[
  {
    "path": ".gitignore",
    "content": "\n*.pdf\n\n*.SUNet\n*.pyc\n.env/*\nexamples/data\nexamples/graphs/*\nexamples/checkpoints/*\nexamples/visualization/*\n"
  },
  {
    "path": "2017/README.md",
    "content": "# tf-stanford-tutorials\nThis repository contains code examples for the 2017 course CS 20SI: TensorFlow for Deep Learning Research.<br>\nDetailed syllabus and lecture notes can be found [here](http://cs20si.stanford.edu/2017).<br>\nFor this year's course, please see [here](http://cs20.stanford.edu)\n\n# Note (as of July 11, 2017)\nI've updated the code to TensorFlow 1.2 and Python3, except the code for chatbot. I will update the code for chatbot soon.\n\n\n\n## Models include: <br>\n### In the folder \"examples\": <br>\nLinear Regression with Chicago's Fire-Theft dataset<br>\nLogistic Regression with MNIST<br>\nWord2vec skip-gram model with NCE loss<br>\nConvnets with MNIST<br>\nAutoencoder (by Nishith Khandwala)<br>\nDeepdream (by Jon Shlens)<br>\nCharacter-level language modeling <br>\n<br>\n### In the folder \"assignments\":<br>\nStyle Transfer<br>\nChatbot using sequence to sequence with attention<br>\n<br>\n## Misc<br>\nExamples on how to use data readers, TFRecord<br>\nEmbedding visualization with TensorBoard<br>\nUsage of summary ops<br>\nExercises to be familiar with other special TensorFlow ops<br>\nDemonstration of the danger of lazy loading <br>\nConvolutional GRU (CRGU) (by Lukasz Kaiser)"
  },
  {
    "path": "2017/assignments/chatbot/README.md",
    "content": "<<<<<<<\nA neural chatbot using sequence to sequence model with\nattentional decoder. This is a fully functional chatbot.\n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen as the starter code for assignment 3,\nclass CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\nThe detailed assignment handout and information on training time can be found at http://web.stanford.edu/class/cs20si/assignments/a3.pdf \n\n<h2>Sample conversation</h2>\nHUMAN ++++ hi <br>\nBOT ++++ hi . what ' s your name ?<br>\nHUMAN ++++ i'm chip<br>\nBOT ++++ please .<br>\nHUMAN ++++ please what?<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ where are you from?<br>\nBOT ++++ california .<br>\nHUMAN ++++ do you like california?<br>\nBOT ++++ i ' m okay .<br>\nHUMAN ++++ i like california a lot<br>\nBOT ++++ of what ?<br>\nHUMAN ++++ i live here<br>\nBOT ++++ and what is it ?<br>\nHUMAN ++++ california<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ i'm a student here<br>\nBOT ++++ and get it out of here .<br>\nHUMAN ++++ what do you like to do in your free time?<br>\nBOT ++++ and what about the gun ?<br>\nHUMAN ++++ do you have a gun?<br>\nBOT ++++ yes .<br>\nHUMAN ++++ you're dangerous!<br>\nBOT ++++ i ' m not .<br>\nHUMAN ++++ do you go hunting?<br>\nBOT ++++ i can do .<br>\nHUMAN ++++ what do you hunt?<br>\nBOT ++++ virgin .<br>\nHUMAN ++++ that's very bad<br>\nBOT ++++ and you in love ?<br>\nHUMAN ++++ no i'm not<br>\nBOT ++++ please - -<br>\nHUMAN ++++ okay, i am<br>\nBOT ++++ and you are good .<br>\nHUMAN ++++ yeah i am good too.<br>\nBOT ++++ good good to you .<br>\nHUMAN ++++ are you in love?<br>\nBOT ++++ i ' m ready .<br>\nHUMAN ++++ ready for what?<br>\nBOT ++++ for what ?<br>\nHUMAN ++++ for what?<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ tell me<br>\nBOT ++++ i can do it .<br>\nHUMAN ++++ okay, tell me what you're ready for<br>\nBOT ++++ i ' m doing you .<br>\nHUMAN ++++ no you're not<br>\nBOT ++++ please !<br>\nHUMAN ++++ this is a strange conversation<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ are you a guy?<br>\nBOT ++++ of what ?<br>\nHUMAN ++++ are you a boy?<br>\nBOT ++++ i ' m fine .<br>\nHUMAN ++++ are you a man?<br>\nBOT ++++ i ' m a little .<br>\nHUMAN ++++ what's your name?<br>\nBOT ++++ chris -<br>\nHUMAN ++++ chris who?<br>\nBOT ++++ cornelius .<br>\nHUMAN ++++ so i guess you're a guy.<br>\nBOT ++++ i am .<br>\nHUMAN ++++ okay<br>\nBOT ++++ okay ?<br>\nHUMAN ++++ do you have a girlfriend?<br>\nBOT ++++ i ' m sure .<br>\nHUMAN ++++ what's her name?<br>\nBOT ++++ let ' s talk about something else .<br>\n\nSee output_convo.txt for more sample conversations.\n\n<h2>Usage</h2>\n\nStep 1: create a data folder in your project directory, download\nthe Cornell Movie-Dialogs Corpus from \nhttps://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html\nUnzip it\n\nStep 2: python data.py\n<br>This will do all the pre-processing for the Cornell dataset.\n\nStep 3:\npython chatbot.py --mode [train/chat] <br>\nIf mode is train, then you train the chatbot. By default, the model will\nrestore the previously trained weights (if there is any) and continue\ntraining up on that.\n\nIf you want to start training from scratch, please delete all the checkpoints\nin the checkpoints folder.\n\nIf the mode is chat, you'll go into the interaction mode with the bot.\n\nBy default, all the conversations you have with the chatbot will be written\ninto the file output_convo.txt in the processed folder. If you run this chatbot,\nI kindly ask you to send me the output_convo.txt so that I can improve\nthe chatbot. My email is huyenn@stanford.edu\n\nIf you find the tutorial helpful, please head over to <a href=\"http://web.stanford.edu/class/cs20si/anonymous_chatlog.pdf\">Anonymous Chatlog Donation</a>\nto see how you can help us create the first realistic dialogue dataset.\n\nThank you very much!\n>>>>>>> origin/master\n"
  },
  {
    "path": "2017/assignments/chatbot/chatbot.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen as the starter code for assignment 3,\nclass CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\nThis file contains the code to run the model.\n\nSee readme.md for instruction on how to run the starter code.\n\"\"\"\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport argparse\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport random\nimport sys\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom model import ChatBotModel\nimport config\nimport data\n\ndef _get_random_bucket(train_buckets_scale):\n    \"\"\" Get a random bucket from which to choose a training sample \"\"\"\n    rand = random.random()\n    return min([i for i in range(len(train_buckets_scale))\n                if train_buckets_scale[i] > rand])\n\ndef _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks):\n    \"\"\" Assert that the encoder inputs, decoder inputs, and decoder masks are\n    of the expected lengths \"\"\"\n    if len(encoder_inputs) != encoder_size:\n        raise ValueError(\"Encoder length must be equal to the one in bucket,\"\n                        \" %d != %d.\" % (len(encoder_inputs), encoder_size))\n    if len(decoder_inputs) != decoder_size:\n        raise ValueError(\"Decoder length must be equal to the one in bucket,\"\n                       \" %d != %d.\" % (len(decoder_inputs), decoder_size))\n    if len(decoder_masks) != decoder_size:\n        raise ValueError(\"Weights length must be equal to the one in bucket,\"\n                       \" %d != %d.\" % (len(decoder_masks), decoder_size))\n\ndef run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only):\n    \"\"\" Run one step in training.\n    @forward_only: boolean value to decide whether a backward path should be created\n    forward_only is set to True when you just want to evaluate on the test set,\n    or when you want to the bot to be in chat mode. \"\"\"\n    encoder_size, decoder_size = config.BUCKETS[bucket_id]\n    _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks)\n\n    # input feed: encoder inputs, decoder inputs, target_weights, as provided.\n    input_feed = {}\n    for step in range(encoder_size):\n        input_feed[model.encoder_inputs[step].name] = encoder_inputs[step]\n    for step in range(decoder_size):\n        input_feed[model.decoder_inputs[step].name] = decoder_inputs[step]\n        input_feed[model.decoder_masks[step].name] = decoder_masks[step]\n\n    last_target = model.decoder_inputs[decoder_size].name\n    input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32)\n\n    # output feed: depends on whether we do a backward step or not.\n    if not forward_only:\n        output_feed = [model.train_ops[bucket_id],  # update op that does SGD.\n                       model.gradient_norms[bucket_id],  # gradient norm.\n                       model.losses[bucket_id]]  # loss for this batch.\n    else:\n        output_feed = [model.losses[bucket_id]]  # loss for this batch.\n        for step in range(decoder_size):  # output logits.\n            output_feed.append(model.outputs[bucket_id][step])\n\n    outputs = sess.run(output_feed, input_feed)\n    if not forward_only:\n        return outputs[1], outputs[2], None  # Gradient norm, loss, no outputs.\n    else:\n        return None, outputs[0], outputs[1:]  # No gradient norm, loss, outputs.\n\ndef _get_buckets():\n    \"\"\" Load the dataset into buckets based on their lengths.\n    train_buckets_scale is the inverval that'll help us \n    choose a random bucket later on.\n    \"\"\"\n    test_buckets = data.load_data('test_ids.enc', 'test_ids.dec')\n    data_buckets = data.load_data('train_ids.enc', 'train_ids.dec')\n    train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))]\n    print(\"Number of samples in each bucket:\\n\", train_bucket_sizes)\n    train_total_size = sum(train_bucket_sizes)\n    # list of increasing numbers from 0 to 1 that we'll use to select a bucket.\n    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size\n                           for i in range(len(train_bucket_sizes))]\n    print(\"Bucket scale:\\n\", train_buckets_scale)\n    return test_buckets, data_buckets, train_buckets_scale\n\ndef _get_skip_step(iteration):\n    \"\"\" How many steps should the model train before it saves all the weights. \"\"\"\n    if iteration < 100:\n        return 30\n    return 100\n\ndef _check_restore_parameters(sess, saver):\n    \"\"\" Restore the previously trained parameters if there are any. \"\"\"\n    ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint'))\n    if ckpt and ckpt.model_checkpoint_path:\n        print(\"Loading parameters for the Chatbot\")\n        saver.restore(sess, ckpt.model_checkpoint_path)\n    else:\n        print(\"Initializing fresh parameters for the Chatbot\")\n\ndef _eval_test_set(sess, model, test_buckets):\n    \"\"\" Evaluate on the test set. \"\"\"\n    for bucket_id in range(len(config.BUCKETS)):\n        if len(test_buckets[bucket_id]) == 0:\n            print(\"  Test: empty bucket %d\" % (bucket_id))\n            continue\n        start = time.time()\n        encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], \n                                                                        bucket_id,\n                                                                        batch_size=config.BATCH_SIZE)\n        _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, \n                                   decoder_masks, bucket_id, True)\n        print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start))\n\ndef train():\n    \"\"\" Train the bot \"\"\"\n    test_buckets, data_buckets, train_buckets_scale = _get_buckets()\n    # in train mode, we need to create the backward path, so forwrad_only is False\n    model = ChatBotModel(False, config.BATCH_SIZE)\n    model.build_graph()\n\n    saver = tf.train.Saver()\n\n    with tf.Session() as sess:\n        print('Running session')\n        sess.run(tf.global_variables_initializer())\n        _check_restore_parameters(sess, saver)\n\n        iteration = model.global_step.eval()\n        total_loss = 0\n        while True:\n            skip_step = _get_skip_step(iteration)\n            bucket_id = _get_random_bucket(train_buckets_scale)\n            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], \n                                                                           bucket_id,\n                                                                           batch_size=config.BATCH_SIZE)\n            start = time.time()\n            _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False)\n            total_loss += step_loss\n            iteration += 1\n\n            if iteration % skip_step == 0:\n                print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start))\n                start = time.time()\n                total_loss = 0\n                saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step)\n                if iteration % (10 * skip_step) == 0:\n                    # Run evals on development set and print their loss\n                    _eval_test_set(sess, model, test_buckets)\n                    start = time.time()\n                sys.stdout.flush()\n\ndef _get_user_input():\n    \"\"\" Get user's input, which will be transformed into encoder input later \"\"\"\n    print(\"> \", end=\"\")\n    sys.stdout.flush()\n    return sys.stdin.readline()\n\ndef _find_right_bucket(length):\n    \"\"\" Find the proper bucket for an encoder input based on its length \"\"\"\n    return min([b for b in range(len(config.BUCKETS))\n                if config.BUCKETS[b][0] >= length])\n\ndef _construct_response(output_logits, inv_dec_vocab):\n    \"\"\" Construct a response to the user's encoder input.\n    @output_logits: the outputs from sequence to sequence wrapper.\n    output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB\n    \n    This is a greedy decoder - outputs are just argmaxes of output_logits.\n    \"\"\"\n    print(output_logits[0])\n    outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]\n    # If there is an EOS symbol in outputs, cut them at that point.\n    if config.EOS_ID in outputs:\n        outputs = outputs[:outputs.index(config.EOS_ID)]\n    # Print out sentence corresponding to outputs.\n    return \" \".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs])\n\ndef chat():\n    \"\"\" in test mode, we don't to create the backward path\n    \"\"\"\n    _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc'))\n    inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec'))\n\n    model = ChatBotModel(True, batch_size=1)\n    model.build_graph()\n\n    saver = tf.train.Saver()\n\n    with tf.Session() as sess:\n        sess.run(tf.global_variables_initializer())\n        _check_restore_parameters(sess, saver)\n        output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+')\n        # Decode from standard input.\n        max_length = config.BUCKETS[-1][0]\n        print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length)\n        while True:\n            line = _get_user_input()\n            if len(line) > 0 and line[-1] == '\\n':\n                line = line[:-1]\n            if line == '':\n                break\n            output_file.write('HUMAN ++++ ' + line + '\\n')\n            # Get token-ids for the input sentence.\n            token_ids = data.sentence2id(enc_vocab, str(line))\n            if (len(token_ids) > max_length):\n                print('Max length I can handle is:', max_length)\n                line = _get_user_input()\n                continue\n            # Which bucket does it belong to?\n            bucket_id = _find_right_bucket(len(token_ids))\n            # Get a 1-element batch to feed the sentence to the model.\n            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], \n                                                                            bucket_id,\n                                                                            batch_size=1)\n            # Get output logits for the sentence.\n            _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs,\n                                           decoder_masks, bucket_id, True)\n            response = _construct_response(output_logits, inv_dec_vocab)\n            print(response)\n            output_file.write('BOT ++++ ' + response + '\\n')\n        output_file.write('=============================================\\n')\n        output_file.close()\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--mode', choices={'train', 'chat'},\n                        default='train', help=\"mode. if not specified, it's in the train mode\")\n    args = parser.parse_args()\n\n    if not os.path.isdir(config.PROCESSED_PATH):\n        data.prepare_raw_data()\n        data.process_data()\n    print('Data ready!')\n    # create checkpoints folder if there isn't one already\n    data.make_dir(config.CPT_PATH)\n\n    if args.mode == 'train':\n        train()\n    elif args.mode == 'chat':\n        chat()\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "2017/assignments/chatbot/config.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen as the starter code for assignment 3,\nclass CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\nThis file contains the hyperparameters for the model.\n\nSee readme.md for instruction on how to run the starter code.\n\"\"\"\n\n# parameters for processing the dataset\nDATA_PATH = '/Users/Chip/data/cornell movie-dialogs corpus'\nCONVO_FILE = 'movie_conversations.txt'\nLINE_FILE = 'movie_lines.txt'\nOUTPUT_FILE = 'output_convo.txt'\nPROCESSED_PATH = 'processed'\nCPT_PATH = 'checkpoints'\n\nTHRESHOLD = 2\n\nPAD_ID = 0\nUNK_ID = 1\nSTART_ID = 2\nEOS_ID = 3\n\nTESTSET_SIZE = 25000\n\n# model parameters\n\"\"\" Train encoder length distribution:\n[175, 92, 11883, 8387, 10656, 13613, 13480, 12850, 11802, 10165, \n8973, 7731, 7005, 6073, 5521, 5020, 4530, 4421, 3746, 3474, 3192, \n2724, 2587, 2413, 2252, 2015, 1816, 1728, 1555, 1392, 1327, 1248, \n1128, 1084, 1010, 884, 843, 755, 705, 660, 649, 594, 558, 517, 475, \n426, 444, 388, 349, 337]\nThese buckets size seem to work the best\n\"\"\"\n# [19530, 17449, 17585, 23444, 22884, 16435, 17085, 18291, 18931]\n# BUCKETS = [(6, 8), (8, 10), (10, 12), (13, 15), (16, 19), (19, 22), (23, 26), (29, 32), (39, 44)]\n\n# [37049, 33519, 30223, 33513, 37371]\n# BUCKETS = [(8, 10), (12, 14), (16, 19), (23, 26), (39, 43)]\n\n# BUCKETS = [(8, 10), (12, 14), (16, 19)]\nBUCKETS = [(16, 19)]\n\nNUM_LAYERS = 3\nHIDDEN_SIZE = 256\nBATCH_SIZE = 64\n\nLR = 0.5\nMAX_GRAD_NORM = 5.0\n\nNUM_SAMPLES = 512\n"
  },
  {
    "path": "2017/assignments/chatbot/data.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen as the starter code for assignment 3,\nclass CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\nThis file contains the code to do the pre-processing for the\nCornell Movie-Dialogs Corpus.\n\nSee readme.md for instruction on how to run the starter code.\n\"\"\"\nfrom __future__ import print_function\n\nimport os\nimport random\nimport re\n\nimport numpy as np\n\nimport config\n\ndef get_lines():\n    id2line = {}\n    file_path = os.path.join(config.DATA_PATH, config.LINE_FILE)\n    with open(file_path, 'rb') as f:\n        lines = f.readlines()\n        for line in lines:\n            parts = line.split(' +++$+++ ')\n            if len(parts) == 5:\n                if parts[4][-1] == '\\n':\n                    parts[4] = parts[4][:-1]\n                id2line[parts[0]] = parts[4]\n    return id2line\n\ndef get_convos():\n    \"\"\" Get conversations from the raw data \"\"\"\n    file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE)\n    convos = []\n    with open(file_path, 'rb') as f:\n        for line in f.readlines():\n            parts = line.split(' +++$+++ ')\n            if len(parts) == 4:\n                convo = []\n                for line in parts[3][1:-2].split(', '):\n                    convo.append(line[1:-1])\n                convos.append(convo)\n\n    return convos\n\ndef question_answers(id2line, convos):\n    \"\"\" Divide the dataset into two sets: questions and answers. \"\"\"\n    questions, answers = [], []\n    for convo in convos:\n        for index, line in enumerate(convo[:-1]):\n            questions.append(id2line[convo[index]])\n            answers.append(id2line[convo[index + 1]])\n    assert len(questions) == len(answers)\n    return questions, answers\n\ndef prepare_dataset(questions, answers):\n    # create path to store all the train & test encoder & decoder\n    make_dir(config.PROCESSED_PATH)\n    \n    # random convos to create the test set\n    test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE)\n    \n    filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec']\n    files = []\n    for filename in filenames:\n        files.append(open(os.path.join(config.PROCESSED_PATH, filename),'wb'))\n\n    for i in range(len(questions)):\n        if i in test_ids:\n            files[2].write(questions[i] + '\\n')\n            files[3].write(answers[i] + '\\n')\n        else:\n            files[0].write(questions[i] + '\\n')\n            files[1].write(answers[i] + '\\n')\n\n    for file in files:\n        file.close()\n\ndef make_dir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass\n\ndef basic_tokenizer(line, normalize_digits=True):\n    \"\"\" A basic tokenizer to tokenize text into tokens.\n    Feel free to change this to suit your need. \"\"\"\n    line = re.sub('<u>', '', line)\n    line = re.sub('</u>', '', line)\n    line = re.sub('\\[', '', line)\n    line = re.sub('\\]', '', line)\n    words = []\n    _WORD_SPLIT = re.compile(b\"([.,!?\\\"'-<>:;)(])\")\n    _DIGIT_RE = re.compile(r\"\\d\")\n    for fragment in line.strip().lower().split():\n        for token in re.split(_WORD_SPLIT, fragment):\n            if not token:\n                continue\n            if normalize_digits:\n                token = re.sub(_DIGIT_RE, b'#', token)\n            words.append(token)\n    return words\n\ndef build_vocab(filename, normalize_digits=True):\n    in_path = os.path.join(config.PROCESSED_PATH, filename)\n    out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:]))\n\n    vocab = {}\n    with open(in_path, 'rb') as f:\n        for line in f.readlines():\n            for token in basic_tokenizer(line):\n                if not token in vocab:\n                    vocab[token] = 0\n                vocab[token] += 1\n\n    sorted_vocab = sorted(vocab, key=vocab.get, reverse=True)\n    with open(out_path, 'wb') as f:\n        f.write('<pad>' + '\\n')\n        f.write('<unk>' + '\\n')\n        f.write('<s>' + '\\n')\n        f.write('<\\s>' + '\\n') \n        index = 4\n        for word in sorted_vocab:\n            if vocab[word] < config.THRESHOLD:\n                with open('config.py', 'ab') as cf:\n                    if filename[-3:] == 'enc':\n                        cf.write('ENC_VOCAB = ' + str(index) + '\\n')\n                    else:\n                        cf.write('DEC_VOCAB = ' + str(index) + '\\n')\n                break\n            f.write(word + '\\n')\n            index += 1\n\ndef load_vocab(vocab_path):\n    with open(vocab_path, 'rb') as f:\n        words = f.read().splitlines()\n    return words, {words[i]: i for i in range(len(words))}\n\ndef sentence2id(vocab, line):\n    return [vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)]\n\ndef token2id(data, mode):\n    \"\"\" Convert all the tokens in the data into their corresponding\n    index in the vocabulary. \"\"\"\n    vocab_path = 'vocab.' + mode\n    in_path = data + '.' + mode\n    out_path = data + '_ids.' + mode\n\n    _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path))\n    in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'rb')\n    out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'wb')\n    \n    lines = in_file.read().splitlines()\n    for line in lines:\n        if mode == 'dec': # we only care about '<s>' and </s> in encoder\n            ids = [vocab['<s>']]\n        else:\n            ids = []\n        ids.extend(sentence2id(vocab, line))\n        # ids.extend([vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)])\n        if mode == 'dec':\n            ids.append(vocab['<\\s>'])\n        out_file.write(' '.join(str(id_) for id_ in ids) + '\\n')\n\ndef prepare_raw_data():\n    print('Preparing raw data into train set and test set ...')\n    id2line = get_lines()\n    convos = get_convos()\n    questions, answers = question_answers(id2line, convos)\n    prepare_dataset(questions, answers)\n\ndef process_data():\n    print('Preparing data to be model-ready ...')\n    build_vocab('train.enc')\n    build_vocab('train.dec')\n    token2id('train', 'enc')\n    token2id('train', 'dec')\n    token2id('test', 'enc')\n    token2id('test', 'dec')\n\ndef load_data(enc_filename, dec_filename, max_training_size=None):\n    encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'rb')\n    decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'rb')\n    encode, decode = encode_file.readline(), decode_file.readline()\n    data_buckets = [[] for _ in config.BUCKETS]\n    i = 0\n    while encode and decode:\n        if (i + 1) % 10000 == 0:\n            print(\"Bucketing conversation number\", i)\n        encode_ids = [int(id_) for id_ in encode.split()]\n        decode_ids = [int(id_) for id_ in decode.split()]\n        for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS):\n            if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size:\n                data_buckets[bucket_id].append([encode_ids, decode_ids])\n                break\n        encode, decode = encode_file.readline(), decode_file.readline()\n        i += 1\n    return data_buckets\n\ndef _pad_input(input_, size):\n    return input_ + [config.PAD_ID] * (size - len(input_))\n\ndef _reshape_batch(inputs, size, batch_size):\n    \"\"\" Create batch-major inputs. Batch inputs are just re-indexed inputs\n    \"\"\"\n    batch_inputs = []\n    for length_id in range(size):\n        batch_inputs.append(np.array([inputs[batch_id][length_id]\n                                    for batch_id in range(batch_size)], dtype=np.int32))\n    return batch_inputs\n\n\ndef get_batch(data_bucket, bucket_id, batch_size=1):\n    \"\"\" Return one batch to feed into the model \"\"\"\n    # only pad to the max length of the bucket\n    encoder_size, decoder_size = config.BUCKETS[bucket_id]\n    encoder_inputs, decoder_inputs = [], []\n\n    for _ in range(batch_size):\n        encoder_input, decoder_input = random.choice(data_bucket)\n        # pad both encoder and decoder, reverse the encoder\n        encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size))))\n        decoder_inputs.append(_pad_input(decoder_input, decoder_size))\n\n    # now we create batch-major vectors from the data selected above.\n    batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size)\n    batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size)\n\n    # create decoder_masks to be 0 for decoders that are padding.\n    batch_masks = []\n    for length_id in range(decoder_size):\n        batch_mask = np.ones(batch_size, dtype=np.float32)\n        for batch_id in range(batch_size):\n            # we set mask to 0 if the corresponding target is a PAD symbol.\n            # the corresponding decoder is decoder_input shifted by 1 forward.\n            if length_id < decoder_size - 1:\n                target = decoder_inputs[batch_id][length_id + 1]\n            if length_id == decoder_size - 1 or target == config.PAD_ID:\n                batch_mask[batch_id] = 0.0\n        batch_masks.append(batch_mask)\n    return batch_encoder_inputs, batch_decoder_inputs, batch_masks\n\nif __name__ == '__main__':\n    prepare_raw_data()\n    process_data()"
  },
  {
    "path": "2017/assignments/chatbot/model.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen as the starter code for assignment 3,\nclass CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\nThis file contains the code to build the model\n\nSee readme.md for instruction on how to run the starter code.\n\"\"\"\nfrom __future__ import print_function\n\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport config\n\nclass ChatBotModel(object):\n    def __init__(self, forward_only, batch_size):\n        \"\"\"forward_only: if set, we do not construct the backward pass in the model.\n        \"\"\"\n        print('Initialize new model')\n        self.fw_only = forward_only\n        self.batch_size = batch_size\n    \n    def _create_placeholders(self):\n        # Feeds for inputs. It's a list of placeholders\n        print('Create placeholders')\n        self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i))\n                               for i in range(config.BUCKETS[-1][0])]\n        self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i))\n                               for i in range(config.BUCKETS[-1][1] + 1)]\n        self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i))\n                              for i in range(config.BUCKETS[-1][1] + 1)]\n\n        # Our targets are decoder inputs shifted by one (to ignore <s> symbol)\n        self.targets = self.decoder_inputs[1:]\n        \n    def _inference(self):\n        print('Create inference')\n        # If we use sampled softmax, we need an output projection.\n        # Sampled softmax only makes sense if we sample less than vocabulary size.\n        if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:\n            w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])\n            b = tf.get_variable('proj_b', [config.DEC_VOCAB])\n            self.output_projection = (w, b)\n\n        def sampled_loss(inputs, labels):\n            labels = tf.reshape(labels, [-1, 1])\n            return tf.nn.sampled_softmax_loss(tf.transpose(w), b, inputs, labels, \n                                              config.NUM_SAMPLES, config.DEC_VOCAB)\n        self.softmax_loss_function = sampled_loss\n\n        single_cell = tf.nn.rnn_cell.GRUCell(config.HIDDEN_SIZE)\n        self.cell = tf.nn.rnn_cell.MultiRNNCell([single_cell] * config.NUM_LAYERS)\n\n    def _create_loss(self):\n        print('Creating loss... \\nIt might take a couple of minutes depending on how many buckets you have.')\n        start = time.time()\n        def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode):\n            return tf.nn.seq2seq.embedding_attention_seq2seq(\n                    encoder_inputs, decoder_inputs, self.cell,\n                    num_encoder_symbols=config.ENC_VOCAB,\n                    num_decoder_symbols=config.DEC_VOCAB,\n                    embedding_size=config.HIDDEN_SIZE,\n                    output_projection=self.output_projection,\n                    feed_previous=do_decode)\n\n        if self.fw_only:\n            self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(\n                                        self.encoder_inputs, \n                                        self.decoder_inputs, \n                                        self.targets,\n                                        self.decoder_masks, \n                                        config.BUCKETS, \n                                        lambda x, y: _seq2seq_f(x, y, True),\n                                        softmax_loss_function=self.softmax_loss_function)\n            # If we use output projection, we need to project outputs for decoding.\n            if self.output_projection:\n                for bucket in range(len(config.BUCKETS)):\n                    self.outputs[bucket] = [tf.matmul(output, \n                                            self.output_projection[0]) + self.output_projection[1]\n                                            for output in self.outputs[bucket]]\n        else:\n            self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(\n                                        self.encoder_inputs, \n                                        self.decoder_inputs, \n                                        self.targets,\n                                        self.decoder_masks,\n                                        config.BUCKETS,\n                                        lambda x, y: _seq2seq_f(x, y, False),\n                                        softmax_loss_function=self.softmax_loss_function)\n        print('Time:', time.time() - start)\n\n    def _creat_optimizer(self):\n        print('Create optimizer... \\nIt might take a couple of minutes depending on how many buckets you have.')\n        with tf.variable_scope('training') as scope:\n            self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\n            if not self.fw_only:\n                self.optimizer = tf.train.GradientDescentOptimizer(config.LR)\n                trainables = tf.trainable_variables()\n                self.gradient_norms = []\n                self.train_ops = []\n                start = time.time()\n                for bucket in range(len(config.BUCKETS)):\n                    \n                    clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], \n                                                                 trainables),\n                                                                 config.MAX_GRAD_NORM)\n                    self.gradient_norms.append(norm)\n                    self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), \n                                                            global_step=self.global_step))\n                    print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start))\n                    start = time.time()\n\n\n    def _create_summary(self):\n        pass\n\n    def build_graph(self):\n        self._create_placeholders()\n        self._inference()\n        self._create_loss()\n        self._creat_optimizer()\n        self._create_summary()\n"
  },
  {
    "path": "2017/assignments/chatbot/output_convo.txt",
    "content": "HUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ really?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ how're you?\nBOT ++++ i ' m fine .\nHUMAN ++++ where're you from?\nBOT ++++ california .\nHUMAN ++++ which city?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ you don't like california?\nBOT ++++ oh . . .\nHUMAN ++++ why not?\nBOT ++++ i can ' t .\nHUMAN ++++ how old are you?\nBOT ++++ eighteen .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's your name?\nBOT ++++ what ?\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ prisoner .\nHUMAN ++++ ha ha you really don't want to tell your name\nBOT ++++ i ' m not going to do it .\nHUMAN ++++ how old are you?\nBOT ++++ can you guess ?\nHUMAN ++++ eighteen?\nBOT ++++ in the bathroom .\nHUMAN ++++ in the bathroom?\nBOT ++++ in the next time .\nHUMAN ++++ why are you talking so funny?\nBOT ++++ i ' m looking .\nHUMAN ++++ where are you from?\nBOT ++++ general .\nHUMAN ++++ where are you from?\nBOT ++++ how ?\nHUMAN ++++ do you like the us?\nBOT ++++ i ' m sure .\nHUMAN ++++ do you vote?\nBOT ++++ i ' m afraid .\nHUMAN ++++ what are you afraid of?\nBOT ++++ i ' m looking .\nHUMAN ++++ what are you looking for?\nBOT ++++ can you guess ?\nHUMAN ++++ no\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm chip\nBOT ++++ please .\nHUMAN ++++ please what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ where are you from?\nBOT ++++ california .\nHUMAN ++++ do you like california?\nBOT ++++ i ' m okay .\nHUMAN ++++ i like california a lot\nBOT ++++ of what ?\nHUMAN ++++ i live here\nBOT ++++ and what is it ?\nHUMAN ++++ california\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ i'm a student here\nBOT ++++ and get it out of here .\nHUMAN ++++ what do you like to do in your free time?\nBOT ++++ and what about the gun ?\nHUMAN ++++ do you have a gun?\nBOT ++++ yes .\nHUMAN ++++ you're dangerous!\nBOT ++++ i ' m not .\nHUMAN ++++ do you go hunting?\nBOT ++++ i can do .\nHUMAN ++++ what do you hunt?\nBOT ++++ virgin .\nHUMAN ++++ that's very bad\nBOT ++++ and you in love ?\nHUMAN ++++ no i'm not\nBOT ++++ please - -\nHUMAN ++++ okay, i am\nBOT ++++ and you are good .\nHUMAN ++++ yeah i am good too.\nBOT ++++ good good to you .\nHUMAN ++++ are you in love?\nBOT ++++ i ' m ready .\nHUMAN ++++ ready for what?\nBOT ++++ for what ?\nHUMAN ++++ for what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ tell me\nBOT ++++ i can do it .\nHUMAN ++++ okay, tell me what you're ready for\nBOT ++++ i ' m doing you .\nHUMAN ++++ no you're not\nBOT ++++ please !\nHUMAN ++++ this is a strange conversation\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ are you a guy?\nBOT ++++ of what ?\nHUMAN ++++ are you a boy?\nBOT ++++ i ' m fine .\nHUMAN ++++ are you a man?\nBOT ++++ i ' m a little .\nHUMAN ++++ what's your name?\nBOT ++++ chris -\nHUMAN ++++ chris who?\nBOT ++++ cornelius .\nHUMAN ++++ so i guess you're a guy.\nBOT ++++ i am .\nHUMAN ++++ okay\nBOT ++++ okay ?\nHUMAN ++++ do you have a girlfriend?\nBOT ++++ i ' m sure .\nHUMAN ++++ what's her name?\nBOT ++++ let ' s talk about something else .\n=============================================\nHUMAN ++++ hi chris\nBOT ++++ hi\nHUMAN ++++ how're you today?\nBOT ++++ okay .\nHUMAN ++++ the weather is nice today\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ just say the weather is nice\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how're you today?\nBOT ++++ okay .\nHUMAN ++++ just okay?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's on your mind?\nBOT ++++ i ' m in my room .\nHUMAN ++++ doing what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ you're boring.\nBOT ++++ i ' m not .\nHUMAN ++++ prove it\nBOT ++++ and what ?\nHUMAN ++++ what what?\nBOT ++++ sorry , i ' m still very stupid .\nHUMAN ++++ it's okay. we're all stupid sometimes.\nBOT ++++ ' s good to see .\nHUMAN ++++ what's good to see?\nBOT ++++ i ' m in .\nHUMAN ++++ you're in your room?\nBOT ++++ i ' m here .\nHUMAN ++++ where's your room?\nBOT ++++ i ' m in my room .\nHUMAN ++++ are you alone?\nBOT ++++ i ' m fine .\nHUMAN ++++ are you by yourself?\nBOT ++++ in the morning .\nHUMAN ++++ what time is it?\nBOT ++++ # # # # # # # .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm marie\nBOT ++++ you are .\nHUMAN ++++ yes i am\nBOT ++++ and you are my friend .\nHUMAN ++++ yes you are \nBOT ++++ and you are my best .\nHUMAN ++++ that's so sweet\nBOT ++++ please .\nHUMAN ++++ do you know what time it is?\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ what time is it?\nBOT ++++ 5 9 6 9 1 4 5 .\nHUMAN ++++ it doesn't make any sense\nBOT ++++ are you ?\nHUMAN ++++ i do \nBOT ++++ and you are ?\nHUMAN ++++ i'm a human.\nBOT ++++ you in .\nHUMAN ++++ are you in trouble?\nBOT ++++ i ' m in .\nHUMAN ++++ what trouble?\nBOT ++++ i ' m in love .\nHUMAN ++++ wow that's nice.\nBOT ++++ like that ?\nHUMAN ++++ yeah\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ who are you?\nBOT ++++ i don ' t understand .\nHUMAN ++++ who are you?\nBOT ++++ no idea .\nHUMAN ++++ what's your name?\nBOT ++++ chris -\nHUMAN ++++ chris who?\nBOT ++++ cornelius .\nHUMAN ++++ where are you from?\nBOT ++++ california .\nHUMAN ++++ are you american?\nBOT ++++ i ' m fine .\nHUMAN ++++ of course you're fine.\nBOT ++++ and you ?\nHUMAN ++++ i'm okay.\nBOT ++++ you are .\nHUMAN ++++ yes, i am.\nBOT ++++ and you are my friend .\nHUMAN ++++ yes, i am.\nBOT ++++ and you are my friend .\nHUMAN ++++ that's correct.\nBOT ++++ and you ' re in trouble ?\nHUMAN ++++ no, i'm not. why did you think so?\nBOT ++++ because it ' s a good idea .\nHUMAN ++++ what is your good idea?\nBOT ++++ i ' ll be fine .\n=============================================\nHUMAN ++++ hello\nBOT ++++ and you are in trouble ?\nHUMAN ++++ what did you say that?\nBOT ++++ ' s life .\nHUMAN ++++ i'm not in trouble.\nBOT ++++ what the hell is that ?\nHUMAN ++++ okay, what trouble?\nBOT ++++ i ' m looking .\nHUMAN ++++ why did you say i'm in trouble?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what do you want to talk about?\nBOT ++++ i can ' t .\nHUMAN ++++ what can't?\nBOT ++++ i ' ll have a drink .\nHUMAN ++++ okay, me too.\nBOT ++++ you were here ?\nHUMAN ++++ i'm still here.\nBOT ++++ what is it ?\nHUMAN ++++ i thought you wanted a drink.\nBOT ++++ i ' m not .\nHUMAN ++++ you just said so!\nBOT ++++ i was wrong .\nHUMAN ++++ it's fine.\nBOT ++++ okay ?\nHUMAN ++++ yeah\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ do you drink a lot?\nBOT ++++ i can do it .\nHUMAN ++++ do you drink often?\nBOT ++++ in the time .\nHUMAN ++++ that doesn't make any sense.\nBOT ++++ are you ?\nHUMAN ++++ what do you drink?\nBOT ++++ ' s in a drink .\nHUMAN ++++ do you drink alcohol?\nBOT ++++ in time .\nHUMAN ++++ i like beer.\nBOT ++++ oh . . .\nHUMAN ++++ don't you like beer?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ where do you live?\nBOT ++++ up .\nHUMAN ++++ up where?\nBOT ++++ in the hospital .\nHUMAN ++++ i'm sorry. are you okay?\nBOT ++++ i ' m fine .\nHUMAN ++++ why are you in the hospital?\nBOT ++++ and what ?\nHUMAN ++++ how many words do you know?\nBOT ++++ i ' d do it .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ whatever\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ you're boring.\nBOT ++++ i ' m not .\nHUMAN ++++ yes you are.\nBOT ++++ and you are my friend .\nHUMAN ++++ no i'm not.\nBOT ++++ please .\nHUMAN ++++ okay fine. i'm your friend.\nBOT ++++ you ' re my mother ?\nHUMAN ++++ i have no son like you.\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ i have no son.\nBOT ++++ i see you .\nHUMAN ++++ i have no children.\nBOT ++++ you are .\nHUMAN ++++ omg you're so dumb.\nBOT ++++ i ' m not .\nHUMAN ++++ you're stupid!\nBOT ++++ i ' m not .\nHUMAN ++++ do you always talk like this?\nBOT ++++ of course .\nHUMAN ++++ how old are you?\nBOT ++++ eighteen .\nHUMAN ++++ you're big enough.\nBOT ++++ i ' m not .\nHUMAN ++++ are you an adult?\nBOT ++++ i ' m going .\nHUMAN ++++ you're going to be an adult?\nBOT ++++ i ' m looking for you .\nHUMAN ++++ i'm in my room.\nBOT ++++ and your mother ?\nHUMAN ++++ she's in vietnam.\nBOT ++++ you ' re here .\nHUMAN ++++ yes i'm here.\nBOT ++++ and what are you talking about ?\nHUMAN ++++ you asked me where i am.\nBOT ++++ and you are your friend .\nHUMAN ++++ i need to go now.\nBOT ++++ i will not .\nHUMAN ++++ that's fine. you can stay. i'm leaving.\nBOT ++++ oh . . .\nHUMAN ++++ bye\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ goodbye\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ bye bye\nBOT ++++ okay ?\nHUMAN ++++ see you later\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm maria\nBOT ++++ please .\nHUMAN ++++ i'm jackson\nBOT ++++ you are .\nHUMAN ++++ i'm jack\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ i have no go\nBOT ++++ you are ?\nHUMAN ++++ i have to go\nBOT ++++ and what do you mean ?\nHUMAN ++++ i'm leaving.\nBOT ++++ what ?\nHUMAN ++++ yeah, bye\nBOT ++++ bye . have a nice day !\n=============================================\nHUMAN ++++ that's very bad.\nBOT ++++ i ' ve got to work .\nHUMAN ++++ what do you do?\nBOT ++++ i ' ll be there .\nHUMAN ++++ where do you work?\nBOT ++++ in the trunk .\nHUMAN ++++ are you a machenics?\nBOT ++++ i ' m not .\nHUMAN ++++ what are you?\nBOT ++++ no idea .\n=============================================\n"
  },
  {
    "path": "2017/assignments/exercises/e01.py",
    "content": "\"\"\"\nSimple exercises to get used to TensorFlow API\nYou should thoroughly test your code\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\nsess = tf.InteractiveSession()\n###############################################################################\n# 1a: Create two random 0-d tensors x and y of any distribution.\n# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.\n# Hint: look up tf.cond()\n# I do the first problem for you\n###############################################################################\n\nx = tf.random_uniform([])  # Empty array as shape creates a scalar.\ny = tf.random_uniform([])\nout = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))\nprint(sess.run(out))\n\n###############################################################################\n# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).\n# Return x + y if x < y, x - y if x > y, 0 otherwise.\n# Hint: Look up tf.case().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] \n# and y as a tensor of zeros with the same shape as x.\n# Return a boolean tensor that yields Trues if x equals y element-wise.\n# Hint: Look up tf.equal().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1d: Create the tensor x of value \n# [29.05088806,  27.61298943,  31.19073486,  29.35532951,\n#  30.97266006,  26.67541885,  38.08450317,  20.74983215,\n#  34.94445419,  34.45999146,  29.06485367,  36.01657104,\n#  27.88236427,  20.56035233,  30.20379066,  29.51215172,\n#  33.71149445,  28.59134293,  36.05556488,  28.66994858].\n# Get the indices of elements in x whose values are greater than 30.\n# Hint: Use tf.where().\n# Then extract elements whose values are greater than 30.\n# Hint: Use tf.gather().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,\n# 2, ..., 6\n# Hint: Use tf.range() and tf.diag().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.\n# Calculate its determinant.\n# Hint: Look at tf.matrix_determinant().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].\n# Return the unique elements in x\n# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1h: Create two tensors x and y of shape 300 from any normal distribution,\n# as long as they are from the same distribution.\n# Use tf.cond() to return:\n# - The mean squared error of (x - y) if the average of all elements in (x - y)\n#   is negative, or\n# - The sum of absolute value of all elements in the tensor (x - y) otherwise.\n# Hint: see the Huber loss function in the lecture slides 3.\n###############################################################################\n\n# YOUR CODE"
  },
  {
    "path": "2017/assignments/exercises/e01_sol.py",
    "content": "\"\"\"\nSolution to simple TensorFlow exercises\nFor the problems \n\"\"\"\nimport tensorflow as tf\n\n###############################################################################\n# 1a: Create two random 0-d tensors x and y of any distribution.\n# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.\n# Hint: look up tf.cond()\n# I do the first problem for you\n###############################################################################\n\nx = tf.random_uniform([])  # Empty array as shape creates a scalar.\ny = tf.random_uniform([])\nout = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))\n\n###############################################################################\n# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).\n# Return x + y if x < y, x - y if x > y, 0 otherwise.\n# Hint: Look up tf.case().\n###############################################################################\n\nx = tf.random_uniform([], -1, 1, dtype=tf.float32)\ny = tf.random_uniform([], -1, 1, dtype=tf.float32)\nout = tf.case({tf.less(x, y): lambda: tf.add(x, y), \n\t\t\ttf.greater(x, y): lambda: tf.subtract(x, y)}, \n\t\t\tdefault=lambda: tf.constant(0.0), exclusive=True)\nprint(x)\nsess = tf.InteractiveSession()\nprint(sess.run(x))\n\n###############################################################################\n# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] \n# and y as a tensor of zeros with the same shape as x.\n# Return a boolean tensor that yields Trues if x equals y element-wise.\n# Hint: Look up tf.equal().\n###############################################################################\n\nx = tf.constant([[0, -2, -1], [0, 1, 2]])\ny = tf.zeros_like(x)\nout = tf.equal(x, y)\n\n###############################################################################\n# 1d: Create the tensor x of value \n# [29.05088806,  27.61298943,  31.19073486,  29.35532951,\n#  30.97266006,  26.67541885,  38.08450317,  20.74983215,\n#  34.94445419,  34.45999146,  29.06485367,  36.01657104,\n#  27.88236427,  20.56035233,  30.20379066,  29.51215172,\n#  33.71149445,  28.59134293,  36.05556488,  28.66994858].\n# Get the indices of elements in x whose values are greater than 30.\n# Hint: Use tf.where().\n# Then extract elements whose values are greater than 30.\n# Hint: Use tf.gather().\n###############################################################################\n\nx = tf.constant([29.05088806,  27.61298943,  31.19073486,  29.35532951,\n\t\t        30.97266006,  26.67541885,  38.08450317,  20.74983215,\n\t\t        34.94445419,  34.45999146,  29.06485367,  36.01657104,\n\t\t        27.88236427,  20.56035233,  30.20379066,  29.51215172,\n\t\t        33.71149445,  28.59134293,  36.05556488,  28.66994858])\nindices = tf.where(x > 30)\nout = tf.gather(x, indices)\n\n###############################################################################\n# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,\n# 2, ..., 6\n# Hint: Use tf.range() and tf.diag().\n###############################################################################\n\nvalues = tf.range(1, 7)\nout = tf.diag(values)\n\n###############################################################################\n# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.\n# Calculate its determinant.\n# Hint: Look at tf.matrix_determinant().\n###############################################################################\n\nm = tf.random_normal([10, 10], mean=10, stddev=1)\nout = tf.matrix_determinant(m)\n\n###############################################################################\n# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].\n# Return the unique elements in x\n# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.\n###############################################################################\n\nx = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9])\nunique_values, indices = tf.unique(x)\n\n###############################################################################\n# 1h: Create two tensors x and y of shape 300 from any normal distribution,\n# as long as they are from the same distribution.\n# Use tf.cond() to return:\n# - The mean squared error of (x - y) if the average of all elements in (x - y)\n#   is negative, or\n# - The sum of absolute value of all elements in the tensor (x - y) otherwise.\n# Hint: see the Huber loss function in the lecture slides 3.\n###############################################################################\n\nx = tf.random_normal([300], mean=5, stddev=1)\ny = tf.random_normal([300], mean=5, stddev=1)\naverage = tf.reduce_mean(x - y)\ndef f1(): return tf.reduce_mean(tf.square(x - y))\ndef f2(): return tf.reduce_sum(tf.abs(x - y))\nout = tf.cond(average < 0, f1, f2)"
  },
  {
    "path": "2017/assignments/style_transfer/readme.md",
    "content": "For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf\n"
  },
  {
    "path": "2017/assignments/style_transfer/style_transfer.py",
    "content": "\"\"\" An implementation of the paper \"A Neural Algorithm of Artistic Style\"\nby Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport vgg_model\nimport utils\n\n# parameters to manage experiments\nSTYLE = 'guernica'\nCONTENT = 'deadpool'\nSTYLE_IMAGE = 'styles/' + STYLE + '.jpg'\nCONTENT_IMAGE = 'content/' + CONTENT + '.jpg'\nIMAGE_HEIGHT = 250\nIMAGE_WIDTH = 333\nNOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image\n\nCONTENT_WEIGHT = 0.01\nSTYLE_WEIGHT = 1\n\n# Layers used for style features. You can change this.\nSTYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']\nW = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers.\n\n# Layer used for content features. You can change this.\nCONTENT_LAYER = 'conv4_2'\n\nITERS = 300\nLR = 2.0\n\nMEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))\n\"\"\" MEAN_PIXELS is defined according to description on their github:\nhttps://gist.github.com/ksimonyan/211839e770f7b538e2d8\n'In the paper, the model is denoted as the configuration D trained with scale jittering. \nThe input images should be zero-centered by mean pixel (rather than mean image) subtraction. \nNamely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].'\n\"\"\"\n\n# VGG-19 parameters file\nVGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'\nVGG_MODEL = 'imagenet-vgg-verydeep-19.mat'\nEXPECTED_BYTES = 534904783\n\ndef _create_content_loss(p, f):\n    \"\"\" Calculate the loss between the feature representation of the\n    content image and the generated image.\n    \n    Inputs: \n        p, f are just P, F in the paper \n        (read the assignment handout if you're confused)\n        Note: we won't use the coefficient 0.5 as defined in the paper\n        but the coefficient as defined in the assignment handout.\n    Output:\n        the content loss\n\n    \"\"\"\n    return tf.reduce_sum((f - p) ** 2) / (4.0 * p.size)\n\ndef _gram_matrix(F, N, M):\n    \"\"\" Create and return the gram matrix for tensor F\n        Hint: you'll first have to reshape F\n    \"\"\"\n    F = tf.reshape(F, (M, N))\n    return tf.matmul(tf.transpose(F), F)\n\ndef _single_style_loss(a, g):\n    \"\"\" Calculate the style loss at a certain layer\n    Inputs:\n        a is the feature representation of the real image\n        g is the feature representation of the generated image\n    Output:\n        the style loss at a certain layer (which is E_l in the paper)\n\n    Hint: 1. you'll have to use the function _gram_matrix()\n        2. we'll use the same coefficient for style loss as in the paper\n        3. a and g are feature representation, not gram matrices\n    \"\"\"\n    N = a.shape[3] # number of filters\n    M = a.shape[1] * a.shape[2] # height times width of the feature map\n    A = _gram_matrix(a, N, M)\n    G = _gram_matrix(g, N, M)\n    return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2))\n\ndef _create_style_loss(A, model):\n    \"\"\" Return the total style loss\n    \"\"\"\n    n_layers = len(STYLE_LAYERS)\n    E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)]\n    \n    ###############################\n    ## TO DO: return total style loss\n    return sum([W[i] * E[i] for i in range(n_layers)])\n    ###############################\n\ndef _create_losses(model, input_image, content_image, style_image):\n    with tf.variable_scope('loss') as scope:\n        with tf.Session() as sess:\n            sess.run(input_image.assign(content_image)) # assign content image to the input variable\n            p = sess.run(model[CONTENT_LAYER])\n        content_loss = _create_content_loss(p, model[CONTENT_LAYER])\n\n        with tf.Session() as sess:\n            sess.run(input_image.assign(style_image))\n            A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS])                              \n        style_loss = _create_style_loss(A, model)\n\n        ##########################################\n        ## TO DO: create total loss. \n        ## Hint: don't forget the content loss and style loss weights\n        total_loss = CONTENT_WEIGHT * content_loss + STYLE_WEIGHT * style_loss\n        ##########################################\n\n    return content_loss, style_loss, total_loss\n\ndef _create_summary(model):\n    \"\"\" Create summary ops necessary\n        Hint: don't forget to merge them\n    \"\"\"\n    with tf.name_scope('summaries'):\n        tf.summary.scalar('content loss', model['content_loss'])\n        tf.summary.scalar('style loss', model['style_loss'])\n        tf.summary.scalar('total loss', model['total_loss'])\n        tf.summary.histogram('histogram content loss', model['content_loss'])\n        tf.summary.histogram('histogram style loss', model['style_loss'])\n        tf.summary.histogram('histogram total loss', model['total_loss'])\n        return tf.summary.merge_all()\n\ndef train(model, generated_image, initial_image):\n    \"\"\" Train your model.\n    Don't forget to create folders for checkpoints and outputs.\n    \"\"\"\n    skip_step = 1\n    with tf.Session() as sess:\n        saver = tf.train.Saver()\n        ###############################\n        ## TO DO: \n        ## 1. initialize your variables\n        ## 2. create writer to write your graph\n        saver = tf.train.Saver()\n        sess.run(tf.global_variables_initializer())\n        writer = tf.summary.FileWriter('graphs', sess.graph)\n        ###############################\n        sess.run(generated_image.assign(initial_image))\n        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))\n        if ckpt and ckpt.model_checkpoint_path:\n            saver.restore(sess, ckpt.model_checkpoint_path)\n        initial_step = model['global_step'].eval()\n        \n        start_time = time.time()\n        for index in range(initial_step, ITERS):\n            if index >= 5 and index < 20:\n                skip_step = 10\n            elif index >= 20:\n                skip_step = 20\n            \n            sess.run(model['optimizer'])\n            if (index + 1) % skip_step == 0:\n                ###############################\n                ## TO DO: obtain generated image and loss\n                gen_image, total_loss, summary = sess.run([generated_image, model['total_loss'], \n                                                             model['summary_op']])\n\n                ###############################\n                gen_image = gen_image + MEAN_PIXELS\n                writer.add_summary(summary, global_step=index)\n                print('Step {}\\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))\n                print('   Loss: {:5.1f}'.format(total_loss))\n                print('   Time: {}'.format(time.time() - start_time))\n                start_time = time.time()\n\n                filename = 'outputs/%d.png' % (index)\n                utils.save_image(filename, gen_image)\n\n                if (index + 1) % 20 == 0:\n                    saver.save(sess, 'checkpoints/style_transfer', index)\n\ndef main():\n    with tf.variable_scope('input') as scope:\n        # use variable instead of placeholder because we're training the intial image to make it\n        # look like both the content image and the style image\n        input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32)\n    \n    utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES)\n    utils.make_dir('checkpoints')\n    utils.make_dir('outputs')\n    model = vgg_model.load_vgg(VGG_MODEL, input_image)\n    model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\n    content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)\n    content_image = content_image - MEAN_PIXELS\n    style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)\n    style_image = style_image - MEAN_PIXELS\n\n    model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, \n                                                    input_image, content_image, style_image)\n    ###############################\n    ## TO DO: create optimizer\n    model['optimizer'] = tf.train.AdamOptimizer(LR).minimize(model['total_loss'], \n                                                            global_step=model['global_step'])\n    ###############################\n    model['summary_op'] = _create_summary(model)\n\n    initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO)\n    train(model, input_image, initial_image)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "2017/assignments/style_transfer/utils.py",
    "content": "\"\"\" Utils needed for the implementation of the paper \"A Neural Algorithm of Artistic Style\"\nby Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\nfrom __future__ import print_function\n\nimport os\n\nfrom PIL import Image, ImageOps\nimport numpy as np\nimport scipy.misc\nfrom six.moves import urllib\n\ndef download(download_link, file_name, expected_bytes):\n    \"\"\" Download the pretrained VGG-19 model if it's not already downloaded \"\"\"\n    if os.path.exists(file_name):\n        print(\"VGG-19 pre-trained model ready\")\n        return\n    print(\"Downloading the VGG pre-trained model. This might take a while ...\")\n    file_name, _ = urllib.request.urlretrieve(download_link, file_name)\n    file_stat = os.stat(file_name)\n    if file_stat.st_size == expected_bytes:\n        print('Successfully downloaded VGG-19 pre-trained model', file_name)\n    else:\n        raise Exception('File ' + file_name +\n                        ' might be corrupted. You should try downloading it with a browser.')\n\ndef get_resized_image(img_path, height, width, save=True):\n    image = Image.open(img_path)\n    # it's because PIL is column major so you have to change place of width and height\n    # this is stupid, i know\n    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)\n    if save:\n        image_dirs = img_path.split('/')\n        image_dirs[-1] = 'resized_' + image_dirs[-1]\n        out_path = '/'.join(image_dirs)\n        if not os.path.exists(out_path):\n            image.save(out_path)\n    image = np.asarray(image, np.float32)\n    return np.expand_dims(image, 0)\n\ndef generate_noise_image(content_image, height, width, noise_ratio=0.6):\n    noise_image = np.random.uniform(-20, 20, \n                                    (1, height, width, 3)).astype(np.float32)\n    return noise_image * noise_ratio + content_image * (1 - noise_ratio)\n\ndef save_image(path, image):\n    # Output should add back the mean pixels we subtracted at the beginning\n    image = image[0] # the image\n    image = np.clip(image, 0, 255).astype('uint8')\n    scipy.misc.imsave(path, image)\n\ndef make_dir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass"
  },
  {
    "path": "2017/assignments/style_transfer/vgg_model.py",
    "content": "\"\"\" Load VGGNet weights needed for the implementation of the paper \n\"A Neural Algorithm of Artistic Style\" by Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\nimport scipy.io\n\ndef _weights(vgg_layers, layer, expected_layer_name):\n    \"\"\" Return the weights and biases already trained by VGG\n    \"\"\"\n    W = vgg_layers[0][layer][0][0][2][0][0]\n    b = vgg_layers[0][layer][0][0][2][0][1]\n    layer_name = vgg_layers[0][layer][0][0][0][0]\n    assert layer_name == expected_layer_name\n    return W, b.reshape(b.size)\n\ndef _conv2d_relu(vgg_layers, prev_layer, layer, layer_name):\n    \"\"\" Return the Conv2D layer with RELU using the weights, biases from the VGG\n    model at 'layer'.\n    Inputs:\n        vgg_layers: holding all the layers of VGGNet\n        prev_layer: the output tensor from the previous layer\n        layer: the index to current layer in vgg_layers\n        layer_name: the string that is the name of the current layer.\n                    It's used to specify variable_scope.\n\n    Output:\n        relu applied on the convolution.\n\n    Note that you first need to obtain W and b from vgg-layers using the function\n    _weights() defined above.\n    W and b returned from _weights() are numpy arrays, so you have\n    to convert them to TF tensors using tf.constant.\n    Note that you'll have to do apply relu on the convolution.\n    Hint for choosing strides size: \n        for small images, you probably don't want to skip any pixel\n    \"\"\"\n    with tf.variable_scope(layer_name) as scope:\n        W, b = _weights(vgg_layers, layer, layer_name)\n        W = tf.constant(W, name='weights')\n        b = tf.constant(b, name='bias')\n        conv2d = tf.nn.conv2d(prev_layer, filter=W, strides=[1, 1, 1, 1], padding='SAME')\n    return tf.nn.relu(conv2d + b)\n\ndef _avgpool(prev_layer):\n    \"\"\" Return the average pooling layer. The paper suggests that average pooling\n    actually works better than max pooling.\n    Input:\n        prev_layer: the output tensor from the previous layer\n\n    Output:\n        the output of the tf.nn.avg_pool() function.\n    Hint for choosing strides and kszie: choose what you feel appropriate\n    \"\"\"\n    return tf.nn.avg_pool(prev_layer, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], \n                          padding='SAME', name='avg_pool_')\n\ndef load_vgg(path, input_image):\n    \"\"\" Load VGG into a TensorFlow model.\n    Use a dictionary to hold the model instead of using a Python class\n    \"\"\"\n    vgg = scipy.io.loadmat(path)\n    vgg_layers = vgg['layers']\n\n    graph = {} \n    graph['conv1_1']  = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1')\n    graph['conv1_2']  = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2')\n    graph['avgpool1'] = _avgpool(graph['conv1_2'])\n    graph['conv2_1']  = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1')\n    graph['conv2_2']  = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2')\n    graph['avgpool2'] = _avgpool(graph['conv2_2'])\n    graph['conv3_1']  = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1')\n    graph['conv3_2']  = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2')\n    graph['conv3_3']  = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3')\n    graph['conv3_4']  = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4')\n    graph['avgpool3'] = _avgpool(graph['conv3_4'])\n    graph['conv4_1']  = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1')\n    graph['conv4_2']  = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2')\n    graph['conv4_3']  = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3')\n    graph['conv4_4']  = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4')\n    graph['avgpool4'] = _avgpool(graph['conv4_4'])\n    graph['conv5_1']  = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1')\n    graph['conv5_2']  = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2')\n    graph['conv5_3']  = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3')\n    graph['conv5_4']  = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4')\n    graph['avgpool5'] = _avgpool(graph['conv5_4'])\n    \n    return graph"
  },
  {
    "path": "2017/assignments/style_transfer_starter/readme.md",
    "content": "For detailed instruction, you should read the assignment handout on the course website: http://web.stanford.edu/class/cs20si/assignments/a2.pdf\n"
  },
  {
    "path": "2017/assignments/style_transfer_starter/style_transfer.py",
    "content": "\"\"\" An implementation of the paper \"A Neural Algorithm of Artistic Style\"\nby Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport vgg_model\nimport utils\n\n# parameters to manage experiments\nSTYLE = 'guernica'\nCONTENT = 'deadpool'\nSTYLE_IMAGE = 'styles/' + STYLE + '.jpg'\nCONTENT_IMAGE = 'content/' + CONTENT + '.jpg'\nIMAGE_HEIGHT = 250\nIMAGE_WIDTH = 333\nNOISE_RATIO = 0.6 # percentage of weight of the noise for intermixing with the content image\n\n# Layers used for style features. You can change this.\nSTYLE_LAYERS = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']\nW = [0.5, 1.0, 1.5, 3.0, 4.0] # give more weights to deeper layers.\n\n# Layer used for content features. You can change this.\nCONTENT_LAYER = 'conv4_2'\n\nITERS = 300\nLR = 2.0\n\nSAVE_EVERY = 20\n\nMEAN_PIXELS = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))\n\"\"\" MEAN_PIXELS is defined according to description on their github:\nhttps://gist.github.com/ksimonyan/211839e770f7b538e2d8\n'In the paper, the model is denoted as the configuration D trained with scale jittering. \nThe input images should be zero-centered by mean pixel (rather than mean image) subtraction. \nNamely, the following BGR values should be subtracted: [103.939, 116.779, 123.68].'\n\"\"\"\n\n# VGG-19 parameters file\nVGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'\nVGG_MODEL = 'imagenet-vgg-verydeep-19.mat'\nEXPECTED_BYTES = 534904783\n\ndef _create_content_loss(p, f):\n    \"\"\" Calculate the loss between the feature representation of the\n    content image and the generated image.\n    \n    Inputs: \n        p, f are just P, F in the paper \n        (read the assignment handout if you're confused)\n        Note: we won't use the coefficient 0.5 as defined in the paper\n        but the coefficient as defined in the assignment handout.\n    Output:\n        the content loss\n\n    \"\"\"\n    pass\n\ndef _gram_matrix(F, N, M):\n    \"\"\" Create and return the gram matrix for tensor F\n        Hint: you'll first have to reshape F\n    \"\"\"\n    pass\n\ndef _single_style_loss(a, g):\n    \"\"\" Calculate the style loss at a certain layer\n    Inputs:\n        a is the feature representation of the real image\n        g is the feature representation of the generated image\n    Output:\n        the style loss at a certain layer (which is E_l in the paper)\n\n    Hint: 1. you'll have to use the function _gram_matrix()\n        2. we'll use the same coefficient for style loss as in the paper\n        3. a and g are feature representation, not gram matrices\n    \"\"\"\n    pass\n\ndef _create_style_loss(A, model):\n    \"\"\" Return the total style loss\n    \"\"\"\n    n_layers = len(STYLE_LAYERS)\n    E = [_single_style_loss(A[i], model[STYLE_LAYERS[i]]) for i in range(n_layers)]\n    \n    ###############################\n    ## TO DO: return total style loss\n    pass\n    ###############################\n\ndef _create_losses(model, input_image, content_image, style_image):\n    with tf.variable_scope('loss') as scope:\n        with tf.Session() as sess:\n            sess.run(input_image.assign(content_image)) # assign content image to the input variable\n            p = sess.run(model[CONTENT_LAYER])\n        content_loss = _create_content_loss(p, model[CONTENT_LAYER])\n\n        with tf.Session() as sess:\n            sess.run(input_image.assign(style_image))\n            A = sess.run([model[layer_name] for layer_name in STYLE_LAYERS])                              \n        style_loss = _create_style_loss(A, model)\n\n        ##########################################\n        ## TO DO: create total loss. \n        ## Hint: don't forget the content loss and style loss weights\n        \n        ##########################################\n\n    return content_loss, style_loss, total_loss\n\ndef _create_summary(model):\n    \"\"\" Create summary ops necessary\n        Hint: don't forget to merge them\n    \"\"\"\n    pass\n\ndef train(model, generated_image, initial_image):\n    \"\"\" Train your model.\n    Don't forget to create folders for checkpoints and outputs.\n    \"\"\"\n    skip_step = 1\n    with tf.Session() as sess:\n        saver = tf.train.Saver()\n        ###############################\n        ## TO DO: \n        ## 1. initialize your variables\n        ## 2. create writer to write your graph\n        ###############################\n        sess.run(generated_image.assign(initial_image))\n        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))\n        if ckpt and ckpt.model_checkpoint_path:\n            saver.restore(sess, ckpt.model_checkpoint_path)\n        initial_step = model['global_step'].eval()\n        \n        start_time = time.time()\n        for index in range(initial_step, ITERS):\n            if index >= 5 and index < 20:\n                skip_step = 10\n            elif index >= 20:\n                skip_step = 20\n            \n            sess.run(model['optimizer'])\n            if (index + 1) % skip_step == 0:\n                ###############################\n                ## TO DO: obtain generated image and loss\n\n                ###############################\n                gen_image = gen_image + MEAN_PIXELS\n                writer.add_summary(summary, global_step=index)\n                print('Step {}\\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))\n                print('   Loss: {:5.1f}'.format(total_loss))\n                print('   Time: {}'.format(time.time() - start_time))\n                start_time = time.time()\n\n                filename = 'outputs/%d.png' % (index)\n                utils.save_image(filename, gen_image)\n\n                if (index + 1) % SAVE_EVERY == 0:\n                    saver.save(sess, 'checkpoints/style_transfer', index)\n\ndef main():\n    with tf.variable_scope('input') as scope:\n        # use variable instead of placeholder because we're training the intial image to make it\n        # look like both the content image and the style image\n        input_image = tf.Variable(np.zeros([1, IMAGE_HEIGHT, IMAGE_WIDTH, 3]), dtype=tf.float32)\n    \n    utils.download(VGG_DOWNLOAD_LINK, VGG_MODEL, EXPECTED_BYTES)\n    utils.make_dir('checkpoints')\n    utils.make_dir('outputs')\n    model = vgg_model.load_vgg(VGG_MODEL, input_image)\n    model['global_step'] = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n    \n    content_image = utils.get_resized_image(CONTENT_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)\n    content_image = content_image - MEAN_PIXELS\n    style_image = utils.get_resized_image(STYLE_IMAGE, IMAGE_HEIGHT, IMAGE_WIDTH)\n    style_image = style_image - MEAN_PIXELS\n\n    model['content_loss'], model['style_loss'], model['total_loss'] = _create_losses(model, \n                                                    input_image, content_image, style_image)\n    ###############################\n    ## TO DO: create optimizer\n    ## model['optimizer'] = ...\n    ###############################\n    model['summary_op'] = _create_summary(model)\n\n    initial_image = utils.generate_noise_image(content_image, IMAGE_HEIGHT, IMAGE_WIDTH, NOISE_RATIO)\n    train(model, input_image, initial_image)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "2017/assignments/style_transfer_starter/utils.py",
    "content": "\"\"\" Utils needed for the implementation of the paper \"A Neural Algorithm of Artistic Style\"\nby Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\nfrom __future__ import print_function\n\nimport os\n\nfrom PIL import Image, ImageOps\nimport numpy as np\nimport scipy.misc\nfrom six.moves import urllib\n\ndef download(download_link, file_name, expected_bytes):\n    \"\"\" Download the pretrained VGG-19 model if it's not already downloaded \"\"\"\n    if os.path.exists(file_name):\n        print(\"VGG-19 pre-trained model ready\")\n        return\n    print(\"Downloading the VGG pre-trained model. This might take a while ...\")\n    file_name, _ = urllib.request.urlretrieve(download_link, file_name)\n    file_stat = os.stat(file_name)\n    if file_stat.st_size == expected_bytes:\n        print('Successfully downloaded VGG-19 pre-trained model', file_name)\n    else:\n        raise Exception('File ' + file_name +\n                        ' might be corrupted. You should try downloading it with a browser.')\n\ndef get_resized_image(img_path, height, width, save=True):\n    image = Image.open(img_path)\n    # it's because PIL is column major so you have to change place of width and height\n    # this is stupid, i know\n    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)\n    if save:\n        image_dirs = img_path.split('/')\n        image_dirs[-1] = 'resized_' + image_dirs[-1]\n        out_path = '/'.join(image_dirs)\n        if not os.path.exists(out_path):\n            image.save(out_path)\n    image = np.asarray(image, np.float32)\n    return np.expand_dims(image, 0)\n\ndef generate_noise_image(content_image, height, width, noise_ratio=0.6):\n    noise_image = np.random.uniform(-20, 20, \n                                    (1, height, width, 3)).astype(np.float32)\n    return noise_image * noise_ratio + content_image * (1 - noise_ratio)\n\ndef save_image(path, image):\n    # Output should add back the mean pixels we subtracted at the beginning\n    image = image[0] # the image\n    image = np.clip(image, 0, 255).astype('uint8')\n    scipy.misc.imsave(path, image)\n\ndef make_dir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass"
  },
  {
    "path": "2017/assignments/style_transfer_starter/vgg_model.py",
    "content": "\"\"\" Load VGGNet weights needed for the implementation of the paper \n\"A Neural Algorithm of Artistic Style\" by Gatys et al. in TensorFlow.\n\nAuthor: Chip Huyen (huyenn@stanford.edu)\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\nFor more details, please read the assignment handout:\nhttp://web.stanford.edu/class/cs20si/assignments/a2.pdf\n\"\"\"\n\nimport numpy as np\nimport tensorflow as tf\nimport scipy.io\n\ndef _weights(vgg_layers, layer, expected_layer_name):\n    \"\"\" Return the weights and biases already trained by VGG\n    \"\"\"\n    W = vgg_layers[0][layer][0][0][2][0][0]\n    b = vgg_layers[0][layer][0][0][2][0][1]\n    layer_name = vgg_layers[0][layer][0][0][0][0]\n    assert layer_name == expected_layer_name\n    return W, b.reshape(b.size)\n\ndef _conv2d_relu(vgg_layers, prev_layer, layer, layer_name):\n    \"\"\" Return the Conv2D layer with RELU using the weights, biases from the VGG\n    model at 'layer'.\n    Inputs:\n        vgg_layers: holding all the layers of VGGNet\n        prev_layer: the output tensor from the previous layer\n        layer: the index to current layer in vgg_layers\n        layer_name: the string that is the name of the current layer.\n                    It's used to specify variable_scope.\n\n    Output:\n        relu applied on the convolution.\n\n    Note that you first need to obtain W and b from vgg-layers using the function\n    _weights() defined above.\n    W and b returned from _weights() are numpy arrays, so you have\n    to convert them to TF tensors using tf.constant.\n    Note that you'll have to do apply relu on the convolution.\n    Hint for choosing strides size: \n        for small images, you probably don't want to skip any pixel\n    \"\"\"\n    pass\n\ndef _avgpool(prev_layer):\n    \"\"\" Return the average pooling layer. The paper suggests that average pooling\n    actually works better than max pooling.\n    Input:\n        prev_layer: the output tensor from the previous layer\n\n    Output:\n        the output of the tf.nn.avg_pool() function.\n    Hint for choosing strides and kszie: choose what you feel appropriate\n    \"\"\"\n    pass\n\ndef load_vgg(path, input_image):\n    \"\"\" Load VGG into a TensorFlow model.\n    Use a dictionary to hold the model instead of using a Python class\n    \"\"\"\n    vgg = scipy.io.loadmat(path)\n    vgg_layers = vgg['layers']\n\n    graph = {} \n    graph['conv1_1']  = _conv2d_relu(vgg_layers, input_image, 0, 'conv1_1')\n    graph['conv1_2']  = _conv2d_relu(vgg_layers, graph['conv1_1'], 2, 'conv1_2')\n    graph['avgpool1'] = _avgpool(graph['conv1_2'])\n    graph['conv2_1']  = _conv2d_relu(vgg_layers, graph['avgpool1'], 5, 'conv2_1')\n    graph['conv2_2']  = _conv2d_relu(vgg_layers, graph['conv2_1'], 7, 'conv2_2')\n    graph['avgpool2'] = _avgpool(graph['conv2_2'])\n    graph['conv3_1']  = _conv2d_relu(vgg_layers, graph['avgpool2'], 10, 'conv3_1')\n    graph['conv3_2']  = _conv2d_relu(vgg_layers, graph['conv3_1'], 12, 'conv3_2')\n    graph['conv3_3']  = _conv2d_relu(vgg_layers, graph['conv3_2'], 14, 'conv3_3')\n    graph['conv3_4']  = _conv2d_relu(vgg_layers, graph['conv3_3'], 16, 'conv3_4')\n    graph['avgpool3'] = _avgpool(graph['conv3_4'])\n    graph['conv4_1']  = _conv2d_relu(vgg_layers, graph['avgpool3'], 19, 'conv4_1')\n    graph['conv4_2']  = _conv2d_relu(vgg_layers, graph['conv4_1'], 21, 'conv4_2')\n    graph['conv4_3']  = _conv2d_relu(vgg_layers, graph['conv4_2'], 23, 'conv4_3')\n    graph['conv4_4']  = _conv2d_relu(vgg_layers, graph['conv4_3'], 25, 'conv4_4')\n    graph['avgpool4'] = _avgpool(graph['conv4_4'])\n    graph['conv5_1']  = _conv2d_relu(vgg_layers, graph['avgpool4'], 28, 'conv5_1')\n    graph['conv5_2']  = _conv2d_relu(vgg_layers, graph['conv5_1'], 30, 'conv5_2')\n    graph['conv5_3']  = _conv2d_relu(vgg_layers, graph['conv5_2'], 32, 'conv5_3')\n    graph['conv5_4']  = _conv2d_relu(vgg_layers, graph['conv5_3'], 34, 'conv5_4')\n    graph['avgpool5'] = _avgpool(graph['conv5_4'])\n    \n    return graph"
  },
  {
    "path": "2017/data/arvix_abstracts.txt",
    "content": "In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\n"
  },
  {
    "path": "2017/data/heart.csv",
    "content": "sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd\n160,12,5.73,23.11,Present,49,25.3,97.2,52,1\n144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1\n118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0\n170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1\n134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1\n132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0\n142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0\n114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1\n114,0,3.83,19.4,Present,49,24.86,2.49,29,0\n132,0,5.8,30.96,Present,69,30.11,0,53,1\n206,6,2.95,32.27,Absent,72,26.81,56.06,60,1\n134,14.1,4.44,22.39,Present,65,23.09,0,40,1\n118,0,1.88,10.05,Absent,59,21.57,0,17,0\n132,0,1.87,17.21,Absent,49,23.63,0.97,15,0\n112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0\n117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0\n120,7.5,15.33,22,Absent,60,25.31,34.49,49,0\n146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1\n158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1\n124,14,6.23,35.96,Present,45,30.09,0,59,1\n106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1\n132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0\n150,0.3,6.38,33.99,Present,62,24.64,0,50,0\n138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0\n142,18.2,4.34,24.38,Absent,61,26.19,0,50,0\n124,4,12.42,31.29,Present,54,23.23,2.06,42,1\n118,6,9.65,33.91,Absent,60,38.8,0,48,0\n145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1\n144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0\n146,0,6.62,25.69,Absent,60,28.07,8.23,63,1\n136,2.52,3.95,25.63,Absent,51,21.86,0,45,1\n158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1\n122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1\n126,8.75,6.53,34.02,Absent,49,30.25,0,41,1\n148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0\n122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1\n140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0\n110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0\n130,0,2.82,19.63,Present,70,24.86,0,29,0\n136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1\n118,0.28,5.8,33.7,Present,60,30.98,0,41,1\n144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0\n120,0,1.07,16.02,Absent,47,22.15,0,15,0\n130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1\n114,0,2.99,9.74,Absent,54,46.58,0,17,0\n128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0\n162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1\n116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1\n114,0,1.94,11.02,Absent,54,20.17,38.98,16,0\n126,3.8,3.88,31.79,Absent,57,30.53,0,30,0\n122,0,5.75,30.9,Present,46,29.01,4.11,42,0\n134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0\n152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1\n134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1\n156,3,1.82,27.55,Absent,60,23.91,54,53,0\n152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0\n118,0,2.99,16.17,Absent,49,23.83,3.22,28,0\n126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1\n103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0\n121,0.8,5.29,18.95,Present,47,22.51,0,61,0\n142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0\n138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0\n152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0\n140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0\n130,0,1.82,10.45,Absent,57,22.07,2.06,17,0\n136,7.36,2.19,28.11,Present,61,25,61.71,54,0\n124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0\n112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0\n118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0\n122,0,3.37,16.1,Absent,67,21.06,0,32,1\n118,0,3.67,12.13,Absent,51,19.15,0.6,15,0\n130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0\n130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0\n126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0\n128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0\n136,0,4.12,17.42,Absent,52,21.66,12.86,40,0\n134,0,5.9,30.84,Absent,49,29.16,0,55,0\n140,0.6,5.56,33.39,Present,58,27.19,0,55,1\n168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1\n108,0.4,5.91,22.92,Present,57,25.72,72,39,0\n114,3,7.04,22.64,Present,55,22.59,0,45,1\n140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1\n148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1\n148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1\n128,0,2.43,13.15,Present,63,20.75,0,17,0\n130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0\n126,10.5,4.49,17.33,Absent,67,19.37,0,49,1\n140,0,5.08,27.33,Present,41,27.83,1.25,38,0\n126,0.9,5.64,17.78,Present,55,21.94,0,41,0\n122,0.72,4.04,32.38,Absent,34,28.34,0,55,0\n116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0\n120,3.7,4.02,39.66,Absent,61,30.57,0,64,1\n143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0\n118,4,3.95,18.96,Absent,54,25.15,8.33,49,1\n194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0\n134,3,4.37,23.07,Absent,56,20.54,9.65,62,0\n138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0\n136,0,5,27.58,Present,49,27.59,1.47,39,0\n122,3.2,11.32,35.36,Present,55,27.07,0,51,1\n164,12,3.91,19.59,Absent,51,23.44,19.75,39,0\n136,8,7.85,23.81,Present,51,22.69,2.78,50,0\n166,0.07,4.03,29.29,Absent,53,28.37,0,27,0\n118,0,4.34,30.12,Present,52,32.18,3.91,46,0\n128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0\n118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0\n158,3.6,2.97,30.11,Absent,63,26.64,108,64,0\n108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1\n170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1\n118,1,5.76,22.1,Absent,62,23.48,7.71,42,0\n124,0,3.04,17.33,Absent,49,22.04,0,18,0\n114,0,8.01,21.64,Absent,66,25.51,2.49,16,0\n168,9,8.53,24.48,Present,69,26.18,4.63,54,1\n134,2,3.66,14.69,Absent,52,21.03,2.06,37,0\n174,0,8.46,35.1,Present,35,25.27,0,61,1\n116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1\n128,0,10.58,31.81,Present,46,28.41,14.66,48,0\n140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1\n154,0.7,5.91,25,Absent,13,20.6,0,42,0\n150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1\n130,0,3.92,25.55,Absent,68,28.02,0.68,27,0\n128,2,6.13,21.31,Absent,66,22.86,11.83,60,0\n120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0\n120,0,5.01,26.13,Absent,64,26.21,12.24,33,0\n138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1\n153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0\n123,8.6,11.17,35.28,Present,70,33.14,0,59,1\n148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0\n136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0\n134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1\n152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0\n158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0\n132,2,3.08,35.39,Absent,45,31.44,79.82,58,1\n134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1\n142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0\n134,6,3.3,28.45,Absent,65,26.09,58.11,40,0\n122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1\n116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0\n128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0\n120,0,3.68,12.24,Absent,51,20.52,0.51,20,0\n124,0,3.95,36.35,Present,59,32.83,9.59,54,0\n160,14,5.9,37.12,Absent,58,33.87,3.52,54,1\n130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1\n128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0\n130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0\n109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0\n144,0,3.84,18.72,Absent,56,22.1,4.8,40,0\n118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0\n136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1\n136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1\n124,15.5,5.05,24.06,Absent,46,23.22,0,61,1\n148,6,6.49,26.47,Absent,48,24.7,0,55,0\n128,6.6,3.58,20.71,Absent,55,24.15,0,52,0\n122,0.28,4.19,19.97,Absent,61,25.63,0,24,0\n108,0,2.74,11.17,Absent,53,22.61,0.95,20,0\n124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1\n138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1\n127,0,2.81,15.7,Absent,42,22.03,1.03,17,0\n174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0\n122,0,3.05,23.51,Absent,46,25.81,0,38,0\n144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1\n126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0\n208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1\n138,0,2.68,17.04,Absent,42,22.16,0,16,0\n148,0,3.84,17.26,Absent,70,20,0,21,0\n122,0,3.08,16.3,Absent,43,22.13,0,16,0\n132,7,3.2,23.26,Absent,77,23.64,23.14,49,0\n110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1\n160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1\n126,0.54,4.39,21.13,Present,45,25.99,0,25,0\n162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0\n194,2.55,6.89,33.88,Present,69,29.33,0,41,0\n118,0.75,2.58,20.25,Absent,59,24.46,0,32,0\n124,0,4.79,34.71,Absent,49,26.09,9.26,47,0\n160,0,2.42,34.46,Absent,48,29.83,1.03,61,0\n128,0,2.51,29.35,Present,53,22.05,1.37,62,0\n122,4,5.24,27.89,Present,45,26.52,0,61,1\n132,2,2.7,21.57,Present,50,27.95,9.26,37,0\n120,0,2.42,16.66,Absent,46,20.16,0,17,0\n128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0\n108,15,4.91,34.65,Absent,41,27.96,14.4,56,0\n166,0,4.31,34.27,Absent,45,30.14,13.27,56,0\n152,0,6.06,41.05,Present,51,40.34,0,51,0\n170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1\n156,4,2.05,19.48,Present,50,21.48,27.77,39,1\n116,8,6.73,28.81,Present,41,26.74,40.94,48,1\n122,4.4,3.18,11.59,Present,59,21.94,0,33,1\n150,20,6.4,35.04,Absent,53,28.88,8.33,63,0\n129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0\n134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0\n126,0,5.98,29.06,Present,56,25.39,11.52,64,1\n142,0,3.72,25.68,Absent,48,24.37,5.25,40,1\n128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1\n102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1\n130,0,4.89,25.98,Absent,72,30.42,14.71,23,0\n138,0.05,2.79,10.35,Absent,46,21.62,0,18,0\n138,0,1.96,11.82,Present,54,22.01,8.13,21,0\n128,0,3.09,20.57,Absent,54,25.63,0.51,17,0\n162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0\n160,3,9.19,26.47,Present,39,28.25,14.4,54,1\n148,0,4.66,24.39,Absent,50,25.26,4.03,27,0\n124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0\n136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1\n134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0\n128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0\n122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0\n152,3,4.64,31.29,Absent,41,29.34,4.53,40,0\n162,0,5.09,24.6,Present,64,26.71,3.81,18,0\n124,4,6.65,30.84,Present,54,28.4,33.51,60,0\n136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0\n136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0\n134,0.05,8.03,27.95,Absent,48,26.88,0,60,0\n122,1,5.88,34.81,Present,69,31.27,15.94,40,1\n116,3,3.05,30.31,Absent,41,23.63,0.86,44,0\n132,0,0.98,21.39,Absent,62,26.75,0,53,0\n134,0,2.4,21.11,Absent,57,22.45,1.37,18,0\n160,7.77,8.07,34.8,Absent,64,31.15,0,62,1\n180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1\n124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0\n114,0,4.97,9.69,Absent,26,22.6,0,25,0\n208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0\n138,0,3.14,12,Absent,54,20.28,0,16,0\n164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1\n144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0\n136,7.5,7.39,28.04,Present,50,25.01,0,45,1\n132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0\n143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0\n112,4.46,7.18,26.25,Present,69,27.29,0,32,1\n134,10,3.79,34.72,Absent,42,28.33,28.8,52,1\n138,2,5.11,31.4,Present,49,27.25,2.06,64,1\n188,0,5.47,32.44,Present,71,28.99,7.41,50,1\n110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1\n136,13.2,7.18,35.95,Absent,48,29.19,0,62,0\n130,1.75,5.46,34.34,Absent,53,29.42,0,58,1\n122,0,3.76,24.59,Absent,56,24.36,0,30,0\n138,0,3.24,27.68,Absent,60,25.7,88.66,29,0\n130,18,4.13,27.43,Absent,54,27.44,0,51,1\n126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0\n176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0\n122,0,5.49,19.56,Absent,57,23.12,14.02,27,0\n124,0,3.23,9.64,Absent,59,22.7,0,16,0\n140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1\n128,6,4.37,22.98,Present,50,26.01,0,47,0\n190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0\n144,0.76,10.53,35.66,Absent,63,34.35,0,55,1\n126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1\n128,0,2.63,23.88,Absent,45,21.59,6.54,57,0\n136,0.4,3.91,21.1,Present,63,22.3,0,56,1\n158,4,4.18,28.61,Present,42,25.11,0,60,0\n160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0\n124,6,5.21,33.02,Present,64,29.37,7.61,58,1\n158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0\n128,0,6.34,11.87,Absent,57,23.14,0,17,0\n166,3,3.82,26.75,Absent,45,20.86,0,63,1\n146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0\n161,9,4.65,15.16,Present,58,23.76,43.2,46,0\n164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1\n146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1\n142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0\n138,12,5.13,28.34,Absent,59,24.49,32.81,58,1\n154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0\n118,0,2.39,12.13,Absent,49,18.46,0.26,17,1\n124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0\n124,1.04,2.84,16.42,Present,46,20.17,0,61,0\n136,5,4.19,23.99,Present,68,27.8,25.86,35,0\n132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1\n118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0\n118,0.12,4.16,9.37,Absent,57,19.61,0,17,0\n134,12,4.96,29.79,Absent,53,24.86,8.23,57,0\n114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0\n136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1\n130,0,4.16,39.43,Present,46,30.01,0,55,1\n136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1\n136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0\n154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0\n108,0.8,2.47,17.53,Absent,47,22.18,0,55,1\n136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1\n174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1\n124,4.25,8.22,30.77,Absent,56,25.8,0,43,0\n114,0,2.63,9.69,Absent,45,17.89,0,16,0\n118,0.12,3.26,12.26,Absent,55,22.65,0,16,0\n106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1\n146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0\n206,0,4.17,33.23,Absent,69,27.36,6.17,50,1\n134,3,3.17,17.91,Absent,35,26.37,15.12,27,0\n148,15,4.98,36.94,Present,72,31.83,66.27,41,1\n126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0\n134,0,3.69,13.92,Absent,43,27.66,0,19,0\n134,0.02,2.8,18.84,Absent,45,24.82,0,17,0\n123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0\n112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1\n112,0,1.71,15.96,Absent,42,22.03,3.5,16,0\n101,0.48,7.26,13,Absent,50,19.82,5.19,16,0\n150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0\n170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1\n134,0,5.63,29.12,Absent,68,32.33,2.02,34,0\n142,0,4.19,18.04,Absent,56,23.65,20.78,42,1\n132,0.1,3.28,10.73,Absent,73,20.42,0,17,0\n136,0,2.28,18.14,Absent,55,22.59,0,17,0\n132,12,4.51,21.93,Absent,61,26.07,64.8,46,1\n166,4.1,4,34.3,Present,32,29.51,8.23,53,0\n138,0,3.96,24.7,Present,53,23.8,0,45,0\n138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1\n170,0,3.12,37.15,Absent,47,35.42,0,53,0\n128,0,8.41,28.82,Present,60,26.86,0,59,1\n136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0\n128,0,3.22,26.55,Present,39,26.59,16.71,49,0\n150,14.4,5.04,26.52,Present,60,28.84,0,45,0\n132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1\n142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0\n130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0\n174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1\n114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0\n162,1.5,2.46,19.39,Present,49,24.32,0,59,1\n174,0,3.27,35.4,Absent,58,37.71,24.95,44,0\n190,5.15,6.03,36.59,Absent,42,30.31,72,50,0\n154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0\n124,0,2.28,24.86,Present,50,22.24,8.26,38,0\n114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0\n168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1\n142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0\n154,0,4.81,28.11,Present,56,25.67,75.77,59,0\n146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0\n166,6,3.02,29.3,Absent,35,24.38,38.06,61,0\n140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1\n136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0\n156,0,3.47,21.1,Absent,73,28.4,0,36,1\n132,0,6.63,29.58,Present,37,29.41,2.57,62,0\n128,0,2.98,12.59,Absent,65,20.74,2.06,19,0\n106,5.6,3.2,12.3,Absent,49,20.29,0,39,0\n144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0\n154,0.31,2.33,16.48,Absent,33,24,11.83,17,0\n126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0\n134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1\n152,19.45,4.22,29.81,Absent,28,23.95,0,59,1\n146,1.35,6.39,34.21,Absent,51,26.43,0,59,1\n162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0\n130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1\n138,6,7.24,37.05,Absent,38,28.69,0,59,0\n148,0,5.32,26.71,Present,52,32.21,32.78,27,0\n124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0\n118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0\n116,4.28,7.02,19.99,Present,68,23.31,0,52,1\n162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1\n138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0\n137,1.2,3.14,23.87,Absent,66,24.13,45,37,0\n198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1\n154,4.5,4.75,23.52,Present,43,25.76,0,53,1\n128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0\n130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1\n162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0\n120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0\n136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0\n176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1\n134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1\n122,1.7,5.28,32.23,Present,51,24.08,0,54,0\n134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1\n134,0,2.43,22.24,Absent,52,26.49,41.66,24,0\n136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0\n132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0\n152,1.68,3.58,25.43,Absent,50,27.03,0,32,0\n132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1\n124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0\n140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0\n166,0.6,2.42,34.03,Present,53,26.96,54,60,0\n156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1\n132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0\n150,0,4.99,27.73,Absent,57,30.92,8.33,24,0\n134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0\n126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0\n148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0\n148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1\n132,6,5.97,25.73,Present,66,24.18,145.29,41,0\n128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0\n128,5.16,4.9,31.35,Present,57,26.42,0,64,0\n140,0,2.4,27.89,Present,70,30.74,144,29,0\n126,0,5.29,27.64,Absent,25,27.62,2.06,45,0\n114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0\n118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0\n126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0\n154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1\n112,1.44,2.71,22.92,Absent,59,24.81,0,52,0\n140,8,4.42,33.15,Present,47,32.77,66.86,44,0\n140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1\n128,2.6,4.94,21.36,Absent,61,21.3,0,31,0\n126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0\n160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1\n144,0,4.17,29.63,Present,52,21.83,0,59,0\n148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1\n146,0,4.92,18.53,Absent,57,24.2,34.97,26,0\n164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1\n130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1\n154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1\n178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0\n180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0\n134,12.5,2.73,39.35,Absent,48,35.58,0,48,0\n142,0,3.54,16.64,Absent,58,25.97,8.36,27,0\n162,7,7.67,34.34,Present,33,30.77,0,62,0\n218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1\n126,8.75,6.06,32.72,Present,33,27,62.43,55,1\n126,0,3.57,26.01,Absent,61,26.3,7.97,47,0\n134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0\n132,0,4.17,36.57,Absent,57,30.61,18,49,0\n178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1\n208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1\n160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0\n116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0\n180,25.01,3.7,38.11,Present,57,30.54,0,61,1\n200,19.2,4.43,40.6,Present,55,32.04,36,60,1\n112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0\n120,0,3.1,26.97,Absent,41,24.8,0,16,0\n178,20,9.78,33.55,Absent,37,27.29,2.88,62,1\n166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0\n164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1\n216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1\n146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0\n134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1\n158,16,5.56,29.35,Absent,36,25.92,58.32,60,0\n176,0,3.14,31.04,Present,45,30.18,4.63,45,0\n132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0\n126,0,4.55,29.18,Absent,48,24.94,36,41,0\n120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0\n174,0,3.86,21.73,Absent,42,23.37,0,63,0\n150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1\n176,6,3.98,17.2,Present,52,21.07,4.11,61,1\n142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1\n132,0,3.3,21.61,Absent,42,24.92,32.61,33,0\n142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0\n146,1.16,2.28,34.53,Absent,50,28.71,45,49,0\n132,7.2,3.65,17.16,Present,56,23.25,0,34,0\n120,0,3.57,23.22,Absent,58,27.2,0,32,0\n118,0,3.89,15.96,Absent,65,20.18,0,16,0\n108,0,1.43,26.26,Absent,42,19.38,0,16,0\n136,0,4,19.06,Absent,40,21.94,2.06,16,0\n120,0,2.46,13.39,Absent,47,22.01,0.51,18,0\n132,0,3.55,8.66,Present,61,18.5,3.87,16,0\n136,0,1.77,20.37,Absent,45,21.51,2.06,16,0\n138,0,1.86,18.35,Present,59,25.38,6.51,17,0\n138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0\n130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0\n130,4,2.4,17.42,Absent,60,22.05,0,40,0\n110,0,7.14,28.28,Absent,57,29,0,32,0\n120,0,3.98,13.19,Present,47,21.89,0,16,0\n166,6,8.8,37.89,Absent,39,28.7,43.2,52,0\n134,0.57,4.75,23.07,Absent,67,26.33,0,37,0\n142,3,3.69,25.1,Absent,60,30.08,38.88,27,0\n136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0\n142,0,4.32,25.22,Absent,47,28.92,6.53,34,1\n130,0,1.88,12.51,Present,52,20.28,0,17,0\n124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0\n144,4,5.03,25.78,Present,57,27.55,90,48,1\n136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0\n120,0,2.77,13.35,Absent,67,23.37,1.03,18,0\n154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0\n124,1.6,7.22,39.68,Present,36,31.5,0,51,1\n146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1\n128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1\n170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0\n214,0.4,5.98,31.72,Absent,64,28.45,0,58,0\n182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1\n108,3,1.59,15.23,Absent,40,20.09,26.64,55,0\n118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0\n132,0,4.82,33.41,Present,62,14.7,0,46,1"
  },
  {
    "path": "2017/data/heart.txt",
    "content": "\"sbp\"\t\"tobacco\"\t\"ldl\"\t\"adiposity\"\t\"famhist\"\t\"typea\"\t\"obesity\"\t\"alcohol\"\t\"age\"\t\"chd\"\n160\t12\t5.73\t23.11\t\"Present\"\t49\t25.3\t97.2\t52\t1\n144\t0.01\t4.41\t28.61\t\"Absent\"\t55\t28.87\t2.06\t63\t1\n118\t0.08\t3.48\t32.28\t\"Present\"\t52\t29.14\t3.81\t46\t0\n170\t7.5\t6.41\t38.03\t\"Present\"\t51\t31.99\t24.26\t58\t1\n134\t13.6\t3.5\t27.78\t\"Present\"\t60\t25.99\t57.34\t49\t1\n132\t6.2\t6.47\t36.21\t\"Present\"\t62\t30.77\t14.14\t45\t0\n142\t4.05\t3.38\t16.2\t\"Absent\"\t59\t20.81\t2.62\t38\t0\n114\t4.08\t4.59\t14.6\t\"Present\"\t62\t23.11\t6.72\t58\t1\n114\t0\t3.83\t19.4\t\"Present\"\t49\t24.86\t2.49\t29\t0\n132\t0\t5.8\t30.96\t\"Present\"\t69\t30.11\t0\t53\t1\n206\t6\t2.95\t32.27\t\"Absent\"\t72\t26.81\t56.06\t60\t1\n134\t14.1\t4.44\t22.39\t\"Present\"\t65\t23.09\t0\t40\t1\n118\t0\t1.88\t10.05\t\"Absent\"\t59\t21.57\t0\t17\t0\n132\t0\t1.87\t17.21\t\"Absent\"\t49\t23.63\t0.97\t15\t0\n112\t9.65\t2.29\t17.2\t\"Present\"\t54\t23.53\t0.68\t53\t0\n117\t1.53\t2.44\t28.95\t\"Present\"\t35\t25.89\t30.03\t46\t0\n120\t7.5\t15.33\t22\t\"Absent\"\t60\t25.31\t34.49\t49\t0\n146\t10.5\t8.29\t35.36\t\"Present\"\t78\t32.73\t13.89\t53\t1\n158\t2.6\t7.46\t34.07\t\"Present\"\t61\t29.3\t53.28\t62\t1\n124\t14\t6.23\t35.96\t\"Present\"\t45\t30.09\t0\t59\t1\n106\t1.61\t1.74\t12.32\t\"Absent\"\t74\t20.92\t13.37\t20\t1\n132\t7.9\t2.85\t26.5\t\"Present\"\t51\t26.16\t25.71\t44\t0\n150\t0.3\t6.38\t33.99\t\"Present\"\t62\t24.64\t0\t50\t0\n138\t0.6\t3.81\t28.66\t\"Absent\"\t54\t28.7\t1.46\t58\t0\n142\t18.2\t4.34\t24.38\t\"Absent\"\t61\t26.19\t0\t50\t0\n124\t4\t12.42\t31.29\t\"Present\"\t54\t23.23\t2.06\t42\t1\n118\t6\t9.65\t33.91\t\"Absent\"\t60\t38.8\t0\t48\t0\n145\t9.1\t5.24\t27.55\t\"Absent\"\t59\t20.96\t21.6\t61\t1\n144\t4.09\t5.55\t31.4\t\"Present\"\t60\t29.43\t5.55\t56\t0\n146\t0\t6.62\t25.69\t\"Absent\"\t60\t28.07\t8.23\t63\t1\n136\t2.52\t3.95\t25.63\t\"Absent\"\t51\t21.86\t0\t45\t1\n158\t1.02\t6.33\t23.88\t\"Absent\"\t66\t22.13\t24.99\t46\t1\n122\t6.6\t5.58\t35.95\t\"Present\"\t53\t28.07\t12.55\t59\t1\n126\t8.75\t6.53\t34.02\t\"Absent\"\t49\t30.25\t0\t41\t1\n148\t5.5\t7.1\t25.31\t\"Absent\"\t56\t29.84\t3.6\t48\t0\n122\t4.26\t4.44\t13.04\t\"Absent\"\t57\t19.49\t48.99\t28\t1\n140\t3.9\t7.32\t25.05\t\"Absent\"\t47\t27.36\t36.77\t32\t0\n110\t4.64\t4.55\t30.46\t\"Absent\"\t48\t30.9\t15.22\t46\t0\n130\t0\t2.82\t19.63\t\"Present\"\t70\t24.86\t0\t29\t0\n136\t11.2\t5.81\t31.85\t\"Present\"\t75\t27.68\t22.94\t58\t1\n118\t0.28\t5.8\t33.7\t\"Present\"\t60\t30.98\t0\t41\t1\n144\t0.04\t3.38\t23.61\t\"Absent\"\t30\t23.75\t4.66\t30\t0\n120\t0\t1.07\t16.02\t\"Absent\"\t47\t22.15\t0\t15\t0\n130\t2.61\t2.72\t22.99\t\"Present\"\t51\t26.29\t13.37\t51\t1\n114\t0\t2.99\t9.74\t\"Absent\"\t54\t46.58\t0\t17\t0\n128\t4.65\t3.31\t22.74\t\"Absent\"\t62\t22.95\t0.51\t48\t0\n162\t7.4\t8.55\t24.65\t\"Present\"\t64\t25.71\t5.86\t58\t1\n116\t1.91\t7.56\t26.45\t\"Present\"\t52\t30.01\t3.6\t33\t1\n114\t0\t1.94\t11.02\t\"Absent\"\t54\t20.17\t38.98\t16\t0\n126\t3.8\t3.88\t31.79\t\"Absent\"\t57\t30.53\t0\t30\t0\n122\t0\t5.75\t30.9\t\"Present\"\t46\t29.01\t4.11\t42\t0\n134\t2.5\t3.66\t30.9\t\"Absent\"\t52\t27.19\t23.66\t49\t0\n152\t0.9\t9.12\t30.23\t\"Absent\"\t56\t28.64\t0.37\t42\t1\n134\t8.08\t1.55\t17.5\t\"Present\"\t56\t22.65\t66.65\t31\t1\n156\t3\t1.82\t27.55\t\"Absent\"\t60\t23.91\t54\t53\t0\n152\t5.99\t7.99\t32.48\t\"Absent\"\t45\t26.57\t100.32\t48\t0\n118\t0\t2.99\t16.17\t\"Absent\"\t49\t23.83\t3.22\t28\t0\n126\t5.1\t2.96\t26.5\t\"Absent\"\t55\t25.52\t12.34\t38\t1\n103\t0.03\t4.21\t18.96\t\"Absent\"\t48\t22.94\t2.62\t18\t0\n121\t0.8\t5.29\t18.95\t\"Present\"\t47\t22.51\t0\t61\t0\n142\t0.28\t1.8\t21.03\t\"Absent\"\t57\t23.65\t2.93\t33\t0\n138\t1.15\t5.09\t27.87\t\"Present\"\t61\t25.65\t2.34\t44\t0\n152\t10.1\t4.71\t24.65\t\"Present\"\t65\t26.21\t24.53\t57\t0\n140\t0.45\t4.3\t24.33\t\"Absent\"\t41\t27.23\t10.08\t38\t0\n130\t0\t1.82\t10.45\t\"Absent\"\t57\t22.07\t2.06\t17\t0\n136\t7.36\t2.19\t28.11\t\"Present\"\t61\t25\t61.71\t54\t0\n124\t4.82\t3.24\t21.1\t\"Present\"\t48\t28.49\t8.42\t30\t0\n112\t0.41\t1.88\t10.29\t\"Absent\"\t39\t22.08\t20.98\t27\t0\n118\t4.46\t7.27\t29.13\t\"Present\"\t48\t29.01\t11.11\t33\t0\n122\t0\t3.37\t16.1\t\"Absent\"\t67\t21.06\t0\t32\t1\n118\t0\t3.67\t12.13\t\"Absent\"\t51\t19.15\t0.6\t15\t0\n130\t1.72\t2.66\t10.38\t\"Absent\"\t68\t17.81\t11.1\t26\t0\n130\t5.6\t3.37\t24.8\t\"Absent\"\t58\t25.76\t43.2\t36\t0\n126\t0.09\t5.03\t13.27\t\"Present\"\t50\t17.75\t4.63\t20\t0\n128\t0.4\t6.17\t26.35\t\"Absent\"\t64\t27.86\t11.11\t34\t0\n136\t0\t4.12\t17.42\t\"Absent\"\t52\t21.66\t12.86\t40\t0\n134\t0\t5.9\t30.84\t\"Absent\"\t49\t29.16\t0\t55\t0\n140\t0.6\t5.56\t33.39\t\"Present\"\t58\t27.19\t0\t55\t1\n168\t4.5\t6.68\t28.47\t\"Absent\"\t43\t24.25\t24.38\t56\t1\n108\t0.4\t5.91\t22.92\t\"Present\"\t57\t25.72\t72\t39\t0\n114\t3\t7.04\t22.64\t\"Present\"\t55\t22.59\t0\t45\t1\n140\t8.14\t4.93\t42.49\t\"Absent\"\t53\t45.72\t6.43\t53\t1\n148\t4.8\t6.09\t36.55\t\"Present\"\t63\t25.44\t0.88\t55\t1\n148\t12.2\t3.79\t34.15\t\"Absent\"\t57\t26.38\t14.4\t57\t1\n128\t0\t2.43\t13.15\t\"Present\"\t63\t20.75\t0\t17\t0\n130\t0.56\t3.3\t30.86\t\"Absent\"\t49\t27.52\t33.33\t45\t0\n126\t10.5\t4.49\t17.33\t\"Absent\"\t67\t19.37\t0\t49\t1\n140\t0\t5.08\t27.33\t\"Present\"\t41\t27.83\t1.25\t38\t0\n126\t0.9\t5.64\t17.78\t\"Present\"\t55\t21.94\t0\t41\t0\n122\t0.72\t4.04\t32.38\t\"Absent\"\t34\t28.34\t0\t55\t0\n116\t1.03\t2.83\t10.85\t\"Absent\"\t45\t21.59\t1.75\t21\t0\n120\t3.7\t4.02\t39.66\t\"Absent\"\t61\t30.57\t0\t64\t1\n143\t0.46\t2.4\t22.87\t\"Absent\"\t62\t29.17\t15.43\t29\t0\n118\t4\t3.95\t18.96\t\"Absent\"\t54\t25.15\t8.33\t49\t1\n194\t1.7\t6.32\t33.67\t\"Absent\"\t47\t30.16\t0.19\t56\t0\n134\t3\t4.37\t23.07\t\"Absent\"\t56\t20.54\t9.65\t62\t0\n138\t2.16\t4.9\t24.83\t\"Present\"\t39\t26.06\t28.29\t29\t0\n136\t0\t5\t27.58\t\"Present\"\t49\t27.59\t1.47\t39\t0\n122\t3.2\t11.32\t35.36\t\"Present\"\t55\t27.07\t0\t51\t1\n164\t12\t3.91\t19.59\t\"Absent\"\t51\t23.44\t19.75\t39\t0\n136\t8\t7.85\t23.81\t\"Present\"\t51\t22.69\t2.78\t50\t0\n166\t0.07\t4.03\t29.29\t\"Absent\"\t53\t28.37\t0\t27\t0\n118\t0\t4.34\t30.12\t\"Present\"\t52\t32.18\t3.91\t46\t0\n128\t0.42\t4.6\t26.68\t\"Absent\"\t41\t30.97\t10.33\t31\t0\n118\t1.5\t5.38\t25.84\t\"Absent\"\t64\t28.63\t3.89\t29\t0\n158\t3.6\t2.97\t30.11\t\"Absent\"\t63\t26.64\t108\t64\t0\n108\t1.5\t4.33\t24.99\t\"Absent\"\t66\t22.29\t21.6\t61\t1\n170\t7.6\t5.5\t37.83\t\"Present\"\t42\t37.41\t6.17\t54\t1\n118\t1\t5.76\t22.1\t\"Absent\"\t62\t23.48\t7.71\t42\t0\n124\t0\t3.04\t17.33\t\"Absent\"\t49\t22.04\t0\t18\t0\n114\t0\t8.01\t21.64\t\"Absent\"\t66\t25.51\t2.49\t16\t0\n168\t9\t8.53\t24.48\t\"Present\"\t69\t26.18\t4.63\t54\t1\n134\t2\t3.66\t14.69\t\"Absent\"\t52\t21.03\t2.06\t37\t0\n174\t0\t8.46\t35.1\t\"Present\"\t35\t25.27\t0\t61\t1\n116\t31.2\t3.17\t14.99\t\"Absent\"\t47\t19.4\t49.06\t59\t1\n128\t0\t10.58\t31.81\t\"Present\"\t46\t28.41\t14.66\t48\t0\n140\t4.5\t4.59\t18.01\t\"Absent\"\t63\t21.91\t22.09\t32\t1\n154\t0.7\t5.91\t25\t\"Absent\"\t13\t20.6\t0\t42\t0\n150\t3.5\t6.99\t25.39\t\"Present\"\t50\t23.35\t23.48\t61\t1\n130\t0\t3.92\t25.55\t\"Absent\"\t68\t28.02\t0.68\t27\t0\n128\t2\t6.13\t21.31\t\"Absent\"\t66\t22.86\t11.83\t60\t0\n120\t1.4\t6.25\t20.47\t\"Absent\"\t60\t25.85\t8.51\t28\t0\n120\t0\t5.01\t26.13\t\"Absent\"\t64\t26.21\t12.24\t33\t0\n138\t4.5\t2.85\t30.11\t\"Absent\"\t55\t24.78\t24.89\t56\t1\n153\t7.8\t3.96\t25.73\t\"Absent\"\t54\t25.91\t27.03\t45\t0\n123\t8.6\t11.17\t35.28\t\"Present\"\t70\t33.14\t0\t59\t1\n148\t4.04\t3.99\t20.69\t\"Absent\"\t60\t27.78\t1.75\t28\t0\n136\t3.96\t2.76\t30.28\t\"Present\"\t50\t34.42\t18.51\t38\t0\n134\t8.8\t7.41\t26.84\t\"Absent\"\t35\t29.44\t29.52\t60\t1\n152\t12.18\t4.04\t37.83\t\"Present\"\t63\t34.57\t4.17\t64\t0\n158\t13.5\t5.04\t30.79\t\"Absent\"\t54\t24.79\t21.5\t62\t0\n132\t2\t3.08\t35.39\t\"Absent\"\t45\t31.44\t79.82\t58\t1\n134\t1.5\t3.73\t21.53\t\"Absent\"\t41\t24.7\t11.11\t30\t1\n142\t7.44\t5.52\t33.97\t\"Absent\"\t47\t29.29\t24.27\t54\t0\n134\t6\t3.3\t28.45\t\"Absent\"\t65\t26.09\t58.11\t40\t0\n122\t4.18\t9.05\t29.27\t\"Present\"\t44\t24.05\t19.34\t52\t1\n116\t2.7\t3.69\t13.52\t\"Absent\"\t55\t21.13\t18.51\t32\t0\n128\t0.5\t3.7\t12.81\t\"Present\"\t66\t21.25\t22.73\t28\t0\n120\t0\t3.68\t12.24\t\"Absent\"\t51\t20.52\t0.51\t20\t0\n124\t0\t3.95\t36.35\t\"Present\"\t59\t32.83\t9.59\t54\t0\n160\t14\t5.9\t37.12\t\"Absent\"\t58\t33.87\t3.52\t54\t1\n130\t2.78\t4.89\t9.39\t\"Present\"\t63\t19.3\t17.47\t25\t1\n128\t2.8\t5.53\t14.29\t\"Absent\"\t64\t24.97\t0.51\t38\t0\n130\t4.5\t5.86\t37.43\t\"Absent\"\t61\t31.21\t32.3\t58\t0\n109\t1.2\t6.14\t29.26\t\"Absent\"\t47\t24.72\t10.46\t40\t0\n144\t0\t3.84\t18.72\t\"Absent\"\t56\t22.1\t4.8\t40\t0\n118\t1.05\t3.16\t12.98\t\"Present\"\t46\t22.09\t16.35\t31\t0\n136\t3.46\t6.38\t32.25\t\"Present\"\t43\t28.73\t3.13\t43\t1\n136\t1.5\t6.06\t26.54\t\"Absent\"\t54\t29.38\t14.5\t33\t1\n124\t15.5\t5.05\t24.06\t\"Absent\"\t46\t23.22\t0\t61\t1\n148\t6\t6.49\t26.47\t\"Absent\"\t48\t24.7\t0\t55\t0\n128\t6.6\t3.58\t20.71\t\"Absent\"\t55\t24.15\t0\t52\t0\n122\t0.28\t4.19\t19.97\t\"Absent\"\t61\t25.63\t0\t24\t0\n108\t0\t2.74\t11.17\t\"Absent\"\t53\t22.61\t0.95\t20\t0\n124\t3.04\t4.8\t19.52\t\"Present\"\t60\t21.78\t147.19\t41\t1\n138\t8.8\t3.12\t22.41\t\"Present\"\t63\t23.33\t120.03\t55\t1\n127\t0\t2.81\t15.7\t\"Absent\"\t42\t22.03\t1.03\t17\t0\n174\t9.45\t5.13\t35.54\t\"Absent\"\t55\t30.71\t59.79\t53\t0\n122\t0\t3.05\t23.51\t\"Absent\"\t46\t25.81\t0\t38\t0\n144\t6.75\t5.45\t29.81\t\"Absent\"\t53\t25.62\t26.23\t43\t1\n126\t1.8\t6.22\t19.71\t\"Absent\"\t65\t24.81\t0.69\t31\t0\n208\t27.4\t3.12\t26.63\t\"Absent\"\t66\t27.45\t33.07\t62\t1\n138\t0\t2.68\t17.04\t\"Absent\"\t42\t22.16\t0\t16\t0\n148\t0\t3.84\t17.26\t\"Absent\"\t70\t20\t0\t21\t0\n122\t0\t3.08\t16.3\t\"Absent\"\t43\t22.13\t0\t16\t0\n132\t7\t3.2\t23.26\t\"Absent\"\t77\t23.64\t23.14\t49\t0\n110\t12.16\t4.99\t28.56\t\"Absent\"\t44\t27.14\t21.6\t55\t1\n160\t1.52\t8.12\t29.3\t\"Present\"\t54\t25.87\t12.86\t43\t1\n126\t0.54\t4.39\t21.13\t\"Present\"\t45\t25.99\t0\t25\t0\n162\t5.3\t7.95\t33.58\t\"Present\"\t58\t36.06\t8.23\t48\t0\n194\t2.55\t6.89\t33.88\t\"Present\"\t69\t29.33\t0\t41\t0\n118\t0.75\t2.58\t20.25\t\"Absent\"\t59\t24.46\t0\t32\t0\n124\t0\t4.79\t34.71\t\"Absent\"\t49\t26.09\t9.26\t47\t0\n160\t0\t2.42\t34.46\t\"Absent\"\t48\t29.83\t1.03\t61\t0\n128\t0\t2.51\t29.35\t\"Present\"\t53\t22.05\t1.37\t62\t0\n122\t4\t5.24\t27.89\t\"Present\"\t45\t26.52\t0\t61\t1\n132\t2\t2.7\t21.57\t\"Present\"\t50\t27.95\t9.26\t37\t0\n120\t0\t2.42\t16.66\t\"Absent\"\t46\t20.16\t0\t17\t0\n128\t0.04\t8.22\t28.17\t\"Absent\"\t65\t26.24\t11.73\t24\t0\n108\t15\t4.91\t34.65\t\"Absent\"\t41\t27.96\t14.4\t56\t0\n166\t0\t4.31\t34.27\t\"Absent\"\t45\t30.14\t13.27\t56\t0\n152\t0\t6.06\t41.05\t\"Present\"\t51\t40.34\t0\t51\t0\n170\t4.2\t4.67\t35.45\t\"Present\"\t50\t27.14\t7.92\t60\t1\n156\t4\t2.05\t19.48\t\"Present\"\t50\t21.48\t27.77\t39\t1\n116\t8\t6.73\t28.81\t\"Present\"\t41\t26.74\t40.94\t48\t1\n122\t4.4\t3.18\t11.59\t\"Present\"\t59\t21.94\t0\t33\t1\n150\t20\t6.4\t35.04\t\"Absent\"\t53\t28.88\t8.33\t63\t0\n129\t2.15\t5.17\t27.57\t\"Absent\"\t52\t25.42\t2.06\t39\t0\n134\t4.8\t6.58\t29.89\t\"Present\"\t55\t24.73\t23.66\t63\t0\n126\t0\t5.98\t29.06\t\"Present\"\t56\t25.39\t11.52\t64\t1\n142\t0\t3.72\t25.68\t\"Absent\"\t48\t24.37\t5.25\t40\t1\n128\t0.7\t4.9\t37.42\t\"Present\"\t72\t35.94\t3.09\t49\t1\n102\t0.4\t3.41\t17.22\t\"Present\"\t56\t23.59\t2.06\t39\t1\n130\t0\t4.89\t25.98\t\"Absent\"\t72\t30.42\t14.71\t23\t0\n138\t0.05\t2.79\t10.35\t\"Absent\"\t46\t21.62\t0\t18\t0\n138\t0\t1.96\t11.82\t\"Present\"\t54\t22.01\t8.13\t21\t0\n128\t0\t3.09\t20.57\t\"Absent\"\t54\t25.63\t0.51\t17\t0\n162\t2.92\t3.63\t31.33\t\"Absent\"\t62\t31.59\t18.51\t42\t0\n160\t3\t9.19\t26.47\t\"Present\"\t39\t28.25\t14.4\t54\t1\n148\t0\t4.66\t24.39\t\"Absent\"\t50\t25.26\t4.03\t27\t0\n124\t0.16\t2.44\t16.67\t\"Absent\"\t65\t24.58\t74.91\t23\t0\n136\t3.15\t4.37\t20.22\t\"Present\"\t59\t25.12\t47.16\t31\t1\n134\t2.75\t5.51\t26.17\t\"Absent\"\t57\t29.87\t8.33\t33\t0\n128\t0.73\t3.97\t23.52\t\"Absent\"\t54\t23.81\t19.2\t64\t0\n122\t3.2\t3.59\t22.49\t\"Present\"\t45\t24.96\t36.17\t58\t0\n152\t3\t4.64\t31.29\t\"Absent\"\t41\t29.34\t4.53\t40\t0\n162\t0\t5.09\t24.6\t\"Present\"\t64\t26.71\t3.81\t18\t0\n124\t4\t6.65\t30.84\t\"Present\"\t54\t28.4\t33.51\t60\t0\n136\t5.8\t5.9\t27.55\t\"Absent\"\t65\t25.71\t14.4\t59\t0\n136\t8.8\t4.26\t32.03\t\"Present\"\t52\t31.44\t34.35\t60\t0\n134\t0.05\t8.03\t27.95\t\"Absent\"\t48\t26.88\t0\t60\t0\n122\t1\t5.88\t34.81\t\"Present\"\t69\t31.27\t15.94\t40\t1\n116\t3\t3.05\t30.31\t\"Absent\"\t41\t23.63\t0.86\t44\t0\n132\t0\t0.98\t21.39\t\"Absent\"\t62\t26.75\t0\t53\t0\n134\t0\t2.4\t21.11\t\"Absent\"\t57\t22.45\t1.37\t18\t0\n160\t7.77\t8.07\t34.8\t\"Absent\"\t64\t31.15\t0\t62\t1\n180\t0.52\t4.23\t16.38\t\"Absent\"\t55\t22.56\t14.77\t45\t1\n124\t0.81\t6.16\t11.61\t\"Absent\"\t35\t21.47\t10.49\t26\t0\n114\t0\t4.97\t9.69\t\"Absent\"\t26\t22.6\t0\t25\t0\n208\t7.4\t7.41\t32.03\t\"Absent\"\t50\t27.62\t7.85\t57\t0\n138\t0\t3.14\t12\t\"Absent\"\t54\t20.28\t0\t16\t0\n164\t0.5\t6.95\t39.64\t\"Present\"\t47\t41.76\t3.81\t46\t1\n144\t2.4\t8.13\t35.61\t\"Absent\"\t46\t27.38\t13.37\t60\t0\n136\t7.5\t7.39\t28.04\t\"Present\"\t50\t25.01\t0\t45\t1\n132\t7.28\t3.52\t12.33\t\"Absent\"\t60\t19.48\t2.06\t56\t0\n143\t5.04\t4.86\t23.59\t\"Absent\"\t58\t24.69\t18.72\t42\t0\n112\t4.46\t7.18\t26.25\t\"Present\"\t69\t27.29\t0\t32\t1\n134\t10\t3.79\t34.72\t\"Absent\"\t42\t28.33\t28.8\t52\t1\n138\t2\t5.11\t31.4\t\"Present\"\t49\t27.25\t2.06\t64\t1\n188\t0\t5.47\t32.44\t\"Present\"\t71\t28.99\t7.41\t50\t1\n110\t2.35\t3.36\t26.72\t\"Present\"\t54\t26.08\t109.8\t58\t1\n136\t13.2\t7.18\t35.95\t\"Absent\"\t48\t29.19\t0\t62\t0\n130\t1.75\t5.46\t34.34\t\"Absent\"\t53\t29.42\t0\t58\t1\n122\t0\t3.76\t24.59\t\"Absent\"\t56\t24.36\t0\t30\t0\n138\t0\t3.24\t27.68\t\"Absent\"\t60\t25.7\t88.66\t29\t0\n130\t18\t4.13\t27.43\t\"Absent\"\t54\t27.44\t0\t51\t1\n126\t5.5\t3.78\t34.15\t\"Absent\"\t55\t28.85\t3.18\t61\t0\n176\t5.76\t4.89\t26.1\t\"Present\"\t46\t27.3\t19.44\t57\t0\n122\t0\t5.49\t19.56\t\"Absent\"\t57\t23.12\t14.02\t27\t0\n124\t0\t3.23\t9.64\t\"Absent\"\t59\t22.7\t0\t16\t0\n140\t5.2\t3.58\t29.26\t\"Absent\"\t70\t27.29\t20.17\t45\t1\n128\t6\t4.37\t22.98\t\"Present\"\t50\t26.01\t0\t47\t0\n190\t4.18\t5.05\t24.83\t\"Absent\"\t45\t26.09\t82.85\t41\t0\n144\t0.76\t10.53\t35.66\t\"Absent\"\t63\t34.35\t0\t55\t1\n126\t4.6\t7.4\t31.99\t\"Present\"\t57\t28.67\t0.37\t60\t1\n128\t0\t2.63\t23.88\t\"Absent\"\t45\t21.59\t6.54\t57\t0\n136\t0.4\t3.91\t21.1\t\"Present\"\t63\t22.3\t0\t56\t1\n158\t4\t4.18\t28.61\t\"Present\"\t42\t25.11\t0\t60\t0\n160\t0.6\t6.94\t30.53\t\"Absent\"\t36\t25.68\t1.42\t64\t0\n124\t6\t5.21\t33.02\t\"Present\"\t64\t29.37\t7.61\t58\t1\n158\t6.17\t8.12\t30.75\t\"Absent\"\t46\t27.84\t92.62\t48\t0\n128\t0\t6.34\t11.87\t\"Absent\"\t57\t23.14\t0\t17\t0\n166\t3\t3.82\t26.75\t\"Absent\"\t45\t20.86\t0\t63\t1\n146\t7.5\t7.21\t25.93\t\"Present\"\t55\t22.51\t0.51\t42\t0\n161\t9\t4.65\t15.16\t\"Present\"\t58\t23.76\t43.2\t46\t0\n164\t13.02\t6.26\t29.38\t\"Present\"\t47\t22.75\t37.03\t54\t1\n146\t5.08\t7.03\t27.41\t\"Present\"\t63\t36.46\t24.48\t37\t1\n142\t4.48\t3.57\t19.75\t\"Present\"\t51\t23.54\t3.29\t49\t0\n138\t12\t5.13\t28.34\t\"Absent\"\t59\t24.49\t32.81\t58\t1\n154\t1.8\t7.13\t34.04\t\"Present\"\t52\t35.51\t39.36\t44\t0\n118\t0\t2.39\t12.13\t\"Absent\"\t49\t18.46\t0.26\t17\t1\n124\t0.61\t2.69\t17.15\t\"Present\"\t61\t22.76\t11.55\t20\t0\n124\t1.04\t2.84\t16.42\t\"Present\"\t46\t20.17\t0\t61\t0\n136\t5\t4.19\t23.99\t\"Present\"\t68\t27.8\t25.86\t35\t0\n132\t9.9\t4.63\t27.86\t\"Present\"\t46\t23.39\t0.51\t52\t1\n118\t0.12\t1.96\t20.31\t\"Absent\"\t37\t20.01\t2.42\t18\t0\n118\t0.12\t4.16\t9.37\t\"Absent\"\t57\t19.61\t0\t17\t0\n134\t12\t4.96\t29.79\t\"Absent\"\t53\t24.86\t8.23\t57\t0\n114\t0.1\t3.95\t15.89\t\"Present\"\t57\t20.31\t17.14\t16\t0\n136\t6.8\t7.84\t30.74\t\"Present\"\t58\t26.2\t23.66\t45\t1\n130\t0\t4.16\t39.43\t\"Present\"\t46\t30.01\t0\t55\t1\n136\t2.2\t4.16\t38.02\t\"Absent\"\t65\t37.24\t4.11\t41\t1\n136\t1.36\t3.16\t14.97\t\"Present\"\t56\t24.98\t7.3\t24\t0\n154\t4.2\t5.59\t25.02\t\"Absent\"\t58\t25.02\t1.54\t43\t0\n108\t0.8\t2.47\t17.53\t\"Absent\"\t47\t22.18\t0\t55\t1\n136\t8.8\t4.69\t36.07\t\"Present\"\t38\t26.56\t2.78\t63\t1\n174\t2.02\t6.57\t31.9\t\"Present\"\t50\t28.75\t11.83\t64\t1\n124\t4.25\t8.22\t30.77\t\"Absent\"\t56\t25.8\t0\t43\t0\n114\t0\t2.63\t9.69\t\"Absent\"\t45\t17.89\t0\t16\t0\n118\t0.12\t3.26\t12.26\t\"Absent\"\t55\t22.65\t0\t16\t0\n106\t1.08\t4.37\t26.08\t\"Absent\"\t67\t24.07\t17.74\t28\t1\n146\t3.6\t3.51\t22.67\t\"Absent\"\t51\t22.29\t43.71\t42\t0\n206\t0\t4.17\t33.23\t\"Absent\"\t69\t27.36\t6.17\t50\t1\n134\t3\t3.17\t17.91\t\"Absent\"\t35\t26.37\t15.12\t27\t0\n148\t15\t4.98\t36.94\t\"Present\"\t72\t31.83\t66.27\t41\t1\n126\t0.21\t3.95\t15.11\t\"Absent\"\t61\t22.17\t2.42\t17\t0\n134\t0\t3.69\t13.92\t\"Absent\"\t43\t27.66\t0\t19\t0\n134\t0.02\t2.8\t18.84\t\"Absent\"\t45\t24.82\t0\t17\t0\n123\t0.05\t4.61\t13.69\t\"Absent\"\t51\t23.23\t2.78\t16\t0\n112\t0.6\t5.28\t25.71\t\"Absent\"\t55\t27.02\t27.77\t38\t1\n112\t0\t1.71\t15.96\t\"Absent\"\t42\t22.03\t3.5\t16\t0\n101\t0.48\t7.26\t13\t\"Absent\"\t50\t19.82\t5.19\t16\t0\n150\t0.18\t4.14\t14.4\t\"Absent\"\t53\t23.43\t7.71\t44\t0\n170\t2.6\t7.22\t28.69\t\"Present\"\t71\t27.87\t37.65\t56\t1\n134\t0\t5.63\t29.12\t\"Absent\"\t68\t32.33\t2.02\t34\t0\n142\t0\t4.19\t18.04\t\"Absent\"\t56\t23.65\t20.78\t42\t1\n132\t0.1\t3.28\t10.73\t\"Absent\"\t73\t20.42\t0\t17\t0\n136\t0\t2.28\t18.14\t\"Absent\"\t55\t22.59\t0\t17\t0\n132\t12\t4.51\t21.93\t\"Absent\"\t61\t26.07\t64.8\t46\t1\n166\t4.1\t4\t34.3\t\"Present\"\t32\t29.51\t8.23\t53\t0\n138\t0\t3.96\t24.7\t\"Present\"\t53\t23.8\t0\t45\t0\n138\t2.27\t6.41\t29.07\t\"Absent\"\t58\t30.22\t2.93\t32\t1\n170\t0\t3.12\t37.15\t\"Absent\"\t47\t35.42\t0\t53\t0\n128\t0\t8.41\t28.82\t\"Present\"\t60\t26.86\t0\t59\t1\n136\t1.2\t2.78\t7.12\t\"Absent\"\t52\t22.51\t3.41\t27\t0\n128\t0\t3.22\t26.55\t\"Present\"\t39\t26.59\t16.71\t49\t0\n150\t14.4\t5.04\t26.52\t\"Present\"\t60\t28.84\t0\t45\t0\n132\t8.4\t3.57\t13.68\t\"Absent\"\t42\t18.75\t15.43\t59\t1\n142\t2.4\t2.55\t23.89\t\"Absent\"\t54\t26.09\t59.14\t37\t0\n130\t0.05\t2.44\t28.25\t\"Present\"\t67\t30.86\t40.32\t34\t0\n174\t3.5\t5.26\t21.97\t\"Present\"\t36\t22.04\t8.33\t59\t1\n114\t9.6\t2.51\t29.18\t\"Absent\"\t49\t25.67\t40.63\t46\t0\n162\t1.5\t2.46\t19.39\t\"Present\"\t49\t24.32\t0\t59\t1\n174\t0\t3.27\t35.4\t\"Absent\"\t58\t37.71\t24.95\t44\t0\n190\t5.15\t6.03\t36.59\t\"Absent\"\t42\t30.31\t72\t50\t0\n154\t1.4\t1.72\t18.86\t\"Absent\"\t58\t22.67\t43.2\t59\t0\n124\t0\t2.28\t24.86\t\"Present\"\t50\t22.24\t8.26\t38\t0\n114\t1.2\t3.98\t14.9\t\"Absent\"\t49\t23.79\t25.82\t26\t0\n168\t11.4\t5.08\t26.66\t\"Present\"\t56\t27.04\t2.61\t59\t1\n142\t3.72\t4.24\t32.57\t\"Absent\"\t52\t24.98\t7.61\t51\t0\n154\t0\t4.81\t28.11\t\"Present\"\t56\t25.67\t75.77\t59\t0\n146\t4.36\t4.31\t18.44\t\"Present\"\t47\t24.72\t10.8\t38\t0\n166\t6\t3.02\t29.3\t\"Absent\"\t35\t24.38\t38.06\t61\t0\n140\t8.6\t3.9\t32.16\t\"Present\"\t52\t28.51\t11.11\t64\t1\n136\t1.7\t3.53\t20.13\t\"Absent\"\t56\t19.44\t14.4\t55\t0\n156\t0\t3.47\t21.1\t\"Absent\"\t73\t28.4\t0\t36\t1\n132\t0\t6.63\t29.58\t\"Present\"\t37\t29.41\t2.57\t62\t0\n128\t0\t2.98\t12.59\t\"Absent\"\t65\t20.74\t2.06\t19\t0\n106\t5.6\t3.2\t12.3\t\"Absent\"\t49\t20.29\t0\t39\t0\n144\t0.4\t4.64\t30.09\t\"Absent\"\t30\t27.39\t0.74\t55\t0\n154\t0.31\t2.33\t16.48\t\"Absent\"\t33\t24\t11.83\t17\t0\n126\t3.1\t2.01\t32.97\t\"Present\"\t56\t28.63\t26.74\t45\t0\n134\t6.4\t8.49\t37.25\t\"Present\"\t56\t28.94\t10.49\t51\t1\n152\t19.45\t4.22\t29.81\t\"Absent\"\t28\t23.95\t0\t59\t1\n146\t1.35\t6.39\t34.21\t\"Absent\"\t51\t26.43\t0\t59\t1\n162\t6.94\t4.55\t33.36\t\"Present\"\t52\t27.09\t32.06\t43\t0\n130\t7.28\t3.56\t23.29\t\"Present\"\t20\t26.8\t51.87\t58\t1\n138\t6\t7.24\t37.05\t\"Absent\"\t38\t28.69\t0\t59\t0\n148\t0\t5.32\t26.71\t\"Present\"\t52\t32.21\t32.78\t27\t0\n124\t4.2\t2.94\t27.59\t\"Absent\"\t50\t30.31\t85.06\t30\t0\n118\t1.62\t9.01\t21.7\t\"Absent\"\t59\t25.89\t21.19\t40\t0\n116\t4.28\t7.02\t19.99\t\"Present\"\t68\t23.31\t0\t52\t1\n162\t6.3\t5.73\t22.61\t\"Present\"\t46\t20.43\t62.54\t53\t1\n138\t0.87\t1.87\t15.89\t\"Absent\"\t44\t26.76\t42.99\t31\t0\n137\t1.2\t3.14\t23.87\t\"Absent\"\t66\t24.13\t45\t37\t0\n198\t0.52\t11.89\t27.68\t\"Present\"\t48\t28.4\t78.99\t26\t1\n154\t4.5\t4.75\t23.52\t\"Present\"\t43\t25.76\t0\t53\t1\n128\t5.4\t2.36\t12.98\t\"Absent\"\t51\t18.36\t6.69\t61\t0\n130\t0.08\t5.59\t25.42\t\"Present\"\t50\t24.98\t6.27\t43\t1\n162\t5.6\t4.24\t22.53\t\"Absent\"\t29\t22.91\t5.66\t60\t0\n120\t10.5\t2.7\t29.87\t\"Present\"\t54\t24.5\t16.46\t49\t0\n136\t3.99\t2.58\t16.38\t\"Present\"\t53\t22.41\t27.67\t36\t0\n176\t1.2\t8.28\t36.16\t\"Present\"\t42\t27.81\t11.6\t58\t1\n134\t11.79\t4.01\t26.57\t\"Present\"\t38\t21.79\t38.88\t61\t1\n122\t1.7\t5.28\t32.23\t\"Present\"\t51\t24.08\t0\t54\t0\n134\t0.9\t3.18\t23.66\t\"Present\"\t52\t23.26\t27.36\t58\t1\n134\t0\t2.43\t22.24\t\"Absent\"\t52\t26.49\t41.66\t24\t0\n136\t6.6\t6.08\t32.74\t\"Absent\"\t64\t33.28\t2.72\t49\t0\n132\t4.05\t5.15\t26.51\t\"Present\"\t31\t26.67\t16.3\t50\t0\n152\t1.68\t3.58\t25.43\t\"Absent\"\t50\t27.03\t0\t32\t0\n132\t12.3\t5.96\t32.79\t\"Present\"\t57\t30.12\t21.5\t62\t1\n124\t0.4\t3.67\t25.76\t\"Absent\"\t43\t28.08\t20.57\t34\t0\n140\t4.2\t2.91\t28.83\t\"Present\"\t43\t24.7\t47.52\t48\t0\n166\t0.6\t2.42\t34.03\t\"Present\"\t53\t26.96\t54\t60\t0\n156\t3.02\t5.35\t25.72\t\"Present\"\t53\t25.22\t28.11\t52\t1\n132\t0.72\t4.37\t19.54\t\"Absent\"\t48\t26.11\t49.37\t28\t0\n150\t0\t4.99\t27.73\t\"Absent\"\t57\t30.92\t8.33\t24\t0\n134\t0.12\t3.4\t21.18\t\"Present\"\t33\t26.27\t14.21\t30\t0\n126\t3.4\t4.87\t15.16\t\"Present\"\t65\t22.01\t11.11\t38\t0\n148\t0.5\t5.97\t32.88\t\"Absent\"\t54\t29.27\t6.43\t42\t0\n148\t8.2\t7.75\t34.46\t\"Present\"\t46\t26.53\t6.04\t64\t1\n132\t6\t5.97\t25.73\t\"Present\"\t66\t24.18\t145.29\t41\t0\n128\t1.6\t5.41\t29.3\t\"Absent\"\t68\t29.38\t23.97\t32\t0\n128\t5.16\t4.9\t31.35\t\"Present\"\t57\t26.42\t0\t64\t0\n140\t0\t2.4\t27.89\t\"Present\"\t70\t30.74\t144\t29\t0\n126\t0\t5.29\t27.64\t\"Absent\"\t25\t27.62\t2.06\t45\t0\n114\t3.6\t4.16\t22.58\t\"Absent\"\t60\t24.49\t65.31\t31\t0\n118\t1.25\t4.69\t31.58\t\"Present\"\t52\t27.16\t4.11\t53\t0\n126\t0.96\t4.99\t29.74\t\"Absent\"\t66\t33.35\t58.32\t38\t0\n154\t4.5\t4.68\t39.97\t\"Absent\"\t61\t33.17\t1.54\t64\t1\n112\t1.44\t2.71\t22.92\t\"Absent\"\t59\t24.81\t0\t52\t0\n140\t8\t4.42\t33.15\t\"Present\"\t47\t32.77\t66.86\t44\t0\n140\t1.68\t11.41\t29.54\t\"Present\"\t74\t30.75\t2.06\t38\t1\n128\t2.6\t4.94\t21.36\t\"Absent\"\t61\t21.3\t0\t31\t0\n126\t19.6\t6.03\t34.99\t\"Absent\"\t49\t26.99\t55.89\t44\t0\n160\t4.2\t6.76\t37.99\t\"Present\"\t61\t32.91\t3.09\t54\t1\n144\t0\t4.17\t29.63\t\"Present\"\t52\t21.83\t0\t59\t0\n148\t4.5\t10.49\t33.27\t\"Absent\"\t50\t25.92\t2.06\t53\t1\n146\t0\t4.92\t18.53\t\"Absent\"\t57\t24.2\t34.97\t26\t0\n164\t5.6\t3.17\t30.98\t\"Present\"\t44\t25.99\t43.2\t53\t1\n130\t0.54\t3.63\t22.03\t\"Present\"\t69\t24.34\t12.86\t39\t1\n154\t2.4\t5.63\t42.17\t\"Present\"\t59\t35.07\t12.86\t50\t1\n178\t0.95\t4.75\t21.06\t\"Absent\"\t49\t23.74\t24.69\t61\t0\n180\t3.57\t3.57\t36.1\t\"Absent\"\t36\t26.7\t19.95\t64\t0\n134\t12.5\t2.73\t39.35\t\"Absent\"\t48\t35.58\t0\t48\t0\n142\t0\t3.54\t16.64\t\"Absent\"\t58\t25.97\t8.36\t27\t0\n162\t7\t7.67\t34.34\t\"Present\"\t33\t30.77\t0\t62\t0\n218\t11.2\t2.77\t30.79\t\"Absent\"\t38\t24.86\t90.93\t48\t1\n126\t8.75\t6.06\t32.72\t\"Present\"\t33\t27\t62.43\t55\t1\n126\t0\t3.57\t26.01\t\"Absent\"\t61\t26.3\t7.97\t47\t0\n134\t6.1\t4.77\t26.08\t\"Absent\"\t47\t23.82\t1.03\t49\t0\n132\t0\t4.17\t36.57\t\"Absent\"\t57\t30.61\t18\t49\t0\n178\t5.5\t3.79\t23.92\t\"Present\"\t45\t21.26\t6.17\t62\t1\n208\t5.04\t5.19\t20.71\t\"Present\"\t52\t25.12\t24.27\t58\t1\n160\t1.15\t10.19\t39.71\t\"Absent\"\t31\t31.65\t20.52\t57\t0\n116\t2.38\t5.67\t29.01\t\"Present\"\t54\t27.26\t15.77\t51\t0\n180\t25.01\t3.7\t38.11\t\"Present\"\t57\t30.54\t0\t61\t1\n200\t19.2\t4.43\t40.6\t\"Present\"\t55\t32.04\t36\t60\t1\n112\t4.2\t3.58\t27.14\t\"Absent\"\t52\t26.83\t2.06\t40\t0\n120\t0\t3.1\t26.97\t\"Absent\"\t41\t24.8\t0\t16\t0\n178\t20\t9.78\t33.55\t\"Absent\"\t37\t27.29\t2.88\t62\t1\n166\t0.8\t5.63\t36.21\t\"Absent\"\t50\t34.72\t28.8\t60\t0\n164\t8.2\t14.16\t36.85\t\"Absent\"\t52\t28.5\t17.02\t55\t1\n216\t0.92\t2.66\t19.85\t\"Present\"\t49\t20.58\t0.51\t63\t1\n146\t6.4\t5.62\t33.05\t\"Present\"\t57\t31.03\t0.74\t46\t0\n134\t1.1\t3.54\t20.41\t\"Present\"\t58\t24.54\t39.91\t39\t1\n158\t16\t5.56\t29.35\t\"Absent\"\t36\t25.92\t58.32\t60\t0\n176\t0\t3.14\t31.04\t\"Present\"\t45\t30.18\t4.63\t45\t0\n132\t2.8\t4.79\t20.47\t\"Present\"\t50\t22.15\t11.73\t48\t0\n126\t0\t4.55\t29.18\t\"Absent\"\t48\t24.94\t36\t41\t0\n120\t5.5\t3.51\t23.23\t\"Absent\"\t46\t22.4\t90.31\t43\t0\n174\t0\t3.86\t21.73\t\"Absent\"\t42\t23.37\t0\t63\t0\n150\t13.8\t5.1\t29.45\t\"Present\"\t52\t27.92\t77.76\t55\t1\n176\t6\t3.98\t17.2\t\"Present\"\t52\t21.07\t4.11\t61\t1\n142\t2.2\t3.29\t22.7\t\"Absent\"\t44\t23.66\t5.66\t42\t1\n132\t0\t3.3\t21.61\t\"Absent\"\t42\t24.92\t32.61\t33\t0\n142\t1.32\t7.63\t29.98\t\"Present\"\t57\t31.16\t72.93\t33\t0\n146\t1.16\t2.28\t34.53\t\"Absent\"\t50\t28.71\t45\t49\t0\n132\t7.2\t3.65\t17.16\t\"Present\"\t56\t23.25\t0\t34\t0\n120\t0\t3.57\t23.22\t\"Absent\"\t58\t27.2\t0\t32\t0\n118\t0\t3.89\t15.96\t\"Absent\"\t65\t20.18\t0\t16\t0\n108\t0\t1.43\t26.26\t\"Absent\"\t42\t19.38\t0\t16\t0\n136\t0\t4\t19.06\t\"Absent\"\t40\t21.94\t2.06\t16\t0\n120\t0\t2.46\t13.39\t\"Absent\"\t47\t22.01\t0.51\t18\t0\n132\t0\t3.55\t8.66\t\"Present\"\t61\t18.5\t3.87\t16\t0\n136\t0\t1.77\t20.37\t\"Absent\"\t45\t21.51\t2.06\t16\t0\n138\t0\t1.86\t18.35\t\"Present\"\t59\t25.38\t6.51\t17\t0\n138\t0.06\t4.15\t20.66\t\"Absent\"\t49\t22.59\t2.49\t16\t0\n130\t1.22\t3.3\t13.65\t\"Absent\"\t50\t21.4\t3.81\t31\t0\n130\t4\t2.4\t17.42\t\"Absent\"\t60\t22.05\t0\t40\t0\n110\t0\t7.14\t28.28\t\"Absent\"\t57\t29\t0\t32\t0\n120\t0\t3.98\t13.19\t\"Present\"\t47\t21.89\t0\t16\t0\n166\t6\t8.8\t37.89\t\"Absent\"\t39\t28.7\t43.2\t52\t0\n134\t0.57\t4.75\t23.07\t\"Absent\"\t67\t26.33\t0\t37\t0\n142\t3\t3.69\t25.1\t\"Absent\"\t60\t30.08\t38.88\t27\t0\n136\t2.8\t2.53\t9.28\t\"Present\"\t61\t20.7\t4.55\t25\t0\n142\t0\t4.32\t25.22\t\"Absent\"\t47\t28.92\t6.53\t34\t1\n130\t0\t1.88\t12.51\t\"Present\"\t52\t20.28\t0\t17\t0\n124\t1.8\t3.74\t16.64\t\"Present\"\t42\t22.26\t10.49\t20\t0\n144\t4\t5.03\t25.78\t\"Present\"\t57\t27.55\t90\t48\t1\n136\t1.81\t3.31\t6.74\t\"Absent\"\t63\t19.57\t24.94\t24\t0\n120\t0\t2.77\t13.35\t\"Absent\"\t67\t23.37\t1.03\t18\t0\n154\t5.53\t3.2\t28.81\t\"Present\"\t61\t26.15\t42.79\t42\t0\n124\t1.6\t7.22\t39.68\t\"Present\"\t36\t31.5\t0\t51\t1\n146\t0.64\t4.82\t28.02\t\"Absent\"\t60\t28.11\t8.23\t39\t1\n128\t2.24\t2.83\t26.48\t\"Absent\"\t48\t23.96\t47.42\t27\t1\n170\t0.4\t4.11\t42.06\t\"Present\"\t56\t33.1\t2.06\t57\t0\n214\t0.4\t5.98\t31.72\t\"Absent\"\t64\t28.45\t0\t58\t0\n182\t4.2\t4.41\t32.1\t\"Absent\"\t52\t28.61\t18.72\t52\t1\n108\t3\t1.59\t15.23\t\"Absent\"\t40\t20.09\t26.64\t55\t0\n118\t5.4\t11.61\t30.79\t\"Absent\"\t64\t27.35\t23.97\t40\t0\n132\t0\t4.82\t33.41\t\"Present\"\t62\t14.7\t0\t46\t1"
  },
  {
    "path": "2017/examples/02_feed_dict.py",
    "content": "\"\"\" Example to demonstrate the use of feed_dict\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\n# Example 1: feed_dict with placeholder\n# create a placeholder of type float 32-bit, value is a vector of 3 elements\na = tf.placeholder(tf.float32, shape=[3])\n\n# create a constant of type float 32-bit, value is a vector of 3 elements\nb = tf.constant([5, 5, 5], tf.float32)\n\n# use the placeholder as you would a constant\nc = a + b  # short for tf.add(a, b)\n\nwith tf.Session() as sess:\n\t# print(sess.run(c)) # InvalidArgumentError because a doesn’t have any value\n\n\t# feed [1, 2, 3] to placeholder a via the dict {a: [1, 2, 3]}\n\t# fetch value of c\n\tprint(sess.run(c, {a: [1, 2, 3]})) # >> [6. 7. 8.]\n\n\n# Example 2: feed_dict with variables\na = tf.add(2, 5)\nb = tf.multiply(a, 3)\n\nwith tf.Session() as sess:\n\t# define a dictionary that says to replace the value of 'a' with 15\n\treplace_dict = {a: 15}\n\n\t# Run the session, passing in 'replace_dict' as the value to 'feed_dict'\n\tprint(sess.run(b, feed_dict=replace_dict)) # >> 45"
  },
  {
    "path": "2017/examples/02_lazy_loading.py",
    "content": "\"\"\" Example to demonstrate how the graph definition gets\nbloated because of lazy loading\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf \n\n######################################## \n## NORMAL LOADING   \t\t\t      ##\n## print out a graph with 1 Add node  ## \n########################################\n\nx = tf.Variable(10, name='x')\ny = tf.Variable(20, name='y')\nz = tf.add(x, y)\n\nwith tf.Session() as sess:\n\tsess.run(tf.global_variables_initializer())\n\twriter = tf.summary.FileWriter('./graphs/l2', sess.graph)\n\tfor _ in range(10):\n\t\tsess.run(z)\n\tprint(tf.get_default_graph().as_graph_def())\n\twriter.close()\n\n######################################## \n## LAZY LOADING   \t\t\t\t\t  ##\n## print out a graph with 10 Add nodes## \n########################################\n\nx = tf.Variable(10, name='x')\ny = tf.Variable(20, name='y')\n\nwith tf.Session() as sess:\n\tsess.run(tf.global_variables_initializer())\n\twriter = tf.summary.FileWriter('./graphs/l2', sess.graph)\n\tfor _ in range(10):\n\t\tsess.run(tf.add(x, y))\n\tprint(tf.get_default_graph().as_graph_def()) \n\twriter.close()"
  },
  {
    "path": "2017/examples/02_simple_tf.py",
    "content": "\"\"\" Some simple TensorFlow's ops\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\n\n\na = tf.constant(2)\nb = tf.constant(3)\nx = tf.add(a, b)\nwith tf.Session() as sess:\n\twriter = tf.summary.FileWriter('./graphs', sess.graph) \n\tprint(sess.run(x))\nwriter.close() # close the writer when you’re done using it\n\n\na = tf.constant([2, 2], name='a')\nb = tf.constant([[0, 1], [2, 3]], name='b')\nx = tf.multiply(a, b, name='dot_product')\nwith tf.Session() as sess:\n\tprint(sess.run(x))\n# >> [[0 2]\n#\t [4 6]]\n\ntf.zeros(shape, dtype=tf.float32, name=None)\n#creates a tensor of shape and all elements will be zeros (when ran in session)\n\nx = tf.zeros([2, 3], tf.int32) \ny = tf.zeros_like(x, optimize=True)\nprint(y)\nprint(tf.get_default_graph().as_graph_def())\nwith tf.Session() as sess:\n\ty = sess.run(y)\n\n\nwith tf.Session() as sess:\n\tprint(sess.run(tf.linspace(10.0, 13.0, 4)))\n\tprint(sess.run(tf.range(5)))\n\tfor i in np.arange(5):\n\t\tprint(i)\n\nsamples = tf.multinomial(tf.constant([[1., 3., 1]]), 5)\n\nwith tf.Session() as sess:\n\tfor _ in range(10):\n\t\tprint(sess.run(samples))\n\nt_0 = 19 \nx = tf.zeros_like(t_0) # ==> 0\ny = tf.ones_like(t_0) # ==> 1\n\nwith tf.Session() as sess:\n\tprint(sess.run([x, y]))\n\nt_1 = ['apple', 'peach', 'banana']\nx = tf.zeros_like(t_1) # ==> ['' '' '']\ny = tf.ones_like(t_1) # ==> TypeError: Expected string, got 1 of type 'int' instead.\n\nt_2 = [[True, False, False],\n       [False, False, True],\n       [False, True, False]] \nx = tf.zeros_like(t_2) # ==> 2x2 tensor, all elements are False\ny = tf.ones_like(t_2) # ==> 2x2 tensor, all elements are True\nwith tf.Session() as sess:\n\tprint(sess.run([x, y]))\n\nwith tf.variable_scope('meh') as scope:\n\ta = tf.get_variable('a', [10])\n\tb = tf.get_variable('b', [100])\n\nwriter = tf.summary.FileWriter('test', tf.get_default_graph())\n\n\nx = tf.Variable(2.0)\ny = 2.0 * (x ** 3)\nz = 3.0 + y ** 2\ngrad_z = tf.gradients(z, [x, y])\nwith tf.Session() as sess:\n\tsess.run(x.initializer)\n\tprint(sess.run(grad_z))\n"
  },
  {
    "path": "2017/examples/02_variables.py",
    "content": "\"\"\"\nExample to demonstrate the ops of tf.Variables()\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\n# Example 1: how to run assign op\nW = tf.Variable(10)\nassign_op = W.assign(100)\n\nwith tf.Session() as sess:\n\tsess.run(W.initializer)\n\tprint(W.eval()) # >> 10\n\tprint(sess.run(assign_op)) # >> 100\n\n# Example 2: tricky example\n# create a variable whose original value is 2\nmy_var = tf.Variable(2, name=\"my_var\") \n\n# assign 2 * my_var to my_var and run the op my_var_times_two\nmy_var_times_two = my_var.assign(2 * my_var)\n\nwith tf.Session() as sess:\n\tsess.run(tf.global_variables_initializer())\n\tprint(sess.run(my_var_times_two)) # >> 4\n\tprint(sess.run(my_var_times_two)) # >> 8\n\tprint(sess.run(my_var_times_two)) # >> 16\n\n# Example 3: each session maintains its own copy of variables\nW = tf.Variable(10)\nsess1 = tf.Session()\nsess2 = tf.Session()\n\n# You have to initialize W at each session\nsess1.run(W.initializer)\nsess2.run(W.initializer)\n\nprint(sess1.run(W.assign_add(10))) # >> 20\nprint(sess2.run(W.assign_sub(2))) # >> 8\n\nprint(sess1.run(W.assign_add(100))) # >> 120\nprint(sess2.run(W.assign_sub(50))) # >> -42\n\nsess1.close()\nsess2.close()"
  },
  {
    "path": "2017/examples/03_linear_regression_sol.py",
    "content": "\"\"\" Simple linear regression example in TensorFlow\nThis program tries to predict the number of thefts from \nthe number of fire in the city of Chicago\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimport xlrd\n\nimport utils\n\nDATA_FILE = 'data/fire_theft.xls'\n\n# Step 1: read in data from the .xls file\nbook = xlrd.open_workbook(DATA_FILE, encoding_override=\"utf-8\")\nsheet = book.sheet_by_index(0)\ndata = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])\nn_samples = sheet.nrows - 1\n\n# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)\nX = tf.placeholder(tf.float32, name='X')\nY = tf.placeholder(tf.float32, name='Y')\n\n# Step 3: create weight and bias, initialized to 0\nw = tf.Variable(0.0, name='weights')\nb = tf.Variable(0.0, name='bias')\n\n# Step 4: build model to predict Y\nY_predicted = X * w + b \n\n# Step 5: use the square error as the loss function\nloss = tf.square(Y - Y_predicted, name='loss')\n# loss = utils.huber_loss(Y, Y_predicted)\n\n# Step 6: using gradient descent with learning rate of 0.01 to minimize loss\noptimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)\n\nwith tf.Session() as sess:\n\t# Step 7: initialize the necessary variables, in this case, w and b\n\tsess.run(tf.global_variables_initializer()) \n\t\n\twriter = tf.summary.FileWriter('./graphs/linear_reg', sess.graph)\n\t\n\t# Step 8: train the model\n\tfor i in range(50): # train the model 100 epochs\n\t\ttotal_loss = 0\n\t\tfor x, y in data:\n\t\t\t# Session runs train_op and fetch values of loss\n\t\t\t_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) \n\t\t\ttotal_loss += l\n\t\tprint('Epoch {0}: {1}'.format(i, total_loss/n_samples))\n\n\t# close the writer when you're done using it\n\twriter.close() \n\t\n\t# Step 9: output the values of w and b\n\tw, b = sess.run([w, b]) \n\n# plot the results\nX, Y = data.T[0], data.T[1]\nplt.plot(X, Y, 'bo', label='Real data')\nplt.plot(X, X * w + b, 'r', label='Predicted data')\nplt.legend()\nplt.show()"
  },
  {
    "path": "2017/examples/03_linear_regression_starter.py",
    "content": "\"\"\" Simple linear regression example in TensorFlow\nThis program tries to predict the number of thefts from \nthe number of fire in the city of Chicago\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimport xlrd\n\nimport utils\n\nDATA_FILE = 'data/fire_theft.xls'\n\n# Phase 1: Assemble the graph\n# Step 1: read in data from the .xls file\nbook = xlrd.open_workbook(DATA_FILE, encoding_override='utf-8')\nsheet = book.sheet_by_index(0)\ndata = np.asarray([sheet.row_values(i) for i in range(1, sheet.nrows)])\nn_samples = sheet.nrows - 1\n\n# Step 2: create placeholders for input X (number of fire) and label Y (number of theft)\n# Both have the type float32\n\n\n# Step 3: create weight and bias, initialized to 0\n# name your variables w and b\n\n\n# Step 4: predict Y (number of theft) from the number of fire\n# name your variable Y_predicted\n\n\n# Step 5: use the square error as the loss function\n# name your variable loss\n\n\n# Step 6: using gradient descent with learning rate of 0.01 to minimize loss\n \n# Phase 2: Train our model\nwith tf.Session() as sess:\n\t# Step 7: initialize the necessary variables, in this case, w and b\n\t# TO - DO\t\n\n\n\t# Step 8: train the model\n\tfor i in range(50): # run 100 epochs\n\t\ttotal_loss = 0\n\t\tfor x, y in data:\n\t\t\t# Session runs optimizer to minimize loss and fetch the value of loss. Name the received value as l\n\t\t\t# TO DO: write sess.run()\n\n\t\t\ttotal_loss += l\n\t\tprint(\"Epoch {0}: {1}\".format(i, total_loss/n_samples))\n\t\n# plot the results\n# X, Y = data.T[0], data.T[1]\n# plt.plot(X, Y, 'bo', label='Real data')\n# plt.plot(X, X * w + b, 'r', label='Predicted data')\n# plt.legend()\n# plt.show()"
  },
  {
    "path": "2017/examples/03_logistic_regression_mnist_sol.py",
    "content": "\"\"\" Simple logistic regression model to solve OCR task \nwith MNIST in TensorFlow\nMNIST dataset: yann.lecun.com/exdb/mnist/\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.examples.tutorials.mnist import input_data\nimport time\n\n# Define paramaters for the model\nlearning_rate = 0.01\nbatch_size = 128\nn_epochs = 30\n\n# Step 1: Read in data\n# using TF Learn's built in function to load MNIST data to the folder data/mnist\nmnist = input_data.read_data_sets('/data/mnist', one_hot=True) \n\n# Step 2: create placeholders for features and labels\n# each image in the MNIST data is of shape 28*28 = 784\n# therefore, each image is represented with a 1x784 tensor\n# there are 10 classes for each image, corresponding to digits 0 - 9. \n# each lable is one hot vector.\nX = tf.placeholder(tf.float32, [batch_size, 784], name='X_placeholder') \nY = tf.placeholder(tf.int32, [batch_size, 10], name='Y_placeholder')\n\n# Step 3: create weights and bias\n# w is initialized to random variables with mean of 0, stddev of 0.01\n# b is initialized to 0\n# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)\n# shape of b depends on Y\nw = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name='weights')\nb = tf.Variable(tf.zeros([1, 10]), name=\"bias\")\n\n# Step 4: build model\n# the model that returns the logits.\n# this logits will be later passed through softmax layer\nlogits = tf.matmul(X, w) + b \n\n# Step 5: define loss function\n# use cross entropy of softmax of logits as the loss function\nentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss')\nloss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch\n\n# Step 6: define training op\n# using gradient descent with learning rate of 0.01 to minimize loss\noptimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)\n\nwith tf.Session() as sess:\n\t# to visualize using TensorBoard\n\twriter = tf.summary.FileWriter('./graphs/logistic_reg', sess.graph)\n\n\tstart_time = time.time()\n\tsess.run(tf.global_variables_initializer())\t\n\tn_batches = int(mnist.train.num_examples/batch_size)\n\tfor i in range(n_epochs): # train the model n_epochs times\n\t\ttotal_loss = 0\n\n\t\tfor _ in range(n_batches):\n\t\t\tX_batch, Y_batch = mnist.train.next_batch(batch_size)\n\t\t\t_, loss_batch = sess.run([optimizer, loss], feed_dict={X: X_batch, Y:Y_batch}) \n\t\t\ttotal_loss += loss_batch\n\t\tprint('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))\n\n\tprint('Total time: {0} seconds'.format(time.time() - start_time))\n\n\tprint('Optimization Finished!') # should be around 0.35 after 25 epochs\n\n\t# test the model\n\t\n\tpreds = tf.nn.softmax(logits)\n\tcorrect_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))\n\taccuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :(\n\t\n\tn_batches = int(mnist.test.num_examples/batch_size)\n\ttotal_correct_preds = 0\n\t\n\tfor i in range(n_batches):\n\t\tX_batch, Y_batch = mnist.test.next_batch(batch_size)\n\t\taccuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) \n\t\ttotal_correct_preds += accuracy_batch\t\n\t\n\tprint('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))\n\n\twriter.close()\n"
  },
  {
    "path": "2017/examples/03_logistic_regression_mnist_starter.py",
    "content": "\"\"\" Starter code for logistic regression model to solve OCR task \nwith MNIST in TensorFlow\nMNIST dataset: yann.lecun.com/exdb/mnist/\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.examples.tutorials.mnist import input_data\nimport time\n\n# Define paramaters for the model\nlearning_rate = 0.01\nbatch_size = 128\nn_epochs = 10\n\n# Step 1: Read in data\n# using TF Learn's built in function to load MNIST data to the folder data/mnist\nmnist = input_data.read_data_sets('/data/mnist', one_hot=True) \n\n# Step 2: create placeholders for features and labels\n# each image in the MNIST data is of shape 28*28 = 784\n# therefore, each image is represented with a 1x784 tensor\n# there are 10 classes for each image, corresponding to digits 0 - 9. \n# Features are of the type float, and labels are of the type int\n\n\n# Step 3: create weights and bias\n# weights and biases are initialized to 0\n# shape of w depends on the dimension of X and Y so that Y = X * w + b\n# shape of b depends on Y\n\n\n# Step 4: build model\n# the model that returns the logits.\n# this logits will be later passed through softmax layer\n# to get the probability distribution of possible label of the image\n# DO NOT DO SOFTMAX HERE\n\n\n# Step 5: define loss function\n# use cross entropy loss of the real labels with the softmax of logits\n# use the method:\n# tf.nn.softmax_cross_entropy_with_logits(logits, Y)\n# then use tf.reduce_mean to get the mean loss of the batch\n\n\n# Step 6: define training op\n# using gradient descent to minimize loss\n\n\nwith tf.Session() as sess:\n\tstart_time = time.time()\n\tsess.run(tf.global_variables_initializer())\t\n\tn_batches = int(mnist.train.num_examples/batch_size)\n\tfor i in range(n_epochs): # train the model n_epochs times\n\t\ttotal_loss = 0\n\n\t\tfor _ in range(n_batches):\n\t\t\tX_batch, Y_batch = mnist.train.next_batch(batch_size)\n\t\t\t# TO-DO: run optimizer + fetch loss_batch\n\t\t\t# \n\t\t\t# \n\t\t\ttotal_loss += loss_batch\n\t\tprint('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))\n\n\tprint('Total time: {0} seconds'.format(time.time() - start_time))\n\n\tprint('Optimization Finished!') # should be around 0.35 after 25 epochs\n\n\t# test the model\n\tpreds = tf.nn.softmax(logits)\n\tcorrect_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))\n\taccuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32)) # need numpy.count_nonzero(boolarr) :(\n\t\n\tn_batches = int(mnist.test.num_examples/batch_size)\n\ttotal_correct_preds = 0\n\t\n\tfor i in range(n_batches):\n\t\tX_batch, Y_batch = mnist.test.next_batch(batch_size)\n\t\taccuracy_batch = sess.run([accuracy], feed_dict={X: X_batch, Y:Y_batch}) \n\t\ttotal_correct_preds += accuracy_batch\t\n\t\n\tprint('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))\n"
  },
  {
    "path": "2017/examples/04_word2vec_no_frills.py",
    "content": "\"\"\" The no frills implementation of word2vec skip-gram model using NCE loss.\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.contrib.tensorboard.plugins import projector\n\nfrom process_data import process_data\n\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128 # dimension of the word embedding vectors\nSKIP_WINDOW = 1 # the context window\nNUM_SAMPLED = 64    # Number of negative examples to sample.\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 10000\nSKIP_STEP = 2000 # how many steps to skip before reporting the loss\n\ndef word2vec(batch_gen):\n    \"\"\" Build the graph for word2vec model and train it \"\"\"\n    # Step 1: define the placeholders for input and output\n    with tf.name_scope('data'):\n        center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')\n        target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')\n\n    # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU\n    # Step 2: define weights. In word2vec, it's actually the weights that we care about\n\n    with tf.name_scope('embedding_matrix'):\n        embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), \n                            name='embed_matrix')\n\n    # Step 3: define the inference\n    with tf.name_scope('loss'):\n        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')\n\n        # Step 4: construct variables for NCE loss\n        nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],\n                                                    stddev=1.0 / (EMBED_SIZE ** 0.5)), \n                                                    name='nce_weight')\n        nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')\n\n        # define loss function to be NCE loss function\n        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, \n                                            biases=nce_bias, \n                                            labels=target_words, \n                                            inputs=embed, \n                                            num_sampled=NUM_SAMPLED, \n                                            num_classes=VOCAB_SIZE), name='loss')\n\n    # Step 5: define optimizer\n    optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)\n    \n    with tf.Session() as sess:\n        sess.run(tf.global_variables_initializer())\n\n        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps\n        writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph)\n        for index in range(NUM_TRAIN_STEPS):\n            centers, targets = next(batch_gen)\n            loss_batch, _ = sess.run([loss, optimizer], \n                                    feed_dict={center_words: centers, target_words: targets})\n            total_loss += loss_batch\n            if (index + 1) % SKIP_STEP == 0:\n                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))\n                total_loss = 0.0\n        writer.close()\n\ndef main():\n    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)\n    word2vec(batch_gen)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "2017/examples/04_word2vec_starter.py",
    "content": "\"\"\" The mo frills implementation of word2vec skip-gram model using NCE loss. \nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.contrib.tensorboard.plugins import projector\n\nfrom process_data import process_data\n\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128 # dimension of the word embedding vectors\nSKIP_WINDOW = 1 # the context window\nNUM_SAMPLED = 64    # Number of negative examples to sample.\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 20000\nSKIP_STEP = 2000 # how many steps to skip before reporting the loss\n\ndef word2vec(batch_gen):\n    \"\"\" Build the graph for word2vec model and train it \"\"\"\n    # Step 1: define the placeholders for input and output\n    # center_words have to be int to work on embedding lookup\n\n    # TO DO\n\n\n    # Step 2: define weights. In word2vec, it's actually the weights that we care about\n    # vocab size x embed size\n    # initialized to random uniform -1 to 1\n\n    # TOO DO\n\n\n    # Step 3: define the inference\n    # get the embed of input words using tf.nn.embedding_lookup\n    # embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')\n\n    # TO DO\n\n\n        # Step 4: construct variables for NCE loss\n        # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)\n        # nce_weight (vocab size x embed size), intialized to truncated_normal stddev=1.0 / (EMBED_SIZE ** 0.5)\n        # bias: vocab size, initialized to 0\n\n        # TO DO\n\n\n        # define loss function to be NCE loss function\n        # tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, ...)\n        # need to get the mean accross the batch\n        # note: you should use embedding of center words for inputs, not center words themselves\n\n        # TO DO\n\n        \n    # Step 5: define optimizer\n    \n    # TO DO\n\n\n\n    with tf.Session() as sess:\n        # TO DO: initialize variables\n\n\n        total_loss = 0.0 # we use this to calculate the average loss in the last SKIP_STEP steps\n        writer = tf.summary.FileWriter('./graphs/no_frills/', sess.graph)\n        for index in range(NUM_TRAIN_STEPS):\n            centers, targets = next(batch_gen)\n            # TO DO: create feed_dict, run optimizer, fetch loss_batch\n\n            total_loss += loss_batch\n            if (index + 1) % SKIP_STEP == 0:\n                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))\n                total_loss = 0.0\n        writer.close()\n\ndef main():\n    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)\n    word2vec(batch_gen)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "2017/examples/04_word2vec_visualize.py",
    "content": "\"\"\" word2vec with NCE loss and code to visualize the embeddings on TensorBoard\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nfrom tensorflow.contrib.tensorboard.plugins import projector\nimport tensorflow as tf\n\nfrom process_data import process_data\nimport utils\n\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128 # dimension of the word embedding vectors\nSKIP_WINDOW = 1 # the context window\nNUM_SAMPLED = 64    # Number of negative examples to sample.\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 100000\nWEIGHTS_FLD = 'processed/'\nSKIP_STEP = 2000\n\nclass SkipGramModel:\n    \"\"\" Build the graph for word2vec model \"\"\"\n    def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate):\n        self.vocab_size = vocab_size\n        self.embed_size = embed_size\n        self.batch_size = batch_size\n        self.num_sampled = num_sampled\n        self.lr = learning_rate\n        self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\n    def _create_placeholders(self):\n        \"\"\" Step 1: define the placeholders for input and output \"\"\"\n        with tf.name_scope(\"data\"):\n            self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words')\n            self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words')\n\n    def _create_embedding(self):\n        \"\"\" Step 2: define weights. In word2vec, it's actually the weights that we care about \"\"\"\n        # Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU\n        with tf.device('/cpu:0'):\n            with tf.name_scope(\"embed\"):\n                self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size, \n                                                                    self.embed_size], -1.0, 1.0), \n                                                                    name='embed_matrix')\n\n    def _create_loss(self):\n        \"\"\" Step 3 + 4: define the model + the loss function \"\"\"\n        with tf.device('/cpu:0'):\n            with tf.name_scope(\"loss\"):\n                # Step 3: define the inference\n                embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed')\n\n                # Step 4: define loss function\n                # construct variables for NCE loss\n                nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size],\n                                                            stddev=1.0 / (self.embed_size ** 0.5)), \n                                                            name='nce_weight')\n                nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')\n\n                # define loss function to be NCE loss function\n                self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, \n                                                    biases=nce_bias, \n                                                    labels=self.target_words, \n                                                    inputs=embed, \n                                                    num_sampled=self.num_sampled, \n                                                    num_classes=self.vocab_size), name='loss')\n    def _create_optimizer(self):\n        \"\"\" Step 5: define optimizer \"\"\"\n        with tf.device('/cpu:0'):\n            self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, \n                                                              global_step=self.global_step)\n\n    def _create_summaries(self):\n        with tf.name_scope(\"summaries\"):\n            tf.summary.scalar(\"loss\", self.loss)\n            tf.summary.histogram(\"histogram loss\", self.loss)\n            # because you have several summaries, we should merge them all\n            # into one op to make it easier to manage\n            self.summary_op = tf.summary.merge_all()\n\n    def build_graph(self):\n        \"\"\" Build the graph for our model \"\"\"\n        self._create_placeholders()\n        self._create_embedding()\n        self._create_loss()\n        self._create_optimizer()\n        self._create_summaries()\n\ndef train_model(model, batch_gen, num_train_steps, weights_fld):\n    saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias\n\n    initial_step = 0\n    utils.make_dir('checkpoints')\n    with tf.Session() as sess:\n        sess.run(tf.global_variables_initializer())\n        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))\n        # if that checkpoint exists, restore from checkpoint\n        if ckpt and ckpt.model_checkpoint_path:\n            saver.restore(sess, ckpt.model_checkpoint_path)\n\n        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps\n        writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph)\n        initial_step = model.global_step.eval()\n        for index in range(initial_step, initial_step + num_train_steps):\n            centers, targets = next(batch_gen)\n            feed_dict={model.center_words: centers, model.target_words: targets}\n            loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op], \n                                              feed_dict=feed_dict)\n            writer.add_summary(summary, global_step=index)\n            total_loss += loss_batch\n            if (index + 1) % SKIP_STEP == 0:\n                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))\n                total_loss = 0.0\n                saver.save(sess, 'checkpoints/skip-gram', index)\n        \n        ####################\n        # code to visualize the embeddings. uncomment the below to visualize embeddings\n        # run \"'tensorboard --logdir='processed'\" to see the embeddings\n        # final_embed_matrix = sess.run(model.embed_matrix)\n        \n        # # it has to variable. constants don't work here. you can't reuse model.embed_matrix\n        # embedding_var = tf.Variable(final_embed_matrix[:1000], name='embedding')\n        # sess.run(embedding_var.initializer)\n\n        # config = projector.ProjectorConfig()\n        # summary_writer = tf.summary.FileWriter('processed')\n\n        # # add embedding to the config file\n        # embedding = config.embeddings.add()\n        # embedding.tensor_name = embedding_var.name\n        \n        # # link this tensor to its metadata file, in this case the first 500 words of vocab\n        # embedding.metadata_path = 'processed/vocab_1000.tsv'\n\n        # # saves a configuration file that TensorBoard will read during startup.\n        # projector.visualize_embeddings(summary_writer, config)\n        # saver_embed = tf.train.Saver([embedding_var])\n        # saver_embed.save(sess, 'processed/model3.ckpt', 1)\n\ndef main():\n    model = SkipGramModel(VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE)\n    model.build_graph()\n    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)\n    train_model(model, batch_gen, NUM_TRAIN_STEPS, WEIGHTS_FLD)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "2017/examples/05_csv_reader.py",
    "content": "\"\"\" Some people tried to use TextLineReader for the assignment 1\nbut seem to have problems getting it work, so here is a short \nscript demonstrating the use of CSV reader on the heart dataset.\nNote that the heart dataset is originally in txt so I first\nconverted it to csv to take advantage of the already laid out columns.\n\nYou can download heart.csv in the data folder.\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport sys\nsys.path.append('..')\n\nimport tensorflow as tf\n\nDATA_PATH = 'data/heart.csv'\nBATCH_SIZE = 2\nN_FEATURES = 9\n\ndef batch_generator(filenames):\n    \"\"\" filenames is the list of files you want to read from. \n    In this case, it contains only heart.csv\n    \"\"\"\n    filename_queue = tf.train.string_input_producer(filenames)\n    reader = tf.TextLineReader(skip_header_lines=1) # skip the first line in the file\n    _, value = reader.read(filename_queue)\n\n    # record_defaults are the default values in case some of our columns are empty\n    # This is also to tell tensorflow the format of our data (the type of the decode result)\n    # for this dataset, out of 9 feature columns, \n    # 8 of them are floats (some are integers, but to make our features homogenous, \n    # we consider them floats), and 1 is string (at position 5)\n    # the last column corresponds to the lable is an integer\n\n    record_defaults = [[1.0] for _ in range(N_FEATURES)]\n    record_defaults[4] = ['']\n    record_defaults.append([1])\n\n    # read in the 10 rows of data\n    content = tf.decode_csv(value, record_defaults=record_defaults) \n\n    # convert the 5th column (present/absent) to the binary value 0 and 1\n    content[4] = tf.cond(tf.equal(content[4], tf.constant('Present')), lambda: tf.constant(1.0), lambda: tf.constant(0.0))\n\n    # pack all 9 features into a tensor\n    features = tf.stack(content[:N_FEATURES])\n\n    # assign the last column to label\n    label = content[-1]\n\n    # minimum number elements in the queue after a dequeue, used to ensure \n    # that the samples are sufficiently mixed\n    # I think 10 times the BATCH_SIZE is sufficient\n    min_after_dequeue = 10 * BATCH_SIZE\n\n    # the maximum number of elements in the queue\n    capacity = 20 * BATCH_SIZE\n\n    # shuffle the data to generate BATCH_SIZE sample pairs\n    data_batch, label_batch = tf.train.shuffle_batch([features, label], batch_size=BATCH_SIZE, \n                                        capacity=capacity, min_after_dequeue=min_after_dequeue)\n\n    return data_batch, label_batch\n\ndef generate_batches(data_batch, label_batch):\n    with tf.Session() as sess:\n        coord = tf.train.Coordinator()\n        threads = tf.train.start_queue_runners(coord=coord)\n        for _ in range(10): # generate 10 batches\n            features, labels = sess.run([data_batch, label_batch])\n            print(features)\n        coord.request_stop()\n        coord.join(threads)\n\ndef main():\n    data_batch, label_batch = batch_generator([DATA_PATH])\n    generate_batches(data_batch, label_batch)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "2017/examples/05_randomization.py",
    "content": "\"\"\" Examples to demonstrate ops level randomization\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\n# Example 1: session is the thing that keeps track of random state\nc = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.57493\n    print(sess.run(c)) # >> -5.97319\n\n# Example 2: each new session will start the random state all over again.\nc = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.57493\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.57493\n\n# Example 3: with operation level random seed, each op keeps its own seed.\nc = tf.random_uniform([], -10, 10, seed=2)\nd = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.57493\n    print(sess.run(d)) # >> 3.57493\n\n# Example 4: graph level random seed\ntf.set_random_seed(2)\nc = tf.random_uniform([], -10, 10)\nd = tf.random_uniform([], -10, 10)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 9.12393\n    print(sess.run(d)) # >> -4.53404\n    "
  },
  {
    "path": "2017/examples/07_basic_filters.py",
    "content": "\"\"\"\nSimple examples of convolution to do some basic filters\nAlso demonstrates the use of TensorFlow data readers.\n\nWe will use some popular filters for our image.\nIt seems to be working with grayscale images, but not with rgb images.\nIt's probably because I didn't choose the right kernels for rgb images.\n\nkernels for rgb images have dimensions 3 x 3 x 3 x 3\nkernels for grayscale images have dimensions 3 x 3 x 1 x 1\n\nNote:\nWhen you call tf.train.string_input_producer,\na tf.train.QueueRunner is added to the graph, which must be run using\ne.g. tf.train.start_queue_runners() else your session will run into deadlock\nand your program will crash.\n\nAnd to run QueueRunner, you need a coordinator to close to your queue for you.\nWithout coordinator, your threads will keep on running outside session and you will have the error:\nERROR:tensorflow:Exception in QueueRunner: Attempted to use a closed Session.\n\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport sys\nsys.path.append('..')\n\nfrom matplotlib import gridspec as gridspec\nfrom matplotlib import pyplot as plt\nimport tensorflow as tf\n\nimport kernels\n\nFILENAME = 'data/friday.jpg'\n\ndef read_one_image(filename):\n    \"\"\" This is just to demonstrate how to open an image in TensorFlow,\n    but it's actually a lot easier to use Pillow \n    \"\"\"\n    filename_queue = tf.train.string_input_producer([filename])\n    image_reader = tf.WholeFileReader()\n    _, image_file = image_reader.read(filename_queue)\n    image = tf.image.decode_jpeg(image_file, channels=3)\n    image = tf.cast(image, tf.float32) / 256.0 # cast to float to make conv2d work\n    return image\n\ndef convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'):\n    images = [image[0]]\n    for i, kernel in enumerate(kernels):\n        filtered_image = tf.nn.conv2d(image, kernel, strides=strides, padding=padding)[0]\n        if i == 2:\n            filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255)\n        images.append(filtered_image)\n    return images\n\ndef get_real_images(images):\n    with tf.Session() as sess:\n        coord = tf.train.Coordinator()\n        threads = tf.train.start_queue_runners(coord=coord)\n        images = sess.run(images)\n        coord.request_stop()\n        coord.join(threads)\n    return images\n\ndef show_images(images, rgb=True):\n    gs = gridspec.GridSpec(1, len(images))\n    for i, image in enumerate(images):\n        plt.subplot(gs[0, i])\n        if rgb:\n            plt.imshow(image)\n        else: \n            image = image.reshape(image.shape[0], image.shape[1])\n            plt.imshow(image, cmap='gray')\n        plt.axis('off')\n    plt.show()\n\ndef main():\n    rgb = False\n    if rgb:\n        kernels_list = [kernels.BLUR_FILTER_RGB, kernels.SHARPEN_FILTER_RGB, kernels.EDGE_FILTER_RGB, \n                    kernels.TOP_SOBEL_RGB, kernels.EMBOSS_FILTER_RGB]\n    else:\n        kernels_list = [kernels.BLUR_FILTER, kernels.SHARPEN_FILTER, kernels.EDGE_FILTER, \n                    kernels.TOP_SOBEL, kernels.EMBOSS_FILTER]\n\n    image = read_one_image(FILENAME)\n    if not rgb:\n        image = tf.image.rgb_to_grayscale(image)\n    image = tf.expand_dims(image, 0) # to make it into a batch of 1 element\n    images = convolve(image, kernels_list, rgb)\n    images = get_real_images(images)\n    show_images(images, rgb)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "2017/examples/07_convnet_mnist.py",
    "content": "\"\"\" Using convolutional net on MNIST dataset of handwritten digit\n(http://yann.lecun.com/exdb/mnist/)\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport time \n\nimport tensorflow as tf\nimport tf.contrib.layers as layers\nfrom tensorflow.examples.tutorials.mnist import input_data\n\nimport utils\n\nN_CLASSES = 10\n\n# Step 1: Read in data\n# using TF Learn's built in function to load MNIST data to the folder data/mnist\nmnist = input_data.read_data_sets(\"/data/mnist\", one_hot=True)\n\n# Step 2: Define paramaters for the model\nLEARNING_RATE = 0.001\nBATCH_SIZE = 128\nSKIP_STEP = 10\nDROPOUT = 0.75\nN_EPOCHS = 1\n\n# Step 3: create placeholders for features and labels\n# each image in the MNIST data is of shape 28*28 = 784\n# therefore, each image is represented with a 1x784 tensor\n# We'll be doing dropout for hidden layer so we'll need a placeholder\n# for the dropout probability too\n# Use None for shape so we can change the batch_size once we've built the graph\nwith tf.name_scope('data'):\n    X = tf.placeholder(tf.float32, [None, 784], name=\"X_placeholder\")\n    Y = tf.placeholder(tf.float32, [None, 10], name=\"Y_placeholder\")\n\ndropout = tf.placeholder(tf.float32, name='dropout')\n\n# Step 4 + 5: create weights + do inference\n# the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax\n\nglobal_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\nwith tf.variable_scope('conv1') as scope:\n    # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d\n    images = tf.reshape(X, shape=[-1, 28, 28, 1]) \n    kernel = tf.get_variable('kernel', [5, 5, 1, 32], \n                            initializer=tf.truncated_normal_initializer())\n    biases = tf.get_variable('biases', [32],\n                        initializer=tf.random_normal_initializer())\n    conv = tf.nn.conv2d(images, kernel, strides=[1, 1, 1, 1], padding='SAME')\n    conv1 = tf.nn.relu(conv + biases, name=scope.name)\n\n    # output is of dimension BATCH_SIZE x 28 x 28 x 32\n    conv1 = layers.conv2d(images, 32, 5, 1, activation_fn=tf.nn.relu, padding='SAME')\n\nwith tf.variable_scope('pool1') as scope:\n    pool1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],\n                           padding='SAME')\n\n    # output is of dimension BATCH_SIZE x 14 x 14 x 32\n\nwith tf.variable_scope('conv2') as scope:\n    # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64\n    kernel = tf.get_variable('kernels', [5, 5, 32, 64], \n                        initializer=tf.truncated_normal_initializer())\n    biases = tf.get_variable('biases', [64],\n                        initializer=tf.random_normal_initializer())\n    conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME')\n    conv2 = tf.nn.relu(conv + biases, name=scope.name)\n\n    # output is of dimension BATCH_SIZE x 14 x 14 x 64\n    # layers.conv2d(images, 64, 5, 1, activation_fn=tf.nn.relu, padding='SAME')\n\nwith tf.variable_scope('pool2') as scope:\n    # similar to pool1\n    pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],\n                            padding='SAME')\n\n    # output is of dimension BATCH_SIZE x 7 x 7 x 64\n\nwith tf.variable_scope('fc') as scope:\n    # use weight of dimension 7 * 7 * 64 x 1024\n    input_features = 7 * 7 * 64\n    w = tf.get_variable('weights', [input_features, 1024],\n                        initializer=tf.truncated_normal_initializer())\n    b = tf.get_variable('biases', [1024],\n                        initializer=tf.constant_initializer(0.0))\n\n    # reshape pool2 to 2 dimensional\n    pool2 = tf.reshape(pool2, [-1, input_features])\n    fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu')\n    \n    # pool2 = layers.flatten(pool2)\n    # fc = layers.fully_connected(pool2, 1024, tf.nn.relu)\n\n    fc = tf.nn.dropout(fc, dropout, name='relu_dropout')\n\nwith tf.variable_scope('softmax_linear') as scope:\n    w = tf.get_variable('weights', [1024, N_CLASSES],\n                        initializer=tf.truncated_normal_initializer())\n    b = tf.get_variable('biases', [N_CLASSES],\n                        initializer=tf.random_normal_initializer())\n    logits = tf.matmul(fc, w) + b\n\n    \n\n\n# Step 6: define loss function\n# use softmax cross entropy with logits as the loss function\n# compute mean cross entropy, softmax is applied internally\nwith tf.name_scope('loss'):\n    entropy = tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=logits)\n    loss = tf.reduce_mean(entropy, name='loss')\n\nwith tf.name_scope('summaries'):\n    tf.summary.scalar('loss', loss)\n    tf.summary.histogram('histogram loss', loss)\n    summary_op = tf.summary.merge_all()\n\n# Step 7: define training op\n# using gradient descent with learning rate of LEARNING_RATE to minimize cost\noptimizer = tf.train.AdamOptimizer(LEARNING_RATE).minimize(loss, \n                                        global_step=global_step)\n\nutils.make_dir('checkpoints')\nutils.make_dir('checkpoints/convnet_mnist')\n\nwith tf.Session() as sess:\n    sess.run(tf.global_variables_initializer())\n    saver = tf.train.Saver()\n    # to visualize using TensorBoard\n    writer = tf.summary.FileWriter('./graphs/convnet', sess.graph)\n    ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))\n    # if that checkpoint exists, restore from checkpoint\n    if ckpt and ckpt.model_checkpoint_path:\n        saver.restore(sess, ckpt.model_checkpoint_path)\n    \n    initial_step = global_step.eval()\n\n    start_time = time.time()\n    n_batches = int(mnist.train.num_examples / BATCH_SIZE)\n\n    total_loss = 0.0\n    for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times\n        X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE)\n        _, loss_batch, summary = sess.run([optimizer, loss, summary_op], \n                                feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) \n        writer.add_summary(summary, global_step=index)\n        total_loss += loss_batch\n        if (index + 1) % SKIP_STEP == 0:\n            print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP))\n            total_loss = 0.0\n            saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index)\n    \n    print(\"Optimization Finished!\") # should be around 0.35 after 25 epochs\n    print(\"Total time: {0} seconds\".format(time.time() - start_time))\n    \n    # test the model\n    n_batches = int(mnist.test.num_examples/BATCH_SIZE)\n    total_correct_preds = 0\n    for i in range(n_batches):\n        X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE)\n        _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], \n                                        feed_dict={X: X_batch, Y:Y_batch, dropout: 1.0}) \n        preds = tf.nn.softmax(logits_batch)\n        correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1))\n        accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n        total_correct_preds += sess.run(accuracy)   \n    \n    print(\"Accuracy {0}\".format(total_correct_preds/mnist.test.num_examples))"
  },
  {
    "path": "2017/examples/07_convnet_mnist_starter.py",
    "content": "\"\"\" Using convolutional net on MNIST dataset of handwritten digit\n(http://yann.lecun.com/exdb/mnist/)\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nfrom __future__ import print_function\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport time \n\nimport tensorflow as tf\nfrom tensorflow.examples.tutorials.mnist import input_data\n\nimport utils\n\nN_CLASSES = 10\n\n# Step 1: Read in data\n# using TF Learn's built in function to load MNIST data to the folder data/mnist\nmnist = input_data.read_data_sets(\"/data/mnist\", one_hot=True)\n\n# Step 2: Define paramaters for the model\nLEARNING_RATE = 0.001\nBATCH_SIZE = 128\nSKIP_STEP = 10\nDROPOUT = 0.75\nN_EPOCHS = 1\n\n# Step 3: create placeholders for features and labels\n# each image in the MNIST data is of shape 28*28 = 784\n# therefore, each image is represented with a 1x784 tensor\n# We'll be doing dropout for hidden layer so we'll need a placeholder\n# for the dropout probability too\n# Use None for shape so we can change the batch_size once we've built the graph\nwith tf.name_scope('data'):\n    X = tf.placeholder(tf.float32, [None, 784], name=\"X_placeholder\")\n    Y = tf.placeholder(tf.float32, [None, 10], name=\"Y_placeholder\")\n\ndropout = tf.placeholder(tf.float32, name='dropout')\n\n# Step 4 + 5: create weights + do inference\n# the model is conv -> relu -> pool -> conv -> relu -> pool -> fully connected -> softmax\n\nglobal_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\nutils.make_dir('checkpoints')\nutils.make_dir('checkpoints/convnet_mnist')\n\nwith tf.variable_scope('conv1') as scope:\n    # first, reshape the image to [BATCH_SIZE, 28, 28, 1] to make it work with tf.nn.conv2d\n    # use the dynamic dimension -1\n    images = tf.reshape(X, shape=[-1, 28, 28, 1])\n    \n    # TO DO\n\n    # create kernel variable of dimension [5, 5, 1, 32]\n    # use tf.truncated_normal_initializer()\n    \n    # TO DO\n\n    # create biases variable of dimension [32]\n    # use tf.constant_initializer(0.0)\n    \n    # TO DO \n\n    # apply tf.nn.conv2d. strides [1, 1, 1, 1], padding is 'SAME'\n    \n    # TO DO\n\n    # apply relu on the sum of convolution output and biases\n    \n    # TO DO \n\n    # output is of dimension BATCH_SIZE x 28 x 28 x 32\n\nwith tf.variable_scope('pool1') as scope:\n    # apply max pool with ksize [1, 2, 2, 1], and strides [1, 2, 2, 1], padding 'SAME'\n    \n    # TO DO\n\n    # output is of dimension BATCH_SIZE x 14 x 14 x 32\n\nwith tf.variable_scope('conv2') as scope:\n    # similar to conv1, except kernel now is of the size 5 x 5 x 32 x 64\n    kernel = tf.get_variable('kernels', [5, 5, 32, 64], \n                        initializer=tf.truncated_normal_initializer())\n    biases = tf.get_variable('biases', [64],\n                        initializer=tf.random_normal_initializer())\n    conv = tf.nn.conv2d(pool1, kernel, strides=[1, 1, 1, 1], padding='SAME')\n    conv2 = tf.nn.relu(conv + biases, name=scope.name)\n\n    # output is of dimension BATCH_SIZE x 14 x 14 x 64\n\nwith tf.variable_scope('pool2') as scope:\n    # similar to pool1\n    pool2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],\n                            padding='SAME')\n\n    # output is of dimension BATCH_SIZE x 7 x 7 x 64\n\nwith tf.variable_scope('fc') as scope:\n    # use weight of dimension 7 * 7 * 64 x 1024\n    input_features = 7 * 7 * 64\n    \n    # create weights and biases\n\n    # TO DO\n\n    # reshape pool2 to 2 dimensional\n    pool2 = tf.reshape(pool2, [-1, input_features])\n\n    # apply relu on matmul of pool2 and w + b\n    fc = tf.nn.relu(tf.matmul(pool2, w) + b, name='relu')\n    \n    # TO DO\n\n    # apply dropout\n    fc = tf.nn.dropout(fc, dropout, name='relu_dropout')\n\nwith tf.variable_scope('softmax_linear') as scope:\n    # this you should know. get logits without softmax\n    # you need to create weights and biases\n\n    # TO DO\n\n# Step 6: define loss function\n# use softmax cross entropy with logits as the loss function\n# compute mean cross entropy, softmax is applied internally\nwith tf.name_scope('loss'):\n    # you should know how to do this too\n    \n    # TO DO\n\n# Step 7: define training op\n# using gradient descent with learning rate of LEARNING_RATE to minimize cost\n# don't forgot to pass in global_step\n\n# TO DO\n\nwith tf.Session() as sess:\n    sess.run(tf.global_variables_initializer())\n    saver = tf.train.Saver()\n    # to visualize using TensorBoard\n    writer = tf.summary.FileWriter('./my_graph/mnist', sess.graph)\n    ##### You have to create folders to store checkpoints\n    ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))\n    # if that checkpoint exists, restore from checkpoint\n    if ckpt and ckpt.model_checkpoint_path:\n        saver.restore(sess, ckpt.model_checkpoint_path)\n    \n    initial_step = global_step.eval()\n\n    start_time = time.time()\n    n_batches = int(mnist.train.num_examples / BATCH_SIZE)\n\n    total_loss = 0.0\n    for index in range(initial_step, n_batches * N_EPOCHS): # train the model n_epochs times\n        X_batch, Y_batch = mnist.train.next_batch(BATCH_SIZE)\n        _, loss_batch = sess.run([optimizer, loss], \n                                feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) \n        total_loss += loss_batch\n        if (index + 1) % SKIP_STEP == 0:\n            print('Average loss at step {}: {:5.1f}'.format(index + 1, total_loss / SKIP_STEP))\n            total_loss = 0.0\n            saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', index)\n    \n    print(\"Optimization Finished!\") # should be around 0.35 after 25 epochs\n    print(\"Total time: {0} seconds\".format(time.time() - start_time))\n    \n    # test the model\n    n_batches = int(mnist.test.num_examples/BATCH_SIZE)\n    total_correct_preds = 0\n    for i in range(n_batches):\n        X_batch, Y_batch = mnist.test.next_batch(BATCH_SIZE)\n        _, loss_batch, logits_batch = sess.run([optimizer, loss, logits], \n                                        feed_dict={X: X_batch, Y:Y_batch, dropout: DROPOUT}) \n        preds = tf.nn.softmax(logits_batch)\n        correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y_batch, 1))\n        accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n        total_correct_preds += sess.run(accuracy)   \n    \n    print(\"Accuracy {0}\".format(total_correct_preds/mnist.test.num_examples))"
  },
  {
    "path": "2017/examples/09_queue_example.py",
    "content": "\"\"\" Example to demonstrate how to use queues\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\n\nN_SAMPLES = 1000\nNUM_THREADS = 4\n# Generating some simple data\n# create 1000 random samples, each is a 1D array from the normal distribution (10, 1)\ndata = 10 * np.random.randn(N_SAMPLES, 4) + 1 \n# create 1000 random labels of 0 and 1\ntarget = np.random.randint(0, 2, size=N_SAMPLES) \n\nqueue = tf.FIFOQueue(capacity=50, dtypes=[tf.float32, tf.int32], shapes=[[4], []])\n\nenqueue_op = queue.enqueue_many([data, target])\ndata_sample, label_sample = queue.dequeue()\n\n# create ops that do something with data_sample and label_sample\n\n# create NUM_THREADS to do enqueue\nqr = tf.train.QueueRunner(queue, [enqueue_op] * NUM_THREADS)\nwith tf.Session() as sess:\n\t# create a coordinator, launch the queue runner threads.\n\tcoord = tf.train.Coordinator()\n\tenqueue_threads = qr.create_threads(sess, coord=coord, start=True)\n\ttry:\n\t\tfor step in range(100): # do to 100 iterations\n\t\t\tif coord.should_stop():\n\t\t\t\tbreak\n\t\t\tdata_batch, label_batch = sess.run([data_sample, label_sample])\n\t\t\tprint(data_batch)\n\t\t\tprint(label_batch)\n\texcept Exception as e:\n\t\tcoord.request_stop(e)\n\tfinally:\n\t\tcoord.request_stop()\n\t\tcoord.join(enqueue_threads)"
  },
  {
    "path": "2017/examples/09_tfrecord_example.py",
    "content": "\"\"\" Examples to demonstrate how to write an image file to a TFRecord,\nand how to read a TFRecord file using TFRecordReader.\nAuthor: Chip Huyen\nPrepared for the class CS 20SI: \"TensorFlow for Deep Learning Research\"\ncs20si.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport sys\nsys.path.append('..')\n\nfrom PIL import Image\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\n# image supposed to have shape: 480 x 640 x 3 = 921600\nIMAGE_PATH = 'data/'\n\ndef _int64_feature(value):\n  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))\n\ndef _bytes_feature(value):\n  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))\n\ndef get_image_binary(filename):\n    \"\"\" You can read in the image using tensorflow too, but it's a drag\n        since you have to create graphs. It's much easier using Pillow and NumPy\n    \"\"\"\n    image = Image.open(filename)\n    image = np.asarray(image, np.uint8)\n    shape = np.array(image.shape, np.int32)\n    return shape.tobytes(), image.tobytes() # convert image to raw data bytes in the array.\n\ndef write_to_tfrecord(label, shape, binary_image, tfrecord_file):\n    \"\"\" This example is to write a sample to TFRecord file. If you want to write\n    more samples, just use a loop.\n    \"\"\"\n    writer = tf.python_io.TFRecordWriter(tfrecord_file)\n    # write label, shape, and image content to the TFRecord file\n    example = tf.train.Example(features=tf.train.Features(feature={\n                'label': _int64_feature(label),\n                'shape': _bytes_feature(shape),\n                'image': _bytes_feature(binary_image)\n                }))\n    writer.write(example.SerializeToString())\n    writer.close()\n\ndef write_tfrecord(label, image_file, tfrecord_file):\n    shape, binary_image = get_image_binary(image_file)\n    write_to_tfrecord(label, shape, binary_image, tfrecord_file)\n\ndef read_from_tfrecord(filenames):\n    tfrecord_file_queue = tf.train.string_input_producer(filenames, name='queue')\n    reader = tf.TFRecordReader()\n    _, tfrecord_serialized = reader.read(tfrecord_file_queue)\n\n    # label and image are stored as bytes but could be stored as \n    # int64 or float64 values in a serialized tf.Example protobuf.\n    tfrecord_features = tf.parse_single_example(tfrecord_serialized,\n                        features={\n                            'label': tf.FixedLenFeature([], tf.int64),\n                            'shape': tf.FixedLenFeature([], tf.string),\n                            'image': tf.FixedLenFeature([], tf.string),\n                        }, name='features')\n    # image was saved as uint8, so we have to decode as uint8.\n    image = tf.decode_raw(tfrecord_features['image'], tf.uint8)\n    shape = tf.decode_raw(tfrecord_features['shape'], tf.int32)\n    # the image tensor is flattened out, so we have to reconstruct the shape\n    image = tf.reshape(image, shape)\n    label = tfrecord_features['label']\n    return label, shape, image\n\ndef read_tfrecord(tfrecord_file):\n    label, shape, image = read_from_tfrecord([tfrecord_file])\n\n    with tf.Session() as sess:\n        coord = tf.train.Coordinator()\n        threads = tf.train.start_queue_runners(coord=coord)\n        label, image, shape = sess.run([label, image, shape])\n        coord.request_stop()\n        coord.join(threads)\n    print(label)\n    print(shape)\n    plt.imshow(image)\n    plt.show() \n\ndef main():\n    # assume the image has the label Chihuahua, which corresponds to class number 1\n    label = 1 \n    image_file = IMAGE_PATH + 'friday.jpg'\n    tfrecord_file = IMAGE_PATH + 'friday.tfrecord'\n    write_tfrecord(label, image_file, tfrecord_file)\n    read_tfrecord(tfrecord_file)\n\nif __name__ == '__main__':\n    main()\n\n"
  },
  {
    "path": "2017/examples/11_char_rnn_gist.py",
    "content": "\"\"\" A clean, no_frills character-level generative language model.\nCreated by Danijar Hafner (danijar.com), edited by Chip Huyen\nfor the class CS 20SI: \"TensorFlow for Deep Learning Research\"\n\nBased on Andrej Karpathy's blog: \nhttp://karpathy.github.io/2015/05/21/rnn-effectiveness/\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport sys\nsys.path.append('..')\n\nimport time\n\nimport tensorflow as tf\n\nimport utils\n\nDATA_PATH = 'data/arvix_abstracts.txt'\nHIDDEN_SIZE = 200\nBATCH_SIZE = 64\nNUM_STEPS = 50\nSKIP_STEP = 40\nTEMPRATURE = 0.7\nLR = 0.003\nLEN_GENERATED = 300\n\ndef vocab_encode(text, vocab):\n    return [vocab.index(x) + 1 for x in text if x in vocab]\n\ndef vocab_decode(array, vocab):\n    return ''.join([vocab[x - 1] for x in array])\n\ndef read_data(filename, vocab, window=NUM_STEPS, overlap=NUM_STEPS//2):\n    for text in open(filename):\n        text = vocab_encode(text, vocab)\n        for start in range(0, len(text) - window, overlap):\n            chunk = text[start: start + window]\n            chunk += [0] * (window - len(chunk))\n            yield chunk\n\ndef read_batch(stream, batch_size=BATCH_SIZE):\n    batch = []\n    for element in stream:\n        batch.append(element)\n        if len(batch) == batch_size:\n            yield batch\n            batch = []\n    yield batch\n\ndef create_rnn(seq, hidden_size=HIDDEN_SIZE):\n    cell = tf.contrib.rnn.GRUCell(hidden_size)\n    in_state = tf.placeholder_with_default(\n            cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size])\n    # this line to calculate the real length of seq\n    # all seq are padded to be of the same length which is NUM_STEPS\n    length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)\n    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state)\n    return output, in_state, out_state\n\ndef create_model(seq, temp, vocab, hidden=HIDDEN_SIZE):\n    seq = tf.one_hot(seq, len(vocab))\n    output, in_state, out_state = create_rnn(seq, hidden)\n    # fully_connected is syntactic sugar for tf.matmul(w, output) + b\n    # it will create w and b for us\n    logits = tf.contrib.layers.fully_connected(output, len(vocab), None)\n    loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=logits[:, :-1], labels=seq[:, 1:]))\n    # sample the next character from Maxwell-Boltzmann Distribution with temperature temp\n    # it works equally well without tf.exp\n    sample = tf.multinomial(tf.exp(logits[:, -1] / temp), 1)[:, 0] \n    return loss, sample, in_state, out_state\n\ndef training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state):\n    saver = tf.train.Saver()\n    start = time.time()\n    with tf.Session() as sess:\n        writer = tf.summary.FileWriter('graphs/gist', sess.graph)\n        sess.run(tf.global_variables_initializer())\n        \n        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/arvix/checkpoint'))\n        if ckpt and ckpt.model_checkpoint_path:\n            saver.restore(sess, ckpt.model_checkpoint_path)\n        \n        iteration = global_step.eval()\n        for batch in read_batch(read_data(DATA_PATH, vocab)):\n            batch_loss, _ = sess.run([loss, optimizer], {seq: batch})\n            if (iteration + 1) % SKIP_STEP == 0:\n                print('Iter {}. \\n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))\n                online_inference(sess, vocab, seq, sample, temp, in_state, out_state)\n                start = time.time()\n                saver.save(sess, 'checkpoints/arvix/char-rnn', iteration)\n            iteration += 1\n\ndef online_inference(sess, vocab, seq, sample, temp, in_state, out_state, seed='T'):\n    \"\"\" Generate sequence one character at a time, based on the previous character\n    \"\"\"\n    sentence = seed\n    state = None\n    for _ in range(LEN_GENERATED):\n        batch = [vocab_encode(sentence[-1], vocab)]\n        feed = {seq: batch, temp: TEMPRATURE}\n        # for the first decoder step, the state is None\n        if state is not None:\n            feed.update({in_state: state})\n        index, state = sess.run([sample, out_state], feed)\n        sentence += vocab_decode(index, vocab)\n    print(sentence)\n\ndef main():\n    vocab = (\n            \" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n            \"\\\\^_abcdefghijklmnopqrstuvwxyz{|}\")\n    seq = tf.placeholder(tf.int32, [None, None])\n    temp = tf.placeholder(tf.float32)\n    loss, sample, in_state, out_state = create_model(seq, temp, vocab)\n    global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n    optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step)\n    utils.make_dir('checkpoints')\n    utils.make_dir('checkpoints/arvix')\n    training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state)\n    \nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "2017/examples/autoencoder/autoencoder.py",
    "content": "import tensorflow as tf\n\nfrom layers import *\n\ndef encoder(input):\n    # Create a conv network with 3 conv layers and 1 FC layer\n    # Conv 1: filter: [3, 3, 1], stride: [2, 2], relu\n    \n    # Conv 2: filter: [3, 3, 8], stride: [2, 2], relu\n    \n    # Conv 3: filter: [3, 3, 8], stride: [2, 2], relu\n    \n    # FC: output_dim: 100, no non-linearity\n    raise NotImplementedError\n\ndef decoder(input):\n    # Create a deconv network with 1 FC layer and 3 deconv layers\n    # FC: output dim: 128, relu\n    \n    # Reshape to [batch_size, 4, 4, 8]\n    \n    # Deconv 1: filter: [3, 3, 8], stride: [2, 2], relu\n    \n    # Deconv 2: filter: [8, 8, 1], stride: [2, 2], padding: valid, relu\n    \n    # Deconv 3: filter: [7, 7, 1], stride: [1, 1], padding: valid, sigmoid\n    raise NotImplementedError\n\ndef autoencoder(input_shape):\n    # Define place holder with input shape\n\n    # Define variable scope for autoencoder\n    with tf.variable_scope('autoencoder') as scope:\n        # Pass input to encoder to obtain encoding\n        \n        # Pass encoding into decoder to obtain reconstructed image\n        \n        # Return input image (placeholder) and reconstructed image\n        pass\n"
  },
  {
    "path": "2017/examples/autoencoder/layer_utils.py",
    "content": "import tensorflow as tf\n\ndef get_deconv2d_output_dims(input_dims, filter_dims, stride_dims, padding):\n    # Returns the height and width of the output of a deconvolution layer.\n    batch_size, input_h, input_w, num_channels_in = input_dims\n    filter_h, filter_w, num_channels_out  = filter_dims\n    stride_h, stride_w = stride_dims\n\n    # Compute the height in the output, based on the padding.\n    if padding == 'SAME':\n      out_h = input_h * stride_h\n    elif padding == 'VALID':\n      out_h = (input_h - 1) * stride_h + filter_h\n\n    # Compute the width in the output, based on the padding.\n    if padding == 'SAME':\n      out_w = input_w * stride_w\n    elif padding == 'VALID':\n      out_w = (input_w - 1) * stride_w + filter_w\n\n    return [batch_size, out_h, out_w, num_channels_out]\n"
  },
  {
    "path": "2017/examples/autoencoder/layers.py",
    "content": "import tensorflow as tf\n\nfrom layer_utils import get_deconv2d_output_dims\n\ndef conv(input, name, filter_dims, stride_dims, padding='SAME',\n         non_linear_fn=tf.nn.relu):\n    input_dims = input.get_shape().as_list()\n    assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in\n    assert(len(filter_dims) == 3) # height, width and num_channels out\n    assert(len(stride_dims) == 2) # stride height and width\n\n    num_channels_in = input_dims[-1]\n    filter_h, filter_w, num_channels_out = filter_dims\n    stride_h, stride_w = stride_dims\n\n    # Define a variable scope for the conv layer\n    with tf.variable_scope(name) as scope:\n        # Create filter weight variable\n        \n        # Create bias variable\n        \n        # Define the convolution flow graph\n        \n        # Add bias to conv output\n        \n        # Apply non-linearity (if asked) and return output\n        pass\n\ndef deconv(input, name, filter_dims, stride_dims, padding='SAME',\n           non_linear_fn=tf.nn.relu):\n    input_dims = input.get_shape().as_list()\n    assert(len(input_dims) == 4) # batch_size, height, width, num_channels_in\n    assert(len(filter_dims) == 3) # height, width and num_channels out\n    assert(len(stride_dims) == 2) # stride height and width\n\n    num_channels_in = input_dims[-1]\n    filter_h, filter_w, num_channels_out = filter_dims\n    stride_h, stride_w = stride_dims\n    # Let's step into this function\n    output_dims = get_deconv2d_output_dims(input_dims,\n                                           filter_dims,\n                                           stride_dims,\n                                           padding)\n\n    # Define a variable scope for the deconv layer\n    with tf.variable_scope(name) as scope:\n        # Create filter weight variable\n        # Note that num_channels_out and in positions are flipped for deconv.\n        \n        # Create bias variable\n        \n        # Define the deconv flow graph\n        \n        # Add bias to deconv output\n        \n        # Apply non-linearity (if asked) and return output\n        pass\n\ndef max_pool(input, name, filter_dims, stride_dims, padding='SAME'):\n    assert(len(filter_dims) == 2) # filter height and width\n    assert(len(stride_dims) == 2) # stride height and width\n\n    filter_h, filter_w = filter_dims\n    stride_h, stride_w = stride_dims\n    \n    # Define the max pool flow graph and return output\n    pass\n\ndef fc(input, name, out_dim, non_linear_fn=tf.nn.relu):\n    assert(type(out_dim) == int)\n\n    # Define a variable scope for the FC layer\n    with tf.variable_scope(name) as scope:\n        input_dims = input.get_shape().as_list()\n        # the input to the fc layer should be flattened\n        if len(input_dims) == 4:\n            # for eg. the output of a conv layer\n            batch_size, input_h, input_w, num_channels = input_dims\n            # ignore the batch dimension\n            in_dim = input_h * input_w * num_channels\n            flat_input = tf.reshape(input, [batch_size, in_dim])\n        else:\n            in_dim = input_dims[-1]\n            flat_input = input\n\n        # Create weight variable\n        \n        # Create bias variable\n        \n        # Define FC flow graph\n        \n        # Apply non-linearity (if asked) and return output\n        pass\n"
  },
  {
    "path": "2017/examples/autoencoder/train.py",
    "content": "import tensorflow as tf\n\nfrom utils import *\nfrom autoencoder import *\n\nbatch_size = 100\nbatch_shape = (batch_size, 28, 28, 1)\nnum_visualize = 10\n\nlr = 0.01\nnum_epochs = 50\n\ndef calculate_loss(original, reconstructed):\n    return tf.div(tf.reduce_sum(tf.square(tf.sub(reconstructed,\n                                                 original))), \n                  tf.constant(float(batch_size)))\n\ndef train(dataset):\n    input_image, reconstructed_image = autoencoder(batch_shape)\n    loss = calculate_loss(input_image, reconstructed_image)\n    optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss)\n\n    init = tf.global_variables_initializer()\n    with tf.Session() as session:\n        session.run(init)\n\n        dataset_size = len(dataset.train.images)\n        print \"Dataset size:\", dataset_size\n        num_iters = (num_epochs * dataset_size)/batch_size\n        print \"Num iters:\", num_iters\n        for step in xrange(num_iters):\n            input_batch  = get_next_batch(dataset.train, batch_size)\n            loss_val,  _ = session.run([loss, optimizer], \n                                       feed_dict={input_image: input_batch})\n            if step % 1000 == 0:\n                print \"Loss at step\", step, \":\", loss_val\n\n        test_batch = get_next_batch(dataset.test, batch_size)\n        reconstruction = session.run(reconstructed_image,\n                                     feed_dict={input_image: test_batch})\n        visualize(test_batch, reconstruction, num_visualize)\n\nif __name__ == '__main__':\n    dataset = load_dataset()\n    train(dataset)\n    \n"
  },
  {
    "path": "2017/examples/autoencoder/utils.py",
    "content": "import os\nimport sys\nimport tensorflow\nimport numpy as np\n\nimport matplotlib\nmatplotlib.use('TKAgg')\nfrom matplotlib import pyplot as plt\n\nfrom tensorflow.examples.tutorials.mnist import input_data\n\nmnist_image_shape = [28, 28, 1]\n\ndef load_dataset():\n    return input_data.read_data_sets('MNIST_data')\n\ndef get_next_batch(dataset, batch_size):\n    # dataset should be mnist.(train/val/test)\n    batch, _ = dataset.next_batch(batch_size)\n    batch_shape = [batch_size] + mnist_image_shape\n    return np.reshape(batch, batch_shape)\n\ndef visualize(_original, _reconstructions, num_visualize):\n    vis_folder = './vis/'\n    if not os.path.exists(vis_folder):\n          os.makedirs(vis_folder)\n\n    original = _original[:num_visualize]\n    reconstructions = _reconstructions[:num_visualize]\n    \n    count = 1\n    for (orig, rec) in zip(original, reconstructions):\n        orig = np.reshape(orig, (mnist_image_shape[0],\n                                 mnist_image_shape[1]))\n        rec = np.reshape(rec, (mnist_image_shape[0],\n                               mnist_image_shape[1]))\n        f, ax = plt.subplots(1,2)\n        ax[0].imshow(orig, cmap='gray')\n        ax[1].imshow(rec, cmap='gray')\n        plt.savefig(vis_folder + \"test_%d.png\" % count)\n        count += 1\n"
  },
  {
    "path": "2017/examples/cgru/README.md",
    "content": "This is the files used to explain convolutional GRU (CGRU) by Lukasz Kaiser at Google Brain. The accompanied slides can be found at http://web.stanford.edu/class/cs20si/lectures/slides_12.pdf\n"
  },
  {
    "path": "2017/examples/cgru/custom_getter.py",
    "content": "# From [github]/tensorflow/python/kernel_tests/variable_scope_test.py\n  def testGetterThatCreatesTwoVariablesAndSumsThem(self):\n\n    def custom_getter(getter, name, *args, **kwargs):\n      g_0 = getter(\"%s/0\" % name, *args, **kwargs)\n      g_1 = getter(\"%s/1\" % name, *args, **kwargs)\n      with tf.name_scope(\"custom_getter\"):\n        return g_0 + g_1  # or g_0 * const / ||g_0|| or anything you want\n\n    with variable_scope.variable_scope(\"scope\", custom_getter=custom_getter):\n      v = variable_scope.get_variable(\"v\", [1, 2, 3])\n      # Or a full model if you wish. OO layers are ok.\n\n    self.assertEqual([1, 2, 3], v.get_shape())\n    true_vars = variables_lib.trainable_variables()\n    self.assertEqual(2, len(true_vars))\n    self.assertEqual(\"scope/v/0:0\", true_vars[0].name)\n    self.assertEqual(\"scope/v/1:0\", true_vars[1].name)\n    self.assertEqual(\"custom_getter/add:0\", v.name)\n    with self.test_session() as sess:\n      variables_lib.global_variables_initializer().run()\n      np_vars, np_v = sess.run([true_vars, v])\n      self.assertAllClose(np_v, sum(np_vars))\n"
  },
  {
    "path": "2017/examples/cgru/data_reader.py",
    "content": "def examples_queue(data_sources, data_fields_to_features, training,\n                   data_items_to_decoders=None, data_items_to_decode=None):\n  \"\"\"Contruct a queue of training or evaluation examples.\n\n  This function will create a reader from files given by data_sources,\n  then enqueue the tf.Examples from these files, shuffling if training\n  is true, and finally parse these tf.Examples to tensors.\n\n  The dictionary data_fields_to_features for an image dataset can be this:\n\n  data_fields_to_features = {\n    'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),\n    'image/format': tf.FixedLenFeature((), tf.string, default_value='raw'),\n    'image/class/label': tf.FixedLenFeature(\n        [1], tf.int64, default_value=tf.zeros([1], dtype=tf.int64)),\n  }\n\n  and for a simple algorithmic dataset with variable-length data it is this:\n\n  data_fields_to_features = {\n    'inputs': tf.VarLenFeature(tf.int64),\n    'targets': tf.VarLenFeature(tf.int64),\n  }\n\n  The data_items_to_decoders dictionary argument can be left as None if there\n  is no decoding to be performed. But, e.g. for images, it should be set so that\n  the images are decoded from the features, e.g., like this for MNIST:\n\n  data_items_to_decoders = {\n    'image': tfexample_decoder.Image(\n      image_key = 'image/encoded',\n      format_key = 'image/format',\n      shape=[28, 28],\n      channels=1),\n    'label': tfexample_decoder.Tensor('image/class/label'),\n  }\n\n  These arguments are compatible with the use of tf.contrib.slim.data module,\n  see there for more documentation.\n\n  Args:\n    data_sources: a list or tuple of sources from which the data will be read,\n      for example [/path/to/train@128, /path/to/train2*, /tmp/.../train3*]\n    data_fields_to_features: a dictionary from data fields in the data sources\n      to features, such as tf.VarLenFeature(tf.int64), see above for examples.\n    training: a Boolean, whether to read for training or evaluation.\n    data_items_to_decoders: a dictionary mapping data items (that will be\n      in the returned result) to decoders that will decode them using features\n      defined in data_fields_to_features; see above for examples. By default\n      (if this is None), we grab the tensor from every feature.\n    data_items_to_decode: a subset of data items that will be decoded;\n      by default (if this is None), we decode all items.\n\n  Returns:\n    A dictionary mapping each data_field to a corresponding 1D int64 tensor\n    read from the created queue.\n\n  Raises:\n    ValueError: if no files are found with the provided data_prefix or no data\n      fields were provided.\n  \"\"\"\n  with tf.name_scope(\"examples_queue\"):\n    # Read serialized examples using slim parallel_reader.\n    _, example_serialized = tf.contrib.slim.parallel_reader.parallel_read(\n        data_sources, tf.TFRecordReader, shuffle=training,\n        num_readers=4 if training else 1)\n\n    if data_items_to_decoders is None:\n      data_items_to_decoders = {\n          field: tf.contrib.slim.tfexample_decoder.Tensor(field)\n          for field in data_fields_to_features\n      }\n\n    decoder = tf.contrib.slim.tfexample_decoder.TFExampleDecoder(\n        data_fields_to_features, data_items_to_decoders)\n\n    if data_items_to_decode is None:\n      data_items_to_decode = data_items_to_decoders.keys()\n\n    decoded = decoder.decode(example_serialized, items=data_items_to_decode)\n    return {field: tensor\n            for (field, tensor) in zip(data_items_to_decode, decoded)}\n\n\ndef batch_examples(examples, batch_size, bucket_boundaries=None):\n  \"\"\"Given a queue of examples, create batches of examples with similar lengths.\n\n  We assume that examples is a dictionary with string keys and tensor values,\n  possibly coming from a queue, e.g., constructed by examples_queue above.\n  Each tensor in examples is assumed to be 1D. We will put tensors of similar\n  length into batches togeter. We return a dictionary with the same keys as\n  examples, and with values being batches of size batch_size. If elements have\n  different lengths, they are padded with 0s. This function is based on\n  tf.contrib.training.bucket_by_sequence_length so see there for details.\n\n  For example, if examples is a queue containing [1, 2, 3] and [4], then\n  this function with batch_size=2 will return a batch [[1, 2, 3], [4, 0, 0]].\n\n  Args:\n    examples: a dictionary with string keys and 1D tensor values.\n    batch_size: a python integer or a scalar int32 tensor.\n    bucket_boundaries: a list of integers for the boundaries that will be\n      used for bucketing; see tf.contrib.training.bucket_by_sequence_length\n      for more details; if None, we create a default set of buckets.\n\n  Returns:\n    A dictionary with the same keys as examples and with values being batches\n    of examples padded with 0s, i.e., [batch_size x length] tensors.\n  \"\"\"\n  # Create default buckets if none were provided.\n  if bucket_boundaries is None:\n    # Small buckets -- go in steps of 8 until 64.\n    small_buckets = [8 * (i + 1) for i in xrange(8)]\n    # Medium buckets -- go in steps of 32 until 256.\n    medium_buckets = [32 * (i + 3) for i in xrange(6)]\n    # Large buckets -- go in steps of 128 until maximum of 1024.\n    large_buckets = [128 * (i + 3) for i in xrange(6)]\n    # By default use the above 20 bucket boundaries (21 queues in total).\n    bucket_boundaries = small_buckets + medium_buckets + large_buckets\n  with tf.name_scope(\"batch_examples\"):\n    # The queue to bucket on will be chosen based on maximum length.\n    max_length = 0\n    for v in examples.values():  # We assume 0-th dimension is the length.\n      max_length = tf.maximum(max_length, tf.shape(v)[0])\n    (_, outputs) = tf.contrib.training.bucket_by_sequence_length(\n        max_length, examples, batch_size, bucket_boundaries,\n        capacity=2 * batch_size, dynamic_pad=True)\n    return outputs\n"
  },
  {
    "path": "2017/examples/cgru/my_layers.py",
    "content": "def saturating_sigmoid(x):\n  \"\"\"Saturating sigmoid: 1.2 * sigmoid(x) - 0.1 cut to [0, 1].\"\"\"\n  with tf.name_scope(\"saturating_sigmoid\", [x]):\n    y = tf.sigmoid(x)\n    return tf.minimum(1.0, tf.maximum(0.0, 1.2 * y - 0.1))\n\n\ndef embedding(x, vocab_size, dense_size, name=None, reuse=None):\n  \"\"\"Embed x of type int64 into dense vectors, reducing to max 4 dimensions.\"\"\"\n  with tf.variable_scope(name, default_name=\"embedding\",\n                         values=[x], reuse=reuse):\n    embedding_var = tf.get_variable(\"kernel\", [vocab_size, dense_size])\n    return tf.gather(embedding_var, x)\n\n\ndef conv_gru(x, kernel_size, filters, padding=\"same\", dilation_rate=1,\n             name=None, reuse=None):\n  \"\"\"Convolutional GRU in 1 dimension.\"\"\"\n  # Let's make a shorthand for conv call first.\n  def do_conv(args, name, bias_start, padding):\n    return tf.layers.conv1d(args, filters, kernel_size,\n                padding=padding, dilation_rate=dilation_rate,\n                bias_initializer=tf.constant_initializer(bias_start), name=name)\n  # Here comes the GRU gate.\n  with tf.variable_scope(name, default_name=\"conv_gru\",\n                         values=[x], reuse=reuse):\n    reset = saturating_sigmoid(do_conv(x, \"reset\", 1.0, padding))\n    gate = saturating_sigmoid(do_conv(x, \"gate\", 1.0, padding))\n    candidate = tf.tanh(do_conv(reset * x, \"candidate\", 0.0, padding))\n    return gate * x + (1 - gate) * candidate\n"
  },
  {
    "path": "2017/examples/cgru/neural_gpu_v3.py",
    "content": "def neural_gpu(features, hparams, name=None):\n  \"\"\"The core Neural GPU.\"\"\"\n  with tf.variable_scope(name, \"neural_gpu\"):\n    inputs = features[\"inputs\"]\n    emb_inputs = common_layers.embedding(\n        inputs, hparams.vocab_size, hparams.hidden_size)\n\n    def step(state, inp):\n      x = tf.nn.dropout(state, 1.0 - hparams.dropout)\n      for layer in xrange(hparams.num_hidden_layers):\n        x = common_layers.conv_gru(\n            x, hparams.kernel_size, hparams.hidden_size, name=\"cgru_%d\" % layer)\n      return tf.where(inp == 0, state, x)  # No-op where inp is just padding=0.\n\n    final = tf.foldl(step, tf.transpose(inputs, [1, 0]),\n                     initializer=emb_inputs,\n                     parallel_iterations=1, swap_memory=True)\n    return common_layers.conv(final, hparams.vocab_size, 3, padding=\"same\")\n\n\ndef mixed_curriculum(inputs, hparams):\n  \"\"\"Mixed curriculum: skip short sequences, but only with some probability.\"\"\"\n  with tf.name_scope(\"mixed_curriculum\"):\n    inputs_length = tf.to_float(tf.shape(inputs)[1])\n    used_length = tf.cond(tf.less(tf.random_uniform([]),\n                                  hparams.curriculum_mixing_probability),\n                          lambda: tf.constant(0.0),\n                          lambda: inputs_length)\n    step = tf.to_float(tf.contrib.framework.get_global_step())\n    relative_step = step / hparams.curriculum_lengths_per_step\n    return used_length - hparams.curriculum_min_length > relative_step\n\n\ndef neural_gpu_curriculum(features, hparams, mode):\n  \"\"\"The Neural GPU model with curriculum.\"\"\"\n  with tf.name_scope(\"neural_gpu_with_curriculum\"):\n    inputs = features[\"inputs\"]\n    is_training = mode == tf.contrib.learn.ModeKeys.TRAIN\n    should_skip = tf.logical_and(is_training, mixed_curriculum(inputs, hparams))\n    final_shape = tf.concat([tf.shape(inputs),\n                             tf.constant([hparams.vocab_size])], axis=0)\n    outputs = tf.cond(should_skip,\n                      lambda: tf.zeros(final_shape),\n                      lambda: neural_gpu(features, hparams))\n    return outputs, should_skip\n\n\ndef basic_params1():\n  \"\"\"A set of basic hyperparameters.\"\"\"\n  return tf.HParams(batch_size=32,\n                    num_hidden_layers=4,\n                    kernel_size=3,\n                    hidden_size=64,\n                    vocab_size=256,\n                    dropout=0.2,\n                    clip_grad_norm=2.0,\n                    initializer=\"orthogonal\",\n                    initializer_gain=1.5,\n                    label_smoothing=0.1,\n                    optimizer=\"Adam\",\n                    optimizer_adam_epsilon=1e-4,\n                    optimizer_momentum_momentum=0.9,\n                    max_train_length=512,\n                    learning_rate_decay_scheme=\"none\",\n                    learning_rate_warmup_steps=100,\n                    learning_rate=0.1)\n\n\ndef curriculum_params1():\n  \"\"\"Set of hyperparameters with curriculum settings.\"\"\"\n  hparams = common_hparams.basic_params1()\n  hparams.add_hparam(\"curriculum_mixing_probability\", 0.1)\n  hparams.add_hparam(\"curriculum_lengths_per_step\", 1000.0)\n  hparams.add_hparam(\"curriculum_min_length\", 10)\n  return hparams\n"
  },
  {
    "path": "2017/examples/data/arvix_abstracts.txt",
    "content": "In science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012).\nIn science and engineering, intelligent processing of complex signals such as images, sound or language is often performed by a parameterized hierarchy of nonlinear processing layers, sometimes biologically inspired. Hierarchical systems (or, more generally, nested systems) offer a way to generate complex mappings using simple stages. Each layer performs a different operation and achieves an ever more sophisticated representation of the input, as, for example, in an deep artificial neural network, an object recognition cascade in computer vision or a speech front-end processing. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). This replaces the original problem involving a deeply nested function with a constrained problem involving a different function in an augmented space without nesting. The constrained problem may be solved with penalty-based methods using alternating optimization over the parameters and the auxiliary coordinates. MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable, and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.\nPoor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.\nHeuristic optimisers which search for an optimal configuration of variables relative to an objective function often get stuck in local optima where the algorithm is unable to find further improvement. The standard approach to circumvent this problem involves periodically restarting the algorithm from random initial configurations when no further improvement can be found. We propose a method of partial reinitialization, whereby, in an attempt to find a better solution, only sub-sets of variables are re-initialised rather than the whole configuration. Much of the information gained from previous runs is hence retained. This leads to significant improvements in the quality of the solution found in a given time for a variety of optimisation problems in machine learning.\nWe study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.\nDeep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.\nTraining deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.\nHessian-free training has become a popular parallel second or- der optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5x speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.\nUnsupervised pretraining and dropout have been well studied, especially with respect to regularization and output consistency. However, our understanding about the explicit convergence rates of the parameter estimates, and their dependence on the learning (like denoising and dropout rate) and structural (like depth and layer lengths) aspects of the network is less mature. An interesting question in this context is to ask if the network structure could \"guide\" the choices of such learning parameters. In this work, we explore these gaps between network structure, the learning mechanisms and their interaction with parameter convergence rates. We present a way to address these issues based on the backpropagation convergence rates for general nonconvex objectives using first-order information. We then incorporate two learning mechanisms into this general framework -- denoising autoencoder and dropout, and subsequently derive the convergence rates of deep networks. Building upon these bounds, we provide insights into the choices of learning parameters and network sizes that achieve certain levels of convergence accuracy. The results derived here support existing empirical observations, and we also conduct a set of experiments to evaluate them.\nSolving inverse problems with iterative algorithms such as stochastic gradient descent is a popular technique, especially for large data. In applications, due to time constraints, the number of iterations one may apply is usually limited, consequently limiting the accuracy achievable by certain methods. Given a reconstruction error one is willing to tolerate, an important question is whether it is possible to modify the original iterations to obtain a faster convergence to a minimizer with the allowed error. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence to an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA solution by neural networks with layers representing iterations.\nMany powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. With large-scale problems, or when distributing the computation is necessary for faster training, the dataset may not fit in a single machine. It is then essential to limit the amount of communication between machines so it does not obliterate the benefit of parallelism. We describe a general way to achieve this, ParMAC. ParMAC works on a cluster of processing machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and propose a theoretical model of its runtime and parallel speedup. We develop ParMAC to learn binary autoencoders for fast, approximate image retrieval. We implement it in MPI in a distributed system and demonstrate nearly perfect speedups in a 128-processor cluster with a training set of 100 million high-dimensional points.\nWe consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\\varepsilon^2)$ iterations for objectives that are sum of smooth functions.   We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\\varepsilon)$ rate, and is faster than full gradient descent by $\\Omega(n^{1/3})$.   We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.\nWe consider the ability of deep neural networks to represent data that lies near a low-dimensional manifold in a high-dimensional space. We show that deep networks can efficiently extract the intrinsic, low-dimensional coordinates of such data. We first show that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Remarkably, the network can do this using an almost optimal number of parameters. We also show that this network projects nearby points onto the manifold and then embeds them with little error. We then extend these results to more general manifolds.\nWe present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.\nTraining of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.\nThe fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices.   In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.\nWe discuss approximation of functions using deep neural nets. Given a function $f$ on a $d$-dimensional manifold $\\Gamma \\subset \\mathbb{R}^m$, we construct a sparsely-connected depth-4 neural network and bound its error in approximating $f$. The size of the network depends on dimension and curvature of the manifold $\\Gamma$, the complexity of $f$, in terms of its wavelet description, and only weakly on the ambient dimension $m$. Essentially, our network computes wavelet functions, which are computed from Rectified Linear Units (ReLU)\nDeep convolutional neural networks comprise a subclass of deep neural networks (DNN) with a constrained architecture that leverages the spatial and temporal structure of the domain they model. Convolutional networks achieve the best predictive performance in areas such as speech and image recognition by hierarchically composing simple local features into complex models. Although DNNs have been used in drug discovery for QSAR and ligand-based bioactivity predictions, none of these models have benefited from this powerful convolutional architecture. This paper introduces AtomNet, the first structure-based, deep convolutional neural network designed to predict the bioactivity of small molecules for drug discovery applications. We demonstrate how to apply the convolutional concepts of feature locality and hierarchical composition to the modeling of bioactivity and chemical interactions. In further contrast to existing DNN techniques, we show that AtomNet's application of local convolutional filters to structural target information successfully predicts new active molecules for targets with no previously known modulators. Finally, we show that AtomNet outperforms previous docking approaches on a diverse set of benchmarks by a large margin, achieving an AUC greater than 0.9 on 57.8% of the targets in the DUDE benchmark.\nWe study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.\nThe generalization error of deep neural networks via their classification margin is studied in this work, providing novel generalization error bounds that are independent of the network depth, thereby avoiding the common exponential depth-dependency which is unrealistic for current networks with hundreds of layers. We show that a large margin linear classifier operating at the output of a deep neural network induces a large classification margin at the input of the network, provided that the network preserves distances in directions normal to the decision boundary. The distance preservation is characterized by the average behaviour of the network's Jacobian matrix in the neighbourhood of the training samples. The introduced theory also leads to a margin preservation regularization scheme that outperforms weight decay both theoretically and empirically.\nCustomer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer. Unsupervised learning was conducted using autoencoders to better understand the reasons for customer churn. Images that maximally activate the hidden units of an autoencoder trained with churned customers reveal ample opportunities for action to be taken to prevent churn among strong data, no voice users.\nWe revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.\nThis paper proposes a set of new error criteria and learning approaches, Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex optimization problem in training deep neural networks (DNNs). Theoretically, we demonstrate its effectiveness on global and local convexity lower-bounded by the standard $L_p$-norm error. By analyzing the gradient on the convexity index $\\lambda$, we explain the reason why to learn $\\lambda$ adaptively using gradient descent works. In practice, we show how this method improves training of deep neural networks to solve visual recognition tasks on the MNIST and CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results comparable or superior to those reported in recent literature on the same tasks using standard ConvNets + MSE/cross entropy. Performance on deep/shallow multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can be combined with other quasi-Newton training methods, innovative network variants, regularization techniques and other specific tricks in DNNs. Other than unsupervised pretraining, it provides a new perspective to address the non-convex optimization problem in DNNs.\nWe introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.\nModel-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \\emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \\emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.\nIn this paper, we explore different ways to extend a recurrent neural network (RNN) to a \\textit{deep} RNN. We start by arguing that the concept of depth in an RNN is not as clear as it is in feedforward neural networks. By carefully analyzing and understanding the architecture of an RNN, however, we find three points of an RNN which may be made deeper; (1) input-to-hidden function, (2) hidden-to-hidden transition and (3) hidden-to-output function. Based on this observation, we propose two novel architectures of a deep RNN which are orthogonal to an earlier attempt of stacking multiple recurrent layers to build a deep RNN (Schmidhuber, 1992; El Hihi and Bengio, 1996). We provide an alternative interpretation of these deep RNNs using a novel framework based on neural operators. The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs.\nIt has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.\nPre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.\nThe Resilient Propagation (Rprop) algorithm has been very popular for backpropagation training of multilayer feed-forward neural networks in various applications. The standard Rprop however encounters difficulties in the context of deep neural networks as typically happens with gradient-based learning algorithms. In this paper, we propose a modification of the Rprop that combines standard Rprop steps with a special drop out technique. We apply the method for training Deep Neural Networks as standalone components and in ensemble formulations. Results on the MNIST dataset show that the proposed modification alleviates standard Rprop's problems demonstrating improved learning speed and accuracy.\nReal time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.\nWe provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.\nA grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks that are complicated by nuisance variation. For instance, visual object recognition involves the unknown object position, orientation, and scale in object recognition while speech recognition involves the unknown voice pronunciation, pitch, and speed. Recently, a new breed of deep learning algorithms have emerged for high-nuisance inference tasks that routinely yield pattern recognition systems with near- or super-human capabilities. But a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on the Deep Rendering Model: a generative probabilistic model that explicitly captures latent nuisance variation. By relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks and random decision forests, providing insights into their successes and shortcomings, as well as a principled route to their improvement.\nA network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.\nDeep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.\nDeep learning has recently led to great successes in tasks such as image recognition (e.g Krizhevsky et al., 2012). However, deep networks are still outmatched by the power and versatility of the brain, perhaps in part due to the richer neuronal computations available to cortical circuits. The challenge is to identify which neuronal mechanisms are relevant, and to find suitable abstractions to model them. Here, we show how aspects of spike timing, long hypothesized to play a crucial role in cortical information processing, could be incorporated into deep networks to build richer, versatile representations.   We introduce a neural network formulation based on complex-valued neuronal units that is not only biologically meaningful but also amenable to a variety of deep learning frameworks. Here, units are attributed both a firing rate and a phase, the latter indicating properties of spike timing. We show how this formulation qualitatively captures several aspects thought to be related to neuronal synchrony, including gating of information processing and dynamic binding of distributed object representations. Focusing on the latter, we demonstrate the potential of the approach in several simple experiments. Thus, neuronal synchrony could be a flexible mechanism that fulfills multiple functional roles in deep networks.\nMany state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation \"on-demand\", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.\nWe seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. We pursue a careful exploration of approaches to allow pooling to learn and to adapt to complex and variable patterns. The two primary directions lie in (1) learning a pooling function via (two strategies of) combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. We experimentally demonstrate that the proposed pooling operations provide a boost in invariance properties relative to conventional pooling and set the state of the art on several widely adopted benchmark datasets; they are also easy to implement, and can be applied within various deep neural network architectures. These benefits come with only a light increase in computational overhead during training and a very modest increase in the number of model parameters.\nMethods from convex optimization are widely used as building blocks for deep learning algorithms. However, the reasons for their empirical success are unclear, since modern convolutional networks (convnets), incorporating rectifier units and max-pooling, are neither smooth nor convex. Standard guarantees therefore do not apply. This paper provides the first convergence rates for gradient descent on rectifier convnets. The proof utilizes the particular structure of rectifier networks which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other words, to precisely the neural networks that perform best empirically. The key step is to introduce gated games, an extension of convex games with similar convergence properties that capture the gating function of rectifiers. The main result is that rectifier convnets converge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training neural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recently developed models of attention.\nWe introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.\nArtificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.\nWe have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques that further improve performance of LSTM RNN acoustic models for large vocabulary speech recognition. We show that frame stacking and reduced frame rate lead to more accurate models and faster decoding. CD phone modeling leads to further improvements. We also present initial results for LSTM RNN models outputting words directly.\nThree important properties of a classification machinery are: (i) the system preserves the core information of the input data; (ii) the training examples convey information about unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are satisfied by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have a similar output. The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure, as well as providing bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.\nLong Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the deep neural networks, the use of RNNs in speech recognition has been limited to phone recognition in small scale tasks. In this paper, we present novel LSTM based RNN architectures which make more effective use of model parameters to train acoustic models for large vocabulary speech recognition. We train and compare LSTM, RNN and DNN models at various numbers of parameters and configurations. We show that LSTM models converge quickly and give state of the art speech recognition performance for relatively small sized models.\nRegularization is essential when training large neural networks. As deep neural networks can be mathematically interpreted as universal function approximators, they are effective at memorizing sampling noise in the training data. This results in poor generalization to unseen data. Therefore, it is no surprise that a new regularization technique, Dropout, was partially responsible for the now-ubiquitous winning entry to ImageNet 2012 by the University of Toronto. Currently, Dropout (and related methods such as DropConnect) are the most effective means of regularizing large neural networks. These amount to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time. The proposed FaMe model aims to apply a similar strategy, yet learns a factorization of each weight matrix such that the factors are robust to noise.\nWe combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.\nWe formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.\nRestricted Boltzmann machines are undirected neural networks which have been shown to be effective in many applications, including serving as initializations for training deep multi-layer neural networks. One of the main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy can be easily generalized to other models as well as to more accurate higher-order approximations, paving the way for systematic improvements in training Boltzmann machines with hidden units.\nTop-down information plays a central role in human perception, but plays relatively little role in many current state-of-the-art deep networks, such as Convolutional Neural Networks (CNNs). This work seeks to explore a path by which top-down information can have a direct impact within current deep networks. We explore this path by learning and using \"generators\" corresponding to the network internal effects of three types of transformation (each a restriction of a general affine transformation): rotation, scaling, and translation. We demonstrate how these learned generators can be used to transfer top-down information to novel settings, as mediated by the \"feature flows\" that the transformations (and the associated generators) correspond to inside the network. Specifically, we explore three aspects: 1) using generators as part of a method for synthesizing transformed images --- given a previously unseen image, produce versions of that image corresponding to one or more specified transformations, 2) \"zero-shot learning\" --- when provided with a feature flow corresponding to the effect of a transformation of unknown amount, leverage learned generators as part of a method by which to perform an accurate categorization of the amount of transformation, even for amounts never observed during training, and 3) (inside-CNN) \"data augmentation\" --- improve the classification performance of an existing network by using the learned generators to directly provide additional training \"inside the CNN\".\nSeveral popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.\nMotivated by an important insight from neural science, we propose a new framework for understanding the success of the recently proposed \"maxout\" networks. The framework is based on encoding information on sparse pathways and recognizing the correct pathway at inference time. Elaborating further on this insight, we propose a novel deep network architecture, called \"channel-out\" network, which takes a much better advantage of sparse pathway encoding. In channel-out networks, pathways are not only formed a posteriori, but they are also actively selected according to the inference outputs from the lower layers. From a mathematical perspective, channel-out networks can represent a wider class of piece-wise continuous functions, thereby endowing the network with more expressive power than that of maxout networks. We test our channel-out networks on several well-known image classification benchmarks, setting new state-of-the-art performance on CIFAR-100 and STL-10, which represent some of the \"harder\" image classification benchmarks.\nIn a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules must first define the nature of the local variables, and then the functional form that ties them together into each learning rule. We consider polynomial local learning rules and analyze their behavior and capabilities in both linear and non-linear networks. As a byproduct, this framework enables also the discovery of new learning rules as well as important relationships between learning rules and group symmetries. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is propagated to the deep layers through a backward channel. The nature of the propagated information about the targets, and the backward channel through which this information is propagated, partition the space of learning algorithms. For any learning algorithm, the capacity of the backward channel can be defined as the number of bits provided about the gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them and achieves the maximum possible capacity. The theory clarifies the concept of Hebbian learning, what is learnable by Hebbian learning, and explains the sparsity of the space of learning rules discovered so far.\nDeep learning is currently the subject of intensive study. However, fundamental concepts such as representations are not formally defined -- researchers \"know them when they see them\" -- and there is no common language for describing and analyzing algorithms. This essay proposes an abstract framework that identifies the essential features of current practice and may provide a foundation for future developments.   The backbone of almost all deep learning algorithms is backpropagation, which is simply a gradient computation distributed over a neural network. The main ingredients of the framework are thus, unsurprisingly: (i) game theory, to formalize distributed optimization; and (ii) communication protocols, to track the flow of zeroth and first-order information. The framework allows natural definitions of semantics (as the meaning encoded in functions), representations (as functions whose semantics is chosen to optimized a criterion) and grammars (as communication protocols equipped with first-order convergence guarantees).   Much of the essay is spent discussing examples taken from the literature. The ultimate aim is to develop a graphical language for describing the structure of deep learning algorithms that backgrounds the details of the optimization procedure and foregrounds how the components interact. Inspiration is taken from probabilistic graphical models and factor graphs, which capture the essential structural features of multivariate distributions.\nDeep learning algorithms have been shown to perform extremely well on many classical machine learning problems. However, recent studies have shown that deep learning, like other machine learning techniques, is vulnerable to adversarial samples: inputs crafted to force a deep neural network (DNN) to provide adversary-selected outputs. Such attacks can seriously undermine the security of the system supported by the DNN, sometimes with devastating consequences. For example, autonomous vehicles can be crashed, illicit or illegal content can bypass content filters, or biometric authentication systems can be manipulated to allow improper access. In this work, we introduce a defensive mechanism called defensive distillation to reduce the effectiveness of adversarial samples on DNNs. We analytically investigate the generalizability and robustness properties granted by the use of defensive distillation when training DNNs. We also empirically study the effectiveness of our defense mechanisms on two DNNs placed in adversarial settings. The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN. Such dramatic gains can be explained by the fact that distillation leads gradients used in adversarial sample creation to be reduced by a factor of 10^30. We also find that distillation increases the average minimum number of features that need to be modified to create adversarial samples by about 800% on one of the DNNs we tested.\nWe present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).\nThis paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.\nThe backpropagation algorithm for calculating gradients has been widely used in computation of weights for deep neural networks (DNNs). This method requires derivatives of objective functions and has some difficulties finding appropriate parameters such as learning rate. In this paper, we propose a novel approach for computing weight matrices of fully-connected DNNs by using two types of semi-nonnegative matrix factorizations (semi-NMFs). In this method, optimization processes are performed by calculating weight matrices alternately, and backpropagation (BP) is not used. We also present a method to calculate stacked autoencoder using a NMF. The output results of the autoencoder are used as pre-training data for DNNs. The experimental results show that our method using three types of NMFs attains similar error rates to the conventional DNNs with BP.\nMultidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nWhy does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.\nIn this paper we propose and investigate a novel nonlinear unit, called $L_p$ unit, for deep neural networks. The proposed $L_p$ unit receives signals from several projections of a subset of units in the layer below and computes a normalized $L_p$ norm. We notice two interesting interpretations of the $L_p$ unit. First, the proposed unit can be understood as a generalization of a number of conventional pooling operators such as average, root-mean-square and max pooling widely used in, for instance, convolutional neural networks (CNN), HMAX models and neocognitrons. Furthermore, the $L_p$ unit is, to a certain degree, similar to the recently proposed maxout unit (Goodfellow et al., 2013) which achieved the state-of-the-art object recognition results on a number of benchmark datasets. Secondly, we provide a geometrical interpretation of the activation function based on which we argue that the $L_p$ unit is more efficient at representing complex, nonlinear separating boundaries. Each $L_p$ unit defines a superelliptic boundary, with its exact shape defined by the order $p$. We claim that this makes it possible to model arbitrarily shaped, curved boundaries more efficiently by combining a few $L_p$ units of different orders. This insight justifies the need for learning different orders for each unit in the model. We empirically evaluate the proposed $L_p$ units on a number of datasets and show that multilayer perceptrons (MLP) consisting of the $L_p$ units achieve the state-of-the-art results on a number of benchmark datasets. Furthermore, we evaluate the proposed $L_p$ unit on the recently proposed deep recurrent neural networks (RNN).\nDeep learning methods have shown great promise in many practical applications, ranging from speech recognition, visual object recognition, to text processing. However, most of the current deep learning methods suffer from scalability problems for large-scale applications, forcing researchers or users to focus on small-scale problems with fewer parameters.   In this paper, we consider a well-known machine learning model, deep belief networks (DBNs) that have yielded impressive classification performance on a large number of benchmark machine learning tasks. To scale up DBN, we propose an approach that can use the computing clusters in a distributed environment to train large models, while the dense matrix computations within a single machine are sped up using graphics processors (GPU). When training a DBN, each machine randomly drops out a portion of neurons in each hidden layer, for each training case, making the remaining neurons only learn to detect features that are generally helpful for producing the correct answer. Within our approach, we have developed four methods to combine outcomes from each machine to form a unified model. Our preliminary experiment on the mnst handwritten digit database demonstrates that our approach outperforms the state of the art test error rate.\nDeep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and compare DNNs using several metrics to quantify factors influencing differences in task performance. Our first set of experiments use the standard Switchboard benchmark corpus, which contains approximately 300 hours of conversational telephone speech. We compare standard DNNs to convolutional networks, and present the first experiments using locally-connected, untied neural networks for acoustic modeling. We additionally build systems on a corpus of 2,100 hours of training data by combining the Switchboard and Fisher corpora. This larger corpus allows us to more thoroughly examine performance of large DNN models -- with up to ten times more parameters than those typically used in speech recognition systems. Our results suggest that a relatively simple DNN architecture and optimization technique produces strong results. These findings, along with previous work, help establish a set of best practices for building DNN hybrid speech recognition systems with maximum likelihood training. Our experiments in DNN optimization additionally serve as a case study for training DNNs with discriminative loss functions for speech tasks, as well as DNN classifiers more generally.\nWe demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.\nThere has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.\nWe introduce two Python frameworks to train neural networks on large datasets: Blocks and Fuel. Blocks is based on Theano, a linear algebra compiler with CUDA-support. It facilitates the training of complex neural network models by providing parametrized Theano operations, attaching metadata to Theano's symbolic computational graph, and providing an extensive set of utilities to assist training the networks, e.g. training algorithms, logging, monitoring, visualization, and serialization. Fuel provides a standard format for machine learning datasets. It allows the user to easily iterate over large datasets, performing many types of pre-processing on the fly.\nOne of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.\nRecurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.\nTraining very deep networks is an important open problem in machine learning. One of many difficulties is that the norm of the back-propagated error gradient can grow or decay exponentially. Here we show that training very deep feed-forward networks (FFNs) is not as difficult as previously thought. Unlike when back-propagation is applied to a recurrent network, application to an FFN amounts to multiplying the error gradient by a different random matrix at each layer. We show that the successive application of correctly scaled random matrices to an initial vector results in a random walk of the log of the norm of the resulting vectors, and we compute the scaling that makes this walk unbiased. The variance of the random walk grows only linearly with network depth and is inversely proportional to the size of each layer. Practically, this implies a gradient whose log-norm scales with the square root of the network depth and shows that the vanishing gradient problem can be mitigated by increasing the width of the layers. Mathematical analyses and experimental results using stochastic gradient descent to optimize tasks related to the MNIST and TIMIT datasets are provided to support these claims. Equations for the optimal matrix scaling are provided for the linear and ReLU cases.\nHessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.\nCurrent deep learning architectures are growing larger in order to learn from enormous datasets.These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep net-works drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost,of both feed-forward pass and backpropagation,by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.\nWe replace the Hidden Markov Model (HMM) which is traditionally used in in continuous speech recognition with a bi-directional recurrent neural network encoder coupled to a recurrent neural network decoder that directly emits a stream of phonemes. The alignment between the input and output sequences is established using an attention mechanism: the decoder emits each symbol based on a context created with a subset of input symbols elected by the attention mechanism. We report initial results demonstrating that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.\nIn this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.\nDeep Neural Network (DNN) acoustic models have yielded many state-of-the-art results in Automatic Speech Recognition (ASR) tasks. More recently, Recurrent Neural Network (RNN) models have been shown to outperform DNNs counterparts. However, state-of-the-art DNN and RNN models tend to be impractical to deploy on embedded systems with limited computational capacity. Traditionally, the approach for embedded platforms is to either train a small DNN directly, or to train a small DNN that learns the output distribution of a large DNN. In this paper, we utilize a state-of-the-art RNN to transfer knowledge to small DNN. We use the RNN model to generate soft alignments and minimize the Kullback-Leibler divergence against the small DNN. The small DNN trained on the soft RNN alignments achieved a 3.93 WER on the Wall Street Journal (WSJ) eval92 task compared to a baseline 4.54 WER or more than 13% relative improvement.\nRecurrent Neural Networks (RNNs) have long been recognized for their potential to model complex time series. However, it remains to be determined what optimization techniques and recurrent architectures can be used to best realize this potential. The experiments presented take a deep look into Hessian free optimization, a powerful second order optimization method that has shown promising results, but still does not enjoy widespread use. This algorithm was used to train to a number of RNN architectures including standard RNNs, long short-term memory, multiplicative RNNs, and stacked RNNs on the task of character prediction. The insights from these experiments led to the creation of a new multiplicative LSTM hybrid architecture that outperformed both LSTM and multiplicative RNNs. When tested on a larger scale, multiplicative LSTM achieved character level modelling results competitive with the state of the art for RNNs using very different methodology.\nIn recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We identify the RPU device and system specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30,000X compared to state-of-the-art microprocessors while providing power efficiency of 84,000 GigaOps/s/W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisted of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration and analysis of multimodal sensory data flows from massive number of IoT (Internet of Things) sensors.\nRecently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.\nThis paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.\nWe address the problem of acoustic source separation in a deep learning framework we call \"deep clustering.\" Rather than directly estimating signals or masking functions, we train a deep network to produce spectrogram embeddings that are discriminative for partition labels given in training data. Previous deep network approaches provide great advantages in terms of learning power and speed, but previously it has been unclear how to use them to separate signals in a class-independent way. In contrast, spectral clustering approaches are flexible with respect to the classes and number of items to be segmented, but it has been unclear how to leverage the learning power and speed of deep networks. To obtain the best of both worlds, we use an objective function that to train embeddings that yield a low-rank approximation to an ideal pairwise affinity matrix, in a class-independent way. This avoids the high cost of spectral factorization and instead produces compact clusters that are amenable to simple clustering methods. The segmentations are therefore implicitly encoded in the embeddings, and can be \"decoded\" by clustering. Preliminary experiments show that the proposed method can separate speech: when trained on spectrogram features containing mixtures of two speakers, and tested on mixtures of a held-out set of speakers, it can infer masking functions that improve signal quality by around 6dB. We show that the model can generalize to three-speaker mixtures despite training only on two-speaker mixtures. The framework can be used without class labels, and therefore has the potential to be trained on a diverse set of sound types, and to generalize to novel sources. We hope that future work will lead to segmentation of arbitrary sounds, with extensions to microphone array methods as well as image segmentation and other domains.\nA very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.\nDeep Belief Networks (DBN) have been successfully applied on popular machine learning tasks. Specifically, when applied on hand-written digit recognition, DBNs have achieved approximate accuracy rates of 98.8%. In an effort to optimize the data representation achieved by the DBN and maximize their descriptive power, recent advances have focused on inducing sparse constraints at each layer of the DBN. In this paper we present a theoretical approach for sparse constraints in the DBN using the mixed norm for both non-overlapping and overlapping groups. We explore how these constraints affect the classification accuracy for digit recognition in three different datasets (MNIST, USPS, RIMES) and provide initial estimations of their usefulness by altering different parameters such as the group size and overlap percentage.\nOur proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce \"companion objective\" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).\nResidual networks (ResNets) have recently achieved state-of-the-art on challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep dual-stream architecture that generalizes ResNets and standard CNNs and is easily implemented with no computational overhead. RiR consistently improves performance over ResNets, outperforms architectures with similar amounts of augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100.\nWe propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.\nAlthough artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.\nStochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.\nInspired by recent successes of deep learning in computer vision, we propose a novel framework for encoding time series as different types of images, namely, Gramian Angular Summation/Difference Fields (GASF/GADF) and Markov Transition Fields (MTF). This enables the use of techniques from computer vision for time series classification and imputation. We used Tiled Convolutional Neural Networks (tiled CNNs) on 20 standard datasets to learn high-level features from the individual and compound GASF-GADF-MTF images. Our approaches achieve highly competitive results when compared to nine of the current best time series classification approaches. Inspired by the bijection property of GASF on 0/1 rescaled data, we train Denoised Auto-encoders (DA) on the GASF images of four standard and one synthesized compound dataset. The imputation MSE on test data is reduced by 12.18%-48.02% when compared to using the raw data. An analysis of the features and weights learned via tiled CNNs and DAs explains why the approaches work.\nTypical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.\nFor discrete data, the likelihood $P(x)$ can be rewritten exactly and parametrized into $P(X = x) = P(X = x | H = f(x)) P(H = f(x))$ if $P(X | H)$ has enough capacity to put no probability mass on any $x'$ for which $f(x')\\neq f(x)$, where $f(\\cdot)$ is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with $f(\\cdot)$ as the encoder and $P(X|H)$ as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations $h=f(x)$, e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood $\\log p(x)$. The objective is to learn an encoder $f(\\cdot)$ that maps $X$ to $f(X)$ that has a much simpler distribution than $X$ itself, estimated by $P(H)$. This \"flattens the manifold\" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.\nIn this paper, we present an infinite hierarchical non-parametric Bayesian model to extract the hidden factors over observed data, where the number of hidden factors for each layer is unknown and can be potentially infinite. Moreover, the number of layers can also be infinite. We construct the model structure that allows continuous values for the hidden factors and weights, which makes the model suitable for various applications. We use the Metropolis-Hastings method to infer the model structure. Then the performance of the algorithm is evaluated by the experiments. Simulation results show that the model fits the underlying structure of simulated data.\nTraining deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a certain approximation error, decreases by a multiplicative factor.   On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as a \"stabilizer\" or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the above mentioned stability properties of dropout, we design dropout based differentially private algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction accuracy) the L2 regularization based methods for several benchmark datasets.\nWe introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.\nWe investigate the capacity, convexity and characterization of a general family of norm-constrained feed-forward networks.\nArtificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\\'e Paris 1\nTraining neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.\nWe describe a simple multilayer bootstrap network for unsupervised dimensionality reduction that each layer of the network is a group of mutually independent k-centers clusterings, and the centers of a clustering are randomly sampled data points. We further compress the network size of multilayer bootstrap network by a neural network in a pseudo supervised way for prediction. We report comparison results in data visualization, clustering, and document retrieval.\nTraining neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.\nRecurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.\nDeep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.\nWe introduce a convolutional neural network that operates directly on graphs. These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape. The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints. We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.\nDeep learning is a broad set of techniques that uses multiple layers of representation to automatically learn relevant features directly from structured data. Recently, such techniques have yielded record-breaking results on a diverse set of difficult machine learning tasks in computer vision, speech recognition, and natural language processing. Despite the enormous success of deep learning, relatively little is understood theoretically about why these techniques are so successful at feature learning and compression. Here, we show that deep learning is intimately related to one of the most important and successful techniques in theoretical physics, the renormalization group (RG). RG is an iterative coarse-graining scheme that allows for the extraction of relevant features (i.e. operators) as a physical system is examined at different length scales. We construct an exact mapping from the variational renormalization group, first introduced by Kadanoff, and deep learning architectures based on Restricted Boltzmann Machines (RBMs). We illustrate these ideas using the nearest-neighbor Ising Model in one and two-dimensions. Our results suggests that deep learning algorithms may be employing a generalized RG-like scheme to learn relevant features from data.\nWe introduce a new representation learning algorithm suited to the context of domain adaptation, in which data at training and test time come from similar but different distributions. Our algorithm is directly inspired by theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on a data representation that cannot discriminate between the training (source) and test (target) domains. We propose a training objective that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classification task, but uninformative as to the domain of the input. Our experiments on a sentiment analysis classification benchmark, where the target domain data available at training time is unlabeled, show that our neural network for domain adaption algorithm has better performance than either a standard neural network or an SVM, even if trained on input features extracted with the state-of-the-art marginalized stacked denoising autoencoders of Chen et al. (2012)."
  },
  {
    "path": "2017/examples/data/heart.csv",
    "content": "sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd\n160,12,5.73,23.11,Present,49,25.3,97.2,52,1\n144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1\n118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0\n170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1\n134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1\n132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0\n142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0\n114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1\n114,0,3.83,19.4,Present,49,24.86,2.49,29,0\n132,0,5.8,30.96,Present,69,30.11,0,53,1\n206,6,2.95,32.27,Absent,72,26.81,56.06,60,1\n134,14.1,4.44,22.39,Present,65,23.09,0,40,1\n118,0,1.88,10.05,Absent,59,21.57,0,17,0\n132,0,1.87,17.21,Absent,49,23.63,0.97,15,0\n112,9.65,2.29,17.2,Present,54,23.53,0.68,53,0\n117,1.53,2.44,28.95,Present,35,25.89,30.03,46,0\n120,7.5,15.33,22,Absent,60,25.31,34.49,49,0\n146,10.5,8.29,35.36,Present,78,32.73,13.89,53,1\n158,2.6,7.46,34.07,Present,61,29.3,53.28,62,1\n124,14,6.23,35.96,Present,45,30.09,0,59,1\n106,1.61,1.74,12.32,Absent,74,20.92,13.37,20,1\n132,7.9,2.85,26.5,Present,51,26.16,25.71,44,0\n150,0.3,6.38,33.99,Present,62,24.64,0,50,0\n138,0.6,3.81,28.66,Absent,54,28.7,1.46,58,0\n142,18.2,4.34,24.38,Absent,61,26.19,0,50,0\n124,4,12.42,31.29,Present,54,23.23,2.06,42,1\n118,6,9.65,33.91,Absent,60,38.8,0,48,0\n145,9.1,5.24,27.55,Absent,59,20.96,21.6,61,1\n144,4.09,5.55,31.4,Present,60,29.43,5.55,56,0\n146,0,6.62,25.69,Absent,60,28.07,8.23,63,1\n136,2.52,3.95,25.63,Absent,51,21.86,0,45,1\n158,1.02,6.33,23.88,Absent,66,22.13,24.99,46,1\n122,6.6,5.58,35.95,Present,53,28.07,12.55,59,1\n126,8.75,6.53,34.02,Absent,49,30.25,0,41,1\n148,5.5,7.1,25.31,Absent,56,29.84,3.6,48,0\n122,4.26,4.44,13.04,Absent,57,19.49,48.99,28,1\n140,3.9,7.32,25.05,Absent,47,27.36,36.77,32,0\n110,4.64,4.55,30.46,Absent,48,30.9,15.22,46,0\n130,0,2.82,19.63,Present,70,24.86,0,29,0\n136,11.2,5.81,31.85,Present,75,27.68,22.94,58,1\n118,0.28,5.8,33.7,Present,60,30.98,0,41,1\n144,0.04,3.38,23.61,Absent,30,23.75,4.66,30,0\n120,0,1.07,16.02,Absent,47,22.15,0,15,0\n130,2.61,2.72,22.99,Present,51,26.29,13.37,51,1\n114,0,2.99,9.74,Absent,54,46.58,0,17,0\n128,4.65,3.31,22.74,Absent,62,22.95,0.51,48,0\n162,7.4,8.55,24.65,Present,64,25.71,5.86,58,1\n116,1.91,7.56,26.45,Present,52,30.01,3.6,33,1\n114,0,1.94,11.02,Absent,54,20.17,38.98,16,0\n126,3.8,3.88,31.79,Absent,57,30.53,0,30,0\n122,0,5.75,30.9,Present,46,29.01,4.11,42,0\n134,2.5,3.66,30.9,Absent,52,27.19,23.66,49,0\n152,0.9,9.12,30.23,Absent,56,28.64,0.37,42,1\n134,8.08,1.55,17.5,Present,56,22.65,66.65,31,1\n156,3,1.82,27.55,Absent,60,23.91,54,53,0\n152,5.99,7.99,32.48,Absent,45,26.57,100.32,48,0\n118,0,2.99,16.17,Absent,49,23.83,3.22,28,0\n126,5.1,2.96,26.5,Absent,55,25.52,12.34,38,1\n103,0.03,4.21,18.96,Absent,48,22.94,2.62,18,0\n121,0.8,5.29,18.95,Present,47,22.51,0,61,0\n142,0.28,1.8,21.03,Absent,57,23.65,2.93,33,0\n138,1.15,5.09,27.87,Present,61,25.65,2.34,44,0\n152,10.1,4.71,24.65,Present,65,26.21,24.53,57,0\n140,0.45,4.3,24.33,Absent,41,27.23,10.08,38,0\n130,0,1.82,10.45,Absent,57,22.07,2.06,17,0\n136,7.36,2.19,28.11,Present,61,25,61.71,54,0\n124,4.82,3.24,21.1,Present,48,28.49,8.42,30,0\n112,0.41,1.88,10.29,Absent,39,22.08,20.98,27,0\n118,4.46,7.27,29.13,Present,48,29.01,11.11,33,0\n122,0,3.37,16.1,Absent,67,21.06,0,32,1\n118,0,3.67,12.13,Absent,51,19.15,0.6,15,0\n130,1.72,2.66,10.38,Absent,68,17.81,11.1,26,0\n130,5.6,3.37,24.8,Absent,58,25.76,43.2,36,0\n126,0.09,5.03,13.27,Present,50,17.75,4.63,20,0\n128,0.4,6.17,26.35,Absent,64,27.86,11.11,34,0\n136,0,4.12,17.42,Absent,52,21.66,12.86,40,0\n134,0,5.9,30.84,Absent,49,29.16,0,55,0\n140,0.6,5.56,33.39,Present,58,27.19,0,55,1\n168,4.5,6.68,28.47,Absent,43,24.25,24.38,56,1\n108,0.4,5.91,22.92,Present,57,25.72,72,39,0\n114,3,7.04,22.64,Present,55,22.59,0,45,1\n140,8.14,4.93,42.49,Absent,53,45.72,6.43,53,1\n148,4.8,6.09,36.55,Present,63,25.44,0.88,55,1\n148,12.2,3.79,34.15,Absent,57,26.38,14.4,57,1\n128,0,2.43,13.15,Present,63,20.75,0,17,0\n130,0.56,3.3,30.86,Absent,49,27.52,33.33,45,0\n126,10.5,4.49,17.33,Absent,67,19.37,0,49,1\n140,0,5.08,27.33,Present,41,27.83,1.25,38,0\n126,0.9,5.64,17.78,Present,55,21.94,0,41,0\n122,0.72,4.04,32.38,Absent,34,28.34,0,55,0\n116,1.03,2.83,10.85,Absent,45,21.59,1.75,21,0\n120,3.7,4.02,39.66,Absent,61,30.57,0,64,1\n143,0.46,2.4,22.87,Absent,62,29.17,15.43,29,0\n118,4,3.95,18.96,Absent,54,25.15,8.33,49,1\n194,1.7,6.32,33.67,Absent,47,30.16,0.19,56,0\n134,3,4.37,23.07,Absent,56,20.54,9.65,62,0\n138,2.16,4.9,24.83,Present,39,26.06,28.29,29,0\n136,0,5,27.58,Present,49,27.59,1.47,39,0\n122,3.2,11.32,35.36,Present,55,27.07,0,51,1\n164,12,3.91,19.59,Absent,51,23.44,19.75,39,0\n136,8,7.85,23.81,Present,51,22.69,2.78,50,0\n166,0.07,4.03,29.29,Absent,53,28.37,0,27,0\n118,0,4.34,30.12,Present,52,32.18,3.91,46,0\n128,0.42,4.6,26.68,Absent,41,30.97,10.33,31,0\n118,1.5,5.38,25.84,Absent,64,28.63,3.89,29,0\n158,3.6,2.97,30.11,Absent,63,26.64,108,64,0\n108,1.5,4.33,24.99,Absent,66,22.29,21.6,61,1\n170,7.6,5.5,37.83,Present,42,37.41,6.17,54,1\n118,1,5.76,22.1,Absent,62,23.48,7.71,42,0\n124,0,3.04,17.33,Absent,49,22.04,0,18,0\n114,0,8.01,21.64,Absent,66,25.51,2.49,16,0\n168,9,8.53,24.48,Present,69,26.18,4.63,54,1\n134,2,3.66,14.69,Absent,52,21.03,2.06,37,0\n174,0,8.46,35.1,Present,35,25.27,0,61,1\n116,31.2,3.17,14.99,Absent,47,19.4,49.06,59,1\n128,0,10.58,31.81,Present,46,28.41,14.66,48,0\n140,4.5,4.59,18.01,Absent,63,21.91,22.09,32,1\n154,0.7,5.91,25,Absent,13,20.6,0,42,0\n150,3.5,6.99,25.39,Present,50,23.35,23.48,61,1\n130,0,3.92,25.55,Absent,68,28.02,0.68,27,0\n128,2,6.13,21.31,Absent,66,22.86,11.83,60,0\n120,1.4,6.25,20.47,Absent,60,25.85,8.51,28,0\n120,0,5.01,26.13,Absent,64,26.21,12.24,33,0\n138,4.5,2.85,30.11,Absent,55,24.78,24.89,56,1\n153,7.8,3.96,25.73,Absent,54,25.91,27.03,45,0\n123,8.6,11.17,35.28,Present,70,33.14,0,59,1\n148,4.04,3.99,20.69,Absent,60,27.78,1.75,28,0\n136,3.96,2.76,30.28,Present,50,34.42,18.51,38,0\n134,8.8,7.41,26.84,Absent,35,29.44,29.52,60,1\n152,12.18,4.04,37.83,Present,63,34.57,4.17,64,0\n158,13.5,5.04,30.79,Absent,54,24.79,21.5,62,0\n132,2,3.08,35.39,Absent,45,31.44,79.82,58,1\n134,1.5,3.73,21.53,Absent,41,24.7,11.11,30,1\n142,7.44,5.52,33.97,Absent,47,29.29,24.27,54,0\n134,6,3.3,28.45,Absent,65,26.09,58.11,40,0\n122,4.18,9.05,29.27,Present,44,24.05,19.34,52,1\n116,2.7,3.69,13.52,Absent,55,21.13,18.51,32,0\n128,0.5,3.7,12.81,Present,66,21.25,22.73,28,0\n120,0,3.68,12.24,Absent,51,20.52,0.51,20,0\n124,0,3.95,36.35,Present,59,32.83,9.59,54,0\n160,14,5.9,37.12,Absent,58,33.87,3.52,54,1\n130,2.78,4.89,9.39,Present,63,19.3,17.47,25,1\n128,2.8,5.53,14.29,Absent,64,24.97,0.51,38,0\n130,4.5,5.86,37.43,Absent,61,31.21,32.3,58,0\n109,1.2,6.14,29.26,Absent,47,24.72,10.46,40,0\n144,0,3.84,18.72,Absent,56,22.1,4.8,40,0\n118,1.05,3.16,12.98,Present,46,22.09,16.35,31,0\n136,3.46,6.38,32.25,Present,43,28.73,3.13,43,1\n136,1.5,6.06,26.54,Absent,54,29.38,14.5,33,1\n124,15.5,5.05,24.06,Absent,46,23.22,0,61,1\n148,6,6.49,26.47,Absent,48,24.7,0,55,0\n128,6.6,3.58,20.71,Absent,55,24.15,0,52,0\n122,0.28,4.19,19.97,Absent,61,25.63,0,24,0\n108,0,2.74,11.17,Absent,53,22.61,0.95,20,0\n124,3.04,4.8,19.52,Present,60,21.78,147.19,41,1\n138,8.8,3.12,22.41,Present,63,23.33,120.03,55,1\n127,0,2.81,15.7,Absent,42,22.03,1.03,17,0\n174,9.45,5.13,35.54,Absent,55,30.71,59.79,53,0\n122,0,3.05,23.51,Absent,46,25.81,0,38,0\n144,6.75,5.45,29.81,Absent,53,25.62,26.23,43,1\n126,1.8,6.22,19.71,Absent,65,24.81,0.69,31,0\n208,27.4,3.12,26.63,Absent,66,27.45,33.07,62,1\n138,0,2.68,17.04,Absent,42,22.16,0,16,0\n148,0,3.84,17.26,Absent,70,20,0,21,0\n122,0,3.08,16.3,Absent,43,22.13,0,16,0\n132,7,3.2,23.26,Absent,77,23.64,23.14,49,0\n110,12.16,4.99,28.56,Absent,44,27.14,21.6,55,1\n160,1.52,8.12,29.3,Present,54,25.87,12.86,43,1\n126,0.54,4.39,21.13,Present,45,25.99,0,25,0\n162,5.3,7.95,33.58,Present,58,36.06,8.23,48,0\n194,2.55,6.89,33.88,Present,69,29.33,0,41,0\n118,0.75,2.58,20.25,Absent,59,24.46,0,32,0\n124,0,4.79,34.71,Absent,49,26.09,9.26,47,0\n160,0,2.42,34.46,Absent,48,29.83,1.03,61,0\n128,0,2.51,29.35,Present,53,22.05,1.37,62,0\n122,4,5.24,27.89,Present,45,26.52,0,61,1\n132,2,2.7,21.57,Present,50,27.95,9.26,37,0\n120,0,2.42,16.66,Absent,46,20.16,0,17,0\n128,0.04,8.22,28.17,Absent,65,26.24,11.73,24,0\n108,15,4.91,34.65,Absent,41,27.96,14.4,56,0\n166,0,4.31,34.27,Absent,45,30.14,13.27,56,0\n152,0,6.06,41.05,Present,51,40.34,0,51,0\n170,4.2,4.67,35.45,Present,50,27.14,7.92,60,1\n156,4,2.05,19.48,Present,50,21.48,27.77,39,1\n116,8,6.73,28.81,Present,41,26.74,40.94,48,1\n122,4.4,3.18,11.59,Present,59,21.94,0,33,1\n150,20,6.4,35.04,Absent,53,28.88,8.33,63,0\n129,2.15,5.17,27.57,Absent,52,25.42,2.06,39,0\n134,4.8,6.58,29.89,Present,55,24.73,23.66,63,0\n126,0,5.98,29.06,Present,56,25.39,11.52,64,1\n142,0,3.72,25.68,Absent,48,24.37,5.25,40,1\n128,0.7,4.9,37.42,Present,72,35.94,3.09,49,1\n102,0.4,3.41,17.22,Present,56,23.59,2.06,39,1\n130,0,4.89,25.98,Absent,72,30.42,14.71,23,0\n138,0.05,2.79,10.35,Absent,46,21.62,0,18,0\n138,0,1.96,11.82,Present,54,22.01,8.13,21,0\n128,0,3.09,20.57,Absent,54,25.63,0.51,17,0\n162,2.92,3.63,31.33,Absent,62,31.59,18.51,42,0\n160,3,9.19,26.47,Present,39,28.25,14.4,54,1\n148,0,4.66,24.39,Absent,50,25.26,4.03,27,0\n124,0.16,2.44,16.67,Absent,65,24.58,74.91,23,0\n136,3.15,4.37,20.22,Present,59,25.12,47.16,31,1\n134,2.75,5.51,26.17,Absent,57,29.87,8.33,33,0\n128,0.73,3.97,23.52,Absent,54,23.81,19.2,64,0\n122,3.2,3.59,22.49,Present,45,24.96,36.17,58,0\n152,3,4.64,31.29,Absent,41,29.34,4.53,40,0\n162,0,5.09,24.6,Present,64,26.71,3.81,18,0\n124,4,6.65,30.84,Present,54,28.4,33.51,60,0\n136,5.8,5.9,27.55,Absent,65,25.71,14.4,59,0\n136,8.8,4.26,32.03,Present,52,31.44,34.35,60,0\n134,0.05,8.03,27.95,Absent,48,26.88,0,60,0\n122,1,5.88,34.81,Present,69,31.27,15.94,40,1\n116,3,3.05,30.31,Absent,41,23.63,0.86,44,0\n132,0,0.98,21.39,Absent,62,26.75,0,53,0\n134,0,2.4,21.11,Absent,57,22.45,1.37,18,0\n160,7.77,8.07,34.8,Absent,64,31.15,0,62,1\n180,0.52,4.23,16.38,Absent,55,22.56,14.77,45,1\n124,0.81,6.16,11.61,Absent,35,21.47,10.49,26,0\n114,0,4.97,9.69,Absent,26,22.6,0,25,0\n208,7.4,7.41,32.03,Absent,50,27.62,7.85,57,0\n138,0,3.14,12,Absent,54,20.28,0,16,0\n164,0.5,6.95,39.64,Present,47,41.76,3.81,46,1\n144,2.4,8.13,35.61,Absent,46,27.38,13.37,60,0\n136,7.5,7.39,28.04,Present,50,25.01,0,45,1\n132,7.28,3.52,12.33,Absent,60,19.48,2.06,56,0\n143,5.04,4.86,23.59,Absent,58,24.69,18.72,42,0\n112,4.46,7.18,26.25,Present,69,27.29,0,32,1\n134,10,3.79,34.72,Absent,42,28.33,28.8,52,1\n138,2,5.11,31.4,Present,49,27.25,2.06,64,1\n188,0,5.47,32.44,Present,71,28.99,7.41,50,1\n110,2.35,3.36,26.72,Present,54,26.08,109.8,58,1\n136,13.2,7.18,35.95,Absent,48,29.19,0,62,0\n130,1.75,5.46,34.34,Absent,53,29.42,0,58,1\n122,0,3.76,24.59,Absent,56,24.36,0,30,0\n138,0,3.24,27.68,Absent,60,25.7,88.66,29,0\n130,18,4.13,27.43,Absent,54,27.44,0,51,1\n126,5.5,3.78,34.15,Absent,55,28.85,3.18,61,0\n176,5.76,4.89,26.1,Present,46,27.3,19.44,57,0\n122,0,5.49,19.56,Absent,57,23.12,14.02,27,0\n124,0,3.23,9.64,Absent,59,22.7,0,16,0\n140,5.2,3.58,29.26,Absent,70,27.29,20.17,45,1\n128,6,4.37,22.98,Present,50,26.01,0,47,0\n190,4.18,5.05,24.83,Absent,45,26.09,82.85,41,0\n144,0.76,10.53,35.66,Absent,63,34.35,0,55,1\n126,4.6,7.4,31.99,Present,57,28.67,0.37,60,1\n128,0,2.63,23.88,Absent,45,21.59,6.54,57,0\n136,0.4,3.91,21.1,Present,63,22.3,0,56,1\n158,4,4.18,28.61,Present,42,25.11,0,60,0\n160,0.6,6.94,30.53,Absent,36,25.68,1.42,64,0\n124,6,5.21,33.02,Present,64,29.37,7.61,58,1\n158,6.17,8.12,30.75,Absent,46,27.84,92.62,48,0\n128,0,6.34,11.87,Absent,57,23.14,0,17,0\n166,3,3.82,26.75,Absent,45,20.86,0,63,1\n146,7.5,7.21,25.93,Present,55,22.51,0.51,42,0\n161,9,4.65,15.16,Present,58,23.76,43.2,46,0\n164,13.02,6.26,29.38,Present,47,22.75,37.03,54,1\n146,5.08,7.03,27.41,Present,63,36.46,24.48,37,1\n142,4.48,3.57,19.75,Present,51,23.54,3.29,49,0\n138,12,5.13,28.34,Absent,59,24.49,32.81,58,1\n154,1.8,7.13,34.04,Present,52,35.51,39.36,44,0\n118,0,2.39,12.13,Absent,49,18.46,0.26,17,1\n124,0.61,2.69,17.15,Present,61,22.76,11.55,20,0\n124,1.04,2.84,16.42,Present,46,20.17,0,61,0\n136,5,4.19,23.99,Present,68,27.8,25.86,35,0\n132,9.9,4.63,27.86,Present,46,23.39,0.51,52,1\n118,0.12,1.96,20.31,Absent,37,20.01,2.42,18,0\n118,0.12,4.16,9.37,Absent,57,19.61,0,17,0\n134,12,4.96,29.79,Absent,53,24.86,8.23,57,0\n114,0.1,3.95,15.89,Present,57,20.31,17.14,16,0\n136,6.8,7.84,30.74,Present,58,26.2,23.66,45,1\n130,0,4.16,39.43,Present,46,30.01,0,55,1\n136,2.2,4.16,38.02,Absent,65,37.24,4.11,41,1\n136,1.36,3.16,14.97,Present,56,24.98,7.3,24,0\n154,4.2,5.59,25.02,Absent,58,25.02,1.54,43,0\n108,0.8,2.47,17.53,Absent,47,22.18,0,55,1\n136,8.8,4.69,36.07,Present,38,26.56,2.78,63,1\n174,2.02,6.57,31.9,Present,50,28.75,11.83,64,1\n124,4.25,8.22,30.77,Absent,56,25.8,0,43,0\n114,0,2.63,9.69,Absent,45,17.89,0,16,0\n118,0.12,3.26,12.26,Absent,55,22.65,0,16,0\n106,1.08,4.37,26.08,Absent,67,24.07,17.74,28,1\n146,3.6,3.51,22.67,Absent,51,22.29,43.71,42,0\n206,0,4.17,33.23,Absent,69,27.36,6.17,50,1\n134,3,3.17,17.91,Absent,35,26.37,15.12,27,0\n148,15,4.98,36.94,Present,72,31.83,66.27,41,1\n126,0.21,3.95,15.11,Absent,61,22.17,2.42,17,0\n134,0,3.69,13.92,Absent,43,27.66,0,19,0\n134,0.02,2.8,18.84,Absent,45,24.82,0,17,0\n123,0.05,4.61,13.69,Absent,51,23.23,2.78,16,0\n112,0.6,5.28,25.71,Absent,55,27.02,27.77,38,1\n112,0,1.71,15.96,Absent,42,22.03,3.5,16,0\n101,0.48,7.26,13,Absent,50,19.82,5.19,16,0\n150,0.18,4.14,14.4,Absent,53,23.43,7.71,44,0\n170,2.6,7.22,28.69,Present,71,27.87,37.65,56,1\n134,0,5.63,29.12,Absent,68,32.33,2.02,34,0\n142,0,4.19,18.04,Absent,56,23.65,20.78,42,1\n132,0.1,3.28,10.73,Absent,73,20.42,0,17,0\n136,0,2.28,18.14,Absent,55,22.59,0,17,0\n132,12,4.51,21.93,Absent,61,26.07,64.8,46,1\n166,4.1,4,34.3,Present,32,29.51,8.23,53,0\n138,0,3.96,24.7,Present,53,23.8,0,45,0\n138,2.27,6.41,29.07,Absent,58,30.22,2.93,32,1\n170,0,3.12,37.15,Absent,47,35.42,0,53,0\n128,0,8.41,28.82,Present,60,26.86,0,59,1\n136,1.2,2.78,7.12,Absent,52,22.51,3.41,27,0\n128,0,3.22,26.55,Present,39,26.59,16.71,49,0\n150,14.4,5.04,26.52,Present,60,28.84,0,45,0\n132,8.4,3.57,13.68,Absent,42,18.75,15.43,59,1\n142,2.4,2.55,23.89,Absent,54,26.09,59.14,37,0\n130,0.05,2.44,28.25,Present,67,30.86,40.32,34,0\n174,3.5,5.26,21.97,Present,36,22.04,8.33,59,1\n114,9.6,2.51,29.18,Absent,49,25.67,40.63,46,0\n162,1.5,2.46,19.39,Present,49,24.32,0,59,1\n174,0,3.27,35.4,Absent,58,37.71,24.95,44,0\n190,5.15,6.03,36.59,Absent,42,30.31,72,50,0\n154,1.4,1.72,18.86,Absent,58,22.67,43.2,59,0\n124,0,2.28,24.86,Present,50,22.24,8.26,38,0\n114,1.2,3.98,14.9,Absent,49,23.79,25.82,26,0\n168,11.4,5.08,26.66,Present,56,27.04,2.61,59,1\n142,3.72,4.24,32.57,Absent,52,24.98,7.61,51,0\n154,0,4.81,28.11,Present,56,25.67,75.77,59,0\n146,4.36,4.31,18.44,Present,47,24.72,10.8,38,0\n166,6,3.02,29.3,Absent,35,24.38,38.06,61,0\n140,8.6,3.9,32.16,Present,52,28.51,11.11,64,1\n136,1.7,3.53,20.13,Absent,56,19.44,14.4,55,0\n156,0,3.47,21.1,Absent,73,28.4,0,36,1\n132,0,6.63,29.58,Present,37,29.41,2.57,62,0\n128,0,2.98,12.59,Absent,65,20.74,2.06,19,0\n106,5.6,3.2,12.3,Absent,49,20.29,0,39,0\n144,0.4,4.64,30.09,Absent,30,27.39,0.74,55,0\n154,0.31,2.33,16.48,Absent,33,24,11.83,17,0\n126,3.1,2.01,32.97,Present,56,28.63,26.74,45,0\n134,6.4,8.49,37.25,Present,56,28.94,10.49,51,1\n152,19.45,4.22,29.81,Absent,28,23.95,0,59,1\n146,1.35,6.39,34.21,Absent,51,26.43,0,59,1\n162,6.94,4.55,33.36,Present,52,27.09,32.06,43,0\n130,7.28,3.56,23.29,Present,20,26.8,51.87,58,1\n138,6,7.24,37.05,Absent,38,28.69,0,59,0\n148,0,5.32,26.71,Present,52,32.21,32.78,27,0\n124,4.2,2.94,27.59,Absent,50,30.31,85.06,30,0\n118,1.62,9.01,21.7,Absent,59,25.89,21.19,40,0\n116,4.28,7.02,19.99,Present,68,23.31,0,52,1\n162,6.3,5.73,22.61,Present,46,20.43,62.54,53,1\n138,0.87,1.87,15.89,Absent,44,26.76,42.99,31,0\n137,1.2,3.14,23.87,Absent,66,24.13,45,37,0\n198,0.52,11.89,27.68,Present,48,28.4,78.99,26,1\n154,4.5,4.75,23.52,Present,43,25.76,0,53,1\n128,5.4,2.36,12.98,Absent,51,18.36,6.69,61,0\n130,0.08,5.59,25.42,Present,50,24.98,6.27,43,1\n162,5.6,4.24,22.53,Absent,29,22.91,5.66,60,0\n120,10.5,2.7,29.87,Present,54,24.5,16.46,49,0\n136,3.99,2.58,16.38,Present,53,22.41,27.67,36,0\n176,1.2,8.28,36.16,Present,42,27.81,11.6,58,1\n134,11.79,4.01,26.57,Present,38,21.79,38.88,61,1\n122,1.7,5.28,32.23,Present,51,24.08,0,54,0\n134,0.9,3.18,23.66,Present,52,23.26,27.36,58,1\n134,0,2.43,22.24,Absent,52,26.49,41.66,24,0\n136,6.6,6.08,32.74,Absent,64,33.28,2.72,49,0\n132,4.05,5.15,26.51,Present,31,26.67,16.3,50,0\n152,1.68,3.58,25.43,Absent,50,27.03,0,32,0\n132,12.3,5.96,32.79,Present,57,30.12,21.5,62,1\n124,0.4,3.67,25.76,Absent,43,28.08,20.57,34,0\n140,4.2,2.91,28.83,Present,43,24.7,47.52,48,0\n166,0.6,2.42,34.03,Present,53,26.96,54,60,0\n156,3.02,5.35,25.72,Present,53,25.22,28.11,52,1\n132,0.72,4.37,19.54,Absent,48,26.11,49.37,28,0\n150,0,4.99,27.73,Absent,57,30.92,8.33,24,0\n134,0.12,3.4,21.18,Present,33,26.27,14.21,30,0\n126,3.4,4.87,15.16,Present,65,22.01,11.11,38,0\n148,0.5,5.97,32.88,Absent,54,29.27,6.43,42,0\n148,8.2,7.75,34.46,Present,46,26.53,6.04,64,1\n132,6,5.97,25.73,Present,66,24.18,145.29,41,0\n128,1.6,5.41,29.3,Absent,68,29.38,23.97,32,0\n128,5.16,4.9,31.35,Present,57,26.42,0,64,0\n140,0,2.4,27.89,Present,70,30.74,144,29,0\n126,0,5.29,27.64,Absent,25,27.62,2.06,45,0\n114,3.6,4.16,22.58,Absent,60,24.49,65.31,31,0\n118,1.25,4.69,31.58,Present,52,27.16,4.11,53,0\n126,0.96,4.99,29.74,Absent,66,33.35,58.32,38,0\n154,4.5,4.68,39.97,Absent,61,33.17,1.54,64,1\n112,1.44,2.71,22.92,Absent,59,24.81,0,52,0\n140,8,4.42,33.15,Present,47,32.77,66.86,44,0\n140,1.68,11.41,29.54,Present,74,30.75,2.06,38,1\n128,2.6,4.94,21.36,Absent,61,21.3,0,31,0\n126,19.6,6.03,34.99,Absent,49,26.99,55.89,44,0\n160,4.2,6.76,37.99,Present,61,32.91,3.09,54,1\n144,0,4.17,29.63,Present,52,21.83,0,59,0\n148,4.5,10.49,33.27,Absent,50,25.92,2.06,53,1\n146,0,4.92,18.53,Absent,57,24.2,34.97,26,0\n164,5.6,3.17,30.98,Present,44,25.99,43.2,53,1\n130,0.54,3.63,22.03,Present,69,24.34,12.86,39,1\n154,2.4,5.63,42.17,Present,59,35.07,12.86,50,1\n178,0.95,4.75,21.06,Absent,49,23.74,24.69,61,0\n180,3.57,3.57,36.1,Absent,36,26.7,19.95,64,0\n134,12.5,2.73,39.35,Absent,48,35.58,0,48,0\n142,0,3.54,16.64,Absent,58,25.97,8.36,27,0\n162,7,7.67,34.34,Present,33,30.77,0,62,0\n218,11.2,2.77,30.79,Absent,38,24.86,90.93,48,1\n126,8.75,6.06,32.72,Present,33,27,62.43,55,1\n126,0,3.57,26.01,Absent,61,26.3,7.97,47,0\n134,6.1,4.77,26.08,Absent,47,23.82,1.03,49,0\n132,0,4.17,36.57,Absent,57,30.61,18,49,0\n178,5.5,3.79,23.92,Present,45,21.26,6.17,62,1\n208,5.04,5.19,20.71,Present,52,25.12,24.27,58,1\n160,1.15,10.19,39.71,Absent,31,31.65,20.52,57,0\n116,2.38,5.67,29.01,Present,54,27.26,15.77,51,0\n180,25.01,3.7,38.11,Present,57,30.54,0,61,1\n200,19.2,4.43,40.6,Present,55,32.04,36,60,1\n112,4.2,3.58,27.14,Absent,52,26.83,2.06,40,0\n120,0,3.1,26.97,Absent,41,24.8,0,16,0\n178,20,9.78,33.55,Absent,37,27.29,2.88,62,1\n166,0.8,5.63,36.21,Absent,50,34.72,28.8,60,0\n164,8.2,14.16,36.85,Absent,52,28.5,17.02,55,1\n216,0.92,2.66,19.85,Present,49,20.58,0.51,63,1\n146,6.4,5.62,33.05,Present,57,31.03,0.74,46,0\n134,1.1,3.54,20.41,Present,58,24.54,39.91,39,1\n158,16,5.56,29.35,Absent,36,25.92,58.32,60,0\n176,0,3.14,31.04,Present,45,30.18,4.63,45,0\n132,2.8,4.79,20.47,Present,50,22.15,11.73,48,0\n126,0,4.55,29.18,Absent,48,24.94,36,41,0\n120,5.5,3.51,23.23,Absent,46,22.4,90.31,43,0\n174,0,3.86,21.73,Absent,42,23.37,0,63,0\n150,13.8,5.1,29.45,Present,52,27.92,77.76,55,1\n176,6,3.98,17.2,Present,52,21.07,4.11,61,1\n142,2.2,3.29,22.7,Absent,44,23.66,5.66,42,1\n132,0,3.3,21.61,Absent,42,24.92,32.61,33,0\n142,1.32,7.63,29.98,Present,57,31.16,72.93,33,0\n146,1.16,2.28,34.53,Absent,50,28.71,45,49,0\n132,7.2,3.65,17.16,Present,56,23.25,0,34,0\n120,0,3.57,23.22,Absent,58,27.2,0,32,0\n118,0,3.89,15.96,Absent,65,20.18,0,16,0\n108,0,1.43,26.26,Absent,42,19.38,0,16,0\n136,0,4,19.06,Absent,40,21.94,2.06,16,0\n120,0,2.46,13.39,Absent,47,22.01,0.51,18,0\n132,0,3.55,8.66,Present,61,18.5,3.87,16,0\n136,0,1.77,20.37,Absent,45,21.51,2.06,16,0\n138,0,1.86,18.35,Present,59,25.38,6.51,17,0\n138,0.06,4.15,20.66,Absent,49,22.59,2.49,16,0\n130,1.22,3.3,13.65,Absent,50,21.4,3.81,31,0\n130,4,2.4,17.42,Absent,60,22.05,0,40,0\n110,0,7.14,28.28,Absent,57,29,0,32,0\n120,0,3.98,13.19,Present,47,21.89,0,16,0\n166,6,8.8,37.89,Absent,39,28.7,43.2,52,0\n134,0.57,4.75,23.07,Absent,67,26.33,0,37,0\n142,3,3.69,25.1,Absent,60,30.08,38.88,27,0\n136,2.8,2.53,9.28,Present,61,20.7,4.55,25,0\n142,0,4.32,25.22,Absent,47,28.92,6.53,34,1\n130,0,1.88,12.51,Present,52,20.28,0,17,0\n124,1.8,3.74,16.64,Present,42,22.26,10.49,20,0\n144,4,5.03,25.78,Present,57,27.55,90,48,1\n136,1.81,3.31,6.74,Absent,63,19.57,24.94,24,0\n120,0,2.77,13.35,Absent,67,23.37,1.03,18,0\n154,5.53,3.2,28.81,Present,61,26.15,42.79,42,0\n124,1.6,7.22,39.68,Present,36,31.5,0,51,1\n146,0.64,4.82,28.02,Absent,60,28.11,8.23,39,1\n128,2.24,2.83,26.48,Absent,48,23.96,47.42,27,1\n170,0.4,4.11,42.06,Present,56,33.1,2.06,57,0\n214,0.4,5.98,31.72,Absent,64,28.45,0,58,0\n182,4.2,4.41,32.1,Absent,52,28.61,18.72,52,1\n108,3,1.59,15.23,Absent,40,20.09,26.64,55,0\n118,5.4,11.61,30.79,Absent,64,27.35,23.97,40,0\n132,0,4.82,33.41,Present,62,14.7,0,46,1"
  },
  {
    "path": "2017/examples/data/heart.txt",
    "content": "\"sbp\"\t\"tobacco\"\t\"ldl\"\t\"adiposity\"\t\"famhist\"\t\"typea\"\t\"obesity\"\t\"alcohol\"\t\"age\"\t\"chd\"\n160\t12\t5.73\t23.11\t\"Present\"\t49\t25.3\t97.2\t52\t1\n144\t0.01\t4.41\t28.61\t\"Absent\"\t55\t28.87\t2.06\t63\t1\n118\t0.08\t3.48\t32.28\t\"Present\"\t52\t29.14\t3.81\t46\t0\n170\t7.5\t6.41\t38.03\t\"Present\"\t51\t31.99\t24.26\t58\t1\n134\t13.6\t3.5\t27.78\t\"Present\"\t60\t25.99\t57.34\t49\t1\n132\t6.2\t6.47\t36.21\t\"Present\"\t62\t30.77\t14.14\t45\t0\n142\t4.05\t3.38\t16.2\t\"Absent\"\t59\t20.81\t2.62\t38\t0\n114\t4.08\t4.59\t14.6\t\"Present\"\t62\t23.11\t6.72\t58\t1\n114\t0\t3.83\t19.4\t\"Present\"\t49\t24.86\t2.49\t29\t0\n132\t0\t5.8\t30.96\t\"Present\"\t69\t30.11\t0\t53\t1\n206\t6\t2.95\t32.27\t\"Absent\"\t72\t26.81\t56.06\t60\t1\n134\t14.1\t4.44\t22.39\t\"Present\"\t65\t23.09\t0\t40\t1\n118\t0\t1.88\t10.05\t\"Absent\"\t59\t21.57\t0\t17\t0\n132\t0\t1.87\t17.21\t\"Absent\"\t49\t23.63\t0.97\t15\t0\n112\t9.65\t2.29\t17.2\t\"Present\"\t54\t23.53\t0.68\t53\t0\n117\t1.53\t2.44\t28.95\t\"Present\"\t35\t25.89\t30.03\t46\t0\n120\t7.5\t15.33\t22\t\"Absent\"\t60\t25.31\t34.49\t49\t0\n146\t10.5\t8.29\t35.36\t\"Present\"\t78\t32.73\t13.89\t53\t1\n158\t2.6\t7.46\t34.07\t\"Present\"\t61\t29.3\t53.28\t62\t1\n124\t14\t6.23\t35.96\t\"Present\"\t45\t30.09\t0\t59\t1\n106\t1.61\t1.74\t12.32\t\"Absent\"\t74\t20.92\t13.37\t20\t1\n132\t7.9\t2.85\t26.5\t\"Present\"\t51\t26.16\t25.71\t44\t0\n150\t0.3\t6.38\t33.99\t\"Present\"\t62\t24.64\t0\t50\t0\n138\t0.6\t3.81\t28.66\t\"Absent\"\t54\t28.7\t1.46\t58\t0\n142\t18.2\t4.34\t24.38\t\"Absent\"\t61\t26.19\t0\t50\t0\n124\t4\t12.42\t31.29\t\"Present\"\t54\t23.23\t2.06\t42\t1\n118\t6\t9.65\t33.91\t\"Absent\"\t60\t38.8\t0\t48\t0\n145\t9.1\t5.24\t27.55\t\"Absent\"\t59\t20.96\t21.6\t61\t1\n144\t4.09\t5.55\t31.4\t\"Present\"\t60\t29.43\t5.55\t56\t0\n146\t0\t6.62\t25.69\t\"Absent\"\t60\t28.07\t8.23\t63\t1\n136\t2.52\t3.95\t25.63\t\"Absent\"\t51\t21.86\t0\t45\t1\n158\t1.02\t6.33\t23.88\t\"Absent\"\t66\t22.13\t24.99\t46\t1\n122\t6.6\t5.58\t35.95\t\"Present\"\t53\t28.07\t12.55\t59\t1\n126\t8.75\t6.53\t34.02\t\"Absent\"\t49\t30.25\t0\t41\t1\n148\t5.5\t7.1\t25.31\t\"Absent\"\t56\t29.84\t3.6\t48\t0\n122\t4.26\t4.44\t13.04\t\"Absent\"\t57\t19.49\t48.99\t28\t1\n140\t3.9\t7.32\t25.05\t\"Absent\"\t47\t27.36\t36.77\t32\t0\n110\t4.64\t4.55\t30.46\t\"Absent\"\t48\t30.9\t15.22\t46\t0\n130\t0\t2.82\t19.63\t\"Present\"\t70\t24.86\t0\t29\t0\n136\t11.2\t5.81\t31.85\t\"Present\"\t75\t27.68\t22.94\t58\t1\n118\t0.28\t5.8\t33.7\t\"Present\"\t60\t30.98\t0\t41\t1\n144\t0.04\t3.38\t23.61\t\"Absent\"\t30\t23.75\t4.66\t30\t0\n120\t0\t1.07\t16.02\t\"Absent\"\t47\t22.15\t0\t15\t0\n130\t2.61\t2.72\t22.99\t\"Present\"\t51\t26.29\t13.37\t51\t1\n114\t0\t2.99\t9.74\t\"Absent\"\t54\t46.58\t0\t17\t0\n128\t4.65\t3.31\t22.74\t\"Absent\"\t62\t22.95\t0.51\t48\t0\n162\t7.4\t8.55\t24.65\t\"Present\"\t64\t25.71\t5.86\t58\t1\n116\t1.91\t7.56\t26.45\t\"Present\"\t52\t30.01\t3.6\t33\t1\n114\t0\t1.94\t11.02\t\"Absent\"\t54\t20.17\t38.98\t16\t0\n126\t3.8\t3.88\t31.79\t\"Absent\"\t57\t30.53\t0\t30\t0\n122\t0\t5.75\t30.9\t\"Present\"\t46\t29.01\t4.11\t42\t0\n134\t2.5\t3.66\t30.9\t\"Absent\"\t52\t27.19\t23.66\t49\t0\n152\t0.9\t9.12\t30.23\t\"Absent\"\t56\t28.64\t0.37\t42\t1\n134\t8.08\t1.55\t17.5\t\"Present\"\t56\t22.65\t66.65\t31\t1\n156\t3\t1.82\t27.55\t\"Absent\"\t60\t23.91\t54\t53\t0\n152\t5.99\t7.99\t32.48\t\"Absent\"\t45\t26.57\t100.32\t48\t0\n118\t0\t2.99\t16.17\t\"Absent\"\t49\t23.83\t3.22\t28\t0\n126\t5.1\t2.96\t26.5\t\"Absent\"\t55\t25.52\t12.34\t38\t1\n103\t0.03\t4.21\t18.96\t\"Absent\"\t48\t22.94\t2.62\t18\t0\n121\t0.8\t5.29\t18.95\t\"Present\"\t47\t22.51\t0\t61\t0\n142\t0.28\t1.8\t21.03\t\"Absent\"\t57\t23.65\t2.93\t33\t0\n138\t1.15\t5.09\t27.87\t\"Present\"\t61\t25.65\t2.34\t44\t0\n152\t10.1\t4.71\t24.65\t\"Present\"\t65\t26.21\t24.53\t57\t0\n140\t0.45\t4.3\t24.33\t\"Absent\"\t41\t27.23\t10.08\t38\t0\n130\t0\t1.82\t10.45\t\"Absent\"\t57\t22.07\t2.06\t17\t0\n136\t7.36\t2.19\t28.11\t\"Present\"\t61\t25\t61.71\t54\t0\n124\t4.82\t3.24\t21.1\t\"Present\"\t48\t28.49\t8.42\t30\t0\n112\t0.41\t1.88\t10.29\t\"Absent\"\t39\t22.08\t20.98\t27\t0\n118\t4.46\t7.27\t29.13\t\"Present\"\t48\t29.01\t11.11\t33\t0\n122\t0\t3.37\t16.1\t\"Absent\"\t67\t21.06\t0\t32\t1\n118\t0\t3.67\t12.13\t\"Absent\"\t51\t19.15\t0.6\t15\t0\n130\t1.72\t2.66\t10.38\t\"Absent\"\t68\t17.81\t11.1\t26\t0\n130\t5.6\t3.37\t24.8\t\"Absent\"\t58\t25.76\t43.2\t36\t0\n126\t0.09\t5.03\t13.27\t\"Present\"\t50\t17.75\t4.63\t20\t0\n128\t0.4\t6.17\t26.35\t\"Absent\"\t64\t27.86\t11.11\t34\t0\n136\t0\t4.12\t17.42\t\"Absent\"\t52\t21.66\t12.86\t40\t0\n134\t0\t5.9\t30.84\t\"Absent\"\t49\t29.16\t0\t55\t0\n140\t0.6\t5.56\t33.39\t\"Present\"\t58\t27.19\t0\t55\t1\n168\t4.5\t6.68\t28.47\t\"Absent\"\t43\t24.25\t24.38\t56\t1\n108\t0.4\t5.91\t22.92\t\"Present\"\t57\t25.72\t72\t39\t0\n114\t3\t7.04\t22.64\t\"Present\"\t55\t22.59\t0\t45\t1\n140\t8.14\t4.93\t42.49\t\"Absent\"\t53\t45.72\t6.43\t53\t1\n148\t4.8\t6.09\t36.55\t\"Present\"\t63\t25.44\t0.88\t55\t1\n148\t12.2\t3.79\t34.15\t\"Absent\"\t57\t26.38\t14.4\t57\t1\n128\t0\t2.43\t13.15\t\"Present\"\t63\t20.75\t0\t17\t0\n130\t0.56\t3.3\t30.86\t\"Absent\"\t49\t27.52\t33.33\t45\t0\n126\t10.5\t4.49\t17.33\t\"Absent\"\t67\t19.37\t0\t49\t1\n140\t0\t5.08\t27.33\t\"Present\"\t41\t27.83\t1.25\t38\t0\n126\t0.9\t5.64\t17.78\t\"Present\"\t55\t21.94\t0\t41\t0\n122\t0.72\t4.04\t32.38\t\"Absent\"\t34\t28.34\t0\t55\t0\n116\t1.03\t2.83\t10.85\t\"Absent\"\t45\t21.59\t1.75\t21\t0\n120\t3.7\t4.02\t39.66\t\"Absent\"\t61\t30.57\t0\t64\t1\n143\t0.46\t2.4\t22.87\t\"Absent\"\t62\t29.17\t15.43\t29\t0\n118\t4\t3.95\t18.96\t\"Absent\"\t54\t25.15\t8.33\t49\t1\n194\t1.7\t6.32\t33.67\t\"Absent\"\t47\t30.16\t0.19\t56\t0\n134\t3\t4.37\t23.07\t\"Absent\"\t56\t20.54\t9.65\t62\t0\n138\t2.16\t4.9\t24.83\t\"Present\"\t39\t26.06\t28.29\t29\t0\n136\t0\t5\t27.58\t\"Present\"\t49\t27.59\t1.47\t39\t0\n122\t3.2\t11.32\t35.36\t\"Present\"\t55\t27.07\t0\t51\t1\n164\t12\t3.91\t19.59\t\"Absent\"\t51\t23.44\t19.75\t39\t0\n136\t8\t7.85\t23.81\t\"Present\"\t51\t22.69\t2.78\t50\t0\n166\t0.07\t4.03\t29.29\t\"Absent\"\t53\t28.37\t0\t27\t0\n118\t0\t4.34\t30.12\t\"Present\"\t52\t32.18\t3.91\t46\t0\n128\t0.42\t4.6\t26.68\t\"Absent\"\t41\t30.97\t10.33\t31\t0\n118\t1.5\t5.38\t25.84\t\"Absent\"\t64\t28.63\t3.89\t29\t0\n158\t3.6\t2.97\t30.11\t\"Absent\"\t63\t26.64\t108\t64\t0\n108\t1.5\t4.33\t24.99\t\"Absent\"\t66\t22.29\t21.6\t61\t1\n170\t7.6\t5.5\t37.83\t\"Present\"\t42\t37.41\t6.17\t54\t1\n118\t1\t5.76\t22.1\t\"Absent\"\t62\t23.48\t7.71\t42\t0\n124\t0\t3.04\t17.33\t\"Absent\"\t49\t22.04\t0\t18\t0\n114\t0\t8.01\t21.64\t\"Absent\"\t66\t25.51\t2.49\t16\t0\n168\t9\t8.53\t24.48\t\"Present\"\t69\t26.18\t4.63\t54\t1\n134\t2\t3.66\t14.69\t\"Absent\"\t52\t21.03\t2.06\t37\t0\n174\t0\t8.46\t35.1\t\"Present\"\t35\t25.27\t0\t61\t1\n116\t31.2\t3.17\t14.99\t\"Absent\"\t47\t19.4\t49.06\t59\t1\n128\t0\t10.58\t31.81\t\"Present\"\t46\t28.41\t14.66\t48\t0\n140\t4.5\t4.59\t18.01\t\"Absent\"\t63\t21.91\t22.09\t32\t1\n154\t0.7\t5.91\t25\t\"Absent\"\t13\t20.6\t0\t42\t0\n150\t3.5\t6.99\t25.39\t\"Present\"\t50\t23.35\t23.48\t61\t1\n130\t0\t3.92\t25.55\t\"Absent\"\t68\t28.02\t0.68\t27\t0\n128\t2\t6.13\t21.31\t\"Absent\"\t66\t22.86\t11.83\t60\t0\n120\t1.4\t6.25\t20.47\t\"Absent\"\t60\t25.85\t8.51\t28\t0\n120\t0\t5.01\t26.13\t\"Absent\"\t64\t26.21\t12.24\t33\t0\n138\t4.5\t2.85\t30.11\t\"Absent\"\t55\t24.78\t24.89\t56\t1\n153\t7.8\t3.96\t25.73\t\"Absent\"\t54\t25.91\t27.03\t45\t0\n123\t8.6\t11.17\t35.28\t\"Present\"\t70\t33.14\t0\t59\t1\n148\t4.04\t3.99\t20.69\t\"Absent\"\t60\t27.78\t1.75\t28\t0\n136\t3.96\t2.76\t30.28\t\"Present\"\t50\t34.42\t18.51\t38\t0\n134\t8.8\t7.41\t26.84\t\"Absent\"\t35\t29.44\t29.52\t60\t1\n152\t12.18\t4.04\t37.83\t\"Present\"\t63\t34.57\t4.17\t64\t0\n158\t13.5\t5.04\t30.79\t\"Absent\"\t54\t24.79\t21.5\t62\t0\n132\t2\t3.08\t35.39\t\"Absent\"\t45\t31.44\t79.82\t58\t1\n134\t1.5\t3.73\t21.53\t\"Absent\"\t41\t24.7\t11.11\t30\t1\n142\t7.44\t5.52\t33.97\t\"Absent\"\t47\t29.29\t24.27\t54\t0\n134\t6\t3.3\t28.45\t\"Absent\"\t65\t26.09\t58.11\t40\t0\n122\t4.18\t9.05\t29.27\t\"Present\"\t44\t24.05\t19.34\t52\t1\n116\t2.7\t3.69\t13.52\t\"Absent\"\t55\t21.13\t18.51\t32\t0\n128\t0.5\t3.7\t12.81\t\"Present\"\t66\t21.25\t22.73\t28\t0\n120\t0\t3.68\t12.24\t\"Absent\"\t51\t20.52\t0.51\t20\t0\n124\t0\t3.95\t36.35\t\"Present\"\t59\t32.83\t9.59\t54\t0\n160\t14\t5.9\t37.12\t\"Absent\"\t58\t33.87\t3.52\t54\t1\n130\t2.78\t4.89\t9.39\t\"Present\"\t63\t19.3\t17.47\t25\t1\n128\t2.8\t5.53\t14.29\t\"Absent\"\t64\t24.97\t0.51\t38\t0\n130\t4.5\t5.86\t37.43\t\"Absent\"\t61\t31.21\t32.3\t58\t0\n109\t1.2\t6.14\t29.26\t\"Absent\"\t47\t24.72\t10.46\t40\t0\n144\t0\t3.84\t18.72\t\"Absent\"\t56\t22.1\t4.8\t40\t0\n118\t1.05\t3.16\t12.98\t\"Present\"\t46\t22.09\t16.35\t31\t0\n136\t3.46\t6.38\t32.25\t\"Present\"\t43\t28.73\t3.13\t43\t1\n136\t1.5\t6.06\t26.54\t\"Absent\"\t54\t29.38\t14.5\t33\t1\n124\t15.5\t5.05\t24.06\t\"Absent\"\t46\t23.22\t0\t61\t1\n148\t6\t6.49\t26.47\t\"Absent\"\t48\t24.7\t0\t55\t0\n128\t6.6\t3.58\t20.71\t\"Absent\"\t55\t24.15\t0\t52\t0\n122\t0.28\t4.19\t19.97\t\"Absent\"\t61\t25.63\t0\t24\t0\n108\t0\t2.74\t11.17\t\"Absent\"\t53\t22.61\t0.95\t20\t0\n124\t3.04\t4.8\t19.52\t\"Present\"\t60\t21.78\t147.19\t41\t1\n138\t8.8\t3.12\t22.41\t\"Present\"\t63\t23.33\t120.03\t55\t1\n127\t0\t2.81\t15.7\t\"Absent\"\t42\t22.03\t1.03\t17\t0\n174\t9.45\t5.13\t35.54\t\"Absent\"\t55\t30.71\t59.79\t53\t0\n122\t0\t3.05\t23.51\t\"Absent\"\t46\t25.81\t0\t38\t0\n144\t6.75\t5.45\t29.81\t\"Absent\"\t53\t25.62\t26.23\t43\t1\n126\t1.8\t6.22\t19.71\t\"Absent\"\t65\t24.81\t0.69\t31\t0\n208\t27.4\t3.12\t26.63\t\"Absent\"\t66\t27.45\t33.07\t62\t1\n138\t0\t2.68\t17.04\t\"Absent\"\t42\t22.16\t0\t16\t0\n148\t0\t3.84\t17.26\t\"Absent\"\t70\t20\t0\t21\t0\n122\t0\t3.08\t16.3\t\"Absent\"\t43\t22.13\t0\t16\t0\n132\t7\t3.2\t23.26\t\"Absent\"\t77\t23.64\t23.14\t49\t0\n110\t12.16\t4.99\t28.56\t\"Absent\"\t44\t27.14\t21.6\t55\t1\n160\t1.52\t8.12\t29.3\t\"Present\"\t54\t25.87\t12.86\t43\t1\n126\t0.54\t4.39\t21.13\t\"Present\"\t45\t25.99\t0\t25\t0\n162\t5.3\t7.95\t33.58\t\"Present\"\t58\t36.06\t8.23\t48\t0\n194\t2.55\t6.89\t33.88\t\"Present\"\t69\t29.33\t0\t41\t0\n118\t0.75\t2.58\t20.25\t\"Absent\"\t59\t24.46\t0\t32\t0\n124\t0\t4.79\t34.71\t\"Absent\"\t49\t26.09\t9.26\t47\t0\n160\t0\t2.42\t34.46\t\"Absent\"\t48\t29.83\t1.03\t61\t0\n128\t0\t2.51\t29.35\t\"Present\"\t53\t22.05\t1.37\t62\t0\n122\t4\t5.24\t27.89\t\"Present\"\t45\t26.52\t0\t61\t1\n132\t2\t2.7\t21.57\t\"Present\"\t50\t27.95\t9.26\t37\t0\n120\t0\t2.42\t16.66\t\"Absent\"\t46\t20.16\t0\t17\t0\n128\t0.04\t8.22\t28.17\t\"Absent\"\t65\t26.24\t11.73\t24\t0\n108\t15\t4.91\t34.65\t\"Absent\"\t41\t27.96\t14.4\t56\t0\n166\t0\t4.31\t34.27\t\"Absent\"\t45\t30.14\t13.27\t56\t0\n152\t0\t6.06\t41.05\t\"Present\"\t51\t40.34\t0\t51\t0\n170\t4.2\t4.67\t35.45\t\"Present\"\t50\t27.14\t7.92\t60\t1\n156\t4\t2.05\t19.48\t\"Present\"\t50\t21.48\t27.77\t39\t1\n116\t8\t6.73\t28.81\t\"Present\"\t41\t26.74\t40.94\t48\t1\n122\t4.4\t3.18\t11.59\t\"Present\"\t59\t21.94\t0\t33\t1\n150\t20\t6.4\t35.04\t\"Absent\"\t53\t28.88\t8.33\t63\t0\n129\t2.15\t5.17\t27.57\t\"Absent\"\t52\t25.42\t2.06\t39\t0\n134\t4.8\t6.58\t29.89\t\"Present\"\t55\t24.73\t23.66\t63\t0\n126\t0\t5.98\t29.06\t\"Present\"\t56\t25.39\t11.52\t64\t1\n142\t0\t3.72\t25.68\t\"Absent\"\t48\t24.37\t5.25\t40\t1\n128\t0.7\t4.9\t37.42\t\"Present\"\t72\t35.94\t3.09\t49\t1\n102\t0.4\t3.41\t17.22\t\"Present\"\t56\t23.59\t2.06\t39\t1\n130\t0\t4.89\t25.98\t\"Absent\"\t72\t30.42\t14.71\t23\t0\n138\t0.05\t2.79\t10.35\t\"Absent\"\t46\t21.62\t0\t18\t0\n138\t0\t1.96\t11.82\t\"Present\"\t54\t22.01\t8.13\t21\t0\n128\t0\t3.09\t20.57\t\"Absent\"\t54\t25.63\t0.51\t17\t0\n162\t2.92\t3.63\t31.33\t\"Absent\"\t62\t31.59\t18.51\t42\t0\n160\t3\t9.19\t26.47\t\"Present\"\t39\t28.25\t14.4\t54\t1\n148\t0\t4.66\t24.39\t\"Absent\"\t50\t25.26\t4.03\t27\t0\n124\t0.16\t2.44\t16.67\t\"Absent\"\t65\t24.58\t74.91\t23\t0\n136\t3.15\t4.37\t20.22\t\"Present\"\t59\t25.12\t47.16\t31\t1\n134\t2.75\t5.51\t26.17\t\"Absent\"\t57\t29.87\t8.33\t33\t0\n128\t0.73\t3.97\t23.52\t\"Absent\"\t54\t23.81\t19.2\t64\t0\n122\t3.2\t3.59\t22.49\t\"Present\"\t45\t24.96\t36.17\t58\t0\n152\t3\t4.64\t31.29\t\"Absent\"\t41\t29.34\t4.53\t40\t0\n162\t0\t5.09\t24.6\t\"Present\"\t64\t26.71\t3.81\t18\t0\n124\t4\t6.65\t30.84\t\"Present\"\t54\t28.4\t33.51\t60\t0\n136\t5.8\t5.9\t27.55\t\"Absent\"\t65\t25.71\t14.4\t59\t0\n136\t8.8\t4.26\t32.03\t\"Present\"\t52\t31.44\t34.35\t60\t0\n134\t0.05\t8.03\t27.95\t\"Absent\"\t48\t26.88\t0\t60\t0\n122\t1\t5.88\t34.81\t\"Present\"\t69\t31.27\t15.94\t40\t1\n116\t3\t3.05\t30.31\t\"Absent\"\t41\t23.63\t0.86\t44\t0\n132\t0\t0.98\t21.39\t\"Absent\"\t62\t26.75\t0\t53\t0\n134\t0\t2.4\t21.11\t\"Absent\"\t57\t22.45\t1.37\t18\t0\n160\t7.77\t8.07\t34.8\t\"Absent\"\t64\t31.15\t0\t62\t1\n180\t0.52\t4.23\t16.38\t\"Absent\"\t55\t22.56\t14.77\t45\t1\n124\t0.81\t6.16\t11.61\t\"Absent\"\t35\t21.47\t10.49\t26\t0\n114\t0\t4.97\t9.69\t\"Absent\"\t26\t22.6\t0\t25\t0\n208\t7.4\t7.41\t32.03\t\"Absent\"\t50\t27.62\t7.85\t57\t0\n138\t0\t3.14\t12\t\"Absent\"\t54\t20.28\t0\t16\t0\n164\t0.5\t6.95\t39.64\t\"Present\"\t47\t41.76\t3.81\t46\t1\n144\t2.4\t8.13\t35.61\t\"Absent\"\t46\t27.38\t13.37\t60\t0\n136\t7.5\t7.39\t28.04\t\"Present\"\t50\t25.01\t0\t45\t1\n132\t7.28\t3.52\t12.33\t\"Absent\"\t60\t19.48\t2.06\t56\t0\n143\t5.04\t4.86\t23.59\t\"Absent\"\t58\t24.69\t18.72\t42\t0\n112\t4.46\t7.18\t26.25\t\"Present\"\t69\t27.29\t0\t32\t1\n134\t10\t3.79\t34.72\t\"Absent\"\t42\t28.33\t28.8\t52\t1\n138\t2\t5.11\t31.4\t\"Present\"\t49\t27.25\t2.06\t64\t1\n188\t0\t5.47\t32.44\t\"Present\"\t71\t28.99\t7.41\t50\t1\n110\t2.35\t3.36\t26.72\t\"Present\"\t54\t26.08\t109.8\t58\t1\n136\t13.2\t7.18\t35.95\t\"Absent\"\t48\t29.19\t0\t62\t0\n130\t1.75\t5.46\t34.34\t\"Absent\"\t53\t29.42\t0\t58\t1\n122\t0\t3.76\t24.59\t\"Absent\"\t56\t24.36\t0\t30\t0\n138\t0\t3.24\t27.68\t\"Absent\"\t60\t25.7\t88.66\t29\t0\n130\t18\t4.13\t27.43\t\"Absent\"\t54\t27.44\t0\t51\t1\n126\t5.5\t3.78\t34.15\t\"Absent\"\t55\t28.85\t3.18\t61\t0\n176\t5.76\t4.89\t26.1\t\"Present\"\t46\t27.3\t19.44\t57\t0\n122\t0\t5.49\t19.56\t\"Absent\"\t57\t23.12\t14.02\t27\t0\n124\t0\t3.23\t9.64\t\"Absent\"\t59\t22.7\t0\t16\t0\n140\t5.2\t3.58\t29.26\t\"Absent\"\t70\t27.29\t20.17\t45\t1\n128\t6\t4.37\t22.98\t\"Present\"\t50\t26.01\t0\t47\t0\n190\t4.18\t5.05\t24.83\t\"Absent\"\t45\t26.09\t82.85\t41\t0\n144\t0.76\t10.53\t35.66\t\"Absent\"\t63\t34.35\t0\t55\t1\n126\t4.6\t7.4\t31.99\t\"Present\"\t57\t28.67\t0.37\t60\t1\n128\t0\t2.63\t23.88\t\"Absent\"\t45\t21.59\t6.54\t57\t0\n136\t0.4\t3.91\t21.1\t\"Present\"\t63\t22.3\t0\t56\t1\n158\t4\t4.18\t28.61\t\"Present\"\t42\t25.11\t0\t60\t0\n160\t0.6\t6.94\t30.53\t\"Absent\"\t36\t25.68\t1.42\t64\t0\n124\t6\t5.21\t33.02\t\"Present\"\t64\t29.37\t7.61\t58\t1\n158\t6.17\t8.12\t30.75\t\"Absent\"\t46\t27.84\t92.62\t48\t0\n128\t0\t6.34\t11.87\t\"Absent\"\t57\t23.14\t0\t17\t0\n166\t3\t3.82\t26.75\t\"Absent\"\t45\t20.86\t0\t63\t1\n146\t7.5\t7.21\t25.93\t\"Present\"\t55\t22.51\t0.51\t42\t0\n161\t9\t4.65\t15.16\t\"Present\"\t58\t23.76\t43.2\t46\t0\n164\t13.02\t6.26\t29.38\t\"Present\"\t47\t22.75\t37.03\t54\t1\n146\t5.08\t7.03\t27.41\t\"Present\"\t63\t36.46\t24.48\t37\t1\n142\t4.48\t3.57\t19.75\t\"Present\"\t51\t23.54\t3.29\t49\t0\n138\t12\t5.13\t28.34\t\"Absent\"\t59\t24.49\t32.81\t58\t1\n154\t1.8\t7.13\t34.04\t\"Present\"\t52\t35.51\t39.36\t44\t0\n118\t0\t2.39\t12.13\t\"Absent\"\t49\t18.46\t0.26\t17\t1\n124\t0.61\t2.69\t17.15\t\"Present\"\t61\t22.76\t11.55\t20\t0\n124\t1.04\t2.84\t16.42\t\"Present\"\t46\t20.17\t0\t61\t0\n136\t5\t4.19\t23.99\t\"Present\"\t68\t27.8\t25.86\t35\t0\n132\t9.9\t4.63\t27.86\t\"Present\"\t46\t23.39\t0.51\t52\t1\n118\t0.12\t1.96\t20.31\t\"Absent\"\t37\t20.01\t2.42\t18\t0\n118\t0.12\t4.16\t9.37\t\"Absent\"\t57\t19.61\t0\t17\t0\n134\t12\t4.96\t29.79\t\"Absent\"\t53\t24.86\t8.23\t57\t0\n114\t0.1\t3.95\t15.89\t\"Present\"\t57\t20.31\t17.14\t16\t0\n136\t6.8\t7.84\t30.74\t\"Present\"\t58\t26.2\t23.66\t45\t1\n130\t0\t4.16\t39.43\t\"Present\"\t46\t30.01\t0\t55\t1\n136\t2.2\t4.16\t38.02\t\"Absent\"\t65\t37.24\t4.11\t41\t1\n136\t1.36\t3.16\t14.97\t\"Present\"\t56\t24.98\t7.3\t24\t0\n154\t4.2\t5.59\t25.02\t\"Absent\"\t58\t25.02\t1.54\t43\t0\n108\t0.8\t2.47\t17.53\t\"Absent\"\t47\t22.18\t0\t55\t1\n136\t8.8\t4.69\t36.07\t\"Present\"\t38\t26.56\t2.78\t63\t1\n174\t2.02\t6.57\t31.9\t\"Present\"\t50\t28.75\t11.83\t64\t1\n124\t4.25\t8.22\t30.77\t\"Absent\"\t56\t25.8\t0\t43\t0\n114\t0\t2.63\t9.69\t\"Absent\"\t45\t17.89\t0\t16\t0\n118\t0.12\t3.26\t12.26\t\"Absent\"\t55\t22.65\t0\t16\t0\n106\t1.08\t4.37\t26.08\t\"Absent\"\t67\t24.07\t17.74\t28\t1\n146\t3.6\t3.51\t22.67\t\"Absent\"\t51\t22.29\t43.71\t42\t0\n206\t0\t4.17\t33.23\t\"Absent\"\t69\t27.36\t6.17\t50\t1\n134\t3\t3.17\t17.91\t\"Absent\"\t35\t26.37\t15.12\t27\t0\n148\t15\t4.98\t36.94\t\"Present\"\t72\t31.83\t66.27\t41\t1\n126\t0.21\t3.95\t15.11\t\"Absent\"\t61\t22.17\t2.42\t17\t0\n134\t0\t3.69\t13.92\t\"Absent\"\t43\t27.66\t0\t19\t0\n134\t0.02\t2.8\t18.84\t\"Absent\"\t45\t24.82\t0\t17\t0\n123\t0.05\t4.61\t13.69\t\"Absent\"\t51\t23.23\t2.78\t16\t0\n112\t0.6\t5.28\t25.71\t\"Absent\"\t55\t27.02\t27.77\t38\t1\n112\t0\t1.71\t15.96\t\"Absent\"\t42\t22.03\t3.5\t16\t0\n101\t0.48\t7.26\t13\t\"Absent\"\t50\t19.82\t5.19\t16\t0\n150\t0.18\t4.14\t14.4\t\"Absent\"\t53\t23.43\t7.71\t44\t0\n170\t2.6\t7.22\t28.69\t\"Present\"\t71\t27.87\t37.65\t56\t1\n134\t0\t5.63\t29.12\t\"Absent\"\t68\t32.33\t2.02\t34\t0\n142\t0\t4.19\t18.04\t\"Absent\"\t56\t23.65\t20.78\t42\t1\n132\t0.1\t3.28\t10.73\t\"Absent\"\t73\t20.42\t0\t17\t0\n136\t0\t2.28\t18.14\t\"Absent\"\t55\t22.59\t0\t17\t0\n132\t12\t4.51\t21.93\t\"Absent\"\t61\t26.07\t64.8\t46\t1\n166\t4.1\t4\t34.3\t\"Present\"\t32\t29.51\t8.23\t53\t0\n138\t0\t3.96\t24.7\t\"Present\"\t53\t23.8\t0\t45\t0\n138\t2.27\t6.41\t29.07\t\"Absent\"\t58\t30.22\t2.93\t32\t1\n170\t0\t3.12\t37.15\t\"Absent\"\t47\t35.42\t0\t53\t0\n128\t0\t8.41\t28.82\t\"Present\"\t60\t26.86\t0\t59\t1\n136\t1.2\t2.78\t7.12\t\"Absent\"\t52\t22.51\t3.41\t27\t0\n128\t0\t3.22\t26.55\t\"Present\"\t39\t26.59\t16.71\t49\t0\n150\t14.4\t5.04\t26.52\t\"Present\"\t60\t28.84\t0\t45\t0\n132\t8.4\t3.57\t13.68\t\"Absent\"\t42\t18.75\t15.43\t59\t1\n142\t2.4\t2.55\t23.89\t\"Absent\"\t54\t26.09\t59.14\t37\t0\n130\t0.05\t2.44\t28.25\t\"Present\"\t67\t30.86\t40.32\t34\t0\n174\t3.5\t5.26\t21.97\t\"Present\"\t36\t22.04\t8.33\t59\t1\n114\t9.6\t2.51\t29.18\t\"Absent\"\t49\t25.67\t40.63\t46\t0\n162\t1.5\t2.46\t19.39\t\"Present\"\t49\t24.32\t0\t59\t1\n174\t0\t3.27\t35.4\t\"Absent\"\t58\t37.71\t24.95\t44\t0\n190\t5.15\t6.03\t36.59\t\"Absent\"\t42\t30.31\t72\t50\t0\n154\t1.4\t1.72\t18.86\t\"Absent\"\t58\t22.67\t43.2\t59\t0\n124\t0\t2.28\t24.86\t\"Present\"\t50\t22.24\t8.26\t38\t0\n114\t1.2\t3.98\t14.9\t\"Absent\"\t49\t23.79\t25.82\t26\t0\n168\t11.4\t5.08\t26.66\t\"Present\"\t56\t27.04\t2.61\t59\t1\n142\t3.72\t4.24\t32.57\t\"Absent\"\t52\t24.98\t7.61\t51\t0\n154\t0\t4.81\t28.11\t\"Present\"\t56\t25.67\t75.77\t59\t0\n146\t4.36\t4.31\t18.44\t\"Present\"\t47\t24.72\t10.8\t38\t0\n166\t6\t3.02\t29.3\t\"Absent\"\t35\t24.38\t38.06\t61\t0\n140\t8.6\t3.9\t32.16\t\"Present\"\t52\t28.51\t11.11\t64\t1\n136\t1.7\t3.53\t20.13\t\"Absent\"\t56\t19.44\t14.4\t55\t0\n156\t0\t3.47\t21.1\t\"Absent\"\t73\t28.4\t0\t36\t1\n132\t0\t6.63\t29.58\t\"Present\"\t37\t29.41\t2.57\t62\t0\n128\t0\t2.98\t12.59\t\"Absent\"\t65\t20.74\t2.06\t19\t0\n106\t5.6\t3.2\t12.3\t\"Absent\"\t49\t20.29\t0\t39\t0\n144\t0.4\t4.64\t30.09\t\"Absent\"\t30\t27.39\t0.74\t55\t0\n154\t0.31\t2.33\t16.48\t\"Absent\"\t33\t24\t11.83\t17\t0\n126\t3.1\t2.01\t32.97\t\"Present\"\t56\t28.63\t26.74\t45\t0\n134\t6.4\t8.49\t37.25\t\"Present\"\t56\t28.94\t10.49\t51\t1\n152\t19.45\t4.22\t29.81\t\"Absent\"\t28\t23.95\t0\t59\t1\n146\t1.35\t6.39\t34.21\t\"Absent\"\t51\t26.43\t0\t59\t1\n162\t6.94\t4.55\t33.36\t\"Present\"\t52\t27.09\t32.06\t43\t0\n130\t7.28\t3.56\t23.29\t\"Present\"\t20\t26.8\t51.87\t58\t1\n138\t6\t7.24\t37.05\t\"Absent\"\t38\t28.69\t0\t59\t0\n148\t0\t5.32\t26.71\t\"Present\"\t52\t32.21\t32.78\t27\t0\n124\t4.2\t2.94\t27.59\t\"Absent\"\t50\t30.31\t85.06\t30\t0\n118\t1.62\t9.01\t21.7\t\"Absent\"\t59\t25.89\t21.19\t40\t0\n116\t4.28\t7.02\t19.99\t\"Present\"\t68\t23.31\t0\t52\t1\n162\t6.3\t5.73\t22.61\t\"Present\"\t46\t20.43\t62.54\t53\t1\n138\t0.87\t1.87\t15.89\t\"Absent\"\t44\t26.76\t42.99\t31\t0\n137\t1.2\t3.14\t23.87\t\"Absent\"\t66\t24.13\t45\t37\t0\n198\t0.52\t11.89\t27.68\t\"Present\"\t48\t28.4\t78.99\t26\t1\n154\t4.5\t4.75\t23.52\t\"Present\"\t43\t25.76\t0\t53\t1\n128\t5.4\t2.36\t12.98\t\"Absent\"\t51\t18.36\t6.69\t61\t0\n130\t0.08\t5.59\t25.42\t\"Present\"\t50\t24.98\t6.27\t43\t1\n162\t5.6\t4.24\t22.53\t\"Absent\"\t29\t22.91\t5.66\t60\t0\n120\t10.5\t2.7\t29.87\t\"Present\"\t54\t24.5\t16.46\t49\t0\n136\t3.99\t2.58\t16.38\t\"Present\"\t53\t22.41\t27.67\t36\t0\n176\t1.2\t8.28\t36.16\t\"Present\"\t42\t27.81\t11.6\t58\t1\n134\t11.79\t4.01\t26.57\t\"Present\"\t38\t21.79\t38.88\t61\t1\n122\t1.7\t5.28\t32.23\t\"Present\"\t51\t24.08\t0\t54\t0\n134\t0.9\t3.18\t23.66\t\"Present\"\t52\t23.26\t27.36\t58\t1\n134\t0\t2.43\t22.24\t\"Absent\"\t52\t26.49\t41.66\t24\t0\n136\t6.6\t6.08\t32.74\t\"Absent\"\t64\t33.28\t2.72\t49\t0\n132\t4.05\t5.15\t26.51\t\"Present\"\t31\t26.67\t16.3\t50\t0\n152\t1.68\t3.58\t25.43\t\"Absent\"\t50\t27.03\t0\t32\t0\n132\t12.3\t5.96\t32.79\t\"Present\"\t57\t30.12\t21.5\t62\t1\n124\t0.4\t3.67\t25.76\t\"Absent\"\t43\t28.08\t20.57\t34\t0\n140\t4.2\t2.91\t28.83\t\"Present\"\t43\t24.7\t47.52\t48\t0\n166\t0.6\t2.42\t34.03\t\"Present\"\t53\t26.96\t54\t60\t0\n156\t3.02\t5.35\t25.72\t\"Present\"\t53\t25.22\t28.11\t52\t1\n132\t0.72\t4.37\t19.54\t\"Absent\"\t48\t26.11\t49.37\t28\t0\n150\t0\t4.99\t27.73\t\"Absent\"\t57\t30.92\t8.33\t24\t0\n134\t0.12\t3.4\t21.18\t\"Present\"\t33\t26.27\t14.21\t30\t0\n126\t3.4\t4.87\t15.16\t\"Present\"\t65\t22.01\t11.11\t38\t0\n148\t0.5\t5.97\t32.88\t\"Absent\"\t54\t29.27\t6.43\t42\t0\n148\t8.2\t7.75\t34.46\t\"Present\"\t46\t26.53\t6.04\t64\t1\n132\t6\t5.97\t25.73\t\"Present\"\t66\t24.18\t145.29\t41\t0\n128\t1.6\t5.41\t29.3\t\"Absent\"\t68\t29.38\t23.97\t32\t0\n128\t5.16\t4.9\t31.35\t\"Present\"\t57\t26.42\t0\t64\t0\n140\t0\t2.4\t27.89\t\"Present\"\t70\t30.74\t144\t29\t0\n126\t0\t5.29\t27.64\t\"Absent\"\t25\t27.62\t2.06\t45\t0\n114\t3.6\t4.16\t22.58\t\"Absent\"\t60\t24.49\t65.31\t31\t0\n118\t1.25\t4.69\t31.58\t\"Present\"\t52\t27.16\t4.11\t53\t0\n126\t0.96\t4.99\t29.74\t\"Absent\"\t66\t33.35\t58.32\t38\t0\n154\t4.5\t4.68\t39.97\t\"Absent\"\t61\t33.17\t1.54\t64\t1\n112\t1.44\t2.71\t22.92\t\"Absent\"\t59\t24.81\t0\t52\t0\n140\t8\t4.42\t33.15\t\"Present\"\t47\t32.77\t66.86\t44\t0\n140\t1.68\t11.41\t29.54\t\"Present\"\t74\t30.75\t2.06\t38\t1\n128\t2.6\t4.94\t21.36\t\"Absent\"\t61\t21.3\t0\t31\t0\n126\t19.6\t6.03\t34.99\t\"Absent\"\t49\t26.99\t55.89\t44\t0\n160\t4.2\t6.76\t37.99\t\"Present\"\t61\t32.91\t3.09\t54\t1\n144\t0\t4.17\t29.63\t\"Present\"\t52\t21.83\t0\t59\t0\n148\t4.5\t10.49\t33.27\t\"Absent\"\t50\t25.92\t2.06\t53\t1\n146\t0\t4.92\t18.53\t\"Absent\"\t57\t24.2\t34.97\t26\t0\n164\t5.6\t3.17\t30.98\t\"Present\"\t44\t25.99\t43.2\t53\t1\n130\t0.54\t3.63\t22.03\t\"Present\"\t69\t24.34\t12.86\t39\t1\n154\t2.4\t5.63\t42.17\t\"Present\"\t59\t35.07\t12.86\t50\t1\n178\t0.95\t4.75\t21.06\t\"Absent\"\t49\t23.74\t24.69\t61\t0\n180\t3.57\t3.57\t36.1\t\"Absent\"\t36\t26.7\t19.95\t64\t0\n134\t12.5\t2.73\t39.35\t\"Absent\"\t48\t35.58\t0\t48\t0\n142\t0\t3.54\t16.64\t\"Absent\"\t58\t25.97\t8.36\t27\t0\n162\t7\t7.67\t34.34\t\"Present\"\t33\t30.77\t0\t62\t0\n218\t11.2\t2.77\t30.79\t\"Absent\"\t38\t24.86\t90.93\t48\t1\n126\t8.75\t6.06\t32.72\t\"Present\"\t33\t27\t62.43\t55\t1\n126\t0\t3.57\t26.01\t\"Absent\"\t61\t26.3\t7.97\t47\t0\n134\t6.1\t4.77\t26.08\t\"Absent\"\t47\t23.82\t1.03\t49\t0\n132\t0\t4.17\t36.57\t\"Absent\"\t57\t30.61\t18\t49\t0\n178\t5.5\t3.79\t23.92\t\"Present\"\t45\t21.26\t6.17\t62\t1\n208\t5.04\t5.19\t20.71\t\"Present\"\t52\t25.12\t24.27\t58\t1\n160\t1.15\t10.19\t39.71\t\"Absent\"\t31\t31.65\t20.52\t57\t0\n116\t2.38\t5.67\t29.01\t\"Present\"\t54\t27.26\t15.77\t51\t0\n180\t25.01\t3.7\t38.11\t\"Present\"\t57\t30.54\t0\t61\t1\n200\t19.2\t4.43\t40.6\t\"Present\"\t55\t32.04\t36\t60\t1\n112\t4.2\t3.58\t27.14\t\"Absent\"\t52\t26.83\t2.06\t40\t0\n120\t0\t3.1\t26.97\t\"Absent\"\t41\t24.8\t0\t16\t0\n178\t20\t9.78\t33.55\t\"Absent\"\t37\t27.29\t2.88\t62\t1\n166\t0.8\t5.63\t36.21\t\"Absent\"\t50\t34.72\t28.8\t60\t0\n164\t8.2\t14.16\t36.85\t\"Absent\"\t52\t28.5\t17.02\t55\t1\n216\t0.92\t2.66\t19.85\t\"Present\"\t49\t20.58\t0.51\t63\t1\n146\t6.4\t5.62\t33.05\t\"Present\"\t57\t31.03\t0.74\t46\t0\n134\t1.1\t3.54\t20.41\t\"Present\"\t58\t24.54\t39.91\t39\t1\n158\t16\t5.56\t29.35\t\"Absent\"\t36\t25.92\t58.32\t60\t0\n176\t0\t3.14\t31.04\t\"Present\"\t45\t30.18\t4.63\t45\t0\n132\t2.8\t4.79\t20.47\t\"Present\"\t50\t22.15\t11.73\t48\t0\n126\t0\t4.55\t29.18\t\"Absent\"\t48\t24.94\t36\t41\t0\n120\t5.5\t3.51\t23.23\t\"Absent\"\t46\t22.4\t90.31\t43\t0\n174\t0\t3.86\t21.73\t\"Absent\"\t42\t23.37\t0\t63\t0\n150\t13.8\t5.1\t29.45\t\"Present\"\t52\t27.92\t77.76\t55\t1\n176\t6\t3.98\t17.2\t\"Present\"\t52\t21.07\t4.11\t61\t1\n142\t2.2\t3.29\t22.7\t\"Absent\"\t44\t23.66\t5.66\t42\t1\n132\t0\t3.3\t21.61\t\"Absent\"\t42\t24.92\t32.61\t33\t0\n142\t1.32\t7.63\t29.98\t\"Present\"\t57\t31.16\t72.93\t33\t0\n146\t1.16\t2.28\t34.53\t\"Absent\"\t50\t28.71\t45\t49\t0\n132\t7.2\t3.65\t17.16\t\"Present\"\t56\t23.25\t0\t34\t0\n120\t0\t3.57\t23.22\t\"Absent\"\t58\t27.2\t0\t32\t0\n118\t0\t3.89\t15.96\t\"Absent\"\t65\t20.18\t0\t16\t0\n108\t0\t1.43\t26.26\t\"Absent\"\t42\t19.38\t0\t16\t0\n136\t0\t4\t19.06\t\"Absent\"\t40\t21.94\t2.06\t16\t0\n120\t0\t2.46\t13.39\t\"Absent\"\t47\t22.01\t0.51\t18\t0\n132\t0\t3.55\t8.66\t\"Present\"\t61\t18.5\t3.87\t16\t0\n136\t0\t1.77\t20.37\t\"Absent\"\t45\t21.51\t2.06\t16\t0\n138\t0\t1.86\t18.35\t\"Present\"\t59\t25.38\t6.51\t17\t0\n138\t0.06\t4.15\t20.66\t\"Absent\"\t49\t22.59\t2.49\t16\t0\n130\t1.22\t3.3\t13.65\t\"Absent\"\t50\t21.4\t3.81\t31\t0\n130\t4\t2.4\t17.42\t\"Absent\"\t60\t22.05\t0\t40\t0\n110\t0\t7.14\t28.28\t\"Absent\"\t57\t29\t0\t32\t0\n120\t0\t3.98\t13.19\t\"Present\"\t47\t21.89\t0\t16\t0\n166\t6\t8.8\t37.89\t\"Absent\"\t39\t28.7\t43.2\t52\t0\n134\t0.57\t4.75\t23.07\t\"Absent\"\t67\t26.33\t0\t37\t0\n142\t3\t3.69\t25.1\t\"Absent\"\t60\t30.08\t38.88\t27\t0\n136\t2.8\t2.53\t9.28\t\"Present\"\t61\t20.7\t4.55\t25\t0\n142\t0\t4.32\t25.22\t\"Absent\"\t47\t28.92\t6.53\t34\t1\n130\t0\t1.88\t12.51\t\"Present\"\t52\t20.28\t0\t17\t0\n124\t1.8\t3.74\t16.64\t\"Present\"\t42\t22.26\t10.49\t20\t0\n144\t4\t5.03\t25.78\t\"Present\"\t57\t27.55\t90\t48\t1\n136\t1.81\t3.31\t6.74\t\"Absent\"\t63\t19.57\t24.94\t24\t0\n120\t0\t2.77\t13.35\t\"Absent\"\t67\t23.37\t1.03\t18\t0\n154\t5.53\t3.2\t28.81\t\"Present\"\t61\t26.15\t42.79\t42\t0\n124\t1.6\t7.22\t39.68\t\"Present\"\t36\t31.5\t0\t51\t1\n146\t0.64\t4.82\t28.02\t\"Absent\"\t60\t28.11\t8.23\t39\t1\n128\t2.24\t2.83\t26.48\t\"Absent\"\t48\t23.96\t47.42\t27\t1\n170\t0.4\t4.11\t42.06\t\"Present\"\t56\t33.1\t2.06\t57\t0\n214\t0.4\t5.98\t31.72\t\"Absent\"\t64\t28.45\t0\t58\t0\n182\t4.2\t4.41\t32.1\t\"Absent\"\t52\t28.61\t18.72\t52\t1\n108\t3\t1.59\t15.23\t\"Absent\"\t40\t20.09\t26.64\t55\t0\n118\t5.4\t11.61\t30.79\t\"Absent\"\t64\t27.35\t23.97\t40\t0\n132\t0\t4.82\t33.41\t\"Present\"\t62\t14.7\t0\t46\t1"
  },
  {
    "path": "2017/examples/deepdream/deepdream_exercise.py",
    "content": "\"\"\"DeepDream.\n\"\"\"\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os.path\nimport zipfile\n\nimport numpy as np\nimport PIL.Image\n\nimport tensorflow as tf\n\nFLAGS = tf.app.flags.FLAGS\n\n\ntf.app.flags.DEFINE_string('data_dir',\n                           '/tmp/inception/',\n                           'Directory for storing Inception network.')\n\ntf.app.flags.DEFINE_string('jpeg_file',\n                           'output.jpg',\n                           'Where to save the resulting JPEG.')\n\n\ndef get_layer(layer):\n  \"\"\"Helper for getting layer output Tensor in model Graph.\n\n  Args:\n   layer: string, layer name\n\n  Returns:\n    Tensor for that layer.\n  \"\"\"\n  graph = tf.get_default_graph()\n  return graph.get_tensor_by_name('import/%s:0' % layer)\n\n\ndef maybe_download(data_dir):\n  \"\"\"Maybe download pretrained Inception network.\n\n  Args:\n    data_dir: string, path to data\n  \"\"\"\n  url = ('https://storage.googleapis.com/download.tensorflow.org/models/'\n         'inception5h.zip')\n  basename = 'inception5h.zip'\n  local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download(\n      basename, data_dir, url)\n\n  # Uncompress the pretrained Inception network.\n  print('Extracting', local_file)\n  zip_ref = zipfile.ZipFile(local_file, 'r')\n  zip_ref.extractall(FLAGS.data_dir)\n  zip_ref.close()\n\n\ndef normalize_image(image):\n  \"\"\"Stretch the range and prepare the image for saving as a JPEG.\n\n  Args:\n    image: numpy array\n\n  Returns:\n    numpy array of image in uint8\n  \"\"\"\n  # Clip to [0, 1] and then convert to uint8.\n  image = np.clip(image, 0, 1)\n  image = np.uint8(image * 255)\n  return image\n\n\ndef save_jpeg(jpeg_file, image):\n  pil_image = PIL.Image.fromarray(image)\n  pil_image.save(jpeg_file)\n  print('Saved to file: ', jpeg_file)\n\n\ndef main(unused_argv):\n  # Maybe download and uncompress pretrained Inception network.\n  maybe_download(FLAGS.data_dir)\n\n  model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb')\n\n  # Load the pretrained Inception model as a GraphDef.\n  with tf.gfile.FastGFile(model_fn, 'rb') as f:\n    graph_def = tf.GraphDef()\n    graph_def.ParseFromString(f.read())\n\n  with tf.Graph().as_default():\n    # Input for the network.\n    input_image = tf.placeholder(np.float32, name='input')\n    pixel_mean = 117.0\n    input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0)\n    tf.import_graph_def(graph_def, {'input': input_preprocessed})\n\n    # Grab a list of the names of Tensor's that are the output of convolutions.\n    graph = tf.get_default_graph()\n    layers = [op.name for op in graph.get_operations()\n              if op.type == 'Conv2D' and 'import/' in op.name]\n    feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1])\n                    for name in layers]\n    # print('Layers available: %s' % ','.join(layers))\n    print('Number of layers', len(layers))\n    print('Number of features:', sum(feature_nums))\n\n    # Pick an internal layer and node to visualize.\n    # Note that we use outputs before applying the ReLU nonlinearity to\n    # have non-zero gradients for features with negative initial activations.\n    layer = 'mixed4d_3x3_bottleneck_pre_relu'\n    channel = 139\n    layer_channel = get_layer(layer)[:, :, :, channel]\n    print('layer %s, channel %d: %s' % (layer, channel, layer_channel))\n\n    # Define the optimization as the average across all spatial locations.\n    score = tf.reduce_mean(layer_channel)\n\n    # Automatic differentiation with TensorFlow. Magic!\n    input_gradient = tf.gradients(score, input_image)[0]\n\n    # Employ random noise as a image.\n    noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0\n    image = noise_image.copy()\n\n    ################################################################\n    # EXERCISE: Implemement the Deep Dream algorithm here!\n    ################################################################\n\n  # Save the image.\n  stddev = 0.1\n  image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5\n  image = normalize_image(image)\n  save_jpeg(FLAGS.jpeg_file, image)\n\n\nif __name__ == '__main__':\n  tf.app.run()"
  },
  {
    "path": "2017/examples/deepdream/deepdream_solution.py",
    "content": "\"\"\"DeepDream.\n\"\"\"\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os.path\nimport zipfile\n\nimport sys\nsys.path.extend(['', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/Users/shlens/Desktop/Neural-Art/homebrew/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0', '/Users/shlens/Desktop/Neural-Art/homebrew/lib/python2.7/site-packages/gtk-2.0'])\n\n\nimport numpy as np\nimport PIL.Image\n\nimport tensorflow as tf\n\nFLAGS = tf.app.flags.FLAGS\n\n\ntf.app.flags.DEFINE_string('data_dir',\n                           '/tmp/inception/',\n                           'Directory for storing Inception network.')\n\ntf.app.flags.DEFINE_string('jpeg_file',\n                           'output.jpg',\n                           'Where to save the resulting JPEG.')\n\n\ndef get_layer(layer):\n  \"\"\"Helper for getting layer output Tensor in model Graph.\n\n  Args:\n   layer: string, layer name\n\n  Returns:\n    Tensor for that layer.\n  \"\"\"\n  graph = tf.get_default_graph()\n  return graph.get_tensor_by_name('import/%s:0' % layer)\n\n\ndef maybe_download(data_dir):\n  \"\"\"Maybe download pretrained Inception network.\n\n  Args:\n    data_dir: string, path to data\n  \"\"\"\n  url = ('https://storage.googleapis.com/download.tensorflow.org/models/'\n         'inception5h.zip')\n  basename = 'inception5h.zip'\n  local_file = tf.contrib.learn.python.learn.datasets.base.maybe_download(\n      basename, data_dir, url)\n\n  # Uncompress the pretrained Inception network.\n  print('Extracting', local_file)\n  zip_ref = zipfile.ZipFile(local_file, 'r')\n  zip_ref.extractall(FLAGS.data_dir)\n  zip_ref.close()\n\n\ndef normalize_image(image):\n  \"\"\"Stretch the range and prepare the image for saving as a JPEG.\n\n  Args:\n    image: numpy array\n\n  Returns:\n    numpy array of image in uint8\n  \"\"\"\n  # Clip to [0, 1] and then convert to uint8.\n  image = np.clip(image, 0, 1)\n  image = np.uint8(image * 255)\n  return image\n\n\ndef save_jpeg(jpeg_file, image):\n  pil_image = PIL.Image.fromarray(image)\n  pil_image.save(jpeg_file)\n  print('Saved to file: ', jpeg_file)\n\n\ndef main(unused_argv):\n  # Maybe download and uncompress pretrained Inception network.\n  maybe_download(FLAGS.data_dir)\n\n  model_fn = os.path.join(FLAGS.data_dir, 'tensorflow_inception_graph.pb')\n\n  # Load the pretrained Inception model as a GraphDef.\n  with tf.gfile.FastGFile(model_fn, 'rb') as f:\n    graph_def = tf.GraphDef()\n    graph_def.ParseFromString(f.read())\n\n  with tf.Graph().as_default():\n    # Input for the network.\n    input_image = tf.placeholder(np.float32, name='input')\n    pixel_mean = 117.0\n    input_preprocessed = tf.expand_dims(input_image - pixel_mean, 0)\n    tf.import_graph_def(graph_def, {'input': input_preprocessed})\n\n    # Grab a list of the names of Tensor's that are the output of convolutions.\n    graph = tf.get_default_graph()\n    layers = [op.name for op in graph.get_operations()\n              if op.type == 'Conv2D' and 'import/' in op.name]\n    feature_nums = [int(graph.get_tensor_by_name(name+':0').get_shape()[-1])\n                    for name in layers]\n    # print('Layers available: %s' % ','.join(layers))\n    print('Number of layers', len(layers))\n    print('Number of features:', sum(feature_nums))\n\n    # Pick an internal layer and node to visualize.\n    # Note that we use outputs before applying the ReLU nonlinearity to\n    # have non-zero gradients for features with negative initial activations.\n    layer = 'mixed4d_3x3_bottleneck_pre_relu'\n    channel = 139\n    layer_channel = get_layer(layer)[:, :, :, channel]\n    print('layer %s, channel %d: %s' % (layer, channel, layer_channel))\n\n    # Define the optimization as the average across all spatial locations.\n    score = tf.reduce_mean(layer_channel)\n\n    # Automatic differentiation with TensorFlow. Magic!\n    input_gradient = tf.gradients(score, input_image)[0]\n\n    # Employ random noise as a image.\n    noise_image = np.random.uniform(size=(224, 224, 3)) + 100.0\n    image = noise_image.copy()\n    \n    ################################################################\n    ### BEGIN SOLUTION #####\n    ################################################################\n    step_scale = 1.0\n    num_iter = 20\n    with tf.Session() as sess:\n      for i in xrange(num_iter):\n        image_gradient, score_value = sess.run([input_gradient, score], {input_image:image})\n        # Normalize the gradient, so the same step size should work \n        image_gradient /= image_gradient.std() + 1e-8 \n        image += image_gradient * step_scale\n        print('At step = %d, score = %.3f' % (i, score_value))\n\n  # Save the image.\n  stddev = 0.1\n  image = (image - image.mean()) / max(image.std(), 1e-4) * stddev + 0.5\n  image = normalize_image(image)\n  save_jpeg(FLAGS.jpeg_file, image)\n  ##################################################################\n  ### END SOLUTION #####\n  ##################################################################\n\n\nif __name__ == '__main__':\n  tf.app.run()"
  },
  {
    "path": "2017/examples/kernels.py",
    "content": "import numpy as np\nimport tensorflow as tf\n\na = np.zeros([3, 3, 3, 3])\na[1, 1, :, :] = 0.25\na[0, 1, :, :] = 0.125\na[1, 0, :, :] = 0.125\na[2, 1, :, :] = 0.125\na[1, 2, :, :] = 0.125\na[0, 0, :, :] = 0.0625\na[0, 2, :, :] = 0.0625\na[2, 0, :, :] = 0.0625\na[2, 2, :, :] = 0.0625\n\nBLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\n# a[1, 1, :, :] = 0.25\n# a[0, 1, :, :] = 0.125\n# a[1, 0, :, :] = 0.125\n# a[2, 1, :, :] = 0.125\n# a[1, 2, :, :] = 0.125\n# a[0, 0, :, :] = 0.0625\n# a[0, 2, :, :] = 0.0625\n# a[2, 0, :, :] = 0.0625\n# a[2, 2, :, :] = 0.0625\na[1, 1, :, :] = 1.0\na[0, 1, :, :] = 1.0\na[1, 0, :, :] = 1.0\na[2, 1, :, :] = 1.0\na[1, 2, :, :] = 1.0\na[0, 0, :, :] = 1.0\na[0, 2, :, :] = 1.0\na[2, 0, :, :] = 1.0\na[2, 2, :, :] = 1.0\nBLUR_FILTER = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[1, 1, :, :] = 5\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[2, 1, :, :] = -1\na[1, 2, :, :] = -1\n\nSHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[1, 1, :, :] = 5\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[2, 1, :, :] = -1\na[1, 2, :, :] = -1\n\nSHARPEN_FILTER = tf.constant(a, dtype=tf.float32)\n\n# a = np.zeros([3, 3, 3, 3])\n# a[:, :, :, :] = -1\n# a[1, 1, :, :] = 8\n\n# EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\nEDGE_FILTER_RGB = tf.constant([\n\t\t\t[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],\n            [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],\n\t\t\t[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]]\n])\n\na = np.zeros([3, 3, 1, 1])\n# a[:, :, :, :] = -1\n# a[1, 1, :, :] = 8\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[1, 2, :, :] = -1\na[2, 1, :, :] = -1\na[1, 1, :, :] = 4\n\nEDGE_FILTER = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[0, :, :, :] = 1\na[0, 1, :, :] = 2 # originally 2\na[2, :, :, :] = -1\na[2, 1, :, :] = -2\n\nTOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[0, :, :, :] = 1\na[0, 1, :, :] = 2 # originally 2\na[2, :, :, :] = -1\na[2, 1, :, :] = -2\n\nTOP_SOBEL = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[0, 0, :, :] = -2\na[0, 1, :, :] = -1 \na[1, 0, :, :] = -1\na[1, 1, :, :] = 1\na[1, 2, :, :] = 1\na[2, 1, :, :] = 1\na[2, 2, :, :] = 2\n\nEMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[0, 0, :, :] = -2\na[0, 1, :, :] = -1 \na[1, 0, :, :] = -1\na[1, 1, :, :] = 1\na[1, 2, :, :] = 1\na[2, 1, :, :] = 1\na[2, 2, :, :] = 2\nEMBOSS_FILTER = tf.constant(a, dtype=tf.float32)"
  },
  {
    "path": "2017/examples/process_data.py",
    "content": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nfrom collections import Counter\nimport random\nimport os\nimport sys\nsys.path.append('..')\nimport zipfile\n\nimport numpy as np\nfrom six.moves import urllib\nimport tensorflow as tf\n\nimport utils\n\n# Parameters for downloading data\nDOWNLOAD_URL = 'http://mattmahoney.net/dc/'\nEXPECTED_BYTES = 31344016\nDATA_FOLDER = 'data/'\nFILE_NAME = 'text8.zip'\n\ndef download(file_name, expected_bytes):\n    \"\"\" Download the dataset text8 if it's not already downloaded \"\"\"\n    file_path = DATA_FOLDER + file_name\n    if os.path.exists(file_path):\n        print(\"Dataset ready\")\n        return file_path\n    file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path)\n    file_stat = os.stat(file_path)\n    if file_stat.st_size == expected_bytes:\n        print('Successfully downloaded the file', file_name)\n    else:\n        raise Exception('File ' + file_name +\n                        ' might be corrupted. You should try downloading it with a browser.')\n    return file_path\n\ndef read_data(file_path):\n    \"\"\" Read data into a list of tokens \n    There should be 17,005,207 tokens\n    \"\"\"\n    with zipfile.ZipFile(file_path) as f:\n        words = tf.compat.as_str(f.read(f.namelist()[0])).split() \n        # tf.compat.as_str() converts the input into the string\n    return words\n\ndef build_vocab(words, vocab_size):\n    \"\"\" Build vocabulary of VOCAB_SIZE most frequent words \"\"\"\n    dictionary = dict()\n    count = [('UNK', -1)]\n    count.extend(Counter(words).most_common(vocab_size - 1))\n    index = 0\n    utils.make_dir('processed')\n    with open('processed/vocab_1000.tsv', \"w\") as f:\n        for word, _ in count:\n            dictionary[word] = index\n            if index < 1000:\n                f.write(word + \"\\n\")\n            index += 1\n    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n    return dictionary, index_dictionary\n\ndef convert_words_to_index(words, dictionary):\n    \"\"\" Replace each word in the dataset with its index in the dictionary \"\"\"\n    return [dictionary[word] if word in dictionary else 0 for word in words]\n\ndef generate_sample(index_words, context_window_size):\n    \"\"\" Form training pairs according to the skip-gram model. \"\"\"\n    for index, center in enumerate(index_words):\n        context = random.randint(1, context_window_size)\n        # get a random target before the center word\n        for target in index_words[max(0, index - context): index]:\n            yield center, target\n        # get a random target after the center wrod\n        for target in index_words[index + 1: index + context + 1]:\n            yield center, target\n\ndef get_batch(iterator, batch_size):\n    \"\"\" Group a numerical stream into batches and yield them as Numpy arrays. \"\"\"\n    while True:\n        center_batch = np.zeros(batch_size, dtype=np.int32)\n        target_batch = np.zeros([batch_size, 1])\n        for index in range(batch_size):\n            center_batch[index], target_batch[index] = next(iterator)\n        yield center_batch, target_batch\n\ndef process_data(vocab_size, batch_size, skip_window):\n    file_path = download(FILE_NAME, EXPECTED_BYTES)\n    words = read_data(file_path)\n    dictionary, _ = build_vocab(words, vocab_size)\n    index_words = convert_words_to_index(words, dictionary)\n    del words # to save memory\n    single_gen = generate_sample(index_words, skip_window)\n    return get_batch(single_gen, batch_size)\n\ndef get_index_vocab(vocab_size):\n    file_path = download(FILE_NAME, EXPECTED_BYTES)\n    words = read_data(file_path)\n    return build_vocab(words, vocab_size)\n"
  },
  {
    "path": "2017/examples/utils.py",
    "content": "import os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport tensorflow as tf\n\ndef huber_loss(labels, predictions, delta=1.0):\n    residual = tf.abs(predictions - labels)\n    def f1(): return 0.5 * tf.square(residual)\n    def f2(): return delta * residual - 0.5 * tf.square(delta)\n    return tf.cond(residual < delta, f1, f2)\n\ndef make_dir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n    \tpass"
  },
  {
    "path": "2017/setup/requirements.txt",
    "content": "tensorflow==1.2.1\r\nscipy==0.19.1\r\nscikit-learn==0.18.2\r\nmatplotlib==2.0.2\r\nxlrd==1.0.0\r\nipdb==0.10.1\r\nPillow==4.2.1\r\nlxml==3.8.0\r\n"
  },
  {
    "path": "2017/setup/setup_instruction.md",
    "content": "Tensorflow supports both Python 2.7 and Python 3.3+. <b>Note that for Windows, TensorFlow supports only 64-bit Python 3.5.</b>\nFor this course, I will use Python 2.7. But you’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 2.7\n\nGoogle has a pretty detailed instruction on how to download and setup Tensorflow. You can follow it here: https://www.tensorflow.org/get_started/os_setup\n\nUnless your computer has GPU, you should install Tensorflow without GPU support. My recommendation is always set up Tensorflow using virtualenv. For the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses.\n\nBelow is a simpler instruction on how to install tensorflow for people using Mac OS. If you have any problem installing Tensorflow, feel free to post it on Piazza: piazza.com/stanford/winter2017/cs20si\n\n## Install TensorFlow<br>\n### For Mac OS\n\nIf you get “permission denied” error in any command, use “sudo” in front of that command.\n\nYou will need pip (or pip3 if you use Python 3), and virtualenv.\n\nStep 1: set up pip and virtual environment\n```bash\n$ sudo easy_install pip \n$ sudo easy_install --upgrade six\n$ pip install virtualenv\n```\n\nStep 2: set up a project directory. You will do all work for this class in this directory\n```bash\n$ mkdir [my project]\n```\n\nStep 3: set up virtual environment for the project directory. \n```bash\n$ cd [my project]\n$ virtualenv venv --distribute\n```\nThese commands create a venv subdirectory in your project where everything is installed.\n\nStep 4: to activate the virtual environment \n```bash\n$ source venv/bin/activate\n```\n\nIf you type:\n```bash\n$ pip freeze\n```\n\nYou will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, refer to requirements.txt\nStep 5: Install Tensorflow and other dependencies\n```bash\n$ pip install tensorflow\n$ pip freeze > requirements.txt\n```\n\nStep n: \nTo exit the virtual environment, use:\n```bash\n$ deactivate\n```\n\nIf you want your virtual environment to inherit globally installed packages, (not recommended), use:\n```bash\n$ virtualenv venv --distribute --system-site-packages\n```\n### For Ubuntu\n\n\n### For Windows\n\n\n### On the cloud\nIf you don't want to install TensorFlow, you can use TensorFlow over the web.\n\n#### SageMath\nYou can use Tensorflow over the web at https://cloud.sagemath.com/\nSimply click on the link, create an account (or log in with your GitHub), and create a TensorFlow project.\n\n#### Jupyter\nYou can also use Jupyter notebook to write TensorFlow programs.\n\n# Possible set up problems\n## Matplotlib\nIf you have problem with using Matplotlib in virtual environment, here is a simple fix. <br>\nIf you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib.\nGo there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg```\n\nOr you can simply add this after importing matplotlib: ```matplotlib.use(\"TkAgg\")```\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) 2017 Huyen Nguyen (Chip Huyen)\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE."
  },
  {
    "path": "README.md",
    "content": "[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n[![Join the https://gitter.im/stanford-tensorflow-tutorials](https://badges.gitter.im/tflearn/tflearn.svg)](https://gitter.im/stanford-tensorflow-tutorials)\n\n# stanford-tensorflow-tutorials\nThis repository contains code examples for the course CS 20: TensorFlow for Deep Learning Research. <br>\nIt will be updated as the class progresses. <br>\nDetailed syllabus and lecture notes can be found [here](http://cs20.stanford.edu).<br>\nFor this course, I use python3.6 and TensorFlow 1.4.1.\n\nFor the code and notes of the previous year's course, please see the folder 2017 and the website https://web.stanford.edu/class/cs20si/2017\n\nFor setup instruction and the list of dependencies, please see the setup folder of this repository."
  },
  {
    "path": "assignments/01/q1.py",
    "content": "\"\"\"\nSimple exercises to get used to TensorFlow API\nYou should thoroughly test your code.\nTensorFlow's official documentation should be your best friend here\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\nsess = tf.InteractiveSession()\n###############################################################################\n# 1a: Create two random 0-d tensors x and y of any distribution.\n# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.\n# Hint: look up tf.cond()\n# I do the first problem for you\n###############################################################################\n\nx = tf.random_uniform([])  # Empty array as shape creates a scalar.\ny = tf.random_uniform([])\nout = tf.cond(tf.greater(x, y), lambda: x + y, lambda: x - y)\nprint(sess.run(out))\n\n###############################################################################\n# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).\n# Return x + y if x < y, x - y if x > y, 0 otherwise.\n# Hint: Look up tf.case().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] \n# and y as a tensor of zeros with the same shape as x.\n# Return a boolean tensor that yields Trues if x equals y element-wise.\n# Hint: Look up tf.equal().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1d: Create the tensor x of value \n# [29.05088806,  27.61298943,  31.19073486,  29.35532951,\n#  30.97266006,  26.67541885,  38.08450317,  20.74983215,\n#  34.94445419,  34.45999146,  29.06485367,  36.01657104,\n#  27.88236427,  20.56035233,  30.20379066,  29.51215172,\n#  33.71149445,  28.59134293,  36.05556488,  28.66994858].\n# Get the indices of elements in x whose values are greater than 30.\n# Hint: Use tf.where().\n# Then extract elements whose values are greater than 30.\n# Hint: Use tf.gather().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,\n# 2, ..., 6\n# Hint: Use tf.range() and tf.diag().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.\n# Calculate its determinant.\n# Hint: Look at tf.matrix_determinant().\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].\n# Return the unique elements in x\n# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.\n###############################################################################\n\n# YOUR CODE\n\n###############################################################################\n# 1h: Create two tensors x and y of shape 300 from any normal distribution,\n# as long as they are from the same distribution.\n# Use tf.cond() to return:\n# - The mean squared error of (x - y) if the average of all elements in (x - y)\n#   is negative, or\n# - The sum of absolute value of all elements in the tensor (x - y) otherwise.\n# Hint: see the Huber loss function in the lecture slides 3.\n###############################################################################\n\n# YOUR CODE"
  },
  {
    "path": "assignments/01/q1_sol.py",
    "content": "\"\"\"\nSolution to simple exercises to get used to TensorFlow API\nYou should thoroughly test your code.\nTensorFlow's official documentation should be your best friend here\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\nsess = tf.InteractiveSession()\n###############################################################################\n# 1a: Create two random 0-d tensors x and y of any distribution.\n# Create a TensorFlow object that returns x + y if x > y, and x - y otherwise.\n# Hint: look up tf.cond()\n# I do the first problem for you\n###############################################################################\n\nx = tf.random_uniform([])  # Empty array as shape creates a scalar.\ny = tf.random_uniform([])\nout = tf.cond(tf.greater(x, y), lambda: tf.add(x, y), lambda: tf.subtract(x, y))\n\n###############################################################################\n# 1b: Create two 0-d tensors x and y randomly selected from the range [-1, 1).\n# Return x + y if x < y, x - y if x > y, 0 otherwise.\n# Hint: Look up tf.case().\n###############################################################################\n\nx = tf.random_uniform([], -1, 1, dtype=tf.float32)\ny = tf.random_uniform([], -1, 1, dtype=tf.float32)\nout = tf.case({tf.less(x, y): lambda: tf.add(x, y), \n\t\t\ttf.greater(x, y): lambda: tf.subtract(x, y)}, \n\t\t\tdefault=lambda: tf.constant(0.0), exclusive=True)\n\n\n###############################################################################\n# 1c: Create the tensor x of the value [[0, -2, -1], [0, 1, 2]] \n# and y as a tensor of zeros with the same shape as x.\n# Return a boolean tensor that yields Trues if x equals y element-wise.\n# Hint: Look up tf.equal().\n###############################################################################\n\nx = tf.constant([[0, -2, -1], [0, 1, 2]])\ny = tf.zeros_like(x)\nout = tf.equal(x, y)\n\n###############################################################################\n# 1d: Create the tensor x of value \n# [29.05088806,  27.61298943,  31.19073486,  29.35532951,\n#  30.97266006,  26.67541885,  38.08450317,  20.74983215,\n#  34.94445419,  34.45999146,  29.06485367,  36.01657104,\n#  27.88236427,  20.56035233,  30.20379066,  29.51215172,\n#  33.71149445,  28.59134293,  36.05556488,  28.66994858].\n# Get the indices of elements in x whose values are greater than 30.\n# Hint: Use tf.where().\n# Then extract elements whose values are greater than 30.\n# Hint: Use tf.gather().\n###############################################################################\n\nx = tf.constant([29.05088806,  27.61298943,  31.19073486,  29.35532951,\n\t\t        30.97266006,  26.67541885,  38.08450317,  20.74983215,\n\t\t        34.94445419,  34.45999146,  29.06485367,  36.01657104,\n\t\t        27.88236427,  20.56035233,  30.20379066,  29.51215172,\n\t\t        33.71149445,  28.59134293,  36.05556488,  28.66994858])\nindices = tf.where(x > 30)\nout = tf.gather(x, indices)\n\n###############################################################################\n# 1e: Create a diagnoal 2-d tensor of size 6 x 6 with the diagonal values of 1,\n# 2, ..., 6\n# Hint: Use tf.range() and tf.diag().\n###############################################################################\n\nvalues = tf.range(1, 7)\nout = tf.diag(values)\n\n###############################################################################\n# 1f: Create a random 2-d tensor of size 10 x 10 from any distribution.\n# Calculate its determinant.\n# Hint: Look at tf.matrix_determinant().\n###############################################################################\n\nm = tf.random_normal([10, 10], mean=10, stddev=1)\nout = tf.matrix_determinant(m)\n\n###############################################################################\n# 1g: Create tensor x with value [5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9].\n# Return the unique elements in x\n# Hint: use tf.unique(). Keep in mind that tf.unique() returns a tuple.\n###############################################################################\n\nx = tf.constant([5, 2, 3, 5, 10, 6, 2, 3, 4, 2, 1, 1, 0, 9])\nunique_values, indices = tf.unique(x)\n\n###############################################################################\n# 1h: Create two tensors x and y of shape 300 from any normal distribution,\n# as long as they are from the same distribution.\n# Use tf.cond() to return:\n# - The mean squared error of (x - y) if the average of all elements in (x - y)\n#   is negative, or\n# - The sum of absolute value of all elements in the tensor (x - y) otherwise.\n# Hint: see the Huber loss function in the lecture slides 3.\n###############################################################################\n\nx = tf.random_normal([300], mean=5, stddev=1)\ny = tf.random_normal([300], mean=5, stddev=1)\naverage = tf.reduce_mean(x - y)\ndef f1(): return tf.reduce_mean(tf.square(x - y))\ndef f2(): return tf.reduce_sum(tf.abs(x - y))\nout = tf.cond(average < 0, f1, f2)"
  },
  {
    "path": "assignments/02_style_transfer/load_vgg.py",
    "content": "\"\"\" Load VGGNet weights needed for the implementation in TensorFlow\nof the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) \n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nFor more details, please read the assignment handout:\nhttps://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing\n\n\"\"\"\nimport numpy as np\nimport scipy.io\nimport tensorflow as tf\n\nimport utils\n\n# VGG-19 parameters file\nVGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'\nVGG_FILENAME = 'imagenet-vgg-verydeep-19.mat'\nEXPECTED_BYTES = 534904783\n\nclass VGG(object):\n    def __init__(self, input_img):\n        utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES)\n        self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers']\n        self.input_img = input_img\n        self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))\n\n    def _weights(self, layer_idx, expected_layer_name):\n        \"\"\" Return the weights and biases at layer_idx already trained by VGG\n        \"\"\"\n        W = self.vgg_layers[0][layer_idx][0][0][2][0][0]\n        b = self.vgg_layers[0][layer_idx][0][0][2][0][1]\n        layer_name = self.vgg_layers[0][layer_idx][0][0][0][0]\n        assert layer_name == expected_layer_name\n        return W, b.reshape(b.size)\n\n    def conv2d_relu(self, prev_layer, layer_idx, layer_name):\n        \"\"\" Create a convolution layer with RELU using the weights and\n        biases extracted from the VGG model at 'layer_idx'. You should use\n        the function _weights() defined above to extract weights and biases.\n\n        _weights() returns numpy arrays, so you have to convert them to TF tensors.\n\n        Don't forget to apply relu to the output from the convolution.\n        Inputs:\n            prev_layer: the output tensor from the previous layer\n            layer_idx: the index to current layer in vgg_layers\n            layer_name: the string that is the name of the current layer.\n                        It's used to specify variable_scope.\n        Hint for choosing strides size: \n            for small images, you probably don't want to skip any pixel\n        \"\"\"\n        ###############################\n        ## TO DO\n        out = None\n        ###############################\n        setattr(self, layer_name, out)\n\n    def avgpool(self, prev_layer, layer_name):\n        \"\"\" Create the average pooling layer. The paper suggests that \n        average pooling works better than max pooling.\n        \n        Input:\n            prev_layer: the output tensor from the previous layer\n            layer_name: the string that you want to name the layer.\n                        It's used to specify variable_scope.\n\n        Hint for choosing strides and kszie: choose what you feel appropriate\n        \"\"\"\n        ###############################\n        ## TO DO\n        out = None\n        ###############################\n        setattr(self, layer_name, out)\n\n    def load(self):\n        self.conv2d_relu(self.input_img, 0, 'conv1_1')\n        self.conv2d_relu(self.conv1_1, 2, 'conv1_2')\n        self.avgpool(self.conv1_2, 'avgpool1')\n        self.conv2d_relu(self.avgpool1, 5, 'conv2_1')\n        self.conv2d_relu(self.conv2_1, 7, 'conv2_2')\n        self.avgpool(self.conv2_2, 'avgpool2')\n        self.conv2d_relu(self.avgpool2, 10, 'conv3_1')\n        self.conv2d_relu(self.conv3_1, 12, 'conv3_2')\n        self.conv2d_relu(self.conv3_2, 14, 'conv3_3')\n        self.conv2d_relu(self.conv3_3, 16, 'conv3_4')\n        self.avgpool(self.conv3_4, 'avgpool3')\n        self.conv2d_relu(self.avgpool3, 19, 'conv4_1')\n        self.conv2d_relu(self.conv4_1, 21, 'conv4_2')\n        self.conv2d_relu(self.conv4_2, 23, 'conv4_3')\n        self.conv2d_relu(self.conv4_3, 25, 'conv4_4')\n        self.avgpool(self.conv4_4, 'avgpool4')\n        self.conv2d_relu(self.avgpool4, 28, 'conv5_1')\n        self.conv2d_relu(self.conv5_1, 30, 'conv5_2')\n        self.conv2d_relu(self.conv5_2, 32, 'conv5_3')\n        self.conv2d_relu(self.conv5_3, 34, 'conv5_4')\n        self.avgpool(self.conv5_4, 'avgpool5')"
  },
  {
    "path": "assignments/02_style_transfer/load_vgg_sol.py",
    "content": "\"\"\" Load VGGNet weights needed for the implementation in TensorFlow\nof the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) \n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nFor more details, please read the assignment handout:\n\n\"\"\"\nimport numpy as np\nimport scipy.io\nimport tensorflow as tf\n\nimport utils\n\n# VGG-19 parameters file\nVGG_DOWNLOAD_LINK = 'http://www.vlfeat.org/matconvnet/models/imagenet-vgg-verydeep-19.mat'\nVGG_FILENAME = 'imagenet-vgg-verydeep-19.mat'\nEXPECTED_BYTES = 534904783\n\nclass VGG(object):\n    def __init__(self, input_img):\n        utils.download(VGG_DOWNLOAD_LINK, VGG_FILENAME, EXPECTED_BYTES)\n        self.vgg_layers = scipy.io.loadmat(VGG_FILENAME)['layers']\n        self.input_img = input_img\n        self.mean_pixels = np.array([123.68, 116.779, 103.939]).reshape((1,1,1,3))\n\n    def _weights(self, layer_idx, expected_layer_name):\n        \"\"\" Return the weights and biases at layer_idx already trained by VGG\n        \"\"\"\n        W = self.vgg_layers[0][layer_idx][0][0][2][0][0]\n        b = self.vgg_layers[0][layer_idx][0][0][2][0][1]\n        layer_name = self.vgg_layers[0][layer_idx][0][0][0][0]\n        assert layer_name == expected_layer_name\n        return W, b.reshape(b.size)\n\n    def conv2d_relu(self, prev_layer, layer_idx, layer_name):\n        \"\"\" Return the Conv2D layer with RELU using the weights, \n        biases from the VGG model at 'layer_idx'.\n        Don't forget to apply relu to the output from the convolution.\n        Inputs:\n            prev_layer: the output tensor from the previous layer\n            layer_idx: the index to current layer in vgg_layers\n            layer_name: the string that is the name of the current layer.\n                        It's used to specify variable_scope.\n\n\n        Note that you first need to obtain W and b from from the corresponding VGG's layer \n        using the function _weights() defined above.\n        W and b returned from _weights() are numpy arrays, so you have\n        to convert them to TF tensors. One way to do it is with tf.constant.\n\n        Hint for choosing strides size: \n            for small images, you probably don't want to skip any pixel\n        \"\"\"\n        ###############################\n        ## TO DO\n        with tf.variable_scope(layer_name) as scope:\n            W, b = self._weights(layer_idx, layer_name)\n            W = tf.constant(W, name='weights')\n            b = tf.constant(b, name='bias')\n            conv2d = tf.nn.conv2d(prev_layer, \n                                filter=W, \n                                strides=[1, 1, 1, 1], \n                                padding='SAME')\n            out = tf.nn.relu(conv2d + b)\n        ###############################\n        setattr(self, layer_name, out)\n\n    def avgpool(self, prev_layer, layer_name):\n        \"\"\" Return the average pooling layer. The paper suggests that \n        average pooling works better than max pooling.\n        Input:\n            prev_layer: the output tensor from the previous layer\n            layer_name: the string that you want to name the layer.\n                        It's used to specify variable_scope.\n\n        Hint for choosing strides and kszie: choose what you feel appropriate\n        \"\"\"\n        ###############################\n        ## TO DO\n        with tf.variable_scope(layer_name):\n            out = tf.nn.avg_pool(prev_layer, \n                                ksize=[1, 2, 2, 1], \n                                strides=[1, 2, 2, 1],\n                                padding='SAME')\n        ###############################\n        setattr(self, layer_name, out)\n\n    def load(self):\n        self.conv2d_relu(self.input_img, 0, 'conv1_1')\n        self.conv2d_relu(self.conv1_1, 2, 'conv1_2')\n        self.avgpool(self.conv1_2, 'avgpool1')\n        self.conv2d_relu(self.avgpool1, 5, 'conv2_1')\n        self.conv2d_relu(self.conv2_1, 7, 'conv2_2')\n        self.avgpool(self.conv2_2, 'avgpool2')\n        self.conv2d_relu(self.avgpool2, 10, 'conv3_1')\n        self.conv2d_relu(self.conv3_1, 12, 'conv3_2')\n        self.conv2d_relu(self.conv3_2, 14, 'conv3_3')\n        self.conv2d_relu(self.conv3_3, 16, 'conv3_4')\n        self.avgpool(self.conv3_4, 'avgpool3')\n        self.conv2d_relu(self.avgpool3, 19, 'conv4_1')\n        self.conv2d_relu(self.conv4_1, 21, 'conv4_2')\n        self.conv2d_relu(self.conv4_2, 23, 'conv4_3')\n        self.conv2d_relu(self.conv4_3, 25, 'conv4_4')\n        self.avgpool(self.conv4_4, 'avgpool4')\n        self.conv2d_relu(self.avgpool4, 28, 'conv5_1')\n        self.conv2d_relu(self.conv5_1, 30, 'conv5_2')\n        self.conv2d_relu(self.conv5_2, 32, 'conv5_3')\n        self.conv2d_relu(self.conv5_3, 34, 'conv5_4')\n        self.avgpool(self.conv5_4, 'avgpool5')"
  },
  {
    "path": "assignments/02_style_transfer/style_transfer.py",
    "content": "\"\"\" Implementation in TensorFlow of the paper \nA Neural Algorithm of Artistic Style (Gatys et al., 2016) \n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nFor more details, please read the assignment handout:\nhttps://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport load_vgg\nimport utils\n\ndef setup():\n    utils.safe_mkdir('checkpoints')\n    utils.safe_mkdir('outputs')\n\nclass StyleTransfer(object):\n    def __init__(self, content_img, style_img, img_width, img_height):\n        '''\n        img_width and img_height are the dimensions we expect from the generated image.\n        We will resize input content image and input style image to match this dimension.\n        Feel free to alter any hyperparameter here and see how it affects your training.\n        '''\n        self.img_width = img_width\n        self.img_height = img_height\n        self.content_img = utils.get_resized_image(content_img, img_width, img_height)\n        self.style_img = utils.get_resized_image(style_img, img_width, img_height)\n        self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height)\n\n        ###############################\n        ## TO DO\n        ## create global step (gstep) and hyperparameters for the model\n        self.content_layer = 'conv4_2'\n        self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']\n        # content_w, style_w: corresponding weights for content loss and style loss\n        self.content_w = None\n        self.style_w = None\n        # style_layer_w: weights for different style layers. deep layers have more weights\n        self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] \n        self.gstep = None # global step\n        self.lr = None\n        ###############################\n\n    def create_input(self):\n        '''\n        We will use one input_img as a placeholder for the content image, \n        style image, and generated image, because:\n            1. they have the same dimension\n            2. we have to extract the same set of features from them\n        We use a variable instead of a placeholder because we're, at the same time, \n        training the generated image to get the desirable result.\n\n        Note: image height corresponds to number of rows, not columns.\n        '''\n        with tf.variable_scope('input') as scope:\n            self.input_img = tf.get_variable('in_img', \n                                        shape=([1, self.img_height, self.img_width, 3]),\n                                        dtype=tf.float32,\n                                        initializer=tf.zeros_initializer())\n    def load_vgg(self):\n        '''\n        Load the saved model parameters of VGG-19, using the input_img\n        as the input to compute the output at each layer of vgg.\n\n        During training, VGG-19 mean-centered all images and found the mean pixels\n        to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract\n        this mean from our images.\n\n        '''\n        self.vgg = load_vgg.VGG(self.input_img)\n        self.vgg.load()\n        self.content_img -= self.vgg.mean_pixels\n        self.style_img -= self.vgg.mean_pixels\n\n    def _content_loss(self, P, F):\n        ''' Calculate the loss between the feature representation of the\n        content image and the generated image.\n        \n        Inputs: \n            P: content representation of the content image\n            F: content representation of the generated image\n            Read the assignment handout for more details\n\n            Note: Don't use the coefficient 0.5 as defined in the paper.\n            Use the coefficient defined in the assignment handout.\n        '''\n        ###############################\n        ## TO DO\n        self.content_loss = None\n        ###############################\n        \n    def _gram_matrix(self, F, N, M):\n        \"\"\" Create and return the gram matrix for tensor F\n            Hint: you'll first have to reshape F\n        \"\"\"\n        ###############################\n        ## TO DO\n        return None\n        ###############################\n\n    def _single_style_loss(self, a, g):\n        \"\"\" Calculate the style loss at a certain layer\n        Inputs:\n            a is the feature representation of the style image at that layer\n            g is the feature representation of the generated image at that layer\n        Output:\n            the style loss at a certain layer (which is E_l in the paper)\n\n        Hint: 1. you'll have to use the function _gram_matrix()\n            2. we'll use the same coefficient for style loss as in the paper\n            3. a and g are feature representation, not gram matrices\n        \"\"\"\n        ###############################\n        ## TO DO\n        return None\n        ###############################\n\n    def _style_loss(self, A):\n        \"\"\" Calculate the total style loss as a weighted sum \n        of style losses at all style layers\n        Hint: you'll have to use _single_style_loss()\n        \"\"\"\n        ###############################\n        ## TO DO\n        self.style_loss = None\n        ###############################\n\n    def losses(self):\n        with tf.variable_scope('losses') as scope:\n            with tf.Session() as sess:\n                # assign content image to the input variable\n                sess.run(self.input_img.assign(self.content_img)) \n                gen_img_content = getattr(self.vgg, self.content_layer)\n                content_img_content = sess.run(gen_img_content)\n            self._content_loss(content_img_content, gen_img_content)\n\n            with tf.Session() as sess:\n                sess.run(self.input_img.assign(self.style_img))\n                style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers])                              \n            self._style_loss(style_layers)\n\n            ##########################################\n            ## TO DO: create total loss. \n            ## Hint: don't forget the weights for the content loss and style loss\n            self.total_loss = None\n            ##########################################\n\n    def optimize(self):\n        ###############################\n        ## TO DO: create optimizer\n        self.opt = None\n        ###############################\n\n    def create_summary(self):\n        ###############################\n        ## TO DO: create summaries for all the losses\n        ## Hint: don't forget to merge them\n        self.summary_op = None\n        ###############################\n\n\n    def build(self):\n        self.create_input()\n        self.load_vgg()\n        self.losses()\n        self.optimize()\n        self.create_summary()\n\n    def train(self, n_iters):\n        skip_step = 1\n        with tf.Session() as sess:\n            \n            ###############################\n            ## TO DO: \n            ## 1. initialize your variables\n            ## 2. create writer to write your grapp\n            ###############################\n            \n            sess.run(self.input_img.assign(self.initial_img))\n\n            ###############################\n            ## TO DO: \n            ## 1. create a saver object\n            ## 2. check if a checkpoint exists, restore the variables\n            ##############################\n\n            initial_step = self.gstep.eval()\n            \n            start_time = time.time()\n            for index in range(initial_step, n_iters):\n                if index >= 5 and index < 20:\n                    skip_step = 10\n                elif index >= 20:\n                    skip_step = 20\n                \n                sess.run(self.opt)\n                if (index + 1) % skip_step == 0:\n                    ###############################\n                    ## TO DO: obtain generated image, loss, and summary\n                    gen_image, total_loss, summary = None, None, None\n                    ###############################\n                    \n                    # add back the mean pixels we subtracted before\n                    gen_image = gen_image + self.vgg.mean_pixels \n                    writer.add_summary(summary, global_step=index)\n                    print('Step {}\\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))\n                    print('   Loss: {:5.1f}'.format(total_loss))\n                    print('   Took: {} seconds'.format(time.time() - start_time))\n                    start_time = time.time()\n\n                    filename = 'outputs/%d.png' % (index)\n                    utils.save_image(filename, gen_image)\n\n                    if (index + 1) % 20 == 0:\n                        ###############################\n                        ## TO DO: save the variables into a checkpoint\n                        ###############################\n                        pass\n\nif __name__ == '__main__':\n    setup()\n    machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250)\n    machine.build()\n    machine.train(300)"
  },
  {
    "path": "assignments/02_style_transfer/style_transfer_sol.py",
    "content": "import os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport load_vgg_sol\nimport utils\n\ndef setup():\n    utils.safe_mkdir('checkpoints')\n    utils.safe_mkdir('outputs')\n\nclass StyleTransfer(object):\n    def __init__(self, content_img, style_img, img_width, img_height):\n        '''\n        img_width and img_height are the dimensions we expect from the generated image.\n        We will resize input content image and input style image to match this dimension.\n        Feel free to alter any hyperparameter here and see how it affects your training.\n        '''\n        self.img_width = img_width\n        self.img_height = img_height\n        self.content_img = utils.get_resized_image(content_img, img_width, img_height)\n        self.style_img = utils.get_resized_image(style_img, img_width, img_height)\n        self.initial_img = utils.generate_noise_image(self.content_img, img_width, img_height)\n\n        ###############################\n        ## TO DO\n        ## create global step (gstep) and hyperparameters for the model\n        self.content_layer = 'conv4_2'\n        self.style_layers = ['conv1_1', 'conv2_1', 'conv3_1', 'conv4_1', 'conv5_1']\n        self.content_w = 0.01\n        self.style_w = 1\n        self.style_layer_w = [0.5, 1.0, 1.5, 3.0, 4.0] \n        self.gstep = tf.Variable(0, dtype=tf.int32, \n                                trainable=False, name='global_step')\n        self.lr = 2.0\n        ###############################\n\n    def create_input(self):\n        '''\n        We will use one input_img as a placeholder for the content image, \n        style image, and generated image, because:\n            1. they have the same dimension\n            2. we have to extract the same set of features from them\n        We use a variable instead of a placeholder because we're, at the same time, \n        training the generated image to get the desirable result.\n\n        Note: image height corresponds to number of rows, not columns.\n        '''\n        with tf.variable_scope('input') as scope:\n            self.input_img = tf.get_variable('in_img', \n                                        shape=([1, self.img_height, self.img_width, 3]),\n                                        dtype=tf.float32,\n                                        initializer=tf.zeros_initializer())\n    def load_vgg(self):\n        '''\n        Load the saved model parameters of VGG-19, using the input_img\n        as the input to compute the output at each layer of vgg.\n\n        During training, VGG-19 mean-centered all images and found the mean pixels\n        to be [123.68, 116.779, 103.939] along RGB dimensions. We have to subtract\n        this mean from our images.\n\n        '''\n        self.vgg = load_vgg_sol.VGG(self.input_img)\n        self.vgg.load()\n        self.content_img -= self.vgg.mean_pixels\n        self.style_img -= self.vgg.mean_pixels\n\n    def _content_loss(self, P, F):\n        ''' Calculate the loss between the feature representation of the\n        content image and the generated image.\n        \n        Inputs: \n            P: content representation of the content image\n            F: content representation of the generated image\n            Read the assignment handout for more details\n\n            Note: Don't use the coefficient 0.5 as defined in the paper.\n            Use the coefficient defined in the assignment handout.\n        '''\n        # self.content_loss = None\n        ###############################\n        ## TO DO\n        self.content_loss = tf.reduce_sum((F - P) ** 2) / (4.0 * P.size)\n        ###############################\n    \n    def _gram_matrix(self, F, N, M):\n        \"\"\" Create and return the gram matrix for tensor F\n            Hint: you'll first have to reshape F\n        \"\"\"\n        ###############################\n        ## TO DO\n        F = tf.reshape(F, (M, N))\n        return tf.matmul(tf.transpose(F), F)\n        ###############################\n\n    def _single_style_loss(self, a, g):\n        \"\"\" Calculate the style loss at a certain layer\n        Inputs:\n            a is the feature representation of the style image at that layer\n            g is the feature representation of the generated image at that layer\n        Output:\n            the style loss at a certain layer (which is E_l in the paper)\n\n        Hint: 1. you'll have to use the function _gram_matrix()\n            2. we'll use the same coefficient for style loss as in the paper\n            3. a and g are feature representation, not gram matrices\n        \"\"\"\n        ###############################\n        ## TO DO\n        N = a.shape[3] # number of filters\n        M = a.shape[1] * a.shape[2] # height times width of the feature map\n        A = self._gram_matrix(a, N, M)\n        G = self._gram_matrix(g, N, M)\n        return tf.reduce_sum((G - A) ** 2 / ((2 * N * M) ** 2))\n        ###############################\n\n    def _style_loss(self, A):\n        \"\"\" Calculate the total style loss as a weighted sum \n        of style losses at all style layers\n        Hint: you'll have to use _single_style_loss()\n        \"\"\"\n        n_layers = len(A)\n        E = [self._single_style_loss(A[i], getattr(self.vgg, self.style_layers[i])) for i in range(n_layers)]\n        \n        ###############################\n        ## TO DO\n        self.style_loss = sum([self.style_layer_w[i] * E[i] for i in range(n_layers)])\n        ###############################\n\n    def losses(self):\n        with tf.variable_scope('losses') as scope:\n            with tf.Session() as sess:\n                # assign content image to the input variable\n                sess.run(self.input_img.assign(self.content_img)) \n                gen_img_content = getattr(self.vgg, self.content_layer)\n                content_img_content = sess.run(gen_img_content)\n            self._content_loss(content_img_content, gen_img_content)\n\n            with tf.Session() as sess:\n                sess.run(self.input_img.assign(self.style_img))\n                style_layers = sess.run([getattr(self.vgg, layer) for layer in self.style_layers])                              \n            self._style_loss(style_layers)\n\n            ##########################################\n            ## TO DO: create total loss. \n            ## Hint: don't forget the weights for the content loss and style loss\n            self.total_loss = self.content_w * self.content_loss + self.style_w * self.style_loss\n            ##########################################\n\n    def optimize(self):\n        ###############################\n        ## TO DO: create optimizer\n        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.total_loss,\n                                                            global_step=self.gstep)\n        ###############################\n\n    def create_summary(self):\n        ###############################\n        ## TO DO: create summaries for all the losses\n        ## Hint: don't forget to merge them\n        with tf.name_scope('summaries'):\n            tf.summary.scalar('content loss', self.content_loss)\n            tf.summary.scalar('style loss', self.style_loss)\n            tf.summary.scalar('total loss', self.total_loss)\n            self.summary_op = tf.summary.merge_all()\n        ###############################\n\n\n    def build(self):\n        self.create_input()\n        self.load_vgg()\n        self.losses()\n        self.optimize()\n        self.create_summary()\n\n    def train(self, n_iters):\n        skip_step = 1\n        with tf.Session() as sess:\n            \n            ###############################\n            ## TO DO: \n            ## 1. initialize your variables\n            ## 2. create writer to write your graph\n            sess.run(tf.global_variables_initializer())\n            writer = tf.summary.FileWriter('graphs/style_stranfer', sess.graph)\n            ###############################\n            sess.run(self.input_img.assign(self.initial_img))\n\n\n            ###############################\n            ## TO DO: \n            ## 1. create a saver object\n            ## 2. check if a checkpoint exists, restore the variables\n            saver = tf.train.Saver()\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/style_transfer/checkpoint'))\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n            ##############################\n\n            initial_step = self.gstep.eval()\n            \n            start_time = time.time()\n            for index in range(initial_step, n_iters):\n                if index >= 5 and index < 20:\n                    skip_step = 10\n                elif index >= 20:\n                    skip_step = 20\n                \n                sess.run(self.opt)\n                if (index + 1) % skip_step == 0:\n                    ###############################\n                    ## TO DO: obtain generated image, loss, and summary\n                    gen_image, total_loss, summary = sess.run([self.input_img,\n                                                                self.total_loss,\n                                                                self.summary_op])\n\n                    ###############################\n                    \n                    # add back the mean pixels we subtracted before\n                    gen_image = gen_image + self.vgg.mean_pixels \n                    writer.add_summary(summary, global_step=index)\n                    print('Step {}\\n   Sum: {:5.1f}'.format(index + 1, np.sum(gen_image)))\n                    print('   Loss: {:5.1f}'.format(total_loss))\n                    print('   Took: {} seconds'.format(time.time() - start_time))\n                    start_time = time.time()\n\n                    filename = 'outputs/%d.png' % (index)\n                    utils.save_image(filename, gen_image)\n\n                    if (index + 1) % 20 == 0:\n                        ###############################\n                        ## TO DO: save the variables into a checkpoint\n                        saver.save(sess, 'checkpoints/style_stranfer/style_transfer', index)\n                        ###############################\n\nif __name__ == '__main__':\n    setup()\n    machine = StyleTransfer('content/deadpool.jpg', 'styles/guernica.jpg', 333, 250)\n    machine.build()\n    machine.train(300)"
  },
  {
    "path": "assignments/02_style_transfer/utils.py",
    "content": "\"\"\" Utils needed for the implementation in TensorFlow\nof the paper A Neural Algorithm of Artistic Style (Gatys et al., 2016) \n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nFor more details, please read the assignment handout:\nhttps://docs.google.com/document/d/1FpueD-3mScnD0SJQDtwmOb1FrSwo1NGowkXzMwPoLH4/edit?usp=sharing\n\n\"\"\"\n\nimport os\n\nfrom PIL import Image, ImageOps\nimport numpy as np\nimport scipy.misc\nfrom six.moves import urllib\n\ndef download(download_link, file_name, expected_bytes):\n    \"\"\" Download the pretrained VGG-19 model if it's not already downloaded \"\"\"\n    if os.path.exists(file_name):\n        print(\"VGG-19 pre-trained model is ready\")\n        return\n    print(\"Downloading the VGG pre-trained model. This might take a while ...\")\n    file_name, _ = urllib.request.urlretrieve(download_link, file_name)\n    file_stat = os.stat(file_name)\n    if file_stat.st_size == expected_bytes:\n        print('Successfully downloaded VGG-19 pre-trained model', file_name)\n    else:\n        raise Exception('File ' + file_name +\n                        ' might be corrupted. You should try downloading it with a browser.')\n\ndef get_resized_image(img_path, width, height, save=True):\n    image = Image.open(img_path)\n    # PIL is column major so you have to swap the places of width and height\n    image = ImageOps.fit(image, (width, height), Image.ANTIALIAS)\n    if save:\n        image_dirs = img_path.split('/')\n        image_dirs[-1] = 'resized_' + image_dirs[-1]\n        out_path = '/'.join(image_dirs)\n        if not os.path.exists(out_path):\n            image.save(out_path)\n    image = np.asarray(image, np.float32)\n    return np.expand_dims(image, 0)\n\ndef generate_noise_image(content_image, width, height, noise_ratio=0.6):\n    noise_image = np.random.uniform(-20, 20, (1, height, width, 3)).astype(np.float32)\n    return noise_image * noise_ratio + content_image * (1 - noise_ratio)\n\ndef save_image(path, image):\n    image = image[0]\n    image = np.clip(image, 0, 255).astype('uint8')\n    scipy.misc.imsave(path, image)\n\ndef safe_mkdir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass"
  },
  {
    "path": "assignments/chatbot/README.md",
    "content": "A neural chatbot using sequence to sequence model with\nattentional decoder. This is a fully functional chatbot.\n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nThe detailed assignment handout can be found at:\nhttps://docs.google.com/document/d/1GJfn2B6EI8JueDiBwzTAdD34d6pC99BSt6vldOmUCPQ/edit?usp=sharing\n\n\n<h2>Sample conversation</h2>\nHUMAN ++++ hi <br>\nBOT ++++ hi . what ' s your name ?<br>\nHUMAN ++++ i'm chip<br>\nBOT ++++ please .<br>\nHUMAN ++++ please what?<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ where are you from?<br>\nBOT ++++ california .<br>\nHUMAN ++++ do you like california?<br>\nBOT ++++ i ' m okay .<br>\nHUMAN ++++ i like california a lot<br>\nBOT ++++ of what ?<br>\nHUMAN ++++ i live here<br>\nBOT ++++ and what is it ?<br>\nHUMAN ++++ california<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ i'm a student here<br>\nBOT ++++ and get it out of here .<br>\nHUMAN ++++ what do you like to do in your free time?<br>\nBOT ++++ and what about the gun ?<br>\nHUMAN ++++ do you have a gun?<br>\nBOT ++++ yes .<br>\nHUMAN ++++ you're dangerous!<br>\nBOT ++++ i ' m not .<br>\nHUMAN ++++ do you go hunting?<br>\nBOT ++++ i can do .<br>\nHUMAN ++++ what do you hunt?<br>\nBOT ++++ virgin .<br>\nHUMAN ++++ that's very bad<br>\nBOT ++++ and you in love ?<br>\nHUMAN ++++ no i'm not<br>\nBOT ++++ please - -<br>\nHUMAN ++++ okay, i am<br>\nBOT ++++ and you are good .<br>\nHUMAN ++++ yeah i am good too.<br>\nBOT ++++ good good to you .<br>\nHUMAN ++++ are you in love?<br>\nBOT ++++ i ' m ready .<br>\nHUMAN ++++ ready for what?<br>\nBOT ++++ for what ?<br>\nHUMAN ++++ for what?<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ tell me<br>\nBOT ++++ i can do it .<br>\nHUMAN ++++ okay, tell me what you're ready for<br>\nBOT ++++ i ' m doing you .<br>\nHUMAN ++++ no you're not<br>\nBOT ++++ please !<br>\nHUMAN ++++ this is a strange conversation<br>\nBOT ++++ i don ' t know what to say .<br>\nHUMAN ++++ are you a guy?<br>\nBOT ++++ of what ?<br>\nHUMAN ++++ are you a boy?<br>\nBOT ++++ i ' m fine .<br>\nHUMAN ++++ are you a man?<br>\nBOT ++++ i ' m a little .<br>\nHUMAN ++++ what's your name?<br>\nBOT ++++ chris -<br>\nHUMAN ++++ chris who?<br>\nBOT ++++ cornelius .<br>\nHUMAN ++++ so i guess you're a guy.<br>\nBOT ++++ i am .<br>\nHUMAN ++++ okay<br>\nBOT ++++ okay ?<br>\nHUMAN ++++ do you have a girlfriend?<br>\nBOT ++++ i ' m sure .<br>\nHUMAN ++++ what's her name?<br>\nBOT ++++ let ' s talk about something else .<br>\n\nSee output_convo.txt for more sample conversations.\n\n<h2>Usage</h2>\n\nStep 1: create a data folder in your project directory, download\nthe Cornell Movie-Dialogs Corpus from \nhttps://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html\nUnzip it\n\nStep 2: update config.py file<br>\nChange DATA_PATH to where you store your data\n\nStep 3: python3 data.py<br>\nThis will do all the pre-processing for the Cornell dataset.\n\nStep 4:\npython3 chatbot.py --mode [train/chat] <br>\nIf mode is train, then you train the chatbot. By default, the model will\nrestore the previously trained weights (if there is any) and continue\ntraining up on that.\n\nIf you want to start training from scratch, please delete all the checkpoints\nin the checkpoints folder.\n\nIf the mode is chat, you'll go into the interaction mode with the bot.\n\nBy default, all the conversations you have with the chatbot will be written\ninto the file output_convo.txt in the processed folder. If you run this chatbot,\nI kindly ask you to send me the output_convo.txt so that I can improve\nthe chatbot.\n\n\nThank you very much!\n"
  },
  {
    "path": "assignments/chatbot/chatbot.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nThis file contains the code to run the model.\n\nSee README.md for instruction on how to run the starter code.\n\"\"\"\nimport argparse\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport random\nimport sys\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nfrom model import ChatBotModel\nimport config\nimport data\n\ndef _get_random_bucket(train_buckets_scale):\n    \"\"\" Get a random bucket from which to choose a training sample \"\"\"\n    rand = random.random()\n    return min([i for i in range(len(train_buckets_scale))\n                if train_buckets_scale[i] > rand])\n\ndef _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks):\n    \"\"\" Assert that the encoder inputs, decoder inputs, and decoder masks are\n    of the expected lengths \"\"\"\n    if len(encoder_inputs) != encoder_size:\n        raise ValueError(\"Encoder length must be equal to the one in bucket,\"\n                        \" %d != %d.\" % (len(encoder_inputs), encoder_size))\n    if len(decoder_inputs) != decoder_size:\n        raise ValueError(\"Decoder length must be equal to the one in bucket,\"\n                       \" %d != %d.\" % (len(decoder_inputs), decoder_size))\n    if len(decoder_masks) != decoder_size:\n        raise ValueError(\"Weights length must be equal to the one in bucket,\"\n                       \" %d != %d.\" % (len(decoder_masks), decoder_size))\n\ndef run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, forward_only):\n    \"\"\" Run one step in training.\n    @forward_only: boolean value to decide whether a backward path should be created\n    forward_only is set to True when you just want to evaluate on the test set,\n    or when you want to the bot to be in chat mode. \"\"\"\n    encoder_size, decoder_size = config.BUCKETS[bucket_id]\n    _assert_lengths(encoder_size, decoder_size, encoder_inputs, decoder_inputs, decoder_masks)\n\n    # input feed: encoder inputs, decoder inputs, target_weights, as provided.\n    input_feed = {}\n    for step in range(encoder_size):\n        input_feed[model.encoder_inputs[step].name] = encoder_inputs[step]\n    for step in range(decoder_size):\n        input_feed[model.decoder_inputs[step].name] = decoder_inputs[step]\n        input_feed[model.decoder_masks[step].name] = decoder_masks[step]\n\n    last_target = model.decoder_inputs[decoder_size].name\n    input_feed[last_target] = np.zeros([model.batch_size], dtype=np.int32)\n\n    # output feed: depends on whether we do a backward step or not.\n    if not forward_only:\n        output_feed = [model.train_ops[bucket_id],  # update op that does SGD.\n                       model.gradient_norms[bucket_id],  # gradient norm.\n                       model.losses[bucket_id]]  # loss for this batch.\n    else:\n        output_feed = [model.losses[bucket_id]]  # loss for this batch.\n        for step in range(decoder_size):  # output logits.\n            output_feed.append(model.outputs[bucket_id][step])\n\n    outputs = sess.run(output_feed, input_feed)\n    if not forward_only:\n        return outputs[1], outputs[2], None  # Gradient norm, loss, no outputs.\n    else:\n        return None, outputs[0], outputs[1:]  # No gradient norm, loss, outputs.\n\ndef _get_buckets():\n    \"\"\" Load the dataset into buckets based on their lengths.\n    train_buckets_scale is the inverval that'll help us \n    choose a random bucket later on.\n    \"\"\"\n    test_buckets = data.load_data('test_ids.enc', 'test_ids.dec')\n    data_buckets = data.load_data('train_ids.enc', 'train_ids.dec')\n    train_bucket_sizes = [len(data_buckets[b]) for b in range(len(config.BUCKETS))]\n    print(\"Number of samples in each bucket:\\n\", train_bucket_sizes)\n    train_total_size = sum(train_bucket_sizes)\n    # list of increasing numbers from 0 to 1 that we'll use to select a bucket.\n    train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size\n                           for i in range(len(train_bucket_sizes))]\n    print(\"Bucket scale:\\n\", train_buckets_scale)\n    return test_buckets, data_buckets, train_buckets_scale\n\ndef _get_skip_step(iteration):\n    \"\"\" How many steps should the model train before it saves all the weights. \"\"\"\n    if iteration < 100:\n        return 30\n    return 100\n\ndef _check_restore_parameters(sess, saver):\n    \"\"\" Restore the previously trained parameters if there are any. \"\"\"\n    ckpt = tf.train.get_checkpoint_state(os.path.dirname(config.CPT_PATH + '/checkpoint'))\n    if ckpt and ckpt.model_checkpoint_path:\n        print(\"Loading parameters for the Chatbot\")\n        saver.restore(sess, ckpt.model_checkpoint_path)\n    else:\n        print(\"Initializing fresh parameters for the Chatbot\")\n\ndef _eval_test_set(sess, model, test_buckets):\n    \"\"\" Evaluate on the test set. \"\"\"\n    for bucket_id in range(len(config.BUCKETS)):\n        if len(test_buckets[bucket_id]) == 0:\n            print(\"  Test: empty bucket %d\" % (bucket_id))\n            continue\n        start = time.time()\n        encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(test_buckets[bucket_id], \n                                                                        bucket_id,\n                                                                        batch_size=config.BATCH_SIZE)\n        _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, \n                                   decoder_masks, bucket_id, True)\n        print('Test bucket {}: loss {}, time {}'.format(bucket_id, step_loss, time.time() - start))\n\ndef train():\n    \"\"\" Train the bot \"\"\"\n    test_buckets, data_buckets, train_buckets_scale = _get_buckets()\n    # in train mode, we need to create the backward path, so forwrad_only is False\n    model = ChatBotModel(False, config.BATCH_SIZE)\n    model.build_graph()\n\n    saver = tf.train.Saver()\n\n    with tf.Session() as sess:\n        print('Running session')\n        sess.run(tf.global_variables_initializer())\n        _check_restore_parameters(sess, saver)\n\n        iteration = model.global_step.eval()\n        total_loss = 0\n        while True:\n            skip_step = _get_skip_step(iteration)\n            bucket_id = _get_random_bucket(train_buckets_scale)\n            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch(data_buckets[bucket_id], \n                                                                           bucket_id,\n                                                                           batch_size=config.BATCH_SIZE)\n            start = time.time()\n            _, step_loss, _ = run_step(sess, model, encoder_inputs, decoder_inputs, decoder_masks, bucket_id, False)\n            total_loss += step_loss\n            iteration += 1\n\n            if iteration % skip_step == 0:\n                print('Iter {}: loss {}, time {}'.format(iteration, total_loss/skip_step, time.time() - start))\n                start = time.time()\n                total_loss = 0\n                saver.save(sess, os.path.join(config.CPT_PATH, 'chatbot'), global_step=model.global_step)\n                if iteration % (10 * skip_step) == 0:\n                    # Run evals on development set and print their loss\n                    _eval_test_set(sess, model, test_buckets)\n                    start = time.time()\n                sys.stdout.flush()\n\ndef _get_user_input():\n    \"\"\" Get user's input, which will be transformed into encoder input later \"\"\"\n    print(\"> \", end=\"\")\n    sys.stdout.flush()\n    return sys.stdin.readline()\n\ndef _find_right_bucket(length):\n    \"\"\" Find the proper bucket for an encoder input based on its length \"\"\"\n    return min([b for b in range(len(config.BUCKETS))\n                if config.BUCKETS[b][0] >= length])\n\ndef _construct_response(output_logits, inv_dec_vocab):\n    \"\"\" Construct a response to the user's encoder input.\n    @output_logits: the outputs from sequence to sequence wrapper.\n    output_logits is decoder_size np array, each of dim 1 x DEC_VOCAB\n    \n    This is a greedy decoder - outputs are just argmaxes of output_logits.\n    \"\"\"\n    print(output_logits[0])\n    outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]\n    # If there is an EOS symbol in outputs, cut them at that point.\n    if config.EOS_ID in outputs:\n        outputs = outputs[:outputs.index(config.EOS_ID)]\n    # Print out sentence corresponding to outputs.\n    return \" \".join([tf.compat.as_str(inv_dec_vocab[output]) for output in outputs])\n\ndef chat():\n    \"\"\" in test mode, we don't to create the backward path\n    \"\"\"\n    _, enc_vocab = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.enc'))\n    inv_dec_vocab, _ = data.load_vocab(os.path.join(config.PROCESSED_PATH, 'vocab.dec'))\n\n    model = ChatBotModel(True, batch_size=1)\n    model.build_graph()\n\n    saver = tf.train.Saver()\n\n    with tf.Session() as sess:\n        sess.run(tf.global_variables_initializer())\n        _check_restore_parameters(sess, saver)\n        output_file = open(os.path.join(config.PROCESSED_PATH, config.OUTPUT_FILE), 'a+')\n        # Decode from standard input.\n        max_length = config.BUCKETS[-1][0]\n        print('Welcome to TensorBro. Say something. Enter to exit. Max length is', max_length)\n        while True:\n            line = _get_user_input()\n            if len(line) > 0 and line[-1] == '\\n':\n                line = line[:-1]\n            if line == '':\n                break\n            output_file.write('HUMAN ++++ ' + line + '\\n')\n            # Get token-ids for the input sentence.\n            token_ids = data.sentence2id(enc_vocab, str(line))\n            if (len(token_ids) > max_length):\n                print('Max length I can handle is:', max_length)\n                line = _get_user_input()\n                continue\n            # Which bucket does it belong to?\n            bucket_id = _find_right_bucket(len(token_ids))\n            # Get a 1-element batch to feed the sentence to the model.\n            encoder_inputs, decoder_inputs, decoder_masks = data.get_batch([(token_ids, [])], \n                                                                            bucket_id,\n                                                                            batch_size=1)\n            # Get output logits for the sentence.\n            _, _, output_logits = run_step(sess, model, encoder_inputs, decoder_inputs,\n                                           decoder_masks, bucket_id, True)\n            response = _construct_response(output_logits, inv_dec_vocab)\n            print(response)\n            output_file.write('BOT ++++ ' + response + '\\n')\n        output_file.write('=============================================\\n')\n        output_file.close()\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--mode', choices={'train', 'chat'},\n                        default='train', help=\"mode. if not specified, it's in the train mode\")\n    args = parser.parse_args()\n\n    if not os.path.isdir(config.PROCESSED_PATH):\n        data.prepare_raw_data()\n        data.process_data()\n    print('Data ready!')\n    # create checkpoints folder if there isn't one already\n    data.make_dir(config.CPT_PATH)\n\n    if args.mode == 'train':\n        train()\n    elif args.mode == 'chat':\n        chat()\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "assignments/chatbot/config.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nThis file contains the hyperparameters for the model.\n\nSee README.md for instruction on how to run the starter code.\n\"\"\"\n\n# parameters for processing the dataset\nDATA_PATH = 'data/cornell movie-dialogs corpus'\nCONVO_FILE = 'movie_conversations.txt'\nLINE_FILE = 'movie_lines.txt'\nOUTPUT_FILE = 'output_convo.txt'\nPROCESSED_PATH = 'processed'\nCPT_PATH = 'checkpoints'\n\nTHRESHOLD = 2\n\nPAD_ID = 0\nUNK_ID = 1\nSTART_ID = 2\nEOS_ID = 3\n\nTESTSET_SIZE = 25000\n\nBUCKETS = [(19, 19), (28, 28), (33, 33), (40, 43), (50, 53), (60, 63)]\n\n\nCONTRACTIONS = [(\"i ' m \", \"i 'm \"), (\"' d \", \"'d \"), (\"' s \", \"'s \"), \n\t\t\t\t(\"don ' t \", \"do n't \"), (\"didn ' t \", \"did n't \"), (\"doesn ' t \", \"does n't \"),\n\t\t\t\t(\"can ' t \", \"ca n't \"), (\"shouldn ' t \", \"should n't \"), (\"wouldn ' t \", \"would n't \"),\n\t\t\t\t(\"' ve \", \"'ve \"), (\"' re \", \"'re \"), (\"in ' \", \"in' \")]\n\nNUM_LAYERS = 3\nHIDDEN_SIZE = 256\nBATCH_SIZE = 64\n\nLR = 0.5\nMAX_GRAD_NORM = 5.0\n\nNUM_SAMPLES = 512\n"
  },
  {
    "path": "assignments/chatbot/data.py",
    "content": "\"\"\" A neural chatbot using sequence to sequence model with\nattentional decoder. \n\nThis is based on Google Translate Tensorflow model \nhttps://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/\n\nSequence to sequence model by Cho et al.(2014)\n\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\nThis file contains the code to do the pre-processing for the\nCornell Movie-Dialogs Corpus.\n\nSee readme.md for instruction on how to run the starter code.\n\"\"\"\nimport os\nimport random\nimport re\n\nimport numpy as np\n\nimport config\n\ndef get_lines():\n    id2line = {}\n    file_path = os.path.join(config.DATA_PATH, config.LINE_FILE)\n    print(config.LINE_FILE)\n    with open(file_path, 'r', errors='ignore') as f:\n        # lines = f.readlines()\n        # for line in lines:\n        i = 0\n        try:\n            for line in f:\n                parts = line.split(' +++$+++ ')\n                if len(parts) == 5:\n                    if parts[4][-1] == '\\n':\n                        parts[4] = parts[4][:-1]\n                    id2line[parts[0]] = parts[4]\n                i += 1\n        except UnicodeDecodeError:\n            print(i, line)\n    return id2line\n\ndef get_convos():\n    \"\"\" Get conversations from the raw data \"\"\"\n    file_path = os.path.join(config.DATA_PATH, config.CONVO_FILE)\n    convos = []\n    with open(file_path, 'r') as f:\n        for line in f.readlines():\n            parts = line.split(' +++$+++ ')\n            if len(parts) == 4:\n                convo = []\n                for line in parts[3][1:-2].split(', '):\n                    convo.append(line[1:-1])\n                convos.append(convo)\n\n    return convos\n\ndef question_answers(id2line, convos):\n    \"\"\" Divide the dataset into two sets: questions and answers. \"\"\"\n    questions, answers = [], []\n    for convo in convos:\n        for index, line in enumerate(convo[:-1]):\n            questions.append(id2line[convo[index]])\n            answers.append(id2line[convo[index + 1]])\n    assert len(questions) == len(answers)\n    return questions, answers\n\ndef prepare_dataset(questions, answers):\n    # create path to store all the train & test encoder & decoder\n    make_dir(config.PROCESSED_PATH)\n    \n    # random convos to create the test set\n    test_ids = random.sample([i for i in range(len(questions))],config.TESTSET_SIZE)\n    \n    filenames = ['train.enc', 'train.dec', 'test.enc', 'test.dec']\n    files = []\n    for filename in filenames:\n        files.append(open(os.path.join(config.PROCESSED_PATH, filename),'w'))\n\n    for i in range(len(questions)):\n        if i in test_ids:\n            files[2].write(questions[i] + '\\n')\n            files[3].write(answers[i] + '\\n')\n        else:\n            files[0].write(questions[i] + '\\n')\n            files[1].write(answers[i] + '\\n')\n\n    for file in files:\n        file.close()\n\ndef make_dir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass\n\ndef basic_tokenizer(line, normalize_digits=True):\n    \"\"\" A basic tokenizer to tokenize text into tokens.\n    Feel free to change this to suit your need. \"\"\"\n    line = re.sub('<u>', '', line)\n    line = re.sub('</u>', '', line)\n    line = re.sub('\\[', '', line)\n    line = re.sub('\\]', '', line)\n    words = []\n    _WORD_SPLIT = re.compile(\"([.,!?\\\"'-<>:;)(])\")\n    _DIGIT_RE = re.compile(r\"\\d\")\n    for fragment in line.strip().lower().split():\n        for token in re.split(_WORD_SPLIT, fragment):\n            if not token:\n                continue\n            if normalize_digits:\n                token = re.sub(_DIGIT_RE, '#', token)\n            words.append(token)\n    return words\n\ndef build_vocab(filename, normalize_digits=True):\n    in_path = os.path.join(config.PROCESSED_PATH, filename)\n    out_path = os.path.join(config.PROCESSED_PATH, 'vocab.{}'.format(filename[-3:]))\n\n    vocab = {}\n    with open(in_path, 'r') as f:\n        for line in f.readlines():\n            for token in basic_tokenizer(line):\n                if not token in vocab:\n                    vocab[token] = 0\n                vocab[token] += 1\n\n    sorted_vocab = sorted(vocab, key=vocab.get, reverse=True)\n    with open(out_path, 'w') as f:\n        f.write('<pad>' + '\\n')\n        f.write('<unk>' + '\\n')\n        f.write('<s>' + '\\n')\n        f.write('<\\s>' + '\\n') \n        index = 4\n        for word in sorted_vocab:\n            if vocab[word] < config.THRESHOLD:\n                break\n            f.write(word + '\\n')\n            index += 1\n        with open('config.py', 'a') as cf:\n            if filename[-3:] == 'enc':\n                cf.write('ENC_VOCAB = ' + str(index) + '\\n')\n            else:\n                cf.write('DEC_VOCAB = ' + str(index) + '\\n')\n\ndef load_vocab(vocab_path):\n    with open(vocab_path, 'r') as f:\n        words = f.read().splitlines()\n    return words, {words[i]: i for i in range(len(words))}\n\ndef sentence2id(vocab, line):\n    return [vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)]\n\ndef token2id(data, mode):\n    \"\"\" Convert all the tokens in the data into their corresponding\n    index in the vocabulary. \"\"\"\n    vocab_path = 'vocab.' + mode\n    in_path = data + '.' + mode\n    out_path = data + '_ids.' + mode\n\n    _, vocab = load_vocab(os.path.join(config.PROCESSED_PATH, vocab_path))\n    in_file = open(os.path.join(config.PROCESSED_PATH, in_path), 'r')\n    out_file = open(os.path.join(config.PROCESSED_PATH, out_path), 'w')\n    \n    lines = in_file.read().splitlines()\n    for line in lines:\n        if mode == 'dec': # we only care about '<s>' and </s> in encoder\n            ids = [vocab['<s>']]\n        else:\n            ids = []\n        ids.extend(sentence2id(vocab, line))\n        # ids.extend([vocab.get(token, vocab['<unk>']) for token in basic_tokenizer(line)])\n        if mode == 'dec':\n            ids.append(vocab['<\\s>'])\n        out_file.write(' '.join(str(id_) for id_ in ids) + '\\n')\n\ndef prepare_raw_data():\n    print('Preparing raw data into train set and test set ...')\n    id2line = get_lines()\n    convos = get_convos()\n    questions, answers = question_answers(id2line, convos)\n    prepare_dataset(questions, answers)\n\ndef process_data():\n    print('Preparing data to be model-ready ...')\n    build_vocab('train.enc')\n    build_vocab('train.dec')\n    token2id('train', 'enc')\n    token2id('train', 'dec')\n    token2id('test', 'enc')\n    token2id('test', 'dec')\n\ndef load_data(enc_filename, dec_filename, max_training_size=None):\n    encode_file = open(os.path.join(config.PROCESSED_PATH, enc_filename), 'r')\n    decode_file = open(os.path.join(config.PROCESSED_PATH, dec_filename), 'r')\n    encode, decode = encode_file.readline(), decode_file.readline()\n    data_buckets = [[] for _ in config.BUCKETS]\n    i = 0\n    while encode and decode:\n        if (i + 1) % 10000 == 0:\n            print(\"Bucketing conversation number\", i)\n        encode_ids = [int(id_) for id_ in encode.split()]\n        decode_ids = [int(id_) for id_ in decode.split()]\n        for bucket_id, (encode_max_size, decode_max_size) in enumerate(config.BUCKETS):\n            if len(encode_ids) <= encode_max_size and len(decode_ids) <= decode_max_size:\n                data_buckets[bucket_id].append([encode_ids, decode_ids])\n                break\n        encode, decode = encode_file.readline(), decode_file.readline()\n        i += 1\n    return data_buckets\n\ndef _pad_input(input_, size):\n    return input_ + [config.PAD_ID] * (size - len(input_))\n\ndef _reshape_batch(inputs, size, batch_size):\n    \"\"\" Create batch-major inputs. Batch inputs are just re-indexed inputs\n    \"\"\"\n    batch_inputs = []\n    for length_id in range(size):\n        batch_inputs.append(np.array([inputs[batch_id][length_id]\n                                    for batch_id in range(batch_size)], dtype=np.int32))\n    return batch_inputs\n\n\ndef get_batch(data_bucket, bucket_id, batch_size=1):\n    \"\"\" Return one batch to feed into the model \"\"\"\n    # only pad to the max length of the bucket\n    encoder_size, decoder_size = config.BUCKETS[bucket_id]\n    encoder_inputs, decoder_inputs = [], []\n\n    for _ in range(batch_size):\n        encoder_input, decoder_input = random.choice(data_bucket)\n        # pad both encoder and decoder, reverse the encoder\n        encoder_inputs.append(list(reversed(_pad_input(encoder_input, encoder_size))))\n        decoder_inputs.append(_pad_input(decoder_input, decoder_size))\n\n    # now we create batch-major vectors from the data selected above.\n    batch_encoder_inputs = _reshape_batch(encoder_inputs, encoder_size, batch_size)\n    batch_decoder_inputs = _reshape_batch(decoder_inputs, decoder_size, batch_size)\n\n    # create decoder_masks to be 0 for decoders that are padding.\n    batch_masks = []\n    for length_id in range(decoder_size):\n        batch_mask = np.ones(batch_size, dtype=np.float32)\n        for batch_id in range(batch_size):\n            # we set mask to 0 if the corresponding target is a PAD symbol.\n            # the corresponding decoder is decoder_input shifted by 1 forward.\n            if length_id < decoder_size - 1:\n                target = decoder_inputs[batch_id][length_id + 1]\n            if length_id == decoder_size - 1 or target == config.PAD_ID:\n                batch_mask[batch_id] = 0.0\n        batch_masks.append(batch_mask)\n    return batch_encoder_inputs, batch_decoder_inputs, batch_masks\n\nif __name__ == '__main__':\n    prepare_raw_data()\n    process_data()"
  },
  {
    "path": "assignments/chatbot/model.py",
    "content": "import time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport config\n\nclass ChatBotModel:\n    def __init__(self, forward_only, batch_size):\n        \"\"\"forward_only: if set, we do not construct the backward pass in the model.\n        \"\"\"\n        print('Initialize new model')\n        self.fw_only = forward_only\n        self.batch_size = batch_size\n\n    def _create_placeholders(self):\n        # Feeds for inputs. It's a list of placeholders\n        print('Create placeholders')\n        self.encoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='encoder{}'.format(i))\n                               for i in range(config.BUCKETS[-1][0])]\n        self.decoder_inputs = [tf.placeholder(tf.int32, shape=[None], name='decoder{}'.format(i))\n                               for i in range(config.BUCKETS[-1][1] + 1)]\n        self.decoder_masks = [tf.placeholder(tf.float32, shape=[None], name='mask{}'.format(i))\n                              for i in range(config.BUCKETS[-1][1] + 1)]\n\n        # Our targets are decoder inputs shifted by one (to ignore <GO> symbol)\n        self.targets = self.decoder_inputs[1:]\n\n    def _inference(self):\n        print('Create inference')\n        # If we use sampled softmax, we need an output projection.\n        # Sampled softmax only makes sense if we sample less than vocabulary size.\n        if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:\n            w = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])\n            b = tf.get_variable('proj_b', [config.DEC_VOCAB])\n            self.output_projection = (w, b)\n\n        def sampled_loss(logits, labels):\n            labels = tf.reshape(labels, [-1, 1])\n            return tf.nn.sampled_softmax_loss(weights=tf.transpose(w), \n                                              biases=b, \n                                              inputs=logits, \n                                              labels=labels, \n                                              num_sampled=config.NUM_SAMPLES, \n                                              num_classes=config.DEC_VOCAB)\n        self.softmax_loss_function = sampled_loss\n\n        single_cell = tf.contrib.rnn.GRUCell(config.HIDDEN_SIZE)\n        self.cell = tf.contrib.rnn.MultiRNNCell([single_cell for _ in range(config.NUM_LAYERS)])\n\n    def _create_loss(self):\n        print('Creating loss... \\nIt might take a couple of minutes depending on how many buckets you have.')\n        start = time.time()\n        def _seq2seq_f(encoder_inputs, decoder_inputs, do_decode):\n            setattr(tf.contrib.rnn.GRUCell, '__deepcopy__', lambda self, _: self)\n            setattr(tf.contrib.rnn.MultiRNNCell, '__deepcopy__', lambda self, _: self)\n            return tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(\n                    encoder_inputs, decoder_inputs, self.cell,\n                    num_encoder_symbols=config.ENC_VOCAB,\n                    num_decoder_symbols=config.DEC_VOCAB,\n                    embedding_size=config.HIDDEN_SIZE,\n                    output_projection=self.output_projection,\n                    feed_previous=do_decode)\n\n        if self.fw_only:\n            self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(\n                                        self.encoder_inputs, \n                                        self.decoder_inputs, \n                                        self.targets,\n                                        self.decoder_masks, \n                                        config.BUCKETS, \n                                        lambda x, y: _seq2seq_f(x, y, True),\n                                        softmax_loss_function=self.softmax_loss_function)\n            # If we use output projection, we need to project outputs for decoding.\n            if self.output_projection:\n                for bucket in range(len(config.BUCKETS)):\n                    self.outputs[bucket] = [tf.matmul(output, \n                                            self.output_projection[0]) + self.output_projection[1]\n                                            for output in self.outputs[bucket]]\n        else:\n            self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(\n                                        self.encoder_inputs, \n                                        self.decoder_inputs, \n                                        self.targets,\n                                        self.decoder_masks,\n                                        config.BUCKETS,\n                                        lambda x, y: _seq2seq_f(x, y, False),\n                                        softmax_loss_function=self.softmax_loss_function)\n        print('Time:', time.time() - start)\n\n    def _creat_optimizer(self):\n        print('Create optimizer... \\nIt might take a couple of minutes depending on how many buckets you have.')\n        with tf.variable_scope('training') as scope:\n            self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\n            if not self.fw_only:\n                self.optimizer = tf.train.GradientDescentOptimizer(config.LR)\n                trainables = tf.trainable_variables()\n                self.gradient_norms = []\n                self.train_ops = []\n                start = time.time()\n                for bucket in range(len(config.BUCKETS)):\n                    \n                    clipped_grads, norm = tf.clip_by_global_norm(tf.gradients(self.losses[bucket], \n                                                                 trainables),\n                                                                 config.MAX_GRAD_NORM)\n                    self.gradient_norms.append(norm)\n                    self.train_ops.append(self.optimizer.apply_gradients(zip(clipped_grads, trainables), \n                                                            global_step=self.global_step))\n                    print('Creating opt for bucket {} took {} seconds'.format(bucket, time.time() - start))\n                    start = time.time()\n\n\n    def _create_summary(self):\n        pass\n\n    def build_graph(self):\n        self._create_placeholders()\n        self._inference()\n        self._create_loss()\n        self._creat_optimizer()\n        self._create_summary()"
  },
  {
    "path": "assignments/chatbot/output_convo.txt",
    "content": "HUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ really?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ how're you?\nBOT ++++ i ' m fine .\nHUMAN ++++ where're you from?\nBOT ++++ california .\nHUMAN ++++ which city?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ you don't like california?\nBOT ++++ oh . . .\nHUMAN ++++ why not?\nBOT ++++ i can ' t .\nHUMAN ++++ how old are you?\nBOT ++++ eighteen .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how are you?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's your name?\nBOT ++++ what ?\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what's your name?\nBOT ++++ prisoner .\nHUMAN ++++ ha ha you really don't want to tell your name\nBOT ++++ i ' m not going to do it .\nHUMAN ++++ how old are you?\nBOT ++++ can you guess ?\nHUMAN ++++ eighteen?\nBOT ++++ in the bathroom .\nHUMAN ++++ in the bathroom?\nBOT ++++ in the next time .\nHUMAN ++++ why are you talking so funny?\nBOT ++++ i ' m looking .\nHUMAN ++++ where are you from?\nBOT ++++ general .\nHUMAN ++++ where are you from?\nBOT ++++ how ?\nHUMAN ++++ do you like the us?\nBOT ++++ i ' m sure .\nHUMAN ++++ do you vote?\nBOT ++++ i ' m afraid .\nHUMAN ++++ what are you afraid of?\nBOT ++++ i ' m looking .\nHUMAN ++++ what are you looking for?\nBOT ++++ can you guess ?\nHUMAN ++++ no\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm chip\nBOT ++++ please .\nHUMAN ++++ please what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ where are you from?\nBOT ++++ california .\nHUMAN ++++ do you like california?\nBOT ++++ i ' m okay .\nHUMAN ++++ i like california a lot\nBOT ++++ of what ?\nHUMAN ++++ i live here\nBOT ++++ and what is it ?\nHUMAN ++++ california\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ i'm a student here\nBOT ++++ and get it out of here .\nHUMAN ++++ what do you like to do in your free time?\nBOT ++++ and what about the gun ?\nHUMAN ++++ do you have a gun?\nBOT ++++ yes .\nHUMAN ++++ you're dangerous!\nBOT ++++ i ' m not .\nHUMAN ++++ do you go hunting?\nBOT ++++ i can do .\nHUMAN ++++ what do you hunt?\nBOT ++++ virgin .\nHUMAN ++++ that's very bad\nBOT ++++ and you in love ?\nHUMAN ++++ no i'm not\nBOT ++++ please - -\nHUMAN ++++ okay, i am\nBOT ++++ and you are good .\nHUMAN ++++ yeah i am good too.\nBOT ++++ good good to you .\nHUMAN ++++ are you in love?\nBOT ++++ i ' m ready .\nHUMAN ++++ ready for what?\nBOT ++++ for what ?\nHUMAN ++++ for what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ tell me\nBOT ++++ i can do it .\nHUMAN ++++ okay, tell me what you're ready for\nBOT ++++ i ' m doing you .\nHUMAN ++++ no you're not\nBOT ++++ please !\nHUMAN ++++ this is a strange conversation\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ are you a guy?\nBOT ++++ of what ?\nHUMAN ++++ are you a boy?\nBOT ++++ i ' m fine .\nHUMAN ++++ are you a man?\nBOT ++++ i ' m a little .\nHUMAN ++++ what's your name?\nBOT ++++ chris -\nHUMAN ++++ chris who?\nBOT ++++ cornelius .\nHUMAN ++++ so i guess you're a guy.\nBOT ++++ i am .\nHUMAN ++++ okay\nBOT ++++ okay ?\nHUMAN ++++ do you have a girlfriend?\nBOT ++++ i ' m sure .\nHUMAN ++++ what's her name?\nBOT ++++ let ' s talk about something else .\n=============================================\nHUMAN ++++ hi chris\nBOT ++++ hi\nHUMAN ++++ how're you today?\nBOT ++++ okay .\nHUMAN ++++ the weather is nice today\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ just say the weather is nice\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi\nHUMAN ++++ how're you today?\nBOT ++++ okay .\nHUMAN ++++ just okay?\nBOT ++++ i ' m okay .\nHUMAN ++++ what's on your mind?\nBOT ++++ i ' m in my room .\nHUMAN ++++ doing what?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ you're boring.\nBOT ++++ i ' m not .\nHUMAN ++++ prove it\nBOT ++++ and what ?\nHUMAN ++++ what what?\nBOT ++++ sorry , i ' m still very stupid .\nHUMAN ++++ it's okay. we're all stupid sometimes.\nBOT ++++ ' s good to see .\nHUMAN ++++ what's good to see?\nBOT ++++ i ' m in .\nHUMAN ++++ you're in your room?\nBOT ++++ i ' m here .\nHUMAN ++++ where's your room?\nBOT ++++ i ' m in my room .\nHUMAN ++++ are you alone?\nBOT ++++ i ' m fine .\nHUMAN ++++ are you by yourself?\nBOT ++++ in the morning .\nHUMAN ++++ what time is it?\nBOT ++++ # # # # # # # .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm marie\nBOT ++++ you are .\nHUMAN ++++ yes i am\nBOT ++++ and you are my friend .\nHUMAN ++++ yes you are \nBOT ++++ and you are my best .\nHUMAN ++++ that's so sweet\nBOT ++++ please .\nHUMAN ++++ do you know what time it is?\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ what time is it?\nBOT ++++ 5 9 6 9 1 4 5 .\nHUMAN ++++ it doesn't make any sense\nBOT ++++ are you ?\nHUMAN ++++ i do \nBOT ++++ and you are ?\nHUMAN ++++ i'm a human.\nBOT ++++ you in .\nHUMAN ++++ are you in trouble?\nBOT ++++ i ' m in .\nHUMAN ++++ what trouble?\nBOT ++++ i ' m in love .\nHUMAN ++++ wow that's nice.\nBOT ++++ like that ?\nHUMAN ++++ yeah\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ who are you?\nBOT ++++ i don ' t understand .\nHUMAN ++++ who are you?\nBOT ++++ no idea .\nHUMAN ++++ what's your name?\nBOT ++++ chris -\nHUMAN ++++ chris who?\nBOT ++++ cornelius .\nHUMAN ++++ where are you from?\nBOT ++++ california .\nHUMAN ++++ are you american?\nBOT ++++ i ' m fine .\nHUMAN ++++ of course you're fine.\nBOT ++++ and you ?\nHUMAN ++++ i'm okay.\nBOT ++++ you are .\nHUMAN ++++ yes, i am.\nBOT ++++ and you are my friend .\nHUMAN ++++ yes, i am.\nBOT ++++ and you are my friend .\nHUMAN ++++ that's correct.\nBOT ++++ and you ' re in trouble ?\nHUMAN ++++ no, i'm not. why did you think so?\nBOT ++++ because it ' s a good idea .\nHUMAN ++++ what is your good idea?\nBOT ++++ i ' ll be fine .\n=============================================\nHUMAN ++++ hello\nBOT ++++ and you are in trouble ?\nHUMAN ++++ what did you say that?\nBOT ++++ ' s life .\nHUMAN ++++ i'm not in trouble.\nBOT ++++ what the hell is that ?\nHUMAN ++++ okay, what trouble?\nBOT ++++ i ' m looking .\nHUMAN ++++ why did you say i'm in trouble?\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ what do you want to talk about?\nBOT ++++ i can ' t .\nHUMAN ++++ what can't?\nBOT ++++ i ' ll have a drink .\nHUMAN ++++ okay, me too.\nBOT ++++ you were here ?\nHUMAN ++++ i'm still here.\nBOT ++++ what is it ?\nHUMAN ++++ i thought you wanted a drink.\nBOT ++++ i ' m not .\nHUMAN ++++ you just said so!\nBOT ++++ i was wrong .\nHUMAN ++++ it's fine.\nBOT ++++ okay ?\nHUMAN ++++ yeah\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ do you drink a lot?\nBOT ++++ i can do it .\nHUMAN ++++ do you drink often?\nBOT ++++ in the time .\nHUMAN ++++ that doesn't make any sense.\nBOT ++++ are you ?\nHUMAN ++++ what do you drink?\nBOT ++++ ' s in a drink .\nHUMAN ++++ do you drink alcohol?\nBOT ++++ in time .\nHUMAN ++++ i like beer.\nBOT ++++ oh . . .\nHUMAN ++++ don't you like beer?\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ where do you live?\nBOT ++++ up .\nHUMAN ++++ up where?\nBOT ++++ in the hospital .\nHUMAN ++++ i'm sorry. are you okay?\nBOT ++++ i ' m fine .\nHUMAN ++++ why are you in the hospital?\nBOT ++++ and what ?\nHUMAN ++++ how many words do you know?\nBOT ++++ i ' d do it .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ how many?\nBOT ++++ i ' m not .\nHUMAN ++++ whatever\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ you're boring.\nBOT ++++ i ' m not .\nHUMAN ++++ yes you are.\nBOT ++++ and you are my friend .\nHUMAN ++++ no i'm not.\nBOT ++++ please .\nHUMAN ++++ okay fine. i'm your friend.\nBOT ++++ you ' re my mother ?\nHUMAN ++++ i have no son like you.\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ i have no son.\nBOT ++++ i see you .\nHUMAN ++++ i have no children.\nBOT ++++ you are .\nHUMAN ++++ omg you're so dumb.\nBOT ++++ i ' m not .\nHUMAN ++++ you're stupid!\nBOT ++++ i ' m not .\nHUMAN ++++ do you always talk like this?\nBOT ++++ of course .\nHUMAN ++++ how old are you?\nBOT ++++ eighteen .\nHUMAN ++++ you're big enough.\nBOT ++++ i ' m not .\nHUMAN ++++ are you an adult?\nBOT ++++ i ' m going .\nHUMAN ++++ you're going to be an adult?\nBOT ++++ i ' m looking for you .\nHUMAN ++++ i'm in my room.\nBOT ++++ and your mother ?\nHUMAN ++++ she's in vietnam.\nBOT ++++ you ' re here .\nHUMAN ++++ yes i'm here.\nBOT ++++ and what are you talking about ?\nHUMAN ++++ you asked me where i am.\nBOT ++++ and you are your friend .\nHUMAN ++++ i need to go now.\nBOT ++++ i will not .\nHUMAN ++++ that's fine. you can stay. i'm leaving.\nBOT ++++ oh . . .\nHUMAN ++++ bye\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ goodbye\nBOT ++++ i don ' t know what to say .\nHUMAN ++++ bye bye\nBOT ++++ okay ?\nHUMAN ++++ see you later\nBOT ++++ i don ' t know what to say .\n=============================================\nHUMAN ++++ hi\nBOT ++++ hi . what ' s your name ?\nHUMAN ++++ i'm maria\nBOT ++++ please .\nHUMAN ++++ i'm jackson\nBOT ++++ you are .\nHUMAN ++++ i'm jack\nBOT ++++ let ' s talk about something else .\nHUMAN ++++ i have no go\nBOT ++++ you are ?\nHUMAN ++++ i have to go\nBOT ++++ and what do you mean ?\nHUMAN ++++ i'm leaving.\nBOT ++++ what ?\nHUMAN ++++ yeah, bye\nBOT ++++ bye . have a nice day !\n=============================================\nHUMAN ++++ that's very bad.\nBOT ++++ i ' ve got to work .\nHUMAN ++++ what do you do?\nBOT ++++ i ' ll be there .\nHUMAN ++++ where do you work?\nBOT ++++ in the trunk .\nHUMAN ++++ are you a machenics?\nBOT ++++ i ' m not .\nHUMAN ++++ what are you?\nBOT ++++ no idea .\n=============================================\n"
  },
  {
    "path": "assignments/trump_bot/trump_tweets.txt",
    "content": "'State works hard and illegally for Clinton' #DrainTheSwamp __HTTP__ _E_\nRT @IvankaTrump: Touched by the warm hospitality of Prime Minister Abe and the Japanese people. ありがとうございます [Thank you]! Until next time ... _E_\nSince Congress can't get its act together on HealthCare I will be using the power of the pen to give great HealthCare to many people FAST _E_\nI always said @BarackObama will attack Iran in some form prior to the election. _E_\nToday I am working on my 'big surprise' for the @RNC convention. Everyone will love it. _E_\nWhat a shock! The U.S. Capitol Christmas tree pays homage @BarackObama but failed to mention Jesus. _E_\nMaking America Safe is my number one priority. We will not admit those into our country we cannot safely vet. __HTTP__ _E_\nRepubs must not allow Pres Obama to subvert the Constitution of the US for his own benefit & because he is unable to negotiate w/ Congress. _E_\nTell Iran to let our Christian Pastor go and I mean right now. If they don't there will be hell to pay. _E_\nMan shot inside Paris police station. Just announced that terror threat is at highest level. Germany is a total mess big crime. GET SMART! _E_\nThank you! __HTTP__ _E_\nI am now inspecting the Old Post Office on Pennsylvania Avenue will be a great hotel. Soon off to the Oklahoma State Fair! _E_\nJust a few more days until the 13th season of All Star @CelebApprentice premieres. Be sure to tune in this Sunday at 9PM on @nbc. Big! _E_\nLook forward to being in Phoenix tomorrow at 2:00 P.M. Hottest ticket in entire country. Was supposed to be 500 people now many thousands! _E_\nThe Al Frankenstien picture is really bad speaks a thousand words. Where do his hands go in pictures 2 3 4 5 & 6 while she sleeps? ..... _E_\nKarl Rove's strategy and commercials were the worst I have ever seen. _E_\n.@lindseygraham who had zero in his presidential run before dropping out in disgrace saying the most horrible things about me on @FoxNews. _E_\nGreat Concert at 4:00 P.M. today at Lincoln Memorial. Enjoy! _E_\n\"Donald Trump: Mitt Romney 'Blew It' Shouldn't Run Again\" __HTTP__ via @Newsmax_Media by @OwenTew _E_\n.@HillaryClinton talking about jobs? Remember what she promised upstate New York. #BigLeagueTruth#Debates __HTTP__ _E_\n.@IsraeliPM @netanyahu is a resolute leader. When he sets a red line it stands! _E_\nThank you Idaho! I love your potatoes nobody grows them better. As President I will protect your market. __HTTP__ _E_\nA bad thing finally happened to Derek Jeter he is a great champion. _E_\nObama told his donors this past week \"public opinion\" is on his side. Don't believe that one either. _E_\nJoin me this Friday in Pensacola Florida at the Pensacola Bay Center! Tickets: __HTTP__ __HTTP__ _E_\nWow record setting cold temperatures throughout large parts of the country. Must be global warming I mean climate change! _E_\n...Colin Powell thought Iraq has weapons of mass destruction. _E_\nReview your work habits & make sure they are taking you in the right direction. Don't tread water get out there and go for it. _E_\nPresident said we would never leave a soldier behind. How about the 4 who died in Benghazi? _E_\nThe point is: the Chinese are smart they respond to economic pressure and they know they're not going to get (cont) __HTTP__ _E_\nLook forward to introducing Governor Mike Pence (who has done a spectacular job in the great State of Indiana). My first choice from start! _E_\nIf Stuart Stevens' book is as bad as his horrible political advice to Mitt Romney don't waste your money. Arrogant guy but a zero! _E_\nSnowboarder/Skateboarder @Shaun_White stopped by to visit this week.... __HTTP__ _E_\n#MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_\n#TBT With @britneyspears __HTTP__ _E_\nWhy is @RandPaul allowed to take advantage of the people of Kentucky by running for Senator and Pres. Why should Kentucky be back up plan? _E_\nWe are about to have a record $500B trade deficit with the Chinese this year. That money should be back here financing jobs in America. _E_\nTogether we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nLittle @MacMiller I'm now going to teach you a big boy lesson about lawsuits and finance. You ungrateful dog! _E_\n\"The way to get started is to quit talking and begin doing.\" Walt Disney _E_\nIn the coming months and years ahead I look forward to building an even STRONGER relationship between the United States and China. __HTTP__ _E_\nWho was it that secretly said to Russian President Tell Vladimir that after the election I'll have more flexibility? @foxandfriends _E_\nI will be interviewed on @meetthepress this morning. Enjoy! _E_\nWacko pervert @AnthonyWeiner's idea of Hispanic outreach is using Carlos Danger as his sexting. He's an insensitive racist. _E_\nRepublicans should not negotiate against themselves again with @BarackObama in today's debt talks First and foremost CUTCAP and BALANCE. _E_\n'Small business optimism soars after Trump election' __HTTP__ _E_\nWe need a great leader now! __HTTP__ _E_\nI am going to Trump National Doral in Miami this week to check out the $250 million renovation. In construction always watch the money! _E_\nA vote for Hillary Clinton is a vote for another generation of poverty high crime & lost opportunities. #ImWithYou __HTTP__ _E_\n.@MattGinellaGC Matt the statement about Pinehurst looking like a local community golf course awful was not made by me but tweeted to me _E_\nOur hearts are with all affected by the wildfires in California. God bless our brave First Responders and @FEMA team. We support you! __HTTP__ _E_\nHopefully others will follow suit. Our country needs & should demand security. It is time to get tough & be smart! _E_\nHeading to North Carolina for two big rallies. Will be there soon. We will bring jobs back where they belong! _E_\n.@ThrillistChi named @SixteenChicago @TrumpChicago one of the \"best value Michelin starred restaurants in Chicago\" __HTTP__ _E_\nPresident Obama wants to change the name of Mt. McKinley to Denali after more than 100 years. Great insult to Ohio. I will change back! _E_\nObama is looking rhetorical and weak. @MittRomney is looking strong and sharp. _E_\nGreat to see the construction of the Old Post Office on Penn Ave. Going fast under budget ahead of schedule! _E_\nThank you New York and Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nJoe McQuaid (@deucecrew) is desperately trying to sell the @UnionLeader. It's a loser and my comments haven't helped him much. _E_\nWe know who did the hoax of James Gandolfini and ObamaCare. Be careful Mister. _E_\nA message of condolences and support regarding the terrorist attacks in Tel Aviv: __HTTP__ _E_\n...One point I made sure to stress at @LibertyU is to be sure to get even with anyone who crosses you... _E_\nObamaCare Story of the Day: \"Florida Cancer Patient Loses Insurance During Treatment B/C of ObamaCare\" __HTTP__ _E_\nNow that Ken Frazier of Merck Pharma has resigned from President's Manufacturing Councilhe will have more time to LOWER RIPOFF DRUG PRICES! _E_\nWishing everyone a safe and Happy Halloween!#Halloween2017 __HTTP__ _E_\nSo many tweets & stories on Stewart/Pattinson Look it doesn't matter the relationship will never be the same. It is permanently broken. _E_\nIt is finally sinking through. 46% OF PEOPLE BELIEVE MAJOR NATIONAL NEWS ORGS FABRICATE STORIES ABOUT ME. FAKE NEWS even worse! Lost cred. _E_\nDrew Peterson just got 36 years for killing his wife bring back the death penalty! _E_\nFake News CNN and NBC are going out of their way to disparage our great First Responders as a way to get Trump. Not fair to FR or effort! _E_\nDon't forget to tune in tonight for the two hour premiere of The Apprentice. 9 pm EST on NBC. We're all in for a fantastic new season! _E_\n#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_\nJoin me this weekend! #NYPrimary4/16: SYRACUSE NOON __HTTP__ WATERTOWN 3pm __HTTP__ #Trump2016 _E_\nI met some really great Air Force GENERALS and Navy ADMIRALS today talking about airplane capability and pricing. Very impressive people! _E_\nTake your work seriously take yourself less seriously. It's a great recipe for some good times & great memories. _E_\nLightweight A.G. Eric Schneiderman who has been a total failure in office failed to report the 98% approval rating of students for courses _E_\nEffective today my administration officially declared the #OpioidCrisis a NATIONAL PUBLIC HEALTH EMERGENCY under federal law. __HTTP__ _E_\nWill be doing @foxandfriends live tomorrow at 7AM ET from Europe. _E_\nRT @DonaldJTrumpJr: Happy new year everyone. #newyear #family #vacation #familytime __HTTP__ _E_\nIf our healthcare plan is approved you will see real healthcare and premiums will start tumbling down. ObamaCare is in a death spiral! _E_\nWishing a Happy Father's Day to all the Dad's out there YOU are a champion today and everyday! __HTTP__ _E_\nCan you imagine with all of the talk about ObamaCare technical breakdowns made it a disastrous day. Our government is badly broken! _E_\nTrump Nat'l Golf Club Philadelphia 360 beautiful acres as designed by Tom Fazio with views of the Philly skyline. __HTTP__ _E_\nVia @IBDeditorials: \"Most Americans Label Obama Presidency A Failure\" __HTTP__ _E_\n#WeeklyAddress __HTTP__ __HTTP__ _E_\nAfter climbing a great hill one only finds that there are many more hills to climb. Nelson Mandela _E_\nI'll be on @americanowradio with Andy Dean at 6:30 ET today talking about last night's @FoxNews debate. __HTTP__ _E_\nVia @EveningExpress: Images of Donald Trump's 2nd North east golf course released: Public have say on images __HTTP__ _E_\nMany people are saying it was wonderful that Mrs. Obama refused to wear a scarf in Saudi Arabia but they were insulted.We have enuf enemies _E_\nEntrepreneurs: Identify your goals. Know precisely what you want to achieve. Have your own vision and stick with it! _E_\nExperience is a hard teacher because she gives the test first the lesson afterwards. Vernon Sanders Law _E_\n...big unnecessary regulation cuts made it all possible\" (among many other things). \"President Trump reversed the policies of President Obama and reversed our economic decline.\" Thank you Stuart Varney. @foxandfriends _E_\nWill be interviewed on @oreillyfactor tonight at 8:00 P.M. _E_\n#ICYMI OHIO RALLY!Watch here: __HTTP__ __HTTP__ _E_\nDopey Sugar @Lord_Sugar Bad ratings come on keep making me money remember I own your show. _E_\nChina hacked the U.S. Chamber of Commerce and now has the information of all 3 million members. China keeps (cont) __HTTP__ _E_\nCheck out the recent Editorial in the Wall Street Journal @WSJ about what a complete disaster the @CFPB has been under its leader from previous Administration who just quit! _E_\n\"You have to set higher and higher goals. You have to want more or you will start slipping backwards fast.\" – Think BIG _E_\nThe only @Forbes Five Star & @fivediamond hotel in NYC @TrumpNewYork is the definition of luxury __HTTP__ The Best! _E_\nThis is really unfair and a conflict for all the other candidates. I said it should not be allowed and ABC agreed. _E_\nEliot better have a great pre nup—I want to help Silda in her negotiation. _E_\nCongratulations to Gretchen Carlson on her big move to hosting an afternoon solo show this fall on @FoxNews. _E_\nThe Clinton News Network sometimes referred to as @CNN is getting more and more biased.They act so indignant hear them behind closed doors _E_\n.@DonaldJTrumpJr and @EricTrump with @HulkHogan Great shot! __HTTP__ _E_\nI predicted Apple's stock fall based on their dumb refusal to give the option of a larger iPhone screen like Samsung. I sold my Apple stock _E_\n.@PapaJohns CEO John Schnatte has told shareholders that ObamaCare will force him to raise pizza prices __HTTP__ REPEAL! _E_\nImportant day spent at Camp David with our very talented Generals and military leaders. Many decisions made including on Afghanistan. _E_\nThe unforgivable crime is soft hitting. Do not hit at all if it can be avoided but never hit softly. Theodore Roosevelt _E_\nThank you Las Vegas Nevada!#Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_\nThe entire country is FREEZING we desperately need a heavy dose of global warming and fast! Ice caps size reaches all time high. _E_\nMy appearances on @todayshow __HTTP__ and @gma __HTTP__ _E_\nRecord cold temperatures in July 20 to 30 degrees colder than normal. What the hell happened to GLOBAL WARMING? _E_\nThe politicians of the U.K. should watch Katie Hopkins of Daily __HTTP__ on @FoxNews. Many people in the U.K. agree with me! _E_\nCongrats to Miss Universe 2011 @RealLeilaLopes & @Giant great @OsiUmenyiora on their engagement! I am very happy for you both. _E_\nObama deserves much less credit for the killing of Bin Laden. The praise goes to our brave military and intelligence officers. _E_\nFlashback – Jeb Bush says illegal immigrants breaking our laws is an \"act of love\" __HTTP__ He will never secure the border. _E_\nJoin @AmerIcan32 founded by Hall of Fame legend @JimBrownNFL32 on 1/19/2017 in Washington D.C.... __HTTP__ _E_\nRE: Michael Jackson: He was a great friend and a spectacular entertainer. It's a devastating loss! Donald J. Trump _E_\nVia @AP's: ObamaCare is a tax __HTTP__ @BarackObama gave the largest tax increase in history on the middle class. Shameful! _E_\nI'm honored to be presented the award of Doctor of Business Administration Honoris Causa from Robert Gordon University in Aberdeen Scotland _E_\nGreat job by the FBI Boston Police and all others involved start the trial tonight! _E_\nRising premium costs from Obamacare will cost businesses billions __HTTP__ Guess where these new costs get passed to – you. _E_\nOur spa @TrumpSoHo gets a nice write up in @DETAILS: #gotmilk _E_\nMore hysterical DSRL videos featuring Donald Trump and Double Trump plus enter Golden Lick Race Sweepstakes: __HTTP__ _E_\nSo proud of @FEMA Military and First Responders! Thank you! __HTTP__ _E_\nWow! Thank you Louisville Kentucky! #VoteTrump on 3/5/2016! Lets #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nBy @kwrcrow: \"NY Post caught 'LYING' Again!\" __HTTP__ ... The Donald\" should go far. Actually if I run I'll win. _E_\nHappy and proud to help @MittRomney win Ohio with robo calls in pivotal Cuyahoga County _E_\n...fired. This story is totally made up by the dishonest media.The Chief is doing a FANTASTIC job for me and more importantly for the USA! _E_\nNow China is helping Iran smuggle nuclear parts __HTTP__ . China is not an ally but our country's greatest threat & rival. _E_\nVia @TheTodaysGolfer \"@TrumpScotland gets new clubhouse\" __HTTP__ _E_\nJust departing La Crosse Wisconsin. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_\nThousands of US warplanes ships and missiles contain fake electronic components from China leaving them open (cont) __HTTP__ _E_\nI will be interviewed on @CNN at 7:00 A.M. _E_\nThe biggest business people have used the bankruptcy laws to their advantage Warren B Icahn Kravis and this week John Paulson for haters! _E_\nMy exclusive @WSOC_TV interview with @BlairMiller9 discussing Trump National North Carolina & future deals __HTTP__ _E_\nHappy New Year to all my Jewish friends. _E_\n51 Million American to travel this weekend highest number in twelve years (AAA). Traffic and airports are running very smoothly! @FoxNews _E_\nLetterman @Late_Show was great last night. I had a lot of fun. You could see his audience really wanted Obama to take the $ for charity. _E_\nJust interviewed by @LouDobbs. Will be aired tonight at 7pmE on @FoxBusiness. #Dobbs _E_\nLooking forward to being hosted & interviewed next Monday by David Rubenstein at the @TheEconomicClub __HTTP__ _E_\nSuccess comes with hard work focus and luck. The luck comes to those who seek it out. If you are not in the game you cannot get lucky. _E_\nWill be interviewed on @Morning_Joe at 7:3O. Enjoy! _E_\nWe should look to China where big time pollution takes place as they manufacture inefficient and costly wind turbines for Scotland! _E_\nStock Market hits another all time high on Friday. 5.3 trillion dollars up since Election. Fake News doesn't spent much time on this! _E_\nThe Failing @nytimes in a story by Peter Baker should have mentioned the rapid terminations by me of TPP & The Paris Accord & the fast.... _E_\nVia Business Insider: Donald Trump's Poll Dominance in 2 Key States is Mind Blowing __HTTP__ _E_\nThe American economy would grow if Washington didn't keep threatening higher taxes and more regulations. Government is not the solution. _E_\nThis year's Trump Miss Universe Pageant is comprised of truly beautiful women.Will be simulcast live December 19th on @nbc and @Univision. _E_\nThe Democrats don't want money from budget going to border wall despite the fact that it will stop drugs and very bad MS 13 gang members. _E_\nGod bless all the brave souls who perished 12 years ago today. You will never be forgotten! _E_\nThe people of Buffalo should be happy Terry Pegula got the team but I hope he does better w/the Bills than he has w/the Sabres. Good luck! _E_\nJust left Liberty University. Chancellor Jerry Falwell Jr.& his father have done an amazing job...great school & the students were fantastic _E_\nCan you believe that the corrupt and pathetic South Africa police force has yet to arrest the sign language guy. Such danger give 10 years! _E_\nToday we honor the fallen at #PearlHarbor 74 years ago today. If you see a vet today thank them! #RememberOurVets __HTTP__ _E_\nRT @foxandfriends: STILL AHEAD: @realDonaldTrump joins us at 7am/et! #RNCinCLE __HTTP__ _E_\nThe Emmys are all politics that's why despite nominations The Apprentice never won even though it should have many times over. _E_\nRT @PChowka: Fox News With Hannity's Help Regains Its Ratings Dominance By Peter Barry Chowka at The Hagmann report __HTTP__ _E_\nWe must never bend too much. Yitzhak Shamir (1915 2012) __HTTP__ _E_\nI will be doing @colbertlateshow at 11:30 on CBS. Enjoy! __HTTP__ _E_\nObama's new campaign ad defends Solyndra __HTTP__ I guess losing $500M is a cause for celebration for @BarackObama. _E_\nHow much longer will the failing nytimes with its big losses and massive unfunded liability (and non existent sources) remain in business? _E_\nRT @VP: Went to the Senate today to say @POTUS & I fully support Graham Cassidy plan to repeal/replace Obamacare. Let's get this done. __HTTP__ _E_\n...@Lord_Sugar You need the income from the show to keep going hope it doesn't hurt. _E_\nGod never takes away something from your life without replacing it with something better. Rev. @BillyGraham _E_\nUnder Obama Iran has taken over Iraq Al Qaeda has taken over Libya the Muslim Brotherhood now controls Egypt. Worst foreign policy ever. _E_\nGlad everyone could see Mar a Lago last night on @datelinenbc. It is the crown jewel of Palm Beach. _E_\n.@MissUniverse visited my office tall and beautiful! __HTTP__ _E_\nRemember when @BarackObama promised you could keep your coverage? Study shows 1 in 10 employers will drop health care __HTTP__ _E_\nHillary said that guns don't keep you safe. If she really believes that she should demand that her heavily armed bodyguards quickly disarm! _E_\nThe 9/11 trials at Gitmo over the weekend were a disaster. Can you imagine how much worse it would be if @BarackObama tried them in NYC? _E_\nJOBS JOBS JOBS! __HTTP__ _E_\nNPR's @NealConan said schlonged to WaPo re: 1984 Mondale/Ferraro campaign: That ticket went on to get schlonged at the polls. #Hypocrisy _E_\nWhich campaign is possibly on the trajectory towards insolvency? __HTTP__ At least @BarackObama is consistent. _E_\nThe invisible hand of the market always moves faster and better than the heavy hand of government. @MittRomney _E_\nThe new reality. China's economy 'underpins' global demand __HTTP__ Our leaders just watched as China took full control. _E_\nDid you know that one of seven Americans is now on food stamps? Think of it. In the United States the most pr... (cont) __HTTP__ _E_\nI still can't get over how the Republicans—my friends—spent hundreds of millions of dollars on such terrible & ineffective ads. _E_\nThank you! #MakeAmericaGreatAgain __HTTP__ _E_\nWow 30000 e mails were deleted by Crooked Hillary Clinton. She said they had to do with a wedding reception. Liar! How can she run? _E_\nDo you notice that nobody is talking about the many scandals of the Obama administration anymore The Teflon President! _E_\nCrooked Hillary wants to get rid of all guns and yet she is surrounded by bodyguards who are fully armed. No more guns to protect Hillary! _E_\nMay the Festival of Lights bring our Jewish friends from around the world health & happiness! Happy Hanukkah! __HTTP__ _E_\nFox & Friends at 7.00 _E_\nIt is time to take care of OUR people to rebuild OUR NATION and to fight for OUR GREAT AMERICAN WORKERS! #TaxReform #USA __HTTP__ _E_\nIn his entire political career @BarackObama has never had a tough @GOP opponent before @MittRomney. He is a paper tiger. #GOMITT _E_\nBill Clinton wants to #MakeAmericaGreatAgain __HTTP__ _E_\nFailure defeats losers failure inspires winners. Robert T. Kiyosaki@theRealKiyosaki _E_\nI have never met a successful person that was a quitter. Successful people never ever give up! _E_\nOur debt is about to top $17T. ObamaCare and China (& others) are killing American business. _E_\nWill be interviewed on @NewDay on @CNN at 7:15 A.M. _E_\nMy @foxandfriends interview re: Muslim Brotherhood taking over Egypt our vast natural gas resources & US tax system __HTTP__ _E_\nFor all of those who have been asking a big cast announcement coming soon for @ApprenticeNBC! _E_\nSometimes there is justice. A Chinese military newspaper was hacked. __HTTP__ _E_\nVia @UnionLeader BY Bill Smith: \"GOP rally in Manchester fires up party faithful\" __HTTP__ _E_\nA tough week was had by@MittRomney but he's come back from adversity before. _E_\nCUT CAP AND BALANCE. TAXED ENOUGH ALREADY! _E_\n.@GOP has leverage. Must stay united & on message. _E_\nVia @PPDNews: Donald Trump: 'I Am Not Doing This For Fun' We Can't Fix U.S. 'Unless We Put Right Person' In WH __HTTP__ _E_\nTrump Golf Links at Ferry Point in the Bronx NY will open soon. A Jack Nicklaus Signature Design. Beautiful. __HTTP__ _E_\nUnder @BarackObama the Iranian nuclear program has rapidly grown. __HTTP__ _E_\nVia Int'l Business Times: Jeb Bush Got $1.3M Job at Lehman After Florida Shifted Pension Cash To Bank. __HTTP__ _E_\nChina is complaining about 2500 marines being placed in Australia. Meanwhile they are building bases across Latin America. #TimeToGetTough _E_\nDonald Trump trademarked Reagan slogan & would like to stop other Republicans from using it __HTTP__ via @businessinsider _E_\nIs Roger Simon @politicoroger ever right about anything? Now he's attacking @BillClinton in defense of (cont) __HTTP__ _E_\nGO VOTE FROM NOW TO 8:30 P.M. NEVADA. I WILL BE AT VARIOUS CAUCUS SITES. MAKE AMERICA GREAT AGAIN! _E_\nRT @TheFive: @POTUS being unpredictable is a big asset North Korea knew exactly what President Obama was going to do. @jessebwatters _E_\nA pessimist is one who makes difficulties of his opportunities... _E_\nGlad to hear @seanhannity supports my offer to Obama. As Sean says \"it is an easy $5 million to charity. What does Obama have to lose?\" _E_\nMy interview on @WOR710 with Jon Gambling discussing #TimeToGetTough meeting @NewtGingrich and the 2012 election __HTTP__ _E_\nNext year will be an interesting one. I look forward to running against Hillary Clinton a totally flawed candidate and beating her soundly _E_\n.@AustinKaiser52 The 2 people I am most excited to hear speak on Thrursday at @CPACnews is @GovChristie & @realDonaldTrump #DCBound Thanks. _E_\nNow is the time to buy housing before values have fully recovered. In 5 years remember I told you so. _E_\nThank you to the men and women of Fort Myer and every member of the U.S. Military at home and abroad. #USA __HTTP__ _E_\nOur country should be worried about nuclear control far more than gun control & that one's not even close! _E_\nFrom Donald Trump: \"I'm so proud of my wife Melania and the launch of her new jewelry line to debut on QVC on April 30th at 9 p.m.\" _E_\nInternal polling shows that I would swamp @RobAstorino in a NY Republican primary 77% to 23%. But won't run if party is not unified. _E_\nVia @swan_investor by @Forbes: \"The Trump Card: Make America Great Again\" __HTTP__ _E_\nJoin me in Roanoke Virginia tomorrow at the Berglund Center Coliseum ~ 6pm! Tickets available at:... __HTTP__ _E_\nGreat to see Sec. Clinton leaving the hospital yesterday with @ChelseaClinton and Pres. Clinton. Glad she is recuperating. _E_\nBernie Sanders is continuing his quest because he believes that Crooked Hillary Clinton will be forced out of the race e mail scandal! _E_\nI will be interviewed on @oreillyfactor tonight at 8:00. Will be talking about the poor treatment of our veterans illegal immigration etc. _E_\nI very much appreciate all of the great reviews & comments on my speech in Michigan the people were great. _E_\nSee when I said NATO was obsolete because of no terrorism protection they made the change without giving me credit. __HTTP__ _E_\nMany Red State Democrats sticking with Obama on deficit spending on the ObamaCare monstrosity will be defeated in 2014. _E_\nHappy 4th of July to everyone including the haters and losers! _E_\nCNN is the worst fortunately they have bad ratings because everyone knows they are biased. __HTTP__ _E_\nThank you @Todayshow for the wonderful and honest poll results on Chicago sign. People love it! __HTTP__ @TrumpChicago _E_\nRemember I predicted that New York Magazine would fold and people scoffed? Just announced (N.Y.Post) it lost big $'s & is cutting way back! _E_\nHappy birthday to @garyplayer a truly great Champion and Person! _E_\nAny increase in ObamaCare premiums is the fault of the Democrats for giving us a product that never had a chance of working. _E_\n\"Mold yourself into the person who can do big things.\" – Think Big _E_\nRon DeSantis Iraq vet Navy hero bronze star Yale Harvard Law running for Congress in Fla. Very impressive. __HTTP__ _E_\n70% of the Chinese say they are better off than they were 4 years ago __HTTP__ At least someone has done well under Obama. _E_\nDo you know how many years @TheRealMarilu starred on Taxi? #CelebApprentice _E_\n\"Interested is interesting. If you remember that simple rule you will have no trouble making conversation.\" Think Like a Billionaire _E_\nI stand ready to lead us down a new path where we are lifted up by our desire to succeed not by a resentment of success. @MittRomney _E_\nObamaCare is causing such grief and tragedy for so many. It is being dismantled but in the meantime premiums & deductibles are way up! _E_\nI will be in Indiana on Sunday and Monday at four MAKE AMERICA GREAT AGAIN rallies. See you there! _E_\nCongrats @TrumpWaikiki for winning @AmericanExpress Fine Hotels & Resorts 'Hotel Partner of The Year for 2014' award! _E_\nIs business success a natural talent? I think it's a combination of aptitude work and luck. Think Like a Champion _E_\nGreat ruling on wind farm in Scotland—very smart judge! Front page article. __HTTP__ _E_\nThank you @BrentBozell As you know I have been saying this for a long time __HTTP__ _E_\nFlashback @FoxNewsInsider July '14:\"Trump: Bergdahl Swap Another Mistake By 'Gang That Couldn't Shoot Straight'\" __HTTP__ _E_\nEveryone should boycott Italy if Amanda Knox is not freed she is totally innocent. _E_\nThe ALS #IceBucketChallenge that Trumps them all __HTTP__ _E_\nGreat job @MariaTCardona on @ThisWeekABC. You made kooky Cokie Roberts and @BillKristol look even dumber than they are. You will be right! _E_\nGreat to be on @andersoncooper tonight with my wonderful family. Will be rebroadcast at 12:00 A.M. (EASTERN). _E_\nVirtually all Presidents and candidates including John McCain Bill Clinton George H.W. Bush and George W. Bush... __HTTP__ _E_\nThank you @JakeTapper for giving me credit for my vision on bombing the oil fields. Should have been done long ago. #Trump2016 _E_\n...intentional. This whole narrative is a way of saving face for Democrats losing an election that everyone thought they were supposed..... _E_\n\"Some people spend an entire lifetime wondering if they made a difference in the world.The Marines don't have that problem.\" Ronald Reagan _E_\nRev.@BillyGraham is doing tremendous work this election cycle educating the Christian community on @MittRomney. _E_\n\"Read the Bible. Work hard and honestly. And don't complain.\" – Rev. @BillyGraham _E_\nRobert Bryce @NYPost Congrats on your great opinion piece on terrible wind turbines & how destructive they are. Windmills are a disaster. _E_\nI will be making a speech at 12:00 in Fort Worth Texas. Really big crowd expected. Will be talking about the debate last night plus plus! _E_\nAfter many years of failurecountries are coming together to finally address the dangers posed by North Korea. We must be tough & decisive! _E_\nShow me someone without an ego and I'll show you a loser. How To Get Rich _E_\nI like John McCain but we have to start rebuilding the United States instead of countries who hate us and want us to fail be smart! _E_\nHappy Birthday @TheLeeGreenwood!#FlashbackFriday __HTTP__ _E_\nThe media is spending more time doing a forensic analysis of Melania's speech than the FBI spent on Hillary's emails. _E_\nAny senator who votes against starting debate is telling America that you are fine w/ the #OCareNightmare! Remarks: __HTTP__ __HTTP__ _E_\nThank you America! Get out & VOTE tomorrow! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nI will be speaking at the #StopIranDeal rally shortly watch live here __HTTP__ _E_\nNegotiation tip #2: I always go into the deal anticipating the worst... _E_\nCan you imagine if I had the small crowds that Hillary is drawing today in Pennsylvania. It would be a major media event! @CNN @FoxNews _E_\nMy ties shirts and cufflinks have never been more beautiful THE BEST available at Macy's! _E_\nJFK Files are released long ahead of schedule! _E_\nWho is the moron who decided to release the Ferguson grand jury findings after 9:00 o'clock in the evening. What were they thinking? _E_\nTHANK YOU! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nMy @jrg710 interview discussing building a cemetery next to Trump National the FL primary @ApprenticeNBC and OPEC __HTTP__ _E_\nISIS is making big threats today no respect for U.S.A. or our leader If I win it will be a very different storywith very fast results _E_\nCommissioner Adam Silver made a strong and very wise decision concerning Donald Sterling. _E_\nSissy Graydon Carter of failing Vanity Fair Magazine and owner of bad food restaurants has a problem his V.F. Oscar party is no longer hot _E_\nLove that Patriots won Brady is best ever! Seahawks pass was DUMBEST play in the history of football! Great going COACH B! _E_\nOil has been over $33/gallon for 34 months. A new record. And now with Obama's war on coal American families will be hit even harder. _E_\nI'm really glad that @MittRomney no longer says what a nice guy @BarackObama is. _E_\nMy daughter Ivanka did great tonight in New Hampshire. The sold out crowd loved her and she loved them. Thanks Ivanka! _E_\n.@ChuckGrassley got your message loud and clear. We have fantastic people on the ground got there long before #Harvey. So far so good! _E_\n.@GeraldoRivera Thank you Geraldo for your nice words on @oreillyfactor tonight. You are a true champion! Thank @ericbolling great guy! _E_\nAiring live from Baton Rouge at 8PM ET on @nbc 2014 @MissUSA Competition will be a tremendous event __HTTP__ _E_\nThe secret of getting ahead is getting started. Mark Twain _E_\nWhy is President Obama allowed to use Air Force One on the campaign trail with Crooked Hillary? She is flying with him tomorrow. Who pays? _E_\nWill be interviewed on @foxandfriends at 8:00 A.M. _E_\n\"Having an ego and acknowledging it is a healthy choice. Our ego gives us a sense of purpose.\" – Think Like a Champion _E_\nOmikronDreamer @realDonaldTrump do you wear your own ties? Yes. _E_\nDo your homework. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. _E_\nI will be on @oreillyfactor tonight interview with Bill O'Reilly on @FoxNews at 8 p.m. repeated at 11 p.m. _E_\nMy @FoxNews interview on @TeamCavuto discussing why debt commission should be discussed in debates & @RNC convention __HTTP__ _E_\nVia @HPCaTravel by @alau2: \"Trump Hotel Reflects Youthful Luxurious Vancouver: Ivanka Trump\" __HTTP__ _E_\nRT @Scavino45: .@POTUS @realDonaldTrump and @FLOTUS Melania visit with @UMCSN patient Tiffany Huizarin Las Vegas earlier today. #VegasStron... _E_\nThe toughest thing about success is that you've got to keep on being a success. Irving Berlin _E_\nC SPAN/Conversation with Donald Trump/Economic Club of Washington DC __HTTP__ _E_\nPresident Obama played golf yesterday??? _E_\nNow the UN is attacking @Redskins franchise __HTTP__ With all the world's problems is this really a top priority? _E_\nVia @GravisMarketing: \"New Hampshire Poll: Trump into top tier status\" __HTTP__ _E_\nI am growing the Republican Party tremendously just look at the numbers way up! Democrats numbers are significantly down from years past. _E_\nAt 96 stories above Michigan Avenue if you're not staying at the 5 star @TrumpChicago then you're in its shadow __HTTP__ _E_\nNext time Marco Rubio should drink his water from a glass as opposed to a bottle—would have much less negative impact. _E_\nObama lied 100% about Libya and the killings emails are absolute. He must release his records on Wednesday and stop the lies. _E_\nCan you believe that President Obama still hasn't stopped the flights and people pouring into the U.S. from West Africa. TERRIBLE PRESIDENT! _E_\nRT @Scavino45: 20295 miles later #POTUSinAsia has successfully concluded as @POTUS @realDonaldTrump lands on the South Lawn of @WhiteHouse... _E_\n'Immigration Ban Is One Of Trump's Most Popular Orders So Far' __HTTP__ _E_\nRT @VP: All Americans in harms way need to be prepared and should continue visiting __HTTP__ for critical updates on #Hurric... _E_\nWhy doesn't phony @bobvanderplaats tell his followers all the times he asked for him and his family to stay at my hotels didn't like paying _E_\nRobert Slater who just passed away was a terrific writer who wrote a very fair book about me. He will be missed. __HTTP__ _E_\nI believe in spending what you have to. But I also believe in not spending more than you should. The Art of The Deal _E_\nMy @FoxNews interview with @megynkelly discussing the 2012 election and the Newsmax @iontv debate __HTTP__ _E_\n.@HillaryClinton is NOT above the law!#Debates2016 __HTTP__ _E_\nThis chart from AEI's @JimPethokoukis shows how terrible @BarackObama's 'recovery' really is: __HTTP__ Disaster. _E_\nIn '08 @BarackObama called Jerusalem Israel's capital __HTTP__ Now he attacks @MittRomney on Jerusalem __HTTP__ _E_\nI won't be doing Fox & Friends tomorrow morning in that I have a big breakfast meeting on a deal. I will be back next week at 7. Thank you! _E_\nOn June 22 I will be going to Scotland to celebrate the opening of the newly renovated @TrumpTurnberry Resort the worlds best. _E_\nI know Rand Paul and I think he may find a way to get there for the good of the Party! _E_\nDC has shrunk our military and exploded our country with debt. We can't send another politician to the White House __HTTP__ _E_\nJust heard Foreign Minister of North Korea speak at U.N. If he echoes thoughts of Little Rocket Man they won't be around much longer! _E_\nWill be traveling to the Great State of Ohio tonight. Big crowd expected. See you there! _E_\nI will be interviewed by @oreillyfactor at 4:00 P.M. (prior to the #SuperBowl Pre game Show) on Fox Network. Enjoy! _E_\n.@Jetsetterdotcom in Hong Kong featured 8 pages on my great hometown of New York City including @TrumpSoHo __HTTP__ _E_\n.@claudiajordan's judgment wasn't the best in who she chose to come back to the boardroom—that was her demise. #CelebApprentice _E_\nNext Tuesday remember how our president has not lifted a finger for USMC Tahmooressi. He only wants illegals to cross our border. _E_\nIt was really strange when Hillary was missing from the podium last night. Not very presidential! _E_\nMy @CNN interview with @PiersTonight discussing the Newsmax @iontv debate #TimeToGetTough the GOP and the economy __HTTP__ _E_\nNice article on Trump Links at Ferry Point in today's New York Post the construction is going really well! _E_\nWE LOVE YOU LAS VEGAS! __HTTP__ _E_\nRT @JaniceTaylor912: @DonaldJTrumpJr @Reryan08 @IvankaTrump @EricTrump obvious to all that he raised some GREAT responsible patriotic kid... _E_\nPoll: Trump Leads GOP Field Among Hispanics Records 34% Favorability __HTTP__ _E_\nWith the World hating us and wanting to destroy the U.S. we have just cut the hell out of the military budget making it smallest since '39 _E_\nI am pleased to announce that I have chosen Governor Mike Pence as my Vice Presidential running mate. News conference tomorrow at 11:00 A.M. _E_\n#ObamacareFail __HTTP__ _E_\nSo Obama used to tell classmates that he was Kenyan royalty and an Indonesian prince __HTTP__ Sounds like his book bio! _E_\nWe are going to make this a government of the people once again!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nIraq told us to get out Iraq is now falling and Iraq now wants us to come back! Don't do it unless we get the OIL and I mean ALL OF IT! _E_\nInteresting studies show that wind farms have a warming effect on the climate _E_\nI am so proud of our great Country. God bless America! _E_\nThe Celebrity Apprentice Sunday night at 9 PM on NBC. Another great episode! __HTTP__ _E_\nSadly firing can be an essential and responsible business decision. It isn't pleasant but lopping off a branch can save a tree. _E_\nAll of this Russia talk right when the Republicans are making their big push for historic Tax Cuts & Reform. Is this coincidental? NOT! _E_\n... It is all about incorporating a sense of optimism into everything you do while also acknowledging the negative.\" – Think Big _E_\n.@MittRomney scored last night on both substance and style. _E_\nJoin me in congratulating @NASA's @AstroPeggy by using the hashtag #CongratsPeggy! Earlier today:... __HTTP__ _E_\nAn individual whose whole career is trying to take down successful celebrities with nonsense campaigns has turned his attention to me..... _E_\nThank you America! #Trump2016 __HTTP__ _E_\nEntrepreneurs must have vision plus the power of focus... to see the future and turn their vision into a profitable reality. #MidasTouch _E_\nI'll bet if I didn't harass Apple for the last 2 years about the large screen iPhone they wouldn't have done it—but it bends & breaks! _E_\nPresident Obama's inaugural had record low ratings. What does that portend? _E_\nMike Leach's lessons his takeaways from Geronimo's life are fascinating & useful whether in boardroom or locker room __HTTP__ _E_\nRasmussen just announced that my approval rating jumped to 49% a far better number than I had in winning the Election and higher than certain \"sacred cows.\" Other Trump polls are way up also. So why does the media refuse to write this? Oh well someday! _E_\n\"President Donald J. Trump Proclaims October 24 2017 as United Nations Day\" Read more: __HTTP__ __HTTP__ _E_\nEntrepreneurs: Be passionate you have to love what you're doing to be successful at it. _E_\nI am committed to keeping our air and water clean but always remember that economic growth enhances environmental protection. Jobs matter! _E_\nWashington will continue to run record deficits into the election. We are borrowing at a rate of $1.40 from China. Truly unsustainable. _E_\nWho is rooting for Obama more tonight his campaign advisors or the press? _E_\nIf I win the Presidency we will swamp Justice Ginsburg with real judges and real legal opinions! _E_\n.@KarlRove who spent $430 million in the last cycle and didn't win one race said I'm not a candidate until I file papers. Next week Karl! _E_\nVia @CarrGaz: \"Trump to jet in to unveil Trump @TurnberryBuzz clubhouse\" __HTTP__ _E_\nTom Brady would have won if he was throwing a soccer ball. He is my friend and a total winner! @Patriots _E_\nIran will convince our incompetent President that they are trying to help us with Iraq take over the country & oil and O will say thanks _E_\nForeigners slashed the purchase of US debt late last year the first time in over 2years. We must control spending. __HTTP__ _E_\n....the 2016 election with interviews speeches and social media. I had to beat #FakeNews and did. We will continue to WIN! _E_\n.@genesimmons Keep up the great work and congrats we are proud of you! _E_\nDespite the phony Witch Hunt going on in America the economic & jobs numbers are great. Regulations way down jobs and enthusiasm way up! _E_\n\"Talent wins games but teamwork and intelligence wins championships.\" Michael Jordan _E_\nIf the morons who killed all of those people at Charlie Hebdo would have just waited the magazine would have folded no money no success! _E_\nEntrepreneurs: What is the standard for which you want to be known? Identify that standard and then establish it. _E_\nChina watched Obama's press conference yesterday salivating. We will be borrowing trillions more from them. _E_\nThank you Jacksonville Florida!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nThank you to @piersmorgan for your nice statement about me in the @HollywoodReporter __HTTP__ _E_\nThe failing New York Daily News knowingly incorrectly reported that I wanted to speak at the Republican National Convention wrong! _E_\n\"Problems setbacks mistakes & losses are all part of life. We shouldn't be shocked if and when they happen.\" – Think Like a Champion _E_\nThank you for you support Virginia! In ONE DAY get out and #VoteTrumpPence16! #ICYMI: __HTTP__ __HTTP__ _E_\nWe're coming up on the NEW YEAR It is really important that despite so many stupid decisions being made in Washington we make it BEST EVER _E_\nRT @foxandfriends: POTUS the predictor? President Trump foretold housing upswing in 2012 __HTTP__ _E_\nThe Chinese are mistreating Hillary Clinton on her trip __HTTP__ They have zero respect for us. Outrageous! _E_\nNew CBS National Poll just out massive lead for Trump. The Wall Street Journal/NBC Poll is a total joke. No wonder WSJ is doing so badly! _E_\nThe Roger Stone report on @CNN is false Fake News. Have not spoken to Roger in a long time had nothing to do with my decision. _E_\nHaters and losers say I wear a wig (I don't) say I went bankrupt (I didn't) say I'm worth $3.9 billion (much more). They know the truth! _E_\nWonderful to be in North Dakota with the incredible hardworking men & women @ the Andeavor Refinery. Full remarks: __HTTP__ __HTTP__ _E_\nJust left Trump National Doral in Miami under massive construction The Blue Monster will be one of the greatest courses ever built! _E_\nDeparting Pittsburgh now where it was my great honor to stand with our incredible workers and to show the world that AMERICA is back and we are coming back bigger and better and stronger than ever before! __HTTP__ _E_\nIf you are lucky enough to catch a knockout assaulter before getting slugged and you carry a gun shoot the bastard (teach them a lesson)! _E_\nRT @Scavino45: \"Utilities cutting rates cite benefits of Trump tax reform\" __HTTP__ _E_\nWhen you do your Christmas shopping remember how disloyal @Macys was to the subject of illegal immigration. #BoycottMacys #DumpMacys _E_\nWe are making tremendous progress with the V. A. There has never been so much done so quickly and we have just started. We love our VETS! _E_\nJOBS JOBS JOBS! __HTTP__ _E_\nMexico's court system is a dishonest joke. I am owed a lot of money & nothing happens. _E_\nIn Massachusetts the place is packed! #MakeAmericaGreatAgain _E_\nEntrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_\nRT @DRUDGE_REPORT: 10 SCANDALS ON DIRECTOR'S WATCH... __HTTP__ _E_\nSenior United States District Judge Robert E. Payne today ruled in favor of Trump campaign delegates who had argued.. __HTTP__ _E_\nThe Zimmerman trial is over. It is time to move on. While Zimmerman is no angel he was acquitted and should be able to move on. _E_\nEntrepreneurs: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_\nAudience chanting RUN TRUMP RUN! during my my @SRQRepublicans speech! They are going to be very happy... _E_\nFiring @lisalampanelli may have come as a surprise. She's a strong player. But there are no losers at this late point. #sweepstweet _E_\nVia @necn by @KatherineNECN: \"Trump Waiting to See Who Runs in 2016\" __HTTP__ _E_\nJustice Ginsburg of the U.S. Supreme Court has embarrassed all by making very dumb political statements about me. Her mind is shot resign! _E_\nWhy does a failed magazine like @Forbes constantly seek out trivial nonsense? Their circulation way down. @Clare_OC _E_\nPremiering on January 4th the 14th season of @ApprenticeNBC will have major fireworks every episode. The Board Room is electric! _E_\nThese crimes won't be happening if I'm elected POTUS. Killer should have never been here. #AmericaFirst __HTTP__ _E_\nAnother freezing day in the Spring what is going on with global warming ? Good move changing the name to climate change sad! _E_\nCyprus is seizing private bank accounts as collateral for €10bn bail out. We owe $17T. Think it can't happen here? _E_\nThe terrorist who killed so many people in Germany said just before crime by God's will we will slaughter you pigs I swear we will...... _E_\nScary America would have had to pay all its GDP to the government to cover @BarackObama's real 2011 budget deficit __HTTP__ _E_\nTomorrow will be a really big day for America. MAKE AMERICA GREAT AGAIN! _E_\n.@BreitbartNews: DONALD TRUMP: CANTOR'S DEFEAT SHOWS 'EVERYBODY' IN CONGRESS VULNERABLE IF THEY SUPPORT AMNESTY __HTTP__ _E_\nThanks. __HTTP__ _E_\nCongratulations to my son Eric for making the Forbes 30 under 30 list. He's done a great job! __HTTP__ _E_\nVia @UnionLeader by @tuohy: Trump hires Lewandowski as presidential run eyed __HTTP__ #FITN #MakeAmericaGreatAgain _E_\nVia @BreitbartNews by mboyle1: EXCLUSIVE DONALD TRUMP CONFIRMED TO SPEAK AT #CPAC2014 __HTTP__ @ACUConservative @CPACnews _E_\nAmazingly @AnthonyWeiner is going to run. The cure rate for his problem is 0. Lots of other things will come out. _E_\nI look forward to attending & speaking at the Iowa Land Investment Expo—total sellout crowd __HTTP__ @PeoplesCompany _E_\nAmerica should not be pressuring @Israel to show restraint against Iran. We should be working to stop Iran's nuclear drive. _E_\nAmazing! AG Schneiderman sues a school w/ a 98% approval rating but doesn't go after billion $ fraudsters all over Wall St. _E_\nThe hatchet job in @NYMag about Roger Ailes is total bullshit. He is the ultimate winner who is surrounded by a great team. @FoxNews _E_\nThank you for your continued support!#MakeAmericaGreatAgain __HTTP__ _E_\nObama's planned tax hike will hit over 1 million small businesses __HTTP__ Expect more massive unemployment and stagnant growth _E_\nJoin me live now in Las Vegas Nevada! We will MAKE AMERICA SAFE & GREAT AGAIN! #VoteTrumpNV #NevadaCaucus __HTTP__ _E_\n\"Donald Trump to name golf course after mother\" __HTTP__ via @scotsmandotcom _E_\nIf you can't see it you can't make it happen. Entrepreneurs chase your dreams with resolute focus & determination. Be positive! _E_\nA Lion's List of Democrats are not attending @BarackObama's DNC Convention. The Democratic Party is in turmoil. __HTTP__ _E_\nThe Holiday Season in New York City is a very special time. I love seeing and meeting the many tourists who visit the #TRUMP Tower atrium. _E_\nJust watched Jon Stewart(?) jumping up and down and screaming like a madman nothing funny or smart just loud and obnoxious a pushy dope! _E_\nPlease tweet me your questions to answer in my #trumpvlog. _E_\nTPP does not stop Japan's currency manipulation & China has a backdoor to join. It must be stopped. We need to protect the American worker! _E_\nI have nothing to do with the Plaza Casino in Atlantic City. I have not been involved with Atlantic City for many years. Used to love A.C.! _E_\nIt's boardroom time! Does anyone miss @OMAROSA? #CelebApprentice _E_\n.@HillaryClinton Sneers At Millions Of Average Americans. __HTTP__ #VPDebate #BigLeagueTruth _E_\nBlatant and rampant property destruction in Baltimore as the police stand by and watch. Should be a lesson on how NOT to handle riots. SAD! _E_\nThank you Piers they don't know what they're getting into. __HTTP__ _E_\nI applaud @netanyahu for announcing that he will show up at the UN to defend @Israel. A true US friend and great leader. _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nAmazing crowd last night in Dallas more spirit and passion than ever before. Today all over the great State of Texas! _E_\nWas not mentioned that we built one of the great golf courses in the world bringing tremendous business to Scotland. __HTTP__ _E_\n\"Simply take a big goal and mold yourself to become the person who can accomplish that goal.\" – Think Big _E_\nTonight's #CelebrityApprentice will continue to impress. Be sure to tune in tonight at 9PM ET @NBC. It will be amazing. _E_\n.@CelebApprentice Flashback: \"What @bretmichaels Learned from the 'Rock Star of Real Estate'\" __HTTP__ _E_\nJust left Virginia where I unveiled my healthcare and other plans for our great Veterans! They will be very happy! __HTTP__ _E_\nDonald Trump To Mitt Romney: 'You're Fired' __HTTP__ via @fitsnews _E_\nTrump International Hotel & Tower New York winner of the Forbes Five Star Hotel Award in 2009 through 2012. __HTTP__ _E_\nIf I would have done the last debate a record would have been set (instead of the poor ratings recieved). Also VETS got $6000000. _E_\nScary. President Obama told Boehner that the government doesn't have a spending problem __HTTP__ _E_\nDonald Trump: 'Monkey business' on jobs __HTTP__ via @politico _E_\nI have an idea for @JebBush whose campaign is a disaster. Try using your last name and don't be ashamed of it! _E_\nCredible Source on 9 11 Muslim Celebrations: FBI __HTTP__ _E_\nHost of the 2017 U.S. Women's Open Trump Bedminster has been rated one of America's best golf courses. _E_\nSOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_\nSnowden should come back to America and face justice. Instead he is begging for clemency from Moscow. Treat him as a spy. _E_\nIran has warned the US not to send an aircraft carrier back into the Strait of Hormuz. We should send three as a (cont) __HTTP__ _E_\nIt will be interesting to see what happens to Eliot Spitzer if he loses the election for Comptroller to very capable @scottmstringer. _E_\n.@foxandfriends int. on gov. collecting data whistle blower hiding in China & no bikinis in Miss World pageant __HTTP__ _E_\nHuge Townhall tomorrow at 5PM in the NH Barrington Middle School! Thanks to @straffordnhgop​ for hosting! Let's Make America Great Again! _E_\nWatch me on @Hannityshow tonight at 9PM ET on @FoxNews. _E_\nRT @DanScavino: Jesse Jackson on @realDonaldTrump when he donated space for the Rainbow/Push Coalition. #DebateNight __HTTP__ _E_\nChina will extract much from Secretary Kerry and the U:S. in order for them to help us with the North Korea problem don't let this happen! _E_\nI don't like seeing the Pope standing at the checkout counter (front desk) of a hotel in order to pay his bill. It's not Pope like! _E_\nDisappointed the @NewYorkObserver article on @AGSchneiderman did not bring up his dealings w/ Shirley Huntley. __HTTP__ _E_\nMy @foxandfriends interview discussing Pres. Obama's inauguration @GOP debt plan & @CelebApprentice #1 branding __HTTP__ _E_\nI will be interviewed on @foxandfriends at 9:00 A.M. I will be talking about the rigged and boss controlled Republican primaries! _E_\nIt's time to let Pete Rose the all time hits leader into the Baseball Hall of Fame. Enough already!!!!! _E_\nIn trade military and EVERYTHING else it will be AMERICA FIRST! This will quickly lead to our ultimate goal: MAKE AMERICA GREAT AGAIN! _E_\n.@SabrinaSiddiqui Re: Taylor and Conor great news for Taylor! _E_\nOur greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_\nMake sure to follow me on @periscopeco. I will be streaming my announcement at 11AM. _E_\nWith Dr. Dror Paley & Dr. Ben Carson with two wonderful children at Mar a Lago. __HTTP__ _E_\n\"Here's the truth the gov't doesn't shutdown\" __HTTP__ via @AP. All essential services continue. Don't believe lies. _E_\nLook how small the pages have become @WSJ. Looks like a tabloid—saving money I assume! _E_\n\"To keep momentum keep challenging yourself.\" – Think Big _E_\nDon't forget to watch Celebrity Apprentice this Sunday night at 9 pm on NBC. You're in for a great show. __HTTP__ _E_\nThe legendary Barbara Walters interviews Melania Trump and me on a special this Friday night at 10:00 on ABC. Don't miss it! _E_\n.@JebBush like it or not our country needs more energy and spirit than you can provide! #MakeAmericaGreatAgain _E_\nI told you the Oscars were terrible—bad look bad talent—and among the lowest ratings in show's history. __HTTP__ ... _E_\nI will be on @foxandfriends at 7.30 A.M. _E_\nThe Emmys were horrendous...the absolute worst show! _E_\nWatch me tonight on Late Night with Jimmy Fallon.Photo: Lloyd Bishop/NBC __HTTP__ _E_\nGov.Kasich of Ohio just stated on a morning show that he doesn't watch politics or anything on television he only watches the @GolfChannel _E_\nDone by a real fan! #TRUMP __HTTP__ _E_\nI will be tweeting live tonight during Celebrity Apprentice 9 o'clock on NBC! _E_\nVery dangerous pattern developing across country by Obama supporters. Detroit poll watcher was threatened with gun __HTTP__ _E_\nIn my book @Joan_Rivers had a lousy doctor shoving a camera down her throat at her age. Something went really wrong that should not have! _E_\nCan you imagine if Bush's administration drafted a memo legalizing the killing of Americans?! Democrats are such hypocrites. _E_\nI use both iPhone & Samsung. If Apple doesn't give info to authorities on the terrorists I'll only be using Samsung until they give info. _E_\nThe Apprentice on the other hand has been a MAJOR television hit often times finishing #1. Even now after 13 seasons it wins its slot! _E_\n.@TheBrodyFile Exclusive: @realDonaldTrump Says He Will Protect Evangelicals Better Than @tedcruz __HTTP__ #CBNNews #2016 _E_\n\"Diligence is the mother of good luck.\" Benjamin Franklin _E_\nRiley Rone was a great young man. We will miss him dearly. __HTTP__ _E_\nNo surprise. @Rosie is failing on @TheView.Terrible ratings.\"Malcontent & another season is out of the question __HTTP__ _E_\nThe only people who don't like the Tax Cut Bill are the people that don't understand it or the Obstructionist Democrats that know how really good it is and do not want the credit and success to go to the Republicans! _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nEntrepreneurs: Resolve to be bigger than your problems. Who's the boss? _E_\nOur country is facing a major threat from radical Islamic terrorism. We better get very smart and very tough FAST before it is too late! _E_\nA president either is constantly on top of events or if he hesitates events will soon be on top of him... _E_\nRT @realDonaldTrump: So much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They prom... _E_\nOnce again Obama is going to lose on another prospective nomination. Chuck Hagel will not be named Sec. of Defense & probably shouldn't be. _E_\nBloomberg: Trump leads GOP field __HTTP__ _E_\nLook forward to Governor Mike Pence V.P. introduction tomorrow in New York City. _E_\nSorry I will miss the CPAC gathering in Orlando there in spirit Obama must go. _E_\nMy #GOPDebate @facebook question for the other candidates __HTTP__ _E_\nWith 50 days until the election it is #TimeToGetTough for @MittRomney & @GOP _E_\nCongratulations to @BretBaier on the immediate & tremendous success of his book 'Special Heart.' Already in its third printing! _E_\n.@TrumpChicago's exceptional dining w/equally exceptional views of the city are exclusive world class experiences __HTTP__ _E_\nGreat memory @TheRealMarilu! #CelebApprentice _E_\nOn my way to Iowa. Will be landing in Des Moines in two hours. See ya! _E_\nMy contract with the American voter will restore honesty accountability & CHANGE to Washington! #DrainTheSwamp __HTTP__ _E_\nWe have many problems in our house (country!) and we need to fix them before we let visitors come over and stay. MAKE AMERICA GREAT AGAIN! _E_\nSuccess tip: Don't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_\nSeems hard to believe that @Facebook could be worth that much be careful if you invest. And Mark Zuckerberg get a pre nup. _E_\nTo be completed this yearTrump Int'l Golf Club Dubai will feature a 7205 yard par 71 & double sided driving range __HTTP__ _E_\nThe US government's foreign debt is at a record $5.29T __HTTP__ China is laughing all the way to the bank. _E_\nRT @foxandfriends: Jared Kushner didn't suggest Russian communications channel in meeting source says __HTTP__ _E_\nThe Blue Monster Golf Course officially opens tomorrow at Trump National Doral with a ribbon cutting ceremony. GREAT COURSE GREAT REVIEWS! _E_\nAmerica's hearts & prayers are with the people of #PuertoRico & the #USVI. We will get through this and we will get through this TOGETHER! __HTTP__ _E_\nThank you Readers' Choice: Trump Int'l Hotel Las Vegas has been nominated by 10 Best for Best Pet Friendly Hotel __HTTP__ _E_\nWill be heading over shortly to make remarks at The National Prayer Breakfast in Washington. Great religious and political leaders and many friends including T.V. producer Mark Burnett of our wonderful 14 season Apprentice triumph will be there. Looking forward to seeing all! _E_\nGovernment dependency has surged over 23% since @BarackObama has taken office. __HTTP__ He is creating an entitlement culture. _E_\nDemocrats refusal to give even one vote for massive Tax Cuts is why we need Republican Roy Moore to win in Alabama. We need his vote on stopping crime illegal immigration Border Wall Military Pro Life V.A. Judges 2nd Amendment and more. No to Jones a Pelosi/Schumer Puppet! _E_\nThe Green Party scam to fill up their coffers by asking for impossible recounts is now being joined by the badly defeated & demoralized Dems _E_\nPictures of my beautiful mother amazing father and family hanging @MontesKitchen in upstate New York. __HTTP__ _E_\nWhy did @DanaPerino beg me for a tweet (endorsement) when her book was launched? _E_\nI was disappointed that Ted Cruz would speak behind my back get caught and then deny it. Well welcome to the wonderful world of politics! _E_\nCongrats to @mboyle1 of @BreitbartNews for exposing Jason Linkins of @HuffingtonPost as a lightweight dope who gives false information. _E_\nStarting next week and by popular demand (plus good ratings) NBC will broadcast only two hour episodes of Celebrity Apprentice at 9 P.M. _E_\nBarack Obama is hard at work today on his highest priority his reelection. @BarackObama has 5 fundraisers in 2 cities. __HTTP__ _E_\nThe lunatics in Congress banned the word 'lunatic' from Congress last week __HTTP__ Busy doing the peoples' work! _E_\nWhen someone can discourage you you probably aren't determined enough. Be resolute. That's what it takes to get things done. _E_\nWhy does the liberal media think Bill O'Reilly (@oreillyfactor) is a complete and total vulgarian? I don't think so! _E_\n.@alexsalmond @pressjournal RT @djkevritch im proud to be scottish but bonnie scotland will soon be a thing of the past w/ these windmills _E_\nJoin me live for the #SOTU __HTTP__ _E_\nBig win today in the House for GOP Tax Cuts and Reform 227 205. Zero Dems they want to raise taxes much higher but not for our military! _E_\nFake News CNN made a vicious and purposeful mistake yesterday. They were caught red handed just like lonely Brian Ross at ABC News (who should be immediately fired for his \"mistake\"). Watch to see if @CNN fires those responsible or was it just gross incompetence? _E_\nSSE slashes offshore wind investment—wants British government to pay for its losses on these monstrosities __HTTP__ _E_\nMore waste fraud and abuse over $460M in food stamps went to ineligible households __HTTP__ Where's the accountability? _E_\nUnder Trump gains against #ISIS have dramatically accelerated __HTTP__ _E_\nNow Obama's campaign is guaranteeing 12 million new jobs during a 2nd term __HTTP__ More like $12T in new debt if he wins. _E_\nIn response to @Lawrence my net worth is substantially more than 7 billion dollars very low debt great as... (cont) __HTTP__ _E_\nYoung entrepreneurs – be patient and continue to work with determination. With hard work success will follow. Keep your focus! _E_\nThe reason I put up approximately $50 million for my successful primary campaign is very simple I want to MAKE AMERICA GREAT AGAIN! _E_\nI cancelled today's meeting with the failing @nytimes when the terms and conditions of the meeting were changed at the last moment. Not nice _E_\nIf the ban were announced with a one week notice the bad would rush into our country during that week. A lot of bad dudes out there! _E_\nI will be watching the great Governor @Mike_Pence and live tweeting the VP debate tonight starting at 8:30pm est! Enjoy! _E_\nWill be on @OreillyFactor tonight at 8:30pm @FoxNews prior to Melania's speech at the #GOPConvention. Tune in she will do great! #RNCinCLE _E_\nYou can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of The Deal _E_\nI will be campaigning in Indiana all day. Things are looking great and the support of Bobby Knight has been so amazing. Today will be fun! _E_\nI am on David Letterman tonight. _E_\nWill be interviewed tonight on @seanhannity at 10:00. There is so much to talk about! _E_\nCongratulations to Georgina Bloomberg on winning the inaugural Central Park Grand Prix CSI 3* @MikeBloomberg _E_\n#HappyNewYearAmerica! __HTTP__ _E_\nCrooked Hillary Clinton looks presidential? I don't think so! Four more years of Obama and our country will never come back. ISIS LAUGHS! _E_\n...Whether you are a Republican or Democrat we should hope that Pres. @BarackObama does a great job for the country. _E_\nWhen the American People speak ALL OF US should listen. Just over one year ago you spoke loud and clear. On November 8 2016 you voted to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nAs the world watches we are days away from passing HISTORIC TAX CUTS for American families and businesses. It will be the BIGGEST TAX CUT and TAX REFORM in the HISTORY of our country! __HTTP__ _E_\nGreat video of tonights crowd reacting to my latest proposal in SC. #Trump2016 __HTTP__ __HTTP__ _E_\nJoe McQuaid (@deucecrew) of the dying Union Leader wanted ads lunches donations speeches from me and tweets very unethical. _E_\nNice to see Obama released a situation room photo from Sandy. How about releasing the photo taken during Benghazi? _E_\n.@BarackObama economic gloom: jobless claims have surged __HTTP__ while factory activity is (cont) __HTTP__ _E_\nDon't believe the biased and phony media quoting people who work for my campaign. The only quote that matters is a quote from me! _E_\nWhy did failing A.G. Eric Schneiderman after years of looking file his pathetic lawsuit on a SATURDAY afternoon (unheard of)? No case! _E_\nWe don't need another stimulus. The first one was a complete failure. Why repeat the same mistake? _E_\n09 19 2011 17:54:28 _E_\nThx to all the people who called to say they are cutting their @Macys credit card as a protest against illegal immigrants pouring into US _E_\n#TrumpVine Weiner is a joke.... __HTTP__ _E_\nThe ObamaCare disaster is in full swing. Websites are down people can't sign up and elderly can't understand the lingo. _E_\nTexas & Louisiana: We are w/ you today we are w/ you tomorrow & we will be w/ you EVERY SINGLE DAY AFTER to restore recover & REBUILD! __HTTP__ _E_\n100% fabricated and made up charges pushed strongly by the media and the Clinton Campaign may poison the minds of the American Voter. FIX! _E_\nGet ready for the Apprentice tonight TWO AMAZING EPISODES. I will be live tweeting! _E_\nMy @CNBCClosingBell interview discussing QE3 the housing market my stock picks and the 2012 election __HTTP__ _E_\nVia @financialpost: \"Climate changing for global warming journalists\" by Lawrence Solomon (cont) __HTTP__ _E_\nWatch the clip from @Late_Show where the crowd cheers after I explain that my offer is about transparency __HTTP__ _E_\nI am greatly honored by the results of the CNN poll in Iowa. In the end I believe the final results will be even better than that! _E_\nWatching @CNN and consider @secupp to be one of the least talented people on television. Boring and biased! _E_\nMake our borders strong and stop illegal immigration. Even President Obama agrees __HTTP__ _E_\nHow does failed writer and pundit like @stephenfhayes with no success and little talent get away with criticizing candidates. _E_\nWhile the next season of @CelebApprentice is packed w/ All Stars ours fans will be happy to see @Joan_Rivers in the board room.She is back! _E_\nCan you believe the Republicans are studying the Democrats on how to win an election? _E_\nDebate showed these guys really hate each other. At one point it looked like they would come to blows. _E_\nIf Obama was smart he would cancel the Muslim Brotherhood's WH visit later this month. He won't. _E_\nIsn't it amazing that @Macys paid a massive fine for profiling African Americans & then criticized me for discussing illegal immigration! _E_\nJeb Bush is desperate strongly in favor of #CommonCore and very weak on illegal immigration. _E_\nNegotiation is persuasion more than power. Negotiation includes a lot of fine lines and that's what makes it an art. _E_\nWas @foxandfriends just named the most influential show in news? You deserve it three great people! The many Fake News Hate Shows should study your formula for success! _E_\nWith one Yes vote in hospital & very positive signs from Alaska and two others (McCain is out) we have the HCare Vote but not for Friday! _E_\nTrump approval rebounds to 45% surges among Hispanics union homes men __HTTP__ _E_\nObama is about to destroy the mililtary through the sequester. The Middle East is a mess. Yet Colin Powell still endorses him. Wonder why? _E_\nAMERICA FIRST! _E_\nMust read piece by @DanielPipes: \"Obama's Diplomatic Acrobatics\" __HTTP__ _E_\nWow the Republican Convention went so smoothly compared to the Dems total mess. But fear not the dishonest media will find a good spinnnn! _E_\nDespite the ever increasing Ebola disaster Obama refuses to stop flights from West Africa.It's almost like he's saying F you to U.S. public _E_\nIvanka caught up with Bret and Holly backstage. Both Bret and Holly were champions all the way. __HTTP__ _E_\nWH claims it lied about Pres. Obama living with his uncle b/c \"wasn't mentioned in his book.\" I guess Bill Ayers never knew about it! _E_\nWelcome to the @WhiteHouse Amir Sabah al Ahmed al Jaber al Sabah of Kuwait! Joint press conference coming up soon: __HTTP__ __HTTP__ _E_\n\"WHAT HAPPENED\"\"How Team Hillary played the press for fools on Russia\" __HTTP__ WE KNOW! __HTTP__ _E_\nCongratulations to @TrumpNewYork and @TrumpToronto for the @WSJ coverage on perks in luxury hotels: __HTTP__ _E_\nWow! This might be my highest # yet! Thank you to my opposition you are totally ineffective & have been for years! __HTTP__ _E_\n.@Macys was one of the worst performing stocks on the S&P last year plunging 46%. Very disloyal company. Another win for Trump! Boycott. _E_\nfind the leakers within the FBI itself. Classified information is being given to media that could have a devastating effect on U.S. FIND NOW _E_\nArmy training slide lists Hillary Clinton as insider threat: __HTTP__ _E_\nNew Zogby poll— highly respected— but the media won't report it because it gives me an even bigger lead! __HTTP__ _E_\nThe American US Airways merger will create even worse service and much higher fares. _E_\nJust like I have been able to spend far less money than others on the campaign and finish #1 so too should our country. We can be great! _E_\nIf the decision by the grand jury in Ferguson was the exact opposite you would still be having the riots right now! _E_\nWhatever happened to Obama's 'independent investigation' into national security leaks from his administration? Where's the media? _E_\nIn '09 Obama released the ISIS chief. The terrorist gloated \"I'll see you in New York\" __HTTP__ Historic nat'l sec. error _E_\n.@bwilliams wouldn't you love to have my ratings? _E_\nA poll of the Miami Dade was conclusively in favor of gambling in Miami. @willweatherford @FLGovScott __HTTP__ _E_\nAn honor to be endorsed by the New England Police Benevolent Association. Thank you! __HTTP__ __HTTP__ _E_\nMy Scotland course is receiving accolades from all over the world a great honor for me. _E_\nI was so happy when I heard that @Politico one of the most dishonest political outlets is losing a fortune. Pure scum! _E_\nMy wonderful son Eric will no longer be allowed to raise money for children with cancer because of a possible conflict of interest with... _E_\nDo you think the 14 African nations that are banning West Africans from coming into their nations are racist? _E_\nWhat a great evening we had. So interesting that Sanders beat Crooked Hillary. The dysfunctional system is totally rigged against him! _E_\nNew York Republican leader @EdwardFCox is pushing my friend @RobAstorino into political suicide. Results won't be pleasant! _E_\nIt's Thursday and only 26 days until the election. How many illegal donations from China and Saudi Arabia did Obama collect today? _E_\nPeople like doing deals with me because they know it will be profitable that I work quickly and that they will be treated fairly. _E_\nFrom my family to yours...I want to wish you all a very merry Christmas! _E_\nThe habitual vacationer @BarackObama has sacrificed so much. He is delaying his 17 day Hawaii vacation a couple of hours. _E_\nThank you for the warm welcome to Brussels Belgium this afternoon! __HTTP__ _E_\n.#CelebrityApprentice Two hour live show on Monday night will determine who will become the winner of Celebrity Apprentice.Full cast returns _E_\nDoes President Obama ever discuss the sneak attack on Pearl Harbor while he's in Japan? Thousands of American lives lost. #MDW _E_\nWith 46 stories and 391 beautiful rooms @TrumpSoHo offers a wide array of AAA Five Diamond luxury options __HTTP__ _E_\nTax experts throughout the media agree that no sane person would give their tax returns during an audit. After the audit no problem! _E_\n.@BetteMidler talks about my hair but I'm not allowed to talk about her ugly face or body so I won't. Is this a double standard? _E_\nNew @OANN national poll released. Thank you America! #Trump2016 __HTTP__ _E_\nI'll be in London on Sunday at the ExCel Centre to talk about success. It will be a great time for everyone! __HTTP__ _E_\n\"One reason many people do not do well in business is because they do not do well with people.\" – Midas Touch _E_\nWow @CNN has nothing but my opponents on their shows. Really one sided and unfair reporting. Maybe I shouldn't do their town hall tonight! _E_\nMarco Rubio couldn't even respond properly to President Obama's State of the Union Speech without pouring sweat & chugging water. He choked! _E_\nTrump National Golf Club Bedminster New Jersey has courses designed by Tom Fazio & 16 acres of practice facilities. __HTTP__ _E_\nThe evening news broadcasts must stop talking about weather—boring and too many other topics. _E_\nI will be on Piers Morgan Live tonight at 9 p.m. on CNN. Tune in! _E_\nAmazing @VanityFair survived one more day without folding. The clock is ticking... _E_\nToday's third stop Londonderry New Hampshire! Thank you!#FITN #VoteTrumpNH __HTTP__ _E_\nRomance or Adventure what do you prefer? #CelebApprentice _E_\nLooking forward to tonight's Ayrshire Chamber of Commerce Annual Dinner 2015 @AyrshireChamber _E_\nHeading to Scotland to check out Turnberry & Trump Int'l Golf Links Scotland. Then heading to Dubai @DamacOfficial a great company. _E_\nIf you don't do your part don't blame God. Billy Sunday _E_\nGlad to hear @BarackObama's attack ad featuring my plane is playing in North Carolina. Free ad time for Trump National in Charlotte! _E_\nThe opening of Trump Turnberry in Scotland was a big success. Good timing I was here for BREXIT. Very exciting news conference today! _E_\nSpeaking to great patriots @MCC_CT. My first visit to Granite State since declaring my candidacy! #FITN __HTTP__ _E_\nThe Ultimate Merger: __HTTP__ 06 17_omarosa_is_back_and_this_time_its_personal.html _E_\n\"The entrepreneur's ability to dream to win lose and win again and again is often called the entrepreneurial spirit.\" – Midas Touch _E_\nNo matter what Bill Clinton says and no matter how well he says it the phony media will exclaim it to be incredible. Highly overrated! _E_\nVia @Newsmax_Media by @wandacarruthers: Trump: 'Inconceivable' Obama didn't know about ISIS threat __HTTP__ _E_\nGet out and VOTE tomorrow! We will MAKE AMERICA GREAT AGAIN! #CTPrimary #DEPrimary #MDPrimary #PAPrimary #RIPrimary __HTTP__ _E_\nWhy isn't anyone using the @CNN Iowa Poll with me having a big lead. They only want to use the one negative poll (2nd place).Dishonest press _E_\n.@bobvanderplaats is a total phony and dishonest guy. Asked me for expensive hotel rooms free (and more). I said pay and he endorsed Cruz! _E_\nThank you Council Bluffs Iowa! Will be back soon. Remember everything you need to know about Hillary just... __HTTP__ _E_\nVia @WSJ: \"The ObamaCare Awakening: Americans are losing their coverage by political design.\" __HTTP__ _E_\nI was interviewed by Greta Van Susteren today here at Trump Tower. Tune in tonight on Fox News at 10 p.m.... (cont) __HTTP__ _E_\nInterview w/Melanie Batley via Newsmax __HTTP__ _E_\nWe should have taken the oil in Iraq and now our mortal enemies have got it and with no opposition. Really dumb U.S. pols! I'm so angry! _E_\nCan you believe they are blaming @MittRomney for Egypt. _E_\nWith all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special counsel appointed! _E_\n'BuzzFeed Runs Unverifiable Trump Russia Claims' #FakeNews __HTTP__ _E_\nLooking forward to meeting with @SenBobCorker in a little while. We will be traveling to North Carolina together today. _E_\n.@JudgeJeanine Tonight at 9 P.M. on @FoxNews ENJOY! _E_\nFor America to be strong again the ways of politicians must be put in the past. Let's Make America Great Again! __HTTP__ _E_\nFilming of the record 13th season of @CelebApprentice has started. Be sure to be on the lookout for future updates. _E_\n2012 is the most important election of my lifetime. @BarackObama must be defeated. _E_\nWe allow Japan to sell us millions of cars with zero import tax and we can't make a trade deal with them our country is in big trouble! _E_\nMy new club on the Atlantic Ocean in Ireland will soon be one of the best in the World and no one will be looking into ugly wind turbines! _E_\nMovie producer Harvey Weinstein who lost his company to Colony Capital is against guns but makes movies w/ major gun violence really! _E_\nBREAKING Border security rally in Phoenix AZ at 2PM MST has been moved to @PhoenixConvCtr! Build a wall! Let's Make America Great Again! _E_\nThe Democrat Governor.of Minnesota said The Affordable Care Act (ObamaCare) is no longer affordable! And it is lousy healthcare. _E_\nIs the Boston killer eligible for Obama Care to bring him back to health? _E_\nPresident Xi of China has stated that he is upping the sanctions against #NoKo. Said he wants them to denuclearize. Progress is being made. _E_\nFBI Director Comey was the best thing that ever happened to Hillary Clinton in that he gave her a free pass for many bad deeds! The phony... _E_\n.@ScottWalker despite your coming to my office to give me an award your very dumb fundraiser hit me very hard not smart! _E_\nFt. Hood Jihadi Nidal Hassan has been paid over $300g in Army salary while on trial. His victims are deprived of any benefits... _E_\nSexting Pervert @anthonyweiner has returned to twitter. Parents of all underage girls should BLOCK him immediately! _E_\nObama is not a leader he's just a campaigner! _E_\nGreat job by MichaelCaputo on @foxandfriends. _E_\nCall it any way you like but Snowden is a traitor. When our country was great do you know what we did to traitors? _E_\nMemorial Day is a time to honor our nation's finest who made the ultimate sacrifice for our freedom. God bless them all. _E_\nPresident Obama missed the deadline! _E_\nThe 250 million dollar construction of Trump Nationsl Doral is coming along great. Just left Miami where I toured entire project.AMAZING! _E_\nJust read that Trump has the largest (and I add most enthusiastic) crowds. Tonight I will be in New Hampshire the place will be packed! _E_\nRefloating the Costa Concordia for many hundreds of millions of $'s is ridiculous. Should have taken it apart in small pieces save fortune _E_\nOur $16T national debt is now bigger than our $15T GDP. If Obama is re elected watch for an economic meltdown in 2013. _E_\nWe have done a great job with the almost impossible situation in Puerto Rico. Outside of the Fake News or politically motivated ingrates... _E_\nHillary Clinton answered email questions differently last night than she has in the past. She is totally confused. Unfit to serve as #POTUS. _E_\nHere are Hillary Clinton's accomplishments at the State Department.#Debates2016 #RattledHillary __HTTP__ _E_\nI will be in Palm Beach Jupiter and Miami today checking on big construction projects. I love Florida and love on time and on budget const _E_\nI'll be on @seanhannity tonight at 10 PM and look forward to it. Lots to discuss! Enjoy. _E_\nNow America knows the Emperor has no clothes. Why would Obama do better in a 2nd debate? #Debate #Obama _E_\n... I never felt that I could let up for a moment. Harry S. Truman _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nI just want to know how much is Saudi Arabia and others who we are helping willing to pay for our saving from total extinction. Pay up now! _E_\nLightweight Attorney General Eric Schneiderman will be next to lose. He goes after a school with a 98% approval ratingleaves biggies alone _E_\nGreat new campaign ad just released by @MittRomney __HTTP__ _E_\nGreat evening in San Jose other than the thugs. My supporters are far tougher if they want to be but fortunately they are not hostile. _E_\nDummies left Iraq without the oil not believable! _E_\nCapitalism doesn't guarantee success only a chance to succeed. The community organizer @BarackObama doesn't (cont) __HTTP__ _E_\nISIS is operating a training camp 8 miles outside our Southern border __HTTP__ We need a wall. Deduct costs from Mexico! _E_\n#NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87 12. MAKE AMERICA GREAT AGAIN! _E_\nIf Snowden was such a hero then he would be in America. He is escaping justice! _E_\nLooks like plane may have been found in the Indian Ocean off the coast of Australia. _E_\nI just had a great victory against lightweight A.G. Eric Schneiderman. Most of his case re Trump U. was thrown out or gutted. Little remains _E_\nRemember when the failing @nytimes apologized to its subscribers right after the election because their coverage was so wrong. Now worse! _E_\nThe Republicans will get zero credit for passing immigration reform—and I said zero! _E_\nI am not just running against Crooked Hillary Clinton I am running against the very dishonest and totally biased media but I will win! _E_\nWatch commodity prices soar because of the freezing cold. Will be bad for the economy. We could use some global warming. _E_\nWow my campaign is hearing from more and more Bernie supporters that they will NEVER support Crooked Hillary. She sold them out V.P. pick! _E_\nIn Austin Texas with some of our amazing Border Patrol Agents. I will not let them down! __HTTP__ __HTTP__ _E_\nLIVE FACT CHECK: Trump's RIGHT. The Clinton Foundation has taken MILLIONS from the Middle East. #DrainTheSwamp __HTTP__ _E_\nI will be on @CNNSitRoom with @wolfblitzer from 5 7pm est. on @CNN. _E_\nThank you Piers. __HTTP__ _E_\nVia @australian: Trump empire planning to build a presence in Sydney __HTTP__ _E_\nEntrepreneurs: Practice positive thinking with a lot of reality checks. Know that goals come with obstacles. _E_\nMany people are saying that my challenge to Obama is having a huge negative effect on his poll numbers I agree. _E_\nNational Pearl Harbor Remembrance Day \"A day that will live in infamy!\" December 7 1941 _E_\nMy heart & prayers go out to all of the victims of the terrible #Brussels tragedy. This madness must be stopped and I will stop it. _E_\nSpoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenMajLeader @SpeakerRyan. Th... __HTTP__ _E_\nI think having Jeb's endorsement hurts Lyin' Ted. Jeb spent more than $150000000 and got nothing. I spent a fraction of that and am first! _E_\n.@JebBush just took millions of $'s in special interest money to look like a tough guy. Will never work! _E_\nVia @DMRegister by @JoelAschbrenner: Trump to speak at @LandExpo in West Des Moines __HTTP__ _E_\nWind farms are killing many thousands of birds. They make hunters look like nice people! _E_\n#CelebrityApprentice contestant @LouFerrigno stopped by to visit today __HTTP__ _E_\nAnother company that the DOE has given money to just filed for bankruptcy. This is how the money we borrow at 40% from China is wasted. _E_\nWho do you want negotiating for us? __HTTP__ _E_\nThe American gymnastic team was great our country should take their lead. _E_\nObama lied to the public about the Al Qaeda attack on our consulate in Libya. He should be held accountable. _E_\nWhy does @mcuban continue to embarrass the 31 35 & 11TH place @dallasmavs with childish behavior? Really unprofessional! _E_\n.@RepTomMarino Great job on television this morning. Glad to have you on my side! _E_\nJohn Kerry is openly celebrating the tenuous nuclear deal with Iran. Great dealmakers do not celebrate dealsthey just go on to the next one _E_\nWill be interviewed on @FaceTheNation with @JDickerson tomorrow at 10:30am EST. Enjoy! _E_\nICYMI @IvankaTrump's int. on @TODAYshow discussing @Joan_Rivers & contestant rivalries on @ApprenticeNBC __HTTP__ _E_\nJoin my team over on my Facebook page live now! #Debates __HTTP__ __HTTP__ _E_\nHappy Veterans Day to ALL in particular to the haters and losers who have no idea how lucky they are!!! _E_\nTrump National Golf Club Washington D.C. is situated on 600 acres overlooking the Potomac River. Beautiful! __HTTP__ ... _E_\nHow much longer are we expected to put up with the world's most incompetent leader ObamaCare Iran Syria bads deals. JUST NEVER ENDS _E_\nRT @DanScavino: .@realDonaldTrump stops by overflow room in Mechanicsburg Pennsylvania prior to main rally. #TrumpMovement #MAGA __HTTP__ _E_\nIsrael is being barraged by rockets from Gaza recently. They must respond accordingly in defense of their citizens. _E_\nWe are going to defend our industry & create a level playing field for the American worker. It is time to put... __HTTP__ _E_\nOil is rising back over $100 barrel. OPEC loves to rip us off. Why shouldn't they they always get away with it. _E_\n.@AROD is back on the DL. The coming suspension will be announced soon by @MLB. _E_\nI've realized that success requires 100% effort and 100% focus. Nothing less. _E_\nThere is nothing nice about searching for terrorists before they can enter our country. This was a big part of my campaign. Study the world! _E_\nOne of the best produced including the incredible stage & set in the history of conventions. Great unity! Big T.V. ratings! @KarlRove _E_\nNoisy windfarm driving community crazy! __HTTP__ @AlexSalmond @AberdeenCC @AberdeenshireCC _E_\nDiscipline is a key ingredient for success. It will build character motivation and bring opportunity. _E_\nThe new NBC POLL has me in first place but said I was third in the debate I demand a recount (just kidding!). EVERY other poll had me #1. _E_\nDow dives more than 500 points down 9% from high. Be careful! _E_\nThe CBO has predicted that unemployment will rise to 8.8% this next year. __HTTP__ This is @BarackObama's economic recovery. _E_\nWhat is our country coming to when a judge can halt a Homeland Security travel ban and anyone even with bad intentions can come into U.S.? _E_\n30 million Americans are unemployed yet Obama has set up workshops across the country for illegals to get Amnesty __HTTP__ _E_\nThe @ForbesInspector & @AAAnews 5 star restaurant @TrumpNewYork's @Jean_GeorgesNYC is NYC's top destination __HTTP__ _E_\nRT @EricTrump: Tune into @GMA right now to catch a great interview with my father & the entire family! #VoteTrumpPence16 __HTTP__ _E_\nHillary Clinton spokesperson admitted that their was no ISIS video of me. Therefore Hillary LIED at the debate last night. SAD! _E_\nOur economy is struggling and OPEC continues to rip us off. Output is low and the price is too high. They ar... (cont) __HTTP__ _E_\nMarco Rubio is being crucified by the media for drinking water during speech! _E_\n\"If it's worth doing it's worth fighting for. You'll have lots of people and obstacles in your way. Work & fight to get beyond them. _E_\nResidential Capital a company in which Warren Buffett is involved went bankrupt but that doesn't mean that Warren Buffett went bankrupt! _E_\nChampions aren't made in the gyms. Champions are made from something they have deep inside them a desire a dream a vision. Muhammad Ali _E_\nThank you Sparks Nevada!#VoteTrumpNV #NevadaCaucus Finder: __HTTP__ __HTTP__ _E_\nHeading over to the @UN to meet with Ambassador @NikkiHaley and all of her great representatives! #USA _E_\nDo not underestimate yourself and know you are able to handle what comes your way by increasing your leverage. _E_\n#CelebrityApprentice @arsenioofficial \"trying to be invisible\"? No way that's going to happen. #sweepstweet _E_\nThank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 #NHPolitics #FITN __HTTP__ __HTTP__ _E_\nVia @Suntimes: Trump wins at trial calls woman suing him 'horrible human being' __HTTP__ _E_\nWhen I said in an interview that Putin is not going into Ukraine you can mark it down I am saying if I am President. Already in Crimea! _E_\nI gave a woman named Barbara Res a top N.Y. construction job when that was unheard of and now she is nasty. So much for a nice thank you! _E_\nObama's 2014 budget \"eyes $1 trillion hike in tax revenue\" __HTTP__ He loves taxes. T E A. Taxed Enough Already. _E_\nWe have to bring back and cherish the middle class once the backbone and true strength of the U.S.A. It can happen! _E_\nA regular part of your day should be devoted to expanding your horizons. Learning is a new beginning. _E_\nI had a fantastic time with @jacknicklaus at the grand opening of the great @TrumpFerryPoint. Watch the video __HTTP__ _E_\nWill Smith did a great job by smacking the guy reporter who kissed him on the lips at a red carpet event. (cont) __HTTP__ _E_\nSorry losers and haters but my I.Q. is one of the highest and you all know it! Please don't feel so stupid or insecureit's not your fault _E_\nBIG @MittRomney is preferred to handle the economy over @BarackObama by 63% 29% in a @gallupnews poll __HTTP__ _E_\nRT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_\nGreat meeting with @HouseGOP and @SenateGOP leaders including @SpeakerRyan @SenateMajLdr @GOPLeader @JohnCornyn... __HTTP__ _E_\n\"Tomorrow hopes we have learned something from yesterday.\" John Wayne _E_\nChina is getting minerals from Afghanistan __HTTP__ We are getting our troops killed by the Afghani govt't. Time to get out. _E_\nI will be doing #GDNY Good Day N.Y. with Rosanna &Greg live at 8.30 A.M. I will be giving money to a great guy who lost his son in the WTC. _E_\nCaptain Khan killed 12 years ago was a hero but this is about RADICAL ISLAMIC TERROR and the weakness of our leaders to eradicate it! _E_\nCrooked Hillary Clinton got Brexit wrong. I said LEAVE will win. She has no sense of markets and such bad judgement. Only a question of time _E_\nRalph Northam will allow crime to be rampant in Virginia. He's weak on crime weak on our GREAT VETS Anti Second Amendment.... _E_\nPeaceful protests are a hallmark of our democracy. Even if I don't always agree I recognize the rights of people to express their views. _E_\nDefinitely watch @Carl_C_Icahn 's 'Danger Ahead'. Very insightful particularly on how corp inversions hurt America: __HTTP__ _E_\nSee story in Fusion and Huff. Post about rape at the border. Beyond terrible! Isn't Fusion owned by Univision? _E_\nTrump Tower Punta Del Este features the Trump Organization's signature superior quality detail & perfection __HTTP__ _E_\nI will not be commenting on boardroom specifics would be unfair to the different time zones. #CelebApprentice _E_\nAttitude is a little thing that makes a big difference. Winston Churchill _E_\nNasty for the middle class electricity prices surged to an all time high this past March __HTTP__ FRACK NOW _E_\nOnce John Kasich announced he was running for president and opened his mouth people realized he was a complete & total dud! _E_\nIt is now commonly agreed after many months of COSTLY looking that there was NO collusion between Russia and Trump. Was collusion with HC! _E_\n.@NYMag is a piece of garbage but I think it is very nice & charitable that they employ the no talent illiterate hack @jonathanchait. _E_\nGovernor Cuomo is right about one thing Attorney General Eric Schneiderman does wear eyeliner! What the hell is up with him? _E_\nHeading to Joint Base Andrews on #MarineOne with Prime Minister Shinzō earlier today. __HTTP__ _E_\nRT @RealJamesWoods: Only now with a #RealPresident do we see the scope of destruction engineered by #Obama and the #Democrat cabal. @realD... _E_\nWill be on Meet the Press with @ChuckTodd tomorrow morning. Enjoy! _E_\nWhy aren't people looking at this reporters earliest statement as to what happened that is before she found out the episode was on tape? _E_\nThank you to all of my Twitter followers for helping to defeat Weiner and Spitzer. Remember in the beginning they said it couldn't be done! _E_\nCongratulations to Linda McMahon on her victory in the Connecticut Senate primary. She is an amazing woman smart as you get! @Linda_McMahon _E_\nPresident Obama is the best thing that ever happened to Jimmy Carter! _E_\nMy Administration Governor @RicardoRossello and many others are working together to help the people of Puerto Rico in every way... _E_\nRT @seanspicer: .@timkaine wants to tough on crime fails to talk about defending rapists and murders #VPDebate _E_\nVia @chicagotribune by @bob_writes: \"@TrumpChicago tower unit sets resale record at $3.99M\" __HTTP__ _E_\nCongratulations to Boston on the @RedSox World Series victory. Earned and deserved. _E_\nMar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_\nOne of the most expensive projects ever in Miami @TrumpDoral's $200M of renovations are right on schedule. When completed will be elite! _E_\nI will be interviewed by Chris Wallace on Fox tomorrow morning. Tune in! _E_\n.@MiamiHerald discusses our @TrumpCollection #TrumpPets program @TrumpDoral: __HTTP__ _E_\nColumbia University stated there was a computer error in their system concerning @BarackObama's attendance. (cont) __HTTP__ _E_\nPerhaps a new meeting will be set up with the @nytimes. In the meantime they continue to cover me inaccurately and with a nasty tone! _E_\nUnder President Obama do you think America will become a THIRD WORLD COUNTRY? _E_\nIn 2010 alone our trade deficit with China cost over 566000 jobs __HTTP__ This is unsustainable for the American worker. _E_\nThe government will spend over $3.8T this year. The sequester is a pittance of the outlays less than 2%. Where's the problem? _E_\nThank you @NFIB together we will #MakeAmericaGreatAgain! __HTTP__ _E_\nOur debt is about to reach $17T. Iraq has $20T in oil reserves. Interesting. _E_\nMy @FoxNews interview @seanhannity discussing Obama's failed presidency Ebola DC Post Office midterms & 2016 __HTTP__ _E_\nEveryone is laughing at the @nytimes for the lame hit piece they did on me and women.I gave them many names of women I helped refused to use _E_\nBoth candidates are looking sharp now it's up to the mouth and the mind. #VPDebate _E_\nWe had all the leverage in our nuclear negotiations with Iran and our leaders foolishly decided to let them out of the trap. WHY? _E_\nLoser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must cut off & use better! _E_\nKeep testing your limits. Never become complacent. Always think big! _E_\nThe $1B failed website is the tip of the iceberg on the ObamaCare. Over 90 million estimated will lose their plans next year. _E_\nICYMI @IvankaTrump's @waytooearly int. w/@ThomasARoberts on @ApprenticeNBC's firingsTrump Int'l DC & @MissUniverse __HTTP__ _E_\n...confidence that President Al Sisi will handle situation properly. _E_\nAn honor to join the @FaithandFreedom Coalition yesterday. In America we don't worship government. We worship God.... __HTTP__ _E_\nGood move by @MSNBC in downgrading @WeGotEd to a dead weekend spot. This is truly a guy who shouldn't be on tv. _E_\nA lot changed when David Letterman said he was probably born in this country the word probably is a total disaster for Obama. _E_\n.@mike_pence is doing a great job so far no contest! _E_\nSeal the deal! Hold your business meeting at the luxurious @TrumpNewYork Executive Board Room __HTTP__ _E_\nCanada's PM was in China last week brokering a deal to sell the oil @BarackObama rejected in Keystone. __HTTP__ Unbelievable! _E_\nPresident Andrew Jackson who died 16 years before the Civil War started saw it coming and was angry. Would never have let it happen! _E_\nDonald Trump tops Franklin Pierce/Herald poll at 28 percent in N.H. __HTTP__ _E_\nWhat do you think so far? #CelebApprentice _E_\nToday is National Prescription Drug Take Back Day. Everyone can help fight the #OpioidEpidemic by participating! __HTTP__ __HTTP__ _E_\nSenator Landrieu If you are a Senator representing Louisiana then you SHOULD own a home in the state. Send @BillCassidy to the Senate! _E_\nWhat a foolish move by @davidaxelrod to speak in Boston yesterday! Completely outmaneuvered by the @MittRomney campaign. _E_\nMexican leadership has been laughing at us for many years but now it's no longer laughter—it's disbelief... _E_\n\"Runaway Obamacare Spending Will Cost Democrats\" __HTTP__ via @BloombergView by @lanheechen _E_\nHow much money is the extremely unattractive (both inside and out) Arianna Huffington paying her poor ex hubby for the use of his name? _E_\nI LIVE IN NEW JERSEY & @realDonaldTrump IS RIGHT: MUSLIMS DID CELEBRATE ON 9/11 HERE! WE SAW IT! __HTTP__ _E_\nIt's Thursday. How much time did Washington waste today trying to find a solution on the so called fiscal cliff? _E_\nWE WILL ONLY BE THE LAND OF THE FREE AS LONG AS WE ARE HOME OF THE BRAVE! _E_\nI will be on Fox & Friends tomorrow morning at 7. Will be discussing basic stupidity and incompetence of which our leaders have plenty! _E_\nGet on Trump's List email from the RNC was not authorized. I am self funding my campaign! Do not pay. Email: __HTTP__ _E_\nVia @PeoplesCompany: Real Estate Magnate Donald J. Trump to Headline 2015 @LandExpo in West Des Moines Iowa __HTTP__ _E_\nNobody could have done what I've done for #PuertoRico with so little appreciation. So much work! __HTTP__ _E_\nI wonder if @megynkelly and her flunkies have written their scripts yet about my debate performance tonight. No matter how well I do bad! _E_\nMy honor thank you. __HTTP__ _E_\nJeb Bush just got contact lenses and got rid of the glasses. He wants to look cool but it's far too late. 1% in Nevada! _E_\nRT @ColumbiaBugle: @realDonaldTrump @FLOTUS President Trump greeting families affected by Hurricane Harvey. #TexasStrong __HTTP__ _E_\n.@DarrellIssa is a very good man. Help him win his congressional seat in California. _E_\nDon't believe the manipulated job numbers. Walmart has just cut orders with suppliers because of rising inventory. _E_\nProduct placement is a definite prerequisite. #sweepstweet _E_\nI love the Mexican people but Mexico is not our friend. They're killing us at the border and they're killing us on jobs and trade. FIGHT! _E_\nThe animal who beheaded the woman in Oklahoma should be given a very fast trial and then the death penalty. The same fate beheading? _E_\n...Americans do what we do best: we pull together. We join hands. We lock arms and through the tears and the sadness we stand strong... __HTTP__ _E_\n.@MonicaCrowley you were GREAT on @seanhannity tonight. Thank you for the nice words! _E_\nJoin me in Florida on Wednesday! Daytona & Jacksonville:Daytona \\ 3pm __HTTP__ | 7pm __HTTP__ _E_\nThank you Cadillac Michigan! #VoteTrumpMI on 3/8/2016. We will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\n\"The best luck of all is the luck you make for yourself.\" – General Douglas MacArthur _E_\nHillary Clinton's short speech is pandering to the worst instincts in our society. She should be ashamed of herself! _E_\nJust completed purchase of magnificent Ritz Carlton in Jupiter Florida. Will be renamed Trump National Golf Club & be tremendous success. _E_\nGabriel Aubry should learn how to fight—he became a punching bag. Always drama with Halle B! _E_\nInside 'Bill Clinton Inc.': Hacked memo reveals intersection of charity and personal income. #DrainTheSwamp! __HTTP__ _E_\nOn immigration I'm consulting with our immigration officers& our wage earners. Hillary Clinton is consulting with Wall Street. _E_\nWhile Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him & his dumb clown humor. Too bad! _E_\n.@tomhanks was fabulous in @LuckyGuyPlay last night—as was the entire cast. _E_\nThe U.S. has enough problems without publicity seekers going out and openly mocking religion in order to provoke attacks and death. BE SMART _E_\nThe failing @nytimes has gone nuts that Crooked Hillary is doing so badly. They are willing to say anything has become a laughingstock rag! _E_\n\"Everyone's dream can come true if you just stick to it and work hard.\" @serenawilliams _E_\n150 Clinton E mails still contain classified information. More sensitive when she was Sec.of State. This is a very big deal. _E_\nTHANK YOU Atlanta Georgia! Leaving for Nevada now. Lets MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ __HTTP__ _E_\nJoin me in Pensacola Florida this Friday at 7pm! #VoteTrump __HTTP__ __HTTP__ _E_\nWhy is oil at a record high? OPEC & the oil speculators continue to rip us off. _E_\nThe train accident that just occurred in DuPont WA shows more than ever why our soon to be submitted infrastructure plan must be approved quickly. Seven trillion dollars spent in the Middle East while our roads bridges tunnels railways (and more) crumble! Not for long! _E_\nMonday morning 7:30 AM I'll be on @foxandfriends. Tune in! _E_\nMike & Mike in one minute! _E_\nDid the poor but smart to leave ex husband of @ariannahuff get any of the dollars she got for the use of his name in really stupid AOL deal? _E_\nMy @foxandfriends interview re: @IvankaTrump's pregnancy my grandchildren Obama's 18% tax rate & Obamacare __HTTP__ _E_\nGood @marcorubio is trying to eliminate the tax on Olympic medals __HTTP__ Our athletes should not be taxed on their wins. _E_\nRT @seanhannity: Graph: @RealDonaldTrump's Historic 13 Million Primary Votes Compared To Every GOP Nominee Since 1908 __HTTP__ _E_\nReal estate taxes are far too high @BriarcliffManor Westchester. A total joke how they waste money! Replace Mayor Vescio. _E_\n70 stories above Panama Bay @TrumpPanama the majestic sail design is Central America's architectural icon __HTTP__ _E_\nThe Rust Belt was created by politicians like the Clintons who allowed our jobs to be stolen from us by other countries like Mexico. END! _E_\nThank you @TIME readers a great honor! __HTTP__ _E_\nDemocrat Congresswoman totally fabricated what I said to the wife of a soldier who died in action (and I have proof). Sad! _E_\nSaudi Arabia was vehemently against the Iran nuclear deal. Then today they embraced it. What happened? What did we give them to endorse? _E_\nCongrats @SixteenChicago's @ChefLents on your Chef of the Year nom in @EaterChicago Annual Eater Awards Vote now! __HTTP__ _E_\nLooking forward to speaking at Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. Second visit to New Hampshire this year. _E_\nAs I always said the Birthers were after the truth. Thanks to @RealSheriffJoe @BarackObama can't hide anymore. _E_\nWOW! __HTTP__ _E_\nI'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nCongress must repeal ObamaCare. Obama will veto while Americans continue to lose their doctors & pay rising premiums. _E_\nJOIN ME TOMORROW!MINNESOTA 2pm __HTTP__ 6pm __HTTP__ 9:30p... __HTTP__ _E_\nFAKE NEWS A TOTAL POLITICAL WITCH HUNT! _E_\nMy @foxandfriends interview from yesterday __HTTP__ _E_\nRemember this @BarackObama told @GStephanopoulos in 09 that it is not true that the individual mandate is a tax __HTTP__ _E_\nThank you America! #Trump2016 __HTTP__ __HTTP__ _E_\nThe losing team is now back in boardroom. I can't discuss the team members or what's going on or what happens from here on out. _E_\nOur deficit spending is China's gain. @BarackObama is bankrupting our country. _E_\nThank you @JCLayfield will get even better as my Administration continues to put #AmericaFirst __HTTP__ _E_\nBenghazi. Obama lied. Our people died. _E_\nA lovely letter from the daughter of the late great John Wayne. Our country could use a John Wayne right now. __HTTP__ _E_\nBernie Sanders was right when he said that Crooked Hillary Clinton was not qualified to be president because she suffers from BAD judgement! _E_\nCongratulations to Jim Herman my ass't golf pro at Trump Nat'l Golf Club/Bedminster NJ for qualifying for the U.S. Open! @usopengolf _E_\nGreat shots of @TrumpTowerNY #CelebApprentice _E_\nHad a record crowd in Boone Iowa. A fantastic day we will #MakeAmericaGreatAgain __HTTP__ _E_\nI really like Nelson Mandela but South Africa is a crime ridden mess that is just waiting to explode not a good situation for the people! _E_\nAlways enjoy appearing on @extratv. @MarioLopezExtra & @mariamenounos were terrific yesterday. _E_\nHillary Clinton has been working on solving the terrorism problem for years. TIME FOR A CHANGE I WILL SOLVE AND FAST! _E_\nIf you want to be a success you have to get used to frequently hearing the word no and ignoring it. Think Big _E_\nJust bought Doral Hotel & Country Club in Miami within two years it will be the best resort in the country. _E_\nTHe Art of the Deal The best thing you can do is deal from strength and leverage is the biggest strength you have. CUT CAP and BALANCE. _E_\nI just got off the phone with the great people of Guam! Thank you for your support! #VoteTrump today! #Trump2016 _E_\nExclusive interview w/ my wife @MELANIATRUMP tomorrow morning @ 8amE on @Morning_Joe w/ @morningmika @MSNBC. Enjoy! __HTTP__ _E_\nRT @USCGSoutheast: .@USCG crews worked together with the @RedCross @fema and members of local #police #fire and #government to distribut... _E_\nGlad to hear Mariano Rivera is going to make a comeback in 2013. He is a true sportsman and a great competitor. __HTTP__ _E_\n.@canoetravel Also the very obsolete ugly and expensive wind turbines will never be build in Aberdeen. No longer works. @GolfChannel _E_\nICYMI @MELANIATRUMP Reading newspapers and see... #BillyGraham95 #happybirthday @BillyGraham __HTTP__ _E_\nMy team of deplorables will be managing my Twitter account for this evenings debate. Tune in!#DebateNight #TrumpPence16 _E_\nSummer's almost here update your business wardrobe with Trump Signature Collection exclusively available @Macys __HTTP__ _E_\nHeading to Baton Rouge Louisiana for a speech. Expecting a very large crowd! See you soon. #Trump2016 #MakeAmericaGreatAgain _E_\nHillary Clinton is weak on illegal immigration among many other things. She is strong on corruption corruption is what she's best at! _E_\nThanks to all for the wonderful congratulation sent to me on the birth of Ivanka's little boy so nice! _E_\nLightweight @AGSchneiderman the worst attorney general in the US is in a tough election with John Cahill @CahillForAG _E_\nI like Russell Brand but Katy Perry made a big mistake when she married him. Let's see if I'm right I hope not. _E_\nYou can take the smartest kid at Wharton the one who gets straight A's and has a 170 IQ and if he doesn't (cont) __HTTP__ _E_\nRealize that being an entrepreneur is not a group effort. You're in charge everything starts with you. _E_\n\"Know from the inside out that you have the power to succeed and you will.\" – Think Like a Champion _E_\nHappy Veterans Day. To those who have served thank you for your special work. _E_\nDopey @chicagotribune critic fails to mention the ugly Sun Times sign. _E_\nAnother attack in London by a loser terrorist.These are sick and demented people who were in the sights of Scotland Yard. Must be proactive! _E_\nI guess they don't have freedom of the press in Scotland. We created this ad and the ASA would not allow us to (cont) __HTTP__ _E_\nGreat interview in @postedtoronto of @DonaldJTrumpJr: He makes me proud. __HTTP__ _E_\n\"90% of Trump 2017 news coverage was negative\" and much of it contrived!@foxandfriends _E_\nDoing interview today with Maria Bartiromo at 10:00 A.M. on @FoxNews ENJOY! _E_\nHillary & Obama's Broken Promises. #RepealObamacare __HTTP__ _E_\nand yet another ...all of them are spectacular. __HTTP__ _E_\nWelcome to the new reality! Moody's just downgraded the entire US health insurance industry because of ObamaCare. _E_\nWill be speaking to President Recep Tayyip Erdogan of Turkey this morning about bringing peace to the mess that I inherited in the Middle East. I will get it all done but what a mistake in lives and dollars (6 trillion) to be there in the first place! _E_\nThanks @renee2i for hosting me tomorrow at the Two International Group! Looking forward to making new friends & discussing #FITN topics. _E_\nMy interview with @gretawire on Fox News for those who missed it 'Obama's Constantly on Vacation' __HTTP__ _E_\nMy @FoxNews interview w/@seanhannity __HTTP__ _E_\n.@SenTedCruz Ted free legal advice on how to pre empt the Dems on citizen issue. Go to court now & seek Declaratory Judgment you will win! _E_\nSugar: @Lord_Sugar Keep working hard so I make plenty of $ with your show... _E_\nObama won't send troops to fight jihadists yet sends them to Liberia to contract Ebola. He is a delusional failure. _E_\n.@BlairKamin Blair you may be the worst architectural critic in the business but thanks for your nice reviews about Trump Chicago & sign PR _E_\nTrump @DoralResort is hosting the WGC @cadillacchamp from March 6th – 10th. Join me I will be there all four days. _E_\nGreat purchase in Ireland will be a top spot! __HTTP__ _E_\nThe three political disasters could lead to a major and complete political meltdown! _E_\nVia @postandcourier: The Donald at @TheCitadelOEA __HTTP__ _E_\nRT @paultdove: @FoxBusiness Republican Senators who are opposing the President look at the great economic news: Americans Are Noticing! _E_\nEnjoyed my visit to Trump Doral in Miami yesterday. Looking forward to returning for the WGC @Cadillac Championship on March 6th 10th... _E_\nBy not doing the failed poorly rated debate I was able to make the point of not allowing unfairness while raising $6000000 for VETS. _E_\nLoved being in Manassas VA last night. Such incredible spirit! Now in DC for a speech will then visit Old Post Office under construction. _E_\nYou can find your polling locations at: __HTTP__ #FITN #NHPrimary #VoteTrumpNH __HTTP__ _E_\nThanks Go Angelo people are now really aware of my ties shirts and cuff links at Macy's _E_\nWhen your secretary of defense tells you that your proposed cuts will erode America's military capability you (cont) __HTTP__ _E_\nMy honor. __HTTP__ _E_\nWhy is @AlexSalmond pursuing environment destroying windmills when @VattenfallGroup quit because of no (cont) __HTTP__ _E_\nAmerica's debt crisis is our country's greatest challenge. Spending must be curbed for our long term fiscal future. _E_\nDetroit is going through very hard times right now.. If they are smart brighter days are ahead. _E_\nOPEC will use yesterday's attacks on our embassies to raise the price of gas. They are always ripping us off. _E_\nChina is so brazen that they now give us economic advice they tell us what to do much like a strong stockh... (cont) __HTTP__ _E_\nWill be on @foxandfriends now. Enjoy! _E_\nJack Welch thinks Sam Palmisano retired CEO of IBM should be the next CEO of MICROSOFT. Interesting! _E_\nWhat did you think of the boardroom? #CelebApprentice _E_\nThe world is noticing thanks! __HTTP__ _E_\n.@NeneLeakes seeks my advice on prenups tonight at 9 PM on Bravo _E_\nTomorrow on the @MissUniverse Facebook page submit your final question for the contestants __HTTP__ _E_\nEverybody tells me not to hit back at the lowlifes that go after me for PR sorry but I must. It's my nature. _E_\nWe stand in absolute solidarity with the people of the United Kingdom. __HTTP__ _E_\n\"Donald Trump offers political advice to Palm Beach Republicans\" __HTTP__ via @SunSentinel _E_\nWill be doing a live Thanksgiving Video Teleconference with Members of the Military at 9:00 A.M. Afghanistan Iraq USS Monterey Turkey & Bahrain. Then going to Coast Guard Quarters Florida. _E_\nWatch Obama push major global warming legislation early in his second term... _E_\nI am interviewed on the @oreillyfactor tonight at 8:00. Then at 10:00 I am interviewed by @donlemon on @CNN. Enjoy! _E_\nA 40mph gust of wind wrecked a wind turbine in Scotland __HTTP__ Any turbine in close proximity to a school must go! _E_\nWhat is a bit appealing about this idea of Trump hosting a debate is consider the diverse audience that perh... (cont) __HTTP__ _E_\nWhat did you think of my decision? #CelebApprentice _E_\nConvenient David Plouffe collected $100G fee from Iranian affiliate only a month before joining @whitehouse __HTTP__ _E_\nMake sure you're registered to vote! Let's #MakeAmericaGreatAgain! We can't afford more years of FAILURE! All info:... __HTTP__ _E_\nGet ready for tonight! _E_\nRemember when comedian Bill Maher openly praised the disgusting terrorists who destroyed the World Trade Center then got canned by ABC? _E_\nI cannot believe that Apple didn't come out with a larger screen IPhone. Samsung is stealing their business. STEVE JOBS IS SPINNING IN GRAVE _E_\nThank you Hilton Head South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIs it legal for a sitting President to be wire tapping a race for president prior to an election? Turned down by court earlier. A NEW LOW! _E_\nJury was unanimous after hearing the made up case against my co. Filed many years ago she.and her pathetic lawyer should pay me big damages _E_\nThe @EricTrumpFdn event featured a performance by #CelebApprentice @JohnRich a great event for a great cause! Watch __HTTP__ _E_\nVia @BreitbartNews by @mboyle1: Exclusive: Trump To Address South Carolina Tea Party Convention __HTTP__ _E_\nBoth Washington D.C. and DALLAS are turning out to be really big events. D.C. is protest of incompetent Iran deal and Dallas is big speech! _E_\nA single Ebola carrier infects 2 others at a minimum. STOP THE FLIGHTS! NO VISAS FROM EBOLA STRICKEN COUNTRIES! _E_\nI will be on Fox & Friends at 7.00 (20 minutes). Plenty of terrible and tragic news to talk about! Too bad. _E_\nThank you @foxandfriends great show! _E_\nI agree @MMFlint To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp w/ me on 11/8. __HTTP__ _E_\nRT @TXMilitary: #PhotosFromTheField: Aerial photos from our rescue crews earlier today. #Harvey #TMDHarvey @USNationalGuard __HTTP__ _E_\nLeaving for North Carolina. Big crowd will be fun! _E_\nGreat evening last night in New Hampshire. Got the endorsement from the New England Police Union big territory great people! Thank you. _E_\n\"I don't see the point of being politically correct if that means actually being incorrect.\" – Donald J. Trump 'Midas Touch' _E_\nSnowden is a traitor and a disgrace. Make no mistake he is no hero. In fact he is a coward who should come back & face justice. _E_\n\"Recognize that the world needs more entrepreneurs. Everyone is counting on you.\" – Midas Touch _E_\n.@bobschieffer did an excellent job as debate moderator last night. I only wish Mitt was more aggressive! _E_\nObama and Republicans are hollowing out our military. Now want to cut troop levels. Lowest level in over 20 years. _E_\nRT @foxandfriends: .@carriesheffield: The mainstream media is neglecting their duty to represent the public. They've failed to represent ha... _E_\nCongratulations to @drewbrees on setting the @NFL record with 48 consecutive games with a TD pass. He is a great guy and player. _E_\n#CelebApprentice Selfies yes or no? _E_\nGetting rdy to leave for tonight's Celebrate Freedom Concert honoring our GREAT VETERANS w/ so many of my evangelic... __HTTP__ _E_\nThank you. __HTTP__ _E_\nCan u believe that Jeb Bush's campaign manager is in Berlin Germany looking for money? What's he giving to Germany? __HTTP__ _E_\nHow can NYS allow lightweight @AGSchneiderman to remain in office? What are JCOPE & Moreland Commissions waiting on? __HTTP__ _E_\nVia @DrudgeReport: __HTTP__ _E_\nOur new Miss USA Alyssa Campanella came up to my office today for a visit. We're proud to have her as our new title holder. _E_\nBuy at the point of maximum pessimism sell at the point of maximum optimism. Sir John Templeton _E_\nHappy to have just passed 1.5M followers on twitter. We picked up over 14000 yesterday alone. It's great to speak to everyone daily. _E_\nWant jobs? Slash corporate tax rate. Tax incentives for companies that create jobs in US. America will boom. _E_\nNo matter the mission the brave men & women of our @USCG proudly answer the call to serve 24/7/365. THANK YOU and HAPPY BIRTHDAY! #CG227 __HTTP__ _E_\nFailure for all of @BarackObama's talk of engaging the world U.S. favorability has dropped around the world __HTTP__ _E_\nI will be doing a major sit down interview on State of the Union With Jake Tapper at 9:00 A.M. on @CNN. Enjoy! _E_\nI will be in PR on Tues. to further ensure we continue doing everything possible to assist & support the people in their time of great need. _E_\nRT @Reince: Happy New Year + God's blessings to you all. Looking forward to incredible things in 2017! @realDonaldTrump will Make America... _E_\nGeorge Will is a political moron. Last month he said Romney couldn't win. _E_\nGreat to talk jobs with #NABTU2017. Tremendous spirit & optimism we will deliver! __HTTP__ _E_\nWe are no longer silent. We are energized & ready to take our country back. Let's Make America Great Again! __HTTP__ _E_\n.@NBCNews purposely left out this part of my nuclear qoute: until such time as the world comes to its senses regarding nukes. Dishonest! _E_\n...2nd Amendment Strong Military ISIS historic VA improvement Supreme Court Justice Record Stock Market lowest unemployment in 17 yrs! _E_\nToday's ceremony is a day for both remembrance and resolve.#NATOMeeting #NATO __HTTP__ _E_\nYet another terrorist attack today in Israel a father shot at by a Palestinian terrorist was killed while: __HTTP__ _E_\n.@BarackObama reported over $269710 of foreign income out of his gross $894520 and paid $5841 in foreign taxes __HTTP__ _E_\nWhat a waste of time being interviewed by @andersoncooper when he puts on really stupid talking heads likeTim O'Brien dumb guy with no clue! _E_\nI appeared on David Letterman last night. And don't forget Sunday night the first episode of Celebrity Apprentice will be on NBC at 9 pm. _E_\n.@BrandenRoderick I was pleased to see the wonderful statements you made about me to the media.I'm not surprised you're a special person _E_\nMy shirts ties and fragrance are doing great at @Macys try them! Make fantastic gifts. _E_\n64 stories of golden glass over the strip @TrumpLasVegas' elite hotel rooms feature floor to ceiling windows __HTTP__ _E_\nLibya is selling its oil to China I notice the Chinese Ambassador is very safe. _E_\nHow low has President Obama gone to tapp my phones during the very sacred election process. This is Nixon/Watergate. Bad (or sick) guy! _E_\nThe only way to spread economic growth is to lower taxes and end unfriendly regulatory practices. _E_\nPresident Obama said that he thinks he would have won against me. He should say that but I say NO WAY! jobs leaving ISIS OCare etc. _E_\nThe Wilson family should thank me. Pegula overpaid for the @buffalobills because of me! _E_\nA woman who got fired after two days of working with Scott Walker a wacko now trying to raise funds to fight me. _E_\nRT @williebosshog: Make America Great Again! #Trump2016 __HTTP__ _E_\n#ConfirmGorsuch #SCOTUS __HTTP__ _E_\nAmerican incomes have fallen $3040 per household in the last 38 months __HTTP__ _E_\nHe is working hard and for that he must be given credit! _E_\nPetraeus is already negotiating a book deal. __HTTP__ Smart. Always negotiate when you are a hot commodity! _E_\nTop 50 Facts About Crooked Hillary Clinton From Trump 'Stakes Of The Election' Address: __HTTP__ _E_\nEntrepreneurs: Be cautiously optimistic. Call it positive thinking with a lot of reality checks. _E_\nThe Tonight Show begins in 5 minutes. Enjoy! _E_\nGetting ready to pay final respect to GREAT LADY Joan Rivers. She could light up a room like no other! She will be greatly missed. _E_\nI will be interviewed on @TODAY Show at 7:00 A.M. and on Morning Joe at 7:20. _E_\nLance Armstrong is now being sued by Fed Govt what was he thinkking? _E_\nBernie sanders has abandoned his supporters by endorsing pro war pro TPP pro Wall Street Crooked Hillary Clinton. _E_\nI am watching the New York mayoral race very closely... _E_\nRT @foxandfriends: .@JudgeJeanine: There will be an uproar in this country if they end up with an indictment against a Trump family member... _E_\nThis country cannot take four more years of Barack Obama! #Debate _E_\nSorry couldn't do @foxandfriends this morning big meeting. Will double up next week at 7. _E_\nI am the BEST builder just look at what I've built. Hillary can't build. Republican candidates can't build. They don't have a clue! _E_\nI knew last year that @TIME Magazine lost all credibility when they didn't include me in their Top 100... _E_\nI actually enjoyed the piece re sign @TheDailyShow. Could it be that I'm starting to like Jon Stewart? _E_\nWow Hillary and Bill are in deep trouble but don't worry my fellow Republicans will let them off the hook. All talk no action. _E_\nHear me on @kiss925toronto now!#rozandmocha _E_\nVia HuffPost Pollster #1 __HTTP__ _E_\nRT @seanhannity: #Hannity Starts in 30 minutes with @newtgingrich and my monologue on the Deep State's allies in the media _E_\nThr coverage about me in the @nytimes and the @washingtonpost gas been so false and angry that the times actually apologized to its..... _E_\nThe pinnacle of the luxury public golf experience @TrumpGolfLA overlooking the Pacific Ocean in Palos Verdes __HTTP__ _E_\nSleepy eyes @chucktodd when looking at my financial filings should've said \"Great job Mr. Trump Sir.\" _E_\nGood advice from my father Fred C. Trump: Know everything you can about what you're doing. _E_\nWell as predicted the 9th Circuit did it again Ruled against the TRAVEL BAN at such a dangerous time in the history of our country. S.C. _E_\nRT @DanScavino: Great interview on @foxandfriends by @SteveDoocy w/ Carrier employee who has a message for #PEOTUS @realDonaldTrump & #VPE... _E_\nWow every poll said I won the debate last night. Great honor! _E_\nIt is also amazing how comments can be edited to provide statements that are used in a knowingly incorrect manner. _E_\nDiet Coke tweet had a monster response dammit I wish the stuff worked. _E_\nVia @BreitbartNews by Katie McHugh: POLL: DONALD TRUMP LEADS THE PACK AS GOP FRONTRUNNER __HTTP__ _E_\nIs Gov. @BobbyJindal the stupid one for using the phrase \"the stupid party\" when referring to the Republicans? _E_\nFormer Navy SEAL Questions @BarackObama's birthplace __HTTP__ _E_\nCrooked Hillary Clinton discussing the #SecondAmendment at a private event. #2A cc: @NRA __HTTP__ _E_\nGoofy Senator Elizabeth Warren @elizabethforma has done less in the U.S. Senate than practically any other senator. All talk no action! _E_\nTHANK YOU AMERICA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n.@SethMacFarlane will be a great Oscar host. He did an amazing job at my @ComedyCentral roast. _E_\nAmerica needs to rebuild our infrastructure. Why are we sending trillions overseas when our own roads bridge... (cont) __HTTP__ _E_\nWhy won't President Obama use the term Islamic Terrorism? Isn't it now after all of this time and so much death about time! _E_\nJames Clapper called me yesterday to denounce the false and fictitious report that was illegally circulated. Made up phony facts.Too bad! _E_\nThe failing @nytimes is truly one of the worst newspapers. They knowingly write lies and never even call to fact check. Really bad people! _E_\n..not associated with Russia. Trump team spied on before he was nominated. If this is true does not get much bigger. Would be sad for U.S. _E_\nMany of Bernie's supporters have left the arena. Did Bernie go home and go to sleep? _E_\nGet the big picture but be prepared for the picture to change. Where there's a will there's a win. Think positively! _E_\nSorry folks but if I would have relied on the Fake News of CNN NBC ABC CBS washpost or nytimes I would have had ZERO chance winning WH _E_\nWork hard play hard and live to the hilt. Think Like a Billionaire _E_\nI will be on @foxandfriends in ten minutes enjoy! _E_\nI will be live tweeting @megynkelly Show in 10 minutes. Should be interesting. Will be on Fox Network! ENJOY! _E_\nThank you for your support!#AmericaFirst #LeadRight2016 __HTTP__ _E_\nSecy. Sebelius who was responsible for the horrendous ObamaCare rollout should resign or be fired.Refuses to go before Congress to explain _E_\nRT @austinroneil: @realDonaldTrump Thanks for all the inspirational quotes. Helping encourage this young entrepreneur. :) _E_\nI am going to give @Rosie a pass. @Rosie is desperate to get back on TV so she can be on yet another show that can be quickly canceled. _E_\nI will make our Military so big powerful & strong that no one will mess with us. #Trump2016 __HTTP__ __HTTP__ _E_\n#EndCommonCore #Trump2016Video: __HTTP__ __HTTP__ _E_\nGreat new poll Florida thank you! #MakeAmericaGreatAgain __HTTP__ _E_\nWow the economy is really bad! GROSS DOMESTIC PRODUCT down 0.7% in 1st. quarter and getting worse. I TOLD YOU SO! Only I can fix. _E_\nTell Congress to straighten out the many problems of our country before trying to be the policemen to the world. Make America great again! _E_\nVia @njdotcom by Eugene R. Dunn Medford: Donald Trump towers over GOP field __HTTP__ They hate us because they ain't us. _E_\nCongratulations to @BarackObama yesterday marked the 1 YR anniversary of our country's credit being downgraded __HTTP__ _E_\nWhat do you think of water boarding the Boston killer sometime prior to allowing our doctors to make him well? I suspect he may talk! _E_\nThoughts and prayers to the great people of Indiana. You will prevail! _E_\nThe new course at Trump International Scotland will be a par 72 layout with five sets of tees ranging from 7540 yards to 5630. _E_\nSee @IvankaTrump on the cover of @HudsonMOD? View the digital edition: __HTTP__ _E_\nHillary Clinton is the only candidate on stage who voted for the Iraq War. #Debates2016 #MAGA __HTTP__ _E_\nDishonest @nytimes reporter Jonathan Martin refused to acknowledge massive crowd surge forward... __HTTP__ _E_\n.@alexsalmond RT @islandbluenose You'll be doing us in scotland a great service if you win. Good Luck. _E_\n.@HillaryClinton is weak on illegal immigration & totally incompetent as a manager and leader no strength or stamina to be #POTUS! _E_\nGreat job once again by law enforcement! We are proud of them and should embrace them without them we don't have a country! _E_\nI will be on @OutFrontCNN with @ErinBurnett at 7PM. Tune in!#Trump2016 _E_\nginrnnr2 @realDonaldTrump ...is China economy in a bubble ? Only if we want it to be! _E_\nI have decided to add a caveat to my offer. Obama can't decide to send my $5M to Rev. Wright if he releases his records. _E_\nThis is about the money I gave to charity and in response to your comments about Gadhafi... __HTTP__ _E_\nA beautiful view from my office today __HTTP__ _E_\n#TrumpTower is one of the country's top tourist destinations. _E_\nThank you @IngrahamAngle! #AmericaFirst __HTTP__ _E_\nBeing an entrepreneur is not a group effort. You have to trust yourself and your instincts. _E_\nI can't resist hitting lightweight @DannyZuker verbally when he starts up because he is just.so pathetic and easy (stupid)! _E_\nThank you. __HTTP__ _E_\nChina is very much the economic lifeline to North Korea so while nothing is easy if they want to solve the North Korean problem they will _E_\nWe need another Bush in office about as much as we need Obama to have a 3rd term. No more Bushes! _E_\nThank you West Virginia. Let's keep it going. Go out and vote on Tuesday we will win big. #Trump2016 _E_\nSebelius didn't test $635M (probably $1B) ObamaCare website until \"a couple of days leading up to the launch.\" __HTTP__ _E_\nVia @Mediaite: Trump to @gretawire: Sequester Cuts Don't Go Far Enough' __HTTP__ _E_\nThe upcoming record 13th season of @CelebApprentice is going to be very special. Our production team's ingenuity is amazing. _E_\nThank you! __HTTP__ _E_\nI don't like bullies. I am not going to stand around and watch @KarlRove target the Tea Party. Karl Rove gave us Barack Obama. Loser. _E_\nGreat Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY! Congratulations! _E_\nBill O'Reilly doing a major special on @OreillyFactor tonight @FoxNews at 8pmE. Watch it should be good! #Trump2016 _E_\nA great night in West Allis Wisconsin! Thank you! #VoteTrumpWI #WIPrimary __HTTP__ __HTTP__ _E_\nWill soon be heading to Davos Switzerland to tell the world how great America is and is doing. Our economy is now booming and with all I am doing will only get better...Our country is finally WINNING again! _E_\nDrew Brees is having a great game a fantastic quarterback and really good guy! _E_\nTo aspiring entrepreneurs: Be tenacious. Once you've decided on your goals remain fixed on them. Set the bar high! _E_\nEntrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_\nVia Newsmax. Nice article thank you so much. __HTTP__ _E_\nThe real scandal here is that classified information is illegally given out by intelligence like candy. Very un American! _E_\n#MakeAmericaWorkAgain#TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_\nOur great African American President hasn't exactly had a positive impact on the thugs who are so happily and openly destroying Baltimore! _E_\nVia @MiamiHerald: \"@IvankaTrump talks family and business\" __HTTP__ _E_\nDid you ever not do something that had you done it would have turned out to be a disaster. Never look back just learn from your experience! _E_\nKatherine Webb gets a Donald Trump job offer says she's 'shocked' about the attention __HTTP__ via @Zap2it _E_\nDon't miss me on @foxandfriends Monday at 7:30 AM _E_\nWriting my inaugural address at the Winter White House Mar a Lago three weeks ago. Looking forward to Friday.... __HTTP__ _E_\nThank you to @foxandfriends for exposing the truth. Perhaps that's why your ratings are soooo much better than your untruthful competition! _E_\nMade in America? @BarackObama called his 'birthplace' Hawaii here in Asia. __HTTP__ _E_\n.@MittRomney is trying to hit back at me because I'm saying that he let the Repub Party down w/ his loss to Obama. Should've won—he choked! _E_\nOne of the simplest joys of life is golf. A great game to both play and watch. _E_\nWith the impending crisis in Korea is it a big confidence builder that Chuck Hagel is Sec. of Defense? Elections have consequences. _E_\n.@melaniatrump on @QVC tonight at 7PM EST. Tune in! _E_\n..But the people were Pro Trump! Virtually no President has accomplished what we have accomplished in the first 9 months and economy roaring _E_\nThe Chinese are now hacking White House computers. Why not? They already own the place. _E_\nMust read @nypost editorial on $40M NYC taxpayer settlement to Central Park Thugs Wilding for Profit __HTTP__ _E_\nWith our weakened dollar gas will continue to rise. Fracking is an answer to lowering energy costs. _E_\nIt was a very wise move that Ted Cruz renounced his Canadian citizenship 18 months ago. Senator John McCain is certainly no friend of Ted! _E_\nActually Putin doesn't want Alaska because the Environmental Protection Agency will make it impossible for him to drill for oil! _E_\n\"Being an entrepreneur is a big task. So what can you do to prepare? First and foremost expand your focus.\" Midas Touch _E_\nGreat new poll Iowa thank you!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nWe had a great News Conference at Trump Tower today. A couple of FAKE NEWS organizations were there but the people truly get what's going on _E_\nJoin me tonight in Cedar Rapids Iowa at 7pm: __HTTP__ Arizona tomorrow night at 3pm: __HTTP__ _E_\nInsurance companies are fleeing ObamaCare it is dead. Our healthcare plan will lower premiums & deductibles and be great healthcare! _E_\nCan you believe that Ted Cruz who has been killing our country on trade for so long just put out a Wisconsin ad talking about trade? _E_\nEntrepreneurs: Be totally focused. Know everything you can about what you're doing. Give your work 100% of your concentrated effort. _E_\nCongratulations to @newtgingrich on a stunning win in South Carolina. All eyes are on Florida now. _E_\nPaul Ryan should spend more time on balancing the budget jobs and illegal immigration and not waste his time on fighting Republican nominee _E_\nMy @FoxNews @gretawire int. on the border crisis #BringBackOurMarine & Obama's ineptitude & the economy __HTTP__ _E_\nWill be at Fort Worth (Texas) Convention Center at 11:30 A.M. Big crowd get there early! Big announcement to be made! _E_\n\"'THE DONALD' GOT A MUSKET\" __HTTP__ via @fitsnews _E_\nThank you Toledo Ohio! It is so important for you to get out and VOTE on November 8 2016! Lets MAKE AMERICA SAFE... __HTTP__ _E_\nVia @InverurieHerald: Trump's new course plans on display __HTTP__ _E_\nWatch as I humiliate a dais full of talent. #TrumpRoast airs tonight at 10:30/9:30c on Comedy Central __HTTP__ _E_\nAccounting firm Ernst & Young and the celebrity judges are insulted by Miss Pennsylvania's made up PR. _E_\n...approvals of The Keystone XL & Dakota Access pipelines. Also look at the recent EPA cancelations & our great new Supreme Court Justice! _E_\nI will bring our jobs back to America fix our military and take care of our vets end Common Core and ObamaCare protect 2nd A build WALL _E_\nMy @SquawkCNBC interview from earlier in the week discussing the GOP primary and @newtgingrich's electability __HTTP__ _E_\nThe Mayor of San Jose did a terrible job of ordering the protection of innocent people. The thugs were lucky supporters remained peaceful! _E_\nIn the center of Ireland's rugged west coast @Trump_Ireland offers a beautiful golf course top dining and a Spa __HTTP__ _E_\nBy Obama mentioning Manhattan yesterday in his response he has singlehandedly made it target #1. How totally stupid is this guy? _E_\nDisloyal R's are far more difficult than Crooked Hillary. They come at you from all sides. They don't know how to win I will teach them! _E_\nToday I delivered remarks at the 36th Annual National Peace Officers' Memorial Service. #NationalPoliceWeekWatch... __HTTP__ _E_\n.@pennjillette is an extraordinary entertainer & magician whose star on the Hollywood Walk of Fame is long overdue. Very proud of him. _E_\nVia @ArabianBusiness: \"@IvankaTrump eyes new projects in Abu Dhabi\" for Trump Organization __HTTP__ _E_\nAs long as we open our eyes to God's grace and open our hearts to God's love then America will forever be the land of the free the home of the brave and a light unto all nations. #NationalPrayerBreakfast __HTTP__ _E_\nMy daughter Ivanka thinks I should run for President. Maybe I should listen. __HTTP__ _E_\nDiscover your true self and surround yourself with people who complement your gifts and modes of operation. Midas Touch _E_\n.@AS_ScienceGuy @realDonaldTrump Thank you for all your support of @autismspeaks Great new breakthroughs. Fantastic! _E_\nIt is terrible that neither Obama Biden nor Kerry attended Lady Thatcher's funeral. They would all run to Muslim Brotherhood Morsi's. _E_\nThe only reason President Obama wants to attack Syria is to save face over his very dumb RED LINE statement. Do NOT attack Syriafix U.S.A. _E_\nRT @foxandfriends: FOX NEWS ALERT: North Korea responds to U.S. with Guam attack plan as Secretary Mattis warns Kim Jung Un \"he is grossly... _E_\n.@tedcruz you were terrific on @seanhannity tonight. I am going to the border tomorrow. _E_\n#CNNDebate __HTTP__ _E_\nIt's Thursday how much $ has @BarackObama wasted today? _E_\nDr. Ben Carson I concur. I believe in God who can change people he can make any of us better. @RealBenCarson _E_\nFrom Bloomberg: \"Chrysler's Jeep expects China production agreement soon.\" I told you so. _E_\nNo wonder Afghanistan is a mess! @BarackObama is releasing high level insurgents in exchange for pledges of peace. __HTTP__ _E_\n.@TrumpSoHo New York has interiors by celebrated design house Fendi Casa and 360 degree views of the city skyline. __HTTP__ _E_\nWow three top MICROSOFT investors want Bill Gates out as Chairman. Do not like job he is doing! _E_\nWhat an amazing comeback and win by the Patriots. Tom Brady Bob Kraft and Coach B are total winners. Wow! _E_\n\"Learning is a new beginning we can give ourselves every day.\" – Trump: How to Get Rich _E_\nJohnny Miller—Great job this weekend. Most insightful and tough. See you at Doral. _E_\nI will be making a major speech on ILLEGAL IMMIGRATION on Wednesday in the GREAT State of Arizona. Big crowds looking for a larger venue. _E_\n.@myfoxny discussing NYPD Chief Kelly's great record & the launch of the crowdfunding site __HTTP__ __HTTP__ _E_\nLet's take a closer look at that birth certificate. @BarackObama was described in 2003 as being born in Kenya. __HTTP__ _E_\nWish Obama would say ISIS like almost everyone else rather than ISIL. _E_\nAfghan Leader Karzai has received tens of millions of dollars IN CASH from the U.S. Government how stupidly is our Country being run? _E_\nWill be on @jimmykimmel tonight at 11:35pmE on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_\nI'm just so tired of listening to the same old rhetoric and words day after day from our President. It's time to stop talking WORK! _E_\nOver 100M are now receiving some form of welfare __HTTP__ We must do better. @MittRomney has the vision to get America working. _E_\nJust to show you how unfair Republican primary politics can be I won the State of Louisiana and get less delegates than Cruz Lawsuit coming _E_\nJoin us in Toledo Ohio tomorrow night at 8pm! #TrumpPence16 #MAGATickets: __HTTP__ __HTTP__ _E_\nAmerica became a powerhouse because of our deep belief in the virtue of self reliance. #TimeToGetTough (cont) __HTTP__ _E_\nHelp fund @Dratzenberger's new show 'American Made' on @fundanything __HTTP__ John is on @teamcavuto today re project. _E_\nWill be on @seanhannity tonight at 10pm hosted by @GovMikeHuckabee. Enjoy! _E_\nOur online store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_\nA true honor to receive the endorsement of John Wayne's daughter....read: __HTTP__ __HTTP__ _E_\nBayer AG has pledged to add U.S. jobs and investments after meeting with President elect Donald Trump the latest in a string... @WSJ _E_\n#trumpvlog The song Donald Trump hits 54 million views. @MacMiller Where's my money? __HTTP__ _E_\n\"@TrumpFerryPoint A Brand New Championship Golf Course In NYC Developed By Donald Trump And Anyone Can Play It\" __HTTP__ _E_\nVia @TheOaklandPress Donald Trump speaks in Novi(Michigan) draws record breaking crowd __HTTP__ _E_\nOur country under President Obama is on life support! Great leaders must bring people together. _E_\nVia @theobserver: Donald Trump: Lake Norman golf course 'one of the hottest places around' __HTTP__ _E_\nAfter being ripped off for years Obama finally figured out that China is taking advantage of us. He's finally listening to me. _E_\nGreat #Thanksgiving travel and parade watching tips by @NYTimesTravel including an option from @TrumpNewYork: __HTTP__ _E_\nCrooked's camp incited violence at my rallies. These incidents weren't spontaneous like she claimed in Benghazi! __HTTP__ _E_\nGo to Trump Doral in Miami and watch the World Golf Championship! On NOW! _E_\nWe commend SG @AntonioGuterres & his call for the UN to focus more on people & less on bureaucracy. #USAatUNGA #UNGA __HTTP__ __HTTP__ _E_\n.@FoxNewsSunday _E_\nEntrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_\nThe failing @HuffingtonPost and dopey @ariannahuff are writing so much false junk about me they just can't get enough! BE CAREFUL. _E_\nJust a reminder that Ted Cruz supported liberal Justice John Roberts who gave us #Obamacare. __HTTP__ _E_\nLies and incompetence the two words that are most closely associated with ObamaCare! _E_\nSpitzer and Weiner lost lightweight Eric Schneiderman will be next he will be challenged in the PRIMARY. He has done a really poor job! _E_\nEach of the 176 magnificent luxury suites and guestrooms at @TrumpNewYork provide a sophisticated urban appeal __HTTP__ _E_\nChris McDaniel looks like he will win in Mississippi GREAT NEWS and big victory for Tea Party! _E_\n\"Push yourself again and again. Don't give an inch until the final buzzer sounds.\" Larry Bird _E_\nHighly respected PUBLIC POLICY POLLING (PPP) just announced that I am number one in IOWA. Thank you! _E_\nBush was called unpatriotic by @BarackObama in '07 for adding $4T to debt __HTTP__ @BarackObama increased it $6T in 3 years. _E_\nDonna Summer performed for me many times she was great and will be missed. @TheDonnaSummer _E_\n.@BMP_Music_Event Read 'Midas Touch' great book for entrepreneurs. Good luck! _E_\n#TheView Lots of fun on @TheViewTV with @JennyMcCarthy and @SherriEShepherd __HTTP__ _E_\nJust completed call with President Moon of South Korea. Very happy and impressed with 15 0 United Nations vote on North Korea sanctions. _E_\n.@mystikangel Bring @johnrich back? He is back! _E_\nYes the BP oil spill was bad but it was no reason to put tighter clamps on domestic drilling. That showed no (cont) __HTTP__ _E_\nThe habitual vacationer @BarackObama is now in Hawaii. This vacation is costing taxpayers $4 milion +++ while there is 20% unemployment. _E_\n#VoteTrump video: __HTTP__ #ArizonaPrimary #UtahCaucus #UTCaucus #AmericanSamoa __HTTP__ _E_\nSo we can spy on our ally's leaders but can't water board terrorists? _E_\nJust got back to New York from California. Will be on Fox & Friends tomorrow morning at 7.00. ObamaCare and other disasters to be discussed _E_\nThe Fake News Is going all out in order to demean and denigrate! Such hatred! _E_\nToday's EO established a commission on combating drug addiction and the opioid crisis. Watch listening session... __HTTP__ _E_\nCarly Fiorina I agree! Ted Cruz is just another politician. All talk no action! __HTTP__ _E_\nIt was @BarackObama who promised if you like your plan you can keep your plan. Now ObamaCare is causing (cont) __HTTP__ _E_\n#UtahCaucus message from @IvankaTrump! #UTCaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nVote for your favorite @MissUSA contestant the 2013 #MissUSA Fan Vote at __HTTP__ ! _E_\nThe @PGATOUR comes to Miami on March 6th when the @CadillacChamp returns to @TrumpDoral __HTTP__ See you there! _E_\nWhy aren't the lawyers looking at and using the Federal Court decision in Boston which is at conflict with ridiculous lift ban decision? _E_\nYesterday in front of Rockefeller Center __HTTP__ _E_\nThe judge in the Oscar Pistorious case is a total moron. She said he didn't act like a killer. This is another O.J. disaster! _E_\nWacko @glennbeck is a sad answer to the @SarahPalinUSA endorsement that Cruz so desperately wanted. Glenn is a failing crying lost soul! _E_\nThanks to our loyal viewers & fans last night's @ApprenticeNBC topped all the demos & grew 24% in our regular slot premiere. _E_\nCongratulations @ElonMusk and @SpaceX on the successful #FalconHeavy launch. This achievement along with @NASA's commercial and international partners continues to show American ingenuity at its best! __HTTP__ _E_\nYesterday the White House claimed its ISIS strategy is a 'success.' Tell that to the Christians being beheaded. We need to hit ISIS hard! _E_\nImposing dunes on the Aberdeenshire coastline @TrumpScotland's Championship course is a classical Scottish links __HTTP__ _E_\nWith @BarackObama listing himself as Born in Kenya in 1999 __HTTP__ HI laws allowed him to produce a fake certificate. #SCAM _E_\n\"30000 MACY'S CUSTOMERS RETALIATE IN SUPPORT OF DONALD TRUMP\" __HTTP__ via @BreitbartNews by @ASwoyer _E_\nJust arrived in Youngstown Ohio with @FLOTUS Melania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nI will be going to New Hampshire today home of my first primary victory to discuss terror and the horrible events of yesterday. 2:30 P.M. _E_\nThank you to all of our law enforcement officers across America! #LESM #MAGA __HTTP__ __HTTP__ _E_\nWow just saw the really bad @CNN ratings. People don't want to watch bad product that only builds up Crooked Hillary. _E_\nWhy should ObamaCare be delayed for businesses and not working families? With premiums rising at record levels it is not equitable. _E_\nNext Saturday night I will be holding a BIG rally in Pennsylvania. Look forward to it! _E_\nBig election tomorrow in the Great State of Alabama. Vote for Senator Luther Strange tough on crime & border will never let you down! _E_\nJoin me in Clive Iowa tomorrow at noon! #AmericaFirst #MAGATickets: __HTTP__ __HTTP__ _E_\n...At the same time go through a worst case scenario but keep it short. Focus on your goal—look at the solution not the problem. _E_\nIt was my great honor to pay tribute to a VET who went above & beyond the call of duty to PROTECT our COMRADES our COUNTRY & OUR FREEDOM! __HTTP__ _E_\nLithium ion batteries should not be allowed to be used in aircraft. I won't fly on the Boeing 787 Dreamliner it uses those batteries. _E_\nNot only does Obama spy on German leaders he criticizes their trade surplus __HTTP__ We should have a trade surplus! _E_\nHillary took money and did favors for regimes that enslave women and murder gays. _E_\n.@PatrickBuchanan was great on @TeamCavuto @FoxNews. Thank you Pat! #Tump2016 _E_\n.@GOP HOUSE LEADERSHIP – ESTABLISH SELECT COMMITTEE ON BENGHAZI. THERE IS A MASSIVE COVERUP. _E_\nRT @namusca: #VoteTrump2016 a real leader that truly cares about America & our values. He wants to bring prosperity back 2 USA __HTTP__ _E_\nI am very disappointed in China. Our foolish past leaders have allowed them to make hundreds of billions of dollars a year in trade yet... _E_\nRemember anything you read about Atlantic City has nothing to do with me. I sold years ago and left. Good timing but very sad! _E_\nLooking forward to honoring the great Dogan family & the success of the Trump Towers project in Istanbul @FollowTurkey Annual Gala Dinner _E_\nTime to end the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_\n.@Peggynoonannyc An election between Hillary and myself will be the biggest voter turnout in U.S. history. Just like the debates 24 M vs 2M. _E_\nJust said at #NCGOPCon that I'm not beholden to lobbyists and donors! No special interest would control me if I were in office. _E_\nExcited to be travelling to New Hampshire on Monday. The Granite State is a model for the country. Live Free or Die! _E_\nSpanish version of ObamaCare website delayed __HTTP__ Hitting google translate apparently too complicated. #MakeDCListen _E_\nUSMC Sgt. Tahmooressi sacrificed for our country. While Obama is welcoming illegals our Marine is locked in a Mexican jail. #FreeOurMarine _E_\nTaxpayers are paying a fortune for the use of Air Force One on the campaign trail by President Obama and Crooked Hillary. A total disgrace! _E_\nWow Record ratings for WGC Cadillac Championship at Trump National Doral's Blue Monster Most watched in seven years. CONGRATS to@Tiger Woods _E_\nI have a proven track record supporting our Veterans. Veterans deserve universal access to care. VA scandal proves politicians are inept. _E_\nLaunching the Trump Home by Dorya Furniture Collection today. It looks amazing! @HPMARKETNEWS @DoryaInteriors __HTTP__ _E_\n#2017Jambo Remember your duty. Honor your history. Take care of the people God puts into your life – and LOVE & CHERISH your country! __HTTP__ _E_\nIt was the childishly written & taunting PR statement by Fox that made me not do the debate more so than lightweight reporter @megynkelly. _E_\n.@PiersMorgan is right he won the show because \"I know how to play the game.\" #CelebApprentice _E_\nCNN'S slogan is CNN THE MOST TRUSTED NAME IN NEWS. Everyone knows this is not true that this could in fact be a fraud on the American Public. There are many outlets that are far more trusted than Fake News CNN. Their slogan should be CNN THE LEAST TRUSTED NAME IN NEWS! _E_\n#ISIS is making $400M/year on oil. I have been saying it for years. We need to bomb the oil! __HTTP__ __HTTP__ _E_\nGreat Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts: __HTTP__ _E_\nEndorsements for Lyin' Ted Cruz __HTTP__ _E_\n.@IvankaTrump's Favorite Miami Hot Spots @TrumpGolf @TrumpDoral __HTTP__ _E_\nThe United States must greatly strengthen and expand its nuclear capability until such time as the world comes to its senses regarding nukes _E_\nAn updated POLL tracker (with all polls thru the weekend) reveals I maintained a double digit lead at... __HTTP__ _E_\nLook if we can make chopsticks in America and sell them to the Chinese we can compete on hundreds of other fronts as well. TimeToGetTough _E_\nRemember Obama limped across the finish line he should have lost to Hillary. Be careful! _E_\nIf Stop & Frisk is struck down by the pandering NYC politicians increases in crime & eventual terrorist attacks will be on them. _E_\nBullshit Pop gave me knowledge and a relatively small amount of money (split between brothers and sisters) and I built it into over 9 bill. _E_\nThank you working hard! __HTTP__ _E_\nWill be doing interview on @GolfChannel at 8.00 this morning. Will be talking about getting the great PGA Championship & Senior PGA etc. _E_\nVia @BreitbartNews by @rwildewrites: \"Donald Trump: I Can Make America Great Again\" __HTTP__ (Hyperlinked on @DRUDGE_REPORT) _E_\nMust see video Obama's criticism of @MittRomney is identical to Carter's on Reagan __HTTP__ _E_\nWhy did we spend billions of our money on Libya if we are not going to get any of the country's oil? What do we get out of this? _E_\nChina is buying gas fields in Texas __HTTP__ & stealing our corporate secrets... _E_\nWill be interviewed by @SeanHannity on @foxnews at 10PM tonight. Enjoy! _E_\nLooking forward to watching the legendary @BarbaraJWalters interview my family (and me) tonight on @ABC at 10:00. Many things to talk about! _E_\nWith two champion style courses @TrumpGolfDC graces 600 rolling acres along the peaceful and scenic Potomac River __HTTP__ _E_\nSenator Chuck Schumer helping to import Europes problems said Col.Tony Shaffer. We will stop this craziness! @foxandfriends _E_\nIf President Obama was going to attack Syria he should've done it a long time ago as a surprise & not after (cont) __HTTP__ _E_\nTO ALL AMERICANS __HTTP__ _E_\nI'm at Trump Doral right now Tiger will tee off shortly. _E_\nAs Iran began the process of taking over Iraq many people wanted me to say that \"I told you so!\" – so I told you so. _E_\nMUST READ! My @chicagotribune editorial: I love Chicago ... and my sign! __HTTP__ _E_\n#TBT At the US Open Tennis Tournament with @EricTrump see same hairstyle! __HTTP__ _E_\nVia @DMRegister :\"@brentroske on Politics: Trump Talks Iowa\" __HTTP__ _E_\nEnd the Democrats Obstruction! __HTTP__ _E_\nTRUMP APPROVAL HITS 50% __HTTP__ _E_\nMany of Hillary's donors are the same donors as Jeb Bush's—all rich will have total control—know them well. _E_\nMy @HollywoodLife interview w/ @MELANIATRUMP discussing her debut on @ApprenticeNBC & her skin care line __HTTP__ _E_\n#MakeAmericaSafeAgain __HTTP__ _E_\nCrude is at $100/Barrel. With the current state of the world economy how is that possible? OPEC is ripping of... (cont) __HTTP__ _E_\nNo wonder boxing is close to dead! _E_\nThank you @billoreilly & @KarlRove. Ted Cruz should be immediately disqualified in Iowa with each candidate moving up one notch. _E_\nOur biggest problems are solved by growth. We need a President who is a job creator. Let's Make America Great Again! __HTTP__ _E_\nJodi thought she outsmarted the system it didn't work! Congratulations to the jury on a job well done! Now will it be life or death? _E_\nWell now they're saying that I not only won the NBC Presidential Forum but last night the big debate. Nice! _E_\nYoung Entrepreneurs – the Holiday season is here but that is no excuse not to stay on top of your business prospects. Focus! _E_\nConsumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_\nBy @kwrcrow: \"NY Post caught 'LYING' Again!\" __HTTP__ The Donald\" should go far. Actually if I run I'll win. _E_\nThe resolution being considered at the United Nations Security Council regarding Israel should be vetoed....cont: __HTTP__ _E_\nObama attacks the CIA for waterboarding while routinely droning civilians caught in the Islamist crosshairs. _E_\nIf dopey Mark Cuban of failed Benefactor fame wants to sit in the front row perhaps I will put Gennifer Flowers right alongside of him! _E_\nObama can release 5 senior Taliban for a deserter but can't make Mexico release decorated Marine Sgt. Andrew Tahmooressi. Pathetic _E_\nRomney's campaign is being put on the defensive. He cannot let this happen. Stop pandering. Must get tougher (cont) __HTTP__ _E_\nDo people notice Hillary is copying my airplane rallies she puts the plane behind her like I have been doing from the beginning. _E_\nWe need much tougher much smarter leadership and we need it NOW! _E_\nJohn Cahill is highly respected in all circles—really nice to see that he's running for New York State Attorney General. @CahillForAG _E_\nI won every poll from last nights Presidential Debate except for the little watched @CNN poll. _E_\nBORDER WALL prototypes underway! __HTTP__ _E_\n#CelebrityApprentice It's good to have Jack back too with @marleematlin. He's become a star. #sweepstweet _E_\n\"Always remember that the future comes one day at a time.\" Dean Acheson _E_\nDenzel Washington gave a wonderful commencement speech over the weekend. From the heart! _E_\nShow me someone without an ego and I'll show you a loser having a healthy ego or high opinion of yourself is a real positive in life! _E_\nI love America. And when you love something you protect it passionately fiercely even. #TimeToGetTough (cont) __HTTP__ _E_\nMy use of social media is not Presidential it's MODERN DAY PRESIDENTIAL. Make America Great Again! _E_\n#MakeAmericaGreatAgain #Trump2016Video: __HTTP__ __HTTP__ _E_\n.@JebBush is totally lost he spends too much time managing the bloated staff of his campaign & not enough talking about America's future. _E_\nThe Oscars were a great night for Mexico & why not—they are ripping off the US more than almost any other nation. _E_\n\"Every strike brings me closer to the next home run.\" Babe Ruth _E_\nLearn from yesterday live for today hope for tomorrow. The important thing is not to stop questioning. Albert Einstein _E_\nWow Obama really put it to Israel by canceling flights there. This puts them at a tremendous disadvantage. Tourism and more will just stop. _E_\nI want to help our miners while the Democrats are blocking their healthcare. _E_\n.@oreillyfactor why don't you have some knowledgeable talking heads on your show for a change instead of the same old Trump haters. Boring! _E_\nTom Brady is a good friend of mine a great player a great guy and a total winner! Fantastic comeback win this is what our country needs! _E_\nWe have given Syria so much time and information there has never been such an instance in wartime history. Syria is now fully prepared! _E_\nVery sad that Republican donors were targeted by Obama's IRS. _E_\nVia @Reuters by @sumeet_chat: \"Donald Trump plans investment in India betting on Modi government\" __HTTP__ _E_\n.@CelebApprentice having \"top brand impact 2012\" ahead of Idol Survivor X Factor & all others has caused quite a stir no surprise! _E_\nIn other words Russia was against Trump in the 2016 Election and why not I want strong military & low oil prices. Witch Hunt! __HTTP__ _E_\nI have always had a good relationship with Chuck Schumer. He is far smarter than Harry R and has the ability to get things done. Good news! _E_\nThank you Arlene! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #DrainTheSwamp __HTTP__ _E_\nWith Luis Mexico and the United States would have made wonderful deals together where both Mexico and the US would have benefitted. _E_\nJoin me tomorrow in Dubuque Iowa! #IACaucus #Trump2016 __HTTP__ _E_\nEverybody is laughing at Jeb Bush spent $100 million and is at bottom of pack. A pathetic figure! _E_\n.@TrumpNationalNY is NY's best golf club. A 5 Star Diamond Award winner w/ an elite golf course & top facilities __HTTP__ _E_\nHeading to Alabama now big crowd! _E_\nJust out according to @CNN: Utah officials report voting machine problems across entire country _E_\nWhy did Clinton supporter @AlisonForKY declare Crooked Hillary winner in KY when AP hasn't even called the race? _E_\nThank you Michigan! #VoteTrumpMITrump 35%Kasich 17%Cruz 12%Rubio 12%Carson 9% Via: ARG _E_\nThank you. __HTTP__ _E_\nThank you Fort Wayne Indiana!#Trump2016 #INPrimary __HTTP__ _E_\nNewsstand sales for @VanityFair run by sleepy Graydon Carter are down almost 20%. All he cares about are his bad food restaurants! _E_\nJust returned from Asia after 12 very successful days. Great to be home! _E_\nWow even I didn't realize we did so much. Wish the Fake News would report! Thank you. __HTTP__ _E_\nRT @WhiteHouse: FACT: when #Obamacare was signed CBO estimated that 23M would be covered in 2017. They were off by 100%. Only 10.3M people... _E_\nPeople who have the ability to work should. But with the government happy to send checks too many of them don't. #TimeToGetTough _E_\nVattenfall CEO stated that the company needed to prepare itself for falling electricity demands in coming years a changing market. _E_\nCongratulations to @MichaelPhelps on concluding the greatest Olympic career ever. You have made us all very proud. _E_\nThe new winner of the @MissTeenUSA pageant K. Lee Graham __HTTP__ _E_\nUS job cuts jumped 53% in May from April __HTTP__ This is the Obama recovery? _E_\nWith all of the words President Obama just dispensed at his press conference he didn't say what we all want to hear I'LL STOP THE FLIGHTS _E_\nAmong the lowest temperatures EVER in much of the United States. Ice caps at record size. Changed name from GLOBAL WARMING to CLIMATE CHANGE _E_\nAmnesty is suicide for Republicans.Not one of those 12 million who broke our laws will vote Republican.Obama is laughing at @GOP. _E_\nUNBELIEVABLE!Clinton campaign contractor caught in voter fraud video is a felon who visited White House 342 times: __HTTP__ _E_\nCrime is out of control and rapidly getting worse. Look what is going on in Chicago and our inner cities. Not good! _E_\nProcter and Gamble is relocating its beauty headquarters from Cincinnati to Asia what are we doing?! _E_\n95% of Americans will pay less or at worst the same amount of taxes (mostly far less). The Dems only want to raise your taxes! _E_\nToday is referendum on ObamaCare Amnesty slow growth having your healthcare dropped & all the other lies. _E_\nJoin me live in Waukesha Wisconsin for an 8pmE rally! #AmericaFirst #MAGA __HTTP__ _E_\nPresident Obama should stay out of the Hong Kong protests we have enough problems in our own country!Can't even properly police White House _E_\nAll the haters and losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_\nFLORIDA Just like TX WE are w/you today we are w/you tomorrow & we will be w/you EVERY SINGLE DAY AFTER to RESTORE RECOVER & REBUILD! __HTTP__ _E_\nWhether @RepPaulRyan's plan is sound fiscal policy is not the relevant issue the issue is strategic timing. Why release it now? _E_\nI applaud Columbia South Carolina for cleaning up biz center __HTTP__ Will cut crime & advance commerce. _E_\nWhether you like it or not Bush also gave us Obama! _E_\nIt was 25 years ago today that Pan Am flight 103 was downed by a terrorist killing 270 innocent people. @AlexSalmond released the terrorist! _E_\nTHE HILL'S TWITTER ROOM: Trump: Spitzer Weiner turning New York into 'pervert central' __HTTP__ _E_\nRT @DanScavino: Join @realDonaldTrump on his official social media platforms during tonight's debate ~ as @TeamTrump manages rapid response... _E_\nBring in 2014 @TrumpSoHo's NYE soireé NYC's most exclusive New Year's Eve Party w/SoHi & @VeuveClicquot __HTTP__ _E_\nI have a surprise for a really special kid on Thursday's episode of @KatieShow with @KatieCouric: __HTTP__ _E_\nThe Wall is the Wall it has never changed or evolved from the first day I conceived of it. Parts will be of necessity see through and it was never intended to be built in areas where there is natural protection such as mountains wastelands or tough rivers or water..... _E_\nCrude is about to pass $90/barrel. The OPEC monopoly must be broken. They are robbing our country blind. _E_\nTed Cruz along with Jeb Bush pushed Justice John Roberts onto the Supreme Court. Roberts could have killed ObamaCare twice but didn't! _E_\nMy thoughts and prayers go out to the @PhillyPolice & @Penn police officers in Philadelphia. __HTTP__ _E_\n....because he doesn't even live there! He wants to raise taxes and kill healthcare. On Tuesday #VoteKarenHandel. _E_\nThank you! #AmericaFirst __HTTP__ _E_\nVery little discussion of all the purposely false and defamatory stories put out this week by the Fake News Media. They are out of control correct reporting means nothing to them. Major lies written then forced to be withdrawn after they are exposed...a stain on America! _E_\nRosie O'Donnell went after me again on The View in order to stir up her failing ratings. Nothing will help her @Rosie always fails. _E_\nObama's VA Secretary just said we shouldn't measurewait times. Hillary says VA problems are not 'widespread.' I will take care ofour vets! _E_\nRemember I'll see you in D.C. at the Capitol Building on Wednesday at 1:00 o'clock. Then Dallas on Sept.14 at 6:00 P.M. American Air Center _E_\nEntrepreneurs: Another question to ask yourself—\"What am I pretending not to see?\" There may be great opportunities right around you. _E_\nThank you Clive Iowa! __HTTP__ _E_\nWeekly jobless claims soared to 21.5% a 6 month high __HTTP__ ObamaCare the greatest job killer in US history. _E_\nJeb is spending millions of dollars on \"hit\" ads funded by lobbyists & special interests. Bad system. _E_\nI find that @Reuters is a far more professional operation than @AP. _E_\nFord is MOVING jobs from Michigan to Mexico AGAIN! __HTTP__ As President this will stop on Day One! Jobs will stay here. _E_\nDespite some very corrupt and dishonest media coverage there are many great reporters I respect and lots of GOOD NEWS for the American people to be proud of! _E_\nCongratulations to @NewYorkObserver on celebrating its 25 year anniversary. Great paper under amazing management! _E_\nHey @realjeffreyross @whitney cummings @lisalampanelli: you call yourselves comedians? #TrumpRoast tonight 10:30/9:30c on @ComedyCentral. _E_\nAlert US jobless claims up 46000 to 388000. Really bad news. 7.8% is now a fraud not possible! _E_\nThe new Congress must restore military spending & stop Obama budget cuts. Also hold Obama accountable on the VA. _E_\nSo why aren't the Committees and investigators and of course our beleaguered A.G. looking into Crooked Hillarys crimes & Russia relations? _E_\nCongratulations to @CNN for having the wisdom to pick TRUMP! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nMy review of #TheDarkKnightRises and more in today's #trumpvlog __HTTP__ _E_\nWhy is it that the Fake News rarely reports Ocare is on its last legs and that insurance companies are fleeing for their lives? It's dead! _E_\nSee you at 7:00 P.M. tonight Phoenix Arizona! #MAGATickets: __HTTP__ __HTTP__ _E_\nI hope Newtown CT can now start to heal—but it won't be easy! _E_\nI don't hate Obama at all I just think he is an absolutely terrible president maybe the worst in our history! _E_\nThe situations in Tulsa and Charlotte are tragic. We must come together to make America safe again. _E_\nWithout passion you don't have energy and without energy you have nothing! Just one more of my totally brilliant quotes use it well. _E_\n.@BillKristol Bill your small and slightly failing magazine will be a giant success when you finally back Trump. Country will soar! _E_\nDopey @billmaher still owes me $5M for charity. I hope he pays up before @hbo fires him which will happen! _E_\nBut why shouldn't I speak out? Don't you speak out in this country? George Steinbrenner _E_\nTHANK YOU to all of the great volunteers helping out with #HurricaneHarvey relief in Texas! __HTTP__ _E_\nBe careful of an Obama bomb to win election! Would be a horrible thing to do. _E_\nWe cannot continue to let Israel be treated with such total disdain and disrespect. They used to have a great friend in the U.S. but....... _E_\nHighly respected economist @Larry_Kudlow is a big fan of my tax plan—thank you Larry. __HTTP__ _E_\nTHE LAST THING THIS COUNTRY NEEDS IS ANOTHER BUSH! _E_\nDoesn't help Kasich to do negative ads on me because he still has to go through everyone else he's almost last. _E_\n.@FLGovScott can create tens of thousands of jobs by approving casinos in Miami it's time. @willweatherford _E_\nA story in the @washingtonpost that I was close to \"rescinding\" the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources don't exist! _E_\nIf Alison Grimes can't admit she voted for Obama even if she is embarrassed then you can't trust her! Vote @Team_Mitch! _E_\nThe Mullahs laughed when @BarackObama asked Iran to return our drone they will show it to China first. _E_\n.@danawhite You have done an amazing job I am proud to have been there at the very beginning! _E_\nTo the people of Puerto Rico:Do not believe the #FakeNews!#PRStrong _E_\nIsrael Saudi Arabia and the Middle East were great. Trying hard for PEACE. Doing well. Heading to Vatican & Pope then #G7 and #NATO. _E_\n.@MittRomney looks much calmer and Obama should stop nodding his head backwards and forward. _E_\nCongressional Black Caucus Chairman Emanuel Cleaver is right. @BarackObama's budget is a nervous breakdown on paper. __HTTP__ _E_\nAs an honorary Buckeye I want to thank the OH GOP primary voters for putting @MittRomney over the top. It was a crucial win. _E_\nWe will bring back our jobs. We will bring back our borders. We will bring back our wealth and we will bring back our dreams! _E_\n.@JordanSpieth Great playing at the Masters and don't get down Jordan you will win many tournaments and many MAJORS! Keep working hard. _E_\nBill Clinton is right: Obamacare is 'crazy' 'doesn't work' and 'doesn't make sense'. Thanks Bill for telling the truth. _E_\nBernie Sanders on HRC: Bad Judgement. John Podesta on HRC: Bad Instincts. #BigLeagueTruth #Debate _E_\nWatched Gennady Golovkin @gggboxing at MSG on Saturday night. He was fantastic should fight @FloydMayweather! _E_\nHe @BarackObama invited his top campaign bundlers and donors to the British State Dinner __HTTP__ So corrupt! _E_\nThehas great strength & patience but if it is forced to defend itself or its allies we will have no choice but to totally destroy #NoKo. __HTTP__ _E_\nThe Democrats have no message not on economics not on taxes not on jobs not on failing #Obamacare. They are only OBSTRUCTIONISTS! _E_\nVia @gulf_news by @JoeHeim: \"@IvankaTrump: Giving back is a priority for me\" __HTTP__ _E_\nThe Blue Monster is celebrated in June issue of Robb Report as the Best of the Best winner in Golf Course Category. __HTTP__ _E_\nNew Day on CNN treats me very badly. @AlisynCamerota is a disaster. Not going to watch anymore. _E_\nI bet the terrorists in Libya used weapons we supplied them during their so called 'revolution' to attack our embassy in Benghazi. _E_\nGlad the Trans Pacific Partnership failed in the Senate. Bad deal for American worker & economy! We need SMART TRADE! __HTTP__ _E_\nIn America we don't worship government we worship God. #ValuesVotersSummit __HTTP__ _E_\nHillary's wars in the Middle East have unleashed destruction terrorism and ISIS across the world. _E_\nThis was the reporters statement when she found out there was tape from my facility she changed her tune. __HTTP__ _E_\n#ChrisWallace who interviewed me on Sunday had his highest ratings since Feb of '09. Congratulations! __HTTP__ _E_\nNew Poll Shows Donald Trump Blowing Everyone Else Out of the Water. __HTTP__ _E_\nThank you Tucson Arizona! A great afternoon with 6000 supporters! #VoteTrump on Tuesday!#MakeAmericaGreatAgain __HTTP__ _E_\nThe FBI is totally unable to stop the national security leakers that have permeated our government for a long time. They can't even...... _E_\nTo be in charge you have to take responsibility you have to instill confidence. It's like being a conductor set the tempo. _E_\n\"The harder you work the luckier you get.\" Gary Player _E_\nListen and learn from others but make your own decisions. Use your instincts you alone know where you want to go. _E_\nMy successful acquisition of the Kluge estate was a fantastic deal which is already being studied in business schools. _E_\nWhy doesn't Obama let our marines who are guarding the embassies in Egypt have live ammunition? They need it fast. _E_\nRT @NRCC: Good to hear @realDonaldTrump is on board.GOP is the party of free enterprise.Join us as we innovate: __HTTP__ _E_\nThe Yuan hit another record high against the Dollar. China is laughing at our expense. _E_\nI want all Americans to succeed together. President Obama's illegal executive amnesty undermines job prospects for... __HTTP__ _E_\nIowa Congressman @SteveKingIA has endorsed the Newsmax @iontv debate. He has been doing great work in the House. _E_\nI see Marco Rubio just landed another billionaire to give big money to his Superpac which are total scams. Marco must address him as SIR ! _E_\nAll the governors are already backing off of the Ebola quarantines. Bad decision that will lead to more mayhem. _E_\n92 year old registers to vote for first time says will vote for Trump __HTTP__ _E_\nSo wrong! @BarackObama is hosting China's VP Xi Jinping today at the Pentagon with a full honor ceremony with music and cannons... _E_\n\"Trump: 'Never Give Up' on Farmland Value Rally\" __HTTP__ @TerryBranstad @KimReynoldsIA @ChuckGrassley @SenJoniErnst @BNorthey _E_\nWatched protests yesterday but was under the impression that we just had an election! Why didn't these people vote? Celebs hurt cause badly. _E_\nLet's not get too excited about Monday's U.S. Supreme Court oral argument on #ObamaCare before the decision. No (cont) __HTTP__ _E_\nVia @RoyalOakPatch: Oakland County High Schoolers Have Chance to Win $1000 Scholarship & Meet Donald Trump __HTTP__ _E_\nWow 15 policemen hurt in Baltimore some badly! Where is the National Guard. Police must get tough and fast! Thugs must be stopped. _E_\nThank you Concord North Carolina! When WE win on November 8th we are going to Washington D.C. and we are going t... __HTTP__ _E_\n...Mexico cannot believe what they are getting away with and have absolutely no respect for our leader. _E_\n.@bigstack19 @realDonaldTrump Does anyone actually read Rolling Stone anymore? Guess they had to create (cont) __HTTP__ _E_\nLance Armstrong was given veryvery bad advice! _E_\nPresident Obama should have gone to Louisiana days ago instead of golfing. Too little too late! _E_\nWhile @BarackObama spends recklessly on domestic projects he is hollowing out our military with over $487B in cuts __HTTP__ _E_\nNBC News just called it the great freeze coldest weather in years. Is our country still spending money on the GLOBAL WARMING HOAX? _E_\nUnlike crooked Hillary Clinton who wants to destroy all miners I want wages to go up in America. We will do so by bringing back jobs! _E_\n\"Peace is not absence of conflict it is the ability to handle conflict by peaceful means.\" – Pres. Ronald Reagan _E_\nJust arrived in Cleveland Ohio join Governor @Mike_Pence and I now LIVE via: __HTTP__ _E_\nTrump Puerto Rico is 1st development in Puerto Rico to combine lavish residences world class golf & a beach __HTTP__ _E_\nResolve to be bigger than your problems. Who's the boss? Realize that fear is the exact opposite of faith. _E_\nI am so proud of my daughter Ivanka. To be abused and treated so badly by the media and to still hold her head so high is truly wonderful! _E_\nGovernor @Mike_Pence and I will be in Cleveland Ohio tomorrow night at 7pm join us! #MAGATickets:... __HTTP__ _E_\nI will be on @meetthepress this morning at various times across the U.S. @NBCNews Enjoy! _E_\nTrump Doral's renovations are right on schedule __HTTP__ Once completed it will be the top resort in the U.S. _E_\nWhy does @BarackObama support the radical Islamists in Egypt protests yet has such a high disregard for the Tea Party? _E_\nRT @DonaldJTrumpJr: Ironic since Hillary has gotten a lot more of that dark unaccountable money into her campaign. #debates _E_\nWatch CNN tomorrow at 2 pm & 5 pm and on Friday at 7 pm & 11 pm for a Thanksgiving Special hosted by John King. I'll be a featured guest. _E_\nWhy are some more concerned with granting terrorist rights than protecting innocent Americans? _E_\nIt has been great to meet so many wonderful people in my #TimeToGetTough book signings. Anyone who wants to be Prez should read! _E_\nHow bad is the New York Times—the most inaccurate coverage constantly. Always trying to belittle. Paper has lost its way! _E_\n.@billmaher: Bill you are really beginning to understand what is going on with Trump actually you always knew! _E_\nWow it's now official. ObamaCare website has topped $1B __HTTP__ Will soon be up to $1.5B _E_\nICYMI my speech from this past Saturday at the @NHGOP @FITNsummit via @cspan __HTTP__ _E_\nI predicted Rosie O'Donnell would fail at the View and was right. Now I predict Rosie will take over for Brian Williams! _E_\nI will be the greatest job producing president in American history. #Trump2016 #VoteTrump __HTTP__ __HTTP__ _E_\nHeading to Manassas Virginia for a rally. Will have a moment of silence for the victims of the California shootings. So sad! _E_\nLive Free or Die: A motto for the whole country to follow. #NewHampshire #FITN #VoteTrumpNH __HTTP__ _E_\nIraq has granted Iran full air rights to fly over and arm Syria. What did America accomplish with the Iraq war? And now Syria?! _E_\nA testament to American ingenuity @TrumpTowerNY shines over Fifth Avenue as one of NYC's most iconic buildings __HTTP__ _E_\nYouth unemployment is at a record high. ObamaCare is a job destroyer which is ruining aspiring careers. It must be repealed. _E_\nNew PPP poll just released in Iowa up 6 points from last poll. Leading w/ 28%! Don't worry media won't report it! __HTTP__ _E_\nI wish good luck to all of the Republican candidates that traveled to California to beg for money etc. from the Koch Brothers. Puppets? _E_\nSpitzer never made 10 cents on his own he worked for his very rich father (a friend of mine who never thought much of Eliot as a businessman _E_\nThanks many are saying I'm the best 140 character writer in the world. It's easy when it's fun. _E_\n#trumpvlog My thoughts on @RickSantorum in today's video blog... __HTTP__ _E_\nThere's nothing \"compassionate\" about allowing welfare dependency to be passed from generation to generation. Time To Get Tough _E_\nHonored to meet w/ Pres Abbas from the Palestinian Authority & his delegation who have been working hard w/everybody involved toward peace. __HTTP__ _E_\nLooking forward to visiting Mason City Iowa tomorrow. Will be my 8th day in the Hawkeye State this year! __HTTP__ _E_\nWhat Barbara Res does not say is that she would call my company endlessly and for years trying to come back. I said no. _E_\nNobody cares about the Iowa straw poll is what @JonHuntsman said yesterday. His problem is that nobody cares about his campaign (or him). _E_\nHope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man respected by all. _E_\n.@WineEnthusiast just awarded Trump Vineyard's Sparkling Reserve 91 points the highest rated wine in Virginia... __HTTP__ _E_\nMerry Christmas have an amazing day! _E_\n.@DineshDSouza had to give $1000 to @BarackObama's brother for his child's hospital bill __HTTP__ Isn't that disgraceful? _E_\nRussia talk is FAKE NEWS put out by the Dems and played up by the media in order to mask the big election defeat and the illegal leaks! _E_\nRemember while @BarackObama is lauding himself tonight with self indulgent compliments we have our brave soldiers fighting in Afghanistan. _E_\n.@CNN should listen. Ana Navarro has no talent no TV persona and works for Bush—a total conflict of interest. __HTTP__ _E_\nJust as I have been saying for MANY years and while they phony negotiate with the U.S. over nuclear Iran is taking over Iraq. Really sad! _E_\nAt some point the Fake News will be forced to discuss our great jobs numbers strong economy success with ISIS the border & so much else! _E_\nHe @MittRomney is a successful entrepreneur. @BarackObama successfuly ruined America's credit. Easy choice in November. _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\n.@Chrysler disputes my statement but watch Chrysler move @Jeep jobs to China after the election. _E_\nNothing is so permanent as a temporary government program. Milton Friedman _E_\nOur Native American Senator goofy Elizabeth Warren couldn't care less about the American worker...does nothing to help! _E_\nThe American worker is being victimized by our trade policies. We need smart trade which can only be accomplished by smart dealmakers. _E_\nIncredible handheld video of the Las Vegas Strip in 1969. The skyline looks better with @TrumpLasVegas! __HTTP__ _E_\nSenator Ted Cruz has been MATHEMATICALLY ELIMINATED from race. He said Kasich should get out for same reason. I think both should get out! _E_\nI have no greater privilege than to serve as your Commander in Chief. HAPPY BIRTHDAY to the incredible men and women @USNavy!#242NavyBday __HTTP__ _E_\nBarney Frank admited that ObamaCare does have 'death panels' yesterday. Obamacare must be fully repealed or healthcare will be destroyed. _E_\nRT @DRUDGE_REPORT: TRUMP APPROVAL HITS 50% __HTTP__ _E_\nRemember if you don't pat yourself on the back nobody else will. Take credit for your successes and don't let others forget!!!!!! _E_\nWhy isn't the @GOP congress doing everything possible to defund and cut ObamaCare? _E_\nThese are something I just can't buy. Excited for the @usopen __HTTP__ _E_\nWe must stand firm against the UN's ploy to sabotage Israel if the UN grants the PA statehood then we must immediately defund it. _E_\nThe state of Virginia economy under Democrat rule has been terrible. If you vote Ed Gillespie tomorrow it will come roaring back! _E_\nI've learned that mistakes can often be as good a teacher as success. Jack Welch _E_\nWhy does @oreillyfactor and @FoxNews always have Karl Rove on. He spent $430 million and lost ALL races. A dope who said Romney won election _E_\nKey Obamacare premiums to jump 25% next year: __HTTP__ _E_\nThe Republican Party needs strong and committed leaders not weak people such as @JeffFlake if it is going to stop illegal immigration. _E_\nThe @washingtonpost which is the lobbyist (power) for not imposing taxes on #Amazon today did a nasty cartoon attacking @tedcruz kids. Bad _E_\nUnfortunately with some men when the poison kicks in (not me of course) there are no rules or guidelines in the military that will stop them _E_\nThank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day I'll never forget. __HTTP__ _E_\nI am now in Palm Beach Florida and will be going to church tonight. MAKE AMERICA GREAT AGAIN! _E_\nGood morning. I will be on Fox and Friends at 7.00 (30 minutes). Enjoy! @foxandfriends _E_\nThis is a great time for @RickSantorum to bow out with dignity. _E_\n. @OMAROSA is smart and strategic. People should cut her some slack and respect the way she works on @ApprenticeNBC. _E_\nMy @SquawkCNBC interview discussing last night's presidential debate my stock picks and tomorrow's big announcement __HTTP__ _E_\nGreat day for America's future Security and Safety courtesy of the U.S. Supreme Court. I will keep fighting for the American people & WIN! _E_\nI am happy to have started #ObamasFavoriteCharity. Really enjoying reading everyone's tweets. _E_\nEntrepreneurs: Realize that fear is the exact opposite of faith. Resolve to be bigger than your problems. Who's the boss? _E_\nMy interview with @RealMichaelKay discussing why A Rod should be fired from @yankees & how to terminate his contract __HTTP__ _E_\nAll seven on line polls including Drudge and Time with thousands of respondents said I won the debate. @krauthammer said I was so so. _E_\nThe ultra liberal and seriously failing Des Moines Register is BEGGING my team for press credentials to my event in Iowa today but they lie! _E_\nHow did you like Michelle Obama's bangs last night? _E_\nIMPORTANT @RepMattSalmon & @RepEdRoyce will hold a hearing on Oct. 1w/USMC Sgt. Tahmooressi's mother & wife __HTTP__ _E_\nWe should be able to negotiate a deal with Iran because they know we could blow them away to the Stone Age.They just don't believe we would. _E_\nA big POLL will be announced this morning on @CBSNews Face The Nation. I wonder if I do well if the press will report the results? Doubt it _E_\nI really enjoyed the debate tonight even though the @FoxNews trio especially @megynkelly was not very good or professional! _E_\nRalph Northamwho is running for Governor of Virginiais fighting for the violent MS 13 killer gangs & sanctuary cities. Vote Ed Gillespie! _E_\nThe Trump base is far bigger & stronger than ever before (despite some phony Fake News polling). Look at rallies in Penn Iowa Ohio....... _E_\nVia @scj by @rodboshart: \"Trump: Next president has to be 'great one'\" __HTTP__ _E_\nSurprise @oreillyfactor used my name big league in pre ads to promote the show—then talked about everyone else but me! _E_\nMeet the 'Trumpocrats': Lifelong Democrats Breaking w/ Party Over Hillary to Support Donald Trump for President: __HTTP__ _E_\nYesterday 15 @GOP senators sided with people who got into this country by breaking our laws. _E_\nVery proud of our incredible First Lady (@FLOTUS.) She is a truly great representative for our country! __HTTP__ _E_\nPeople in our country want borders and without them the old line pols like Crooked Hillary will not win. It is time for CHANGE and JOBS! _E_\n#TrumpAdvice __HTTP__ _E_\nWe're all proud of @erictrump for being on @Forbes 30 Under 30 list. __HTTP__ _E_\n.@SenJohnMcCain Thank you for coming to D.C. for such a vital vote. Congrats to all Rep. We can now deliver grt healthcare to all Americans! _E_\nGeneral Keith Kellogg who I have known for a long time is very much in play for NSA as are three others. _E_\nduring a general election. I for one am appalled that somebody that is the nominee of one of our two major parties would take that kind _E_\nP.S. There is also something really good to say about humility. Being confident and humble is a great combination maybe the best of all! _E_\nI just arrived in Barcelona. I make a big speech tomorrow and then off to Ireland and Scotland. _E_\nVia The Hill: Trump Tops National Poll for Second Straight Week __HTTP__ _E_\nWill be on Howard Stern at 6.45 A.M. and the Today Show at 8.00 A.M. _E_\nHow is it possible that the people of the great State of Colorado never got to vote in the Republican Primary? Great anger totally unfair! _E_\nWow Matt Lauer was just fired from NBC for \"inappropriate sexual behavior in the workplace.\" But when will the top executives at NBC & Comcast be fired for putting out so much Fake News. Check out Andy Lack's past! _E_\nVia @GolfDigest by @LukeKerrDineen: \"@MichaelBreed to open golf academy at Donald Trump's @TrumpFerryPoint\" __HTTP__ _E_\nCrooked Hillary Clinton is totally unfit to be our president really bad judgement and a temperament according to new book which is a mess! _E_\nHow can the economy ever recover when @BarackObama keeps threatening the private sector with more taxes. This is no way to spur growth. _E_\nWatch Face The Nation will be on now! _E_\nFor the truth about job creation in America go to __HTTP__ A great site for employers to get the tools & information they need! _E_\n\"Successful people keep moving. They make mistakes but they don't quit.\" – Conrad Hilton _E_\n#TimeToGetTough The crowd at the book signing at Trump Tower in NYC right now... __HTTP__ _E_\nIsn't it amazing that Obama \"never knew\" about the IRS scandals until he saw it in the news?! _E_\nHas AG Schneiderman been extorting his targets and their lawyers for contributions? We will find out. _E_\nReminder: The Miss Universe competition will be LIVE from the Bahamas Tonight @ 9pm (EST) on NBC: __HTTP__ _E_\nCongratulations to @GatewayPundit on being named the #ROL15 @BreitbartNews award. Well earned & well deserved! _E_\nThe bigger problem with Ebola is all of the people coming into the U.S. from West Africa who may be infected with the disease. STOP FLIGHTS! _E_\nI want to negotiate my own and much better trade deals for our country. MUST INCLUDE CURRENCY MANIPULATION (and more). DO NOT LET PASS! _E_\nI find it offensive that Goofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be Native American to get in Harvard. _E_\nWith championship links @TrumpScotland's world class amenities also include dining & luxury accommodations __HTTP__ _E_\n\"If you know the enemy and know yourself you need not fear the results of a hundred battles.\" Sun Tzu _E_\nFear defeats more people than any other one thing in the world. Ralph Waldo Emerson _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe best way to build a successful business is by results. In the end that is what counts. _E_\nProduct integration is very important. #CelebApprentice _E_\nI only wish my wonderful daughter Tiffany could have been with us at Mar a Lago for our great election victory. She is a winner! _E_\nOn schedule for 2016 completion @trumpvancouver's 57 story twisting tower will be the icon of Vancouver's skyline __HTTP__ _E_\nMy interview with @seanhannity discussing this season's @ApprenticeNBC #TimeToGetTough the economy and GOP primary. __HTTP__ _E_\nIcahn Kravis Zell Buffett have all used the bankrutcy law to their benefit. Many of the top business people do. _E_\nThe EU just dropped their self imposed carbon tax. I bet they wish they had all that money back! _E_\nWow @FoxNews just reporting big news. Source: Official behind unmasking is high up. Known Intel official is responsible. Some unmasked.... _E_\nRT @IvankaTrump: I have long respected India's accomplished and charismatic Foreign Minister @SushmaSwaraj and it was an honor to meet her... _E_\nTrump Golf Links at Ferry Point: Grand Opening next Tuesday May 26th at 11 AM. Jack Nicklaus will be joining me. __HTTP__ _E_\nGood news for those that want to Make America Great Again I am winning every poll in every STATE and NATIONAL and by big numbers! Thanks _E_\nOver the past 11 months I have travelled tens of thousands of miles to visit 13 countries. I have met with more than 100 world leaders and everywhere I traveled it was my highest privilege and greatest honor to represent the AMERICAN PEOPLE! __HTTP__ _E_\nThe perfect getaway @Trump_Ireland is Europe's most elite 5 star destination perfected with old world luxury __HTTP__ _E_\nControl your own destiny or someone else will. @jack_welch _E_\nOnly three weeks until the new season of @CelebApprentice begins filming great all star cast. _E_\nThe Trump Organization is honored to have been awarded the redevelopment of The Old Post Office. Will be DC's finest hotel. _E_\nObama is a disaster at foreign policy. Never had the experience or knowledge. He is not capable of doing the job. _E_\nRT @AdrianaCohen16: Carly Fiorina no lifeboat for a fast sinking @tedcruz campaign __HTTP__ via @bostonherald @realdonaldtru... _E_\nVia @WSJ: \"A New Direction For America\" by @MittRomney _E_\n\"The risk of a wrong decision is preferable to the terror of indecision.\" – Maimonides _E_\nDo you think John Kerry is aware of the fact that they are building nuclear weapons in Iran and North Korea and Pakistan already has them!! _E_\nFellow inductee @SammartinoBruno and me. #WWEHOF __HTTP__ _E_\nWhile @BarackObama is slashing the military he is also negotiating with our sworn enemy the Taliban who facilitated 9/11. _E_\nCongratulations to two great and hardworking guys Corey Lewandowski and David Bossie on the success of their just out book \"Let Trump Be Trump.\" Finally people with real knowledge are writing about our wonderful and exciting campaign! _E_\nWhy would the people of Florida vote for Marco Rubio when he defrauded them by agreeing to represent them as their Senator and then quit! _E_\nLooking forward to addressing the record setting crowd tonight at the New York County Lincoln Day Dinner. Lots to talk about! _E_\nThank you @davidaxelrod for your nice words this morning on @CNN. It was a good night! _E_\nBeautiful @MissUSA in @NewYorkPost tomorrow as Audrey Hepburn in front of Tiffany's. _E_\n...Trump/Russia story was an excuse used by the Democrats as justification for losing the election. Perhaps Trump just ran a great campaign? _E_\nWOW @foxandfrlends \"Dossier is bogus. Clinton Campaign DNC funded Dossier. FBI CANNOT (after all of this time) VERIFY CLAIMS IN DOSSIER OF RUSSIA/TRUMP COLLUSION. FBI TAINTED.\" And they used this Crooked Hillary pile of garbage as the basis for going after the Trump Campaign! _E_\nDopey Sugar @Lord_Sugar—you are the worst kind of loser—a total fool. _E_\nSo @BarackObama's campaign is calling @MittRomney a potential criminal __HTTP__ How about Obama's Tony Rezko land deal! _E_\nIsn't it funny when a failed Senator like goofy Elizabeth Warren can spend a whole day tweeting about Trump & gets nothing done in Senate? _E_\nExcited to be returning to the @NCGOP State Convention as the Keynote of Saturday's dinner! @NCGOP is a strong Conservative state party! _E_\nRe Miss Universe Pageant we've spoken w/the LGBT community in Russia who asked \"please don't leave it would send the wrong signal.\" _E_\n.ccolvinj @AP is one of the truly bad reporters working for an organization that has totally lost its way. Stories are fictional garbage. _E_\nVia @theFAMiLYLEADER: \"Donald Trump to Speak at The Family Leadership Summit\" __HTTP__ Get tix __HTTP__ _E_\nOur very stupidly run Country better stop being so politically correct or we won't have a Country to run anymore! _E_\nI told you so a long time ago: Iraq just lost second largest city as their soldiers drop their guns and run. Only the beginning! OIL. _E_\nDid you ever see a situation so ridiculous as our President explaining what when and where to Congress about a Syrian attack. Far too late! _E_\n\"Donald Trump on VA woes: 'I'd fire everybody' 'you fix it by getting Trump elected'\" __HTTP__ via @washtimes by @dsherfinski _E_\nVia @HeraldWeekly by Lauren Odomirok: Trump Norman play renovated golf course __HTTP__ _E_\nWhen nobody wanted the UFC I opened the way by letting them fight at the Trump Taj Mahal in Atlantic City. Dana White has done a great job! _E_\nThank you California! #Trump2016 __HTTP__ __HTTP__ _E_\nHave you seen the new #TRUMP line of clothing apparel and fragrances @Macy's? Selling like hotcakes. Great for Christmas gifts etc. _E_\nChina is happy to learn that @BarackObama plans to borrow another $300 Billion. @BarackObama is their favorite client. _E_\nAn amazing article by Kevin Gabriel __HTTP__ A must read by friends and foes of President Obama. End date is tomorrow at noon. _E_\nYoung entrepreneurs: Your success is measured by results. Be productive in the face of challenges. Setbacks are not fatal. _E_\nThe @SenTedCruz endorsement was a wonderful surprise. I greatly appreciate his support! We will have a tremendous victory on November 8th. _E_\nI was at @FoxNews and met Juan Williams in passing. He asked if he could have pictures taken with me. I said fine. He then trashes on air! _E_\nI watched @BarackObama at the National Prayer Breakfast and he looked totally uncomfortable with his words. (cont) __HTTP__ _E_\nTotally false reporting on my call with @Reince Priebus. He called me ten minutes said I hit a \"nerve doing well end! _E_\nI agree Mike thank you to all of our law enforcement officers! #VPDebate Police officers are the best of us... @Mike_Pence _E_\nHappy #NationalFarmersDay!📸 __HTTP__ __HTTP__ _E_\nI wonder who @ArsenioHall's first guest will be his show will be great! _E_\nThank you New Hampshire! #FITN #NHPrimary #VoteTrumpNH Voting questions? __HTTP__ __HTTP__ _E_\n.@joycefinance #asktrump __HTTP__ _E_\nBig interview tonight by Henry Kravis at The Business Council of Washington. Looking forward to it! _E_\nI'll be signing copies of my new book Time To Get Tough tomorrow in Trump Tower 11 am to 2 pm. Hope to see you there. _E_\nMy wife @MELANIATRUMP will be #OnTheRecord w/ @greta tonight at 7pmE on @FoxNews. Enjoy! __HTTP__ __HTTP__ _E_\nGovernor @RicardoRossello We are with you and the people of Puerto Rico. Stay safe! #PRStrong _E_\n#noratings @Lawrence will soon be off tv bad ratings he has a face made for radio. _E_\nI will be interviewed on @foxandfriends at 7:30 A.M. Enjoy! _E_\nWhen terrorists are beheading and executing American citizens in such a brutal waythe report on torture should be the least of our concerns _E_\nSometimes by losing a battle you find a new way to win the war. _E_\nWith the number of tweets sad sack @Rosie has done she has totally lost control of herself hopefully not a breakdown. _E_\nThank you. __HTTP__ _E_\nThank you to Shawn Steel for the nice words on @FoxNews. _E_\nThanks. __HTTP__ _E_\nMore dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_\nHYPOCRITE! Long before @BarackObama called the Tea Party 'teabaggers' he dressed as a revolutionary in a Hyde Park rally __HTTP__ _E_\nChina is about to acquire a unit of AIG which we bailed out for $5.5B __HTTP__ China is making great deals on our backs. _E_\nRising 70 stories over Panama Bay @TrumpPanama offers our elite amenities in Latin Americas tallest building __HTTP__ _E_\nHope everyone enjoyed their Thanksgiving. But get ready our country is in big trouble! _E_\nSpent time with Indiana Governor Mike Pence and family yesterday. Very impressed great people! _E_\nDemocrats try so hard to mock & belittle Republicans—& the Republicans just don't fight back—no energy! _E_\nWe must keep the pressure on @BarackObama's administration to make sure Chen comes to the US. It would be a tragedy to abandon him in China. _E_\nFor more information on tonight's two hour telethon 8 to 10 p.m.: __HTTP__ _E_\n...The fact is that Puerto Rico has been destroyed by two hurricanes. Big decisions will have to be made as to the cost of its rebuilding! _E_\nI wonder if @BarackObama ever applied to Occidental Columbia or Harvard as a foreign student. When can we see (cont) __HTTP__ _E_\n... in order to occupy space in a truly ugly office building in a much worse location! _E_\nFor all of those that think life is easy & don't want to work remember: HOPE IS THE POOR MAN'S BREAD. _E_\n.@KellyannePolls Kellyanne you were fantastic on @meetthepress today. Keep going I will win for the people. MAKE AMERICA GREAT AGAIN! _E_\nA great day in both Spencer & Davenport Iowa! THANK YOU for the support! #Trump2016 #FITN #IAPolitics __HTTP__ _E_\nThank you Mississippi! #Trump2016 _E_\nHead on over to my Facebook page to have your questions answered in the next #AskTheDonald __HTTP__ _E_\nThank you Anthony @Scaramucci @WSJ The Entrepreneur's Case for Trump __HTTP__ _E_\nNO MERCY TO TERRORISTS you dumb bastards! _E_\nThank you to respected columnist Katie Hopkins of Daily __HTTP__ for her powerful writing on the U.K.'s Muslim problems. _E_\nTwo great people! __HTTP__ _E_\nWith a record deficit and $15 trillion in debt @BarackObama is spending $4 million of our money on his Hawaii vacation. Just plain wrong. _E_\nI'm very proud of the work my son @EricTrump has been doing with the @EricTrumpFDN take a look... __HTTP__ _E_\nHe @RickSantorum has as much chance of being the GOP nominee as @Rosie does of ever having a successful (cont) __HTTP__ _E_\nVia @DailyCaller by @alweaver22:\"Trump: Obama One Of 'The Worst Things That's Ever Happened To Israel'\" __HTTP__ _E_\nPleasure in the job puts perfection in the work. Aristotle _E_\nTechnology has shown we have tremendous energy resources right under our feet that we didn't know about 5 years ago. _E_\nSo they caught Fake News CNN cold but what about NBC CBS & ABC? What about the failing @nytimes & @washingtonpost? They are all Fake News! _E_\nSuccess is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_\nEntrepreneurs: Success is good. Success with significance is even better. Make your work count. _E_\n...well into our 4th week of shooting the record 13th season of @CelebApprentice. The 'All Stars' are hard at work... _E_\nEvery poll Time Drudge Slate and others said I won both debates but heard Megyn Kelly had her two puppets say bad stuff. I don't watch _E_\nRT @AnnCoulter: Anyone who plans to talk about Trump ever again has to see this speech. Your opinion is irrelevant unless you listened to... _E_\nRT @NFIB: .@NFIB encouraged by @realDonaldTrump's #taxplan says #smallbiz would benefit from lower tax rate: __HTTP__ _E_\nDeparting Farmers Round Table in Boynton Beach Florida. Get out & VOTE lets #MAGA! EARLY VOTING BY FL. COUNTY:... __HTTP__ _E_\nVia @BreitbartNews by @mboyle1: DONALD TRUMP: MSM INVESTIGATION INTO SCOTT WALKER'S COLLEGE A 'DOUBLE STANDARD' __HTTP__ _E_\nWoody Johnson owner of the NYJets is @JebBush's finance chairman. If Woody would've been w/me he would've been in the playoffs at least! _E_\nRT @realDonaldTrump: Happy Birthday @DonaldJTrumpJr! __HTTP__ _E_\nThe Fed is considering issuing even more US bond debt into the market. Not good! _E_\nThe United Nations has such great potential but right now it is just a club for people to get together talk and have a good time. So sad! _E_\nMy @marklevinshow interview discussing Obama's SOTU Rove's attack on the Tea Party & All Star @ApprenticeNBC __HTTP__ _E_\n.@mcuban says he is a member of Dallas National but doesn't play golf. Who is a member of a golf club that doesn't play?? No talent! @TMZ _E_\nWhy doesn't President Obama simply apologize for telling a big fat lie announce that ObamaCare was a mistake and deal a really great plan! _E_\nWHAT THEY ARE SAYING ABOUT THE CLINTON CAMPAIGN'S ANTI CATHOLIC BIGOTRY: __HTTP__ _E_\n.@brithume I am in first place by a lot in all polls tied for first place with Ben Carson in one Iowa poll. I thought you knew this thanks _E_\nO.K. Christmas is over now we can all go back to the wars of life. Focus focus focus never accept defeat push hard for total victory! _E_\nI will be on Face the Nation with John Dickerson on CBS this morning. Enjoy! _E_\nCrooked Hillary Clinton Tops Middle East Forum's 'Islamist Money List' __HTTP__ _E_\nTrump Int'l Palm Beach offers a spectacular course with hill vistas bunkers and incredible water features. __HTTP__ _E_\nI'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring' or 'insourcing?' We need (cont) __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWonder if Obama will ever say RADICAL ISLAMIC TERRORIST? _E_\nLots of response that Obama should give the $5M to the families of our great heroes who were murdered in Benghazi. _E_\nBefore Kids Can Go Places They Need a Place To Go the motto of The Police Athletic League an organization I'm very proud to support _E_\nTHE DONALD J. TRUMP PRESIDENTIAL EXPLORATORY COMMITTEE __HTTP__ _E_\nWhat can you learn today that you didn't know before? Set the bar high do the best you possibly can. _E_\nVia @golf_com by @joepassov: \"@TrumpFerryPoint Will Be One of Nation's Best Public Courses\" __HTTP__ _E_\n'U.S. Industrial Production Surged in April' __HTTP__ _E_\nMy complaint against @AGSchneiderman is a \"case study\" for JCOPE & Moreland Commissions on everything that is wrong with NYS politics. _E_\nThank you Washington! Honored to say on behalf of our great movement we have broken the all time record for votes in GOP primary history. _E_\nNorth Korea can't survive or even eat without the help of China. China could solve this problem with one phone call they love taunting us! _E_\nGet Snowden back from Russia—he has done tremendous damage to the US & should pay a very heavy price. _E_\nThank you Newt! __HTTP__ _E_\nAs soon as John Kasich is hit with negative ads he will drop like a rock in the polls against Crooked Hillary Clinton. I will win! _E_\nGreat Town Hall tonight at 10:00 P.M. (Eastern) conducted by @seanhannity on @FoxNews _E_\nHad a great time on @gretawire's inaugural 7PM show. Congrats to Greta on the new spot! _E_\nPervert alert. @RepWeiner is back on twitter. All girls under the age of 18 block him immediately. _E_\n...lottery continues deadly catch and release and bars enforcement even for FUTURE illegal immigrants. Voting for this amendment would be a vote AGAINST law enforcement and a vote FOR open borders. If Dems are actually serious about DACA they should support the Grassley bill! _E_\nObama must now FOCUS get his mind off March.Madness and LEAD! Watch Russia closely work hard on the economy and get rid of ObamaCare! _E_\nOnce ObamaCare is fully enacted in NY conveniently after 2014 expect higher premiums bigger deductibles & worse care. Job killer! _E_\nWell Obama refused to say (he just can't say it) that we are at WAR with RADICAL ISLAMIC TERRORISTS. _E_\nI don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1 in terror no problem! _E_\nFor what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_\nDoing a commercial for @NFLONFOX lots of fun! __HTTP__ _E_\nCongress get ready to do your job DACA! _E_\n.@latoyajackson is once again at the top of her game in the upcoming All Star season of @CelebApprentice. Amazing in the boardroom... _E_\nMAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN! _E_\nI am honored that Texas supporters have filed papers in Texas to create Make America Great Party on my behalf. __HTTP__ _E_\n\"The Constitution is the guide which I never will abandon\" George Washington _E_\n.@MichelleMalkin would be nothing without being on the @seanhannity show. I don't see what Sean sees in her—loser! _E_\nEntrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_\nObama & Clinton should stop meeting with special interests & start meeting with the victims of illegal immigration. _E_\nGolf Odyssey one of golf's most respected publications just named Trump International Golf Links Scotland golf course of the year _E_\nJust landed in Iowa speaking soon! _E_\nIt all comes down to one simple question: How much money can you stand to lose? That's how much risk you should assume. _E_\nWhat do you think of Gary's definition of f u n? _E_\nTrump International Hotel & Tower Toronto continues to receive accolades. Great city great hotel. __HTTP__ #TrumpToronto _E_\nRT @detroitnews: .@IvankaTrump in Michigan: 'This is your movement' __HTTP__ @realDonaldTrump __HTTP__ _E_\n.@TPNNtweets Donald Trump Tells A Fascinating Inside Story About His Dealings w/ The Obama WH __HTTP__ @johnhawkinsrwn _E_\nIf Chelsea Clinton were asked to hold the seat for her motheras her mother gave our country away the Fake News would say CHELSEA FOR PRES! _E_\nSadly this kind of stuff even happened to Ronald Reagan. There is nothing nice about it! #MakeAmericaGreatAgain __HTTP__ _E_\nOn Sunday Jerome Bettis 'the bus' from the Pittsburgh Steelers will play at Trump Int'l Golf Club/Palm Beach against Julius Erving 'Dr J' _E_\nI will be interviewed on The O'Reilly Factor this evening at 8 pm on the Fox News Channel. @oreillyfactor _E_\nOur legal system is broken! 77% of refugees allowed into U.S. since travel reprieve hail from seven suspect countries. (WT) SO DANGEROUS! _E_\nNY Jets center Nick Mangold interns for Trump. Watch Trump's Fabulous World of Golf tonight 9PM ET on Golf Channel __HTTP__ _E_\nWith 49 days until the election @MittRomney needs to stay on offense. He should not be apologizing. Deflect onto Obama's record. _E_\nObama's motto: If I don't go on tax payer funded vacations & constantly fundraise then the terrorists win. _E_\nDoes everyone remember @MittRomney and his famous remarks about self deportation and 47% . He was done. I don't need his angry advice! _E_\nObama told Medvedev after the '12 reelect he would \"have more flexibility.\" It was music to Putin's ears. _E_\nAmazing. @CelebApprentice has started filming our record 13th season this week thanks to our big and very loyal fan base. _E_\nThe results are in on the final debate and it is almost unanimous I WON! Thank you these are very exciting times. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nVia @Newsmax_Media: \"Donald Trump 2016: 8 Facts About Personal Life of GOP Presidential Hopeful\" __HTTP__ _E_\n#CrookedHillary __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe Costa Concordia shipwreck is a MONUMENT TO STUPIDITY but the uprighting of the ship is a MONUMENT TO GENIUS! _E_\nWealth comes from big goals and sustained action toward those goals every day. Think Big _E_\nWow you are all correct about @FoxNews totally biased and disgusting reporting. _E_\nMy @971FMTalk int. with @DLoesch on #HandsOffMyGun 2014 election results stopping Obamacare new Senate & 2016 __HTTP__ _E_\nVia @CarolinaLive by @JoelAllenWPDE:\"Big names wrap up largest ever SC Tea Party Coalition Convention\" __HTTP__ _E_\nSee story in The Scotsman re: wind turbines __HTTP__ _E_\nI answered your questions in today's video... watch at __HTTP__ _E_\nWill be interviewed by @andersoncooper on @CNN tonight. Let's see if he treats me fairly—enjoy! _E_\nA wonderful evening in South Carolina big crowd amazing energy! _E_\nThe truly great Phyllis Schlafly who honored me with her strong endorsement for president has passed away at 92. She was very special! _E_\nPresident Obama spoke last night about a world that doesn't exist. 70% of the people think our country is going in the wrong direction. #DNC _E_\nVia cnsnews by @SJonesCNS: \"Trump Explains His Appeal: 'People Are Tired...Of These Incompetent Politicians'\" __HTTP__ _E_\nGlad to hear @SethMacFarlane will be hosting this year's Oscars. Something new that should be fun. _E_\nMy thoughts on @barackobama's campaign.... __HTTP__ _E_\nSee Sanders backed Hillary on E mails at the debate hurting himself and then she threw him under the bus (but failed). Disloyal person! _E_\n\"You measure your people and you take action on those that don't measure up.\" @jack_welch _E_\nWhy gas prices will rise Miss Canada/Miss Universe and #CelebApprentice in today's #trumpvlog... __HTTP__ _E_\nWe must leave stop and frisk for A Rod and Anthony Weiner! _E_\nRT @Scavino45: .@POTUS & @FLOTUS w/ @LVMPD Officer Cook 2nd day on job received gunshot wound to the right chest & right arm saving live... _E_\nI had a great time in D.C. yesterday at the Trump International Hotel OPO groundbreaking ceremony. Watch __HTTP__ _E_\nMike Pence won big. We should all be proud of Mike! _E_\nGoing to the White House is considered a great honor for a championship team.Stephen Curry is hesitatingtherefore invitation is withdrawn! _E_\nWhy would smart voters want to put Democrats in Congress in 2018 Election when their policies will totally kill the great wealth created during the months since the Election. People are much better off now not to mention ISIS VA Judges Strong Border 2nd A Tax Cuts & more? _E_\nDon't let Obama buy the election by handing out unlimited free money to states. _E_\nThank you New Hampshire! Departing with my amazing family now! #FITN #NHPrimary __HTTP__ __HTTP__ _E_\n\"Trump signs lease for a NH office returns Monday\" __HTTP__ via @UnionLeader by @tuohy _E_\nHillaryClinton can illegally get the questions to the Debate & delete 33000 emails but my son Don is being scorned by the Fake News Media? _E_\n#FlashbackFriday @kimkardashian on the set of @ApprenticeNBC __HTTP__ _E_\n\"Fortunately for a quarterback you can play for a long time because you don't get hit very often.\" – Tom Brady @SuperBowl @Patriots _E_\nIt should be mandatory that all haters and losers use their real name or identification when tweeting they will no longer be so brave! _E_\n.@RudyGiuliani one of the finest people I know and a former GREAT Mayor of N.Y.C. just took himself out of consideration for State . _E_\nWow @megynkelly really bombed tonight. People are going wild on twitter! Funny to watch. _E_\nMitt Romney gave a masterful speech this weekend at Liberty University with a wonderful introduction by Mark DeMoss. Well done. @MittRomney _E_\nIf you don't believe in yourself no one else will. _E_\nWhen will Washington stand up to China. China is manipulating its currency and stealing our jobs. Washington should move on legislation. _E_\nI am in Colorado big day planned but nothing can be as big as yesterday! _E_\nTed Cruz has now apologized to Marco Rubio and Ben Carson for fraud and dirty tricks. No wonder he has lost Evangelical support! _E_\n.@CNN Why is somebody (Beck) I beat so soundly all of a sudden an expert on Donald Trump (all over television). She knows nothing about me. _E_\nWe may get out of ObamaCare because the train wreck is impossible to implement __HTTP__ It is a disaster. _E_\nI will be interviewed on @foxandfriends at 7:00 this morning. Plenty to talk about! _E_\nThank you for a great evening Laconia New Hampshire will be back soon! #AmericaFirst __HTTP__ __HTTP__ _E_\nHillary Clinton's open borders immigration policies will drive down wages for all Americans and make everyone less safe. _E_\nI fought hard against Spitzer and Weiner and both lost. For a while when Spitzer was way up it seemed that I was a lone voice! Good power _E_\nThank you America!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nA great great honor to welcome & recognize the National Teacher of the Year as well as the Teacher of the Year fro... __HTTP__ _E_\nBelieve and act as if it were impossible to fail. Charles F. Kettering _E_\nI will and I agree! RT @ZacharyQuinto@realdonaldtrump you can't possibly make any more money. so why don't you make a difference instead?! _E_\nThank you to President Moon of South Korea for the beautiful welcoming ceremony. It will always be remembered. __HTTP__ _E_\nI can't believe that Prime Minister @David_Cameron is giving massive subsidy to Scotland to destroy itself with windfarms. _E_\nTRAIN WRECK just the beginning. Our roads airports tunnels bridges electric grid all falling apart.I can fix for 20% of pols & better _E_\nWow such sacrfices for his re election. @BarackObama will not vacation in Martha's Vineyard this summer. __HTTP__ _E_\nLightweight @AGSchneiderman is driving business & jobs out of NY. Only wants self publicity—a total loser! _E_\nI (we) broke the all time record for most votes gotten in a Republican Primary by a lot and with many states left to go! Thank you. _E_\nA strong military will stop wars. Peace through Strength! Let's Make America Great Again! __HTTP__ _E_\nIf a conservative Republican made the mistake that Mrs. Obama just made by calling Braley by the wrong name it would be the biggest story! _E_\n.@usgsa A momentous day. Great job on Old Post Office we will make you proud! _E_\nCan you believe that the Afghan war is our \"longest war\" ever—bring our troops home rebuild the U.S. make America great again. _E_\nEveryone here is talking about why John Podesta refused to give the DNC server to the FBI and the CIA. Disgraceful! _E_\n#TBT With James Lipton on the set of @ApprenticeNBC __HTTP__ _E_\nThousands of e mails from folks urging me to seek the Americans Elect Presidential nomination. _E_\n.@dubephnx If we didn't remove incredibly powerful fire retardant asbestos & replace it with junk that doesn't (cont) __HTTP__ _E_\nScotland does not have free press even when you are just stating the facts it's crazy! _E_\nThe original Apprentice is coming back do you have what it takes to be the next Apprentice? For casting details: __HTTP__ _E_\nWe did it! Thank you to all of my great supporters we just officially won the election (despite all of the distorted and inaccurate media). _E_\nNothing conservative about the Club for Growth coming into my office and demanding a $1M contribution which naturally they did not get. _E_\nWhat other country tells the enemy when we are going to attack like Obama is doing with ISIS. Whatever happened to the element of surprise? _E_\n...NFL attendance and ratings are WAY DOWN. Boring games yes but many stay away because they love our country. League should back U.S. _E_\nI'm going to the @Yankees game tonight to root them on they always win when I am there. _E_\nAs China is building an air and naval force @BarackObama is cutting ours. __HTTP__ He is weakening our national security. _E_\nGet out tomorrow and vote so that we can all finally say those magic words __HTTP__ _E_\nIt was a true honor to be at Yokota Air Base with our GREAT @USForcesJapan! __HTTP__ _E_\nIraq is more dangerous today than any time under Saddam. War was a mistake as I said from the very beginning. Bush & Obama should apologize _E_\nI guess they have Lance Armstrong cold. Brutal report. A waste of taxpayer money to take down an American hero. _E_\nHeading to Iowa join me today at noon! #MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_\nPutin & I discussed forming an impenetrable Cyber Security unit so that election hacking & many other negative things will be guarded.. _E_\nRT @foxandfriends: President Trump to sign an executive order on religious liberty today the National Day of Prayer | @kevincorke __HTTP__ _E_\n.@FoxNews is changing their theme from fair and balanced to unfair and unbalanced. But dying @WSJ is worse.Their phony poll is a joke! _E_\n.@JohnLegere @TMobile John focus on running your company I think the service is terrible! Try hiring some good managers. _E_\nBe sure to get a copy of @williebosshog's new book American Hunter. _E_\nThe Arab Spring is not working out so well nice name bad results! _E_\nJoin me in Redding California tomorrow at 1:00pm. #Trump2016Tickets: __HTTP__ _E_\nThe next generation of luxury @TrumpVancouver will be the icon of the Vancouver skyline __HTTP__ _E_\nJust landed in Iowa. See everyone soon! #MAGA _E_\nWe are taking action to #RepealANDReplace #Obamacare! Contact your Rep & tell them you support #AHCA. #PassTheBill... __HTTP__ _E_\nBay Bridge in San Fransisco built in China keeps getting worse. Cost overruns are out of control China is having a field day with us! _E_\nMany reports of peaceful protests by Iranian citizens fed up with regime's corruption & its squandering of the nation's wealth to fund terrorism abroad. Iranian govt should respect their people's rights including right to express themselves. The world is watching! #IranProtests _E_\n.@KatyTurNBC 3rd rate reporter & @SopanDeb @ CBS lied. Finished in normal manner&signed autos for 20min. Dishonest! __HTTP__ _E_\nWe cannot take four more years of Barack Obama and that's what you'll get if you vote for Hillary. #BigLeagueTruth _E_\nRemember when Obama promised \"you can keep your health care plan?\" Not in these 10 states. __HTTP__ Another lie. _E_\nMy plan will lower taxes for our country not raise them. Phony @club4growth says I will raise taxes—just another lie. _E_\nCongratulations to new Congressman @leezeldin being named to House Foreign Affairs Comm. and co chair the House Republican Israel Caucus. _E_\nWelcome to the new reality. Goldman Sachs just based their new Asia Pacific chairman not in Tokyo but Beijing. __HTTP__ _E_\nDoing Fox and Friends at 7.00 A.M. Hope you loved Apprentice last night. _E_\nWe must stop being politically correct and get down to the business of security for our people. If we don't get smart it will only get worse _E_\nToday I announced our strategy to confront the Iranian regime's hostile actions and to ensure that they never acquire a nuclear weapon. __HTTP__ _E_\nThe Tax Cut Bill is coming along very well great support. With just a few changes some mathematical the middle class and job producers can get even more in actual dollars and savings and the pass through provision becomes simpler and really works well! _E_\nHow crazy 7.5% of all births in U.S. are to illegal immigrants over 300000 babies per year. This must stop. Unaffordable and not right! _E_\nEntrepreneurs: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_\nCan anyone imagine Chafee as president? No way. _E_\nRT @foxandfriends: President Trump officially nominates former Assistant Attorney General Christopher Wray to head the FBI __HTTP__ _E_\n....it is very possible that those sources don't exsist but are made up by fake news writers. #FakeNews is the enemy! _E_\nTHE U.S.G.A. Boy's Junior Champion at Trump National Golf Club Bedminster just won The Australian Open. We are proud of you @JordanSpieth _E_\nIraq is being ravaged by Al Qaeda. Country in utter chaos & all oil is going to Iran & China __HTTP__ Terrible mistake! _E_\nLetterman @Late_Show had Brian Williams @NBCNightlyNews as guest last night I was on last Thursday _E_\n.@lightjzup Industrial turbines are destroying our land. _E_\nThanks. __HTTP__ _E_\nBiggest story in politics is now happening in the great State of Colorado where over one million people have been precluded from voting! _E_\nA pessimist is one who makes difficulties of his opportunities... _E_\nTo all of those who asked I predicted two weeks ago and again last night that Dwight Howard would go to Houston.Do I get congrats insight? _E_\nIncredibly proud of my son @EricTrump & his efforts on behalf of @StJude in Memphis TN. __HTTP__ __HTTP__ _E_\nIf Christian Bale turned down $50M to return as Batman he should have his head examined. What was he thinking?! _E_\nThey should have rebuilt the two buildings of the World Trade Center exactly as they were except taller and stronger. A better statement! _E_\nCan you imagine trading five really bad enemies of the U.S. for the freedom of traitor Bergdahl. Just another bad deal! _E_\nRT @PChowka: Sean Hannity's Big Week Top Ratings Probing Reporting and Let There Be Light at American Thinker __HTTP__ h... _E_\nWhether you think you can or think you can't you're right. Henry Ford _E_\nExcited and honored to be addressing @theFAMiLYLEADER summit in Iowa this August. __HTTP__ _E_\nGoing to New Hampshire in a little while. Big crowds! #MakeAmericaGreatAgain! _E_\nIn real estate all locations can be enhanced through good marketing. Be smart! _E_\nWhy does @CNN bore their audience with people like @secupp a totally biased loser who doesn't have a clue. I hear she will soon be gone! _E_\nMy @foxandfriends int. @FoxNewsInsider \"'Once a Choker Always a Choker': DJT Takes Credit for Romney Dropping Out\" __HTTP__ _E_\nThe media has been speculating that I fired Rex Tillerson or that he would be leaving soon FAKE NEWS! He's not leaving and while we disagree on certain subjects (I call the final shots) we work well together and America is highly respected again! __HTTP__ _E_\nLet's not start celebrating over Libya until we see who takes over. _E_\n\"@NMoralesNBC @ThomasARoberts to Host 63rd Annual @MissUniverse\" __HTTP__ via @TheWrap by @AnthonyMaglio _E_\nMy @FoxBusiness int. w/Don Imus on not drinking alcohol politicians being all talk and no action & the border __HTTP__ _E_\n.@GretchenCarlson's memoir is a powerful example of perseverance & hope. \"Getting Real\" is as real as it gets. Get it & enjoy! #GettingReal _E_\nToday we gathered in the Roosevelt Room for one single reason: to CUT THE RED TAPE! For many decades an ever growing maze of regs rules and restrictions has cost our country trillions of dollars millions of jobs countless American factories & devastated entire industries. __HTTP__ _E_\nI will start reviewing various political reporters etc & websites as to their professionalism & fairness—many people asking for this. _E_\n\"If you are passionate about your endeavors it will be reflected back to you in your end result.\" – Trump Never Give Up _E_\nThere's a lot going on at the Eric Trump Foundation ... __HTTP__ _E_\n#GOPDebate #GoogleTrends __HTTP__ _E_\nGreat day for Tax Cuts and the Republican Party. But the biggest Winner will be our great Country! _E_\nAsk: Is there anyone else who can do this better than I can?That's just another way of saying know yourself & know your competition. _E_\nAssad will never give up his chemical weapons. He has spent years and billions accumulating them. This is all a ruse. _E_\n\"When you expect things to happen strangely enough they do happen.\" J. P. Morgan _E_\n'U.S. Murders Increased 10.8% in 2015' via @WSJ: __HTTP__ _E_\n.@DennisRodman re @Omarosa is right she's becoming predictable. _E_\nI know you will enjoy reading my tax plan __HTTP__ #MakeAmericaGreatAgain _E_\n.@TrumpGolfLA is @theknot's pick for the Best of Weddings with our Vista Terrace looking over the Pacific Ocean __HTTP__ _E_\nHillary said such nasty things about me read directly off her teleprompter...but there was no emotion no truth. Just can't read speeches! _E_\nRT @jessebwatters: Thanks for watching!! __HTTP__ _E_\nCruz said Kasich should leave because he couldn't get to 1237. Now he can't get to 1237. Drop out LYIN' Ted. _E_\nWise words from my mother: \"Trust in God and be true to yourself.\" Mary MacLeod Trump _E_\nI pick the best locations @Trump_Charlotte has incredible views of beautiful Lake Norman. __HTTP__ _E_\nJon Stewart is the most overrated joke on television. A wiseguy with no talent. Not smart but convinces dopes he is! Fading out fast. _E_\nI am self funding my campaign and am therefore not controlled by the lobbyists and special interests like lightweight Rubio or Ted Cruz! _E_\nThe Justice Department's investigation into the national security leaks is not independent. This is a very grave situation. _E_\nThank you America! #Trump2016 __HTTP__ _E_\nVery excited to be returning to Iowa tomorrow to campaign for my friend & strong Conservative leader @SteveKingIA! _E_\nClear winner of the #GOPDebate. Thank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThank you! #GOPDebate __HTTP__ _E_\n\"Donald Trump hosts first ever 'Trump Invitational' at Mar a Lago\" __HTTP__ via @WPTV _E_\nRT @charliekirk11: ISIS getting slaughtered: Square miles liberated from ISISTrump: 26000 Obama: 13200Total Square miles held by... _E_\nWhen it comes to money finance and even life PROTECT THE DOWNSIDE AND THE UPSIDE WILL TAKE CARE OF ITSELF! _E_\nYoung entrepreneurs – always remember in negotiations that sometimes the best deal you make is the one you walk away from. _E_\n#MakeAmericaSafeAgain __HTTP__ _E_\n.@AlexSalmond is making a truly stupid mistake by forcing ugly industrial wind turbines down Scotland's throat –he's hated for it. _E_\nVia @theblaze by @BillyHallowell:\"DONALD TRUMP BLASTS OBAMA FOR FAILING TO SECURE CHRISTIAN PASTOR'S FREEDOM IN IRAN\" __HTTP__ _E_\n.@alexsalmond RT @RichWaugaman This time I agree 100% I never knew how useless a wind turbine was until I (cont) __HTTP__ _E_\nThe stage is set for the real debate it will be very interesting! _E_\nBlackdog Scotland started a petition against @VattenfallGroup. __HTTP__ _E_\nThank you Arizona! See you soon!#MakeAmericaGreatAgain __HTTP__ _E_\nVenezuela should allow Leopoldo Lopez a political prisoner & husband of @liliantintori (just met w/ @marcorubio) o... __HTTP__ _E_\nHillary said she was under sniper fire (while surrounded by USSS.) Turned out to be a total lie. She is not fit to... __HTTP__ _E_\nThe Republicans who want to cut SS & Medicaid are wrong. A robust economy will Make America Great Again! __HTTP__ _E_\nIn August 2012 Obama said the so called Arab Spring sprung from 'joyful longing for human freedom' __HTTP__ Good call! _E_\nRT @DonaldJTrumpJr: Not surprising at all! Father Of Otto Warmbier: Obama Admin Told Us To Keep Quiet Trump Admin Brought Him Home __HTTP__ _E_\nOver 2 million people have lost their jobs since @BarackObama became POTUS. How many of them still have healthcare? _E_\nLittle Mac Miller's next album may bomb. He can't use my name again for sales. _E_\nI took some heat a long time ago when I said that George Zimmerman was a sicko and bad news. I know people and this guy is no good trouble! _E_\n#ThankYouTour2016 12/6 North Carolina __HTTP__ Iowa __HTTP__ Michiga... __HTTP__ _E_\nBe sure to download my new The Celebrity Apprentice app to begin interacting with this Sunday's episode __HTTP__ _E_\nWhile Hillary and I both won South Carolina by big margins Repubs got far more votes with a massive increase from past cycles.GROWING PARTY _E_\nI would love to be at the Cadillac World Golf Championship @TrumpDoral in Miami but even more so in Orlando with the #TrumpTrain! _E_\nThe Fed's actions these past 3 years could bring record high inflation in the near future. That would be (cont) __HTTP__ _E_\nNo surprise that China was caught cheating in the Olympics. That's the Chinese M.O. Lie Cheat & Steal in all international dealings. _E_\nProud of @IvankaTrump for her leadership on these important issues. Looking forward to hearing her speak at the W20! __HTTP__ _E_\nThank you Nashua New Hampshire! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nGoing over to @ABC to do LIVE at 9:00. _E_\nThank you to everyone for all of the nice comments by Twitter pundits and otherwise for my speech last night. _E_\nObamacare puts poor people on a form of government run single payer health insurance that many doctors don't take @Avik _E_\n.@FLGovScott Gaming states are laughing at stupidity of not approving gaming in FL—they're afraid of Miami—can't believe their luck! _E_\nTrump University has a 98% approval rating. I could have settled but won't out of principle! _E_\n.@Toure I felt very sorry for you during your meltdown on @PiersMorgan. He drove you insane but of course Piers is a lot smarter than you _E_\nRumor has it that the grubby head of failing @VanityFair Magazine Sloppy Graydon Carter is going to be fired or replaced very soon? _E_\nVia @wmbfnews: Donald Trump puts Tea Party on map for 2016 __HTTP__ _E_\nBad news for @BarackObama. @gallupnews reports that the economy (71%) and gas prices (65%) are Americans' top (cont) __HTTP__ _E_\n#CaucusForTrump #Trump2016 __HTTP__ _E_\nThe five Taliban leaders released for a deserter must really be laughing and having a good time right now. They are saying how dumb U.S. is! _E_\nMy @FoxNews interview with @gretawire discussing the #CNNDebate and how to deal with Iran without using force __HTTP__ _E_\n#trumpvlog China is laughing.... __HTTP__ _E_\nI answered my @Facebook fans questions via video watch __HTTP__ _E_\nTed Nugent was obviously using a figure of speech unfortunate as it was. It just shows the anger people have towards @BarackObama. _E_\nMy thoughts on the Geico ad and more in today's video blog.... __HTTP__ _E_\nMost of the world's great riders are at Mar a Lago today for the Trump Invitational one of the most important equestrian events of the year _E_\nDo as I say not as I do. Obama just granted a special ObamaCare exemption for all Congress __HTTP__ All are hypocrites! _E_\n\"Developing your talent requires work and work creates luck.\" – Trump Never Give Up _E_\nThe US should not give a penny of foreign aid to Egypt if the Muslim Brotherhood takes over the country. We (cont) __HTTP__ _E_\nDrudge Poll on who won the 3rd #GOPDebate. Thank you! __HTTP__ _E_\nThank you to @jdickerson and @FaceTheNation for a very fair and professional interview this morning. No wonder you are #1 in the ratings! _E_\nFrack now and frack fast unless we want to continue to be dependent on countries that hate us. _E_\nWhen will anyone be held accountable for the VA scandal? The politicians are experts in never facing any consequence. _E_\nVia @AmSpec by Jeffrey Lord: \"New Obama Scandal Erupts: Trump Targeted\" __HTTP__ _E_\nSo I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_\nJoe Scarborough initially endorsed Jeb Bush and Jeb crashed then John Kasich and that didn't work. Not much power or insight! _E_\nI invite you to join my campaign to Make America Great Again! Sign up to Volunteer! __HTTP__ _E_\nBarack Obama said absolutely not 3 times before he agreed to go after Bin Laden now he wants all of the credit! _E_\nThanks @JamersonHayes they are all total losers with nothing going for them! _E_\nCheck out this great story from the @WSJ... __HTTP__ _E_\nIn calling my tweets 'obnoxious' @AOL says \"I sure know how to keep them wanting more.\" They are welcome. I just tell it like it is. _E_\n.@PhilMickels0n_ is right—California taxes are far too high. It's ridiculous. _E_\nThe Audacity of @BarackObama the Federal Reserve purchased 61% of all debt issued by Treasury in 2011. Killing our children's future. _E_\nRT @realDonaldTrump: At the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of go... _E_\nObama should play golf with Republicans & opponents rather than his small group of friends. That way maybe the terrible gridlock would end. _E_\nBetween a terrible press conference mishandled prisoner swap & Taliban attacks Hagel's 1st trip as SOD was a disaster. No surprise. _E_\nOur vets are treated like 3rd class citizens. Enough! Join me & @V4SA on @USSIOWA at LA Waterfront to hear my plan for vets & the military! _E_\nI look forward to playing golf with President @BarackObama someday. _E_\nBig response to my Tea Party statement remember they were never fully energized by Romney campaign and will have far more power with time. _E_\nEntrepreneurs: Gain and use information to your advantage see every day as an opportunity to learn. _E_\nI guess Rupert Murdoch and the @nypost don't like Donald Trump. Such false reporting about my big hit in Iowa. Even my enemies said bull. _E_\nDon't believe Kay Hagan on Ebola travel ban. She also promised that you would keep your healthcare plan under ObamaCare. Vote @ThomTillis! _E_\nPutin re Snowden issue \"it is like shearing a pig: there's lots of squealing and little fleece.\" _E_\nCheck out today's video blog __HTTP__ I want to answer more of your questions tweet me..... _E_\nThe only reason irrelevant @GlennBeck doesn't like me is I refused to do his failing show asked many times. Very few listeners sad! _E_\nGive your goals substance make them count on as many levels as you can. Remember that passion can be the catalyst for great achievement. _E_\n.@ScotGolfPodcast Work has not yet begun. We're in the approval phase. It will be amazing. You will love the final result. _E_\nWatch the WH spokesman try to spin @BarackObama's rationale for using exec. priv. on Fast & Furious __HTTP__ _E_\nA very interesting read. Unfortunately so much is true. __HTTP__ _E_\nPresident Obama and our negotiators are failed checker players playing against Grand Master Chess champions. Very sad to watch! _E_\nChinese oil trader just bought \"record number\" of Mideast crude __HTTP__ China gains while we fight ISIS. What are we doing? _E_\nI have built so many great & complicated projects– creating tens of thousands of jobs video: __HTTP__ __HTTP__ _E_\nVia @GolfweekMag by @GolfweekBRomine: \"@TigerWoods to design Trump course in Dubai\" __HTTP__ _E_\nIf JP Morgan took their case through the courts for 15 years nobody would be suing them—easy target. _E_\n'Trump is right about violent crime: It's on the rise in major cities' __HTTP__ _E_\nAs #HurricaneHarvey intensifies remember to #PlanAhead. __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_\nthings they did and said (like giving the questions to the debate to H). A total double standard! Media as usual gave them a pass. _E_\nCongratulations to @newtgingrich‎ on being signed to co host @CNN Crossfire. Great move by Jeff Zucker. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nNew York should Frack. Thousands of jobs and millions in revenue. NY would be a truly rich state. _E_\nThe worst thing Hillary could do is have her husband campaign for her. Just watch. _E_\nOn our YouTube channel the opening of the incredible Trump Ocean Club in Panama.... __HTTP__ _E_\nI just saw the movie Unbroken very good except I thought the ending was weak no retribution! And we complain about waterboarding. _E_\nAfter 1 year of investigation with Zero evidence being found Chuck Schumer just stated that Democrats should blame ourselvesnot Russia. _E_\nWatching the Ryder Cup on @GolfChannel. Very interesting and tough matches. Amazing sport my favorite! _E_\nKasich only looks O.K. in polls against Hillary because nobody views him as a threat and therefore have placed ZERO negative ads against him _E_\nWith Ben Carson wanting to hit his mother on head with a hammer stab a friend and Pyramids built for grain storage don't people get it? _E_\nRT @mitchellvii: EXACTLY AS I SAID House Intel Chair: We Cannot Rule Out Sr. Obama Officials Were Involved in Trump Surveillance __HTTP__ _E_\nThe USC made a terrible decision today. How can a requirement to buy private health insurance logically be a Government tax?! _E_\nCarly Fiorina did such a horrible job at Lucent and HP virtually destroying both companies that she never got another CEO job offer! Pres. _E_\nThe Football program at Penn State should be suspended. _E_\nWill be speaking with Germany and France this morning. _E_\nI just left @trumpwinery in CharlottesvilleVirginia it is the finest in the country really incredible! _E_\nAt least 3.5M fellow Americans are going to lose their healthcare plans because of ObamaCare. Defund then repeal! _E_\nI liked The Kelly File much better without @megynkelly. Perhaps she could take another eleven day unscheduled vacation! _E_\nAs a candidate I promised we would pass a massive TAX CUT for the everyday working American families who are the backbone and the heartbeat of our country. Now we are just days away... __HTTP__ _E_\nThank you @TrumpWomensTour!#MakeAmericaGreatAgain __HTTP__ _E_\nThe historic $250M renovations at @TrumpDoral are moving on pace. Once complete @TrumpDoral will be South Florida's premiere resort. _E_\nThey laughed at me when I said to bomb the ISIS controlled oil fields. Now they are not laughing and doing what I said. #Trump2016 _E_\nThe NYPost reports @VanityFair Magazine dropped 18% to only 283938 newsstand copies sold. Very sad & their bloggers are doing even worse! _E_\nDonald Trump will keynote Oakland County Republicans' Lincoln Day dinner __HTTP__ via @MLive Record crowd expected. _E_\nMy interview with @ASavageNation discussing #TimeToGetTough my 2012 plans and Iraq __HTTP__ __HTTP__ _E_\nLightweight Senator Marco Rubio features Trump Univ. students in FL. attack ads who submitted excellent reviews. __HTTP__ _E_\nThank you Ohio see you tonight! __HTTP__ _E_\n.@BarackObama bowed to the Saudi King in public yet the Dems are questioning @MittRomney's diplomatic skills. _E_\n.@DonaldJTrumpJr and I on the 18th hole at Trump International Golf Links Scotland __HTTP__ _E_\nDid you agree with my decision? #CelebApprentice _E_\nI never met former Defense Secretary Robert Gates. He knows nothing about me. But look at the results under his guidance a total disaster! _E_\nVia @FootwearNews by @kristenmhenning: \"@IvankaTrump Works to Beat Breast Cancer\" __HTTP__ _E_\nSteven Spielberg is a great filmmaker. Go see Lincoln. _E_\nI am astonished that the media continues to lie. @BarackObama gutted welfare reform. It is a fact! _E_\nAs a big job creator I was greatly honored to have been mentioned twice tonight during the debate. _E_\nJust announced that Iraq (U.S.) is preparing for battle to reclaim Mosul. Why do they have to announce this? Makes mission much harder! _E_\nFailed candidate Mitt Romneywho ran one of the worst races in presidential historyis working with the establishment to bury a big R win! _E_\nOur gov't should immediately stop sending $'s to Mexico no friend until they release Marine & stop allowing immigrant inflow into U.S. _E_\nDonald Trump visits Doral resort says he's allaying neighbors' concerns __HTTP__ via @MiamiHerald _E_\nI hope the boycott of @Macys continues forever. So many people are cutting up their cards. Macy's stores suck and they are bad for U.S.A. _E_\nThe blatant waste of taxpayers' dollars doesn't bother Obama because it's all part of his broader nanny state (cont) __HTTP__ _E_\nA tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_\n.@Omarosa is not winning points being called \"the wicked witch of the Mid West\" and most certainly other things. #CelebApprentice _E_\nRT @FoxNews: .@POTUS: Our infrastructure will again be the best in the world. We used to have the greatest infrastructure anywhere in the... _E_\nI have a dream that our country will be great again! #DreamDay _E_\nBenghazi is bigger than Watergate. Don't let Obama get away with allowing Americans to die. Kick him out of office tomorrow. _E_\nVia @dcexaminer by @eScarry: \"Donald Trump: @HuffingtonPost 'a very dishonest organization'\" __HTTP__ _E_\nGreat coordination between agencies at all levels of government. Continuing rains and flash floods are being dealt with. Thousands rescued. _E_\nDow S&P 500 and Nasdaq all finished the day at new RECORD HIGHS! __HTTP__ _E_\nAt least @TheTinaBeast is consistent. She takes over a magazine and it ends up in the gutter. _E_\n'Trump signs bill undoing Obama coal mining rule' __HTTP__ _E_\nThank you New Hampshire!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nJust arrived in Mississippi for the rally. Word is that the crowd is overflowing and massive. Will be an amazing evening! _E_\n\"A man always has two reasons for doing anything: a good reason and the real reason.\" J. P. Morgan _E_\nHappy Passover to everyone celebrating in the United States of America Israel and around the world. #ChagSameach _E_\nThank you Eric! __HTTP__ _E_\nRT @foxnation: .@realDonaldTrump's First Full Month in Office Sees Biggest Jobs Gain 'In Years': Report: __HTTP__ _E_\nBecause of President Obama's failed leadership we have put Vladimir Putin & Russia back on the world stage! No reason for this. _E_\nJobless claims rose yet again last week __HTTP__ @BarackObama's economic record is abysmal we can do much better. _E_\nObama doesn't know what he's doing. His foreign policy is a disaster. Libya Egypt Iraq Afghanistan all (cont) __HTTP__ _E_\nThank you! __HTTP__ _E_\nCurtis Sliwa doing tv commentary on 9/13/2001. Good job Curtis. Please send your apologies to @realDonaldTrump. __HTTP__ _E_\nRepublicans are always worried about their general approval. With proposing to 'ignore the debt ceiling' they are ignoring their base. _E_\nMAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_\nThank you! Facebook: __HTTP__ __HTTP__ __HTTP__ _E_\nToday we remember the men and women who made the ultimate sacrifice in serving. Thank you God bless your families & God bless the USA! _E_\nCongratulations to @SteveKingIA and his team on running a great campaign. Steve is a strong leader in the House. _E_\nThank you for your endorsement @paulteutulsr! #BikersForTrump #VoteTrumpNV Video: __HTTP__ __HTTP__ _E_\nIf you have a hard time communicating one way to overcome it is to turn your focus onto your audience. Midas Touch _E_\nThanks for all the nice comments about the @Late_Show last night. I enjoyed it and David enjoyed the ratings. __HTTP__ _E_\nA great @The Masters. The course looks so beautiful. Fantastic for golf and television ratings! _E_\nStop saying I went bankrupt. I never went bankrupt but like many great business people have used the laws to corporate advantage—smart! _E_\nAmazing! Watch @NHLBruins fans take over National Anthem during pregame ceremonies __HTTP__ _E_\nShould not raise taxes in Wisconsin but massive budget deficit. Education roads etc suffering. @DanHenninger lies. @WSJ _E_\nVia @BreitbartNews: \"EXCLUSIVE: TRUMP SMACKS BACK AGAINST MEDIA ATTACKS ON CPAC SPEECH\" __HTTP__ by @mboyle1 _E_\nRemember our six brave heroes who died searching for Bergdahl after he deserted __HTTP__ (h/t @Military_News) _E_\n\"Leverage: don't make deals without it.\" – The Art of the Deal _E_\nMy @FoxNews interview on @gretawire discussing the @RNC convention @BarackObama's sealed records & real estate advice __HTTP__ _E_\nCongratulations @TrumpSoHo on being named a \"Great East Coast Hotels for Teens and Families\" by @ParadeMagazine __HTTP__ _E_\nDerek Jeter's rehab assignment is progressing on schedule. He's a true @Yankees captain. Look forward to seeing him back on the field _E_\nWhen I say I would end Obamacare I would also come up with a plan that would be far better much easier to understand and cost less! _E_\nHappy #MothersDay to all the great mothers out there! __HTTP__ _E_\n\"The worst thing you can possibly do in a deal is seem desperate to make it.\" The Art of the Deal _E_\nEntrepreneurs: See each day as an opportunity to show what you can do at the highest level. _E_\nIf the Prez wants to create jobs talk to some business people not liberal intellectuals. _E_\nWatching Pyongyang terrorize Asia today is just amazing! _E_\nPay to play. Collusion. Cover ups. And now bribery? So CROOKED. I will #DrainTheSwamp. __HTTP__ _E_\nRT @AnnCoulter: GREATEST FOREIGN POLICY SPEECH SINCE WASHINGTON'S FAREWELL ADDRESS. _E_\nToday @BarackObama will borrow 40 cents on every dollar he spends from China. Just another day at the office. _E_\nAlways know you could be on the precipice of something great. Donald J. Trump __HTTP__ _E_\nPhony Rubio commercial. I could have settled but won't out of principle! See student surveys. __HTTP__ _E_\nThe @TheView @ABC once great when headed by @BarbaraJWalters is now in total freefall. Whoopi Goldberg is terrible. Very sad! _E_\nI will be meeting with the NRA who has endorsed me about not allowing people on the terrorist watch list or the no fly list to buy guns. _E_\nI'll be on @greta ON THE RECORD tonight at 7 PM _E_\nBy the way New York State MUST LOWER TAXES (and fast) and must start going after all of the energy that lies just below our feet (now)! _E_\nHALF of Americans don't pay income tax despite crippling govt debt... __HTTP__ _E_\n'Majority in Leading EU Nations Support Trump Style Travel Ban' Poll of more than 10000 people in 10 countries... __HTTP__ _E_\nRT @DiamondandSilk: When the President says You're Fired That means: Pack Yo Stuff and Go Not Say You Refuse to Go! #DrainTheSwam... _E_\nRush Limbaugh is great tells it as he sees it really honorable guy! Thanks Rush! #Trump2016 _E_\nRT @FoxNews: .@KellyannePolls: Since @POTUS took office 863000 new jobs were filled by women. Over half a million American women have en... _E_\nAn important part of my (or anybody's) success is the ability to judge people. I believe that @MileyCyrus is a really good person. _E_\nIf FM @AlexSalmond needs to litter Scotland w/ ugly industrial wind turbines to gain independence he will lose! __HTTP__ _E_\nI will be interviewed on @CNN @NewDay at 7:30 A.M. Enjoy! _E_\nLet's Trump the Establishment! We are no longer silent. We will Make America Great Again! __HTTP__ _E_\nOutrageous @BarackObama is trying to unilaterally gut welfare reform __HTTP__ He doesn't believe in a strong work ethic. _E_\nStamps are going up once again. Now the US postal service will lose even more money. _E_\nMy @extratv interview discussing @Rosie's new baby my acceptance of @billmaher's $5M offer & hiring @_KatherineWebb __HTTP__ _E_\n.@bobvanderplaats asked me to do an event. The people holding the event called me to say he wanted $100000 for himself.Phony @foxandfriends _E_\nTrust your instincts. They are there for a reason. Without instincts you'll have a hard time getting to and staying at the top. _E_\nWhen I jokingly said bring back Steve Jobs to run Apple because Apple has not been doing well the haters & losers had a field day! Sad. _E_\nLyin' Ted Cruz just used a picture of Melania from a G.Q. shoot in his ad. Be careful Lyin' Ted or I will spill the beans on your wife! _E_\nStock market hits new high with longest winning streak in decades. Great level of confidence and optimism even before tax plan rollout! _E_\n\"Donald Trump to address SC Tea Party Coalition at Myrtle Beach event\" __HTTP__ via @CarolinaLive by @timmcginniswpde _E_\nWith Terry McAuliffe Gov of Virginia at the Trump Winery in Charlottesville VA largest on East Coast. @GovernorVA __HTTP__ _E_\nYou have no idea what my strategy on ISIS is and neither does ISIS (a good thing). Please get your facts straight thanks. @megynkelly _E_\nJoin my team tonight at 8:30pmE! __HTTP__ __HTTP__ _E_\n.@TimTebow has tremendous talent and a proven ability to lead. He deserves to be in the @nfl. _E_\nAmerica's Labor Market Continues to Boom JOBS JOBS JOBS! __HTTP__ _E_\nHow Trump Won And How The Media Missed It __HTTP__ _E_\nChina is driving the price of gold up in order to ease pressure against Iranian sanctions. __HTTP__ _E_\n.@FoxNews is the only network that does not even mention my very successful event last night. $6000000 raised in one hour for our VETS. _E_\nPractice positive thinking this will keep you focused while weeding out anything that is unnecessary negative or detrimental. _E_\nThe cheap 12 inch sq. marble tiles behind speaker at UN always bothered me. I will replace with beautiful large marble slabs if they ask me. _E_\n.@Linda_McMahon is an elite businesswoman who will bring a great outlook to DC. Support her campaign here __HTTP__ _E_\nThe debate tonight will be a total disaster low ratings with advertisers and advertising rates dropping like a rock. I hate to see this. _E_\nThank you @rushlimbaugh for your wonderful words. We will #MakeAmericaGreatAgain _E_\n.@Lord_Sugar If you didn't say the iPod would be gone in a year you might have been really rich instead of the peanut money you have. _E_\n\"Protect the downside and the upside will take care of itself. – The Art of the Deal _E_\nCongrats to @TimTebow on making @Patriots' first cut. Stay strong and positive! We are all rooting for you. _E_\nNew on our YouTube channel today is a brand new #trumpdocumentary giving you a look inside the world of Trump Golf... __HTTP__ _E_\nIf any candidate believes that with what we know today we still should have invaded Iraq then they are unqualified to be Commander in Chief. _E_\nThe dying @VanityFair's circulation has \"dropped\" & its newsstand sales have \"plummeted by 20.1 percent\" __HTTP__ _E_\nI am truly honored to have been chosen Statesman of the Year by the Republican Party of Sarasota County. The (cont) __HTTP__ _E_\nLooked at plans for Trump Doral Country Club today. It will be amazing! Glad to be in Miami. _E_\nIf elected I will undo all of Obama's executive orders. I will deliver. Let's Make America Great Again! __HTTP__ _E_\nEntrepreneurs: Resolve to be bigger than your problems. Who's the boss? Don't negate your own power. _E_\nI've done the largest house sale in U.S. history by selling a Palm Beach mansion for $100M $60M more than I paid. I love real estate. _E_\nMar a Lago in Palm Beach is one of the great palazzos of the world with a fantastic history. __HTTP__ _E_\nWatch the latest From The Desk Of Donald Trump at __HTTP__ and read this article __HTTP__ _E_\nTrump urges GOP to be 'mean as hell' __HTTP__ Via @CNNPolitics _E_\nLooks like two time failed candidate Mitt Romney is going to be telling Republicans how to get elected. Not a good messenger! _E_\nI truly understood the appeal of Ron Paul but his son @RandPaul didn't get the right gene. _E_\nMost people think small because most people are afraid of success afraid of making decisions afraid of winning. The Art of the Deal _E_\n.@oreillyfactor The people of Iowa love the fact that I stuck up for my rights as I will do for the U.S. Also got $6000000 for our VETS! _E_\nResolve never to quit never to give up no matter what the situation. @jacknicklaus _E_\nDonald Trump backs 'Apprentice' Randal Pinkett for N.J. Lieutenant Governor: __HTTP__ _E_\nApparently @MartinBashir said something about me on his show yesterday. I was surprised to find out he is on TV. Who knew?! _E_\nThey don't like Rubio in Florida he left them high & dry. Doesn't even show up for votes! _E_\nRepublicans sorry but I've been hearing about Repeal & Replace for 7 years didn't happen! Even worse the Senate Filibuster Rule will.... _E_\nThe Euro is going to collapse soon. Cross border lending is already down and banks are stopping their Euro investments. _E_\nHorrible killing of a 13 year old American girl at her home in Israel by a Palestinian terrorist. We must get tough. __HTTP__ _E_\nI am working hard even on Thanksgiving trying to get Carrier A.C. Company to stay in the U.S. (Indiana). MAKING PROGRESS Will know soon! _E_\n#TrumpVine A message for @AnthonyWeiner __HTTP__ _E_\nOn the 13th tee box @TrumpScotland with my grand daughter Kai! @DonaldJTrumpJr __HTTP__ _E_\nThe Democrats will only vote for Tax Increases. Hopefully all Senate Republicans will vote for the largest Tax Cuts in U.S. history. _E_\nRT @JohnStossel: I can skate here ONLY b/c @realdonaldtrump fixed this rink after NYC gov't spent $13M but FAILED! Good for Trump! __HTTP__ _E_\nVia @BPolitics by @Griffin Aboard Donald Trump's 757 at the South Carolina Tea Party Convention __HTTP__ _E_\nOn behalf of an entire nation Happy 242nd Birthday to the men and women of the United States Marines!#USMC242 #SemperFi __HTTP__ _E_\nAnn Romney is a fantastic lady. She was great in thanking people last night! __HTTP__ _E_\nOur nation is a once great nation divided! _E_\nBe sure to watch my wonderful wife Melania Trump tonight on @QVC at 1AM EST _E_\n..... I wonder if Angelo has a job or is on assistance. In any event I'm sure he is a nice guy! _E_\nCongratulations to @FoxNews for winning November in the cable news rating race with 9 of 10 top shows __HTTP__ _E_\nMy @greta int. on @FoxNews on how to defeat ISIS Obama losing ground to ISIS & Making America Great Again! __HTTP__ _E_\nMiss USA pageant had a 4 to 1 vote in favor but it won't be in Miami Doral in 2014 Mayor Boria voted against it. I want total support! _E_\nNew study shows 80% of Congress have no business experience it shows! _E_\nAmyMek Amen! @realDonaldTrump has drawn more attention to Veterans issues in 1 week than these politicians have in decades! _E_\nGreat poll numbers for @MittRomney just out he is leading substantially in swing states. _E_\nNEW POLL: Trump Blue Collar Support highest since FDR in 1930s WOW! __HTTP__ _E_\nWe have to repeal & replace #Obamacare! Look at what is doing to people! #DrainTheSwamp __HTTP__ _E_\nNice story from @businessinsider __HTTP__ _E_\nWe have tremendous economic power over China if our leaders knew how to use it which they don't! China's economy would collapse without us. _E_\nEveryone's favorite frontman Twisted Sister lead singer @deesnider returns to this year's All Star @ApprenticeNBC. Dee does great! _E_\nAnother good poll result in the great state of SC. Trump at 30%. Carson at 15% and Bush at 9%. __HTTP__ _E_\n\"We have a president who has a vendetta against businesspeople and considers them the enemy. He's also (cont) __HTTP__ _E_\n\"Donald Trump launches new men's fragrance Empire @Macys Because every man has his own empire to build'\" __HTTP__ _E_\nAs bad as they were I don't remember our embassies being attacked when Mubarak and Gaddafi were in power. _E_\nThe Failing @nytimes the pipe organ for the Democrat Party has become a virtual lobbyist for them with regard to our massive Tax Cut Bill. They are wrong so often that now I know we have a winner! _E_\nThug Politics. Lightweight hack Schneiderman meets with Obama on Thursday then brings frivolous suit on Saturday. _E_\nWashington should have brought in Strasburg to relieve they would have won. _E_\nA wonderful story on Iowa voters by @arappeport of the @NYTimes. __HTTP__ _E_\nThe speakers slots at the Republican Convention are totally filled with a long waiting list of those that want to speak Wednesday release _E_\nWith gas prices rising and the economy failing @BarackObama seeks to have his EPA raise energy prices by $109B __HTTP__ _E_\nRT @DRUDGE_REPORT: REUTERS ROLLING: TRUMP 39% CRUZ 14.5% BUSH 10.6% CARSON 9.6% RUBIO 6.7%... MORE... __HTTP__ _E_\n.@EricTrump did an amazing job raising money for @StJude with his @EricTrumpFDN event featuring @LisaLampanelli. Watch __HTTP__ _E_\nA big salute to Jerry Jones owner of the Dallas Cowboys who will BENCH players who disrespect our Flag. Stand for Anthem or sit for game! _E_\nand stay at the fantastic Trump International Hotel Las Vegas ... __HTTP__ _E_\nThank you Fort Lauderdale Florida. #MakeAmericaGreatAgain __HTTP__ _E_\nStop and frisk works. Instead of criticizing @NY_POLICE Chief Ray Kelly New Yorkers should be thanking him for keeping NY safe. _E_\nRT @foxandfriends: Insurers seeking huge premium hikes on ObamaCare plans __HTTP__ _E_\nPaul Ryan a man who doesn't know how to win (including failed run four years ago) must start focusing on the budget military vets etc. _E_\nDon't believe the media stories. OPEC and the Saudis have not been doing us any favors recently with oil outputs. Oil should be $30/barrel. _E_\nI will defeat Crooked Hillary Clinton on 11/8/2016. #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\n\"Palin's brand among evangelicals is as gold as the faucets in Trump tower\" said Ralph Reed the chairman of the Faith & Freedom Coalition. _E_\n\"Fans like winners. They come to watch stars – great exciting players who do great exciting things.\" The Art of The Deal _E_\nGreat day in Colorado & Arizona. Will be in Nevada Colorado and New Mexico tomorrow join me!Tickets:... __HTTP__ _E_\nHave passion drive and enthusiasm? You can check out the @TrumpCollection careers here: __HTTP__ _E_\n\"The most important thing in communication is hearing what isn't said.\" Peter Drucker _E_\n\"Sometimes life hits you in the head with a brick. Don't lose faith.\" Steve Jobs _E_\nThe 13th season of All Star @CelebApprentice is unique. We really pushed the envelope here. Our great and loyal fans will love it. _E_\nIn any business there will be ups and downs. If you can weather the rough times your success will be even greater during high times. _E_\nCLINTON CORRUPTION AND HER SABOTAGE OF THE INNER CITIES. Full speech transcript: __HTTP__ _E_\nI will be going to Atlanta Georgia tomorrow—here's the info: __HTTP__ Hope to see you there! #MakeAmericaGreatAgain! _E_\n.@BillMaher didn't come through with his promised $5 million for charity so today I will sue him. _E_\nOPEC is better off than they were 4 years ago. Gas has more than doubled during @BarackObama's term. Outrageous! _E_\nThe Republican Party has to be smart & strong if it wants to win in November. Can't allow lightweights to set up a spoiler Indie candidate! _E_\n....This now allows for the passage of large scale Tax Cuts (and Reform) which will be the biggest in the history of our country! _E_\nWith @ivankatrump and the Chairman of DAMAC in Dubai. __HTTP__ _E_\n.@hardball_chris says he's \"glad\" we had a hurricane! With many people dying and thousands hurting MSNBC (cont) __HTTP__ _E_\nGreat Gravis Poll on the great state of NH. Also watch @FaceTheNation on CBS & @HowardKurtz #mediabuzz both on Sunday. _E_\nThe so called bipartisan DACA deal presented yesterday to myself and a group of Republican Senators and Congressmen was a big step backwards. Wall was not properly funded Chain & Lottery were made worse and USA would be forced to take large numbers of people from high crime..... _E_\nOur country needs strong borders and extreme vetting NOW. Look what is happening all over Europe and indeed the world a horrible mess! _E_\nCongrats @NBCInvestigates on revealing that Obama knew millions of Americans would lose their healthcare plans __HTTP__ _E_\nCongrats to @TrumpWaikiki celebrating 51 consecutive months as the #1 Honolulu Hotel on @TripAdvisor! _E_\nThe U.S. cannot negotiate with terrorists. It is a sad and terrible situation for the family involved but this can only lead to disaster. _E_\n.... to help McConnell who spoke right after him.\"@BreitbartNews _E_\nIf Republican Senators are unable to pass what they are working on now they should immediately REPEAL and then REPLACE at a later date! _E_\n\"You may have to try a lot of things to get just one thing to work. That's tenacity and it's critical to success.\" – Trump Never Give Up _E_\nIf the people so violently shot down in Paris had guns at least they would have had a fighting chance. _E_\nWow! Senator Mark Warner got caught having extensive contact with a lobbyist for a Russian oligarch. Warner did not want a \"paper trail\" on a \"private\" meeting (in London) he requested with Steele of fraudulent Dossier fame. All tied into Crooked Hillary. _E_\nRT @foxandfriends: .@GeraldoRivera: Chances of impeachment went from 3% to 0% with Comey's testimony __HTTP__ _E_\nJust landed in Da Nang Vietnam to deliver a speech at #APEC2017 _E_\nThank you Richmond Virginia! #Trump2016 __HTTP__ _E_\nThank you Iowa! #Trump2016 __HTTP__ _E_\nWhether I choose him or not for State Rex Tillerson the Chairman & CEO of ExxonMobil is a world class player and dealmaker. Stay tuned! _E_\nI will be live tweeting during the Celebrity Apprentice at 9 P.M. Also will be hosting Dateline just prior to Apprentice at 8 P.M. _E_\nMy interview from last night with @piersmorgan discussing OWS __HTTP__ _E_\nI just returned from Iowa what a beautiful state. The people are amazing and the event for Congressman Steve King was a great success! _E_\nWill be on @foxandfriends at 7.00. (30 minutes). A great deal to talk about including Ebola quarantine. _E_\nThe Trans Pacific Partnership will increase our trade deficits & send even more jobs overseas. This is a bad deal. Time for smart trade! _E_\nFace The Nation's interview of me was the highest rated show that they have had in 15 years. Congratulations and WOW! @CBSNews @jdickerson _E_\nJoin us live in the Oval Office for the swearing in of our new Attorney General @SenatorSessions!LIVE:... __HTTP__ _E_\nBusiness is an art in itself & powerful negotiation skills are one of the techniques necessary to facilitate success. Think Like a Champion _E_\nElections have consequences. Obama just published \"final regulations for ObamaCare's individual mandate\" __HTTP__ Enjoy! _E_\nRead a great interview with Donald Trump that appeared in The New York Times Magazine: __HTTP__ _E_\n.@EWErickson got fired like a dog from RedStateand now he is the one leading opposition against me. _E_\nDon't let the GLOBAL WARMING wiseguys get away with changing the name to CLIMATE CHANGE because the FACTS do not let GW tag to work anymore! _E_\n.@Yankees should get rid of A Rod ASAP I can't watch this guy anymore! _E_\nIt's Tuesday how much will the media continue to cover up the embassy attacks for Obama? _E_\nIt is great to meet fellow patriots at the #TimeToGetTough book signings. Can't wait to meet more today at Trump Tower from 12PM to 2PM _E_\nAccording to new employment numbers 296000 Americans have dropped out of the work force & gave up looking for work. _E_\nWow it's snowing in Isreal and on the pyramids in Egypt. Are we still wasting billions on the global warming con? MAKE U.S. COMPETITIVE! _E_\nTonight I will be on @FoxNews with @SeanHannity at 10pm and @CNN w/ @AndersonCooper at 10:10pm. Enjoy! #VoteTrumpSC #Trump2016 _E_\nDonald Trump shocked by 'stupid decision' about @OMAROSA on '@ApprenticeNBC' __HTTP__ @TODAY_Clicker _E_\nObama's war on coal is killing American jobs making us more energy dependent on our enemies & creating a great business disadvantage. _E_\nWith @StephenBaldwin7 earlier today at @ApprenticeNBC press conference in @TrumpTowerNY. __HTTP__ _E_\nBy Scotland officials canceling my local ad about how damaging wind turbines are it became a much bigger story around the world. Great! _E_\nWhy does @BarackObama always have to rely on teleprompters? _E_\nI will be on @foxandfriends at 8:30 A.M. Will be talking about lightweight Marco Rubio and lying Ted Cruz! _E_\nObama will be trying very hard at next debate he doesn't want to lose the Boeing. _E_\nWe need a real President! __HTTP__ _E_\nWhen your life flashes before your eyes make sure you've got plenty to watch. Anonymous _E_\nHow stressed are @lisarinna and @pennjillette already? #CelebApprentice _E_\nLooking forward to giving keynote speech tonight @ChesterfieldGOP Lincoln Reagan dinner in Virginia. _E_\nVia @bostonherald by Eugene R. Dunn: \"Iran a clear danger\" __HTTP__ _E_\nI hear this moron @billmaher said nasty things about me (hair etc—boring) on the terminated @jayleno show. Stupid guy/bad ratings! _E_\nUnbelievable evening in Melbourne Florida w/ 15000 supporters and an additional 12000 who could not get in. Tha... __HTTP__ _E_\nOn behalf of @FLOTUS Melania and I THANK YOU for an unforgettable afternoon and evening at the Forbidden City in Beijing President Xi and Madame Peng Liyuan. We are looking forward to rejoining you tomorrow morning! __HTTP__ _E_\n...allegations of unmasking Trump transition officials. Not good! _E_\nWow NY Observer story about @AGSchneiderman really exposes him as a sleazebag & crook. He's bad for New York. __HTTP__ _E_\nDoes Madonna know something we all don't about Barack? At a concert she said we have a black Muslim in the White House. _E_\nHow do you fight millions of dollars of fraudulent commercials pushing for crooked politicians? I will be using Facebook & Twitter. Watch! _E_\nThank you for a great night at the Verizon Wireless Arena New Hampshire! #VoteTrumpNH#MakeAmericaGreatAgain #FITN __HTTP__ _E_\nThere is no better place in the world to spend Christmas than Mar a Lago __HTTP__ in Palm Beach Florida. _E_\nJust finished another week of filming @ApprenticeNBC. This season a record 14th is shaping up to be the best yet. _E_\nLeaving soon after a great time in New Hampshire a truly special place! _E_\nFrom ABC News: In Demand: Washington's Highest (and lowest) Speaking Fees by Scott Wilson __HTTP__ _E_\nDon't forget to watch Celebrity Apprentice tonight at 9pm...you will love it! _E_\nLanding in Pennsylvania now. Great new poll this morning thank you. Lets #DrainTheSwamp and #MakeAmericaGreatAgain... __HTTP__ _E_\nWill be playing golf today with Rand Paul at Trump International in Palm Beach. Will be both interesting and fun! _E_\nCan you believe that our very stupid politicians released the leader of ISIS and now we are spending billions trying to get him back! _E_\nVia the Washington Post: Inside the World of Donald Trump's Super Fans: __HTTP__ _E_\nRT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_\nHurricane looks like largest ever recorded in the Atlantic! _E_\nGreat news that @ehasselbeck will be joining @foxandfriends. Elisabeth is a tremendous person and will be missed on @theviewtv. _E_\nThank you to the @nydailynews for a very nice story __HTTP__ _E_\nThank you to @exxonmobil for your $20 billion investment that is creating more than 45000 manufacturing & construction jobs in the USA! _E_\nLike her or not Hillary did what she had to do in the debate last night—get through it. Her opponents were very gentle and soft! _E_\nObama's nuclear deal with the Iranians will lead to a nuclear arms race in the Middle East. It has to be stopped. _E_\nMiss Universe contestants are amazing—the most beautiful ever! _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nVia @politicalwire: Tweet of the Day __HTTP__ _E_\nNow that Obama's poll numbers are in tailspin – watch for him to launch a strike in Libya or Iran. He is desperate. _E_\nOf course there is large scale voter fraud happening on and before election day. Why do Republican leaders deny what is going on? So naive! _E_\nto make up their own minds as to the truth. The media lies to make it look like I am against Intelligence when in fact I am a big fan! _E_\nCongratulations to @seanhannity on his tremendous increase in television ratings. Speaking of ratings I will be on his show tonight @ 10pE. _E_\nMy interview with @NYDNGatecrasher discussing @BarackObama's #WHCD and my endorsement of @MittRomney __HTTP__ _E_\nHappy #SmallBusinessSaturday!A great day to support your community and America's JOB creators by shopping locally at a #SmallBiz. #ShopSmall __HTTP__ _E_\nVia @TODAY_Clicker: Donald Trump promises 'tough and mean and nasty' 'Celebrity Apprentice' __HTTP__ _E_\n.@GovernorPataki did a terrible job as Governor of New York. If he ran again he would have lost in a landslide. He and Graham ZERO in polls _E_\nThis week the Senate can join the House & take a strong stand for the Middle Class families who are the backbone of America. Together we will give the American people a big beautiful Christmas present a massive tax cut that lets Americans keep more of their HARD EARNED MONEY! __HTTP__ _E_\nObama spoke to the Mexican president last week & did not mention UMC Sgt. Tahmooressi. Sad! _E_\nI find the photos of these children killed in Newtown in the New York Post heartbreaking.#Angels _E_\nIt's Monday. How much will premiums rise today because of ObamaCare? REPEAL! _E_\nThank you Michael Harrison @Talkersmagazine for your kind words greatly appreciated! _E_\nA fact golfers don't get aches & pains like others who don't golf. It is amazingly remedial. _E_\n...popular vote. ABC News/Washington Post Poll (wrong big on election) said almost all stand by their vote on me & 53% said strong leader. _E_\nRT @DonnaWR8: @realDonaldTrump Thank you @POTUS for believing in Us like we believed in you! #MAGA __HTTP__ _E_\nIf the presidential election were held today according to this @surveyusa poll Donald Trump would defeat any Dem: __HTTP__ _E_\nJust read @PiersMorgan's book \"Shooting Straight\" and whether you love him or hate him (I'm in the first category) it is terrific. _E_\n\"Miss Universe Ratings 6.1 Million Viewers Best Since 2008\" __HTTP__ _E_\nWe will have the votes for Healthcare but not for the reconciliation deadline of Friday after which we need 60. Get rid of Filibuster Rule! _E_\n#CrookedHillary gives Obama an \"A\" for an economic recovery that's the slowest since WWII... #BigLeagueTruth... __HTTP__ _E_\nHow foolish did @davidaxelrod look yesterday trying to rationalize why @BarackObama accepts donations from Bain? __HTTP__ _E_\nThe problem with agreeing to a policy on immigration is that the Democrats don't want secure bordersthey don't care about safety for U.S.A. _E_\nICYMI via @foxnewsinsider my @foxandfriends from yesterday on Obama's dangerous disconnect __HTTP__ _E_\nWhy is @BarackObama letting the Taliban know when our troops are leaving? __HTTP__ This is dangerous for our soldiers. _E_\nI will be holding a major news conference in New York City with my children on December 15 to discuss the fact that I will be leaving my ... _E_\nI will be interviewed on @jaketapper @CNN at 9:00 A.M. and Fox News Sunday with Chris Wallace at 10:O0 A.M. CNN Iowa Poll 13 point lead! _E_\nMy interview on @gretawire discussing the economy and @TheHermanCain Witch Hunt __HTTP__ _E_\nSpend your last day of 2013 contemplating the moves you will make in 2014 to make it your best year ever! _E_\nConsidering Obama hasn't proposed anything concrete if he wins he won't have a mandate. Another 4 years of legislative stalemate. _E_\nThe @nfl ratings continue to fall every week and will keep dropping. Boring games too many flags too soft! _E_\nToday's #trumpvlog answers your tweets about my thoughts on the Republican candidates... __HTTP__ _E_\nLimited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_\nI guess @edshow is a lot smarter than dopes like @JonahNRO & @stephenfhayes. Oh well both mags are dying anyway. __HTTP__ _E_\nTo the brave men and women past and present in our armed services best wishes on Veterans Day. _E_\nPresident Obama you are a complete and total disaster but you have a chance to do something great and important: STOP THE FLIGHTS! _E_\nI've been watching very little @CNBC lately—the good news is I'm switching over to @BloombergNews and @FoxNews. _E_\nMobile Alabama today at 3:00 P.M. Last rally of the year THANK YOU ALABAMA AND THE SOUTH Biggest of all crowds expected see you there! _E_\nRT @TeamTrump: \"Police officers are the BEST of us. Law enforcement in this country is a force for GOOD. @mike_pence #VPDebate #BigLeagu... _E_\nCongratulations to Eric & Lara on the birth of their son Eric Luke Trump this morning! __HTTP__ _E_\nEntrepreneurs: Don't be confined by expectations. There are no exact rules for negotiation try to remain flexible and open to new ideas. _E_\nI ask again how much is very wealthy South Korea paying the United States for protecting it against North Korea? _E_\nVia @CBNNews by @TheBrodyFile: \"Donald Trump: 'We Must Make America Great Again'\" __HTTP__ _E_\n.@FoxNews FBI's Andrew McCabe \"in addition to his wife getting all of this money from M (Clinton Puppet) he was using allegedly his FBI Official Email Account to promote her campaign. You obviously cannot do this. These were the people who were investigating Hillary Clinton.\" _E_\nOn the way to the great state of Rhode Island big rally. Then to Pennsylvania for rest of day and night! _E_\nReceived a standing applause at #NCGOPcon when I said to have free trade be fair for the US we need really intelligent negotiators. _E_\nThe entire world understands that the good people of Iran want change and other than the vast military power of the United States that Iran's people are what their leaders fear the most.... __HTTP__ _E_\nMy @CNN interview with @TVAshleigh discussing @MittRomney's electability and @RickSantorum's Senate loss. __HTTP__ _E_\nDonald Trump Ed Koch and the Ice Skating Rink: A Tale of Bureaucracy __HTTP__ @ActonInstitute _E_\nPeople ask why do you tweet and re tweet to millions about @JebBush when he is so low in the polls? Because of his big $ hit ads on me! _E_\nVia @BreitbartNews by @TheTonyLee: @Citizens_United sues @AGSchneiderman for violating 1st Amendment __HTTP__ _E_\nDanger Weiner is a free man at 12:01AM. He will be back sexting with a vengeance. All women remain on alert. _E_\nGood luck to @joniernst. You will make a wonderful Senator. _E_\nWe have a president who has a vendetta against businesspeople and considers them the enemy. #TimeToGetTough (cont) __HTTP__ _E_\nIf we do not protect the rule of law then we can expect even more illegals to cross the border. Obama's executive amnesty is dangerous. _E_\n....it is very possible that those sources don't exist but are made up by fake news writers. #FakeNews is the enemy! _E_\nAlmost every T.V. show is asking me to go on especially the @Late_Show. It's simple I get the ratings! _E_\nWhy does the federal government send foreign aid to China? Unbelievable! Washington is financing America's de... (cont) __HTTP__ _E_\nFailed Presidential Candidate Mitt Romney was campaigning with John Kasich & Marco Rubio and now he is endorsing Ted Cruz. 1/2 _E_\nHere's a great video of the official launch of my new fragrance #Success @Macys Herald Square __HTTP__ _E_\nDenver Minnesota and others are bracing for some of the coldest weather on record. What are the global warming geniuses saying about this? _E_\nManiac Sergeant who went on a killing spree in Afghanistan must be punished big time and quickly. _E_\nLimited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ __HTTP__ _E_\nThe pathetic new hit ad against me misrepresents the final line. You can tell them to go BLANK themselves was about China NOT WOMEN! _E_\nI asked @VP Pence to leave stadium if any players kneeled disrespecting our country. I am proud of him and @SecondLady Karen. _E_\nBill Clinton's meeting was a total secret. Nobody was to know about it but he was caught by a local reporter. _E_\nGreat @nytimes story about our conversion of the Old Post Office building in D.C. to luxury hotel __HTTP__ _E_\nLittle Marco Rubio the lightweight no show Senator from Florida is just another Washington politician. __HTTP__ _E_\nSee Newsmax story re Republican National Convention __HTTP__ _E_\nTo be successful your focus has to be broad enough to think big at the same time. 'Midas Touch' with @theRealKiyosaki _E_\nI remained strong for @TigerWoods during his difficult period. He rewarded me (and himself) by winning at Trump National Doral. _E_\nLooking forward to @David_Bossie & @RepJeffDuncan's @Citizens_United Freed Summit in Greenville SC this Saturday! _E_\nSoon to be the greatest hotel in U.S. don_trump_jr @ivankatrump @erictrump #OldPostOffice __HTTP__ _E_\nThe so called 'moderate' Syrian rebels pledged their allegiance to ISIS after Obama's address. We should not be arming them! _E_\nReplay of Fox News Sunday With Chris Wallace at 2:00 P.M. on @FoxNews. Big statement made by Chris! _E_\nI pay millions of $'s a year to Florida Power & Light & they can't give us what we want. Maybe a major class action suit against them? _E_\nThe police in London say I'm right. Major article in Daily Mail. \"We can't wear uniform in our own cars.\" __HTTP__ _E_\nI am now in Iowa getting ready to speak. People are always amazed to find out that I am Protestant (Presbyterian). GREAT. _E_\nRoger Ailes just called. He is a great guy & assures me that \"Trump\" will be treated fairly on @FoxNews. His word is always good! _E_\n.@AndrewKreig Thank you Andrew so correct! _E_\nMark Begich votes with Obama 97%. He opposes drilling & supports Amnesty for illegals. Next Tuesday vote @DanSullivan2014! _E_\nGet ready for two amazing episodes of Celebrity Apprentice tomorrow night (Monday) at 8:00. Some incredible things happen! _E_\nObama better than last time but again @MittRomney wins. Good night. #debate _E_\nI don't believe you have to be better than everybody else. I believe you have to better than you ever thought you could be. Ken Venturi _E_\nStuart Stevens the failed campaign manager of Mitt Romney's historic loss is now telling the Republican Party what to do with Trump. Sad! _E_\nRemember all these 'freedom fighters' in Syria want to fly planes into our buildings. _E_\nGreat column by David Bossie at @BreitbartNews: \"A Battle Won but the War Continues to Defund ObamaCare\" __HTTP__ _E_\nAfter reading the false reporting and even ferocious anger in some dying magazines it makes me wonder WHY? All I want to do is #MAGA! _E_\nMet with President Putin of Russia who was at #APEC meetings. Good discussions on Syria. Hope for his help to solve along with China the dangerous North Korea crisis. Progress being made. _E_\nRECORD HIGH FOR S & P 500! _E_\n\"Trumps Are Giving @TrumpDoral A Makeover\" __HTTP__ via @CBSMiami _E_\nThe Trans Pacific Partnership is an attack on America's business. It does not stop Japan's currency manipulation. This is a bad deal. _E_\nMy @gretawire interview re: the dismal job report getting ripped off by South Korea 2016 election & #WWEHOF __HTTP__ _E_\n.@FoxNews treats me so badly. Using old Quinnipiac Poll where I have a much smaller lead than the just out @CNN Poll. All negative! _E_\nAddress to the NationFull Video & Transcript: __HTTP__ __HTTP__ _E_\nWhy are we letting the three girls who left the U.S. to join ISIS back into the country? How stupid has our once respected country become! _E_\nVia @ TheScotsman: \"Donald Trump to lay out new golf course plan\" __HTTP__ _E_\nVia @townhallcom by @MattTowery: \"Why Trump Should Run\" __HTTP__ _E_\nNew poll thank you! #Trump2016 __HTTP__ __HTTP__ _E_\n\"If you don't have problems you're pretending or you don't run your own business.\" –Donald J. Trump __HTTP__ _E_\nI will be interviewed by @TuckerCarlson tonight at 9:00 P.M. on @FoxNews. Enjoy! _E_\n\"Trump to campaign for @SteveKingIA\" __HTTP__ via @kscj1360 _E_\nWith ZERO Democrats to help and a failed expensive and dangerous ObamaCare as the Dems legacy the Republican Senators are working hard! _E_\nEntrepreneurs: Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_\nJust took off for ceremony @ Pearl Harbor. Will then be heading to Japan SKorea China Vietnam & the Philippines. Will never let you down! _E_\nThe NFL has all sorts of rules and regulations. The only way out for them is to set a rule that you can't kneel during our National Anthem! _E_\n.@EWErickson is a total low life read his past tweets. A dummy with no \"it\" factor. Will fade fast. _E_\nThe New York Giants are looking really bad so far tonight. Does not get much worse than this! _E_\nCrooked Hillary said loudly and for the world to see that she SHORT CIRCUITED when answering a question on her e mails. Very dangerous! _E_\n#SuccessByTrump Here's a photo from my appearance at @Macy's Herald Square with @ximenanr __HTTP__ _E_\nCelebrity Apprentice starts in 15 minutes on NBC. ENJOY! _E_\nLooking forward to being at the @RyderCupUSA announcement tonight. _E_\nThe Miss U.S.A. pageant will be amazing tonight. To be politically incorrect the girls (women) are REALLY BEAUTIFUL. NBC at 8 PM. _E_\nGood move by Aubrey to be the red headed model they didn't have. #sweepstweet _E_\n.@EdRendell's book A Nation of Wusses is an excellent read especially page 10. Go get it! _E_\nDoes any Republican have the ability to negotiate? _E_\nWe need jobs & we need them fast. I am a job creator. None of the pols can or will. Let's Make America Great Again! __HTTP__ _E_\nIf Mitch McConnell wants to win his election he'd better get rid of jinxed Karl Rove and fast... _E_\nSo nice of @Cher greatly appreciated! __HTTP__ _E_\n#CelebApprentice Time for the first firing of the night. _E_\n.@marcthiessen is a failed Bush speechwriter whose work was so bad that he has never been able to make a comeback. A third rate talent! _E_\nThe Fannie and Freddie execs should not get million dollar bonuses with our tax dollars. They were bailed out with $169B of our money. _E_\n\"The only source of knowledge is experience.\" – Albert Einstein _E_\nWatch #MissUSA 2012 live tonight on @NBC at 9PM EST! _E_\nWhy is the @GOP being asked to do a debate that is so much longer than the just aired and very boring #DemDebate? _E_\nIt was a great honor to welcome Atlanta's heroic first responders to the White House this afternoon! __HTTP__ _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nWhen Strasburg leaves @Nationals for another team for more money will Washington still like the decision to shut him down for his good? _E_\nStatement on International Holocaust Remembrance Day: __HTTP__ _E_\nGood luck to the US Men's National Team in tomorrow's CONCACAF Cup vs. Mexico! It should be a great game! __HTTP__ _E_\nNo such meeting or conversation ever happened a made up story by low ratings @CNN. _E_\nCongratulations to Brandy as our new Apprentice and to Clint for being a great player. It's been a terrific season! _E_\nI am hearing that @NRCC Digital Director @lansing is doing great work expanding and modernizing @GOP social media. Good – we need it. _E_\nRT @jayMAGA45: NFLplayer PatTillman joined U.S. Army in 2002. He was killed in action 2004. He fought 4our country/freedom. #StandForOurAnt... _E_\nWe blow up the famous Blue Monster at Trump National Doral on.Monday in order to build a spectacular new bigger and better Blue Monster! _E_\nWith the world's top amenities @TrumpTO's luxury residential condominiums provide the ultimate Toronto lifestyle __HTTP__ _E_\nFind out who and what is the best in your field. Identify the trendsetters leaders and authorities. Learn the standards they follow. _E_\nRemember to think big by expanding your horizons at the same time you're expanding your net worth. _E_\nRT @EricTrump: Join my family in this incredible movement to #MakeAmericaGreatAgain!! Now it is up to you! Please #VOTE for America! __HTTP__ _E_\nYea NBC has increased all remaining Celebrity Apprentice episodes to two hours starting at 9 P.M. on Sunday! Amazing show. _E_\nTo all young college graduates – stick in there keep your head up and make sure you don't miss any opportunities. They are out there. _E_\nTo be called Trump Links at Ferry Point course will be GREAT and over the years hold many tournaments and major championships $'s to NYC. _E_\n#TrumpAdvice __HTTP__ _E_\nI have instructed Homeland Security to check people coming into our country VERY CAREFULLY. The courts are making the job very difficult! _E_\nRT @Morning_Joe: VIDEO: @realDonaldTrump announces 'a very powerful endorsement' will be coming today. __HTTP__ _E_\nGreat Twitter poll and I wasn't even there. Thank you! #GOPDebate __HTTP__ _E_\nMitt did the right thing—not because he had to but because he never would have been given a second chance after his first fiasco _E_\nJust did final purchase on fabulous @LodgeatDoonbeg in Ireland. Will become Trump International Hotel & Golf Links Ireland. Very exciting! _E_\nVery exciting. I will be at Macy's Herald Square this Wednesday at 5:30pm to celebrate the launch of Trump Home crystal! _E_\nThank you @ShopFloorNAM. An honor to be with you today. Great news! Manufacturers report record high economic optimism in 2017. #TaxReform __HTTP__ _E_\nOur clubhouse facility & suites in Ireland @LodgeatDoonbeg #TrumpIreland __HTTP__ __HTTP__ _E_\n46% of Americans think the Media is inventing stories about Trump & his Administration. @FoxNews It is actually much worse than this! _E_\nIf Russia or any other country or person has Hillary Clinton's 33000 illegally deleted emails perhaps they should share them with the FBI! _E_\nRT @Scavino45: Under POTUS' @realDonaldTrump S&P 500 38th📈Record High NASDAQ 44th📈Record High#MakeAmericaGreatAgain __HTTP__ _E_\nHas anyone seen the financials of @Univision. They are doing really badly. Too much debt and not enough viewers. Need money fast. Funny! _E_\nMike Flynn should ask for immunity in that this is a witch hunt (excuse for big election loss) by media & Dems of historic proportion! _E_\nThe @Lakers should have an amazing team next year with Kobe Nash and Howard. Will be fun to watch. _E_\nVia @GolfweekMag: Major makeover: Trump has big vision for Doral __HTTP__ by @BKleinGolfweek _E_\nThank you! #AmericaFirst __HTTP__ _E_\nJoin me on #FacebookLive as I conclude my final #debate preparations. __HTTP__ _E_\n#sweepstweet Teresa seems to underestimate the power of observance—that of the client as well as her team but she's a wonderful person _E_\nUnderstand that difficulties mistakes & setbacks are an inevitable part of business and life. Don't allow them to knock you off your feet. _E_\n\"Relax & clear your mind if someone is speaking so that you're receptive to what they're saying.\" – Roger Ailes You are the Message _E_\nWith 18 beautiful holes each boasting unique characteristics Trump Nat'l Philadelphia is a Golf treasure __HTTP__ _E_\nJeb Bush just announced he raised over $100M. Everyone of those people who contributed are getting something to the detriment of America! _E_\nCan you believe thatwith all of the problems and difficulties facing the U.S. President Obama spent the day playing golf.Worse than Carter _E_\nThe CIA deserves our praise for taking the fight to the enemy in the dark corners of the world. The CIA perseveres the politicians whine! _E_\nWake up Jeb supporters! __HTTP__ _E_\nTHE APPRENTICE. 10 years 182 shows many at number one for week or night Amazing! @NBC _E_\nI am honored to be receiving the American Spectator Foundation Award for excellence in entrepreneurialism in Washington DC this fall. _E_\nJust landed in North Carolina heading to the J.S. Dorton Arena. See you all soon! Lets #MakeAmericaGreatAgain! __HTTP__ _E_\nHeading into the 12 days with great negotiating strength because of our tremendous economy. __HTTP__ _E_\n\"I also protect myself by being flexible. I never get too attached to one deal or one approach.\" – THE ART OF THE DEAL _E_\n...Maybe the best thing to do would be to cancel all future press briefings and hand out written responses for the sake of accuracy??? _E_\nThe @WSJ Wall Street Journal loves to write badly about me. They better be careful or I will unleash big time on them. Look forward to it! _E_\nMy @nbcdfw int. by @EricKingNBC5 w/@IvankaTrump discussing the Sunday @nbc premiere of @ApprenticeNBC's 14th season __HTTP__ _E_\nTrump Int'l Hotel & Tower Chicago has received accolades for design service & our signature restaurant Sixteen __HTTP__ _E_\nI was referring to a backstop for pre existing conditions. I will eliminate the law in its entirety & replace it w/ something much better. _E_\nThis Russian connection non sense is merely an attempt to cover up the many mistakes made in Hillary Clinton's losing campaign. _E_\nOn 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_\nLets #MakeAmericaGreatAgain Maryland! #VoteTrump __HTTP__ _E_\n. @Newsmax__Media is one of the top media outlets in the country. @ChrisRuddyNMX has revolutionized political commentary and reporting. _E_\nThe deplorables came back to haunt Hillary.They expressed their feelings loud and clear. She spent big money but in the end had no game! _E_\nAxl Rose should take his #rockhall2012 honors and be happy. Stop the no induction nonsense. Do it for your fans @axlrose. _E_\nPocahontas is at it again! Goofy Elizabeth Warren one of the least productive U.S. Senators has a nasty mouth. Hope she is V.P. choice. _E_\n'The Clinton Foundation's Most Questionable Foreign Donations'#PayToPlay #DrainTheSwamp __HTTP__ _E_\nBig announcement coming soon regarding South Carolina... _E_\nHope & Change. Millions are losing their healthcare plans & ObamaCare is taking cancer patients' doctors away __HTTP__ _E_\nThe White House should stop publicly pressuring Israel on Iran. Iran's nuclear program is the threat not Israel's right to self defense. _E_\n... the ratings of Shark Tank. Everyone was hitting on me until the numbers came in—and now—dead silence! _E_\nWork has begun ahead of schedule to build the greatest golf course in history: Trump International – Scotland. _E_\nMy statement as to what's happening in Sweden was in reference to a story that was broadcast on @FoxNews concerning immigrants & Sweden. _E_\nI am not trying to get top level security clearance for my children. This was a typically false news story. _E_\nI will be on The Situation Room with @wolfblitzer from 5 7pm est on CNN _E_\n.@CoachDanMullen Great to have you and your GREAT team at Trump National Doral. Go out and finish your fantastic season in style! _E_\nGreat meeting with Governor Mapp of the #USVI. He is very thankful for the great job done by @FEMA and First Responders. __HTTP__ _E_\n#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nRemember @JebBush wants COMMON CORE (education from D.C.) and is very weak on ILLEGAL IMMIGRATION ( come as act of love ). Not a leader! _E_\nAn honor to welcome the Taoiseach of Ireland @EndaKennyTD to the @WhiteHouse today with @VP Pence. __HTTP__ _E_\nLooking forward to being interviewed on the @marklevinshow tonight at 6:30 PM EST. Be sure to listen! _E_\nCrooked Hillary Clinton was not at all loyal to the person in her rigged system that pushed her over the top DWS. Too bad Bernie flamed out _E_\nIran is going to buy 114 jetliners with a small part of the $150 billion we are giving them...but they won't buy from U.S. rather Airbus! _E_\nSad just 16% of American parents think their children will be better off than them __HTTP__ We can do much better! _E_\nPlaying politics with the Keystone decision? @BarackObama vetos 20000 jobs and cheaper oil. _E_\nVia @gatewaypundit: \"Mother of Murdered Teen Thanks Donald Trump During Senate Hearing\" __HTTP__ _E_\nI will be interviewed on @seanhannity tonight at 10pmE on @FoxNews. Enjoy! _E_\nAmerica wasted billions and precious lives in Iraq and Iran will soon take control very very sad. _E_\nThe Dems want to stop tax cuts good healthcare and Border Security.Their ObamaCare is dead with 100% increases in P's. Vote now for Karen H _E_\nIf you really want to succeed you'll have to go for it every day. The big time isn't for slackers. Keep up your stamina and remain curious. _E_\nRT @AnnCoulter: I hear Churchill had a nice turn of phrase but Trump's immigration speech is the most magnificent speech ever given. _E_\nWindmills are the greatest threat in the US to both bald and golden eagles. Media claims fictional 'global warming' is worse. _E_\nThank you America! #Trump2016 __HTTP__ _E_\n...never allow the Republicans to pass even great legislation. 8 Dems control will rarely get 60 (vs. 51) votes. It is a Repub Death Wish! _E_\nMust read article by @boonepickens & @AmbJohnBolton: \"America's Untapped Energy Weapon\" __HTTP__ We don't need foreign oil! _E_\nCrooked @HillaryClinton's foundation is a CRIMINAL ENTERPRISE. Time to #DrainTheSwamp! __HTTP__ #BigLeagueTruth #Debate _E_\n\"Government's first duty is to protect the people not run their lives.\" – Ronald Reagan _E_\nThe @MissUSA 2012 contestants standing outside of Trump Tower in New York City __HTTP__ @MissUSA 2012 tomorrow at 9PM ET NBC. _E_\n.@TrumpChicago's Spa has an array of 5 star services12 treatment rooms & 53 spa guestrooms w/great views __HTTP__ _E_\nThe independent watchdog who exonerated @BarackObama for the failed green energy loans just donated $52500 to Obama's campaign. _E_\nImagination is more important than knowledge. Albert Einstein _E_\nThe Pope should not have resigned—he should have lived it out. It hurts him it hurts the church... _E_\nYou are right more like the opening of the Tonys. _E_\nI can't believe we are not asking South Korea for anything. They make a fortune on us while we spend a fortune defending them how stupid! _E_\nThe U.S. Senate should switch to 51 votes immediately and get Healthcare and TAX CUTS approved fast and easy. Dems would do it no doubt! _E_\nRT @JackPosobiec: Dick Durbin called Trump racist for wanting to end chain migration. Here's a video of Dick Durbin calling for an end to... _E_\nIt has been 1000 days since @BarackObama has passed a budget. He continues to spend this country into the ground without any control. _E_\nPresidents and their administrations have been talking to North Korea for 25 years agreements made and massive amounts of money paid...... _E_\nVia @EllonTimesKenny: Trump course sparks international interest __HTTP__ _E_\nThe Great State of Arizona where I just had a massive rally (amazing people) has a very weak and ineffective Senator Jeff Flake. Sad! _E_\n.@JebBushAt the debate you said your brother kept us safe I wanted to be nice & did not mention the WTC came down during his watch 9/11. _E_\nThank you to our fantastic veterans. The reviews and polls from almost everyone of my Commander in Chief presentation were great. Nice! _E_\nObama has zero credibility on oil and coal. If we do not win energy as a country we just do not win period! _E_\nFrom the Wall Street Journal: Google Steps Into Autism Research re @autismspeaks __HTTP__ _E_\nVia @CBNNews' @TheBrodyFile: \"Poll: Donald Trump in GOP Top Tier for President\" __HTTP__ _E_\nGreat and we should boycott Fake News CNN. Dealing with them is a total waste of time! __HTTP__ _E_\nIf the gov't shuts down it is because Obama wants to make working Americans buy ObamaCare while businesses and gov't are exempt. _E_\n.@MissUniverse ratings were great! A big win and a wonderful night! __HTTP__ _E_\nNegotiation is a true talent. It is an art. And our politicians are killing our country b/c they don't have it. @SRQRepublicans speech _E_\nRT @DanScavino: .@POTUS @realDonaldTrump signs executive orders on trade that will set the stage for revival in American manufacturing. #Am... _E_\nSurprise – China has spies throughout NASA stealing our R&D __HTTP__ When will we ever make them pay for espionage? _E_\nI am confident when American public gets to know @MittRomney the race will go his way. He's honorable & successful man polls looking good. _E_\nCBO estimates over 2.3M jobs will be lost due to ObamaCare __HTTP__ Elections have consequences. _E_\nJust landed from Paris France. It was an incredible visit with President @EmmanuelMacron. A lot discussed and accomplished in two days! _E_\nPenn Jillette shows his dark side in new crowdfunded film Director's Cut __HTTP__ @pennjillette @bradwyman _E_\nExcellent story on @MittRomney very good moment for Ryan. #VPDebate _E_\nI'll be signing copies of my new book @TimeToGetTough tomorrow at Trump Tower (5th Avenue between 56 and 57) from noon to 2pm. _E_\nStuart Stevens is a dumb guy who fails @ virtually everything he touches. Romney campaignhis booketc. Why does @andersoncooper put him on? _E_\nI agree! The headline says it all. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nOur economy is in trouble. The unemployed are more likely to drop out of the workforce than find a job. We need growth and now! _E_\nTom Brady just did it again. He is not only a great guy he is without question the BEST quarterback! _E_\nIt's Thursday. How much money did Barack Obama waste today on crony green energy projects? _E_\nJeb Bush just said about Marco Rubio he's my friend! Pure political speak. Why can't he be truthful and say disloyal guy no friend! _E_\nGreat job @IvankaTrump! #RNCinCLE __HTTP__ _E_\nDon't attack Syria an attack that will bring nothing but trouble for the U.S. Focus on making our country strong and great again! _E_\n500 of the most vicious prisoners escaped from an Iraq prison today. That country is a time bomb waiting to happen a total corrupt mess! _E_\nI would absolutely consider investing in Atlantic City again great and hard working people but much would have to change taxes regs. etc _E_\nWow now leading in @ABC /@washingtonpost Poll 46 to 45. Gone up 12 points in two weeks mostly before the Crooked Hillary blow up! _E_\nThank you for your support. Together we will MAKE AMERICA SAFE AND GREAT AGAIN!#POTUSAbroad #USA __HTTP__ _E_\nA Rod should do the Yankees a favor and never play again. _E_\n#HappyIndependenceDay #July4 #USA __HTTP__ _E_\nIt was a great privilege to meet with President Moon of South Korea.Stay tuned! 🇰 #UNGA __HTTP__ _E_\n#1. Keep the big picture in mind. There are always opportunities and possibilities and thinking too small can negate a lot of them. _E_\nWe continue to lose our nation's finest in Afghanistan almost daily. The Rules of Engagement are costing lives. _E_\nConcerns over the national debt are stopping businesses from hiring and expanding __HTTP__ Obama's policies are unsustainable _E_\nHow can @BarackObama invoke Richard Nixon against @MittRomney when Obama just used Executive Privilege on Fast & Furious?! _E_\nTHANK YOU! #Trump2016 __HTTP__ _E_\nRT @JaydaBF: VIDEO: Muslim migrant beats up Dutch boy on crutches! __HTTP__ _E_\nThank you @TheTodaysGolfer for the wonderful statement that the new par 3 9th hole @Trump Turnberry could be the most dramatic in Britain. _E_\nRT @kevcirilli: CEDAR RAPIDS TRUMP'S DAUGHT IVANKA: I can just say without equivocation my father will make America great again. _E_\nBoy is this guy @ShepNewsTeam tough on me. So totally biased. As a reporter he should be ashamed of himself! #Trump2016 _E_\nWe are going to have a great time in Cleveland. Will lead to special results for our country. We will Make America Great Again! _E_\nThe Fed's reckless monetary policies will cause problems in the years to come. The Fed has to be reined in or we will soon be Greece. _E_\nCutting taxes and simplifying regulations makes America the place to invest! Great news as Toyota and Mazda announce they are bringing 4000 JOBS and investing $1.6 BILLION in Alabama helping to further grow our economy! __HTTP__ _E_\nAnybody that believes in strong borders and stopping illegal immigration cannot vote for Marco Rubio READ THIS: __HTTP__ _E_\nAmericans already believe that @PaulRyanVP is better qualified to serve as President over @JoeBiden __HTTP__ No surprise. _E_\nOil is under $50/barrel. Now is the time to increase sanctions against Iran not lift them. No deal is better than a bad deal. #ArtOfTheDeal _E_\nDoctors have already died treating Ebola __HTTP__ We should not be importing the disease to our homeland. _E_\nIf you can't run your own house you certainly can't run the White House A statement made by Mrs. Obama about Crooked Hillary Clinton _E_\nI have been saying for weeks for President Obama to stop the flights from West Africa. So simple but he refused. A TOTAL incompetent! _E_\nWow little Mac Miller has almost 100 million views on his song Donald Trump. Keep pushing Mac and come up with another hit just do it! _E_\nWhy is the NFL getting massive tax breaks while at the same time disrespecting our Anthem Flag and Country? Change tax law! _E_\nMy @FoxNews with @gretawire discussing the Keystone pipeline Re election is more important than 20000 jobs and (cont) __HTTP__ _E_\n#TrumpVlog @Rosie wasn't even a short term fix at The View. __HTTP__ _E_\nThe Wall Street Journal stated falsely that I said to them \"I have a good relationship with Kim Jong Un\" (of N. Korea). Obviously I didn't say that. I said \"I'd have a good relationship with Kim Jong Un\" a big difference. Fortunately we now record conversations with reporters... _E_\nThe first 90 days of my presidency has exposed the total failure of the last eight years of foreign policy! So true. @foxandfriends _E_\nHealth insurance premiums are rising by double digits __HTTP__ Another tax to the consumer by Obama Care. Enjoy! _E_\nThank you on my way! __HTTP__ _E_\nToday in Florida I pledged to stand with the people of Cuba and Venezuela in their fight against oppression cont: __HTTP__ _E_\nI have gotten to know many Spanish speaking people as the owner of Trump National Doral in Miami. They are smart hard working and great _E_\nMy interview with @Newsmax_Media where I explain that gas is headed to $5 $6 and why @RickSantorum can't win __HTTP__ _E_\nMy experience in Iowa was a great one. I started out with all of the experts saying I couldn't do well there and ended up in 2nd place. Nice _E_\nWith Sen. Elizabeth Dole & @DoleFoundation Caregiver Fellows. Tremendous people caring for our military & veterans! __HTTP__ _E_\nMust read via @IowaGOP by @shanevanderhart: \"Congress Should Vote No on Syria\" __HTTP__ _E_\nWind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @alexsalmond @Aberdeenshire _E_\nWith almost 1.3 million followers and rising really fast everyone is asking me to critique things(and people). Finally I will be a critic. _E_\nI want to thank evangelical Christians for the warm embrace I've received on the campaign trail. Video: __HTTP__ _E_\nLiberty University speech by DJT was biggest by far in school's history. Standing ovations...great young people! _E_\nHow many more billions of dollars will @BarackObama continue to waste in these solar companies? _E_\nBeginning today the United States of America gets back control of its borders. Full speech from today @DHSgov:... __HTTP__ _E_\n\"Keep a good attitude and do the right thing even when it's hard. When you do that you are passing the test.\" @JoelOsteen _E_\nYou know what is the worst part of @BarackObama's Tuesday speech playing class warfare we paid for it with our tax dollars. _E_\nPeople have got to stop working to be so politically correct and focus all of their energy on finding solutions to very complex problems! _E_\nIt is so great to be back home! Looking forward to a great rally tonight in Bethpage Long Island! _E_\nWith my friends at the great @Adidas Boost event at the @cadillacchamp at @trumpdoral __HTTP__ _E_\nNot only did the $1B ObamaCare website not work it can't even protect your personal information __HTTP__ A disaster. _E_\nSmart move by the Democrats to have Pres. @billclinton play a key role in their convention. _E_\nI think the @NewYorkObserver was far too nice to sleazebag @AGSchneiderman. He's got plenty more to worry about!. _E_\nFind something for everyone on your list with this Holiday Gift Guide from @TrumpSoHo on @TrumpCollection's Tumblr: __HTTP__ _E_\nThe important thing is not to stop questioning. Curiosity has its own reason for existing. Albert Einstein _E_\nTickets are now available for the 2015 @CadillacChamp at @TrumpDoral March 4 8: __HTTP__ _E_\nIn his own words @BarackObama was born in Kenya and raised in Indonesia and Hawaii. This statement was made (cont) __HTTP__ _E_\nNew ad concerning lightweight Senator Marco Rubio: __HTTP__ _E_\nCatch me on Fox News right now my interview with Neil Cavuto __HTTP__ _E_\nFrance was just stripped of its AAA bond rating. With the PMs radical tax rates... _E_\nSo now that Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the \"unsolved mystery\" that took place in Florida years ago? Investigate! _E_\nMust read article by @EmilyMiller: \"Anthony Weiner is a twit who treats women like dirt\" __HTTP__ _E_\nVia @CNNPolitics by @JDiamond1: \"Trump: RNC call was 'congratulatory'\" __HTTP__ _E_\nOur soldiers can't even have any more joint exercises with Afghan soldiers because they are getting shot in the (cont) __HTTP__ _E_\nWith @VanityFair circulation and advertising revenue doing so badly rumor has it that dopey Graydon Carter is going to resign? He should. _E_\nVery much enjoyed my tour of the Smithsonian's National Museum of African American History and Culture...A great job done by amazing people! _E_\nThe media is unrelenting. They will only go with and report a story in a negative light. I called Brexit (Hillary was wrong) watch November _E_\nYoung entrepreneurs should always remember that if you do not promote yourself no one else will! _E_\nObama vacationing in West Palm Beach starting tomorrow. He should play a round at Trump Int'l Golf Club #1 rated course in Florida. _E_\nVia @PJMedia_com by @NicholasBallasy: \"Trump Calls Election a 'Big Blow to Obama... I Think He's in Denial'\" __HTTP__ _E_\nWell maintained real estate is always going to be worth a lot more than poorly maintained real estate. The Art of the Deal _E_\nIt is a great honor for me to be inducted into the @WWE Hall of Fame. This will take place on April 6... _E_\nTONIGHT! NORTH CAROLINA: __HTTP__ GEORGIA: __HTTP__ NEVADA: __HTTP__ _E_\n.@BernardGoldberg was not good tonight on @oreillyfactor. He just doesn't know about winning! But he is a nice guy. _E_\nOur President must be very careful with the 28 year old wack job in North Korea. At some point we may have to get very tough blatant threats _E_\nLast October on @meetthepress @chucktodd attacked @jack_welch and I for saying Obama cooked the job number. Will he apologize? _E_\nIf you are a young entrepreneur just entering the business world I highly recommend that you read The Art of (cont) __HTTP__ _E_\nIt's Wednesday how many more of our embassies will be stormed by Islamists? _E_\nObama is tougher on WWII vets wanting to visit a DC memorial than Iran. He needs to show respect to our vets and not play games. _E_\nWhile I am a critic of President Obama I hate it when someone (Robert Gates) writes a self serving negative book about his boss. _E_\nYou never know when the tide is going to turn in your favor. It's important to never give up on yourself. Think Like a Champion _E_\nThank you @scottienhughes for your powerful words on @FoxNews. I am with the Evangelicals and Tea Party big time. We will all WIN together! _E_\nThe Trump Tower restaurant Trump Grill just received the highest sanitary inspection grade possible \"A\" – the food is also great! _E_\nVia @MiamiHerald: Donald Trump to be inducted into WWE Hall of Fame __HTTP__ _E_\nCrooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person! _E_\nHank Greenberg formerly of AIG gave $10 million to the @JebBush campaign 3 months ago. He is not happy a total waste of money! _E_\nKate Middleton is great but she shouldn't be sunbathing in the nude only herself to blame. _E_\n'President Donald J. Trump Approves Emergency Declarations' __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_\nA bad manager such as @BarackObama will continually be plagued by scandals. __HTTP__ Leadership starts at the top. _E_\nChina attempted to sell embargoed computers to Iran __HTTP__ China loves these deals! _E_\nDue to the holiday I will NOT be doing Fox & Friends this morning. Next Monday at 7. _E_\nIt is a shame Keystone wasn't powered by solar panels and wind because then @BarackObama would have wasted billions on it. _E_\nOur trade deficit just jumped in May to \"the second highest level on record\" __HTTP__ FAIR trade not free trade. I TOLD YOU. _E_\nThere is ZERO margin for error on Ebola. Are we confident in Obama when he can't even make a website for $5 Billion? _E_\nTHANK YOU INDIANA! #Trump2016 __HTTP__ _E_\nRecord snowfall & freezing temps throughout the country. Where is Global Warming when you need it?! _E_\nRe: Success Don't put blinders on and do not limit yourself reach out seek and explore. Think big at all times. _E_\nScary. Our military is a using a Chinese made satellite for North Africa command communications __HTTP__ _E_\nRead Donald Trump's Top Ten Tips for Success: __HTTP__ _E_\nI'd like to call JEB a liar but the truth is he has no clue & never revealed that he used Eminent Domain when criticizing me! (1/2) _E_\nThank you to all of the supporters who far out numbered the protesters yesterday at the Women's U.S. Open. Very cool! _E_\nCongrats to Pres.Obama and Dems. CBO has TRIPLED its estimate of working hours lost due to ObamaCare __HTTP__ Job Killer _E_\nIt was just announced by sources that no charges will be brought against Crooked Hillary Clinton. Like I said the system is totally rigged! _E_\nRT @gatewaypundit: Democrat Fire Marshal Turns THOUSANDS of Trump Supporters Away at Columbus Rally __HTTP__ via @gatewaypun... _E_\nI really enjoyed being at the Iowa State Fair. The crowds love and enthusiasm is something I will never forget. _E_\nDespite winning the second debate in a landslide (every poll) it is hard to do well when Paul Ryan and others give zero support! _E_\n\"Remember people's names and small details about them. Use both in conversation... _E_\nThank you for inviting me to the Western Conservative Summit in Colorado! #ImWithYou #WCS16 __HTTP__ __HTTP__ _E_\nAdrian also gives autographs if you stop by the lobby of @TrumpTowerNY. #CelebApprentice _E_\nI talk about Obamacare in today's #TrumpVlog __HTTP__ _E_\nThe new job figures don't include 315000 people who have given up looking for jobs. _E_\nWe are building China's wealth by buying all their products even though we make better products in America. _E_\n'Presidential Executive Order on Strengthening the Cybersecurity of Federal Networks and Critical Infrastructure'... __HTTP__ _E_\nEach day that Iran delays the deal if that is what you call it we must add another sanction and make them progressively tough. _E_\nVia @HeraldBusiness by @hannahbsampson: \"@TrumpDoral looking to hire hundreds\" __HTTP__ _E_\nThe @nytimes is so poorly run and managed that other family members are looking to take over control. With unfunded liabilities big trouble! _E_\nMy thoughts on last night's meeting with @SarahPalinUSA in today's #trumpvlog... __HTTP__ _E_\nHeading for Atlanta tomorrow morning for noon speech at North Atlanta Trade Center. Big crowds great people! _E_\nGeorgetown should not host @KathleenSebelius for the graduation ceremony. Her policies abuse Catholics. _E_\n.@PolitiTrends @realdonaldtrump is dominating the discussion on Twitter with 79352 mentions today (via __HTTP__ ) _E_\nRT @GovAbbott: To ensure your safety ahead of #Harvey heed warnings from local officials & review important safety information. __HTTP__ _E_\n83% of the government is still running during the shutdown while 41% of nondefense federal workers are furloughed. Room for cuts. _E_\nI am in Baton Rouge where the Miss USA Pageant will be shown live on NBC on Sunday night for 3 hours starting at 8 P.M. INCREDIBLE SHOW! _E_\nMy @Newsmax_Media interview discussing OPEC US gas resources @MittRomney and running a campaign against @BarackObama __HTTP__ _E_\nIrrelevant clown @KarlRove sweats and shakes nervously on @FoxNews as he talks bull about me. Has zero cred. Made fool of himself in '12. _E_\nRemember after this new episode starts 5 MINUTES! _E_\nThe Republicans have been played into a trap by the President they forgot the 14th amendment..... _E_\nThe convention in Cleveland will be amazing! __HTTP__ _E_\nI just arrived in Miami where I will be checking out construction of the brand new Trump National Doral always closely watch construction! _E_\nLast night in his SOTU @BarackObama claimed that he is a friend of Israel. Does anyone really believe that. _E_\nWe need a #POTUS with great strength & stamina. Hillary does not have that.#Trump2016 __HTTP__ __HTTP__ _E_\nMy Monday @foxandfriends interview discussing the fiscal cliff negotiations making the big deal and who has the cards __HTTP__ _E_\nI will be interviewed from Cleveland Ohio on @seanhannity Tonight at 10:00 P.M. Enjoy! _E_\nClinton made a false ad about me where I was imitating a reporter GROVELING after he changed his story. I would NEVER mock disabled. Shame! _E_\nWhen I made the Apprentice the #1 show in the US that was a good day for you... _E_\nIf Obama wins it is the end of the Republican party. @limbaugh _E_\nMainstream (FAKE) media refuses to state our long list of achievements including 28 legislative signings strong borders & great optimism! _E_\n.@TrumpCollection's @DoralResort renovations are revitalizing Miami. The new course will be a great challenge __HTTP__ _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nCelebrity Apprentice is rebroadcasting last weeks episode at 9 P.M. WITH A GREAT NEW EPISODE FEATURING @MELANIA TRUMP AT 10 P.M. AMAZING! _E_\nMy interview last week with Greta van Susteren is available here in slightly abridged form. __HTTP__ Good info to know about. _E_\nObama is looking like an incompetent fool in the handling of the war against.ISIS! Why isn't China and Russia helping they gain so much! _E_\nThe negative television commercials about me paid for by the politicians bosses are a total #Mediafraud. When you watch remember! _E_\nIf @MittRomney has a good debate tomorrow night Obama is finished! _E_\nI am on @foxandfriends at 7:00 A.M. ENJOY! _E_\nHeading to the great state of Mississippi at the invitation of their popular and respected Governor @PhilBryantMS. Look forward to seeing the new Civil Rights Museum! _E_\nWill be landing in Knoxville Tennessee shortly tremendous crowd expected. It's all very simple we want to #MakeAmericaGreatAgain! _E_\nHonored to receive an endorsement from @SJSOPIO thank you! Together we are going to MAKE AMERICA SAFE & GREAT AG... __HTTP__ _E_\nThank you Indiana! #Trump2016 __HTTP__ _E_\nToday we are going to win the great state of MICHIGAN and we are going to WIN back the White House! Thank you MI!... __HTTP__ _E_\nMariano Rivera is greatest closer of all time. A leader in the club house & an exceptional man. One of the best @Yankees in history. _E_\nWhy would anyone think Obama would attack Syria the day of his speech in Washington. He doesn't want to detract from his press & glory. _E_\nIf the GOP will have any chance to beat @BarackObama in November the great people of Michigan need to support @MittRomney's candidacy. _E_\nI always said the people we fought for in Libya were bad news. Once again I was right. _E_\nMaking my speech. #WWEHOF __HTTP__ _E_\nDon't forget the three hour episode of Celebrity Apprentice this Sunday night 8pm 11pm on NBC. You're in for a (cont) __HTTP__ _E_\nThe windfarm approval in Scotland is subject to many conditions that can never be met will be tied up in courts for years! #EOWDC _E_\nI never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_\n#HappyMothersDay! __HTTP__ __HTTP__ _E_\nHAPPY THANKSGIVING to everyone I love you all even my many enemies (sometimes!). _E_\nWe must protect our veterans. #MakeAmericaGreatAgain __HTTP__ _E_\nHow do you like the boardroom so far? _E_\nWow l just found out that A.G. Schneiderman met with President Obama in Syracuse on Thursday and sued me on Saturday! Same as IRS etc. _E_\nBureaucratic red tape and overregulation are discouraging the American dream. It's time for a bold new direction! __HTTP__ _E_\nWhen somebody challenges you unfairly fight back be brutal be tough don't take it. It is always important to WIN! _E_\nGet out to VOTE on 11/8/2016 and we will #DrainTheSwamp!RASMUSSEN NATIONAL Trump 43%Clinton 41% __HTTP__ _E_\nDopey Sugar @Lord_Sugar I hear your ratings last week were at an all time low you better get them up or you'll be fired. _E_\nThis show was taped just before the terrible Bill Cosby revelations came to light.She still should have asked him for money goes to charity. _E_\nLook what happened to the autism rate from 1983 2008 since one time massive shots were given to children __HTTP__ _E_\nJust got home watching the news and every story is bad about the U.S. Someday we will return to being great again but we need leadership! _E_\nCHILD CARE REFORMS THAT WILL MAKE AMERICA GREAT AGAIN!Transcript: __HTTP__ __HTTP__ _E_\nOur incredible U.S. Coast Guard saved more than 15000 lives last week with Harvey. Irma could be even tougher. We love our Coast Guard! _E_\nVia @ABCPolitics by @ajdukakis & @rickklein: Mr. Trump Goes to Washington And Talks 2016 __HTTP__ _E_\nFranklin such a great photo. HAPPY 99th BIRTHDAY to your father @BillyGraham! __HTTP__ _E_\nWill Team Power be able to withstand Omarosa as PM? Smooth sailing is not expected. _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\nDon't talk to me about Bush I was never a defender or a fan! _E_\nMexico's court system corrupt.I want nothing to do with Mexico other than to build an impenetrable WALL and stop them from ripping off U.S. _E_\nWhat's incredible is that Obamacare hasn't even kicked in yet and already it's doing tremendous damage. (cont) __HTTP__ _E_\n.@KirstenPowers New book is excellent and so true! Congrats! _E_\nThe ever dwindling @WSJ which is worth about 1/10 of what it was purchased for is always hitting me politically. Who cares! _E_\nI hope @MittRomney now starts asking for any & all of @BarackObama's sealed records it's time. _E_\nI am very proud of @StephenBaldwin7's performance in the record 13th season of All Star @CelebApprentice. Watch. _E_\nVery nice @HuffingtonPost @pollsterpolls has me in first place at 18% and Bush second at 14% __HTTP__ _E_\nMy @greta int. on @FoxNews with @MELANIATRUMP at OPO discussing my potential candidacy & making America great again __HTTP__ _E_\nChina demanded that we raise our debt ceiling and then their rating agency downgraded us. Our leaders are hope... (cont) __HTTP__ _E_\nThe Amateur! First @BarackObama was caught bowing to the Saudi King but now the President of Mexico! __HTTP__ _E_\nAlways protect against the downside the upside will take care of itself. Donald J. Trump _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nWhether you have someone managing your finances or you're doing it yourself money like anything takes maintenance & planning to grow. _E_\nElite Traveler & the 12 Best Hotel Room Views in the World __HTTP__ #TrumpChicago _E_\nTeam Power+@LilJon= Spielberg? Let's find out. #CelebApprentice _E_\nCongratulations to David Wright on signing a long term extension with the @Mets. David is an exceptional player and person. _E_\nThe @Washingtonpost reported about the closing hotels in Atlantic City but knowingly failed to report that I am not involved left years ago _E_\nWatch ET tonight to find out what my beautiful wife will be wearing at the Met Gala! __HTTP__ _E_\n...Re: China I told you that a long time ago. __HTTP__ _E_\nThank you to Tom Brady Coach Ditka Coach Bobby Knight and all of the many champions that have been so supportive! _E_\nYesterday Obama campaigned with JayZ & Springsteen while Hurricane Sandy victims across NY & NJ are still decimated by Sandy. Wrong! _E_\nVia @TIME by @ZekeJMiller: \"Trump Talks Politics at His Virginia Winery\" __HTTP__ _E_\nBe sure to check @fundanything to see my picks __HTTP__ _E_\nJust out: The same Russian Ambassador that met Jeff Sessions visited the Obama White House 22 times and 4 times last year alone. _E_\nRT @ErinBurnett: Sat down w/ @EricTrump @DonaldJTrumpJr here in Iowa. Talked God @realDonaldTrump late night tweets __HTTP__ _E_\nSeth Myers is so unnatural and uncomfortable doing his show that you have to feel sorry for him. Bad interviewer marbles in his mouth! _E_\n.@davidaxelrod David Thank you my great honor for a very worthy cause! _E_\nThis will prove to be a great time in the lives of ALL Americans. We will unite and we will win win win! _E_\nWill be on Fox & Friends tomorrow morning at 7.00. Will be discussing the disgusting and wasteful $635 million website rollout and more! _E_\nRise high in affordable luxury. Trump Parc Stamford offers gracious living with entertainment spaces __HTTP__ _E_\nI have a gift for my loyal viewers of All Star @ApprenticeNBC Mrs. @MELANIATRUMP debut on this week's episode __HTTP__ _E_\nWe boarded the helicopter for Sarasota earlier & will be landing soon! See you there. #Trump2016 __HTTP__ _E_\nThank you Macomb County Michigan! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nReceived a #HurricaneHarvey briefing this morning from Acting @DHSgov Secretary Elaine Duke @FEMA_Brock @TomBossert45 and COS John Kelly. __HTTP__ _E_\nCNN will soon be the least trusted name in news if they continue to be the press shop for Hillary Clinton. _E_\nFLASHBACK: Donald Trump Answers Boy's Prayer for New Bike __HTTP__ via @FoxNewsInsider _E_\nMark it on your calendar: Comedy Central Roast March 15th at 10:30 pm for the Roast of Trump __HTTP__ _E_\nEnough about my ties etc. @Macys but they are doing really big numbers people love them (and @Macys loves Trump)! _E_\nWhen everything seems to be going against you remember that the airplane takes off against the wind not with it. Henry Ford _E_\n#TrumpVlog China is laughing at U.S. __HTTP__ _E_\nEntrepreneurs: Success is good. Success with significance is even better. _E_\nEvery poll shows high approval of the new sign on @TrumpChicago. I am honored by the great support. _E_\nMy @foxandfriends interview discussing Pres. Obama playing golf w/@TigerWoods US Airways American merger & oil __HTTP__ _E_\nNegotiation tip: Be reasonable & flexible. Being open to change could lead you into a fortunate situation and open the door to innovation. _E_\nSince @BarackObama is on such a transparency kick how about releasing Fast & Furious info to Brian Terry's family? __HTTP__ _E_\nRT @foxandfriends: President Trump vows America will respond to North Korean threats with fire & fury in a warning to the rogue nation ht... _E_\nMany Syrian 'rebels' are radical Jihadis. Not our friends & supporting them doesn't serve our national interest. Stay out of Syria! _E_\nWhen true golfers see what I do at Doral it will be the hottest club in the country. #sayfie #newsmax _E_\nI salute all Tea Party Patriots for marching on DC today. Stand strong! _E_\nICYMI The ALS #IceBucketChallenge that Trumps them all __HTTP__ @MissUSA @MissUniverse @DonaldJTrumpJr @EricTrump _E_\nPolling shows nearly 7 in 10 Americans support an immigration reform package that includes DACA fully secures the border ends chain migration & cancels the visa lottery. If D's oppose this deal they aren't serious about DACA they just want open borders. __HTTP__ _E_\n\"Some people dream of great accomplishments while others stay awake and do them.\" Anonymous _E_\nGood news @RasmussenPoll has @MittRomney beating @BarackObama 49% 44% __HTTP__ Obama was up by 5% at same point in '08. _E_\nLooks like Plan B is stuck with the mechanical dog. @THEGaryBusey has latched on and won't let go. #CelebApprentice _E_\nA resort in Arizona is using sewage to make snow. Environmentalists are going crazy I won't be skiing in that snow. _E_\nTomorrow the House votes on #KatesLaw & No Sanctuary For Criminals Act. Lawmakers must vote to put American safety... __HTTP__ _E_\n\"The best entrepreneurs believe the true measure of success has to do with the number of jobs their business creates.\" – Midas Touch _E_\nWe will continue to follow developments in Charlottesville and will provide whatever assistance is needed. We are ready willing and able. __HTTP__ _E_\nOur country will soon be relegated to THIRD WORLD status if proper decisions are not made by our president. He was never qualified for job! _E_\nIt's sad to see once decent newspapers like @USAToday failing so badly. I just don't know if they can be saved. _E_\nAetna CEO: Obamacare in 'Death Spiral' #RepealAndReplace __HTTP__ _E_\n\"Donald Trump: 'I Will Take Full Credit' for Romney Dropping Out\" __HTTP__ via @Newsmax_Media by @ssfitzgerald _E_\n....agencies not just the FBI & DOJ now the State Department to dig up dirt on him in the days leading up to the Election. Comey had conversations with Donald Trump which I don't believe were accurate...he leaked information (corrupt).\" Tom Fitton of Judicial Watch on @FoxNews _E_\nDeparting for #GOPDebate. Let's #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nToday I signed the Global War on Terrorism War Memorial Act (#HR873.) The bill authorizes....cont __HTTP__ __HTTP__ _E_\nDurst is a disaster at operating the new World Trade Center. It takes forever for workers or visitors to get in with impossible security. _E_\nOthers claim they can make America great again but only one knows The Art of The Deal. It's time for an outsider __HTTP__ _E_\nA great new poll 33%! __HTTP__ _E_\nWhen will @BarackObama release an actual budget? _E_\n.@MittRomney should have been more aggressive last night. Yet some polls have him winning the debate. _E_\ngreat business in total in order to fully focus on running the country in order to MAKE AMERICA GREAT AGAIN! While I am not mandated to .... _E_\nGlad to see that @RondaRousey lost her championship fight last night. Was soundly beaten not a nice person! _E_\nHappy birthday to my friend the great @jacknicklaus a totally special guy! _E_\nBusiness is easy. Dealing with people is hard. If you are an entrepreneur your most important job is to choose who works with you. _E_\nThe media is so dishonest. If I make a statement they twist it and turn it to make it sound bad or foolish.They think the public is stupid! _E_\nToday I signed the Veterans (OUR HEROES) Choice Program Extension & Improvement Act @ the @WhiteHouse. #S544 Watch... __HTTP__ _E_\nIf my many supporters acted and threatened people like those who lost the election are doing they would be scorned & called terrible names! _E_\nLatin America's tallest building @TrumpPanama is the perfect getaway location to celebrate the New Year in luxury __HTTP__ _E_\nGeneral Petraeus should stop apologising and get on with his life. He is a good man and should have a great future. _E_\nWhen Obama tried to tweak his previous statement on ObamaCare he made it an even greater lie even the Senate Democrats are angry with him! _E_\n.@SenRandPaul's Tea Party rebuttal to Obama's SOTU explained why limited government promotes freedom. Well done! _E_\nIf you want to kill any idea in the world get a committee working on it. Charles Kettering _E_\nIf you're sitting in an office working in a job you hate then it's time to THINK BIG and plan your next step... _E_\nI've been saying it for a long long time. #NoKo __HTTP__ _E_\nBoth are looking good! Now we begin! _E_\n.@ronsirak Thank you for being so fair this morning on @GolfChannel—greatly appreciated. _E_\n.@hardball_chris is a really dumb guy(and I know him well)—that's why he works swimmingly with our leaders in Washington. _E_\nWe are building our future with American hands American labor American iron aluminum and steel. Happy #LaborDay! __HTTP__ _E_\n..under a magnifying glass they have zero tapes of T people colluding. There is no collusion & no obstruction. I should be given apology! _E_\nCrimea was TAKEN by Russia during the Obama Administration. Was Obama too soft on Russia? _E_\nCongratulation to Jane Timken on her major upset victory in becoming the Ohio Republican Party Chair. Jane is a loyal Trump supporter & star _E_\nOh the wonders of the Arab Spring. Our new allies in Egypt the Muslim Brotherhood just called the Holocaust a myth __HTTP__ _E_\nJeb Bush spent more than $40000000 in New Hampshire to come in 4 or 5 I spent $3000000 to come in 1st. Big difference in capability! _E_\nNew York we will make America great again! __HTTP__ _E_\nVia @clarechampion by @DanDanaherNews: \"Wind Farm Proposal Near @Trump_Ireland Rejected\" __HTTP__ _E_\nSouth Carolina and the audience were GREAT THANKS! _E_\nThe World as we know it is falling apart. Much of the blame can be attributed to the fact that the United States is no longer respected! _E_\nThe @Broncos had a truly bad day my advice is to go home forget about it and come back tough next year. _E_\n\"You want to compete and you want to compete at the highest level.\" @boonepickens _E_\nRepublicans must be careful with immigration—don't give our country away. _E_\nUnemployment rate only dropped because more people are out of labor force & have stopped looking for work.Not a real recovery phony numbers _E_\nI'm going to do what @MittRomney was totally unable to do WIN! _E_\nCan you believe that Sony chief Amy Pascal wants to meet with Al Sharpton to seek forgiveness for her racial slurs. Al is laughing at her! _E_\n.@serenawilliams had a flawless @usopen quarterfinal win last night. She's a great player and a wonderful person. _E_\n\"But if someone has a gun and is trying to kill you... it would be reasonable to shoot back with your own gun.\" @DalaiLama _E_\n.@BrandenRoderick did a great job on All Star Celebrity @ApprenticeNBC. Raised a lot of money for charity while looking great. _E_\nWill be covering President Obama's speech at 9.00 on Twitter you are all so lucky! _E_\nGreat shot by @KingJames yesterday. Lebron is a tough competitor who delivers under pressure. _E_\nWe will confront ANY challenge no matter how strong the winds or high the water. I'm proud to stand with Presidents for #OneAmericaAppeal. _E_\nI had 15000 people in Phoenix but @politico said the rooms capacity is just over 2000. But said Bernie Sanders had 11000 in same room. _E_\nMy prayers and condolences to the victims and families of the terrible tragedy in Nice France. We are with you in every way! _E_\nEntrepreneurs: Having an ego and acknowledging it is a healthy choice. There's nothing wrong with bringing your talents to the surface. _E_\nREPEAL AND REPLACE!!! #ObamaCareInThreeWords _E_\nHAPPY BIRTHDAY to the United States Air Force!! __HTTP__ _E_\nI sure hope the sexting pervert Anthony Weiner runs for mayor. Will be great fun watching him both lose and be humiliated. _E_\n\"If it's worth doing it's worth fighting for. You'll have lots of people & obstacles in your way. Fight to get beyond them.\"–Midas Touch _E_\nToday I was honored to be joined by Republicans and Democrats from both the House and Senate as well as members of my Cabinet to discuss the urgent need to rebuild and restore America's depleted infrastructure. __HTTP__ __HTTP__ _E_\nI am in Miami at Trump National Doral. Just gave out contract to build a new ballroom and luxury suites. Blue Monster complete opens Dec 14. _E_\nA drug free A Rod is just an average baseball player.@Yankees will soon move him down in the batting order & should renegotiate his contract _E_\nMassive record setting snowstorm and freezing temperatures in U.S. Smart that GLOBAL WARMING hoaxsters changed name to CLIMATE CHANGE! $$$$ _E_\nHouse Democrats want a SHUTDOWN for the holidays in order to distract from the very popular just passed Tax Cuts. House Republicans don't let this happen. Pass the C.R. TODAY and keep our Government OPEN! _E_\nObama just appointed an Ebola Czar with zero experience in the medical area and zero experience in infectious disease control. A TOTAL JOKE! _E_\nRT @JoeNBC: Remarkable how cost effective Post says Trump campaign was per vote and stunning how much Jeb spent per vote. __HTTP__ _E_\nNo matter how good the replacement refs do they will be soundly criticized they can't win! _E_\n\"I pride myself on being obstinate stubborn & tough. I think those are important qualities found in successful people.\" – Think Big _E_\nRT @WhiteHouse: Dr. King's dream is our dream. It is the American Dream. It's the promise stitched into the fabric of our Nation etched i... _E_\nJust spoke to @JohnKasich to express condolences and prayers to all for the horrible shooting of two great police officers from @WestervillePD. This is a true tragedy! _E_\nWith proper thinking and leadership we can have a much better plan than Obamacare something that works for the people and costs much less _E_\nPublicity seeking Lindsey Graham falsely stated that I said there is moral equivalency between the KKK neo Nazis & white supremacists...... _E_\n.@JoseCanseco who I got to know very well during #CelebApprentice can't carry @SHAQ's jock. _E_\nAs a former host of Saturday Night Live I look forward to attending tonight! _E_\n\"Go for the jugular so that people watching will not want to mess with you.\" – Think Big _E_\nRT @TuckerCarlson: .@RichardGrenell : @realDonaldTrump told Tillerson he had the full support of the U.S. Gov't to bring #OttoWarmbier home... _E_\nA strong America creates opportunity and growth. We just need to change Washington. Let's Make America Great Again! __HTTP__ _E_\nDo you ever notice that @CNN gives me very little proper representation on my policies. Just watched nobody knew anything about my foreign P _E_\nThe dishonest media does not report that any money spent on building the Great Wall (for sake of speed) will be paid back by Mexico later! _E_\nRT @TeamTrump: RT if you agree @realDonaldTrump WON the #Debate BIG LEAGUE! #MAGA __HTTP__ _E_\nNow we have a once in a lifetime opportunity to RESTORE AMERICAN PROSPERITY – and RECLAIM AMERICA'S DESTINY.But in order to achieve this bright and glowing future the SENATE MUST PASS TAX CUTS – and bring Main Street roaring back to life! __HTTP__ __HTTP__ _E_\nA feature on the progress of the course @ #Trump Int'l #Golf Club will feature on @CNNLivingGolf Thurs 8 May 2014 @ 0930 & 1630 GMT #DAMAC _E_\n.@hardball_chris Did you forget about Bill Ayers & so many others? You should apologize to all the people you offended yesterday. _E_\nSo much interest in my visit to Scotland! I greatly look forward to attending the opening event @TrumpTurnberry taking place on June 24th. _E_\nGreat job on @Greta @DonaldJTrumpJr. Nobody could have done it better! _E_\nRT @GOP: .@IvankaTrump: This administration is committed to keeping working families at the forefront of our agenda. __HTTP__ _E_\nVia @CNNMoney by @AaronSmithCNN: The Donald wins. Trump name coming off casino __HTTP__ _E_\nIf the GOP Establishment really wants to defeat @BarackObama then they should read #TimeToGetTough. _E_\nSen.Richard Blumenthal who never fought in Vietnam when he said for years he had (major lie)now misrepresents what Judge Gorsuch told him? _E_\nThey must be kidding can this be happening #Oscars _E_\nWill be interviewed on @Morning_Joe at 7:00 A.M. So much to talk about! _E_\nGive the public a break The FAKE NEWS media is trying to say that large scale immigration in Sweden is working out just beautifully. NOT! _E_\nThe only global warming that people should be concerned with is the global warming caused by nuclear weapons because of our weak U.S. leader _E_\nI will be making a big surprise announcement to the massive crowd assembled in Huntsville/Madison Alabama! Landing now! #Trump2016 _E_\nTake a look at __HTTP__ and __HTTP__ to see these beautiful hotels. _E_\nYou're just not getting there @DanaPerino. Sometimes things just don't work out but don't worry no problem! _E_\nI will be interviewed on @foxandfriends at 7:30. Things are looking good had a great Easter look forward to spending the week in Wisconsin! _E_\nThen on June 25th back to the USA to MAKE AMERICA GREAT AGAIN! _E_\nIn the very least Congress must defund Obama's unconstitutional amnesty order. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nLooking forward to promoting a pro growth & positive message at this Saturday's @Citizens_United @AFPhq Freedom Summit in Manchester. _E_\nSmall crowds at @RedState today in Atlanta. People were very angry at EWErickson a major sleaze and buffoon who has saved me time and money _E_\nI will be interviewed on Face The Nation @CBSNews at 10:30 A.M. Should be interesting ENJOY! _E_\nFocus on your goals not your problems. And remember problems are a mind exercise so enjoy the challenge. _E_\nRexnord of Indiana is moving to Mexico and rather viciously firing all of its 300 workers. This is happening all over our country. No more! _E_\n.@TrumpChicago is Chicago's sole destination showcasing a Five Star @Forbes rating for both hotel & restaurant __HTTP__ _E_\nFitch has downgraded our credit outlook to negative. Why? @BarackObama's failure to lead with the Super Committee. __HTTP__ _E_\nThank you Colorado! An honor to win @NBC @9News #GOPDebate Poll. __HTTP__ _E_\n.@DineshDSouza's '2016: Obama's America' is expanding to over 1000 theaters this weekend. Will be highest grossing documentary in 2012. !! _E_\nBusy day—working on buying a major property—and creating lots of jobs. _E_\nI agree with Marco Rubio that Ted Cruz is a liar! _E_\nGreat meetings will take place today at Trump Tower concerning the formation of the people who will run our government for the next 8 years. _E_\nGlad to see that Jamie Dimon passed yesterday's shareholder vote. The JP Morgan stock holders understand that a good CEO is worth keeping. _E_\nHillary Clinton is taking the day off again she needs the rest. Sleep well Hillary see you at the debate! _E_\n.@CNBC is pushing the @GOP around by asking for extra time (and no criteria) in order to sell more commercials. _E_\nThank you Pennsylvania! #Trump2016 __HTTP__ _E_\n.@Macy's is a big contributor to @PPFA . Anybody against Planned Parenthood should boycott racial profiling Macy's. _E_\nSleepy Chuck Todd of NBC falls far short of the late great Tim Russert. _E_\nFormer President Jimmy Carter is so happy that he is no longer considered the worst President in the history of the United States! _E_\nTHe Westminster Dog Show asked if I'd be interested in meeting Hickory a Scottish Deerhound who won Best in Show. She came to visit today! _E_\nPresident Obama should ask the DNC about how they rigged the election against Bernie. _E_\nRichard Mourdock a very good man running for the Senate in Indiana. Hopefully he will win! @richardmourdock _E_\nTed Cruz is a nervous wreck. He is making reckless charges not caring for the truth! His poll #'s are way down! _E_\nTake the time to be thorough in whatever you undertake. Remain open to new ideas. Remain fluid not fixed in your expectations. _E_\nI'll be on @foxandfriends on Monday at 7:30 AM tune in! _E_\nWe pay for Obama's travel so he can fundraise millions so Democrats can run on lies. Then we pay for his golf. _E_\nWhen will @BarackObama release his transcripts? What is he hiding? _E_\nI will win the election against Crooked Hillary despite the people in the Republican Party that are currently and selfishly opposed to me! _E_\nCongratulations to #TeamUSA🏆on your great @PresidentsCup victory! __HTTP__ _E_\nFor what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_\n#trumpvlog My thoughts on Afghanistan @RickSantorum and why I fired two people on this week's #CelebApprentice... __HTTP__ _E_\nPocahontas just stated that the Democrats lead by the legendary Crooked Hillary Clinton rigged the Primaries! Lets go FBI & Justice Dept. _E_\n'16 Fake News Stories Reporters Have Run Since Trump Won' __HTTP__ _E_\nRepublicans have a last chance to do the right thing on Repeal & Replace after years of talking & campaigning on it. _E_\nA photo delivered yesterday that will be displayed in the upper/lower press hall. Thank you Abbas! __HTTP__ _E_\nReverend Wright must have great hatred for Obama and the manner in which he was shunted aside. _E_\nRight now 4000 U.S. troops are stupidly heading to West Africa to help fight Ebola.No help from China Russia or wealthy African oil nations _E_\nFirst segment of my @seanhannity @FoxNews interview discussing @GOP are terrible negotiators & lost all their cards __HTTP__ _E_\nNo deal was made last night on DACA. Massive border security would have to be agreed to in exchange for consent. Would be subject to vote. _E_\n....likewise billions of dollars gets brought into Mexico through the border. We get the killers drugs & crime they get the money! _E_\nI am enjoying my travels across Europe but home is where the heart is. Looking forward to coming back to the family in New York very soon. _E_\nObama is a terrible negotiator. He bails out Chrysler and now Chrysler wants to send all Jeep manufacturing to China and will! _E_\nGDP was revised upward to 3.1 for last quarter. Many people thought it would be years before that happened. We have just begun! _E_\nMy condolences and prayers to the victims of the terrorist attack in Paris. _E_\nGreat article a must read by Peter Ferrara at @Forbes about The Biggest Government Spender in World History __HTTP__ _E_\nEveryone should go see @HatingBreitbart. Great documentary showcasing @AndrewBreitbart's legacy. _E_\nEverybody is asking about my announcement this Wednesday concerning Barack Obama just wait and see! _E_\nIf @TedCruz doesn't clean up his act stop cheating & doing negative ads I have standing to sue him for not being a natural born citizen. _E_\n.@JustinRose99 Great playing today in the Scottish Open. I see our practice facility is helping—use it a lot! _E_\n\"Our side needs Donald Trump.\" @AnnCoulter on @seanhannity's show last night. Thanks Ann. _E_\nJust leaving Akron Ohio after a packed rally. Amazing people! Going now to Texas. _E_\nWow just came out on secret tape that Crooked Hillary wants to take in as many Syrians as possible. We cannot let this happen ISIS! _E_\nLyin' Ted Cruz who can never beat Hillary Clinton and has NO path to victory has chosen a V.P.candidate who failed badly in her own effort _E_\n.@PennJillette and @StephenBaldwin7's arguments are making the edit room look like the boardroom. #CelebApprentice _E_\n\"Is business success a natural talent? I think it's a combination of aptitude work and luck.\" – Think Like a Champion _E_\nIf I were President I would push for proper vaccinations but would not allow one time massive shots that a small child cannot take AUTISM. _E_\nSo true Ivanka! __HTTP__ _E_\nPretty audacious for Obama to call @MittRomney a BSer when he has lied about so much we don't have room to write. _E_\nA great day in Wisconsin many stops many great people! Melania is joining me on Monday. Big crowds. MAKE AMERICA GREAT AGAIN! _E_\nThank you for your continued support!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n#trumpvlog Same last name same bad ratings @lawrence and @rosie..... __HTTP__ _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nMy official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_\nLooks like we will have a pervert running for mayor after all just what New York City needs and he will revert back to form always do! _E_\nIn '08 @BarackObama hit Bush for secrecy __HTTP__ When will Obama release all his sealed college records?! _E_\n.@AlexSalmond suffered a huge defeat by the people of Blackdog. Communities all over Scotland are fighting this loser. _E_\nGoofy Elizabeth Warren sometimes referred to as Pocahontas pretended to be a Native American in order to advance her career. Very racist! _E_\nI am deeply disturbed by what I have read in the case of @TrayvonMartin. I support a full investigation and justice. _E_\nMore great news as a result of historical Tax Cuts and Reform: Fiat Chrysler announces plan to invest more than $1 BILLION in Michigan plant relocating their heavy truck production from Mexico to Michigan adding 2500 new jobs and paying $2000 bonus to U.S. employees! __HTTP__ _E_\nWill be interviewed by @SeanHannity on @FoxNews tonight at 10pm from Pennsylvania. Enjoy! #Trump2016 __HTTP__ _E_\nOur new allies in Egypt the Muslim Brotherhood have close relations with Iran __HTTP__ We never should have abandoned Mubarak. _E_\nThe U.S. has a 60 billion dollar trade deficit with Mexico. It has been a one sided deal from the beginning of NAFTA with massive numbers... _E_\nThe acclaimed @TrumpChicago soars 92 stories high. You're either staying in @TrumpChicago or in its shadow. __HTTP__ _E_\nIncreasing America's debt weakens us domestically and internationally. US Senator @BarackObama 2007 _E_\nPrediction: Rand Paul has been driven out of the race by my statements about him he will announce soon. 1%! _E_\nThe Democrats are delaying my cabinet picks for purely political reasons. They have nothing going but to obstruct. Now have an Obama A.G. _E_\n.@HuffingtonPost actually gave me a positive story yesterday! _E_\nRemember our brave men & women who have fallen protecting our country this Memorial Day! _E_\nLaw enforcement & military did a spectacular job in Hamburg. Everybody felt totally safe despite the anarchists. @PolizeiHamburg #G20Summit _E_\n'What I Like About Trump ... and Why You Need to Vote for Him' __HTTP__ _E_\nRT @EricTrump: __HTTP__ _E_\nIf ObamaCare is such a wonderful law then why does Obama summarily change the law before an election? _E_\nSocial media has changed the news & communication landscape for good. Everything must be up to date by the second instead of the hour or day _E_\nMechanical dog is going to be trending tonight. #MechanicalDog #CelebApprentice _E_\nCrooked Hillary's V.P. pick said this morning that I was not aware that Russia took over Crimea. A total lie and taken over during O term! _E_\nEntrepreneurs: Seek opportunity and see opportunity as a perk. You never know what will evolve. Keep an open mind! _E_\nThank you Roanoke Virginia this a MOVEMENT join us today!Sign up: __HTTP__ __HTTP__ _E_\nWow I have had so many calls from high ranking people laughing at the stupidity of the failing @nytimes piece. Massive front page for that! _E_\nGreat listening session with CEO's of the Retail Industry Leaders Association this morning! __HTTP__ _E_\nTrump National Doral will have big crowds this weekend for the WGC. THE BLUE MONSTER IS READY FOR THE WORLD'S TOP FIFTY PLAYERS! _E_\nDems have been complaining for months & months about Dir. Comey. Now that he has been fired they PRETEND to be aggrieved. Phony hypocrites! _E_\nWhen a complex website is broken the best thing to do is blow it up and start all over again then sue the culprits and use the proper team! _E_\nThe Letterman show really turned things around people finally understand my $5 million dollar offer to charity.... __HTTP__ _E_\nHillary says this election is about judgment. She's right. Her judgement has killed thousands unleashed ISIS and wrecked the economy. _E_\nBernie caved! __HTTP__ _E_\nI have proven to be far more correct about terrorism than anybody and it's not even close. Hopefully AZ and UT will be voting for me today! _E_\nMelania and I look forward to being with President Xi & Madame Peng Liyuan in China in two weeks for what will hopefully be a historic trip! __HTTP__ _E_\nTracking 149 polls from 29 pollsters nationwide/HuffPost Pollster #GOP __HTTP__ _E_\n.@MattGinellaGC It's true Matt the NEW Blue Monster is better than Pinehurst so is Bedminster. Turnberry & Trump Aberdeen blow it away! _E_\nThere can never be a sharp economic recovery until @BarackObama is out of the White House. _E_\nToday proves what I have always known that @Reince Priebus is the tough one and the smart one not Debbie Wasserman Shultz (@DWStweets.) _E_\nI LOVE NEW YORK! #NewYorkValues __HTTP__ _E_\nPeople do not assume this but more than anything else I like helping people. Be at Trump Tower at 11 AM today. _E_\nIt's sad that the WH is punishing children from across the country by closing all tours. Doesn't have to be. WH should take my offer. _E_\nMy @6abc int. with @Jim_Gardner on Atlantic City Philadelphia's real estate market & 2014 2016 elections __HTTP__ _E_\nCruz lies are almost as bad as Jeb's. These politicians will do anything to stay at the trough! _E_\nThe ties shirts and suits at Macy's are doing fantastically well check out the new designs and low prices nothing better! _E_\nTonight on @ExtraTV I'm talking #CelebApprentice. Tune in! _E_\nReceiving the Algemeiner Liberty Award a great honor. __HTTP__ _E_\nHow could Michael Forbes get Scot of the Year when he lost—badly—to me & Andy Murray a true Scot who won the U.S. Open & Olympic gold? _E_\nJust like Jonathan Gruber viciously lied & called Americans \"stupid\" on ObamaCare many consultants are doing the same on Global Warming. _E_\nWow even lowly Rand Paul has just past @JebBush in the new @CNN Poll. Jeb is at 3% I'm at 39%. Stop throwing your money down the drain! _E_\nI will be on On The Record @gretawire tonight at 7 PM _E_\nUnemployment is now 7.9%. Four years and $6.5T later that is really bad! _E_\nMake sure you take some time to enjoy the weekend. Important for your mind and will help you be productive next week. _E_\nNew Reuters poll! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nRecord of Health: __HTTP__ #Trump2016 _E_\nWill be at the Women's U.S. Open today! _E_\nWill be doing Fox & Friends in a few minutes hope you enjoy! _E_\nCome on MLB do the right thing! Let @PeteRose_14 into the Hall. No drugs—just hard work and talent! _E_\n\"Good communicators control space.\" – Roger Ailes 'You Are The Message' @FoxNews _E_\nAn honor to welcome PM of Australia @TurnbullMalcolm to America & join him in marking the 75th Anniversary of the... __HTTP__ _E_\nInnovation distinguishes between a leader and a follower. Steve Jobs _E_\nWhere is the progress in the state of New York over the last three years? There is none only backwards. _E_\nThe government is borrowing 46 cents on every dollar it spends __HTTP__ Dangerous for us but great news for China. _E_\nWhat a time we all had in Iowa yesterday massive overflow crowd. Love them! _E_\nHeading now to Pella Iowa. Big crowd! Remember Trump is a big buyer of Pella windows. See you soon! _E_\nAnderson Silva just got knocked out by new champion Chris Weidman! Congrats to Chris. _E_\n.@MarkSteynOnline Thank you and great job on @seanhannity tonight! _E_\nThe big loss yesterday for Israel in the United Nations will make it much harder to negotiate peace.Too bad but we will get it done anyway! _E_\n#GOPDebate #Trump2016 __HTTP__ _E_\nSo much for creating American jobs @BarackObama gave $529 Million to a Green car company so they can be manufactured in Finland. _E_\nPresident Obama if it is important to you I will substantially increase the $5M offer! _E_\nShe's baaack! @Rosie needs me to salvage her dying career. But it won't help she's got no talent & no persona. Too many tv cancellations! _E_\nWow what a nice honor! __HTTP__ _E_\nU.S. Stock Market up almost 20% since Election! _E_\nLooking forward to being hosted by @NickLangworthy's Erie County Lincoln Leadership Reception tonight. Record crowd! Can't wait. _E_\nAn aerial shot of Jacksonville crowd yesterday! I may as well show you because the media won't. #Trump2016 __HTTP__ _E_\nThe very outdated filibuster rule must go. Budget reconciliation is killing R's in Senate. Mitch M go to 51 Votes NOW and WIN. IT'S TIME! _E_\nAmazing that Derek Jeter played with an injury throughout most of last night's @yankees game and did so well. _E_\nHow do you spend over $635M on websites and they don't work? _E_\nWhy did @MittRomney give his tax returns without demanding that Obama release his college records & applications in return? _E_\nDoes anyone believe that @BarackObama did not fully write or review the 1991 publisher booklet? _E_\nAttention all hackers: You are hacking everything else so please hack Obama's college records (destroyed?) and check place of birth _E_\nThe spotlight has finally been put on the low life leakers! They will be caught! _E_\n.@TrumpToronto was just voted the #1 hotel in Canada in Conde Nast Traveler's prestigious Reader's Choice Awards __HTTP__ _E_\nMoney was never a big motivation for me except as a way to keep score. The real excitement is playing the game. The Art of the Deal _E_\nRT @DanScavino: Join @realDonaldTrump LIVE in Denver Colorado via his #Facebook page we are here!!#MakeAmericaGreatAgain __HTTP__ _E_\nCampaigning to win the Electoral College is much more difficult & sophisticated than the popular vote. Hillary focused on the wrong states! _E_\n25 days to go until fiscal cliff (bad name)—it is only a fiscal curb! Debt ceiling is real fiscal cliff...and that will be interesting! _E_\nThe United States is experiencing the coldest weather in decades with vast amounts of snow blanketing many states.Pendulum has swung to cool _E_\n..North Korea is a rogue nation which has become a great threat and embarrassment to China which is trying to help but with little success. _E_\nThank you Michigan! #Trump2016 _E_\nCrooked Hillary has never created a job in her life. We will create 25 million jobs. Think she can do that? Not a c... __HTTP__ _E_\nWe're missing a lot of information on autism. Support @AutismSpeaks' project by visiting mss.ng #MSSNG _E_\nIt's @BarackObama who wants to raise all our taxes who applauds China for cutting their taxes! (cont) __HTTP__ _E_\nWhen will all the haters and fools out there realize that having a good relationship with Russia is a good thing not a bad thing. There always playing politics bad for our country. I want to solve North Korea Syria Ukraine terrorism and Russia can greatly help! _E_\nDebate was somewhat hard to watch last night. Viewership will be way down. _E_\nObamacare is a disaster! Time to repeal & replace! #ObamacareFail __HTTP__ _E_\nUS tourists threaten to boycott Scotland over windfarms' __HTTP__ Write to Alex Salmond: firstminister@scotland.gsi.gov.uk _E_\nRT @DonnaWR8: @realDonaldTrump You can boycott our anthem WE CAN BOYCOTT YOU! #NFL #MAGA __HTTP__ _E_\nWho do you want negotiating for us? #MakeAmericaGreatAgain __HTTP__ _E_\nI will be interviewed by @BretBaier @SpecialReport at 6pm ET tonight @FoxNews _E_\nCongratulations to @JamesOKeefeIII on exposing more Democrat voter fraud. @DNC was caught red handed telling people to vote twice. _E_\nLittle Adam Schiff who is desperate to run for higher office is one of the biggest liars and leakers in Washington right up there with Comey Warner Brennan and Clapper! Adam leaves closed committee hearings to illegally leak confidential information. Must be stopped! _E_\nMy speech at @AmSpec Bartlet Gala Dinner where I received @boonepickens Entrepreneur Award __HTTP__ _E_\nWatch this video for a look at our great course in Los Angeles Rancho Palos Verdes __HTTP__ @TrumpGolfLA _E_\n.@MarkHalperin's and John Heilemann's book Double Down is an excellent read on the just passed election. Great book congrats! @jheil _E_\nCongratulations Secretary Mattis! __HTTP__ _E_\nRe my hair Should I change it? What do you think? _E_\n.@VattenfallGroup doesn't have the finances or financial statement to build the hated windfarm in Aberdeen. _E_\nIt was a great honor to have President Xi Jinping and Madame Peng Liyuan of China as our guests in the United States. Tremendous... _E_\n\"Leverage: don't make deals without it.\" – The Art of The Deal _E_\nThe North Korean regime has pursued its nuclear & ballistic missile programs in defiance of every assurance agreement & commmitment it has made to the U.S. and its allies. It's broken all of those commitments... __HTTP__ _E_\nIt is time to send someone from the outside to fix DC from the inside. Let's Make America Great Again! __HTTP__ _E_\n'Everything in Dubai': Learn from Emirate's rebound says @DonaldTrumpJr __HTTP__ via @Emirates247 by @Parag1301 _E_\nMoney was never a big motivation for me except as a way to keep score. The real excitement is playing the game. #TheArtofTheDeal _E_\nLIVE on #Periscope: Watch major press conference live from @TrumpTowerNY now! #MakeAmericaGreatAgain __HTTP__ _E_\nNew PPP Poll just out Trump up big Cruz Rubio and Bush down. The debate results even with a stacked RNC audience were wonderful! _E_\nAlabama people are saying their team has real football & real girlfriends—not good for Notre Dame—but they'll be back! _E_\nMy new book Midas Touch in stores now.... __HTTP__ #trumpvlog _E_\nThe president of the pathetic Club For Growth came to my office in N.Y.C. and asked for a ridiculous $1000000 contribution. I said no way! _E_\nVia @thehill by @martinmatishak: \"Trump: 'We look like we're beggars' in Iran nuclear talks\" __HTTP__ _E_\nBack in NY from Scotland and fighting for our country to get better. Trump International Golf Links Scotland opened to rave reviews. _E_\nLeaving for Jacksonville now. See you there! Miami was great. _E_\nThere are so many blatant lies coming out of the ADMINISTRATION healthcare spying NSA IRS brutally killed Americans WILL IT EVER END? _E_\nJailed USMC Sgt Andrew Tahmooressi should be released immediately. Since when does Mexico care about border security?#BringBackOurMarine _E_\nRT @FoxBusiness: #BreakingNews: U.S. employers added 209000 jobs in July unemployment rate down to 4.3% #JobsReport __HTTP__ _E_\nCrooked Hillary Clinton mentioned me 22 times in her very long and very boring speech. Many of her statements were lies and fabrications! _E_\nI am very surprised that @lancearmstrong gave up. I never thought he was a quitter... _E_\nI really enjoyed being in New Hampshire & speaking for Joe McQuaid @deucecrew & the Nackey Loeb School @LoebSchool honoring James Foley. _E_\nShe is so sad and pathetic that I almost feel sorry for Sec.Sebelius. She has done great harm to many people and must be fired. Incompetent! _E_\nJoin me at 7:00 P.M. on Tuesday August 22nd in Phoenix Arizona at the Phoenix Convention Center! Tickets at: __HTTP__ __HTTP__ _E_\n\"Donald Trump Congratulated on @foxandfriends for Receiving the @Algemeiner's 'Liberty Award'\" __HTTP__ via @Algemeiner _E_\nWatching these politicians trying to get a deal done is truly painful Republicans are in a much stronger position than they think. _E_\nIn 1960 there were approximately 20000 pages in the Code of Federal Regulations. Today there are over 185000 pages as seen in the Roosevelt Room.Today we CUT THE RED TAPE! It is time to SET FREE OUR DREAMS and MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nUnivision wants to back out of signed @MissUniverse contract because I exposed the terrible trade deals that the U.S. makes with Mexico. _E_\nThanks @AndreaTantaros for all of your kind words and thoughts. Big progress is being made. Keep up the great work! _E_\nIt was an honor to have the amazing Root family join me in Iowa. I have been so inspired by their courage & bravery. __HTTP__ _E_\nHere is a letter I received yesterday from someone who has had personal experience with our health care situation. __HTTP__ _E_\nMy interview with @gretawire last night on @FoxNews: @BarackObama 'Missed His Opportunity' __HTTP__ _E_\nHillary Clinton only knows how to make a speech when it is a hit on me. No policy and always very short (stamina). Media gives her a pass! _E_\n#badratings @Lawrence's show failed at 8pm and is failing(even worse) at 10pm not long for tv..... _E_\nOh really check out innocent @megynkelly discussion on @HowardStern show 5 years ago I am the innocent (pure) one! __HTTP__ _E_\n\"Winners never quit and quitters never win.\" Vince Lombardi _E_\nRT @seanhannity: Tonight the truth about how despicable the media and the left are in America today. We will name names. 9 est Hannity Fox... _E_\nIs this really America? Terrible! __HTTP__ _E_\nRT @accesshollywood: @realDonald Trump: 'Celebrity Apprentice' Season 5 is 'Tough Nasty & Smart.' Watch: __HTTP__ _E_\nI would absolutely kill Jon Stewart(?) in a debate it would be no contest he's not fast enough or smart enough (only obnoxious enough!). _E_\nHillary Clinton surged the trade deficit with China 40% asSecretary of State costing Americans millions of jobs. _E_\nWith a world renowned open air lobby w/ ocean views & top restaurants @TrumpWaikiki is Honolulu's premier hotel __HTTP__ _E_\nThanks to everyone for your support on @CNBC's \"Top Leaders Icons and Rebels\" vv __HTTP__ Thanks for voting Trump! _E_\n. @WWE's @WrestleMania XXIX less than 3 weeks away. Looking forward to being inducted into the Hall of Fame! _E_\nJohn Kasich was never asked by me to be V.P. Just arrived in Cleveland will be a great two days! _E_\nVia @NYDailyNews by Rich Schapiro: Donald Trump slams Mitt Romney Jeb Bush __HTTP__ _E_\nThe real story turns out to be SURVEILLANCE and LEAKING! Find the leakers. _E_\n#FlashbackFriday Many big movies have filmed in my buildings. Here is @TrumpChicago in #Transformers 3. __HTTP__ _E_\ngroveling when he totally changed a 16 year old story that he had written in order to make me look bad. Just more very dishonest media! _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nUSA has the greatest business people in the world but we let political hacks negotiate our deals. We need change! #BigLeagueTruth #Debate _E_\nIt must have been President Obama that called in what will go down as the DUMBEST PLAY IN THE HISTORY OF FOOTBALL. Same thought process! _E_\nNew business start ups at the lowest level in 30 years and the EPA is now the Employment Prevention Agency. @bobmcdonnell _E_\nI am on @foxandfriends Enjoy! _E_\nThank you Indiana! Was great seeing everyone on Wednesday! I will be back soon! #Trump2016 __HTTP__ __HTTP__ _E_\nALL SAFE IN ORANGE COUNTY NORTH CAROLINA. With you all the way will never forget. Now we have to win. Proud of you all! @NCGOP _E_\nVia @DailyCaller by @AlexPappas: \"Donald Trump Headed To Iowa Says Ebola Is Further Proof Of Obama's Incompetence\" __HTTP__ _E_\nVia @washtimes: Donald Trump warns of 'dangerous precedent' in Cyprus bank skimming __HTTP__ _E_\nWhat a great couple... Katherine Webb and AJ McCarron. They are both winners! _E_\nPolls close but can you believe I lost large numbers of women voters based on made up events THAT NEVER HAPPENED. Media rigging election! _E_\nThank you for having me this morning @AmericanLegion. I enjoyed my time with everyone! #ALConvention2016 __HTTP__ _E_\nI have never been a fan of John Edwards but it is time for the gov't to focus on more important things. @johnedwards _E_\nWhile in politics it is often smart to send out false messages one thing is clear: That Hillary does not want to run against TRUMP. _E_\nJoan Rivers @Joan_Rivers was an amazing woman and a great friend. Her energy and talent were boundless. She will be greatly missed. _E_\nNot believable that Manti Te'o was in love for one year with a girl he never met she then died. He is either very stupid.... _E_\nThank you Indiana! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\nObamaCare is a disaster. Americans will see record increases in their premiums and inferior care services. _E_\nA fantastic day in D.C. Met with President Obama for first time. Really good meeting great chemistry. Melania liked Mrs. O a lot! _E_\nSuccess is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_\nRT @foxandfriends: Sec. Mattis: If North Korea fires missile at US it's 'game on' __HTTP__ _E_\n\"Tone it down? No way! Donald Trump needs to crank up the volume\" __HTTP__ via @FoxNews by @toddstarnes _E_\nA robust growing economy is how to fix Social Security and Medicare not cuts on Seniors. _E_\nSo true! __HTTP__ _E_\nSC has kept us safe from exec amnesty for now. But Hillary has pledged to expand it taking jobs from Hispanic & African American workers. _E_\nEric Cantor's concession speech was ridiculous acted like nothing had just happened. WE NEED REAL LEADERS! _E_\nWe should cut off all aid to every country that does not respect our border. Why are we giving them money in the first place? _E_\nThe ill conceived windfarm that @AlexSalmond is pushing for Aberdeen will lose $50 million a year. Only a fool would build it or want it! _E_\n\"Keep focusing on doing what you love even if times are tough.\" – Think Big _E_\nRubio is totally owned by the lobbyists and special interests. A lightweight senator with the worst voting record in Senate. Lazy! _E_\nHappy New Year from #MarALago! Thank you to my great family for all of their support. __HTTP__ _E_\nPacked with holiday celebrations members & staff are enjoying the first Christmas season at @Trump_Charlotte __HTTP__ _E_\nRT @TeamTrump: She put the office of Sec of State up for sale. If she ever got the chance she'd put the Oval Office up for sale too. #Fo... _E_\nI have asked the reigning Miss Universe and Miss USA to do the honors. At least I will not have to wash my hair this morning! Enjoy. _E_\nThe NYPD has been doing a fantastic job protecting NYC. I hope Chief Ray Kelly is strongly considering running for mayor. _E_\nIn today's all new #TrumpVlog I discuss what a great honor it was to be inducted into the WWE Hall of Fame. __HTTP__ _E_\nVia @Suntimes' @CSTearlyoften by @FSPIELMAN: \"Council sign rules mean Trump name will loom large on river\" __HTTP__ _E_\nEverybody should boycott the @megynkelly show. Never worth watching. Always a hit on Trump! She is sick & the most overrated person on tv. _E_\n.@CNN Will be interviewed by Jake Tapper at 9:00 A.M. Enjoy. _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nThanks for all the great comments on all my recent interviews. Much appreciated. _E_\nI will be on@gretawire tonight at 10 P.M. Now I know she will get great ratings! _E_\nOn my way to Des Moines Iowa will see you soon with @mike_pence. Join us! Tickets: __HTTP__ #ThankYouTour2016 _E_\nSen. @MaxBaucus has announced his retirement. A major proponent of ObamaCare Baucus now says it's a 'huge train wreck.' _E_\nFact – Obama still has not fixed the backend of the ObamaCare website. This could be the greatest internet boondoggle in history. _E_\nWill be in Bangor Maine today at 3pm join me! #MAGATickets: __HTTP__ __HTTP__ _E_\nGeorge Ross and I have done some great real estate deals together. He's a tough negotiator. #CelebApprentice _E_\nVote for Mar a Lago __HTTP__ _E_\nCongratulations to @serenawilliams on her superb @usopen win. She is terrific! _E_\nI've never seen anything like it everything he touches turns to gold! So nice a quote by Fred C.Trump about his son Donald (me!). _E_\nMichelle Nunn supports Amnesty a weak border & ObamaCare. She is an Obama liberal. Send DC an independent voice. Vote @Perduesenate! _E_\n...to stop drugs they want to take money away from our military which we cannot do.\" My standard is very simple AMERICA FIRST & MAKE AMERICA GREAT AGAIN! _E_\nThank you to Linda Bean of L.L.Bean for your great support and courage. People will support you even more now. Buy L.L.Bean. @LBPerfectMaine _E_\nThank you! An honor to be the first candidate ever endorsed by the @NRA prior to @GOPconvention! #Trump2016 #2A __HTTP__ _E_\nGeorge Will was critical of @MittRomney throughout the primary. Maybe it is because his wife was turned down for (cont) __HTTP__ _E_\nBrooklyn Nets have the worst uniform ever Boring won't matter if they win( Winning solves all problems (cont) __HTTP__ _E_\nManufacturing is now less than 9% of US GDP. The Rust Belt heart of our country's factory sector has been destroyed by our leaders. _E_\n#DrainTheSwamp #PhoenixRally __HTTP__ _E_\nWhy would Republican candidates want the support of Mitt Romney. He lost an election against Obama that should NEVER have been lost! _E_\nWho should star in a reboot of Liar Liar Hillary Clinton or Ted Cruz? Let me know. __HTTP__ _E_\nThe true question for the @UN... __HTTP__ _E_\n.@WSJ is bad at math. The good news is nobody cares what they say in their editorials anymore especially me! _E_\nGreat to see @Yankees Captain Derek Jeter back on the field. He will have another great season and make NYC proud again. _E_\n\"The Conference Board said that consumer sentiment was at its highest level in nearly 17 years in November. The Consumer Confidence Index rose from 126.2 in October to 129.5 notching its best reading since December 2000...\" __HTTP__ _E_\nVia @Reuters_Biz Trump flies into ex Soviet Georgia for tower project project __HTTP__ _E_\nI will be interviewed today on Fox News Sunday with Chris Wallace at 10:00 (Eastern) Network. ENJOY! _E_\nOur country is not going to have a comeback with any politician. my @SRQRepublicans speech _E_\nThank you to the GREAT NYPD First Responders and all govt officials for having handled the terrible West Side attack so professionally! __HTTP__ _E_\nOn my way! #Inauguration2017 __HTTP__ _E_\n\"Playing golf with business associates creates a relaxing atmosphere where everyone has fun... _E_\nFact Obama does not read his intelligence briefings nor does he get briefed in person by the CIA or DOD. Too busy I guess! _E_\nPresident Obama said over and over again if you like your plan you can keep your plan PERIOD! This turned out to be a total lie 90 mill. _E_\nIf history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_\nWith our record debt & trillion $ deficits our $ is now at an all time low against the Chinese Yuan. Time for our gov't to work together. _E_\nCan you believe the head of Iran refused to meet with our great President?—Zero respect! _E_\nvia American Spectator @AmSpec Trump Card by Jeffrey Lord __HTTP__ _E_\nThank you Missouri! #Trump2016 __HTTP__ _E_\nOpening for the 2014 season soon the National Register landmark Mar a Lago Club is the crown jewel of Palm Beach __HTTP__ _E_\nUse adverse events and monumental challenges to make you stronger. Think Big _E_\nThoughts and prayers to the families of the four great Marines killed today. _E_\nSome of you were asking about the All Star line up for Celebrity Apprentice __HTTP__ _E_\nCrooked Hillary Clinton says that she got more primary votes than Donald Trump. But I had 17 people to beat—she had one! _E_\nKeep stimulating your mind with big ideas. Fill your mind with new information & use this information to spawn new ideas. Think Big _E_\nMajor Mexican cartel boss El Diego was just arrested with weapons provided to him through Fast and Furious __HTTP__ Media?? _E_\nThis is a tragedy. The real unemployment rate is 14.8% with over 23.2 million unemployed Americans. We can do much better. _E_\nDemocrats would do much better as a party if they got together with Republicans on HealthcareTax CutsSecurity. Obstruction doesn't work! _E_\nThank you OHIO! #TrumpPence16 __HTTP__ __HTTP__ _E_\nLooking forward to meeting everyone at the North Carolina GOP this Friday where I will be the keynote speaker at the dinner. #GOP _E_\nThis Sunday's LIVE FINALE of @ApprenticeNBC will be tough & nasty. Be sure to watch @pennjillette & @TraceAdkins fight to the finish! _E_\nNow @BarackObama is planning to have we the taxpayers pay off mortgages he will spend this country into the ground. __HTTP__ _E_\nTrump Offers To Donate $5 Million To Charity If Obama Releases College Transcripts __HTTP__ via @rcpvideo _E_\n#badratings @Rosie you will never make it. You are not funny or talented. _E_\nDue to popular demand @lisarinna returns to the 13th season of All Star @CelebApprentice. Lisa's fans won't be disappointed! _E_\nObama is planning on attacking Romney on Bain in tomorrow's debate __HTTP__ Mitt should bring up college applications & records _E_\nToday at 1:30PM CT I will be addressing @RepLeadConf in New Orleans __HTTP__ Will focus on how to fix our great country. _E_\nChina just put a tariff on US cars and trucks 22% China is laughing at our inept leaders. @BarackObama _E_\nThank you Mr. & Mrs. @TomBarrackJr for the wonderful and magical evening last night. It will not be forgotten. #Trump2016 _E_\nThe stock market and US dollar are both plunging today. Welcome to @BarackObama's second term. _E_\n46 stories in the center of downtown New York @TrumpSoHo's 391 spacious rooms each have floor to ceiling windows __HTTP__ _E_\nThe #FakeNews MSM doesn't report the great economic news since Election Day. #DOW up 16%. #NASDAQ up 19.5%. Drilling & energy sector... _E_\nVia @Newsmax_Media by @wandacarruthers: \"Trump: GOP on Edge of Winning 'Big' and Forcing Obama to Act\" __HTTP__ _E_\nWe should be focused on clean and beautiful air not expensive and business closing GLOBAL WARMING a total hoax! _E_\nVia @WashTimes by @harperbulletin: \"Donald Trump Goes to Washington\" __HTTP__ _E_\n\"Christmas waves a magic wand over this world and behold everything is softer and more beautiful.\" Norman Vincent Peale _E_\n\"The difference between stupidity and genius is that genius has its limits.\" Albert Einstein _E_\nMy @foxandfriends interview discussing my friend Whitney Houston @SarahPalinUSA's CPAC speechthe economy and primary __HTTP__ _E_\nHappy #FirstRespondersDay to all of our HEROES out there. We are forever grateful to you for your service sacrifice and courage 24/7/365! __HTTP__ _E_\nNancy Reagan the wife of a truly great President was an amazing woman. She will be missed! _E_\nDoing my best to disregard the many inflammatory President O statements and roadblocks.Thought it was going to be a smooth transition NOT! _E_\nLegendary basketball coach Bobby Knight who has 900+ wins many championships and a gold medal will be introducing... __HTTP__ _E_\nWhile our wonderful president was out playing golf all day the TSA is falling apart just like our government! Airports a total disaster! _E_\nA country that does not control or respect its own borders is a country destined for failure. Secure our borders! _E_\nIf I win the presidency my judicial appointments will do the right thing unlike Bush's appointee John Roberts on ObamaCare. _E_\n\"After every setback start thinking big as soon as possible.\" Think Big _E_\nMy @FoxNews interview with @TeamCavuto discussing my endorsement of @MittRomney and how I came to my decision __HTTP__ _E_\n'Moderate' Repubs plotting against @GOP strategy have short term memories. Tea Party gave them majority in House & primaries aren't fun. _E_\n\"Each excellent thing once learned serves for a measure of all other knowledge.\" Philip Sidney _E_\nFlashback: Donald Trump would fire A Rod __HTTP__ via @espn 10.17.12 _E_\nHe @BarackObama claims he does not want higher gas prices. That's not what he said in 2008: __HTTP__ _E_\nAlison Grimes will protect the 'sanctity' of her Obama ballot yet admits she voted for Hillary in primary. Hypocrite. Vote @Team_Mitch! _E_\nHonestly whether you're for or against ObamaCare the 635 million dollar website fiasco is bad for the U.S. It makes us look totally inept! _E_\nI said this was happening long ago I will stop this immediately! __HTTP__ _E_\nBuilding a personal brand? Then focus on being great. Focus on being the best at what you do. Excellent article: __HTTP__ _E_\nA message to the great people of New Hampshire on this important day! #VoteTrumpNH Video: __HTTP__ __HTTP__ _E_\nMost people do not know what Presient Obama is going to do to save his legacy. I do! He's got to get back to basics.Forget Syria FIX THE USA _E_\nGlobal warming has been proven to be a canard repeatedly over and over again. __HTTP__ The left needs a dose of reality. _E_\nEntrepreneurs: What is the standard for which you want to be known? Identify that standard & then establish it. Simple but not easy. Focus! _E_\nMy performance from last week on David Letterman @Late_Show will be re aired tonight at 11:35 PM on CBS. _E_\nMany generals and military leaders are now saying I told you so! They say this will have big impact on military strength & national sec. _E_\nFocus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_\nTwo of the best ever episodes of Celebrity Apprentice tonight at 8. Totally vicious and crazy! I will live tweet. _E_\nThank you! __HTTP__ _E_\nIt wasn't Matt Lauer that hurt Hillary last night. It was her very dumb answer about emails & the veteran who said she should be in jail. _E_\nMany people would like to see @Nigel_Farage represent Great Britain as their Ambassador to the United States. He would do a great job! _E_\nRT @DiamondandSilk: The Media Says: The President Should Stop Tweeting about Russia. Well Why Don't the Media Take Their Own Advice & S... _E_\nRT @statedeptspox: #GES2017 highlights the important role of women #entrepreneurs & demonstrates the importance of #innovation & partnershi... _E_\nTake a sneak peek into one of Trump Park Avenue's most exclusive residences on the market __HTTP__ _E_\nJoin me live in Wilmington Ohio! __HTTP__ _E_\nWhy the hell did we help the Libyan rebels in the first place. That is the real scandal. _E_\nRalph Norman who is running for Congress in SC's 5th District will be a fantastic help to me in cutting taxes and.... _E_\nA gallon of gas has more than doubled while @BarackObama has been POTUS and he still won't approve Keystone. _E_\nGreat! Last night @CelebApprentice winner @johnrich & alumni @RealMeatLoaf packed OH stadium rallying w/ @MittRomney __HTTP__ _E_\nInteresting article by @MattTowery @townhallcom:\"It Is Time to Use 'The Trump Card'\" __HTTP__ Thanks Matt for the nice mention _E_\nOh no they are worried that they didn't read the Boston killer his rights and he may have a good legal argument. 12 year case to finish? t _E_\nI was just given a great tour of Moscow fantastic hard working people. CITY IS REALLY ENERGIZED! The World will be watching tonight! _E_\nHopefully the Republican Party can come together and have a big WIN in November paving the way for many great Supreme Court Justices! _E_\n.@Matt_Berry87 Piers did a great job the interview was very important. _E_\nTogether we are going to MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_\nOpening in 2016 Trump Tower Punta del Este will bring our signature luxury living to the sands of Playa Brava __HTTP__ _E_\nSome really dumb blogger for failing @VanityFair a magazine whose ads are down almost 18% this year said I wear a hairpiece I DON'T! _E_\nI never did give anybody hell. I just told the truth and they thought it was hell. Harry S. Truman _E_\nThe Democrats dropped all references to God from their platform. Not good! _E_\nJames Gandolfini was a remarkable talent. He was also a decent man. We will all miss him. _E_\nRe Negotiation: Know exactly what you want & focus on that. View conflict as an opportunity this will expand your mind and your horizons. _E_\n.@AGSchneiderman Why is Douglas Durst allowed to use the Freedom Tower to get out of a lease with Conde Nast? _E_\nBad performance by Crooked Hillary Clinton! Reading poorly from the telepromter! She doesn't even look presidential! _E_\n.@thehill discussing my @foxandfriends interview: Trump: 'Clamor for @MittRomney's tax returns has died down' __HTTP__ _E_\nGreat night in WI. I'm going to fight for every person in this country who believes government should serve the PEO... __HTTP__ _E_\nChris @hardball_chris Matthews ratings are at new historic lows. He is single handedly destroying the entire @msnbc channel. _E_\nRT @JoeNBC: Pope Francis tear down that wall! #vaticanwalls __HTTP__ _E_\n.@HeyTammyBruce Thank you for your nice words on Fox today. They never use my full statements on nuclear which you would agree with! _E_\nOf course I don't think Jimmy Carter is dead saw him today on T.V. Just being sarcastic but never thought he was alive as President stiff! _E_\nFor those that don't think a wall (fence) works why don't they suggest taking down the fence around the White House? Foolish people! _E_\n.@gretawire Greta—you're wrong Kirsten Powers is a dummy—wasn't she Anthony Weiner's girlfriend? _E_\nVladimir Putin said today about Hillary and Dems: In my opinion it is humiliating. One must be able to lose with dignity. So true! _E_\nMy visit to Japan and friendship with PM Abe will yield many benefits for our great Country. Massive military & energy orders happening+++! _E_\nThank you Michigan! #Trump2016 __HTTP__ _E_\nHowever beautiful the strategy you should occasionally look at the results. Winston Churchill _E_\nTHANK YOU St. Augustine Florida! Get out and VOTE! Join the MOVEMENT and lets #DrainTheSwamp! Off to Tampa now!... __HTTP__ _E_\nCongratulations to @TrumpWaikiki for being selected as Best of +VIP Access 2014 by @Expedia! _E_\nWatch my @oreillyfactor appearance from this week discussing nuclear negotiations with Iran __HTTP__ _E_\nWhile Obama is denying it he did receive intelligence about the attacks 3 days before __HTTP__ Too busy campaigning? _E_\nWeekly AddressJoin me here: __HTTP__ __HTTP__ _E_\nWhat a coincidence?! @BarackObama's campaign logo uses the same font as Cuban communist propaganda posters. __HTTP__ _E_\nThe world was gloomy before I won there was no hope. Now the market is up nearly 10% and Christmas spending is over a trillion dollars! _E_\nUnder @BarackObama 1 out of every 7 Americans is on food stamps. _E_\n36 hrs Central Park as seen in @nytimes including a stop @TrumpNewYork for a bite in @Nougatine_NYC. Full article __HTTP__ _E_\nWith our amazing All Star cast @Joan_Rivers @johnrich @ArsenioOFFICIAL & @piersmorgan are also returning as boardroom advisors. _E_\nIt is simply immoral for the government to encourage able bodied Americans to think that a life on welfare of (cont) __HTTP__ _E_\n\"We would accomplish many more things if we did not think of them as impossible.\" Vince Lombardi _E_\nRepublicans have once again capitulated to Obama. This time on the Iran nuclear treaty. When will it end? _E_\nThank you Novi Michigan! Get out and VOTE #TrumpPence16 on 11/8. Together WE WILL MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_\nI was on the TODAY Show this morning and then visited Regis & Kelly. The Celebrity Apprentice starts this Sunday night—don't miss it! _E_\nCan you imagine if the election results were the opposite and WE tried to play the Russia/CIA card. It would be called conspiracy theory! _E_\nGoofy Elizabeth Warren is now using the woman's card like her friend crooked Hillary. See her dumb tweet \"when a woman stands up to you...\" _E_\nA Rod is a less than average baseball player now that he is unable to use drugs. A Rod misrepresented to th... (cont) __HTTP__ _E_\nDeparting @JBA_NAFW for St. Charles Missouri to help push our plan for HISTORIC TAX CUTS across the finish line.A successful vote in the Senate this week will bring us one giant step closer to delivering an incredible victory for the American people! __HTTP__ __HTTP__ _E_\nToday is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_\n.@HBO should fire @BillMaher and bring back @DennisDMZ someone that is actually funny. _E_\nAs election looms some bad news for Clinton Democrats: __HTTP__ _E_\nThese last 4 years have not had a single quarter over 4% GDP. Obama has overseen the weakest economic recovery in American history. _E_\nWho is the dumbest man on TV? @Lawrence of MSNBC... __HTTP__ _E_\nThe Dollar is at an all time WWII low against the Yen. The Fed's recklessness is going to lead to record inflation. _E_\nVia @HamptonsMag: @IvankaTrump Talks Hamptons Lifestyle with Emmy Rossum __HTTP__ _E_\nIt's true... Dennis is really into this very animated. I have never seen him this way before. _E_\nVOTE TODAY! Go to __HTTP__ to find your polling location. We are going to Make America Great Again!... __HTTP__ _E_\nMany people booed the players who kneeled yesterday (which was a small percentage of total). These are fans who demand respect for our Flag! _E_\nThe liberal media is focusing on @MittRomney's bank records. How about reviewing @BarackObama's illegal land deal contracts with Tony Rezko? _E_\nA great honor to receive polling numbers like these. Record setting African American (25%) & Hispanic numbers (31%). __HTTP__ _E_\nThe problem with the U.S. is that our leadership has no knowledge or ability to negotiate or see into the future. Every nation beats us! _E_\nI discuss yesterday's tragedy at the Boston Marathon in today's video blog. __HTTP__ _E_\nThe new reality – China's demand for oil now controls the market __HTTP__ And OPEC gets away with ripping us off at $105! _E_\nIf Goofy Elizabeth Warren a very weak Senator didn't lie about her heritage (being Native American) she would be nothing today. Pick her H _E_\nThe fact is you're not going to see real growth or create real jobs until we get these exorbitant energy costs (cont) __HTTP__ _E_\nRT @NRA: But there IS something we will do on #ElectionDay: Show up and vote for the #2A! #DefendtheSecond #NeverHillary _E_\nGood luck and best wishes to my dear friend the wonderful and very talented Joan Rivers! Winner of Celebrity Apprentice amazing woman. _E_\nKellyanne Conway went to @MeetThePress this morning for an interview with @chucktodd. Dishonest media cut out 9 of her 10 minutes. Terrible! _E_\nPreliminary talks have begun for next season's #CelebrityApprentice. As usual we will have another great season. _E_\nWhy does @BarackObama continue to defend radical Islam? He is calling the Ft. Hood massacre workplace violence. _E_\nAlso appearing on the Miss USA Pageant will be Country Superstar Trace Adkins and Pop Rock Sensation Boys Like Girls... _E_\nI will be in Washington D.C. tomorrow to receive the 2014 Joseph Wharton Award at the Wharton Club of D.C.—a great honor! @Wharton _E_\n....John McCain has failed miserably to fix the situation and to make it possible for Veterans to successfully manage their lives. _E_\nThe gorgeous contestants of Trump Miss Universe are so excited to be simulcast on both @nbc and @Telemundo. Will be a beautiful show! _E_\nCall me old school but I believe in the old warrior's credo that to the victor go the spoils. In other word... (cont) __HTTP__ _E_\nVia @CBS19: Trump Winery President Nominated for Award by Wine Enthusiast Magazine __HTTP__ Congrats @EricTrump! _E_\nWatch this amazing ad from @autismspeaks and learn the signs... __HTTP__ _E_\nN.Y. City is paying FORTY MILLION DOLLARS to five men that many think are guilty as hell. So many facts should have been trial. Politics! _E_\nMy team of deplorables will be taking over my Twitter account for tonight's #debate#MakeAmericaGreatAgain _E_\nIt's finally happening Fiat Chrysler just announced plans to invest $1BILLION in Michigan and Ohio plants adding 2000 jobs. This after... _E_\n... People love to hear their names and their stories said out loud.\" – Think Like a Billionaire _E_\nMy interview with @PaulWTalk on @wjrradio on behalf of @MittRomney discussing why Michigan needs to go for Romney. __HTTP__ _E_\nFiring Bret was a tough one for me but Omarosa doesn't seem to mind. _E_\nSpoke to Roy Moore of Alabama last night for the first time. Sounds like a really great guy who ran a fantastic race. He will help to #MAGA! _E_\nRussia has never tried to use leverage over me. I HAVE NOTHING TO DO WITH RUSSIA NO DEALS NO LOANS NO NOTHING! _E_\nThe final part of restoring fiscal sanity to America is the most obvious and that's to control Obama style (cont) __HTTP__ _E_\nAgain don't forget to watch @hannityshow tonight on Fox at 9 o'clock EST. _E_\nWe crushed the original goal! I will write a $2 MILLION check to our campaign if we hit our end of month goal! __HTTP__ _E_\n.@IvankaTrump @EricTrump & @DonaldJTrumpJr take no prisoners in boardroom of 'All Star' @CelebApprentice. Where do they get it from? _E_\nYou don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_\nTrump Nears 100 days on Top via The Hill __HTTP__ _E_\nThe NYPD Surveillance Program kept NYC safe since 9/11. There will be tragic consequences for ending it. _E_\nA GREAT HONOR to spend time with our BRAVE HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United States of America! __HTTP__ _E_\n.@tedcruz must be doing something right if @cher sadly rated \"the 4th ugliest celebrity\" according to @listverse is attacking him. _E_\n.@WashTimes states Democrats have willfully used Moscow disinformation to influence the presidential election against Donald Trump. _E_\n.#Celebrityapprentice will be live tomorrow night. Entire cast will be there. Who do you like to win? _E_\n.@MittRomney much better on Libya and Middle East problems. Obama has no answer. _E_\nThe tax scam Washington Post does among the most inaccurate stories of all. Really dishonest reporting. _E_\nVarious media outlets and pundits say that I thought I was going to lose the election. Wrong it all came together in the last week and..... _E_\nMy @gretawire interview discussing @billlmaher's comments attacks on @MittRomney and @CNN & @msnbctv's low ratings __HTTP__ _E_\nI will be interviewed by @DavidMuir tonight at 10 o'clock on @ABC. Will be my first interview from the White House.... __HTTP__ _E_\nIf we keep on this path if we reelect @BarackObama the America we leave our kids and grandkids won't look (cont) __HTTP__ _E_\nThe Democrats just aren't calling about DACA. Nancy Pelosi and Chuck Schumer have to get moving fast or they'll disappoint you again. We have a great chance to make a deal or blame the Dems! March 5th is coming up fast. _E_\nOn 800 beautiful acres in Miami @TrumpDoral boasts 100000 sq. ft. in meeting space with event planning services __HTTP__ _E_\n....8 Dems totally control the U.S. Senate. Many great Republican bills will never pass like Kate's Law and complete Healthcare. Get smart! _E_\n.@BoonePickens Thank you for the T. Boone Pickens Entrepreneur Award—a great honor for me from a fantastic man. _E_\nToday's assignment: read chapter three of Think Big \"Basic Instincts.\" Focus on my acquisition of 40 Wall Street. _E_\nIraq is no longer our problem. We never should have been there in the first place! _E_\nI couldn't make the Faith and Freedom confab in Orlando so I sent a video... __HTTP__ _E_\nThank you! Four new #DebateNight polls with the MOVEMENT winning. Together we will MAKE AMERICA SAFE & GREAT AGAIN... __HTTP__ _E_\nHad dinner last night at Megu 845 United Nations Plaza fabulous food beautiful restaurant. _E_\nRT @EricTrump: Friends: Remember to VOTE tomorrow if you live in Louisiana Maine Kentucky or Kansas! #MakeAmericaGreatAgain __HTTP__ _E_\nBased on new oil prices the ugly windfarms being built in Scotland will quickly die! What a mess! _E_\nObamaCare not only has brought higher premiums decreased care & loss of jobs but now .1% Q1 growth. REPEAL BEFORE IT IS TOO LATE! _E_\nThe 5 star @Trump_Ireland graces over 500 acres fronting 2.5 miles on the Atlantic Ocean in County Clare Ireland __HTTP__ _E_\n\"Ice Skaters Invade Mar a Lago as Snow Falls on Palm Beach Salvation Army Ball!\" __HTTP__ via @GossipExtra _E_\n.@WendyWilliams Thanks for the nice statement especially about my wife and kids very much appreciated. _E_\nRead my full statement here on the Supreme Court's executive amnesty decision #imwithyou __HTTP__ _E_\nOur NOBEL PRIZE FOR PEACE president said I'm really good at killing people according to just out book Double Down. Can Oslo retract prize? _E_\nI look forward to watching @megynkelly tonight 8 PM ET. It will be interesting to see how she treats me—I think she will be very fair. _E_\nJoin me in Delaware Ohio tomorrow at 12:30pm! #DrainTheSwamp Tickets: __HTTP__ __HTTP__ _E_\nRT @DanScavino: Join #PEOTUS Trump & #VPEOTUS Pence live in West Allis Wisconsin! #ThankYouTour2016 #MAGA __HTTP__ __HTTP__ _E_\nToday @MittRomney addressed the NAACP. @BarackObama takes their vote for granted which is why there is such high Black unemployment. _E_\nThank you @GOPLeader Kevin McCarthy! Couldn't agree w/you more. TOGETHER we are #MAGA __HTTP__ _E_\nNo matter how much I accomplish during the ridiculous standard of the first 100 days & it has been a lot (including S.C.) media will kill! _E_\nTrump Int'l Hotel & Tower Vancouver will be a new landmark in a fantastic city __HTTP__ _E_\nLyin' Ted Cruz consistently said that he will and must win Indiana. If he doesn't he should drop out of the race stop wasting time & money _E_\nRT @theRealKiyosaki: Donald Trump coined the phrase 'multilevel focusing' I love it. It is when two ideas intersect & form a new innovation _E_\nCrooked Hillary said that I want guns brought into the school classroom. Wrong! _E_\nOne by one we are keeping our promises on the border on energy on jobs on regulations. Big changes are happening! _E_\nThank you! #GOPDebate Polls #MakeAmericaGreatAgain __HTTP__ _E_\n\"Leadership is perhaps the key to getting any job done.\" – The Art of The Deal _E_\nBay Bridge in California made in China for $1.8 billion. $300 million in cost overruns. Are we stupid? _E_\nOscar Pistorious the blade runner is as guilty as O.J. I wonder if the result will be the same? _E_\nThank you! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_\nThis guy @sethmeyers can't do a simple interview—saw him the other night stumbling & mumbling while trying to interview a guest. _E_\nWith autism being way up what do we have to lose by having doctors give small dose vaccines vs. big pump doses into those tiny bodies? _E_\nRT @Scavino45: Time lapse video of the border wall prototypes when they were being built in San Diego. Next phase underway: testing and ev... _E_\nVattenfall the company behind a proposed asinine windfarm off the coast of Aberdeen Scotland is having serious financial difficulty. _E_\nKeystone pipeline would create 20000 direct jobs another 50000 jobs servicing the pipeline. 700000 barrels a (cont) __HTTP__ _E_\nThe last person that Hillary or Bernie want to run against is Donald Trump and that is fact! _E_\nPutin has no respect for our President really bad body language. _E_\nWhy is Obama's auto bailout now creating jobs in China? He is ruining American industry. _E_\nGeneral Flynn was given the highest security clearance by the Obama Administration but the Fake News seldom likes talking about that. _E_\nOur ally Canada is 'frustrated' by @BarackObama's radical anti gas policies __HTTP__ BHO is forcing Canada to send gas to China. _E_\nEntrepreneurs: Stay focused and be tenacious. Remain fixed on your goals. _E_\nI am asking all citizens to believe in yourselves believe in your future and believe once more in America. #AmericaFirst __HTTP__ _E_\nWant access to Crooked Hillary? Don't forget it's going to cost you!#DrainTheSwamp #PayToPlay __HTTP__ _E_\nCrooked H is nasty to Sanders supporters behind closed doors. Owned by Wall St and Politicians HRC is not with you. __HTTP__ _E_\nCruz came to Mississippi there was nobody there he left the state. I had a rally in Madison MS with 10000! Thank you! _E_\nJust left $259 million rebuilding of Doral in Miami. Amazing Trump National Doral will be a masterpiece (if I do say so myself)! _E_\nThe big problem for little @MacMiller is that he's going to have to have another hit song not just his Donald Trump bonanza. _E_\nOn Bill O'Reilly in 5 minutes! _E_\nJust left the best golf course in the State of California @trumpgolfla. When in the LA area check it out even (cont) __HTTP__ _E_\nFBI director said Crooked Hillary compromised our national security. No charges. Wow! #RiggedSystem _E_\nAmerica needs strong leadership. Politicians can talk but they don't get things done. Video: __HTTP__ __HTTP__ _E_\nGreat American heroes who averted an attack in France. THANK YOU! Spencer Stone Anthony Sadler & Alex Skarlatos. __HTTP__ _E_\nRT @EricTrump: Aloha Hawaii: We would be honored to have your vote! Find your caucus __HTTP__ #TrumpWaikiki #Mahalo __HTTP__ _E_\nAny American who fights w/ ISIS in Iraq or Syria should have their passport revoked. If they try to come back in send them to Gitmo. _E_\nHave a GREAT EASTER I love you all! _E_\nJust got back from Asheville North Carolina where we had a massive rally. The spirit of the crowd was unbelievable. Thank you! #MAGA _E_\nTrue America is rapidly losing it's SPIRIT and when that's gone we will only be going in one direction and that direction is down! _E_\nTERRORISM IMMIGRATION AND NATIONAL SECURITY SPEECH TRANSCRIPT: __HTTP__ __HTTP__ _E_\nLeaders at Trump National Doral are only one under par. The great Ben Hogan said I've never seen a great course that was easy! _E_\n\"I don't measure a man's success by how high he climbs but how high he bounces when he hits bottom.\" George S. Patton _E_\nA friend is one who has the same enemies as you have. Abraham Lincoln _E_\nLightweight @AGSchneiderman's phony lawsuit against Trump U was decimated by the court—he's a loser! _E_\nPeople rarely say that many conservatives didn't vote for Mitt Romney. If I can get them to vote for me we win in a landslide. _E_\nVery important that NFL players STAND tomorrow and always for the playing of our National Anthem. Respect our Flag and our Country! _E_\nThe country of Georgia is a small wonder. Performing well economically under the leadership of @SaakashviliM. A great American ally. _E_\nCan't believe we are less than three weeks away from the election. Time certainly flies! _E_\nObama said in his SOTU that \"global warming is a fact.\" Sure about as factual as \"if you like your healthcare you can keep it.\" _E_\nSuper PACs should be disavowed by anyone running for President. They are a total scam on our system and country! I am self funding. _E_\nFor every CEO that drops out of the Manufacturing Council I have many to take their place. Grandstanders should not have gone on. JOBS! _E_\nA GREAT DAY IN WISCONSIN!Thank you #Racine & #Wausau! Just arrived in #EauClaire! #Trump2016#WIPrimary #TrumpTrain __HTTP__ _E_\nLeaving the great people of North Carolina. Amazing event. Heading to Tampa now! #VoteTrump _E_\nUnited States looks more and more like a paper tiger. Won't be that way if I win! _E_\nTHANK YOU ASIA! #USA __HTTP__ _E_\n.@PennyPritzker Really important to cover currency manipulation in trade agreements that's where China and others are beating us. Best! _E_\nThank you. __HTTP__ _E_\nGood luck to the people of Scotland whatever their decision may be on Thursday. The whole world is watching—really exciting! _E_\nThe @GOP should not agree to the ridiculous debate terms that @CNBC is asking unless there is a major benefit to the party. _E_\n#ICYMI: @KarlRove & @oreillyfactor discuss what Ted Cruz did to the great people of Iowa as they went to vote. __HTTP__ _E_\nAs promised my @SuperBowl pick is the San Francisco @49ers. _E_\nMy Twitter has been seriously hacked and we are looking for the perpetrators. _E_\nI look forward to my press conference on Weds of next week @TrumpTurnberry to discuss changes & big investment I'll make. Very exciting! _E_\nBecause Obama was so pathetic in the first debate tonight's audience will be humongous people want to see if he is for real. _E_\nTomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM _E_\nIt was just announced that @MacMiller's song \"DonaldTrump\" went platinum—tell Mac Miller to kiss my ass! _E_\nToday we are thrilled to welcome @Broadcom CEO Hock Tan to the WH to announce he is moving their HQ's from Singapore back to the U.S.A..... __HTTP__ _E_\nLooking at the figures and plans behind @Disney's acquisition of Lucas Film makes you realize how stupid @AOL (cont) __HTTP__ _E_\nObama's $1T+ deficit budget expanded welfare & green cronyism & it cut domestic bomb prevention in half __HTTP__ _E_\nRestoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_\nThank you America! #MAGARasmussen National PollDonald Trump 43%Hillary Clinton 40% __HTTP__ _E_\nThe so called A list celebrities are all wanting tixs to the inauguration but look what they did for Hillary NOTHING. I want the PEOPLE! _E_\nW/a newly expanded 27 holes of golfing Trump Intl.Palm Beach is ranked by Florida Golf Magazine as FL's #1 course __HTTP__ _E_\nCheck @billmaher's background & you will find he is not a smart guy—he just wants people to think he is just call him dummy. _E_\nGovernment's first duty is to protect the people not run their lives.\" – President Ronald Reagan _E_\nAn honor having the National Sheriffs' Assoc. join me at the @WhiteHouse. Incredible men & women who protect & serv... __HTTP__ _E_\nMy @SquawkCNBC interview discussing the @GOP convention @BarackObama's sealed records & @SenatorReid's tax claim __HTTP__ _E_\nInterview w/ @AndreaTantaros discussing my WH tour offer @KarlRove's terrible ads & Ashley Judd's candidacy __HTTP__ _E_\nAmericans understand that the US has a spending problem not a revenue problem. #TimeToGetTough __HTTP__ __HTTP__ _E_\nThe trade deal is a disaster she was always for it! #DemDebate _E_\nThe seriously failing @nytimes despite so much winning and poll numbers that will soon put me in first place only writes dishonest hits! _E_\nA bite from last night's @piersmorgan interview discussing Rev. Wright's Ed Klein interview and the 2012 campaign __HTTP__ _E_\nOur not very bright Vice President Joe Biden just stated that I wanted to carpet bomb the enemy. Sorry Joe that was Ted Cruz! _E_\nJust shows that you can have all the cards and lose if you don't know what you're doing. _E_\n#TrumpVine from D.C. __HTTP__ _E_\nWith a @SharkGregNorman designed course directly along the water @Trump_Charlotte is North Carolina's elite club __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nHad a great time hosting the Palm Beach County Republican at Mar a Lago. @IngrahamAngle gave a strong speech. She's great! _E_\n\"You have to be positive every single day. Positive stamina is a necessary ingredient for success.\" – Think Like a Champion _E_\nPlease @21Club go back to your original menu and preparation. Believe me it was much better. Let me know when the change is made! _E_\n.@pennjillette has received a star on the Hollywood Walk of Fame— about time! #CelebApprentice _E_\nBe aware of things that seem inexplicable because they can be a big step towards innovation. Donald J. Trump __HTTP__ _E_\nI am not only fighting Crooked Hillary I am fighting the dishonest and corrupt media and her government protection process. People get it! _E_\nDid the Boston terrorists register their guns? No. Another example of why gun control legislation is not the answer! _E_\nA Clinton economy = more taxes and more spending! #DebateNight __HTTP__ _E_\nI can't wait to donate @billmaher's $5 million to charity. Just waiting on @billmaher to send me the money. _E_\nPeoples lives are being shattered and destroyed by a mere allegation. Some are true and some are false. Some are old and some are new. There is no recovery for someone falsely accused life and career are gone. Is there no such thing any longer as Due Process? _E_\nSo sad that @CNN and many others refused to show the massive crowd at the arena yesterday in Oklahoma. Dishonest reporting! _E_\nThis story is not about Mr. Khan who is all over the place doing interviews but rather RADICAL ISLAMIC TERRORISM and the U.S. Get smart! _E_\nGetting back to the nicer and more normal parts of life Celebrity Apprentice is great tonight on NBC at 9. It will be a full two hour show! _E_\nMy twitter account is now reaching more people than the New York Times not bad. And we're only going to get better! _E_\nSenator (Doctor) Bill Cassidy is a class act who really cares about people and their Health(care) he doesn't lie just wants to help people! _E_\nThe story with Hillary will never change. __HTTP__ _E_\nThe big story is the unmasking and surveillance of people that took place during the Obama Administration. _E_\n.@TMobile gives terrible service and has many complaints just check. _E_\nIt's 10 AM: Two hours to go for Obama to easily pick up millions for charity! _E_\nGotta hand it to @IvankaTrump she loved Doral from the time we looked at it. The Trump Doral will be an Icon. #sayfie #newsmax _E_\nRubio is weak on illegal immigration with the worst voting record in the U.S. Senate in many years. He will never MAKE AMERICA GREAT AGAIN! _E_\nCongratulations to @SpeakerBoehner on standing strong and tying government shutdown to defunding ObamaCare. _E_\nVia @USATODAYsports: \"Last year it was Tiger Woods with the walk off\" __HTTP__ @CadillacChamp @DoralResort #TrumpDoral _E_\n.@Lord_Sugar....but you wouldn't notice because you have no vision and you are a total loser. _E_\nThank you to all Americans who participated in Nat'l Rx Drug Take Back Day. A record amount of drugs collected & disposed. We can do this! _E_\n...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_\nThe #AmazonWashingtonPost sometimes referred to as the guardian of Amazon not paying internet taxes (which they should) is FAKE NEWS! _E_\nSome of the Fake News Media likes to say that I am not totally engaged in healthcare. Wrong I know the subject well & want victory for U.S. _E_\nThank you! \"Trump's Defining Speech\" WSJ Editorial: __HTTP__ __HTTP__ _E_\nVery strange why do database records contradict @BarackObama and show he was only at Columbia 1 year? __HTTP__ _E_\nCrooked Hillary can't even close the deal with Bernie and the Dems have it rigged in favor of Hillary. Four more years of this? No way! _E_\nBig announcement tomorrow morning concerning the great Turnberry Resort in Scotland! _E_\nRT @TeamTrump: .@HillaryClinton had her chance and she BLEW IT. #BigLeagueTruth #Debates __HTTP__ _E_\nObama's war on women has lead to the biggest decline in female employment in 40 years. 4 more years?? _E_\nAfter Turkey call I will be heading over to Trump National Golf Club Jupiter to play golf (quickly) with Tiger Woods and Dustin Johnson. Then back to Mar a Lago for talks on bringing even more jobs and companies back to the USA! _E_\nLooking forward to my Iowa visit at @bobvanderplaats' @theFAMiLYLEADER Summit __HTTP__ Big crowd! _E_\nPresident Obama's Arab Spring is not looking so good right now! _E_\nGuests are raving about our exclusive hotel mattress and so we've made it available for purchase! __HTTP__ _E_\nKeystone must be approved. Oil is at a record high. We need to use our resources and support allies like Canada. _E_\n.@ximenaNR Great job we are all proud of you one of our all time BEST! _E_\nRT @foxandfriends: Getting the job done! Sen. Mitch McConnell delays August recess to work on health care bill __HTTP__ _E_\nMarshawn Lynch of the NFL's Oakland Raiders stands for the Mexican Anthem and sits down to boos for our National Anthem. Great disrespect! Next time NFL should suspend him for remainder of season. Attendance and ratings way down. _E_\nI just left the Trump Tower atrium it is packed with great people. #1 tourist attraction in NYC Fun! #TrumpTower _E_\nVia @LuxuryDaily by Joe McCarthy: \"Trump Collection leverages 2016 election frenzy for Washington debut\" __HTTP__ _E_\n\"TRUMP: IMMIGRATION BILL A REPUBLICAN 'DEATH WISH'\" __HTTP__ via @BreitbartNews by @mboyle1 _E_\nCongressman Ron DeSantis is a brilliant young leader Yale and then Harvard Law who would make a GREAT Governor of Florida. He loves our Country and is a true FIGHTER! _E_\nThe new reality. 'China Daily' is sold in street newspaper vending machines across DC. Why not? They own the place. _E_\nThe @TrumpChicago Spa offers 5 star services12 treatment rooms & 53 spa guestrooms overlooking Chicago skyline __HTTP__ _E_\nRepublicans and Democrats have both created our economic problems. _E_\nI am self funding my campaign so I do not owe anything to lobbyists & special interests. __HTTP__ __HTTP__ _E_\nJerry Finkelstein passed away last night a great New York mover & shaker & a really great guy! _E_\nSpeaking at the Red White and Blue Dinner in Maryland __HTTP__ _E_\nObama is laughing at Karl Rove & all the losers who spent hundreds of millions of dollars and didn't win one race including the big one! _E_\n\"Donald Trump to Build Trump Towers Complex in Rio de Janeiro\" __HTTP__ via Hispanically Speaking News _E_\nSecretary of Defense Chuck Hagel seems so lost and frankly dumb. He can't even speak properly. Poor leader in these very dangerous times! _E_\nThis is my last election. After my election I have more flexibility. Obama to @MedvedevRussiaE discussing our nuclear arsenal. _E_\nWhy I would not have approved the deal... __HTTP__ #trumpvlog _E_\nRT @foxandfriends: VIDEO: Rep. Scalise — GOP agrees on over 85 percent of health care bill __HTTP__ _E_\nYou have to believe in what you want. Keep your focus keep your momentum and remain patient and persistent. _E_\nObama keeps namedropping Bill Clinton he is no Bill Clinton. _E_\nBe tenacious. Being tenacious means you're tough and patient at once a formidable combination. _E_\nChina is the biggest environmental polluter in the World by far. They do nothing to clean up their factories and laugh at our stupidity! _E_\nThe NFL image is really tarnished! Now if the sponsors start leaving and the ratings go down the NFL will be in big trouble. Boring games! _E_\nEmmy Awards show was terrible last night. Same shows winning over and over again (politics). Amazing race a joke. Host Seth Meyers bombed! _E_\nMy @foxandfriends interview on risk for @GOP on immigration wasting money in Middle East & firing @OMAROSA __HTTP__ _E_\nI'll be on @foxandfriends on Monday at 7:30 AM...be sure to tune in. _E_\nMy prayers and best wishes are with the family of Edwin Jackson a wonderful young man whose life was so senselessly taken. @Colts _E_\nLooking forward to being awarded the '2015 Statesman of the Year' by @SRQRepublicans this Thursday. A record 2000+ attendees Can't wait! _E_\nHillary has called for 550% more Syrian immigrants but won't even mention \"radical Islamic terrorists.\" #Debate... __HTTP__ _E_\n....victory and cannot be burdened with the tremendous medical costs and disruption that transgender in the military would entail. Thank you _E_\n#MakeAmericaGreatAgain#TrumpPence16 __HTTP__ _E_\nThere is no excuse for riots in Ferguson regardless of the grand jury outcome. _E_\nI am the king of debt. That has been great for me as a businessman but is bad for the country. I made a fortune off of debt will fix U.S. _E_\n\"There can be no liberty unless there is economic liberty.\" Margaret Thatcher _E_\nThank you Brian Krzanich CEO of @Intel. A great investment ($7 BILLION) in American INNOVATION and JOBS!... __HTTP__ _E_\nYou haven't seen fireworks until you see @OMAROSA & @piersmorgan go at it again! Let's just say it's no happy reunion... _E_\nSteve Jobs is spinning in his grave Apple has lost both vision and momentum must move fast to get magic back! _E_\nGreat job @EricTrump! Proud of you!#AmericaFirst #RNCinCLE __HTTP__ _E_\nRT @FLOTUS: Thank you to all who participated in today's discussion on opioid abuse. By talking about it we can start to make a real diffe... _E_\nVia @BreitbartSports by @warnerthuston: \"Donald Trump Buys Four Time British @The_Open Golf Course\" __HTTP__ _E_\nThank you so much to __HTTP__ for naming me the 2015 Man of the Year. This is indeed a great honor for me! _E_\nWhat apology didn't they go around beating the crap out of people and robbing them? Why did they all confess? Aren't police convinced? _E_\nToday it was my great honor to welcome Prime Minister Erna Solberg of Norway to the @WhiteHouse a great friend and ally of the United States! Joint press conference: __HTTP__ __HTTP__ _E_\nIt is time to take care of OUR COUNTRY to rebuild OUR COMMUNITIES and to protect our GREAT AMERICAN WORKERS! #TaxReform __HTTP__ _E_\nCrooked Hillary Clinton put out an ad where I am misquoted on women. Can't believe she would misrepresent the facts! My hit was on China _E_\nMy @foxandfriends interview discussing the @nyjets acquisition of @TimTebow and the timing of @RepPaulRyan's plan __HTTP__ _E_\nWe have to make the U.S.A. RICH again so that we can afford to pay Social Security Medicareand Medicaid and STRONG to keep our enemies out _E_\nOur Marines are sent to kill the Taliban not coddle them. USMC should be praised not investigated. Semper Fi ! _E_\nWhy has nobody asked Kaine about the horrible views emanated on WikiLeaks about Catholics? Media in the tank for Clinton but Trump will win! _E_\n.@TraceAdkins isn't excited about their ideas. Are you? #CelebApprentice _E_\n\"Give me a smart idiot over a stupid genius any day.\" Samuel Goldwyn _E_\nBaseball player Ryan Braun turned out to be a total con man after so vociferously proclaiming his innocence only to be guilty as.hell! _E_\nCan't believe Major League Baseball just rejected @PeteRose_14 for the Hall of Fame. He's paid the price. So ridiculous let him in! _E_\n#CrookedHillary __HTTP__ _E_\nRT @FoxNews: .@jessebwatters on @DonaldJTrumpJr meeting with Russian attorney: I believe Don Jr. is the victim here. #TheFive __HTTP__ _E_\nDESPERATE @BarackObama is already asking supporters to 'find dirt' on @MittRomney's VP picks __HTTP__ Dirty tactics. _E_\nRandy Moss said he was the greatest receiver of all time—no way—it was @JerryRice! _E_\nA review of @MikeTyson's show great press on Trump International Golf Links Scotland and more in today's #trumpvlog __HTTP__ _E_\nElizabeth Warren often referred to as Pocahontas just misrepresented me and spoke glowingly about Crooked Hillary who she always hated! _E_\nWill be on Fox & Friends at 7 (10 minutes). ENJOY! _E_\nWow the final ratings for the Miss Universe Pageant show that it won in all key demos number one on Sunday. I have a winner! _E_\nRT @EricTrump: #ThrowbackThursdays @realDonaldTrump __HTTP__ _E_\nWe are inspired by the stories of everyday heroes who pull their communities from the depths of despair through leadership and love. __HTTP__ _E_\nMegyn Kelly has two really dumb puppets Chris Stirewalt & Marc Threaten (a Bushy) who do exactly what she says. All polls say I won debates _E_\nCongratulations to my son Eric on the fantastic job he has done in rebuilding Turnberry and its great Ailsa Course. Always support kids! _E_\n.@Modern_Do_Good #asktrump __HTTP__ _E_\n.@rushlimbaugh Rush I am in LA inspecting property (big job creator) & listening to you. You are truly fantastic thanks! _E_\nIowa was fantastic last night amazing crowd and people. I'm now in Florida getting ready to go to South Carolina. Big crowd very exciting _E_\n.@BarackObama was caught telling Russian PM @MedvedevRussiaE that he can be more 'flexible' in his second term. Russia thinks he's weak. _E_\nVia @Newsmax_Media by @OwenTew: \"Donald Trump: 'Last Thing We Need Is Another Bush'\" __HTTP__ _E_\nRT @dmartosko: 'Duck Dynasty' star Phil Robertson says he'll back Trump for president __HTTP__ via @MailOnline _E_\nI love taking lawsuits all the way when I'm right. @AGSchneiderman is finding that out the hard way! _E_\nISIS has infiltrated countries all over Europe by posing as refugees and @HillaryClinton will allow it to happen h... __HTTP__ _E_\nRemember that I am self funding my campaign. Hillary Jeb and the rest are spending special interest and lobbyist money.100% CONTROLLED _E_\nJoin Governor Mike Pence in Reno Nevada tonight at 7pm! Tickets available at: __HTTP__ _E_\nFact: without Texas and states reaping the fracking boom Obama's job record would go from bad to worse! _E_\nA great gift idea is my new book #TimeToGetTough easy to order on Amazon __HTTP__ _E_\nJoin me live from the @WhiteHouse. __HTTP__ _E_\nI will be on Morning's with Maria on the Fox Business Network tomorrow during the 7am and 8am ET hours. _E_\nMajor League Baseball was really smart when they wouldn't let Mark Cuban buy a team. Was it his financials or the fact that he's an asshole? _E_\nBrian if I'm well past the last exit to relevance how come you spent so much time reading my tweets last night? @NBCNightlyNews _E_\nMy two wonderful sons Don and Eric will be on @foxandfriends at 7:02 now! Enjoy. _E_\nVia @Newsmax_Media:  Trump: I'd Be Better 'Meet the Press' Host Than 'Moron' Chuck Todd __HTTP__ _E_\nI'm on @ETonlineAlert tonight to talk about what the Yankees should have done about A Rod long ago __HTTP__ _E_\nAttorney General Jeff Sessions has taken a VERY weak position on Hillary Clinton crimes (where are E mails & DNC server) & Intel leakers! _E_\n\"You have to be patient as well as enthusiastic when it comes to your goals. Think big but be realistic.\" – Think Big _E_\nI was putting together my early deals in New York & I was advised by many that I was too young. Believe in yourself & you can do anything. _E_\nGreat honor to have @GOP General Counsel #JohnRyder as a Trump delegate in TN. RNC meeting well worth it! Unifying the party! _E_\nArena was packed totally electric! _E_\nMelania will be interviewed by @morningmika on @Morning_Joe now (8:30 A.M.). ENJOY! _E_\nYesterday @BarackObama actually spent a full day in Washington. He didn't campaign fund raise or play golf. Shocking. _E_\nJon Huntsman called to see me. I said no he gave away our country to China! @JonHuntsman _E_\nRT @foxandfriends: FOX NEWS ALERT: 2 US drone strikes in Somalia target Al Qaeda and Al Shabaab __HTTP__ _E_\nFlorida Ethics Commission Advocate comes down hard on Rubio. So do two people who worked with him. Said he used the wrong credit card! Sure. _E_\nIt will now start to cool down concerning Sterling and the Clippers. This mess will start to fade after litigation into the murky past! _E_\n. @deesnider is a great guy & a total winner! He understood he did not leave me any other choice. Look forward to keeping in touch. _E_\nNobody understands politicians like I do all talk and no action. They will never get our country where it needs to be truly great again! _E_\nMrs. Goldberg who filed the Chicago case many years ago is a vicious and conniving woman loved beating her. _E_\nMUST READ It's time people listened to Trump' says mother of gunned down teenage football star __HTTP__ SECURE THE BORDER! _E_\nWow China's growth accelerated 7.8% in third quarter. If the U.S. had half that number we would be the talk of the World need leadership _E_\n.@JTimberlake It was great having you play The Blue Monster. Thanks for your nice statements many agree that it is best they've seen! _E_\nMy rallies are not covered properly by the media. They never discuss the real message and never show crowd size or enthusiasm. _E_\nI will be doing Greta Van Susteren @gretawire tonight at 10 PM on Fox News talking about China & Mitt's failed campaign team. _E_\nGlad to hear Clint Eastwood endorsed @MittRomney. He understands that America needs a big boost to be strong again. _E_\nJust got back to the White House from the Great States of Texas and Louisiana where things are going well. Such cooperation & coordination! _E_\nNorth Korea is looking for trouble. If China decides to help that would be great. If not we will solve the problem without them! U.S.A. _E_\n\"Definiteness of purpose is the starting point of all achievement.\" W. Clement Stone _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nIf you can't see it it will never happen. Bring your vision to fruition through perseverance and hard work. That will build momentum. _E_\nI know Shia LaBeouf @thecampaignbook and when sober a really nice guy. Must get act together fast before too late. _E_\nThe economy is in terrible shape. @BarackObama is manipulating the job numbers to hide the truth. __HTTP__ _E_\nWhen it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it'... (cont) __HTTP__ _E_\nBest speech in #GoldenGlobes history __HTTP__ _E_\nWatching the madness in Cyprus? If our government keeps spending trillion dollar deficits that could happen here. _E_\n#CelebApprentice #TeamVortex or #TeamInfinity? _E_\nBig day on Thursday for Indiana and the great workers of that wonderful state.We will keep our companies and jobs in the U.S. Thanks Carrier _E_\nIt's still exciting after all these years and this cast is special! _E_\nOne of the greatest tributes to a father I have ever witnessed given to the great @jacknicklaus by his wonderful son __HTTP__ _E_\nHow will raising taxes create jobs? Washington is all out of answers. New leadership is needed. _E_\nAccording to a @gallupnews poll over 60% think ObamaCare will make things worse for taxpayers __HTTP__ ObamaCare is a T A X. _E_\nI will be holding a major briefing on the Opioid crisis a major problem for our country today at 3:00 P.M. in Bedminster N.J. _E_\n.@robertjeffress I greatly appreciate your kind words last night on @FoxNews. Have great love for the evangelicals great respect for you. _E_\nMake sure to grab your copy of this month's @Newsmax_Media detailing The Trump Effect __HTTP__ _E_\nMusic cues audience participation sounds like a very active Team Power. #CelebApprentice _E_\nNATIONAL DEBT January 2009 = $10.6 TRILLIONAugust 2016 = $19.4 TRILLION __HTTP__ _E_\nDem Gov. of MN. just announced that the Affordable Care Act (Obamacare) is no longer affordable. I've been saying this for years disaster! _E_\n\"When you can't make them see the light make them feel the heat.\" – President Ronald Reagan _E_\nOver 35 CIA operatives were on the ground in Benghazi the night of the 9.11 attack __HTTP__ Still a phony scandal ? _E_\nThe people of Ireland are very smart—they just killed an ugly windfarm which would've hurt tourism @AlexSalmond __HTTP__ _E_\n... to build a wind farm and destroy this view! _E_\nI will be on @seanhannity tonight from Las Vegas Nevada at 10pmE. Enjoy! #Hannity #Trump2016 __HTTP__ _E_\nThe great State of Nebraska can do much better than @BenSasse as your Senator. Saw him on @greta totally ineffective. Wants paid for pols. _E_\nIran's continued public threats of annihilating @Israel are unacceptable. Iran's nuclear drive must be stopped. #TimeToGetTough _E_\nVery little pick up by the dishonest media of incredible information provided by WikiLeaks. So dishonest! Rigged system! _E_\nEntrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_\nIt was an honor to host our American heroes from the @WWP #SoldierRideDC at the @WhiteHouse today with @FLOTUS @VP... __HTTP__ _E_\nThis was sent out from Ted Cruz as Iowans arrived at their caucus sites to vote. #CruzFraud __HTTP__ _E_\nRT @GOP: Reminder: last year Clinton pledged she had turned over all work related email under penalty of perjury __HTTP__ _E_\nWith the labor participation rate at a 36 yr. low over 92M Americans are out of the work force. _E_\nChristians in the Middle East have been executed in large numbers. We cannot allow this horror to continue! _E_\nJust arrived at Camp David where I am monitoring the path and doings of Hurricane Harvey (as it strengthens to a Class 3). 125 MPH winds! _E_\n'Democratic operative caught on camera: Hillary PERSONALLY ordered 'Donald Duck' troll campaign that broke the law' __HTTP__ _E_\nWhen Warren Buffett & others play w/ bankruptcy nobody cares—when Trump plays the game it becomes a big deal! __HTTP__ _E_\nHeading back from a very exciting two days in Davos Switzerland. Speech on America's economic revival was well received. Many of the people I met will be investing in the U.S.A.! #MAGA _E_\nAsk yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if hard times hit. _E_\nI always said that @lancearmstrong had to keep fighting the charges. By stopping he gave his enemies an opening. _E_\nLanding in New Hampshire soon to talk about the massive drug problem there and all over the country. _E_\nDonald J. Trump Ethics Reform Plan For Washington D.C. __HTTP__ _E_\n\"Destiny has a part to play in your life and in your business – so give it a chance to work.\" – Think Like a Champion _E_\nMy wife Melania will be on @QVC today @ 5 PM selling really beautiful jewelry at a very low price. Perfect for Mother's Day—call in! _E_\nEntrepreneurs: Keep your momentum. See yourself as victorious and leading a winning team. Keep everyone moving forward. _E_\nRT @foxandfriends: Mark Levin: The collusion is among the Democrats __HTTP__ _E_\n.@marklevinshow has written a great book Plunder and Deceit. He powerfully analyzes issues that are crucial to us today. Read it! _E_\nWhy is the Pentagon wasting precious dollars on going 'green.' Complete waste. We need the best & easiest fuel for our military. _E_\nIt was great being with Luther Strange last night in Alabama. What great people what a crowd! Vote Luther on Tuesday. _E_\nTV's darling @TheRealMarilu is back in this year's \"All Star\" @CelebApprentice. Marilu is a fierce competitor. _E_\n.@CNN & @CNNPolitics did not say that lawyer Beck lost the case and I got legal fees. Also she wanted to breast pump in front of me at dep. _E_\n\"Keep your brand standard in mind and your expansion will seem possible as well as gratifying.\" – Midas Touch _E_\nI will be handing over my Twitter account to my team of deplorables for tonight's #debate#MakeAmericaGreatAgain _E_\nAnti Morsi protests are 10 times larger than 2011 anti Mubarek protests. Interesting. _E_\n\"Trump: Illegal Immigrants Are Getting Treated Better than Vets\" __HTTP__ via @nro by @AndrewE_Johnson _E_\nI will be live tweeting during the @ApprenticeNBC tonight at 9PM ET. _E_\nIf you fail once twice three times it doesn't matter. Learn from your mistakes and push forward to VICTORY the sweetest feeling there is! _E_\nThank you Laura! __HTTP__ _E_\n.@oreillyfactor was very negative to me in refusing to to post the great polls that came out today including NBC. @FoxNews not good for me! _E_\nHAPPY NEW YEAR & THANK YOU! __HTTP__ __HTTP__ _E_\nMillions of dollars being spent on false TV ads by special interest groups who own Rubio & Cruz.When you see them think of your puppet POLS _E_\nIt's Friday. How many bald eagles did wind turbines kill today? They are an environmental & aesthetic disaster. _E_\nJust won the highest rated sanitary award in NY—an A & the food is great also. Trump Grill/ 57th & 5th. _E_\nEvery dollar @BarackObama spends costs $1.40 with interest borrowed from China on our children and grandchildren's backs. CUT CAP BALANCE! _E_\nLooking forward to being honored with the prestigious 'Friend of Israel' award at the @Algemeiner Gala Dinner __HTTP__ _E_\nObama's plan to have Russia stand up to Iran was a horrible failure that turned America into a laughingstock. #TimeToGetTough _E_\n\"Success is dependent on effort.\" Sophocles _E_\nUse adverse events and monumental challenges to make you strong Think Big _E_\nThis Sunday's LIVE FINALE of @ApprenticeNBC puts @pennjillette against @TraceAdkins. Watch two great competitors battle to win! _E_\nWhat you dream about is what you do. If you cannot even dream of doing big things you will never do anything big. Think Big _E_\nWe have just begun! __HTTP__ _E_\nGreat poll Florida! Thank you! __HTTP__ _E_\n.@WhiteHouse Briefing with Director Marc Short and Director Mick Mulvaney... __HTTP__ _E_\nCelebrate 2013 @TrumpSoHo with downtown's nicest #NYE party. Get your tickets now: __HTTP__ _E_\nInspiration exists but it has to find us working. Pablo Picasso _E_\n20 Most Anticipated Hotel Openings of 2016: Trump International Hotel Washington D.C. __HTTP__ _E_\nI said gas prices would sky rocket after election Opec payback! _E_\nOil prices just went over $100 per barrel for first time in nine months! _E_\nObama is on yet another two day West Coast fundraising swing. Has to fit it in before his 15 day tax payer funded vacation. _E_\nSnowden is a spy who has caused great damage to the U.S. A spy in the old days when our country was respected and strong would be executed _E_\nI just answered my Facebook fan's questions in the latest #AskTheDonald watch the video __HTTP__ _E_\nWho says Obama will do better in the next debate has he gotten smarter in 2 weeks! _E_\n\"The belief that security can be obtained by throwing a small state to the wolves is a fatal delusion.\" Winston Churchill _E_\nThank you @Morning_Joe & @morningmika a great show! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nI wonder how officials @TexasTech feel now after treating Coach Mike Leach with so little respect after their loss to @TCUFootball 82 27? _E_\nInteresting article by @newtgingrich @HumanEvents: \"WHY ROVE AND STEVENS ARE PLAIN WRONG\" __HTTP__ _E_\nVia @AP on @washingtonpost: Trumps look at building 18 hole golf course on former Kluge estate in rural Virginia __HTTP__ _E_\nWill be in South Bend Indiana in a short while big rally! See you soon! _E_\nI have just ordered Homeland Security to step up our already Extreme Vetting Program. Being politically correct is fine but not for this! _E_\n.@KAThomas212 Congratulations on joining the finest and fastest growing group of very talented people in the City. You will be GREAT! _E_\nObama must now start focusing on OUR COUNTRY jobs healthcare and all of our many problems. Forget Syria and make America great again! _E_\nHeading down to D.C. __HTTP__ _E_\nDOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER MAKE AMERICA GREAT AGAIN! _E_\n10000 people in South Carolina unbelievable evening! Will be in New Hampshire tomorrow love it. __HTTP__ _E_\nGlenfiddich is a joke—should have chosen Andy Murray—U.S. Open & Olympic gold winner—as Top Scot instead of a total loser! _E_\nThe Ebola patient who came into our country knew exactly what he was doing. Came into contact with over 100 people.Here we go I told you so! _E_\nThank you! __HTTP__ _E_\nDemocrats refused to vote down their ObamaCare subsidy. While Americans will be hit w/ rising premiums Washington won't feel any pain _E_\n47M on food stamps. Over 23M Americans unemployed. 50% of college grads unemployed. And Obama wants us talking about Big Bird. _E_\nI will be speaking Monday September 24 (10 A.M.) at Liberty University to a record setting student body. I look forward to it! _E_\nBreaking news negotiations with Iranians broke down because Obama insisted that they use ObamaCare. _E_\nAs stated here is the press release. __HTTP__ _E_\nObamaCare is a complete disaster. Many of my friends have to scale down their businesses because they can't afford it. Terrible. _E_\nGood luck to @Joy_Villa on her decision to enter the wonderful world of politics. She has many fans! _E_\nDeparting now thank you Cedar Rapids Iowa. This is a MOVEMENT! __HTTP__ _E_\nIf I would have challenged the man the media would have accused me of interfering with that man's right of free speech. A no win situation! _E_\nWhat a shame that Kobe Bryant was so badly injured last night a truly great champion who brought the Lakers back from oblivion this year! _E_\nWill be interviewed tonight by @seanhannity on @FoxNews at 10 PM. Enjoy! _E_\nNew episode starting now! _E_\nJust arrived in New Hampshire. Another packed venue! Will be fun. _E_\nIn debate @MittRomney should ask Obama why autobiography states born in Kenya raised in Indonesia. _E_\nAfter Poland had a great meeting with Chancellor Merkel and then with PM Shinzō Abe of Japan & President Moon of South Korea. _E_\nCongratulations Chuck. Must be wonderful to have Donald Trump as your guest #BeCool! #Trump2016 __HTTP__ _E_\nRT @Scavino45: Manufacturer Optimism Hits Record High After #TaxReform Plan Revealed __HTTP__ _E_\nI promise to do a new #trumpvlog when I get back next week lots of requests. Thanks! _E_\n.@MikeTyson and @SpikeLee I gave a great review of your show in my #trumpvlog __HTTP__ _E_\nI will be making some very big campaign stops next week big crowds and tremendous energy! MAKE AMERICA GREAT AGAIN _E_\nI dictate my tweets to my executive assistant and she posts them. Time is money The Art of the Deal. _E_\nAs usual the ObamaCare premiums will be up (the Dems own it) but we will Repeal & Replace and have great Healthcare soon after Tax Cuts! _E_\nBlack Lives Matter protesters totally disrupt Hillary Clinton event. She looked lost. This is not what we need with ISIS CHINA RUSSIA etc. _E_\nThe failing @UnionLeader newspaper in N.H. just sent The Trump Organization a letter asking that we take ads. How stupid how desperate! _E_\n.@AC360 Has the absolutely worst anti Trump talking heads on his show. Dopey writer O'brian knows nothing about me or my wealth. A waste! _E_\n.@BradSteinle Thank you for yr wonderful tweet of July 4. I wanted a little time to go by before calling. Your sister & family are amazing. _E_\nEgypt is turning into a hot bed of radical Islam. The current protest is another coup attempt. We should never have abandoned Mubarak. _E_\nThe jury in the Jodi Arias trial is believe it or not still out. You never know but such a long deliberation could be good for the defense _E_\n#TextTrump88022 for exclusive @realDonaldTrump updates! We will Make America Great Again! _E_\nGreat read: \"Hollywood can kiss Adam Corolla's ass he's going Trump funding\" __HTTP__ via @upstartbusiness _E_\nI hope people will start to focus on our Massive Tax Cuts for Business (jobs) and the Middle Class (in addition to Democrat corruption)! _E_\nChina is neither an ally or a friend they want to beat us and own our country. _E_\n72% of refugees admitted into U.S. (2/3 2/11) during COURT BREAKDOWN are from 7 countries: SYRIA IRAQ SOMALIA IRAN SUDAN LIBYA & YEMEN _E_\n\"Obama doesn't respect the fact that the money he wastes belongs to us. He thinks that the wealth you create (cont) __HTTP__ _E_\nOpen for the 2014 season Mar a Lago Club is an architectural masterpiece offering the finest amenities in the world __HTTP__ _E_\nLIVE on #Periscope: Good morning Iowa! Let's #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nDennis Rodman is a project manager tonight on Celebrity Apprentice watch Dennis in full action! _E_\n#TrumpVlog South African justice __HTTP__ _E_\nDo you believe it? The Obama Administration agreed to take thousands of illegal immigrants from Australia. Why? I will study this dumb deal! _E_\nJust finished a press conference in Trump Tower wherein I gave information on which VETERANS groups got the $5600000 that I raised/gave! _E_\nStarting to develop a much better relationship with Pakistan and its leaders. I want to thank them for their cooperation on many fronts. _E_\nAnother Dishonest Politician #LightweightSenatorMarcoRubio __HTTP__ _E_\nFL KS ME MD MN NJ OR & WV! It's the LAST DAY to mail in voter reg forms. Get the forms at... __HTTP__ _E_\nWho would you rather have negotiating for the U.S. against Putin Iran China etc. Donald Trump or Hillary? Is there even a little doubt? _E_\nWill be doing Fox & Friends at 7 A.M. It never ends (hopefully)! _E_\nGreat victory for people of Blackdog Scotland. They defeated substation stopping inefficient & ugly wind turbines.@AlexSalmond _E_\nI will be interviewed on @GMA Good Morning America tomorrow at 7:00 A.M. Big new ABC poll coming out I hope I do well! _E_\nWithout more Republicans in Congress we were forced to increase spending on things we do not like or want in order to finally after many years of depletion take care of our Military. Sadly we needed some Dem votes for passage. Must elect more Republicans in 2018 Election! _E_\nThe polls have been really amazing we are all tired of incompetent politicians and bad deals! __HTTP__ _E_\n.@Omarosa admitting she's a threat in the boardroom that's not revelation knowledge. #CelebApprentice _E_\nGive your goals substance. Imbue them with a value that exceeds the monetary. Make them count on as many levels as you can. _E_\nCongrats @TrumpChicago for being named #3 Best Business Hotel in Chicago in @TravlandLeisure's 2014 World's Best __HTTP__ _E_\nBe prepared there is a small chance that our horrendous leadership could unknowingly lead us into World War III. _E_\nRT @TheFive: Trump just won on law & order and now he's delivering the goods. @jessebwatters #thefive _E_\nA former classmate Roy Eaton has published a great book \"Makers Shakers & Takers\" – check it out __HTTP__ _E_\n.@TraceAdkins great job on FOX this morning. Keep up the good work! _E_\nThe two dumbest interviews in history may go down as Lance Armstrong who is being sued by everyone in the world & Michael Douglas. _E_\nI will be going to Trump Links at Ferry Point for the official opening of this long delayed (but future NYC treasure) course. Great job D _E_\nMy @foxandfriends int. destroying Schneiderman's frivolous suit which he brought after meeting Obama on Thurs. __HTTP__ _E_\nHaving a vision for something can be a very powerful force for accomplishment. Midas Touch _E_\nThe Republican Party of New York has been conditioned to lose and there is no excuse for this. Leadership must move fast and decisively! _E_\nIf their highly unethical behavior including begging me for ads isn't questionable enough they have endorsed a candidate who can't win. _E_\n... than his destruction of Scotland's magnificent lands.@AlexSalmond _E_\nMY POSITION ON VISAS#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nCongrats to @FLGovScott on today's inauguration and having done a great job! _E_\nWashington is in total gridlock—no trust no leadership—very interesting! _E_\nNorth Korea is behaving very badly. They have been playing the United States for years. China has done little to help! _E_\nShock! Obamacare's high risk pool spending DOUBLED government estimates __HTTP__ @BarackObama is bankrupting this country! _E_\n\"Once you know you love your job never stop and never give up.\" – Think Like a Billionaire _E_\nThink big. Stay focused. Be passionate. Don't ever give up _E_\nIf you can count the amount of time you put into a project on your fingers then you haven't spent enough time on it. _E_\nDid A Rod really try to buy the papers that would implicate him re. drugs wow that would be the end a disaster! _E_\nP.S. 42 in Queens is getting a truckload of food and much needed supplies for Rockaway residents #HurricaneSandyRelief _E_\nRalph Norman ran a fantastic race to win in the Great State of South Carolina's 5th District. We are all honored by your success tonight! _E_\nJoin me this Thursday in Wilmington Ohio at noon! #ImWithYouTickets: __HTTP__ __HTTP__ _E_\nIt was an honor to be with @MittRomney the night he clinched the nomination. He will defeat @BarackObama and be a tremendous POTUS. _E_\nJust out: Neera Tanden Hillary Clinton adviser said \"Israel is depressing.\" I think Israel is inspiring! _E_\nIf NFL fans refuse to go to games until players stop disrespecting our Flag & Country you will see change take place fast. Fire or suspend! _E_\nUPCOMING RALLIES JOIN ME!TOMORROWFletcher NC @ 12pm. __HTTP__ OH @ 7pm. __HTTP__ _E_\nHis spending is reckless: @BarackObama will set a record fourth year of a $1 trillion budget deficit. __HTTP__ _E_\nJoin me LIVE from the Rose Garden at 1:30pmE with Prime Minister Alexis Tsipras of Greece. __HTTP__ __HTTP__ _E_\nThe exclusive home of @PGATOUR's @CadillacChamp @TrumpDoral sits on 800 beautiful acres in the center of Miami __HTTP__ _E_\nOrder my book CRIPPLED AMERICA for your holiday gifts. I will be signing books for the next two weeks! __HTTP__ _E_\n\"Donald Trump to crown @FIU as Miss Universe venue\" __HTTP__ via @MiamiHerald _E_\nA really nice article about the Blue Monster from \"The Street.\" __HTTP__ _E_\nDonald Trump promises 'world class' Crandon Park golf course __HTTP__ via @WPLGLocal10 by @GlennaOn10 _E_\nRT @DRUDGE_REPORT: 'Win lose deal that benefits Iran and hurts United States'... __HTTP__ _E_\nFailed Presidential Candidate Mitt Romney is having a news conference tomorrow to criticize me. (1/2) _E_\nWe've gone from $10 trillion that the president inherited from all prior presidents to $16 trillion @MittRomney _E_\nWill be on Fox & Friends in five minutes enjoy and good morning! _E_\nWe have made more progress in the last nine months against ISIS than the Obama Administration has made in 8 years.Must be proactive & nasty! _E_\nSorry I never went bankrupt and don't wear a wig (it's all mine)! _E_\nIt's Thursday. How many people have lost their healthcare today? _E_\n\"One man with courage is a majority.\" Thomas Jefferson _E_\nRemember when I recently said that Brussels is a hell hole and a mess and the failing @nytimes wrote a critical article. I was so right! _E_\nOil is double the price now compared to last year OPEC is laughing at @BarackObama. _E_\nCongratulations to the Miss USA Pageant it was the #1 telecast of the night among ABC CBS NBC and Fox. A great show and a huge success. _E_\nThank you. __HTTP__ _E_\nGas prices are way too high. With an economy contracting and lower demand how do OPEC & the speculators get away with this?! _E_\nOn the shores of the Lake Norman @Trump_Charlotte features a world class course designed by @SharkGregNorman __HTTP__ _E_\nIt was recently reported that 3rd rate $ losing @Politico is a foil for the Clintons. Questions given to Clinton in advance. No credibility. _E_\nHillary was involved in the e mail scandal because she is the only one with judgement so bad that such a thing could have happened! _E_\nWow! New National Zogby Poll just out:.TRUMP 45. CRUZ 13. RUBIO 8. Big numbers. _E_\nBefore I or anyone saw the classified and/or highly confidential hacking intelligence report it was leaked out to @NBCNews. So serious! _E_\nRetail sales are at record numbers. We've got the economy going better than anyone ever dreamt and you haven't seen anything yet! _E_\nI met a Trump Twitter hater last night (well known). As he came near me he nervously said Mr. Trump it is an honor to meet you sir! Nice _E_\nThe Iraqi Army is useless. President Obama stay the hell out of Iraq (we should never have been there in the first place). _E_\n.@tedcruz should not make statements behind closed doors to his bosses he should bring them out into the open more fun that way! _E_\nObama's goal of 1 million electric car sales is a little off by over 910000 __HTTP__ $100B of our money wasted! _E_\nWhen I was 18 people called me Donald Trump. When he was 18 @BarackObama was Barry Soweto. Weird. _E_\nThank you Miami! In 6 days we are going to WIN the GREAT STATE of FLORIDA and we are going to win back the White... __HTTP__ _E_\nTweet me your questions for the next #trumpvlog.... _E_\nI just bought stock in Tiffany & Company and McDonald's. Two ends of the spectrum but I like both companies. _E_\nMiss Alabama @_KatherineWebb stopped by to say hello today. __HTTP__ _E_\nAm in Bedminster for meetings & press conference on V.A. & all that we have done and are doing to make it better but Charlottesville sad! _E_\nThe new hot term that they have recently invented is POLAR VORTEX give me a break! _E_\nThank you for your support on my way now! See you soon. #TrumpTrain __HTTP__ _E_\nIf we can help little #CharlieGard as per our friends in the U.K. and the Pope we would be delighted to do so. _E_\nVia @fitsnews:\"Donald Trump Surges In New Hampshire Poll: MOGUL REALITY STAR EMERGES AS GRANITE STATE'S 'ANTI BUSH' __HTTP__ _E_\nCastro Chavez and Ahmadinejad are all anxiously awaiting our election results. They are praying Obama wins. _E_\nBorder agent: We might as well abolish our immigration laws altogether __HTTP__ _E_\nIn the East it could be the COLDEST New Year's Eve on record. Perhaps we could use a little bit of that good old Global Warming that our Country but not other countries was going to pay TRILLIONS OF DOLLARS to protect against. Bundle up! _E_\nGo as far as you can see when you get there you'll be able to see farther. J.P. Morgan _E_\nCongratulations to @WWERaw on passing 1000 episodes. @WWE is still going strong after all these years @VinceMcMahon is great! _E_\nFake News CNN is looking at big management changes now that they got caught falsely pushing their phony Russian stories. Ratings way down! _E_\nThe media is going crazy. They totally distort so many things on purpose. Crimea nuclear the baby and so much more. Very dishonest! _E_\nEntrepreneurs: Keep the big picture in mind. There are always opportunities & possibilities and thinking too small can negate a lot of them _E_\nGovernor Rick Scott of Florida did really poorly on television this morning. I hope he is O.K. _E_\nTrump promises special session to repeal Obamacare: __HTTP__ _E_\nTweet me your New Year's resolution to make America great again! #TrumpNewYearsRes __HTTP__ _E_\nThe U.S. has appealed ro Russia not to intervene in Ukraine Russia tells U.S. they will not become involved and then laughs loudly! _E_\nI will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_\nWill be doing Fox & Friends at 7 A.M. (1 hour). ENJOY! _E_\nOn my way! __HTTP__ _E_\n\"A brand is not a logo. A brand is the promise you put out there and the experience you deliver.\" – Midas Touch _E_\nOne of the best moves I made early in my career was buying the air rights from Tiffany's flagship. Trump Tower gleams over Fifth Avenue. _E_\nWHAT THEY ARE SAYING ABOUT MIKE PENCE \"DOMINATING\" THE DEBATE: __HTTP__ #VPDebate _E_\nSomebody with aptitude and conviction should buy the FAKE NEWS and failing @nytimes and either run it correctly or let it fold with dignity! _E_\nFiscal mismanagement of cash costing US Taxpayer billions cut fraud and waste before cutting funding for Seniors. _E_\nStock market hit yet another all time record high yesterday. There is great confidence in the moves that my Administration.... _E_\nI want to end the day by saying there is no check I would rather write than that to a good charity designated by our President. _E_\nDestiny has a part to play in your life and in your business so give it a chance to work. Think Like a Champion _E_\nWhy doesn't @JebBush in his ads show my answer to his statement in the debate? _E_\nRaleigh North Carolina was fantastic last night. Such incredible spirit. We all want to and will MAKE AMERICA GREAT AGAIN! _E_\nRT @EricTrump: Very proud of what my father has accomplished in the past 7 months Wishing him amazing luck and success tonight! #NVcaucus ... _E_\nOur country has a big heart. And it's a point of national pride that we take care of our own. #TimeToGetTough (cont) __HTTP__ _E_\n.@AndreaTantaros's radio show is a great addition to talk radio. She is sharp talented & great sense of humor. Congratulations. _E_\nAttorney General Bill Shuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help! _E_\nOnce again #MSM is dishonest. Schlonged is not vulgar. When I said Hillary got schlonged that meant beaten badly. _E_\nFailed presidential candidate Mitt Romney the man who choked and let us all down is now endorsing Lyin' Ted Cruz. This is good for me! _E_\nSpoiler @dennisrodman has really got his act together so far on the upcoming season of @CelebApprentice... _E_\nInstead of driving jobs and wealth away AMERICA will become the world's great magnet for innovation and job creati... __HTTP__ _E_\nPoor @JebBush spent $50 million on his campaign I spent almost nothing. He's bottom (and gone) I'm top (by a lot). That's what U.S. needs! _E_\nRT @realDonaldTrump: \"President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers... _E_\nRT @MarkHalperin: Utah Speaker of the House announces endorsement of @realDonaldTrump. Says @DonaldJTrumpJr played a big role _E_\nNow that the ObamaCare website contractor has been terminated for obvious incompetence is the person who hired them going to be fired? _E_\nGreat article by @WayneRoot @theblaze Obama's College Classmate: 'The Obama Scandal Is at Columbia' __HTTP__ _E_\nAwarded 5 stars from @ForbesInspector @TrumpTO offers 261 rooms & 115 suites in the center of downtown Toronto __HTTP__ _E_\nObama has now become the weakest POTUS against China yuan just hit record high against dollar __HTTP__ Very sad! _E_\nNYC terrorist was happy as he asked to hang ISIS flag in his hospital room. He killed 8 people badly injured 12. SHOULD GET DEATH PENALTY! _E_\nLyin' Ted Cruz is now trying to convince prople that his problems with The National Enq.were caused by me. I had NOTHING to do with story! _E_\nEvery day Pastor Saeed is imprisoned by Iran is an indictment on Obama's 'diplomacy.' #SaveSaeed _E_\nWhile Derek Jeter is training every day in the off season reports come out that A Rod is partying all over the country. Go Derek. @Yankees _E_\nHomeland Security and law enforcement are on alert & closely watching for any sign of trouble. Our borders are far tougher than ever before! _E_\nThe new amnesty bill is over 1000 pages. It is another monstrosity a la ObamaCare. _E_\nJust did theToday Show to announce that Baton Rouge Louisiana will host the Miss USA Pageant on Sunday June 8th. @Miss USA. _E_\nLooking forward to addressing @TheEconomicClub on December 15th at the Marriot Marquis Washington DC. _E_\nLook I have always liked Lance Armstrong I just hated what he did to himself including recently. His life will now be hell. _E_\nI am lowering taxes far more than any other candidate. Any negotiated increase by Congress to my proposal would still be lower than current! _E_\n\"Get in. Get it done. Get it done right. Get out.\" – My father Fred C. Trump _E_\n\"Yesterday's home runs don't win today's games.\" Babe Ruth _E_\nOn at 9:00A.M. or 10:00 A.M. (depending on your location) on Fox is a tough but really good interview with Chris Wallace. Enjoy! _E_\nTed is the ultimate hypocrite. Says one thing for money does another for votes. __HTTP__ _E_\nSports fans should never condone players that do not stand proud for their National Anthem or their Country. NFL should change policy! _E_\nI'll be on @gretawire On the Record tonight to talk about the ObamaCare fiasco 7 pm on Fox News _E_\nDonald Trump: GOP Has 'Nuclear Weapon' In Fiscal Cliff Negotiation But They Don't Know It __HTTP__ via @mediaite _E_\nl still think @Boeing should just bite the bullet & get rid of the new batteries in the 787. Those batteries will always be a problem! _E_\nGreat meeting with @THEHermanCain yesterday in Trump Tower. Great guy! _E_\nInstead of creating new jobs Obamacare is destroying jobs. And the worst part is yet to come since the truly (cont) __HTTP__ _E_\n.@lisarinna is at the top of her game in the upcoming season of @CelebApprentice All Stars. Our fans love her. _E_\nCome on Republican Senators you can do it on Healthcare. After 7 years this is your chance to shine! Don't let the American people down! _E_\nThank you Foxconn for investing $10 BILLION DOLLARS with the potential for up to 13K new jobs in Wisconsin! MadeInTheUSA __HTTP__ _E_\nMiami's top destination @TrumpDoral's remodeled Royal Palm Pool offers 18 luxurious cabanas __HTTP__ _E_\nBe sure to watch Oprah today (4 pm on Channel 7) I'll be on with my entire family and it will be an entertaining hour.. __HTTP__ _E_\nThe Navy Yard shooting is a horrible disaster. If we don't clean up OUR COUNTRY of the garbage soon we are just going to do a death spiral! _E_\nI don't know what will happen with the lawsuit against dummy @billmaher but have an obligation to charity to bring it. _E_\nGreat Strategic & Policy CEO Forum today with my Cabinet Secretaries and top CEO's from around the United States.... __HTTP__ _E_\nIt doesn't matter who you vote for it matters who is counting the votes. Be careful of voter fraud! _E_\nJoin me on Monday April 4th in Milwaukee! #WIPrimary #Trump2016Tickets: __HTTP__ __HTTP__ _E_\nPeople love gossip. It's the biggest thing that keeps the entertainment industry going. @TheEllenShow _E_\nIf @OMAROSA is not in the Board Room I can't fire her. @latoyajackson made a strategic mistake. _E_\nSneak peek of Trump's trio of spectacular new seaside holes on the famed Ailsa course/@TrumpTurnberry __HTTP__ _E_\nJobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut and it will only get better. MUCH MORE TO COME! _E_\nAnother cover up. Obama won't disclose how many illegal immigrants he has released into our country __HTTP__ No surprise. _E_\nIt was my great honor to welcome Mayor's from across America to the WH. My Administration will always support local government and listen to the leaders who know their communities best. Together we will usher in a bold new era of Peace and Prosperity! __HTTP__ __HTTP__ _E_\n...and job losses. American companies must be prepared to look at other alternatives. _E_\nCNN which is totally biased in favor of Clinton should apologize. They knew they were wrong. __HTTP__ _E_\nMy interview with @jheil & @MarkHalperin at @WollmanRink airing at 5PM on @bpolitics. __HTTP__ _E_\nLook forward to seeing final results of VoteStand. Gregg Phillips and crew say at least 3000000 votes were illegal. We must do better! _E_\nGeneral James Mad Dog Mattis who is being considered for Secretary of Defense was very impressive yesterday. A true General's General! _E_\n....The Wall will be paid for directly or indirectly or through longer term reimbursement by Mexico which has a ridiculous $71 billion dollar trade surplus with the U.S. The $20 billion dollar Wall is \"peanuts\" compared to what Mexico makes from the U.S. NAFTA is a bad joke! _E_\nWhat a dumb mistake AOL made buying the @huffingtonpost. How much longer will Arianna last I predict not much. _E_\nI was just told by one of the top @PGATOUR players that my golf courses are the most elite in the country. Very nice compliment I agree. _E_\nLast Friday's gaffe by @BarackObama claiming that the private sector is doing fine is illustrative.Everything to him revolves around gov't _E_\n'President Elect Donald J. Trump Nominates Elaine Chao as Secretary of the Department of Transportation' __HTTP__ _E_\nVia @TheScotsman: \"Donald Trump's @TrumpTurnberry plan gets go ahead\" __HTTP__ _E_\nWhy don't we ask the Navy SEALs who killed Bin Laden? They don't seem to be happy with Obama claiming credit. All he did is say O.K. _E_\n.@ColinCowherd said such nice things about me during the debate that I thought I'd do his show @TheHerd on Monday (2:30pm EST). _E_\nHeading to Boston to see another huge crowd! My friend Tom Brady is a great competitor and golf partner. __HTTP__ _E_\nThe World Economic Forum now ranks the US the fifth most competitive economy in the world. We have fallen from first under @BarackObama. _E_\nThe Country is being run just like the stadium. _E_\nWhen will people and the media start to apologize to me for my statement Mexico is sending.... which turned out to be true? El Chapo _E_\nHow can a dummy dope like Harry Hurt who wrote a failed book about me but doesn't know me or anything about me be on TV discussing Trump? _E_\nEntrepreneurs: Cover your bases. Know everything you can about what you're doing. _E_\nThe House Republicans and Democrats are finally unanimous! Yesterday they voted down @BarackObama's $3.6T budget (cont) __HTTP__ _E_\nHillary Clinton needs to address the racist undertones of her 2008 campaign. #FlashbackFriday __HTTP__ _E_\nOur GREAT VETERANS can now connect w/ their VA healthcare team from anywhere using #VAVideoConnect available at: __HTTP__ __HTTP__ _E_\nTom marbles in his mouth Brokaw once thanked me for the great success of the Apprentice for NBC. Now he calls (cont) __HTTP__ _E_\nRT @DanScavino: Last nights winner was clear & it will be proven time & time again lets #MAGA!! Lets WIN!! #TrumpTrain __HTTP__ _E_\nSugar @Lord_Sugar—you should say thank you Donald like a good little boy... ... _E_\nRT @AlanDersh: We should stop talking about obstruction of justice. No plausible case. We must distinguish crimes from pol sins __HTTP__ _E_\nObama's speech indicates he wants to change this country as we know it wow he really feels emboldened. _E_\nGeneral says that the Armed Forces will be severely weakened if the large scale rape and sexual abuse problem is not brought under control. _E_\nPolls show that the hurricane had a huge positive effect for Obama on his win isn't that ridiculous? _E_\n.@IvankaTrump will lead the U.S. delegation to India this fall supporting women's entrepreneurship globally.#GES2017 @narendramodi _E_\nThe Russia Trump collusion story is a total hoax when will this taxpayer funded charade end? _E_\nWith all the talk of fiscal responsibility at the @DNC convention yesterday it was ironic that the debt passed $16T. _E_\nGreat honor to be inducted into the NJ Boxing Hall of Fame last night. Thank you! Timing could not have been better! __HTTP__ _E_\nCongratulations to my daughter Ivanka and her husband Jared on the birth of their daughter Arabella Rose yesterday. _E_\nAlways bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_\nLooking forward to hosting our heroes from the Wounded Warrior Project (@WWP) Soldier Ride to the @WhiteHouse on Th... __HTTP__ _E_\nStop The China Curse Pass the Chinese Currency Bill! _E_\n...... Circulation is way down and all he thinks about are his bad food restaurants. @CondeNastCorp _E_\nThis is the One Year Anniversary of my Presidency and the Democrats wanted to give me a nice present. #DemocratShutdown _E_\nCongratulations to Sung Hyun Park on winning the 2017 @USGA #USWomensOpen _E_\nVery exciting—tomorrow night at Madison Square Garden I get inducted into the @WWE Hall of Fame. _E_\nLooking forward to hosting the @FloridaGOP \"House Majority 2014 Golf Tournament\" at Trump Int'l West Palm Beach on Jan. 27th. _E_\nNew Yorkers will get a chance to see a film for free this summer from @attnyc and @tribecafilmfest. My choice? Citizen Kane #FilmForAll _E_\n#ICYMI: Announcement of Air Traffic Control Initiative Watch __HTTP__ _E_\nJust watched Facebook COO Sheryl Sandberg on 60 Minutes. She should spend more time trying to get the F stock price up & less on her ego! _E_\nWith all my Administration has done on Legislative Approvals (broke Harry Truman's Record) Regulation Cutting Judicial Appointments Building Military VA TAX CUTS & REFORM Record Economy/Stock Market and so much more I am sure great credit will be given by mainstream news? _E_\nFans shouldn't worry. We have adjusted the filming schedule of the upcoming 13th season of @CelebApprentice appropriately due to the storm. _E_\nJust watched Brian Williams on @TODAYshow very sad! Brian should get on with a new life and not start all over at @msnbc. Stop apologizing _E_\nThe many losers and haters never have the brains or stamina to become truly successful! _E_\n...when they have no environmental restrictions! America' s workers need us. __HTTP__ _E_\nThank you Houston Texas! #AmericaFirst #Trump2016 __HTTP__ _E_\nWhy Isn't the Senate Intel Committee looking into the Fake News Networks in OUR country to see why so much of our news is just made up FAKE! _E_\n... Doesn't seem like they have a coherent strategy right now. _E_\nThe FAKE & FRAUDULENT NEWS MEDIA is working hard to convince Republicans and others I should not use social media but remember I won.... _E_\nYes I will be live tweeting during the final debate this coming Monday. _E_\nGood luck #TeamUSA#OpeningCeremony #Rio2016 __HTTP__ _E_\nThis was a great evening I would like to thank everyone for their wonderful support. _E_\nOur leaders are terrible. The government spends over $50B a day. It can't find cuts for less than 2 days of spending?! Sad! _E_\nCareer Advice from Donald Trump __HTTP__ via @BNDarticles by @brittneyplz _E_\nThis may be the worst football game ever played by one team Denver! Hard to watch. _E_\nOur enemy China is illegally buying oil from our enemy Iran __HTTP__ China loves it! _E_\nShock Obama WH given three pinocchios for lying about Benghazi emails __HTTP__ _E_\nThe pessimist sees the difficulty in every opportunity and the optimist sees the opportunity in every difficulty. Pres. Lincoln _E_\n.@Lawrence is the poor man's left wing @oreillyfactor(with no ratings)! _E_\nRT @SheriffClarke: Happy Father's Day to all dads. My dad. Like father like son @realDonaldTrump supporters to the end. He an Airborne Ra... _E_\nPaula Deen made a big mistake in using a forbidden word but must be given some credit fot admitting her mistake. She will be back! _E_\nThe NRA in Nashville today was amazing. Packed house and standing ovation for Trump. THANKS! _E_\nBecause I was told I could not do well in Iowa I spent very little there a fraction of Cruz & Rubio. Came in a strong second. Great honor _E_\nI will be on Face The Nation (CBS) today at 10:30 A.M. and Media Buzz (Fox News) at 11:00 A.M. Enjoy! _E_\nIvanka is now on Twitter You can follow her @IvankaTrump Have a terrific weekend! _E_\nToday we gathered in the East Room to pay tribute to the HEROES whose courageous actions under fire saved so many lives in Alexandria VA. __HTTP__ _E_\nSee you tomorrow Michigan!Grand Rapids MI tomorrow at noon: __HTTP__ MI tomorrow at 3pm:... __HTTP__ _E_\n.@KatrinaCampins Thank you so much for the wonderful statements you made about me on TV. Also keep up the great work! _E_\nCongratulations to @TrumpChicago and @SixteenChicago for receiving the @AAANews Five Diamond Award again this year! _E_\nChina is filling the vacuum left by Obama at the UN on the world stage. _E_\nWashington is wasting over $2 billion this year on Solyndra type loans. Yet they want to cut military spending. _E_\nVia @DailyCaller: Donald Trump: Obama should golf w/ Republicans not his 'local friends' __HTTP__ by @NicholasBallasy _E_\nLook at the way Crooked Hillary is handling the e mail case and the total mess she is in. She is unfit to be president. Bad judgement! _E_\nTexas @GovAbbott & Lt. Gov. @DanPatrickThank you for todays briefing on hurricane recovery efforts here in TX. Keep up the great work! __HTTP__ _E_\nRT @charliekirk11: 3 big wins in 2017 you won't hear:Trump confirmed the most circuit court judges ever in a President's 1st year (all co... _E_\nChina OPEC and Russia laugh at us. But now thanks to Obama so does Syria. Very sad! _E_\nWho is more believable on the state of employment the great @jack_welch or some government bureaucrat who is voting for Obama? _E_\nI turned down a meeting with Charles and David Koch. Much better for them to meet with the puppets of politics they will do much better! _E_\nThe NRA strongly endorses Luther Strange for Senator of Alabama.That means all gun owners should vote for Big Luther. He won't let you down! _E_\nOur country needs leadership now. There is total dysfunction in Washington. _E_\nRemember this Sunday I am also featured on @datelinenbc at 8PM right before the premiere of All Star @CelebApprentice @nbc likes me! _E_\nI have created tens of thousands of jobs and will bring back great American prosperity. Hillary has only created jobs at the FBI and DOJ! _E_\nCan you believe that the builder of the failed ObamaCare website was just given a new government contract how stupid is that CLUELESS!!! _E_\n#VoteTrumpKS #Trump2016March 5 2016 | Wichita Kansas: __HTTP__ __HTTP__ _E_\nI wonder what the rest of the world is thinking about the United States as they watch the disgusting and out of control Baltimore riots? _E_\nIf Crooked Hillary Clinton can't close the deal on Crazy Bernie how is she going to take on China Russia ISIS and all of the others? _E_\nHow could Obama leave those American heroes out to die in Benghazi? And he continues to lie to the public! _E_\nVia @PressClubDC by @snlyngaas: \"Trump Says U.S. Brand Has Lost Its Luster\" __HTTP__ _E_\nNo more Clintons or Bushes! __HTTP__ _E_\nI will be interviewed by @MariaBartiromo at 6:00 A.M. @FoxBusiness. Enjoy! _E_\n.@Team_Mitch Fantastic win we are all proud of you! Your victory speech last night was very gracious to an opponent whose speech was not. _E_\nDonald Trump Defends His Big Obama Bombshell: 'It's Not a Publicity Stunt' __HTTP__ via @eonline _E_\nTerrible! Just found out that Obama had my wires tapped in Trump Tower just before the victory. Nothing found. This is McCarthyism! _E_\nDummy @KarlRove continues to make and write false statements. He still thinks Romney won he should get a life! _E_\nRT @TeamTrump: .@HillaryClinton & @timkaine think you're #Deplorables & #BasementDwellers. @realDonaldTrump & @mike_pence think you're PATR... _E_\n08 02 2011 19:56:31 _E_\nTrump Int'l Hotel & Tower Vancouver's original twisting design gives every unit a distinct view __HTTP__ A landmark! _E_\nCongratulations to our great Women's Olympic Soccer team @ussoccer on their gold medal. They made us all proud! _E_\nI appreciate the GOP candidates who remain strong on border security. They know I am right. A nation without borders cannot survive. _E_\nRT @The_Trump_Train: @realDonaldTrump Make no mistake we are going to put the interest of AMERICAN CITIZENS FIRST! The forgotten men & w... _E_\nThank you to Eli Lake of The Bloomberg View The NSA & FBI...should not interfere in our politics...and is Very serious situation for USA _E_\nObama sadly has no business or private sector background and it shows. _E_\nWatch You've Got Donald Trump at __HTTP__ _E_\nThe DJT Foundation unlike most foundations never paid fees rent salaries or any expenses. 100% of money goes to wonderful charities! _E_\nIraq is falling apart fast two trillion dollars and so many deaths Bush got us in and Obama took far too long to get us out! _E_\nSuch a serious problem for Ted & the GOP. Great doubt Dems will sue! Let's all work together to solve this problem. __HTTP__ _E_\nRepublican Senators will not let the American people down! ObamaCare premiums and deductibles are way up it was a lie and it is dead! _E_\nTune in tonight at 10 pm on NBC for another exciting episode of The Apprentice and see the Dog Whisperer make an appearance. _E_\nRT @mike_pence: There's one clear choice in this election to create jobs and grow the American economy. #VPDebate __HTTP__ _E_\nNow Chinese state run companies are taking over our coal market __HTTP__ China wants to deplete our resources here at home. _E_\nOne of the reasons I assume I was inducted into the @WWE Hall of Fame is that Vince McMahon and I have the all time highest ratings... _E_\nClinton Foundation's Fundraisers Pressed Donors to Steer Business to Former President __HTTP__ _E_\n.@katyperry Katy what the hell were you thinking when you married loser Russell Brand. There is a guy who has got nothing going a waste! _E_\nI wouldn't use @Richard_Meier to design a doghouse let alone a house or building! _E_\nThe real outsourcer @BarackObama is funding German automakers with the GM bailout money __HTTP__ How does that help us? _E_\nThank you for your support! Being #PoliticallyCorrect will NOT #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nVia @Newsmax_Media: \"@RepMattSalmon: Obama 'Didn't Lift a Finger' to Help Free Marine in Mexican Prison\" __HTTP__ _E_\nSad thing is Rolling Stone was (is) a dead magazine with big downward circulation and now for them at last people are talking about it! _E_\nI never thought I'd be saying this but I've really enjoyed @RichLowry on television lately and he was terrific hosting @seanhannity _E_\nJust tried watching Saturday Night Live unwatchable! Totally biased not funny and the Baldwin impersonation just can't get any worse. Sad _E_\nWe don't need a Secretary of Business to understand business we need a president who understands business and I do @MittRomney _E_\nVia @CarrGaz: \"Trump's grand plans for @TrumpTurnberry resort get the green light\" __HTTP__ _E_\n\"You can always become better.\" @TigerWoods _E_\nI hope the NY tax payer appreciates the millions Schneiderman is about to waste on a small case. I will litigate to victory. _E_\nViolent crime is rising across the United States yet the DNC convention ignored it. Crime reduction will be one of my top priorities. _E_\nJames Comey better hope that there are no tapes of our conversations before he starts leaking to the press! _E_\nThe thing I like best about Rex Tillerson is that he has vast experience at dealing successfully with all types of foreign governments. _E_\nSpeech transcript at Arab Islamic American Summit __HTTP__ __HTTP__ #POTUSAbroad _E_\nMy @msnbc int w/ @krystalball at #WHCD on my 2016 timetable saving Social Security & Making America Great Again! __HTTP__ _E_\n\"God's word is the same yesterday and today and a million years from now.\" @Franklin_Graham _E_\nThere is no challenge too great no dream outside of our reach! Thank you Selma North Carolina!#ICYMI watch here... __HTTP__ _E_\nGreat news I'm now leading in most polls w/ new CNN poll also having me #1. NBC I am #1 in NH by a lot #2 in Iowa close & gaining. _E_\nRead this about @lawrence...... __HTTP__ _E_\nThe United States needs great deals and fast. We have to make our country rich again in order to MAKE OUR COUNTRY GREAT AGAIN! _E_\nPresident Obama seems so fawning and desperate to make a deal with Iran that lots of bad results can occur. Be cool and be careful! _E_\nPeople buy deals & immediately put them into bankruptcy in order to make better deals. It's a very effective & commonly used business tool. _E_\nIran humiliated the United States with the capture of our 10 sailors. Horrible pictures & images. We are weak. I will NOT forget! _E_\nThe Wall Street Journal has reported that Obama's food stamp policies are ushering in a massive 'food stamp crime wave.' #TimeToGet Tough _E_\nI will be traveling to Florida tomorrow to meet with our great Coast Guard FEMA and many of the brave first responders & others. _E_\nOn @foxandfriends in two minutes! _E_\nThank you! I miss my father. __HTTP__ _E_\n.@MichaelPhelps you are the greatest Olympic champion of them all. Fantastic job! _E_\nIt is a miracle how fast the Las Vegas Metropolitan Police were able to find the demented shooter and stop him from even more killing! _E_\nI have decided to postpone my trip to Israel and to schedule my meeting with @Netanyahu at a later date after I become President of the U.S. _E_\nIt's crunch time. This Sunday's All Star Celebrity @ApprenticeNBC's task will separate the winners from the losers. _E_\nGetting ready to leave for Cincinnati in the GREAT STATE of OHIO to meet with ObamaCare victims and talk Healthcare & also Infrastructure! _E_\nOn my way to Charleston/Mount Pleasant South Carolina. Big crowd. Look forward to it! #USSYorktown __HTTP__ _E_\nFor the first time in the history of military operations a country has broadcast what when and where they will be doing in a future attack! _E_\nOne of the keys to thinking big is total focus.\" – THE ART OF THE DEAL _E_\nWhen will our country stop wasting money on global warming and so many other truly STUPID things and begin to focus on lower taxes? _E_\nVets mistreated NO border security? I'm with @V4SA this Tuesday 9/15 to #MakeAmericasMilitaryGreatAgain! Join us! __HTTP__ _E_\nAfter 7 months of investigations & committee hearings about my collusion with the Russians nobody has been able to show any proof. Sad! _E_\n.@MittRomney's poll numbers are looking really good. One more great debate performance and it will be a total knockout. _E_\nObama just endorsed Crooked Hillary. He wants four more years of Obama—but nobody else does! _E_\nTo all journalists look into the financial dealings of Scottish Parliament members with Vattenfall...Follow the money. _E_\nObama just had another trillion dollar budget deficit for the fourth year in a row. At least he is consistent. _E_\nJust landed in Ohio. Thank you America I am honored to win the final debate for our MOVEMENT. It is time to... __HTTP__ _E_\n.@jack_welch is correct these reporters would not have been so brave while Jack was running GE. _E_\nCongratulations to @CharlieCrist who has now lost a statewide election in Florida as a Republican Independent & Democrat. _E_\n\"To be a visionary and to be a billionaire you have to chase impossibilities. Few ever get rich easily.\" – Think Like a Billionaire _E_\n.@AnnCoulter's new book Adios America! The Left's Plan to Turn Our Country into a Third World Hellhole is a great read. Good job! _E_\nNew Iowa poll. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n\"Let your passion for your work carry you through all the setbacks they can throw at you.\" – Trump Never Give Up _E_\nHigher Taxes kill job creation cut wild government spending and waste. _E_\n.@FoxNews you should be ashamed of yourself. I got you the highest debate ratings in your history & you say nothing but bad... _E_\nThe only candidate who can get 1145 delegates is @MittRomney. The primary is over. _E_\nSet your sights and aim high. You never know what you can achieve until you focus on achieving it. Midas Touch _E_\nCelebrity Apprentice in 15 minutes don't miss it! _E_\n.@nyrangers did a great job of winning tonight played like champions! _E_\nHillary Clinton didn't go to Louisiana and now she didn't go to Mexico. She doesn't have the drive or stamina to MAKE AMERICA GREAT AGAIN! _E_\nObama claims that he needs an extra $4B to secure the border. Well then he should not have wasted $5B on the ObamaCare website. _E_\nAnother broken promise by @BarackObama: @ObamaCare actually increases income inequality __HTTP__ It must be fully repealed! _E_\nIf you're going to think think big The Art of the Deal _E_\nNobody is watching @Morning_Joe anymore. Gone off the deep end bad ratings. You won't believe what I am watching now! _E_\nCongress must protect our borders first. Amnesty should be done only if the border is secure and illegal immigration has stopped. _E_\nCongrats @TrumpDoral for being named one of the Most Notable Openings of 2014 from @BizBash: __HTTP__ _E_\nObamaCare must be completely repealed. A recent report from UBS shows that it is the number one reason employers are not hiring. _E_\nThank you @greta. #ImWithYou __HTTP__ _E_\nDoing David Letterman @Late_Show tonight at 11:30. 1st nite of Sweeps.Going into the lion's den but I've been there many times before. Enjoy _E_\nThank you Appleton Wisconsin!#WIPrimary #Trump2016 __HTTP__ __HTTP__ _E_\nA great victory in Scotland ... __HTTP__ __HTTP__ _E_\nHere's @Joan_Rivers. She & @IvankaTrump make a terrific team as my advisors. #CelebApprentice _E_\nThere are no buyers for the worthless @NYDailyNews but little Mort Zuckerman is frantically looking. It is bleeding red ink a total loser! _E_\nLooking forward to being hosted by @saintanselm for Politics & Eggs next Tuesday. See you in Manchester! #NHPolitics _E_\nI'll be making a major announcement on President Obama next week stay tuned! _E_\nThe reason that President Obama did NOTHING about Russia after being notified by the CIA of meddling is that he expected Clinton would win.. _E_\nCongratulations to @TrumpNewYork @TrumpChicago @TrumpWaikiki @TrumpToronto on your Forbes Five Star ratings @ForbesInspector _E_\nNick Adams new book Green Card Warrior is a must read. The merit based system is the way to go. Canada Australia! @foxandfriends _E_\nThank you @Morning_Joe for throwing the pathetic reporter from the failing and money losing Daily Beast off the air. Really cool! _E_\nEveryone is now saying how right I was with illegal immigration & the wall. After Paris they're all on the bandwagon. _E_\nPlease only respond by tweet @lawrence because like everyone else I don't watch your show. _E_\nReady to lead. Ready to Make America Great Again. #Debate #MAGA _E_\nI just retained Sir Nick Faldo to be the architect of the Red Course at Doral he will do a tremendous job! @NickFaldo006 _E_\nThank you South Bend Indiana! Everyone get out & #VoteTrump tomorrow! #INPrimary __HTTP__ __HTTP__ _E_\nHe that is good for making excuses is seldom good for anything else. Benjamin Franklin _E_\nThe Baldwin family is well represented in the 13th season of All Star @CelebApprentice with @StephenBaldwin. Stephen does great. _E_\nWishing everyone a very Happy Holiday season! _E_\nRT @LindseyGrahamSC: I support President Trump's desire to re enter the Paris Accord after the agreement becomes a better deal for America... _E_\nNow they say obese women may cause Autism in children nonsense they use any excuse. The FDA should immediately (cont) __HTTP__ _E_\nBus crash in Tennessee so sad & so terrible. Condolences to all family members and loved ones. These beautiful children will be remembered! _E_\nToday I welcomed the Victory Christian Center School. Good luck @ the Team America Rocketry Challenge! #TARC Watch... __HTTP__ _E_\nRT @WhiteHouse: Happy Father's Day! __HTTP__ _E_\nI've dealt w/politicians throughout the world. My deals are multi faceted transactions which involve many issues. I know the process & win! _E_\nPassing what was once a vibrant manufacturing area in Pennsylvania. So sad! #MakeAmericaGreatAgain __HTTP__ _E_\nSorry @Rosie is a mentally sick woman a bully a dummy and above all a loser. Other than that she is just wonderful! _E_\nThe USC should be ruling any day now on @ObamaCare. Hopefully we will get the right result. _E_\nA good example of how our country wastes money... __HTTP__ #trumpvlog _E_\nHope & Change! China now controls a record number of our debt __HTTP__ _E_\nChina is an international pariah. They are now harassing Japan over its purchase of 3 uninhabited islands __HTTP__ _E_\nLet's properly check goofy Elizabeth Warren's records to see if she is Native American. I say she's a fraud! _E_\nSuccess tip: Be ready for problems and be patient there are very few cases of instant gratification. _E_\nChina is openly sailing warships in our waters & arming countries in our hemisphere including Mexico __HTTP__ Ally? _E_\nI will be in Wisconsin until the election. Jobs trade and immigration will be big factors. I will bring jobs back home make great deals! _E_\nHeading to beautiful West Virginia to be with great members of the Republican Party. Will be planning Infrastructure and discussing Immigration and DACA not easy when we have no support from the Democrats. NOT ONE DEM VOTED FOR OUR TAX CUT BILL! Need more Republicans in '18. _E_\nTrump Tuesday @SquawkCNBC tomorrow at 7:38 AM. _E_\nNetwork news has become so partisan distorted and fake that licenses must be challenged and if appropriate revoked. Not fair to public! _E_\nMiss Alabama Katherine Webb has been a truly great representative of the Ms. USA Organization ..We are proud of her! _E_\nWhen foreigners attend our great colleges & want to stay in the U.S. they should not be thrown out of our country. _E_\nThe @AmSpec interview by Jeffrey Lord: A TRUMP CARD The Donald talks politics and parenting. __HTTP__ _E_\nThe Freedom Caucus will hurt the entire Republican agenda if they don't get on the team & fast. We must fight them & Dems in 2018! _E_\nRe: Ashley Judd: Keep @KarlRove away. He already made her a viable candidate. _E_\nOur gov't is so pathetic that some of the billions being wasted in Afghanistan are ending up with terrorists __HTTP__ _E_\nGreat going to Bob Kraft & Bill Belichick of the @Patriots on @TimTebow. Tim is a winner just like them! _E_\nThe 2013 Trump @MissUniverse Pageant comes to Moscow on November 9th. Airing from Crocus City Hall on @nbc! _E_\nI'll be on @gretawire tonight on @foxnews at 10 pm. _E_\nThe so called angry crowds in home districts of some Republicans are actually in numerous cases planned out by liberal activists. Sad! _E_\nRT @EricTrump: We should all take a moment to say a prayer for those who paid the ultimate price — Their bravery and sacrifice allows us t... _E_\nOn Friday @VPBiden said that China has better cities and airports than the US. Well what has @BarackObama done about it the last 3 years?! _E_\nPlease tune in January 15th at 6:00AM EST and 6:00PM EST to the QVC network to watch my wife @MELANIATRUMP... _E_\nPlan a perfect weekend for the holidays in NYC's hottest neighborhood using @TrumpSoHo's 20% offer __HTTP__ _E_\nThe Village @Trump_Charlotte offers a variety of 5 Star dining experiences for everyday dining & catered affairs __HTTP__ _E_\nWow! What a great night. Thank you to all of the viewers and congratulations to @StephenAtHome __HTTP__ @colbertlateshow _E_\nRT @FoxNews: New Poll Shows @POTUS Approval at 50 Percent __HTTP__ _E_\nWe must do everything possible to keep this horrible terrorism outside the United States. _E_\nInteresting article from highly respected Wayne Allyn Root __HTTP__ _E_\nWith an award winning course designed by Tom Fazio Trump National Philadelphia is a 360 acre exclusive jewel __HTTP__ _E_\nThe President Changed. So Has Small Businesses' Confidence __HTTP__ _E_\nWow so many Fake News stories today. No matter what I do or say they will not write or speak truth. The Fake News Media is out of control! _E_\nA few of the many clips of John McCain talking about Repealing & Replacing O'Care. My oh my has he changed complete turn from years of talk! __HTTP__ _E_\nThere will be no amnesty!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nDo you believe that Hillary Clinton now wants Obamacare for illegal immigrants? She should spend more time taking care of our great Vets! _E_\nWe need a tax system that is fair and smart one that encourages growth savings and investment. It's time to (cont) __HTTP__ _E_\nI believe this book will rock a lot of people. Don't just read #TImeToGetTough but share it with your friends and family! RushLimbaugh _E_\nRT @IvankaTrump: We must reform our tax code so that all Americans can succeed in our modern economy & achieve the American Dream! #TaxRefo... _E_\nMy @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney has to get tough real unemployment & bias press __HTTP__ _E_\nGet ready for the fireworks between @OMAROSA & @latoyajackson in 13th season of All Star @CelebApprentice! Neither one will back down. _E_\n.@MelaniaTrump looks amazing in 2000 @SInow! __HTTP__ _E_\nHuma Abedin the top aide to Hillary Clinton and the wife of perv sleazebag Anthony Wiener was a major security risk as a collector of info _E_\nNewly released emails prove that scientists have manipulated data on global warming. The data is unreliable. __HTTP__ _E_\nTremendous cold wave hits large part of U.S. Lucky they changed the name from global warming to climate change G.W. just doesn't work! _E_\nWelcome to the new reality be careful. Retirement ages will be pushed to 80 due to the incompetence of our leaders. __HTTP__ _E_\nJust released @CNN Poll gives me a big 13 point lead in Iowa. Change your false story failing @nytimes. Thank you Iowa! _E_\nIt's easy to see why Americans are sick of career politicians and both parties. _E_\nWhy has Obama let China and others take our jobs? _E_\nMexico's biggest drug lord escapes from jail. Unbelievable corruption and USA is paying the price. I told you so! _E_\nIsn't it funny that I am now #1 in the money losing @HuffingtonPost (poll) and by a big margin. Dummy @ariannahuff must be thrilled! _E_\nMy son Don and his wife Vanessa just had a beautiful baby boy named Spencer Frederick very thrilling. _E_\nExcited to be heading home to see the House pass a GREAT Tax Bill with the middle class getting big TAX CUTS!#MakeAmericaGreatAgain _E_\n3. You should tweet your pick for MVP using the celebrity's name followed by the hashtag #CelebApprenticeMVP. _E_\n.@VanityFair Magazine is doing really poorly. It has gotten worse and worse over the years and has lost almost all of it's former allure! _E_\nTed Cruz only talks tough on immigration now because he did so badly in S.C. He is in favor of amnesty and weak on illegal immigration. _E_\nI hope you all are looking at the Donald J. Trump Signature Collection of ties shirts & cufflinks @Macys—great for Christmas & holidays. _E_\nGreat day yesterday at @TrumpDoral unveiling the new Gary Player Villa __HTTP__ Gary is a champion and a great guy. _E_\nWatch this video to see how bad wind turbines are for the environment __HTTP__ _E_\nThe @TuckerCarlson opening statement about our once cherished and great FBI was so sad to watch. James Comey's leadership was a disaster! _E_\nI just learned that @politico has no credibility total phonies that don't report the truth. A puppet of Obama? _E_\nVia @TVbytheNumbers: 'Celebrity Apprentice' is Number 1 among ABC CBS & NBC for its Second Hour from 10 11 p.m. __HTTP__ _E_\nKarl Rove is now making excuses for his total wasting of $400M—not one win—(the Republicans better get smart next time)... _E_\n.@HillaryClinton ITS CALLED EXTREME VETTING! #Debates2016 __HTTP__ _E_\nThis is the first time in my life that I have caused controversy by NOT saying something. _E_\nThe people are really smart in cancelling subscriptions to the Dallas & Arizona papers & now USA Today will lose readers! The people get it! _E_\nJoin me live for the commissioning ceremony of the USS Gerald R. Ford! __HTTP__ #USA __HTTP__ _E_\nDopey @billmaher is in for a lot of trouble—I hope he has $5 million (for charity). _E_\nGetting ready for @nbcsnl commercial. __HTTP__ _E_\nI will be On The Record with Greta Van Susteren @gretawire tonight at 10 PM on Fox News. _E_\nThe people of Ireland have been so great about my purchase of Doonbeg I'll be there soon. @LodgeatDoonbeg _E_\nAssad hit the jackpot! _E_\nHere we go with the Oscars! _E_\nBy the way folks @billmaher is not a smart guy (just look at his past)—he just pretends he is! _E_\nDoes anybody notice that Atlantic City lost its magic after I left years ago. I had the big boxing introduced UFC (ask Dana)the best shows _E_\nWhere's the electability? Jeb is losing to HRC by 13 points. A Bush will never beat a Clinton. Wake up @GOP! _E_\nObama should stop running down the stairs when getting off Air Force One. Doesn't look presidential and at some point he will take a fall. _E_\nIt's Tuesday. How many more non stories will the liberal media try to manufacture so everyone ignores Obama's record? _E_\nThank you to all law enforcement agencies for a fabulous job!#LEO #LESM #Trump2016 __HTTP__ _E_\nCelebrity Apprentice will be LIVE on Sunday at 9 PM (from New York City).Casting has already begun for next season. _E_\nFor first time the failing @nytimes will take an ad (a bad one) to help save its failing reputation. Try reporting accurately & fairly! _E_\nWill be interviewed on @Morning_Joe at 7:40. ENJOY! _E_\nHow is Chris Christie running the state of NJ which is deeply troubled when he is spending all of his time in NH? New Jerseyans not happy! _E_\nHappy Father's Day to all even the haters and losers! _E_\nThank you for the kind words tonight @OMAROSA. You were great! See you soon! _E_\nLast night Melania and I attended the Skating with the Stars Gala at Wollman Rink in Central Park it was fantastic. Stay tuned for Part 2.. _E_\nWe create success or failure on the course primarily by our thoughts. Gary Player _E_\nVia @ConcordNHPatch by @politizine: \"Trump: 'We'll Make America Great Again'\" __HTTP__ _E_\nHillary there is nothing to laugh about __HTTP__ _E_\n.@sethmeyers Seth can't help it he is really trying hard but just doesn't have what it takes. Very awkward and insecure! _E_\nSee you tomorrow w/ Gov. @Mike_Pence Iowa & Wisconsin! 3pm __HTTP__ __HTTP__ __HTTP__ _E_\nWhy does US doping agency destroy an American icon @lancearmstrong for events that took place years ago in France? _E_\nDeparting Golden CO. for Arizona now after an unbelievable rally. Watch here: __HTTP__ __HTTP__ _E_\nA wonderful article by a writer who truly gets it. I am for the people and the people are for me. #Trump2016 __HTTP__ _E_\nAssociated Press knowingly and inaccurately wrote about Liberty University speech. Shameful reporting...no credibility. _E_\nLooks like @tedcruz is getting ready to attack. I am leading by so much he must. I hope so he will fall like all others. Will be easy! _E_\nDespite having a black president the racial divide seems greater than it has in decades.If Obama were a leader this would not be the case _E_\nVia @BreitbartNews by @NolteNC: DONALD TRUMP SURGES TO COMMANDING LEAD IN POST MCCAIN BACKLASH POLL __HTTP__ _E_\nHe is destroying our country:@BarackObama has requested to raise our debt limit to over $16.4Trillion by the end (cont) __HTTP__ _E_\nJust looked at new selection of Donald J. Trump Signature Collection ties & shirts @Macys fantastic! Would make great gifts! _E_\nJeb used Eminent Domain & took advantage of a disabled vet in the process. (2/2) __HTTP__ _E_\nThe fans are going to love the tasks in the upcoming 13th season of All Star @CelebApprentice. The biggest yet! _E_\nI have accepted the invitation of President Enrique Pena Nieto of Mexico and look very much forward to meeting him tomorrow. _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nWhere are @RepMarkMeadows @Jim_Jordan and @Raul_Labrador?#RepealANDReplace #Obamacare _E_\nIn NYC looks like another attack by a very sick and deranged person. Law enforcement is following this closely. NOT IN THE U.S.A.! _E_\nMichael Forbes lives in a pigsty and bad liquor company Glenfiddich gave him Scot of the Year award... _E_\nListening to @rushlimbaugh on way back to Jury Duty. Fantastic show terrific guy! _E_\n.@MagicJohnson Good luck with the Dodgers this season if they were like you they would never lose a game! _E_\nOne of the worst and most boring political pundits on television is @krauthammer. A totally overrated clown who speaks without knowing facts _E_\nUnivision apologized to me but I will not accept their apology. I will be suing them for a lot of money. Miss U.S.A. contestants are hurt! _E_\nLightweight @AGSchneiderman will probably win only because he is a Dem in NY but what a loser! _E_\nI have watched sloppy Graydon Carter fail and close Spy Magazine and now am watching him fail at @VanityFair Magazine. He is a total loser! _E_\nI'll be on @foxandfriends Monday at 7:30 AM don't miss it. _E_\nRatings way down show irrelevant. Why haven't they learned? @Rosie always fails. _E_\nDemocrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our dangerous Southern Border. They could have easily made a deal but decided to play Shutdown politics instead. #WeNeedMoreRepublicansIn18 in order to power through mess! _E_\n#HappyIndependenceDay #USA __HTTP__ _E_\n'Manufacturing openings hires rise to highest levels of the recovery' __HTTP__ _E_\nI bought Tim Tebow's jersey and helmet at auction for a good cause fighting breast cancer __HTTP__ _E_\nFor those of you in trouble—(in these troubled times)—never ever give up! _E_\nIf my people said the things about me that Podesta & Hillary's people said about her I would fire them out of self respect. Bad instincts _E_\nCongress should be worried about American workers not people who came into our country by breaking our laws. _E_\nRT @IvankaTrump: 2016 has been one of the most eventful and exciting years of my life. I wish you peace joy love and laughter. Happy New... _E_\n.@CNN is the worst.They go to their dumb one sided panels when a podium speaker is for Trump! VAST MAJORITY want: Make America Great Again! _E_\nMAKE AMERICA GREAT AGAIN! #IACaucus #CaucusForTrump __HTTP__ __HTTP__ _E_\nMy Twitter account was taken down for 11 minutes by a rogue employee. I guess the word must finally be getting out and having an impact. _E_\n.@FrankLuntz is a low class slob who came to my office looking for consulting work and I had zero interest. Now he picks anti Trump panels! _E_\nThe CDC chief just said Ebola is spreading faster than Aids. Marines are preparing for a pandemic drill. Stop all flights from West Africa! _E_\nWhy the Rust Belt just gave Donald Trump a hero's welcome __HTTP__ _E_\nFrom an amazing day on the border in Laredo. __HTTP__ _E_\nCondolences to the family of the young woman killed today and best regards to all of those injured in Charlottesville Virginia. So sad! _E_\nI keep getting great feedback on new #TRUMP cologne 'Success.' Exclusively available at @Macy's __HTTP__ And best shirts & ties _E_\nJust read @marklevinshow's bestseller book—really great! _E_\nThe failing @nytimes does not mention the new @CNN Poll that has me leading Iowa by a massive 13 points I am at 33%. Maggie Haberman sad! _E_\nGive me clean beautiful and healthy air not the same old climate change (global warming) bullshit! I am tired of hearing this nonsense. _E_\n...Save your energy Rex we'll do what has to be done! _E_\nJoin me in Colorado at 12pm tomorrow or Arizona at 3pm!TICKETS:Golden: __HTTP__ __HTTP__ _E_\nThe best luck of all is the luck you make for yourself. Douglas MacArthur _E_\nGetting ready to do the David Letterman @Late_Show tonight—I hope you all will watch—I think! _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nHe was quick to issue an apology on behalf of America to Karzai. Why won't he release the letter? @BarackObama __HTTP__ _E_\nWe should immediately close all tax loopholes that favor foreign investments and taking our jobs overseas t... (cont) __HTTP__ _E_\nI told you @TIME Magazine would never pick me as person of the year despite being the big favorite They picked person who is ruining Germany _E_\nThere are just so many penalties and such long commercials in these NFL games that they are no longer worth watching. Soft hitting & boring! _E_\nThank you America! #Trump2016 __HTTP__ _E_\nGovernor Cuomo only cut the Verrazano Bridge tolls because I made it a major point in speeches. I love the people of Staten Island! _E_\nVia @Newsmax_Media by Alana Marie Burke: \"Donald Trump 2016: 7 Key Political Positions\" __HTTP__ _E_\nPoliticians are all talk and no action. Washington can only be fixed by an outsider. Let's make America great again! __HTTP__ _E_\n...and West Virginia. The fact is the Fake News Russian collusion story record Stock Market border security military strength jobs..... _E_\nI have spoken w/ @GovAbbott of Texas and @LouisianaGov Edwards. Closely monitoring #HurricaneHarvey developments & here to assist as needed. _E_\nThe Ebola doctor who just flew to N.Y. from West Africa and went on the subway bowling and dining is a very SELFISH man should have known! _E_\nTogether we will prevail in the GREAT state of Texas. We love you!GOD BLESS TEXAS & GOD BLESS THE USA __HTTP__ _E_\nEven liberals & Democrats think Eric Schneiderman's use of the Atty General's office is unfair & unethical. __HTTP__ _E_\nSpeaker John Boehner who I like should never have agreed to raise taxes because the Republicans got absolutely nothing for it! _E_\n...extremism and all reference was pointing to Qatar. Perhaps this will be the beginning of the end to the horror of terrorism! _E_\nVia @EW by @DaltonRoss: \"recap: 'Nobody Out Thinks Donald Trump'\" __HTTP__ _E_\nThe S&P downgrade is a direct result of @BarackObama's increased reckless budget spending and Obama Care. He owns this. _E_\nI'm on @CNN's @AC360 tonight @8pm & @FoxNews' @seanhannity @ 10PM discussing immigration and lots of other things.#LetsMakeAmericaGreatAgain _E_\nWonderful coordination between Federal State and Local Governments in the Great State of Texas TEAMWORK! Record setting rainfall. _E_\nWhat a sad thing that the memory of Nelson Mandela will be stained by the phoney sign language moron who is in every picture at funeral! _E_\nLanding in Phoenix now. Tomorrow's events will be amazing! #Trump2016 _E_\nRT @foxandfriends: FOX NEWS EXCLUSIVE: President Trump 'seriously considering' a pardon for ex Sheriff Joe Arpaio __HTTP__ _E_\nLeaving now for New Hampshire. Big crowd looking forward to it! #FITN _E_\nNewsmax is a great news organization and its pres debate in IA on Dec 27 will be fair balanced and informative. @ralphreed _E_\nA horrible day for Newtown CT and our country yesterday. My condolences to all of the families so tragically affected. _E_\nIsn't it ironic that China is going all in nuclear for energy while at the same time making wind turbines for others. @alexsalmond _E_\n.@ScottWalker is a nice guy but not presidential material. Wisconsin is in turmoil borrowing to the hilt and doing poorly in jobs etc. _E_\nVia @MailOnline by @dmartosko: \"President Trump? Says 'there's a very substantial chance' he'll run in 2016\" __HTTP__ _E_\n...Even though parts of healthcare could pass at 51 some really good things need 60. So many great future bills & budgets need 60 votes.... _E_\n.@alexsalmond @pressjournal RT @JohnDuthie1 just sitting here looking out over Aberdeen bay. These clowns cannot be allowed... _E_\nWord is that crying @GlennBeck left the GOP and doesn't have the right to vote in the Republican primary. Dumb as a rock. _E_\nHealth Insurance stocks which have gone through the roof during the ObamaCare years plunged yesterday after I ended their Dems windfall! _E_\nRemember it was the Republican Party with the help of Conservatives that made so many promises to their base BUT DIDN'T KEEP THEM! Hi DT _E_\nIf the working proud and productive people of our country don't start exerting their authority and views the U.S. as we know it is doomed! _E_\nOne of the saddest things in journalism is what happened to the formerly great @AP. They have lost their way and are no longer credible. _E_\nChange is not a destination just as hope is not a strategy. Rudy Giuliani _E_\nOff to Indiana! #Trump2016 __HTTP__ _E_\nRosie O'Donnell should leave Lindsay Lohan alone @Rosie has bigger problems than Lindsay. Lindsay's mother called my office for help _E_\nOne of the many reasons that @VattenfallGroup dropped out of windfarm project—they couldn't solve military radar defense problems _E_\n\"Don't let the fear of striking out hold you back.\" – Babe Ruth _E_\nWhen will the Fake Media ask about the Dems dealings with Russia & why the DNC wouldn't allow the FBI to check their server or investigate? _E_\nObamaCare gives free insurance to illegal immigrants. Yet @BarackObama is cutting our troops healthcare. (cont) __HTTP__ _E_\nIt is wonderful to be in beautiful Doonbeg touring @Trump_Ireland. I'm truly honored by the wonderful welcome to my family & organization _E_\nIf you're going to be thinking you may as well think big. _E_\nWatch the game really good. _E_\n\"Success isn't permanent and failure isn't fatal.\" – Mike Ditka _E_\nI will be in Iowa all day and until Tuesday morning. Finally after all these years of watching stupidity we will MAKE AMERICA GREAT AGAIN! _E_\nTrue @THEGaryBusey is a scene stealer without trying. He's got a gift. #CelebApprentice _E_\nIf I become the next POTUS they will not be ignoring! #AmericaFirst __HTTP__ _E_\nIs that all there is? We need a new President FAST! _E_\nInvincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_\nInnovation distinguishes between a leader and a follower. Steve Jobs _E_\nLoved doing #NCGOPConvention keynote speech last night! Unbelievable reception. Had the biggest crowds by far of any of the GOP candidates. _E_\nThe talks between the U.S. and Iran are going on forever WORLD'S LONGEST NEGOTIATION. Obama has no idea what he is doing incompetent! _E_\nYou're all wrong—check the facts! UK is massively subsidizing Scotland's wind turbines & the people don't want them. _E_\n... Rove's ad campaign has made Ashley Judd a totally credible candidate. Be careful Mitch! _E_\nTonight I trade places with Larry King @kingsthings and interview him on the 25th anniversary of his show. 9PM on CNN featuring best clips. _E_\nMy @foxandfriends interview discussing the Benghazi cover up Hostess' closing & celebrating Thanksgiving with family __HTTP__ _E_\nCrooked took MILLIONS from oppressive ME countries. Will she give the $$$ back? Probably not. Don't forget her slog... __HTTP__ _E_\nI am having 600 Thanksgiving dinners sent to the Rockaways prepared by my wonderful Trump Grill/Trump Tower staff. #SandyRelief _E_\nSpolier alert...the record setting 13th season of All Star @CelebApprentice also features the return of previous winners in the boardroom. _E_\nVenezuelan leader Hugo Chavez said in a television interview that aired on Sunday If I were American I'd vote for Obama. _E_\nJOBS JOBS JOBS! #MAGA __HTTP__ _E_\nOn my way to Pensacola Florida. See everyone soon! #MAGA __HTTP__ _E_\nCrooked Hillary Clinton has destroyed jobs and manufacturing in Pennsylvania. Against steelworkers and miners. Husband signed NAFTA. _E_\n...@BarackObama is hiding plenty of bad things. _E_\nWith the coming forward today of the woman central to the failing @nytimes hit piece on me we have exposed the article as a fraud! _E_\nThis is the right TAX CUT @ the RIGHT TIME. We will ALL succeed & grow TOGETHER – as one team one people & one American family. #TaxReform __HTTP__ _E_\nIf you want to conquer fear don't sit home and think about it. Go out and get busy. Dale Carnegie _E_\nShould have gone after the oil years ago (like I have been saying). _E_\nAn 'extremely credible source' has called my office & told me that @BarackObama applied to Occidental as a foreign student think about it! _E_\nCrooked Hillary Clinton blames everybody (and every thing) but herself for her election loss. She lost the debates and lost her direction! _E_\nLooking forward to addressing @ralphreed's @FaithandFreedom 'Road to Majority Conference' on June 13th __HTTP__ _E_\nPresident Obama spends so much time speaking of the so called Carbon footprint and yet he flies all the way to Hawaii on a massive old 747. _E_\nAs a tribute to the late great Phyllis Schlafly I hope everybody can go out and get her latest book THE CONSERVATIVE CASE FOR TRUMP. _E_\nThe economy is bad and getting worse almost ZERO growth this quarter. Nobody can beat me on the economy (and jobs). MAKE AMERICA GREAT AGAIN _E_\nWe are excited to announce Trump Estates at Akoya by DAMAC luxury villas situated byTrump Int'l Golf Links Dubai __HTTP__ _E_\nBreaking news The Washington Redskins have just announced that they will be removing the name Washington from their name! _E_\n\"@DamacOfficial Announces @TigerWoods to Create Golf Course for Trump World Golf Club Dubai\" __HTTP__ via @BusinessWire _E_\nRT @DanScavino: LOUISIANA GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nGlad to hear @GovChristie will be delivering the Keynote for the @RNC convention. He will deliver a strong message. _E_\nIt's Tuesday. How much money will Karl Rove waste today trying to push amnesty through the House? _E_\nNo @JebBush you're pathetic for saying nothing happened during your brother's term when the World Trade Center was attacked and came down. _E_\nKasich voted for NAFTA a disaster for Ohio and now wants the even worse TPP approved. Vote Trump and end this madness! _E_\nWe must all be united in offering assistance to everyone suffering in Puerto Rico and elsewhere in the wake of this terrible disaster. _E_\nA clip of my upcoming interview with @DavidBrody discussing #TimeToGetTough @Israel and the Islamist winter __HTTP__ _E_\nRT @DanScavino: Join @realDonaldTrump LIVE in Wisconsin with Gov. @ScottWalker @MayorRGiuliani @Reince & Coach Bobby Knight! LIVE: __HTTP__ _E_\nIt is amazing how often I am right only to be criticized by the media. Illegal immigration take the oil build the wall Muslims NATO! _E_\nCongratulations to Aberdeen and Scotland for just having our great golf course named Best New Course In World by The Robb Report. _E_\nI love that thousands of people are boycotting @Macys and cutting up credit cards. No guts no glory. This really backfired love it! _E_\nFacebook was always anti Trump.The Networks were always anti Trump henceFake News @nytimes(apologized) & @WaPo were anti Trump. Collusion? _E_\nI hope Bill Clinton and NEWSMAX's Chris Ruddy are enjoying their mission to Africa. Two great people. _E_\nCelebrity Apprentice continues to be a top ten trend on twitter this morning __HTTP__ _E_\nNo I'm saying that the World is paying the price for China's pollution while they make a fortune with their dirty factories! Very sad. _E_\nThe Club For Growthwhich asked me for $1000000 in an extortion attempt just put up a Wisconsin ad with incorrect math.What a dumb group! _E_\nVia \"TRUMP: HILLARY PRESIDENCY WILL CAUSE 'CRIME WAVE LIKE YOU'VE NEVER SEEN'\" __HTTP__ via @BreitbartNews _E_\n.@MattGinellaGC Don't forget to watch Matt tomorrow on Morning Drive talking about The Blue Monster and Trump Doral. @GolfChannel _E_\nA quote was read from a parody account last night on MSNBC re: Jeb. __HTTP__ _E_\nRT @Corrynmb: @realDonaldTrump Liberals have an agenda and it's not in America's best interest. Keep fighting the good fight! We stand with... _E_\nToday it was my great honor to sign a new Executive Order to ensure Veterans have the resources they need as they transition back to civilian life. We must ensure that our HEROES are given the care and support they so richly deserve! __HTTP__ __HTTP__ _E_\nLeaving now I'm spending the entire day in Iowa great people great state! _E_\n#MakeAmericaSafeAgain #ImWithYou __HTTP__ _E_\nRambling and stumbling @hardball_chris is as dumb as a rock! _E_\nMy interview yesterday with @TeamCavuto discussing Europe's debt deal and the GOP primary __HTTP__ _E_\nGreat New Poll __HTTP__ _E_\nThe Justice Dept. should have stayed with the original Travel Ban not the watered down politically correct version they submitted to S.C. _E_\nThe best vision is insight. Malcolm Forbes _E_\nBusiness is looking better than ever with business enthusiasm at record levels. Stock Market at an all time high. That doesn't just happen! _E_\nThe Senate must NOT pass TPA! Any Senator who votes for it is disqualified for being POTUS. Protect the American worker and manufacturer! _E_\nYet another terrorist attack this time in Turkey. Willthe world ever realize what is going on? So sad. _E_\nA MUST WATCH TRULY BEAUTIFUL! @PrivateCaddie: Amazing Turnberry Ailsa course changes from @realDonaldTrump #Golf __HTTP__ _E_\nI am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_\nComing together is a beginning keeping together is progress working together is success. Henry Ford _E_\nIt's important to remain open to new ideas and new information. Keep your door open every day to something innovative and energizing. _E_\nLocated in Palm Beach FL historic Mar a Lago features 20 exquisite acres filled w/ world class amenities __HTTP__ _E_\nVia @businessinsider by @hunterw: \"TRUMP: 'I'm going to surprise a lot of people' in 2016\" __HTTP__ _E_\nResults are what matter. The bottom line is clearly the bottom line. Think Like a Champion _E_\nI will be live tweeting my interview with @megynkelly on the Fox Network tonight at 8! Enjoy! __HTTP__ _E_\nI will be interviewed by @MariaBartiromo on @MorningsMaria @FoxBusiness at 7:30 A.M. Enjoy. _E_\nBush and Rubio are finally attacking each other as I knew they would in order to be the last establishment man standing against me.Great _E_\nThis is what REAL PRIDE in our COUNTRY is all about! #USA __HTTP__ _E_\nBernie Sanders is being treated very badly by the Democrats the system is rigged against him. Many of his disenfranchised fans are for me! _E_\nRT @axios: The DOJ is opening a civil rights investigation on the car attack in Charlottesville __HTTP__ _E_\nThank you New Hampshire! #FITN __HTTP__ _E_\nMeeting with Iowa State Senate Leaders __HTTP__ _E_\nMust read f/@ weeklystandard by @JayCostTWS: \"Obamacare Myth Making Five phony success stories.\" __HTTP__ _E_\n\"Each life is made up of mistakes and learning waiting and growing practicing patience and being persistent.\" – Rev. @BillyGraham _E_\nBeing good in business is the most fascinating kind of art.Making money is art & working is art & good business is the best art. A. Warhol _E_\nRepublicans must be careful in that the Dems own the failed ObamaCare disaster with its poor coverage and massive premium increases...... _E_\nHillary's vision is a borderless world where working people have no power no jobs no safety. _E_\nAlways bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_\nIvanka Trump will be interviewed on @foxandfriends. _E_\nCrude has skyrocketed since @BarackObama delayed the Keystone Pipeline. Not only are 20000 jobs gone but family budgets are tightening. _E_\nOur founders invoked our Creator four times in the Declaration of Independence. Our currency declares \"IN GOD WE TRUST.\" And we place our hands on our hearts as we recite the Pledge of Allegiance and proclaim that we are \"One Nation Under God.\" #NationalPrayerBreakfast __HTTP__ _E_\nSugar @Lord_Sugar Why don't you tell the public what you're really worth they would be very disappointed. _E_\nClosely monitoring #HurricaneHarvey from Camp David. We are leaving nothing to chance. City State and Federal Govs. working great together! _E_\nThe #Hyperlapse app in @TrumpTowerNY __HTTP__ _E_\nI will be interviewed by @GStephanopoulos on @GMA at 7:00 A.M. There is much to talk about! _E_\nThank you for your support in Biloxi MS! Let's ALL get out & VOTE in 2016 so we can #MakeAmericaGreatAgain! __HTTP__ _E_\nCrooked Hillary promised 200k jobs in NY and FAILED. We'll create 25M jobs when I'm president and I will DELIVER! __HTTP__ _E_\nThey finally let our Marine out of a Mexican prison no thanks to Obama. Way too long. Such an event should never be allowed to happen again _E_\nIt was just announced that @ErinBurnett won't be going to mornings on CNN. @OutFrontCNN just made a wise decision. _E_\nNo wonder the Today Show on biased @NBC is doing so badly compared to its glorious past. Little credibility! _E_\nJoin me in Mobile Alabama on Sat. at 3pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_\nI will be interviewed by @MarthaMaccallum on @FoxNews tonight at 7pm. Enjoy! _E_\nIt is about time that Roger Goodell of the NFL is finally demanding that all players STAND for our great National Anthem RESPECT OUR COUNTRY _E_\nI'm sick of always reading about outsourcing. Why aren't we talking about 'onshoring'? (cont) __HTTP__ _E_\nThank you to @foxandfriends for the nice reviews of last night. _E_\nNasty Ted Cruz is at it again same dirty tricks he used w/ @RealBenCarson saying I may not be on ballot & I hold liberal positions. LIES! _E_\n\"He who knows when he can fight and when he cannot will be victorious.\" Sun Tzu _E_\nMy Trump Home Mattress Collection by Serta is setting records they are really phenomenal. You can order them at __HTTP__ _E_\nPigs get slaughtered ... again. Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_\nDo you believe this singing? #Oscars _E_\n.@antbaxter Anthony—did you illegally take clips from the Letterman @Late_Show show and @GolfChannel without their approval? _E_\nRemember this the worst doctors (by far) are celebrity doctors. If you see their names or read about them in the newspapers stay away! _E_\nVia @espn: @dallasmavs \"most likely scenario remains finishing a frustrating ninth in the West\" __HTTP__ _E_\nBusy week planned with a heavy focus on jobs and national security. Top executives coming in at 9:00 A.M. to talk manufacturing in America. _E_\n.@KellyandMichael are both wonderful people. Their show is terrific. #CelebApprentice _E_\nDon't take vacations. What's the point? If you're not enjoying your work you're in the wrong job. Think Like A Billionaire _E_\nBig ratings getter @seanhannity and Apprentice Champion John Rich are right now going on stage in Las Vegas for #VegasStrong. Great Show! _E_\nAmerican must now get very tough very smart and very vigilant. We cannot admit people into our country without extraordinary screening. _E_\n#RiyadhSummit #POTUSAbroad __HTTP__ _E_\nReally bad article about me in the dying (or dead) Esquire Magazine. Totally false lots of hatred. When will this boring magazine close? _E_\nPeople that have read it tell me that @KarlRove book is terrible (and boring). Save your money! @FoxNews should can him no credibility! _E_\nSee the attack very possibly could have been stopped. We need real leadership and vision. __HTTP__ _E_\nWhat's more important for the American public to have? @MittRomney's tax returns or @BarackObama's sealed records? _E_\nIran with all of the money and all else given to them by Obama has wanted a way to take over Saudi Arabia & their oil. THEY JUST FOUND IT! _E_\nYou have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_\nTo aspiring entrepreneurs: Trust your instincts. They are there for a reason. _E_\nCOMING UP @GenFlynn @newtgingrich on @foxandfriends _E_\nWho would you like to see on next season of #CelebrityApprentice? Let us know everyone wants to be on it. _E_\nTrump defends campaign manager charged for bruising a reporter: __HTTP__ _E_\nStock Market up 5 months in a row! _E_\nAt 9:00 P.M. @CNN of all places is doing a Special Report on my daughter Ivanka. Considering it is CNN can't imagine it will be great! _E_\nHILLARY'S BAD TAX HABIT! __HTTP__ _E_\nSuch long rhetorical and boring answers from Obama. No wonder nothing gets done. _E_\n.@DannyZuker I hear your filmography is stacked with failures. _E_\nI hope you buy my shirts and ties at @Macys _E_\nA huge honor for @TrumpToronto for being named #1 Luxury Hotel in Canada by @TripAdvisor's #TravelersChoice Awards __HTTP__ _E_\nRT @DonaldJTrumpJr: Great group at our Victory Office in Columbus Ohio. I'm incredibly grateful to have so many... __HTTP__ _E_\nWow new @ABCnews/@WashingtonPost @GOP preference poll has DonaldTrump 11 points up! Thank you. _E_\nThe delegates at the @DNC convention keep shouting Four More Years. Four more years of 18% real unemployment and another $6T in debt? _E_\nThe #CelebApprentice post @OMAROSA. Will it ever be the same? _E_\nReferees are destroying the enjoyment of NFL games. Slowing down the fun. Big shots. Jets game is ridiculous! _E_\nI hope Derek Jeter's recovery is going well. He is a very special player and a great guy. New York loves him. @yankees _E_\nThank you Pennsylvania I am forever grateful for your amazing support. Lets MAKE AMERICA GREAT AGAIN! #MAGA... __HTTP__ _E_\nMy @SquawkCNBC interview. __HTTP__ _E_\n.@TheRealMarilu is impressing the All Star Celebrity @ApprenticeNBC viewers with her continued success on Team Power. _E_\nAn unbelievable night in Iowa with our great Veterans! We raised $6000000.00 while the politicians talked! #GOPDebate _E_\nI missed the PGA Championship because it was not broadcast by TimeWarner @TWC. Why aren't they giving subscribers major discounts? _E_\nIf I only had 1 person running against me in the primaries like Hillary Clinton I would have gotten 10 million more votes than she did! _E_\nVia @Newsmax_Media: \"Trump to Speak at CPAC\" __HTTP__ @CPACnews #CPAC13 _E_\nThe hatred that clown @krauthammer has for me is unbelievable – causes him to lie when many others say Trump easily won debate. _E_\nThank you Mahoning County Ohio! See you soon! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_\n.@GovernorPataki was a terrible governor of NY one of the worst would've been swamped if he ran again! _E_\nThe new Red Tiger course at @TrumpDoral __HTTP__ Follow @TrumpGolf for more great photos. _E_\nMore and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the one sided coverage? _E_\nCanadian PM Harper immediately called the Ottawa attack terrorism. At least North America has a strong leader who lives in reality. _E_\nMy first order as President was to renovate and modernize our nuclear arsenal. It is now far stronger and more powerful than ever before.... _E_\nThe failing @nytimes does major FAKE NEWS China story saying Mr.Xi has not spoken to Mr. Trump since Nov.14. We spoke at length yesterday! _E_\nMedicare payments have become so unpredictable that record amount of doctors are now leaving __HTTP__ Bad for long term. _E_\nArnold Schwarzenegger isn't voluntarily leaving the Apprentice he was fired by his bad (pathetic) ratings not by me. Sad end to great show _E_\nThank you Virginia! #Trump2016#SuperTuesday _E_\nGreat Live Signing last nite! Over 25k views. I am signing books for next two weeks. Order yours for holiday gifts. __HTTP__ _E_\nExcited to see @SixteenChicago's \"elevated fine dining\" explored by @USAToday @10Best! __HTTP__ _E_\nIt's Friday. How many people have been forced off their plans and lost their doctors today because of ObamaCare? _E_\nMiss Israel and Miss Lebanon no more fighting! #TrumpVlog #MissUniverse __HTTP__ _E_\nRobert I'm getting a lot of heat for saying you should dump Kristen but I'm right. If you saw the Miss Universe girls you would reconsider. _E_\nTwo people fired very early on Celebrity Apprentice tonight at 9 leading up to next weeks live Finale. Don't get angry at me tonight! _E_\nIt's important to promote an image of yourself each and every day. It's part of having a sense of self and a sense of purpose. _E_\nNow Chinese agents are smuggling our military weapons through rogue US soldiers __HTTP__ China loves to cheat! _E_\nRe: Negotiation: View any conflict as an opportunity. Be a diplomat as much as possible. _E_\nI will be landing in Las Vegas shortly to pay my respects with @FLOTUS Melania. Everyone remains in our thoughts and prayers. _E_\nOur billion dollar website __HTTP__ _E_\nVia @BleacherReport: \"Donald Trump to Be Inducted into WWE Hall of Fame\" __HTTP__ _E_\nRT @DonaldJTrumpJr: An Honor to be in #Indiana w @realDonaldTrump @greta & the legend Bobby Knight! I like our secret weapon better!!! __HTTP__ _E_\nGoing over to @TodayShow now to introduce @ApprenticeNBC cast etc. watch. _E_\nAlaska Arizona Maine and Kentucky are big winners in the Healthcare proposal. 7 years of Repeal & Replace and some Senators not there. _E_\nIf I run I will be in all the primary debates and you will see why I am the only one who can Make America Great Again! _E_\nHappy Thanksgiving to all even the haters and losers! _E_\nApprentice ratings doing great easily won the 10 o'clock hour over other networks! _E_\nCan you imagine the anger and disgust when the heads of other countries found out that their cell phones were being tapped by NSA.Obama mess _E_\nIt's been great making so many new friends at Trump @DoralResort for the @CadillacChamp. Good luck to everyone! _E_\n.@EricShawnonFox Highest rated Saturday Night Live in four years. 47% higher than their opening night with Hillary & Miley Cyrus. Nice words _E_\n.@GStephanopoulos just announced that I am leading BIG in the new @ABC Poll which will be shown on This Week at 9:00 A.M. I will be on show _E_\nTHANK YOU California Maryland New York and Pennsylvania! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nOut of our very big country with many choices does everyone notice that both the ban case and now the sanctuary case is brought in ... _E_\nRemember if you don't sell yourself no one else will. Make sure the public friends & the business community hears about your success. _E_\nMy interview with @gretawire last night Everything Obama Does is a 'Campaign Speech' __HTTP__ _E_\nBernie Sanders totally sold out to Crooked Hillary Clinton. All of that work energy and money and nothing to show for it! Waste of time. _E_\nJust received applause at #NCGOPcon when I said People ask me why I may run for President I might so we can Make America Great Again! _E_\nCarson now admits his friend named Bob who he tried to stab (Bob was saved by his belt buckle!) no longer exists as Bob. Wrong name! _E_\nThe United States mourns for the victims of Nice France. We pledge our solidarity with France against terror. __HTTP__ _E_\nRepublican Tax Cuts are looking very good. All are working hard. In the meantime the Stock Market hit another record high! _E_\nI saw from my window just before accident that the crane was not properly anchored for the storm. _E_\nIf we did all the things we are capable of we would literally astound ourselves. Thomas A. Edison _E_\n.@MajorCBS Major Garrett of @CBSNews covers me very inaccurately. Total agenda bad reporter! _E_\nPiers truly hates Omarosa! _E_\nCertainly has been an interesting 24 hours! _E_\nWhen will @BarackObama present an actual budget? Enough with the games. _E_\nFailing @nytimes which has been calling me wrong for two years just got caught in a big lie concerning New England Patriots visit to W.H. _E_\nWhat recovery? JP Morgan has readjusted Q2 growth down from 1.7% to 1.4% and Q3 to 1.5% with 2012 on a whole at 1.7% __HTTP__ _E_\nChina wouldn't provide a red carpet stairway from Air Force One and then Philippines President calls Obama the son of a whore. Terrible! _E_\n\"Presidential Proclamation Commemorating the 50th Anniversary of the Vietnam War\" __HTTP__ __HTTP__ __HTTP__ _E_\nRT @PressSec: .@POTUS and @FLOTUS meet w/ some of America's finest on the USS Kearsarge off the coast of PR. __HTTP__ _E_\nAlmost all reporters falsely report that I had a bad time at last year's White House Correspondents' Dinner. (cont) __HTTP__ _E_\nBernie Sanders is lying when he says his disruptors aren't told to go to my events. Be careful Bernie or my supporters will go to yours! _E_\nWe are going to ask Katherine Webb to be a judge at the Miss USA Pageant coming up in Las Vegas. _E_\nOn this wonderful Veterans Day I want to express the incredible gratitude of the entire American Nation to our GREAT VETERANS. Thank you! __HTTP__ _E_\nA great night in Iowa! __HTTP__ _E_\nI still don't know who I'm going to choose. @GeraldoRivera or @LeezaGibbons? Who do you like? @ApprenticeNBC _E_\nHe @BarackObama is incapable of admitting that he is a complete and utter failure. He is 100% responsible for Solyndra. __HTTP__ _E_\nWhy do people listen to clown @KarlRove on @FoxNews? Spent $430M & lost all races—a Bushy! _E_\nThanks you for all of the Trump Rallies today. Amazing support. We will all MAKE AMERICA GREAT AGAIN! _E_\nThoughts & prayers with everyone in Lafayette Louisiana this evening. _E_\n.@TheEconomist Poll one of the most highly respected was just released. Wow wait until the media digests these numbers won't be happy! _E_\nThank you America! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\nMy great honor to host the 2017 back to back #StanleyCup Champion Pittsburgh Penguins at the WH with @FLOTUS today! __HTTP__ __HTTP__ _E_\nI have a tip that can take 5 strokes off anyone's golf game. It's called an eraser. Arnold Palmer _E_\nWith unempoyment over 10% in 2009 @BarackObama held an extravagant Alice in Wonderland party. He is a man of the people! _E_\nIf you are steadfast in your efforts and self respect critics will be harmless. Keep your focus! _E_\nIt is being reported by virtually everyone and is a fact that the media pile on against me is the worst in American political history! _E_\nVia @MoscowTimes: Donald Trump in New @eminofficial Video __HTTP__ Emin & family are wonderful people. _E_\n#Imwithyou __HTTP__ _E_\nPrice of corn has jumped over 50%. This will cause a jump in food prices perhaps beyond what we've ever seen. Nasty for the economy. _E_\nWhen confronted @RickSantorum can't defend his ridiculous attacks on @MittRomney __HTTP__ _E_\n... So if you want to aim high you have to have the guts to handle the inevitable bumps in the road. Think BIG _E_\nMillions protesting in Egypt for Morsi's ouster __HTTP__ When will Obama demand Morsi's resignation as he did to Mubarak _E_\n#TBT As a young man when I proposed the Convention Center in New York City. __HTTP__ _E_\nRT @DRUDGE_REPORT: WSJ: Grifters in Chief... __HTTP__ _E_\nPutin has become a big hero in Russia with an all time high popularity. Obama on the other hand has fallen to his lowest ever numbers. SAD _E_\nThis story is no longer about John McCain it's about our horribly treated vets. Illegals are treated better than our wonderful veterans. _E_\n...they are costly inefficient bird killing community destroying machines. They are obsolete! @maddow _E_\nThe last person corrupt Hillary Clinton wants to run against is Donald J. Trump. I'll end up beating her in every state. New Fox Poll Trump! _E_\n#trumpvlog NY Area Two book signings Tonight and Thursday.... __HTTP__ _E_\nCourageous Patriots have fought and died for our great American Flag we MUST honor and respect it! MAKE AMERICA GREAT AGAIN! _E_\nKnowledge requires patience action requires courage. Put patience and courage together and you'll be a winner. _E_\nMy childcare plan makes a difference for working families more money more freedom. #AmericaFirst means... __HTTP__ _E_\nWould seem that plane landed short of runway in San Francisco! _E_\nMy son Donald openly gave his e mails to the media & authorities whereas Crooked Hillary Clinton deleted (& acid washed) her 33000 e mails! _E_\nHillary Clinton is using race baiting to try to get African American voters but they know she is all talk and NO ACTION! _E_\nToday's open call drew thousands of eager applicants. It was an impressive group I enjoyed meeting them. We've got some great candidates! _E_\nBecause I will be busy doing anything other than being in the movie #RoadHard. __HTTP__ _E_\nI hear @pennjillette show on Broadway is terrible. Not surprised boring guy (Penn). Without The Apprentice show would have died long ago. _E_\nIsn't it sad that on a day of national tragedy Hillary Clinton is answering softball questions about her email lies on @CNN? _E_\nJust for your info tax returns have 0 to do w/ someone's net worth. I have already filed my financial statements w/ FEC. They are great! _E_\nISIS is taking credit for the terrible stabbing attack at Ohio State University by a Somali refugee who should not have been in our country. _E_\nNow even @BarackObama's old professors are coming out in opposition to his re election. __HTTP__ He has embarrassed them. _E_\nOnly a fool would buy the @NYDailyNews. Loses fortune & has zero gravitas. Let it die! _E_\nEntrepreneurs: Review your work habits regularly and make sure they are taking you in the right direction. Keep your focus intact. _E_\nAdvice from my mother Mary MacLeod Trump: Trust in God and be true to yourself. _E_\nTed Cruz was born in Canada and was a Canadian citizen until 15 months ago. Lawsuits have just been filed with more to follow. I told you so _E_\n..my endorsement). He also wanted to be Secretary of State I said NO THANKS. He is also largely responsible for the horrendous Iran Deal! _E_\n.@kilmeade It was great being with you on @foxandfriends this morning. So many people saw and loved the piece. Great work! _E_\nJoin us Saturday night for the South Carolina Primary Watch Party!#SCPrimary #Trump2016 __HTTP__ _E_\nPictures of @melaniatrump and me from the Men In Black III premiere in New York City __HTTP__ We loved the movie! _E_\nThe United States under President Obama has truly become the gang that couldn't shoot straight. Everything he touches turns to garbage! _E_\nThank you @DennisRodman. It's time to #MakeAmericaGreatAgain! I hope you are doing well! __HTTP__ _E_\nRT @JackPosobiec: Meanwhile: 39 shootings in Chicago this weekend 9 deaths. No national media outrage. Why is that? __HTTP__ _E_\nThe Ebola nurse should NEVER have been allowed to fly to Cleveland and (amazing) back again. Nothing works in our once great country anymore _E_\nJust signed 702 Bill to reauthorize foreign intelligence collection. This is NOT the same FISA law that was so wrongly abused during the election. I will always do the right thing for our country and put the safety of the American people first! _E_\nI look forward to Saturday night and being inducted into the @WWE Hall of Fame. _E_\nClips from tax speech and @seanhannity on @foxandfriends now. Have a great day! _E_\nVia @TPInsidr __HTTP__ _E_\nThe biggest doers often suffer the biggest setbacks in life... _E_\nThe legendary @BarbaraJWalters interviews my family and me tonight at 10:00 on @ABC2020 . Don't miss it! __HTTP__ _E_\nCNN Poll just out on South Carolina – great #'s __HTTP__ _E_\n'Obama Warned Of Rigged Elections In 2008.' Time to #DrainTheSwamp __HTTP__ __HTTP__ _E_\nSouth Carolina was so great last night. Will be back soon! _E_\nI will be interviewed on @FacetheNation Sunday 10AM on CBS. @johndickerson is a true pro! _E_\nThe #CNBCGOPDebate poll closed with #Trump2016 declared the official winner. Thank you! __HTTP__ __HTTP__ _E_\nFor those of you defending Bret and saying Omarosa should go remember Bret chose O which could also be considered a big mistake! _E_\nVia @realitytvworld: La Toya Jackson fired from 'All Star Celebrity Apprentice' by Donald Trump __HTTP__ _E_\nI am so glad @Rosie got fired by @Oprah. Rosie is a bully and it's always nice to see bullies go down! _E_\nTina Brown could finally be over. @thedailybeast is a total failure. She just got fired great! _E_\nIt is a great honor to have helped the community so much. __HTTP__ _E_\nRT @foxandfriends: OPIOID CRISIS: Worse than we thought with a new study showing overdose deaths were under reported __HTTP__ _E_\nVia @AP by @kronayne & @colvinj: Disavowed by GOP leaders Trump has supporters cheering __HTTP__ _E_\nTrump @DoralResort's renovations are on schedule. With such a massive project underway I am watching closely. _E_\nCrooked Hillary Makes History! #ImWithYou #AmericaFirst __HTTP__ _E_\nThank you the very dishonest Fake News Media is out of control! __HTTP__ _E_\nThank you for your support! TOGETHER we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nWrong used to be called global warming and when that name didn't work they deftly changed it to climate change because it's freezing! _E_\n.@ewanshearer Happy Birthday _E_\nBeijing had a bigger celebration than Chicago last night. The Chinese are happier with the election than we are. _E_\nNegotiation is an art. Treat it like one. _E_\nPeople don't understand that I left The Apprentice to run for Pres—the Apprentice DID NOT leave me. Bob Greenblatt & folks @NBC were GREAT! _E_\nAt the request of many I will be doing live tweets during the next presidential debate. _E_\nI loved being at Liberty University today! Record setting crowd unbelievable people! Thank you Jerry and Becki! __HTTP__ _E_\n.@HillaryClinton #ICYMI WE ARE NOT IN A NARRATIVE FIGHT. @Mike_Pence #MAGA __HTTP__ _E_\nGoing to New Hampshire all sold out crowds. People want real change POLS WILL NEVER MAKE OUR COUNTRY GREAT AGAIN! _E_\nRep. Stephen Lynch (D Ma) said There's all of these taxes and fees that are the tough medicine..it's going to hit the fan 're ObamaCare. _E_\nWe are experiencing the coldest weather in more than two decades most people never remember anything like this. GLOBAL WARMING anyone? _E_\nIs Cruz honest? He is in bed w/ Wall St. & is funded by Goldman Sachs/Citi low interest loans. No legal disclosure & never sold off assets. _E_\nA lot of the @Yankees should be ashamed of their play in the post season. They are lucky they don't have to deal with George Steinbrenner. _E_\nI have millions more votes/hundreds more dels than Cruz or Kasich and yet am not being treated properly by the Republican Party or the RNC. _E_\nLive on the edge no complacency is allowed and keep an open mind. Business is a creative endeavor. _E_\nVia @BreitbartNews' @biggovt: \"WAR! TRUMP LEVIN PUMMEL ROVE AS CONSERVATIVE BATTLE ESCALATES\" __HTTP__ _E_\n.@genesimmons is terrific congratulations on Hall of Fame. _E_\n...They have been in our country for many years through no fault of their own brought in by parents at young age. Plus BIG border security _E_\nTrump Int'l Hotel Washington D.C.: The iconic Old Post Office Building will be one of the world's great hotels. __HTTP__ _E_\nFor those who missed my chat with @hannityshow on radio here it is on TV. Sean is terrific. __HTTP__ _E_\nThe media is really on a witch hunt against me. False reporting and plenty of it but we will prevail! _E_\nReadout of my meeting with Israeli Prime Minister Benjamin Netanyahu: __HTTP__ __HTTP__ _E_\nOn the whole the teams seem to be working well together. No wars...yet. _E_\nIn all of television the only one who said anything bad about last nights landslide victory was dopey @KarlRove. He should be fired! _E_\nIt is time Republicans stop attacking each other and focus on @BarackObama. America cannot survive a second term. _E_\nThe Trump Signature Collection exclusively available at @Macys tops all menswear styles. Dress to impress! __HTTP__ _E_\nVia @DMRegister by @JenniferJJacobs: Trump to hand out Trump memorabilia at Iowa summit __HTTP__ _E_\n.@Yankees are in trouble without Derek. Try A Rod at short get him some confidence. _E_\n.@TrumpWaikiki is Hawaii's top luxury hotel & destination. Each room features stunning views & superb amenities __HTTP__ _E_\nWhy did lightweight A.G. Eric Schneiderman come to my office on numerous occasions begging for campaign contributions? Also recent asks? _E_\nI hate @USAToday's redesign the logo is terrible. Lightweight Al Neuharth must've had something to do with this No wonder paper is failing. _E_\nTHANK YOU NEVADA!#Trump2016 #MakeAmericaGreatAgain@Snapchat! Username: realdonaldtrump __HTTP__ __HTTP__ _E_\nWatching the #GOPConvention#AmericaFirst #RNCinCLE _E_\nJoin me in California or Montana!5/25/16: Anaheim California __HTTP__ Billings Montana __HTTP__ _E_\nIf speeches and memoirs created jobs then @BarackObama would be Ronald Reagan. _E_\nGood news is that my campaign has perhaps more cash than any campaign in the history of politics b/c I stand 100% behind everything we do. _E_\nThank you to @Franklin_Graham. I have always appreciated your courage but now more so than ever! _E_\nSame CDC which is bringing Ebola to US misplaced samples of anthrax earlier this year __HTTP__ Be careful. _E_\n.@mcuban is so short off the tee he can't have much of a punch. He's just a weak man with a big mouth! _E_\n#trumpvlog @BarackObama is very inconsiderate... __HTTP__ _E_\nRT @foxandfriends: Millions of gallons of Mexican waste threaten Border Patrol agents __HTTP__ _E_\nIt was great spending time with @joniernst yesterday. She has done a fantastic job for the people of Iowa and U.S. Will see her again! _E_\nThe arrogant young woman who questioned me in such a nasty fashion at No Labels yesterday was a Jeb staffer! HOW CAN HE BEAT RUSSIA & CHINA? _E_\nDon't forget to watch Larry King tonight CNN at 9 pm. He's a television legend and a great friend. It's going to be a fantastic farewell. _E_\nVia @BreitbartNews by @THESHARKTANK1: DONALD TRUMP FIRES ENTIRE 2016 GOP FIELD __HTTP__ _E_\n. @foxandfriends interview discussing a budget deal my #CPAC2013 speech @RealBenCarson & firing @latoyajackson __HTTP__ _E_\n#TBT On the stage during the Emmys performing Green Acres with Megan Mullally __HTTP__ _E_\nThank you to the Governor of Florida Rick Scott for your endorsement. I greatly appreciate your support! _E_\n#CrookedHillary has FAILED all over the world! 􏰀 #BigLeagueTruth #Debates2016 __HTTP__ _E_\nThe movie may be garbage but we can't let a foreign country dictate to us what to watch. @SonyPictures _E_\nIt's Wednesday. I wonder how much money @BarackObama borrowed from China today? _E_\nJust returned home from the great state of New Hampshire. Have made so many friends there special place! _E_\n\"@TrumpFerryPoint was something we've been working on for years and Donald Trump got it to the finish line.\" @rubendiazjr _E_\n.@peachespulliam at @TrumpTowerNY this afternoon a wonderful woman. It was an honor to donate $25K to her charity. __HTTP__ _E_\nAmerica deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_\nThis is no surprise. Constant phony reporting from failing @CNN turns everyone off. The American people get it! __HTTP__ _E_\nNew job numbers once again show no growth or recovery. Unemployment has been over 8% for 41 straight months now up to 8.3% _E_\nDon't wait for dire circumstances to test your quick thinking ability. Be on alert at all times. _E_\nReal unemployment is at over 21%. Businesses won't hire until @BarackObama is defeated in 2012. #TimeToGetTough _E_\nCan you imagine what the outcry would be if @SnoopDogg failing career and all had aimed and fired the gun at President Obama? Jail time! _E_\nThe truth continues to come out after 14 years. A truth that many in the media did not want to tell. #Trump2016 __HTTP__ _E_\nTomorrow's election will have historic repercussions for our country. Make America strong again. Vote for @MittRomney. _E_\nWe are going to WIN and MAKE AMERICA GREAT AGAIN maybe better than ever before! _E_\nOur prayers are with Rev. @BillyGraham for a speedy recovery. His faith continues to inspire us all. _E_\nJust leaving Mechanicsburg PA. Incredible crowd so enthusiastic! Will be back soon. #MAGA __HTTP__ _E_\nRT @paulsperry_: BREAKING: top FBI investigator for Mueller PETER STRZOK busted sending political text messages bashing Trump & praising... _E_\nResponse to the Des Moines Register __HTTP__ _E_\n.@Zagat named Christmas Day Brunch @TrumpChicago @SixteenChicago one of the best in the city! #TrumpHolidays __HTTP__ _E_\nThank you Michigan! This is a MOVEMENT that will never be seen again it's our last chance to #DrainTheSwamp! Watch... __HTTP__ _E_\nI am on @FoxNewsSunday with Chris Wallace his 20th year anniversary with #FNS throughout the day. Enjoy! __HTTP__ _E_\nPress Conference Following National Security Briefing in Bedminster New Jersey. __HTTP__ __HTTP__ _E_\n$6 gas is coming sooner than later. America must become energy independent with our own resources and fast.Also (cont) __HTTP__ _E_\nIt's Thursday and again I ask how much money is China stealing from us? _E_\nBelieve you can and you're halfway there. Pres. Theodore Roosevelt _E_\nI believe in #AmericaFirst and that means FAMILY FIRST! My childcare plan reflects the needs of modern working clas... __HTTP__ _E_\nIn case you missed it my @gretawire interview on Obama's IRA rate cut hurting savings & economic growth __HTTP__ _E_\nTrump Making GOP Speech — Is 2016 in the Cards? __HTTP__ via @Newsmax_Media _E_\nAll signs are that business is looking really good for next year only to be helped further by our Tax Cut Bill. Will be a great year for Companies and JOBS! Stock Market is poised for another year of SUCCESS! _E_\nI wonder if the Rutgers coach who had the audacity to yell at the player is a proponent of global warming? _E_\nThank you @BillyJoel many friends just told me you gave a very kind shoutout at MSG. Appreciate it love your music! _E_\n#TeamTrump. Police and law enforcement seem to have killed one of the California shooters and are in a shootout with the others. Go police _E_\nI loved firing goofball atheist Penn @pennjillette on The Apprentice. He never had a chance. Wrote letter to me begging for forgiveness. _E_\nThank you Wilmington North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_\nWe spent TWO TRILLION DOLLARS in Iraq and got NOTHING. Now we are going back and will again get NOTHING because our leaders are clueless! _E_\nWill be on Hannity tonight. Rebroadcast of town hall from Pittsburgh PA. 8:00pm on FOX. Enjoy! #Trump2016 __HTTP__ _E_\nI have self funded my winning primary campaign with an approx. $50 million loan. I have totally terminated the loan! _E_\nInteresting read from Peggy Noonan. __HTTP__ _E_\nSince stop & frisk was struck down gun shootings & victims have spiked while gun seizures have decreased. __HTTP__ _E_\nRT @DoralResort: Thanks! RT @gem3wood: @DonaldJTrumpJr You guys @DoralResort have one hell of a leaderboard. Love this Tournament. _E_\nTed Cruz is incensed that I want to refocus NATO on terrorism as well as current mission but also want others to PAY FAIR SHARE a must! _E_\nModerator: Hillary plan calls for more regulation and more government spending. #Debate #BigLeagueTruth _E_\nJoseph Kennedy is really being used by Venezuela and Hugo C. in oil commercial! _E_\nSo many lives and two trillion dollars wasted and our worst enemies will get the 2nd largest oil reserves in the World. Such stupid leaders _E_\nLeaving for New Hampshire now. Making a speech—packed house. Love it! _E_\nThe Fake News is now complaining about my different types of back to back speeches. Well there was Afghanistan (somber) the big Rally..... _E_\nA budget that puts #AmericaFirst must make safety its no. 1 priority—without safety there can be no prosperity: __HTTP__ _E_\nNow that George Bush is campaigning for Jeb(!) is he fair game for questions about World Trade Center Iraq War and eco collapse? Careful! _E_\nRT @EricTrump: Nevada remember you can Vote and Go walk in vote and walk out! Caucus locator: __HTTP__ #TrumpLV __HTTP__ _E_\nBig news to share in New Hampshire tonight! Polls looking great! See you soon. _E_\nDopey Arianna @huffingtonpost is really after me boring story after boring story...but I hear she is in big trouble! _E_\nChina is a threat to America. They are not our friend. _E_\nThe Budget passed late last night 51 to 49. We got ZERO Democrat votes with only Rand Paul (he will vote for Tax Cuts) voting against..... _E_\nThe rally in Lowell Massachusetts was amazing. 10000 people going wild. MAKE AMERICA GREAT AGAIN! _E_\nChina keeps manipulating its currency at our financial expense. Why do our leaders continually let China run all over us? _E_\nPuerto Rico survived the Hurricanes now a financial crisis looms largely of their own making. says Sharyl Attkisson. A total lack of..... _E_\nDying @GQMagazine just named me to a list. Too bad GQ is no longer relevant—won't be around long! _E_\nGreat meeting with CEOs of leading U.S. health insurance companies who provide great healthcare to the American peo... __HTTP__ _E_\nThis boardroom gets CRAZY! These people are wild _E_\nUnemployment is plaguing both Black and Hispanic youths. Very troubling. _E_\nI am happy to announce that the @PGAGrandSlam will be held at @TrumpGolfLA this year! __HTTP__ Follow @TrumpGolf for more! _E_\nAwarded the renowned 5 Star @ForbesInspector rating the 65 story @TrumpTO brings style luxury & impeccable service __HTTP__ _E_\nI'm protesting the @UnionLeader from having anything to do w/ ABC debate. Their unethical record doesn't give them the right to be involved! _E_\nSo nice thank you Laura. __HTTP__ _E_\n.@antbaxter—Your documentary works better than any sleeping pill—in fact that may be your only way to make money with this recycled garbage! _E_\nDerek get well soon the @Yankees need youl. _E_\nNow @BarackObama is issuing regulatory demands to states ordering no firings in November __HTTP__ _E_\nThank you Delaware County Ohio! Remember either we WIN this election or we are going to LOSE this country!... __HTTP__ _E_\nThank you South Carolina! Everyone has to get out and VOTE on 11/8/16. #MakeAmericaGreatAgain... __HTTP__ _E_\nGreat day in D.C. with @SpeakerRyan and Republican leadership. Things working out really well! #Trump2016 __HTTP__ _E_\nObama has now had two record & historic midterm losses. There is Hope & Change for America. _E_\nPoll numbers have nosedived for pervert NYC mayoral candidate Anthony Weiner good news for New York! _E_\nLightweight Schneiderman's suit was filed on a Saturday (unheard of) against a school with a 98% approval rating right after Obama meeting. _E_\nCheck out today's #trumpvlog about the upcoming episode of @ApprenticeNBC.... __HTTP__ #celebrityapprenticefinale _E_\nTed Cruz said he didn't know that he was a Canadian Citizen. He also FORGOT to file his Goldman Sachs Million $ loan papers.Not believable _E_\nDemocrats have shut down our government in the interests of their far left base. They don't want to do it but are powerless! _E_\n.@Jimmyv3 @WWE Greatly appreciate your nice words re WrestleMania. That's why you are such a respected writer. _E_\nIn standing by @dennisrodman I was also representing many people who have addiction problems & are working hard to come back. _E_\nIran was on its last legs and ready to collapse until the U.S. came along and gave it a life line in the form of the Iran Deal: $150 billion _E_\nThe successful man will profit from his mistakes and try again in a different way. Dale Carnegie _E_\nThank you to the LGBT community! I will fight for you while Hillary brings in more people that will threaten your freedoms and beliefs. _E_\nFollow me on Instagram __HTTP__ _E_\nWith that being said I have personally directed the fix to the unmasking process since taking office and today's vote is about foreign surveillance of foreign bad guys on foreign land. We need it! Get smart! _E_\n#CelebApprentice Another exciting episode tune in next Monday at 8pm for 2 more new episodes! _E_\nDONALD TRUMP BLASTS THE OSCARS __HTTP__ via @theblaze _E_\nEarly on Ted Cruz said that if he didn't win South Carolina it's over. He didn't win and lost to me in a landslide! _E_\nIt is a joke the amount of time that network news spends talking about the weather. No wonder their ratings are way down! Enough already. _E_\nThe Golden Rule of Negotiating: He who has the gold makes the rules. _E_\nThe dishonest media is fawning over the Democratic Convention. I wonder why then my speech had millions of more viewers than Crooked H? _E_\nLike your current health care plan? Too bad you're going to lose it under ObamaCare. Hope Change & a 300% Increase in Your Premium. _E_\n#AskTrump Getting ready to answer your questions. __HTTP__ _E_\nHonored to host a luncheon for African leaders this afternoon. Great discussions on the challenges & opportunities facing our nations today. __HTTP__ _E_\n.@THEGaryBusey and one of his Busey isms: \"Art is only the search it is not the final form.\" #CelebApprentice _E_\nGet it straight: Pakistan is not our friend. When our tremendous Navy SEALS took out Osama bin Laden they did... (cont) __HTTP__ _E_\nWho did the House Task Force onUrgent Fiscal Issues call when America needed HELP? __HTTP__ _E_\nI hope @billmaher pays quickly so that this money can immediately be given to the charities. _E_\nCongratulations Kevin Gabriel on your amazing article. If I were a journalist this would be the next Watergate and I would be a star. _E_\nHe ruins the brand: @Robertgbeckel doesn't belong on @FoxNews . As CM for Mondale in '84 you lost 49 states. Sad! _E_\nPresident @EmmanuelMacronThank you for the beautiful welcome ceremony at Les Invalides today! __HTTP__ _E_\nMany many people are disappointed I didn't run third party but I won't risk @BarackObama benefiting from a split in the anti Obama vote. _E_\n\"America is the experiment that works.\" – President Ronald Reagan _E_\nStay on message is the chant. I always do trade jobs military vets 2nd A repeal Ocare borders etc but media misrepresents! _E_\nIt pays to have friends in high places like the Justice Department. Clearly the Clintons do. #DrainTheSwamp! __HTTP__ _E_\nEntrepreneurs: There are no guarantees. But being ready sure beats being taken by surprise. Do your due diligence! _E_\nWow the two highest apartment rentals in all of 2013 were at Trump Park Avenue—each one = $100000 per month __HTTP__ _E_\nI just sent @THEGaryBusey a check of $20000 for his charity Children's Kawasaki Disease . He worked hard and deserves it. _E_\nI will be interviewed by @JudgeJeanine tonight on @FoxNews Enjoy! _E_\n\"Trump: Rove Gave Us Obama\" __HTTP__ via @cnsnews _E_\n\"Integrity is the essence of everything successful.\" – Richard Buckminster Fuller _E_\nThank you @foxandfriends. Really great job and show! _E_\nMy @foxandfriends interview discussing how @BarackObama is running a hateful campaign & the @RNC convention 'Surprise' __HTTP__ _E_\nCongratulations to @MariaBartiromo on her big move to @FoxBusiness. She is a total winner! _E_\nPut this on your calendar: The Celebrity Apprentice live finale is this Sunday at 9 p.m. on NBC. Who will be the next Celebrity Apprentice? _E_\nHappy 241st birthday to the U.S. Marine Corps! Thank you for your service!! __HTTP__ _E_\nRT @hughhewitt: @realDonaldTrump I spoke to a group of influential CA GOPers tonight long time activists bundlers influencers. Support f... _E_\nGreat success in Iowa today. Fantastic sold out crowd. Will be back soon! _E_\nHappy Easter to everyone! _E_\nDummy writer @tonyschwartz who wanted to do a second book with me for years (I said no) is now a hostile basket case who feels jilted! _E_\nMy @gretawire interview from last Friday discussing the unemployment numbers gas prices and acquiring the Doral __HTTP__ _E_\nObamaCare is a total disaster. Hillary Clinton wants to save it by making it even more expensive. Doesn't work I will REPEAL AND REPLACE! _E_\nSurprise? 1970's global cooling alarmists were pushing same no growth liberal agenda as today's global warming __HTTP__ _E_\nComing up in March: The Comedy Central Roast of Donald Trump. March 15 mark your calendars. __HTTP__ _E_\nBoth Ted Cruz and John Kasich have no path to victory. They should both drop out of the race so that the Republican Party can unify! _E_\nWe will always take care of our GREAT VETERANS. You have shed your blood poured your love and bared your soul in... __HTTP__ _E_\nLooking forward to speaking at #sparknb next week in Atlantic Canada my first time ever. _E_\n.....but that's what I've been saying. Very unfair treatment by the media! _E_\n.@JebBush has embarrassed himself & his family with his incompetent campaign for President. He should remain true to himself. _E_\nIt's disgraceful that the Obama Administration's first response was not to condemn attacks on our diplomatic (cont) __HTTP__ _E_\nI will be signing copies of my new book TIME TO GET TOUGH tomorrow Dec 9th in Trump Tower from 11 a.m. to ... (cont) __HTTP__ _E_\n... It is time to get out and rebuild our own nation. _E_\nWe must repeal Obamacare and replace it with a much more competitive comprehensive affordable system. #debate #MAGA _E_\nThe Huffington Post is such a loser it will die just as AOL is dying What a stupid deal AOL made to buy it! _E_\nA Rod is now being investigated for continued doping __HTTP__ @yankees have a great opportunity to dump him now. Go for it! _E_\n.@DonaldJTrumpJr & his wife @MrsVanessaTrump attended the #SnowflakeGardenBrunch here w/ Governor @TerryBranstad. __HTTP__ _E_\nLet @PeteRose in the HOF it's time! _E_\nThe opening of the @TigerWoods Villa at trumpdoral __HTTP__ _E_\nWow China exports rise 15% in September. They are laughing at USA! _E_\nThe mark of a great player is in his ability to come back. The great champions have all come back from defeat. Sam Snead _E_\n#MakeAmericaGreatAgain! __HTTP__ _E_\nRand Paul or whoever votes against Hcare Bill will forever (future political campaigns) be known as the Republican who saved ObamaCare. _E_\nConsumer prices rose in June due to OPEC __HTTP__ OPEC continues to rip off hard working American families daily. _E_\nGreat jobs report today It is all beginning to work! _E_\n.@AndreaTantaros You are a true journalistic professional. I so agree with what you say. Keep up the great work! #MakeAmericaGreatAgain _E_\nHow quickly people forget that Crooked Hillary called African American youth SUPER PREDATORS Has she apologized? _E_\nDon't believe the @FoxNews Polls they are just another phony hit job on me. I will beat Hillary Clinton easily in the General Election. _E_\nModels! Remember to register for the Trump Model Search. Check out the info here: __HTTP__ @CadillacChamp _E_\nIt won't stay a buyer's market forever. If you can take advantage and buy property asap. You'll thank me! _E_\nSmall bright spot in lackluster economy travel industry added 81000 jobs in 2012 __HTTP__ Trump Org had a record year. _E_\nSadly when it comes to using the energy industry to create American jobs @BarackObama has been a total (cont) __HTTP__ _E_\n.@MittRomney should continue to stay on offense on the embassy issue. Obama who put these radicals in power deserves blame. _E_\nThank you to a #Trump2016 supporter for this video of my campaign over the past 6 months. Video: __HTTP__ _E_\nCan you believe Crooked Hillary said We are going to put a whole lot of coal miners&coal companies out of business. She then apologized. _E_\nGreat new poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n.@whitehouse continues to defend the billions it pissed away on 'green energy' failures __HTTP__ Your money was wasted. _E_\nThe debate was pretty even but I thought Mitt should have been much more aggressive on Obama's failed foreign policy and I mean much more. _E_\nStanding on what will be the greatest golf course in the world. Opens July 10th. __HTTP__ _E_\nSo much for 'global warming.' Earth is cooling at a record pace __HTTP__ _E_\nOnly you can #SavetheQueen during the LIVE telecast of #MissUSA on June 8 at 8/7c on NBC. Click for more info: __HTTP__ _E_\nAn 'extremely credible source' has called my office and told me that @BarackObama bought his house with the help of Tony Rezko. _E_\n#TBT My confirmation picture at First Presbyterian Church in Jamaica NY. __HTTP__ _E_\nJoin me today in Wilmington Ohio at 4pm: __HTTP__ Tampa Florida at 10am: __HTTP__ _E_\nThe #CNNDebate was amazing so much fun! __HTTP__ _E_\nBuy American & hire American are the principals at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_\nObama has called @GOP terrorists during this showdown. It's a shame he really doesn't think it because then he would meet all @GOP demands. _E_\nI loved being a surrogate on behalf of @MittRomney. I am glad I was able to help him win. _E_\n.@JuliInkster Congratulations on your great win what a captain what a champion! _E_\nEntrepreneurs: Don't put blinders on or limit yourself. Reach out seek and explore. The opportunities are always there. _E_\nEverybody knows why Obama would not show his college applications they are just not willing to say! _E_\nRT @EWErickson: Personally I think it is awesome that @realDonaldTrump listens to @DLoesch on the radio. She's awesome. _E_\nWatch my appearance on @foxandfriends... __HTTP__ _E_\nMake sure you realize that this 'deal' is only a stop gap measure.Obama will be looking to raise even more taxes in the coming negotiation.. _E_\nLIVE on #Periscope: Major announcement! #MakeAmericaGreatAgain __HTTP__ _E_\nWhy would the people of Kentucky want a rookie Senator– they have Sen. Mitch @McConnellPress who may be next Leader & bring $'s to KY _E_\nI've been visiting Trump Int'l Golf Links Scotland and the course will be unmatched anywhere in the world. Spectacular! __HTTP__ _E_\nWatch my endorsement of @MittRomney. __HTTP__ _E_\nHave a great Memorial Day and remember that we will soon MAKE AMERICA GREAT AGAIN! _E_\nChuck Hagel showed gross incompetence before yesterdays Senate panel...our new Secretary of Defense. _E_\nHillary Clinton is unfit to be president. She has bad judgement poor leadership skills and a very bad and destructive track record. Change! _E_\nSee the new sizzle reel for The Apprentice __HTTP__ _E_\n.....Has worst attendance record in Senate rarely there to vote on a bill! @marcorubio _E_\nThe great GENERALS MacArthur and Patton real leaders and fighters are spinning in their graves as we give Syria info & time to prepare. _E_\nA great honor to host the @SuperBowl Champion New England @Patriots at the White House today. Congratulations!... __HTTP__ _E_\nGetting China to stop playing its currency charades can begin whenever we elect a president ready to take decisive action. #TimeToGetTough _E_\nSad sack @JebBush has just done another ad on me with special interest money saying I won't beat Hillary I WILL. But he can't beat me. _E_\nI will be interviewed on @oreillyfactor at 8:00 P.M. Enjoy! _E_\nDo you think the 14 African nations that are banning West Africans from coming into their nations are being called racists? Perhaps not! _E_\nMedia desperate to distract from Clinton's anti 2A stance. I said pro 2A citizens must organize and get out vote to save our Constitution! _E_\n.@McIlroyRory Way to go Rory fantastic victory! _E_\nThe amazing Trump National Golf Club Los Angeles. __HTTP__ _E_\nThat was really exciting. Made all of my points. MAKE AMERICA GREAT AGAIN! _E_\nDerek Jeter @yankees wants to rent an apartment. Derek only in a Trump building Trump is lucky for you. _E_\nTweet me your questions to answer. #trumpvlog _E_\nI will be on the Mike & Mike Show on radio and ESPN at about 6 to 7 A.M. We will be talking Super Bowl and sports no Obama Care! _E_\nJoined the @HouseGOP Conference this morning at the U.S. Capitol. __HTTP__ #PassTheBill #MAGA... __HTTP__ _E_\nVolunteer to be a Trump Election Observer. Sign up today!#MakeAmericaGreatAgain __HTTP__ _E_\n.@AnnCoulter's new book 'In Trump We Trust comes out tomorrow. People are saying it's terrific knowing Ann I am sure it is! _E_\nI like Rob Astorino. He's a friend and really good guy. Sadly he has ZERO chance of beating Cuomo and the 2 to 1 Dems for governor! _E_\nI will miss @Letterman & doing his show. He was always intriguing & smart. You never knew what would happen but he was fair! _E_\nWas going to do a phoner this morning with @jaketapper on @CNN but they could not get their phone equipment to hook in. Will do next week. _E_\nHelp those affected by #Sandy. @TrumpSoHo is giving $10 per booking made by 11/23 to @RedCross for #sandyrelief. __HTTP__ _E_\nUse your intelligence and your education to execute what your imagination presents to you. This is one step to becoming an entrepreneur. _E_\n.@washingtonpost thinks @IvankaTrump is What Washington's Social Scene Needs __HTTP__ Truth is she's amazing. _E_\nEvery penny of the $7 billion going to Africa as per Obama will be stolen corruption is rampant! _E_\nObama's Def. Sec. just said US Asia focus 'not aimed to contain China' __HTTP__ China is hoping that Obama is re elected. _E_\nVia @politico: Donald Trump to get more CPAC time than Marco Rubio __HTTP__ @CPACnews knows how to prioritize! _E_\nHow can an Attorney General ask for campaign contributions during his evaluation of a case a total sleazebag! _E_\nVia @freep: Trump to speak to GOP __HTTP__ _E_\nRep. Steve Scalise of Louisiana a true friend and patriot was badly injured but will fully recover. Our thoughts and prayers are with him. _E_\nGreat write up on @thedailymeal about our new Executive Sous Chef Sydney Jones @TrumpLasVegas: __HTTP__ _E_\nLook how bad it is getting! How much more crime how many more shootings will it take for African Americans and Latinos to vote Trump=SAFE! _E_\nThe same people who did the phony election polls and were so wrong are now doing approval rating polls. They are rigged just like before. _E_\n.@willweatherford @FLGovScott Gaming in Miami will be incredible—best in world and create lots of jobs and revenue. _E_\n\"Be flexibly focused. Focus does not mean being narrow minded or rigid.\" – Think Big _E_\nIt's going to take an outsider to clean up after Clinton Bush and Obama. Let's Make America Great Again! __HTTP__ _E_\nCan you imagine if Obama had to give today's press conference before the election? He would have lost. @GOP really blew it. _E_\nEveryone is saying the bad news is that Donald Trump is going to take credit & they are right—Mitt wouldn't have won anyway. _E_\n#ICYMI: #Trump2016 closing speech inBuffalo New York!#VoteTrumpNY  __HTTP__ _E_\nI want to #MakeAmericaGreatAgain __HTTP__ _E_\nIPSOS/REUTERS POLLThank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nIt is so imperative that we have the right justices. #DrainTheSwamp #Debates #BigLeagueTruth __HTTP__ _E_\nWhen are we going to wake up and realize that we are funding our enemies? #TimeToGetTough _E_\n.@kevinolearytv Great job on @foxandfriends this morning. You tell it like it is! Also thx for the nice mention. Your book sounds great! _E_\nJust watched the totally biased and fake news reports of the so called Russia story on NBC and ABC. Such dishonesty! _E_\nEntrepreneurs: It's often to your advantage to be underestimated. _E_\nGreat news on the 2018 budget @SenateMajLdr McConnell first step toward delivering MASSIVE tax cuts for the American people! #TaxReform __HTTP__ _E_\n.@FrankLuntz knows nothing about me or my religion. Came to my office looking for work. I had NO interest. I will save the vets! _E_\nToday we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet __HTTP__ _E_\nCPAC 2013: Donald Trump: Immigration reform is a 'suicide mission' for GOP __HTTP__ by @SethMcLaughlin1 _E_\nJust as I predicted immigration reform will increase the cost of ObamaCare over $300B __HTTP__ More money borrowed from China. _E_\nIf anybody else but Coore and Crenshaw designed Pinehurst they would be run out of town—(and the turtleback greens are totally unfair)! _E_\nJohn Kasich was managing director of Lehman Brothers when it crashed bringing down the world and ruining people's lives. A total failure! _E_\nThe military and Navy Seals should be given more credit for Bin Laden's death not Obama who works hard to take (cont) __HTTP__ _E_\n#MakeAmericaGreatAgain! __HTTP__ _E_\nThe 18th hole at the Blue Monster @Doral in Miami is considered the toughest finishing hole in golf... __HTTP__ _E_\nVia @washingtonpost: The Donald's video should have trumped Eastwood by @CapehartJ __HTTP__ _E_\nand fair elections. We've accepted the outcomes when we may not have liked them and that is what must be expected of anyone standing on a _E_\nSo since the people at the @nytimes have made all bad decisions over the last decade why do people care what they write. Incompetent! _E_\n.@youngmman @realDonaldTrump Conrad Hilton was a great man but Barron Hilton is a dope. Wrong on Barron! _E_\nTed Cruz talks about the Constitution but doesn't say that if the Dems win the Presidency the new JUSTICES appointed will destroy us all! _E_\nI believe America can be great only with proper leadership. _E_\n\"Chalk failure up to experience don't take it personally and go find your next challenge.\" – Trump: Never Give Up _E_\nAubrey has a lot of self confidence—but will it be warranted? #sweepstweet _E_\nFor entrepreneurs ignorance is not bliss. It's fatal. It's costly. And it's for losers. You either get organized or get crushed. _E_\nThe Arab Spring has turned into the Islamist Winter. Our ally @Israel is in a perilous position. We must stand behind @Israel. _E_\nThe Mullahs are laughing at what they think is a very stupid president@BarackObama has asked for Iran to return the drone #TimeToGetTough _E_\nNegotiation is an art. Treat it like one. Be open to change it's another word for innovation. _E_\nThe new selection of ties shirts and suits at Macy's is amazing also available in Trump Tower lobby. _E_\nMelania and I extend our warmest greetings to those observing Rosh Hashanah here in the United States in Israel and around the world. _E_\nIn order to save Medicare and stop record premium increases we must repeal ObamaCare. _E_\nGetting ready to visit Walter Reed Medical Center with Melania. Looking forward to seeing our bravest and greatest Americans! _E_\nThe Budget Agreement today is so important for our great Military. It ends the dangerous sequester and gives Secretary Mattis what he needs to keep America Great. Republicans and Democrats must support our troops and support this Bill! _E_\nAny deal on DACA that does not include STRONG border security and the desperately needed WALL is a total waste of time. March 5th is rapidly approaching and the Dems seem not to care about DACA. Make a deal! _E_\nWhile under no obligation to do so I have raised between 5 & 6 million dollars including 1million dollars from me for our VETERANS. Nice! _E_\nSenator Luther Strange who is doing a great job for the people of Alabama will be on @foxandfriends at 7:15. Tough on crime borders etc. _E_\nGreat new polls! Thank you Nevada North Carolina & Ohio. Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_\nWatching @trishstratuscom get inducted from the sold out crowd. #WWEHOF. __HTTP__ _E_\nI'll be on @foxandfriends on Monday at 7:30 AM. Always interesting. Tune in! _E_\nRT @WhiteHouse: Today @POTUS will welcome the Prime Minister of India @narendramodi to the White House. __HTTP__ _E_\nVery honored: Trump Is Tops As Clinton Drops In Connecticut Primaries Quinnipiac University Poll Finds __HTTP__ _E_\n\"We are fully supportive of @Israel's right to defend itself.\" @BarackObama Very good I like it. _E_\nCongratulations to @TheSlyStallone and Arnold @Schwarzenegger on 'Expendables 2' #1 box office opening. Still going strong! _E_\nThank you Kentucky! #Trump2016#SuperSaturday _E_\nWhat a statesman! @BarackObama made sure to quickly call the Muslim Brotherhood victor to congratulate him on (cont) __HTTP__ _E_\nI wonder if traitor Edward Snowden will be attending the Miss Universe Pageant in Moscow on November 9th. _E_\nSo @ReutersPolitics claims that @MittRomney's birth certificate evokes 'controversy' __HTTP__ Where (cont) __HTTP__ _E_\nThe failing @nytimes writes false story after false story about me. They don't even call to verify the facts of a story. A Fake News Joke! _E_\nI hope everyone or rather almost everyone had a GREAT EASTER! We need our leaders to make great and wise decisions in these troubled times _E_\nWhile @BarackObama seeks to further destroy our credit our economy continues to hemorrhage jobs. Such a total failure as a President. _E_\nBig day planned on NATIONAL SECURITY tomorrow. Among many other things we will build the wall! _E_\nSuccess seems to be connected with action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_\nCrooked Hillary will NEVER be able to solve the problems of poverty education and safety within the African American & Hispanic communities _E_\n\"@HoganSeaisle129: @realDonaldTrump who who who ... Say it just say it #CelebApprentice\" Watch and see what happens! _E_\nThank you to Carmen Yulin Cruz the Mayor of San Juan for your kind words on FEMA etc.We are working hard. Much food and water there/on way _E_\n#FoxNews Poll THANK YOU!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nTerrible tragedy at the Empire State Building today. Must have fast trials and death penalty for the animals. _E_\nHappy 30th Birthday #Ghostbusters! It was great to have @TrumpTowerNY be a part of the series. __HTTP__ _E_\nIf you're passionate about your work you will never give up. _E_\nOur online campaign store is open! Visit __HTTP__ for #MakeAmericaGreatAgain merchandise including my signature hat! _E_\nWith today's struggling job numbers it is clear that there is one choice this November. @MittRomney can turn the economy around. _E_\nSeth Meyers was terrible co hosting with Kelly. Marbles in his mouth & he must stop picking at his hands insult to the great Regis Philbin! _E_\nDrop A Rod in the order and cut his salary based on unreported drug use. Also not a pressure player. _E_\n.@limbaugh is right. Watergate is much different than Benghazi. No one died in Watergate. _E_\n\"If the freedom of speech is taken away then dumb and silent we may be led like sheep to the slaughter.\" George Washington _E_\nThe Ted Cruz wiseguy apology to the people of New York is a disgrace. Remember his wife's employer and his lender is located there! _E_\nWe are using the absolute wrong negotiating technique with respect to the Iran nuclear talks. Strengthen sanctions until GREAT deal is made! _E_\n\"Talent wins games but teamwork and intelligence wins championships.\" – Michael Jordan @Jumpman23 _E_\nAt least 7 dead and 48 wounded in terror attack and Mayor of London says there is no reason to be alarmed! _E_\nCongratulations to @joniernst on delivering a strong conservative message in her #SOTU response. Joni will be a great senator. _E_\nAmazing... @VattenfallGroup tried to destroy Aberdeen. _E_\nEntrepreneurs: Winners see problems as just another way to prove themselves. Remember to focus on the solution not the problem. _E_\nGreat meeting with the @RepublicanStudy Committee this morning at the @WhiteHouse! __HTTP__ _E_\nLyin' Ted Cruz can't get votes (I am millions ahead of him) so he has to get his delegates from the Republican bosses. It won't work! _E_\nThe United States needs the security of the Wall on the Southern Border which must be part of any DACA approval. The safety and security of our country is #1! __HTTP__ _E_\nJohn Lewis said about my inauguration It will be the first one that I've missed. WRONG (or lie)! He boycotted Bush 43 also because he... _E_\nFor those that constantly say that \"global warming\" is now \"climate change\"—they changed the name. The name global warming wasn't working _E_\nWhen I broadly proclaimed Mitt \"choked\" – and would do it again—everybody said yeah he's right. _E_\n'True blue collar billionaire Donald Trump shows Hillary Clinton is out of touch' __HTTP__ _E_\nRemember what I previously said Obama will someday attack Iran in order to show how tough he is. _E_\nThe failing @nytimes talks about anonymous sources and meetings that never happened. Their reporting is fiction. The media protects Hillary! _E_\nI will be interviewed by @TheBrodyFile on @CBNNews tonight at 11pm. Enjoy! _E_\nI have been a guest on The View many times when it was successful show. Now the show is dying for lack of ratings. Too bad! _E_\nI'm not going to be watching much NFL football anymore. Too time consuming too boring too many flags and too soft. Focus on other things! _E_\nI just had to fire someone he didn't have a clue he reminded me of Obama on Wednesday night. _E_\nPut the glamour beauty & mystery back in the Oscars and the ratings will zoom. Also & most importantly the Oscars need credibility. _E_\nNew York Yankees President Randy Levine: 'End of the Republican Party' If Donald Trump Not Nominated. __HTTP__ _E_\nThank you for your support last night Iowa! #VoteTrump #Trump2016 #IACaucus #FITN #IAPolitics __HTTP__ _E_\n#TrumpVine Where is the money @MacMiller? __HTTP__ _E_\nPrior to the end of the year I will be traveling to Israel. I am very much looking forward to it. _E_\nKirsten Powers: Anti Trump Operative was Aggressively Shopping Cruz Story via the Gateway Pundit: __HTTP__ _E_\nWow just in John Beale the top person in government on climate change (EPA) is a total fraud and just admitted it! What can they say now _E_\nJeb Bush just talked about my border proposal to build a fence. It's not a fence Jeb it's a WALL and there's a BIG difference! _E_\nIt appears that @THEGaryBusey is entranced with @MELANIATRUMP and rightly so! #CelebApprentice _E_\nOn June 1st. near Washington D.C. I will be opening the greatest championship golf course in the U.S. All holes front on the Potomac River _E_\n#ICYMI: John Podesta's Brother Pocketed $180000 from Putin's Uranium Company: __HTTP__ __HTTP__ _E_\nThe dying NY Daily News put out a false report about my kids not wanting me to criticize Obama...totally false! _E_\n.@NikWallenda #Skywire As much credit as he's been given he wasn't given enough credit for his incredible feat over Grand Canyon. _E_\nOur next President must stop China's Rip off of America. _E_\n.@NBCNews is bad but Saturday Night Live is the worst of NBC. Not funny cast is terrible always a complete hit job. Really bad television! _E_\nNew The Next Generation videos @donaldjtrumpjr __HTTP__ @ivankatrump __HTTP__ @erictrump __HTTP__ _E_\nVia @DMRegister by @newsmanone: \"'Moon cycle' can't defeat @ShawnJohnson on @ApprenticeNBC\" __HTTP__ _E_\nHappy Anniversary to my wonderful wife @MELANIATRUMP a truly great decision by me! __HTTP__ _E_\nVia @globegazette by @GGMaryP: \"Trump: We'll bring American dream back\" __HTTP__ _E_\nRT @realDonaldTrump: #USA #Japan __HTTP__ _E_\nJust returned to Bedminster NJ from Camp David. GREAT meeting on National Security the Border and the Military! #MAGA __HTTP__ _E_\nGet rich quick! Crooked Hillary Clinton's pay to play guide: __HTTP__ _E_\nOnly a fool would believe that the meeting between Bill Clinton and the U.S.A.G. was not arranged or that Crooked Hillary did not know. _E_\nA political commentator for @cnn which I no longer watch said Trump showed some weakness in the Repub Primaries. I set all time record! _E_\nLoved Dallas and the tremendous crowd last night. Will be back! _E_\nVia @CNNMoney: Donald Trump gets into crowdfunding __HTTP__ #FundAnything _E_\n.@megynkelly Will be on Fox now. Watch and enjoy! _E_\nDo you think @THEGaryBusey will be able to \"step up\" as PM? I know @lisarinna is hoping so. #CelebApprentice _E_\nNo surprise Assad is not destroying his chemical weapons. He never intended to in the first place. _E_\n....And it will get even better with Tax Cuts! __HTTP__ _E_\nTRUMP'S BIG ANNOUNCEMENT: HE'LL GIVE $5 MILLION TO CHARITY OF OBAMA'S CHOICE IF... __HTTP__ By @billyhallowell @theblaze _E_\nI will be interviewed by @jessebwatters on @oreillyfactor tonight at 8pm. Enjoy! _E_\nPutin is not feeling too nervous or scared. #DemDebate _E_\nThe media has not reported that the National Debt in my first month went down by $12 billion vs a $200 billion increase in Obama first mo. _E_\nWhen I left Conference Room for short meetings with Japan and other countries I asked Ivanka to hold seat. Very standard. Angela M agrees! _E_\nWill be joining @GovMikeHuckabee tonight at 8pmE on @TBN. Enjoy! __HTTP__ _E_\nToday's groundbreaking at the Old Post Office Building in D.C. was amazing. Great people great dedication. @usgsa __HTTP__ _E_\nAs I have said the Tea Party is alive and well and fighting hard for the USA. BIG WIN TODAY! _E_\nI will be doing the @TODAYshow with my wife Melania and the rest of my family in a major Town Hall. Hopefully it will be fun! Enjoy.7A.M. _E_\nThank you. __HTTP__ _E_\nWill be on Bill O'Reilly @oreillyfactor tonight at 8 PM. Enjoy! _E_\nI hope that Derek Jeter has such a fantastic year with @Yankees that he changes his mind about retiring. Great guy! _E_\nWith fellow inductees in front of the sold out crowd at MSG. #WWEHOF __HTTP__ _E_\nRT @RealEagleBites: @realDonaldTrump It is the height of hypocrisy. Obama and Clinton in effect gave nuclear weapons to North Korea by thei... _E_\nEntrepreneurs: Be prepared and be tough. Cover your bases! There are a lot of ups and downs but you can ride them out if you're prepared. _E_\n... and many others. Drop to your knees Sugar and say thank you Mr. Trump. _E_\n.@BretBaier I will be interviewed by Bret (on Fox) tonight at 6:00. Watch it will be good! _E_\n.@BreitbartNews is much smarter than sleepy eyes @chucktodd @nbc __HTTP__ Thanks to Steve Bannon & real reporters. _E_\nI wonder what the great generals like Patton the big M or Robert E. LEE would have thought about our stupid broadcasting of an attack? _E_\nJodi Arias jury is having a hard time with the death penalty judge just sent them back for further deliberatuon. _E_\nThe unbiased reporters and attendees said mine was the best and most well received speech at CPAC THANK YOU! _E_\nCongratulations to Mitt Romney. He was not only good he was absolutely fantastic tonight! _E_\nGovernment needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_\nOur Southern border is unsecure. I am the only one that can fix it nobody else has the guts to even talk about it. __HTTP__ _E_\nHe @FLGovScott handled the Zimmerman matter very well. I am glad to see there will be a trial. Justice. Now let's wait for a fair trial. _E_\nI do not understand how so many of my Jewish friends backed Obama in the last election. He is a TOTAL DISASTER FOR ISRAEL AND ALWAYS WILL BE _E_\nTrue! __HTTP__ _E_\nI had a great day campaigning in Connecticut. Looking for a big vote on Tuesday! _E_\nWrong! Under @BarackObama's watch @Israel is not being invited to NATO summit in Chicago this month __HTTP__ _E_\nMy @FoxNews interview with @gretawire discussing the 2012 GOP primary and ObamaCare's attack on the Catholic Church. __HTTP__ _E_\nMinorities Line Up Behind Donald Trump #Trump2016 __HTTP__ _E_\nDoes everyone see that the Democrats and President Obama are now because of me starting to deport people who are here illegally. Politics! _E_\nIllegal immigrant children non Mexicans surge across border at record rate __HTTP__ _E_\nThe stimulus is a net negative effect on the growth of GDP over 10 years as admitted by @BarackObama's own CBO __HTTP__ _E_\nVia @si_golf: \"Donald Trump Rory McIlroy and Matt Kuchar are guys to watch at @DoralResort\" __HTTP__ @CadillacChamp _E_\nDon't find fault. Find a remedy. Henry Ford _E_\nWhat a great night. Thank you South Carolina a special place with truly amazing people! LOVE _E_\nWill be back soon Virginia. We are going to MAKE AMERICA GREAT AGAIN! #TrumpPence16 __HTTP__ _E_\n#IACaucus #CaucusForTrump#iCaucused #iVoted __HTTP__ _E_\nThere is nothing @TrumpSoHo did not think about for the holidays @RobbReport dives in: __HTTP__ _E_\nWhy does@ Bill O'Reilly keep putting Karl Rove on his show a total waste of time. Rove spent $400 000 000 and didn't win a race pathetic! _E_\nRT @CPACnews: ACU Announces @realDonaldTrump will be a featured speaker at #CPAC2013! Get tickets today at __HTTP__ _E_\nI will be interviewed by @LouDobbs tonight on @FoxBusiness 7pm ET _E_\n.@JebBush is a sad case. A total embarrassment to both himself and his family he just announced he will continue to spend on Trump hit ads! _E_\nCan @pennjillette @lisarinna and @THEGaryBusey continue to co exist? Find out on this Sunday's Celebrity All Star @ApprenticeNBC. _E_\nObamaCare is an absolute disaster which will destroy 16% of the economy and ultimately more! _E_\nIt is actually hard to believe how naive (or dumb) the Failing @nytimes is when it comes to foreign policy...weak and ineffective! _E_\nWow @CNN Town Hall questions were given to Crooked Hillary Clinton in advance of big debates against Bernie Sanders. Hillary & CNN FRAUD! _E_\n#NeverForget __HTTP__ _E_\nI look forward to my press conference @TrumpTurnberry Scotland this Wednesday lots of great people attending. _E_\nHappy Easter to all have a great day! _E_\nPeople ask me every day to pose for pictures but the camera never works the first time they are never prepared or maybe just very nervous! _E_\nDon't worry THE UNITED STATES WILL BE GREAT AGAIN! _E_\nFines and penalties against Wells Fargo Bank for their bad acts against their customers and others will not be dropped as has incorrectly been reported but will be pursued and if anything substantially increased. I will cut Regs but make penalties severe when caught cheating! _E_\nCongratulations to the 7 @TrumpCollection properties who made @USNewsTravel's Best Hotels List: __HTTP__ _E_\nLife always presents new opportunities you would never expect. I hosted @WrestleMania & then I starred in one which sold most PPVs. _E_\nWatch @BarackObama admit Obamacare is a TAX __HTTP__ The GOP must continue to Disrupt Dismantle & Repeal! _E_\nThe U.S. under my administration is completely rebuilding its military and they're spending hundreds of billions of dollars to the newest and finest military equipment anywhere in the world being built right now. I want peace through strength! __HTTP__ _E_\nRomney Ryan Slam Obama Administration on China Currency Manipulation __HTTP__ via @ABC _E_\nJust learned that Jon @Ossoff who is running for Congress in Georgia doesn't even live in the district. Republicans get out and vote! _E_\nI settled the Trump University lawsuit for a small fraction of the potential award because as President I have to focus on our country. _E_\nAre Republicans suicidal? Now they want to push amnesty through Congress. Allowing Democrats into the country. _E_\nPresident Obama campaigned hard (and personally) in the very important swing states and lost.The voters wanted to MAKE AMERICA GREAT AGAIN! _E_\nRepublicans should just REPEAL failing ObamaCare now & work on a new Healthcare Plan that will start from a clean slate. Dems will join in! _E_\nIf Jon Stewart is so above it all & legit why did he change his name from Jonathan Leibowitz? He should be proud of his heritage! _E_\nI'll be on THe Willis Report @GerriWillisFBN today at 5 pm EST _E_\nVia @fitsnews by @TaylahhKane: Donald Trump's Refreshing Lack Of A Filter __HTTP__ _E_\nRemember Celebrity Apprentice tonight on CNBC at 9. Amazing episode watch Omarosa get the boot! _E_\nEntrepreneurs: Set your mind on winning and losing won't have a chance. See yourself as victorious! _E_\nIt's about time Italy recognized the innocence of @AmandaKnox great news! _E_\nJournalists shower Hillary Clinton with campaign cash __HTTP__ __HTTP__ _E_\nUnder @BarackObama 5 major banks now control 56% of economy from 43% in 2007 __HTTP__ Another catastrophe is brewing. _E_\nJust watched recap of #CrookedHillary's speech. Very short and lies. She is the only one fear mongering! _E_\nWOW! I just heard that the previously unknown singer Mac Miller has received over 67 million hits on his song Donald Trump. _E_\nVia @washingtonpost by @OConnellPostbiz:\"Bidding to stay at Trump's hotel for '17 inauguration?Pick the next POTUS. __HTTP__ _E_\nThank you @RepLouBarletta! __HTTP__ __HTTP__ _E_\nThis is the definition of ransom ⬇ __HTTP__ _E_\nMy representatives had a great meeting w/ the Hispanic Chamber of Commerce at the WH today. Look forward to tremendous growth & future mtgs! _E_\nVideo of my day at The Old Post Office soon to be the most fabulous hotel! __HTTP__ _E_\n\"When you're at a meeting monitor your behavior and work at being an observer – of yourself and of others.\" – Think Like a Billionaire _E_\nSome great quotes from the legendary and courageous Winston Churchill: Never never never give up. ... _E_\n#SuperBowl Vote for me and @CENTURY21 __HTTP__ _E_\nDangerous The USC ObamaCare ruling means the government can now tax you for inactivity. _E_\nSometimes by losing a battle you find a new way to win the war. Don't ever get down on yourself just keep fighting in the end you WIN! _E_\nYesterday was @BarackObama's favorite day of the year he collects our taxes to redistribute. _E_\nGoing to North Carolina to make keynote speech sold out crowd! _E_\nWow I hear @Morning_Joe has gone really hostile ever since I said I won't do or watch the show anymore.They misrepresent my positions! _E_\nWhile all agree the U. S. President has the complete power to pardon why think of that when only crime so far is LEAKS against us.FAKE NEWS _E_\nPhylis Schlafly: 'Marco Rubio Betrayed Us All' __HTTP__ _E_\nGetting rdy to leave for France @ the invitation of President Macron to celebrate & honor Bastille Day and 100yrs since U.S. entry into WWI. _E_\nSix hours left to #VoteTrump Connecticut! __HTTP__ _E_\nPatience is the greatest of all virtues. Cato _E_\nOne of the best moves I ever made was staying out of last decade's artificial real estate boom. But I used the downturn to my advantage. _E_\nTrue! __HTTP__ _E_\nI'll be interviewed by Greta Van Susteren @Gretawire tonight at 10 pm ET on Fox. _E_\nMust read @AmSpec article by Jeffrey Lord: \"The Ruling Class Liberty Medal\" __HTTP__ _E_\nEntrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_\nReuters just announced that Secret Service never spoke to me or my campaign. Made up story by @CNN is a hoax. Totally dishonest. _E_\nDonald E. Ballard on behalf of the people of the United States THANK YOU for your courageous service. YOU INSPIRE US ALL! #ALConv2017 __HTTP__ _E_\nThanks. __HTTP__ _E_\nNow the Chinese are hacking @nytimes __HTTP__ & Twitter __HTTP__ When will we hold these thieves accountable? _E_\n.@Richard_Meier a highly overrated architect has had many problems with buildings he has designed. _E_\nWith great patriots in Mason City who also want to bring the American Dream back! We can Make America Great Again! __HTTP__ _E_\nObama talks about what he is going to do why the hell didn't he just do it especially in the first 2 years when he had all votes necessary _E_\nThank you Las Vegas Review Journal! EDITORIAL: 'Donald Trump for president' __HTTP__ via @reviewjournal _E_\nPat Buchanan gave a fantastic interview this morning on @CNN way to go Pat way ahead of your time! _E_\nWow interview released by Wikileakes shows quid pro quo in Crooked Hillary e mail probe.Such a dishonest person & Paul Ryan does zilch! _E_\n.@BarbaraJWalters called my office to ask me to do election night coverage with her sadly I won't be able to do it. _E_\nThe US is always getting ripped off! China gets cheap oil from Iran and Iraq as US pays for Hormuz Patrols to (cont) __HTTP__ _E_\nWhy is @MittRomney the only guy who talks about getting tough with China and their currency manipulation? _E_\nEntrepreneurs: Keep your focus and keep your momentum. Believe in yourself if you don't no one else will either. _E_\nA spectacular lake front club w/ dramatic course designed by @SharkGregNorman @Trump_Charlotte is NC's top club __HTTP__ _E_\nThank you to @AmSpec & Jeffrey Lord for the lovely article \"Governor Trump? The conservative Nelson Rockefeller.\" __HTTP__ _E_\nCongratulations to Martin Kaymer for winning the 2014 #USOpen. #USGA Great playing from beginning to end! _E_\nMy @CNN interview with @wolfblitzercnn yesterday discussing by meeting with @MittRomney __HTTP__ _E_\n\"As you go through life you've got to see the valleys as well as the peaks.\" – Neil Young _E_\nThe SECRET meeting between Bill Clinton and the U.S.A.G. in back of closed plane was heightened with FBI shouting go away no pictures. _E_\nMarco Rubio is a member of the Gang Of Eight or very weak on stopping illegal immigration. Only changed when poll numbers crashed. _E_\nThis is a time for big ideas. This is a time for real reform for a real recovery. @PaulRyanVP _E_\nJust received a standing ovation at #NCGOPCon when I said We need to bring the American Dream back better and stronger than ever before! _E_\nThe Trump Organization continues to expand internationally at a record pace. Many new announcements to come soon. _E_\nThank you Louisville Kentucky. Together we will MAKE AMERICA SAFE AND GREAT AGAIN! __HTTP__ _E_\nOne of the big problems facing Atlantic City are the ridiculously high real estate taxes which I fought for years before leaving.Corruption! _E_\nRT @benshapiro: Pope on Trump: A person who thinks only about building walls...is not Christian. This is Vatican City. __HTTP__ _E_\n.@Mayor_Nutter of Philadelphia who is doing a terrible job should be ashamed for using such a disgusting word in referring to me.Low life! _E_\nI am supportive of Lamar as a person & also of the process but I can never support bailing out ins co's who have made a fortune w/ O'Care. _E_\nThe only Forbes 5 Star & 5 Diamond hotel with a 5 Star & 5 Diamond restaurant @TrumpNewYork offers elite luxury __HTTP__ _E_\nAge wrinkles the body. Quitting wrinkles the soul. General Douglas MacArthur _E_\nStopped by @TrumpDC to thank all of the tremendous men & women for their hard work! __HTTP__ _E_\nHillary just said that she will not use the term radical Islamic but was incapable of saying why. She is afraid of Obama & the e mails! _E_\nObama said he never met his uncle Oscar who was arrested for whatever. Turns out he lived with his uncle in Boston. SO MANY LIES! _E_\nWatch a powerful and frank interview with Donald Trump about the economy on Greta Van Susteren's On The Record: __HTTP__ _E_\nAmerica is proud to stand shoulder to shoulder w/a free & ind UK. We stand together as friends as allies & as a people w/a shared history. _E_\nThe invention of email has proven to be a very bad thing for Crooked Hillary in that it has proven her to be both incompetent and a liar! _E_\nAt the debate the President kept talking of what he is going to do. I kept saying why didn't he do it? He lost me a long time ago. _E_\nI didn't start the fight with Lyin'Ted Cruz over the GQ cover pic of Melania he did. He knew the PAC was putting it out hence Lyin' Ted! _E_\nPres. O a bump in the road in reference to our Ambassador's (and others) killing in Libya _E_\n'Trump lays out policies for first 100 days in White House' __HTTP__ _E_\nToday both @BarackObama and @MittRomney are giving speeches on their economic policies in Ohio. The choice is (cont) __HTTP__ _E_\n...owed to Wall Street and the banks which sadly must be dealt with. Food water and medical are top priorities and doing well. #FEMA _E_\nWhy did the failing @nytimes refuse to use any of the names given to them that I was so proud to have helped with their careers. DISHONEST! _E_\nI love @LibertyUniversity such great people! _E_\n...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong).To bad the Dems have no one who can change tones! _E_\n'U.S. Consumer Comfort Just Reached Its Highest Level in a Decade' __HTTP__ __HTTP__ _E_\nTrump Nat'l Jupiter's @jacknicklaus designed course is a challenging & innovative 7531 yds w/special features __HTTP__ _E_\nDoes anybody think that @CNBC will get their fictitious polling numbers corrected sometime prior to the start of the debate. Sad! _E_\nHappy to have @ralphreed and the FFC's endorsement of the Newsmax @iontv debate. FFC is a great organization. _E_\nIf the Senate Democrats ever got the chance they would switch to a 51 majority vote in first minute. They are laughing at R's. MAKE CHANGE! _E_\nI will bring jobs back and get wages up. People haven't had a real wage increase in almost twenty years. Clinton killed jobs! _E_\nWill be interviewed by @StephenAtHome tonight by phone a late show first @CBS @colbertlateshow. Enjoy! #Colbert #LSSC _E_\nHow dumb is our president to send thousands of poorly trained and ill equipped soldiers over to West Africa to fight Ebola. Stop all flights _E_\nThank you Attorney General Gonzales so many people feel this way. __HTTP__ _E_\nWhile I'm beating my opponents in the polls I'm also beating lobbyists special interests & donors that are supporting them with billions. _E_\nI told you whenever I go to a @Yankees game the @Yankees win. _E_\nMust read article: \"Conservative Fury at Rove Erupts\" __HTTP__ By Jeffrey Lord @AmSpec _E_\nGov. Gary Johnson pulling votes from @MittRomney Don't waste your vote. Obama must go! _E_\n\"We have a system that increasingly taxes work and subsidizes nonwork.\" Milton Friedman _E_\nVia @BreitbartNews: \"EXCLUSIVE TRUMP COUNSEL 'CANNOT CONFIRM OR DENY' INTEREST IN BUYING NEW YORK TIMES\" __HTTP__ _E_\n62 years ago this week a brave seamstress in Montgomery Alabama uttered one word that changed history... __HTTP__ _E_\nMy friend Ronald Kessler explains in @washingtonpost that Secret Service problems are much bigger than prostitutes __HTTP__ _E_\nBig poll comes out today on Face The Nation at 10:30 on @CBSNews. _E_\n26000 sexual assaults in the military last year way up from previous years. Armed Forces are in total turmoil! _E_\nJust landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! _E_\nKeep up the GREAT work. I am with you 100%! ISIS is losing its grip... Army Colonel Ryan DillonCJTF–OIR __HTTP__ __HTTP__ _E_\nPresident Obama I have an idea! Pretend that West Africa is Israel and then you will be able to stop the Ebola area flights. _E_\nDon't tread water. Get out there and go for it. There's nothing wrong with bringing your talents to the surface. _E_\nMy @KWWL int. from @WartburgCollege discussing how politicians have failed us & Making America Great Again! __HTTP__ _E_\n'Hillary Clinton Had Gun Control Supporters Planted In Town Hall Audience' __HTTP__ _E_\nBernie Sanders gave Hillary the Dem nomination when he gave up on the e mails. That issue has only gotten bigger! _E_\nWith Spitzer & Anthony Weiner running for office New York is pervert central! Pathetic _E_\nJoin me tomorrow in Sanford or Tallahassee Florida!Sanford at 3pm: __HTTP__ at 6pm: __HTTP__ _E_\nJoin me tonight in Fayetteville North Carolina at 7pm! #ThankYouTour2016 Tickets: __HTTP__ __HTTP__ _E_\nMy warmest condolences and sympathies to the victims and families of the terrible Las Vegas shooting. God bless you! _E_\nEd Gillespie will be a great Governor of Virginia. His opponent doesn't even show up to meetings/work and will be VERY weak on crime! _E_\nMy @foxandfriends interview duscussing my meeting with @newtgingrich the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_\nUnemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Market ever up $5.4 trill _E_\nRemember when you hear the words sources say from the Fake Media often times those sources are made up and do not exist. _E_\nAlso tomorrow night I will be going to Boone and Ames. Really look forward to seeing all of my friends in Iowa. _E_\nRT @EricTrump: .@LaraLeaTrump and I look forward to being on @JudgeJeanine tonight at 9pm! @FoxNews #MakeAmericaGreatAgain __HTTP__ _E_\n\"Actions are the seed of fate deeds grow into destiny.\" – Harry S. Truman _E_\nA message for Hollywood __HTTP__ _E_\nA friend is one who has the same enemies as you have. Pres. Abraham Lincoln _E_\nOn Taxes: \"This is the biggest corporate rate cut ever going back to the corporate income tax rate of roughly 80 years ago.This is a huge pro growth stimulus for the economy. Every year the Obama WH overstated how the economy would grow. Now real economics and jobs.\" @WSJ Report _E_\nAnother radical Islamic attack this time in Pakistan targeting Christian women & children. At least 67 dead400 injured. I alone can solve _E_\nVia @SteveKingIA's Steve King for Congress Facebook Page: \"Donald Trump has a special announcement!\" __HTTP__ _E_\nIf you are interested in balancing work and pleasure you will never succeed! _E_\nWhy would college graduates want Crooked Hillary as their President? She will destroy them! __HTTP__ _E_\nBrits spent $57.8M on the royal family. Obamas cost us $1.4B in expenses including entertainment __HTTP__ Living large on us. _E_\nReally looking forward to watching The Masters this weekend one of THE GREATEST SHOWS ON EARTH! _E_\nCrooked Hillary refuses to say that she will be raising taxes beyond belief! She will be a disaster for jobs and the economy! _E_\nI was viciously attacked by Mr. Khan at the Democratic Convention. Am I not allowed to respond? Hillary voted for the Iraq war not me! _E_\nBaltimore had a really tough night only great leadership can solve the many inner city problems facing our country. Jobs jobs jobs! _E_\nCrooked Hillary Clinton overregulates overtaxes and doesn't care about jobs. Most importantly she suffers from plain old bad judgement! _E_\nIt's Friday. How many millions has the White House wasted on the ObamaCare website today? _E_\nWatch me on Late Night with Jimmy Fallon tomorrow night at 12:35 a.m. on NBC I'll be making a big announcement! _E_\nVia @USATODAY Amateur hour with the Iran nuclear deal __HTTP__ _E_\n#WeeklyAddress __HTTP__ _E_\nI'm very proud of my daughter Ivanka. Great interview. __HTTP__ _E_\nIsis terror group has now fully taken over large sections of Iraq and will soon have control of massive oil reserves. I told you so. _E_\nDon't underestimate yourself or your possibilities keep your focus intact and focus on the positives. _E_\nNot looking good for our great Military or Safety & Security on the very dangerous Southern Border. Dems want a Shutdown in order to help diminish the great success of the Tax Cuts and what they are doing for our booming economy. _E_\nWe are being embarrassed by Russia and China on Snowden (and much more) yet Obama is talking about global warming on Tuesday. _E_\nGlobal warming is based on faulty science and manipulated data which is proven by the emails that were leaked __HTTP__ _E_\nWhat I am saying is that we never should have been in Iraq in the first place. Bush was terrible Obama is worse! Make America GREAT again. _E_\nVisited some very beautiful golf courses this weekend...this is one... __HTTP__ _E_\nSteve Bannon will be a tough and smart new voice at @BreitbartNews...maybe even better than ever before. Fake News needs the competition! _E_\n\"All Star Celebrity Apprentice\" ranked #1 for the 10 o'clock hour among ABC CBS and NBC with a season high 19% margin. _E_\nThank you @kayleighmcenany for your nice words great knowledge and style! We are doing really well in South Carolina. @CNN @donlemon _E_\nJust got to the #USWomensOpen in Bedminster New Jersey. People are really happy with record high stock market up over 17% since election! _E_\nMade in America? @BarackObama argues that his long form birth certificate is irrelevant in court. __HTTP__ _E_\nEven if you're on the right track you'll get run over if you just sit there. Will Rogers _E_\n#CelebrityApprentice Who will win? __HTTP__ Find out tonight live Season Finale at 9PM ET on NBC. _E_\nCongratulations to @TrumpCollection's @TrumpPanama for receiving the Certificate of Excellence & Top 10 Hotels in Panama from @TripAdvisor! _E_\nGas prices are the lowest in the U.S. in over ten years! I would like to see them go even lower. _E_\nVia @kcautv: \"Donald Trump Coming to Sioux City in May\" __HTTP__ _E_\nRemarks from the Roosevelt Room with @SenateMajLdr Mitch McConnell @SpeakerRyan and Secretary of Defense General James Mattis. __HTTP__ _E_\nLooking forward to being with @SenTedCruz at our big rally in D.C. on Wednesday (1:00 P.M. at the Capitol) to protest insane Iran nuke deal! _E_\n\"The worst thing you can possibly do in a deal is seem desperate to make it.\" – The Art of The Deal. _E_\nI'm sick of always reading about outsourcing. Why aren't we talking about onshoring ? We need to bring manufa... (cont) __HTTP__ _E_\n#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_\nWH refused a meeting with the Israeli Defense Minister. If only Obama hated Iran as much as he dislikes Israel. _E_\nThe only American who has met with the North Korean man child is Dennis Rodman. Isn't that frightening and sad? _E_\n#ObamacareFail __HTTP__ _E_\nRT @TeamTrump: A @realDonaldTrump Administration will bring JOBS BACK! #Debates2016 __HTTP__ _E_\nSomeone just wrote that \"you predicted every single major event that's now happening—and they knock you instead of giving you credit.\" _E_\nThe only one to fix the infrastructure of our country is me roads airports bridges. I know how to build pols only know how to talk! _E_\nRussia must be laughing up their sleeves watching as the U.S. tears itself apart over a Democrat EXCUSE for losing the election. _E_\nRepresentative Devin Nunes a man of tremendous courage and grit may someday be recognized as a Great American Hero for what he has exposed and what he has had to endure! _E_\n.@BarackObama sent over 100000 jobs and Canadian oil to China all because he would not approve Keystone XL. _E_\nInteresting reading re September 11th __HTTP__ _E_\nRead Ivanka's blog about last night's Apprentice on Entertainment Weekly ... __HTTP__ _E_\nThis is the summer of box office bombs. Who is green lighting this garbage? The scripts are terrible. _E_\nThank you Greta. __HTTP__ _E_\n08 09 2011 19:33:31 _E_\nTrump shows complete domination of Facebook conversation __HTTP__ _E_\nFree enterprise is still the greatest force for upward mobility economic security and the expansion of the middle class. @MittRomney _E_\nObamaCare Tragedy Primed to Further Explode the Deficit __HTTP__ And @Obama transferred $500 billion from Medicare to fund it! _E_\nA great honor to welcome President Juan Manuel Santos of Colombia to the White House today! Joint Press Conf... __HTTP__ _E_\nMy sons Don and Eric are in Ireland looking at my new club. It will be phenomenal! @LodgeatDoonbeg _E_\nTogether we are MAKING AMERICA GREAT AGAIN! __HTTP__ _E_\nDennis—Thank you for being honest. Somebody put words in your mouth & you wouldn't take it. Great! @dennisrodman _E_\n.@JebBush today said he didn't want to be the front runner he would rather be where he is now 2%. That is the talk of a loser can't win! _E_\nOn @FoxNews at 7:00 P.M. Special: Meet the Trumps Hope you enjoy! _E_\nTony Romo just made a great play Giants are getting killed! _E_\nIraq was one of our biggest mistakes. We got absolutely nothing for our sacrifices.The country will collapse (cont) __HTTP__ _E_\nHillary Clinton just had her 47% moment. What a terrible thing she said about so many great Americans! _E_\nVia @usweekly: \"Donald Trump Sounds Off on Joan Rivers' Death: 'I Think The Doctors Made a Terrible Mistake'\" __HTTP__ _E_\n.@davidaxelrod use Buffet Icahn Sam Zell Leon Black Kravis Caesars and many more when talking about using the bankruptcy laws not me! _E_\nVia @ WSOC_TV: \"Blair Miller talks with Donald Trump about Charlotte ventures\" __HTTP__ _E_\nDerek must move back into one of my buildings immediately. It will be lucky for him like in past. _E_\nI'm in Scotland getting ready for a major news conference on the Great Dunes of Scotland announcing the second North Sea course amazing! _E_\nGet ready for @Oreillyfactor tonight at 8 always interesting! _E_\nAmazing. People are sending letters of support for @TrumpChicago's sign to my other properties including even @TrumpScotland. Thank you! _E_\nGet your ballots in Colorado I will see you soon and we will win!#MakeAmericaGreatAgain __HTTP__ _E_\nAfter 200 days rarely has any Administration achieved what we have achieved..not even close! Don't believe the Fake News Suppression Polls! _E_\nAny deal completed before the fiscal curb must have tangible cuts on expenditures in baseline spending so we can get our credit back. _E_\nChina has so much of our debt that they can't put us in default w/o killing themselves US needs our toughest negotiator and fast! _E_\nMelania and I will be appearing on The View tomorrow at 11 a.m. on CBS. Tune in for some great fun! _E_\nRT @IvankaTrump: Very proud of Arabella and Joseph for their performance in honor of President Xi Jinping and Madame Peng Liyuan's official... _E_\nEven with lower profit projections American firms are still throwing money into China __HTTP__ Obama is killing investment. _E_\nGreat meeting w/ coal miners & leaders from the Virginia coal industry thank you! #MAGA __HTTP__ __HTTP__ _E_\nLook when it comes to China America better stop messing around. China sees us as a naive gullible foolish (cont) __HTTP__ _E_\nWith long gas lines & total disarray from storm the hurricane may yet be a negative for Obama. _E_\nAs a favor to my friends at EXTRA I am co hosting tonight at 7 p.m. on @nbc _E_\nThe American dream is back. We're going to create an environment for small business like we haven't had in many ma... __HTTP__ _E_\nThanks to @SteveKingIA for the kind introduction at the IA Freedom Summit & congrats to @David_Bossie & @Citizens_United on a great success! _E_\nBe sure to watch \"The History of WrestleMania\" on @netflix. My interview explains how I supported the event early on. I'm proud of it. _E_\nThe failing @NYDailyNews which just raised its prices because it's dying said I wear a \"wig\" when they know I don't. Dishonest. _E_\nEven though I have the legal right to use Steven Tyler's song he asked me not to. Have better one to take its place! _E_\nThx Mark I appreciate your words about the school. You sound like you're doing well happy for you. @businessinsider __HTTP__ _E_\nThe Establishment and special interests are absolutely killing our country. Stop them now: __HTTP__ _E_\nThank you for joining me this afternoon New Hampshire! Will be back soon. #FollowTheMoneySpeech transcript:... __HTTP__ _E_\nThank you New Orleans Louisiana!#MakeAmericaGreatAgain #VoteTrump __HTTP__ __HTTP__ _E_\nWeak and low energy @JebBush whose campaign is a disaster is now doing ads against me where he tries to look like a tough guy. _E_\nIf you like having the world collapse and being told America is leading from behind vote Obama. _E_\nI will be on @foxandfriends at 7:00 A.M. So much to talk about but not much good news for the U.S.A. MAKE AMERICA GREAT AGAIN! _E_\nI told Rex Tillerson our wonderful Secretary of State that he is wasting his time trying to negotiate with Little Rocket Man... _E_\nThe ring announcers are working hard to justify the Mayweather victory. They should be ashamed of themselves! A TOTAL JOKE. _E_\nMany of the Syrian rebels are radical jihadi Islamists who are murdering Christians. Why would we ever fight with them? _E_\nJust arrived at Camp David where I am closely watching the path and doings of Hurricane Harvey as it strengthens to a Category 3. BE SAFE! _E_\nIt is amazing how rude much of the media is to my very hard working representatives. Be nice you will do much better! _E_\nAmerica's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_\nDevelop your gut instincts and act on them. You will have your biggest successes when you go with your gut but be very smart & careful. _E_\nI will be interviewed by @oreillyfactor tonight on @FoxNews at 8pm. Enjoy! _E_\nMy interview with @IngrahamAngle discussing the real unemployment number and how the 7.8% number is a fraud __HTTP__ _E_\nNegotiations on DACA have begun. Republicans want to make a deal and Democrats say they want to make a deal. Wouldn't it be great if we could finally after so many years solve the DACA puzzle. This will be our last chance there will never be another opportunity! March 5th. _E_\nCongratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Dec! _E_\n#AmericaFirst #RNCinCLE __HTTP__ _E_\n.@AScottPGA Really solid playing keep going! _E_\nI always enjoy watching young entrepreneurs enter the business world. I can tell who reads my books and who doesn't. #MidasTouch _E_\nTHe 2012 election is the most important in my lifetime. We must nominate a candidate who will win and will roll back @BarackObama's damage. _E_\nSo many false and phony T.V. commercials being broadcast in Indiana. Reminds me of Florida where thousands were put up I won in a landslide! _E_\nMitt's got it right: @RickSantorum's attacks on @MittRomney's pro growth tax cut proposal are foolish. _E_\nI am the only one who can beat Hillary Clinton. I am not a Mitt Romney who doesn't know how to win. Hillary wants no part of Trump _E_\n\"The world is changing very fast. Big will not beat small anymore. It will be the fast beating the slow.\" @rupertmurdoch _E_\nSuch a great honor! __HTTP__ _E_\nWhy should we have any defense cuts in any deal? America must remain strong. _E_\nRT @seanhannity: HRC mishandles and destroys classified info NO PROBLEM! Pay/play on Uranium one NO PROBLEM! Lynch BC tarmac: it's a matte... _E_\nWow! What a great honor from @DRUDGE_REPORT __HTTP__ _E_\nCongratulations to our new #VASecretary Dr. David Shulkin. Time to take care of Veterans who have fought to protect... __HTTP__ _E_\nVia @BreitbartNews TRUMP WINS NASHVILLE GRASSROOTS STRAW POLL WITH 52 PERCENT __HTTP__ _E_\nGreat thanks. __HTTP__ _E_\n...If you plan for the worst—if you can live with the worst—the good will always take care of itself. _E_\nLooking forward to speaking @acnnews International Convention tomorrow morning in Charlotte NC __HTTP__ _E_\nEnjoy the #SuperBowl and then we continue: MAKE AMERICA GREAT AGAIN! _E_\nYou can listen to my interview today with Jay Sekulow Live and the @JordanSekulow show here __HTTP__ @12PM EST. _E_\nThank you Ohio! VOTE so we can replace Obamacare and save healthcare for every family in the United States! Watch:... __HTTP__ _E_\nI let @pennjillette come back on the record 13th season of 'All Star' @CelebApprentice after he relentlessly begged me to good t.v. _E_\nWhen a country is no longer able to say who can and who cannot come in & out especially for reasons of safety &.security big trouble! _E_\n#MakeAmericaGreatAgain! __HTTP__ _E_\nMy @TMZ interview with @HarveyLevinTMZ discussing how I will see my $5M lawsuit against @billmaher to the end __HTTP__ _E_\n#MakeAmericaGreatAgain! __HTTP__ _E_\nWe should tell China that we don't want the drone they stole back. let them keep it! _E_\nThe President of the U.S. is the leader of the Free World. He should dress like it at all times. Wear a suit and a tie for major interviews. _E_\nI wonder how much our leaders have promised or given Russia in order for them to behave and not make the U.S. look even worse? _E_\nMagician extraordinaire @pennjillette is back in the All Star @ApprenticeNBC. This time he has even more tricks up his sleeve. _E_\nAmericans who can afford to buy enough food is now at a 3 year low. Is this @BarackObama's 'recovery'? __HTTP__ _E_\nEntrepreneurs: Realize that success requires 100% effort and 100% focus. Nothing less. _E_\nI fully support the @NYPD @MayorBloomberg and @CommissionerKelly. They should all be honored for protecting us since 9/11 not demonized. _E_\nThe bus driver who saved the woman from jumping off the bridge was really cool great guy. I'm going to send him $10 000 he deserves it! _E_\nHillary Clinton failed all over the world. LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_\nWow Hillary Clinton was SO INSULTING to my supporters millions of amazing hard working people. I think it will cost her at the Polls! _E_\nI will be in California this weekend making a speech for Clint Eastwood. Then to Arizona and Vegas. Big crowds. Discussing illegals & more! _E_\n\"There are no environments where you're only going to win because life just isn't like that.\" Bobby Orr _E_\nIt is my great honor to send $25000 to Sgt. Andrew Tahmooressi. #marinefreed _E_\nDespite major outside money FAKE media support and eleven Republican candidates BIG R win with runoff in Georgia. Glad to be of help! _E_\nSix months in it is the hope of GROWTH📈that is making AmericaFOUR TRILLION DOLLARS RICHER. Stuart @VarneyCo __HTTP__ __HTTP__ _E_\nAnother solar company @BarackObama funded with our money has filed for bankruptcy __HTTP__ One (cont) __HTTP__ _E_\nMy @gretawire interview re: how the debt ceiling is key point the fiscal curb & why we must & can make a great deal. __HTTP__ _E_\nThe biggest winner of Obama's '08 win Vladimir Putin. Ultimately he could be tied with Iran after Tehran becomes a nuclear power. _E_\nOnce Iran has nuclear weapons they will shut down the Strait of Hormuz. Oil will be over $300/Barrel. Iran'... (cont) __HTTP__ _E_\nPeople of Ohio are fantastic. Thank you so much. What an evening! __HTTP__ _E_\n2 million more people just dropped out of ObamaCare. It is in a death spiral. Obstructionist Democrats gave up have no answer = resist! _E_\nVia @NewsInTheBurg: \"@chefjoseandres to open restaurant in Trump Int'l Washington D.C.\" __HTTP__ _E_\nMerry Christmas & Happy Holidays!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nOnce again under@BarackObama the US has fallen down the ranks of global competitiveness __HTTP__ We must do better. _E_\nNo American should be separated from their loved ones because of preventable crime committed by those illegally in our country. Our cities should be Sanctuaries for Americans – not for criminal aliens! __HTTP__ _E_\nMy thoughts on Gadhafi's death @BarackObama and the misery index... __HTTP__ #trumpvlog _E_\nBig meetings today at the United Nations. So many interesting leaders. America First will MAKE AMERICA GREAT AGAIN! _E_\nMariano Rivera is one of top @Yankees of all time. Greatest closer of all time. A true warrior. Last night's MVP award well deserved. _E_\nCongratulations to my children Don and Tiffany on having done a fantastic job last night. I am very proud of you! _E_\n#CNNDebate __HTTP__ _E_\nLook Snowden is bad done tremendous damage to our country and standing but we have far worse in our government (guess who?). _E_\nThank you to all of the men and women who protect & serve our communities 24/7/365! #LawEnforcementAppreciationDay... __HTTP__ _E_\nWhy does HI Revised Statute 338 17.8 allow an HI resident who doesn't have to be US citizen to procure an official Hawaii birth certificate? _E_\nTrump: Rove 'Made a Fool Out of Himself' __HTTP__ via @cnsnews _E_\nHeading to New Hampshire. Will be talking about the disaster known as ObamaCare! _E_\nRussia beat the United States in the Olympics another Obama embarrassment! Isn't it time that we turn things around and start kicking ass? _E_\nKim Jong Un of North Korea who is obviously a madman who doesn't mind starving or killing his people will be tested like never before! _E_\nThe racial divide in our country is almost at an all time high and getting worse every time you turn on the television. _E_\nMake the Boston killer talk before our doctors make him better. Once he is well he will say speak to my lawyers. _E_\nLast night's @extratv 's interview by @MarioLopezExtra of gorgeous 2012 @MissUniverse @oliviaculpo __HTTP__ Great job! _E_\nOnce again someone we were told is ok turns out to be a terrorist who wants to destroy our country & its people how did he get thru system? _E_\nI just left the Doral in Miami it is going to be amazing! __HTTP__ _E_\nThe Clinton's are the real predators... __HTTP__ _E_\nFor all of today's voters please remember that I am the only candidate that is self funding my campaign I am not bought and paid for! _E_\n.@MarkHalperin showed a focus group on @Morning_Joe me using a very bad word. I never said the word left an open blank. Please apologize! _E_\nWatching @loudobbsnews fantastic show! Has very interesting take on Paul Ryan. _E_\nI am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ _E_\nWhen I intelligently turned down The Club For Growth crazy request for $1000000 they got nasty.What a waste of money that would have been _E_\nMy joint @seanhannity int. on @FoxNews with @GeraldoRivera recapping @ApprenticeNBC & discussing the 2016 election __HTTP__ _E_\nDine With The Donald and Mitt: __HTTP__ _E_\nWill be participating in a Town Hall tonight on @SeanHannity at 10pmE from Austin Texas. Enjoy! __HTTP__ _E_\nWeiner says many more pictures may be out there—this is just what NYC needs a pervert Mayor. _E_\nWow two candidates called last night and said they want to go to my event tonight at Drake University. _E_\nLet's be honest if Obama thought he could get away with campaigning during the storm then he would have been in Ohio on Monday. _E_\n38 stories high @TrumpWaikiki's 462 luxury guest rooms & suites offer exceptional services __HTTP__ _E_\n.@VanityFair's 2013 dwindling sales continue to sink at an even faster record rate under Graydon Carter __HTTP__ Disaster! _E_\nWall Street paid for ad is a fraud just like Crooked Hillary! Their main line had nothing to do with women and they knew it. Apologize? _E_\nI'm getting ready to be inducted tonight into the WWE Hall of Fame at Madison Square Garden a great honor for me and the Trump family! _E_\nAs a presidential candidate I have instructed my long time doctor to issue within two weeks a full medical report it will show perfection _E_\nTRUMP: GOP MUST DUMP 'USELESS' ROVE TO WIN PRESIDENTIAL ELECTIONS __HTTP__ by @mboyle1 @BreitbartNews _E_\nAnother new post debate poll. THANK YOU! #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nPaul Ryan is far from my first choice but a very nice guy. The Republicans should go for tough and (very) smart this time no games! _E_\nRT @markets: What Is Trump worth to Twitter? One analyst estimates $2 billion __HTTP__ __HTTP__ _E_\nSue them Tom! #TrumpVlog __HTTP__ _E_\nToday is a big day for us and for Toronto: Trump International Hotel & Tower Toronto opens today. (cont) __HTTP__ _E_\nFor all those sick degenerates contemplating a knockout attack please remember the late great Charles Bronson no more crime! _E_\nI will be interviewed by Anderson Cooper at 8pm on @CNN from New Hampshire. Should be very interesting! _E_\nThe Republicans never discuss how good their healthcare bill is & it will get even better at lunchtime.The Dems scream death as OCare dies! _E_\nJust leaving Florida. Big crowds of enthusiastic supporters lining the road that the FAKE NEWS media refuses to mention. Very dishonest! _E_\nAs our Country rapidly grows stronger and smarter I want to wish all of my friends supporters enemies haters and even the very dishonest Fake News Media a Happy and Healthy New Year. 2018 will be a great year for America! _E_\nHillary Clinton conceded the election when she called me just prior to the victory speech and after the results were in. Nothing will change _E_\nToo many people rely on auto correct...an assistant of mine apologizes! _E_\nHAPPY 4TH OF JULY TO EVERYONE! MAKE AMERICA GREAT AGAIN! _E_\nI will implement effective missile defenses to protect against threats. On this there will be no flexibility with Vladimir Putin. Mitt _E_\nHappy Birthday President Reagan #FlashbackFriday __HTTP__ _E_\nBuy American & hire American are the principles at the core of my agenda which is: JOBS JOBS JOBS! Thank you @exxonmobil. _E_\nThe reality is that no gun bill will ever stop tragedies. And as we have learned from ObamaCare Washington only makes things worse! _E_\nI WILL DEFEAT ISIS. THEY HAVE BEEN AROUND TOO LONG! What has our leadership been doing?#DrainTheSwamp __HTTP__ _E_\nCan you imagine the Boston killer being lovingly tended to in a hospital room right next to his victims who lost their arms legs and worse! _E_\nThe single greatest Witch Hunt in American history continues. There was no collusion everybody including the Dems knows there was no collusion & yet on and on it goes. Russia & the world is laughing at the stupidity they are witnessing. Republicans should finally take control! _E_\nA detainee released from Gitmo has killed an American. When will our so called leaders ever learn! _E_\nI don't want to be the only billionaire in America I want all Americans to be rich. _E_\nFrank \"FX\" Giaccio On behalf of @FLOTUS Melania & myself THANK YOU for doing a GREAT job this morning! @NatlParkService gives you an A+! __HTTP__ _E_\nTalking with @SammartinoBruno backstage __HTTP__ #WWEHOF _E_\n.@realDonaldTrump will PROTECT and DEFEND the Constitution #Debate #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nMy son @EricTrump is in Memphis at St. Jude Children's Research Hospital... _E_\nFlashback – Jeb Bush received a $4M tax payer bailout in 1990 __HTTP__ Guess who was POTUS then? _E_\n\"Donald Trump to headline SC Tea Party Convention\" __HTTP__ via @wyffnews4 _E_\nEl Chapo comes to the U.S. often thru our border—it's been revealed he has CA drivers license. __HTTP__ _E_\nMembers from Obama's own job council are endorsing @MittRomney __HTTP__ Not surprising _E_\nWatch Coach Mike Ditka a great guy and supporter tonight at 8pmE on #WattersWorld with @jessebwatters @FoxNews. _E_\nIt never ends! __HTTP__ _E_\nIt was only after I informed NBC that I wouldn't do the Apprentice that they became upset w/ me. They couldn't care less about \"inclusion _E_\nMalfeasance at Fannie Mae and Freddie Mac helped cause our current financial meltdown. _E_\nDoes anybody remember when Bill Clinton in 2008 worked long and hard for Hillary? She LOST! Now Bill is at it again. Just watch. _E_\nThe greatest commodity to own is land. It is finite. God is not making any more of it. _E_\nGlad to hear my @foxandfriends' Monday interview continues to get big ratings. Great way to start your week _E_\nI will be interviewed on @FaceTheNation this morning at 10:00 A.M. Have a great day! _E_\n.@KarlRove is a failed Jeb Bushy. Never says anything good & never will even after I beat Hillary. Shouldn't be on the air! _E_\nLittle Marco Rubio treated America's ICE officers like absolute trash in order to pass Obama's amnesty. __HTTP__ _E_\nThe Iranians are sure happy with Obama's nomination of Hagel. Already praising Hagel as 'Anti Israel' __HTTP__ _E_\nHow long will it take for chants bring back the replacement refs when a bad call is made? _E_\nWhen will @davidaxelrod realize he is on a fool's errand trying to defend @BarackObama's ineptitude? _E_\nHypocrite! @HillaryClinton claims she needs a \"public and a private stance\" in discussions with Wall Street banks. #Debate _E_\n\"The road to Easy Street goes through the sewer.\" – John Madden _E_\nHow can Hillary run the economy when she can't even send emails without putting entire nation at risk? _E_\nSo what will happen to the Big O on Celebrity Apprentice tonight. Remember I only fire people when it is deserved not for other reasons! _E_\nWe will fight the #FakeNews with you! __HTTP__ _E_\nDiligence is the mother of good luck. Benjamin Franklin _E_\nMeeting with \"Chuck and Nancy\" today about keeping government open and working. Problem is they want illegal immigrants flooding into our Country unchecked are weak on Crime and want to substantially RAISE Taxes. I don't see a deal! _E_\n'Presidential Executive Order on the Establishment of Presidential Advisory Commission on Election Integrity'... __HTTP__ _E_\nIf ObamaCare should not be repealed then why has Obama & Congress exempted their staffs? _E_\nWe ALL must be united & condemn all that hate stands for. There is no place for this kind of violence in America. Lets come together as one! _E_\nI wonder why @BarackObama is not going to the NAACP Convention. Is it because he can't answer questions about 14.7% Black unemployment? _E_\n.@FoxNews is much better and far more truthful than @CNN which is all negative. Guests are stacked for Crooked Hillary! I don't watch. _E_\nJust had a great legal victory in Ft. Lauderdale won trial now will receive tremendous $ in legal fees from losers. Love it! _E_\nWow just announced that Lyin' Ted and Kasich are going to collude in order to keep me from getting the Republican nomination. DESPERATION! _E_\nVia @NOLAnews by @DaveWalkerTV: Donald Trump praises @Joan_Rivers as 'strong' 'vibrant' in @ApprenticeNBC return __HTTP__ _E_\nNational Review is a failing publication that has lost it's way. It's circulation is way down w its influence being at an all time low. Sad! _E_\nI will be interviewed on @megynkelly's The Kelly File tonight. Be sure to watch on @FoxNews! _E_\n\"Donald Trump to visit metro Detroit in May\" __HTTP__ via @wxyzdetroit _E_\nI am continuing to get rid of costly and unnecessary regulations. Much work left to do but effect will be great! Business & jobs will grow. _E_\nAfter watching all about the horror story that is A Rod I realized again that it is time to let Pete Rose into the Baseball Hall of Fame! _E_\nI will be doing Fox and Friends at 7 A.M. this morning. _E_\nMy #TrumpTuesday @SquawkCNBC interview discussing golf VP choices the real estate market & healthcare reform __HTTP__ _E_\nLooking forward to tonight's conversation w/ David Rubenstein @TheEconomicClub. Airing live on @cspan at 7PM EST __HTTP__ _E_\nWow @Macys shares are down more than 40% this year. I never knew my ties & shirts not being sold there would have such a big impact! _E_\nRT @DonnaWR8: @realDonaldTrump I wonder what this BRAVE American would give to stand on his OWN two legs just ONCE MORE for our #Anthem?... _E_\nWatch me explain on the @Late_Show how my charitable offer to Obama changes the election and is about transparency __HTTP__ _E_\nHe @BarackObama believes that the War on Terror is over __HTTP__ Who does he think won? _E_\nboth countries will perhaps work together to solve some of the many great and pressing problems and issues of the WORLD! _E_\nJoin me for my #WeeklyAddress __HTTP__ __HTTP__ _E_\nPocahontas bombed last night! Sad to watch. _E_\nThank you for your incredible support Wisconsin and Governor @ScottWalker! It is time to #DrainTheSwamp & #MAGA!... __HTTP__ _E_\n.@JebBush's opening and closing in the debate were said by all to be terrible fumbled around incoherent. _E_\n.@SenJohnMcCain should be defeated in the primaries. Graduated last in his class at Annapolis dummy! _E_\nThe Fed's pumping is great news in the short term but it can't last forever. Be prudent in your market investing. _E_\nLooks like my work here is done bringing a close to the first ever #NBC #SweepsTweet. Keep watching @ApprenticeNBC every Sunday 9/8c. _E_\n.@Israel could very well be close to attacking Iran. Could be this election's big October surprise... _E_\nThank you to Bob Woodward who said That is a garbage document...it never should have been presented...Trump's right to be upset (angry)... _E_\nVia @BreitbartNews by @j_strong: \"Obama Administration Quietly Prepares 'Surge of Millions' of New Immigrant Ids\" __HTTP__ _E_\nVia @FoxSportsGolf: Trump's protégé earns US Open spot __HTTP__ _E_\nSome jerk fraudulently tweeted that his parents said I was a big inspiration to them + pls RT—out of kindness I retweeted. Maybe I'll sue. _E_\nI am going to repeal and replace ObamaCare! Read more about my positions on healthcare reform here: __HTTP__ _E_\nThis is one of the COLDEST WINTERS ever freezing all over the country for long periods of time! So much for GLOBAL WARMING. _E_\nDiscussing the 9/11 attack and coverage with @kingsthings while hosting the 25th anniversary of his @CNN show __HTTP__ _E_\nDoing Fox and Friends in two minutes! _E_\nWill be doing Fox & Friends at 7 2 minutes. _E_\nThe Trans Pacific Partnership will lead to even greater unemployment. Do not pass it. _E_\nI guess I have reached yet another ceiling 49.7% with four people. My highest Reuters poll yet! Thank you! __HTTP__ _E_\nEven the @NYTimes and @WashingtonPost Editorial Boards condemned Justice Ginsburg for her ethical and legal breach. What was she thinking? _E_\nVia @ConMonitorNews by @CMonitor_JVF: Donald Trump guest speaker at event honoring James Foley __HTTP__ _E_\nDonald Trump returns to the 'Apprentice' boardroom __HTTP__ via @BW _E_\nI campaigned on creating a merit based immigration system that protects U.S. workers & taxpayers. Watch: __HTTP__ #RAISEAct __HTTP__ _E_\nNation's Immigration And Customs Enforcement Officers (ICE) Make First Ever Presidential Endorsement: __HTTP__ _E_\nCongratulations to @PiersMorgan on winning @BritishGQ TV Personality Of The Year. Piers deserves his success! _E_\n.@DannyZuker Danny—Let your bosses on Modern Family lend you the money to play the game. Show courage! _E_\nI never want someone working for me who doesn't want to be there and in the same way you shouldn't want to be there either. _E_\nIf elected POTUS I will stop RADICAL ISLAMIC TERRORISM in this country! In order to do this we need to... __HTTP__ _E_\nGetting ready for the big news conference in Dubai. It should all be happening in the U.S. but it isn't SAD! _E_\nPresident Obama has a personal responsibility to visit & embrace all people in the US who contract Ebola! _E_\nMelania and I just had interview with the legendary @BarbaraJWalters. Watch #abc2020 this Friday. Tonight we talk ISIS @WNTonight _E_\nWaPo attack on alleged high school incidents by @MittRomney is a hit job to me. Where are @BarackObama's high school and college records? _E_\nObama should meet with Putin snd convince him to do what is good for the U.S. It's called good dealmaking or simply leadership! Cajole. _E_\nGreat leaders listen to and support law enforcement officials. Police discuss no go areas: __HTTP__ __HTTP__ _E_\nDiane Black of Tennessee the highly respected House Budget Committee Chairwoman did a GREAT job in passing Budget setting up big Tax Cuts _E_\nHopefully Republican Senators good people all can quickly get together and pass a new (repeal & replace) HEALTHCARE bill. Add saved $'s. _E_\nSince November 8th Election Day the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs! _E_\nNo cuts to welfare no cuts to food stamps & NOT A SINGLE CUT TO OBAMACARE yet the new budget cuts military benefits. Sad! _E_\n\"Build your reputation on intelligence responsibility and results. That's building the right way.\" – Think Like a Champion _E_\nIran's nuclear program must be stopped – by any and all means necessary. _E_\nBefore I bought the site the Sun Times had the biggest ugliest sign Chicago has ever seen. Mine is magnificent and popular. _E_\nTHANK YOU SYRACUSE! #NYPrimary __HTTP__ __HTTP__ _E_\nSet the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_\nNot since Watergate have we been going thru a time like this Benghazi IRS wiretapping of @AP... _E_\nMy @CNN interview with @wolfblitzercnn where I discuss @BarackObama's 'birth certificate' and why @CNN has low ratings __HTTP__ _E_\nThere must be a higher standard of accuracy in the media. Incredible that some so called journalists can make up lies and get away with it _E_\nRT @TrumpInaugural: Counting down the days until the swearing in of @realDonaldTrump & @mike_pence. Check in here for the latest updates. #... _E_\nRowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday joined @foxandfriends. __HTTP__ _E_\nWhy do we always know how the four liberals are going to rule but have to think about which side the Republican judges will go. _E_\nIn the span of two months @BarackObama the habitual vacationer has called America soft and lazy. He loves to criticize America. _E_\nMust read @IBDinvestors editorial: \"Child Alien Crisis Obama's Fault But GOP Won't Pounce\" __HTTP__ _E_\n.@CPACnews had its largest ever ticket sales the day of my announcement. Really an honor. Can't wait to see everyone. _E_\nCongrats to the new Gov. of Texas @GregAbbott_TX for taking a tough & bold stance at the border. Should have been done long ago by Perry. _E_\nSadly there is no way that Ted Cruz can continue running in the Republican Primary unless he can erase doubt on eligibility. Dems will sue! _E_\nEntrepreneurs: Keep an open mind! Business is a creative endeavor. There are always opportunities and possibilities. _E_\nLyin'Ted Cruz is weak & losing big so now he wants to debate again. But according to DrudgeTime and on line polls I have won all debates _E_\nMy father Fred Trump left me a relatively small amount of money (compared to where I am today over $10 billion) but vast amount of knowledge _E_\nWhy isn't Hillary Clinton 50 points ahead?#DebateNight __HTTP__ _E_\nDespite the false @nytimes story about Jeb Bush being happy with the Trump surge he fell more than anybody & is miserable. _E_\nEmails prove WH knew ObamaCare website wouldn't work in October why didn't they delay the launch? __HTTP__ _E_\nI look forward to attending Saturday Night Live on Sunday night. I am sure it will be a great show. @nbcsnl __HTTP__ _E_\n.@megynkelly used this poll (nobody else did) when I was down—wonder if she'll use it now that I'm up? __HTTP__ _E_\nMy @FoxNews interview last night with @Gretawire __HTTP__ _E_\nTo all the Bernie voters who want to stop bad trade deals & global special interests we welcome you with open arms. People first. _E_\nIntrinsic means basic inborn elemental. If you have an intrinsic value it cannot be taken away. Think Like a Champion _E_\nThe Chinese must still be laughing at Kerry's trip to China. He got nothing gave them everything and promised even more. _E_\nThe Donald J. Trump Signature Collection available @Macys offers this fall's top styles in ties shirts & suits __HTTP__ _E_\nI am in IstanbulTurkey. Just opened magnificent #TrumpTowers a big hit. _E_\nRT @DonaldJTrumpJr: Someone please fact check her coal comments. Give me a break. #debates _E_\nRT @netanyahu: Ever Strongerחזקים תמיד 🇱 __HTTP__ _E_\nThank you Hilton Head South Carolina! @SCTeamTrump #Trump2016 __HTTP__ __HTTP__ _E_\nUpstate New York needs jobs. Frack Now & Frack Fast! Pay off NY State debt. _E_\nVia Politico: Trump Extends Lead in New Hampshire Poll __HTTP__ _E_\nRT @JaydaBF: VIDEO: Islamist mob pushes teenage boy off roof and beats him to death! __HTTP__ _E_\nChuck Jones who is President of United Steelworkers 1999 has done a terrible job representing workers. No wonder companies flee country! _E_\nDespite the Fake News Media in conjunction with the Dems an amazing job is being done in Puerto Rico. Great people! _E_\nEverything you can imagine is real. —Pablo Picasso _E_\nIsn't it interesting that now that I'm #1 in the polls the networks show polls that are a month old! _E_\nMore radical Islam attacks today it never ends! Strengthen the borders we must be vigilant and smart. No more being politically correct. _E_\nI am at the Saturday Night Live Studio electricity all over the place. We will be doing a tweeting skit so stay tuned! _E_\nTrump National Golf Club Los Angeles on the Palos Verdes Peninsula overlooking the Pacific Ocean spectacular! __HTTP__ _E_\nWe should start an immediate investigation into @SenSchumer and his ties to Russia and Putin. A total hypocrite! __HTTP__ _E_\nWill be doing a joint press conference in Hanoi Vietnam then heading for final destination of trip the Phillipines. _E_\nI know some of you may think l'm tough and harsh but actually I'm a very compassionate person (with a very high IQ) with strong common sense _E_\n.@WSJ Editorial says Clinton primary vote total is 8646551.Trump's is 7533692 a knock. But she had only 3 opponents I had 16.Apologize _E_\nBob Schieffer will do a great job tonight. Always treated me fairly. _E_\nRT @foxandfriends: Yesterday's hearings provided zero evidence of collusion between our campaign and the Russians because there wasn't any... _E_\nWith all of the Fake News coming out of NBC and the Networks at what point is it appropriate to challenge their License? Bad for country! _E_\nHow quality a woman is Rowanne Brewer Lane to have exposed the @nytimes as a disgusting fraud? Thank you Rowanne. _E_\nMaybe Boehner will stop this one sided deal in the House...I hope so! _E_\nNEW FBI TEXTS ARE BOMBSHELLS! _E_\nGreat news! Just out the highly respected USA Today/Suffolk University Poll. Enjoy! __HTTP__ _E_\nRT @MollyCBraswell: WHAT?! @realDonaldTrump is speaking at #CPAC2013? This conference just became like a hundred times more awesome! _E_\nThank you so nice. _E_\nThe Supercommittee is a disaster. The Republicans made a crucial mistake agreeing to this debt deal. They hat... (cont) __HTTP__ _E_\nLooking over New York City with luxurious 5 Star hotel rooms @TrumpNewYork top dining & amenities __HTTP__ _E_\nWow just in ObamaCare projected to cause large scale drop in jobs even Dems are shocked by 2.5 million number. DISASTER! _E_\nThe U.S. needs to protect our intelligence assets especially in China. If the Chinese want to spy on us then we need to return the favor. _E_\nPaula Broadwell's book on Gen. Petreus is titled All In. Did she know something? _E_\nIt's time for politicians to be reminded they work for us! We can get it done. Let's Make America Great Again! __HTTP__ _E_\nRT @RSBNetwork: LIVE Stream now: Donald Trump press conference #TrumpTrain #Trump2016 __HTTP__ _E_\nInteresting polls on who won the GOP debate. __HTTP__ _E_\nAt the Old Post Office __HTTP__ _E_\nCongratulations to Miss Mexico Jimena Navarrete our new Miss Universe 2010 and congratulations to everyone for a fantastic show. _E_\nDonald Trump Leads Polls in Florida __HTTP__ _E_\nEven though I beat him in the first six debates especially the last one Ted Cruz wants to debate me again. Can we do it in Canada? _E_\n.@StephenBaldwin7 and me at a press event for All Star @ApprenticeNBC earlier today at @TrumpTowerNY.... __HTTP__ _E_\nObamacare premiums continue to rise and bend up the cross curve. And the back end of the website does not even work. _E_\nI was on CNN last night with @ErinBurnett. _E_\n.@WhoopiGoldberg had better surround herself with better hosts than Nicole Wallace who doesn't have a clue. The show is close to death! _E_\nIf Trump became president he would do an amazing job if Obama took over Celebrity Apprentice he'd fail. What's your opinion? I agree! _E_\nLooking forward to a big rally in Nashville Tennessee tonight. Big crowd of great people expected. Will be fun! _E_\nEconomic growth can save Social Security Medicare and America. _E_\nTHANK YOU Council Bluffs Iowa! The silent majority is silent no more!#Trump2016 #FITN __HTTP__ __HTTP__ _E_\nThe Council is concerned over the health & safety for the village of Blackdog w/ placement of sub station. @AlexSalmond @pressjournal _E_\n.@MarkHalperin I totaly won the RJC meeting yesterday. Know many members who said not even close. Only FULL standing O. But don't want $'s _E_\n\"I don't think you should ever run from history. You should learn from it and embrace it.\" @LAClippers Coach Doc Rivers _E_\nThe failing @NYTimes would do much better if they were honest! __HTTP__ _E_\nYesterday there was yet another massive intelligence leak by the @BarackObama administration. __HTTP__ _E_\nMILITARY LIVES MATTER! END GUN FREE ZONES! OUR SOLDIERS MUST BE ABLE TO PROTECT THEMSELVES! THIS HAS TO STOP! _E_\nNew and great selection of ties shirts and cufflinks@Macy's check them out! _E_\nI was right—TV ratings for US Open are way down from last year. People don't want to look at a burned out ugly course! _E_\nHe @RickSantorum should get out of the race so Republicans can focus on @BarackObama. _E_\nPlease send a psychiatrist to help @Rosie she's in a bad state. To @Rosie's girlfriend's parents get (cont) __HTTP__ _E_\nI will be on @foxandfriends this morning at 8:30. Enjoy! _E_\nThe ruling @GOP consultant class of losers like @KarlRove have no respect for the Tea Party. They do this at their own peril! _E_\n.@Yankees are making a big mistake sending the doping @AROD to rehab assignment. Should suspend him until investigation is over. _E_\n.@MittRomney will create 2 million new jobs if elected POTUS. If reelected @BarackObama will create over $12T in new debt. Easy choice. _E_\nBrian Williams who is not the nice guy that people think he is has now become totally irrelevant. He will never again hold court! _E_\nMy @greta interview discussing why we do not need another Bush __HTTP__ _E_\nAbsentee Governor Kasich voted for NAFTA and NAFTA devastated Ohio a disaster from which it never recovered. Kasich is good for Mexico! _E_\nThis season's cast of @ApprenticeNBC brings excitement to the Board Room. Lots of surprises & great tasks. Enjoy – Jan. 4th! _E_\nCelebrating New Year's Eve in the Windy City? Join @TrumpChicago for the chic & elegant Cirque Soiree Celebration __HTTP__ _E_\nVia @eonline: Donald Trump wants Katherine Webb for Miss USA judge __HTTP__ _E_\n.@MacMiller has over 79M hits on YouTube & just hit platinum with his Donald Trump song—screw you Mac! _E_\nI want to thank my @Cabinet for working tirelessly on behalf of our country. 2017 was a year of monumental achievement and we look forward to the year ahead. Together we are delivering results and MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nGreat news Chinese companies who were fixing prices and accounting are leaving the US stock market __HTTP__ #TimeToGetTough _E_\nGood news is Melania's speech got more publicity than any in the history of politics especially if you believe that all press is good press! _E_\nIn game 7 of the World Series tonight the Giants are making a big mistake in not starting their ace against K.C. even with two days rest. _E_\n.@GMA at 7:00 A.M. _E_\nIf your actions inspire others to dream more learn more do more and become more you are a leader. – John Quincy Adams _E_\nSo nice thank you very much. __HTTP__ _E_\n.@AlexSalmond Heatwave in Scotland makes wind turbines useless. Big problem expensive mess. _E_\n.@THR The Donald Trump Ratings Bump: Who's Benefiting Most? __HTTP__ _E_\nMy major hotel conversion of The Old Post Office on Pennsylvania Avenue in D.C. is under budget and ahead of schedule. Should be U.S.A. _E_\nI visited our Trump Tower campaign headquarters last night after returning from Ohio and Arizona and it was packed with great pros WIN! _E_\nVia @washingtonpost's @goingoutguide by @timcarman: \" @gzchef open the National at the Old Post Office Pavilion\" __HTTP__ _E_\nGreat day today in South Carolina. Fantastic capacity crowd amazing people! _E_\nToday Judge St. Eve ruled in my favor on the two remaining claims brought by Goldberg in Chicago. The case is now officially over... _E_\nMy @Newsmax_Media int. with @SteveMTalk on my Iowa @theFAMiLYLEADER speech @jonkarl 2016 & Benghazi __HTTP__ _E_\nRT @shawgerald4: @realDonaldTrump Thank you President TRUMP!! __HTTP__ _E_\nLeaving today for California to inspect my fantastic golf course & club on the Palos Verdes peninsula. Big success. __HTTP__ _E_\nDave Brubeck was great and will be missed! _E_\nJeb Bush should stop trying to defend his brother and focus on his own shortcomings and how to fix them. Also Rubio is hitting him hard! _E_\nVia @Newsmax_Media by \"Donald Trump: Don't Give Obama Fast Track Trade Authority\" __HTTP__ _E_\nA strong Poland is a blessing to the nations of Europe and a strong Europe is a blessing to the West and to the world. __HTTP__ _E_\nThe measure of who we are is what we do with what we have. Vince Lombardi _E_\nWow Jeb Bush just lost three of his top fundraisers they quit! _E_\nI've helped pass and signed 38 Legislative Bills mostly with no Democratic support and gotten rid of massive amounts of regulations. Nice! _E_\nThank you Indiana we were just projected to be the winner. We have won in every category. You are very special people I will never forget! _E_\nMiss Universe Paulina Vega criticized me for telling the truth about illegal immigration but then said she would keep the crown Hypocrite _E_\n\"Strive for wholeness and keep your sense of wonder intact & you will find yourself ready for a grand slam.\" Think Like A Champion _E_\nAll eyes are on Florida today. I will be watching the GOP primary results very closely. We need the right candidate to beat @BarackObama. _E_\nWhy do you need a photo ID to buy a drain cleaner __HTTP__ not to vote? _E_\n\"Trump Tiger Team Up to Create 'Stunning' Golf Course in Dubai\" __HTTP__ via @Newsmax_Media by @Jlorenz _E_\nVia @fitsnews by Will Folks: \"'THE DONALD' REBUKES OBAMATRADE\" __HTTP__ _E_\nTHANK YOU! #AmericaFirst __HTTP__ _E_\nClinton campaign & DNC paid for research that led to the anti Trump Fake News Dossier. The victim here is the President. @FoxNews _E_\nArriving to check out the border. __HTTP__ _E_\nLeaking and even illegal classified leaking has been a big problem in Washington for years. Failing @nytimes (and others) must apologize! _E_\nWe have got to take our country back. It's time! _E_\nMajor League Baseball: The best thing you can do is let @PeteRose_14 your all time hits leader into the Hall of Fame. It's time! _E_\nEric Trump on @foxandfriends now! _E_\nThe jury was not told the killer of Kate was a 7 time felon. The Schumer/Pelosi Democrats are so weak on Crime that they will pay a big price in the 2018 and 2020 Elections. _E_\nBeyond eliminating the wasteful spending we need to get tough in cracking down on the hundreds of billions of (cont) __HTTP__ _E_\nI have been saying it for sometime now!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nExclusive Davi: Trump The Lion We Need __HTTP__ _E_\n.@TraceAdkins the winner of @ApprenticeNBC after last night's victory __HTTP__ _E_\n\"Go as far as you can see when you get there you'll be able to see farther.\" J. P. Morgan _E_\nRT @rupertmurdoch: As predicted Trump reaching out to make peace with Republican establishment . If he becomes inevitable party would be... _E_\n.@seanhannity should have corrected Jeb Bush when he said that I ran for president twice. Never ran merely considered running! _E_\n\"Always remember: Dress for the job you want not the job you have.\" – Think Like a Billionaire _E_\nThank you to everyone who came out & joined us @TrumpTurnberry yesterday! @EricTrump @IvankaTrump @DonaldJTrumpJr __HTTP__ _E_\nLosers and haters are invited to watch Celebrity Apprentice along with the many great and productive people in the hope that you will learn. _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nBowe Bergdahl walked off the base after he was told not to. Solders died looking for him. U.S. should NEVER have made the deal! PUNISHMENT? _E_\nObama's promise to build an international coalition against ISIS is already broken. No one trusts him at home or abroad. _E_\n...It's old electrical grid which was in terrible shape was devastated. Much of the Island was destroyed with billions of dollars.... _E_\nMy sense is that people are far angrier at the President than they are at Congress re the shutdown—an interesting turn! _E_\nThanks @SherriEShepherd 4 your nice comments today on The View. U were terrific! _E_\nHe's saddled our children with more debt than we accumulated in 225 years in America. @BarackObama has done an (cont) __HTTP__ _E_\nFrom The Desk Of Donald Trump two new videos up at __HTTP__ and __HTTP__ _E_\nYou're hired! The @CENTURY21 ad is airing during the #SuperBowl and you need to get voting! Vote for me & @CENTURY21: __HTTP__ _E_\nCongrats to Team USA & Capt. @AllenWronowski for retaining the PGA Cup! Well done and well deserved! _E_\nMet a big fan today! __HTTP__ _E_\nTalks on Repealing and Replacing ObamaCare are and have been going on and will continue until such time as a deal is hopefully struck. _E_\nKeep lightweight Marco and his friends out of the White House. #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIf we want to renew our PROSPERITY restore OPPORTUNITY & re establish our economic DOMINANCE then we need tax reform that is pro growth.. __HTTP__ _E_\nThank you @DallasPD! __HTTP__ _E_\n'Dem Operative Who Oversaw Trump Rally Agitators Visited White House 342 Times' #DrainTheSwamp __HTTP__ _E_\nThanks. __HTTP__ _E_\nGetting ready to watch the debate as they say let's get ready to rumble ! _E_\nRemember the harder you work the luckier you get! _E_\nSome people dream of great accomplishments while others stay awake and do them.\" Anonymous _E_\nObamaCare is a broken mess. Piece by piece we will now begin the process of giving America the great HealthCare it deserves! _E_\nLegal immigrants want border security. It is common sense. We must build a wall! Let's Make America Great Again! __HTTP__ _E_\nOur country is totally divided and our enemies are watching. We are not looking good we are not looking smart we are not looking tough! _E_\nThank you to Prime Minister of Australia for telling the truth about our very civil conversation that FAKE NEWS media lied about. Very nice! _E_\nTogether we will show the world that the forces of destruction and extremism are NO MATCH for the BLESSINGS of PROSPERITY and PEACE! __HTTP__ _E_\nWill be joining @jimmyfallon on @FallonTonight at 11:35pmE tonight. Enjoy! _E_\nMy @amtalker int. on @whoradio w/@SteveKingIA discussing my upcoming campaign visit for Steve this Sat. in Iowa __HTTP__ _E_\nVia @TV3Xpose: \"@IvankaTrump: Think pink in the boardroom.\" __HTTP__ _E_\nBoston incident is terrible. We need energy and passion but we must treat each other with respect. I would never condone violence. _E_\nLast night William Shatner had more airtime than any winner. It should have been called the William Shatner show... _E_\nMassive crowds already forming in Jacksonville will be and incredible day 12 noon! MAKE AMERICA GREAT AGAIN! _E_\nHope we all enjoy @60Minutes tomorrow night. I do believe they will treat me fairly! _E_\nThank you to Sue Kruczek who lost her wonderful and talented son Nick to the Opioid scourge for your kind words while on @foxandfriends. We are fighting this terrible epidemic hard Nick will not have died in vain! _E_\nStill time to get out and VOTE!#WIPrimary #Trump2016 #MAGA __HTTP__ _E_\nI told the Republicans the debt ceiling talks should come before election & we would have a Republican president—they wouldn't listen. _E_\nQ/A @stalkinpeople Yes I'd give the real numbers. _E_\nI'll be in one of my favorite places this morning Staten Island. Big crowd will be fun! _E_\nJoin us in Salt Lake City Utah tonight!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIn getting the endorsement of the 16500 Border Patrol Agents (thank you) the statement was made that the WALL was very necessary! _E_\nVia @Newsmax_Media by @wandacarruthers: Donald Trump: US Defeating ISIS Only in John Kerry's Imagination __HTTP__ _E_\nI hope the @GOP realizes that if they blow this election the Tea Party won't be with them next time. _E_\nWhen candidate John Kasich on the @oreillyfactor talked about dismantling Medicare and Medicaid he was referring to Ben Carson. _E_\nAmazing rally in Florida this is a MOVEMENT! Join us today at __HTTP__ __HTTP__ _E_\nFascinating to watch people writing books and major articles about me and yet they know nothing about me & have zero access. #FAKE NEWS! _E_\nToday @FLOTUS hosted a Military Mother's Day Event in the East Room of the WH. It was an honor to stop by say hel... __HTTP__ _E_\n\"A leader does not deserve the name unless he is willing occasionally to stand alone.\" Henry A. Kissinger _E_\nWe need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_\nFailing @GlennBeck lost all credibility. Not only was he fired @ FOX he would have voted for Clinton over McCain. __HTTP__ _E_\nWill be on @Morning_Joe in 5 minutes at 7:00. Enjoy! _E_\nThe @CelebApprentice will be broadcast tonight on @CNBC at 9 PM. _E_\nTomorrow is #TrumpTuesday on @SquawkCNBC 7:30 AM. Tune in! _E_\nAshley Judd has just thanked Karl Rove for all the attention he has given her—unreal!—how stupid can we get? _E_\nWaste! With a $16T debt and $1T budget deficit @BarackObama is sending $770M overseas to fight global warming __HTTP__ _E_\nSeems to be the next election must be about jobs and gas prices not birth control. _E_\nVote for your favorite TRUMP HOTEL COLLECTION hotels in Travel + Leisure's 2012 World's Best Awards Survey __HTTP__ _E_\nHow long did it take your staff of 823 people to think that up and where are your 33000 emails that you deleted? __HTTP__ _E_\nI truly LOVE all of the millions of people who are sticking with me despite so many media lies. There is a great SILENT MAJORITY looming! _E_\nNow @BarackObama has decided there are 5 million Palestinian refugees __HTTP__ He always goes against @Israel's interest. _E_\nMarco Rubio lost big last night. I even beat him in Virginia where he spent so much time and money. Now his bosses are desperate and angry! _E_\nDo you notice that because of Ebola ISIS etc. ObamaCare has gone to the back burner despite horrible results coming out. A disaster! _E_\nThe massive TAX CUTS/REFORM that I have submitted is moving along in the process very well actually ahead of schedule. Big benefits to all! _E_\nThe Democrats in the Southwest part of Virginia have been abandoned by their Party. Republican Ed Gillespie will never let you down! _E_\nGas prices are at crazy levels fire Obama! _E_\nI hope everybody is having a FANTASTIC Christmas! No matter how tough things may seem remember that you will ride it out & go on to victory! _E_\nIs everyone seeing how incompetently our country is being run by watching the mess with Syria? Our leaders don't know what they are doing! _E_\n.@AlexSalmond –the man who let terrorist (Pan Am Flight 103) al Megrahi go lost another battle over ugly wind turbines in Blackdog. _E_\nVia @FOXSports: Trump 'blowing up' @DoralResort after WGC @CadillacChamp __HTTP__ by @AP _E_\nA fantastic day and evening in Washington D.C.Thank you to @FoxNews and so many other news outlets for the GREAT reviews of the speech! _E_\nI employ many people in Hawaii at my great hotel in Honolulu. I'll be there very soon. Vote for me Hawaii! _E_\nIt is my great honor to support our Veterans with you! You can join me now. Thank you! #Trump4Vets __HTTP__ _E_\nTune in tonight at 1 AM EST to the QVC network to watch Melania Trump debut her first 2012 Melania Timepieces & (cont) __HTTP__ _E_\nVOTE 4 @mariamenounos & derekhough#01 tonight! She's doing a great job on Dancing with the Stars #DWTS (& a good person). 1 800 868 3401 _E_\nThe Yankees really have to be embarrassed losing all four games to the Mets my great friend George Steinbrenner would be going nuts! _E_\nGetting ready to lift off for Laredo. Will land at 1:OO P.M. Should be exciting and informative! _E_\nIt was a great honor to represent the United States at the magnificent #BastilleDay parade. Congratulations President @EmmanuelMacron! __HTTP__ _E_\n.@MELANIATRUMP @IvankaTrump @EricTrump @DonaldJTrumpJr & I thank our loyal fans for another great season of @ApprenticeNBC! _E_\nRT @WhiteHouse: #Obamacare has led to higher costs and fewer health insurance options for millions of Americans. It has failed the American... _E_\nLater today I'm being honored at the Park Hyatt in Washington D.C. by the Wharton Club. The Joseph Wharton Award Dinner. A great honor. _E_\nMy son Donald will be interviewed by @seanhannity tonight at 10:00 P.M. He is a great person who loves our country! _E_\nNo wonder Sony is doing so badly. Really stupid leadership that wants Al Sharpton to help. Watch him turn the tables on chief Amy Pascal. _E_\nRT @charliekirk11: Incredible video: @CBS does a special on the GOP tax plan The result?Every middle class family they sat down with SA... _E_\nVerlander is great but very beatable. Does not have a good ERA in playoff games _E_\nHeading to D.C. to speak at Faith and Freedom Coalition and visit OPO. _E_\n\"No matter how good you get you can always get better and that's the exciting part.\" @TigerWoods _E_\nYesterday was amazing—5 victories. Lyin' Ted Cruzhad zero. Things are going very well! _E_\nThank you for your support!#AmericaFirst #ImWithYou __HTTP__ _E_\nThank you Louisiana! #Trump2016#SuperSaturday _E_\nI am greatly honored to receive Sarah Palin's endorsement tonight. Video: __HTTP__ __HTTP__ _E_\nObamaCare is one of the worst political disasters of all time 4992343 AMERICANS LOSING COVERAGE LESS THAN 50OOO NEW SIGNUPS. _E_\n.... but you only want to talk about 10 years later when I still win 10PM in all key demos.@DannyZuker _E_\nVia @todayshow by @ReeHines: \"Donald Trump reveals new @ApprenticeNBC cast talks Joan Rivers' role on show\" __HTTP__ _E_\nSometimes your best investments are the ones you don't make. _E_\nI will be visiting Trump Int'l Golf Links in Scotland tomorrow. Always great to see the Great Dunes of Scotland. __HTTP__ _E_\nI'll soon be leaving for Washington where @AmSpec will give me the T. Boone Pickens Entrepreneur Award. Very exciting! _E_\nCrooked Hillary Attacks Foreign Government Donations While Ignoring Her Own: __HTTP__ _E_\nI think that both candidates Crooked Hillary and myself should release detailed medical records. I have no problem in doing so! Hillary? _E_\nYogi Berra was not only a great baseball player he was a great guy. Yogi will be missed. __HTTP__ _E_\nRT @piersmorgan: Trump makes a funny obvious joke about Russia going after Hillary's emails & U.S. media goes insane with fury.He plays t... _E_\nBack by popular demand @latoyajackson returns to the 13th season of All Star @CelebApprentice. She is fierce in the Board Room! _E_\nAnyone reading this profile of Marco Rubio would never vote for him. Never made ten cents & is totally controlled! __HTTP__ _E_\nJoin me in Sacramento California tomorrow evening @ 7pm! #Trump2016 __HTTP__ __HTTP__ _E_\nLIVE on #Periscope __HTTP__ _E_\nThe three great essentials to achieve anything worth while are: Hard work stick to itiveness and common sense. Thomas A. Edison _E_\nA test: tweet me the reason @billmaher was fired from @ABC (other than his bad ratings). _E_\nThe @USNavy is conducting search and rescue following aircraft crash. We are monitoring the situation. Prayers for all involved. __HTTP__ _E_\nMeryl Streep one of the most over rated actresses in Hollywood doesn't know me but attacked last night at the Golden Globes. She is a..... _E_\nCan you believe that the U.S. will be sending 3000 troops to Africa to help with Ebola.They will come home infected? We have enough problems _E_\nHonored to welcome Georgia Prime Minister Giorgi Kvirikashvili to the @WhiteHouse today with @VP Mike Pence.... __HTTP__ _E_\n.@BernieSanders who blew his campaign when he gave Hillary a pass on her e mail crime said that I feel wages in America are too high. Lie! _E_\nThe failing @politico news outlet which I hear is losing lots of money is really dishonest! _E_\nI am happy to announce that theoriginal Apprentice which will offer job opportunities to those in need is coming back. _E_\nIf Obama keeps pushing wind turbines our country will go down the tubes economically environmentally & aesthetically. _E_\nTrust yourself. Create the kind of self that you will be happy to live with all your life. Golda Meir _E_\nRT @IvankaTrump: Beautiful article about @realDonaldTrump written by my friend the incredibly talented golfer Natalie Gulbis: __HTTP__ _E_\nGuess what folks the ObamaCare website just went down again. What a disaster. _E_\nAs we are learning the hard way both domestically & internationally hope is not a strategy. _E_\nMy @gretawire int. on Obama scandals not resonating no retribution on Benghazi Obama not being engaged __HTTP__ _E_\nJerry Buss was a great guy and friend. He will be missed! _E_\nDeportations are now at a record low. Obama manipulated the numbers to lie to the public that they were at a record high. Secure the border! _E_\n.@AGSchneiderman has never once said that he didn't ask for campaign contributions during the investigation. _E_\n...and will be very embarrassed unless they get smart fast. _E_\nThe US is stupidly closing all of its coal fired plants while at the same time we're selling our coal to (cont) __HTTP__ _E_\nDemocrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that happen! _E_\n#timetogettough The White House Correspondents' Dinner in my new book Time To Get Tough .....watch the #trumpvlog __HTTP__ _E_\nTim Kaine is and always has been owned by the banks. Bernie supporters are outraged was their last choice. Bernie fought for nothing! _E_\nAlways nice to see the terrific @mariamenounos at the #WWEHOF. __HTTP__ _E_\nObama told @NBC that Egypt is no longer an ally. They used to be until he pushed out Mubarak. _E_\nHer instincts are suboptimal. __HTTP__ _E_\nAfter 14 years U.S. beef hits Chinese market. Trade deal an exciting opportunity for agriculture. __HTTP__ _E_\nDonald Trump appeared on the final episode of The Jay Leno Show to deliver a very special message: __HTTP__ _E_\n.@AndreBauer Great job and advice on @CNN @jaketapper Thank you! _E_\n.@johnboehner—if you can't make a great deal go over the cliff & negotiate new deal along with debt ceiling in February!—Trump 101. _E_\nBen Bradlee was truly one of the greats. What an amazing life he led. My warmest condolences to Dino & the whole family. #BenBradlee _E_\n.@RealDonaldTrump wants a SAFE America w/ stronger borders no amnesty and an END to sanctuary cities. He is... __HTTP__ _E_\nIf the @yankees can somehow beat Verlander tonight then they can still salvage the series. And I will go to games 6& 7 so they will win! _E_\nThe replacement refs are getting blamed for everything. I've seen many bad sports calls over the years. _E_\nBernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them and run as an Independent! _E_\n\"A list golfing buddy! @Tegan__Martin enjoys golf w/Donald Trump ahead of@MissUniverse\" __HTTP__ via @DailyMailCeleb _E_\nIt's Thursday how many more bias press reports will be released against @MittRomney? _E_\nCongratulations to Bill O'Brien on being named the Republican Speaker of the NH House. Well earned & well deserved. A great guy. _E_\nStatement Regarding Recent Executive Order Concerning Extreme Vetting: __HTTP__ _E_\nThe electric power grid in Puerto Rico is totally shot. Large numbers of generators are now on Island. Food and water on site. _E_\nWe fully support @SaveCulzean in Turnberry great for beauty & tourism. Wind turbines are death to environment. __HTTP__ _E_\nIt is that time of the year. The Trump Wollman Skating Rink is open to the public in Central Park. The greatest ice rink in the country. _E_\nFailed presidential candidate @MittRomney was made to look like a fool by Senator Harry Reid & didn't release his tax returns until 9/21/12. _E_\n.@Deadspin will never make it—they don't understand graciousness or money—and best guy is leaving? _E_\nWill be on @foxandfriends now. _E_\n\"You have to have a good reason for doing what you're doing because people connect with the why.\" – Midas Touch _E_\n.@tuckercarlson is doing a really good job on Fox especially when talking politics. He has come a long way fast! _E_\nThank you James Freeman of the @WSJ for the very nice words. All polls said I won the debate except NBC (3rd). Explain to Daniel Henninger! _E_\nWhy would a very low ratings radio talk show host like Hugh Hewitt be doing the next debate on @CNN. He is just a 3rd rate gotcha guy! _E_\nObama thinks he can just laugh off the fact that he refuses to release his records to the American public. He can't. _E_\nBy folding Penn State leadership made things worse. The deal is ridiculous & punishes the wrong people. I hope the alumni sue to overturn. _E_\nWhat does.Obama know about the VA or business nothing just look at the five billion dollar ObamaCare website. We need a real leader! _E_\nTrump Chicago was featured in Transformers 3. Trump Tower was featured in Dark Knight Rises. Both are summer blockbusters. #MidasTouch _E_\n#TrumpVlog Be careful with Iran. __HTTP__ _E_\nTHe WH should not have hosted the Muslim Brotherhood. @BarackObama's friends are enemies of the US and @Israel. The Islamist winter is here. _E_\nMake your NYC getaway memorable @TrumpNewYork provides both true luxury and top access to Midtown West __HTTP__ _E_\nUnemployment is up in 44 states showing July's unemployment numbers to be broad based __HTTP__ @BarackObama is a job killer. _E_\nBrian @kilmeade wrote a wonderful book called George Washington's Secret Six that is truly worth reading. __HTTP__ _E_\n\"If you strike out nobody is going to help you not your friends not the government. You have to look to look out for yourself.\" Think Big _E_\nI will be interviewed on @NewDay @CNN at 7:00 A.M. _E_\nIllegal immigration is a wrecking ball aimed at US taxpayers. Washington needs to get tough and fight for W... (cont) __HTTP__ _E_\nEntrepreneurs: Difficulties mistakes & setbacks are an inevitable part of business & life. Remember to keep your equilibrium intact. _E_\nHow can Ted Cruz be an Evangelical Christian when he lies so much and is so dishonest? _E_\nFact – while Jeb was governor & Rubio was House Majority Leader Florida's debt more than doubled. Conservatives? _E_\n\"I love it when people doubt me. It makes me work harder to prove them wrong.\" – Derek Jeter _E_\nNEVADA! Tomorrow is the deadline to register Republican.Visit: __HTTP__ from @IvankaTrump: __HTTP__ _E_\nJohn Sununu was more right than he even knew yesterday @BarackObama indeed needs to learn how to be an American. _E_\nMust read @WSJ column by Senator Phil Gramm \"The Multiple Distortions of Wind Subsidies\" __HTTP__ _E_\nPlayed golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_\nSuch a great experience in New Hampshire amazing people! Will be leaving for a big event in South Carolina today. _E_\nAN AMERICA FIRST ENERGY PLAN#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nAmazing NH poll released! We are getting ready to Make America Great Again! __HTTP__ _E_\nA general is just as good or just as bad as the troops under his command make him.\" Gen. Douglas MacArthur _E_\nWelcome to the United States @IsraeliPM Benjamin & Sara!#ICYMI 🇱Joint Press Conference: __HTTP__ __HTTP__ _E_\nWould love to send the NYC terrorist to Guantanamo but statistically that process takes much longer than going through the Federal system... _E_\nThis is what we can expect from #CrookedHillary. More Taxes. More Spending. #BigLeageTruth #DrainTheSwamp #Debates __HTTP__ _E_\nMore $ thrown away @BarackObama gave $20M to Amonix and praised its success in '10. It just filed for bankruptcy __HTTP__ _E_\n19 firefighters killed in Arizona terrible tragedy! _E_\nTrump Links will be a great championship golf course that will host many major tournaments and bring tremendous $'s & prestige to N.Y.C.! _E_\nPrinting money is neither a short or long term solution to our country's economic woes. The Fed is destroyin... (cont) __HTTP__ _E_\nRepublicans and Democrats must come together now to make America great again! _E_\nThank you! #MAGA #AmericaFirst __HTTP__ _E_\nI will rebuild the military take care of vets and make the world respect the US again! Join me today. Info: __HTTP__ _E_\n\"Listen to others but never negate your own instincts.\" – Trump Never Give Up _E_\nDo you believe this Iran wants to trade our 3 prisoners (not 4) for 19 prisoners held by the U.S. Should have been let go with last deal! _E_\nLow energy candidate @JebBush has wasted $80 million on his failed presidential campaign. Millions spent on me. He should go home and relax! _E_\n#NEPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_\n.@SecShulkin's decision is one of the biggest wins for our VETERANS in decades. Our HEROES deserve the best! ... __HTTP__ _E_\nJoin me in Oklahoma tomorrow night!#MakeYoutubeGreatAgain #Trump2016 __HTTP__ _E_\nGo to Trump National Doral Miami and watch Tiger Phil Ernie Rory and all of the other great players compete in The WGC Cadillac Champ! _E_\nIf Hillary thinks she can unleash her husband with his terrible record of women abuse while playing the women's card on me she's wrong! _E_\nRT @IsraelUSAforevr: @realDonaldTrump __HTTP__ _E_\nThe Herschel Walker interview on The Tim McCarver Show was fantastic much can be learned from watching. Congrats to Herschel and Tim! _E_\nI have been drawing very big and enthusiastic crowds but the media refuses to show or discuss them. Something very big is happening! _E_\nEntrepreneurs: See yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_\nSadly because president Obama has done such a poor job as president you won't see another black president for generations! _E_\nEntrepreneurs: See yourself as victorious. Look at the solution not the problem. Keep your focus positive. _E_\nAn investment in knowledge pays the best interest. Benjamin Franklin _E_\nMeeting w/ Washington D.C. @MayorBowser and Metro GM Paul Wiedefeld about incoming winter storm preparations here... __HTTP__ _E_\nVeterans please call 855 VETS 352 or email address veterans@donaldtrump.com to share your stories about the need to reform the VA. _E_\n#CelebApprentice What do you think of the new teams/PMs? _E_\nJust got back from Las Vegas. @TrumpLasVegas Hotel was fantastic in every way but the fight was a total waste of time. The aggressor lost? _E_\nWWII vs. Now! During the 3 1/2 years of World War II that started with the Japanese bombing of Pearl Harbor (cont) __HTTP__ _E_\nThank you Denver Colorado! #MakeAmericaGreatAgain! __HTTP__ _E_\n... The Republicans just didn't resonate with the people—but they will have better days. _E_\nThe White House is continuing to be openly uncooperative with the Fast and Furious investigation. American lives were lost. We need answers. _E_\nObamaCare premiums could jump as high as 51% __HTTP__ Terrible for economy. Repeal & Replace with free market solution! _E_\nJust arrived at @trumpdoral for the @cadillacchamp starting tomorrow __HTTP__ _E_\nIron Mike Tyson was not asked to speak at the Convention though I'm sure he would do a good job if he was. The media makes everything up! _E_\nKeep an open mind business is a creative endeavor. Strive for innovative ideas. _E_\nRT @CLewandowski_: Please watch @foxandfriends today at 7:30 AM to watch me discuss @realDonaldTrump. _E_\nThank you. __HTTP__ _E_\n\"The more predictable the business the more valuable it is. Predictability also means consistency of brand experience.\" Midas Touch _E_\nThe failing @nytimes hates the fact that I have developed a great relationship with World leaders like Xi Jinping President of China..... _E_\nCrooked Hillary Clinton has zero natural talent she should not be president. Her temperament is bad and her decision making ability zilch! _E_\nPres. Obama was touting Yemen as a great success story it just fell. Obama doesn't know what he is doing. Saudi Arabia is in big trouble. _E_\nEliot Spitzer has failed at everything he has ever done and now he wants to be comptroller. Thrown out of politics and off of TV CRAZY! _E_\nIt is so important to audit The Federal Reserve and yet Ted Cruz missed the vote on the bill that would allow this to be done. _E_\n.@Gracematters Thank you a very wise bet! Best wishes. _E_\nVia @trdmiami: \"@TrumpDoral project will boast 800 hotel rooms\" __HTTP__ $250M renovation on 800 acres in sunny Miami. _E_\nMichelle Obama made a terrible mistake in Iowa. When endorsing Bruce Braley before a large crowd she called him Bruce Bailey seven times. _E_\n.@CarlyFiorina Carly not just you I also told Gov. Kasich to \"let Jeb talk give him a chance\" because Kasich was constantly cutting in. _E_\nIt was great seeing @Schwarzenegger at the #WWEHOF. __HTTP__ _E_\nHow come the @TODAYshow & @chucktodd show the new @NBCNews Poll for Hillary vs Bernie but do not show the SAME poll where I am killing Cruz? _E_\nJFK Files are being carefully released. In the end there will be great transparency. It is my hope to get just about everything to public! _E_\nLooks like @bwilliams is having some problems with his Rock Center with Brian Williams show I hate to see such bad ratings for @NBC. _E_\n.@CNN is #FakeNews. Just reported COS (John Kelly) was opposed to my stance on NFL players disrespecting FLAG ANTHEM COUNTRY. Total lie! _E_\nChina's Financial Institutions are expanding overseas. __HTTP__ They will own everything if we don't stop them now. _E_\nI will be going to Trump National Doral in Miami early today to check on the construction of the hotel and the new Blue Monster. AMAZING! _E_\nI can't believe that President Obama isn't able or willing to make just one phone call to the family of Kate Steinle.Come on Pres MAKE CALL! _E_\nOn Tuesday I visited with the incredible men & women of @ICEgov & @DHSgov Border Patrol in Yuma AZ. Thank you. We respect & cherish you! __HTTP__ _E_\nStock Market at all time high unemployment at lowest level in years (wages will start going up) and our base has never been stronger! _E_\nMy prayers are with the victims and hostages in the horrible Paris attacks. May God be with you all. _E_\nI cannot imagine that Congress would dare to leave Washington without a beautiful new HealthCare bill fully approved and ready to go! _E_\nFor the haters out of hundreds of deals or transactions I have used the bankruptcy laws 4 times in order to cut better deals. _E_\nWatch video of Ivanka Trump sharing business advice with 4 entrepreneurial women on GMA: __HTTP__ _E_\nThe people of Scotland love the golf course I have built it is now considered perhaps the greatest ever built! Thank you also to Robb Report _E_\nSo nice to get an endorsement from the founder and owner of Pizza Ranch in Iowa! A great guy and great places! #CaucusForTrump _E_\nHave we ever had a POTUS before @BarackObama who earned over 1/3 of his income from foreign sources and paid taxes to another country? _E_\nWho's the outsourcer? @BarackObama's campaign is using a travel company with outsourced jobs in China and India. __HTTP__ _E_\nthis election. That is a direct threat to our democracy. She then said We have to accept the results and look to the future Donald _E_\nNever let them see you sweat! __HTTP__ _E_\nPakistani intelligence had full knowledge that Bin Laden was living in Abbottabad. They were sheltering him. _E_\nThe @NBCNews story has just been totally refuted by Sec. Tillerson and @VP Pence. It is #FakeNews. They should issue an apology to AMERICA! _E_\nIn case you missed it last week's @extratv interview with @AJCalloway discussing Tiger Woods & much more __HTTP__ _E_\nObama can open the Mall for illegals to protest our country yet he continues to barricade WWII memorial. That's an absolute disgrace. _E_\nPrior to the election it was well known that I have interests in properties all over the world.Only the crooked media makes this a big deal! _E_\nRight now we are running a massive $300 billion trade deficit with China. That means every year. China is (cont) __HTTP__ _E_\nRe: immigration. Do the Republicans not realize that Dems will get 100% of 11 million votes no matter what they do? _E_\nAfter the way I beat Gov. Scott Walker (and Jeb Rand Marco and all others) in the Presidential Primaries no way he would ever endorse me! _E_\nWELCOME HOME AYA!#GodBlessTheUSA __HTTP__ _E_\nRT @foxandfriends: Wall Street hits record highs after Trump pulls out of Climate pact __HTTP__ _E_\nObama should stop talking about wind turbines they are a disaster for a country or community & are very expensive & unreliable. _E_\nI always felt I would be running and winning against Bernie Sanders not Crooked H without cheating I was right. _E_\nI will be making the announcement of my Vice Presidential pick on Friday at 11am in Manhattan. Details to follow. _E_\n#TBT Here I am with @gwenstefani and @donaldjtrumpjr __HTTP__ _E_\nSubject to the receipt of further information I will be allowing as President the long blocked and classified JFK FILES to be opened. _E_\nWorking in Bedminster N.J. as long planned construction is being done at the White House. This is not a vacation meetings and calls! _E_\n.@kimguilfoyle just watched you on @OutnumberedFNC thank you! _E_\nThis week we saw what Obama Care actually does when implemented. It is a losing issue for @BarackObama and must be repealed. _E_\nI'll be on @foxandfriends this morning at 7:00. So much to talk about! _E_\nHigh above the city @TrumpLasVegas' pool deck mixes business & pleasure over a soaring bar of sky bound gold __HTTP__ _E_\nObama called Reverend Wright his friend counselor & great leader then dumped him like a dog! _E_\neven those registered to vote who are dead (and many for a long time). Depending on results we will strengthen up voting procedures! _E_\nWhy did @BarackObama let Iran keep our drone? Now it is going straight to the Chinese. He should have taken it out. _E_\nMajor rescue operations underway! _E_\nCan't believe I finally got a good story in the @washingtonpost. It discusses the enthusiasm of Trump voters through campaign.... _E_\nTruly honored to receive the first ever presidential endorsement from the Bay of Pigs Veterans Association. #MAGA... __HTTP__ _E_\nTreasury has refused to name China a currency manipulator even though the yuan \"remains significantly undervalued\" __HTTP__ _E_\nWhy would I call China a currency manipulator when they are working with us on the North Korean problem? We will see what happens! _E_\nTrump Virginia Office Announces Statewide TV Ad Strategy and Leadership Team: __HTTP__ __HTTP__ _E_\nHe @RickSantorum is now losing in the latest @ppppolls to @MittRomney in Pennsylvania __HTTP__ Rick is wasting everyone's time. _E_\nVan Jones: 'There Is A Crack in the Blue Wall' — It Has to Do With Trade: __HTTP__ _E_\nVia @BreitbartNews @biggovt: DONALD TRUMP TO SPEAK AT CPAC __HTTP__ by @michaelpleahy _E_\nThe United Nations Security Council just voted 15 0 in favor of additional Sanctions on North Korea. The World wants Peace not Death! _E_\nWife Huma wants @RepWeiner to pull a @billclinton by giving a tell all interview. Unlike Clinton Anthony is a sick puppy. _E_\nUnder his administration oil and gas production on public land is down over 10% __HTTP__ Obama did not tell truth last night. _E_\nVia @nypost by @GeoffEarle: \"Polls show 'President Trump' may not be so far fetched\" __HTTP__ _E_\nObama's own top donor is now laying employees off and lowering hours in anticipation of Obama Care __HTTP__ The new reality. _E_\nEntrepreneurs: You have to have passion. If you love your work success will follow. _E_\nBig things going on today at Trump National Westchester! _E_\nTogether we will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_\nWith 3.5 million Americans receiving bonuses or other benefits from their employers as a result of TAX CUTS 2018 is off to great start!✅Unemployment rate at 4.1%.✅Average earnings up 2.9% in the last year.✅200000 new American jobs.✅#MAGA __HTTP__ _E_\nThe polls are really looking good—#1 everywhere despite all lobbyist & special interest $ being spent against me. I'm turning down millions. _E_\nThe White House Correspondents' dinner was so boring this year I guess that's because I didn't attend(even... __HTTP__ _E_\n\"Never think of learning as a burden. It may require some discipline but it prepares you for a new beginning.\"– Think Like a Champion _E_\nSo General Flynn lies to the FBI and his life is destroyed while Crooked Hillary Clinton on that now famous FBI holiday \"interrogation\" with no swearing in and no recording lies many times...and nothing happens to her? Rigged system or just a double standard? _E_\nI want to thank Elizabeth Steve Brian and all of the great folks of @foxandfriends for the long and successful run we had together. NICE! _E_\nIt's Wednesday how much money is China stealing from us today? _E_\nObamaCare has cut workers' pay by over $22B & eliminated 350000+ small business jobs __HTTP__ Repeal before it's too late! _E_\nGreat new poll thank you!#MakeAmericaGreatAgain __HTTP__ _E_\nI try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is. ~Donald Trump _E_\nThe Fed should not bail out the EU. Europe's financial mess is their problem not our problem! _E_\nDemocrats are laughingly saying that McCain had a moment of courage. Tell that to the people of Arizona who were deceived. 116% increase! _E_\nI knew disgusting and unwanted porn star @REPWEINER was a sleazebag the first time I met him. Thank goodness he was revealed (so to speak). _E_\nIf @amazon ever had to pay fair taxes its stock would crash and it would crumble like a paper bag. The @washingtonpost scam is saving it! _E_\nChina is raising its defense budget by 11% __HTTP__ @BarackObama wants to cut ours by over $1Trillion. Wrong policy. _E_\n\"Big jobs usually go to the men who prove their ability to outgrow small ones.\" Theodore Roosevelt _E_\nWas in Iowa yesterday great people. Record crowds at both speeches. Something big is happening. Pols are all talk. Make America great again! _E_\nWow! Does Eliot Spitzer have a girlfriend? This is getting exciting. _E_\nFirst Minister of Scotland released bomber of Pan Am flight #103 on compassionate grounds. Do you believe? _E_\nThank you Kevin. With unification of the party Republican wins will be massive! __HTTP__ _E_\nYOU NEED BOTH A PUBLIC AND A PRIVATE POSITION @HillaryClinton #Debates2016 __HTTP__ _E_\n.@MRbelzer is a stone cold loser with no talent why did they ever put him on Law and Order? _E_\nWeekly jobless claims are up once again. The economy cannot recover with Obama in office. _E_\nThank you Springfield Ohio. Get out and #VoteTrumpPence16!#ICYMI watch here: __HTTP__ __HTTP__ _E_\nI have long stated that Brian Williams was not a very smart guy all you have to do is look at his past. Now he has proven me correct! _E_\nI will be on ON THE RECORD @gretawire tonight at 10 pm _E_\nDonald Trump's back with 14 'Apprentice' All Stars __HTTP__ via @AP _E_\nThank you for your support! Together we can #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_\n#MakeAmericaWorkAgain #TrumpPence16 #RNCinCLE __HTTP__ __HTTP__ _E_\nIn Charlottesville VA @trumpwinery is Virginia's largest winery with 200 acres of French vinifera varieties __HTTP__ _E_\ncountry and with the massive cost reductions I have negotiated on military purchases and more I believe the people are seeing big stuff. _E_\nDerek Jeter's baseball and more in today's #trumpvlog... __HTTP__ _E_\nI believe @BarackObama is manipulating the jobless numbers __HTTP__ _E_\nWhile Hillary said horrible things about my supporters and while many of her supporters will never vote for me I still respect them all! _E_\nObamaCare does indeed ration care. Seniors are now restricted to comfort care instead of brain surgery. Repeal now! __HTTP__ _E_\nI am doing On the Record With Greta Van Susteren at 10 P.M. on Fox. We will be talking about the bad economy and other subjects of interest! _E_\nHow ironic that @BarackObama's campaign would call me a charlatan. Have they looked at their boss's record? _E_\nAnd the FAKE NEWS winners are... __HTTP__ _E_\nLightweight Senator Marco Rubio is VERY weak on immigration knows nothing about finance and would be incapable of making great trade deals! _E_\nI have determined that it is time to officially recognize Jerusalem as the capital of Israel. I am also directing the State Department to begin preparation to move the American Embassy from Tel Aviv to Jerusalem... __HTTP__ _E_\n.@NRO @JonahNRO Wow just looked at the stats for National Review. Dying fast doing very little business. Save this conservative voice! _E_\nRobin Williams was a truly wonderful actor & comedian. One of the few people who could make me laugh. Very tragic. _E_\nGreat going to all of Dubai in winning what will be a fantastic #Expo2020 we will all be there! _E_\n.@redcross CEO's salary in 2011 was $951957. Where is the outrage? _E_\nWhere is the outrage for this Disney book? Is this the 'Star of David' also? Dishonest media! #Frozen __HTTP__ _E_\n.@JonahNRO You should be totally focused on trying to save the badly failing National Review instead of focusing on me. Work hard! @NRO _E_\nVia @WBJonline by @WBJHolan: \"Donald Trump hints at presidential run promises 'great luxury hotel' for D.C.\" __HTTP__ _E_\nThe Chinese are smart. They bought up over $7B in US housing last year __HTTP__ U.S. is busy making China even richer. _E_\nA general is just as good or just as bad as the troops under his command make him. Douglas MacArthur _E_\nHave you ever seen our country look weaker or more pathetic: Snowden ObamaCare VA Russia jobs decimated military debt and so much more _E_\nThe now $1.2B ObamaCare website is as bad as ever insurers not getting the proper data. __HTTP__ _E_\nHave a happy successful and healthy New Year! _E_\nCongratulations to the 2016 @ClemsonFB Tigers!Full ceremony: __HTTP__ __HTTP__ _E_\nCongratulations to @AmericansElect for winning a spot on the California 2012 ballot. A major feat! __HTTP__ _E_\n\"See yourself as victorious! That will focus you in the right direction.\" – Trump Never Give Up _E_\nWill be in Alabama tonight. Luther Strange has gained mightily since my endorsement but will be very close. He loves Alabama and so do I! _E_\nMichelle Nunn will be a solid vote for Obama. She supports ObamaCare & opposes 2nd Amendment. Vote for @Perduesenate to change things! _E_\nPresident Obama is losing on so many fronts in fact all fronts that I am concerned he will do something totally irrational. He can't lead! _E_\nWe just have to get tough get smart and get a president willing to stand up for America and stick it to the (cont) __HTTP__ _E_\nMichelle Nunn will be a rubber stamp for Barack Obama. @Perduesenate. GOTV for David this Tuesday! _E_\nI will be in Maryland this afternoon for a major rally. Things are looking good for Tuesday! _E_\nI will be interviewed on @greta at 7:00 P.M. @FoxNews _E_\nRe build the United.States not places that hate our country and everything we stand for! _E_\nThe Obama's Spain vacation cost taxpayers over $476K __HTTP__ They love to spend money. _E_\nI watched the last two minutes of the @dallasmavs game last night I just loved watching them lose. _E_\nNot only does ObamaCare have at least 21 new taxes but it will lead to a tremendous doctor shortfall. _E_\nI'm a star maker Adrian has continued to receive many fans in @TrumpTowerNY and @AmandaTMiller is definitely on the map! #CelebApprentice _E_\nChina's new AND ADVANCED currency manipulation is killing the U.S. Help! _E_\nWhitney Houston was a great friend and an amazing talent. We will all miss her and send our prayers to her family. _E_\nWe have millions in our country unemployed yet we are wasting millions arming Syrian 'rebels.' What is wrong with Washington?! _E_\nGood luck to my new friends on your testimony in DC. You are amazing people doing something so important stopping illegal immigration! _E_\nVia @worldnetdaily: JAILED U.S. PASTOR'S WIFE PRAISES TRUMP: 'I hope more people like him will speak out' __HTTP__ _E_\nCrooked Hillary says we must call on Saudi Arabia and other countries to stop funding hate. I am calling on cont'd: __HTTP__ _E_\nI've always been a fan of Steve Jobs especially after watching Apple stock collapse w/out him – but the yacht he built is truly ugly. _E_\nHave to go now to sign a great and job producing deal! Good night. _E_\nHope to see you tomorrow in Trump Tower (5th Ave betw 56 and 57) I'll be signing copies of my book #TimeToGetTough from noon until 2 pm _E_\nNobody will protect our Nation like Donald J. Trump. Our military will be greatly strengthened and our borders will be strong. Illegals out! _E_\nOld Post Office Building in DC will be a world class Trump property. Honored to be doing this historic building Washington will be proud. _E_\nMove slowly carefully and then strike like the fastest animal on the planet! _E_\nThank you to our amazing law enforcement officers! #MAGA __HTTP__ _E_\nAmerican steel & American hands have constructed a 100000 ton message to the world: American MIGHT IS SECOND TO NONE!#USSGeraldRFord #USA __HTTP__ _E_\nHurricane is good luck for Obama again he will buy the election by handing out billions of dollars. _E_\nWe will never cut spending until we actually work off of a budget. The Democrats haven't passed one in over 3 years. What a joke. _E_\nThe Better Business Bureau report with an A rating for Trump University. #GOPDebate __HTTP__ __HTTP__ _E_\nRT @DRUDGE_REPORT: RICE ORDERED SPY DOCS ON TRUMP? __HTTP__ _E_\n.@RNC leadership should not be afraid of a government shutdown. They should be afraid of not defunding ObamaCare. _E_\nTrump's Campaign Hat Becomes an Ironic Summer Accessory The New York Times. __HTTP__ _E_\nEven though I have a very biased and unfair judge in the Trump U civil case in San Diego I have thousands of great reviews & will win case! _E_\nWill be speaking with Italy this morning! _E_\nTremendous investment by companies from all over the world being made in America. There has never been anything like it. Now Disney J.P. Morgan Chase and many others. Massive Regulation Reduction and Tax Cuts are making us a powerhouse again. Long way to go! Jobs Jobs Jobs! _E_\nThere's only one candidate who cut medicare and that's Barack Obama. Cut over $700M to move into ObamaCare. _E_\nMy @gretawire interview where I discuss fixing the economy killing Bin Laden the John Edwards trial and fair trade. __HTTP__ _E_\nIsn't the WORLD tired of hearing President Obama say he knew nothing about anything time to take responsibility for all of your mistakes! _E_\nI am happy to see the majority of the GOP candidates agree with me that the tax code must be simplified and the rates dropped. _E_\nWill be doing Fox and Friends in two minutes! _E_\nThe only reason I am critical of the Pinehurst look is because I'm a lover of golf—and that look on TV hurts golf badly. _E_\nVegas' top destination @TrumpLasVegas is a 64 story tower of golden glass __HTTP__ What goes on there stays there! _E_\n.@ThisWeekABC with @GStephanopoulos had fantastic numbers last Sunday Trump interview. Nice! _E_\n#TBT Taking piano lessons from my friend Elton John. __HTTP__ _E_\nJust got a great new selection of ties & shirts @Macys. Go buy them now for Father's Day—they're beautiful! _E_\nGoing to Charlotte NC to speak before more than 20000 people on Saturday morning—total sellout crowd—will be great! _E_\nI hear @billmaher really bombed in Springfield people were leaving show way early stupid guy! _E_\nGreat POLL numbers are coming out all over. People don't want another four years of Obama and Crooked Hillary would be even worse. #MAGA _E_\nI'll be on @foxandfriends on Monday at 7:30 AM. _E_\n\"Invincibility lies in the defence the possibility of victory in the attack.\" Sun Tzu _E_\nThey found Jessica in Colorado body was mutilated death to the pervert killer. _E_\nLooking forward to being the special guest at tonight's Dutchess County #GOP dinner to a SOLD OUT crowd. It will be great fun. _E_\nJust landed in the Philippines after a great day of meetings and events in Hanoi Vietnam! __HTTP__ _E_\nRespected Morning Consult poll just out. I lead all Republicans and beat Hillary head to head by a wide margin 45 to 40! _E_\n'Presidential Executive Order on Promoting Agriculture and Rural Prosperity in America'Executive Order:... __HTTP__ _E_\nI received calls from the President of Mexico and the Prime Minister of Canada asking to renegotiate NAFTA rather than terminate. I agreed.. _E_\n'Scandals surround Clinton's gatekeeper at State'#DrainTheSwamp __HTTP__ _E_\nYesterday Obama compared Nelson Mandela to George Washington in Africa. Do you think he really believes it? _E_\nHad a meeting with the terrific @GovPenceIN of Indiana. So excited to campaign in his wonderful state! __HTTP__ _E_\nI will be in Cincinnati Ohio tomorrow night at 7:30pm join me! #OhioVotesEarly #VoteTrumpPence16 Tickets:... __HTTP__ _E_\nI will be on @oreillyfactor at 8:00 P.M. Enjoy! _E_\nMy comments on a larger screen iPhone were in addition to existing unit not a replacement. Screen should be 10% larger than Samsung. _E_\nWhy is the UN condemning @Israel and doing nothing about Syria? What a disgrace. _E_\nThe podium in the Oval Office looks odd! Not good but the words will be the key. _E_\nMajority of Independents want Obamacare overturned __HTTP__ The best way to do it is by voting out @BarackObama _E_\nHe made a great contribution to the press @AndrewBreitbart will be missed. _E_\nOur President should stop trying to be an economist to the world and start fighting for our economy. Instead (cont) __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nI am honored that the great men and women of the @Teamsters have created a movement from within called Teamsters for Trump! Thank you. _E_\nI knew Chris Matthews when he was sane and quite honestly wonderful. Now he's gone off the deep end as an Obama surrogate. @hardball_chris _E_\nLooking forward to visiting the Trump Vineyard Estates today in Charlottesville VA for a press conference and the grand opening. _E_\nExclusively @Macys The Donald J. Trump Signature Collection features the best ties & shirts at the best prices. __HTTP__ _E_\n... Is a third party coming? I hope not. _E_\nMy conversation from ON THE RECORD @Gretawire __HTTP__ _E_\n.@GovernorPerry stopped by to say hello. __HTTP__ _E_\nIf @BarackObama had to use the same labor participation he had when he entered office then the unemployment number would be 11.2% _E_\n\"Competitive golf is played mainly on a five and a half inch course... the space between your ears.\" Bobby Jones _E_\nFormer Obama White House economic adviser @Austan_Goolsbee gave his old boss a 'C' on the economy __HTTP__ Pretty generous! _E_\nThe @MissUSA 2012 contestants pose for a picture with me at Trump Tower in New York City __HTTP__ _E_\nIt's Friday. How much money has been wasted on defunct ObamaCare website today? _E_\n.@KevinHart4real joined @woodmank104 @katek104 @K1047 & was asked about his thoughts on @realDonaldTrump #Trump2016 Thanks Kevin so nice! _E_\nThere is no question who will handle the threat of terrorism best as #POTUS. #Trump2016 __HTTP__ __HTTP__ _E_\nPresident Obama has a major meeting on the N.Y.C. Ebola outbreak with people flying in from all over the country but decided to play golf! _E_\nFake News is at an all time high. Where is their apology to me for all of the incorrect stories??? _E_\nI will be on @SeanHannity tonight at 10pmE talking about my new book #CrippledAmerica and much more! #MakeAmericaGreatAgain #Trump2016 _E_\nFurther proof that Gang of Eight member Marco Rubio is weak on illegal immigration is Paul Singer's Mr. Amnesty endorsement.Rubs can't win _E_\n.@DanaPerino & @BradThorThank you so much for the wonderful compliment. Working hard! #MAGA __HTTP__ _E_\nFor what is the best choice for each individual is the highest it is possible for him to achieve. Aristotle _E_\nNow the world is looking to China for an economic 'lift' __HTTP__ @BarackObama has ruined our economic hegemony. _E_\nEconomics behind ugly bird killing wind turbines do not work will destroy Scotland's beautiful coastline. (cont) __HTTP__ _E_\n.@aaronschock Aaron it was great to meet you at Trump Tower. Also really good job on television! _E_\nThe Greater Miami area and numerous others are fighting hard to get the Miss Universe Pageant. A decision will be made very soon! _E_\nRT @MichaelCohen212: I have never been to Prague in my life. #fakenews __HTTP__ _E_\nShould not pass bad deal! __HTTP__ _E_\nThe only thing that can stop this corrupt machine is YOU. The only force strong enough to save our country is US.... __HTTP__ _E_\nPeople forget it was Club for Growth that asked me for $1 million. I said no & they went negative. Extortion! __HTTP__ _E_\nI will be on the @colbertlateshow tonight at 11:30 __HTTP__ _E_\nNegotiation is persuasion more than power. Be reasonable and flexible and never let anyone know exactly where you're coming from. _E_\nThank you @TheFix Chris Cillizza. It is a true person of character that can change his opinion & do what is right. __HTTP__ _E_\nI think somebody should pick Johnny Football he will be a star. _E_\nAs long as we have faith in each other and confidence in our values then there is no challenge too great for us to conquer! #ALConv2017 __HTTP__ _E_\nLooking forward to receiving the T. Boone Pickens Entrepreneur Award at tomorrow's @AmSpec Robert L. Bartley Gala dinner. _E_\nBeautiful weather all over our great country a perfect day for all Women to March. Get out there now to celebrate the historic milestones and unprecedented economic success and wealth creation that has taken place over the last 12 months. Lowest female unemployment in 18 years! _E_\n.@AlexSalmond Ireland just ended the bird killing wind farm near my great resort on the Atlantic Ocean. The reason would hurt tourism! _E_\n...they do NOTHING for us with North Korea just talk. We will no longer allow this to continue. China could easily solve this problem! _E_\nPhoto from @IvankaTrump of Trump International Golf Links & Hotel Ireland __HTTP__ _E_\nLet's get out of Afghanistan. Our troops are being killed by the Afghanis we train and we waste billions there. Nonsense! Rebuild the USA. _E_\nThe best vision is insight. Malcolm Forbes _E_\nReporter @AlHunt is one boring and low vision guy! _E_\nWe all have the capability to read or sense what's happening with others. It can often give you the edge (cont) __HTTP__ _E_\n.@GiulianaRancic & @nickjonas are co hosting Miss USA 2013 Sunday night at 9 PM ET on NBC. @JonasBrothers will be performing. Tune in! _E_\n.@THEGaryBusey is definitely different. #CelebApprentice _E_\nDemocrats are smiling in D.C. that the Freedom Caucus with the help of Club For Growth and Heritage have saved Planned Parenthood & Ocare! _E_\nGOP Voters Trust Donald Trump to Keep Our Country Safe __HTTP__ _E_\nI am the only potential owner of the @buffalobills who will keep the team in Buffalo where it belongs! _E_\nCongratulations to @Yankees Derek Jeter on passing Eddie Murray last night to become the 11th all time @MLB hit leader. _E_\nDoes anyone really believe that President Obama found out about Petraeus immediately after the election? _E_\n... among ABC CBS and NBC in the key news demo of adults.... _E_\nI'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nSleazy Adam Schiff the totally biased Congressman looking into Russia spends all of his time on television pushing the Dem loss excuse! _E_\nPolitical strategist Stuart Stevenswho led Romney down the tubes in what should have been an easy victoryhas terrible political instincts! _E_\nMy interview yesterday with @foxandfriends discussing the failure of the Super Committee and GOP 2012.... __HTTP__ _E_\nBiggest Tax Bill and Tax Cuts in history just passed in the Senate. Now these great Republicans will be going for final passage. Thank you to House and Senate Republicans for your hard work and commitment! _E_\nIn times of tragedy the bonds that sustain us are those of family faith community and country. These bonds are stronger than the forces of hatred and evil and these bonds grow even stronger in the hours of our greatest need. __HTTP__ __HTTP__ _E_\nNBC Wall St Journal Poll of African American voters: 94% @BarackObama 0% @MittRomney.Even worse than Hillary's old numbers. Is that racism? _E_\nI am impressed with how clearly @PaulRyanVP explains the challenges we face and the solutions @MittRomney will bring as President. _E_\nHave a great game today @USArmy and @USNavy I will be watching. We love our U.S. Military. On behalf of an entire Nation THANK YOU for your sacrifice and service! #ArmyNavyGame #USA __HTTP__ _E_\n\"Golfer bids $130000 for round with Donald Trump\" in Scotland for charity __HTTP__ via Evening Express _E_\nVia @Reuters: Donald Trump takes steps toward 2016 presidential run __HTTP__ _E_\nEntrepreneurs: Keep your eyes on your ideals as well as reality. Accentuate the positive without being blind to the negative. _E_\nChinese demand is raising the price of oil to$123/Barrel __HTTP__ We need to use our own energy resources. _E_\n2011 #CelebrityApprentice winner @JohnRich and @MarleeMatlin interviewed the final four in this week's episode __HTTP__ _E_\n.@HuffingtonPost is doing very badly. Also very inaccurate stories. Like AOL when will they fail? _E_\nWhy is @BarackObama delaying the sale of F 16 aircraft to Taiwan? Wrong message to send to China. #TimeToGetTough _E_\nI am offering the chance for Barack Obama to redistribute $5M to any charity of his choice. Everyone wins. Take the deal. _E_\nChina is pushing North Korea! _E_\n#WVPrimary #VoteTrump #Trump2016 __HTTP__ __HTTP__ _E_\nObamaCare is a major threat to America's entrepreneurial spirit and competitiveness. Small businesses will b... (cont) __HTTP__ _E_\nAn Iranian nuclear scientist's car exploded in Tehran yesterday lots of problems to come @BarackObama we need real leadership. _E_\nIf our border is not secure we can expect another attack. A country with open borders is open to the terrorists. _E_\nDo not allow our very stupid leaders to sign a deal that keeps us in Afghanistan through 2024 with all costs by U.S.A. MAKE AMERICA GREAT! _E_\nYesterday was Veterans Day. I hope our armed service members felt appropriately honored. This nation loves and respects all of you. _E_\nJust had a very nice meeting with @Reince Priebus and the @GOP. Looking forward to bringing the Party together and it will happen! _E_\nCrooked's top aides were MIRED in massive conflicts of interests at the State Dept. We MUST #DrainTheSwamp __HTTP__ #Debate _E_\nSomeone unknown tweeted incorrectly that I'm for Sen. Mitch @McConnellPress for speaker. I'm supporting him for Senate Majority Leader _E_\nTime Magazine called to say that I was PROBABLY going to be named \"Man (Person) of the Year\" like last year but I would have to agree to an interview and a major photo shoot. I said probably is no good and took a pass. Thanks anyway! _E_\nDonald Trump plans return to Iowa __HTTP__ via @KCCINews _E_\n\"Iowa hirings suggest Donald Trump serious about 2016 White House bid\" __HTTP__ via @WashTimes by @SethMcLaughlin1 _E_\n...It's called intellectual property rights something they know nothing about. _E_\nThe ObamaCare website still is not complete. $5 billion and no progress. Scary and sad! _E_\nDavid Brooks of the New York Times is closing in on being the dumbest of them all. He doesn't have a clue. _E_\nThese Tsarnaev brothers did not work alone. They had help and assistance from other cell members. Be vigilant and on the lookout. _E_\nThe tragedy in Newtown really makes you understand how life is so fragile. Must appreciate every minute! _E_\nTrump vows to fight 'epidemic' of human trafficking __HTTP__ _E_\nIt all begins today! I will see you at 11:00 A.M. for the swearing in. THE MOVEMENT CONTINUES THE WORK BEGINS! _E_\nHonor to have been interviewed by the very wonderful @bishopwtjackson in Detroit last week tune in at 9pmE. Enjoy! __HTTP__ _E_\nThe New Black Panthers are back at the same Philly polling station from '08 __HTTP__ Don't let them intimidate you! _E_\nTime to start building in our country with American workers & with American iron aluminum & steel. It is time to... __HTTP__ _E_\n\"To keep your momentum going you must have intrinsic values as well as monetary values. Know when to give back.\" – Think Big _E_\nI will be going to the funeral of my friend Joan Rivers today. I got to know her really well when she became the winner of The Apprentice! _E_\nWouldn't it be nice if our government could build a wall on the border under budget and ahead of schedule?! my @SRQRepublicans speech. _E_\nExcited that @OurCountryPAC's @Amy Kremer has endorsed the Newsmax @iontv debate. The Tea Party Express is a great group. _E_\nWhy is the @GOP congress focusing on amnesty when so many Americans are unemployed? _E_\nThank you Andrew Jackson! #POTUS7 #USA __HTTP__ _E_\nHillary Clinton made a speech today using the biggest teleprompter I have ever seen. In fact it wasn't even see through glass it was black _E_\nRT @Reince45: Promise kept. @POTUS exits flawed #ParisAccord to seek better deal for U.S. workers & economy. This WH will always put #Ameri... _E_\nMany people walked out on Madonna's concert when she told them to vote for Obama. Years ago I walked out because the concert was terrible! _E_\nIt was an honor to welcome President Al Sisi of Egypt to the @WhiteHouse as we renew the historic partnership betwe... __HTTP__ _E_\nWatch to see the new cast of @ApprenticeNBC __HTTP__ _E_\nA big day for the U.S. at the United Nations! _E_\nJust out: TRUMP GOP DEBATE 18000000. CLINTON DEMOCRAT DEBATE 6700000. And they were on major network vs. cable! _E_\nBe weak on immigration and ensure Democratic victory. _E_\nEveryone should cancel HBO until they fire low life dummy Bill Maher! Get going now and feel good about yourself! _E_\nRickie Fowler @therealrickiefowler Instagram photos | Websta __HTTP__ via @websta _E_\nEven NY Democrats are avoiding @BarackObama's convention __HTTP__ He is dragging his own party down with him _E_\nGetting ready to take off for Nashua New Hampshire. Big crowd will be there soon. Fun! _E_\n.@cyndilauper Condolences on the passing of your uncle and best wishes. _E_\n...really hard to help but many have lost their homes. Military is now on site and I will be there Tuesday. Wish press would treat fairly! _E_\nJeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately but it was clearly not.... _E_\nI have many great people but also an amazing number of haters and losers responding to my tweets why do these lowlifes follow nothing to do! _E_\nLess than ten days until I keynote @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tix going fast. __HTTP__ _E_\nEverybody is arguing whether or not it is a BAN. Call it what you want it is about keeping bad people (with bad intentions) out of country! _E_\nIn the last 2 weeks I had $35M of negative ads against me in Florida & I won in a massive landslide.The establishment should save their $$! _E_\nI really liked everyone at the @WWE Hall of Fame ceremony fantastic people! _E_\nAmazing various celebrities were far harsher than me with political statements but media doesn't care about... __HTTP__ _E_\nRT @realDonaldTrump: I will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_\nPutin is having such a good time. Our President is making him look like the genius of all geniuses. Do not fearwe are a NATION OF POTENTIAL _E_\nRT @EricTrump: Friends: If you live in AL AK AR CO GA MA MN OK TN TX VT or VA get out and VOTE on Tuesday! #Trump2016 __HTTP__ _E_\nHere I am with Whitney Houston at a party at Mar a Lago. __HTTP__ _E_\nThank you! #GOPDebate MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nHere I am with @IvankaTrump and erictrump presenting the WGC @CadillacChamp Trophy to Tiger Woods at... __HTTP__ _E_\nRated Toronto's #1 hotel the 65 story 5 Star @TrumpTO is located in the heart of the city's finest attractions __HTTP__ _E_\nThank you for having me! I enjoyed the tour and spending time with everyone. See you soon. #MAGA __HTTP__ _E_\nDishonest reporters knowingly write lies that I said \"children should not get vaccinated.\" I believe fully (cont) __HTTP__ _E_\nRT @MoskowitzEva: .@BetsyDeVos has the talent commitment and leadership capacity to revitalize our public schools and deliver the promise... _E_\nIf @RepMarkMeadows @Jim_Jordan and @Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_\nHome ownership is at a 19 year low. If you can buy now. You will thank me later. _E_\nVia @AmSpec by Jeffrey Lord: \"Donald Trump: America's Entrepreneur\" __HTTP__ Wow thank you to Jeffrey Lord & @AmSpec! _E_\nWhile @BarackObama tries to push gun control __HTTP__ He still has not answered for Project Gun Runner __HTTP__ _E_\nFight is over Mayweather lost big but lets see what judges say! _E_\nIf Obama mentions Mitt's tax returns in tomorrow's debate then Mitt should immediately ask for Obama's college records & applications _E_\nThe White House never looked more beautiful than it did returning last night. Important meetings taking place today. Big tax cuts & reform. _E_\nCongratulations to @DanaPerino on your book going to number one on Amazon. Great book Great job! _E_\nCan someone explain to me how a Chechnyan permanent resident non citizen in our country is planning Jihad while on welfare? _E_\nEd Gillespie worked hard but did not embrace me or what I stand for. Don't forget Republicans won 4 out of 4 House seats and with the economy doing record numbers we will continue to win even bigger than before! _E_\nWatch my live book signing now! __HTTP__ _E_\nJust watched lightweight Marco Rubio lying to a small crowd about my past record. He is not as smart as Cruz and may be an even bigger liar _E_\nThank you @BillKristol. I am going to Make America Great Again! _E_\nAmerican corporations and entrepreneurs are masters of technological and business innovation but the Chinese (cont) __HTTP__ _E_\nClive Davis gave a great eulogy at my friend Whitney Houston's funeral absolutely amazing! _E_\nRT @DonaldJTrumpJr: Thanks New Hampshire!!! #NH #NewHampshire #MAGA __HTTP__ _E_\nWhich team do you think has the edge in this interactive photo experience task assignment? _E_\nVia @EW: \"@CelebApprentice All Stars' first trailer\" __HTTP__ _E_\nEveryone join me tomorrow at 11 AM in Trump Tower atrium. _E_\nGreat Tax Cut rollout today. The lobbyists are storming Capital Hill but the Republicans will hold strong and do what is right for America! _E_\nRT @DRUDGE_REPORT: GREAT AGAIN: FEDS ARREST MURDER SUSPECT IN 'FAST AND FURIOUS' SCANDAL... __HTTP__ _E_\nSince the Democrats decided to kill the filibuster they now own it.Republicans should keep the new rule when they're in the majority. _E_\nBroken promises. A broken billion dollar website. ObamaCare can't be fixed. Repeal! _E_\nSerious stuff IRS Commissioner visited White House 157 times far more than Sec. of State or Defense. What a big story this is! _E_\nTaking a photo with my family on the opening day of Trump International Golf Links Scotland __HTTP__ _E_\nThe U.S. is now begging Russia to give back Edward Snowden. In a letter they promised no death penalty for the traitor. No respect! _E_\nSo great to have the endorsement and support of Paul Ryan. We will both be working very hard to Make America Great Again! _E_\nWe should be focused on magnificently clean and healthy air and not distracted by the expensive hoax that is global warming! _E_\nAccording to new WPOST ABC poll Obama has just lost 14 points on public trust with economy _E_\nOn my way to South Carolina. Big Crowd look forward to it! _E_\nDee Dee Sorvino @deedeegop I am betting on Trump _E_\nIf someone made a nasty or controversial statement about me to the president do you really think he would come to my rescue? No chance! _E_\nVia @ABCPolitics by @rickklein: Trump Blasts Romney Bush Says GOP Has 'Nobody Like Trump' __HTTP__ _E_\nI have a judge in the Trump University civil case Gonzalo Curiel (San Diego) who is very unfair. An Obama pick. Totally biased hates Trump _E_\nApple must make the IPhone screen bigger. Losing major market share. _E_\nMany think that the Championship Course at Turnberry home of The Duel In The Sun will be the worlds best after the renovation. _E_\nTo @TigerWoods He is truly a great champion and we were honored to have him at Trump National Doral. @DoralResort #Trump _E_\nWhile the Republicans and Democrats in Congress are working hard to come up with a solution to DACA they should be strongly considering a system of Merit Based Immigration so that we will have the people ready willing and able to help all of those companies moving into the USA! _E_\nMUST READ @IBDeditorials: \"President Obama's Amnesty At Any Price\" __HTTP__ Congress Use the Power of Purse! Defund Amnesty! _E_\nMy speech at yesterday's @SteveKingIA @Citizens_United Iowa Freedom Summit __HTTP__ via @FoxNews _E_\nJust the beginning & it is going to get worse. Rates & deductibles are so high nobody is going to be able to use it. __HTTP__ _E_\nToday Obama will give another speech on the economy. Tomorrow our country will still be $17T+ in debt with 18% real unemployment. _E_\nWent to the Yankees game last night with Bill O'Reilly we had a great time watching the Yankees win! _E_\nWow Kasich didn't qualify to run in the state of Pennsylvania not enough signatures. Big problem! _E_\nOn @seanhannity show @FoxNews now. ENJOY! _E_\nJoin me on Wednesday May 25th at the Anaheim Convention Center!#Trump2016 #MAGA Tickets: __HTTP__ __HTTP__ _E_\nThe failing @nytimes reporters don't even call us anymore they just write whatever they want to write making up sources along the way! _E_\nThank you @FLGovScott. __HTTP__ _E_\nI will be interviewed on @foxandfriends at 8:30 A.M. ENJOY! _E_\nBusiness is an art in itself and powerful negotiation skills are one of the techniques necessary to facilitate success. _E_\nPut big game trophy decision on hold until such time as I review all conservation facts. Under study for years. Will update soon with Secretary Zinke. Thank you! _E_\nAs the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Democrats and Russians! _E_\n.@TrumpNewYork on CPW in NYC is the home of the globe that has become an icon in the city. #CelebApprentice _E_\n90 stories over midtown New York Trump World Tower's glass curtain wall is a true landmark __HTTP__ _E_\nRealize that an entrepreneur's most important gift to the world is jobs security and well being for others. Midas Touch _E_\nThe #MissUniverse Pageant is the biggest pageant of them all—by far! _E_\nBehind the scenes photo of @Gretawire and I filming an interview __HTTP__ Watch tonight at 10PM ET on @FoxNews. _E_\nWhen the economy is bad @BarackObama wants to raise taxes. When the economy is good @BarackObama wants to raise taxes. Notice a trend? _E_\nIf these scandals happened before the election Obama could not have won. _E_\nRT @billoreilly: Hannity crushing MSNBC at 9. Good for him! Check the No Spin News on __HTTP__ Killing England a huge bests... _E_\nA great honor to visit the 9/11 Memorial Museum with my wife @MELANIATRUMP today. #NewYorkValues __HTTP__ _E_\nI am surprised that Hugo Chavez can keep power in his weak physical condition! _E_\nLeaving Hamburg for Washington D.C. and the WH. Just left China's President Xi where we had an excellent meeting on trade & North Korea. _E_\nHouse of Representatives shouldn't give anything to Obama unless he terminates Obamacare. _E_\nThe money losing @politico is considered by many in the world of politics to be the dumbest and most slanted of the political sites. Losers! _E_\nMichelle Obama likes to be addressed as Your Excellency. __HTTP__ She is an excellent spender of taxpayer money on herself. _E_\nAll new @ApprenticeNBC starts right now! __HTTP__ _E_\nWho should win Celebrity Apprentice on Monday night? Show will be telecast LIVE! _E_\nIf a player wants the privilege of making millions of dollars in the NFLor other leagues he or she should not be allowed to disrespect.... _E_\nIt's a shame the ruling class of Republicans don't attack Obama and the Democrats the way they hit Senators Cruz & Lee. _E_\n.@DavidGregory got thrown off of TV by NBC fired like a dog! Now he is on @CNN being nasty to me. Not nice! _E_\nMake sure to catch @history's season finale of \"The Men Who Built America\" on Sun November 11th. Great show. _E_\nThank you Dan I agree! Best wishes. __HTTP__ _E_\nRemember don't believe sources said by the VERY dishonest media. If they don't name the sources the sources don't exist. _E_\nThank you for the incredible support Melania Barron Ivanka Jared Tiffany Don Vanessa Eric and Lara! __HTTP__ _E_\nI don't blame China I blame the incompetence of past Admins for allowing China to take advantage of the U.S. on trade leading up to a point where the U.S. is losing $100's of billions. How can you blame China for taking advantage of people that had no clue? I would've done same! _E_\nWhen @BarackObama is not vacationing he is hosting his top donors in the White House __HTTP__ Always having a good time! _E_\nThere is no way my friend Bob Kraft agreed not to appeal the NFL decision without making a deal to at least get something. We love Tom Brady _E_\nIf a person is #1 at Harvard and comes from Europe or Asia they can't get into the U.S. From Mexico etc. with a criminal record no problem _E_\n\"What you dream about is what you will do. If you cannot even dream of doing big things you will never do anything big in life.\" Think Big _E_\nOur thoughts and prayers are with everyone in the path of California's wildfires. I encourage everyone to heed the advice and orders of local and state officials. THANK YOU to all First Responders for your incredible work! __HTTP__ _E_\nObama has not passed a single budget in 4 years. Democrats don't even vote them in Congress. He has failed to lead! _E_\nRT @DonaldJTrumpJr: Great pic from a friend on @CBPflorida @CustomsBorder who have been helping with #harvey recovery and now with #irma. T... _E_\nThank you Governor @Mike_Pence!Lets MAKE AMERICA SAFE AND GREAT AGAIN with the American people. #AmericaFirst... __HTTP__ _E_\nLeaving for New Hampshire now. Will be doing the @TODAYshow there live at 7:00 A.M. New @CBSNews Poll of New Hampshire: Trump 38 Carson 12! _E_\nHuge crowd expected tomorrow night! VT Police say first come first serve. Arrive early! _E_\nI have offered DACA a wonderful deal including a doubling in the number of recipients & a twelve year pathway to citizenship for two reasons: (1) Because the Republicans want to fix a long time terrible problem. (2) To show that Democrats do not want to solve DACA only use it! _E_\nThe #G20Summit was a wonderful success and carried out beautifully by Chancellor Angela Merkel. Thank you! _E_\nRT @DRUDGE_REPORT: Fears of new terror attack after van 'mows down 20 people' on London Bridge... _E_\nTogether we dream of a Korea that is free a peninsula that is safe and families that are reunited once again! __HTTP__ _E_\nA NEW ERA IN AMERICAN ENERGY! #MadeInTheUSAWatch here: __HTTP__ __HTTP__ _E_\nBe sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight! _E_\n#ICYMI: Governor @mike_pence and I were in Valley Forge Pennsylvania today. You can watch it here:... __HTTP__ _E_\nVia @AP: \"Donald Ivanka Trump say DC's Old Post Office Pavilion will be 1 of country's finest hotels\" __HTTP__ _E_\nDonald Trump bids to buy the Oreo Double Stuf Racing League. Check it out: __HTTP__ _E_\nDo not go back into Iraq unless they agree in a signed formal instrument to give the U.S. 50% of their oil reserves.Make the deal dummies! _E_\nWhat is the standard for which you want to be known? Identify that standard and follow it. _E_\nRecently opened @TrumpToronto it's beautiful and here is a video of the ribbon cutting ceremony.. __HTTP__ _E_\nVia @beforeitsnews: \"WATCH: See How Trump Just Torched Obama Biden Kerry For Snubbing Paris Anti Terror March\" __HTTP__ _E_\nCrooked Hillary is wheeling out one of the least productive senators in the U.S. Senate goofy Elizabeth Warren who lied on heritage. _E_\nEvery day Mexico continues to hold Sgt. Tahmooressi is an insult to our country. _E_\nOf course @hardball_chris attacked 'birthers' in praising @CondoleezzaRice's speech. Chris has completely lost it. _E_\nHonor of a lifetime to meet His Holiness Pope Francis. I leave the Vatican more determined than ever to pursue PEAC... __HTTP__ _E_\nRead about my @LibertyU speech in @jameshohmann's @politico Morning Score __HTTP__ _E_\n\"Don't find fault. Find a remedy.\" – Henry Ford _E_\nObamaCare premiums are going up up up just as I have been predicting for two years. ObamaCare is OWNED by the Democrats and it is a disaster. But do not worry. Even though the Dems want to Obstruct we will Repeal & Replace right after Tax Cuts! _E_\n\"Trump hails liberation of Raqqa as critical breakthrough in anti ISIS campaign\" __HTTP__ _E_\nI am pleased to inform you that I have just named General/Secretary John F Kelly as White House Chief of Staff. He is a Great American.... _E_\nBig news just out NEW @CNN POLL TRUMP 39 and leads in every major category. Likeability way up. CRUZ 18 CARSON 10 RUBIO 10 _E_\n'Must Act Immediately': Clinton Charity Lawyer Told Execs They Were Breaking The Law __HTTP__ _E_\nWatch me tonight at 9PM ET on @CNN full hour. @Piersmorgan won @ApprenticeNBC before taking over Larry King's slot should be interesting. _E_\n.@BarbaraJWalters Barbara—get better fast & stay healthy forever. _E_\nDespite all the statements to the contrary Obama's policies will increase taxes on everyone __HTTP__ Enjoy! _E_\nAnother great charity that the $5M could go to just a recommendation to the Pres. the Wounded Warriors represented so well by @TraceAdkins _E_\nCrooked Hillary Clintons foreign interventions unleashed ISIS in Syria Iraq and Libya. She is reckless and dangerous! _E_\nIt's hard to believe that we are rationing gas in NYC. OPEC is laughing all the way to the bank. _E_\nI will be asking for a major investigation into VOTER FRAUD including those registered to vote in two states those who are illegal and.... _E_\nA big part of the country even the southern states is under massive attack from snow and freezing cold. Global warming anyone? _E_\n\"Image is important and speaks more than the words or fine print that goes along with the product.\" – Midas Touch _E_\nIran is rapidly taking over more and more of Iraq even after the U.S. has squandered three trillion dollars there. Obvious long ago! _E_\nLast night's live show was so much fun. Congrats to the entire cast they are all winners! From beginning over $13 million for charity. _E_\nVia @ConroeCourier by @StephenGreen91:\"Trump talks 2016 run jobs at @TXPatriotsPAC\" __HTTP__ _E_\nICYMI via @PageSix by @Mohris: \"Donald Trump honored at Marine Corps charity gala\" __HTTP__ _E_\nThank you #Biloxi #Mississippi! Remember this night & spread the word to get out & #VoteTrump2016! __HTTP__ _E_\nDemocrat Jon Ossoff would be a disaster in Congress. VERY weak on crime and illegal immigration bad for jobs and wants higher taxes. Say NO _E_\nStaff at Trump Park Avenue disliked A Rod to put it mildly The staff at Trump World Tower loves Derek Jeter. _E_\nKarl Rove is a total loser. Money given to him might as well be thrown down the drain. _E_\nBy the way if Russia was working so hard on the 2016 Election it all took place during the Obama Admin. Why didn't they stop them? _E_\nThanks. __HTTP__ _E_\nIran's threats are no excuse for the 9 month high price of oil. OPEC is ripping us off while @BarackObama watches. __HTTP__ _E_\nEntrepreneurs: Success is good. Success with significance is even better. Work on what you will be proud to be associated with. _E_\nMy thoughts and prayers are with those affected by the tragic storms and tornadoes in the Southeastern United States. Stay safe! _E_\nEntrepreneurs: Having a product requires something very important you have to think about the market. Do your due diligence. _E_\nI just saw my new tie & shirt collection—it's fantastic—unbelievable look. Go to Macy's now to buy! _E_\nChina owes us money.... __HTTP__ #trumpvlog _E_\nNow AP is banning the term illegal immigrants What should we call them? 'Americans'?! This country's political press is amazing! _E_\n.@HillaryClinton's Careless Use Of A Secret Server Put National Security At Risk: __HTTP__ #VPDebate#BigLeagueTruth _E_\nJust landed in D.C. __HTTP__ _E_\nGlad to see no charges against Greg Kelly. His accusers' charges never made sense! _E_\nWill go back on for a final question now! _E_\nWe are going to bring steel and manufacturing back to Indiana! _E_\nObamacare is far toooo expensive far toooo complicated (thousands of pages) and most importantly doesn't work. WE CAN DO MUCH BETTER! _E_\nBe yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_\nWhy didn't the writer of the twelve year old article in People Magazine mention the incident in her story. Because it did not happen! _E_\nThank you for a wonderful evening in Washington D.C. #Inauguration __HTTP__ _E_\nLeightweight @Lord Sugar virtually begged my reps to have me stop mocking him. Every time this dope goes on Apprentice I make money too easy _E_\nThank you Wilkes Barre Pennsylvania! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nMore of your questions answered in today's video at __HTTP__ here is my appearance on Neil Cavuto __HTTP__ _E_\nChurches in Texas should be entitled to reimbursement from FEMA Relief Funds for helping victims of Hurricane Harvey (just like others). _E_\n.@Lord_Sugar If you think ugly windmills are good for Scotland you are an even worse businessman than I thought... _E_\nMeeting with African American Pastors at Trump Tower was amazing. Wonderful news conference followed. Now off to Georgia for big speech! _E_\nYou have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_\nOccupy Wall Street is at it again go out and get a job. It's actually easier work and far more rewarding. _E_\nPerforming live on the Miss Universe Pageant from the Mandalay Bay Resort & Casino will be Telemundo Orianthi John Legend and The Roots. _E_\nSouth Korea must in some form pay for our help the U.S. must stop being stupid! _E_\n...and people like Ms. Heyer. Such a disgusting lie. He just can't forget his election trouncing.The people of South Carolina will remember! _E_\nDummy @mcuban made up a story about a visit to Mar a Lago last night on Leno. It never happened—I don't talk that way. _E_\nRT @KellyannePolls: more media #polls showing @realDonaldTrump ahead in states Pres Obama won twice. __HTTP__ _E_\n9 million fewer people voted for Obama this election than last & yet the Republicans lost—do you think they might be doing something wrong! _E_\nReceiving @AmericanCancer Lifetime Achievement Award & chairing @FollowLola debut @CarnegieHall on Jan.19 __HTTP__ _E_\nI'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\n.@JonahNRO You stated that I started \"relentlessly tweeting like a 14 year old girl...\" Horrible insult to women. Resign now or later! _E_\nThe S&P are losers. They did this for personal publicity in order to straighten out their terrible reputatio... (cont) __HTTP__ _E_\nWill be on @megynkelly tonight at 9:00. WILL BE TALKING ABOUT EVERYTHING! _E_\nEbola patient Duncan lied on his exit papers by saying he never came into contact with a person with Ebola. He knew he did and person died. _E_\nAll the contestants have arrived to compete in Trump Miss Universe Pageant in Las Vegas. Today's welcoming ceremony will be terrific! _E_\n.@MittRomney must ask for Obama's college records & applications why is he not doing this? _E_\nIf Americans understood just how many hidden government fees and taxes are absorbed into the prices of the (cont) __HTTP__ _E_\nThere is an incredible spirit of optimism sweeping the country right now—we're bringing back the JOBS! __HTTP__ _E_\nGraydon Carter is laughing at the stupidity of Chuck Townsend on his contract renewal even he doesn't believe it! @CondeNastCorp _E_\nThe @nytimes states today that DJT believes more countries should acquire nuclear weapons. How dishonest are they. I never said this! _E_\nBest Apprentice episode EVER tonight at 8:00. _E_\nExperience knowledge and prescience are a formidable combination of powers. Do not underestimate them. Think Like a Champion _E_\nNBC news is #FakeNews and more dishonest than even CNN. They are a disgrace to good reporting. No wonder their news ratings are way down! _E_\nThank you Bobby Bowden for the intro tonight and your support! I hope I can do as well for Florida as you have done! __HTTP__ _E_\n.@_KatherineWebb with some of my memorabilia. __HTTP__ _E_\nRT @EricTrump: Thank you to @GolfDigest for this incredible feature! Golfer in Chief @RealDonaldTrump __HTTP__ __HTTP__ _E_\nWas there another loan that Ted Cruz FORGOT to file. Goldman Sachs owns him he will do anything they demand. Not much of a reformer! _E_\nRick Perry a good man a great family and a patriot. _E_\n.@ForbesInspector 5 Star & @TripAdvisor #1 Luxury Hotel @TrumpToronto offers style luxury & impeccable service __HTTP__ _E_\nWill be on Fox & Friends at 7.00 Enjoy! _E_\n\"Whether you realize it or not your brand can be many times more valuable than your business.\" – Midas Touch _E_\nI'll be on @foxandfriends at 7:30 AM Monday _E_\n.@CNN just doesn't get it and that's why their ratings are so low and getting worse. Boring anti Trump panelists mostly losers in life! _E_\nRick Perry failed at the border. Now he is critical of me. He needs a new pair of glasses to see the crimes committed by illegal immigrants. _E_\nWe _E_\nRT @IvankaTrump: Thank you Angie Phillips for inviting me to tour your plant Middletown Tube Works. #Ohio __HTTP__ _E_\nI will be speaking at the NRA event today in Nashville. Many friends will be there. _E_\nI guess @BillMaher saw my ratings on the @Late_Show the other night where Letterman beat Leno. Bill you are no Letterman. _E_\nFrom my first day in office we've taken swift action to lift the crushing restrictions on American energy. Remarks... __HTTP__ _E_\nI explained to the President of China that a trade deal with the U.S. will be far better for them if they solve the North Korean problem! _E_\nIf Scotland would have gone independent predicated on $100 $150 oil they would now be bust! _E_\n\"ACU ANNOUNCES DONALD TRUMP TO ADDRESS CPAC 2013\" __HTTP__ via @CPACnews _E_\nBriarcliff Manor should get a better town manager. Philip Zegarelli has no clue—bad roads a total puppet of the mayor? @westchestergov _E_\nAmazing that Crooked Hillary can do a hit ad on me concerning women when her husband was the WORST abuser of woman in U.S. political history _E_\nWill be signing the biggest ever Tax Cut and Reform Bill in 30 minutes in Oval Office. Will also be signing a much needed 4 billion dollar missile defense bill. _E_\nAutism Speaks' Bob and Suzanne Wright will address the Pontifical Council on Health Care Workers at the Vatican in Rome. November 20 22 _E_\nMarco Rubio is totally weak on illegal immigration & in favor of easy amnesty. A lightweight choker bad for #USA! _E_\nVia @WashTimes by @dsherfinski __HTTP__ _E_\nPresident Obama strongly considering a plan to bring non U.S. citizens with Ebola to the United States for treatment. Now I know he's nuts! _E_\nRT @DanScavino: 'Trump as Commander in Chief Making the Hard Decisions' by LTG (Ret) Kellogg a highly decorated Vietnam War Vet: __HTTP__ _E_\nHonolulu's best @TrumpWaikiki features a dozen distinct tropically decorated Hawaii hotel rooms and suite layouts __HTTP__ _E_\nThe Tea Party delivered the House for @GOP so they could be fiscally responsible. Instead they have been irresponsible! _E_\n.@JebBush had a tiny 300 person crowd at Senator Tim Scott's forum. I had thousands and they had real passion! __HTTP__ _E_\nThank you Arkansas! #Trump2016#SuperTuesday _E_\nI will take full credit for Mitt Romney dropping out of the race—looks like he won't be endorsing Trump any time soon. _E_\nRT @foxandfriends: FOX NEWS ALERT: U.S. flexes its defense muscles destroys incoming test missile off coast of Alaska __HTTP__ _E_\nVia @todayshow: Trump: Attorney general behind lawsuit a 'total lightweight' __HTTP__ _E_\nJoin me in Colorado Springs Colorado tomorrow at 1:00pm! #MAGA Tickets: __HTTP__ _E_\nHappy Lá Fheile Phadraig to all of my great Irish friends! _E_\nWill CNN send its cameras to the border to show the massive unreported crisis now unfolding or are they worried it will hurt Hillary? _E_\nHow can General Martin Dempsey tell Obama that delaying the Syria bombardment will have no consequences? He is no Patton or MacArthur. _E_\nGreat to see @RandPaul looking well and back on the Senate floor. He will help us with TAX CUTS and REFORM! _E_\nThank you America! #Trump2016 __HTTP__ __HTTP__ _E_\nTune in at __HTTP__ and get the word out #BigLeagueTruth #Debate Help us spread the TRUTH stop the... __HTTP__ _E_\nWhile Bernie has totally given up on his fight for the people we welcome all voters who want a better future for our workers. _E_\nSally Yates made the fake media extremely unhappy today she said nothing but old news! _E_\nIf only the morons @AP were as concerned with Obama's inconsistent statements on the Embassy attacks as they are (cont) __HTTP__ _E_\n#AmericaFirst! __HTTP__ _E_\nThank you to Joe Passov (Travelin' Joe) of Golf Magazine for the great article... __HTTP__ __HTTP__ _E_\nPhilly FOP Chief On Presidential Endorsement: Clinton 'Blew The Police Off' __HTTP__ _E_\nCan't wait to meet patriotic small business owners next week in Sarasota and Tampa! Hey @BarackObama We Did Build It! _E_\nSo what did you think of my decision? What would you have done? #CelebApprentice _E_\nHappy Birthday @IvankaTrump! You are an amazing daughter! _E_\nOur views trump the rest for the #Thanksgiving #MacysParade. Stay @TrumpNewYork for exclusive parade access __HTTP__ _E_\nSee Charles Gasparino's article in today's NYPost about Eric Schneiderman's witch hunt against Republicans __HTTP__ _E_\nTime for Sebelius to be fired. She has admitted that the Administration did not vet the ObamaCare website __HTTP__ _E_\nThe Democrats sent a very political and long response memo which they knew because of sources and methods (and more) would have to be heavily redacted whereupon they would blame the White House for lack of transparency. Told them to re do and send back in proper form! _E_\nMy ties and shirts are doing very big numbers @Macy's beyond my wildest thoughts! Thanks @GoAngelo and the rest of the losers for mentions! _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nSyrian ceasefire seems to be holding. Many lives can be saved. Came out of meeting. Good! _E_\nThank you to Matt Boyle @BreitbartNews for analytical & well written piece on sleazebag blogger @mckaycoppins & irrelevant @BuzzFeed _E_\n27 days until America's greatest test since our founding. In this election we decide whether we become great again. _E_\n...didn't do it so now we have a big deal with Dems holding them up (as usual) on Debt Ceiling approval. Could have been so easy now a mess! _E_\nI will be working late into the evening closing a big real estate deal—soon to be announced. Happy Easter and/or Holiday to all. _E_\nLet Pete Rose into the Hall of Fame now 35 years is enough! _E_\nGovernor John Kasich of the GREAT GREAT GREAT State of Ohio called to congratulate me on the win. The people of Ohio were incredible! _E_\nVia @limbaugh: \"Trump Doubled Down and It Worked\" __HTTP__ _E_\nYou've got something unique to offer find out what it is. Ask yourself: What can I provide that does not yet exist? Innovation can follow.. _E_\nWelcome to the new ObamaCare reality – Doctor spent 2 hours on hold w/insurance company to get approval for surgery __HTTP__ _E_\nCapital isn't scarce vision is. Sam Walton _E_\nI wonder how @JoeBiden feels after last night's love fest between Obama and Hilary on @60Minutes. Can't be too happy. _E_\n.@MileyCyrus – don't worry about Liam. You can do much better and you have plenty of time—remain strong! _E_\n.@BrandenRoderick returns in All Star @ApprenticeNBC 2001 Playmate of the Year is a determined competitor. She is terrific! _E_\nI will be interviewed at 7:00 A.M. on @foxandfriends Enjoy. _E_\nMy new book #TimeToGetTough is the best present of the holiday season. A great gift for anyone who cares about this country. _E_\nWill be interviewed by @ainsleyearhardt on @foxandfriends Enjoy! _E_\nThank you. __HTTP__ _E_\n.@DeeSnider @StephenBaldwin7 and the rest of your favorites are back! All Star @ApprenticeNBC premieres Sunday... __HTTP__ _E_\nJeb's policies in Florida helped lead to its almost total collapse. Right after he left he went to work for Lehman Brothers—wow! _E_\nBe sure to watch my interview on @Gretawire tonight! _E_\nRT @Carl_C_Icahn: 2/2 How many of our presidents even our great presidents would have handled the antics that went on in that auditorium... _E_\n\"Intellectuals solve problems geniuses prevent them.\" – Albert Einstein _E_\nI laugh when I see Marco Rubio and Jeb Bush pretending to love each other with each talking of their great friendship. Typical phony pols _E_\nI received such a nice letter today from someone who took refuge in Trump Tower during Sandy. It was my pleasure to help. _E_\n... Also if they're at home who the hell knows what they're doing (a second job maybe). _E_\nNo matter what happens in the election @davidaxelrod deserves a lot of credit. He has kept Obama in it even with his terrible record. _E_\nMaybe Derek Jeter should ask A Rod about renting his apartment next year. Very soon A Rod won't need a place in NYC. _E_\nUL has lost all credibility under Joe McQuaid w circulation dropping to record lows. They aren't worthy of representing the great people NH. _E_\nMet with @RepCummings today at the @WhiteHouse. Great discussion! _E_\nTwo years ago I told everybody to start looking & buying houses—I hope you listened! (but there is still time). _E_\nHAPPY THANKSGIVING your Country is starting to do really well. Jobs coming back highest Stock Market EVER Military getting really strong we will build the WALL V.A. taking care of our Vets great Supreme Court Justice RECORD CUT IN REGS lowest unemployment in 17 years....! _E_\nFor the sake of transparency @BarackObama should release all his college applications and transcripts both from Occidental and Columbia. _E_\n#ArizonaPrimary message from @IvankaTrump! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThis is a terrible deal for the country and an embarrassment for Republicans! _E_\nCrooked Hillary has been fighting ISIS or whatever she has been doing for years. Now she has new ideas. It is time for change. _E_\nNew poll WOW 53% say President Obama is not honest & trustworthy. What took them so long. Go back and look at his house purchase in Chicago _E_\nLots of people are asking whether or not I should have run for President—stay tuned for the answer. _E_\nDo you think I will get credit for keeping Ford in U.S. Who cares my supporters know the truth. Think what can be done as president! _E_\nThis will be the biggest TAX CUT in the history of our country and we need it! #TaxReform Read more: __HTTP__ __HTTP__ _E_\n\"Sure the home field is an advantage but so is having a lot of talent.\" @DanMarino _E_\nWow Trump International Hotel & Tower Toronto was just ranked #1 out of 138 hotels in Toronto! @TrumpToronto _E_\nThe Clintons spend millions on negative ads on me & I can't tell the truth about her husband? Don't feel sorry for crooked Hillary! _E_\nMy thoughts on The O'Reilly Factor and more here... __HTTP__ _E_\nVery proud to announce that Mar a Lago was awarded top Historic building in the state by the illustrious (cont) __HTTP__ _E_\nA letter written to one of my many critics! __HTTP__ _E_\nN.Y.C. has the worst Mayor in the United States. I hate watching what is happening with the dirty streets the homeless and crime! Disgrace _E_\nWhile Jon Stewart is a joke not very bright and totally overrated some losers and haters will miss him and his dumb clown humor. Too bad! _E_\nJoin me in Phoenix Arizona tomorrow at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_\nSo nice thank you! __HTTP__ _E_\nThousands of fans have been sending letters to Trump Tower in anticipation of @CelebApprentice. Really good show. _E_\nMorning Consult poll: Trump Leads __HTTP__ _E_\n#trumpvlog My thoughts on the State of the Union address Apple and a great @WSJ article.... __HTTP__ _E_\nOutrageous @BarackObama has spent over $2.7B on implementing @ObamaCare since the oral arguments at SCOTUS __HTTP__ _E_\nYou mean the fact that my father left me some money (as a good father will) and I multiplied it many many times to over $10 billion is bad? _E_\nDon't be fooled. In 2008 @BarackObama promised immigration reform in his 1st yr of his 1st term. Now promising (cont) __HTTP__ _E_\nI really like Chelsea Clinton an amazing young woman. She got the best of both parents. (@IvankaTrump agrees) _E_\nTrade between China and North Korea grew almost 40% in the first quarter. So much for China working with us but we had to give it a try! _E_\nWhen Obama took office in 2009 employer provided premiums cost $13375. Today they are $18142. Thanks Obama. _E_\nWatch my interview with @ericbolling on @FoxNews today at 11:30AM ET _E_\n#sweepstweet @johnrich and @marleematlin were on #CelebrityApprentice—and they're back! _E_\nCheck out the last webisode in our 3 part series featuring me with Serta. Which one was your favorite? www.youtube.com/user/mattressserta _E_\nRemember Republicans are 5 0 in Congressional Races this year. The media refuses to mention this. I said Gillespie and Moore would lose (for very different reasons) and they did. I also predicted \"I\" would win. Republicans will do well in 2018 very well! @foxandfriends _E_\nThe Carson story is either a total fabrication or if true even worse trying to hit mother over the head with a hammer or stabbing friend! _E_\nI'm glad President Obama followed my lead and lowered the flags half staff. It's about time! _E_\nRE: FB Vanity URLs: SF Chronicle David Beckham was one of the first along with Britney Spears & Donald Trump. __HTTP__ _E_\n__HTTP__ _E_\nFoxNewsInsider with comments on my speech at CPAC in Washington DC __HTTP__ _E_\nIt was a great honor to be with President @EmmanuelMacron of France this afternoon with his delegation. Great bilateral meeting! #UNGA __HTTP__ _E_\nJoin me in Cedar Rapids Iowa tomorrow at 7:00pm! #MAGA __HTTP__ __HTTP__ _E_\nGreat poll numbers out of @UMassAmherst. Thank you! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nStrong debate by @PerdueSenate. No question he won. We need more business leaders with bold vision to fix Washington. #GASen _E_\nHad dinner this week at @MEGUNYC (at Trump World Tower) opposite the United Nations—fantastic food! 212.964.7777 _E_\nMassive crowds expected in Mississippi tomorrow night. Look forward to it! 2015 IN PHOTOS: __HTTP__ __HTTP__ _E_\nObama once again just missed a self imposed deadline with Iran. Our leadership is weak & ineffective. Double the sanctions! _E_\nCongrats @LindseyGrahamSC. You just got 4 points in your home state of SC—far better than zero nationally. You're only 26 pts behind me. _E_\n#CrookedHillary = Obama's third term which would be terrible news for our economic growth seen below. __HTTP__ _E_\nJust arrived in Syracuse NY. Big crowd great place! We will bring back the desperately needed jobs. #NYPrimary __HTTP__ _E_\nYou get what you vote for. 21% of small business owners planning to cut their workforce in 2013 __HTTP__ _E_\nWatched Crooked Hillary Clinton and Tim Kaine on 60 Minutes. No way they are going to fix America's problems. ISIS & all others laughing! _E_\nYou have all been waiting the response has been amazing! Watch my announcement now press release to follow at 12:15. __HTTP__ _E_\nSpoke with Governor @PatMcCroryNC of North Carolina today. He is doing a tremendous job under tough circumstances. _E_\nMy thoughts on last night's Celebrity Apprentice __HTTP__ also an observation I made recently __HTTP__ _E_\nToday I hosted an immigration roundtable ahead of two votes taking place in Congress tomorrow. Watch and read more... __HTTP__ _E_\nObamacare will bankrupt our country and lead to socialized medicine. We must all focus now on electing @MittRomney this November. _E_\nDon't talk about Rolling Stone Magazine but most importantly don't buy it. This degenerate killed and maimed so many wonderful people! _E_\n\"Home Prices Reach New All Time Highs in August\" Read more: __HTTP__ __HTTP__ _E_\nWhich team is your favorite? _E_\n.@jamieaydt Happy Birthday Jamie! _E_\nOn my way to Cedar Falls Iowa now. Will be great I love the people of Iowa! _E_\nRemember Russia still has Snowden. When are we going to bring that piece of human garbage back home to stand trial? He caused great damage! _E_\n.@TigerWoods is playing like his old self in the Farmers Insurance Open. He will have a great year. _E_\n...We cannot keep FEMA the Military & the First Responders who have been amazing (under the most difficult circumstances) in P.R. forever! _E_\nRepublicans seem intent on negotiating against themselves. Many senior Senators are doing Obama's bidding. Can't win this way. _E_\nReally dumb @CheriJacobus. Begged my people for a job. Turned her down twice and she went hostile. Major loser zero credibility! _E_\nThe #SOTU speech is really boring slow lethargic very hard to watch! _E_\nTomorrow!Las Vegas NV 11a: __HTTP__ CO 4p: __HTTP__ NM 7p: __HTTP__ _E_\nGreat to see Tony La Russa manage one last game last night. Congratulations to the National League on winning the @MLB All Star Game. _E_\nChina is worried. The polls are trending for @MittRomney. They won't be able to steal from us anymore. _E_\nDeportations are \"plummeting\" __HTTP__ while Obama continues to grant amnesty. _E_\nGreat reporting by @foxandfriends and so many others. Thank you! _E_\nIt's record cold all over the country and world where the hell is global warming we need some fast! _E_\nUnbelievable how he gets away with it: @BarackObama is flying around on Air Force One laughing at everybod... (cont) __HTTP__ _E_\nJuly U.S. construction had biggest drop in 12months. Bad indicator on economic numbers for rest of the year. _E_\nI am so honored by all the great NY State Repubs who came to my office called & wrote for me to run for Governor. If I do I will win. _E_\nCongress must defund ObamaCare. It is destroying Medicare and breaking promises to our Seniors including veterans. _E_\nVia @dallasnews' @neighborsgo by Heather Noel: Shelton School graduate receives handwritten note from Donald Trump __HTTP__ _E_\nATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_\n.@TrumpChicago's river lake and skyline views in each of its deluxe 5 Star guestrooms __HTTP__ _E_\nThank you for all of the nice compliments and reviews on the State of the Union speech. 45.6 million people watched the highest number in history. @FoxNews beat every other Network for the first time ever with 11.7 million people tuning in. Delivered from the heart! _E_\nIf I were president Sgt. Andrew Tahmooressi would be let out of jail with one phone call. If notMexico would pay a price like never before! _E_\nMorning Joe Panel is stealing many of my statements and ideas to better America without giving credit the story of my life! _E_\nObama & the Democrats want this shutdown. They think it helps their electoral prospects for 2014. Don't believe! _E_\nAnybody who watched all of Ted Cruz's far too long rambling overly flamboyant speech last nite would say that was his Howard Dean moment! _E_\nVia @wbtwnews13 by @elizabethk_wbtw: \"Donald Trump will deliver keynote address to the SC Tea Party Convention\" __HTTP__ _E_\nOur athletes in the Olympics are proving once again to be the greatest competitors in the world. Makes us proud to be Americans. _E_\nI will be releasing the full interview with a guy named Baxter @antbaxter only to show the bias and stupidity of him and @BBCWorld. Clowns! _E_\n.@LaToyaJackson & @Omarosa are not likely to become friends –ever! #CelebApprentice _E_\nSorry for such silence—spent weekend at closing of Ritz Carlton in Jupiter Florida—just bought it will be great! _E_\nSo how and why are they so sure about hacking if they never even requested an examination of the computer servers? What is going on? _E_\nHonoring the men and women who made the ultimate sacrifice in service to America. Home of the free because of the brave. #MemorialDay _E_\n...Based on that the Military has hit ISIS much harder over the last two days. They will pay a big price for every attack on us! _E_\nI am thrilled to nominate Dr. @RealBenCarson as our next Secretary of the US Dept. of Housing and Urban Development... __HTTP__ _E_\nSenator @lisamurkowski of the Great State of Alaska really let the Republicans and our country down yesterday. Too bad! _E_\nCongratulations to @ScottKWalker of Wisconsin a great victory. A smart and tough guy. Great going. _E_\nA general is just as good or just as bad as the troops under his command make him. General Douglas MacArthur _E_\n\"Take calculated risks. That is quite different from being rash.\" George S. Patton _E_\nNew Reuters Poll just came out and has me at 32% highest number yet.The silent majority is back and we will MAKE AMERICA GREAT AGAIN! _E_\nPervert alert serial sexter @repweiner is polling to test the waters for NYC political run. __HTTP__ _E_\nRT @FLOTUS: Had a wonderful visit from @JBA_NAFW children today at the @whitehouse! #WhiteHouseChristmas __HTTP__ _E_\nThe rescue icebreaker trying to free the ship of the GLOBAL WARMING scientists has turned back the ice is massive (a record). IRONIC! _E_\nA government shutdown will be devastating to our military...something the Dems care very little about! _E_\nNo surprise Saudis turned down spot on UN Security Council. They don't want responsibility. Just have us do their heavy lifting. _E_\nThe brand new Blue Monster Golf Course at Trump National Doral is doing fantastic business. Also the new driving range is open at night! _E_\nThis will be a big week for Infrastructure. After so stupidly spending $7 trillion in the Middle East it is now time to start investing in OUR Country! _E_\nWith the run on our dollar about to take place commodity prices will rise. Gold silver & timber will spike also certain real estate. _E_\nSomeone should look into who paid for the small organized rallies yesterday. The election is over! _E_\nI am on my way! See you all soon! __HTTP__ _E_\nWow great news from Wisconsin. Just made two speeches there with a big one coming tonight. Thank you! __HTTP__ _E_\nRT @SpeakerRyan: For individuals and families the final Tax Cuts & Jobs Act:✔lowers individual taxes✔nearly doubles the standard deducti... _E_\nI had a great time answering your questions in the latest #AskTheDonald. Watch and see if your question made it in __HTTP__ _E_\nCongrats to Roger Clemens he showed great courage. This case never should have been brought to trial. Andy Pettitte did the right thing. _E_\nPress conference at the opening of the @GaryPlayer Villa at @TrumpDoral . __HTTP__ _E_\n.@FoxNews is MUCH more important in the United States than CNN but outside of the U.S. CNN International is still a major source of (Fake) news and they represent our Nation to the WORLD very poorly. The outside world does not see the truth from them! _E_\nInteresting...the last time a Democrat succeeded a two term Democratic pres. was in 1836 when Martin Van Buren succeeded Andrew Jackson. _E_\nThank you Louisiana! #Trump2016 __HTTP__ _E_\nPeople having a great time in the Trump Tower atrium unlike others I stayed open. __HTTP__ _E_\nWhat has happened in Orlando is just the beginning. Our leadership is weak and ineffective. I called it and asked for the ban. Must be tough _E_\nThe meeting with Republican Senators yesterday outside of Flake and Corker was a love fest with standing ovations and great ideas for USA! _E_\nNot so smart after all ... Man with name on Duke law library must pay me legal fees after Trump trial victory. _E_\nI'm watching Knicks game I'd bet all of those guys with the terrible tattoos wish they never got them too bad too late! _E_\nMy @gretawire interview discussing @BarackObama's economic failures attack on capitalism and playing class warfare. __HTTP__ _E_\nStatement on John McCain __HTTP__ _E_\n.... Do I get the credit for this? Thank you! __HTTP__ _E_\nI agreed to take the worst spot at CPAC because nobody else wanted it and it was the only time I could be there it was great fun! _E_\nOur country needs to reestablish the work ethic. In NY welfare pays better than jobs __HTTP__ Zero incentive. _E_\nTo all @MittRomney supporters make sure you have taken advantage of early voting now so you can GOTV on election day. _E_\nThe UK is seriously thinking about halting wind turbine subsidies. Good news killing country. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nMy thoughts are with all those observing Yom Kippur the holiest day of the Jewish year. __HTTP__ _E_\nJoin me LIVE on my Facebook page in St. Augustine Florida! Lets #DrainTheSwamp & MAKE AMERICA GREAT AGAIN!... __HTTP__ _E_\nThere will be NO change to your 401(k). This has always been a great and popular middle class tax break that works and it stays! _E_\nRT @WhiteHouse: President Trump proclaims today as #WorldAIDSDay: __HTTP__ __HTTP__ _E_\nWHO IS GOING TO GET IRAQ'S OIL??????? _E_\nTogether we will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nWe are making great progress with healthcare. ObamaCare is imploding and will only get worse. Republicans coming together to get job done! _E_\nIt was my great honor to join our wonderful Veterans at AMVETS Post 44 in Youngstown Ohio this evening. A grateful nation salutes you! __HTTP__ _E_\n#WeeklyAddress __HTTP__ __HTTP__ _E_\n.@NicolleDWallace Your father is a brilliant man with wonderful sense therefore you must be good! _E_\nLooking forward to receiving 2015 Statesman of the Year award tonight by @SRQRepublicans. A record 2000+ sell out __HTTP__ _E_\nMy @foxandfriends interview discussing @BarackObama's #WHCD lowering tax rates Republic of Georgia & (cont) __HTTP__ _E_\n.@DJohnsonPGA We are so proud of you Dustin. Your reaction under pressure was amazing. First of many Majors. You are a true CHAMPION! _E_\nI will be on @foxandfriends at 7.00 45 minutes. Talking about Ebola Obama and other strange U.S. happenings! _E_\nMy @gretawire interview discussing @MittRomney debate responses Obama's hidden records my tweets and unemployment __HTTP__ _E_\nIs President Obama going to finally mention the words radical Islamic terrorism? If he doesn't he should immediately resign in disgrace! _E_\nDoes anybody really believe that a reporter who nobody ever heard of went to his mailbox and found my tax returns? @NBCNews FAKE NEWS! _E_\n.@IsraeliPM @netanyahu delivered an excellent speech yesterday at the UN. Too bad @AmbassadorRice wasn't there. _E_\nMy @SquawkCNBC interview discussing today's primary contests @MittRomney's lead and my stock picks __HTTP__ _E_\nI will be the best by far in fighting terror. I'm the only one that was right from the beginning & now Lyin' Ted & others are copying me. _E_\nWatched @davidaxelrod on @oreillyfactor and the dog hit me even after I made a big contribution to his charity. I never went bankrupt! _E_\nI really enjoy doing @foxandfriends every Monday at 7 AM. @sdoocy @ehasselbeck and @kilmeade are great people. _E_\n\"Manufacturing Optimism Rose to Another All Time High in the Latest @ShopFloorNAM Outlook Survey\" __HTTP__ _E_\nVery nice article from Daily Mail __HTTP__ _E_\nThe statement about leaving the base came directly from CBS Evening News. _E_\nGallup poll proves that @BarackObama's regulation and Obamacare are stopping small business owners from hiring __HTTP__ SHOCK! _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\nThen ask: What am I pretending not to see? These two simple questions can pave the way for some very clear answers. _E_\nThe 9.11.12 attack on the Benghazi consulate was a sophisticated multi prong wave attack. When will all the 50+ fighters face justice? _E_\nLyin' Ted Cruz can't win with the voters so he has to sell himself to the bosses I am millions of VOTES ahead! Hillary would destroy him & K _E_\n\"Never judge someone by their job title. You'd be surprised at the talents people can have.\" – Midas Touch _E_\nThe plane I saw on television was the hostage plane in Geneva Switzerland not the plane carrying $400 million in cash going to Iran! _E_\nWhatever you are doing right now make sure to stop for a minute focus and ask yourself \"Am I thinking BIG?\" _E_\n.@David_Cameron Why do you give Scotland so much money to destroy their magnificent land with wind turbines causing massive taxes & E bills _E_\nWe can make Washington work for us. It's time for real leadership. Let's #MakeAmericaGreatAgain! __HTTP__ _E_\nIran has been formally PUT ON NOTICE for firing a ballistic missile.Should have been thankful for the terrible deal the U.S. made with them! _E_\nHow will Mitt Romney defend his record on jobs and Romneycare in tonight's debate? _E_\nThank you Las Vegas Nevada I love you! Departing for Greeley Colorado now. Get out & VOTE! #ICYMI watch here:... __HTTP__ _E_\nMy @foxandfriends interview re: @SuperBowl blackout @BobbyJindal's stupid comment & suing @billmaher f/$5M __HTTP__ _E_\nObama is trying to block sequester layoff notices in Virginia __HTTP__ Another example of sleazy politics! _E_\nAs dishonest as @RollingStone is I say @HuffingtonPost is worse. Neither has much money sue them and put them out of business! _E_\nPresident Obama's literary agent (in 1991) promoted a book about the first African American president of the (cont) __HTTP__ _E_\nEntrepreneurs – be tough resolute & trustworthy. The most crucial time to build your reputation is when you start making deals. _E_\nThe NFL has decided that it will not force players to stand for the playing of our National Anthem. Total disrespect for our great country! _E_\nCountries charge U.S. companies taxes or tariffs while the U.S. charges them nothing or little.We should charge them SAME as they charge us! _E_\nThe lights went out at the White House today __HTTP__ Symbolic of the Obama presidency. _E_\nI wonder why somebody doesn't do something about the clowns @politico and their totally dishonest reporting. _E_\nPress release. Video response to follow. __HTTP__ _E_\nThe banks were bailed out by us. They should start lending to private entrepreneurs. The banks are slowing American growth. _E_\nUnlike the other Republican candidates I will be in Nevada all day and night I won't be fleeing in and out. I love & invest in Nevada! _E_\nWill be in Bangor Maine today! Join me 4pmE at the Cross Insurance Center! __HTTP__ __HTTP__ _E_\nI hope the Republicans are happy. Just as I predicted that stupid deal they voted for only whetted Obama's appetite for more taxes. _E_\n\"President Trump?\" __HTTP__ via @MiamiHerald by Wayne E. Williams _E_\nVia @HorsetalkNZ: \"Florida's Trump Invitational to kick off showjumping year\" __HTTP__ Mar a Lago's 3rd ann. Trump Grand Prix! _E_\nThe Government spends 30% more than it admits __HTTP__ @BarackObama is out of control with his deficit spending. _E_\nObama administration fails to screen Syrian refugees' social media accounts: __HTTP__ _E_\nVia @DailyCaller: Trump on Obama and Congress: 'Lock them up' in a room like Vatican conclave __HTTP__ by @NicholasBallasy _E_\nVery tacky set! _E_\nWith Boston terrorist cell widening in suspects it's now clear that it was a mistake to read the bomber the Miranda warning so early. _E_\nI love the Lakers and when you love the Lakers you want them to win so badly that you will work tirelessly. Dr. Jerry Buss _E_\n#JointSession #MAGA __HTTP__ _E_\nWith taxes set to go up and Obama about to cut the mortgage deduction now is the time to buy a house if you can. Can get a great deal. _E_\nDoes anyone know that Crooked Hillary who tried so hard was unable to pass the Bar Exams in Washington D.C. She was forced to go elsewhere _E_\nBill Clinton has been Obama's most effective surrogate out on the trail. _E_\nIt doesn't cost any money to think bigger. The Art of the Deal _E_\n\"Did you know that with the natural gas reserves we have in the United States we could power America's (cont) __HTTP__ _E_\nVia @MarketWatch: \"@TrumpSoHo New York Unveils $50 Million Presidential Penthouse\" __HTTP__ _E_\nThank you. __HTTP__ _E_\n.@redstate I miss you all and thanks for all of your support. Political correctness is killing our country. weakness. _E_\nRT @ricardorossello: Briefed @POTUS @realDonaldTrump in #SituationRoom and thanked him for his leadership quick response & commitment to o... _E_\nFEMA & First Responders are doing a GREAT job in Puerto Rico. Massive food & water delivered. Docks & electric grid dead. Locals trying.... _E_\nRT @ScottAdamsSays: Trump's speech today is the best persuasion I have ever seen. Game over. Now running unopposed: __HTTP__ _E_\n.@mkhammer a Fox contributor isn't smart enough to know what is going on at the border. @TheJuanWilliams made the point far better! _E_\nSuch a great honor to be the Republican Nominee for President of the United States. I will work hard and never let you down! AMERICA FIRST! _E_\nRT @EricTrump: Who has voted today??? Feedback from the polls? I'm like a kid on Christmas! #SuperTuesday #MakeAmericaGreatAgain __HTTP__ _E_\n.@timkaine is wrong for defense: __HTTP__ #BigLeagueTruth #VPDebate _E_\nLeaving for Mobile Alabama right now can't be late! _E_\nIt is going to be a long and tough road to turn around CNN they are looking at the wrong people! _E_\nGreat read by @VDHanson: \"Mexico's Hypocrisy Is Evident In Its Own Strict Policy Toward Immigrants\" __HTTP__ _E_\nVia @DailyCaller by @rpollockDC: \"NYC Mayor Action Against Donald Trump Is 'Not the American Way'\" __HTTP__ _E_\nDo you believe Obama just said that America would be less safe with a travel ban from West Africa.This is the thinking of a total mad man! _E_\nThe Solyndra Scandal @BarackObama's $500Million photo op. He loves wasting our money. _E_\nIf you love it own it. @TrumpCondosLV bring unparalleled style elegance and world class amenities to Las Vegas __HTTP__ _E_\nJoin me at 2:00pmEST today live from Trump Tower via Facebook & Periscope! __HTTP__ _E_\nThe Boston Bomber got immediate emergency surgery for a gunshot yet our vets die on waiting lines at the VA. We must do better! _E_\nWhat team would you choose to win? #CelebApprentice _E_\n.@AllenWest Great seeing you last night at record setting Mar a Lago Republican event. The crowd loved you! _E_\nPresidential Memorandum for the @CommerceGov @SecretaryRoss re: Aluminum Imports and Threats to National Security:... __HTTP__ _E_\nThank you Minnesota! It is time to #DrainTheSwamp & #MAGA! #ICYMI watch: __HTTP__ __HTTP__ _E_\nWow Lyin' Ted Cruz really went wacko today. Made all sorts of crazy charges. Can't function under pressure not very presidential. Sad! _E_\nI feel sure that my friend @RandPaul will come along with the new and great health care program because he knows Obamacare is a disaster! _E_\nI strongly pressed President Putin twice about Russian meddling in our election. He vehemently denied it. I've already given my opinion..... _E_\nI will be doing Fox & Friends tomorrow morning at 7.00. Will be discussing all sorts of current disasters! _E_\nWow @CNN ratings are up 75% because it's all Trump all the time. The networks are making a fortune off of me! MAKE AMERICA GREAT AGAIN! _E_\nAl Qaeda taking over Libya after we made it possible really amazing. _E_\nObama has unilaterally & unconstitutionally drawn 4 ObamaCare exemptions for his friends. All @GOP wants is (cont) __HTTP__ _E_\nRT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_\nThank you Marco I agree! __HTTP__ _E_\nIn 2008 @BarackObama campaigned against $3.50 gas __HTTP__ It is now $6 in Florida and on the rise. He is a disaster! _E_\nHappy Canada Day to all of the great people of Canada and to your Prime Minister and my new found friend @JustinTrudeau. #Canada150 _E_\nvia Bloomberg: Fox News Couldn't Kill Trump's Momentum Made Him Stronger @FoxNews @business __HTTP__ _E_\nIf the Republicans need a chief negotiator I am always available or can recommend some really good ones! _E_\n.@BillRancic Bill fantastic job this morning on @foxandfriends you are a total winner and I am proud of you as first Apprentice CHAMP! _E_\nIsn't it sad the way Putin is toying with Obama regarding Snowden. We look weak and pathetic. Could not happen with.a strong leader! _E_\nGreat players in sports make the game fun to watch. @DerekJeter has continued to impress with another amazing season. Absolute professional. _E_\nOur ally Canada wants to send their oil down south to us. @BarackObama is forcing Canada to send it west to China. _E_\nSorry folks I'm just not a fan of sharks and don't worry they will be around long after we are gone. _E_\nI will be interviewed by Chris Wallace at 2:00 P.M. on @FoxNews Turn off the football for 15 minutes Make America Great Again! _E_\nVia @pbpost: \"Faldo calls team up for golf course with Trump 'entertaining'\" __HTTP__ _E_\nGreat job on the Larry King Live Gulf Telethon last night $1.3 million was raised in 2 hours. _E_\nI wish everyone including the haters and losers a very happy Easter! _E_\nRT @TheFive: Media bias is not just about what they report it's also about what they don't report. @jessebwatters #thefive _E_\nWe need to fix our broken education system! #StopCommonCore #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_\nWOW great new poll New York! Thank you for your support! #Trump2016#NewYorkValues __HTTP__ __HTTP__ _E_\nIf dummy Bill Kristol actually does get a spoiler to run as an Independent say good bye to the Supreme Court! _E_\nThe national debt continues to rise at record levels and today @BarackObama is in Disney World. He lives in a fantasy. _E_\nPhoto from yesterday's USGA announcement that Trump National Golf Club Bedminster will host the 2017 U.S. Women's Open __HTTP__ _E_\nThe special interests and people who control our politicians (puppets) are spending $25 million on misleading and fraudulent T.V. ads on me. _E_\nNot only is @Toure a racist (and boring) he's a really dumb guy! _E_\nJoin me in San Jose California tomorrow evening at 7pm! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThanks for your support! __HTTP__ _E_\nI hear the very ungrateful @ArsenioHall has a show that is absolutely dying in the ratings. Really too bad! _E_\nBest wishes to the Republic of Korea on hosting the @Olympics! What a wonderful opportunity to show everyone that you are a truly GREAT NATION! __HTTP__ _E_\nI don't know Putin have no deals in Russia and the haters are going crazy yet Obama can make a deal with Iran #1in terror no problem! _E_\n.@ArsenioHall just got \"fired\"—the people spoke ratings were terrible. The Apprentice brought him back from the dead but he blew it! _E_\nGreat conversations with President @EmmanuelMacron and his representatives on trade military and security. _E_\nJust out the new nationwide @FoxNews poll has me alone in 2nd place closely behind Jeb Bush but Bush will NEVER Make America Great Again! _E_\nHillary's 33000 deleted emails about her daughter's wedding. That's a lot of wedding emails. #debate _E_\nI can't believe he would choose @OMAROSA as his first choice! She is hard to handle. #CelebApprentice _E_\nCrowd is booing the hell out of that phony decision place is angry and going wild. Fight was not even close! DISGUSTING. _E_\nUnfortunately@BarackObama's continued attack on the US $ will lead to ever rising gas prices at the pump and lots of other really bad things _E_\nGas prices are up 30 cents this month rising 21 days in a row __HTTP__ Don't worry @BarackObama has a solution ALGAE! _E_\nHeading to Pennsylvania for a big rally tonight. We will MAKE AMERICA GREAT AGAIN! _E_\nMiddle Eastern countries must participate militarily (no running away) and big league financially in order for us to go in and save them! _E_\nRT @TeamTrump: ONLY @realDonaldTrump will end what even @BillClinton called a CRAZY SYSTEM. #BigLeagueTruth #Debate __HTTP__ _E_\nHAPPY BIRTHDAY to our 40th President of the United States of America Ronald Reagan! __HTTP__ _E_\n\"In the end you're measured not by how much you undertake but what you finally accomplish. The Art of the Deal _E_\nI have NOTHING to do with The Apprentice except for fact that I conceived it with Mark B & have a big stake in it. Will devote ZERO TIME! _E_\nDonald Trump Partners with TV1 on New Reality Series Entitled Omarosa's Ultimate Merger: __HTTP__ _E_\nToday we remember and honor our fellow Americans and NYPD and FDNY who fell 11 years ago. _E_\n.@chucktodd is so dishonest in his reporting...and to think he was going off the air until I came along no ratings. I will beat Hillary! _E_\nI hope everyone (especially the haters and losers) goes to Macy's today and buys some DJT ties shirts and suits and SUCCESS Fragrance love! _E_\nHas anyone looked at the really poor numbers of @VanityFair Magazine. Way down big trouble dead! Graydon Carter no talent will be out! _E_\n.#RogerStone was just banned by @CNN their loss! Tough loyal guy. _E_\nI just had an amazing day in Mumbai India. Building an almost 80 story building super luxury which is doing great! Press is going wild. _E_\nGoing to South Carolina now great place SRO crowd. Iowa was amazing yesterday! _E_\nJoin me in Wilmington Ohio tomorrow at 4:00pm! It is time to #DrainTheSwamp! Tickets: __HTTP__ __HTTP__ _E_\nWhere's the press? 1484: 72% of Afghan Casualties Have Occurred Under Obama __HTTP__ _E_\nHillary can never win over Bernie supporters. Her foreign wars NAFTA/TPP support & Wall Street ties are driving away millions of votes. _E_\nGreat day of bilateral meetings at #ASEANSummit on trade which we are turning around to be great deals for our country! __HTTP__ _E_\nI wonder if Apple is upset with me for hounding them to produce a large screen iPhone. I hear they will be doing it soon—long overdue. _E_\nRT @SLandinSoCal: When you kneel for our #NationalAnthem you aren't protesting a specific issue you are protesting our Nation and EVERYTH... _E_\nWhen will the unemployment numbers be corrected? Sadly after the election! _E_\nMelania and I will be appearing on Larry King Live tonight 9 p.m. on CNN. Be sure to tune in for some great conversation! _E_\nDo you believe that Secy. KERRY just went to Egypt to talk about human rights problems and this as everything is being blown up around him _E_\nAre you a young professional getting ready for a big meeting? Pick up a #Trump suit @Macys __HTTP__ Look your best! _E_\nTHANK YOU! #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_\nTweet me tonight who your favorite is during the live telecast of the Miss Universe Pageant. _E_\n\"60 Minutes\" treats President Obama with kid gloves Mike Wallace is spinning in his grave! _E_\nProud to welcome our great Cabinet this afternoon for our first meeting. Unfortunately 4 seats were empty because S... __HTTP__ _E_\n.@CarlyFiorina Ben Carson said in his own book that he has a pathological temper & pathological disease. I didn't say it he did. Apology? _E_\nI will be announcing my decision on Paris Accord Thursday at 3:00 P.M. The White House Rose Garden. MAKE AMERICA GREAT AGAIN! _E_\n.@TheAlabamaBand was great last night in D.C. playing for 147 Diplomats and Ambassadors from countries around the world. Thanks Alabama! _E_\nIf @rihanna is dating @chrisbrown again then she has a death wish. A beater is always a beater just watch! _E_\nVia @KingwoodNews by @JayRJordan:: \"@TXPatriotsPAC gives public a chance to hear Trump speak\" __HTTP__ _E_\n.@oreillyfactor Please correct I WON Virginia! _E_\nNow @BarackObama wants us to believe the Republicans cancelled Keystone and are responsible for $4 gas. He (cont) __HTTP__ _E_\nHow badly will the Country be hurt by the three scandals and the very poor implementation and cost of Obama Care? _E_\nBut @mcuban is physically weak he has no clubhead speed or game! _E_\nThe Democrats' solution is the same solution they have for everything tax tax tax. Just one problem: it doesn't work #TimeToGetTough _E_\nI turned down going to the debate tonight so that I could do live tweets to my many followers. _E_\nI'm very proud of my new crystal collection. Here's a sneak peak of my favorite collection Elmsford __HTTP__ (cont) __HTTP__ _E_\nMy shirts ties & suits (and fragrance Success) are doing great go over & check out Macy's now—beautiful new selection! _E_\nObamaCare is on LIFE SUPPORT it will soon be DEAD ON ARRIVAL A bad concept that was imcompetently administered! _E_\nRemember I am the only candidate who is self funding. While I am given little credit for this by the voters I am not bought like others! _E_\nThe only people who are not interested in being the V.P. pick are the people who have not been asked! _E_\nHappy 94th birthday to Nelson Mandela! _E_\nAsk yourself Is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_\nFunny if you listen to @FoxNews the Democrats did not have a good day. If you listen to the other two they are fawning. What a difference _E_\n\"Trump on Obama: 'I never thought it would be this bad'\" __HTTP__ via @breitbarttv _E_\nObama Care's taxes vest in 2014. Conveniently after the 2012 election. Coincidence? _E_\nWishing everyone a wonderful Independence Day weekend. We have a lot to be thankful for. _E_\nLast Saturday A Rod was 0 3 and left 6 stranded. But he was still hitting on girls from the dugout __HTTP__ He is very selfish! _E_\nDespite my great respect for King Abdullah II I will not be visiting Jordan at this time. This is in response to the false @AP report. _E_\nThe W.H. is functioning perfectly focused on HealthCare Tax Cuts/Reform & many other things. I have very little time for watching T.V. _E_\nOn Fifth Avenue the iconic @TrumpTowerNY is one of NYC's most heavily visited tourist attractions __HTTP__ _E_\nAwarded the prestigious 2014 @ForbesInspector 5 Star Guide @TrumpToronto is located in beautiful downtown Toronto __HTTP__ _E_\nI wish the 23 million who are unemployed were able to celebrate like the Democrats in Charlotte right now. _E_\nGreat first day with world leaders at the #G20Summit here in Hamburg Germany. Looking forward to day two! #USA __HTTP__ _E_\nMy opponents big bosses lobbyists and donors are trying to do damage. They will fail! Money down the drain! _E_\n.@TomLlamasABC cannot report the news truthfully. Why not apologize for your fraudulent story on World News Tonight.Gang members & criminals _E_\nCadillac Championship at Doral a great success I just bought Doral it will be amazing. Cadillac a great American car. _E_\nNo president in history has lied to the American people more than President Obama in fact it is not even close! _E_\nInnovation distinguishes between a leader and a follower. Steve Jobs _E_\nGreat poll out of Nevada thank you! See you soon. #MAGA #AmericaFirst __HTTP__ __HTTP__ _E_\n.@Rosie get better fast. I'm starting to miss you! _E_\nCongrats @marklevinshow on fantastic article when \"B\" writes so nicely about you it really means something. __HTTP__ _E_\nI will be on Face The Nation this morning at various times across the U.S. @CBSNews Enjoy! _E_\n.@NBCNews is so knowingly inaccurate with their reporting. The good news is that the PEOPLE get it which is really all that matters! Not #1 _E_\nWhat controversy? 2 'active' @BarackObama supporters at Bain have confirmed that @MittRomney left in '99 __HTTP__ No story here. _E_\nWhy Donald Trump Won't Touch Your Entitlements: DES MOINES Iowa—Donald Trump says if he runs for p... __HTTP__ _E_\nThank you! #ImWithYou __HTTP__ _E_\nYoung entrepreneurs – Remember that your first deals are the most important of your career. Win & gain confidence. _E_\nWhy does Obama continue to release the worst of the worst from Gitmo?! Look at Paris and wake up! _E_\nI know of no more encouraging fact than the unquestionable ability of man to elevate his life by conscious endeavor. Henry David Thoreau _E_\nWaste! The CBO now estimates that @BarackObama's stimulus cost $831B and a ridicuous $4.1M per job created __HTTP__ _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nBehind the scenes video with \"Uncle Sam\" (eagle's name) and me. __HTTP__ _E_\nGreat night in Denver Colorado thank you! Together we will MAKE AMERICA GREAT AGAIN! #ICYMI watch rally here:... __HTTP__ _E_\nIraq is far worse and of more danger to the U.S. now than it ever was under Saddam Hussein and this after $2 trillion and so many lives! _E_\nI had a great time at @TwitterNYC #AskTrump __HTTP__ _E_\nIf Bernie Sanders after seeing the just released e mails continues to look exhausted and done then his legacy will never be the same. _E_\nThat said the rich Arab countries should get involved with the Syrian mess not us.We should start rebuilding our own country & military. _E_\nLooking like my 5 victories on Tuesday will be just as good as if I won Ohio. Two more days and Ohio was mine! _E_\nThe Democrats want to shut down the Government over Amnesty for all and Border Security. The biggest loser will be our rapidly rebuilding Military at a time we need it more than ever. We need a merit based system of immigration and we need it now! No more dangerous Lottery. _E_\nThis is the time for the United States to be strengthening all important military components not rolling over and dealing from weakness! _E_\nNotice the first word in my Think Big credo: Think = the 1st step. Use everything in your power to utilize & develop that capacity. _E_\nWhy doesn't Fake News talk about Podesta ties to Russia as covered by @FoxNews or money from Russia to Clinton sale of Uranium? _E_\n.@BLTPrimeMiami @TrumpDoral's signature restaurant has set the standard for today's modern steakhouse __HTTP__ _E_\nThe Mar a Lago Club has the best meatloaf in America. Tasty. __HTTP__ _E_\nRT @ProgressPolls: Who is a better President of the United States? #ObamaDay _E_\n.@Graeme_McDowell Great playing this weekend. You are a true winner. We look forward to having you back to Trump National Doral. _E_\nRT @RealJamesWoods: I've never witnessed such hatred for a man who is willing to work for free to make his beloved country a better place.... _E_\nWhy isn't @chucktodd using the much newer @CNN Poll when discussing how well I am doing instead of the older Q Poll? CNN even better! _E_\nThey just called this the biggest scandal in U.S. sports history (GMA). Roger looks really weak and indecisive must put on a better front! _E_\nThe failing @nytimes was forced to apologize to its subscribers for the poor reporting it did on my election win. Now they are worse! _E_\nGov. @BobbyJindal referred to the Republicans as \"the stupid party\". Now he has given Dems a phrase to use. _E_\n#WheresHillary? Sleeping!!!!! _E_\nChuck Hagel's nomination has been held up for at least 12 more days. A lot can happen. _E_\nRestoring American wealth will require that we get tough. The next president must understand that America's (cont) __HTTP__ _E_\nSpent full day with contractors at Trump National Doral it will be amazing! __HTTP__ _E_\nDouble digit premium increases because of ObamaCare. Dems trying to delay showing numbers until after election but news is spreading fast! _E_\n\"TRUMP HITS BACK AT CHRIS MATTHEWS' BIRTHER RANT: 'HE USED TO BE A MUCH MORE INTELLIGENT MAN' __HTTP__ @MadeleineBlaze _E_\nThe media must denigrate ISIS at all levels or youth will continue to be drawn to it. These are low level degenerates NOT masterminds! _E_\nThank you! __HTTP__ _E_\nOn rugged Aberdeenshire coastline@TrumpScotland's Par 72 course is 7400 picturesque yds. threaded through dunes __HTTP__ _E_\nRT @ericbolling: Good morning friends! The Swamp out today. President Trump has a copy... get yours here __HTTP__ #maga... _E_\nDonna Brazile Shreds Obama Economy Acting DNC chair says 'people are more in despair about how things are' __HTTP__ _E_\nWe cannot solve our problems with the same thinking we used when we created them. Albert Einstein _E_\nMy @SteveDeaceShow int. on China the economy and my upcoming keynote at @theFAMiLYLEADER Leadership Summit __HTTP__ _E_\nThe number of unemployed Americans has increased over 60% during Obama's term. The economy can't survive another 4 years. _E_\nKnow when to walk away from the table. The Art of the Deal _E_\nMany people have commented that my fragrance \"Success\" is the best scent & lasts the longest. Try it & let me know what you think! _E_\nAnother horrific attack this time in Nice France. Many dead and injured. When will we learn? It is only getting worse. _E_\nI am a Republican...but the Republicans may be the worst negotiators in history! _E_\nJeff Flake with an 18% approval rating in Arizona said a lot of my colleagues have spoken out. Really they just gave me a standing O! _E_\nWho is paying for that tedious Smokey Bear commercial that is on all the time enough already! _E_\nStory will be released today at 12 noon EST on Twitter and Facebook. _E_\nI can't wait to read this...RT @Newsmax_Media: SEAL Book Explodes Obama Furious __HTTP__ _E_\nAchievers move forward at all times. Achievement is not a plateau it's a beginning. _E_\n.@MittRomney was a disaster candidate who had no guts and choked! Romney is a total joke and everyone knows it! _E_\nTomorrow we celebrate Independence Day America's 236th birthday. Here is America's actual birth certificate __HTTP__ _E_\nMy interview with @BarbaraJWalters in her @ABC special 'Most Fascinating People of 2011' __HTTP__ _E_\nWelcome to Twitter @melaniatrump! _E_\nRT @TrumpNV: #NVcaucus locator &gt __HTTP__ _E_\nVia @NRO by @JOELMENTUM: \"Matchless Name Recognition and Deep Pockets Make Trump a Threat in Iowa\" __HTTP__ _E_\nMuch as it pays to emphasize the positive there are times when the only choice is confrontation. #TheArtofTheDeal _E_\nJust got back from New Hampshire. Great crowd great people! Will be back soon! _E_\nRT @LouDobbs: #AmericaFirst @KellyannePolls: The Middle Class & businesses will benefit from @POTUS' historic tax revolution. #Dobbs #MAGA... _E_\nMy interview on @ASavageNation discussing why @MittRomney will defeat @BarackObama with a strong campaign. __HTTP__ _E_\nVia @espn: Donald Trump would fire A Rod __HTTP__ _E_\nEnter the contest.... __HTTP__ stay at Trump International Hotel Las Vegas _E_\nHuma calls it a MESS the rest of us call it CORRUPT! WikiLeaks catches Crooked in the act again.#DrainTheSwamp __HTTP__ _E_\nRT @TomOdell: .@FoxNews Pope who lives in a Vatican city fortified with huge walls thinks it's wrong to build walls? Really? __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nAll Star Celebrity @ApprenticeNBC continues to surprise our loyal viewers each and every week. More and bigger coming... _E_\nFor someone who demanded 20 years of Mitt's tax returns you would think my offer to donate $5M to charity for his records is an easy go. _E_\n\"Don't emphasize the problem so much emphasize the solution. It's a mindset that works.\" – Think Like a Champion _E_\nJust finished my second speech. 20K in Dayton & 25K in Cleveland perfectly behaved crowd. Thanks I love you Ohio! __HTTP__ _E_\nA number of months ago I was not expected to win South CarolinaTed Cruz was and yet I won in a landslide every group and category. WIN! _E_\nAsking why my dislike of A Rod dishonorable dealings with me on an apartment deal _E_\nThe day after @BarackObama blocks a Texas voter photo ID law @JamesOkeefeIII exposes more dead people getting ballots __HTTP__ _E_\nNothing ever happened with any of these women. Totally made up nonsense to steal the election. Nobody has more respect for women than me! _E_\nI look forward to going to the Land Investment Expo in Iowa on Jan. 23. Record crowd—sold out venue. @LandExpo @PeoplesCompany _E_\nGetting ready to go to Las Vegas (Freedom Fest) great crowd. Then on to amazing Phoenix that will be a total happening! Love America. _E_\n#TrumpVlog Quarantine the nurse! __HTTP__ _E_\nStorm turned Hurricane is getting much bigger and more powerful than projected. Federal Government is on site and ready to respond. Be safe! _E_\nBe sure to watch @IvankaTrump's @FoxBusiness @FBNATB interview from the @NYU #HospitalityConference __HTTP__ _E_\nVia @Entrepreneur by @MDMSEO : \"8 Lessons for Entrepreneurs That @ApprenticeNBC Emphasizes\" __HTTP__ _E_\nDress your best. Trump Signature Collection exclusively available @Macys tops all male business attire __HTTP__ _E_\nI am promising you a new legacy for America. We're going to create a new American future. Thank you OHIO! #ImWithYou __HTTP__ _E_\n.@FinancialTimes writes that @BarackObama should pray that China overtakes US __HTTP__ Don't worry he is making it happen. _E_\n.@ariannahuff is unattractive both inside and out. I fully understand why her former husband left her for a man he made a good decision. _E_\nThe WALL which is already under construction in the form of new renovation of old and existing fences and walls will continue to be built. _E_\nRemember when guns are outlawed only outlaws will have guns! _E_\n.@CelebApprentice wins 10 11 o'clock hour in all key ratings demographics including most importantly the 18 49 age group. _E_\nAsians are very offended that JEB said that anchor babies applies to them as a way to be more politically correct to hispanics. A mess! _E_\nMar a Lago my club in Palm Beach and one of the greatest mansions ever built has been nominated as one of (cont) __HTTP__ _E_\nObama will send troops back into Iraq combat zone. Don't believe anything he says. Just covering for his mistakes. _E_\nWith America's debt topping $21T by the end of his presidency Obama will have effectively bankrupted our country. @davidaxelrod _E_\nRT @Scavino45: #WeThePeople#USAatUNGA #UNGA __HTTP__ _E_\nTrump International Golf Links was just rated one of the greatest courses in the world. Virtually all reviews are saying the same thing. _E_\nVia @eonline: @IvankaTrump Wears @MissUniverse 2014's $300000 Crown Nails Beauty Pageant Winner Look __HTTP__ _E_\nMexico sent USMC Andrew Tahmooressi back to jail after court hearing. Mexico does not respect our border or U.S. Boycott! #FreeOurMarine _E_\nSigning The Facebook Wall __HTTP__ _E_\nTed Cruz Was For Welcoming Syrian Refugees Before He Was Against It __HTTP__ _E_\nYou know the world is crazy when New York gets hit by a hurricane and Florida doesn't. _E_\nWe will have to see what Russia's next move will be. They may have given him an out of an embarrassing situation or drove into deeper mess! _E_\n\"In this game knowledge is the key to power.\" Think Big _E_\nMy @SquawkCNBC #TRUMPTUESDAY interview discussing the upcoming debates the real state of unemployment & bias media __HTTP__ _E_\nI highly recommend the just out book THE FIELD OF FIGHT by General Michael Flynn. How to defeat radical Islam. _E_\n.@megynkelly the most overrated anchor at @FoxNews worked hard to explain away the new Monmouth poll 41 to 14 or 27 pt lead. She said 15! _E_\nMy @SquawkCNBC interview discussing Jamie Dimon the Doral Miami purchase OPEC's output & @PGATOUR Open __HTTP__ _E_\nThe Democrats are turning down services and security for citizens in favor of services and security for non citizens. Not good! _E_\nDonald Trump could again defy the conventional wisdom of the chattering class in November. @Newsmax_Media's cover The Trump Effect _E_\nHow come Snowden and ObamaCare have access to all records and information but don't have even the smallest tidbits on President Obama? _E_\nThis Sunday at 9/8C the real playoffs begin with the premiere of @apprenticenbc! Game on in the Boardroom... __HTTP__ _E_\nMarco Rubio had no idea what he was doing on Chris Wallace show. Said Iraq was not a mistake. He looked clueless! _E_\nA great evening in Iowa! Thank you Des Moines Area Community College for a great forum! #Trump2016 #IAForums __HTTP__ _E_\n'President Donald J. Trump Proclaims September 3 2017 as a National Day of Prayer' #HurricaneHarvey #PrayForTexas __HTTP__ __HTTP__ _E_\nWe need to secure our borders ASAP. No games we must be smart tough and vigilant. MAKE AMERICA GREAT AGAIN & MAKE AMERICA STRONG AGAIN! _E_\nMy @gretawire int. discussing business difficulties with ObamaCare & how it is stopping businesses from hiring __HTTP__ _E_\nThey changed the name from \"global warming\" to \"climate change\" after the term global warming just wasn't working (it was too cold)! _E_\nTed Cruz is totally unelectable if he even gets to run (born in Canada). Will loose big to Hillary. Polls show I beat Hillary easily! WIN! _E_\nCongratulations to @IvankaTrump and Jared on the big news. I will have yet another grandchild this fall! _E_\nThe reason lyin' Ted Cruz has lost so much of the evangelical vote is that they are very smart and just don't tolerate liars a big problem! _E_\nDo you think @SenTedCruz knows about @bobvanderplaats dealings? Actually I doubt it! _E_\n....and Japan will put up with this much longer. Perhaps China will put a heavy move on North Korea and end this nonsense once and for all! _E_\nRecord crowd in Tampa Florida thank you! We will WIN FLORIDA #DrainTheSwamp in Washington D.C. and MAKE AMERICA... __HTTP__ _E_\nNothing great was ever achieved without enthusiasm. Ralph Waldo Emerson _E_\n\"Stay confident even when something bad happens. It is just a bump in the road. It will pass.\" – Think Big _E_\nChina is already preparing to benefit economically from this mess. They will pick up the pieces and make yet another fortune & laugh at us! _E_\nOlympic Gold Medalist Evan Lysacek just left my office. He is in town and wanted to meet me he's a fanastic guy and a true champion. _E_\nThank you for all of your support South Carolina! #Trump2016 __HTTP__ _E_\nWatch @DonaldJTrumpJr and @EricTrump accept my #ALSIceBucketChallenge __HTTP__ _E_\nOnce the Bloomberg administration selected Trump to take over the very expensive and years late project I kicked ass and got it done fast _E_\nTruth will ultimately prevail where there is pains to bring it to light. George Washington _E_\nGoofy Elizabeth Warren has been one of the least effective Senators in the entire U.S. Senate. She has done nothing! _E_\nTed Cruz lifts the Bible high into the air and then lies like a dog over and over again! The Evangelicals in S.C. figured him out & said no! _E_\nVia @FoxNews: Donald Trump sends Bill Maher birth certificate awaits $5 million __HTTP__ _E_\nThank you Maryland! #Trump2016 __HTTP__ _E_\nObamaCare is clearly unconstitutional. Hopefully the USC rules correctly but in the end repealing ObamaCare requires a political solution. _E_\nToday we continued a wonderful American Tradition at the White House. Drumstick and Wishbone will live out their days in the beautiful Blue Ridge Mountains at Gobbler's Rest... __HTTP__ _E_\nHISTORIC rainfall in Houston and all over Texas. Floods are unprecedented and more rain coming. Spirit of the people is incredible.Thanks! _E_\nThank you Miss Katie's Diner!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThe Democrats are in a total meltdown but the biased media will say how great they are doing! E mails say the rigged system is alive & well! _E_\nCreativity and control can go hand in hand. Brainpower is the ultimate leverage. _E_\n\"Never Ignore Donald Trump\" __HTTP__ by Jeffrey Lord @AmSpec _E_\nNow the @BarackObama campaign is fundraising off of me. I should get a tax rebate! __HTTP__ _E_\n.@chelseahandler stop calling my office for me to do your rather gross show I have less interest in you than Andre. _E_\nPervert Weiner is dead in his race for mayor of NYC but WOW Eliot Spitzer has dropped way down in recent poll for comptroller. SLEAZE! _E_\nBig wins against ISIS! _E_\nWill be interviewed on @SeanHannity on @FoxNews from #Wisconsin tonight. My wife Melania will join me for the entire show. _E_\nI suggest that we add more dollars to Healthcare and make it the best anywhere. ObamaCare is dead the Republicans will do much better! _E_\nIn a clumsy move to get out of his anchor babies dilemma where he signed that he would not use the term and now uses it he blamed ASIANS _E_\nMy deepest condolences to the victims of the terrible shooting in Douglas County @DCSheriff and their families. We love our police and law enforcement God Bless them all! #LESM _E_\nI will be interviewed on @foxandfriends at 7:00 30 minutes. Some very interesting topics. _E_\n.@acuconservative's #CPACCO kept up the momentum from the debate. @MittRomney even made a surprise appearance. Now go win CO! _E_\nWhy did @BarackObama liberate Libya and do nothing for the Iranian protestors? Iran is a threat to our national security. _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\n\"I've found that people work harder when they are held accountable and their confidence rises along with that.\" – Midas Touch _E_\nIt is now a FACT that President Obama lied in order to get ObamaCare passed that is fraud and the legislation should be recinded INTERESTING _E_\nFlying to New Hampshire to keynote the @LoebSchool First Amendment Award Ceremony. Always great to visit the Granite State! _E_\nThe RNC which is probably not on my side just illegally put out a fundraising notice saying Trump wants you to contribute to the RNC. _E_\nOur country has become so politically correct that it has lost all sense of direction or purpose. We are unable to move forward painlessly. _E_\nI predict that President Obama will at some point attack Iran in order to save face! _E_\n.@GOP congress needs to actually defund ObamaCare not waste time passing non binding resolutions. _E_\nI have raised/given a tremendous amount of money to our great VETERANS and have got nothing but bad publicity for doing so. Watch! _E_\nSo sad that Burt Reynolds has lost all of his money. I wish he came to me for advice he would be rich as hell! _E_\n#CrookedHillary sending U.S. intelligence info. to Podesta's hacked email is 'unquestionably an OPSEC violation' __HTTP__ _E_\n...do the typical political thing and BLAME. The fact is ObamaCare was a lie from the beginning. Keep you doctor keep your plan! It is.... _E_\nThe House of Representatives seeks contempt citations(?) against the JusticeDepartment and the FBI for withholding key documents and an FBI witness which could shed light on surveillance of associates of Donald Trump. Big stuff. Deep State. Give this information NOW! @FoxNews _E_\nThe constant interruptions last night by Tim Kaine should not have been allowed. Mike Pence won big! _E_\nEntrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_\nWon't be a buyer's market for long. If you can purchase a home but remember I told you this three years ago. _E_\nIt was my great honor to welcome President @JC_Varela & Mrs. Varela from Panama this afternoon. ... __HTTP__ _E_\nGreat news that @FoxNews has cancelled the additional debate. How many times can the same people ask the same question? I beat Cruz debating _E_\n...hasn't worked agreements violated before the ink was dry makings fools of U.S. negotiators. Sorry but only one thing will work! _E_\n\"@MissUSA Nia Sanchez of Nevada is ready for @missuniverse\" __HTTP__ via @lasvegassun by @Robin_Leach _E_\nThe weak illegal immigration policies of the Obama Admin. allowed bad MS 13 gangs to form in cities across U.S. We are removing them fast! _E_\nRT @SecretService: Our thoughts & prayers are with the families friends & colleagues of #Virginia's @VSPPIO Lt Cullen & Tpr Bates #Charlot... _E_\nSadly Vanity Fair is a rapidly dying magazine. Needs new blood and fast! Going the way of SPY Magazine. _E_\nA nation without borders is no nation at all. We must build a wall. Let's Make America Great Again! __HTTP__ _E_\nEntrepreneurs: Work on what you will be proud to be associated with. Make your work count. _E_\nTrump National Golf Club Los Angeles will be the host in October for the @PGAGrandSlam. __HTTP__ _E_\nWe deserve all the answers on Benghazi! __HTTP__ @RepWOLFPress _E_\nNice mention by Brian Kelly __HTTP__ Conservative Action Alerts _E_\nWhy didn't A.G. Sessions replace Acting FBI Director Andrew McCabe a Comey friend who was in charge of Clinton investigation but got.... _E_\n\"My advice to you regarding momentum is definitive: Get yours going!\" – Think Like a Champion _E_\nI love watching these poor pathetic people (pundits) on television working so hard and so seriously to try and figure me out. They can't! _E_\nRevisionist history. Now Obama claims he never told us that everyone could keep their healthcare plans. Crazy! _E_\nNever never never give up. Winston Churchill _E_\nMy friend Larry King @kingsthings asked me to do an interview with him—he was always great to me—& I agreed. Watch tonight 9 PM on RTV. _E_\nThere is a good possibility that a person who treated patients in West Africa and who FLEW into New York has Ebola. Touched many bedlam! _E_\nI am self funding my campaign & don't owe anybody anything! I only owe it to the American people! #Trump2016Watch: __HTTP__ _E_\n.@BarackObama said he doesn't take the Navy Seals campaigning against him too seriously. _E_\nThe failing @nytimes has been wrong about me from the very beginning. Said I would lose the primaries then the general election. FAKE NEWS! _E_\nWill Barack Obama personally read the Boston terrorist his Miranda Rights? _E_\nAbout to begin a rally here in Henderson Nevada. New Reuters poll just out thank you! Join the MOVEMENT:... __HTTP__ _E_\nOur worst threat to unemployment is @ObamaCare. It will also destroy our country's basic standards. _E_\nWhere's the leadership? Obama only met with Sebelius ONCE since ObamaCare passed __HTTP__ His signature legislation... _E_\nPeople don't know that Eliot's father is very rich. Eliot likes to pretend he's poor to appeal to voters. _E_\nI will make my final decision on the Paris Accord next week! _E_\nSun Sentinel says: Rubio lacks the experience work ethic and gravitas needed to be president. HE HAS NOT EARNED YOUR VOTE! _E_\nGood luck to Derek on his operation. I know it will be a success he is a great champion. _E_\nLife is very fragile and success doesn't change that. If anything success makes it more fragile. The Art of the Deal _E_\nThe Green Party just dropped its recount suit in Pennsylvania and is losing votes in Wisconsin recount. Just a Stein scam to raise money! _E_\nDo not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Take control and move forward every day. _E_\n\"The object of war is not to die for your country but to make the other bastard die for his.\" Gen. George S. Patton _E_\nWe better be vigilant careful and strong. __HTTP__ _E_\nRT @NRA: .@RealDonaldTrump is right. If @HillaryClinton gets to pick her anti #2A #SCOTUS judges there's nothing we can do. #NeverHillary _E_\n.@foxandfriends at 7:00 A.M. _E_\nIn January '12 3 turbines were wrecked in rough weather ... __HTTP__ _E_\n.@FoxNews legal analyst & former prosecutor @kimguilfoyle destroyed hack Schneiderman's suit on @FNTheFive yesterday.She's very sharp! _E_\nThe EPA is caught saying that their philosophy is to crucify oil companies __HTTP__ That will sure lower the price of gas. _E_\nWill be doing @seanhannity at 10 PM on @FoxNews. As always with Sean will be interesting! _E_\nRT @realDonaldTrump: Unemployment is down to 4.1% lowest in 17 years. 1.5 million new jobs created since I took office. Highest stock Mark... _E_\nSouth Carolina rally last night was so unbelievably exciting (and fun). I am now off to Iowa for two big rallies packed houses. Love it! _E_\nThank you very much for the nice story I greatly appreciate it __HTTP__ _E_\nWe need a tax system that is FAIR to working FAMILIES & that encourages companies to STAY in AMERICA GROW in AMERICA and HIRE in AMERICA! __HTTP__ _E_\nSarah Jessica Parker voted \"unsexiest woman alive\" – I agree. She said \"it's beneath me to comment on the... __HTTP__ _E_\nThank you Naples Florida! Get out and VOTE #TrumpPence16 on 11/8. Lets #MakeAmericaGreatAgain! Full Naples rally... __HTTP__ _E_\nGreece's financial calamity should serve as a warning. @BarackObama's massive deficit spending is unsustainable. _E_\nHome of the 2022 @PGAChampionship Trump Nat'l Bedminster features 36 holes designed by famed architect Tom Fazio __HTTP__ _E_\n#ImWithYou #AmericaFirst __HTTP__ _E_\nIf Vera Coking had taken my millions of $'s like she should have she would have lived for many years in Palm Beach Florida. _E_\nOur Q1 GDP was 2.9%. Worst in memory ObamaCare killing jobs stopping growth and making small business insecure. _E_\nJust arrived in Italy for the G7. Trip has been very successful. We made and saved the USA many billions of dollars and millions of jobs. _E_\n.@KieranLalor I created far more jobs and success in Dutchess than you you should be Fired. _E_\nIn other words our military has a very big problem! _E_\nWhy is lightweight A.G. Eric Schneiderman allowed to ask for campaign contributions from my people during settlement negotiations? _E_\nScary Obama's budget deficits are so out of control that he has to borrow 40 cents on every dollar he spends. _E_\nVia @KCRG by @markwcarlson: \"Donald Trump stops in Coralville\" __HTTP__ _E_\nLightweight @JebBush is spending a fortune of special interest against me in SC. False advertising desperate and sad! _E_\nHow amazing the State Health Director who verified copies of Obama's \"birth certificate\" died in plane crash today. All others lived _E_\nAmerican Exceptionalism and the Navy Yard shooting do not go hand in hand. Foreign countries in particular Russia are mocking the U.S. _E_\nOil should not cost more than $40 a barrel. Ideally it should be $25. Cheap to produce and we protect the OPEC countries. _E_\nWing bangers the name given to wind turbines by bird lovers for the thousands of birds they kill in the U.S. _E_\nGood Morning America weather headline for U.S. NEVER ENDING COLD _E_\nSpoke to Jerry Jones of the Dallas Cowboys yesterday. Jerry is a winner who knows how to get things done. Players will stand for Country! _E_\nJust landed in Paris France with @FLOTUS Melania. __HTTP__ _E_\nI love Mexico but not the unfair trade deals that the US so stupidly makes with them. Really bad for US jobs only good for Mexico. _E_\nHope everyone is watching the Finale rerun of Celebrity Apprentice on CNBC especially the haters and losers! It is on right now. _E_\nI am in Iowa. Will be making two speeches today. Good luck to all of the great folks on the East coast. Enjoy the beauty of the storm! _E_\nMy @SquawkCNBC interview discussing why the Fed shouldn't do a QE3 @BarackObama's college records & 2012 election __HTTP__ _E_\nMy condolences to those involved in today's horrible accident in NJ and my deepest gratitude to all of the amazing first responders. _E_\nWas the brother of John Podesta paid big money to get the sanctions on Russia lifted? Did Hillary know? _E_\nGood response on jobs by @MittRomney. _E_\nIn addition to doing a lousy job in taking care of our Vets John McCain let us down by losing to Barack Obama in his run for President! _E_\nLightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately—because he is being looked at! _E_\nThe ObamaCare website is in the news again it is turning out to cost even more than previously thought AND IT DOESN'T WORK! Big trouble! _E_\nIt's Monday. How many fundraisers will Obama hold today? _E_\n\"The only place success comes before work is in the dictionary.\" – Vince Lombardi _E_\nThe person that Hillary Clinton least wants to run against is by far me. It will be the largest voter turnout ever she will be swamped! _E_\nI thought and felt I would win big easily over the fabled 270 (306). When they cancelled fireworks they knew and so did I. _E_\nHillary Clinton strongly stated that there was absolutely no connection between her private work and that of The State Department. LIE! _E_\nThe polls are now showing that I am the best to win the GENERAL ELECTION. States that are never in play for Repubs will be won by me. Great! _E_\nI look forward to tonight's debate but look far more forward to making America great again. It can happen! _E_\nTHEBillMcGee @realDonaldTrump after a year of wear your shirts still look great! Glad I made the purchase! Thank you. _E_\nThe Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. It's absolutely... __HTTP__ _E_\nTHANK YOU Baton Rouge Louisiana! WE will #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_\nWe are now at 1001 delegates. We will win on the first ballot and are not wasting time and effort on other ballots because system is rigged! _E_\nThe Consumer Financial Protection Bureau or CFPB has been a total disaster as run by the previous Administrations pick. Financial Institutions have been devastated and unable to properly serve the public. We will bring it back to life! _E_\nThe Unaffordable Care Act will soon be history! _E_\n...big dollars ($700000) for his wife's political run from Hillary Clinton and her representatives. Drain the Swamp! _E_\nCan you envision Jeb Bush or Hillary Clinton negotiating with 'El Chapo' the Mexican drug lord who escaped from prison? .... _E_\n.@timkaine oversaw unemployment INCREASE by 179249 while @mike_pence DECREASED unemployment in Indiana by 113826.... __HTTP__ _E_\nBoth of our New York hotels are on the Top Ten list of the most luxurious hotels in NYC... __HTTP__ Congrats to all! _E_\nWSJ covers Ride of Fame __HTTP__ _E_\nJust made the point at #NCGOPcon that we have to protect our border & I think everyone here knows nobody can build a wall like Trump! _E_\nObama was very disloyal to Wisconsin Democrats. @BarackObama never showed up to help them even though he (cont) __HTTP__ _E_\nI find it really hard to listen to @BarackObama's speeches. He doesn't have a clue. _E_\nIn the ridiculous @JebBush ad about me Jeb is speaking to me during the debate but doesn't allow my answer which destroys him SO SAD! _E_\n.@mcuban Mark okay with me but don't start your bullshit again! _E_\nChange before you have to. Jack Welch _E_\nMy interview yesterday with @MyFoxNY __HTTP__ _E_\nThe Miss Universe Pageant will be on August 23 (9 11 p.m. on NBC ET) with Bret Michaels and Natalie Morales to co host live from Las Vegas _E_\nLook great for Thanksgiving. Trump Signature Collection exclusively available @Macys offers top men's styles __HTTP__ _E_\nI'm in Moscow for Miss Universe tonight picking a winner is very hard they are all winners. Total sellout of arena. Big night in Russia! _E_\n.@THEGaryBusey as Project Manager... is Team Power in trouble?? #CelebApprentice _E_\nFirst Obama says Egypt is not an ally. Then he promises to keep handing over aid __HTTP__ Incompetent and unqualified. _E_\n#MakeAmericaGreatAgain I will be in Cedar RapidsIA this Saturday. Get your tickets __HTTP__ _E_\nThere's definitely no love lost between Piers and Omarosa. _E_\n.@KyleStephens30 #asktrump __HTTP__ _E_\n.@MannyPacquiao and friends at @TrumpDoral __HTTP__ _E_\nOur country is in a major crisis of incompetent leadership. We cannot continue to go on with these politicians who do nothing but talk. _E_\nGov. Cuomo's Moreland Comm should be looking at AG Schneiderman shaking down those under investigation/ in litigation for campaign $$$ _E_\n.@RollingStone admitted their scam. Phony @HuffingtonPost and others are no better total joke! _E_\nThe Fed is destroying the dollar. When inflation hits the economy then even more jobs will go overseas. _E_\nEveryone is talking about how Trump Tower is the exterior for Wayne Enterprises in Dark Knight Rises it's true. __HTTP__ _E_\nMark Udall was the deciding vote for ObamaCare & now 250000 Coloradans were dropped from their plans. Vote @CoryGardner! _E_\n#ObamacareFail #HillarycareFail __HTTP__ _E_\nJeb why did your brother attack and destabalize the Middle East by attacking Iraq when there were no weapons of mass destruction? Bad info? _E_\nEntrepreneurs: Be a cautious optimist. I call it positive thinking with a lot of reality checks. _E_\nThe only place where our border is protected is from Europeans. We educate them in our finest institutions & then have them deported. _E_\nThank you Vermont! #VoteTrumpVT __HTTP__ _E_\nThank you High Point NC! I will fight for every neglected part of this nation & I will fight to bring us together... __HTTP__ _E_\nA tough negotiator can make the Chinese back off. We've done it before. #TimeToGetTough __HTTP__ __HTTP__ _E_\nContinued success is built on building a brand people know will deliver. Unless you're @KarlRove. Then you just blame the Tea Party. _E_\nVia @Newsmax_Media by @JAGERFILE: \"Donald Trump: 'Morally Unfair' to Use Soldiers in Ebola Fight\" __HTTP__ _E_\nTHANK YOU! #VoteTrump __HTTP__ _E_\nEntrepreneurs: See yourself as victorious. Look at the solution not the problem. Be tough be strong be tenacious. _E_\nCharlie Hebdo reminds me of the satirical rag magazine Spy that was very dishonest and nasty and went bankrupt. Charlie was also broke! _E_\nChina is stealing our jobs. We need to demand China stop manipulating its currency and end its rampant corporate espionage. _E_\nOur country does not feel 'great already' to the millions of wonderful people living in poverty violence and despair. _E_\nDoesn't want to remove Assad worries what comes next. _E_\n.@DiamondandSilk Just watched you on #WattersWorld with a large group of people. Everybody loves you two amazing people! #Trump2016 _E_\nWatch this great behind the scenes video of @IvankaTrump's Spring 2013 photo shoot __HTTP__ _E_\nJoin me! #Trump20166/10: Richmond __HTTP__ Tampa __HTTP__ Pittsburgh __HTTP__ _E_\nThe reporter that called Kevin Durant Mr. Unreliable should be fired or at least apologize. He is a truly great player and a winner! _E_\nAnytime you see someone talking about celebrity weight loss on my twitter it is a total scam! _E_\nI appreciate the kind words of Mike Huckabee a fine American __HTTP__ _E_\nThe #2A to our Constitution is clear. The right of the people to keep & bear Arms shall not be infringed upon. __HTTP__ _E_\nThose who lack courage will always find a philosophy to justify it. Albert Camus _E_\nToday I signed an Executive Order @ the U.S. Dept. of @Interior: 'Review of Designations Under the Antiquities Act... __HTTP__ _E_\nIs President Obama trying to destroy Israel with all his bad moves? Think about it and let me know! _E_\nWatch the 63rd Annual @MissUniverse Pageant tomorrow on NBC at 8PM! __HTTP__ _E_\nThank you Wayne Root we will #MakeAmericaGreatAgain! __HTTP__ _E_\nCongratulations to Gabby Douglas on winning the Gold for the USA in gymnastics. She is terrific! _E_\nVery sad & dangerous that soon to be ex Intelligence Chair Dianne Feinstein released the CIA report. Glad she is losing her Comm. Chair. _E_\nThe National Border Patrol Council (NBPC) said that our open border is the biggest physical & economic threat facing the American people! _E_\nThe 2013 @NJPGA Course of the Year Trump Nat'l Bedminster is honored to be hosting the 2022 @PGAChampionship __HTTP__ _E_\nI just don't know why some of these NFL teams with lousy quarterbacks don't give Tim Tebow a chance what do they have to lose? _E_\nWhen you're \"hot\" the lowlifes really shoot at you... and they try hitting from every angle! Never let the bastards get you down. _E_\nUK is freezing through longest & coldest winter in over 50 years __HTTP__ Where's the global warming? @gatewaypundit _E_\nWatching the show. #WWEHOF __HTTP__ _E_\nOn top of the disrespect shown by Russia don't forget they still have Snowden who has given them (& everyone) massive US secrets. _E_\nWe.signed our deal to take over the historic Old Post Office on Pennsylvania Ave. from the U.S. and convert it into super luxury hotel jobs! _E_\n... I will soon start naming magazines that I think will fold I predicted Newsweek. _E_\nLibya is adopting a more radical form of Sharia Law now under their new leadership. Is this what @BarackObama wanted? _E_\nLucky for New York highly respected John Cahill is running for NY State AG against incumbent lightweight dope @AGSchneiderman @CahillForAG _E_\nIf you read my last number of tweets only one opinion can be formed that our President and therefore leader is grossly incompetent! _E_\nThe golden rule of negotiation: He who has the gold makes the rules. _E_\nWhen will Sleepy Eyes Chuck Todd and @NBCNews start talking about the Obama SURVEILLANCE SCANDAL and stop with the Fake Trump/Russia story? _E_\n.@KirschnerDavid @realDonaldTrump Congrats Mr. Trump on making @Forbes list of wealthiest in the world. Thanks! _E_\n...expect the country to be further downgraded in the future. The rich are all leaving! _E_\nI am being investigated for firing the FBI Director by the man who told me to fire the FBI Director! Witch Hunt _E_\nThe WGC @CadillacChamp leadership board is available here: __HTTP__ @DoralResort _E_\nAll those politicians in Washington and not one good negotiator. _E_\nChina is not our friend. They are not our ally. They want to overtake us and if we don't get smart and tough soon they will. _E_\nAs bad as Qaddafi was what comes next in Libya will be worse just watch. _E_\nBill Kristol has been wrong for 2yrs an embarrassed loser but if the GOP can't control their own then they are not a party. Be tough R's! _E_\nThank you Maine New Hampshire and Iowa. The waiting is OVER! The time for change is NOW! We are going to... __HTTP__ _E_\n.@SharkGregNorman @Trump_Charlotte Looking great love the improvements to the buildings and grounds! not to mention course Thank you. _E_\nThe Islamists are taking over Egypt through the election. __HTTP__ Why did @BarackObama force Mubarak out? He was an ally. _E_\nWhen someone attacks me I always attack back...except 100x more. This has nothing to do with a tirade but rather a way of life! _E_\nGraydon Carter whose reign over failing @VanityFair has been a disaster has acted in two movies both bombed & got bad reviews. _E_\nThe @BarackObama recovery US unemployment is 9.1% US underemployment is 19.1% __HTTP__ Businesses won't hire under Obama. _E_\nFor many years our country has been divided angry and untrusting. Many say it will never change the hatred is too deep. IT WILL CHANGE!!!! _E_\nLibyan Rebels should have given us 50% of the oil in return for our military support we don't even ask! _E_\n#CelebApprentice I will be live tweeting(no spoilers) during tonight's all new @ApprenticeNBC at 8PM ET. _E_\nAlabama will shine tomorrow. It will be a big and glorious day! _E_\n.@stuartpstevens made some of the dumbest political decisions of all time in helping Romney to get destroyed by Obama. Should have won! _E_\n.@Toure when you are fired from MSNBC for your bad ratings and racist coverage stop by and say hello. _E_\nThank you @SenatorSessions!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nI see where Mayor Stephanie Rawlings Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! _E_\nI see @FLGovScott poll numbers are improving. Good man doing a good job. _E_\nWe must stop outsourcing our jobs overseas and end our multi billion dollar trade deficits. _E_\nWow in the new CBS Poll I went way up into the forties! Thank you! _E_\nI will be making a big speech tomorrow to discuss the failed policies and bad judgment of Crooked Hillary Clinton. _E_\nIncredible progress at @trumptowerpde – Punta del Este Uruguay the views are going to be fantastic! __HTTP__ _E_\nVia @WalidShoebat: \"Watch Donald Trump: He Is Patriotic And He Can Fix America\" __HTTP__ _E_\nRemember as a senator Obama did not vote for increasing the debt ceiling __HTTP__ I guess things change when President?! _E_\nCalifornia gas prices going thru the roof others to follow. An election losing event for Obama. _E_\n...a real loser named Tim O'Brien and it's never recovered. _E_\nNaghmeh Abedini the lovely wife of the Christian Pastor Saeed being held in an Iranian jail just left my office. #savesaeed _E_\nWill be on @jimmykimmel in 20 minutes on @ABC. #Kimmel #Trump2016 #MakeAmericaGreatAgain _E_\nThe United States must immediately institute strong travel restrictions or Ebola will be all over the United States a plague like no other! _E_\nDemocrats jeopardizing the safety of our troops to bail out their donors from insurance companies. It is time to put #AmericaFirst _E_\nDo you think the three UCLA Basketball Players will say thank you President Trump? They were headed for 10 years in jail! _E_\n#MadeInAmerica📸 __HTTP__ __HTTP__ _E_\nA great day in Puerto Rico yesterday. While some of the news coverage is Fake most showed great warmth and friendship. _E_\nA lot of comments re @MELANIATRUMP vs. Milania last week. I think spelling has taken on a new significance. #CelebApprentice _E_\nI won the debate if you decide without watching the totally one sided spin that followed. This despite the really bad microphone. _E_\nIt's amazing that people can say such bad things about me but if I say bad things about them it becomes a national incident. _E_\nMy thoughts and prayers are with the @KissimmeePolice and their loved ones. We are with you!#LESM _E_\n#PeaceOfficersMemorialDay and#PoliceWeek Proclamation: __HTTP__ __HTTP__ _E_\nTrump right: Illegal families crossing border set to double 51152 so far __HTTP__ _E_\nJoe thanks for not running! __HTTP__ _E_\nA sneak peek at Sunday's episode of The Celebrity Apprentice... __HTTP__ #trumpvlog _E_\n.@KarlRove is a biased dope who wrote falsely about me re China and TPP. This moron wasted $430 million on political campaigns and lost 100% _E_\nVia @peoplemag by @amandamichl: \"@IvankaTrump: @Joan_RiversWas 'Very Warm' During Appearance on @ApprenticeNBC\" __HTTP__ _E_\n\"Don't toss off your problems and don't dwell on them either. Deal with them!\" – Think Like a Champion _E_\nVia David Ebner re Stanley Cup & Trump poster: \"If you're going to be thinking anything you might as well think big\" __HTTP__ _E_\nOne of @GolfWorldUS top private clubs @TrumpNationalNY features a Jim Fazio designed 7291 yd par 72 course __HTTP__ _E_\n.@ritter1025 Wishing your wife a Happy Birthday _E_\nStanding room only in Mason City Iowa! Thanks to the record crowd of over 400 supporters! __HTTP__ _E_\n.@karlrove's ad is the best thing that ever happened to Ashley Judd—simply increases her profile. _E_\nI was on @SquawkBox this morning __HTTP__ _E_\nI would like to wish everyone including all haters and losers (of which sadly there are many) a truly happy and enjoyable Memorial Day! _E_\nIt's important that we help poor people to become independent self sufficient individuals who gain the benefits of work. #TimeToGetTough _E_\nHe @newtgingrich is sounding more and more like a real team player...he is a really good guy! _E_\n.... I only respond to people that register more than 1% in the polls. I never thought he had a chance and I've been proven right. _E_\nJust got back from Wisconsin. Great day great people! _E_\nKarzai of Afghanistan is not sticking with our signed agreement. They are dropping us like dopes. Get out now and re build U.S.! _E_\nChina's top academics are working w/ PLA in cyber espionage of our state secrets & R&D __HTTP__ They are laughing at us! _E_\nCongratulations to @DavidWright of the #Mets. What a great season he is having batting over 400 and clutch hitting. Also a fantastic guy. _E_\nWhy the nation's debt keeps growing a Dept of Agriculture employee made over $242K with a $63K bonus __HTTP__ Ridiculous. _E_\nIf Karl Rove & @GOP Establishment continue to attack the Tea Party who delivered in 2010 then there will be a 3rd Party in 2016. _E_\nWill be on Fox & Friends at 7.00 this morning ENJOY! _E_\nOn Jimmy Fallon tonight. _E_\nRemember when you vote Obamacare is a disaster! _E_\nGet ready this should be informative and fun! #VPDebate _E_\nTrump Int'l Golf Club Turnberry Scotland. A legendary course ... and rightly so. __HTTP__ _E_\nGolf is a brain game & is a great way to improve your business skills. Concentrationassessment technique & passion...it's all there. _E_\nThe Fed continues to recklessly flood the market with dollars. This will eventually create record inflation. It has to stop. #TimeToGetTough _E_\nWhat a night! 10000 amazing supporters in Greenville South Carolina! THANK YOU!VOTE on Saturday! #VoteTrumpSC __HTTP__ _E_\nHad @SenScottBrown asked me to do a robo call for him I would have done it and he would have won. _E_\nHaim Saban: Hillary Clinton's Top Hollywood Donor Demands Racial Profiling of Muslims __HTTP__ _E_\nA letter to @CNN President Jeff Zucker __HTTP__ _E_\nIt would be really nice if the Fake News Media would report the virtually unprecedented Stock Market growth since the election.Need tax cuts _E_\nThanks Dave! __HTTP__ _E_\nJay Sekulow on @foxandfriends now. _E_\nThrilled to hear that @RakutenTravelJP has awarded @TrumpWaikiki the 'Rakuten Diamond Award' for the 4th consecutive year! Congrats! _E_\nHere's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_\nEntrepreneurs: Remember the golden rule of negotiating he who has the gold makes the rules. _E_\nRT @realDonaldTrump: Thank you to our GREAT Military/Veterans and @PacificCommand.Remember #PearlHarbor. Remember the @USSArizona!A day... _E_\nIt's springtime and it just started snowing in NYC. What is going on with global warming? _E_\nMany NATO countries have agreed to step up payments considerably as they should. Money is beginning to pour in NATO will be much stronger. _E_\nMade additional remarks on Charlottesville and realize once again that the #Fake News Media will never be satisfied...truly bad people! _E_\nThe lobbyist and political hack that President Obama just appointed as the Ebola Czar just missed his first major meeting on Ebola A joke _E_\nI was on The View this morning. We talked about The Apprentice. Tonight's episode is a great one tough exciting and surprising. 10 pm/NBC _E_\nToday it was my pleasure and great honor to announce my nomination of Jerome Powell to be the next Chairman of the @FederalReserve. __HTTP__ _E_\nVia @TheStreet by @swan_investor: Trump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_\nGreat defense by the @nyjets this weekend—congratulations to @woodyjohnson4—only 6 points allowed! _E_\nwanting to sell their product cars A.C. units etc. back across the border. This tax will make leaving financially difficult but..... _E_\nSome day when things calm down I'll tell the real story of @JoeNBC and his very insecure long time girlfriend @morningmika. Two clowns! _E_\n.@SheriffClarke Great insight in dealing with the media today. You are a wonderful representative of calm and reason a real pro! _E_\nDonald Trump's commercial free WWE Raw does big rating: __HTTP__ _E_\n.@antbaxter I tried watching but fell asleep. _E_\nThere are many editorial writers that are good some great & some bad. But the least talented of all is frumpy Gail Collins of NYTimes. _E_\nHillary Clinton may be the most corrupt person ever to seek the presidency. Donald J. Trump _E_\nGreat news! Thank you Governor Ralph DLG Torres! #Trump2016 __HTTP__ _E_\nThank you to our amazing Wounded Warriors for their service. It was an honor to be with them tonight in D.C.... __HTTP__ _E_\nShirley B did a very good job singing Goldfinger! Not easy. _E_\nSo I have spent almost nothing on my run for president and am in 1st place. Jeb Bush has spent $59 million & done. Run country my way! _E_\nOne of the most accurate polls last time around. But #FakeNews likes to say we're in the 30's. They are wrong. Some people think numbers could be in the 50's. Together WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nDo you think I made the right decision? #CelebApprentice _E_\n\"To succeed one must be creative and persistent.\" John H. Johnson _E_\nYou can be an @nfl player with murder charges and not be suspended. Yet with NO EVIDENCE @nfl targeted Tom Brady. B.S.! _E_\nObama is finally stopping the Chinese from buying something in America – windfarms __HTTP__ What a joke! _E_\nMany meetings today in Bedminster including with Secretary Linda M and Small Business. Job numbers are looking great! _E_\nLearn work and think in equal proportions and you'll be going in the right direction. _E_\nHere we go again via @timesunion.com __HTTP__ ... another bad deal. _E_\nThe first ever All Star Celebrity @ApprenticeNBC premieres Sunday March 3rd! __HTTP__ _E_\nThis is a crossroads in the history of our civilization that will determine whether or not We The People reclaim c... __HTTP__ _E_\nI am deeply committed to preserving our strong relationship & to strengthening America's long standing support for... __HTTP__ _E_\nTo be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_\nTrump Tower at Century City brings luxury to Makati the financial & social capital of Philippines __HTTP__ _E_\nIt is finally happening for our great clean coal miners! __HTTP__ _E_\nThings will work out fine between the U.S.A. and Russia. At the right time everyone will come to their senses & there will be lasting peace! _E_\nAmazing that while I lead by big numbers in the new Q and and USA Today polls the the press only wants to report on the phony WSJ/NBC poll. _E_\n... will happen when you go against the tide when you take a risk and it works. Think Big _E_\n\"Failure is simply the opportunity to begin again this time more intelligently.\" Henry Ford _E_\n.... to do The Apprentice but I approved you anyway. Without my show you'd be nothing! _E_\n...design or negotiations yet. When I do just like with the F 35 FighterJet or the Air Force One Program price will come WAY DOWN! _E_\nThank you @FrankLuntz for saying I was a winner tonight. It is my great honor. #Trump2016 _E_\nJeb has been confused for forty years __HTTP__ _E_\nGood news. Voters give @MittRomney the edge over @BarackObama on handling the economy according to @gallupnews __HTTP__ _E_\nI will be on Fox & Friends at 7 A.M. 10 minutes. Much to talk about enjoy! _E_\nTHANK YOU @MayorGimenez for following the RULE OF LAW! Sanctuary cities make our country LESS SAFE! Full remarks: __HTTP__ __HTTP__ _E_\nRT @TeamTrump: RT if you agree @HillaryClinton & @timkaine are WRONG for America! #VPDebate #MAGA __HTTP__ _E_\nI am going to keep our jobs in the U.S. and totally rebuild our crumbling infrastructure. Crooked Hillary has no clue! @Teamsters _E_\nNew polls are good because the media has deceived the public by putting women front and center with made up stories and lies and got caught _E_\n#FlashbackFriday #CrookedHillary __HTTP__ _E_\nWhy would anyone in Kentucky listen to failed presidential candidate Rand Paul re: caucus. Made a fool of himself (1%.)KY his 2nd choice! _E_\nThe Donald J. Trump Signature Collection exclusively available @Macys offers the finest style in menswear __HTTP__ _E_\nThe shale boom is saving our economy __HTTP__ Good for jobs national security & trade balance. Frack Now & Frack Fast! _E_\n...In other words Secy John Kerry is so out of his element... _E_\n'Presidential Executive Order on Identifying and Reducing Tax Regulatory Burdens' Executive Order:... __HTTP__ _E_\nTHE HARDER YOU WORK THE LUCKIER YOU GET! _E_\nI am really beginning to respect Mark Halperin and John Heilemann as political reporters they truly get why Trump poll numbers are high. _E_\nIncompetent Hillary despite the horrible attack in Brussels today wants borders to be weak and open and let the Muslims flow in. No way! _E_\nChina has a business tax rate of 15%. We should do everything possible to match them in order to win with our economy. Jobs and wages! _E_\nThe $85B sequester is just 2% of Obama's $3.5T record deficit spending budget. Our leaders are ruining our children's future. _E_\nShould I do the #GOPdebate? __HTTP__ _E_\n\"Success depends...on how effectively you learn to manage the game's two ultimate adversaries: the course and yourself.\" @jacknicklaus _E_\nAll because of me people don't care about you Cher. @cher My week on twitter 1k retweets 29 new listings 15k new followers 2k mentions. _E_\nIn analyzing the Alabama Primary race Fake News always fails to mention that the candidate I endorsed went up MANY points after Election! _E_\nBarack Obama has everything to gain. Why would anyone ever deny $5M to charity? _E_\nSorry there is no STAR on the stage tonight! _E_\n\"I succeeded by saying what everyone else is thinking.\" @Joan_Rivers _E_\nOur law enforcement officers deserve our appreciation for the incredible job they do. Video: __HTTP__ __HTTP__ _E_\n\"It's not that I'm so smart it's just that I stay with problems longer.\" Albert Einstein _E_\nI am a registered Republican. __HTTP__ With @MittRomney as the nominee we can defeat @BarackObama. _E_\nDemocrats don't want massive tax cuts how does that win elections? Great reviews for Tax Cut and Reform Bill. _E_\nOur country's debt crisis cannot be solved by tax increases. We must cut government spending. _E_\nCentral America's tallest building @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_\nDon't let them build a wind turbine in your backyard (or near your house). It will destroy your property value. _E_\nMore poll results from last nights Commander in Chief Forum. #AmericaFirst #TrumpTrain __HTTP__ _E_\nAppreciate the congrats for being right on radical Islamic terrorism I don't want congrats I want toughness & vigilance. We must be smart! _E_\nGreat to see @MittRomney being well received in Poland __HTTP__ The Poles understand the value of freedom through strength _E_\nDems had a very good and professional convention. The Republicans must be smart and tough and fast! _E_\nSo many people who know nothing about me are commenting all over T.V. and the media as though they have great D.J.T. insight. Know NOTHING! _E_\nCalifornia shooting looks very bad. Good luck to law enforcement and God bless. This is when our police are so appreciated! _E_\n\"If you like your healthcare plan you can keep it.\" = \"I was born in Hawaii.\" _E_\nAnother false story this time in the Failing @nytimes that I watch 4 8 hours of television a day Wrong! Also I seldom if ever watch CNN or MSNBC both of which I consider Fake News. I never watch Don Lemon who I once called the \"dumbest man on television!\" Bad Reporting. _E_\nI win awards for speaking but the enemies either won't comment or will say only bad...leave Clint alone! _E_\nThank you @EricTrump! __HTTP__ _E_\nNo investor would be stupid enough to pour their money into the bottomless Vattenfall pit. They totally gave up __HTTP__ _E_\n'It's just a 2 point race Clinton 38% Trump 36%' __HTTP__ _E_\nThe American work ethic is what led generations of Americans to create our once prosperous nation. (cont) __HTTP__ _E_\nHe @MittRomney had another impressive win last night in Illinois. His delegate lead is insurmountable. It is (cont) __HTTP__ _E_\n.@TrumpChicago is the Windy City's sole skyscraper to feature a 4 star hotel 4 star restaurant & spa __HTTP__ _E_\nRadical Islamic Terrorism must be stopped by whatever means necessary! The courts must give us back our protective rights. Have to be tough! _E_\nOver $1T in annual deficit spending and adding over $6T to the debt for what? May jobless numbers are horrendous. The great Obama recovery. _E_\nA working dinner tonight with Prime Minister Abe of Japan and his representatives at the Winter White House (Mar a Lago). Very good talks! _E_\nWhere were all the @VanityFair exposes on When Rev. Wright disciples go to Washington? Sad! _E_\nJonah Goldberg @JonahNRO of the once great @NRO #National Review is truly dumb as a rock. Why does @BretBaier put this dummy on his show? _E_\nShoplifting is a very big deal in China as it should be (5 10 years in jail) but not to father LaVar. Should have gotten his son out during my next trip to China instead. China told them why they were released. Very ungrateful! _E_\nI hope everyone had a great Memorial Day! _E_\nAmericans never quit. General Douglas MacArthur _E_\nI'll be going to the Old Post Office Building on Pennsylvania Avenue in D.C. today. Will create one of world's great hotels. Lots of jobs! _E_\n7 of 10 Americans prefer 'Merry Christmas' over 'Happy Holidays' __HTTP__ No surprise. _E_\nI wish tonight's debate would cover more than foreign policy. _E_\nRT @RightlyNews: What's a high priced Clinton attorney doing representing a low level IT staffer for the Democrats? @jessebwatters on t... _E_\nCheck out Trump International Hotel & Tower New York spectacular! __HTTP__ _E_\nIf you're interested in 'balancing' work and pleasure stop trying to balance them. Instead make your work more pleasurable. _E_\nThe Miss Universe Pageant raked in some great ratings! A great job by everyone. _E_\n...Senators should focus their energies on ISIS illegal immigration and border security instead of always looking to start World War III. _E_\nOne of @GolfWorldUS' top public courses @TrumpGolfLA's course stands as a testament to the greatness of golf __HTTP__ _E_\nRead this @BarackObama's birth certificate cannot survive judicial scrutiny because of phantom numbers __HTTP__ _E_\nLooks like yet another terrorist attack. Airplane departed from Paris. When will we get tough smart and vigilant? Great hate and sickness! _E_\nIt's often necessary to boast but it's even better if others do it for you.\" – Think Like A Billionaire _E_\nGreat new poll. Thank you Texas! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_\n\"On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military estimates the remaining 1000 or so fighters occupy roughly 1900 square miles...\" via @jamiejmcintyre __HTTP__ _E_\nStatement on House Passage of Kate's Law and No Sanctuary for Criminals Act. __HTTP__ _E_\nICYMI you can watch my full press conference with @SteveKingIA on @shanevanderhart's @CaffThoughts __HTTP__ _E_\n\"Even such traits as who makes the most eye contact in conversation can be an indication of who seeks to dominate.\" Think Like A Billionaire _E_\nWhat a great day it was yesterday showing the public Trump Links at Ferry Point. I took over a disaster and made it GREAT! Good job to all! _E_\nVia @thestate by @andyshain: \"Donald Trump joins other 2016 prospects speaking at SC Tea Party Convention\" __HTTP__ _E_\n...Bad decisions can be devastating. _E_\n\"In N.H. Trump says his business experience would play well in government\" __HTTP__ via @ConMonitorNews by @AP _E_\nMy prayers and condolences to the families of the victims of the terrible Florida shooting. No child teacher or anyone else should ever feel unsafe in an American school. _E_\nA vote to CUT TAXES is a vote to PUT AMERICA FIRST. It is time to take care of OUR WORKERS to protect OUR COMMUNITIES and to REBUILD OUR GREAT COUNTRY! __HTTP__ __HTTP__ _E_\nUnder @MittRomney Bain had an 80% success rate with annual returns of over 50%. Under @BarackObama America has added over $6T in debt. _E_\nObamacare has to be killed now before it grows into an even bigger mess as it inevitably will. #TimeToGetTough _E_\nTrump: Something 'mentally wrong' with Weiner __HTTP__ via @hilltube by @DanielStrauss4 _E_\nThey changed the name global warming to climate change because the concept of global warming just wasn't working! _E_\nNew National Rasmussen Poll: __HTTP__ _E_\n\"Winning takes talent to repeat takes character.\" John Wooden _E_\nJust as I have been predicting for years Iraq will fall to the people that hate the U.S. the most just outside of Baghdad. Keep the oil _E_\nCrazy Joe Scarborough and dumb as a rock Mika are not bad people but their low rated show is dominated by their NBC bosses. Too bad! _E_\nWhat people don't know about Kasich he was a managing partner of the horrendous Lehman Brothers when it totally destroyed the economy! _E_\nI have not heard any of the pundits or commentators discussing the fact that I spent FAR LESS MONEY on the win than Hillary on the loss! _E_\nGood morning Wisconsin! The polls are now open! #VoteTrump today & we will MakeAmericaGreatAgain! __HTTP__ _E_\nGreat decision by @SpeakerBoehner in placing @TGowdySC as chairman of the Benghazi select committee. Gowdy is a seasoned prosecutor. _E_\n\"Discovery breeds discovery as in success breeds success. Questions are thoughts with a quest.\" – Think Like a Champion _E_\nICYMI my speech this past Monday at the South Carolina Tea Party Convention in Myrtle Beach __HTTP__ #SCTeaParty15 _E_\nBenghazi is now a full blown training center for jihadists __HTTP__ Congratulations to the Obama administration. _E_\nDo you think Crooked Hillary will finally close the deal? If she can't win Kentucky she should drop out of race. System rigged! _E_\nCongratulation to Adam Scott and all of the folks at Trump National Doral on producing a really great WGC Tournament. Amazing finish! _E_\nNew York State's lightweight A.G. is driving business & jobs out of N.Y. Look into his past he shouldn't even be allowed to hold office! _E_\n\"He who defends everywhere defends nowhere.\" – Sun Tzu _E_\nWhen people find out how bad a job Scott Walker has done in WI they won't be voting for him. Massive deficit bad jobs forecast a mess. _E_\nJoin me in Manheim Pennsylvania on Saturday at 7pm! #TrumpRallyTickets: __HTTP__ __HTTP__ _E_\nWhen you have exhausted all possibilities remember this you haven't. Thomas A. Edison _E_\nReport raises questions about 'Clinton Cash' from Russians during 'reset' __HTTP__ _E_\n#CelebApprentice fans watch today's #trumpvlog __HTTP__ to find out about our new App __HTTP__ _E_\nA family in Las Vegas just stopped a violent home invasion by shooting one of the perpetrators the other fled and will be captured. Great! _E_\n.....Guy in front asked for picture said he was the biggest fan never saw the guy in back. _E_\nI spoke with Fox and Friends today watch here __HTTP__ _E_\nThe Trump Signature Collection exclusively available @Macys offers high end fashion for men. Dress your best. __HTTP__ _E_\n#sweepstweet @lisalampanelli wins $100000 for her charity and that's a nice gift. _E_\nThe Prayer Breakfast was used by @BarackObama to say that the Bible commands higher income taxes. That's not the way it is! _E_\nToday is April 15th Obama's favorite day of the year. T E A. TAXED ENOUGH ALREADY! _E_\nThe attack on Mosul is turning out to be a total disaster. We gave them months of notice. U.S. is looking so dumb. VOTE TRUMP and WIN AGAIN! _E_\nA woman is suing one of my businesses despite the fact that she loved her classes. Our legal system is a mess. Watch __HTTP__ _E_\nMy daughter @IvankaTrump will be on @Greta tonight at 7pm. Enjoy! __HTTP__ _E_\nGreat story on @TrumpToronto in @globeandmail about our new Sky and Wellness Suites: __HTTP__ _E_\nMr. President tell Iran to immediately free the CHRISTIAN PASTOR as a sign of good faith & if they refuse break off talks big sanctions _E_\nRe Negotiation: Know what you want & think about what the other side wants. Know where they're coming from & do not underestimate them. _E_\nDonald Trump CPAC Speech: U.S. Is Run By 'Very Stupid People' __HTTP__ via @HuffPostPol by @elisefoley _E_\nFor all of the morons who have been complaining about my comment on sexual assault & rape in the military (cont) __HTTP__ _E_\nCelebrity Apprentice on in 5 minutes on CNBC it's great! _E_\nI think Joe Biden made correct decision for him & his family. Personally I would rather run against Hillary because her record is so bad. _E_\nNo complaints but how many people would be watching these really dumb but record setting debates if I wasn't in them? Interesting question! _E_\nThank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe news about our beautiful Miss Venezuela Monica Spear is devastating to all who knew her. A spectacular woman she will be missed. _E_\nBe tough be smart be personable but don't take things personally. That's good business. _E_\n.@BenFergusonShow just watched you on @CNN. Thank you for your nice comments. _E_\nCongratulations to Bret Michaels the new Celebrity Apprentice. Bret's a true champion all of us were happy to see him and to see him win! _E_\nMy @foxandfriends int. on Benghazi cover up the ObamaCare mess & firing @TheRealMarilu on @ApprenticeNBC __HTTP__ _E_\nEnthusiasm is a vital element in individual success. ― Conrad Hilton _E_\nWhile everyone is waiting and prepared for us to attack Syria maybe we should knock the hell out of Iran and their nuclear capabilities? _E_\nHAPPY THANKSGIVING! __HTTP__ _E_\nI will be doing @foxandfriends at 8.00 a.m. _E_\nVia @TheYBF: \"@msvivicafox Attends A Private Screening + Donald Trump DONATES $25K To @peachespulliam'S Kamp Kizzy\" __HTTP__ _E_\n.@CNBC Titans: Donald Trump' is available to live stream on @netflix and @hulu. Watch! _E_\nMain Street is BACK! Strongest Holiday Sales bump since the Great Recession beating forecasts by BILLIONS OF DOLLARS. __HTTP__ _E_\nToday it was my honor to welcome President Nursultan Nazarbayev of Kazakhstan to the @WhiteHouse! __HTTP__ _E_\nJust returned from Ireland Scotland and Dubai. Amazing trip great places but always good to be back. _E_\n.@BarackObama's Super PAC has continually called @MittRomney a murderer __HTTP__ Ironic since Obama is destroying Medicare. _E_\nJoy Behar who was fired from her last show for lack of ratings is even worse on @TheView. We love Barbara! _E_\nMUST READ – via @IBDinvestors: \"VA Scandal Grows As Bonuses Went To Worst Hospitals\" __HTTP__ _E_\nI believe @BarackObama made a deal with the Saudis to increase oil production until after the election. Then (cont) __HTTP__ _E_\nChina is expanding its military bases abroad. We must expand our naval fleet. Now is no time for defense cuts. (cont) __HTTP__ _E_\nA great honor to spend time with our brave HEROES at the @USMC Air Station Yuma. THANK YOU for your service to the United Staes of America! __HTTP__ _E_\nBig win in Montana for Republicans! _E_\nJust as I predicted ObamaCare is a complete disaster which is failing on its own. May never be fully implemented. _E_\n.@TrumpNationalNY a great place! __HTTP__ @TrumpGolf _E_\nI never gave anybody hell! I just told the truth and they thought it was hell. Harry S. Truman _E_\nIt's Tuesday. How many terrible predictions and advice will Karl 1.6% Rove make today? _E_\nEntrepreneurs: Ask yourself: What can I provide that does not yet exist? Be open to new ideas. Be innovative! _E_\nBig announcement by Ford today. Major investment to be made in three Michigan plants. Car companies coming back to U.S. JOBS! JOBS! JOBS! _E_\n...Never let yourself be pushed around but treat the good folks great. _E_\nLoved doing the debate last night on @CNBC. Check out all of the polls! Everyone agrees that Harwood bombed! _E_\nWar on the families. Price of electricity hit record high in October __HTTP__ Terrible especially during holiday season. _E_\n$5 a gallon gas and we have yet to approve the Keystone XL Pipeline. OPEC is laughing at us. _E_\nFlashback from October 2013: \"Donald Trump demands larger iPhone screen\" __HTTP__ You're welcome! Apple listened. _E_\n.@HillaryClinton has been a foreign policy DISASTER for the American people. I will #MakeAmericaStrongAgain #Debate... __HTTP__ _E_\nLooks like another great day for the Stock Market. Consumer Confidence is at Record High. I guess somebody likes me (my policies)! _E_\nThe Blue Monster at Trump National Doral. __HTTP__ _E_\nThe U.S. is going to substantialy reduce taxes and regulations on businesses but any business that leaves our country for another country _E_\nI WILL BE ON @foxandfriends AT 7:30 NOW! _E_\nThey should have got Darrell Hammond as the Donald Trump impersonator. #CelebApprentice _E_\nGreat article by Chris Ruddy @Newsmax_Media: @AnnDRomney and Jackie's Example __HTTP__ _E_\nWatch my speech at CPAC in Washington DC yesterday ... __HTTP__ _E_\nRead about how this hotel came into being in my book \"Never Give Up\"—it's quite a story. #CelebApprentice @TrumpNewYork _E_\nObama's speech on climate change was scary. It will lower our standard of living and raise costs of fuel & food for everyone. _E_\nOne hit wonder @DannyZuker I notice you are not disputing all of the failures that I said you had. Let's talk about it! _E_\n#CrookedHillary \"was at center of negotiating $12M commitment from King Mohammed VI of Morocco\" to Clinton Fdn. __HTTP__ _E_\nMy @USATOpinion piece: Trump: I don't need to be lectured __HTTP__ _E_\nWe should be building up our military and our missile defense systems to their highest levels ever. Must be very strong to prosper & survive _E_\nAs hard as it is to believe sexting pervert Anthony Weiner is leading in some polls for Mayor of NYC. _E_\n.@JordanSpieth Great job you are a true champion! See you soon. _E_\nwill only get higher. Car companies and others if they want to do business in our country have to start making things here again. WIN! _E_\nThe time has come to take action to IMPROVE access INCREASE choices and LOWER COSTS for HEALTHCARE! __HTTP__ __HTTP__ _E_\nPrayers and condolences to all of the families who are so thoroughly devastated by the horrors we are all watching take place in our country _E_\nRT @billoreilly: FNC dominated ratings last night. MSNBC disaster demonstrating folks don't trust the network. __HTTP__ toni... _E_\nWhen Ted Cruz quits the race and the field begins to clear I will get most of his votes no problem! _E_\nPerhaps this is the kind of thinking we need in Washington ... __HTTP__ _E_\nWow! Letterman show @Late_Show won the ratings last night big time and guess who was his guest? DJT _E_\nCongratulations to Obama and the @DNC. The federal deficit has topped $1T for a fourth year in a row __HTTP__ Nice work! _E_\nWinning isn't everything but the will to win is everything. Vince Lombardi _E_\nToday will be a great day at work have only one word in mind VICTORY! _E_\nLook here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and destroyed jobs #TimeToGetTough _E_\nDine with The Donald and Mitt __HTTP__ _E_\nThank you @RepReneeEllmers! __HTTP__ __HTTP__ _E_\nWill be on @oreillyfactor tonight. Signing a copy of Crippled America for Bill! __HTTP__ _E_\nThe evidence continues to mount against lightweight @AGSchneiderman. It is time for JCOPE and Moreland Commissions to act. _E_\n\"Being true to yourself...will give you a lot of power over any negatives thrown your way.\" – Midas Touch _E_\nApril is Autism Awareness Month join me in raising awareness get your \"Light It Up Blue\" sign here! #LIUB __HTTP__ _E_\nHoward Stern will do a great job on @America'sGotTalent. He's very smart and really gets what talent is. @HowardStern _E_\nVia @AmSpec by @JeffJlpa1: Exclusive: Trump Says Obama Shows 'Total Desperation' on Iran __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n.@antbaxter Only the stupid @BBC would air your garbage—no wonder they are in such deep trouble. _E_\nA lot of people have imagination but can't execute you have to execute with the imagination. Donald J. Trump __HTTP__ _E_\nNumerous polls have me beating Hillary Clinton. In a race with her voter turnout will be the highest in U.S. history I get most new voters! _E_\nRT @mike_pence: History teaches us that weakness arouses evil. America needs to be strong on the world stage. #VPDebate __HTTP__ _E_\n.@Morning_Joe is so off on Iowa which I am leading big in new @CNN poll. I will win Iowa. Also I beat Hillary easily! _E_\nWell the year has officially begun. I have many stops planned and will be working very hard to win so that we can turn our country around! _E_\n.@jessebwatters Watching your show from Arizona where we just had a big rally. It is fantastic everybody loves it!#MakeAmericaGreatAgain _E_\nObama and the Democrats are laughing at the deal they just made...the Republicans got nothing! _E_\nObama's budget spends $2B making our navy ships algae powered __HTTP__ The strong world is laughing at us. _E_\n\"My view is that not only has Trump been vindicated in the last several weeks about the mishandling of the Dossier and the lies about the Clinton/DNC Dossier it shows that he's been victimized. He's been victimized by the Obama Administration who were using all sorts of....... _E_\nI've just released my position papers on The Second Amendment. __HTTP__ _E_\nFast and Furious put semi automatics in the hands of Mexican drug lords that killed Americans @BarackObama should answer all questions. _E_\nChina has been unfairly subsidizing the export of cars & auto parts. I've been saying this for 3 years... _E_\n\"@Algemeiner Honors @Joan_Rivers Donald Trump @YuliEdelstein at Second Annual 'Jewish 100′ Gala\" __HTTP__ via @Algemeiner _E_\nDon Jr. will present the Keynote Address in South Africa on Dec. 1st @TheInvestShow _E_\nI look forward to being in Lowell Massachusetts today. I hear a very big crowd is expected we will have lots of fun! _E_\nOnly 15 days until ObamaCare is implemented. Congress must waive the monstrosity for regular Americans. Why should they be punished? _E_\nMy sons Don and Eric are on @foxandfriends now 7:35. Great kids enjoy! _E_\nGetting the support of @DanaWhite of UFC means a lot. A total winner who has done an amazing job. Just ordered his fight to watch tonight! _E_\nShining over Fifth Avenue @TrumpTowerNY (a NY icon) offers a full service restaurant bar cafe ice cream parlor and Gucci. _E_\n\"It's a tough game and you never want to take that aspect out of the game.\" – @NYRangers Stanley Cup Champion Mark Messier _E_\nBetter off? The $16T US debt works out to $136260 per household a 50% increase since @BarackObama took office. _E_\nTune in tonight at 9 pm on TV One for The Ultimate Merger starring the one and only Omarosa and twelve brave bachelors ... _E_\nTune in for #TrumpTuesday on @SquawkCNBC tomorrow morning. _E_\n.@foxandfriends in five minutes. Enjoy! _E_\nSgt.Thamooressi has been held in Mexico for 115 Days. Mexico has zero respect for our border & our servicemen. Boycott! #freeourmarine _E_\nThe Veterans of our country have been treated like third class citizens for many years... _E_\nMany people have said I'm the world's greatest writer of 140 character sentences. _E_\n\"Build up your weaknesses until they become your strong points.\" Knute Rockne _E_\nMy @foxandfriends interview discussing Chuck Hagel nomination Republicans terrible deal making & where we go next __HTTP__ _E_\nInterview with David Muir of @ABC News in 10 minutes. Enjoy! _E_\nVia @BreitbartNews: GAME ON: TRUMP RESPONDS TO JEB __HTTP__ _E_\nRemember save your evening to watch Celebrity Apprentice tonight at 9 increased to a full two hours great episode watch Gary B. _E_\n\"Age is whatever you think it is. You are as old as you think you are.\" @MuhammadAli _E_\n40 days until the election. Crunch time. @MittRomney must stay on offense and take the fight to Obama. _E_\nAnnounced 3 years ago that Scottish course would close in winter like Kingsbarns and others too cold. _E_\nThe shirts and ties at Macy's are so good beautiful and do so well that guys like the one that sued me wrongly want a piece l kicked his ass _E_\nRT @APCampaign:Trump to Obama: $5 million donation to charity if you release passport and college records __HTTP__ #Election2012 _E_\nSee you in Arizona on Friday and Saturday. __HTTP__ _E_\nLearn more about @TrumpIntRealty's @Mgriffithnyc and some of our spectacular real estate in NYC __HTTP__ _E_\nHow many more of our soldiers have to be shot by the Afghanis they are training? Let's get the hell out of there and focus on U.S. _E_\nBernie Madoff and Tony La Russa in today's #trumpvlog... __HTTP__ _E_\nRT @RealBHorowitz: @VinceMcMahon @realDonaldTrump @WWE My two favorite billionaires! _E_\nBy @BarackObama's design the middle class will be hit with record taxes under ObamaCare through inflation __HTTP__ REPEAL! _E_\nChina is going to complete 59 new theme parks by 2020 over $23B in expansion. That would take over 100 years in our country. _E_\nCongratulations to @IvankaTrump on being named @FoxNewsSunday Power Player of the Week. Ivanka is doing a great job w/ DC Post Office. _E_\nFlorida Power & Light has disgusting rotting utility poles outside Doral in Miami. They should put in new ones or will be sued. _E_\nYet another weak hit by a candidate with a failing campaign. Will Jeb sink as low in the polls as the others who have gone after me? _E_\nThe people of Scotland have spoken—a great decision. I wish @AlexSalmond well & look forward to playing golf with him at Aberdeen! _E_\n.@EdGoeas thank you for your support tonight on @JudgeJeanine. _E_\n.@FrankLuntz is a total clown. Has zero credibility! @FoxNews @megynkelly _E_\nCongratulation to Roy Moore and Luther Strange for being the final two and heading into a September runoff in Alabama. Exciting race! _E_\n.@marthamaccallum Martha great interview with my son @EricTrump smart tough & professional. Thank you! @FoxNews _E_\nMy golf club @TrumpNationalNY in Westchester a great place! __HTTP__ _E_\n#sweepstweet @3nVMusic I very much rely on my own 'take' of the situation and people involved. My instincts (cont) __HTTP__ _E_\nCongressman John Lewis should spend more time on fixing and helping his district which is in horrible shape and falling apart (not to...... _E_\nVery few people read the National Review because it only knows how to criticize but not how to lead. _E_\nBy the end of this year China will be the number one economic power on earth and the U.S. will owe 20 trillion dollars much of it to China! _E_\nAmtrak crash near Philadelphia train derails many hurt some badly. Our country has horrible infrastructure problems. Pols can't solve! _E_\nGreat job Kevin we are all proud of you! __HTTP__ _E_\nDon't miss my Fabulous World of Golf now in its second season on Golf Channel beginning tonight at 9 pm ET __HTTP__ _E_\n#MerryChristmas __HTTP__ _E_\nThe GOP primary schedule is a disaster. Not enough time. _E_\nWhat do we get from our economic competitor South Korea for the tremendous cost of protecting them from North Korea? NOTHING! _E_\nBe focused be disciplined be patient there are very few cases of instant gratification. _E_\nFreedom is never more than one generation away from extinction. Ronald Reagan #MakeDCListen #DefundObamaCare _E_\nIf the U.S. does not win this case as it so obviously should we can never have the security and safety to which we are entitled. Politics! _E_\nNominating Chuck Hagel for SOD is the wrong move for Obama. He doesn't need the fight. Too much political capital will be wasted. _E_\nEntrepreneurs: Believe in yourself. If you don't no one else will either. _E_\n.@ericbolling did a fantastic job on O'Reilly tonight. Way to go Eric! _E_\nEntrepreneurs: Set the example and you'll be a magnet for the right people. Great leaders determine the teams they assemble. _E_\nWhen I look at all of the money the special interests and lobbyists are giving to candidates beware the candidates are mere puppets $$$$! _E_\nWow really nice and unexpected from Ed Schultz. Thank you Ed! @edshow __HTTP__ _E_\nI would like to wish everyone A HAPPY AND HEALTHY NEW YEAR. WE MUST ALL WORK TOGETHER TO FINALLY MAKE AMERICA SAFE AGAIN AND GREAT AGAIN! _E_\nSEAL who shot Bin Laden is unemployed & can't feed his family __HTTP__ Everyone can get welfare but this SEAL can't eat! _E_\nHeading over to the Miss USA Pageant. The young women participating are amazing and accomplished. Competition is very tough. ENJOY THE SHOW! _E_\nGreat meeting with @NaghmehAbedini the wonderful wife of Christian Pastor Saeed who is in Iranian prison. #savesaeed __HTTP__ _E_\nJust arrived in Las Vegas for a packed house speech tomorrow. Big poll results today Leading big everywhere. MAKE AMERICA GREAT AGAIN! _E_\nI'll bet Obama goes down just like Washington because he doesn't use our(this country's) best people to win. _E_\nHeading to South Carolina really big crowd! Will be back in New Hampshire tomorrow.#MakeAmericaGreatAgain _E_\n.@seanhannity Carly whose campaign is dead is making false statements about me in order to salvage hope! Sad. _E_\n\"There is a point in every contest when sitting on the sidelines is not an option.\" Dean Smith _E_\nI got George Zimmerman right watch __HTTP__ _E_\nRomney campaign used me in 6 primary states and won every one they should have used me in Florida and Ohio & he would be President. _E_\n.@AP and @HuffingtonPost should change their fraudulent story to say THAT I DROPPED @NBC & The Apprentice to run for President! _E_\nFor the sake of New York City all recent sexting victims of Anthony 'Carlos Danger' Weiner should come forward. _E_\nLucky to have been chosen for the purchase of the magnificent The Point Lake and Golf Club on Lake Norman in (cont) __HTTP__ _E_\n\"President Trump is not getting the credit he deserves for the economy. Tax Cut bonuses to more than 2000000 workers. Most explosive Stock Market rally that we've seen in modern times. 18000 to 26000 from Election and grounded in profitability and growth. All Trump not 0... _E_\nRT @Team_Trump45: @realDonaldTrump We won. Move on. __HTTP__ _E_\nTHANK YOU for another wonderful evening in Washington D.C. TOGETHER we will MAKE AMERICA GREAT AGAIN __HTTP__ _E_\n\"Failure has a thousand explanations. Success doesn't need one.\" Alec Guinness _E_\nI can't believe that the judge in the Oscar Pistorious case has found him not guilty of murder. No one has been more guilty since O.J.! _E_\nI've always defended @jayleno but he never defends me. He's not a loyal person & I now understand why everybody dumped him. Jay sucks! _E_\nThe Republican Establishment has been pushing for lightweight Senator Marco Rubio to say anything to hit Trump.I signed the pledge careful _E_\nSave your time @rosie and focus on your horrible ratings and don't mention my name on talk shows anymore or you will get more of the same. _E_\nI do not know the reporter for the @nytimes or what he looks like. I was showing a person groveling to take back a statement made long ago! _E_\n...is all of the illegal leaks of classified and other information. It is a total witch hunt! _E_\nSleepy eyes Chuck Todd a man with so little touch for politics is at it again.He could not have watched my standing ovation speech in N.C. _E_\nThank you New Hampshire! #MakeAmericaGreatAgain #FITN __HTTP__ _E_\nWhy did @BarackObama and his family travel separately to Martha's Vineyard? They love to extravagantly spend on the taxpayers' dime. _E_\n...the entire World WAS laughing and taking advantage of us. People like liddle' Bob Corker have set the U.S. way back. Now we move forward! _E_\nThe Fake News is working overtime. As Paul Manaforts lawyer said there was no collusion and events mentioned took place long before he... _E_\nThe protesters blocked a major highway yesterday delaying entry to my RALLY in Arizona by hours and the media blames my supporters! _E_\nCongratulations to Emmanuel Macron on his big win today as the next President of France. I look very much forward to working with him! _E_\nNot only did Egypt destroy its civil society w/ the Muslim Brotherhood now it is a complete economic mess __HTTP__ _E_\nMajor grudge match this weekend between @nyjets & @Patriots. I have a dilemma I am good friends w/ both Woody (cont) __HTTP__ _E_\nBuilding a brand is like building a skyscraper the foundation comes first. The bigger the building the deeper the foundation needs to be _E_\nThe final Wisconsin vote is in and guess what we just picked up an additional 131 votes. The Dems and Green Party can now rest. Scam! _E_\n.@CNN Poll just came out amazing numbers for those who want to MAKE AMERICA GREAT AGAIN! TRUMP 36% a 20 point lead over 2nd place. Thanks. _E_\nJoin me at Clemson University on Wednesday February 10th! #MakeAmericaGreatAgain __HTTP__ _E_\nThe YouTube of the 2012 Miss USA contestants @GiulianaRancic and me singing Call Me Maybe __HTTP__ has over 2M views. _E_\nTrump Invitational at Mar a Lago was a huge success. Raised millions for charity and was the 1st equestrian event held in Palm Beach. _E_\nThank you Henderson NV. This is a MOVEMENT like never seen before! Watch some of the rally via my Facebook page:... __HTTP__ _E_\nEntrepreneurs: Everything starts with you. Realize that you're in charge. Whatever happens you're responsible. _E_\nThe Federal government has increased its employment by 12% since 2007. We need to stop replacing retired workers unless position is needed. _E_\nDuring the GOP convention CNN cut away from the victims of illegal immigrant violence. They don't want them heard. __HTTP__ _E_\nBig progress being made in ridding our country of MS 13 gang members and gang members in general. MAKE AMERICA SAFE AGAIN! _E_\nSomebody got rich building the ObamaCare website which doesn't even come close to working where has the money gone? _E_\nWorking hard to get the Olympics for the United States (L.A.). Stay tuned! _E_\nJust out but lightly reported: Fewest jobless claims since 1973 show firm U.S. Job Market Lowest since March 1973. @bpolitics _E_\nPeople do business with those people they like and trust. Ralph J. Roberts Founder of Comcast _E_\nSo nice when media properly polices media. Thank you @BreitbartNews. __HTTP__ _E_\nWeiner and Spitzer are on top of the latest polls. A sad day for the greatest city on earth! They will spend lots of time together. _E_\nWhile I greatly appreciate the efforts of President Xi & China to help with North Korea it has not worked out. At least I know China tried! _E_\nWe need strong tough and brilliant leadership now more than ever! MAKE AMERICA GREAT AGAIN! _E_\nTogether we are going to MAKE AMERICA GREAT AGAIN!#AmericaFirst __HTTP__ _E_\nHad a fantastic dinner last night at Quattro in the Trump SoHo Hotel. It's already one of the hottest new restaurants in the city. _E_\nRaising the capital gains tax in this fragile economic time is the dumbest thing Washington could do. So they will probably do it. _E_\nJust heard that crazy and very dumb @morningmika had a mental breakdown while talking about me on the low ratings @Morning_Joe. Joe a mess! _E_\nThank you Kansas! Thousands of people inside and thousands outside who couldn't get into the hall. Really amazing! #CaucusForTrump _E_\nBack by popular demand TV personality @TheRealMarilu returns in the record 13th season of 'All Star' @CelebApprentice. Marilu does great! _E_\nFor those of you that have conveniently fotgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_\nOur heartfelt prayers go out to our fellow Americans suffering from the storms & tornadoes. _E_\nMy @foxandfriends interview discussing how @BarackObama should release his college applications & records __HTTP__ _E_\nReally bad ratings for Lawrence O'Donnell on MSNBC O'Reilly is killing him! _E_\nHope you enjoy the story in the highly respected Real Estate Weekly __HTTP__ _E_\nSee you in D.C. tomorrow at 1:00 P.M. at the Capitol to protest the horribly negotiated deal with Iran. Really sad! _E_\nIs it even slightly possible that Jodi Arias could be set free wow what a miscarriage of justice that would be! _E_\nFive people killed in Washington State by a Middle Eastern immigrant. Many people died this weekend in Ohio from drug overdoses. N.C. riots! _E_\nI just arrived at Trump National Doral in Miami where I'll spend the day checking work just completed by contractors. This place is amazing! _E_\nMitt Romney who was one of the dumbest and worst candidates in the history of Republican politics is now pushing me on tax returns. Dope! _E_\nabout that...Those Intelligence chiefs made a mistake here & when people make mistakes they should APOLOGIZE. Media should also apologize _E_\nThanks to @SenateMajLdr McConnell and the @SenateGOP we are appointing high quality Federal District... _E_\n2. The celebrity with the highest totals by Tuesday noon ET gets an extra donation to his or her charity... _E_\nThe Democrats want to shut government if we don't bail out Puerto Rico and give billions to their insurance companies for OCare failure. NO! _E_\nJust arrived in New Hampshire. Thank you to all of my supporters!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nCongratulations to the dedicated professionals of the USSS as they celebrate their 152nd anniversary. Thank you! __HTTP__ __HTTP__ _E_\nI made a fortune in Atlantic City got out years ago (great timing) and havn't been back in many years. I have NOTHING to do with A.C. _E_\nThank you Rep. @MarshaBlackburn! __HTTP__ __HTTP__ _E_\nI would gain a whole new respect for President Obama if he would say look we made a big mistake sorry! No more lies or deception. _E_\nRT @ABCNewsRadio: Global fund championed by Ivanka Trump to help women entrepreneurs begins operations __HTTP__ __HTTP__ _E_\nThank you to our U.S. Navy for protecting our country both in times of peace & war. Together WE WILL MAKE AMERICA... __HTTP__ _E_\nWith the high prices of corn to continue expect even more inflation on the price of food. _E_\nHappy Birthday to my wonderful daughter @IvankaTrump. _E_\nIn politics and in life ignorance is not a virtue. This is a primary reason that President Obama is the worst president in U.S. history! _E_\nWatch Gary B tonight on Celebrity Apprentice some really crazy things happen! _E_\nBecause of me the Republican Party has taken in millions of new voters a record. If they are not careful they will all leave. Sad! _E_\nCrooked Hillary Clinton made up facts about me and forgot to mention the many problems of our country in her very average scream! _E_\nIt's amazing how many people still come up to me to thank me for 'The Art of The Deal.' The book has changed a lot of lives. _E_\nObama has blocked ICE officers and BP from doing their jobs. That ends when I am President! _E_\nWhat did you think of @THEGaryBusey's mechanical dog idea? _E_\nMy kids never negatively discussed my criticism of President Obama with me or anyone...it's not in their nature! _E_\nThe least number of hurricanes in the U.S. in decades. So they change global warming (too cold) to climate change now what will they call it _E_\nTonight's episode of The Apprentice has a big surprise at the top of the show don't miss it! 10 p.m. on NBC. _E_\nIf ObamaCare is so amazing then why is Obama delaying significant parts of the bill before the election? #MakeDCListen _E_\nThe voting booth process was a total disaster—it could and should be much better and more efficient—tremendous room for error! _E_\nWithin the heart of beautiful Somerset County Trump Nat'l Bedminster is the proud host of the 2022 @PGAChampionship __HTTP__ _E_\nVery proud of Trump Int'l Golf Links in Aberdeen Scotland. Just got the five star award from @VisitScotNews __HTTP__ _E_\nThank you North Carolina! #MAGA __HTTP__ _E_\nPM Sarah Westcot Williams incompetence should not be rewarded. You should vote for anyone who runs against her—loser! @PrimeMinisterSX _E_\nGlad to see RomneyCare/ObamaCare architect Gruber being eviscerated on the Hill today. He should return all taxpayer money he was paid. _E_\n#MakeAmericaGreatAgainVideo: __HTTP__ __HTTP__ _E_\nNorth Carolina's most exclusive club @Trump_Charlotte's features @SharkGregNorman designed golf course which fronts the biggest lake in NC _E_\nIf you accept the expectations of others especially negative ones then you will never change the outcome. Michael Jordan _E_\nWhen will we see @BarackObama's passport records (sealed)? _E_\nThank you for all of the great comments on the debate last night. Very exciting! _E_\n\"@BrandiGlanville @KenyaMoore Talk @ApprenticeNBC Feud\" __HTTP__ via @ChristianPost by Virnelli Mercader _E_\nMy experience yesterday in Poland was a great one. Thank you to everyone including the haters for the great reviews of the speech! _E_\nThis very expensive GLOBAL WARMING bullshit has got to stop. Our planet is freezing record low tempsand our GW scientists are stuck in ice _E_\nThe developer of the Scottish wind monstrosities Vattenfall just laid off 2500 people & has serious financial difficulties. _E_\nMy lawyers want to sue the failing @nytimes so badly for irresponsible intent. I said no (for now) but they are watching. Really disgusting _E_\n.@GiulianaRancic & @nickjonas both did a wonderful job hosting @MissUSA! Everyone loved @JonasBrothers & @DJPaulyD's performances! _E_\n.@SteveRattner While I think you should have gone to prison for what you did I guess Obama saved you. But watch – I will win! _E_\nHas everyone forgotten our marine who now sits in a Mexican prison because we have a president too incompetent or too lazy to make a call? _E_\nThank you to Donald Rumsfeld for the endorsement. Very much appreciated. Clinton's conduct has been disqualifying. _E_\nBack by popular demand @GiulianaRancic and @BravoAndy are co hosting tonight's #MissUniverse pageant. They are great! _E_\nA great afternoon in Tampa Florida. Thank you! #TrumpPence16 __HTTP__ _E_\nLet me sum this up for you... __HTTP__ _E_\nPeople are struggling to get gasoline for their cars we are like a third world country. _E_\nAll I heard in the SOTU was proposals for more govt more spending and more bureaucrats. Very bad! _E_\nCongrats to Jim Lipton and Inside the Actors Studio for winning the Emmy Award for the 250th Episode. I was honored to appear in it. _E_\nA Rod's forgery defense is blown __HTTP__ The more he lies the worse it's going to get. @yankees want out of his contract _E_\nI will be live on all of the major morning talk shows. Enjoy! _E_\nRick Santorum making a strong point on the Newsmax @iontv debate: @RickSantorum. __HTTP__ _E_\nTrump Int'l Palm Beach offers an award winning par 72 Championship measuring 7326 yards. Florida's top course __HTTP__ _E_\nSenator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no trust! Durbin blew DACA and is hurting our Military. _E_\nThank you to the great crowd of supporters in Newtown Pennsylvania. Get out & VOTE on 11/8/16. Lets #MAGA! Watch:... __HTTP__ _E_\nThe World is falling apart around us but we don't have people who know how to play the game. The U.S. is in big trouble no leadership! _E_\nGreat line from @TheGaryBusey: \"I am an angel in an earth suit.\" Do you agree? #CelebApprentice _E_\nAlways remember I was the one who got Obama to release his birth certificate or whatever that was! Hilary couldn't McCain couldn't. _E_\nDonald Trumps Speech Is a Game Changer. __HTTP__ __HTTP__ _E_\nNow that African Americans are seeing what a bad job Hillary type policy and management has done to the inner cities they want TRUMP! _E_\nToday's announcement by @BarackObama on immigration was done for reelection. He is using the office of the presidency as a campaign tool. _E_\nThe cast and producers of Hamilton which I hear is highly overrated should immediately apologize to Mike Pence for their terrible behavior _E_\nA legitimate article about me... __HTTP__ _E_\nFraud lightweight Marco made a TV ad on TrumpU featuring 2 people who signed these letters: __HTTP__ _E_\nI will hold a press conference in the near future to discuss the business Cabinet picks and all other topics of interest. Busy times! _E_\nHow does Ben Carson survive this problem – really big. Similar story on front page of New York Times. __HTTP__ _E_\nAlso the more desperate you are to close a deal the less likely it will happen. Stay calm and focused on your ultimate goals. Be smart! _E_\n#VoteTrump #SuperTuesday✅Florida✅Illinois✅Missouri✅North Carolina✅Ohio #TrumpTrain __HTTP__ __HTTP__ _E_\nAll former Bush administration officials should have zero standing on Syria. Iraq was a waste of blood & treasure. _E_\nThe fact is right now and for the foreseeable future the planet runs on oil and that means we need to get (cont) __HTTP__ _E_\nTonight be sure to watch Melania and Ivanka on Larry King Live for a Celebrity Relief Telethon __HTTP__ _E_\nSo I speak badly of China but I speak the truth and what do the consumers in China want? They want Trump. (cont) __HTTP__ _E_\nJackie Evancho's album sales have skyrocketed after announcing her Inauguration performance.Some people just don't understand the Movement _E_\nJoin @mike_pence at the University of Northwestern Ohio tonight at 7pm. Tickets: __HTTP__ _E_\nWhen it comes to Iran's nuclear weapons program here's my advice: Distrust dismantle and verify. @IsraeliPM @netanyahu _E_\nLooking forward to keynoting the Nackey S. Loeb School of Communications First Amendment Awards event tomorrow in New Hampshire. _E_\nStrange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy! _E_\nThe only place where success comes before work is in the dictionary. Vidal Sassoon _E_\nVia @BBCNews: \"Donald Trump golf clubhouse at Menie approved\" __HTTP__ _E_\nThank you to all of the television viewers that made my speech at the Republican National Convention #1 over Crooked Hillary and DEMS. _E_\nThe Keystone pipeline will create 20000 jobs and lower gas prices. But Obama says No. Dumb. _E_\nOur trade deficit continues to rise at record rates __HTTP__ The US manufacturing sector is being (cont) __HTTP__ _E_\nTrading Shots with Donald Trump a great article in the Wall Street Journal __HTTP__ _E_\nI feel sorry for the 4000 soldiers who are being forced to go the West Africa to fight Ebola. Their families are up in arms. Not trained. _E_\nThe Fed should not do another 'stimulus.' We can't keep spending our children's future away on waste. _E_\nI am having a great time in Iowa at Jack Trice Stadium! Unbelievable people. _E_\n.@mike_pence and I will defeat #ISIS. __HTTP__ #VPDebate _E_\nRT @OCChoppers: Bike we built for @realDonaldTrump. The gold flakes in the paint out in the sunlight looked amazing! __HTTP__ _E_\nLines for my @CPACnews address start at 7:00AM outside the Potomac Ballroom. ACU has asked that you get there early. #CPAC2013 _E_\nRex Tillerson never threatened to resign. This is Fake News put out by @NBCNews. Low news and reporting standards. No verification from me. _E_\nFor those on TV defending my use of the word schlonged bc #MSM is giving it false meaning tell them it means beaten badly. Dishonest #MSM _E_\nWhile not at all presidential I must point out that the Sloppy Michael Moore Show on Broadway was a TOTAL BOMB and was forced to close. Sad! _E_\nVia @reviewjournal \"Event offers glimpse of Trump high life\" by Holly Ivy Dore __HTTP__ Great interview @EricTrump! _E_\nDoes anyone remember this @BillMaher clip when he got fired from ABC in fact fired like a dog! __HTTP__ _E_\n.@THEGaryBusey doesn't need instructions. Couch time is more fun. #CelebApprentice _E_\nIce storm rolls from Texas to Tennessee I'm in Los Angeles and it's freezing. Global warming is a total and very expensive hoax! _E_\nWatch @JudgeJeanine on @FoxNews tonight at 9:00 P.M. _E_\nRemembering the fallen heroes on #DDay June 6 1944. __HTTP__ _E_\nThank you! WE will MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\n\"You can't put a limit on anything. The more you dream the farther you get.\" @MichaelPhelps _E_\nWow despite the switch to Monday night @ApprenticeNBC ratings were higher than even the Sunday night show. _E_\n\"America is too great for small dreams.\" — Pres. Ronald Reagan _E_\nMany of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_\nWe need leaders who can negotiate great deals for Americans. It is common sense. Let's Make America Great Again! __HTTP__ _E_\nHe @BarackObama made a deal with Saudi Arabia to pump the hell out of oil until after the election. Watch what (cont) __HTTP__ _E_\nDonald Trump retains national lead in new ABC News/WaPo poll with 37%: __HTTP__ __HTTP__ _E_\nI promoted the hell out of Trump Tower but I also had a great product. The Art of the Deal _E_\nWe need a President who isn't a laughing stock to the entire World. We need a truly great leader a genius at strategy and winning. Respect! _E_\nPeople are smart. They know you can't be for jobs but against those who create them. It doesn't work. (cont) __HTTP__ _E_\nStatement on Clinton Foundation: __HTTP__ _E_\nThese politicians like Cruz and Graham who have watched ISIS and many other problems develop for years do nothing to make things better! _E_\nLawyer Elizabeth Beck was easy for me to beat. Ask her clients if they are happy with her results against me. Got total win and legal fees. _E_\nIn business you make decisions that are in your best interests. Time for the US gov't to do the same. Let's Make America Great Again! _E_\nI will be on @piersmorganlive tonight at 9PM. __HTTP__ _E_\nJust watched the very incompetent Mitt Romney Campaign Strategist Stuart Stevens. Now I know why Mitt lost so badly. Stevens is a clown! _E_\nLooking forward to being interviewed by Sam Clovis tomorrow at @MorningsideEdu in Sioux City at 10AM CT! Let's Make America Great Again! _E_\nVia @scotsmandotcom: Via Donald Trump makes plans for Menie Estate marquee __HTTP__ _E_\n.@OMAROSA You were fantastic on television this weekend. Thank you so much – you are a loyal friend! _E_\nThank you Charlotte North Carolina! We are going to have an AMAZING victory on November 8th...because this is all... __HTTP__ _E_\nGreat article in @torontodotcom @DonaldJTrumpJr: the original apprentice __HTTP__ _E_\nCongratulations to @NYCParks on quickly repairing the Lasker Rink. Record skaters this past Thanksgiving! _E_\nCongratulations on the GREAT job done by POLICE and law enforcement on the California shootings. Give credit where credit is due. _E_\nHey @POTUS WE AGREE!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nIt's more important to be smart than tough. I know businessmen who are brutally tough but they're not smart.\" – Think Like A Billionaire _E_\nWhy do losers & haters always say I wear a \"wig\" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_\nOur thoughts and prayers are w/ the families of the 19 brave firefighters who died fighting the Arizona wildfire. God bless them. _E_\nNow Sebelius is \"'urging' insurers to cover people who haven't paid\" __HTTP__ Complete mess. Enrollment Numbers are a sham. _E_\n.@billmaher says that the Iraelis are controlling our government __HTTP__ @HBO. Let's fire him a second time. _E_\nThank you for all of your support! Most importantly we need to get everyone out to VOTE! #VoteTrump2016 __HTTP__ _E_\nArriving at Joint Base Andrews with @SecretaryPerry @SecretaryZinke and @SecPriceMD..... __HTTP__ _E_\nIt's amazing @hardball_chris has completely lost all connections to reality. He is a complete shill for Obama. _E_\nJohn McCain couldn't get him to release \"it\" and neither could Hillary Clinton—but Donald did! _E_\nWill be doing Fox and Friends in 10 minutes at 7.05 enjoy! _E_\nUSA should take oil from Iraq in repayment for their liberation. __HTTP__ _E_\nThank you to Brad Blakeman on @FoxNews for grading year one of my presidency with an \"A\" and likewise to Doug Schoen for the very good grade and statements. Working hard! _E_\nSo if Iran is going to take over the oil I say we take over the oil first by hammering out a cost sharing plan with Iraq. #TimeToGetTough _E_\nWhat is better advice The Art of the Deal or Rules for Radicals ? I know which one @BarackObama prefers. _E_\n.@WayneDupreeShow A fantastic guy! _E_\nEx Presidential Pollster Pat Cadell says most voters sick of both parties and their failure. _E_\nI would have done even better in the election if that is possible if the winner was based on popular vote but would campaign differently _E_\nHILLARY FAILED ALL OVER THE WORLD. #BigLeagueTruth LIBYA SYRIA IRAN IRAQ ASIA PIVOT RUSSIAN RESET BENGHAZI... __HTTP__ _E_\nIt is really too bad that the scientists studying GLOBAL WARMING in Antarctica got stuck on their icebreaker because of massive ice and cold _E_\nHe's back! @THEGaryBusey returns to cause even more trouble in the13th season of All Star @CelebApprentice. _E_\nTrump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban Habitat __HTTP__ _E_\nThank you Iowa! #Trump2016#MakeAmericaGreatAgain #FITN __HTTP__ _E_\nVia @dcexaminer: @realDonaldTrump to speak at @LibertyU __HTTP__ _E_\nPresident Obama has made one mistake after another for a very long time and the people of the United States are just plain tired of it! _E_\nISIS is advancing even against Obama's airstrikes. Obama is disengaged and making the Middle East even more dangerous. _E_\nJust got to listen to Rush Limbaugh the guy is fantastic! _E_\nI hope voters in Mississippi cast their ballot for @senatormcdaniel. He is strong he is smart & he wants things to change in Washington. _E_\nThis is just not the right time for Jeb Bush. His campaign is in total disarray too much staff being paid way too much money = U.S. GOVT. _E_\nWhat do you think of my suing @billmaher for $5M for charity? He made an offer I accepted. _E_\nDo you really believe our once great country can continue to survive with incompetent leadership. The answer is no and we better move fast! _E_\nTell Saudi Arabia and others that we want (demand!) free oil for the next ten years or we will not protect their private Boeing 747s.Pay up! _E_\nHow is ABC Television allowed to have a show entitled Blackish ? Can you imagine the furor of a show Whiteish ! Racism at highest level? _E_\nRe Kerry admitting to \"working\" for Pastor Abedini's release why has US already released Iranian spies & nuclear scientist? Dumb! _E_\nA GREAT day in South Carolina. Record crowd and fantastic enthusiasm. This is now a movement to MAKE AMERICA GREAT AGAIN! _E_\nTHANK YOU Arkansas! Get out & #VoteTrump on Tuesday. We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ _E_\nThe Trump Tower atrium is such a great place & kept thousands of people warm & safe during the storm thanks staff! _E_\n2016 Republican Primary Morning Consult Poll was just released. TRUMP 32 CARSON 12 BUSH 11 FIORINA 6 RUBIO 5 CRUZ 5. Taken after debate _E_\nA winning attitude will put everything in perspective. Keep negative thoughts and people where they belong out of the big picture. _E_\nWhere's the transparency? Despite Obama's denial @sfchronicle stands by report he just talked with Jeremiah Wright. _E_\nPhoenix Convention Center officials did not want to have thousands of people standing outside in the heat so they let them in. A GREAT day! _E_\nI would bet that we have many great American technology companies that would build and fix the pathetic ObamaCare website for ZERO dollars! _E_\n...Hence I would fully expect Corker to be a negative voice and stand in the way of our great agenda. Didn't have the guts to run! _E_\nJust put in ad for a real estate executive: \"Hard work low pay mean boss!\" _E_\nGreat time last night in Louisiana. Big and energetic crowd. Go out and vote now polls open. MAKE AMERICA GREAT AGAIN! _E_\nDr. Ben Carson blasted Ted Cruz for deceit and dirty tricks and lies. _E_\nSpoke at the Congressional @GOP Retreat in Philadelphia PA. this afternoon w/ @VP @SenateMajLdr @SpeakerRyan. Th... __HTTP__ _E_\nEXCLUSIVE: FBI Agents Say Comey 'Stood In The Way' Of Clinton Email Investigation: __HTTP__ _E_\nLeft Paris for U.S.A. Will be heading to New Jersey and attending the#USWomensOpen their most important tournament this afternoon. _E_\nI don't know if President Obama isn't stopping the flights from Ebola torn West Africa because he is stubborn stupid or just doesn't care! _E_\n\"When you can't make them see the light make them feel the heat.\" – Ronald Reagan _E_\n\"Concentration comes out of a combination of confidence and hunger.\" Arnold Palmer _E_\nChina is buying our shale and gas fields __HTTP__ & Obama still won't approve Keystone __HTTP__ Pathetic! _E_\nThanks to @pnehlen for your kind words very much appreciated. _E_\nDummy @GoAngelo who had 11 people show up for 15 min. at his \"massive\" rally at Macy's is trying to get publicity for self by using me _E_\nWRONG!@BarackObama capitulated to China by releasing Chen Guangcheng out of the US Embassy __HTTP__ China really has our number _E_\nIn terms of energy we need to be exploring and developing numerous approaches...and I also include in that (cont) __HTTP__ _E_\nOn Saturday a great man Elie Wiesel passed away.The world is a better place because of him and his belief that good can triumph over evil! _E_\nMost people can learn from their own experiences quite well but many ignore the experiences and lessons of others. The Way To The Top _E_\nTo all young entrepreneurs entering the business world stay positive focused and remember everything has its ups and downs. _E_\n.@FoxNews Outgoing CIA Chief John Brennan blasts Pres Elect Trump on Russia threat. Does not fully understand. Oh really couldn't do... _E_\nUnion Leader refuses to comment as to why they were kicked out of the ABC News debate like a dog. For starters try getting a new publisher! _E_\nJust terrible! #Oscars _E_\nCelebrity Apprentice tonight at 9 on NBC some amazing things happen! _E_\n.@KarlRove is far more to blame for Obama's victory than the Tea Party. _E_\nDid @BarackObama try to bribe Rev. Wright with $150K? __HTTP__ I am sure the media will be all over this. _E_\n.@realDonaldTrump will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate... __HTTP__ _E_\nCan you imagine if @billmaher said about Obama what he said about me (orangutan etc)—the press would run him out of the country... _E_\nMust watch @IvankaTrump interview on @gma discussing #Girlpower __HTTP__ _E_\nWill be on @foxandfriends at 7:00 A.M. Enjoy! _E_\nIt was an honor to welcome @GLFOP to the @WhiteHouse today with @VP Pence & Attorney General Sessions. THANK YOU fo... __HTTP__ _E_\nThe U.S. has gained more than 5.2 trillion dollars in Stock Market Value since Election Day! Also record business enthusiasm. _E_\nThe only place success comes before work is in the dictionary. Vince Lombardi _E_\nIf you can't say great things about yourself who do you think will? Think Like a Champion _E_\nLIVE on #Periscope: Tax Plan Press Conference#Trump2016 __HTTP__ _E_\n.@ashleycam2883 Re: Libya Hillary took the blame for Obama. _E_\nW/ signature Trump amenities 5 star rooms & world class restaurants @TrumpWaikiki brings excellence to Hawaii __HTTP__ _E_\nIt's the Democrats' total weakness & incompetence that gave rise to ISIS not a tape of Donald Trump that was an admitted Hillary lie! _E_\nSTATEMENT ON MELANIA SPEECH __HTTP__ _E_\nIn 1999 @BarackObama said that he didn't support Welfare Reform __HTTP__ He just gutted the entire program. _E_\nDo you think Putin will be going to The Miss Universe Pageant in November in Moscow if so will he become my new best friend? _E_\nUS Army Reserve @leezeldin will bring Conservative solutions to DC. Next Tuesday vote for Lee in the NY 1 primary. #zeldinforcongress _E_\n.@ralphreed is doing a great job! _E_\nFor the record I have ZERO investments in Russia. _E_\nJust arrived in Scotland. Place is going wild over the vote. They took their country back just like we will take America back. No games! _E_\nChelsea Clinton will be very successful in the world of politics. She's always been a great person a winner. (cont) __HTTP__ _E_\nThank you Arizona! #Trump2016#MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_\nThe $200M in renovations of Trump Int'l Washington DC are on track. The Old Post Office is being transformed into true luxury. _E_\nTo be a winner think like a winner. Practice positive thinking with reality checks. _E_\nMany of life's failures are people who did not realize how close they were to success when they gave up. Thomas A. Edison _E_\nJoin us at 10pmE on @ABC2020 @ABC with @BarbaraJWalters! #MeetTheTrumps #ABC2020 __HTTP__ _E_\nToday it was my honor to join the great men and women of @DHSgov @CustomsBorder @ICEgov and @USCIS at the U.S. Customs and Border Protection National Targeting Center in Sterling Virginia. Fact sheet: __HTTP__ __HTTP__ _E_\n.@IvankaTrump and I are looking forward to visiting Vancouver next week. Big announcement... _E_\nRT @TeamTrump: .@realDonaldTrump is here to talk about the REAL issues #BigLeagueTruth #Debates2016 __HTTP__ _E_\n#FullRepeal: Stopping Obamacare is now up to the American people. We must elect @MittRomney this November. _E_\nLooking forward to RALLY in the Great State of Pennsylvania tonight at 7:30. Big crowd big energy! _E_\ntheir country (the U.S. doesn't tax them) or to build a massive military complex in the middle of the South China Sea? I don't think so! _E_\nMy @SquawkCNBC interview discussing the GOP primary gas prices the Doral purchase and my outlook on the economy. __HTTP__ _E_\nShock @BarackObama's DNC Convention has a $27M deficit and events are starting to be canceled. __HTTP__ _E_\nGreat going. _E_\nRT @IvankaTrump: It was an honor to meet with you Prime Minister Modi. Thank you for co hosting the 8th annual Global Entrepreneurship Summ... _E_\nDonald Trump Jr. Ivanka Trump Eric Trump and myself in front of The Old Post Office D.C. on Pennsylvania... __HTTP__ _E_\nGlad to hear @ehasselbeck will be staying on @theviewtv. Elizabeth has great presence & doesn't back down from sharing her views. _E_\nOil would be $25 a barrel if our government would let us drill. Our country would be rich again who needs OPEC. _E_\nMy fragrance Success is flying off the shelves @Macys. The perfect Christmas gift! _E_\n\"Successful people don't have fewer problems.They have determined that nothing will stop them from going forward.\" Dr. Benjamin Carson _E_\nAlways be prepared to start.\" @JoeMontana _E_\nClinton betrayed Bernie voters. Kaine supports TPP is in pocket of Wall Street and backed Iraq War. _E_\nEntrepreneurs: always remember that deals are fluid. Terms are always negotiable and time can be the best option for success. _E_\nThe Miss USA Pageant #MissUSA was a big ratings hit for @nbc NBC won the evening. Thank you Donald. _E_\nFirst candidate in Virginia with over 16000 validated signatures for the ballot. An honor thank you! #Trump2016 #MakeAmericaGreatAgain _E_\nPassion gives great momentum and can be the catalyst for great achievement. _E_\nA TRULY GREAT CHAMPION WILL SELDOM FAIL AND ALWAYS COME BACK. NEVER UNDERESTIMATE THE POWER OF GREATNESS! _E_\nWaterboarding KSM gave us the intelligence that lead to Bin Laden. _E_\nTune in to The Marriage Ref onThursday night at 10 p.m. on NBC I'm on the panel of experts along with Gloria Estefan & Adam Carolla. _E_\nIn order to get elected @BarackObama will start a war with Iran. _E_\nVia @EWErickson: \"Stop Complaining About Donald Trump\" __HTTP__ _E_\n'S&P 500 Edges Higher After Trump Renews Jobs Pledge' __HTTP__ _E_\nI just got Mike Leach's new book Swing Your Sword. He's a great coach and he's written a great book. It's definitely worth reading. _E_\nLyin' Ted Cruz steals foreign policy from me and lines from Michael Douglas— just another dishonest politician. _E_\nTrump National Golf Club Jupiter is close to Palm Beach and designed by Jack Nicklaus a masterpiece of a course. __HTTP__ _E_\nHow did Obama go to a Las Vegas fundraiser on 9.12 the day after he refused to send help to Americans in Benghazi? _E_\nIt's good to see that @FLGovScott is protecting the sanctity of this November's elections Voter fraud must be broken. _E_\nEverybody loves @bretmichaels! He's a great champion and this is where he should be. He agrees! _E_\nWonderful weekend at Camp David. A very special place. A lot of very important work done. Heading back to the @WhiteHouse now. __HTTP__ _E_\nWe will stop heroin and other drugs from coming into New Hampshire from our open southern border. We will build a WALL and have security. _E_\nOur American comeback story begins 11/8/16. Together we will MAKE AMERICA SAFE & GREAT again for everyone! Watch:... __HTTP__ _E_\nThank you Illinois! Great news! #VoteTrumpIL on 3/15!Trump 28%Cruz 15%Rubio 14%Kasich 13%Bush 8%Carson 6%Simon Poll/SIU _E_\nIt's amazing how people can talk about me but I'm not allowed to talk about them. _E_\nIf you look at the horrible picture on the front page of the NY Times of the rebels executing prisoners you would say forget the rebels! _E_\nAmerica's primary goal with Iran must be to destroy its nuclear ambitions. Let me put this as plainly as I know (cont) __HTTP__ _E_\nDon't miss my fabulous World of Golf now in its second season on Golf Channel beginning January 31 at 9 pm ET. Celebrity matches and more... _E_\nGot the endorsement of Brian France and @NASCAR yesterday in Georgia. Also many of the sports great drivers. Thank you Nascar and Georgia! _E_\nGeneral Motors is sending Mexican made model of Chevy Cruze to U.S. car dealers tax free across border. Make in U.S.A.or pay big border tax! _E_\nIt seems there is never a problem for which @BarackObama cannot find a reason for another speech and another tax. _E_\n#TBT With Tommy Lee Jones at Mar a Lago. __HTTP__ _E_\nBig increase in traffic into our country from certain areas while our people are far more vulnerable as we wait for what should be EASY D! _E_\n\"Failure isn't fatal but failure to change might be\" – John Wooden _E_\nDo not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_\nHappy Birthday to my friend @garyplayer... __HTTP__ _E_\nDon't forget next Friday December 9th: I'll be signing my new book @HowToGetTough in Trump Tower from 11 a.m... (cont) __HTTP__ _E_\nArmy officer who led a sexual abuse prevention unit was just fired after being charged with violently going after his wife.What is going on? _E_\nRemember get TIME magazine! I am on the cover. Take it out in 4 years and read it again! Just watch... _E_\nStill waiting to hear from @billmaher. Every day he dodges me is one less day that $5M is being used for charity. _E_\nAll civilized nations must join together to protect human life and the sacred right of our citizens to live in safety and in peace. _E_\nA big fat hit job on @oreillyfactor tonight. A total waste of time to watch boring and biased. @brithume said I would never run a dope! _E_\nLeaving now for a one night trip to Scotland in order to be at the Grand Opening of my great Turnberry Resort. Will be back on Sat. night! _E_\nCelebrity Apprentice on CNBC tonight at 9. _E_\nBoycott all Apple products until such time as Apple gives cellphone info to authorities regarding radical Islamic terrorist couple from Cal _E_\nCover your bases know everything you can about what you're doing. Keep your focus by being well informed on a daily basis. _E_\n\"If we ever forget that we are One Nation Under God then we will be a nation gone under.\" Ronald Reagan (Feb. 6 1911–June 5 2004) _E_\nCongratulations to my brother Robert & Ann Marie on the success of @MontesKitchen in Dutchess County New York (Amenia.) Great food! _E_\nEntrepreneurs: See yourself as victorious look at the solution not the problem. _E_\nThe @nytimes sent a letter to their subscribers apologizing for their BAD coverage of me. I wonder if it will change doubt it? _E_\nThe Chicago machine is scared. @PaulRyanVP shows that @MittRomney will run on a conservative & coherent platform. 85 days until victory! _E_\nQ1 GDP has just been revised down to 1.9% __HTTP__ The economy is in deep trouble. _E_\nISIS is in retreat our economy is booming investments and jobs are pouring back into the country and so much more! Together there is nothing we can't overcome even a very biased media. We ARE Making America Great Again! _E_\nUnsustainable. With our $17T debt & $90T in unfunded liabilities government \"blatantly\" wasted $30B this year __HTTP__ _E_\nRT @LouDobbs: We are Watching A Leader Who for the First Time in Three Presidencies Will Put America and Americans First! @realDonaldTrump... _E_\nThe highly respected Suffolk University poll just announced that I am alone in 2nd place in New Hampshire with Jeb Bust (Bush) in first. _E_\nFidel Castro is dead! _E_\nI hope everyone that read @DanAmira's reprehensible statement will cancel their subscription to @NYMag in protest. Let me know. _E_\nfrom Donald Trump: I saw Lady Gaga last night and she was fantastic! _E_\n.@lisarinna is the last lady standing in All Star Celebrity @ApprenticeNBC. Watch out men she's sharp and tough. _E_\nThe legendary @BarbaraJWalters will be asking me questions about the Presidential campaign on @WNTonight at 6:30 PM. _E_\n#Trump360 Watch this 360 video of my speech last night at Trump Tower __HTTP__ _E_\nEntrepreneurs: In the best negotiations everyone wins. This is a possibility and it's the ideal situation to strive for. _E_\nPresident Obama & Putin fail to reach deal on Syria so what else is new? Obama is not a natural deal maker. Only makes bad deals! _E_\nVia @PVPatch by Paige Austin: \"Trump to Donate 12 Acres for Conservation in Palos Verdes\" __HTTP__ _E_\nRT @SecretarySonny: Serious @Cabinet meeting today called by @POTUS at Camp David. Reports on #Irma's track potential impact fed & state... _E_\nA Rod is now looking for an expensive home in Beverly Hills why aren't the @Yankees terminating his contract for misrepresentation? _E_\nVia @DMRegister by @KObradovich: \"Donald Trump: Next president needs to be 'a great one'\" __HTTP__ _E_\nThird quarter GDP was lowered to 2% . There won't be any economic recovery until @BarackObama is defeated. _E_\nGeorge Will one of the most overrated political pundits (who lost his way long ago) has left the Republican Party.He's made many bad calls _E_\nWow  I never saw the Petraeus thing coming. A straight laced guy! Very sad for him and his family. _E_\nJoin me live in Toledo Ohio. Time to #DrainTheSwamp & #MAGA! __HTTP__ _E_\nGo to the website for the Judge's full decision re Trump University: __HTTP__ _E_\nVera Coking made a big mistake in Atlantic City by turning down many millions of $'s years ago for property that just sold for $530000. _E_\nI will be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_\nAs I predicted 1 year ago gasoline prices hit a record high today...OPEC is having a ball at our expense. _E_\nThe President has until tomorrow at 12 noon to pick up $5M for his favorite charity. Looking like he won't be doing it. What is he hiding? _E_\nI will be talking about my wonderful experience in Iowa and the simultaneous unfair treatment by the media later in New Hampshire. Big crowd _E_\nHope he won't spend too much time ripping apart the 2nd. Amendment! _E_\nToday we heard the experiences of law enforcement professionals and community leaders working to combat the threat of MS 13 and the reforms we need from Congress to defeat it. Watch here: __HTTP__ __HTTP__ _E_\nRT @MikeHolden42: @foxandfriends @realDonaldTrump He's a fascist so not unusual. _E_\nThank you Albany New York!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\n.@MelindaDC Don't misrepresent in order to make a point. I was always tough on ISIS as you'll find out after I get elected. _E_\nThe United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_\nI was never a fan of Bush 2 FOR MANY REASONS including the fact that we should never have gone into Iraq but once there kept the oil! DUMB _E_\nFact – the tighter the gun laws the more violence. The criminals will always have guns. _E_\nI hope everybody goes to Macy's today to get Donald J. Trump shirts ties suits and cufflinks they are really beautiful at low price _E_\nHeroin overdoses are taking over our children and others in the MIDWEST. Coming in from our southern border. We need strong border & WALL! _E_\nGas prices are still too high. We really need to pressure OPEC to lower the price of oil. _E_\nTexas is lucky to have him @GovernorPerry is a great guy! _E_\nWhy does @FoxNews give @KarlRove so much airtime. He (and other Fox pundits) is so biased. Still thinks Romney won. Unfair coverage of Trump _E_\n.@IvankaTrump and @PiersMorgan will be wonderful advisors. #CelebApprentice _E_\nObamaCare will cost 3 times as much as Obama promised – $2.6T __HTTP__ It is not sustainable. (h/t @gatewaypundit) _E_\nThank you Pennsylvania! #Trump2016 __HTTP__ __HTTP__ _E_\nWatching the returns at 9:45pm. #ElectionNight #MAGA __HTTP__ _E_\n\"If you're still in school pay attention. Education is a money machine.\" – Think Like a Billionaire _E_\nIsn't it time that Obama release his college records and applications? Boy would that create a mess! He is not who you think. _E_\nRe run of O'Reilly on Fox NOW! _E_\nWe're singlehandedly transferring hundreds of billions of dollars a year... _E_\nLast night in Orlando Florida was incredible massive crowd THANK YOU FLORIDA! Today at 3:00 P.M. I will be in Alabama for last rally! _E_\nNow Obama is having our army coordinate with Iran against ISIS. What's next? _E_\nOh wow lightweight Governor @BobbyJindal who is registered at less than 1 percent in the polls just mocked my hair. So original! _E_\nGood messaging and staying on point. @MittRomney called @BarackObama anti investment anti business anti jobs __HTTP__ _E_\nCongrats everyone we topped 4 million today on Twitter and heading up fast! _E_\nHappy #VeteransDay to all. And it is nice to have Sgt. Andrew Tahmooressi back home. _E_\nI was nice to loser @rosie and she attacked me it just shows never let up with a bully. They only fade when you hit them hard! _E_\nLooking forward to seeing the World Champion Yankees today on opening day! _E_\nOnly a grossly incompetent government led by an equally incompetent president could have made the terrible trade for Bergdahl. #OrangeRoom _E_\nA must watch: Legal Scholar Alan Dershowitz was just on @foxandfriends talking of what is going on with respect to the greatest Witch Hunt in U.S. political history. Enjoy! _E_\nFew people know that @FortuneMagazine is still in business. Tell your writer Alisa Soloman that I left The Apprentice to run for president _E_\nIf America unlocked its energy potential we would once again be the most powerful country in the world. Washington is holding us back. _E_\nWelcome to the new reality. 23116928 US households on food stamps __HTTP__ Obama's Hope & Change. _E_\nIn '08 @BarackObama said that Bush adding $4T to the debt was unpatriotic. __HTTP__ @BarackObama has already added $6T. _E_\nThank you Peter if elected I will think big for our country & never let the American people down! #AmericaFirst __HTTP__ _E_\nJustice Roberts did the Republican Party and @MittRomney a great favor. He essentially said ObamaCare is a tax (cont) __HTTP__ _E_\n...Such poor leadership ability by the Mayor of San Juan and others in Puerto Rico who are not able to get their workers to help. They.... _E_\nJack Nicklaus II gave the best tribute to a parent I have ever heard at yesterday's Congressional Gold Medal Ceremony honoring @jacknicklaus _E_\nAsk Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Counsel. _E_\nChina taxing imports from the US 22% why aren't we taxing China? _E_\nEgypt is going the exact opposite of what it was. They will soon be very strongly against Israel. Thanks President Obama. @BarackObama _E_\nJust met with General Petraeus was very impressed! _E_\nI am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... _E_\nTell me which is \"cooler\"—my induction into the @WWE Hall of Fame or my Star on the Hollywood Walk of Fame? _E_\nThe Republican Party is racking up record amounts of small dollar donations fueled by Trump supporters..... @nypost Thank you! _E_\nOur major airports are decaying. It's embarrassing. We need to have them renovated by competent professionals and fast. _E_\n\"You don't necessarily need the best location. What you need is the best deal.\" – The Art of The Deal _E_\n.@NBA Hall of Famer @dennisrodman rebounds for a tremendous performance in his return to this year's All Star @ApprenticeNBC! Great guy! _E_\nBig day in Texas tomorrow! Having a rally in Fort Worth. Tremendous crowd. Will be exciting! #Trump2016 __HTTP__ _E_\nBloomberg News Spain's renewable projects lead by money losing wind turbines facing bankruptcy. Hopefully Scotland is watching! _E_\nI was just told by a television pro thay @DannyZucker is one of the truly dumbest guys in the business he's obsessed with T so many flops! _E_\nReports say #ISIS now has a passport machine to have its believers infiltrate our country. I told you so. __HTTP__ _E_\n\"Faldo to rework two Doral courses\" __HTTP__ via @FOXSports _E_\nNew Q poll out we are going to win the whole deal and MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_\nVia @AFP: Trump tees off on new golf course in Scotland __HTTP__ _E_\nBiden @VP Spends $1 Million Annually for Weekend Trips __HTTP__ _E_\nMy wife Melania will be interviewed tonight at 8:00pm by Anderson Cooper on @CNN. I have no doubt she will do very well. Enjoy! _E_\nI am no fan of President Obama but to show you how dishonest the phony Washington Post is: __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nCongrats to the Senate for taking the first step to #RepealObamacare now it's onto the House! _E_\nMust read column by Bob Woodward explaining how Obama pushed for sequestration & promised no tax increase __HTTP__ _E_\nWow I just had two very good Iowa polls and a phenomenal just out National Poll from @ABC @washingtonpost 38%. MAKE AMERICA GREAT AGAIN! _E_\nWEST VIRGINIA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThank you Texas! 10000 amazing supporters! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nI hope @Official1MCD is recuperating well in LA. Get better! @OMAROSA _E_\n.@montgomeriefdn Your commentary this weekend was fantastic. People love what you say and how you say it. _E_\nIs it legal for @BarackObama to make campaign donor calls from Air Force One? __HTTP__ Obama is always fundraising on our dime. _E_\nVia @DMRegister by @brianneDMR:\"Trump: Bring back jobs from overseas\" __HTTP__ Let's Make America Great Again! _E_\nMoney pouring into Insurance Companies profits under the guise of ObamaCare is over. They have made a fortune.Dems must get smart & deal! _E_\nI will be on The Tonight Show with Jimmy Fallon tonight at 11:30. Should be fun! @jimmyfallon _E_\nVia RealClear Politics __HTTP__ _E_\n\"Do more be more give more and everyone will benefit.\" – Think Like a Champion _E_\nIt was my great honor to celebrate the opening of two extraordinary museums the Mississippi State History Museum & the Mississippi Civil Rights Museum. We pay solemn tribute to our heroes of the past & dedicate ourselves to building a future of freedom equality justice & peace. __HTTP__ _E_\nI know you don't like to hear this @DannyZuker but the biggest nights of The Apprentice were far bigger than the biggest nights of Mod Fam _E_\nEntrepreneurs: See yourself as victorious. Look at the solution not the problem. _E_\nThe Democrats in Congress don't want ObamaCare for themselves or big businesses. So why are they forcing it on the American people? _E_\nGas prices have doubled under Obama. Over $5/gallon now in California. We must start drilling from our own resources to become independent. _E_\nPassion motivates. Passionate people don't give up their zeal eliminates fear. Passion can also create business opportunities. _E_\n...then he who continues the attack wins.\" Ulysses S. Grant _E_\nThe Scottish windfarm was conceived by the same mind that released terrorist al Megrahi for humanitarian reasons. .. _E_\nEveryone loves @AmandaTMiller here she is with @Joan_Rivers and me. __HTTP__ _E_\nOprah will end up doing just fine with her network she knows how to win. @Oprah _E_\nHow do you take care of our people if you don't make anything? We don't make anything. We are rapidly losing our manufacturing to China etc. _E_\nObama now just wants to save face Russia is now telling him don't do it . He waited too long and the other side is much better prepared. _E_\nWow a really great review of my golf club in Scotland @TrumpScotland in todaysgolferco.uk. Thank you! __HTTP__ _E_\nNegotiation tip: Think about what the other side wants. Know where they're coming from. Try to create a win/win situation. _E_\nPut Kathleen Sebelius out of her misery and lovingly say YOU'RE FIRED! Let her go home to her family and rest. BRING IN TOP FLIGHT PEOPLE! _E_\nThe United Kingdom is trying hard to disguise their massive Muslim problem. Everybody is wise to what is happening very sad! Be honest. _E_\nTo be a big success in any field you need to build momentum. Momentum is all about energy and timing. Think BIG _E_\nLike it or not haters and losers everybody is talking about Miss U.S.A. and Miss Utah. By the way she is a fine young woman unfair to her. _E_\nTrue! __HTTP__ _E_\nWhite House relaxes penalty for canceled health policies a major blow to the sustainability (and concept) of ObamaCare! They are desperate _E_\nWe will repeal & replace #Obamacare which has caused soaring double digit premium increases. It is a disaster! __HTTP__ _E_\nThe incompetence of our current administration is beyond comprehension. TPP is a terrible deal. _E_\nMY PRO GROWTH Econ Plan:✅Eliminate excessive regulations! ✅Lean government!✅Lower taxes!#Debates ... __HTTP__ _E_\nBased on the shoots which silent film do you think will be better? #CelebApprentice _E_\nJust left news conference at @TrumpTowerNY with @TheGaryBusey people love @TheGaryBusey! __HTTP__ _E_\nThoughts and prayers with the victims and their families along with everyone at the Berrien County Courthouse in St. Joseph Michigan. _E_\nThe record 13th season of 'All Star' @CelebApprentice features the return of the beautiful @BrandenRoderick. The fans love her! _E_\nThe Republicans should use everything against @BarackObama just as @BarackObama is going to use everything (cont) __HTTP__ _E_\nRT @IvankaTrump: We're working to make tax cuts & the expanded Child Tax Credit a reality for American families. The time is now! #TaxRefor... _E_\nWith multiple space options @TrumpChicago is the ideal venue to hold your dream wedding __HTTP__ _E_\nPuerto Rico is devastated. Phone system electric grid many roads gone. FEMA and First Responders are amazing. Governor said great job! _E_\nCrooked Hillary Clinton who I would love to call Lyin' Hillary is getting ready to totally misrepresent my foreign policy positions. _E_\nOn behalf of @FLOTUS Melania and myself thank you Poland!🇱#ICYMI watch here __HTTP__ #POTUSinPoland __HTTP__ _E_\nCrooked Hillary Clinton likes to talk about the things she will do but she has been there for 30 years why didn't she do them? _E_\n...are now fighting back like never before. There is so much GUILT by Democrats/Clinton and now the facts are pouring out. DO SOMETHING! _E_\nI am in Toronto checking the great Trump International Hotel highest rated hotel in Canada. It is a beauty! _E_\n.@BradPaisley came up to see me. A really nice and talented guy. __HTTP__ _E_\nWow @GolfMagazine just rated the renovation of The Blue Monster the best of the year. Even better they stated it may be best of all time! _E_\nCongrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. __HTTP__ _E_\nI have had the pleasure of getting to know @AnnDRomney & @MittRomney this past year. They love America. Let's push them over the top today. _E_\nVia @baltimoresun by @ErinatTheSun: \"Maryland GOP book Trump for major fundraiser\" __HTTP__ _E_\nThank you to our wonderful team @USUN and their families. Keep up the GREAT work! #USA __HTTP__ _E_\nI hate hearing after all of the hard work that @MittRomney never wanted to become President. _E_\nIf The Art of the Deal is a must read then #TimeToGetTough is my opus. It is available Dec 5th! _E_\nThank you Pennsylvania! Together we are going to MAKE AMERICA GREAT AGAIN! Watch here: __HTTP__ __HTTP__ _E_\nThank you Kirkwood Community College. Heading to the U.S. Cellular Center now for an 8pmE MAKE AMERICA GREAT AGAIN... __HTTP__ _E_\nCrooked's State Dept gave special attention to Friends of Bill after the Haiti Earthquake. Unbelievable! __HTTP__ _E_\nBig new @ABC Poll to be announced at 9:00 A.M. on This Week with @GStephanopoulos. I will be interviewed on show! _E_\nCredit the Bloomberg administration for having the foresight and courage to get this decades old project finished will be BIG for NY. _E_\nJoin us tomorrow in Kiawah South Carolina! #SCPrimary #VoteTrumpSC#Trump2016 __HTTP__ _E_\nThank you for all of your support Iowa!#MakeAmericaGreatAgain #Trump2016#IACaucus finder: __HTTP__ __HTTP__ _E_\nI feel sorry for Rosie 's new partner in love whose parents are devastated at the thought of their daughter being with @Rosie a true loser. _E_\n.@CNN is looking at Jeff Zucker to lead them out of the forest Jeff would be a great choice. _E_\nTop searched candidate by state as seen in the #GOPDebate media filing center. WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nRT @realDonaldTrump: Democrats are holding our Military hostage over their desire to have unchecked illegal immigration. Can't let that hap... _E_\nTrump Int. Hotel & Tower Vancouver will transform the skyline w/ its 616 ft twisting & beautiful tower __HTTP__ _E_\nDid Hillary Clinton ever apologize for receiving the answers to the debate? Just asking! _E_\nEven though every poll Time Drudge etc. has me winning the debate by a lot @FoxNews only puts negative people on. Biased a total joke! _E_\nMoney is really cheap so this is a great time to buy a house but be sure to lock in long term financing (without which don't buy). _E_\nA suggestion for the dishonest media __HTTP__ _E_\n\"If you can accept losing you can't win.\" Vince Lombardi _E_\nChina has been taking out massive amounts of money & wealth from the U.S. in totally one sided trade but won't help with North Korea. Nice! _E_\nMelania and I were thrilled to join the dedicated men and women of the @USEmbassyFrance members of the U.S. Military and their families. __HTTP__ _E_\nI said don't invade Iraq from the very beginning. my @SRQRepublicans speech _E_\nfires its employees builds a new factory or plant in the other country and then thinks it will sell its product back into the U.S. ...... _E_\nRT @GMA: WATCH: @IvankaTrump on women who work empowering campaign celebrates modern women. __HTTP__ _E_\nGreat column by @howardfineman on @HuffPostPol: Karl Rove Is Done __HTTP__ _E_\nVia @ChristianToday: \"Donald Trump vows to be the 'greatest representative of Christians' if he wins White House\" __HTTP__ _E_\nRemember we don't get any oil from Iraq China gets whatever ISIS hasn't already taken. So why isn't China sending the troops? Too smart! _E_\nNew National GOP Zogby Poll#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nSo disgraceful that a person illegally in our country killed @Colts linebacker Edwin Jackson. This is just one of many such preventable tragedies. We must get the Dems to get tough on the Border and with illegal immigration FAST! _E_\nAMAZING how the press protected President Obama when he did the so called comedy routine with Zach G. He looked like a fool they said cute _E_\nThe failing @nytimes story is so totally wrong on transition. It is going so smoothly. Also I have spoken to many foreign leaders. _E_\nI hear the Rickets family who own the Chicago Cubs are secretly spending $'s against me. They better be careful they have a lot to hide! _E_\nRemember politicians are all talk and no action they will never be able to MAKE OUR COUNTRY GREAT AGAIN! Controlled by lobbyists & donors _E_\nWe are going through contentious primaries now but the GOP must unite. Let's take the Senate and stop Obama's dangerous agenda. _E_\nBut while Dallas dropped to its knees as a team they all stood up for our National Anthem. Big progress being made we all love our country! _E_\nSo many people are angry at my comments on Mexico—but face it—Mexico is totally ripping off the US. Our politicians are dummies! _E_\n...instead of biting the hand that feeds you! Don't bother just keep making me money! _E_\nWow! Ted Cruz received $487K in campaign contributions $11M from a NY hedge fund mogul & $1M low int. loan from Goldman Sachs. Hypocrite _E_\nWe need to worry about the American worker first! _E_\nDirect view of crane from apartment window. Crane was never properly secured blowing in the breeze. __HTTP__ _E_\nWith the fantastic ratings last weekend @meetthepress & @ThisWeekABC I think it's only fair that I go on @FoxNewsSunday w/ Chris Wallace. _E_\nI was sorry to decline headlining the Reagan Dinner last Saturday due to a prior business commitment. Pres. Reagan was one of the greats. _E_\nRT @ColumbiaBugle: @realDonaldTrump Love our @FLOTUS! __HTTP__ _E_\nVote for the next Miss USA... __HTTP__ #VEGASusa11 #MissUSA _E_\nAmerica deserves a commander in chief who respects the challenges and realities our Armed Forces face in our (cont) __HTTP__ _E_\nThank you to Jack Morgan Tamara Neo Cheryl Ann Kraft and all of my friends and supporters in Virginia. GREAT JOB! _E_\nThe NFL is now thinking about a new idea keeping teams in the Locker Room during the National Anthem next season. That's almost as bad as kneeling! When will the highly paid Commissioner finally get tough and smart? This issue is killing your league!..... _E_\nIf you don't have passion everything you do will ultimately fizzle out or at best be mediocre. Is that how (cont) __HTTP__ _E_\nFox News Sunday With Chris Wallace will be re broadcast on @FoxNews at 6:00 P.M. _E_\nThe FDA must immediately stop allowing massive dose vaccinations in babies. It is mind boggling that they allow this practice to continue. _E_\nChina is ripping wealth out Africa and yet as usual refuses to put anything back to help with Ebola. Let the stupid Americans do it! SAD _E_\nThx and from a better quotation source: You miss 100% of the shots you don't take. Wayne Gretzky _E_\nRT @foxandfriends: Former President Obama's $400K Wall Street speech stuns liberal base Sen. Warren saying she was troubled by that __HTTP__ _E_\nThe top leadership of the New York State Republican Party is totally dysfunctional they haven't won a major election in many years. _E_\n\"Donald Trump on 'Brutal' New Season of @ApprenticeNBC\" __HTTP__ via @YahooTV _E_\nGreat job today by the NYPD in protecting the people and saving the climber. _E_\nCNBC poll: Trump won #GOPDebate #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nGreat job @AdamScott you deserve it! _E_\nWhy would the people of Texas support Ted Cruz when he has accomplished absolutely nothing for them. He is another all talk no action pol! _E_\nLeaving now for Tennessee. Big crowd! _E_\nDeparting for Long Island now. An area under siege from #MS13 gang members. We will not rest until #MS13 is eradicated. #LESM __HTTP__ _E_\nIf Obama resigns from office NOW thereby doing a great service to the country—I will give him free lifetime golf at any one of my courses! _E_\nA wonderful afternoon in Iowa! Great people! Heading now to Florida tomorrow South Carolina! #MakeAmericaGreatAgain #Trump2016 _E_\nCongrats to Charles @krauthammer for his statements on climate change formerly known as global warming! _E_\nI'm leading big in every poll and we are going to WIN! Remember Trump NEVER gives up! _E_\nGreat meeting with a wonderful woman today former Secretary of State Condoleezza Rice! #USA __HTTP__ _E_\nEntrepreneurs: Trust your instincts even after you've honed your skills. They're there for a reason. _E_\nA country must enforce its borders. Respect for the rule of law is at our country's core. We must build a wall! __HTTP__ _E_\nThe new Ebola czar will report to the WH & NSA adviser Susan Rice. More mismanagement & duplicity with CDC. Obama is terrible executive. _E_\nI hope Tom Brady sues the hell out of the @nfl for incompetence & defamation. They will drop the case against him and he will win. _E_\nCongrats to @KarlRove on blowing $400 million this cycle. Every race @CrossroadsGPS ran ads in the Republicans lost. What a waste of money. _E_\nRT @charliekirk11: 100 days ago a new message leader & movement took the Oval Office! A government FOR the people BY the people. This is... _E_\nJeb Bush is weak on illegal immigration in favor of common core bad on women's health issues and thinks the Iraq war was a good thing. _E_\nDo you believe that highly overrated political pundit @krauthammer said this is the best Republican field in 35 years. What a dope! _E_\nTo the people of Kentucky Rand Paul didn't want you. Now he runs back due to his presidential failure. #VoteTrump #MakeAmericaGreatAgain _E_\nA MUST READ! @AndrewBreitbart's last article The Vetting Part I: @BarackObama's Love Song to Alinsky __HTTP__ _E_\nHillary's staff thought her email scandal might just blow over. Who would trust these people with national security? __HTTP__ _E_\nIt's very sad that the administration isn't sending anyone to Margaret Thatcher's funeral. She was a big U.S. supporter. _E_\nWith Jemele Hill at the mike it is no wonder ESPN ratings have tanked in fact tanked so badly it is the talk of the industry! _E_\n.@Graeme_McDowell Great playing Graeme you are a true champion! _E_\nCanada has made business for our dairy farmers in Wisconsin and other border states very difficult. We will not stand for this. Watch! _E_\nBlack politicians are in prison based on Shirley Huntley's statements but not white @AGSchneiderman RACISM! __HTTP__ _E_\n.@USATODAY Poll and @QuinnipiacPoll say that I beat both Hillary and Bernie and I havn't even started on them yet! _E_\nCruz did not renounce his Canadian citizenship as a US Senator only when he started to run for #POTUS. He could be Canadian Prime Minister. _E_\nNegotiation: It is persuasion more than power. _E_\nWill be doing Fox & Friends at 7 A.M. 20 minutes. ENJOY! _E_\nThank you Colorado Springs. If I'm elected President I am going to keep Radical Islamic Terrorists out of our count... __HTTP__ _E_\nRT @TeamTrump: .@HillaryClinton just claimed she has a positive optimistic view for America. #Debates __HTTP__ _E_\nJust as I predicted people are going to be shocked by the rise in premium prices thanks to Obama Care __HTTP__ Enjoy! _E_\nPart 2 of my @jimmyfallon interview giving away some @CelebApprentice spoilers & discussing 2012 Miss Universe Pageant __HTTP__ _E_\nThank you Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_\nand knew they were in big trouble which is why they cancelled their big fireworks at the last minute.THEY SAW A MOVEMENT LIKE NEVER BEFORE _E_\n.@Jrprotalker Thanks Judy for the wonderful statements on @TrumpTurnberry. Great seeing you there & you did a fabulous job on commentary. _E_\nFive U.S. soldiers killed in Afghanistan by so called friendly fire. What are we doing? _E_\nInflation is here. Record beef prices are hitting consumers pockets __HTTP__ Bad for family grills. _E_\nToday I am standing with patriots in Arizona for border security! Build a wall! Let's Make America Great Again! __HTTP__ _E_\nThe harder you work the harder it is to surrender. @ProFootballHOF @buffalobills Head Coach Marv Levy _E_\nWow Macy's numbers just in Trump is doing better than ever thanks for your great support! _E_\nI am asking the chairs of the House and Senate committees to investigate top secret intelligence shared with NBC prior to me seeing it. _E_\nWhy is it that when Warren Buffett uses the bankruptcy laws to his benefit nobody cares but with me they go nuts! _E_\nLetterman @Late_Show begging me to go back on his low rated show calls lots must apologize for racist comment. _E_\nThank you @FoxNews Huge win for President Trump and GOP in Georgia Congressional Special Election. _E_\nThe World Bank is tying poverty to 'climate change' __HTTP__ And we wonder why international organizations are ineffective. _E_\nGoofy Elizabeth Warren who may be the least productive Senator in the U.S. Senate must prove she is not a fraud. Without the con it's over _E_\n{Crooked Hillary Clinton} created this mess and she knows it. #DrainTheSwamp __HTTP__ __HTTP__ _E_\n.@secupp who can't believe that her candidate has bombed so badly is one of the dumber pundits on T.V. Hard to watch zero talent! @CNN _E_\nFormer President Vicente Fox who is railing against my visit to Mexico today also invited me when he apologized for using the f bomb. _E_\nIt seems @BarackObama had our tax dollars buy guns for Mexican drug lords that were used to kill Americans. We need answers now. _E_\nI have been leading big in all polls with two more today @nbc and @CNN. The NBC poll is more than double next at 29%. Fiorina has 11%. _E_\nPres. Obama should leave the baseball game in Cuba immediately & get home to Washington where a #POTUS under a serious emergency belongs! _E_\nRead about my victory against sleazebag @AGSchneiderman. More people should fight when they're right! __HTTP__ _E_\nYoung entrepreneurs – in an economic climate like this only the strong survive. You can do it. Think Big! _E_\nBoycott @Macys no guts no glory. Besides there are far better stores! _E_\n.@Graeme_McDowell You are the toughest guy there is. If you were a boxer you'd be the champ. Great going! _E_\nVia @scotsmandotcom: \"Awards for Trump's golf course\" __HTTP__ _E_\nVisit the highly acclaimed Trump International Hotel & Tower Chicago and its exceptional 'Sixteen' restaurant __HTTP__ _E_\nWill be interviewed on @foxandfriends at 7:00 5 minutes. Then I head to New Hampshire great people! _E_\nAfter destroying the Middle East & our economy the Bushes last gift was having Justice Roberts legalize ObamaCare. No more Bushes! _E_\nGet respect and do not give a damn if people like you. Think Big _E_\nThe worst thing you can possibly do in a deal is seem desperate to make it. #TheArtofTheDeal _E_\nA new INTELLIGENCE LEAK from the Amazon Washington Postthis time against A.G. Jeff Sessions.These illegal leaks like Comey's must stop! _E_\n\"TRUMP TO CPAC: BUILD A GREAT ECONOMY\" __HTTP__ via @BreitbartVideo _E_\n#TBT Filming an Oreo commercial with Eli Manning Peyton Manning and Darrell Hammond __HTTP__ _E_\nHad a great time in Myrtle Beach and Charleston this past Saturday and Monday. Looking forward to going back soon. _E_\nThe majority of Americans agree with @MittRomney's comments on @Israel and Iran. _E_\nLinkedIn Workforce Report: January and February were the strongest consecutive months for hiring since August and September 2015 _E_\n#FlashbackFriday Just after I did my renovation in Central Park of @TrumpRink __HTTP__ _E_\nThe dishonest media likes saying that I am in Agreement with Julian Assange wrong. I simply state what he states it is for the people.... _E_\nVia @NorthvillePatch: Donald Trump to Speak in Novi This May __HTTP__ _E_\nThe GOP doesn't waste an opportunity to waste an opportunity. Defunding Obamacare should be central to any deal. _E_\nJust found out I won the Rockingham County Republican Booth Straw Poll at the Deerfield Fair in New Hampshire this past weekend. 39% Wow! _E_\nMy thoughts and prayers are with the two police officers shot in Sebastian County Arkansas. #LESM _E_\nThe Central Park Five documentary was a one sided piece of garbage that didn't explain the.horrific crimes of these young men while in park _E_\nSince the first day I took office all you hear is the phony Democrat excuse for losing the election Russia RussiaRussia. Despite this I have the economy booming and have possibly done more than any 10 month President. MAKE AMERICA GREAT AGAIN! _E_\nMy father's 4 step formula for success: Get in get it done get it done right and get out. Fred C. Trump _E_\nWatch @AC360 on NOW! @CNN _E_\nCrooked Hillary Clinton is spending a fortune on ads against me. I am the one person she doesn't want to run against. Will be such fun! _E_\nJoin me in Reno Nevada on Wednesday at 3:30pm at the Reno Sparks Convention Center! #MAGATickets:... __HTTP__ _E_\nCongrats to @greggutfeld on his new @FoxNews show! Greg makes great TV and is a terrific guy. _E_\nW/state of the art Clubhouse & our signature amenities @Trump_Charlotte brings true luxury to The Tar Heel State __HTTP__ _E_\nLooking forward to speaking at @Citizens_United & @SteveKingIA's \"Iowa's Freedom Summit\" on January 24th __HTTP__ _E_\nPaulina @MissUniverse Vega will be introduced tonight at the Finale of Celebrity Apprentice.She is a great beauty and a monster star in S.A. _E_\n.@TrumpDoral offers multiple award winning dining options in our all new signature restaurant and lounges __HTTP__ _E_\nFracking poses ZERO health risks __HTTP__ In fact it increases our national security by making us energy independent. _E_\nMy speech is right now on C SPAN 1 _E_\nI will be meeting with Henry Kissinger at 1:45pm. Will be discussing North Korea China and the Middle East. _E_\nWatch @extratv's spot covering the first annual Trump Invitational at Mar a Lago __HTTP__ _E_\nCongratulations to my head pro of Trump International Golf Club (Florida) John Nieporte for qualifying for the U.S. Open! @usopengolf _E_\nObama stop the flights to and from West Africa NOW before it is too late! Can't you see what's happening? Can you be that thick (stupid)? _E_\nDoes @BarackObama ever work? He is constantly campaigning and fundraising on both the taxpayer's dime and time not fair! _E_\nWhen it comes to the future of America's energy needs we will FIND IT we will DREAM IT and we will BUILD IT.... __HTTP__ _E_\nThank you Roseanne very much appreciated. __HTTP__ _E_\n.@politico has no power but so dishonest! _E_\nEntrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_\nGovt. collapsing in Iraq only 2 weeks after withdrawal of our troops. Sadly I called this one and please remember I alone called it. _E_\nI think @megynkelly should take another eleven day unscheduled vacation. _E_\nIs Jon Stewart a racist? See video that includes clip... __HTTP__ #thedailyshow _E_\n.@kevinjonas was great but he brought the wrong person into the boardroom. Had he brought Lorenzo in he would not have been fired. _E_\n.@antbaxter should really be ashamed about his massive box office disaster. Take a hint and get out of the film (cont) __HTTP__ _E_\nStatement Regarding British Referendum on E.U. Membership __HTTP__ _E_\nIn three years people won't be building wind turbines anymore they are obsolete & totally destroy the environment in which they sit. _E_\nGet ready for fireworks...@Joan_Rivers & @THEGaryBusey face off in the Board Room this Sunday on All Star Celebrity @ApprenticeNBC. _E_\nLots of autism and vaccine response. Stop these massive doses immediately. Go back to single spread out shots! What do we have to lose. _E_\nStill looking to give away a RECORD $1M reward on @fundanything for a crowd funding campaign __HTTP__ _E_\nIn addition to those without health coverage those that have disastrous #Obamacare are seeing MASSIVE PREMIUM INCR... __HTTP__ _E_\n.@Neilyoung A few months ago Neil Young came to my office looking for $$ on an audio deal & called me last week to go to his concert. Wow! _E_\nWhy won't Obama release his college applications? Is there something 'foreign' about them? _E_\nAn HR solutions company polled 1000 employed adults to find out who would make ideal bosses... __HTTP__ _E_\nIf the Palestinians want statehood then why are they run by the terrorist group Hamas? _E_\nGreat time in Burlington Vermont. Crowd was amazing. _E_\nMy heartfelt condolences to the family of Kathryn Steinle. Very very sad! _E_\nWatch this behind the scenes video of @IvankaTrump's Fall 2012 collection photo shoot __HTTP__ _E_\nBe a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_\nFailing @NYTimes will always take a good story about me and make it bad. Every article is unfair and biased. Very sad! _E_\nRT @FoxNews: .@POTUS: I'm not against the media. I'm against the FAKE media. #CashinIn __HTTP__ _E_\nMet @newtgingrich at Trump Tower today. He's a big thinker. _E_\nOutrageous @BarackObama is suing to suppress the military vote in Ohio __HTTP__ Our Commander in Chief should be ashamed. _E_\nBe sure to watch #MissUniverse tonight at 8PM on @nbc with its first simulcast on @Telemundo! _E_\nI'd bet the horrible look of Pinehurst translates into poor television ratings. This is not what golf is about! _E_\nRight To Play uses the power of play to educate and empower children facing adversity. A great cause check it out. __HTTP__ _E_\nThe Mar a Lago Club was amazing tonight. Everybody was there the biggest and the hottest. Palm Beach is so lucky to have best club in world _E_\nMullet Bay Golf Course looks like a slum on the beautiful island of St. Maarten. @PrimeMinisterSX should be ashamed for allowing this. _E_\nMy @IngrahamAngle interview discussing @JebBush's comments a united 2012 GOP #CelebApprentice & Trump#Miss Universe __HTTP__ _E_\n\"If you can't explain it simply you don't understand it well enough.\" Albert Einstein _E_\nI spoke with President Moon of South Korea last night. Asked him how Rocket Man is doing. Long gas lines forming in North Korea. Too bad! _E_\nJust as I predicted today Obama called for even more tax increases. The Republicans played right into his hands and blew their cards. _E_\n\"Always strive to outdo yourself.\" – Think Big _E_\nNo wonder the @nytimes is failing—who can believe what they write after the false malicious & libelous story they did on me. _E_\n\"Attitude is a little thing that makes a big difference.\" Winston Churchill _E_\nDo you believe that Obama is giving weapons to moderate rebels in Syria.Isn't sure who they are. What the hell is he doing.Will turn on us _E_\nI really enjoyed doing the show circuit this AM discussing lightweight AG Eric Schneiderman & the terrible job he has done for NY. _E_\nOne of the reasons I am no fan of John McCain is that our Vets are being treated so badly by him and the politicians. I will fix VA quickly. _E_\nThe banks need to start lending again otherwise the economy will continue its downturn. This is why we bailed the banks out! _E_\nPuerto Rico Governor Ricardo Rossello just stated: The Administration and the President every time we've spoken they've delivered...... _E_\nLightweight @AGSchneiderman is driving business out of New York for his own public relations benefit. A real dope! _E_\nWhy does @FoxNews keep George Will as a talking head? Wrong on so many subjects! _E_\nSo Obama's top people responsible for ObamaCare think the American Public is stupid! All based on lies and deception! Repubs should sue. _E_\nWe should not attack Syria but if they make the stupid move to do so the Arab Leaguewhose members are laughing at us should pay! _E_\nThanks to everyone who has waited in the long lines at the #TimeToGetTough book signings. It is great to meet fellow patriots. _E_\nConde Nast Traveler Readers' Choice Awards Best Resorts in Europe: Trump Int'l Hotel & Golf Links Doonbeg voted #1. __HTTP__ _E_\nThomas Jefferson wrote the Senate filibuster rule. Harry Reid & Obama killed it yesterday. Rule was in effect for over 200 years. _E_\n.@ericbolling you can do much better than you did tonight on @oreillyfactor. Better luck tomorrow! _E_\n.@BarackObama Hood: Rob our children's future by borrowing from the Chinese to pay for socialist programs that will bankrupt us. _E_\nThe Church is yet another victim to his liberal agenda: @BarackObama lied to his Catholic supporters to pass ObamaCare. _E_\n...the beauty that is being taken out of our cities towns and parks will be greatly missed and never able to be comparably replaced! _E_\nHorrible and cowardly terrorist attack on innocent and defenseless worshipers in Egypt. The world cannot tolerate terrorism we must defeat them militarily and discredit the extremist ideology that forms the basis of their existence! _E_\nWithout passion you don't have energy without energy you have nothing! _E_\nThe Dallas event in two weeks at the American Airlines Center is filling up fast. Get your tickets fast before it is too late! _E_\nSo impt Rep Senators under leadership of @SenateMajLdr McConnell get healthcare plan approved. After 7yrs of O'Care disaster must happen! _E_\nThe @BarackObama administration is far more enthusiastic about boosting food stamp enrollment than about preventing fraud. #TimeToGetTough _E_\nVERY IRONIC: In 2010 video Clinton lectured underlings on cybersecurity and guarding 'sensitive information' __HTTP__ _E_\n.@VattenfallGroup will never solve the issues with the Ministry of Defense. Besides they smartly just left the project. _E_\nHighly overrated & crazy @megynkelly is always complaining about Trump and yet she devotes her shows to me. Focus on others Megyn! _E_\nHillary whose decisions have led to the deaths of many accepted $ from a business linked to ISIS. Silence at CNN. __HTTP__ _E_\nThe U.S. Consumer Confidence Index for December surged nearly four points to 113.7 THE HIGHEST LEVEL IN MORE THAN 15 YEARS! Thanks Donald! _E_\n.@CNN is so negative getting even worse as I get closer. Just had two anti Trump losers with zero rebuttal from my team. Turning off! _E_\nJoin me this Saturday at Ladd–Peebles Stadium in Mobile Alabama! #ThankYouTour2016 Tickets:... __HTTP__ _E_\n.@chucktodd said today on @meetthepress that attacking Bill to get to Hillary has never worked before. Wrong attacked him in '08 & won! _E_\n...long he doesn't know how to win anymore just look at the mess our country is in bogged down in conflict all over the place. Our hero.. _E_\nWe are one step closer to delivering MASSIVE tax cuts for working families across America. Special thanks to @SenateMajLdr Mitch McConnell and Chairman @SenOrrinHatch for shepherding our bill through the Senate. Look forward to signing a final bill before Christmas! __HTTP__ _E_\nEntrepreneurs: Set the example. You can motivate others as well as yourself by remembering that you are setting the example. _E_\nThank you Colorado! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_\nJust as I predicted @Joe_Biden was a complete disaster in China. He condoned the Chinese one child policy an... (cont) __HTTP__ _E_\nVia @DC_Decoder: Donald Trump to 'surprise' GOP convention. What might he do? __HTTP__ Answer: Something major! _E_\nWhat is happening in Atlantic City casino closures is very sad but does anybody give me credit for getting out before its demise? Timing _E_\nHow can Jeb Bush expect to deal with China Russia + Iran if he gets caught doing a \"plant\" during my speech yesterday in NH? _E_\nThe best thing you can do is deal from strength and leverage is the biggest strength you have. Leverage is (cont) __HTTP__ _E_\nWatching my beautiful wife Melania speak about our love of country and family. We will make you all very proud.... __HTTP__ _E_\nToday there were terror attacks in Turkey Switzerland and Germany and it is only getting worse. The civilized world must change thinking! _E_\nTomorrow I will be tweeting on only one subject! _E_\nLooking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone and Natalie Morales. _E_\nMy interview on @gretawire last night Our Leaders Are Leading Us Into 'Oblivion' __HTTP__ _E_\nWho is @Macys to pretend innocence when they \"racial profile\" all over the place? Paid big fine! _E_\nI will be commenting LIVE on Sunday night (9 to 11) on TWITTER Celebrity Apprentice will be great this season amazing cast! _E_\nI'll be turning the table on Larry King this Saturday night. I'll be interviewing him in honor of the 25th Anniversary of his show. _E_\nChina is primed to continue to rob us and steal our jobs through their exports __HTTP__ We need @MittRomney to rein them in. _E_\nTake a chance! All life is a chance. The man who goes farthest is generally the one who is willing to do and dare. Dale Carnegie _E_\nThe public is learning (even more so) how dishonest the Fake News is. They totally misrepresent what I say about hate bigotry etc. Shame! _E_\nA Rod has disgraced the blessed @Yankees organization lied to the fans & embarrassed NYC. He does not deserve to wear the pinstripes. _E_\nTHANK YOU IOWA!#ThankYouTour2016 __HTTP__ _E_\n.@EricTrump was FANTASTIC on @foxandfriends this morning. He may be my son but he is a special guy! _E_\nBeing politically correct takes too much time. We have too much to get done! #Trump2016 __HTTP__ __HTTP__ _E_\nRT @JerryTravone: @realDonaldTrump __HTTP__ _E_\nGetting ready to leave for Michigan will be an amazing evening! See you there. _E_\nJoin me at 4pm over at the Lincoln Memorial with my family!#Inauguration2017 __HTTP__ _E_\nThe Fake News is at it again this time trying to hurt one of the finest people I know General John Kelly by saying he will soon be..... _E_\ndo this under the law I feel it is visually important as President to in no way have a conflict of interest with my various businesses.. _E_\nLIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_\nWhen the Super Committee fails @BarackObama will get exactly what he really wants automatic cuts in defense spending. This is his plan. _E_\nRT @dcexaminer: EXCLUSIVE: How Donald Trump's 30 million followers are crashing the Internet __HTTP__ __HTTP__ _E_\nCharles McCullough the respected fmr Intel Comm Inspector General said public was misled on Crooked Hillary Emails. \"Emails endangered National Security.\" Why aren't our deep State authorities looking at this? Rigged & corrupt? @TuckerCarlson @seanhannity _E_\nJust finished speaking in Jacksonville Florida. Incredible crowd fantastic people. Thank you! _E_\nRio de Janeiro joins the @TrumpCollection in 2016. It's going to be a spectacular hotel! __HTTP__ _E_\nThis George Zimmerman is really a mess he really has to just disappear! (He attacked his wife last night). _E_\nCongratulations to Woody Johnson and @nyjets on acquiring @TimTebow.@TimTebow is not only a winner but a leader. (cont) __HTTP__ _E_\nLooking forward to the 2010 Miss USA Pageant Sunday May 16 on NBC 7 p.m. ET hosted by Curtis Stone & Natalie Morales live from Las Vegas. _E_\nThank you Oregon! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nVia CNET: Donald Trump Bests Jeb Bush in Website Performance Experts Say __HTTP__ _E_\nOnly 109 people out of 325000 were detained and held for questioning. Big problems at airports were caused by Delta computer outage..... _E_\nObamaCare is dead and the Democrats are obstructionists no ideas or votes only obstruction. It is solely up to the 52 Republican Senators! _E_\nApple is finally considering a large screen for the I Phone they better get moving fast. When I told them to do this last year they scoffed _E_\nThoughts and prayers for those in the floods affecting the great people of South Carolina. _E_\nWill be on @foxandfriends at 7:02 A.M. Enjoy. _E_\nI certainly hope the Democrats do not force Nancy P out. That would be very bad for the Republican Party and please let Cryin' Chuck stay! _E_\nN.Y.Times headline states Obama suffers setbacks in Japan trade deal. Can somebody please tell him that with all they sell us WE HAVE CARDS _E_\n.@mcuban has less TV persona than any other person I can think of. He's an arrogant crude dope who met some very stupid people... _E_\nWith few exceptions only really smart people are able to make a lot of money. Hard work is also important but brains will supersede. _E_\nA terrible deal with Iran! __HTTP__ _E_\nEgypt's Muslim Brotherhood just made its first visit to Hamas led Gaza. Why did @BarackObama promote the Arab Spring ? _E_\nMy @gretawire int. on Leon Panetta's critique of Obama Ebola rise of ISIS Obama's lack of common sense & 2016 __HTTP__ _E_\n\"Positive thinking is not merely wishful thinking... _E_\nDress for success. The Donald J. Trump Signature Collection exclusively available @Macys.com __HTTP__ _E_\nInterestingly the hurricane may now be a disaster for Obama's reelection because of his grandstanding. _E_\nA great honor from somebody that knows how to win! __HTTP__ _E_\nGreat to hear that @nfl legend and hall of famer John Elway has endorsed @MittRomney in Colorado. CO is a must win state for Mitt. _E_\n.@TheJuanWilliams you never speak well of me & yet when I saw you at Fox you ran over like a child and wanted a picture. Please share pic! _E_\nCongratulations to our great resident of Chicago Trump Tower Patrick Kane @88PKane for the #StanleyCup win & winning MVP of series. _E_\nWith @stuartpstevens expected to represent @GovChristie in the Presidential race Chris will have a very hard time winning. _E_\n....your release possible and HAVE A GREAT LIFE! Be careful there are many pitfalls on the long and winding road of life! _E_\nWe need your vote. Go to the POLLS! Let's continue this MOVEMENT! Find your poll location: __HTTP__ __HTTP__ _E_\n.@serenawilliams is a special player. After winning the Gold for the US in the Olympics it looks like she will (cont) __HTTP__ _E_\nThey are great people! __HTTP__ _E_\nCongratulations to @spurs on their @NBA championship. Well deserved. _E_\nI will be interviewed by @donlemon tonight on @CNN at 10PM. _E_\n.@bwilliams knows that I think his newscast has become totally boring so he took a shot at me last night. _E_\nThe passage of the @DeptVetAffairs Accountability and Whistleblower Protection Act is GREAT news for veterans! I lo... __HTTP__ _E_\nTrump: Obama is 'Unlucky President' __HTTP__ via @Newsmax_Media _E_\nIf traveling to the Windy City to celebrate 100th anniversary of Wrigley Field @TrumpChicago is Chicago's #1 hotel __HTTP__ _E_\n.@SenMikeLee refuted every point Karl 1.6% Rove made on the need to defund ObamaCare.Must listen __HTTP__ @TheRightScoop _E_\nWas with great people last night in Fort Myer Virginia. The future of our country is strong! _E_\n#Trump2016 #TrumpInstagram: __HTTP__ __HTTP__ _E_\nDo you notice that the polling establishment doesn't put me in polls but put in folks who hardly register. MAKE AMERICA GREAT AGAIN! _E_\nOften times being 'innovative' is simply putting together pre existing elements into something new. Be resourceful & expect success. _E_\nSTATEMENT IN RESPONSE TO PRESIDENT OBAMA'S FAILED LEADERSHIP: __HTTP__ _E_\nThe economy cannot take four more years of these same failed policies.#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nThank you Willie Robertson! #VoteTrump #MakeAmericaGreatAgain __HTTP__ _E_\nYou miss 100% of the shots you don't take. Wayne Gretzky _E_\nPerhaps Miss USA can lure Snowden back? _E_\nDACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed money away from our Military. _E_\nI never quit trying. I never felt that I didn't have a chance to win. Arnold Palmer @KingdomMag _E_\nMy @Gretawire interview where I discuss why @BarackObama is an economic ignoramus and how OPEC is inflating gas prices. __HTTP__ _E_\nVia @digitaljournal: Donald Trump tweets Obama is 'an incompetent President' __HTTP__ _E_\nGreat evening with the @AmSpec & the T. Boone Pickens Entrepreneur Award. Amazing crowd—thank you! _E_\n.@MissUSA Erin Brady is doing a fantastic job representing Trump Miss USA. Smart gorgeous a really positive force! _E_\nAny and all weather events are used by the GLOBAL WARMING HOAXSTERS to justify higher taxes to save our planet! They don't believe it $$$$! _E_\nI hope you can go to @oreillyfactor and vote for Donald Trump in order to Make America Great Again! Thanks. _E_\nThank you Rep. Collins! #Trump2016 __HTTP__ _E_\nTime Warner cable out AGAIN in Manhattan no television. They have a real problem! _E_\nVia @ConservReview by @JeffJlpa1: Why Donald Trump is Right __HTTP__ _E_\n\"Polling strong Donald Trump starting to get serious\" __HTTP__ via @bostonherald by @JaclynCashman _E_\nVia @lohud by @hoopsmbd: \"Buzz builds for @TrumpFerryPoint\" __HTTP__ _E_\nTrump Towers Istanbul Sisli will be one of the country's top landmarks __HTTP__ _E_\nDopey @GeorgeWill the most overrated political pundit in the business continues to downgrade the Republican (cont) __HTTP__ _E_\nRepublicans had all the cards but not the guts to make a great deal! _E_\nHow does this cast look to you? Pretty amazing. #CelebApprentice _E_\nThe ratings for the Republican National Convention were very good but for the final night my speech great. Thank you! _E_\nI'm going to be live with @ericbolling and @kimguilfoyle to ring in the New Year 2016. Everybody should tune in to @foxnews tomorrow night! _E_\nUnprecedented success for our Country in so many ways since the Election. Record Stock Market Strong on Military Crime Borders & ISIS Judicial Strength & Numbers Lowest Unemployment for Women & ALL Massive Tax Cuts end of Individual Mandate and so much more. Big 2018! _E_\nPrime Minister @David_Cameron is very foolish in giving @AlexSalmond so much money to build wind turbines which r destroying Scotland. _E_\nWatch my interview on @CBSNews Face The Nation now and also the new CBS POLLS which if good for me the media won't report! _E_\nThe Wall is a very important tool in stopping drugs from pouring into our country and poisoning our youth (and many others)! If _E_\nWhile Hillary profits off the rigged system I am fighting for you! Remember the simple phrase: #FollowTheMoney... __HTTP__ _E_\nDo you believe @algore is blaming global warming for the hurricane? _E_\nCongrats @JanineTurner on new book A Little Bit Vulnerable you're a breath of fresh air in the political forum __HTTP__ _E_\nThe hedge fund guys (gals) have to pay higher taxes ASAP. They are paying practically nothing. We must reduce taxes for the middle class! _E_\nWhat's with this rap stuff with me and Ebenezer Scrooge? __HTTP__ _E_\nThe Chinese want to steal our jobs and technology that includes so called green energy which they make but (cont) __HTTP__ _E_\nJoin @TeamTrump on Facebook & watch tonight's rally from Geneva Ohio our 3rd rally of the day. #AmericaFirst #MAGA __HTTP__ _E_\nWith the $635 million dollar website fiasco getting caught tapping phones of WORLD LEADERS and so much more U.S. is looking really stupid! _E_\nWe are taking care of hundreds of people in the Trump Tower atrium they are seeking refuge. Free coffee and food. _E_\n.@MacMiller's Donald Trump just hit 60 million hits. Maybe I should go into a new business. _E_\nThoughts & prayers with the millions of people in the path of Hurricane Matthew. Look out for neighbors and listen... __HTTP__ _E_\nTalent is cheaper than table salt. What separates the talented individual from the successful one is a lot of hard work. Stephen King _E_\nMAKE AMERICA GREAT AGAIN!#AmericaFirst #Trump2016 __HTTP__ _E_\n.@TrumpGolfLA has panoramic Pacific Ocean views features a 7242 yard public course designed by Pete Dye __HTTP__ _E_\nCongratulations to @nyknicks on winning their first Atlantic Division title since 1994. @carmeloanthony is a great New Yorker and Knick! _E_\n.@KarlRove stated clearly that he wants to repeal the 2nd Amendment. I thought @FoxNews was going to fire that jerk after his Romney fiasco? _E_\n.@GolfMonthly re: my Scottish course \"Quite simply this is not the best new links course in the UK it is the best links course full stop _E_\nToday I signed the Holocaust Remembrance Proclamation: __HTTP__ #ICYMI My statement last night at... __HTTP__ _E_\nMy parents: Trust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_\nGreat poll numbers all over and beating Hillary Clinton one on one. Thank you! _E_\nIt is time for Iran to face serious consequences. This regime is a threat to our national security. _E_\n.@CNBC has just agreed that the debate will be TWO HOURS. Fantastic news for all especially the millions of people who will be watching! _E_\nDon't like @SamuelLJackson's golf swing. Not athletic. I've won many club championships. Play him for charity! _E_\nThe Democratic National Committee would not allow the FBI to study or see its computer info after it was supposedly hacked by Russia...... _E_\nNEW @MittRomney TV AD Dream For these small businesses hope and change was not so kind: __HTTP__ #tcot _E_\nWhy are Democrats fighting massive tax cuts for the middle class and business (jobs)? The reason: Obstruction and Delay! _E_\nI hope we never find life on other planets because there's no doubt that the U.S. Government will start sending them money! _E_\nThe ratings for the Celebrity Apprentice were fantastic and everyone had a great time. It was a terrific season congrats to everyone! _E_\nA Rod @Yankees had hip surgery & will be out 6 months. Do you notice all the \"druggies\" have bad hips. _E_\nTurnberry one of the most beautiful places in the world.... soon to be Trump Turnberry a Luxury... __HTTP__ _E_\nRT @DRUDGE_REPORT: GREAT AGAIN: +235000 __HTTP__ _E_\nThank you @JeffJlpa1 and @AmSpec for the wonderful and very true article \"Total Desperation on Iran\" __HTTP__ _E_\nI hear that dopey political pundit Lawrence O'Donnell one of the dumber people on television is about to lose his show no ratings?Too bad _E_\nI'd bet the lawyers for the Central Park 5 are laughing at the stupidity of N.Y.C. when there was such a strong case against their clients _E_\nScary – in the past 90 days Obama has set over 6125 regulatory burdens __HTTP__ Terrible for the economy. _E_\nThanks. __HTTP__ _E_\nTed Cruz attacked New Yorkers and New York values we don't forget! __HTTP__ _E_\nStanding with Jamiel Shaw Sabine Durdin Don Rosenberg Lupe Moreno Brenda Sparks Robin Hvidston & their spouses. __HTTP__ _E_\nCrooked Hillary Clinton will be a disaster on jobs the economy trade healthcare the military guns and just about all else. Obama plus! _E_\nVia @USNewsTravel: \"Best New York City Hotels: @TrumpNewYork\" __HTTP__ _E_\nMuhammad Ali is dead at 74! A truly great champion and a wonderful guy. He will be missed by all! _E_\nNegotiation tip: View any conflict as an opportunity this will expand your mind as well as your horizons. Persistence can go a long way. _E_\nWord is spreading that I got a tattoo no way I am not a fan! _E_\nObama sent weapons through Benghazi to ISIS yet he is holding up shipments to Israel. What is he thinking? _E_\nThe Patch a total loser for @AOL will be a good deal compared to @HuffingtonPost. @ariannahuff laughs at \"stupid\" Armstrong! _E_\nAs I have always said let ObamaCare fail and then come together and do a great healthcare plan. Stay tuned! _E_\nThank you Ohio! #AmericaFirst __HTTP__ _E_\nRemember THE HARDER YOU WORK THE LUCKIER YOU GET! _E_\nHappy New Year to all including to my many enemies and those who have fought me and lost so badly they just don't know what to do. Love! _E_\nSaudis just cut oil supplymaking prices rise \"immediately\" while we are fighting ISIS for them __HTTP__ What are we doing! _E_\n\"It is hard to fail but it is worse never to have tried to succeed.\" Theodore Roosevelt _E_\nThe U.S. once again condemns the brutality of the North Korean regime as we mourn its latest victim. Video: __HTTP__ _E_\nThank you so many people have given me credit for winning the debate last night. All polls agree. It was fun and interesting! _E_\nThe latest update on Bret Michaels is that he's making every effort to attend the live finale of Celebrity Apprentice on Sunday so tune in! _E_\nGreat piece by @EWErickson @RedState exposing how Karl 1.6% Rove cooked a poll in support of ObamaCare __HTTP__ _E_\nThank you! #VoteTrump __HTTP__ _E_\n.@CNN @jaketapper at 9:00 A.M. _E_\nKeep talking about me: use #TrumpRoast to tweet about how good I look on @ComedyCentral tonight at 10:30/9:30c __HTTP__ _E_\nTonight I will be signing copies of #TimeToGetTough in Westbury at Costco 1250 Old Country Rd from 6 pm to 8 pm _E_\nThank you Iowa! #ImWithYou __HTTP__ _E_\n.@FoxNews will be re running Objectified: Donald Trump the ratings hit produced by the great Harvey Levin of TMZ at 8:00 P.M. Enjoy! _E_\nCongratulations to @MittRomney on Tuesday night's sweep. He also delivered a 'Killer Speech' __HTTP__ _E_\nI love that in addition to everything else so much money is raised for such great causes on Celebrity Apprentice all proud of that! _E_\nMy @gretawire interview discussing @BarackObama's USC comments insurance premiums @SarahPalinUSA on the (cont) __HTTP__ _E_\nWhen you think big you will automatically trigger more details because details are the major component of making anything big. _E_\nVia @eonline by @BrettMalec: \"2014 @MissUniverse Contestants\" __HTTP__ _E_\nRT @TeamTrump: RT if you believe @HillaryClinton is the one who owes America an apology! #BigLeagueTruth #Debates __HTTP__ _E_\nLying traitor Snowden now claims that he did not give any information to the Russians or Chinese. Why doesn't he come home then? _E_\nI can't believe that Mitt Romney would run for president again. He had his chance and blew it in the last weeks of the race. _E_\nWhy is Douglas Durst allowed to use the World Trade Center to get out of a lease with Conde Nast? _E_\nAnother new poll. Thank you for your support! Join the MOVEMENT today! #ImWithYou __HTTP__ __HTTP__ _E_\nnot anymore. The beginning of the end was the horrible Iran deal and now this (U.N.)! Stay strong Israel January 20th is fast approaching! _E_\nWe can create jobs in the American economy by protecting our own manufacturing sector. _E_\nBudget that just passed is a really big deal especially in terms of what will be the biggest tax cut in U.S. history MSM barely covered! _E_\nThank you @JerryJrFalwell will see you soon. #TrumpPence16 __HTTP__ _E_\n.@HillaryClinton channels John Kerry on trade: she was for bad trade deals before she was against them. #TPP #Debates2016 _E_\nGoofy Elizabeth Warren and her phony Native American heritage are on a Twitter rant. She is too easy! I'm driving her nuts. _E_\nWhy is it that the horrendous protesters who scream curse punch shut down roads/doors during my RALLIES are never blamed by media? SAD! _E_\n.kimguilfoyle great job tonight! _E_\nMore of my #TRUMPTUESDAY @SquawkCNBC interview discussing how the US gets killed negotiating with other countries __HTTP__ _E_\nTrump Int'l Hotel & Tower New York includes Central Park views & our signature restaurant Jean Georges. Perfection! __HTTP__ _E_\nLance Armstrong just got sued by the Federal Government they want their money back I told you so! What was he thinking when he did that int? _E_\nEntrepreneurs: See yourself as victorious. Look at the solution not the problem. And never give up! _E_\nRT @FoxNews: Jobs added during @POTUS' time in office. __HTTP__ _E_\n#CelebApprentice Photo from last night's boardroom. __HTTP__ _E_\nRT @Reince: With a strong candidate in @POTUS & @GOP revolutionary data program Republicans carried WI for 1st time in 30 years __HTTP__ _E_\nMarine Plane crash in Mississippi is heartbreaking. Melania and I send our deepest condolences to all! _E_\nIt is my opinion that many of the leaks coming out of the White House are fabricated lies made up by the #FakeNews media. _E_\nRT @Team_Trump45: @realDonaldTrump __HTTP__ _E_\nMy robocall on behalf of @MittRomney playing across the state of Michigan __HTTP__ _E_\nPresident Obama our great leader wants to declare martial law in New York City as a means of helping out with the massive storm. _E_\nHagel's performance yesterday was the worst I have ever seen before a committee of any kind! _E_\nOur great project in South America Trump Tower Punta Del Este in Uruguay will have spectacular views and the... __HTTP__ _E_\nCongratulations to @MittRomney on getting the @DMRegister @NewYorkPost @NewYorkObserver & @NashuaTelegraph endorsements! _E_\nHuckabee is a nice guy but will never be able to bring in the funds so as not to cut Social Security Medicare & Medicaid. I will. _E_\nWith an elite course designed by @SharkGregNorman @Trump_Charlotte is North Carolina's most desirable club __HTTP__ _E_\nWRONG: A China court ordered @apple to pay $60M to a Chinese company that registered iPad before @apple __HTTP__ _E_\nOne of the things that has been lost in the politics of this situation is that the Russians collected and spread negative information..... _E_\nThe Senate should be more concerned about actually passing a budget than spreading lies about @MittRomney's taxes. _E_\nYesterday was a referendum on ObamaCare & all other Obama fiascos. Republicans can now rein him in. _E_\nLogic will get you from A to B. Imagination will take you everywhere. Albert Einstein\" _E_\nHonored to have received the endorsement of Lou Holtz a great guy! #INPrimary #Trump2016 __HTTP__ _E_\nTed Cruz went down big in just released Reuters poll what's going on? Is it Goldman Sachs/Citi loans or Canada? _E_\nRush is right. @limbaugh and I have both created more jobs than @BarackObama...in fact far more jobs! _E_\nTuesday will be a big day for our country to do a complete turnaround. MAKE AMERICA GREAT AGAIN! _E_\nA message to my fellow Americans#IrmaHurricane2017 __HTTP__ __HTTP__ __HTTP__ _E_\nThe ONLY bad thing about winning the Presidency is that I did not have the time to go through a long but winning trial on Trump U. Too bad! _E_\nWhy is that Hillary Clintons family and Dems dealings with Russia are not looked at but my non dealings are? _E_\n.@politico covers me more inaccurately than any other media source and that is saying something. They go out of their way to distort truth! _E_\nMy @FoxNews @megynkelly int. on why I am considering running for POTUS negotiations & making America great again __HTTP__ _E_\nWith the great vote on Cutting Taxes this could be a big day for the Stock Market and YOU! _E_\nI am no fan of Bill Cosby but never the less some free advice if you are innocent do not remain silent. You look guilty as hell! _E_\nImmigration reform is fine—but don't rush to give away our country! Sounds like that's what's happening. _E_\nThere's nothing like fall in #NewYorkCity. See where @TrumpCollection recommends you take in the season's beauty: __HTTP__ _E_\nwithout retribution or consequence is WRONG! There will be a tax on our soon to be strong border of 35% for these companies ...... _E_\nWe were let down by all of the Democrats and a few Republicans. Most Republicans were loyal terrific & worked really hard. We will return! _E_\nDon't let the fake media tell you that I have changed my position on the WALL. It will get built and help stop drugs human trafficking etc. _E_\nObama said not optimal to Ambassador & embassy killings bad word usage for a Harvard graduate. _E_\nSee you tonight Huntington West Virginia!#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_\n.@Franklin_Graham: Great job on @foxandfriends this morning. You beautifully stated what most people are thinking! Say hi to all. _E_\nVia CNN: Trump now leads in odds to win GOP nomination __HTTP__ _E_\nNancy Pelosi and Fake Tears Chuck Schumer held a rally at the steps of The Supreme Court and mic did not work (a mess) just like Dem party! _E_\nWill be interviewed on the @oreillyfactor tonight at 8:00 P.M. Will be talking about the debate and more! _E_\nThe writer of the now proven false story in the @nytimes Michael Barbaro who was interviewed on CBS this morning was unable to respond. _E_\nNo matter how far down a path you go if it's the wrong path turn around and go back home before it is too late. _E_\nMar a Lago is Florida's most lavish and exclusive private club and spa with world class amenities __HTTP__ _E_\nDeparting NH now great morning with record crowd in Portsmouth in a snow storm! Thank you! __HTTP__ __HTTP__ _E_\nYou must be kidding zero chance he is innocent! _E_\nBe sure to tune in for Melania's second QVC show for Melania Timepieces & Jewelry tonight live from 9 10 pm on QVC __HTTP__ _E_\nRT @IvankaTrump: Check out my May Redbook magazine cover. Very exciting! #Redbook __HTTP__ _E_\nA rough night for Hillary Clinton ABC News. _E_\nPay attention to global news and developments in today's world that is a requirement not an elective. _E_\nJoin me in San Jose California tonight!#MakeAmericaGreatAgain #Trump2016Tickets: __HTTP__ __HTTP__ _E_\nExplain to @brithume and @megynkelly who know nothing that I will beat Hillary and win states (and dem indie votes) that no other R can! _E_\n#CrookedHillary #PayToPlay __HTTP__ _E_\nIf the U.S. Government doesn't give the money necessary for the burials of our military personnel I will.The U.S. under Obama's leadership! _E_\nShe's back! Champion @Joan_Rivers returns to the boardroom in this year's All Star @ApprenticeNBC. Joan is ferocious. _E_\nA clip from last night's @Late_Show where I detail my charitable offer to Obama and Dave describes his terrible grades __HTTP__ _E_\nIn making any decision you need all the facts. But after exhausting all due diligence in the end you have to go with your gut! _E_\nThank you @CarlHigbie. Great work on @CNN. #Trump2016 _E_\nPresident Obama and other world leaders don't know how close they were to being seriously injured (or worse) standing next to psycho in SA. _E_\nSorry folks but Donald Trump is far richer and much better looking than dopey @mcuban! _E_\n#ICYMI: Joint Statement with Prime Minister Shinzo Abe on North Korea. __HTTP__ _E_\nRemember but for Conservatives Bush would have given us not only Roberts but also Harriet Miers. Face it Bush was terrible! _E_\nIt would have been much easier for me to win the so called popular vote than the Electoral College in that I would only campaign in 3 or 4 _E_\n\"@DonaldJTrumpJr: 'We want to build everything in Dubai\" __HTTP__ via @CWO_dotcom _E_\nThank you for your support at this mornings Town Hall in Salem New Hampshire. #FITN #NHPrimary __HTTP__ _E_\nThe Spa @TrumpWaikiki offers unique treatments that use traditional Hawaiian botanicals & healing techniques __HTTP__ _E_\nMy @SquawkCNBC interview discussing interest rates the deficit @RepPaulRyan's timing @TimTebow and the Doral __HTTP__ _E_\nThat's what I find so morally offensive about welfare dependency: it robs people of the chance to improve. Work (cont) __HTTP__ _E_\nTIME #DebateNight poll over 800000 votes. Thank you! #AmericaFirst #MAGA __HTTP__ _E_\nRT @AnnCoulter: RUMSFELD: Trump has a touched a nerve in our country...in a way that most politicians have not been able to do. __HTTP__ _E_\nJohn @CahillForAG is one of the most respected people in politics. Dopey @AGSchneiderman is one of the least respected! _E_\nGreat op ed from @RepKenBuck. Looks like some in the Freedom Caucus are helping me end #Obamacare. __HTTP__ _E_\nLots of great new polls big leads! __HTTP__ __HTTP__ __HTTP__ _E_\nObama our Welfare & Food Stamp President is praising himself for expanding welfare __HTTP__ He doesn't believe in work. _E_\nI am very supportive of the Senate #HealthcareBill. Look forward to making it really special! Remember ObamaCare is dead. _E_\nRT @Realjmannarino: @realDonaldTrump The ungratefulness is something I've never seen before. If you get someone's son out of prison he sho... _E_\nEntrepreneurs: Set the example. You can motivate others as well as yourself by remembering you are setting the example. _E_\nFor all of those that were hoping I was wrong and this is a very unimportant subject to me Dwight Howard just officially announced Houston _E_\nThe only place success comes before work is in the dictionary. Vince Lombardi _E_\nPeople must remember that ObamaCare just doesn't work and it is not affordable 116% increases (Arizona). Bill Clinton called it CRAZY _E_\n.@CharlesGKoch is looking for a new puppet after Governor Walker and Jeb Bush cratered. He now likes Rubio next fail. _E_\nLooking forward to speaking at the @ARGOP Reagan Rockefeller Dinner tonight! Record crowd. We are no longer silent! #MAKEAMERICAGREATAGAIN! _E_\nRT @hughhewitt: #NeverTrumpers elite MSMers and virtue signalers are persuading themselves that @realDonaldTrump supporters are deserting.... _E_\n\"Learn work and think in equal proportions and you'll be going in the right direction.\" – Think Like a Champion _E_\nDo executives at @msnbc know that the business of TV centers on viewers & ratings? @msnbc is #19 on cable __HTTP__ Sad. _E_\nI had a great time today visiting Facebook NY. __HTTP__ _E_\nIt's Monday how many more excuses will Obama make today about the economy? _E_\n.@SouthJerseyMag \"According to the Pros\" just named Trump National Golf Club Philadelphia the #1 private club. Thanks! _E_\nI'm really saddened to see that @Cher was voted \"the 4th ugliest celebrity\" according to @listverse.... _E_\nWe are leaving Iraq after expending a tremendous amount of blood and treasure. We should be reimbursed with oil! Don't give it to Iran. _E_\nWatch Late Night with Jimmy Fallon on NBC at 12:35 EST tonight I'll be bringing a couple of surprises with me. _E_\nBusiness is no place for stream of consciousness babbling. Keep it short fast and direct. Think Like a Champion _E_\nNY State Republican Party must unify or November will be another disaster. _E_\n.@MichaelRCaputo Thank you for all of your support you have been amazing! _E_\nPeople who lost money when the Stock Market went down 350 points based on the False and Dishonest reporting of Brian Ross of @ABC News (he has been suspended) should consider hiring a lawyer and suing ABC for the damages this bad reporting has caused many millions of dollars! _E_\nAfter settling for a ridicilous 13 billion dollars J.P.Morgan's lawyer is critical of the amount of the fine why did they settle then DUMB! _E_\nTerrific response to my previous tweet: I'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. __HTTP__ ... _E_\nHypocrite! in '06 @BarackObama called private equity the best opportunity for long term economic vitality __HTTP__ _E_\nEmmys telecast is way down & lowest telecast on record among young adults. Emmys have no credibility Should have nominated Apprentice again! _E_\n#CelebrityApprentice ranked #1 among ABC CBS and NBC in all key demos from 10 11PM. It won the 10PM hour by a 53% margin in 18 49 rating. _E_\nIf crazy @megynkelly didn't cover me so much on her terrible show her ratings would totally tank. She is so average in so many ways! _E_\nWest Virginia was incredible last night. Crowds and enthusiasm were beyond GDP at 3% wow!Dem Governor became a Republican last night. _E_\n: @realDonaldTrump @HelpUServe When we have people eating out of trash cans in this country we have no business helping any other country _E_\nI will be going to Sarasota Florida today for a big rally with amazing people! I have one goal on mind: MAKE AMERICA GREAT AGAIN! _E_\nMade a speech in Arkansas last night before a record GOP crowd. Great spirit and amazing people. MAKE AMERICA GREAT AGAIN! _E_\nJeffrey Lord @AmSpec—Thank you for the presentation—terrific job! _E_\nI am making a major speech in West Palm Beach Florida at noon. Tune in! _E_\nWe just had the worst jobs report since 2010. _E_\nThe ObamaCare enrollment numbers are a lie.They will be 'readjusted' by the White House at an opportune time probably after '14 election _E_\nWe are one nation. When one state hurts we all hurt. We must all work together to lift each other up. __HTTP__ _E_\nHAPPY BIRTHDAY to my son @DonaldJTrumpJr! Very proud of you! #TBT __HTTP__ __HTTP__ _E_\nThere is only one way to avoid criticism: do nothing say nothing and be nothing. – Aristotle _E_\nWin a dinner with @MittRomney and me in New York this June 28th. It's selling like hotcakes! __HTTP__ _E_\nBob Corker who helped President O give us the bad Iran Deal & couldn't get elected dog catcher in Tennessee is now fighting Tax Cuts.... _E_\nLeaving South Korea now heading to China. Looking very much forward to meeting and being with President Xi! _E_\nMitt Romney had his chance and blew it. Lindsey Graham ran for president got ZERO and quit! Why are they now spokesmen against me? Sad! _E_\nCrooked Hillary Clinton knew everything that her servant was doing at the DNC they just got caught that's all! They laughed at Bernie. _E_\nAs families prepare for summer vacations in our National Parks Democrats threaten to close them and shut down the government. Terrible! _E_\n.CNN & @CNNPolitics Lawyer Elizabeth Beck did a terrible job against me she lost (I even got legal fees). I loved beating hershe was easy _E_\nRemember when Jeb gave Hillary a medal on the 1 year anniversary of Benghazi?! __HTTP__ Guess he would have invaded Libya too! _E_\nCongratulations to the House of Representatives for passing the #TaxCutsandJobsAct — a big step toward fulfilling our promise to deliver historic TAX CUTS for the American people by the end of the year! __HTTP__ _E_\nThe Mexican legal system is corrupt as is much of Mexico. Pay me the money that is owed me now and stop sending criminals over our border _E_\nIf Michael Bloomberg ran again for Mayor of New York he wouldn't get 10% of the vote they would run him out of town! #NeverHillary _E_\nChina is building 50 brand new airports while our country continues to rott! Very sad. _E_\nCongratulations to @KingJames on winning Athlete of the Year in last night's @ESPYS. LeBron is also a great guy! _E_\nI'm thrilled to announce that my new tailored clothing line has officially launched at Macy's. In business it'... (cont) __HTTP__ _E_\nI will be at the Cadillac World Golf Championship @TrumpDoral in Miami tomorrow! Rory Phil Bubba Adam and Dustin all at the top! _E_\nIf ObamaCare is hurting people & it is why shouldn't it hurt the insurance companies & why should Congress not be paying what public pays? _E_\nThank you! #Trump2016 __HTTP__ _E_\nRT @DeptofDefense: VIDEO: Elements of the #DoD and @FEMA are providing humanitarian relief for #PuertoRico and #USVI 🇻 . __HTTP__ _E_\nPlacing the ball in the right position for the next shot is eighty percent of winning golf. Ben Hogan _E_\nMy @CNBCClosingBell interview discussing America's financial uncertainty due to @BarackObama and the job report __HTTP__ _E_\nPolicy towards our enemies: Hit them hard hit them fast hit them often & then tell them it was because they are the enemy! _E_\nYou must promise that you will never cheat off Manti Te'o's test papers. _E_\n...goodwill and friendship was formed but only time will tell on trade. _E_\nThe FAKE NEWS media (failing @nytimes @NBCNews @ABC @CBS @CNN) is not my enemy it is the enemy of the American People! _E_\nObamaCare is imploding. It is a disaster and 2017 will be the worst year yet by far! Republicans will come together and save the day. _E_\nTwitter is on @BarackObama's enemies list __HTTP__ _E_\nWhile Putin is scheming and beaming on how to take over the World President Obama is watching March Madness (basketball)! _E_\nRT @Team_Trump45: @realDonaldTrump __HTTP__ _E_\nWow big lines in Kansas. _E_\nThe Republicans should NOT give @BarackObama the authority to raise the debt another $1.2Trillion (cont) __HTTP__ _E_\n.@lancearmstrong really blew it went down in flames too bad! _E_\nGood advice from my mother: Trust in God and be true to yourself. Mary Trump _E_\nIt was a great honor to welcome Prime Minister Najib Abdul Razak of Malaysia and his distinguished delegation to the @WhiteHouse today! __HTTP__ _E_\nWhy aren't the same standards placed on the Democrats. Look what Hillary Clinton may have gotten away with. Disgraceful! _E_\n.@CNN should stop apologizing for the mistake they made the other day & get back to reporting! _E_\nMy @FoxNews interview with @gretawire discussing how @BarackObama is delusional and how a 3rd party candidate can win. __HTTP__ _E_\nBeing at the Army Navy Game was fantastic. There is nothing like the spirit in that stadium. A wonderful experience and congrats to Army! _E_\nJust did Howard Stern Show great time. Now doing The Today Show with Ivanka. ENJOY! _E_\nWhat are Hillary Clinton's people complaining about with respect to the F.B.I. Based on the information they had she should never..... _E_\nWhich National Costume do you think should win? __HTTP__ _E_\nCongratulations to our Olympic team for by far winning the most medals including first place gold. _E_\nIt's freezing and snowing in New York we need global warming! _E_\nPolls close at 6pm! #INPrimary #Trump2016 #VoteTrump __HTTP__ _E_\n.@foxandfriends in five minutes! _E_\nAfter consultation with my Generals and military experts please be advised that the United States Government will not accept or allow...... _E_\n.@FLOTUS & I were honored to host our first WH Congressional Picnic. A wonderful evening & tradition. @MarineBand:... __HTTP__ _E_\nThe Trump Signature Collection exclusively available at @Macys is the pinnacle of style and prestige __HTTP__ _E_\nPathetic @BarackObama is 'sweetening' his offer to the Taliban __HTTP__ Read 'The Art of The Deal.' _E_\nGreat going @themichellewie –you showed the world that all of that amazing talent is for real. We love you at Trump Jupiter @TNGCJ _E_\n.@DonaldJTrumpJr's @CNBC interview discussing the starving demand that is fueling high end luxury __HTTP__ _E_\nThe safest way to preserve Medicare is with a robust and vibrant economy. We should lower corporate and capital gain taxes immediately. _E_\nClinton is trying to wash away her bad judgement call on BREXIT with big dollar ads. Disgraceful! _E_\nI had a great time doing press interviews with @LisaLampanelli and @Teresa_Giudice earlier today __HTTP__ _E_\n70 years ago today the National Security Council met for the first time. Great history of advising Presidents then & now! Thanks NSC Staff! _E_\nCongratulations to the 2016 #StanleyCup Champions Pittsburgh @Penguins! _E_\nI will be in beautiful Burlington Vermont tonight for a rally. Will be great fun. MAKE AMERICA GREAT AGAIN! _E_\nCanada will now sell its oil to China because @BarackObama rejected Keystone. At least China knows a good deal when they see it. _E_\nVery good speech by @MichelleObama and under great pressure Dems should be proud! _E_\nWhy was @BarackObama selling guns to Mexican drug dealers? _E_\nTrump Turnberry is a spectacular place and home to four of the greatest Open Championships of all time. __HTTP__ _E_\nJoin us via our new #AmericaFirst APP! #TrumpPence16 __HTTP__ __HTTP__ _E_\nThank you Pennsylvania! Going to New Hampshire now and on to Michigan. Watch PA rally here: __HTTP__ __HTTP__ _E_\nPaul Ryan said that I inherited something very special the Republican Party. Wrong I didn't inherit it I won it with millions of voters! _E_\nCongratulations to @Yankees Derek Jeter on being named to 2014 @MLB @AllStarGame! _E_\nI don't like the opening even a little bit! _E_\nToday is Donald Trump's Birthday! Send him your B'day wishes here: __HTTP__ _E_\nDo your homework. Wasting other people's time due to poor planning or thoughtlessness leaves a bad impression. – Think Like a Billionaire _E_\nCongress now has 6 months to legalize DACA (something the Obama Administration was unable to do). If they can't I will revisit this issue! _E_\nWill be interviewed on @foxandfriends at 8:30 A.M. Eastern. ENJOY! _E_\nWow one of the all time greats in fashion OSCAR DE LA RENTA has just died at 82. Great fashion achievements but also a really nice guy! _E_\nVia @dcexaminer by @rebeccagberg: \"Trump: 'I'm the only one who can beat' Hillary\" __HTTP__ _E_\nLast week's episode of the Celebrity Apprentice set the stage for a great new season. Tune in this Sunday on NBC for even more excitement. _E_\nI have over seven million hits on social media re Crooked Hillary Clinton. Check it out Sleepy Eyes @MarkHalperin @NBCPolitics _E_\nObama has no problem leaking national security secrets. Why can't he release his records? Especially when $5M is going to charity. _E_\nVia @reason: Donald Trump: I Can Fix America __HTTP__ _E_\n....we need to keep America safe including moving away from a random chain migration and lottery system to one that is merit based. __HTTP__ _E_\nTo be really successful it is always good to have A COOL HEAD WARM HEART AND BEAUTIFUL COMMON TOUCH! _E_\nMy @FoxNews interview on @gretawire discussing The China Curse __HTTP__ _E_\nYes Arnold Schwarzenegger did a really bad job as Governor of California and even worse on the Apprentice...but at least he tried hard! _E_\nJoin me in Indianapolis Indiana tomorrow at 3pm! #Trump2016#MakeAmericaGreatAgainTickets: __HTTP__ __HTTP__ _E_\nRemember official campaign merchandise (hats apparel etc.) can only be bought at __HTTP__ Be careful don't get ripped off _E_\nMarco Rubio would keep Barack Obama's executive order on amnesty intact. See article. Cannot be President. __HTTP__ _E_\nBig game trophy decision will be announced next week but will be very hard pressed to change my mind that this horror show in any way helps conservation of Elephants or any other animal. _E_\n#TrumpVlog @Rosie needs to rest and relax. It's not working. __HTTP__ _E_\nHappy Halloween! __HTTP__ _E_\nBy failing to prepare you are preparing to fail. Benjamin Franklin _E_\nFirst there was the Declaration of Independence then there was the Constitution. Now there is #TimeToGetTough. Available today. _E_\nWho would have thought that an @ApprenticeNBC champion would return to compete? @bretmichaels returns to All Star @CelebApprentice... _E_\nMy new radio ad airing today in Wisconsin! See you soon!#WIPrimary #Trump2016 __HTTP__ _E_\nCome on goAngelo don't give up now just because your rally at Macy's drew only eleven people for twenty minutes! I love@ Macy's. _E_\nSolyndra's government loan and subsequent bankruptcy prove that @BarackObama is both corrupt and inept. _E_\nTrump Int'l Golf Links Ireland in County Clare fronts the Atlantic Ocean & is #1 Resort in Europe/Conde Nast Traveler __HTTP__ _E_\nThe Democrats have been told and fully understand that there can be no DACA without the desperately needed WALL at the Southern Border and an END to the horrible Chain Migration & ridiculous Lottery System of Immigration etc. We must protect our Country at all cost! _E_\nI've sent a 10 wheeler filled with 358 master cases of food and supplies to my hometown of Queens today #TrumpCares _E_\n.@RalphGilles of Chrysler should focus on design rather than filthy language not very professional. _E_\nYoung Entrepeneurs: Think Big Stay Motivated & Always Remain Confident. The Sky is the Limit. _E_\nHouse GOP better get its act together.Defund ObamaCare. Out negotiate on debt ceiling. Form commissions on Benghazi & IRS. No excuses! _E_\nIt's freezing in New York—where the hell is global warming? _E_\nCongratulations to @sethmeyers on \"Emmy's Rating Tumble\" __HTTP__ Just as I predicted Seth bombed! . _E_\nIs this true about Univision and Fusion? Wow!?! __HTTP__ _E_\nVia @thehill by @timdevaney: Donald Trump: GOP nominee 'can't be Mitt can't be Bush' __HTTP__ _E_\nMy @Live5News int. with @WilliamLive5 in South Carolina with @citadelgop cadets on my 757 discussing 2016 __HTTP__ _E_\nToday the House votes on two crucial bills:#NoSanctuaryForCriminalsAct #KatesLaw Pass these bills & lets... __HTTP__ _E_\nHad dinner with @RickPerry last night great guy straight shooter impressive record. _E_\nIs it the Neil Patrick Harris show or the Emmy Awards?How was he ever put in this position to start with? CRAZY! _E_\nEverytime someone tweets that I wear a wig realize to yourself that you are dealing with them just another sad & lonely hater and loser! _E_\nDo you notice that Hillary spews out Jeb's name as often as possible in order to give him status? She knows Trump is her worst nightmare. _E_\nVoting for @GovGaryJohnson is voting for Obama don't waste your vote! _E_\nOur country must get very strong and very tough and fast before it is too late. We have zero leadership and never WIN! We want victory. _E_\n\"How much money can you stand to lose? That's how much risk you should assume.\" – Think Like a Billionaire _E_\n.@BarackObama wants to see 10 yrs of @MittRomney's tax returns tell him ok but we want to see your college applications first.' _E_\nI hope @TGowdySC does better for Rubio than he did at the #Benghazi hearings which were a total disaster for Republicans & America! _E_\nCanada's legal immigration plan starts with a simple and smart question: How will any immigrant applying fo... (cont) __HTTP__ _E_\nRemember victims of Hurricane Sandy during Thanksgiving. Many will not be celebrating the holiday in comfort.Their lives are in turmoil! _E_\nThank you Northern Mariana Islands!#SuperTuesday #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nJoin us Monday February 8th @ the Verizon Wireless Arena in Manchester New Hampshire! #FITN #NHPolitics #Trump2016 __HTTP__ _E_\nVia @fitsnews: \"Donald Trump: John McCain Is 'A Loser'\" __HTTP__ _E_\nSigning orders to move forward with the construction of the Keystone XL and Dakota Access pipelines in the Oval Off... __HTTP__ _E_\nThank you Indiana! Will be back soon!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nIf I am elected President I will immediately approve the Keystone XL pipeline. No impact on environment & lots of jobs for U.S. _E_\nWatch What's America Worth? hosted and narrated by me this Sunday at 9PM on @Discovery __HTTP__ _E_\nThink. That's the first step. Use all your power to utilize and develop that capability Donald J. Trump __HTTP__ _E_\nCrooked Hillary no longer has credibility too much failure in office. People will not allow another four years of incompetence! _E_\nRT @SLandinSoCal: @foxandfriends @realDonaldTrump Nothing can stop the #TrumpTrain __HTTP__ _E_\nGoofy political pundit George Will spoke at Mar a Lago years ago. I didn't attend because he's boring & often wrong—a total dope! _E_\nMitt Romney had his chance to beat a failed president but he choked like a dog. Now he calls me racist but I am least racist person there is _E_\nCoincidence? More than half of @BarackObama's 47 biggest fundraisers have been given administration jobs. __HTTP__ _E_\nThe Celebrity Apprentice delivers the goods and the puppets Sunday at 9 pm on NBC __HTTP__ _E_\n\"Trump Gives 'Em Hell\" __HTTP__ via @limbaugh _E_\nThe dummies left Iraq (and Libya) without the oil! _E_\nDid you know Donald Trump is on Facebook? __HTTP__ Become a fan today! _E_\nWe should immediately stop sending our beautiful American tax dollars to countries that hate us and laugh at our President's stupidity! _E_\nAlways remember that as your success grows you will be asked for more favors. Learn how to say 'No.' It is critical. _E_\nDeath spiral!'Aetna will exit Obamacare markets in VA in 2018 citing expected losses on INDV plans this year' __HTTP__ _E_\n84% of US troops wounded & 70% of our brave men & women killed in Afghanistan have all come under Obama. Time to get out of there. _E_\nWow just saw an ad Cruz is lying on so many levels. There is nobody more against ObamaCare than me will repeal & replace. He lies! _E_\n\"You have to have confidence in yourself and confidence to know that what you are doing is right.\" – Think Big _E_\nI will not let you down! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\n\"If you want the best you'd better be the best – in all aspects of business.\" – Think Like a Billionaire _E_\nObama still will keep all military recruitment centers & bases Gun Free Zones! It has to stop. MILITARY LIVES MATTER! _E_\nHillary Clinton is a major national security risk. Not presidential material! _E_\nWell this is it the final debate let's see how it goes. I'll be tweeting live. _E_\nRT @EricTrump: #MakeAmericaGreatAgain!!! __HTTP__ _E_\nCheck out Serta's Counting Sheep (and me) at the Trump International Hotel New York __HTTP__ _E_\nMy thoughts and prayers are with the victims and families of those affected by two powerful earthquakes in Italy and Myanmar. _E_\nSen. @DavidVitter & @David_Bossie w/@seanhannity __HTTP__ demand 'Congress Live By Your Laws' __HTTP__ _E_\nToday's assignment: read Chapter 7 'Trump Tower: The Tiffany Location' of The Art of the Deal. Focus on how I marketed the property. _E_\nObama lied when he said \"you can keep your plan\" so why would anyone believe his bogus ObamaCare enrollment numbers?! _E_\nBy popular request I will also be tweeting live during the Vice Presidential debate Thursday night. It will be very interesting I promise. _E_\nThe United States is considering in addition to other options stopping all trade with any country doing business with North Korea. _E_\nOf the 9 battleground states we only carried North Carolina. I'm proud of @NCGOP & glad I delivered keynote at their state convention. _E_\nYankees should have dropped A Rod long ago not even bothered with arbitration. They would have saved a fortune! _E_\nGoing to D.C. for big groundbreaking on Old Post Office site. Will be spectacular new hotel. Lots of jobs! _E_\nDisproven and paid for by Democrats \"Dossier used to spy on Trump Campaign. Did FBI use Intel tool to influence the Election?\" @foxandfriends Did Dems or Clinton also pay Russians? Where are hidden and smashed DNC servers? Where are Crooked Hillary Emails? What a mess! _E_\nI like Mexico and love the spirit of Mexican people but we must protect our borders from people from all over pouring into the U.S. _E_\nOn @FallonTonight with @jimmyfallon at 11:30 PM. Enjoy! _E_\nFather's Day is Sunday. Find the perfect gift.Trump Signature Collection is exclusively available @Macys __HTTP__ _E_\nI'm sending lots of bottled water out to Staten Island & Long Island. _E_\nLittle Marco Rubio the lightweight no show Senator from Florida is set to be the puppet of the special interest Koch brothers. WATCH! _E_\nAttorney General Bill Schuette will be a fantastic Governor for the great State of Michigan. I am bringing back your jobs and Bill will help _E_\nI'm turning down millions of dollars of campaign contributions—feel totally stupid doing so but hope it is appreciated by the voters. _E_\nTHANK YOU IOWA! Highly respected @OANN @GravisMarketing poll just released. #VoteTrump #IowaCaucus __HTTP__ _E_\nThe Trump Hotel Collection is currently nominated for Conde Nast Traveler Readers Choice Awards Travel & Leisure and World Travel Awards. _E_\nGetting ready to land in Hawaii. Looking so much forward to meeting with our great Military/Veterans at Pearl Harbor! _E_\nI spent Friday campaigning with John Kennedy of the Great State of Louisiana for the U.S.Senate. The election is over JOHN WON! _E_\nThe great workers who just completed the skylight at Trump International Hotel D.C. (Old Post Office) __HTTP__ _E_\nObama should work on a ceasefire in Chicago as well as Gaza. _E_\nCanadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_\nWhen I think big which is often you can be sure I'm aware of the enormous amount of little things that we will have to account for. _E_\nI'm saying that the Tea Party perhaps by another name will soon have another big moment and will be a major factor in victory! _E_\nThe fastest way we can start saving Social Security is to get Americans back to work. #TimeToGetTough (cont) __HTTP__ _E_\nTo Jamie Dimon—I love kicking lightweight @AGSchneiderman's ass. Stop settling and fight! _E_\n\"There is no worse feeling than being trapped in a job you do not enjoy. You have to love what you do.\" Think Big _E_\nMy @SquawkCNBC #TrumpTuesday interview discussing how @MittRomney can win the first debate & the last 35 days __HTTP__ _E_\nThank you @FaithandFreedom Coalition! An honor joining you today to discuss our shared values.#RTM2016 #Trump2016 __HTTP__ _E_\n\"Partner with people who share your values attitude and drive.\" – Midas Touch with @theRealKiyosaki _E_\n\"House votes on controversial FISA ACT today.\" This is the act that may have been used with the help of the discredited and phony Dossier to so badly surveil and abuse the Trump Campaign by the previous administration and others? _E_\nInterview with @LouDobbs coming up at 7pmE on @FoxBusiness. Enjoy! __HTTP__ _E_\nRT @foxandfriends: Head of the NYPD union slams Mayor de Blasio for skipping vigil for assassinated cop Miosotis Familia __HTTP__ _E_\n...So far he has been a complete failure at doing so. He should read The Art of the Deal and use his energy to focus on a new career. _E_\n.@GovernorPerry in my office last cycle playing nice and begging for my support and money. Hypocrite! __HTTP__ _E_\nIt's not enough that we do our best sometimes we have to do what's required. Winston Churchill _E_\nCrude is at $85 right now – isn't even worth half that. OPEC is ripping us off. _E_\nCrooked Hillary Clinton lied to the FBI and to the people of our country. She is sooooo guilty. But watch her time will come! _E_\nI've realized that success requires 100% effort and 100% focus. Nothing less. Get out there and go for it. _E_\nOnly very stupid people think that the United States is making good trade deals with Mexico.Mexico is killing us at the border and at trade! _E_\nCheck out my most recent interview with CNN... __HTTP__ _E_\nCoach W to his basketball players BE QUICK BUT DON'T HURRY! _E_\nThe Democrats are all talk and no action. They are doing nothing to fix DACA. Great opportunity missed. Too bad! _E_\nStop flights into the U.S. from West Africa immediately! _E_\nTom Ridge should be focused on trying to bring the party together rather than ripping it apart w/ your faulty thought process. I will win! _E_\nThe Miss Universe Pageant will be broadcast live from MOSCOW RUSSIA on November 9th. A big deal that will bring our countries together! _E_\nTwo dozen NFL players continue to kneel during the National Anthem showing total disrespect to our Flag & Country. No leadership in NFL! _E_\nGreat honor to receive today's endorsement of @RickSantorum. Really nice! #Trump2016 _E_\n.@DanaPerino wrote a wonderful book \"And the Good News is.. Dana has a fabulous perspective on life & politics—go get it! _E_\nThe upcoming All Star season of @CelebApprentice has @lisarinna returning to compete. She doesn't disappoint! _E_\nIf we could force Russia China and other competitors to use ObamaCare we would be able to instantly destroy their great economic success! _E_\nHeading to Myrtle Beach South Carolina. Really big crowd—so much to talk about! _E_\nAgreed @piersmorgan says he and @OMAROSA have a \"communication malfunction.\" #CelebApprentice _E_\nOur trade deficit is still on pace to be over $500B. This is killing our manufacturing sector and sending jobs overseas. _E_\nEbola's spread is 'unprecedented' says CDC chief __HTTP__ _E_\nCentral Park's top locale @TrumpRink is open throughout the holidays. Our Skating School is excellent & acclaimed __HTTP__ _E_\nAlmost every television network wants me badly—but I stay loyal to @NBC. _E_\nThank you @CharlesHurt for the nice words on @seanhannity. I will win and Make America Great Again! _E_\nThe real war on women over 175000 fewer held jobs in July & 94000 dropped out of labor force __HTTP__ We must do better. _E_\nThank you @JerryJrFalwell! __HTTP__ _E_\nLance Armstrong fought for 7 years & then just ran out of energy. Very sad story although they caught him red handed.He definitely cheated! _E_\nIn one of the biggest stories in a long time the FBI now says it is missing five months worth of lovers Strzok Page texts perhaps 50000 and all in prime time. Wow! _E_\nThe failing @nytimes wrote yet another hit piece on me. All are impressed with how nicely I have treated women they found nothing. A joke! _E_\nNever quit and always hit back The Art of the Comeback _E_\nRemember I said Derek don't sell your Trump World Tower apartment...its been lucky for you. The day after he sold it he broke his foot. _E_\nSigning a recent tax return isn't this ridiculous? __HTTP__ _E_\nThe electoral college is a disaster for a democracy. _E_\nI think it was terrible that Tim Cook of Apple apologized to China. What the hell is he apologizing for? Steve Jobs wouldn't. _E_\nThe 48000 sq. ft. Spa @TrumpDoral boasts 33 treatment rooms and over 100 signature spa services and treatments __HTTP__ _E_\nDuring the campaign I promised to MAKE AMERICA GREAT AGAIN by bringing businesses and jobs back to our country. I am very proud to see companies like Chrysler moving operations from Mexico to Michigan where there are so many great American workers! __HTTP__ _E_\nThank you Farmington New Hampshire! #FITN #Trump2016 __HTTP__ _E_\nWhere are the 50000 important text messages between FBI lovers Lisa Page and Peter Strzok? Blaming Samsung! _E_\nFAKE NEWS media knowingly doesn't tell the truth. A great danger to our country. The failing @nytimes has become a joke. Likewise @CNN. Sad! _E_\nTo be a visionary you have to chase impossibilities. Few ever get rich easily. Think Like a Billionaire _E_\nI will be doing @foxandfriends this morning at 8 (not 7). _E_\nI want to see people make lots of $$ and live better lives. I really think they can do that through TheTrumpNetwork __HTTP__ _E_\nToday we remember the crew of the Space Shuttle Challenger 31 years later. #NeverForget __HTTP__ _E_\nThe cast for next season looks really good! _E_\nAll recent Presidents have released their transcripts. What is @BarackObama hiding? _E_\nCongratulations to @SenScottBrown on running an aggressive & fair campaign. Vote for Scott today New Hampshire! _E_\nRemember new environment friendly lightbulbs can cause cancer. Be careful the idiots who came up with this stuff don't care. _E_\nTrump Tuesday: I'll be on @SquawkCNBC tomorrow morning at 7:30 AM. Be sure to tune in. _E_\n....getting great border security and healthcare. #VoteRalphNorman tomorrow! _E_\nThe new line of Trump ties shirts and cufflinks are out at Macy's and are really beautiful at a really reasonable.price. Go check them out! _E_\nI will be going to Aberdeen Scotland today to help my team celebrate the great success of Trump International Golf Links press conference. _E_\n#TrumpVlog Obama stop chewing gum! __HTTP__ _E_\nIran looks like it is toying with John Kerry on nuclear talks he is begging for a deal to save face. Negotiation is just not his thing! _E_\nAttending Chief Ryan Owens' Dignified Transfer yesterday with my daughter Ivanka was my great honor. To a great and brave man thank you! _E_\nTraitor Snowden has requested asylum in Russia. Why would Russia grant it? Snowden already gave them all the intel he stole! _E_\nWhile I hear the Koch brothers are in big financial trouble (oil) word is they have chosen little Marco Rubio the lightweight from Florida _E_\nObama was beaten but not knocked out. He lives to fight another day. But in the real world presidents are not given a second chance... _E_\n.@oreillyfactor called me a master marketeer last night I am not. I am a great builder I build great things & people come. _E_\nRT @FLOTUS: I had a wonderful time with the students at the American International School #Riyadh today. #SaudiaArabia __HTTP__ _E_\n\"Exclusive: Donald Trump wants to build a luxury hotel in Dubai\" __HTTP__ via @itp_ab by @ctrenwith _E_\nPolitician @SenatorCardin didn't like that I said Baltimore needs jobs & spirit. It's politicians like Cardin that have destroyed Baltimore. _E_\nWorkers of firm involved with the discredited and Fake Dossier take the 5th. Who paid for it Russia the FBI or the Dems (or all)? _E_\nDiscovery breeds discovery as in success breeds success. Questions are thoughts with a quest. Think Like a Champion _E_\n#TrumpVlog Hagel quits __HTTP__ _E_\nWhile I was in Moscow I see that President Obsma apologized for his lie I mean statement on ObamaCare! How nice of him to be so forthright _E_\nSo @BarackObama is celebrating his 'birthday' with a fundraiser in his home he bought with the help of Rezko __HTTP__ _E_\nGreat to see @SarahPalinUSA back on @FoxNews. She's a wonderful woman and commentator. _E_\nHow much money are the lawyers for the Central Park Five getting out of the 40 million dollars or are they paid by the City (or both)? _E_\nRT @DanScavino: .@NikkiHaley in 2012 w/ Romney on tax returns🤔(political ploy.) Fast forward..2016 w/ Robot Rubio🤖#FAIL👎#Politician __HTTP__ _E_\nRT @JacobAWohl: @realDonaldTrump The #MAGA great again movement is WINNING and the left wing media can't stand it! _E_\nWe need economic growth and jobs not blue ribbon panels to study the problem. _E_\nLooks like @OMAROSA is up to the challenge. #CelebApprentice _E_\nFlorida Power & Light did a fantastic job of providing service & energy during the big storm in Palm Beach. @insideFPL _E_\nEverybody is asking why the Justice Department (and FBI) isn't looking into all of the dishonesty going on with Crooked Hillary & the Dems.. _E_\nThank you @SenatorDole very kind! __HTTP__ _E_\nJust in—all efforts to stop sexual abuse in the military have totally failed—in fact the stoppers have become the abusers. _E_\nDopey Sugar @Lord_Sugar I'm worth more than $8 billion acknowledged almost no debt ... _E_\nRT @CLewandowski_: The Scrum: Video Emerges to Suggest WaPo Reporter Ben Terris Misidentifies Lewandowski in Fields Incident Breitbart __HTTP__ _E_\nI can't believe @VanityFair would renew Graydon Carter's contract...... _E_\n\"Trump: 'Very much inclined' to enter GOP White House race\" __HTTP__ via @McClatchyDC by @LightmanDavid _E_\nA great day in New Hampshire and Maine. Fantastic crowds and energy! #MAGA _E_\nFBI Deputy Director Andrew McCabe is racing the clock to retire with full benefits. 90 days to go?!!! _E_\nGreat news @BarbaraJWalters has fully recovered and will be back on @theviewtv this coming Monday. Barbara is wonderful! _E_\nWill be having many meetings this weekend at The Southern White House. Big 5:00 P.M. speech in Melbourne Florida. A lot to talk about! _E_\n.@_Just_Mads_ #asktrump __HTTP__ _E_\nTerrible attacks in NY NJ and MN this weekend. Thinking of victims their families and all Americans! We need to be strong! _E_\n#trumpvlog @BarackObama's dismal record in today's video blog.... __HTTP__ _E_\nGov Mike Pence has just stated that Donald Trump has taken a strong stance on Hoosier jobs and he thanks me! I will bring back jobs to USA. _E_\nHad a great time with @MittRomney last night. He is focused and ready for the battle ahead. Lots of money was raised. _E_\nLightweight @AGSchneiderman is fighting with @NYGovCuomo –Cuomo wins that one easily. Schneiderman is a total loser. _E_\nI was referring to the fact that Jeb Bush wants to keep common core. _E_\nThe countdown is on. The 13th season of All Star @ApprenticeNBC premieres this Sunday March 3rd at 9PM EST on @nbc. Big! _E_\nI'll be speaking at the first ever National Achievers Congress at the San Jose Convention Center (San Jose CA) (cont) __HTTP__ _E_\nVery little reporting about the GREAT GDP numbers announced yesterday (3.0 despite the big hurricane hits). Best consecutive Q's in years! _E_\nEven Usain Bolt from Jamaica one of the greatest runners and athletes of all time showed RESPECT for our National Anthem! 🇲 __HTTP__ _E_\nBillions of dollars in investments & thousands of new jobs in America! An initiative via Corning Merck & Pfizer: __HTTP__ __HTTP__ _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nI will bring our jobs back to the U.S. and keep our companies from leaving. Nobody else can do it. Our economy will sing again. _E_\nHappy belated birthday wishes to @BarbaraJWalters. Barbara is terrific! _E_\nGov. John Kasich has really failed on the campaign trail. I thought he would have been far more talented. He is just wasting time & money! _E_\n\"If you want to succeed you should strike out on new paths rather than travel worn paths of accepted success.\" John D. Rockefeller _E_\nVery interesting election currently taking place in France. _E_\nA great honor to host PM Paolo Gentiloni of Italy at the White House this afternoon! #ICYMI Joint Press Conference... __HTTP__ _E_\nAnnounced w/ @pgaofamerica that we will bring @seniorpgachamp to @TrumpGolfDC & @pgachampionship to Trump Bedminster _E_\nAttention to detail is critical choose scents that exude sophistication & confidence. Find out more 4/18 5:30 pm @Macys Herald Square. _E_\nGreat event last night @trumpwinery with @GovernorVA to support @TheVFoundation @UVA @VCU __HTTP__ _E_\nA beautiful article by @IvankaTrump on my newly opened golf course in NYC Trump Links Ferry Point __HTTP__ _E_\nI will be having lunch at the White House today with Republican Senators concerning healthcare. They MUST keep their promise to America! _E_\nAdversity is a fact of life. Be bigger than the problems be ready to fight for your rights & all will be well – Trump Never Give Up _E_\nIf you have any doubt that @BarackObama must be defeated see @DineshDSouza's '2016: Obama's America.' Amazing film! _E_\n.@MittRomney if Obama gets wise tonight just ask for his college records & transcripts he will quiet down quickly. _E_\nWhile I have never met @nytdavidbrooks of the NY Times I consider him one of the dumbest of all pundits he has no sense of the real world! _E_\nOur debt finances China's military. It's time to get tough – we hold all the cards. Let's Make America Great Again! __HTTP__ _E_\nThere's only only one person who has defunded Medicare. His name is @BarackObama. _E_\n....People are angry. At some point the Justice Department and the FBI must do what is right and proper. The American public deserves it! _E_\nVia @DailyCaller by @AlexPappas: \"Donald Trump To Blast Obama Trade Pact In Radio Ads: 'A Bad Bad Deal'\" __HTTP__ _E_\nWithout momentum there's a lack of energy that can lead the best of ideas to nowhere. Get your momentum going and keep it going. _E_\nComing up soon: The two hour premiere of The Apprentice. Next Thursday September 16th at 9 pm on NBC. __HTTP__ _E_\nWho handed Iraq over to Iran yesterday? @BarackObama. We have gotten nothing from the Iraqis we should have them pay us back with oil. _E_\nA @senatormcdaniel win is a victory for our country. Chris is a Constitutional Conservative who'll make a difference in Washington. _E_\nPresident Obama you have a big job to do. Go to Baltimore and bring both sides together. With proper leadership it can be done! Do it. _E_\nToday I announced an Air Traffic Control Initiative to take American air travel into the future finally!... __HTTP__ _E_\nBe sure to read my column in @cnni \"Europe is terrific place for investment\" __HTTP__ _E_\n...whether there are tapes or recordings of my conversations with James Comey but I did not make and do not have any such recordings. _E_\nEven @PiersMorgan is impressed by @THEGaryBusey. #CelebApprentice _E_\nJust received from @PeteRose_14. Thank you Pete! #VoteTrump on Tuesday Ohio! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\nA sad day for America with Snowden being granted asylum in Russia. Putin is laughing at Obama. _E_\nRT @Jim_Jordan: President Trump did the right thing by withdrawing us from Paris treaty it would hurt American companies and American wor... _E_\nAll successful people are high energy people who are passionate about what they do. Find a passion that energizes you. Think Big _E_\nTed Cruz said on @oreillyfactor that illegals sent out of country by my administration would come right back as citizens. Another lie crazy! _E_\nThe onus of the Chicago teachers' strike falls squarely on the teachers & their union. Inexcusable to leave children without school. _E_\nGet the big picture but be prepared for the picture to change. Be persistent and alert every single day. _E_\nGreat honor to be endorsed by popular & successful @gov_gilmore of VA. A state that I very much want to win THX Jim! __HTTP__ _E_\nLeaving Superior Wisconsin now. Thank you! #Trump2016 #WIPrimary __HTTP__ __HTTP__ _E_\nKasich has helped decimate the coal and steel industries in Ohio. I will bring them back! #MakeAmericaGreatAgain _E_\nRussia just said the unverified report paid for by political opponents is A COMPLETE AND TOTAL FABRICATION UTTER NONSENSE. Very unfair! _E_\nNewsmax is a great news org and and its pres debate in IA on 12 27 will be fair balanced and informative. @ralphreed _E_\nI'll be on @foxandfriends Monday at 7:30 AM. Tune in! _E_\n\"@Letterman to Donald Trump: 'Fire @geraldorivera'\" __HTTP__ via @Mediaite by @TheMattWilstein _E_\nA very good NBC/Wall Street Journal Poll was just released wherein I went up from last month and am in the lead. Nice! _E_\n11AM #MakeAmericaGreatAgain __HTTP__ _E_\nA true piece about the standing ovations I got yesterday __HTTP__ _E_\nAfter 13 seasons @ApprenticeNBC easily beat Shark Tank in ratings last year better demos as well. _E_\nMust watch for all Georgians @Perduesenate's new ad \"Secure Our Border __HTTP__ Michelle Nunn supports amnesty & ObamaCare _E_\nToday I spoke @LibertyU Convocation a great crowd... __HTTP__ _E_\nI attended @Aerosmith concert last night in Newark NJ. Doesn't get any better than that. @IamStevenT was fantastic great energy! _E_\nDope Frank Bruni said I called many people including Karl Rove losers true! I never called my friend @HowardStern a loser he's a winner! _E_\nIt's Wednesday. How many times will A Rod sue the @Yankees today? A Rod has no one to blame but himself for his predicament. _E_\nWatch @MissUSA Olivia Culpo crowned as @MissUniverse 2012 in the Trump #MissUniverse Pageant __HTTP__ _E_\nI will be interviewed on @foxandfriends at 6:00 A.M. Enjoy! _E_\nYou have to set higher and higher goals. You have to want more or you will start slipping backwards fast. Think Big _E_\nCongratulations to @FLGovScott on winning access to federal database __HTTP__ He is making FL a safe & legal election for 2012 _E_\n.@lancearmstrong revise your decision to quit go back and fight. _E_\nEntrepreneurs: Be ready for problems you'll have them every day. Keep open to new ideas that's where innovation begins. _E_\n#DemDebate was really boring but had a lot of fun live tweeting and picked up by far the most followers. _E_\nSecretary Kerry cannot get other nations to join us in fighting ISIS. They are afraid and he is a poor salesman who reps a pathetic leader! _E_\nAfter Solyndra @BarackObama is stil intent on wasting our tax dollars on unproven technologies and risky companies. He must be accountable. _E_\nPresidential Proclamation Honoring the Victims of the Tragedy in Parkland Florida: __HTTP__ __HTTP__ _E_\nFrankly for a writer I don't think @DannyZuker's stuff is good. In fact it's terrible. _E_\nWhether we like it or not oil is the axis on which the world's economies spin. It just is. When the price o... (cont) __HTTP__ _E_\nThe only problem I have with Mitch McConnell is that after hearing Repeal & Replace for 7 years he failed!That should NEVER have happened! _E_\nThe new reality China and Japan are warning us not to default __HTTP__ Reckless government spending has made us weak. _E_\nStrong leader: @IsraeliPM Netanyahu explained at AIPAC the threat Israel faces from Iran's nuclear drive. He is (cont) __HTTP__ _E_\nVia @Law360: \"Trump's $200M Old Post Office Project Gets Early Approval\" __HTTP__ _E_\n.@alexsalmond @pressjournal RT @GailLorene Ask our Canadian neighbors who abhor the windfarms. And poor Scotland _E_\nI'll be on The Late Show with David Letterman tonight be sure to tune in for a great show. 11:30 pm on CBS. _E_\nIsn't it interesting that immediately after September 11th everybody was asking for and indeed demanding torture of any kind. No reports! _E_\nRT @foxandfriends: .@jasoninthehouse: Comey went silent when I asked him about his memos which raised a lot of eyebrows. __HTTP__ _E_\nGreat night in Iowa special people. Thank you! _E_\nMy heart goes out to the people of Boston on this terrible day! _E_\nDonald Trump Has Given Millions To Pro Romney SuperPACs and His Whole Family Is Cutting Checks to Mitt's Campaign __HTTP__ _E_\nI will be interviewed on @seanhannity tonight at 10:00. Many things mostly bad to talk about! _E_\nOhio Gov.Kasich voted for NAFTA from which Ohio has never recovered. Now he wants TPP which will be even worse. Ohio steel and coal dying! _E_\n.@VattenfallGroup wants out of their Aberdeen windfarm fiasco so badly but @AlexSalmond won't let them—he's (cont) __HTTP__ _E_\nWith all of its phony unnamed sources & highly slanted & even fraudulent reporting #Fake News is DISTORTING DEMOCRACY in our country! _E_\nCongratulations to @PGA_JohnDaly on his big win yesterday. John is a great guy who never gave up and now a winner again! _E_\nThe era of division is coming to an end. We will create a new future of #AmericanUnity. First we need to... __HTTP__ _E_\nInteresting that @Macys criticized me but just paid $650000 in fines for racial profiling. Are they racists? _E_\nNew York Times Apologizes to Donald TrumpA recent story in the New York Times incorrectly stated that Donald (cont) __HTTP__ _E_\nIn that @TimeWarner has @HBO with really dumb racist Bryant Gumbel(and I mean dumb) and no CBS (which fired Bryant) I am switching bldgs. _E_\nHappy 102nd birthday to President Ronald Reagan. Every day that passes Reagan's presidency looks better and better. _E_\nWhy did @AGSchneiderman have to fill out 3 successive ballots on Election Day? And this is our A.G. _E_\n#ThankYouTour2016 Tue: West Allis WI. Thur: Hershey PA. Fri: Orlando FL. Sat: Mobile AL. Tickets:... __HTTP__ _E_\nSteps away from Waikiki's famous beaches @TrumpWaikiki is Hawaii's top destination w/our signature amenities __HTTP__ _E_\nThe Republican platform is most pro Israel of all time! _E_\nMy @FoxNews interview last night on @hannityshow discussing OWS and @BarackObama's incompetent leadership. __HTTP__ _E_\nArriving @TrumpScotland with @DonaldJTrumpJr & @EricTrump. Back to New York tonight. Video: __HTTP__ _E_\n...massive regulation cuts 36 new legislative bills signed great new S.C.Justice and Infrastructure Healthcare and Tax Cuts in works! _E_\nHave some fun with this __HTTP__ _E_\nCheck out my speech from last Friday __HTTP__ as well as my appearance this morning on @foxandfriends __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_\nThe success of Shark Tank over the years is a total joke compared to the success of The Apprentice one of the biggest hits in T.V. history. _E_\nTo the geniuses at 'Americans United for Change': the more you tax me the less people I employ. Get it? _E_\nMy friend @AriEmanuel of @IMG bought the Miss Universe pageants from me and they are on tonight on #Fox! Tune in! _E_\nThe past 4 years have seen the weakest multiyear recovery since WWII __HTTP__ Need to loosen regulations and lower taxes. _E_\nThank you Pittsburgh Pennsylvania! Will be back soon! #AmericaFirst __HTTP__ _E_\nIraq should be paying us while we fight ISIS. Give the money to the families of our brave soldiers. _E_\nEveryone makes mistakes but it's what you do with them and what you learn from them that matters. Midas Touch _E_\nThank you South Dakota! #Trump2016 __HTTP__ __HTTP__ _E_\nEntrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_\nWoody Johnson's comments that he would rather have @MittRomney win the election than his @nyjets win games shows real patriotism. _E_\nAfghani soldiers those on our side killed 7 Marines last month. __HTTP__ They don't want us what (cont) __HTTP__ _E_\nMy thoughts and prayers are with all of the victims involved in this mornings train collision in South Carolina. Thank you to our incredible First Responders for the work they've done! _E_\nToo bad @morningmika did not allow her interview with @SpitzerForNYC to go on another few minutes...would have been interesting... _E_\nPlans to build wind farm near Trump Turnberry in Scotland have been dropped. GREAT! @GolfDigest @GolfweekMag @GolfChannel @ESPNGolf _E_\nVia @BreitbartNews: TRUMP TO REPUBLICANS: 'PLAY THE DEBT CEILING CARD' __HTTP__ by @joelpollak _E_\nNot only does the media give a platform to hate groups but the media turns a blind eye to the gang violence on our streets! __HTTP__ _E_\nAutism rates through the roof why doesn't the Obama administration do something about doctor inflicted autism. We lose nothing to try. _E_\n.@AlexSalmond sought my support after he released terrorist Al Megrahi who blew up Pan Am #103 killing all aboard. I said \"no way!\" _E_\nThe Iranians are having 'difficulties' with their nuclear program __HTTP__ But no thanks to us! _E_\nI am proud to announce our newest project Trump Tower Mumbai. Together with the Lodha Group it will be incredible! __HTTP__ _E_\nWill be interviewed on Media Buzz with Howie Kurtz on Fox Sunday at 11:00 A.M. _E_\nWay to go @serenawilliams you are a true champion. Proud of you! _E_\n#MakeAmericaGreatAgain From my speech in South Carolina yesterday __HTTP__ _E_\nGreat boardroom! What did you think? #CelebApprentice _E_\nBig win by @Yankees last night to take control of AL East. Jeter & company now control their destiny. _E_\nThe Republican House Freedom Caucus was able to snatch defeat from the jaws of victory. After so many bad years they were ready for a win! _E_\nKaren Handel's opponent in #GA06 can't even vote in the district he wants to represent.... _E_\nI'll bet Lance Armstrong wishes he didn't do the interview with Oprah he's saying to himself what was I thinking? _E_\n.@NBC really happy with how well the #MissUniverse pageant went. _E_\nThe Fake Media is working overtime today! _E_\nNot under my watch __HTTP__ _E_\nThere are great campaigns on @fundanything __HTTP__ Be sure to take a look and donate to one today. _E_\nHeading to rally with Bobby now! See you soon! __HTTP__ _E_\nRT @IvankaTrump: Thank you New Hampshire! __HTTP__ _E_\nRT @ChuckGrassley: Jerusalem Embassy Act of '95 (Senate vote 93 5 & I voted for it) states embassy should be in Jerusalem by 5/31/99. For 1... _E_\nToday is the day that ObamaCare website was supposed to be up and working. WRONG website is closed down a total disaster! 90 million doomed _E_\nThat's right we need a TRAVEL BAN for certain DANGEROUS countries not some politically correct term that won't help us protect our people! _E_\nI absolutely support Kate's Law—in honor of the beautiful Kate Steinle who was gunned down in SF by an illegal immigrant. _E_\nCrooked Hillary Clinton deleted 33000 e mails AFTER they were subpoenaed by the United States Congress. Guilty cannot run. Rigged system! _E_\nAt your request I will be doing live tweeting during tonight's @ApprenticeNBC. #CelebApprentice _E_\nI just beat a lawyer from Yale and a lawyer from Harvard who teamed up against me in a major case worth millions ($). They were so dumb! _E_\n\" Pennies don't fall from heaven they have to be earned here on earth. – PM Margaret Thatcher (October 13 1925 – April 8 2013) _E_\nLuther Strange has been shooting up in the Alabama polls since my endorsement. Finish the job vote today for Big Luther. _E_\nBorder Patrol Officer killed at Southern Border another badly hurt. We will seek out and bring to justice those responsible. We will and must build the Wall! _E_\n\"In every battle there comes a time when both sides consider themselves beaten... _E_\nThe rules DID CHANGE in Colorado shortly after I entered the race in June because the pols and their bosses knew I would win with the voters _E_\nSorry won't be doing Fox & Friends this morning will be in India on a couple of major business deals! _E_\nTrump volunteers were out early today to offload cases of food and supplies for hard hit Rockaways residents #Sandy _E_\n#FlashbackFriday At Military Academy second from left. __HTTP__ _E_\nIt's Tuesday. How many jobs has ObamaCare cost the economy today? _E_\nJust saw the phony ad by Cruz totally false more dirty tricks. He got caught in so many lies is this man crazy? _E_\nChina's domestic economic and political problems prove how pathetic our leadership is in allowing China to rip us off __HTTP__ _E_\nAs we come together to celebrate the extraordinary contributions of African Americans to our nation our thoughts turn to the heroes of the civil rights movement whose courage and sacrifice have inspired us all. Proclamation: __HTTP__ __HTTP__ _E_\n#JFKFiles __HTTP__ _E_\nThank you for your support! __HTTP__ _E_\nCertain people are ruining their reputations tonight really sad! #Oscars _E_\nGood news @AFPhq is going to fight back against Rove's attack on the Tea Party __HTTP__ Go get em! @marklevinshow _E_\nCongratulations to my friend @seanhannity on @hannityshow 1000th show consecutively #1 in his time slot! Great going! _E_\nJust spoke to President XI JINPING of China concerning the provocative actions of North Korea. Additional major sanctions will be imposed on North Korea today. This situation will be handled! _E_\nThe @EricTrumpFDN is doing amazing work helping the children... _E_\nespecially how to get people even with an unlimited budget out to vote in the vital swing states ( and more). They focused on wrong states _E_\nFrom 1954 to 1960 there were 10 major hurricanes that hit the East Coast. _E_\nThis morning I will be going to the Commissioning Ceremony for the largest aircraft carrier in the world The Gerald R. Ford. Norfolk Va. _E_\nWe spend billions of dollars helping nations all over the World but with hurricane Sandy and Oklahoma tornado not one nation helped us! _E_\nThank you Washington! Together WE will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_\n.@Cher attacked @MittRomney. She is an average talent who is out of touch with reality. Like @Rosie O'Donnell a total loser! _E_\nWhat could be better than dinner with @MittRomney and me? __HTTP__ _E_\n#IceBucketChallenge For those of you who wanted a picture here it is __HTTP__ _E_\n.@VattenfallGroup has topped Carbon Data's rankings of the most carbon intensive companies in the EU's emissions trading scheme. _E_\nVia @worldnetdaily by @MichaelCarl7: \"Trump: Obama blew chance to free U.S. pastor\" __HTTP__ _E_\nCrooked Hillary Clinton wants to flood our country with Syrian immigrants that we know little or nothing about. The danger is massive. NO! _E_\nI was saddened to see how bad the ratings were on the Emmys last night the worst ever. Smartest people of them all are the DEPLORABLES. _E_\n... and pay per view records with \"Battle of the Billionaires\" in Detroit. It was a wild day! _E_\nBeing the best requires full time attention and application.\" – Midas Touch _E_\n.@AJDelgado13 Thank you so much for the nice words and support really enjoy listening to your ideas and thoughts. _E_\n\"Deals are my art form. I like making deals preferably big deals.\" – The Art of The Deal _E_\nThe FAKE NEWS media (failing @nytimes @CNN @NBCNews and many more) is not my enemy it is the enemy of the American people. SICK! _E_\nMariano Rivera Yankee pitcher is the greatest ever. Get well fast. _E_\nWith magnificent views @TrumpChicago is the perfect venue to host impact events & business meetings __HTTP__ _E_\nThe Obama Administration has a very important duty to provide a budget and then negotiate! OUR COUNTRY is a laughingstock! _E_\nAre you ready for the All Star @CelebApprentice? @TraceAdkins is back in the upcoming season...which is the best yet! _E_\nThank you! #Trump2016 #WIPrimary __HTTP__ _E_\nI was speaking with Don Imus this morning.... __HTTP__ _E_\nThe Democrats have said some of the worst things about James Comey including the fact that he should be fired but now they play so sad! _E_\nThey call it climate change now because the words global warming didn't work anymore. Same people fighting hard to keep it all going! _E_\nAmazingly with all of the money I have raised for the vets I have got nothing but bad publicity from the dishonest and disgusting media. _E_\nScary. Over 8332000 Americans left the work force during Obama's first term __HTTP__ How did Romney lose that election? _E_\n.@ARealSuperMan #asktrump __HTTP__ _E_\nWe should stop talking stay out of Syria and other countries that hate us rebuild our own country and make it strong and great again USA! _E_\nThe Cruz campaign issued a dishonest and deceptive get out the vote ad calling voters in violation. They are now under investigation. Bad! _E_\n\"Trump to build second Scottish course\" __HTTP__ via @UPI _E_\nStaff Sgt. Salvatore A. Giunta received the Medal of Honor from Pres. Obama this month. It was a great honor to have him visit me today. _E_\nAmazing story in @BreitbartNews about the sleazebag blogger Coppins who fabricated nonsense about me for irrelevant @BuzzFeed. CONGRATS! _E_\nMerry Christmas to all. Have a great day and have a really amazing year. Together we will MAKE AMERICA GREAT AGAIN! It will be done! _E_\nSorry to hear of yesterday's passing of General Norman Schwarzkopf. He was a terrific general and leader we could use more like him. _E_\nReally great numbers on jobs & the economy! Things are starting to kick in now and we have just begun! Don't like steel & aluminum dumping! _E_\nThank you New Jersey! #Trump2016 __HTTP__ __HTTP__ _E_\nI am being proven right about massive vaccinations—the doctors lied. Save our children & their future. _E_\nMy shirts ties & cufflinks @Macys have never been better or more beautiful. Great holiday gifts great price. _E_\nColorado was amazing yesterday! So much support. Our tax trade and energy reforms will bring great jobs to Colorado and the whole country. _E_\nThe Fake Media (not Real Media) has gotten even worse since the election. Every story is badly slanted. We have to hold them to the truth! _E_\nThe deficits under @BarackObama are the highest in America's history. Why is he bankrupting our country? _E_\nRumor has it Pataki Kasich & Senator Lindsey Graham are dropping out of the race very soon. Hope it's not true they're so easy to beat! _E_\n#Trump2016 #MakeAmericaGreatAgain #ECONOMY VIDEO: __HTTP__ __HTTP__ _E_\nLincoln never sounded like that! _E_\nThe Countryside Party just formed in Scotland to fight ugly wind turbines & @AlexSalmond. Congrats to Jim Crawford & Countryside Party. _E_\nTrump National Hudson Valley's 7693 yd par 72 course features one of the country's great golf courses. __HTTP__ _E_\n#MakeAmericaGreatAgain #NYPrimary __HTTP__ _E_\nVia @MailOnline: \"But did his hair survive? @MissUniverse & @MissUSA dump water over Donald Trump\" __HTTP__ _E_\nTo all young (and old) entrepreneurs: Believe in yourself talk yourself up! Energize yourself and you'll energize others. _E_\nWow a really nice lead in New Hampshire an increase since my last poll! __HTTP__ _E_\nDo not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_\nMike Bloomberg is doing a great job as Mayor of New York City. Ray Kelly is a great Police Commissioner. @MikeBloomberg _E_\nReally bad shooting in Orlando. Police investigating possible terrorism. Many people dead and wounded. _E_\nRe Negotiation: Patience is an enormous virtue & needs to be cultivated for successful negotiations on any level. _E_\n...have it. Fake News said 17 intel agencies when actually 4 (had to apologize). Why did Obama do NOTHING when he had info before election? _E_\nWhile @BarackObama watches China is trying to have the yuan overtake our dollar as the international (cont) __HTTP__ _E_\nHave a great and peaceful Memorial Day but remember there are people out there who don't want us to have peace. WE MUST BE STRONG!!!! _E_\nThe press is so totally biased that we have no choice but to take our tough but fair and smart message directly to the people! _E_\nThe only reason Obama gave a speech last night was because it was on the schedule Putin is laughing and the reviews have been really bad! _E_\nBrian Thanks dummy I picked up 70000 twitter followers yesterday alone. Cable News just passed you in the ratings. @NBCNightlyNews _E_\nPathetic excuse by London Mayor Sadiq Khan who had to think fast on his no reason to be alarmed statement. MSM is working hard to sell it! _E_\nAttended last night's @Yankees game Derek Jeter is both a great player and a great guy. _E_\nRoger Goodell of NFL just put out a statement trying to justify the total disrespect certain players show to our country.Tell them to stand! _E_\nEntrepreneurs: Learn to trust yourself. Being an entrepreneur is not a group effort. _E_\nTHANK YOU Daytona Beach Florida!#MakeAmericaGreatAgain __HTTP__ _E_\nThank you Florida Ohio and Pennsylvania! #CrookedHillary is not qualified. #ImWithYou __HTTP__ _E_\n\"@WestJournalism Exclusive – We Asked Donald Trump What Jobs He Would Offer ISIS\" __HTTP__ _E_\nThe brand new season of @CelebApprentice starts filming in less than 5 weeks. The 'All Star' cast will be announced very soon. _E_\nWill be working all weekend in choosing the great men and women who will be helping to MAKE AMERICA GREAT AGAIN! _E_\nIf Democrats do not start opposing ObamaCare and fast Republicans will have a massive victory in 2014 far greater than any predictions! _E_\nCRIPPLED AMERICA is the perfect gift for friends & family. Order signed copy & join me at 7:30pm live streaming! __HTTP__ _E_\nCrazy Maureen Dowd the wacky columnist for the failing @nytimes pretends she knows me well wrong! _E_\nI only wish my wonderful father Fred gave me $200 million to start my business like lightweight Rubio says. He didn't total fabrication! _E_\nBush is pretending that the Trump surge is great for him and the @nytimesworld is reporting Bush delight con job a Bush nightmare! _E_\nThe @BarackObama campaign keeps highlighting a web video of John McCain being nice & respectful. I'll bet John (cont) __HTTP__ _E_\nI look forward to Tuesday night's presidential debate I wonder if Obama will use my name again. _E_\nBobby Jindal did not make the debate stage and therefore I have never met him.... _E_\nJUST IN: A jury awarded a complete and total victory in buyer's remorse lawsuit against me in Ft. Lauderdale. _E_\nJay Carney won't answer reporters questions of Why Obama won't release his college transcripts Come on Jay! _E_\nWe should stay the hell out of Syria the rebels are just as bad as the current regime. WHAT WILL WE GET FOR OUR LIVES AND $ BILLIONS?ZERO _E_\n.@lancearmstrong should immediately reconsider or his legacy is ruined. _E_\n\"I believe anybody who is not afraid to fail is a winner.\" @JoeTorre _E_\n'Better Be Careful':Donald Trump Warns GOP On Immigration Creating '12 Million' New Dem Voters __HTTP__ via @Mediaite _E_\nThank you for the support South Carolina! #USSYorktown #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nSpain's government is closing down wind turbines the maintenance is higher than the income. _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nCheck out OAN and compare to what you are watching now! _E_\nHave a GREAT weekend everybody enjoy yourself but always keep your goals and aspirations in mind. Never lose sight of the victory ahead! _E_\nThe Democrats have zero intention of coming to any deal on the fiscal cliff. They will raise taxes and blame it on the Republicans. _E_\nTHANK YOU WISCONSIN! #VoteTrump next Tuesday April 5th! #WIPrimary __HTTP__ __HTTP__ _E_\nWow thank you Pensacola FL. See you Friday at 7pm join me! __HTTP__ __HTTP__ _E_\nCelebrated doctor @BillCassidy will be a tremendous Senator. Louisiana – send a Conservative to the Senate vote for Bill this November! _E_\nCorrupt @BarackObama's largest bundlers are fundraisers linked to the Obama Solyndra boondoggle __HTTP__ Chicago cronyism _E_\nThe Fake News Media has never been so wrong or so dirty. Purposely incorrect stories and phony sources to meet their agenda of hate. Sad! _E_\n\"Advertising is totally unnecessary. Unless you hope to make money.\" Jef I. Richards _E_\n.@TrumpCollection continues to deliver the goods __HTTP__ _E_\nRT @FLOTUS: The decorations are up! @WhiteHouse is ready to celebrate! Wishing you a Merry Christmas & joyous holiday season! __HTTP__ _E_\n.@MikeAndMike I will be on the Mike & Mike Show at 7.05 a.m. (ESPN) 10 minutes. Will be fun great guys! Radio and T.V. _E_\nWhat is wrong with the @GOP? Now they want to give all authority on the sequester cuts to Obama __HTTP__ Pathetic. _E_\nMy interview with Michael Patrick Shiels on WJIM in Lansing on behalf of @MittRomney __HTTP__ _E_\nPlease read __HTTP__ and watch a recent trip made to Trump Vineyard Estates by @EricTrump __HTTP__ _E_\nyears as a pol in Connecticut Blumenthal would talk of his great bravery and conquests in Vietnam except he was never there. When.... _E_\nThe reason I originally endorsed Luther Strange (and his numbers went up mightily) is that I said Roy Moore will not be able to win the General Election. I was right! Roy worked hard but the deck was stacked against him! _E_\nSo generous and pious! After spending millions of our tax dollars on his campaign through travel @BarackObama donated to himself. _E_\nRemarks by President Trump on the Policy of the U.S.A. Towards Cuba Video: __HTTP__ __HTTP__ _E_\n.@foxandfriends in 15 minutes! _E_\nVia @Inc by @steelwire: \"Donald Trump – To Micromanage or Not To Micromanage?\" __HTTP__ _E_\nThis is a buyers' market. Buy now. You will thank me in 3 years. _E_\nCongratulations to @foxandfriends on its unbelievable ratings hike. _E_\nLightweight A.G. Eric Schneiderman sued school with a 98% approval rating while billions in corruption goes unpunished. A total crook? _E_\nBest book ever on dealmaking (or so they say) TRUMP: THE ART OF THE DEAL. Go get it and others Washington you really can do better! _E_\nVia @scotsmandotcom: Trump joins with Chandler in bid to attract events __HTTP__ _E_\nThank you Georgia! See you soon!#Trump2016 __HTTP__ _E_\nSad.@BarackObama has already exempted major oil importers on Iranian sanctions and is negotiating a waiver with China. __HTTP__ _E_\nA really bad night for President Obama. Now the Republicans have to get together and get the job done! _E_\nStocks rose yesterday during the first day of government shutdown. Markets like being left alone for a day. _E_\nWhich is worse and which is more dishonest the #Oscars or the Emmys? _E_\nToday's @WSJ Editorial is WRONG again. I know that China is not in the new T.P.P. trade deal but would come in latter through a back door. _E_\nJohn Kasich fell right into President Obama's trap on ObamaCare and the people of Ohio are suffering for it. Shame! _E_\n#TrumpAdvice __HTTP__ _E_\nBenghazi was a massive cover up. _E_\n\"Lifestyle unveils Trump Home brand in GCC\" __HTTP__ via @TradeArabia _E_\nIn order to stop the Ebola outbreak in Africa perhaps the President should put all Africans on ObamaCare rather than sending the troops! _E_\nRon Estes is running TODAY for Congress in the Great State of Kansas. A wonderful guy I need his help on Healthcare & Tax Cuts (Reform). _E_\nTHANK YOU PORTLAND Maine!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThe United States troops which were sent to West Africa have only gotten 4 hours of Ebola training very unfair to them and their families! _E_\nWow—Golf Magazine just named Trump Scotland \"best new course.\" __HTTP__ _E_\nCongratulations to the @NYRangers on bringing the series home last night. _E_\nKarl Rove's ads are the worst in political history! _E_\nI am calling on Congress to TERMINATE the diversity visa lottery program that presents significant vulnerabilities to our national security. __HTTP__ _E_\nWill be interviewed on @SquawkCNBC by @JoeSquawk coming up at 6:00amE from Davos Switzerland. Enjoy! #WEF18 __HTTP__ _E_\nThank you Georgia! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nTurnberry in Scotland is a far superior golf course to Pinehurst and it isn't even close! Likewise the Blue Monster at Doral. _E_\nTrump SoHo opens this Friday and it is fantastic! Check out the Trump Hotel Collection... __HTTP__ _E_\n.@WestwoodLee Great going this weekend. You are a true champion! _E_\nTrump Int'l Hotel & Tower Chicago has won many awards & accolades as has Sixteen its signature restaurant. __HTTP__ _E_\nIs Anthony Weiner a jerk or what! _E_\nIn the general course of human nature a power over a man's subsistence amounts to a power over his will. Alexander Hamilton _E_\nDuring my trip to Saudi Arabia I spoke to the leaders of more than 50 Arab & Muslim nations about the need to confront our shared enemies.. __HTTP__ _E_\nAn attack on our Embassy is an attack on our soil. We have been attacked by Libya. Go into Libya & take the (cont) __HTTP__ _E_\n.@HillaryClinton loves to lie. America has had enough of the CLINTON'S! It is time to #DrainTheSwamp! Debates __HTTP__ _E_\n\"Learn to think continentally.\" Alexander Hamilton _E_\nWhile in Charlotte this weekend will visit my Trump National Golf Club on Lake Norman—a magnificent place & doing really well! _E_\nJust received a copy of @SarahPalinUSA new book a great read! Sarah is a terrific person. _E_\nI spoke with other candidates to a Jewish group many friends in D.C. I said I'm a negotiator like you Got standing O rated best of day! _E_\nThe price of greatness is responsibility. Winston Churchill _E_\nBritish Prime Minister May was very angry that the info the U.K. gave to U.S. about Manchester was leaked. Gave me full details! _E_\nGreat meeting with automobile industry leaders at the @WhiteHouse this morning. Together we will #MAGA! __HTTP__ _E_\nI wonder what the late great Vince Lombardi would say about the Rutgers football player who says he is being bullied because coach yelled? _E_\nA hurricane will be coming to Tampa. My @RNC convention surprise hits Monday night! _E_\nEstablishment flunky @KarlRove is going crazy with the just released CBS poll that has me way ahead. New Fox poll has me beating Hillary. _E_\nThere is nothing I would be happier to do than to donate the $5M to a charity of Obama's choice once he releases all of his records. _E_\nSo wonderful to be in Las Vegas yesterday and meet with people from police to doctors to the victims themselves who I will never forget! _E_\nCurrent @NYMag really sad not only boring but highly inaccurate. Use better paper product looks like a death march (which it is!). _E_\nHouse Republicans should be doing everything possible to defund ObamaCare. Instead Leadership is funding it __HTTP__ _E_\nGreatly dishonest of @TedCruz to file a financial disclosure form & not list his lending banks then pretend he is going to clean up Wall St _E_\nThere's no bigger name in America than Donald Trump political or nonpolitical. Sarasota GOP Chair Joe Gruters _E_\nNo surprise welfare spending is up over 30% under Obama. __HTTP__ He is the food stamp & welfare king _E_\nObama now wants to deny due process to the police. He'll give all constitutional rights to the terrorists but not our cops. _E_\nLast week was a first in #CelebApprentice when I fired 2 celebrities at once. Wish I could FIRE @RickSantorum! (cont) __HTTP__ _E_\nWho else could take 16 vacations play over 100 rounds of golf and hold over 300 fundraisers while serving as (cont) __HTTP__ _E_\nIf Obama is concerned about the border he should stop vacationing. Gov't will save millions which it can use to stop illegal migration. _E_\nGreat now Supreme Court Justices are talking about a constitutional right to a cell phone __HTTP__ Obama just stop already. _E_\nTake responsibility for yourself it's a very empowering attitude. _E_\nMitt Romney did great in the debate last night. _E_\nTucson killer Loughner should be given the death penalty not his plea bargained life in prison which will cost (cont) __HTTP__ _E_\nBillions of dollars spent on Baltimore and it's still a total mess. Leadership is needed not dollars. Our whole country is going to hell! _E_\nThis is the simple fact about @HillaryClinton: she is a typical politician all talk no action. #Debates2016 _E_\nThank you! #MakeAmericaGreatAgain __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nAgain more dead people voted in the last election than enrolled in ObamaCare. Congratulations America! _E_\nWe're going to use American steel we're going to use American labor we are going to come first in all deals. ... __HTTP__ _E_\nObama's second term is going to very tough for the Republicans. The Republicans must pick their battles wisely and play smart. _E_\nLetter to @Univision Re: @TrumpDoral __HTTP__ _E_\nWill be on @CNBC at @7:22. Enjoy! _E_\nWill be another Sean success! __HTTP__ _E_\nWow...NYT reports @celebrityapprentice was the number 1 show in branding on television for all of 2012. _E_\nThoughts and prayers are with everyone in West Virginia dealing with the devastating floods. #ImWithYou _E_\nThe $25 Billion settlement with the banks on mortgages will slow the housing market down even more and create higher user fees. Stupid! _E_\nJob tip: If you were the employer what kind of person would you most desire as an employee? Be that person. _E_\nI will be on @letterman tonight. Be sure to watch! Always a great time. @LateShow _E_\nWe are now at a time perhaps more than ever before when the World needs GREAT leadership! _E_\nEntrepreneurs: See yourself as victorious and the best way to be victorious is to be passionate. Find something you love doing! _E_\nThe radar defense shipping and civil aviation problems will stop the ugly windfarm. #EOWDC _E_\nThe media coverage this morning of the very average Clinton speech and Convention is a joke. @CNN and the little watched @Morning_Joe = SAD! _E_\nIt is a shame that the biased media is able to so incorrectly define a word for the public when they know that the definition is wrong. Sad! _E_\nA clip from @KatieShow where I take @katiecouric's audience on the Katie Coach __HTTP__ _E_\nWhen the achiever achieves it's not a plateau it's a beginning. Donald J. Trump __HTTP__ _E_\nEvery business has surprises hidden dangers beneath the surface and little known opportunities that can lead to huge success. _E_\nFans like winners. They come to watch stars great exciting players who do great exciting things. #TheArtofTheDeal _E_\nImagine how much money the average American would save if we busted the OPEC cartel. (cont) __HTTP__ _E_\nHere we go! I stated long ago that we should cancel all flights from West Africa. Now we have Ebola in U.S. AND IT WILL ONLY GET WORSE! _E_\nWill be cutting ribbon at 10 A.M. with Mayor Bloomberg and Jack Nicklaus for the opening of TRUMP LINKS at FERRY POINT. _E_\n\"Spend your time enjoying your big dreams.\" Think Big _E_\nDo you ever notice that lightweight @megynkelly constantly goes after me but when I hit back it is totally sexist. She is highly overrated! _E_\nWow new polls just came out from @CNN Great numbers especially after total media hit job. Leading Ohio 48 44. _E_\nMitt Romney matches sitting President in fundraising for April not an easy thing to do. Bad news for (cont) __HTTP__ _E_\nDesigned by @IvankaTrump @TrumpDoral's New Villa Deluxe Guestrooms include vintage artwork of golf legends __HTTP__ _E_\nTomorrow at 11AM #MakeAmericaGreatAgain __HTTP__ _E_\nWe need the Wall for the safety and security of our country. We need the Wall to help stop the massive inflow of drugs from Mexico now rated the number one most dangerous country in the world. If there is no Wall there is no Deal! _E_\nYesterday on the same day I had meetings with Russian Foreign Minister Sergei Lavrov and the FM of Ukraine Pavlo... __HTTP__ _E_\nTurn to QVC now to watch Melania really good stuff! _E_\nWith eleven Republican candidates running in Georgia (on Tuesday) for Congress a runoff will be a win. Vote R for lower taxes & safety! _E_\nThe Failing New York Times foiled U.S. attempt to kill the single most wanted terroristAl Baghdadi.Their sick agenda over National Security _E_\nWill be doing Fox and Friends at 7 A.M. (in 20 minutes). _E_\nWhat a shame that @msnbc's ratings have sunk even lower in 2013. Prime time down 50%. @TheRevAl's are (cont) __HTTP__ _E_\n\"Surround yourself with people who are smarter than you.\" @UncleRUSH _E_\nThe Democrats are most angry that so many Obama Democrats voted for me. With all of the jobs I am bringing back to our Nation that number.. _E_\nMust read editorial today about lightweight New York State Attorney General Eric Schneiderman. Is he a crook? __HTTP__ _E_\nThank you West Virginia!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe polling numbers show a close race. @MittRomney needs all of our support. _E_\n.@BarbaraJWalters @theviewtv will apologize to me just like she did when I was right about @Rosie. Besides I get great ratings on The View. _E_\nGreat evening with President @EmmanuelMacron & Mrs. Macron. Went to Eiffel Tower for dinner. Relationship with France stronger than ever. __HTTP__ _E_\nThe Obama administration gives better medical care to Al Qaeda at Gitmo than to our vets. _E_\nThe great @MarianoRivera in my office with my son @EricTrump __HTTP__ _E_\nResponse to @LindseyGrahamSC: __HTTP__ _E_\nDo you notice the silence lately on wind turbine monstrosities? The people of Scotland & many other countries are fighting back. _E_\n\"You owe it to yourself and to your community to make your property the best it can be.\" – Think Like a Billionaire _E_\nThe MOVEMENT in Lakeland Florida. Voter registration extended to 10/18. REGISTER ASAP @ __HTTP__ &... __HTTP__ _E_\nThis is not a media event or about Donald J. Trump this is about the United States of America. I will be (cont) __HTTP__ _E_\nIf @BarackObama wanted the Super Committee to succeed he would have lead. Instead he has been campaigning. Where is the leadership? _E_\nIt is important to think positively. Negative thinking will kill your focus and destroy any chance you have of being successful. _E_\nLittle @MacMiller you illegally used my name for your song \"Donald Trump\" which now has over 75 million hits. _E_\nNow that Mitt is gone all we have to do is get Bush to drop out and Trump to run—and we will win! _E_\nI will be in Washington D.C. on Wednesday1 P.M.in front of the Capitol to protest the horrible and incompetent deal being made with Iran. _E_\nDACA has been made increasingly difficult by the fact that Cryin' Chuck Schumer took such a beating over the shutdown that he is unable to act on immigration! _E_\nI am happy to hear how badly the @nytimes is doing. It is a seriously failing paper with readership which is way down. Becoming irrelevant! _E_\nRapidly failing @VanityFair magazine hits me for my strong stance against Obama's brilliant 5 killers for 1 deserter trade. Amazing! _E_\nThe Iran deal is terrible. Why didn't we get the uranium stockpile it was sent to Russia. #SOTU _E_\nLooking forward to the Florida rally tomorrow. Big crowd expected! _E_\nRatings starved @CNN and @CNNPolitics does not cover me accurately. Why can't they get it right it's really not that hard! _E_\nW/ the ransom Obama paid for deserter Bergdhal getting Mexico to release USMC Sgt Andrew Tahmooressi is much harder. #BringBackOurMarine _E_\n.@katyperry must have been drunk when she married Russell Brand @rustyrockets – but he did send me a really nice letter of apology! _E_\nJusr arrived at the studio the place is going wild! LIVE AT 8 P.M. #CELEBRITYAPPRENTICE _E_\nNow China is threatening our allies who share defense pacts with us the latest is the Philippines __HTTP__ Very aggressive _E_\n\"If it doesn't sell it isn't creative.\" David Ogilvy _E_\nMy @FoxNews interview from yesterday with @TeamCavuto discussing the economy my trip to Australia @MittRomne... (cont) __HTTP__ _E_\nThose five hotels includeTrump International Hotel & Tower New York Trump Soho New York Trump International Hotel & Tower Chicago... _E_\nMore reports of voting machines switching Romney votes to Obama. Pay close attention to the machines don't let your vote be stolen _E_\nMoving forward f/tonight's competitive primaries it is crucial that the Tea Party & @GOP remain united towards November. Take the Senate! _E_\n#trumpvlog Why I cancelled the great debate ..... __HTTP__ _E_\n\"The problem is that no government can create real jobs. Only entrepreneurs can do that.\" – Midas Touch _E_\nSpeech on Veterans' Reform: __HTTP__ _E_\nI am tired of @BarackObama talking about @MittRomney's father. Why don't we discuss Barack Obama Sr.! _E_\nThank you to @GolfweekMag for naming Trump International Golf Links Scotland #1 GB&I Best Modern Course A great honor! _E_\nPete Rose should now be allowed in The Baseball Hall of Fame. The all time hits leader has paid the price already! _E_\nLooking forward to keynoting the @NCGOP #NCGOPcon dinner tomorrow night! @NCGOP is a top state party! _E_\nThe woman who is the Secret Service Director looks like she is way over her head.Why can't the president appoint the best and the brightest? _E_\nThe only way for Medicare and Social Security to remain solvent is if our economy is healthy. @BarackObama doesn't get it. _E_\nWow great Ohio poll. Shows me leading by 5 points beating K! _E_\nI find hope in the darkest of days and focus in the brightest. Dalai Lama _E_\nSooner or later those who win are those who think they can. Paul Tournier _E_\nMy interview yesterday on the S&P downgrade with Wolf Blitzer on CNN __HTTP__ _E_\n\"Courageous people do not fear forgiving for the sake of peace.\" – Nelson Mandela _E_\nOn behalf of @FLOTUS Melania & myself THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief @RedCross & @SalvationArmyUS __HTTP__ _E_\nThe Honolulu accommodations of @TrumpWaikiki are the perfect merger of beauty and function __HTTP__ _E_\nJust as I predicted @Rosie would fail on The View __HTTP__ _E_\n.@BillMaher's so called show on HBO must be the cheapest special produced in the history of television it sucks! _E_\nvia __HTTP__ Donald Trump announces launch of his first Indian project in Pune __HTTP__ _E_\nIt was great being on @MikeAndMike in the Morning (ESPN)—two great guys fantastic show! _E_\nIt will be interesting to see how Jenna Talackova does as Miss Universe Canada. We all wish her luck. _E_\nMake sure to enjoy your time with your family during the holiday. It is a special time. Love and appreciate your family. _E_\nGetting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_\nNow @RonWyden is also \"concerned\" about ObamaCare along with @MaxBaucus __HTTP__ Program may fold through its own doing. _E_\nAGAIN TO OUR VERY FOOLISH LEADER DO NOT ATTACK SYRIA IF YOU DO MANY VERY BAD THINGS WILL HAPPEN & FROM THAT FIGHT THE U.S. GETS NOTHING! _E_\nJust arrived in Indianapolis Indiana to make an announcement on #TaxReform! Together we are going to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nBig WIN today for building the wall. It will secure the border & save lives. Now the full House & Senate must act! __HTTP__ __HTTP__ _E_\nOther than a small group of people who have suffered massive and embarrassing losses the party is VERY united. Great love in the arena! _E_\nAs President of the United States of America I will ALWAYS put #AmericaFirst#UNGAFull remarks: __HTTP__ __HTTP__ _E_\nI commend Roger Ailes for publicly supporting @FoxNews' employees against the Obama administration's intimidation of its reporters. _E_\nCongratulations to John Rich and to Marlee Matlin for a terrific job throughout the season. You are both great! __HTTP__ _E_\nEntrepreneurs: Cover your bases. Know everything you can about what you're doing. Then go with your gut. Your instincts r there for a reason _E_\nPrayers go out to the victims of the terrible fire in New Jersey. Stay strong and remember it will soon get better. _E_\nProblem with @GOP is not their message it's that they are incapable of controlling the message. _E_\nJimmy Fallon show will be great tonight I'm on! _E_\nFewer Americans are now insured through their employers due to higher premiums. Obamacare must be fully repealed. __HTTP__ _E_\nThe Washington Post calls out #CrookedHillary for what she REALLY is. A PATHOLOGICAL LIAR! Watch that nose grow! __HTTP__ _E_\nI promise you that I'm much smarter than Jonathan Leibowitz I mean Jon Stewart @TheDailyShow. Who by the way is totally overrated. _E_\nOnce again @Cher tweets nonsense about @MittRomney. She needs to stop tweeting & start worrying about some of her many problems. _E_\nJoint Press Conference with Prime Minister Saad Hariri of Lebanon beginning shortly. Join us live! __HTTP__ __HTTP__ _E_\nSpent the full day at meetings and a major rally yesterday in South Carolina. Great people and spirit. Today will be more of the same. _E_\nI don't know whether I will win or lose the @billmaher lawsuit but had an obligation to sue for charity. _E_\nWe grieve for the officers killed in Baton Rouge today. How many law enforcement and people have to... __HTTP__ _E_\n.@nbc did a great job last night with the @GoldenGlobes! _E_\nVia @nytpolitics by @AshleyRParker: \"Strong Showings for Donald Trump in Iowa and New Hampshire Polls\" __HTTP__ _E_\nWindmills are destroying every country they touch and the energy is unreliable and terrible. __HTTP__ _E_\nWatch my wife Melania Trump tonight on @QVC at 1 a.m. So proud of her! _E_\nVia @bpolitics by @emtitus: \"Defying Doubters Donald Trump Makes Presidential Bid Official\" __HTTP__ _E_\nVia @newhampshirecom:\"Tickets on sale for Loeb School Event featuring Donald Trump\" __HTTP__ _E_\nI have accepted @billmaher's $5 million offer paid to me for charity (made on the @jayleno show). _E_\nThat was a great football game. _E_\nAre you a Democrat running in a race you should lose? Get @KarlRove to run an ad against you and you will win. _E_\nLooking forward to speaking at @ralphreed's @FaithandFreedom Gala Dinner on Friday in D.C. His staff has been great! _E_\nMelania and I saw American Idiot on Broadway last night and it was great. An amazing theatrical experience! _E_\nCongrats to @BarbaraJWalters on winning the @MadeinNY Mayor's Award for Lifetime Achievement! I love Barbara! _E_\nHappy 5th Anniversary to @TrumpWaikiki&lt __HTTP__ ! Can't believe it's already been 5 years.. _E_\nOn this Memorial Day holiday we honor our fallen soldiers who have made the greatest sacrifice for freedom. They are our country's finest. _E_\n#trumpvlog Windfarms in today's video blog... __HTTP__ _E_\nWow some new and even greater polls thank you! _E_\nThe U.S. cannot allow EBOLA infected people back. People that go to far away places to help out are great but must suffer the consequences! _E_\nThe Council was shocked by the exuberance of the demonstration in Blackdog. @AlexSalmond @pressjournal _E_\nVia @Newsmax_Media: \"Trump: @KarlRove 'The Most Over rated Man in Politics'\" __HTTP__ _E_\n.@VattenfallGroup had no answers at demonstration last night. It's a failing company. Aberdeen windmills will destroy it. _E_\nHillary Clinton has been involved in corruption for most of her professional life! _E_\nEntrepreneurs: Get and keep your momentum going. Without momentum a lot of great ideas go nowhere. _E_\nGreat article in the @guardian Donald Trump opens £100m golf course __HTTP__ _E_\nIn one of the biggest stories in a long time the FBI says it is now missing five months worth of lovers Strzok Page texts perhaps 50000 all in prime time. Wow! _E_\nJoin me in Henderson Nevada on Wednesday at 11:30am! #MAGA Tickets: __HTTP__ _E_\nWind Farms are not only disgusting to look at but also cause tremendous damage to their local ecosystems. __HTTP__ _E_\nVia @Slate: Who won the #GOPDebate? __HTTP__ _E_\nFracking will lead to American energy independence. With price of natural gas continuing to drop we can be at a tremendous advantage. _E_\nThe Voter Violation certificate gave poor marks to the unsuspecting voter(grade of F) and told them to clear it up by voting for Cruz. Fraud _E_\nThe Saudis are taking credit for a meager 2% drop on crude __HTTP__ They always play this game (cont) __HTTP__ _E_\nI enjoyed meeting with @MattBlunt @TrumpTowerNY to discuss why our government must address currency manipulation. Many US jobs are at stake. _E_\nThe signature restaurant of @TrumpNewYork @jeangeorges is both Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_\nVia @BreitbartNews by @mboyle1: Donald Trump Slams Liberals In 'Dishonest Press': 'I'm Going To Start Naming Names' __HTTP__ _E_\nThank you for joining us at the Lincoln Memorial tonight a very special evening! Together we are going to MAKE AM... __HTTP__ _E_\nA signed copy of CRIPPLED AMERICA makes a great gift. Order & join my live streaming book signing event on 12/3 __HTTP__ _E_\nCongratulations to @BarackObama he is the first POTUS to run trillion dollar deficits in all four years of his term! _E_\nA mediocre person tells. A good person explains. A superior person demonstrates.... _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nThank you to my great supporters in Wisconsin. I heard that the crowd and enthusiasm was unreal! _E_\nDummy I'm asking a question look at the question mark at the end of the sentence! Use your head. _E_\nJohn Heilemann the lightweight reporter begging to be on@morning joe looks like a timebomb waiting to explode he's a nervous and sad mess! _E_\nGary as the Cat in the Hat? He can work it out. _E_\nHow much is New York State spending on that obnoxious T.V. commercial that is being played endlessly for a tax incentive that doesn't work? _E_\nI have directed that U.S. Cyber Command be elevated to the status of a Unified Combatant Command focused on....cont: __HTTP__ _E_\nUSSS did an excellent job stopping the maniac running to the stage. He has ties to ISIS. Should be in jail! __HTTP__ _E_\nVia The Brody File: The Lesson Evangelicals Can Learn From Donald Trump Thank you David & CBN News so nice. __HTTP__ _E_\nIf the Wall Street protesters are upset about the economy then they should really be protesting @BarackObama at the White House. _E_\nNew CNN/WMUR New Hampshire poll just released. Thank you! #FITN #Trump2016 __HTTP__ _E_\nto the U.S. but had nothing to do with TRUMP is more FAKE NEWS. Ask top CEO's of those companies for real facts. Came back because of me! _E_\nVirtually no one has spent more money in helping the American people with disabilities than me. Will discuss today at my speech in Sarasota _E_\nGreat to see that Dr. Kelli Ward is running against Flake Jeff Flake who is WEAK on borders crime and a non factor in Senate. He's toxic! _E_\nThe failing @nytimes should focus on fair and balanced reporting rather than constant hit jobs on me. Yesterday 3 boring articles today2! _E_\nThe Chinese are illegally dumping bird killing wind turbines on our shores. Only one of many grievances we should act. _E_\nHappy Thanksgiving to everyone. We will together MAKE AMERICA GREAT AGAIN! _E_\nPhoto from a recent episode of @ApprenticeNBC saying those two famous words! __HTTP__ _E_\n...Spread shots out over long period and watch positive result. _E_\nFeaturing @BLTPrime & Palm Grill @TrumpDoral offers a wide array of acclaimed top dining options __HTTP__ _E_\nThis morning @nbc @todayshow played some of the @RNC video I filmed for the Tampa Convention __HTTP__ _E_\nRaised a lot of money for the Republican Party. There will be a big gasp when the figures are announced in the morning. Lots of support! Win _E_\nSee you tonight in North Carolina. Making keynote for the Republican party will be fun. _E_\nObamaCare will increase individual market premiums by 99% for men and 62% for women __HTTP__ DEFUND!! #MakeDCListen _E_\n#AmericaFirst #ImWithYou __HTTP__ _E_\nRT @Carl_C_Icahn: 1/2 Believe Trump gave a great speech. _E_\nRussia and the world has already started to respect us again! __HTTP__ _E_\nBig win for Republicans as Democrats cave on Shutdown. Now I want a big win for everyone including Republicans Democrats and DACA but especially for our Great Military and Border Security. Should be able to get there. See you at the negotiating table! _E_\nWill be spending the day campaigning in Connecticut another state where jobs are being stolen by other countries. I will stop this fast! _E_\nRegardless of the USC's ruling ObamaCare can only be defeated politically. It must be legislatively repealed or America will go bankrupt. _E_\nVia @HotelierME: Olympic golf course designer named by Trump Damac __HTTP__ _E_\n...al Megrahi was the man who blew up Pan Am Flight 103 over Lockerbie Scotland. _E_\n.@oreillyfactor please explain to the very dumb and failing @glennbeck that I supported John McCain big league in 2008 not Obama! _E_\nAs I predicted long ago the war in Iraq was a disaster for the U.S. Heading for civil war there are bombings all over the place.Iran happy _E_\nWithout passion you don't have energy without energy you have nothing. Find work that you love and the energy will be there. _E_\nEntrepreneurs: Never give up. Be tough. Apply your skills and talent but above all be tenacious. _E_\nGreat to be back in Arizona!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nHillary Clinton wants to create the most liberal Supreme Court in history #debate #DrainTheSwamp __HTTP__ _E_\nI try to learn from the past but I plan for the future by focusing exclusively on the present. That's where the fun is! _E_\nThe event with me and @V4SA in L.A on 9/15 is turning out to be huge. Get your tickets before they're gone __HTTP__ _E_\nWhile @JoeBIden is a gaffe machine yesterday's comments that @MittRomney will put y'all back in chains was not at all proper. _E_\nThe people of Cuba have struggled too long. Will reverse Obama's Executive Orders and concessions towards Cuba until freedoms are restored. _E_\nAll these polls released by news outlets are oversampling Democrats. They want to influence public perception of the race. _E_\nGreat event in Columbus taking off for Cincinnati now. Great new Ohio poll out thank you!OHIO NBC/WSJ/MARIST POLLTrump 42% Clinton 41% _E_\nBad move @BarackObama released $147M in aid to the Palestinians __HTTP__ That money is going to Hamas. _E_\nWow with all this talk @MissUniverse is going to Russia on November 9th __HTTP__ _E_\nSoaring 92 stories @TrumpChicago boasts a @ForbesInspector 5 Star rating for both its hotel & restaurant __HTTP__ _E_\nI always said Obama is lucky for himself but unlucky for the country. The storm could be very good for him as he (cont) __HTTP__ _E_\nSomebody please inform Jay Z that because of my policies Black Unemployment has just been reported to be at the LOWEST RATE EVER RECORDED! _E_\nThe decision on Sergeant Bergdahl is a complete and total disgrace to our Country and to our Military. _E_\nPresident Obama has absolutely no control (or respect) over the African American community they have fared so poorly under his presidency. _E_\nI totally respect that Angelina Jolie has shown such great bravery in the face of danger she has really come a long and positive way! _E_\nGood news @MittRomney is now leading in North Carolina according to @ppppolls. The NC GOP is united after their (cont) __HTTP__ _E_\nThere is incredible progress on the site of Trump Tower Punta del Este Uruguay situated on the sands of Playa Brava __HTTP__ _E_\nInteresting that certain Middle Eastern countries agree with the ban. They know if certain people are allowed in it's death & destruction! _E_\n\"Keep your focus global and you may very well find yourself ahead of the game.\" – Trump Never Give Up _E_\nJoin me in Bedford New Hampshire tomorrow at 3:00pm. Can't wait to see everyone! #AmericaFirst #MAGA... __HTTP__ _E_\nAnother example of the destruction caused by wind turbines. Unnecessary waste horrible! __HTTP__ _E_\n....because he doesn't live there! He wants to raise taxes & kill healthcare. On Tuesday #VoteKarenHandel. _E_\nWill be participating in a town hall event hosted by @SeanHannity tonight at 10pmE on @FoxNews. Enjoy! __HTTP__ _E_\nInstead of driving jobs and wealth away AMERICA will become the world's great magnet for INNOVATION & JOB CREATION. __HTTP__ _E_\nI can't believe that @CNN would allow the very nice Jeffrey Lord to be savaged by a panel of seven Trump haters. 7 to 1 Don't watch CNN! _E_\nThe election is still close but trending toward @MittRomney. He leads all national polls and Obama's likeability is imploding. VOTE! _E_\nNewly released documents show Geithner to be laughing as the financial crisis loomed __HTTP__ _E_\nIf once you forfeit the confidence of your fellow citizens you can never regain their respect and their esteem. Abraham Lincoln _E_\nTo all haters and losers: I am NOT anti vaccine but I am against shooting massive doses into tiny children. Spread shots out over time. _E_\nIn Nashville Tennessee! Lets MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nHas Charles @krauthammer ever apologized for being so totally wrong on Iraq? I called it right in every way—Make America Great Again! _E_\nA great morning with everyone @LibertyU! Thank you! Off to New Hampshire now. #Trump2016 __HTTP__ __HTTP__ _E_\nWill be interviewed on @ThisWeekABC this morning. Enjoy! _E_\n'The goal is to be the winner': Donald Trump's campaign is for real. Via The Guardian __HTTP__ _E_\n.@AGSchneiderman must take a drug test immediately—make results public. NY Attorney General cannot be a cokehead. _E_\nRubio was very disloyal to Bush his mentor when he decided to run against him. Both said they love each other.They don't word is hate! _E_\nToyota Motor said will build a new plant in Baja Mexico to build Corolla cars for U.S. NO WAY! Build plant in U.S. or pay big border tax. _E_\nOnce again @RickSantorum proves he can't run a professional campaign. He is ineligible in large section of (cont) __HTTP__ _E_\nMany people are saying that the Iranians killed the scientist who helped the U.S. because of Hillary Clinton's hacked emails. _E_\nRT @Scavino45: Florida Governor Rick @FLGovScott. #HurricaineIrma __HTTP__ _E_\nObamaCare enrollment lie: Obama counts an enrollee as a web user putting a plan in \"their online shopping carts\" __HTTP__ _E_\n\"There are 2 things I've found I'm very good at: overcoming obstacles and motivating good people to do their best work.\"–The Art of The Deal _E_\nIran admits to aiding the Libyan Rebels and Ahmadinejad received a letter of thanks when will Washington learn? __HTTP__ _E_\nThe Great State of Michigan was just certified as a Trump WIN giving all of our MAKE AMERICA GREAT AGAIN supporters another victory 306! _E_\nDavid Wright of the NY Mets should have been on the 1st Team All Stars. He's having a great year. _E_\nI always enjoy speaking to young aspiring entrepreneurs. They are hungry motivated and eager to learn. Proves America can still be great. _E_\nI had a great time in Texas yesterday. A tremendous crowd of wonderful and enthusiastic people. Will be back soon! _E_\nTogether we will Make America Great Again!#AmericaFirst __HTTP__ _E_\nWeakness of attitude becomes weakness of character. Albert Einstein _E_\nAfter four years of getting the run around America needs a turnaround and the man for the job is Governor Mitt Romney. @PaulRyanVP _E_\n....This is real collusion and dishonesty. Major violation of Campaign Finance Laws and Money Laundering where is our Justice Department? _E_\nCheck out Donald Trump's new iGoogle Showcase page: __HTTP__ _E_\nHAPPY THANKSGIVING EVERYONE ENJOY YOUR DAY! _E_\nAnother health insurer is pulling back due to 'persistent financial losses on #Obamacare plans.' Only the beginning! __HTTP__ _E_\nI like thinking big. I always have. To me it's very simple: if you're going to be thinking anyway you might as (cont) __HTTP__ _E_\nGreat visit to Detroit church fantastic reception and all @CNN talks about is a small protest outside. Inside a large and wonderful crowd! _E_\n.@AlexSalmond's insane release of the terrorist—for humanitarian reasons will go down as a better decision.. _E_\nTonight despite everything put A Rod in the lineup. _E_\nEntrepreneurship is engine of American success. I bring it to crowdfunding w/ @fundanything's $1M RECORD reward __HTTP__ _E_\nJust read the nice remarks by President Jimmy Carter about me and how badly I am treated by the press (Fake News). Thank you Mr. President! _E_\nVia @THESHARKTANK1: Donald Trump's Controversial Mexican Comments Are Accurate __HTTP__ _E_\nNo person who is enthusiastic about his work has anything to fear from life. Samuel Goldwyn _E_\nWord is that despite a record amount spent on negative and phony ads I had a massive victory in Florida. Numbers out soon! _E_\nIsn't it amazing that the U.S. and NSA can listen to the highly protected phone conversations of world leaders but can't get O's records! _E_\n#MakeAmericaGreatAgain #GOPdebate __HTTP__ _E_\nSenator Bob Corker begged me to endorse him for re election in Tennessee. I said NO and he dropped out (said he could not win without... _E_\nGreat article by @AmSpec's Jeffrey Lord: \"The Reagan Revolution. And now... the @realDonaldTrump Revolution?\" __HTTP__ _E_\nNice interview in the @The Atlantic of Sarasota GOP Chair Joe Gruters on my 2012 'Statesman of the Year' award __HTTP__ _E_\nThe Arab League stated that it wants nothing to do with an attack on Syria but they want us to attack.Are our leaders insane or just stupid _E_\n\"You have to keep going and moving forward no matter what is happening around you or to you.\" – Think Like a Champion _E_\nRT @DailyCaller: Guam Governor To Trump: I've Never Felt Safer Than 'With You At The Helm' __HTTP__ __HTTP__ _E_\nAlong with two championship courses on the Potomac River @TrumpGolfDC's also offers limitless social events __HTTP__ _E_\nI will be having a general news conference on JANUARY ELEVENTH in N.Y.C. Thank you. _E_\nRT @FoxNews: .@EricTrump: My father was elected for one reason and that's because he actually believes in putting America first which is... _E_\nHonored to sign S.442 today. With this legislation we support @NASA's scientists engineers and astronauts in the... __HTTP__ _E_\nVia @gatewaypundit: \"Please Pray for Me... I Am Losing My Insurance\" __HTTP__ Just one of the millions of cases like this... _E_\nMore Bush cronyism – \"Jeb Bush and the Common Core Money Trail\" __HTTP__ It's the Bush way! _E_\nThat would mean that Eliot Spitzer has failed at everything he's done politics TV & even real (cont) __HTTP__ _E_\nGood timing: @TraceAdkins won big for American Red Cross last night on @ApprenticeNBC. Now the Red Cross is in Oklahoma doing a great job. _E_\nRemember the old saying The more you learn the more you realize you don't know it's true. Learning is a daily challenge. _E_\nMy heartfelt thoughts and prayers are with the 7 @USNavy sailors of the #USSFitzgerald and their families. ... __HTTP__ _E_\nA suicide bomber has just killed U.S. troops in Afghanistan. When will our leaders get tough and smart. We are being led to slaughter! _E_\nGreat job to Missy Franklin. She's got a smile that can take over the world. She's also a major talent. Great going Missy! _E_\nAmerica will have record growth and prosperity during his adminstration: @MittRomney's success in the private sector is a tremendous asset. _E_\nJust spoke to President Macri of Argentina about the five proud and wonderful men killed in the West Side terror attack. God be with them! _E_\nToday I announced another historic breakthrough for the VA. We are working tirelessly to keep our promises to our GREAT VETERANS! #USA __HTTP__ _E_\nRT @EricTrump: So proud to be out on the campaign trail with @realDonaldTrump thanks for an amazing night #Biloxi #Trump2016 __HTTP__ _E_\nI am watching the NFL DRAFT will be interesting! A lot of talent but only a few will become STARS. _E_\n\"What America Needs: The Case for Trump\" Great new book by the esteemed Jeffrey Lord @JeffJlpa1 Available now. __HTTP__ _E_\nEntrepreneurs: Remember to think big by expanding your horizons at the same time you're expanding your net worth. _E_\nThe ultimate Golf experience @TrumpTurnberry is a unique destination located on the beautiful Ayrshire coastline __HTTP__ _E_\n'President Elect Donald J. Trump Intends to Nominate Congressman Tom Price and Seema Verma.' __HTTP__ __HTTP__ _E_\nBook on Bin Laden is a terrible violation of code makes @BarackObama's story a big lie. _E_\nLots of comments—Do you really believe these two brothers operated alone without influence of others? _E_\nCongratulations to @Likud_Party MK @dannydanon on being offered Deputy Defense Minister of IDF by @IsraeliPM @netanyahu. _E_\nTHANK YOU AMERICA! #Trump2016#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nHeading for Ohio really big crowd of amazing people! Much to talk about! _E_\nRupert Murdoch is a great guy who likes me much better as a very successful candidate than he ever did as a very successful developer! _E_\nHeading to South Carolina now meeting with fantastic people! _E_\nObama says a WALL at our southern border won't enhance our security (wrong) and yet he now wants to build a much bigger wall (fence) at W.H. _E_\nRick Perry is a good guy who had a really tough evening. @RickPerry _E_\nBetween Iraq war monger @krauthammer dummy @KarlRove deadpan @GeorgezWill highly overrated @megynkelly among others @FoxNews not fair! _E_\nPeter Navarro: 'Trump the Bull vs. Clinton the Bear' #DrainTheSwamp __HTTP__ _E_\nThe man made climate change that our great president should be focused on is of the NUCLEAR variety brought upon us because of weakness! _E_\nI love watching dummy @mcuban promote on ok show named Shark Tank—but he is just a small part of that show. _E_\nThe Architect @KarlRove is directly responsible for losing both houses & @BarackObama becoming President. Ignore him. _E_\nRT @IvankaTrump: Thank you for the warm welcome. I'm excited to be in Hyderabad India for #GES2017. __HTTP__ _E_\nI will be on State of the Union @CNN with @jaketapper at 9am. Enjoy! _E_\nScary & Unsustainable: On Monday the US added more debt than from 1776 through Pearl Harbor __HTTP__ _E_\nTrump International Hotel & Tower Vancouver will include Vancouver's first pool bar nightclub & Trump Spa __HTTP__ _E_\nJoe Biden called America the Problem vis a vis Iran __HTTP__ He never wastes an opportunity to say something stupid.@JoeBiden _E_\nAlso great comeback by the New York Jets. That game was over until a really dumb defensive play by Tampa. Amazing. _E_\nBrainpower is the ultimate leverage. Keep your focus intact! _E_\nTo vote for me and CENTURY 21 for the best #Superbowl commercial click the following link and \"Like\" the page. __HTTP__ _E_\nOur thoughts are with the forces fighting ISIS in Iraq. We must never back down against this extreme radical Islami... __HTTP__ _E_\nTried watching low rated @Morning_Joe this morning unwatchable! @morningmika is off the wall a neurotic and not very bright mess! _E_\nScenes from last night's episode of @OCChoppers where @DonaldJTrumpJr and I visit the OCC HQ __HTTP__ _E_\nRosechem1 One of the reasons that I like you is because I feel that old American greatness in your mentality. It makes me feel hope! Thx. _E_\nImportant editorial by John Faso in @nydailynews: \"Spitzer's reckless leadership\" __HTTP__ _E_\nDopey @Lord_Sugar I'm worth $8 billion and you're worth peanuts...without my show nobody would even know who you are. _E_\n.@MissUSA Olivia Culpo has been a star a young Audrey Hepburn. _E_\nA true honor. @PressSec considers asking for @BarackObama's college transcripts a Donald Trump question. __HTTP__ Release it! _E_\nWhy do we always try to destroy our true champions and winners in this country while at the same time leaving the losers alone? STUPID! _E_\nThe damage that Democrats weak Repubicans and this disaster of a president have inflicted on America has put (cont) __HTTP__ _E_\nI will once again write a $1 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_\nThe #IranDeal is a catastrophe that must be stopped. Will lead to at least partial world destruction & make Iran a force like never before. _E_\nWe need to bring manufacturing jobs back home where they belong. #TimeToGetTough __HTTP__ _E_\nTHE ROLLOUT OF OBAMACARE IS A TOTAL DISASTER AND AN EMBARRASSMENT TO OUR COUNTRY. THE WORLD IS WATCHING AND LAUGHING.$635000000 WEBSITE! _E_\nConcentration is a fine antidote to anxiety. Jack Nicklaus _E_\nI will end common core. It's a disaster. __HTTP__ #Trump2016 __HTTP__ _E_\nThank you Pastor Robert Jeffress! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIt is amazing that after lambasting Donald Sterling on @foxandfriends some DISHONEST press only reported my GIRLFRIEND FROM HELL statement! _E_\nMy @SquawkCNBC interview discussing the close election the enthusiasm gap between Mitt & Obama & the fiscal cliff __HTTP__ _E_\nWhy won't @BarackObama repeal the Defense of Marriage Act if he supports gay marriage? __HTTP__ He is gaming the issue. _E_\nI never heard of @DannyZucker until his very dumb and endless tweets started pouring out of insecure mind but I have a great deal for him! _E_\nLikewise the primary victims of violent crimes are in the African American and Hispanic communities. These people want LAW AND ORDER now! _E_\nToday's job report is dismal. Now a record 88921000 Americans are no longer in the work force. _E_\nAgain immigration reform is fine—but don't rush to give away our country. That's what's happening! _E_\nJust finished reading a poorly written & very boring book on the General Motors Building by Vicky Ward. Waste of time! @WileyBiz _E_\nCEO's most optimistic since 2009. It will only get better as we continue to slash unnecessary regulations and when we begin our big tax cut! _E_\n.@BarackObama's dismal job record is reason alone that he must be defeated this November. _E_\nOn Anthony Wiener I TOLD YOU SO! _E_\nI am getting great credit for my press conference today. Crooked Hillary should be admonished for not having a press conference in 179 days. _E_\nI am always on the front page of the failing @nytimes but when I won the GOP nomination I'm in the back of the paper. Very dishonest! _E_\nVia @nypost's @PageSix: \"Trump researching 2016 run\" __HTTP__ _E_\nIs Chris Jackson as dumb as I hear but I still like that he follows me like a good little soldier! _E_\nHere we go! _E_\nLosers and haterseven you as low and dumb as you are can learn from watching Apprentice and checking out my tweets you can still succeed! _E_\nThe Christmas Story begins 2000 years ago with a mother a father their baby son and the most extraordinary gift of all—the gift of God's love for all of humanity.Whatever our beliefs we know that the birth of Jesus Christ and the story of his life... __HTTP__ _E_\nIf Scotland doesn't stop insane policy of obsolete bird killing wind turbines country will be destroyed. @AlexSalmond @AberdeenCC _E_\nLook what the President of NBC sent me recently about his stay in my Las Vegas hotel. Very loyal guy. __HTTP__ _E_\nI've just started blocking out some of the repetitive and boring (& dumb) haters and losers. They are a waste of time and energy! _E_\nObamaCare Horror Story: \"Navigators Tell Applicants To Lie Like Administration\" __HTTP__ @JamesOKeefeIII strikes again! _E_\n#VoteTrumpMI! #Trump2016 __HTTP__ _E_\nTom Brady has done a great job tonight amazing New England comeback. Good game not over yet! _E_\nAfter foolishly spending two trillion $'s and losing so many great young people the U.S. will be the only one who won't get the oil in Iraq _E_\nHere's a sneak peek at the @DNC convention theme: It's not our fault. Blame Bush. Oh and government built it. _E_\nAt least 24 players kneeling this weekend at NFL stadiums that are now having a very hard time filling up. The American public is fed up with the disrespect the NFL is paying to our Country our Flag and our National Anthem. Weak and out of control! _E_\nTo entrepreneurs: Watching you could be the motivation for your employees so make it an example that will best serve your success. _E_\nThe lightweight @JonHuntsman used my name in a debate for gravitas it didn't work. Sad! _E_\nEconomists on the TAX CUTS and JOBS ACT:\"The enactment of a comprehensive overhaul complete with a lower corporate tax rate will IGNITE our ECONOMY with levels of GROWTH not SEEN IN GENERATIONS...\" __HTTP__ _E_\nCongratulations to Paul Ryan Kevin McCarthy Kevin Brady Steve Scalise Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes! _E_\nThe Democrats lead by head clown Chuck Schumer know how bad ObamaCare is and what a mess they are in. Instead of working to fix it they.. _E_\nICYMI: PENCE: I RAN A STATE THAT WORKED KAINE RAN A STATE THAT FAILED. __HTTP__ _E_\nI wonder if President Obama would have attended the funeral of Justice Scalia if it were held in a Mosque? Very sad that he did not go! _E_\nCheck out a list of Donald Trump's books for summer reading at the Trump University Blog: __HTTP__ _E_\nCongratulations to @BillCassidy on a decisive win this past Saturday. Bill will be a pro growth & pro energy Senator. _E_\nMelania will be on QVC tomorrow night at 9 p.m. ET to introduce her beautiful and inspiring Melania Timepieces & Fashion Jewelry collection. _E_\nTwisted Sister frontman @deesnider shines in the record 13th season of 'All Star' @CelebApprentice. The Iron Man of Rock and Roll is great! _E_\nBackstage with @jimmyfallon before opening skit great fun! @fallontonight __HTTP__ _E_\nMy interview with WMUR's @JoshMcElveen at #NHFreedomSummit __HTTP__ _E_\nHonored to meet this years @SenateYouth delegates w/ @VP Pence in the East Room of the @WhiteHouse. Congratulations... __HTTP__ _E_\nBe tenacious. Being tenacious means you're tough and patient at once so it's a formidable combination. _E_\nTed Cruz should be disqualified from his fraudulent win in Iowa. Weak RNC and Republican leadership probably won't let this happen! Sad. _E_\nFLASHBACK October 9 2012: \"Donald Trump: Jobs Numbers Are 'A Lot Of Monkey Business'\" __HTTP__ Proven right again! _E_\nThank you to FEMA our great Military & all First Responders who are working so hardagainst terrible oddsin Puerto Rico. See you Tuesday! _E_\nJoin me LIVE with @VP @SecretaryPerry @SecretaryZinke and @EPAScottPruitt. #UnleashingAmericanEnergy __HTTP__ _E_\nThe Iran nuclear deal is a terrible one for the United States and the world. It does nothing but make Iran rich and will lead to catastrophe _E_\nBig day tomorrow in Georgia and South Carolina. ObamaCare is dead. Dems want to raise taxes big! They can only obstruct no ideas. Vote R _E_\nThe terrorists in Syria are calling themselves REBELS and getting away with it because our leaders are so completely stupid! _E_\nEvery poll done on debate last night from Drudge to Newsmax to Time Magazine had me winning in a landslide. #MakeAmericaGreatAgain! _E_\nFor all of those (DACA) that are concerned about your status during the 6 month period you have nothing to worry about No action! _E_\nI'm on @foxandfriends every Monday morning at 7:30... _E_\n.@Omarosa's new name via @DennisRodman: \"Ms. Saboteur\" sounds rather elegant. #CelebApprentice _E_\nJust arrived in Arizona! #ImWithYou __HTTP__ _E_\nThe entire village of Blackdog in Scotland protested to the Council last night about the ugly windmills. @AlexSalmond @pressjournal _E_\nMy interview in @politico with @pwgavin discussing being awarded the 2012 Statesman of the Year by Sarasota GOP __HTTP__ _E_\nThere's a reason @mcuban's partners can't stand him and on top of that the team sucks! _E_\nWhenever you see the words 'sources say' in the fake news media and they don't mention names.... _E_\n\"Donald Trump: $200 Million D.C. Hotel Will Be Among World's Best\" __HTTP__ via @WNEW _E_\nSo @BarackObama will attack @MittRomney's career at Bain Capital but won't return donations from Bain executives __HTTP__ _E_\nRT @realDonaldTrump: Consumer confidence soars to highest level since 2004 📈 __HTTP__ __HTTP__ _E_\nThank you! We are at 35% in new Reuters poll with #2 coming in at 12%. Time to #MakeAmericaGreatAgain!#Trump2016 __HTTP__ _E_\nI made a great deal of money in Atlantic City but left years ago when I saw so many political mistakes being made. I have ZERO involvement! _E_\nJust got back from Georgia. The crowds and love for U.S. was so amazing! We all had a great day together will be back soon! _E_\n\"The biggest doers often suffer the biggest setbacks in life. So if you want to aim high you have be able to handle the bumps.\"–Think Big _E_\nIf Republicans are going to pass great future legislation in the Senate they must immediately go to a 51 vote majority not senseless 60... _E_\nWhy isn't Mexico releasing our Marine. U.S. should come down really hard on them. They have ZERO respect for our so called leader _E_\nVia @Ammoland by Fredy Riehl: \"Donald Trump Talks: Gun Control Assault Weapons Gun Free Zones & Self Defense\" __HTTP__ _E_\nThank you! Vote in 2016! #MakeAmericaGreatAgain __HTTP__ _E_\nThe VA scandal shows the fatal ineptitude of big central planning government. When will we learn? _E_\n#FraudNewsCNN #FNN __HTTP__ _E_\n.@Morning_Joe: Marco only won the debate in the minds of desperate people. I won every on line poll even crazy @CNBC. Marco good looking? _E_\nHe @BarackObama wants to release 5 senior Taliban detainees back to the Taliban. __HTTP__ The Taliban out negotiates him! _E_\nThe Club For Growth said in their ad that 465 delegates (Cruz) plus 143 delegates (Kasich) is more than my 739 delegates. Try again! _E_\nOnce again we will have a government of by and for the people. Join the MOVEMENT today! __HTTP__ __HTTP__ _E_\nKeep it fast short and direct whatever it is. Donald J. Trump __HTTP__ _E_\nThe middle class has become the new poor in this country and our incompetent politicians are unable to do anything about it.They don't care! _E_\n\"Consider the fact that for every gallon of gas you put in your car you pay 45.8 cents in state local and federal taxes.\" #TimeToGetTough _E_\nThank you for such a beautiful welcome Hawaii. My great honor to visit @PacificCommand upon arrival. Heading to Pearl Harbor w/ @FLOTUS now. __HTTP__ _E_\nJust arrived at the Pensacola Bay Center. Join me LIVE on @FoxNews in 10 minutes! #MAGA __HTTP__ _E_\nHAPPY EASTER HAVE A GREAT DAY! _E_\n.@katyperry I watched Russell Brand and I think his mind is fried he looks really bad. Russell is a total joke a dummy who is lost! _E_\nWith the complete Ft. Lauderdale victory I will now sue for millions of $'s in attorney fees for which plaintiffs are liable. _E_\nRemember @dannyzuker you are not even the real boss of Modern Family no big $$$$$$'s for you! _E_\nCheck out a picture of the custom made Trump Bike that Paul Teutul Sr. presented to me today in Trump Tower __HTTP__ _E_\nSuch a wonderful statement from the great @LouDobbs. We take up what may be the most accomplished presidency in modern American history. _E_\nYoung entrepreneurs across the US are trying to make deals & build businesses daily. Stay positive think big & big things will happen _E_\nThe @nfl games are so boring now that actually I'm glad I didn't get the Bills. Boring games too many flags too soft! _E_\nPlease remember I am the ONLY candidate who is self funding his campaign. Kasich Rubio and Cruz are all bought and paid for by lobbyists! _E_\nYou are right the media is always offending Donald Trump they have no limits but they will do anything not to offend the Boston killer! _E_\nErnie Els and myself at Trump National Doral. __HTTP__ _E_\nWe must fix our education system for our kids to Make America Great Again. Wonderful day at Saint Andrew in Orlando. __HTTP__ _E_\nSpoke yesterday with the King of Saudi Arabia about peace in the Middle East. Interesting things are happening! _E_\nEntrepreneurs: Use your imagination. Use your intelligence to execute what your imagination presents to you. _E_\nWhy isn't the House Intelligence Committee looking into the Bill & Hillary deal that allowed big Uranium to go to Russia Russian speech.... _E_\n.@JRubinBlogger one of the dumber bloggers @washingtonpost only writes purposely inaccurate pieces on me. She is in love with Marco Rubio? _E_\nI am now in Texas doing a big fundraiser for the Republican Party and a @FoxNews Special on the BORDER and with victims of border crime! _E_\n#TBT With my friend @muhammadali __HTTP__ _E_\nJust won the lawsuit on leadership of Consumer Financial Protection Bureau CFPB. A big win for the Consumer! _E_\nCongratulations to @SixteenChicago @TrumpChicago for being honored with a @MichelinGuideChi two star rating again this year! _E_\nJohn Menard of Menards home improvement stores in Midwest treats employees horribly should they form a union? __HTTP__ _E_\nIt's a national embarrassment that an illegal immigrant can walk across the border and receive free health care and one of our Veterans..... _E_\n#ObamacareFail __HTTP__ _E_\nHonored to serve as Commander in Chief to the courageous men and women of our U.S. Armed Forces. A grateful nation thanks you! __HTTP__ _E_\nVideo: Trump Golf Links at Ferry Point @TrumpFerryPoint __HTTP__ _E_\nWeak JEB getting thrown out by management during speech. Do you think he will be this tough on Putin & others? __HTTP__ _E_\nRT @paulsperry_: Fusion GPS firm behind disputed Russia dossier retracts its claim of FBI mole in Trump camp __HTTP__ _E_\nJust got back from Iowa. Fantastic evening with truly fabulous people. Will be back again soon. Thanks! _E_\n.@katyperry will do much better __HTTP__ _E_\nEntrepreneurs: We win in our daily lives by being careful with every day every moment. _E_\nGo with your gut. Take chances. If you think you have the ingredients that you need take chances because your biggest successes... _E_\nVia @nypost: Trump's links getting green __HTTP__ _E_\nICYMI This week we hosted a #MadeInAmerica event right here at the @WhiteHouse! If it is MADE IN AMERICA it is the BEST! USA __HTTP__ _E_\nThe Answer to both Social Security and Medicare is a robust growing economy not cuts on the elderly. _E_\nLearning to expect problems saved me from a lot of wasted energy. Winners see problems as just another way to prove themselves. _E_\nThe premier landmark in midtown NYC Trump Tower features our signature amenities w/a magnificent waterfall __HTTP__ _E_\nWill be interviewed on @Morning_Joe at 7:20. Great crowd in Las Vegas yesterday! _E_\nNational Review @NRO may be going out of business because of the really pathetic job being done by @JonahNRO. No talent means death sad! _E_\nRT @TeamTrump: .@HillaryClinton is RAISING your taxes to a disastrous level. @realDonaldTrump is going to LOWER your taxes BIG LEAGUE! #D... _E_\n\"Obama's promises on the Iran deal are like him promising 'if you like your healthcare plan you can keep it'\" @marklevinshow _E_\nIf Mitt Romney were in the private sector & he suffered the horrendous loss of 2012 do you think he'd rehire himself for 2016?—I don't! _E_\nIf other countries benefit from our armed forces protecting them those countries should pay for the protection. #TimeToGetTough _E_\nRe CRIPPLED AMERICA I am signing books for the next two weeks. Order yours for holiday gifts! __HTTP__ _E_\n$1B down another $1B to go. ObamaCare website is 40% unfinished. This is beyond pathetic. _E_\nTrump locks down Delaware GOP delegates. #Trump2016 #MAGA __HTTP__ _E_\nMoney was never a big motivation for me except as a way to keep score. The real excitement is playing the game! _E_\nThis Sunday's @CelebApprentice will shock you! Big Development...Be sure to tune in on @NBC this Sunday at 9PM EST! _E_\nRepublicans Senators are working hard to get their failed ObamaCare replacement approved. I will be at my desk pen in hand! _E_\n#TimeToGetTough: Making America #1 Again my new book available today. The book both China and OPEC do NOT want you to read. _E_\nThink positively. There are always opportunities. Keep your focus and don't give up! _E_\nI still don't get how @KarlRove spent $400 million & lost all. _E_\nI had a fun time doing the #CallMeMaybe video featuring  the @MissUSA contestants @BravoAndy and @GiulianaRancic __HTTP__ _E_\nWhat is vital now is a swift restoration of law and order and the protection of innocent lives.#Charlottesville __HTTP__ _E_\nIn light of Newtown our country has to pull together. _E_\nHow can Crooked Hillary put her husband in charge of the economy when he was responsible for NAFTA the worst economic deal in U.S. history? _E_\n\"I can accept failure everyone fails at something. But I can't accept not trying.\" – Michael Jordan _E_\nHome of the iconic Ailsa a four time @The_Open course @TrumpTurnberry is a landmark on the Ayrshire coastline __HTTP__ _E_\n...the Uranium to Russia deal the 33000 plus deleted Emails the Comey fix and so much more. Instead they look at phony Trump/Russia.... _E_\nEurope and the U.S. must immediately stop taking in people from Syria. This will be the destruction of civilization as we know it! So sad! _E_\nWill be on @FallonTonight with @JimmyFallon on @NBC at 11:35pmE. Enjoy! #Trump2016 __HTTP__ _E_\n.@Rosie If America's Got Talent uses you the show will fail like all your others! _E_\nCongrats to Congress on their 112 'gold tier' healthcare plans __HTTP__ Why should they suffer like regular Americans? _E_\nLife is very fragile and success doesn't change that. If anything success makes it more fragile. Anything can (cont) __HTTP__ _E_\nIn 2016 the Old Post Office will be fully transformed into an iconic destination Trump Int'l Washington DC __HTTP__ _E_\nBernie Sanders supporters have every right to be apoplectic of the complete theft of the Dem primary by Crooked Hillary! _E_\nThank you Redding California!#MakeAmericaGreatAgain #CAPrimary __HTTP__ _E_\nThe last thing our country needs is another BUSH! Dumb as a rock! _E_\nLyin' Ted and Kasich are mathematically dead and totally desperate. Their donors & special interest groups are not happy with them. Sad! _E_\nCrooked Hillary Clinton who called BREXIT 100% wrong (along with Obama) is now spending Wall Street money on an ad on my correct call. _E_\nObama's wind turbines kill \"13 39 million birds and bats every year!\" __HTTP__ Save our bald eagles symbol of our nation! _E_\nEntrepreneurs: Have your own vision and stick with it. Don't be afraid to be unique. Don't tread water get out there and go for it. _E_\n\"Winning is habit. Unfortunately so is losing.\" Vince Lombardi _E_\nI can't believe @Denver_Broncos allowed final touchdown—dumbest defensive play I have ever seen in football. _E_\nThe Al Qaeda flag is now flying over Benghazi. @BarackObama spent over $3Billion of our money for this? _E_\nBeautiful rally in Albuquerque New Mexico this evening thank you. Get out & VOTE! #DrainTheSwampWatch rally:... __HTTP__ _E_\n\"You can attack defend counterattack sell or ignore. \" Roger Ailes to Pres. Reagan during prep for 2nd Mondale debate/ '84 election _E_\nThe CIA report should not be released. Puts our agents & military overseas in danger. A propaganda tool for our enemies. _E_\nEntrepreneurs: Vision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_\nI will be meeting General Kelly General Mattis and other military leaders at the White House to discuss North Korea. Thank you. _E_\nThen we attended the Scottish fashion show that benefits veterans Dressed to Kilt 2010 which I co hosted with Sir Sean and Lady Connery. _E_\nGod bless the people of Mexico City. We are with you and will be there for you. _E_\nHere we go A healthcare worker who treated Thomas Duncan the man who flew into the U.S. from West Africa infected with Ebola caught it! _E_\nLocated in beautiful Briarcliff NY @TrumpNationalNY features a 7291 yard course just 25 minutes outside NYC __HTTP__ _E_\nReality TV's #1 Bad Girl @OMAROSA is back on the upcoming 13th season of All Star @CelebApprentice. She is great as always. _E_\nThe Boston killer applying today for ObamaCare. He demands that medical bills be taken care of immediately. Does this include dental? _E_\nI am attracting the biggest crowds by far and the best poll numbers also by far. Much of the media is totally dishonest. So sad! _E_\nSnowden is a liar.and a fraud! _E_\nRT @TeamTrump: When @realDonaldTrump is POTUS families are going to be safe and secure. Law and order will be RESTORED! #MAGA #Debates #De... _E_\nThanks Geraldo you're a champion. __HTTP__ _E_\nSECURE THE BORDER! BUILD A WALL! _E_\nHope you liked it. Tune in tomorrow night at 8:00 and 9:00 for two episodes and two boardrooms! Will be a great evening of television! _E_\nThe LIVE FINALE of @ApprenticeNBC is this Sunday at 9/8C. Watch and see who will be the first ever All Star Celebrity Apprentice. _E_\nI'll be discussing a variety of topics tonight with Greta Van Susteren 10 p.m. on Fox News. It will be the first of a two part series. _E_\nIowa was amazing today. Great crowd great people. Thanks will be back soon! _E_\nThere is no comparison between @ApprenticeNBC and Shark Tank in the ratings. The Apprentice beats Shark Tank hands... __HTTP__ _E_\nNew report from DOJ & DHS shows that nearly 3 in 4 individuals convicted of terrorism related charges are foreign born. We have submitted to Congress a list of resources and reforms.... _E_\nThe top course on the west coast @TrumpGolfLA overlooks Pacific Ocean & offers a luxurious public golf experience __HTTP__ _E_\nPresident Obama do not attack Syria. There is no upside and tremendous downside. Save your powder for another (and more important) day! _E_\nWe will always ENFORCE our laws PROTECT our borders and SUPPORT our police! #LESMHarrisburg Pennsylvania #FlashbackFriday #MS13 __HTTP__ _E_\nIn presidential voting so far John Kasich is ZERO for 22. So why would he be a good candidate? Hillary would beat him I will beat Hillary! _E_\nAlways great to see the wonderful people of South Carolina. Thank you for the beautiful welcome at Greenville Spartanburg Int'l Airport! __HTTP__ _E_\nTens of millions of dollars in airstrikes had no impact because key leaders fled after hearing ON NEWS REPORTS the strikes were coming. DUMB _E_\nThis memo totally vindicates \"Trump\" in probe. But the Russian Witch Hunt goes on and on. Their was no Collusion and there was no Obstruction (the word now used because after one year of looking endlessly and finding NOTHING collusion is dead). This is an American disgrace! _E_\nI am truly honored and grateful for receiving SO much support from our American heroes... __HTTP__ __HTTP__ _E_\nSince Election Day on November 8 the Stock Market is up more than 25% unemployment is at a 17 year low & companies are coming back to U.S. _E_\nEntrepreneurs: Successful negotiation means knowing what the other side wants. You've got to know where they're coming from. Pay attention! _E_\nmuch worse just look at Syria (red line) Crimea Ukraine and the build up of Russian nukes. Not good! Was this the leaker of Fake News? _E_\nI'm doing The David Letterman Show tonight should be interesting! _E_\nWhat will happen to Omarosa tonight? One of our all time great episodes! _E_\nAnother terrorist attack in Paris. The people of France will not take much more of this. Will have a big effect on presidential election! _E_\nThe new Dark Knight Rises Trailer is great __HTTP__ The movie filmed scenes in Trump Tower last October. _E_\nHey Missouri let's defeat Crooked Hillary & @koster4missouri! Koster supports Obamacare & amnesty! Vote outsider Navy SEAL @EricGreitens! _E_\nVanity Fair Magazine which used to be one of my favorites is failing badly. Newsstand sales are plummeting (cont) __HTTP__ _E_\nMy @todayshow show interview with @IvankaTrump discussing the fierce competition in All Star @CelebApprentice __HTTP__ _E_\nUsing Alicia M in the debate as a paragon of virtue just shows that Crooked Hillary suffers from BAD JUDGEMENT! Hillary was set up by a con. _E_\nI want to express our support and extend our prayers to all those affected by the vile terror attack in Spain last month. __HTTP__ _E_\n\"When everyone works with the same energy loyalty and focus it makes for smooth sailing all around.\" – Midas Touch _E_\nFact – Amnesty lowers wages and invites more lawlessness. Obama has unilaterally cancelled any chance of immigration reform. _E_\nThe winner of Best in Show at the Westminster Kennel Club Show Miss P will be coming to my office this morning. _E_\nRemember that Marco Rubio is very weak on illegal immigration. South Carolina needs strength as illegals and Syrians pour in. Don't allow it _E_\nHave the right mindset for the job. See your work as an art form which means paying attention to every detail. _E_\nCould this be my newest apprentice? __HTTP__ ...Enter the contest .. . __HTTP__ _E_\nEnjoy the Super Bowl! _E_\nGovernment is shut down yet Obama is now harassing the privately owned @Redskins to change its name.He needs to focus on his job! _E_\nLooking forward to being in Council Bluffs Iowa later today. Despite weather rally is on will be fantastic! #MakeAmericaGreatAgain! _E_\nPeople are pouring into Washington in record numbers. Bikers for Trump are on their way. It will be a great Thursday Friday and Saturday! _E_\nOur country and our leaders are getting dumber all the time. Now they are about to release full documentation on torture. Will destroy CIA _E_\nBob Dole Warns of 'Cataclysmic' Losses With Ted Cruz and Says Donald Trump Would Do Better via New York Times: __HTTP__ _E_\nPresident Obama spoke for me and every American in his remarks in #Newtown Connecticut. _E_\nRT @AnnCoulter: Trump's speech today was Churchillian only better. You can tell by the spluttering hysteria on TV about @realDonaldTrump. _E_\nEntrepreneurs: Ask yourself is this a blip or is it a catastrophe? and your equilibrium will be kept in check if/when hard times hit. _E_\nIf you want to be successful two important considerations are passion and efficiency. Think Like a Champion _E_\nToday on Earth Day we celebrate our beautiful forests lakes and land. We stand committed to preserving the natural beauty of our nation. _E_\nHeading to Washington this morning. Much work to do. Focus on trade and military. #MAGA _E_\nI just filed a major ethics complaint against crooked New York State Attorney General Eric Schneiderman he should resign from office! _E_\nWe are winning and the press is refusing to report it. Don't let them fool you get out and vote! #DrainTheSwamp on November 8th! _E_\n.@alexsalmond @pressjournal RT @rdowns @realdonaldtrump Margaret Thatcher NEVER would have allowed those wind mill monstrosities. _E_\n.@williebosshog watched you on @foxandfriends. You were great and I appreciate the nice statements. I'm sending out for your new book now! _E_\nAnother @BarackObama investment triumph the $500Billion American funded Finnish plug in cars are all being recalled __HTTP__ _E_\nObama and all others have been so weak and so politically correct that terror groups are forming and getting stronger! Shame. _E_\n.@washingtonpost by @OConnellPostbiz:\"Donald Trump lands @chefjoseandres for Old Post Office flagship restaurant\" __HTTP__ _E_\nI am in Dubai with Damac. PLACE IS BOOMING AMAZING! Major news conference in two hours. Announcing luxury villas and major golf course. _E_\nIf you want to be successful in business you must take risks. Make sure each risk is calculated and can have a positive fallback. _E_\nCongrats @TrumpToronto for being ranked #1 on @TripAdvisor and a Travellers' Choice 2013 Winner! _E_\nBig week coming up! _E_\nThere usually is an easy solution to every problem. For instance a lot of our country's problems can be solved in next year's election. _E_\nOur greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_\n.@MattBevin: As someone well versed in job creation and the Private Sector if you lie on your resume You're Fired! _E_\nJudge Jeanine Slams GOP Establishment: __HTTP__ _E_\nMarch 5th is rapidly approaching and the Democrats are doing nothing about DACA. They Resist Blame Complain and Obstruct and do nothing. Start pushing Nancy Pelosi and the Dems to work out a DACA fix NOW! _E_\nBad break for @TigerWoods hits a great shot which hits the pin and kicks into the water gets a bogey on hole with another great shot Champ! _E_\nJust spoke with @NYGovCuomo and @NYCMayor de Blasio to let them know that the federal government... _E_\nIt begins Republican Party of Virginia controlled by the RNC is working hard to disallow independent unaffiliated and new voters. BAD! _E_\nRe Lance Armstrong—not only was it a big lie but a big lie that lasted too long! _E_\nThe primary plaintiff in the phony Trump University suit wants to abandon the case. Disgraceful! _E_\nDems failed in Kansas and are now failing in Georgia. Great job Karen Handel! It is now Hollywood vs. Georgia on June 20th. _E_\nRecord crowd and standing ovation at Simpson College in Iowa lots of fun wonderful audience! _E_\nA great night in Raleigh North Carolina! THANK YOU! #Trump2016 __HTTP__ _E_\nI'm a skeptical guy but I don't believe Petraeus used this to get out of the Benghazi hearings. _E_\nIt's Thursday. I wonder how much money @BarackObama drained from Medicare today to finance ObamaCare. _E_\n.@billmaher has not yet sent me the $5M he owes which I am giving to various charities. Come on Bill—you made a deal. _E_\nWow I'm at 2200000 followers but I'd love to get rid of the haters & losers—they're such a waste of time! _E_\nThird Gun Linked to 'Fast and Furious' Identified at Border Agent's Murder Scene. When will the White House come clean? _E_\nVanity Fair which looks like it is on its last legs is bending over backwards in apologizing for the minor hit they took at Crooked H. Anna Wintour who was all set to be Amb to Court of St James's & a big fundraiser for CH is beside herself in grief & begging for forgiveness! _E_\nNow China is publicly supporting the OWS protests __HTTP__ It's time for the protesters to go home. _E_\nEntrepreneurs: Brainpower is the ultimate leverage. Don't underestimate yourself or your possibilities. _E_\nTrump International Golf Club Turnberry Scotland has been home to four of the greatest Open Championships in history __HTTP__ _E_\nCongratulations to my Catholic friends on the selection of Pope Francis I to lead the Catholic Church. People that know him love him! _E_\nBrent Musburger did himself a great favor by saying what everyone was thinking he is much more popular now than before. _E_\nI am a handwriting analyst. Jack Lew's handwriting shows while strange that he is very secretive—not necessarily a bad thing. _E_\nVia @mrctv by Ben Graham: Border Reports Back Up Trump's 'Rapists' Claim __HTTP__ _E_\nRT @realDonaldTrump: As the phony Russian Witch Hunt continues two groups are laughing at this excuse for a lost election taking hold Dem... _E_\nA rare case where the U.S. should help __HTTP__ _E_\nI hate when the news media so afraid to offend anyone always refers to the BOSTON KILLER as the suspect . _E_\nEmpty pockets never held anyone back. Only empty heads and empty hearts can do that. Norman Vincent Peale _E_\n\"He who is not courageous enough to take risks will accomplish nothing in life.\" Muhammad Ali _E_\nMaking speech tonight in New Hampshire leaving now. Fantastic people fantastic crowd! _E_\nThe San Fran crash was totally the pilot's fault may be too late for drug testing RIDICULOUS! _E_\nThe Mayweather decision is a disgrace! _E_\nThank you Nevada! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWe need someone with experience to rebuild America. #MakeAmericaGreatAgain __HTTP__ _E_\nWhat did our very stupid & ineffective A.G. Eric Schneidean during his trips to MY office tell me about President Obama & Governor Cuomo? _E_\nThank you Wisconsin! Tuesday was a great success for #WorkforceWeek at @WCTC w/ @IvankaTrump & @GovWalker. Remarks... __HTTP__ _E_\nJosh8J4 @realDonaldTrump I have a dream that you will be president to make this country great again. #USA Thank you. _E_\nWhen will Mayor Vescio and manager Zegarelli repave Pine Road in @BriarcliffManor? It is a disgrace! _E_\nNobody knows for sure that the Republicans & Democrats will be able to reach a deal on DACA by February 8 but everyone will be trying....with a big additional focus put on Military Strength and Border Security. The Dems have just learned that a Shutdown is not the answer! _E_\nHe @MittRomney gets the China problem why don't the others? _E_\nPatrick Reed—We are proud to have you as our champion at Doral. Love the attitude & the play. See you in March at the Cadillac WGC. _E_\nI'll be at Liberty University Monday 10 AM for speech. Looking forward to meeting students all sold out! _E_\nRT @foxnation: .@SenTedCruz: I want to Get to a 'Yes' Vote: __HTTP__ _E_\n\"The best luck of all is the luck you make for yourself.\" – General Douglas MacArthur _E_\nMr. President it is time to lead on the Korean crisis. Make a statement from the Rose Garden and send a strong message to the man child! _E_\nToday is the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_\nWhy is Obama playing basketball today? That is why our country is in trouble! _E_\nConsumer confidence is at a 16 year high....and for good reason. Much more regulation busting to come. Working hard on tax cuts & reform! _E_\nThe Bernie Sanders supporters are furious with the choice of Tim Kaine who represents the opposite of what Bernie stands for. Philly fight? _E_\nInformation is being illegally given to the failing @nytimes & @washingtonpost by the intelligence community (NSA and FBI?).Just like Russia _E_\nA Rod must be dropped in the Yankees line up tonight if they want to win. He simply can't perform without drugs. _E_\nOctober has a 7% foreclosure increase last month. Is this @BarackObama's economic recovery? _E_\nWise words from my father: \"Know everything you can about what you're doing.\" Fred C. Trump _E_\nA @aahs5star Diamond & Green Star Diamond Award winner @TrumpGolfLA is the nation's top public course __HTTP__ _E_\n70 stores above Punta Pacifica's pristine peninsula @TrumpPanama offers fine dining five pools & luxury rooms __HTTP__ _E_\nI will be doing @foxandfriends at 7.00 (15 minutes). _E_\nAnother great cause Obama could send my $5M donation to is a charity for 9/11 First Responders. They are American heroes. _E_\nSuch bad reporting: A puff piece on Ben Carson in the @nytimes states that Carson is trying to solidify his lead. But I am #1 easily! Sad _E_\nOur next Vice President of the United States of America Gov. @Mike_Pence!#GOPinCLE #GOPConvention#AmericaFirst __HTTP__ _E_\nI look forward to paying my respects to our brave men and women on this Memorial Day at Arlington National Cemetery later this morning. _E_\nThey are saying that tickets to tonight's Saturday Night Live are the hardest to get in the history of this great show! Off to a good start! _E_\nI've gotten many letters from people fighting autism thanking me for stating how dangerous 38 vaccines on a (cont) __HTTP__ _E_\nWow Ted Cruz falsely suggested Marco Rubio mocked the Bible and was just forced to fire his Communications Director. More dirty tricks! _E_\nI enjoy meeting tourists in #TrumpTower. People travel from across the world to see the five level Atrium & waterfall. _E_\nI will be going to Indiana on Thursday to make a major announcement concerning Carrier A.C. staying in Indianapolis. Great deal for workers! _E_\nI have a feeling the emphasis by @johnrich and @marleematlin will be on the charities and the money raised. (cont) __HTTP__ _E_\nSnow and freezing weather all over mid section of Country. Global warming specialists better start thinking fast! _E_\nObama:\"I will destroy ISIS\" = Obama: \"If you like your healthcare plan you can keep your plan.\" _E_\nWatch me get inducted into the #WWEHOF tonight at 10PM on USA. I will be posting exclusive behind the... __HTTP__ _E_\nScotland is beautiful. I spent several years looking for the right place visiting over 200 sites and this is absolutely the right place! _E_\nGeorge Will was pushing for @JonHuntsman for the GOP nomination in December...said he was going to win. (cont) __HTTP__ _E_\nThank you. __HTTP__ _E_\nChina is closing a massive oil deal w/ Russia taking advantage of the Ukraine conflict __HTTP__ Smart unlike our leaders. _E_\nI havn't seen @tonyschwartz in many years he hardly knows me. Never liked his style. Super lib Crooked H supporter. Irrelevant dope! _E_\nIt was my great honor to welcome Prime Minister Alexis Tsipras of Greece to the WH today! __HTTP__ 📸 __HTTP__ __HTTP__ _E_\nCrooked Hillary Clinton perhaps the most dishonest person to have ever run for the presidency is also one of the all time great enablers! _E_\nJoin me LIVE in South Korea🇰 #NationalAssembly #POTUSinAsia __HTTP__ __HTTP__ _E_\nMy @ WCNC News interview w/ @DianneG touring the magnificent Trump National Charlotte course & facilities __HTTP__ _E_\nThank you Arizona. Beautiful turnout of 15000 in Phoenix tonight! Full coverage of rally via my Facebook at: __HTTP__ __HTTP__ _E_\n#TheArsenioHallShow Well it had to happen. People that are disloyal in the long run never make it. Arsenio was just cancelled! _E_\n... at St. Jude Children's Research Hospital __HTTP__ I am proud of you Eric. _E_\nHillary Clinton has bad judgment and is unfit to serve as President. __HTTP__ _E_\nThe Middle East is blowing up we didn't back Egypt and now they riot against us. Iran is using Iraqi airspace (cont) __HTTP__ _E_\nFor those asking the Republicans only have 51 votes in the Senate and they need 60. That is why we need to win more Republicans in 2018 Election! We can then be even tougher on Crime (and Border) and even better to our Military & Veterans! _E_\nJoin me live as we recognize the first responders to the June 14th shooting involving @SteveScalise. #TeamScalise __HTTP__ __HTTP__ _E_\n.@DannyZuker Don't lie @ApprenticeNBC was #1 in all major demos at 10. Do not lie! _E_\nRT @Team_Trump45: @realDonaldTrump __HTTP__ _E_\nSadly the overwhelming amount of violent crime in our major cities is committed by blacks and hispanics a tough subject must be discussed. _E_\n#DrainTheSwamp! __HTTP__ _E_\nWith Hillary and Obama the terrorist attacks will only get worse. Politically correct fools won't even call it what it is RADICAL ISLAM! _E_\nThank you so much for the wonderful article Robert Davi. __HTTP__ _E_\nSen. McCain should not be talking about the success or failure of a mission to the media. Only emboldens the enemy! He's been losing so.... _E_\nI just released my financial disclosure forms the largest numbers in the history of the F.E.C. Even the dishonest media thinks great! _E_\nSo professional of @ABC news to throw out the failing @UnionLeader newspaper from their debate. Paper won't survive highly unethical! _E_\nA great afternoon. Thank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nToday's job report is not a good sign & we could be facing another recession. No real job growth. We need over 300K new jobs a month. _E_\nWe should have gotten more of the oil in Syria and we should have gotten more of the oil in Iraq. Dumb leaders. _E_\nThank you to all for your wonderful comments on my speech. I could feel the electricity in thr air. Great reviews most votes ever recieved _E_\nWhat do you think @amandatmiller is writing? #CelebApprentice _E_\nOur country is being torn apart from the inside it's getting nasty out there. _E_\nIt was a pleasure to have President Ashraf Ghani of Afghanistan with us this morning! #USAatUNGA #UNGA __HTTP__ _E_\nRT @NYPDnews: Many supported NYers when Sandy hit. Now our NY Task Force 1 can be there to help others during Harvey & #HurricaneIrma. Here... _E_\nBased on the ovation last night from the Letterman @Late_Show audience I believe it will be hard for Obama to throw $5M down the drain.... _E_\nWhy would the USChamber be upset by the fact that I want to negotiate better and stronger trade deals or that I want penalties for cheaters? _E_\nI am on @greta now! _E_\nIn one hour I will be making a major announcement from Trump Tower. Watch it live on Periscope! __HTTP__ _E_\nDo you agree with the client's decision? #CelebApprentice _E_\nWow Bernie Sanders just admitted that the real unemployment rate is 10% (it is actually over 20%) and for African American youth 51%. _E_\nDopey Mort Zuckerman owner of the worthless @NYDailyNews has a major inferiority complex. Paper will close soon! _E_\n92 stories above North Michigan Avenue @TrumpChicago's 5 Star @Forbes rated rooms have the best views of Chicago __HTTP__ _E_\nDon't believe Chrysler (if Obama wins) see how fast @Jeep production will be moved to China and I'll be watching! _E_\nVia @wsoctv by @BlairWSOC9: EXCLUSIVE: Donald Trump talks possible presidential run __HTTP__ _E_\nI AM PLEASED TO INFORM YOU THAT CELEBRITY APPRENTICE HAS BEEN RENEWED FOR ANOTHER SEASON BY NBC. SEE YOU AT THE NBC UPFRONTS TOMORROW. _E_\nIt was Rosie O'Donnell who ate the cake in the vicious Hillary commercial about me not Crooked Hillary! @marthamaccallum _E_\nToday I will meet with Canadian PM Trudeau and a group of leading business women to discuss women in the workforce. __HTTP__ _E_\nHillary Clinton is not qualified to be president because her judgement has been proven to be so bad! Would be four more years of stupidity! _E_\nThe middle class has worked so hard are not getting the kind of jobs that they have long dreamed of and no effective raise in years. BAD _E_\nJust announced that because of Trump advertising rates for debate on @CNN are going from $5000 to $200000 a 4000% increase.PAY CHARITY? _E_\nState Senator Shirley Huntley ratted on black politicians & was believed when she ratted on @AGSchneiderman nobody listened. Racism! _E_\nResponse to Huffington Post __HTTP__ _E_\nCrowd gathers to hear Trump speech in Las Vegas __HTTP__ _E_\nThank you Des Moines Iowa! Governor @Mike_Pence and I appreciate your support! #MAGA #TrumpTrain __HTTP__ _E_\nFEMA and first responders are working hard (yet again) on Hurricane Nate. Military helping. Very much under control! _E_\nExciting news—After massive construction the Blue Monster at Trump National Doral is open for business today. __HTTP__ _E_\nIran and the United States just pushed deadline back SEVEN MONTHS on working out a nuclear deal. Iran is tapping along our bad negotiators! _E_\nThank you South Carolina! #Trump2016 __HTTP__ _E_\n.@VP Mike Pence is working hard on HealthCare and getting our wonderful Republican Senators to do what is right for the people. _E_\nEntrepreneurs: Apply your skills and talent but above all be tenacious. See yourself as victorious which means never giving up. _E_\nHagel has been endorsed by China __HTTP__ & Iran __HTTP__ for SOD. Welcome to Obama's second term! _E_\nJohnny Miller correctly very critical of greens at Pinehurst. Said they should be redone _E_\n.@Peggynoonannyc Interesting article but I will beat Hillary easily. People that have given up on the system will come out to vote for me! _E_\nThe GOP primary is getting very nasty. The candidates need to remember that @BarackObama is the main target. He must not be reelected. _E_\nVia @DMRegister by @BylineAndyDavis: \"Donald Trump speaks to veterans residents in Coralville\" __HTTP__ _E_\nLittle @MacMiller sent me an expensive plaque for making his song \"Donald Trump\" such a big hit. Mac you still... __HTTP__ _E_\nThe debates are going to have a big impact on the election. @MittRomney has proved in Florida he delivers under pressure. _E_\nEven Crazy Jim Acosta of Fake News CNN agrees: \"Trump World and WH sources dancing in end zone: Trump wins again...Schumer and Dems caved...gambled and lost.\" Thank you for your honesty Jim! _E_\nThe @erictrumpfdn Golf Invitational featuring a performance by @BretMichaels was a great event. Enjoy the video.... __HTTP__ _E_\n#TrumpVine Opinion on Egypt __HTTP__ _E_\nOnly a Reagan or a Trump like figure in the White House will achieve this goal. __HTTP__ _E_\nThis election is a choice between law order & safety or chaos crime & violence. I will make America safe again for everyone. #ImWithYou _E_\n.@Ed_Klein's book 'The Amateur' is out in paper back. Lots of insights. _E_\nIf you have passion confidence resilience & vision you could become an entrepreneur. Add focus to the list & you're off to a good start _E_\nRussia is sending a fleet of ships to the Mediterranean. Obama's war in Syria has the potential to widen into a worldwide conflict. _E_\nWe will never forget the 241 American service members killed by Hizballah in Beirut. They died in service to our nation. __HTTP__ _E_\nChina is taking the oil from Iraq after we spent 1.5 trillion dollars and thousands of lives for their freedom . Our leaders are so stupid! _E_\nI will be on Fox and Friends at.7.00 A.M. Enjoy! _E_\nLook forward to being in Tampa this afternoon. Wonderful crowds. Thank you Florida! _E_\nNYC's top cop acted wisely and legally to monitor activities of some in the Muslim community. Vigilance keeps us (cont) __HTTP__ _E_\nDo you believe what is going on in Washington with respect to Syria these people don't have a clue! _E_\nJust left Sioux Center Iowa. My speech was very well received. Truly great people! Packed house overflow! _E_\nIf the great Si Newhouse were still running @CondeNastCorp he would fire Graydon Carter immediately circulation tanking. _E_\nRT @Scavino45: Hurricane force winds hit Florida Keys. 390 shelters have been opened in Florida. Shelters near you __HTTP__ _E_\nBe sure to keep following announcements on the development of Trump International Golf Club Dubai. Will be spectacular. _E_\nIt is impossible for the FBI not to recommend criminal charges against Hillary Clinton. What she did was wrong! What Bill did was stupid! _E_\nGreat @Esquiremag piece '@DonaldJTrumpJr: What I've Learned' __HTTP__ _E_\nHighly respected Constitutional law professor Mary Brigid McManamon has just stated Ted Cruz is not eligible to be President. Big problem _E_\nWhen will @BarackObama release his college and law school transcripts? __HTTP__ _E_\nThe last thing we need in Alabama and the U.S. Senate is a Schumer/Pelosi puppet who is WEAK on Crime WEAK on the Border Bad for our Military and our great Vets Bad for our 2nd Amendment AND WANTS TO RAISES TAXES TO THE SKY. Jones would be a disaster! _E_\nCall @MELANIATRUMP today on @QVC at 5 PM EST say hello and buy buy buy! _E_\n...can't change history but you can learn from it. Robert E Lee Stonewall Jackson who's next Washington Jefferson? So foolish! Also... _E_\nAmazing comeback by The Heat your friends at your favorite golf club Trump National Doral are proud of you. NOW for game 7! _E_\nDeparting New York with General James 'Mad Dog' Mattis for tonight's rally in Fayetteville North Carolina! See you... __HTTP__ _E_\nDon't ever forget we will together MAKE AMERICA GREAT AGAIN! _E_\n....Also there is NO COLLUSION! _E_\nIf the disgusting and corrupt media covered me honestly and didn't put false meaning into the words I say I would be beating Hillary by 20% _E_\nEbola patient will be brought to the U.S. in a few days now I know for sure that our leaders are incompetent. KEEP THEM OUT OF HERE! _E_\nThe winner of Best In Show of the 139th @WKCDOGS Miss P visited @TrumpTowerNY today __HTTP__ _E_\nWatch @PaulRyanVP explain how 'It's irrefutable' that President Obama is damaging Medicare' __HTTP__ _E_\nWhat the hell is going on with GLOBAL WARMING. The planet is freezing the ice is building and the G.W. scientists are stuck a total con job _E_\nLooking forward to Friday night in the Great State of Alabama. I am supporting Big Luther Strange because he was so loyal & helpful to me! _E_\nToday I was thrilled to announce a commitment of $25 BILLION & 20K AMERICAN JOBS over the next 4 years. THANK YOU... __HTTP__ _E_\nObama wanted Putin to reset. Instead Putin laughed at him and reloaded. _E_\nMexico doesn't respect our border hourly __HTTP__ Release USMC Tahmooressi NOW! Time for a boycott? #SaveOurMarine _E_\nI don't know @SamuelLJackson to best of my knowledge haven't played golf w/him & think he does too many TV commercials—boring. Not a fan. _E_\nThank you @SeanHannity & @BoDiet! #MakeAmericaGreatAgain _E_\nJune 16th __HTTP__ _E_\nThe election is absolutely being rigged by the dishonest and distorted media pushing Crooked Hillary but also at many polling places SAD _E_\nThis Tweet from @realDonaldTrump has been withheld in response to a report from the copyright holder. _E_\nAmerica's Olympic uniforms are manufactured in China. Burn the uniforms!#U.S.OlympicCommittee _E_\nSome low life journalist claims that I made a pass at her 29 years ago. Never happened! Like the @nytimes story which has become a joke! _E_\n\"@IvankaTrump: 'Trump Estates Dubai unlike anything else in the region'\" __HTTP__ via @aawsat_eng by Musaid Al Zayani _E_\nThe wimps that run Penn State should be forced to resign (and be sued) for the pathetic settlement they made and destruction of great legacy _E_\nWorst ever issue of @VanityFair magazine—bad food Graydon Carter should be fired! _E_\n1. Each week you the audience can choose an MVP among the celebrities @CelebApprentice using Twitter...... _E_\nDonald Trump Announcement: $5 Million for Obama College Records __HTTP__ via @Newsmax_Media _E_\n.@bobbyjindal watched you on @TeamCavuto. Made some excellent points. Best Wishes. _E_\nVia @HorsetalkNZ: \"NY's Central Park Horse Show a huge success\" __HTTP__ _E_\nIt is time to #DrainTheSwamp in Washington D.C! Vote Nov. 8th to take down the #RIGGED system! __HTTP__ _E_\nI loved watching Clint Eastwood last night he was terrific! _E_\nObama's attack on the internet is another top down power grab. Net neutrality is the Fairness Doctrine. Will target conservative media. _E_\nAnother great poll result! Thank you! __HTTP__ _E_\nObama is the most profligate deficit & debt spender in our nation's history. Doubled debt (cont) __HTTP__ _E_\nNow a small country like Sudan tells Obama he can't send any more Marines __HTTP__ We are a laughing stock. _E_\n15K in OK! Had to turn away 5k but we are coming back soon to take care of them! So much love in the crowd! Thanks! __HTTP__ _E_\nIf only the illegals were Tea Party members then Obama would get them out of the country immediately. _E_\nVia @Newsmax_Media by @melaniebatley: Donald Trump: France's Strict Gun Laws Enabled Attack __HTTP__ _E_\nLook here's the deal: @BarackObama has been a total disaster. He has spent this country into the ground and (cont) __HTTP__ _E_\n.@MarissaMayer is right to expect Yahoo employees to come to the workplace vs. working at home. She is doing a great job! _E_\nLeaving for Liberty University. I'll be speaking today in front of a record crowd. #Trump2016 _E_\nBy continuing to give massive subsidies to Scotland's ugly wind turbines @David_Cameron is playing right into @AlexSalmond's hands. _E_\nChina has 5 oil projects in Iraq and we didn't get anything from the Iraqis except asked to leave. Iraq is going (cont) __HTTP__ _E_\nCrooked Hillary Clinton is unfit to serve as President of the U.S. Her temperament is weak and her opponents are strong. BAD JUDGEMENT! _E_\nThis week we came one step closer to reaching the goal of aligning the skills taught in our nation's classrooms with the jobs of the future. __HTTP__ _E_\nAmerica needs @MittRomney and @PaulRyanVP and we need them right now. @GovChristie _E_\n...these days...we could all use a little of the power of Trumpative thinking. –BarnesandNoble.com __HTTP__ _E_\nHow come nobody mentions that the Nielsen Ratings of the Apprentice after 12 seasons as shown by Howard Stern totally blow away... _E_\nThe phony Club For Growth which asked me in writing for $1000000 (I said no) is now wanting to do negative ads on me. Total hypocrites! _E_\nNew York City's iconic architectural masterpiece @TrumpTowerNY houses prime commercial residential & retail space __HTTP__ _E_\n.@IvankaTrump's @FoxNewsSunday \"Power Player of the Week\" interview with Chris Wallace __HTTP__ _E_\nSenator Mitch McConnell said I had excessive expectations but I don't think so. After 7 years of hearing Repeal & Replace why not done? _E_\nVia @BreitbartNews by @mboyle1: \"EXCLUSIVE — DONALD TRUMP TO SPEAK AT CPAC\" __HTTP__ @CPACnews _E_\nWatching other networks and local news. Really good night! Crazy @megynkelly is unwatchable. _E_\nThis is more than a campaign it is a movement. #MakeAmericaGreatAgainSIGN UP TODAY & WE WILL WIN! __HTTP__ _E_\nJoin me in Pueblo Colorado on Monday afternoon at 3pm! #TrumpRally __HTTP__ _E_\nFACT – the reason why Americans have to worry about a government shutdown is because Obama refuses to pass a budget. _E_\nBe sure to stop by Trump Tower today I'll be signing copies of my new book Time To Get Tough from 11 am to 2 pm. _E_\nMy warmest condolences to the families of the horrible Roseburg Oregon shootings. _E_\nTrump Int'l Golf Links Scotland awarded 5 star status by Scottish Tourism chiefs. Via MailOnline __HTTP__ _E_\nI know our complex tax laws better than anyone who has ever run for president and am the only one who can fix them. #failing@nytimes _E_\nThe POLICE in Paris did a fantastic job. Very brave not easy! _E_\nHave confidence work hard and keep your focus on the small things that matter while keeping the big picture in mind. _E_\nMany of the released Guantanamo detainees are now fighting for ISIS and other enemy groups.We need proper leadership before it is too late! _E_\nAccording to @pewresearch 2/3 of Mexican LEGAL immigrants do not pursue citizenship because of 'no interest' __HTTP__ _E_\nUS interest payments on the debt have already passed $375B this year __HTTP__ China is laughing at us as usual. _E_\nNow A Rod doesn't even show up to his single A rehab games. Maybe the @Yankees will get lucky and @MLB will suspend A Rod. _E_\n1/5 households is on food stamps __HTTP__ We must do better. Americans need to have a work ethic. _E_\nCrooked Hillary's brainpower is highly overrated.Probably why her decision making is so bad or as stated by Bernie S she has BAD JUDGEMENT _E_\nVia @LatinoVoices by @CaritoJuliette: \"Meet The Latina 2014 @MissUniverse Candidates\" __HTTP__ _E_\nPresident Obama wants @MittRomney to hand over even more past tax returns he should when @BarackObama reveals his college applications. _E_\nMy @CNN interview with @piersmorgan explaining why Mitt should not apologize __HTTP__ _E_\nAll the haters & losers must admit that unlike others I never attacked dopey Jon Stewart for his phony last name. Would never do that! _E_\nDisaster! The @BarackObama tax hikes set for 2013 are going to throw us back into a recession according to the CBO __HTTP__ _E_\nLooking forward to speaking at prestigious @TheEconomicClub on December 15th __HTTP__ _E_\nThank you Illinois! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIt's almost like the United States has no President we are a rudderless ship heading for a major disaster. Good luck everyone! _E_\nI can't believe Republican leadership allowed such a stupid deal to be made. They are rapidly giving up all of their cards. _E_\n.@foxandfriends will be showing much of our successful trip to Asia and the friendships & benefits that will endure for years to come! _E_\nDemocrat Jon Ossoff who wants to raise your taxes to the highest level and is weak on crime and security doesn't even live in district. _E_\nMust read editorial via @IBDeditorials: ObamaCare's Bitter Irony: It May Increase Number Of Uninsured __HTTP__ _E_\nSexual assault and rape in the Armed Forces is a Massive problem that nobody wants to talk about or do anything about the big dark secret! _E_\nHerman Cain handled the pressure of the debate really well. @THEHermancain _E_\nRosie is back on the View which tells you how desperate they must be. It is the standard short term fix and long term disaster. _E_\nDiligence is the mother of good luck. Benjamin Franklin _E_\nI delivered a speech in Charlotte North Carolina yesterday. I appreciate all of the feedback & support. Lets #MAGA... __HTTP__ _E_\nRev. @BillyGraham is a great man and so is his son Franklin Graham. _E_\n.@ByronYork Great numbers from @CBSNews Poll. Also from ABC Washington Post Poll. Thank you! @CNN _E_\nThe Failing @nytimes has totally gone against the Social Media Guidelines that they installed to preserve some credibility after many of their biased reporters went Rogue! @foxandfriends _E_\nThe Daily Snooze publishes lies about me. They should be ashamed but it will die very soon. _E_\nThink positively. Zap negativity immediately. Focus on the solution not the problem. Be persistent and alert every single day. Momentum! _E_\nDon't think my statement on @ariannahuff was harsh if you knew her and the phony Huffington Post you would understand more to follow. _E_\nIt was an honor to welcome the Prime Minister of Vietnam Nguyễn Xuân Phúc to the @WhiteHouse this afternoon. __HTTP__ _E_\nThe @MissUniverse contestants review their amazing stay at @TrumpDoral __HTTP__ _E_\n.@ErinBurnett should have stayed at CNBC—she was never smart but people liked her. @OutFrontCNN Jeff Zucker's got problems! _E_\nI have been hitting Obama and Crooked Hillary hard on not using the term Radical Islamic Terror. Hillary just broke said she would now use! _E_\nWas President Obama in charge of this years Academy Awards they remind me of the ObamaCare website! #Oscars. _E_\n.@MattGinellaGC Have you ever seen Trump National/Bedminster or Trump International Golf links in Scotland. Both far better than Pinehurst! _E_\nDemand by China continues to raise the price of oil __HTTP__ We must become energy independent through our vast resources. _E_\n.@danawhite Great job last night very exciting! You have come a long way from those difficult early days I am proud of you. _E_\nEvery economic climate whether an uptick or downturn presents new opportunities and challenges. _E_\nJoin me! 6/10: Richmond VA 8pm6/11: Tampa FL 11am6/11: Pittsburgh PA 3pm6/13: Portsmouth NH 2:30pm __HTTP__ _E_\nCongratulations to @David_Bossie & his team @Citizens_United on their important court win for the First Amendment! __HTTP__ _E_\n.@BretBaier Why do you have George Will on your show he's exhausted boring and not even a little relevant! Waste of good air time! _E_\nBernie Sanders is being treated very badly by the Dems. The system is rigged against him. He should run as an independent! Run Bernie run. _E_\n.@nypost: \"Dozens of key staffers fleeing @AGSchneiderman's office\" __HTTP__ _E_\nThere's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_\nI don't think Ted Cruz can even run for President until he can assure Republican voters that being born in Canada is not a problem. Doubt! _E_\nLooking forward to meeting with Prime Minister @Netanyahu shortly. Peace in the Middle East would be a truly great legacy for ALL people! _E_\nCongratulations to THE MOVEMENT we have just won THE GREAT STATE OF OREGON. The vote percentage is even higher than anticipated! Thank you. _E_\nIn last night's #CNNDebate @MittRomney proved once again why he is the steady conservative who can restore America's future. _E_\n\"Arrests of MS 13 Members Associates Up 83% Under Trump\" __HTTP__ _E_\nIn making big money knowledge is far more important than any other ingredient including money itself! _E_\nWonderful @pastormarkburns was attacked viciously and unfairly on @MSNBC by crazy @morningmika on low ratings @Morning_Joe. Apologize! _E_\nMayor Bill Vescio of Briarcliff Manor Westchester is doing a terrible job. Horrible roads high taxes housing down. @westchestergov _E_\nIf you can't handle the hard times that come with business then you will never be able to celebrate the successes. Focus & Stay Positive. _E_\nWacky @NYTimesDowd who hardly knows me makes up things that I never said for her boring interviews and column. A neurotic dope! _E_\nBen Carson has never created a job in his life (well maybe a nurse). I have created tens of thousands of jobs it's what I do. _E_\nChicago don't forget tix for @EricTrumpFdn Wine Tasting Fundraiser @TrumpChicago 11/22. Proceeds benefit @StJude __HTTP__ _E_\nWow television ratings just out: 31 million people watched the Inauguration 11 million more than the very good ratings from 4 years ago! _E_\nThe American people have waited long enough. There has been enough talk and no action for seven years. Now is the time for action! __HTTP__ _E_\n\"Take the time to move yourself forward. In other words think work and be lucky.\" – Think Like a Champion _E_\nVia @shinysheet: Mar a Lago to host top equestrian jumpers: Trump Invitational will benefit 90 area charities. __HTTP__ _E_\nCongrats to @AlCardenasACU and @CPACnews. I really enjoyed being there—the response was so terrific! _E_\nRT @Scavino45: #USNSComfort en route to #PuertoRico from Norfolk Virginia to support Hurricane Maria relief efforts. __HTTP__ _E_\nThank you Florida can't wait to see you Friday in Miami! Join me: __HTTP__ __HTTP__ _E_\nPresident Obama created a VERY BAD precedent by handing over five Taliban prisoners in exchange for Sgt. Bowe Bergdahl. Another U.S. loss! _E_\nEntrepreneurs: Be curious. Discovery breeds discovery just as success breeds success. Don't sell yourself short. _E_\nMy interview with @EWErickson of @RedState discussing #TimeToGetTough GOP primary and my 2012 options __HTTP__ _E_\nVia @BreitbartNews: \"DONALD TRUMP AT SUMMIT: OBAMACARE A 'FILTHY LIE' CAN BUILD 'A BEAUTY' OF A BORDER FENCE\" __HTTP__ _E_\nConstitutional law expert #Laurence Tribe of Harvard says wrong to say it (natural born citizen) is a settled matter it isn't settled). _E_\nThank you Governor @TerryBranstad! #AmericaFirst #Debates2016 __HTTP__ _E_\n\"No one remembers who came in second.\" – Walter Hagen _E_\nWith so many scandals plaguing Obama it seems that they all hit him at the right time. Could help him get away w/ all of them. _E_\nHe @BarackObama is using the IRS to sabotage the Tea Party __HTTP__ What about the Occupy Wall Street groups? _E_\nCongratulations to Roy Moore on his Republican Primary win in Alabama. Luther Strange started way back & ran a good race. Roy WIN in Nov! _E_\nPlayed the Trump International Golf Club in Palm Beach last weekend. One of the best golf courses in the country. Perfect weather. _E_\nCongratulations to the Rolling Stones on marking their 50th anniversary in London. _E_\nThere is no instance of a nation benefitting from prolonged warfare. Sun Tzu _E_\nI will also be going to a wonderful state Missouri that I won by a lot in '16. Dem C.M. is opposed to big tax cuts. Republican will win S! _E_\nSuch great support in New Hampshire. So many people are working so hard to #MakeAmericaGreatAgain! _E_\nVia @BreitbartNews by @LarryOConnor: TRUMP: NY MAG AILES STORY 'TOTAL BULLS**T' __HTTP__ It was total bullshit! _E_\nThe @USCHAMBER must fight harder for the American worker. China and many others are taking advantage of U.S. with our terrible trade pacts _E_\nJust watched Hillary deliver a prepackaged speech on terror. She's been in office fighting terror for 20 years and look where we are! _E_\nFLASHBACK – \"Donald Trump Blasts Obama for Failing to Secure Christian Pastor's Freedom in Iran __HTTP__ via @theblaze' _E_\n\"You're never a loser until you quit trying.\" Mike Ditka _E_\nDemocrats are trying to bail out insurance companies from disastrous #ObamaCare and Puerto Rico with your tax dollars. Sad! _E_\n#CrookedHillary __HTTP__ _E_\n\"The President has accomplished some absolutely historic things during this past year.\" Thank you Charlie Kirk of Turning Points USA. Sadly the Fake Mainstream Media will NEVER talk about our accomplishments in their end of year reviews. We are compiling a long & beautiful list. _E_\nSnowden is sitting in China and taunting the U.S. He is mocking us as a Country. Great time to place a tax on China trade if not turned over _E_\nWhat's more important? Rebuilding our military or bailing out insurance companies? Ask the Democrats. _E_\nThank you Geneva Ohio. If I am elected President I am going to keep RADICAL ISLAMIC TERRORISTS OUT of our countr... __HTTP__ _E_\nI will do far more for women than Hillary and I will keep our country safe something which she will not be able to do no strength/stamina! _E_\nI put @DonnyDeutsch on Apprentice at his request I did his failed cable show as a favor to him then he knocks me for my Obama announcement. _E_\nHow can George Osborne reduce UK debt while spending billions to subsidize Scotland's garbage wind turbines that are destroying the country? _E_\n'Remarks by President Trump at Signing of H.J. Resolution 41' __HTTP__ __HTTP__ _E_\nWe pause today to remember the 2403 American heroes who selflessly gave their lives at Pearl Harbor 75 years ago... __HTTP__ _E_\nOur wonderful new Healthcare Bill is now out for review and negotiation. ObamaCare is a complete and total disaster is imploding fast! _E_\nThanks to @TheRealMarilu a great woman for her wonderful defense of the Miss USA pageant. _E_\nWe need a dealmaker in the White House who knows how to think innovatively and make smart deals. #TimeToGetTough. _E_\nWow honored to just pass 2.5M followers on @twitter. Thanks to all my followers. We are going to have a great year together. _E_\nCrooked Hillary Clinton said she is used to dealing with men who get off the reservation. Actually she has done poorly with such men! _E_\nCongratulations to Michelle and Barack Obama on their 20th anniversary. _E_\nI will be doing @hannityshow tonight on Fox at 9 o'clock. Will be interesting and tough! _E_\nWe call for the full restoration of democracy and political freedoms in Venezuela and we want it to happen very very soon! __HTTP__ _E_\nLet Pete into the Hall of Fame __HTTP__ @PeteRose_14 _E_\nRon Fournier: Clinton Used Secret Server To Protect #CircleOfEnrichment\" __HTTP__ _E_\nBig day for healthcare. Working hard! _E_\nGreat job by @EricTrump on interview with @BillHemmer on @FoxNews. #ImWithYou #TrumpTrain _E_\nEverybody is talking about the protesters burning the American flags and proudly waving Mexican flags. I want America First so do voters! _E_\nTrump Nat'l Golf Club Philadelphia is a 360 acre beauty and an award winning Tom Fazio designed course fantastic! __HTTP__ _E_\nIn order to preserve my options and guarantee that @BarackObama is defeated I changed my voter registration to independent. _E_\nTonight at 8:00 is a really big one for a double episode of Celebrity Apprentice. Watch you won't believe what happens! _E_\nStatement by me last night in Florida: \"Honestly I don't think the Democrats want to make a deal. They talk about DACA but they don't want to help..We are ready willing and able to make a deal but they don't want to. They don't want security at the border they don't want..... _E_\nLooking forward to speaking at 1:30PM tomorrow in Nashua at @NHGOP @FITNsummit!. Let's Make America Great Again! #FITN _E_\nOn beautiful Lake Norman @Trump_Charlotte offers a state of the art Clubhouse to complement its championship course __HTTP__ _E_\nGreat poll numbers! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWow the ObamaCare website which President Obama said would be working TODAY is a total mess with many functions not even thought about! _E_\nOur greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time. Thomas A. Edison _E_\nBe sure to set exceptional goals for your 2015 resolutions. Push yourself you can do it. Think Big! _E_\nThe unemployment numbers are tragic. We are letting the world take our jobs. It has to stop! _E_\nHad a great time on the @HowardStern show this morning—he will and should never change! _E_\nWhy doesn't the failing @nytimes write the real story on the Clintons and women? The media is TOTALLY dishonest! _E_\nHillary Clinton's weakness while she was Secretary of State has emboldened terrorists all over the world..cont: __HTTP__ _E_\n#SweepsTweet @clayaiken might get some use out of the Chi Touch digital hairdryer. Not the same for @arsenioofficial. _E_\nWow! @FoxNews poll just came out. #1 with 26%! Almost as importantly I am the strongest on economic issues by far! #Trump2016 _E_\nHRC is using the oldest play in the Dem playbook when their policies fail they are left w/this one tired argument! __HTTP__ _E_\nNever let the fear of striking out get in your way. Babe Ruth _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n.@IamStevenT gave one of the greatest endings to a show ever @MissUniverse. Standing ovation! _E_\nFast trial and death penalty for maniac in Colorado immediately pass speed up legislation. _E_\nLast week's boardroom was truly epic and the dust hasn't settled yet. #CelebApprentice _E_\nThank you Jonathan. Greatly appreciated! __HTTP__ _E_\nMy twitter has become so powerful that I can actually make my enemies tell the truth. _E_\nMy great honor! __HTTP__ _E_\nChina's leadership is sneaky and underhanded they significantly underreport their actual defense budget and (cont) __HTTP__ _E_\nHe's back and causing more trouble than ever before! @THEGaryBusey returns in the record 13th season of 'All Star' @CelebApprentice. _E_\nOff to Nashville and the NRA. _E_\n... It is very effective and a commonly used business tool. _E_\nObama's economic policies are causing inflation on hard working families. The price of corn alone has risen over 200% since he was elected. _E_\nI cannot believe the Republicans are extending the debt ceiling—I am a Republican & I am embarrassed! _E_\nSee you soon Arizona! #Trump2016 __HTTP__ _E_\nCongratulations to @FLGovScott the state is really making progress and fast! _E_\nPart 1 of my @jimmyfallon interview discussing my $5M offer to Obama #TRUMP Tower atrium my tweets & 57th st. crane __HTTP__ _E_\nThink of it 20% of our country is essentially unemployed. _E_\nQuote of the Day: Donald Trump Decrees Boycott on Glenfiddich Scotch __HTTP__ via @Zagat _E_\nGreat news that the New York Stock Exchange won't be owned by a German company. European regulators turned the (cont) __HTTP__ _E_\nAre we talking about the same cyberattack where it was revealed that head of the DNC illegally gave Hillary the questions to the debate? _E_\nHillary's Aides Urged Her to Take Foreign Lobbyist Donation And Deal With Attacks: __HTTP__ _E_\nA storied franchise with a loyal fanbase @buffalobills should remain in Buffalo. _E_\nVideo from Michigan last night. After asking for months the media panned their cameras! __HTTP__ __HTTP__ _E_\nIt's hard to read the Failing New York Times or the Amazon Washington Post because every story/opinion even if should be positive is bad! _E_\nRT @MittRomney: For nearly 4 years Barack Obama has refused to crack down on China's cheating & American workers have paid the price. _E_\nThank you Windham New Hampshire! #TrumpPence16 #MAGA __HTTP__ _E_\nProblems are never truly hardships to winners & if you haven't got any then you must not have a business to run. _E_\nTrump: Zimmerman Trial 'Traumatic Period for Country' __HTTP__ @Newsmax_Media _E_\n.@NY_POLICE Commissioner Ray Kelly has done a top job keeping NYC safe. Stop & Frisk has been a critical tool for the NYPD. _E_\nWhy does Conde Nast allow dopey Graydon Carter to run bad food restaurants while running failing @VanityFair magazine? _E_\nA commander in chief has to possess the right instincts. That's one of the biggest problems with @BarackObama: (cont) __HTTP__ _E_\nThank you Iowa! I appreciate all of your support @IowaCentral & @ethanolbyPOET this evening! #Trump2016 #IACaucus __HTTP__ _E_\nTune in for my interview with @gretawire tonight at 10 pm @FoxNews _E_\nLooking forward to the GOP debate and the outcome of the Ames straw poll. We must get a real leader. _E_\nI am interviewed on This Week on @ABC this morning. Enjoy! _E_\nIn answer to your questions about my favorite impersonator the answer is Darrell Hammond. _E_\n...(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). To bad the Dems have no one who can change tones! _E_\nLess than one week away from implementation ObamaCare's small business exchanges are not ready! __HTTP__ A disaster! _E_\nI am so disappointed that the Yankeed haven't terminatrd A Rod's contract. There is no way they would not win in court! Hard to believe. _E_\nThank you Kenansville North Carolina! Remember on November 8th that special interest gravy train is coming to a... __HTTP__ _E_\nObamaCare strikes again. Major insurer announced that over 53000 New Yorkers will be dropped from their plans __HTTP__ _E_\nMy thoughts on the Republican Party in today's #trumpvlog... __HTTP__ _E_\nWow. Unbelievable. __HTTP__ _E_\nSome dope said I deleted a tweet about James G. There was no tweet and there was no delete a totally fabricated story (nobody saw tweet). _E_\nThe current tax code is a burden on American taxpayers & harmful to job creators. Americans need #TaxReform! More: __HTTP__ __HTTP__ _E_\nVia @Newsmax_Media: \"Trump Iowa Visit Raises 2016 Speculation\" __HTTP__ _E_\nLet's go! #CelebApprentice _E_\nVia @NYDailyNews: Joan Rivers' last work for @ApprenticeNBC will run on two shows next season says Donald Trump __HTTP__ _E_\nCongratulations to Tom Brady @Patriots he is a great quarterback and a great champion! _E_\nChina just agreed that the U.S. will be allowed to sell beef and other major products into China once again. This is REAL news! _E_\nThe lawyer I just beat in Chicago was a buffoon but was a lot smarter and sharper than @DannyZuker. Come on Danny make the bet! _E_\n.@TheRevAl came to my Trump Tower office to apologize for calling me a racist very nice apology accepted! _E_\nInteresting that Roberts said it was a tax in order to come out with his good public relations decision when (cont) __HTTP__ _E_\nPresident Obama will go down as perhaps the worst president in the history of the United States! _E_\nCongrats @GretchenCarlson's new Fox show debuts w/ very strong ratings __HTTP__ Guess who her first guest was? Donald Trump. _E_\nTune into the legendary @BarbaraJWalters at 10pmE on@ABC2020 tonight. #MeetTheTrumps for a full hour @ABC #ABC2020! __HTTP__ _E_\nGood advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_\nUnbelievable evening. Just made a speech in front 17000 amazing New Yorkers in Bethpage Long Island great to be home! _E_\nObama asked a 7 yr old for his birth certificate. He's in your face because the Republicans dropped the ball. (cont) __HTTP__ _E_\nWE ARE WITH YOU FLORIDA!Emergency Information 1 800 342 3557 __HTTP__ 1 800 FL HELP 1 __HTTP__ __HTTP__ _E_\nWhat a 'nice guy' 97% of @BarackObama's campaign ads have been negative attacks on @MittRomney __HTTP__ Give it back Mitt! _E_\n.@MissTeenUSA visited today __HTTP__ _E_\n'Good Chance' Trump Will Run for President __HTTP__ via @Newsmax_Media by @melaniebatley _E_\n.@CNN is in a total meltdown with their FAKE NEWS because their ratings are tanking since election and their credibility will soon be gone! _E_\nIt is an exciting time for our country!#WeeklyAddress #ConfirmGorsuch __HTTP__ _E_\nThe threat from radical Islamic terrorism is very real just look at what is happening in Europe and the Middle East. Courts must act fast! _E_\nYes All Star @ApprenticeNBC contestant @THEGaryBusey is a little out there. But he uses his 'uniqueness' to his advantage. _E_\nVia @STVAberdeen: Donald Trump reveals first image of his new Aberdeenshire hotel __HTTP__ _E_\nNew jobs report: 432000 left workforce manufacturing & durable goods go __HTTP__ We need leaders who understand business. _E_\nOur big and very popular Tax Cut and Reform Bill has taken on an unexpected new source of \"love\" that is big companies and corporations showering their workers with bonuses. This is a phenomenon that nobody even thought of and now it is the rage. Merry Christmas! _E_\nObama just stated It's always good to ignore Donald Trump. I state he is right especially when the truth is against him. _E_\nFirst responders have been doing heroic work. Their courage & devotion has saved countless lives – they represent the very best of America! __HTTP__ _E_\nHe says he will spend $1 B to get re elected: @BarackObama. I can match him preserving my options. _E_\nJustice Kennedy should be proud of himself for sticking to his principles in light of Justice Roberts' bullshit! _E_\nGoing to The Citadel tonight getting The Nathan Hale Patriot Award. Very nice! _E_\nNegotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. _E_\nWinner of the 5 Star Diamond Award @TrumpGolfLA brings luxury & elite amenities to LA's top public golf course __HTTP__ _E_\nJohn McCain called thousands of people crazies when they came to seek help on illegal immigration last week in Phoenix. He owes apology! _E_\nTrump International Hotel & Tower New York has received great acclaim as has our signature restaurant Jean Georges __HTTP__ _E_\nSheena Monnin acted terribly...she got what she deserved! _E_\n.@EliseChristine #asktrump __HTTP__ _E_\nMy @SquawkCNBC interview from this morning discussing the price of oil windfarmsDoral Hotel & Country Club and more... __HTTP__ _E_\nArt Laffer just said that he doesn't know how a Democrat could vote against the big tax cut/reform bill and live with themselves! @FoxNews _E_\nI'm not proud of my locker room talk. But this world has serious problems. We need serious leaders. #debate #BigLeagueTruth _E_\nVia @nypost by @JonathonTrugman: Donald Trump's resume backs his run for president __HTTP__ _E_\nThank you Lake Worth Florida. @foxandfriends _E_\nAll Star @ApprenticeNBC has done the impossible. TV's greatest villain @OMAROSA & @THEGaryBusey are in competition. Fireworks! _E_\nHawaii: __HTTP__ __HTTP__ __HTTP__ __HTTP__ _E_\nPlease to inform that the Champion Pittsburgh Penguins of the NHL will be joining me at the White House for Ceremony. Great team! _E_\nI hope the Fake News Media keeps talking about Wacky Congresswoman Wilson in that she as a representative is killing the Democrat Party! _E_\nStock Market at new all time high! Working on new trade deals that will be great for U.S. and its workers! _E_\nVery excited to be addressing the @RepLeadConf next Friday in New Orleans. There is much to discuss! _E_\nVia @BreitbartNews by @ASwoyer: Exclusive: Trump Slams Obamatrade Stands Up For American Jobs __HTTP__ _E_\nAmerica will never be destroyed from the outside.If we falter and lose our freedomsit will be because we destroyed ourselves. A. Lincoln _E_\nRomneyCare/ObamaCare architect Gruber apologized for his comments. He should apologize for the $2T monstrosity & return all taxpayer money. _E_\nIt was great to appear on Piers Morgan Tonight last night as his first live guest. Piers won the Celebrity Apprentice and he's fantastic. _E_\nJay Leno and his people are constantly calling me to go on his show. My answer is always no because his show sucks. They love my ratings! _E_\nCongrats to @Reince Priebus a really good and talented man. We're proud of you Reince! __HTTP__ _E_\nThousands of great people showed up from Liberty University yesterday. I love standing ovations! __HTTP__ _E_\nI'm helping the Serta Counting Sheep get back to work. Enter the contest __HTTP__ and win a trip to Las Vegas.. _E_\nHurricane Irma is raging but we have great teams of talented and brave people already in place and ready to help. Be careful be safe! #FEMA _E_\nObamaCare will explode and we will all get together and piece together a great healthcare plan for THE PEOPLE. Do not worry! _E_\n.@HillaryClinton Obama #ISIS Strategy Has Allowed It To Expand To Become A Global Threat #DebateNight __HTTP__ _E_\nWhat is never said is that people take a big risk with their money and can lose it all. We should be given credit for taking this risk. _E_\n\"You had Hillary Clinton and the Democratic Party try to hide the fact that they gave money to GPS Fusion to create a Dossier which was used by their allies in the Obama Administration to convince a Court misleadingly by all accounts to spy on the Trump Team.\" Tom Fitton JW _E_\n....the wall is not built which it will be the drug situation will NEVER be fixed the way it should be!#BuildTheWall _E_\nDoes Bush's library have a wing featuring Supreme Court Justice Jon Robert's ObamaCare ruling? Roberts was his prize appointee! _E_\nIf taxes are raised to avoid the fiscal cliff then they must be accompanied by tangible hard cuts on spending everywhere. _E_\nRapper Mac Miller's song Donald Trump has reached close to 72 million hits. He owes me big! _E_\nWhy doesn't somebody study the horrible charges brought against @Macys for racial profiling? Terrible hypocrites! _E_\nChris Ruddy is always on point: Trump Opens 'Greatest Golf Course In the World' __HTTP__ via @Newsmax_Media _E_\nThe biggest story yesterday the one that has the Dems in a dither is Podesta running from his firm. What he know about Crooked Dems is.... _E_\nSHOCK Hugo Chavez endorses @BarackObama __HTTP__ Will he be in Chicago on election night too? _E_\nRepublicans are always saying Obama is such a nice guy. When will they learn that he is not? _E_\nAll are very scripted and rehearsed two (at least) should not be on the stage. _E_\nRT @FoxNews: TUNE IN: @EricTrump joins @seanhannity TONIGHT at 9p ET on @FoxNews Channel! #Hannityat9 __HTTP__ _E_\n.@Joan_Rivers Get well soon Joan keep fighting! _E_\nVia @bostonherald by @ ChrisCassidy_BH: Donald Trump says Jeb Bush is wrong about Iraq __HTTP__ _E_\nCongratulations to @RealSheriffJoe on his successful Cold Case Posse investigation which claims @BarackObama's 'birth certificate' is fake _E_\nI am happy that The Job on CBS the 16th. knockoff of the Apprentice was just cancelled. I love to see my opponents lose (not nice)! _E_\nI always said that Debbie Wasserman Schultz was overrated. The Dems Convention is cracking up and Bernie is exhausted no energy left! _E_\nCrooked Hillary colluded w/FBI and DOJ and media is covering up to protect her. It's a #RiggedSystem! Our country d... __HTTP__ _E_\nTake a tour of this amazing residence at Trump World Tower..... __HTTP__ _E_\nLyin' Hillary Clinton told the FBI that she did not know the C markings on documents stood for CLASSIFIED. How can this be happening? _E_\n13 BILLION 4.5 BILLION these are the stupid settlements that J.P.Morgan just made. Why don't they FIGHT? No wonder they keep getting sued. _E_\nWord is that little Morty Zuckerman's @NYDailyNews loses more than $50 million per year can that be possible? _E_\nJoin me in Wisconsin tomorrow or Colorado on Tuesday!Green Bay 6pm __HTTP__ Springs 1pm... __HTTP__ _E_\nThere is no way that Carly Fiorina can become the Republican Nominee or win against the Dems. Boxer killed her for Senate in California! _E_\nRT @gatewaypundit: The Trump Hotel Waikiki looks like a lovely resort @realDonaldTrump #Hawaii _E_\nRT @foxandfriends: Sen. Ted Cruz: Trump's air traffic control plan is a 'win win' for Democrats and Republicans __HTTP__ _E_\n\"Do your homework before you invest. A dumb investor is a dead investor.\" – Think Like a Billionaire _E_\nGreat new poll from NH. Thank you! We need to keep this country safe! #Trump2016 __HTTP__ __HTTP__ _E_\nObama friend got a no bid $635M contract to build website __HTTP__ And now she will get more to fix it. _E_\nI had thousands join me in New Hampshire last night! @HillaryClinton had 68. The #SilentMajority is fed up with what is going on in America! _E_\nJust out the POLAR ICE CAPS are at an all time high the POLAR BEAR population has never been stronger. Where the hell is global warming? _E_\nRT @brunelldonald: I thought about jobs that went overseas failing schools open borders not my skin color when I voted @realDonaldTrump! I... _E_\nI always believed @BretMichaels was making a mistake in coming back as a competitor. I disagree with him but... __HTTP__ _E_\nWill be on @Morning_Joe live from New Hampshire 7:00 A.M. Talking about the debate and more! _E_\nWe have wasted an enormous amount of blood and treasure in Afghanistan. Their government has zero appreciation. Let's get out! _E_\nMillions of $'s of false ads paid for by lobbyists special interests of cheater @SenTedCruz and sleepy @JebBush are now running in S.C. _E_\nWhat is Mitch McConnell thinking?...make the big deal! _E_\nWow CNN had to retract big story on Russia with 3 employees forced to resign. What about all the other phony stories they do? FAKE NEWS! _E_\nAnother one of my predictions just came true Iraq is a total disaster with government losing all control—so sad. _E_\nRT @GovChristie: .@POTUS has done more to combat the addiction crisis than any other President. __HTTP__ _E_\nWe should not cut any aid to Egypt. Their country is in chaos and now they must form a normal civil government. _E_\nI have clearly stated that if the New York State Republican Party is able to unify I would run for Governor and win. They can't unify SAD! _E_\nMy appearance this morning on Good Morning America... __HTTP__ _E_\nRefugees from Syria are now pouring into our great country. Who knows who they are some could be ISIS. Is our president insane? _E_\nThe women played great today at the @USGA #USWomensOpen I look forward to being there tomorrow for the final round! __HTTP__ _E_\nHeading to New Hampshire will be talking about Hillary saying her brain SHORT CIRCUITED and other things! _E_\n26000 unreported sexual assults in the military only 238 convictions. What did these geniuses expect when they put men & women together? _E_\nThat Seth Meyers is hosting the Emmy Awards is a total joke. He is very awkward with almost no talent. Marbles in his mouth! _E_\nThe one positive from the plunge in household wealth is that we are in a buyer's market. This is the time to buy! _E_\nPaul Begala the dopey @CNN flunky and head of the Pro Hillary Clinton Super PAC has knowingly committed fraud in his first ad against me. _E_\nDem Senator Schumer hated the Iran deal made by President Obama but now that I am involved he is OK with it. Tell that to Israel Chuck! _E_\nI am so happy that I was able to do something really good for the Bronx and lots of jobs! _E_\nWatch yesterday Obama continued to evade questions on his security failures in the Benghazi consulate attack. __HTTP__ _E_\nTotally unauthorized do not pay. I am self funding my campaign! Notice has just been withdrawn. #Trump2016#MakeAmericaGreatAgain _E_\nJust leaving for @LandExpo in Iowa standing room only. My great honor. @PeoplesCompany __HTTP__ _E_\nReally enjoyed discussing @yankees yesterday with @RealMicihaelKay. I am a long time Yankee fan. _E_\nWow 25000 in San Diego California!Thank you!! #Trump2016 __HTTP__ _E_\nThe virtually incompetent Republican Strategist who has had a failed career Cheri Jacobus is incoherent with anger that her puppets died! _E_\nRT @seanhannity: BOOM!! Tick Tock __HTTP__ _E_\nThank you to all of those who gave me such wonderful reviews for my performance on @nbcsnl Saturday Night Live. Best ratings in 4 years! _E_\nAs your President I have no higher duty than to protect the lives of the American people. __HTTP__ _E_\nThe Republicans must get Virgil Goode out of the race in Virginia. He will take votes away from @MittRomney. _E_\nI am proud of the Rep. House & Senate for working so hard on cutting taxes {& reform.} We're getting close! Now how about ending the unfair & highly unpopular Indiv Mandate in OCare & reducing taxes even further? Cut top rate to 35% w/all of the rest going to middle income cuts? _E_\nObamaCare/RomneyCare architect Gruber was paid over $6M with our tax dollars yet Obama only claims he 'was some adviser.' _E_\nMy @gretawire interview discussing @IvankaTrump wanting me to run for POTUS @BarackObama's SOTU and his China policy __HTTP__ _E_\n.@Omarosa has another meltdown ... while giving a check for $40000 to Michael's charity the Sue Duncan Center. #CelebApprentice _E_\nAccording to @RasmussenPoll @MittRomney has a 12 point advantage over @BarackObama on the economy __HTTP__ Look for it to grow. _E_\nBig day for HealthCare. After 7 years of talking we will soon see whether or not Republicans are willing to step up to the plate! _E_\n.@GovernorPerry is a terrific guy and I wish him well I know he will have a great future! _E_\nSecure your place at the National Achievers Congress in London. It will be an amazing event with a great surprise. __HTTP__ _E_\nThe United States condemns the terror attack in Barcelona Spain and will do whatever is necessary to help. Be tough & strong we love you! _E_\nGen. Petraeus has agreed to testify in the Senate on Benghazi. I will be watching. _E_\nAustralia New Zealand and more. I am always available to them. @nytimes is just upset that they looked like fools in their coverage of me. _E_\nA smart negotiator would use the leverage of our dollars our laws and our armed forces to get a better deal (cont) __HTTP__ _E_\nLooking forward to keynoting the South Carolina Tea Party Convention in Myrtle Beach on Monday at 3:20PM! __HTTP__ _E_\n.@Omarosa on the cover of Soap Opera Digest? That's a credential... #CelebApprentice _E_\nIt was an honor to welcome President @MarianoraJoy of Spain. Thank you for standing w/ us in our efforts to isolate the brutal #NoKo regime. __HTTP__ _E_\n\"Most entrepreneurs do not realize that wealth does not come from work but from the assets they build.\" – Midas Touch _E_\nRT @realDonaldTrump: DACA is probably dead because the Democrats don't really want it they just want to talk and take desperately needed m... _E_\n.@KatyTurNBC & @DebSopan should be fired for dishonest reporting. Thank you @GatewayPundit for reporting the truth. #Trump2016 _E_\n\"Trump on Romney: 'You Just Can't Give Him Another Chance':Some golfers can't sink the 3 ft. putt.\" __HTTP__ via @PJMedia_com _E_\nHopefully the House of Representatives can hold our country together for four more years...stay strong and never give up! _E_\nWe are TRYING to fight ISIS and now our own people are killing our police. Our country is divided and out of control. The world is watching _E_\nLolo Jones our beautiful Olympic athlete wants to remain a virgin until she gets married she is great. @Followlolo _E_\nThanks. __HTTP__ __HTTP__ _E_\n\"Appreciate your property and your property will appreciate for you.\" – Think Like a Billionaire _E_\nWhat people don't know about @BillMaher is that he was a terrible student and not considered smart in his early (cont) __HTTP__ _E_\nIf the people of Massachusetts found out what an ineffective Senator goofy Elizabeth Warren has been she would lose! _E_\nEntrepreneurs: Identify your goals and see each day as an opportunity to show what you can do at the highest level. _E_\nEliot Spitzer was a horrible Governor and A.G. who ruined many good people and cost the Country billions of dollars in losses (and jobs). _E_\n\"The greatest discovery of all time is that a person can change his future by merely changing his attitude.\" @Oprah _E_\nCheck out my new book Time To Get Tough: Making America #1 Again __HTTP__ _E_\nIn @oreillyfactor's No Spin Zone re: ObamaCare causing unemployment negotiating with China & my $5M court win __HTTP__ _E_\nErin Burnett who has no ratings on CNN in prime time now wants more money to move to the morning slot. @CNN should say no way . _E_\nGeneral John Kelly is doing a great job as Chief of Staff. I could not be happier or more impressed and this Administration continues to.. _E_\nImpossible is a word to be found only in the dictionary of fools. Napoleon Bonaparte _E_\nPeople very unhappy with Crooked Hillary and Obama on JOBS and SAFETY! Biggest trade deficit in many years! More attacks will follow Orlando _E_\nThe Stock Market is setting record after record and unemployment is at a 17 year low. So many things accomplished by the Trump Administration perhaps more than any other President in first year. Sadly will never be reported correctly by the Fake News Media! _E_\nWSJ/NBC Poll: Donald Trump Widens His Lead in Republican Presidential Race. #Trump2016 __HTTP__ _E_\nWe have spent over $1 Billion on the Libya operation. What are we getting back? _E_\nTrump organisation backs community battle against substation __HTTP__ via @STVNews _E_\nThe more predictable the business the more valuable it is. Predictability also means consistency of brand experience. Midas Touch _E_\nThank you to everyone for the wonderful reviews of my speech on Thursday night. From the heart! _E_\n.@antbaxter Your documentary died many deaths. You have in my opinion zero talent. _E_\nWow Senator Luther Strange picked up a lot of additional support since my endorsement. Now in September runoff. Strong on Wall & Crime! _E_\nHeading to a packed house in Waterloo Iowa! Will celebrate today's great poll numbers together. See you soon! _E_\nShouldn't there have been increased security at our embassies on the anniversary of 9/11? _E_\n\"Study: Insurance costs to soar under Obamacare\" __HTTP__ Men in NC get 305% hike. Women in NE suffer an average 237% hike. _E_\n#BuyAmericanHireAmericanWatch __HTTP__ __HTTP__ _E_\n.@antbaxter—Heard your documentary cost you less than $3000 to make—where did you get that kind of money? _E_\nGive great credit to @GeorgeClooney for exposing the atrocities taking place in Sudan. _E_\nMy support of Anna Wintour for Ambassador got a lot of coverage. She is smart and will be a strong advocate for the US. _E_\nPresident Obama Gruber and all of the other Obama cronies got ObamaCare passed by lies and fraudulent statements. Courts should overturn! _E_\nThank you @IvankaTrump for the kind words. I am very proud of the role model you are for so many. NH & IA radio ad: __HTTP__ _E_\nA new radical Islamic terrorist has just attacked in Louvre Museum in Paris. Tourists were locked down. France on edge again. GET SMART U.S. _E_\nSad. Our food stamp rolls now surpass the entire population of Spain __HTTP__ We must do better or we will be Greece. _E_\nBoardroom time which team do you think had the best presentation? #CelebApprentice _E_\n.@BretBaier's newly released book 'Special Heart' brings a message of hope. All sales donated to heart charities __HTTP__ _E_\n... Icahn Kravis Apollo and most others but nobody says they went bankrupt! _E_\nDear @kimguilfoyle Thank you so much for your nice words today on @TheFive. Will not be forgotten! In Iowa now. Packed house! _E_\nUS Gov't is on the hook for more than a third of the world's entire debt & we wonder why China & OPEC are laughing all the way to the bank! _E_\nObamaCare must be fully repealed or it will destroy America's small businesses. _E_\nIn Las Vegas getting ready to speak! _E_\nCountry music star @TraceAdkins returns to All Star @CelebApprentice. Competing for @RedCross Trace is great! _E_\nCongratulations are in order! @TrumpPanama ranks #5 Top Hotel in Panama by @TripAdvisor's #TravelersChoice Awards! __HTTP__ _E_\nVia @bizjournals by @BrandonSawalich: 3 lessons about loyalty that I learned from Donald Trump __HTTP__ _E_\nMy thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_\nISIS threatens us today because of the decisions Hillary Clinton has made along with President Obama. Donald J. Trump _E_\nAmerica has lost its AAA rating and gained over $6T in debt under @BarackObama and now he wants to raise the debt ceiling SCARY! _E_\nWorking hard on the biggest tax cut in U.S. history. Great support from so many sides. Big winners will be the middle class business & JOBS _E_\nLast night's horrific execution style shootings of 12 Dallas law enforcement officers... __HTTP__ _E_\nAmerica's debt is greater than our GDP. Time for new thinking. _E_\n#TBT @DonaldJTrumpJr @IvankaTrump @EricTrump and I 20 years ago __HTTP__ _E_\nThank you Florida. My Administration will follow two simple rules: BUY AMERICAN and HIRE AMERICAN! #ICYMI Watch:... __HTTP__ _E_\nI believe the James Comey leaks will be far more prevalent than anyone ever thought possible. Totally illegal? Very 'cowardly!' _E_\nThank you Bridgeport Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThe President must get Congressional approval before attacking Syria big mistake if he does not! _E_\nNo surprise serial sexter Anthony continues to be a sick pervert. He was sexting a 'young' girl last summer __HTTP__ _E_\nWill be in Missouri today with Melania for the funeral of a wonderful and truly respected woman Phyllis S! _E_\nGood investors are good students. It's as simple as that. Think Like a Billionaire _E_\nHouston TX: __HTTP__ Vegas NV __HTTP__ AZ: __HTTP__ __HTTP__ _E_\nCongress use the power of the purse. STOP AMNESTY! _E_\nThe failing @nytimes which has made every wrong prediction about me including my big election win (apologized) is totally inept! _E_\n\"Donald Trump on Fiscal Cliff and Obama\" __HTTP__ via @Livetradingnews _E_\nThe @HuffingtonPost is a total joke & laughing stock of journalism as is gross Arianna Huffington. They don't report the facts! _E_\nTom Ridge is a failed 'Bushy' & PA Governor. Him & his friend @KarlRove shouldn't be allowed to do their bias commentary nobody listens! _E_\nIf @BarackObama really loved this country he wouldn't be destroying it. He has ruined our credit and killed jobs with ObamaCare. _E_\nI took a failed club in Dutchess County & made it a great success plus many jobs. @KieranLalor should be thankful. _E_\nToo bad I don't get this for political speeches they cost me a fortune! __HTTP__ _E_\nClub for Growth is the group that came to my office seeking $1 million dollars. I told them no and now they are doing negative ads. _E_\nNo deal is better than a bad deal. America out negotiated again. #Iran _E_\nJoin me live from the @WhiteHouse via #Periscope __HTTP__ _E_\nJust got back from Tampa. It was an amazing evening with an even more amazing crowd fantastic people! Will be in South Carolina tomorrow. _E_\nUS government's foreign indebtedness has grown over 72% under @BarackObama. He is bleeding us dry to China. _E_\nDAMAC & #Trump Organization are developing a 2nd Trump #golf course Trump World Golf Club #Dubai at AKOYA Oxygen! __HTTP__ _E_\nWe must stop Common Core from controlling state & local curriculums. It is a federal grab of education. Keep education local! _E_\nIf @BarackObama's policies are so advantageous then why is he constantly invoking Ronald Reagan on the Stump? __HTTP__ _E_\nDuring my recent trip to the Middle East I stated that there can no longer be funding of Radical Ideology. Leaders pointed to Qatar look! _E_\nOur relationship with Russia is at an all time & very dangerous low. You can thank Congress the same people that can't even give us HCare! _E_\nTrump Vineyard Estates is a breathtaking location to hold special events for all occasions. Watch the video for a look __HTTP__ _E_\nMy int. on @FoxNews' @oreillyfactor: \"Donald Trump presidential politics and 'The Factor'\" __HTTP__ _E_\nWhy do we keep broadcasting when we are going to attack Syria. Why can't we just be quiet and if we attack at all catch them by surprise? _E_\n#TrumpVlog Trouble in paradise for Clintons __HTTP__ _E_\nRT @greta: Thank you @realDonaldTrump this is important to so many of us __HTTP__ _E_\nDoes he look sharp smart and presidential his hands keep hitting the podium making a loud and distracting noise microphone too sensitive. _E_\nUgly industrial wind turbines are ruining the beauty of parts of the country and have inefficient unreliable energy to boot. _E_\nDaily Caller: Trump Surpasses Field Flirts With 40 Percent in Alabama Poll __HTTP__ _E_\nJust arrived for the #GOPdebate #MakeAmericaGreatAgain __HTTP__ _E_\nHad a great time on @gretawire last night. Greta always does great interviews. _E_\nRT @FLOTUS: Looking forward to hosting the annual Easter Egg Roll at the @WhiteHouse on Monday! __HTTP__ _E_\n\"Donald Trump: The View Will be Better without Joy Behar (Video)\" __HTTP__ via @gatewaypundit _E_\nHonored to be attending Rev. @BillyGraham's 95th birthday. His life & work has brought hope & faith to millions worldwide. _E_\nAlways good to have @ArsenioHall back as advisor as well as @DonaldJTrumpJr. They have their own fan clubs at this point. #CelebApprentice _E_\nIn '08 America voted for Hope & Change. Instead we got incompetency. Now it is time to put a real job creator in office. Vote 4 Mitt! _E_\nRT @PressSec: .@POTUS historic tax cuts + doubling of the child tax credit will do infinitely more to empower working moms than liberals' p... _E_\nOur hearts & prayers go out to the people of London who suffered a vicious terrorist attack.... __HTTP__ _E_\nThank you Diamond and Silk! __HTTP__ _E_\nWow @GeorgeWill said some very nice things about me today on @FoxNewsSunday with Chris Wallace. I am making progress thanks George! _E_\nDelusional @BarackObama claims that his economic plan worked __HTTP__ Is the 16% real unemployment part of the plan? _E_\n\"If we get tough and make the hard choices we can make America a rich nation—and respected—once again.\" – Time to Get Tough _E_\nA top firm like Cooley will only submit a case they believe in and can win. _E_\nAs one of Miamii's largest landowners I am pulling for the @MiamiHEAT in the @NBA finals. Lebron's time is now! @KingJames _E_\nSo excited to have @SantanaCarlos performing at the 2015 #CadillacChampionship at @TrumpDoral: __HTTP__ _E_\nGetting ready to go to the great State of Michigan. Big crowd tonight. Make America Great Again! _E_\nI still can't believe we left Iraq without the oil. _E_\n\"Money was never a big motivation for me except as a way to keep score.The excitement is playing the game.\"–The Art of The Deal _E_\nGreat optimism for future of U.S. business AND JOBS with the DOW having an 11th straight record close. Big tax & regulation cuts coming! _E_\n\"Get to know yourself.You can't improve upon something you don't understand.The more you ask the better you'll know.\" Vince Lombardi _E_\nVia @paramuspost: \"@TrumpSoHo New York Debuts Sizzling Summer Offerings\" __HTTP__ _E_\nObama & his people did a brilliant job of delaying these scandals until after the election. Mitt must be going wild thinking about it! _E_\n\"Design your business from the start so that it is leverageable expandable predictable and financeable.\" – Midas Touch _E_\nObama weak on immigration. All words no action. He's been Prez 4 years. _E_\n.@washtimes @BrettMDecker: Five Questions w/ @realDonaldTrump 'Lack of Leadership is the biggest threat to America' __HTTP__ _E_\nTaking risks & making mistakes is the best way to learn something new. Most of the time you will surprise yourself Trump Never Give Up _E_\nWho wants the endorsement of a guy (@EricCantor) who lost in perhaps the greatest upset in the history of Congress? _E_\n.@MittRomney and I are working out a great dinner for someone I hope it's you! __HTTP__ _E_\nCLINTON REFUGEE PLAN COULD BRING IN 620000 REFUGEES IN FIRST TERM AT LIFETIME COST OF OVER $400 BILLION. __HTTP__ _E_\nBeing nice to Rocket Man hasn't worked in 25 years why would it work now? Clinton failed Bush failed and Obama failed. I won't fail. _E_\nThe bend in the road is not the end of the road unless you refuse to take the turn. – Anonymous _E_\nZogby Poll: Trump Widens Lead After GOP Debate __HTTP__ _E_\nFormerly of the New York Times @frankrichny was a poor theatre critic who was forced out. Sadly he is an even (cont) __HTTP__ _E_\nWe are delivering HISTORIC TAX RELIEF for the American people!#TaxCutsandJobsAct __HTTP__ _E_\nI am convinced that if @AlexSalmond had not pushed ugly wind turbines all over Scotland the vote would have been much better for him! _E_\nThank you to Ford for scrapping a new plant in Mexico and creating 700 new jobs in the U.S. This is just the beginning much more to follow _E_\nDeals are my art form. Other people paint beautifully or write poetry. I like making deals preferably big deals. That's how I get my kicks. _E_\nIf @DannyZuker competed against me and.won (which not too many people do) he could win millions of $'s for himself or his charity! _E_\nIt's really cold outside they are calling it a major freeze weeks ahead of normal. Man we could use a big fat dose of global warming! _E_\n.@NRO Not much is as dead or irrelevant as National Review thanks to guidance of Goldberg a total loser! Get some real talent or fold! _E_\nThank you @GolfMagazine for putting my Scotland course on your cover and a Top 100 course in the world. __HTTP__ _E_\nJoin the MOVEMENT to #MAGA! __HTTP__ __HTTP__ _E_\nThe dopes at the @nytimes bought the Boston Globe for $1.3 billion and sold it for $1.00. Their great old headquarters gave it away! So dumb _E_\n#VoteTrump at clerk's offices & 185 ballot drop boxes in #ORPrimary!Closes at 8pm! __HTTP__ _E_\nHappy 4th of July! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_\nVia @Newsmax_Media: 14 Reasons Donald Trump Is Really Running — and Doing Well __HTTP__ _E_\nTo put on your calendar for May: Miss USA 2010 live from Las Vegas on May 16th 7 p.m. ET on NBC. I'll be there tune in for a great show! _E_\nHad a great meeting at CIA Headquarters yesterday packed house paid great respect to Wall long standing ovations amazing people. WIN! _E_\nRT @DRUDGE_REPORT: DEAD HEAT: CLINTON VS TRUMP __HTTP__ _E_\nChina is now given preference to buy US debt by going directly to Treasury. I don't believe @BarackObama knows that he selling us out. _E_\nToday's final round of the WGC Cadillac Championship will be amazing. A lot of pressure on leader who has played great. Big names hunting! _E_\nTrump Tycoon App for iPhone & iPod Touch It's $2.99 but the advice is priceless! __HTTP__ _E_\nObama will grant amnesty to millions of illegals yet he has not lifted a finger for USMC Sgt. Tahmooressi! . #BringBackOurMarine _E_\nVia @Zawya: \"Trump home partners with lifestyle to launch an exclusive collection of home décor\" __HTTP__ _E_\nRT @foxnation: . @TuckerCarlson : #Dems Don't Really Believe #Trump Is a Pawn of #Russia That's Just Their Political Tool __HTTP__ _E_\nOur FIFTH 1K milestone of 2017!#DOW24K #MAGA __HTTP__ _E_\n.@stephenfhayes: I heard you were a joke on the media panel this weekend in New Hampshire. You just don't have what it takes! @JoeNBC _E_\nMy son @EricTrump will be interviewed by @SeanHannity tonight at 10pm on @FoxNews. Enjoy! _E_\nImagine how much stronger economic shape we would be in if we made the Iraqi government agree to a cost sharing (cont) __HTTP__ _E_\nbeing a movie star and that was season 1 compared to season 14. Now compare him to my season 1. But who cares he supported Kasich & Hillary _E_\nVia @UnionLeader by @tuohy: \"Trump says he will decide on a presidential run by June\" __HTTP__ _E_\nI am getting bad marks from certain pundits because I have a small campaign staff. But small is good flexible save money and number one! _E_\nTed Cruz has been playing an ad about me that is so ridiculously false no basis in fact. Take ad down Ted. Biggest liar in politics! _E_\nObamaCare is one of the greatest threats our country faces. It is unsustainable and will lead America into complete insolvency. _E_\nUkrainian efforts to sabotage Trump campaign quietly working to boost Clinton. So where is the investigation A.G. @seanhannity _E_\n.@Univision cares far more about Mexico than it does about the U.S. Are they controlled by the Mexican government? _E_\nClinton Aides: 'Definitely' Not Releasing Some HRC Emails: __HTTP__ _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nCNN anchors are completely out of touch with everyday people worried about rising crime failing schools and vanishing jobs. _E_\n.@morning_joe Wow Ticket sales go through the roof after Trump asked to speak at CPAC _E_\nThe New York Times/Bill Carter/Sept.26 2011: On MSNBC meanwhile Lawrence O'Donnell has lost 100000 viewers (cont) __HTTP__ _E_\nThank you New Hampshire!#Trump2016 __HTTP__ _E_\nRepublicans want to fix DACA far more than the Democrats do. The Dems had all three branches of government back in 2008 2011 and they decided not to do anything about DACA. They only want to use it as a campaign issue. Vote Republican! _E_\nA house divided against itself cannot stand. Abraham Lincoln _E_\nChance favors the prepared mind. Louis Pasteur _E_\n.@MeghanMcCain was terrible on @TheFive yesterday. Angry and obnoxious she will never make it on T.V. @FoxNews can do so much better! _E_\nThanks! __HTTP__ _E_\n\"The true competitors are the ones who always play to win.\" – Tom Brady @Patriots _E_\nPeople believe CNN these days almost as little as they believe Hillary....that's really saying something! _E_\n.@alexsalmond @pressjournal @BBCNews RT @DanScavino one would think the photo & caption says it all.... __HTTP__ _E_\nVia @washingtonpost: Donald Trump will speak at CPAC by @rachelweinerwp __HTTP__ @CPACnews @AlCardenasACU @RGreggKeller _E_\nTrump Organization's first project in India Trump Towers Pune will epitomize inspired living and timeless elegance __HTTP__ _E_\nOrder a signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm __HTTP__ _E_\nWe are not retreating we are advancing in another direction. Douglas MacArthur _E_\nMissouri just confirmed #Trump2016 as the official winner with an additional 12 delegates. #MakeAmericaGreatAgain __HTTP__ _E_\nThank you New Hampshire! Great people see you next week! __HTTP__ _E_\nHere's to a safe and happy Independence Day for one and all Enjoy it! Donald J. Trump _E_\nIf the wind will not serve take to the oars. Latin Proverb _E_\nBusy day planned in New York. Will soon be making some very important decisions on the people who will be running our government! _E_\nGreat honor Rev. Jerry Falwell Jr. of Liberty University one of the most respected religious leaders in our nation has just endorsed me! _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nThe Republicans must face reality & create a strong & positive immigration policy if not they will continue to lose elections. _E_\nWe are rebuilding other countries while our own country is going to HELL. Time to rebuild the U.S.A.! Tell our stupid politicians ENOUGH _E_\nMust read @ConservReview article by @JeffJlpa1: \"Jeb Bush and the Outsiders\" __HTTP__ _E_\nMy transition team which is working long hours and doing a fantastic job will be seeing many great candidates today. #MAGA _E_\nNorth Korea just stated that it is in the final stages of developing a nuclear weapon capable of reaching parts of the U.S. It won't happen! _E_\nThank you Travis County Texas!#MakeAmericaGreatAgain __HTTP__ _E_\nChina is advocating on behalf of Iran's nuclear program the Chinese oppose both sanctions and any militar... (cont) __HTTP__ _E_\nThe United States made some of the worst Trade Deals in world history.Why should we continue these deals with countries that do not help us? _E_\nHillary Clinton lied when she said that ISIS is using video of Donald Trump as a recruiting tool. This was fact checked by @FoxNews: FALSE _E_\nPeople have been forced to resign positions for far less than @JonahNRO's \"tweeting like a 14 year old girl\" _E_\nVia @BW: Donald Trump Vows to Fight Scottish Wind Farm Plan in Courts __HTTP__ _E_\nThank you South Carolina! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nVia @inventorspot by Myra Per Lee: \"Got A Great Idea? Get Donald Trump To Fund It\" __HTTP__ _E_\nIn the end you're measured not by how much you undertake but by what you finally accomplish! _E_\nNew Hampshire vote today MAKE AMERICA GREAT AGAIN! _E_\nJust sat down for a great interview with @PHussionWYFF in Greenville today. Watch at 5pm. An amazing day in South Carolina! #VoteTrumpSC _E_\nMake it special! No better place to celebrate St. Patrick's Day in the Windy City than @TrumpChicago __HTTP__ _E_\nGlad to hear Bella Santorum is recovering. @RickSantorum has a beautiful family. _E_\nI was invited to be with Mitt Romney tonight win lose or draw I'll be there! _E_\nOur very weak and ineffective leader Paul Ryan had a bad conference call where his members went wild at his disloyalty. _E_\nAn: Media fell all over themselves criticizing what DonaldTrump may have insinuated about @POTUS. But he's right: __HTTP__ _E_\nCelebrity Apprentice is nearing the end of a wonderful and very successful season. Watch tonight at 8:00. _E_\nJust received a wonderful letter from a new father who bought his son his first book The Art of the Deal. Great parent! _E_\nThank you Vermont! #Trump2016#SuperTuesday _E_\nAn incredible honor to receive the endorsement of a person Ihave such tremendous respect for. Thank you Sheldon! __HTTP__ _E_\n.@bobvanderplaats is a total phony and con man. When I wouldn't give him free hotel rooms and much more he endorsed Cruz. @foxandfriends _E_\nIt is hard to believe I am winning by so much when I am treated so badly by the media. New @CNN Poll amazing in ALL categories. 21 pt. Lead _E_\nGreat afternoon in Ohio & a great evening in Pennsylvania departing now. See you tomorrow Virginia! __HTTP__ _E_\nFinal #'s just announced in the GREAT State of MO. TRUMP WINS! New certified #'s show a 365 vote increase for me @ least 12 more delegates! _E_\nRecord setting cold and snow ice caps massive! The only global warming we should fear is that caused by nuclear weapons incompetent pols. _E_\nCan you believe we still have not gotten our Marine out of Mexico. He sits in prison while our PRESIDENT plays golf and makes bad decisions! _E_\n.@Morning_Joe just went off the rails. I will beat Hillary easily she does not want to run against me. I am tuning them out waste of time _E_\nChina controls North Korea. So now besides cyber hacking us all day they are using the Norks to taunt us. China is a major threat. _E_\nWhy would they announce a finding of the grand jury in Ferguson at 9:00 in the evening a prime time for riots! Not smart. _E_\nRexnord of Indiana made a deal during the Obama Administration to move to Mexico. Fired their employees. Tax product big that's sold in U.S. _E_\nAfter all of these years of suffering thru ObamaCare Republican Senators must come through as they have promised! _E_\nPervert alert! Sexter Anthony Weiner will be running for Mayor of New York City. _E_\nMy @foxandfriends interview discussing my possible GOP endorsement @MittRomney's taxes and the Florida primary. __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nThe weak jokers who so badly hurt great Penn State University should have fought the NCAA instead of making a deal __HTTP__ _E_\nObamaCare is such a national treasure that @BarackObama has waived over 1200 companies from the law __HTTP__ _E_\nSee problems as a mind exercise. Enjoy the challenge and remember to keep focused on your goals. _E_\nNewt attacks on @MittRomney record at Bain an attack on free enterprise and entrepreneurship. Mistake! _E_\nGovernment waste fraud and abuse should be immediately addressed. This will help solve our deficit crisis both short and long term. _E_\nChina steals United States Navy research drone in international waters rips it out of water and takes it to China in unprecedented act. _E_\nWe are what we repeatedly do. Excellence therefore is not an act but a habit. Aristotle _E_\nThank you for your support of my candidacy! #MAGA #ImWithYou __HTTP__ _E_\nWow! I hear you Warren Michigan. Streaming live join us America. It is time to DRAIN THE SWAMP!Watch: __HTTP__ _E_\nWe are getting reports from many voters that the Cruz people are back to doing very sleazy and dishonest pushpolls on me. We are watching! _E_\nIn Texas now leaving soon for BIG rally in Florida! _E_\nWith @C_Soules from #TheBachelor in Iowa __HTTP__ _E_\nThe new unemployment numbers are terrible. 522000 more people are out of the labor force to 88419000. __HTTP__ _E_\nHappy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to last Sunday's tragedy. _E_\nI will be making a major statement from the @WhiteHouse upon my return to D.C. Time and date to be set. _E_\nWhen will Pakistan apologize to us for providing safe sanctuary to Osama Bin Laden for 6 years?! Some ally. _E_\nVia @BW: Thomas Jefferson Donald Trump Share Love of Grapes in Virginia __HTTP__ @trumpwinery @EricTrump _E_\nEditorial by @DonaldJTrumpJr in the DailyCaller: Defending Innovation in America __HTTP__ _E_\nI promise that our administration will ALWAYS have your back. We will ALWAYS be with you! __HTTP__ _E_\nRemember go vote we need real change this time. _E_\nRT @FoxBusiness: .@JerryJrFalwell: I was so impressed by [@realDonaldTrump's] speech yesterday. He was the best I've ever seen him. __HTTP__ _E_\n#CrookedHillary is nothing more than a Wall Street PUPPET! #BigLeagueTruth #Debate __HTTP__ _E_\nObama projected a 2012 budget deficit of $557B. It is actually double that at $1.1T __HTTP__ We can't afford four more years. _E_\nAll the guys that said @MittRomney would lose are rapidly coming on board. Mitt will remember the early helpers. _E_\nVia @theblaze: Falwell on Trump: He 'was willing to say publicly' what conservatives said 'privately' __HTTP__ _E_\n\"Had the information (Crooked Hillary's emails) been released there would have been harm to National Security.... Charles McCulloughFmr Intel Comm Inspector General __HTTP__ _E_\nFind out what Success smells like. I'll be @Macys Herald Square April 18 5:30pm to sign my new fragrance first (cont) __HTTP__ _E_\n\"Winners embrace hard work.\" @ESPNDrLou _E_\nOn my way to @TrumpSoHo to receive the AAA Five Diamond Award. _E_\nIt's very sad that Republicans even some that were carried over the line on my back do very little to protect their President. _E_\nAll 50 of the WORLD'S TOP 50 PLAYERS will be at TRUMP NATIONAL DORAL on Thursday Sunday for the Cadillac World Golf Championship. _E_\nRT @NWSHouston: Historic flooding is still ongoing across the area. If evacuated please DO NOT return home until authorities indicate it i... _E_\nMy daughter Ivanka is being honored by the Wharton School of Finance with the 2012 Young Leadership Award. Also (cont) __HTTP__ _E_\nRT @FoxNews: Geraldo Blasts 'Fake News' Reports About Trump's Visit to Puerto Rico __HTTP__ _E_\nDo not underestimate the UNITY within the Republican Party! _E_\n'Hillary Clinton Deleted Emails With Her Email Server Technician' __HTTP__ _E_\nChina is buying so many of our companies it's really getting bad. _E_\nAnother historic first under Obama businesses are collapsing faster than they're being formed __HTTP__ New leadership now! _E_\nThe contract to build the ObamaCare website was given to a CANADIAN company for $55 744 081. It then bloated to $292 071067 INCOMPETENCE _E_\nVia @Newsmax_Media by @OwenTew: \"Trump on 2016 Run: I Would Self Fund Appoint Wall Street Experts\" __HTTP__ _E_\nOne of the dumber and least respected of the political pundits is Chris Cillizza of the Washington Post @TheFix. Moron hates my poll numbers _E_\nBaltimore just set a record for the coldest day in March in a long recorded history 4 degrees. Other places likewise. Global warming con! _E_\nLive tweeting during tonight's VP debate...should be a great time _E_\nThank you Faith and Freedom Forum & @UrbandaleSchool. I had a great time in Iowa today! __HTTP__ _E_\nWe want our companies to hire & grow in AMERICA to raise wages for AMERICAN workers & to help rebuild our AMERICAN cities & towns! #USA __HTTP__ _E_\nArrived in Palm Beach drove by a gas staion $4.50 a gallon. Result of failed @BarackObama leadership. _E_\nSmall business owners are the DREAMERS & INNOVATORS who are powering us into the future!Read more and watch here: __HTTP__ __HTTP__ _E_\nThank you Colorado Springs. Get out & VOTE #TrumpPence16 in November! __HTTP__ _E_\nTo show you how shallow politicians can be many are jealous of my @CPACnews speaking slot & also their fellow Republicans! Not good! _E_\nRT @DRUDGE_REPORT: Trump: 'Is the Boston Killer Eligible for Obama Care to Bring Him Back to Health?' __HTTP__ _E_\nIt is very sad to see what @BarackObama has done with NASA. He has gutted the program and made us dependent on the Russians. _E_\nIt was great having @ArsenioHall back on this week's @ApprenticeNBC! __HTTP__ _E_\nObama will be going on @theviewtv & fundraising while in NYC for the UN Assembly... _E_\nNot good or smart for Obama to be calling Russia a regional power or to mention the concept of a nuclear weapon going off in NYC. _E_\nRT @realDonaldTrump: ATTN: @HillaryClinton Why did five of your staffers need FBI IMMUNITY?! #BigLeagueTruth #Debates _E_\nI'm leaving now for Ireland Spain Scotland and elsewhere crazy life! _E_\nPeople are LOVING the Trump sign on the Chicago building. Big league tweets letters and calls... _E_\nJust leaving Virginia really big crowd great enthusiasm! _E_\n#2. Be totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_\n\"Destiny has a part to play in your life and in your business so give it a chance to work.\" – Think Like a Champion _E_\nI beat Hillary in the new @FoxNews Poll head to head. SHE HAS NO STRENGTH OR STAMINA both of which are needed to MAKE AMERICA GREAT AGAIN! _E_\nThank you to the Robb Report The Best of the Best issue for just naming Trump International Golf Links the Best New Golf Course In World! _E_\nSad to see the history and culture of our great country being ripped apart with the removal of our beautiful statues and monuments. You..... _E_\nThank you for all of the really nice comments and reviews concerning my speech today at the National Press Club. It was my great honor! _E_\nChinese spies stole our F 35 Joint Strike Fighter design __HTTP__ We should offset the cost from our Chinese debt _E_\nNever seen such Republican ANGER & UNITY as I have concerning the lack of investigation on Clinton made Fake Dossier (now $12000000?).... _E_\nCrooked Hillary has zero imagination and even less stamina. ISIS China Russia and all would love for her to be president. 4 more years! _E_\nBernie should pull his endorsement of Crooked Hillary after she decieved him and then attacked him and his supporters. _E_\nJust at a news conference from Trump Turnberry in Scotland. Everybody was there & will be all over television tonite. Back on trail Saturday _E_\n.@StephenBaldwin7 You were fabulous on CNN last night I greatly appreciate your support. Best wishes. _E_\n\"Don't bunt. Aim out of the ball park. Aim for the company of immortals.\" David Ogilvy _E_\nAs usual Hillary & the Dems are trying to rig the debates so 2 are up against major NFL games. Same as last time w/ Bernie. Unacceptable! _E_\nPoliticians are all talk and no action. Bush and Rubio couldn't answer simple question on Iraq. They will NEVER make America great again! _E_\nMore and more reporters are using the word TRUMP when referring to winning just used on Bloomberg News. Gee I wonder why? _E_\n#LawandOrder #ImWithYouVideo: __HTTP__ __HTTP__ _E_\nDowntown Manhattan's trendiest hotel @TrumpSoHo 46 stories of luxurious rooms fine dining & The Spa __HTTP__ _E_\nI don't know if Hillary will be able to run she is a walking time bomb! _E_\nOur country and it's leadership has to be so careful and so smart these are treacherous times like no other. The world is a crazy place! _E_\nIt was an honor to be @GretchenCarlson's inaugural guest on her new show 'The Real Story.' Gretchen will be a big success! _E_\nAs Bernie Sanders said Hillary Clinton has bad judgement. Bill's meeting was probably initiated and demanded by Hillary! _E_\n\"I'm a great believer in asking everyone for an opinion before I make a decision. It's a natural reflex.\" – The Art of The Deal _E_\nMy @foxandfriends interview discussing #MissUSA Olivia Culpo the job numbers & the waste of the Obama stimulus __HTTP__ _E_\n.@genesimmons really great job handling the wise guys so easy for you such talent! I won't forget. _E_\n.@NJPGA Club of the Year Trump Nat'l Bedminster is NJ's top family country club with two award winning courses __HTTP__ _E_\nThank you for a great afternoon South Carolina! See you next Tuesday! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nYou mean George Bush sends our soldiers into combat they are severely wounded and then he wants $120000 to make a boring speech to them? _E_\n\"Do not allow fear to settle into place in any part of your life. It is a defeating attitude & a negative emotion Think Like a Champion _E_\nWill Plan B miss Trace? _E_\n.... It is a very effective & commonly used business tool. _E_\nWelcome to the new reality. @BarackObama is now letting China buy US banks __HTTP__ The US government is selling us out. _E_\nCrooked Hillary should not be allowed to run for president. She deleted 33000 e mails AFTER getting a subpoena from U.S. Congress. RIGGED! _E_\nI never did a day's work in my life. It was all fun. Thomas A. Edison _E_\nThe @nytimes purposely covers me so inaccurately. I want other nations to pay the U.S. for our defense of them. We are the suckers no more! _E_\nGeorge Steinbrenner would have done a major number on A Rod there is no way he would have gotten paid even with the help of the union! _E_\nLabor Unions Giving Serious Thought to Endorsing Trump via Washington Examiner __HTTP__ _E_\nI will be interviewed on @oreillyfactor tonight at 8:00 P.M. Enjoy! _E_\nNow the Chinese are planning a war game w/ the IraniansSyrians & Russians along Syrian coast. __HTTP__ Laughing at @BarackObama _E_\nWhy doesn't OPEC lower the price of crude to help avert the European crisis? Crude keeps rising during the dow... (cont) __HTTP__ _E_\nMy @FoxNews @TeamCavuto interview discussing the @RNC Convention businesses making products in China and unemployment __HTTP__ _E_\nToday in Bedminster I signed the Harry W. Colmery Veterans Educational Assistance Act of 2017 joined by @DeptVetAffairs @SecShulkin. __HTTP__ _E_\nThe TIME Magazine cover showing late age breast feeding is disgusting sad what TIME did to get noticed. @TIME _E_\nAlways make a total effort even when the odds are against you. Arnold Palmer @KingdomMag _E_\nAs President I will bring jobs back and get wages up for Americans who need it most. __HTTP__ _E_\nBernie Sanders says that Hillary Clinton is unqualified to be president. Based on her decision making ability I can go along with that! _E_\nWind energy is a complete economic disaster.... __HTTP__ @AlexSalmond @AberdeenCC @David_Cameron @Aberdeenshire @ScotParl _E_\nWeakness cow towing and not standing firm is provocative. We are getting pushed around and robbed under this President. _E_\nJust found out that @tedcruz is spending a fortune on Iowa push polls negative to me. Not nice but OK! New polls are great. _E_\nVisiting LA? Be sure to make a reservation at Trump National Golf Club __HTTP__ The #1 public course in the country! _E_\nCongratulations to @bostonpolice on yesterday's successful and safe @bostonmarathon. The entire country is proud. _E_\nThank you Pueblo Colorado! #TrumpRally #AmericaFirst __HTTP__ __HTTP__ _E_\nGreat seeing @TheLeeGreenwood and Kimberly at this evenings VP dinner! #GodBlessTheUSA __HTTP__ _E_\nThe @washingtonpost which loses a fortune is owned by @JeffBezos for purposes of keeping taxes down at his no profit company @amazon. _E_\nAlways leave your ego at the door during negotiations. Remember it's only business and there will always be another day! _E_\n.@meetthepress and @chucktodd did a 1 hour hit job on me today – totally biased and mostly false. Dishonest media! _E_\nIs it the same Kaine that took hundreds of thousands of dollars in gifts while Governor of Virginia and didn't get indicted while Bob M did? _E_\nWe are now leading in many polls and many of these were taken before the criminal investigation announcement on Friday great in states! _E_\n.@megynkelly recently said that she can't be wooed by Trump. She is so average in every way who the hell wants to woo her! _E_\nI told everybody the Oscars were no good—Nielsen ratings confirmed one of the lowest ratings in history. _E_\nEntrepreneurs: A winning attitude will put things in perspective. Keep negative thoughts & people where they belong out of the big picture. _E_\nRussian officials must be laughing at the U.S. & how a lame excuse for why the Dems lost the election has taken over the Fake News. _E_\nOur GDP has been growing less than 2% for the last 5 years. ObamaCare will slow us down even more. Has to be repealed. _E_\nMy thoughts and prayers are with the two police officers their families and everybody at the @WestervillePD. __HTTP__ _E_\nThe Oscar Pistorius disaster is a really interesting story to me—a very sad situation for everyone! _E_\nRT @IvankaTrump: The Administration is committed to supporting military spouses in the workforce. Thanks Kim for sharing your story! __HTTP__ _E_\nAmerica needs a tough negotiator not a community organizer. _E_\nWow new polls just out have Trump up and Cruz down he is a nervous wreck! _E_\nI am seriously considering Dr. Ben Carson as the head of HUD. I've gotten to know him well he's a greatly talented person who loves people! _E_\nCan you imagine a Canadian company developing our website? Terrible way to put Americans back to work. _E_\nRT @MeetThePress: Watch our interview with @KellyannePolls: Russia did not succeed in attempts to sway election __HTTP__ #... _E_\nIf we did all the things we are capable of we would literally astound ourselves. Thomas Edison _E_\nRT @TeamTrump: .@timkaine's Abortion Flip Flops: From Valuing The Sanctity of Life &gt Pro Abortion Demagogue #VPdebate __HTTP__ _E_\nWhy does Barack Obama's ring have an arabic inscription? __HTTP__ Who is this guy? _E_\nMitt Romney called to congratulate me on the win. Very nice! _E_\nMake sure to follow me on @periscopeco #MakeAmericaGreatAgain _E_\nJust arrived at #ASEAN50 in the Philippines for my final stop with World Leaders. Will lead to FAIR TRADE DEALS unlike the horror shows from past Administrations. Will then be leaving for D.C. Made many good friends! _E_\nBoeing is building a brand new 747 Air Force One for future presidents but costs are out of control more than $4 billion. Cancel order! _E_\nCrooked Hillary wants a radical 500% increase in Syrian refugees. We can't allow this. Time to get smart and protect America! _E_\nWhat will we get for bombing Syria besides more debt and a possible long term conflict? Obama needs Congressional approval. _E_\nVia @SunSentinel by @JoanieCox: \"In Palm Beach nothing trumps the Trump Invitational\" __HTTP__ _E_\nI believe Lance Armstrong had death wish when he did interview w/Oprah—as I predicted everybody is suing him he'll have nothing left _E_\nPeople should be proud of the fact that I got Obama to release his birth certificate which in a recent book he \"miraculously\" found. _E_\n\"I have a very strict gun control policy: if there's a gun around I want to be in control of it.\" Clint Eastwood _E_\n.@pastormarkburns You were great last night and we all very much appreciate it! Thank you! _E_\n.@foxandfriends in 5 minutes. _E_\nAfter decades of our leaders allowing China to steal our jobs & R&D the Chinese will 'overtake America' in 2016 ... _E_\nSadly I will no longer be doing @foxandfriends at 7:00 A.M. on Mondays. This is because I am running for president and law prohibits. LOVE! _E_\nIran is threatening to shut the Strait of Hormuz and @BarackObama won't approve the Keystone pipeline. His energy policy makes America weak. _E_\nGolf match? I've won 18 Club Championships including this weekend. @mcuban swings like a little girl with no power or talent. Mark's a loser _E_\nWhen is South Korea going to start paying us for the massive amounts of money we are spending to protect them from the North? _E_\nBought @JohnDeere stock a year ago for old fashioned reason—I love their product and service. _E_\nWill be on @foxandfriends at 7:00 15 minutes! Enjoy. _E_\n\"When you're at a meeting monitor your behavior and work at being an observer – of yourself and others.\" – Think Like a Billionaire _E_\nWhy are we giving China foreign aid? Couldn't the Super Committee have agreed to at least cut that outlay? #TimeToGetTough _E_\nI'm going to the BORDER tomorrow. Will be seeing some really brave people. Look forward to a big day! _E_\nThe reason that Ted Cruz lost the Evangelicals in S.C. is because he is a world class LIAR and Evangelicals do not like liars! _E_\nTaking a helicopter to New Hampshire boarding now. Amazing activity planned. New UMASS poll very nice! __HTTP__ _E_\nOnly 1 mill. dollars @mcuban? Offer me real money and I'd consider it. Your team and networks lose so much money I doubt you have much left! _E_\nJust like @Yankee organization I can't wait for @MLB to suspend A Rod. Will be a great day for the sport. _E_\nHow come every time I show anger disgust or impatience enemies say I had a tantrum or meltdown—stupid or dishonest people? _E_\nIraq in political turmoil one day after we leave I told you so. _E_\nMy interview on @theviewtv discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate(starts at 23:00) __HTTP__ _E_\nThe U.S. manufacturing sector has suffered its greatest order losses under @BarackObama. He has stood idle while China steals our jobs. _E_\nBe prepared for a sensational episode of The Apprentice tomorrow night 10 pm on NBC. _E_\nThe new e mail release is a disaster for Hillary Clinton. At a minimum how can someone with such bad judgement be our next president? _E_\nWhat a great group! __HTTP__ With @Schwarzenegger @SammartinoBruno and @TripleH. #WWEHOF _E_\nWikiLeaks: 'Clinton Kaine Even Lied About Timing of Veep Pick' __HTTP__ _E_\nThe rigged Dem Primary one of the biggest political stories in years got ZERO coverage on Fake News Network TV last night. Disgraceful! _E_\n...and says something is seriously wrong. He will never go down as great! _E_\nMichele Bachmann just dropped out of prez race when she didn't do the Newsmax debate it showed great disloyalty and people rejected her. _E_\n.@mcuban Baseball commissioner and owners were smart when they didn't want you to buy a team but I don't think you have the money anyway. _E_\nHe @BarackObama wants 23 years of @MittRomney's tax returns __HTTP__ Let's see BHO's school (cont) __HTTP__ _E_\nISIS exploded on Hillary Clinton's watch she's done nothing about it and never will. Not capable! _E_\nI am encouraged by President Moon's assurances that he will work to level the playing field for American workers b... __HTTP__ _E_\nWho knew this innocent kid would grow into a monster? #TBT #Trump __HTTP__ _E_\nSee what I have to say about the Occupy Wall Street protestors in today's #trumpvlog.... __HTTP__ _E_\nI will be ON THE RECORD with Greta Van Susteren @gretawire tonight at 7 pm eastern/FOX News Channel _E_\nDemocracy cannot succeed unless those who express their choice are prepared to choose wisely... _E_\nTAX CUTS will increase investment in the American economy and in U.S. workers leading to higher growth higher wages and more JOBS! __HTTP__ _E_\nNational Black Republican Association Endorses Donald J. Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nObama Clinton inherited $10T in debt and turned it into nearly $20T. They have bankrupted... __HTTP__ _E_\nAmericans by & large hate ObamaCare. They see Obama lied to get it passed. They see big business & gov't got waivers. Defund! _E_\nThank you for the great rallies all across the country. Tremendous support. Make America Great Again! _E_\nSo many people who have children with autism have thanked me—amazing response. They know far better than fudged up reports! _E_\nThank you Sanford Florida. Get out & VOTE #TrumpPence16! #ICYMI watch this afternoons rally here:... __HTTP__ _E_\nAn investment in knowledge pays the best interest. Benjamin Franklin _E_\nI am happy to donate $5 million to a charity Barack Obama chooses. All I am asking is that he is transparent with the American people _E_\nIt is not freedom of the press when newspapers and others are allowed to say and write whatever they want even if it is completely false! _E_\nThank you! #TrumpWon #MAGA __HTTP__ _E_\nAnother example of @BarackObama's diplomatic triumphs he gave the Queen of England an iIPod filled with his speeches. _E_\nWow the respected Monmouth University poll has me ahead of most Republican candidates nationwide and most people don't think I'm running! _E_\nCongratulations to @TrumpSoHo for once again receiving the AAA Five Diamond Award for another year! _E_\nAfter many years of LEAKS going on in Washington it is great to see the A.G. taking action! For National Security the tougher the better! _E_\nJust watched @meetthepress and how totally biased against me Chuck Todd and the entire show is against me.The good news the people get it! _E_\nAfghanistan's so called leader Karzai is toying with the U.S. _E_\nThe same people that built the ObamaCare website used as the face of the website someone who is not a US citizen. Incompetent. _E_\nWhen will @CNN get some real political talent rather than political commentators like Errol Louis who doesn't have a clue! Others bad also. _E_\nVery resource rich Canada our neighbor is looking to China for its growth. Just another sad commentary on the U.S. __HTTP__ _E_\nJust got back from Colorado. The love and enthusiasm at two rallies was incredible. Big crowds! _E_\nLet the Arab League take care of Syria. Why are these rich Arab countries not paying us for the tremendous cost of such an attack? _E_\nFull transcript of economic plan delivered to the Economic Club of New York. #MAGA __HTTP__ __HTTP__ _E_\nOrders for U.S. factory goods in March record biggest decline in 3 years __HTTP__ China is eroding the US manufacturing sector. _E_\nWhy haven't they released the final Missouri victory for us yet? Could it be because Cruz's guy runs Missouri? _E_\nWatch – Obama will not fix the illegal immigrant loophole. Instead he will sign another executive action giving more amnesty. _E_\nThank you Iowa! #FITN #IACaucus#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nMy wife Melania will be speaking in Pennsylvania this afternoon. So exciting big crowds! I will be watching from North Carolina. _E_\nSome @OWS protesters are sincere people frustrated with the system others just in for the party. _E_\nAfter seven horrible years of ObamaCare (skyrocketing premiums & deductibles bad healthcare) this is finally your chance for a great plan! _E_\nWhere is the President? It is time for him to come on TV and show strength against the repeated threats from North Korea and others. _E_\nThe F 35 program and cost is out of control. Billions of dollars can and will be saved on military (and other) purchases after January 20th. _E_\nThe United States has been reminded time and again in recent years that economic security is not merely RELATED to national security economic security IS national security. It is vital to our national strength. #APEC2017 __HTTP__ _E_\nHonor Memorial Day by thinking of and respecting all of the great men and women that gave their lives for us and our country! We love them. _E_\nWow no longer Saturday delivery from U.S. Postal Service no money our poor poor Country! _E_\nI can't believe that 60 Minutes is right now showing our nuclear facilities for the world to see (at request of U.S. leadership). STUPID! _E_\n..... He knows I don't respect him. _E_\nI would like to offer Vice President Biden my warmest condolences on the loss of his wonderful son Beau. Met him once great guy! _E_\nJoin me on @greta from Indianapolis Indiana at 7pmE! Enjoy! #Trump2016 __HTTP__ _E_\nHost of the 2022 @PGAChampionship & 2017 #USWomensOpen Trump Nat'l Bedminster offers 36 holes of world class golf __HTTP__ _E_\nRT @IvankaTrump: Ivanka is joining @realDonaldTrump to outline an innovative new child care policy to support American families. Tune in to... _E_\nWill @anthonyweiner be fully clothed in his mayoral ads? _E_\nDavid Letterman's show has become so boring and mundane. Somehow every time I look I can't help thinking of (cont) __HTTP__ _E_\nFlattering. Over 500 upset people called Mar a Lago disappointed I am not running for President but Mitt Romney will do a great job. _E_\nI never equated wind farms to the Pan Am Lockerbie disaster only stated that @AlexSalmond should never have released the terrorist BAD! _E_\nMan did JEB throw his brother under the bus last night on @colbertlateshow . Probably true but not nice! _E_\nEnough is Enough no more Bushes! __HTTP__ _E_\n.@DLoesch played great audio from my @CPACnews press conference on her radio show. Glad she made it! _E_\nThe Oscars are a sad joke very much like our President. So many things are wrong! _E_\nThank you! __HTTP__ _E_\nThe press is going out of their way to convince people that I do not like or respect women when they know that it is just the opposite! _E_\nFor the NY State Repubs to waste time energy and money on a primary—then go against 3 1 Dems—is insane. _E_\nWell we all did it together! I hope the MOVEMENT fans will go to D.C. on Jan 20th for the swearing in. Let's set the all time record! _E_\nHow did Snowden with not even a high school education get access to top secret U.S. records. He then gave or sold those records traitor! _E_\nMAKE AMERICA GREAT AGAIN!#INPrimary #VoteTrump __HTTP__ _E_\nOn my way to New Hampshire expecting a big and spirited crowd! #FITN #Trump2016 __HTTP__ __HTTP__ _E_\nCrooked Hillary will NEVER be able to handle the complexities and danger of ISIS it will just go on forever. We need change! _E_\nIt is time to #DrainTheSwamp! __HTTP__ _E_\nIs this the New York that Ted Cruz is talking about & demeaning? __HTTP__ _E_\nWow Now experts are calling #Harvey a once in 500 year flood! We have an all out effort going and going well! _E_\n#TrumpTODAY Watch my appearance on the @TODAYshow from this morning __HTTP__ _E_\nThe first book signing at Trump Tower for #TimeToGetTough was so popular that I'm doing another one today from noon to 2pm/Trump Tower _E_\nPersonally I think Douglas Durst's brother got screwed by Douglas—no wonder he's angry! _E_\nFar more killed than anticipated in radical Islamic terror attack yesterday. Get tough and smart U.S. or we won't have a country anymore! _E_\nPlease explain to the dummies at the @WSJ Editorial Board that I love to debate and have won according to Drudge etc. all 11 of them! _E_\nCrazy @megynkelly is now complaining that @oreillyfactor did not defend her against me yet her bad show is a total hit piece on me.Tough! _E_\nMy @SteveDeaceShow interview discussing Ebola Obama's incompetence & my trip to Iowa for @SteveKingIA on Sat. __HTTP__ _E_\nJeb Bush and Ted Cruz are not electable presidential candidates Hillary would destroy them. Ted may not be eligible to run born in Canada _E_\nThank you Cleveland. We love you and will be back many times! _E_\nMy @foxandfriends re: the sequestration failure of leadership in DC China playing us & taking over in 2016 __HTTP__ _E_\nIowa was amazing last night. The event could not have worked out better. We raised $6000000 for our great vets. They were so happy & proud _E_\nObamacare is a disaster as I've been saying from the beginning. Time to repeal & replace! #ObamacareFail __HTTP__ _E_\nI see my friend @FlaGovScott is speaking at CPAC. Solid guy wonderful job. #sayfie @marcaputo _E_\nRT @EricTrump: #MakeAmericaGreatAgain __HTTP__ _E_\n\"Be ready for problems you'll have them every day. Keep your focus and be as big as your daily challenges.\" – Trump Never Give Up _E_\nThe U.S. recorded its slowest economic growth in five years (2016). GDP up only 1.6%. Trade deficits hurt the economy very badly. _E_\nAre all the illegals pouring into our country vaccinated? I don't think so. Great danger to U.S. _E_\nI was on CNN yesterday..... __HTTP__ _E_\nI will be interviewed on @foxandfriends at 8:00 A.M. So much to talk about! _E_\nFunny that Jeb(!) didn't want help from his family in his failed campaign and didn't even want to use his last name.Then mommy now brother! _E_\nA very interesting piece by a very good writer @KirstenPowers of @USATODAY and @FoxNews. __HTTP__ _E_\nThanks. __HTTP__ _E_\nJust won IOWA @CNN Poll BIG: Trump 33% Cruz 20% Rubio 11% but @WSJ reported Cruz momentum but nothing about the fact that I easily won! _E_\nRubio lied about my meeting w/ Hispanic activists. I didn't change my opinion but treated them w/ respect. Shame! __HTTP__ _E_\nA simplified tax code would spur economic growth and help create jobs. Unfortunately Washington is incapable of simplifying anything. _E_\nVia @Newsmax_Media by \"Poll: Trump Surges Among GOP Hopefuls in NH\" __HTTP__ _E_\nJust got back from South Carolina. Going to Alabama tomorrow! _E_\nJust did an interview with my friend @MarkSimoneNY. Congratulations to Mark on his new show on @WOR710. _E_\nMy sons Don and Eric are right now at Doonbeg in Ireland. There will be nothing like it! _E_\nEntrepreneurs: Keep the big picture in mind. There are always opportunites and possibilities & thinking too small can negate a lot of them. _E_\n.@scottienhughes you were fantastic on CNN. Thank you for the nice words. See you at the #GOPDebate. _E_\nI want talented people to come into this country—to work hard and to become citizens. Silicon Valley needs engineers etc. _E_\nTHE SYSTEM IS RIGGED! _E_\nMichigan Mississippi Idaho & Hawaii: Get out to VOTE and join the movement today! Video: __HTTP__ __HTTP__ _E_\nI worked hard with Bill Ford to keep the Lincoln plant in Kentucky. I owed it to the great State of Kentucky for their confidence in me! _E_\nA 34 story luxury highrise @TrumpParc offers elite amenities with residences that maximize every inch of space __HTTP__ _E_\nEveryone is asking me to cover The Apprentice LIVE on twitter. I will do so. Tonight 9 to 11. IT WILL BE A GREAT EVENING OF TELEVISION! _E_\nFun to watch the Democrats working so hard to win the great State of South Carolina when I just won the Republican version amazing people! _E_\nFox & Friends going on now enjoy! _E_\nSleepy eyes @chucktodd whenever you mention me unfairly I will likewise mention you. _E_\nAdopt the Arts campaign at @fundanything ensures that an underfunded public school has music and arts programs __HTTP__ _E_\nRT @DanScavino: Join President elect Trump LIVE from Mobile Alabama via his #Facebook page! #ThankYouTour2016 Watch: __HTTP__ _E_\nThe outer boroughs of Manhattan are still devasted by Sandy. How would the press cover this if a Republican was President. _E_\n\"If you put the federal government in charge of the Sahara Desert in 5 years there'd be a shortage of sand.\" – Milton Friedman _E_\nVia @thehill: Trump warns GOP moving too fast on immigration reform __HTTP__ by @JonEasley _E_\nBetween Libya the national security leaks and Fast & Furious Obama has had more national security scandals than any other President. _E_\nThe Miami Heat looked great tonight congratulations from all of your friends at your favorite place in Miami Trump National Doral. _E_\nWe should have a contest as to which of the Networks plus CNN and not including Fox is the most dishonest corrupt and/or distorted in its political coverage of your favorite President (me). They are all bad. Winner to receive the FAKE NEWS TROPHY! _E_\nThank you @DailyMail for setting the failing @NYTimes story straight. This is what the NYT's should have written! __HTTP__ _E_\nAs a show of support for our Armed Forces I will be going to The Army Navy Game today. Looking forward to it should be fun! _E_\nJust got back from Iowa had a great time with amazing people. Will be back soon! _E_\nDespite the upcoming election the cover of paper thin Time Magazine looks like an ad for the movie Lincoln sad! _E_\nThanks @MickyArison for your nice statement @BLTPrimeMiami @TrumpDoral. I just want to do as well as you have with @MiamiHEAT. See u soon _E_\nNever make a concession during negotiations that could lead to more demands. Be prudent. It's best to have your concessions predetermined _E_\nRT @GregAbbott_TX: Spoke with Pres. Trump & heads of Homeland Security & FEMA. They're helping Texas respond to #HurricaneHarvey. __HTTP__ _E_\nTemperature at record lows in many parts of the country. 50 degrees below zero with wind chill in large area. Global warming folks iced in! _E_\nI believe that Crooked Hillary sent Bill to have the meeting with the U.S.A.G. So Bill is not in trouble with H except that he got caught! _E_\nTruly weird Senator Rand Paul of Kentucky reminds me of a spoiled brat without a properly functioning brain. He was terrible at DEBATE! _E_\nWe must change the laws of our land and seek fair but rapid trials for the perpetrators of terrorist acts (Boston) with harsh punishment! _E_\nI've been warning about China since as early as the 80's. No one wanted to listen. Now our country is in real trouble. #TimetoGetTough _E_\nMy daughter Ivanka will be representing me today at the opening of our campaign office in Manchester NH #MakeAmericaGreatAgain! _E_\nThe public is about to learn a lot more information on Barack Obama and his true background in the coming weeks... _E_\nRT @EricTrump: I will be always be incredibly proud of my work for @StJude raising $16.3+ million dollars over the last 10 years at a 9.2%... _E_\nCorey Lewandowski Senior Political Adviser: Mr.Trump has the vision and leadership skills to bring our country back to greatness. _E_\nVia @UrbanTurf_DC: Trump Releases Renderings For Old Post Office Building __HTTP__ _E_\nVia @nydailynews: @IvankaTrump oversees new healthy room service menu at Trump Hotels __HTTP__ _E_\nI will be interviewed on @60Minutes tonight after the NFL game 7:00 P.M. Enjoy! _E_\nPresident Donald J. Trump and @FLOTUS Melania Participate in the Pardoning of the National Thanksgiving Turkey at the White House. __HTTP__ _E_\nWhen do we sue the company for billions that robbed us in creating the hapless ObamaCare website? _E_\nI never made the ridiculous comment about James G. and Obama Care somebody else put it out and attributed it to me. Not my style! _E_\nIf last night's election proved anything it proved that we need to put up GREAT Republican candidates to increase the razor thin margins in both the House and Senate. _E_\nThe polls show that I picked up many Jeb Bush supporters. That is how I got to 46%. When others drop out I will pick up more. Sad but true _E_\nSnowden is showing how weak the U.S. has become. _E_\n.@TraceAdkins says @Joan_Rivers is a gem. I agree. We all agree. #CelebApprentice _E_\nExcited to host two great championships at two of our best properties @seniorpgachamp at Trump DC & @pgachampionship at Trump Bedminster _E_\nGreat optimism in America – and the results will be even better! __HTTP__ _E_\n#CrookedHillary is unfit to serve. __HTTP__ _E_\nNow Obama is keeping our soldiers in Afghanistan for at least another year. He is losing two wars simultaneously. _E_\n.@mcuban Mark—nice picture thanks for the invite to the Mavs/Nets game. Next time I'll go and you'll win! _E_\n.@TrumpSoho has just been awarded the AAA Five Diamond Award. Congratulations to the team for this great recognition of their amazing work. _E_\nThank you!! #Trump2016 __HTTP__ _E_\nI'm at Trump Int'l Hotel in Las Vegas tallest/most beautiful building in town. Speaking to another great crowd at Treasure Island (12 noon) _E_\nIdiot @billmaher always forgets to mention that I am suing him to collect the $5M for charity that he expressly offered. _E_\nThe Trump Spa @TrumpNewYork is a serene sanctuary featuring luxurious spa treatment rooms saunas and steam rooms __HTTP__ _E_\nThank you Ted. __HTTP__ _E_\nThank you Sean McGarvey & the entire Governing Board of Presidents for honoring me w/an invite to speak. #NABTU2017... __HTTP__ _E_\nDow Passes 23000 for the First Time Fueled by Strong Earnings #Dow23K📈 __HTTP__ __HTTP__ _E_\nVia @starpulse: Donald Trump Calls Barack Obama 'Incompetent' __HTTP__ _E_\nMeet the amazing mother whose letter I read during my speech. She lost her son to policies supported by Clinton. __HTTP__ _E_\nJust saw Crooked Hillary and Tim Kaine together. ISIS and our other enemies are drooling. They don't look presidential to me! _E_\nAn honor to meet with the Polish American Congress in Chicago this morning! #ImWithYou Video:... __HTTP__ _E_\nJOBS JOBS JOBS! __HTTP__ _E_\nVia @IBTimes: Under Fire From Donald Trump Jeb Bush Focuses On 9/11 Even Though Hijackers Got Florida Licenses __HTTP__ _E_\nI am thrilled to share that the Trump Home furniture collection by @doryainteriors just opened a new... __HTTP__ _E_\nTHANK YOU AMERICA!#MakeAmericaGreatAgain __HTTP__ _E_\nI will be interviewed on @foxandfriends at 7:00 A.M. Enjoy! _E_\nMy @foxandfriends interview discussing ObamaCare the Romney Trump fundraiser & my plans for Jones Beach __HTTP__ _E_\nMy beautiful wife Melania will be appearing on QVC this evening from 8 to 9 pm. _E_\nMay God have mercy upon my enemies because I won't General George S. Patton _E_\nMy @TeamCavuto interview re: 2016 the need for leadership in our country Syria & China hacking our military __HTTP__ _E_\nPresidency. Two of my children Don and Eric plus executives will manage them. No new deals will be done during my term(s) in office. _E_\nWow the ALIS just nominated my purchase of Doral in Miami as Transaction of the Year—thanks! _E_\nRe Real Estate: You don't necessarily need the best location. What you need is the best deal... _E_\nJoin me in Westfield Indiana tomorrow night at 7:30pm! #Trump2016 Tickets: __HTTP__ __HTTP__ _E_\nForty seven million now on food stamps. When he came to office there were 32 million. He's added 15 million people. @MittRomney _E_\nBrought to you by @HillaryClinton & her campaign in Chicago Illinois. #BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nThe Tax Cut/Reform Bill including Massive Alaska Drilling and the Repeal of the highly unpopular Individual Mandate brought it all together as to what an incredible year we had. Don't let the Fake News convince you otherwise...and our insider Polls are strong! _E_\nSaying goodbye to some of my great workers at @TrumpDoral in Miami. __HTTP__ _E_\n.@PiersMorgan and @OMAROSA really hate each other. #CelebApprentice _E_\nWe are getting rid of all Glenfiddich garbage alcohol from Trump properties. _E_\nControl your own destiny or someone else will. Jack Welch _E_\nA great evening in Springfield Illinois. Thank you for all of the support! #Trump2016 __HTTP__ _E_\nOur thoughts and prayers remain with Bret Michaels and his family and for his speedy recovery. _E_\nBecause of Rodolfo Rosas Moya who owes me lots of money Mexico will never again host the Miss Universe Pageant. _E_\nChina talks about the so called carbon footprint and then behind our leaders backs they laugh. They could (cont) __HTTP__ _E_\nIt has just been confirmed by the City of Mobile Alabama that there were 30000 people at last nights event making it #1for pol season. _E_\nJames Clapper and others stated that there is no evidence Potus colluded with Russia. This story is FAKE NEWS and everyone knows it! _E_\nThroughout my travels I've had the pleasure of sharing the good news from America. I've had the honor of sharing our vision for a free & open Indo Pacific a place where sovereign & independent nations w/diverse cultures & many different dreams can all prosper side by side. __HTTP__ _E_\n\"As someone once put it 'Marriage is the greatest 'anti poverty' program God ever created.'\" #TimeToGetTough _E_\nThe joint statement of former presidential candidates John McCain & Lindsey Graham is wrong they are sadly weak on immigration. The two... _E_\nTHANK YOU America! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIf you are planning to visit the world famous Trump Tower Atrium be sure to come early. During the holiday season it is packed by 10AM _E_\nSenator Tom Cotton was great on Meet the Press yesterday. Despite a totally one sided interview by Chuck Todd the end result was solid! _E_\nI have long given the order to help Argentina with the Search and Rescue mission of their missing submarine. 45 people aboard and not much time left. May God be with them and the people of Argentina! _E_\nI am returning to the Pensacola Bay Center in Florida Friday 9/9/16 at 7pm. Join me! __HTTP__ __HTTP__ _E_\nVia @WSJ by @SenAlexander: Wind Power Tax Credits Need to be Blown Away __HTTP__ @alexsalmond _E_\nCaught RED HANDED very disappointed that China is allowing oil to go into North Korea. There will never be a friendly solution to the North Korea problem if this continues to happen! _E_\nThey had no definitive proof against Tom Brady or #patriots. If Hillary doesn't have to produce Emails why should Tom? Very unfair! _E_\nWhat's incredible is that @Obamacare hasn't even kicked in yet and aleady it's doing tremendous damage. (cont) __HTTP__ _E_\nDonald Trump's __HTTP__ Breaks $1M for Comedian Adam Carolla: New crowdfunding site sets record __HTTP__ _E_\nRT @EricTrump: #JournalismIsDead __HTTP__ _E_\nI love the White House one of the most beautiful buildings (homes) I have ever seen. But Fake News said I called it a dump TOTALLY UNTRUE _E_\nThe @nytimes is so dishonest. Their hit piece cover story on me yesterday was just blown up by Rowanne Brewer who said it was a lie! _E_\nIsn't it ridiculous starting today new Ebola screenings go into effect for people coming from West Africa. Just stop the flights dummies! _E_\nWatch #CelebApprentice this Sunday at 9PM ESTon @NBC it has received many 4 star reviews. _E_\nMore questions answered... __HTTP__ #trumpvlog _E_\n.@GStephanopoulos stupidly believes that Hillary wants to run against me because she said so. She says that so people believe it opposite! _E_\nSome of your most popular questions answered in today's video __HTTP__ _E_\nBob & Suzanne Wright co founders of @autismspeaks have done an absolutely fantastic job—two real winners. __HTTP__ _E_\nAfter hearing the news that they would not be able to extort $1M from me they went hostile w/ a series of incorrect & ill informed ads. _E_\nAutism WAY UP I believe in vaccinations but not massive all at once shots. Too much for small child to handle. Govt. should stop NOW! _E_\nOn International Women's Day join me in honoring the critical role of women here in America & around the world. _E_\nIsn't it ironic that President Obama of all people is pushing for 'universal background checks?!' _E_\nJust returned from Mississippi a great evening. _E_\nThrowing out the first pitch a few years ago at Fenway in Boston Boston will be better than ever. __HTTP__ _E_\nIn the 1920's people were worried about global cooling it never happened. Now it's global warming. Give me a break! _E_\nPresident Obama must remember that the worst thing you can do in a deal is seem desperate to make it. Be cool move slowly and think! IRAN _E_\n.@GeorgeTakei is doing really well & soon coming to Broadway. _E_\nI will be on @LateNightJimmy tonight. Always have a good time with @jimmyfallon. Now we know he will get high ratings tonight. _E_\nMaybe the millions of people who voted to MAKE AMERICA GREAT AGAIN should have their own rally. It would be the biggest of them all! _E_\nThe Veterans Administration is in shambles and our veterans are suffering greatly. John McCain has done nothing to help them but talk. _E_\nFootball coaches are no longer allowed to scream and yell at their players because it is discriminatoryracist and can be viewed as bullying _E_\nJoin me tomorrow in Plymouth New Hampshire! #FITN #NHPrimary __HTTP__ _E_\nInner city crime is reaching record levels. African Americans will vote for Trump because they know I will stop the slaughter going on! _E_\nIn analyzing the Alabama Primary raceFAKE NEWS always fails to mention that the candidate I endorsed went up MANY points after endorsement! _E_\nGreat news Former Mayor of Dallas Tom Leppert has just endorsed me! Thank you! Tomorrow is a big day VOTE! #VoteTrump #SuperTuesday _E_\nAlmost universal support that Trump won the debate. Only @FoxNews is consistantly fighting the Trump win and I got them the ratings! _E_\nMcAllen Texas 8 miles from U.S. Mexico border. #Trump2016 Video: __HTTP__ __HTTP__ _E_\nExperience is the teacher of all things. Julius Caesar _E_\n.@deedeesorvino was GREAT today on @FoxNews She gets what is going on in politics and sees it very clearly. Have her on more! _E_\nAs I stated at the press conference on Friday regarding David Duke I disavow. __HTTP__ _E_\n#LawandOrder #ImWithYouTranscript: __HTTP__ _E_\nOne of the hardest jobs in politics must be cleaning up after @JoeBiden gaffes. I feel sorry for his spokespeople. _E_\nHad a great time going over renovations for Trump National Doral this past weekend. It is going to be amazing. __HTTP__ _E_\nMy interview from yesterday with @seanhannity __HTTP__ _E_\nHAPPY BIRTHDAY to our @FLOTUS Melania! __HTTP__ __HTTP__ _E_\nVery good news—the new Quinnipiac poll just came out—I am #1 in Iowa. _E_\nBy raiding the defense budget to pay for his failed social programs @BarackObama continues to weaken our (cont) __HTTP__ _E_\nMy @gretawire interview discussing the economy unemployment numbers China Charles BarkleyFrance and the election __HTTP__ _E_\nAs I predicted Obama already caught lying on Ocare enrollment # by CBO who's sticking w/ \"6 million enrollments\" __HTTP__ _E_\nRT @billoreilly: A free press is vital to protecting all Americans. A corrupt press damages the Republic. _E_\nRT @EricTrump: #Wisconsin: To find your voting location visit __HTTP__ #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_\n.@VinceMcMahon @MikeTyson @HomerJSimpson I think I'm going to accept the #IceBucketChallenge stay tuned to my Twitter tomorrow.... _E_\nInvincibility lies in the defence the possibility of victory in the attack. Sun Tzu _E_\nThe State Of The Union speech was one of the most boring rambling and non substantive I have heard in a long time. New leadership fast! _E_\nPriorities. While Obama wastes billions on a broken website he is going to cut military pay __HTTP__ No surprise. _E_\nWill be heading over to the debate soon. Can you believe @CNN is milking it for almost 3 hours? Too long too many people on stage! _E_\nI am speaking today at the National Press Club totally sold out and will then be inspecting The Old Post Office on Pennsylvania Avenue! _E_\nMy @SquawkCNBC interview discussing @MittRomney's pick of @PaulRyanVP how to frame Medicare debate & @RNC convention __HTTP__ _E_\nI am very proud of Ivanka! _E_\nEntrepreneurs: Business is a creative endeavor. Being innovative = being open to new ideas. Keep an open mind! _E_\nWe're all thinking of you @SteveScalise! #TeamScalise __HTTP__ _E_\nGreat to be back in Iowa! #TBT with @JerryJrFalwell joining me in Davenport this past winter. #MAGA __HTTP__ _E_\nI will be interviewed on @GMA Good Morning America at 7:00 A.M. @ABC will be announcing new poll numbers. MAKE AMERICA GREAT AGAIN! _E_\nToday I introduced my Contract with the American Voter our economy will be STRONG & our people will be SAFE.... __HTTP__ _E_\n\"Success in golf depends less on strength of body than upon strength of mind and character.\" Arnold Palmer _E_\nCheck out Ivanka's new FaceBook page and keep up with what's happening from The Celebrity Apprentice to jewlery to free tickets and more.. _E_\nRT @SarahPalinUSA: Trading in the beautiful snow of Iowa for the red dirt of Oklahoma as planned despite what the media is try's no... __HTTP__ _E_\n...healthcare plan is on its way. Will have much lower premiums & deductibles while at the same time taking care of pre existing conditions! _E_\nThe @BarackObama campaign took in $39M in May but spent $44.6M. Sound familiar! _E_\nWhen will Obama next go on vacation if he wins the election? The day after. _E_\nIt's a shame to hear that the @dcexaminer is failing. No one wants the paper even if it is being handed out for free. _E_\nPresident Obama we need to protect our closest ally Israel. The situation in the Middle East is at a tipping point. _E_\nThe owner of California Gold just made a jerk (fool) out of himself. Just smile and congrat the winner. His wife was visibly embarrassed! _E_\nVia @SaintPetersblog by @MitchEPerry: \"Shock poll: Donald Trump leads Jeb Bush 26 20% ... in Florida\" __HTTP__ _E_\nMy acquisition of the Doral in Maimi will be a major success for the Trump Organization. The re building is on schedule. _E_\nFirst Minister @AlexSalmond will be destroying the beauty of Scotland with his insane desire for bird killing wind turbines. _E_\nContractors can blame Obama admin all day for their $600M failure but both parties are at fault pay taxpayers back. _E_\n'Donald Trump leads Hillary Clinton by 19 points among military veteran voters: poll' #AmericaFirst #MAGA __HTTP__ _E_\nThe U.S. accidentally air dropped a large shipment of military weapons and supplies right into the middle of ISIS as enemy laughs! Very sad! _E_\nWe traveled the world to strengthen long standing alliances and to form new partnerships. See more at:... __HTTP__ _E_\nThank you Hershey Pennsylvania. Get out & VOTE on November 8th & we will #MAGA! #RallyForRiley #ICYMI watch here... __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nOverlooking Central Park @TrumpNewYork brings both glamor and prestige to your Five Diamond hotel stay __HTTP__ _E_\nI will be interviewed tonight on @seanhannity Enjoy! 10:00 P.M. _E_\nWhen written in Chinese the word 'crisis' is composed of two characters. One represents danger and the other represents opportunity. JFK _E_\nMy interview with @ThisWeekABC w/@GStephanopoulos destroyed all Sunday competition w/ 2.52M total viewers...that's why they want me on! _E_\n... while Tom Brady is guilty because he REPLACED his LEGAL cellphone? _E_\nRT @JoeNBC: Trump +15 on Cruz in 2 weeks. Cruz may look back and ask why he ever attacked Trump. DT has killed him ever since. __HTTP__ _E_\nI think @TheRevAl should take this challenge. Axelrod was too scared. RT: @RonKaufmanIntrn: Kaufstache vs. Sharpstache. _E_\nWaste @BarackObama's Dep. of Energy was warned in advance by Treasury that it wasn't loaning $ out in good deals __HTTP__ _E_\nI hope Oprah gives Lance Armstrong 100 million dollars because that's what that ridiculous interview will cost him! _E_\nWinners never quit and quitters never win. Vince Lombardi _E_\n.@mcuban Shark Tank was shoved to Friday evening Friday is considered \"dead television.\" Besides you are not the star (& never will be). _E_\nGold just set another record high on price with the largest physical gold sales on record __HTTP__ Inflation is coming... _E_\n.@JustinRose99 Great playing we are proud of you! _E_\nObama keeps saying that he will do something but why hasn't he done it? It's all talk. _E_\nSuccess breeds success. The best way to impress people is through results. Think Like a Billionaire _E_\nJust watched Senator John Barrasso on @FoxNews He was great! Thank you John. _E_\nCongratulations to @Boston_Police @FBIBoston & all emergency first responders & doctors for their excellent work under fire yesterday _E_\nThe Republicans can absolutely win if they stick together but they are NOT sticking together. Sen. McCain just said we can't win .Very bad! _E_\nGoofy Elizabeth Warren Hillary Clinton's flunky has a career that is totally based on a lie. She is not Native American. _E_\n__HTTP__ Countdown to @AmericaNowRadio as my former _E_\n _E_\nSenate concludes \"Benghazi could have been prevented\" __HTTP__ _E_\nThe Democrats in the Super Committee want to raise taxes first in deficit talks. Huge mistake. Cut wasteful spending first. _E_\nI commend @DrZuhdiJasser for defending the NYPD and Commissioner Kelly. The NYPD has done outstanding work in defending NYC from attacks. _E_\nChicago murder rate is record setting 4331 shooting victims with 762 murders in 2016. If Mayor can't do it he must ask for Federal help! _E_\n... at Madison Square Garden followed by a ceremony with 80000 people at MetLife Stadium Wrestlemania. _E_\nWatch Celebrity Apprentice on Sunday at 9 pm on NBC we're winding up for a terrific finale. What a season! __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n\"Know from inside out that you have the power to succeed and you will. That's taking control.\" – Think Like a Champion _E_\n#TrumpVine on ObamaCare website __HTTP__ _E_\nMy @foxandfriends int. on IRS targeting Tea Party the Benghazi death scandal & @TraceAdkins v. @pennjillette finale __HTTP__ _E_\nThe people that gave you global warming are the same people that gave you ObamaCare! _E_\nTogether we're going to restore safety to our streets and peace to our communities and we're going to destroy the vile criminal cartel #MS13 and many other gangs...'Hundreds arrested in MS 13 crackdown' __HTTP__ __HTTP__ _E_\nIran toys with U.S. days before we pay them ridiculously billions of dollars. Don't release money. We want our hostages back NOW! _E_\nMy @foxandfriends interview discussing the Make America Great Again Texas filing and the Iowa caucus __HTTP__ _E_\n.@greta: Look forward to watching Greta's interview tonight at 7.00 p.m. with Marine Andrew Tahmooressi. #Marinefreed _E_\nThank you Greta. __HTTP__ _E_\nGain and use information to your advantage see every day as an opportunity to learn. _E_\nURGENT: we've just announced a $2 million fundraising goal tonight. Please stand with us! __HTTP__ __HTTP__ _E_\nBig defeat last night in Nevada for Ted Cruz and Marco Rubio. @KarlRove on @FoxNews is working hard to belittle my victory. Rove is sick! _E_\n.@AnnCoulter has been amazing. We will win and establish strong borders we will build a WALL and Mexico will pay. We will be great again! _E_\nThe US Air Force won the war in Libya to clear the way for Islamic Extremist control of Libya. _E_\nNeeded: Leaders who negotiate smart trade deals.Only one knows The Art of The Deal. Let's Make America Great Again! __HTTP__ _E_\nHope and Change? Job numbers down. Time for @MittRomney _E_\nI'll be doing Fox and Friends this morning at seven. _E_\n\"Sometimes you have to take a half step back to take two forward.\" @VinceMcMahon _E_\nMany people talking with much agreement on my Iran speech today. Participants in the deal are making lots of money on trade with Iran! _E_\nVia @CraveOnline: Donald Trump is NOT A Rod fan __HTTP__ _E_\nNever bet against Bob Kraft Bill Belichick or Tom Brady! @Patriots _E_\nRT @RSBNetwork: We are ALREADY LIVE in Everett WA for the Trump Rally. Come join us our cameras tonight! #TrumpinEverett __HTTP__ _E_\nALWAYS TRY TO LEARN FROM OTHER PEOPLES MISTAKES NOT YOUR OWN IT IS MUCH CHEAPER THAT WAY! _E_\nIf I'm the third most envied man in America the small group of haters and losers must be nauseas. _E_\nVia the New York Times __HTTP__ _E_\nOn 9/11 we pray for the victims and their families of the attack and give thanks to all who have sacrificed for justice & our freedom. _E_\nI will be on @foxandfriends at 7:00 A.M. ENJOY! _E_\nOther networks are begging me to do a show I can't because I'm doing the Apprentice! _E_\nThe Chinese talk of climate change and carbon footprint but don't clean up their factories but they sell us the equipment to clean up ours! _E_\nRosie O'Donnell just said she felt shame at being fat not politically correct! She killed Star Jones for weight loss surgery just had it! _E_\nThe tournament at Trump National Doral was much more exciting than what is going on now! _E_\nThe polls have shown that DEAD PEOPLE voted for President Obama overwhelmingly and without hesitation he must be doing something right! _E_\nShocking two of @BarackObama's largest campaign bundlers are directly linked to Solyndra __HTTP__ What a coincidence! _E_\nI will be on @seanhannity tonight at 10 PM @FoxNews. #Hannity _E_\nTerrible CBO forecast for 2013 1.4% GDP growth and 7.5%+ unemployment (really 17%+) __HTTP__ You get what you vote for! _E_\nConsumer spending is continuing to fall with weak June numbers. @BarackObama's policies have created a climate (cont) __HTTP__ _E_\nAfter being forced to apologize for its bad and inaccurate coverage of me after winning the election the FAKE NEWS @nytimes is still lost! _E_\nRT @RepKristiNoem: A lot of tough decisions got us to this point but we're closer than we've been in 30+ years to a fairer tax code that k... _E_\n#TrumpVine A message for my hotel guest @MileyCyrus __HTTP__ _E_\nI just realized that if you listen to Carly Fiorina for more than ten minutes straight you develop a massive headache. She has zero chance! _E_\n.@FoxNews has been treating me very unfairly & I have therefore decided that I won't be doing any more Fox shows for the foreseeable future. _E_\nShock even more @BarackObama solar corruption. @VPBiden's chief of staff's firm got biggest DOE loan. __HTTP__ _E_\nIn November I think the people of Ohio will remember that the Republicans picked Cleveland instead of going to another state. Jobs! _E_\nBy the way where is @Oprah? Good question. 4 years ago she strongly supported Obama now she is silent. Anyway who cares I adore Oprah. _E_\nTom Brady played great today. He is a total champ and a really nice guy a rare combination! _E_\n.@nbcnightlynews (Brian Williams anyone?) says women warriors are every bit as tough as the guys. Just think about that statement! _E_\nI will be making my Supreme Court pick on Thursday of next week.Thank you! _E_\nI have a lawsuit in Mexico's corrupt court system that I won but so far can't collect. Don't do business with Mexico! _E_\nHe @BarackObama said it would be 'unprecedented' if the USC rules that ObamaCare is unconstitutional. It was (cont) __HTTP__ _E_\nMy @NewsRadio610 int. w/@JackHeathRadio discussing Nickey S. Loeb 1st Amendment Awards Dinner & @SenScottBrown __HTTP__ _E_\nRT @Team_Trump45: @realDonaldTrump __HTTP__ _E_\nWhat about all of the contact with the Clinton campaign and the Russians? Also is it true that the DNC would not let the FBI in to look? _E_\n.@scottienhughes Keep up the great work Scottie. Polls are best ever! _E_\nCongratulations to all the Trump 2012 #MissUniverse contestants who came from across the world. You did great and made us all proud! _E_\nStop the flights! __HTTP__ _E_\nThank you Green Bay Wisconsin! Governor @Mike_Pence and I will be back soon. #TrumpPence16 #MAGA __HTTP__ _E_\nCongratulations to @AnnDRomney on delivering a knock out speech last night. America can't wait to call her our First Lady. _E_\nEveryone should calm down. @BenAffleck is going to do a great job as Batman. _E_\nCongratulations to @TrumpPanama for being named one of the \"Best of +VIP Access\" hotels for 2014 by @Expedia! __HTTP__ _E_\n$25 Million+ raised online in just one week! RECORD WEEK. #DrainTheSwamp Today we set a bigger record. Contribute &gt __HTTP__ _E_\nIt is now clear that the Embassy attack in Libya was a coordinated Al Qaeda operation and not based on some video. _E_\nSo many people are agreeing with me on not creating a highway for Ebola to the U.S. Started in small area of Africa and now spreading fast _E_\n\"Our Constitution was made only for a moral & religious people. It is wholly inadequate to the government of any other.\" John Adams _E_\nBe sure to check out the new projects @fundanything __HTTP__ Giving away money! _E_\nAfter these spirited primaries are over @GOP must be fully united for November. If we take the Senate we stop Obama's agenda. _E_\n.@WayneNewtonMrLV Wayne such a pleasant surprise so nice. Thank you very much. _E_\nAfter raising w/ no obligation almost $6M for Vets I couldn't believe protesters formed @ Trump Tower. JUST OUT SENT BY CROOKED HILLARY! _E_\nAs gas prices keep rising @BarackObama won't approve Keystone. Instead he is pushing algae yes algae as an (cont) __HTTP__ _E_\nIt's Tuesday. @AGSchneiderman is wearing Revlon eyeliner today. Governor Cuomo alerted all to this. _E_\nExcited to speak at tomorrow night's @ocrp Lincoln Day dinner in Michigan \"All time sales record over 2000.\" __HTTP__ _E_\nGreat book just out A Place Called Heaven by Dr. Robert Jeffress A wonderful man! _E_\nGlad to hear @EWErickson has moved over to @FoxNews. Erick is a sharp political analyst. _E_\nI never asked Comey to stop investigating Flynn. Just more Fake News covering another Comey lie! _E_\nGreat Army Navy Game. Army wins 14 to 13 and brings home the COMMANDER IN CHIEF'S TROPHY (last time was 1996). Wow! Congratulations! _E_\nCongratulations John! __HTTP__ _E_\nThe big golf course project on the water by the Whitestone Bridge in NYC that has been under construction for many years now complete GREAT! _E_\nObama wants taxes to go up so he can take credit for lowering them next year. _E_\nRT @narendramodi: Had a wonderful meeting with @IvankaTrump advisor to @POTUS and leader of the US delegation at the @GES2017. __HTTP__ _E_\nGetting ready for my big foreign trip. Will be strongly protecting American interests that's what I like to do! _E_\nI will be interviewed on the @TODAYshow at 7:00 A.M. this morning. Enjoy! _E_\nSpitzer failed as A.G. failed as Governor in disgrace and was fired on all T.V. shows (boring and zero ratings) and he's at it again! _E_\n...you can enhance location through promotion and work. _E_\nThe @ABC poll sample is heavy on Democrats. Very dishonest why would they do that? Other polls good! _E_\nRT @DanScavino: Back to Cincinnati Ohio this Thursday (12/1/16) at 7pm for #PEOTUS @realDonaldTrump's #ThankYouTour2016! Join us! __HTTP__ _E_\nTO ALL AMERICANS #HappyNewYear & many blessings to you all! Looking forward to a wonderful & prosperous 2017 as we... __HTTP__ _E_\nDoes anyone notice how the Montana Congressional race was such a big deal to Dems & Fake News until the Republican won? V was poorly covered _E_\nBe careful – sexting pervert Anthony Weiner is upping his campaigning. When will new pictures be released? _E_\nWhy are people giving money to Karl Rove when he just wasted $400M without any victories? Use your head. _E_\nVia The Hill No Tickets Left for Trump's Dallas Rally __HTTP__ _E_\nWill be interviewed on @GMA this morning at 7:00. Thanks for the GREAT poll results! _E_\n.@BarackObama's college application would be very very very very interesting! _E_\n. @chrislhayes replaced @edshow on @msnbc to increase ratings. It's a shame Chris' are even worse. Sad to see. _E_\nChampion @bretmichaels triumphantly returns to 13th season of All Star @CelebApprentice. Spoiler – Bret is back to his winning ways. _E_\n#LasVegasStrong #USA __HTTP__ _E_\nFor the great people of Iowa find your #IACaucus location at __HTTP__ so important to vote! #MakeAmericaGreatAgain _E_\nLooking forward to a press conference today about @adamcarolla on @fundanything movie project #roadhard __HTTP__ _E_\nThe @BarackObama administration is pressuring contractors to fix job loss estimates from environmental regulations __HTTP__ _E_\n.@StephenBaldwin7 shines in the record 13th season of 'All Star' @CelebApprentice. The Baldwin clan will be proud of Stephen. _E_\nAberdeen tourism is booming because of my great Scottish golf club. _E_\nIs PM Cameron a dummy? With monumental cuts in UK spending how come he continues to spend billions of pounds ... _E_\nGET READY!! The #TrumpFerryPoint tee sheet opens TODAY @ 10am EST on our website for April 1st 30th! @TrumpFerryPoint _E_\nVia @fitsnews: Donald Trump Knows How To (Tea) Party THE DONALD PLANS SPLASHY LANDING IN MYRTLE BEACH S.C. __HTTP__ _E_\nAs ISIS and Ebola spread like wildfire the Obama administration just submitted a paper on how to stop climate change (aka global warming). _E_\nJust spoke w/ Governors Rick Scott of Florida Kenneth Mapp of the U.S. Virgin Islands & Ricardo Rosselló of Puerto Rico. WE ARE W/ YOU ALL! __HTTP__ _E_\nGeorge Steinbrenner was a great friend and a true legend. There will never be anyone like him in New York. We've lost a truly great man. _E_\nLightweight Senator Marco Rubio is polling very poorly in Florida. The people can't stand him for missing so many votes poor work ethic! _E_\nCongratulations to @MikeTyson on the success of his new book Undisputed Truth & @HBO special and thanks for the nice words Mike. _E_\nWinner of the 5 Star Diamond Award @TrumpGolfDC's two courses grace over 600 acres on the Potomac River __HTTP__ _E_\nShould be interesting but too bad the three guys at《1% will be taking up so much time but who knows maybe a star will be born (unlikely) _E_\nFollow @TrumpNH for all the updates on my New Hampshire political activities. Looking forward to returning to the Granite State on May 14! _E_\n.@SenScottBrown is the most competitive GOP option against Obama's amnesty loving @SenatorSheehan. He can win! _E_\n.@TraceAdkins presents the NJ Coast Red Cross a $40000 check for Sandy Relief. You can tell he's very pleased about that & rightly so. _E_\nWe should be concerned about the American worker & invest here. Not grant amnesty to illegals or waste $7B in Africa. _E_\nVia @BreitbartNews by Steve Bannon: \"'TIME TO GET TOUGH': TRUMP'S BLOCKBUSTER POLICY MANIFESTO\" __HTTP__ _E_\nEntrepreneurs: Believe in yourself. If you don't no one else will either. _E_\nShouldn't George Will have to give a disclaimer every time he is on Fox that his wife works for Scott Walker? _E_\n.@BillMaher needs to cut back on the pot and maybe he will stop making offers he can't afford. _E_\nThank you. __HTTP__ _E_\nClub for Growth letter trying to extort $1000000.00 from me. Remember I said NO! __HTTP__ _E_\nMy list of potential U.S. Supreme Court Justices was very well recieved. During the next number of weeks I may be adding to the list! _E_\nInteresting to watch Senator Richard Blumenthal of Connecticut talking about hoax Russian collusion when he was a phony Vietnam con artist! _E_\nI believe in the America that never gives up never stops striving never ceases believing in itself. @MittRomney 11.2.12 _E_\nLightweight @JebBush said tonight he didn't know his family used private eminent domain in Texas Lie! #GOPDebate _E_\n... collusion which doesn't exist. The Dems are using this terrible (and bad for our country) Witch Hunt for evil politics but the R's... _E_\nToday will be a Super Tuesday for @MittRomney he will win over 220 delegates from states across every region. He will be the nominee. _E_\nWow @CNN got caught fixing their focus group in order to make Crooked Hillary look better. Really pathetic and totally dishonest! _E_\nJoin me live in Springfield Ohio! __HTTP__ _E_\nI hate what has happened to the once great @CNN. _E_\nNelson Mandela and myself had a wonderful relationship he was a special man and will be missed. __HTTP__ _E_\nI'm urging my friends in Brooklyn to vote for Bob Turner tomorrow send @barackobama a message. _E_\nSuccess seems to be connected w/ action. Successful people keep moving. They make mistakes but they don't quit. Conrad Hilton _E_\nRT @EricTrump: What a scary statistic! Americans are working harder and making less! We need competent leadership! __HTTP__ _E_\nWithout passion you don't have energy without energy you have nothing. Nothing great in the world has been accomplished without passion! _E_\nEverything comes to him who hustles while he waits. Thomas Edison _E_\n#IACaucus 2/1/2016 6:30pm#MakeAmericaGreatAgain!Iowa caucus finder: __HTTP__ #GOPDebate __HTTP__ _E_\nGreat list of spring travel ideas from our @TrumpCollection properties: __HTTP__ _E_\nWhy doesn't President Obama just get the people from Google to fix the failed website. In fact why didn't he use them in the first place! _E_\nA beautiful funeral today for a real NYC hero Detective Steven McDonald. Our law enforcement community has my complete and total support. _E_\nGlad to see 9 more Iraq and Afghan war veterans joining the next Congress __HTTP__ They deserve to be there! _E_\nWatch @IvankaTrump show you how easy it is to #CaucusForTrump in Iowa! #IACaucus Video: __HTTP__ __HTTP__ _E_\nI hope @boyscouts of America handle their problems a lot better than the board at Penn State did. You can't do any worse! _E_\nJoin me in Houston Texas tomorrow night at 7pm! Tickets: __HTTP__ __HTTP__ _E_\nLooking forward to being the guest of honor at @ralphreed's @FaithandFreedom Patriot's Award Gala Dinner in Washington DC _E_\nWe're going to cut taxes BIG LEAGUE for the middle class. She's raising your taxes and I'm lowering your taxes! __HTTP__ _E_\nWhy do the losers & haters always say I wear a \"wig\" when they know I don't. Like it or not it's all mine—just ask Barbara Walters. _E_\nMitt Romney didn't show his tax return until SEPTEMBER 21 2012 and then only after being humiliated by Harry R! A bad messenger for estab! _E_\nThank you Georgia!#SuperTuesday #Trump2016 _E_\nCongratulations to our new Miss USA the beautiful Rima Fakih. Rima will represent us well at Miss Universe and be a wonderful Miss USA . _E_\nHillary when you complain about a penchant for sexism who are you referring to. I have great respect for women. BE CAREFUL! _E_\n.@CNN poll just hit 49% for Trump. Interesting how my numbers have gone so far up since lightweight Marco Rubio has turned nasty. Love it! _E_\nSo with all of the Obama tough talk on Russia and the Ukraine they have already taken Crimea and continue to push. That's what I said! _E_\nThank you Arizona! #Trump2016 #WesternTuesday #TrumpTrain __HTTP__ _E_\n\"Donald Trump 2016: 7 Political Stances of GOP Presidential Hopeful\" __HTTP__ via @Newsmax_Media _E_\n2nd segment of my @seanhannity @FoxNews interview discussing @billmaher's insult of parents and sending him $5M bill __HTTP__ _E_\nThank you for your strong testimony when welcoming me to Liberty University yesterday @JerryJrFalwell. __HTTP__ _E_\nOne year ago I started calling President Obama INCOMPETENT and people thought it was too tough. Tonight everyone is using that word! _E_\nWow did the @nytimes fall into the Bush trap where his people convinced them how happy he was that I was hurting other candidates & not him _E_\nThis is all about American weakness and an incompetent President. _E_\nJust met with courageous family of Sarah Root in Nebraska. Sarah was horribly killed by illegal immigrant but leaves behind amazing legacy. _E_\nA great event in Las Vegas Nevada! __HTTP__ _E_\nRemember when failed candidate @JebBush said that illegals came across the border as AN ACT OF LOVE? He's spent $59 million and is at 3%. _E_\nGoodnight everyone sleep tight! _E_\nEntrepreneurs: Believe in yourself! If you don't no one else will either. _E_\nBe passionate. If you love what you're doing success will follow. _E_\nThe @MittRomney fundraiser last night was a tremendous success. _E_\nVia @BreitbartNews by @pamkeyNEN: \"Trump: ObamaCare Not Working for Business Going to Collapse\" __HTTP__ _E_\nThe reason Ed Schultz said nice things about me is that I'm the only Repub who won't cut Social Security etc. I'll make America rich again! _E_\nI just won a big Court decision (N.Y. Post) against some character who claimed I owed him licensing fees on success of my shirts and ties. _E_\n\"Live Free or Die.\" – motto of New Hampshire _E_\nDems warn not to underestimate Trump's potential win __HTTP__ _E_\n30000 illegal immigrants with CRIMINAL RECORDS were released last year by our wonderful though highly incompetent government. So stupid! _E_\nWhat would you choose Vampires or Cavemen? #CelebApprentice _E_\n\"Circumstances are beyond human control but our conduct is in our own power.\" Benjamin Disraeli _E_\nMany on the team and staff of Bernie Sanders have been treated badly by the Hillary Clinton campaign and they like Trump on trade a lot! _E_\nDopey @Lord_Sugar People are calling in saying you are being beaten badly w/ the tweets... _E_\nIf you don't treat yourself like royalty no one else will. @TrumpWaikiki is Honolulu's most luxurious hotel __HTTP__ _E_\nAfter seven years of talking Repeal & Replace the people of our great country are still being forced to live with imploding ObamaCare! _E_\nVia @newsobserver by @RaleighReporter: In Raleigh Donald Trump all but announces presidential bid __HTTP__ _E_\n#CelebApprentice With three wonderful but fired contestants __HTTP__ _E_\nThe Supreme Art of war is to subdue the enemy without fighting. Sun Tzu _E_\nI don't believe the Democrats really want to see a deal on DACA. They are all talk and no action. This is the time but day by day they are blowing the one great opportunity they have. Too bad! _E_\nNever confuse a single defeat with a final defeat. ― F. Scott Fitzgerald _E_\n.@HollySandersGC. Remember it was Martin K who sank the big ten footer to win the Ryder Cup. He can handle the pressure! _E_\nStrange but I see wacko Bernie Sanders allies coming over to me because I'm lowering taxes while he will double & triple them a disaster! _E_\nLance Armstrong's liability & lawsuits against him have just increased tenfold—his lawyers will be very happy—lots of fees! _E_\nThere is only one fix for ObamaCare REPEAL & REPLACE with a free market oriented alternative! _E_\nI will be making a major announcement tomorrow (Thursday February 2) at 12:30 pm at Trump International Hotel & (cont) __HTTP__ _E_\nJerry Falwell of Liberty University was fantastic on @foxandfriends. The Fake News should listen to what he had to say. Thanks Jerry! _E_\nThinking small when you could think big limits you in all aspects of your life. _E_\nIn the end you're measured not by how much you undertake but by what you finally accomplish. _E_\nA great guy (with great ratings)! __HTTP__ _E_\nToday I filed my Statement of Candidacy with the FEC. Let's #MakeAmericaGreatAgain __HTTP__ _E_\nThe boardroom and @WrestleMania I'm watching great entertainment tonight! #CelebApprentice _E_\nRoadway steel on beautiful Verrazano Narrows Bridge is rusting and rotting away. Scrape and paint before too late. _E_\nThank you John Nolte for wonderful analysis & reporting. _E_\nVia @haaretzcom: \"Donald Trump calls Obama Israel's greatest enemy\" __HTTP__ _E_\nI am so happy that people are boycotting Macy's __HTTP__ _E_\nIt's 4.35 a.m. and I am working on a very exciting (and hopefully very good) deal a major resort. THE HARDER I WORK THE LUCKIER I GET! _E_\nEntrepreneurs: See each day as an opportunity to show what you can do at the highest level. Take responsibility for yourself! _E_\nI am landing shortly. Can't wait to be with our GREAT MILITARY. See you soon! __HTTP__ _E_\nEven though I am not mandated by law to do so I will be leaving my busineses before January 20th so that I can focus full time on the...... _E_\nIran is closing the Strait of Hormuz for a military exercise. Imagine what they will do with nukes?! _E_\nFrom Donald Trump: Andrea Bocelli @ Mar a Lago Many say best night of entertainment in long history of Palm Beach __HTTP__ _E_\nThe immigration crisis is a horrible mess made worse by an incompetent president who doesn't have a clue. We need new leadership FAST! _E_\nBecause of the hurricane I am extending my 5 million dollar offer for President Obama's favorite charity until 12PM on Thursday. _E_\nWhen times are difficult you must be even more focused. That's when you will find profitable opportunities. _E_\nNo surprise. @BarackObama is letting the Muslim Brotherhood in Egypt default on their US loans __HTTP__ Big mistake! _E_\nVia @BreitbartNews @biggovt by @mboyle1: \"EXCLUSIVE: NEVER AIRED 'APPRENTICE' PARODY OF TRUMP FIRING OBAMA\" __HTTP__ _E_\nA great night in Fayetteville North Carolina. Thank you! #ICYMI watch here: __HTTP__ __HTTP__ _E_\n...that it was hard not to end up rooting for Trump... _E_\nI don't know why the @yankees keep paying A Rod—they have a perfect out. _E_\nAdditionally two executives @VattenfallGroup are under major investigation & they are unable to get the many permits necessary. _E_\nLightweight choker Marco Rubio looks like a little boy on stage. Not presidential material! _E_\nWow Putin is really taking advantage of President Obama. It is important that Obama responds with strength and determination be smart cool! _E_\nOff shore windfarms being abandoned in droves throughout world—too expensive to build & operate—don't work. (cont) __HTTP__ _E_\nKeep an eye on Anthony Weiner. Weasels are hard to get rid of. _E_\nThis is such a special time to be in New York City. No better city in the world to celebrate Christmas! _E_\nObamacare continues to fail. Humana to pull out in 2018. Will repeal replace & save healthcare for ALL Americans. __HTTP__ _E_\nWith the debt limit approaching @GOP has even more leverage. If they stay united and on message they can win. _E_\nThank you Greensboro North Carolina! Will be back soon! #AmericaFirst __HTTP__ _E_\nThe difference between a successful person and others is not a lack of strength not a lack of knowledge but (cont) __HTTP__ _E_\nIn interview I told @AP that my taxes are under routine audit and I would release my tax returns when audit is complete not after election! _E_\nPat Caddell on Neil Cavuto tonight: I've watched Donald Trump take on the issues of energy and he ties it to (cont) __HTTP__ _E_\nWatch my appearance on @Letterman from last night __HTTP__ _E_\nVia @DMRegister: \"@ShawnJohnson returns to reality TV with Donald Trump\" __HTTP__ _E_\nJohn McCain had a really hard time with his town hall meeting on immigration. They really went after him! _E_\nWe must restore the entrepreneurial spirit of our country. A small business boom. Let's Make America Great Again! __HTTP__ _E_\nSo now tha Matt Lauer is gone when will the Fake News practitioners at NBC be terminating the contract of Phil Griffin? And will they terminate low ratings Joe Scarborough based on the \"unsolved mystery\" that took place in Florida years ago? Investigate! _E_\nToday we remember the courage and bravery of our troops that stormed the beaches of Normandy 73 years ago. #DDay... __HTTP__ _E_\nRT @foxandfriends: Never give up....that's the worst thing you could do. There's always a chance. Kyle Coddington's message to those als... _E_\nWe have to combat the welfare mentality that says individuals are entitled to live off taxpayers. (cont) __HTTP__ _E_\n#MakeAmericaGreatAgain#Trump2016  __HTTP__ _E_\nTrump Int'l Hotel & Tower Chicago is one of very few hotels in No. America w/ a 5 Star 5 Diamond Hotel & a 5 Star 5 Diamond Restaurant... _E_\nAddressing record crowd @ Madison County Iowa GOP Dinner. We can bring common sense to DC & Make America Great Again! __HTTP__ _E_\nMy job as President is to do everything within my power to give America a level playing field. #AmericaFirst... __HTTP__ _E_\nPolls are starting to look really bad for Obama. Looks like he'll have to start a war or major conflict to win. Don't put it past him! _E_\nEntrepreneurs: Believe in yourself. If you don't no one else will either. Realize that becoming an entrepreneur is not a group effort. _E_\nWeekly jobless claims are now at an astronomical 365000. Manufacturing sector is suffering badly. We must do better. __HTTP__ _E_\nHas the media picked up the new Zogby poll that was just put out? I doubt it! __HTTP__ _E_\nWho's the flip flopper? @MittRomney has never flip flopped on gay marriage. _E_\nAmazing five days developments in Aberdeen Turnberry (Scotland) and Ireland are fantastic the best anywhere in the WORLD. A lot of fun! _E_\n.@TigerWoods has made a truly great comeback he is number one again! Give him credit comebacks are tough to do. Way to go Tiger. _E_\nThe Maryland Democrat Party attacked me with a racist flyer. @Hogan4Governor won 2nd GOP governor elected in 40 years. _E_\nBig crowd expected today in Pensacola Florida for a Make America Great Again speech. We have done so much in so short a period of time...and yet are planning to do so much more! See you there! _E_\nSo lets get this right. Steve Jobs dies and leaves his wife everything billions of dollars. Now his wife has a boyfriend (lover). Oh Steve! _E_\nGRETA IN A FEW MINUTES on Fox. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe new @BarackObama motto You Own Nothing Not Even Your Own Success ... _E_\nThe great people of New Hampshire who I love are not properly served by the dying Union Leader newspaper. _E_\nCRIPPLED AMERICA is perfect gift for friends & family. Order signed copy & join me tonite live streaming 7:30 __HTTP__ _E_\n.@dallasmavs is 1 12 against the Western Conference's top four seeds after Sunday's loss & @okcthunder swept the season series. _E_\nWorst graphics and stage backdrop ever at the Oscars. Show is terrible really BORING! _E_\nMy @WMUR9 Commitment 2016 Conversation with @JoshMcElveen discussing leadership China healthcare & veterans __HTTP__ _E_\nA Rod is just not making it. We want to give him a chance but it was only drugs that made him great. _E_\n...addresses of any mentioned person who is still living. I am doing this for reasons of full disclosure transparency and... _E_\nI am convinced that sleepy eyes Chuck Todd was only a placeholder for someone else at Meet the Press. He bombed franchise in ruins! @nbc _E_\nExcited to announce that @GiulianaRancic & @BravoAndy will be hosting the 2012 Miss Universe Pageant. Great ratings for Miss Universe. _E_\nA week after Biden says that the Taliban is not our enemy the Taliban demand that we pay Iraq for a 9 year occupation. __HTTP__ _E_\n.@tracegallagher and @FredTecce discussing my case on @FoxNews __HTTP__ _E_\n.@MittRomney Op Ed Culture Does Matter : __HTTP__ _E_\nDow hit a new intraday all time high! I wonder whether or not the Fake News Media will so report? _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nSo far the hurricane is being handled very well in NY not nearly as bad as stated on news. Let's see what happens later. _E_\nThe Democrats made up and pushed the Russian story as an excuse for running a terrible campaign. Big advantage in Electoral College & lost! _E_\n.@krauthammer pretends to be a smart guy but if you look at his record he isn't. A dummy who is on too many Fox shows. An overrated clown! _E_\n...to terrorism and airline flight safety. Humanitarian reasons plus I want Russia to greatly step up their fight against ISIS & terrorism. _E_\nNetworks are all wanting me to do shows—like it or not a \"ratings machine\"! –but time I run a really big company! _E_\nDopey @Rosie I never went bankrupt ABC already apologized to me for your stupid statement in the past they didn't want a lawsuit. _E_\nCongratulations to @rushlimbaugh on his recent 26th year anniversary. Rush has revolutionized talk radio! Sorry haters and losers! _E_\nWho would be stupid enough to invest in @VattenfallGroup's ill conceived windfarm when it will lose £25M yearly? _E_\nDespite so many false statements and lies total and complete vindication...and WOW Comey is a leaker! _E_\n.@alexsalmond RT @NOBLE74 I live in Aberdeenshire & I'm with you you have made a big difference to that bit (cont) __HTTP__ _E_\nThank you America! __HTTP__ _E_\nWhat?! LaToya is saying Omarosa is one of the nicest people she's met? _E_\nEntrepreneurs: Do your best to your utmost ability every day. Make that your standard. _E_\nGlad to see that @PeteRose_14 has been hired by @FOXSports as an analyst. Pete should be around baseball and in the Hall of Fame! _E_\nExcited to be keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa this Saturday __HTTP__ _E_\nShocker: study reveals that @msnbc is completely biased while @FoxNews is factual __HTTP__ What a surprise! _E_\nWe need a strong leader and fast! __HTTP__ _E_\nThe House's failure to pass the Balance Budget Amendment was another unforced error by the GOP. Very disappointing. _E_\nGreat job tonight on @FoxNews Tony. I am with you all the way! Make America Great Again @tperkins _E_\n....earth shattering. He and his brother could Drain The Swamp which would be yet another campaign promise fulfilled. Fake News weak! _E_\nThe Benghazi terrorist is getting speedier care than our Vets at the VA. Obama has his priorities. _E_\nTo the African American community: The Democrats have failed you for fifty years high crime poor schools no jobs. I will fix it VOTE T _E_\nMust read article via @fitsnews: DONALD TRUMP VERSUS MEXICO __HTTP__ _E_\nTrump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean & is host to the 2014 Great Irish Links Challenge __HTTP__ _E_\nAnother great shot from the beginning of construction at @DoralResort. __HTTP__ _E_\nIn moments like thiswe are all just Americans. I join with the President religious and civic leaders and encourage all to pray today. _E_\nIf victorious Republicans will be having a big press conference at the beautiful Rose Garden of the White House immediately after vote! _E_\nUnbelievable evening in New Hampshire THANK YOU! Flying to Grand Rapids Michigan now. Watch NH rally here:... __HTTP__ _E_\n...in order to put any and all conspiracy theories to rest. _E_\nThank you Newt! __HTTP__ _E_\nIs @billmaher the dumbest man on television?—I think so. _E_\nI will be on the @todayshow tomorrow morning to make a major announcement about a television show. Stay tuned! _E_\nMy H 1B reform plan will transform program so it delivers for country not lobbyists & will have bipartisan support: __HTTP__ _E_\nI am sure the Chinese are getting anxious. They watch the polls. @MittRomney won't let them cheat us anymore. _E_\nI am very proud of @IvankaTrump for her work with @Cookies4kids. @Cookies4kids is a great cause helping children __HTTP__ _E_\nRT @GeraldoRivera: #NewYork tromps #Jonas. Day after storm of the century the big city is up and running unlike others in the northeast. Mu... _E_\nMy interview with Greta last night on Fox News Nation Has Become All Talk No Action' __HTTP__ _E_\nNow there is talk of A Rod being shipped to @Marlins. If A Rod is not a @yankee next year the fans will be happy. _E_\nLETS MAKE AMERICA GREAT AGAIN!Schedule & tickets: __HTTP__ __HTTP__ _E_\nI was never a fan of Bush in fact he was so bad he gave us Obama! But Obama is truly a pathetic excuse of a president can't get any worse _E_\nCrowd was amazing tonight at Trump National Doral in Miami. Love and excitement in the ballroom. Tomorrow at noon in Jacksonville! _E_\nGood luck @RoccoMediate and nice hat! __HTTP__ _E_\n.@CharlesHurt You were great on @seanhannity last night. Thanks for the nice words. MAKE AMERICA GREAT AGAIN! _E_\nOur politically correct country will read the ISIS terrorists who beheaded the reporter their Miranda Rights prior to good food & care! _E_\n....came to the campaign. Few people knew the young low level volunteer named George who has already proven to be a liar. Check the DEMS! _E_\n.@CNN Kayleigh McEnany was great on you network today. You should have her on more often! Thank you Kayleigh for your nice words. _E_\nEveryone is starting to feel the new tax hikes. You get what you vote for! _E_\nPresident Obama is finally getting hammered even by his most loyal supporters and the press I guess they can only take so much! _E_\n.@VanityFair looks like a dying magazine! Really really boring really really thin! _E_\nThe new President of OPEC is Mahmoud Ahmadinejad's confidant Rostam Ghasemi a commander of the Revolutionar... (cont) __HTTP__ _E_\n\"Courage is contagious. When a brave man takes a stand the spines of others are often stiffened.\" – Rev. @BillyGraham _E_\nRepublicans remember—debt ceiling debt ceiling debt ceiling—be smart and you will win! _E_\nJust be tough be strong be willing to learn – and you will learn. Don't be afraid of mistakes or setbacks. Think Like a Champion _E_\n.@BreitbartNews continues to do great work in exposing the left wing financing behind amnesty __HTTP__ _E_\nThe best deals are good for everyone which creates a win win situation. Negotiation is persuasion more than power. _E_\n#VoteTrumpID! #Trump2016 __HTTP__ _E_\nOne of the dumbest political pundits on television is Chris Stirewalt of @FoxNews. Wrong facts check Fox debate rankings Trump #1. Dope! _E_\nThank you Michigan!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nUnemployment has been over 8% for a record 40 straight months. @MittRomney's election will end the @BarackObama downturn. _E_\nThe party of the year in Palm Beach was the New Year's Eve celebration at the Mar a Lago Club it was amazing. __HTTP__ _E_\n\"@ApprenticeNBC: Donald Trump Talks Joan Rivers\" __HTTP__ via @TVGrapevine by @TVG_Sammi _E_\nCelebrate Thanksgiving @TrumpNewYork with exclusive viewing access to the 88th Annual @Macys Thanksgiving Parade® __HTTP__ _E_\nSouth African Tourism North America will unveil its new ad campaign \"What's Your BIG 5?\" on All Star @ApprenticeNBC this Sunday. _E_\nMitt Romneywho totally blew an election that should have been won and whose tax returns made him look like a fool is now playing tough guy _E_\nYesterday I explained to @wolfblitzercnn on @CNNSitRoom why @BarackObama doesn't deserve credit for killing Bin Laden __HTTP__ _E_\nDemocrat Dianne Feinstein should never have released secret committee testimony to the public without authorization. Very disrespectful to committee members and possibly illegal. She blamed her poor decision on the fact she had a cold a first! _E_\nSanctions were not discussed at my meeting with President Putin. Nothing will be done until the Ukrainian & Syrian problems are solved! _E_\nChina has announced it is \"fully prepared\" for a currency war __HTTP__ Outrageous they have no fear of our leaders. _E_\nMany people have asked recently when do you sleep? The answer is not much. _E_\nVia @examinercom: The Miss Universe contestants glow with elegance during the Trump Holiday Party __HTTP__ _E_\nExcellent preliminary meeting in Oval with @SenSchumer working on solutions for Security and our great Military together with @SenateMajLdr McConnell and @SpeakerRyan. Making progress four week extension would be best! _E_\nICYMI Via @nypost by Post Editorial Board: \"@TrumpFerryPoint: New Gem in the Bronx\" __HTTP__ _E_\nThe people of Tennessee yesterday were amazing. Thank you! _E_\nThe five fingers represent the five key factors every entrepreneur dreaming of success must master. (cont) __HTTP__ _E_\nGreat progress on healthcare. Improvements being made Republicans coming together! _E_\nIf you never want to be criticized for goodness' sake don't do anything new. Jeff Bezos _E_\nAre people really afraid of @OMAROSA Would you be? #CelebApprentice _E_\nWe're stuck with the worst mayor in the United States. Too bad but New York City will survive! _E_\nJoin me in Denver Colorado tomorrow at 9:30pm!Tickets: __HTTP__ _E_\nVia@politicalwire: Trump Offers to Fund White House Tours __HTTP__ _E_\nFormer winner @bretmichaels returns to All Star @ApprenticeNBC March 3rd on @NBC. Bret shows once again why he is a champion! _E_\nI will be interviewed on This Week with George S this morning. Enjoy! _E_\nIt was an honor to stop by a #SchoolChoice event hosted by @VP Pence and @usedgov Secretary @BetsyDeVosED at the... __HTTP__ _E_\nGas prices are about to hit a record high during the Labor Day weekend. @BarackObama could have stopped this. _E_\nCongratulations @TrumpSoHo for making @CNTraveler's #GoldList2015! __HTTP__ _E_\nYour civil liberties mean nothing if you're dead. That's why the single most important function of the federal (cont) __HTTP__ _E_\n.@ChuckTodd just informed us that my interview last week on @MeetthePress was their highest rated show in 4 years. Congrats! _E_\nI'll be honored at the Family Business Dynasties Gala in NYC on December 5th. It will be a great event for a great cause. _E_\n\"I believe in spending what you have to. But I also believe in not spending more than you should.\" The Art of the Deal _E_\nThank you Illinois! #Trump2016 __HTTP__ _E_\n\"The idea flow from the human spirit is absolutely unlimited. All you have to do is tap into that well.\" @jack_welch _E_\nThe very dishonest @NBCNews refuses to accept the fact that I have forgiven my $50 million loan to my campaign. Done deal! _E_\nJeb's big ad buy against me paid for by lobbyists shows my face but doesn't have me answering Jeb's statements. He is really pathetic! _E_\nWind farms are now being paid to shut down __HTTP__ A complete waste. _E_\nMark your calenders for August 23rd: __HTTP__ _E_\nTHANK YOU ALABAMA! 32000 supporters tonight. Get out & VOTE on Tuesday! WE WILL MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nArgyll grandmother takes UK and EU to the United Nations over plans to turn Scotland into windfarm 'hedgehog' __HTTP__ _E_\nWe just passed 1.9M followers & gained over 250000 followers in the last month.Thank you let's have fun and do business. _E_\nRT @IvankaTrump: Ivanka penned an Op Ed that ran in the @WSJ this afternoon read it here. __HTTP__ @realDonaldTrump __HTTP__ _E_\nV.P.....really! __HTTP__ _E_\nCrooked Hillary said her husband is going to be in charge of the economy.If so he should runnot her.Will he bring the energizer to D.C.? _E_\nAccording to @BarackObama the War on Terror is over __HTTP__ global warming is a national (cont) __HTTP__ _E_\nIt was announced this morning that unemployment rose this can't be good for Obama. _E_\nSuccess is having to worry about every damn thing in the world except money. Johnny Cash _E_\nIn all my years in business and participating in politics I've never seen the country as divided as it is right (cont) __HTTP__ _E_\nHappy belated birthday to Chris Wallace! Chris does a great job every week on @FoxNewsSunday. Like father like son. _E_\nRT @foxandfriends: Anthem announces it will withdraw from ObamaCare Exchange in Nevada __HTTP__ _E_\n\"I also plan to keep making deals big deals and right around the clock.\" – The Art of The Deal _E_\nTickets for future debates should be put out to the general public instead of being given to the lobbyists & special interests the bosses! _E_\nI would like to thank a great writer and person @JPappasPR. of REAL ESTATE WEEKLY for the wonderful story on me. Very much appreciated! _E_\nCan't watch Crazy Megyn anymore. Talks about me at 43% but never mentions that there are four people in race. With two people big & over! _E_\nCongratulations to my friend @TheSlyStallone on winning a #GoldenGlobe. A wonderful guy who has created something special well deserved! _E_\nKeep hearing about tiny amount of money spent on Facebook ads. What about the billions of dollars of Fake News on CNN ABC NBC & CBS? _E_\nMajor story that the Dems are making up phony polls in order to suppress the the Trump . We are going to WIN! _E_\nWe cannot let the failing REPUBLICAN ESTABLISHMENT who could not stop Obama (twice) ruin the MOVEMENT with millions of $'s in false ads! _E_\nLittle @MacMiller I want the money not the plaque you gave me! _E_\n46 stories above downtown New York @TrumpSoHo features loft inspired interiors designed by Fendi Casa __HTTP__ _E_\nGeneral John Kelly is doing a fantastic job as Chief of Staff. There is tremendous spirit and talent in the W.H. Don't believe the Fake News _E_\nAgain for all of the haters and losers I have NOTHING to do with Atlantic City got out a long time ago! _E_\nObama once said he \"would be ignoring the law\" by granting amnesty through executive action. Now he's about to do it. What will Congress do? _E_\nAmazing rally in Reno Nevada thank you. Make sure you get out on 11/8 & VOTE #TrumpPence16. Together we will put... __HTTP__ _E_\nLA Times USC Dornsife Sunday Poll: Donald Trump Retains 2 Point Lead Over Hillary: __HTTP__ _E_\nFunny that the Democrats would have their convention in Pennsylvania where her husband and her killed so many jobs. I will bring jobs back! _E_\nWrong Policy: @BarackObama wants to cut defense spending by $487B while China is building their navy in the Pacific. __HTTP__ _E_\nLightweight NYS Attorney General Eric Schneiderman is trying to extort me with a civil law suit. See website __HTTP__ _E_\n\"Donald Trump—The Disrupter\" will air on @FoxNews Saturday night and Sunday night at 8 PM ET. Anchored by @BretBaier. @johnrobertsFox _E_\nVia @CNNPolitics by @mj_lee: Father of murder victim to introduce Trump in Phoenix __HTTP__ _E_\nI am thinking about changing the name #FakeNews CNN to #FraudNewsCNN! _E_\nJust to show you how dishonest certain reporters are here is my @foxandfriends interview __HTTP__ _E_\n.@Mitt Romney strongly stated in one of the debates with Pres. OBAMA that Russia is the big problem. Obama scoffed. Mitt was 100% correct! _E_\nWind turbines are totally destroying the areas in which they are located—all for unreliable bad & expensive energy! _E_\nVia @Newsmax_Media: Trump Says He'll Foot Bill for White House Tours __HTTP__ _E_\nI am very excited about hosting @MittRomney today for a fundraiser. Looking forward to seeing @newtgingrich and many other friends. _E_\nFor a country like China being able to steal our military designs represents hundreds of billions in savings (cont) __HTTP__ _E_\nBeing true to yourself equals being true to your brand. That's the solid foundation that stands the test of time. Midas Touch _E_\nObama told the UN that \"the world is more stable than it was 5 years ago.\" Is he delusional? _E_\nRT @atensnut: How many times must it be said? Actions speak louder than words. DT said bad things!HRC threatened me after BC raped me. _E_\n.@BlairKamin Sorry sucker as usual you lose again. You couldn't work for me for 10 seconds. Bad critic great sign. __HTTP__ _E_\nI am increasingly concerned with the UN's ploy against @Israel this coming week and will monitor all events closely from Australia. _E_\nCriminal deportations in the U.S. are the lowest number in many years. We are letting criminals knowingly stay in our country. MUST CHANGE! _E_\nSee June 2007 speech is Obama a total racist? _E_\nWEEKLY ADDRESS __HTTP__ _E_\nAll Star Celebrity @ApprenticeNBC is down to the five final contestants. Getting fired now is when it really hurts! _E_\nI will be interviewed on @Morning_Joe at 6:15 A.M. Enjoy! _E_\nSelf determination is the sacred right of all free people's and the people of the UK have exercised that right for all the world to see. _E_\nPlease don't pay attention to all of those phony tweets that mention my twitter handle relative to \"diet\" it is a total scam. _E_\nWhy is this reporter touching me as I leave news conference? What is in her hand?? __HTTP__ _E_\nAs it has turned out James Comey lied and leaked and totally protected Hillary Clinton. He was the best thing that ever happened to her! _E_\nObama's coal regulations will destroy the coal industry put Americans out of work raise electricity prices & lead to blackouts. _E_\n.@DannyZuker Danny You're a total loser! _E_\nGetting ready to leave for Washington D.C. The journey begins and I will be working and fighting very hard to make it a great journey for.. _E_\n.@mcuban Letterman @Late_Show had his best ratings with me and you bombed. People don't care about Mark Cuban. _E_\nA great honor to host and welcome leaders from around America to the @WhiteHouse Infrastructure Summit.... __HTTP__ _E_\nRT @TravelGov: Continue to notify us of US citizens overseas impacted by #HurricaneIrma & #HurricaneJose. __HTTP__ __HTTP__ _E_\nTerrorists are engaged in a war against civilization it is up to all who value life to confront & defeat this evil __HTTP__ _E_\nHaving a truly great imagination is often far more important than having even massive knowledge but still never underestimate knowledge! _E_\n\"Our country is the greatest force for freedom the world has ever known. We have big hearts big brains and (cont) __HTTP__ _E_\nBottom line I don't think President changed people's minds must hope for a lifeline from Putin a very dangerous lifeline at that! _E_\nHistory lesson: There's a big difference between Hillary Clinton and Abraham Lincoln. For one his nickname is Hone... __HTTP__ _E_\n.@MittRomney did a great job last night. Watch the clip! __HTTP__ _E_\nCongratulations to @DLoesch on the release of her great new book #HandsOffMyGun! Check out @TheBlazeBooks excerpt __HTTP__ _E_\nAnthony Hopkins is a truly great actor I love everything he does! _E_\nWatch my latest video blog.... __HTTP__ _E_\nWhy doesn't @FoxNews quote the new Iowa @CNN Poll where I have a 33% to 20% lead over Ted Cruz and all others. Think about it! _E_\nLIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_\nThe more you learn about the debt deal the worse it gets. _E_\nDummies @Deadspin had their big payday taken from them by others in the media. _E_\nI told you in speeches months ago that Jeb and Marco do not like each other. Marco is too ambitious and very disloyal to Jeb as his mentor! _E_\nOnce the tragic mistake of going into Iraq was made we should have at least taken the oil (or at least some of it). Now Iran & China get it _E_\nRemember what I said about @BarackObama attacking Iran before the election I hope the Iranians are not so (cont) __HTTP__ _E_\nNation's infrastructure is collapsing MAKE AMERICA GREAT AGAIN! _E_\nRT @TeamTrump: #RattledHillary wants to talk about her 30 years in service. How about her 30 years of FLOPSFLOPS?! #BigLeagueTruth #Debat... _E_\nJoin me live in Reno Nevada! __HTTP__ __HTTP__ _E_\nThank you Warwick Rhode Island!#RIPrimary #VoteTrump __HTTP__ __HTTP__ _E_\nWant to know why China is growing? They can build the world's tallest building in 90 days __HTTP__ Red tape would kill it here. _E_\nThank you Nevada! #AmericaFirst#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nIf Cory Booker is the future of the Democratic Party they have no future! I know more about Cory than he knows about himself. _E_\nof jobs and companies lost. If Mexico is unwilling to pay for the badly needed wall then it would be better to cancel the upcoming meeting. _E_\nIf Sheena Monnin apologized for her mistake as she should have I would have treated her very nicely. _E_\nEgypt's Muslim Brotherhood President is visiting us next month. @BarackObama is so excited. _E_\nRT @Scavino45: .@POTUS @realDonaldTrump @IvankaTrump Jared Kushner & Dina Powell in the Oval Office today w/ Aya & her brother Basel.#W... _E_\nPoliticians are ALL TALK NO ACTION! just look at our country. _E_\nPresident Obama should bring Secretaty Sebelius into his office look right into her beautiful blue eyes and saywith emotion YOU'RE FIRED! _E_\nAs promised our campaign against the MS 13 gang continues. @ICEgov Busts 39 MS 13 Members in New York Operation __HTTP__ _E_\nThank you Georgia! I had a great afternoon with all of you! I will be back soon. #MakeAmerciaGreatAgain __HTTP__ _E_\nRT @Heritage: We had a special visitor yesterday. @IvankaTrump thank you for meeting with @KayColesJames spending time with our team and... _E_\nHow can any Senator vote for Hagel as Sec. of Defense after that horrific hearing? He is not up for the job but will probably get it. _E_\n.@TrumpGolfLA is proud to be hosting @PGAGrandSlam where all 4 Major Champions will square off. October 2015. __HTTP__ _E_\nHad a great time yesterday on @theviewtv with @WhoopiGoldberg @JennyMcCarthy @SherriEShepherd & guest host @MrJerryOC! _E_\nIran is playing with fire they don't appreciate how kind President Obama was to them. Not me! _E_\nA good question for would be entrepreneurs to ask themselves: What am I pretending not to see? There are a lot of opportunities out there. _E_\nI hope all of the many thousands of people who are asking me to give up so much and RUN FOR PRESIDENT will fight hard for victory if I do! _E_\nRepublican Senate must get rid of 60 vote NOW! It is killing the R Party allows 8 Dems to control country. 200 Bills sit in Senate. A JOKE! _E_\nWow looks like James Comey exonerated Hillary Clinton long before the investigation was over...and so much more. A rigged system! _E_\nThe entrepreneur builds an enterprise the technician builds a job. Michael Gerber _E_\nMany people still out of power in Staten Island. Absolutely ridiculous. Why can't they get service? _E_\nI'll be on Greta Van Susteren's show tonight at 10 PM on FoxNews. Tune in. _E_\nCORRUPTION CONFIRMED: FBI confirms State Dept. offered 'quid pro quo' to cover up classified emails __HTTP__ _E_\nThank you Tallahassee Florida! A beautiful evening with the MOVEMENT! Get out & VOTE!#ICYMI watch here: __HTTP__ _E_\nWhen you're in a fight with a bully always throw the first punch—and don't telegraph it—hit hard & hit fast! _E_\n...subject to the fact that if we do not reach a fair deal for all we will then terminate NAFTA. Relationships are good deal very possible! _E_\nThe Islamists have won. Just as I predicted the Muslim Brotherhood has taken over Egypt. @BarackObama never should have abandoned Mubarek. _E_\nIf Syria was forced to use Obamacare they would self destruct without a shot being fired. Obama should sell them that idea! _E_\nA segment from last night's @piersmorgan interview discussing @CoryBooker and fighting fire with fire in a campaign __HTTP__ _E_\n#TrumpVlog Free our Marine! __HTTP__ _E_\nVia Breitbart Riding High in Polls Donald Trump Storms the American South to Overflow Crowds in Georgia __HTTP__ _E_\nThe economy is expected to slow down once again at the end of the year __HTTP__ The price of gas has to be lowered. _E_\nA state legislator w/ a true record of accomplishments military vet @joniernst will make a tremendous US Senator. Iowa send Joni to DC! _E_\nNobody but Donald Trump will save Israel. You are wasting your time with these politicians and political clowns. Best! #SheldonAdelson _E_\nWow—Family Feud said I am the third most envied man in America. I respectfully disagree—I am very modest. _E_\nWhile millions are being spent against me in attack ads they are paid for by the \"bosses\" and \"owners\" of candidates. I am self funding. _E_\n\"Donald Trump was proven right on another one of his top issues Thursday: 'gun free zones' at military bases.\" __HTTP__ _E_\nMore than a century after conquering flight the #WrightBrothers continue to motivate & inspire Americans who never tire of exploration & innovation. This GREAT AMERICAN SPIRIT can be found in the design of every new supersonic jet and next generation: __HTTP__ __HTTP__ _E_\nThe reason you don't generally hit runways is that they are easy and inexpensive to quickly fix (fill in and top)! _E_\nHere I am with @RodStewart at Mar a Lago. __HTTP__ _E_\nI look forward to working w/ D's + R's in Congress to address immigration reform in a way that puts hardworking citizens of our country 1st. _E_\nBen Carson was speaking in general terms as to what he would do if confronted with a gunman and was not criticizing the victims. Not fair! _E_\nChina has a backdoor into the Trans Pacific Partnership. This deal does not address currency manipulation. China is laughing at us. _E_\nWe are one nation. When one hurts we all hurt. We must all work together to lift each other up.#StandWithLouisiana __HTTP__ _E_\nCheck out the #trumpvlog to see the answers to your questions... __HTTP__ _E_\nI will be going to Asheville North Carolina tonight for the 95th birthday party of the GREAT Billy Graham such a wonderful man! _E_\n.And to think that just last week he was lecturing anyone who would listen about sexual harassment and respect for women. Lesley Stahl tape? _E_\nLooks like I was right about NATO. I had no doubt. __HTTP__ _E_\nPresident Obama just fired the ObamaCare website builder. My question is why were they hired in the first place? Sue them for damages! _E_\n.@somelikeitlar hope you enjoyed the premiere of All Star Celebrity @ApprenticeNBC. Make sure @marklevinshow watches! _E_\nJust started building one of the great hotels of the World in Washington D.C. the site of the Old Post Office. Will be amazing JOBS! _E_\nIf Obama goes after Mitt's private sector experience in the next debate then Mitt should ask for Obama's college records all of them. _E_\nVia @postandcourier by @skropf47: \"Donald Trump: Don't politicize Walter Scott shooting\" __HTTP__ _E_\n.@KarlRove still thinks Romney won! He doesn't have a clue! @FoxNews _E_\nWhat an evening in Las Vegas Nevada! THANK YOU for your continued support. #Trump2016 __HTTP__ __HTTP__ _E_\nBig vote tomorrow in the House. Tax cuts are getting close! _E_\nRT @MELANIATRUMP \"@ApprenticeNBC: Her beauty lives 5000 miles past Heaven. __HTTP__ \" Thank u @THEGaryBusey! _E_\nEverybody wants me to talk about Robert Pattinson and not Brian Williams—I guess people just don't care about Brian! _E_\nI will be doing @SquawkCNBC at 7:30. _E_\nWork often becomes problem solving. Problems come with the territory and they should never surprise you. Think Like a Champion _E_\nCrazy @megynkelly supposedly had lyin' Ted Cruz on her show last night. Ted is desperate and his lying is getting worse. Ted can't win! _E_\nGood night everyone sleep well and tomorrow have many victories! _E_\nThat was some episode last week we've got a great cast! _E_\nVia @LINKSMagazine: \"Only The Donald __HTTP__ _E_\nStanding strong for his people @GovWalker is ignoring the Feds and keeping all Wisconsin parks open. Great! _E_\nAmazing Race winning an Emmy again is a total joke. The Emmys have no credibility no wonder the ratings are at record lows. _E_\n#AmericasMerkel __HTTP__ _E_\nDangerous. While Obama is cutting down our military China has announced plans to build more aircraft carriers __HTTP__ _E_\nCongratulations to @MittRomney for an impressive win in Florida. He performed well under pressure. _E_\nMy supporters are the best! $18 million from hard working people who KNOW what we can be again! Shatter the record: __HTTP__ _E_\nI have been very consistent and always said that Iraq would fall as soon as the U.S. left. What a terrible waste of lives and money! _E_\nMake no mistake Fast and Furious goes ALL the way to the White House. _E_\nThe main stream media wants to surrender constitutional rights I believe #ISIS needs to surrender! _E_\nThe Senate Democrats have only confirmed 48 of 197 Presidential Nominees. They can't win so all they do is slow things down & obstruct! _E_\nIf Obama worked as hard on straightening out our country as he has trying to protect and elect Hillary we would all be much better off! _E_\nDirect foreign investments continue to flow into China at over $100B a year __HTTP__ That's money that could be spent here. _E_\nWith Irma and Harvey devastation Tax Cuts and Tax Reform is needed more than ever before. Go Congress go! _E_\nGoofy Elizabeth Warren didn't have the guts to run for POTUS. Her phony Native American heritage stops that and VP cold. _E_\nIt just shows everyone how broken and unfair our Court System is when the opposing side in a case (such as DACA) always runs to the 9th Circuit and almost always wins before being reversed by higher courts. _E_\nLooks like the line has started be sure to join me for book signing @TImeToGetTough starting at 11am to 2 pm here in Trump Tower. _E_\nLYIN' TED __HTTP__ _E_\nI rarely agree with President Obama however he is 100% correct about Crooked Hillary Clinton. Great ad! __HTTP__ _E_\n'Donald Trump: A President for All Americans' __HTTP__ _E_\nMy friend @eminofficial was fantastic on the @TODAYshow this morning—a star! _E_\n.@LilJon once again made it to the Final Four. A true talent and great friend to #CelebApprentice @ApprenticeNBC. Great job! _E_\nIf Obama was willing to lie about ObamaCare then what else has he lied to us about... _E_\n.@CNN is so negative it is impossible to watch. Terrible panel angry haters. Bill O @oreillyfactor said such an amazing thing about me! _E_\nHillary could lose to Trump in Democratic New York #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nRe: CPAC \"The crowd in the main room filled to capacity by the end of Trump's address something his operative said he planned to do... _E_\nJoin me live in Cincinnati Ohio!#TrumpRally #MAGA __HTTP__ _E_\nRepublicans Senators are working hard to pass the biggest Tax Cuts in the history of our Country. The Bill is getting better and better. This is a once in a generation chance. Obstructionist Dems trying to block because they think it is too good and will not be given the credit! _E_\nHave your own vision & stick with it. Don't be afraid to be unique.Every day is an opportunity to show what you can do at the highest level. _E_\n#TBT With Steven Spielberg in the old days a great guy! __HTTP__ _E_\nIt's Thursday. How much did OPEC steal from all of us today? _E_\nThe Fake News refuses to talk about how Big and how Strong our BASE is. They show Fake Polls just like they report Fake News. Despite only negative reporting we are doing well nobody is going to beat us. MAKE AMERICA GREAT AGAIN! _E_\nTo the @BarackObama administration saving money isn't the point expanding government and spending more (cont) __HTTP__ _E_\nVia @ArabianBusiness: \"Trump eyes PGA tour for Dubai golf course\" __HTTP__ _E_\nA great book for your reading enjoyment: REASONS TO VOTE FOR DEMOCRATS by Michael J. Knowles. _E_\nFake @NBCNews made up a story that I wanted a tenfold increase in our U.S. nuclear arsenal. Pure fiction made up to demean. NBC = CNN! _E_\nSomeone should ask @BarackObama in today's press conference how he accumulated more debt in 3 years than the first 42 presidents combined. _E_\nJust landed in New York a one night stay in Scotland. Turnberry came out magnificently. My son Eric did a great job under budget! _E_\nCongratulations to @TrumpDoral's #BlueMonster course for being named one of the 10 Toughest courses on Tour This Year __HTTP__ _E_\nEven though Bernie Sanders has lost his energy and his strength I don't believe that his supporters will let Crooked Hillary off the hook! _E_\nGreat solidarity for our National Anthem and for our Country. Standing with locked arms is good kneeling is not acceptable. Bad ratings! _E_\nGood news: Toyota and Mazda announce giant new Huntsville Alabama plant which will produce over 300000 cars and SUV's a year and employ 4000 people. Companies are coming back to the U.S. in a very big way. Congratulations Alabama! _E_\n...We negotiated a ceasefire in parts of Syria which will save lives. Now it is time to move forward in working constructively with Russia! _E_\nRT @FLOTUS: Thank u to Queen Fabiola University Hospital! Enjoyed creating paper flowers with amazing patients & getting a tour. #Brussels... _E_\nAs the days and weeks go by we see what a total mess our country (and world) is in Crooked Hillary Clinton led Obama into bad decisions! _E_\nThank you! #Trump2016 __HTTP__ _E_\nObama is under a great of pressure to perform well in the next debate. Let's see how he reacts under pressure. _E_\nWhen people treat me badly or unfairly or try to take advantage of me my general attitude all my life has (cont) __HTTP__ _E_\nRT @foxandfriends: SEN. CRUZ: It's crazy to go an August recess without having Obamacare repealed. We should work every day until it is don... _E_\nSyria has been given so much time that much of the things we were going to bomb have been moved into civilian areas! A polititian's war. _E_\n.@RinglingBros is retiring their elephants the circus will never be the same. _E_\n'WikiLeaks Drip Drop Releases Prove One Thing: There's No Nov. 8 Deadline on Clinton's Dishonesty and Scandals' __HTTP__ _E_\nSaudi Arabia should be paying the United States many billions of dollars for our defense of them. Without us gone! @AlWaleedbinT _E_\nMy wife @MELANIATRUMP will be joining @andersoncooper @AC360 tonight at 8pmE on @CNN. Enjoy! __HTTP__ _E_\nPeople are so jealous of Tom Brady and the Patriots. No court could convict based on the evidence.They can't beat him on the field so this! _E_\n\"Here's something about Donald Trump he's got a top rated show on TV and everything he says becomes a headline.\" @DLoesch All true! _E_\nObama's speech in Las Vegas yesterday cost the taxpayer $520 per word and over $1.6M __HTTP__ More money borrowed from China. _E_\nA $1.5B website that can only handle 50K users at a time is sad but no surprise! _E_\nEd Gillespie will turn the really bad Virginia economy #'s around and fast. Strong on crime he might even save our great statues/heritage! _E_\nThe booing at the NFL football game last night when the entire Dallas team dropped to its knees was loudest I have ever heard. Great anger _E_\nJust out wonderful poll in North Carolina. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\n...people are now starting to recognize the amazing work that has been done by FEMA and our great Military. All buildings now inspected..... _E_\nVia @myrbeachonline by @TSN_MPrabhu: Donald Trump states case for becoming POTUS at SC Tea Party convention __HTTP__ _E_\nWhat separates the winners from the losers is how a person reacts to each new twist of fate. _E_\n.@jessebwatters You did a great job hosting @oreillyfactor. Everybody loved it! Thank you for the nice words. _E_\nRepublicans in the Senate will NEVER win if they don't go to a 51 vote majority NOW. They look like fools and are just wasting time...... _E_\n\"Most of the time you will need to work hard and stay focused to get to the top – and then work even harder to stay there.\" Think Big _E_\nToo many people on stage for debate. @RandPaul at 11th with 2% in @RealClearNews shouldn't be allowed to participate. _E_\nSo now it is reported that the Democrats who have excoriated Carter Page about Russia don't want him to testify. He blows away their.... _E_\n\"History does not long entrust the care of freedom to the weak or the timid.\" Dwight D. Eisenhower _E_\nMore than 70M people watched the Presidential Debate. A new record. See what happens when I am so prominently mentioned (just kidding)! _E_\nDon't forget Benghazi. _E_\n#CelebApprentice We always make sure to have great NYC locations for the task delivery. _E_\nLast night in Phoenix I read the things from my statements on Charlottesville that the Fake News Media didn't cover fairly. People got it! _E_\nThe Federal government has $2.7T in assets & $17.5T in total liabilities plus another $4.7T in intergovernmental debt. Have a nice day. _E_\nWho can figure out the true meaning of covfefe ??? Enjoy! _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nDonald Trump praises @LilJon and welcomes him back to All Star @CelebApprentice __HTTP__ via @HipHopNews24x7 _E_\nHopefully all supporters and those who want to MAKE AMERICA GREAT AGAIN will go to D.C. on January 20th. It will be a GREAT SHOW! _E_\nI will issue a lifetime ban against senior executive branch officials lobbying on behalf of a FOREIGN GOVERNMENT!... __HTTP__ _E_\nExcited to keynote of the sold out Pottawattamie County Republican Party Lincoln/Reagan Dinner tonight! __HTTP__ Leaving now! _E_\nLance Armstrong made a really big mistake by opening up to Oprah. I'll bet he wishes he had the chance to do it over again. _E_\nTo have a government we can afford we need to eliminate the tremendous waste clogging the system. Almost every (cont) __HTTP__ _E_\nI will be at @Macys Herald Square April 18 to sign my new fragrance #Success by Trump. First 100 customers receive a copy of my new book. _E_\nHillary just gave a disastrous news conference on the tarmac to make up for poor performance last night. She's being decimated by the media! _E_\nWind turbines threaten the migration of birds __HTTP__ Where's the outcry? _E_\nVia @ScotlandNow: \"Donald Trump starts £250million overhaul on @TrumpTurnberry golf resort\" __HTTP__ _E_\nThis is Amateur Night who the hell is in charge of this production? #Oscars _E_\nGreat photo with @IvankaTrump and @Joan_Rivers from this week's @ApprenticeNBC __HTTP__ _E_\nVia @Newsmax_Media by @OwenTew: \"Trump: 'Maybe Something Miraculous Happens' and Obama Will Succeed\" __HTTP__ _E_\nThank you Louisville Kentucky on my way! #MAGA __HTTP__ _E_\nThe Donald Goes to CPAC: TV star and hotel magnate gives his thoughts on the state of America __HTTP__ by @Kredo0 _E_\nSugar is nowhere near being a billionaire and I know he works for me! _E_\nOn the shores of Lake Norman @Trump_Charlotte presents the true luxury lifestyle and an elite golf course __HTTP__ _E_\nRT @Scavino45: Today @POTUS @realDonaldTrump and @FLOTUS Melania visit the @USCG at the Lake Worth Inlet Station in Riviera Beach Florid... _E_\nWith our border not being secure Obama is giving a pathway to terrorists to enter our country. An attack is on him. _E_\nEntrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. _E_\nStill waiting for an explanation about why @GiulianaRancic & @BillRancic did not name their son Donald. Unbelievable. _E_\nRemarks at the United States Holocaust Memorial Museum's National Days of Remembrance. Full remarks:... __HTTP__ _E_\nMy @FoxNews interview from yesterday discussing my recent meetings in Trump Tower and also @GovChristie __HTTP__ _E_\nHeading now for Reno Nevada for a big rally. Good poll numberd all over! _E_\nIt is time to bring competence to Washington. It is time get results. Let's Make America Great Again! __HTTP__ _E_\nKeystone XL should be approved but more importantly we should be drilling & fracking our own resources. Would be an economic windfall. _E_\nAsk China if their rapidly expanding (with our money) Navy or Armed Forces are going green They would laugh in your face! _E_\nWith all of the illegal acts that took place in the Clinton campaign & Obama Administration there was never a special councel appointed! _E_\nI am especially grateful for the tremendous support I have received from the Evangelicals in the just out Iowa CNN poll. Thank you! _E_\nLeading by 13 over Landrieu in a @FoxNews poll @BillCassidy will beat her in November. _E_\nJust leaving Salt Lake City Utah fantastic crowd with no interruptions. Love Utah will be back! _E_\nMan we had a great day today at Trump Tower lots of money was given to many people who really needed it good feelings and happiness! _E_\nBack in D.C. big week for Tax Cuts and many other things of great importance to our Country. Senate Republicans will hopefully come through for all of us. The Tax Cut Bill is getting better and better. The end result will be great for ALL! _E_\nAwarded 5 Stars by @ForbesInspector @TrumpChicago's @SixteenChicago offers Executive Chef @cheflents's new menu __HTTP__ _E_\nDo your homework. Wasting other people's time due to poor planning will only leave a bad impression. Think Like a Billionaire _E_\nWhen you can't say it or see it you can't fix it. We will MAKE AMERICA SAFE AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_\nIran continues to delay the nuclear deal while doing many bad things behind our backs. Time to WALK and double the sanctions. Stop payments! _E_\n#TheRemembranceProject __HTTP__ __HTTP__ _E_\nOur many loyal viewers should expect a major announcement very soon on next season's @CelebApprentice. Our fans will be pleased. _E_\n\"A failure or setback is not a defeat. Defeat is a state of mind. You are defeated only when you accept defeat.\" – Think Big _E_\nIf you live in a state with early voting you should be voting as soon as possible. Bring your friends and family with you. _E_\nI give Secretary of State John Kerry credit for working and trying hard but he has zero negotiating ability! _E_\nJust watched @NBCNightlyNews So biased inaccurate and bad point after point. Just can't get much worse although @CNN is right up there! _E_\nMy wife Melania will be on @Morning_Joe tomorrow morning at 8:00. Interviewed by @morningmika Enjoy! _E_\nConsumer Confidence Hits Highest Level Since December 2000 Read more: __HTTP__ __HTTP__ _E_\nObama looks exhausted and beaten. He was never made or prepared for the job. Like it or not he doesn't have it _E_\nWe're all very happy to hear of Bret Michael's progress and send our best wishes for his full recovery. _E_\n\"I like thinking big. To me it's very simple: if you're going to be thinking anyway you might as well think big.\" – The Art of The Deal _E_\n'America must decide between failed policies or fresh perspective a corrupt system or an outsider' __HTTP__ _E_\nHey @Rosie how is your recovery going? I hope you are doing well so we can start fighting again soon! _E_\nThe Oscar broadcast is really boring where is the glamour and beauty? _E_\n#ICYMI: Weekly Address __HTTP__ __HTTP__ _E_\nThe misery of Obama's economic policies. US households with unemployed parent was at record high in 2011 __HTTP__ _E_\nMy interview yesterday with @TeamCavuto where I discuss Dick Cheney and China __HTTP__ _E_\nThe ratings for The View are really low. Nicole Wallace and Molly Sims are a disaster. Get new cast or just put it to sleep. Dead T.V. _E_\n\"Trump: 'No way' Bush Romney would win in 2016\" __HTTP__ via @FoxNews by Barnini Chakraborty _E_\n#FlashbackFriday With Mickey Rooney @Regis and @itstonybennett __HTTP__ _E_\nStill a buyer's market but somewhat fragile. Be sure to calculate the risk of rising rates coming sooner than you think! _E_\nThe State of Iowa should disqualify Ted Cruz from the most recent election on the basis that he cheated a total fraud! _E_\nRT @Reince: Flying to Dallas now with @realDonaldTrump...Reports of discord are pure fiction. Great events lined up all over Texas. Rs wil... _E_\nTonight in his SOTU @BarackObama won't talk about Keystone. He will continue to dissemble about his record and play class warfare. _E_\nEntrepreneurs: Is the problem a blip or a catastrophe? Keep things in perspective. Learn to expect problems and keep moving forward. _E_\nObama's deal vs. Trump's deals __HTTP__ _E_\nThey should seriously look into the moron George Zimmerman who shot and killed the 17 year old kid Trayvon (cont) __HTTP__ _E_\nJust landed in New Hampshire. Will be at the venue shortly. #FITN _E_\nThe Fake News Media works hard at disparaging & demeaning my use of social media because they don't want America to hear the real story! _E_\nThe voters the Republican Party of Virginia are excluding will doom any chance of victory. The Dems LOVE IT! Be smart and win for a change! _E_\nExcited to announce Trump Rio de Janeiro our first South American @TrumpCollection hotel set to open in 2016 __HTTP__ _E_\nJust cancelled my subscription to @USATODAY. Boring newspaper with no mojo must be losing a fortune. Founder (cont) __HTTP__ _E_\nThis Ebola patient Thomas Duncan who fraudulently entered the U.S. by signing false papers is causing havoc. If he lives prosecute! _E_\nScary now China's Development Bank is looking to buy U.S. homes and developments __HTTP__ They will own our country soon. _E_\nI can't stress strongly enough that we are currently in a buyer's residential market.Try to buy directly from a bank. _E_\nThank you America! Together we will all #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_\nVia @businessinsider by @BKcolin: \"Donald Trump called the White House and offered to help fix the BP oil spill\" __HTTP__ _E_\nResilience is part of the survival of the fittest formula make sure you remain adaptable. _E_\nWill be calling the President of Egypt in a short while to discuss the tragic terrorist attack with so much loss of life. We have to get TOUGHER AND SMARTER than ever before and we will. Need the WALL need the BAN! God bless the people of Egypt. _E_\n.@DanaPerino Have you released a copy of the beautiful thank you card you sent me? Would you like to see it? @ericbolling @kimguilfoyle _E_\nMichigan has made great progress under Snyder Calley. @MIGOP is out early energizing the grassroots. Keep it up! #JoinMITeam _E_\nThe PGA tour just extended my Trump Doral contract for WGC for ten years. _E_\nWatch me on the @oreillyfactor tonight at 8PM. _E_\nThank you Roanoke Virginia be back soon! #TrumpPence16 __HTTP__ __HTTP__ _E_\nThe $200M renovations at Trump @DoralResort are right on target. When completed the course will be as good as it gets. _E_\nLike the worthless @NYDailyNews looks like @politico will be going out of business. Bad reporting no money no cred! _E_\nWho would really believe I would say such a thing about a guy I truly liked James Gandolfini. Sadly very sick people use my name. _E_\nDoes the Fake News Media remember when Crooked Hillary Clinton as Secretary of State was begging Russia to be our friend with the misspelled reset button? Obama tried also but he had zero chemistry with Putin. _E_\nI will be on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_\nWisconsin has suffered a great loss of jobs and trade but if I win all of the bad things happening in the U.S. will be rapidly reversed! _E_\nGetting to the point is appreciated by everyone. Here's some advice for public speaking: Be sincere be brief be seated. F.D. Roosevelt _E_\nDonald Trump: Yahoo Marissa Mayer Are Right Employees Should Not Work From Home __HTTP__ via @HuffPostSmBiz _E_\nI will be interviewed on @foxandfriends at 7:00 A.M. _E_\nGovernor Kasich whose failed campaign & debating skills have brought him way down in the polls is going to spend $2.5 million against me. _E_\nI am very impressed by @dennisrodman. His return to this season's @ApprenticeNBC showed who Dennis really is which is very good. _E_\nSame failing @nytimes reporter who wrote discredited women's story last week wrote another terrible story on me today will never learn! _E_\nDonate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_\nThe polls & momentum are trending towards @MittRomney. Don't let the hurricane change your thinking! _E_\nEliot had a terrible debate performance this morning against Scott Stringer. He can't spin his failing and contemptible public record. _E_\nMeeting Former Speaker Newt Gingrich next week. On the Agenda defeating @BarackObama. _E_\nRT @FoxBusiness: #StockAlert: U.S. markets since the election __HTTP__ _E_\nObama's foreign policy is a complete and total disaster the worst President we have ever had. _E_\nIf only @Obama was as focused on balancing the budget as he is on weakening Israel's borders then America would be on the path to solvency. _E_\nAfter 5 SB victories since 2002 it was my honor to give Bob Kraft Coach Belichick and the players their first to... __HTTP__ _E_\n#CrookedHillary's plan will add $1.15 TRILLION in new taxes. We cannot afford her! #DrainTheSwamp #Debate __HTTP__ _E_\nArizona had a 116% increase in ObamaCare premiums last year with deductibles very high. Chuck Schumer sold John McCain a bill of goods. Sad _E_\nWe fight to free Libya and they kill our Ambassador and other Americans. Obama's foreign policy is a joke. _E_\nI told you so @politico just lost it's top person. Poor results and no money to pay him. If they were legit they would be doing far better! _E_\nYou pick it! #1. Anybody that says anything derogatory about @BarackObama is labeled stupid insane or (cont) __HTTP__ _E_\nThe Schumer Rounds Collins immigration bill would be a total catastrophe. @DHSgov says it would be \"the end of immigration enforcement in America.\" It creates a giant amnesty (including for dangerous criminals) doesn't build the wall expands chain migration keeps the visa... _E_\nOur deepest sympathies and most heartfelt prayers are with the victims of the train derailment in Washington State. We are closely monitoring the situation and coordinating with local authorities... __HTTP__ _E_\nThird rate reporters Amy Chozick and Maggie Haberman of the failing @nytimes are totally in the Hillary circle of bias. Think about Bill! _E_\nThe commodity market is extremely fragile. Be wary of investing right now. The futures are way too dependent on the Fed. _E_\nRadio interview w/ @seanhannity discussing @PhilMickels0n_ why NY must start fracking & staying in @GOP primary __HTTP__ _E_\nGreat comeback by Tom Brady New England! _E_\n...our Great American Flag (or Country) and should stand for the National Anthem. If not YOU'RE FIRED. Find something else to do! _E_\nPoll data shows that @marcorubio does by far the best in holding onto his Senate seat in Florida. Important to keep the MAJORITY. Run Marco! _E_\nOur country has been unsuccessfully dealing with North Korea for 25 years giving billions of dollars & getting nothing. Policy didn't work! _E_\nTonight's episode of @ApprenticeNBC is not only the best episode ever it has a great lesson in life. Don't miss it! _E_\nI am the only one that knows how to build cities pols are all talk and no action. Our cities need help and fast. They are crumbling! _E_\nThe election was a major setback for economy. All young entrepreneurs should be sure to calculate Obama's policies into their investments. _E_\nGood article: What Happened to American Men from @Newsmax by Michael Cohen __HTTP__ _E_\nIn the end Andy Pettitte did not rat out his friend Roger Clemens. I like him again a lot. _E_\nI will make this right for our great Vets! _E_\nJoin me tomorrow! #MAGA 10am Baton Rouge LA. Tickets: __HTTP__ Grand Rapids MI.Tickets: __HTTP__ _E_\nVia @latimes' @LATshowtracker:\"Monday's TV Highlights:@ApprenticeNBCon @nbc\" __HTTP__ _E_\nWow Jeb Bush whose campaign is a total disaster had to bring in mommy to take a slap at me. Not nice! _E_\n__HTTP__ _E_\n _E_\nRT @TeamTrump: WATCH: @realDonaldTrump on the stakes in this election #Debates2016 __HTTP__ _E_\nVia @thestate by @AP: \"Donald Trump: Giving 'serious thought' to presidential run\" __HTTP__ _E_\nTweet me more of your questions to answer in the next video.... _E_\nVia @theblaze: Donald Trump on how Rubio should have drank his water __HTTP__ _E_\nRead what Donald Trump has to say about daughter Ivanka's upcoming new book The Trump Card: __HTTP__ _E_\nEarlier today I spoke with @GovMattBevin of Kentucky regarding yesterday's shooting at Marshall County High School. My thoughts and prayers are with Bailey Holt Preston Cope their families and all of the wounded victims who are in recovery. We are with you! _E_\nDonate Today To Help Make America Great Again! You Can Help Stop Crooked Hillary Clinton! __HTTP__ __HTTP__ _E_\nPervert alert–serial sexter @RepWeiner is making another step towards a comeback __HTTP__ All girls under 18 should block him. _E_\nThe Fake News refuses to report the success of the first 6 months: S.C. surging economy & jobsborder & military securityISIS & MS 13 etc. _E_\nWith the ridiculous Filibuster Rule in the Senate Republicans need 60 votes to pass legislation rather than 51. Can't get votes END NOW! _E_\nAs I said on @foxandfriends this a.m. you have to give Obama credit—he won! ... _E_\nThe debates especially the second and third plus speeches and intensity of the large rallies plus OUR GREAT SUPPORTERS gave us the win! _E_\nI will end illegal immigration and protect our borders! We need to MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 __HTTP__ _E_\n#3. Cover your bases. Know everything you can about what you're doing. _E_\n.@LindseyGrahamSC and Lyin' Ted Cruz are two politicians who are very much alike ALL TALK AND NO ACTION! Both talk about ISIS do nothing! _E_\nGive a lot of credit to Carlos Beltran for developing into a terrific baseball player and total winner for the Cardinals great going Carlos! _E_\nThe UK has run out of money and can't afford to borrow. __HTTP__ Neither can we but that doesn't stop @BarackObama. _E_\nThank you @AmSpec __HTTP__ _E_\nRT @Inspire_Us: No color no religion no nationality should come between us we are all children of God. Mother Teresa _E_\nNow he has made his Busey ism into a song. #CelebApprentice _E_\nRT @IvankaTrump: Working families need #TaxReform & the time is now. This Administration is committed to ensuring all Americans can thrive... _E_\nAnother sign that @jack_welch is right. New government labor report casts even more doubt on the September jobs data __HTTP__ _E_\nIt wasn't only that Obama saluted a Marine with a cup of coffee in his hand but why the hell does he have to exit a heli holding coffee? _E_\n5 year old Trey has terminal cancer. I'm helping him go to Disney won't you? __HTTP__ _E_\nMAKE AMERICA SAFE AND GREAT AGAIN! #RNCinCLE __HTTP__ _E_\n\"Dem candidates are all folks who vote with me.\" – Barack Obama describing ALL Democrat Senate candidates _E_\nRT @Scavino45: POTUS' @realDonaldTrump on Hurricane Response Efforts in #PuertoRico on Instagram part of 9/29/17 Weekly Address. __HTTP__ _E_\nGreat seeing @MarianoRivera w/@realDonaldTrump at @TrumpTowerNY for @EricTrumpFdn! __HTTP__ __HTTP__ _E_\nBill Clinton stated that I called him after the election. Wrong he called me (with a very nice congratulations). He doesn't know much ... _E_\nAn 'extremely credible source' has called my office and told me that @BarackObama's birth certificate is a fraud. _E_\nDisappointed in GOP and Dems Giving Obama power to raise the debt limit next year is a mistake. _E_\nRT @TwitterData: These are the 10 most Tweeted about world leaders during the first day of #UNGA General Debate __HTTP__ _E_\nIn just out book Secret Service Agent Gary Byrne doesn't believe that Crooked Hillary has the temperament or integrity to be the president! _E_\n\"A big key to winning is knowing where the other side is coming from.\" – Think Like a Champion _E_\nIt would be nice if our commander in chief was as concerned for our Veterans health as he is for illegal immigrants becoming citizens. _E_\nObama's disastrous judgment gave us ISIS rise of Iran and the worst economic numbers since the Great Depression! _E_\nRanked nationally in @GolfMagazine's top 100 Trump Int'l Golf Club in Palm Beach is a 27 hole masterpiece __HTTP__ _E_\nIn my speech on protecting America I spoke about a temporary ban which includes suspending immigration from nations tied to Islamic terror. _E_\nThink of yourself like a one man army. You're not only the commander in chief you're the soldier as well. – Think Like A Billionaire _E_\n...and a Great Leader. John has also done a spectacular job at Homeland Security. He has been a true star of my Administration _E_\nThe big and highly respected Cooley LLP is handling the @billmaher case for me. _E_\nAfter strict consultation with General Kelly the CIA and other Agencies I will be releasing ALL #JFKFiles other than the names and... _E_\nRT @LouDobbs: The stock market has gained an incredible 7.8 Trillion dollars in market value since @POTUS was elected! Looks like 4% econom... _E_\nNow @BarackObama is telling @MittRomney how to control his own assets. __HTTP__ Obama is consumed by class warfare. _E_\nLyin' Ted Cruz will never be able to beat Hillary. Despite a rigged delegate system I am hundreds of delegates ahead of him. _E_\nI cannot imagine that these very fine Republican Senators would allow the American people to suffer a broken ObamaCare any longer! _E_\nIf the press would cover me accurately & honorably I would have far less reason to tweet. Sadly I don't know if that will ever happen! _E_\nThe Mayor of San Juan who was very complimentary only a few days ago has now been told by the Democrats that you must be nasty to Trump. _E_\nFor political purposes only Obama is planning to hit Libya for the Benghazi embassy attack right before the election? _E_\nYesterday's failing @NYTimes fraudulently shows an empty room prior to my speech when in fact it was packed! __HTTP__ _E_\nObama is community organizing from the Oval Office on Ferguson today. More riots sure to follow. _E_\nPresident Obama thinks the nation is not as divided as people think. He is living in a world of the make believe! _E_\nGood news House just passed #KatesLaw. Hopefully Senate will follow. _E_\nU.S. Senator Bob Corker (R Tenn.) issued the following statement today regarding the 2016 presidential election: __HTTP__ _E_\nMany journalists are honest and great but some are knowingly dishonest and basic scum. They should.be weeded out! _E_\nI am going to save Social Security without any cuts. I know where to get the money from. Nobody else does. my @SRQRepublicans speech _E_\nGreat evening in Canton Ohio thank you! We are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ __HTTP__ _E_\nMitt's proposed tax cuts for the middle class will spur record economic growth. _E_\nIs this Hope & Change? A record 46.7M Americans were on food stamps this past June. We must do better. _E_\nBe sure to visit the world renowned Trump Tower Atrium to see our holiday decorations. __HTTP__ _E_\nThe greatest threat to our security is our debt. It is already past 100% GDP. We need to make real budget cuts. _E_\nThe real war on women. Under @BarackObama 766000 more women are unemployed from when he took office __HTTP__ _E_\nPresident Obama wants to change the name of the White House because it is highly discriminating and not at all politically correct! _E_\nM.M. is a good choice also nice guy! #Oscars _E_\nObamaCare will continue to stop entrepreneurship slow growth and halt research & development. Defund Repeal & Replace! _E_\nVia @politico: Donald Trump claims Barack Obama bombshell __HTTP__ _E_\nJoin us! #CaucusForTrump11am WATERLOO: __HTTP__ CEDER RAPIDS: __HTTP__ __HTTP__ _E_\nThe deal with Iran will go down as one of the most incompetent ever made. The U.S. lost on virtually every point. We just don't win anymore! _E_\nThe road to success is always under construction. Arnold Palmer _E_\nMaking a big speech in Alabama today. So many people we had to move to a football stadium! Come and join us! _E_\nJust signed Bill. Our Military will now be stronger than ever before. We love and need our Military and gave them everything — and more. First time this has happened in a long time. Also means JOBS JOBS JOBS! _E_\nGetting ready to meet President al Sisi of Egypt. On behalf of the United States I look forward to a long and wonderful relationship. _E_\nDon't assume you have to accept the hand you were dealt. – Think Like A Billionaire _E_\nTrump National Golf Club Los Angeles is situated on the Palos Verdes Peninsula overlooking the Pacific Ocean... __HTTP__ _E_\nMe by a lot! _E_\nSpoke to U.K. Prime Minister Theresa May today to offer condolences on the terrorist attack in London. She is strong and doing very well. _E_\nLocated in South Ayrshire Scotland @TrumpTurnberry offers diverse dining options suitable for any occasion __HTTP__ _E_\nJust announced that as many as 5000 ISIS fighters have infiltrated Europe. Also many in U.S. I TOLD YOU SO! I alone can fix this problem! _E_\nMy @FoxNews interview with @TeamCavuto discussing the national housing market unemployment numbers and the FL (cont) __HTTP__ _E_\nThe same people that said I wouldn't run or that I wouldn't lead or do well (1st place and leading by 21%) now say I won't beat Hillary. _E_\nVia @PoliticalTicker: \"TRENDING: Trump a right leaning tower at CPAC\" __HTTP__ by @KilloughCNN _E_\nI will be addressing a fantastic Ames crowd at tomorrow's @bobvanderplaats' @theFAMiLYLEADER Leadership Summit __HTTP__ _E_\nEntrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_\nRT @JaydaBF: VIDEO: Muslim Destroys a Statue of Virgin Mary! __HTTP__ _E_\nRT @GOP: In @timkaine's own words #Debates2016 __HTTP__ _E_\nRT @greta: Prob w/ all pundits saying last fall @realDonaldTrump had no chance is that shows media so out of touch w/ Americans _E_\nScottish government having huge backlash on wind turbines. @AlexSalmond is becoming very unpopular. _E_\n'Trump won the third debate' __HTTP__ _E_\nChina is our enemy they want to destroy us Redstate Interview _E_\n\"Trust your instincts especially if they are well honed.\" – Midas Touch _E_\nOn 59th & Park Avenue Trump Park Avenue transformed the legendary Hotel Delmonico into 120 luxury residences __HTTP__ _E_\nVery organized process taking place as I decide on Cabinet and many other positions. I am the only one who knows who the finalists are! _E_\nCan you believe that the Chinese would not give Obama the proper stairway to get off his plane fight on tarmac! __HTTP__ _E_\nLightweight shakedown artist AG Eric Schneiderman was exposed in today's New York Post editorial __HTTP__ _E_\nRT @EricTrump: Wow! I am speechless! Thank you to my sidekick @LynnePatton who keeps me & the @EricTrumpFdn in line! __HTTP__ _E_\nLittle Barry Diller who lost a fortune on Newsweek and Daily Beast only writes badly about me. He is a sad and pathetic figure. Lives lie! _E_\nAn extended interview from the Super Bowl with @oreillyfactor airs tonight at 8:00 P.M. Enjoy! __HTTP__ _E_\nSusan Rice the former National Security Advisor to President Obama is refusing to testify before a Senate Subcommittee next week on..... _E_\nDo you believe that @FoxNews is still playing up the old Iowa poll numbers and no mention of the ABCWashington Post or just out CBS results? _E_\nI've got news for President @BarackObama: America is not what's wrong with the world. I don't believe we need (cont) __HTTP__ _E_\nCosts on non military lines will never come down if we do not elect more Republicans in the 2018 Election and beyond. This Bill is a BIG VICTORY for our Military but much waste in order to get Dem votes. Fortunately DACA not included in this Bill negotiations to start now! _E_\nDon't forget to watch @ApprenticeNBC tonight—you will love it! 8 PM on NBC. #CelebApprentice _E_\n\"Be up front and direct with people and they will return the favor.\" – Think Like a Billionaire _E_\nRepublicans must start the Tax Reform/Tax Cut legislation ASAP. Don't wait until the end of September. Needed now more than ever. Hurry! _E_\nEntrepreneurs: Look at the solution not the problem. Learn to focus on what will give results. _E_\n.@THEGaryBusey returns to @CelebApprentice All Stars this season. His streak of chaos and havoc continues! _E_\n... A great person inspires others to see for themselves. – Harvey Mackay _E_\nWhy has Barack Obama repeatedly told inconsistent stories about his religious background? __HTTP__ Who is he? _E_\nCelebrated for its room views by @LuxTravelExpert @TrumpChicago soars a luxurious 92 stories over the Windy City __HTTP__ _E_\nHeading to Colorado for a big rally. Massive crowd great people! Will be there soon the polls are looking good. _E_\n.@newsbusters Thank you for a great and very accurate story well done! _E_\nWill be in Nashville Tennessee tomorrow (Saturday) at 2:30 P.M. So much to talk about see you there! _E_\nWhy is Senator John McCain in Syria visiting with the rebels MAKE AMERICA GREAT AGAIN! _E_\nDespite the long delays by the Democrats in finally approving Dr. Tom Price the repeal and replacement of ObamaCare is moving fast! _E_\n.@EmilyMiller's book Emily Gets Her Gun exposes the attack on our Second Amendment __HTTP__ A must read! _E_\nSo the highly overrated anchor @megynkelly is allowed to constantly say bad things about me on her show but I can't fight back? Wrong! _E_\nVia @DailyMail by @chriskitching: \"Luxury penthouse at @TrumpChicago skyscraper sells for record $17M\" __HTTP__ _E_\nDo you believe Barack Hussein Obama (aka Barry Soetoro) looked like a president last night? I don't! _E_\nJust leaving D.C. Had great meetings with Republicans in the House and Senate. Very interesting day! These are people who love our country! _E_\nAmericans Elect on track to put an Indy Presidential candidate on the ballot in all 50 states. _E_\nStephanie Cutter Attended WH Meetings With IRS Chief __HTTP__ Great investigative work by Jim Hoft @gatewaypundit _E_\nMERRY CHRISTMAS!! __HTTP__ _E_\n\"Trump: I created tens of thousands of jobs\" __HTTP__ via @thehill by @SmiloTweets _E_\nI'm not sure about @teresa_giudice as Project Manager. @lisalampanelli can be formidable. But let's see what happens #sweepstweet _E_\nMany of @TigerWoods' 'friends' were quick to abandon him in his time of crisis. Now Tiger knows who he can count on. _E_\nOne of the most obvious lessons on @ApprenticeNBC is for the candidates to learn to think quickly. Think Like a Champion _E_\nA 7242 yd. masterpiece @TrumpGolfLA's $250 million course features 18 challenging holes with incredible views __HTTP__ _E_\nI have fun I love what I do. You should too. Find out how at the National Achievers Conference this October. __HTTP__ _E_\nRT @fundanything In case you missed it check out @washingtonpost story about @realDonaldTrump & @fundanything __HTTP__ _E_\nThe Chinese are planning on going to the Moon...I hope they stop and take a look at our flag that was put there 43 years ago. @MittRomney _E_\nVia @MoscowTimes Donald Trump Planning Skyscraper in Moscow __HTTP__ _E_\nOnce again the Bush appointed Supreme Court Justice John Roberts has let us down. Jeb pushed him hard! Remember! _E_\nMilitary solutions are now fully in placelocked and loadedshould North Korea act unwisely. Hopefully Kim Jong Un will find another path! _E_\nRe Megyn Kelly quote: you could see there was blood coming out of her eyes blood coming out of her wherever (NOSE). Just got on w/thought _E_\nIf everyone is thinking alike then somebody isn't thinking. George S. Patton _E_\nA must read for any country or community considering wind turbines. __HTTP__ _E_\nI think the @yankees will win today. Unlike A Rod CC is good under pressure. I hope A Rod plays however. _E_\nCongratulations to @ehasselbeck on her successful first day as co host of @foxandfriends! Great to be in studio today for Elisabeth. _E_\nBringing hundreds of billions of dollars back to the U.S.A. from the Middle East which will mean JOBS JOBS JOBS! _E_\nTheresa @theresamay don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_\nThank you Waterbury Connecticut!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nObama is now standing in a puddle acting like a President give me a break. _E_\nWhen James Clapper himself and virtually everyone else with knowledge of the witch hunt says there is no collusion when does it end? _E_\n.@sternshow My interview with Howard Stern this morning! __HTTP__ __HTTP__ _E_\n#TBT With my family growing up I'm on the left. __HTTP__ _E_\nWow pres. candidate Ben Carson who is very weak on illegal Immigration just said he likes amnesty and a pathway to citizenship. _E_\nNick Adams Retaking America Best things of this presidency aren't reported about. Convinced this will be perhaps best presidency ever. _E_\nSanders said only black lives matter wow! Hillary did not answer question! _E_\nWe have all been following the Wisconsin recall election. @ScottKWalker's victory tonight will be well earned. A Governor who gets results. _E_\nYesterday China VP Xi stressed the benefits of trade with China to Congress __HTTP__ We need FAIR TRADE with China! _E_\nObamaCare is a disaster and Snowden is a spy who should be executed but if it and he could reveal Obama's recordsI might become a major fan _E_\nThe rally in Cincinnati is ON. Media put out false reports that it was cancelled. Will be great love you Ohio! _E_\nI'm proud to accept the 2010 HollyRod Foundation Humanitarian Award from Holly Robinson Peete who raised $700000 on Celebrity Apprentice _E_\nVia @FoxNews: \"Trump: Politicians are all talk no action I'm the opposite\" __HTTP__ _E_\nR.P.Virginia has lost statewide 7 times in a row. Will now not allow desperately needed new voters. Suicidal mistake. RNC MUST ACT NOW! _E_\nOn Holocaust Remembrance Day we mourn and grieve the murder of 6 million innocent Jewish men women and children and the millions of others who perished in the evil Nazi Genocide. We pledge with all of our might and resolve: Never Again! __HTTP__ __HTTP__ _E_\nDespite the establishment and the media's best efforts the people are speaking loudly and clearly. Thank you to my amazing supporters! _E_\nDespite what you hear in the press healthcare is coming along great. We are talking to many groups and it will end in a beautiful picture! _E_\nFormer Prosecutor: The Clintons Are So Corrupt Everything 'They Touch Turns To Molten Lead' __HTTP__ _E_\nThank you Ohio! Just landed in Canton for a rally at the Civic Center. Join me at 7pm: __HTTP__ __HTTP__ _E_\nRT @FoxNews: .@KellyannePolls on Harvey recovery: We hope when it comes to basic Hurricane Harvey funding that we can rely upon a nonpartis... _E_\nBe sure to enjoy the '50th Anniversary Chicago International Film Festival' at @TrumpChicago the Windy City's top hotel! _E_\n\"When your brand begins to build you too will be faced with opportunities for greater recognition.\" – Midas Touch _E_\nLots of response to my Pattinson/Kristen Stewart reunion. She will cheat again 100 certain am I ever wrong? _E_\n.@WSJ Editorial Board should review my debate statement re China and T.P.P. and apologize. China not part but will get their way in later. _E_\nDo these very stupid politicians who got us involved in Iraq look bad or what? Everybody wants their oil only made possible by U.S.! _E_\n.@lisarinna did amazing on #CelebApprentice @ApprenticeNBC. Raising over $505K for @StJude she made it to the Final Four. Congrats Lisa. _E_\nJoin us in Sparks Nevada today! #NevadaCaucus #VoteTrumpNV __HTTP__ _E_\nThe Republicans never should have agreed to this past summer's debt deal. Military cuts will now come along with tax increases. _E_\nWow the MSM is really going after me. 12000 in Sarasota a love fest hardly a mention. Only one negativity they only want negatives! _E_\nJoin me for a 3pm rally tomorrow at the Mid America Center in Council Bluffs Iowa! Tickets:... __HTTP__ _E_\nRT @AbeShinzo: トランプ大統領による、初の、歴史的な日本訪問は、間違いなく、日米同盟の揺るぎない絆を世界に示すことができました。本当にありがとう、ドナルド。そして、アジア歴訪の大成功をお祈りしています。@realDonaldTrump __HTTP__ _E_\nLet's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy! _E_\n.@FrankLuntz I won every poll of the debate tonight by massive margins @DRUDGE_REPORT & @TIME so where did you find that dumb panel. _E_\nObama admin. called @netanyahu chickenshit. Ironic since Bibi was an IDF Special Forces commando while Obama was a community organizer. _E_\n\"@jacknicklaus elated at official grand opening of @TrumpFerryPoint\" __HTTP__ via @nypost by @NYPost_Willis _E_\n'U.S. Small Business Optimism Index Surges by Most Since 1980' __HTTP__ _E_\nThanks everybody for the Happy Birthday greetings but it's actually the 10th birthday of The Apprentice. My birthday is June 14th.... _E_\nToday marks the one year anniversary of @AndrewBreitbart's passing. Andrew's mission & legacy still lives on. @BreitbartNews _E_\nGreat @UnionLeader piece by @jdistaso on my visit to @saintanselm for @NECouncil & @nhiop Politics & Eggs __HTTP__ _E_\nI'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_\n\"TRUMP BATTLES THE NEW TOTALITARIANS: GOP elites join with leftists at Media Matters in targeting threat to both\" __HTTP__ _E_\nVia @FoxNewsInsider as seen on @foxandfriends: \"Trump: Iran Nuke Talks Should Have Taken One Day\" __HTTP__ _E_\nEvery strike brings me closer to the next home run. – Babe Ruth _E_\nI will be on @foxandfriends at 7:00 in 10 minutes. HAVE A GREAT DAY ALL! _E_\n\"Concentration and mental toughness are the margins of victory.\" Bill Russell _E_\nThe freezing cold weather across the country is brutal. Must be all that global warming. _E_\nRT @DonaldJTrumpJr: Last chance #Wisconsin: Find your polling location for today's primary & go vote! Visit __HTTP__ #T... _E_\nA third rate architecture critic who I thought got fired—for the failing @chicagotribune likes the building but doesn't like the Trump sign _E_\nThe sequester is less than 2% of total 2013 budget. Why can't the WH re allocate funds and keep the tours open for children? #OpenOurWH _E_\nThe Democrats without a leader have become the party of obstruction.They are only interested in themselves and not in what's best for U.S. _E_\nIsn't it sad that Weiner's first press conference with wife Huma was yesterday admitting to a sext he made post resignation & apology! _E_\n#MakeAmericaGreatAgain #6Days __HTTP__ _E_\nWe must stop the crime and killing machine that is illegal immigration. Rampant problems will only get worse. Take back our country! _E_\nThe NY SAFE Act is an unconstitutional attack on 2nd Amendment rights. Will also increase crime. _E_\n.@guardian_sport by @mrewanmurray:\"Donald Trump's transformation will make @TrumpTurnberry Open worth the wait\" __HTTP__ _E_\nGreat jobs numbers and finally after many years rising wages and nobody even talks about them. Only Russia Russia Russia despite the fact that after a year of looking there is No Collusion! _E_\nKeep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_\nPresident Obama is under pressure from Democrats to undo his lie on ObamaCare. His problem is that such a move would end ObamaCare. _E_\nThanks for all the nice words on my keeping the Trump Tower atrium accessible to stranded victims of #Sandy. My honor. _E_\n. #LaskerRink. We do not do the maintenance on Lasker Rink that is done by NEW YORK CITY. _E_\nFacebook billionaire gives up his U.S. citizenship in order to save taxes. I guess 3.8 billion isn't enough for (cont) __HTTP__ _E_\nSo @JLin7 had another game winning shot last night. Looks like the Knicks have not only found a new point guard (cont) __HTTP__ _E_\nEXCLUSIVE — Video Interview: Bill Clinton Accuser Juanita Broaddrick Relives Brutal Rapes: __HTTP__ _E_\nJust 30 minutes from Manhattan @TrumpNationalNY is Westchester's most elite club offering a 7291 yard course __HTTP__ _E_\nPeople are proud to be saying Merry Christmas again. I am proud to have led the charge against the assault of our cherished and beautiful phrase. MERRY CHRISTMAS!!!!! _E_\nThe media is pathetic. Our embassies are savaged by radicals while Obama does nothing and all they can do is criticize @MittRomney. _E_\nGetting ready to land in Charlottesville Virginia at Trump Vineyards another job producing development that I bought and made AMAZING! _E_\nI will be at the @USGA #USWomensOpen in Bedminster NJ tomorrow. Big crowds expected & the women are playing great should be very exciting! _E_\n.@evaemery Thanks you sound great! _E_\n69 Democrats voted in favor of the Keystone pipeline in the House this week __HTTP__ A major defeat for @BarackObama _E_\nGreat sign: We built this business without government help. Obama can kiss our a ! __HTTP__ Commonly heard now across America! _E_\nThe full video of my @LibertyU speech __HTTP__ Liberty's largest ever Convocation crowd. _E_\nMerry Christmas to all have a fantastic day year and life! The World with great leadership will become a much more beautiful place! _E_\nSo happy about my daughter @IvankaTrump's announcement that she will be having a baby this spring. Congratulations! _E_\n.@ArsenioOFFICIAL Thx for the good wishes you are going to have a really big year! _E_\n#Obamacare premiums are about to SKYROCKET again. Crooked H will only make it worse. We will repeal & replace! __HTTP__ _E_\nSo many politically correct fools in our country. We have to all get back to work and stop wasting time and energy on nonsense! _E_\nThe FAKE MSM is working so hard trying to get me not to use Social Media. They hate that I can get the honest and unfiltered message out. _E_\nWhat we are watching on our TV screens is the unraveling of the Obama foreign policy. @PaulRyanVP _E_\nFor China of all nations to search the massive Indian Ocean and pick up the ping from the black box of flight 370 sounds a bit far fetched _E_\nThe race for DNC Chairman was of course totally rigged. Bernie's guy like Bernie himself never had a chance. Clinton demanded Perez! _E_\nNo surprise that @BBC is in a major scandal for shoddy journalism. Any network that air's @antbaxter's garbage has zero credibility. _E_\nWhy isn't Obama protecting us from ridiculous gas prices? _E_\nWe all know that chess is a game of strategy. So is business. Think about that and develop a strategy starting today. _E_\nIran is desperate to develop nukes. Congress must increase sanctions against Iran. _E_\n\"Out of clutter find Simplicity. From discord find Harmony. In the middle of difficulty lies Opportunity.\"–Albert Einstein _E_\nLightweight @AGSchneiderman just got his ass kicked by Trump! _E_\nRT @DRUDGE_REPORT: Obama Refers to Himself 119 Times During Hillary Nominating Speech... __HTTP__ _E_\nMy @foxandfriends int.on @FoxNewsInsider:\"'We Have No Leadership': Trump Slams Obama for Skipping Paris Unity Rally\" __HTTP__ _E_\nBack by popular demand the record 13th season of 'All Star' @CelebApprentice features the return of @bretmichaels. Our fans will be happy. _E_\nHe @BarackObama received an early endorsement from the Soviet newspaper Pravda over @MittRomney (cont) __HTTP__ _E_\nJust found out that at a charity auction of celebrity portraits in E. Hampton my portrait by artist William Quigley topped list at $60K _E_\nI will be interviewed on @oreillyfactor tonight at 11pmE @FoxNews. Enjoy! _E_\nBut maybe my biggest beef with Obama is his view that there's nothing special or exceptional about America. #TimeToGetTough _E_\nI don't watch or do @Morning_Joe anymore. Small audience low ratings! I hear Mika has gone wild with hate. Joe is Joe. They lost their way! _E_\nThe Boston killer will soon be asking for a Presidential pardon—don't give it to him Mr. President—hang tough! _E_\nWhen will we see stories from CNN on Clinton Foundation corruption and Hillary's pay for play at State Department? _E_\nPathetic! Since @GovWalker is going to win the recall @BarackObama is trying to disown the endorsement of Tom Barrett __HTTP__ _E_\nThe new Libyan Government should turn over the Lockerbie bomber now. _E_\nI will be interviewed on Fox News Sunday With Chris Wallace at 9:00 A.M. or 10:00 A.M. (depending on location). Will be tough but good! _E_\nJoin me tomorrow! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\nA Veteran & true Conservative @leezeldin will make a real difference in Washington. NY 1 GOP GOTV for Lee tomorrow! _E_\nexpensive mistake! THE UNITED STATES IS OPEN FOR BUSINESS _E_\nJOIN ME TOMORROW IN FLORIDA!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_\nI told you so. Our country totally lost control of illegal immigration even with criminals. __HTTP__ _E_\nThank you to NC for last evenings great reception. The speech was a great success. Heading now to Louisiana & another speech tonight in MI. _E_\nHow can the NY Times show an empty room hours before my speech even started when they knew it was going to be packed? So totally dishonest! _E_\nOur great VETERANS are being treated very badly because of corruption and incompetence at the V.A. That will stop I will fix this quickly! _E_\n.@TrumpLasVegas was just rated \"Best Room Service\" in LV by The Daily Meal. Congrats to my Las Vegas staff! __HTTP__ _E_\nJoin me Tuesday in Everett Washington at the Xfinity Arena! Tickets: __HTTP__ __HTTP__ _E_\nRT @TeamTrump: .@realDonaldTrump calling out @HillaryClinton's support for NAFTA = most searched moment during tonight's debate. #Debates20... _E_\nThe five prisoners our government so stupidly released for one pathetic traitor are now fighting and killing for ISIS BAD DEAL! Courtmarshal _E_\nCongratulations to @RickSantorum for coming out of Iowa a winner! _E_\nWith @shawnjohnson and @lorenzolamas from @apprenticenbc two great people! __HTTP__ _E_\nBIG NIGHT ON TWITTER TONIGHT. I WILL BE LIVE TWEETING PRESIDENT OBAMA'S SPEECH AT 7:50 P.M. ( EASTERN). MUST TALK RADICAL ISLAMIC TERRORISM! _E_\nMoney may not grow from trees but it does grow from talent hard work and brains. Think Like a Billionaire _E_\nThanks for the tremendous support for my shirts ties and suits at Macy's. They do great because of really high quality at a low price. _E_\nPresident Obama just had a news conference but he doesn't have a clue. Our country is a divided crime scene and it will only get worse! _E_\nRon Paul is right when he says we are wasting lives and money in Iraq and Afghanistan. _E_\nIt's amazing that some of the dumbest people on television work for the Wall Street Journal in particular a real dope named Charles Lane! _E_\nI can't believe no one has been fired over the ObamaCare website fiasco! _E_\nMy @foxandfriends interview from this morning __HTTP__ _E_\nLegendary Illusionist v. Country Music Star. This Sunday's LIVE Finale of @ApprenticeNBC is a historic matchup. MUST SEE TV! _E_\nIsn't it amazing that @CNN paid a fortune for an Iowa Poll which shows me in first place over Cruz by 13% 33% to 20% then doesn't use it _E_\nWhere's the accountability for the $635M website fiasco in the Obama administration? Heads should roll and officials should be fired _E_\nGreat new poll thank you! __HTTP__ _E_\nJoin me in Naples Florida this evening at 6:00pm! Tickets: __HTTP__ __HTTP__ _E_\nIsn't it terrible that @megynkelly used a poll not used before (I.B.D.) when I was down but refuses to use it now when I am up? _E_\nIf the Dems (Crooked Hillary) got elected your stocks would be down 50% from values on Election Day. Now they have a great future and just beginning! __HTTP__ _E_\nI wonder how much money dumb @BuzzFeed and even dumber Ben Smith loooose each year? They have zero credibility totally irrelevant and sad! _E_\nA level will be reached where ObamaCare will be so out of control expensive and unwieldy that the biggest supporters will abandon ship. _E_\nBarack Obama's delivery on Saturday night was excellent cute mention of Trump and I am flattered to be mentioned. @BarackObama _E_\nI am watching @CNN very little lately because they are so biased against me. Shows are predictable garbage! CNN and MSM is one big lie! _E_\nThe long anticipated release of the #JFKFiles will take place tomorrow. So interesting! _E_\nRemember @foxandfriends at 7:00 A.M. and Celebrity Apprentice at 8:00 P.M. Enjoy! _E_\n.@mdamelincourt Thanks M you are doing a great job at Trump Toronto! _E_\nCongratulations to @netanyahu on his electoral victory. He will now be the longest serving @IsraeliPM. A great leader. _E_\nTHE WEST WILL NEVER BE BROKEN. Our values will PREVAIL. Our people will THRIVE and our civilization will TRIUMPH! __HTTP__ _E_\nThank you Charlotte North Carolina. Great afternoon! #ICYMI I delivered a speech on urban renewal. Full speech:... __HTTP__ _E_\nNew reality. Yuan just passed the Euro as 2nd most traded finance currency __HTTP__ Our leaders better get smart fast. _E_\nGreat meeting with @SenateMajLdr Mitch McConnell and Republican leaders in D.C. #Trump2016 __HTTP__ _E_\nI will not be able to attend the Miss USA pageant tomorrow night because I am campaigning in Phoenix. Wishing all well! _E_\nTaliban targeted innocent Afghans brave police in Kabul today. Our thoughts and prayers go to the victims and first responders. We will not allow the Taliban to win! _E_\nMy interview with Don Imus on @77WABCradio discussing my @RNC convention surprise & @MittRomney's China policy __HTTP__ _E_\nA great new book has been written about Crooked Hillary. Read it & you will never be able to vote for her. @Ed_Klein __HTTP__ _E_\nYesterday I was in Washington D.C. visiting the #TRUMP Old Post Office renovation. It will be magnificent. _E_\nAs I have long stated we are so tied in with China and Asia that their markets are now taking the U.S. market down. Get smart U.S.A. _E_\nThe new Rasmussen Poll one of the most accurate in the 2016 Election just out with a Trump 50% Approval Rating.That's higher than O's #'s! _E_\nAfter today Crooked Hillary can officially be called Lyin' Crooked Hillary. _E_\n#TBT A picture of my fantastic father and myself. Best teacher in the world! A great Father's Day... __HTTP__ _E_\nJudge Gorsuch will be sworn in at the Rose Garden of the White House on Monday at 11:00 A.M. He will be a great Justice. Very proud of him! _E_\nHave a great weekend everyone and for those of you that are young entrepreneurs have fun but never stop thinking of the task ahead victory _E_\nI will be on Fox & Friends (@foxandfriends) at 7.00. Fighting Ebola will be a topic! _E_\nThe Republican House members are working hard (and late) toward the Massive Tax Cuts that they know you deserve. These will be biggest ever! _E_\n.@THEGaryBusey is making no attempt to help. Is he in BuseyLand? Their team is short on help already... #CelebApprentice _E_\nOur president could not make a proper website with $5B. The website still does not work. How can we feel safe about Ebola?! _E_\nIncredible crowd in Richmond Virginia tonight! So much spirit and energy! #makeamericagreatagain __HTTP__ _E_\nstates instead of the 15 states that I visited. I would have won even more easily and convincingly (but smaller states are forgotten)! _E_\n.@EricTrump unbelievable job on #FoxNews with @greta. That was better than I could do! #Trump2016 _E_\nTomorrow in DC: 1 PM West Front Lawn of the Capitol. Not even believable that we would do this deal with Iran. _E_\nAll raising taxes on businesses does is force business owners to lay off employees they can no longer afford. (cont) __HTTP__ _E_\nMy @gretawire interview discussing @BarackObama's misleading political ad @MittRomney's response and @Cher & @Rosie __HTTP__ _E_\nAngelina and Sidney had a really strange vibe going! #Oscars _E_\nWe need a tax system that is fair and smart one that encourages growth savings and investment. #TimeToGetTough _E_\nThe Crooked Hillary V.P. choice is VERY disrespectful to Bernie Sanders and all of his supporters. Just another case of BAD JUDGEMENT by H! _E_\nHealthcare listening session w/ @VP & @SecPriceMD. Watch: __HTTP__ #ReadTheBill:... __HTTP__ _E_\nBack to work for the President to try and keep some dignity for the office and himself. The so called rebels must be thoroughly confused! _E_\nCoincidence? Obama and Ahmadinejad each describe @Israel's warning over the Iranian nuclear program as just 'noise' __HTTP__ _E_\nI'm self funding and I am going to take care of the people – not the special interests and insurance companies like the other candidates. _E_\nMy @nbc @todayshow interview discussing my @RNC video & why @MittRomney should not apologize __HTTP__ _E_\nSorry folks but Bernie Sanders is exhausted just can't go on any longer. He is trying to dismiss the new e mails and DNC disrespect. SAD! _E_\n.@DavidLetterman @Late_Show fully apologized last night for calling me a racist. Thank you David we are again friends. _E_\nTrump International Golf Links and Hotel Ireland is located on the Atlantic Ocean in County Clare. Spectacular! __HTTP__ _E_\nGolf bookings for next season on Scottish course are already double our projections for April opening—great news... __HTTP__ _E_\n#TrumpAdvice __HTTP__ _E_\nThis 'deal' @RNC voted for has $41 in tax increases for every $1 in spending cuts. It is pathetic. Obama is laughing at them. _E_\nWill be back in Virginia tonight for a 6pm rally at the Berglund Center in Roanoke. Join me! Tickets:... __HTTP__ _E_\nThe recession was made worse by @BarackObama. A $900Billion deficit is not getting better. _E_\nSuccess tip: See yourself as victorious. This will focus you in the right direction. Apply your skills and talent and be tenacious. _E_\nAmazing that Ted Cruz can't even get a Senator like @BenSasse who is easy to endorse him. Not one Senator is endorsing Canada Ted! _E_\nWatch – Obama in 2006: \"I've stolen ideas from Jonathan Gruber\" __HTTP__ And now Obama claims he is 'just some adviser.' _E_\nRead this about @Lawrence.... __HTTP__ _E_\nOne who fears failure limits his activities. Failure is only the opportunity to more intelligently begin again. Henry Ford _E_\nThe Mayor of Baltimore said she wanted to give the rioters space to destroy another real genius! _E_\nExclusive–Donald Trump: Obama 'Totally Out Negotiated' by Iran Taliban 'Virtually Every Country in the World' __HTTP__ _E_\nLance Armstrong is having a breakdown. What is he doing—his life is now officially over! _E_\nMy @gretawire interview __HTTP__ _E_\nCan't believe these totally phoney stories 100% made up by women (many already proven false) and pushed big time by press have impact! _E_\nIf the election were based on total popular vote I would have campaigned in N.Y. Florida and California and won even bigger and more easily _E_\nBob Turner great guy great businessman will be a great Congressman. Was happy to help him win. _E_\nIs this boring or is it just me? #Oscars _E_\nThe reason for the plan negotiated between the Republicans and Democrats is that we need 60 votes in the Senate which are not there! We.... _E_\nVia The Political Insider: \"Donald Trump Just Received The Best News Possible!\" __HTTP__ _E_\nEntrepreneurs: Stay focused and be tenacious. Pay attention to people who know what they're talking about. Stay fixed on your goals! _E_\nHousing prices will be going up big league a great time to buy good luck! _E_\nI will be on Meet the Press with Chuck Todd on NBC this morning. Enjoy! __HTTP__ _E_\nMy new book tells some harsh truths and lays out some bold plans. Time for America to be #1 again. #TimeToGetTough _E_\nThank you Anaheim California!#Trump2016 __HTTP__ _E_\nWe agree @POTUS SHE'LL (Hillary Clinton) SAY ANYTHING & CHANGE NOTHING. IT'S TIME TO TURN THE PAGE President Obama _E_\nWhy has @BarackObama allowed the Muslim Brotherhood to visit the @whitehouse? What Hope & Change! _E_\nThe real unemployment rate according to the CBO is 15% __HTTP__ @BarackObama's economic recovery is all Hope _E_\nObama called August's job report progress. Overall 96K new jobs & over 173K new people on food stamps __HTTP__ _E_\nI am leaving China for #APEC2017 in Vietnam. @FLOTUS Melania is staying behind to see the zoo and of course the Great WALL of China before going to Alaska to greet our AMAZING troops. _E_\nI'm a conservative but the weakness of conservatives is that they destroy each other whereas liberals unite to win. _E_\nRT @joshrogin: Pence is right. Clinton & Obama tried to negotiate an Iraq troop extension but failed. Bush admin always anticipated such an... _E_\nLIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_\nTonight's official count 7943. An all time record for the Anderson Civic Center in SC! Thanks! #Trump2016 __HTTP__ _E_\nThanks to everyone for your kind birthday wishes very nice! _E_\nThe Fed's reckless monetary policy is going to create record inflation. _E_\nRepublicans better start listening to and respecting the Tea Party! _E_\n.@Kstupples Thanks for the nice comments on Trump National Doral. I've long been your fan—now am an even bigger fan! @TrumpDoral _E_\nI encourage EVERYONE in the path of #HurricaneIrma to heed the advice and orders of local & state officials! __HTTP__ _E_\nPeople are anxiously awaiting my decision as to who the next head of the Fed will be.... __HTTP__ _E_\nGo confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_\nThe irony is that the Freedom Caucus which is very pro life and against Planned Parenthood allows P.P. to continue if they stop this plan! _E_\n3 Republicans and 48 Democrats let the American people down. As I said from the beginning let ObamaCare implode then deal. Watch! _E_\nA message from @IvankaTrump! #SCPrimary #VoteTrumpSC #MakeAmericaGreatAgain Video: __HTTP__ __HTTP__ _E_\nCongratulations to @TrumpIntRealty for the two top rentals in 2013! __HTTP__ #TIRNYC _E_\nWill the Keystone XL pipeline finally be approved? Will create over 100000 jobs and make us more energy independent. _E_\n\"Sometimes people spend too much time focusing on problems instead of focusing on opportunities.\" – Think Like A Champion _E_\nI don't consider writing books a small venture...writing books is essentially a sharing experience. @MidasTouch @theRealKiyosaki _E_\nMy @SquawkCNBC #TrumpTuesday interview discussing QE3 @MittRomney's leaked comments Middle East & US oil capability __HTTP__ _E_\nWe will all have fun and hopefully learn something tonight. I will shoot straight and call it as I see it both the good and the bad. Enjoy! _E_\n\"Any political leader who won't face the future head on is putting the American Dream at risk.\" – The America We Deserve _E_\nThis is the best deal the Republicans could get? _E_\nWith 15% US real unemployment and a 16T debt @Michelle Obama's luxurious Aspen vacation her 16th cost us over $1M __HTTP__ _E_\n\"Do what you can with what you have where you are.\" Theodore Roosevelt _E_\nTake action every day and stay focused for the long haul.\" Think Big _E_\nStop the assault on American values. Stand w/ Trump to #MakeAmericaGreatAgain!#VotersSpeak: __HTTP__ __HTTP__ _E_\nDefund it or own it. If you fund it you're for it. @SenMikeLee _E_\nYou can have the best product in the world but if people don't know about it it's not going to be worth much. The Art of the Deal _E_\nCrime and killings in Chicago have reached such epidemic proportions that I am sending in Federal help. 1714 shootings in Chicago this year! _E_\nVia @LasVegasSun by Eugene R. Dunn: \"Impeach Obama and elect Trump\" __HTTP__ _E_\nCORRUPT with the national security leaks and Fast & Furious there are clearly at least two cover ups in @BarackObama's White House. _E_\nTed Cruz is mathematically out of winning the race. Now all he can do is be a spoiler never a nice thing to do. I will beat Hillary! _E_\nThe Fake News is now complaining about my different types of back to back speeches. Well their was Afghanistan (somber) the big Rally..... _E_\nJoin me in Dallas Texas on Thursday!#AmericaFirst #Trump2016 __HTTP__ __HTTP__ _E_\nI will be live tweeting! _E_\nObama just bought the Afghan Police $288M in ammo __HTTP__ Make no mistake some of these will be shot at our troops. _E_\n#SecondAmendment #2A#Debates __HTTP__ _E_\nYesterday was another big day for jobs and the Stock Market. Chrysler coming back to U.S. (Michigan) from Mexico and many more companies paying out Tax Cut money to employees. If Dems won in November Market would have TANKED! It was headed for disaster. _E_\nTotally made up facts by sleazebag political operatives both Democrats and Republicans FAKE NEWS! Russia says nothing exists. Probably... _E_\n'Huma Abedin told Clinton her secret email account caused problems' __HTTP__ _E_\nRT @Team_Trump45: @realDonaldTrump __HTTP__ _E_\nOnly by enlisting the full potential of women in our society will we be truly able to #MakeAmericaGreatAgain... __HTTP__ _E_\n.@EdWGillespie will totally turn around the high crime and poor economic performance of VA. MS 13 and crime will be gone. Vote today ASAP! _E_\nToday it was my great honor to proclaim January 15 2018 as Martin Luther King Jr. Federal Holiday. I encourage all Americans to observe this day with appropriate civic community and service activities in honor of Dr. King's life and legacy. __HTTP__ _E_\nWhether you like Obama or not Bob Gates turned out to be one disloyal dude! Personally I hate rats. _E_\nLooking forward to being @TrumpSoHo this evening for Corporate Meeting Planners reception for Trump National Doral @TrumpDoral _E_\n.@antbaxter Thanks for helping promote & make Trump International Golf Links Scotland so successful you stupid fool! _E_\nThe ISIS thug who murdered American journalist James Foley may have been Gitmo detainee __HTTP__ If so why was he released? _E_\n\"You miss 100% of the shots you don't take.\" Wayne Gretzky _E_\nJoe Girardi did a great job of managing the Yankees this series. _E_\nBarackObama set a record deficit last February $229 billion while borrowing 42 cents of every dollar it spent. @BarackObama is reckless. _E_\nKasich has already spent $6 million on ads in New Hampshire and his numbers have gone down. People from NH are smart! _E_\nGreat article on so called climate change formerly known as global warming. __HTTP__ _E_\nISIS just claimed the Degenerate Animal who killed and so badly wounded the wonderful people on the West Side was their soldier. ..... _E_\nThank you Evansville Indiana! #MakeAmericaGreatAgain __HTTP__ _E_\nCrooked Hillary Clinton is 100% owned by her donors. #ImWithYou #MAGA __HTTP__ _E_\nPutin says Russia can't allow a weakening of its nuclear deterrent—U.S. wants to reduce—are we crazy? _E_\nWe have to combat the welfare mentality that says individuals are entitled to live off taxpayers. #TimeToGetTough _E_\nRemember how @ObamaCare did not have any tort reform? Now the trial lawyers are getting ready for even more lawsuits __HTTP__ _E_\nGee @meetthepress with @chucktodd was getting terrible ratings then with me he set records I saved his job but Chuck still not nice! _E_\nWill be on @bloombergtv tomorrow with @sruhle. Enjoy! _E_\n.@VattenfallGroup couldn't sell its money losing Aberdeen windfarm—so @AlexSalmond forced phony extension. @AberdeenCC @Aberdeenshire _E_\nWork is fun deals are fun life is fun but love of a great family makes it all come together. Go out there and make your family proud. _E_\n__HTTP__ _E_\nLooks like a very good World Series game! _E_\nRT @TeamTrump: We need STRONG BROAD SHOULDERED leadership like @mike_pence & @realDonaldTrump in the White House! #VPDebate #BigLeagueTrut... _E_\n#2. Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_\nAt this point the legacy of the Obama Administration will be sadly that of THE GANG THAT COULDN'T SHOOT STRAIGHT what a pathetic mess! _E_\n.@StephenBaldwin7 thinks @TheRealMarilu is ping ponging all over the place. Do you agree? #CelebApprentice _E_\nMake no mistake Obamacare is the first step towards changing our health system into single payer. Just a disaster. _E_\nA lot of complaints from people saying my name is not on the ballot in various places in Florida? Hope this is false. _E_\nIf Jeb Bush were more competent he could not have lost the skirmish with Marco in the debate. BAD facts for Marco if properly delivered! _E_\nVia @BBCScotland: \"Donald Trump's name 'will boost @TrumpTurnberry '\" __HTTP__ _E_\nWith the whacko pervert Weiner about to be embarrassed all women need to be on the lookout. Sexting begins 9.11 @ 12:01 AM _E_\nObama's spending and borrowing is burying America and destroying our children's future. Does he even care? _E_\n#TrumpVlog Make our country great again! __HTTP__ _E_\nAmazing how fast all of Joe Paterno's friends abandoned him. They ran for the hills. _E_\nWhat did we get for fighting in Libya besides a dead Ambassador. Demand their oil. _E_\nIn Hudson Valley @TrumpNationalNY's course has pristine fairways tour caliber greens & 64 strategic sand bunkers __HTTP__ _E_\nHouse GOP wants to cut Medicare Obama took $500 billion from Medicare for Obamacare. Both Wrong! _E_\nVia Union Leader: Trump leads tribute for slain journalist James Foley | New Hampshire First Amendment Awards __HTTP__ _E_\nFunny how the failing @nytimes is pushing Dems narrative that Russia is working for me because Putin said Trump is a genius. America 1st! _E_\nIs everything ok over there @Salon? I actually got some good press from them today. _E_\nIran has never had a better friend than Obama. _E_\nRemember the golden rule of negotiating: He who has the gold makes the rules. _E_\nThe Fed must be reined in. In 2011 the Fed bought 61% percent of US debt even more than 2008. Unsustainable! __HTTP__ _E_\nBen Carson wants to abolish Medicare I want to save it and Social Security. _E_\nNIELSEN RATINGS: 1.@ThisWeekABC 2.52 viewers 6 SHR1.91RTG .55 25 54 2.@meetthepress 2.24 total viewers 5 SHR1.61RTG .47 25 54 _E_\nBarney Frank looked disgusting nipples protruding in his blue shirt before Congress. Very very disrespectful. _E_\nWonderful Frank Gifford has just passed away at age 84. He was my friend and a truly great guy! Warmest condolences to family. _E_\nOnly 10 more days until the premiere of All Star @ApprenticeNBC. On March 3rd at 9PM EST @NBC the fireworks return to the Board Room! _E_\nTo Tom Brady @patriots and Gisele Best wishes on the birth of your daughter. Tom is a great player and great friend. _E_\n\"The aesthetic the quality has to be carried all the way through.\" Steve Jobs _E_\nMiss USA Tara Conner will not be fired I've always been a believer in second chances. says Donald Trump _E_\nLittle Andy Lassner who lives his life through Ellen and has nothing else going for himself is having a really bad night! #Oscars _E_\nPlease tell me what is going on with the Republicans? _E_\nThe U.S. has been talking to North Korea and paying them extortion money for 25 years. Talking is not the answer! _E_\n.@FoxNews Objectified tonight at 10:00 P.M. Enjoy! _E_\nSerious doubt in Illinois as to whether or not Cruz can run for President. First of many challenges. __HTTP__ _E_\nRT @HeyTammyBruce: Coming up at 720a ET on @foxandfriends! See you there! #maga _E_\nThe reporting at the failing @nytimes gets worse and worse by the day. Fortunately it is a dying newspaper. _E_\nBy popular demand I will be tweeting during tomorrow's record 14th season premiere of @ApprenticeNBC on @nbc at 9/8c __HTTP__ _E_\n.@jimmyfallon regularly features @ApprenticeNBC contestants on his show. We love his support & he's a terrific host.Tonight: Omarosa. _E_\nWhat a shock – higher taxes are slowing retail spending __HTTP__ Wait until 2014 when Obama Care is fully implemented. _E_\nI'm surprised that Gabriel Aubry has settled so quickly and easily with Halle—in the long run it was a wise decision. _E_\nNow is no time to cut military spending. We must remain strong. Our enemies are looking for weakness. I'm i... (cont) __HTTP__ _E_\nCan you imagine we spend billions of dollars protecting Saudi Arabia and now the King refuses to even meet with Obama. Great leadership! _E_\nHeading to D.C. to see and hear ROLLING THUNDER. Amazing people that LOVE OUR COUNTRY. Great spirit! _E_\n\"Do whatever it takes to improve your public speaking skills. You'll absolutely need them.\" – Midas Touch _E_\nIf you're going through hell keep going. Winston Churchill _E_\nObama opposes sanctions on Iran __HTTP__ They are laughing at Kerry & Obama! _E_\nI had a very respectful conversation with the widow of Sgt. La David Johnson and spoke his name from beginning without hesitation! _E_\nWill be doing a sit down interview with @JakeTapper @CNN on Sunday morning at 9:00. Tough questions and hopefully very good answers! _E_\nBernie Sanders started off strong but with the selection of Kaine for V.P. is ending really weak. So much for a movement! TOTAL DISRESPECT _E_\nIf China had a tenth of the natural resources we do then they would already be energy independent. Instead we continue to buy oil from OPEC. _E_\nWashington (D.C.) is such a mess nothing works! I will MAKE AMERICA GREAT AGAIN! It's not going to happen with anyone else. _E_\nLooks like Anthony Weiner Is through most recent poll has him deeply in last place. GOOD NEWS _E_\nThe trip by @VP Pence was long planned. He is receiving great praise for leaving game after the players showed such disrespect for country! _E_\nI am officially running for President of the United States. #MakeAmericaGreatAgain __HTTP__ _E_\nLooking forward to live tweeting during the rest of the debates. Will be a lot of fun. _E_\nJeb Bush never uses his last name on advertising signage materials etc. Is he ashamed of the name BUSH? A pretty sad situation. Go Jeb! _E_\n\"Trump Rally: Stocks put 2017 in the record books\" __HTTP__ _E_\nWith the strategy that I announced today we are declaring that AMERICA is in the game and AMERICA is DETERMINED to WIN!OUR FOUR PILLARS OF NATIONAL SECURITY STRATEGY: __HTTP__ _E_\nI look very much forward to meeting Prime Minister Theresa May in Washington in the Spring. Britain a longtime U.S. ally is very special! _E_\n\"Do you want to know who you are? Don't ask. Act! Action will delineate and define you.\" Thomas Jefferson _E_\nIn the 10:30 PM ET lead in to local news @ApprenticeNBC delivered a 31 percent margin of victory... _E_\nA lot of call ins about vote flipping at the voting booths in Texas. People are not happy. BIG lines. What is going on? _E_\nNow @BarackObama is telling donors he will need to 'revisit' healthcare in his 2nd term __HTTP__ _E_\nFAKE NEWS media which makes up stories and sources is far more effective than the discredited Democrats but they are fading fast! _E_\nI love show Law and Order but the @MRbelzer casting is the worst ever. No talent unwatchable! _E_\nI can't believe Mitch McConnell isn't way up in the Kentucky polls. Massive seniority brings so much power and status to State. Brings K.$'s _E_\nIn war the elememt of surprise is sooooo important.What the hell is Obama doing. _E_\nGreat decision by Donald Graham @Newsweek to sell. I'll now have to take my newsweek covers off the wall. _E_\nGreat – we are sending even more F 16's to the Muslim Brotherhood in Egypt __HTTP__ This is a total disaster. _E_\nVOTER REGISTRATION DEADLINES TODAY. You can register now at: __HTTP__ and get out to... __HTTP__ _E_\nWhen Mitt Romney asked me for my endorsement last time around he was so awkward and goofy that we all should have known he could not win! _E_\nJust out Nevada poll shows Jeb Bush at 1% he should take his dumb mouthpiece @LindseyGrahamSC and just go home. _E_\nWho's your pick @bretmichaels or @hollyrpeete ? Vote now on Ivanka's new Facebook page! __HTTP__ _E_\nWhy didn't Hillary Clinton announce that she was inappropriately given the debate questions she secretly used them! Crooked Hillary. _E_\nOnce the ISIS thug who beheaded Foley is identified 100% he should be bunker busted to hell. _E_\nLook at the editorial I was just sent from the NY Post on 9/14/01 3 days after collapse of WTC. Any apologies? __HTTP__ _E_\nYou get what you vote for. US credit rating is about to be downgraded once again __HTTP__ _E_\nIn the spirit of transparency Obama should immediately release the 9.11 tape of Tyrone Woods pleading for military support in Benghazi. _E_\nNegotiation: Think about what the other side wants. Know where they're coming from. Don't underestimate them. Create a win/win situation. _E_\nWe must immediately stop all air traffic coming from the Ebola infected areas of Africa—before it is too late. _E_\nChina Russia and Iran are laughing at us. We have weak leaders who are threatening our national security. Dangerous times. _E_\n.@robbreport Best 2013 Golf Courses: Trump Int'l Golf Links Scotland. Great honor great magazine—thanks! __HTTP__ _E_\nWow did you see how badly @CNN (Clinton News Network) is doing in the ratings. With people like @donlemon who could expect any more? _E_\nWill be on #Hannity @ 10pE @FoxNews discussing various subjects including immigration if elected we will #BuildTheWall & enforce our laws! _E_\nJoin me Monday in Columbus Ohio & Harrisburg Pennsylvania! #MAGA3pm in OH: __HTTP__ in PA: __HTTP__ _E_\nThe Great Irish Links Challenge @Trump_Ireland & Lahinch Golf Club is coming this June. Don't miss it. __HTTP__ #Doonbeg _E_\nI have helped many friends and colleagues in their business ventures. They always thank me after they succeed. #MIDASTOUCH _E_\nThank you Indiana! #Trump2016 __HTTP__ _E_\nMay God be w/ the people of Sutherland Springs Texas. The FBI & law enforcement are on the scene. I am monitoring the situation from Japan. _E_\n...get things done at a record clip. Many big decisions to be made over the coming days and weeks. AMERICA FIRST! _E_\nI will have set the all time record in primary votes in the Republican party despite having to compete against 17 other people! _E_\nBeing successful requires nothing less than 100% of your concentrated effort. Be totally focused. _E_\nThe United States cannot continue to make such bad one sided trade deals. There are only so many jobs we can give up. No more! _E_\nCrooked Hillary is spending big Wall Street money on ads saying I don't have foreign policy experience yet look what her policies have done _E_\nAll predictions re: my 12 o'clock release are totally incorrect. Stay tuned! _E_\n....is making. Working very hard on TAX CUTS for the middle class companies and jobs! _E_\nVia @RadioIowa by @okayhenderson: \"Trump touts business career but not TV show during Iowa speech\" __HTTP__ _E_\nRT @GOPChairwoman: The Trump Inaugural Committee is donating $3 million in surplus funds to victims of the latest hurricanes. __HTTP__ _E_\nVia @NYDailyNews by @klnynews: \"Donald Trump wins lawsuit against Joint Commission on Public Ethics\" __HTTP__ _E_\nOffering true luxury @Trump_Charlotte has spectacular restaurants Olympic pools & six professional tennis courts __HTTP__ _E_\nJeb Bush gave five different answers in four days on whether or not we should have invaded Iraq.He is so confused.Not presidential material! _E_\nThe Blue Monster @TrumpDoral was a sensation over the weekend. Really tough but players & critics alike loved it. _E_\nHappy #CincoDeMayo! The best taco bowls are made in Trump Tower Grill. I love Hispanics! __HTTP__ __HTTP__ _E_\nJared Kushner did very well yesterday in proving he did not collude with the Russians. Witch Hunt. Next up 11 year old Barron Trump! _E_\nTowering over trendy Bay Street @TrumpTO offers 118 stunning condominiums w/ multi angle views & elite amenities __HTTP__ _E_\nWhy do shows have @ananavarro—Ntl Hispanic Chair for the losing McCain '08 & Huntsman '12. She's a loser who doesn't deliver votes. _E_\nRepublicans don't extend the debt ceiling—make the great deal now! _E_\nHow much BAD JUDGEMENT was on display by the people in DNC in writing those really dumb e mails using even religion against Bernie! _E_\n#ImWithYou __HTTP__ _E_\nThanks for all of the great support but I just don't see myself wanting to run for Governor of New York I have something else in mind! _E_\nSomething very important and indeed society changing may come out of the Ebola epidemic that will be a very good thing: NO SHAKING HANDS! _E_\nThe Fake News is going crazy with wacky Congresswoman Wilson(D) who was SECRETLY on a very personal call and gave a total lie on content! _E_\nI am in Iowa today great STATE fantastic PEOPLE! Many speeches big crowds all sold out! MAKE AMERICA GREAT AGAIN! _E_\nHonest Omarosa: she won't backstab she'll come at you from the front. _E_\nCongresswoman Jennifer Gonzalez Colon of Puerto Rico has been wonderful to deal with and a great representative of the people. Thank you! _E_\nJust arrived at Trump National Doral saying hello to all the great players. This place is amazing.Come Thursday & see for yourselves! _E_\nToday I officially declared my candidacy for President of the United States. Watch the video of my full speech __HTTP__ _E_\nMaybe some of the dead voters who helped get President Obama elected can be brought back to life after signing up for ObamaCare. _E_\nIn my opinion one of the worst utility companies in the country is Florida Power and Light. _E_\nKatie Couric the third rate reporter who has been largely forgotten should be ashamed of herself for the fraudulent editing of her doc. _E_\n.@GlennBeck got fired like a dog by #Fox. The Blaze is failing and he wanted to have me on his show. I said no because he is irrelevant. _E_\nMy people caught the person who committed forgery of the James Gandolfini Obama Care phoney quote attributed to me fraud. Arrest coming? _E_\nSouth Carolina voters have the future of our country in their hands. Vote now (today) and MAKE AMERICA GREAT AGAIN! _E_\nJust arrived in West Virginia for a MAKE AMERICA GREAT AGAIN rally in Huntington at 7:00pmE. Massive crowd expected tune in! #MAGA _E_\nNow Assad is demanding that Obama stop supporting the rebels before he turns over his chemical weapons. What a mess! _E_\nThank you. __HTTP__ _E_\nCongratulations to Alyssa Campanella Miss California our new MIss USA! __HTTP__ _E_\nMy @todayshow int. with @MLauer announcing the January 4th premiere & cast of the 14th season of @ApprenticeNBC __HTTP__ _E_\nThank you Piers for the wonderful article and also great writing. @piersmorgan __HTTP__ _E_\nRed line statement was a disaster for President Obama. _E_\n...come down hard tax the hell out of their imports and reduce our deficit fast. _E_\nVia @AP: Donald Trump returns to the 'Apprentice' boardroom __HTTP__ _E_\nThe Democrats only want to increase taxes and obstruct. That's all they are good at! _E_\nOne of the world's tallest buildings @TrumpChicago is not only a 5 star hotel but has 5 star dining options __HTTP__ _E_\nAgreed! __HTTP__ _E_\nMexico will pay for the wall! _E_\nlike the 116% hike in Arizona. Also deductibles are so high that it is practically useless. Don't let the Schumer clowns out of this web... _E_\nAnd finally Cruz strongly told thousands of caucusgoers (voters) that Trump was strongly in favor of ObamaCare and choice a total lie! _E_\nThe issue of kneeling has nothing to do with race. It is about respect for our Country Flag and National Anthem. NFL must respect this! _E_\nHillary Clinton Dominates the Pack in Fake Twitter Followers __HTTP__ _E_\nAnother Obama disaster __HTTP__ _E_\nRepublicans should not be giving Obama fast track authority on trade. The Trans Pacific Partnership will squeeze our manufacturing sector _E_\nEdddie24 Mr. Trump is a real American patriot. You have my vote if you ever ran. 👍 Thank you. _E_\nHillary Clinton reaches new low. #TrumpVlog __HTTP__ _E_\n...Overall the Academy Awards were very average at best. _E_\nWith all that Congress has to work on do they really have to make the weakening of the Independent Ethics Watchdog as unfair as it _E_\nI want to thank all my friends in Macon for the special evening and great reception. What a crowd of incredible people! _E_\nMy friend Derek is a special athlete and special person there is nobody like him. @Yankees _E_\nThe @BarackObama administration now claims to have done everything to reduce gas prices __HTTP__ What about Keystone? _E_\nThings are looking great for Karen H! _E_\nCrooked Hillary launched her political career by letting terrorists off the hook. #DrainTheSwamp... __HTTP__ _E_\nThe Mar a Lago club in Palm Beach is one of the most successful places on earth in raising money for charity a great feeling! _E_\nI will be interviewed by @jdickerson on @FaceTheNation tomorrow morning. Enjoy! #Trump2016 _E_\nGreat sportscaster Al Michaels a friend of mine played golf with me on Saturday morning at Trump National LA. He was in perfect shape! _E_\n.@Mark_Sanchez shouldn't be too upset over @EvaLongoria. He will always do great! _E_\nTRUMP TUESDAY @SquawkCNBC tomorrow at 7:30 am Tune in! _E_\nCongratulations to our new Attorney General @SenatorSessions! __HTTP__ _E_\nThe Miami Heat is getting it's ass kicked they better start playing or it will be a long Summer for them. _E_\nIs Fake News Washington Post being used as a lobbyist weapon against Congress to keep Politicians from looking into Amazon no tax monopoly? _E_\nHome values have sunk a record 15% under Obama. _E_\nFor America to be great again we must have a President who has been successful and Americans can learn from on how to succeed. _E_\nDonald Trump Will Be on Pennsylvania Avenue in 2016 & There's Nothing You Can Do About It __HTTP__ by @lilsarg _E_\nThe rolling average of jobless claims is the highest in 5 months __HTTP__ ObamaCare continues to slow growth and cost jobs. _E_\nThe cast of the new season of apprenticenbc. Premieres January 4th on NBC. __HTTP__ _E_\nJoin me in Washington today!Spokane tickets: __HTTP__ tickets: __HTTP__ __HTTP__ _E_\nCongratulations to @AllenWest on winning last night's primary! _E_\nWH counsel met with IRS lawyer 3x in 2012 once in September __HTTP__ But Obama just learned through news reports? _E_\nThe Democrats ObamaCare is imploding. Massive subsidy payments to their pet insurance companies has stopped. Dems should call me to fix! _E_\nDoes anyone really believe that Chuck Hagel is sorry for any of his past comments or supports Israel? _E_\nOne of the tallest office buildings in downtown NYC 40 Wall Street is a classic Art Deco building __HTTP__ _E_\nA reader just sent me the following: I wanted to share with you something rather startling. On page 103 of (cont) __HTTP__ _E_\nBreitbart gets it! Vote now @BarackObama should release his college application records and grades. He says he (cont) __HTTP__ _E_\nThe phony story in the failing @nytimes is a TOTAL FABRICATION. Written by same people as last discredited story on women. WATCH! _E_\nGlad to hear North Carolina is solid for @MittRomney. It started trending for Mitt solidly after my speech at the @NCGOP convention. _E_\nWhat lies behind us and what lies before us are tiny matters compared to what lies within us. Ralph Waldo Emerson _E_\nBig excitement last night in the Great State of Pennsylvania! Fantastic crowd and people. MAKE AMERICA GREAT AGAIN! _E_\nI now see John Kasich from Ohio who is desperate to run is using my line \"Make America Great Again\". Typical pol no imagination! _E_\nThank you for the nice words @ktmcfarland. The debate was interesting and fun. Keep up the great work! _E_\nStarting tomorrow it's going to be #AmericaFirst! Thank you for a great morning Sarasota Florida!Watch here:... __HTTP__ _E_\nExcellent Jobs Numbers just released and I have only just begun. Many job stifling regulations continue to fall. Movement back to USA! _E_\nFour more years of weakness with a Crooked Hillary Administration is not acceptable. Look what has happened to the world with O & Hillary! _E_\nSorry losers and haters but I LOVED the great energy in Madison Square Garden during my speech. The WWE thought it was incredible it was! _E_\nBernie's exhausted he just wants to shut down and go home to bed! _E_\nI am honored to be chosen by Gray Line for their NY Ride of Fame Campaign. Today we had the ribbon cutting ceremony in front of Trump Tower. _E_\nTrue. __HTTP__ _E_\nI was proud to be one of Ronald Reagan's earliest supporters. Like Reagan it's time to Make America Great Again! __HTTP__ _E_\nVia @JNSworldnews by @JacobKamarasJNS: Donald Trump says he is no apprentice when it comes to Israel __HTTP__ _E_\n#HasJustineLandedYet Justine what the hell are you doing are you crazy? Not nice or fair! I will support @AidForAfrica. Justine is FIRED! _E_\nCAMPAIGN STATEMENT: __HTTP__ _E_\nMy @foxandfriends int. re: Tiger's victory at Trump @DoralResort 's @CadillacChamp my WH tour offer and CPAC __HTTP__ _E_\nAmerica is proud to stand shoulder to shoulder with Poland in the fight to eradicate the evils of terrorism and extremism. #POTUSinPoland __HTTP__ _E_\nThank you to teachers across America! When I become POTUS we will make education a far more important component of our life than it is now. _E_\nDespite Mexico's interest in again hosting the Miss Universe Pageant it will be because of Rodolfo Rosas Moya that it will never happen. _E_\nFor too long we've been pushed around used by other countries and ill served by politicians in Washington who (cont) __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN!#AmericaFirst #ImWithYou __HTTP__ _E_\nPres. Obama is meeting with China's Pres. this week __HTTP__ He will get zero deliverables. China laughs at us. _E_\nThe Republicans must use the debt ceiling as leverage to make a great deal! _E_\nE mails show that the AmazonWashingtonPost and the FailingNewYorkTimes were reluctant to cover the Clinton/Lynch secret meeting in plane. _E_\nJust returned from Trump Doral in Miami. Massive construction job. When completed will be the best resort in U.S. Blue Monster is amazing! _E_\nWhy would the great people of Florida vote for a guy who as a Senator never even shows up to vote worst record. Marco Rubio is a joke! _E_\nThey now say using the word thug is like so many other words not politically correct (even though Obama uses it). It is racist. BULL! _E_\n\"US tycoon Donald Trump in talks with Ryanair to bring more flights back to Prestwick Airport\" __HTTP__ via @Daily_Record _E_\nJoin our next Vice President @Mike_Pence in Wisconsin tonight & Michigan Thursday!MI: __HTTP__ __HTTP__ _E_\nRepublicans have the right approach to ObamaCare – let it fail. Free market solutions will be embraced by Americans in 2016. _E_\nDon't let Obama play the Iran card in order to start a war in order to get elected be careful Republicans! _E_\n.@ICEgov HSI agents and ERO officers on behalf of an entire Nation THANK YOU for what you are doing 24/7/365 to keep fellow American's SAFE. Everyone is so grateful!#LawEnforcementAppreciationDayPresident @realDonaldTrump __HTTP__ _E_\n'Small business says Trump is their pick for president' __HTTP__ _E_\nAmerica needs a President who can negotiate better deals for the American People. _E_\nMy interview on 9/13/01 with a German reporter after visiting Ground Zero __HTTP__ _E_\nTrump Tuesday on @SquawkCNBC 7:30 AM is getting very good ratings as is @Foxandfriends on Mondays 7:30 AM. _E_\nVia @Newsmax_Media: Robb Report: Trump Scotland Best Golf Course in the World __HTTP__ _E_\nAgain I have nothing to do with the Atlantic City closing I have not even been there in many years. Some press was accurate some not! _E_\nVia @BloombergNews by Peter Millard: Trump Helps Rio Builders After Olympics: Corporate Brazil __HTTP__ _E_\nOur country is now in serious and unprecedented trouble...like never before. _E_\nIf history teaches us anything it's that strong nations require strong leaders with clearly defined national (cont) __HTTP__ _E_\nGreat meeting w/ NATO Sec. Gen. We agreed on the importance of getting countries to pay their fair share & focus on... __HTTP__ _E_\nMitt Romney is right about the Chinese rip off of America. _E_\nSo Obama can host the Muslim Brotherhood Pres. Morsi in the White House __HTTP__ but doesn't have time for @netanyahu? _E_\nTexas & Florida are doing great but Puerto Rico which was already suffering from broken infrastructure & massive debt is in deep trouble.. _E_\nGreat work being done by @FEMA @DHSgov w/state & local leaders to prepare for hurricane season. Preparedness is an investment in our future! __HTTP__ _E_\nWhoever wins today remember that tomorrow we still have a country struggling. Our work is not done until America is strong again. _E_\n\"Experience knowledge & prescience are a formidable combination of powers. Do not underestimate any of them.\" Think Like a Champion _E_\nObama has missed 58% of his intelligence briefings. But our president does make 100% of his fundraisers. _E_\nTop brand impact is what television is all about from the commercial standpoint—a big deal for @CelebApprentice. _E_\nThe charities I have designated for @billmaher's donations are: Police Athletic League New York March of Dimes Hurricane Sandy victims.... _E_\nMust see morning clip: Donald Trump addresses Lil Wayne tweet and 'Celebrity Apprentice' __HTTP__ via @Salon _E_\nMy son Don will be giving the Keynote Address at The Investment Show in Sandton South Africa on Dec. 1. He's an (cont) __HTTP__ _E_\nMuslim Brotherhood head of Egypt Morsi is already making demands on Obama before the WH visit. Obama's foreign policy is a complete failure. _E_\nHow come there are no protests in favor of the two young police officers gunned down in Mississippi by two deranged animals. DEATH PENALTY! _E_\nWikiLeaks reveals Clinton camp's work with 'VERY friendly and malleable reporters' #DrainTheSwamp #CrookedHillary __HTTP__ _E_\nChicago is a shooting disaster they should immediately go to STOP AND FRISK. They have no choice hundreds of lives would be saved! _E_\nVast numbers of manufacturing jobs in Pennsylvania have moved to Mexico and other countries. That will end when I win! _E_\nRemember when I said when Saddam Hussein fell the new leader of Iraq will be meaner and tougher and hate the U.S. even more. Welcome ISIS! _E_\nThanks. __HTTP__ _E_\nKeystone: @johnboehner MUST pass Keystone by linking it to another bill. __HTTP__ _E_\nTop suspect in Paris massacre Salah Abdeslam who also knew of the Brussels attack is no longer talking. Weak leaders ridiculous laws! _E_\nJust out report: United Kingdom crime rises 13% annually amid spread of Radical Islamic terror. Not good we must keep America safe! _E_\nAs the nuclear crisis with Iran shows America needs to import oil from a reliable region. Keystone XL Pipeline (cont) __HTTP__ _E_\nWord is that Sleepy Eyes Chuck Todd who has failed so badly with Meet the Press will be taking over for now irrelevant Brian Williams! _E_\nVia @feminamissindia: \"@MannyPacquiao among @MissUniverse 2015 judges\" __HTTP__ _E_\nThe Club for Growth is a very dishonest group. They represent conservative values terribly & are bad for America. __HTTP__ _E_\nBig G7 meetings today. Lots of very important matters under discussion. First on the list of course is terrorism. #G7Taormina _E_\n...Why did Democratic National Committee turn down the DHS offer to protect against hacks (long prior to election). It's all a big Dem HOAX! _E_\nAs always & due to popular demand@TrumpRink will be open Christmas eve & day as well as New Year's eve & day __HTTP__ _E_\nBe sure to watch the Larry King Show tomorrow night on CNN 9 p.m. I'll be the host Larry the guest. __HTTP__ _E_\nRT @WhiteHouse: Do not allow anyone to tell you that it cannot be done. No challenge can match the HEART and FIGHT and SPIRIT of America. ... _E_\nIN AMERICA WE DON'T WORSHIP GOVERNMENT WE WORSHIP GOD! __HTTP__ _E_\nWill be interviewed on @foxandfriends tomorrow morning Monday at 8:00. Much to talk about! _E_\nNO GAMES! HOUSE @GOP MUST DEFUND OBAMACARE! IF THEY DON'T THEN THEY OWN IT! _E_\nWhat a coincidence Michelle Obama called Kenya @BarackObama's homeland in 2008 __HTTP__ _E_\nGreat news @TPPatriots are starting their own Super PAC to fight @KarlRove __HTTP__ (via @thehill) Go get em! _E_\nPennsylvania is in play @MittRomney. All undecideds in Philly suburbs should ask themselves who do you trust most on @Israel? _E_\nWatch listen and learn. You can't know it all yourself. Anyone who thinks they do is destined for mediocrity. Donald Trump _E_\nWill be interviewed on the @TODAYshow this morning at 7:00. Talking about politics polls and whatever. Enjoy! _E_\nAfter the litigation is disposed of and the case won I have instructed my execs to open Trump U(?) so much interest in it! I will be pres. _E_\nRe negotiation: Know exactly what you want and keep it to yourself. Think about what the other side wants and where they're coming from. _E_\nSuccess breeds success. The best way to impress people is through results. Think Like a Billionaire. _E_\n.@MittRomney can only speak negatively about my presidential chances because I have been openly hard on his terrible choke loss to Obama! _E_\n.@Borisep was great on @JudgeJeanine tonight. Very smart commentary that will prove to be correct! _E_\nprotesters and the tears of Senator Schumer. Secretary Kelly said that all is going well with very few problems. MAKE AMERICA SAFE AGAIN! _E_\nEverything comes to him who hustles while he waits. Thomas A. Edison _E_\nSuccess is not final failure is not fatal: it is the courage to continue that counts. Winston Churchill _E_\nGlad to see that Sacha Baron Cohen's new movie is not only a dud but not too good at the box office. He is talentless. @Sacha_B_Cohen _E_\nGoing to CPAC! _E_\nObama can kill Americans at will with drones but waterboarding is not allowed—only in America! _E_\nWho do you think is going home? #CelebApprentice _E_\nStay confident even when something bad happens. It is just a bump in the road. It will pass. Think Big _E_\n\"@BarackObama may have been a good 'community organizer' but the man is a lousy international dealmaker.\" #TimeToGetTough _E_\nWe need a tax system that is FAIR to working families & that encourages companies to STAY in America GROW in America and HIRE in America __HTTP__ _E_\nA penny saved is a penny earned. Benjamin Franklin _E_\nI will be doing Fox & Friends at 7.00 will be discussing the the Donald Sterling (Clippers) MESS! _E_\nMaybe Obama should donate my $5M to the families of the 17 who have lost loved ones during the storm? _E_\nEven Barbara Bush agrees with me __HTTP__ _E_\nCheck out my interview on @GMA __HTTP__ _E_\nObamaCare has 21 tax hikes __HTTP__ There's now only one solution defeat @BarackObama this November! #GOMITT _E_\n.@FoxNews You shouldn't have @KarlRove on the air—he's a clown with zero credibility—a Bushy! _E_\nHappy birthday to the great @TheLeeGreenwood. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_\nRe: hiring contractors remember the cheapest isn't always the best. Their work may have to be redone & they may not be reliable. _E_\nCheck out the last webisode www.youtube.com/user/mattressserta in our 3 part series featuring me with Serta. Which one was your favorite? _E_\nWhere are the other candidates now that this tragic murder has taken place b/c of our unsafe border __HTTP__ We need a wall! _E_\nIf these guys have any integrity they'd say no to MSNBC a network that few watch and is very negative. @AndrewBreitbart re debate. _E_\nI made my decision to allow Jenna Talackova to participate in Miss Universe Canada two days before Gloria Allred (cont) __HTTP__ _E_\nI would rather run against Crooked Hillary Clinton than Bernie Sanders and that will happen because the books are cooked against Bernie! _E_\nThank you Orlando Florida! We are just six days away from delivering justice for every forgotten man woman and ch... __HTTP__ _E_\nBy the way Hillary & the MSM forgot to mention that Hillary is in the Al Shabaab terror video. __HTTP__ _E_\nA clip of my @LibertyU speech talking about the importance of the election & our country's potential __HTTP__ via@washingtonpost _E_\n.@LouDobbs just stated that President Trump's successes are unmatched in recent presidential history Thank you Lou! _E_\nThe failing @nytimes is greatly embarrassed by the totally dishonest story they did on my relationship with women. _E_\nLet me put this as plainly as I know how: Iran's nuclear program must be stopped by any and all means necessary. Period. #TimeToGetTough _E_\nI will be on @oreillyfactor tonight at 8:00. Enjoy! _E_\nGetting closer and closer on the Tax Cut Bill. Shaping up even better than projected. House and Senate working very hard and smart. End result will be not only important but SPECIAL! _E_\nAct as if what you do makes a difference. It does. William James _E_\nThank you. __HTTP__ _E_\nWatch @ApprenticeNBC episode 2 online again via @nbc: \"Nobody Out Thinks Donald Trump __HTTP__ _E_\nIf @RepMarkMeadows @Jim_Jordan and Raul_Labrador would get on board we would have both great healthcare and massive tax cuts & reform. _E_\nDesigned by @IvankaTrump @TrumpDoral's Deluxe Guestrooms feature impeccable furnishing and details __HTTP__ _E_\n\"Get to the essence immediately. Learn to economize. People appreciate brevity in today's world.\" – Think Like a Champion _E_\n'President elect Donald J. Trump today announced his intent to nominate Steven Mnuchin Wilbur Ross & Todd Ricketts... __HTTP__ _E_\nJust like its website ObamaCare is a disaster.Maybe all those who are fighting it are wasting their time it will fail on its own! _E_\nOctober 2015 thanks Chris Wallace @FoxNewsSunday! __HTTP__ _E_\nI said that Eliot Spitzer was going to lose when he was way up in the polls. I fought him when others retreated out of fear. NEVER GIVE UP! _E_\nVia @NRO: Palin Trump Get Longer Speaking Slots at CPAC by @KatrinoTrinko __HTTP__ _E_\nThe rallies in Utah and Arizona were great! Tremendous crowds and spirit. Just returned but will be going back soon. _E_\nI wonder if I run for PRESIDENT will the haters and losers vote for me knowing that I will MAKE AMERICA GREAT AGAIN? I say they will! _E_\nTrending story on Miss Utah is very unfair. She simply lost her train of thought—could happen to anyone! @MissUSA @MissUniverse _E_\nOpening in 2016 Trump Hotel Rio de Janeiro will be a 13 story 171 guestroom masterpiece with a beachside view __HTTP__ _E_\nObama and Clinton told the same lie to sell #ObamaCare. #Debates2016 __HTTP__ _E_\nThank you to all of our amazing military families service members and veterans. #ImWithYou __HTTP__ _E_\nYou wouldn't believe how tall and beautiful @_KatherineWebb is 6'5 in heels. She is also a total winner in... __HTTP__ _E_\nThe crackdown on illegal criminals is merely the keeping of my campaign promise. Gang members drug dealers & others are being removed! _E_\nMAKE AMERICA GREAT AGAIN! _E_\nRemember Cruz and Bush gave us Roberts who upheld #ObamaCare twice! I am the only one who will #MAKEAMERICAGREATAGAIN! _E_\nDeepest condolences to the families & fellow officers of the VA State Police who died today. You're all among the best this nation produces. _E_\nThank you to our great Police Chiefs & Sheriffs for your leadership & service. You have a true friend in the... __HTTP__ _E_\nRT @DineshDSouza: Finally as if by accident the @washingtonpost breaks down & admits the truth about where the violence is coming from ht... _E_\nSo many great polls like Reuters big leads everywhere. New Hampshire really special! We will win big and MAKE AMERICA GREAT AGAIN! _E_\nArticle: More illegals enter than people born in state each week. __HTTP__ _E_\nRT @DeptofDefense: #HappyThanksgiving from @USArmy and @USNationalGuard #soldiers serving with Task Force Marauder in #Afghanistan. 🦃 __HTTP__ _E_\nCan't wait for @DylanByers' follow up @politico piece discussing my large Sunday news shows ratings win because of my interview! _E_\nLooking for an excuse not to cook for Thanksgiving? Many NYC outlets will delivery a full meal including @TrumpSoHo __HTTP__ _E_\n...New Donna B book says she paid for and stole the Dem Primary. What about the deleted E mails Uranium Podesta the Server plus plus... _E_\nJust finished two major speeches in South Carolina. Big crowds great people. Going for a third now! _E_\nMy thoughts on Dick Cheney and his new book... __HTTP__ #trumpvlog _E_\nJoin me in Carmel Indiana tomorrow at 4pm! #INPrimary __HTTP__ __HTTP__ _E_\nJust leaving Nashville Tennessee. Had a great time with a fabulous crowd of people! Love Nashville back soon! __HTTP__ _E_\nUnder the leadership of Obama & Clinton Americans have experienced more attacks at home than victories abroad. Time to change the playbook! _E_\nDick Clark was a friend of mine he lived in one of my buildings on East 61st Street. Everybody loved him. He will be missed. _E_\nGeneral Kelly is doing a great job at the border. Numbers are way down. Many are not even trying to come in anymore. _E_\nSadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. And (cont) __HTTP__ _E_\n.@dixierhilton #asktrump __HTTP__ _E_\nI have never liked the media term 'mass deportation' but we must enforce the laws of the land! _E_\nMy interview with @parademagazine from the Olympics 100 Day Countdown in Times Square __HTTP__ _E_\n.#IranDeal will go down as one of the dumbest & most dangerous misjudgments ever entered into in history of our country—incompetent leader! _E_\n.@CNN is all negative when it comes to me. I don't watch it anymore. _E_\nWith allies like Egypt and Libya who needs enemies?! _E_\nRT @DRUDGE_REPORT: TRUMP STUMPS... __HTTP__ _E_\nMake your life as groundbreaking as possible while also minding the tides and riptides around you. Think Like a Champion _E_\nNice guy @pennjillette needs your help to make his bad guy movie Directors Cut &gt __HTTP__ @fundanything _E_\nThank you @morningmika and @JoeNBC for all of your nice words and comments on the debate! _E_\nRT @paulsperry_: Wray needs to clean house. Now we know the politicization even worse than McCabe's ties to McAuliffe/Clinton. It also infe... _E_\nI love watching the dishonest writers @NYMag suffer the magazine's failure. _E_\nWe will never have great national security in the age of computers too many brilliant nerds can break codes (the old days were better). _E_\nWATCH – WH official says that ObamaCare/RomneyCare architect Gruber was 'an important figure' in crafting the law __HTTP__ _E_\nWeekly Address __HTTP__ __HTTP__ _E_\nI hear @JoeNBC of rapidly fading @Morning_Joe is pushing hard for a third party candidate to run. This will guarantee a Crooked Hillary win. _E_\n\"@AP Interview: @MissUniverse Gabriela Isler reflects as her reign winds down\" __HTTP__ via @YahooNews _E_\nThe US GDP in 2010 was 4.1% down to 2% in 2011 & now 1.5%. I guess @BarackObama's plan is not working! _E_\n#CrookedHillary #ThrowbackThursday __HTTP__ _E_\n11000 inside venue tonight in Tampa! Broke record set by Elton John in 1988 w/out musical instruments! Another 5000 outside. Will be back! _E_\nFor beauty and flight I'll take the @Boeing 757 over the @Boeing 787 any day! _E_\nWow just heard really bad stuff about the failing @politico. How much longer will they be around? Some very untalented reporters. _E_\nIf Cuba is unwilling to make a better deal for the Cuban people the Cuban/American people and the U.S. as a whole I will terminate deal. _E_\nFor the disciples of global warming in 150 summers (years) there have been 20 heat waves as bad or worse than current this has happened b4! _E_\nI will be interviewed on @GMA at 7:00 A.M. and @foxandfriends at 7:50. Talking about my new book out today Crippled America. _E_\nLoved the debate last night and almost everyone said I won but the RNC did a terrible job of ticket distrbution. All donors & special ints _E_\nRT @DonaldJTrumpJr: Nevada: Here is a quick video @IvankaTrump created on How to Caucus very quick and simple! __HTTP__ ... _E_\nLooking forward to attending the GREAT Rev. @BillyGraham's birthday party tonight there's nobody like him! _E_\nMy interview with @IngrahamAngle discussing @THEHermanCain @BarackObama's mistreatment of Israel and GOP 2012. __HTTP__ _E_\nThe entire cast will be back for the live finale of @ApprenticeNBC Monday night at 8 PM _E_\n.@JebBush is slashing campaign salaries people making millions. If he can't manage his campaign how can he manage our countries finances? _E_\nDonald Trump appearing today on CNN International's 'Connect the World' as 'Connector of the Day'. Submit questions: __HTTP__ _E_\nMasa said he would never do this had we (Trump) not won the election! _E_\nI wonder when we will be able to see @BarackObama's college and law school applications and transcripts. Why the long wait? _E_\nI'll be co hosting @extratv tonight. Be sure to tune in! _E_\nI watched Russell Brand @rustyrockets on the @jimmyfallon show the other night—what the hell do people see in Russell—a major loser! _E_\nAlmost every major dealmaker has used the bankruptcy laws as a business tool... _E_\nThat trip would be to the Trump International Hotel Las Vegas... __HTTP__ _E_\nDonald J. Trump's History Of Empowering Women #BigLeagueTruth __HTTP__ _E_\nA fine man Dr. Paul F. Crouch has just passed away. All Christians are grateful for his wonderful life and work. @TBN _E_\nTrump National Golf Club Charlotte is the premiere club in North Carolina. __HTTP__ Will visit tomorrow. _E_\nLyin' Ted Cruz denied that he had anything to do with the G.Q. model photo post of Melania. That's why we call him Lyin' Ted! _E_\n.@billmaher has continually degraded Catholic Church on the joke he calls a show __HTTP__ Catholics should boycott HBO. _E_\nMy daughter Ivanka will be on @foxandfriends tomorrow morning. Enjoy! _E_\nThanks Piers. Greatly appreciated. @piersmorgan __HTTP__ _E_\n.@LilJon's take on @piersmorgan seems to be a classic love hate combo. Piers can be tough and everyone knows it. #CelebApprentice _E_\nWhat ever happened to the good old days of The Academy Awards. This show is an insult to the past just plain bad! _E_\n.@TrumpDoral will be featured on @GolfChannel this morning (now). _E_\nWhy isn't Hillary 50 points ahead? Maybe it's the email scandal policies that spread ISIS or calling millions of... __HTTP__ _E_\nRemember that I predicted a long time ago that President Obama will attack Iran because of his inability to negotiate properly not skilled! _E_\nI am running against the Washington insiders just like I did in the Republican Primaries. These are the people that have made U.S. a mess! _E_\nBoston's Mayor Walsh wasted a lot of time and money on going for the Olympics and then he gave up. I don't want him negotiating for me! _E_\n10 yrs ago today the Iraq war began. 4485 of our nation's finest have not returned home alive. Iran will soon control Iraq & its oil. _E_\nObama was guest at VP debate moderator Martha Raddatz's wedding __HTTP__ Do people think this is fair? _E_\nVia @trscoop: WHOA: Trump changing venues for Saturday rally in Arizona due to OVERWHELMING RESPONSE __HTTP__ _E_\nHillary is the most corrupt person to ever run for the presidency of the United States. #DrainTheSwamp __HTTP__ _E_\nThey just arrested pol Shelly Silver in New York. Why aren't they arresting a far bigger crook @AGSchneiderman? _E_\nObama killed over 100k jobs by not approving Keystone XL pipeline and Canada is now selling the oil to China very dumb! _E_\nBig news Budget just passed! _E_\n.@ABCPolitics #GOPDebate#MakeAmericaGreatAgain #FITN __HTTP__ _E_\nTHANK YOU NEW YORK!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThe last time I visited China I couldn't believe all the construction. You can go up with a project in a week no red tape. _E_\nGlad to hear Derek Jeter just removed his boot and is practicing on the field for @yankees. Derek is a true champion. _E_\nWhy doesn't @MittRomney just endorse @marcorubio already.Should have done it before NH or Nevada where he had a little sway. Too latenow! _E_\nRather than putting pressure on the businesspeople of the Manufacturing Council & Strategy & Policy Forum I am ending both. Thank you all! _E_\nNew Fox News PollThank you Iowa! #Trump2016 #IACaucus __HTTP__ _E_\nOur new American Energy Policy will unlock MILLIONS of jobs & TRILLIONS in wealth. We are on the cusp of a true ene... __HTTP__ _E_\nAfter spending $89 million @JebBush is at the bottom of the barrel in polls. He is ashamed to use the name Bush in ads. Low energy guy! _E_\nThere is no world problem which cannot be solved if people of good will & intelligence want it to be. _E_\nThank you Delaware! #Trump2016 #MakeAmericaGreatAgain #TrumpTrain __HTTP__ __HTTP__ _E_\n.@bretmichaels and George Ross are back as advisors. Good to see them! #CelebApprentice _E_\nThe illegal immigrant crime problem is far more serious and threatening than most people understand. Along our (cont) __HTTP__ _E_\nCapitalism requires capital. When government robs capital from investors through high taxes it takes away the (cont) __HTTP__ _E_\nOur great team at @FEMA is prepared for #HurricaneNate. Everyone in LA MS AL and FL please listen to your local authorities & be safe! _E_\nI left Atlantic City years ago good timing. Now I may buy back in at much lower price to save Plaza & Taj. They were run badly by funds! _E_\nBusiness is a creative endeavor. Cultivate a sense of discovery and start thinking big. _E_\nThe Audacity of Ineptitude – ObamaCare website will cost over $1B __HTTP__ When will someone finally be held accountable? _E_\nOffshore Wind in Europe: Lessons for the U.S. __HTTP__ via @HuffPostGreen The lesson should be that it's a lousy idea!!! _E_\n.@oreillyfactor @KarlRove as per the show an even more serious Cruz charge is the fraudulent voter violation certificate sent to everyone. _E_\nMy @FoxNews interview from last night with @gretawire discussing yesterday's meeting with @MittRomney __HTTP__ _E_\nIn Bangladesh hostages were immediately killed by ISIS terrorists if they were unable to cite a verse from the Koran. 20 were killed! _E_\nI will be interviewed by @SeanHannity tonight at 10pm on FOX! Enjoy! _E_\nI had to fire General Flynn because he lied to the Vice President and the FBI. He has pled guilty to those lies. It is a shame because his actions during the transition were lawful. There was nothing to hide! _E_\nSee I told you so __HTTP__ _E_\nWow! I hear that thousands of people are cutting up their @Macys credit card. That's great. #MakeAmericaGreatAgain! _E_\nGreat basketball game going on right now! _E_\n.@MonicaCrowley you were great with @SeanHannity on @FoxNews tonight. Thank you for your kind words. We will keep Americans safe. _E_\n.@TrumpDoral's Red Course redesign is underway. Will be completed in September. Follow all the developments __HTTP__ _E_\nFinally in the new ABC News/Washington Post Poll Hillary Clinton is down 11 points with WOMEN VOTERS and the election is close at 47 43! _E_\nPrice gouging at many gas stations $10 a gallon welcome to the new world. _E_\nHousing prices are up in Feb over last Feb 9.3 per cent remember I told everyone two years ago to buy (but they will be going much higher) _E_\n\"Life is difficult no matter what but hard work and perseverance make it a lot easier.\" – Think Like a Billionaire _E_\nA.G. Lynch made law enforcement decisions for political purposes...gave Hillary Clinton a free pass and protection. Totally illegal! _E_\nRT @EricTrump: Congratulations @SeanHannity! Looking forward to being on the show tonight at 9pmET Hannity beats Maddow POLITICO __HTTP__ _E_\nAll I can say is that if I were President Snowden would have already been returned to the U.S. (by their fastest jet) and with an apology! _E_\nGreat works are performed not by strength but by perseverance. Samuel Johnson _E_\nJust left Istanbul Turkey yesterday where #TrumpTowers was just opened magnificent! _E_\nFor the great people of Iowa find your #IACaucus location at __HTTP__ So important to vote! #MakeAmericaGreatAgain _E_\nMake sure you get on the Trump line and are not mislead by the Cruz people. They are bad! BE CAREFUL. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n\"If you want to be successful at anything in life you have to be able to handle pressure.\" – Think Big _E_\nOscar Pistorious will likely only serve 10 months for the cold blooded murder of his girlfriend. Another O.J. travesty.The judge is a moron! _E_\nThey asked me to dress as Santa Claus to open Miss Universe tonight—I'm thinking about it! _E_\nI don't cheat at golf but @SamuelLJackson cheats—with his game he has no choice—and stop doing commercials! _E_\nI very much look forward to tomorrow's debate in New Hampshire—so many things to say so much at stake. It will be an incredible evening! _E_\nCyberattack on White House what's next? __HTTP__ _E_\nStudy what General Pershing of the United States did to terrorists when caught. There was no more Radical Islamic Terror for 35 years! _E_\nBoycott @Macys and @Univision. MAKE AMERICA GREAT AGAIN! _E_\nThe stock market is having a horrendous day bad employment numbers. _E_\nThank you Iowa Get out & #VoteTrumpPence16! __HTTP__ __HTTP__ _E_\nIt was my great HONOR to present our nation's highest award for a public safety officer THE MEDAL OF VALOR to FIVE AMERICAN HEROES! __HTTP__ _E_\nModerator: \"Respectfully you won't answer the pay to play question.\" #Debate #BigLeagueTruth _E_\nVia @PostSports @barrysvrluga Donald Trump has major aspirations for his Trump National Golf Club in Virginia __HTTP__ _E_\nTrade with China has killed over 29% of US manufacturing jobs in the US __HTTP__ China is robbing us blind! _E_\nModerator: Hillary paid $225000 by a Brazilian bank for a speech that called for \"open borders.\" That's a quote! #Debate #BigLeagueTruth _E_\nMy just filed lawsuit against Univision. Always fight back when right. #MakeAmericaGreatAgain __HTTP__ _E_\nI just started construction of The Old Post Office on Pennsylvania Avenue in D.C. Many jobs. Will be finest hotel in U.S. Watch it happen! _E_\n.@LaurenScruggs who was badly injured by an airplane was great on The Today Show! _E_\nThank you @DonaldJTrumpJr. Proud of you! #RNCinCLE #TrumpPence2016 __HTTP__ _E_\nAll Star @ApprenticeNBC premiering March 3rd on @NBC features terrific TV stars competing in the toughest tasks yet. Will be great. _E_\nThe dirty poll done by @ABC @washingtonpost is a disgrace. Even they admit that many more Democrats were polled. Other polls were good. _E_\nChina's currency manipulation is one of our nation's greatest sovereign threats. The yuan has appreciated 40% against our dollar since 2005. _E_\nThanks Piers. __HTTP__ _E_\nDespite all of China's cheating they are not doing that well we can beat them our country has great potential! _E_\nSomeone incorrectly stated that the phrase DRAIN THE SWAMP was no longer being used by me. Actually we will always be trying to DTS. _E_\nNegotiations 101: The best deals you can make are the ones you walk away from...and then get them with better terms. _E_\nMy interview on @ThisWeekABC with @GStephanopoulos had a 40%+ ratings increase over same Sunday last year. 20% over last week. _E_\nMany many people are thanking me for what I said about @autism & vaccinations. Something must be done immediately. _E_\nThe military generals are fuming at Obama. He has boxed them in against ISIS with a strategy that is destined to fail. Sad! _E_\nIn beautiful Pine Hill Trump Nat'l Philadelphia's award winning course provides amazing views of Philly skyline __HTTP__ _E_\nIf it were up to goofy Elizabeth Warren we'd have no jobs in America—she doesn't have a clue. _E_\nPresident Obama has just reached an ALL TIME low approval rating! Is anybody surprised? The happiest person is former President Jimmy Carter _E_\nRumor has it that @politico is going out of business. Losing too much money. Great news! Likewise dopey Mort Zuckerman's @NYDailyNews _E_\nWe need a PRESIDENT with strength stamina heart and incredible deal making skill if our country is ever going to be able to prosper again! _E_\nWhy would Ohio listen to Bruce Springsteen reading his lines? Be careful or I will go to Ohio and @MittRomney will win it! _E_\n.@FreeJesseJames Just read your complete statement. You are an amazing guy & I really appreciate your words & support. I will see you soon! _E_\nThis whole Super PAC scam is very unfair to a person like me who has disavowed all PAC's & is self funding. _E_\nAgain illegal immigrant is charged with the fatal bludgeoning of a wonderful and loved 64 year old woman. Get them out and build a WALL! _E_\nRT @FiIibuster: @realDonaldTrump We have a President that is putting the security and prosperity of America first. Thank you President Tru... _E_\nCarter Banned Iranians From Coming To U.S. During Hostage Crisis __HTTP__ _E_\nRemember tonight (Monday) the second and third episodes of The Apprentice are on at 8:00 & 9:00. Great ratings last night 18 49. FUN! _E_\nI look forward to being in South Carolina tomorrow a total sellout crowd! _E_\n.@dbongino You were fantastic in defending both the Second Amendment and me last night on @CNN. Don Lemon is a lightweight dumb as a rock _E_\n.@jasondhorowitz I am very proud of my sister your story was terrific. Thank you so much. _E_\nVia @411mania: Donald Trump Comments on a Return to Wrestling __HTTP__ _E_\nFrom Donald Trump: Wishing everyone a wonderful holiday & a happy healthy prosperous New Year. Let's think like champions in 2010! _E_\nAlways fun to read the @NewYorkObserver investigative piece re @AGSchneiderman his mascara and more! __HTTP__ _E_\nWhy would anybody listen to @MittRomney? He lost an election that should have easily been won against Obama. By the wayso did John McCain! _E_\nProfessional anarchists thugs and paid protesters are proving the point of the millions of people who voted to MAKE AMERICA GREAT AGAIN! _E_\nOil is starting to rise again despite the horrible times. OPEC continues to rip us off. Not worth $30. New leadership needed. _E_\nVia @LinkedInPulse by @nicholas_wyman: \"What All Hiring Managers Can Learn from Donald Trump\" __HTTP__ _E_\nTreat yourself to the pinnacle of luxury public golf at @TrumpGolfLA's white sand $250M premiere course __HTTP__ _E_\nAn excerpt of my @TheBrodyFile interview at the Sarasota 'Statesman of the Year' dinner discussing the Tea Party __HTTP__ _E_\nRT @AmericaFirstPol: MAJOR IMPACT: @POTUS Trump is 50 Days in and moving swiftly to get America back on the right track. #MAGA __HTTP__ _E_\nHe makes a mistake every hour every day admits @BarackObama. __HTTP__ The problem is that we are paying for them. _E_\nI am going to repeal and replace ObamaCare. We will have MUCH less expensive and MUCH better healthcare. With Hillary costs will triple! _E_\nRT @FoxNewsSunday: Sunday our exclusive interview with President elect @realDonaldTrump Watch on @FoxNews at 2p/10p ET Check your local... _E_\n70 stories over Panama Bay @TrumpPanama's deluxe rooms feature private balconies to enjoy the ocean views __HTTP__ _E_\nWhy aren't we getting any oil from Iraq before we leave? We are leaving the country wide open for Iran. Big mistake. _E_\nI am in Scotland checking on my developments in Aberdeen and Turnberry. Just left Ireland property will be great. ALWAYS CHECKING! _E_\nGo to @Macys now to see the incredible new selection of Trump Signature Collection ties shirts and suits. _E_\nSometimes we do things to build up experience and stamina to prepare but it's to prepare us for something bigger. _E_\nWhy would @BarackObama be spending millions of dollars to hide his records if there was nothing to hide? _E_\n\"Effective leadership is putting first things first. Effective management is discipline carrying it out.\" Stephen Covey _E_\nThe perfect Hawaiian getaway @TrumpWaikiki's 462 luxury guest rooms and suites each have spectacular views __HTTP__ _E_\nMy @TheBrodyFile int. discussing the persecution of Christians in the Middle East & Religious Liberty & Freedom __HTTP__ _E_\nWith our $250M in renovations @TrumpDoral offers a wide array of courses restored to perfection __HTTP__ _E_\nTrump Victorious in Fort Lauderdale Litigation __HTTP__ _E_\n\"To keep your goals alive you must take action every day. No one should care about your money and success more than you do.\" Think Big _E_\n#1. Be passionate you have to love what you're doing to be successful at it. _E_\nWe've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_\nRonald Kessler's new book The Secrets of the FBI is a great book that should be read by everyone. _E_\n(2/2) David brilliantly tells it like it is the real deal! Read it! __HTTP__ _E_\nThe @nyjets are going to have a terrific season. @Mark_Sanchez & @TimTebow will do great things on the field. _E_\nI'd bet a good lawyer could make a great case out of the fact that President Obama was tapping my phones in October just prior to Election! _E_\nFocus on your goals not your problems. Don't tread water. Get out there and go for it. _E_\nAn idealist is a person who helps other people to be prosperous. Henry Ford _E_\nWhen will the Democrats give us our Attorney General and rest of Cabinet! They should be ashamed of themselves! No wonder D.C. doesn't work! _E_\nGreat parent teacher listening session this morning with @VP Pence & @usedgov Secretary @BetsyDeVos. Watch:... __HTTP__ _E_\nI will be doing @oreillyfactor tonight at 8:00pmE from Mesa Arizona will be talking about the #GOPDebate & more. __HTTP__ _E_\nMy @SquawkCNBC interview discussing why I don't own Facebook stock and running a tough campaign against @BarackObama __HTTP__ _E_\nI am soooo proud of my children Don Eric and Tiffany their speeches under enormous pressure were incredible. Ivanka intros me tonight! _E_\nBasically nothing Hillary has said about her secret server has been true. #CrookedHillary _E_\n.@megynkelly the @FoxNews poll said very plainly I came in second in the debate. All others Time Drudge Slate etc. said I came in 1st. _E_\nGreta in a few minutes will be interesting! _E_\nThe ratings at @FoxNews blow away the ratings of @CNN not even close. That's because CNN is the Clinton News Network and people don't like _E_\nI am very worried that if @BarackObama is re elected then Medicare will be destroyed. We must take care of our seniors. _E_\nI was on CNBC this morning talking about the market and America's financial future __HTTP__ _E_\nIf you can't focus with unyielding resolve then you will never be successful. Believe in yourself and you can accomplish your goals. _E_\nI have great confidence in King Salman and the Crown Prince of Saudi Arabia they know exactly what they are doing.... _E_\nBernie Sanders must really dislike Crooked Hillary after the way she played him. Many of his supporters because of trade will come to me. _E_\nStop congratulating Obama for killing Bin Laden. The Navy Seals killed Bin Laden. #debate _E_\nThe crowning moment – Conneticut's Erin Brady winning @MissUSA 2013 __HTTP__ _E_\nCPAC attendees & fellow patriots lines for my @CPACnews start at 7:00AM outside the Potomac Ballroom. Make sure to get there early! _E_\nHenry McMaster Lt. Governor of South Carolina who endorsed me beat failed @CNN announcer Bakari Sellers so badly. Funny! _E_\nLightweight @AGSchneiderman is pushing for the Moreland Commission to be disbanded immediately because he is being looked at! _E_\nI was standing with @SHAQ when a young high school star Kevin Garnett @Celtics said to a crowd Forget Shaq I want to meet Donald Trump. _E_\nGlad that @MittRomney is hitting @BarackObama on ending work requirements for welfare. Obama attacks the American work ethic. _E_\nThank you to General John Kelly who is doing a fantastic job and all of the Staff and others in the White House for a job well done. Long hours and Fake reporting makes your job more difficult but it is always great to WIN and few have won more than us! _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nSec of State Kerry said we would not go back to Iraq. We shouldn't but he should not have said that. So stupid! _E_\nA nurse in Dallas who treated Ebola patient Thomas Duncan was allowed to fly to Cleveland.She should never have been so allowed! The real JV _E_\nDress your best! The Trump Signature Collection exclusively available @Macys offers the tops style in menswear __HTTP__ _E_\nScary while @BarackObama has been POTUS for 1.6% of America's history he has amassed 33.3% of the total debt. _E_\nObama betrays Israel yet again our strongest ally in the Middle East. He will recognize Hamas breaking long standing US policy. _E_\nLooking forward to next week's unveiling of the Red Tiger @TrumpDoral. An 18 hole masterpiece w/two island greens __HTTP__ _E_\nI am proud of the Tea Party. These great patriots have accomplished so much in strengthening our country in only 3 short years. _E_\nPresident Donald J. Trump Proclaims October 9 2017 as #ColumbusDay __HTTP__ _E_\nThe best thing you can do is deal from strength and leverage is the biggest strength you have.\" – THE ART OF THE DEAL _E_\nThank you Newt! __HTTP__ _E_\nVince McMahon @WWE and I hold the all time ratings & pay per view record in the history of wrestling. _E_\nTrust in God and be true to yourself. Mary MacLeod Trump Know everything you can about what you're doing. Fred C. Trump _E_\nIn order to be successful especially to be very successful you must have the ability to be able to handle pressure! _E_\nCelebrity Apprentice will be rebroadcast tonight at 9 on CNBC. _E_\nFlashbk – \"Trump: 'I would build a border fence like you have never seen before'\" __HTTP__ via @BreitbartNews by @rwildewrites _E_\nThe golden rule for every businessman is this: 'Put yourself in your customer's place.' Orison Swett Marden _E_\nRomney was the architect of ObamaCare. Bush's Chief Justice legalized the monstrosity. Notice a trend? _E_\nWho thinks that President Obama is totally incompetent? _E_\nOur $17T national debt and $1T yearly budget deficits are a national security risk of the highest order. _E_\nJust left the set of The Apprentice the live show tonight will be fantastic and something very big and very different is going to happen _E_\nAn appeaser is one who feeds a crocodile hoping it will eat him last. Winston Churchill _E_\nHow do third rate talents with no smarts like @ron_fournier get so much time on television news. Boring guy really bad for ratings! _E_\n.@RogerJStoneJr was great on @TheKudlowReport last night. Roger and Larry are good friends! _E_\nI'll be on @foxandfriends on Monday at 7:30 a.m. Always a great time. _E_\nThanks Matthew! _E_\nObama will go down as the worst President in history on many topics but especially foreign policy. _E_\nVia @WashTimes by @EmilyMiller: Donald Trump says 'This country is going to hell in a handbasket' __HTTP__ _E_\n.@IamStevenT stopped by my office to say hello a great guy! __HTTP__ _E_\nI got to know @johnboehner very well—he is a great guy who will do the right thing for the country! _E_\n...to Mar a Lago 3 nights in a row around New Year's Eve and insisted on joining me. She was bleeding badly from a face lift. I said no! _E_\n.@brithume thinks that when Republicans drop out of the race someone will pick up ALL of that vote. The fact is I will get much of it! _E_\nVia @UnionLeader by @tuohy: \"Trump inches closer to a decision\" __HTTP__ _E_\nAnticipate change and embrace it. Recognize new developments that you can capitalize on and use to open new doors. _E_\nVia @NewHampJournal by @jdistaso: \"In NH 'The Donald' hammers Mitt Jeb as he again weighs a run for President\" __HTTP__ _E_\nI will be speaking about our great journey to the Republican nomination at 9:00 P.M. The movement toward a country that WINS again continues _E_\nWith oil below $50 the blighted views by windfarms of historic @CulzeanCastle will be very sad. #SaveCulzean __HTTP__ _E_\nHappy Thanksgiving I hope everyone can get together to MAKE AMERICA GREAT AGAIN! It won't be easy nothing is but it can be done. _E_\n.@BretBaier Thank you for the very fair and highly professional segment on me tonight. Many people watched and commented. _E_\nObama is not working. US Manufacturing orders fell a record 13.9% in August. Where's the recovery? __HTTP__ _E_\nAspirin gets the best press of almost anything I can think of fact or great PR? _E_\nVery sad that a person who has made so many mistakes Crooked Hillary Clinton can put out such false and vicious ads with her phony money! _E_\nCheck out today's From The Desk Of Donald Trump at __HTTP__ I'm willing to answer your questions tweet me.... _E_\nWorking hard from New Jersey while White House goes through long planned renovation. Going to New York next week for more meetings. _E_\nWill be leaving Trump Turnberry tomorrow place & Women's British Open are great. Will be back hitting hard tomorrow. @Turnberrybuzz _E_\nVia @BreitbartNews: DONALD TRUMP: EXEC AMNESTY WILL MAKE ILLEGAL IMMIGRATION 'WORSE THAN IT'S EVER BEEN __HTTP__ _E_\nDon't worry West Coast etc. we are not going to tweet who was fired or give any indication there of until after it airs. #CelebApprentice _E_\nToday we are not merely transferring power from one Administration to another or from one party to another – but we are transferring... _E_\nNew York City hosted over 52 million visitors in 2012. __HTTP__ Record amount visited Trump Tower. _E_\nA doctor on NBC Nightly News agreed with me we should not bring Ebola into our country through two patients but should bring docs to them. _E_\nWhen the military informed Obama that they had Bin Laden is there anyone with a brain that would not have said Ok go get him ? _E_\nEntrepreneurs: Put everything you've got into what you're doing. Know exactly what you want and go for it. Nothing should be haphazard. _E_\nBoy did Pharrell & Robin Thicke get screwed. The Marvin Gaye song sounds nothing like theirs. Get new lawyers fast! _E_\nI will be interviewed by @SeanHannity tonight at 10pm EST on @FoxNews! Enjoy! _E_\nRT @Scavino45: LIVE Joint Statement by President Trump and Prime Minister Shinzo Abe: __HTTP__ _E_\nCongrats to @cheflents of TrumpCollection's #TrumpChicago on being a James Beard semifinalist: __HTTP__ via @CrainsChicago _E_\nHappy Birthday to the great @BillyGraham. He's done so many wonderful things not the least of which is his fantastic family. I love Billy! _E_\nJoin me in Cincinnati Ohio tomorrow evening at 7:00pm. I am grateful for all of your support. THANK YOU!Tickets:... __HTTP__ _E_\nHillary Advisers Wanted Her To Avoid Supporting Israel When Talking To Democrats: __HTTP__ _E_\nOur campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_\nAs I have been saying. Only the beginning: ISIS Suspects Arrested in Turkey 150 European Passports Seized. __HTTP__ _E_\nGreat work Ivanka! __HTTP__ _E_\nA clip from guest hosting @extratv yesterday on @nbc discussing Halle Angus and Gen. Petraeus __HTTP__ _E_\n.@gerardtbaker Gerard—wonderful job last night as moderator of the debate. I told many \"really smart and elegant.\" _E_\nI am getting worried about Chris @hardball_chris Matthews. Is he drinking again? _E_\nJoin me live from Fort Myer in Arlington Virginia. __HTTP__ _E_\nI win an election easily a great movement is verified and crooked opponents try to belittle our victory with FAKE NEWS. A sorry state! _E_\nI will sign the first bill to repeal #Obamacare and give Americans many choices and much lower rates! _E_\nWith the very dangerous carjacking epidemic going on especially in New York and New Jersey you would be lucky to have a gun for protection _E_\nPocahontas wanted V.P. slot so badly but wasn't chosen because she has done nothing in the Senate. Also Crooked Hillary hates her! _E_\nMerry Christmas and a very very very very Happy New Year to everyone! _E_\nVia @BreitbartNews __HTTP__ _E_\nVia WSOC_TV: Donald Trump's son says family thinking about expanding in uptown Charlotte __HTTP__ Great job @EricTrump _E_\nCarly Fiorina is terrible at business the last thing our country needs! __HTTP__ _E_\n.@BarackObama's assault on coal and gas and oil will send energy and manufacturing jobs to China. @MittRomney _E_\nThank you. __HTTP__ _E_\n.@ConradMBlack what an honor to read your piece. As one of the truly great intellects & my friend I won't forget! __HTTP__ _E_\nToday it was my privilege to welcome survivors of the #USSArizona to the @WhiteHouse. #HonorThemRemarks: __HTTP__ __HTTP__ _E_\n.@FoxNews should be ashamed for allowing experts to explain how to make a nuclear attack! _E_\n...way up. Regulations way down. 600000+ new jobs added. Unemployment down to 4.3%. Business and economic enthusiasm way up record levels! _E_\nYou talk tough Mr. President but have done nothing about China killing our jobs and economy. _E_\nWatch Celebrity Apprentice on NOW! _E_\nRT @LouDobbs: Making America Great Again @Kellyannepolls: After #Irma @POTUS is focused on saving lives not swamp shenanigans. #Dobbs #MA... _E_\nI'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_\nCongratulations to Thomas Perez who has just been named Chairman of the DNC. I could not be happier for him or for the Republican Party! _E_\n...Corker dropped out of the race in Tennesse when I refused to endorse him and now is only negative on anything Trump. Look at his record! _E_\nPAY TO PLAY POLITICS. #CrookedHillary __HTTP__ _E_\nThe Democrats are pushing for Universal HealthCare while thousands of people are marching in the UK because their U system is going broke and not working. Dems want to greatly raise taxes for really bad and non personal medical care. No thanks! _E_\nBird killing windfarm that I oppose in Aberdeen just got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_\n...You have little persona but The Apprentice concept is great and lucky for you! _E_\nHappy to have just passed 1.3M Twitter followers. Love communicating with everyone daily. _E_\nThis shows what a complete & total liar Ted Cruz is he said he wouldn't have nominated John Roberts. Really? __HTTP__ _E_\n\"Americans are hungry to feel once again a sense of mission and greatness.\" – Pres. Ronald Reagan _E_\nJeb Bush will never secure our border or negotiate great trade deals for American workers. Jeb doesn't see & can't solve the problems. _E_\nRima Fakih our beautiful Miss USA rode with me on the Gray Line Ride of Fame yesterday... __HTTP__ _E_\nRemember if you don't promote yourself then no one else will! Likewise believe in yourself or no one else will either. _E_\nALWAYS BORROW MONEY FROM A PESSIMIST BECAUSE HE WILL NEVER EXPECT IT TO BE PAID BACK! _E_\nGreat news. We are only just beginning. Together we are going to #MAGA! __HTTP__ __HTTP__ _E_\nWill be on @CNN at 7:00 A.M. _E_\nWowthe Fake News media did everything in its power to make the Republican Healthcare victory look as bad as possible.Far better than Ocare! _E_\n\"Shutting down the government is a very serious thing. People die accidents happen. I don't know how I would vote right now on a CR OK?\"Sen. Dianne Feinstein (D Calif) __HTTP__ _E_\nThe Afghan Security Forces who we are training have killed 52 U.S. soldiers __HTTP__ Time to get out of there! _E_\nWith China beating us like a punching bag daily OPEC vacuuming our wallets clean and jobs nowhere in sight (cont) __HTTP__ _E_\nBoth @BarackObama and China have embraced OWS. All want the decline of America. Time for the protesters to go home. _E_\nMelania and I are honored to light up the @WhiteHouse this evening for #WorldAutismAwarenessDay. Join us & #LIUB.... __HTTP__ _E_\nObama's rollout of his ISIS war plan is another unmitigated disaster. The Generals must be furious. _E_\nI will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting! _E_\nFor a president who likes to showcase how hip and tech savvy he is Obama also appears surprisingly clueless (cont) __HTTP__ _E_\nWill be in New Hampshire and then on @CNN Special at 9 PM tonight. _E_\nUnder a Trump administration it's called #AmericaFirst! #ImWithYou __HTTP__ _E_\nAmazing crowd outside @FallonTonight. Tune in tonight at 11:30. __HTTP__ _E_\nThank you @TeamTrump Florida. Keep me updated and lets get those 100000 registered voters!#MakeAmericaGreatAgain __HTTP__ _E_\nFormer Weather Underground radical Kathy Boudin spent 22 yrs in prison for armored car robbery that killed 2 cops & a Brinks guard... _E_\nSo I raised/gave $5600000 for the veterans and the media makes me look bad! They do anything to belittle totally biased. _E_\n\"Going with your instincts requires tuning in to everything around your decision.\" – Think Big _E_\nJust put out a very important policy statement on the extraordinary influx of hatred & danger coming into our country. We must be vigilant! _E_\nThe last thing we need is another Bush in the White House. Would be the same old thing (remember read my lips no more taxes ). GREATNESS! _E_\nThank you for a great afternoon Birmingham Alabama! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nObama said in his speech that Muslims are our sports heroes. What sport is he talking about and who? Is Obama profiling? _E_\nRidiculous that they gave the 14 year old golfer from China a one stroke penalty for slow play at The Masters(see I can stick up for China) _E_\nThe real story on Collusion is in Donna B's new book. Crooked Hillary bought the DNC & then stole the Democratic Primary from Crazy Bernie! _E_\nStock Market hits an ALL TIME high! Unemployment lowest in 16 years! Business and manufacturing enthusiasm at highest level in decades! _E_\nI hear they are very unhappy w/ Arianna and @huffingtonpost at @AOL. I'll bet she won't be there for long! _E_\nRemember oftentimes the best deal you make is the deal you don't make! _E_\nRT @KellyannePolls: Love and prayers for friends Adrienne & Eric Bolling. May Eric Chase know eternal peace. __HTTP__ _E_\nPennsylvania: Cast your vote for Trump for POTUS & ALSO vote for the TRUMP DELEGATES in your congressional district! __HTTP__ _E_\nWhy is no one talking about the horrible murder of Ana Charle by ex con thug West Spruill. Gunned down on street naked. Why no riots here? _E_\n....instead of giving to a wonderful charitable cause. _E_\nVia @BreitbartNews by @AWRHawkins: TRUMP PREACHES PEACE THROUGH STRENGTH IN PHOENIX __HTTP__ _E_\nThe Federal government spent over $3.7 trillion last year. This is unsustainable and a true danger. The American dream is being destroyed. _E_\nIt has been a pleasure to make so many friends and meet so many great people on the trail this past cycle. We will fight on! _E_\nOnce again Obama fails to classify China as a currency manipulator. He just helped China steal even more jobs and money from us. _E_\nThe pressure on the debt ceiling is on @BarackObama.... __HTTP__ #trumpvlog _E_\nThank you for all of the nice statements on the Press Conference yesterday. Rush Limbaugh said one of greatest ever. Fake media not happy! _E_\nIf you don't have a competitive advantage don't compete. Jack Welch _E_\nPres. Bill Clinton 5.31.12: @MittRomney had a sterling business career. _E_\nWill be having meetings and working the phones from the Winter White House in Florida (Mar a Lago). Stock Market hit new Record High yesterday $5.5 trillion gain since E. Many companies coming back to the U.S. Military building up and getting very strong. _E_\nWow new Reuters Poll just out. Big lead if you want to MAKE AMERICA GREAT AGAIN! TRUMP 37 CRUZ 11 This is at the top of Drudge! _E_\nLiberals can hardly belileve it they can't understand how health care costs could have risen so much when (cont) __HTTP__ _E_\nCrooked H destroyed phones w/ hammer 'bleached' emails & had husband meet w/AG days before she was cleared & they talk about obstruction? _E_\n\"Learn know and show. It's a proven formula. Put it to use starting today.\" – Think Like a Champion _E_\n#Imwithyou __HTTP__ __HTTP__ _E_\nToday will be a big day @Team_Mitch for you in many ways. The country is lucky. _E_\nGreat read: \"How New York's Veterans Day Parade Became 'America's Parade'\" __HTTP__ _E_\nCongrats to @JoeTorre @TonyLaRussa & Bobby Cox on all being unanimously elected to @MLB's @BaseballHall! Great leaders & managers. _E_\nAt the National Achievers Congress in London this October I'm going to talk about success and how to avoid failure __HTTP__ _E_\nI have great respect for the people that represent China. What I don't respect is the way that we negotiate and (cont) __HTTP__ _E_\nIf Justice Roberts had done the right thing and voted against ObamaCare our country would be in a lot better shape right now! TOTAL TURMOIL _E_\n\"NBC FIRES TRUMP KEEPS SHARPTON: The bigots of the NBC executive suite look the other way\" __HTTP__ via @AmSpec by @JeffJlpa1 _E_\nWow Rowanne Brewer the most prominently depicted woman in the failing @nytimes story yesterday was on @foxandfriends saying Times lied _E_\nFeaturing private living spaces oversized bathrooms & stunning views @TrumpSoHo = downtown NYC's premiere hotel __HTTP__ _E_\nRT @DRUDGE_REPORT: MEXICO 2ND DEADLIEST COUNTRY TOPS AFGHAN IRAQ... __HTTP__ _E_\nToday I signed an Executive Order on Improving Accountability and Whistleblower Protection at the @DeptVetAffairs:... __HTTP__ _E_\nI'll bet Obama now uses the amendment for the debt ceiling. _E_\n.@BillClinton was very nice to me as I am to him on the Piers Morgan Show (CNN). He is loyal to his friends. @piersmorgan _E_\nI will be interviewed on @greta at 7:00 P.M. Enjoy! @FoxNews _E_\nHe @johnedwards is bad but @andrewyoung is worse not only is he a rat but it turns out he stole much of the money for himself. _E_\nPutin has shown the world what happens when America has weak leaders. Peace Through Strength! _E_\nMy thoughts and prayers are with the great people of Tennessee during these terrible wildfires. Stay safe! _E_\nThe thousands of people that showed up for me in Phoenix were amazing Americans. @SenJohnMcCain called them crazies must apologize! _E_\n1988 with Oprah discussing why I would never rule out a run for #POTUS.#Trump2016 #VoteTrumpNY #PrimaryDay __HTTP__ _E_\nI don't know how much longer I can take this bullshit so terrible! #Oscars _E_\nI feel so badly for Mark Cuban the Dallas Mavericks were just eliminated from the playoffs and his partners are pissed. Very sad! _E_\nA great book by a great guy highly recommended! __HTTP__ _E_\nNo surprise Obama's Deputy Campaign Manager tweeted link from Chinese propaganda outlet __HTTP__ Did she also write it? _E_\nFor those of you that have conveniently forgotten dummy Jon Stewart is a bad filmmaker. His last effort was a real bomb (in all ways)! _E_\nCHAIN MIGRATION must end now! Some people come in and they bring their whole family with them who can be truly evil. NOT ACCEPTABLE! __HTTP__ _E_\nLightweight A.G. Eric Schneiderman meets with President Obama (who he told me sucks as a president) and quickly files a suit against me! _E_\nRT @DanScavino: Doesn't fit the MSM narrative so they wont share what @realDonaldTrump did for Jesse Jackson in 1999 so I will! __HTTP__ _E_\nThere are huge opportunities for profits if you can think big & create big solutions for the human needs brought by trends. Think Big _E_\nLooking forward to my @theFAMiLYLEADER summit visit and speech. _E_\nThe April jobs report is terrible. If the labor forces didn't shrink under @BarackObama then real unemployment (cont) __HTTP__ _E_\nI'll bet Jimmy Fallon gets great ratings tonight! _E_\nGreat interview on @foxandfriends with the parents of Otto Warmbier: 1994 2017. Otto was tortured beyond belief by North Korea. _E_\nYesterday was a big day for the stock market. Jobs are coming back to America. Chrysler is coming back to the USA from Mexico and many others will follow. Tax cut money to employees is pouring into our economy with many more companies announcing. American business is hot again! _E_\nAs an addition Apple must go to a larger screen now asap! They're losing their standing in the market! _E_\n#NYCStrong #USA __HTTP__ _E_\nRemember Anthony Wiener continued sending sick pics. long after his resignation from Congress and his apology zero control over himself! _E_\nGreat job on Fox this morning @KatiePavlich. I am sending out for your book immediately. Thank you very much! _E_\n.@GovChristie is going to do a fantastic job tonight explaining why @MittRomney should be elected and @BarackObama has to go. _E_\nHonored to be named as one of business's \"Top Leaders Icons and Rebels\" by @CNBC __HTTP__ Vote Trump! _E_\nGetting ready to deliver a VERY IMPORTANT DECISION! 8:00 P.M. _E_\nDestroying the world's finest health care system so that @BarackObama can have his socialized medicine program (cont) __HTTP__ _E_\nDespite what you have heard from the FAKE NEWS I had a GREAT meeting with German Chancellor Angela Merkel. Nevertheless Germany owes..... _E_\nThe United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved there will be a News Conference at The White House at approximately 1:00 P.M. _E_\nWhile Obama is obsessed with green collar jobs blue collar workers aren't buying it. (cont) __HTTP__ _E_\nWe should never have gone into Iraq but once in should have gotten out a lot faster. MAKE AMERICA GREAT AGAIN! _E_\nChina will never go to war with us because if they won they would only take over property they already own! _E_\nNow is the time to buy a house if you can DIRECTLY from a bank. They want to get rid of all their foreclosures. _E_\nWatch the clip from my #C21 Super Bowl spot on @AccessHollywood tonight. _E_\n...yet not one meeting with an ally (or an enemy!) Where's the media? _E_\nThank you West Virginia! All across the country Americans of every kind are coming together w/one simple goal: to MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nEntrepreneurs: Pay attention to your negotiation skills. It's all about persuasion and persuasion is power. _E_\nI will be speaking at 9:00 A.M. today to Police Chiefs and Sheriffs and will be discussing the horrible dangerous and wrong decision....... _E_\nOnly reason the hacking of the poorly defended DNC is discussed is that the loss by the Dems was so big that they are totally embarrassed! _E_\nThey're going to riot in Ferguson no matter what. _E_\nTHANK YOU Phoenix Arizona! Time for new POWERFUL leadership. Just imagine what WE can accomplish in our first 100... __HTTP__ _E_\nSomerset County New Jersey SWAT Team really fantastic people! __HTTP__ _E_\nA lot of undecided and independent voters have had enough with Obama's lack of transparency. I don't blame them. _E_\n'Clinton Charity Got Up To $56 Million From Nations That Are Anti Women Gays' #CrookedHillary __HTTP__ _E_\nAs expected the media is very much against me. Their dishonesty is amazing but just like our big wins in the primaries we will win! _E_\n.@TrumpGolfLA is ranked the top course in the West __HTTP__ If you're in the area book a round today. _E_\nWhile @BarackObama continues to defend ObamaCare in the courts he is also granting companies waivers. Eve... (cont) __HTTP__ _E_\nvia __HTTP__ Only one man up for the job of president __HTTP__ _E_\nCan you conquer the Blue Monster? Book a tee time @TrumpDoral right here __HTTP__ _E_\nVia @AP March2013: Jeb said \"he was open to...pathway for citizenship for illegal immigrants\" __HTTP__ Lying on campaign trail! _E_\nCongrats to @rushlimbaugh on the release of his new book \"Rush Revere and the Brave Pilgrims.\" #1 on @amazon and @bnbooks. Must read! _E_\nObama just stated he didn't take school seriously made bad choices and GOT HIGH then how the hell did he get into Columbia & Harvard? _E_\nObama's complaints about Republicans stopping his agenda are BS since he had full control for two years. He can never take responsibility. _E_\nA top rated NY course by @GolfDigestMag @TrumpNationalNY provides award winning services and exceptional facilities __HTTP__ _E_\n...What is wrong with this story? Isn't this just ridiculous? Terrible! #KathyBoudin _E_\nLeaving the White House for the Great State of North Carolina. Big progress being made on many fronts! _E_\nMexico has taken advantage of the U.S. for long enough. Massive trade deficits & little help on the very weak border must change NOW! _E_\nRemember Sunday is National Prayer Day (by Presidential Proclamation)! _E_\n#TBT For all who have been asking my mother was a great beauty and a wonderful person. Here we are with my father __HTTP__ _E_\nAt the Univision forum Obama continued to make excuses for Fast and Furious __HTTP__ His operation killed innocent Americans. _E_\nThank you New Hampshire! Together we will Make America Great Again! __HTTP__ _E_\nWouldn't it be great to Repeal the very unfair and unpopular Individual Mandate in ObamaCare and use those savings for further Tax Cuts..... _E_\nIn Iran deal we get 4 prisoners. They get $150 billion 7 most wanted and many off watch list. This will create great incentive for others! _E_\n.@BarackObama is begging the Eurozone to keep Greece in until after 11.6.12. He thinks the world revolves around his re election. _E_\nThank you! #Trump2016 __HTTP__ __HTTP__ _E_\nLook forward to going to Indiana tomorrow in order to be with the great workers of Carrier. They will sell many air conditioners! _E_\nRT @JoeNBC: Explosive Trump attack on HRC Bill Monica Cosby and Weiner. Trump camp just upped the ante on women's rights __HTTP__ _E_\nFLASHBACK: \"Hiding evidence of global cooling\" __HTTP__ @washtimes \"Scientific data\" is cooked! _E_\nThis is what @BarackObama thinks: that America would be better off if we acted more like European socialist (cont) __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016UNIFYING THE NATIONVideo: __HTTP__ __HTTP__ _E_\nAm I morally obligated to defend the president every time somebody says something bad or controversial about him? I don't think so! _E_\nRemember Bill Maher praised the animals who took down the World Trade Center and was fired by ABC. DROP@HBO until dopey Bill is canned! _E_\nWe have the Final Six—and @LilJon is the last remaining member of Team Power. He's done a great job. #CelebApprentice _E_\nJust received the new Fox poll.Thank you America! #Trump2016 __HTTP__ _E_\nCadillac has made amazing strides in the beauty and quality of their cars. Great management team congratulations! @Cadillac _E_\nVisit @Fund_Anything at __HTTP__ to see my picks! #FundAnything _E_\nOne of my many Twitter followers suggested Obama should take my offer & give $1250000 to each family of the four... __HTTP__ _E_\nU.S. COAL PRODUCTIONUp📈7.8% past year. Down📉31.5% last 10 years. #EndingWarOnCoal __HTTP__ _E_\nGreat meeting with military spouses in Virginia joined by @IvankaTrump @LaraLeaTrump @GenFlynn & @MayorRGiuliani. __HTTP__ _E_\nVery exciting week for @TrumpDoral. I will be in Miami opening what will soon be best resort in U.S. World Golf Championship this week! _E_\nAmerica's relationship with China is at a crossroads. We only have a short window of time to make the tough (cont) __HTTP__ _E_\nWe will push onward to victory w/hope in our hearts courage in our souls & everlasting pride in each & every one of you. God Bless America. __HTTP__ _E_\nIt was a great honor to be on @MikeAndMike on @espn. Wow the response was amazing! _E_\nTrump Was Right: 'Obama's America' Tops 2012 Documentaries __HTTP__ via @Newsmax_Media _E_\nLooking forward to being guest of honor at @ralphreed's @FFCoalition Patriot Gala Dinner on June 14th in DC. Flag day and my birthday. _E_\nThe U.S. should not be giving away our strategy & tactics to the enemy so they can prepare. Just go and do what you have to do! _E_\nFOX debate advertising rates falling like a rock! Tune into my special event for the Veterans at 9pm EST! _E_\nSecret Service members on break from Obama's $4M vacation are more than welcomed to relax at Hawaii's top hotel @TrumpWaikiki. _E_\nPrime Minister @Netanyahu and @PresidentRuvi on behalf of @FLOTUS Melania and myself thank you for the invitation... __HTTP__ _E_\nCentral American presidents are blaming us for the influx of illegal immigration __HTTP__ Obama will soon apologize. _E_\nThe world is most peaceful and most prosperous when America is strongest. __HTTP__ _E_\nHistoric Change! Obama has spent over $44M of our money on travel expenses the most for any president __HTTP__ _E_\nCruz going down fast in recent polls dropping like a rock. Lies never work! _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nCan you believe it—the model who mysteriously disappeared from the ObamaCare website is not a US citizen—she's from Colombia. _E_\nI use Social Media not because I like to but because it is the only way to fight a VERY dishonest and unfair \"press\" now often referred to as Fake News Media. Phony and non existent \"sources\" are being used more often than ever. Many stories & reports a pure fiction! _E_\nObama's new excuse for his failures is that you can't change Washington from the inside. Not what he said in '09. __HTTP__ _E_\nU.S. tuitions are completely out of control. In the last 4 years the average price has gone up by 15%. __HTTP__ Unsustainable! _E_\nI believe in free markets but allowing a merger of US Air & American Airlines is totally ridiculous! Will control most of US market. _E_\nHILLARY'S HEALTH CARE POLICIES#DrainTheSwamp #Debate __HTTP__ _E_\nTomorrow is #TrumpTuesday on @squawkCNBC 7:30 AM EST. Always interesting. _E_\nIn even the darkest moments the light of our people has shown through their goodness their courage and their love. #USA __HTTP__ _E_\nThe @GOP primary voter spoke last night in VA 7 & @DaveBratVA7th won going away. Now the party MUST stand behind him! Unity Unity Unity! _E_\nDepartment of Homeland Security has spent $3.5 billion dollars building their new headquarters and is years late and billions over budget! _E_\nTea Party takes down Eric Cantor REALLY BIG WIN! _E_\nBashar Assad is stronger today than he was before Obama threatened military action. Obama really bungled this. _E_\nPhoenix crowd last night was amazing a packed house. I love the Great State of Arizona. Not a fan of Jeff Flake weak on crime & border! _E_\nRT @realDonaldTrump: National Pearl Harbor Remembrance Day \"A day that will live in infamy!\" December 7 1941 _E_\nRT @townhallcom: ABC NBC And CBS Pretty Much Bury IT Scandal Engulfing Debbie Wasserman Schultz's Office __HTTP__ _E_\nI will be on @60Minutes tonight at 7:00 P.M. with Mike Pence talking about LAW AND ORDER and many other subjects! Bad times for divided USA! _E_\nCongratulations to my friend David Wright of the @mets who is now their all time hitting leader. _E_\nThe just out USA Today National Poll where I lead by big numbers shows that in a head to head matchup I beat both Hillary and Bernie. _E_\nPoor @JohnKasich doesn't have what it takes __HTTP__ _E_\nThank you @SarahPalinUSA for your amazing help and support. Big win leaving now for Atlanta and Nevada.The people of South Carolina got it! _E_\nTomorrow is #TrumpTuesday on @squawkboxCNBC 7:30 AM don't miss it! _E_\nI will be going to Mississippi tomorrow night hear the crowds are going to be massive! Look forward to it. _E_\nEntrepreneurs: Don't expect anyone to be on your side. Sometimes we're all in this alone. So believing in yourself is mandatory. _E_\nSirius National News at 7:30 A.M. Steve Bannon. @BreitbartNews _E_\nObama has destroyed the middle class. In '09 median household income was $55198. Now it is $50678. Four more years? _E_\nSHOCK! While attacking @MittRomney's private equity experience @BarackObama raises $2M from private equity bankers __HTTP__ _E_\nWhen and how are the dummies at the @WSJ going to apologize to me for their totally incorrect Editorial on me. I want smart trade deals. _E_\nI will be on @Morning_Joe live from New Hampshire tomorrow at 7am. #Trump2016 #MakeAmericaGreatAgain _E_\nWashington must come together on a deal to avoid a fiscal cliff. If taxes are raised they must come with real hard cuts. _E_\nI should have easily won the Trump University case on summary judgement but have a judge Gonzalo Curiel who is totally biased against me. _E_\n\"Much as it pays to emphasize the positive there are times when the only choice is confrontation.\" – The Art of the Deal _E_\nThanks @greggutfeld. Really nice! I'm glad I did your show. @GregGutfeldShow _E_\nElection is being rigged by the media in a coordinated effort with the Clinton campaign by putting stories that never happened into news! _E_\n.@ShawnJohnson Congratulations on your engagement he is a lucky guy. You are  a true winner and will be an amazing couple. _E_\nOnce you consent to some concession you can never cancel it and put things back the way they are. Howard Hughes _E_\nDid President Obama have a rough day yesterday or what? He has got to start telling the truth NO MORE LIES OR DECEPTION! _E_\nBrian I hope @NBCNightlyNews isn't paying you too much look at what's happening to nightly news. _E_\nDo the people of Ohio know that John Kasich is STRONGLY in favor of Common Core! In other words education of your children from D.C. No way _E_\nUnlike U.S. China taxes things made in the U.S. and sold in China. China demands plants we don't. Stupid! _E_\n.@HillaryClinton's 2008 Campaign And Supporters Trafficked In Rumors About Obama's Heritage #DebateNight __HTTP__ _E_\nKeystone must be approved through Congress. @BarackObama is costing America over 20000 jobs and driving the price of gas high. _E_\nHouse of Representatives needs to pass Government Funding Bill tonight. So important for our country our Military needs it! _E_\n.@ErraticSLK Shout out = work hard! _E_\nEverybody should contribute & fight in the long haul battle against autism. @autismspeaks _E_\nGOPers eye Donald Trump for governor run __HTTP__ via @nypost by @fud31 _E_\nIf people knew how hard I worked to get my mastery it wouldn't seem so wonderful at all. Michelangelo _E_\nThe lady in Chicago that I'm fighting owes me $500 000 and is sophisticated & vicious. She made up a story & plays the age card bad! _E_\nCongrats to people of Scotland on the Judge's ruling concerning bird killing land destroying environmentally disastrous windmills. _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nO'Malley as former Mayor of Baltimore has very little chance. _E_\nI'm eagerly awaiting the next polls. The debate performance could be devastating to the Obama team. Let's see what happens. _E_\nAdmitted:@BarackObama's Treasury Secretary admitted that their 2013 budget does nothing to address America's (cont) __HTTP__ _E_\nIn this book our second together we share what gives us the Midas Touch the ability to turn things we touch (cont) __HTTP__ _E_\nRussians are playing @CNN and @NBCNews for such fools funny to watch they don't have a clue! @FoxNews totally gets it! _E_\nMy friend @TheSlyStallone lost his wonderful son Sage this weekend. We all send Sly our love and warmest wishes. (cont) __HTTP__ _E_\nRT @CLewandowski_: Trump winning over Latino Republicans poll says | New York Post __HTTP__ _E_\nA guy named @BobBeckel on FOX their resident liberal was not born with much of a brain. _E_\nFor the Republicans to have any success these next two years they must have a long game plan... _E_\nIt's important to listen to what people say. \"Horrible\" and \"disgusting\" are the words I used in response to Sterling's comments. _E_\nThe most important truth our FOUNDERS understood was: FREEDOM is NOT a gift from Govt. FREEDOM is a GIFT from GOD. __HTTP__ __HTTP__ _E_\nRandy Moss should not be bragging about himself—I'm the only one who is allowed to do that! _E_\nIt is time for DC to protect the American worker not grant amnesty to illegals. Let's Make America Great Again! __HTTP__ _E_\nI had amazing time in Charlotte. Great people & many new friends. I look forward to coming back very soon. Congrats to Gavin & Staff. _E_\nDo you think that very dumb reporter(blogger) McKay Coppins has apologized to his wife for his very inappropriate behavior while in Florida? _E_\n'Americans overwhelmingly oppose sanctuary cities' __HTTP__ _E_\nInteresting how the U.S. sells Taiwan billions of dollars of military equipment but I should not accept a congratulatory call. _E_\nVia @Newsmax_Media by @dpatten32: \"Trump's Brand Gives Him 2016 Mojo\" __HTTP__ _E_\nDid a shoot in front of the Metropolitian Museum on 5th Ave for the 13th season of the Apprentice... _E_\nExxon donated $250g to Obama's inaugural __HTTP__ I guess the Democrats have no problem accepting money from 'big oil.' _E_\nThanks & I won't let you down. __HTTP__ _E_\nThere are many ways of going forward but only one way of standing still. Pres. Franklin D. Roosevelt _E_\nVia @successmagazine by @MikeSeemuth: \"Trump Power\" __HTTP__ _E_\nToday I announced a new Executive Order with re: to North Korea. We must all do our part to ensure the complete denuclearization of #NoKo. __HTTP__ _E_\n\"If you like your plan you keep it.\" = \"Gruber is just some adviser.\" Two of Obama's greatest lies told to the American public. _E_\nExperience is not what happens to you it's what you do with what happens to you. Aldous Huxley _E_\nRT @IvankaTrump: 3/4: This Administration is deeply committed to those who serve & their families who make it possible through their love a... _E_\nJoin me live in Louisiana! Tomorrow we need you to go to the polls & send John Kennedy to the U.S. Senate. __HTTP__ _E_\n.@BarbaraJWalters @theviewtv Barbara unfortunately you've missed the entire point of my announcement you just don't get it! _E_\n\"@OMAROSA is a bit toxic\" per @BrandenRoderick. Being a bit PC? #CelebApprentice _E_\nThe Fed should not do QE3. Neither the economy nor the dollar can withstand another round of artificial liquidity. _E_\n\"Donald Trump: Karl Rove Has Done Ashley Judd A Favor\" __HTTP__ via @SheKnows _E_\nWill be interviewed by @GStephanopoulos on @ABC at 10:00 A.M. _E_\n$716 Billion from Medicare by @BarackObama. When will it end? _E_\nKaren Handle's opponent in #GA06 can't even vote in the district he wants to represent.... _E_\nObama still refuses to stop the flights. Is he stubborn or just plain incompetent I say both! _E_\nRobert Pattinson should not take back Kristen Stewart. She cheated on him like a dog & will do it again just watch. He can do much better! _E_\nThank you @MikeOzanian for the nice comments on @FoxNews today. Great job! _E_\nRT @foxandfriends: U.S. spy satellites detect North Korea moving anti ship cruise missiles to patrol boat __HTTP__ _E_\nToday it was a tremendous honor for me to sign the #VAaccountability Act into law delivering my campaign promise... __HTTP__ _E_\nIt is the same Fake News Media that said there is no path to victory for Trump that is now pushing the phony Russia story. A total scam! _E_\nStill time to #VoteTrump! #iVoted #ElectionNight __HTTP__ _E_\nHow can @JebBush beat Hillary Clinton if he can't beat anyone else on the #GOPDebate stage with $150M? I am the only one who can! _E_\nA good head and a good heart are always a formidable combination. Nelson Mandela _E_\nIf you voted for Obama in 2008 to prove you were not a racist then vote for Romney in 2012 to prove you are not stupid. Thanks Walter D! _E_\nWhere's the global warming? 2013 was one of the least extreme years in weather on record __HTTP__ _E_\nJust got back from the Iowa State Fair. Record crowds phenomenal people. Thank you IOWA I will never let you down! _E_\nRe negotiation: Trust your instincts even after you've honed your skills. They're there for a reason. _E_\nRand Paul is a friend of mine but he is such a negative force when it comes to fixing healthcare. Graham Cassidy Bill is GREAT! Ends Ocare! _E_\nOPEC has just raised oil to over $102/Barrel. And @BarackObama still won't approve the Keystone Pipeline. Does he want high gas prices? _E_\nJoin me in Council Bluffs Iowa today at 3pm! #MakeAmericaGreatAgain Tickets: __HTTP__ _E_\n.@AlexSalmond If a country wants to rapidly destroy its economy I have an idea just put up subsidized wind (cont) __HTTP__ _E_\nVia @ACLJ: Pastor Saeed's Wife Expresses Gratitude to Donald Trump for Raising Her Husband's Plight __HTTP__ _E_\nI'll be speaking tomorrow at the San Jose Convention Center (CA) for the first ever National Achievers Congress __HTTP__ _E_\nAngieApon I think you should try wearing your hair combed back. It looked good when you slicked it back Mr. Trump ) #ALS May happen thx _E_\nDonald Trump's Speech Is a Game Changer. #Trump2016 __HTTP__ __HTTP__ _E_\nBuild your reputation on intelligence responsibility and results. That's building the right way. Think Like a Champion _E_\nI am going to save Medicare and Medicaid Carson wants to abolish and failing candidate Gov. John Kasich doesn't have a clue weak! _E_\nI can't believe that in New York we can't watch the PGA Championsip on CBS. How .much discount is Time Warner giving its customers? _E_\nHappy 8th Anniversary to @MELANIATRUMP. __HTTP__ _E_\nRT @IngrahamAngle: Trump Int'l Golf Club West Palm Beach is spectacular. Almost makes me wish I had time to play/learn/like golf. _E_\nI have to admit @AlexSalmond is a tough smart guy. He is formidable by any standard! _E_\nWhy is Washington ready to spend billions on care for illegals while our VA is still in shambles? Vets should be the priority. _E_\nTHANK YOU to all of the great men and women at the U.S. Customs and Border Protection facility in Yuma Arizona & around the United States! __HTTP__ _E_\nLook forward to being in DC tomorrow—big crowd expected for our protest against the truly stupid nuclear deal we are making with Iran. _E_\nWhen is the media going to talk about Hillary's policies that have gotten people killed like Libya open borders and maybe her emails? _E_\nWelcome to @BarackObama's America 8.74 million workers on 'Federal Disability __HTTP__ Where are the jobs?! _E_\nI own Turnberry in Scotland one of great resorts in world. Women's British Open there this week. I'll go for two days & back on trail. _E_\nA simplified tax code will help promote growth in the private sector. _E_\nMy @SquawkCNBC #TrumpTuesday interview with Ken Langone & Dick Grasso discussing the Chicago teachers' strike @ 2012 __HTTP__ _E_\nWhy would Texans vote for liar Ted Cruz when he was born in Canada lived there for 4 years and remained a Canadian citizen until recently _E_\nWhy can't @Politico get better reporters than Ben Schreckenger? Guy is a major lightweight with no credibility. So dishonest! _E_\nEntrepreneurs: Ignorance is not bliss it's fatal. It's costly. Pay attention or get crushed. Watch listen and learn. _E_\nI am self funding my campaign and only work for YOU the American people!#Trump2016 Video: __HTTP__ __HTTP__ _E_\n\"Patriotism is supporting your country all the time and your government when it deserves it.\" – Mark Twain _E_\nIf the Republican Convention had blown up with e mails resignation of boss and the beat down of a big player. (Bernie) media would go wild _E_\n#BigLeagueTruth __HTTP__ _E_\nThis week's All Star Celebrity @ApprenticeNBC features another memorable Board Room rumble between @piersmorgan & @OMAROSA. _E_\nSaudi Arabia and many of the countries that gave vast amounts of money to the Clinton Foundation cont'd: __HTTP__ _E_\n...American Cancer Society and the Dana Farber Cancer Center. _E_\nThank you @loudobbsnews I will be trying very hard to prove you right great show! _E_\nI hope A Rod has a great night for the Yankees he owes it to them especially with Derek hurt. _E_\nThe President's speech tonight will largely focus on class warfare. The Republicans don't know how to handle that—I do. _E_\nFailing host @glennbeck a mental basketcase loves SUPERPACS in other words he wants your politicians totally controlled by lobbyists! _E_\nFrom day one I said that I was going to build a great wall on the SOUTHERN BORDER and much more. Stop illegal immigration. Watch Wednesday! _E_\n#CrookedHillary Job Application __HTTP__ _E_\nVia @cnsnews by @CraigBBannister: \"Poll: Hispanics Blacks Call for Tighter Borders Access to Illegals' Jobs\" __HTTP__ _E_\nGood move by Bernie S. _E_\nThe statement put out yesterday by @FoxNews was a disgrace to good broadcasting and journalism. Who would ever say something so nasty & dumb _E_\nSnow and ice freezing weather in Texas Arizona and Oklahoma what the hell is going on with GLOBAL WARMING? _E_\nBoycott Mexico until they release our Marine. With all the money they get from the U.S. this should be an easy one. NO RESPECT! _E_\nThey will soon be calling me MR. BREXIT! _E_\nBIG NIGHT on Celebrity Apprentice tonight. IMPORTANT starts at 10 P.M. as scheduled but NBC just increased all future episodes to 2 hours! _E_\n....Dopey @krauthammer should be fired. @FoxNews _E_\nEveryone knows I am right that Robert Pattinson should dump Kristen Stewart. In a couple of years he will thank me. Be smart Robert. _E_\n...vast sums of money to NATO & the United States must be paid more for the powerful and very expensive defense it provides to Germany! _E_\nWhere was all the outrage from Democrats and the opposition party (the media) when our jobs were fleeing our country? _E_\nMy recent statement re: @macys We must have strong borders & stop illegal immigration now!... __HTTP__ _E_\nI'm self funding my campaign but lobbyists & special interests for Jeb & others are starting to do big ads—desperate! Don't believe them. _E_\nVia @bluegreentweet: Scottish wind farm opposed by Donald Trump delayed __HTTP__ _E_\nWe will bring America together as ONE country again – united as Americans in common purpose and common dreams. #MAGA _E_\nThank you Senator David Perdue! __HTTP__ __HTTP__ _E_\nWith Mexico being one of the highest crime Nations in the world we must have THE WALL. Mexico will pay for it through reimbursement/other. _E_\nPastor #Nadarkhani must be released by Iran immediately. I applaud the @WhiteHouse & @StateDept for issuing (cont) __HTTP__ _E_\n.@loudobbsnews did a fantastic interview with syndicated columnist Michelle Malkin. Congrats to both! _E_\nI am more concerned about Biden in the debate than I am about Obama. Be careful on Thursday night! _E_\nRT @USHCC: USHCC was delighted to host @IvankaTrump for a roundtable discussion w/ Hispanic women biz owners today in Washington #USHCCLegi... _E_\nOne of my first acts as President will be to deport the drug lords and then secure the border. #Debate #MAGA _E_\n.@rushlimbaugh played 3 separate audio bites (the most of anyone) of my CPAC speech. Hour 3 in Friday's show. _E_\nCrooked Hillary Clinton wants completely open borders. Millions of Democrats will run from her over this and support me. _E_\nThe Tonight Show @nbc will be amazing 11:30 P.M. ENJOY! _E_\nRush Limbaugh: Trump Has Changed the Entire Debate on Immigration __HTTP__ _E_\nSgt. Bowe Bergdahl should face the death penalty for desertion five brave soldiers died trying to bring him back. U.S. has to get tough! _E_\nChina's business interests reach far and wide even domestically within our borders. We need to reassess our relationship. _E_\nCongratulations to @Mets @RADickey43 on becoming the first knuckleball pitcher to ever win the CY Young award! _E_\n#TBT Do you believe once upon a time Jon Stewart really liked me? From 2004. __HTTP__ _E_\nVia @AP: Miss USA Olivia Culpo is crowned Miss Universe Ratings increase 15% over last year. __HTTP__ _E_\nMy offer to Obama is about transparency. In 2008 American people were sold on hope and change. This our last chance to get the full record. _E_\nA great night in Macon Georgia! Thank you for all of the support. Together we will #MakeAmericaGreatAgain! __HTTP__ _E_\nYou have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_\nYou have to feel bad for the Democrat Senators. They don't want Hagel either. Just following Obama's orders. _E_\nSo far the Super Bowl is very boring not nearly as exciting as politics MAKE AMERICA GREAT AGAIN! _E_\nI am in New Hampshire having a great time! Loved the #GOPDebate last night! Everybody enjoy the Super Bowl. #SuperBowlSunday #SB50 _E_\nWell Iran has done it again. Taken two of our people and asking for a fortune for their release. This doesn't happen if I'm president! _E_\nVery low ratings radio host Hugh Hewitt asked me about Suleiman Abu Bake al Baghdad Hassan Nasrallah and more typical gotcha questions _E_\nHypocrite. Watch Senator Obama defend democratic debate' of Senate filibuster rules in 2005 __HTTP__ _E_\nNorth Carolina is a fantastic state with wonderful people. I enjoy my time there when I visit Trump National Charlotte. _E_\nI'm having a real hard time watching the Academy Awards (so far). The last song was terrible! Kim should sue her plastic surgeon! #Oscars _E_\n...But while Dallas dropped to it knees as a team they ALL stood up for our National Anthem. Big progress being made we love our country! _E_\nRated \"#1 Resort in Europe\" by @CNTraveler @Trump_Ireland offers breathtaking golf & the 5 Star Lodge at Doonbeg __HTTP__ _E_\nA 60% increase in Texas Blue Cross/Blue Shield through ObamaCare. I told you so there is panic and anger as healthcare costs explode! _E_\n13 states have voter registration deadlines TODAY: FL OH PA MI GA TX NM IN LA TN AR KY SC.Register: __HTTP__ _E_\nCan't fool Americans. 57% of uninsured hate ObamaCare __HTTP__ Reality is less will be insured b/c of this monstrosity. _E_\nI'm impressed both teams have produced very entertaining silent films. #CelebApprentice _E_\nBad. @gallupnews survey shows 30% of businesses not hiring they are worried they won't be around in a year. __HTTP__ _E_\nMy @FoxNews interview with @gretawire discussing why I endorsed @MittRomney and why he will make a great President __HTTP__ _E_\nIntelligence stated very strongly there was absolutely no evidence that hacking affected the election results. Voting machines not touched! _E_\nIf US Air and American Airlines are allowed to merge we are back to the days of \"monopoly.\" _E_\nThe townhall question segment of my @WMUR9 Commitment 2016 Conversation @JoshMcElveen __HTTP__ Great questions/people #FITN _E_\nJames Clapper who famously got caught lying to Congress is now an authority on Donald Trump. Will he show you his beautiful letter to me? _E_\nJust left Trump Golf Links at Ferry Point. Ribbon cutting w/@MayorBloomberg & @jacknicklaus was spectacular. Lots of people & jobs! _E_\nEver see @bluemangroup in performance? They're fantastic. And so are Penn & Teller. Don't miss them. #CelebApprentice _E_\nThe MAKE AMERICA GREAT AGAIN agenda is doing very well despite the distraction of the Witch Hunt. Many new jobs high business enthusiasm.. _E_\nI hope @billmaher comes through with his $5 million offer which I fully accepted or I will be forced to sue him. All goes to charity! _E_\nThis Sunday's All Star @ApprenticeNBC features some of the biggest fireworks of the entire season. Get ready. _E_\nVia @washingtonpost 9/18/01. I want an apology! Many people have tweeted that I am right! __HTTP__ __HTTP__ _E_\nHappy 70th Birthday @USAirForce! __HTTP__ _E_\nSuccess is good. Success with significance is even better. Work on what you will be proud to be associated with make your work count. _E_\nGood news @RickSantorum did the right thing. I congratulate him on running a very good race. Now it's onto @BarackObama go get him Mitt! _E_\nI unfairly get audited by the I.R.S. almost every single year. I have rich friends who never get audited. I wonder why? _E_\nGetting ready to leave for the Great State of Indiana and meet the hard working and wonderful people of Carrier A.C. _E_\nVince McMahon shows the crowd one of the greatest moments in WWE History. #WWEHOF __HTTP__ _E_\nThe time has come. THEGaryBusey will be project mgr on this Sunday's All Star Celebrity @ApprenticeNBC. MUST SEE TV!!! Back to 2 hrs. _E_\nSee ungrateful Little @MacMiller's statement to me a year ago— __HTTP__ he was kissing my ass! _E_\n....on ruining Scotland's beauty with ugly & costly wind turbines? _E_\nThe U 6 Unemployment Rate is over 14.9%. ObamaCare is stopping businesses from both hiring and expanding. _E_\nRatings challenged @CNN reports so seriously that I call President Obama (and Clinton) the founder of ISIS & MVP. THEY DON'T GET SARCASM? _E_\nWill be in Terre Haute Indiana in a short while big rally! See you soon! _E_\nColin Montgomerie @montgomeriefdn You are not only a great golfer you are doing a great job of commentary @GolfChannel _E_\nMAKE AMERICA SAFE AGAIN!#NoSanctuaryForCriminalsAct #KatesLaw #SaveAmericanLives __HTTP__ _E_\nMy Doral Country Club purchase was made just before Miami real estate market went through the roof—good timing! _E_\n.@megynkelly must have had a terrible vacation she is really off her game. Was afraid to confront Dr. Cornel West. No clue on immigration! _E_\nMy @FoxNews int with @TeamCavuto on the state of world affairs economy the Bushes etc. __HTTP__ _E_\nInterview with @oreillyfactor on Fox Network 4:00 P.M. (prior to Super Bowl). Enjoy! _E_\nOn the sands of Playa Brava waves will reflect on walls & circular architecture of Trump Tower Punta del Este __HTTP__ _E_\nThank you @SenOrrinHatch. Let's continue MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nIt's not that I'm so smart it's just that I stay with problems longer. Some good words from Albert Einstein. It pays to be tenacious. _E_\nCrooked Hillary who embarrassed herself and the country with her e mail lies has been a DISASTER on foreign policy. Look what's happening! _E_\nBe sure to watch The Celebrity Apprentice on Sunday at 9 pm on NBC. It's an episode you'll want to see and one you won't forget! _E_\nIt's time for Mountain State to have a Senator who will stop Obama's war on coal. This November send DC a message vote for @CapitoforWV! _E_\nOhio had the biggest budget increase in the U.S. If it were not for striking oil they would be bust! Governor Kasich in favor of TPP fraud! _E_\nThe Clinton Campaign at Obama Justice #DrainTheSwamp __HTTP__ _E_\nRatings for NFL football are way down except before game starts when people tune in to see whether or not our country will be disrespected! _E_\nOur Founding fathers got it. They understood that nothing good in life religious freedom economic freedom (cont) __HTTP__ _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nI am signing copies of my book CRIPPLED AMERICA. Makes a great holiday gift. Order yours now! __HTTP__ ... ... _E_\nMy @foxandfriends interview discussing @newsday's endorsement of @MittRomney tomorrow's election and Sandy's victims __HTTP__ _E_\nA clip from my interview with @jimmyfallon discussing the cast of @ApprenticeNBC Season 5 __HTTP__ _E_\nI play golf to relax. My company is in great shape. @BarackObama plays golf to escape work while America goes down the drain. _E_\nEntrepreneurs always remember that every business relationship can lead to greater deals in the future. Be sure to cultivate relationships _E_\nI would triple the sanctions on Iran if the American pastor is not released. my @SRQRepublicans speech _E_\n$ave your $. Don't invest in @KarlRove. He doesn't have a clue. __HTTP__ _E_\nCongratulations to @FLGovScott on getting an A grade from @CatoInstitute on his fiscal policy. Rick is a fantastic governor. _E_\nThank you @LuisRiveraMarin! __HTTP__ _E_\nThe Yankees are absolutely terrible what happened to this team? _E_\nDid you ever think our country would become an economic basket case? So much for Hope & Change. _E_\nThe endorsement of me by the 16500 Border Patrol Agents was the first time that they ever endorsed a presidential candidate. Nice! _E_\n.@TrumpDoral's record $200M renovations are on schedule. The hotel remains open for guests events and conferences. __HTTP__ _E_\nRT @greta: interesting poll results so far (and go vote on __HTTP__ __HTTP__ _E_\nI'm on Bill @oreillyfactor tonight at 8 PM. It will be another lively interview about how to #MakeAmericaGreatAgain! _E_\nTremendous backlash against the NFL and its players for disrespect of our Country.#StandForOurAnthem _E_\nObamaCare is an attack on our country's identity. The latest victim is the Catholic church. It must be full repealed. @BarackObama _E_\nThe signature restaurant of @TrumpNewYork @jeangeorges is both a Forbes Five Star & AAA Five Diamond restaurant __HTTP__ _E_\nObama is totally \"tweaking\" the Republicans because he doesn't respect them—they've got to change their ways. _E_\nI hope when Rand Paul gets out of the race—he is at 1% his supporters come over to me. I will do a much better job for them. _E_\nI was #1 on Twitter and so positive. Thank you! __HTTP__ _E_\n\"TEA TALK: Highlights from Monday convention speech from Donald Trump\" __HTTP__ via @myrbeachonline by @TSN_MPrabhu _E_\nCheck out ShouldTrumpRun.... __HTTP__ _E_\nWork begins on the Old Post Office in Washington D.C. in 3 months. It will soon become one of the great hotels of the world. _E_\nAt least ObamaCare/RomneyCare architect Gruber admitted albeit privately that we were lied to by Obama. Gang of Liars. _E_\nJoin me in Colorado Springs at 2pm or in Denver tonight at 7pm!Colorado Springs: __HTTP__ __HTTP__ _E_\nOur nation has a duty to care for our vets & their families. It's time to do it! Let's Make America Great Again! __HTTP__ _E_\nI am really happy that Hillary made her speech right under Trump World Tower! _E_\nMy @WMUR9 'Close Up' int. with @JoshMcElveen discussing the midterms the new Congress travelling to NH & 2016 __HTTP__ _E_\nMy new book Time To Get Tough comes out on December 5th. Pre order on Amazon.com. It's the best book I've ever written. _E_\nRT @Scavino45: The Iran deal was one of the worst & most one sided transactions the United States has EVER entered into. @POTUS @realDona... _E_\nTime Warner Cable went out on 5th Avenue for 2 plus days. They are a disaster. I think I'm going to switch. _E_\nRT @DanScavino: Congratulations to the 2017 @PinstripeBowl (Yankee Stadium) Champions Iowa @HawkeyeFootball! __HTTP__ _E_\nEntrepreneurs: Learn to be succinct. Can you tell someone your idea in three minutes or less? Be clear and concise. _E_\nTo show you how politicians act Bobby Jindal spent $1000 to register in New Hampshire & dropped out the next day. Such a waste! _E_\nGreat quote from the late Steve Jobs: Innovation distinguishes between a leader and a follower. _E_\nCongratulations to @BarackObama for being reckless. In his first 38 months in office the debt has grown at a rate that is unthinkable. _E_\nWhether you think you can or think you can't you are right. Henry Ford _E_\nIt is crucial for Republicans to remain united during this shutdown _E_\nSenator Luther Strange has done a great job representing the people of the Great State of Alabama. He has my complete and total endorsement! _E_\nGreat piece on Extra tonight re. Celebrity Apprentice! _E_\n\"Your money should be at work at all times. Even in the worst economy there is no excuse. Think Like a Billionaire _E_\nThe average family has spent $4155 this year filling up the car on $3.50/gallon average. Both record highs. (cont) __HTTP__ _E_\n\"When mistakes are made and they will be the entrepreneur's true character emerges and further growth takes place.\" – The Midas Touch. _E_\nI can't get over after all of the buildup what a terrible game that was the worst Super Bowl in history. The advertisers must be furious! _E_\nEmployees of @NYMag should have their resumes updated. It is very boring & will die in the near future. How much are they losing now? _E_\nJust left a great event in Pella. Going to church tomorrow in Muscatine Iowa. _E_\nEvery sports fan is treated to an All Star game. The loyal and growing fan base of @CelebApprentice will be getting a much bigger treat! _E_\nJoin me live at the 2018 World Economic Forum in Davos Switzerland! #WEF18 __HTTP__ __HTTP__ _E_\n.@club4growth should release the letter they sent me asking for $1000000. When I said no they came out against me. A scam operation? _E_\nThe most luxurious hotel in downtown Manhattan @TrumpSoHo is a top destination __HTTP__ _E_\nMany Republicans support TPP. They are stupid. We have stupid Republicans too. We need to keep jobs here! my @SRQRepublicans speech _E_\nRemember that in 2006 then Senator Obama voted NOT TO INCREASE THE DEBT CEILING. Now he acts in disbelief as others plan to do the same! _E_\nThank you Florida! #Trump2016 __HTTP__ _E_\nI want to thank the people of Iowa for an unbelievable day. The crowds were amazing. Will be back Tuesday! _E_\nAll of the phony T.V. commercials against me are bought and payed for by SPECIAL INTEREST GROUPS the bandits that tell your pols what to do _E_\nThank you @Forbes for showing the @WSJ was wrong. So dishonest! __HTTP__ _E_\nThank you California! See you soon!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nNo taxes in Boehner or Reid Plan important victory for America. _E_\nFrom the great author of Rich Dad Poor Dad Robert Kiyosaki here is a very nice article. __HTTP__ _E_\nSometimes understanding other people's problems is the key to finding opportunities. Midas Touch w/@atheRealKiyosaki _E_\nThe Republican Party must spend its money wisely and do incredible television commercials. They must be tough and smart. _E_\nThe people of Colorado had their vote taken away from them by the phony politicians. Biggest story in politics. This will not be allowed! _E_\nHave you seen the new #Trump fall collection exclusively available @Macys? Top selling brand nationwide.Ties shirts fragrance great gifts. _E_\nWASTE HUD is spending $70M to teach grant recipients how to spend the money from their grants __HTTP__ Does it get any dumber? _E_\nI am the best builder but if that were my building with the crane mishap I would have been lambasted from coast to coast. _E_\nI love Twitter.... it's like owning your own newspaper without the losses. _E_\nThe WH yesterday defended Biden's comments that the Taliban aren't our enemy. When did the American people decide this? __HTTP__ _E_\nRT @IvankaTrump: Thank you to the amazing men and women working tirelessly to bring relief to those in need. #PuertoRico #HurricaneMaria ht... _E_\nThank you Carl Higbie (former Navy Seal) for you support of my plan to straighten out the Veterans Administration a mess!Great job @kilmeade _E_\nLuther Strange of the Great State of Alabama has my endorsement. He is strong on Border & Wall the military tax cuts & law enforcement. _E_\nWow @CNN is really working hard to make me look as bad as possible. Very unprofessional. Hurting in ratings bad television! _E_\nOur military is building and is rapidly becoming stronger than ever before. Frankly we have no choice! _E_\nMy interview with @IngrahamAngle discussing @MittRomney's Super Tuesday and why @BarackObama must be defeated. __HTTP__ _E_\nWhen will people realize that @billmaher is not an intellectual but actually a rather dumb guy—just look at his past. _E_\nOur new @MissUniverse Olivia Culpo is not only beautiful but intelligent and accomplished. She is a wonderful role model. _E_\nIt was truly an honor to introduce my wife Melania. Her speech and demeanor were absolutely incredible. Very proud! #GOPConvention _E_\nIt was an honor to welcome the Teachers of the Year to the WH last month. Today we honor and thank all teachers!... __HTTP__ _E_\nNo member of Congress should be eligible for re election if our country's budget is not balanced deficits not allowed! _E_\nSarasota was an unbelievable success. We expected 5000 a record but 12000 showed up! Great love in the air! __HTTP__ _E_\nWe better get tough with RADICAL ISLAMIC TERRORISTS and get tough now or the life and safety of our wonderful country will be in jeopardy! _E_\nI hope Arnold S. does well with the Apprentice because he is a nice guy and also because I get a big percentage of the profits! _E_\nI am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_\nOn the way to the #GOPDebate with my wonderful wife @MelaniaTrump. __HTTP__ __HTTP__ _E_\nEntrepreneurs: View any conflict as an opportunity. Being positive could lead you into a fortunate situation. _E_\nCongratulations to Billy Payne and @AugustaNational on doing the right thing. _E_\nReporter should resign __HTTP__ _E_\nThe road to success is always under construction. Arnold Palmer _E_\nDo we still want a President who bows to the Saudis and lets OPEC rip us off? Make America strong vote for @MittRomney. _E_\nWhy doesn't @CNN use the #CNN Iowa poll? @andersoncooper @andydean2014 _E_\n... & all Obama is concerned about stopping them doing is buying wind farms __HTTP__ _E_\nI don't know why but I feel so sorry for dummy reporter John Heilemann when I watch him on television. _E_\nI look forward to the debate on Thursday night & it is certainly my intention to be very nice & highly respectful of the other candidates. _E_\nIt was my honor. THANK YOU! __HTTP__ _E_\nMy interview with @Jay_Severin on behalf of @MittRomney discussing why the GOP must nominate @MittRomney __HTTP__ _E_\nCongratulations to our new National Security Advisor General H.R. McMaster. Video: __HTTP__ __HTTP__ _E_\nMy induction last night at Madison Square Garden into the WWE Hall of Fame was amazing I met some great people including Bruno. _E_\nWhy hasn't Obama created jobs? _E_\nUS froze $8B in Iranian assets during '79 Hostage Crisis. Now Obama is giving it back to Iran while Christian Pastor is jailed. Don't do it! _E_\nWind farms are ugly not cost effective and don't produce worthwhile returns or energy. No wonder governments are giving up on them. _E_\nKim Jong Un of North Korea made a very wise and well reasoned decision. The alternative would have been both catastrophic and unacceptable! _E_\nA total refutation of the disgraceful David Brooks column in the failing @NYTimes by the @WashingtonPost: __HTTP__ _E_\nAnother four years not good for the country but we'll have to live with it! _E_\nTrump is already delivering the jobs he promised America __HTTP__ _E_\nObama hasn't released a budget in over 2 years & for the 1st time House & Senate delivered budgets before him __HTTP__ _E_\nTrump Turnberry news conference tomorrow at noon Scotland time. The place is amazing! _E_\n\"The vast majority felt she should be prosecuted... even senior FBI officials thought Crooked was guilty. __HTTP__ _E_\nIt is truly amateur hour at the White House and this is why we should not be doing the war thing right now! _E_\nMay God Forever Bless the United States of America. #NeverForget911 __HTTP__ _E_\nWe need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_\nKeep Wednesday morning free. You will want to see this! _E_\nMy condolences to Dwyane Wade and his family on the loss of Nykea Aldridge. They are in my thoughts and prayers. _E_\nAmerica's debt officially became 100% of our GDP on @BarackObama's 50th birthday coincidence? _E_\nObama Care is already having a devastating impact on our economy. _E_\nWill do thanks. __HTTP__ _E_\nLet me say it as clearly as possible that the attack on my Catholic brothers and sisters is an attack on me. @GovMike Huckabee _E_\nScary thought is the sexual pervert Anthony Weiner now in Charlotte? Did he bring his phone with him? _E_\n....and has been horrible on Virginia economy. Vote @EdWGillespie today! _E_\nWe still have not learned the full truth on Benghazi. Four Americans were killed. Congress must act! _E_\nEli Manning staged a great comeback in 4th quarter an elite quarterback. _E_\n.@CarlyFiorina had to inject herself into my factual statements concerning Ben Carson in order to breathe life into her failing campaign! _E_\n\"Donald Trump's Miss USA Pageant Scores $5 Million Legal Victory Following Rigged Claims\" __HTTP__ via @eonline _E_\nVia @WashTimes by @SethMcLaughlin1: \"Donald Trump: I want to run for president 'so badly'\" __HTTP__ _E_\nRT @joegooding: What's happening in our country isn't just an assault on our @POTUS @realDonaldTrump it's an assault on the American people... _E_\n.@AP continues to do extremely dishonest reporting. Always looking for a hit to bring them back into relevancy—ain't working! _E_\nThanks Lou. __HTTP__ _E_\nI called Chuck Schumer yesterday to see if the Dems want to do a great HealthCare Bill. ObamaCare is badly broken big premiums. Who knows! _E_\nRT @FoxNews: U.S. Markets since election. __HTTP__ _E_\nGreat. Just reported on @FoxNews that many people who supported @JebBush are now supporting me. I knew that would happen pundits didn't! _E_\n\"The most important political office is that of the private citizen.\" Justice Louis D. Brandeis _E_\nGlad to hear patriotic Americans are organizing a movement this August to boycott Chinese products __HTTP__ People get it! _E_\nReport: \"ANTI TRUMP FBI AGENT LED CLINTON EMAIL PROBE\" Now it all starts to make sense! _E_\n.@CNBC continues to report fictious poll numbers. Number one based on every statistic is Trump (by a wide margin). They just can't say it! _E_\nThank you to the 2500+ in North Augusta South Carolina. Lines down the block! Don't forget to VOTE on Saturday! __HTTP__ _E_\n.@MarketMavensInc #asktrump __HTTP__ _E_\nEnvironmental regulations stop Border Patrol from protecting 40% of the border __HTTP__ A coup for the migrant Democrats. _E_\nReceived a beautiful letter from Joe Paterno's son Jay. He really loved and respected his father. _E_\nWow did you just hear Bill Clinton's statement on how bad ObamaCare is. Hillary not happy. As I have been saying REPEAL AND REPLACE! _E_\nWithout focus it's just impossible to be successful at anything. Midas Touch _E_\nAfghanistan leaders want the U.S. to keep 20 000 troops there for many more years fully paid for by the U.S. but first they want apology. _E_\nESPN is paying a really big price for its politics (and bad programming). People are dumping it in RECORD numbers. Apologize for untruth! _E_\nLearning never exhausts the mind. Leonardo da Vinci _E_\nGross negligence by the Democratic National Committee allowed hacking to take place.The Republican National Committee had strong defense! _E_\nThe Art of the Deal = #1 business book. Over 3 million copies sold. Forbes Article from Oct. 20 2014. __HTTP__ _E_\nAfter Super Tuesday every GOP candidate should take a long hard look at their prospects and drop out if they can't get the nomination. _E_\n..Ryan died on a winning mission ( according to General Mattis) not a failure. Time for the U.S. to get smart and start winning again! _E_\nEvery sport evolves. Every sport gets bigger and more athletic and you have to keep up. Tiger Woods _E_\nRT @Scavino45: .@POTUS @realDonaldTrump and @UN Secretary General @AntonioGuterres pose for📸prior to their expanded bilateral meeting. #USA... _E_\nVanity Fair is failing. Newstand sales are down 20 percent 2nd most for major magazines and the magazine has (cont) __HTTP__ _E_\nWhich National Costume do you think should win? __HTTP__ _E_\nOne of the reasons Hillary hid her emails was so the public wouldn't see how she got rich selling out America. __HTTP__ _E_\nReally sad news: The great Arnold Palmer the King has died. There was no one like him a true champion! He will be truly missed. _E_\nYom Kippur blessings to all of my friends in Israel and around the world. #YomKippur _E_\nJust met with David Perdue @Perduesenate. He's a fantastic guy who will fight hard against ObamaCare. He will win! _E_\nI will write a $2 MILLION check to our campaign if we hit our million dollar end of month goal! __HTTP__ _E_\n\"The cheapest natural gas in the world is in the United States.\" @boonepickens _E_\nGreat job on @CNN tonight @heytana. We are all proud of you! Also congrats on a great son he is going places. _E_\nWow! Such nice words from Robert Redford on my running for President. Thank you Robert. __HTTP__ _E_\n.@DennisRodman must be thinking of North Korea. #CelebApprentice _E_\nThis is a terrific day for downtown New York. Trump SoHo is unlike anything else. Be sure to visit this fantastic hotel soon! _E_\n.@davidaxelrod I hope your book is better than the Obama second book but it is inaccurate as it pertains to me but no big deal boring! _E_\nI think Senator Blumenthal should take a nice long vacation in Vietnam where he lied about his service so he can at least say he was there _E_\nThank you Wilmington North Carolina. We are 3 days away from the CHANGE you've been waiting for your entire life!... __HTTP__ _E_\n\"It's a good idea to take your own pulse once in a while instead of focusing on what the masses are doing.\" – Think Like a Champion _E_\nI want to see @BarackObama's college records to see how he listed his place of birth in the application. _E_\nGreat news here comes the Tea Party! @MittRomney has received 42k donations online & raised over $4.2 million since the ObamaCare decision. _E_\nJust landed in Bedminster New Jersey. #MAGA __HTTP__ _E_\nQE3 is going to further sink the dollar into oblivion. Creates artificial numbers for short term market gains. (cont) __HTTP__ _E_\nYesterday our national debt topped a record $18T. Over 44% has accrued under Obama. A real mess. _E_\nReigning @ApprenticeNBC Champion @TraceAdkins does great work with @wwpinc. Donate to an Injured Warrior today __HTTP__ _E_\nThe Iran deal poses a direct national security threat. It must be stopped in Congress. Stand up Republicans! _E_\nGreat! __HTTP__ _E_\nUnbelievable support in Florida last night thank you! #MAGA __HTTP__ _E_\nObama is making the Ebola problem much worse than it needs to be in the U.S. by not halting flights from West Africa. Airport testing a joke _E_\nRanked a top course @GolfMagazine & 6 Star Diamond Award Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_\nDo not view any failure as the end. Learn your lessons quickly then move on. Do not dwell on failure. Start thinking big again. _E_\nEach and every new event space at @TrumpDoral looks stunning. See the transformation for yourself: __HTTP__ _E_\nYesterday was Matt Drudge's birthday Happy Birthday @DRUDGE and great job! _E_\nRT @EricTrump: Mathematically it is statistically impossible for Kasich to get to 1237 he would need 112% of the remaining delegates to b... _E_\nMy @SquawkCNBC interview discussing the Republic of Georgia taxes the fledgling economy and Facebook __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nBeing true to yourself equals being true to your brand.That's the solid foundation that will keep your brand flourishing. Midas Touch _E_\nThe Super Committee is finding ways to raise all our taxes without admitting it. The Republicans made a big mistake agreeing to this deal. _E_\nTrump arrives for SC Tea Party Convention in Myrtle Beach __HTTP__ via @WCBD _E_\nCrooked Hillary Clinton just can't close the deal with Bernie. I had to knock out 16 very good and smart candidates. Hillary doesn't have it _E_\nThis is a storm of enormous destructive power and I ask everyone in the storm's path to heed ALL instructions from government officials. __HTTP__ _E_\nI am sure the @NCGOP will do a great job bracketing the @DNC convention. They are a tremendous statewide organization. _E_\nNotice how @BarackObama failed to mention ObamaCare last night in his SOTU. Even he knows it is terrible. _E_\nA look at the Trump hotel planned for the Old Post Office pavilion __HTTP__ via @washingtonpost _E_\nConservative? Jeb Bush doubled Florida State debt! __HTTP__ _E_\nIgnorance is inexcusable it's the surest way to fail. No acceptable reason exists for not being well informed. _E_\nMy daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person always pushing me to do the right thing! Terrible! _E_\nWith panoramic views of Central Park & the Manhattan skyline 5 Star @TrumpNewYork offers 176 newly renovated rooms __HTTP__ _E_\n.@WhoopiGoldberg Don't let @Rosie speak badly of you or try to bring you down. She is rude crude & not smart. She is not in your league. _E_\nUnbelievable crowd of supporters in Virginia Beach Virginia. Thank you! Next stop Cleveland Ohio.... __HTTP__ _E_\nThank you California Connecticut Maryland and Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n#3. Look at the solution not the problem. Learn to focus on what will give results. _E_\nI'm leaving now for Burlington Vermont. It will be wild! _E_\nI will be interviewed on the @oreillyfactor tonight from Florida now. Enjoy! _E_\nDruggies drug dealers rapists and killers are coming across the southern border. When will the U.S. get smart and stop this travesty? _E_\nOur President is a great embarrassment to the U.S. How could anybody be so dumb or know so little as to make the very stupid 5 for 1 swap? _E_\nThanks Eric. __HTTP__ _E_\nChina does not negotiate from a position of strength we simply negotiate against ourselves. We have all the advantages but don't execute. _E_\nNew poll states that a record number of Americans have lost all faith in President Obama duh! _E_\nAnyway I'm all about jobs & the economy & making America great again. We're falling fast! _E_\nSorry to hear of the passing of Neil Armstrong over the weekend. He was an American hero. _E_\nThanks to Donald Trump __HTTP__ via @AmSpec. My pleasure Jeffrey! _E_\nBig cancer risk from new environmental light bulbs a big price to pay! _E_\nMy official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_\nI am on @FoxNews with @greta doing a town hall from Wisconsin now! Enjoy!#MakeAmericaGreatAgain #Trump2016 _E_\nMillions Could Get Surprise Tax Bills Under 'Obamacare' If They Don't Accurately Project Their Income __HTTP__ _E_\nStock Market could hit all time high (again) 22000 today. Was 18000 only 6 months ago on Election Day. Mainstream media seldom mentions! _E_\nBrowse Donald Trump's Summer Reading List for Business Success at the Trump University Blog: __HTTP__ _E_\nWill be interviewed tonight at 7 by @greta re Sony & Bush _E_\n\"If you want to be the best you'd better be the best – in all aspects of business.\" Think Like a Billionaire _E_\nA signed copy of CRIPPLED AMERICA is the ultimate gift. Order now & join my live streaming book signing on 12/3 __HTTP__ _E_\nThe Blue Monster is being torn up at Trump @DoralResort. On April 1 I go out & play it one more time until the new course opens. _E_\nMy @SquawkCNBC interview discussing housing prices the GDP numbers China spreading its wealth and my stock picks. __HTTP__ _E_\nRoger Goodell must stop apologizing to everyone who will listen and toughen up. His street smart players are laughing at him and the NFL! _E_\nIt was an honor to meet with Republic of Rwanda President Paul Kagame this morning in Davos Switzerland. Many great discussions! #WEF18 __HTTP__ _E_\nI will be developing the two tallest towers in the Republic of Georgia. __HTTP__ _E_\n.@washingtonpost @BretBaier Please thank Charles Lane for his new found confidence. He has made a very good bet! _E_\nThe Boston terrorist thugs' mother is also a radical. I am sure she will be granted citizenship shortly. _E_\nI'll be on @foxandfriends on Monday at 7:30 AM. Tune in! _E_\nWow great post debate poll: Trump Increases Lead via Breitbart __HTTP__ _E_\nWhile in the Philippines I was forced to watch @CNN which I have not done in months and again realized how bad and FAKE it is. Loser! _E_\nThank you!Mitchell FOX2 Michigan Poll finds Trump holds 3 1 lead over closest GOP opponents. Trump 47% Clinton 43% __HTTP__ _E_\n.@penn_state leadership has permanently scarred & perhaps destroyed a great university. They should have (cont) __HTTP__ _E_\nI am pleased to announce that I had the Union Leader removed from the upcoming debate. __HTTP__ _E_\nI am leaving for Sioux City Iowa great event (rally). _E_\n\"Donald Trump pledges to make Prestwick Airport 'really successful'\" __HTTP__ via @STVNews _E_\nThank you North Carolina! #Trump2016 #SuperTuesday  #MakeAmericaGreatAgain __HTTP__ _E_\nIf the U.S. attacks Syria and hits the wrong targets killing civilians there will be worldwide hell to pay. Stay away and fix broken U.S. _E_\nBroadcom's move to America=$20 BILLION of annual rev into U.S.A. $3+ BILLION/yr. in research/engineering & $6 BILLION/yr. in manufacturing. __HTTP__ _E_\nThe cast has been largely selected for next year's Celebrity Apprentice. Wait 'till you hear the names AMAZING! Season 14 many nights at #1 _E_\nWon $5000000 against Miss Pennsylvania Sheena Monnin for her terrible and untrue statements about Miss USA Pageant. Not a nice person! _E_\nWow @SharylAttkisson just wrote the definitive piece on what I said about John McCain __HTTP__ _E_\n\"Donald Trump To Be In Mason City June 4th\" __HTTP__ via @KCHA _E_\n.@megynkelly I am in Nevada. Sorry to inform you Kellyanne is in the audience. Better luck next time. _E_\nAmerican league wins! _E_\n... The NY Daily Snooze totally lied and never even called my kids! _E_\nExpecting a great crowd of amazing people. Questions will be live! #TrumpToday _E_\n\"Never give up on yourself.\" – Think Big _E_\nOh no just reported that Ted Cruz didn't report another loan this one from Citi. Wow no wonder banks do so well in the U.S. Senate. _E_\nCongratulations to Boys and Girls Nation. It was my great honor to welcome you to the WH today! Full Remarks: __HTTP__ __HTTP__ _E_\nSo much dishonest reporting (or non reporting) in political media—an amazing experience for me. @BretBaier _E_\nHaving a great time hosting Prime Minister Shinzo Abe in the United States! __HTTP__ __HTTP__ _E_\nThe @CNN panels are so one sided almost all against Trump. @FoxNews is so much better and the ratings are much higher. Don't watch CNN! _E_\n#VoteTrumpHI! #Trump2016 __HTTP__ _E_\nRT @DanScavino: #TrumpTrain🚂💨 __HTTP__ _E_\nI will not let the families of The Remembrance Project down! #MakeAmericaSafeAgain __HTTP__ __HTTP__ _E_\nAn interesting cartoon that is circulating. __HTTP__ _E_\nAll the online polls have me winning the debate. I really enjoyed the evening. Not easy but good. __HTTP__ _E_\nRe Negotiation: Persistence can go a long way. Being stubborn can be good. The key is to know when to loosen up. _E_\nDopey @ariannahuff should force her reporters to be accurate—if she has that power. _E_\nTEXAS: We are with you today we are with you tomorrow and we will be with you EVERY SINGLE DAY AFTER to restore recover and REBUILD! __HTTP__ _E_\nJust landed in Iowa to attend a great event in honor of wonderful Senator @JoniErnst. Look forward to being with all of my friends. _E_\n\"Donald Trump Takes on Apple @CPACnews\" __HTTP__ via @kmbznews _E_\nGo get the new book on Andrew Jackson by Brian Kilmeade...Really good. @foxandfriends _E_\nBe sure to watch highlights from the record setting 14th season of @ApprenticeNBC here __HTTP__ _E_\nThank you to the greatest heroes __HTTP__ #DDay70 #WWII _E_\nHypocrites! @JamesOKeefeIII's new video shows Journal News reporters refusing to designate their homes as 'gun free' __HTTP__ _E_\nMy @todayshow interview where I reveal the new cast of Celebrity Apprentice and discuss the GOP primary field __HTTP__ _E_\nEntrepreneurs: Achievers move forward at all times. Achievement is not a plateau it's a beginning. Get out there & go for it! _E_\nWith our national debt passing $16T during the @DNC convention @BarackObama has amassed more debt than the first 42 presidents. Scary. _E_\nHillary Clinton should not be given national security briefings in that she is a lose cannon with extraordinarily bad judgement & insticts. _E_\n.@realbobmassi who does a show called Bob Massi Is The Property Man on @FoxNews really knows his stuff a total pro! _E_\nSo funny Jeb Bush called me a highly gifted politician and a great entertainer I assume that is a compliment! _E_\nI hope everybody reads the @AmSpec article \"Shakedown Schneiderman\" – the AG of New York @AGSchneiderman __HTTP__ _E_\nRT @KellyannePolls: #Polls showing @realDonaldTrump surging @hillaryclinton #slipping have HER camp on defense/lowering expectations goi... _E_\nRT @DanScavino: OHIO GENERAL ELECTIONDonald Trump vs. Hillary Clinton#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nGeneral John Kelly totally agrees w/ my stance on NFL players and the fact that they should not be disrespecting our FLAG or GREAT COUNTRY! _E_\nInstinct has a lot to do with timing. You have to be patient & wait for your instincts to tell you the best time to make your move. _E_\nWacky @glennbeck who always seems to be crying (worse than Boehner) speaks badly of me only because I refuse to do his show a real nut job! _E_\n... and opened a full month ahead of schedule. Case is taught in Wharton. _E_\nWow I have always liked the @nypost but they have really lied when they covered me in Iowa. Packed house standing O best speech! Sad. _E_\nYou may want to watch David Letterman tonight I am on! _E_\nIsn't it ironic that a lot of the wealthy environmentalists use private jets and fight wind farms being placed near their property? _E_\nTrue! __HTTP__ _E_\n'Trump Helps Lift Small Business Confidence to 12 Yr. High' __HTTP__ __HTTP__ _E_\nWe must have Security at our VERY DANGEROUS SOUTHERN BORDER and we must have a great WALL to help protect us and to help stop the massive inflow of drugs pouring into our country! _E_\nOur @TrumpNewYork is really starting the summer on the right foot with their #wellness program as seen in @TandCmag: __HTTP__ _E_\nThe storied success of Bain in private entrepreneurship and equity is one reason @MittRomney will be a great POTUS. _E_\nDoes anybody like Lyin' Ted? __HTTP__ _E_\nI hereby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_\nThank you America! @FoxNews post debate poll with +/ from previous poll. #VoteTrump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nThank you for your wonderful endorsement today @TGowdySC. It means a great deal to me. We will not disappoint! #Trump2016 _E_\nThere are now 119000 fewer Americans employed than there were in July. The economy is still terrible. _E_\nKasich just announced that he wants the people of Indiana to vote for him. Typical politician can't make a deal work. _E_\nWith all that is happening with Ebola including the doctor who so easily came back to New York Obama still refuses to stop the flights! _E_\nBefore you are a leader success is all about growing yourself. When you become a leader success is all about growing others. Jack Welch _E_\nVia @DMRegister by @JenniferJJacobs: Trump adds events to his Iowa trip next month __HTTP__ _E_\nWow my poll numbers have just been announced and have gone through the roof! _E_\nDoes anyone else have two golf pros—John Nieporte & Jim Herman—who qualified for the U.S. Open? Could this be an all time record? _E_\n...or mentally troubled (or a con). _E_\nOffering two championship courses @TrumpGolfDC has been awarded the honor of hosting the 2017 @seniorpgachamp __HTTP__ _E_\nI will be live tweeting during tonight's #CelebrityApprentice 9 PM ET @NBC _E_\n26000 sexual assaults or rapes reported in military last year and that is just the number that is reported (many do not want to report). _E_\nWhich National Costume do you think should win? __HTTP__ _E_\nWatching John Kasich being interviewed acting so innocent and like such a nice guy. Remember him in second debate until I put him down. _E_\nWow 25 degrees below zero record cold and snow spell. Global warming anyone? _E_\nTotal misnomer to call ObamaCare 'The Affordable Care Act.' Affordable for whom besides big businesses & Congress w/their exemptions? _E_\nThe meeting with the @nytimes is back on at 12:30 today. Look forward to it! _E_\nThe more you know the more you realize how much you don't know. How can you possibly discover anything if you already know everything? _E_\nGreat meeting with Ford CEO Mark Fields and General Motors CEO Mary Barra at the @WhiteHouse today. __HTTP__ _E_\n.@unicef Caryl M. Stern CEO is driving around in a Rolls Royce... _E_\nI am glad America is starting to get to know @MittRomney the way I know him. A wonderful & decent family man (cont) __HTTP__ _E_\nThey made up a phony collusion with the Russians story found zero proof so now they go for obstruction of justice on the phony story. Nice _E_\nThe New Hampshire drug epidemic must stop. If elected POTUS I will create borders & the drugs will stop pouring in. __HTTP__ _E_\nAfter my tour of Asia all Countries dealing with us on TRADE know that the rules have changed. The United States has to be treated fairly and in a reciprocal fashion. The massive TRADE deficits must go down quickly! _E_\n#VoteTrump2016 & together we will #MakeAmericaGreatAgain! THANK YOU for your support! __HTTP__ _E_\nAlways bear in mind that your own resolution to succeed is more important than any other. Abraham Lincoln _E_\nVia @AP by @splaisance: @realmissnvusa NIA SANCHEZ CROWNED AS 63RD @MissUSA __HTTP__ _E_\n.@History's wonderful The Men Who Built America with me on tonight at 9 bad timing I'll be live tweeting the debate _E_\nIn your planning know how much risk you can take. Evaluate whether the returns will be worth the risk. _E_\nThe Fake News Networks are working overtime in Puerto Rico doing their best to take the spirit away from our soldiers and first R's. Shame! _E_\nAfter @BarackObama's speech tonight which should be well delivered reality will hit Friday morning when the new jobs report is released. _E_\n.@BarackObama blocked Keystone. Now China is preparing a massive $1.5B oil deal with Canada. __HTTP__ A terrible deal for US! _E_\nI endorsed Luther Strange in the Alabama Primary. He shot way up in the polls but it wasn't enough. Can't let Schumer/Pelosi win this race. Liberal Jones would be BAD! _E_\nThank you @Samsung! We would love to have you! __HTTP__ _E_\n.@Theresa_May don't focus on me focus on the destructive Radical Islamic Terrorism that is taking place within the United Kingdom. We are doing just fine! _E_\nIvanka on @foxandfriends now! _E_\nVia @PRNewswire \"Streetsense Brings The National a Geoffrey Zakarian Restaurant to DC's New Trump Intl Hotel\" __HTTP__ _E_\n.@MELANIATRUMP just finished being on @theviewtv by any standard she was great! _E_\nThanks to all for your thoughtful birthday wishes – Donald Trump _E_\nWe should remember that during this entire Petraeus episodeover 50 of our nation's bravest have died in Afghanistan... _E_\nI will be going to church in Iowa this morning with my wife Melania. After church I will be making two speeches and touring the State! _E_\nAlex Rodriguez has played under 140 games in each of the last five seasons. He will miss half of next season. Really bad deal for @yankees. _E_\nFor last minute shopping my new book #TimeToGetTough is a great choice... __HTTP__ _E_\n\"You have to learn the rules of the game. And then you have to play better than anyone else.\" – Albert Einstein _E_\n#TrumpVlog Will our brave soldiers catch Ebola? __HTTP__ _E_\nLance Armstrong is now going to admit guilt—can that be possible after many years of denying? Just go away Lance. _E_\nEntrepreneurs: Don't sell yourself short. Don't ever think you've done it all already or that you've done your best. _E_\nObama's carbon tax plan will finance more windmills in America. More real estate depreciated wildlife killed incl. bald eagles _E_\nRT @USArmy333: @804StreetMedia @realDonaldTrump He's done more in 9 months then obama did in8 yrs _E_\nTHANK YOU NEVADA! WE WILL MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_\n#TrumpAdvice __HTTP__ _E_\nRT @AmericaFirstPol: .@POTUS Trump led a historic journey to the White House. 50 days in that historic journey continues. Take a look 👉 ht... _E_\nThank you to all for the wonderful reviews of my foreign policy speech. I will soon be speaking in great detail on numerous other topics! _E_\n.@RICKYMONEY I don't know a lot about failures. And as you know I never went bankrupt. _E_\nNow A Rod is claiming that @MLB and @yankees are out to get him' __HTTP__ He should just get the hell out of NYC already! _E_\n#1 for success: Find out what you love to do. Trust yourself enough to find out what is best for you and what you're best at doing... _E_\nI will be live tweeting President Obama's prime time speech tonight starting at 7:50 P.M. (Eastern).Will he finally state the real problems? _E_\nHeadline reads Rubio passes Bush in Florida poll Unfair because Trump destroys them both! Trump 31.5% Rubio 19.2% Bush 11.3% _E_\nThank you Tennessee! #Trump2016#SuperTuesday _E_\nGood news @MittRomney has pulled ahead in Wisconsin __HTTP__ WIth @PaulRyanVP on the ticket Wisconsin is in play. _E_\nThank you West Chester Pennsylvania!#PAPrimary #VoteTrump __HTTP__ __HTTP__ _E_\nWe must keep evil out of our country! _E_\nTHANK YOU LAS VEGAS NEVADA!#NevadaCaucus #VoteTrumpNV __HTTP__ __HTTP__ _E_\nJOIN ME IN OHIO TOMORROW!Springfield 1pm: __HTTP__ 4pm: __HTTP__ 7pm... __HTTP__ _E_\nWhy isn't President Obama working instead of campaigning for Hillary Clinton? _E_\nWhy did Vince and the WWE give my speech and segment the most time last night on USA Network because that's what people want to see! _E_\nTrump: 'Terrible traitor' Snowden embarrassing US __HTTP__ via @thehill by @JTSTheHill _E_\nAttention Arnold Palmer: Happy Birthday Arnold. There is no one like you The King! @KingdomMag __HTTP__ _E_\nLosers such as George Will and @Rosie use me to get publicity for themselves. They are strictly third rate. _E_\nTHANK YOU! _E_\nJoin me in Florida tomorrow!MIAMI 12pm __HTTP__ __HTTP__ __HTTP__ _E_\nThe Debt is our nation's greatest threat. @BarackObama is out of touch. _E_\nCOURT FINDS IN FAVOR OF TRUMP UNIVERSITY __HTTP__ _E_\nWell the Special Elections are over and those that want to MAKE AMERICA GREAT AGAIN are 5 and O! All the Fake News all the money spent = 0 _E_\nWinners I am convinced imagine their dreams first. They want it with all their heart and expect it to come true. Joe Montana _E_\nDummy Graydon Carter doesn't like me too much...great news. He is a real loser! @VanityFair _E_\nRemember negotiations are fluid. Remain calm and don't settle easily. If you have the goods you will ultimately win. _E_\nI heard because his show is unwatchable that @Lawrence has made many false statements last night about me. Maybe I should sue him? _E_\n\"You can't wear a blindfold in business. A regular part of your day should be devoted to expanding your horizons.\" – Trump: How to Get Rich _E_\nAlso The Donald J. Trump Signature mattress from SERTA is doing record business call Serta and see why! _E_\nRemember the Republicans are 5 0 in Congressional races this year. In Senate I said Roy M would lose in Alabama and supported Big Luther Strange and Roy lost. Virginia candidate was not a \"Trumper\" and he lost. Good Republican candidates will win BIG! _E_\nVia @politicalwire: \"Trump Not Happy with Republicans\" __HTTP__ _E_\nI have chosen one of the truly great business leaders of the world Rex Tillerson Chairman and CEO of ExxonMobil to be Secretary of State. _E_\nDishonest media says Mexico won't be paying for the wall if they pay a little later so the wall can be built more quickly. Media is fake! _E_\nThe White House has just admitted Al Qaeda was involved in Benghazi __HTTP__ What about the video tape? _E_\nTrump Finalizes Agreement For Trump International Hotel The Old Post Office Building Washington D.C. __HTTP__ _E_\nLittle Marco Rubio gave amnesty to criminal aliens guilty of sex offenses. DISGRACE! __HTTP__ _E_\nAct NOW for your chance to have a private lunch with Eric Trump & tour of campaign HQ at Trump Tower in NYC. __HTTP__ _E_\nThe brand new hotel at Trump National Doral has the most beautiful rooms and suites in Miami. Enjoy! _E_\n.@ZachJohnsonPGA You're one of the truly great competitiors. I've said it for years. Great going winning @OpenChampionship Not surprised! _E_\nAll polls have me winning debate big Drudge TIME etc. Dopey Charles Krauthammer still nasty. He has zero cred totally dishonest! _E_\nRussia has more warheads than ever N Korea is testing nukes and Iran got a sweetheart deal to keep theirs. Thanks @HillaryClinton. _E_\nPresident Obama be cool be smart be sharp and FOCUS (no more March Madness) and you can beat Putin at his own game. IT CAN BE DONE! _E_\nSorry but this is years ago before Paul Manafort was part of the Trump campaign. But why aren't Crooked Hillary & the Dems the focus????? _E_\nVia @JTAnews and Jason Greenblatt Donald Trump is a Visionary With Talents Our Country Needs @JasonDovEsq __HTTP__ _E_\nNever said anything derogatory about Haitians other than Haiti is obviously a very poor and troubled country. Never said \"take them out.\" Made up by Dems. I have a wonderful relationship with Haitians. Probably should record future meetings unfortunately no trust! _E_\nGlad to hear that @RobinRoberts is doing well. She is a terrific person. _E_\nVia @itp_ab by @ctrenwith: 'Trump effect' will see Dubai properties rise 50%\" __HTTP__ _E_\nObama has called Libya attack a bump in the road and not optimal. Just come clean already tell Americans the truth! _E_\nDemocrats purposely misstated Medicaid under new Senate bill actually goes up. __HTTP__ _E_\nI said simply that the Mexican leaders and negotiators are smarter than ours and that the Mexican gov't is pushing their hard core to U.S. _E_\n.@RGIII & @DangeRussWilson & Luck are very special players will be great playoff games. _E_\nI am somewhat surprised that Bernie Sanders was not true to himself and his supporters. They are not happy that he is selling out! _E_\nThank you @krauthammer for your nice comments on @oreillyfactor. A lot of progress is being made! _E_\nAs the @BarackObama's took their 16th vacation this month unemployment is back to 9% and underemployment at (cont) __HTTP__ _E_\nCruz caught cold in lie after denial of push polls like lies w/ @RealBenCarson. How can he preach Christian values? __HTTP__ _E_\nGot to know Senator @JohnKerry in Aspen Colorado years ago—a very solid and stand up guy. _E_\nI can't believe David Letterman has announced his retirement he is a great guy! @Letterman _E_\nHere's the deal: when your secretary of defense tells you that your proposed cuts will erode America's military (cont) __HTTP__ _E_\nGreat mtg w/ @Cabinet today. Tomorrow I will be announcing the new head of the Fed. I think you will be extremely impressed by this person! __HTTP__ _E_\nRated @GolfMagazine as 1 of the top courses in the country Trump Int'l Palm Beach has been expanded to 27 holes __HTTP__ _E_\nAlways pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. _E_\nIf you think big you will encounter big setbacks from time to time. What really matters is how you respond to them. Think Big _E_\nSuch a nice article in the New York Times about a wonderful developer Arthur Zeckendorf __HTTP__ _E_\nHighest Stock Market EVER best economic numbers in years unemployment lowest in 17 years wages raising border secure S.C.: No WH chaos! _E_\nWishing everyone a wonderful Independence Day holiday weekend a great celebration for a great country. _E_\nMiss Pennsylvania is just looking for free publicity at the expense of the real winner of Miss USA Olivia Culpo. _E_\nVia CBSWashDC: \"114 Year Old DC Building a Step Closer to Becoming Trump's Latest Hotel\" __HTTP__ _E_\nDeparting The Pentagon after meetings with @VP Pence Secretary James Mattis and our great teams. #MAGA __HTTP__ _E_\nGeneral Petreus and his family are paying a big price! _E_\nGreat numbers on Stocks and the Economy. If we get Tax Cuts and Reform we'll really see some great results! _E_\nThere's no love lost between @latoyajackson & @OMAROSA Disrespectful? Who is being disrespectful? #CelebApprentice _E_\nI had a great time answering as many questions as possible in sixty seconds at @facebook NY today __HTTP__ _E_\nPress conference at The Old Post Office in D.C. __HTTP__ _E_\nThank you Massachusetts! #Trump2016 #SuperTuesday _E_\nGet our Marine out of Mexico. __HTTP__ _E_\nPennsylvania poll just released. Two rallies there on Mon join me!Ambridge: __HTTP__ Barre:... __HTTP__ _E_\n\"How to travel like a billionaire! Inside Donald Trump's £63m private jet\" __HTTP__ via @travelmail by @AndreaMagrath _E_\nI don't know Dennis Kozlowski who made Tyco into a great company & then went to prison but he's up for parole—let him go! _E_\nSenator Marco amnesty Rubio who has worst voting record in Senate just hit me on national security but I said don't go into Iraq. VISION _E_\nEntrepreneurs: Keep your momentum going! It's a big factor in sustaining your success. Keep moving forward! _E_\nSleepy eyes @chucktodd—one of the dumbest voices in politics is angry that I'm doing @ThisWeekABC. _E_\nMika Brzezinski: Dem Criticism of Comey Reinforcing Idea 'There's Something There' __HTTP__ __HTTP__ _E_\nA year ago today a diplomat and 3 security operatives were abandoned by our government while they were under attack. Never forget! _E_\nLet's all take a moment to remember all of the heroes from a very tragic day that we cannot let happen again! _E_\nPresident Obama NOW bring our 4000 innocent and ill trained soldiers home from West Africa before it is too late AND STOP THE FLIGHTS! _E_\nLightweight Marco Rubio was working hard last night. The problem is he is a choker and once a choker always a choker! Mr. Meltdown. _E_\nEntrepreneurs: Keep the big picture in mind. There are always opportunities and thinking too small can negate a lot of them. _E_\nThe @washingtonpost loses money (a deduction) and gives owner @JeffBezos power to screw public on low taxation of @Amazon! Big tax shelter _E_\nWe finally agree on something Rosie. __HTTP__ _E_\nA Call for Unity by Jason Greenblatt @JasonDovEsq __HTTP__ _E_\n#CrookedHillary __HTTP__ _E_\nFor all of my millions of followers and at your request I will be tweeting tonight during President Obama's speech! 9pm ET _E_\n.@michellemalkin & @BuzzFeedAndrew: \"Vaccine court awards millions to two autistic children damaged by vaccine\" __HTTP__ _E_\nA special message to the staff of @TrumpWaikiki in celebration of the 2nd anniversary.... __HTTP__ _E_\nPersonally I'm glad the NYPD is monitoring the actions of certain extremists. New York's finest! I support them. _E_\nWhat is your favorite @THEGaryBusey film? Tonight's short film? Point Break? Lethal Weapon? #CelebApprentice _E_\nSerious voter fraud in Virginia New Hampshire and California so why isn't the media reporting on this? Serious bias big problem! _E_\nSanctions Relief From Clinton Obama Iran Nuclear Deal Likely Go to Terrorists: __HTTP__ #BigLeagueTruth #VPDebate _E_\nVia @CBSmiami by @LisaPetrillo: \"Trump Unveils Renovated @TrumpDoral Red Tiger Golf Course\" __HTTP__ _E_\nSen. Lindsey Graham embarrassed himself with his failed run for President and now further embarrasses himself with endorsement of Bush. _E_\nHeading back to Washington D.C. Much will be accomplished this week on trade the military and security! _E_\nCrooked Hillary has ZERO leadership ability. As Bernie Sanders says she has bad judgement. Constantly playing the women's card it is sad! _E_\nVia @Newsmax_Media: Trump at CPAC: What Really Happened __HTTP__ _E_\nWe spent over a billion on Libya and lead the way why is Europe getting the oil? _E_\nthilan_GolfSwag @realDonaldTrump Played Doral for the first time. absolutely great course! Fantastic job! Thanks. _E_\nSee Schneiderman admit he spoke with Obama about \"ongoing investigations. __HTTP__ _E_\nHonored to welcome Republican and Democrat members of the House Ways and Means Committee to the White House today! #USA __HTTP__ _E_\nJust said at #NCGOPcon that politicians are all talk and no action and we are all tired of it! We need action and results to move forward! _E_\nSee what I have to say about Iran and Iraq in today's #trumpvlog... __HTTP__ _E_\n.@Neilyoung one of my favorite musicians in my office. __HTTP__ _E_\nRT @VP: .@POTUS is committed to the health & well being of the US people & we are confident Dr. Jerome Adams will succeed as our new surgeo... _E_\nBecause Gov. Kasich cannot run in the state of Pennsylvania he cannot win the nomination & should not be allowed to compete in Ohio on Tue. _E_\nThis is good news: @MittRomney is now leading in Michigan by 6 points according to @RasmussenPoll __HTTP__ _E_\nGet ready to turn to NBC for CELEBRITY APPRENTICE TONIGHT'S SHOW IS GREAT! _E_\nSupport Coach Kennedy and his right together with his young players to pray on the football field. Liberty Institute just suspended him! _E_\nYankees can win today. Kuroda is a highly underrated pitcher. _E_\nThe protesters in California were thugs and criminals. Many are professionals. They should be dealt with strongly by law enforcement! _E_\n\"The longer you play the better chance the better player has of winning.\" @jacknicklaus _E_\nThere won't be any new gun legislation. No surprise. Americans support the 2nd amendment. _E_\nScary thought what is the pervert Anthony Weiner doing with all the free time he has. Does he collect unemployment? _E_\nAl Shabbab not ISIS just made a video on me they all will as front runner & if I speak out against them which I must. Hillary lied! _E_\nWhat the hell is Obama doing in allowing all of these potentially very sick people to continue entering the U.S.! Is he stupid or arrogant? _E_\nVia @limbaugh: \"See Trump Told You So\" __HTTP__ _E_\nMy video response to President Obama's lack of transparency. __HTTP__ _E_\nState Department has not revoked a single passport of ISIS Americans __HTTP__ We should send them to Gitmo for some R&R. _E_\nIran is moving troops into Iraq under the guise that it is helping out. Actually they will take over Iraq and all of their oil. Stupid U.S. _E_\nDon't worry getting rid of state lines which will promote competition will be in phase 2 & 3 of healthcare rollout. @foxandfriends _E_\n.@TrumpSoHo features a striking glass walled building w/ loft inspired interiors __HTTP__ NYC's trendiest luxury hotel _E_\nGreat article by @jameshohmann @politico explaining why @KarlRove was biggest loser @CPACnews __HTTP__ James is sharp. _E_\nI will be interviewed on the @TODAYshow at 7:30. Enjoy! _E_\nVia @MiamiHerald: Donald Trump aims to bring luxury to Doral Golf Resort & Spa __HTTP__ @DoralResort _E_\nPresident Reagan had it right: Social Security is here to stay. We must root out the fraud and make it more (cont) __HTTP__ _E_\nPeople the lawyers and the courts can call it whatever they want but I am calling it what we need and what it is a TRAVEL BAN! _E_\n\"Most people think small because most people are afraid of success afraid of making decisions afraid of winning\" The Art of the Deal _E_\nTrue thanks. __HTTP__ _E_\nI'll be on @foxandfriends Monday morning at 7:30 AM. Tune in! _E_\nThe failing @nytimes has disgraced the media world. Gotten me wrong for two solid years. Change libel laws? __HTTP__ _E_\nObama/Reid/Nunn's failed economic policies are not working. @PerdueSenate will bring fresh perspective to solving problems. #GASen _E_\nSpent the weekend in LA checking out Trump National Golf Club on the Pacific Ocean. An amazing place! __HTTP__ _E_\nRT @CBSNews: WATCH NOW: The @realDonaldTrump supporters you'd never expect __HTTP__ __HTTP__ _E_\nDoing an interview with @SteveDeaceShow. Discussing the ObamaCare web disaster. Be sure to listen __HTTP__ _E_\nSnowden is doing great damage to our relations with other countries and U.S.prestige. China is laughing at us as he continues illegal action _E_\nFinal poll results from NBC on last nights Commander in Chief Forum. Thank you! #ImWithYou #MAGA __HTTP__ _E_\nThe failing @NYDailyNews destroyed by little Morty Zuckerman is preparing to close and save face by going online. It's dead! _E_\n.@ApprenticeNBC Season 13 still #1 at 10PM in all key demos despite having to serve as our own lead in from 9 10. 11PM News loves Trump! _E_\nDo you believe that The State Department on NEW YEAR'S EVE just released more of Hillary's e mails. They just want it all to end. BAD! _E_\nMillions without electricity across NY & NJ. The media has covered for Obama's massive failure. Can you imagine if this was another Pres? _E_\nJust left Oklahoma the most amazing crowd and people! What a night! _E_\nMy @Newsmax_Media interview from Friday where I predicted that @newtgingrich in South Carolina would change the race. __HTTP__ _E_\nHighly respected author Christopher Bedford just came out with book The Art of the Donald Lessons from America's.... Really good book! _E_\nThe best vision is insight. Malcolm Forbes _E_\nFollow @MELANIATRUMP's jewelry line on @QVC site __HTTP__ _E_\nThe #CelebrityApprentice Sunday night on NBC at 9 PM. Another exciting episode is ready to go. __HTTP__ _E_\nHow is Bernie Sanders going to defend our country if he can't even defend his own microphone? Very sad! _E_\nRepublicans must unite to defund Obamacare it will drive our country into oblivion and by the way the healthcare is no good anyway! _E_\nWill be doing a big interview tonight with Bret Baier at 6:00 P.M. on Fox. Don't miss it! _E_\nThank you Waukesha Wisconsin! Full transcript of my speech #FollowTheMoney: __HTTP__ __HTTP__ _E_\nIt was an honor to welcome the Prime Minister of Denmark Lars Løkke Rasmussen {@larsloekke} to the @WhiteHouse yes... __HTTP__ _E_\nWishing @FLOTUS Melania and all of the great mothers out there a wonderful day ahead with family and friends! Happy #MothersDay _E_\nUSMC Sgt. Tahmooressi has now been held in Mexican jail for over 150 days. When will Obama call for his release? #FreeOurMarine _E_\nThe oil reserve is a strageic asset for a time of war and an embargo. @BarackObama should open more land for drilling not tap the reserve. _E_\nBriarcliff Manor Mayor Vescio is doing a terrible job. Taxes way too high roads in terrible condition—repave Pine Road. @BriarcliffManor _E_\nJournal News readership is already down 50 percent over the years. _E_\nRight now we have a president and a Treasury secretary who shrug while China tears away hundreds of thousands (cont) __HTTP__ _E_\n\"Happiness is not something ready made. It comes from your own actions.\" @DalaiLama _E_\nBest of luck to @chucktodd on his @meetthepress debut this Sunday. _E_\n.@McIlroyRory What a year it has been for you and this weekend topped it off. Fantastic job see u at Doral. _E_\nI'll be tweeting live tonight starting at 9PM ET re:@ApprenticeNBC. Don't worry other time zones I will give nothing away! _E_\nGreat reception in D.C. At the Values Voter Summit. Now checking on my job at the Old Post Office... _E_\nThe problem w/ the concept of global warming is that the U.S. is spending a fortune on fixing it while China & others do nothing! _E_\nVia @foxnewslatino: \"Donald Trump Plans Huge Towers In Rio For Post Olympic Building Boom\" __HTTP__ _E_\nThis is dangerous: @BarackObama is seeking to shrink Israeli military funding but gives $1.3Billion to Muslim (cont) __HTTP__ _E_\nLooks like the Bernie people will fight. If not their BLOOD SWEAT AND TEARS was a total waste of time. Kaine stands for opposite! _E_\nObama and Kerry are bungling Syria by the hour. They have set America's deterrence & stature back by years. Amateurs! _E_\nAccording to many ISIS was given so much time and so many signals as to when we would start bombing that they were able to prepare and hide _E_\n.@VattenfallGroup lead investor in Aberdeen windfarm fiasco has dropped out—project not economically viable & protestors hate it. _E_\nEli Wallach was a great actor and a great guy. My opinion his performance in The Good the Bad and the Ugly was his all time best! _E_\n...Get along & make deals for the good of the country! _E_\nDue diligence includes increasing your financial IQ daily. _E_\nRemember no one ever said success was easy.Good luck doesn't come overnight.But if u work hard & love it u will find success & luck. _E_\nObama is an easy target on foreign policy.@MittRomney has many openings to attack especially when Obama starts bragging about Bin Laden. _E_\nI answered some of your questions in today's video... __HTTP__ _E_\nAccording to many and while nominated I would have won the Emmy many times except for my politics. @PrimetimeEmmys _E_\nIt's not climate changeit's global warming.Don't let the dollar sucking wiseguys change names midstream because the first name didn't work _E_\nGetting ready to leave for my GREAT resort Turnberry in Scotland. Hosting The Women's British Open (biggest tournament). Will be back Sat. _E_\n...and safe. Questions were asked about why the CIA & FBI had to ask the DNC 13 times for their SERVER and were rejected still don't.... _E_\nWorthless @NYDailyNews which dopey Mort Zuckerman is desperately trying to sell has no buyer! Liabilities are massive! _E_\nA record high 6.7% of Americans are living in extreme poverty. This is tragic. We can do better. _E_\nGovernor Rick Perry said Donald Trump is one of the most talented people running for the Presidency I've ever seen. Thank you Rick! _E_\nRe Life: Life is very fragile and success doesn't change that. If anything success makes it more fragile. _E_\nThe mother of the Boston killers (not suspects) says her boys are totally innocent and were set up I can see the 14 year long defense now! _E_\nRT @DonaldJTrumpJr: Donald Trump Jr. On The Record: Why Trump International Hotels And Residences Are Still Winning via @forbes __HTTP__ _E_\n#CNNDebate Winning the @drudge_report poll __HTTP__ _E_\nI'm leaving for Iowa now will be great! _E_\nFan favorite @LilJon once again shines in the record 13th season of 'All Star' @CelebApprentice. He is an amazing & wonderful guy! _E_\nWatching Hurricane closely. My team which has done and is doing such a good job in Texas is already in Florida. No rest for the weary! _E_\nLIVE on #Periscope: Join me for a few minutes in Pennsylvania. Get out & VOTE tomorrow. LETS #MAGA!! __HTTP__ _E_\nWhether you think you can or think you can't you're right. Henry Ford _E_\nRT @piersmorgan: BOOM! Thank you Mr President. Trophy hunting is repellent. __HTTP__ _E_\n.@WSJ and dopey Karl Rove made a mistake and purposely mischaracterized my statement on the terrible TPP deal. __HTTP__ _E_\nDark Knight Rises is projected to gross over $180 million this weekend. Remember to watch for Trump Tower! _E_\n.@HillaryClinton's tax hikes will CRUSH our economy. I will cut taxes BIG LEAGUE. __HTTP__ __HTTP__ _E_\nBe ready for problems. You'll have them every day so keep things in perspective. Ask yourself: Is this a blip or is it a catastrophe? _E_\nThink of this: After we spent $2 trillion on Iraq Baghdad is about to be taken over by ISIS. _E_\n... debut her first 2013 \"Melania® Timepieces & Fashion Jewelry\" collection! _E_\nI made a lot of money in Atlantic City and left 7 years ago great timing (as all know). Pols made big mistakes now many bankruptcies. _E_\nBTW The Miss USA pageant was the highest rated non sports telecast on the Big 4 networks. Congrats to our newly crowned @Nia_Sanchez_! _E_\nI am following the Trayvon Martin case carefully. It's a terrible situation that should never have happened. (cont) __HTTP__ _E_\nBill Cosby is foolish stupid or getting bad advice in remaining silent if he is innocent. Probably guilty! Not a fan. _E_\nWe are suffering through the worst long term unemployment in the last 70 years. I want change Crooked Hillary Clinton does not. _E_\nThank you Delaware! #Trump2016 __HTTP__ _E_\nHappy Birthday @DonaldJTrumpJr! __HTTP__ _E_\nGreat crowd in Johnstown Pennsylvania thank you. Get out & VOTE on 11/8! Watch the MOVEMENT in PA. this afternoon... __HTTP__ _E_\nA great deal of good things happening for our country. Jobs and Stock Market at all time highs and I believe will be getting even better! _E_\n#USAatUNGA#UNGA __HTTP__ _E_\nWe will now be helping Syria and Iran by attacking ISIS ironic isn't it! _E_\nVia @WWE: Donald Trump announced for WWE Hall of Fame __HTTP__ _E_\nSituated in the heart of downtown Toronto the 65 story @TrumpTO offers an elegant and wonderful lifestyle __HTTP__ _E_\nThe @WSJ Editorial Board is so wrong so often. They got info from an incorrect story in another pub. Why not watch and listen to debate. _E_\nRemain open to new ideas. That's where innovation begins. _E_\nA record 46.68M Americans are now on food stamps __HTTP__ Four more years? _E_\nAsk Sally Yates under oath if she knows how classified information got into the newspapers soon after she explained it to W.H. Council. _E_\nI am self funding my campaign putting up my own money not controlled. Cruz is spending $millions on ads paid for by his N.Y. bosses. _E_\nChina is sending an Envoy and Delegation to North Korea A big move we'll see what happens! _E_\nIn memory of Joan Rivers watch when she became my Celebrity Apprentice which meant so much to her! __HTTP__ _E_\n.@TrumpChicago's award winning dining options also offer the best views of the city __HTTP__ _E_\nMy @CNNS interview with @wolfblitzercnn discussing my endorsement of @MittRomney and why he can beat @BarackObama __HTTP__ _E_\nThe opinion of this so called judge which essentially takes law enforcement away from our country is ridiculous and will be overturned! _E_\nEntrepreneurs: Brainpower is the ultimate leverage. _E_\nIf amnesty is so popular according to the DC ruling class then why is Obama delaying his executive action until after the election? _E_\n'Top Hillary Adviser Mocked Plotted Attacks on Pro Sanders Civil Rights Leader' #DrainTheSwamp __HTTP__ _E_\nNobody beats me on National Security. __HTTP__ _E_\nAs a stockholder in Apple they should get on with a larger screen iPhone as a supplement—immediately. _E_\nThe Coca Cola company is not happy with me that's okay I'll still keep drinking that garbage. _E_\nEntrepreneurs: Being stubborn is a big part of being a winner. Never give up! _E_\nMany of the great jobs that the people of our country want are long gone shipped to other countries. We now are part time sad! I WILL FIX! _E_\nThe most stringent gun laws in the U.S. happen to be in Chicago and look what is happening there! _E_\nTo all my fans sorry I couldn't do The Apprentice any longer—but equal time (presidential run) prohibits me from doing so. Love! _E_\nAmerican sanctions alone cannot stop Iran's nuclear drive and @BarackObama cannot get China and Russia to agree on new Iranian sanctions. _E_\nGreat article by @RichLowry on @POLITICOMag : \"Sorry Donald Trump Has A Point\" __HTTP__ _E_\nMy speech to @PressClubDC yesterday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_\nChina's Communist Party has now publicly praised Obama's reelection. They have never had it so good. Will own America soon. _E_\n#LaborDay #AmericaFirstVideo: __HTTP__ __HTTP__ _E_\nCongrats to winners from around the world who entered the Think Like A Champion signed book/keychain contest! __HTTP__ _E_\nSee yourself as an organization. Pay attention to every facet of your life. What's strong? What's weak? What's missing? _E_\nI am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_\nHeading over to @Kelly and Michael re. Apprentice! _E_\nConsumer Comfort Reaches 16 Year High on U.S. Economic Optimism via Bloomberg __HTTP__ _E_\nNational Security Presidential Memorandum on Strengthening the Policy of the United States Toward Cuba Memorandum... __HTTP__ _E_\nIf someone says \"I'll bet you ten dollars\" and loses the bet it's pay up time. _E_\nLIVE on #Periscope: Live with the Donald __HTTP__ _E_\nBoth Aberdeen and Turnberry in Scotland and the soon to open Doonbeg in Ireland blow Bandon Dunes away. Bandon is a toy by comparison! _E_\n59% of the United States by area is now covered in snow highest % in many years. The global warming name isn't working anymore SORRY! _E_\nWord is that they have far more evidence on A Rod than they have on Ryan Braun! Alex is over. _E_\nTed Cruz didn't win Iowa he stole it. That is why all of the polls were so wrong and why he got far more votes than anticipated. Bad! _E_\n#CelebrityApprentice Boardrooms—can anything be more intense? #sweepstweet _E_\nSenator Sessions will serve as the Chairman of my National Security Advisory Committee. __HTTP__ __HTTP__ _E_\nI don't mind that @BarackObama plays a lot of golf. I just wish he used it productively to make deals with Congress! _E_\nCongrats to @bubbawatson on winning the Masters. He did it without heavy reliance on coaches and the other hanger ons he just played golf. _E_\nICYMI \"Raw video: Donald Trump speaks at Rep. Steve Stepanek's Amherst reception\" __HTTP__ via @wmur9 _E_\nVia @pressjournal by Ann Marie Parry: Plans revealed for course named after Trump's mother __HTTP__ _E_\nBig protests in Iran. The people are finally getting wise as to how their money and wealth is being stolen and squandered on terrorism. Looks like they will not take it any longer. The USA is watching very closely for human rights violations! _E_\nRT @BrazoriaCounty: __HTTP__ _E_\nIf I would have offered Obama a billion dollars to show his records he would have refused. _E_\nEvery Poll has me winning BIG.If you listen to dopey Karl Rove a Trump hater on @oreillyfactor you would think I'm doing poorly. @FoxNews _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nBen Smith (is that really his last name?) of @BuzzFeed is a total mess who probably got his minion Coppins to do what he didn't want to do? _E_\nSo much Fake News is being reported. They don't even try to get it right or correct it when they are wrong. They promote the Fake Book of a mentally deranged author who knowingly writes false information. The Mainstream Media is crazed that WE won the election! _E_\nHow do you like Seth and Oscars so far? _E_\nSenator Lindsey Graham called me yesterday very much to my surprise and we had a very interesting talk about national security and more! _E_\nWind Power is proving to be very costly and unsightly. _E_\nVia @BostonDotCom by @lilsarg: \"Donald Trump on Snow Salt Vaccines and the Oval Office\" __HTTP__ _E_\nScary thought @JoeBiden is a heartbeat away from the Presidency. _E_\n2013 is the worst year ever for Hollywood. Garbage released after garbage. What is going on in these studios?! _E_\nBob Beckel a commentator for FOX is bad for the @FoxNews brand: @BobBeckel is close to incompetent. _E_\nThe Midas Touch hand is the ideal metaphor to represent the attributes critical to entrepreneurial success. (cont) __HTTP__ _E_\nGreat Governor @Mike_Pence is in Indiana to help lead the relief efforts after tornadoes struck. True leadership. _E_\n\"Winning is the most important thing in my life after breathing. Breathing first winning next.\" George Steinbrenner _E_\nThank you Senator @TedCruz!#Debates2016 #MAGA __HTTP__ _E_\nI watched POTUS speech from Europe same old tax and spend won't create jobs. _E_\nCongratulations to Patrick Reed for winning at Trump National Doral. He told me The Blue Monster is the best course I've ever played _E_\nHow about President Obama fixing the gasoline situation instead of taking photo ops in the destruction. _E_\nWatch me on @SeanHannity's show at 10PM tonight on @FoxNews _E_\nEntrepreneurs: There are no guarantees but being ready sure beats being taken by surprise. Know everything you can about what you're doing. _E_\nThank you Rand! __HTTP__ _E_\nW/ views of NYC's skyline Trump Stamford is Connecticut's most luxurious high rise featuring Trump amenities __HTTP__ _E_\n'Uniforms 4 Everyone' campaign @fundanything has a $3000 goal to buy underprivileged kids school uniforms __HTTP__ _E_\nGreat debate poll numbers I will be on @foxandfriends at 7:00 to discuss. Enjoy! _E_\nThe Hillary Clinton staged event yesterday was pathetic. Be careful Hillary as you play the war on women or women being degraded card. _E_\nCongratulations to @SpeakerRyan @GOPLeader @SteveScalise and to the Republican Party on Budget passage yesterday. Now for biggest Tax Cuts _E_\nGabriel Sherman's book on Roger Ailes is filled with falsehoods and inaccuracies. Publisher should be ashamed (and sued). _E_\nDon't worry when our country starts hurting bad enough from all of the mistakes that are being made we will start doing the right things. _E_\n\"Don't expect to build up the weak by pulling down the strong.\" Calvin Coolidge _E_\nThe road to success is always under construction. Arnold Palmer _E_\nThank you Jason Greenblatt @JasonDovEsq For Our Children: Let's Elect Donald Trump __HTTP__ _E_\nReceiving the @RobbReport trophy for best new golf course in the world Trump International Golf Links Scotland. __HTTP__ _E_\n.@KathieLGifford Melania and I send our deepest condolences. Frank was a special and amazing person. He will be missed by all! _E_\n.@FranksFight Keep fighting Frank! Never give up! _E_\n\"Always be prepared to start.\" Joe Montana _E_\nVia @nypost by @StarrMSS: \"Trump: @ApprenticeNBC contestants 'the meanest by far'\" __HTTP__ _E_\nWith these record high gas prices what does it say about Obama that he was trying to brag about his energy policy in the debate? _E_\nLooking forward to a full day of meetings with President Xi and our delegations tomorrow. THANK YOU for the beautiful welcome China! @FLOTUS Melania and I will never forget it! __HTTP__ _E_\nI am doing Greta tonight on Fox talking about Obama Care and pervert Anthony Wiener! 10 P.M. _E_\nCongratulations to Dubai on winning the rights to host Expo 2020! A great place winning a major global event.@damacofficial @dubaiexpo2020 _E_\nMany Super Pacs funded by groups that want total control over their candidate are being formed to \"attack\" Trump. Remember when u see them _E_\nThe Unaffordable Care Act sometimes referred to as ObamaCare is not working. Millions of people are losing their plans and doctors fraud! _E_\nEvery time I speak of the haters and losers I do so with great love and affection. They cannot help the fact that they were born fucked up! _E_\nChina has control over North Korea! _E_\nCongratulations to Barack Obama for having 2012's debt already surpass 2011 __HTTP__ _E_\nMy @WOR710 interview on The John Gambling Show discussing the 2012 election Trump real estate projects & our airports __HTTP__ _E_\nWow \"FBI lawyer James Baker reassigned\" according to @FoxNews. _E_\nThe Federal deficit crossed $15Trillion 100% of our GDP. Yet the Super Committee can't find $1.2Trillion i... (cont) __HTTP__ _E_\nSo funny Crooked Hillary called BREXIT so incorrectly and now she says that she is the one to deal with the U.K. All talk no action! _E_\nJoin me in Tampa Florida tomorrow at 1pmE! Tickets: __HTTP__ __HTTP__ _E_\nWe could make America great again by spreading ObamaCare throughout the World while at the same time dropping it from U.S.! _E_\nWe believe that every American should stand for the National Anthem and we proudly pledge allegiance to one NATION UNDER GOD! __HTTP__ _E_\nBe sure to watch the Celebrity Apprentice on Sunday night 9 pm on NBC. __HTTP__ _E_\n.@megynkelly is very bad at math. She was totally unable to figure out the difference between me and Cruz in the new Monmouth Poll 41to14. _E_\nKAREN HANDEL FOR CONGRESS. She will fight for lower taxes great healthcare strong security a hard worker who will never give up! VOTE TODAY _E_\nSo nice being with Republican Senators today. Multiple standing ovations! Most are great people who want big Tax Cuts and success for U.S. _E_\n.@GovMikeHuckabee Great job on @FoxNews tonight. Thanks for your nice words about my children. Class! _E_\nEnter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_\nthought it would be hypocritical to attend Bush's swearing in....he doesn't believe Bush is the true elected president. Sound familiar! WP _E_\nSpeaking at the City Club of Chicago. Sold out in minutes with thousands on the wait list!... __HTTP__ _E_\nStock market hits another high with spirit and enthusiasm so positive. Jobs outlook looking very good! #MAGA __HTTP__ _E_\n.@AC360 Anderson so amazing. Your mother is and always has been an incredible woman! _E_\nOver 90% of American workers could lose their healthcare by 2020 thanks to ObamaCare. Repeal before it is too late! _E_\nHow the hell does the Libyan government get off telling our embassy security they can't have loaded guns for protection?! _E_\nJane Fonda and Michael Douglas look great! _E_\nAfghanistan is a total disaster. We don't know what we are doing. They are in addition to everything else robbing us blind. _E_\nMy FoxBusiness interview with Don Imus discussing #TimeToGetTough the GOP primary and the Newsmax @iontv debate __HTTP__ _E_\nAs I anticipated Justice Roberts made the cover of Time Magazine etc. The liberal media now loves him he should be ashamed. _E_\nThank you @JoeTrippi for the nice and true words on #Media Buzz with terrific Howie Kurtz. Leading New Hampshire 30 to 12. @FoxNews _E_\nHillary has bad judgment! __HTTP__ _E_\nResponse to the Pope: __HTTP__ _E_\nTrump to Liberty U Students: 'The World is Laughing at Us' __HTTP__ Via @Newsmax_Media _E_\n\"Perception about India has changed says Donald Trump\" __HTTP__ via @EconomicTimes by Kailash Babar _E_\nWhy are the Republicans giving Obama fast track authority for TPP and the Iran agreement?! Obama gets more from the GOP than his own party. _E_\nSorry for all of the millions of people who long to hear my brilliant words of wisdom on Fox & Friends on Monday A.M. no go in Dubai. _E_\n#ObamacareFail __HTTP__ _E_\nWhen will President Obama issue the words RADICAL ISLAMIC TERRORISM? He can't say it and unless he will the problem will not be solved! _E_\n#NeverForget __HTTP__ _E_\nEntrepreneurs: Ask yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_\nHere's what I told @Gretawire on @FOX when it comes to singer @Cher's inappropriate attacks on @MittRomney __HTTP__ _E_\nSo terrible that Crooked didn't report she got the debate questions from Donna Brazile if that were me it would have been front page news! _E_\nThe real J.P.Morgan is spinning in his grave at the ridiculous settlements the bank is making to settle disputes. A settler is a soft target _E_\nObama: \"I will control Ebola.\" = Obama: \"If you like your health care plan you can keep your healthcare plan.\" _E_\nMarco Rubio is a total lightweight who I wouldn't hire to run one of my smaller companies a highly overrated politician! _E_\nMy persona will never be that of a wallflower I'd rather build walls than cling to them Donald J. Trump _E_\n\"The most important thing in communication is hearing what isn't said.\" Peter Drucker _E_\nI will beat Hillary easily but Lindsey Graham says I won't and yet he got zero against me no cred! Why does FOX put him on? _E_\nThe U.S. is spending fortunes at airports checking people coming in from West Africa with uncertain results. STOP THE FLIGHTS YOU DUMB B's! _E_\nTHANK YOU NEW YORK! #Trump2016 __HTTP__ _E_\nScotland is beautiful and Trump Internatonal Golf Links Scotland is progressing beautifully as well. __HTTP__ _E_\nJohn Roberts arrived in Malta yesterday. Maybe we will get lucky and he will stay there. _E_\nIn 2011 I said that Mubarak never should have been ousted because whoever replaces him will be worse. Obama made a mistake. _E_\nJOBS JOBS JOBS! __HTTP__ __HTTP__ _E_\nDon't believe @BarackObama's whining Pro Romney SuperPAC spending is on par with Pro Obama SuperPAC __HTTP__ _E_\nFast and Furious gun running goes all the way to the White House. We need answers now! _E_\nJust toured Baton Rouge Louisiana GREAT PEOPLE fantastic place doing really well. Miss USA Pageant totally sold out.Tomorrow night NBC _E_\nI will be interviewed by @seanhannity tonight at 10:00 on @FoxNews . Much much much to talk about! _E_\n.@Ynberg: Long term goal &gt &gt &gt to be the black @realDonaldTrump 4real .Great Dean and you will make it! _E_\nWhich National Costume do you think should win? __HTTP__ _E_\nHe @RickSantorum wants to decide what books people can read what movies they can see. #freespeech It doesn't work that way! _E_\nBig announcement in Ames Iowa on Tuesday! You will not want to miss this rally! #Trump2016 __HTTP__ __HTTP__ _E_\nWe are asking law enforcement to check for dishonest early voting in Florida on behalf of little Marco Rubio. No way to run a country! _E_\nWhat is our President doing? __HTTP__ _E_\nI was so looking forward to being in Virginia Beach Virginia today. The demand for tickets was amazing. Good luck with storm back soon! _E_\nthese companies are able to move between all 50 states with no tax or tariff being charged. Please be forewarned prior to making a very ... _E_\nBelated congratulations to @serenawilliams on winning the French Open. A great player & person! _E_\nTed Cruz poll numbers are down big. Because he was born in Canada and was until recently a Canadian citizen many believe he cannot run! _E_\nBecause of our terrible leaders it is now open season on every American throughout the world. Terrorists are thrilled. _E_\nSuccess tip: Keep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_\nThe Obstructionist Democrats make Security for our country very difficult. They use the courts and associated delay at all times. Must stop! _E_\nI aim very high and then just keep pushing and pushing to get what I'm after. The Art of the Deal _E_\nWatching @TigerWoods on NBC playing great golf. Tiger won The WGC Cadillac Championship at Trump National Doral this year. I love Tiger! _E_\nYou can't build a reputation on what you're doing to do. Great quote by Henry Ford. _E_\nTime to #DrainTheSwamp in Washington D.C. and VOTE #TrumpPence16 on 11/8/2016. Together we will MAKE AMERICA SAFE... __HTTP__ _E_\nThe Formula of Knowledge: The best way to learn is through studying the history of success and failures in your industry. _E_\nWOW SO NICE AND SO TRUE. THANK YOU! @not_that_actor: @realDonaldTrump #TRUMP2016 TIME TO RETHINK THE CHOICES __HTTP__ _E_\nI was always a big fan of Kim Novak and still am—a wonderful actress. _E_\nVia @NJcomsomerset BY @wobriensomerset: @TigerWoods brings charity golf playoffs toTrump Nat'l/Bedminster __HTTP__ _E_\nWhy did @oreillyfactor give @davidaxelrod so much time to sell his third rate book. Bill should have hit stammering David MUCH harder! Waste _E_\nSadly when it comes to using the energy industry to create American jobs Obama has been a total disaster. #TimeToGetTough _E_\nJust got back from Iowa great people! _E_\nWe are in the NAFTA (worst trade deal ever made) renegotiation process with Mexico & Canada.Both being very difficultmay have to terminate? _E_\nMore than anything else I think deal making is an ability you're born with. It's in the genes. #TheArtofTheDeal _E_\nRT @IvankaTrump: \"The Trump economy is booming.\" One thing @realDonaldTrump \"has done that has received little attention despite arguably d... _E_\n.@realDonaldTrump is PRO LIFE PRO FAMILY #BigLeagueTruth #Debates2016 __HTTP__ _E_\nI had fun appearing in the video for Carly Rae Jepsen's #CallMeMaybe for #MissUSA 2012 __HTTP__ _E_\nThat Saturday Night Live is able to joke about the Germanwings air tragedy is disgusting. They should apologize to all of those suffering! _E_\nWill be on @foxandfriends at 7:00 this morning enjoy! _E_\nCongratulations to @joniernst on her impressive @IowaGOP primary win last night. Now all should unite & defeat Bruce Braley this November _E_\nHost of the @PGATOUR & @CadillacChamp @TrumpDoral is home to 4 unique courses including the famous Blue Monster __HTTP__ _E_\nI will be on Greta @gretawire tonight at 10 PM on Fox News. _E_\nAutism Speaks head up by Bob & Suzanne Wright does a fantastic job—if only we had more people like them! To help: __HTTP__ _E_\nCongratulations to Obama on building a strong economy. There are 49500000 people on food stamps. A historic record! _E_\nAnimals representing Hillary Clinton and Dems in North Carolina just firebombed our office in Orange County because we are winning @NCGOP _E_\nVia @newsbusters: \"Donald Trump Issues Statement Regarding $5 Million Lawsuit Against Bill Maher\" __HTTP__ _E_\nThe Republican Senators must step up to the plate and after 7 years vote to Repeal and Replace. Next Tax Reform and Infrastructure. WIN! _E_\nJoin me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nPollster Trend National GOP Average223 national polls & 33 pollsters.#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nRT @EricTrump: Honored to speak at the RNC Summer Meeting in Nashville Tennessee this evening! @GOP #MAGA @GOPChairwoman __HTTP__ _E_\nVia @BreitbartNews: 'MAJOR COUP': DONALD TRUMP PICKS UP TOP IOWA GRASSROOTS OPERATIVE FOR POTENTIAL 2016 CAMPAIGN __HTTP__ _E_\nGreat to be in Riyadh Saudi Arabia. Looking forward to the afternoon and evening ahead. #POTUSAbroad __HTTP__ _E_\nSexual pervert Anthony Weiner has zero business holding public office. _E_\nImmigration reform is all risk for the @GOP. Their base doesn't want it and the 12M illegals will all vote Democrat. _E_\nGas is $6 already in California. Don' worry @BarackObama's Algae energy policy is going to pay major (cont) __HTTP__ _E_\nThe job plan by @BarackObama is nothing more than a second stimulus. The first failed and so will this one. _E_\nIn the latest poll Danger Weiner's numbers have sunk. I wonder how Carlos handled the stress? He is one whacko sicko sexter. _E_\n.@johnhawkinsrwn Great speaking to you today we will speak again soon. _E_\nI am working on a new system where there will be competition in the Drug Industry. Pricing for the American people will come way down! _E_\nKay Hagan profited off of the stimulus.She just skipped a debate. Kay supports amnesty weak border & __HTTP__ @ThomTillis! _E_\nThe @timestribune @EricTrump: Eyes are on Northeast Pa. with gas development __HTTP__ _E_\nStock Market hit another all time high yesterday despite the Russian hoax story! Also jobs numbers are starting to look very good! _E_\nHillary defrauded America as Secy of State. She used it as a personal hedge fund to get herself rich! Corrupt dangerous dishonest. _E_\nMy @foxandfriends interview from Monday discussing Obama's tone going over the curb and Republican debt ceiling card __HTTP__ _E_\nThank you @ATFD17! #ImWithYouVideo: __HTTP__ _E_\nPresident Obama was terrible on @60Minutes tonight. He said CLIMATE CHANGE is the most important thing not all of the current disasters! _E_\nDemocrats used to support border security — now they want illegals to pour through our borders. _E_\n\"Confidence is contagious. So is lack of confidence.\" Vince Lombardi _E_\nIf the Boston killer applies for Obama Care the paperwork will be too complicated for him to understand! _E_\nCongratulations to Chuck Hagel on one of the shortest tenures as Sec. of Defense. Another terrible appointee by Obama. _E_\nI got to know @ScottWalker well—he's a very nice person and has a great future. _E_\nI read @willweatherford's comments that \"the lights are dimming on gambling in Florida\"—nothing could be worse for the state. _E_\nCongrats to @EricTrump and @LaraLeaYunaska on a great five years! _E_\nThese are facts: In 2001 the US opened its markets to China & since then more than 2 million Americans can't (cont) __HTTP__ _E_\nIf Obama attacks Syria and innocent civilians are hurt and killed he and the U.S. will look very bad! _E_\nMeeting with Generals at Mar a Lago in Florida. Very interesting! _E_\n.@AGSchneiderman should remove his eyeliner as pointed out by Cuomo when he does his commercials! _E_\nCelebrity Apprentice returns to NBC Sunday 3/14 9 11PM ET/PT. Outstanding list of celebrities & season should be the best one yet! _E_\nVia @Newsmax_Media by @OwenTew: \"Donald Trump: Kerry Has to Walk If Iran Doesn't Make Deal\" __HTTP__ _E_\nDummy political pundit @krauthammer constantly pressed the crazy war in Iraq. Many lives and trillions of dollars wasted. U.S. got NOTHING! _E_\nDoes anyone remember the fight @mcuban had w/ the referee—he was weak & pathetic—a non athlete trying to live life thru his players. _E_\n.@NFL: Too much talk not enough action. Stand for the National Anthem. _E_\nWorking on major Trade Deal with the United Kingdom. Could be very big & exciting. JOBS! The E.U. is very protectionist with the U.S. STOP! _E_\nOur foreign policy decisions are dumbest in U.S. history _E_\nEllen was so awkward and insecure last night. The pizza skit was terrible. She should dump Andy Lassner a guy with no absolutely no talent! _E_\n\"Partnerships also require negotiation. It should be a win win setup. Otherwise it's not a partnership.\" – 'Midas Touch' _E_\nJoin me in Wichita Kansas tomorrow morning! Looking forward to it!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIt is only the people that were never asked to be VP that tell the press that they will not take the position. _E_\nHighly untalented Wash Post blogger Jennifer Rubin a real dummy never writes fairly about me. Why does Wash Post have low IQ people? _E_\nCheck out my interview from MSNBC at __HTTP__ _E_\nObama's ideas don't move us 'Forward' they take us 'Backwards.' These are ideas people come to America to get away from. @marcorubio _E_\nSaw Michael Jordan and Ray Allen today playing golf at Trump National Doral the Blue Monster. Great guys! _E_\nCapitalism requires capital. When government robs capital from investors it takes away the money that creates (cont) __HTTP__ _E_\nThanks everyone they all said I won the debate. Even won the @CNBC Poll! _E_\nSuccess requires 100% of your focus and 100% of your effort. Don't sell yourself short. _E_\nMy 757 is incredible I think the teams agree on that. _E_\nEconomy growing! Excluding hurricane effects CEA estimates that real GDP growth would have been 3.9% in Q3.Stock market at a new high unemployment at a low. We are winning and TAX CUTS will shift our economy into high gear! __HTTP__ _E_\nIf everything seems under control you're not going fast enough. Mario Andretti _E_\nWe must stop releasing hard core criminals all over the United States. Our country must be strong again! _E_\nDespite the fact that I have had great success with the words YOU'RE FIRED I do not like firing people. But ZERO on ObamaCare mess no way! _E_\n\"If you plan for the worst – if you can live with the worst – the good will always take care of itself.\" – The Art of the Deal _E_\nLeaving Puerto Rico now for D.C. Will be in Las Vegas early tomorrow to pay my respects. Everyone is in my thoughts and prayers. __HTTP__ _E_\nLate Night host are dealing with the Democrats for their very unfunny & repetitive material always anti Trump! Should we get Equal Time? _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\nMy @foxandfriends interview discussing the Super Bowl the real unemployment numbers Iran and @MittRomney's (cont) __HTTP__ _E_\nI wonder what the work atmosphere is like @VanityFair. It must be hard working at a dying institution. _E_\nBest ratings for the Dateline show were for six months not two months! _E_\nSenate passed the VA Accountability Act. The House should get this bill to my desk ASAP! We can't tolerate substandard care for our vets. _E_\nPost Debate via @OANN. Thank you!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\n\"Trump Dana Farber waiting on Bill Maher\" __HTTP__ via @BostonGlobe _E_\nWhy doesn't President Obama call upon the NSA to fix the badly broken website then they could spy on all of the many cheaters & arrest them! _E_\nWe need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_\nIf US Air & American Airlines are allowed to merge ticket prices will skyrocket—there will be no competition. _E_\nVia @trscoop: \"Mark Levin DEFENDS Trump: Hillary Clinton is a CROOK and a FRAUD and she's not treated this way!\" __HTTP__ _E_\nRT @RightlyNews: @realDonaldTrump @LouDobbs It is NOT a coincidence that the economy boomed immediately after the 2016 election. _E_\nThis is my pledge to the American people: __HTTP__ _E_\nThe reason I am staying in Bedminster N. J. a beautiful community is that staying in NYC is much more expensive and disruptive. Meetings! _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIt's going to get hotter in Las Vegas tonight! Watch the Miss Universe Pageant tonight on NBC at 9 p.m. I'm looking forward to being there! _E_\nBrian Williams was never a smart guy but always passes himself off as such. People will learn the truth! @NBCNightlyNews _E_\n.@bretbaier has a wonderful new book #specialheart and it's proving to be a great success already. Bret is a winner! _E_\nI never thought I'd say it in my lifetime but President Barack Hussein Obama aka Barry Sotoro is a far worse president than Jimmy Carter! _E_\nChina has just overtaken us as the world's largest economy. We are busy wasting $'s while China builds airports & skyscrapers. _E_\nInauguration Day is turning out to be even bigger than expected. January 20th Washington D.C. Have fun! _E_\nAwarded 5 Stars by @VisitScotland @TrumpScotland's MacLeod House & Lodge boutique hotel is an historic masterpiece __HTTP__ _E_\n\"What the mind can conceive and believe and the heart desire you can achieve.\" Norman Vincent Peale _E_\n....and don't forget that Foxconn will be spending up to 10 billion dollars on a top of the line plant/plants in Wisconsin. _E_\nIf everything seems under control you're not going fast enough. Mario Andretti _E_\nPeople are really liking my new book Crippled America. Check it out! _E_\nIt's about time for all Americans (Republicans & Democrats) to force our elected officials to start acting fiscally responsible! _E_\n.@TheBrodyFile great job on @AC360. Thank you for the very smart and kind words! _E_\nChris Cuomo in his interview with Sen. Blumenthal never asked him about his long term lie about his brave service in Vietnam. FAKE NEWS! _E_\nCongratulations to @TrumpPanama for winning the 2015 Traveler's Choice Award from @TripAdvisor __HTTP__ _E_\nI don't care what people say I like Tom Cruise. He works his ass off and never ever quits. He's one of the few true movie stars. _E_\nWe will NEVER FORGET the victims who lost their lives one year ago today in the horrific #PulseNightClub shooting.... __HTTP__ _E_\nUnder Mayor @MikeBloomberg and Police Commissioner @Ray Kelly all violent crime in NYC is down dramatically. That's leadership. _E_\nExpect the best from people. They will rise to the challenge and it's important to inspire confidence. _E_\nGreat @ANHQDC segment with @CharlesHurt: Breaking Down the Trump Factor __HTTP__ Let's Make America Great Again! _E_\nThe brand new Blue Monster just opened at Trump National Doral Miami. Also great new driving range which is open 'till midnight. GO SEE! _E_\nSuccess tip: Achievers move forward at all times. Achievement is not a plateau it's a beginning. _E_\nIn Hillary Clinton's America things get worse. #TrumpPence16 __HTTP__ _E_\n\"Get it straight: Pakistan is not our friend. We've given them billions and billions of dollars and what (cont) __HTTP__ _E_\nMessage to Obama re: Iran: \"The worst thing you can possibly do in a deal is seem desperate to make it.\" – The Art of the Deal _E_\nI own @DannyZuker but he has his friends & haters & losers tweeting that he beat me. He can't beat me at anything! _E_\nStill waiting for a response from @billmaher. Does he even have $5 million? _E_\nStupid George Will gave @MittRomney no chance 3 months ago. Take off his little spectacles and he's just another dummy. _E_\nJust watched Jeb's ad where he desperately needed mommy to help him. Jeb mom can't help you with ISIS the Chinese or with Putin. _E_\n.@TheBrodyFile was fantastic tonight on @CNN. Thank you we will MAKE AMERICA GREAT AGAIN! _E_\n#WeeklyAddress __HTTP__ _E_\nInstead of driving jobs and wealth away AMERICA will become the WORLD'S great magnet for innovation & job creation! __HTTP__ _E_\nSuccess consists of going from failure to failure without loss of enthusiasm. Winston Churchill _E_\n\"Donald Trump turns over 11.5 ac.in Rancho Palos Verdes for recreational open space\" __HTTP__ @DailyBreezeNews by @meg_barnes _E_\nRT @Franklin_Graham: Join me in praying for @POTUS. He reminded the world \"If the righteous many do not confront the wicked few then evil... _E_\nCNBC Titans: Donald Trump will be shown Friday Nov 19th at 9 pm and 1 am Sunday 11/21 at 9 pm and 11/24 at 7 pm __HTTP__ _E_\nSo many stories about me in the @washingtonpost are Fake News. They are as bad as ratings challenged @CNN. Lobbyist for Amazon and taxes? _E_\nThe Democrats should be ashamed. This is a disgrace!#DrainTheSwamp __HTTP__ _E_\nFeels good to be home after seven months but the White House is very special there is no place like it... and the U.S. is really my home! _E_\nI never fall for scams. I am the only person who immediately walked out of my 'Ali G' interview _E_\nGas prices are going up big league—I told you so—payback to OPEC! _E_\n.@KatrinaPierson you did a fantastic job tonight on @FoxNews. Thank you for your very tough and very smart representation! _E_\n.@BilldeBlasio should focus on running #NYC & all of the problems that he has caused with his ineptitude & not be so focused on me! _E_\nWith Miami's top #NewYearsEve vacation package @TrumpDoral is the perfect option to celebrate the start of 2015. __HTTP__ _E_\n\"Obamacare Data Mismatch Could Leave Thousands Uninsured\" __HTTP__ ObamaCare is not working and has missed all targets. _E_\nI hope Bill Clinton starts talking about women's issues so that voters can see what a hypocrite he is and how Hillary abused those women! _E_\nRT @foxandfriends: Senators learn the hard way about the fallout from turning on Trump __HTTP__ _E_\nCommon Core is a federal takeover of school curriculum. Department of Education should be disbanded not expanded. Focus on local education. _E_\nI was asked about healthcare by Anderson Cooper & have been consistent I will repeal all of #ObamaCare including the mandate period. _E_\nFood stamps up 45%. Federal handouts up 45%. Is @BarackObama happy? __HTTP__ _E_\nUltimately Trump Tower became much more than just another good deal. I work in it I live in it and I have a (cont) __HTTP__ _E_\nWith over 260 5 Star guest rooms & suites @TrumpTO is 65 stories of pure luxury in the center of downtown Toronto __HTTP__ _E_\nThe new joke in town is that Russia leaked the disastrous DNC e mails which should never have been written (stupid) because Putin likes me _E_\nFinally an accurate story from the Washington Post! __HTTP__ _E_\nMayweather is getting absolutely killed! _E_\nSpeaking of our very stupid war with Iraq it is totally disintegrating and Iran (with Russia) will walk in and take it over (lots of oil)! _E_\nBig dinner with Governors tonight at White House. Much to be discussed including healthcare. _E_\nI'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_\nStock Market just hit another record high! Jobs looking very good. _E_\nNYC should hold a parade for returning Iraq and Afghanistan veterans. _E_\nRT @DonaldJTrumpJr: If you live in Louisiana Maine Kentucky or Kansas remember to vote today! Together let's #MakeAmericaGreatAgain __HTTP__ _E_\nCongratulations to @EmilyMiller @mboyle1 & @NolteNC on making @FishbowlDC's list of 10 Journos You Don't Want to Fight on Twitter. _E_\n.@MarkHalperin works so hard but just doesn't have a natural instinct for politics. Others do and those are the people you want to follow! _E_\nIsn't it crazy that people of little or no talent or success can be so critical of those whose accomplishments are great with no retribution _E_\nJodi if you're listening MAKE A DEAL! _E_\nToday I'm in Aberdeen Scotland preparing for the July 10th opening of perhaps the world's greatest golf course __HTTP__ _E_\n.@Larry_Kudlow 'Donald Trump Is the middle class growth candidate' __HTTP__ _E_\nI am watching @FoxNews and how fairly they are treating me and my words and @CNN and the total distortion of my words and what I am saying _E_\nThe right leadership can help economy while creating security around the world. Let's make America great again! __HTTP__ _E_\nThanks to @johnrich for putting on such a great concert fot @Stjude. John was a winner on Celebrity Apprentice and is a fantastic guy. _E_\nThank you Michigan. This is a MOVEMENT. We are going to MAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ _E_\n\"Watch what people are cynical about and one can often discover what they lack.\" General George S. Patton _E_\nIt's been stated that dopey NY @AGSchneiderman used cocaine while he was a state senator. __HTTP__ _E_\nToday @BarackObama is in Ohio on a bus tour. Tomorrow Pennsylvania. How about actually running the country? _E_\nSHOCK! ObamaCare will cost double what @BarackObama promised over $1.76 __HTTP__ and result (cont) __HTTP__ _E_\n\"Donald trump files statement of candidacy\" __HTTP__ via @CBSNews _E_\nVancouver's most anticipated hotel & residences @TrumpVancouver will unveil Canada's first Mar a Lago Spa __HTTP__ _E_\nJust left a great rally in Florida now heading to Ohio for two more. Will be there soon. _E_\nQE3 a political favor for Obama will cause record inflation on food and fuel. This hits low income families the hardest. Big mistake. _E_\n.@ABFAlecBaldwin P.S. Your brother @StephenBaldwin7 is doing very well on @ApprenticeNBC and he stated he adores you. _E_\nCongrats to @MiamiHEAT on winning @NBA championship. @MickyArison is a tremendous owner & has done wonders for (cont) __HTTP__ _E_\nEnviro friendly? AP IMPACT: Obama administration allows wind farms to kill eagles birds despite federal laws __HTTP__ _E_\nSen. Corker is the incompetent head of the Foreign Relations Committee & look how poorly the U.S. has done. He doesn't have a clue as..... _E_\nDon't be easily pleased with yourself or with anything else. Be tough & fight to keep your standards high. Think Like a Champion _E_\nIn light the Benghazi emails released last night it is apparent that Obama has no problem lying to the American public... _E_\nDon Butler and executives are doing a great job at @Cadillac the cars are fantastic. _E_\nHuma should dump the sicko Weiner. He is a calamity that is bringing her down with him. _E_\nHe @BarackObama promised to close Gitmo in his first year. It is still open 3 years later and about to get a (cont) __HTTP__ _E_\nAshley Judd Targeted by @karlrove's Super PAC in Ad (Video) __HTTP__ _E_\nVia @FurnitureToday by Cindy W. Hodnett: \"Dorya to introduce Trump Home high end furniture\" __HTTP__ _E_\nPeople love @LilJon! __HTTP__ #CelebApprentice _E_\nChina just hacked our federal government & stole gov. workers' information. Why do our leaders let China get away with this?! No respect. _E_\nI truly believe that our country has the worst and dumbest negotiators of virtually any country in the world. _E_\n#TrumpVine @arod sucks! __HTTP__ _E_\nOur inner cities have been left behind. We will never have the resources to support our people if we have an open border. _E_\nJudy Garland was much better to put it mildly! #Oscars _E_\nU.S. small businesses are truly worried about rising healthcare costs and taxes __HTTP__ I told you so! _E_\nTrue courage is being afraid and going ahead and doing your job anyhow that's what courage is. Gen. Norman Schwarzkopf (1934 2012) _E_\nThank you Virginia! 15000 amazing supporters! Everyone get out and #VoteTrump tomorrow! __HTTP__ _E_\nGreat afternoon in Little Havana with Hispanic community leaders. Thank you for your support! #ImWithYou __HTTP__ _E_\nObama never consulted with Congress about a prisoner exchange. HE BROKE THE LAW AND SHOULD BE TRIED. OUR PRESIDENT IS A TOTAL DISASTER! _E_\nZegarelli and Vescio: Pine Road looks like hell. Must be re paved now—very bad for town. @BriarcliffManor _E_\nThank you for the wonderful welcome @WEF! #Davos2018 __HTTP__ _E_\nIf America was under the threat of imminent attack would Obama use torture or a kiss? _E_\n.@NRO Really important to save National Review from going out of business. We need a true conservative voice! _E_\nWith only a very small majority the Republicans in the House & Senate need more victories next year since Dems totally obstruct no votes! _E_\nLogic will get you from A to B. Imagination will take you everywhere. Albert Einstein _E_\n\"Develop success from failures. Discouragement and failure are two of the surest stepping stones to success.\" – Dale Carnegie _E_\nThe NYC casting call for The Apprentice is thisThursday April 1 at Trump Tower. For all the information you need go to NBC.com/casting. _E_\nPathetic attempt by @foxnews to try and build up ratings for the #GOPDebate. Without me they'd have no ratings! __HTTP__ _E_\nWe had a wonderful visit to Vietnam thank you President Tran Dai Quang! Heading to the #ASEANSummit 50th Anniv Gala in the Philippines now. __HTTP__ _E_\nExcited that @OurCountryPAC's Amy Kremer has endorsed the Newsmax iontv debate. The Tea Party Express is a great group. _E_\nYesterday Barack Obama said he wants wind turbines manufactured here in China __HTTP__ I don't think this was a gaffe. _E_\nGeorge Ross could be right—@THEGaryBusey would be better in the adventure task than the romance task. #CelebApprentice _E_\nThank you @GolfMagazine for your fantastic review of The Blue Monster at Trump National Doral BEST U.S. RESORT RENOVATION & ALL TIME _E_\nOn my way to San Diego to raise money for the Republican Party. I am spending a lot myself and also helping others. _E_\nI'll be doing @piersmorgan show tonight on CNN at 9 PM. Will be very interesting. (I hope!) _E_\nFocus on your goals not your problems. Problems are a mind exercise learn to play beyond your comfort zone. _E_\nThank you our great honor! __HTTP__ _E_\nGREAT EVENING last night in Pensacola Florida. Arena was packed to the rafters the crowd was loud loving and really smart. They definitely get what's going on. Thank you Pensacola! _E_\nRemember to think big by expanding your horizons at the same time you're expanding your net worth. _E_\nJust finished speaking in Sydney Australia in front of 20000 people and today I'm off to Melbourne for anot... (cont) __HTTP__ _E_\nGlad to hear that @FLGovScott will be speaking at the @RNC Convention. He is a true conservative and fantastic governor! _E_\nEntrepreneurs: Follow your own path—it will bring you to the places you were meant to be. _E_\nI bought the great Turnberry Resort today considered by many to have the greatest golf course in the World. I will take good care of it! _E_\nPlain & Simple: We should only admit into this country those who share our VALUES and RESPECT our people. __HTTP__ _E_\n.@StephenBaldwin7's mother thinks I'm very handsome. Now I see where Stephen and Alec get their smarts. #CelebApprentice _E_\nOne positive from last week for Lance was that everyone was focused on Manti Te'o! Why did Lance do that interview? _E_\n#TeamTrump is thinking of Captain Andrew Maitner. A true American hero. #MaitnerStrong __HTTP__ __HTTP__ _E_\nOur country has to come together. We have to start working with and really liking each other. The whole world is watching Baltimore. _E_\nAn analysis showed that Bernie Sanders would have won the Democratic nomination if it were not for the Super Delegates. _E_\nWatching biased Charles @krauthammer a @FoxNews flunky who didn't know that I won every debate in particular the last one. Check polls! _E_\nFor you newcomers George Ross was one of my first advisors on the original Apprentice. #CelebApprentice _E_\nSo many great things happening new poll numbers looking good! News conference at 11:00 A.M. today Trump Tower! _E_\nRT @DonaldJTrumpJr: Nice piece and video today in the Wall Street Journal: Trump's three eldest children jump into campaign __HTTP__ _E_\nThe State of Florida is so embarrassed by the antics of Crooked Hillary Clinton and Debbie Wasserman Schultz that they will vote for CHANGE! _E_\nI had a great day in D.C. even though the subject was an unpleasant one the horrible Iran Nuke deal. Amazing crowd and enthusiasm! _E_\nThe New York Times should never have moved out of their magnificent original home... _E_\nThe era of strategic patience with the North Korea regime has failed. That patience is over. We are working closely... __HTTP__ _E_\nI do what I do out of pure enjoyment. Hopefully nobody does it better. Theres a beauty to making a great deal. It's my canvas. _E_\nHey @KimKardashian I hear you are undecided in the election. I can explain why you should vote for @MittRomney. _E_\nChina is about to acquire 82800 net acres of a Texas shale oil and gas field __HTTP__ What are we doing! _E_\nKeep difficulties in perspective. Ask yourself is this a blip or is it a catastrophe? _E_\nJust arrived in Italy after having a very successful NATO meeting in Brussels. Told other nations they must pay more not fair to U.S. _E_\nThank you @chucktodd for your commentary last night on @NBCNightlyNews. Very fair we are making progress together! _E_\n.@alexsalmond @pressjournal @BBCNews RT ‏@DanScavino the photos that they don't show the public... __HTTP__ _E_\nThe purpose of China's massive military buildup on the Nork's border is to intimidate us. China attacked us during the Korean War. _E_\nThe people of South Carolina are embarrassed by Nikki Haley! _E_\nBe sure to tune in to another amazing episode of #CelebApprentice this Sunday on @nbc at 9PM EST! This Sunday's (cont) __HTTP__ _E_\nYou will love Celebrity Apprentice tonight 9 PM on NBC. Must watch from beginning two early firings! _E_\nHope & Change the number of 26 year olds living with parents has jumped 46% under Obama __HTTP__ Four more years? _E_\nThank you CBS & Breitbart total vindication! Will the mainstream media apologize? Many many witnesses. #Trump2016 __HTTP__ _E_\nRemember: Obama turned down $5M to charity which I said I would increase by 10X to $50M just to show simple records. He's hiding lots! _E_\nThank you Grand Rapids Michigan! #ICYMI watch: __HTTP__ __HTTP__ _E_\nVia @CBNNews: Exclusive: Backstage Interview w/ Donald Trump at CPAC __HTTP__ by @TheBrodyFile Great seeing you David! _E_\nI'm getting The Commandant's Leadership Award from the U.S.Marines tonight at The Waldorf Astoria a great honor! @BretBaier _E_\nI told @megynkelly that @oreillyfactor and I had identical views on a certain issue and she cut it out of the taped interview. Why? Too bad! _E_\nMy thoughts on @andyroddick in today's #trumpvlog.... __HTTP__ _E_\n.@FrankLuntz your so called focus groups are a total joke. Don't come to my office looking for business again. You are a clown! _E_\nThe only way to do great work is to love what you do. – Steve Jobs _E_\n.@Rosie—No offense and good luck on the new show but remember you started it! __HTTP__ _E_\nI love New Hampshire will be an exciting evening! _E_\nI would do same thing if I were China. They want Obama. __HTTP__ _E_\nJust won The Club Championship at Trump International Golf Club in Palm Beach lots of very good golfers never easy to win a C.C. _E_\nRT @ReutersPolitics: Trump to give $5 million to charity if Obama releases records __HTTP__ _E_\nIn a little reported event China has just overtaken the United States as the NUMBER ONE World economic power! Great going Washington! _E_\nA Rod's appeal will go nowhere. He will get a long suspension. Good for the @Yankees. And sends strong message to @MLB players. _E_\nThank you @elvisduran for dedicating your birthday today to the @EricTrumpFdn for @StJude! Click here to donate: __HTTP__ _E_\nPeople get what is going on! __HTTP__ _E_\n#ICYMI: Will Media Apologize to Trump? __HTTP__ _E_\nI don't get @billmaher and his terrible show he is dumb as a rock but tries so hard to pass himself off as a great intellect. Check past! _E_\nHe is a professional and true gentleman: @GeorgeTakei is one of my favorite contestants from #CelebApprentice. _E_\nTomorrow is the 10 year anniversary of the Apprentice one of the biggest hits in television history. How time flies! _E_\nI hear that @SenTedCruz's $$ man Robert Mercer a good man is very angry because Cruz lied to him about liquidating his (Ted's) holdings.? _E_\nDON'T LET HER FOOL US AGAIN. __HTTP__ _E_\nLooking forward to speaking at tonight's gala for @MittRomney supporters at the Intrepid. Mitt's doing well. _E_\nDonald Trump song is up to almost 60 million hits crazy! _E_\nNobamaCare won't work never will work and can't work it is a total waste of time and energy except that it is hurting people (& economy!) _E_\nRT @RealBenCarson: Please read my full endorsement of @realDonaldTrump for President of the United States: __HTTP__ _E_\nThe press has very inaccurately covered this event see for yourself! __HTTP__ _E_\nWill be doing @greta interview tomorrow. So much to talk about! _E_\nRT @IvankaTrump: Since @realDonaldTrump inauguration over 1 million net new jobs have been created in the American economy! #MAGA _E_\n...Why did the DNC REFUSE to turn over its Server to the FBI and still hasn't? It's all a big Dem scam and excuse for losing the election! _E_\nRT @usairforce: \"#AirForce relief efforts in #PuertoRico & #VirginIslands\" __HTTP__ _E_\nSugar @Lord_Sugar Unlike yours my financials are phenomenal. People don't know your real numbers & would not be impressed. _E_\nA Clinton already defeated a Bush. The definition of insanity is doing the same thing twice & expecting a different result. _E_\nWe don't have a country if we don't have borders. #VoteTrump Video: __HTTP__ __HTTP__ _E_\nNorth Korea has just launched another missile. Does this guy have anything better to do with his life? Hard to believe that South Korea..... _E_\nDonald Trump Plans To Continue GOPLegacy Of Leading On Women's Civil Rights Against Racist Sexist Democrats __HTTP__ _E_\nThe media is so in the tank for Obama that it is amazing—the funny thing is he can't stand them! _E_\nWhat would All Star @ApprenticeNBC be w/out a Baldwin? @StephenBaldwin7 is at the top of his game this season. Our fans will be happy. _E_\nDopey @BillKristol who has lost all credibility with so many dumb statements and picks said last week on @Morning_Joe that Biden was in. _E_\nCan you imagine not taking Snowden's passport away before he jetted happily away to foreign lands (where he gave away many U.S. secrets). _E_\nCatch the second part of my interview with Bill O'Reilly tonight at 8pm on Fox News.... _E_\n#CrookedHillary is not qualified! __HTTP__ _E_\nN.A.T.O. is obsolete and must be changed to additionally focus on terrorism as well as some of the things it is currently focused on! _E_\nThank you. __HTTP__ _E_\nThe approval process for the biggest Tax Cut & Tax Reform package in the history of our country will soon begin. Move fast Congress! _E_\nUnsustainable @BarackObama has increased total federal budget outlays by over 24% during his term __HTTP__ He loves debt. _E_\nRosie O'Donnell has failed again. Her ratings were abysmal and Oprah cancelled her on Friday night. When will (cont) __HTTP__ _E_\nGet rid of all of these commercials. #DemDebate _E_\nI hearby demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_\nHere is my statement. __HTTP__ _E_\nThe U.S. has 69 treaties with other countries where we would have to defend them and their borders. How nice but what do we get? NOT ENOUGH _E_\n.@PrimeMinisterSX has no clue what's going on in St. Maarten. Mullet Bay is a third world slum. _E_\nHeading to the Great State of Wisconsin to talk about JOBS JOBS JOBS! Big progress being made as the Real News is reporting. _E_\nNew GOP platform now includes language that supports the border wall. We will build the wall and MAKE AMERICA SAFE AGAIN! _E_\nHave a fantastic beautiful and happy Easter everyone and then when Easter is over have great wins and triumphs in life. Never give up! _E_\nTotally dishonest Donna Brazile chokes on the truth. Highly illegal! Watch: __HTTP__ __HTTP__ _E_\nThe failing @nytimes just announced that complaints about them are at a 15 year high. I can fully understand that but why announce? _E_\nDoral Tournament was great best 18th hole in golf and a wonderful winner in @JustinRose _E_\nRT @realDonaldTrump: Senator Dicky Durbin totally misrepresented what was said at the DACA meeting. Deals can't get made when there is no t... _E_\n.@HillaryClinton lists litany of ways she plans to restrict gun rights. 2A will not survive a Hillary presidency. #Debate #BigLeagueTruth _E_\nIt is Clinton and Sanders people who disrupted my rally in Chicago and then they say I must talk to my people. Phony politicians! _E_\n45000 construction & manufacturing jobs in the U.S. Gulf Coast region. $20 billion investment. We are already winning again America! _E_\nThank you Columbus Ohio! __HTTP__ _E_\nAmerica is at a great disadvantage. Putin is ex KGB Obama is a community organizer. Unfair. _E_\nIf you think you can do a thing or think you can't do a thing you're right. Henry Ford _E_\nBesides an award winning golf course @TrumpGolfLA features exquisite estates on top the Palos Verdes Peninsula __HTTP__ _E_\nThe failing @nytimes has become a newspaper of fiction. Their stories about me always quote non existent unnamed sources. Very dishonest! _E_\nEbola has been confirmed in N.Y.C. with officials frantically trying to find all of the people and things he had contact with.Obama's fault _E_\nRemember I am the only one who is self funding my campaign. All of the other candidates are bought and paid for by special interests! _E_\nIt does not cost anything to dream. Spend your time enjoying your big dreams. Think Big _E_\nPeople are really unhappy with the endless security checks at the new World Trade Center. Durst is a terrible manager. Tenants furious! _E_\nCongrats to Senator McConnell and @TheTeaParty_net's Kellen Guida on yesterday's successful Tea Party Caucus __HTTP__ _E_\n\"Trump Brand Expands To South America: The Donald Lends His Name To Luxury Tower In Uruguay\" __HTTP__ via @Forbes _E_\nBlue Ribbon Commission to find and agree to future spending cuts? Bad idea. _E_\nThank you Georgia! 15000 amazing supporters tonight! Everyone get out & #VoteTrump tomorrow! #SuperTuesday __HTTP__ _E_\n.@GovernorPerry failed on the border. He should be forced to take an IQ test before being allowed to enter the GOP debate. _E_\nFriends of mine who are driving Cadillacs it is becoming a very hot car are raving about what a great job @Cadillac has done. _E_\nIt means so much to me receiving an endorsement from Phyllis Schlafly. A truly great woman & conservative. __HTTP__ _E_\n.@VanityFair magazine is doing so poorly that they make even @NYMag look good. Graydon Carter should've been fired a long time ago. _E_\nI just gave lots of money away at Trump Tower to people who needed it...they were very happy and appreciative! _E_\nHave a good chance to win Texas on Tuesday. Cruz is a nasty guy not one Senate endorsement and despite talk gets nothing done. Loser! _E_\nIf the government doesn't start working together the media is right & we will hit a fiscal cliff. We need to avoid this. _E_\nThe lobbyists & special interests have just put out an ad for Jeb which hits me just a little but is very false! _E_\nInto our first week of filming @ApprenticeNBC the Celebrities are already turning up the heat. Major fireworks! _E_\nEveryone is talking about the incredible event we had in Dallas last night. Spectacular crowd & arena! Thank you @mcuban. _E_\nVia @CNNPolitics by @teddyschleifer: Trump: San Francisco killing shows perils of illegal immigration __HTTP__ _E_\n.@Lexi Great job in winning your first of many majors . We are proud of you at Trump International. Work hard be an all time great! _E_\nFantastic job on @CNN tonight. @kayleighmcenany is a winner! @donlemon _E_\n.@MELANIATRUMP and I are looking forward to watching @AnnDRomney's speech tonight. She is an amazing woman who will be a great First Lady! _E_\nOPEC is setting crude at $94/barrel on 'signs US economy is improving.' OPEC uses any excuse to rip us off and our leaders just watch. _E_\nA ship is only as good as the people who serve on it — and the AMERICAN SAILOR is the BEST in the world. @USNavy #USSGeraldRFord __HTTP__ _E_\nObama can sign an illegal executive action anytime for ObamaCare but he can't fix the illegal loophole. _E_\nLove seeing union & non union members alike are defecting to Trump. I will create jobs like no one else. Their #Dem leaders can't compete! _E_\nJoin us tomorrow night in Charleston South Carolina! #SCPrimary #Trump2016 __HTTP__ _E_\nThey should close down Rolling Stone Magazine after the phony rape charge story. University of Virginia should sue them for big bucks! _E_\nIt was so great being in Nebraska last week. Today is the big day get out and vote! _E_\nThe Republican establishment out of self preservation is concerned w/ my high poll #'s. More concerned are Dems—I beat Hillary heads up! _E_\nMeeting with biggest business leaders this morning. Good jobs are coming back to U.S. health care and tax bills are being crafted NOW! _E_\nAndy Williams has died. He was a friend of mine and a great guy. _E_\nThat's Adrian in the elevator— he works at @TrumpTowerNY & he's got a lot of stories. #CelebApprentice _E_\nSo many people have told me that I should host Meet the Press and replace the moron who is on now. Just too busy especially next 10 years! _E_\nToday the Democrats lose big. But tomorrow the Republicans must communicate a positive pro growth agenda. _E_\nThe scum that gets high on badly hurting old ladies and others through knockout assaults wouldn't feel that way with a gun at their head! _E_\nIt was a great honor to welcome President Petro Poroshenko of Ukraine to the @WhiteHouse today with @VP Pence.... __HTTP__ _E_\n#TimeToGetTough presents bold solutions on taxes national security the debt dealing with OPEC and China and defeating @BarackObama. _E_\nMy interview with @HowardKurtz on #MediaBuzz will air tomorrow on @Fox at 11am and 5pm. Great job Howie very insightful. _E_\nThe Obama Economy workers added to disability and individuals added to food stamps more than doubles net jobs created __HTTP__ _E_\nI met Prince on numerous occasions. He was an amazing talent and wonderful guy. He will be greatly missed! _E_\nGreat new numbers. Thank you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nBe sure to watch my #CPAC2015 speech with intro by @DLoesch and a Q&A with @seanhannity __HTTP__ _E_\n#ElectionDay __HTTP__ __HTTP__ _E_\nUnions who secure the border oppose the amnesty bill __HTTP__ Their expert opinions should at least be listened to. _E_\nSo Obama and Congress can waste billions in Iraq & Afghanistan building roads & schools but can't get money to the NJ & NY Sandy victims? _E_\nRemember tonight's 8 o'clock episode of Celebrity Apprentice is the best ever—you will see nothing like it on tv. @ApprenticeNBC _E_\nMexico will pay for the wall 100%!#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nOur economy has had worst recovery under Obama since the Depression. Results of his policies speak for themselves. No new taxes! _E_\nRussian leaders are publicly celebrating Obama's reelection. They can't wait to see how flexible Obama will be now. _E_\nBob Corker gave us the Iran Deal & that's about it. We need HealthCare we need Tax Cuts/Reform we need people that can get the job done! _E_\n.@Franklin_Graham so many people have tweeted about your amazing words to me thank you! Heading to big crowd in South Carolina! _E_\nWatch the Miss Universe competition LIVE from the Bahamas Sunday 8/23 @ 9pm (ET) on NBC: __HTTP__ _E_\nGreece should get out of the euro & go back to their own currency they are just wasting time. _E_\nWow Twitter Google and Facebook are burying the FBI criminal investigation of Clinton. Very dishonest media! _E_\nThe Governor of Puerto Rico Ricardo Rossello is a great guy and leader who is really working hard. Thank you Ricky! _E_\nI will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House. _E_\nIn the UK taxpayers are wasting £24 million on wind farms that don't even operate. __HTTP__ They (cont) __HTTP__ _E_\nThanks. __HTTP__ _E_\nMy thoughts on the situation in Norway and Amanda Knox... __HTTP__ #trumpvlog _E_\nRT @Scavino45: .@POTUS @realDonaldTrump in the Oval Office w/senior U.S. military leaders prior to dinner hosted by the President & First L... _E_\nPresident Obama's weakness and indecision may have saved us from doing a horrible and very costly (in more ways than money) attack on Syria! _E_\nINTELLIGENCE INSIDERS NOW CLAIM THE TRUMP DOSSIER IS A COMPLETE FRAUD! @OANN _E_\n... but @billmaher is allowed to say that about me. _E_\nThe reporter who pulled back from his 14 year old never retracted story is having fun. I don't know what he looks like and don't know him! _E_\nConde Nast made a big mistake going into the World Trade Center. The place is a total disaster and I feel this is only the beginning! _E_\nRT @T_Lineberger: Thanks @IvankaTrump for coming to help win Michigan! More people here than a Hillary rally with less than 24 hours notice... _E_\nFor accurate reporting of my @CPACnews speech read @PoliticalTicker @Newsmax_Media @politico @HuffPostPol.... _E_\nNot good news for Jeb Bush __HTTP__ _E_\nMy shirts ties and suits are selling great @Macy's because they are the best and most stylish at a really reasonable price thanks! _E_\nIt is a great victory for NYC that A Rod will never wear pinstripes again. _E_\nConsumer spending fell in September __HTTP__ Another indicator the 7.8% unemployment number is cooked. _E_\nThe @nytimes was very nice in reporting that @CelebApprentice was #1 on all television for \"top brand impact 2012.\" Thank you! _E_\nGreat poll Florida thank you! #ImWithYou #AmericaFirst __HTTP__ _E_\nEven the SEALS who killed Bin Laden don't like @JoeBiden __HTTP__ _E_\nHappy Birthday @EricTrump! __HTTP__ _E_\nThank you Alex! __HTTP__ _E_\nWho is winning the debate so far (just last name)? #DemDebate _E_\nOf course the Australians have better healthcare than we do everybody does. ObamaCare is dead! But our healthcare will soon be great. _E_\nHas Pres. Obama or the White House told the public what happened in Algeria yet? Where's the media? _E_\nIt is amazing how @LindseyGrahamSC gets on so many T.V. shows talking negatively about me when I beat him so badly (ZERO) in his pres run! _E_\n\"The future is always beginning now.\" Mark Strand former Poet Laureate _E_\n\"Be sure you put your feet in the right place then stand firm.\" Abraham Lincoln _E_\nUnless you catch hackers in the act it is very hard to determine who was doing the hacking. Why wasn't this brought up before election? _E_\nRemember that things are cyclical so be resilient be patient be creative and remain positive. Think Like a Champion _E_\nTrump Golf Links at Ferry Point an 18 hole public golf course in the Bronx New York is opening soon! __HTTP__ _E_\nJoin me tomorrow in Des Moines Iowa with Vice President Elect @mike_pence at 7:00pm!#ThankYouTour2016 #MAGA... __HTTP__ _E_\nHow can FBI Deputy Director Andrew McCabe the man in charge along with leakin' James Comey of the Phony Hillary Clinton investigation (including her 33000 illegally deleted emails) be given $700000 for wife's campaign by Clinton Puppets during investigation? _E_\nIn war there is no substitute for victory. Douglas MacArthur _E_\nMogul Donald Trump has many powerful friends. And it turns out one of them is Anna Wintour.\" __HTTP__ via @FoxNews _E_\nThank you for all of your support! Let's #MakeAmericaGreatAgain! #Trump2016 __HTTP__ _E_\nThe Old Post Office building in Washington (D.C.) will soon be transformed into one of the great hotels anywhere in the world lots of jobs! _E_\nI want to do negative ads on John Kasich but he is so irrelevant to the race that I don't want to waste my money. _E_\nThank you Sacramento California! #MakeAmericaGreatAgain __HTTP__ _E_\n\"Ability is nothing without opportunity.\" Napoleon Bonaparte _E_\nPresident Obama was able to fool the Americans by getting elected but not able to fool Vladimir Putin. Too bad for us! _E_\nWill be on @SeanHannity tonight at 10pmE delivering an important speech live from Wisconsin. #MakeAmericaGreatAgain _E_\nWhy has all time hits leader Pete Rose paid a 20 year price whrn A Rod gets 200 game penalty. It's time to let Pete into The Hall of Fame! _E_\nGood Morning America is thrilled @Rosie is working for the @todayshow that means almost guaranteed success for @GMA _E_\nOur great VPE @mike_pence is in Louisiana campaigning for John Kennedy for US Senate. John will be a tremendous help to us in Washington. _E_\nHad a great time on @IngrahamAngle this morning. _E_\nGreat new poll numbers! Thank you for your support! #Trump2016 __HTTP__ _E_\n.@MattGinellaGC Thx for the nice story @TrumpDoral. Look forward to showing you Trump Int'l in Aberdeen in the spring & Turnberry plans. _E_\n#sweepstweet @teresa_giudice definitely fell under @lisalampanelli's negotiation skills—an important business tool. _E_\nVia @Newsmax_Media: Trump @oreillyfactor Make Up After Digs at Each Other __HTTP__ _E_\nI will be on @oreillyfactor tonight on @FoxNews at 8 PM and 11 PM. _E_\nI build beautiful websites with very smart and imaginative people for almost NOTHING. OUR GOVERNMENT SPENT ALMOST $535 000 000 for NOTHING _E_\nI am the only candidate (in many years) who is self funding his campaign. Lobbyists and $ interests totally control all other candidates! _E_\nISIS is on the run & will soon be wiped out of Syria & Iraq illegal border crossings are way down (75%) & MS 13 gangs are being removed. _E_\n. @BBCNews' child molestation sex scandal is the latest in continued downward spiral of BBC.I know personally they do not check for accuracy _E_\nAs a candidate I promised we would pass a massive tax cut for the everyday working Americans. If you make your voices heard this moment will be forever remembered as a great new beginning – the dawn of a brilliant American future shining with PATRIOTISM PROSPERITY AND PRIDE! __HTTP__ _E_\nVia @TIME by @lullintheaction: #REALTIME: Donald Trump Weighs a 2016 Run At #CPAC2015 __HTTP__ _E_\n.@stuartpstevens did a horrible job for Mitt—is a refund in order? Sadly Stuart is a disaster! _E_\nThe Republicans look so weak and foolish—what the hell are they doing? _E_\nEntrepreneurs: Set the bar high. Do the best you possibly can. Apply your skills and talent but above all be tenacious. _E_\nThe Trump Doctrine: Peace Through Strength. #Trump2016 __HTTP__ __HTTP__ _E_\nGreat win last night by Peyton Manning & @Denver_Broncos in San Diego coming from 24 points behind on the road. Very impressive. _E_\nWhat a great time we just had in the atrium of Tump Tower for __HTTP__ The place was happy and packed! _E_\nAdmiral McRaven had full operational control of the Bin Laden mission __HTTP__ @BarackObama gave vague directions. _E_\nThank you for the massive turnout tonight Cleveland Ohio! Get out & VOTE #TrumpPence16 on 11/8.Watch rally here:... __HTTP__ _E_\n.@DottieandBogey Thanks for nice comments over weekend re Turnberry. You and your husband have fantastic taste! Also great commentary. _E_\nCongratulations to @BretBaier on his five year anniversary as the anchor @SpecialReport. Brett is great! _E_\nWhen the stupid people start feeling sorry for the Boston killer and want to release him and give him medals remember the killings maimings _E_\nVirginia's highest rated wine by @WineEnthusiast @trumpwinery is inspired by the regions of Bordeaux & Champagne __HTTP__ _E_\nThe United States needs to fix its own problems of which there are many first! _E_\nThe hardest thing Clinton has to do is defend her bad decision making including Iraq vote e mails etc. _E_\nGreat pick by Buffalo Sammy Watkins will be GREAT! _E_\nGeneral John Allen who I never met but spoke against me last night failed badly in his fight against ISIS. His record = BAD #NeverHillary _E_\nOnly 36 days until the election. @MittRomney needs to stay on offense. Make Obama's terrible record the issue. #TimeToGetTough _E_\n.@IamStevenT visited me at @TrumpTowerNY what a great guy! __HTTP__ _E_\nTake a look at what happened w/ Bill Clinton. The system is totally rigged. Does anybody really believe that meeting was just a coincidence? _E_\nTrump's Menie golf resort enjoys bumper first year __HTTP__ via @TheScotsman _E_\nI look forward to meeting @joniernst today in New Jersey. She has done a great job as Senator of Iowa! _E_\n... By releasing his records he can come clean with the American people and have $5 million go to a charity. _E_\nWikiLeaks emails reveal Podesta urging Clinton camp to 'dump' emails. Time to #DrainTheSwamp! __HTTP__ _E_\nOur Southern border is totally out of control. This is an absolutely disgraceful. situation. __HTTP__ We need border security! _E_\nSaw @mcuban try to hit a ball in Lake Tahoe while I played in tournament he's got no talent or strength!!!! @TMZ _E_\n.@MattGinellaGC @GCMorningDrive Matt will be talking about Trump National Doral tomorrow A.M. Terrific guy looking forward to it! _E_\n.@antbaxter I predict somebody is going to sue you! _E_\nMany people are now saying I won South Carolina because of the last debate. I showed anger and the people of our country are very angry! _E_\nWhy does @KarlRove lie about his Reagan credentials? __HTTP__ He's a Bushie through and through. _E_\nOne good aspect of the Obama depression is that it will separate the winners from the losers. If you can make it now you deserve it! _E_\n\"Strong men have sound ideas and the force to make these ideas effective.\" Andrew Mellon _E_\n.@danabrams editor of @mediaite explained on radio this morning that I am so widely covered because I draw high interest. True! _E_\nMy interview from yesterday on Fox and Friends GOP Crazy If They Don't Get Everything They Want __HTTP__ _E_\nThanks! __HTTP__ _E_\n\"Don't be afraid of mistakes. They can be learning tools on the way to building something great for yourself.\" Think Like a Champion _E_\nThank you ARIZONA! This is a MOVEMENT like nobody has ever seen before. Together we are going to MAKE AMERICA SAFE... __HTTP__ _E_\nLyin' Ted I have already beaten you in all debates and am way ahead of you in votes and delegates. You should focus on jobs & illegal imm! _E_\nGovernment needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. #TimeToGetTough _E_\nThank you @GeraldoRivera @FoxandFriends. Agree! __HTTP__ _E_\nSleepy eyed @chucktodd thinks Las Vegas is a state see @todayshow this morning. _E_\nI opposed going into Iraq. Hillary voted for it. As with everything else she's supported it was a DISASTER. __HTTP__ _E_\nThe Keystone pipeline will create 20000 jobs and make us less energy dependent from the Middle East. @BarackObama says No! _E_\nGolf Odyssey just named Trump Scotland \"Golf course of the year.\" __HTTP__ _E_\nMarco Rubio was a complete disaster today in an interview with Chris Wallace @FoxNews concerning our invading Iraq.He was as clueless as Jeb _E_\nCLINTON'S CLOSE TIES TO PUTIN DESERVE SCRUTINY: __HTTP__ #VPDebate _E_\nCongratulations to @STEPHENATHOME I will see you on the show! _E_\nThe very foul mouthed Sen. John McCain begged for my support during his primary (I gave he won) then dropped me over locker room remarks! _E_\nShark Tank is a dead Friday night filler compared to the Apprentice which has been number one show for week in the T. V. ratings! _E_\nToday as we Remember Pearl Harbor it was an incredible honor to be joined with surviving Veterans of the attack on 12/7/1941. They are HEROES and they are living witnesses to American History. All American hearts are filled with gratitude for their service and their sacrifice. __HTTP__ _E_\nApprentice will be amazing tomorrow night! _E_\nVia @HuffPostPol: \"Donald Trump: 'Republicans May Be The Worst Negotiators In History'\" __HTTP__ _E_\nHillary Clinton is weak and ineffective no strength no stamina. _E_\nOnly those who will risk going too far can possibly find out how far one can go. T. S. Eliot _E_\nThe trade deficit rose to a 7yr high thanks to horrible trade policies Clinton supports. I will fix it fast JOBS! __HTTP__ _E_\nI will be interviewed on @foxandfriends with the legendary Coach Bobby Knight tomorrow morning. Enjoy! #INDPrimary __HTTP__ _E_\nThe road to success is always under construction. Arnold Palmer _E_\nPresident Xi thank you for such an incredible welcome ceremony. It was a truly memorable and impressive display! 📸 __HTTP__ __HTTP__ _E_\nGreat new ad from @CmteForIsrael: 'Next Year...President @MittRomney in Jerusalem the Capital of Israel' __HTTP__ _E_\nDrugs are pouring into this country. If we have no border we have no country. That's why ICE endorsed me. #Debate #BigLeagueTruth _E_\nCBO now estimates that over 2.5M will lose jobs directly because of ObamaCare. REPEAL now before it is too late. _E_\nThank you @thefix for your very honest commentary. One thing we do have great teams in IA NH SC and beyond. __HTTP__ _E_\n.@KarlRove wasted $400 million + and didn't win one race—a total loser. @FoxNews _E_\n‏.@richardroeper Perhaps one of the worst replacements in showbiz once you went on it was over! Your taste sucks! _E_\nDo you believe that @UnionLeader in NH was demanding ads? Look at enclosed letter from them just received: __HTTP__ _E_\nThe lights went out in New Orleans...the Country's lights went out also. We are not the same place! _E_\nThank you for today's endorsement New York Veteran Police Association! #NewYorkValues __HTTP__ __HTTP__ _E_\nNew Gravis Poll in NH just out: Trump 32% Carson 13% __HTTP__ _E_\nThe silent majority is silent no more! Remember the importance of VOTING!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nHear Donald Trump discuss big gov spending banks & taxes on Your World w/Neil Cavuto: __HTTP__ _E_\nWeekly Address from @WhiteHouse: __HTTP__ __HTTP__ _E_\nDespite firing @StephenBaldwin7 in last night's All Star Celebrity @ApprenticeNBC Stephen had strong overall performance this season _E_\nWe will defend our country protect our communities and put the safety of the AMERICAN PEOPLE FIRST! Replay: __HTTP__ __HTTP__ _E_\nWe are going to make our country so strong again so great again. No more ripping off the United States. We will MAKE AMERICA GREAT AGAIN! _E_\nReporters say it's the Trump Bump I tell CNBC I am buying stocks and the market goes up. _E_\nCongratulations to @piersmorgan on his new position as Editor at Large for the United States of @MailOnline! My Apprentice champ! _E_\nTurn on @oreillyfactor now and enjoy true brilliance! _E_\nAfter a great evening and packed auditorium in Iowa I am now in Colorado looking forward to what I am sure will be a very unfair debate! _E_\n... where he raised 2 million dollars for the wonderful kids. Eric has a great heart! _E_\nPutin just sent a Russian nuclear sub to the Gulf of Mexico. @BarackObama can't be bothered he is too concerned with @MittRomney's taxes. _E_\nThe Trump Organization Finalizes Purchase of Legendary Turnberry Resort in Scotland. __HTTP__ _E_\n\"A vampire with a day pass?\" We are in @THEGaryBusey land. #CelebApprentice _E_\nWatched Saturday Night Live hit job on me.Time to retire the boring and unfunny show. Alec Baldwin portrayal stinks. Media rigging election! _E_\nThe @Yankees should break A Rod's contract immediately—he misrepresented. _E_\n#Trump2016 #IACaucus Finder: __HTTP__ __HTTP__ _E_\nVery sad what happened last night at the Miss Universe Pageant. I sold it 6 months ago for a record price. This would never have happened! _E_\nI'll be on @gretawire tonight at 10 PM Fox News _E_\nNow that the election is over watch Chrysler ship @Jeep production to China my prediction. _E_\nTwo policemen just shot in San Diego one dead. It is only getting worse. People want LAW AND ORDER! _E_\nThe #1 trend on Twitter right now is #TrumpWon thank you! _E_\nYou must admit that Bryant Gumbel is one of the dumbest racists around an arrogant dope with no talent. Failed at CBS etc why still on TV? _E_\n\"In order to build your wealth and improve your business smarts you need to know about real estate.\" Think Like a Billionaire _E_\nI am going to Iowa today sold out crowds. People don't want our country ripped off anymore. Must stop now! _E_\nAs we told the @nydailynews I was asked to speak at the RNC but said no because I will be doing something much bigger just watch! _E_\nRT @FLOTUS: Preparations are underway to celebrate the holidays at the @WhiteHouse! __HTTP__ _E_\nAmazing my tweets are covered across every spectrum from @espn to @politico to @WSJ. _E_\n\"Offshore wind is a dead duck in Scotland and it's time Alex Salmond manned up stopped blaming Westminster (cont) __HTTP__ _E_\n\"You just can't beat the person who never gives up.\" – Babe Ruth _E_\n.@McIlroyRory Thanks for your nice note they love you at Trump National Doral. You are looking good will have a GREAT year! _E_\nYou are always there for us – THE MEN AND WOMEN IN BLUE.Thank you to our police thank you to our sheriffs and thank you to our law enforcement families. God Bless you all and GOD BLESS AMERICA! #LESM __HTTP__ _E_\nThe best social program by far is a JOB! Our jobs are being taken away from us by China and many other countries incompetent leader. _E_\nOn behalf of @FLOTUS Melania and myself thank you for a wonderful dinner and evening President Sergio Mattarella.... __HTTP__ _E_\nA person who never made a mistake never tried anything new. Albert Einstein _E_\nCarl Cameron @FoxNews is the only reporter I know who consistently fumbles & misrepresents poll results. He has been so wrong & he hates it! _E_\nRT @TeamTrump: Law enforcement officers bring communities together & keep us safe. @mike_pence & @realDonaldTrump RESPECT & stand by them!... _E_\nWorired that the USC will strike down ObamaCare @BarackObama is trying to implement his debacle in public schools __HTTP__ _E_\nTrump Int'l Hotel & Tower Toronto. #1 in all of Canada. __HTTP__ _E_\n.@BarackObama is promoting ugly inefficient unreliable bird killing noisy neighborhood destroying wind turbines. Big mistake. _E_\nIs Anthony Weiner also delusional? Add him to NY Sex Offender list instead! _E_\nJanuary 20th 2017 will be remembered as the day the people became the rulers of this nation again. _E_\nI said that Crooked Hillary Clinton is not qualified to be president because she has very bad judgement Bernie said the same thing! _E_\nIranian Pastor #Nadarkhani has just been sentenced to death by the Mullahs because he is a Christian (cont) __HTTP__ _E_\nTHE CHOICE IS CLEAR!#BigLeagueTruth #DrainTheSwamp __HTTP__ _E_\nAfter tearing W Bush down for 12 years now the media loves him. Why not? He gave them Obama. _E_\nIn my office with Banana Joe who just won the @WKCDOGS at @MSGnyc. __HTTP__ _E_\nCongratulations to @seanhannity on his great ratings and ratings increase as reported by the @AP today. Amazing job! _E_\nAlert...The president knew that the ambassador was being attacked in Benghazi. He did nothing...he is no leader. _E_\nJamiel Shaw was incredible on @foxandfriends this morning. His son who was viciously killed by an illegal immigrant is so proud of pop! _E_\nTo EVERYONE including all haters and losers HAPPY NEW YEAR. Work hard be smart and always remember WINNING TAKES CARE OF EVERYTHING! _E_\nMy @gretawire interview where I discuss the #ObamaCare USC argument gas prices & @IvankaTrump's new clothing line __HTTP__ _E_\nVia @TMZ_Sports: \"Donald Trump: Don't Mess Up @terrellowens' Name. 'I've Seen Him Go Crazy At People'\" __HTTP__ _E_\nPres. Obama's steady support of @Israel throughout this crisis helped stop the war. He did a good job. _E_\nMiami Dade Mayor drops sanctuary policy. Right decision. Strong! __HTTP__ _E_\nMacy's was very disloyal to me bc of my strong stance on illegal immigration. Their stock has crashed! #BoycottMacys __HTTP__ _E_\nAmy Pascal of Sony was totally used by Rev. Al Sharpton. She should be fired for stupidity. _E_\nMelania our great and very hard working First Lady who truly loves what she is doing always thought that \"if you run you will win.\" She would tell everyone that \"no doubt he will win.\" I also felt I would win (or I would not have run) and Country is doing great! _E_\nI hope we never find life on another planet because if we do there's no doubt that the United States will start sending them money! _E_\nIt is a MOVEMENT not a campaign. Leaving the past behind changing our future. Together we will MAKE AMERICA SAF... __HTTP__ _E_\nAfter @TrumpScotland I will visit @TrumpDoonbeg in Ireland the magnificent resort fronting on the Atlantic Ocean. _E_\n\"Tomorrow is the first blank page of a 365 page book. Write a good one.\" — @BradPaisley _E_\n#MakeAmericaSafeAgain!#GOPConvention #RNCinCLE __HTTP__ __HTTP__ _E_\nNational GOP Presidential Poll via @OANN @realDonaldTrump 35.6% #Trump2016 __HTTP__ _E_\nIntelligence agencies should never have allowed this fake news to leak into the public. One last shot at me.Are we living in Nazi Germany? _E_\nI was the first & only potential GOP candidate to state there will be no cuts to Social Security Medicare & Medicaid. Huckabee copied me. _E_\nRumor has it Apple is going to release iPhones with bigger screens. That's good news. _E_\nWashington needs common sense conservative solutions. Let's make America great again! __HTTP__ _E_\nMy new book Time To Get Tough will be out Dec 5th. Solutions you won't hear from the politicians. The bes... (cont) __HTTP__ _E_\nThank you to @IvankaTrump for her wonderful acknowledgement this morning on @foxandfriends... _E_\nThere are many Jonathan Gruber types selling the global warming stuff and they really do believe the American public is stupid. _E_\n__HTTP__ _E_\nGreat day in Virginia. Crowd was fantastic! _E_\nExcited for tomorrow's Politics & Eggs @saintanselm co hosted by @NECouncil & @nhiop. Live stream here __HTTP__ _E_\nLooking forward to meeting the great folks of Sarasota GOP party when I am honored as 'Statesman of the Year.' Should be a wonderful time. _E_\nThis assignment has stretched not just the imaginations but the patience quotas of @lisarinna and @pennjillette. #CelebApprentice _E_\nI started to get very worried about Mitt's chances when I heard that A Rod donated to his campaign. Everything A Rod touches turns bad. _E_\nSorry I won't be able to do @foxandfriends at 7 AM on Monday—will be in India. _E_\nVia @thehill by @HenschOnTheHill: \"Trump says US roads are 'falling apart'\" __HTTP__ _E_\nI will be on @foxandfriends at 7:00 there is much to talk about (sadly)! Enjoy! _E_\nThank you @megynkelly for the nice things you said about Melania. You will like her great heart and smart always wanting to help people! _E_\nSurprise In a post election delayed release food stamp rolls surged to biggest monthly increase and an all time high __HTTP__ _E_\nNamed best golf course in the world by @RobbReport Trump Int'l Golf Links Scotland is a 7400 yd par 72 __HTTP__ _E_\nI really enjoyed last night's Tele Town Hall with @ralphreed's Faith and Freedom Coalition. Thanks to the thousands who joined. _E_\nCongratulations to Bernie Marcus & Herman Cain @JobCreatorsUSA on the #TruthTour2012 All employers need to check this out! _E_\nWhy are we sending thousands of ill trained soldiers into Ebola infested areas of Africa! Bring the plague back to U.S.? Obama is so stupid. _E_\nI'll be on @Foxandfriends Monday at 7:30 AM. _E_\nWe're not talking about religion we're talking about security. #GOPDebate __HTTP__ _E_\nLooks like Obama will not stop the very potentially dangerous flights to and from West Africa. What the hell is wrong with this guy? _E_\nTHANK YOU to everyone in Little Rock Arkansas tonight! A record crowd of 12K. #Trump2016 __HTTP__ __HTTP__ _E_\nOn the luxurious Palos Verdes Peninsula @TrumpGolfLA features @GolfWorldUS' top public course & elite restaurants __HTTP__ _E_\nVia @kmovnewsfeed: Photos: Tour Donald Trump's NC golf club __HTTP__ _E_\n32º in New York it's freezing! Where the hell is global warming when you need it? _E_\nI am the only Republican who will get large numbers of Dems and Indies (crossover). I will also get states that no other Republican can get. _E_\n.@IvankaTrump is right—Plan B has descended into a state of total chaos. #CelebApprentice _E_\n\"George has a real twinkle about him\" says @TheRealMarilu. Really? The shark should be scared. #CelebApprentice _E_\nJust landed in New Hampshire a very exciting morning planned! _E_\n#AmericaFirst #ImWithYou __HTTP__ _E_\nWho do you like hate so far? _E_\nreleased by Intelligence even knowing there is no proof and never will be. My people will have a full report on hacking within 90 days! _E_\nThank you to Time Magazine and Financial Times for naming me Person of the Year a great honor! _E_\nRomney's failed advisors like campaign mgr Stuart Stevens are all over TV telling people how to win. But they lost don't know how to win! _E_\nIf we let Crooked run the govt history will remember 2017 as the year America lost its independence. #DrainTheSwamp __HTTP__ _E_\nVia @DMRegister by @SharynJackson: \"Trump: @SteveKingIA has 'the right views' __HTTP__ _E_\nThe 'brunt' of ObamaCare will be shouldered by folks making under $120K __HTTP__ _E_\nI would like to thank @GolfMagazine for the really nice review of Trump National Doral Best Renovation of the Year (and maybe all time). _E_\nMy motto is: 'Never give up.' I follow this very strictly. I do not let problems and challenges stop me they are normal. _E_\nWow @Politico is in total disarray with almost everybody quitting. Goodnews bad dishonest journalists! __HTTP__ _E_\nA great American Kurt Cochran was killed in the London terror attack. My prayers and condolences are with his family and friends. _E_\nNew York Fashion Week is really bad and used to be so glamorous and exciting! No stars no fun just boring. They need serious help. #NYFW _E_\nGlad to hear @BrentBozell @marklevinshow @EWErickson & @TPPatriots are standing up to @KarlRove's attack on the Tea Party. _E_\nThank you America! #Trump2016Via @DRUDGE_REPORT __HTTP__ _E_\nOur VISA system is broken like so much else in our country. We better get it fixed really fast. MAKE AMERICA GREAT AGAIN! _E_\nOur wonderful future V.P. Mike Pence was harassed last night at the theater by the cast of Hamilton cameras blazing.This should not happen! _E_\nTo be yourself in a world that is constantly trying to make you something else is the greatest accomplishment. Ralph Waldo Emerson _E_\n'Trump administration seen as more truthful than news media' __HTTP__ _E_\nYes I won the right to have my name taken off Trump Plaza in A.C. because it was not operated up to a very high standard and NO involvement _E_\nI am in Las Vegas at the best hotel (by far) Trump International. I will be working with my wonderful teams and volunteers to WIN Nevada! _E_\nVia @Newsmax_Media by @wandacarruthers: \"Trump: Baghdad Likely to Fall to ISIS\" __HTTP__ _E_\nVOTE #TrumpPence16 on 11/8/16! __HTTP__ _E_\nThe media must immediately stop calling ISIS leaders MASTERMINDS. Call them instead thugs and losers. Young people must not go into ISIS! _E_\n...want everything to be done for them when it should be a community effort. 10000 Federal workers now on Island doing a fantastic job. _E_\nObama has no understanding of how to create jobs or opportunity. He believes in Government. _E_\nIt was great seeing @MissUniverse and @MissTeenUSA yesterday __HTTP__ _E_\n.@BillMaher's show is great for helping me get to sleep better than Sominex. _E_\nIn light of the horrible attack in Nice France I have postponed tomorrow's news conference concerning my Vice Presidential announcement. _E_\n\"It's sad—truly sad and disgraceful—the way Obama has allowed America to be abused and kicked around (cont) __HTTP__ _E_\nThe Tea Party is filled with great Americans. Despite being mistreated by everyone including @GOP they will continue to fight on _E_\nA very big thank you to Bill Donohue head of The Catholic League for the wonderful interview on @CNN and article in Newsmax! Great insight _E_\nI told you! Premiums are soaring! #RepealObamacare #Trump2016 __HTTP__ _E_\nRT @glamourizes: @realDonaldTrump Only true Americans can see that president Trump is making America great. He's the only person who can! H... _E_\nWhen somebody challenges you fight back be tough! _E_\nCrooked Hillary Clinton knew that her husband wanted to meet with the U.S.A.G. to work out a deal. The system is totally rigged & corrupt! _E_\nWill be in Novi Michigan this Friday at 5:00pm. Join the MOVEMENT! Tickets available at: __HTTP__ __HTTP__ _E_\nI hope corrupt Hillary Clinton chooses goofy Elizabeth Warren as her running mate. I will defeat them both. _E_\nI will be in Huntsville Alabama on Saturday night to support Luther Strange for Senate. Big Luther is a great guy who gets things done! _E_\nI hope Washington makes a good deal to avert the fiscal cliff. Both sides need to work together. _E_\nPeople ask about @AmandaTMiller. She is actually a VP of Marketing at the Trump Organization. #CelebApprentice _E_\nWill be leaving Palm Beach for the 11 A.M. ceremony opening the magnificent GARY PLAYER VILLA at Trump Nationak Doral Miami. GARY IS GREAT! _E_\nReally enjoyed my interview with @marklevinshow. He is terrific! _E_\nThe phony lawsuit against Trump U could have been easily settled by me but I want to go to court. 98% approval rating by students. Easy win _E_\nGreat making keynote speech at 2014 Lincoln Day Dinner hosted by Dan Isaacs & NY Republican County Committee. Wonderful people! _E_\n.@katyperry is no bargain but I don't like John Mayer he dates and tells be careful Katy (just watch!). _E_\nCrooked Hillary Clinton is soft on crime supports open borders and wants massive tax hikes. A formula for disaster! _E_\nJoin me in Reno Nevada tomorrow at 3:30pm! #AmericaFirst #MAGATickets: __HTTP__ _E_\nThe media and establishment want me out of the race so badly I WILL NEVER DROP OUT OF THE RACE WILL NEVER LET MY SUPPORTERS DOWN! #MAGA _E_\nFLORIDA: Do not miss this opportunity to #MakeAmericaGreatAgain! Thank you @IvankaTrump: __HTTP__ __HTTP__ _E_\nOregon is voting today. Keep the big numbers going VOTE TRUMP! MAKE AMERICA GREAT AGAIN! _E_\nAn iconic building and top tourist attraction @TrumpTowerNY sets New York City's luxury standard __HTTP__ & great food! _E_\nLaura Massive crowd had to move to Phoenix Convention Center. __HTTP__ _E_\nWow sleepy eyes @chucktodd is at it again. He is do totally biased. The things I am saying are correct. far better vision than the others _E_\nNow that Iran ripped us off by making one of the best deals of any kind in history they have just moved to block any imports from the U.S. _E_\nIn today's #trumpvlog I answer your questions about what you should be doing in this uncertain economy... __HTTP__ _E_\nThank you to the BRAVE servicemen & women who have served and continue to serve the United States our true HEROES... __HTTP__ _E_\nFollow Trump @DoralResort's WGC @CadillacChamp leadership board here at @nbc's @GolfChannel __HTTP__ _E_\nLeadership: Whatever happens you're responsible. If it doesn't happen you're responsible. _E_\nAmazing! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nAccording to @pewresearch illegal immigrants favor Dems 8:1 __HTTP__ @GOP pushing amnesty. Do they have death wish _E_\nPutin's letter is a masterpiece for Russia and a disaster for the U.S. He is lecturing to our President.Never has our Country looked to weak _E_\n...now it's the \"greatest pageant on earth\" broadcast in 190 countries to 1 billion people—\"hot!\" _E_\neither elect more Republican Senators in 2018 or change the rules now to 51%. Our country needs a good shutdown in September to fix mess! _E_\nArticle from The Street The Donald's Trump Card: Himself __HTTP__ _E_\nRepublicans have the cards because of the debt ceiling—but it doesn't seem that way! _E_\nEven the Left realizes that @BarackObama's policies have led to more jobs being outsourced out of this country. __HTTP__ _E_\nHillary says things can't change. I say they have to change. It's a choice between Americanism and her corrupt globalism. #Imwithyou _E_\nDennis Rodman was either drunk or on drugs (delusional) when he said I wanted to go to North Korea with him. Glad I fired him on Apprentice! _E_\nWatch Kasich squirm if he is not truthful in his negative ads I will sue him just for fun! _E_\nThe best investors are visionaries—they look beyond the present. _E_\nYoung entrepreneurs – remember quality and results are the key metrics to success. _E_\nNext she says she's being set up by Omarosa to fail....is somebody confused? _E_\nThank you Congressman Steven Palazzo! __HTTP__ __HTTP__ _E_\nIt was wonderful to have President Petro Poroshenko of Ukraine with us in New York City today. #UNGA __HTTP__ __HTTP__ _E_\nSenator @LindseyGrahamSC made horrible statements about @SenTedCruz – and then he endorsed him. No wonder nobody trusts politicians! _E_\nMy @SquawkCNBC #TrumpTuesday interview discussing the 2012 election OPEC ripping us off & @MittRomney's job policy __HTTP__ _E_\nToyota & Mazda to build a new $1.6B plant here in the U.S.A. and create 4K new American jobs. A great investment in American manufacturing! _E_\nNo wonder @NYMag is doing so poorly with an idiot Sr. Editor like @DanAmira it will only get worse! _E_\nNone of Romney's leaked comments change the fact that Obama is a complete disaster. 20% real unemployment and $6T in deficit spending. _E_\nHillary is too weak to lead on border security no solutions no ideas no credibility.She supported NAFTA worst deal in US history. #Debate _E_\nWow! Honored to be chosen by the highly respected + accurate Washington & Lee Mock Convention. I hope you are right I will make you proud! _E_\nPres @BarackObama expects @MittRomney to play nice like @SenJohnMcCain it's not going to happen & the result is going to be much different. _E_\nWEEKLY ADDRESS __HTTP__ _E_\nNo money wasted like bad ads—the Republicans spent more & got nothing for it. _E_\nWatch this tour by @TrumpIntRealty's @M_Griffith1 of this luxurious penthouse in Trump Park Avenue __HTTP__ _E_\nAnother attack this time in Germany. Many killed. God bless the people of Munich. _E_\nCrooked Hillary wants to take your 2nd Amendment rights away. Will guns be taken from her heavily armed Secret Service detail? Maybe not! _E_\nRT @newtgingrich: Seems out of touch w/ reality to announce a VP nominee before securing 1237 delegates. __HTTP__ __HTTP__ _E_\n#VoteTrumpNH #NHPrimary #FITN __HTTP__ _E_\nI will be interviewed on @greta tonight at 7pm. Enjoy! __HTTP__ _E_\n.@MittRomney shouldn't give additional tax returns until @BarackObama gives his passport records college records & applications... _E_\nEvery American needs to say 2 simple words to every Vet they meet: THANK YOU! John Wayne Walding __HTTP__ _E_\nJob openings are at a 4 year high but businesses aren't hiring __HTTP__ Why? ObamaCare US debt & @BarackObama's tax plan. _E_\nSeems like the teams are surprised when @THEGaryBusey comes back. #CelebApprentice _E_\n\"There can be no liberty unless there is economic liberty.\" – The Iron Lady Margaret Thatcher _E_\nMy appearance on The View... __HTTP__ and __HTTP__ _E_\nIt is almost time. I will be making a major announcement from @TrumpTowerNY at 11AM. Follow on social media! #MakeAmericaGreatAgain _E_\nThe federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_\nStock Market hits new Record High. Confidence and enthusiasm abound. More great numbers coming out! _E_\nSometimes people spend too much time focusing on problems instead of focusing on opportunities Think Like a Champion _E_\nReally big crowd expected tomorrow morning at # CPAC2013. I look forward to it! _E_\n#MakeAmericaGreatAgain #Trump2016 Story: __HTTP__ __HTTP__ _E_\nIf only speeches could create jobs then @BarackObama wouldn't have such a dismal economic record. _E_\n.@BradSteinle Great talking to you and your parents—fantastic people. Keep your sister's very important memory alive—big impact! _E_\n\"A savvy investor is a sponge for information. You have to read the newspapers... _E_\nPriorities: @BarackObama wants to slash a Trillion dollars from military spending while raising the salaries of (cont) __HTTP__ _E_\nI give the President's speech a 7 on the scale of 0 to 10! Not bad but room for improvement! _E_\nWhy was the Hanukah celebration held in the White House two weeks early? @BarackObama wants to vacation in Hawaii in late December. Sad. _E_\nIran must immediately allow Christian #PastorSaeed out of prison or we should put back sanctions (which should never have been lifted) _E_\nI was never a fan of Colin Powell after his weak understanding of weapons of mass destruction in Iraq = disaster. We can do much better! _E_\nI am in New Hampshire. Just received great news from Reuters poll. Thank you for your support! __HTTP__ _E_\nThe @FBIPressOffice police & others are doing an amazing job. How genius was it putting together that tape? _E_\nDoes anybody really believe that Bill Clinton and the U.S.A.G. talked only about grandkids and golf for 37 minutes in plane on tarmac? _E_\nMiss Universe 2012 Pageant will be airing live on @nbc & @Telemundo december 19th. Open invite stands for Robert Pattinson. _E_\nI will replace it with private plans health savings accounts & allow purchasing across state lines. Maximum choice & freedom for consumer. _E_\nWord is that @NBCNews is firing sleepy eyes Chuck Todd in that his ratings on Meet the Press are setting record lows. He's a real loser! _E_\nNow that China's own economy is slowing __HTTP__ watch how they start doing even bigger numbers in (cont) __HTTP__ _E_\nInsurgents in Iraq show they can still mount horrifying attacks US wastes trillions. _E_\nMy meetings with President Xi Jinping were very productive on both trade and the subject of North Korea. He is a highly respected and powerful representative of his people. It was great being with him and Madame Peng Liyuan! _E_\nFlorida has been very good to me. I am really esxcited to give back at the Sarasota GOP event and @RNC convention. Will be fun! _E_\nAll weights are on crane's wrong side very precarious below move out! _E_\nWe should leave Afghanistan immediately. No more wasted lives. If we have to go back in we go in hard & quick. Rebuild the US first. _E_\nGovernment can be efficient with the right leadership. Let's Make America Great Again __HTTP__ _E_\nVia @GolfweekMag by @BKleinGolfweek: \"Donald Trump reopens Doral's Blue Monster\" __HTTP__ _E_\nEntrepreneurs: Don't tread water. Get out there and go for it. _E_\nHillary said I really deplore the tone and inflammatory rhetoric of his campaign. I deplore the death and destruction she caused stupidity _E_\nRT @foxandfriends: HAPPENING TODAY: House to vote on immigration bills including 'Kate's Law' and 'No Sanctuary for Criminals Act' __HTTP__ _E_\nIf you want to know about Hillary Clinton's honesty & judgment ask the family of Ambassador Stevens. _E_\nThank you Rhode Island! #Trump2016 __HTTP__ _E_\n#ICYMI On Saturday I signed two EO's to help keep jobs & wealth in our country.EO1: __HTTP__ EO2:... __HTTP__ _E_\nKentucky has a chance to have the Senate Majority Leader Mitch McConnell representing it in Washington. Big power for State. Don't blow it _E_\nmassive increases of ObamaCare will take place this year and Dems are to blame for the mess. It will fall of its own weight be careful! _E_\nWhat truly matters is not which party controls our government but whether our government is controlled by the people. _E_\nWill be on @foxandfriends at 8:00. Enjoy! _E_\nWill be in Orlando Florida this afternoon. 25000 people expected. This is a movement like our GREAT COUNTRY has never seen before! _E_\nMy @Shalom_TV interview discussing my video endorsement of @IsraeliPM @netanyahu and past visits to @Israel __HTTP__ _E_\n...money to Bill the Hillary Russian reset praise of Russia by Hillary or Podesta Russian Company. Trump Russia story is a hoax. #MAGA! _E_\nWatched low rated @Morning_Joe for first time in long time. FAKE NEWS. He called me to stop a National Enquirer article. I said no! Bad show _E_\nOn my way to Iowa just received new national poll numbers. Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nVery productive bilateral meeting with Prime Minister Benjamin @Netanyahu of Israel in Davos Switzerland! #WEF18 __HTTP__ _E_\nMy thoughts on the Emmys in today's #trumpvlog.... __HTTP__ _E_\nWhy are we still giving billions of dollars we don't have in foreign aid to the Muslim Brotherhood in Egypt? _E_\nPoliticians are trying to chip away at the 2nd Amendment. I won't let them take away our guns! #Trump2016Watch: __HTTP__ _E_\n\"Do not give in to anger. It destroys your focus on goals and ruins your concentration.\" – Think Big _E_\nThe upcoming All Star @CelebApprentice puts the celebrities under the hardest tasks we have ever given. We really pushed the envelope _E_\nI hope Mark Zuckerberg signs a prenup with his current girlfriend perhaps soon to be wife. Otherwise she can walk away with 9 billion. _E_\nMy @bostonherald interview on Tom Brady Hillary Clinton the Granite State & Making America Great Again! __HTTP__ _E_\nVia @washingtonpost by @costareports: \"Trump says he is serious about 2016 bid is hiring staff and delaying TV gig\" __HTTP__ _E_\nOnly 88000 jobs were added this past March. Prediction was 190000. Businesses can't expand with Obama Care & high taxes on horizon. _E_\nJust left Columbus rally of 14000 people a far bigger crowd than even I expected! Unbelievable evening incredible spirit in the arena! _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nThank you. __HTTP__ _E_\nConsumer Confidence is at an All Time High along with a Record High Stock Market. Unemployment is at a 17 year low. MAKE AMERICA GREAT AGAIN! Working to pass MASSIVE TAX CUTS (looking good). _E_\nJoan Rivers on The Apprentice tonight at 8:00. I will be live tweeting. JOAN WAS GREAT! _E_\nAre you allowed to impeach a president for gross incompetence? _E_\nSleep eyes @ChuckTodd is killing Meet The Press. Isn't he pathetic? Love watching him fail! _E_\nVia @thehill by @HugginsRachel: \"Trump looking 'very seriously' at 2016 run\" __HTTP__ _E_\n\"Representing your own brand yourself is the best way to go. If you can't sell it who will?\" – Midas Touch _E_\nFor eight years Russia ran over President Obama got stronger and stronger picked off Crimea and added missiles. Weak! @foxandfriends _E_\nThe GOP needs to learn how to get tough and outnegotiate @BarackObama and his big spending allies in (cont) __HTTP__ _E_\nGood luck to Bob Kraft Tom Brady and Coach Bill Belichick tonight. _E_\nInternationally recognized as an iconic landmark @TrumpTowerNY beams over Fifth Avenue __HTTP__ _E_\nA great day at the White House! _E_\nIt is fatal to enter any war without the will to win it. Douglas MacArthur _E_\nVia @Mediaite by @evanmcmurry: \"Trump Calls @AGSchneiderman a Cokehead\" __HTTP__ Schneiderman is by his own admission! _E_\nNorth Korea has conducted a major Nuclear Test. Their words and actions continue to be very hostile and dangerous to the United States..... _E_\nMassive crowd in VT tonight. Venue not big enough. Officials say NO to outside event and sound system. Arrive early! _E_\nWow @CNN is so negative. Their panel is a joke biased and very dumb. I'm turning to @FoxNews where we get a fair shake! Mike will do great _E_\nVia @MailOnline Trump still in the lead by a whopping 14 points after fluke survey had put Carson on top __HTTP__ _E_\nRT @realDonaldTrump: On #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_\nRT @realDonaldTrump: HAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_\nGood luck! Enjoy. __HTTP__ _E_\nWith all of the bad economic numbers and horrendous foreign policy Obama should be down by 12 points and he's not. _E_\nWEEKLY ADDRESS __HTTP__ _E_\nA great ad from @MittRomney showing A Few of the 23 Million unemployed who need economic change __HTTP__ Take it to him Mitt! _E_\nHis @BarackObama's budget: interest payments to China will exceed US defense spending by 2019 __HTTP__ @BarackObama's America! _E_\nAngela Merkel is doing a fantastic job as the Chancellor of Germany. Youth unemployment is at a record low & she has a budget surplus. _E_\nNow with the Danger Weiner campaign dead time to focus on crazy Eliot Spitzer. A man who has never earned 10 cents in his life. _E_\nWhen ISIS caught the soldiers do you think they read them their legal rights prior to executing them? _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nWatch Obama's favorability numbers drop even further if he doesn't accept my charitable offer. No one approves (cont) __HTTP__ _E_\nHurricane Irene and Libya in today's #trumpvlog.... __HTTP__ _E_\nWow more than 90% of Fake News Media coverage of me is negative with numerous forced retractions of untrue stories. Hence my use of Social Media the only way to get the truth out. Much of Mainstream Meadia has become a joke! @foxandfriends _E_\nIf you stop by Trump Tower (Fifth Avenue between 56th and 57th Streets) you can buy a pre signed copy of #TimeToGetTough. _E_\nThank you to the amazing law enforcement officers today in Daytona Beach Florida! #LESM #MAGA __HTTP__ _E_\nBy rejecting my ad on ugly windmills & @AlexSalmond's faulty thinking on the \"Lockerbie bomber\" the ad is now on worldwide newscasts. _E_\nRT @GOPChairwoman: .@realDonaldTrump is the Paycheck President. Learn how the tax bill will put more money in your pocket & how to contact... _E_\nWill be going to Pennsylvania today in order to give my total support to RICK SACCONE running for Congress in a Special Election (March 13). Rick is a great guy. We need more Republicans to continue our already successful agenda! _E_\nBack by popular demand @TraceAdkins delivers in the upcoming @CelebApprentice All Stars season. Yes he sings. _E_\nJohn Kasich should focus his special interest money on building up his failed image not negative ads on me. _E_\nThis Sunday's All Star Celebrity @ApprenticeNBC features the return of @Joan_Rivers. Sunday at 9 PM on @NBC full 2 hours. _E_\nWisconsin we will MAKE AMERICA GREAT AGAIN! _E_\nJoin me live from Bedminster New Jersey: __HTTP__ _E_\nGetting ready to leave for Melbourne Florida. See you all soon! _E_\nVia @11AliveNews by @JenniferJJacobs: \"Trump heads to Iowa as '16 speculation rises\" __HTTP__ _E_\nAnytime you see a story about me or my campaign saying sources said DO NOT believe it. There are no sources they are just made up lies! _E_\nJust heard that the great Golf Week Magazine named my Trump International Golf Course Scotland The Best Modern Day Golf Course In The World! _E_\nWhy do the Republicans keep apologizing on the so called birther issue? No more apologies take the offensive! _E_\nOmarosa always promises and delivers high drama... _E_\nDebate polls look great thank you!#MAGA #AmericaFirst __HTTP__ _E_\nNow China is trying to take over a U.S. airbase __HTTP__ This is only the beginning. They only understand toughness! _E_\nA massive tax increase will be necessary to fund Crooked Hillary Clinton's agenda. What a terrible (and boring) rollout that was yesterday! _E_\nRT @Fuctupmind: @realDonaldTrump Donald Trump's amazing golf swing #CrookedHillary __HTTP__ _E_\nThe @rydercup is currently going on and is one of the truly great sporting events. _E_\nWhy is crude oil priced at $86/Barrel? OPEC is ripping us off. Not worth $30/Barrel. America needs new leaders. _E_\nI will be announcing my decision on the Paris Accord over the next few days. MAKE AMERICA GREAT AGAIN! _E_\nLeadership: the art of getting someone else to do something you want done because he wants to do it. Dwight D. Eisenhower _E_\nWhat took investigators so long to interview the pilots of Asiana San Fran crash? WHY NO DRUG TESTS FOR PILOTS they were really off . _E_\n.@Apprenticenbc cast will be announced tomorrow at 7:30am ET on the @todayshow with @MLauer _E_\nGary Johnson is asking people to waste their vote on him. Make it count vote for @MittRomney. _E_\nTHANK YOU to everyone who joined me at the @WhiteHouse yesterday. Together we are MAKING AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\n.@TheRealMarilu was very impressive and is a great person. The All Star Celebrity @ApprenticeNBC viewers loved her. _E_\n.@HillaryClinton has been part of the rigged DC system for 30 years? Why would we take policy advice from her? #Debates2016 _E_\n.@pennjillette and @dennisrodman as PM's I'm proud of Dennis and his performance this season. #CelebApprentice _E_\nGood luck @MittRomney tonight have no doubt you will be great. _E_\nWhitey Bulger's prosecution starts today. Will be one of the most interesting and intriguing trials. _E_\n\"Donald Trump unveils vision for @TrumpTurnberry\" __HTTP__ via @BunkeredOnline by @MMcEwanBunkered _E_\nThere should be no further releases from Gitmo. These are extremely dangerous people and should not be allowed back onto the battlefield. _E_\n\"You should always feel comfortable bargaining for goods and services. I do it all the time.\" – Think Like a Billionaire _E_\nReason I canceled my trip to London is that I am not a big fan of the Obama Administration having sold perhaps the best located and finest embassy in London for \"peanuts\" only to build a new one in an off location for 1.2 billion dollars. Bad deal. Wanted me to cut ribbon NO! _E_\nGreat Kevin McCarthy drops out of SPEAKER race. We need a really smart and really tough person to take over this very important job! _E_\nAfter Crooked @HillaryClinton allowed ISIS to rise she now claims she'll defeat them? LAUGHABLE! Here's my plan: __HTTP__ _E_\nDonald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion! _E_\nMexican leaders and negotiators are much tougher and smarter than those of the U.S. Mexico is killing us on jobs and trade. WAKE UP! _E_\nFort Hood shooting should be declared a terror attack. Respect the wounded and dead. _E_\nWisdom comes as a result of both experience and knowledge. It's something you can't teach someone else you have to achieve it on your own. _E_\nMy @SquawkCNBC interview discussing Jamie Dimon banking regulations and Mark Zuckerberg's prenuptial __HTTP__ _E_\nTrump Tees Up Another 'Hole in One' in Scotland __HTTP__ _E_\nIn Vegas? Enjoy Thanksgiving in @TrumpLasVegas' DJT lounge where the @nfl games will be playing all day __HTTP__ _E_\nWould be really bad if columnist Mike Lupica left the @NYDailyNews. A wonderful and talented guy! _E_\nTune in & join me live in Albany New York! 7pmE start time! I love you New York! #Trump2016 #TrumpTrain __HTTP__ _E_\n#ICYMI: I joined #OnTheRecord with @kimguilfoyle on @FoxNews this evening. #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nWhy is someone like George Pataki who did a terrible job as Governor of N.Y. and registers ZERO in the polls allowed on the debate stage? _E_\nThe West Coast's most luxurious public course @TrumpGolfLA features spectacular panoramic Pacific Ocean views __HTTP__ _E_\nTrump Int'l Hotel & Golf Links Ireland (formerly The Lodge at Doonbeg) is a 5 star resort fronting the Atlantic Ocean __HTTP__ _E_\n.@BarackObama should be careful questioning @MittRomney on diplomacy how many times has Obama apologized for our country on foreign soil?! _E_\nSince the Obama Administration was told way before the 2016 Election that the Russians were meddling why no action? Focus on them not T! _E_\nBenghazi is just another Hillary Clinton failure. It justnever seems to work the way it's supposed to with Clinton. _E_\n.@LisaLampanelli You are terrific (always). Great job on the Apprentice. _E_\nThe opening of #TrumpScotland an exciting day on perhaps the world's best golf course watch the video __HTTP__ _E_\nNBC terminates The Chris Matthews Show __HTTP__ _E_\n\"Get some face time in The Spa at @TrumpLasVegas\" __HTTP__ via @Vegascom by Renée Libutti _E_\nApprentice = big hit. Miss Universe = Big hit. I always get big ratings. If I hosted Meet the Press instead of Sleepy Eyesa smash! @NBCNews _E_\nRecord low temperatures and massive amounts of snow. Where the hell is GLOBAL WARMING? _E_\nI have traveled the world. America is the most beautiful country on Earth. _E_\nOur country is being run by total amateurs. Let's just call it \"amateur hour.\" _E_\nMedia silent when @BarackObama called @MittRomney a murderer & felon. Mitt mentions 'birth certificate' and they go nuts. Double standard! _E_\nMy @extratv interview before Hurricane Sandy explaining that I would be staying in Trump Tower during the storm __HTTP__ _E_\nRT @JacobAWohl: @realDonaldTrump President Trump alone has succeeded in bringing the Stock Market Small Business Index and Consumer Comfor... _E_\nThe real story here is why are there so many illegal leaks coming out of Washington? Will these leaks be happening as I deal on N.Korea etc? _E_\n#USA #Japan __HTTP__ _E_\nI will be signing copies of my new book Time To Get Tough: Making America #1 Again in Trump Tower on Frida... (cont) __HTTP__ _E_\nListen – my Citizens United Political Victory Fund robo call for @leezeldin __HTTP__ #zeldinforcongress _E_\nThird rate @politico took every negative tweet or response they could find & put it out when in fact the response is incredibly positive. _E_\nThank you @WayneAllynRoot.Very nice! #Trump2016 __HTTP__ _E_\nHappy #MedalOfHonorDay to our heroes! __HTTP__ __HTTP__ _E_\nReally sad that Republicans would allow themselves to be used in a Clinton ad. Lindsey Graham Romney Flake Sass. SUPREME COURT REMEMBER! _E_\nRead my tweets you dopes of course he should get a trial but fast (not a 12 year disaster). _E_\nAll Presidential candidates should immediately disavow their Super PAC's. They're not only breaking the spirit of the law but the law itself _E_\nI am on @oreillyfactor tonight a big special. @FoxNews at 8:00 P.M. ENJOY! _E_\n.@VanityFair's terrible piece on Mitt's faith is a new low even for them. _E_\nI'll be on Piers Morgan Tonight this evening 9 pm on CNN. Be sure to tune in. @PiersTonight _E_\nToday is the 53rd anniversary of the March on Washington today we honor the enduring fight for justice equality and opportunity. _E_\nCan the relationship between the mayor of New York City and the police force ever be fixed? Tune in to @foxandfriends at 7:15. _E_\nMany people advised me not to buy the Miss Universe pageant. They were all wrong. The deal worked out to be a great one! _E_\nPhotos from the @ApprenticeNBC press conference __HTTP__ Premieres January 4th on @NBC. _E_\nBe sure to listen to my interview today w/@SteveMTalk on @Newsmax_Media __HTTP__ Congratulations to Steve on his new show! _E_\nIf Democrats were not such obstructionists and understood the power of lower taxes we would be able to get many of their ideas into Bill! _E_\nThe people of Scotland love Trump International Golf Links. _E_\nThe failure of the Super Committee shows Washington has truly incompetent leaders. #TimeToGetTough _E_\nPractice positive thinking—this will keep you focused while weeding out anything that is unnecessary negative or detrimental... _E_\nWithout passion you don't have energy and without energy you don't have anything! _E_\n\"Worry destroys focus.\" – Think Big _E_\nDateline NBC featuring yours truly just set a season high in households in the ratings—no wonder NBC likes me so much! @nbc _E_\nA wonderful place. __HTTP__ _E_\n.@MarkBurnettTV and his incredible wife @RealRomaDowney did a fabulous movie @SonofGodMovie see it! _E_\nPeople (pundits) gave me no chance in South Carolina. Now it looks like a possible win. I would be happy with a one vote victory! (HOPE) _E_\nWow just won Missouri! _E_\nCome join us at the Verizon Wireless Center Manchester New Hampshire on 2/8! Register now: __HTTP__ __HTTP__ _E_\n#CrookedHillary __HTTP__ _E_\nIt seems that Justice Scalia originally wrote the majority on ObamaCare and Roberts then switched his position. __HTTP__ _E_\nInteresting how President Obama so haltingly said I would never be president This from perhaps the worst president in U.S. history! _E_\nDo not settle for remaining in your comfort zone. Being complacent is a good way to get nowhere. Get your momentum going and keep it going. _E_\n...The ads made her look great and now she probably will run. _E_\nYour most popular tweet answered why I'm holding off on a Presidential bid... __HTTP__ #trumpvlog _E_\nBig crowd expected tomorrow night in Iowa. It will be interesting and fun great people! _E_\nWow the ratings for @60Minutes last night were their biggest in a year very nice! _E_\n.@lancearmstrong teammate is angry and jealous he is no Lance. _E_\nAnother one of me on stage. #WWEHOF __HTTP__ _E_\nWatch my appearances on Good Day NY... __HTTP__ and @FoxandFriends... __HTTP__ _E_\nI'll be signing copies of my new book Time To Get Tough today at Trump Tower 11 am to 2 pm. Hope to see you there. #TimeToGetTough _E_\nIt's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_\nDoes anyone agree with Marilu that Gary while 'adorable' is a distraction? _E_\nBased on the fraud committed by Senator Ted Cruz during the Iowa Caucus either a new election should take place or Cruz results nullified. _E_\nRT @realDonaldTrump: Happy to announce we are awarding $1M to Las Vegas in order to help local law enforcement working OT to respond to l... _E_\nHamas has warned Pres. Obama not to visit the Temple Mount during his trip to Israel __HTTP__ _E_\nToday The Blue Monster is torn up. The Trump National @DoralResort is being revolutionized with $200M of renovations. _E_\nRT @Newsmax_Media: Donald Trump: Mean Spirited GOP Won't Win Elections @REALDonaldTrump __HTTP__ via @Newsmax_Media _E_\nFallout from Iowa: Trump Speech Drew Greatest Response __HTTP__ via @Newsmax_Media by Jim Meyers __HTTP__ _E_\nFiscal cliff negotiations have officially begun between the President and Congress Washington must come together and make a deal. _E_\nCongratulations to the @thenyrangers on taking a 2 1 lead over the @washcaps. Great game last night! _E_\nThe prestigious 800 acre @TrumpDoral boast luxurious event spaces and 5 Star restaurants __HTTP__ _E_\nRealize that persistence can go a long way. Being stubborn is often an attribute. _E_\nChina is threatening Washington over the currency bill. We should pass it immediately. _E_\nI am giving away money. Check the crowdfunding site @fundanything __HTTP__ Raise money for anything! _E_\nEdward Snowden is absolutely killing the the U.S. with other countries! _E_\nHillary will never reform Wall Street. She is owned by Wall Street! _E_\nOur country is on the precipice. Washington is broken. Where is the leadership? _E_\nThank you @IngrahamAngle for your strength & wonderful words last night on @FoxNews but @KarlRove is easy to beat! _E_\nI know it has been many years since our country made great deals but isn't it about time we start right now. MAKE AMERICA GREAT AGAIN! _E_\nRT @foxandfriends: Jeb is a weak guy. @EricTrump __HTTP__ _E_\n.@Morning_Joe @mikebarnicle on @realDonaldTrump: He finished 2nd but he made the turn successfully like a pro _E_\nSadly it took a hit & run auto accident to make us aware of who our Secretary of Commerce is and such an important position! _E_\nMy @gretawire interview discussing why @BarackObama is not a nice guy and who will win the 2012 election __HTTP__ _E_\n\"Do not pray for easy lives. Pray to be stronger men.\" – Pres. John F. Kennedy _E_\nUndecideds in OHPA and WI will make the difference. All should ask themselves if they want $6/gallon gas because it will come under Obama. _E_\nThe meeting next week with China will be a very difficult one in that we can no longer have massive trade deficits... _E_\nShows how weak and desperate Lyin' Ted is when he has to team up with a guy who openly can't stand him and is only 1 win and 38 losses. _E_\nJobless claims have dropped to a 45 year low! _E_\nHe @MittRomney would do a great job on Saturday Night Live. @nbcsnl _E_\nOur economy is better than it has been in many decades. Businesses are coming back to America like never before. Chrysler as an example is leaving Mexico and coming back to the USA. Unemployment is nearing record lows. We are on the right track! _E_\nRT @GOP: .@POTUS: I want to work with Congress Republicans and Democrats on a plan that is pro growth pro jobs pro worker and pro Amer... _E_\nMy thoughts and prayers are with everyone involved in the train accident in DuPont Washington. Thank you to all of our wonderful First Responders who are on the scene. We are currently monitoring here at the White House. _E_\n........may be their number one act and priority. Focus on tax reform healthcare and so many other things of far greater importance! #DTS _E_\nMark They could use you. __HTTP__ _E_\nHere we go Enjoy! _E_\n\"@PGAChampionship @seniorpgachamp both headed to Trump courses\" __HTTP__ via @FoxNews _E_\nWant to take a quiz with me? Download the @millonseconds app and watch @RyanSeacrest on Monday at 8/7c on @NBC _E_\nDoing the @todayshow with @MLauer was great I really like Matt. _E_\nIt wasn't the White House it wasn't the State Department it wasn't father LaVar's so called people on the ground in China that got his son out of a long term prison sentence IT WAS ME. Too bad! LaVar is just a poor man's version of Don King but without the hair. Just think.. _E_\nThank you for all of the positive response on my Chicago lawsuit victory yesterday. Most of you saw through the phony age card ploy. _E_\nOver 150000 more of our fellow Americans dropped out of the workforce in July. @BarackObama is a disaster! _E_\nRT @EricTrump: Nevada we are on our way! #VoteTrumpNV #Trump2016Caucus locator: __HTTP__ __HTTP__ _E_\n.@politico which is not read or respected by many may be the most dishonest of the media outlets and that is saying something. _E_\nBernie Sanders who has lost most of his leverage has totally sold out to Crooked Hillary Clinton. He will endorse her today fans angry! _E_\nBombings all over Iraq today.That country is falling apart such a horrible waste of lives and 1.5 trillion dollars (and I told you so!). _E_\nDON'T LET HILLARY CLINTON DO IT AGAIN!#TrumpPence16 __HTTP__ _E_\nExcited to have @SarahPalinUSA's endorsement of the Newsmax @iontv debate. Sarah is terrific. _E_\nThanks. __HTTP__ _E_\nThis has to stop! @BarackObama loves accruing American debt he missed his budget deficit goal by over $500 billion. __HTTP__ _E_\nI will be interviewed on @oreillyfactor tonight at 8:00 P.M. (Eastern). Enjoy! _E_\nRT @TeamTrump: Obama Clinton FAILED foreign policy: Bad nuclear deal Ransom payment to leading state sponsor of terror Sharing classifie... _E_\nWe are way over the fiscal cliff. And with Obama Care being fully implemented in less than 14 months it may be too late. _E_\nJust leaving Las Vegas. Unbelievable crowd! Many Hispanics who love me and I love them! __HTTP__ _E_\n\"Revenge is sweet and not fattening.\" Alfred Hitchcock _E_\nObamaCare will destroy small business the backbone of America's economy. _E_\n.@jimmykimmel is terrific but for Obama to fly on Air Force One ($'s) to do the show in these bad times is ridiculous. _E_\nThank you. __HTTP__ _E_\nVia @CNNPolitics: Trump will have 'memorable' role at GOP convention __HTTP__ It's true just wait and see... _E_\nThank you Cedar Rapids Iowa!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\n\"You cannot push anyone up the ladder unless he is willing to climb.\" Andrew Carnegie _E_\nCan you believe that President Karzai of Afghanistan is holding out for more more more and refuses to sign deal. Tell him to go to hell! _E_\n\"Iowans Drawn to Donald Trump Praise His Antiestablishment Bent\" __HTTP__ via @WSJ by @heatherhaddon & @reidepstein _E_\nWhat America needs: @MittRomney follows in steps of Kemp and Reagan with pro growth tax cut. _E_\n\"Do not underestimate yourself and know you are able to handle what comes your way.\" – Think Like a Champion _E_\nRT @realDonaldTrump: The travel ban into the United States should be far larger tougher and more specific but stupidly that would not be... _E_\nVia @ChristianPost @NaghmehAbedini to Testify at New Congressional Hearing on Persecution of Pastor Saeed Abedini __HTTP__ _E_\nCongrats to @GovernorCorbett he's right to be suing @NCAA over the ridiculous deal made by the trustees of Penn State __HTTP__ _E_\nThe Iranians have just threatened to send warships to our coasts. They laugh at us. We can't allow them to develop nuclear weapons. _E_\nThe U.S. rocket that blew up and crashed yesterday is emblematic of the United States under Obama. Nothing works be it a rocket or website. _E_\nIf you include people who have left the work force unemployment rate is 15%. Labor participation rate is lowest in 70 yrs. _E_\n#IndianaJones and #Ghostbusters what's wrong??? __HTTP__ _E_\nI along with almost everyone else have so little confidence in President Obama. He has a horrible attitude a man who is resigned to defeat _E_\nThank you Maryland what a great way to conclude the day! Will be back soon. #Trump2016 __HTTP__ __HTTP__ _E_\nSantorum calls Trump debate skippers hypocrites __HTTP__ @RickSantorum _E_\nVia @Citizens_United: \"Donald Trump To Speak At The Iowa Freedom Summit in Des Moines on January 24th\" __HTTP__ _E_\nI have tremendous respect for women and the many roles they serve that are vital to the fabric of our society and our economy. _E_\nRT @JacobAWohl: @realDonaldTrump When Obama was President the #MSM LOVED talking about stock market rallies! Now they barely mention new a... _E_\n....Transgender individuals to serve in any capacity in the U.S. Military. Our military must be focused on decisive and overwhelming..... _E_\nThanks. __HTTP__ _E_\nWatch this video of my wonderful golf club @TrumpNationalCN in beautiful Colts NeckNJ __HTTP__ _E_\nAfter one of the great chokes in the history of sports it will be hard for the Spurs to beat the Heat but who knows. Good game on now! _E_\nI like doing this once a month for the haters & losers (and as they know) I don't wear a wig . Some may not like my hairstyle but all mine _E_\nThe talk in Albany is that JCOPE & Moreland Commissions are taking my complaint against lightweight (cont) __HTTP__ _E_\nThe @SuperCommittee will fail. The Republicans never should have agreed to the debt deal. _E_\nChina's the leading exporter of Iraqi oil yet they won't lift a finger against ISIS. Why should we do the heavy lifting for China's gain? _E_\nTo each member of the graduating class from the National Academy at Quantico CONGRATULATIONS! __HTTP__ _E_\nRT @PaulaReidCBS: .@CBSNews confirms FBI found emails on #AnthonyWeiner computer related to Hillary Clinton server that are new & not p... _E_\n\"Change can't be measured in speeches. It is measured in achievements.\" @MittRomney yesterday in Fairfax VA. _E_\nPaying attention is a cost effective way of protecting yourself and your interests. _E_\nThis will be a very interesting day for HealthCare.The Dems are obstructionists but the Republicans can have a great victory for the people! _E_\nWhy can't the leaders of the Republican Party see that I am bringing in new voters by the millions we are creating a larger stronger party! _E_\n#ICYMI on Monday I had the great honor of welcoming India's Prime Minister @narendramodi to the WH. Full Remarks:... __HTTP__ _E_\nThe debate last night proved that Hillary is running against the \"B\" team. She won't be so lucky when it comes to me! _E_\nRT @GOP: On National #VoterRegistrationDay make sure you're registered to vote so we can #MakeAmericaGreatAgain __HTTP__ ht... _E_\nWhy is @BarackObama constantly issuing executive orders that are major power grabs of authority? This is the latest __HTTP__ _E_\nIsn't it intetesting that anybody who attacks President Obama is considered a racist by the real racists out there! _E_\nIf ObamaCare is not repealed then we can expect stagnant growth long term unemployment and record high premiums. _E_\n.@FoxNews owes me an apology for allowing clueless pundit @RichLowry to use such foul language on TV. Unheard of! _E_\nLightweight Senator Kirsten Gillibrand a total flunky for Chuck Schumer and someone who would come to my office \"begging\" for campaign contributions not so long ago (and would do anything for them) is now in the ring fighting against Trump. Very disloyal to Bill & Crooked USED! _E_\nIn New York March was the coldest month in recorded history we could use some GLOBAL WARMING! _E_\nChina will now pass our economy this year way ahead of projections. Pres. Obama – China's greatest asset! _E_\nVia @RealClearNews by @rebeccagberg: \"Is the White House Big Enough for Donald Trump?\" __HTTP__ _E_\nIt's Jan. 2. President Obama should end his vacation early & get back to Washington to straighten out the ObamaCare catastrophe or end it. _E_\nWow so far everyone running for office who I did a ROBOCALL for has taken the lead in the polls the smart pols know this. GREAT! _E_\nChallenges present opportunities. Always keep your focus and stay calm. _E_\nVia @Newsmax_Media by @spiccoli: \"Donald Trump Taking 'Serious Look' at 2016 Presidential Run\" __HTTP__ _E_\nIn beautiful Miami inspecting the progress of @TrumpDoral's $250 million conversion into the country's #1 resort. _E_\nPathetic: @BarackObama did not want to veto Keystone himself so he lobbied the Democrats in the Senate to defeat it. __HTTP__ _E_\nThoughts & prayers are w/ our @USNavy sailors aboard the #USSJohnSMcCain where search & rescue efforts are underway. __HTTP__ _E_\nGreat news in Georgia! The just out Landmark poll shows me in first with 43%! Wow. __HTTP__ __HTTP__ _E_\nTune in Sunday June 3 to NBC at 9pm ET for the 2012 Miss USA competition coming from Planet Hollywood Resort & Casino in Las Vegas _E_\nThank you Georgia! I appreciate all of your support. #Trump2016 __HTTP__ _E_\nCongratulations to Evan Lysacek for being nominated SI sportsman of the year. He's a great guy and he has my vote! #EvanForSI _E_\nMexico was just ranked the second deadliest country in the world after only Syria. Drug trade is largely the cause. We will BUILD THE WALL! _E_\nChris Wallace @fox at 10:00 A.M. _E_\n\"@limbaugh: 'Trump Has Changed the Entire Debate on Immigration'\" __HTTP__ via @Newsmax_Media by Jason Devaney _E_\n#TBT Trump and Gekko __HTTP__ _E_\nLooking forward to IA & WI with Gov. Pence tomorrow. Join us! #MAGA __HTTP__ __HTTP__ __HTTP__ _E_\nPhyllis Schlafly's Eagle Forum: 'National Review Will Be Defunct In The Next Year' __HTTP__ _E_\nPriorities while fundraising and campaigning on our dime Obama has skipped over 50% of his intel briefings __HTTP__ _E_\nVia @foxnewslatino by @GeraldoRivera: \"@ApprenticeNBC Diary: And Now There Are Two\" __HTTP__ _E_\nLeaving Miami Trump National Doral will be GREAT! _E_\nThe leader and negotiators representing Mexico are far smarter and more cunning than the leader and negotiators representing the U.S.! _E_\nMy son @EricTrump has just done another great event and raised a lot of money for @StJude. He is a really good boy who loves helping kids. _E_\nWatch the first #TrumpVine re: Anthony Weiner __HTTP__ _E_\nPraying for the families of the two Iowa police who were ambushed this morning. An attack on those who keep us safe is an attack on us all. _E_\nHuff post gets it wrong re: Ferry Point...the only leakage of gas is from Arianna Huffington. _E_\nToday it was my privilege to welcome survivors of the #USSArizona to the WH. Remarks: __HTTP__ __HTTP__ _E_\nThe biggest problem with A Rod is he is bad for the chemistry of the Yankees he must go. _E_\nOnly in America can a Jihadi thug who murdered women and children be nursed back to health & then get a @RollingStone cover. _E_\nFailed show @DannyZuker season 1 of @apprenticenbc had 28 million viewers and 41.5 million watching..... _E_\nObama believes Benghazi is a \"phony scandal.\" Nothing phony about Americans being killed by Islamists. _E_\nThe global warming we should be worried about is the global warming caused by NUCLEAR WEAPONS in the hands of crazy or incompetent leaders! _E_\nThank you Connecticut! #Trump2016 __HTTP__ _E_\nNumerous states are refusing to give information to the very distinguished VOTER FRAUD PANEL. What are they trying to hide? _E_\n.@carlosbeltran15 is playing great for St. Louis Cardinals. They made a wise decision. _E_\nCrooked Hillary Clinton is bought and paid for by Wall Street lobbyists and special interests. She will sell our country down the tubes! _E_\nUncomfortable looking NBC reporter Willie Geist calls me to ask for favors and then mockingly smiles when he is told of my high poll numbers _E_\n\"Face reality as it is not as it was or as you wish it to be.\" @jack_welch _E_\nGood news disloyal @Macys stock is in a total free fall. Don't shop there for Christmas! __HTTP__ __HTTP__ _E_\nAMERICA'S FUTURE __HTTP__ _E_\nJames Comey will be replaced by someone who will do a far better job bringing back the spirit and prestige of the FBI. _E_\n#GodBlessTheUSA __HTTP__ _E_\n...Hopefully we will never have to use this power but there will never be a time that we are not the most powerful nation in the world! _E_\nRep.Tom Marino has informed me that he is withdrawing his name from consideration as drug czar. Tom is a fine man and a great Congressman! _E_\n.@Boeing stock went way down because of 787 so I just bought stock in @Boeing great company! _E_\nIt was my great honor to have lunch with our INCREDIBLE U.S. and ROK troops at Camp Humphreys in South Korea. 🇰 __HTTP__ __HTTP__ _E_\nAnybody (especially Fake News media) who thinks that Repeal & Replace of ObamaCare is dead does not know the love and strength in R Party! _E_\nI have captured the smell of success. Meet me and the new Success @Macys Herald Square April 18 5:30pm first (cont) __HTTP__ _E_\nI was on a tele townhall with @TeamBachmann and hosted her 4 times in Trump Tower yet she declined the Newsmax @iontv debate. No loyalty. _E_\nAt least 12 dead and 50 wounded in Colorado bring back fast trials & death penalty for mass murderers & terrorists. _E_\nLyin' Ted Cruz and 1 for 38 Kasich are unable to beat me on their own so they have to team up (collusion) in a two on one. Shows weakness! _E_\n1.5M have already lost their health care plans thanks to ObamaCare __HTTP__ Defund now and Repeal later! _E_\nThe innocent bystanders of American poverty are kids. Yet two thirds of childhood poverty in America is (cont) __HTTP__ _E_\nI look forward to reading the @CommerceGov 232 analysis of steel and aluminum to be released in June. Will take major action if necessary. _E_\nNo I wasn't at the @Yankees game yesterday can't go today either. When I go they win. _E_\nIn order to #DrainTheSwamp & create a new GOVERNMENT of by & for the PEOPLE I need your VOTE! Go to __HTTP__ LET'S #MAGA! _E_\nCongratulations to the Republicans in Congress. You are the only people Obama can out negotiate. #TimeToGetTough _E_\nRT @PacificCommand: #USAF B 1B Lancer #bombers on Guam stand ready to fulfill USFK's #FightTonight mission if called upon to do so __HTTP__ _E_\nObama Administration official said they choked when it came to acting on Russian meddling of election. They didn't want to hurt Hillary? _E_\nLet us never negotiate out of fear but let us never fear to negotiate. John F. Kennedy Inaugural Address January 1961 _E_\nThe roads and sidewalks airports and bridges are perfect in Dubai. Everything looks clean & strong. In U.S. everything is falling apart! _E_\nMe voting it really is my hair! __HTTP__ _E_\nMeet me at @TrumpTowerNY and get your copy of my new book CRIPPLED AMERICA signed on 11/3 at 12pm! __HTTP__ _E_\nGreat move to take A Rod out of game. Now terminate his contract based on misrepresentation (drugs). _E_\nI am signing copies of my book CRIPPLED AMERICA. Order yours now makes a great holiday gift! __HTTP__ ... ... _E_\nNever met but never liked dopey Robert Gates. Look at the mess the U.S. is in. Always speaks badly of his many bosses including Obama. _E_\nCan you believe that the disrespect for our Country our Flag our Anthem continues without penalty to the players. The Commissioner has lost control of the hemorrhaging league. Players are the boss! __HTTP__ _E_\n\"The team with the best players wins.\" @jack_welch _E_\nIt takes guts to be a brand. You cannot be all things to all people if you want to be a brand. Midas Touch _E_\nThe secret of getting ahead is getting started. Mark Twain _E_\nThank you Arizona I love you! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nRT @mike_pence: .@EdWGillespie is fighting to grow the economy & cut taxes! He's fighting for a safer VA. And he's is fighting for affordab... _E_\nNo wonder @BBC is in such big trouble & boss was just fired they are lost. _E_\nJust took a look at Time Magazine looks really flimsy like a free handout at a parking lot! The sad end is coming just like Newsweek! _E_\nWhy is the United States Post Office which is losing many billions of dollars a year while charging Amazon and others so little to deliver their packages making Amazon richer and the Post Office dumber and poorer? Should be charging MUCH MORE! _E_\n'President Trump Congratulates Exxon Mobil for Job Creating Investment Program' __HTTP__ _E_\nThe rally in Cincinnati is ON. Media put out false reports that it was cancelled! #MakeAmericaGreatAgain #Trump2016 _E_\nJust met with the incoming Speaker of the Florida House @SteveCrisafulli – a fantastic guy! He will be a truly great leader. _E_\nI endorsed a book on ObamaCare & it just went to #2 on the New York Times bestseller list! _E_\nDruggie @AROD is now scheming to sue the @Yankees. He will go down as the biggest sports embarrassment of all time. _E_\nWe are now sending thousands of additional troops to Iraq to teach them how to fight they will run billions wasted! WHAT DOES U.S. GET? _E_\nMy @WendyWilliams appearance re Sony Atlantic City @ApprenticeNBC & 2016 __HTTP__ Always love going on Wendy's show! _E_\nThank you to our law enforcement officers! #LESM #Trump2016 __HTTP__ _E_\nHeading back to Washington after working hard and watching some of the worst and most dishonest Fake News reporting I have ever seen! _E_\n... but like many other great business people have used the laws to corporate advantage. _E_\nOh the wonders of the Arab Spring. Our new 'ally' the Muslim Brotherhood hosted Ahmadinejad yesterday __HTTP__ No more aid. _E_\nVia @MiamiNewTimes by @Munzenrieder : \"Doral Mayor Declares Emergency to Give Donald Trump Key to the City\" __HTTP__ _E_\nDummy Bill Maher did an advertisement for the failing New York Times where the picture of him is very sad he looks pathetic bloated & gone! _E_\n\"I'm not afraid of failing. I don't like to fail. I hate to fail. But I'm not afraid of it.\" @VinceMcMahon _E_\nRosie O'Donnell's show is dead can't keep going for long with such poor ratings. @Rosie is a stone cold (cont) __HTTP__ _E_\nThere is great unity in my campaign perhaps greater than ever before. I want to thank everyone for your tremendous support. Beat Crooked H! _E_\nMexico is killing the United States economically because their leaders and negotiators are FAR smarter than ours. But nobody beats Trump! _E_\nNegotiation tip: Be patient be persistent be stubborn. Know exactly what you want and keep it to yourself. _E_\nDjango Unchained is the most racist movie I have ever seen it sucked! _E_\nThank you America great #CommanderInChiefForum polls! __HTTP__ _E_\n.@MarcoRubio is weak on illegal immigration and will allow anyone into the country..... _E_\nThank you Florida we are going to MAKE AMERICA GREAT AGAIN! Join us: __HTTP__ #AmericaFirst __HTTP__ _E_\nProblem is that the acting head of the FBI & the person in charge of the Hillary investigation Andrew McCabe got $700000 from H for wife! _E_\n...conquests how brave he was and it was all a lie. He cried like a baby and begged for forgiveness like a child. Now he judges collusion? _E_\nVia @rcpvideo: \"Donald Trump on Who He Likes For President: Donald Trump\" __HTTP__ _E_\nU.S. jobless claims are at a 2 month high. __HTTP__ @BarackObama's gas policy and ObamaCare are directly killing jobs. _E_\nGreat parade in The Villages I love you all. We will #MAGA. Thank you for the incredible support I will not forget! __HTTP__ _E_\nIt is so sad to see what has happened to Atlantic City. So many bad decisions by the pols over the years airport convention center etc. _E_\nTerrible economic numbers released today. US GDP only grew 0.4% during Oct Dec 2012 quarter __HTTP__ Great news for China. _E_\nCongrats @Jean_GeorgesNYC for being named the 6th best hotel restaurant in the world! __HTTP__ _E_\nA great article by @NolteNCspelling out the truth on Mexico trade the border & illegals. Thank you @BreitbartNews __HTTP__ _E_\nBig crowds standing ovations in South Carolina MAKE AMERICA GREAT AGAIN! _E_\n'Donald Trump is already helping the working class' __HTTP__ _E_\nRT @mike_pence: Join me in Colorado today! Look forward to seeing you!Denver 2pm __HTTP__ Springs 6pm __HTTP__ _E_\nHe should be ignored: @RonPaul's foreign policy is a dream come true for our enemies. He has zero chance to beat @BarackObama. _E_\nIt amazes me that other networks seem to treat me so much better than @FoxNews. I brought them the biggest ratings in history & I get zip! _E_\nHere we go again with another Clinton scandal and e mails yet (can you believe). Crooked Hillary knew the fix was in B never had a chance! _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nMy @seanhannity interview where I discuss @BarackObama's Job Council @RealSheriffJoe's investigation & 2012 election __HTTP__ _E_\nObama is our unlucky President. Everything he touches turns into a mess. Some people just don't have it! _E_\nHAPPY NEW YEAR! We are MAKING AMERICA GREAT AGAIN and much faster than anyone thought possible! _E_\nYou need to overcome the tug of people against you as you reach for high goals. General George S. Patton _E_\n.@Morning_Joe can you believe Kasie Hunt's poor and purposely inaccurate reporting on my great night and crowd in Iowa. @politico is a scam! _E_\n.@timkaine is the ANTI DEFENSE SENATOR. #VPDebate #BigLeagueTruth __HTTP__ _E_\n.@EWErickson ran @RedState into the ground. A change was necessary. Congratulations to @RedState and good luck in the future! _E_\nDwight Howard just signed with Houston. _E_\nDoesn't dummy @chucktodd realize that when I considered running for president I filed financial papers showing unbelievable numbers. _E_\nJust left West Palm Beach Fire & Rescue #2. Met with great men and women as representatives of those who do so much for all of us. Firefighters paramedics first responders what amazing people they are! _E_\n.@TrumpTurnberry's 149 award winning guest rooms offer a perfect blend of Edwardian tradition and timeless design __HTTP__ _E_\nJust another desperate move by the man who should have easily beaten Barrack Obama. (2/2) _E_\nMarble mouth @tombrokaw asks why do we think to have a successful eveving you have to have Donald Trump as your guest of honor? BORING TOM _E_\nDon't believe the main stream (fake news) media.The White House is running VERY WELL. I inherited a MESS and am in the process of fixing it. _E_\nGas prices are soaring. $4.12 in CA. OPEC is laughing at how stupid we are. _E_\nWe just finished shooting a new season of Celebrity Apprentice and happily for all Joan plays my advisor in two episodes. She was great! _E_\nJeb's new slogan Jeb can fix it . I never thought of Jeb as a crook! Stupid message the word fix is not a good one to use in politics! _E_\nI am a very calm person but love tweeting about both scum and positive subjects. Whenever I tweet some call it a tirade..totally dishonest! _E_\nThank you Carmel Indiana! Get out & #VoteTrump tomorrow! #INPrimary #MakeAmericaGreatAgain __HTTP__ _E_\nThank you to Brandon Judd of the National Border Patrol Council for his strong statement on @foxandfriends that we very badly NEED THE WALL. Must also end loophole of \"catch & release\" and clean up the legal and other procedures at the border NOW for Safety & Security reasons. _E_\nGreat bilateral meeting with President @Alain_Berset of the Swiss Confederation as we continue to strengthen our great friendship. Such an honor to be in Switzerland! #WEF18 __HTTP__ _E_\nGloria Allred is always talking about me. She needs publicity. She is by far a better PR agent than lawyer. _E_\nReporting that Orlando killer shouted Allah hu Akbar! as he slaughtered clubgoers. 2nd man arrested in LA with rifles near Gay parade. _E_\nI will be interviewed on @foxandfriends this morning at 7:30. So much to talk about! _E_\n.@AP has one of the worst reporters in the business @JeffHorwitz wouldn't know the truth if it hit him in the face. _E_\nslaughter you. This is a purely religious threat which turned into reality. Such hatred! When will the U.S. and all countries fight back? _E_\nAt some point Sgt. Bergdahl will have to explain his capture. In 2009 he simply wandered off his base without a weapon. Many questions! _E_\n.@latoyajackson informs @ArsenioHall that @Omarosa is a \"conniving witch\"—is he surprised? Are we surprised? #CelebApprentice _E_\nVia @amspec by Jeffrey Lord: Is Eric Schneiderman a Crook? What a great writer & researcher amazing story. __HTTP__ _E_\nThe Republicans once again hold all the cards with the debt ceiling. They can get everything they want. Focus! _E_\nGreat advice from my mother: \"Trust in God and be true to yourself.\" – Mary MacLeod Trump _E_\nAn architectural landmark @TrumpTowerNY offers sweeping panoramic views of Fifth Avenue __HTTP__ _E_\nTrump: I Love the Tea Party They Love Me __HTTP__ via @Newsmax_Media (cross posted on @foxnation __HTTP__ _E_\nIt snowed over 4 inches this past weekend in New York City. It is still October. So much for Global Warming. _E_\nI am not available to be in @adamcarolla's new movie #RoadHard.bit.ly/roadhardmovie _E_\nOffering top amenities along w/ award winning architectural design @TrumpChicago's condominiums are world class __HTTP__ _E_\nSomething really bad happened to the @Yankees psyche much like our President! _E_\nThank you @SenJohnMcCain for your kind remarks on the important issue of PTSD and the dishonest media. Great to be in Arizona yesterday! _E_\nThank you Jacob! __HTTP__ _E_\nDummy @Clare_OC @Forbes: Tiny fragrance deal with Parlux means nothing. Still sold at Trump Tower... _E_\n#StandForOurAnthem _E_\nSo many veterans groups are beyond happy with all of the money I raised/gave! It was my great honor they do an amazing job. _E_\nNow is the time for the @GOP to be united with the mission of electing @MittRomney this November. Stop with the public divisions. _E_\nI really enjoyed the debate last night.Crooked Hillary says she is going to do so many things.Why hasn't she done them in her last 30 years? _E_\nPrediction: The disaster known as ObamaCare will only get worse and Republicans will gain far greater power than they have had in years! _E_\nFMR PRES of Mexico Vicente Fox horribly used the F word when discussing the wall. He must apologize! If I did that there would be a uproar! _E_\nUnbelievable crowd in Dallas! __HTTP__ _E_\nBruce Willis wearing my hat on @FallonTonight last Friday __HTTP__ _E_\nVia @ BreitbartNews by @BobPriceBBTX: \"DONALD TRUMP HEADING TO TEXAS BORDER\" __HTTP__ _E_\nBeing good in business is the most fascinating kind of art. Making money is art & working is art & good business is the best artAndy Warhol _E_\n.@oreillyfactor bad and very deceptive journalism. Show must be heading in wrong direction too bad! @SarahPalinUSA _E_\nCongratulations! 'First New Coal Mine of Trump Era Opens in Pennsylvania' __HTTP__ _E_\nYou have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_\nIf you have built castles in the air your work need not be lost that is where they should be. Now put the foundations under them. Thoreau _E_\n\"Sixteen\" @TrumpChicago is winning accolades and is a destination point restaurant—don't miss it! _E_\nMy beautiful daughter Ivanka just had a healthy baby boy. Jared and Ivanka are very proud! _E_\nWow NATO's top commander just announced that he agrees with me that alliance members must PAY THEIR BILLS. This is a general I will like! _E_\n.@WSJ reports that @GOP getting ready to treat me unfairly—big spending planned against me. That wasn't the deal! _E_\n\"The four page memo released Friday reports the disturbing fact about how the FBI and FISA appear to have been used to influence the 2016 election and its aftermath....The FBI failed to inform the FISA court that the Clinton campaign had funded the dossier....the FBI became.... _E_\nIn all fairness to Anthony Scaramucci he wanted to endorse me 1st before the Republican Primaries started but didn't think I was running! _E_\nAMERICA USED TO BE THE LEADER OF THE WORLD. THANKS TO OBAMA AMERICA ISN'T EVEN LEADING FROM BEHIND. _E_\n.@SanDiegoPD Fantastic job on handling the thugs who tried to disrupt our very peaceful and well attended rally. Greatly appreciated! _E_\nThere has been a systematic targeting of the Tea Party by the Obama administration. Now Schneiderman goes after me. No coincidence. _E_\nOur country needs a president with great leadership skills and vision not someone like Hillary or Barack neither of which has a clue! _E_\n\"Think of yourself as a one man army. You're not only the commander in chief you're the soldier as well.\" – Think Like a Billionaire _E_\nI still love Derek he is a winner! _E_\nIt's true—@dennisrodman gets the comeback of the year award. I didn't like having to fire him. #CelebApprentice _E_\n... Apprentice was #1 among ABC CBS and NBC from 10:30 11 p.m. in all key demos (adults men and women 18 34 18 49 and 25 54) Nielsen. _E_\nJoin me at 11:00am:Watch here: __HTTP__ __HTTP__ _E_\nThank you for your support & friendship Governor @ChrisChristie!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nObama will quarantine all soldiers returning from Africa for 21 days. But he still allows all who contract Ebola into country? Hypocrite. _E_\n'Over 250000 to Lose Health Insurance in Battleground North Carolina Due to #Obamacare' __HTTP__ _E_\nNo one has done more for people with disabilities than me. I have spent many millions of dollars to help out and am happy to have done so! _E_\n...but interestingly the same people seem to be lucky. _E_\n.@RobinRoberts everyone adores you including me get well fast! _E_\nIt's Thursday. @billmaher is still a very dumb guy just look at his past. _E_\nCongratulations to Doug Jones on a hard fought victory. The write in votes played a very big factor but a win is a win. The people of Alabama are great and the Republicans will have another shot at this seat in a very short period of time. It never ends! _E_\nHe @BarackObama will lose a delegate in Oklahoma he only got 57% of the vote in the Democrat primary __HTTP__ _E_\nJoin me in Fayetteville North Carolina tomorrow evening at 6pm. Tickets now available at: __HTTP__ _E_\nIf Graydon Carter's very dumb bosses would fire him for his terrible circulation numbers at failing Vanity Fair his bad food restaurants die _E_\nWill be going to North Dakota today to discuss tax reform and tax cuts. We are the highest taxed nation in the world that will change. _E_\nJohn Podesta says nominee will be Cruz b/c last person Hillary wants to face is Trump! Use your head folks! 46 41! __HTTP__ _E_\nDon't forget to tune in to the Celebrity Apprentice this Sunday night 9 pm on NBC. The fireworks continue.... __HTTP__ _E_\nSo many major problems for the U.S. and no answers by our leaders. When will it all change? Many of our difficulties are so easy to solve! _E_\nNow that I started my war on illegal immigration and securing the border most other candidates are finally speaking up. Just politicians! _E_\nJust returned from Europe. Trip was a great success for America. Hard work but big results! _E_\nIran just test fired a Ballistic Missile capable of reaching Israel.They are also working with North Korea.Not much of an agreement we have! _E_\nCongratulations @Jean_GeorgesNYC for 10 years of 3 #MichelinStars! Visit the restaurant in @TrumpNewYork for a meal you'll never forget. _E_\nI will be using Facebook and Twitter to expose dishonest lightweight Senator Marco Rubio. A record no show in Senate he is scamming Florida _E_\nWho are your favorites on Team Power? Team Plan B? #CelebApprentice _E_\nMr. Pesident @BarackObama you cannot attack free enterprise and expect to have a healthy economy! _E_\nWhen will the Democrats and Hillary in particular say we must build a wall a great wall and Mexico is going to pay for it? Never! _E_\nJoin me Tuesday Nov. 3rd at 12pm in Trump Tower NYC. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_\nBig time in U.S. today MAKE AMERICA GREAT AGAIN! Politicians are all talk and no action they can never bring us back. _E_\nJust spent two days in Ireland at Trump International Golf Links & Hotel absolutely magnificent. __HTTP__ _E_\nBoth being optimistic and remembering the big picture have served me well throughout my life. You need to stay positive. _E_\nRT @Harlan: Watching MSM you would have no idea @realDonaldTrump clearly unambiguously & repeatedly condemned the bigotry & violence in Ch... _E_\n.@ErinBurnett's @OutFrontCNN ratings are so pathetic she even loses to @hardball_chris at 7PM which is replay of 5PM __HTTP__ _E_\nFour more days until the Miss Universe Pageant. Be sure to tune in on Monday night at 9 p.m. on NBC it will be an amazing show. _E_\nTo aspiring entrepreneurs always remember that if your enemies aren't talking about you then you aren't doing well...and must work harder! _E_\nMichele Bachmann will finish dead last tonight in Iowa because she is disloyal and a terrible boss. Sadly it is over for Michele. _E_\nOnce again @LilJon has competed at a very high level on Celebrity All Star @ApprenticeNBC. He is a great competitor. _E_\nWhile I am in OH & PA you can also join @Mike_Pence in Nevada on Mon!Carson City: __HTTP__ __HTTP__ _E_\nGreat! __HTTP__ _E_\nRT @IBDeditorials: Was Barack Obama A Foreign Exchange Student? __HTTP__ _E_\nNATO commander agrees members should pay up via @dcexaminer: __HTTP__ _E_\nDoesn't the US have better things to do than to destroy an American hero for the world to see? Now other (cont) __HTTP__ _E_\nJoin me live at the @WhiteHouse. __HTTP__ __HTTP__ _E_\nVia @TVGrapevine: \"@ApprenticeNBC: Premieres Sunday January 4 2015\" __HTTP__ _E_\nWith Obama and Bernanke destroying the value of the dollar gold and real estate should continue to rise in value. _E_\nChina's best friend @BarackObama wants to cut the US fleet down to 230 ships the lowest level since WWI. __HTTP__ _E_\nThey let Crooked & the Gang off the hook for the crime but it looks like the cover up is just as bad. Unbelievable! __HTTP__ _E_\nThe Federal Government is teaching citizens 'Financial Literacy' while it is running $16T in debt __HTTP__ Only in America! _E_\nEntrepreneurs: Paying attention can be a cost effective way of protecting yourself. _E_\nRT @RealBenCarson: I endorse @realDonaldTrump. It's time to unite behind the candidate who will beat Hillary Clinton and return government ... _E_\nGreat to hear that our loyal @CelebApprentice fans are happy with today's announcement of the new cast. This will be something special! _E_\nWe can't even stop the Norks from blasting a missile. China is laughing at us. It is really sad. _E_\nThanks for all of the accolades on my speech today it's all about the truth! _E_\nChange has to come from outside our very broken system. #MAGA __HTTP__ _E_\nThank you to our amazing law enforcement officers! #AmericaFirst __HTTP__ _E_\nJoin me in North Carolina tomorrow at 7:30pm! #ImWithYou Tickets: __HTTP__ _E_\nThe truth is that we could have much better healthcare in our country at a much more affordable price everyone in U.S. would benefit! _E_\nUh oh... @OMAROSA & @piersmorgan once again reunite in the Board Room in next week's 'All Star' @ApprenticeNBC. Fireworks! _E_\n6 @TrumpCollection hotels made @CNTraveler reader's choice! @TrumpNewYork @TrumpSoHo @TrumpChicago @TrumpToronto @TrumpPanama @Trump_Ireland _E_\n.@alexsalmond RT @King_Pepp Driving through Indiana and seeing tons of ugly windmills. Now I know what @realDonaldTrump is talking about _E_\nMy @gretawire interview discussing my $5M charitable offer to Obama his lack of transparency & my tremendous support __HTTP__ _E_\nEARLY VOTING: MN & IA already underway more states coming up in the next week: OH ME AZ IN — check w/local officials for details & VOTE! _E_\nSo many signs that the Florida shooter was mentally disturbed even expelled from school for bad and erratic behavior. Neighbors and classmates knew he was a big problem. Must always report such instances to authorities again and again! _E_\nMy comment last March \"Anthony Weiner is a sick pervert you think he will change? He will never change.\" __HTTP__ _E_\nAgain the story that there was collusion between the Russians & Trump campaign was fabricated by Dems as an excuse for losing the election. _E_\nVia @examinercom by @Mellyora13: \"Trump: Was Benghazi the result of incompetence or something more sinister?\" __HTTP__ _E_\nWith the economy still on a downward trajectory the best investment young people can make now is buying property... _E_\nAction speaks louder than words but not nearly as often. Mark Twain _E_\n...contributions. The RNC is taking in far more $'s than the Dems and much of it by my wonderful small donors. I am working hard for them! _E_\nNow that the three basketball players are out of China and saved from years in jail LaVar Ball the father of LiAngelo is unaccepting of what I did for his son and that shoplifting is no big deal. I should have left them in jail! _E_\nGetting ready to make a major speech to the National Assembly here in South Korea then will be headed to China where I very much look forward to meeting with President Xi who is just off his great political victory. _E_\nMy @BreitbartNews' @biggovt editorial: \"'A COUNTRY THAT CANNOT PROTECT ITS BORDERS WILL NOT LAST\" __HTTP__ _E_\nSnowing in Texas and Louisiana record setting freezing temperatures throughout the country and beyond. Global warming is an expensive hoax! _E_\nAn old picture with Nancy and Ronald Reagan. __HTTP__ _E_\nDopey Sugar.@Lord_Sugar ...Your net worth doesn't even qualify you to host the Apprentice. Keep making me money. _E_\nObama and the Democrats have no respect for WWII vets trying to get into the memorial. _E_\nLast night was the first time Obama said we instead of I in respect to Bin Laden's killing. _E_\n'Hillary's Two Official Favors To Morocco Resulted In $28 Million For Clinton Foundation' #DrainTheSwamp __HTTP__ _E_\n.@Andre_Reed83 Thanks for your nice words. You are a real champion. I'm pushing! _E_\nWhy does @CNN & @andersoncooper waste airtime by putting failed campaign strategist Stuart Stevens who lost BIG for Romney on the show? _E_\nThis will be one of the biggest and most beautiful Miss Universe events ever. _E_\nIf Senate Republicans don't get rid of the Filibuster Rule and go to a 51% majority few bills will be passed. 8 Dems control the Senate! _E_\nThe Penn State Board should resign based on the grossly incompetent way they handled the NCAA. They gave away (cont) __HTTP__ _E_\n#TedCruz eligibility to be President not settled law says Cruz' Constitutional Law Professor #LaurenceTribe __HTTP__ _E_\nI've just done a major Dateline for NBC March 3rd just ahead of Apprentice. _E_\nWhy Franklin Graham says Donald Trump is right about stopping Muslim immigration __HTTP__ _E_\nThank you @ScottWalker! #AmericaFirst #RNCinCLE __HTTP__ _E_\nI had a lot of fun answering your questions in the latest round of #AskTheDonald. See if your question made it __HTTP__ _E_\n.@marklevinshow has been saying very nice things about me on his show recently. He has a fantastic radio show that I always enjoy! _E_\nI look so forward to debating Crooked Hillary Clinton! Democrat Primaries are rigged e mail investigation is rigged so time to get it on! _E_\nJohn McCain never had any intention of voting for this Bill which his Governor loves. He campaigned on Repeal & Replace. Let Arizona down! _E_\nThank you St. Louis Missouri! #MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nThe Senate must go to a 51 vote majority instead of current 60 votes. Even parts of full Repeal need 60. 8 Dems control Senate. Crazy! _E_\nEli Manning. Great Athlete. Great Guy. @NYGiants great teamwork! _E_\nNo @DannyZuker just the opposite lots of money can go to charity if you have the guts to play the game (deal)! _E_\nAnybody whose mind SHORT CIRCUITS is not fit to be our president! Look up the word BRAINWASHED. _E_\nVia @TheTodaysGolfer \"Trump @TurnberryBuzz transformation on course\" __HTTP__ _E_\nThank you New Jersey! #Trump2016 __HTTP__ _E_\nDerek Jeter had a great career until 3 days ago when he sold his apartment at Trump World Tower I told him not to sell karma? _E_\n3 Chief of Staffs in less than 3 years of being President: Part of the reason why @BarackObama can't manage to pass his agenda. _E_\nRT @FoxBusiness: .@charliekirk11: What this president has done is truly historic and if a Democrat president achieved 1/10th of what @POT... _E_\nWhen renovations are completed Trump National Doral will be the finest resort in the U.S. _E_\nIt wasn't Donald Trump that divided this country this country has been divided for a long time! Stated today by Reverend Franklin Graham. _E_\nI agree with Pres. Obama on Afghanistan. We should have a speedy withdrawal. Why should we keep wasting our money rebuild the U.S.! _E_\nWho would you rather have negotiating with Iran President Obama or Toronto Mayor Ford? My money is on Ford. _E_\nPeople are really liking the new ties and shirts @Macy's they are amazing and selling great! _E_\nWe need a president who knows how to get things done who can keep America strong safe and free and who can (cont) __HTTP__ _E_\n.@DannyZuker You're starting up again because people have forgotten you. You wouldn't take my bet but it's (cont) __HTTP__ _E_\nMany people are now saying that this is the worst storm/hurricane they have ever seen. Good news is that we have great talent on the ground. _E_\nJust left Florida for D.C. The people and spirit in THAT GREAT STATE is unbelievable. Damage horrific but will be better than ever! _E_\n\"I also protect myself by being flexible. I never get too attached to one deal or one approach.\" – The Art of The Deal _E_\nThe interview with Oprah will cause Lance Armstrong huge legal and financial problems sometimes it is better to go into a corner and hide. _E_\nGreat news as a result of our TAX CUTS & JOBS ACT! __HTTP__ _E_\nIt's time for @PeteRose_14 to enter @MLB's @BaseballHall. All time hits leader has paid the price. _E_\nMy son @EricTrump and @LaraLeaYunaska just announced their engagement. Great news! A wonderful couple! _E_\nDonald Trump Returns For 'All Star Celebrity Apprentice' __HTTP__ via @HuffPostTV _E_\nNew poll by ABC News/Washington Post TRUMP 32 CARSON 22 RUBIO 10 BUSH 7 Wow how will the media put a negative spin on this one? _E_\nWhere is the main stream media reporting on Univision's new expose of Fast and Furious? Too busy looking at Mitt's taxes? _E_\n.@yankees are privately ecstatic over A Rod's latest doping bust. The evidence is damning __HTTP__ @yankees don't want him. _E_\nEveryone is telling me that @EliotSpitzer is going to run against lightweight @AGSchneiderman Spitzer would win! _E_\nTHANK YOU ILLINOIS! Let's not forget to get family & friends out to VOTE IN 2016! __HTTP__ __HTTP__ _E_\nSo to all Americans in every city near and far small and large from mountain to mountain... __HTTP__ _E_\nBarack Obama used to mock Bush's 300K monthly job reports __HTTP__ Now Obama wishes he could have a month half as good. _E_\nWhen an employee leaves me and begs to come back I never let them. Loyalty is very important. _E_\nGreat win in Kansas last night for Ron Estes easily winning the Congressional race against the Dems who spent heavily & predicted victory! _E_\nRT @WhiteHouse: The current tax code is a burden on American taxpayers and harmful to American job creators. Learn more: __HTTP__ _E_\nI can confirm the reports @BillRancic my first season winner will be returning to this All Star season of @CelebApprentice. _E_\nIran's attack on Israeli diplomats is an attack on the West _E_\nRT @SpoxDHS: Schumer Rounds Collins destroys the ability of @DHSgov to enforce immigration laws creating a mass amnesty for over 10 millio... _E_\nRising over Bay Street @TrumpTO brings opulent luxury along with our famous world class amenities to the Queen City __HTTP__ _E_\nThe Irish government is too smart to destroy their beautiful coastline w/ bird killing ugly wind turbines. @AlexSalmond @AberdeenCC _E_\nThank you so much. Earnest must have been a great person. __HTTP__ __HTTP__ _E_\nThe city of Buffalo is struggling. Moving the @buffalobills would be catastrophic. The Bills belong in Buffalo! _E_\nImmigration reform really changes the voting scales for the Republicans—for the worse! _E_\nTexas is heeling fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_\nI am in Trump International Hotel Las Vegas getting ready and waiting for the debate tonight. Look forward hope I get treated fairly! _E_\nSo great that John McCain is coming back to vote. Brave American hero! Thank you John. _E_\nPresident Obama just told President Putin how important the Russian air strikes against ISIS have been. I TOLD YOU SO! _E_\nWe don't want to have a recount in any of the battleground states. Obama will steal it. Make sure all your friends and family vote. _E_\nClass of 2013. #WWEHOF __HTTP__ _E_\nI would like to congratulate @SenateMajLdr on having done a fantastic job both strategically & politically on the passing in the Senate of the MASSIVE TAX CUT & Reform Bill. I could have not asked for a better or more talented partner. Our team will go onto many more VICTORIES! _E_\nThe global warming scientists don't want to be airlifted off the ship they are having too much fun and that is too simple a solution FAME! _E_\n\"The unemployment rate remains at a 17 year low of 4.1%. The unemployment rate in manufacturing dropped to 2.6% the lowest ever recorded. The unemployment rate among Hispanics dropped to 4.7% the lowest ever recorded...\"@SecretaryAcosta @USDOL __HTTP__ _E_\nMr. Trump removing the broken teleprompter in North Carolina in front of a massive crowd. He goes on&delivers the b... __HTTP__ _E_\nNext week the Senate is going to vote on legislation to save Americans from the ObamaCare DISASTER. #WeeklyAddress __HTTP__ _E_\nThe off shore Aberdeen wind farm site is \"experimental\" & has no track record delivering energy. __HTTP__ @guardian _E_\nVia @Golfmagic: Golden Bear and American business tycoon finish their unlikely masterpiece __HTTP__ _E_\nThe greatest influence over our election was the Fake News Media screaming for Crooked Hillary Clinton. Next she was a bad candidate! _E_\nRemember that Carson Bush and Rubio are VERY weak on illegal immigration. They will do NOTHING to stop it. Our country will be overrun! _E_\nHillary's refusal to mention Radical Islam as she pushes a 550% increase in refugees is more proof that she is unfit to lead the country. _E_\nObama promised 5.2% unemployment by October 2012. His promises are worthless! _E_\n...case against him & now wants to clear his name by showing the false or misleading testimony by James Comey John Brennan... Witch Hunt! _E_\nHow did the NCAA which is weak and becoming irrelevant extract such a big & reputation shattering settlement from Penn State. Others zero! _E_\nVia @CBSLA: Donald Trump Fights To Keep Large American Flag Flying At Southland Golf Course __HTTP__ _E_\nWhy does the failing @WSJ write a false editorial about me and let dummy @KarlRove make the same mistake in the same edition of the paper? _E_\nWatching Gates on @seanhannity looks like he got hit by a truck! Why didn't Obama get him and othersto sign a confidentiality agreement? _E_\nCongrats @adamcarolla on #RoadHard raising $1M on @fundanything a record. _E_\nGreat poll out of Illinois! Thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWhat a series the @nyrangers @NHLDevils is turning out to be! Tonight's game should be another close one. _E_\nRT @EricTrump: Join @TeamTrump on Saturday for National Day of Action as we work to #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nTed Cruz does not have the right temperment to be President. Look at the way he totally panicked in firing his director of comm. BAD! _E_\nJoin me on Tuesday in Greensboro North Carolina! #Trump2016 #AmericaFirst __HTTP__ _E_\nCertain Internet sites are like a bad epidemic that won't go away others are terrific _E_\nMake sure you get out and vote...most important election of our generation...go Romney! _E_\nVia @MiamiHerald by Bill Van Smith: @jacknicklaus reminisces amid honor at @TrumpDoral __HTTP__ _E_\nThe economy will come back but it will not be the same economy. The old economy of the Industrial Age is (cont) __HTTP__ _E_\nBe sure to look for my beautiful wife Melania Trump tonight on QVC at 9 pm ET where she will be debuting her fantastic jewelry collection. _E_\nI would have millions of votes more than Hillary except for the fact that I had 17 opponents and she just had a socialist named Bernie! _E_\nRobert Pattinson is putting on a good face for the release of Twilight. He took my advice on Kristen Stewart...I hope! _E_\n.@Joan_Rivers —I know you're watching what did you think of your impersonator? _E_\nYou can watch all the highlights of last night's record 14th season premiere of @ApprenticeNBC __HTTP__ _E_\n\"One thing I've learned about the press is they're always hungry for a good story the more sensational the better.\" Art of the Deal _E_\nJOIN ME! #MAGATODAY:Springfield OH Toledo OH Geneva OH FRIDAY:Manchester NH Lisbon ME Cedar Rapids IA __HTTP__ _E_\nLast Thursday Obama said investing in infrastructure would improve our economy for the long term The next day he again stopped Keystone _E_\nLook at the solution not the problem. Learn to focus on what will give results. _E_\nWe left Iraq and it is quickly falling apart what a waste of lives and money and so obvious. _E_\n. #JoeTheismann was great as a political analyst on @FoxNews. He knows far more than football. Thanks for the nice words Joe! _E_\nJoin me live at 9:00 P.M. #JointAddress __HTTP__ __HTTP__ _E_\nVia @AmSpec by Jeffrey Lord: \"Donald Trump Takes Ice Bucket Challenge – Dares Obama\" __HTTP__ _E_\n\"Success is having to worry about every damn thing in the world except money.\" Johnny Cash _E_\nYou can change your vote in six states. So now that you see that Hillary was a big mistake change your vote to MAKE AMERICA GREAT AGAIN! _E_\nMy @FoxBusiness interview on @Varneyco discussing @BarackObama's dirty tactics & how @MittRomney should respond __HTTP__ _E_\nVia @NYDailyNews by Eugene Dunn: \"Trump the Nation's Great Hope\" __HTTP__ _E_\nThe military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_\nWe have all the cards. Now is the time to make a great deal with Iran. _E_\nWhite House Press Sec. had a hard time explaining why @BarackObama supported tax breaks for oil companies in (cont) __HTTP__ _E_\n.@JohnLegere T Mobile service is terrible! Why can't you do something to improve it for your customers. I don't want it in my buildings. _E_\nGo to @greta show will be talking about OPO and plenty else ENJOY! _E_\n.@THEGaryBusey feels he's been abandoned by his team. Do you think so? #CelebApprentice _E_\nWith the signature services of Trump Attaché @TrumpWaikiki brings premiere luxury to the white sands of Waikiki __HTTP__ _E_\nWe as a country either have borders or we don't. IF WE DON'T HAVE BORDERS WE DON'T HAVE A COUNTRY! _E_\nHitting at home. Democrat Sen. Joe Donnelly's son had his healthcare plan dropped __HTTP__ _E_\nThe ObamaCare websites have cost over $5B & many still do not work __HTTP__ One of the greatest fiascos in modern history! _E_\n#CelebApprentice @apprenticenbc returns tonight at 9/8c on NBC __HTTP__ _E_\nWe have to get tough with China before they destroy us. _E_\n\"The great question is not whether you have failed but whether you are content with failure.\" Laurence J. Peter _E_\nBelieve you can and you're halfway there. Theodore Roosevelt _E_\nLyin' Ted Cruz lost all five races on Tuesday and he was just given the jinx a Lindsey Graham endorsement. Also backed Jeb. Lindsey got 0! _E_\nIs this what we want for a President? __HTTP__ _E_\nWHY CAN'T THE MEDIA TELL THE TRUTH WE WOULD ALL BE SO MUCH BETTER OFF! _E_\nTed Cruz is lying again. Polls are showing that I do beat Hillary Clinton head to head. Check out __HTTP__ Poll snd Q Poll. _E_\nJust been informed by @nbc they want to extend the run of the @ApprenticeNBC by two shows because it is doing so well. Two hours live. _E_\nOne of the most effective press conferences I've ever seen! says Rush Limbaugh. Many agree.Yet FAKE MEDIA calls it differently! Dishonest _E_\nChina 'scorns' US cyber espionage charges China does not respect us __HTTP__ and feels Obama is a dummy _E_\nJobs are returning illegal immigration is plummeting law order and justice are being restored. We are truly making America great again! _E_\nKathy Griffin should be ashamed of herself. My children especially my 11 year old son Barron are having a hard time with this. Sick! _E_\nThank you @Heritage! This is our once in a generation opportunity to revitalize our economy revive our industry & renew the AMERICAN DREAM! __HTTP__ _E_\nDon't forget to watch me tonight on Late Night with Jimmy Fallon 12:35 a.m. on NBC. I'll be making a big announcement! _E_\nA coincidence that the NSA leaker is living openly in Hong Kong?! At the same time the Chinese Pres. met with Obama in CA. _E_\nThe dying @NRO National Review has totally given up the fight against Barrack Obama. They have been losing for years. I will beat Hillary! _E_\nSteven Tyler got more publicity on his song request than he's gotten in ten years. Good for him! _E_\nVia @politico by \"Poll: Trump has twice the support of Bush in New Hampshire\" __HTTP__ _E_\nThe concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non competitive. _E_\nToday I will be rallying with with 15000 patriots in Arizona for border security! Let's Make America Great Again! __HTTP__ _E_\nAchievers move forward at all times. Achievement is not a plateau it's a beginning. Don't waste time treading water. _E_\nIranian officials say that the WH is misleading public about the details of an interim nuclear agreement __HTTP__ _E_\nI wonder what @JoeBiden was thinking last night as @PaulRyanVP delivered that knockout speech. Joe should call in sick for the VP debate. _E_\nA dishonest slob of a reporter who doesn't understand my sarcasm when talking about him or his wife wrote a foolish & boring Trump hit _E_\nCongratulations Stephen Miller on representing me this morning on the various Sunday morning shows. Great job! _E_\nCorporations have NEVER made as much money as they are making now. Thank you Stuart Varney @foxandfriends Jobs are starting to roarwatch! _E_\nVia @ShinySheet by @soapbox1: \"Show jumping grand prix returns to Mar a Lago Sunday __HTTP__ _E_\nJoin us tomorrow in Scranton Pennsylvania at 3pm!#TrumpPence16 #MAGA Tickets: __HTTP__ __HTTP__ _E_\nThe Palestinian terror attack today reminds the world of the grievous perils facing Israeli citizens....continued: __HTTP__ _E_\nI told you that the Giants starting Hudson was a mistake. Just got knocked out of the game. I love being right! _E_\nJust read about my friend @HulkHogan he was set up too bad he has to use the court system instead of his muscles. _E_\nWill miss @RealBenCarson tonight at the #GOPDebate. I hope all of Ben's followers will join the #TrumpTrain. We will never forget. _E_\nDuring @BarackObama's presidency median family income has fallen 4.8% __HTTP__ Terrible for the middle class. _E_\nWishing all of those celebrating #Hanukkah around the world a happy and healthy eight nights in the company of those they love. __HTTP__ __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nWow USA Today did todays cover story on my record in lawsuits. Verdict: 450 wins 38 losses. Isn't that what you want for your president? _E_\nMichael Morell the lightweight former Acting Director of C.I.A. and a man who has made serious bad calls is a total Clinton flunky! _E_\nThe @washingtonpost report on potential VP candidates is wrong. Marco Rubio and most others mentioned are NOT under consideration. _E_\nCruz just lied again I am and have been totally against #ObamaCare repeal and replace! _E_\nTrump National Golf Club Washington D.C. is on 600 beautiful acres fronting the Potomac River. A fantastic setting! __HTTP__ _E_\nGreat boardroom. #CelebApprentice _E_\nPhil Mickelson's final 66 round in @The_Open was amazing. Congrats on his well deserved win. Amazing competitor & a great guy! _E_\nIn more and more places throughout this region citizens of SOVEREIGN and INDEPENDENT nations have taken greater control of their destinies and unlocked the potential of their people. #APEC2017 __HTTP__ _E_\nThis election is being rigged by the media pushing false and unsubstantiated charges and outright lies in order to elect Crooked Hillary! _E_\nDonald Trump helped expose the silliness of the move by offering to pay for the White House tours. __HTTP__ _E_\nMy @gretawire interview discussing why the sequestration cuts are necessary our $17T national debt & 2016 election __HTTP__ _E_\nA market is never saturated with a good product but it is very quickly saturated with a bad one. Henry Ford _E_\nI have just lost my beautiful & elegant long time exec. assistant Norma Foerderer. She passed away yesterday – a truly magnificent woman. _E_\nvia WSJ. Wake up @AlexSalmond before you destroy Scotland. @David_Cameron @AberdeenCC @pressjournal __HTTP__ _E_\nSet high standards and meet them. The proof is in the doing: learn by doing and taking risks. _E_\nIn Nov. '11 Al Qaeda's flag flew over the 'birthplace' of Libya's revolution __HTTP__ In Sept. '12 it flew over our Embassy. _E_\nHillary Clinton does not have the STRENGTH or STAMINA to be President. We need strong and super smart for our next leader or trouble! _E_\nBased on the incredibly inaccurate coverage and reporting of the record setting Trump campaign we are hereby: __HTTP__ _E_\nObama's '07 speech which @DailyCaller just released not only shows that Obama is a racist but also how the press always covers for him. _E_\nNever allow your attitude to be a liability. Be positive and strong. Set your mind on winning and keep it there. _E_\nThank you Iowa see you soon!#Trump2016 #ImWithYou __HTTP__ __HTTP__ _E_\nI loved beating these two terrible human beings. I would never recommend that anyone use her lawyer he is a total loser! _E_\nWill be meeting at 9:00 with top automobile executives concerning jobs in America. I want new plants to be built here for cars sold here! _E_\nWhy are people upset w/ me over Pres Obama's birth certificate?I got him to release it or whatever it was when nobody else could! _E_\n.@FloydMayweather Good luck tonight Floyd. _E_\nFifth Avenue's most iconic building @TrumpTowerNY features Trump Grill nestled in the corner of the Atrium __HTTP__ _E_\nThe G 20 Summit was a great success for the U.S. Explained that the U.S. must fix the many bad trade deals it has made. Will get done! _E_\nChina's Olympic training program is abusive __HTTP__ It is modern day slavery & shameful. Their (cont) __HTTP__ _E_\nWe need a president who is smart and tough enough to recognize the national security threat China poses in the (cont) __HTTP__ _E_\nLess than two weeks until @WWE's @WrestleMania XXIX. @TheRock v. @JohnCena willbe epic! Excited to be inducted into the Hall of Fame. _E_\n.@Newsmax by @melaniebatley: Donald Trump Tells Why He's Eyeing the White House.I'll Tell You Why He Could Win. __HTTP__ _E_\nDon't forget to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_\n\"If winning isn't everything why do they keep score?\" Vince Lombardi _E_\n.@weeklystandard I know your business is failing but you should try to get writers far better than @stephenfhayes. _E_\nIt's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_\nSenator Luther Strange has gone up a lot in the polls since I endorsed him a month ago. Now a close runoff. He will be great in D.C. _E_\nI still can't believe we didn't t take the oil from Iraq. _E_\nFor reasons only they can explain the @USChamber wants to continue our bad trade deals rather than renegotiating and making them better. _E_\n\"Remember to keep going: if you stop your momentum will stop.\" – Think Big _E_\nCould be a fight over red heads with @lisalampanelli—this could be good. #sweepstweet _E_\n\"Statement from President Donald J. Trump on #GivingTuesday\" __HTTP__ _E_\nEven if @BarackObama stays in DC taxpayers will pay millions for his Hawaii vacation when Americans are struggling __HTTP__ _E_\nHe who demands little gets it. Ellen Glasgow _E_\nVia @MiamiHerald by Hannah Sampson: \"BLT Prime coming to Trump's Doral resort\" __HTTP__ _E_\n.@TIME Magazine should definitely pick David Pecker to run things over there he'd make it exciting and win awards! _E_\nThe only thing more boring than @bwilliams newscast is his show Rock Center which is totally dying in the ratings—a disaster! _E_\nThe secret of success in life is for a man to be ready for his opportunity when it comes. Benjamin Disraeli _E_\nI will be interviewed by @ericbolling tonight at 8pm on the @oreillyfactor. Enjoy! _E_\nTime magazine should name David Pecker of American Media to be its top guy...but they are not smart enough to do that! _E_\n\"Remember that fear can be conquered. Go full throttle and the odds will be on your side.\" – Trump Never Give Up _E_\nA special message for Martin Bashir __HTTP__ _E_\nJeff Zucker failed @NBC and he is now failing @CNN. _E_\nObama is in Texas but will not be visiting the border. He is too busy fundraising! _E_\nThis new Russian strategy guarantees victory for the Syrian government and makes Obama and U.S. look hopelessly bad. President in trouble! _E_\n.@CarlyFiorina I only said I was on @60Minutes four weeks ago with Putin—never said I was in Green Room. Separate pieces—great ratings! _E_\nThank you Tennessee! #MAGA __HTTP__ _E_\nMattis Says Trump's Warning Stopped Chemical Weapons Attack In Syria __HTTP__ _E_\nYour work will never be in vain if you work for a cause that is greater than yourself. _E_\nWow Corey Lewandowski my campaign manager and a very decent man was just charged with assaulting a reporter. Look at tapes nothing there! _E_\nLast week to enter the Think Like A Champion signed book and keychain contest: __HTTP__ _E_\nGreat @FOXSports art. by @jillpainter on Doc River's annual golf charity event @TrumpGolfLA. Doc is a great friend! __HTTP__ _E_\n__HTTP__ #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nHe who knows when he can fight and when he cannot will be victorious. Sun Tzu _E_\nThe reason great dealmakers do not OPENLY celebrate a deal especially one that is not complete is that it shows weakness to the other side _E_\nThe Ground Zero Mosque should not go up where planned. It is wrong. My offer still stands to buy the property. Good deal for everyone. _E_\n#TBT Saturday Night Live __HTTP__ _E_\nThe United Nations Security Council just voted 15 0 to sanction North Korea. China and Russia voted with us. Very big financial impact! _E_\nInitial reports say 2nd debate viewership dropped. See what happens when I am not mentioned. _E_\nW/ spectacular panoramic Pacific Ocean views @TrumpGolfLA is the top luxury public golf course in the country __HTTP__ _E_\n\"Get in. Get it done. Get it done right. Get out.\" – Fred C. Trump (My father!) _E_\nBe totally focused. Being successful requires nothing less than 100% of your concentrated effort. _E_\nWhen I renovated Wollman Rink in Central Park it came in $750000 under budget.. _E_\nI win a state in votes and then get non representative delegates because they are offered all sorts of goodies by Cruz campaign. Bad system! _E_\nAn ad hoc interview I filmed with a German journalist at Ground Zero hours after the attack __HTTP__ _E_\nHillary Clinton's open borders are tearing American families apart. I am going to make our country Safe Again for all Americans. #Imwithyou _E_\nJust landed in Baton Rouge Louisiana. Reports are out that lines are three quarters of a mile to get in. Wow! #MakeAmericaGreatAgain _E_\n.@DannyZuker I'm in front of the camera and behind the camera just looked at your picture you'll never be in front of the camera! _E_\nI always enjoy being interivewed on @WOR710 by John Gambling. My father Fred used to listen to his father's show. _E_\nDirector Clapper reiterated what everybody including the fake media already knows there is no evidence of collusion w/ Russia and Trump. _E_\nThings happen that make you question whether you should keep going. As long as you are enjoying what you are doing keep going. _E_\nMy interview last night with Greta on the GOP going El Foldo __HTTP__ _E_\nThanks Larry. Best wishes. __HTTP__ _E_\nAfter witnessing first hand the horror & devastation caused by Hurricane Harveymy heart goes out even more so to the great people of Texas! _E_\nYesterday the Christmas tree arrived at Rockefeller Plaza. An iconic event for New York! _E_\nThe President of Taiwan CALLED ME today to wish me congratulations on winning the Presidency. Thank you! _E_\n.@FoxNews is devastated that lightweight Senator Marco Rubio got trounced tonight and is the big loser. I won the two big states great! _E_\nMy @SquawkCNBC interview re: Europe's financial mess investing in Spain Germany's economy and the future of the Euro __HTTP__ _E_\nSorry the best and most beautiful ties and shirts made anywhere and at a really reasonable cost. Also fragrance is amazing. GO TO MACY'S. _E_\n.@BarackObama is now taking credit for changing party platform language but he reviewed it prior to the convention __HTTP__ _E_\nI should release the sad and totally apologetic letter that Penn @pennjillette hand delivered to me. Minds would be changed very fast! _E_\nThis is the Cruz voter violation certificate sent to everyone a misdemeanor at minimum. __HTTP__ _E_\nJust had a very open and successful presidential election. Now professional protesters incited by the media are protesting. Very unfair! _E_\nWhat would you do if a large group of Muslims had a very public meeting drawing horrible and mocking cartoons of Jesus? Oh really be cool! _E_\nCongratulations to @woodyjohnson4 and @nyjets on yesterday's very exciting game. _E_\nWow the ridiculous deal made between Lyin'Ted Cruz and 1 for 42 John Kasich has just blown up. What a dumb deal dead on arrival! _E_\nThe ROLL CALL is beginning at the Republican National Convention. Very exciting! _E_\nI thought that @CNN would get better after they failed so badly in their support of Hillary Clinton however since election they are worse! _E_\nGet ALL the info then quick trial then death penalty for the Boston killer of innocent children and people! Do not be kind. _E_\nDummy @mcuban is at it again trying to use me to get publicity for himself! _E_\nIran was planning to attack the Israeli and Saudi DC embassies. We should respond accordingly. The diplomatic window is closed. _E_\nBringing true luxury to the Windy City @TrumpChicago soars 92 levels over the Chicago River __HTTP__ _E_\nWould very much appreciate Saudi Arabia doing their IPO of Aramco with the New York Stock Exchange. Important to the United States! _E_\nThe episode of the Apprentice that everyone has been waiting for....Joan Rivers stars and she is and does GREAT! Next Monday night at 8:00 _E_\nCongrats @MittRomney on a huge NV victory. Let's make @BarackObama a one term president __HTTP__ #OneTermFund _E_\nHonest reporters stated that the Prayer Breakfast was going on during my CPAC speech and security was very slow to let people in long lines! _E_\nIt's amazing my weekly scheduled interviews on @foxnews and @CNBC draw the highest ratings. And they get bigger week by week thanks folks! _E_\n\"Donald Trump on 'cliff': 'Other countries are eating our lunch'\" __HTTP__ via @BIZPACReview _E_\nA great Christmas movie & perfect #TBT! #MakeAmericaGreatAgain Story: __HTTP__ __HTTP__ _E_\nEntrepreneurs: Paying attention is a cost effective way of protecting yourself. _E_\nI wonder if @BarackObama ever had an Indonesian passport. Did he become an Indonesian citizen when he lived there? _E_\nNow Michelle Nunn will not admit she voted for Obama. Of course she did. Nunn supports ObamaCare & is anti Second Amendment. _E_\nA total lightweight: @JonHuntsman continues to give the worst responses on China in the debates. I can see why (cont) __HTTP__ _E_\nNYPD Officer Larry DePrimo has made the entire city proud with a his generous act of kindness __HTTP__ NYC loves the NYPD. _E_\n.@BrookslawBrooks Thank you so much for your nice words. I will make you look very smart! _E_\nSet the bar high do the best you possibly can and believe in yourself—because if you don't no one else will either. _E_\nPeople will be very surprised by our ground game on Nov. 8. We have an army of volunteers and people with GREAT SPIRIT! They want to #MAGA! _E_\nWe're getting down to the wire on The Apprentice tune in tonight for some great action! 10 p.m. on NBC. _E_\n.@FrankLuntz works really hard but is a guy who just doesn't have it a total loser! _E_\nCheck out this photo shoot video of @IvankaTrump's Spring 2012 collection.... __HTTP__ _E_\nIs Supreme Court Justice Ruth Bader Ginsburg going to apologize to me for her misconduct? Big mistake by an incompetent judge! _E_\nLet's together Make America Great Again! Vote Trump at __HTTP__ _E_\nThe new winter menu @SixteenChicago @TrumpChicago explores the evolution of fine dining @RobbReport __HTTP__ _E_\nIt now turns out that the phony allegations against me were put together by my political opponents and a failed spy afraid of being sued.... _E_\nEntrepreneurs: Put everything you've got into what you're doing. Be totally focused nothing should be haphazard. _E_\nDisgusting @BarackObama's supporters are launching an anti Mormon whisper campaign __HTTP__ Shameful but no surprise. _E_\nBecause of the tornado tragedy I will not be doing @piersmorgan tonight. I wish everyone well! _E_\nMy @FoxNews interview with @TeamCavuto discussing why I will not be moderating the Newsmax @iontv debate __HTTP__ _E_\nOur country is blowing up and @BarackObama is out campaigning. _E_\nThe Republicans must be patient and smart ObamaCare could sweep them into office in far greater numbers than anyone ever thought possible! _E_\nConvention Center officials in Phoenix don't want to admit that they broke the fire code by allowing 12 15000 people in 4000 code room. _E_\nEveryone loves TV's darling @TheRealMarilu. But wait until you see her tough & competitive side in the upcoming @CelebApprentice! _E_\nWill be working with contractors at Trump National Doral in Miami today. _E_\nThanks. __HTTP__ _E_\nMy @FoxNews interview with @gretawire explaining that I am keeping all my options available for 2012 __HTTP__ _E_\nDruggie A Rod @MLB's biggest fraud is lucky George Steinbrenner is no longer with us. @Yankees would have voided his contract. _E_\nRT @realDonaldTrump: \"Arrests of MS 13 Members Associates Up 83% Under Trump\" __HTTP__ _E_\nLast weeks Dateline which I hosted was the highest rated Dateline since January! _E_\nA great Father's Day gift—a stay at my 5 star hotel @TrumpNewYork along with items from my signature collection __HTTP__ _E_\nStatement on Relationship with NBC __HTTP__ _E_\nI am going to Trump National Doral in Miami today to check out the brand new and just opened BLUE MONSTER and the spectacular driving range. _E_\nI hear by demand a second investigation after Schumer of Pelosi for her close ties to Russia and lying about it. __HTTP__ _E_\nRT @FoxNews: .@EricTrump: People have seen a year that's incredible that's been filled with nothing but the best for our country America... _E_\nThe polls are close so Crooked Hillary is getting out of bed and will campaign tomorrow.Why did she hammer 13 devices and acid wash e mails? _E_\nCBS's FACE THE NATION Posts Largest Audience Since 2001#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nWith Democrats Spitzer Danger Weiner & Filner which party really has the war on women? _E_\nThe latest book on Hillary—Wow a really tough one! __HTTP__ @RogerJStoneJr _E_\nIt's time for government to stop picking winners & losers. Let's make sure everyone can achieve the American dream! __HTTP__ _E_\n\"Set the example and you'll be a magnet for the right people. That's the best way to work with people you like.\" – Think Like a Champion _E_\nMaybe if Obama knew too much about the spying it would be worse than knowing nothing but either way it is just another disaster! _E_\nChina has done very well under Obama. Now they just released their first aircraft carrier. _E_\n.@JustinRose99 The display you put on this weekend was unprecedented. Even the best putters couldn't believe it. You're amazing. See u soon. _E_\nWe are stupidly paying Iran billions of dollars that we should not be paying. Why isn't this part of the nuclear negotiations? Really dumb! _E_\nThis past Sunday's All Star Celebrity @ApprenticeNBC continued to win the key demographic of adults 25 54. An amazing run! _E_\nThis summer is very tough for the nation's worst AG @AGSchneiderman. Moreland Commission is his disaster. _E_\nRT @seanhannity: @ericbolling To my dear friend please know we all love you will be here for you and your family. _E_\nPeople are just now starting to find out how dishonest and disgusting (FakeNews) @NBCNews is. Viewers beware. May be worse than even @CNN! _E_\nCongress must stop Obama's reckless deal with Iran. The framework is a pathway for Iran to develop nukes. _E_\nCongrats to my friend @Schwarzenegger who is doing next season's Celebrity Apprentice. He'll be great & will raise lots of $ for charity. _E_\nThe Justice Dept. should ask for an expedited hearing of the watered down Travel Ban before the Supreme Court & seek much tougher version! _E_\nHaving a good relationship with Russia is a good thing not a bad thing. Only stupid people or fools would think that it is bad! We..... _E_\nCongratulations to Miss Rhode Island on winning the Miss USA contest. She did an amazing job. _E_\nListen to my interview with @gretawire tonight at 10PM ET on @FoxNews. _E_\nThe Paley Center for Media is a great place to visit when you're in NYC. #CelebApprentice _E_\n...use subsidies to buy health plans. In other words Ocare is dead. Good things will happen however either with Republicans or Dems. _E_\nAn insightful article on @BarackObama __HTTP__ _E_\nI make no apologies for this country my pride in it or my desire to see us become strong and rich again. (cont) __HTTP__ _E_\nDonald Trump's Guns by @EmilyMiller @washtimes __HTTP__ _E_\n#CelebApprentice Who will hear those two famous words? @Apprenticenbc premieres tomorrow at 9/8c on NBC. __HTTP__ _E_\nWhen I said that if within the Orlando club you had some people with guns I was obviously talking about additional guards or employees _E_\nI was thrilled to be back @LibertyU. Congratulations to the Class of 2017! This is your day and you've earned it.... __HTTP__ _E_\nJust got back from New Hampshire. Amazing people we all had a great time together! _E_\nVia @zpolitics: \"Donald Trump Sends Message to @GaRepublicans\" __HTTP__ _E_\n....countries which are doing badly. I want a merit based system of immigration and people who will help take our country to the next level. I want safety and security for our people. I want to stop the massive inflow of drugs. I want to fund our military not do a Dem defund.... _E_\nThe Dow just broke 24000 for the first time (another all time Record). If the Dems had won the Presidential Election the Market would be down 50% from these levels and Consumer Confidence which is also at an all time high would be \"low and glum!\" _E_\nJeb Bush had a tough night at the debate. Now he'll probably take some of his special interest money he is their puppet and buy ad's. _E_\nDon't forget my book signing tonight at Costco on 1250 Old Country Road in Westbury NY from 6 8 pm. Hope to see you there. _E_\nRT @BarackObama: RT if you agree: We need a President who is fighting for all Americans not one who writes off nearly half the country. _E_\nGreat! __HTTP__ _E_\nWhat Bernie Sanders really thinks of Crooked Hillary Clinton. __HTTP__ _E_\n.@JerryLawler was terrific. #WWEHOF __HTTP__ _E_\nWatch this clip from earlier this year. Time & time again I have been right about terrorism. It's time to get tough! __HTTP__ _E_\nOver the years I've discovered that for a brand to build the people surrounding it have to work exceptionally well together. _E_\nThere is no substitute for private sector experience. _E_\nJeb Bush \"I am a conservative\" = Barack Obama \"If you like your healthcare plan you can keep your plan.\" _E_\n#CongratsPeggy! __HTTP__ _E_\nVia @CBSNewYork: \"@TrumpFerryPoint Opens In The Bronx\" __HTTP__ _E_\nThank you to the people of New Hampshire I love you! Now off to South Carolina. _E_\n.@cher should spend more time focusing on her family and dying career! _E_\nCan it just be new age that Manti Te'o fell in love with a girl he never met or is it a hoax? _E_\nCongratulations to the White House. For every 1 ObamaCare enrollment there are 44 cancellation notices. Very unfair! _E_\n'As Senator Clinton promised 200000 jobs in Upstate New York her efforts fell flat.' __HTTP__ __HTTP__ _E_\nThanks for all of the nice tweets re Sgt. Tahmooressi. Especially nice that the money will be sent today #VeteransDay. _E_\n.@maddow Standing in front of wind turbines is sad. Rachel windmills are terrible for the environment— _E_\nFlashback: Donald Trump: $200M plan for Doral __HTTP__ via @ESPNGolf. Trump Doral's @cadillacchamp is one week away! _E_\nToday is a day that I've been looking very much forward to ALL YEAR LONG. It is one that you have heard me speak about many times before. Now as President of the United States it is my tremendous honor to finally wish America and the world a very MERRY CHRISTMAS! __HTTP__ _E_\nReports by @CNN that I will be working on The Apprentice during my Presidency even part time are ridiculous & untrue FAKE NEWS! _E_\n#SuccessByTrump exclusively available @Macy's has set sale records for fastest selling cologne. Makes a great gift __HTTP__ _E_\nManufacturers' record high optimism reported in the 1st qtr has carried into the 2nd qtr of 2017 via @ShopFloorNAM: __HTTP__ __HTTP__ _E_\nTrump Signature mattress is from Serta the best there is! Thanks _E_\nThe Muslim Brotherhood dictator in Egypt is bad news. He will never be our true ally! _E_\n🚨BREAKING🚨: State Department's Kennedy pressured FBI to unclassify Clinton emails: FBI documents __HTTP__ _E_\nA world famous testament to architectural excellence @TrumpTowerNY features a 60 ft waterfall __HTTP__ _E_\nDon't be afraid of being unique it's like being afraid of your best self. Donald J. Trump __HTTP__ _E_\nThe Trump Organization is going revolutionize Rio de Janeiro's downtown port area with Trump Towers. Construction begins soon! _E_\nOn the cover of @TIME Magazine—a great honor! __HTTP__ _E_\n.@VP Mike Pence will be speaking at today's #MarchForLife You have our full support! __HTTP__ _E_\nVia @Newsmax_Media by Cathy Burke: \"Donald Trump on 2016 Bid: On Scale of 1 10 I'm 'Much More Than Five'\" __HTTP__ _E_\nhave enough problems around the world without yet another one. When I am President Russia will respect us far more than they do now and.... _E_\nNasty tactics being used by @BarackObama campaign against @MittRomney. Must stop saying Obama is a nice man he is not! _E_\n..and now holds an adjunct professorship at Columbia University. Boudin also received an academic laurel from NYU Law School... _E_\nVia HT Politics __HTTP__ _E_\n#DrainTheSwamp __HTTP__ _E_\nSnowden if you're such a hero then come back home and face justice. In reality you are just another wiseguy traitor. _E_\nSo sad to hear of the terrorist attack in Egypt. U.S. strongly condemns. I have great... _E_\nThe $9B that @BarackObama spent in 'Stimulus' for Solar Wind Projects created 910 total jobs costing $9.8M each. __HTTP__ _E_\nLawyers have sent @billmaher demand notice and necessary documentation. _E_\nOn #PurpleHeartDay💜I thank all the brave men and women who have sacrificed in battle for this GREAT NATION! #USA __HTTP__ _E_\nHorrific incident in FL. Praying for all the victims & their families. When will this stop? When will we get tough smart & vigilant? _E_\nComic @sethmeyers21 bombed at University of Texas at Arlington—crowd was dismal as was his performance—I told you so! _E_\nLooking forward to @THEGaryBusey's book of Buseyisms ! _E_\nWow. @nfl ratings are down big league. Glad I didn't get the Bills. Rather be lucky than good. _E_\nWhat you get by achieving your goals is not as important as what you become by achieving your goals. Goethe _E_\nWe pay a disproportionate share of the cost of N.A.T.O. Why? It is time to renegotiate and the time is now! _E_\n\"The Trumps pay tribute to the late @Joan_Rivers\" __HTTP__ via @azcentral _E_\nFor too many years our inner cities have been left behind. I am going to deliver jobs safety and protection for those in need. _E_\nGoing on Letterman now let me know what you think how did I do? Here we go! _E_\nMatt Harvey @Mets Don't let the @NYDailyNews get you down nobody reads it. Play well. _E_\nNew home sales reach a 10 year high. Stock Market has more record gains. Hopefully Republican Senators will give us the much needed Tax Cuts to keep it all going! Democrats want big Tax Increases. _E_\nWisconsin's economy is doing poorly and like everywhere else in U.S. jobs are leaving. I will make our economy strong again bring in jobs _E_\nToday it was my tremendous honor to visit Marine Helicopter Squadron One (HMX 1) at the Marine Corps Air Facility in Quantico Virginia. I am honored to serve as your Commander in Chief. On behalf of an entire Nation THANK YOU for your sacrifice and service. We love you! __HTTP__ _E_\nLooking forward to @VinceMcMahon inducting me into @WWE Hall of Fame this Saturday in @TheGarden. #WWEHOF #WrestleMania _E_\n.@AlexSalmond See attached article. Very frightening to people living around these monstrosities __HTTP__ _E_\n.@TrumpNewYork's 176 rooms have floor to ceiling windows providing unparalleled views of Central Park & NYC __HTTP__ _E_\nJoin me in Denver Colorado tonight at 9:30pm: __HTTP__ Scranton Pennsylvania Monday @ 5:30pm: __HTTP__ _E_\nSee my picks at @Fund_Anything at __HTTP__ and giving away money!!! #FundAnything _E_\nThank you to Governor @ScottWalker for such warm support. Great speech! _E_\nGreat! __HTTP__ _E_\nI still hold the all time attendance and pay per view record at @WWE. _E_\nAlabama was great last night amazing people. 30000 folks was largest crowd of political season. Nice! _E_\nThe Democrats have become nothing but OBSTRUCTIONISTS they have no policies or ideas. All they do is delay and complain.They own ObamaCare! _E_\nDonald Trump: If Bill Maher Does Not Pay Off His $5 Million Bet – 'Then I'll Sue Him' __HTTP__ via @gatewaypundit _E_\nPeople often ask me the secret to my success and the answer is simple: passion focus and hard work. Momentum keeps it all going. _E_\nWhy does @ThisWeekABC w/ @GStephanopoulos allow a hater & racist like @tavissmiley to waste good airtime? @ABC can do much better than him! _E_\nThe Trump Signature Collection is the best menswear design for young entrepreneurs. Great style & design exclusively available @Macys. _E_\nJennifer is a terrific person. __HTTP__ _E_\nStill a great time to buy residential property. The courts are holding up foreclosures. Buy directly from the banks. _E_\nDon't forget to tune in tonight at 10 p.m. on NBC for another action packed episode of The Apprentice. __HTTP__ _E_\nHow will the client react? They've got both Elle Magazine and Chi to please. #sweepstweet _E_\nIn less than 30 minutes watch the season premiere of @ApprenticeNBC on NBC. _E_\nMake sure to tune in to All Star Celebrity @ApprenticeNBC this Sunday at 9PM EST for another round of fireworks and surprises! _E_\nI only go on shows that get ratings that's why I do @oreillyfactor @hannityshow and @gretawire. Your sho... (cont) __HTTP__ _E_\nBack from Miami where my Cuban/American friends are very happy with what I signed today. Another campaign promise that I did not forget! _E_\nIn Britain more Muslims join ISIS than join the British army. __HTTP__ _E_\nPresident @EmmanuelMacronThank you for inviting Melania and myself to such a historic celebration in France. #BastilleDay #14juillet __HTTP__ _E_\nWe enjoy hosting tourists in @TrumpTowerNY. They come from all over the world to see the Atrium a NYC landmark. __HTTP__ _E_\nWow did great in the debate polls (except for @CNN which I don't watch). Thank you! _E_\nObama is angry frustrated and desperate. He said \"voting is the best revenge\" __HTTP__ He is divisive. _E_\nHence legal documents are being crafted which take me completely out of business operations. The Presidency is a far more important task! _E_\n.@FoxNews is so biased it is disgusting. They do not want Trump to win. All negative! _E_\nMessage to Edward Snowden you're banned from @MissUniverse. Unless you want me to take you back home to face justice! _E_\nThis was the Republicans election to win but they just blew it reasons why to follow. _E_\nDopey @Lawrence O'Donnell whose unwatchable show is dying in the ratings said that my Apprentice $ numbers were wrong. He is a fool! _E_\nAm now in L.A. Will be going to the U.S.S. IOWA at 5:30 P.M. to speak to our great VETERANS and other friends! _E_\nIt's Tuesday. How many more 'The View' Execs will leak that they want @rosie gone? Show is failing. _E_\nAmerican homeownership rate in Q2 2016 was 62.9% lowest rate in 51yrs. WE will bring back the 'American Dream!' __HTTP__ _E_\nLook where the world is today a total mess and ISIS is still running around wild. I can fix it fast Hillary has no chance! _E_\nI started this campaign to Make America Great Again. That's what I'm going to do. #MAGA #debate _E_\nIn Tampa Florida thank you to all of our outstanding volunteers who want to #MakeAmericaGreatAgain! __HTTP__ _E_\n.@HallieJackson Why didn't you report Hillary lying about the ISIS video. Bad reporting. Perhaps @NBC will do better next year but doubt it! _E_\nBig thanks to @David_Bossie @Citizens_United & @AFPhq for hosting me at #NHFreedomSummit. Will be back to the Granite State soon! _E_\nTremendous support (except for some Republican leadership ). Thank you. _E_\nCongratulations to @FoxNews for being number one in inauguration ratings. They were many times higher than FAKE NEWS @CNN public is smart! _E_\nGreat meeting with active & retired law enforcement officers at the Fraternal Order of Police lodge in Akron Ohio. __HTTP__ _E_\nGov. Scott Walker just left my office we had a really wonderful talk. Very interesting! @GovWalker _E_\nSources inside @AGSchneiderman's office are saying that they are very concerned with the allegations against their lightweight boss. _E_\n51% of @JonHuntsman's NH voters are satisfied with @BarackObama as president __HTTP__ So is @JonHuntsman! _E_\nExcited by my acquisition of Doral Hotel & Country Club in Miami already world class but will soon be The Best. _E_\nCrooked Hillary Clinton spent hundreds of millions of dollars more on Presidential Election than I did. Facebook was on her side not mine! _E_\nEntrepreneurs: Being stubborn is a big part of being a winner. Don't give in and don't give up! _E_\nJon Stewart @TheDailyShow is a total phony –he should cherish his past—not run from it. _E_\nObama can attend a fundraiser every day but can't be bothered to get briefed on national security. Commander in Chief?! _E_\nJust left Florida amazing how well State is doing jobs way up taxes down. Congrats to @FLGovScott _E_\n'How Trump Would Stimulate the U.S. Economy' __HTTP__ _E_\nNew Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: __HTTP__ _E_\nIt probably was not drugs that caused the San Fran crash but why aren't they testing who knows? _E_\nThe joke around town is that I freed El Chapo from the Mexican prison because the timing was so good w/ my statements on border security. _E_\nI will be interviewed by @IngrahamAngle on @FoxNews at 10:00. Enjoy! _E_\nCrazy Election officials saying that there is nothing stopping illegal immigrants from voting. This is very bad (unfair) for Republicans! _E_\nThank you Alabama! #Trump2016#SuperTuesday _E_\nJust out @ApprenticeNBC was in first place in all demos during the 10PM hour in the ratings. _E_\nWell back to the drawing board! _E_\nI will be doing a Town Hall tonight at 10:00 P.M. on @seanhannity @FoxNews _E_\nLooking forward to a speedy recovery for George and Barbara Bush both hospitalized. Thank you for your wonderful letter! _E_\nI really like Jay Z but there is trouble in paradise. When his wife's sister starts whacking him not good! No help from B leads to a mess. _E_\nI truly hope President Obama doesn't do something irrational and dangerous for our country in order to save face. He must sit back and chill _E_\nThe North Coast of Scotland is spectacular the sea the sand dunes the rolling bluffs we walked the course and it is fantastic. _E_\nMUST READ ARTICLE: \"Immigration reform could be bonanza for Democrats\" __HTTP__ Are the @RNC & @GOP suicidal? _E_\nCongratulations to @TrumpDoral for being named one of @LINKSMagazine's Great Destinations: __HTTP__ _E_\nFor all of those who have been asking about online sales the Donald J. Trump Signature Collection ties & shirts are sold @Macys.com _E_\nSmall Business Poll has highest approval numbers in the polls history. All business is just at the beginning of something really special! _E_\n.@williebosshog such an honor to get your endorsement. You are a fantastic guy! It will not be forgotten. Don and Eric say hello! _E_\nAmerica's men & women in uniform is the story of FREEDOM overcoming OPPRESSION the STRONG protecting the WEAK & G... __HTTP__ _E_\nVery different styles but each totally effective in his own way at the debate. _E_\nChoose your own path: It doesn't have to be the path less traveled...What matters is that it's the right one for you. Vince Lombardi _E_\nPresident Obama's approval rating at 38% is at an all time low. Gee I wonder why? _E_\nMy best wishes to everyone for a Happy Thanksgiving! _E_\nThe United States better address China's exchange rate before they steal our country and it is too late! China is laughing at us. _E_\nWhere the hell is global warming when you need it? _E_\nSo a woman in Chicago who never had a job has 9 kids with 7 different men (she is one of many). These kids will never work. Trouble! _E_\n.@MacMiller's 'Donald Trump' song is at 64.5M views on YouTube __HTTP__ You're welcome Mac! _E_\nRev. Graham made a critical point. @BarackObama has turned a blind eye to the Christians being persecuted in (cont) __HTTP__ _E_\nWelcome to Obama's America record high poverty and an 8% drop in median household family income __HTTP__ Four more years? _E_\nBusinesses have already started massive layoffs and reducing employees' hours due to Obama Care. Reality is setting in. _E_\nA Great 4th of July! America a great country who's brightest days with wise leadership lie ahead. _E_\nRT @JasonMillerinDC: Is @realDonaldTrump debating Crooked @HillaryClinton or the moderators @AC360 and @MarthaRaddatz? #rattledhillary _E_\nIn light of Boston immigration legislation will be much harder to get. _E_\nYou would think a paper like the Washington Post would be fair and objective. For the record almost all polls showed I won all debates. _E_\nHillary Clinton's Campaign Continues To Make False Claims About Foundation Disclosure: __HTTP__ _E_\n.@AlexSalmond of Scotland may be the dumbest leader of the free world. I can't imagine that anyone wants him in office. _E_\nThe dying @UnionLeader newspaper in NH is in turmoil over my comments about them like a bully that got knocked out! _E_\nJeanne Shaheen was the deciding vote for ObamaCare. Premiums have skyrocketed 90% for New Hampshire. Send @SenScottBrown to the Senate! _E_\nKevin Garnett's response to Ray Allen last night was that of a great competitor nothing wrong in fact it was terrific. A champion! _E_\nWhy is @BarackObama always campaigning or on vacation? _E_\nThe Trump Doral's @cadillacchamp is Florida's premiere golf tournament. I'll be there! Tickets available here: __HTTP__ _E_\n.@hardball_chris must have the lowest IQ on television—now telling people that domestic terrorists are from the right. _E_\nRT @DRUDGE_REPORT: 43 39 __HTTP__ _E_\nThat the Obama administration didn't know the facts about who Bergdahl was before making the stupid 5 killers for one trade is pathetic! _E_\n.@MittRomney is 100% right. The US Supreme Court should do the right thing & overturn ObamaCare or the country (cont) __HTTP__ _E_\nFor years even as a civilian I listened as Republicans pushed the Repeal and Replace of ObamaCare. Now they finally have their chance! _E_\n.@pgaofamerica A really great tournament congrats to Monty Pete B and Ted Bishop. FANTASTIC JOB! _E_\nCongratulations to Tom Brady on yet another great victory Tom is my friend and a total winner! _E_\nIn today's #trumpvlog @RepWeiner the Secret Service and Dick Clark..... __HTTP__ _E_\nNEVER forget our HEROES held prisoner or who have gone missing in action while serving their country.Proclamation: __HTTP__ __HTTP__ _E_\nSave Medicare. Vote for @MittRomney. He will repeal Obamacare on day one. _E_\nDeja vu I can remember a time when our embassies were stormed under another failed President. Obama=Carter. _E_\nI will be interviewed on @foxandfriends by @ainsleyearhardt starting at 6:00 A.M. Enjoy! _E_\nJoining @oreillyfactor from Waukesha Wisconsin now live! Enjoy! _E_\nSuccess requires 100% effort and 100% focus. Nothing less. _E_\nWhat is your thought as to why Obama refused millions for charity and did not show his records and applications? _E_\nTHANK YOU Clemson South Carolina! #MakeAmericaGreatAgain #SCPrimary __HTTP__ _E_\nPeople rarely succeed unless they have fun in what they are doing. Andrew Carnegie _E_\nImportant meetings and calls scheduled for today. Military and economy are getting stronger by the day and our enemies know it. #MAGA _E_\nCan you believe the worst Mayor in the U.S. & probably the worst Mayor in the history of #NYC @BilldeBlasio just called me a blow hard! _E_\nBased on the fact that Ted Cruz was born in Canada and is therefore a natural born Canadian did he borrow unreported loans from C banks? _E_\nJust left Family Leadership Summit in Iowa got a standing ovation from many wonderful people. I will be back soon. _E_\nI don't believe I have been given any credit by the voters for self funding my campaign the only one. I will keep doing but not worth it! _E_\nBird killing windfarm that I oppose in Aberdeen got delayed by at least two years.@AlexSalmond forced the failing developers to delay! _E_\nThis Man Is the Most Dangerous Political Operative in America via Bloomberg Politics __HTTP__ _E_\nHillary Clinton is not a change agent just the same old status quo! She is spending a fortune I am spending very little. Close in polls! _E_\nHeading to Sioux County Iowa where the crowd is amazing. Dr. Robert Jeffress will make the introduction. Make America Great Again! _E_\nThank you Pennsylvania!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\n\"The only way to do great work is to love what you do. If you haven't found it yet keep looking. Don't settle.\" – Steve Jobs _E_\n#TBT With the wonderful actor Jack Nicholson __HTTP__ _E_\nJoin the MOVEMENT! __HTTP__ __HTTP__ _E_\nVia @fitsnews: The Donald Trump Show Is Returning To SC: BILLIONAIRE MOGUL HEADS BACK TO PALMETTO STATE __HTTP__ _E_\nNow Syria is bombing Iraq and Secy. Kerry after we blew the hell out of the place says please don't do that. Syria is a front for Iran. _E_\nIs it possible for @megynkelly to cover anyone but Donald Trump on her terrible show. She totally misrepresents my words and positions! BAD. _E_\nEntrepreneurs: Do not view any failure as the final say for your efforts. Learn your lessons quickly then move on. _E_\n.@HillaryClinton's Nuclear Agreement Paved The Way For The $400 Million Ransom Payment #DebateNight __HTTP__ _E_\nJob numbers today terrible! So what else is new? _E_\nI have always done well with properties fronting on oceans lakes and rivers. If something works stay with it. _E_\nSean Spicer is a wonderful person who took tremendous abuse from the Fake News Media but his future is bright! _E_\nVia @TheBrodyFile: Iowa Evangelical Leader Says Donald Trump Is Bold And Transparent __HTTP__ _E_\nThe only reason I bid on @buffalobills was to make sure they stayed in Buffalo where they belong. Mission accomplished. _E_\nDon't forget! Sunday night at 9 pm EST on @nbc Celebrity Apprentice is back! Tune in for a great show. @ApprenticeNBC _E_\n70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow! _E_\nHe knows he won't have to spend much: @JonHuntsman has offered to match any donation dollar for dollar. _E_\nAlaska had a 200% plus increase in premiums under ObamaCare worst in the country. Deductibles high people angry! Lisa M comes through. _E_\nBoth Obama administration and House leadership staffs are exempt from ObamaCare. Why not the American people? #MakeDCListen _E_\n.@WWE: He's answered the call! @realDonaldTrump responds to @VinceMcMahon's #ALSIceBucketChallenge! __HTTP__ #SmackDownALS _E_\nI was recently asked if Crooked Hillary Clinton is going to run in 2020? My answer was I hope so! _E_\nThank you Abingdon Virginia! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nKnockout assaults are the new rage by sick and depraved youth. We better start getting tough in this country and they want to take our guns! _E_\n\"I always follow my own instincts but I am not going to kid you: it's also nice to get good reviews.\" The Art of the Deal _E_\nComing soon to Pennsylvania Avenue __HTTP__ _E_\nResults of recovery efforts will speak much louder than complaints by San Juan Mayor. Doing everything we can to help great people of PR! _E_\nWow just heard that that next Tuesday's @saintanselm Politics & Eggs is the largest crowd ever. Looking forward to making new friends. _E_\nJoining @SeanHannity tonight at 9pmE on @FoxNews. Enjoy! __HTTP__ _E_\nDoing Fox & Friends at 7.00 A.M. ENJOY! _E_\n.@BillBratton was a great choice for NYC Police Commissioner. He will make us proud and safe! _E_\nObama's own gun study proves gun control is ineffective __HTTP__ @BIZPACReview _E_\nIt's 46º (really cold) and snowing in New York on Memorial Day tell the so called scientists that we want global warming right now! _E_\nWhat a rotten deal we made with Iran. We get nothing (except laughter at our stupidity). They get everything including delay and big cash! _E_\nThe people of Alabama will do the right thing. Doug Jones is Pro Abortion weak on Crime Military and Illegal Immigration Bad for Gun Owners and Veterans and against the WALL. Jones is a Pelosi/Schumer Puppet. Roy Moore will always vote with us. VOTE ROY MOORE! _E_\nCongratulations to Justice Neil Gorsuch on his elevation to the United States Supreme Court. A great day for Americ... __HTTP__ _E_\nIt's Tuesday. How many fundraisers travelling on the taxpayer dime will Obama hold today? _E_\n...They should realize that these relationships are a good thing not a bad thing. The U.S. is being respected again. Watch Trade! _E_\n.@TMobile You service is absolutely terrible get on the ball! @JohnLegere _E_\n.@BarackObama has completely failed the American people. U.S. annual incomes have fallen over 5% during his term __HTTP__ _E_\nWaste. With 22 new taxes & $1.8T in added debt @BarackObama's disgraceful 'ObamaCare' will still leave 30M uninsured __HTTP__ _E_\nI spell out some of the differences between Ben Carson and myself at 9:00 A.M. on @CNN @jaketapper. Ben is very weak on illegal immigration. _E_\nSpent a beautiful weekend golfing at Trump National Golf Club Westchester and Trump National Golf Club Bedminster. _E_\nRT @foxandfriends: France vehicle attack leaves at least six soldiers injured __HTTP__ _E_\n... to OPEC countries that hate our guts. It's stupid policy.\" Time To Get Tough _E_\nWhen @crowleyCNN defended Obama on Benghazi in the presidential debate she was defending a complete lie __HTTP__ _E_\nI'm at Trump National DC @TrumpGolfDC watching the #2013JuniorPGA championship fantastic young players! @ThePGAofAmerica. _E_\n\"The way to get started is to quit talking and begin doing.\" – Walt Disney _E_\nWith our brand new Tennis Performance Center @TrumpGolfDC offers countless activities along with top courses __HTTP__ _E_\nFirst Titantic sunk on its maiden voyage.Next the Hindenburg explodes on its first flight to America.Now we suffer the ObamaCare rollout! _E_\n\"President Donald J. Trump Proclaims January 16 2018 as Religious Freedom Day\" __HTTP__ _E_\nI recorded robo calls for @Perduesenate @leezeldin & @SteveKingIA. All had record wins. #MidasTouch _E_\n\"Mastering others is strength. Mastering yourself is true power.\" – Lao Tzu _E_\nNo surprise with the talk of amnesty in DC illegal immigration is picking up in Arizona __HTTP__ _E_\nAs I told everyone once before Wiener is a sick puppy who will never change 100% of perverts go back to their ways. Sadly there is no cure _E_\nCertain Republicans who have lost to me would rather save face by fighting me than see the U.S.Supreme Court get proper appointments. Sad! _E_\nCrooked Hillary's bad judgement forced her to announce that she would go to Charlotte on Saturday to grandstand. Dem pols said no way dumb! _E_\nMy @foxandfriends interview discussing the 9/11 Trials at Gitmo @MittRomney the job numbers and @CelebApprentice __HTTP__ _E_\nFew if any Administrations have done more in just 7 months than the Trump A. Bills passed regulations killed border military ISIS SC! _E_\nGreat meeting all of you. This group knocked on 50K doors & counting here in Maine thank you! @MaineGOP __HTTP__ _E_\nCongratulations to @ABC News for suspending Brian Ross for his horrendously inaccurate and dishonest report on the Russia Russia Russia Witch Hunt. More Networks and \"papers\" should do the same with their Fake News! _E_\nNot only are wind farms disgusting looking but even worse they are bad for people's health __HTTP__ (cont) __HTTP__ _E_\n.@Oprah was great amazing that she got Lance Armstrong to totally destroy his life. Why did he ever do that interview? _E_\n#ThrowbackThursday #Trump2016 __HTTP__ _E_\nThank you New Hampshire! #FITN#Trump2016 #NHPolitics __HTTP__ _E_\nHey @SnoopDogg @ItstheSituation @SethMacFarlane: Oh I'm real scared. #TrumpRoast airs tonight at 10:30/9:30 on @Comedy Central. _E_\n#TBT With @DonaldJTrumpJr almost 35 years ago __HTTP__ _E_\nNot only giving out money but Obama will be seen today standing in water and rain like he is a real President don't fall for it. _E_\nGreat day in Kentucky with Wayne LaPierre Chris Cox & the @NRA! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nA classic China just signs massive oil and gas deal with Russia giving Russia plenty of ammo to continue laughing in U.S. face. _E_\nAmerica's top public course @TrumpGolfLA's greens on Palos Verdes Peninsula have been celebrated by @GolfMagazine __HTTP__ _E_\nbeepee2004 Thank you very much Donald. Here is another. __HTTP__ Thanks both do justice to a fantastic place. _E_\nRT @TeamTrump: Mrs. Saucier's son is in prison for having classified info on an unsecured device. @HillaryClinton did FAR WORSE & is runnin... _E_\nWhat message does it send when @BarackObama's campaign has to spin whether America is better off than it was 4 years ago? _E_\nand here's another.... __HTTP__ _E_\nCan you believe this fool Dr. Thomas Frieden of CDC just stated anyone with fever should be asked if they have been in West Africa DOPE _E_\nBain did not list @MittRomney as an Executive on its website in 2000 __HTTP__ @BarackObama's Saul Alinsky tactics won't work! _E_\nOur great country is respected again in Asia. You will see the fruits of our long but successful trip for many years to come! _E_\nChina's corporate espionage is a continued threat to the American economy. With the right leadership it can be stopped. _E_\nI am looking forward to being in New Hampshire tomorrow. The silent majority is taking our country back. We will MAKE AMERICA GREAT AGAIN! _E_\nWe will no longer be silent. We can take our country back! Let's Make America Great Again! __HTTP__ _E_\nSuch a great honor. Final debate polls are in and the MOVEMENT wins!#AmericaFirst #MAGA #ImWithYou... __HTTP__ _E_\nUnbelievable. __HTTP__ _E_\nI refuse to call Megyn Kelly a bimbo because that would not be politically correct. Instead I will only call her a lightweight reporter! _E_\nYour higher self is in direct opposition to your comfort zone. Donald J. Trump __HTTP__ _E_\nI watched parts of @nbcsnl Saturday Night Live last night. It is a totally one sided biased show nothing funny at all. Equal time for us? _E_\n.@LukeDonald You are so good and so talented that I have no doubt you will conquer the 18th hole at the New Blue Monster @DoralResort _E_\nThe polls have been consistently great. The silent majority is speaking. Politicians are failing. #MakeAmericaGreatAgain! _E_\nDianne Gallagher @DianneG is a great reporter for News Channel 36 in Charlotte NC. Fantastic interview thanks! _E_\nI am very proud of my friend @OMAROSA. Despite her recent lossshe gracefully performs in the upcoming All Star @ApprenticeNBC _E_\nWe must build a great wall between Mexico and the United States! __HTTP__ _E_\nDidn't the Boston killer even run over his own brother with a car in order to get away? We are not dealing with an innocent baby here DEATH _E_\nDavid Pecker would be a brilliant choice as CEO of TIME Magazine nobody could bring it back like David! __HTTP__ _E_\nBoasting @AAAFiveDiamond & @ForbesInspector 5 Star ratings @TrumpNewYork's @jeangeorges features a superb menu __HTTP__ _E_\nThe @whitehouse has 'clarified' that the unemployment is actually 8.254% not 8.3% __HTTP__ A little sensitive are we? _E_\nGreat investor John Paulson just sought bankruptcy protection for a unit of his hedge fund very smart but he didn't go bankrupt you morons! _E_\nWith a stupid guy like Jonah Goldberg who uses \"tweeting like a 14 year old girl\" to hit me no wonder the NRO is doing so poorly. @JonahNRO _E_\n.@DaveWeigel @WashingtonPost put out a phony photo of an empty arena hours before I arrived @ the venue w/ thousands of people outside on their way in. Real photos now shown as I spoke. Packed house many people unable to get in. Demand apology & retraction from FAKE NEWS WaPo! __HTTP__ _E_\nWow the failing @nytimes has not reported properly on Crooked's FBI release. They are at the back of the pack no longer a credible source _E_\nJust as I predicted Iraq is deteriorating into utter chaos __HTTP__ The war was a waste. China is taking all the oil. _E_\nVia @bpolitics by @BetBrod \"Trump sets an aggressive tone as he insisted he's serious about running for POTUS. __HTTP__ _E_\nLooking like a really big night for Republicans a tremendous refutation of President Obama and his failed policies! _E_\n.@genesimmons Amazing! Thank you.   __HTTP__ _E_\n.@BarbaraJWalters @theviewtv Why did you choose me as one of the 10 Most Fascinating People of the Year last season (and more than once?) _E_\nI discuss South Korea in today's all new #TrumpVlog __HTTP__ _E_\nIt was my great honor to welcome the 2016 World Series Champion Chicago @Cubs to the @WhiteHouse this afternoon.... __HTTP__ _E_\nAfter North Korea missile launch it's more important than ever to fund our gov't & military! Dems shouldn't hold troop funding hostage for amnesty & illegal immigration. I ran on stopping illegal immigration and won big. They can't now threaten a shutdown to get their demands. _E_\n\"The object of golf is not just to win. It is to play like a gentleman and win.\" Phil Mickelson @MickelsonHat _E_\nThe ObamaCare website was hacked. $5B dollars later and the site can't even secure your personal information. _E_\nInstead of attacking me Ashish J. Thakkar should worry about the culture of corruption plaguing Uganda __HTTP__ _E_\n.@MittRomney and his campaign manager should not be critical of candidates after they blew an election that should never have been lost! _E_\nTe'o's imaginary girlfriend is one of the great cons of all time—or he's very stupid. _E_\nIn the just out @FoxNews Poll I easily beat Hillary Clinton and I havn't even focused on her yet. On our way: MAKE AMERICA GREAT AGAIN! _E_\nIf you've looked over the yearsI've been right on virtually every issue from Iraq (not going in but if so taking the oil) to jobs to China _E_\nWow Ted Cruz got booed off the stage didn't honor the pledge! I saw his speech two hours early but let him speak anyway. No big deal! _E_\nWind turbines are not only killing millions of birds they are killing the finances & environment of many countries & communities. _E_\nAs President I wanted to share with Russia (at an openly scheduled W.H. meeting) which I have the absolute right to do facts pertaining.... _E_\n\"Build confidence starting with small successes that lead to greater and greater successes there is nothing like winning. Think Big _E_\nAnd Trump SoHo New York is one of the hottest new hotels anywhere.... __HTTP__ _E_\nI've got news for President Obama: America is not what's wrong with the world. #TimeToGetTough __HTTP__ __HTTP__ _E_\nCan you imagine if @BarackObama had passed Cap and Trade?! Energy costs would be double from already record highs. _E_\nEntrepreneurs: Pay attention to details. If you don't know everything about what you're doing you'll be in for some big surprises. _E_\nNo one wants the government to shut down but if ObamaCare is fully implemented then our country will eventually shutdown anyway! _E_\nHaters stop saying I went bankrupt it is not so. I never went bankrupt... _E_\nTypical @BarackObama's Press Secretary deflects any criticism of Obama's constant celebrity visits by attacking me. My great honor. _E_\nA Rod's lawsuit trying to overturn a binding arbitration agreement is going nowhere. He should be banned from spring training. _E_\nMy book Midas Touch with Robert Kiyosaki (Rich Dad Poor Dad) will be in bookstores tomorrow it's a grea... (cont) __HTTP__ _E_\nHeading to New Hampshire. #MakeAmericaGreatAgain __HTTP__ _E_\nBernie Sanders has been treated terribly by the Democrats—both with delegates & otherwise. He should show them & run as an Independent. _E_\n\"All our dreams can come true if we have the courage to pursue them.\" – Walt Disney _E_\nCongratulations to the winners of the Commander in Chief's Trophy the great Air Force Falcons! Watch:... __HTTP__ _E_\nI have great confidence that China will properly deal with North Korea. If they are unable to do so the U.S. with its allies will! U.S.A. _E_\nThere's never been anyone more abusive to women in politics than Bill Clinton.My words were unfortunate the Clintons' actions were far worse _E_\nAnswer to your questions I will be voting at 10:30 AM at Lighthouse International 110 East 60th Street Manhattan _E_\n#CelebApprentice contestants @DeeSnider and @DebbieGibson joined me for interviews today __HTTP__ _E_\nCongrats to @nbc on the success of the new smash show @NBCBlacklist. Fantastic suspense. Great acting. Must see TV! _E_\n.@foxandfriends int. on how the Boston thug deserves death penalty @FBI's great work & firing Brande Roderick __HTTP__ _E_\nOkay I think I'm going to do it—I'll open the Miss Universe Pageant as Santa tonight at 8 pm on @NBC _E_\nDon't forget to watch The Tonight Show with the wonderful @jimmyfallon at 11:30 P.M. You will not be disappointed! @NBC _E_\nThe Eric Trump Foundation has raised over $1000000 towards St. Jude Children's Research Hospital. __HTTP__ _E_\nToday is #VeteransDay. Let us be thankful for our nation's finest who fight at all corners of the earth to protect our freedoms. _E_\nWatch my latest appearance on Squawk Box .... __HTTP__ _E_\nEXCLUSIVE: Newt Gingrich: 'The Country Is in Rebellion' Trump Can 'Kick Down the Doors' __HTTP__ _E_\nBiden's sarcastic smiling may or may not be effective depending on who is watching. #VPDebate _E_\n.@Natalie_Gulbis Thank you for the nice piece in @SInow / @Golf_com.Keep up the great work! __HTTP__ _E_\nLibya is being taken over by Islamic radicals with @BarackObama's open support. _E_\nHeading to Richmond Virginia now. Join me tonight! #Trump2016Tickets: __HTTP__ _E_\nBeautiful evening in Kinston North Carolina thank you! Get out and VOTE!! You can watch tonight's rally here:... __HTTP__ _E_\nThe world economy is under deep stress with growth slowing everywhere. Yet crude is over $87/barrel. Should be $25 at the most. _E_\n#CelebrityApprentice Listening to the advice from @johnrich and @marleematlin adds another insight into the Final 4. #sweepstweet _E_\nGo to Macy's today and buy Trump ties shirts suits and cufflinks as a Christmas or holiday present.Great style great price! ONLY THE BEST _E_\nThe Pledge #MakeAmericaGreatAgain __HTTP__ _E_\n....Because of the Democrats not being interested in life and safety DACA has now taken a big step backwards. The Dems will threaten \"shutdown\" but what they are really doing is shutting down our military at a time we need it most. Get smart MAKE AMERICA GREAT AGAIN! _E_\nInteresting case from UK re @stellacreasy and abusive troll __HTTP__ _E_\nLyin' Crooked Hillary's email stories all have one thing in common. __HTTP__ _E_\nI am in Ireland inspecting my great and very beautiful Atlantic Ocean property. It is one of the most spectacular hotels anywhere! DOONBEG _E_\nWhether you like it or not the Russians did a great job in hosting the Olympics! Remember when Obama went to Europe to get Olympics fourth. _E_\nWe will remain fully engaged w/ open lines of communication as #HurricaneHarvey makes landfall. America is w/ you! @GovAbbott @FEMA @DHSgov __HTTP__ _E_\nThe worst negotiators in history (otherwise known as Republicans) have just offered to suspend debt ceiling for four months. Pathetic! _E_\nVia @thehill by @HenschOnTheHill: \"Trump: 'I'm disappointed' in many Republicans\" __HTTP__ _E_\nEverybody is raving about the Trump Home Mattress by @SertaMattresses. If you are looking for a mattress go buy (cont) __HTTP__ _E_\nDuring primetime of the Iowa Caucus Cruz put out a release that @RealBenCarson was quitting the race and to caucus (or vote) for Cruz. _E_\nThe Eric Trump Foundation Golf Invitational benefiting St. Jude Children's Research Hospital is today and i... (cont) __HTTP__ _E_\nEventually but at a later date so we can get started early Mexico will be paying in some form for the badly needed border wall. _E_\nWill be delivering a major speech tonight live on @oreillyfactor at 8:10pm from Pensacola Florida. _E_\nFor all of those fools that want to attack Syria the U.S.has lost the vital element of surprise so stupid could be a disaster! _E_\nMy @Yahoo 'Power Players' interview with @jonkarl Inside Donald Trump's new digs on Pennsylvania Avenue\" __HTTP__ _E_\nLooking forward to being at the convention tonight to watch all of the wonderful speakers including my wife Melania. Place looks beautiful! _E_\nNext time you are waiting in an emergency room remember the Boston killer was rushed to intensive care within minutes of capture. _E_\nI will be interviewed by @kimguilfoyle at 7pm on @FoxNews. #Enjoy! _E_\nMy @showbiztonight interview on @KhloeKardashian @ApprenticeNBC & my surprising TV career __HTTP__ _E_\nICYMI via @DMRegister by @JenniferJJacobs: \"Donald Trump to give Iowa speech on education\" __HTTP__ _E_\nI had a great time in Des Moines Iowa tonight! Thank you for all of the support. #Trump2016 __HTTP__ __HTTP__ _E_\nThe Muslim brotherhood is sending tanks into the Sinai & saying it doesn't violate Camp David accord. _E_\nPraying for everyone in Florida. Hoping the hurricane dissipates but in any event please be careful. _E_\nToday is the first day of the rest of your life make the most of it! _E_\nThe Emmys are sooooo boring! Terrible show. I'm going to watch football! I already know the winners. Good night. _E_\nI gave out the Male Athlete of the Year Award last night to my friend @MichaelPhelps—22 Olympic medals—a record that will never be broken. _E_\nDebbie Wasserman Schultz is hard to watch or listen to no wonder our country is going to hell! _E_\nLet the Arab countries take care of Egypt they have more to gain and plenty of money..It's time for the U.S. to stop being stupid.NO DOLLARS _E_\nToday in history WrestleMania 23: I shave @VinceMcMahon's hair highest rated show in WWE history @WrestleFact __HTTP__ _E_\nHas Barack Obama been caught red handed laundering money into his campaign from illegal online foreign donations? Media? _E_\nMy interview with Andy Dean on @americanowradio I told him what I really thought about the @FoxNews debate. __HTTP__ _E_\nObamacare is a disaster. Rates going through the sky ready to explode. I will fix it. Hillary can't!#ObamacareFailed _E_\nLate last Friday @BarackObama announced his 2011 budget deficit was $1.299 trillion the second largest in US history. _E_\n.@FoxNews should not put @KarlRove on—he has no credibility a bush plant who called all races wrong. _E_\nThank you Ohio! Together we made history – and now the real work begins. America will start winning again!... __HTTP__ _E_\nSomeone must be fired at @AOL for that stupid deal they made buying Huffington Post. _E_\nThe failing @NRO National Review Magazine has just been informed by the Republican National Committee that they cannot participate in debate _E_\nHopefully there won't be any problems in Baltimore tonight. Be calm be cool do not let anybody get hurt.There is just too much to live for! _E_\n.@TrumpDoral. Thanks for the many nice statements and to the media and golf critics for the great reviews of the brand new BLUE MONSTER! _E_\nA nation WITHOUT BORDERS is not a nation at all. We must have a wall. The rule of law matters. Jeb just doesn't get it. _E_\n.@RichLowry is truly one of the dumbest of the talking heads he doesn't have a clue! _E_\nThose who refuse to draw red line to Iran don't have the moral right to put a red line to @Israel. @IsraeliPM @netanyahu _E_\n.@HillaryClinton is on the front page of the @nytimes waving to 200 people in New Hampshire. My crowd next door was 5000 people – no pic! _E_\n......@DailyCaller @BreitbartNews @DRUDGE_REPORT & @gatewaypundit. _E_\nCongratulations to @thomtillis on winning @NCGOP Senate primary. Time for the party to unite and defeat ObamaCare advocate Kay Hagan! _E_\n.@KeithUrban is excellent on American Idol—great touch solid guy! _E_\nJust spoke to Governor Rick Scott. We are working closely with law enforcement on the terrible Florida school shooting. _E_\n.@TheView T.V. show which is failing so badly that it will soon be taken off thr air is constantly asking me to go on. I TELL THEM NO _E_\nAll the hotels currently open in the Trump Hotel Collection have been nominated for Travel & Leisure's World's Best Awards 2011 ..... _E_\nMy interview this morning on Good Morning America with George Stephanopoulos __HTTP__ _E_\nCanadians kicked out the firm that the U.S. paid all that money to for the failed website. How stupid are our leaders ? This is a scandal! _E_\nNorth Korea is reliant on China. China could solve this problem easily if they wanted to but they have no respect for our leaders. _E_\nWhile I won't be running for Governor of New York State a race I would have won I have much bigger plans in mind stay tuned will happen! _E_\nTeams are making a big mistake not taking Johnny Manziel he is going to be really good (and exciting to watch). _E_\nThank you Sarah Let's have pizza in New York soon with you & your great family __HTTP__ _E_\nIt was a GREAT day for the United States of America! This is a great plan that is a repeal & replace of ObamaCare.... __HTTP__ _E_\nHe would be crazy to play in L.A. really bad coach who can't adjust to his players! _E_\nObama planted that @nytimes story on Iran so it will be discussed in tonight's debate. He wants Libya and China off the table. _E_\nGreat poll numbers just coming out of New Hampshire. BIG lead for Trump according to @CNN! _E_\nMichele Bachmann got less than 1200 more votes in the Caucus than she did in the Ames Straw Poll. Very sad for her a nice woman! _E_\nChecking out the course at TNGC Westchester and it is fantastic. Should be a great season. __HTTP__ _E_\nRT @foxandfriends: Sen. John McCain making his return to the Senate ahead of health care vote __HTTP__ _E_\nCongrats to @BreitbartNews' @mboyle1 on being awarded the prestigious 'Eagle Award for Amnesty Reporting' __HTTP__ _E_\nI was not scheduled to be on the @oreillyfactor. Pure fiction! _E_\nVia @AmSpec BY Jeffrey Lord: \"Donald Trump was right on Ebola\" __HTTP__ _E_\nWe can't destroy the competitiveness of our factories in order to prepare for nonexistent global warming. China is thrilled with us! _E_\nDopey Sugar @Lord_Sugar The wind turbines are ruining the beauty & majesty of Scotland... _E_\nThank you American Legion Post 610 for hosting @Mike_Pence & I for a roundtable with labor leaders. #LaborDay #MAGA __HTTP__ _E_\nGOP now viewed more favorably than Dems in Trump era (per NBC/WSJ poll) via @HotlineJosh: __HTTP__ _E_\nI have founded and run one of the largest real estate empires in the world. I employ thousands of people. Why am I the enemy? _E_\nIf the UN unilaterally grants the Palestinians statehood then the US should cut off all its funding. Actions have consequences. _E_\nBe a cautious optimist. Call it positive thinking with a lot of reality checks. _E_\nRepublicans gave Obama a free pass to the White House they just don't get it. _E_\n.@FLGovScott: Amazing race tremendous courage you deserved this win for a very old fashioned reason you have been a great governor! _E_\nMore thoughts on the debt ceiling in today's #trumpvlog... __HTTP__ _E_\nBill Clinton did a great job last night the Democrats are lucky to have him. Do you really believe he likes @BarackObama? _E_\nRT @KatiePavlich: Your boss pardoned a traitor who gave U.S. enemies state secrets he also pardoned a terrorist who killed Americans. Spar... _E_\nEven the once great Caesars is bankrupt in A.C. Others to follow. Ask the Democrat City Council what happened to Atlantic City. _E_\nI will be on @SeanHannity @FoxNews tonight at 10pmE w/ @MELANIATRUMP from Wisconsin. Enjoy! #WIPrimary #Trump2016 __HTTP__ _E_\n.@eagles should sit Michael Vick. He is a great athlete but less than average quarterback. _E_\n\"No government ever voluntarily reduces itself in size. So governments' programs once launched never disappear.\" – Ronald Reagan _E_\nWe need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_\nCrooked's stop in Johnstown Pennsylvania where jobs have been absolutely decimated by dumb politicians drew less than 200 with Bill VP _E_\nI will be doing the A.L.S. Ice Bucket Challenge this morning on twitter. It is not something I look forward to doing but is for a good cause _E_\nISIS is starting its own currency. May be stronger than the dollar if ObamaCare is fully implemented. _E_\nRT @transition2017: President elect Trump announces selections for Attorney General National Security Advisor CIA Director. More here: ht... _E_\nI picked seven Super Bowl winners in a row & would have been right last night had the refs thrown the flag. _E_\nCongrats to @Team_Mitch on winning a spirited primary. Great job Mitch. _E_\nVera Coking saved me \"mucho\" money by turning down my offer—thanks Vera! _E_\nTogether we can save American JOBS American LIVES and AMERICAN FUTURES! #Debates __HTTP__ _E_\n. #RepMikeKelly Great job on @foxandfriends this morning. Thank you for the nice words! _E_\nRT @DRUDGE_REPORT: REUTERS POLL: CLINTON TRUMP ALL TIED UP... __HTTP__ _E_\nWe are going to have a wild time in Alabama tonight! Finally the silent majority is back! __HTTP__ _E_\n.@Team_Mitch Congratulations Mitch! _E_\nI'll be playing golf tomorrow in Palm Beach at the number one rated golf course in the State of Florida Trump International Golf Club. _E_\n.@ESPN's apology(Brent Musburger) was a disgrace to broadcasting stop being so politically correct! _E_\nDon't let the FAKE NEWS tell you that there is big infighting in the Trump Admin. We are getting along great and getting major things done! _E_\nIt was great being in Michigan. Remember I am the only presidential candidate who will bring jobs back to the U.S.and protect car industry! _E_\nCrooked Hillary Clinton has not held a news conference in more than 7 months. Her record is so bad she is unable to answer tough questions! _E_\n.@AP is doing very badly. I can say from experience their reporting is terrible & highly inaccurate. Sadly they are now irrelevant! _E_\nCongratulations @TrumpNewYork for being named in @CNTraveler's Top 10 US Hotels for Business Travelers! __HTTP__ _E_\nRT @foxandfriends: Report accuses material James Comey leaked to a friend contained top secret information __HTTP__ _E_\nVideo game violence & glorification must be stopped—it is creating monsters! _E_\nGreat even in SC tonight! Fire Marshall would not let everyone in 5000 turned away. Thank you for coming! _E_\nIn today's #trumpvlog I talk about how well Will Smith handled the situation with the reporter __HTTP__ _E_\nBig wins in West Virginia and Nebraska. Get ready for November Crooked Hillary who is looking very bad against Crazy Bernie will lose! _E_\n.@MittRomney's entire life and career have built prosperity and growth. _E_\nExplain how the women on The View which is a total disaster since the great Barbara Walters left ever got their jobs. @abc is wasting time _E_\nThank you @ASavageNation and keep up the great work! _E_\nBased on John Sweeney's lousy reputation we are airing large parts of the interview that were not shown enjoy! __HTTP__ _E_\nWhy would @greta use @KarlRove as an election analyst when he has made so many mistakes. He still thinks Romney won. An establishment dope! _E_\n#CelebApprentice We had lots of fun last night with the live tweeting so I will do it again tonight from 8 10pm. _E_\nMy great honor to join our incredible men and women of the @USCG at the Lake Worth Inlet Station in Riviera Beach Florida today!#HappyThanksgiving __HTTP__ _E_\nBig speech tonight in South Carolina 7:00 P.M. Tremendous crowd! _E_\nRepublicans must stop listening to dopes like @KarlRove who still insists Mitt Romney won the last election. Think big & think strong! _E_\nObama did much better than he did last time but still lost decisively. _E_\nWhy is @BarackObama spending millions to try and hide his records? He is the least transparent President ever and he ran on transparency. _E_\nEntrepreneurs: Realize that becoming an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_\nLying Cruz put out a statement \"Trump & Rubio are w/Obama on gay marriage. Cruz is the worst liar crazy or very dishonest. Perhaps all 3? _E_\nWacky Congresswoman Wilson is the gift that keeps on giving for the Republican Party a disaster for Dems. You watch her in action & vote R! _E_\nGetting ready to leave @TrumpDoral and the brand new Blue Monster course it's unbelievable! _E_\n#CrookedHillary is outspending me by a combined 31 to 1 in Florida Ohio & Pennsylvania. I haven't started yet! __HTTP__ _E_\nBefore Star Jones begged me to put her on The Apprentice she was \"professionally dead.\" I saved her tiny... __HTTP__ _E_\n\"Yesterday's home runs don't win today's games.\" – Babe Ruth _E_\nVia @businessinsider by @hunterw: \"TRUMP UNLOADS: Hillary Clinton was 'the worst' and is 'extremely bad'\" __HTTP__ _E_\nA day after Greece burned @BarackObama released a $3.8 Trillion budget for 2013 with a $900 Billion deficit.He will turn America into Greece _E_\nI've had enough of this good night! _E_\nBiggest story today between Clapper & Yates is on surveillance. Why doesn't the media report on this? #FakeNews! _E_\nWho ever heard of a legal conviction statement \"more probable than not\" against Tom Brady? Sue them Tom and make lots of $. @nfl _E_\n.@ShawnJohnson have a great Easter you are a real champion! _E_\nGlad to see my interview with Ronald Kessler @Newsmax_Media. Hopefully the @GOP can get the message. _E_\nThis is the single greatest witch hunt of a politician in American history! _E_\nRT @DanScavino: WE LOVE OUR DEPLORABLES!!!#TrumpTrain #Debates2016 __HTTP__ _E_\nState Treasurer John Kennedy is my choice for US Senator from Louisiana. Early voting today election next Saturday. _E_\nHere is another CNN lie. The Clinton News Network is losing all credibility. I'm not watching it much anymore. __HTTP__ _E_\nWe will MAKE AMERICA SAFE & GREAT AGAIN! #Trump2016 #VoteTrumpSC __HTTP__ __HTTP__ _E_\nWatch Miss USA 2013 Sunday night at 9 PM ET. Live from Planet Hollywood Las Vegas. __HTTP__ _E_\nWhy did Mitt Romney BEG me for my endorsement four years ago? _E_\nHappy #Hanukkah __HTTP__ _E_\nVia @nbc6: \"@MissUniverse Pageant Coming to @TrumpDoral in 2015\" __HTTP__ _E_\nHow is @VanityFair editor Graydon Carter allowed to run bad food restaurant Beatrice Inn? Fire Graydon! _E_\n.@drmoore Russell Moore is truly a terrible representative of Evangelicals and all of the good they stand for. A nasty guy with no heart! _E_\nMy @CENTURY21 Super Bowl commercial __HTTP__ which aired during the third quarter. _E_\nEntrepreneurs: Money is not always the bottom line: it can be a score card not the final score. _E_\n.@TrumpLasVegas is Sin City's most elite destination. Treat yourself to Vegas' most luxurious hotel rooms __HTTP__ _E_\nWhat a surprise! Newly released audit proves that the IRS only targeted Tea Party groups __HTTP__ _E_\nClinton commented in Ohio today that @MittRomney is right the economy has not been fixed under Obama.I always said Bill was an honest man. _E_\n\"DONALD TRUMP TO BILL MAHER: PAY UP\" __HTTP__ via @BreitbartNews _E_\nIf @megynkelly stopped covering me on her show her ratings would drop like a rock! My h to h interview with @AC360 beat her by millions! _E_\nRepublicans must stop relying on losers like @KarlRove if they want to start winning presidential elections. Be tough and get smart! _E_\nI have a lot of @Apple stock and I miss Steve Jobs. Tim Cook must immediately increase the size of the screen... __HTTP__ _E_\nBest of luck to my good friend Derek Jeter on his first game today back at shortstop. @Yankees Captain is a warrior & winner. _E_\nALso coming up: The Celebrity Apprentice returns. Sunday night March 6 at 9 pm EST __HTTP__ apprentice/ _E_\nJust finished the wonderful event on the U.S.S. Iowa. VETERANS FOR A STRONG AMERICA endorsed me. Such a great honor thank you! _E_\nI will be doing The Howard Stern Show at 7 a.m. (10 minutes). Always fun and interesting talking to Howard! _E_\nRemember when @ariannahuff ran for Governor of California. She got 3 votes. _E_\nVia Huffington Post Congrats America! Donald Trump Is Now A 2016 Presidential Front runner __HTTP__ by Igor Bobic _E_\nThe object of golf is not just to win. It is to play like a gentleman and win. Phil Mickelson _E_\nHad a special visitor in my office yesterday for @TIME photo shoot. __HTTP__ _E_\n.@GOP's election loss and failed negotiations will serve as a case study in how third parties come about. _E_\nNew York Magazine just named the most influential tweeters in N.Y. and one Donald Trump was #2 after ESPN. Actually I'm easily #1! _E_\nThe worst employee in today's #trumpvlog... __HTTP__ _E_\nWhy would anyone in Florida vote for lightweight Senator Marco Rubio. Check out his credit card scam his house sale & his no show voting! _E_\nMyrtle Beach South Carolina #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWhy are armed drones being released over our homeland by the Government? __HTTP__ Seems excessive. _E_\nRing in 2015 in downtown New York's most elite 5 Star hotel. @TrumpSoHo offers 46 luxurious stories of excellence __HTTP__ _E_\nIt makes me feel so good to hit sleazebags back much better than seeing a psychiatrist (which I never have!) _E_\nOur online campaign store is officially open! Visit __HTTP__ to shop the latest #MakeAmericaGreatAgain merchandise. _E_\n.@janinegibson __HTTP__ _E_\nIt is snowing in Jerusalem and across Lebanon. Global warming! _E_\nCongratulations to Jeb Hensarling & Republicans on successful House vote to repeal major parts of the 2010 Dodd Frank financial law. GROWTH! _E_\nEvery day St. Maarten loses vital tourism dollars due to the incompetence of PM Sarah Westcot Williams. @PrimeMinisterSX _E_\nRT @PressSec: The Trump effect: \"The U.S. economy is running at its full potential for the first time in a decade\" WSJ __HTTP__ _E_\nCongratulations to @jdickerson of Face the Nation on his highest ratings in 15 years. 4.6 million people watched my interview! Thank you! _E_\nLooking forward to keynoting @ChesterfieldGOP Lincoln Reagan Gala this Friday at The Country Club at The Highlands. Sold out record crowd! _E_\nJust released that international gangs are all over our cities. This will end when I am President! _E_\nRealize that being an entrepreneur is not a group effort. You're in charge. Everything starts with you. _E_\nChampion @bretmichaels is back competing in the upcoming All Star @CelebApprentice. Premiere is March 3rd on @NBC at 9 p.m. EST. _E_\n.@BBC should never have played that piece of garbage documentary & yet the phones are ringing off the hook to play the course. _E_\nTrump National Golf Club Washington D.C. is located on 600 acres and fronts the Potomac River. Spectacular! __HTTP__ _E_\nI can't believe that @CNN would waste time and money with @smerconish he has got nothing going. Jeff Zucker must be losing his touch! _E_\nMore Anti Catholic Emails From Team Clinton: __HTTP__ __HTTP__ _E_\n.@hardball_chris' very small audience is shrinking rapidly because people finally understand that he is very very dumb! _E_\nThank you for the endorsement Coach Bobby Knight! I will never forget it! __HTTP__ __HTTP__ _E_\nMy Administration has identified three major priorities for creating a safe modern and lawful immigration system: fully securing the border ending chain migration and canceling the visa lottery. Congress must secure the immigration system and protect Americans. __HTTP__ _E_\nDon't believe the millions of dollars of phony television ads by lightweight Rubio and the R establishment. Dishonest people! _E_\nBe a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_\nObama has exempted businesses his staff and all of Congress from ObamaCare. Why is he still forcing the monstrosity on the U.S.? _E_\nThank you Buffalo! #NYPrimary __HTTP__ __HTTP__ _E_\n.@GovMikeHuckabee was great the other night. People love him. _E_\nPeople ask me what I do in my free time. The answer I don't have any. _E_\nI love you Arizona! Thank you!#Trump2016 #AmericaFirst __HTTP__ _E_\nYou are doing a great job the world is watching! Be safe. __HTTP__ _E_\nIt's amazing how badly the Knicks and Nets are playing. Everybody predicted they would be top teams with all of the money spent. Too bad! _E_\nIt all begins today WE WILL FINALLY TAKE OUR COUNTRY BACK AND MAKE AMERICA GREAT AGAIN! _E_\n.@TrumpLasVegas' 7th floor provides the most urbane feel in Las Vegas w/private air conditioned cabanas & a massive 110 ft. heated pool. _E_\n#CelebApprentice stay tuned for the 2nd half we have one more firing tonight! _E_\nNow I know that Yahoo is in good hands. It took great courage for @marissamayer to take away the right of employees to work at home. _E_\n.@KatrinaCampins You were absolutely great on @CNN! Thank you. _E_\n3rd rate writer Vicky Ward who begged me for help see her letters to me. __HTTP__ _E_\nWatch @ FoxNews' @ShannonBream @LisWiehl & former prosecutor Doug Burns destroy ridiculous lawsuit __HTTP__ _E_\nYou can watch 360 video live from the podium! __HTTP__ #RNCinCLE #TrumpIsWithYou #MakeAmericaGreatAgain _E_\nGreat to see @RedSox win big yesterday. Good for Boston and the country. Yesterday we were all @RedSox fans. _E_\nIf my offer is refused every undecided OH voter will be fully aware that Obama denied $5M to charity all because he is hiding something! _E_\nMany political pundits are using the term Art of the Deal .... they should thank me. That is my term and book title. _E_\nThe sad truth is some Republicans in Congress are clueless when it comes to negotiation. #TimeToGetTough _E_\nGood morning America! Thank you for all of your support in the latest Drudge poll! __HTTP__ __HTTP__ _E_\nWill be interviewed by @SarahPalinUSA tonight at 10:00 on OAN Network. Enjoy! _E_\nWow the ratings are in and Arnold Schwarzenegger got swamped (or destroyed) by comparison to the ratings machine DJT. So much for.... _E_\nI have nothing to do with Atlantic City sold years ago (great timing). For losers and haters I NEVER went bankrupt. Plus $10 billion sorry _E_\nDon't forget to tune in tonight for another exciting episode of The Apprentice 10 p.m. on NBC. _E_\n.@davidaxelrod I'm sending you a check to help find a cure. @IvankaTrump says hi. _E_\nLike Al Sharpton @DonnyDeutsch apologized to me for calling me a racist on @todayshow apology accepted! _E_\nLet this be the day you go for your dream. Focus don't give up and only accept total and complete victory. You can do it! _E_\nCan you imagine how embarrassing it would have been for the country if the candidates actually did get into a fist fight? _E_\nTexas Georgia & many more VOTE EARLY! This is a movement!#Trump2016 VOTE VIDEO: __HTTP__ __HTTP__ _E_\nDonald Trump Sends @FallonTonight to Highest Friday Rating in 18 Months. @JimmyFallon that is #HUGE! __HTTP__ _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nI'm with YOU! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nHopefully the Republican National Committee can straighten out the total mess that is taking place in Virginia's Republican Party. FAST! _E_\nWatch Eric at 9 am (EST) today on Fox 5 w/ @rosannascotto and David Price to discuss Eric Trump Foundation's $20 million donation to St Jude _E_\nBill Hemmer of @FoxNews was very nice in explaining the excitement and energy in the arena. More than in past years. _E_\nSorry folks got to go to work now but I'll be baaaaack ! _E_\nI would like to thank Reince Priebus for his service and dedication to his country. We accomplished a lot together and I am proud of him! _E_\n\"Trump the orator outlines the greatness of America to Democrats' disgust\" __HTTP__ _E_\nRT @VP: Our President is choosing to put American jobs American consumers American energy and American industry first. __HTTP__ _E_\nCongrats to @msnbc for firing Martin Bashir—don't feel badly he didn't get ratings anyway. @SarahPalinUSA _E_\nWhile the Pres. of Iran tweets sweet nothings to Obama he forbids the Iranians to use twitter. Very revealing. _E_\n\"Don't fight the problem decide it.\" – General George C. Marshall _E_\nToo busy playing golf? @BarackObama sends form letters with an electronic signature to the parents of fallen SEALs __HTTP__ _E_\nCongratulations to @Graeme_McDowell and @kristinstape. Your baby has seriously good genes will be a champ! _E_\nNobody should be allowed to burn the American flag if they do there must be consequences perhaps loss of citizenship or year in jail! _E_\nHow does @michellemalkin get a conservative platform? She is a dummy just look at her past. _E_\nof position. Then separately she stated He said something truly horrifying ... he refused to say that he would respect the results of _E_\nGo to __HTTP__ to help my friend Scott Brown take back our Senate. _E_\nSuch an honor to have my good friend Israel PM @Netanyahu join us w/ his delegation in NYC this afternoon. #UNGA __HTTP__ __HTTP__ _E_\nRT @RightlyNews: @realDonaldTrump @LouDobbs Trust in the media is at the lowest level in all of U.S. history. The American people see throu... _E_\nThank you America! #MAGA __HTTP__ __HTTP__ _E_\nIsn't it interesting that the tragedy in Paris took place in one of the toughest gun control countries in the world? _E_\nMy thoughts condolences and prayers to the victims and families of the New York City terrorist attack. God and your country are with you! _E_\nObama wanted to meet with the Iranian president yet the Iranians denied the request. So much for Hope & Change. _E_\nDear @MaraLiasson I greatly appreciate your fairness. My history shows I never disappoint. Looking forward to meeting you soon. _E_\nThe Republicans owe an apology for blowing the 2012 election. How could they lose to Obama?! _E_\nHe @BarackObama is caught on tape making election promises to @MedvedevRussiaE on missile defense and national security __HTTP__ _E_\n'Clinton Ally Aided Campaign of FBI Official's Wife' __HTTP__ _E_\nWe must build a wall to secure our border. It will save lives and help Make America Great Again! __HTTP__ _E_\nVia @globegazette by John Skipper: North Iowan says Trump serious about POTUS run but he'll have to prove it __HTTP__ _E_\nThe NFL has just barred ball carriers from using helmet as contact. What is happening to the sport? The beginning of the end. _E_\nWe've all wondered how Hillary avoided prosecution for her email scheme. Wikileaks may have found the answer. Obama! __HTTP__ _E_\n.@TrumpChicago's Spa at Trump® offers 12 treatment rooms & 53 spa guestrooms overlooking the Chicago skyline __HTTP__ _E_\nCongrats to R. Emmett Tyrrell Jr of @AmSpec for the fantastic piece on Benghazi. _E_\nObamaCare's tax credit is underperforming by over 95% creating an even bigger cost to the debt __HTTP__ It must be repealed! _E_\nThe truth is a beautiful weapon. __HTTP__ _E_\nNew book by @ericbolling is absolutely terrific and a must read! #WakeUpAmerica _E_\nThe Republicans are funding ObamaCare and Amnesty. Obama beats them. __HTTP__ _E_\nProviding backstage commentary at the Miss USA Pageant will be comedic mother daughter duo Joan and Melissa Rivers. A fantastic lineup! _E_\n.@CNN & @CNNPolitics Please thank Alisyn Camerota David Chalian and John King for the very professional reporting of the new CNN Poll. _E_\nPeople are always asking me about the very special word CONFIDENCE. The fact is there is (almost) nothing like it. Is derived from winning! _E_\n.@FLOTUS Melania and I were honored to welcome Argentina President @MauricioMacri and First Lady Juliana Awada to t... __HTTP__ _E_\nI hope Republican Senators will vote for Graham Cassidy and fulfill their promise to Repeal & Replace ObamaCare. Money direct to States! _E_\nPurchase your copy of CRIPPLED AMERICA now & be on potential call list for my live streaming signing event tonite. __HTTP__ _E_\nWishing you and yours a very Happy and Bountiful Thanksgiving! _E_\nA nation that cannot control its borders is not a nation. President Ronald Reagan _E_\nAny negative polls are fake news just like the CNN ABC NBC polls in the election. Sorry people want border security and extreme vetting. _E_\nGreat crowd in Fletcher North Carolina thank you! Heading to Johnstown Pennsylvania now! Get out on November 8th... __HTTP__ _E_\nLittle Michael Bloomberg who never had the guts to run for president knows nothing about me. His last term as Mayor was a disaster! _E_\nThe class warfare being played by @BarackObama is the only way he can get reelected. He can't have America focus on his horrendous record. _E_\nThe liberal clown @ariannahuff told her minions at the money losing @HuffingtonPost to cover me as enterainment. I am #1 in Huff Post Poll. _E_\nGreat poll! Thank you North Carolina! #VoteTrumpNC on 3/15!Trump 36%Cruz 18%Rubio 18%Carson 10%Kasich 7%Via @SurveyUSA _E_\nWow Crooked Hillary was duped and used by my worst Miss U. Hillary floated her as an angel without checking her past which is terrible! _E_\nHave you been to the @TrumpGrill in the Trump Tower Atrium? Best meatloaf in the City my mother's famous recipe. 212.836.3249 _E_\nEnjoyed watching @ericbolling $ @SarahPalinUSA's @FoxNews special #PainatthePump over the weekend. (cont) __HTTP__ _E_\nThose who believe in tight border security stopping illegal immigration & SMART trade deals w/other countries should boycott @Macys. _E_\nMy @morning_joe int. w/@morningmika @JoeNBC & @ThomasARoberts f/@trumpdoral on why Romney shouldn't be @GOP nominee __HTTP__ _E_\nPersistence is a key for success. Don't give up. Continue to Think Big and you will be able to close deals. _E_\nA very interesting take from @KatiePavlich: __HTTP__ _E_\nTop Clinton Aides Bemoan Campaign 'All Tactics' No Vision: __HTTP__ _E_\nThe Washington Establishment will never rein in government spending waste fraud and abuse. A great thinker and outsider is needed. _E_\nThank you @JebBush you finally get it! __HTTP__ _E_\nI will be on @SpecialReport with @BretBaier tonight at 6PM. __HTTP__ _E_\nWhile @BarackObama is obsessed with 'green collar jobs' blue collar workers aren't buying it. (cont) __HTTP__ _E_\nSpoke to President Xi of China to congratulate him on his extraordinary elevation. Also discussed NoKo & trade two very important subjects! _E_\nFrom rags to riches and back to rags! __HTTP__ _E_\n.@DianneG @WCNC To the \"news bigs\" elevate Dianne Gallagher immediately—she is terrific! _E_\nThe Intelligence briefing on so called Russian hacking was delayed until Friday perhaps more time needed to build a case. Very strange! _E_\nAmericans may no longer have access to their family doctors because of Obamacare. __HTTP__ via @Newsmax_Media _E_\nAmazon is doing great damage to tax paying retailers. Towns cities and states throughout the U.S. are being hurt many jobs being lost! _E_\nThanks to @BarackObama rejecting the Keystone XL pipeline China has become Canada's biggest oil consumer. China is laughing at us! _E_\n.@andersoncooper Anderson—Thank you for being so fair with your reporting & story last night. Greatly appreciated! _E_\n.@nbc has increased @ApprenticeNBC to 2 hours until the end of the season full 2 hour episodes starting at 9 PM EST _E_\n.@danabrams Dan of course stories on me do well. Glad you have found a medium you can actual do well on. TV was not your forte. _E_\nKeep the big picture in mind. There are always opportunites & possibilities & thinking too small can negate a lot of them. _E_\nI will be LIVE tweeting tomorrow (MONDAY) nights TWO shows starting at 8:00 P.M. They are both great. _E_\nRT @seanhannity: Watch: Donald Trump OWNS A Heckler Who Said Illegal Immigrants Are The Backbone Of America __HTTP__ _E_\nMany people are equating BREXIT and what is going on in Great Britain with what is happening in the U.S. People want their country back! _E_\nEntrepreneurs: Problems are a mind exercise. Enjoy the challenge. _E_\nI will be on @marklevinshow at 8PM tonight. Tune in! _E_\nThe #WomenWhoWork campaign from @IvankaTrump __HTTP__ ... _E_\nReading @nytdavidbrooks of the NY Times is a total waste of time he is a clown with no awareness of the world around him dummy! _E_\nI dream for a living. Steven Spielberg _E_\nMany people have been asking me to answer questions. You can ask me questions at any time. #TrumpQandA _E_\nIf @HillaryClinton is president she'll be all talk and nothing will get done. #Debate #BigLeagueTruth _E_\nStraighten out The Republican Party of Virginia before it is too late. Stupid! RNC _E_\nThe new edition of The Apprentice will be on Thursdays this fall at 10 pm ET I'm putting people back to work! _E_\n.@megynkelly Sorry there was only one breakout star this weekend in New Hampshire. Just check out the local New Hampshire media! _E_\nEverybody that loves the people of New York and all they have been thru should get hypocrites like Ted Cruz out of politics! _E_\nLoved being with my many friends in Tennessee. The crowd and enthusiasm was fantastic. I won the straw poll big! _E_\nRepublican Senators are working very hard to get there with no help from the Democrats. Not easy! Perhaps just let OCare crash & burn! _E_\nI am going to expand the definition of LOBBYIST so we close all the LOOPHOLES! #DrainTheSwamp __HTTP__ _E_\nWow the Supreme Court passed @ObamaCare. I guess @JusticeRoberts wanted to be a part of Georgetown society more than anyone knew. _E_\nWow I am ahead of the field with Evangelicals (am so proud of this) and virtually every other group and Ben Carson just took a swipe at me _E_\nLooking forward to touring the @sigsauerinc world headquarters tomorrow! One of the top gun manufacturers in the US! #GunRights #TCOT _E_\nOppressive regimes cannot endure forever and the day will come when the Iranian people will face a choice. The world is watching! __HTTP__ _E_\nWith President Obama it's all talk and no action. Our country is in desperate need of smart and decisive leadership before it is too late! _E_\nMany people in our Country are asking what the \"Justice\" Department is going to do about the fact that totally Crooked Hillary AFTER receiving a subpoena from the United States Congress deleted and \"acid washed\" 33000 Emails? No justice! _E_\nTo the three UCLA basketball players I say: You're welcome go out and give a big Thank You to President Xi Jinping of China who made..... _E_\nWhile on FAKE NEWS @CNN Bernie Sanders was cut off for using the term fake news to describe the network. They said technical difficulties! _E_\nRick Perry is right when he says we must stand by Israel in the UN. _E_\nIf China didn't play games with its currency and we played on a level economic playing field we could easily (cont) __HTTP__ _E_\nPart of Obama's new found confidence is that the Republicans aren't using their power of ideas properly or effectively. _E_\nMy interview last night with Greta on Fox News __HTTP__ _E_\nThank you Arizona! #VoteTrump __HTTP__ _E_\nThe con artists changed the name from GLOBAL WARMING to CLIMATE CHANGE when GLOBAL WARMING was no longer working and credibility was lost! _E_\nI predict that dying @UnionLeader newspaper which has been run into the ground by publisher Stinky Joe McQuaid will be dead in 2 years! _E_\nHillary Clinton: 'Architect of failure'#DrainTheSwamp #CrookedHillary __HTTP__ _E_\nCongress must pass a budget and hold Obama to it. No more continuing resolutions and no more excuses. Republicans soon hold both houses. _E_\nI am in Kansas. Will be an exciting day. Big speech this morning in Wichita and then go to caucus. Sorry CPAC (the format was fine!). _E_\nThe Time Magazine list of the 100 Most Influential People is a joke and stunt of a magazine that will like Newsweeksoon be dead. Bad list! _E_\nVia @BBCNews Trump begins renewables mission in Scotland __HTTP__ _E_\nWhat will be the response on Wednesday? If Obama doesn't take the 5 million dollars for charity. _E_\nJoin me in Pittsburgh tonight at 7pmE! #Trump2016 #TrumpTrainTickets: __HTTP__ _E_\nLook for good ideas outside of your own areas of expertise. Find innovations approaches and practices that you could adapt in your field. _E_\nTerrible jobs report just reported. Only 38000 jobs added. Bombshell! _E_\nThe military threat from China is gigantic and it's no surprise that the Communist Chinese government lies (cont) __HTTP__ _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\nCaptured or not all our soldiers are heroes! _E_\nI employ many people in the State of Virginia JOBS JOBS JOBS! Crooked Hillary will sell us out just like her husband did with NAFTA. _E_\nDespite previous tweet Dennis Rodman would do a better job than the current (cont) __HTTP__ _E_\nWill the Benghazi terrorist use the videotape as a defense? If so will Obama apologize to him? _E_\nIn some ways it is sad. We all wanted @BarackObama to succeed. It's not worked out that way. _E_\nMy two sons Eric & Don have long been expert hunters & marksmen @NRA. They go on safaris & give animals to the poor & starving villagers! _E_\nIt is time to get out of Afghanistan. We are building roads and schools for people that hate us. It is not in our national interests. _E_\n...What about all of the Clinton ties to Russia including Podesta Company Uranium deal Russian Reset big dollar speeches etc. _E_\nAfter Friday's Twilight release I hope Robert Pattinson will not be seen in public with Kristen she will cheat on him again! _E_\nI hear @glennbeck is in big trouble. Unlike me his viewers & ratings are way down & he has become irrelevant—glad I didn't do his show. _E_\n.@BillGates and @JimBrownNFL32 in my Trump Tower office yesterday two great guys! __HTTP__ _E_\nMy interview w/ @nbc6 re: @CadillacChamp & my $200M of future renovations invested in Trump @DoralResort __HTTP__ _E_\nSCARY $6T in debt and $1T annual budget deficits later @BarackObama is asking for more time to fix the economy __HTTP__ _E_\nWeakness of attitude becomes weakness of character. Albert Einstein _E_\nQ/A @saychowder I receive a great many requests for interviews nationally and internationally. _E_\nThank you @CrainsChicago for featuring @TrumpChicago in your list of Best Private Dining Rooms in Chicago. __HTTP__ _E_\n...Those stupid people bought @mcuban's company (of which he owned a piece). _E_\nThe travel ban into the United States should be far larger tougher and more specific but stupidly that would not be politically correct! _E_\nI consider my health stamina and strength one of my greatest assets.The world has watched me for many years and can so testify great genes! _E_\nComey lost the confidence of almost everyone in Washington Republican and Democrat alike. When things calm down they will be thanking me! _E_\nWhy would Obama ever nominate someone for Sec. of Defense who opposes sanctions against Iran when Obama claims to support them? _E_\nThe @MittRomney healthcare plan post ObamaCare relies on consumer choices with more options __HTTP__ The perfect remedy! _E_\nThe sexual abuse that is so rampant has according to generals greatly weakened our military. They have failed to stop it. _E_\nWhat do you think of @DennisRodman's Donald Trump head? The hair's not quite right for one thing. #CelebApprentice _E_\nThis is an incredible MOVEMENT WE are going to take our country BACK! #November8th #BigLeagueTruth #Debate __HTTP__ _E_\nDopey @Lord_Sugar—Look in the mirror and thank the real Lord that Donald Trump exists. You are nothing! _E_\nGetting ready to open the magnificent Turnberry in Scotland. What a great day especially when added to the brave & brilliant vote. _E_\nMitt's subsequent rise in the polls post debate shows that the American public can still spot a real winner. _E_\nObama met with Chinese Premier Wen yesterday __HTTP__ and talked trade. The Chinese are robbing us blind be tough! _E_\nWill @JebBush in his phony advertising campaign show himself asking me to apologize to his wife in the debate? _E_\nRT @TeamTrump: We are going to be THRIVING again. @realDonaldTrump #BigLeagueTruth #Debates2016 __HTTP__ _E_\nSpeaking to a record crowd of over 20000 people in Charlotte Arena this Saturday morning—look forward to it! _E_\n122 vicious prisoners released by the Obama Administration from Gitmo have returned to the battlefield. Just another terrible decision! _E_\nRT @DonnaWR8: .@POTUS #TRUMP & @FLOTUS🌺When ALL seemed HOPELESS...YOU brought HOPE!You INSPIRE us ALL!#MAGA #Harvey @Scavino45 #USA... _E_\nIMO Manti Te'o was involved in a hoax for sympathy to get the Heisman Trophy. _E_\nNYC's sole hammam The Spa at @TrumpSoHo offers classic treatments inspired by wellness rituals f/around the world __HTTP__ _E_\nThank you @LtStevenLRogers. We will respond to terrorism with strength in 2017! __HTTP__ _E_\nOur economy cannot stay competitive with policies like these: @BarackObama is proposing over $90 Billion in new regulations. _E_\nThe massive Blue Monster @TrumpDoral is getting rave reviews. I built it in one year—no easy feat! _E_\nWatch out. Champion @Joan_Rivers returns to the Boardroom as a judge in this week's All Star Celebrity @ApprenticeNBC. Don't cross her! _E_\n.@deneenborelli Thank you for your nice words greatly appreciated. _E_\nChina court: Apple pays $60M to settle iPad case. China is getting away with murder. __HTTP__ _E_\nDonald Trump: Jeb Bush's Support of Common Core 'a Disaster' __HTTP__ via @BreitbartNews by Dr. Susan Berry _E_\nI'm going to D.C. today to check on the hotel I'm building on Penn. AVE. and then being honored by the Wharton School of Finance the BEST! _E_\nNetworks other than low ratings @CNN have been very fair and exciting! _E_\nHypocrite: @HillaryClinton is the single biggest beneficiary of Citizens United in history by far. #debate #bigleaguetruth _E_\nSuch a total miscarriage of Justice in San Francisco! __HTTP__ _E_\nObama's offer to Iran will not stop Iran's breakout capability. It is a bad desperate deal negotiated from weakness. Pass sanctions! _E_\nGreat interview tonight @donlemon very professionally done. @CNN _E_\nMy latest Celebrity Apprentice video blog... __HTTP__ _E_\nSorry banks when we accused lightweight AG Eric Schneiderman of not going after banks he started going after banks—but years too late! _E_\nObamaCare could eat up your raise __HTTP__ Why isn't Congress defunding it? They're obsessed with amnesty. _E_\nMark Levin's @marklevinshow 'The Liberty Amendments: Restoring the American Republic is a truly great & important book. _E_\nThe addition of the iconic Doral Resort to the Trump portfolio is one of the most exciting transactions __HTTP__ _E_\nThe fact that we are taking the Ebola patients while others from the area are fleeing to the United States is absolutely CRAZY Stupid pols _E_\nThank you @Morning_Joe for explaining to @CNN and @andersoncooper and so many others that I am leading in almost all national & state polls. _E_\nLiberal press won't look into why Obama ignored security warnings for embassies but is obsessed with Romney's private comments. _E_\nLooking forward to seeing Joe McQuaid Curtis Barry and my many friends in the Granite State! _E_\nWhy does @BarackObama have such a fascination with my plane? He is more than welcomed to come for a ride. _E_\nLooking forward to press conference on taxes at 11AM at @TrumpTowerNY. _E_\nLets go America! Get out & #VoteTrump! #Trump2016#MakeAmericaGreatAgain!#SuperTuesday __HTTP__ __HTTP__ _E_\nObama will let Ebola fly into US & drugrunners cross our border daily. But he won't pressure Mexico on Sgt. Tahmooressi. #FreeOurMarine _E_\nChuck Hagel: Wrong For Defense __HTTP__ via @NewYorkObserver _E_\nOscar Pistorius only gets five years in prison for killing his girlfriend. Ridiculous decision! Judge couldn't even read her own writings. _E_\nWhen the New York Times sold their beautiful long time building for peanuts & the buyer flipped it for a massive profit—they lost me! _E_\n...accountability say the Governor. Electric and all infrastructure was disaster before hurricanes. Congress to decide how much to spend.... _E_\n.@TrumpNewYork is NYC's only @ForbesInspector 5 Star & @AAAnews 5 Diamond hotel w/a 5 Star & 5 Diamond restaurant __HTTP__ _E_\nRT @FieldofFight: We Can Do Better We Must Do Better We Will Do Better By LTG (R) Keith Kellogg and LTG (R) Michael Flynn @GenFlynn __HTTP__ _E_\n\"No one remembers who came in second.\" Walter Hagen _E_\nObamaCare is a failure. Costs are rising much faster under Obama than other Presidents. _E_\nWelcome to the @WhiteHouse Prime Minister @JustinTrudeau! __HTTP__ _E_\nLooks like a lawsuit against GoAngelo won't work—my ties & shirts doing too well at Macy's he's actually helping. I have no damages! _E_\nGlad to hear that @taylorswift13 will be co hosting the Grammy nominations special on 12.5. Taylor is terrific! _E_\n\"Mistakes are always forgivable if one has the courage to admit them.\" Bruce Lee _E_\nWow the Failing @nytimes said about @foxandfriends ....the most powerful T.V. show in America. _E_\nVia @UnionLeader by @tuohy: \"Trump: You're Hired\" __HTTP__ _E_\nGet out and vote West Virginia we will MAKE AMERICA GREAT AGAIN! _E_\nI look forward to @MittRomney hitting Obama hard tonight for lying about Benghazi. CIA told Obama it was a terrorist attack after 24 hrs. _E_\n#AMERICA FIRST! _E_\nSaturday Night Live has some incredible things in store tonight. The great thing about playing myself is that it will be authentic! Enjoy _E_\nWill be interviewed by @chucktodd on @meetthepress at 10:30 A.M. _E_\nTime is on your side things do not continue downward forever. Think Big _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nThere is no longer a Bernie Sanders political revolution. He is turning out to be a weak and somewhat pathetic figurewants it all to end! _E_\nCan you imagine what Putin and all of our friends and enemies throughout the world are saying about the U.S. as they watch the Ferguson riot _E_\n#TrumpAdvice __HTTP__ _E_\nRT @foxandfriends: NYT editor apologizes for misleading tweet about New England Patriots' visit to the White House (via @FoxFriendsFirst) h... _E_\n...and borrow cheap! You will thank me someday. _E_\nWord is that Ford Motor because of my constant badgering at packed events is going to cancel their deal to go to Mexico and stay in U.S. _E_\nIsn't it crazy I'm worth billions of dollars employ thousands of people and get libeled by moron bloggers who can't afford a suit! WILD. _E_\nI watched @todayshow this AM re: @MarthaStewart & dating. She looks terrific better than ever any guy would be lucky to be with her. _E_\nI hear that sleepy eyes @chucktodd will be fired like a dog from ratings starved Meet The Press? I can't imagine what is taking so long! _E_\nBoard Room finale of this week's All Star @ApprenticeNBC will leave viewers wondering where the rest of the season goes...It's great! _E_\nThe terrorist came into our country through what is called the Diversity Visa Lottery Program a Chuck Schumer beauty. I want merit based. _E_\nI don't think the voters will forget the rigged system that allowed Crooked Hillary to get away with murder. Come November 8 she's out! _E_\nTrace delivers check to hospital in NYC: American Red Cross must be grateful to Trace and his team for their tremendous work. _E_\nMy response to the failing Des Moines Register the ultra liberal paper that has no power in Iowa __HTTP__ _E_\nLeaving West Palm Beach Florida now heading to St. Augustine for a 3pm rally. Will be in Tampa at 7pm join me:... __HTTP__ _E_\nThank you @SahilKapur for the wonderful story. __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nJust got great national poll numbers double digit lead! Thank you we will all MAKE AMERICA GREAT AGAIN! _E_\nTrump at Tea Party __HTTP__ via @myrbeachonline _E_\nDonald Trump donates land to conservation group in Palos Verdes __HTTP__ via @MyNewsLA _E_\nNobody would fight harder for free speech than me but why taunt over and over again in order to provoke possible death to audience. DUMB! _E_\nTo aspiring entrepreneurs: Be ready for problems. You'll have them every day. So remember to look at the solution not the problem. _E_\nI will be interviewed on @MariaBartiromo @FoxBusiness at 7:30 _E_\nRemember if you do not promote yourself no one else will. When you have success let people know about it. _E_\nUnemployment for Black Americans is the lowest ever recorded. Trump approval ratings with Black Americans has doubled. Thank you and it will get even (much) better! @FoxNews _E_\nGeorge Will was a big Iraq fool. $2 trillion thousands of lives lost & we got nothing! Dummy. _E_\n\"Successful leaders see the opportunities in every difficulty rather than the difficulty in every opportunity.\" Reed Markham _E_\nRupert Murdoch Defends Trump: 'Complete Refugee Pause' Makes Sense' __HTTP__ _E_\nVia @CNNMoney by @jtotoole: \"U.S. taps Donald Trump to convert DC's Old Post Office into luxury hotel\" __HTTP__ _E_\nThe original Apprentice returns with a two hour premiere on Thursday September 16th. Looking forward to a fantastic season! _E_\nCrooked Hillary said that I couldn't handle the rough and tumble of a political campaign. ReallyI just beat 16 people and am beating her! _E_\nGoing to Scotland Ireland & other places in Europe to close up deals. Getting ready for the June 16th announcement @TrumpTowerNY! _E_\nRT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2... _E_\nAlternatives are important but first Repubs must repeal ObamaCare. It's an unsustainable monstrosity that's destroying our healthcare. _E_\nIn the heart of midtown New York @TrumpTowerNY is a landmark which hosts tourists from the around the world daily __HTTP__ _E_\nBig day at the United Nations many good things and some tricky ones happening. We have a great team. Big speech at 10:00 A.M. _E_\nRT @IvankaTrump: My next project is pretty amazing...!xx Ivanka __HTTP__ __HTTP__ _E_\nKern County CA has secured $1.2B for windfarms __HTTP__ They also just secured more eagle deaths & low property values. _E_\nPeople like lawyer Elizabeth Beck and failed writer Harry Hurt & others talk about me but know nothing about me—crazy! _E_\n#TBT With the cast of GoodFellas __HTTP__ _E_\n.@SarahPalinUSA did a great job @CPACnews. Much of what she said was plain old common sense. _E_\nMy @NewsRadio967 interview re Jeb Bush's absurd immigration comment & @Citizens_United @AFPhq Freedom Summit. __HTTP__ _E_\nThe horrible shooting that took place in San Bernardino was an absolute act of terror that many people knew about. Why didn't they report? _E_\nThank you Rep. @CynthiaLummis! __HTTP__ __HTTP__ _E_\nA Rod hit ball hard first at bat. Time for him to step up and leave. _E_\nSometimes the best thing you can do is just let things ride let time go by. Donald J. Trump _E_\nDummy goAngelo keep letting people know how great my shirts ties and cufflinks (also Success) are at Macy's.The BEST now everyone's aware! _E_\nI would have had many millions of votes more than Crooked Hillary Clinton except for the fact that I had 16 opponents she had one! _E_\nThank you for your nice words @MikeNeedham @Heritage for the nice words on @FoxNewsSunday with Chris Wallace. #FNS #Trump2016 _E_\nRemember to watch the series finale of The Men Who Built America this Sunday at 8/7c on @History _E_\nGo out and vote this will be the most important election of our time! _E_\nI hope the Mexican judge is more honest than the Mexican businessmen who used the court system to avoid paying me the money they owe me. _E_\nDo you believe this one Secretary of State John Kerry just stated that the most dangerous weapon of all today is climate change. Laughable _E_\nImitation is the sincerest form of flattery Huntsman goes Donald Trump __HTTP__ _E_\nPolls close in 3 hours! Everyone get out and VOTE!#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nThank you @SenatorFischer! #TrumpPence16 __HTTP__ _E_\nVia @advisorsource: Donald Trump speaks in Novi drawing largest crowd in Oakland County Republican Party's history __HTTP__ _E_\nTrump Nat'l Westchester is among the most highly regarded clubs in New York. A great place. __HTTP__ _E_\nThank you @oreillyfactor for your wonderful editorial as to why I should have been @TIME Magazine's Person of the Year. You should run Time! _E_\nNewsweek ending print edition sad. Now my Newsweek covers mean nothing they lost all credibility. TIME to follow? _E_\nEveryone's wondering what's wrong with A Rod. Not one sports writer blames it on his not being able to use drugs anymore the real reason. _E_\n.@HillaryClinton has been doing this for THIRTY YEARS....where has she been? #BigLeagueTruth _E_\nLarge Block Grants to States is a good thing to do. Better control & management. Great for Arizona. McCain let his best friend L.G. down! _E_\nLet Pete Rose into the Baseball Hall of Fame. It's time he has paid a big and very long price! _E_\nI am at the @USGA #USWomensOpen. An amateur player is co leading for the first time in many decades very exciting! _E_\nSomething must be done with dopey @KarlRove he is pushing Republicans down the same old path of defeat. Don't fall for it Karl is a loser _E_\n.@Megynkelly spent a big part of her show talking about other shows spending so much time on me. Really weird she's being driven crazy! _E_\nThank you Speaker @PRyan!#AmericaFirst #Trump2016 __HTTP__ _E_\n.@hardball_chris became a super liberal Obama fan only because he must need the money and on @MSNBC that's the way it is. _E_\n\"Expand your life every day.\" –Donald J. Trump __HTTP__ _E_\nTonight @FLOTUS Melania and I were thrilled to welcome so many wonderful friends to the @WhiteHouse – and wish them all a very #HappyHanukkah __HTTP__ __HTTP__ _E_\n\"Borrowing and spending is not the way to prosperity.\" @PaulRyanVP _E_\n.@AlexSalmond See photo __HTTP__ _E_\nWishing everyone a Happy Memorial Day Weekend with a special thought for all the veterans who have done so much for our freedom. _E_\nRe Omarosa: Nasty tough or smart...or all? _E_\nFrom: @Newsmax_Media: @realDonaldTrump: Public not Worried About @MittRomney's Tax Returns __HTTP__ _E_\nSitting at the foot of the Whitestone Bridge @TrumpFerryPoint is an 18 hole @jacknicklaus signature course __HTTP__ _E_\nThe new @DarKnightRises trailer is fantastic __HTTP__ Trump Tower stood in for Wayne Enterprises during filming. _E_\nHillary Clinton is being badly criticized for her poor performance in answering questions. Let us all see what happens! _E_\n.....Ahead of schedule and under budget! Will be in Oklahoma tonight! _E_\nLots of pressure on Obama tonight even more than A Rod. If he doesn't perform well it could be over. _E_\nBeauty arrives to Moscow's Crocus City Hall this 11.9.! On @nbc the world will watch @MissUniverse 2013 crowned __HTTP__ _E_\nSuccess is not the key to happiness. Happiness is the key to success. If you love what you are doing you'll be a success. A. Schweitzer _E_\nIt's a plain fact: free trade requires having fair rules that apply to everyone. (cont) __HTTP__ _E_\nThe totally unexpected loss of Supreme Court Justice Antonin Scalia is a massive setback for the Conservative movement and our COUNTRY! _E_\nToday on #NationalAgDay we honor our great American farmers & ranchers. Their hard work & dedication are ingrained... __HTTP__ _E_\nDo you believe the way Karzai talks down to the United States zero respect! _E_\nWhen Americans are free to thrive innovate & prosper there is no challenge too great no task too large & no goal beyond our reach. We are a nation of explorers pioneers innovators & inventors. We are nation of people who work hard dream big & who never ever give up... __HTTP__ _E_\nRT @EricTrump: 2016 was such an incredible year for our entire family! My beautiful wife @LaraLeaTrump made it even better! __HTTP__ _E_\nMany countries including allies already see China as world superpower __HTTP__ We have greatest military yet no respect _E_\nWatch the @nbc video where @realmissnvusa is crowned as the 63rd @MissUSA __HTTP__ The Crowning Moment! _E_\nIn the 1950's our climate was far more unstable than it has been over the last 5 years. _E_\nI loved Walter Cronkite one of the all time greats. He couldn't stand Dan Rather I agree with Walter. @DanRatherReport _E_\n.@AlexSalmond the Scottish politician who released the terrorist who blew up Pan Am flight 103 over Lockerbie... _E_\nIran will soon take all of the oil in Iraq...and Iraq itself Keep the oil. _E_\nWith all of the recently reported electronic surveillance intercepts unmasking and illegal leaking of information I have no idea... _E_\nGo Republican Senators Go! Get there after waiting for 7 years. Give America great healthcare! _E_\nWith @IvankaTrump and crew at the start of a new @DoralResort. __HTTP__ _E_\nThe US Navy wants to go green. Our Navy should use the best & most powerful fuel & not play games. Give me a break! _E_\nRosie is crude rude obnoxious and dumb other than that I like her very much! _E_\nWe could only get a small fraction of this 25k crowd in. The movement to Make America Great Again is unbelievable! __HTTP__ _E_\nLooking forward to being honored at @citadelgop's Patriot Dinner with @SenatorTimScott in Charleston SC this Sunday __HTTP__ _E_\nBig news—WOW—U.S. economy shrinks! _E_\nInvestors are visionaries in some respects they look beyond the present. _E_\nObama's statement that illegals \"can't stay\" = Obama's promise \"if you like your healthcare plan you can keep it.\" _E_\nToday I signed an Executive Order on Enforcing Statutory Prohibitions on Federal Control of Education. EO:... __HTTP__ _E_\nEntrepreneurs: Absorb assess and then act. Don't negate your own power. Whatever you've been dealt know you can deal with it. _E_\nMy @foxandfriends interview discussing Obama's failed and dangerous foreign policy and the real unemployment numbers __HTTP__ _E_\nI've known @hardball_chris for a long time & sadly he gets dumber each & every year & started from a very low base. _E_\nRT @mitchellvii: Trump always ends up being right. It's almost a little freaky. _E_\nWhat I would do on my first day in office. #MakeAmericaGreatAgainWatch: __HTTP__ __HTTP__ _E_\nReal estate is always a great asset to own but especially now. Try to take advantage if you can and buy (cont) __HTTP__ _E_\nBernie Sanders is pushing hard for a single payer healthcare plan a curse on the U.S. & its people... _E_\nPer @rushlimbaugh: Why does Hillary Clinton get the benefit of the doubt (after she DESTROYS her illegal email server) ... _E_\nSix days and counting until my offer to Barack Obama expires... _E_\nThe Apprentice will be very exciting and interesting tonight at 8:00. Joan Rivers puts on a great show! _E_\nThe recent Kansas election (Congress) was a really big media event until the Republicans won. Now they play the same game with Georgia BAD! _E_\nWake Up America! See article: Israeli Science: Obama Birth Certificate is a Fake __HTTP__ _E_\nTonight's episode of The Apprentice is one you won't want to miss! Be sure to tune in 10 p.m. on NBC. _E_\nThe devastation left by Hurricane Irma was far greater at least in certain locationsthan anyone thought but amazing people working hard! _E_\nIf only Obama would treat @IsraeliPM @netanyahu with the same respect he awards tyrants. Very strange & dangerous for our national security. _E_\nWho are our generals that are allowing this fiasco to happen right before our eyes. Call it the PLENTY OF NOTICE WAR _E_\nWould anyone in the music industry treat a Democrat like this? @RealMeatLoaf is being punished for his political views __HTTP__ _E_\nWhy are we building a $1Billion embassy in Iraq when the country kicked us out didn't give us any oil & is about to get taken over by Iran? _E_\nI'll be speaking on Thursday April 12 at the first ever National Achievers Congress at the San Jose Convention (cont) __HTTP__ _E_\nRT @GOPLeader: .@POTUS made the right call in leaving a deal that would have put an unnecessary burden on the United States. __HTTP__ _E_\nThank you! CNBC #DebateNight poll with over 400000 votes. Trump 61%Clinton 39%#AmericaFirst #ImWithYou... __HTTP__ _E_\nJust spoke to Governor Kenneth Mapp of the U.S. Virgin Islands who stated that #FEMA and Military are doing a GREAT job! Thank you Governor! _E_\nI wonder if @BarackObama has promised Iran and China that he can be more flexible after his last election? _E_\nGetting ready to leave for South Korea and meetings with President Moon a fine gentleman. We will figure it all out! _E_\nI am in Virginia @RegentU Presidential forum with Dr. Pat Robertson beginning now! Watch here: __HTTP__ _E_\nBig Republican Dinner tonight at Mar a Lago in Palm Beach. I will be there! _E_\nThe biggest thrill in the world is entertaining the public there is no bigger thrill than that. Vince McMahon @WWE _E_\nJust read in the failing @nytimes that I was not aware the event had to be held in Cleveland a total lie. These people are sick! _E_\nWonderful meeting with Canadian PM @JustinTrudeau and a group of leading CEO's & business women from Canada and th... __HTTP__ _E_\nEverybody wants to see and talk to Dennis Rodman he will be on Celebrity Apprentice tonight at 9. _E_\nTrump Int'l Hotel & Tower New York has the perfect Manhattan location & @jeangeorges is the signature restaurant. __HTTP__ _E_\nKeep the big picture in mind. There are always opportunities & possibilities & thinking too small can negate a lot of them. _E_\nRT @RealBenCarson: Many people fight for change in DC. @realDonaldTrump is a leader with an outsider's perspective & the vision guts & ene... _E_\nWe look forward to making the Old Post Office in DC one of the great hotels of the World. __HTTP__ _E_\nMany people have been asking to see my plane The Apprentice's @AmandaTMiller will give you a tour... __HTTP__ _E_\nHeading to Trump National Doral to check the progress prior to the start of the Cadillac Championship on Thursday. I'll be there all week _E_\nTrue. __HTTP__ _E_\nThe interview was great for @Oprah and terrible for Lance Armstrong! _E_\nChina's submarines will soon be carrying nukes __HTTP__ They will be sent to patrol our coasts Obama won't do anything. _E_\nI know the Governors and Jeb Bush who has gone nasty with lies is by far the weakest of the lot. His family used private eminent domain! _E_\nRobust Economic growth is the answer to the Medicare Problem not cuts on the elderly. _E_\nThank you New Mexico! #Trump2016 __HTTP__ __HTTP__ _E_\nHome Sales hit BEST numbers in 10 years! MAKE AMERICA GREAT AGAIN _E_\nMelania and I send our thoughts and prayers to Senator McCain Cindy and their entire family. Get well soon. __HTTP__ _E_\nWill be doing @oreillyfactor tonight at 8pm. Enjoy! _E_\nHillary Clinton will use American tax dollars to provide amnesty for thousands of illegals. I will put... __HTTP__ _E_\nI will be interviewed on @TODAYshow and Good Morning America at 7:00 A.M. _E_\nShirts and ties are doing great @Macys thanks! _E_\nI cannot believe how well certain areas are doing relative to the U.S. There is no reason for this other than poor leadership.WE SHOULD BE 1 _E_\nRT @Scavino45: President Trump pays respects and delivers #MemorialDay remarks at Arlington National Cemetery. __HTTP__ _E_\nIf you can't adapt to new situations then you will never be successful. Every change is a new opportunity to use your talent. _E_\n.@Andre_Reed83. Congratulations Andre you deserve it! _E_\nPublic Policy Polling (PPP) has just come out with a major poll putting me #1 with Hispanics leading all Republican candidates.Told you so _E_\nDid Crooked Hillary help disgusting (check out sex tape and past) Alicia M become a U.S. citizen so she could use her in the debate? _E_\nThe Fed's reckless policies of low interest and flooding the market with dollars needs to be stopped or we will face record inflation. _E_\nLiberal SD Dem candidate Rick Weiland wants to expand ObamaCare to single payer & opposes Ebola travel ban. Send @RoundsforSenate to Senate! _E_\nMainstream media never covered Hillary's massive \"hacking\"or coughing attack yet it is #1 trending. What's up? _E_\nPeople are finally beginning to hit China and OPEC. They never give me credit for being the first by far but that's okay! _E_\nSome good news for New York – Weiner has dropped 12 points in the polls & that is before more of the pervert's old texts are released. _E_\nPresident Obama close down the flights from Ebola infected areas right now before it is too late! What the hell is wrong with you? _E_\nPlane was carrying those terrible lithium ion batteries which are highly combustible as cargo. Fire could have started in cockpit. _E_\nWith the record $200M renovations on track & budget (a miracle in DC) Trump Int'l Washington DC is being built into a national marvel. _E_\n.@TheHill Trump on Boehner resignation: 'It's a good thing' __HTTP__ _E_\nHave a great Good Friday and a Happy Easter. _E_\nIn the just released SC poll I increased my lead by 4 points since last poll by same firm. Up by 14! Cruz dropped 3. __HTTP__ _E_\nSpeech in Dallas went really well. Big and wonderful crowd. Just arrived in L.A. Big day tomorrow! _E_\nThe basketball coach at Rutgers looks bad but I had a coach who made him look like a baby coaches can be tough! _E_\nMy @SquawkCNBC interview discussing @BarackObama's #WHCD my Scotland property & @BarackObama using Bin Laden's death __HTTP__ _E_\nThank you! #Trump2016 __HTTP__ __HTTP__ _E_\nThe wonderful people of Puerto Rico with their unmatched spirit know how bad things were before the H's. I will always be with them! _E_\nYesterday in Iowa was amazing two speeches in front of two great sold out crowds. They love that I am the only candidate self funding! _E_\n'Podesta urged Clinton team to hand over emails after use of private server emerged' __HTTP__ _E_\nStock Market has increased by 5.2 Trillion dollars since the election on November 8th a 25% increase. Lowest unemployment in 16 years and.. _E_\nThe next ObamaCare disaster will be doctors being dropped from plans. _E_\nBreitbart gets it! Vote now Obama should release his college application records & grades. He says he loves (cont) __HTTP__ _E_\nJodi Arias has stated that she follows me on twitter so I really hate to be saying that she is guilty but sadly she is as guilty as it gets _E_\n.@VanityFair could come back if Graydon Carter paid as much attention as he does to his bad food restaurants. @CondeNastCorp _E_\nCongratulations to @TrumpNewYork for being named #1 Best Business Hotel in NYC in @TravlandLeisure's 2014 World's Best Business Hotels. _E_\nThank you North Carolina get out & #VoteTrump on 11/8/2016!#MakeAmericaGreatAgain __HTTP__ _E_\nVia @G_Liberty_Voice by Melody Dareing: \"Donald Trump Wants to Build a Wall Between U.S. And Mexico\" __HTTP__ _E_\nObama Putin Moscow meeting on 9.3 4 __HTTP__ On the agenda 2013 Trump @MissUniverse Pageant in Moscow on 11.9 on @nbc! _E_\nWhen I bought the #MissUniverse pageant 13 years ago it was on life support... _E_\nIf everything seems under control you're just not going fast enough. Mario Andretti _E_\nJeb Bush George W and George H.W. all called to express their best wishes on the win. Very nice! _E_\n2004 VIDEO:Pocahontas describing Crooked Hillary Clinton as a Corporate Donor Puppet. Time for change! #Trump2016 __HTTP__ _E_\nThe French police are afraid to go into many communities. How did France let this all happen and how did the female terrorist ever escape? _E_\nWhere serenity meets luxury: Trump Nat'l Jupiter's Spa offers treatments which help restore youthful vitality __HTTP__ _E_\nVia @BET: \"Donald Trump Blasts Beyoncé for Suggestive Super Bowl Show\" __HTTP__ _E_\nLittle @MacMiller—I have more hair than you do and there's a slight age difference. _E_\nScotland is having a virtual revolt over obsolete wind turbines which are driving up energy costs and killing the bird population (and more) _E_\nDoes anybody really want to throw out good educated and accomplished young people who have jobs some serving in the military? Really!..... _E_\nBreaking ground shortly Trump Int'l Washington DC will bring the DC Post Office far beyond its original grandeur __HTTP__ _E_\nPresident @BarackObama's vacation is costing taxpayers millions of dollars Unbelievable! _E_\nEveryone is excited for @THEGaryBusey's return to All Star @CelebApprentice. Be warned this time Gary is even more insane! _E_\nThe State Department's 'shadow government' #DrainTheSwamp __HTTP__ _E_\nNew rule for @billmaher: check the law before you make a public absolute offer. _E_\nIf this doctor who so recklessly flew into New York from West Africahas Ebolathen Obama should apologize to the American people & resign! _E_\nSee you tomorrow Wisconsin!'Trump spurs small business optimism in Milwaukee area' __HTTP__ _E_\nCheck out Gray Line's site for the Donald Trump Ride of Fame... __HTTP__ _E_\nShows how dumb Joe McQuaid (@deucecrew) of the dying Union Leader is to put out the letter I wrote saying why I didn't do his failed debate! _E_\nNYC's sole hamman the bi level @TrumpSoHo features indoor & outdoor relaxation lounges with luxury services __HTTP__ _E_\nTerrible for the economy & middle class gas has now been over $3/gallon for a record 1245 days __HTTP__ FRACK NOW & FAST! _E_\nShocking over 92% of France who just elected a socialist for its new PM want @BarackObama re elected __HTTP__ _E_\nRay Kelly is the best Police Commissioner in NYC history. Keeping NYC safe thru vigilance. @RayKelly _E_\nBe sure to watch #CelebApprentice on Sunday night at 9 pm on NBC. Another great episode! __HTTP__ _E_\nHave you been watching how Saudi Arabia has been taunting our VERY dumb political leaders to protect them from ISIS. Why aren't they paying? _E_\n.@McLaughlinGroup Greatly appreciate yr wonderful comments this weekend. People of \"great accomplishment\" should easily quality for prez. _E_\nCome on @DannyZuker take the bet show your friends and family (& your bosses on Modern Family) that you're not chicken shit _E_\nPresident Obama please take the $5M check for charity tomorrow. It is so easy and could do so much good! _E_\nThank you Hawaii! #Trump2016 _E_\nWatched chief negotiator for Iran on @charlierose last night. He is far smarter than our reps—increase sanctions and walk! _E_\nThis is an outrage! Bias Free Language Guide claims the word 'American' is 'problematic' WHAT?! __HTTP__ _E_\nLets fight like hell and stop this great and disgusting injustice! The world is laughing at us. _E_\nLets go Pennsylvania! #VoteTrump __HTTP__ _E_\nCongrats to great golfer @Frostpga on his big win last week. Always been best putter. Frost Wins for Trump _E_\nPay attention to details. If you don't know every aspect of what you're doing you're setting yourself up for some big surprises. _E_\nWhy didn't movie Lincoln use Ford's Theater for big scene instead of the stage of an unrelated theater? _E_\nCan't wait for tonight's debate actually delayed my trip to Europe so I can watch. This is going to be a great night. _E_\nThank you @gawker! Call me on my cellphone 917.756.8000 and listen to my campaign message. _E_\nDid anyone notice that Obama failed to get a coalition of other countries to go along with us. He couldn't even get Britain! NO LEADERSHIP. _E_\nOur government now imports illegal immigrants and deadly diseases. Our leaders are inept. _E_\nUnemployment has risen today and some other very bad news has just been reported the stock market is way down. _E_\nMy @piersmorgan interview on Snowden the traitor national security and China hacking us __HTTP__ _E_\nWatch the 2011 #MissUniverse Pageant tonight at 9PM on NBC... __HTTP__ _E_\n\"Results are what matter...A series of efforts will add up to experience and achievement.\" Think Like a Champion _E_\n#MakeAmericaGreatAgain #ImWithYou __HTTP__ __HTTP__ _E_\nDrain the Swamp should be changed to Drain the Sewer it's actually much worse than anyone ever thought and it begins with the Fake News! _E_\n.@georgewillf is perhaps the most boring political pundit on television. Got thrown off ABC like a dog. At Mar a Lago he was a total bust! _E_\n\"@NBCApprentice: And the fired celebrities are...\" __HTTP__ via @ew by @DaltonRoss _E_\nThank you New York I will never forget! _E_\nPlease help @autismspeaks with their petition to the White House for a national strategy for the autism epidemic __HTTP__ _E_\nIf you entered our country illegally and are then granted amnesty why would you abide by other laws? No Amnesty! _E_\nRape is a huge problem in the U.S. military. Over 19000 rapes last year. _E_\nWho will be the next @TheRealTeenUSA? Find out this Saturday at 8PM ET on missteenusa.com #TeenUSA _E_\nThe Democrats had to come up with a story as to why they lost the election and so badly (306) so they made up a story RUSSIA. Fake news! _E_\nBad sign for Obama's campaign now publicly admitting they are focused on 4 states. Their internals must be horrendous. _E_\nSo many people think I will not run for President.Wow I wonder what the response will be if I do. Even the haters and losers will be happy! _E_\nFrom @FoxNews Bombshell: In 2016 Obama dismissed idea that anyone could rig an American election. Check out his statement Witch Hunt! _E_\n.@GovernorPerry just gave a pollster quote on me. He doesn't understand what the word demagoguery means. _E_\nThank you to all of the men and women who have served our country. You are our true heroes! #ArmedForcesDay __HTTP__ _E_\n#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nIt is so nice that the shackles have been taken off me and I can now fight for America the way I want to. _E_\nIt's clear to me that @teresa_giudice needs some lessons in negotiation #sweepstweet _E_\nMaybe @THEGaryBusey should stick to words... vs. barking. He's got a definite talent when he wants to use it. #CelebApprentice _E_\nReady to get mad?! We are sending foreign aid to China our greatest threat __HTTP__ We are financing our enemy. _E_\nThank you Worcester Massachusetts!#MakeAmericaGreatAgain #Trump2016 __HTTP__ __HTTP__ _E_\nJust arrived in Wisconsin to discuss JOBS JOBS JOBS! #MAGA __HTTP__ _E_\n'Jeff Sessions a Fitting Selection for Attorney General' __HTTP__ _E_\nIt is time for the airline pilots flight attendants and the airlines themselves to stop flights to and from West Africa. Do it right now! _E_\n.@GovernorPataki couldn't be elected dog catcher if he ran again—so he didn't! _E_\nThe dishonest media will NEVER keep us from accomplishing our objectives on behalf of our GREAT AMERICAN PEOPLE!... __HTTP__ _E_\nVia @TIMEPolitics by @zekejmiller: \"Trump To Visit New Hampshire\" __HTTP__ _E_\nLooking forward to speaking @nranews Convention in Nashville __HTTP__ The 2nd Amendment is a right not a privilege! _E_\nI will be on @foxandfriends tomorrow morning at 7:15 Hope you enjoy and agree! _E_\nTHANK YOU to the amazing staff and their families of the United States Embassy in the Philippines. Keep up the GREAT WORK! __HTTP__ _E_\nWill be leaving the Philippines tomorrow after many days of constant mtgs & work in order to #MAGA! My promises are rapidly being fulfilled. _E_\nWhen will CNN do a segment on Hillary's plan to increase Syrian refugees 550% and how much it will cost? _E_\nWe have to get tough on China. For every one American child there are four Chinese. China is out to steal our (cont) __HTTP__ _E_\nThere was a major diplomatic breakthrough yesterday w/the White House Iran & China. All celebrated Chuck Hagel being voted in as SOD. _E_\nBig win in the House very exciting! But when everything comes together with the inclusion of Phase 2 we will have truly great healthcare! _E_\nCryin' Chuck Schumer stated recently I do not have confidence in him (James Comey) any longer. Then acts so indignant. #draintheswamp _E_\n#USAatUNGA #UNGA __HTTP__ _E_\nTerrible story on front page of NYTimes about lightweight @AGSchneiderman __HTTP__ Does Eric wear eyeliner? _E_\nBeing tough doesn't mean being nasty difficult or unreasonable. It means being tenacious and refusing to give in or give up. _E_\nThe Russia hoax continues now it's ads on Facebook. What about the totally biased and dishonest Media coverage in favor of Crooked Hillary? _E_\nNo amnesty. Protect the rule of law! Let's Make America Great Again __HTTP__ _E_\nAll NYC needs is the mentally unstable Elliot Spitzer in office again. _E_\nThe Establishment and special interests are absolutely killing our country. We must put #AmericaFirst. __HTTP__ _E_\nUS trade deficit hit $64B+ in April 2 yr record high __HTTP__ We must do better. China is ripping us. Bring the jobs home! _E_\n.@Betsy_McCaughey Thanks so much. Really appreciate your comments. I will help the veterans like no one else. __HTTP__ _E_\nWhy aren't the Democrats speaking about ISIS bad trade deals broken borders police and law and order. The Republican Convention was great _E_\nRT @realDonaldTrump: Democrats are far more concerned with Illegal Immigrants than they are with our great Military or Safety at our danger... _E_\nWhat's the primary ingredient for success? Passion. You have to love what you're doing or you won't get too far. _E_\nWow Eliot Spitzer has lost great news for New York City! _E_\n.@BillMoyers is a liberal hack whose career is being laid to rest @PBS. Here Moyers coddles @JeremiahWright __HTTP__ _E_\nSpeaker @johnboehner seems to have gained strength in house—a good thing! _E_\nWhy do we continue to sit idly by while China steals our national security and corporate secrets? China is an enemy not a friend. _E_\nWork is expected to begin today on my golf course in Scotland. It will be spectacular! __HTTP__ _E_\nDefense Sec.Hagel has quit. Great news for our country. The guy didn't have a clue—grossly outmatched by our enemies. Couldn't even speak _E_\nSee the amazing views from @TrumpGolfLA located directly on the Pacific Ocean __HTTP__ _E_\nAlmost every major dealmaker has used the bankruptcy laws as a business tool. Icahn Black Zell—but nobody says they went bankrupt! _E_\n\"Always bear in mind that your own resolution to succeed is more important than any other.\" – Abraham Lincoln _E_\nMany people look at successful people & don't see anything but the end result. They don't see all the work that went into getting there. _E_\n#WeeklyAddress __HTTP__ _E_\nSouth Korea is absolutely killing us on trade deals. Their surplus vs U.S. is massive and we pay for their protection. WHO NEGOTIATES? _E_\nToday I am here to offer a renewed partnership with America to work together to strengthen the bonds of friendship and commerce between all of the nations of the Indo Pacific and together to promote our prosperity and security. #APEC2017 __HTTP__ _E_\nGood morning Ohio! Some additional information from my daughter @IvankaTrump! #VoteTrump #SuperTuesday __HTTP__ _E_\n.@MacMiller's \"Donald Trump\" __HTTP__ just crossed 73.5 million views on @YouTube. You're welcome Mac! _E_\nThank you Brian France Bill Elliott @chaseelliott @DavidRagan & @RyanJNewman! #NASCAR #Trump2016 #VoteTrump __HTTP__ _E_\nHow bad has our leader made us look on Syria. Stay out of Syria we don't have the leadership to win wars or even strategize. _E_\nHope & Change since @BarackObama has taken office the US debt has increased by an average of $64K per taxpayer. _E_\nVia @CBNNews by @TheBrodyFile: Brody File Exclusive: Donald Trump Comes Out In Support Of 20 Week Abortion Ban __HTTP__ _E_\nRT @foxandfriends: Another Dem 'queasy' over claim of Loretta Lynch meddling in Clinton case __HTTP__ _E_\nDesigned by @jacknicklaus Trump Golf Links at Ferry Point's 18 hole course sits by the Bronx's Whitestone Bridge __HTTP__ _E_\nHow incompetent are our leaders allowing these Ebola infected people to come into our country with all of the problems and danger entailed! _E_\nThis is why @TimTebow is a winner. He lays everything out on the field. He never quits and never gives up. That's why he is a success. _E_\nWhy would Kim Jong un insult me by calling me old when I would NEVER call him short and fat? Oh well I try so hard to be his friend and maybe someday that will happen! _E_\nDon't miss the #MissUniverse Pageant tonight at 8/7c with performances by @NickJonas @PrinceRoyce and @GavinDeGraw __HTTP__ _E_\nA quote from the late great golfer Sam Snead: Practice puts brains in your muscles! THIS IS TRUE ALSO IN LIFE. _E_\nThank you Costa Mesa California! 31000 people tonight with thousands turned away. I will be back! #Trump2016 __HTTP__ _E_\nMyself with mother and father at New York Military Academy. See I can be very military. High rank!... __HTTP__ _E_\nThe Russians are playing a very smart game. In the meantime they are buying lots of time for Syria and making U.S. look foolish. Dangerous! _E_\nI am now going to the brand new Trump International Hotel D.C. for a major statement. _E_\nThank you Graham Ledger of the Daily Ledger @OANN for your really fair coverage and your great interview with Peter Roff of U.S. NEWS & W.R. _E_\nMust read article in @washtimes: @RealSheriffJoe probe could dwarf Watergate __HTTP__ _E_\nObamaCare premiums rising 13.2% in 2015 __HTTP__ Elections have consequences! _E_\nWhat is Frank VanderSloot getting for agreeing to back Marco Rubio? Last victim was Mitt Romney see how that turned out. _E_\n\"UPDATE: Trump plans public event at @WartburgCollege\" __HTTP__ via @wcfcourier: _E_\nI will miss Mike Wallace. He did a major interview with me for 60 Minutes and it was totally fair and balanced. (cont) __HTTP__ _E_\nMy @foxandfriends int. on the Zimmerman trial & verdict courage of the jury and reactions! __HTTP__ _E_\nThe habitual vacationer @BarackObama spent 9 days before the critical Super Committee deadline traveling. He failed to lead again. _E_\nWhy do the networks continue to put dopey @BillKristol on panels when he has called every single shot about me wrong for 2 yrs? _E_\n.@CNN is unwatchable. Their news on me is fiction. Theyare a disgrace to the broadcasting industry and an arm of the Clinton campaign. _E_\nThis Sunday's All Star Celebrity @ApprenticeNBC has the most beautiful boardroom judges ever w/ @IvankaTrump & @MELANIATRUMP together! _E_\nWhen will @TedCruz give all the New York based campaign contributions back to the special interests that control him. _E_\nGermany is going through massive attacks to its people by the migrants allowed to enter the country. New Years Eve was a disaster. THINK! _E_\nGet smart on knockout assaults and crime we have to be slightly more vicious (and violent) than the assaulter and crime would end FAST! _E_\nWe can't let this happen. We should march on Washington and stop this travesty. Our nation is totally divided! _E_\nNYC politicians better stop pandering ending stop & frisk would be a disaster. __HTTP__ _E_\nI'll be in Dallas at the American Airlines Center on Sept 14th at 6 PM. Will be great to be back in Texas. __HTTP__ _E_\nRT @EricTrump: Nevada: Reminder that today is the LAST day to register to vote in the February 23rd caucus! __HTTP__ __HTTP__ _E_\nGet to the essence immediately. Learn to economize. People appreciate brevity in today's world. Think Like a Champion _E_\nRequired reading 4 success in politics & life read @kimguilfoyle's book #MakingTheCase. Brilliant Advice ! __HTTP__ _E_\nWow President Obama just landed in Cuba a big deal and Raul Castro wasn't even there to greet him. He greeted Pope and others. No respect _E_\nThank you Washington! #Trump2016#MakeAmericaGreatAgain __HTTP__ _E_\nCongratulations to John Roberts for making Americans hate the Supreme Court because of his BS __HTTP__ _E_\nEntrepreneurs: Focus on your goals not on fixed patterns. Do what's necessary and what's unnecessary will be made clear. _E_\nAt some point and for the good of the country I predict we will start working with the Democrats in a Bipartisan fashion. Infrastructure would be a perfect place to start. After having foolishly spent $7 trillion in the Middle East it is time to start rebuilding our country! _E_\nMy @MorningJoe interview with @JoeNBC & @morningmika discussing the Newsmax @iontv debate and #TimeToGetTough __HTTP__ _E_\nWe need your support to get to the White House and defeat #CrookedHillary. Let's Make America Great Again! __HTTP__ _E_\nVia @DMRegister by @JenniferJJacobs: Trump: 'I would've won the race against Obama' __HTTP__ _E_\nSadly Democrats want to stop paying our troops and government workers in order to give a sweetheart deal not a fair deal for DACA. Take care of our Military and our Country FIRST! _E_\nThe so called 87 year old lady was a vicious and skilled investor who was trying to rip me off with made up facts and a blowhard lawyer. _E_\nDave Letterman @Late_Show said during my interview that Obama was probably born in the US the word probably is a disaster for Obama. _E_\nThis morning Chris Wallace has the best political show on television but that's only because I'm on it (kidding)! Have fun. _E_\n.@CBSNews Poll WOW! New Hampshire TRUMP 38% CARSON 12% BUSH 8% South Carolina TRUMP 40% CARSON 23% CRUZ 8% Iowa TRUMP 27% CARSON 27% _E_\nGeorge Will may be the dumbest(and most overrated) political commentator of all time. If the Republicans listen to him they will lose. _E_\nMorning Joe's weakness is its low ratings. I don't watch anymore but I heard he went wild against Rudy Giuliani and #2A sad & irrelevant! _E_\nWow @CNBC ratings are really low worst in many years. I guess I'll have to start doing my Tuesday morning interviews with them again! _E_\nObama just said @MittRomney was a very successful investor big mistake for Obama to admit he has less and less credibility. _E_\nWhat's more dangerous for the country the Iranian nuclear threat or @BarackObama as President? _E_\nAmazing view of @TrumpGolfLA __HTTP__ _E_\nMajor article in New York Times today discusses the cost of environmental damage in China and how it is RAPIDLY GROWNG! Rest of World pays. _E_\nObama' ststement on Egypt was terrible and dumb now being used by military as a rallying cry our foreign policy is worst in U.S. history. _E_\nIt was an honor to be the Grand Marshall in the Salute to Israel Parade back in 2004. __HTTP__ _E_\nI'd like to wish all of my friends and even my many enemies a very Merry Christmas and Happy New Year. _E_\nMAKE AMERICA SAFE AGAIN! __HTTP__ __HTTP__ _E_\nWill be in Chicago tomorrow for a record setting (by far) luncheon. _E_\n.@Franklin_Graham @BillyNungesser @SamaritansPurse so humbled by my time w/ you. You are in our thoughts & prayers. __HTTP__ _E_\nCongratulations Jim Herman! We are all proud of you @TrumpGolf! __HTTP__ _E_\nHad a very good call last night with the President of China concerning the menace of North Korea. _E_\nA vote for Clinton Kaine is a vote for TPP NAFTA high taxes radical regulation and massive influx of refugees. _E_\nRally last night in San Jose was great. Tremendous love and enthusiasm in the hall. Big crowd. Outside small group of thugs burned Am flag! _E_\nWow was Ted Cruz disloyal to his very capable director of communication. He used him as a scape goat fired like a dog! Ted panicked. _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\nLAWFARE: Remarkably in the entire opinion the panel did not bother even to cite this (the) statute. A disgraceful decision! _E_\nReally disgusting that the failing New York Times allows dishonest writers to totally fabricate stories. _E_\nAshley Judd's candidacy was created by Karl Rove's terrible ads even before she thought seriously about running... _E_\nDonald Trump Tells @theblaze About His Obama Announcement: PASSPORT APPLICATIONS TELL YOU A LOT __HTTP__ by @BillyHallowell _E_\nI just wrapped up a Q&A @TwitterNYC. Thanks for all your questions! #AskTrump __HTTP__ _E_\nIn less than a week I'll be honored by Sarasota GOP as Statesman of the Year & then give my big surprise to @RNC convention. Will be fun! _E_\nThe Muslim Brotherhood @BarackObama's allies in Egypt will cancel the Camp David Agreement. __HTTP__ What a disaster! _E_\nI know Mark Cuban well. He backed me big time but I wasn't interested in taking all of his calls.He's not smart enough to run for president! _E_\nRe Super PAC scam: What the other candidates are doing is a disgrace. _E_\nFLASHBACK – \"Donald Trump Answers Boy's Prayer for New Bike\" __HTTP__ via @FoxNewsInsider _E_\nThank you Greeley CO! REAL change means restoring honesty to the govt. Our plan will END govt. corruption! Watch:... __HTTP__ _E_\nJust watched Cookie Roberts on @ABC. Her predictions have been so wrong for so long that she has lost all credibility. Just another sad case _E_\nEric did a great job with his Eric Trump Foundation annual charity outing. I'm proud of him. __HTTP__ _E_\nGreat speech by my good friend @GovChristie. He did something you won't hear at @BarackObama's convention tell the truth. _E_\nThe economy is broken. Entrepreneurship is being suppressed. See what I do Wednesday 11 AM at Trump Tower atrium. _E_\nGary Sinise is doing tremendous work for veterans through his foundation—check it out @GarySiniseFound _E_\nCongratulations to @IsraeliPM @netanyahu on forming his new unity government. A major political success for the Jewish State of Israel. _E_\nThe Blue Monster at Trump National Doral recieved rave reviews from both players and architectural critics following the Cadillac WGC.Thanks _E_\nThank you Columbus Ohio! I will be back soon. #ImWithYou #MAGA __HTTP__ _E_\nHypocrite @BarackObama has major investments in companies that are outsourcing jobs overseas __HTTP__ _E_\nI am at Trump National Doral best resort in U.S. Rory and Adam Scott are doing great! Watch on NBC at 3:00 P.M. MAKE AMERICA GREAT AGAIN! _E_\nI am having a really hard time watching @FoxNews. _E_\nBroken borders $18T debt ObamaCare failing & over budget. Don't worry our president is still fundraising __HTTP__ Priorities _E_\nI'm at @WrestleMania tonight but will be doing a few tweets. I know the episode well.... #CelebApprentice _E_\nRT @foxandfriends: Trump fires new warning shot at McConnell leaves door open on whether he should step down __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nIt's time for Ted Cruz to either settle his problem with the FACT that he was born in Canada and was a citizen of Canada or get out of race _E_\nVia @BreitbartNews by @mboyle1: \"Obama's Amnesty Will Give Illegal Aliens Public Benefits\" __HTTP__ _E_\nEntrepreneurs: Negotiation is an art. Treat it like one. _E_\nSpoke to President of Mexico to give condolences on terrible earthquake. Unable to reach for 3 days b/c of his cell phone reception at site. _E_\nA strong military makes us respected by our allies & feared by our enemies. Let's Make America Great Again! __HTTP__ _E_\nYou have to learn the rules of the game. And then you have to play better than anyone else. Albert Einstein _E_\nVia @BBCNews: \"US property tycoon Donald Trump confirms Turnberry buy\" __HTTP__ _E_\nMichael Vick of the Philadelphia @eagles is a great athlete but not a great quarterback. _E_\nI would love to see the Republican party and everyone get together and unify.Video: __HTTP__ __HTTP__ _E_\nContinuous effort not strength or intelligence is the key to unlocking our potential. Winston Churchill _E_\nHis @BarackObama's specialties? Vacations and campaigning. Jobs not so much! _E_\n'Food Groups' – Emails Show Clinton Campaign Organized Potential VPs By Race And Gender: __HTTP__ _E_\nA great day in New Jersey for Trump! __HTTP__ & __HTTP__ _E_\n\"Donald Trump on Mark Levin: Karl Rove is one of the most overrated people in politics\" __HTTP__ via @TheRightScoop _E_\nIn my new book #TimeToGetTough I make a full financial disclosure detailing my net worth. __HTTP__ _E_\nMaking his case in a nice and articulate manner. _E_\nAmazing how the haters & losers keep tweeting the name \"F**kface Von Clownstick\" like they are so original & like no one else is doing it... _E_\nBe sure to watch The Apprentice tonight 10 p.m. on NBC it's an episode you won't forget! _E_\nRemember the most hated part of ObamaCare is the Individual Mandate which is being terminated under our just signed Tax Cut Bill. _E_\nIt is my great honor to be speaking at CPAC 2013. They are all about what's good for America. _E_\nAcross the battlefields oceans and harrowing skies of Europe and the Pacific throughout the war one great battle cry could be heard by America's friends and foes alike:\"REMEMBER PEARL HARBOR.\" __HTTP__ _E_\nDummy @BillMaher forgot to say that he made an absolute offer which I accepted. Hopefully charity gets $5M dollars. _E_\nJoin me in Phoenix Arizona today at 4pm! #Trump2016 #AmericaFirst __HTTP__ __HTTP__ _E_\nMust read editorial co written by @weeklystandard editor William Kristol & @NRO editor @RichLowry 'Kill the Bill' __HTTP__ _E_\nUnder President @BarackObama China has experienced unusually fast gains and America unusually fast losses. #TimeToGetTough _E_\nThe MOVEMENT in Portsmouth New Hampshire w/ 7K supporters. THANK YOU! This is the biggest election of our lifetime... __HTTP__ _E_\nRemember this: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_\nI can't believe the great @wjcarter got canned by @nytimes. He was a fantastic reporter & really knew entertainment. He will be missed! _E_\nFLASHBACK: \"Alex Salmond pleaded with Donald Trump to back release of Lockerbie bomber\" __HTTP__ @telegraphnews ... _E_\nI thought people weren't celebrating? They were cheering all over even this savage from Orlando. I was right. __HTTP__ _E_\nWill be interviewed by @SeanHannity on @FoxNews at 10:00pm tonight. Enjoy! _E_\nBig poll just out by @TheEconomist has me in 1st. place by a lot. A great honor but we have a long way to go to MAKE AMERICA GREAT AGAIN! _E_\nMy @IngrahamAngle interview discussing healthcare monopolies @MittRomney oil prices and @AnnRomney's birthday __HTTP__ _E_\nOur way of life is under threat by Radical Islam and Hillary Clinton cannot even bring herself to say the words. _E_\n.@foxandfriends Dems are taking forever to approve my people including Ambassadors. They are nothing but OBSTRUCTIONISTS! Want approvals. _E_\nVia @OceanDriveMag by @SuzMcGeeNYC: Q&A: Ivanka Trump on the Business of Golf & the Championships __HTTP__ _E_\nChange is the law of life. And those who look only to the past or present are certain to miss the future. John F. Kennedy _E_\nFinally held our first full @Cabinet meeting today. With this great team we can restore American prosperity and br... __HTTP__ _E_\nWhat is he reading? #Oscars _E_\nPresident should not be telling the Washington Redskins to change their name our country has far bigger problems! FOCUS on themnot nonsense _E_\nEnjoy the ratings of President Obama. __HTTP__ _E_\nI am on @foxandfriends now! Tune in! _E_\nJust left the #G7Summit. Had great meetings on everything especially on trade where.... _E_\nThe numbers at the @nytimes are so dismal especially advertising revenue that big help will be needed fast. A once great institution SAD! _E_\nThank you for such a wonderful and unforgettable visit Prime Minister @Netanyahu and @PresidentRuvi. _E_\nObama's Amnesty Executive Order can now be stopped by Majority Leader McConnell with riders. That's one reason we needed the Senate. _E_\nThis is a once in a generation opportunity to offer historic tax relief to the American people! Join me today: __HTTP__ __HTTP__ _E_\nBob Tyrrell @AmSpec—Thank you and also for the great work you do. _E_\nLet us give thanks for all that we have and let us boldly face the exciting new frontiers that lie ahead. Happy Th... __HTTP__ _E_\nSusan Rice is a good woman but Pres. O should not taunt the Republicans by appointing her S of S... _E_\nAre you expanding your business? Interview returning soldiers. Give them strong consideration. Their sacrifices deserve it. _E_\nA list from @Heritage: Top 10 Most Expensive Obamacare Taxes and Fees __HTTP__ _E_\nJustice Roberts turned on his principles with absolutely irrational reasoning in order to get loving press from (cont) __HTTP__ _E_\nWhat a year it's been and we're just getting started. Together we are MAKING AMERICA GREAT AGAIN! Happy New Year!! __HTTP__ _E_\nWelcome back @SteveScalise!#TeamScalise __HTTP__ _E_\nI would feel sorry for @JebBush and how badly he is doing with his campaign other than for the fact he took millions of $'s of hit ads on me _E_\nRT @EricTrump: Friends in #FL #OH #NC #IL & #MO we would be honored to have your #VOTE! #SuperTuesday #LetsDoThis #MakeAmericaGreatAgain #T... _E_\nIsn't it sad that lightweight Senator Bob Corker who couldn't get re elected in the Great State of Tennessee will now fight Tax Cuts plus! _E_\nMexican gov doesn't want me talking about terrible border situation & horrible trade deals. Forcing Univision to get me to stop no way! _E_\nNext year I will be changing the name of 800 acre Doral to Trump National Doral. It will be the best resort in the country—Miami is hot! _E_\nCrazy Dennis Rodman is saying I wanted to go to North Korea with him. Never discussed no interest last place on Earth I want to go to. _E_\nThe first meeting Jeff Sessions had with the Russian Amb was set up by the Obama Administration under education program for 100 Ambs...... _E_\nThank you Tampa Florida!#AmericaFirst #TrumpTrain __HTTP__ _E_\nI will be on Fox & Friends tomorrow morning at 7.ºº _E_\nWhat a STUPID deal for Verizon to buy AOL for $4.4 billion. AOL has been bad luck for everyone who touched it. Worth less than $1 billion! _E_\nWhy are we giving away our entire strategy and tactics we will deploy against ISIS? It puts our troops at a disadvantage. _E_\nRT @Bet22325450ste: @FoxBusiness @foxandfriends Come on America. Get on the Trump Train. The winners already have boarded! The losers are w... _E_\nNow @BarackObama is praising China's cooperation in negotiations over Chen Guangcheng __HTTP__ This is a sad episode for us. _E_\nDon't ever think you've done it all already or that you've done your best. That's a shortcut to undermining your own potential. _E_\nThe Jets should have let them score to get the number one draft pick who will be really good. It will just never change for them! _E_\nHillary Clinton doesn't have the strength or the stamina to MAKE AMERICA GREAT AGAIN! #AmericaFirst __HTTP__ _E_\nUnder President Trump unemployment rate will drop below 4%. Analysts predict economic boom for 2018! @foxandfriends and @Varneyco _E_\nCongratulations to @arsenioofficial on his new late night show! He will do really well. (It pays to win #CelebrityApprentice) _E_\nToday it was my great honor to meet with the Crown Prince of Bahrain at the @WhiteHouse. Bahrain and the United States are important partners.During the Crown Prince's visit he is advancing $9 BILLION in commercial deals including finalizing the purchase of F 16's... __HTTP__ _E_\nGlad to hear @InsideEdition has hired @_KatherineWebb to cover @SuperBowl. She will be absolutely terrific! Miss USA pageant is proud. _E_\nSnowden has given serious information to China and Russia anyone who thinks otherwise is a dope! He is a traitor who fled he knew the crime! _E_\n.@GOP need to face reality – not one of the illegal immigrants granted amnesty will vote Republican. _E_\nHow much is South Korea paying the U.S. for protection against North Korea???? NOTHING! _E_\nFrance is losing its businesses and wealth rapidly and day by day. _E_\nTake a tour of this amazing penthouse in Trump Park Avenue.... __HTTP__ _E_\nTHANK YOU ARIZONA! 20000 amazing supporters! Get out and #VoteTrump on Tuesday. I love you!#MakeAmericaGreatAgain __HTTP__ _E_\nLast time lightweight @JebBush tried to knock off @marcorubio he made a total fool of himself. If he doesn't do better this time he is out! _E_\nHow does a dummy like @billmaher get a television show & his ratings stink. You'd think @HBO could do a lot better. _E_\nVia @TWtravelnews by Robert Silk: \"Renovations make Trump's Doral a showcase once again\" __HTTP__ _E_\n'ICE OFFICERS WARN HILLARY IMMIGRATION PLAN WILL UNLEASH GANGS CARTELS & DRUG VIOLENCE NATIONWIDE'... __HTTP__ _E_\nThe terrorists cut off the heads of Americans and laugh then want to sell us the bodies for $1000000. We fight over sleep deprivation! _E_\nVia @BBCNews: \"Donald Trump visits his newly purchased Turnberry golf resort\" __HTTP__ _E_\nHopefully the violence & unrest in Charlotte will come to an immediate end. To those injured get well soon. We need unity & leadership. _E_\nScotland will be so lucky if this monstrosity is not built—I will tie them up in courts for years if necessary. _E_\nThis is going to be a special season truly great characters and cast. You will soon see! _E_\nThe Lincoln Day Dinner last night in Michigan was fantastic. Record attendance and tremendous enthusiasm I loved it! _E_\nDo you notice the Fake News Mainstream Media never likes covering the great and record setting economic news but rather talks about anything negative or that can be turned into the negative. The Russian Collusion Hoax is dead except as it pertains to the Dems. Public gets it! _E_\nGreat rally in Iowa! Such wonderful people. Traveling now with @SarahPalinUSA to Tulsa massive crowd expected! __HTTP__ _E_\nThis afternoon I'll be speaking with Neil Cavuto on Your World with Neil Cavuto 4 p.m. on FOX News. _E_\nThe highly neurotic Debbie Wasserman Schultz is angry that after stealing and cheating her way to a Crooked Hillary victory she's out! _E_\nAfter years of Comey with the phony and dishonest Clinton investigation (and more) running the FBI its reputation is in Tatters worst in History! But fear not we will bring it back to greatness. _E_\nMany people will be surprised at what is about to be released concerning @BarackObama's background. I for one won't be. _E_\nJames Comey leaked CLASSIFIED INFORMATION to the media. That is so illegal! _E_\nThe GOP Debate Scorecard: Donald Trump and Energy by Wayne Allyn Root. __HTTP__ _E_\nIn order to try and deflect the horror and stupidity of the Wikileakes disaster the Dems said maybe it is Russia dealing with Trump. Crazy! _E_\nI will be interviewed tonight at 7pm ET by @greta #OnTheRecord _E_\nHillary flunky who lost big. For the 100th time I never mocked a disabled reporter (would never do that) but simply showed him....... _E_\nWhat is Obama thinking? __HTTP__ _E_\nMany countries are cutting back big time on ugly industrial wind turbines. The energy is very inefficient & (cont) __HTTP__ _E_\nI want to thank @RealSheriffJoe for all of his help in our historic Arizona win. Could not have done it without you Joe! _E_\nRT @FoxNews: TONIGHT on Justice @JudgeJeanine talks to special guests @EricTrump and @LaraLeaTrump Tune in at 9p ET on Fox News Channe... _E_\nI don't want to hit Crazy Bernie Sanders too hard yet because I love watching what he is doing to Crooked Hillary. His time will come! _E_\nThe dying @NYDailyNews asked me to do an Editorial on the Central Park 5 ripoff & then they pretend it was my idea. Loser newspaper! _E_\nIrresponsible! In the last 6 months @BarackObama has held over 100 fundraisers and not a single meeting with his Job Council. _E_\n.@DennisDMZ Thanks for the nice words. You are fantastic! _E_\n\"Be objective and strive to be your own counselor. Listen to others but know the final decision is yours.\" – Think Like a Champion _E_\nThank you Eau Claire Wisconsin. #VoteTrump on Tuesday April 5th!MAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nYou're never a loser until you quit trying. Mike Ditka _E_\nInstead of trash talking @PMIsrael on the world stage @BarackObama should be defending @Israel. _E_\nWow Vanity Fair was totally shut out at the National Magazine Awards it got NOTHING. Graydon Carter is a loser with bad food restaurants! _E_\nRT @Newsmax_Media: Trumps Warns of Obama Tipping Point that May Destroy America __HTTP__ via @Newsmax_media _E_\nJeb's brother George insisted on a $100000 fee and $20000 for a private jet to speak at a charity for severely wounded vets. Not nice! _E_\nWelfare's purpose should be to eliminate as far as possible the need for its own existence. – Pres. Ronald Reagan _E_\nWhat I am saying is stay out of Syria. _E_\nBy the US winning the Olympic medal count we proved that both the American spirit & talent is greater than a 1.4B population. USA! _E_\nThis just in re: FundAnything and producer Brad Wyman __HTTP__ _E_\nIf you think we have a problem with Social Security and Medicare now try taking in millions of new citizens all at once. _E_\nObamaCare continues to increase insurance premiums & raise record deductibles. New Congress must use every tool to defund. _E_\nIf you've got some problems today that's a good sign that's life. So give them some thought and make the most of the situation. _E_\nThe U.S. Coast Guard FEMA and all Federal and State brave people are ready. Here comes Irma. God bless everyone! _E_\nRT @foxandfriends: U.S. Air Force jets take off from Guam for training ensuring they can 'fight tonight' __HTTP__ _E_\n\"Success breeds success. The best way to impress people is through results.\" – Think Like a Billionaire _E_\n\"Sometimes by losing a battle you find a new way to win the war. The Art of The Deal _E_\n.@piersmorgan is back! Did I see @OMAROSA wince? #CelebApprentice _E_\nWhy wouldn't the @WSJ call for comment or clarification before writing an editorial which is so totally wrong. No wonder it is doing poorly! _E_\nRT @AmbJohnBolton: Our country & civilians are vulnerable today because @BarackObama did not believe in national missile defense. Let's nev... _E_\nThey only changed the term to CLIMATE CHANGE when the words GLOBAL WARMING didn't work anymore. Come on people get smart! _E_\nWord is that Crooked Hillary has very small and unenthusiastic crowds in Pennsylvania. Perhaps it is because her husband signed NAFTA? _E_\nRory Tiger Phil and Ernie will be fun to watch this weekend at Trump National Doral. _E_\nI would like to wish all fathers even the haters and losers a very happy Fathers Day. _E_\nEntrepreneurs: Let your actions show that you're the best. See each day as an opportunity to show you can do business at the highest level. _E_\nHillary and Sanders are not doing well but what is the failed former Mayor of Baltimore doing on that stage? O'Malley is a clown. _E_\nOne thing I will say about Rep. Keith Ellison in his fight to lead the DNC is that he was the one who predicted early that I would win! _E_\nDespite thousands of hours wasted and many millions of dollars spent the Democrats have been unable to show any collusion with Russia so now they are moving on to the false accusations and fabricated stories of women who I don't know and/or have never met. FAKE NEWS! _E_\nThanks @PiersMorgan. You're great! _E_\nIf you want to know how to prevail through tough circumstances then read The Art of the Comeback. _E_\n\"True courage is being afraid and going ahead and doing your job anyhow!\" General Norman Schwarzkopf _E_\n#MakeAmericaGreatAgain #TrumpRallyAL __HTTP__ _E_\nTo aspiring entrepreneurs: Be focused! Know your goals. Put everything you've got into what you're doing every single day. _E_\nI think everyone will like my new and very successful book Crippled America. Go get it and let me know what you think! _E_\nJust tried watching Modern Family written by a moron really boring. Writer has the mind of a very dumb and backward child. Sorry Danny! _E_\nRT @gatewaypundit: BREAKING POLL: Trump Gains 11 Points on Clinton Since March=&gt Now Leads Crooked Hillary 46 44 __HTTP__ vi... _E_\nThe forgotten men and women of our country will be forgotten no longer. From this moment on it's going to be #AmericaFirst _E_\nObama promised premiums would lower $2500/yr for family of 4. In truth healthcare will increase by $7450 __HTTP__ _E_\nFrom Donald Trump: Ivanka and Jared's wedding was spectacular and they make a beautiful couple. I'm a very proud father. _E_\nThe federal gov. has handled Sandy worse than Katrina. There is no excuse why people don't have electricity or fuel yet. _E_\nI'm right TPM is wrong @BarackObama did not issue a special statement for Christmas however he issued one (cont) __HTTP__ _E_\n.@CNN is so embarrassed by their total (100%) support of Hillary Clinton and yet her loss in a landslide that they don't know what to do. _E_\nNow @BarackObama's Vice Chief of Joint Staff is defending China while they cheat __HTTP__ Wrong course of action. _E_\nCongratulations to @gretawire on the 11 year anniversary of @FoxNews 'On the Record.' Always enjoy being interviewed by Greta. She's great. _E_\nNewly minted diplomat @dennisrodman is a completely different competitor in All Star @CelebApprentice. Dennis is a legend! _E_\nPresident Donald J. Trump Proclaims 5/14/2017 through 5/20/2017 as #PoliceWeek Proclamation... __HTTP__ _E_\nThe Dunes here are amazing and they're how I learned about geomorphology which is the study of movement landforms. We've had a great trip _E_\nI will be making a major announcement today at 12:30 pm PST at Trump International Hotel & Tower Las Vegas (cont) __HTTP__ _E_\nLeft New Hampshire for Turnberry in Scotland which I am renovating. This place is incredible! @TrumpTurnberry _E_\n.@realDonaldTrump on ISIS&OIL FIELDS! Saying it for years! @AndersonCooper you should acknowledge this! #Trump2016 __HTTP__ _E_\nLightweight reporter Alex Pareene @pareene is known as a total joke in political circles. Hence he writes for Loser Salon. @Salon _E_\nWhy does the media with a strong push from Crooked Hillary keep pushing the false narrative that I want to raise taxes. Exactly opposite! _E_\nPresident Reagan put it best: Welfare's purpose should be to eliminate as far as possible the need for its own existence. _E_\nTed Cruz is a cheater! He holds the Bible high and then lies and misrepresents the facts! _E_\nJoin me in Florida this Saturday at 5pm for a rally at the Orlando Melbourne International Airport!Tickets:... __HTTP__ _E_\nVia @gazettedotcom by James Q. Lynch: \"Trump to run typical caucus campaign 'but bigger'\" __HTTP__ _E_\n.@TraceAdkins is back—good news for Plan B. #CelebApprentice _E_\nIn Miami tracking @TrumpDoral's $250M renovations. Will be America's top resort. @PGATOUR just signed for 10 yr ext. __HTTP__ _E_\nWhatever the United States can do to help out in London and the U. K. we will be there WE ARE WITH YOU. GOD BLESS! _E_\nBy popular request I will be live tweeting during Celebrity Apprentice (Sunday 9 P.M.). _E_\nListen to my interview with @KathieLGifford at @PodcastOne __HTTP__ _E_\nFor all of my many Jewish friends Happy Passover. _E_\nWatch my video blog to see if your questions from my Facebook page were answered __HTTP__ _E_\n _E_\n _E_\nAfter all is said and done more is said than done. Aesop _E_\n.@ArsenioHall How quickly people forget but not me! You told me that without The Apprentice you could never have gotten your show Sad! _E_\nWhy can't the pundits be honest? Hopefully we are all looking for a strong and great country again. I will make it strong and great! JOBS! _E_\nSen. Kay Hagan voted for Amnesty & ObamaCare. She is a proven liberal who recklessly goes along with Obama. Vote @ThomTillis in November! _E_\nEntrepreneurs: Keep an open mind. Business is a creative endeavor. _E_\n.@ArsenioHall The only thing you don't mention in the nice Esquire piece about you is The Apprentice without which you would be nowhere! _E_\nNew polls out today are very good considering that much of the media is FAKE and almost always negative. Would still beat Hillary in ..... _E_\nAmazing playing with an ankle injury @Yankees Captain Derek Jeter tied Willie Mays last night for #10 on (cont) __HTTP__ _E_\nThe Massive Tax Cuts which the Fake News Media is desperate to write badly about so as to please their Democrat bosses will soon be kicking in and will speak for themselves. Companies are already making big payments to workers. Dems want to raise taxes hate these big Cuts! _E_\nObama & Democrat leaders did a great disservice by releasing the papers on torture. The world is laughing at us—they think we are fools! _E_\nIf @BarackObama had such a wonderful academic record why wouldn't he want to show it? _E_\n.@Macys stock just dropped. Interesting. So many people calling to say they are cutting up their @Macys credit card. Thank you! _E_\nThe EPA official who wants to crucify gas companies resigned __HTTP__ Good but his attitude is endemic in the EPA _E_\nYou can't compare anything to ObamaCare because ObamaCare is dead. Dems want billions to go to Insurance Companies to bail out donors....New _E_\nWhile Jeb Bush is cutting staff and salaries after having paid ridiculous amounts of money why did he pay so much in the first place? _E_\n\"If you don't have time to do it right when will you have time to do it over?\" John Wooden _E_\nEach time I see one of Anthony Weiner's television ads for mayor I ask what the hell is he doing just wasting money & time go get a job! _E_\nGetting the strong endorsement of the great coach Bobby Knight has been a highlight of my stay in Indiana. Big speech tomorrow with Bobby! _E_\nWow the highly respected Governor of Iowa just stated that Ted Cruz must be defeated. Big shoker! People do not like Ted. _E_\nGetting rid of the mortgage interest deduction would be a disaster for homeowners who have suffered enough! _E_\nFlashback: \"NYers were grateful when Donald Trump finished ahead of schedule and under budget the Wollman Rink\" __HTTP__ _E_\nThe weather has been so cold for so long that the global warming HOAXSTERS were forced to change the name to climate change to keep $ flow! _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\n.@Merck Pharma is a leader in higher & higher drug prices while at the same time taking jobs out of the U.S. Bring jobs back & LOWER PRICES! _E_\nWORKING TOGETHER we will defeat this #OpioidEpidemic & free our nation from the terrible affliction of drug abuse. __HTTP__ __HTTP__ _E_\nI'm giving away money! 11AM Trump Tower. Be there or be left behind! _E_\nConservatives have to be smart in the way we speak. Using crazy language that terifies seniors accomplishes (cont) __HTTP__ _E_\nStupid Arianna @huffingtonpost hired the man who ruined the once great NYTimes Business Section... _E_\nThe response has been fantastic actually overwhelming! Thank you! _E_\nGreat job by all law enforcement officers and Boston Mayor @Marty_Walsh. _E_\nAlways remember SOMETIMES YOUR BEST INVESTMENTS ARE THE ONES YOU DON'T MAKE! _E_\nBrande would have been fired immediately if she didn't raise $132000 a really large sum. Bret on the other hand raised very little... _E_\nEntrepreneurs: Keep your momentum. Without momentum a lot of great ideas go nowhere. _E_\nJust returned from Colorado. Amazing crowd! _E_\nGetting ready to make my speech at #KansasCaucus. A great honor! #MakeAmericaGreatAgain #Trump2016 _E_\nThe word is that Lance Armstrong will now implicate officials and others but who knows if he's telling the truth _E_\nIran will only get stronger in Iraq with the latest civil war. We should have taken the oil immediately after the invasion. _E_\nLooking forward to the debate tonight and will be tweeting live with very honest assessment. _E_\nWatch my appearance on @Morning_Joe great interview! __HTTP__ _E_\nAlong with a soaring bar of sky bound gold @TrumpLasVegas' pool deck overlooks the City of Lights __HTTP__ _E_\nGetting China to stop playing its currency charades can begin whenever we elect a president ready to take (cont) __HTTP__ _E_\n.@IanJamesPoulter Great going and almost as importantly your clothing line is selling well! _E_\nHillary Clinton colluded with the Democratic Party in order to beat Crazy Bernie Sanders. Is she allowed to so collude? Unfair to Bernie! _E_\nYet more evidence of a media rigged election: __HTTP__ _E_\nObama's Secret Service catastrophe has openly revealed a great lack of respect for our President. If they (cont) __HTTP__ _E_\nOver 50 women were interviewed by the @nytimes yet they only wrote about 6. That's because there were so many positive statements. _E_\nSo so so important MAKE AMERICA GREAT AGAIN! _E_\n\"Failures are expected by losers ignored by winners.\" @CoachJoeGibbs _E_\nLots of response to my comment on Diet Coke  let's face it it doesn't work just makes you hungry. _E_\nPlaying golf with Prime Minister Abe and Hideki Matsuyama two wonderful people! __HTTP__ _E_\nMany Democrats up for reelection in 2012 are skipping the DNC convention in Charlotte __HTTP__ Smart politics! _E_\nThank you to the @washingtonpost for the accurate and very discriptive story on my speech in Alabama last night. It was a great evening! _E_\nFirst the Ninth Circuit rules against the ban & now it hits again on sanctuary cities both ridiculous rulings. See you in the Supreme Court! _E_\nI don't believe in government picking winners or in the case of (@BarackObama) picking losers @MittRomney _E_\nMust read column for all young people: Obama's war on young voters who elected him __HTTP__ _E_\nFormer Homeland Security Advisor Jeh Johnson is latest top intelligence official to state there was no grand scheme between Trump & Russia. _E_\nTogether our task is to strengthen our families to build up our communities to serve our citizens and to celebrate AMERICAN GREATNESS as a shining example to the world.... __HTTP__ _E_\nHelp fight autism go to __HTTP__ website for __HTTP__ donations & government activation. _E_\nThe ABC/Washington Post Poll even though almost 40% is not bad at this time was just about the most inaccurate poll around election time! _E_\nIn the heart of the city Trump International Toronto is the city's most elite property __HTTP__ True luxury at its finest. _E_\nJosh Brolin a friend of mine was terrific in Men in Black. Congrats! _E_\n.@JebBush was terrible on Face The Nation today. Being at 2% and falling seems to have totally affected his confidence. A basket case! _E_\nThere are only 22 days for @BarackObama to drop @JoeBiden. Obama is not a loyal guy. I think he is strongly considering it. _E_\nThe Republicans are always worried about the press they should just do what is right. _E_\nISIS gained tremendous strength during Hillary Clinton's term as Secretary of State. When will the dishonest media report the facts! _E_\nEntrepreneurs: Don't ever think you've done it all already or that you've done your best. Don't sell yourself short! _E_\nI wonder if Marshawn Lynch will now speak and call some coach a moron for not allowing him to run the ball three times for one yard? _E_\nPlayef golf today with Prime Minister Abe of Japan and @TheBig_Easy Ernie Els and had a great time. Japan is very well represented! _E_\nThe atrium of @TrumpTowerNY dressed up for Christmas __HTTP__ _E_\nGoing to Salt Lake City Utah for a big rally. Lyin' Ted Cruz should not be allowed to win there Mormons don't like LIARS! I beat Hillary _E_\nI'm leading by big margins in every poll but the press keeps asking would you ever get out? They are just troublemakers I'm going to win! _E_\nRev. Wright called @BarackObama on tape a liar. Why isn't this being looked into? It would be a great commercial for the republicans. _E_\nReally interesting President Obama was quick to shut down flights to Isreal but is totally unwilling to shut down flights from West Africa! _E_\nIf Saudi Arabia which has been making one billion dollars a day from oil wants our help and protection they must pay dearly! NO FREEBIES. _E_\nPaul Teutul is always good on the show. #CelebApprentice _E_\nDerek Jeter broke ankle one day after he sold his apartment in Trump World Tower. _E_\nWe have a sacred duty to care for our vets and their families. Veterans deserve universal access to care anywhere and anytime! _E_\nThe Fake News Media will not talk about the importance of the United Nations Security Council's 15 0 vote in favor of sanctions on N. Korea! _E_\nThank you Gettysburg Pennsylvania! #DrainTheSwamp __HTTP__ _E_\nRT @dmartosko: This is the #NYTimes. Can you understand why so many reporters are cautious about working for them? __HTTP__ _E_\nLove the people of South Carolina look very much forward to the debate tonight. _E_\nIf a new HealthCare Bill is not approved quickly BAILOUTS for Insurance Companies and BAILOUTS for Members of Congress will end very soon! _E_\nThe #USSJohnFinn will provide essential capabilities to keep America safe. Our sailors are the best anywhere in the world. Congratulations! __HTTP__ _E_\nI hope that Crooked Hillary picks Goofy Elizabeth Warren sometimes referred to as Pocahontas as her V.P. Then we can litigate her fraud! _E_\nI will unveil my first campaign ads on @Morning_Joe at 6:30am tomorrow. Enjoy! #MakeAmericaGreatAgain _E_\nThe new season of the Celebrity Apprentice begins Feb. 12 be prepared for the best season yet! __HTTP__ _E_\nDerek Jeter is playing phenomenal baseball. He is a total winner and also a great guy. @DerekJeter _E_\nI make good deals. That's what I do. I would make great deals for our country. my @SRQRepublicans speech _E_\nI have fun I love what I do. You should too. Find out how at the National Achievers Conference this October in London __HTTP__ _E_\nALABAMA get out and vote for Luther Strange he has proven to me that he will never let you down! #MAGA _E_\nHappy Thanksgiving to all. Have a great day and look forward to the future. We will MAKE AMERICA GREAT AGAIN! _E_\nI will be on with @BretBaier tonight at 6PM. #Trump2016 _E_\nThe Trumping of Turnberry via Links Magazine @TrumpTurnberry __HTTP__ _E_\nWatch Donald Trump's recent appearance on The Late Show with David Letterman: __HTTP__ _E_\nI will be interviewed by @oreillyfactor tonight on @FoxNews at 11pm. Enjoy! _E_\nGreat @foxbusiness interview with @EricTrump on @TeamCavuto discussing the real estate economy & 2016 __HTTP__ _E_\nTrump buys mansion adjacent to family winery __HTTP__ via @trdny _E_\nWhen will we stop wasting our money on rebuilding Afghanistan? We must rebuild our country first. _E_\nJoin me LIVE at 5:45pmE from Harrisburg Pennsylvania! #TaxReform #USA __HTTP__ __HTTP__ _E_\nWe only want to admit those who love our people and support our values. #AmericaFirst _E_\nWill be on @foxandfriends tomorrow morning at 7:00. _E_\nDon't forget episodes 2 and 3 of @ApprenticeNBC are on tonight at 8PM and 9PM on @NBC. _E_\nMore on Benghazi cover up: \"ATTORNEY FOR WHISTLEBLOWER: 400 U.S. MISSILES STOLEN IN BENGHAZI\" __HTTP__ Really bad. _E_\nWeiner is gone Spitzer is gone next will be lightweight A.G. Eric Schneiderman. Is he a crook? Wait and see worse than Spitzer or Weiner _E_\nObama wants to unilaterally put a no fly zone in Syria to protect Al Qaeda Islamists __HTTP__ Syria is NOT our problem. _E_\nThe Fake News is becoming more and more dishonest! Even a dinner arranged for top 20 leaders in Germany is made to look sinister! _E_\nIf you don't publicize your successes your competitors will be sure to belittle them. Get the word out! _E_\nObama is now warning North Korea on the Yongbyon nuclear reactor __HTTP__ After Syria our enemies are laughing! _E_\nGo out and buy CRIPPLED AMERICA: How to Make America Great Again. Doing really well. Great Thanksgiving or Christmas present! _E_\nCan't wait for @VanityFair to fold which under Graydon Carter will be sooner rather than later. _E_\nHmmm...can you imagine me speaking at the RNC Convention in Tampa? __HTTP__ That's a speech everyone would watch. _E_\nRT @realDonaldTrump: More and more people are suggesting that Republicans (and me) should be given Equal Time on T.V. when you look at the... _E_\nA great story in the New York Post really well written! __HTTP__ _E_\nYes it is true Carlos Slim the great businessman from Mexico called me about getting together for a meeting. We met HE IS A GREAT GUY! _E_\nToday we witnessed an incredible moment in history – the presentation of Congress' highest civilian honor to our friend and true AMERICAN HERO Bob Dole. #CongressionalGoldMedal __HTTP__ _E_\n.@FoxNews from multiple sources: There was electronic surveillance of Trump and people close to Trump. This is unprecedented. @FBI _E_\nCanadians: My ultra luxury private plane will be featured on Sunday's episode of #MightyPlanes on @DiscoveryCanada don't miss it at 8 ET! _E_\nHillary Clinton just lost every Republican she ever had including Never Trump all farmers & sm. biz by saying she'll tax estates at 65%. _E_\nThinking big is the driving force that has forged all the great achievements in modern life. Think Big _E_\nAll time hit leader Pete Rose should now be in the Baseball Hall Of Fame. He has paid his penalty! _E_\nMAKE AMERICA SAFE AND GREAT AGAIN! #TrumpPence16 __HTTP__ __HTTP__ _E_\nStrange why didn't @BarackObama hold any special event to celebrate the 2 year anniversary of ObamaCare? __HTTP__ _E_\nThank you! #MakeAmericaGreatAgain __HTTP__ _E_\nJerry Falwell Jr. stated speech was best in University's history...my great honor. _E_\nThe more I get to know @MittRomney the more I like him. He has the judgment and private sector experience America needs in the White House. _E_\nYou have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_\nThank you @DonaldJTrumpJr & @EricTrump. #Trump2016 __HTTP__ _E_\nThe failing @nytimes finally gets it In places where no insurance company offers plans there will be no way for ObamaCare customers to.. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nAIR FORCE TRUMP: AHEAD OF 2016 THE DONALD SLAMS ROMNEY BUSH IN SOUTH CAROLINA __HTTP__ via @BreitbartNews by @mboyle1 _E_\nNow that the ineffective Baltimore Police have allowed the city to be destroyed are the U.S. taxpayers expected to rebuild it (again)? _E_\nWow great ratings for @ApprenticeNBC __HTTP__ Don't forget watch 2 new episodes tonight at 8PM on @NBC. _E_\nReceived a standing ovation in packed house @MorningsideEdu after Sam Clovis intro! Let's Make America Great Again! __HTTP__ _E_\n\"Donald Trump: I'm Not Buying the @BrooklynNets\" __HTTP__ via @TMZ_Sports _E_\nBased on the fact that the very unfair and unpopular Individual Mandate has been terminated as part of our Tax Cut Bill which essentially Repeals (over time) ObamaCare the Democrats & Republicans will eventually come together and develop a great new HealthCare plan! _E_\nLas Vegas' most elite destination @TrumpLasVegas' has 64 stories of golden glass & offers ultimate luxury __HTTP__ _E_\nAchievers move forward at all times. Don't tread water. Get out there and go for it. _E_\nRT @ABC: Pres. Trump: We cannot be defined by the evil that threatens us or the violence that incites such terror. __HTTP__ _E_\nDoing Fox & Friends at 7 A.M. _E_\nI wish the @WSJ Wall Street Journal had reported the just out @CNN Iowa Poll correctly. I lead by a wide margin13 points going up big! _E_\nRead an excerpt from Think Like A Champion by Donald J. Trump: __HTTP__ _E_\nJoin me in Rome NY tomorrow!#Trump2016 #NYPrimaryTickets available: __HTTP__ _E_\nI wish President @BarackObama the best of luck in his second term... _E_\nGovernor Alejandro García Padilla said presidential hopeful Sen. Marco Rubio \"is no friend of Puerto Rico. __HTTP__ _E_\nOhio Senator @RobPortman: @MittRomney knows how to return prosperity: __HTTP__ #Mitt2012 #tcot _E_\nI believe that President Obama is so overwhelmed by what is happening in the U.S. and throughout the World that he has totally given up! _E_\n.@Yankees Kevin Youkilis is off to a terrific start. He's less than half the price and a much better player than a drug free A Rod. _E_\nVia @BreitbartNews by @mboyle1: TRUMP: OBAMA SHOULDN'T ATTACK AMERICANS OVERSEAS HILLARY'S EMAIL WAS 'CRIMINAL __HTTP__ _E_\nDon't let up keep getting out to vote this election is FAR FROM OVER! We are doing well but there is much time left. GO FLORIDA! _E_\nAnother Crooked Hillary Fan! __HTTP__ _E_\nEven more @BarackObama crony capitalism & corruption. We are guaranteeing a $105M loan to another Obama donor __HTTP__ _E_\n...Trump however would kick his ass! _E_\n...LaVar you could have spent the next 5 to 10 years during Thanksgiving with your son in China but no NBA contract to support you. But remember LaVar shoplifting is NOT a little thing. It's a really big deal especially in China. Ungrateful fool! _E_\n\"We need more grown ups in Washington people who will shoot straight and level with the American people.\" #TimeToGetTough _E_\nI'm sure the media will not report the highly respected new national poll that just came out via The Economist. 32%! __HTTP__ _E_\nGetting ready to go to Iowa today. Big crowd will be a great day! _E_\nThreatening phone calls from Obama supporters are being made to the Michigan GOP office __HTTP__ Don't be intimidated! _E_\n#AskTrump Send me your questions to answer live from @TwitterNYC later this afternoon. _E_\nWe create success or failure on the course primarily by our thoughts. Gary Player _E_\nThank you! #Trump2016 __HTTP__ _E_\nToday it was my great honor to sign the largest TAX CUTS and reform in the history of our country. Full remarks: __HTTP__ __HTTP__ _E_\nWill be on Fox & Friends tomorrow morning at 7.00 hope you enjoy! _E_\nColorado Trump Delegates Scratched from Ballots at GOP Convention __HTTP__ _E_\nAny American who fights with ISIS should have their passport revoked. Take them to Gitmo for interrogation. _E_\nWow Obama Care just got delayed by over a year because it is so complicated it cannot be understood the beginning of the end! _E_\nJEB is a hypocrite! Used massive private Eminent Domain Just another clueless politician! __HTTP__ _E_\nPenn State is doing a poor job in bringing its mess to a close.They should be ashamed for hiding Sandusky's crimes all these years... _E_\nRT @opinionsamerica: @realDonaldTrump Strong administration leads to a strong response. _E_\nThe law requires individuals pay 15% on carried interest. Why would a potential President pay more than he or she is supposed to? _E_\n.@Yankees manager Joe Girardi is a gritty leader who stands up for his players. Doing a great job! _E_\nObama is going to take away over 90M Americans' healthcare plans but he is letting Iran keep its nukes. Just think about that. _E_\nTrump National Golf Club Los Angeles offers 18 holes fronting the Pacific Ocean on the Palos Verdes Peninsula. __HTTP__ _E_\nThe Fake News Media hates when I use what has turned out to be my very powerful Social Media over 100 million people! I can go around them _E_\nSee you tomorrow Dutchess County New York! #NYPrimary #TrumpTrain __HTTP__ __HTTP__ _E_\nBig TAX REFORM AND TAX REDUCTION will be announced next Wednesday. _E_\nLifting off right now for U.S.S. Wisconsin in Norfolk. See ya' _E_\nNext year @TomBrokaw should be the comedian at the White House Correspondents' dinner. The only problem is that... __HTTP__ _E_\nToday we together won the Republican Nomination for President! __HTTP__ _E_\nObama has changed the Census so \"it will be difficult to measure the effects\" of O'Care __HTTP__ REAL data hidden _E_\nThe Inspector General's report on Crooked Hillary Clinton is a disaster. Such bad judgement and temperament cannot be allowed in the W.H. _E_\nI will be on Fox & Friends tomorrow morning at 7.00. Ebola and ISIS will be topics. _E_\n.@WhiteHouse #CEOTownHall __HTTP__ __HTTP__ _E_\nStatement on Preventing Muslim Immigration: __HTTP__ __HTTP__ _E_\n.@mcuban you were excellent on Howard Stern...thanks for the nice comments about my kids...yours are winners also! _E_\n#trumpvlog My thoughts on gasoline prices skyrocketing...... __HTTP__ _E_\nOnly the Obama WH can get away with attacking Bob Woodward. _E_\nTo all of my twitter followers please contribute whatever you can to the campaign. We must beat Crooked Hillary. __HTTP__ _E_\n.@TrumpDoral's golf courses The Red Tiger The Silver Fox & The Golden Palm are on track to open later this year __HTTP__ _E_\nThe attack on our Libyan consulate was the worst attack on the US since 9/11. Time for Obama to come clean. _E_\nWe have a MASSIVE trade deficit with Germany plus they pay FAR LESS than they should on NATO & military. Very bad for U.S. This will change _E_\nI'm loyal to people who've done good work for me. #TheArtofTheDeal _E_\nIs it a coincidence that the Middle East has blown up since Obama became president? _E_\nCongratulations to @IvankaTrump on being named the @FoxNewsSunday Power Player of the Week __HTTP__ _E_\nI started my business with very little and built it into a great company with some of the best real estate assets in the World. Amazing! _E_\nVia @IBTimes: Miss Universe 2013: Contestants Stun in Gorgeous Gowns at National Gift Auction Gala __HTTP__ _E_\nVia @USATODAY: \"Trump endorses Wintour for ambassadorship\" __HTTP__ _E_\nEnjoyed watching @MonicaCrowley's analysis of my @BillOreilly interview. Great points! Thank you Monica. _E_\nThe CPAC speech went really well this morning first speaker standing ovation. I really enjoyed it. _E_\nOhio is losing jobs to Mexico now losing Ford (and many others). Kasich is weak on illegal immigration. We need strong borders now! _E_\nThe media can track down @PaulRyan's old girlfriend and marathon time but can't find @BarackObama's college applications or other info. _E_\nRemember I am self funding my campaign the only one in either party. I'm not controlled by lobbyists or special interests only the U.S.A.! _E_\nAfter years of long stops then starts why did dopey Eric Scheiderman tell people in The Trump Org. this case is going awaywe have no case _E_\nLooking forward to visiting @SimpsonCollege on Wednesday to discuss education. Common Core is an attack on individual & local rights! _E_\nThe @Yankees should immediately stop paying A Rod—he signed his contract without telling them he was a druggie. _E_\nSeven people shot and killed yesterday in Chicago. What is going on there totally out of control. Chicago needs help! _E_\nEven Jimmy Carter just released a statement saying that Obama doesn't have a clue. That has to be a new low! _E_\nRT @IngrahamAngle: \"Far right\"? You mean \"right so far\" as in @realDonaldTrump has been right so far abt how to kick the economy into high... _E_\nTotal fool @KarlRove is part of the Republican Establishment problem. An all talk no action dummy! __HTTP__ _E_\nTHANK YOU Grand Rapids Michigan! Time to end political correctness & secure our homeland! __HTTP__ __HTTP__ _E_\n\"Great effort springs naturally from great attitude.\" Pat Riley _E_\nChina is robbing us blind in trade deficits and stealing our jobs yet our leaders are claiming 'progress' __HTTP__ SAD! _E_\nThank you to @foxandfriends for the great review of the speech on immigration last night. Thank you also to the great people of Arizona! _E_\nHey @glennbeck see how I beat your boy Ted in your own Blaze poll? Your endorsement means nothing! #GOPDebate _E_\nI am not angry at Russia (or China) because their leaders are far smarter than ours. We need real leadership and fastbefore it is too late _E_\nRT @TeamTrump: We agree with Bill ObamaCare is \"the craziest thing in the world.\" #BigLeagueTruth #Debates2016 __HTTP__ _E_\nThank you America! #Trump2016 __HTTP__ __HTTP__ _E_\nCryin' Chuck Schumer fully understands especially after his humiliating defeat that if there is no Wall there is no DACA. We must have safety and security together with a strong Military for our great people! _E_\nVia @NewYorkObserver by @Bshapiro91: \"Donald Trump @MelRivers Headline @Algemeiner Gala\" __HTTP__ _E_\nObama is making speeches excoriating the Republicans and they never answer back. Why aren't they fighting? _E_\nGreat job on @donlemon tonight @kayleighmcenany @cherijacobus begged us for a job. We said no and she went hostile. A real dummy! @CNN _E_\nI will be live tweeting the V.P. Debate. Very exciting! MAKE AMERICA GREAT AGAIN! _E_\nWow ISIS has just taken the City of Ramadi in Iraq. So many of our great soldiers died in originally going after it. Such a waste. _E_\nOur spectacular ballroom under construction at the great Turnberry resort in Scotland. __HTTP__ _E_\nClinton's Top Aides Were Mired In Conflict Of Interest At The State Department: __HTTP__ #BigLeagueTruth _E_\n.@yuSiddiqui @piersmorgan @rustyrockets I got much better—no contest—I got Melania! _E_\nMy thoughts on Joe Paterno and political analysts in today's #trumpvlog... __HTTP__ _E_\nYou should give the money back @HillaryClinton! #DrainTheSwamp __HTTP__ _E_\nWeird why did BarackObama Sr. fail to list @BarackObama as his son in his 1961 INS application? __HTTP__ _E_\nIllegal use of official Attorney General stationary by lightweight @AGSchneiderman. __HTTP__ _E_\nWelcome to the new Egypt Muslim Brotherhood representatives who won't take questions from Israeli journalists __HTTP__ _E_\nWhat my father really gave me is a good (great) brain motivation and the benefit of his experience unlike the haters and losers (lazy!). _E_\nThe tragedy in South Carolina is incomprehensible. My deepest condolences to all. _E_\nI'm not hearing much from Obama or his administration about my $5M offer to charity or to which charity the money will go. _E_\nWhat a convenient mistake: @BarackObama issued a statement for Kwanza but failed to issue one for Christmas. __HTTP__ _E_\nI will be on @foxandfriends tomorrow morning at 7.00. Will be talking about sleazebag Jonathan Gruber ( Americans are stupid ) & exec order _E_\nWhile the Fake News loves to talk about my so called low approval rating @foxandfriends just showed that my rating on Dec. 28 2017 was approximately the same as President Obama on Dec. 28 2009 which was 47%...and this despite massive negative Trump coverage & Russia hoax! _E_\nReally looking forward to my address @CPACnews this Friday morning at 8:30. Will stress jobs etc. Can't wait to see my many friends. _E_\nSanders says he wants to run against me because he doesn't want to run against me. He would be so easy to beat! _E_\nCongratulations to my friend @limbaugh on being named to the Hall of Famous Missourians. Rush is a great guy & a great character. _E_\nThanks Mark will be fun. __HTTP__ _E_\nA great and important day at the United Nations.Met with leaders of many nations who agree with much (or all) of what I stated in my speech! _E_\nMany people voted for Cruz over Carson because of this Cruz fraud. Also Cruz sent out a VOTER VIOLATION certificate to thousands of voters. _E_\nMy friend @ChristianJosi is making a very special LP. Follow him. Conservative leader by day likely 2015 GRAMMY winner by night. #LEGENDS _E_\nOh no another rapper doing a Trump song Young Jeezy Trump Lyrics. Why aren't these guys paying me? _E_\nWas Susan Rice told to lie about Bergdahl? Obama and his representatives lie about virtually everything from ObamaCare to a deserter. _E_\nUSMC Andrew Tahmooressi should be freed immediately. He never should have been jailed in the first place. Weak leaders. #FreeOurMarine _E_\nAdrian was recognized on a Disney cruise and has had many photo requests in @TrumpTowerNY. We have a new celebrity! #CelebApprentice _E_\nPageant people are really talking about Venezuela Brazil Mexico USA India Australia. _E_\nWhat a great four days in Cleveland. So proud of the great job done by the RNC and all. The police and Secret Service were fantastic! _E_\nOn 800 pristine Miami acres @TrumpDoral boasts luxurious accommodations world class dining & championship golf __HTTP__ _E_\nWatch @CNN at 9:00 A.M. @jaketapper. Then interviewed on @ABC @GStephanopoulos at 10:00 A.M. and then at 10:30 A.M. watch Face The Nation. _E_\nBecause the ban was lifted by a judge many very bad and dangerous people may be pouring into our country. A terrible decision _E_\nIt is time to remember that... __HTTP__ _E_\n.@natalie_gulbis Thank you for your support this morning on @GolfChannel. Even more importantly play well this week! Say hi to all. _E_\nToday we just passed 1.4 million twitter followers.. _E_\nI will renegotiate NAFTA. If I can't make a great deal we're going to tear it up. We're going to get this economy running again. #Debate _E_\nMy @eonline interview discussing @_KatherineWebb's stardom and why @espn's apology was unwarranted __HTTP__ _E_\nDegenerate former Congressman Anthony Weiner is trying to make a comeback. He is a sick & perverted man that New York does not want or need. _E_\nFor the nonbeliever here is a photo of @Neilyoung in my office and his $$ request—total hypocrite. __HTTP__ _E_\nThank you for your interest & support during last nights #GOPDebate! #IACaucus finder: __HTTP__ __HTTP__ _E_\n... ...Do your research before donating this holiday season! _E_\nMy wife Melania Trump's show was a tremendous success last night. In case you missed her you can see her again tonight on @QVC at 7 pm ET _E_\nRT @foxandfriends: Trump vows U.S. 'power' will meet North Korean threat __HTTP__ _E_\nI would like to express my warmest regards best wishes and condolences to all of the families and victims of the horrible bombing in NYC. _E_\nA very big poll is coming out at 6 PM in New Hampshire. Will be very interested in the results. _E_\nAlways great to speak with Veterans our nation's heroes. We will Make America Great Again! __HTTP__ _E_\n'CNBC Time magazine online polls say Donald Trump won the first presidential debate' via @WashTimes. #MAGA __HTTP__ _E_\nWow Huffington Post just stated that I am number 1 in the polls of Republican candidates. Thank you but the work has just begun! _E_\nThe Apprentice was the #1 show on television last season on Sunday from 10 to 11 congratulations Donald! _E_\nScots should boycott Glenfiddich garbage for not choosing great Olympic & U.S. Open champ Andy Murray over total loser Michael Forbes. _E_\nI will stand with police and protect ALL Americans! #Debates2016 #MAGA __HTTP__ _E_\nThank you Atlanta Georgia! Will be back soon! #AmericaFirst __HTTP__ _E_\nA massive blow to Obama's message only 38000 new jobs for month in just issued jobs report. That's REALLY bad! _E_\nTrue thanks. __HTTP__ _E_\nAMERICA will once again be a NATION that thinks big dreams bigger and always reaches for the stars. YOU are the ones who will shape America's destiny. YOU are the ones who will restore our prosperity. And YOU are the ones who are MAKING AMERICA GREAT AGAIN! #MAGA __HTTP__ _E_\nWe had a GREAT year @Macys with ties shirts and suits thanks! New selections just arrived they are amazing! _E_\nYoung entrepreneurs – keep positive. Don't let the ObamaCare disaster stop your endeavors. There are great opportunities out there. _E_\nSure @BarackObama's literary agent claims the 1991 booklet was a 'mistake' __HTTP__ Pretty convenient. _E_\nWhen will @AlexSalmond realize that he's destroying Scotland the most beautiful countryside in the world w/ his stupid wind turbines? _E_\nMike Huckebee a great guy said the President should appoint me Treasury Secretary. China and OPEC would not be happy. _E_\nForty six million Americans more than at any time ever in the history of this country now live under the poverty line. #TimeToGetTough _E_\nThis is no act of love as Jeb Bush said... __HTTP__ _E_\nI'll be on @foxandfriends Monday at 7:30 AM. Be sure to tune in. _E_\n\"Trump: 'I like North Carolina we are looking at another deal'\" __HTTP__ via @WSOC_TV _E_\nVia @WSJPolitics by @reidepstein: \"Trump Surges in Popularity in N.H.\" __HTTP__ _E_\n.@WineEnthusiast's highest rated wine in Virginia @trumpwinery is the premier name in sophistication and quality __HTTP__ _E_\nVia @Newsmax_Media by Courtney Coren: Trump: China Gets Iraq Oil US Gets Nothing __HTTP__ _E_\nmy presidency. Isn't this a ridiculous shame? He loves these kids has raised millions of dollars for them and now must stop. Wrong answer! _E_\n...Trump International Hotel Las Vegas and Trump International Hotel & Tower Waikiki Beach Walk. __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nLooking forward to the @CadillacChamp at Trump Doral next week 3.6. 3.10. Can't wait to meet the attendees. #WGCDoral _E_\nSo much SPIRIT in LA! Thank you to all of our HEREOS who saved many lives. An honor to spend time w/ @NationalGuard #LEOs & the #CajunNavy! __HTTP__ _E_\nVia @Newsmax_Media: Romney Said Nothing Wrong __HTTP__ _E_\nWatch Melania on QVC this morning from 10 a.m. to 11 a.m. with her third line of her Melania Timepieces & Jewelry collection... _E_\nIf Ebola is so non contagious how come an NBC cameraman caught it so quickly while over in West Africa? U.S. is behaving very foolishly! _E_\nOscar Pistorious is guilty as hell! _E_\n.@CarlyFiorina Carly—I did graduate from Wharton and did very well. Who is your fact checker? Will you apologize? _E_\nEric Trump on @JudgeJeanine on @FoxNews now! _E_\n...while her charity is getting less than 5 cents per donated dollar. She should be ashamed! _E_\nThe 2012 budget deficit is already $93 billion larger than earlier estimates __HTTP__ @BarackObama (cont) __HTTP__ _E_\nJodi should try but the Govt. should not make a deal no jury could be dumb enough to let her off (but you never know look at OJ & others) _E_\nIf you experience any harassment or heckling at the polling places from Obama supporters make sure you report it immediately. _E_\nMy @SquawkCNBC interview discussing why @MittRomney is a great nominee gas prices and why George Will is a loser. __HTTP__ _E_\nYou must be registered Republican by February 16th to vote TRUMP in the Florida primary. __HTTP__ _E_\nDon't forget to watch Celebrity Apprentice tonight at 9 on NBC GREAT EPISODE! _E_\nMy Administration will continue to work around the clock with Governor @RicardoRossello & his team. Great progress being made! #PRStrong __HTTP__ _E_\nJust got a call from my friend Bill Ford Chairman of Ford who advised me that he will be keeping the Lincoln plant in Kentucky no Mexico _E_\nWill be on @foxandfriends at 7:00 5 minutes. _E_\nThe Chinese Envoy who just returned from North Korea seems to have had no impact on Little Rocket Man. Hard to believe his people and the military put up with living in such horrible conditions. Russia and China condemned the launch. _E_\n#TrumpAdvice __HTTP__ _E_\nVia @WashTimes by @CharlesHurt: Donald Trump declares war on lying street hustlers of Congress\" __HTTP__ _E_\n.@SpeakerRyan Congratulations and good luck you will do a GREAT job for our wonderful U.S.A.! _E_\n.@bovanpelt. Bo I heard you were great at Trump National Westchester I am not at all surprised. Keep playing well you are a winner! _E_\nBrand new selection of Trump Signature Collection shirts and ties @Macys. Go check them out. _E_\nWhile the @Yankees look like they quit and are finished they won't quit for CC _E_\n\"Patriotism is supporting your country all the time and your government when it deserves it.\" Mark Twain _E_\nIt always seems impossible until it is done. Nelson Mandela _E_\nMy statement on NATO being obsolete and disproportionately too expensive (and unfair) for the U.S. are now finally receiving plaudits! _E_\n\"Labor disgraces no man unfortunately you occasionally find men who disgrace labor.\" Gen. Ulysses S. Grant _E_\nCongratulations to @secupp on joining @newtgingrich on @CNN's Crossfire. Show will be excellent! _E_\nEmin from Russia a very talented guy. All proceeds go to help the Philippines. @eminofficial #missuniverse __HTTP__ _E_\nWhich National Costume do you think should win? __HTTP__ _E_\nLooking forward to joining @V4SA Tuesday 9/15 in L.A. aboard the @USSIowa The Battleship of Presidents! Join us! __HTTP__ _E_\nThanks @WWE @VinceMcMahon is an amazing guy. _E_\n\"Be flexible enough to adjust to changing circumstances.\" – Think Big _E_\nWe now have confirmation as to one reason Crooked H wanted to be sure that nobody saw her e mails PAY FOR PLAY. How can she run for Pres. _E_\nDetroit's bankruptcy could just be the start __HTTP__ Many municipalities across US are over leveraged & losing citizens _E_\nGiving away money and revolutionizing crowdfunding. Follow @fundanything to see which causes are financed daily _E_\n#trumpvlog The Republicans must act now don't let @barackobama push you around.... __HTTP__ _E_\nVia @theinquisitr: \"Americans Agree With Donald Trump 58 Percent Want Flights Banned From Ebola Outbreak Countries\" __HTTP__ _E_\n__HTTP__ _E_\nObama administration is killing American industrial renaissance by stopping drilling and fracking. Terrible for economy. _E_\nRepublicans must get out today and VOTE in Georgia 6. Force runoff and easy win! Dem Ossoff will raise your taxes very bad on crime & 2nd A. _E_\nIt's too bad so few people showed up to @bobvanderplaats Family Leader dinner. Next year I'll try & be there and they'll have a huge crowd! _E_\nWelcome to the @BarackObama recovery the labor force participation rate is at a NEW 30 year low of 64.3% __HTTP__ _E_\nChina is now attacking Japan's economy for leverage __HTTP__ Soon they will try the same with us. #TimeToGetTough _E_\nDue to popular demand CNN will re broadcast the Larry King Live show I hosted in June in which I interview Larry. Monday July 5 9 pm CNN _E_\nThe late great William F. Buckley would be ashamed of what had happened to his prize the dying National Review! _E_\nBe tough be focused. There are a lot of ups and downs but you can ride them out if you're prepared for them. _E_\nLet's see whether or not Chuck Townsend @CondeNastCorp is smart enough to fire Graydon Carter who only cares about his bad food restaurants _E_\nLooks like many anti police agitators in Boston. Police are looking tough and smart! Thank you. _E_\nJohn Kasich despite being Governor of Ohio is losing to me in the Ohio polls. Pathetic! _E_\nA former Secret Service Agent for President Clinton excoriates Crooked Hillary describing her as ERRATIC & VIOLENT. Bad temperament for pres _E_\nNot his 'per se'? A Friday document dump shows @BarackObama all hands on deck as Solyndra collapsed __HTTP__ @BarackObama lies. _E_\nEverybody's talking about my doing twitter during the likely very boring debate tonight. @realDonaldTrump #DemDebate _E_\nIt's Tuesday how much has China stolen from us today through cyber espionage? _E_\nRemember to take time this weekend to relax and regroup. It will pay major dividends for the next week. _E_\nFor great success you need passion but make sure it's well directed. Learn everything you can about what you're doing. Be an expert. _E_\nWe have a sacred duty to care for our vets and their families. Our Vets are owed full access to healthcare anytime & anywhere! _E_\nIt was great to have @ApprenticeNBC veterans George Ross and @BretMichaels back in the boardroom. __HTTP__ #CelebApprentice _E_\nToday we lost a great pioneer of air and space in John Glenn. He was a hero and inspired generations of future explorers. He will be missed. _E_\nVia @FortuneMagazine by @mcasey1: \"Donald Trump plans to build a Trump Tower in Mumbai\" __HTTP__ _E_\nAn impromptu interview I did with German TV on 9/11 down by Ground Zero discussing the attack and WTC Towers __HTTP__ _E_\nGreat poll thank you America! Once we #DrainTheSwamp together we will #MAGA#Debate __HTTP__ _E_\nWhose artwork was your favorite— and what team do you think will win? #CelebApprentice _E_\nCHIP should be part of a long term solution not a 30 Day or short term extension! _E_\nRT @TeamTrump: She calls our people deplorable and irredeemable. I will be a president for ALL of our people. @RealDonaldTrump #BigLeag... _E_\n\"Our runaway judiciary is badly in need of restraint by Congress.\" Phyllis Schlafly _E_\nThe Dunes of @TrumpScotland are a world treasure threading thru @GolfWorld1's Scotland top Par 72 7400 yd course __HTTP__ _E_\nYes this is a large scale version of when I built and saved the ice skating rink in Central Park (which all should go to). Great course! _E_\n... That's why so many huge deals are closed on a golf course.\" – TRUMP 101 _E_\nNew @RNC report calls for embracing \"comprehensive immigration reform.\" __HTTP__ Does the @RNC have a death wish? _E_\n.@Omarosa's meltdown—was it for real? @DennisRodman thinks she could be an Oscar winner for that performance... #CelebApprentice _E_\nTHANK YOU Connecticut Delaware Maryland Pennsylvania and Rhode Island! #MakeAmericaGreatAgain __HTTP__ _E_\n...people not interviewed including Clinton herself. Comey stated under oath that he didn't do this obviously a fix? Where is Justice Dept? _E_\nJoan Rivers had great talent but also truly amazing stamina and drive she would never give up or quit. That is why she became a champion! _E_\nThe sex scandal at the CIA and Pentagon is rapidly unfolding getting more interesting by the minute! _E_\n#AskTrump @TwitterNYC __HTTP__ _E_\nA 'confidential source' has called my office and told me that @BarackObama has added over $6T to the new national debt & ruined US credit. _E_\nAnna Wintour came to my office at Trump Tower to ask me to meet with the editors of Conde Nast & Steven Newhouse a friend. Will go this AM. _E_\nMy @foxandfriends interview where I discuss @Rosie being canceled yet again and how she just can't make it on TV __HTTP__ _E_\nWow so nice! Thank you Wayne Allyn Root. __HTTP__ _E_\nFRACK NOW & FRACK FAST!!! American prosperity depends on it. Our economic renaissance is here. _E_\n.@BarackObama is bankrupting this country. His budget adds another $4.4T to the debt putting us over $20T in total debt by 2016. _E_\nToo bad about New York Magazine but there's a much bigger one out there currently doing a story on me to get even that I'll soon discuss! _E_\nGet ready for some excitement the live finale of the Celebrity Apprentice is on this Sunday night don't miss it! __HTTP__ _E_\n\"On 1/20 the day Trump was inaugurated an estimated 35000 ISIS fighters held approx 17500 square miles of territory in both Iraq and Syria. As of 12/21 the U.S. military est the remaining 1000 or so fighters occupy roughly 1900 square miles..\" @jamiejmcintyre @dcexaminer __HTTP__ _E_\nLimited opportunity to get your OFFICIAL Trump gear! Shop now! __HTTP__ _E_\nObama now wants to give another $450M to the Muslim Brotherhood. Money we don't have going to people that hate us. Moronic. _E_\nBeautiful morning thank you @ICLV! __HTTP__ _E_\n\"Remember the golden rule of negotiating: 'He who has the gold makes the rules.'\" – Midas Touch _E_\n7 million Americans are going to lose their jobs due to ObamaCare. 46 million face 300% premium increases. DEFUND! #MakeDCListen _E_\nHeading to Phoneix. Will be arriving soon. Tomorrow a big day. Tremendous crowds expected! #Trump2016 #MakeAmericaGreatAgain _E_\nEntrepreneurs: See yourself as victorious: Look at the solution not the problem. _E_\nEven a mistake may turn out to be the one thing necessary to a worthwhile achievement. Henry Ford _E_\n.@nbcsnl So much fun last night! _E_\nNow that the Mexican drug lord escaped from prison everyone is saying that most of the cocaine etc. coming into the U.S. comes over border! _E_\nThis cannot be the the Academy Awards #Oscars AWFUL!!!!!!!!!!!!!!! _E_\nWow just released that $67 million in negative ads was spent on me. How am I still number one by a lot? _E_\n.@meetthepress and @chucktodd very dishonest in not showing the new @CNN Poll where I am at 39% 21points higher than Cruz. Be honest Chuck! _E_\nJoe Biden said that the Taliban 'is not our enemy.' I wonder how our troops in Afghanistant that are under attack view Biden's statement. _E_\nI want to win for the people of this great country. The only people I will owe are the voters. #Trump2016 Video: __HTTP__ _E_\nRemember politicians are all talk and NO action. Our country is a laughing stock that is going to hell. The lobbyists & donors control all! _E_\nAlmost daily more discrepancies in @BarackObama's biography continue to arise. Who is this guy? _E_\nVia @FSMtweet: \"Trump is Right: Illegal Alien Crime is Staggering in Scope and Savagery\" __HTTP__ _E_\nWhat do African Americans and Hispanics have to lose by going with me. Look at the poverty crime and educational statistics. I will fix it! _E_\nMy @TeamCavuto int. on simplifying the tax code our incompetent leaders Iran and making America great again __HTTP__ _E_\n.@NBA hall of famer @dennisrodman brings his A game in the 13th season of All Star @CelebApprentice. This time Dennis is a star! _E_\nThe @SuperCommittee must cut spending not raise taxes. Washington has a spending problem not a revenue problem. _E_\nI am a defender of @MileyCyrus who I think is a good person (and not because she stays at my hotels) but last night's outfit must go! _E_\nFor all of my fantastic supporters and for the U.S.A. we are going to win and MAKE AMERICA GREAT AGAIN maybe greater than ever before! _E_\nMy @FoxNews interview last night on @gretawire On 2012: I'll Wait and See __HTTP__ _E_\nJoe Girardi @Yankees must play his starters even A Rod they got you there. _E_\n#BuyAmericanHireAmerican __HTTP__ _E_\n\"The problem is that we have a president who is more concerned with pursuing some sort of bizarre ideological (cont) __HTTP__ _E_\nThank you St. Louis Missouri!#MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nHillary Clinton Deleted Emails Using Program Intended To Prevent Recovery #CrookedHillary __HTTP__ _E_\nWill be in Cleveland Ohio w/ @mike_pence tonight join us: __HTTP__ Florida tomorrow @ 6pm: __HTTP__ _E_\nPart 1 of my @SpecialReport int. with @BretBaier discussing why I am strongly considering running for President __HTTP__ _E_\nThe Iraqi army has squandered the majority of the weapons & training we gave them for 10 long years. When will we learn? _E_\n#TBT WrestleMania 23 __HTTP__ _E_\n'Why Trump' __HTTP__ _E_\nThank you Maria B! __HTTP__ _E_\nWhile I own properties across the world I am very excited about my new acquisition of @Doral in Miami. (cont) __HTTP__ _E_\nFor those few people knocking me for tweeting at three o'clock in the morning at least you know I will be there awake to answer the call! _E_\nI'm in Scotland to open what we hope to be the greatest golf course in the world it's amazing. _E_\nWhen in doubt Obama fundraises. He has held 393 fundraisers in six years. Another record. _E_\nIt's hardly any wonder that our country's manufacturing dominance has evaporated. #TimeToGetTough (cont) __HTTP__ _E_\nI have met & spent a lot of time with families @ The Remembrance Project. I will fight for them everyday!... __HTTP__ _E_\n.@bubbawatson What a great player you have turned out to be but also what a great guy! Congratulations on another fantastic Masters win. _E_\nJoin me in Ohio & Maine!Cincinnati Ohio tonight @ 7:30pm: __HTTP__ Maine Saturday @ 3pm... __HTTP__ _E_\nPersonally I think Douglas Durst's brother got screwed by Douglas no wonder he's angry. _E_\n\"All the things I love is what my business is all about.\" @MarthaStewart _E_\nThey succeed because they think they can. Virgil _E_\nFernando thank you for the GREAT review of The Blue Monster in South Florida Golf especially top 10 in the WORLD. I love @SOFLAGOLF! _E_\nTrump Tower Punta del Este's cylindrical tower redefines the essence of luxury. On the sands of Playa Brava __HTTP__ _E_\n.@Deadspin guys are total losers—they had their story stolen right from under their bad complexions—other media capitalized! _E_\nAfter my meeting with the pastors it's off to Georgia for a big rally many thousands of great people will be there a beautiful movement! _E_\nFor the 1st time in American history America's 16500 border patrol agents have issue a presidential primary endorsement—me! Thank you. _E_\nThank you! #VoteTrump #ImWithYou __HTTP__ _E_\nWhen @mcuban had his own show The Benefactor it totally \"bombed!\" _E_\nTrump has big plans for improving @DoralResort __HTTP__ via @nbc's @GolfChannel @CadillacChamp _E_\nAustralia is a beautiful country with terrific people who love America. _E_\nWe must not allow ISIS to return or enter our country after defeating them in the Middle East and elsewhere. Enough! _E_\nEven the liberal CRS is now reporting Obama Care will cause 200% premium increases __HTTP__ Surprised? @Newsmax_Media _E_\nRT @DonaldJTrumpJr: A message from Donald J. Trump to NEW YORK! __HTTP__ _E_\nGolf Channel & Donald Trump's World of Golf host a Celebrity Match 1/25 @ TNGC LA CA Mark Wahlberg vs. Kevin Dillon __HTTP__ _E_\nSo biased: @TIME made 'The Protester' as the person of the year. @TIME celebrates OWS but vilified the Tea Party last year. _E_\nHappy 226th Birthday to the United States Coast Guard. Thank you @USCG! #CoastGuardDay __HTTP__ _E_\nThank you for joining me in Mandan ND Gov. @DougBurgum Lt. Gov. @BrentSanfordND @SenJohnHoeven @RepKevinCramer & @SenatorHeitkamp. __HTTP__ _E_\nVery proud of my Executive Order which will allow greatly expanded access and far lower costs for HealthCare. Millions of people benefit! _E_\nThe media is on a new phony kick about my management style. I spend much less money & get much better results! What we need as Prez! _E_\nMelania and I are hosting Japanese Prime Minister Shinzo Abe and Mrs. Abe at Mar a Lago in Palm Beach Fla. They are a wonderful couple! _E_\nRon Paul is right that we are wasting trillions of dollars in Iraq and Afghanistan. _E_\nCrooked Hillary Clinton is guilty as hell but the system is totally rigged and corrupt! Where are the 33000 missing e mails? _E_\nThe real story that Congress the FBI and all others should be looking into is the leaking of Classified information. Must find leaker now! _E_\nRT @TeamTrump: Quite simply @HillaryClinton mistreats women. #BigLeagueTruth #Debate2016 __HTTP__ __HTTP__ _E_\nDespite spending $500k a day on TV ads alone #CrookedHillary falls flat in nationwide @QuinnipiacPoll. Having ZERO impact. Sad!! _E_\nCelebrating 1237! #Trump2016 __HTTP__ _E_\nI feel bad for all @VanityFair employees. Every day at work they see circulation going down as Graydon runs his bad food restaurants. _E_\nAre NFL games getting boring or is it just my magnificent imagination? In any event I'm just not watching them much anymore! _E_\nBill O'Reilly calls Trump and campaign brilliant. In first place by 27 points. _E_\nCome celebrate Thanksgiving in the Windy City at @TrumpChicago's 5 Star 5 Diamond Sixteen restaurant __HTTP__ _E_\nEntrepreneurs keep this in mind: Great spirits have always encountered violent opposition from mediocre minds. Albert Einstein _E_\nRT @paulsperry_: __HTTP__ _E_\nResolve never to quit never to give up no matter what the situation. Jack Nicklaus _E_\nI heard that @Morning_Joe was very nice on Friday but that little Donny D a big failure in TV (& someone I helped) was nasty. Irrelevant! _E_\nChina's media is attacking @MittRomney while endorsing @BarackObama __HTTP__ Of course. Mitt knows it's Time To Get Tough. _E_\nThe stock of my shirt and tie maker just hit an all time high great going great product! _E_\nRemember the huge amount of money raised by @JohnRich and company... #sweepstweet _E_\nIf Republican Senate doesn't get rid of the Filibuster Rule & go to a simple majority which the Dems would do they are just wasting time! _E_\nYour questions about my desk answered in today's #trumpvlog... __HTTP__ _E_\n\"Once you learn to quit it becomes a habit.\" Vince Lombardi _E_\nJoin us today! Together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nVanity Fair circulation down 20 percent. My third rate stalker should start looking for a new job. _E_\nWe need to be smart vigilant and tough. We need the courts to give us back our rights. We need the Travel Ban as an extra level of safety! _E_\nEntrepreneurs who develop their Midas Touch do not work for money. They work to create or acquire assets. Focus on assets. Midas Touch _E_\nSo good to see the Saudi Arabia visit with the King and 50 countries already paying off. They said they would take a hard line on funding... _E_\nAs usual the weather people got it wrong in Tampa. They just look for headlines & ratings! _E_\n\"Faster And Cheaper Trump Finishes NYC Ice Rink @TrumpRink\" __HTTP__ Gov. can be efficient w/leadership & business acumen. _E_\nMy new book #TimeToGetTough out Dec 5th outlines how to make America rich again. Order now through Amazon __HTTP__ _E_\nJust purchased NBC's half of The Miss Universe Organization and settled all lawsuits against them. Now own 100% stay tuned! _E_\nI did what was an almost an impossible thing to do for a Republican easily won the Electoral College! Now Tax Returns are brought up again? _E_\nCrooked Hillary just can't close the deal with Bernie. It will be the same way with ISIS and China on trade and Mexico at the border. Bad! _E_\nTrump Int'l Golf Links & Hotel Ireland fronts the Atlantic Ocean in County Clare for 2.5 miles. Extraordinary! __HTTP__ _E_\nWhen we have big disasters no one comes to our aid or even suggests helping but we are always expected to come to the aid of others! _E_\nBy self funding my campaign I am not controlled by my donors special interests or lobbyists. I am only working for the people of the U.S.! _E_\nI heard poorly rated @Morning_Joe speaks badly of me (don't watch anymore). Then how come low I.Q. Crazy Mika along with Psycho Joe came.. _E_\nHad a fantastic time at yesterday's All Star @ApprenticeNBC press conference with @StephenBaldwin7 in @TrumpTowerNY. _E_\nJoin me tomorrow in Michigan!Grand Rapids at 12pm: __HTTP__ at 3pm: __HTTP__ __HTTP__ _E_\nWhy is the UN planning to attack @Israel's sovereignty and ignore Iran's nuclear program? The US should look at future funding. _E_\nRT @TODAY_Clicker: Get ready @ApprenticeNBC fans! @realDonaldTrump promises plenty of mean and nasty action.. __HTTP__ _E_\nGreat job First Lady Melania! __HTTP__ _E_\nInspiration exists but it must find you working. Pablo Picasso _E_\nWe have got to get our Marine out of that disgusting Mexican jail. Would be so easy if we had a real leader. One tough phone call & he's out _E_\nFox and Friends _E_\nStory written by a @HuffingtonPost reporter that the HuffPost refused to print. Total bias but we will prevail! __HTTP__ _E_\nHAPPY 70th BIRTHDAY to the @USAirForce! The American people are eternally grateful. Thank you for keeping America PROUD STRONG and FREE! __HTTP__ _E_\nPresident Obama is the greatest hoax ever perpetrated on the American people Clint Eastwood _E_\n#TrumpVlog Why are we the sad suckers? __HTTP__ _E_\n.@ericbolling Great job on The Five tonight and not only because you were so nice to The Apprentice. See you soon and thanks! _E_\n#CelebrityApprentice Paul Teutul Sr. joined me for a press event in Trump Tower last week __HTTP__ _E_\nYes I will give my @SuperBowl pick tomorrow. Watch @_KatherineWebb cover it on @InsideEdition. _E_\nThe failing @WSJ Wall Street Journal should fire both its pollster and its Editorial Board. Seldom has a paper been so wrong.Totally biased! _E_\nVia @GolfMonthly by @jake0reilly: \"Trump to build five new holes at @TurnberryBuzz\" __HTTP__ _E_\nI love being in South Carolina. We are leading big in all of the State polls Saturday is a BIG day. MAKE AMERICA GREAT AGAIN! _E_\nLETS GO AMERICA! Time to take backour country and #MakeAmericaGreatAgainWatch video & go#VoteTrump!  __HTTP__ _E_\nDon't believe the lies every budget @BarackObama has delivered to Congress raises the income tax on EVERYONE __HTTP__ _E_\nCongratulations to the Philadelphia Eagles on a great Super Bowl victory! _E_\nThe Amazon Washington Post fabricated the facts on my ending massive dangerous and wasteful payments to Syrian rebels fighting Assad..... _E_\nCongratulations to @gohermie for winning the @ShellHouOpen. We are all proud of you @TNGCBedminster & all @TrumpGolf clubs! Great going! _E_\nRussia took Crimea during the so called Obama years. Who wouldn't know this and why does Obama get a free pass? _E_\nThe Donald J.Trump Signature Collection exclusively available @Macys offers top styles in menswear. Dress your best __HTTP__ _E_\nBy popular(extremely) demand I will be live tweeting the #Oscars2014 on Sunday night. Tell all your friends I will not be pulling punches! _E_\nLIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_\nJeb is fighting to defend a catastrophic event. I am fighting to make sure it doesn't happen again.Jeb is too soft we need tougher & sharper _E_\nRemember Trump ties & shirts @Macys for Fathers Day your father will love you even more! _E_\nFilming for @CelebApprentice Season 13 is now into the 2nd week. The 'All Star' cast is already hard at work. _E_\n.@YoungDems4Trump Thank you! _E_\nWhile I believe I will clinch before Cleveland and get more than 1237 delegates it is unfair in that there have been so many in the race! _E_\nI will be in Evansville Indiana with the great Bobby Knight (who last night endorsed me) at 12:00 this afternoon. See you there! _E_\n.@jorgeramosnews Please send me your new number your old one's not working. Sincerely Donald J. Trump _E_\nRT @JeffTutorials: @realDonaldTrump __HTTP__ _E_\nThe good news is that their ratings are terrible nobody cares! __HTTP__ _E_\nGovernment needs to stop pick pocketing your wallet. Every time it does it slows growth and kills jobs. It's (cont) __HTTP__ _E_\nRT @foxandfriends: Israeli PM Netanyahu praises U.S. policy changes during meeting with Defense. Sec Mattis __HTTP__ _E_\nThank you for sharing Amy. __HTTP__ _E_\nThe real story is that President Obama did NOTHING after being informed in August about Russian meddling. With 4 months looking at Russia... _E_\nA great article about how ObamaCare has even further complicated the tax code and will hurt housing market __HTTP__ _E_\nVia __HTTP__ __HTTP__ _E_\nI will be interviewed on @foxandfriends at 8:40. A.M. Enjoy! _E_\nEnjoy Celebrity Apprentice tonight at 9 a really great episode! _E_\nMargaret Thatcher was the Iron Lady of the West. She promoted freedom & democracy a great leader & ally of America. _E_\nRepublicans have very strong hand in their fight against Obamacare lets see if they are willing and able to play it tuff ! _E_\nThe Sarasota Florida rally today was amazing. 12000 people chanting their love for our country. It's going to happen this is a MOVEMENT! _E_\nMy @TheBrodyFile int. from Iowa on how I would build a wall to secure our Southern Border & deduct costs from Mexico __HTTP__ _E_\nBe sure to buy this month's @AmSpec magazine. Read \"A Trump Card\" my interview with Jeffrey Lord. _E_\nLying #Ted Cruz just (on election day) came out with a sneak and sleazy Robocall. He holds up the Bible but in fact is a true lowlife pol! _E_\nHillary Lies to Benghazi Families#CrookedHillary __HTTP__ _E_\nRepublicans and @MittRomney must get tough very soon. _E_\n\"The minute that you're not learning I believe you're dead.\" – Jack Nicholson _E_\nI'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nWhy did Pres Obama remove sanctions against Iran prior to negotiating rather than completing successful negotiation & then remove sanctions? _E_\nRemember the worst thing you can do in a negotiation is seem desperate to make the deal. _E_\nClaims for unemployment are at a 3 month high __HTTP__ Where's the @BarackObama recovery? _E_\nHowever beautiful the strategy you should occasionally look at the results. Winston Churchill _E_\nMy @todayshow discussing the @CelebApprentice discussing the cast __HTTP__ _E_\n\"Do not view any failure as the end. Learn your lessons quickly then move on.\" – Think Big _E_\nI will be interviewed tonight on @FoxNews by @SeanHannity at 9pmE. Enjoy! _E_\n...and they knew exactly what I said and meant. They just wanted a story. FAKE NEWS! _E_\n..(enthusiastic dynamic and fun) and the American Legion V.A. (respectful and strong). Too bad the Dems have no one who can change tones! _E_\nSuperbowl Sunday is a great American tradition. The Colts and Saints are already champions but may the best team win! _E_\nHard to believe that Bernie Sanders has done such a complete fold. He got NOTHING for all of the time energy and money. The V.P. a joke! _E_\nJobs report is really bad beyond the worst projections.A bad day on Wall Street! _E_\n....that has served our country is put on a waiting list and gets no care. _E_\n\"Remember that some things are worth waiting for. Plans can change sometimes for good reason.\" – Trump Never Give Up _E_\n\"Never confuse a single defeat with a final defeat.\" F. Scott Fitzgerald _E_\n\"Courage is being scared to death but saddling up anyway.\" John Wayne _E_\nThe @WashingtonPost quickly put together a hit job book on me comprised of copies of some of their inaccurate stories. Don't buy boring! _E_\nNew South Carolina poll from PPP. Thank you! #VoteTrumpSC __HTTP__ _E_\n.@CNN is so disgusting in their bias but they are having a hard time promoting Crooked Hillary in light of the new e mail scandals. _E_\nBack by popular demand this year's All Star @ApprenticeNBC sees the return of @claudiajordan! Our fans love her. _E_\nThank you Omarosa for your service! I wish you continued success. _E_\nHillary and the Dems were never going to beat the PASSION of my voters. They saw what was happening in the last two weeks before the...... _E_\nThe just released Public Policy Polling (PPP national result) is the best yet. MAKE AMERICA GREAT AGAIN! _E_\nWhen will the U.S. stop sending $'s to our enemies i.e. Mexico and others. _E_\nRT @dcexaminer: Emails show Washington Post New York Times reporters unenthusiastic about covering Clinton Lynch meeting __HTTP__ _E_\nThank you @HauteLivingMag for naming @TrumpDoral the #1 golf course in Miami __HTTP__ _E_\nWe have to make America great again! _E_\nTomorrow night's episode of The Apprentice delivers excitement at QVC along with appearances by Isaac Mizrahi and Cathie Black. 10 pm on NBC _E_\n...Terrible for the economy and a job killer. China is laughing at us! _E_\nNot the world only your tiny group of viewers the world doesn't care about you. @lawrence You're too stupid to (cont) __HTTP__ _E_\nI will be doing @GMA @GStephanopoulos this morning at around 7:00. Likewise I will be doing @Morning_Joe at around 7:00. Figure it out! _E_\n.@rushlimbaugh is right—the Republicans lost because they weren't conservative enough—or tough enough. _E_\nIt was a great honor to welcome the President of Turkey Recep Tayyip Erdoğan to the @WhiteHouse today! __HTTP__ _E_\nThe rigged system may have helped Hillary Clinton escape criminal charges but... __HTTP__ __HTTP__ _E_\nLet's see what happens in the boardroom... #CelebApprentice _E_\nThank you! #Trump2016 __HTTP__ _E_\nI have so much admiration and respect for the 2.4 million men and women of our Armed Forces. #TimeToGetTough _E_\n....for the Middle Class. The House and Senate should consider ASAP as the process of final approval moves along. Push Biggest Tax Cuts EVER _E_\nMy @foxandfriends interview from yesterday discussing how @BarackObama failed to show any leadership on th... (cont) __HTTP__ _E_\nI believe Putin will continue to re build the Russian Empire. He has zero respect for Obama or the U.S.! _E_\nA country that cannot protect its borders is a country destined to fail. Another broken promise by our leaders in Washington. _E_\nIf I run and if I win our country will be great again. last line of my @SRQRepublicans speech _E_\nThank you Nebraska!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nWhy didn't President Obama just go inside when it started raining yesterday common sense! The two Marines looked very uncomfortable & wet. _E_\nI was in San Jose CA on Saturday for a sit down interview for the ACN national meeting which was attended by over 20000 people. Huge! _E_\nFor all of the haters and losers out there sorry I never went Bankrupt but I did build a world class company and employ many people! _E_\nThe ObamaCare website is unfixable & rumor has it that they will stop checks & balances—a free for all that will cost the country trillions _E_\nObama loves wasting our money. He just made another guarantee of $197M to a solar company __HTTP__ Cronyism! _E_\n#CrookedHillary __HTTP__ _E_\nThomas Kinkade died. I happen to love the beauty of his paintings. He took a lot of heat from art critics who (cont) __HTTP__ _E_\nLove the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud! _E_\nThis week will mark the 1 year anniversary of the attack in Benghazi that left 4 Americans dead. No answers! _E_\nWow what a day. So many foolish people that refuse to acknowledge the tremendous danger and uncertainty of certain people coming into U.S. _E_\nJust won a big federal lawsuit similar in certain ways to the Trump U case but the press refuses to write about it. If I lost monster story! _E_\nThe military and first responders despite no electric roads phones etc. have done an amazing job. Puerto Rico was totally destroyed. _E_\nOur vets are the pride of our nation. The VA scandal is a disgrace.If you can get food stamps so fast our vets should get immediate care _E_\nCongrats to @TrumpWaikiki for being named @Orbitz Best In Stay Elite Award Winner for Oahu for 2014! _E_\nI've been saying for three months that the bridge tolls to Staten Island are far too high and unfair just got lowered but not nearly enough _E_\nThe system is rigged. General Petraeus got in trouble for far less. Very very unfair! As usual bad judgment. _E_\nThis is not a media event or about Donald J. Trump this is about the United States of America. I will be... __HTTP__ _E_\nI will be going to Texas as soon as that trip can be made without causing disruption. The focus must be life and safety. _E_\nBased on the tremendous cost and cost overruns of the Lockheed Martin F 35 I have asked Boeing to price out a comparable F 18 Super Hornet! _E_\nGo to work today be smart think positively and WIN! _E_\nPres. Obama is about to embark on a 17 day vacation in his 'native' Hawaii putting Secret Service away from families on Christmas. Aloha! _E_\nWith Barry Diller & Tina Brown in charge did anyone doubt that @Newsweek would be a massive failure? _E_\nVerlander pitched great but @Yankees look truly defeated. _E_\nWe've just set a new goal: raise $4 million from our grassroots supporters by MIDNIGHT! __HTTP__ __HTTP__ _E_\nWhy should he? He's only the POTUS and @BarackObama has no opinion on whether the Senate should pass a budget. __HTTP__ _E_\nGreat deal we swap 5 killer terrorists for a U.S. military deserter. That's how the U.S. negotiates nowadays. _E_\nSome of the women on Celebrity Apprentice are absolutely crazy maybe the wildest thing ever on reality television. Watch tonight! _E_\nMy @FoxNews interview with @gretawire where I discuss my potential GOP endorsement and the NH primary __HTTP__ _E_\n\"Action is the foundational key to all success.\" Pablo Picasso _E_\nThank you Portland Maine! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nI will Make Our Government Honest Again believe me. But first I'm going to have to #DrainTheSwamp in DC. __HTTP__ _E_\nChampion @Joan_Rivers loves being on the other side of the table in the Boardroom. She leaves no punches out in @CelebApprentice! _E_\nDwyane Wade's cousin was just shot and killed walking her baby in Chicago. Just what I have been saying. African Americans will VOTE TRUMP! _E_\nJust returned from Pennsylvania where we will be bringing back their jobs. Amazing crowd. Will be going back tomorrow to Gettysburg! _E_\nHillary Clinton lied last week when she said ISIS made a D.T. video. The video that ISIS made was about her husband being a degenerate. _E_\nThe Chinese laugh at how weak and pathetic our government is in combating intellectual property theft. (cont) __HTTP__ _E_\nTime flies it's @TrumpTowerNY's 30th anniversary. To celebrate we made this video highlighting its amazing history __HTTP__ _E_\nMy @fox8news interview discussing the passing of my longtime friend Dick Clark. __HTTP__ A true TV legend who will be missed. _E_\nGreat bilateral meetings at Élysée Palace w/ President @EmmanuelMacron. The friendship between our two nations and ourselves is unbreakable. __HTTP__ _E_\nTim Kaine has been praising the Trans Pacific Partnership and has been pushing hard to get it approved. Job killer! _E_\nI could fix tv talk shows that are doing poorly—there is tremendous talent out there waiting to be tapped—and nobody sees it! _E_\nCongress is back.TIME TO CUT CAP AND BALANCE.There is no revenue problem.The Debt Limit cannot be raised until Obama spending is contained. _E_\nRT @IngrahamAngle: The #CruzCrew prevailed! Smart for @MarcoRubio to keep his speech short & sweet. Ditto for @realDonaldTrump who was brie... _E_\nHeading to Camp David for major meeting on National Security the Border and the Military (which we are rapidly building to strongest ever). _E_\nObamacare is a disaster. We must REPEAL & REPLACE. Tired of the lies and want to #DrainTheSwamp? Get out & VOTE... __HTTP__ _E_\nRT @DRUDGE_REPORT: FORMER HOSTAGE SAYS PLANE WAITED UNTIL MONEY ARRIVED... __HTTP__ _E_\nThe reason Flake and Corker dropped out of the Senate race is very simple they had zero chance of being elected. Now act so hurt & wounded! _E_\nThe ratings of The Cycle on MSNBC a sad and pathetic show are way down. If they fired racist moron @Toure a truly stupid guy they live! _E_\nJeanne Shaheen wants amnesty for illegals placed the deciding vote for ObamaCare & opposes the 2nd Amendment. Vote her out in November! _E_\nAre the Republicans going to blow their chance to take the Senate? Must focus on ObamaCare and amnesty. _E_\nMy @SquawkCNBC interview discussing the 57th St. crane damage from the storm and extending my $5M offer to Obama __HTTP__ _E_\nHillary's been failing for 30 years in not getting the job done it will never change. _E_\nIf you are steadfast in your efforts critics will be harmless. Achievers move forward and achievement is not a plateau it's a beginning. _E_\n\"Romney's $2 Billion Sacrifice for America\" By Chris Ruddy @Newsmax_Media __HTTP__ _E_\nRT @TeamTrump: 100% TRUE &gt @realDonaldTrump is right @HillaryClinton did call TPP 'the gold standard' #Debates2016 __HTTP__ _E_\n...dwindling subscribers and readers.They got me wrong right from the beginning and still have not changed course and never will. DISHONEST _E_\nGovernment Funding Bill past last night in the House of Representatives. Now Democrats are needed if it is to pass in the Senate but they want illegal immigration and weak borders. Shutdown coming? We need more Republican victories in 2018! _E_\nJoe Paterno's family should sue the idiots @PennState that made that ridiculous deal and commissioned the one sided report. _E_\nWebsite Exposing Marco 'Amnesty' Rubio Goes Live: A 'Donor Class Puppet'? Breitbart __HTTP__ _E_\nFor an advance preview of the Miss USA 2013 contestants as well as other show details go to __HTTP__ _E_\n#BARACKTAX QUOTE: If you have health insurance you're not getting hit with a tax. _E_\nSee the Ashley Judd ad by @karlrove and you will definitely vote for her and love Obama. _E_\nNoKo has interpreted America's past restraint as weakness. This would be a fatal miscalculation. Do not underestimate us. AND DO NOT TRY US. __HTTP__ _E_\nWIshing everyone a happy healthy and prosperous New Year! _E_\nWow I have just exceeded 2 million followers and in such a short time! _E_\nUnless the Republican Senators are total quitters Repeal & Replace is not dead! Demand another vote before voting on any other bill! _E_\nThe Comedy Central Roast of Donald Trump last week was the #1 highest rated Comedy Central Roast ever...it brought in 3.5 milion viewers _E_\nYou have to love what you do or you are never going to be successful no matter what you do in life. Think Big _E_\nOur airports are Third World horrible. Let's rebuild them by people who know how to do it inexpensively. _E_\nWhen you are in a war or even a battle losing is not an option! _E_\nThank you Jeffrey Lord for the great article discrediting third rate @BuzzFeed site & slimebag reporter McKay Coppins.@PiersMorgan @AmSpec _E_\nPresident Obama put himself in a very bad position when he talked about Syria crossing the RED LINE. Amazingly now he denies he said that! _E_\nThank you for the nice words this morning @KellyRiddell. Well delivered and totally logical! @CNN @FoxNews _E_\nRT @Jenniffer2012: Thank you @realDonaldTrump for all the help you are providing for Puerto Rico. We're are grateful and happy to welcome y... _E_\n\"45 year low in illegal immigration this year.\" @foxandfriends _E_\nWhat do you think about the push to put women into high intensity combat situations? _E_\nThe Washington Times Presidential Debate Poll:TRUMP 77% (18290)CLINTON 17% (4100)#DrainTheSwamp #Debate __HTTP__ _E_\nThe media tries so hard to make my move to the White House as it pertains to my business so complex when actually it isn't! _E_\nI love reading about all of the geniuses who were so instrumental in my election success. Problem is most don't exist. #Fake News! MAGA _E_\nI hope people are looking at the disgraceful behavior of Hillary Clinton as exposed by WikiLeaks. She is unfit to run. _E_\nTerrible. Wind farms are provided permits by the US government which causes the programmatic killing of bald eagles. _E_\n\"Pride yourself on your ability to find creative solutions to tough problems. Think Big _E_\nThe Electoral College is actually genius in that it brings all states including the smaller ones into play. Campaigning is much different! _E_\nWeekly Address #KatesLaw#NoSanctuaryForCriminalsActStatement: __HTTP__ __HTTP__ _E_\nIs everyone enjoying ObamaCare's 21 new 2014 taxes? __HTTP__ It's Obama's special gift added on to your rising premium. _E_\n\"Do your duty and a little more and the future will take care of itself.\" Andrew Carnegie _E_\nHead of Air Force's anti sexual assault unit arrested for sexual assault! It just seems that our Country is not what it used to be. _E_\nThe ultimate vacation destination @TrumpPanama's sleek design evokes a majestic sail fully deployed in the wind __HTTP__ _E_\nAt the end of the day Obama won the battleground states by less than 500000 votes. This was a winnable race. GOP needs to do better! _E_\nBy popular demand I will be tweeting on the very tainted Academy Awards tonight! _E_\nHappy New Year to all of my Jewish friends and supporters. Shana Tova. Hopefully it will be a great year! _E_\nGlad to see that the Egyptian Army is releasing Mubarek. As we see Obama never should have abandoned him. He was an ally. _E_\nCheck out the Trump Fabulous World of Golf site to meet the Fazio family master golf course designers.... __HTTP__ _E_\nThe Euro put in place to hurt the U.S. is done! will have less negative impact than most think. _E_\nGreat rally last night in Massachusetts. 2000 people at a house must be a record! Unbelievable spirit to MAKE AMERICA GREAT AGAIN. _E_\nSome dope tweeted my message to my friend Bill Belichick incorrectly they called him Bob. Sorry Bill! @Patriots _E_\nNewsmax article: 'Trump Declines Prime Time GOP Convention Speech' __HTTP__ _E_\nFact – every successful GOP Senate candidate just elected ran on repealing ObamaCare. In January it's time to move! _E_\nWe all know that chess is a game of strategy. So is business. Think Like a Champion _E_\nI could fix existing Tappan Zee Bridge for peanuts. Unfortunately Gov Cuomo will end up spending more than $10B on this project. $25 tolls? _E_\nI agree getting Tax Cuts approved is important (we will also get HealthCare) but perhaps no Administration has done more in its first..... _E_\nVia @DMRegister by @AP: \"Donald Trump talks economy with Republicans in Davenport\" __HTTP__ _E_\nCelebrate Martin Luther King Day and all of the many wonderful things that he stood for. Honor him for being the great man that he was! _E_\n'Economists say Trump delivered hope' __HTTP__ _E_\nWill be doing @OutFrontCNN with @ErinBurnett tonight at 7 pm re: tax reductions and various other topics. _E_\n\"The thing about high corporate tax rates is that in the end companies aren't the ones who foot the bill consumers do.\" #TimeToGetTough _E_\nYour tax dollars well spent. Over 1.295M ObamaCare enrollees will also be illegal immigrants __HTTP__ Are you surprised? _E_\n.@KarlRove Had my best day ever in the polls one had me at 41% Morning Consult. Boston Globe Monmouth NBC and CNN all great. More! _E_\nI had a great time in Iowa yesterday record crowds fantastic people! _E_\nWeakness is very dangerous: @BarackObama is going to unilaterally disarm our nuclear arsenal. America keeps the world safe! _E_\nI'm not a hunter and don't approve of killing animals. I strongly disagree with my sons who are hunters but (cont) __HTTP__ _E_\nI have made my decision on who I will nominate for The United States Supreme Court. It will be announced live on Tuesday at 8:00 P.M. (W.H.) _E_\nAfter allowing North Korea to research and build Nukes while Secretary of State (Bill C also) Crooked Hillary now criticizes. _E_\nChina is cooking up conspiracy theories that the Olympics are rigged. __HTTP__ They don't understand why they can't cheat. _E_\nI am impressed with the scam @BarackObama pulled but the truth will come out. _E_\n.@piersmorgan Russell has nothing going for himself except for energy & aggression. Without that he would be dead—a first class dummy! _E_\nCrooked Hillary can't close the deal with Bernie Sanders. Will be another bad day for her! _E_\n.@JohnKerry claims he has never stopped working\" f/Pastor Abedini's release through \"back channels. Where are the results? _E_\nVanity Fair party at Tribeca Film Festival was a bust. _E_\nAdam Moss editor in chief of @NYMag is quickly losing his reputation in that @NYMag has become so boring and so irrelevant. _E_\nLying Ted Cruz and lightweight choker Marco Rubio teamed up last night in a last ditch effort to stop our great movement. They failed! _E_\nThank you @hardball_chris for your nice words. They are very much appreciated. I fully understand that you really get it. _E_\nRep. Lou Barletta a Great Republican from Pennsylvania who was one of my very earliest supporters will make a FANTASTIC Senator. He is strong & smart loves Pennsylvania & loves our Country! Voted for Tax Cuts unlike Bob Casey who listened to Tax Hikers Pelosi and Schumer! _E_\nDo you think crooked @AGSchneiderman will ever challenge the NFL tax status? No—too many friends and contributors in @nfl? _E_\nHow can Crooked Hillary say she cares about women when she is silent on radical Islam which horribly oppresses women? _E_\nICYMI my @foxandfriends int. criticizing the GOP on ObamaCare the new Congress & 2016 __HTTP__ _E_\nMake sure to verify the voting machine does not switch your vote. If you have any problems notify the poll workers. _E_\nThe fact that President Putin and I discussed a Cyber Security unit doesn't mean I think it can happen. It can't but a ceasefire can& did! _E_\nGet rid of gun free zones. The four great marines who were just shot never had a chance. They were highly trained but helpless without guns. _E_\nStocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy and create many beautiful JOBS! _E_\nRT @foxandfriends: .@Suffolk_Sheriff praises President Trump for making gang eradication a priority __HTTP__ _E_\nThe only deal the Republicans should accept is a complete repeal of ObamaCare. You have them on the run don't fold go for it! _E_\nThe Ryder Cup will be amazing this week. _E_\nSo many people are asking why isn't the A.G. or Special Council looking at the many Hillary Clinton or Comey crimes. 33000 e mails deleted? _E_\nThe failing @nytimes which never spoke to me keeps saying that I am saying to advisers that I will change. False I am who I am never said _E_\nBREAKING NEWS: Obama has just made a trade with Russia. They get Florida California & our gold supply. We get borscht & a bottle of vodka. _E_\nToday I was honored and proud to address the 45th Annual @March_for_Life! You are living witnesses of this year's March for Life theme: #LoveSavesLives. __HTTP__ _E_\nMuch bigger win than anticipated in Arizona. Thank you I will never forget! _E_\nEntrepreneurs: Be passionate. You have to love what you're doing to be successful at it. _E_\nWe don't always think of our presidents as jobs and business negotiators but they are. Presidents are our (cont) __HTTP__ _E_\n#CongressionalBaseballGame __HTTP__ _E_\nWe MUST have strong borders and stop illegal immigration. Without that we do not have a country. Also Mexico is killing U.S. on trade. WIN! _E_\nI will be on @wolfblitzer for a @CNNSitRoom interview today. Please join us 5PM ET. _E_\nA Warren Buffett corp. is currently ensnared in a bankruptcy. Likewise Icahn Kravis Apollo and many others have played the game.Thanks! _E_\nShameful. After trading 5 senior Taliban for a deserter the White House is now attacking Bergdahl's platoon __HTTP__ _E_\nSUN newspaper/Scotland reports that Tourism jump is thanks to Trump. 8000 visitors in one month from 20 countries __HTTP__ _E_\nA great crowd at Trump Tower for #TimeToGetTough book signing! _E_\nOn behalf of the entire family we would truly be honored to have your vote! Let's #MakeAmericaGreatAgain #EarlyVote __HTTP__ _E_\nVery successful fund raising for @MittRomney yesterday. Good to see my friend Woody Johnson. _E_\nRT @FoxNews: Jobs created in February. __HTTP__ _E_\nFrom 2% to 27% in Texas quite a jump into first place! _E_\nNew orders for manufacturing down 9/10 months __HTTP__ Time for fair trade. Stop TPP! _E_\nGreat going Andy Roddick! Another victory for a fabulous player. Brooklyn Decker is good luck. _E_\n.@McIlroyRory Great job Rory you have the heart and talent of a great champion. Work hard and win many more! See you at Turnberry. _E_\nJust like I have warned from the beginning Crooked Hillary Clinton will betray you on the TPP. __HTTP__ _E_\nWill be interviewed on @GMA at 7:00 A.M. Big wins last night! _E_\n\"A very good way to pave your own way to success is simply to work hard and to be diligent\" – Think Like a Champion _E_\nAlabama is sooo lucky to have a candidate like Big Luther Strange. Smart tough on crime borders & trade loves Vets & Military. Tuesday! _E_\nI don't think Obama will do well in the second debate he is psyched out just like A Rod. _E_\nWhy are we continuing to train these Afghanis who then shoot our soldiers in the back? Afghanistan is a complete waste. Time to come home! _E_\n\"Always get even. When you are in business you need to get even with people who screw you.\" – Think Big _E_\nThe dishonest NY Daily News reporter advised my rep in writing story is dead and then put it out anyway. A total lie and she knew it! _E_\nProsperity is coming back to our shores because we are putting America WORKERS and FAMILIES first. #AmericaFirst __HTTP__ _E_\nI fulfilled my campaign promise others didn't! __HTTP__ _E_\n\"WATCH: @MissUniverse contestants golf with Donald Trump @TrumpDoral\" __HTTP__ via @KylePorterCBS by @CBSSports _E_\nMemorial service today for beautiful and incredible Heather Heyer a truly special young woman. She will be long remembered by all! _E_\nIt came out that Huma Abedin knows all about Hillary's private illegal emails. Huma's PR husband Anthony Weiner will tell the world. _E_\nYou are witnessing the single greatest WITCH HUNT in American political history led by some very bad and conflicted people! #MAGA _E_\nEntrepreneurs: Knowledge requires patience action requires courage. Put patience and courage together and you'll be a winner . _E_\nDon't underestimate yourself or your possibilities. There are always opportunities. _E_\nFeaturing top spa in New York AAA Five Diamond Award @TrumpSoHo is Soho's most elite hotel & destination spot __HTTP__ _E_\nMONDAY 11/7/2016Scranton Pennsylvania at 5:30pm. __HTTP__ Rapids Michigan at 11pm.... __HTTP__ _E_\nI promised that my policies would allow companies like Apple to bring massive amounts of money back to the United States. Great to see Apple follow through as a result of TAX CUTS. Huge win for American workers and the USA! __HTTP__ _E_\nAccording to the @nytimes a Russian sold phony secrets on \"Trump\" to the U.S. Asking price was $10 million brought down to $1 million to be paid over time. I hope people are now seeing & understanding what is going on here. It is all now starting to come out DRAIN THE SWAMP! _E_\nCongratulations to Tom Scocca and Timothy Burke of @Deadspin for exposing the Manti Te'o fiasco. _E_\nAs I made very clear today our country needs the security of the Wall on the Southern Border which must be part of any DACA approval. _E_\n.@GeraldoRivera Thanks my champion Geraldo and very true. _E_\nIn order to stay competitive in your industry it is imperative to keep up to date on all news. A great commodity is information. _E_\n.@THEGaryBusey survives another week of All Star Celebrity @ApprenticeNBC. Gary is shifty and playing to win. _E_\nAlso tune in to the @TodayShow at 7:00am. I will be on to discuss the campaign my new ads and #CrippledAmerica. _E_\nSomebody hacked the DNC but why did they not have hacking defense like the RNC has and why have they not responded to the terrible...... _E_\nA new terror warning was issued for European cties. At what point do we say we have had enough and get really tough and smart. Weak leaders! _E_\nWashington is simply incapable of any moderation because @BarackObama is such an extreme leftist. He must be defeated. #TImeToGetTough. _E_\nRT @mike_pence: Congrats to my running mate @realDonaldTrump on a big debate win! Proud to stand with you as we #MAGA. _E_\nWikiLeaks proves even the Clinton campaign knew Crooked mishandled classified info but no one gets charged? RIGGED! __HTTP__ _E_\nI hope everyone read the brilliant article in American Spectator about leightweight A.G. Eric Schneiderman. He should be run out of office! _E_\nObamaCare contains marriage penalty taxes. Why should married couples be penalized for having healthcare? _E_\nWill be on @ABC News tonight at 6:30. Interviewed by the legendary @BarbaraJWalters! Enjoy _E_\nHow did NBC get an exclusive look into the top secret report he (Obama) was presented? Who gave them this report and why? Politics! _E_\n#TrumpAdvice __HTTP__ _E_\nAn investment in life luxury & leisure a Trump Nat'l Bedminster membership offers top amenities & services __HTTP__ _E_\nI don't know why our allies are so surprised Obama is tapping their phones? Nothing changes! _E_\nToday Barack Obama is standing in water in NJ. Remember on election day that he has put the US underwater. _E_\nVia @LasVegasSun by Eugene Dunn: \"2016 is the year of Donald Trump\" __HTTP__ _E_\nOur country is totally split right now but someday it will come together! _E_\nMy wife the beautiful @MELANIATRUMP will be appearing... #CelebApprentice _E_\nWow just 1 day after my offer to fund all WH tours Obama backtracks on decision to cancel all White House tours\" ... _E_\nPerhaps @BarackObama's biggest shortcoming as President is he failed to unite the country. _E_\n$4 gasoline – wow—OPEC is very happy! _E_\n.@JonahNRO watched on @seanhannity and appreciate your statements I have been waiting for them for a long time. Thank you. _E_\nWe win in our lives by having a champion's view of each moment. Donald J. Trump __HTTP__ _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nUS News named the Top10 best hotels in the US and Trump Int'l Hotel & Tower NYC and Trump Int'l Hotel & Tower Chicago are on the list! _E_\nWord is early voting in FL is very dishonest. Little Marco his State Chairman & their minions are working overtime trying to rig the vote. _E_\nThe White House is predicting 9% unemployment throughout 2012 – and when Obama Care takes effect in 2014 expect it to go even higher. _E_\nAs Governor of Texas Rick Perry could have done far more to secure the border but that's O.K. I like him anyway! @GovernorPerry _E_\nIt's a plain fact: free trade requires having fair rules that apply to everyone. #TimeToGetTough _E_\nThank you Virginia! #ImWithYou __HTTP__ _E_\nGreat to see how hard Republicans are fighting for our Military and Safety at the Border. The Dems just want illegal immigrants to pour into our nation unchecked. If stalemate continues Republicans should go to 51% (Nuclear Option) and vote on real long term budget no C.R.'s! _E_\n'President elect Donald J. Trump's CIA Director Garners Praise' __HTTP__ __HTTP__ _E_\nWow I just found out that in a major poll of its readers the @NewYorkObserver voted me #1 on the power 100 list in NY...... _E_\nWhile China screws us with every turn of its currency is the biggest commercial espionage threat we face (cont) __HTTP__ _E_\nGolf is a game of respect and sportsmanship we have to respect its traditions and its rules. Jack Nicklaus _E_\n.@GOP must stay focused on defunding ObamaCare and the impending budget battle. Don't let Syria rule the agenda. _E_\nWeekly Address 11:00 A.M. at the @WhiteHouse! #MAGA __HTTP__ __HTTP__ _E_\nWhen I said that Hillary Clinton got schlonged by Obama it meant got beaten badly. The media knows this. Often used word in politics! _E_\nAdditionally @CelebApprentice ranked as the #1 program in the 9 11 pm time period with adults in the 25 54 age group. _E_\nVia @Mediaite: Donald Trump Trashes 'Tacky' 'Boring' Oscars Blasts 'Racist' Django Unchained __HTTP__ _E_\nFriday is the last day to enter the Counting Sheep for Hire contest. Click here www.youtube.com/user/mattressserta and you could win a trip _E_\nWhy do the Republicans always negotiate against themselves in public? Watching them operate these fiscal negotiations is painful. _E_\nLyin' Ted Cruz even voted against Superstorm Sandy aid and September 11th help. So many New Yorkers devastated. Cruz hates New York! _E_\nWith Joan Rivers and ivankatrump from last night's great boardroom! __HTTP__ _E_\nJust signed contract to purchase the Ritz Carlton in Jupiter Florida great land great location great future! _E_\nNorth Korea has shown great disrespect for their neighbor China by shooting off yet another ballistic missile...but China is trying hard! _E_\nEntrepreneurs: If you cannot handle the tough times you will never be successful in business. Stay positive & stay strong! _E_\nGreat trip to Mexico today wonderful leadership and high quality people! Look forward to our next meeting. _E_\nFind out where to #VoteTrump on caucus night in Iowa on 2/1/16!#IACaucus #FITN #Trump2016 __HTTP__ __HTTP__ _E_\nEven though I refused to pay a ridiculous price for the Buffalo Bills I would have produced a winner. Now that won't happen. _E_\nThank you for your service! __HTTP__ _E_\nIt is a disgrace that my full Cabinet is still not in place the longest such delay in the history of our country. Obstruction by Democrats! _E_\nEntrepreneurs: Watching you could be the motivation for your employees.Make it an example that will best serve the success of your business. _E_\nTotally biased @NBCNews went out of its way to say that the big announcement from Ford G.M. Lockheed & others that jobs are coming back... _E_\nMust read opinion piece by @Gallup CEO Jim Clifton: \"The Big Lie: 5.6% Unemployment\" __HTTP__ Just as I have long been saying... _E_\nBy failing to prepare you are preparing to fail. Benjamin Franklin _E_\n.@NBC just announced that all 1 hour @CelebApprentice episodes are being expanded to 2 hours—it's amazing what good ratings will do! _E_\nEntrepreneurs: Get a momentum going. Listen apply then move forward. Do not procrastinate. See opportunity as the perk that it is. _E_\nI just got back from Russia learned lots & lots. Moscow is a very interesting and amazing place! U.S. MUST BE VERY SMART AND VERY STRATEGIC. _E_\nI hope when the MSM runs its \"interruption counters\" they consider the # of times the moderators interrupted me com... __HTTP__ _E_\nMarkets are crashing all caused by poor planning and allowing China and Asia to dictate the agenda. This could get very messy! Vote Trump. _E_\nWhen we're talking about math that doesn't add up how about $5 trillion of deficits over the last four years. @MittRomney _E_\nNo matter how diligent you are in evaluating a business deal there is invariably one factor you have no control over luck... _E_\n...time for Republicans & Democrats to get together and come up with a healthcare plan that really works much less expensive & FAR BETTER! _E_\nThanks @LilJon for coming to my defense in Rolling Stone Magazine. As I have often said you are a terrific guy! _E_\nMoody's is out to make publicity. The bank downgrades from yesterday don't make up for @Moody's giving AAA (cont) __HTTP__ _E_\nOur incompetent Secretary of State Hillary Clinton was the one who started talks to give 400 million dollars in cash to Iran. Scandal! _E_\nCongrats to @Yankees on finishing 1st in the AL East. Derek Jeter is great good luck in the playoffs! _E_\nThank you! __HTTP__ _E_\nLightweight @DannyZuker is too stupid to see that China (and others) is destroying the U.S. economically and our leaders are helpless! SAD. _E_\nLike it or not Edward Snowden is a SPY and should be tried as a SPY! He has stolen invaluable information and damaged us with other nations _E_\nI am in Iowa watching all of these phony T.V. ads by the other candidates. All bull politicians are all talk and no action it won't happen! _E_\nRubio puts out ad that my pilot was a drug dealer not true not my pilot! Guy owned helicopter company don't think I ever even used. _E_\nThis is happening all over our country—great people being disenfranchised bypoliticians. Repub party is in trouble! __HTTP__ _E_\nToday the U.S. flag flies at half staff at the @WhiteHouse in honor of National Pearl Harbor Remembrance Day. __HTTP__ __HTTP__ _E_\nThe final two @ArsenioOFFICIAL and @ClayAiken visited yesterday __HTTP__ _E_\n.@LisaRinna looks better with her reduced lips. Good move Lisa. #CelebApprentice _E_\nDue to the horrific events taking place in our country I have decided to postpone my speech on economic opportunity today in Miami. _E_\nI'm glad that Mark Cuban won the ridiculous case with the S.E.C. It never should have been brought in the first place! _E_\nIf Obama doesn't accept my offer to be fully transparent what will he say? _E_\n...and an optimist is one who makes opportunities of his difficulties. Harry S. Truman _E_\nDumbass @BillMaher has still not given me the 5 million he committed to charity we just presented him with a demand notice. _E_\nComing together is a beginning. Keeping together is progress. Working together is success. Henry Ford _E_\nIf we reelect @BarackObama the America we leave our kids and grandkids won't look like the America we were (cont) __HTTP__ _E_\n13 Syrian refugees were caught trying to get into the U.S. through the Southern Border. How many made it? WE NEED THE WALL! _E_\nI think it would be a good idea—and fair—to include @GovChristie & @MikeHuckabeeGOP in the debate. Both solid & good guys. @FoxBusiness _E_\nGreat numbers on the economy. All of our work including the passage of many bills & regulation killing Executive Orders now kicking in! _E_\n.@DRUDGE_REPORT's First Presidential Debate Poll:Trump: 80%Clinton: 20%Join the MOVEMENT today & lets #MAGA!... __HTTP__ _E_\nVia @Investopedia by @swan_investor: The Irreplaceable Brand Of Donald Trump __HTTP__ _E_\n\"Talent hits a target no one else can hit. Genius hits a target no one else can see.\" – Arthur Schopenhauer _E_\n\"Donald Trump Wishes Kristen Stewart A Happy Birthday\" __HTTP__ via @HollywoodLife _E_\n'Trump rally disrupter was once on Clinton's payroll' __HTTP__ _E_\nInteresting how President Obama is flying around in a Boeing 747 on so called Earth Day! _E_\nThe protesters in New Mexico were thugs who were flying the Mexican flag. The rally inside was big and beautiful but outside criminals! _E_\nI will be interviewed on @seanhannity tonight at 10:00. You will find it very interesting (I hope). Enjoy! _E_\nMaybe I'm old fashioned but I don't like seeing women in combat. _E_\nDon't reward Mitt Romney who let us all down in the last presidential race by voting for Kasich (who voted for NAFTA open borders etc.). _E_\nDonna Brazile just stated the DNC RIGGED the system to illegally steal the Primary from Bernie Sanders. Bought and paid for by Crooked H.... _E_\nWell the New Year begins. We will together MAKE AMERICA GREAT AGAIN! _E_\nIf you want to succeed keep your edge. Staying on top of all new developments in your sector = major advantage that pays dividends. _E_\nWill be going to Richmond Virginia today. Big crowd! See you there. _E_\nFailing comedian Bill Maher who I got an accidental glimpse of the other night is really a dumb guy just look at his past! _E_\n\"Create your own visual style... let it be unique for yourself and yet identifiable for others.\" Orson Welles _E_\nI like Michael Douglas! _E_\nI hope you are watching the Apprentice...tonight's show is great and Brett Michaels is back! _E_\nUnderstand that difficulties mistakes and setbacks are an inevitable part of business and life...But always look for the opportunities. _E_\n...a tool of anti Trump political actors. This is unacceptable in a democracy and ought to alarm anyone who wants the FBI to be a nonpartisan enforcer of the law....The FBI wasn't straight with Congress as it hid most of these facts from investigators.\" Wall Street Journal _E_\nRe: Decisions: Cover your bases then ask yourself this question: What am I pretending not to see? This can save a lot of time & trouble. _E_\nThe Chinese are better off than they were 4 years ago. They have stolen even more from us in jobs & trade during @BarackObama's term. _E_\nThe fact that we are here today to debate raising America's debt limit is a sign of leadership failure. Sen. Obama 3/16/06 _E_\nLooking forward to speaking at the @NHGOP #FITN Republican Leadership Summit on Saturday at 12PM! Let's Make America Great Again! _E_\nThe Fake News media is officially out of control. They will do or say anything in order to get attention never been a time like this! _E_\nIt's amazing how celebrities such as @Cher can say horrible untrue things about Republican politicians and it's (cont) __HTTP__ _E_\nMay God be with the people of Sutherland Springs Texas. The FBI and Law Enforcement has arrived. _E_\nI don't know how Al Michaels could have been drunk and arrested on Friday night if he was totally sharp on Saturday morning. _E_\nWe must suspend immigration from regions linked with terrorism until a proven vetting method is in place. _E_\nJoin me live in Toledo Ohio!#MakeAmericaGreatAgain __HTTP__ _E_\nThe Democrats when they incorrectly thought they were going to win asked that the election night tabulation be accepted. Not so anymore! _E_\nMaking money is art and working is art and good business is the best art. Andy Warhol _E_\nHillary Clinton should have been prosecuted and should be in jail. Instead she is running for president in what looks like a rigged election _E_\nLooking forward to hosting @NaghmehAbedini next week @TrumpTowerNY. The White House has abandoned her husband Christian Pastor Abedini. _E_\nDonald Trump's birther event is the greatest trick he's ever pulled __HTTP__ _E_\nThank you America! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_\nWith one of the worst and most prolonged cold spells in history with Atlanta Texas and parts of Florida freezing Global Warming anyone? _E_\nThank you to Doug Parker and American Airlines for all of the help you have given to the U.S. with Hurricane flights. Fantastic job! _E_\nIf you like to work hard you will attract people with the same ethic. Think Like a Billionaire _E_\n70 stories over Panama Bay @TrumpPanama is the country's first five star development. A masterpiece __HTTP__ _E_\nWhat a coincidence that Obama's good friends in Libya and Egypt picked 9/11 to attack our embassies. _E_\nTrace and his team raised an amazing amount of $. Looks like a good season for charities. _E_\nThank you New Hampshire! #MakeAmericaGreatAgain __HTTP__ _E_\nSad case @USATODAY did article saying I don't pay bills false only don't pay when work is shoddy bad or not done! They should do same! _E_\nI will be interviewed on @foxandfriends tomorrow morning at 7:00. Enjoy! _E_\nOn 1300 acres in Charlottesville @trumpwinery's wine has been awarded the coveted Virginia Double Gold Medal __HTTP__ _E_\nGood.morning I'm going to work! _E_\nRT @TeamTrump: It's US vs. them! @realDonaldTrump will fight for you! #BigLeagueTruth #Debates _E_\nThe @CadillacChamp returns to @TrumpDoral on March 6th __HTTP__ Watch top golfers of the world battle the Blue Monster! _E_\nRising gas prices are causing a steep rise in consumer prices and will slow any future economic growth. It is a tax on all Americans. _E_\nThank you @mcuban for your nice words. I am rapidly becoming a @dallasmavs fan! __HTTP__ _E_\nRe Negotiation: Realize that persistence can go a long way. Being stubborn is often an attribute. _E_\nReceiving thousands of thank you letters from @LibertyU students for my convocation speech. The honor was all mine! Great people. _E_\nIrony! @BarackObama was in Florida yesterday fundraising. Gas also rose to $6/gallon for Florida drivers yesterday. __HTTP__ _E_\nOther worthy people were taken off the @CNBC list as well. Stupid poll should be canceled—no credibility. _E_\nYou have to love what you do or you are never going to be successful no matter what you do in life.\" Think Big _E_\nStay tuned for my big Obama announcement probably on Wednesday. _E_\nI will be on @MeetThePress with @ChuckTodd tomorrow morning at 10:30am ET on @NBC. Enjoy! _E_\nJoin me in Atlanta on Wednesday at noon! #Trump2016Tickets: __HTTP__ __HTTP__ _E_\nMichael Forbes is a loser who failed to stop what was just named \"the golf course of the year\" and which has brought ... _E_\nRT @DRUDGE_REPORT: CLINTON EMAIL LED TO EXECUTION IN IRAN? __HTTP__ _E_\n.@heytana great job we are all proud of you! _E_\nOn this solemn day of remembrance we can all take joy in the fact that Bin Laden's last sight was a Navy SEAL pulling the trigger. _E_\nStanding ovation after promising to bring the American Dream back and better than ever before! __HTTP__ _E_\nRemember NBC increased Celebrity Apprentice to 2 hours starting this Sunday night at 9 P.M. through end of season great news for App lovers _E_\nObamaCare is already done. HHS Sec. Sebelius is trying to force private companies to finance implementation __HTTP__ _E_\n.@TrumpPanama is Panama City's premiere hotel. 70 stories over Punta Pacifica excellence has arrived to So. America __HTTP__ _E_\nThe harder you work the luckier you get. Gary Player _E_\nDummy @Clare_OC from failing @Forbes magazine: NASCAR deal was 1 nite ballroom ESPN was small golf outing... _E_\nI have hired renowned golf course architect Gil Hanse to rebuild The Blue Monster at Doral. He designed the 2016 (cont) __HTTP__ _E_\nDeserter Bergdahl returns to active duty as parents of brave soldiers killed looking for him grieve. Obama trying to play this mistake down! _E_\nI wonder what the answer is on @BarackObama's college application to the question: place of birth? Maybe the (cont) __HTTP__ _E_\nRepublicans and Democrats should get back to work immediately to work on resolving downgrade. This is not a go... (cont) __HTTP__ _E_\nHillary Clinton doesn't have the strength or stamina to be president. Jeb Bush is a low energy individual but Hillary is not much better! _E_\n.@JoselynMartinez is a very brave woman who caught her father's killer __HTTP__ She visited Ivanka & me at Trump Tower today. _E_\nCredibility is important to me hence must admit that both candidates did really well last night. #VPDebate _E_\nTrump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ via @thehill _E_\nThank you Great Faith Ministries International Bishop Wayne T. Jackson and Detroit! __HTTP__ _E_\nAs I've said many times before Jon Stewart @TheDailyShow is highly overrated. _E_\nWill be interviewed by @seanhannity tonight for the full hour. Hope you enjoy it and more importantly hope you agree! _E_\nCongress must end chain migration so that we can have a system that is SECURITY BASED! We need to make AMERICA SAFE! #USA __HTTP__ _E_\nOn my way to see the great people of Maine. Will be landing in Portland in 2 hours. Look forward to it! #Trump2016 _E_\nA Rod's salary is more than the entire @astros. Half the players on @astros will have better seasons than him. A Rod is a joke! _E_\nIt's okay but why do the haters (& losers) want to follow me on twitter?? Get a life! _E_\n#TrumpAdvice __HTTP__ _E_\nThe Hostess closing did not have to happen should have been an easy deal to make. _E_\nVia @PatheosFamily by @BristolsBlog: Trump Weighs In on Saeed: Obama 'Didn't Even Ask' __HTTP__ Thanks Bristol! _E_\nHow many more times do we all have to watch and pay for that stupid and never ending #SmokeyBearHug commercial. How much is govt. spending? _E_\nTrump's Tax Plan: A Proposal Reagan Would Approve? by Jeff Bell __HTTP__ _E_\nThe Donald J. Trump Signature Collection's new line is out @Macys ties shirts accessories great & going fast! __HTTP__ _E_\n.@GolfMagazine is great thanks! _E_\nVia Hardball with Chris Matthews __HTTP__ _E_\nTrump Golf Links at Ferry Point will host many major championships over the years. Great thing for NYC—congratulations to all! _E_\nTrump lays out big plans for Doonbeg resort: Billionaire says investment shows Ireland's economy recovering __HTTP__ _E_\nRT @MittRomney: I am running for president to get us creating wealth again not to redistribute it. _E_\nThank you @FrankLuntz __HTTP__ _E_\nThe home of the boardroom @TrumpTowerNY __HTTP__ #CelebApprentice _E_\nJust watched Full Metal Jacket can't believe R. Lee Ermey didn't win the Academy Award as the drill sergeant. Political nominations! _E_\nBritish PM Cameron is making a fool of himself by wasting billions of pounds on unwanted & environment destroying Scottish windmills. _E_\nI wonder what the next scandal will be in D.C.? Can we handle yet another? _E_\nWill be in Phoenix Arizona on Wednesday. Changing venue to much larger one. Demand is unreal. Polls looking great! #ImWithYou _E_\nChina is our enemy. It's time we start acting like it...and if we do our job corectly China will gain a whole (cont) __HTTP__ _E_\nWE ARE MAKING AMERICA GREAT AGAIN! __HTTP__ _E_\nWe should be focusing on beautiful clean air & not on wasteful & very expensive GLOBAL WARMING bullshit! China & others are hurting our air _E_\n#TRUMP International Reality will be America's premiere real estate brokerage house __HTTP__ w/ the most distinctive services. _E_\nThank you Lexington South Carolina!#Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nGreat knockout on Saturday by Juan Manuel Marquez on Manny Pacquiao. A great fight! _E_\n\"The most terrifying words in the English language are: I'm from the government and I'm here to help.\" – Pres. Ronald Reagan _E_\nCongratulations Eric & Lara. Very proud and happy for the two of you! __HTTP__ _E_\n#MidasTouch is divided into five sections. The second is the index finger which represents Focus __HTTP__ _E_\nJoin me live in Hershey Pennsylvania! #MakeAmericaGreatAgain LIVE: __HTTP__ __HTTP__ _E_\n.@HighSock_Sunday #asktrump __HTTP__ _E_\nCongratulations to the Houston @Astros 2017 #WorldSeries Champions#HoustonStrong #EarnHistory __HTTP__ _E_\nForeign leaders are already requesting meetings with @MittRomney to warn that we are viewed as in decline __HTTP__ _E_\nSo many positive things going on for the U.S.A. and the Fake News Media just doesn't want to go there. Same negative stories over and over again! No wonder the People no longer trust the media whose approval ratings are correctly at their lowest levels in history! #MAGA _E_\nVia @BreitbartNews by @rwildewrites: \"TRUMP: 'I WOULD BUILD A BORDER FENCE LIKE YOU HAVE NEVER SEEN BEFORE'\" __HTTP__ _E_\nAnyone who doubts the strength or determination of the U.S. should look to our past....and you will doubt it no longer. __HTTP__ _E_\nIf I win I am going to instruct my AG to get a special prosecutor to look into your situation bc there's never been anything like your lies. _E_\nWill be interviewed by @MariaBartiromo on @FoxBizAlert at 7:30 A.M. Enjoy! _E_\nThank you! #Trump2016 __HTTP__ __HTTP__ _E_\nIn this time of economic turmoil where millions of Americans are unemployed our tax dollars are paying @BillMoyers' big @PBS salary! _E_\nYou can only smile when the losers of the world try so hard to put down successful people. Just remember they all want to be YOU! _E_\nWatch my interview with Greta Van Susteren on her show On the Record tonight on Fox News in the 10 p.m. hour. _E_\nA disgraceful verdict in the Kate Steinle case! No wonder the people of our Country are so angry with Illegal Immigration. _E_\nCNN/ORC Poll results just out for Nevada—WOW! Trump 38 Carson 22 Fiorina 8 Bush 6 Cruz 4 __HTTP__ _E_\n...Remember I told you so. _E_\nThank you Alabama! From now on it's going to be #AmericaFirst. Our goal is to bring back that wonderful phrase:... __HTTP__ _E_\nJust got final renderings of Trump National Doral in Miami there will be nothing like it in the Country will be the best! _E_\nChina has hacked another US government body. __HTTP__ will we learn? _E_\nThe failing @nytimes wrote a story about my management style & that I don't have many people. I have 73 Hillary has 800 & I'm beating her. _E_\nTed Cruz complains about my views on eminent domain but without it we wouldn't have roads highways airports schools or even pipelines. _E_\nObamaCare is torturing the American People.The Democrats have fooled the people long enough. Repeal or Repeal & Replace! I have pen in hand. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nAmazing both Transformers & Dark Knight Rises featured Trump properties and each grossed over $1B. Just coincidence. _E_\n....... I disagree but it's still cool. _E_\nPeople are loving the new line of Trump ties and shirts at Macy's. Check them out! _E_\nRT @IsraeliPM: PM Benjamin Netanyahu at weekly Cabinet meeting:In two weeks Israel will host @POTUS Trump on his first trip as President... _E_\nIt was an honor to welcome so many truckers and trucking industry leaders to the @WhiteHouse today! __HTTP__ _E_\nI hear a failing New York newspaper is going to publish one of my old cell phone numbers. So original just one of many! _E_\nHopefully the violent and vicious killing by ISIS of a beloved French priest is causing people to start thinking rationally. Get tough! _E_\nLooking forward to meeting the students of Urbandale High School tomorrow __HTTP__ _E_\nJulian Assange said a 14 year old could have hacked Podesta why was DNC so careless? Also said Russians did not give him the info! _E_\nWith @TraceAdkins on top of the truck the crowd definitely buzzed. #CelebApprentice _E_\nJoin me on Saturday in Syracuse New York! #NYPrimary #Trump2016 __HTTP__ __HTTP__ _E_\nMy interview yesterday from Newsmax Obama Is 'Now Totally Lost' Boehner Must Not Fold __HTTP__ _E_\nCongratulations to @GovMikeHuckabee on last night's tremendous speech. Mike united the party faithful and explained that we can do better. _E_\nWhy do so many people say I hate President Obama—I don't hate the President at all. I just disagree with his policies! _E_\nIf you treat people right they will treat you right...ninety percent of the time. Franklin D. Roosevelt _E_\nRepublicans are going for the big Budget approval today first step toward massive tax cuts. I think we have the votes but who knows? _E_\nPervert Anthony Wiener will never be able to get away from his perversion the cure rate is ZERO. _E_\nGreat! __HTTP__ _E_\n\"China presents three big threats to the United States in its outrageous currency manipulation its systematic (cont) __HTTP__ _E_\nPresident Obama looks and sounds so ridiculous making his speech in Cuba especially in the shadows of Brussels. He is being treated badly! _E_\nNegotiation tip: Know exactly what you want and focus on that. Trust your instincts even after you've honed your skills. _E_\nThe Jets just don't have it. Time for a quarterback change! _E_\nHow does Obama rationalize giving Iran $8B in sanction relief when a Christian pastor is being tortured in an Iranian prison? _E_\nVia @MailOnline @dmartosko Donald Trump says it's morally unfair of Obama to send soldiers into Ebola hot zone __HTTP__ _E_\nTrue thanks. __HTTP__ _E_\nUngrateful TRAITOR Chelsea Manning who should never have been released from prison is now calling President Obama a weak leader. Terrible! _E_\nWhy is @BarackObama continuing to lie? __HTTP__ has found that @MittRomney did not ship jobs overseas __HTTP__ _E_\nCLINTON'S FLAILING SYRIA POLICY WAS JUDGED A FAILURE: __HTTP__ #VPDebate _E_\nHow many illegal foreign donations will Obama collect this final week? Another scandal ignored by the liberal media. __HTTP__ _E_\n.@mike_pence was fantastic tonight. Will be a great V.P. _E_\nI am counting on your help to defeat Hillary Clinton and her cronies. Let's Make America Great Again! __HTTP__ _E_\nDiscussing #NewYorkValuesin Buffalo last night on the eve of the #NYPrimary.LETS GO NY! #VoteTrump __HTTP__ _E_\nI am making a big speech the night of the @FoxNews debate but I wish everyone well. Yesterday was a big day for me with 5 wins! _E_\nThe media has not covered my long shot great finish in Iowa fairly. Brought in record voters and got second highest vote total in history! _E_\nJust named General H.R. McMaster National Security Advisor. _E_\nThink big set your vision high and go for it. You'll be shocked by what you can accomplish when you do. Midas Touch _E_\nThank you! WE WILL MAKE AMERICA GREAT AGAIN! #Trump2016 __HTTP__ _E_\nAlways pretend that you're working for yourself. You'll do a wonderful job. It's simple but it works. Think Like a Billionaire _E_\nNew national Bloomberg poll just released thank you! Join the MOVEMENT: __HTTP__ #TrumpTrain... __HTTP__ _E_\n.@IvankaTrump and me at the @todayshow this morning. __HTTP__ _E_\nGreat news out of New Hampshire! DonaldTrump is pulling away from the pack w/ 2nd is 17% behind him! #Trump2016 __HTTP__ _E_\nI have an open door policy for my employees. I'm accessible because I like to know what's going on. The Midas Touch _E_\nOur country has the slowest growth since 1929. #BigLeagueTruth #debate _E_\nNorth Carolina lost 300000 manufacturing jobs and Ohio lost 400000 since 2000. Going to Mexico etc. NO MORE IF I WIN WE WILL BRING BACK! _E_\nIt's Tuesday. How many more customers has Glenfiddich lost today? _E_\nAfter thousands lost and spending two trillion dollars Iraq (I told you so) is imploding. Really dumb pols put us and kept us there so sad! _E_\nI have never seen a thin person drinking Diet Coke. _E_\nRick Perry did an absolutely horrible job of securing the border. He should be ashamed of himself. Gov. Abbott has since been terrific. _E_\nI have been watching and loving the United States for many years and have NEVER seen it look weaker or less effective! _E_\nMy speech from last Saturday's @Citizens_United @AFPhq #NHFreedomSummit __HTTP__ via @cspan _E_\nNo one will work harder. No one will move heaven and earth like Mitt Romney to make this country a better place to live! @AnnDRomney _E_\nDopey Prince @Alwaleed_Talal wants to control our U.S. politicians with daddy's money. Can't do it when I get elected. #Trump2016 _E_\nVictoria's Secret reps were nasty to @KateUpton and now she is doing great. _E_\nDOW S&P 500 and NASDAQ close at record highs! #MAGA __HTTP__ _E_\nRT @realDonaldTrump: ...big unnecessary regulation cuts made it all possible\" (among many other things). \"President Trump reversed the poli... _E_\nThe sub station in Blackdog is very dangerous on unregulated landfill—fire hazard! @AlexSalmond @pressjournal _E_\nExclusive Video–Broaddrick Willey Jones to Bill's Defenders: 'These Are Crimes' 'Terrified' of 'Enabler' Hillary __HTTP__ _E_\nThe Celebrity Apprentice has a two hour premiere this Sunday March 14th at 9 p.m. on NBC. This will be the best season yet see you then! _E_\nCNN: New GOP polls show Trump's favorability is up __HTTP__ _E_\nLocated in the beautiful countryside of Mooresville @Trump_Charlotte has a superb clubhouse & top amenities __HTTP__ _E_\nTHe Chinese military is already hacking our satellites __HTTP__ The Chinese government is not an American ally. _E_\nThe Theater must always be a safe and special place.The cast of Hamilton was very rude last night to a very good man Mike Pence. Apologize! _E_\n.@MarieLeff #asktrump __HTTP__ _E_\nObama is without question the WORST EVER president. I predict he will now do something really bad and totally stupid to show manhood! _E_\nLightweight A.G. Eric Schneiderman is perhaps the most incompetent and least respected A.G. in the U.S. He is a total joke! _E_\nMitch get back to work and put Repeal & Replace Tax Reform & Cuts and a great Infrastructure Bill on my desk for signing. You can do it! _E_\nHappy Friday the 13th __HTTP__ _E_\nTweet me back if u think we should start a petition to fire @hardball_chris for his comments on Sandy & the death & destruction it caused. _E_\nChina's economy is now projected to overtake the US as the world's largest economy by 2027 __HTTP__ #TimeToGetTough _E_\nTrump International Hotel & Tower Vancouver will be a fantastic addition to a spectacular city. __HTTP__ _E_\nDonald Trump Explains Why He Called Django Unchained 'Racist' In Tweet __HTTP__ via @accesshollywood _E_\nVia @Newsmax_Media by @melaniebatley: \"Trump Backed Candidate @leezeldin Wins NY GOP Primary\" __HTTP__ _E_\nThank you to Jeffrey Lord @AmSpec for his incredible & insightful article this weekend on failing & irrelevant @BuzzFeed _E_\nPacked venue of people who want to #MakeAmericaGreatAgain __HTTP__ _E_\nMy @LateNightJimmy interview with @jimmyfallon discussing the new season of All Star @CelebApprentice __HTTP__ _E_\nI will be speaking the night before the RNC in Sarasota FL when I receive the Statesman of the Year award. _E_\n.@MittRomney will make us energy independent by 2020 __HTTP__ @BarackObama will keep wasting money on Solyndra projects. _E_\nI'll be on with Larry Kudlow of the Kudlow Report tonight on CNBC at 7 p.m. We'll be discussing current affairs and politics. Tune in. _E_\nCongrats to @BarackObama he has now had over 40 months straight of over 8% unemployment while accruing over $6T (cont) __HTTP__ _E_\nVia @TVbytheNumbers:\"TV Ratings Sunday 'Family Guy' & 'The Simpsons' Down 'All Star Celebrity Apprentice' Up\" __HTTP__ _E_\nJoin me in Greensboro North Carolina tomorrow at 2:00pm! #TrumpRally __HTTP__ __HTTP__ _E_\nKate is donating a #kidney to her husband __HTTP__ . You can help! I did @fundanything #donate _E_\nMAKE AMERICA GREAT AGAIN! MAKE AMERICA SAFE AGAIN!#Trump2016 #AmericaFirst __HTTP__ _E_\nI have been consistent in my opposition to Common Core. Get rid of Common Core keep education local! _E_\n.@ApprenticeNBC season premiere this Sunday at 9/8c on @NBC __HTTP__ _E_\n\"It's always great to be in business with Donald Trump\" said @Telemundo president Emilio Romano. __HTTP__ _E_\nThe fact that Sneaky Dianne Feinstein who has on numerous occasions stated that collusion between Trump/Russia has not been found would release testimony in such an underhanded and possibly illegal way totally without authorization is a disgrace. Must have tough Primary! _E_\nI will be on @foxandfriends Monday morning at 7.00. A lot to talk about! _E_\nThank you to former campaign adviser Michael Caputo for saying so powerfully that there was no Russian collusion in our winning campaign. _E_\nRemember to keep going if you stop your momentum will stop. _E_\nI have an idea for A Rod buy a home at @TrumpGolfLA overlooking the Pacific will bring you better luck. _E_\nA letter from an amazing woman __HTTP__ _E_\nThe dollar always talks in the end although our pols are killing the dollar! _E_\nRemember that Bill Clinton was brought in to help Hillary against Obama in 2008. He was terrible failed badly and was called a racist! _E_\nThank you to General Motors and Walmart for starting the big jobs push back into the U.S.! _E_\nNo surprise. @DNC displayed Russian ships in tribute to vets __HTTP__ Did they mean to honor the Russians? _E_\n\"Trump's Championship #BlueMonster Course Opens To Rave Reviews\" __HTTP__ via @sacbee_news _E_\n.@AlexSalmond don't worry my ad will be shown across the world and it is highly accurate! _E_\nCLINTON IS WEAK ON NORTH KOREA: __HTTP__ #VPDebate _E_\nI have always been the same person remain true to self.The media wants me to change but it would be very dishonest to supporters to do so! _E_\nI hear @NBCNews / @WSJ came out with another one of their phony polls. While I am leading they are totally discredited after last S.C. poll _E_\nOpening in 2016 @TrumpVancouver's original twisting design will transform the skyline at 616 ft. & 63 stories __HTTP__ _E_\nDo as I say not as I do.The politicians who passed ObamaCare are now exempting themselves from the monstrosity __HTTP__ _E_\nThe Bay Bridge in San Francisco is being built by the Chinese tremendous cost overruns. A total mess. We should build our own bridges etc _E_\nMasa (SoftBank) of Japan has agreed to invest $50 billion in the U.S. toward businesses and 50000 new jobs.... _E_\nmention crime infested) rather than falsely complaining about the election results. All talk talk talk no action or results. Sad! _E_\nThe Obstructionist Democrats have given us (or not fixed) some of the worst trade deals in World History. I am changing that fast! _E_\nCleveland just made a very wise decision congrats! _E_\nThank you General. #Trump2016 __HTTP__ _E_\nTrump International Golf Club Turnberry Scotland home to four of the greatest Open Championships of all time.. __HTTP__ _E_\nDepression be careful of China! __HTTP__ _E_\nThe language used by me at the DACA meeting was tough but this was not the language used. What was really tough was the outlandish proposal made a big setback for DACA! _E_\nThank you for your support! We will MAKE AMERICA SAFE AND GREAT AGAIN! #ImWithYou #AmericaFirst __HTTP__ _E_\nIncompetent @RichLowry lost it tonight on @FoxNews. He should not be allowed on TV and the FCC should fine him! _E_\n...about then candidate Trump. Catherine Herridge @FoxNews. So why doesn't Fake News report this? Witch Hunt! Purposely phony reporting. _E_\nYou have enemies? Good. That means you've stood up for something sometime in your life. Winston Churchill _E_\nSyria has prepared for an attack based on all of our talk they have moved targeted ammunition and supplies to new locations.Amazing! _E_\nBecoming a US citizen is not a right it's a privilege. _E_\nDoes anybody really think that President Obama didn't know about our spying on the leaders of allies around the world not possible! _E_\n.@TrumpNationalHV features wide open pristine fairways tour caliber greens 64 strategically placed sand bunkers __HTTP__ _E_\nCongratulations to all of the \"DEPLORABLES\" and the millions of people who gave us a MASSIVE (304 227) Electoral College landslide victory! __HTTP__ _E_\nThe best way out is always through. Robert Frost _E_\nObstacles are those frightening things that become visible when we take our eyes off our goals. Henry Ford _E_\nNevada: A quick reminder that today is your last day to register to vote! __HTTP__ __HTTP__ _E_\n.@jacknicklaus has done a GREAT job as the architect of my new golf course at Ferry Point. NYC is very proud! _E_\nObama's deal raises taxes on 77% of national households. With Obama Care taxes kicking in now everyone will be paying for his 2nd term. _E_\n.@BrentBozell one of the National Review lightweights came to my office begging for money like a dog. Why doesn't he say that? _E_\nWe don't have the leadership including the Generals (who just said the element of surprise does not matter) to attack anyone! Cool it. _E_\nNow we will never know if @BarackObama would have been able to fill Bank of America Stadium. Pretty convenient. _E_\nWatching Senator Richard Blumenthal speak of Comey is a joke. Richie devised one of the greatest military frauds in U.S. history. For.... _E_\nAnne Hathaway is a good winner! _E_\nWe launched a new series of #Trump2016 videos via Facebook. A new topic everyday! Watch: __HTTP__ __HTTP__ _E_\nCongratulations to the $1B ObamaCare website on enrolling FOUR in Delaware. Cost to us $4M __HTTP__ _E_\nThank you! #AmericaFirst __HTTP__ _E_\nRussia is on the move in the Ukraine Iran is nuking up & Libya is run by Al Qaeda yet Obama is busy issuing 'climate change\" warnings. _E_\nJust letting China know in advance that the USA will win the medal count in the Olympics. Even with your cheating you can't beat us. _E_\nHard for Biden to justify Libya mess but doing best he can. #VPDebate _E_\nEliot Spitzer's illegal frivolous & over reaching harassment of Hank Greenberg at AIG played a major part in 2008 financial meltdown. _E_\nGetting ready for some big news with my friends at @pgaofamerica _E_\nWill be on @foxandfriends at 8:00 A.M. _E_\nHappy Birthday to my legendary friend Aretha Franklin. _E_\nGreat minds have purposes others have wishes. Washington Irving _E_\nCourage is being scared to death... and saddling up anyway. John Wayne _E_\nThe failing @nytimes writes total fiction concerning me. They have gotten it wrong for two years and now are making up stories & sources! _E_\n...Who says the death penalty is not a deterrent? _E_\nI will be heading to Dubai where I am doing a GREAT project with Damac will be a massive success! _E_\nRT @foxandfriends: Hannity: Russia allegations 'boomeranging back' on Democrats __HTTP__ _E_\nMexico is allowing many thousands to go thru their country & to our very stupid open door. The Mexicans are laughing at us as buses pass by. _E_\nFrom 10 11 pm @ApprenticeNBC ranks #1 in 18 49 among ABCCBS and NBC. #CelebApprentice _E_\n.@MittRomney should not give any other further information until @BarackObama releases the things that everyone wants to see _E_\nI just finished a great meeting with the Republican Senators concerning HealthCare. They really want to get it right unlike OCare! _E_\nThe American people are sick and tired of not being able to lead normal lives and to constantly be on the lookout for terror and terrorists! _E_\nEntrepreneurs are visionaries in some respects they look beyond the present. Keep that in mind when looking for opportunities. _E_\nWe want to make sure that we have the workforce development programs we need to ensure these jobs are.... __HTTP__ _E_\n\"It takes guts to win fortunately most people don't have guts! Donald J. Trump _E_\nI will be on @meetthepress at 10:30. @nbc will be releasing their new poll numbers. Based on the debate results I should do well who knows? _E_\n.@FoxNews Chris Wallace: \"More evidence of Dem collusion with Russia than GOP\" __HTTP__ _E_\nWill be interviewed on @JudgeJeanine at 9:00 P.M. Enjoy! _E_\nPresident Obama refuses to answer question about Iran terror funding. I won't dodge questions as your President. __HTTP__ _E_\nNew national poll released. Join the MOVEMENT & together we will #MakeAmericaGreatAgain! __HTTP__ __HTTP__ _E_\nRemember China is not a friend of the United States! _E_\nRingling Brothers is phasing out their elephants. Ifor one will never go again. They probably used the animal rights stuff to reduce costs _E_\nEgypt is a total mess. We should have backed Mubarak instead of dropping him like a dog. _E_\nPhony Club For Growth tried to shake me down for one million dollars & is now putting out nasty negative ads on me. They are total losers! _E_\nNo matter what you're managing don't assume you can glide by. You have to work to maintain your momentum. Trump: How to Get Rich _E_\nEconomic confidence is soaring as we unleash the power of private sector job creation and stand up for the American Workers. #AmericaFirst _E_\nI am a cautious optimist. Call it positive thinking with a lot of reality checks. _E_\nSEE YOU IN COURT THE SECURITY OF OUR NATION IS AT STAKE! _E_\n.@donlemon on @CNN at 10:00 P.M. _E_\nIt's driving @ariannahuff & the money losing @HuffingtonPost post crazy that I am #1 in their poll and they only write bad stories about me! _E_\nThe Misery Index is at a 28 year high. _E_\nMy economic policy speech will be carried live at 12:15 P.M. Enjoy! _E_\nWomen defy media narrative love Trump at packed Michigan rally.VIDEO: __HTTP__ __HTTP__ _E_\nThank you for your support! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nFor information on Trump University victory call Alan Garten Esquire at 212.836.3203 or Jeff Goldman Esquire at 212.867.4466. _E_\n_xx_justme I still can't believe Donald Trump responded to my tweet. Respect #Trump2016 He would be the best Pres for this Country. Thx _E_\nWind turbines are a scourge to communities and wildlife. They are environmental disasters. _E_\nIvanka and Joan Rivers will be working hard tonight at the Live Finale everybody must watch the OPENING at 9. _E_\nA new poll indicates that 68% of my supporters would vote for me if I departed the GOP & ran as an independent. __HTTP__ _E_\nVia @DMRegister by @WilliamPetroski: Trump: I can make America great again __HTTP__ _E_\nIf Russia or some other entity was hacking why did the White House wait so long to act? Why did they only complain after Hillary lost? _E_\nSleazebag @BashirLive has just been forced to resign from @msnbc. His pathetic apology wasn't enough to save his job. @SarahPalinUSA _E_\nJoin me in Charleston WV tomorrow! __HTTP__ _E_\nWow CNN just said that Donald Trump won the DEBATE connected best with audience. Also Time Drudge Newsmax N.Y.Times and more! _E_\nI should host the #Oscars just to shake things up this is not good! _E_\nFord said last week that it will expand in Michigan and U.S. instead of building a BILLION dollar plant in Mexico. Thank you Ford & Fiat C! _E_\nVia @WashTimes By Eugene Dunn: \"Trump could lead U.S. forward\" __HTTP__ _E_\nToday Americans everywhere remember the brave men and women of @NASA who lost their lives in our Nation's eternal quest to expand the boundaries of human potential. __HTTP__ __HTTP__ _E_\nIn '08 @PaulRyanVP predicted that US headed toward bankruptcy __HTTP__ @BarackObama has added over $6T in debt since. Scary. _E_\nMake sure to vote today. Vote for real change. Change that will deliver jobs and a free & strong America. Vote for @MittRomney. _E_\nI'll be on@SquawkCNBC tomorrow at 7:30 am #TrumpTuesday _E_\nUnder our President ISIS is gaining great strength __HTTP__ _E_\nIt was my great honor to deliver the #CGACommencement17 at the @USCGAcademy. CONGRATULATIONS to the Class of 2017!... __HTTP__ _E_\n\"Luck does not come around often. So when it does be sure to take full advantage of it even if it means working hard. Think Big _E_\nNow another Obama speech from 2002 with him talking about taking the rich's 'stuff' __HTTP__ Who is this guy? Where's the media? _E_\nJoin @autismspeaks and light the world blue on 4/2. #LIUB will raise awareness for millions with autism! _E_\nBig day in Alabama. Vote for Luther Strange he will be great! _E_\nI will be doing Fox & Friends at 7 (15 minutes). Enjoy it and your day! _E_\nJust announced that in the history of @CNN last night's debate was its highest rated ever. Will they send me flowers & a thank you note? _E_\nThank you Dallas Texas! __HTTP__ __HTTP__ _E_\nThe elites want Common Core so they can take education out of parental control. NO! Let's Make America Great Again! __HTTP__ _E_\nI look forward to all meetings today with world leaders including my meeting with Vladimir Putin. Much to discuss.#G20Summit #USA _E_\nAccording to Bill O'Reilly 80% of all the shootings in New York City are blacks if you add Hispanics that figure goes to 98%. 1% white. _E_\nThat was an amazing interview on @foxandfriends I hope the rest of the media picks it up to show how totally dishonest the @nytimes is! _E_\nNow that it's almost over I can't believe that unions & management couldn't save Twinkies etc & management just got a $1.75M bonus. _E_\nRecord setting gas prices in the U.S. we're really looking dumb. Lots of $'s being made on us. _E_\nHe @MittRomney wrote a great piece on China __HTTP__ @JonHuntsman criticized him (cont) __HTTP__ _E_\nAt the foot of Whitestone Bridge in the Bronx @TrumpFerryPoint offers fantastic views of the Manhattan skyline __HTTP__ _E_\nA note from the fabulous Mark Burnett: \"Donald congratulations again we are #1 in the 10:00pm hour. I am tweeting about it.\" _E_\nNew Reuters poll thank you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nSo don't forget to enter the Serta Counting Sheep for Hire contest! www.youtube.com/user/mattressserta _E_\n...if Congress gives us the massive tax cuts (and reform) I am asking for those numbers will grow by leaps and bounds. #MAGA _E_\nBeing in Detroit today was wonderful. Quick stop in Ohio to meet with some of our great supporters. Just got back home! _E_\nTed Cruz is in trouble for not reporting his bank borrowing in his very important Financial Disclosure Form. Very low interest loans scam! _E_\n\"@realDonaldTrump: I would like to extend my best wishes to all even the haters and losers on this special date September 11th.\" _E_\nVia @bpolitics by @tdopp: \"In Iowa Trump Promises to 'Surprise a Lot of People'\" __HTTP__ _E_\nTo show you how dishonest some of the press is they took my funny & (cont) __HTTP__ _E_\n2016 GOP Nomination Polls have me as #1 as seen on @SpecialReport with @BretBaier. __HTTP__ _E_\nIt's inconvenient and inconsiderate: @BarackObama is doing a fundraiser tonight making it almost impossibl... (cont) __HTTP__ _E_\nThe election result in France is very disappointing. The Europeans have to embrace austerity in order for their economy to fully recover. _E_\nI promise to rebuild our military and secure our border. Democrats want to shut down the government. Politics! _E_\nState Department official accused of offering 'quid pro quo' in Clinton email scandal __HTTP__ _E_\nI will be on FOX with the great @JudgeJeanine tonight at 9pm EST! Enjoy! #Trump2016 _E_\nHillary said at debate ISIS is going to people showing videos in order to recruit more radical jihadistst. She made up story want apology! _E_\nCrooked Hillary Clinton is a fraud who has put the public and country at risk by her illegal and very stupid use of e mails. Many missing! _E_\nI say we cannot continue to let Obama fly around on Air Force 1 at a cost of millions of dollars a day for the purpose of politics & play! _E_\nThank you to Chris Cox and Bikers for Trump Your support has been amazing. I will never forget. MAKE AMERICA GREAT AGAIN! _E_\nI am very happy to have the civilian version of The Apprentice back on the air this fall. There will be excitement as well as opportunity. _E_\nWisconsin and Pennsylvania have just certified my wins in those states. I actually picked up additional votes! _E_\nGreat news! #MAGA __HTTP__ _E_\nMichelle Obama's weekend ski trip toAspen makes it 16 times that Obamas have gone on vacation in 3 years. (cont) __HTTP__ _E_\nYesterday I was thrilled to be with so many WONDERFUL friends in Utah's MAGNIFICENT Capitol.It was my honor to sign two Presidential Proclamations that will modify the national monuments designations of both Bears Ears and Grand Staircase Escalante... __HTTP__ __HTTP__ _E_\nThank you @TrumpSoHo @TrumpNewYork for helping me celebrate #agreatcause @MarineCorpsLEF while accepting the Commandant's Leadership award! _E_\nNever get good #'s from failing Des Moines Register/Bloomberg. I think something's going on w/them. Up 13 in IA according to respected CNN. _E_\nObamaCare is in serious trouble. The Dems need big money to keep it going otherwise it dies far sooner than anyone would have thought. _E_\nThe United States is prepared to work with each of the leaders in this room today to achieve mutually beneficial commerce that is in the interests of both your countries and mine. That is the message I am here to deliver today. #APEC2017 __HTTP__ _E_\nDo you all remember how beautiful and safe a place Brussels was. Not anymore it is from a different world! U.S. must be vigilant and smart! _E_\nMust read article on Obama's illegal fundraising from abroad __HTTP__ Foreign candidate getting foreign donations. _E_\nICYMI @nypost's @LoisWeiss described my Monday @ICSC speech @javitscenter as one of my \"best and most riveting\" __HTTP__ _E_\nHillary and her friends! __HTTP__ _E_\n.@T Mobile has so many service complaints a total joke! _E_\nI am in Las Vegas for the @MissUSA 2012 pageant. Watch live tonight on @NBC at 9PM ET. __HTTP__ _E_\nNEBRASKA #VoteTrump TODAY!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nHappy 70th Birthday @CIA! __HTTP__ _E_\nWe are fighting hard for Merit Based immigration no more Democrat Lottery Systems. We must get MUCH tougher (and smarter). @foxandfriends _E_\nBig win in Montana for Republicans! We _E_\nRT @EricTrump: #Arizona: We made it easy to find your polling location for today's primary! Simply visit __HTTP__ __HTTP__ _E_\nTake a look at this amazing photo of the cast from the first ever All Star @CelebApprentice __HTTP__ _E_\nOut of hundreds of deals & transactions I have used the bankruptcy laws a few times to make deals better. Nothing personal just business. _E_\nNow an additional 600 700 jobs in America (2000) being eliminated for move to Mexico via Hartford Courant. __HTTP__ _E_\nOutright disgusting the Obama administration has continually stonewalled and lied to US Amb. Sean Smith's mother __HTTP__ _E_\nHillary won't call out radical Islam! She will be soundly defeated. _E_\nWhy isn't the Arab League paying for everything and sending troops? They want us to do their dirty work with no involvement by themselves! _E_\nThe DC press corps is obsessed with my @CPACnews speech which is scheduled tomorrow 8:45AM in the Potomac Ballroom. Can't blame them. _E_\nThe haters and losers that assume I was a non athlete and know nothing about coaches should look into my past unlike our President open book _E_\n.@MissUniverse final 3 on now. Great people great new owner @IMG. WATCH. _E_\nRT @realDonaldTrump: Loser terrorists must be dealt with in a much tougher manner.The internet is their main recruitment tool which we must... _E_\nMake sure to have fun and celebrate NYE with friends and family. Happy New Year everyone! _E_\nDonald Trump Hands Bill O'Reilly Cable TV Viewership Win @deadline __HTTP__ _E_\nVia @CBSNews by @ReenaJF: Donald Trump scolds Republicans: 'Toughen up' __HTTP__ _E_\nThe Fed continues to flood the market with US dollars. Wrong move. _E_\nMay jobless numbers have been readjusted to 8.2%. @BarackObama's economy is a disaster __HTTP__ New numbers tomorrow. _E_\nThere is only one person who should be crossing our southern border USMC Sgt. Tahmooressi. Boycott Mexico? #FreeOurMarine _E_\nThank you for the incredible support this morning Tampa Florida! #ICYMI watch here: __HTTP__ __HTTP__ _E_\nThe failing @nytimes should be focused on good reporting and the papers financial survival and not with constant hits on Donald Trump! _E_\nWake Up America China is eating our lunch. _E_\nGreat rally in New Mexico amazing crowd! Now in L.A. Big rally in Anaheim. _E_\nI am in Indiana where we just had a great rally. Fantastic people! Staying at a Holiday Inn Express new and clean not bad! _E_\nIt is really a shame that Barack Obama may stop $5M from being generously donated to charity all because he refuses to be transparent. _E_\nJust in big news I have been declared the winner of the CNMI Rep Caucus with 72.8% of the vote! Thank you! #SuperTuesday #VoteTrump _E_\nI heard that the underachieving John King of @CNN on Inside Politics was one hour of lies. Happily few people are watching dead network! _E_\nHere I am with @trishstratuscom #WWEHOF __HTTP__ _E_\nWhat will be @RickSantorum's excuse tomorrow after @MittRomney wins Wisconsin and Maryland? Time for Rick to face reality and drop out. _E_\nWhen it comes to China @BarackObama practices pretty please diplomacy. He begs and pleads and bows and it... (cont) __HTTP__ _E_\nGuess which POTUS has held more fundraisers than the previous 5 combined? __HTTP__ @BarackObama is (cont) __HTTP__ _E_\nBernie Sanders endorsing Crooked Hillary Clinton is like Occupy Wall Street endorsing Goldman Sachs. _E_\nDon King and so many other African Americans who know me well and endorsed me would not have done so if they thought I was a racist! _E_\nMonday night at 8:00 will be must see television. Our wonderful Joan Rivers plays a major role as my advisor on the Apprentice. AMAZING! _E_\nWhy does @Greta have a fired Bushy like dummy John Sununu on spewing false info? I will beat Hillary by a lot she wants no part of Trump. _E_\nThere have been 17 shutdowns since 1976 14 under Reagan and Bush with Democrat Congresses who wanted more spending. _E_\nEight Syrians were just caught on the southern border trying to get into the U.S. ISIS maybe? I told you so. WE NEED A BIG & BEAUTIFUL WALL! _E_\nThere are 11 more Solyndras in the @BarackObama energy program __HTTP__  He loves to waste our (cont) __HTTP__ _E_\nUranium deal to Russia with Clinton help and Obama Administration knowledge is the biggest story that Fake Media doesn't want to follow! _E_\nMilitary reps have attacked @BarackObama over Bin Laden leaks they believe he's just using this for his benefit. Not a big surprise... _E_\nI will be making my announcement on the next Secretary of State tomorrow morning. _E_\nEntrepreneurs are all unique. One way to build a business and turn it into a brand is to know who you are. Midas Touch _E_\n.@dennisrodman looks like he really cleaned up his act. _E_\nIf you're going through hell keep going. Winston Churchill _E_\nHillary Clinton raked in money from regimes that horribly oppress women and gays & refuses to speak out against Radical Islam. _E_\nTo be successful never give up. My secrets to success will be shared at the National Achievers Congress in London. __HTTP__ _E_\nPoll numbers way up making big progress! _E_\nAmerica's trade deficit with China is one of our greatest national security threats. Time for Fair Trade. We must produce our own products. _E_\nMy announcement is tomorrow! _E_\nSad to watch Bernie Sanders abandon his revolution. We welcome all voters who want to fix our rigged system and bring back our jobs. _E_\nTrump rails on Romney as possible 2016 contender __HTTP__ via @nypost by @GeoffEarle _E_\nThe Mar a Lago Club the crown jewel of Palm Beach is a landmark in the National Register of Historic Places __HTTP__ _E_\nVia @DailyCaller by @NeilMunroDC: \"Trump Wants Ebola Travel Ban\" __HTTP__ _E_\nHAPPY BIRTHDAY to my son @EricTrump! Very proud of you! __HTTP__ __HTTP__ _E_\nListen to an interview with Donald Trump discussing his new book Think Like A Champion: __HTTP__ _E_\nI as President want people coming into our Country who are going to help us become strong and great again people coming in through a system based on MERIT. No more Lotteries! #AMERICA FIRST _E_\nIf there is one more Ebola case in the U.S. a full travel ban will be instituted. This common sense move should have been done long ago! _E_\nAMAZING @BarackObama has actually found a government program he can cut in half the Defense Department...bad (cont) __HTTP__ _E_\nIran's quest for nuclear weapons is a major threat to our nation's national security interests. We can't allow Iran to go nuclear. _E_\nThe Dallas event on September 14 at 6:00 P.M. at the American Airlines Center looks like it will be a giant success. Tickets are going FAST! _E_\nJeffrey Robinson's #TrumpTower has it all. The ultra rich powerful and beautiful. It's your summer must read. __HTTP__ _E_\nThe upcoming season of @CelebApprentice will be terrific a great cast. _E_\nHave time to waste? Go to the ObamaCare website. _E_\nWith all of the jobs I am bringing back into the U.S. (even before taking office) with all of the new auto plants coming back into our..... _E_\nObama Spurns Trump Offer to Foot White House Tours __HTTP__ via @Newsmax_Media _E_\nThank you Florida! #SuperTuesday #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nI'll be on @foxandfriends on Monday at 7:30 AM... _E_\nI will be interviewed on Face The Nation with @jdickerson this morning. Enjoy! _E_\nGet out and vote Nebraska we will MAKE AMERICA GREAT AGAIN! _E_\nThe #CNBC 25 poll is a joke. I was in 9th place and taken off. (Politics?) No wonder @CNBC ratings are going down the tubes. _E_\nVia @CBNNews by @TheBrodyFile: \"Donald Trump To Brody File in 2011: People Send Me Bibles\" __HTTP__ _E_\nOur country is stagnant. We've lost jobs and business. We don't make things anymore b/c of the bill Hillary's husband signed and she blessed _E_\nThe two fake news polls released yesterday ABC & NBC while containing some very positive info were totally wrong in General E. Watch! _E_\nVia @NRO by @LovelaceRyanD: Trump Slams Bush: 'I Don't See Him Winning I Don't See There's Any Way' __HTTP__ _E_\nObama's proposed budget has another middle class tax hike __HTTP__ Enjoy! _E_\nGoing to Ohio home of one of the worst presidential candidates in history Kasich. Can't debate loves #ObamaCare dummy! _E_\nJust reported by CNN that the Trump halo effect caused a record shattering Democratic Debate rating of 15.3 million viewers. So true! _E_\nUnited Nations Resolution is the single largest economic sanctions package ever on North Korea. Over one billion dollars in cost to N.K. _E_\nTrump Will Make America GREAT!!!! #ChangeTheWorldIn5Words _E_\nRT @atensnut: Hillary calls Trump's remarks horrific while she lives with and protects a Rapist . Her actions are horrific. _E_\nAndy Roddick...a great tennis player is a fantastic guy with a wonderful wife. _E_\nThank you Michigan. We are going to bring back your jobs & together we will MAKE AMERICA GREAT AGAIN!Watch:... __HTTP__ _E_\nThe media is so after me on women Wow this is a tough business. Nobody has more respect for women than Donald Trump! _E_\n.@garyplayer you were great on @MikeAndMike this morning—& the Gary Player Villa at @TrumpDoral is a hot ticket. _E_\nHeading to Biloxi Mississippi. Massive crowds expected. Thank you for your support! #VoteTrump2016 __HTTP__ _E_\nObama's job approval is at 37% a record low. @GOP & @SpeakerBoehner have the leverage & momentum. Delay ObamaCare for all Americans! _E_\n#MakeAmericaGreatAgain __HTTP__ _E_\n.@foxandfriends interview re: North Korea firing @dennisrodman job report @MELANIATRUMP's debut & @WrestleMania __HTTP__ _E_\nFORMAL ACCEPTANCE OF THE NOMINATION! #TrumpPence16 __HTTP__ _E_\n.@Omarosa's emergency has put a new spin on Team Power's presentation—but it's not \"show time\" yet. #CelebApprentice _E_\nWill be on @foxandfriends. Enjoy! _E_\n.@CharlieRymerGC Charlie call me we'll set up a match with Gary and Damon. Doral now finished and doing great! _E_\nI believe that in addition to the 5 terrorist leaders President Obama gave up for Bergdahl a great deal of CASH was also given. So stupid! _E_\nI was relentless because more often than you would think sheer persistence is the difference between success and failure. NEVER GIVE UP! _E_\nGreat article on wind turbines by Robert Bryce in today's @NYPost __HTTP__ _E_\nI loved the day Paul Goldberger got fired (or left) as N.Y.Times architecture critic and has since faded into irrelevance. Kamin next! _E_\nI did interview with Chris Wallace of @FoxNews in order to be fair. He then puts on Rove Lane and Will three Trump bashers to discuss. _E_\nHave you heard? China just told Obama to jump. Obama asked how high. _E_\nLess than 1% of Obama's $4B immigration request will go towards immediate border security. A real scam. Enforce our laws now! _E_\nObama is addicted to spending America into insolvency. His record proves it. _E_\n14 African nations have totally banned West Africans from entering their nations. Likewise many other nations. But the U.S. = COME ON IN _E_\nGeorge Will said best debate he ever saw . If you ever heard George Will speak(boring) anything is exciting. _E_\nVictory press conference was over. Why is she allowed to grab me and shout questions? Can I press charges? __HTTP__ _E_\nI will be on Bill O'Reilly's show tonight at 8 PM talking about Iran and politics. @oreillyfactor _E_\nI look very much forward to meeting w/Paul Ryan & the GOP Party Leadership on Thurs in DC. Together we will beat the Dems at all levels! _E_\nI love Bluffton SC what a great place what great people. _E_\nCongrats to @JimmieJohnson a great guy on winning Daytona! _E_\nThank you Pennsylvania. This is a MOVEMENT like we have never seen before! #VoteTrumpPence16 on 11/8/16 together... __HTTP__ _E_\nPeople like @KatyTurNBC report on my campaign but have zero access. They say what they want without any knowledge.True of so much of media! _E_\nHow does frumpy & little read @nytimes editorial writer Gail Collins keep her job? She is totally irrelevant! @nytimescollins _E_\nHappy birthday to the great @leegreenwood83. You and your beautiful song have made such a difference. MAKE AMERICA GREAT AGAIN! _E_\nI watched Mark Cuban on Jay Leno last night what a jerk! _E_\nMichael Barbaro the author of the now discredited @nytimes hit piece on me with women has in past tweeted badly about me. He should resign _E_\nISIS made a big mistake with the beheading of the reporter. Even people against intervention want them blown into oblivion. LEADERSHIP! _E_\nTwitter will soon be irrelevant if lowlifes are so easily able to hack into accounts. _E_\nWow sexual assaults in the military have gone through the roof far worse than anybody could have predicted! _E_\nBret had a target on his back from the get go... _E_\nTrump Tower is located at 725 Fifth Avenue between 56th and 57th Streets... _E_\n... Time for the Republicans to find someone new—and better. _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\nI demand an apology from Hillary Clinton for the disgusting story she made up about me for purposes of the debate. There never was a video. _E_\nSacrificing our nation's bravest for ungrateful Iraqis = great for China. China is taking majority of the oil __HTTP__ _E_\nMy family and I just arrived in Scotland for the grand opening of Trump International Golf Links Scotland __HTTP__ _E_\nTo become a champion fight one more round. James J. Corbett long ago Heavyweight Champion _E_\nThanks to the historic TAX CUTS that I signed into law your paychecks are going way UP your taxes are going way DOWN and America is once again OPEN FOR BUSINESS! __HTTP__ _E_\nThe media refuses to talk about the three new national polls that have me in first place. Biggest crowds ever watch what happens! _E_\nAt 10:30 I will be interviewed on both @meetthepress by @chucktodd and @CBSNews Face The Nation by John Dickerson. This after long evening! _E_\n.@tedcruz Conflicting Stances on Birthright Citizenship [14th Amendment] Gives #TeamTrump credit. __HTTP__ _E_\nIf U.C. Berkeley does not allow free speech and practices violence on innocent people with a different point of view NO FEDERAL FUNDS? _E_\nMore and more Americans seem fed up with both Parties I agree. _E_\nDiscussing #SyrianRefugees with @EricBolling on @FoxNews back on 10/3/2015. #ISIS __HTTP__ _E_\nAlex Rodriguez should substantially reduce his salary from the Yankees in that he misrepresented his use of (cont) __HTTP__ _E_\nRT @CLewandowski_: Gov Nikki Haley just became a liability for Rubio after this was published to social media! __HTTP__ _E_\nThank you Erie Pennsylvania! Together we will #MakeAmericaGreatAgain! __HTTP__ _E_\n.@CharlesMBlow Why don't you use new polls instead of the single ancient national poll that was a tiny bit negative. Dishonest reporting! _E_\nMy ties & shirts at Macy's are doing great. Stupid @GoAngelo is making people aware of how good they are! _E_\nI'm not saying to not give vaccines I am just saying give them small doses over a long period of time not one massive dose for a child. _E_\n\"You can have the most wonderful product in the world but if people don't know about it it's not worth much.\" The Art of the Deal _E_\nJamie Dimon just gave away $13B to government in settlement. Terrible move & bad precedent. Could have done much better by fighting. _E_\nI love you North Carolina thank you for your amazing support! Get out and __HTTP__ tomorrow!Watch:... __HTTP__ _E_\nNew Bloomberg Poll: Trump Leads Big __HTTP__ _E_\nIt was a great honor to have spoken before the countries of the world at the United Nations.#USAatUNGA#UNGA __HTTP__ __HTTP__ _E_\nObama is giving Social Security & ObamaCare to illegals yet wants to cut military benefits __HTTP__ Disgrace! _E_\nEntrepreneurs: Everything starts with you. Leadership is not a group effort if you're in charge then be in charge. _E_\nUS should have told Libya Rebels give us 50% of your oil for our military support. _E_\nIf Republicans don't Repeal and Replace the disastrous ObamaCare the repercussions will be far greater than any of them understand! _E_\nChina is now deploying drones across ocean routes used for trade __HTTP__ They stole the technology from us. _E_\nIf the people of our great country could only see how viciously and inaccurately my administration is covered by certain media! _E_\nExceptional dining matched with exceptional views @Trumpchicago offers a unique array of 5 star dining options __HTTP__ _E_\n.@bobvanderplaats begged me to do an event while asking organizers for $100000 for himself—a bad guy! _E_\nRT @DRUDGE_REPORT: WSJ: The Cold Clinton Reality... __HTTP__ _E_\nEntrepreneurs: See yourself as victorious. This will focus you in the right direction. Put everything you've got into what you're doing. _E_\nThe dishonest media didn't mention that Bernie Sanders was very angry looking during Crooked's speech. He wishes he didn't make that deal! _E_\nBased on @MegynKelly's conflict of interest and bias she should not be allowed to be a moderator of the next debate. _E_\nIt's Friday how many advertisers dropped @HuffPost today? _E_\nVery excited for @LaraLeaYunaska and @EricTrump's wedding this weekend. _E_\n#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nHere are my thoughts on last night's episode of The Celebrity Apprentice... __HTTP__ _E_\nCould be the hurricane helps @MittRomney people are rioting in the streets over gasoline _E_\nYoung entrepreneurs – never back down. Take the hits and get up. That's what makes a winner. _E_\nMy interview w/ @BloombergTV's Peter Cook re the Old Post Office Bldg becoming Trump Int'l Hotel Washington D.C. __HTTP__ _E_\nWell that is it. Well done Megyn and they all lived happily ever after! Now let us all see how THE MOVEMENT does in Oregon tonight! _E_\n.@BarackObama is practically begging @MittRomney to disavow the place of birth movement he is afraid of it and (cont) __HTTP__ _E_\nA lot of people strongly advised me against doing @ApprenticeNBC. Next week we start filming the record 13th seasonHence go with your gut! _E_\nSorry to hear @msnbc was dead last in the gutter in their Boston bombing coverage __HTTP__ @hardball_chris @Lawrence _E_\nHillary Clinton has announced that she is letting her husband out to campaign but HE'S DEMONSTRATED A PENCHANT FOR SEXISM so inappropriate! _E_\nVia @PRNewswire: \"Central Park Horse Show To Make Inaugural Debut in NYC Sept 18 21\" __HTTP__ I am proud to be a sponsor! _E_\nThe same brilliant negotiators that gave up five Taliban leaders for one traitor are now making trade deals with China & others.No chance _E_\nThe debate was very interesting last night. There were numerous winners and Governor Romney did very well. _E_\n\"Some events will wipe out one person but will make another even more tenacious.\" – Think Like a Champion _E_\nHow come discredited reporter @mckaycoppins refused to write that the events in New Hampshire Buffalo and N.Y. were all record breakers! _E_\nRT @TeamTrump: It's hard to fight terrorism when you're making cash payments to the world's LARGEST state sponsor of TERROR. Under Trump: N... _E_\nWeak newscasters are asking is there a racial component to knockout attacks? Of course there is and weakness will only make it worse! _E_\nA Rod should donate his contract to charity. He doesn't make the @yankees any money and he doesn't perform. He is a $30M/yr rip off. _E_\npower from Washington D.C. and giving it back to you the American People. #InaugurationDay _E_\nGreat news from Ireland—Clare County Council turned down massive windfarm near my hotel & golf course in Doonbeg. __HTTP__ _E_\nThe only job @BarackObama cares about is his own. Everything he does is for his own reelection. _E_\nWe are getting ready to protect Saudi Arabia against Iran & others sending ships. How much are they going to pay us toward this protection. _E_\nFLORIDA Visit __HTTP__ to find shelters road closures & evacuation routes. Helpful Twitter list: __HTTP__ __HTTP__ _E_\nEl Chapo and the Mexican drug cartels use the border unimpeded like it was a vacuum cleaner sucking drugs and death right into the U.S. _E_\n.@lolojones given a raw deal in @nytimes story not fair. _E_\n(1/2) Time Magazine has me on the cover this week. David Von Drehle has written one of the best stories I have ever had. _E_\nThe Wikileaks e mail release today was so bad to Sanders that it will make it impossible for him to support her unless he is a fraud! _E_\nDemocrats slam GOP healthcare proposal as Obamacare premiums & deductibles increase by over 100%. Remember keep your doctor keep your plan? _E_\n... Supreme Court pick economic enthusiasm deregulation & so much more have driven the Trump base even closer together. Will never change! _E_\nJoin me tomorrow Nov. 3rd at 12pm in #TrumpTowerNY. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_\nCongrats to Obama & Democrats. CBO has just announced that ObamaCare missed its uninsured target by half & program costs extra $700B+. _E_\n\"Let other people talk. Any business conversation should be two sided.\" – Think Like a Billionaire _E_\n\"Trump could be great friend if 'Second Amendment' enthusiasm is real\" __HTTP__ via @SFLuxe _E_\nApple's iPhone sales fell way short they must go to a larger screen as alternative fast (as I said long ago)! Samsung's size much better. _E_\nWould be nice if @jmartNYT learned how to read the polls before writing his next story. Probably done on purpose but not good reporting! _E_\nVia @GolfweekMag: \"Trump reveals routing for second course in Scotland\" __HTTP__ _E_\nA great photo of @MittRomney and me __HTTP__ _E_\nRepublicans should have been much tougher on Obama. Just wait until you see what Obama does to Romney at the DNC! _E_\nWill be on @Morning_Joe at 6:30 A.M. _E_\nChristians need support in our country (and around the world) their religious liberty is at stake! Obama has been horrible I will be great _E_\nCongratulations to @NHGOP & @AFPFNH for winning control of the State House & Executive Council while holding State Senate. Strong results! _E_\nShe'll say anything and change NOTHING! #MAGA #BigLeagueTruth __HTTP__ _E_\nRated Toronto's #1 hotel @TrumpTO has 261 guest rooms & suites furnished in elegant cosmopolitan style. __HTTP__ _E_\nI am watching Crooked Hillary speak. Same old stuff our country needs change! _E_\nIran is toying with our president buying time and laughing at the stupidity of our leadership. Syria and now this! What's next? _E_\nRT @foxandfriends: Chicago approves new plan to hide illegal immigrants from the feds plus give them access to city services __HTTP__ _E_\nLocated in Central Park the iconic @TrumpRink is NYC's top skating rink. VIP sessions are available for booking __HTTP__ _E_\nA phony story that I am trying to buy a soccer team in Argentina is untrue. Never even heard of the team—no interest! __HTTP__ _E_\nHillary said with respect to ISIS we are finally where we need to be. Do we want 4 more years of incompetent leadership? MAGA! _E_\nI call Jeb Bush the reluctant warrior he just doesn't want to be doing this he is not having fun! _E_\nSocialists think profits are a vice I consider losses the real vice. Winston Churchill _E_\n... Will be there front & center along with the 70 greatest players in the world. WGC @Cadillac Championship _E_\nStatesman of the Year in Sarasota FL on Sunday night will be terrific a total sellout. _E_\n.@foxandfriends we are in record territory in all things having to do with our economy! __HTTP__ _E_\nNo Question' Violent Crime Will Rise If Program (Stop & Frisk) Is Stopped\" @NY_POLICE Commissioner Ray Kelly _E_\nHeading to Youngstown Ohio now some great polls. #AmericaFirst __HTTP__ _E_\nWhy didn't Obama as part of the negotiation free the Christian Pastor Saeed Abedini? __HTTP__ _E_\nCongratulations to the Republic of Korea on what will be a MAGNIFICENT Winter Olympics! What the South Korean people have built is truly an inspiration! __HTTP__ _E_\nThe 2nd Amendment is under siege. We need SCOTUS judges who will uphold the US Constitution. #Debate #BigLeagueTruth _E_\nWatch me on the @hannityshow tonight at 9pm. More thoughts on Anthony Weiner in today's #trumpvlog... __HTTP__ _E_\nIn the last 24 hrs. we have raised over $13M from online donations and National Call Day and we're still going! Thank you America! #MAGA _E_\nDems don't want to talk ISIS b/c Hillary's foreign interventions unleashed ISIS & her refugee plans make it easier for them to come here. _E_\nChina loved Obama's climate change speech yesterday. They laughed! It hastens their takeover of us as the leading world economy. _E_\nRECKLESS! @BarackObama has now increased the debt more than any other POTUS and the first 42 combined. __HTTP__ _E_\nOPEC is ripping us off on oil. We are ripping ourselves off by investing in unproven green energy. #Solyndra _E_\nFor all of those who want to #MakeAmericaGreatAgain boycott @Macys. They are weak on border security & stopping illegal immigration. _E_\n.@SkyscraperLive: Nick all of the folks at Trump International next door are wishing you well. We will block the strong winds! _E_\n#VoteTrumpMS! #Trump2016 __HTTP__ _E_\nRT @LouDobbs: Trump outlines new child care policy proposals via the @FoxNews App @realDonaldTrump seems a candidate of destiny __HTTP__ _E_\nAsk yourself: What am I pretending not to see? There may be some great opportunities right around you. _E_\nNegotiation tip #1: The worst thing you can possibly do in a deal is seem desperate to make it. @realDonaldTrump _E_\nPhoto of @Gretawire and me from yesterday's interview... __HTTP__ _E_\nHappy Father's Day to all! I had a wonderful and loving father. __HTTP__ _E_\nThank you to the amazing law enforcement officers in Colorado!#MakeAmericaGreatAgain #LESM __HTTP__ _E_\nWhy doesn't the media want to report that on the two Big Thursdays when Crooked Hillary and I made our speeches Republican's won ratings _E_\nMilitary has announced that China has successfully hacked our advanced weapon designs. China is our enemy.Should we offset this on our debt? _E_\nLive from New York November 7th! @nbcsnl __HTTP__ _E_\nGoing now to make a major speech before some of the world's biggest investors in Dubai! _E_\nNothing on emails. Nothing on the corrupt Clinton Foundation. And nothing on #Benghazi. #Debates2016 #debatenight _E_\nHard to believe that the Democrats who have gone so far LEFT that they are no longer recognizable are fighting so hard for Sanctuary crime _E_\n.@HillaryClinton : Bill \"clarified\" what he meant when calling Obamacare a \"disaster.\" Actually \"disaster\" is pretty clear. #Debate _E_\nI have no doubt that Mitt will do really well tonight. We'll all be watching @MittRomney. _E_\nIn a new poll a majority of people felt the president knowingly lied about health care pledge. Who are the fools who don't think he lied? _E_\nOur amazing golf course @TrumpScotland __HTTP__ _E_\nTrump: US Must Get Tougher Because China Is 'Eating Our Lunch' __HTTP__ via Moneynews @Newsmax_Media _E_\nRT @BretEastonEllis: Just back from a dinner in West Hollywood: shocked the majority of the table was voting for Trump but they would never... _E_\nWith $250M of renovations Trump Int'l DC's 250 expansive guest rooms will be DC's top offering of amenities & views __HTTP__ _E_\nDishonest @politico just called to say that none of the polls including Fox NBC CNN Zogby & Morning Consult matter. Serious haters. _E_\n.....you keep forgetting to mention the fragrance Success ! _E_\nIdentify your goals. Know precisely what you want to achieve study the best people in your fieldand then plan the best route for success. _E_\nActually I was very nice to Jimmy Carter during my standing room only (& standing ovation) speech for CPAC stated better Pres. than Obama! _E_\n#ThankYouTour2016 Tonight Orlando Florida Tickets: __HTTP__ Mobile AlabamaTickets:... __HTTP__ _E_\nTHANK YOU ARIZONA! Get out and #VoteTrump on Tuesday! #AZPrimary #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n.@washingtonpost is going out of its way to tell failing candidates how to beat Donald Trump.The Post doesn't get that I'm good at winning! _E_\nIn Crooked Hillary's telepromter speech yesterday she made up things that I said or believe but have no basis in fact. Not honest! _E_\nHe @BarackObama wants record high gas prices drilling permits on federal land are declining under his regime __HTTP__ _E_\nTainted (no very dishonest?) FBI \"agent's role in Clinton probe under review.\" Led Clinton Email probe. @foxandfriends Clinton money going to wife of another FBI agent in charge. _E_\nWith all of the Crooked Hillary Clinton's foreign policy experience she has made so many mistakes and I mean real monsters! No more HRC. _E_\nA lot of people are concerned about which charity my $5M will be donated to. The onus is on Obama to first release his records. _E_\nIt's go time! See you at Trump Tower. I'm giving money away! #FundAnything _E_\nJoin me Thursday in Florida & Ohio!West Palm Beach FL at noon: __HTTP__ OH this 7:30pm: __HTTP__ _E_\nMexico has lost a brilliant finance minister and wonderful man who I know is highly respected by President Peña Nieto. _E_\nWhat about the undocumented immigrant with a record who killed the beautiful young women (in front of her father) in San Fran. Get smart! _E_\nI will be in Milwaukee Wisconsin tomorrow at 7pmE with @MELANIATRUMP. Join us! #WIPrimary #Trump2016 __HTTP__ _E_\nAmerica's competitors love @BarackObama. @MedvedevRussiaE says @BarackObama has been the best 3 years for Russia __HTTP__ _E_\nJust as I predicted while Obama lifted sanctions 18 months ago Iran cheated & increased its nuclear fuel by 20%. We must DOUBLE sanctions! _E_\nSo many in the African American community are doing so badly poverty and crime way up employment and jobs way down: I will fix it promise _E_\nSome exciting news the newest acquisition of Trump Golf Trump National Golf Club Charlotte NC formerly The (cont) __HTTP__ _E_\nVia @ABC by @jonkarl & @JordynPhelps: Donald Trump Says Jeb Bush is the 'Last Thing We Need' __HTTP__ _E_\nOur economy is at a standstill. Some are even predicting a possible double dip. We need to elect @MittRomney in November. _E_\nVia @ArutzSheva_En by Moshe Cohen: \"Donald Trump: French Gun Control Allowed Terrorists to Succeed\" __HTTP__ _E_\nDo not look for approval except for the consciousness of doing your best. Andrew Carnegie _E_\nI never said that China was in the bad TPP trade deal but that China would come in the back door at a later date. @CNN @FoxBusiness _E_\nGoing to Columbus Ohio today for a tremendous rally of thousands. The silent majority is no longer silent! _E_\n.@TrumpNewYork is the only Forbes 5 Star & 5 Diamond hotel with a 5 Star & Five Diamond restaurant in NYC __HTTP__ _E_\nWhat took so long to catch only 1 of the Benghazi terrorists? Especially after the killer has been taunting the US in the press f/2 yrs. _E_\nThank you Wisconsin! My Administration will be focused on three very important words: jobs jobs jobs! Watch:... __HTTP__ _E_\nAs everybody knows but the haters & losers refuse to acknowledge I do not wear a \"wig.\" My hair may not be perfect but it's mine. _E_\nWhile @BetteMidler is an extremely unattractive woman I refuse to say that because I always insist on being politically correct. _E_\nRe Negotiation: Think about what the other side wants. Know where they're coming from. View any conflict as an opportunity. Be flexible. _E_\nGreat speech on China by @PaulRyanVP yesterday where he explains why China is treating @BarackObama like a Doormat __HTTP__ _E_\nThe ObamaCare disaster will increase the amount of uninsured __HTTP__ What is the point of this Trillion $ monstrosity? _E_\nExcited to see that @AnnDRomney has joined twitter. Melania and I are looking forward to hosting her next week (cont) __HTTP__ _E_\nVia @digitalspyus: Donald Trump to Lord Sugar: 'Drop to your knees and thank me' __HTTP__ _E_\nA person who never made a mistake never tried anything new. Albert Einstein _E_\nThank you @USNavy! #USA __HTTP__ _E_\nVia @NRO: Donald Trump Eyes 2016 by @woodruffbets __HTTP__ _E_\n.@AlexSalmond I hope you played well at Royal Aberdeen but u must admit the windmill hovering over hole 14 is disgusting & inappropriate. _E_\nIt's freezing outside where the hell is global warming ?? _E_\nMy Administration will follow two simple rules: __HTTP__ _E_\nGreat new poll thank you America!#Trump2016 #ImWithYou __HTTP__ _E_\nSecy John Kerry has a tough job but he looks so totally lost negotiating w/ those characters who are cleaning his clock. Sad to watch... _E_\nVia @qctimes by @EdTibbetts: \"Trump: U.S. getting beat up\" __HTTP__ _E_\nIt was great to have Governor @RicardoRossello of #PuertoRico with us at the @WhiteHouse today. We are with you! #PRStrong __HTTP__ _E_\nGot to do something about these missing chidlren grabbed by the perverts. Too many incidents fast trial death penalty. _E_\n.@IvankaTrump looks like a movie star from the days of glamour and beauty. #CelebApprentice _E_\nRT @foxnation: Grateful Syrians React To @realDonaldTrump Strike: 'I'll Name My Son Donald' __HTTP__ #SyrianStrikes _E_\nMy @CPACnews speech is scheduled Friday at 8:45AM in the Potomac Ballroom. Will also be telecast live on CSPAN & cable news networks. _E_\nDo not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_\nVia @BreitbartFeed why doesn't @BarackObama release his original book proposal which says he was born in Kenya? __HTTP__ _E_\nEric's Sept. 14th event will be held at Trump National Golf Club Westchester. __HTTP__ _E_\nMy @gretawire interview discussing my @MittRomney fundraiser in Trump Int'l Hotel Las Vegas and the state of the (cont) __HTTP__ _E_\n.@Mediaite: Donald Trump Trashes @michellemalkin On Twitter:You're A 'Dummy' & 'Were Born Stupid' __HTTP__ @AndrewKirell _E_\nSuch amazing reporting on unmasking and the crooked scheme against us by @foxandfriends. Spied on before nomination. The real story. _E_\nSomeone just asked me who is my favorite Donald Trump impersonator? __HTTP__ _E_\n.@foxandfriends We are not looking to fill all of those positions. Don't need many of them reduce size of government. @IngrahamAngle _E_\nThen how come gasoline is hitting record high prices? _E_\nThe @Yankees must re negotiate @AROD's contract. He is not the same player without drugs. _E_\nI am at Trump National Doral in Miami as the best golfers in the World start arriving for the World Golf Championship (Cadillac). A big week _E_\nJust hit a million on Facebook __HTTP__ _E_\nObama Care stole more then $500M from Medicare. _E_\nThank you @tweetbypremier for selecting the Ocean View Suite @Trump_Ireland as one of your top 10 suites __HTTP__ _E_\nCongratulations Treasury Secretary Steven Mnuchin! #ICYMI watch here: __HTTP__ __HTTP__ _E_\nThe @AmSpec article Shakedown Schneiderman about NY State lightweight @AGSchneiderman is amazing. __HTTP__ _E_\nI hope everyone enjoyed Palm Sunday! _E_\nWith the exception of cheating Bernie out of the nom the Dems have always proven to be far more loyal to each other than the Republicans! _E_\nThe Democrats will make a deal with me on healthcare as soon as ObamaCare folds not long. Do not worry we are in very good shape! _E_\nOn stunning Aberdeenshire coastline @TrumpScotland features a classic Scottish link threaded through the dunes __HTTP__ _E_\nOpportunity is missed by most people because it is dressed in overalls and looks like work. Thomas Edison _E_\nWe should not bail out any of the European countries or banks. _E_\nRT @KellyannePolls: After a decent first debate @HillaryClinton is back to form: pedantic lawyerly technocratic (woefully untruthful) r... _E_\n.@MittRomney's @RNC convention came in over $3M under budget. Barack's @DNC convention is over $10M in debt. What a surprise! _E_\nGive yourself a chance make every day a discovery. _E_\nMy interview which recently aired on CNBC's Squawk Box __HTTP__ _E_\n _E_\nOne of the country's dumbest newspapers—The Palm Beach Post should be put to sleep. It's dying. @pbpost _E_\nSuch amazing people in India. This trip is very enlightening! _E_\nEntrepreneurs: Take responsibility for yourself. It's a very empowering attitude. _E_\nKarl Rove lost GOP both Houses of Congress and the White House gave us Obama. _E_\nGo confidently in the direction of your dreams. Live the life you have imagined. Henry David Thoreau _E_\nSo many self righteous hypocrites. Watch their poll numbers and elections go down! _E_\nEntrepreneurs: There's nothing wrong with bringing your talents to the surface. Having an ego and acknowledging it is a healthy choice. _E_\nAs I have long been saying South Africa is a total and very dangerous mess. Just watch the evening news (when not talking weather). _E_\nCrooked Hillary called it totally wrong on BREXIT she went with Obama and now she is saying we need her to lead. She would be a disaster _E_\nSigning my tax return.... __HTTP__ _E_\nPeople buy deals & immediately put them into bankruptcy in order to make better deals... _E_\nJoin me today Nov 3rd in #TrumpTowerNYC at noon. I'll be signing copies of my new book CRIPPLED AMERICA. Don't miss it! _E_\nBe a yardstick of quality. Some people aren't used to an environment where excellence is expected. Steve Jobs _E_\nI won every debate so far according to all debate polls including @DRUDGE_REPORT @TIME @Slate and more. Too bad dopey @megynkelly lies! _E_\nDon't blindly pursue a career that others suggest or insist is right for you. It may be worth taking a pay cut for a job you love. _E_\n\"Donald Trump: I've made up my mind on 2016\" __HTTP__ via @msnbc by @janestreet _E_\nCongratulations to America's new Secretary of @HHSGov Alex Azar! __HTTP__ _E_\nJust as I predicted @BarackObama is preparing a possible attack on Iran right before November. __HTTP__ _E_\nCon Ed has won its suit against the Ground Zero Mosque developers __HTTP__ The mosque is never going up. _E_\nObama's policies have led to food stamp rolls growing 75X faster than job production __HTTP__ We can't afford 4 more years. _E_\nNo surprise that all the foreign countries are celebrating Obama's win. They love a weak America that they can rip off. _E_\nThere are no short cuts to any place worth going. Beverly Sills _E_\nAll of my Cabinet nominee are looking good and doing a great job. I want them to be themselves and express their own thoughts not mine! _E_\nMy wife @MELANIATRUMP and my children will be featured on @FoxNews with @Greta 7pmE. Enjoy!#MeetTheTrumps #Trump2016 _E_\nObamacare premiums increasing 33% in Pennsylvania a complete disaster. It must be repealed and replaced!... __HTTP__ _E_\n.@brandonhardest Love what you do and work hard. _E_\nJust as I said last October census workers cooked the job numbers for Obama right before the election __HTTP__ _E_\nAnd happy to welcome @ArsenioHall back as an advisor— he will have his own show and is doing great. #CelebApprentice _E_\nJust did @OReillyFactor. Will be back on at 11pm on @FoxNews. _E_\nSurprised @Eagles signed Michael Vick yesterday to be their 2013 QB. Vick is talented but brittle & probably won't last long. _E_\nJust Introduced at #NCGOPcon as the country's highest paid speaker. Told the record crowd of 650 I am to be speaking here for free! _E_\nOn my way to Dayton Ohio. Will be there soon! _E_\nHillary's debate answer on delay: That is horrifying. That is not the way our democracy works. Been around for 240 years. We've had free _E_\nHAPPY 241st BIRTHDAY to the @USArmy! THANK YOU! __HTTP__ _E_\nMy @gretawire interview on @FoxNewsInsider \"Trump: 'Last Person I'd Want Negotiating for Me Is Obama'\" __HTTP__ _E_\nChina continues to be on the move both technologically and militarily. Obama is sitting by and watching. _E_\nToo bad I'll Have Another out of Belmont Stakes interest now way down. _E_\nIs it true the DNC would not allow the FBI access to check server or other equipment after learning it was hacked? Can that be possible? _E_\nJoin me this Wednesday in Phoenix Arizona at 6pm! #ImWithYouTickets: __HTTP__ __HTTP__ _E_\n'Trump Celebrates American Manufacturing Survey Showing Highest Level of Optimism in 20 Years' ... __HTTP__ _E_\nWe need a 21st century MERIT BASED immigration system. Chain migration and the visa lottery are outdated programs that hurt our economic and national security. __HTTP__ _E_\n#TBT With Barbara Walters on my helicopter going somewhere. __HTTP__ _E_\nIt takes guts to win! _E_\nI will be going to Puerto Rico on Tuesday with Melania. Will hopefully be able to stop at the U.S. Virgin Islands (people working hard). _E_\nThe new season of the Celebrity Apprentice is off to a great start last night it swept the 10 p.m. hour in every key demographic. _E_\nYou have to know when to call it quits and when to keep moving forward. Donald J. Trump __HTTP__ _E_\nLynne Ryan just read your great story in the NY Times I am proud of you. Thanks! __HTTP__ _E_\nWhen it comes to violent crime and if we are going to solve the problem we must stop being so politically correct must tell it like it is! _E_\nVia @PRNewswire: TRUMP HOTEL COLLECTION™ Announces Trump® International Hotel & Tower Baku __HTTP__ _E_\n__HTTP__ Lights... Camera....You're Fired! All new @apprenticenbc tonight at 8PM ET on NBC! _E_\nFrumpy and very dumb Gail Collins an editorial writer at The New York Times is so lucky to even have a job. Check her out incompetent! _E_\nVattenfall the promoter of the money losing wind farm plan in Aberdeen Scotland just took a loss of $4.6 billion after dumb European move _E_\nYou can't know it all yourself anyone who thinks that they do is destined for mediocrity.\" The Way To The Top _E_\n740 Park Avenue is being robbed all over the place we come down hard on thieves at Trump buildings. _E_\nWe mourn the horrifying terrorist attack in NYC. All of America is praying and grieving for the families who lost their precious loved ones. __HTTP__ _E_\nA great honor to easily finish FIRST in the @FoxNews poll tabulation even though some of my best polls were not used in determining winner! _E_\nWe will defend our people our nations and our civilization from all who dare to threaten our way of life...cont: __HTTP__ __HTTP__ _E_\nI hate to say it but the Republican Convention was far more interesting (with a much more beautiful set) than the Democratic Convention! _E_\nI am honored that @BarackObama has featured my plane in one of his attack ads. It was made in America! _E_\nNow China 'calls in' US diplomats to lecture them on their illegal escapades. __HTTP__ The new reality. @BarackObama is weak. _E_\nThis month we celebrate the contributions of Asian Americans & Pacific Islanders that enrich our Nation. __HTTP__ _E_\nGreat advice from my father: Know everything you can about what you're doing. Fred C. Trump _E_\nHappy New Year to all my Jewish friends celebrating the holiday. _E_\nThe Oil Companies collude with OPEC to keep oil artificially overvalued. They need to be reigned in. _E_\nTHANK YOU Youngstown Ohio! I love you! Get out & #VoteTrump tomorrow. #Trump2016 __HTTP__ _E_\nThe Phoenix V.A. it has just been reported is in worse shape than ever before. The wait is horrendous and people are dying. I will fix it _E_\nRaffaele Sollecito was unfairly convicted. He didn't kill anyone. The Italian government should be ashamed. @Raffasolaries _E_\nThe Democrats want MASSIVE tax increases & soft crime producing borders.The Republicans want the biggest tax cut in history & the WALL! _E_\nLeading in the Bloomberg Iowa poll. Also my favorability numbers went up at a record almost unheard of clip. Thank you Iowa! _E_\nIf your enemies end up liking you it's because they beat you. You want their respect not their friendship. _E_\nI want to thank Steve Bannon for his service. He came to the campaign during my run against Crooked Hillary Clinton it was great! Thanks S _E_\nThe young intern who accidentally did a Retweet apologizes. _E_\nBig protest march in Colorado on Friday afternoon! Don't let the bosses take your vote! _E_\nCongress' greatest card against Obama is the power of the purse. Use it! _E_\nIt's not that I'm so smart it's just that I stay with problems longer. Albert Einstein _E_\nWhy gas prices will cost @BarackObama re election: pain at the pump not good for obama __HTTP__ _E_\nI gave away money. Go to __HTTP__ to see how I'm helping people. #FundAnything #Entrepreneurs #GiveBack _E_\nMitt Romney is a mixed up man who doesn't have a clue. No wonder he lost! _E_\nWatch me play both golf and baseball tonight on Donald J. Trump's Fabulous World of Golf 9PM ET on Golf Channel.. __HTTP__ _E_\nBecause of #FakeNews my people are not getting the credit they deserve for doing a great job. As seen here they are ALL doing a GREAT JOB! __HTTP__ _E_\nIt is time to take back our country and MAKE AMERICA GREAT AGAIN!#CaucusForTrump Video: __HTTP__ __HTTP__ _E_\nDonald Trump explains celebrity feuds: 'I speak the truth' __HTTP__ via @DigitalSpyUS _E_\nVia @WDesMoinesPatch by @DerekJ3031: \"@ShawnJohnson on @ApprenticeNBC\" __HTTP__ _E_\nWatched Sean Hannity last night a great guy. _E_\nHe thinks that the wealth you create belongs to the government @BarackObama doesn't respect the fact that the (cont) __HTTP__ _E_\nLooking for Father's Day gift? @Miamimagazine named the spa @TrumpDoral one of the best places for men to relax __HTTP__ _E_\nRT @marcorubio: Good #AfghanStrategy & excellent speech by @POTUS laying it out to the nation. _E_\nLIMITED EDITION signed copies of my book The Art of the Deal for your donation of $184 or more. Get YOURS today! __HTTP__ _E_\n.@BenSasse looks more like a gym rat than a U.S. Senator. How the hell did he ever get elected? @greta _E_\nWith respect to Iran we have all the cards they are scared stiff! I can't believe we aren't able to negotiate (cont) __HTTP__ _E_\nBig meeting today with Republican leadership concerning Tax Cuts and Healthcare. We are all pushing hard must get it right! _E_\nObama administration had 4 years to prepare for the ObamaCare rollout. And of course they failed miserably. _E_\nConvention speaker schedule to be released tomorrow. Let today be devoted to Crooked Hillary and the rigged system under which we live. _E_\nThe great Barbara Walters interviews Melania Trump and me on a Special Friday night at 10:00 on ABC.... __HTTP__ _E_\nThe President's speech was very combative toward Republicans—they have obviously not earned his respect! _E_\nOnly two weeks until we start shooting @CelebApprentice. We really have something amazing for the fans this year. _E_\nThe dealmaker is cunning secretive focused and never settles for less than he wants. The America We Deserve _E_\nEntrepreneurs: Identify your goals know precisely what you want to achieve. Then study the best people in your field and learn from them. _E_\nThank you Christian Broadcasting Network @TheBrodyFile @CBNNews __HTTP__ _E_\nThink BIG! You are going to be thinking anyway so you might as well think BIG! _E_\nThe Job on CBS the 15th copy of The Apprentice was just cancelled I love it! _E_\nLindsey Graham is all over T.V. much like failed 47% candidate Mitt Romney. These nasty angry jealous failures have ZERO credibility! _E_\nPeyton Manning should have passed on 3rd down! _E_\nStill a buyer's market. Residential home sales fall 7.1% in March. __HTTP__ Now is the time to buy property. _E_\nEvery on line poll Time Magazine Drudge etc. has me winning the debate. Thank you to Fox & Friends for so reporting! _E_\nIs Hillary really protecting women? __HTTP__ _E_\nMy interview from yesterday with #Apprentice Andy on @AmericaNowRadio __HTTP__ _E_\n.@antbaxter Dummythanks for increasing awareness of my big golf project in Aberdeen—sales are thru the roof & Aberdeen seeing big benefits. _E_\n.@MittRomney looks much stronger and much more Presidential! _E_\nI had a wonderful meeting with Likud Deputy Speaker of The Knesset @DannyDanon this past Friday in Trump Tower __HTTP__ #Israel _E_\nI am watching the Democrats trying to defend the you can keep you doctor you can keep your plan & premiums will go down ObamaCare lie. _E_\nIt was just determined that the woman who passed out at Obama's press conference had just seen what her new premiums would be! _E_\nSnowden is handing over to Russia a treasure trove of intel. Our politicians are incapable of dealing! _E_\nYou've got something unique to offer. Find out what it is. Ask yourself: What can I provide that does not yet exist? _E_\n'Kept me out of jail': Top DOJ official involved in Clinton probe represented her campaign chairman: __HTTP__ _E_\nToday is my birthday. My wish is for our country to be great and prosperous again. _E_\n.@thehill Your story about me & the carbon tax is absolutely incorrect—it is just the opposite. I will not support or endorse a carbon tax! _E_\nBefore you vote think: Obama wants to raise taxes @MittRomney wants to lower taxes need I say more! _E_\nTHANK YOU to all of the incredible HEROES in Texas. America is with you! #TexasStrong __HTTP__ _E_\nCongratulations to Connecticut's Erin Brady on being crowned the 2013 @MissUSA! America will be well represented in @MissUniverse! _E_\nWith millions of dollars of negative and phony ads against me by the establishment my numbers continue to go up. Can anyone explain this? _E_\nDeparting for Texas and Louisiana with @FLOTUS Melania right now @JBA_NAFW. We will see you soon. America is with you! __HTTP__ _E_\nOnly the Fake News Media and Trump enemies want me to stop using Social Media (110 million people). Only way for me to get the truth out! _E_\n\"Don't find fault find a remedy.\" Henry Ford _E_\nJoin me in Columbus Ohio tomorrow!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nPresident Obama said ISIL continues to shrink in an interview just hours before the horrible attack in Paris. He is just so bad! CHANGE. _E_\nPaul Teutul Sr. is a fantastic guy. Although I fired him on #CelebApprentice we will remain great friends. I love the bike he made for me. _E_\nFox News PollThank you New Hampshire! #FITN#Trump2016 __HTTP__ _E_\n.@foxandfriends Russia sent millions to Clinton Foundation _E_\nIn today's #trumpvlog I speak about Clint Eastwood the #DNC and Drew Peterson __HTTP__ _E_\n...There is also something appropriate about keeping him in the home of the horrible crime he committed. Should move fast. DEATH PENALTY! _E_\nJust as we won the Cold War in part by exposing the evils of communism and the virtues of free markets....Cont: __HTTP__ _E_\n...the Ninth Circuit which has a terrible record of being overturned (close to 80%). They used to call this judge shopping! Messy system. _E_\nSo much for Washington shutting down Strasburg they deserved to lose. _E_\nSurprising a future Nobel prize winner on today's @KatieShow: __HTTP__ _E_\n\"RUBIO'S GANG OF 8 BILL WOULD HAVE REWARDED SANCTUARY CITIES HARBORING ILLEGALS\" __HTTP__ Marco is a politician he flip flops! _E_\nBloggers like McKay Coppins & @BuzzFeed are true garbage with no credibility. Record setting crowds & speech not reported. @PiersMorgan _E_\nTo get momentum you must first focus on a specific goal with passion and intensity. _E_\nNew Gravis national poll just out 36%! Very nice! #MakeAmericaGreatAgain _E_\nHillary Clinton said that it is O.K. to ban Muslims from Israel by building a WALL but not O.K. to do so in the U.S. We must be vigilant! _E_\nHere's the solution on China: get tough. Slap a 25 percent tax on China's products if they don't set a real (cont) __HTTP__ _E_\nLightweight @AGSchneiderman is driving business out of NY so that he can get publicity for his failing political career. _E_\nLess than a week after we leave Iraq the country is already unraveling. We got nothing from the Iraqis and now (cont) __HTTP__ _E_\nMy campaign for president is $35000000 under budget I have spent very little (and am in 1st place).Now I will spend big in Iowa/N.H./S.C. _E_\nOur border is being breached daily by criminals. We must build a wall & deduct costs from Mexican foreign aid! __HTTP__ _E_\nMr. President you're entitled as the president to your own airplane and to your own house but not to your own facts. @MittRomney _E_\nMy @FoxNews interview with @TeamCavuto where I explain that we need to start using our own domestic energy resources. __HTTP__ _E_\n.@CelebApprentice was #1 on network TV last night in its time slot and easily won the 10 o'clock hour in all major demographics. _E_\nNow that Bush has wasted $120 million of special interest money on his failed campaign he says he would end super PACs. Sad! _E_\nThe great boxing promoter Don King just endorsed me. Nice! _E_\nVia @BreitbartNews: \"DONALD TRUMP: 'RICH PEOPLE DON'T LIKE ME'–POOR MIDDLE INCOME PEOPLE 'LIKE ME BEST'\" __HTTP__ _E_\nWow great news! I hear @EWErickson of Red State was fired like a dog. If you read his tweets you'll understand why. Just doesn't have IT! _E_\nI will be on @foxandfriends at 7:00 A.M. Will be talking about many things including The Apprentice! _E_\nDespite the constant negative press covfefe _E_\nGeorge also appeared on Saturday Night Live when I was guest host in 2004. A great time! #CelebApprentice _E_\nMelania and I offer our deepest condolences to the family of Otto Warmbier. Full statement: __HTTP__ __HTTP__ _E_\nIt is time to rebuild OUR country to bring back OUR jobs to restore OUR dreams & yes to put #AmericaFirst! TY O... __HTTP__ _E_\nKeep focused on your goals. Practice positive thinking. View any conflict as an opportunity look at the solution not the problem. _E_\nWatch my interview with Greta Van Susteren @Gretawire tonight at 10 p.m. on Fox News. _E_\nLooking forward to keynoting @bobvanderplaats' @theFAMiLYLEADER Leadership Summit. Tickets selling out __HTTP__ _E_\n.@OMAROSA as a cashier a big mistake by @BrandenRoderick. #CelebApprentice _E_\n#MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nMy @OraTV #Politicking interview w/@kingsthings on the govt. shutdown ObamaCare Putin 2016 and @TrumpDoral __HTTP__ _E_\nAnother great accolade for @TrumpGolf. Highly respected Golf Odyssey awarded @TrumpDoral Blue Monster with best redesign. Thank you! _E_\nRT @foxandfriends: FOX NEWS ALERT: Jihadis using religious visa to enter US experts warn (via @FoxFriendsFirst) __HTTP__ _E_\nMy @SquawkCNBC interview discussing #TRUMPTUESDAYS high ratings @ToddAkin's statement & @MittRomney's policies __HTTP__ _E_\nSet the bar high do the best you possibly can. Be focused disciplined and alert every single day. _E_\nJust a few days until I keynote at @bobvanderplaats' @theFAMiLYLEADER Leadership Summit in Iowa __HTTP__ Very exciting _E_\nAt the request of many and even though I expect it to be a very boring two hours I will be covering the Democrat Debate live on twitter! _E_\n'Clinton Campaign And Harry Reid Worked With New York Times To Smear State Dept Watchdog'Time to #DrainTheSwamp! __HTTP__ _E_\n.@realDonaldTrump is going to cut taxes BIG LEAGUE Crooked is going to raise taxes BIG LEAGUE! #DrainTheSwamp... __HTTP__ _E_\nMy @FoxNews interview with @gretawire discussing the GOP primary my 2012 options and why @BarackObama must lose __HTTP__ _E_\nThe Democrats have a corrupt political machine pushing crooked Hillary Clinton. We have Paul Ryan always fighting the Republican nominee! _E_\nI am happy to hear that Pres.Obama is considering giving Anna Wintour @voguemagazine an ambassadorship. She is a winner & really smart! _E_\nCrooked @club4growth has given up advertising in Iowa on me—remember they wanted my million dollars—I said no—total frauds! _E_\nMark my words a gallon of gas will be $5 during the summer. OPEC is ripping us off. There's nobody in our (cont) __HTTP__ _E_\nCrooked Hillary Clinton now blames everybody but herself refuses to say she was a terrible candidate. Hits Facebook & even Dems & DNC. _E_\nThank you Adam Levine The Federalist in interview on @foxandfriends \"Donald Trump is the greatest President our Country has ever seen.\" _E_\nRT @FoxNews: More than 1 million jobs added since @POTUS took office. __HTTP__ __HTTP__ _E_\nWord is I am doing very well in Michigan and Mississippi! Wow and with all that money spent against me! Will be going to Trump Jupiter now! _E_\nTo have a government we can afford we need to eliminate the tremendous waste clogging the system #TimeToGetTough _E_\nNo taxes the only good thing about DC Debt Deal. _E_\n\"Success is getting what you want. Happiness is wanting what you get.\" Dale Carnegie _E_\nRT @GovMikeHuckabee: Trump says the chaos in Chicago was a planned attack. But Hillary insists it was a spontaneous reaction to an internet... _E_\nI wonder why @BarackObama is now spending $8B to postpone Obamacare's Medicare Cuts until after the election? __HTTP__ _E_\nRT @DanScavino: On behalf of our next #POTUS & @TeamTrump #HappyNewYear AMERICA __HTTP__ __HTTP__ __HTTP__ _E_\nMost politicians would have gone to a meeting like the one Don jr attended in order to get info on an opponent. That's politics! _E_\nThe election is trending towards @MittRomney. Americans know we can't afford another 4 years of the Obama economic decline. _E_\nIt's that time of the year. @TrumpRink in Central Park is now open best rink in the world. __HTTP__ A landmark. _E_\nOur country is looking very bad right now! _E_\nOur deficits are caused by runaway spending not inadequate taxing. Washington does not have a revenue problem. _E_\nWednesday's debate is day one of the election. Over 70 million voters will be watching. _E_\nWhen will the US government finally classify China as a currency manipulator? China is robbing us blind and @BarackObama defends them. _E_\nThis is your land this is your home and it's your voice that matters the most. So speak up be heard and fight fight fight for the change you've been waiting for your entire life!MERRY CHRISTMAS and THANK YOU Pensacola Florida! __HTTP__ _E_\nMy son Donald did a good job last night. He was open transparent and innocent. This is the greatest Witch Hunt in political history. Sad! _E_\nThe $10 billion (net worth) is AFTER all debt and liabilities. So simple to understand but @CNN & @CNNPolitics is just plain dumb! _E_\nTo all struggling young entrepreneurs stay positive in this tough climate and keep looking for good deals. They are out there. _E_\n...Brande was also smart in not bringing Omarosa to the boardroom. _E_\nHappy to have passed 800000 followers. Looking forward to passing 1M sooner than later. _E_\nThank you Louisiana! Get out & vote for John Kennedy tomorrow. Electing Kennedy will help enact our agenda on behal... __HTTP__ _E_\n\"To be successful you must become very good at finding creative solutions to what appear to be impossible problems.\" – Think BIG _E_\nOn Greta 87% of the people said they would not watch the debate if I'm not in it. Wow what an honor! _E_\nMany people think that WM23 @WrestleMania \"the battle of the billionaires\" was the greatest of all time—set all records _E_\nThe crowd in Ohio was amazing last night broke all records. We all had a great time in a great State. Will be back soon! _E_\nOn behalf of all Americans I want to wish Jewish families many blessings in the New Year. __HTTP__ __HTTP__ _E_\n\"Hard work is my personal method for financial success. You can do it too.\" Think Big _E_\nI am the only one who can fix this. Very sad. Will not happen under my watch! #MakeAmericaGreatAgain __HTTP__ _E_\n‎In anticipation of ObamaCare part time jobs are surging & full time jobs are falling and becoming scarce __HTTP__ _E_\nNo better place to celebrate New Year's Eve than @TrumpSoHo the most elite hotel in downtown NYC __HTTP__ _E_\nRT @robertjeffress: Honored to pray for my friend @realDonaldTrump at tonight's Dallas rally. #TrumpDallas c: @DanScavino __HTTP__ _E_\nUpstate New York is suffering with record unemployment. Fracking is the answer. Frack now and Frack fast! _E_\nWow FBI confirms report that James Comey drafted letter exonerating Crooked Hillary Clinton long before investigation was complete. Many.. _E_\nResponse to Hillary Clinton __HTTP__ _E_\nLeaving for New York City and meetings on military purchases and trade. _E_\nIs @karlrove incompetent? 400 million dollars down the drain and not 1 victory! _E_\n#TrumpVlog Obama should be ashamed! __HTTP__ _E_\nChina and Saudi Arabia recently struck a deal which is the largest expansion by any oil company in the world (cont) __HTTP__ _E_\n.@GoAngelo—the next time you have a rally @Macy's try getting 12 people instead of 11—it would be much more effective! _E_\nStill a buyer's market. Home prices are dropping mortgages are low. Now is the time to take advantage for your gain. __HTTP__ _E_\nOnce again @BarackObama's speech at @AIPAC yesterday proved that he is more concerned about containing @Israel (cont) __HTTP__ _E_\nThank you to everybody for your wonderful comments on my debate performance it was a lot of fun! Today I will be speaking in Reno Nevada. _E_\nIf you love your work the difficulties will be balanced out by the enjoyment. Think Big _E_\nThe difference between @MittRomney and @BarackObama's campaign promises to @Israel is that Mitt will actually keep all of his. _E_\nThanks. __HTTP__ _E_\n.@RuthMarcus of the @washingtonpost was terrible today on Face The Nation.No focus poor level of concentration but correct on Hillary lying _E_\nI hope all workers demand that their @Teamsters reps endorse Donald J. Trump. Nobody knows jobs like I do! Don't let them sell you out! _E_\nCongratulations to @DianeSawyer on her big ratings win for the evening news. Diane is a spectacular person. _E_\nMexico's totally corrupt gov't looks horrible with El Chapo's escape—totally corrupt. U.S. paid them $3 billion. _E_\nGreat bilateral meeting with Prime Minister Theresa May of the United Kingdom affirming the special relationship and our commitment to work together on key national security challenges and economic opportunities. #WEF18 __HTTP__ _E_\nSheldon Adelson is looking to give big dollars to Rubio because he feels he can mold him into his perfect little puppet. I agree! _E_\nJust landing in Knoxville Tennessee! Massive crowd expected! Will all have a great time despite serious subject matter. _E_\nHealthy young child goes to doctor gets pumped with massive shot of many vaccines doesn't feel good and changes AUTISM. Many such cases! _E_\nIf Ted Cruz is so opposed to gay marriage why did he accept money from people who espouse gay marriage? _E_\nCowards die many times before their actual deaths. Caesar _E_\nDonald Trump reads Top Ten Financial Tips on Late Show with David Letterman: __HTTP__ Very funny! _E_\n...While I fully agree it is not politically correct! __HTTP__ _E_\nPM @David_Cameron should be run out of office for spending so much of England's money to subsidize windfarms in Scotland. _E_\nEbola is much easier to transmit than the CDC and government representatives are admitting. Spreading all over Africa and fast. Stop flights _E_\nPlease keep your thoughts & prayers with Melissa Young Miss Wisconsin 2005. __HTTP__ _E_\nGet out and vote! I am your voice and I will fight for you! We will make America great again! __HTTP__ _E_\nNotice that illegal immigrants will be given ObamaCare and free college tuition but nothing has been mentioned about our VETERANS #DemDebate _E_\nI will be on @foxandfriends at 7:00 A.M. Enjoy! _E_\nFACT on \"red line\" in Syria: HRC I wasn't there. Fact: line drawn in Aug '12. HRC Secy of State til Feb '13. __HTTP__ _E_\nRT @foxandfriends: FOX NEWS ALERT: ISIS claims responsibility for hostage siege in Melbourne Australia that killed 1 person and injured 3... _E_\nObama's administration is now openly admitting it expects US credit downgraded again __HTTP__ Thanks for letting us know now _E_\nI was invited by Caroline Wozniacki to sit with her family in her special box during her match at the U.S. Open yesterday. She's fantastic! _E_\nWho do you like of the final two? #CelebApprentice __HTTP__ _E_\nMy @IngrahamAngle interview on the border crisis USMC Tahmooressi & my fight for the American flag __HTTP__ (15:00 mark) _E_\nMy friend @GovChristie called it @MittRomney recast the race. _E_\nTry to develop a tempo when you're working momentum is something you have to work at to maintain & is an important element of success. _E_\n.@TheBrodyFile: Trump's appeal to evangelicals is real #Trump2016 __HTTP__ __HTTP__ _E_\nMy bestselling book from last April Think Like a Champion is now available in paperback. It's inspiring entertaining and a great read. _E_\n....Some of those they are harshly treating have been \"milking\" their country for years! _E_\nThank you for the incredible support Maryland! This is a movement!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nRemember I was the one who said attack the oil (ISIS source of wealth) a long time ago. Everyone scoffed now they're attacking the oil. _E_\nBe in Turnberry on Thurs AM for start of Women's British Open one of world's great golf tournaments. Back soon to #MakeAmericaGreatAgain! _E_\nTremendous day in Massachusetts and Maine. Thank you to everyone for making it so special! _E_\nThank you to @DailyTelegraph reviewer @NeilMidgley who stated 'You've Been Trumped' was so biased in favour of the protesters... _E_\nPuerto Rico being hit hard by new monster Hurricane. Be careful our hearts are with you will be there to help! _E_\nJohn Foust is a liberal who supports ObamaCare and opposes Ebola travel ban. Send Conservative @BarbaraComstock to Congress! _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nLittle Marco Rubio is just another Washington D.C. politician that is all talk and no action. #RobotRubio __HTTP__ _E_\nJust received huge applause when I said Berghdal should be sent back to Afghanistan! @SRQRepublicans speech is sold out with record crowd. _E_\nReport out that Obama Campaign paid $972000 to Fusion GPS. The firm also got $12400000 (really?) from DNC. Nobody knows who OK'd! _E_\nLeadership is the capacity to translate vision into reality. Warren G. Bennis _E_\n.@ABFalecbaldwin Alec it's not science it's a con read the e mails. _E_\nDon't sell yourself short on something that is important. Today is just the beginning. Think Like a Champion _E_\nI will represent our country well and fight for its interests! Fake News Media will never cover me accurately but who cares! We will #MAGA! _E_\nI wonder if when Secy. Kerry goes to Iraq and Afghanistan he pushes hard for them to look at GLOBAL WARMING and study the carbon footprint? _E_\nSo many problems in the U.S. and leadership that is hopeless...and now on top of everything else we just hit $18 trillion in debt! _E_\nIn politics and sometimes in life FRIENDS COME AND GO BUT ENEMIES ACCUMULATE! _E_\nSunday night at 9 PM EST will be re run of last week's episode of Celebrity @ApprenticeNBC followed by new episode at 10 PM. _E_\n.@serenawilliams we look forward to being with you a truly great champion tomorrow at Trump National D.C. for the Tennis Center dedication _E_\nOur infrastructure plan has been put forward and has received great reviews by everyone except of course the Democrats. After many years we have taken care of our Military now we have to fix our roads bridges tunnels airports and more. Bipartisan make deal Dems? _E_\nTexas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow! _E_\n.@danielhalper Great job on @CNN today. Very wise indeed! _E_\nLeaving for Albany New York now massive crowd expected. Very exciting! _E_\nDon't give up Republican Senators the World is watching: Repeal & Replace...and go to 51 votes (nuke option) get Cross State Lines & more. _E_\nthe American people. I have no doubt that we will together MAKE AMERICA GREAT AGAIN! _E_\nJoin @ericbolling to get @vanessariddle to 100k followers. Beautiful girl with stage 4 cancer. __HTTP__ _E_\nNew polls join the MOVEMENT today. __HTTP__ #ImWithYou __HTTP__ _E_\nThank you Florida a MOVEMENT that has never been seen before and will never be seen again. Lets get out &... __HTTP__ _E_\n...9 months than this Administration. Over 50 Legislation approvals massive regulation cuts energy freedom pipelines border security.... _E_\nNumerous patriots will be coming to Bedminster today as I continue to fill out the various positions necessary to MAKE AMERICA GREAT AGAIN! _E_\nRESPONSE TO THE LIES OF SENATOR CRUZ: __HTTP__ #VoteTrumpSC _E_\nShould have settled ... Ft Lauderdale plaintiffs must pay me close to $400k in legal fees after Trump trial victory. _E_\nThe border is wide open for cartels & terrorists. Secure our border now. Build a massive wall & deduct the costs from Mexican foreign aid! _E_\nGoing to a Cabinet Meeting (tele conference) at 11:00 A.M. on #Harvey. Even experts have said they've never seen one like this! _E_\nIf we are going to continue to be stupid and go into Syria (watch Russia) as they say in the movies SHOOT FIRST AND TALK LATER! _E_\nMy official #MakeAmericaGreatAgain hat is now available online. To shop please visit __HTTP__ it is selling fast! _E_\nAmerica is going to build again. Under budget and ahead of schedule. Time to put #AmericaFirst! #InfrastructureWeek... __HTTP__ _E_\nHeed the advice of @FLGovScott! If you're in an evacuation zone you need to get to a shelter...there's not many hours left. Gov. Scott __HTTP__ _E_\nVia World Tribune The elites' problem with Donald Trump: He's not for sale by Jeffrey T. Kuhner __HTTP__ _E_\nWOW! Thank you Massachusetts! See you soon. #VoteTrumpMA __HTTP__ _E_\nThank you Mark. #GOPDebate __HTTP__ _E_\nThe U.S. must immediately stop all flights from EBOLA infected countries or the plague will start and spread inside our borders. Act fast! _E_\nCongrats to Pres.Obama on having 3 of @washingtonpost's \"biggest Pinocchios of the year\" __HTTP__ Great accomplishment! _E_\nCredible Source on 9 11 Muslim Celebrations: FBI __HTTP__ via @WKRG _E_\nWhen Strasburg leaves in a couple of years under free agency Washington will say what were we doing . _E_\n.@chucktodd is a nice guy but just hopeless. He knows so little about politics and in particular winning! I fixed his rating problem. _E_\nSad only 36% think America's best days are ahead while 49% believe they are in the past __HTTP__ We can & must do better. _E_\nWatch my interview with Greta Van Susteren @gretawire tonight on Fox News at 10 p.m. _E_\nMERRY CHRISTMAS!!! __HTTP__ _E_\nDonald Trump: Anna Wintour Ambassadorship Would Be 'A Favor To The Country' __HTTP__ via @mediaite _E_\nThis is how it starts. Obama is now threatening to use an Executive Order for gun control __HTTP__ Welcome to his 2nd term. _E_\nWill be leaving for Missouri soon for a speech on tax cuts and tax reform so badly needed! _E_\nI'm giving away money go to __HTTP__ . Take it from me! Proud of the #FundAnything team. _E_\n\"Obstacles are those frightful things you see when you take your eyes off your goal.\" Henry Ford _E_\nWhat They Are Saying About @realDonaldTrump's GREAT Debate and @HillaryClinton's Bad Performance... __HTTP__ _E_\nCongrats to fantastic All Star @ApprenticeNBC celebrity & illusionist @pennjillette on being honored at 2013 Hollywood Walk of Fame! _E_\nSomeone should inform @CNN that despite spending millions of $'s on graphics it is not the Democratic Debate rather the Democrat (s) D! _E_\nI will be on @cbs @60minutes this Sunday. A great honor hope you enjoy it. _E_\nThe Republican Party must get tougher and smarter and fast or it will go down to a very big defeat just like the last two times! _E_\n.@Neilyoung's song \"Rockin' In The Free World\" was just one of 10 songs used as background music. Didn't love it anyway. _E_\nMust read quote by @EricTrump in @CNNMoney article \"Builders race to develop sky high condo buildings\" __HTTP__ _E_\nSwisher should have caught ball in right field last night. _E_\nThank you Iowa! Great night see you soon! #Trump2016 __HTTP__ _E_\nIt's snowing & freezing in NYC. What the hell ever happened to global warming? _E_\nPress Conference at Glasgow Prestwick Airport this Friday Nov. 14 at 11 AM with Donald J. Trump & Mr. Iain Cochrane __HTTP__ _E_\nWhat a foolish statement by @davidaxelrod he said that a @marcorubio VP pick would 'insult' Hispanics __HTTP__ _E_\nObama through his cronies said the Keysyone pipeline was not political how much can one man lie about even the most obvious things? _E_\nThank you for your support Greensboro North Carolina. Next stop Charlotte! #MAGA __HTTP__ __HTTP__ _E_\nAnother must read from Jeffrey Lord @amspec: \"Rove Email Leaks: Ideological War Opens in GOP\" __HTTP__ _E_\n.@DennisRodman is always hard to miss especially when dressed in silver finery. But not sure about the silver lipstick. #CelebApprentice _E_\nTonight is the Apprentice finale and it's a fantastic episode in every way with the great Liza Minnelli performing and a new Apprentice! _E_\nCrooked Hillary just took a major ad of me playing golf at Turnberry. Shows me hitting shot but I never did = lie! Was there to support son _E_\nThe economy won't fully recover until @ObamaCare is fully repealed. It is a job killer! _E_\nThere is. __HTTP__ _E_\nFact – all the countries complaining about us spying on them spy on us. They just don't get caught stupid! _E_\nI will take care of the Veterans who have served this country so bravely.#ThankAVet Video: __HTTP__ __HTTP__ _E_\nChina's military buildup is a major threat to the Free World. We must remain resolute and maintain our national defense at all costs. _E_\nThey say that if I participated in last night's Fox debate they would have had 12 million more & would have broken the all time record. _E_\nI would not sign Graham Cassidy if it did not include coverage of pre existing conditions. It does! A great Bill. Repeal & Replace. _E_\n.@NicolleDWallace is really hurting @TheView. She is boring predictable and has zero television it show no longer has ratings dying! _E_\nI want to applaud the many protestors in Boston who are speaking out against bigotry and hate. Our country will soon come together as one! _E_\nI am allowing Japan & South Korea to buy a substantially increased amount of highly sophisticated military equipment from the United States. _E_\nYou have to scratch your head when the president spends the last week talking about saving Big Bird. @MittRomney _E_\nIt was just announced that I will be hosting Saturday Night Live on Nov. 7th look forward to it! __HTTP__ _E_\nIn addition to winning the Electoral College in a landslide I won the popular vote if you deduct the millions of people who voted illegally _E_\n.@JebBush has spent $63000000 and is at the bottom of the polls. I have spent almost nothing and am at the top. WIN! @hughhewitt _E_\nLooking forward to Sunday's speech in the ExCel Centre. __HTTP__ _E_\nReckless @BarackObama is projecting $1.2T deficit from 2012 budget & a projected $25.4T debt in a decade __HTTP__ _E_\nIt is terrible that @BarackObama did not appoint an independent counsel to investigate the national security leaks. No accountability. _E_\n\"Pay attention to the small numbers in your finances such as percentages and cents... _E_\nThank you to @LOUDOBBS for giving the first six months of the Trump Administration an A+. S.C.reg cuttingStock M jobsborder etc. = TRUE! _E_\nSometimes when you innovate you make mistakes. It is best to admit them quickly and get on with other innovations. Steve Jobs _E_\nSmart move by @BarackObama having Pres. Bill Clinton deliver the @DNC convention keynote. _E_\nLAST thing the Make America Great Again Agenda needs is a Liberal Democrat in Senate where we have so little margin for victory already. The Pelosi/Schumer Puppet Jones would vote against us 100% of the time. He's bad on Crime Life Border Vets Guns & Military. VOTE ROY MOORE! _E_\nMy appearance on @foxandfriends from today.... __HTTP__ _E_\nReuters polling just out thank you!#MakeAmericaGreatAgain __HTTP__ _E_\n.@cher I don't wear a \"rug\"—it's mine. And I promise not to talk about your massive plastic surgeries that didn't work. _E_\n#TBT It is great being part of Home Alone 2 a holiday staple. __HTTP__ _E_\nAmerican professors were in Tehran for an Occupy Wall Street Conference __HTTP__ @BarackObama's diplomatic initiative?!?! _E_\nThe water damage to NYC is amazing. The winds were bad but the water was worse. _E_\n.@mercedesschlapp thank you so much for your kind words on television fantastic job and greatly appreciated! _E_\nEntrepreneurs: Set the bar high and resolve to be bigger than your problems. Who's the boss? _E_\nHe is out of real solutions @BarackObama's job bill is nothing more than a tax increase. _E_\nDo you notice we are not having a gun debate right now? That's because they used knives and a truck! _E_\n\"Golf is deceptively simple and endlessly complicated. It satisfies the soul and frustrates the intellect. (cont) __HTTP__ _E_\nThe U.S. will invite El Chapo the Mexican drug lord who just escaped prison to become a U.S. citizen because our leaders can't say no! _E_\nIf I'd started in business thinking I knew everything I'd have been sunk before I got started. Think Like a Champion _E_\nNew Government data by the Center for Immigration Studies shows more than 3M new legal & illegal immigrants settled.. __HTTP__ _E_\nSean's interview with Bob Woodward on @hannityshow was very interesting Woodward was great. __HTTP__ _E_\nIn the upcoming New Year we will focus like never before if we do that we will have complete and total VICTORY in all we do! _E_\nI loved beating John Kasich in the debates but it was easy—he came in dead last! _E_\nCheck it out 2nd video on Lying Crooked Hillary is now online! Watch it here: __HTTP__ #CrookedHillary #Trump2016 _E_\nEntrepreneurs: Keep your focus and keep your momentum. Listen apply and move forward. Set the standard! _E_\nWas with @jacknicklaus yesterday great golfer great architect great guy! _E_\nThank you Pittsburgh Pennsylvania!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThe Republicans better be careful. Obama is out to destroy them! _E_\nTelevision ratings for @nbcsnl Saturday Night Live just came out and they were great the best since 2011. Very few protesters! _E_\nFailure is simply the opportunity to begin again this time more intelligently. Henry Ford _E_\nSelective memory @BarackObama says that he forgets the recession __HTTP__ Maybe that's why he is forgetting to create jobs. _E_\nDo we really need another Bush in the White House we have had enough of them. __HTTP__ _E_\nIt's Tuesday. I wonder how much money @HuffPost lost today great purchase AOL _E_\nOne of Obama's greatest failures will be his legacy of making millions completely dependent on government handouts not work. _E_\nAm leaving now for Florida to see our GREAT first responders and to thank the U.S. Coast Guard FEMA etc. A real disaster much work to do! _E_\nStop the EBOLA patients from entering the U.S. Treat them at the highest level over there. THE UNITED STATES HAS ENOUGH PROBLEMS! _E_\nThe new Pope is a humble man very much like me which probably explains why I like him so much! _E_\nChina is our enemy. It's time we start acting like it and if we do our job correctly China will gain a whole (cont) __HTTP__ _E_\nAfter two days of very productive talks Prime Minister Abe is heading back to Japan. L _E_\nMitt Romney must start congratulating the Navy Seals and military on Bin Laden's killing not the President. _E_\nWe should not allow @Chrysler to move @Jeep jobs to China after they said they wouldn't stay tuned! _E_\n\"Score one for the Donald in his battle with @AGSchneiderman.\" __HTTP__ _E_\nI still say Te'o did this in order to get sympathy for the Heisman vote—thankfully he did not win. _E_\nJust returned from Pensacola Florida where the crowd was incredible. _E_\nThey should have allowed applause during the TRIBUTE to the departed Really bad production. Bette Midler sucked! #Oscars _E_\nAmerica will THINK BIG once again. We will inspire millions of children to carry on the proud tradition of American... __HTTP__ _E_\nOn Monday ObamaCare kicks in with all goodies of 300% increased premiums higher taxes and part time replacement employees. _E_\nHillary Clinton should ask why the Democrat pols in Atlantic City made all the wrong moves Convention Center Airport and destroyed City _E_\n.@VenueMagazine_ highlights the opening of @TrumpDoral's brand new #RedTiger course: __HTTP__ _E_\nGreat new ad from @MittRomney titled Nothing's Free __HTTP__ detailing both the high costs and taxes of ObamaCare. _E_\nGood news out of the House with the passing of 'No Sanctuary for Criminals Act.' Hopefully Senate will follow. _E_\nSadly I'm probably helping @billmaher's lowly rated show—but charity will benefit by $5 million so it's worth it. _E_\n#sweepstweet @DonaldJTrumpJr and @EricTrump have the eyes and ears for total surveillance I wonder where they got that from? _E_\nRT @DRUDGE_REPORT: LIMBAUGH: By not showing he's owning entire event... __HTTP__ _E_\nThe #MarchForLife is so important. To all of you marching you have my full support! _E_\nJust watched @Patriots Bill Belichick's news conference. He did a great job—smart concise truthful! _E_\n...I trounced him in ratings & Letterman beat @jayleno last Thursday. Brian—are you irrelevant? _E_\nSugar: @Lord_Sugar—unlike you I own The Apprentice. You were never successful enough... _E_\nEntrepreneurs: Be tough be smart be personable but don't take things personally. That's good business. _E_\nAberdeenshire coast is spectacular. Its historic value & wildlife will be tarnished if these wind turbines are built but they won't be! _E_\nI will be interviewed on @foxandfriends tomorrow at 7am. Enjoy! _E_\n\"If you have a crisis whether on a ship or wherever there are heroes who rise above it.\" Jerry Bruckheimer _E_\nThank you Montana! #Trump2016 __HTTP__ __HTTP__ _E_\n.@TrumpChicago's The Spa offers 5 star services w/ 12 treatment rooms & 53 spa guestrooms overlooking the skyline __HTTP__ _E_\nMarco Rubio will not win. Weak on illegal immigration strong on amnesty and has the appearance to killers of the world as a lightweight . _E_\n\"Compete with yourself to be the best you can be.\" – Think Like a Champion _E_\nAnother new Iowa poll just released. Thank you! #IACaucus #FITN __HTTP__ _E_\nJeb Bush has zero communication skills so he spent a fortune of special interest money on a Super Bowl ad. He is a weak candidate! _E_\n.@CGasparino Good seeing you. Keep up the great work never stop! _E_\nRodolfo Rosas Moya and his pals in Mexico owe me a lot of money. Disgusting & slow Mexico court system. Mexico is not a U.S. friend. _E_\nThe results are in. I killed Wolf Blitzer in our debate. I like Wolf but he went for an ambush! #wolfblitzercnn _E_\nMore than $500 million designated for Iraqi Army disappeared. Where is it? Our sad sad country what have we come to? _E_\nA gallon of gas is $3.523 today and has never before risen so high early in the year __HTTP__ The @BarackObama policy realized! _E_\nAs China is built on corporate espionage currency manipulation & cheap labor its economy is a ticking time bomb __HTTP__ _E_\nDopey Sugar @Lord_Sugar I never go silent. I was buying a major property in Florida a property worth more than you are! _E_\nGreat article in the @NewYorkPost by Ben Garrett Don't Blame Sandy on Global Warming __HTTP__ _E_\nWith @IvankaTrump and @EricTrump at the opening of the @GaryPlayer Villa at @TrumpDoral __HTTP__ _E_\nWith the two wacko perverts Spitzer and Weiner NYC politics has become a joke all over the world. _E_\nJames Holmes the Aurora Colorado guy who killed 12 people & injured 58 others is fighting hard to avoid the death penalty... _E_\nAmerica is mired in the longest job recession since the Great Depression. @MittRomney can get us out of it. (cont) __HTTP__ _E_\nLooking forward to a great weekend in Iowa! #IACaucus #CaucusForTrump Tickets: __HTTP__ __HTTP__ _E_\nVia @HuffPostPol by @_under_current: \"Donald Trump Will End Outsourcing If President\" __HTTP__ _E_\nJust gave a speech to the great men and women at Yokota Air Base in Tokyo Japan. Leaving to see Prime Minister Abe. __HTTP__ _E_\nThank you to @GaryVanSickle & Sports Illustrated @SInow for the really nice piece about me. March 17 2014 issue __HTTP__ _E_\nWill be at venue in wonderful South Carolina very soon. Big traffic back up tremendous crowd! Will be wild. _E_\n\"A true business only exists to solve a problem and to make life better.\" – Midas Touch _E_\nAmazing Obama speaks market goes DOWN Trump tells CNBC he's buying stock market goes UP should not be that way! _E_\nJoin me in Florida tomorrow! #MakeAmericaGreatAgain Daytona | 3pm __HTTP__ | 7pm __HTTP__ _E_\nMy book with @theRealKiyosaki Midas Touch is divided into five sections. The first is the thumb __HTTP__ _E_\nThe @TODAYshow refused to use their just in poll numbers where I have a massive lead but instead used @CNN numbers where my lead is smaller. _E_\nWe need a President who understands the economy @gallupnews has US unemployment at 8.2% in July up from 8% in June __HTTP__ _E_\nPeople are going crazy with my comments on Diet Coke (soda). Let's face it this stuff just doesn't work. It makes you hungry. _E_\n#ICYMI: I agree To all Americans I see you & I hear you. I am your voice. Vote to #DrainTheSwamp with me on 11/8.... __HTTP__ _E_\nWe need a balanced budget Amendment because Congress has no fiscal discipline. _E_\nOur national debt has grown by 30% and a gallon of gas has doubled so far under @BarackObama. He is a disaster. _E_\n.@EricTrumpFdn continues to do important work for @StJude Children's Research Hospital. I am very proud of @EricTrump's philanthropy. _E_\nI will be on CNN's State of the Union tomorrow morning at 9amE. __HTTP__ __HTTP__ _E_\nI have been asking Director Comey & others from the beginning of my administration to find the LEAKERS in the intelligence community..... _E_\nThe Trump Administration has terminated more UNNECESSARY Regulation in just twelve months than any other Administration has terminated during their full term in office no matter what the length. The good news is THERE IS MUCH MORE TO COME! _E_\nWindmills are a bigger safety hazard than either coal or oil __HTTP__ A 34% higher mortality rate than coal alone. Outrageous! _E_\nIs Jon Stewart a racist? See video __HTTP__ @thedailyshow _E_\nWhile @BarackObama criticizes the GOP budget his own party graded him with an F by voting down his budget in the House 414 0. _E_\nI will be doing Fox & Friends in 10 minutes at 7.00. Many things to talk about! ENJOY _E_\nPress conference after CPAC speech this morning was excellent lots of very professional reporters. _E_\nAlmost no news organizations are showing the satirical pictures. Gee I wonder why? The media is usually so brave! _E_\nI don't know what it is but I'm getting totally bored watching NFL football. Too many penalties and far too soft! T.V. off and back to work _E_\nThe habitual vacationer: @BarackObama has campaigned on our dime more than any previous president in history... (cont) __HTTP__ _E_\nPresident Obama looks absolutely exhausted in the Netherlands. He is not a natural leader was never ment to lead it is tough work for him _E_\nRemember Univision apologized! _E_\nI hope everyone is having a great Christmas then tomorrow it's back to work in order to Make America Great Again (which is happening faster than anyone anticipated)! _E_\nHitting the first ball at Trump International Dubai 272 right down the middle. __HTTP__ _E_\nToday our entire nation pauses to REMEMBER PEARL HARBOR—and the brave warriors who on that day stood tall and fought for America. God Bless our HEROES who wear the uniform and God Bless the United States of America. #PearlHarborRemembranceDay __HTTP__ _E_\n.@TrumpScotland provides luxury accommodations & a championship Par 72 7400 yd. course. Book your tee time now __HTTP__ _E_\nAs of September 30th we have a record trade deficit with China of over $217Billion. They are ripping us off. #TimeToGetTough _E_\n\"One of the keys to thinking big is total focus.\" – The Art of The Deal _E_\nI gave millions of dollars to DJT Foundation raised or recieved millions more ALL of which is given to charity and media won't report! _E_\nI'm on the David Letterman @LateShow tonight looking forward to it. 11:35 PM on CBS. _E_\nTrump Miss Universe simulcast on @nbc and @Telemundo on December 19th will once again deliver an entertaining and 'beautiful' show! _E_\nWe were led to believe that Jeep would manufacture in U.S. and sell to China—like China does to us. _E_\nWhy is the GOP establishment so threatened by the Newsmax @iontv debate? More debate is always better. _E_\nScary Americans private wealth fell 40% from 2007 2010 __HTTP__ But @BarackObama thinks the private economy is doing fine. _E_\nHurricane Irma is of epic proportion perhaps bigger than we have ever seen. Be safe and get out of its wayif possible. Federal G is ready! _E_\nA wonderfully written article concerning Israel by @JasonDovEsq __HTTP__ _E_\nSen. Jeff Flake(y) who is unelectable in the Great State of Arizona (quit race anemic polls) was caught (purposely) on \"mike\" saying bad things about your favorite President. He'll be a NO on tax cuts because his political career anyway is \"toast.\" _E_\n.@club4growth asked me for $1 million. I said no. Now falsely advertising that I will raise taxes. I'll lower big league for middle class. _E_\nChina must be worried that @MittRomney will win this November. They have never had such a pushover like @BarackObama. _E_\nI played football and baseball sorry but said to be the best bball player in N.Y. State ask coach Ted Dobias said best he ever coached. _E_\nTeachers in Chicago should go back to work immediately.Rahm Emanuel has offered them a fair deal. Now they're just acting for the cameras. _E_\nThe Manufacturing Index rose to 59% the highest level since early 2011 and we can do much better! _E_\nOur great country has been divided for decades. Sometimes you need protest in order to heel & we will heel & be stronger than ever before! _E_\nDonald Trump to Chris Christie: Don't hire @stuartpstevens __HTTP__ via @politico by @Hadas_Gold _E_\nA certain whack job Go Angelo who doesn't have a life spends his time hopelessly attacking me re: Macy's.... _E_\nBest thing my supporters can do if you don't like the way @megynkelly and her puppets unfairly treat us is don't watch her show! _E_\n.@transition2017 update and policy plans for the first 100 days. __HTTP__ _E_\nOn Mike and Mike @espn in two minutes! _E_\nIt's late in July and it is really cold outside in New York. Where the hell is GLOBAL WARMING??? We need some fast! It's now CLIMATE CHANGE _E_\nThank you California! Will see you soon! #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nWe must remember this truth: No matter our color creed religion or political party we are ALL AMERICANS FIRST. __HTTP__ _E_\n#TrumpVlog #TheInterviewMovie A sad day for freedom of speech __HTTP__ _E_\nThe Chinese are the biggest beneficiary of this post Saddam oil boom in Iraq __HTTP__ _E_\nEveryone is asking me to speak more on Robert & Kristen.I don't have time except to say Robert drop her she cheated on you & will again! _E_\nWhy are we fighting for the rebels that hate us only to save face for Obama! _E_\nReal unemployment is 20%. We must simplify the tax code and start making our own products again to bring our jobs back from overseas. _E_\nMAKE AMERICA GREAT AGAIN! __HTTP__ _E_\n??? @BarackObama held a raffle with donors for a lunch in the White House. The winners were conveniently all (cont) __HTTP__ _E_\nCongratulations to @JasonDufner on winning the PGA championship. Great job! _E_\nTHe people at shouldtrumprun.com have got it right! How are our factories supposed to compete with China and other countries... _E_\nAchievers go for the challenge so the next deal is what they're thinking about. They have an obligation to best themselves. _E_\nJust stated by a total pro: You are the only one who has the guts to say what we are all thinking. _E_\n.@David_Cameron As Prime Minister why are you spending vast amounts of money to subsidize ugly wind turbines in Scotland that nobody wants? _E_\nGallup finds Des Moines Iowa has the highest community pride (76.5) of any large city. Congrats and I agree I love the place! @DesMoines _E_\nIraq buying $200000000 worth of weapons from Iran. Despite so many killed and trillions spent Iraq dumps U.S. I TOLD YOU SO LONG AGO! _E_\nWatch @IvankaTrump's Ready To Wear Fashion Show at @LordandTaylor featuring @TrumpModels and @MissUSA..... __HTTP__ _E_\nNever ever quit never give up Donald J. Trump The Art of the Deal. _E_\nJust heard Fake News CNN is doing polls again despite the fact that their election polls were a WAY OFF disaster. Much higher ratings at Fox _E_\nRT @DonaldJTrumpJr: FINAL PUSH! Eric and I doing dozens of radio interviews. We can win this thing! GET OUT AND VOTE! #MAGA #ElectionDay ht... _E_\nIt does matter! __HTTP__ _E_\nLast night's All Star @ApprenticeNBC once again showed why the ultimate onus lies with the project manager. The buck stops there. _E_\nHe thinks that the wealth you create belongs to the gov't. @BarackObama doesn't respect the fact that the money he wastes belongs to us. _E_\nVia @DailyCaller by @samsondunn: \"Pastor To Hispanic Congregation Speaks Out On Trump Immigrant Crime Statement\" __HTTP__ _E_\nThe OWS protesters are doing nothing to advance the interests of the 99%. Time for them to go home! _E_\nIt's extremely cold in NY & NJ—not good for flood victims. Where is global warming? _E_\nFriends in NY 9 let @BarackObama know that you don't approve of his mistreatment of @Israel. Vote for @Bobturner9th tomorrow! _E_\nNobody wants wind turbines they are failing all over the world and need massive subsidy a disaster for taxpayers. _E_\nNo one has worse judgement than Hillary Clinton corruption and devastation follows her wherever she goes. _E_\nPoll numbers are starting to look very good. Leading in Florida @CNN Arizona and big jump in Utah. All numbers rising national way up. Wow! _E_\nAt the request of the Governor of Texas I have signed the Disaster Proclamation which unleashes the full force of government help! _E_\n.@TrumpGolfLA public golf course features spectacular panoramic Pacific Ocean views an elite attraction __HTTP__ _E_\nOpportunities only present themselves if you are out there looking for them. Be aggressive and seize them when they come. _E_\nFree enterprise is essentially a formula not just for wealth creation but for life satisfaction. Arthur C. Brooks _E_\nIt was an honor to welcome Republican and Democratic members of the Senate Finance Committee to the @WhiteHouse today. #TaxReform __HTTP__ _E_\n...for safety. Thank you to the Governor of P.R. and to all of those who are working so closely with our First Responders. Fantastic job! _E_\nJeb Bush signed memo saying not to use the term anchor babies offensive. Now he wants to use it because I use it. Stay true to yourself! _E_\nI suspect @JoeBiden could do well tonight. Don't be fooled by his gaffes. He is a seasoned and feisty debater. _E_\nThank you Kansas! The line going into the Orlando event is over a mile long. Massive crowd expected. Leaving Kansas now be there soon! _E_\nTrump to host #Oscars? __HTTP__ _E_\nEntrepreneurs: Listen and learn from others but make your own decisions. Take responsibility for yourself. It's a very empowering attitude! _E_\nHappy birthday to U.S. ARMY and our soldiers. Thank you for your bravery sacrifices & dedication. Proud to be your Commander in Chief! _E_\n7.8% unemployment number is a complete fraud as evidenced by the jobless claims number released yesterday.Real unemployment is at least 15% _E_\nJust left hospital. Rep. Steve Scalise one of the truly great people is in very tough shape but he is a real fighter. Pray for Steve! _E_\nRT @SecShulkin: Our Mobile Vet Center set up and ready to help #Veterans impacted by #HurricaneHarvey in Corpus Christi. __HTTP__ _E_\nMany reports that I will be attending the Alvarez/Khan fight this weekend in Vegas. Totally untrue! Unfortunately I have other plans. _E_\n\"Today's put off objectives reduce tomorrow's achievements.\" Henry Banks _E_\nI'm a former chief of police in a border town. I'm Hispanic I'm proud to be Hispanic and I'm 100% behind Trump. __HTTP__ _E_\nMar a Lago in Palm Beach is one of the most exclusive & elite clubs in the world w/award winning amenities __HTTP__ _E_\nGood advice from my mother Mary MacLeod Trump: \"Trust in God and be true to yourself.\" _E_\nIran the Number One State of Sponsored Terror with numerous violations of Human Rights occurring on an hourly basis has now closed down the Internet so that peaceful demonstrators cannot communicate. Not good! _E_\nThank you for your endorsement @GovernorSununu. #MAGA __HTTP__ _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nTomorrow's the day! Knock on doors and make calls with us on National Day of Action! #TrumpTrain #MAGA... __HTTP__ _E_\nMust watch – owner of a single restaurant anticipates that ObamaCare will cost over $1M for compliance __HTTP__ _E_\nQ/A @thecelidebiasio The secret behind my success is that I love what I'm doing. That gives me energy focus (cont) __HTTP__ _E_\nThings work out best for those who make the best of how things work out. John Wooden _E_\nIf the Republicans ever want to win a presidential election in the next 30 years they must get rid of @KarlRove. He is useless. _E_\nMore lies and deceptions @BarackObama is having his ex staffers write 'independent' studies for his reelection __HTTP__ _E_\nTrue thanks. __HTTP__ _E_\n\"We build too many walls and not enough bridges.\" Isaac Newton _E_\nI went to Wharton made over $8 billion employ thousands of people & get insulted by morons who can't get enough of me on twitter...! _E_\nYou have until 8pm to #VoteTrump Delaware! __HTTP__ _E_\nLocated in Tribeca each @TrumpSoHo hotel room features floor to window ceilings for a view of lower Manhattan __HTTP__ _E_\nCrooked Hillary Clinton wants to essentially abolish the 2nd Amendment. No gun owner can ever vote for Clinton! _E_\nWelcome to the 'Islamist Winter' the Muslim Brotherhood is now taking over the Egyptian military and possibly (cont) __HTTP__ _E_\nI take great pride watching skaters enjoy the #TRUMP Rink in Central Park from my office world's best skating rink __HTTP__ _E_\nSouth Korea is finding as I have told them that their talk of appeasement with North Korea will not work they only understand one thing! _E_\nI'm not against vaccinations for your children I'm against them in 1 massive dose.Spread them out over a period of time & autism will drop! _E_\nThe way President Obama runs down the stairs of Air Force 1 hopping & bobbing all the way is so inelegant and unpresidential. Do not fall! _E_\nDespite what the haters and losers like to say I never filed for bankruptcy but WOW the preeminent gaming company Caesars just did. _E_\n.@thehill John Oliver had his people call to ask me to be on his very boring and low rated show. I said NO THANKS Waste of time & energy! _E_\nI thought I was being nice to somebody re their parents. I guess this teaches you not to be nice or trusting. Sad! _E_\nOur country has tremendous potential. Together we can fix Washington. Let's Make America Great Again! __HTTP__ _E_\nChrysler is moving a massive plant from Mexico to Michigan reversing a years long opposite trend. Thank you Chrysler a very wise decision. The voters in Michigan are very happy they voted for Trump/Pence. Plenty of more to follow! _E_\nCrooked Hillary is being badly criticized (for a Wall Street paid for ad) by PolitiFact for a false ad on me on women. She is a total fraud! _E_\nOn behalf of a GRATEFUL NATION THANK YOU to all of the First Responders (HEROES) who saved countless lives in Las Vegas on Sunday night. __HTTP__ _E_\nThe fake news media is going crazy with their conspiracy theories and blind hatred. @MSNBC & @CNN are unwatchable. @foxandfriends is great! _E_\nRather than causing a big disruption in N.Y.C. I will be working out of my home in Bedminster N.J. this weekend. Also saves country money! _E_\nTHANK YOU to all of the incredible volunteers behind the scenes in Iowa! #CaucusForTrump __HTTP__ __HTTP__ _E_\nThank you South Carolina! Together WE WILL MAKE AMERICA GREAT AGAIN! #VoteTrumpSC __HTTP__ _E_\nWe will repeal and replace the horrible disaster known as #Obamacare! __HTTP__ _E_\n.@mcuban When Apprentice became the #1 show on tv you tried copying me with The Benefactor a complete and total ratings disaster for @ABC. _E_\nSharks are last on my list other than perhaps the losers and haters of the World! _E_\n\"My office is at Yankee stadium. Yes dreams do come true.\" @Yankees Captain Derek Jeter _E_\nReverend Wright was dumped like a dog by @BarackObama he can't be feeling too good. _E_\nThe Yankees are sure lucky George Steinbrenner is not around. A lot of people would be losing their jobs. _E_\nGov Kasich voted for NAFTA which devastated Ohio and is now pushing TPP hard bad for American workers! _E_\nHard to believe that with 24/7 #Fake News on CNN ABC NBC CBS NYTIMES & WAPO the Trump base is getting stronger! _E_\nThank you South Carolina! Everyone get out and vote tomorrow! We will #MakeAmericaGreatAgain! __HTTP__ _E_\nZimmerman is no angel but the lack of evidence and the concept of self defense especially in Florida law gave the jury little other choice _E_\nBarack Obama is not who you think he is. Most overrated politician in US history. _E_\nPeople are happy that I left the Trump Tower atrium open as opposed to taking the easy way out. __HTTP__ _E_\nI really like the Koch Brothers (members of my P.B. Club) but I don't want their money or anything else from them. Cannot influence Trump! _E_\n\"Be tough be smart be personable but don't take things personally. That's good business.\" – Think Like a Champion _E_\nPower Lunching next to the #BlueMonster: __HTTP__ via @UrbanDaddy cc @TrumpDoral _E_\nStill a buyer's market. Buy directly from a bank. They want to offload properties that have defaulted will give good prices & financing. _E_\nObama administration said that Saudi Arabia was on Syria's border __HTTP__ Wrong. These are the civilians planning the war. _E_\nThe Trump Organization is honored to be expanding our interests into Dubai. The golf course will be the top course in the Middle East. _E_\nTed Cruz purposely and illegally did not list on his personal disclosure form personally guaranteed loans from banks. They own him! _E_\nPervert Alert! Serial sexter @anthonyweiner has promised to use twitter as a \"tool.\" Parentsmake sure your children have him blocked. _E_\nThe third mass attack (slaughter) in days by ISIS. 200 dead in Baghdad worst in many years. We do not have leadership that can stop this! _E_\nCongratulations to Rex Tillerson on being sworn in as our new Secretary of State. He will be a star! _E_\nA great interview of @DonaldJTrumpJr in the @ globeandmail on Trump Tower Toronto __HTTP__ _E_\nAs China and the rest of the World continue to rip off the U.S. economically they laugh at us and our president over the riots in Ferguson! _E_\nThank you! #TrumpPence16 __HTTP__ _E_\nJust out: The Obama Administration knew far in advance of November 8th about election meddling by Russia. Did nothing about it. WHY? _E_\nDid China ask us if it was OK to devalue their currency (making it hard for our companies to compete) heavily tax our products going into.. _E_\nGetting ready to engage G7 leaders on many issues including economic growth terrorism and security. _E_\nCongrats to @leezeldin on a great victory. I hope my robocalls helped! #NY1 _E_\nCan you believe that Mitch McConnell who has screamed Repeal & Replace for 7 years couldn't get it done. Must Repeal & Replace ObamaCare! _E_\nCrazy @megynkelly says I don't (won't) go on her show and she still gets good ratings. But almost all of her shows are negative hits on me! _E_\nTennessee GOP Poll __HTTP__ 32.7%Cruz 16.5%Carson 6.6%Rubio 5.3%Christie 2.4%Jeb 1.6% _E_\nAs a very active President with lots of things happening it is not possible for my surrogates to stand at podium with perfect accuracy!.... _E_\nMy @SquawkCNBC interview discussing 2012 election polls @MittRomney's current trip & the US housing & land market __HTTP__ _E_\nOne point I made last night and will continue to push is that the @GOP can't be pollitically correct. We must fight fire with fire. _E_\n.@billmaher was so nervous talking about me on the @jayleno show—I've never seen him like that! _E_\nChina has just intervened to lower the yuan in other words they will continue to screw the U.S.! _E_\nI am pleased to inform you that I have just granted a full Pardon to 85 year old American patriot Sheriff Joe Arpaio. He kept Arizona safe! _E_\nHe may be the worst reporter in all of sports: @RickReilly of @ESPN. He gets away with murder and most people (cont) __HTTP__ _E_\nThe Kate Steinle killer came back and back over the weakly protected Obama border always committing crimes and being violent and yet this info was not used in court. His exoneration is a complete travesty of justice. BUILD THE WALL! _E_\nWe will soon be at a point with our incompetent politicians where we will be treating illegal immigrants better than our veterans. _E_\nVia __HTTP__ Interview with Donald Trump about Presidential Aspirations: It's all a deal __HTTP__ _E_\nWow @UnionLeader circulation in NH has dropped from 75000 to around 10—bad management. No wonder they begged me for ads. _E_\nIsn't it a shame that the person who will have by far the most delegates and many millions more votes than anyone else me still must fight _E_\nIf @VattenfallGroup dropped out of the economically unfeasible wind farm development in Aberdeen who is (cont) __HTTP__ _E_\nWe cannot let this evil continue! #Debates2016 __HTTP__ _E_\nThank you South Carolina! We will MAKE AMERICA SAFE & GREAT AGAIN! __HTTP__ __HTTP__ _E_\nTrump: If Republicans 'don't get tough they're not going to win this election' __HTTP__ Via @thehill _E_\nWill be on Fox & Friends in 3 minutes 7.00 A.M. _E_\nEntrepreneurs: Don't ever think you've done it all already or that you've done your best. You haven't so don't limit yourself! _E_\nI just passed a 10 block long gas line going to LGA airport a terrible situation! _E_\nOur country is totally fractured and with our weak leadership in Washington you can expect Ferguson type riots and looting in other places _E_\nAdvice from my father Fred C. Trump: Know everything you can about what you're doing. _E_\nHave passion for what you do and be efficient at the same time. Think Like a Champion _E_\nTed Cruz is falling in the polls. He is nervous. People are worried about his place of birth and his failure to report his loans from banks! _E_\nVideo in honor of the 100th Anniversary of the Anti Defamation League (ADL): \"Imagine a World Without Hate\" __HTTP__ _E_\nIf authorities need direct view from top of Trump Tower call office. _E_\nHe @BarackObama should not be trying to intimidate the USC justices on ObamaCare. He is worried because SG (cont) __HTTP__ _E_\n\"A people that values its privileges above its principles soon loses both.\" Dwight D. Eisenhower _E_\nVia @WTCommunities: Donald Trump to CPAC: Romney 'Didn't Talk Enough About Success' __HTTP__ by @HuizingaDanny _E_\nI can't believe Apple isn't moving faster to create a larger iPhone screen. Bring back Steve Jobs! _E_\nCongress should get back to Washington but @BarackObama doesn't want to interrupt his vacation in Martha's Vineyard. _E_\nEntrepreneurs: When negotiating don't be an open book. Know that the only person on your side might be yourself. _E_\nThe more time you spend feeling sorry for yourself the more time you waste after a setback. Move on and quickly embrace the next challenge! _E_\nSee dummy Danny Zuker who I never heard until this started something that he couldn't finish gutless and unwilling to take my bet! _E_\nThank you Tennessee! #Trump2016 __HTTP__ _E_\n\"Becoming an entrepreneur is a personal development program. If you grow personally your business will grow.\" – Midas Touch _E_\nThank you @DonaldJTrumpJr! #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nLeaving for the GREAT STATE OF SOUTH CAROLINA now to make a speech about how to MAKE OUR COUNTRY GREAT AGAIN! _E_\nGreat rally in Fresno California great crowd! Thank you! #Trump2016 __HTTP__ _E_\nWith @greta in Washington D.C. Old Post Office under construction. Tune in tonight at 7PM EST! __HTTP__ _E_\nSee Lyin' Ted even the @DailyBeast (no fan of mine) says this story came from Rubio not Trump! __HTTP__ _E_\nSuch a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before _E_\nIf Obama has to re fight this fight next year he loses Watch the fine details in every deal The Art of the Deal _E_\nVia @DMRegister BY @JenniferJJacobs: \"@SteveKingIA ramps up with first TV ad Trump event\" __HTTP__ _E_\nNew CNN Iowa poll Trump 33 Cruz 20. Everyone else way down! Don't trust Des Moines Register poll biased towards Trump! _E_\nToday we honored our true American heroes on the first ever National Vietnam War Veterans Day.#ThankAVeteran... __HTTP__ _E_\nWow the @nytimes is losing thousands of subscribers because of their very poor and highly inaccurate coverage of the Trump phenomena _E_\nDid my weekly phoner on Fox & Friends this morning...sounding off on issues of the day ... __HTTP__ _E_\nGreat job tonight @ericbolling _E_\nTired of being bullied by the economy? I'm going to help people. Wednesday 11 AM at Trump Tower _E_\nI will do more in the first 30 days in office than Hillary has done in the last 30 years! #Debate  #BigLeagueTruth __HTTP__ _E_\nThe Democrats seem intent on having people and drugs pour into our country from the Southern Border risking thousands of lives in the process. It is my duty to protect the lives and safety of all Americans. We must build a Great Wall think Merit and end Lottery & Chain. USA! _E_\nTrump International Hotel Washington D.C. will be one of the world's top luxury hotels __HTTP__ _E_\nThank you. __HTTP__ _E_\nMany great business campaigns at @fundanything __HTTP__ Great way to support small upstarts. _E_\nLance Armstrong did himself great harm last night. Lawsuits & failure will follow him! _E_\nOn the red carpet at the NYC premiere of Dark Knight Rises with @melaniatrump via @NewYorkObserver's @velvet_roper __HTTP__ _E_\n.@SarahPalinUSA was 100% correct when she stated that @oreillyfactor used us in day long tease to get people to watch but we were not on! _E_\nFeaturing five championship golf courses including the Blue Monster @TrumpDoral is South Miami's top destination __HTTP__ _E_\n\"All Star Celebrity Apprentice\" is #1 in the time period among ABC CBS and NBC in 18 49 and all other key demos—Nielsen Ratings _E_\nThe real estate market is slowly improving. Still a great time to buy. You will thank me in 5 years. _E_\n\"Money may not grow on trees but it does grow from talent hard work and brains.\" – Think Like a Billionaire _E_\nA former Miss New York is the designer behind the swimsuits featured in Sunday's Miss USA pageant—beautiful! __HTTP__ _E_\nIt was a great honor to be with King Abdullah II of Jordan and his delegation this morning. We had a GREAT bilateral meeting! __HTTP__ _E_\nEXCLUSIVE — DONALD TRUMP ON THE GOP PRIMARY: 'IF I WIN I WILL BEAT HILLARY' __HTTP__ via @BreitbartNews by Katie McHugh _E_\nWind turbine syndrome is affecting tremendous numbers of people in their wake—stop ugly turbines. _E_\nFloyd Mayweather is being beaten up badly through 10 rounds by Marcos Maidana but announcers say it is even. TWO ROUNDS LEFT. _E_\nWacky & totally unhinged Tom Steyer who has been fighting me and my Make America Great Again agenda from beginning never wins elections! _E_\nRT @markets: U.S. job openings surge to record __HTTP__ via @ShoChandra __HTTP__ _E_\nTrump Collection's summer line exclusively available @Macys is the pinnacle of style & prestige. Dress your best! __HTTP__ _E_\nOn December 19th the @MissUniverse pageant will be broadcast live in over 190 countries to one billion viewers. @nbc _E_\nThe situation with Russia is much more dangerous than most people may think and could lead to World War III. WE NEED GREAT LEADERSHIP FAST _E_\nClub For Growth tried to extort $1000000 from me. When I said NO they went hostile with negative ads. Disgraceful! _E_\nDishonest media is trying their absolute best to depict a star in a tweet as the Star of David rather than a Sheriff's Star or plain star! _E_\nRT @GregAbbott_TX: Thanks to the Texas National Guard for their help to rescue flooded Texans. #HurricaneHarvey __HTTP__ _E_\n\"When you have confidence you can have a lot of fun. And when you have fun you can do amazing things.\" @RealJoeNamath _E_\nGreat poll thank you Nevada!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\n\"Generals don't panic then the troops never panic.\" @SHAQ _E_\nYou can't tax business. Business doesn't pay taxes. It collects taxes. ― Ronald Reagan _E_\nThank you New York! I love you!#MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nI know about the \"rustic\" look on golf courses—but see photo of highly rated Trump National Philadelphia—a real gem. __HTTP__ _E_\nThank you to @NYPost's Robert Rorke for the really nice review of #SNL. So many enjoyed it very gratifying! __HTTP__ _E_\nThe TODAY Show should call me about who to put on the show— I know more about people who get ratings than anyone. _E_\nObama wants Americans to keep buying crude from OPEC who is ripping us off instead of our ally Canada through (cont) __HTTP__ _E_\nJust out new PPP NATIONAL POLL has me in first place by a wide margin at 29%. I wonder why only @FoxNews has not reported this? Too bad! _E_\nAs President I WILL fix this rigged system and only answer to YOU the American people! __HTTP__ _E_\nTrump National Golf Club Los Angeles fronts the Pacific Ocean and has an 18 hole Pete Dye course. Beautiful! __HTTP__ _E_\nTrump Int'l Washington D.C. is a historic building which our entire nation can take pride in & enjoy Opening 2016 __HTTP__ _E_\n.@rupertmurdoch is absolutely right it will be a nightmare for @Israel if Obama is re elected. _E_\nI will be on Fox & Friends @foxandfriends at 7.00 a.m. (30 minutes). Enjoy! _E_\nThe journey to #MAGA began @CPAC 2011 and the opportunity to reconnect with friends and supporters is something I look forward to every year. See you at #CPAC2018! _E_\nHe is delusional: @BarackObama believes that he is the 4th best POTUS ever. _E_\nMinorities line up behind.....Donald Trump #Trump2016 #MakeAmericaGreatAgain __HTTP__ __HTTP__ _E_\nWeak & ineffective @JebBush is doing ads where he shows his statement in the debate but not my response. False advertising! _E_\nThe UN is about to use its Assembly to attack @Israel. We should defund the UN entirely if they can't act resp... (cont) __HTTP__ _E_\nI must say that some of these college football games are great tonight very exciting I wish I had more time to watch! _E_\n.@seanhannity at 10:00. _E_\nI'm convinced that about half of what separates successful entrepreneurs from the non successful ones is pure perseverance. Steve Jobs _E_\nMy @FoxNews interview with @TeamCavuto discussing the Newsmax @iontv debate #TimeToGetTough and the 2012 race __HTTP__ _E_\nAnother electric car firm that @BarackObama gave $118M just went bankrupt. __HTTP__ He loves to waste our tax dollars. _E_\n\"TRUMP DECLARES VICTORY ON IMMIGRATION AS OBAMA ADMITS SOME ILLEGALS ARE 'GANG BANGERS'\" __HTTP__ via @BreitbartNews @ASwoyer _E_\n#ICYMI: @foxandfriends this morning. __HTTP__ _E_\nHe admits his presidency has been flawed...but @BarackObama claims economy is stronger. __HTTP__ _E_\nAilsa Course changes: #TrumpTurnberry What a beautiful place! __HTTP__ _E_\nRT @foxandfriends: White House calls out Senate Democrats for obstructing nominees __HTTP__ _E_\nThe judge opens up our country to potential terrorists and others that do not have our best interests at heart. Bad people are very happy! _E_\nWe will follow two simple rules: BUY AMERICAN & HIRE AMERICAN!#InaugurationDay #MAGA _E_\nCrooked Hillary Clinton and her team were extremely careless in their handling of very sensitive highly classified information. Not fit! _E_\nDon't forget the open call at Trump Tower tomorrow for The Apprentice. I look forward to seeing you there. _E_\nI'm a Republican but not a fan of the last George Bush he also was a lousy President (Iraq etc.). In fact he was so bad he gave us Obama! _E_\n...I told Republicans to approve healthcare fast or this would happen. But don't worry I will veto because I love our country & its people. _E_\nCommodity prices are beginning to drop as a result of the Euro crisis __HTTP__ _E_\nRussia should hand over Snowden to the U.S. but they are having too much fun taunting our leaders. _E_\nWhy is it that Eric Schneiderman is considered a lightweight by so many and has failed to go after Jon Corzine and big abusers for billions? _E_\nI requested that Mitch M & Paul R tie the Debt Ceiling legislation into the popular V.A. Bill (which just passed) for easy approval. They... _E_\nWhy does @megynkelly devote so much time on her shows to me almost always negative? Without me her ratings would tank. Get a life Megyn! _E_\nDo you think that Hillary Clinton will apologize to me for the lie she told about the video of me being used by ISIS. There is no video. _E_\nCongratulations to @AlabamaFTBL on winning the BCS championship last night! _E_\nJust arrived in Taormina with @FLOTUS Melania. #G7Summit #USA __HTTP__ _E_\nThe Amateur. On his trip to Afghanistan our commander in chief disclosed the CIA Chief's name. Unsafe disaster! __HTTP__ _E_\nWishing everyone a Happy Memorial Day and a thank you to all the soldiers who protected our great country. _E_\nHillary and the Dems loved and praised FBI Director Comey just a few days ago. Original evidence was overwhelming should not have delayed! _E_\nAs promised on the campaign trail we will provide opportunity for Americans to gain skills needed to succeed & thrive as the economy grows! __HTTP__ _E_\nTrump offers $5 million for Obama college passport records __HTTP__ By @AlexPappasDC @DailyCaller _E_\nWatch me tonight on The O'Reilly Factor at 8 pm and 11 pm EST FOX News _E_\nInteresting article about Atlantic City __HTTP__ _E_\nThe American people agree. No free pass for #CrookedHillary! __HTTP__ _E_\nVia @bostonherald by @ChrisCassidy_BH: \"Trump: `The last thing we need is another Bush'\" __HTTP__ _E_\nFirst of all you don't necessarily need the best location. What you need is the best deal. The Art of the Deal _E_\nCongratulations to @TrumpChicago @TrumpSoHo and @TrumpLasVegas all listed #1 on @TravelandLeisure World's Best Business Hotels _E_\nThe Blue Monster at Trump National Doral in Miami is doing record business everybody wants a piece of it. Great reviews. Thank you! _E_\nA Rod was a great player when he lived at Trump Park Avenue even though he was on the juice! _E_\nA friend of mine went to @CakeBossBuddy and sent me this beautiful cake which we put in the atrium of @TrumpTowerNY. __HTTP__ _E_\n.@Lord_Sugar nice call on predicting that the iPOD would be dead finished gone kaput __HTTP__ Great business foresight. _E_\nSo many great endorsements yesterday except for Paul Ryan! We must put America first and MAKE AMERICA GREAT AGAIN! _E_\nI hope the @RNC is ready for a Third Party if they blow this election because that is what they will face. They must fight hard. _E_\nWord is that @Greta Van Susteren was let go by her out of control bosses at @NBC & @Comcast because she refused to go along w/ 'Trump hate!' _E_\nHuffington Post is just upset that I said its purchase by AOL has been a disaster and that Arianna Huffington is ugly both inside and out! _E_\n.@MatthewJDowd thank you for the nice comments recently especially on @BarbaraJWalters. My family & I greatly appreciate your kind words. _E_\nGreat being in Cincinnati Ohio last night thank you! Off to Washington D.C. now. #Trump2016 #AmericaFirst __HTTP__ _E_\nThe Failing @nytimes set Liddle' Bob Corker up by recording his conversation. Was made to sound a fool and that's what I am dealing with! _E_\nIraq's government is treating us like fools. We should demand their oil. _E_\nAll this from a guy who lectured Americans about tightening their belts: @BarackObama bashes rich people an... (cont) __HTTP__ _E_\nWe are the greatest country the world has ever known. I make no apologies for this country my pride in it or (cont) __HTTP__ _E_\nDon't forget the Celebrity Apprentice Sunday night at 9 pm on NBC for another surprising and exciting episode __HTTP__ _E_\nTrump is going to be our President. We owe him an open mind and the chance to lead. So much time and money will be spent same result! Sad _E_\nThe so called Commission on Presidential Debates admitted to us that the DJT audio & sound level was very bad. So why didn't they fix it? _E_\nScary. Obama and the Democrat Senate have accrued over $5T worth of debt without passing a budget in the last 3 years. 4 more years? _E_\nBe on time. Wasting other people's time due to poor planning and thoughtlessness will only leave a bad impression. Think Like a Champion _E_\n.@NYDailyNews the dying tabloid owned by dopey clown Mort Zuckerman puts me on the cover daily because I sell. My honor but it is dead! _E_\nIt's amazing how different all of the polling results are not an exact science. _E_\nWe must not let #CrookedHillary take her CRIMINAL SCHEME into the Oval Office. #DrainTheSwamp __HTTP__ _E_\n.@andersoncooper did an excellent job of hosting the #DemDebate last night. Tough firm but fair. _E_\nIt is time to create jobs for Americans not D.C. We need a bold new direction. Let's Make America Great Again! __HTTP__ _E_\nI told all of the haters and losers long ago that Iraq would fall take the OIL or get out fast! Massive waste of lives and trillions of $'s _E_\nChina just called. They want to lend Obama another $1B for the ObamaCare web site. _E_\nRT @RSBNetwork: LIVE Stream: Donald Trump about to speak in Boca Raton FL. Protesters already before Trump speaks. #TrumpTrain __HTTP__ _E_\nHe has no respect for American exceptionalism. @BarackObama has outsourced our space program to the Russians __HTTP__ _E_\n$30M a year and A Rod is now relegated to the bench. @yankees would have lost if Girardi hadn't benched him in the 9th (see my prediction) _E_\nIt's Thursday. How much has OPEC ripped us off today? _E_\n.@reince is doing a fantastic job for the Republican Party hope he gets the credit he deserves. _E_\nThe VA scandal will only get worse over the time. Our vets deserve the best care possible. We must be open to private solutions. _E_\nThe super Liberal Democrat in the Georgia Congressioal race tomorrow wants to protect criminals allow illegal immigration and raise taxes! _E_\nWatch @foxandfriends now on Podesta and Russia! _E_\nThis assignment has been a challenge to both teams. #CelebApprentice _E_\nI will soon be releasing my response to the fact that President Obama refused to show his applications and records to the public. _E_\nThings turn out best for the people who make the best of the way things turn out. John Wooden _E_\nJust finished a very good meeting with the President of South Korea. Many subjects discussed including North Korea and new trade deal! _E_\nThe fight against ISIS starts at our border. 'At least' 10 ISIS have been caught crossing the Mexico border. Build a wall! _E_\nPalm Springs CA has been destroyed absolutely destroyed by the world's ugliest wind farm at the Gateway on Interstate 10. Very very sad! _E_\nI went to @MikeTyson's play. I will be doing a review in the next #trumpvlog. _E_\nIt's this simple. \"Make America Great Again.\" #debate #BigLeagueTruth _E_\nAs to the U.N. things will be different after Jan. 20th. _E_\n\"You can't con people at least not for long. If you don't deliver the goods people will eventually catch on.\" The Art of The Deal _E_\nWow I was just informed that I'm being inducted into the @WWE Hall of Fame a great honor 4/6/13 at @MSGnyc __HTTP__ _E_\n.@MittRomney & @PaulRyanVP get what needs to be done to reign in China. @BarackObama gets kicked around by the Chinese. _E_\nOur national security starts at the border. Do you think ISIS & al Qaeda are just in the Middle East? _E_\nTell 'Top Scot' Michael Forbes to clean up his property—it is an embarrassment to Scotland. _E_\nVia @washingtonpost by @jdelreal: About that Donald Trump speech at CPAC ... __HTTP__ _E_\nThank you @scottienhughes for the great job you did on @CNN. Great energy and smarts! I will not let you down. _E_\nDistressed real estate opportunities can make great investments. You need the foresight and instincts to know the property's true potential. _E_\nRapper @MacMiller's song Donald Trump now has 57 million hits I created another star where's my cut? _E_\nRT: @thedailybeast: Polling shows the @AmericansElect movement could still nominate a viable independent with a chance of victory... _E_\nOne 57 is one of the worst looking buildings I've seen in a long time in particular its very ugly skin. _E_\nFunny to hear the Democrats talking about the National Debt when President Obama doubled it in only 8 years! _E_\nhave been allowed to run guilty as hell. They were VERY nice to her. She lost because she campaigned in the wrong states no enthusiasm! _E_\nJust watched @marcorubio on television. Just another all talk no action politician. Truly doesn't have a clue! Worst voting record in Sen. _E_\nLast nights results in poll taken by NBC. #AmericaFirst #ImWithYou __HTTP__ _E_\nThis is a MOVEMENT! #RNCinCLE __HTTP__ _E_\nDon't forget the Miss USA Pageant live on Sunday night at 9 pm ET on NBC. And you can vote for your favorite beauty! __HTTP__ _E_\nDonald Trump is confident that Ireland is ready for a big comeback __HTTP__ via @independent_ie by @AnitaActually _E_\nWill be interviewed by @oreillyfactor tonight at 8 PM. _E_\nHelp save the lives of our troops.Our #vets suffering from TBI/PTS need treatment @makeitvisible Donate to __HTTP__ _E_\n#CelebApprentice who do you think won? _E_\nFake News story of secret dinner with Putin is sick. All G 20 leaders and spouses were invited by the Chancellor of Germany. Press knew! _E_\nWill be going to Detroit Michigan (love) today for a big meeting on bringing back car production to State & U.S. Already happening! _E_\nYoung entrepreneurs – be resolute in your drive for success. Gain momentum. Once you succeed promote yourself! _E_\nSo great to be in New York. Catching up on many things (remember I am still running a major business while I campaign) and loving it! _E_\nGreat that Pres. O is seeing @MittRomney today—lots of good things can happen. _E_\nHow does @HBO employ @BillMaher with a pathetic show that he does what kind of a special is that? Complete garbage! _E_\nIn Las Vegas for the Miss Universe Pageant—airing tonight on @nbc at 8 o'clock. _E_\nSleepy eyes @chucktodd is an absolute joke of a reporter. He is in the bag for Obama. He can't carry @jack_welch's jock. _E_\n.@marcorubio what do you say to the family of Kathryn Steinle in CA who was viciously killed b/c we can't secure our border? Stand up for US _E_\nThe death tax should be abolished the Government is simply taxing you twice. It is also a job killer. _E_\nHeading to Iowa to a packed house. Just released polls all first place are amazing. Thank you! _E_\nVia @ProgressIndex: \"Donald Trump to deliver keynote address at annual Chesterfield Republican Gala\" __HTTP__ _E_\nI will be interviewed on @FaceTheNation this morning. Enjoy! @jdickerson _E_\nAnother @BarackObama green car loan recipient is laying off staff. __HTTP__ How many billions of our money has he wasted? _E_\nWe are going to have a big event at the Verizon Wireless Arena in Manchester New Hampshire! 5K+! Join us tomorrow: __HTTP__ _E_\nWow Mitt Romney didn't know that Rand Paul was in the race for president. Very strange! @FoxNews _E_\nLawrence O'Donnell will soon have another cancelled show to go along with his three cancelled TV series Mister (cont) __HTTP__ _E_\nRT @FoxNews: .@AlanDersh: Trump Has 'More Credibility' Than Obama With North Korea __HTTP__ __HTTP__ _E_\nCrooked Hillary Clinton is being protected by the media. She is not a talented person or politician. The dishonest media refuses to expose! _E_\nHillary says take back Mosul? We would have NEVER lost Mosul if it wasn't for #CrookedHillary. #DrainTheSwamp __HTTP__ _E_\nVisiting New York City? Make sure to skate in the world famous Trump Rink in Central Park __HTTP__ Great for the whole family! _E_\nAnother clip from my @greta interview discussing why Sony should not have capitulated to the hackers __HTTP__ No Courage! _E_\nObama has admitted that he spends his mornings watching @ESPN. Then he plays golf fundraises & grants amnesty to illegals. _E_\nWrong @BarackObama's '08 campaign manager & current Senior WH Advisor collected $100G fee from Iranian affiliate __HTTP__ _E_\nWe're worried about waterboarding as our enemy ISIS is beheading people and burning people alive. Time for us to wake up. _E_\nGetting ready to leave for Poland after which I will travel to Germany for the G 20. Will be back on Saturday. _E_\nIran is flying supply planes to Syria through Iraqi airspace. Thank you United States for making this possible! _E_\nThank you Oklahoma & Virginia! #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nHonored to have passed 1 million twitter followers. We are making America #1 again. #TimeToGetTough _E_\nThe Trump Signature Collection available @Macys offers top new designs for your fall wardrobe. Dress your best! __HTTP__ _E_\nWhen employees are working at home they can never have the same cohesivness as working together as a group... _E_\n#Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nI will be nominating Christopher A. Wray a man of impeccable credentials to be the new Director of the FBI. Details to follow. _E_\nToday is armed forces day. Thank you  to our military service members! I love you all! _E_\nDopey Sugar—@Lord_Sugar Isn't it sad that my golf course in Scotland just got \"best new course in the world\"—it's worth more than you are! _E_\n.@JuddApatow I agree! _E_\nObama's convention bounce is gone. @MittRomney has retaken the lead in the latest @RasmussenPoll __HTTP__ _E_\nThank you Nashville Tennessee! __HTTP__ _E_\nWow it is unbelievable how distorted one sided and biased the media is against us. The failing @nytimes is a joke. @CNN is laughable! _E_\nMy interview w/ @WendyWilliams on @WendyShow discussing @MichelleObama's bangs & All Star @CelebApprentice __HTTP__ _E_\nGreat win by the @nyjets yesterday. If they run the table they will make the playoffs. _E_\nRemember when the two failed presidential candidates Lindsey Graham and Jeb Bush signed a binding PLEDGE? They broke the deal no honor! _E_\nMany of the thugs that attacked the peaceful Trump supporters in San Jose were illegals. They burned the American flag and laughed at police _E_\nWill be interviewed by @JudgeJeanine on @FoxNews at 9:00 P.M. (Saturday night). Enjoy! _E_\nCongressman John Lewis should finally focus on the burning and crime infested inner cities of the U.S. I can use all the help I can get! _E_\n.@andydean2014 Thank you you were great. You can defend me anytime. Amazing job. _E_\nCruz says I supported TARP which gave $25 million to Goldman Sachs the bank which loaned him the money he didn't disclose. Puppet! _E_\nIn the \"old days\" when good news was reported the Stock Market would go up. Today when good news is reported the Stock Market goes down. Big mistake and we have so much good (great) news about the economy! _E_\nBiden's statements on Medicare are very effective. Ryan must now come back and combat. #VPDebate _E_\nA clip from my @foxandfriends interview discussing how Newsmax @iontv debate is determining the GOP primary polls __HTTP__ _E_\nWill be interviewed on @FoxNews at 10:00 P.M. Enjoy! _E_\nI am in Iowa. Will be interviewed on This Week With @GStephanopoulos this morning. ENJOY! _E_\nBe sure to listen to my interview on tonight's @SteveDeaceShow. Steve is a terrific guy! _E_\nThe great Mike Wallace covered me in a much more professional manner than his son Chris Wallace of @FoxNews. Mike was a total pro! _E_\nHagel committee vote has been postponed as Hagel refuses to disclose all his finances __HTTP__ _E_\nNY should frack now. What's the hold up? Is Albany opposed to creating jobs and making gas cheaper for middle class? _E_\nWhen will lightweight hack Attorney General be investigated for his repeated prosecutorial misconduct? __HTTP__ _E_\nFigure out what really moves you. You've got to have the 'FIRE' in order to have the Midas Touch. Midas Touch _E_\nDon't negate your own power. Whatever you've been dealt know you can deal with it. Fear is the opposite of faith. _E_\nTrue. Thanks. __HTTP__ _E_\nI love seeing that Graydon Carter and @VanityFair are failing so badly. He's only focused on his bad food restaurants. _E_\nVery grateful for the 9 O decision from the U. S. Supreme Court. We must keep America SAFE! _E_\nNeed all on the UN Security Council to vote to renew the Joint Investigative Mechanism for Syria to ensure that Assad Regime does not commit mass murder with chemical weapons ever again. _E_\n\"What counts is not necessarily the size of the dog in the fight it's the size of the fight in the dog.\" Dwight D. Eisenhower _E_\nWill be interviewed on @foxandfriends now! _E_\nNew Hampshire has a major decision to make today. Hopefully we won't have to hear any more Mandarin spoken in future debates. _E_\nAmazing evening at Saturday Night Live! _E_\nThe harder I work the luckier I get. Samuel Goldwyn _E_\nNYC is under constant threat from Jihadists & violent criminals. Stop & Frisk keeps streets & subways safe.Stand strong Ray Kelly _E_\nWow reviews are in THANK YOU! _E_\nThe Gang of Six yet another unmitigated disaster. ANY DEAL NEEDS TO REPEAL OBAMACARE. T E A. _E_\nLightweight Senator @RandPaul should focus on trying to get elected in Kentucky a great state which is embarrassed by him. _E_\nBusy doing phoners this week with Neil Cavuto Wolf Blitzer Fox & Friends and Larry Kudlow....check out __HTTP__ _E_\nWhy didn't Gates resign if he was so unhappy about what he was being told by Obama? The fact is Iraq etc. have always been disasters! _E_\nMillions losing healthcare plans despite President Obama's promise that this WOULD NOT HAPPEN! What about a massive protest march on D.C. _E_\nMy family has the honor of being interviewed for a full hour by the legendary @BarbaraJWalters tonight @ABC 10pmE. __HTTP__ _E_\nYou can benefit from others' wisdom. Not just their mistakes but the good decisions and insight they have to offer.\" The Way To The Top _E_\nLooks like the U.S. will be having the coldest March since 1996 global warming anyone????????? _E_\nThank you Bangor Maine! Get out & #VoteTrumpPence16 on 11/8/16 and together we will MAKE AMERICA SAFE AND GREAT A... __HTTP__ _E_\nI will be the featured guest on the season opener of @60Minutes this Sunday. There certainly is plenty to talk about! _E_\nCHAIN MIGRATION cannot be allowed to be part of any legislation on Immigration! _E_\nFrank was a great guy married to an absolutely wonderful woman @KathieLGifford. What a couple! __HTTP__ _E_\nMy @gretawire int. on Obama's falling poll numbers Americans losing incentive to work and Weiner's sexting __HTTP__ _E_\nHeading to Birmingham Alabama and a massive crowd of incredible people! 12 noon will be wild. _E_\nLeaving Nevada now for Iowa. Things are looking good great new polls! _E_\nSo nice great Americans outside Trump Tower right now. Thank you! __HTTP__ _E_\nI am truly enjoying myself while running for president. The people of our country are amazing great numbers on November 8th! _E_\nIsn't it great that Obama had time yesterday to fundraise with Jay Z and do @Late_Show while there is a record 21% real unemployment! _E_\nThank you Dayton Ohio! 20000 supporters largest in airport history! #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nThank you South Carolina! __HTTP__ _E_\nPresident Obama could totally solve the problem with Putin by demanding that Russia sign on to ObamaCare thereby destroying their economy! _E_\n.@HillaryClinton you have failed failed and failed. #BigLeagueTruthTime to #DrainTheSwamp! __HTTP__ _E_\nWages in are country are too low good jobs are too few and people have lost faith in our leaders.We need smart and strong leadership now! _E_\nReports are out there that many CEOs of charities are getting overpaid while their causes are seeing very little... _E_\nHe (or she) who hesitates is lost: MAKE AMERICA GREAT AGAIN! _E_\n.@CNN has to do better reporting if it wants to keep up with the crowd.So totally one sided and biased against me that it is becoming boring _E_\nWas photo bombed yesterday by a wise guy when I left the set of @LateNightJimmy... _E_\nThe Senate should immediately vote on the Iranian sanctions bill. What is the delay? Iran is already breaking its agreement with Obama _E_\nDonald Trump ready to end @ApprenticeNBC for White House run __HTTP__ via via @dcexaminer by @eScarry _E_\nWe are what we repeatedly do. Excellence then is not an act but a habit. Aristotle _E_\nVia NYTimes What's Your Ideal Gadget? __HTTP__ _E_\nJeb failed as Jeb! He gave up and enlisted Mommy and his brother (who got us into the quicksand of Iraq). Spent $120 million.Weak no chance! _E_\nRT @IamVicky4Trump: TUNE IN: Maria Bartiromo Has an Exclusive Interview With President Trump __HTTP__ _E_\n.... we push for the removal of all trade distorting practices....to foster a truly level playing field. _E_\nIt is truly an honor that His Eminence Archbishop of New York @CardinalDolan will be delivering the benediction at the @RNC convention. _E_\n\"@joerepublic1 @mckaycoppins how nice of this punk with a pen to call a truce after he tries to show u up w/his bs! True thx _E_\nMarco Rubio should pick a location that has working air conditioning next time especially when in Miami proper plan. Sweating profusely! _E_\nPreview of Obama's SOTU: More taxes bigger government shrink the private sector end the Republicans & bankrupt the country. Enjoy! _E_\nThink of it the Arab League doesn't want to get involved with Syria but they want us to do their dirty work. How stupid! _E_\nI encourage everyone in the path of #HurricaneHarvey to heed the advice & orders of their local and state officials. __HTTP__ _E_\n...to win. The Democrats are overplaying their hand. They lost the election and now they have lost their grip on reality. The real story... _E_\nFirst Minister Salmond should stop his fruitless drive for obsolete wind turbines in Scotland he would become popular again! @alexsalmond _E_\nI watched Sen. Graham @FaceTheNation. Why don't they say that I ran him out of the race like a little boy and in the end he had no support? _E_\nBy @kwrcrow: Hey Washington Post 'Only You Hate Donald Trump' or 'Is it FEAR?' __HTTP__ _E_\nI will be on @meetthepress in an interview with @chucktodd on Sunday morning. So much to talk about! _E_\nWow just watching the news.ObamaCare and the website are TOTALLY OUT OF CONTROL. Costs are through the roof. This could be ruinous to U.S.! _E_\nOffering river lake & skyline views @TrumpChicago's 339 5 Star rooms range from Deluxe Suites to Spa Guestrooms __HTTP__ _E_\nThere is no substitute for hard work. Thomas Edison _E_\nJoin me in Roanoke Virginia on Saturday evening at 6pm! #MAGA __HTTP__ _E_\nFox and.Friends now! _E_\nObama will eventually approve the Keystone XL pipeline has to happen but it is very late! _E_\nThe boardroom has never been as intense as in the upcoming13th season of All Star @CelebApprentice. Premieres March 3rd on @NBC! _E_\nAll Star Celebrity @ApprenticeNBC continues to dominate the Sunday 10PM slot in every key demographic. Still hot after 13 seasons! _E_\nI will not be attending the White House Correspondents' Association Dinner this year. Please wish everyone well and have a great evening! _E_\nSo much Fake News being put in dying magazines and newspapers. Only place worse may be @NBCNews @CBSNews @ABC and @CNN. Fiction writers! _E_\nCongratulations to our great military men and women for representing the United States and the world so well in the Syria attack. _E_\nTODAY WE MAKE AMERICA GREAT AGAIN! _E_\nNielson Media Research final numbers on ACCEPTANCE SPEECH: TRUMP 32.2 MILLION. CLINTON 27.8 MILLION. Thank you! _E_\nI didn't suggest a database a reporter did. We must defeat Islamic terrorism & have surveillance including a watch list to protect America _E_\nLeaked e mails of DNC show plans to destroy Bernie Sanders. Mock his heritage and much more. On line from Wikileakes really vicious. RIGGED _E_\nI'm with you! I will work hard and never let you down. Make America Great Again! __HTTP__ __HTTP__ _E_\nWork Underway on First New Trump Course in Dubai Second Course in Planning __HTTP__ via @CybergolfNews _E_\nSmall businesses will have an ally in the White House with @MittRomney. Mitt gave a great interview yesterday __HTTP__ _E_\nFamily group shot. #WWEHOF __HTTP__ _E_\nJust cannot believe a judge would put our country in such peril. If something happens blame him and court system. People pouring in. Bad! _E_\nVia @DailyCaller @NeilMunroDC: \"Obama's Border Policy Fueled Epidemic Evidence Shows\" __HTTP__ _E_\nRT @mike_pence: Good morning! Join me in Lima Ohio tomorrow evening at 7pm. #MAGATickets: __HTTP__ _E_\nFLASHBACK via @Reuters from 2004: \"Donald Trump Would 'Fire' Bush Over Iraq Invasion\" It's called great vision. __HTTP__ _E_\nJust arrived in Texas have been informed two @fortworthpd officers have been shot. My thoughts and prayers are with them. _E_\nTrump Hotels are delivering lots of food to storm victims...we love doing it! _E_\nRT @TeamTrump: Hillary's policies have made America less safe that's why 200+ general and military leaders have endorsed @realDonaldTrump!... _E_\nThe worst show in Las Vegas in my opinion is @pennjillette. Hokey garbage. New York show even worse! _E_\nAwarded both @ForbesInspector Five Star & @AAAFiveDiamond ratings @TrumpNewYork's @Jean_GeorgesNYC is fantastic. __HTTP__ _E_\nThank you Charlotte North Carolina!#MakeAmericaGreatAgain __HTTP__ _E_\n\"Donald Trump dedicates second Scottish golf course to beloved mother Mary\" __HTTP__ via @MailOnline _E_\nMy fellow Tea Party friends in Ohio make sure you take advantage of early voting so you can GOTV election day. Know you can! Must win Ohio. _E_\nEveryone should watch the documentary 'Windfall' on @netflix. See an upstate NY town ruined by environmentalists & windfarms. _E_\nHAPPY PRESIDENTS DAY MAKE AMERICA GREAT AGAIN! _E_\n.@ArceePalabrica @realDonaldTrump Midas Touch is the manual for entrepreneurs who want to succeed. Thanks for sharing your knowledge _E_\nOne season ends and another starts. Already casting for the next @ApprenticeNBC. Great news for charity $13 million so far. _E_\nLooking forward to my meeting with Benjamin Netanyahu in Trump Tower at 10:00 A.M. _E_\nWow I love stimulating debate and driving certain people crazy the Generals were forced to do something they didn't want to do (not me). _E_\nThank you @NYPost! #Trump2016 __HTTP__ _E_\nThe lightweight hack Schneiderman told Ivanka that the \"case is weak and more. Meets with Obama & then files one day later. _E_\nSadly they and others are Fake News and the public is just beginning to figure it out! __HTTP__ _E_\nJeb Bush really blew his interview with @megynkelly should cost him big time. Said he would do the disastrous Iraq war all over again _E_\nSo Obama wants to bomb ISIS in Iraq & arm them in Syria? What is he doing! _E_\nIn my administration EVERY American will be treated equally protected equally and honored equally #Debate #BigLeagueTruth _E_\nInteresting.@BarackObama's 1981 transfer class to Columbia declined in quality according to the Columbia Spectator __HTTP__ _E_\nWhat's funny about the name \"F**kface Von Clownstick\" it was not coined by Jon Leibowitz he stole it from some moron on twitter. _E_\nSo many incredible friends said thanks for TT help I say thanks to you! __HTTP__ _E_\nWe threw our ally Mubarak overboard and Egypt is now our enemy. Great going Obama Israel is in trouble. _E_\nWill be meeting on Monday at Trump Tower with a large group of African American Pastors. Many I know wonderful people! Not a press event. _E_\nIt doesn't matter that Crooked Hillary has experience look at all of the bad decisions she has made. Bernie said she has bad judgement! _E_\nSo many people don't understand I am a big proponent of vaccines for children—just not in one massive dose—spread them out over time. _E_\nSorry but @piersmorgan is a good & smart man who is doing really well. That's why he won @ApprenticeNBC. _E_\nWow President Obama's brother Malik just announced that he is voting for me. Was probably treated badly by president like everybody else! _E_\nSupporters waiting to hear me speak in Oskaloosa Iowa. #MakeAmericaGreatAgain __HTTP__ _E_\nWith the record high February gas prices hurting the economy even more reason to start fracking. Will create jobs & lower prices. _E_\nMAKE AMERICA GREAT AGAIN! #Trump2016 #VoteTrump __HTTP__ _E_\nCongratulations to @marklevinshow on 'The Liberty Amendments' debuting at #1 on the NY Times' bestseller list. Must read! _E_\nLeaving now for Texas! _E_\nWow! Such a wonderful article from fantastic people my great honor! __HTTP__ _E_\nIf you love what you do you are going to work harder you are going to try harder and you will be better at it. Think Big _E_\nTrump Int'l Puerto Rico spreads luxury residences a world class golf resort & beach club across 1000 acres __HTTP__ _E_\nA great job by @RickieFowlerPGA in winning The Players yesterday. Finally your jealous critics can go to hell! Good luck at The U.S. Open. _E_\nThe unemployment numbers released later this week will show no job growth. We must start making our own products again. #TimeToGetTough. _E_\nDummy writer @Clare_OC from failing @Forbes magazine works so hard to make such trivial license deals look important... _E_\nJennifer Aniston is engaged she's a great person and I wish her well. _E_\nWe have all got to come together and win this election. We can't have four more years of Obama (or worse!). _E_\n#TrumpAdvice __HTTP__ _E_\nHappy to announce I am nominating Alex Azar to be the next HHS Secretary. He will be a star for better healthcare and lower drug prices! _E_\nVia @Newsmax_Media: Maher Being Sued by Trump Over Birth Certificate Bet on 'Tonight Show' __HTTP__ _E_\nIn the plane heading to Iowa State Fair. Will be great fun. Hopefully giving helicopter rides to some of the kids. _E_\n.@MannyPacquiao was robbed in his title fight on Saturday night. No wonder boxing is dying. Bring back the 15 round fights. _E_\nGood @FLGovScott is suing the Federal Government so he can protect the voter rolls __HTTP__ Florida must be a legal election. _E_\n#FlashbackFriday Trump family final week of @Oprah's show @Oprah is terrific! __HTTP__ _E_\nI would have had millions of votes more in the primaries (than Crooked Hillary) if I only had one opponent instead of sixteen. Broke record _E_\nWhether you love like or hate Donald Trump I will be on Bill O'Reilly (Fox) tonight at 8.00. Bill knows Trump is great for ratings! _E_\nMelania and I extend our deepest condolences to the family of Shimon Peres... __HTTP__ _E_\nNow every time Islamic militants attack they will use that movie as an excuse __HTTP__ What was the excuse before the movie? _E_\nWith imposing dunes on the rugged Aberdeenshire coastline @TrumpScotland's Championship Course is a masterpiece __HTTP__ _E_\nAl Sharpton said they are even making it more harder to register people to vote . Which is worse his grammar or his thoughts? _E_\nReally bad news just announced concerning jobs. Far fewer jobs created in August than anticipated. Interest rates therefore to remain low. _E_\n\"Donald Trump: 'Karl Rove Is A Total Loser' So Why Are People Still Giving Him Money?\" __HTTP__ via @Mediaite _E_\n\"Attitude is a little thing that makes a big difference.\" Winston Churchill _E_\nProud to see my friend Governor Chris Christie standing up for Israel on his visit. Standing tall! _E_\nWow because of the pressure put on by me ICE TO LAUNCH LARGE SCALE DEPORTATION RAIDS. It's about time! _E_\nVoter fraud! Crooked Hillary Clinton even got the questions to a debate and nobody says a word. Can you imagine if I got the questions? _E_\nThe Tax Cuts are so large and so meaningful and yet the Fake News is working overtime to follow the lead of their friends the defeated Dems and only demean. This is truly a case where the results will speak for themselves starting very soon. Jobs Jobs Jobs! _E_\n... while a 300ft turbine in Ardrossan North Ayrshire erupted in flames the previous month during gales of 165 mph __HTTP__ _E_\nA big day for New York and for our COUNTRY! MAKE AMERICA GREAT AGAIN! _E_\nThank you Governor @ScottWalker & @GOP Chairman @Reince Priebus. #MakeAmericaGreatAgain #ImWithYou __HTTP__ _E_\nMore people attend a @JonHuntsman rally than watch @Lawrence on @MSNBCtv all week. @Lawrence is very lonely. (cont) __HTTP__ _E_\n#CelebApprentice what do you think of the choices for project manager? _E_\nWeak and totally conflicted people like @TheRickWilson shouldn't be allowed on television unless given an I.Q. test. Dumb as a rock! @CNN _E_\nAs usual the storm of the century was not nearly as bad as forecast. What a waste of time energy and money! _E_\nVia @worldnetdaily by @jerome_corsi: \"Donald Trump: Obama's Jobless Figures 'Phony.' Economists agree.\" __HTTP__ _E_\nI can't believe the Yankees continue to pay A Rod they have a perfect right to stop paying (and should have stopped a long time ago). _E_\n\"The achievements of an organization are the results of the combined effort of each individual.\" – Vince Lombardi _E_\nTRUMP & CLINTON ON IMMIGRATION#Debate #BigLeagueTruth __HTTP__ _E_\n\"Donald Trump on Jeb Bush: 'The last thing we need is another Bush'\" __HTTP__ via @fox5newsdc by @EmilyMiller _E_\nObamaCare has brought skyrocketing premium increases & unaffordable deductibles which will lead to less care & job losses. _E_\n\"Face reality as it is not as it was or as you wish it to be.\" @jack_welch _E_\nSHOCK @BarackObama's people are sending paid political organizers to heckle at @MittRomney events __HTTP__ _E_\nNational Review Online: Kristin Davis's Libertarian 'Tough Love' __HTTP__ _E_\nCongratulations to @bobmcdonnell on leading Virginia to be in the black for a 3rd straight year. He is a fantastic governor. _E_\nWhen will our nation's sacrifices be respectfully appreciated? Iraq and Libya should reimburse us in oil. _E_\n.@Lord_Sugar How did you enjoy Mar a Lago? It was nice having you there my people thought you were terrific! _E_\nI remember when the Apprentice became the number one show on T.V. @tombrokow came up to me and thanked me on behalf of NBC (Yankee Stadium) _E_\nJusr watched #HarveyPitt on @TeamCavuto he was great! _E_\nVia @Hometownlife: Donald Trump to speak at Lincoln Day Dinner at The Showplace in Novi __HTTP__ _E_\nSelfishness ultimately begets only unhappiness. Unselfishness begets happiness. B.C. Forbes _E_\nEnjoying the Olympics. Great coverage by @NBC as well. GO TEAM USA! _E_\n.@MarthaRaddatz was so unprofessional and biased when discussing me on This Week. @GStephanopoulos should not allow this conduct! _E_\nREPEAL AND REPLACE OBAMACARE! _E_\nClinton camp fumed when surrogate told supporters Clinton planned to betray labor on TPP post election: __HTTP__ _E_\nAfter decades of lies and scandal Crooked Hillary's corruption is closing in. #DrainTheSwamp! __HTTP__ _E_\nI cannot believe how bad Jeb Bush looks with his insane answer on Iraq and then his numerous corrections which made him look even worse. _E_\nHome of @PGATOUR's @CadillacChamp @TrumpDoral represents all that is Miami: energy glamour innovation & luxury __HTTP__ _E_\nDrew Peterson a real sleaze just convicted of killing wife. Change the law so he gets death penalty. _E_\nSaudi Arabia should fight their own wars which they won't or pay us an absolute fortune to protect them and their great wealth $ trillion! _E_\nLoved doing the debate...won Drudge and all on line polls! Amazing evening moderators did an outstanding job. _E_\nRT @mike_pence: We are heading to Virginia. Looking forward to supporting my friend @EdWGillespie. He will make a great Governor for the Co... _E_\n.@GovernorSununu who couldn't get elected dog catcher in NH forgot to mention my phenomenal biz success rate: 99.2% __HTTP__ _E_\nAlison Grimes supports harsh restrictions to kill coal industry & supports Obama's anti gun legislation. Vote @Team_Mitch! _E_\nA country that Crooked Hillary says has funded ISIS also gave Wild Bill $1 million for his birthday? SO CORRUPT! __HTTP__ _E_\nI'll be appearing on Larry King Live for his final show Thursday night at 9 p.m. CNN. Larry's been on TV for 25 years... _E_\nHillary Clinton's Presidency would be catastrophic forthe future of our country. She is ill fit with bad judgment. _E_\nThe Generals and top military brass never wanted a mixer but were forced to do it by very dumb politicians who wanted to be politically C! _E_\n'Clinton Campaign Tried to Limit Damage From Classified Info on Email Server' #DrainTheSwamp __HTTP__ _E_\nBeautiful evening with Religious Leaders here at the WH last night. Join us now for a #NationalDayofPrayer LIVE:... __HTTP__ _E_\nI have brought millions of people into the Republican Party while the Dems are going down. Establishment wants to kill this movement! _E_\n#TBT With Darrell Hammond when I hosted SNL. __HTTP__ _E_\nI watched lightweight Senator Marco Rubio who is all talk and no action defend his WEAK position on illegal immigration. Pathetic! _E_\nRemember get out on November 8th & VOTE #TrumpPence16. It is time to #DrainTheSwamp this is our last chance! __HTTP__ _E_\nThe polling numbers for 2012 are very interesting will Americans ultimately want their leaders to be 'likeable' or 'competent'? _E_\nA real president should take pride in saving and spending your money wisely not funneling it to his cronies (cont) __HTTP__ _E_\n.@HillaryClinton and Obama policies increased debt by $9trillion over the last 8 years _E_\nRT @Scavino45: U.S. MARKETS FROM ELECTION DAY {Since 11/8/2016} 📈 __HTTP__ _E_\nDELUSIONAL Obama actually thought that he won the debate __HTTP__ What is he thinking? _E_\nCongratulations to my friend @RoccoMediate on winning the big golf tournament today! _E_\nIn any business venture remember that branding is one of the most crucial aspects of your enterprise. Fight hard for that brand of yours. _E_\nI endorsed @MittRomney not because I agree with him on every issue but because he will get tough with China. _E_\nRT @EricTrump: I look forward to being on @CNN with @ErinBurnett at 7:40pmET. @realDonaldTrump _E_\n.@SenTedCruz had a very good debate far better than Rand Paul. _E_\nCongratulations to Karen Handel on her big win in Georgia 6th. Fantastic job we are all very proud of you! _E_\nThe @WTA released a new #StrongisBeautiful celebrity campaign today. Amazing athletes. Proud to be a part of this. __HTTP__ _E_\nBeyond simple justice and beyond reducing our national debt another advanage of taking the oil is that it (cont) __HTTP__ _E_\nPremiering Jan. 4th the record 14th season's @ApprenticeNBC cast is the nastiest yet __HTTP__ Major Boardroom fireworks! _E_\nFor those asking my son @EricTrump makes zero $$ running his charity & raises a great deal of $$ all of it for @StJude @EricTrumpFdn _E_\nExcited to be speaking at @frankgaffney's @securefreedom Iowa National Security Action Summit tomorrow at 1:30PM! __HTTP__ _E_\nI am leaving for Norfolk Virginia the great battleship U.S.S. Wisconsin for a big rally and really big crowd. See you soon! _E_\n.@Playboy Playmate of the Year @BrandenRoderick returns to the 13th season of All Star @CelebApprentice she is smart & beautiful. _E_\n.@Disney's acquisition of Lucas Film is a smart deal for both sides. Disney just bought a great brand which will keep producing revenue. _E_\nNow Obama has set red line 2 with demand that Assad hands over Syria's chemical weapons or it will face an attack. _E_\nKeep your momentum. Without momentum a lot of great ideas go nowhere. _E_\nRT @EPAScottPruitt: Thoughts and prayers for those in Texas & Louisiana. I am closely monitoring #Harvey developments along with @fema & @E... _E_\nThank you Senator @ChuckGrassley! #TrumpPence16 __HTTP__ _E_\nWind Power Company Fined $1 Million for Killing Birds. Golden eagles among victims... __HTTP__ @RSPBScotland @Natures_Voice _E_\nI am very proud to have brought the subject of illegal immigration back into the discussion. Such a big problem for our country I will solve _E_\nVoters understand that Crooked Hillary's negative ads are not true just like her email lies and her other fraudulent activity. _E_\n\"Statement by President Trump on the Apprehension of Mustafa al Imam for His Alleged Role in Benghazi Attacks\" __HTTP__ _E_\nEntrepreneurs Always remember that every day counts. Stay focused. Stay positive and develop momentum. _E_\nRatings for #MissUniverse pageant were highest in 4 years. @NBC likes me (and I like them!) _E_\nStop calling my office to do your show I have more important things to do with my time nobody's watching you! @lawrence _E_\nMy thoughts on last night's Celebrity Apprentice __HTTP__ as well as my latest video blog at __HTTP__ _E_\nRT @TeamTrump: .@timkaine has a pay to play problem just like Crooked @HillaryClinton #VPDebates #BigLeagueTruth __HTTP__ _E_\nKarl Rove's stupid ad made Ashley Judd hot—now everybody is talking about her. _E_\nChina has done great under Obama. Increased private US holdings by 500%. Hacks our military & R&D. Robs us blind daily.#timetogettough _E_\nWhen little Morty Zuckerman closes his failing @NYDailyNews will I at least be given some credit? Will happen soon. _E_\nWhy does Obama believe he shouldn't comply with record releases that his predecessors did of their own volition? Hiding something? _E_\nVia The Washington Times Mr. Trump buzzes the presidential radar __HTTP__ _E_\n.@AlexSalmond Wind turbines are ripping your country apart and killing tourism.Electric bills in Scotland are skyrocketing stop the madness _E_\n. @BarbaraJWalters made a great decision in firing @JoyVBehar from @theviewtv. The show will be better without her! _E_\nRT @foxandfriends: .@DonaldJTrumpJr: Trump has had a lot more responsibility to deal with than any of the other GOP candidates __HTTP__ _E_\nThank you Nicole! __HTTP__ _E_\nThank you Reno Nevada. NOTHING will stop us in our quest to MAKE AMERICA SAFE AND GREAT AGAIN! #AmericaFirst... __HTTP__ _E_\n#TrumpAdvice __HTTP__ _E_\nWow they are really killing Jay Leno let him go out with dignity! _E_\nWill be doing @foxandfriends this morning at 7:00. ENJOY! _E_\nThe newly built Blue Monster at Trump National Doral is being considered a masterpiece by almost all who see it and play it THANK YOU! _E_\nCrooked Hillary is flooding the airwaves with false and misleading ads all paid for by her bosses on Wall Street. Media is protecting her! _E_\nThe organized group of people many of them thugs who shut down our First Amendment rights in Chicago have totally energized America! _E_\nFive Star @TrumpCondosLV are the most luxurious & elite residences in the Vegas market __HTTP__ \"If you love it own it\" _E_\nJust returned from New Hampshire where the crowd was great and got a beautiful standing ovation! Wonderful people who truly love the U.S.A. _E_\nFlags to be flown at Half Staff at all Trump Properties in Honor of the Five Fallen Soldiers __HTTP__ _E_\n... and in my opinion should not be doing The Apprentice. _E_\nRT @DarrenJJordan: CONSTRUCTIVE WINS! 💪 @realDonaldTrump @CLewandowski_ @DanScavino @MichaelCohen212 @KatrinaPierson @DefendingtheUSA __HTTP__ _E_\nJeb Bush who did poorly last night in the debate and whose chances of winning are zero just got Graham endorsement. Graham quit at O. _E_\nMonitoring the terrible situation in Florida. Just spoke to Governor Scott. Thoughts and prayers for all. Stay safe! _E_\nA good friend: @SarahPalinUSA. More importantly she is a tremendous voice for policies that would put America on (cont) __HTTP__ _E_\nPeople Magazine: Donald Trump Was Right: He Gave SNL Its Best Ratings in Nearly 4 Years Plus What You Didn't See __HTTP__ _E_\nLiving in denial only 15% of Democrats think that recent economic news is poor __HTTP__ _E_\nFrom Fox and Friends interview: Trump: We should not go back to Iraq __HTTP__ _E_\n.@RepChrisCollins Chris thank you so much for your wonderful endorsement. I will not let you down! @CNN _E_\nRepublican Senators are working very hard to get Tax Cuts and Tax Reform approved. Hopefully it will not be long and they do not want to disappoint the American public! _E_\nThe Debate @BarackObama's mic and my Endorsement in today's #trumpvlog __HTTP__ _E_\nNo more massive injections. Tiny children are not horses—one vaccine at a time over time. _E_\nThe most elite private club in the world Mar a Lago is Palm Beach's legendary landmark. __HTTP__ _E_\n\"The Conservative does not despise government. He despises tyranny. @marklevinshow _E_\n.@pennjillette doesn't like @StephenBaldwin7's cliché line and Stephen says Penn creeps him out. Do we sense conflict yet? #CelebApprentice _E_\nJust sit back and watch ObamaCare is such a disaster it will fall like a house of broken cards. The website is the best part of this mess! _E_\n\"No person who is enthusiastic about his work has anything to fear from life.\" – Samuel Goldwyn _E_\nAmerica's top Army general has warned of a crisis unless sexual abuse in the military is quickly brought undet control.Forces greatly hurt! _E_\nI will be on @FoxNewsSunday with Chris Wallace this morning. Enjoy! _E_\nBeautiful thank you. __HTTP__ _E_\nI'm looking forward to seeing you all this afternoon at Macy's Herald Square. 5:30 pm at the Crystal department on 8. _E_\nOrder signed copy of CRIPPLED AMERICA & submit a question for my live streaming book signing on 12/3 at 7:30 pm. __HTTP__ _E_\nFailed presidential candidate Lindsey Graham should respect me. I destroyed his run brought him from 7% to 0% when he got out. Now nasty! _E_\nThe ones who are crazy enough to think that they can change the world are the ones who do. Steve Jobs _E_\nHe's hired! Listen to my #Apprentice Andy launchhis radio show @AmericaNowRadio with me tomorrow 6PM ET __HTTP__ _E_\nVia @NRO:\"Trump @KarlRove 'Most Overrated Man in Politics'Responsible for Ashley Judd's Rise\" __HTTP__ @elianayjohnson _E_\nJust leaving Knoxville TN what a crowd what amazing people! #Trump2016 #MakeAmericaGreatAgain __HTTP__ _E_\nReckless! Why is @BarackObama wasting over $70 Billion on 'climate change activities?' Will he ever learn? __HTTP__ _E_\nCelebrity Apprentice on tonight CNBC at 9 _E_\nWhenever one of the morons say I wear a wig stop reading because they have no credibility & just hate. _E_\nFour brave Americans died in Benghazi. Administration is still covering up the truth. We deserve to know the full truth. _E_\nThe NFL should have its non profit status immediately revoked while at the same time ending the giant tax scam which makes teams so valuable _E_\nThe people who support Hillary sit behind CNN anchor chairs or headline fundraisers those disconnected from real life. _E_\n\"Winners see problems as just another way to prove themselves.\" – Think Like a Champion _E_\nThings are going really well for our economy a subject the Fake News spends as little time as possible discussing! Stock Market hit another RECORD HIGH unemployment is now at a 17 year low and companies are coming back into the USA. Really good news and much more to come! _E_\nGuess who is talking to @MissUniverse at @TrumpTowerNY? Not terrible hair! __HTTP__ _E_\nMuch of the money I have raised for our veterans has already been distributed with the rest to go shortly to various other veteran groups. _E_\ncaught he cried like a baby and begged for forgiveness...and now he is judge & jury. He should be the one who is investigated for his acts. _E_\nHere's my message to @BarackObama: America is a capitalistic country. Get over it and get on with it! #TimeToGetTough _E_\nMr. President take your campaign of division and anger and hate back to Chicago. @MittRomney _E_\nThe 2013 MISS UNIVERSE® Pageantwill take place in Russia for the very first time in the 62 year history of the contest. _E_\nA day after @BarackObama released a trillion dollar budget deficit he is hosting China's future leader VP XiJinping. America's new reality. _E_\nI developed the Wollman Rink under budget and in record time __HTTP__ If I hadn't gotten involved it would still be unused. _E_\nMy @foxandfriends interview discussing @BarackObama's reckless spending the Buffet Tax gimmick and #CelebApprentice __HTTP__ _E_\nGreat to meet everyone while having breakfast @ChezVachon this morning! #FITN #VoteTrumpNH __HTTP__ __HTTP__ _E_\n.@RobertGBeckel Please thank your brother for his nice words on television. Seems like a great guy and character! @CNN _E_\nRT @DonaldJTrumpJr: Thank you Elko County Nevada. So much amazing feedback from my forum today I really appreciate it #trump2016 #ICYMI ht... _E_\nPeople have been asking to hear my Howard Stern interview—you can access it on @HowardTV. _E_\nI am extremely pleased to see that @CNN has finally been exposed as #FakeNews and garbage journalism. It's about time! _E_\nRT @EricTrump: Please stay safe #Florida! You are in our thoughts and we are praying for you! __HTTP__ _E_\nTune in to see me on @ThisWeekABC with @GStephanopoulos at 10am ET. Enjoy! _E_\nGoing to Charleston South Carolina in order to spend time with Boeing and talk jobs! Look forward to it. _E_\nObama's war on women. \"Number of Unemployed Women Increased in July by 227000\" __HTTP__ _E_\nThe ObamaCare website will cost over $1.5B when all is said and done. Crazy! _E_\nMassive combined inoculations to small children is the cause for big increase in autism.... _E_\nDon't forget to tune in tonight to see another unpredictable and exciting episode of The Apprentice 10 pm on NBC _E_\nGetting ready to go on @KellyandMichael two great people! _E_\n.@chelseahandler—stop trying to get your hotelier boyfriend back—a lost cause—he can do much better! _E_\nOrder signed copy of CRIPPLED AMERICA & have opportunity to submit question for my live streaming book signing 12/3 __HTTP__ _E_\nThank you @AnnCoulter for your nice words. The U.S. is becoming a dumping ground for the world. Pols don't get it. Make America Great Again! _E_\nTremendous pressure on President Obama to institute a travel ban on Ebola stricken West Africa. At some point this stubborn dope will fold! _E_\nLow energy Jeb Bush just endorsed a man he truly hates Lyin' Ted Cruz. Honestly I can't blame Jeb in that I drove him into oblivion! _E_\nBig storm in New Hampshire. Moved my event to Monday. Will be there next four days. _E_\nTeam Trump with the recipients of our donations in the Rockaways. #Sandy __HTTP__ _E_\nBoth Barack and @MittRomney were excellent at the Al Smith dinner last night! _E_\nCongrats to Barack Obama on April's job report. Over 800000 left the work force w/average hourly wages & weekly hours staying flat. Bad! _E_\nI'm with YOU. I will work hard and never let you down. Make America Great Again! __HTTP__ _E_\nTo every action there is always opposed an equal reaction. Isaac Newton _E_\nVia @nypost by Editorial Board: \"New York's mute @AGSchneiderman\" __HTTP__ Schneiderman is feckless and corrupt. _E_\nJust leaving Miami for Houston Oklahoma and Colorado. Miami crowd was fantastic! _E_\nObama's China 'climate' deal binds America with language of 'will' curb emissions now while China only 'intends' to curb in 2030. Bad deal! _E_\nAside from having no ratings sleazy Ed Schultz lied about what I said. Thank you Scott Whitlock @ScottJW __HTTP__ _E_\nWill be in Louisiana for the Miss USA Pageant which will be on NBC on Sunday night. Watch Miss Pennsylvaniaan interesting and amazing story _E_\nWe're spending a fortune looking for the lost plane with mostly Chinese passengers and that's OK but how much are Russia & China spending? _E_\nRemain open to new ideas. That's where innovation comes from. _E_\nGoofy Elizabeth Warren lied when she says I want to abolish the Federal Minimum Wage. See media—asking for increase! _E_\nThe documentary of me that @CNN just aired is a total waste of time. I don't even know many of the people who spoke about me. A joke! _E_\nAsk yourself: What can I learn today that I didn't know before? Always be a student always be open to new ideas. _E_\nPeople buy deals & immediately put them into bankruptcy in order to make better deals.. _E_\nTo every PATRIOT who will serve on the #USSGeraldRFord:Keep the watchProtect herDefend herLOVE HERGood Luck & Godspeed! __HTTP__ _E_\nVia @StarsEntLive by Nick Ricko: \"@kevinjonas @IanZiering In Celebrity @ApprenticeNBC First Look\" __HTTP__ _E_\nNewly released NH poll has @MittRomney with a 1 point lead. Mitt will pull away next week. _E_\nRe Florida Power & Light—Most important is safety but they have to also cater to aesthetics & not ruin the beauty of Florida. _E_\nI will be re tweeting some of your better most imaginative and hopefully insightful tweets. Make them good (great)! Important stuff. _E_\nTomorrow we'll be going to Panama for the opening of our new hotel. It's a fantastic building in a fantastic location. __HTTP__ _E_\nIran hides behind its assertion of technical compliance w/the nuclear deal while it brazenly violates the other limits.. Amb. @NikkiHaley __HTTP__ _E_\n\"Failed show @DannyZuker\" I have never heard of you and was told you are a loser after reading your credits I have no questions about it! _E_\nI know a great deal about websites etc. but I am unable to understand how our government spent $635 million on the ObamaCare site & disaster _E_\nAnother nasty season premieres Sunday March 3rd at 9/8c on NBC! __HTTP__ _E_\nNew Virginia poll thank you! We are going to show the whole world that America is back – BIGGER and BETTER and S... __HTTP__ _E_\nA study says @Autism is out of control a 78% increase in 10 years. Stop giving monstrous combined vaccinations (cont) __HTTP__ _E_\nGreat job @EricTrump! Proud of you! #AmericaFirst #RNCinCLE __HTTP__ __HTTP__ _E_\n\"@TurnberryBuzz the jewel in Donald Trump golfing crown\" __HTTP__ via @TheScotsman by @DempsterMartin _E_\nWow tremendous victory in the Trump University case against lightweight @AGSchneiderman just got the news! _E_\nCasting sometimes is fate and destiny more than skill and talent from a director's point of view. Steven Spielberg _E_\nSo China is ordering us to raise the Debt Limit...How low have we as a nation sunk? _E_\nWe must bring the truth directly to hard working Americans who want to take our country back. #BigLeagueTruth... __HTTP__ _E_\nAs I have been saying Crooked Hillary will approve the job killing TPP after the election despite her statements to the contrary: top adv. _E_\nOmarosa is very confident that the execs loved her concept & presentation. _E_\nLooking forward to returning to the Hawkeye state this Saturday to support my friend and strong Conservative @SteveKingIA! _E_\nHeading to U.S. Bank Arena in Cincinnati Ohio for a 7pm rally. Join me! Tickets: __HTTP__ _E_\nVia @WPOffshore: \"Donald Trump's Blackdog victory\" __HTTP__ _E_\nToday it was an honor to have @UNSecretary General @AntonioGuterres at the @WhiteHouse. Speaking for the U.S.A. we appreciate all you do! __HTTP__ _E_\nEntrepreneurs: Review your work habits and make sure they are taking you in the right direction. Don't become complacent! _E_\nThe United States will be immediately implementing much tougher Extreme Vetting Procedures. The safety of our citizens comes first! _E_\nThank you! #MakeAmericaGreatAgain __HTTP__ _E_\nIt was my great honor to defend @dennisrodman on @ApprenticeNBC last night—he has come a long way and for the good! _E_\nVia @RedState by @EWErickson: \"Always Play On Offense\" __HTTP__ _E_\nMy speech to @PressClubDC on Tuesday at the #NPCLunch on the topic of building a business brand via @cspan __HTTP__ _E_\nIn 2008 @BarackObama warned that electricity rates will necessarily skyrocket during his term. Mission Accomplished! __HTTP__ _E_\nNo surprise. Woman being cited by Kerry & McCain on Syrian rebels is a paid consultant of the rebels __HTTP__ _E_\n.@MittRomney needs to make @BarackObama regret that he ever asked for his tax records. _E_\nLittle @MacMiller—I don't need your praise __HTTP__ just pay me the money you owe. _E_\nGreat meeting @GarySinise at @AmSpec dinner. Besides his great acting Gary does tremendous work for vets through his foundation. _E_\nJoin me Tuesday Nov. 3rd at 12 PM in #TrumpTower in NYC. I'll be signing copies of my book CRIPPLED AMERICA. Don't miss it! _E_\nI'll be in Iowa tonight making a speech to a record setting crowd. The word is getting out MAKE AMERICA GREAT AGAIN! _E_\nTune in tonight to Greta van Susteren's show On the Record which airs on Fox News at 9 p.m. _E_\nThe networks are all driving me crazy to do television shows—\"a ratings machine\"—but because of Apprentice have been loyal to NBC. _E_\nRT @realDonaldTrump: I as President want people coming into our Country who are going to help us become strong and great again people co... _E_\n.@PamelaGeller is a total whack job who doesn't have a clue. Don't provoke the enemy go get them and make them pay. No signals just do it! _E_\nIvanka Trump defends her dad __HTTP__ via @politico _E_\nHe ruins the brand: @bobbeckel doesn't belong on @FoxNews. As CM for Mondale in '84 you lost 49 states. Sad! _E_\nFM @AlexSalmond of Scotland spent more than $750000 of taxpayers $ to visit Ryder Cup in Chicago peanuts compared to his windmill folly. _E_\nPhyllis Schlafly: Trump is 'last hope for America' __HTTP__ __HTTP__ _E_\nThe #MissUniverse women totally blow away the Victoria's Secret women! _E_\nNo @DannyZuker it's making you crazy because you don't have the guts to play the game. Come on Danny you can do it! _E_\nEven Bill is tired of the lies SAD! __HTTP__ _E_\nAn honor to host President Mahmoud Abbas at the WH today. Hopefully something terrific could come out it between th... __HTTP__ _E_\nBased on very popular demand I will be live tweeting tomorrow night during the Presidential debate. _E_\nMiss Florida was great in her denial of Miss Pennsylvania's phoney statements. She blows Miss Pennsylvania away a different league . _E_\nCongratulations to @tedcruz on his Texas primary victory last night. He will be an outstanding Senator. _E_\nJust remember the birther movement was started by Hillary Clinton in 2008. She was all in! _E_\nShooting deaths of police officers up 78% this year. We must restore law and order and protect our great law enforcement officers! _E_\nWe pray for our fallen heroes who died while serving our country in the @USNavy aboard the #USSJohnSMcCain and their families. __HTTP__ _E_\nIt's not whether you get knocked down it's whether you get up. Vince Lombardi _E_\nThe premiere of Donald J. Trump's Fabulous World of Golf is tomorrow night at 9 p.m.ET on Golf Channel. Tune in for a great adventure! _E_\nObamaCare website fiasco was a SINGLE bid to a Canadian company terrible! _E_\nLook what is happening to our country under the WEAK leadership of Obama and people like Crooked Hillary Clinton. We are a divided nation! _E_\nAll he does is go on television is talk talk talk but incapable of doing anything. _E_\nIf Justice Roberts had made the correct decision on ObamaCare our country would not be in turmoil right now! _E_\nI will be in South Carolina all week. Saturday is BIG BIG BIG! Get out and vote MAKE AMERICA GREAT AGAIN _E_\nEither Miss Pennsylvania will pay her father will pay or her lawyers will pay. She hurt many people! _E_\nVia @BreitbartNews by @IanHanchett: \"Trump: Obama 'Treats Our Known Enemies Much Better' Than Israel\" __HTTP__ _E_\nVia @Mediaite by forza_desiderio: \"Donald Trump Blasts Obama on Ebola: Why Are You Sending Troops?\" __HTTP__ _E_\nFun fact for my 2M+ followers the 'Architect' Karl Rove blew $400M in the 2012 election with a success rate of 1.6%. _E_\nThank you for a great day yesterday Rhode Island! #VoteTrump __HTTP__ _E_\nWill be interviewed on @seanhannity tonight at 10pmE. Enjoy! #INPrimary _E_\nThe residential real estate market continues to provide opportunities for first time home owners. Buy now if you can! _E_\n#CrookedHillary is not fit to be our next president! #TrumpPence16 __HTTP__ _E_\nJust bought the Kluge Estate in Charlottesville Virginia (don't worry only business). See Washington Post article __HTTP__ _E_\nCongratulations to Michael Jordan on his marriage over the weekend. _E_\nRT @EricTrump: Debate ready!!! @realDonaldTrump #MakeAmericaGreatAgain #TrumpTrain __HTTP__ _E_\nA big contingent of very enthusiastic Roy Moore fans at the rally last night. We can't have a Pelosi/Schumer Liberal Democrat Jones in that important Alabama Senate seat. Need your vote to Make America Great Again! Jones will always vote against what we must do for our Country. _E_\nWatch Obama refuse to call Benghazi a terrorist attack on 9.12 __HTTP__ What took @CBS so long to release this footage? _E_\nLife brings you many surprises. As a child I used to vacation with my family at the Doral in Miami. Now I own it. __HTTP__ _E_\nTrump: Weiner a 'Sick Puppy' That NYC Doesn't Need __HTTP__ via @Newsmax_Media _E_\nThank you! __HTTP__ _E_\nThe Cruz Kasich pact is under great strain. This joke of a deal is falling apart not being honored and almost dead. Very dumb! _E_\nThe CBO has confirmed that @BarackObama's stimulus crowds out private investment while not creating any jobs. __HTTP__ _E_\nMy @FoxNews int. with @seanhannity on Obama being all talk & no action & making America Great Again! __HTTP__ _E_\nWe need a dealmaker in the White House who knows how to think innovatively and make smart (cont) __HTTP__ _E_\nCrooked Hillary Clinton is the worst (and biggest) loser of all time. She just can't stop which is so good for the Republican Party. Hillary get on with your life and give it another try in three years! _E_\nDAMAC Properties @DamacOfficial @realDonaldTrump Looking forward to welcoming you to Dubai! Have a great trip! Thank you! _E_\nRT @FoxNews: Poll: @realDonaldTrump vs. @HillaryClinton among white Evangelicals. __HTTP__ _E_\nIn today's #trumpvlog I speak about the chopper recently made for me by @occhoppers.... __HTTP__ #CelebApprentice _E_\nWatch @Seanhannity tonight on his show Hannity Fox News at 9 pm. I'll be on and we'll cover the Wall Stree... (cont) __HTTP__ _E_\nJoin us in Iowa tomorrow! #IACaucus #Trump2016 #MakeAmericaGreatAgain 3:00pm: __HTTP__ 7:30pm: __HTTP__ _E_\nI will be live tweeting during the debate tonight. _E_\nI have an idea for @JebBush whose campaign is a disaster. Try using your last name & don't be ashamed of it! _E_\nHow long did it take for Obama to call Hugo Chavez and congratulate him on his 'reelection?' Who do you think Chavez supports in ours? _E_\nIf you have a speech one that would put Winston Churchill to shame liberals would find a way to make it sound terrible! _E_\nCongratulations @Trump_Ireland for being named #12 resort in Europe by the @CNTraveler #ReadersChoice2014 awards! _E_\nI will be on @FoxNews live with members of my family at 11:50 P.M. We will ring in the New Year together! MAKE AMERICA GREAT AGAIN! _E_\nHappy to hear that @ralphreed's Faith and Freedom chapters are at the @RNC convention supporting @MittRomney. We must be united to win! _E_\nI am reading that the great border WALL will cost more than the government originally thought but I have not gotten involved in the..... _E_\nThe top Leadership and Investigators of the FBI and the Justice Department have politicized the sacred investigative process in favor of Democrats and against Republicans something which would have been unthinkable just a short time ago. Rank & File are great people! _E_\n.@FLOTUS Melania and I were honored to stop by the Women's Empowerment Panel this afternoon at the @WhiteHouse.... __HTTP__ _E_\nI'm looking forward to the Super Bowl but looking even more forward to Monday night at 8:00 best episode EVER of Celebrity Apprentice! _E_\nA top Clinton Foundation official said he could name \"500 different examples\" of conflicts of interest. __HTTP__ _E_\nAfter @TrumpTurnberry I will be visiting Aberdeen the oil capital of Europe to see my great club @TrumpScotland. _E_\nAll eyes are on @TigerWoods @The_Masters. He's in good position! _E_\nEllen is sadly having a hard time with her lines. #Oscars _E_\nThanks to ObamaCare's device tax Boston Scientific plans to cut 1500 jobs __HTTP__ ObamaCare will kill ingenuity. _E_\nJeb Bush has a photoshopped photo for an ad which gives him a black left hand and much different looking body. Jeb just can't get it right! _E_\nNew CBS poll. #Trump2016 __HTTP__ _E_\nThis has been a very difficult decision regarding the Presidential run and I want to thank all my twitter fans for your fantastic support. _E_\nTom Brokaw keeps calling Mitt Romney George (Mitt's father). Sadly time is up for Tom. _E_\n.@AnnCoulter U were great last nite @ericbolling on FOX. Our country has become a dumping ground for the world I'll get it to stop & fast! _E_\nThe mind that opens to a new idea never comes back to its original size. Einstein _E_\nA great honor to sign the Veterans Appeals Improvement & Modernization Act into law w/ @AmericanLegion @SecShulkin. __HTTP__ __HTTP__ _E_\nThe real estate market in Vietnam is booming. Growth is everywhere in the world except for the US. _E_\nRT @FoxNews: .@davidwebbshow: Let's look at the calendar. It's January 20th. DACA expires on March 5th. That means this was a construct of... _E_\nThank you Ohio. Together we will MAKE AMERICA GREAT AGAIN! __HTTP__ __HTTP__ _E_\nTrump's National Lead Increases to 35.6% Going into the Third GOP Debate it's Trump Carson and Rubio __HTTP__ ... _E_\nI predicted the 9/11 attack on America in my book The America We Deserve and the collapse of Iraq in @TimeToGetTough. _E_\nThanks @piersmorgan! Trump is the most unpredictable extraordinary entertaining&massively popular candidate this country has ever seen. _E_\n#MayThe4thBeWithYou here is when Darth Vader and I did some firing __HTTP__ _E_\nI would invite Edward Snowden to be a judge at the Miss Universe Pageant in Moscow but would be concerned that he would sell results early! _E_\nThe coolest story is that John Beale the man who headed up CLIMATE CHANGE for the government is a proven con man and total phoney.ARRESTED _E_\nGreat move on delay (by V. Putin) I always knew he was very smart! _E_\nVia The Hindu @businessline: Realty brand Donald Trump's India venture to sport desi tag __HTTP__ _E_\nJeffrey Lord former Reagan adviser has endorsed the Newsmax @iontv debate with a great article __HTTP__ _E_\nWe will immediately repeal and replace ObamaCare and nobody can do that like me. We will save $'s and have much better healthcare! _E_\nCamp David is a very special place. An honor to have spent the weekend there. Military runs it so well and are so proud of what they do! _E_\nSo proud of NASCAR and its supporters and fans. They won't put up with disrespecting our Country or our Flag they said it loud and clear! _E_\nDopey Sugar @Lord_Sugar You should thank me for having created the platform on which you became known The Apprentice. Say Thank you Donald _E_\nChina is heavily investing in building its own jet engine __HTTP__ They will end up stealing the design from us as usual. _E_\n'How Trump won over a bar full of undecideds and Democrats' __HTTP__ _E_\nIf @Barack Obama is really concerned about carbon emissions and air pollution then maybe he should have (cont) __HTTP__ _E_\nI am starting to think that there is something seriously wrong with President Obama's mental health. Why won't he stop the flights. Psycho! _E_\nJust out: Boston Herald/Franklin Pierce Poll N.H. TRUMP 28 (up 10) CARSON 16 BUSH 9 RUBIO 6 CRUZ 5 Press will say they are surging! _E_\nCrooked Hillary has once again been proven to be a person who is dishonest incompetent and of very bad judgement. _E_\n...and did not want to rock the boat. He didn't choke he colluded or obstructed and it did the Dems and Crooked Hillary no good. _E_\nAmericans nationwide have their premiums double and work hours decreased. @GOP must do the right thing stand strong & defund! _E_\nIt's Thursday. Which brand of eyeliner is the nation's worst AG @AGSchneiderman wearing today? _E_\n.@Deadspin's disgusting response will teach me & others not to be nice anymore—a sad lesson. _E_\nFeeling sorry for yourself is not only a waste of energy but the worst habit you could possibly have. Dale Carnegie _E_\nVia @scotsmandotcom: \"Donald Trump hires top lawyer for wind farm battle\" __HTTP__ _E_\nSweat equity is the most valuable equity there is. Know your business and industry better than anyone else in the world. @mcuban _E_\nIf Chicago doesn't fix the horrible carnage going on 228 shootings in 2017 with 42 killings (up 24% from 2016) I will send in the Feds! _E_\n\"Being true to yourself and your work is an asset. Remember that assets are worth protecting.\" – Think Like a Champion _E_\nAnyone who wants strong borders and good trade deals for the US should boycott @Univision. _E_\nEntrepreneurs: Set the example and you'll be a magnet for the right people. That's the best way to work with people you like. _E_\nThank you NH! We will end illegal immigration stop the drugs deport all criminal aliens&save American lives! Watc... __HTTP__ _E_\nDo you think Iran would have acted so tough if they were Russian sailors? Our country was humiliated. _E_\nIt is so pathetic that the Dems have still not approved my full Cabinet. _E_\nNew Sugar deal negotiated with Mexico is a very good one for both Mexico and the U.S. Had no deal for many years which hurt U.S. badly. _E_\nGoofy Elizabeth Warren is weak and ineffective. Does nothing. All talk no action maybe her Native American name? _E_\n.@ABFAlecBaldwin They were rising in the 1950's then went back down they will go up and down through eternity. _E_\n.@JebBush is a low energy stiff who should focus his special interest money on the many people ahead of him in the polls. Has no chance! _E_\nA record 1.2 million Americans have left the job force during @BarackObama's recovery __HTTP__ Don't trust the job numbers. _E_\nPretty even debate no knockouts. However Ryan's closing statement somewhat stronger. What do you think? #VPDebate _E_\n\"Real estate is at the core of almost every business and it's certainly at the core of most people's wealth.\" – Think Like a Billionaire _E_\nBack by popular demand the fabulous @LilJon returns to the record setting 13th season of All Star @CelebApprentice. The fans love him! _E_\nRe @TWC TimeWarner I am going to be switching many of my buildings to another service—this is ridiculous! _E_\n\"Success breeds success. The best way to impress people is through results.\" – Think Like a Billionaire _E_\n#trumpvlog The Republicans must defeat @BarackObama not themselves..... __HTTP__ _E_\nIf you want more you have to require more from yourself. Dr. Phil McGraw _E_\n20000📈21000📈22000📈23000📈this year...FOUR one thousand milestones this year... #Dow23K #MAGA __HTTP__ _E_\nSo much for Hope and Change. @BarackObama has already spent over $100M on attack ads across the swing states __HTTP__ _E_\nGlad to hear that @JimTalent has put some strong anti China referendums in the @GOP convention platform. _E_\nSexual pervert & deviant Anthony Weiner is polling to see if he can run for NYC Mayor... _E_\n\"To state the obvious if any business operated the way the government does it would go under.\" #TimeToGetTough _E_\n#MakeAmericaGreatAgain #Trump2016LIFE CHANGING EXPERIENCEVideo: __HTTP__ __HTTP__ _E_\nOn November 9th @MissUniverse comes to Moscow! Hosted by the wonderful duo of @OfficialMelB & @ThomasARoberts in Crocus City Hall! _E_\nI guess Obama's Cairo Speech really worked out. The Muslim Brotherhood stormed our embassy on 9.11. Imagine if Obama speaks in Beijing? _E_\n.@daveweigel of the Washington Post just admitted that his picture was a FAKE (fraud?) showing an almost empty arena last night for my speech in Pensacola when in fact he knew the arena was packed (as shown also on T.V.). FAKE NEWS he should be fired. _E_\nREMEMBER the terrible 5 for 1 trade whereby the Taliban got back leaders (killers) and we got back a NOTHING WILL COME BACK TO HAUNT U.S.! _E_\nI have always liked Ellen done her show numerous times but she was not good last night fumbling and stumbling! _E_\nIn new Quinnipiac Poll 66% of people feel the economy is \"Excellent or Good.\" That is the highest number ever recorded by this poll. _E_\nChina has copied our military's F 22 Raptor design __HTTP__ We should offset their theft from our debt. _E_\nOfficials behind the now discredited Dossier plead the Fifth. Justice Department and/or FBI should immediately release who paid for it. _E_\nGreat everyone is saying I did much better on @60Minutes last week than President Obama did tonight. I agree! _E_\nMedian household income is down for the middle class since Obama took office. It will only go further down under Clinton. _E_\nSometimes your best investments are the ones you don't make. The Art of The Deal _E_\nCongratulations to Bob Kraft and Coach Bill Belichick for having built an amazing team. @Patriots _E_\n\"You cannot escape the responsibility of tomorrow by evading it today.\" – Pres. Abraham Lincoln _E_\nBoring & failing @NYMag's 3rd rate political reporter @jheil had flunky @DanAmira write a totally false report about me today...... _E_\nMy message MAKE AMERICA GREAT AGAIN is beginning to take hold. Bring back our jobs strengthen our military and borders help our VETS! _E_\nWhat do you think Obama will do when Putin seizes Alaska? _E_\nBig day in Washington D.C. even though White House & Oval Office are being renovated. Great trade deals coming for American workers! _E_\nVia @WTOC11: Donald Trump headlines Tea Party Convention in Myrtle Beach __HTTP__ Looking forward to visiting SC on Monday! _E_\nStudy your area of business. All business involves risk but risk can be reduced when you learn everything you can about what you're doing. _E_\nMy @foxandfriends interview re: firing @bretmichaels on the premiere of All Star @ApprenticeNBC & politics __HTTP__ _E_\nRT @IvankaTrump: .@realDonaldTrump stock market rally is close to becoming the greatest in 85 years __HTTP__ _E_\nTrump at CPAC: 'We Have to Get the Momentum Back' __HTTP__ via @WSJ's @WSJVideo _E_\nI bet the dumbest political commentator on television @Lawrence will soon be thrown off the air for poor (cont) __HTTP__ _E_\nTonight's episode of The Apprentice is one of the best ever we're down to the final 3 and it's high excitement all the way. 10 pm on NBC. _E_\n\"Hook your career to a big trend. There are huge opportunities for profits if you can create big solutions.\" – Think Big _E_\nI call my own shots largely based on an accumulation of data and everyone knows it. Some FAKE NEWS media in order to marginalize lies! _E_\nI have recieved and taken calls from many foreign leaders despite what the failing @nytimes said. Russia U.K. China Saudi Arabia Japan _E_\nChina's stock market rose yesterday after 4 consecutive days of losses __HTTP__ Their market gains the day we are hit by storm _E_\nIt's Tuesday how much inflation has @BarackObama's spending caused today on the price of food and gas? _E_\nAl Qaeda terrorist Al Libi was immediately read his rights & is now being treated for 'pre existing' medical (cont) __HTTP__ _E_\n19000 RESPECTING our National Anthem! #StandForOurAnthem __HTTP__ _E_\nMy twitter followers will soon be over 2 million & all the biggies. It's like having your own newspaper. _E_\nTrump International in Dubai will be one of the great projects anywhere in the world. Congratulations to @damacofficial for their genius! _E_\nChina just landed a jet on an aircraft carrier stolen from a U.S. design. __HTTP__ We should offset the thievery from our debt.. _E_\nLooking at Air Force One @ MIA. Why is he campaigning instead of creating jobs & fixing Obamacare? Get back to work for the American people! _E_\nSome people dream of great accomplishments while others stay awake and do them! _E_\nEntrepreneurs: Do not go where the path may lead go instead where there is no path and leave a trail. Ralph Waldo Emerson _E_\nIt is outrageous and disgusting that families of U.S. MILITARY personnel killed in action will not be given money for burials. SAD! _E_\nLittle respected Club For Growth asked me for $1000000 I said NO . Now they are spending lobbyist and special interest money on ads! _E_\nDon't go around saying the world owes you a living. The world owes you nothing. It was here first. Mark Twain _E_\nGetting ready to celebrate the 4th of July with a big crowd at the White House. Happy 4th to everyone. Our country will grow and prosper! _E_\nJust watching NBC News where our potential attack is being detailed the exact ships the stealth bombers the destinations so ridiculous! _E_\nStrive for wholeness and keep your sense of wonder intact. Donald J. Trump __HTTP__ _E_\nTrump Int'l Golf Links & Hotel Ireland is on 400 beautiful acres & fronts the Atlantic Ocean for 2.5 miles. Spectac! __HTTP__ _E_\nToday I was pleased to announce the official approval of the presidential permit for the #KeystonePipeline. A grea... __HTTP__ _E_\nI am watching two clown announcers on @FoxNews as they try to build up failed presidential candidate #LittleMarco. Fox News is in the bag! _E_\nYesterday I signed the #INTERDICTAct (H.R. 2142) with bipartisan members of Congress to help end the flow of drugs into our country. Together we are committed to doing everything we can to combat the deadly scourge of drug addiction and overdose in the United States! __HTTP__ _E_\nI support K 9's for Warriors a wonderful organization that trains service dogs for veterans. Please contact __HTTP__ _E_\nFor Entrepreneurs: A good question to ask yourself –\"What can I provide that does not yet exist?\" _E_\n.@jessebwatters is terrific at hosting on @FoxNews he really gets it! _E_\nMr. Khan who does not know me viciously attacked me from the stage of the DNC and is now all over T.V. doing the same Nice! _E_\nMs. Goldberg & her blowhard lawyer should be ashamed for having brought this frivolous case. They should pay me damages! _E_\nEnglish taxpayers should stop subsidizing the destruction of Scotland by paying massive subsidies for ugly wind turbines. _E_\nSpoke to a capacity crowd at Horry County Republican event earlier today. __HTTP__ _E_\nDonald Trump Reviews Oscars: Django 'Racist' Ceremony 'Boring' Set 'Tacky'... __HTTP__ via @eonline _E_\nDemocrats are not interested in Border Safety & Security or in the funding and rebuilding of our Military. They are only interested in Obstruction! _E_\nLord grant that I may always desire more than I can accomplish. Michelangelo _E_\nVia @AmericanThinker by Malcolm Unwell: \"Taking Trump Seriously\" __HTTP__ _E_\nHow far has the United States gone down when we are reduced to accept the imbecilic deal just agreed to with Iran. Read THE ART OF THE DEAL! _E_\nHuff Post His early morning speech drew a large crowd far larger than remarks at the same time on Thursday and packed by end! The facts. _E_\nI can't believe my friend Derek Jeter is out for whole season injured day he left Trump World Tower. Lucky bldg. Move back fast! _E_\nMake sure to watch Celebrity Apprentice tonight at 9 on NBC. A GREAT SHOW JUST LIKE THE MASTERS. 9 _E_\nSo how did I do on Face The Nation? _E_\nSo sad that Obama rejected Keystone Pipeline. Thousands of jobs good for the environment no downside! _E_\nThank you! #Trump2016 __HTTP__ _E_\nNot one American flag on the massive stage at the Democratic National Convention until people started complaining then a small one. Pathetic _E_\nOur great country has been divided for decades. Sometimes you need protest in order to heal & we will heal & be stronger than ever before! _E_\nI'll always like @OMAROSA because she constantly defends me. #CelebApprentice _E_\n...and now Alex Salmond pushes ugly turbines! _E_\nI will be doing the @TodayShow live from New Hampshire at 7am on Monday morning. #TrumpToday _E_\n.@garyplayer As a true champion you must have enjoyed how difficult but fair The Blue Monster played last weekend. Gary Player Villa loved! _E_\n\"Inside Donald Trump's Scottish golf course\" __HTTP__ via @TelegraphSport _E_\nMy @greta int. discussing $25000 gift to USMC Tahmooressi Obama's trip to China & the 2014 election results __HTTP__ _E_\n.@newtgingrich just said a historic victory for Trump. NICE! _E_\nKeep stimulating your mind with big ideas. Be a collector of big ideas. Constantly fill your mind with new information. Think Big _E_\nCongratulations to @RobinRoberts on celebrating 100 days in her bone marrow transplant recovery. Robin is a special person. _E_\n.@stuartpstevens horrible advise to Mitt Romney made victory an impossibility. Don't blame Mitt! Now Stevens can't get a job! _E_\nNever in U.S.history has anyone lied or defrauded voters like Senator Richard Blumenthal. He told stories about his Vietnam battles and.... _E_\nThe Democratic Convention has paid ZERO respect to the great police and law enforcement professionals of our country. No recognition SAD! _E_\n.@BarackObama should release all his records (like other Presidents).... _E_\nThe first General killed in a combat zone since Vietnam it is a travesty that Obama did not attend Major General Harold Greene's funeral _E_\nThank you Terre Haute Indiana!#MakeAmericaGreatAgain __HTTP__ _E_\nCarl Icahn said this about me: I think at this moment in time he's the only candidate that speaks out about the country's problems. _E_\nIndividual commitment to a group effort is what makes a team work a company work a society work a civilization work. Vince Lombardi _E_\nMust read via @FoxNews by @JaySekulow: \"Mr. President: Will you bring home American pastor imprisoned in Iran?\" __HTTP__ _E_\nVia @Newsmax_Media by @ChrisRuddyNMX: Donald Trump and the End of Free Speech __HTTP__ _E_\n.@melaniatrump will be on @theviewtv today at 11am ET discussing @apprenticenbc #celebapprentice & her skin care collection. Tune in! _E_\n4.2 million hard working Americans have already received a large Bonus and/or Pay Increase because of our recently Passed Tax Cut & Jobs Bill....and it will only get better! We are far ahead of schedule. _E_\nPutting Pelosi/Schumer Liberal Puppet Jones into office in Alabama would hurt our great Republican Agenda of low on taxes tough on crime strong on military and borders...& so much more. Look at your 401 k's since Election. Highest Stock Market EVER! Jobs are roaring back! _E_\nIn any event we are EXTREME VETTING people coming into the U.S. in order to help keep our country safe. The courts are slow and political! _E_\nWhy do people give @KarlRove contributions when they know he is a loser who has no idea how to win? __HTTP__ _E_\nJoin me tomorrow! #Trump2016 #MakeAmericaGreatAgain Omaha Nebraska: __HTTP__ Oregon: __HTTP__ _E_\nEntrepreneurs: Follow your instincts and keep your focus intact. You alone know where you really want to go. _E_\n\"@marklevinshow: 'PLUNDER AND DECEIT'\" __HTTP__ via @AmSpec by @JeffJlpa1 _E_\nLightweight A.G. Eric Schneiderman asked us for political contributions DURING his investigation of usthen sued for $40 million.Dopey guy! _E_\nThe @Yankees acquisition of Ichiro was a smart move. I look forward to watching him play. _E_\nEveryone is asking if and when I will endorse a candidate in the NYC mayoral race. Doing my due diligence... _E_\nThe brass in #TRUMP Tower's atrium is polished twice a month like clockwork. I keep the atrium impeccable. Key to its success! _E_\nVision remains vision until you focus do the work and bring it down to earth where it will do some good. _E_\nNorth Korea disrespected the wishes of China & its highly respected President when it launched though unsuccessfully a missile today. Bad! _E_\nThis election is a total sham and a travesty. We are not a democracy! _E_\nThoughts and prayers with the sailors of USS Fitzgerald and their families. Thank you to our Japanese allies for th... __HTTP__ _E_\nI'm always amazed when I travel to my foreign properties.Seeing the Trump brand across 4 continents proves that excellence can be universal. _E_\nNO WAY JUDGES SAY MAYWEATHER WON. INVESTIGATION SHOULD TAKE PLACE. FIX? _E_\nToday it was an honor to celebrate the Collegiate National Champions of 2016/2017 at the @WhiteHouse! #NCAAChampions Photos: __HTTP__ __HTTP__ _E_\nThanks to Giovanni's Coal Fire Pizza of Florida for donating enough pizza to feed 750 Police Athletic League youngsters in NY this Friday. _E_\nMy interview yesterday with @IngrahamAngle __HTTP__ _E_\n\"Trump: 'Seriously Considering' a Presidential Bid\" __HTTP__ via @NBCNews _E_\nIf the press can report stories from @MittRomney's dorm years then why can't it find @BarackObama's college and law school transcripts? _E_\nWith terrific Steve Wynn at dinner last night. __HTTP__ _E_\nVia @Newsmax_Media: Trump: Americans 'Desperate for Leadership' __HTTP__ _E_\nIf everybody sued the Journal News for revealing their info (guns) paper would go out of business. _E_\nIf the Saudis are so concerned about Syria then they should go in themselves. Stop telling us to do their dirty work. _E_\n\"Every big thinker has had to start as a nobody. Just think big & that immediately distinguishes you from the majority.\" – Think Big _E_\nEst. in 1906 @TrumpTurnberry is home to the iconic Ailsa @The_Open Championship course four times over __HTTP__ _E_\nWhy isn't AG Schneiderman going after Democrat Jon Corzine and the $1.4 billion that is \"missing?\" _E_\nThese Islamists chop Americans' heads off and want to destroy us. We should be applauding the CIA not persecuting them. _E_\nThank you Texas! If you haven't registered to VOTE today is your last day. Go to: __HTTP__ & get ou... __HTTP__ _E_\nThe girlfriend of Lubitz the wacko co pilot who took down the plane knew he was insane and should have reported him. Put her through hell _E_\nTrump Golf Links at Ferry Point is a Jack Nicklaus Signature Design 18 hole course just minutes from Manhattan __HTTP__ _E_\nIf United Steelworkers 1999 was any good they would have kept those jobs in Indiana. Spend more time working less time talking. Reduce dues _E_\n.@RNC report was written by the ruling class of consultants who blew the election. Short on ideas. Just giving excuses to donors. _E_\nThe people of Scotland are really starting to fight the ugly industrial wind turbines. See Press and Journal __HTTP__ _E_\nGIVE AMERICA BACK ITS DREAM! Donald J. Trump _E_\n\"MSNBC'S TOURÉ HAS EPIC RACE BAITING MELTDOWN ON CNN\" __HTTP__ It's Toure's modus operandi. He is so angry. _E_\nSee yourself as having a lot already and keep your integrity intact. It's the best path to comprehensive success. Think Like a Champion _E_\nLove making correct predictions. National Review is over. __HTTP__ _E_\nMy nomination would increase voter turnout. #VoteTrump #MakeAmericaGreatAgain #Trump2016 __HTTP__ _E_\nCongress was elected last November to reign Obama in not to give him 'fast track' authority for bad trade deals for the American worker! _E_\n"
  },
  {
    "path": "assignments/word_transform/common.en.vocab",
    "content": ",\n.\nthe\n</s>\nof\n-\nin\nand\n'\n)\n(\nto\na\nis\nwas\non\ns\nfor\nas\nby\nthat\nit\nwith\nfrom\nat\nhe\nthis\nbe\ni\nan\nutc\nhis\nnot\n–\nare\nor\ntalk\nwhich\nalso\nhas\nwere\nbut\nhave\n#\none\nrd\nnew\nfirst\npage\nno\nyou\nthey\nhad\narticle\nt\nwho\n?\nall\ntheir\nthere\nbeen\nmade\nits\npeople\nmay\nafter\n%\nother\nshould\ntwo\nscore\nher\ncan\nwould\nmore\nif\nshe\nabout\nwhen\ntime\nteam\namerican\nsuch\nth\ndo\ndiscussion\nlinks\nonly\nsome\nup\nsee\nunited\nyears\ninto\n/\nschool\nso\nworld\nuniversity\nduring\nout\nstate\nstates\nnational\nwikipedia\nyear\nmost\ncity\nover\nused\nthen\nd\nthan\ncounty\nexternal\nm\nwhere\nwill\nde\nwhat\ndelete\nany\nthese\njanuary\nmarch\naugust\njuly\nbeing\nfilm\nhim\nmany\nsouth\nseptember\nlike\nbetween\noctober\nthree\njune\nwell\nuse\nwar\nunder\nthem\napril\nwe\nborn\ndecember\nlink\nwhile\nc\nlater\npart\nnovember\nfurther\nplayers\nlist\nplease\nfollowing\nmy\nfebruary\nknown\nsecond\nu\nname\ngroup\nhistory\nseries\njust\ne\nnorth\nwork\nbefore\nsince\nseason\nboth\nhigh\nst\nthrough\ndistrict\nnow\n!\ncomments\nbecause\nfootball\nmusic\nhowever\ndiff\ncentury\nleague\nedits\ndebate\ntitle\narticles\njohn\nsame\nincluding\ncould\nenglish\nalbum\nnumber\nagainst\nfamily\nuser\nbased\narea\nbecame\nyork\nb\nlife\nme\nbritish\ninternational\ngame\n\"\nabove\nclub\nyour\nuntil\nearly\nbest\nwest\nhouse\ncompany\ngeneral\nleft\nvery\nhere\ndon\nliving\nday\nseveral\nplace\nparty\ncollege\nresult\nkeep\nappropriate\nfour\nsubsequent\neven\nclass\ngovernment\nhow\ncalled\ndid\neach\nfound\ncenter\nper\nstyle\ncom\nlong\ncountry\nback\nway\ndoes\nwww\nmodify\nend\nmake\npublic\nplayed\np\nwon\nanother\nreleased\nadded\nf\nsupport\ngames\nformer\nthose\nfilms\nchurch\neast\nline\nmajor\nmembers\ngood\nmuch\nimage\nshow\nstill\nthink\nbelow\ntown\nlast\nsystem\nright\nsong\nnon\nnotable\nsection\nsingle\nincluded\nalign\nhome\nwomen\ntelevision\n—\nseed\nmember\ngoals\nsources\nbook\nstation\norder\nold\ninformation\nset\nown\ntext\nband\npoint\nlocal\naround\nriver\ntop\nmain\nlanguage\nfrench\nhttps\nnamed\noff\nus\nnote\ncareer\noriginal\nage\nservice\nestablished\nlocated\nre\nsaid\nwebsite\npopulation\nair\ngerman\nlaw\nmilitary\n}\ngreat\nii\nwithin\nclubs\npublished\npresident\npark\nofficial\n$\nr\ncase\n>\nlondon\ntimes\nalthough\nsmall\nthird\ndifferent\ndue\nget\nvillage\nclosed\ng\nart\nplayer\nfinal\nl\ncommunity\nheld\nn\nagain\nbegan\narmy\naward\nwithout\ndeath\nbuilt\nmen\nlarge\nsite\n+\nusing\ndeletion\nwhite\nalong\nfive\ncentral\nroad\nchildren\nfree\ntook\nengland\ninclude\nassociation\ndown\nj\ngiven\nsource\nx\ncalifornia\nman\nversion\nwritten\ncreated\nmedia\nblack\nthough\nphp\nreport\nbuilding\nla\ntake\ndivision\ncomment\nhaving\nking\nedit\nstadium\ndied\nship\nresearch\nrecord\narchive\nplaces\nundo\ncup\nrecords\noften\nfew\nreceived\nside\npower\neducation\nknow\ncategory\nwater\npolitical\nspecies\nfield\nnear\n&\nco\naustralia\nvideo\nneed\ngo\nisland\nform\nfind\nserved\nplay\nproject\no\naccording\nradio\nam\nworks\nproposed\nevery\ndevelopment\nexample\nlive\nunion\nindia\nnext\nspecial\ncourt\nregion\nh\nlittle\nshort\nv\nwilliam\nprovince\nwestern\nson\nfrance\ncouncil\nothers\nroyal\ncurrent\nstreet\nfull\nred\ntoo\ndepartment\nw\nsan\nhelp\namong\nve\npreserved\njames\nopen\nforce\nposition\nhead\ndirector\nfather\ntrack\nhttp\ncanada\nnever\naustralian\nid\ngeorge\njpg\nlevel\nlate\nsummer\nsociety\nmoved\noffice\nperiod\nchampionship\nround\nstory\nsongs\nvarious\nfile\ndays\nland\nbusiness\ntv\nreason\namerica\nmillion\neuropean\nterm\nal\nsix\nuk\npost\nwhy\nproduced\nmaking\nsubject\nyoung\ntotal\ndavid\nscience\nrelated\nrock\narchived\nrailway\nbecome\nled\nstudents\nstarted\nnews\ndescribed\nrole\nelection\nalbums\npresent\nindian\nkingdom\nbooks\nimportant\nnorthern\nlove\nrun\ncanadian\npress\nrather\nk\ntype\nact\neditor\ncame\nschools\nprogram\nonce\nissue\nsocial\ngermany\nproduction\nmale\nmight\nawards\npoints\nsimilar\nprofessional\nsay\nbackground\nenough\nlead\neither\ncommon\noverlap\ndata\ncolor\nbetter\n•\nperson\nservices\nmuseum\nbattle\nwent\nsports\nalready\ncurrently\nhall\nbuildings\nhistoric\ndate\ndeleted\nconsidered\nchange\nlocation\nseems\nmust\nyes\nour\nsouthern\nleast\nlost\nsomething\nreview\ntogether\nrobert\nfact\nless\njapanese\ngroups\ncontent\ninvolved\nisbn\nboard\njapan\ncontrol\npolicy\nmodern\nhuman\nhalf\ndesign\nevent\nevents\navailable\ndone\nwashington\nreal\nstart\npersonal\naction\nspace\nareas\ndoesn\nnotability\nstar\nreally\nchina\npossible\npaul\nworking\ntaken\nfar\ngoing\nminister\nlake\nreported\npopular\nmarried\nfounded\neurope\nauthor\naway\nindependent\nprocess\nteams\ncharacter\nlow\nmichael\npages\nlight\nbig\nseen\nrelease\nwant\nepisode\nwrote\nrepublic\nthomas\ncompanies\nvia\nrussian\nthanks\nput\nrace\nworked\nroute\nrecorded\nsomeone\ncivil\npolice\ncharles\nlisted\nusers\ntemplate\ninstead\neastern\nbody\nquestion\nitalian\nfeatured\nweek\neditors\ntexas\nchief\nclose\nhimself\nupon\nmatch\nq\nroman\ncome\nopened\ntour\nsea\nactually\ncross\nplaying\nhealth\ninstitute\ncaps\nforces\ngreen\nrights\nevidence\noriginally\naircraft\narts\nrange\nprobably\nconsensus\nbar\nproblem\nlook\nissues\nalumni\naverage\nnetwork\nwin\nshows\nwife\nreturned\nnight\nmagazine\ncentre\njoined\nusually\nmiddle\ncompleted\nelected\nsignificant\nafrican\nable\ngoogle\nstage\naddition\nireland\ntoday\nacademy\nsaint\nself\nitself\ncontinued\nstations\nmother\nappeared\nafrica\nculture\nspanish\ngrand\ncommittee\nthings\nfire\nchanged\ngold\nfemale\ncourse\ndirected\nmonths\nwhether\nchinese\nprevious\ndeveloped\nsize\nmentioned\nadd\nfestival\npeter\nbasketball\nacross\nmove\nperformance\nstandard\nmeans\ngive\ntraining\nartist\nword\nblue\nprimary\nannounced\nvalue\nchristian\nprivate\ncatholic\nartists\nincludes\nview\nthus\nalmost\nbaseball\nseven\nappears\never\nprovide\ntechnology\nolympics\nfuture\nformed\ncensus\nnd\nimages\nlos\nresults\nreturn\nquality\nconstruction\nzealand\nfront\ncover\nmodel\ndespite\nread\nmaterial\nstrong\ncoach\nhenry\nfootballers\nmark\nrev\norganization\nstudies\nfederal\nrichard\nhtml\nvirginia\ncar\nattack\nconference\noutside\nstudy\nbrother\nnames\nthroughout\nwriter\ncharacters\nmusical\nnothing\nborder\nmedical\ncountries\npast\nwriting\nmakes\ninterest\nprovided\nkilled\nmedal\nsigned\ndr\nlargest\nlabel\nfair\nsearch\nbay\nreference\nespecially\nrefer\nremoved\nlibrary\neventually\nmanagement\nreferences\nfeatures\nnavy\nguitar\nhill\nsure\nhistorical\nlower\ndaughter\nappointed\nreading\nyet\nsystems\ndebut\nmovement\nfc\nspecific\nalways\nactor\nnatural\nclear\ncoast\nlet\ngot\nchicago\nchampionships\nll\npennsylvania\nten\nperformed\nindividual\ndesigned\nrule\netc\nlists\nparis\nthought\nbrown\nhand\nneeds\nreliable\nsmith\ngenerally\nbase\nsometimes\nflorida\ncapital\nvalley\nbank\ngave\nground\nreached\nitaly\nenergy\nbelieve\nleader\nactive\nonline\nblock\nbridge\nfamilies\nchanges\ny\nfollowed\nindustry\ncollection\nrequest\nsoon\nleading\nolympic\nsold\nwriters\nprofessor\nstudio\nmexico\ncompetition\ncampaign\norg\ntheatre\nanything\nparticular\nempire\nlength\nislands\nsinger\ncreate\nredirect\nadditional\nsoviet\nmarket\nwords\nproducer\nnotes\nhockey\nnovel\ncode\nreferee\nfourth\nsport\nvan\nmary\nairport\nsound\nstatus\nirish\nplaced\nchild\nperhaps\nidea\nforeign\nmunicipality\nisn\nregister\neight\nproblems\nnative\ncoverage\nchannel\nparliament\nusername\nedition\nminor\nsays\nwhose\nfoundation\nunits\nmovie\nruns\nice\nsimply\nlimited\nunit\nstudent\npreviously\nstated\ngovernor\ncomplete\ntest\nnominated\nbill\nparts\nvocals\ntheory\nregional\nkm\naccount\nvote\ncomputer\nnone\ncarolina\ntournament\npoland\nbehind\nwales\nwinning\nlot\nhospital\nmid\ntaking\nmountain\nhigher\ncases\nangeles\nediting\nreplaced\nfood\nmultiple\nlikely\nterms\nsir\nthing\nsquare\ntry\ntopic\nwoman\nofficer\ncategories\ngreek\nrecent\nsent\ncopyright\nspeed\ntemplates\nmoney\nsaw\nsenior\nselected\nintroduced\npolitician\ntrue\nrequired\nregular\nawarded\ncommercial\ncities\ncontains\ntrade\nmr\ndegree\nanti\nbirth\nsun\nfinished\nlonger\nrugby\nearth\naccess\nprior\nseasons\njournal\nbeginning\nsoftware\nfamous\nreligious\nappear\nmartin\nel\ngod\nbit\nhours\nrunning\nbrought\nmissing\neconomic\nstructure\nrural\nremained\ndecision\ncertain\nquite\nhit\nminutes\nspain\nplays\nwhole\njoseph\nlord\nweb\ndecided\noperations\nfunction\nlouis\nassembly\nqueen\nsecurity\nuses\nohio\nowned\njan\noperation\ncall\nsuccessful\nlegal\nrussia\nprince\nmean\njewish\nstaff\nestablishments\ngoal\ntowards\nagree\nbad\nattendance\npopulated\nnature\nallowed\ncaptain\nmount\ntownship\ncalculated\nstructures\nhard\nsaying\nmanager\nearlier\nelections\nmeet\nbox\nlines\ndemocratic\nsuccess\n·\nassociated\nsingles\ntraditional\nrest\nhighway\nmatter\nparticularly\nwide\nmonth\ncare\nadmin\ncultural\ncommission\ndidn\nplan\ntherefore\npractice\ncommand\nnomination\njersey\nparties\nmichigan\nentire\nen\nanyone\nseem\noverlaps\napproximately\nmaster\nnoted\nusa\nstop\ncannot\nfeature\nengine\nresponse\nneeded\nillinois\nafd\nexperience\nhighest\nengineering\nsilver\nseparate\ntakes\nsecretary\ndutch\nlee\nrecording\nprime\nle\nthemselves\nrules\nuploaded\ntrying\nyouth\nscotland\niii\nhouses\nheart\nroom\nstone\nshown\ndeal\ndrama\nscores\ndead\nkey\nshot\nturn\noccupation\nscottish\nexecutive\nplant\npromoted\nwhom\nvillages\nlanguages\ninternet\nleave\nfeel\ncovered\nmerge\nmostly\nnumerous\nancient\nattempt\nproperty\nprograms\npicture\nfinally\nships\nfiction\nlooking\nsecondary\nnations\nmajority\nedward\nannual\ndigital\nmission\nwp\nlived\nclaim\nseat\nbbc\nprofile\ndance\nprize\ndoing\ngeorgia\nport\npacific\ncastle\npass\ntransport\norganizations\nratio\nrecently\nfall\nglobal\nera\nwing\nopinion\ncommander\nfort\neffect\nopening\nfine\npurpose\nwinter\ngenus\ncongress\noverall\nactivities\nmet\nincome\nmassachusetts\ncomes\nolder\npeak\nlack\nbass\nsuper\ncomplex\nacademic\nstars\naccounts\nappearance\nasian\nasked\nfriends\nkind\net\ntill\nfinancial\nentry\nasia\nsense\nmeaning\nactress\nmap\n→\nintended\nbishop\nboston\nrate\nliterature\nforest\nvoice\njack\npre\njustice\nbritain\nchampion\ndouble\npolish\nnumbers\ncolumbia\ntemple\ndefeated\nadministration\nended\nclaims\nz\njones\nmm\nparish\nisrael\nactors\nsister\nnine\nscored\ntable\nattended\npop\ncd\nelse\nnewspaper\nfriend\nunknown\nwinner\nchart\ninitially\nloss\nsites\nstarting\narchitecture\nrelations\nupper\nsupported\ntracks\ncontract\nface\ndirectly\nspent\ngirl\nclearly\njunior\nfrancisco\npolitics\ngreater\npresented\nmar\ned\ncause\nvolume\ncaused\npagename\ntom\nflight\ncandidate\npassed\nmatches\nclaimed\nexcept\noil\nassistant\nsurface\nvictory\nregiment\nstories\nrepresented\ngets\nspeedy\nweeks\nlisting\nallow\njr\nbranch\nretired\ncommunities\ntrain\npaper\nadding\nprovides\nremains\nvictoria\nmetal\nwrong\nlarger\ndirect\nfrank\nmiles\nblocked\nlaunched\nmass\nchairman\ncomedy\nrelationship\nknowledge\nformat\ncreek\nmeeting\nfailed\nofficers\ndraft\ngoes\nfight\nfigure\nfaculty\ncamp\nran\nvariety\nowner\nstatistics\nraised\nheavy\nalexander\nalone\nunderstand\nepisodes\ngives\neducational\ndaily\nwilliams\nlatin\ncompletely\nproducts\ndark\nattention\nreligion\nreferred\nvon\nmind\noppose\ncorps\nadministrative\ncut\nscott\nbecoming\nfootballer\njean\nmayor\npro\nbeach\ndescent\nnearly\nlatter\nleaving\nhighly\ncast\nterritory\nwrite\ntowns\nforms\njoe\ninside\nwanted\nsolid\nindividuals\nauthority\nmention\nprojects\ndel\ncontinue\ncost\nvice\ndrive\nnotice\njohnson\nforced\nbasis\nlooks\nreasons\njob\nphoto\nhope\nlog\nparents\nentered\nmike\nbasic\nscientific\namount\nspring\noxford\nkong\nopera\ntried\ncritical\nsimple\nfounder\nhong\ntold\nhusband\nuseful\ntechnical\nnecessary\nbelieved\noperated\nmountains\nimportance\nmusicians\nhotel\ngirls\ncrew\nfeb\nboy\nontario\nnation\ndefense\nwiki\nchampions\ngolden\ndistricts\nfaith\nracing\nmainly\nauto\nunless\nlives\nswedish\nhot\nentertainment\nturned\nnet\nsoccer\ncreation\nproduct\ntower\nincreased\nvotes\nsquadron\npx\ncontemporary\nsubsequently\nregarding\nfocus\nmarriage\nquestions\nnaval\ndetails\nforward\nmemorial\npeace\nip\nkept\niran\nkorea\nanalysis\nwinners\npoor\ngrade\ncricket\njudge\nelectric\nbc\nexist\ncorporation\nhold\nfeaturing\ncampus\nbrazil\nchris\nbeyond\nfifth\nincrease\nsummary\nremaining\nstatement\nbroadcast\ngetting\npiano\ndes\nnovels\nserving\nhour\nmoving\nresolution\nconcept\nalternative\nbrothers\nattacks\nencyclopedia\nrepublican\nrepresentatives\npoliticians\ndifficult\nability\nstudied\nhost\nwall\nimmediately\nurban\npakistan\nbecomes\nmarine\nphysical\ndec\ntroops\ninterview\ncoming\nsemi\nsuggest\nemperor\nletter\ncouple\nfellow\nduke\ntell\ngallery\nfollow\nwindows\ntree\nhits\njazz\nprotection\nrelevant\ncount\nsituation\nreviews\ncontaining\nclassical\noffered\nlady\nnetherlands\nreports\ninfluence\naddress\nlinear\nconsider\nmachine\ndomain\nelements\nminnesota\ntypes\nnov\nserve\nsydney\nministry\nblood\ndistance\nbottom\ngiving\nboys\npotential\ntoronto\nedited\ninfantry\njun\nformerly\noct\nconflict\nworkers\nsteve\nphiladelphia\nhelped\nder\nnationality\ndispute\nscene\nmethod\ntitles\nberlin\nconditions\narms\nraces\nmaybe\ndiscovered\niron\nextended\nchurches\notherwise\npositive\nsanta\nnom\nimperial\ncomposed\nball\nwidth\nquickly\ncorrect\nresponsible\npossibly\nindiana\nsoldiers\nexamples\nkorean\ntimezone\ngenre\nfish\nsenate\neffects\ngun\ncheck\nappearances\nplans\nrenamed\nsign\nreporting\nsweden\nconsists\nheritage\ntag\nprimarily\ndoctor\nleaders\nlies\ninc\nrivers\ndu\ncrime\nliberal\nstand\nbob\nexisting\npublishing\nindustrial\nanswer\nsplit\napr\nsex\nmixed\nacting\npersonnel\nrail\ndie\npremier\napproach\nwisconsin\nsentence\nroot\nago\nstandards\ncomics\nearned\nmiss\nspecifically\nhorse\nactual\ncontributions\ncarried\nlieutenant\nwood\nplants\ninitial\norigin\nenvironment\npretty\nrank\nbus\ngas\ndirection\nguide\nresources\naffairs\naccepted\nanimals\nnor\nactivity\nlevels\nlaws\njim\ncreating\ncambridge\nones\ncomposer\nremove\nagency\nreserve\natlantic\nsupreme\nweight\npp\nask\nfighting\njackson\nwidely\nrose\noperating\ntreatment\nlinked\nandrew\ntrial\nexpanded\ndaniel\ncertainly\ninfo\nvs\nsciences\nfame\neverything\navenue\ntravel\nscale\nbreak\noregon\nhousing\nproduce\ncapacity\nsmaller\nfictional\nexchange\nactions\ncited\ntypically\nsettlement\nagreement\ntranslation\nmales\nkansas\nmanaged\nbring\ncharge\nfails\ndedicated\nestate\nnearby\nresidents\npiece\ngrowth\ntrust\napplied\ndrums\nissued\nmurder\nnormal\ntwenty\ncommonly\navoid\ntony\nnorwegian\ncriteria\ncontext\nsuggested\nrevolution\nfully\nwars\nprominent\naug\nleaves\nadvanced\ndistribution\nmedicine\ngarden\nreach\nturkey\nfemales\npublications\nimpact\nhouseholds\nsurvey\nheight\nmorning\nhonor\ndeep\nargument\npublication\narthur\nelizabeth\ndisambiguation\nworth\ncolorado\nmedian\nmaryland\nfalls\nzone\nsolo\nlearning\npay\nresolves\nchoice\nvol\nflag\nengineer\ncars\nfarm\nwilson\nprincipal\nacquired\nconstructed\nsecret\npoet\nbuild\nremain\norchestra\nversions\nfollows\nfixed\nfm\nefforts\ndocumentary\nequipment\nray\nyellow\nguard\npressure\ngrant\nprison\nfreedom\nnorway\ntwice\nsportspeople\nstore\ntaylor\nquarter\ndesignated\nindependence\nplatform\nrome\nad\nteacher\ncopy\neffort\nnuclear\npictures\nmodels\nsep\neveryone\neasily\nthank\ndi\ndescription\nagreed\n£\ninstitutions\ncovers\nfacilities\ntarget\nstack\nrationale\nstat\ncombined\nbronze\nsort\nhosted\nprogramming\nsri\nrailroad\nunique\ndefined\nocean\ncell\nmissouri\nconcert\nimprove\nbiography\nloan\nshortly\ncontact\nholy\ntennessee\nsub\nsafety\ncompeted\nstephen\npolicies\npainting\nprice\nentirely\nmexican\nleadership\nflying\nmessage\nmunicipal\nserious\nheadquarters\nofficially\ncemetery\nmemory\n×\nfields\ngeneration\njoin\ncopies\nfinals\nfox\ncontinues\nrepresentative\ndestroyed\nfeet\nguy\nphilippines\nrevealed\norganized\nserves\nconservative\nshare\nmaria\ndisease\nsections\nphilosophy\nways\narrived\ndivided\nfloor\nlabour\nlogo\nmeets\nyard\nlargely\ncancer\noffer\ntax\nexpected\ntraffic\nconcerns\ngraduated\nguest\njews\nformation\nmeant\neconomy\nstorm\ntells\nmile\nprotected\nbowl\nletters\nproviding\nbegins\nclassic\ndamage\nharry\noffers\ndavis\nchallenge\nviews\nmarked\nallows\ndensity\nliterary\nfa\nhtm\nben\ntransportation\nkentucky\nsales\nfleet\nsupporting\ncaptured\nextra\nrecognized\narizona\ncompared\ntheme\nfrancis\nmoscow\ninterested\nheard\nbehavior\ntransferred\nenvironmental\nblank\nmusician\nstarring\nassigned\nseats\ntennis\npercent\nlogs\ndisplay\nconvention\nring\njoint\nbrian\ndeputy\nplanned\nuniversities\nyards\ncommunist\nagent\ndifference\nanimal\nczech\npositions\nexactly\nstay\ntitled\ncombat\npalace\ncard\nordered\nopposition\nattempts\nunderstanding\nstub\nwrestling\ncritics\ngrowing\nestablish\nhands\nparticipated\nrevert\npoetry\nmaterials\nga\nturkish\npaid\npromotion\napparently\nbattalion\nmobile\nadditions\nrow\nmerged\nmetropolitan\nfigures\nexistence\neye\nlongs\nlouisiana\nlewis\nmelbourne\naustria\nbrigade\nscreen\nrisk\nconducted\nlats\nban\nda\nlabor\nlegislative\ndefinition\nindeed\ndraw\napplication\nun\nsteel\npresence\nexpansion\nearl\nmax\nwild\nplanning\ncomic\nadopted\neasy\nplus\nhappy\nacts\nclasses\niowa\ngrew\nsave\nwins\ntheater\nexists\nroles\nchance\nprevent\ncandidates\nobject\nfelt\npowers\nbirds\nspread\ndefeat\ncape\nidentified\ngained\nregions\nmine\nsides\njul\nshowing\nteaching\nguidelines\nsimon\ndepth\nlyrics\nchristmas\ndeclined\ngreece\nexpress\nfederation\njournalist\nintelligence\noccurred\nconnection\ndisplayed\nportuguese\ndeclared\nconstitution\npresidential\nstanding\nsons\nplot\ndates\nfirm\nproper\nends\npilot\nrelatively\nreceive\neducated\nopposed\nmanchester\nqueensland\namericans\nintroduction\ndirectors\nvehicle\nstock\nvehicles\nisraeli\nfrequently\nhills\nperforming\nnorthwest\ndrug\nvisit\nportion\nresidence\nwalter\npov\ninteresting\nmoon\nlimit\nminute\nbell\nathletics\nreduced\nwind\noklahoma\narchitect\nideas\nelectronic\ncrown\nyounger\nanderson\nstep\nweapons\nunable\nneutral\nconnected\nswitzerland\nexpatriate\narmed\nweekly\nrating\nprogramme\nsquad\nmedalists\nmulti\ndynasty\ncold\ngranted\nsocorro\nalliance\nmethods\nsr\nsam\nalabama\nalbert\ntropical\nvietnam\ndvd\nrefers\nheat\nfans\nsurrounding\npurposes\ncredit\ncommons\nboat\niv\nboxes\nethnic\nspeaking\nfell\narena\nroads\ncore\ndog\nkill\nathletic\noldest\nnegative\nconfirmed\nsixth\nedge\njesus\ntools\ncolonel\nweak\nchosen\nbrand\nresulting\nnfl\nrise\nsupply\ntradition\nelementary\nhousehold\nspirit\ntask\nslightly\nhoward\nincident\ndevelop\nsoutheast\nsunday\ndiscuss\nstats\nclimate\ntopics\npurchased\ncommunications\nchapter\nbroken\nsingapore\nalongside\nsituated\nca\nlicense\nhaven\ndeaths\npassing\ncitizens\nguns\ntrees\ngone\ngreatest\nimproved\nvisual\npope\nofficials\nsat\nglass\nmiller\nresulted\nposted\nestimated\ncontain\nbrazilian\nsexual\ndefence\nrespectively\nconcerning\nrich\nmyself\nfast\nproperties\ntaught\nextensive\nexhibition\nspeech\nles\nproposal\nstraight\nff\ninternal\neffective\nsolution\nfashion\nfoot\norange\nargentina\nbrief\nperformances\nadult\nallowing\nnewly\nidentity\nnominator\nsingers\ninspired\ndiscussed\nrequire\nex\nfacility\ntransfer\negypt\ncells\npatrick\nquebec\nconnecticut\nscoring\nanthony\npermanent\nphase\naudience\nmotion\nblues\nhungarian\narab\ntrains\nsets\nwasn\nranked\nunlike\nbegin\nsetting\neyes\ndatabase\nstudios\ncriminal\ncommonwealth\nfinish\ncommunication\nscope\naccused\ndivisions\naccept\nwarning\nalan\nobjects\ndiego\ncontest\nfighter\nfinds\ncoaches\nbeat\nextremely\nford\nswiss\nsorry\nhouston\nworldwide\nshowed\nholds\ncathedral\nlosing\nadvance\nreality\nbroadcasting\nadam\nvandalism\nenemy\nentitled\nyoutube\nassessed\nbillion\nburied\nbelgium\nrespect\nrare\ndetroit\ngraduate\ncolleges\nexplain\nauthorities\nkilling\nmaximum\nneither\nfan\nnotify\npainter\nhamilton\nreturning\nattempted\nuniverse\npasses\nobvious\nsuffered\npieces\napply\nactresses\ncompetitions\naid\ndriver\nfolk\ndan\nkhan\nbaby\ndenmark\ntokyo\nbillboard\ncalling\nanne\nhappened\ndanish\nwants\nformula\ninterior\nkevin\nweather\npowerful\nmuslim\nregistered\npublisher\npreceding\nsounds\neric\napproved\nachieved\ndouglas\nprovincial\nfund\nportugal\nathletes\nbird\nbands\naudio\ncat\nbureau\ncenturies\nvalid\nchemical\nitems\nlane\nholding\ncounties\nupdate\nncaa\nspeak\nfinding\ndomestic\nali\nfalse\nequivalent\ncaught\nchrist\nending\ntoward\npuerto\nperform\npartner\nromania\naviation\nwouldn\nfailure\nward\nstrength\nonto\nknight\nnominations\nhungary\nconcern\nkeeping\nrecordings\nep\njuan\nfunctions\nmississippi\nok\ncalls\ncriticism\ninvolving\nmagic\ngordon\ntreaty\nantonio\nselection\nrear\ncolonial\nmotor\nobtained\ncircuit\nwish\ncompilation\nharvard\nislamic\ndetermined\ngeography\narkansas\nfuel\nartillery\nmedieval\nlocations\ninclusion\nrecognition\nnortheast\nchamber\nmoment\nsomewhat\ngrounds\nanyway\nsucceeded\nhistorian\ncondition\nphysics\nnewspapers\ninstance\nrepresent\nallen\nwatch\nkitt\nprotect\ngrey\nlaunch\ndave\nphilip\ndc\niraq\nchanging\nukraine\nmunicipalities\nmix\ntamil\nshift\nshared\naustrian\ndoor\ninvestigation\ninstitution\nprincess\ntrail\nultimately\nparks\napplications\nhundred\naired\nrequirements\ntalking\nkim\nltd\nmetres\ngray\nsector\ndean\nagricultural\nunincorporated\nincorporated\nescape\norders\ncorner\ncommissioned\nfounding\nmill\nmrs\nsubjects\ntemperature\nsettled\nremember\nmiami\npromote\nvalues\nspot\nprogress\nlearn\nplanet\noh\noccupied\nusage\nsouthwest\nrefused\nborough\ntruth\nclark\nsufficient\nequal\nadministrator\npersons\nfactory\nfought\nderived\noutstanding\nmagazines\nflow\npeer\nattacked\ngenerate\nshape\ncreator\nrequires\noption\nlincoln\nstarts\nstands\ncarry\nestablishment\nselling\ncauses\nmp\nbudget\nbattles\nsky\nlegend\nsourced\narrested\nforum\nmetro\nbroke\nstrike\ninjury\nryan\nzero\nconverted\nviolence\nsignificantly\nstatements\ncontrolled\nwelsh\ndropped\nroger\npdf\ndistinguished\nsamuel\ntranslated\npapers\ndetail\nchapel\nfrederick\nthousands\nbanks\nherself\noffensive\nkings\nfactor\nrename\nreplace\nmuseums\nresistance\njunction\ntries\ntim\nengines\ncontributed\nmedium\ndevice\nprofit\ndream\nenter\ntwelve\nuniversal\ntypical\nseeing\nskills\nbought\npassenger\ncleveland\nfunding\nagriculture\nparent\ndecades\nreceiving\nsignal\nreform\norganisation\nprix\ncolumn\ndefunct\nutah\nmanagers\nqualified\nindicate\nukrainian\ngay\namateur\nobviously\nflora\ngene\nsoul\nop\nalt\ndiscussions\nmontreal\nturns\nwalker\nentrance\npath\nnice\nstring\ninfluenced\noccur\ndeveloping\nabandoned\nhumans\npair\nflat\nsample\ncontained\nbanned\nmoore\nstrongly\nvisited\npm\nincreasing\nattorney\narm\nmathematics\ncanal\ncharts\nthinking\ndublin\nsuggests\nwhatever\nsurname\nbrain\npittsburgh\nblog\neconomics\nseventh\nalex\nemployed\nheavily\nauthors\npaintings\nconcerned\nrecipients\nnavigational\nscholars\ncontroversial\ncontroversy\nreverted\nexpressed\njosé\nbodies\nconservation\nmaps\nahead\nmarie\narguments\nchain\nfocused\nreaders\ncarl\ncm\nviolation\noffices\nwave\ncircle\napart\ninvasion\njimmy\nopportunity\ndetermine\northodox\nvoted\nformal\ndescribes\nseconds\ncycle\ndoubt\ngolf\nwalls\nproductions\nconstituency\nclosely\noccurs\nhuge\nandy\nrepresenting\nindonesia\nsell\naren\nmon\ndrawn\ndiocese\ntank\nadvice\nsenator\nmanner\ngenerated\nmalaysia\nasking\nfinland\ncausing\nleads\nlawyer\nseattle\ngain\nindex\nsaints\nrunner\ncrisis\ncinema\nmatt\nhollywood\nreaction\nmedals\ndocuments\nreader\nlawrence\npattern\narchives\natlanta\nvoting\nreviewed\nlooked\nbear\nperfect\nrestored\nbruce\nbaltimore\nbaron\npan\ncommune\nfantasy\nduty\nchair\nscenes\nbroad\nopposite\nstuff\naged\nstreets\nnick\nanna\nbilly\nextension\nkent\nparliamentary\nkelly\nshooting\nready\npick\nma\nsongwriter\naware\njordan\ndictionary\ncomposition\nsalt\nstating\nbangladesh\nbot\nsuccessfully\nbenefit\nlands\ninterests\nscheduled\nteachers\nclosing\nadvertising\ncontribution\nmaine\nretirement\nscientists\ndam\nny\nblocks\nlas\nprint\ntechniques\nparticipate\nanniversary\nrequested\ndiscovery\nexplained\nexpedition\ncitation\nassist\nund\nmeanwhile\nhampshire\ncreative\nmaintain\npierre\ndetailed\nfacts\nframe\nfinance\nsocialist\nscript\ncamera\nreturns\nengaged\nassistance\nexperienced\nunderground\nsale\nbeautiful\njane\nabc\nsupposed\nsuccessor\nclassification\ntool\nmining\nproducing\ncabinet\nfr\nbytes\nross\nrussell\ncitations\nmaintained\nevening\nsinging\nfifa\ngender\nvenues\nlakes\nmail\njeff\nelectoral\nemergency\nmode\nchristopher\nheads\nproved\npriest\nfunds\ninvestment\nromanian\nsession\ncapture\naspects\nreduce\ntrophy\nabuse\nprefecture\nwalk\nfaced\nnormally\nregarded\nsnow\nshop\ndakota\nbush\ncoal\ninhabitants\nheaded\ngary\nemployees\nerror\ninvited\ncable\nprotein\naccident\ndecade\nmeasure\nwatched\npatients\ndowntown\nanimated\nsatellite\njohnny\ncombination\ncourts\nsequence\nhook\nclean\nwed\nowners\ntwin\ndistributed\ndescribe\n~\ndefensive\nislam\nphotos\nottoman\ntrained\naffected\nroutes\nministers\nwine\nelsewhere\nbiggest\nli\nlanka\ncarlos\nlanding\ncollected\nrevival\nrio\ncommunes\nsaturday\nmps\nguess\ndrop\nsarah\nlaid\nswimming\nmembership\nedinburgh\nfit\nharris\ndallas\ndegrees\nbachelor\npersonally\nbriefly\nfiles\nconduct\nextreme\ncourses\nhence\nhomes\nreaching\nna\nsought\nvision\ndemand\nvertical\nupdated\nmarketing\njason\nconsisted\nappeal\nplane\nquick\nvictor\ndyk\nsolar\nages\nneighborhood\nfairly\nwings\nacid\nscheme\n°\nmatters\nrfc\nconstant\nadditionally\nhip\nadmins\nnova\nceremony\nchile\ncomposers\nnazi\nscholar\nliverpool\nhero\ndesigner\nlearned\ninstruments\nwelcome\nhair\nconsecutive\nmovies\nadjacent\npool\ntue\nnorman\ncollections\nbelgian\ncorporate\naustin\nensure\ndriving\nphone\nfly\nian\nwindow\ndocument\nadams\ncollaboration\nmargaret\nkennedy\nleg\nvideos\nassume\nattached\ndry\nexpand\nbible\nmatthew\ndepending\nserbian\ninstrument\ncovering\nrandom\nrepresents\nparticipants\nthorough\nmentions\nportrait\ndrivers\nairlines\nfranklin\nviewers\nfinnish\ndifferences\nvenue\nvocal\ncricketers\nelement\nregularly\nrejected\nrelative\nillegal\nstewart\nroof\nleagues\nargued\ncolour\nmorgan\nprisoners\nfacebook\nattend\nnelson\nsurvived\ninsurance\nexpert\nsteam\ncards\nmanufacturing\ntesting\ncoastal\nyorkshire\nrescue\nterritories\nthu\nthailand\nstruck\nchoose\nvienna\njourney\nstorage\ncosts\nsingh\ndistinct\nnotably\nsoldier\nil\ncolony\nevolution\ntaiwan\nhurricane\njudges\ngardens\npoems\nconsisting\nremoving\ndriven\nresponsibility\nsentences\nbirmingham\nengineers\nvisible\nft\nsubstantial\ngulf\ninstalled\nrevolutionary\ninner\ntrip\nrestaurant\ngraham\n\\\nstores\nrice\nhappen\nprove\nreasonable\nskin\ncommitted\nvolleyball\n_\nchose\nfactors\nhundreds\ninjured\ndevices\nphrase\nstanley\nlemmon\nthompson\nsuicide\nadvantage\nautomatically\ndisc\nminimum\ngoods\ncharges\nalfred\noperator\nmerely\nfinishing\nfred\nidentify\nproducers\nss\nann\ncampbell\nportland\nhelps\nlatest\nreleases\nvictims\nexplanation\noperate\nthreat\ncrossing\nslow\npoets\nstopped\nstrategy\nwayne\nranking\ndisney\nwright\nresidential\nassociate\nhi\nsignificance\nruled\nexcellent\nshouldn\nobserved\nthreatened\nfriendly\nredirects\ntemporary\nmasters\npeninsula\nnetworks\npassengers\nassumed\nartistic\nsafe\nearliest\nfestivals\ncompete\npng\nhunter\nmoths\nalaska\nmi\npartnership\nmaintenance\nmonitoring\nevil\nrelief\ncharlie\npoverty\nhop\ncc\nfri\nencyclopedic\nsuspected\nfilled\nnba\ndecide\nbreaking\nargentine\nresigned\noblast\nhanded\ndrew\nhawaii\nbrooklyn\nwhilst\nhistorians\npa\nspeaker\nmoth\npermission\nwounded\nracial\nmarshall\nkg\ngate\nsprings\nroy\nphotography\nhelping\nknights\nroll\nprogressive\ncontrast\ncontinuing\nprocesses\nterminal\nexecuted\nshall\nsvg\nspouse\ninfrastructure\nprinciple\npainters\npainted\nproperly\nfrequency\nshaped\njoining\nrobinson\nwaters\nridge\nbridges\nceo\nmonument\nmental\ncarter\nkarl\nmac\norleans\nportal\nparallel\nregardless\nthirty\ngiant\nqualifying\nmurray\nafghanistan\nassessment\ncounter\nbears\npurchase\nexpression\nbacking\nuefa\nimprovement\nmadrid\nclosure\nwheel\nambassador\ndesert\nbringing\niranian\nreign\nuncle\nsevere\nrain\nadmiral\nfishing\nexisted\nraise\nbroadway\nprinciples\ngrow\ntests\nroughly\ntech\ntrouble\nrico\nparagraph\nbat\nprepared\nmeasures\nrobin\nhired\nfear\nmerit\nparticipation\nmassive\ndesigns\nagencies\nwhereas\ntechnique\nalberta\negyptian\nclerk\nknew\nnarrow\nadapted\ncommissioner\nrapid\ncredited\ndating\nbusinesses\nbomb\ncapable\npoem\nstages\nhonorary\ndragon\ncharged\npropose\nmodified\nfired\nmlb\nsend\nproof\npractices\narabic\nattractions\ncarrying\nmouth\nfix\nlicensed\nsymbol\norgan\ndamaged\nwarren\nexception\ncosta\nunfortunately\njerusalem\nreplacement\nindians\nbesides\nsoundtrack\nvirgin\nthousand\nvancouver\nlegislation\nbeauty\ncredits\nbuy\norganisations\nserbia\nchristianity\nopinions\ncavalry\ntribe\nrichmond\nchess\nchannels\nclaiming\nexact\nbaker\nallied\ninvolvement\nanime\nreferring\ndonald\nsisters\nwilling\nrequests\nunusual\nyourself\nimpossible\ncolors\ncook\ndrawing\nwikimedia\njonathan\nremoval\nmoves\nindicates\nadmitted\nownership\nshore\nmonitored\nnebraska\nregulations\ncrash\nguitarist\nenforcement\nsupports\nabbey\ndeleting\nnevada\nbarry\ntone\noperates\nindigenous\npersonality\nreception\ntransit\nbuffalo\nflowers\nbond\njay\nadventure\ndefinitely\nguinea\nhorror\nrangers\npointed\napple\npopularity\noccasionally\ncoalition\nfranchise\nstarred\ncritic\njournals\nrolling\npercentage\nsilent\nlaboratory\nmicrosoft\nmovements\ncharter\nsuitable\nalternate\noffering\nmissions\nsc\nexperimental\nrooms\nconcluded\nreputation\naccurate\nversus\nwebsites\ninterpretation\ntagged\nendemic\nchemistry\nachieve\nknows\nmanga\njournalists\nforests\ncbs\ncomprehensive\nsymphony\npromotional\nelectrical\ntags\nmeters\njerry\ntigers\ncommerce\nremix\naddressed\nphil\nautomatic\ngang\nafterwards\nprinted\noak\nwarner\ntend\nms\nquote\nseparated\nbishops\nglasgow\nessentially\nwait\ninput\nbattery\nfavor\nbenjamin\napparent\nshopping\npatrol\neagle\nmainstream\npc\nangel\nmartial\nrestoration\ndelhi\nhans\nindicated\nmorris\nrailways\ncenters\nmills\nhelpful\ndelivered\ncomponents\nvictorian\nlegislature\ntourism\ntreated\nextent\nkids\nbarbara\nessay\ncircumstances\nrepeated\nplain\nsuperior\nstrategic\nsimilarly\nduties\neffectively\nblp\nconsidering\narranged\nken\ngrammar\namendment\nalleged\nrelation\nhabitat\nspoken\neu\nshell\nmounted\nentries\nconflicts\nphilippine\nmontana\nappearing\ntriple\nboundary\ncaribbean\nhosts\nsigns\nseriously\nbristol\nwarring\nmitchell\nindustries\ncolombia\ncomparison\nbasin\neleven\nill\nputting\npradesh\ncharity\noutput\ndna\ncarbon\nboats\ndesc\narchitectural\nrepresentation\ncommentary\nrising\nvisitors\nmarkets\nplate\ngiants\nprocessing\nlandscape\ndick\nhunt\nem\nsummit\nrr\npsychology\nride\ngreatly\nguardian\ncloser\nterminus\nlosses\nbalance\ndemocracy\nsubmarine\nnicholas\nunsourced\nusual\nperu\neighth\ninstrumental\nhindu\namongst\ndefender\nriding\narrival\nevans\nturning\nimply\nprose\ncargo\nhidden\nvolunteer\nbio\nholder\nsugar\ndaughters\nwildlife\nfun\nintegrated\npartners\nrates\ngrace\nfeed\nchildhood\naccompanied\nmilan\nphotographs\nhonour\nsoil\nserver\nmanual\nconcrete\npossibility\nghost\nconfused\ntunnel\nlarry\nstyles\nelevation\nmuhammad\nconsiderable\nstood\ninter\nlose\nphoenix\nsweet\nwaste\noperational\ntall\nongoing\nqualify\nconstitutional\nsporting\npeoples\nacceptable\nfruit\ndecisions\ndepression\nperspective\nlongest\nmidfielder\ncrystal\nmonastery\nresident\nseek\ncincinnati\ntied\nsurgery\nsteps\ncarrier\nstream\nalice\ndj\nkick\nfurthermore\nstrange\npredecessor\nbernard\nnigeria\npain\nph\ninfluential\npunk\nwooden\nsuggestion\ninteraction\nretained\nachievement\nmechanical\ndrugs\nmissed\nexpect\ntrinity\nclassified\nminority\nbusinessman\ngrown\ncoat\npowered\nalive\nnbc\nnhl\nkeith\nbobby\nharbor\nbehaviour\ncroatian\nmaritime\nterry\nvirtual\nindoor\nperiods\nspiritual\neasier\ncroatia\nlions\narchbishop\nluis\nmerchant\nazerbaijan\nlots\ncontested\neditorial\ninitiative\ncharlotte\npure\nborders\npersian\nmarks\narmenian\nromantic\nreplacing\ntalent\nunlikely\npanel\njump\nanimation\nagents\nemployment\ntrading\nparker\nstatue\nac\ndated\nwonder\nfiled\nprovinces\nfriday\njobs\ncuba\ncouldn\naside\nsão\nscientist\nschedule\nwaiting\nfamiliar\nsuspect\ndisagree\nsuggestions\nturner\nforming\nformally\nlocomotives\nbarcelona\nse\nconsistent\nrecommended\ndesire\nhappens\npatient\nbulgaria\nvi\nvincent\nouter\nhear\ntexts\nbelief\nvisitor\nvessels\nbasically\ncontinental\nhole\nfail\npassage\nsees\nwedding\narchaeological\nlayer\ndesignation\nclan\nrevenue\nseeking\ncouples\nentering\nsuit\nsoft\nweekend\napproval\ndemocrats\ncrimes\ncollins\nexpatriates\nhorses\nwear\nvisiting\novers\nsupporters\ncash\nsomewhere\ndennis\nresource\nsculpture\npractical\nharrison\npink\noliver\nlimits\ncooper\nillustrated\nsur\nhell\nstatistical\nreferenced\nwolf\nwarriors\nincidents\nfresh\neditions\nroots\nsignature\nclinical\npremiered\nvolumes\nworst\nadults\ncontribute\nnecessarily\nimmediate\nfeeling\ntheories\nessential\ncompletion\nconclusion\ntechnologies\nstrip\nbound\npraised\nstayed\nhull\n−\ndiamond\norigins\nempty\neliminated\nvaluable\ncite\ndoubles\nbranches\n@\nhonors\nbrick\nexperiences\nbeijing\ntie\nlgbt\nsa\nwickets\nliberty\nrepeatedly\nsiege\nbaptist\nron\nhebrew\naffect\nportrayed\ndecline\nwidespread\ncoaching\nalpha\nequipped\nidentical\nsubmitted\nenterprise\ntouch\ntransmission\nrs\nplatforms\ncave\nfilmed\ninch\ncool\nbulgarian\ndebuted\nliga\nmanhattan\ndestruction\nactivist\nweapon\nclay\nkeyboards\ndangerous\nviewed\nlp\nemail\nbiology\nincreasingly\nbold\nbowling\nos\ncompare\ntreaties\naffiliated\nsock\nassault\nregards\nmonthly\nfoster\ncousin\nurls\nhispanic\nlogic\ncraig\ntrivial\npioneer\nmuslims\nlay\nrated\nabsence\namsterdam\npublishers\ntribes\npercussion\nrunners\nthemes\nbenefits\nguards\nflows\nattributed\nstubs\nathens\nherbert\ncelebrated\nsponsored\nraf\nregard\nasks\ndelaware\nneil\npole\nref\nhistorically\ntail\ntours\nstable\ndecides\nvessel\nidentification\ndelta\ntelling\ndealing\nwrites\nmediterranean\nvolunteers\nreply\nattempting\nstuart\nmarvel\nluke\ngrave\nodd\nhearing\nuss\nmall\npenalty\nsolutions\nsecure\nhugh\nsteven\nsole\narchitects\ncharacteristics\nfalling\nspin\nclinton\nvilla\nselect\nmetric\ncriticized\nsurviving\nroberts\nstandings\nbiological\nlloyd\nmunich\nbelongs\nadelaide\nbelong\nharold\nnorfolk\nbutler\ncoi\nrival\nacoustic\nposts\nadaptation\ngreg\nreporter\nurl\nabsolutely\nnobody\nscholarship\nvast\nexit\nfacing\ninquiry\ndual\nbelt\nnoticed\npatent\nmathematical\nrelating\nrarely\nsubmission\ndemographics\ncrowd\nrick\ngovernments\nbonus\ntourist\nmystery\nclick\nsettlements\nwalking\nnevertheless\nvoters\nrifle\ncomponent\ncivilian\npartial\nencouraged\nbirthday\neddie\nchristians\ndenver\npetersburg\nresearchers\npartly\nphotographer\nruntime\njon\nobama\npicked\nseemed\nclock\nviolin\nhighways\nholiday\ndistinction\nartwork\nmakeup\ncatherine\nfont\nfarmers\noccasions\nau\nguideline\nphotograph\nstruggle\ntimestamp\nproduces\nyale\noptions\npen\nprocedure\njacob\nconvicted\ntouring\ntransition\nanglo\nlegacy\ndenied\nrelationships\nottawa\nderby\nsurrounded\nlibraries\ncompeting\nspeakers\ngrades\nhudson\nadministrators\nsacred\nsigning\nrob\ncitizen\ndogs\nargue\nbelieves\nannually\ncardinal\nnepal\nintersection\ndiscussing\nreveals\ndefeating\ndisputes\nbeam\noverseas\nperry\nnickname\nruling\nsyria\nwells\ncontributing\nultimate\nranks\ndanny\nretail\nfavorite\nvermont\nbegun\ndownload\ntrusted\nappointment\nballet\njefferson\nanywhere\nsand\nangle\nsessions\nrecreation\nwearing\nkenya\naccessible\nralph\nthread\ndisruptive\nspend\nninth\narrest\nchoir\ntrials\nmines\ninjuries\nrapidly\nrounds\ncompetitive\nopportunities\nmeetings\ncommented\nwang\nwoods\nexercise\njacques\nobjective\ndemolished\npreferred\nresort\npedro\nrobot\nvenezuela\nsegment\nstudying\nedwards\naim\ndancing\neagles\ndemonstrated\ntribute\ncontinuous\nencourage\nspider\nacted\nconvinced\nheroes\ndescribing\nrocks\nbed\ngap\nreflect\nmars\nparticipating\ncooperation\nobtain\ngothic\nprotest\nhunting\nrfa\nfrequent\nconversion\nstress\nmanufacturers\nvoiced\ninnings\ntraditionally\njose\nadventures\ntiger\ntotally\nvoyage\nconcentration\nsing\nrocket\nelectricity\nshadow\nboxing\nsenators\ndoc\nstanford\nmachines\nvegas\nclearer\nsaved\njury\ncalendar\nnoble\ntommy\nguilty\nleo\naffair\nhandle\nextinct\nresponded\nshares\nscotia\nmanufacturer\ntales\nimplementation\ntruck\nspelling\nitem\nload\ncustomers\nadds\nspaces\ncap\norphaned\nferry\nprefer\npush\nlie\nberkeley\nlebanon\nmadison\nthrone\nattracted\nie\nlion\nretrieved\nmanor\npromoting\nsaudi\nserial\nabroad\nrogers\nlights\ngrandfather\ngauge\nconcerts\nelder\nreads\nrenaissance\nuniform\nchase\naka\ncomputers\nbrisbane\nsusan\nraymond\nflower\ncol\nthai\ndisaster\nsurvive\ninvolves\nclothing\nmurphy\nsharp\nbehalf\nexplains\nyugoslavia\nbuddhist\npublicly\nmeat\nliterally\nspam\ntelephone\nmoral\nsung\npartially\nlawyers\nciting\ninterviews\nbrunswick\nradar\nspending\ngrove\ntea\nap\nelite\nbright\nimproving\nsierra\nheaven\nathlete\naspect\nanswers\nted\nconsumer\nfunded\nexclusive\nibn\nmanuel\nallies\nreviewer\nmissile\nmechanism\nhelen\nwithdrawn\nintention\nmini\ncasualties\nestablishing\ndiseases\nrhythm\npat\ncatch\npoll\ndeck\nnewcastle\nantarctic\nleeds\nlasted\nranges\nlistings\nordinary\ninsects\nsuffering\nflash\nworship\nboundaries\nblind\npakistani\nassuming\ninterstate\npatterns\narrangement\nglobe\nhonours\ngross\ngilbert\napplies\ngradually\nyoungest\nmanaging\nexperiment\nradical\ngov\nlegs\nopponent\ndiameter\nsupplies\npitch\nutility\ncleanup\nopponents\nregime\nrevised\nplenty\ngenera\ndiplomatic\ngermans\nseal\ngregory\ncorresponding\nconcepts\nsword\npurple\npending\nvirus\npopulations\nbull\ndrummer\npresents\nholland\ncongressional\nbias\nmerger\nremote\nsean\nmessages\nrebellion\npremiere\nphysician\nha\nvictim\ncon\ncloud\nangels\nnoise\nheading\nduo\nbeer\npalestinian\ncopper\njurisdiction\nimplemented\nimprovements\nski\npeaked\nhms\nloop\nrenaming\ndrum\ndramatic\nsaskatchewan\ntalks\nearthquake\nrhode\nhat\nrequirement\nden\ntanks\npresidents\nsocieties\nmin\ndefending\nalcohol\ndominated\nsang\neat\ngraphics\nconstituencies\nasp\ncoffee\nbatting\nchancellor\ndestroy\ntons\ncruz\nwarsaw\nexclusively\nconnections\nrush\nheights\nplaystation\noutcome\napartment\ncardinals\nfill\nrecipient\ncorrectly\ntraditions\nfundamental\ncopyrighted\nthin\nchan\nresolved\nmario\ndepartments\ndame\nthereafter\nshield\nlowest\nfighters\nivan\nwritings\nbosnia\nsentenced\nviolent\ncaption\nharbour\nmargin\nauckland\npostal\npirates\ncollective\ndiesel\nliberation\nconfederate\ndevil\nactivists\nsultan\nrider\namazon\nflorence\nmarc\nwider\narnold\nshah\nsi\nblogspot\nreduction\ncontents\ngenetic\nsomerset\nlocally\nmilk\nromance\nlacking\nintellectual\nlatino\nfailing\nmason\npete\nadvisory\nes\narbitration\ninterface\nhitler\ndefault\naccessed\nlifetime\nsheffield\ndeparture\nhindi\nanglican\nsuggesting\nmistake\nresiding\nremainder\nraising\nembassy\nmurdered\nsox\nsleep\nsuspended\nsum\nmythology\nbengal\nconfusion\nbakhsh\noscar\nprogrammes\ntherapy\noccasion\nexposed\nassisted\npossession\ndefend\ndevoted\ngraphic\nwarfare\nmilwaukee\ninformed\nanonymous\nreverse\nsoap\nterritorial\nlisa\npaulo\nnorthwestern\nplayoffs\nboss\nnasa\nsockpuppets\nquoted\nbyzantine\nidaho\nposter\ngeographic\nrebounds\nho\ncongo\nventure\ncricketer\nworse\nhoax\nrestricted\nadvocate\ndoors\nnaming\nsituations\ninstructions\nsullivan\ntables\nleaf\nshoot\nsubstitute\nrestaurants\ncontributor\nspoke\nerrors\nenjoyed\nframework\nrocky\nkerala\nshakespeare\nquantum\nimmigration\nmirror\ncertified\nassets\npotentially\npresentation\ncotton\nsitting\ntournaments\nsyndrome\nchecked\nforty\naimed\nsourcing\njournalism\nupload\nunsuccessful\ntowers\nconductor\nhospitals\nbone\nessex\nrebuilt\nwellington\nideal\nraw\nsharing\nlabels\nleonard\nwatson\ngovernors\nposting\nharvey\nbases\nhello\nrabbi\nhardware\nensemble\nmonster\npitcher\nemphasis\nirst\nrecovery\nrespond\naaron\nlesser\nqualification\norganic\nexposure\npalestine\nthoughts\nneeding\ndrafted\nmaurice\nimmigrants\nvariant\nlap\nlegitimate\nautonomous\nwallace\n†\nsuccession\nthrow\nmonday\nreserves\ndonated\nincreases\nkid\nguidance\ndelivery\njoan\nfifty\nslave\nfeedback\ncolumbus\nstones\nmanage\ncgi\ninitiated\nfavour\nprinting\nvariable\ntheology\ntodd\nparameters\ntraveled\nmd\ncanton\nhan\nreed\nceltic\ncharacteristic\ncommanded\nsearching\ninappropriate\nswitch\nties\ntube\notto\ndebt\noutdoor\nnavigation\neligible\nexperts\nexpensive\ntier\ngospel\nnewton\nessays\nshanghai\nconventional\ncampaigns\nformations\nfeelings\nbath\nvenice\ncats\nvariations\nemerged\nsocks\nch\nconnecting\nflood\ndocumented\ncustom\ntouchdown\nprofession\nlayout\nacademics\nsettlers\nmerging\nsony\ncompetitors\nphillips\nhasn\ngrass\nreservoir\nartificial\nnovelist\ntip\nprague\nabu\nfaces\nguitars\nlaura\nfellows\ninternationally\nattacking\njohann\ndreams\nhughes\nsuburb\nunderstood\nspecialized\nwarned\npearl\nchorus\ndependent\nrestrictions\nkiller\noakland\nromanized\ntrio\ninfluences\nblocking\nmtv\nwikipedians\nà\ncattle\ngear\ngabriel\ntraded\nskating\nfifteen\npalm\nwikis\ntale\ndemonstrate\nvary\nliquid\ncycling\nprinceton\nrespective\nvoices\nfaster\nfriedrich\njet\nhorn\nerected\nburning\nworker\natmosphere\ncharacterized\nsyrian\nwelfare\njava\nmonitor\nye\ngraduating\ncolumns\nreportedly\nrepair\nbin\nstick\ndollars\norganised\nparameter\ntruly\nresolve\nbuenos\nparade\nbacked\nawareness\ndepends\ndefine\nspencer\nrepublicans\nconspiracy\ndies\nclarke\nrough\nengage\npine\nequation\nfeels\ndemocrat\npermitted\ncutting\nbutton\nattending\nbrands\nqueens\nabraham\nneck\nforever\nest\ndrink\nsheriff\nmiguel\naires\nirrelevant\npoorly\nmontgomery\nvanity\ngift\nriders\nfunctional\ncrossed\ndiverse\nru\nnumbered\nquotes\nslowly\nxi\nattitude\nmouse\njustin\nprotests\ngods\namounts\nvariation\nsmart\nprices\nprayer\nterrorism\nbeta\ndurham\nhouseholder\ncounts\niraqi\ndetective\njosh\nplacing\nsomebody\nlinking\ncompositions\noval\nextensively\nfilming\nperfectly\nmw\npr\nindianapolis\nfn\nfuneral\nrecovered\nsoutheastern\nfarmer\nprotestant\nlt\ncameron\nfocuses\nranging\nunclear\nindonesian\nmixing\nmumbai\nnashville\ndanger\nrally\nnarrative\ncamps\nsurprise\nmanufactured\ndeployed\nkate\nsolely\nmolecular\nunnecessary\nisle\ntheorem\nhomepage\ncolonies\ncyprus\nwake\nbrings\nwinds\nmagnetic\nconversation\nsussex\nab\nfl\nfastest\ngates\nram\nplastic\nelectronics\nrestore\nstockholm\ninn\nbuses\nconnect\nforth\nguests\nradiation\nreceives\nlancashire\nplayoff\ncork\ngenerals\nintermediate\nba\nverifiable\ncheers\nfilipino\nreaches\noriented\nhamburg\ncreates\norbit\nmassacre\ndialogue\nillness\nwc\ndress\ncodes\ndawn\nisolated\nnancy\nviolations\nperth\ntenure\nladies\nautumn\nratings\nincorrect\nscout\ndifficulty\npupils\nwealth\nhart\ntoured\nallegations\nregulation\nwatching\nlodge\neggs\ndisputed\ncitizenship\nspecialist\ntasks\nintent\ninstruction\nceased\npride\nbanner\nfriendship\npanama\ncorruption\nsunk\nharm\nernest\npilots\npursue\ntape\nemigrants\ncancelled\nrevenge\nrevision\ndominant\nfee\ncomputing\nexamination\nchen\nmatrix\ndas\nbiographical\nkiss\nnationalist\nluck\ncrosses\nheavyweight\nbid\nappreciate\nce\nenemies\nmercury\ninteractive\nmath\ndebuts\npreserve\nnobel\ngrande\nkeeps\nstructural\nmarry\nairports\nveterans\nairline\naxis\nexecution\ncult\nreducing\nsp\ncolin\nchester\nticket\nbelonging\nim\nentity\njudicial\nexplicitly\nbombing\nrecognised\noriginated\napplicable\nfounders\nfitted\nwilhelm\nsuddenly\nparking\nabsolute\nfrançois\nlocomotive\npreparation\nnintendo\ndeclaration\npresumably\nburial\ngoverning\njamaica\nknowing\nvladimir\nbeating\navg\nmethodist\nutf\nchallenges\nkenneth\nevolved\ncelebration\ndiscipline\nbearing\nbelonged\nfauna\nmanuscript\nexperiments\nchiefs\ncompound\ntampa\narabia\nassociations\nequally\ndealt\nshut\ntargets\nalien\nwithdrew\ndepicted\nsergeant\ndiffs\nsubsidiary\nthirteen\nthick\nextend\ndismissed\nneo\nwire\nphd\nmeasured\nfat\nvisits\nlinux\nteach\nflights\nverse\nbennett\nwarm\ndynamic\nshaw\nbreaks\ngrandson\nmonuments\nlying\nlords\nmichel\ntreat\nraid\ncongregation\nshorter\ntemperatures\ntestament\ndrinking\ncompanion\nmanila\nkm²\npunjab\nimagine\nconsideration\nveteran\ndoctors\neldest\ncarries\nruler\nwise\nshipping\nafc\nworthy\nregistration\ndirectory\nwyoming\nmanitoba\nvietnamese\nronald\ncuban\nburns\njustify\ndivine\nsuppose\nsequel\nfate\nrovers\ncole\noral\ntrans\ndeemed\nboards\nspan\nbryan\nsantiago\nepiscopal\nterrorist\nokay\nwaves\ninvented\nlanded\nsandy\nacres\npaint\nactively\nindication\nstops\nexcellence\nintegration\nthinks\nbibliography\nfarming\nnonsense\nmarathon\nbeliefs\nredundant\nfreestyle\naerial\n€\npreservation\naltitude\nfreely\nlandforms\nsimultaneously\npsychological\nfernando\ncultures\ntaxes\nmarcus\nstakes\ndominican\nfranz\ncoins\noxygen\n^\nincumbent\ncivic\nhardly\nisaac\nspell\ncraft\ninspiration\npairs\nvector\narc\nprofessionals\nvii\ncontrary\naccusations\napproaches\nbaronet\nslaves\nmad\nspectrum\nclient\ndozen\ntravels\nsymbols\nplaza\nbanking\ninherited\nlegion\nsymptoms\nmosque\nguys\nlab\nsailing\norientation\nvirtually\ngeneric\nreasoning\nstroke\nunions\nefficient\nopens\nimpression\ndiscover\nrelocated\nnovelists\nroosevelt\ndancer\nphenomenon\npreliminary\nrecognize\nanchor\narguing\nabilities\nprocedures\nemotional\ntimber\nfisher\nprod\ncartoon\ndisorder\nfled\ndemands\nlithuania\ncontinent\nfellowship\nlock\nrelegated\nwarrant\npictured\nrecurring\noverview\nwealthy\nacquisition\neve\nfilter\naddresses\nindependently\nslovenia\nobservation\nroster\ndisestablished\nchallenged\nthreats\nfallen\nprotocol\njudgment\ngrammy\ncolours\ndistinctive\nnamely\nopposing\nlandmark\npackage\ncontrols\ncompleting\nsabha\nprisoner\nsignals\nowen\ncapita\ninaugural\nintervention\narriving\ncylinder\ntenth\nliu\ntested\nrenowned\nshops\ndome\nphilosopher\nepic\nstem\nspecified\nkinds\ndavies\ncollapse\nallan\nsight\nalbanian\ncanyon\nsamples\nperceived\ncelebrity\npriests\nlouise\nworkshop\nherzegovina\nclaude\nfortune\nbars\ncornwall\npalmer\npresidency\ntiny\nfk\nappeals\nistanbul\nhp\nrookie\nexpanding\ncalgary\nshock\npulled\nstevens\nemployee\nyang\nhoused\ntomb\nearning\ninnovation\nstreams\nunity\nlucas\ngrows\narmenia\ninterchange\nsized\nproteins\nproposals\nswimmers\nmainland\nseminary\nhamlet\ntimeline\nrealize\ncoup\nnewport\nnegotiations\nexhibitions\nmalta\nhate\nwestminster\ninstallation\nenters\ngoalkeeper\njulian\nmorocco\nefficiency\nchapters\naboard\nhelicopter\nfewer\nfortress\nani\nburned\ndisplays\ncompiled\nips\ncontributors\ntorpedo\ngiovanni\nchat\ncatholics\nherald\nchuck\npit\nsupplied\noptional\ndesk\ngarrison\nsprint\nexile\nsurprised\nachievements\nbiblical\nrebels\nte\ndenis\ngeographical\nsit\nalpine\nbills\nglacier\naa\nbinding\nindicating\nestonia\neating\nsaving\nchi\ndeveloper\nindie\ndifficulties\ndoctrine\nworn\nfork\nsimpson\nmoreover\nmaintaining\ntheological\nupcoming\nvocalist\ntemporarily\nhotels\nedmonton\ndevelopments\nliteracy\ncurrency\nmissionary\narrives\nhammer\ndollar\nambassadors\nreverts\ntwitter\ncentres\nsolomon\nrecommend\ndescendants\nruth\nhandling\ncustoms\ncollect\ngrid\nsecured\ncertificate\ndestination\nalbania\nindies\neuro\nconsumption\nfeat\npushing\nconstantly\nsurvivors\nmansion\ncardiff\ntemples\nblake\nsheet\nlift\nconfidence\ncuisine\nfrankfurt\ngalaxy\necuador\nbreeding\noutbreak\nlegendary\nhandball\ngeorgian\ncopenhagen\ntrek\nignored\narch\nkeys\nproceedings\nenjoy\nquartet\naims\npropaganda\nwu\ndisk\nrealized\nne\nneat\nfunny\npunishment\naccuracy\nbusinesspeople\nmeter\ntheoretical\nsuspension\ngraduation\nflew\nseeds\nlighting\njennifer\nsmooth\nah\ncustomer\narmstrong\nsouthwestern\ninvolve\nphilosophical\nescaped\npowell\nkills\ntaste\nallmusic\nrequiring\nbros\nassertion\nboulevard\nnortheastern\nbrooks\nsending\natomic\nantarctica\nstrikes\nreconstruction\nchronicle\ntraveling\nleslie\nellis\ndevon\nghana\ngen\nrebel\nduncan\npianist\ncanon\nnc\nreformed\npack\niceland\nsolve\ncyclists\npayment\nsuburbs\nmilitia\npronounced\nexhibit\nmph\nglen\neugene\ncompromise\ntactical\ndiscovers\nswitched\nuganda\njail\nyeah\ntownships\nsomehow\nwithdraw\nholmes\npromise\ndeals\nconvert\ndos\nafternoon\nnoting\nrecall\narrive\nwarrior\nmammals\ndimensions\nsurrey\ngaming\nlutheran\nports\namy\nsurvival\nresponses\nbadly\ncollegiate\nscandal\nwidow\nswing\nnights\npolo\nlinda\nadr\nconsist\nprobability\nfarms\nconferences\nzhang\ncrazy\nwitness\nnephew\nsensitive\nmutual\nhd\ndiet\nclients\nfringe\npassion\nrings\nstronger\nmillions\ndialect\norlando\nundergraduate\nrelay\nwet\ncruise\nhenri\npublish\njoy\njulia\nkitchen\nabstract\nsnake\ncomedian\nmotorcycle\nnadu\nreverting\narsenal\nmillennium\nassists\nthereby\nbow\nandré\nserie\ndimensional\ntravelled\neurovision\nfiring\nsuite\ndoug\ngravity\nstored\ndeparted\noptical\nfrontier\nevaluation\ngraph\nhybrid\noslo\nearn\nmetre\nkeyboard\ninducted\nnearest\njamie\ndecorated\ncomplicated\nnathan\nslavery\ncircular\noperators\narmor\nmechanics\nbradford\nleon\nrachel\nfootage\nstrings\nheader\nhood\ninspector\nwarnings\nrelatives\nplains\ndefended\nwheels\ncriterion\nace\narrangements\npenn\napproached\njoke\nsailed\nreligions\nauthored\ngrants\nandrews\nmoderate\nstolen\ntributary\ncommanding\npin\ncarol\nowns\nprototype\ncopied\ncanterbury\nmidnight\nquarterback\nduchy\nbailey\narbitrators\nperformers\nhandled\nexploration\ndiversity\nsixteen\nfindings\nrepeat\nbrussels\nimdb\nplanets\ntheatrical\nreconnaissance\nshots\ncomplaint\nbatman\nexhibited\nespn\ninvestigate\nverify\ndiscontinued\nabsent\ngirlfriend\nresignation\nfossil\nexplaining\ntang\ninches\nproven\nyu\nfranco\ndying\ntribal\ntyler\nsurrender\nglenn\nsubstance\nfocusing\nluxembourg\ncolored\nscholarly\nadministered\nexplosion\npushed\ngenerations\nduck\nporter\npermanently\nmemphis\nsalvador\nemma\nmit\nzoo\ngibson\nwording\nemerging\nmere\nnotre\nportions\nmacedonia\nethics\ndepot\ncurtis\nrescued\ngaelic\nslovakia\nelevated\njeremy\nlisten\nimpressive\nbradley\nsurely\negg\nconquest\nrod\ncdp\nalgorithm\nburn\nthesis\nlover\ncapitol\ncomprises\nremembered\nferdinand\nmarshal\njudaism\nballs\nnacional\nwrestlers\nahmed\nsin\nholocaust\nedgar\nsaxophone\nretain\ncurriculum\nwishes\nprepare\nruins\nibm\nrochester\nnigerian\npitched\njesse\nmalaysian\natlas\ntelegraph\nperformer\ncannon\nencounter\nemily\ndissolved\ncatalogue\ndiscrimination\ner\nrefs\nmyspace\nreveal\nwizard\nteen\nspots\nbomber\nfoods\nquest\nconnor\nscreenplay\nmotors\nminimal\nmuscle\nprestigious\nsustainable\nchelsea\nstrict\nkingston\nsheep\nandrea\ncomplaints\nxs\nnée\nconnects\nnursing\ndefenders\nrichardson\ntriangle\nnato\nteeth\noccasional\nstrictly\nharper\nfluid\nbigger\nfed\nnewfoundland\ndisbanded\ncomparable\ndocumentation\nbrien\ncompounds\npointing\nedmund\ninstances\nnaturally\nforcing\nussr\nlaser\nlat\nsculptor\nguild\nobserver\nworlds\nimprisoned\nwrestler\npraise\nparishes\nbones\ncss\ncox\ncontracts\nconsequences\nprovisions\ncirculation\nbutterfly\nhugo\nabolished\nalgeria\nedu\nsufficiently\narmies\nseparation\nspy\ncliff\ntechnically\nreactions\nlithuanian\ntrick\ncurve\naccidents\nhorizontal\nuploader\nlegends\nenzyme\nfreight\nlacks\nhydrogen\nbroadcasts\nviii\ncaroline\npull\nplymouth\ntwentieth\ncuts\nmediation\nairfield\ncatalog\ndale\nsynthesis\nrape\nseoul\nengagement\ncoin\nlucy\nconsequently\nplatinum\ntwins\nmemories\nrobertson\nverified\nanthology\nmilton\ngeological\ndefining\ndinner\nhosting\nthriller\nretreat\nalbany\nabdul\nignore\nmigration\ncarefully\nmagnitude\nsudan\nclosest\nmanages\nduration\nhenderson\nexplorer\nmarco\nfusion\naids\ngathered\nprivately\nreflected\nafraid\npresbyterian\nautomobile\nestates\nfault\npound\nallegedly\ndelay\ndevelopers\nsemifinals\nbelfast\narctic\nps\nkurt\nmayors\nwindsor\nassumption\nplates\nfourteen\nnominee\ndisruption\nmonroe\nhearts\nbelgrade\nvictories\nextending\npale\npursuit\nglory\ndestroyer\ndeeply\nlectures\naffiliate\npreston\ndeceased\nspeaks\ngathering\nangry\nincomplete\nenrolled\nconfiguration\nbrad\nskill\nintense\ntasmania\ncommitment\nloved\nreforms\nrulers\nuruguay\nsustained\nnapoleon\nconfirm\nbreed\nauxiliary\nenabled\ndiscography\nlicence\nrefugees\nadrian\npipe\nkaren\naltered\nbudapest\ndesigners\nfe\nheir\nadvisor\nillustrate\nauthorized\nhide\nannouncement\ncompact\ntissue\nparticles\nrefuses\nreceiver\ncivilians\nmarsh\nvinyl\ndelayed\nunrelated\nencountered\nwednesday\nchecking\nchilean\nhey\nchambers\ndemo\nnationwide\nagrees\nahmad\nsantos\npaying\ninterpreted\nsubmit\ndesired\nfollowers\nobservatory\nproblematic\nspringfield\nkit\nremarks\nburton\nmo\ninns\ncoached\nmonarch\nobservations\nfootnotes\nbeetle\npromised\npalomar\ncream\npresenter\npotter\nsu\nfavourite\ntransformation\nmcdonald\nbavaria\nkumar\nnineteenth\nseverely\ngaining\nmixture\nbrowser\nendangered\nmate\neverybody\nlyon\nillustration\nkyle\nafl\nbrook\ngeometry\nping\nextends\naggregate\nvariants\nbaroque\niso\ncollapsed\nneighboring\nintegral\njake\nhopes\ncornell\nmodes\nservant\ngt\ntd\nkenny\nhurt\nmk\nmaker\ninline\ncarlo\nlynn\nstability\nhoping\nbeneath\nimposed\nconfusing\nmt\nsummaries\nbeetles\njoel\ncf\njets\nlogos\nvital\nmalcolm\nwinnipeg\nkilometers\nsongwriters\nbuddhism\nnose\nrespected\npace\nthunder\ncentered\nphysicians\nbolivia\nforget\nimplies\ncrops\nhalifax\ntoll\nmonk\nextraordinary\nlessons\npub\nparalympics\nmonte\nmaría\nsegments\ndeer\nwireless\nwhenever\ncommenced\nmysterious\nconsultant\nfraser\nformats\njam\nchicken\nenable\nidol\nreid\nbirths\namazing\npet\nupset\nloves\nstretch\nnominate\nstriking\nstriker\naccidentally\nlouisville\nhopkins\neds\ngoddess\nburials\nresumed\nsatisfy\nnotion\nvoltage\nbetty\nmarion\ngeology\nconsistently\ncyclone\nexport\nlightning\nimpressed\nmaintains\nlogical\naggressive\njin\njulie\nfbi\nyankees\nludwig\nfi\npond\nsuburban\nenlisted\nmoments\nconjunction\ninterim\nargues\nlucky\ntargeted\nlon\nspeedway\nregiments\npicks\nprevented\ntoy\nbicycle\npurely\npd\ninteractions\nfraud\nlang\narcade\nlecture\nsanctuary\ndragons\ncopa\ncareful\nnurse\nrivals\nmodule\nsupplement\nlens\npatron\ncommands\ntrend\nsuperintendent\ngerald\nrap\ngeneva\nash\nblade\ndisappeared\narray\npatrolling\npredominantly\ncommittees\nloose\nboom\nsailors\nbeaten\nsmoke\nassassination\nlancaster\nreynolds\ndivorce\ndust\nsaxon\nhealthcare\nseparately\ngrain\nexecutives\ntranslations\nzimbabwe\nthrown\ncohen\nputs\ndiving\nneighbouring\ncarroll\naccounting\nmesa\nprussia\nintelligent\ncherry\nunderlying\ntobacco\ncleaned\nvarieties\nbench\ndirections\nellen\npadding\nmeasurement\nparadise\nalexandria\ncomplement\nwitch\nattraction\ndiana\npersonalities\ncolleagues\nbusy\ncia\nscreenwriter\nrankings\naboriginal\ncommanders\nsalem\nwagner\nfirms\nsanctions\namericas\nendings\ninstructor\nnobility\ndivorced\nvaries\ntomorrow\nmanuscripts\nunified\nclarify\nscouts\ninvestigations\nsilva\nderek\nagenda\nprovision\nhumanity\nadmit\nterror\ncontestants\ntrinidad\ndistant\nburke\ncircles\nassignment\nreleasing\nrecalled\nshrine\nsail\nwillie\nkarnataka\ncelebrate\nranch\njo\ncollaborated\nvampire\nunfree\nplaywright\nsick\nassociates\nheinrich\nethiopia\nflags\ntel\ndrove\nlearns\nshorts\ndrives\naccomplished\nautobiography\nrecruited\nuprising\nedwin\nvelocity\nterminology\nraiders\ncoordinates\nbrighton\nviola\npara\nmorrison\npropulsion\nboxer\nfinale\nsh\nshoulder\ndisabled\njoins\ndiv\ntactics\nernst\ninnocent\nrapper\nsettle\nprivacy\nboeing\ncites\nbunch\nemmy\nindo\ndistinguish\nrosa\naccordance\nthermal\nflute\nmarines\nfeminist\ntrustees\nsculptures\nnationally\nbacteria\nintroduce\nlandmarks\ndisorders\nrivalry\nprevention\nhonored\nhealthy\ncircus\nspeculation\nburma\nsec\nka\nar\nquiet\nknee\ndeliver\nthrew\nhypothesis\nreferendum\ntravelling\nestonian\npastor\nsofia\ntribune\nlasting\npermit\npriority\npounds\ncent\nconsequence\nrica\nconducting\nfurniture\nmacdonald\nhonest\ninnovative\nestimate\natp\nrotation\nsyracuse\nlecturer\nautomated\nobscure\nkosovo\nclassics\njulius\nappreciated\nnaples\nsebastian\nactivated\nvaried\noffense\nadvised\nbarnes\nacknowledged\nexceptions\nmartha\nquarters\ndrawings\nbarely\nhitting\nrefuge\nmaharashtra\nconventions\nelliott\ndiplomat\nunused\nsearches\nbrigadier\nparticle\nmalayalam\nthursday\nicon\nulster\ngenes\ninfinite\nconsiderably\nvale\nportraits\npaste\nrandy\nec\nsaxony\nconvoy\nannie\nexcessive\nbelieving\nrhine\nmineral\nimplement\nsurgeon\nbadge\ncharleston\nclause\ninfection\nelectron\nwalt\ncnn\nlikewise\ntonight\nconfederation\naccommodate\ncasino\ndoctorate\naux\nguatemala\nsettings\nmask\nshelter\ndorothy\nethnicity\nhopefully\nelimination\nheath\npregnant\nrichards\ntheodore\ndelegates\nblair\nfac\nphrases\ncrashed\npreference\njaneiro\nconcerto\nheadquartered\nbits\nconstruct\ntune\nunofficial\nbulk\nlighthouse\nstan\nhighland\nmascot\nsquadrons\nacceptance\ntight\nconsiders\nai\nhub\nmess\nwilderness\nroutine\nreviewing\ndubbed\ndozens\nspotted\nharmony\nentrepreneur\nwwe\napollo\nrunway\nji\nnaked\nanton\nmoses\nlegally\nwa\nnominating\nfake\nbiased\ndivisión\nrevolt\nequity\nvarying\nprovidence\ninvestors\nreliability\ntenor\nfights\npocket\nsad\ntroy\ntreasure\nion\nrendered\ntransformed\nroberto\nnn\nadoption\ndecrease\nreserved\nforgotten\nlok\ncrop\nlicensing\nadvocacy\ncollecting\ntreasury\ntrumpet\njohnston\nuncertain\nnorton\ncollector\ncluster\ndear\ngeorges\nroller\npt\nclothes\nsovereign\nenhanced\ncompensation\nconsent\noutline\nholdings\njorge\ndarkness\npenalties\nsk\nbombers\nhometown\nholes\nblow\ncooking\nvfd\naftermath\ntrainer\nrican\nmeasuring\nlawsuit\nretiring\nchip\nconsciousness\narchaeology\nlatvia\ntelugu\nblogs\nprotecting\nhardy\nnicknamed\nscorer\nstamp\nnat\nfur\nredirected\nestimates\nlit\nritual\nlocality\ntrace\nmarble\nfoundations\npolitically\nnottingham\nderivative\nboxers\ndimension\ntouchdowns\ncrawford\nbats\nyugoslav\ntanzania\nsucceed\nmotto\nstreak\nconcentrated\ndirty\nhayes\nxbox\nidentifying\nlikes\ngenres\ngalleries\nforbes\ncouncils\nadequate\nbrass\nbach\nalias\ninland\ncounsel\nwore\ncomprising\ntough\nadvertisement\nprotagonist\ntrails\ndemanded\nclaire\nmistakes\nbruno\ndylan\nbag\nthrowing\nchurchill\ntan\nspelled\nclimb\nwitnesses\nstoryline\nthames\nanybody\nkazakhstan\nbots\nconstitute\npresenting\nhighlights\njumping\nprof\nslovak\nskull\nmissionaries\nordained\neventual\nhoped\nmyth\nmandatory\nstern\nfees\nbet\nmonks\ndancers\nquantity\nendorse\ninventor\ncairo\ngraves\nproximity\nseemingly\nsue\narmament\nbarrier\ncreatures\nlogan\nje\nerik\nleicester\nds\nsilence\njessica\nplateau\nfinite\nprecedent\nstationed\nwalsh\nzones\nintensity\nexterior\nmurders\nparagraphs\ncostume\nbike\nneighborhoods\nimprisonment\nsuffolk\nforwards\nremarkable\nundelete\ndiffer\ntin\ngarcia\nmadagascar\ncameras\nammunition\nfires\nviewing\nexplore\nbuilder\nminneapolis\noccurring\nbullet\nkerry\nsubway\narrow\neconomist\nbread\nlou\nstrategies\nrubber\nprecise\nrifles\ncognitive\ngovernorate\nnest\nslam\nancestry\nportsmouth\nmiscellany\nconvince\naudiences\nboarding\nbonds\njoshua\ninhabited\ncasey\nattract\nnonetheless\neb\nkilometres\npump\nfeeding\nprey\nain\nmathematician\ndiary\nvulnerable\ninscription\ndubai\nmichelle\nlebanese\nproductive\nguided\nhappening\naccordingly\ngp\nresearcher\ndella\nbaden\nupgraded\ndemonstration\nequality\nphilosophers\nspacecraft\ntrap\ngb\nclara\ninvitation\nmarking\nexpertise\nadmission\nsacramento\ncertification\nprecisely\ncasting\nreassessed\nsubmarines\nprohibited\nsupposedly\ngovernance\nsometime\nfrog\nvague\ntackles\nmhz\nsecular\ntracking\nspa\npublicity\narmoured\ncleared\nwatts\ngibraltar\nrenewed\nreflects\nfever\nmelody\nsupporter\nelaborate\njeffrey\ndiscusses\nsurnames\nuseless\nswift\ntuesday\nsilly\nempress\ncapabilities\nnewman\nscales\nonwards\nbeatles\nko\nclergy\njacksonville\nsara\nlifestyle\nbee\nholders\nbaltic\nczechoslovakia\nbrandon\nloaded\nmaya\nevangelical\nenterprises\nimo\nmature\nphysically\nsequences\nbreast\nbeast\nraja\nfür\nanymore\neducator\nbang\ngriffin\nrhodes\npreparing\nproportion\nenrollment\nitv\nana\nceiling\nrainbow\ndemon\nprussian\nequations\nanswered\n←\nist\nperception\ndistributor\nentities\njackie\ndynamics\nfiji\ninsufficient\nalgebra\nhomer\nlarvae\nlimestone\njohns\nbce\nchaos\nchang\nlayers\ncrater\nki\nchad\ndb\nseized\nwebster\ndepicting\nexcess\nbombardment\nhurling\nashley\ndot\ngif\ntranslator\ncowboys\ncounted\nhanging\nsoprano\ninterviewed\ngovernmental\nworkshops\nterrain\nbelarus\nliked\nconsole\nnascar\nteenage\napplying\nvandal\ngraduates\njungle\nballot\nplacement\nfairy\ntourists\nreasonably\nperforms\nquarterly\nshifted\nromans\nrpm\ndiploma\ncirca\nenvironments\ncollaborative\nswan\ncarpenter\npetition\nboris\nberry\ninvention\nsouthampton\nprairie\nbend\napp\nfinalist\nquestioned\nexplicit\nbo\ndraws\ngoverned\nslight\ndrag\nmaxwell\nquarterfinals\nplanes\neveryday\noriental\nmanufacture\nairing\nacclaimed\ncoordinator\nbombs\nmohammad\nbassist\nsuperman\ncolombian\nphilippe\nfelix\nbengali\ngreene\nvoluntary\nfloating\nmontenegro\nsketch\nlo\nmann\nflooding\nescort\ndressed\nastronomy\nsudden\nvariables\narbitrary\nskiing\ntimothy\ncello\nrainfall\nrafael\nsphere\nought\nrewrite\ngeorg\ncinematography\ncanvas\nchest\nkrishna\nprovider\nva\nfrances\ncrowned\nwanting\ncarved\npoles\ncabin\ncivilization\nbroader\navoided\npl\nlisbon\ntongue\nendorsed\nnewer\neliminate\ngng\npanels\ndarwin\ncheese\neaster\nrat\npapua\ninsert\ndescriptions\ndebates\ninformal\ncastles\ncry\nloyal\nsurfaces\nnicolas\ninstitutes\nhumor\nmadonna\nworcester\ncooperative\nsubstantially\ntrips\nwinston\nintroducing\nllc\ndrake\nlunar\nfountain\nconsulting\ntends\npatricia\ners\ngarcía\nquit\nrid\nmaple\naberdeen\nverb\nillustrations\nmechanisms\nreds\nposters\nunderwent\nleipzig\nsanto\nengaging\nroland\nexpelled\nevident\nstaged\ntelecommunications\nmc\nstrait\nnationals\nunanimous\nmisleading\nsherman\nstefan\nsleeping\nalto\ncrucial\nthomson\ndirecting\nyo\nprehistoric\ncommunicate\nalphabet\nzagreb\nfu\nchef\nix\ntubes\ncarnegie\nhostile\nprizes\neighteen\nshirt\nweird\nnamespace\npursued\nloading\nedges\nplantation\nvernon\nacre\npracticed\nwonderful\nmissiles\nbr\nguides\nfinger\ngarage\nsavage\ntechnological\nrely\nrises\nshoes\nclimbing\nbarrel\nbiographies\nspite\nassembled\nham\nhon\ngeoffrey\nvalve\nhistories\ncommenting\nosaka\nbeck\nanalog\nmonkey\ntackle\nlistening\nintegrity\nsits\ndefinitions\ncritically\noperas\nbaldwin\ntroop\nobjections\nmarina\ndetection\nsixty\ndifferential\nlinguistic\nvenus\nsalmon\nmonarchy\nstade\ndepicts\nfrancesco\nabortion\nmonica\nelephant\npolar\nprompted\ntrademark\nton\nprovisional\ncounting\ntu\ncorn\ntrucks\nsank\nairborne\nlengthy\ndeutsche\nabsorbed\npar\ntension\npablo\naviv\ncleaning\njudgement\nphysicist\ncatalina\npreventing\ndisasters\nni\nlesbian\nrays\nwithdrawal\nwalks\nrealm\ntrailer\njanet\ntornado\naunt\ndistribute\ngenesis\nshallow\nmakers\nmentioning\nrequesting\nfloors\nviolate\nscattered\nboyfriend\nconsolidated\ndetermining\ncatholicism\nluther\npleasure\nseeks\nconstructive\nwebb\nbattalions\nalberto\nmagical\ndeliberately\nsacrifice\nfaction\nolive\ncorridor\ncompatible\nreceptor\nmolecules\nkiev\nlb\nvista\ncastro\npreceded\nwei\npleasant\nmiscellaneous\nlineup\nultra\nnowhere\nbulletin\nclinic\nitunes\nbatted\npatriots\npollution\ntreasurer\n…\nherman\nquestionable\nsally\nrex\nmvp\nstuck\nproceeded\nopenly\nafghan\ninferior\nvatican\nfolklore\naffiliation\ncruiser\nprinces\nsilk\nagreements\ntrilogy\narmour\neng\nstamps\nharder\nspecimens\nconsumers\ndifferently\ntallest\ndefines\ncage\nvicinity\nferguson\ncorrespondence\nillustrates\ntehran\ncheshire\npageant\namanda\npos\nbye\nemployer\nbrowns\nstatic\ntibetan\nnixon\nsociology\nrecreational\ncapability\naddressing\npanthers\nobjection\nsystematic\nviolated\nnl\nchocolate\nhelsinki\nbedford\ncambodia\ngrandmother\nrounded\nfitness\nhalls\nmidlands\njung\nbrilliant\nlibya\ntransported\nangela\ncreature\ngather\nvolcanic\nandroid\nfilmography\ndeposits\nmentor\nfreeman\napartments\ncsd\nuci\npuppet\ncarey\nhiding\nsends\nmeyer\nstrongest\ngameplay\ntopped\nlima\nfreeway\nkirk\nbatteries\nchains\nucla\nrecover\nwong\near\nservants\ncombine\ndecent\ncubs\ndeputies\nhawk\ncontestant\ncumberland\nrational\naltar\nufc\norganize\nexhibits\nsubjective\ndetermination\nchennai\ndong\nthreshold\nmyanmar\nproud\ncustody\nsurveillance\nng\nthreatening\nallegiance\ndischarge\npenny\nra\nkannada\nastronomical\nrector\npetroleum\nbug\nproceed\nchapman\nmud\ndecree\nregulatory\nhoney\ndiagram\ninfringement\nbroadcaster\ncomplexity\nriley\ndisplacement\ncr\nsponsor\nira\ncondemned\ncareers\nhaiti\nwisdom\npierce\nstupid\nalma\nalexandra\ncontinuously\nshuttle\nhampton\neva\nmickey\natlético\ngastropod\nrelegation\nyuan\nsv\nlynch\njohannes\necclesiastical\ndestroying\ntrunk\nproclaimed\ngenuine\nadolf\nsharon\nacquire\nguitarists\nhung\ncorrected\ntheatres\ngardner\nartifacts\nhandful\nmandate\nlocked\nmohammed\ndish\nterrible\nrevived\nreagan\nrebecca\ntoledo\nindefinitely\ndivide\nrecommendations\ngymnastics\nalter\nideology\ncarriers\nconstantinople\nfatal\nmodeling\nmonsters\nnotified\nbulls\nharassment\ncensorship\nfavorable\nbore\ncommercially\ninserted\nparamount\nluxury\nyahoo\necology\nreviewers\nemirates\nneutrality\nchronicles\ndeployment\nlepidoptera\nblast\njenkins\nomaha\ndetected\nattribution\nharsh\ncn\nbundesliga\novercome\nmonarchs\nfacilitate\nhermann\nmorton\nya\nsake\nconceived\nroma\naccredited\necho\nbangkok\nachieving\nbergen\nhawks\navailability\norganizing\nlung\nandreas\ncedar\nstake\nbankruptcy\nflagship\nlankan\nunassessed\nbasement\nspirits\ntorture\ntoyota\nwheat\nexamined\nmembrane\nknockout\ncoleman\nhomeland\nmaiden\ndhaka\ndental\nchristine\ncurious\ntucker\nevolutionary\nresponsibilities\nviscount\ncommunists\npavilion\nfantastic\nceremonies\naugustus\nloans\naltogether\njoey\ncorporations\ntagging\nmoss\nparody\nyield\nbeside\nsizes\narmored\nneighbourhood\ncheap\nstuttgart\ndialects\nbackup\nforbidden\nsanskrit\nneill\ncouncillor\nknife\nwade\nhumanities\npremiership\ngonzález\nyesterday\npercy\nexpressway\ntransform\nvegetation\ntickets\nsf\ndams\nadopt\nbraves\nthrows\nplural\nfilmmaker\npomeranian\nhabitats\nsicily\njustified\nspeeds\nzip\nupdates\nreplied\nrestriction\nlin\nsiblings\nmeasurements\nuttar\ncourthouse\ntide\nlaps\ndynamo\npredicted\nuncredited\ncharitable\nfeast\npoker\npromising\npo\nkane\ntrevor\nreprinted\nattractive\nsidney\nfactual\nholidays\nwondering\ninitiatives\ncomfortable\nsw\noutlets\nfletcher\nairways\nfloyd\nsnail\ngenocide\nsaga\nundertaken\nscots\nkingdoms\nviolating\nweren\narchdiocese\nlu\nfossils\nbuying\ncommissioners\nsmoking\nenormous\nsubdivision\nta\nleather\ndef\npatch\nunreliable\nporto\nvalencia\nsubfamily\ntracy\ndownloaded\nbreakfast\nsubspecies\nandhra\nmercy\nbotanical\npeaks\nore\nlone\nbottle\ntunisia\nkw\nangola\nhumanitarian\ndecorations\nslide\ninvalid\ngenerating\ntrapped\nrao\nguru\nfreshwater\ntelescope\ndense\nsatisfied\ncriticised\ncoral\nxavier\nobituary\nhelena\nrowing\nalbeit\nanatomy\nbinary\nmerchants\ncontrolling\nwound\nconviction\npublishes\ndixon\nkuwait\nridiculous\nsuffer\nnsw\nblacklisted\nnord\ncaves\nregent\nriverside\nshane\nbalanced\nfitzgerald\nhighlight\npayments\ntendency\nlópez\njoyce\nbelle\nlifted\ndavidson\ndecorative\nrouge\nclue\nrockets\ncure\nbolton\nslope\nmanning\nboyd\norganisms\nparaguay\npreview\ndodgers\nqatar\nimport\nbuddy\niucn\ndock\nclare\npenguin\nrapids\nexceptional\npapal\nshed\ncologne\nproviders\nbeings\nnorwich\nchallenging\nanthem\nsupervision\ninvestigator\nming\nlease\nyours\ncecil\ntoys\nconrad\nbuck\nhang\nbulldogs\nitalia\nportable\ndell\nsd\nirving\nbells\ntibet\ncomposite\ncouncillors\nsalary\nblacks\nenhance\nstaying\nrealizes\nsysop\ndropping\nsmallest\nfragments\nobjectives\ncaesar\ntragedy\nmacedonian\nunesco\ninvestigated\ncameroon\ncarson\nslot\nrams\nicelandic\nsophie\ngranite\njenny\ndeciding\ncomparative\napostolic\nvenezuelan\npirate\nparalympic\nslovenian\naccompanying\nseasonal\ndelegation\ncombining\nportfolio\nmachinery\nminers\nembedded\nkeen\ncharted\ngandhi\nfiscal\ntestimony\nduchess\ncu\nreflection\nmiddlesex\ncarr\nfreshman\nbooth\nrudolf\nleón\ncourage\nmidland\nriots\ndevelops\npier\nisles\nformatting\nwoodland\ntoss\nauction\ncbc\nspecifications\nacclaim\nsecrets\ncrest\nsutton\ndecreased\nups\nrepairs\nque\nbypass\nprominence\nslopes\nphantom\nhartford\nalps\nliner\nease\ncet\ninvestments\nauburn\nflame\nexeter\ndin\nkay\nrent\nhighlands\ninduced\nverses\ncrane\nsimulation\nschemes\nlatitude\nlance\nnetworking\nkeeper\nwanderers\nfixing\nasset\nprospect\nknocked\nmerits\nprejudice\nhighlighted\nstance\ngymnasium\ngloria\ntrent\nchoosing\nhorizon\nyemen\nreadily\nriot\narose\nrenovation\nquinn\nexcluded\nelena\nsuccessive\nanger\nhierarchy\nml\nnm\ndecoration\nnerve\ndisability\ngiuseppe\nboot\nexceed\nscan\nrca\npioneers\ncanberra\nmb\ncoupled\nnewark\nflowing\nfungi\nsensors\nbattlefield\nshells\nashore\narguably\ncottage\nhonduras\nresides\nshepherd\nreservation\nelderly\npbs\nmapping\nmongolia\ncommentator\nbi\nnash\nsharks\nricky\nshirley\nstopping\nexercises\nstepped\nribbon\ninfant\ndestroyers\nmild\nvikings\npeers\ncorrespondent\nrenovated\nproceeds\nreproduction\nsings\nmercedes\ncompliance\nsided\nbombay\njointly\nbarack\npremises\nhawaiian\npoly\nhandbook\nrené\nscreenshot\nimported\ntransactions\nstatute\ncongressman\njerome\nshark\nchoices\ndoom\nbare\nraids\ndiagnosis\nreferencing\nfirearms\nstoke\nbride\nprolific\nselective\ncriminals\nlindsay\nadvocates\nautomotive\nwartime\nworry\nvacuum\nheating\nrushing\njp\nlengths\nradius\njohan\ninactive\nadvances\nvalued\ncelebrities\nmeta\nestadio\nrehabilitation\nsynonym\nasserted\ncorp\nambiguous\nleone\nsailor\nracism\nnegro\ntears\ninspection\nancestors\nhonda\nbarn\ncontroller\nassert\nabbreviated\nsingular\nvacant\ndepend\nag\nburden\nencounters\nfruits\nruby\nlunch\nnationalism\nphotographers\nmick\nclassroom\nbtw\ndelegate\nclyde\nforums\ncups\nchallenger\nrecommendation\nshi\ninstant\nfalcon\nloses\nplanted\nnordic\nhannah\ncoordination\nhammond\ndances\nbarracks\nskilled\nqualifier\nsaves\nhardcore\nprecipitation\nbucharest\nranger\nlily\nsurveys\npeterson\nalbion\nwicket\ndestinations\ncalm\nphotographic\nschmidt\nsubtropical\ncontacts\nasylum\ndeserves\nhyderabad\nvisa\ntwelfth\ncontacted\nmatching\npregnancy\nblatant\nnoteworthy\ncommit\nstale\nexpectations\ncarmen\nerie\nharvest\nflies\nkashmir\ntheft\nrelates\noccupy\nflames\nwesley\npga\numpires\nbeats\narlington\nsterling\nchances\ndonna\npetty\nlets\nsonic\ntr\ntransparent\ndedication\nclarence\nkatherine\nnervous\nfacto\nexcluding\neverywhere\nsunset\nanthropology\nminds\nreward\nnarrator\nwingspan\nantoine\nally\nsteep\nprimitive\ngateway\nleigh\nslavic\nderbyshire\nlopez\nchronic\ndonations\nimaging\nsuited\neuropa\nduet\nignoring\nintel\nwinchester\nbroncos\nbuddha\npig\nwhale\nbeaches\napproaching\narchipelago\nattributes\nminerals\njustification\nfrozen\ninherently\nfischer\ngains\ndee\ncomparing\naccurately\nparticipant\npeaceful\ntyphoon\nhank\nmanually\nverlag\nshortened\ntram\naerospace\nalignment\narabian\nivory\nhastings\ntaxa\noz\nholly\ntraces\nconstellation\nallocated\nhealing\nframes\nrey\nlorenzo\nspecification\nrisks\nprofessors\nviking\natmospheric\npottery\nbloody\neden\ndd\nmormon\nnielsen\nconsecrated\nperuvian\ncoloured\nrelevance\ninterference\ncandy\nbitter\nflown\nlongtime\nstruggled\nmonetary\ntrends\nindividually\nfunk\nbasque\nbangalore\nsynagogue\nlatvian\nelect\npot\ninteract\ncrews\nyi\npaths\nvar\nguarantee\nintact\nlacrosse\nraj\nimmigrant\naus\npatriarch\nthemed\nblessed\ntailed\noversight\nservers\necological\nrabbit\nphi\nsectors\ncorners\npowder\nmega\nvfl\nconservatives\ntrivia\nclips\ncoventry\nwives\nplaque\ninstitutional\ninform\nvic\nobtaining\nsummers\ndetect\nprecision\nconservatory\nsunshine\nmotivated\ndiaspora\nlinguistics\ngm\nlc\naccepting\nflowering\nnazis\nsavings\nunderwater\ndeeper\nmccarthy\navoiding\ntoxic\ngeorgetown\nscreening\nunaware\nphenomena\ninaugurated\nrolls\nliver\naided\nbenedict\ncommodore\nhollow\nmob\ndawson\nemi\nbrett\nalert\nseventeen\nbounded\ninsisted\nloyalty\nreunion\npatterson\ncollision\nlighter\nlinnaeus\nzl\nfilling\nenacted\nmatthews\nwillis\nsecretly\nvowel\nwhatsoever\nencouraging\nfinest\ndrainage\ndisciplines\nshooters\nlambert\nseth\nkai\nemigrated\nbermuda\nwow\nnoun\nreversed\nfactories\nvalentine\nplasma\nvirtue\nheated\nphillies\nbryant\ncarnival\nnotices\nbates\npartisan\namended\nballad\ndairy\ninsight\nsorts\nwolves\nhay\nkolkata\ntrafficking\nmomentum\nalternatively\npeerage\nlabeled\nricardo\ndevils\nsolved\nendurance\nblame\npearson\ngaza\nhal\nteachings\nconvent\ndealer\nct\nneal\nfrost\nbout\nbugs\nconsistency\ndemanding\nwrecked\nsteady\nexplored\nfuller\nshannon\nwatershed\nspecially\nhansen\nimplied\nmg\nextant\nreverend\nlanes\nwhitney\nmistaken\npupil\nintend\nadequately\nlimitations\nlateral\nvarsity\nlovers\nintentions\ncomply\nsecurities\nprohibition\nacute\ncontainer\nemissions\nupgrade\nnotation\nomar\ntaipei\ntorres\ndining\nsurvivor\ncorpus\nadvancing\nshaft\nnicole\nturks\nmarker\nlocals\nobserve\ndishes\ncove\nseating\ndont\npour\nloud\nhu\nsuspicious\nexpense\nprosecution\nnoah\nhyde\neternal\nterrestrial\nethical\ncountess\ntender\nreflecting\nyearly\nwholly\npushpin\nmasses\nmodest\ndudley\nbarrett\nexplosive\nfathers\npas\nrevealing\njuvenile\nsimilarities\nconquered\nrelate\ncolts\nviolates\nsurrendered\ndenomination\nfeminine\ncontentious\nsoftball\ncompression\nmicro\nsovereignty\nreef\nbrady\nwalked\nscenic\neleventh\nshadows\nbreach\nbehavioral\nmonasteries\nrope\nowing\nspare\ncache\nwimbledon\npractitioners\nduplicate\nwards\nnotorious\nparkway\nronnie\nodds\nandre\n½\nmos\nmaternal\ntransmitter\nsteal\nidentifies\ntertiary\npension\nhc\nscenario\nconstantine\nspecimen\nmare\ndolphins\nderives\ngloucestershire\nmetals\nteaches\nincorporate\noceania\ngalway\nmighty\nlightweight\nprocessor\nprints\nsomalia\nunblock\nog\nswitching\npossessed\nhr\nphillip\nbilateral\ndescended\ngloucester\npyramid\neps\nemeritus\nnina\nval\nwolfgang\ngill\ncooling\nhiv\nelectorate\nsexuality\ndash\ndoctoral\nrita\noutright\nconstituted\nnam\ntiming\npeters\ntribunal\nstalin\nstadion\norganizational\nmemoirs\npitchers\npresumed\nmayer\nbeaver\nja\nsm\ngonna\ngustav\nmotorsport\nfacade\nspike\ntitans\npossess\napps\ncal\ncarlton\nklein\nhassan\ndundee\ntranslate\nflesh\nmedalist\nclayton\npistol\nmikhail\nexpired\nunnamed\nclube\nwatchlist\nvalleys\nbenson\nprofits\nacids\ntravis\ndistances\nenclosed\nasteroid\ncow\naccommodation\ncatalan\nwarwick\ntextile\nbrewery\nsanders\nclarification\ndeadly\nqualities\nmarching\nassess\ngotten\nwitnessed\nmarx\nchassis\nintensive\nusd\nphones\nmarcel\nweber\ndiane\ndetachment\noperative\ngazette\nthoroughly\nspecify\nfritz\nmarketed\ncloth\ndoyle\nstatues\nnigel\nqing\nsheikh\ncafe\nsandbox\nbreeds\norphan\nmonaco\nfolks\nmeaningful\ninvaded\nnorm\nsen\nberkshire\nillustrator\nsl\ncelebrations\nbeverly\nmacmillan\nstrain\nmotorway\ncontracted\nconstitutes\nwilliamson\npronunciation\ndisco\npresently\ntalented\ndeletions\nforestry\nrodríguez\nexam\ncoined\nstruggling\nadvantages\nmood\nwheeler\nkarachi\nblackburn\nunsuccessfully\neducators\nbeth\nswimmer\ntunnels\nwagon\nmasculine\noffshore\ndarren\ngerard\nconsul\nexcuse\nunreleased\ninvestigating\nmohamed\nsalon\nchecks\ntrigger\nreinforced\nenabling\nshoots\npipes\natoms\ninvisible\nibrahim\nmemoir\nintro\ninception\nimplications\ncoronation\ntobago\nwerner\ngifts\nlivestock\ncompanions\ndiffers\nconfirmation\nvolcano\njupiter\nprophet\ntaxi\ndisagreement\ndresden\noriginating\nrefuse\ncyclist\ndeserve\ndetention\npersistent\nscripts\ndiagnosed\niris\ndad\nabbot\nelvis\ndorset\nmel\nlaboratories\nlocalities\nresponding\nlorraine\nfabric\nseas\nshin\nstarter\nclip\nvillain\nchemicals\nreplaceable\nmultiplayer\nnr\nastronomer\nimmune\ncomprised\nbologna\nflexible\nmodifications\nsamoa\ntai\ngamma\nreformation\nconvincing\ncherokee\nposthumously\nhomeless\ntriumph\nprefix\ncarriage\nobsolete\ncommunism\nunreferenced\nconsortium\nstark\nlocate\ngreeks\ninclined\neffectiveness\nadvocated\nsurgical\nsouls\nretire\ncs\nrenewable\nsafely\ncameo\ndissolution\nunemployment\nracist\ntorn\nbaghdad\nmalay\ndana\nclarinet\nsodium\nmoldova\nbacks\ncrow\ninterrupted\ncolumnist\nquantities\narise\npseudonym\neclipse\norgans\nphilharmonic\nchristie\nincorporating\nchâteau\nchristina\ncountryside\nconception\nangelo\ndir\nlincolnshire\nwished\nregina\nrectangular\ntransfers\nyacht\npicking\nsecondly\nalgorithms\nnassau\nconstituent\nsánchez\nfalcons\ntermed\nbart\ngenius\ndefeats\ninformational\npunch\nspecialty\nclash\ncontinuation\ncharities\nabdullah\nremedy\njudged\njosef\nloosely\nuninvolved\ncomfort\nbranded\nbroadly\ncadet\nhelicopters\nnave\nshooter\nlotus\nenlarged\ndwarf\nlogged\naffecting\npackers\nantwerp\nglad\nfold\nknox\nbilled\nterrace\nflanders\ntri\nsocialism\nsandstone\nrenault\nfleming\njudy\ninfected\ninformative\nhappiness\nwarehouse\nmiracle\nleisure\nkindergarten\nleopold\nbarker\nsynthetic\nwaterloo\nsage\nadviser\naccepts\ntrim\nbasilica\nsitcom\ncartoons\nirregular\natom\nnormandy\nfinancing\nadmits\nembarked\npassive\nmodification\nshiva\ntudor\nbreakdown\nalfonso\nbadminton\nfiber\nfiba\nmice\ncarlisle\nunveiled\nsubdivisions\ntired\nmu\nwage\nfingers\ncest\nloving\nworried\ngambling\nbasel\nrandolph\nallison\nmothers\ncapitals\nauditorium\nvault\nhaute\npersuaded\ndescendant\npoison\ngujarat\nroyals\nyan\ndamages\neleanor\nparsons\ninvested\ncalvin\ndover\nmafia\nresembles\noffs\nanon\ndeny\nshapes\nmadras\nwears\nprofessionally\nessence\nemotions\nmyers\nsaturn\ntransmitted\npaperback\nfreed\ndomains\nhanover\nmets\npike\nsketches\ncellular\naugusta\nsinclair\nexamine\nunexpected\nsponsorship\nae\ndab\nbmw\nuncommon\naria\nrealistic\nbarton\nmiranda\ndodge\nstint\noc\nsunderland\nplug\norbital\nspreading\npuzzle\nsymbolic\ndiamonds\njesuit\nviable\nflee\nlined\nroses\ncompetitor\nmunster\nmidway\ngermanic\nurdu\nrides\nscratch\nlahore\npipeline\nprotective\nlevy\nrodriguez\nseymour\naffects\npac\njules\narcher\nkicked\npromises\nremake\nforts\naccusation\nreopened\nphases\nrendering\nprogression\ncomplained\nliberals\nindefinite\nlacked\nclement\nbatsman\nmackenzie\njudith\nchartered\nsalisbury\ndiscoveries\npursuing\nprimera\npolls\nreactor\ncope\nbeds\nhowe\ncalcutta\nstems\ndubious\ndesperate\nlafayette\nuniforms\noutcomes\ntips\npartition\nministries\nconclusions\ndestiny\nsigma\nreporters\ndaytime\naurora\ngoodbye\nrelisted\nwebpage\ncoordinate\nsteering\npepper\ndrunk\nfolded\nseals\nscouting\nswim\ncunningham\nlexington\npersecution\nexpenses\ntheaters\ncontroversies\nagnes\nhiggins\nprojected\nfunctioning\nquery\nseine\nbyron\nnowadays\naccomplishments\npérez\naveraged\nfool\ndowns\nbrave\nmozart\ngil\ndirt\nmathematicians\naquatic\nnoel\ninventory\nethiopian\nevaluate\nmeanings\nreluctant\nrgb\nimagination\nbarber\nslogan\nreich\nhurricanes\ndesktop\ndemonstrates\namusement\nhunters\ndrops\nletting\nsept\ndemolition\npi\nhague\nbronx\ndukes\nnatives\nancestor\npreparatory\nfence\nretaining\nrna\nstrikeouts\nturin\ncrush\nalgerian\naccusing\nexploring\nreunited\nwii\nnarrowly\ncone\nnz\nwürttemberg\nurged\ntaiwanese\ncapturing\ncontinuity\nenforce\nclouds\nverification\nowl\nstruggles\nnamibia\npratt\ntate\nromeo\nreduces\nregulated\nlonely\ndeaf\nshire\nstructured\nnewsletter\nink\ndwight\nswept\nmenu\nsteelers\npointless\ndeclare\npaolo\nsupernatural\nprivileges\npleased\ncommercials\nfaithful\nknock\nlaunching\njubilee\nfinishes\nlamb\nresemble\nhesse\nstorms\nsimpsons\nstorey\nflemish\nlogistics\nconsort\nsatellites\nkb\ndub\nelisabeth\napache\nantenna\nashes\nfin\nadvise\ntrump\nkidnapped\nenables\nwounds\nalternatives\nturtle\ngreenwich\ncake\ntributaries\nheather\nneologism\nhobart\npulse\ncamden\nmeyrick\neaten\nirrigation\nkyoto\nlabrador\nvera\nhonestly\nninja\nmidfielders\ntranscription\nvalidity\nemployers\nwinger\nredesignated\npays\ncurse\nbreakthrough\ncycles\nnode\neuropeans\nexpressing\nrecreated\nthirds\nprivy\nhabit\nviewer\nrussians\nshields\nanalyst\nexposition\ninterval\ndecommissioned\nigor\noath\nfiling\ngenerator\nwelcomed\npg\nbahrain\npressed\nstays\ntwist\ngriffith\naston\nbordered\nnicaragua\nbacon\nbury\nfrankly\nsmithsonian\npremium\nned\nmayo\nwalton\nstevenson\ncolleague\ncurves\nfraternity\nxii\nencoded\njacobs\nhire\nisolation\nraven\ndig\nintroduces\nverbal\nappropriately\nnoticeboard\npatents\nusc\nholt\ntx\nutilized\nsearched\nclifford\nsonata\ncb\nhemisphere\nsolving\nslang\nthrust\ntrout\naesthetic\npredators\npanic\ntulsa\naaa\nteresa\nwarming\nadvancement\nhunger\nears\ntract\nlaurent\nvandals\nresign\nburnt\nsplitting\ncourtesy\nrodney\nencyclopaedia\ntreating\nsubscription\nperspectives\naugustine\neighteenth\nmidwest\npasha\namber\ndayton\nuniversidad\nrevelation\ndesigning\novertime\nrows\nfunctionality\nave\npts\nmastering\ndeity\nshocked\nautonomy\nkuala\nconway\nfrequencies\nmagnus\nprotesters\ninflation\nacademia\nrc\nhancock\nunacceptable\npassport\ninterpretations\ncreators\nappealed\nbaba\nviral\nclearing\nwoody\nventures\ntab\nmarginal\nmeantime\ninscriptions\nshame\ngenetics\nmotivation\nyoga\nrotterdam\nmarvin\nteenager\nnorthumberland\nvocational\nafterward\nlaurel\noverhead\nballoon\nsponsors\nfirstly\nwha\ngreenland\ndemographic\nrahman\nattained\nivy\neg\nscrew\ntalked\nwigan\naccreditation\nvicar\nsurprising\nsandra\noverwhelming\nrats\nmali\nbaku\nstranger\ninstallations\npayne\nmod\nprojection\nferrari\ncollectors\nsands\nsubtle\nstrips\nshoe\nbasket\nmoroccan\nfisheries\ngfdl\ntally\nhadn\nvoter\nresolutions\nsilicon\nrage\nbrent\naliens\ncao\nimagery\nnucleus\nceremonial\npractically\nsimmons\ncircuits\nstrand\ndefinitive\npromotes\ntooth\nuranium\ncomplications\nsega\nclusters\nzambia\nemploy\nky\nisabella\nabandon\nmarriages\ntunes\nprobable\namino\naccent\nincorporates\ngarnered\nplaywrights\nrobots\nslate\nweston\nparachute\nescapes\nlauren\nskiers\nbelarusian\nclearance\nsnails\nmartínez\nmercer\ncrimson\nincorrectly\nmeal\nislander\npharmaceutical\ndominion\nchemist\nmedley\nproves\nporn\nmirrors\nremnants\nexpressions\nvegetables\nteammate\nleningrad\npsychiatric\nlaos\ntraced\nwiltshire\ntense\nant\nhectares\nfears\ncurling\nfits\naf\ninning\nrotten\nfernández\nfury\nunderneath\nbihar\nkraków\nconfident\npioneering\nsuzuki\nmobility\nfascist\nmarquis\nregistry\nnumerical\ncatches\nolivier\nhalt\nchronology\npolytechnic\nrandall\naligned\ndunn\nscrapped\nalike\nleased\nchandler\nsophisticated\nspur\nacknowledge\nlaurence\ntextbook\narabs\nbother\nxx\nrewritten\nsealed\nverde\nlandscapes\ncurved\nreilly\nspatial\ncommissions\ndatabases\napproximate\ninheritance\ncounterpart\ncease\nugly\nfraction\ntraits\nrivera\ntransaction\ndecay\nlineage\nsubstances\nsued\nguam\nensuring\nsyndicated\nfencing\nzhou\nfinalists\nmir\nelectrons\n§\navant\nturbine\ninstall\nmajesty\nzürich\noutlet\neduardo\nvince\nmint\nbrotherhood\niconic\nkatie\nbizarre\nign\ngr\nswansea\nannounces\nwheelchair\nrefusing\nslip\nshri\ngore\nherb\nrep\nstripped\ncrescent\npossibilities\nsyntax\ngenome\nhale\nrailroads\nsonny\nfavored\njonas\nhawkins\nsimilarity\nsheets\nsuspicion\nmolecule\nwinters\nlaunches\ntonnes\nnautical\npreferences\ncds\ncasa\ntricks\nburlington\npartnerships\nconfined\nterminated\nsalvation\nadjusted\nbb\nguaranteed\nswamp\nuc\nwwii\noverly\nhanna\ndove\njennings\nbeneficial\nannexed\nsutherland\ngym\npatriotic\nmollusk\nsuspects\nsquares\nads\ncollectively\nstaffordshire\neet\nrevenues\nmüller\nbremen\nlicenses\nseventy\ncultivation\ncredible\ningredients\nbenny\nstafford\nstephens\nshipped\nxiii\ndramatically\nracer\ntranslators\nbeef\npassword\nmodules\ncomeback\nprecious\ndei\ncomputational\napologize\nmoist\naging\ngarrett\ndinosaur\nemblem\nplague\nbroadcasters\nwan\nchristchurch\npie\nspiral\nmarcos\ntensions\nadmiralty\norchestral\nscreened\nev\ncement\nrefugee\ncontinually\nchaired\nmadame\nviewpoint\nminiseries\nluigi\nieee\ndonation\nnumbering\ncafé\nhare\ncorrection\njensen\ncrusade\ntalbot\ndevelopmental\nbrooke\nextensions\ncube\nmigrated\ngoa\nvacation\nmaggie\nboots\nsubmerged\nambulance\ndisabilities\nsanctioned\nligue\ntyne\nbyrne\nimportantly\nluna\nflint\nskeleton\nburmese\nkurdish\nlawn\nécole\nweaver\nmai\neinstein\nforgot\ngould\ncrosby\nlivingston\ntitular\nwreck\nniagara\ndemons\ntreatments\nsinking\ncapt\nvintage\nperkins\ngdp\npius\nstereo\ntroubles\ntina\nresume\nunlimited\nfootnote\nmolly\nrobbie\nattitudes\nthoroughbred\ndecisive\nmans\nboost\nnorse\nneighbors\nrestoring\nportrayal\nti\nafford\nrim\nprocedural\nassam\nabundant\nwendy\ngran\nfeud\njockey\nbreathing\nalison\nniger\nprone\nwhereby\nemmanuel\nbirthplace\ndoll\noaks\numbrella\nelliot\nbahá\nstatewide\nedith\ncampuses\nshowcase\nrotating\nangles\ntolerance\nsurvives\ninsignia\nberg\ncombines\nresistant\npaula\naffiliates\nmemorable\ntraders\nsink\nconcentrate\nprop\nfp\num\nhiking\nmansfield\nrefusal\nrama\npaved\nafb\nairplane\ndemonstrating\nviolet\nnorthampton\nblamed\nwordpress\nglobally\nconstable\nabbreviation\nku\ngentleman\ncosmic\nshawn\npiper\nbanker\nvillagers\ndefinite\nbullets\nholden\nlionel\ndemonstrations\ndaniels\ndrinks\nprosecutor\nphotographed\ninaccurate\nblown\nbollywood\nirene\nchamberlain\ndive\nsworn\nanita\napartheid\nada\nsimplified\nreject\nabbott\nstephanie\nlongitude\nmack\nlucia\nsmile\nclassrooms\nfeared\ninjection\ncalcium\nredskins\njill\nnurses\nsophia\nswords\npokémon\nprominently\nmultimedia\nbuchanan\ndeclaring\nproving\njudging\nsanction\nemploys\nsant\nhorace\ndarker\nemerson\nmandarin\ngarde\nipswich\nborrowed\nbred\nvaughan\ncriticisms\nambitious\nphilanthropist\ndiscourse\nrajasthan\ncollar\nhonolulu\ninherent\njamaican\nfcc\nadvent\npriory\ninability\nbrigades\nsunny\ntalents\njakarta\nnatalie\nchin\nlester\nmacau\nbarbados\nactivism\nconsumed\nspectators\nwages\nundertook\nsilesian\ninmates\nhomosexuality\nteamed\nrecruit\nsensor\ncp\nfloods\nthereof\nweakened\ntwilight\nlowe\nnitrogen\nassassinated\nharmful\ndenotes\nsergei\nkirby\ndisplaced\noccurrence\nfirefox\nstrengthen\namérica\nleonardo\nsettling\ncategorized\nirc\nbrittany\nlds\nrite\nproportional\nspeeches\nuna\ncontributes\ndigit\nlobby\nsupervisor\nminorities\nspectacular\nskaters\nbays\nbean\nbg\nblades\nnyc\npitt\nprinter\ngenoa\nbrandenburg\nresidences\nthumb\nwikipedian\nlublin\nisabel\ncivility\nsol\nluftwaffe\nseated\novernight\nmarty\nsegunda\nkerr\nbutterflies\ntelevised\nhostage\nlennon\nfowler\nlabs\ninstructed\ncasual\nrude\ncubic\ncommentators\nsoup\nloch\nbattleship\ndissertation\ndenominations\npython\nextract\npeasants\nthroat\ncelebrating\nfinn\nmound\nmozambique\nfortified\ninadequate\nuzbekistan\nstained\npunjabi\ncaste\nprisons\nstirling\nlimiting\nsenegal\nbeirut\ntouched\nrepeating\nsexually\nwildcats\ndeposit\nolivia\ncolombo\nobservers\nhubert\nbasil\nou\nbatch\nnh\ngreenwood\nbreath\nbaton\nspokesman\ncaution\nmeadows\nturnout\naluminum\nrevisions\nstein\njokes\nitalics\nbark\nazerbaijani\ndiabetes\nflickr\nframed\nlesson\ntenants\ncaptains\nfactions\nconscious\nalley\nschneider\nbert\nsubset\nbowie\ngeo\nmeditation\nrue\nundergo\npromotions\ndiplomats\ncritique\nbrowne\nsucceeding\npurchasing\npastoral\nliz\nhorns\nemergence\ncanceled\nmarched\ndamaging\ntrustee\nresorts\ntaxation\nminiature\nconclude\ndartmouth\nclarity\nincredible\nresided\nforewings\nchevrolet\nqualifications\npoetic\nvoid\nbosnian\ncab\nhereditary\npassages\nmater\nprincipality\nincrement\ncontests\nomitted\ngentlemen\nnj\nunderway\ntumor\ntorah\ncatalonia\ncricinfo\nmemorials\nwarn\ngeoff\njimbo\nlocks\nbees\nvanderbilt\ncommemorate\nequilibrium\ndorsal\nrealism\nadaptations\njohannesburg\nbent\nnonprofit\nintentionally\nbecker\nbs\neponymous\nassumes\nadvertisements\nhumorous\nshores\nhoffman\nenforced\nchung\njuniors\noman\nrenewal\nflank\nokinawa\noutlined\nhelmet\nstraw\ngrouped\ndownstream\ntens\nanticipated\nincivility\nsteele\ngoodman\nnutrition\nhavana\nmessenger\nwi\nrobbery\ncultivated\nunusually\nducks\ncastile\nscreens\npeaking\nbowler\nhuang\nproposing\npackaging\nint\nnu\napology\ntrauma\ntt\nreyes\nsubjected\ncapped\ndraught\nwhites\nvols\nlaying\nextinction\ntroll\ndivisional\npulling\nattribute\nproto\nproxy\ncoding\ncanoe\ngasoline\npodcast\ncommuter\nbloc\nencompasses\ncpu\nborderline\ndisplaying\ngeometric\nchaplain\ntended\nchord\nbarrow\nangus\nleinster\ncostumes\ndomingo\ncowboy\njew\nge\npreparations\nqualifies\nsculptors\nkoch\nsts\nbedroom\nprevents\nminded\ndeclining\nchoral\ninterpret\ngrape\nexpeditions\nairs\nlimerick\nkhorasan\nhaunted\noffspring\nxp\nmarxist\naluminium\npermits\ntomatoes\nsavannah\ntapes\nquarry\naustralians\ndisappointed\ncola\ncrafts\nirvine\ncurator\nrbi\ndavenport\nreceivers\naccidental\nmarseille\nopted\ngibbs\nrand\ninsect\nsergio\ncombinations\nappreciation\nenzymes\nawkward\nspanning\npressing\ndioxide\ntheologian\nprogressed\nreceptors\nencourages\ncorrupt\ndonovan\nht\nhogan\nsg\nnominees\nlogging\nole\nevacuated\ndoubled\nconversations\nquestioning\ngavin\ntargeting\nimplementing\nceylon\ndepiction\nconverts\nsymmetry\nexcavations\nreuters\nhercules\nmartín\ndevi\nsheridan\nunfair\npile\nmuscles\nreptiles\nrepertoire\nsorted\nsamurai\ncassette\nstripes\nak\njuice\nproprietary\nfierce\nnecessity\nstealing\nhilton\nblu\ngiles\ncompelling\nstomach\nrushed\nabusive\nbuttons\ngram\nhinduism\nfortifications\nreggae\nhomosexual\napprentice\npartnered\nhussein\nsells\ndisposal\npharmacy\nsynthesizer\nliability\ncam\ndinamo\nsci\ncontexts\ncannes\nwillow\nhbo\nstarr\nstatutory\nwinged\nconcurrency\nrecruiting\nperiodic\nslower\npagan\nmatched\ndefendant\nspecializing\ncrashes\npizza\nevan\nrwanda\nmls\navengers\nthornton\nghosts\nexploitation\nxml\nul\ncolt\nensuing\npulitzer\nsupervised\nflooded\nrepaired\ninteger\ninstantly\nmistress\nsecuring\nethnicities\nmortar\nplausible\ncop\nactivation\nlyric\ncypriot\nassumptions\nmethodology\nios\ngale\ndominic\nnominal\nbud\nmelissa\nskip\nhapoel\ninvite\nrt\nreside\nmartinez\naffiliations\nmedina\ndebris\nsuccesses\nfrustrated\nengineered\narranger\n»\ninfinity\ncorresponds\ncretaceous\nbite\nnodes\nbavarian\nmarilyn\nrumors\nneighbor\nsuffix\nfare\nordering\nlifelong\ncampeonato\nfuselage\nwanna\ncapitalism\nharlem\nwarwickshire\ndelivering\npigs\ncentennial\nrituals\njustices\njoão\nbolt\nwakefield\ninfamous\nautomobiles\ndamascus\nlisteners\nsimpler\norioles\nxv\ndried\nke\nhai\nassignments\nporch\nhymn\n⋅\nindirect\nverifiability\nuruguayan\nvhs\nsioux\nconvenience\nstatesman\ndetached\npracticing\nchu\nabs\nhiatus\nbanning\nsul\npaired\nrelied\nwatt\nkite\noffence\nsylvia\nfrigate\nconvey\ndramas\nshade\ntrades\nmarian\ncoastline\nclive\nlamp\ncamping\nchips\nearthquakes\nassisting\nindicator\nbloom\nemil\nprayers\nattorneys\nnapoleonic\nmortality\nbruins\nprocessed\npathway\ntreatise\nmalik\nres\ncliffs\ncaliber\nbordeaux\nconcludes\nlowell\natop\nrelies\nmarino\nscorecard\nrecognizes\nheavier\nprevalent\nruined\nkidney\nemphasized\nuncertainty\ndocumentaries\nteddy\nomega\ncheng\ndenial\nsunni\ngranting\noutreach\nordnance\nneville\nbrush\nrolled\nalain\nmama\nevaluated\nein\nstyled\nturbo\npietro\ndemise\nhydraulic\neminent\napologies\nbaxter\naddiction\nanders\ngubernatorial\nemerge\nzu\nsolidarity\nposed\nmaid\nconfluence\nphysiology\nalgebraic\nbrake\nloire\njacket\nspinning\nbrendan\nscarlet\nexplorers\nimpacts\ndante\ngerry\nregained\ngranada\ncemeteries\nroyalty\nins\nconfronted\nboulder\nannouncer\ninconsistent\namphibious\nspears\nmariners\nmutant\noracle\nslim\nerosion\nborneo\npositioned\ncoordinated\nexplanations\ntransformers\nevelyn\nprofiles\nbeethoven\ndigits\nusgs\nraises\nparole\nconstraints\nnotification\nvein\nfrankie\neconomists\nterra\nbahamas\nspark\nkappa\npotato\ninterred\nimpose\nacc\nkathleen\nunconscious\nfilmmakers\ncl\nendless\nabsurd\napex\nevacuation\nlil\nmortgage\nconsiderations\nzur\nsuits\nlagos\nhm\nrealizing\nquotation\nrepresentations\npilgrimage\ntruman\nashton\nyields\ntaliban\nconcise\nremarked\nolga\nprostitution\nflyers\nprestige\ncontractor\nxiv\nexpeditionary\nbubble\nmauritius\ndeliberate\nincoming\npanzer\nantiquity\nconvenient\nwines\ndirectorate\napprove\ngifted\nwilmington\ntempo\ndexter\nskipped\nupstream\nislanders\nbp\nperennial\nkerman\nrai\nfulfill\napi\nhatch\nunionist\nunfortunate\nimmunity\nsnakes\nnile\nrm\nduel\nassessments\nbern\ndaisy\nauthentic\nattachment\nxvi\nlawson\nlaureate\nrupert\nmccartney\nrelativity\nnikolai\ncord\nexclude\nfencers\nshipyard\nentirety\nemotion\ncobra\nsensitivity\nintimate\nsoils\nplots\nholstein\ninvestor\nhuntington\nneighbours\nretains\nbarons\nenrique\nroth\npsychologist\nkaiser\ndarling\njudiciary\nvicente\nstreaming\njeanne\nreplies\nconverting\nklaus\nanarchist\noxide\nremoves\noverlooking\noccupies\nstella\ndose\nzombie\ncanadians\ndwelling\nbackwards\nliberia\nblackpool\nmodeled\noffline\nnorris\nserbs\nharp\nrental\nrelieved\nbotanist\nguilt\nfried\nalexandre\nconcentrations\nmarkings\nverdict\ncyril\nmnm\nhertfordshire\ndiscovering\naces\nshropshire\nspiders\ncompulsory\nwoo\nbonnie\ncrack\nincarnation\noldham\nliam\npositively\nmgm\neverett\nquincy\nfirmly\nrode\nutrecht\nvolcanoes\nbritannica\namendments\nproposition\nthreads\ntriggered\nhumphrey\nbust\nfitting\ndiscrete\npeterborough\nzurich\ndeliveries\nlumpur\nwolfe\nunfinished\ndrill\nexotic\nteenagers\nsauce\nprobation\nloaned\narchie\njavier\nexports\ncartridge\nclifton\nnagar\ndeluxe\ncommando\ndeportivo\nfeeds\noptimal\nrhythmic\nkernel\nspringer\naustro\ndramatists\nbans\ntragic\ncredibility\nresurrection\nconditional\nind\ngaa\nusb\nbelize\nresearched\nprosperity\nstandardized\ncomet\npools\ntroubled\nnegotiated\ntasked\nmarkers\nkenyan\narchibald\nlava\nvictorious\npromo\ncreativity\nprotocols\nbeloved\nasserts\nesther\nobjected\npeel\nheidelberg\nchooses\ngroove\nlantern\nsavoy\nsaunders\nfacial\nconfrontation\nladder\nbohemia\nthief\nguyana\nvanessa\nhalloween\nsentiment\nappointments\nlevi\nrodgers\nhomestead\nrealised\nplc\nsainte\npune\nsarajevo\nannouncing\ndinosaurs\ndominance\nprecursor\nlaugh\nfinanced\nlars\nviktor\nscrutiny\nsandwich\napparatus\nutterly\npornography\ntoulouse\ntap\nincredibly\nalarm\ncruisers\npreserving\nrover\ngranddaughter\nexams\nhalfway\nacceleration\nraced\nshelley\nadverse\ncompetes\nintervals\nnichols\ninduction\nprivilege\ntrombone\naforementioned\nairplay\nedison\neyed\nmagnet\nmartyrs\nsuffers\ncaptive\ncodex\nwool\nforensic\nexciting\nmajors\nsurprisingly\nvowels\nseller\nplatoon\ndia\nfog\nhonorable\ncornish\nbt\ntriangular\nyuri\nhiroshima\nselections\nwash\nfreddie\nnationale\nundrafted\nsignatures\ntreason\nbabies\npersia\nwilkinson\nwhoever\nsacked\nglaciers\nsustainability\nleaning\nrecognise\nlumber\nreceptions\nballads\npillars\nturret\nresidency\nreginald\ndoubts\nzhu\nune\nowens\nlately\nveterinary\nguggenheim\nreputable\nhector\nlounge\nundergoing\npresenters\nsacks\ntara\njumped\ncurry\ngoat\nnightmare\nburst\nyokohama\nbehaviors\nsecretariat\nspans\nillusion\nwta\nicons\nelectro\nteens\nromanesque\nshake\nelias\nresist\nplanetary\npseudo\nalba\nbiodiversity\nshifting\nbluff\nanxiety\nzhao\nrogue\ndolls\npitching\nbarney\nsikh\nsuffrage\nfeathers\nrichie\nmarathi\nhabits\nblend\nextraction\ncourtyard\nturf\ndesirable\nexpo\nbhutan\nguerrilla\nesp\ndeadline\nea\nbon\nexchanges\nwhip\nfarewell\ncardiac\nsensible\ndeities\nreplica\nsmiles\nsympathetic\nreproductive\ncousins\nterrorists\nacquiring\nheinz\nsongwriting\nfinancially\nlizard\nzen\nremixes\nexiled\nnegotiate\naxe\nflour\njade\ninsane\npose\nfry\nida\ngoose\nscientology\nchiang\ndom\npact\ngarbage\nregency\nfulton\nreorganized\nsixteenth\nriga\nmom\nxu\naccompany\nnw\nhasan\nmosaic\nicc\nark\nsubcategories\njoachim\nbahn\nrutgers\ncomma\ncrude\ntaxonomy\nalcoholic\nrom\ncaucasus\ncharlton\nvillains\neliminating\neighty\noccupying\nsuperhero\nmao\nbaptiste\npaz\nsid\nloads\nlime\nraleigh\ncrossover\nkarate\nstrengthened\nbrig\ndickinson\ndrought\ndelays\nsocially\nmaccabi\najax\nchargers\nrejection\nschooling\ncaucus\ncounterparts\nreconstructed\ninvestigative\ncatcher\nprev\nthor\namnesty\ntracked\nsurfaced\nthirteenth\nbohemian\nfailures\nsoviets\nwichita\nbarriers\nlottery\ngrateful\noutskirts\nturtles\nmeadow\nelectromagnetic\nconan\nlikelihood\nendowment\nwiley\nformatted\npatriot\ndeacon\ninfrared\ndioceses\nspecialists\njulio\npaso\nkang\nbourbon\ntf\npromoter\nnineteen\nsurgeons\ngreco\nmilitant\ngable\nquoting\nthatcher\nweigh\nparma\nlibrarian\nchairs\nvc\ncomedians\nstressed\nwatches\nundefeated\ncommunal\ntablet\npalatinate\ncopying\nflip\ndescriptive\nshelf\nupright\nnursery\nsynod\npackages\nunsure\ndisclosure\nemission\nvisually\ncorrelation\nideals\nav\ndisappearance\ngong\nderivatives\nsammy\ntong\npossesses\nadobe\ndeficit\npremise\nidentities\navon\nvega\nrosario\ntko\nleap\ntransparency\neverton\nri\nlakers\nsic\ndebated\nruin\nsophomore\nretailers\nmalawi\npayload\nsubdivided\ngf\nsubstantive\ndevotion\npunches\naudit\nméxico\nsteadily\nnoon\nhist\ninequality\nlowland\nstrauss\ninforms\nconfession\nmba\nstunt\nmonopoly\nniece\nsurf\nspells\nnationalists\nnathaniel\ngentle\npatronage\ntransferring\nkc\nplayboy\npenguins\ntransgender\neisenhower\nopus\nlinebacker\nchatham\ncommunion\njewelry\nprobe\nsharma\ngases\nmalls\nro\norganist\npertaining\nintersections\nrusso\ncandidacy\nconcurrently\nlaguna\nelevator\ndiscs\nironically\nlenin\nbreton\npatience\npedestrian\npeggy\ntucson\ncares\nissuing\nnickel\nverbs\ncatching\norchid\ncalculate\ncomprise\nscreenwriters\nmaj\nmanned\nalexis\ninstituted\nnamesake\nlibyan\nmentally\nir\nliang\nquotations\noxfordshire\ntownsend\ncw\ntear\nunpublished\nsubordinate\nheroic\nshine\nregain\ndetermines\nonset\nsounding\nbranding\npf\nrecreate\nadjoining\npeasant\nrebuilding\ninfections\nviolinist\nmongol\nbotswana\nexclusion\nmagistrate\nunstable\nmurderer\ndeborah\nuncivil\nleicestershire\npromptly\narte\nalejandro\nkillings\norient\ngymnasts\nbalkan\npascal\nreadings\nvenetian\nvocabulary\ncum\ndestructive\njudo\nsignpost\ntranslates\ndiffering\neli\ncane\ninnocence\nnpr\narrows\ncalculation\ninnovations\ncroix\nabundance\ncyber\npauline\nego\nlitigation\nua\nhooks\nnwa\nbangladeshi\nmotives\ncoats\nmia\npossessions\nmeridian\nexaminations\nbayern\ncv\nsatirical\nreissued\nwrapped\nharriet\naudition\nenjoys\nsampling\nbrackets\ninsists\nnewest\namor\nhubbard\nmerry\nbackgrounds\nfragment\nnottinghamshire\nbeginnings\nbusch\nancestral\ndalton\nmagnificent\nlethal\nbanana\nguerrero\nusaf\nfriction\ncomparisons\nmadness\nroutledge\nmutations\nassassin\nhistoire\ncanals\nharding\nconceptual\ndaddy\noccupational\nguinness\ninscribed\nadler\nacronym\nforthcoming\ncarpet\nfelipe\nhamlets\nbuzz\nunfamiliar\nrecorder\nintake\nunhappy\nseventeenth\nfundraising\nladen\nbyrd\nhomage\nchiefly\nfuck\ncooke\nhulk\ngrandchildren\nsuppression\nmae\nestablishes\nantony\nnatal\ncontention\nunix\nconform\nernie\nchandra\nbeard\ncoca\nusable\nteammates\nconcord\nmacarthur\nmaltese\ndiscretion\nsorting\ntottenham\nbel\nworcestershire\ndanube\ngarland\nbuilders\nwetlands\ncôte\nmapped\ncooled\nbas\ntemporal\nmisunderstanding\nboyle\nmoody\ndetained\nbeacon\ncoaster\nbaja\nblah\nnobles\noilers\narches\nexamining\nhazard\ntitan\ncables\npsychedelic\nqaeda\nforrest\nrealise\nobligations\nosborne\nsomali\nmma\nreminiscent\nrecruitment\nflats\nobligation\nlibertarian\nweiss\ncorrections\nwembley\ndebts\nanswering\nrigid\nflores\nenlightenment\nsect\nfocal\nfielding\nabolition\ngps\ncitadel\ngravel\nsecretaries\noswald\nnoir\nmartyr\ninstitut\nmyths\nanterior\nsticks\nnb\nsuppressed\nanalogy\ngolfers\nlabelled\nzinc\nbeans\nmclean\nshrewsbury\nturbines\nourselves\ntextbooks\nang\ntractor\ntyping\nborne\nsting\npic\ncents\nexcited\nspeedily\nscandinavian\natari\nunblocked\ninlet\nfairfield\npounder\nminimize\nsubstituted\nchronological\nsatisfaction\nremedies\npolynomial\nbutter\nfourteenth\nposterior\nfloat\npersuade\ncyrus\nwherever\nom\nlaptop\nspartak\nderry\nrhetoric\nsunrise\nequestrian\nrender\nnhs\nplantations\nenthusiasm\nrepository\npropeller\nmorse\nstadiums\nnk\nmaturity\noutfit\ninflammatory\nhabsburg\nbombings\nschwartz\ndrain\nnate\nstrasbourg\nlemon\nnorte\nbrennan\nions\nworkforce\nhonourable\npredict\nbullying\ngraveyard\nafro\nmortal\n±\nnuts\nvisions\npoisoning\ncombustion\ncommandant\nenduring\nmn\nexceeded\npor\nclans\ntuberculosis\nwarships\neddy\ncaldwell\neco\nfoul\nbentley\nphysicists\nankara\ngeelong\norganism\nbeaumont\ngorge\nmcgill\nretrospective\nnolan\nprocession\nrb\nweir\npanther\nforemost\naragon\npalermo\njaw\nexplores\nbaritone\nkilkenny\nannals\nceramic\npony\ncornelius\ndetainees\nneural\nmoor\nssr\nabbas\ncollaborations\ntidal\nhui\nannounce\ncalculations\ncongregations\nunification\ncartoonist\nimproper\npanorama\ndividing\nnt\ngrouping\nmural\ntorque\nhatred\nproductivity\ndans\nexempt\nsoundtracks\nfutsal\nmonumental\nanaheim\nspends\neconomically\nwolverhampton\nspire\ncentro\nbrakes\npredecessors\njays\nexpresses\nbasal\nversa\npacked\nlandings\ncategorization\naccomplish\nwarden\nwholesale\ndial\nasphalt\nclarified\nblockade\nlaurie\nmiddleton\nflynn\ntoby\nmole\nnicholson\ncheaper\npiedmont\nrefrain\notters\npatrons\ncorporal\nsparks\nberger\njain\ntrolling\ncoliseum\naero\nturnpike\nhistoria\nofferings\nsmell\nmoreno\noversaw\nbamboo\nlockheed\nmeals\ncharging\ndal\nunchanged\nfoo\nobserving\nsetup\nmetallic\nrespiratory\nmilitants\ncorrespond\nrowers\nlean\nproposes\nsweep\nmeredith\npurdue\nnissan\ncalculus\nsteals\nsamantha\nconstructing\nbabylon\nhuddersfield\nrabbis\ndonor\nsmash\nputnam\ndrowned\nhut\nsalzburg\ndevised\ndillon\npressures\nmountainous\nrented\nmusée\nveronica\nbrock\ngalicia\npal\ngus\nabused\nfamed\ntiles\ndrift\nbrewing\ncanary\nhumour\nolympia\nradial\nbk\ncórdoba\nnude\ncurrents\nreservoirs\nfeminism\nresembling\nquébec\ntransitional\nstraightforward\nwaterford\ndivers\nshia\ninsee\naveraging\nfamine\nwilly\ngreens\nramsey\nhonoured\nguangzhou\nteatro\nunsuitable\nmetacritic\nling\nsummoned\nindirectly\nreflections\njurisdictions\nwyatt\ncfd\nmanifesto\nshan\ncadets\ndepictions\nclicking\nutilities\nfigured\nexplosives\nparadox\nminsk\nconferred\nchrome\nscroll\ntramway\nsolicitor\nniche\ncrap\nlifting\nexpecting\ndoncaster\nregulate\ndefenses\nexperiencing\nagf\nshirts\nts\nmarquess\nundue\nwax\nmotive\nhutchinson\noverturned\ntango\nlara\nstrokes\ninfectious\nreinstated\nmont\npigeon\nlyons\nstole\ndaylight\nfertile\nstairs\npatrols\nupdating\nslender\nut\nbotany\ndignity\nmadhya\nideological\ngrip\nshortage\nanalyses\nskater\nclone\nravens\ngu\nforeigners\ntakeover\nwestward\nrecognizing\nretrieve\ntraction\nbrewers\nhumboldt\nalternating\nlenses\nopposes\npulp\nsalle\nvisibility\nplata\npicnic\ndecks\nczechoslovak\nconcur\nworms\nboone\nlam\nlagoon\nsoo\ncruel\nthreatens\nallocation\nbuffy\nrecovering\nrío\naffordable\npresley\nrandomly\ntimor\nmackay\ntire\nchestnut\npillar\naccumulated\ndt\ndiagnostic\ndem\nimho\nunanimously\npopularly\nchoreographer\nsimone\nbernie\nbags\nchampagne\nnorms\ndüsseldorf\nmusicals\ncomplexes\nendorsement\nneighbourhoods\nconcurrent\nhydroelectric\ncarrie\nmughal\nmonmouth\nforested\nmccoy\nargs\nignorance\nsquash\nconductors\ninvasive\ncloses\nburgess\ntavern\nwmf\npetit\nreno\naz\nracehorse\njong\nkitty\nreinforcements\nseahawks\nworkplace\noffset\nbenz\ncha\nracecourse\nreissue\nrebuild\nmotorcycles\nsevens\ncovenant\nrobust\ndislike\nminus\nweimar\nhoover\ndolphin\nconditioning\nella\ninvestigators\nglasses\nbowen\nhindus\nhandicap\nware\njurassic\namphibians\nimplying\npostgraduate\nsiding\ntrench\nspi\nweighed\nredevelopment\nsanchez\ninactivated\nwishing\nimaginary\nrevue\nincidentally\nhs\nimam\nradioactive\nconsultation\ntipperary\ntonga\nadapt\nlovely\nerotic\nhg\nmanipulation\nbelmont\nfarrell\nthickness\ndischarged\ntorch\nlois\nramos\nfilters\ndamon\nmongolian\nemploying\npremature\npreacher\nballots\nrubin\npornographic\nkatrina\netymology\nattracting\nambient\nsubdistrict\nfeudal\nantagonist\ndare\ninsult\ndiplomacy\nclaudia\nneglected\nliteral\nmiddleweight\ncomplaining\ncrushed\nseniors\nbrunei\ndots\npostponed\nlowered\nvegetable\nsiberia\ncollects\nbirch\nsyndicate\ncrowds\nwwf\nscholarships\npolite\nconfirms\nstall\nshifts\nwired\ndirective\naide\ntheresa\nbiographer\ngma\ntissues\nbenton\nnos\nmarijuana\ncommemorative\ngnu\ntuition\nresemblance\nlsu\ngao\nsundays\nlac\nwatkins\npassionate\nwaterfall\ngenealogy\ndiscouraged\ncentenary\nempirical\ncharting\nbd\nhq\nreact\nsnyder\npsychiatry\nprescribed\neducate\nfairfax\ndevastated\nconfronts\ntestified\nrails\nwestphalia\nrouting\nexhaust\ntwisted\nvitamin\nalvin\nroutinely\nchromosome\nmecklenburg\nweakness\nweekends\npuppets\nnippon\njealous\nbrutal\nabsorption\nshaun\nkung\ncanonical\nworm\nakin\nviruses\nodi\nbutcher\nfarther\nlim\ndisagreed\nandersen\nseparating\nexcavated\neligibility\ncésar\nweasel\ngraphical\nitn\nmock\nagreeing\nkara\natlantis\ninductees\nfreak\namtrak\nwien\naccounted\ninclusive\neliot\nunrest\nspecials\nspeculative\nsemitic\nmla\ndismissal\nharmonica\noutlook\nelegant\nmast\ncrystals\nresting\nclimbed\ndug\nheirs\nprofound\nmitch\nuae\ndepict\nabel\ncolonists\ntemperate\nalexa\ndar\nenthusiastic\ncromwell\nannoying\niihf\nfrustration\nkathy\nkensington\nguiding\nsurroundings\nkidnapping\nplayground\npomerania\ndet\nincorporation\ncfm\nespaña\nexported\ntexture\nfancy\ntor\ndoris\ndigging\ncocaine\nrites\nbauer\nerich\nmainz\ndwellings\nspinal\nramp\nsocialists\nsemester\nche\nunwilling\nprediction\nrollback\nupheld\nsamsung\nalbuquerque\nó\nreconciliation\nfreelance\nstretching\ntopology\nneurons\nassertions\nretention\nwoodlands\nstandalone\ncobb\nhalo\ngraphs\ngrange\nmendoza\naquatics\nlip\nspeculated\nraphael\nunprecedented\nbaseman\nsadly\namherst\nbuilds\nresearching\nseeded\nlyrical\ncolchester\ngallagher\ngenerates\nwherein\namos\npitchfork\nadopting\nscarborough\nquasi\nnorthamptonshire\ncooked\noptimization\nvacancy\naggression\ndressing\ncontingent\nsympathy\nlea\njuliet\nemperors\nstaging\ndf\npaternal\nprincipally\nschleswig\nfresno\nclever\nsuzanne\nee\nuncovered\nprolonged\ndisappointing\nliaison\npolling\nabd\nsunlight\ntyrone\nsyed\ncompressed\nhumid\nassyrian\ntouching\ngravitational\naccession\ntutor\ndarlington\ntar\ncomplain\ngeologic\nsingaporean\nintegrate\ning\npioneered\nsar\nexecute\nsedan\nantique\nrf\nmorales\nchoi\ndisappear\nstocks\nsurplus\nfurious\nbuccaneers\nmutation\nghetto\nsatire\nvp\nvelvet\nastronaut\ngaps\nconcacaf\npunt\nljubljana\ncfl\narchaeologists\nirwin\npad\nautobiographical\nyukon\ninterceptions\ninstrumentation\nrockefeller\ninterception\ncaptained\nshining\nspokesperson\ntoilet\npol\norchard\nrutherford\nrfd\nkramer\nromney\nnas\nadvocating\npueblo\nnuremberg\nflavor\nhypothetical\n‎\nfinch\ngrammatical\nknots\nremotely\nutilize\ndivinity\nfixtures\ninvest\nstraits\njumps\nretreated\nbacterial\nthéâtre\nshy\nbuckinghamshire\nsai\nsino\namid\naiming\nsurveyed\nmisuse\ncontinents\nrefined\nsolitary\nspectral\ndesmond\nodyssey\nhiring\nmysteries\nphosphate\nbombed\nwesleyan\nimprint\ncaledonia\nexploded\nportals\ndarts\nanimalia\ncancellation\nautism\nknoxville\npeacock\nsyllable\npianists\ndepths\nmichele\nshipbuilding\nsleeve\ncumbria\nquo\ntheologians\nreigning\npamela\nmontevideo\nandrei\nassemblies\nstanton\ntones\nsaddle\ndisturbed\nblessing\ninevitable\nreprise\nselecting\niaaf\nportray\njasper\neaton\nfb\npits\nhanson\nmcmahon\nrobbins\nvine\nsparta\nespionage\nfifteenth\npoznań\nclown\npaddy\ntroupe\nrelying\nyankee\nvaccine\nwelch\ncaptivity\npacket\nreplay\nqi\nboiler\nbelly\niphone\nexcerpt\ncompetent\nnightclub\nsymposium\njewel\ngenerous\nstatutes\nentertaining\nodessa\ncockpit\nnets\nbucks\ndetailing\nheadline\ntremendous\nmailing\nhicks\nfiat\nalessandro\nalec\ncentred\nstretches\nclashes\nleiden\ndamn\nsurveyor\npaterson\nyong\naristotle\ndáil\ntent\nnouns\natkinson\npersona\nmig\ndistributions\nplayable\nnuns\nrotary\nangular\nfoley\nslaughter\nswitches\nrejoined\ndistress\nariel\ncorpse\nperipheral\naccelerated\nprasad\nfixture\nvoluntarily\naccord\nconscience\nass\ndaytona\naccountability\nnovi\nburnett\ncoconut\ngiorgio\ndrilling\nkhz\nanniversaries\ntravelers\ndominate\nlazy\nordinance\nsemifinal\npiston\ncody\ngómez\nbravo\ncrete\nbravery\ntheorists\nnovgorod\nanalytical\ninventions\nextracted\nmetabolism\nprovence\nstud\nstratford\nbella\nrecruits\ncountless\nposthumous\noriginates\namir\nmorality\nfife\ntombs\ncredentials\nproclamation\nsahara\npresided\npapa\nrus\nboring\noceans\nismail\nintercontinental\ncain\nchoke\nahl\ncompass\nfreeze\nprofitable\nhaifa\nsouthbound\nreeves\ngmt\nelaine\ncompton\nblonde\nsultanate\ncurtain\ndeposited\not\nroyce\ndispatched\nud\nsubmissions\ncrossings\noperatic\nbuckley\ngolfer\nvita\nmirza\nfra\ntermination\nhitter\nburkina\nreliance\nsuperseded\npropelled\nliquor\nblackwell\nciudad\nflexibility\narbor\nmonastic\npe\nadjective\nboer\nwicked\nhewitt\nbilingual\nconstance\nbleeding\nperez\nvilnius\nloser\nfond\nlasts\nstranded\nbottles\nmonkeys\nsheila\nexchanged\nparticipates\nreel\nkicks\ninvites\nbureaucrat\nrelics\nwashed\nmx\new\njessie\nblunt\nolsen\nsims\nhk\nskinner\ncanoeists\nelm\nresonance\nfaso\ndeclares\nfranchises\nkurdistan\ncoffin\nsights\nitalians\nbothered\nrecipe\nalright\nelephants\ngreenhouse\nautomation\nhampson\ncascade\nforge\naquarium\nromero\ntsar\ndisciples\ndonnell\nspecialised\ncutter\nsustain\nscream\npavel\napprox\nratified\ngeneralized\nactivate\nprocessors\ngarner\nsatisfies\nnorthbound\nandes\nshareholders\nevergreen\nkicking\nkillers\npostseason\nmeteorological\ndigest\nhandles\nmclaren\nsubscribers\nsparrow\nmarin\ndynasties\nshankar\nmat\nwally\nprimetime\nsnowman\ngrapes\ncrusaders\nboroughs\nunderworld\nheadmaster\nravi\nsubstrate\ncheltenham\nmelodies\nmankind\nprompting\nspies\ntuning\ninsulting\ncreed\nsenses\nspecializes\nmona\nreorganization\nconfederacy\nstockton\naccessories\nsupportive\nprogrammer\nswami\ntorpedoes\nmotif\nitf\ncortex\nepidemic\nambrose\nza\nunsigned\nlyricist\ncourtney\nedo\nmustafa\nshrub\ngermain\nwhales\nfrançaise\nencoding\nconcluding\ncrossroads\nconsolidation\ncalcio\nwillem\ntelecom\ngoldman\nbriggs\nalonso\nsumatra\nanchored\nkapoor\nboycott\nmuseo\nforks\nconsulate\nfirearm\nbanjo\nfrogs\npork\ncontemporaries\nemphasize\narises\nkazan\nsurpassed\ninverse\nreddy\ncolonization\nassured\nobliged\neruption\nanalogous\nfriedman\nideally\nexits\nkeller\nremark\nnad\nfiddle\nhorrible\neconomies\nentrants\npasadena\nfungus\nescaping\nscanned\nlibretto\nbenin\necosystem\nnavies\nportrays\njoanna\nusda\ngraffiti\nmystic\nobstacles\nfda\nbing\nblanked\nants\nreddish\nbarlow\nlent\ndeeds\ndoe\nplugin\nfutures\nhorton\nbrasil\ncannabis\nserv\nentertainer\nglance\nsoloist\nrepetition\nsparked\nverona\nattracts\narmenians\ncoupe\nwit\nchrysler\nhobby\njaime\nmerlin\nsindh\nwight\ncropped\nlama\nconnector\nmccain\ntm\nmagician\nguangdong\nwizards\nadvertised\nmediator\nburger\nbernardino\ncatalyst\nglacial\nrhodesia\ntbilisi\nprotestants\nhindwings\nstretched\ngossip\nmetropolis\nbeatrice\nundercover\nauthoritative\ndíaz\ncannons\nnaturalist\nglider\nillegitimate\njuventus\nxxx\ndisciplinary\noccupations\nty\ninternally\nsheer\narithmetic\nspokane\nnewcomer\nsami\nearnings\nprogrammed\naba\nns\noverthrow\nallah\nrancho\ndump\nmerchandise\npatches\nhumble\nshelby\navery\ndenote\nworded\njavascript\npanchayat\npadres\nunverifiable\nrewarded\npresentations\nhurdles\nversailles\ngenerators\nhappily\ndungeons\nseville\nsystemic\ndaly\nhaitian\npatented\ngig\nrenovations\nstellar\nmed\nmates\nsans\nconvinces\nstrengthening\nporsche\nundertake\nskyscrapers\nbuckingham\ndiaries\narrests\nwilde\nmandated\nadjust\nimmense\nrot\nkv\nhungry\nfremantle\ntna\nmidst\nsgt\nwaterfront\ncelestial\nlevine\ncombatant\nnicola\nminh\nengined\nexceptionally\nsheldon\nhalted\nplayback\ngiro\nhee\npoe\nproponents\ninauguration\nbind\nneedle\ncourier\nexcavation\nspurs\nexodus\nquad\nclimax\npotassium\nascent\nvolkswagen\nlydia\nreprint\nconnell\nlattice\nunicode\ngodfrey\nrossi\ngonzalez\nprospects\ndecreasing\nrains\nhymns\nqf\nadmired\norion\npledge\nmodernist\nblacklist\nmonitors\nacademies\nderive\nshit\nundergone\ngarfield\nvishnu\nevidently\nfest\ncbn\noverhaul\nflawed\ncynthia\ndegradation\nbracket\npray\ntex\nutilizing\nabe\npam\nartery\nappalachian\nplagiarism\nleopard\npiers\nsensory\nevidenced\nbunker\nspherical\nregret\nulrich\nsocio\nvickers\nsupermarket\ncustomary\nmalone\nweights\nelders\ntornadoes\ncorey\nvacated\ncharm\npetrol\nblanc\nik\nsignaling\nbuffer\nmelting\nsensation\nsubcommittee\nfinances\ncaracas\nvernacular\nregimental\njudoka\npsychic\ngundam\ndenny\nfatalities\nzach\nju\nburgundy\nexemption\notago\nsimultaneous\nunite\neager\ncomposing\nrothschild\nbooker\nweighing\ncalais\nhint\npunished\ncrying\nbunny\nspl\nlibel\nantioch\ngangs\ncastillo\narrondissement\nappoint\nurge\npalestinians\nfavoured\nhernández\nbackward\nambiguity\napproximation\ngrocery\nrestrict\ncyrillic\nshoulders\nharley\ndealers\ndiminished\nunopposed\nret\nsurge\nreservations\nbald\nseminars\nrudolph\nvijay\nwagons\ndevastating\nremind\nbn\ntallinn\npraising\ncampaigned\nnasty\npants\nfleeing\nanalyzed\napocalypse\narchaeologist\ngrief\ndispersed\nallegheny\nconsulted\nhydro\nlegislators\nstaircase\nbernstein\nbundle\ncommencement\ntextual\nprospective\nmoose\nchancel\nconsuming\nminas\nconsonant\nnun\nvariously\nmelodic\nknot\nnull\ntsunami\nadventist\ndefendants\nprotested\nvalves\nbrewer\nbarred\nruiz\nweekday\nordination\nrpg\nhillary\nspun\nracehorses\nerin\nbalkans\nprep\nville\nyiddish\nentrepreneurs\ncrimean\nsq\nintersects\nwelterweight\nbratislava\nmushroom\nmosques\nhumidity\nalicia\nemilio\ndixie\n©\nlen\ngradual\ntrash\nlitre\nchasing\nponds\ngreenville\nmidi\nconvex\nrejects\nseminar\ncart\nruss\ninsist\nstationary\ntoni\nlarsen\norchestras\nbandwidth\nseize\nplato\nauf\ntc\nfibers\ninterfere\npunctuation\nclair\ncollingwood\ntn\nportraying\nimports\ngradient\nrespects\ngregg\nphilips\nmegan\nquiz\nalterations\nhowell\nguardians\nhighlighting\ntasmanian\nmf\nsurround\nol\nloops\nsymphonic\nhospitality\nrae\nintellectuals\njunk\ncod\nbf\nwinding\nsb\nestuary\ndiscount\naxle\nreliably\nchun\nwillingness\nneoclassical\npainful\namelia\nhussain\nexhausted\nresponds\nprovost\nluca\numpire\nindiscriminate\nngo\nbehave\npodium\nquentin\njakob\npneumonia\nlao\nethan\ncommitting\neliza\ndeficiency\ncoherent\nrudy\nmantle\nwoodward\nsac\njulien\ncopyediting\nliberties\ntherapeutic\narising\nspill\nestádio\nsemantic\nchloride\nconfront\nvanguard\nvendors\nbaptism\nrv\nfamously\nplanting\nvalle\ní\nbonn\nclaus\nmono\nintends\npenal\nlips\nopt\neurasian\nramon\nmaxim\nzion\nuploading\nalfredo\nia\ngs\npoliceman\ntreasures\nceramics\nafricans\nembrace\nculminating\nbliss\nwonders\nbowls\nuniversally\ncharacterised\nplayhouse\ngoldberg\ncaretaker\nguadalajara\narchaic\nrisen\nranged\nterminals\ncampaigning\npedal\nwen\nmanagerial\nimmortal\nmarrying\nsuppress\ncambridgeshire\nrappers\ndeposed\nmistakenly\nrecycling\nintentional\nbei\nfishermen\nalloy\nmalmö\nassurance\nlan\nstevie\nmarne\ncontractors\nspine\nmaximilian\ngala\nglucose\nparallels\nawesome\nmigrants\nquaker\nzionist\ndetract\npalazzo\ndoping\nramsay\njs\nministerial\nblanche\nmoran\ncrab\nbutt\ncanadiens\nhanged\nelectrified\nburt\nambush\nflotilla\nequatorial\nmoderately\nminors\nsubsidiaries\nconflicting\nherd\nluc\nserb\nshea\ncollier\nnickelodeon\napostles\nsherwood\nconducts\narchery\ncyclones\noceanic\npotomac\nconversely\ncaptures\nshootout\neton\nloc\nadb\nblanket\nparaphrasing\nexpose\nschooner\ndeparting\nlbs\ndrv\nawaiting\ndisguise\nbrenda\nnora\nbeams\nconnie\ncomo\nhayden\nld\nwehrmacht\nwarbler\nrhineland\nadvert\ntoe\nastros\nditch\npolymer\nyorker\nsubsection\nthanksgiving\ntransverse\nhoughton\nneutron\nmorphology\nmythological\ncho\nlocke\nmodelling\nbois\nfaçade\nleafs\npiracy\nevolve\ncompliant\nfulham\nsuperstar\nmechanic\nperimeter\nexceeding\nhmm\nfranciscan\ndetector\narrange\nwires\nvertex\nbethlehem\nwharf\ngi\ncarmel\nmedication\ninfants\nauguste\nbroadband\nbali\nrift\nhenrik\ndelivers\nmitsubishi\nleak\nnme\nsharply\nformulation\nbisexual\nsichuan\nsincerely\nbricks\nkendall\ncountdown\nsupplier\nplea\naffinity\nhet\nfw\nreplaces\nfined\nwalters\nanalysts\nkyiv\nmarketplace\nkits\nlamps\nlviv\nboyer\npreserves\nexpulsion\nfavourable\nbiologist\ndebbie\nstephenson\ntanker\ndomination\nmargins\nskate\nherring\ndisrupt\nworthwhile\nffff\nsteward\nproceeding\njacqueline\ncindy\nwatford\ntheodor\nrestructuring\nmysore\nbaronets\ndiver\nphilipp\ndisguised\ngmbh\naccuse\nconvergence\nprophecy\nnuevo\nwills\noutfielder\nsanitation\ntortured\nluton\ngovt\nankle\nbacklog\ncoil\ncollaborate\ncinematographer\nundisclosed\ndemos\npredator\ntops\nlivery\ncoefficient\nsentinel\nrecalls\nalphabetical\ninserting\nponce\ndefences\nvolunteered\nwilkes\noverlooked\nvogue\nleaked\nmiddlesbrough\ntorpedoed\nsoyuz\njanata\nmilestone\nimposing\nshades\ndeed\nfreud\ncampo\nrodrigo\nredesigned\ngwen\nmasonic\nsummarize\njstor\nmonterey\ndenise\nspear\nzoe\ngraf\ndev\nfertility\ncarla\nvertices\nsuccessors\npleaded\nventura\nsins\nmastered\nculminated\nexpectation\nasteroids\nwat\nprima\nserpent\nstepping\nfarmland\nfixes\nviaduct\nchristoph\ninitiate\nremixed\ndunedin\ngrenade\nao\nanalyze\nsatan\nfrançais\nfolding\nearls\nchristi\nflux\ninvaders\nnail\nmodular\nsquirrel\noffences\nsloan\nboilers\nliturgical\nballroom\nvida\nscenarios\ntablets\nmartins\nneon\ntrader\ntails\nsaxe\nlamar\nthessaloniki\ndictatorship\nsperm\ndifferentiate\nconjecture\ntaft\nmckay\nmelville\nkris\nmating\nridges\ntabloid\nnorthward\ndecreases\nbattleships\ndescending\npolk\nannouncements\nhara\nsupplying\nétat\nhears\notis\nmilano\nnikki\npearce\nproton\nweaker\nrainy\ndiffusion\nclarkson\nbordering\nhostilities\nawakening\nsherlock\ntyson\nvengeance\ndoi\npont\nslalom\ncomune\nbowman\nsack\nleroy\nelk\napplicants\nmister\nnobleman\nhamas\nvectors\ndisagreements\nfs\nthats\nfreezing\nmounting\nquintet\nbaronetage\ncounseling\nkhmer\nbeaux\nfascism\nreproduce\nandrés\nwalled\ncostly\njurist\nmalaya\ngerhard\nexecutions\nflagged\nfoil\njammu\nurgent\ncerebral\ntajikistan\ntypo\ncolorful\nwhig\ndeception\nmariana\nhooker\nakron\ncrimea\nneolithic\nnarrated\nviva\nfia\nkrai\nfeasible\nimmigrated\ncanvassing\nqualifiers\nawful\nhübner\nengraved\ncoke\nconquer\nintroductory\nraaf\nhazardous\ncertificates\ndirectorial\nhume\ndl\npractitioner\ndisused\nperiodically\nferries\npathways\nabuses\nscrap\nmeaningless\nanand\ndocks\nillustrating\nfalkland\nshale\nannex\nwhistle\nglamorgan\nisa\naft\ncreations\nsms\njaguar\nhazel\nshu\nsellers\nvaudeville\ntenant\nwillard\nboca\nuni\nnagoya\nransom\nstokes\nredirecting\ncuriosity\ndisqualified\nemerald\nfars\nshear\nnokia\ninterfaces\npereira\ngenuinely\nchalk\npest\nsteamer\nillegally\nguillaume\nmixtape\ncompelled\ndecimal\nascension\ntechnician\nwasted\ndenying\nmelanie\nmutiny\nhind\nimpaired\nunidentified\nopenings\nmichaels\ndonetsk\nching\nconfirming\npresiding\nmotifs\ndefects\nurging\ncapsule\nbuyers\ntrailing\ngomez\nastronomers\nclues\ndisciple\njared\napostle\ngrossing\ncompiler\njackets\nobe\ngan\nacquitted\nmagna\nnan\npreface\nensign\nuh\ndracula\nmandolin\npatton\nturkic\nnaomi\nunmarried\nsnooker\nlena\nannexation\nwasting\nyen\ntying\ndull\nconcession\nvalerie\namiga\ndonors\npurge\nalgae\njesuits\nsinatra\ndisastrous\npathology\ncontainers\nairlift\njpeg\nblanco\nrory\nhandsome\ndvds\nkabul\narchbishops\nrip\nbrigham\nginger\nbangor\nregisters\npembroke\ndiagrams\ndisappointment\nchamp\nindy\nnicely\nunexpectedly\nuncomfortable\nkahn\ncaring\ncinemas\nsummarized\npostage\nnut\npeculiar\nloyola\nequals\nww\nauthorship\nobsessed\nveracruz\ntunisian\nescorted\nwavelength\nspawned\nrelocation\nheadlines\nsquads\ncolon\nfist\nfrigates\ninsults\nguillermo\navalanche\nunpopular\ndickens\ndeported\nasa\nfulfilled\nretaliation\nminer\nlol\ngrounded\nyin\nsettler\ncbe\nsegregation\nexercised\nsubstitution\nkelley\nincidence\nkinetic\nbernhard\nfearing\nblurb\nskeptical\nhereford\nzheng\nbenedictine\nhz\nnairobi\nsinai\ngypsy\nfalsely\noptics\ntouches\ntanner\nhitchcock\nmanifold\nnests\nmoe\ndependence\npixels\nyves\nprefers\nxiao\ndenounced\ngymnast\nmop\nhelm\neduard\nbis\nvie\npilgrims\nmerrill\nbail\nrigorous\nsha\ngem\ncirculated\nsaul\nduffy\ntotals\nfashioned\nlandfall\nramps\nhyun\noffended\nrockies\nwaltz\nmedicinal\nepa\nambition\ndisturbing\npardon\nmoot\nlinguist\nstrangers\ncamille\ntb\nuninhabited\nbeverages\nvila\nlend\ntandem\nsemiconductor\npalaces\nod\nnomenclature\nbrowning\nmuse\nsilesia\nantigua\nisis\ntires\nsimplicity\nfuels\ninterdisciplinary\nfluent\nbarony\nswindon\npadma\nbounds\nhostility\ngabon\ntheoretically\nbankrupt\nmasked\npoole\nmaud\nmohan\nritchie\ninsurgency\nmoisture\ncorrectional\nmckenzie\nburnley\nsermon\nvenom\ntha\nelton\ncapitalist\ndominique\ncompilations\nramón\npriesthood\nawb\ngeologist\nrevive\nwhitish\nveins\nsarawak\norganizer\ntehsil\ngalloway\nbengals\nreferees\nsud\npatel\ntripoli\nprotects\ncantonese\nwr\nsulfur\neccentric\nqc\npba\nvivian\ndesires\npak\nfatty\nscorers\nfeng\nexpiry\nprotectorate\nottomans\ntobias\nadaptive\nfederico\nnike\nemanuel\nmanners\ntuscany\ndocumenting\ntao\nscripture\nrusty\nmediated\nshout\nstronghold\nspray\neastward\nrhythms\nrooted\npixel\ntile\nornamental\nintercepted\nsuns\nwebber\ncis\njosephine\npreaching\ndistortion\nroofs\nhail\nevaluating\nbayer\nculturally\nparadigm\ndewey\nindicators\ndistinguishing\nfg\nmarkup\nparanormal\ncompatibility\ngrasp\nkyrgyzstan\nhardcover\natoll\namalgamated\nensured\nmythical\nrufus\natheist\nwarns\nmasterpiece\nmis\nbooklet\nmontréal\npostwar\nîle\nheats\nnotoriety\nortiz\nlever\nsymmetric\ndoo\nblowing\nlobbying\nexploit\nib\nchoreography\nsaloon\nthieves\nsabbath\nuhf\nzeppelin\nbernardo\nmv\nmystical\nsociété\nfundamentally\nqb\nattested\nmetaphor\nchesapeake\nlokomotiv\nfaa\nsalvage\noasis\nbeverage\nsufi\nstefano\ngalaxies\nshelters\nanchorage\nupwards\nreminded\nsexy\nthreaten\nquantitative\nguessing\nparentheses\ninstituto\nhutton\nlai\ntreats\nrink\narid\ntrams\nhailed\nwashing\nstony\nskies\nbarrister\nflourished\nvampires\ngum\nbathroom\nbartlett\nbenfica\ncrowded\nharmonic\npsychiatrist\nguido\nfas\nqin\nethel\nglossary\ncavity\naziz\nforgive\nsardinia\ntransylvania\nstadio\ndai\nreggie\nrepetitive\nuncopyrighted\ndismantled\ncurrie\nmiracles\nroc\nfam\nmoines\nreassigned\npumps\nkindly\nsniper\npod\nintercollegiate\nkin\nobsession\nezra\nninety\nthy\ngertrude\nguthrie\nlola\nanthropologist\ngoodwin\nblanking\nhellenic\nhairs\nmutually\nharrington\nparkinson\nsums\nhormone\naudrey\ngut\narchers\ndrummond\naperture\ngoalie\ndigitally\nmisconduct\nmammal\nknowles\nspotlight\nseldom\nspice\ngalerie\nassistants\nfitzroy\nic\noutlaw\ncougars\nharald\ngenetically\nrotor\nmas\nsplits\npeabody\nencouragement\ninstability\ndrafts\nperiodical\nmultinational\nanhalt\nrayon\nsylvester\narchival\nmil\nclaudio\nwitches\nonward\ntomas\ndestroys\napples\narenas\nmedallists\nsabah\nmotorsports\nnapier\nlucius\noxidation\nlighthouses\nrealms\nvargas\nheadings\npulls\ngrazing\ncommentaries\nresisted\nemails\ndictator\ncroydon\nenthusiasts\nmontenegrin\nperiodicals\ncommitments\nlaughing\nefficiently\ntk\nnegatively\names\nunavailable\nreluctantly\nusl\npredictions\npreferably\nprecedence\nclergyman\npotatoes\ndebating\ncostello\nlibre\nopener\nscreenplays\nfrederic\noffenders\nars\nannouncers\nlede\nreminds\nsweeping\nfore\npsi\nsooner\ntransports\nnil\nantrim\nkilda\npurchases\nstalking\nprotagonists\ncigarette\nacadémie\nstamford\nracers\nclinics\nupgrades\ntl\nsnap\ndunes\ngriffiths\nmca\nchick\nrecipes\nghanaian\ninitiation\nballistic\nappealing\neh\nbarrels\nroche\ninspire\nsatisfying\nattic\nattain\nconsult\ntuned\nala\nmatthias\nchesterfield\nviceroy\ndisturbance\nbesieged\ntau\nlauderdale\ndumb\nsawyer\ntacoma\nholloway\nmaldives\nvuelta\nlangley\nbarnett\nlightly\nslater\nliège\ncassidy\njaguars\nks\nexistent\ndart\nboiling\nferreira\ncullen\nbrowsers\ninsertion\ndortmund\nmacintosh\nundated\nlille\npacks\nuniversité\nchittagong\nresolving\nreproduced\nglover\nmillionaire\nsynonymous\ndion\norganizers\nurine\nsicilian\ninflux\npets\nnoticeable\nmer\nbeckett\nfukuoka\nnanjing\npledged\nwes\nbuyer\nmal\nstripe\nenvelope\nrosenberg\noverlapping\ntrenton\nbestowed\nfaber\nconsonants\nrichest\nneptune\nbarr\nkhuzestan\ncharacterization\ntolkien\nforged\nnero\ncecilia\nedible\ndice\nasserting\nbreeders\nseparates\nskier\nmausoleum\nmonty\nreelection\nyearbook\nshafts\nmasks\nfaculties\nencompassing\ndismiss\nsantana\nswallow\nclint\nprevailing\ntranscript\nenjoying\nmassacres\nensembles\nmalaria\noro\nstaple\ntelangana\nfender\ntrait\nlange\noutdated\ncontamination\ncska\ndifferentiation\nadvisors\ngilles\ndownloads\ngrains\npsychologists\ntow\naxel\nwt\ntattoo\nsiena\ndepressed\ncass\nrowland\nlund\nhearings\nrosemary\nparrot\nadhere\nlindsey\nkemp\nryder\npeninsular\nblaze\nlimbs\nfurnace\nsergey\nfools\nphelps\ndickson\nslovene\npretend\nerect\nrainforest\nenclosure\nanalogue\nlegitimacy\ntirana\nrecession\naffection\nska\nef\nshipwrecks\naesthetics\nhayward\naol\nwaited\nnp\ntito\ndjs\nmag\nperpetual\nswap\nadjustment\nbertrand\nnavigate\nfairs\nmourning\nmounts\nsteiner\nfanny\npostcode\ndraper\nfortunes\ncancel\nhides\nspartans\nsears\nfullback\nlal\nlex\nstimulus\ntactic\npresume\ncabaret\nthou\ntransforming\nconfiscated\nundertaking\ncanopy\ninverted\ngraeme\ndrained\nwithdrawing\ntitanic\nairfields\ngaston\nengraving\nwonderland\nspontaneous\nwarranted\nspirituality\ndharma\nying\npropagation\ntextiles\nolds\ngesture\nalumnus\nkamen\nscandinavia\nbonaparte\nrepeats\nundoubtedly\nknowledgeable\nreconsider\nmagnum\nrichter\nclemson\nparry\nnfc\ngrandparents\nmiriam\npontifical\ndiocesan\nharmless\ndictionaries\nmart\nfumble\ngettysburg\nbey\nshortest\ncylindrical\ntiffany\nphysiological\nsafari\nscreaming\ncentimeters\nfaults\nowed\nproliferation\nlimb\nalliances\nmalicious\nfarmhouse\nadmissions\ncommodity\nintending\nndp\ninputs\nabdomen\ndiscarded\nfélix\nimpulse\nstricken\ncrowley\njiang\npenned\nvineyard\nbusinessmen\nyielded\nrationales\nsaxophonist\nkobe\narbitrator\nlouisa\nadmirals\ntexans\northodoxy\ndirk\nchattanooga\ncreole\ndrafting\ngarry\nbloomberg\nfuji\ncummings\ngothenburg\npamphlet\npatty\nstiff\nmarries\nhonneur\nscheduling\ncheek\nbucket\nflaws\nvapor\noverturn\nbyu\nprotector\ncarleton\nwoodstock\nlastly\ngeographically\nfreiburg\nufo\nprelude\ncory\nlynx\nliechtenstein\nexaminer\nsharpe\natkins\nffa\nblogger\npriorities\nakbar\nmeg\nairbus\nisil\ntranslating\nhomicide\navid\nsanford\nheels\ndiscographies\nlevin\nlau\nshotgun\nemigration\nslated\nhomework\nfascinating\ncasualty\nguernsey\npopulous\nconcealed\njumper\ndiaz\nwaived\ntechno\nlending\ntheorist\ncompose\nlively\nrelieve\nmasonry\narmistice\ncamel\nrevolves\nedt\nwaterfalls\ntil\nmadeleine\ntitus\ncatering\ndelicate\nquietly\nglorious\nredemption\ninjunction\nisfahan\nnana\nliturgy\ncosmos\ntogo\nbaroness\nexploited\nimproves\nfig\nchant\nquran\nderrick\nchairperson\ntrance\nelmer\nrespectable\ntrophies\nbari\ndangers\nharyana\ntaekwondo\nmicrowave\nmorrow\nallegation\nras\nassessing\ninsights\ngangster\nviewpoints\nyunnan\ndanielle\nmarshes\nio\nmsg\ndino\nbishopric\ndeserted\ninternationale\npricing\ncz\nheron\nunmanned\navatar\nyates\naleksandr\nwalnut\nmarguerite\nseneca\nnrl\nconfidential\ninterpreter\nnavarre\nremembrance\ngemini\ntorino\npfc\nchords\nfireworks\nacquisitions\nscaled\nscanning\ncompromised\npointer\npitches\ndye\noversee\nbetrayed\nserena\nreadable\nunreasonable\npetersen\ngdańsk\ngardiner\nconvictions\ncollaborators\ncraters\nentrusted\nsatisfactory\nemilia\ncoincidence\nsusceptible\nindustrialist\nlawsuits\nfeather\ncompetence\nnasal\nroe\nmetadata\nelevations\ndenoted\ndyer\ntransporting\ncoupling\nnorwood\nkiel\nelbow\nhats\nunderstands\nforecast\nample\ndispatch\ntraps\nfranks\nthistle\npb\npartisans\nduff\nbillie\nheavenly\nhuskies\nkatz\ninstructors\naccessibility\nrobotics\nlausanne\nperpendicular\nbrains\nplaster\nrumours\nknesset\nbuster\ntrusts\ntoken\nguantanamo\nbrest\ncoma\npreferable\nzeller\nopp\nsimplest\ncentralized\ngee\nfernandez\ngoalkeepers\nbarnard\nsubmitting\ncathy\nbelievers\nprototypes\npops\nraped\ncollaborating\ncheyenne\nheroine\ncitrus\ntimely\nempires\nsalford\nzoom\nencyclopaedic\nbilbao\ndissent\naground\ninclination\nintervene\ngail\ncairns\nmurdoch\ncommemorated\nvows\nslayer\ninteracting\nsiberian\nvinci\nrowan\noliveira\nbaylor\nwilder\nboise\nkeynes\nridden\ndragged\ncerro\nexcel\njeremiah\naddison\ninventors\nhuron\ncelebrates\ngators\nfrontal\nmurals\ndenies\nsharif\nharrisburg\nevolving\ninstallment\nwai\nenergetic\nbafta\ncraven\nprepares\npalais\nprovoked\npopularized\nmonsoon\nmara\nattendees\nberth\nassure\nsafer\nbismarck\nwhitman\nmatilda\nweed\nsails\nmellon\nsurfing\nbiotechnology\ncary\nqueue\nscattering\nhaul\nsubgroup\narturo\nconsume\nyeshiva\nerwin\ncoordinating\ncarolyn\nhartley\nbournemouth\nmata\ngenealogical\ntorre\nformulated\nauthorised\nsoda\nrendition\nsuriname\nprefect\ninsurgents\nagg\nsinister\nrec\nsim\nphyllis\nparental\nreminder\nrp\nfishes\nseaside\nmarlborough\ncy\nrhys\ndodd\nnails\ncylinders\nbrowse\nimmaculate\nsounded\nintensified\naccordion\nembraced\njoker\nhendrix\nelector\nvoyager\nita\njuno\nvirgil\ngrab\npilgrim\nstrawberry\nbounty\nvicious\ntowed\ncollaborator\nsabre\nceltics\nremarkably\ndisclosed\ndecca\nuniquely\nsynchronized\nmicrophone\nfang\ndhabi\nfracture\ncolliery\nbrethren\nmaze\ncomparatively\niberian\nasleep\nsucceeds\n‘\nsprinter\nconceded\nhidalgo\nhack\njohor\nhum\nunbeaten\nowls\ncongregational\npentagon\ncategorize\nrests\ncontinuum\ncomputation\nschumacher\nsas\nfreddy\nliable\nsomme\nkangaroo\nmlas\nqui\ntaxonomic\nyerevan\nbarnsley\naugmented\nwestwood\nari\ngalactic\nsuperiority\nioc\nheck\nfiona\nchiba\nemotionally\nilluminated\nkidd\ninterventions\nslade\nale\nabsorb\nvain\nrobotic\nstaten\nprevalence\nbraun\nselo\nshandong\nthorpe\nwolverines\nhints\ntug\nlied\ncambodian\ncommemorating\nknives\nzeus\nedmond\nbluegrass\nenrico\nzaragoza\naverages\nclerks\nsax\nconcordia\nappendix\nfamilia\nbaird\notter\nspw\ngillespie\nseminal\nnf\nrana\nwrap\nmead\ncasablanca\nqur\nbabe\ncoincide\npenis\nmckinley\nverdi\nseverity\ncomplementary\nsuperliga\nveto\naccountant\ntheo\ncharming\ncolbert\nineffective\nrushes\nsui\ncriticizing\ndonkey\nroadway\nencryption\npuzzles\nmisunderstood\nhokkaido\nworries\nhairy\nserra\nunemployed\njsp\nretailer\nhotspur\nanal\nhaynes\nbartholomew\nuntitled\nwooded\njudah\ndeco\nentropy\nhelens\nabnormal\nanalytic\nreinforce\nsonia\nromani\nvt\nblows\ncows\nclutch\ngupta\nbolivian\nramírez\nmanipulate\nknighted\nbahia\nsliding\nshower\nipa\neverest\naudi\nshetland\ncooler\noutlines\norbits\npurity\nhawthorn\nmarianne\nskyline\nignorant\nemerges\nimplication\npolicing\nmariano\nhoc\nturkmenistan\nlimitation\nprosecutors\nweaving\ntransforms\naubrey\npeck\nbusiest\nwikipedias\nprosperous\nrewards\nprecinct\nbu\nnovella\nwikia\ndiagonal\nbowled\nsbs\nauthenticity\njourneys\ndetectives\nkinda\nbasics\nputin\naviator\nchurchyard\nalderman\nculinary\nrosie\nluzon\nconnectivity\nava\npanda\nbankers\nprescott\nentrepreneurship\ndestined\ngoaltender\nbiomedical\ndoha\nmorley\nouting\nnewcomers\nschedules\nsire\nsamson\ncheryl\nliberalism\ncaucasian\ndolly\nflu\ntraverse\nceded\npieter\ncoasts\ngrossed\nfoothills\ncollided\ntricky\nwb\nenvoy\nseizure\nerupted\nsweeney\ncontra\ndisrupted\nmorale\nenhancing\ncaravan\nfortunately\nbarra\ndisappears\nbahadur\npresses\nindependents\nrack\nreactors\ndesignations\nprinters\nalgiers\nlehigh\ntam\nwexford\nfibre\ntory\nnavarro\nchimney\nyellowish\nromano\nmodernization\nschultz\nrobson\nlyndon\nwandering\nrowe\ntutorial\nignacio\niq\ndistributing\nvertically\nremastered\nalta\nstreetcar\ngloves\nimpressions\ninvade\nwarrants\nimminent\nreese\nrematch\nunitary\nmei\nsampled\ncinematic\nfederally\nvolvo\nkosmos\nhalle\nhernandez\nrefurbished\nineligible\nmayoral\nrhyme\nsuppliers\nesperanto\nwentworth\njavelin\nherrera\nlandowners\ncooperate\nheroin\ntalmud\nalsace\nely\nwee\nroyalist\nmelvin\nchico\ndivides\nincentive\nconstructions\nchili\nkern\nlandslide\ncochrane\ncompensate\ndeposition\nflyer\ngina\nterence\nslots\nweightlifting\nparaguayan\nvh\ndonate\nmarius\nfins\ncorbett\njihad\nate\nmarquette\norphans\nserge\nmecca\nwrongly\nmuller\nsap\npalatine\ngoats\nborussia\nhandy\nupward\nanalyzing\ncheung\nlandowner\nrefinery\nindexed\nmanson\nethanol\nrecognizable\nrunaway\ncorona\nsynopsis\ncebu\njohnstone\ntightly\nconsoles\ncrewe\nindicative\ninforming\njens\nvance\nscare\ncoincided\nmari\nbloomington\nscared\nnara\njargon\nscandals\nsaddam\nneglect\nwnba\nmoldovan\nmcgraw\nzoology\nspanned\nconfuse\nguerre\nslowed\nmetz\ndrowning\nbsc\npasted\nengagements\nconnolly\nedged\ncommunicating\nmcdonnell\nfonts\nelongated\npicasso\nnueva\ndisagrees\ncertifications\nregeneration\nfutebol\ntristan\nmercenaries\ntelenovela\nnikola\nwrist\nmotions\nhornets\nawarding\nbeyoncé\ndreaming\ninevitably\nscar\nmein\ndora\nlombardy\nrochdale\nprostitute\ndk\npistols\nimplicit\nsaturdays\nhygiene\narmagh\nemphasizes\nmaclean\norléans\ncoptic\nalpes\nauthorization\namenities\nyun\nthorn\nowning\nbureaucracy\narticulated\ntimed\nbedfordshire\naleppo\ndeprived\ninvitational\nbanners\ndmitry\ntransmit\ncompartment\ngrimsby\nnatasha\nmassey\namc\nrunoff\nhighlanders\nweighted\nabbr\nendowed\nsabotage\nwasp\nwedge\nulysses\npins\nfir\niss\nevenings\naugsburg\nroach\ncarriages\nmold\nsaskatoon\nimprovised\nalzheimer\nshoreline\ncensored\noaxaca\nmünchen\nexcerpts\noriginate\ndelight\ngalileo\ntendencies\nexploits\nhaas\navenues\nflanked\nhostages\ninvasions\ndiscourage\npositioning\nsymbolism\nchased\ngaga\nforeman\ncontender\nbison\nencyclopedias\ndrummers\necumenical\nhazards\ndisregard\nballard\ninjuring\ndiscus\ncontaminated\nhilary\nbarangay\nbios\nclocks\nbandits\nstyling\nremembers\nmcgee\ndelist\nlego\nsystematically\ninitials\nlearnt\nastronauts\nmanfred\nsermons\nwestbound\nmessiah\nnationalities\ninvading\ncookie\nconfer\nrepertory\nlansing\npreseason\nkaplan\ncoated\nboogie\nbelts\nfx\nwrocław\ncurated\nepithet\njarvis\nimpacted\nmilford\njunta\npetra\ndrastically\nharcourt\nakira\ncalhoun\namour\nslash\nobstacle\nrepealed\ncoded\nsickness\nstm\nneumann\nexplosions\neileen\nsensing\nimagined\nproponent\ntrojan\ncher\nstockport\noricon\njewels\ninsisting\nneuroscience\nbids\ngotta\nduplication\ncondensed\nnegotiation\nrealization\ninviting\ncathedrals\ndoherty\nepoch\nsociological\ndépartement\npopulace\nseychelles\nprc\nlaundry\ncatchment\nbourne\nfragile\nnes\npickup\noblique\nyouths\ncabal\ninformally\nbreakup\nrye\ninvesting\norganising\nproportions\nrees\nbeg\nprompt\nanatolia\nkicker\nintercity\neastbound\nlegged\nadjutant\nelgin\nolson\nzeta\nblew\nprincely\nfern\nprohibit\nwarrington\ncristina\nblitz\nstrains\nexaggerated\ntruce\nrods\noclc\ninduce\nthee\nlegislator\nsuez\nmortimer\nvandalized\ncarver\nmetabolic\nong\nintegers\ncavaliers\npennant\nmarkus\nconfessed\nrevisited\nmeiji\nconveyed\nshareholder\nshaping\nhuffington\nmainline\nmemorandum\noverland\naccumulation\npeach\ncertainty\nmigrant\nefficacy\nlistener\nmussolini\ncongestion\nupside\namounted\noils\ncompares\nnyt\nkhalid\nchinatown\ninspiring\nremnant\nstriped\npumping\ngöttingen\nsumo\nwilkins\nmixes\nbroker\nnavajo\nfaroe\nvascular\nknocking\nadmittedly\ninflicted\nhua\npossessing\nunconstitutional\nue\nspruce\nenforcing\nfairness\nparamilitary\nyds\naccuses\nrepression\nyamaha\nabbreviations\njiangsu\nbackdrop\nshreveport\nfeeder\nstout\nbenito\nensg\ngareth\npivotal\nrolf\ndensely\npharaoh\ndynamite\nastrology\npredominant\nchichester\nitalianate\nrehearsal\nevasion\njing\nliberated\ndrains\nremembering\nmedicines\ndoubtful\nclassmates\nspheres\nrequiem\nstartup\ninsistence\nernesto\ngiacomo\ntopical\npretoria\nidiot\nlongitudinal\nratios\nhelium\nfargo\npenetration\nconserved\nsaharan\nsediment\nlicensee\nsundance\necosystems\nyvonne\nconvened\nlegislatures\nmaharaja\nrouted\nelectors\nbiochemistry\nrodeo\nexposing\nqueer\neaston\ndamian\ndiffered\nhandel\ncommemoration\nrevolutionaries\nlore\nur\neugène\nbicycles\nellington\nlookout\ninterpreting\nek\ngull\nsleeper\nmedici\ntrimmed\nreplication\nhyundai\nauspices\nmoto\nfacilitated\narroyo\nhacker\nsour\nob\nalfa\nidf\nbutte\nstarters\ndownloadable\ninverness\ntraveller\nsociologist\nmaureen\nandorra\nhunted\nheel\nsuites\nannapolis\ndamien\nisp\ngunn\nscarce\nproteam\nrosen\nconsultants\nkaty\nplagued\ncontrollers\nenergies\nlinguists\nbadges\nhouten\nness\ncruft\ngrassland\nvineyards\nbikes\nccc\nsmuggling\nnarration\nensures\ndane\npuebla\nmarta\nbully\nmartian\ndonegal\ndyke\nraided\ncalder\ncontradictory\nlenny\nexceeds\nsalesman\nartworks\nbetrayal\ndoubling\nolympiad\ndistributors\ncappella\nabruptly\nnegotiating\npencil\nprescription\nstatehood\nweekdays\nnovelty\ndirectional\nbureaucrats\nwolverine\nheal\ncontempt\nreversion\nspectator\nger\nfairbanks\nanyways\nprofessions\nsoto\ngustavo\nrebranded\njoints\ngrandmaster\nbharatiya\nantiquities\nsacrifices\nrashid\nnutrients\nsudanese\njoaquin\nwitchcraft\nlithium\nironic\nnepalese\ndurban\ndefect\nconspicuous\nregents\nspaced\nmusa\nmaynard\nunwanted\nich\nmartina\npubs\nanchors\nsighted\nwarship\nchaplin\ndow\nmarjorie\nfoundry\njohansson\nmuir\nfeminists\ntravellers\nfuse\nnicknames\nopium\nguiana\nshrines\ngroundbreaking\nbenefited\namplifier\nmunro\nframing\nsupplemented\nchristensen\nbromwich\nnouveau\nforeseeable\nundid\nyusuf\nkildare\nwrath\nbaptized\nshepard\nsilla\ngoldsmith\ndrone\nnovo\nunderwood\nquota\nkarma\nvis\nmcc\nogden\ncigarettes\nkamal\ngeared\nconfessions\nlieu\njazeera\nskeletal\nyr\ncartridges\ngardening\nchecklist\nbog\ndietrich\nrhône\njudd\nglobalization\nsal\ndownward\npolly\nevangelist\ndiverted\nlauncher\neuros\nengages\nstereotypes\nhedge\nforster\njae\nnavigator\npolished\nfabricius\nimplicated\nxl\nmonde\ndelle\npk\nhenley\nkawasaki\ndomenico\nwhitehead\nuv\nhannover\ndisadvantage\ndietary\nmesh\nlucknow\npulmonary\ninadvertently\ncounselor\ncompiling\nrig\nfisherman\nkathryn\nstephan\nvojvodina\nanglia\nlouvre\ngregorian\nkat\nessendon\nforeground\ncant\nvu\nthom\nposing\nmarches\nbrentford\nmonterrey\nmeps\nlouie\nbidding\npensions\nscenery\narched\nalmanac\nsiemens\njewellery\nwallis\nbattling\nmateo\nspamming\nunspecified\ngibbons\npotsdam\ngladstone\ncaspian\nrevoked\nfatigue\nensued\ntoro\nhash\nhooper\nhopper\nquay\nsouza\ngovern\nselangor\nclaremont\nassign\nslipped\nbrandt\nbose\npluto\nbalancing\nhighschool\naristocratic\nguerra\ngoddard\nwindmill\niata\ndept\nhomo\narcheological\nwolff\nadjunct\nperceive\nprojecting\nmontane\nstylistic\ncarving\nhumanist\nsahib\nvanuatu\ngerais\nmcgrath\ncarlson\nalf\nsachs\nusher\nstrengths\nwaterways\nindications\nsesame\nleary\nmeritorious\nbreathe\nscribe\nvastly\nlinden\npalma\nclad\namidst\nbarley\nsportsmen\nsloop\nspartan\nmeteor\nbalcony\nbored\ncute\ngospels\nayrshire\nneighbour\ngigs\npinto\njailed\nacacia\nretarget\nleach\nasiatic\nnantes\nreefs\nmandir\nrefueling\nglee\ncoefficients\nonion\nmaize\ngogh\npurse\nlsp\namman\nsequels\ncanons\nannotated\nlambda\nfortification\ntoxicity\ndependency\nconversions\nemir\nsirius\nwetland\nauditor\ngoethe\ncottages\nbaths\npunish\nabby\ndistinctions\narmand\necuadorian\nnorma\ndoctrines\nentrances\nsewage\naxes\npretending\nsubcategory\nidle\ngems\nnehru\npistons\nshocking\nnebula\nlaval\nlungs\nmanuals\naztec\nvendor\nhonoring\ncurb\nincentives\nmanifest\nwhiskey\ngallantry\nsalaries\ngraded\nbiennial\nmarcelo\nunitarian\nbakery\ntraveler\ntanaka\nwebs\ntrainers\nzombies\nbai\ncontradiction\ndude\neritrea\nsexes\nmanny\ndistorted\nsudbury\nretro\nrestart\nmanly\nköln\norphanage\njericho\nnewscast\naquino\nnesting\nribs\njumpers\nunderside\ndisclose\npenang\nsaratoga\nconvict\nrecommends\nnarratives\npurported\nstables\njürgen\nyao\nbrownish\nnair\ntakeoff\nchávez\ncomets\nheiress\nwo\nstargate\ngently\nadmitting\nkermanshah\nuppsala\ncypress\nzhejiang\nrestrictive\neconomical\nrepublics\nintegrating\nnico\ninstructional\noccupants\nsoc\npawn\nsecession\ntyrol\navoids\ntian\ngrady\nanarchism\nattends\ndownhill\nguarded\nnatalia\nsparse\nambitions\nsyllables\nkeyboardist\nhungarians\nignores\njc\nhated\nbathurst\nmacon\npsycho\nhanoi\nsuperb\nbuys\nhodges\nhl\ndisposition\nsuffice\nrooney\nchloe\nhearted\nlevant\nfoundered\nunto\nbala\nritter\nbanquet\nhaley\nclade\nkan\nturnover\nlazio\nventilation\ngears\nlifts\nshelton\nmccormick\nmcbride\njang\nsnout\nrotherham\nnemesis\nreconcile\ncoating\nteahouse\nlombard\nalvarez\nexpenditure\naviators\ntoes\nparc\ncomedic\nwalden\nouts\ncanucks\nsupervisors\ncheating\nharvesting\nolaf\nbreuning\ncarthage\nparaná\ngottfried\nexamines\nreputed\ndunbar\nelijah\nfordham\nidols\nbarge\nrihanna\ncartel\nangered\nantilles\nradcliffe\ncapacities\nfueled\nwoodrow\nmustang\noverseeing\norganizes\nwah\nasturias\nbeaufort\npalo\nama\nquezon\ndarius\nmotown\nwestchester\ntrenches\ndcc\ncontraction\nshrimp\ndungeon\nsupremacy\ncrust\nscriptures\npostmaster\ntal\nroanoke\ncovert\nyeast\nmongols\nmaxi\nglove\ngilmore\nmuscular\nste\nweighs\ndeux\newing\ncommissioning\nclauses\ndune\nfictitious\nrazavi\nstorylines\ncoa\nfrankenstein\nburundi\nbearer\nraul\nstylized\npines\nprakash\nrani\ndm\ncleaner\nindicted\nkinase\ndupont\nguadalupe\nprogressively\nlandlord\nfleetwood\nmigratory\nrejecting\nguise\nbrewster\nnomadic\ncaliph\ncalculating\nconfigurations\njar\nreadiness\ninsulin\ndagger\ngoalscorers\nren\nsasha\nperfection\nursula\nnovak\nwrexham\nspaniards\norkney\ndonaldson\ntata\nauschwitz\nhalftime\ntransitions\ntestify\ntrolley\ntae\nearle\nkonstantin\noyster\nsevered\nimg\nreversal\nconcessions\nsampson\npisa\nmcleod\ntrojans\npaige\ntwinkle\nantisemitism\ncomp\npractised\ncaf\nlining\nsuicides\ncollateral\nxinjiang\nrugged\nfy\nblizzard\nesteem\nstimulation\npity\ninherit\nzack\ncadillac\nfrédéric\ntokugawa\ntwain\nshen\nphoton\nmunitions\nincompatible\ntrolls\ntoad\nrevolver\nprevailed\nsynonyms\ningredient\nenhancement\noverwhelmingly\nlégion\nals\nmounds\nregis\nlogin\nbogotá\nmsn\nfungal\naryan\nrevolutions\nbonding\nsven\nnicky\npacking\ntransformations\nesq\ntastes\nkeel\ninstalling\nbl\nfis\nvivid\nplaintiff\ndecomposition\ngroom\nphonetic\nsynth\ncensor\nüber\ntianjin\nhari\ntopographic\nbwv\ndavey\nexpands\noccurrences\nrelaxed\noutward\nreg\ncrafted\namericana\nimplementations\nexperimentation\nargyll\nburr\nkonrad\ngazetteer\nbasins\ndarrell\nlocalized\ndeploy\nlineman\nmessaging\nspence\nfei\nlocker\nbadger\ngearbox\npalin\ncumulative\nspellings\nloyalist\nwharton\npeoria\nknees\nmannheim\nyd\nser\nargent\ncradle\nmclaughlin\nconfesses\ndetrimental\nclockwise\nprosecuted\nlocus\nregulars\nwakes\nlinkedin\njordanian\nkaufman\ndomesday\ndisliked\nviet\nrust\nnotions\nporcelain\nsemantics\nsouthend\neclectic\ncared\nhorseshoe\nci\ntopological\nmacleod\nbloomfield\nlongevity\nvariance\nhighness\nguo\nharrow\nebert\ninsignificant\ntransplant\nlib\nanticipation\nchao\nlg\nterre\nvoyages\nsumner\nrazor\nveil\nluciano\narcadia\nformidable\ntides\nargonauts\ntick\nnotwithstanding\nunpleasant\ncantata\ncolumbian\nnetball\nchevalier\npolicemen\nanimator\nweddings\nquartz\ndumont\nabdominal\nrenew\naffirmed\ncomputed\nprimaries\nlaureates\nfilmfare\nclones\nflashback\nutilizes\ntraumatic\noutdoors\nhoffmann\nconstrued\njesús\npushes\nriaa\nlocking\nsupplementary\nagrarian\nblossom\noctave\njude\nptolemy\nqueries\nropes\nalbrecht\nhaydn\ncdc\ngrenada\nfade\naspen\nroi\ndeteriorated\negyptians\nboasts\nsuárez\nlifeboat\ngroningen\nsevilla\nhybrids\nbabu\ndepart\nskins\nburrows\npaisley\nterminate\nghent\nreigned\nusernames\nlowering\ndesperately\nseismic\neastenders\npow\nmadden\ncrocodile\nabrams\ninteriors\ndent\nmarlins\nbetting\nhagen\nrepublished\nreelected\nchong\nfiesta\nprojections\nmuddy\ninvertebrates\npaleontology\nnovice\nrower\naccolades\nprologue\ncinderella\ncyclic\namalgamation\nberwick\nblatantly\nfreedoms\ntransmissions\norganise\nreflective\nstabbed\nsimulcast\nreformer\ndenton\noppression\nfoam\nmonograph\ngentry\nchemists\ngabrielle\ndresses\nlectured\nmaneuver\nnerves\nadulthood\nbray\nmilne\ndaring\nhamid\nutter\nherbs\nharness\ncleopatra\nbrno\nemancipation\nphoebe\nreactive\nchristy\nreset\nbianca\nimplements\ncorrecting\nfugitive\ncicero\nbono\nsynagogues\ninvariant\nmindanao\npleistocene\nattackers\nviz\nhebei\ndefamation\nrelocate\nsuperheroes\noptic\ndowager\nfuzzy\nspecifies\nanthologies\namin\npoultry\nhelmut\ncomedies\nbeech\nfivb\nlori\nnorthernmost\nsimulator\nding\nsouthernmost\napprenticeship\nmsc\nraceway\nkhyber\npensacola\nroyale\nmacro\ninsider\nrighteous\nmirage\nsweat\nbanda\nphrasing\nunderstandable\nrutland\ninuit\ntumors\ncavendish\nvoodoo\npun\ngambia\nthematic\naltering\npyrénées\nsaitama\nreacted\nsint\ntulane\nboiled\nregulating\nprogresses\nwarmer\nusefulness\nintrinsic\nstainless\nlulu\nvegetarian\ntracts\nalam\nkissing\nannoyed\nraúl\nshattered\nnoaa\ndusty\nlilly\ninterestingly\nforeword\nives\nseo\nneurological\nvibration\nessayist\npoisoned\ninvoked\nfrontman\narchdeacon\nrenal\njiu\ntriassic\npriced\ninternacional\nfabian\ntrailers\nlillian\nposes\nuneven\nturrets\nharlan\ncruelty\nstorytelling\nvirtues\ngorilla\nrochelle\ngui\nauditions\noboe\nremarried\nbounce\nradicals\nincidental\nsonora\nguarantees\norchids\nmyrtle\ncharters\nsuperficial\npatti\nvolga\nhilda\nhamburger\nmultitude\ndire\naeronautical\nprogrammers\nquicker\nclinched\natt\ntnt\npackard\nbubbles\nstunning\ngarth\nburroughs\ncrypt\nrewriting\ngonzales\nrestless\noverwhelmed\nmiocene\nfused\ngranville\ninfluenza\ntomato\nnagasaki\nmedications\nengraver\ndistinctly\nwaist\nsalvatore\ntunis\npuppetry\nfremont\nsemitism\nrecycled\ncommendation\nscorpion\ndamned\nencore\ninhabit\nshapiro\nargyle\nbingham\ninterrogation\ngamble\nbridget\nregulator\nadvertise\nclassifications\nbutch\nthriving\nwiener\njalisco\nadminister\nhenan\nworkplaces\nnewbie\nlibrarians\nub\nfilmmaking\neasiest\nngos\nwelles\ndevote\nshrubs\nattendant\naerodrome\nfollower\nmethyl\ngrayson\nvaughn\nbodyguard\nbyte\nsplash\npeña\nclassify\nlieutenants\nbridgeport\ntechnicians\nmanufactures\nlennox\nswans\ncanoeing\nplanck\neastwood\nformulas\nstaffed\ndirects\ngérard\nenvisioned\nmusik\nrepeal\nmach\nmori\nrabbits\nprostitutes\ncompassion\nlabeling\nsynthesizers\nhen\ndelgado\nbosch\ncue\ndh\nkaye\nfielded\nhawker\nbarrie\nhawke\nithaca\ncornerstone\nincapable\neureka\nanastasia\ncayman\nbarnet\nfronts\nclippers\nnapoli\ndeviation\nquakers\nkeepers\nmutants\npeng\nrum\nhahn\nspacing\nbritannia\nmuñoz\nparasite\ntopography\nconglomerate\namusing\noutflow\noffender\nwaller\nmabel\nintercept\niroquois\nperceptions\nnic\nhonesty\nfaulkner\nmined\ncluj\nblazers\nabide\nlpga\npontiac\nabusing\nturmoil\nrhino\nkilometre\npackaged\ntrois\naspiring\ninhibitors\nbarrage\npiazza\ntruncated\ntrondheim\ncapitalized\nbusan\nphased\ndank\noutlaws\npronouns\nignition\nevade\nbuddhists\nkobayashi\nwoven\nmute\nfai\nirony\ncabinets\npersisted\ngc\npotent\nsubsidies\ngin\nnuclei\nprocurement\neintracht\npictorial\nmaroon\nprem\ninexperienced\nhid\ndesignate\neats\nmacquarie\nbooking\nadherents\nicf\nhove\ncaliphate\nox\ntolerant\naristocracy\nplumage\nclaw\nbackstroke\nmigrate\ntilt\nhillside\navalon\nwasps\ntemper\ncorvette\nedna\nchopin\nglendale\nchaotic\nassaulted\nmahmoud\ndevotees\npadua\nmatrices\ndilemma\nfide\neine\nplum\nattacker\ngoogling\npertinent\nbourgeois\nmani\ngraz\nmosquito\neuclidean\ncub\nechoes\nmisses\nassemble\nethernet\nbait\nscholastic\ndip\nschubert\nmauritania\nlev\ncrisp\ntotaling\nmultiplication\nlarson\nbreaststroke\nchefs\nsuspicions\nngc\ngroves\ningram\nadriatic\nknicks\noutpost\ndarmstadt\nrhymes\ncommodities\nfashionable\nsediments\npunitive\nléon\nskipper\nirina\ngrassroots\nsticking\nblaine\ncapitalization\npreached\nsheppard\nmagistrates\nnadia\nexplanatory\nmina\ntensor\nsignalling\neuroleague\nestimation\nivanov\nkeystone\nimitation\nbiennale\nsalamanca\nislamabad\nconnacht\nconverse\nbradshaw\nunseen\ndaryl\nmbe\nadolescent\nskyscraper\nmontpellier\ngag\ndormant\nvanished\npartizan\neastman\nnunavut\nattach\nvolatile\ncaleb\nmoniker\ncardiovascular\nnec\nreza\nmelt\ndisks\npri\nbroughton\ndx\nzoological\nbodied\nportage\nsupermarkets\nassassins\nrn\nearnest\ncosmology\namar\nseaman\nejected\nmandal\nscrub\nphylum\ntyre\nhavilland\ngotham\nbrabant\npremiers\nfay\nskopje\ndecker\nmermaid\noutspoken\ncomrades\nkarim\nclimbs\narchiving\nslain\namplitude\nappellate\nfishery\nbragg\nexcitement\nhorne\nsalute\ninflammation\nmálaga\nsurrounds\nfriars\nbackbone\npetals\npegasus\nmoselle\naisle\nhobbs\narmando\nkharkiv\ntrafford\nridley\nverge\naltitudes\nhates\nmidfield\ncontracting\ncocktail\noutsiders\nexperimented\nmaguire\nbard\nfaded\nalternately\nunlawful\neternity\nconvection\nkimberley\nlute\nhuntsville\ndarryl\nschizophrenia\nmcdowell\ngrasses\nugandan\npollard\nprophets\nquito\ntruss\noutsider\ncambrian\nnewmarket\nhound\nstaples\nnarayan\nillustrators\ndrury\nbarclay\npreferring\nfaust\nmaha\ngage\nalleging\nthence\ndowning\nelf\noctopus\ninterceptor\ndeserved\nfines\nsomeday\nhangar\nprohibits\nbeau\nebay\nalbans\ndeutsch\nlucien\ncontrasting\nhannibal\naegean\nmcguire\ndil\nfiltering\nhourly\njohanna\nstacy\nnaturalized\nerica\ncompute\nevenly\nterribly\npalms\nkickoff\nwithstand\nnaive\nkylie\nvase\nazt\ndominica\nazores\noutgoing\nrollins\ninternationals\nmcpherson\nbarre\njd\nbulb\ncrusader\nspines\nfielder\nmacy\nnakamura\ngreenberg\ngoldwyn\node\nguildford\naqueduct\nrubbish\nvasco\nsimulated\narboretum\noleg\nnotch\ngreeted\nchoirs\nlew\nbiologists\ntaller\ncentric\njamal\nhermitage\nfootsteps\nwiped\nbooked\ntrier\nparramatta\niain\nlakshmi\nmoravian\npeppers\nversatile\nmundo\npedersen\nwhaling\nbinds\ndim\nbazaar\nfeatherweight\nhalves\ntrieste\njedi\nluz\nfirefighters\ncauseway\nmagdalena\nmist\narranging\ngalveston\ncomte\nforcibly\nparasites\nisabelle\namend\nfijian\nfidelity\nsentencing\nshenzhen\noffenses\nbaked\nnea\nlure\nroundabout\nlistened\npointe\nparasitic\nsolvent\nvested\nmodifying\nkarabakh\nscotch\nvaluation\nsevern\ndiversion\ngoldstein\ncas\nmornings\nhunan\ndummy\nresembled\nhb\nkelvin\nbeavers\ncalvert\ninjected\nartifact\nmanipulated\nmusically\nitaliana\ndialog\nfluids\nslab\nwalsall\ngrams\nhillsborough\nlizards\nmoonlight\ncantor\ncarole\nsaigon\ntelecommunication\ngunner\nsj\nstray\nbrightness\nmolina\npseudoscience\nobey\nprism\nimpending\noctagonal\nuniversiade\nsorrow\njarrett\ndolores\nfronted\ngunpowder\nbabylonian\ncurvature\ncolonels\nvip\nborg\ntorquay\nantibodies\ncracks\nsinn\nheidi\nyep\nbergman\nchristophe\nmarko\nmavericks\nsiam\napologized\nunauthorized\ndaphne\nmozilla\njenna\nreplacements\nfrustrating\nfrancesca\ntakahashi\npassports\nclaudius\nscent\ncharley\nbmg\nsusquehanna\nscam\ndanzig\nstature\ngunfire\nrallies\nemory\ndependencies\nserials\ndrunken\nstalled\nclapton\ncompile\nhuber\nobesity\nfourier\nsn\ninfancy\nhyper\npalau\nsiegfried\ncandle\nallowance\nislamist\nstrikers\nprincipals\noversees\nstimuli\njai\nhodge\nmathews\nparcel\nwelcoming\nshouting\ngodfather\ncuckoo\nbreeze\nhrs\ndrying\nmitochondrial\nretrieval\nminogue\nduc\nsyriac\nsebastián\nhurley\ncms\nemery\nmadeira\nchihuahua\nkali\nbloggers\ncivilizations\nhezbollah\ncurrencies\nfrankish\nvibrant\ningrid\nsentiments\nindochina\nelectrification\ndanced\nacquainted\nchow\noriginals\nwren\nconvoys\nwaterway\nrotated\nphylogenetic\nwelding\nhusbands\nvigorous\ncongenital\nfulfilling\ntolerate\nmenace\neurobasket\nspectroscopy\nmarek\nmckenna\nshowdown\nshrew\nrebirth\ngujarati\nidentifiable\nunprotected\nstrained\nlyle\nbooster\nstealth\nfayette\nliking\nhodgson\ndecatur\nnewsweek\nmoshe\nmusique\ngreyhound\nkoreans\ncontacting\nzulu\ngalician\nferris\nripley\nmerseyside\nostensibly\nabducted\nfloral\nkilometer\nmazda\nsequential\nentertainers\ngaddafi\nyielding\nnarrower\nrivalries\ncroats\nczar\ncoinage\nter\nnewtown\nghz\nsousa\nlynne\nkepler\n●\nsheriffs\nreworded\nmohawk\nhawthorne\nlaude\ndoses\nape\nnazareth\ndoubleday\njess\nmagnesium\nsocietal\nintercourse\nmurdering\nyamaguchi\narmada\ngilan\nfunctioned\nchapels\nathena\ncanning\nsouthport\ntaunton\nfavorites\ngladys\nvincenzo\nclerical\ndisrupting\nbogus\ngatherings\nprivileged\nlst\npollock\nkant\nwarp\nwcw\ndomino\npockets\nops\ndisneyland\nresurrected\nradiant\nhenrietta\nmersin\nsonar\nacquaintance\nclancy\ncontrasts\nappliances\nsocket\nspiegel\nharmon\nsikhs\nmodelled\nbrasileiro\nmuzzle\nbitch\nsupervising\nrotate\nimpress\npantheon\nduran\nignatius\npv\nlcd\nastro\nclermont\nscary\ngenie\ndespair\nundermine\njoanne\nbuff\nbosses\njurisprudence\nandersson\ndialogues\nsabres\nenfield\ngastropods\ncutler\nrostov\ncongresses\njaws\ngrupo\ngoodwill\ngrim\ntrustworthy\natrocities\ncowan\nkassel\ntelenovelas\nsupplements\nsynthesized\nsteamship\nproprietor\ngrimm\npublicized\nprops\nshortlisted\nsuperfamily\nerroneous\nrouen\ncheer\nbuena\namadeus\ncaledonian\nguatemalan\nchateau\ncommended\ndownfall\nmazandaran\nliar\npaddle\nopéra\nappropriations\nhostel\nmandy\nvedic\njaya\ngwynedd\ndepended\njosiah\ntheta\nmedial\nregimes\nsticky\nmallorca\nvent\nmotel\njena\nkarel\nturing\nsuperhuman\npsalm\nbal\nhellenistic\nexhibiting\nwinery\nbackstage\ntipped\nbromley\nprimate\nhistoriography\ndiscounted\nrave\nasst\ntaxon\nmcintyre\nbae\npause\nzee\ngarment\nnikolay\ncatfish\nreworked\nmarkham\nkisses\nbotanists\ndisclaimer\nsig\nsabine\ndefiance\naj\nlucrative\nsurveying\nrudd\noneself\nara\nbiz\ngreensboro\ncampos\nnguyen\ntopping\ncactus\nidentifier\nspaceflight\nunhelpful\nfirth\nmilky\nmule\nbeasts\nscrolls\nteller\nserum\nnevis\ntarzan\nkindness\ntempest\nswinging\nadministratively\namateurs\nblacksmith\nupstairs\ndeportation\nbland\nsincere\nempowerment\nmotivations\nmildred\nboating\n♦\nawake\nsubsp\nfaulty\nfran\ncloset\nsymphonies\nintuitive\nadmiration\npepsi\nalla\nmultiply\nreuse\nestrada\njunctions\nmanpower\nkei\nconverter\navoidance\nmarley\nnsa\nmassif\nfucking\nfederer\nbrawl\nredundancy\nerasmus\noffending\nfairchild\nuntrue\ndramatist\nthinkers\nresidues\nadvises\ntesla\nhousemates\nprotesting\ncirculating\nforerunner\ngalatasaray\nrodents\nsoluble\nzimbabwean\ncolloquially\nbrace\nnewborn\ntangent\nnominally\nruthless\nvalentin\ncorrosion\nneue\nregression\nequator\nict\nascending\nclassmate\nglam\ngroundwater\nmarylebone\nvulcan\nnih\nsibling\nclutter\nstacey\nblackout\nerskine\ndade\ntoulon\npoisonous\nderivation\nbeliever\neindhoven\naccompaniment\ntrajectory\napogee\nbatters\nfallout\nbeers\nauditioned\nbecky\nqu\nanarchy\narias\nhacking\ncalculator\nheraldry\ngama\nmarital\nscripted\nbender\nbritney\nleukemia\ncoyote\nhorizons\nadvising\nthru\nfitzpatrick\neun\nesoteric\nomnibus\nslice\ndrastic\nlaughter\nglands\nsamoan\nprofessorship\nsalts\ntchaikovsky\npoured\nausten\nspd\nleftist\ncharismatic\ndominating\nlesions\nsouthward\nsimeon\nhalfback\nabstraction\ncid\nmammoth\nslides\ngainesville\nhealy\nfallon\nvolta\nthorne\nassaults\nclemens\nmeme\nfabricated\ntemp\nmeath\nrumble\nmalacca\ntitanium\nosman\ngrover\nadele\noch\npinned\nelsa\nmoors\nsoho\naccelerate\ntun\nbissau\ncopyrights\nxm\nclaws\nbraga\nhomunculus\nconstantin\nmustard\nregal\nbestseller\nwessex\nbolshevik\nisla\nkarin\ngrenades\nblackhawks\nhun\nsteamboat\nleonid\nsecrecy\ngrading\nflock\nfukushima\nessen\nbreakout\nprostate\ntariff\nexponential\nyarmouth\nacknowledging\nupton\ndumped\ntrillion\nslant\nwilliamsburg\nanalytics\njudgments\ndescend\ninhibitor\nconservatism\nmerton\nshortages\nhierarchical\njanice\nreigns\nscuttled\nellie\ncontrasted\nimproperly\nwilfred\nsheltered\nlothian\nmethane\nabandoning\nthoughtful\nbending\nkeane\nmediate\nbattleground\npollen\nrevelations\nunambiguous\ndieter\nventral\nleah\nstresses\nyoon\nmaru\nchill\nsalazar\nexpos\nliszt\nabduction\nmorphological\nmurderers\nturnbull\ncater\njealousy\nui\nmandela\ndiva\nbethel\nmaximus\nfontaine\nmahmud\nspecifics\ncurtiss\nwilcox\nstakeholders\nexiles\nyellowstone\nradically\ncaptives\nmacbeth\nintricate\nhorseback\nthrash\nburgos\ncoe\nclarendon\namd\nlitter\nisraelis\nsfr\nsummits\nconstabulary\nresidue\norr\nhess\nkeating\nrecounts\npendleton\nwaits\nzelda\nlaird\nsikkim\nceasefire\ngonzaga\nstimulate\nseperate\ncrushing\nprimates\nemphasizing\ntyped\nequivalence\nspree\nmora\nincurred\npermissions\nunconventional\nbites\nsprinters\nmoi\nlegit\nvalidation\nrasmussen\nhumane\nstoring\ncosmetic\nshowtime\nbullock\neel\nhampered\nisotopes\nabandonment\nprovocative\nheraldic\nconcentrating\nmsnbc\nlazarus\nelastic\ngideon\nscully\nelves\nvulnerability\nrockwell\nrelisting\nlarkin\nhedgehog\nbryce\nadjustments\nauthentication\ncrook\nhearst\nanwar\nconsecration\nolympian\namassed\nina\ncamouflage\nweaknesses\ntemptation\nturk\ncisco\nfontana\nsql\nstuffed\nlobe\naden\nlays\nparted\npcs\nsquid\nmesopotamia\ndogg\njamestown\nswamps\nseafood\nshortstop\nmilo\nrms\ncantons\ncurly\nryu\noverture\nkitts\nipad\ncrested\nwoodpecker\njimmie\nfabulous\nendeavour\nctv\nthroated\nnewell\npartridge\nevicted\njonah\nmcmillan\nindictment\nalton\nwellesley\noven\natm\ncaptions\ndormitory\nuniv\njolly\nclubhouse\nubuntu\ntemperament\nguarding\nencompass\ncharcoal\nindigo\nhavre\nmarxism\nhorizontally\nanonymously\npermitting\nleung\nlordship\nzionism\nlesotho\ntufts\nsalad\natheists\nnestor\nspared\nairplanes\nmagnolia\nwilloughby\npinch\nneedles\nduluth\nraoul\nbreaker\ntee\nexcelled\ndustin\nsocking\nmats\neskimos\nconfinement\ngal\nvaguely\nendured\nunnecessarily\nmessina\nhaired\nmons\nindus\njasmine\nmoons\nleyte\ncurt\nhaleakala\nepsilon\ngustave\nmanchu\npediatric\nopole\nconstituents\nyeomanry\nbrowsing\ninnsbruck\naachen\nolympique\neyre\nintervened\nsnowy\nbanknotes\nkarlsruhe\nhyperbolic\nexpenditures\norton\nmodulation\nsomerville\nreeve\nfrescoes\nfrederik\nisu\nkhalifa\ncops\nepsom\napplicant\nriddle\napocalyptic\nfoliage\ngünther\ntimetable\nyoruba\ngorman\ntherapist\npredictable\ngutenberg\naffluent\nhottest\ncasts\ntranscribed\nguineas\njameson\nmünster\npeyton\nfederalist\nkraft\nhaarlem\ninexpensive\npeugeot\nmontrose\nrestricting\nsustaining\nstandardization\nstrata\nrealises\nderogatory\nlocale\nashland\nmentoring\nabigail\nglow\nionic\nradios\nmassimo\njst\nchechen\nproofs\nalexei\npurcell\ndp\ncookies\ntelegram\nfips\nboarded\nbranched\nmic\ndeanery\nharassed\npathogens\nrene\ndemoted\nbreakers\ncriticize\nsportsman\nacknowledges\nlongstanding\nrestraint\nswear\noops\nnils\njournalistic\nprotestantism\nsous\neste\norbiting\nkk\noutrage\ncommandos\nflattened\nferrer\ncosting\nlace\nwilbur\nblackish\nveneto\nretreating\ndeciduous\narrogant\nelle\nmoderator\nnepali\nbrescia\nforgiveness\ncones\naspirations\nwilton\nmaitland\nkota\ntore\nellison\nmalabar\ncesar\nswaziland\ncollapsible\nwestfield\nfabrication\nseton\nrealities\ninjustice\nbarron\ncalabria\nmontagu\ndenison\nsyndication\ncsa\ntp\ncoco\nseater\npraying\nirs\ncorrelated\naarhus\nboo\nkitchener\npei\nnightingale\ndentistry\nfujian\nfencer\nfresco\npairing\nlucha\nchopra\nvilliers\naccents\nspecialising\ninvestigates\nmodena\ncellar\ndubois\nhugely\nsligo\nhackney\npaints\nbellevue\nisaiah\nweaponry\nfoxes\nfortuna\ntanya\nmathieu\navail\nspices\nhangs\ngland\nadditive\nangie\nburnham\nnexus\nbrit\nheaders\nvivo\nls\neducating\nmodernism\nkr\nkimberly\noccult\nparodies\nharvested\nsykes\noverseen\nmenon\nimprovisation\nwainwright\noverride\npicturesque\nkathmandu\nredesign\nunión\nachilles\nbattled\ncabins\nthuringia\nannette\nahmedabad\nstir\nmerritt\ngershwin\nunbiased\nspoon\nthierry\nlasers\nmagyar\nbautista\noddly\nhelix\nworldcat\nsupplemental\ngrill\nbaseline\nono\ncautious\nperl\nbranching\nevacuate\nike\nreadership\nrockford\ntubular\npursuits\nimperialism\nhavoc\nmalley\ntolerated\nfaq\nexecuting\natheism\nhauled\nchicks\ngillian\nmanx\nwondered\nintern\nworthless\nsequencing\nallegro\nmerges\nshamrock\ninference\ncpc\nsneak\nmedallist\nantónio\narden\nexemplifies\nric\npeat\nbop\nasher\npharmaceuticals\nfestivities\nbarrington\nprincesses\nbargaining\nreuben\nbam\nringo\nrumored\nrendezvous\nreckless\nmargarita\nvenerable\nunpaid\ninferno\nberber\nomission\njudas\nseminole\nharassing\nsteamed\nfabio\nmercenary\ngrafton\ntempted\nnicholls\nwendell\nsable\nintermittent\nsvenska\nrag\ntheirs\ncastes\nazad\ngardener\nfederated\ncoloring\ndahl\nspikes\ngallo\nbilling\nhomogeneous\nenraged\ninca\nbulldog\ndeprecated\nwhereabouts\nresidual\nfountains\npleas\nhilbert\nhurst\nkatharine\nroyalties\nrumor\njerzy\nwhichever\ndissolve\nduane\nmoritz\nhemingway\nzum\napes\nspringsteen\nsouthland\njehovah\nappleton\nerika\npenelope\nregan\npuck\nmcgregor\nheartland\ndifferentiated\nfootprint\ndistinguishes\nscoreboard\neras\nblanchard\ncognition\nlowry\nmasjid\npanoramic\nkingsley\nbeforehand\nadjectives\nnumerals\namp\nyoo\nmicroscope\nballoons\naggressively\nkellogg\nrespondents\ncolloquial\nserviced\nshowcased\numar\nsantander\nforehead\nconqueror\nhyphen\nemile\naugusto\nalameda\ncruises\nepstein\ndiscretionary\npetitioned\nfaye\ngnome\nkwan\ninstinct\nitu\nmechanized\npulpit\nflees\nhanding\nextracts\ntashkent\nviper\nquarterbacks\nvassal\ndeutschland\ncrowns\nlosers\nfoolish\nwheeled\nalmeida\nsubstrates\npenetrate\npau\nreused\ncorrespondents\ntownsville\nstallion\nws\nmichelin\nbrill\ndorchester\nhomme\ncolo\nplotting\nhorsepower\nvalladolid\nduplicated\nraiding\nhermit\ncues\ntung\nmergers\ncrashing\neredivisie\nadm\ndns\nchariot\npavement\nbraking\nbureaucratic\ngunnar\ngrinding\nregistrar\napa\nsogn\nimperative\nrankin\nnonlinear\norchards\ncharlemagne\ncombo\npentecostal\nrecaptured\nbeit\nnasser\nhimalayan\ntransmitting\narson\nsingled\nellsworth\nbreasted\ngreenfield\ntransatlantic\nhardin\nsakura\nordo\ntimbers\nmaison\npeptide\nrescues\nnawab\ntg\nconsultancy\nsway\ninvariably\ndescends\nprivateer\ntoilets\ntransient\npremio\nfrazier\nayr\nmaths\ntailor\nnietzsche\nsupper\npiotr\ngaul\nbotanic\nsmackdown\naw\nnocturnal\nmagdalene\ncontiguous\nreprised\nunilateral\nhilly\nclassed\nchampaign\nschloss\nfeasibility\nborrowing\ndarby\ntrough\ncured\nax\nalbanians\nsquire\nmeade\nsymmetrical\nheller\nlimburg\nsubcontinent\ndeportes\ntelescopes\nstarvation\nabdel\njeans\ngamer\nlid\nrajshahi\nterminating\ndunlop\ncarmichael\nembarrassing\nsnack\nrampage\njalan\nphotons\nnaia\ndisturbances\nazure\nmun\ncali\nvittorio\ntendentious\nbaronetcy\naiding\nsimulations\nfenton\ncontradict\ncj\nbaccalaureate\nhenson\ndeepest\nizmir\nassigns\nprohibiting\nnytimes\ndeng\nbarbarian\nmaori\nnur\npetitions\nenvironmentally\nish\nfamiliarity\nsulawesi\nmassively\nholm\nkgb\nindices\ndonnelly\nflair\npolynesia\nantagonists\nclover\nsoy\nscaling\nwingers\nvaliant\ncouch\nhines\nrubble\nwinslow\naramaic\npolynomials\nstrive\nconfess\nspreads\nsiegel\nandrey\naccomplishment\nlexicon\ndeclines\nbakersfield\nrant\ncolourful\nsopranos\nolympiacos\nbump\nathenian\nparanoid\ngel\nslept\nrisky\nbathing\nnbl\nduly\nmixer\nluís\ninhibition\nnonfiction\ntracey\nconsolidate\nreasoned\nfission\nbarrio\nlear\nascended\neviction\nalteration\ncorinthians\ncoburg\nborrow\nbandar\nrightly\ntemperance\nfraudulent\ncatastrophic\nkanagawa\ngesellschaft\nsabrina\nscrolling\nfunky\nsato\nmusica\npronoun\nacidic\nmania\nmango\ncopeland\nrestarted\nconfederates\ntracing\nroberta\nsargent\nastra\nexcludes\nbanco\nadolph\nbiting\ndealings\nmustered\nteaming\npunta\nprank\naldershot\njelly\nfredrik\nturbulent\nreinforcement\narg\nclimates\noder\njana\nhires\ngodzilla\narbitrarily\nmaestro\ndavy\nlega\ntvb\neroded\nicao\nblanca\nambushed\nnicolás\nredistribution\ncapcom\ngomes\nbattista\nadvisers\neminem\nivoire\nscans\nmanifestation\nscreenshots\nmentality\norson\nsonoma\ndeletes\nblended\ncosmopolitan\ntriad\nfills\ncrises\nupgrading\npill\nyue\nmarge\njonny\ncrooked\ncbd\naqua\ndetecting\ncarbonate\nfjord\ngodwin\niqbal\nyuen\ngraders\nmotorola\nbetsy\nfolder\nbreadth\ncheney\ncores\nipod\nconscription\nbatter\nflush\nhiroshi\nvertigo\nincarcerated\nkeynote\narmory\njeopardy\npoppy\nshady\nerroneously\nhardest\nbeads\ncreditors\ncoward\nmimi\ngrimes\ntenerife\nnavigable\ndakar\nbanded\nslug\npromoters\nsquared\npots\ncompendium\nnazism\nmimic\npu\nendeavor\nanatomical\nmccann\nrunways\nexperimenting\nfaint\nandrzej\nnizhny\nscala\ninsulation\nhorned\naromatic\ndepleted\nregistering\ndouglass\nrodrigues\nhatfield\nmisery\nexplode\ntyres\nbreakaway\nchickens\nsporadic\nunilaterally\nintervening\nmika\ncuster\nperak\nvines\nlyman\nrss\nconditioned\nundesirable\ndoesnt\ncursed\nurbana\nconfigured\nhadith\ndisposed\naeronautics\nsmallpox\ncrusades\nlust\nlogically\nmaverick\ngratitude\ntentative\nbefriended\nnaga\nfallacy\nniels\nenigma\ninsanity\nplatt\nleighton\nworkings\nwhisky\nwoodbridge\nselector\nlycée\neocene\npartido\nucoz\nami\nparisian\npursuant\nlucie\ncommemorates\ncruising\nmort\nbaggage\nisotope\nconical\nkendrick\nslick\nlaunchers\nbinghamton\nsigismund\naquitaine\nm²\nbodily\nhurts\nparliaments\nsaba\nweakening\npopes\nsadie\nbalfour\nhomecoming\nosborn\ngaius\npasture\njimi\nunionists\nbassett\nmira\n+,\nkia\nunsupported\nwaffen\nnadal\nalcoholism\npowerhouse\ngrasslands\noutrageous\ncontend\nmillwall\ntomás\nsegregated\nhepburn\naix\nrouter\ncentrally\nplastics\nfrancois\npremieres\nπ\ndistracted\nrelational\nstéphane\nfurnished\nairliner\ndockyard\nwebcast\nanarchists\nverbatim\ncollapses\ninquisition\npst\nwwi\nabkhazia\neuler\nshortcut\nletterman\naldo\nzimmerman\ntaped\nexhaustive\nshootings\njeep\nflorian\nramirez\nzachary\nfavors\nmauro\noptimistic\nsuccessively\ngadget\ncramer\njózef\nlacey\nligament\nspores\npresumption\nashford\npersist\nutopia\nsolids\nchoreographed\nagatha\nlad\nbouts\nkimball\ncg\nprojective\ndentist\nadi\nschwarz\nsulfate\naichi\ncorsica\nherefordshire\ncosmetics\nupland\ngoddesses\nskirt\njaipur\nesa\ncsi\ninorganic\nphosphorus\nfacilitating\naccelerator\nifk\nsmyth\ncompletes\nkettle\nfatally\nchar\nmccall\ntriangles\npills\nreflex\nnorbert\ndauphin\narundel\nhammersmith\nrectory\nelaborated\nasahi\nunicorn\ndiner\nperpetrators\njharkhand\nnilsson\nprimer\nbeatty\nmoray\nwrestlemania\nmurcia\ngrenadier\nreductions\njavanese\nsedimentary\nata\nours\ncrawley\nmammalian\nlinn\nscientifically\nsheng\nheresy\nlayered\nfrans\nfyi\nsejm\ntak\ninitiating\nlifespan\nprivatization\npumped\nbukit\naguilera\nirrational\nfinalized\nwidowed\nuta\nassassinate\nindexing\nklan\nanthropological\nrk\ndfb\nbrody\nseekers\nstalls\ncoincidentally\noldenburg\nvenezia\nchromosomes\nsteeplechase\ngaulle\nhuh\nenjoyable\ngigantic\nillumination\nhumber\n{\nproficiency\nmbc\nspit\nfianna\npads\nsolos\nvidal\napologise\nhornet\nkaunas\nexpects\npolydor\nporte\nluka\nmaximize\nchairmen\nlibertadores\nlegions\nlivingstone\ndunne\nnikita\ndod\njura\noriginality\npreschool\nenhances\nbjp\nlearners\ninfluencing\nmembranes\nforsyth\nlaborers\noskar\nshiny\nwidened\napt\nimmature\nvisualization\nfemme\ntwinned\nmcfarland\nfelony\nmarriott\nruben\ngl\nymca\nspouses\nyemeni\nmummy\nedict\ncries\nsima\nviolently\nflanagan\nflycatcher\nclimbers\npuget\nhoratio\nito\npointers\nuniformly\neugen\nhorst\ntds\ndeans\ninmate\nclashed\nhartlepool\nemitted\nlocating\ncharacterize\nfences\nréunion\nphilanthropic\nkowloon\nnirvana\nenchanted\ngough\nexemplary\nfalkirk\nbiplane\nconn\ntransliteration\nchavez\nmagdeburg\ndwyer\nfilippo\nundone\nenthusiast\ntabs\npronounce\nwrought\nultraviolet\nparades\nberliner\nremade\nferal\nfrancs\npolynesian\nemu\nlatina\nbydgoszcz\ncasinos\ntuna\nmushrooms\nespañol\nwhitaker\nabstain\nima\nzx\nrebellions\nsportswomen\nasimov\nresin\ngreer\njoseon\nvest\npagoda\nwal\nvendetta\nzamora\ngillingham\nhonduran\nwiesbaden\ncommits\nbayou\ncasimir\nnutritional\nkickboxing\ncochran\ntempered\nhumbug\nchongqing\ncholera\nsalsa\nfusiliers\nandalusia\nbraille\nsalman\njen\ncommence\nornate\nmalvern\nhallmark\nsitu\negan\narchduke\nwd\nbandit\nvans\ncircumstance\nafi\nbaptists\nfeldman\nmcnamara\naguilar\nbash\npamphlets\ncondor\nnecklace\nbellied\nperished\nitalic\nhormones\nliter\nhimachal\nnewscasts\nsewer\nspurious\nmulticultural\noutlying\nvanilla\ndoorway\nquartermaster\nconstraint\nfaisal\nsocrates\ndobson\nkun\nforefront\nrigged\nrx\nquake\nfulbright\nrobins\nnicosia\nselectively\ncalf\nloyalists\nenriched\nbillings\nlorestan\nantibiotics\nferrara\nselena\nheap\ncymru\nmaher\nobserves\nvertebrates\njoshi\nmicroscopic\nlei\nguerrillas\nsleepers\nvitro\nseaplane\ncoarse\nlister\npeshawar\nastor\nbookstore\nrelist\nharpercollins\ncigar\northogonal\nrampant\nhypotheses\nvitória\nsacrificed\naquinas\nfwiw\nhartman\nillnesses\nbless\nstandpoint\narterial\nrudder\ndiablo\npia\ncarly\nsilas\npatna\nmana\nindira\nmultilingual\ndeserving\noutputs\nblindness\nrevolving\nwanda\niar\ninternment\nembankment\nabba\nmojo\nconus\nasean\nenactment\nbellamy\nhou\ndumping\nrosenthal\nabsorbing\nvortex\nroughriders\nredwood\nanxious\ndalai\nreclaimed\nrizal\ncocoa\nkmh\nelectronica\nkazakh\nexpressive\nflaming\nyucatán\nexternally\ncarvings\nroadside\npains\nrebellious\nuniting\ncassandra\nkok\nkew\nbolívar\ndistracting\nammonia\nmahal\nrousseau\nvr\nvulture\ntrujillo\nwrit\npeking\nethnically\nequivalents\nfilly\nconvertible\naccountable\nnotts\nriff\nortega\nimpairment\nanthropomorphic\nsculpted\nalleviate\nmagnate\nspp\nsteer\nimpartial\nrojas\nent\nwadi\ntotaled\ntad\nmathew\njacobite\nestranged\nwebcomic\ncree\nsham\nchancellors\njoaquín\ncryptic\nspoiler\nnell\ncham\nmathias\nsaturated\nmartinique\npious\nresigning\nmoravia\ndiversified\nafp\nhospitalized\ndoomed\nnotebook\ncultivars\nadoptive\npendulum\ngonzalo\nhimalayas\nleuven\ninhibit\nhaig\nraft\nclements\ndisadvantages\nyamamoto\nboon\noccidental\nvogel\ndeserts\nwestport\nauthoritarian\nzodiac\ngoodness\nbain\nmarcia\ntransvaal\nmcconnell\nquan\nsixties\nmatteo\nbra\nbarbuda\ntopeka\ngrenoble\ngrabbed\ncesare\ncancers\nbraunschweig\ntraitor\nuphold\nconvicts\nhinted\nstarship\nproctor\ncomfortably\ncock\naccessory\nkrasnodar\nartefacts\ndebuting\nweightlifters\nivo\nadorned\nhaji\nextras\nexchequer\nfullerton\ncontinual\nrhetorical\nskepticism\ninternship\nsync\nphilanthropy\nmarshals\nspec\ninformatics\nkristen\ncuthbert\ngreeting\nhorticultural\nlame\nschuster\ndurable\nmemo\nddr\nclosures\nrg\noutnumbered\nwellness\nhermes\nresurgence\nponte\nyuki\norchestrated\ngamespot\nposse\nbeverley\nuploads\nslap\nwatanabe\nlesley\nfleets\ntrident\nsubunit\ntrousers\ncraftsman\northography\nwaldo\nwrestled\njiménez\nkunst\npackets\ntriathlon\nbcs\ncroft\nconsultative\nstratigraphic\nmolluscs\nfina\nplurality\nnaacp\nwarlord\nsaline\ncollisions\noutset\nmajestic\naloud\nspelt\nrhein\nvichy\nipc\nbastard\nlaunceston\nnfcc\nbryn\nfascinated\ncracked\nentertain\nbinomial\nsimulate\nafrikaans\ngustaf\ndebra\nbrandy\nknocks\nmessy\nratification\nbiota\ntweed\nchunk\niberia\nshipment\ncipher\npakhtunkhwa\nhectare\ntimelines\nmendes\nfellowships\nfrontiers\njj\nquartets\ntatar\ndissident\nnowy\nprefectures\nnegligible\nvalor\numm\ndelighted\nnyu\nalphonse\nhackett\nselby\nsapporo\nfenerbahçe\nnagano\nmobilization\ndjibouti\nvenerated\ndmitri\nprotections\naek\nrevered\nlana\nmacpherson\nelizabethan\nvhf\nfrivolous\ncos\nimmortality\nangolan\narrays\ntyrant\nfolds\nindoors\njia\naura\nzanzibar\nsine\nquadrangle\nimmersion\nspecialization\nsimplify\nthicker\nsava\nreunite\npinyin\npapacy\nescorts\ncfa\nnonsensical\nadherence\nparaphrase\nfactually\nspecifying\nscarecrow\nosprey\nincheon\npapyrus\nmaui\nquadratic\nmotorized\ngracie\nmercantile\nworshipped\namore\nkanye\nprojectile\nliaoning\nserialized\nrefit\nnightly\nbeautifully\nstagecoach\nféin\nantennas\nwynn\nadapting\nvillas\nwillamette\nviolins\ncausal\nadjustable\nmichelangelo\nextraterrestrial\nraju\nlodges\nwreckage\nconservancy\nlabourers\nassimilation\nzappa\nrenato\nthurston\naccessing\nentre\nlubbock\nunresolved\nlorenz\nstatistic\nintra\nrefurbishment\netiquette\ncreeks\nswiftly\nmetallica\nbertram\nplainly\nvisionary\npeanut\ntherein\npersians\nuncontroversial\nheroism\ntrey\nbudgets\nmiddletown\nhays\nxviii\ndegraded\nroutines\nchet\nnoisy\ndisappearing\npolygon\nova\nhampden\ndorado\nspitfire\nposture\neater\nluisa\nfo\nsamba\nbantamweight\ntransmitters\nozone\nmuster\nscreenings\ncolegio\ncara\ndefected\ndempsey\nwaikato\nbertha\nclicked\ndeficient\ngetty\nbrighter\nexcessively\ncelia\nschism\nlaredo\nmaternity\npenultimate\nassad\nrelax\naxiom\nevansville\nbloch\npickering\ndistillery\nupi\nairmen\naiv\ntres\nmueller\ntba\ntaxis\nlondonderry\ntrumpeter\ngutiérrez\nuri\nportico\nthrush\ndumfries\nhendrik\nhurdle\nallotted\nenrichment\nheterosexual\ntijuana\nbiographers\nfidel\nrenders\nmedicare\nhadley\nplaintiffs\nsplendid\nmcintosh\nmifflin\ngateshead\npsp\nhirsch\nballarat\npinnacle\ncatalytic\nunfounded\nmaneuvers\nbladder\napc\nservicemen\nprematurely\nsingleton\ndotted\nmandates\nascii\nmarsden\nferns\ndevonian\ncong\npiloted\nrepublika\nlivelihood\nselects\nbharat\nregulators\nobligatory\nsalim\nvibe\nhex\nsmoothly\nresponsive\nlux\nmortars\ninterned\nrespectfully\nfearless\ncabrera\nkanji\nopined\npandora\ncst\nsiu\nperiphery\nimperfect\nlush\nhooked\nmustangs\nharpsichord\nsawmill\ndanes\nchorale\nkochi\nnugent\nlobes\nanson\nscranton\nhurlers\nalot\nsalinas\nfrs\nlibby\ngallons\nparti\niteration\nquechua\nrainer\nconstants\npostmodern\nwerewolf\nclimatic\nfayetteville\ndissemination\ndiffuse\nelevators\neta\nquadrant\nburbank\nflavour\ntatiana\nantalya\nflyweight\nlourdes\ngibb\nseung\ngestapo\ntartu\nkf\nacton\nhuey\nbumper\ndusk\ncleric\ncommonplace\ntents\nanzac\nwindy\npaine\nunc\nillustrious\npowys\nbrink\nwildly\nwounding\nvandalizing\nrelaxation\nchai\ninversion\nglide\nreclamation\ncustomized\npsalms\npuppy\ngall\nprodded\nvicky\ndumbarton\nlehman\ntownland\npageants\ncrabs\nakademi\nburen\nplanner\njuveniles\ngarlic\ngar\nviolinists\nmehmet\nturku\noldies\nyak\nrediscovered\nafforded\nrobbed\nfitz\ntaboo\nleverage\nbm\nbananas\ngénéral\nbiosphere\npersistence\ntruths\nrecounted\ncereal\nenslaved\ncounterfeit\nliquids\nplovdiv\nantelope\npied\nmartyn\nroar\nrohan\nflex\nwidening\nweymouth\nwidows\nalerted\ntierra\nbao\nmildly\nreciprocal\nrotational\nkyung\nmariner\nchamps\npeacekeeping\nairship\ntombstone\nbrutality\nreims\nembassies\ntransitioned\nessayists\nstipulated\nelektra\nprequel\nhoneymoon\nsanitary\ngrind\nshalom\ncanto\ngifford\nballpark\nsébastien\nmanu\nalas\nnorsk\nphilanthropists\nclandestine\nnotifications\nflinders\nmidwestern\nported\nguadalcanal\ndistraction\navec\nwaking\nvaccines\nguadeloupe\nroderick\ntuvalu\nsaxons\nsilvio\nintersect\nlacy\nduplicates\nfractured\nregatta\ncracking\neared\nwhistler\ntamara\nhana\ncommuted\nascribed\nstampeders\nvarna\nelsie\ncorridors\nelemental\nwinfield\nheaviest\nquorum\nherzog\nmohd\ninaccessible\npersecuted\nhandheld\nutilization\nburgh\nbret\nalloys\nlowlands\ndelaney\nhers\ntheorems\ncombatants\ninterpersonal\nadjusting\ntransistor\npico\nchaplains\nsealing\nsizable\nseizures\nspaceship\njerk\ncrows\nmoo\npianos\ntaj\nbarren\ndisable\nbegs\ndetectors\nbritten\nattaining\nprelate\neau\nzenit\npassions\nhydra\nkidnap\nszczecin\nrevise\nstampede\nhogg\nchalmers\nboulogne\nbochum\nreincarnation\nminnie\ngottlieb\ncommunicated\nstills\nmillennia\nbipolar\nanthropologists\nsentimental\nhf\ngrays\nbillions\ngrizzlies\nescalated\npedigree\nexpressly\nbanished\ntransformer\nhepatitis\nfd\ndutton\npeacefully\nmulder\nreborn\nlandau\nstimulated\nflavors\nembodied\nqualitative\nbacklash\ncolby\ngliding\nsouthwark\nkristin\nadvertiser\ndorsey\nzebra\nditto\ncoeducational\nbreasts\nstigma\nnudity\nleland\nberman\nstatistically\nolympus\nfares\nhumanoid\nfinely\nsuspense\ntownshend\nvalencian\ndemi\nbaking\nverizon\nmfa\ninfinitely\nsurfer\ncochin\nbootleg\nimpoverished\npredatory\nschoolhouse\nfokker\nbeak\nhurry\nprodigy\nauditory\nspecialize\ndorian\nbenchmark\nrang\npalette\nfortunate\ngables\nlaughs\nresearches\ncafeteria\nkts\ncornerback\nreclassified\nfluorescent\naccumulate\nswat\nplanners\nchildish\neels\ncontenders\nconstrained\ncarp\nperm\ncaller\ndictated\naugustin\ninterstellar\nencompassed\ncoyotes\npiercing\nblockbuster\ncatarina\ncrowe\nlark\nsochi\naccountants\ntj\nabolitionist\nolimpia\nexpansions\nretracted\nubiquitous\nmayfield\nenroll\nwycombe\ncatalunya\nlinz\nnoticing\nnellie\nleith\nfuchs\ndani\nheer\npolitely\npolymers\nhershey\nfernandes\nanticipate\nantonia\nstu\nrj\nhelmets\nextremist\nrobe\nhaut\nrescuing\nacm\nlag\ntripod\nwavelengths\nmerle\nnutrient\noverdose\ngiulio\ncahill\nbackyard\nheadwaters\ninterlude\nschulz\nkalamazoo\nsmartphone\nsnoop\npandit\nchew\nrehearsals\nmri\nintersecting\nlucille\ndelisted\nchristgau\nseparatist\nwilfrid\nintellect\nlarsson\nnecked\ncaterpillar\nvarma\nfacilitates\nbargain\njock\nreunification\nsarcastic\nfriar\nloeb\nsuspend\nkala\nconf\nhaskell\nantibody\njitsu\nasthma\nfloats\nmicroscopy\ncultivar\nleasing\nqld\nskye\ngregor\noise\nhorde\nasha\nswarm\nandover\nfrey\nsusanna\ndizzy\nspirited\nrada\nuno\nspielberg\ngallipoli\ncandles\nhôtel\nexpire\npoorer\nchiapas\ncinnamon\nempowered\nmyriad\nanytime\nimpedance\nembryo\ncans\nhh\nrko\nsalvaged\nshang\nalerts\nbiking\nlabyrinth\nparochial\ncategorised\ncurate\nrefining\nmoderne\nsrpska\nimplicitly\nmetaphysics\nsuck\nfunnel\ndiscredit\nremington\nardabil\nnasdaq\nbrahmin\ntbs\nyachts\nparity\narjun\nante\ndavison\naxial\nbarring\ndivisive\nnasir\nunexplained\nmaratha\nssp\ncheat\nmarginally\nsherry\nboniface\nwheelbase\nsparsely\nexploding\nsilvia\ncoolidge\nbrahma\ngoalscorer\nyada\nfats\nepilepsy\naidan\nsly\nnath\nmukherjee\nholiness\nnectar\ncleansing\nflaw\nagra\nbowers\nmidtown\nspurred\npvt\npalmerston\nartemis\nraider\nnoël\ndarcy\naustralasian\nsausage\nmezzo\nkbs\nthermodynamics\nkilmarnock\nbrochure\ncardboard\nguzmán\nproficient\nbarangays\namt\ntallahassee\ndefenceman\nadolescents\nelliptical\nbreached\ndepartmental\nelectrode\nhampstead\nerased\nsistan\ntaxpayer\nminerva\nbonded\npusher\npastures\nstrands\nwhiting\ngreed\nclearwater\ncondemnation\nblackberry\nmh\ncolón\njure\npullman\ntariffs\nsitcoms\ngotha\ncartwright\ncorpses\ndeh\nkurds\nresentment\nbehavioural\nelects\nwarehouses\nargentinian\nsmashing\napparel\nmeteorology\nallahabad\nbute\namplifiers\nshipments\nmontague\nhuff\nbalochistan\nballets\nextermination\nstamped\npreach\nsignage\nunsafe\nroscoe\navignon\nreferendums\nemmett\ntxt\nlászló\nshaanxi\nbearings\narne\nmachado\nsana\ntildes\ndavao\nwary\nbonnet\nfantasia\npermian\nyoungstown\nrsa\nnhk\nmagdalen\nchecker\nhaines\npinball\nunicef\npatsy\nfrequented\nreclaim\nornaments\nmcqueen\ninsurrection\nazul\nelliptic\nsupernova\nfujiwara\ncrank\nfabrics\nrulings\nkt\nbenoit\nousted\ndeterioration\neclipses\nherbal\nliterate\nvowed\nexponent\nverve\ngeologists\nmaximal\nmotherwell\nlinen\nengel\nspectra\nenglishman\nguelph\npygmy\ncrucifixion\nextracurricular\ninterfering\nadultery\nuzbek\nesteban\nbacterium\nhandicapped\nfiery\ncloak\nhéctor\naccra\nbiases\nrubens\nhickory\ngestures\nobstruction\nsheen\nclarifying\nivor\nerickson\nbéla\nsewing\nterrier\ndew\nay\nvito\nlough\nnv\npathogen\npearls\ntaxpayers\ncasper\ncorrupted\ngrabs\nimmensely\nmehta\njeong\ncheckpoint\nforte\nechl\nregulates\nartisans\nchola\ntahiti\ncassie\nsulphur\ngk\nsenegalese\ncpi\nmodi\nreversing\nwalkers\nsaône\neucalyptus\nthrace\ncoppa\nheadlined\nastoria\npaced\nashok\ngrumman\nledger\nfigurative\npropellers\nfundraiser\nnarcotics\nems\ncompetitiveness\nsects\nlambeth\nlooney\ncrustaceans\nbatavia\nmetrics\nconstructs\nbeale\nfragmentation\nvulgar\nhose\ncedric\ndirectories\nnagpur\ndawkins\nvox\nduos\nwürzburg\npea\necw\nnit\nhanley\nnano\nbundled\nmooney\nchengdu\nsimplex\ncatastrophe\nunpredictable\ninterspersed\npiero\nmarietta\ngladiators\nbrahms\nmuriel\npsychotherapy\nmortally\nsociedad\nacademically\nsmiley\nvilleneuve\nbiomass\ncarte\nmeyers\nreplicate\nbenevolent\ngaye\nwalther\ncad\nweakly\nfbs\nweaken\ncobalt\nbursts\ncranes\nrad\ndissipated\npigment\nkönigsberg\ntailored\nsuperiors\nlettres\ndea\ncourtroom\nravenna\nobverse\nwhereupon\nfalmouth\nresultant\npiped\nsociologists\nmanipur\nnagorno\npromenade\nknob\nrallied\ndiscoverer\nmastery\nboutique\nreykjavík\nrocco\ngeneralization\nclemente\nvíctor\nparenting\nmallory\nplight\nstrategically\nbien\nnai\nbosco\nhenning\ndanville\nloft\nshawnee\nskinned\nmicropolitan\nbandy\nillicit\ninstrumentalist\nlinkage\ndough\nsigh\nmicrobiology\ndemonstrators\npipelines\ninoue\nspectacle\nhype\nwv\nrelic\ntedious\nterraces\nmilitias\ncallaghan\ncomcast\nmins\ncessna\npriscilla\nabyss\nxxi\nurges\naris\ncoates\ngallon\nradha\nanimations\njody\nayala\nkampala\ncommencing\ncrystalline\npurification\nundertaker\nantoinette\nsoloists\ncoimbra\ndalmatia\ndorothea\nkl\nmonsieur\neruptions\nfavourites\nreopen\nmariah\ncraftsmen\ndomes\nhubei\nrecital\ndevout\nwoolwich\npressured\npros\ndispersion\nbmx\nhuts\nplanar\nfoe\nshanxi\nfascination\nsala\nscanner\ncomprehension\ncoroner\ncovington\nfledged\nadapter\nprentice\nripe\nberries\npali\nflashes\nsanjay\nflipped\ncyborg\nrennes\nextremes\nmilestones\nfcs\nbreweries\nulm\nbanksia\nsummed\ndurga\ncookbook\nduarte\nganga\ndissatisfied\nhiram\ntintin\ndaley\nminiatures\nlawful\nproxies\nmala\nskulls\nconcertos\ndispersal\nsyn\nlanarkshire\nbrevet\nlovecraft\nanjou\nyangon\nadministrations\nindependiente\nalouettes\nrichland\nfreshmen\nspawn\nnostalgia\nnach\noverrun\nstochastic\nwheaton\nligand\nquarries\naddis\nyoko\nroxy\nunsolved\nslippery\ndowney\nfolio\nobedience\nardent\noakley\ndidier\nnapa\nimpetus\ngeek\nsummon\nhorticulture\nsidebar\nkampong\nlegality\npeg\nfalun\npolarization\nnaughty\nstab\ncolonialism\npanathinaikos\nvf\nsyrup\npatrolled\nheavens\nregnum\nmadurai\nflushing\ngael\nthrive\npejorative\nsigmund\nsalaam\naberdeenshire\nkenyon\nafterlife\nchemotherapy\nluggage\nadopts\ncatania\ntroublesome\nvariability\ntrafalgar\npulaski\nantiquarian\nverne\nundergoes\nstreamlined\nlavender\nvocalists\nfáil\nmountaineers\ninspirational\nbolts\nhorsemen\nislington\ncartoonists\nseniority\nthyroid\nplank\ncamogie\ntimer\nduets\nleaflets\ncristo\nspeculate\nstadt\nlodged\nservicing\ndiamondbacks\nexemplified\ngarments\nracetrack\npeanuts\neurasia\nunarmed\nsalamander\nnino\nhaunting\npuri\nríos\nkeefe\nreluctance\nenjoyment\nnewbury\ngunmen\ndraining\npraises\nomit\nlite\nbolsheviks\nslightest\nglue\nlavish\nsacrament\nisolate\nprovoke\namiens\nblur\naggregator\njg\nworrying\nrhinos\ncoasters\nstroud\nkremlin\nteaser\nlobbied\nrl\ngrad\ncounters\nexplodes\nwisden\nmysticism\nmillar\nsnowfall\nosama\nmodem\nsadler\nphilology\nfrancophone\nbello\nblending\nbombardier\nplutonium\npax\ncakes\nses\nbillionaire\nargus\npostings\nreceipt\nnelly\nnativity\nwhitehall\ncalligraphy\nandres\nselma\nei\nmalware\nsignifies\nisraelites\nwarhol\nrhapsody\nlasalle\nhussars\ntravers\ncommandments\nduct\nalexandru\nnarod\nclumsy\nmarrow\ntercera\nmeng\nrijeka\nsubtitles\nstr\nsaeed\nstemming\ncossacks\nniigata\nconverge\nflanks\nbethesda\nalamo\nyamato\nreorganisation\nreacts\nalphabetically\ncoloration\nalistair\nschema\ngeese\nyosemite\nsweetheart\ninternazionale\ndaniela\nyeovil\nspongebob\nintimidation\ncontradicts\ntomé\nateneo\ndonnie\nreina\nnonstop\noprah\naffirmative\nscuba\ndre\ntweaked\nschumann\nobservances\nrufous\nluo\nlocalization\nbadgers\nperugia\nvow\nmor\nplotted\ntorneo\ndeva\nhaworth\neyewitness\nrealist\naccompanies\nwinthrop\nting\nconservatoire\nstunts\ncoles\nshelled\naquila\nmississauga\nbethany\nerp\nmisguided\nrecap\nnitrate\nmennonite\nmidsummer\nmahatma\nanomaly\ncai\nvibrations\ngaronne\nbenign\nthrill\nprado\nundisputed\nitaliano\npreventive\nmentors\nsectarian\ntrainee\ncaen\nmalibu\nbradbury\ndundas\ncheetah\nterminates\nschiller\nsy\nnac\nrivière\noperatives\nirrespective\nsubsistence\naerodynamic\npuff\nredmond\nodin\nfrançoise\nwardrobe\nsos\nworthington\neccentricity\nvicki\nlamont\npci\nkristina\nwm\ntw\nhubble\ntees\naet\nlyn\nalmond\nsymptom\nchildbirth\ncovent\nflashbacks\npostcards\ninquiries\ndragoons\ntrudeau\nmorally\nxiang\nappointing\ngrudge\nchabad\ncaptaincy\nalves\nblvd\ndocking\nrib\nbuxton\nshenandoah\nstricter\ngop\ntwenties\ngg\nrash\nyogi\ndivergence\npermissible\nconsolation\ncreep\ndébut\nboa\nalegre\nrodolfo\nrahul\naddicted\nsorcerer\nhabib\ndefends\nmeuse\nfloppy\nhoaxes\nowes\ncloudy\nhmmm\nshowcases\nfranc\noptimus\nhao\nfatima\ncollage\nbrampton\ntelecast\ncoli\nfermentation\nbiathlon\ndeformation\ntortoise\ndessert\nmoat\nreintroduced\ndissenting\nmetaphysical\nbasalt\nideologies\nlauded\nadventurer\nprobes\noptimized\nfavorably\nconceal\nsupervisory\ncarnivorous\njerseys\nforecasts\nfolly\nkaraoke\nsuperleague\ncervical\nimpeachment\nsmiling\ncato\nbribery\nwbc\nsaleh\ndevoid\nacquires\ntobin\nludlow\nmeier\nsmoky\nraptors\nrefloated\nchieftain\nfútbol\ndeparts\npee\nonstage\npredicting\nformative\ngunnery\nghulam\nporta\nartur\nmorals\nskirmish\nrosary\nmodal\nsectional\nlandfill\nnuggets\nsanderson\nlinesman\nincompetent\nbw\nrebbe\nrbis\nchandigarh\ntrapping\nmodernized\naustrians\ncercle\nstrangely\nglamour\nrobbers\nolav\nawait\nchadwick\nfrightened\nboar\nenoch\nringing\nbreda\ninsulted\nstrife\nsorority\nflea\nslavs\nimpractical\nboomerang\nsheds\ncults\ncalvinist\ncorinth\ninvoke\ntranscripts\nvom\nperfume\nbony\nlakeland\npla\nbikini\npotts\npetite\nracine\npatriarchate\nshahid\nabolish\ngenital\ncircumcision\nogg\nrecommending\npelican\nplaques\nmemorabilia\nunearthed\ntractors\nglowing\nrelaunched\ndominates\nreformers\nshutdown\nslogans\ndashes\nmarlon\njian\nripper\ntern\nappropriation\narranges\nbullshit\nsqueeze\ninventing\nspade\ncooperatives\nlawton\nhighs\ndg\nfetus\nelectronically\ndisadvantaged\nwaco\nhesitate\nblaming\nyouthful\nharass\nartificially\ncalvary\nexercising\npillai\ncholesterol\nstabilization\nppp\nmartini\necm\nthiruvananthapuram\nnuestra\nnope\nimf\ngunboat\nbasing\nukrainians\npostcard\nmonographs\napertura\nregensburg\nrebelled\ndysfunction\nparked\nvertebrate\nconor\nmelted\nkodak\nmontserrat\ntriggers\nhoon\nblessings\nfarrar\nmojave\nsaab\nbenches\nrepairing\nengravings\nillusions\nbowlers\nasap\nstringent\nirregularities\nhuntingdon\nandean\nairdate\ncartagena\nuttarakhand\nassigning\nshaker\ndigestive\nafonso\nhike\nmosley\nkruger\nkreis\nstunned\nacorn\nbluetooth\niglesias\nine\ngreyish\nhumphreys\nvfb\nhopeless\nantigen\nlockhart\nsturgeon\njoo\nvigo\nteutonic\nlarva\ncrockett\nmonmouthshire\nstormed\nobs\ninsurgent\nbuckeyes\nganesh\nloco\nwalla\nbingo\nembarrassment\nsited\npeas\ndisqualification\nrested\nbst\nattire\nparalysis\ncalderón\nsao\nsenna\nmigrating\nromanians\nmansions\ninvent\nural\nfinancier\nvasily\nmayhem\ntammy\njacobson\nsom\nendeavors\nrensselaer\ntroopers\ndementia\nstaunch\nchop\nabrupt\nsplinter\nconnors\nsponsoring\nbagh\nthrissur\nendorsing\nkirsten\nindexes\nspawning\nstepfather\nnewbies\nvertebrae\nsuny\nsalient\nskeletons\nallentown\nsunflower\nchaim\ngallen\nneutrons\ncommas\nhog\ndomestically\ngilmour\nhideout\nadamson\nfurness\nwhitley\nshui\nmassage\ngd\nmelrose\nsuresh\nreliefs\ndeficiencies\ncapo\ntübingen\nracially\nlakeside\nwipe\nmelon\nnakhon\ncarney\nrourke\nrostock\nsardar\nether\nfetal\nsash\ndaimler\nsarasota\nconnectors\nruhr\nentertained\nmartyrdom\nprometheus\nslough\nbridgewater\nearnhardt\nconfederations\nforgetting\nsubsidy\nceilings\nfreemasonry\nstacked\nericsson\nheadache\njb\nfarley\nmassa\nforaging\nnoor\nfrontline\nbeograd\ndegli\ngeothermal\nluiz\nzoologist\npenitentiary\nreforming\nsummons\ncontended\ngregorio\nflare\ncomb\nluxurious\nchlorine\nfooted\nbashir\nescorting\nformosa\narthritis\nfragmented\ntimberlake\nsurpassing\nnegligence\nbarbarossa\n¢\nyadav\nglossy\nmarcello\ndrumming\nlizzie\nonions\ntherapies\nshuffle\ncryptography\nabi\ncartier\nupbringing\nunlock\nescalating\nspecificity\nauvergne\nfalco\nalaskan\nrh\nmontage\nnod\nbjörn\nquercus\nkn\nselfish\nwhitesmoke\nrestrained\ntorso\navro\ndias\npiccadilly\nbertie\ntemplar\nboyce\nethnographic\nchildless\nantisemitic\ncolleen\ncorresponded\ncallahan\nexiting\nripped\numa\nprom\nseward\nfederations\nmileage\nshaking\nconcede\nmubarak\ncora\nanglesey\ncountered\nmcdermott\narsenic\nashanti\nprefectural\nreinhard\nsari\ngeraldine\ncoleridge\nleyton\nfluctuations\nantoni\npleasing\nscooby\nasheville\nhousewives\narun\nmahler\nhorrors\nlexical\nshowcasing\nkinship\norbiter\npancreatic\nmadre\npercival\navenge\nthrower\nlm\ngoblin\nleases\nbrodie\nfilipinos\nvillanova\ndons\nreopening\ncabot\ninspectors\norwell\nboyz\nembracing\nlaundering\narisen\noverlook\ntome\nbancroft\nopel\nrubio\ntagalog\nuncover\naustralasia\ncrawl\nbarbera\nyung\nsalvadoran\noberlin\nolya\nhathaway\nfractions\nscalar\nwhalers\nfables\nadair\nwestmoreland\nheightened\norissa\nfridays\nhibernian\nelisa\nduisburg\nsup\nelectrically\nsurreal\nlatex\ngsm\nsubstitutes\nsonatas\ncreighton\netienne\nstyria\nnarrows\ntrumpets\ndefective\nhuxley\nbroom\nmehmed\nmanifested\ndyson\nerection\ndun\nbhopal\ngown\nbrutally\ncameroonian\nfalklands\nstubbs\nundead\ncounterattack\naruba\nchevron\nculver\nrecapture\nclasp\nriviera\nmanchuria\nrenumbered\ngeographer\nexpectancy\ncontractual\npetar\nachieves\nwiring\nbastion\npear\nmanipulating\nrejoin\nteal\npsychoanalysis\naur\nunofficially\nfledgling\nkibbutz\nexpansive\nexceedingly\naldrich\nimporting\nwildcat\nlockwood\nkingfisher\nangrily\ncurran\npulses\nnikon\nmonoplane\nearns\nkato\nadolphe\nmaharaj\ninferred\ndurant\ntuba\nlutz\nhesitant\ntre\nsupersonic\nshek\nannum\nutica\nbarking\nswanson\nmcneil\nreactivated\ntoast\norb\nrosters\ndumas\nsavanna\nrounder\nmckee\nscunthorpe\nbenefactor\nexpiration\nblackwood\nbegging\nfurlongs\nfreighter\nkhalil\nthrowers\nswedes\nrockingham\nprehistory\nlicences\ntoc\nsloane\nshutter\nminesweeper\nanc\ninterwar\novercoming\nclassis\nnicaraguan\ncba\nsightings\nvoltaire\nfewest\nbatista\nkilograms\npopulist\npercentages\nnicolae\nscot\nbefriends\nyoshida\nrhin\nmarconi\npigeons\nbarbarians\ngower\nroux\nforrester\npicard\nreversions\nsutra\nobservance\nrooster\nstereotype\nmangrove\nshoemaker\ndecrees\npoplar\nunimportant\ntakashi\nola\npuritan\nlinebackers\nembarrassed\nreptile\nons\noecd\nconcepción\nsignatories\npromulgated\ncádiz\npelham\ncavalier\naggregation\ninterchangeable\ncellist\nver\nautopsy\nhutt\nreis\nlancers\nmorrissey\nlettering\ndevotional\nenclave\ndunfermline\nspoof\ncategorizing\nreversible\nsadness\nobjectively\ninterrupt\ntriples\nbandleader\nlobos\ndisarmament\nparque\nwealthiest\nhewlett\nsankt\nunrestricted\nsteppe\nlui\nkuhn\nmckinney\nredeveloped\nzoos\npetr\nmarche\nshutout\ncutoff\nflap\nleyland\nmagma\nscorpions\nmollusks\nplume\nesperanza\ntelford\nbellator\nsleeves\nhajduk\nsampler\nrocker\nguan\nsleepy\nscotty\nblond\nfortresses\nwaldorf\noutraged\ncollapsing\nsorbonne\npadilla\nspearheaded\nwyndham\ndomesticated\nmaribor\ngaribaldi\nharbours\nashby\ncio\nvisuals\nbantu\ngóra\nkielce\ngaines\nfuturistic\nerr\nmeteorite\npanamanian\naxioms\nnoticeably\nmacclesfield\ndenominational\ntaurus\nquarterfinal\ndix\nconstitutions\nmagicians\ndunkirk\nmultiplied\nriyadh\nowe\nwalpole\nrei\nbeheaded\neno\nvalence\nvalidated\njima\nrecalling\ngunners\ncine\nmelinda\nelegans\nrwandan\ndurango\nbhutto\nlatham\ndeane\nvolt\nreconciled\nfiltered\nbede\nsponge\nvizier\ngoran\nwrongdoing\ndeutscher\nexpires\nbaptised\nsprague\nnetflix\nsmells\nwheeling\nquadruple\nbong\nsaginaw\nliza\nmonaghan\npyramids\nhendricks\npembrokeshire\nblames\nrobb\ncarousel\ngrossman\nintroductions\nvanishing\nfraternal\nxin\nmotivational\nrui\npreachers\nmicronesia\nherr\njuliette\nlander\nardennes\naladdin\nboolean\nflamenco\ndeadliest\nneedless\nramakrishna\ncuraçao\nhar\nsuk\nshaded\nrenee\ndynastic\nfranck\nclique\nadolfo\nraging\nastrophysics\nmoffat\ndolan\nphonology\ndelicious\nusenet\niconography\ninductee\nendure\nnanny\nmasts\ntabernacle\nballast\nmidlothian\nfireplace\nclinch\nwestinghouse\nkhanate\ntransporter\nstain\ninflow\nfading\npaddington\npiccolo\nshinto\ngovernorship\nsindhi\nenquiry\ngreenish\ncomptroller\nlopes\ncons\nrds\nairspace\narnhem\ntay\npryor\ncourageous\nsalerno\nmeasurable\nstump\ngilded\narchangel\nadana\nalkaline\nasunción\nzoning\ngras\nvigorously\nmilling\ndwayne\ntoto\nhilltop\npsv\nacetate\nerase\nnumeric\nascot\nhilarious\npragmatic\npatriotism\nunverified\ndevonshire\nlayton\ndalian\nleverkusen\nkickstarter\nfreeing\nemirate\nconclusive\nwaugh\nmathematically\narteries\nmobilized\npedestal\nspoiled\ncarboniferous\narb\nkirkland\npropellant\ncossack\nrecurrent\nbulbs\ncdt\npolled\nmodernity\ngastrointestinal\ntracker\nstimulating\nmau\nhinton\nspecialties\nwhitby\nqian\nbedrock\ninflated\nencountering\ngrille\nverifying\nstalk\nalligator\nweld\nterminator\nsouvenir\nwitty\nphotoshop\nhebron\nyeung\nipv\ncategorisation\nfung\nincarnations\nmpeg\nliao\nkettering\nwabash\npows\nfairies\nmediocre\ntankers\nprecedents\nfillmore\nsapphire\nangers\ngianni\nhayley\nmarkedly\ndowned\nrowley\nultrasound\naleksandar\nmarlene\ntarn\nhasty\ngalilee\nrelocating\nnorthwards\nanno\nreindeer\nchamplain\ncarrington\nfaroese\nwhitfield\nzimmer\nmyles\ncabbage\nstavanger\nkollam\npomona\nwastewater\nemergencies\nosbourne\nbrood\nétudes\nhexagonal\noran\nbarns\nmitt\nprefixes\nsmoked\ndissatisfaction\npty\nsaviour\ndismay\ngrowers\ncarnatic\ncrippled\nmanifestations\nballiol\nsmolensk\nlax\nexhaustion\nmargrave\nmarques\nuncut\npitted\ndeccan\nbreeder\nterri\noptimum\nhebrides\nasbestos\nsamara\nmisc\nhomosexuals\npillow\nneutrally\nmoro\ngino\naccademia\nhowie\nstitch\nnorthrop\nunfit\ndopamine\nastana\nobscene\ngamers\nrelays\ndubbing\nlyme\ngabriela\nloren\ninvincible\nexcursion\nenact\nfinlay\nghosh\nrevisit\nbyzantines\noverboard\nbequeathed\nheadlining\ntasman\nirb\nwaivers\nbreaches\namazed\ncosby\nuptown\nchallengers\nusn\nfrenchman\nultimatum\nunleashed\ndsm\nmedallion\nchromatic\ncultura\nheartbeat\noda\nnarayana\noutbreaks\nlipid\ntoxin\nsublime\ncurricular\njuniper\nlübeck\nstumbled\nundeveloped\nkickers\n♫\nsideways\nmongo\noutfits\nferenc\nscoreless\nunequal\nost\nsalted\nsomaliland\nenzo\nbea\nupstate\ninterviewing\ncontesting\ncombs\nbulgarians\nodor\ninnate\ntolstoy\npartitions\nbreslau\ndiscontent\nlucifer\ncena\nriches\ntuck\nrembrandt\nobsessive\nprimes\nsinhalese\nstarboard\nmadeline\nadministering\ncamilla\ncompressor\npj\nlogistical\ndio\nrelativistic\nslips\nwaltham\nframeworks\nstructurally\nnotifying\npiles\ninserts\nagitation\nwaverley\npersuasion\nrune\nnosed\nricci\nviability\nbaluchestan\nunorganized\nseventies\nhonorific\nrobyn\ninterchanges\nyamada\npassers\npacers\nkam\nguangxi\nroosters\niptv\nrespectful\nabnormalities\nhedges\nevading\ngdr\ndewitt\nchases\nincomes\nsharia\nkearney\nly\ncallsign\npedestrians\ndrifting\nhive\narcs\ngemma\nabbasid\nricher\npractise\nbun\nrepulsed\nwhaaat\ninconclusive\nsuffragan\nepistle\nperch\ngroupings\nlatent\npartnering\nzenith\ncanaan\nhastily\nhasidic\nlarouche\nshorthand\nordre\nphilatelic\nswelling\nmcmanus\nunlocked\nnaruto\ncalle\nfinishers\nmangalore\nforecasting\nasians\neure\nspontaneously\npfr\nstockings\nadriana\nfries\nchampioned\nblink\nshortcomings\nstove\ncongolese\nreprints\nrustic\nyangtze\nguevara\nscreams\nvolunteering\ncatchy\nniles\npayroll\nsolitude\nobscurity\ntoothed\ndrones\nquotient\nsofter\nbir\nreliant\nstravinsky\nrebound\ncondemning\nentrenched\npalacio\ncascades\nlv\ntangible\noratory\nhowitzer\nbona\nhaus\nintriguing\nvous\nremodeled\nbabel\nburrell\ncaine\nvaccination\nshizuoka\nhorowitz\ndimitri\nanhui\naddict\nxia\npersists\nprowess\ndistract\nwuhan\nelites\nsendai\nhartmann\nhwa\nstabilize\nrefuted\ntropics\ntagore\nkenji\nscarlett\nrecorders\ncavan\ntravancore\ngrasshopper\nhanja\nlyceum\nmrna\nhohenzollern\ngrooves\nyeon\nprogressing\ndeter\ndandy\nphilosophies\nmcclellan\nenamel\nmagellan\npenrith\ntit\nreadability\njuilliard\ntapping\nanil\nthwarted\ngallant\nterrell\ntycoon\nfinley\nelbe\ntod\nlooted\nyaroslavl\ntula\nmoby\nbonner\nacta\nelective\ngrease\ncortés\ndigby\nfurnishings\ndemonic\nsignify\nlick\nthinker\nbearers\nkombat\nbegum\nliber\nexcelsior\nbrightly\nfarmbrough\nrobber\nweinberg\naitken\nfarc\nrarity\nrodent\nshipwreck\ndetonated\nantics\nbypassed\ndragging\nlodging\nequitable\nuniversität\nyom\nvodka\nzane\nmisrepresentation\nmantra\nthinner\ncaldera\nangelina\npadre\ncollège\noneida\ninquirer\nminted\ngangsta\nagile\nboasted\nhollis\nmalayan\nconductivity\nasymmetric\nemulate\nscars\nscripting\ndisparate\ninsular\nharmonies\naga\nsclerosis\nsusie\nalger\nantiques\nplenipotentiary\nappetite\ninvaluable\nrhodesian\nlandry\nkhaled\nsuperfluous\nroaring\nwrc\njutland\nenlargement\nhorner\ndevine\ngis\ndragoon\nriemann\ntimeless\npyrenees\nherbie\nhanuman\nshowers\nfray\nethos\njansen\nreggio\nsober\nfundamentals\nfriesland\ntentacles\ncarlyle\nliquidation\ninvoluntary\nburlesque\noutlawed\nrehab\né\nparnell\nmondays\nvolley\njuris\ndeclarations\nphnom\nhaggard\nirt\nbundles\npatterned\nnouvelle\nistván\naccorded\nparr\nmails\nkaliningrad\nduval\npersuasive\ndnp\nsugarcane\ninhibits\nroofed\nheyday\nappliance\nspas\nmoderated\noncology\nkh\ndiminutive\nsiren\nrajput\nwashburn\nhospice\ngwr\niodine\ntemps\nmisrepresenting\nastonishing\npumpkin\nevaluations\ndivergent\ndeforestation\nusaaf\nrents\ninappropriately\ncynical\nswapped\nconfronting\ndivert\nonboard\noranges\nworkload\nlikeness\nmechanically\nlakewood\ncohn\nwynne\nebook\nswings\nmenzies\nfong\nramesh\nenlist\nmccormack\nreplying\nbobcats\nunblocking\nxix\nredding\ninfirmary\nchelmsford\nrenegade\nimaginative\ntac\nfumbles\nbests\ngreetings\nphotobucket\nspaghetti\ngamecube\ncoveted\nrealising\nkaran\nnorwalk\nwarcraft\npics\nbmi\nglastonbury\nrelentless\nsion\nlorentz\nclimber\nintensely\nincarceration\nslit\nhaq\nvo\nholistic\nmather\nlerner\nayp\nandrade\nester\nvenomous\nintestinal\ntranscendental\ninstitutionalized\narrivals\npalmas\nmaude\npastry\nhangzhou\nnamco\nshorten\nsunken\nsuisse\ndeferred\nsinha\nmiroslav\npríncipe\ndocked\nsubculture\ntori\npong\nstubborn\nbedrooms\njustine\nalden\npacker\ntomlinson\npuma\nbongo\nsubstituting\ntinker\nchristened\ndwellers\nchoctaw\namarillo\ntess\nexe\nsortable\npaulista\ngreenock\npolka\nsubdued\nspalding\nstabilized\nembargo\npilar\ninefficient\ndryden\nmaastricht\nunnoticed\nsta\ndada\nmaroons\nhorrified\nsmashed\nmitigate\ndwarfs\nnel\nlikened\nkorn\nremakes\nforbade\nperpetrator\nnader\ngaia\ngrosvenor\ndisrepair\nflanking\ngrosso\ncapacitor\ncostal\nmonoxide\ncolumnists\nrecited\nmcnally\nresisting\nartes\nlovin\nnationalistic\ncomanche\nwelt\nprovoking\nrevista\nyuma\nppm\ninconsistency\nechoed\nfalk\nepilogue\ncnbc\ngunshot\nwinnie\ndsc\ndario\nvisitation\nseamen\nliv\nkano\ntriton\ncalendars\npalladium\nolomouc\nstonewall\nsilhouette\nawaited\npretends\nsimons\nconcourse\nmountaineering\nimpressionist\nheathrow\nhobson\nmovable\nbarges\ncervantes\nalchemy\ntapestry\nyeh\nmantis\nschroeder\nhoop\npaving\nlineages\nembryonic\nwitnessing\ndiscriminatory\nfarmed\ndistal\nhinder\nnpa\nimplausible\ncessation\nglimpse\nconner\nbieber\nepitaph\ninspected\noffaly\nstratton\nconserve\nflutes\nsaipan\nlobster\nmuscat\nvikram\nlogistic\nrossini\nassent\nnauru\nsancho\ndoppler\ngladly\naltarpiece\npy\nsidi\nnoises\narticulation\nmartino\nramat\ncitroën\nthunderbolt\nkwazulu\nwindham\nmountaineer\nvomiting\nroh\ncourtenay\ndenoting\ndistributes\nbyzantium\ngigi\nriccardo\ninformant\nfelder\nmahesh\nanecdotes\nsubsets\ntriggering\nkeene\nencrypted\nbows\nloki\nlimassol\nquébécois\nrelinquished\nhelper\njacinto\nccf\ncheerleading\nprosper\nsteinberg\nnamibian\ntolls\nsuicidal\nordovician\npreviews\nretribution\nhardened\nmays\navian\nmanifolds\nmaidstone\nguinean\nkuwaiti\nfostering\nturbulence\nstreamed\nvaldez\nmendelssohn\npolitburo\npritchard\nsnowball\ncanned\nprevail\nchores\naborigines\nappropriated\njaffa\nwalkway\nsurprises\nparkland\ndso\nros\npliny\ntoxins\nsaber\nflo\nstrickland\noates\nmello\npère\nthrottle\nvalentino\ntowing\naugustinian\nlargo\ntrademarks\nholman\nnye\ncoimbatore\nstanza\ncougar\nhickman\nafaik\najay\nshogunate\npolonia\novert\nsweets\nsinks\ncornelis\nstormy\nnazionale\ncorrectness\nplacebo\nombudsman\npatriarchal\nhamlin\nconsular\nhakim\ngarter\ncummins\nprized\nmanslaughter\ncomplains\nmoldavia\nstandout\nsubgroups\ncremated\nsiva\nkirke\nnightmares\nleaks\ndat\ngladiator\nhunts\nbloomsbury\nintrusion\ntonal\ncontradicted\nundoing\nlullaby\ngrievances\norderly\nconquests\npdc\ntranmere\nlemma\nsleeps\nslavia\nbrant\ndina\nwitt\nsunil\nsocioeconomic\nclio\ngaussian\nskinny\nfitch\nromanticism\nconnotations\nfractional\ntonnage\nboucher\ndutt\nkari\nzeitung\nolsztyn\ndignitaries\neugenio\nfisk\npotosí\nbuick\ninspections\nmum\nthugs\nflorentine\nentomologist\namnesia\nordeal\nhwang\npsyche\nstockhausen\nmanganese\njepson\nilya\nmonza\ngills\nserrano\nvicksburg\nsena\nsaffron\ncompensated\nnagy\nmta\nmoncton\ntait\necstasy\nwreath\nwarhead\nashkenazi\nblyth\nruse\npivot\nlament\njagger\nfiner\narista\ndamp\nbridging\nforbid\ngpl\ncagliari\nwhorls\ntroubling\nentrepreneurial\nfootwear\nseeming\nhoyt\nallergic\ninertia\nplumbing\ncourtship\nfractures\nbibliothèque\ngeorgi\nassimilated\nbela\nhadrian\ncornelia\ncomplies\ndishonest\nleben\nado\nkeenan\nfacets\nunregistered\nretractable\nfated\nforgery\nanonymity\ngabe\nshelly\nchand\nparanoia\nluge\nsaud\nelectorates\ndeprivation\nfarmington\naik\npractising\nmicrobial\nemit\nunproductive\nbunting\nabstracts\nbaum\nthumbnail\nearldom\ndeteriorating\nbateman\ndisapproval\nblackmail\nmikael\nmitigation\nminuscule\npriestley\nattila\nspying\ntamworth\nclerics\nollie\ndredd\ngoya\naisne\nmash\nrestructured\ninterscope\nboxed\nrichly\nbruges\nwoodruff\ndiffraction\ncoll\nopenness\ncouture\nwaitress\nespañola\nfutile\nsuperstructure\nhandler\noffshoot\ncarrera\nmoreau\naccelerating\nmiserable\nsher\nmahabharata\nsportscaster\nslowing\ndriscoll\ncolonia\nloretta\nconspirators\nlooting\ndjokovic\nsteaua\nconscientious\nknighthood\ngranger\npaxton\npir\nlig\njuárez\ncornwallis\numberto\ngillette\npokemon\noverlord\nfascists\nbessie\nsalomon\nmaurer\nobscured\nliners\ncodified\nhester\nsama\nclément\njoplin\nsafeguard\ninfielder\nskirts\nfertilizer\nsurat\nadolescence\nopaque\nsandman\nlevski\nhue\nneuron\nflourish\ngrassy\nmelancholy\nbonuses\ncor\ntat\nsolemn\nchants\nwadsworth\nshimizu\nkristian\nagustín\nexplorations\ndigitized\nboosted\nstrap\nbatsmen\nchancery\nthigh\nshepherds\nblasts\nnovosibirsk\ncrat\ncursory\nrenounced\npolaris\neffected\nlichfield\nshrink\ntcu\nbelo\nbrew\nbarbecue\npauli\nbiochemical\ntending\nreichstag\njfk\nbritons\nelectrodes\npharmacology\nmontoya\nradars\ngw\ndebatable\njingle\nrajya\nwidest\ncoronary\nsidekick\nvas\nsagar\ngliders\nsoares\nsporadically\npodcasts\nthunderbird\nrevamped\nbaskets\nzeit\nlesbians\ndevin\nworsened\ncomoros\nsexton\nstarling\nblount\nloughborough\nbowden\npaw\nelecting\nunfavorable\ntongues\nnihon\nalvarado\nlewes\ngaze\nregrets\nfearful\nseanad\nindra\npdp\ndiminish\nyam\nvaranasi\nshrinking\nburnside\nadversary\nprotracted\nliberian\namur\ngheorghe\nsloping\nromances\nintuition\npunishments\nlobo\ncctv\nluminous\nscrum\nroommate\nftp\nmayan\nblum\npep\nthayer\npilasters\noverflow\nstacks\nalbright\nnegros\nvinegar\nembark\nundermined\nhelene\nforwarded\nelgar\ncrichton\nrea\nmiraculous\nmott\nlucian\naccustomed\nlucerne\nbronson\ndiy\nepidemiology\njurists\nmitra\noverthrown\ncarcinoma\narnaud\npugh\ngrunge\npls\nassemblyman\nescarpment\ncalibre\nrollers\nlago\nadministers\nshipyards\nyee\naggravated\nwesterly\nseasoned\nnazarene\nbassoon\nmagneto\nmentorship\nseizing\ncontextual\nbrightest\nravine\nsnp\ncapitalize\nshakti\nbleed\nsavior\nsensational\ndae\naccommodations\nfinder\nnuisance\npedagogical\nfamilial\neen\nsavoie\nenix\nantibiotic\nconquering\nderelict\ndismissing\nfarce\nscoop\ncollide\ntransnational\naustralis\nretract\nhoy\npublicist\nplatte\nfédération\nclaimant\nsurrealist\nbelleville\nflowed\npretext\nkosher\ngadgets\npavia\ngorgeous\nfps\ngurney\ncondemn\n<\nealing\nhopeful\nforeigner\ninfringing\nillustrative\ncelsius\nhobbies\nfaiths\nitalo\nkatowice\nnbsp\nheisman\nfraternities\nretitled\ndownloading\nriggs\ncamino\nassamese\nparliamentarian\ngrossly\nantennae\nhardship\ndiscrepancy\nstalker\ndrilled\natv\nlukas\npiping\ngleason\nshillings\nidiom\nkuomintang\nshaman\nbsa\nlsd\natrium\nnypd\ncrore\nterrence\nmisty\nequestrians\ngian\npyotr\ncompleteness\nkyushu\nepirus\ngrantham\ncivilisation\nsturm\nsorties\nlest\npunched\npastors\nmarston\nfirmware\ncistercian\nborden\nprofessionalism\namg\nsvetlana\njuliana\naccommodated\nmusicologist\nreboot\nvisas\nageing\nleandro\ninertial\ntextures\nmonash\npods\noutfield\nblurred\nmundane\ninnes\nharman\nids\nbilliards\nconstructors\nknopf\nagnostic\nconformity\nyup\nlymphoma\nppg\ninward\nstewardship\nxxiii\nbogdan\nmultidisciplinary\nthermodynamic\nenvy\ngirard\nprojector\nloneliness\nhubs\nbehest\ndetachments\nuw\nbritton\nnga\ncutaneous\ntrunks\nlends\nmelton\nspins\nstarch\ncarvalho\nkessler\npatrice\nestado\nashamed\nfeyenoord\ncooks\nawakens\nontology\nhainan\nfairview\nltte\nhmong\nstreaks\nactivates\ndecreed\ndreyfus\nning\nstabbing\nslaughtered\nrobes\n‡\nclipper\nfirefly\nfansite\ndeceptive\ncircumvent\nsonnets\ngia\ndispatches\ngator\nmadman\ndime\ntanzanian\nrespecting\ncloning\nchisholm\nuniverses\nmatheson\ntranslucent\nobservable\nmalice\nsidewalk\nmover\ndag\nephraim\nsmokey\npsa\nnoms\nsired\navi\nfugue\nhautes\nshiraz\nremovable\ndall\nsuggestive\nvauxhall\ndisciplined\nmatsumoto\nmalignant\nhoboken\nminden\nbarbie\nvoicing\nkonami\nresigns\nconti\nhof\nentourage\nchemically\nstoreys\nmurat\nfiorentina\ntrios\nviewership\ntectonic\nshearer\nfable\ndeployments\nnps\nexporting\npatriarchs\nbehaving\nstooges\nalban\ngalleria\nprofoundly\nmato\nbrute\nkrzysztof\navenger\novershadowed\nrecreating\npositives\nunaffected\nscooter\nglazed\nnearer\npedagogy\ninterconnected\nshack\ntq\nrabbinical\nsubterranean\nhominem\nraman\nslack\nmirrored\nchopped\nprobabilities\nsanity\nkelsey\ngecko\nmultiplier\nshook\nchhattisgarh\neaters\nsupervise\nphylogeny\nschmitt\nrocha\nwastes\ncavite\ndaniele\nmcgovern\nscripps\nkedah\ncristian\njacobsen\nstevenage\nnecessitated\ninducing\nobjectionable\nsubaru\nreacting\nrazed\nbeware\npip\ngrail\ndehydrogenase\nbarrymore\nfergus\nlonghorns\nsubsections\nez\nvicenza\ntoole\neisner\nspindle\nast\nheineken\negregious\nscotsman\nossetia\nislet\nsmiths\nlehmann\ninteracts\ncleavage\nradford\nsubordinates\nclausura\ninconsistencies\nffd\ncx\ndisgusting\nbale\nwarszawa\ndisregarded\ncompartments\naiken\nnightclubs\npepe\ninstigated\ndeduction\nwoolf\nmcculloch\nisthmus\nsepta\nedmunds\ntearing\nmaxine\nlisp\noperetta\ntadeusz\nbaseless\nsenatorial\npacheco\n♠\nnotoriously\nleech\ninterpretive\noahu\nnitro\ntver\nnighttime\nbolster\nstereotypical\ncowell\naeroplane\nkipling\nhillman\nthorax\niwo\nharlow\ndelegations\npropositions\nmidday\ntributes\nzhong\nsummarizes\nsculptural\nlugano\nspeeding\nswine\nweeds\nmolten\nkanpur\ncritiques\nweinstein\nangelica\nstepmother\npredates\nganges\nadept\nley\nflourishing\ncrocker\nbutterfield\ndevlin\nkeaton\nlocust\nhain\npharmacist\ncordillera\nparodied\nallegory\nlair\nconveys\nafternoons\nhuston\nfigueroa\nmacgregor\nkml\ngreta\ninteracted\nfloated\nmoreton\nmargot\ndonating\nmalagasy\nquaternary\npdt\nbluffs\npurana\nordinances\nbudd\noi\nhangul\ninfiltration\nmcgowan\nanu\nfiddler\nexerted\ndissidents\ngaz\nandromeda\nmould\nig\nmauricio\nhitherto\nactionable\nhempstead\nmonique\nvat\nmilner\nnylon\nharriers\nsharqi\ngeophysical\nshogun\nwick\ncyclops\nmcclure\nkazimierz\ngeorgina\nindycar\nroque\npurportedly\nreceipts\nyokosuka\nalchemist\nrecoil\ntentatively\ncharente\npesticides\ngraphite\nsounders\nmantua\ntypeface\nlees\nliberator\nintergovernmental\ndepartures\ndefer\nshelves\ntricked\nparchment\nhindered\nsaturation\nkemal\nbritt\noed\norganisers\nmiyazaki\namr\nmiki\npaola\nvg\njive\nalastair\ngathers\nparcels\noriginator\nmedellín\nmouths\nuav\nmecha\nshun\nreaper\npneumatic\nmace\necole\nhijacked\nmelo\nmsu\npudding\nseasonally\nquark\nquintana\nrecovers\nfiller\nbungalow\nelusive\naqueous\nconsciously\nsubtitle\nnanotechnology\nbac\nzamboanga\ntiberius\nconvocation\nbarth\ncrc\nloaf\ndashboard\nkaiserslautern\nsquirrels\nakita\npens\ncarpets\nduquesne\nhama\npetrov\npentathlon\nfocussed\npoorest\nbowles\nbeauchamp\ntripura\nseinfeld\noblivion\njams\nsonnet\nnoodles\nfrieze\nconsequent\neastwards\ncharmed\nmrt\ndoomsday\nsynthpop\nhormozgan\nappoints\ntreble\nbraxton\nvin\ntreviso\nesquire\nbergamo\neighties\nwolfsburg\ndurand\nnormans\nfittings\nwaiter\nveiled\nsubscriptions\nmalt\nbrandeis\nclicks\nchanting\nstints\nsocialite\npickett\ntransplantation\nbrothel\nmeek\nkoi\nsurya\nwaterman\ndyes\nwittgenstein\nchatterjee\nkal\nstemmed\npashtun\nupscale\nkhrushchev\nroper\nexcise\nraza\nahmet\nfarah\neriksson\nmethodists\narad\nmadam\nseán\nmullen\nmidget\nfigaro\nnortherly\nsault\nyorke\njaffna\ngrenadines\nbearded\npasserine\ndearborn\nconfucian\nzac\ntis\nroswell\nignited\nfascia\nhatton\nindustrialization\nrabbinic\ndangerously\nbitten\ngare\ndubrovnik\npuja\nhelms\nthani\npathological\nrichelieu\nexploiting\nrouse\ninfiltrate\nmexicana\nrectangle\nchn\ndarfur\nlorne\nzia\ncursor\nsupercup\nbrokers\nsmear\nspartacus\nhardness\nmirren\nsucks\nbeginners\nbleach\ncadre\nducal\nsulfide\nmillard\norganiser\nresumes\nchunks\nlina\nstare\narresting\nhumanism\ndeb\nyat\ndowry\ncultivate\nmegatron\noverlooks\ntotalling\nmeir\nalder\nwaterhouse\ngambit\nebenezer\nanomalies\ndole\nsuperstars\nboardwalk\nchippewa\nfandom\nconte\nandaman\nvalentina\ntraversed\nspacious\nconcussion\nsoleil\npests\nsubunits\nprosecute\neucharist\nelise\nhaze\npenetrating\nhaplogroup\ngutierrez\nmysteriously\nsynchronization\nrenown\nmarlowe\nbreech\nbandung\nbulge\ncrunch\nnürnberg\npredation\nholliday\nhypertension\npasta\ncrouch\nspacetime\nflick\nundergraduates\nbanu\ndrexel\nedmonds\nuntouched\nherds\ndisconnected\nunsubstantiated\ncarriageway\nvert\nsinaloa\neastbourne\nextradition\ntruro\npennington\nflashing\nunconditional\nsideline\nrockers\nkool\nimmersed\ndelphi\nfredericksburg\nsterile\nfours\niloilo\nzi\nretina\nbess\ncastilla\npeacetime\nmcmaster\naudubon\nwrestle\naaf\ngansu\nluce\nunremarkable\nexacerbated\ncfb\nstardust\nmishra\nsurfers\narticulate\nfuentes\nwelded\ntakeshi\nflaps\nburg\nkaohsiung\nzoran\nbelinda\nobjectivity\ndames\ngunboats\nchomsky\nlongford\nwaving\npskov\nhabitation\nrubén\nweil\nethic\nmessengers\ndisperse\ntort\nnpc\nanimators\neliminates\nblackstone\nconey\nsyd\ncampania\nvlad\ncommuters\nhawkeye\ngolan\ncircumference\nbv\noptionally\ncarmine\nrajendra\nconveniently\nborges\nkirkpatrick\nbakr\ncollin\njacks\nfundamentalist\nyap\nstewards\ncatalogues\ninspect\nmarlin\nlowercase\ndefinitively\ndevastation\nleander\nfrisian\nobelisk\nyarn\nfamer\nlangford\nvaults\nnasl\nknitting\ncohort\ngabled\ncrosse\nbirkenhead\nbartender\nmigrations\ndialing\nprofiled\nunmarked\nswearing\nlandlords\nendorsements\nrtl\ngabriele\nsingularity\ncorinthian\nanthems\nlegitimately\nkuban\nplayful\narf\nthakur\ndn\ntoowoomba\nstair\nrewrote\nwhigs\nvere\nallusion\ngayle\ndemetrius\nhypothesized\nbandai\ngopal\ndurability\nmahoney\nmst\nreinhold\nmakoto\noffseason\nmains\nsteak\nvalidate\nexited\ndictate\nboomer\nduma\nwarped\nbypassing\nprofessed\nsx\ngimme\ngarza\ntonic\nrockin\npelvic\nwestland\nmormons\ndisillusioned\nvelasco\nscrews\nlivorno\nkimura\nbennet\nsetback\ngeffen\nknapp\ncdr\njoking\nath\nrealignment\ndensities\nlaptops\napis\ncoherence\nbrownsville\ninterviewer\nbh\ncanyons\nschuyler\nwmc\nappraisal\nsnowden\nprimal\npenh\nwhisper\nvoronezh\nreconstruct\nhauptmann\nshelved\nmoorish\nrawalpindi\nescalation\nkar\nportrayals\naida\nbette\nares\nlitres\ntranscontinental\nparlor\nleblanc\nlapse\nforage\nunjust\nchameleon\nevaporation\nsandoval\nbrownlow\nshocks\ndischarges\ncharlottesville\nmathis\nanalyse\nmasse\nmancini\ncornice\nsprung\nbethune\npang\nxvii\nsiamese\ndubuque\nchronicler\nembroidered\nsei\ncampground\nalp\nrajiv\nhaywood\nthanked\ncheerful\nmillet\nribeiro\nvet\nbeckham\nalbedo\nwylie\nlemur\nthug\nhenchmen\ndis\ndelft\ncourthouses\ncalories\nherodotus\ntame\nbribes\nmontfort\nbehaved\nbrecht\nisnt\ngatehouse\nshoals\nspiny\nharbin\nwarmth\nboleyn\npredicts\nfawcett\nwicklow\ntribunals\ndisgrace\ncasas\nashram\njiangxi\naroused\ndekalb\nceline\ntiers\nalmaty\nfff\nsphinx\nbrigitte\ninhabits\nbhp\ntalkin\ndisgust\ngambler\nfluoride\nfasting\nlovell\ncreationism\nobtains\nmeteorologist\ngreenway\nwelcomes\noffend\nadmire\nmonologue\njohnstown\nweller\npandemic\nromagna\nkidding\npascual\nlump\ndepots\narchitectures\nbazar\nmagee\nsymbolizes\nquarrel\naristocrat\ndefamatory\nembroidery\nkami\nplentiful\nflc\nuyghur\ncushing\ndecidedly\nschalke\nkriegsmarine\nutmost\ncolder\ntyranny\nastrid\ntrooper\nalienated\nperrin\nmondo\nballerina\ncastilian\netruscan\nrook\njermaine\nrightful\nsimms\ngharbi\nronaldo\noffenbach\nhye\nconclave\nrecess\nshortening\nchevy\n¥\nvandalize\nshouted\nmediaeval\nculmination\npalgrave\nlotte\ncloister\ngeiger\ndaredevil\npacifist\nmoser\nmérida\npurposely\nsagan\nmisspelling\nrooftop\ntabriz\nthrones\ndanilo\nbobsleigh\nloma\nautobots\ntendon\ndegenerate\nfranca\nboil\ntempleton\ncorcoran\nsighting\nerratic\nlimbo\nsubscriber\nfiercely\nstanhope\narno\nvigilante\nacoustics\nangled\ndelegated\ncoltrane\nthunderbirds\nmaldonado\nbaer\nhenchman\nmj\nbantam\ndeus\nritz\nmakeshift\ndiligence\npouring\npiety\nsuitability\nleopards\ntwisting\nrambling\ntambourine\nbicentennial\nhsu\njug\nundivided\ngrizzly\namphibian\ncommute\nacknowledgement\nadrienne\nvirtuoso\nkarol\nyoungsters\nfantasies\nusability\ntheses\nrounding\nrefute\ndeploying\nexempted\nyell\ncdu\njustinian\npardoned\ngrandma\nouest\ncornet\ntompkins\nbrentwood\nenlightened\nananda\nmemberships\ncoruña\namplified\nstucco\nannan\ncathode\nsentient\nvoiceless\nplaylist\nhurting\nempathy\nwollongong\nchapelle\nmtr\nsender\ntopo\nweakest\nconjugate\nrequisite\ngalactica\nsuffixes\nfulfillment\nacb\nze\nsuppressing\niec\nhawley\nconflicted\nfarnham\nfavoring\nrahim\ncrewmen\nmonstrous\nsummarizing\nalia\ntapped\nmervyn\nterrific\norator\naggies\nligands\noviedo\nkoh\nmotte\ncaribou\nwildcard\nniall\nintellectually\naisles\nreassessment\naccords\nlandscaping\nbends\nhurler\nvb\ndisagreeing\nschoolboy\nfaux\ncoulter\nmersey\neberhard\nchewing\nmcrae\nmacfarlane\njunkers\nmarburg\nnathalie\nresorted\nfiancée\nnorthumbria\nlakota\nsubtitled\nprotons\nburying\nconde\nbro\nblends\nyew\nmisplaced\nflamengo\nneurology\noverlay\npence\nutilised\ndrayton\nnotables\nuneasy\nillawarra\ngenerosity\nrin\nfonseca\nmocking\norchestration\nbracelet\ntramways\nmurad\nacs\npercussionist\nwatercolor\nnasr\npahang\nkiribati\nhomemade\nextraordinarily\ndaegu\nmich\ngilberto\nbnp\nbarbed\nchaser\nblueprint\nalveolar\nsvalbard\nbarefoot\nmedford\nbounced\nkenton\ndaleks\ngangsters\natc\nfanfare\nseam\ncaa\ncanine\ngeocities\nmak\npenetrated\nenhancements\nbelles\nhimmler\ngoth\nluthor\npetri\nchromium\ntamaulipas\nalluded\nhereby\nrajesh\nire\nrockland\ncabo\ngresham\nsteen\ntundra\nbribe\nkita\nkwok\nchimneys\nindifferent\nthinly\nfink\nharrier\nelsevier\nstapleton\ncoping\ntiling\nhardships\nenid\nwesternmost\nadvantageous\nexert\nstanislaus\niupac\nmissoula\ngoo\nerstwhile\noscillator\nneedham\nfriendships\npiraeus\nmonorail\ngranny\narthropods\nunorthodox\nsumerian\nfatherland\npostdoctoral\nmiley\nsilverman\napprenticed\noratorio\ndynamical\nkrause\nalianza\nsomers\ndijon\neditorials\nkhulna\ntossed\ngilchrist\nunbalanced\nreestablished\nnested\nirvin\nvolcanism\nbianchi\nstead\nduality\nrefinement\nquid\ncfr\ndives\nwander\nprosecuting\ntownspeople\nblofeld\namounting\ndiets\nadmirable\nisi\nsss\ncarpenters\nsuleiman\nsurmounted\ncomplied\nmajored\nhummingbird\nkagoshima\ncutters\nbisons\nrigging\nmarred\naia\nscarcity\nishikawa\nyahya\nrecognises\nsprang\nfil\nconsuls\npleading\nassembling\nnomads\nplummer\nundeleted\nunivision\nrupture\ngoode\nmullins\nnineties\nreferral\nexistential\nbullpen\natherton\nsanger\naborted\nvisconti\nharare\nsane\ngully\norpheus\nspotting\nassorted\nbelgaum\ncolgate\ntardis\natatürk\nnegatives\nreplicated\nstairway\nnumeral\nbum\npotentials\nwerder\nmagnets\nsewell\nlr\ndeficits\nkira\nbestselling\napplause\ntreatises\nrenomination\nanemia\nlila\nornament\nsidings\nentails\nwilhelmina\nserenade\nneustadt\ndiarrhea\npierced\nllewellyn\nvicariate\ntimmy\nunfairly\nunanswered\ntokens\nkazakhstani\nwhitewater\ngreedy\nhammerstein\ninactivity\nbendigo\ngöteborg\ngifu\ninfer\nwrapping\nabingdon\nshackleton\ndreamer\nhajj\nbushes\nbiscay\nhertford\nquarantine\nexchanging\nboldly\nconcensus\namen\ndacia\ndms\ngol\nklamath\nfax\nbondage\narable\npineapple\nconstituting\nreconstituted\nypres\ntsn\nfen\ntwickenham\nevanston\nminesweepers\nnetting\nutopian\nredistricting\ncui\nhindustani\ncareless\nstoppage\nrachael\nterengganu\nscissors\nsynchronous\nheywood\ncounsellor\nsps\nribbons\ncortez\nmelodrama\naleksander\nmatchday\nseeker\ninvertebrate\nbelongings\nbusinesswoman\nwordsworth\nnearing\npursues\nsteamers\nnada\nvillarreal\nrotunda\nshakespearean\nunwarranted\nshiv\nlombardi\nalonzo\nmulligan\nqasim\nmelee\nfinns\ncastor\nairframe\nsoaring\nswam\ninterruption\nzhi\npolarized\nwhence\nqazvin\ngoiás\nholotype\nlinemen\nregionally\nsuccumbed\nunderwear\nextremity\nmahdi\neur\nwittenberg\nscrewed\nsteroids\npasser\nigbo\nababa\nheater\nmissy\nchah\nboulders\nmotocross\ndrifted\nyehuda\nestero\nrecursive\ntasting\nmundi\ndiscomfort\nspinner\nhyatt\nvélez\nbumps\nlocator\nmanley\nwalsingham\npasswords\nvehicular\npriya\ntrimming\nbuildup\nmexicans\ncried\nbarisal\nsidelined\ncohesion\nlewiston\nmasterpieces\nbottled\ngalen\ntransliterated\nprofiling\nlind\nhwy\npervasive\nadvertisers\nhasbro\nbrahmins\ncirque\nshaken\nbernadette\ngoin\nrosh\napologizes\ncpr\nivorian\nfirewall\nbower\nmúsica\napse\nsimplistic\nlupus\nfelice\nschwarzenegger\njawaharlal\ncough\nmikey\namazonas\nkennel\nnakajima\ntimeslot\nreappeared\nverdun\noccupancy\nkayak\njos\nlps\nmille\nthroughput\nbriefing\nantitrust\naltman\nlyrically\nnui\nneapolitan\nundefined\nile\nkashmiri\nlouder\nretreats\nbuffaloes\nwinkler\nurinary\nptc\njohnnie\nuniqueness\nanglicans\nmichal\nextravagant\nsubstantiate\ngoliath\nlandon\niit\ninduces\nblinded\nupbeat\ndispose\nibis\nkošice\njnr\nsickle\nordinarily\nunnatural\ncontainment\nmccabe\ncampaigner\nmisspelled\nvivekananda\nanalysed\nreiterated\npossessive\nascertain\novercame\nbarkley\nmarkov\nbobo\npetrie\nswann\nplead\navn\nsandhurst\nminami\naegis\ncondensation\npicket\nventured\nderailed\nwifi\nspecialises\nreminding\nstork\namplification\nvacancies\nfurlong\nstaudinger\nemphasised\ndiscredited\nsarcasm\nelectra\ntho\nvoss\njiří\nmilligan\nkandahar\nsevastopol\nunintentionally\nhostess\ninvitations\nsloppy\noldfield\nmekong\nlevied\nswallows\nstalingrad\nbree\ntimur\nsouthwards\ncomplemented\nparrish\nheinemann\nmong\nhegemony\ntabor\nwillingly\nintolerance\nimitate\nhoneycomb\ndewan\nlockout\nrallying\nnucleotide\nxd\ngaetano\npuzzled\nhacked\nisidro\nkart\nbabcock\nguillotine\nmentored\nalluvial\nmie\ngarnett\nkitchens\nimmoral\nwestbrook\nkor\ndinah\nearthly\noverkill\nmaa\nsmythe\noss\nrcaf\ncomprehensiveness\npredictive\nstrives\nnemo\nkabir\njardine\ncaveat\nceleste\napprehended\nexpel\nsizeable\ntenders\nalicante\nextracellular\npaget\nreplicas\ndisambiguate\ntamar\nwhitecaps\nnegeri\npalsy\ndelaying\nhens\nleger\nsamar\neuclid\njpn\nennis\nvinnie\nlibertad\nunconfirmed\nomer\nauctions\nofsted\nattendants\ndisproportionate\nassortment\ndrank\nsco\nlinus\njustifies\nforfeited\narmageddon\nrefresh\nmaury\nportfolios\npinus\nshutting\nnormalized\nsheehan\noptimism\npershing\noppressed\nweavers\nvagina\nkieran\ncajun\ndistressed\noutsourcing\nfreiherr\nseriousness\nannihilation\ncpa\nlagrange\nchic\nsupérieure\ngens\nstag\nstd\nunderage\nformulate\nescobar\ndiscourses\nparton\nnikos\nbrutus\npooh\nhyperion\nhollyoaks\nproclaiming\nmanors\nseaboard\nchechnya\nwinfrey\ntully\nschofield\nide\ngh\nrommel\npleasures\ncoupé\nmosul\nmbta\njérôme\nflop\nnúñez\nadhesion\nvidya\nuprisings\nezekiel\nnkvd\npsychiatrists\nwaite\nmerrick\ntyrrell\nmogadishu\njars\nmassacred\nprovo\nprecursors\nmuay\narp\nirma\nniño\nbenoît\nott\nmisconception\nfrenzy\ndecommissioning\nhandwriting\nsaad\neasterly\nstacking\nknoll\naguirre\natelier\nnascent\nbernd\npaler\nprofanity\nimposition\nintermittently\nimplant\nkafka\nstowe\nheist\npasting\nmsa\ncongratulations\nkwon\ntelephones\nsungai\nhydroxide\nviscosity\nscarcely\nvents\nqaleh\ngaspar\ncodename\nobligated\nspicy\nmainstay\nshahr\nwhelan\npathetic\nvoor\nthelma\ntrotsky\ncolumba\nladd\nupn\nhickey\ngenevieve\nrecollections\nhegel\njuneau\nrevere\nconfessor\nhousewife\nkitten\ndiscriminate\nestrella\namphitheatre\npolymerase\nbatu\nplugs\noppenheimer\nmya\nparting\ntomáš\nflavored\ncontingency\ninaccuracies\nfulfil\ntennyson\nsilica\npeirce\ngorbachev\nadil\ngoodyear\nhrh\nalmighty\ndread\nlonesome\nbjörk\nrewarding\ncheeses\ncartesian\nparapet\nsora\nbara\nplanters\nsegal\nbarclays\nthrilling\nseaport\nstara\ntutelage\nboswell\njoon\nchronologically\nleno\ntennant\nwis\npreparedness\nlandis\nkhanna\nlingua\ngorges\nfragrance\ncider\nawa\nmehdi\nnetted\nserotonin\nstew\nhofmann\neugenia\neros\nflaherty\ngiulia\nexcursions\ncompounded\nhardwood\nwye\ncausa\nemitting\ncaters\nacharya\nunbroken\ntomasz\npesos\nsoapbox\nintestine\ndao\nkeyword\ngent\nlethbridge\ntromsø\ndedicate\nindistinguishable\nams\nharmed\nbets\namish\ncorfu\nadmirer\nrhinoceros\ngpu\npoke\ncoinciding\nvigil\nprosecutions\napproving\nbrugge\nhinckley\nsiècle\nhades\ndogma\nzeal\nraspberry\nobjecting\nmisused\nnikolaus\nfetch\nleibniz\ngibbon\nspore\nsnare\nacer\noswego\nbahasa\nguilford\ndepletion\nsmartphones\nconduction\nshree\nyorktown\nrentals\nyamagata\nacosta\nsmyrna\nfullbacks\nirons\neid\ncounterpoint\neuthanasia\ndunk\nexposes\nrsc\nregia\ntungsten\ngroeneveld\njános\ngreatness\neasternmost\nhealed\nsubsidized\nbelvedere\nunplugged\nultralight\nwickham\ncooley\nbef\nmastermind\nepisodic\nbouncing\nobispo\nravaged\nplanter\nmesoamerican\nquill\ndislikes\nassisi\nfireman\ncivilized\njayne\nbedouin\ninsofar\nabbess\nfaire\nhandgun\ncordon\ndecorate\nalarmed\njosie\nshaikh\nwavy\nwikileaks\nintimacy\nhomology\nsynthase\nenvisaged\njag\nmio\nknut\njosip\nseren\ndolby\nscrooge\npompey\nlancelot\nbennington\ndiode\nslum\npresbytery\nsolves\nfriuli\nbalances\nseri\ntwists\nsus\ncalibration\nshutouts\nharbors\nlouth\nparalyzed\nharrogate\noutlining\nvallejo\npenthouse\nlippe\nafflicted\ncouncilor\nshines\nvieira\nthunderstorms\nbeginner\nlech\nensues\npai\nairstrip\nolfactory\nnoticeboards\ncorbin\nreinforcing\nammonium\nallegorical\ndowling\nforging\ngainsborough\neth\ndecider\ncoo\npigments\npaco\ncurators\ntransitive\nleaking\nwagga\nvolts\nparticipatory\nhansa\nmutated\nschoolteacher\ngays\nsymbolize\nparishioners\ncoahuila\necoregions\npropagate\nasterisk\nundercarriage\ncrackdown\ncunning\nfringes\ncorrugated\nshaughnessy\nalexey\npeaches\ncohesive\nwig\nnamur\ndistilled\nsolstice\nadhesive\nsubscribe\ndunham\ntackling\nnephews\ncyanide\nintertitles\ndownwards\npolio\nhitters\nxian\nestimating\nmosaics\ncaspar\ncaricature\nmep\nfirepower\ncupola\nfriendlies\ncontour\ncriticizes\npersonalized\nayers\nbodybuilding\ntackled\nequip\nvienne\nanderlecht\nhumiliation\nusmc\nfunerals\nhawking\nprovenance\ndecisively\nwithers\nunjustified\njennie\nswell\nfreemasons\nsherbrooke\nmessed\nimp\nthebes\nimplants\nconceding\nwiggins\nviennese\nkeegan\nencode\ndalhousie\nslc\nplated\ncro\nconcave\nfruition\nskateboarding\ndiminishing\nmawr\nrabat\nlorna\npersepolis\nkoenig\nenshrined\nralf\njumbo\nslander\nchaparral\nreels\nimmanuel\nperpetrated\ncale\nworkman\nanatoly\nauctioned\nhealey\ncomrade\ndisparity\nufa\nbrides\nlittoral\nkangaroos\nawhile\nmou\nallusions\nextratropical\nimran\njenner\ntypos\nseq\nprogeny\ntilted\nmadsen\nwont\nschenectady\nglobo\nbarbour\nkelantan\nhochschule\nstripping\nmancha\nnecks\nmesozoic\ninfect\njeffries\nkiwi\nuncanny\nlutherans\ndisobedience\nvalery\naditya\nexemptions\ngrammatically\ndreamworks\nfilip\nkandy\ncoincides\nbrittle\nbangla\nnormative\nrecurrence\npn\nwindmills\nvanish\nxxxx\ngunman\nmixtures\ncamels\nyosef\nsearchable\nfreedman\nfoss\nusages\nrancher\nradiator\nrepelled\nhoard\nnizam\ncline\nimbalance\nretake\nblackbird\nweldon\nvisibly\nfonda\nshelling\nfab\npuccini\nrwy\nbullied\natypical\nrudimentary\nhiro\nbevan\nquail\nmoseley\nextortion\ncouncilman\nsalam\nxie\nwil\nremo\ncameraman\noa\npcr\nvio\nsnapped\nnee\nplaya\nleavitt\nneath\nintegrates\nbarrios\nblocker\ncasanova\nyar\ntatars\npanhandle\ncoppola\nnadine\nrepay\njanis\nfurry\nstallions\ngehrels\ndnipropetrovsk\nrudi\nallman\nsnapshot\nterriers\nadnan\nshakedown\ndomed\nhendrick\ndiptera\nlearner\nlatitudes\nsatanic\nluckily\ntajik\nsteamships\nweary\nsmu\ngated\norganises\nlindbergh\ndesserts\ngodavari\nlarval\nelmo\nhowrah\nhandley\naceh\nwinifred\nhypocrisy\ncrocodiles\nroadways\nethyl\nspector\npim\nhindustan\nacp\neiner\norganisational\nchiropractic\nreinhardt\nfuego\nderiving\nyarra\nnum\ndope\nsabina\nsumter\ntubing\nrecognising\nhorribly\ncontroversially\ninteroperability\nseverin\nsurrogate\nvillanueva\nreformist\ntampere\nstaunton\nmontes\nbane\nthon\ndumps\nfoes\nfunctionally\nfleur\ndevonport\ncortes\nautosomal\ncooperated\nberne\nfrees\nestudiantes\narya\nessentials\npushkin\nwarred\npantomime\nthreaded\nans\nsurgeries\nlinton\nunheard\ncockburn\npavilions\nasin\nnominators\nchatter\nrumoured\nstoria\nhossein\nconduit\nwheatley\nkottayam\nhamiltonian\nweiner\nromantically\nstrategist\nplanetarium\nromana\ntsv\nlansdowne\nkumamoto\npersuades\nbanff\npemberton\nruskin\nstreisand\nbolded\ngenesee\nmaloney\nmisinterpreted\nmisinformation\ntbd\ngirona\narles\nmyra\nfixation\nsuitably\nerrol\nslew\nvodafone\nbanat\noverweight\nincline\nchamberlin\nuniformity\nhacienda\nreviving\nalienation\nionization\ngretchen\nbeagle\nbiscuit\ncolossal\ndum\nnok\nkinney\nbruckner\nexpressionist\nwolfram\nincremental\nbahraini\ncords\nibiza\ngulls\nluminosity\nneedy\nwsop\negerton\ncarex\nionian\nlua\nbleak\nwielding\ninfante\nawe\ntray\nvm\ntownhall\ntur\nintracellular\nlar\nramadan\nkeyes\nmidpoint\nmanic\ncemented\ndistillation\ntulip\nqatari\nfreeware\nrinpoche\nlms\naki\nwalmart\nintermediary\nattributable\nfelicity\npetro\nfootpath\nnaxos\nsnacks\ndues\nhoosiers\nhenceforth\nwoolly\nhutchison\npitman\nirresponsible\nsobre\nformalized\nimprobable\ndefaults\ndv\ngeorgios\nlengthened\nconroy\nmiyagi\nmcfadden\nrobo\nstylus\nempower\nmacaulay\nunsatisfactory\nbayonne\nmischief\nleila\nannoyance\nbuds\ntout\ncombating\nnonexistent\nchanel\nronan\nburrow\nfruitful\nbiscuits\nmicroprocessor\ndisgusted\nbottoms\nspock\nfer\nhabeas\nspeciality\ncantatas\nbradman\nclovis\njeanette\ntlc\nrefractive\nisd\nmorecambe\nkursk\njohansen\nmonticello\nconcentrates\nannunciation\nalkali\nfacsimile\npodgorica\ncomprehend\ncatalyzes\nacadia\nthanjavur\npreclude\nmeetups\ncsx\nestelle\nembroiled\nstearns\njie\nkinder\nstocked\nepistemology\nanyhow\nsidaway\nkamakura\nencodes\nrecommissioned\ngoodnight\ncasing\ncarrot\nescalate\nsumma\nmaia\nzvezda\nskelton\nimagining\nhertha\nfairest\ncronin\nrecollection\npaley\nleela\nabilene\narchiv\niranians\nbaiting\nplover\nkissed\nparallax\nprimo\npretender\nramayana\nminimalist\nrecessive\nalbatross\ngrandpa\nfrazer\nfootprints\nrebuttal\nstardom\nditches\nforfeit\njeffery\nseminoles\nunusable\nust\nbiblioteca\nsharjah\nforbids\noo\nvegan\ncanoes\nfostered\nzh\nabridged\nprayed\nbrenner\nsticker\nlivre\ninterfered\nskirmishes\nesteemed\nlabelling\nlaurier\nresettlement\ncomical\nwhittaker\nredirection\nmenus\ngilman\nfiberglass\nashe\ndisputing\nchilton\nherod\nrub\nnetworked\nindore\njainism\npetrel\ndoric\njosephus\nfdp\nspinoff\nkeywords\nincest\nbobsledders\nkaufmann\npropagated\nintrigued\nsakai\nbillionaires\nsaarbrücken\ngendarmerie\nexperimentally\ncatechism\ngarda\navril\ndistraught\ninvoking\nmuppet\ntoolbox\ngenghis\nadventurous\nmandalay\nshao\nnordland\nhippie\nagostino\nteri\nbani\ndynamically\ndowngraded\nbnei\ngrandstand\nsander\nbrahman\nunrealistic\nmás\npreventative\npoisson\nbock\nludicrous\nosijek\nvolgograd\nagility\nov\nthoroughfare\ntongan\ncalypso\nactivating\nsinhala\nshakira\nwatertown\nconspire\nnausea\nmackintosh\ngmail\nlitchfield\nzhen\nwoodbury\nmane\nconcerted\ngenomes\ncarts\nmansur\nolives\nrushers\nhomelessness\napoptosis\ncosmo\nfurnaces\nanas\nsteroid\ntama\nilliterate\nleakage\nterrified\nbos\npolluted\naquaculture\nazteca\ntuscaloosa\necoregion\ninformer\naerosmith\nromain\nexcuses\nvandalised\nadhered\nmahogany\nchaco\npopulate\ngrotto\nfootbridge\nbourke\nbamberg\nhmas\nextracting\nsandro\nlured\ncomédie\nastral\nadirondack\nsecretion\nlayouts\ndragonfly\ndruze\nthorns\npollutants\nafar\nbsd\nconsorts\ndowd\ninquest\nparatroopers\nsummertime\nschuylkill\nlata\nherbarium\nincense\nagony\naptitude\nsutter\ncoda\neinar\ngrenville\ndawes\nbirdlife\nmessing\nhenryk\nfermi\ndescartes\nfigurines\nreruns\nthoracic\netching\ntinged\njojo\norganists\nequinox\ncensuses\nvalea\nurn\nhitch\nyeats\nnicki\nfief\ntarot\ndeutsches\nvipers\nhug\nhiggs\nseminaries\nschwerin\ncarta\nheadland\nerfurt\numayyad\niona\ndooley\nuncles\nswore\nancona\nguizhou\nmönchengladbach\nvogt\nmotogp\nbmt\nohl\nbeggar\ncunha\nmoog\nwhitworth\nhanks\ncapacitors\ntattoos\nfindlay\naccrington\naung\nherschel\nmerthyr\nwebcomics\ntestosterone\nconfrontational\nebony\nskeptics\nminster\nwaged\nvos\nfyodor\ntangled\nsignifying\nmagi\nfoote\nvázquez\nlarissa\nsamir\nbranco\nchoreographers\ncorrective\nescuela\nhoops\nfortnight\nalludes\nascend\nsoaked\nhélène\ntutorials\nironclad\nsweeps\nhillsboro\nautistic\nallergy\nprout\nniches\nchinook\nmedusa\nnorthland\niu\nbalboa\njoss\npraha\nphilologist\ncyclo\nshortcuts\novertly\novation\ngoofy\nheterogeneous\nhotchkiss\nthrice\nroyalists\nvologda\ntoei\ntheatrically\ndecompression\nblender\nkimi\npaleolithic\nfiguring\nsed\njosep\nleprosy\nsuborder\nporcupine\nbugle\nunintended\ncomplimented\nrioting\nwidnes\npliocene\nlatency\nhaag\ncoils\nsuperbike\ndoña\ncatalogs\naudible\ntipton\nornamentation\nempties\npooja\ncarolingian\novarian\nnvidia\ngarnering\ngerd\nindifference\nricans\nmaeda\nsourceforge\nkhl\nmindset\nsmuggled\npayton\nbiggs\nsolicitors\nbhushan\nspectre\nharem\nkhartoum\nselectors\nproudly\ngerm\nchilds\nbaruch\nseconded\nhsbc\ninterfaith\nnijmegen\nlackawanna\nomitting\naltai\nconstructively\nven\nlifestyles\npyongyang\nstoker\nhonoré\nripon\nkiln\ngünter\nviktoria\nelst\nlonging\nrake\nlycoming\ntrax\ndara\nsita\namit\nposeidon\naxles\ntaranaki\n¤\nneuronal\nteh\nrevolts\nattica\népée\ntuttle\nblythe\ncastleford\nsalinity\nanecdotal\nhorrific\nfifties\npuppeteer\nmus\njma\noversized\nhush\nstriving\nperil\nazam\nhalen\niyer\nadriano\natalanta\ncuenca\niba\nfiancé\nrumour\nmagpies\nclarion\ntonkin\nmetcalfe\nncr\nclassifying\ntransept\naya\nembedding\nidris\nharwood\nsledge\nstirring\ninterpolation\nnewington\nrein\nhumphries\nfetish\nstarlight\nwand\nmocked\ninglis\nvending\nfad\nsparkling\nbhai\nmcallister\nstationery\nsirens\nmormonism\ndreamed\nziegler\ndist\nantônio\ngila\nsree\nlayman\ncomputerized\nramona\ndesai\ntillman\nbattered\ncurricula\ntub\nhardback\npriestly\nperseus\nbombarded\njax\ninterchangeably\nswinton\nseawater\nlrt\npassaic\nfao\nmárquez\nfermented\nenvironmentalist\nhydrocarbons\nlecturing\nrecount\nverma\nunifying\nvader\nnicol\nemailed\ntuscan\nwatergate\npunishable\nmolar\nadc\nprecautions\ncharisma\nirregularly\nnxt\nhomeopathy\ngrands\ngenomic\ngentile\nnatchez\nconnelly\nanglophone\ncairn\nsse\nhyman\nembraces\nbowed\ncaffeine\npatently\nneu\nunhealthy\nsudamericana\ndarkest\nhamish\nmetallurgy\namon\nsore\nmallet\nburgeoning\nck\ntra\nmoira\ncalifornian\ntexan\nharz\nchou\njk\nelvira\noverload\npolity\nhomeowners\ntoi\nbehaviours\nchairmanship\nincubator\nkip\nfestive\nshivaji\ndwell\ncondé\nguilds\nfarr\nchernobyl\npore\nstefani\ncurl\nasl\nhipped\nfictionalized\ngunther\nmetaphors\npolygamy\ndeem\nmicroorganisms\nentomology\nlancet\nacademician\nuranus\nvaginal\nnpb\nbayreuth\ntaman\norienteering\nscribner\nhamadan\nmalleus\nmeetup\ncarinthia\nneale\nghats\ncharger\npaulus\nridership\nknit\nanimosity\nsayyid\nmethodologies\nfrye\nresilience\nfaraday\ncafes\nacrylic\ndictates\nretrospect\nheh\nwrigley\ndesks\nsympathies\ngauteng\nclustered\ndrills\ncristóbal\ngaya\noutback\nmcneill\nprogrammable\nhenrique\nredman\nprojectiles\ntainted\nberesford\nminimally\naye\nhindenburg\nsedgwick\nausterity\ncentaur\nmichał\nwiseman\nadrien\nrepatriation\nbora\npiet\nsia\ndeformed\ná\nfamicom\nplaymate\nverbally\ngrotesque\nvaulted\ndelia\nfernand\nexcommunicated\naif\nmercia\narras\nlonnie\nmedvedev\nzambian\nsandwiches\nelegance\nabortions\nwallpaper\nscottsdale\nleeward\nfuss\natwood\nplywood\ncasket\nlhasa\nbulletins\nkingship\ntoolkit\nplenary\nunreal\nlieberman\nfractal\nsikorsky\ngrandes\nbowel\nkline\nhuguenot\nluv\ntransistors\nbibliographic\ncaitlin\nuniformed\ncarlin\ncupid\nzorn\nannotations\nseeding\ncognate\nsamaritan\nnarrowed\nscrapping\nemulator\nsakha\nkrishnan\nironworks\ndrown\njed\ngroupe\npup\nbunkers\nharshly\nwineries\nprospered\noutcry\ncatchers\nbeattie\nbeneficiaries\npom\nsackville\noscillation\nboron\ntutoring\nasbury\nreappears\nsensibility\nharden\npfeiffer\nrennie\ncollectible\nchapin\ndesperation\ncaro\npersistently\ngauntlet\ngob\nwoodford\nbigelow\nucl\nadversely\npedals\nchappell\npremière\nwordy\nxt\nvoip\nsalty\nslabs\novertaken\nmarkazi\nfrau\ndirectives\nchien\ncumulus\nafloat\ncreepy\ngenitive\nceres\nmalian\nclough\ncambria\nard\nhtc\nspringboard\nayatollah\ncontradictions\nrainier\nquestionnaire\nais\nseduce\nduckworth\nsooners\ncatchphrase\nunwin\nrockhampton\nuniversitario\nwhl\ndistrust\ntramp\nfaked\ngrabbing\ncondominium\ndundalk\nracks\nversed\ncubes\nexpressionism\ncompagnie\ntorrent\nhesitation\nnarendra\ninterprets\nminaj\nsoma\ngbr\nphrased\nquirky\nshabbat\nobservational\napollon\ndelano\nhusbandry\ndisposable\noutpatient\namis\nmaris\nregaining\nadoration\nakhtar\njw\nresonant\nvasa\nhomophobia\nmayflower\ngrundy\nslovan\nstv\ncay\nbeneficiary\ntechnicolor\nminimizing\nfifths\ntariq\ntrawler\nguildhall\nbranson\ncharleroi\ntrentino\nfared\ntopographical\normond\ncarew\namphitheater\nstandby\nimax\npredicate\nimpromptu\nfiltration\novens\nhemp\naspiration\nconventionally\npanned\ndemolish\nvladivostok\nhamad\nsavy\nruff\nrevitalization\nthérèse\nliebe\nmorten\nzara\nuxbridge\nmaronite\nheadlights\ntoed\nofc\nreorganised\nsnippet\nbernice\nashraf\nmontclair\ndressage\nasu\nbering\nissuance\ncantonment\nmusketeers\nagnew\nmidshipman\nrenée\nmétis\nforked\nhenriette\nattributing\nartisan\nopting\nmotorways\nbanerjee\nshay\nhitachi\nyeo\nnunatak\nmailed\nskeptic\nimportation\nchekhov\nsacrificing\nmultilateral\nvassar\nxe\naldridge\nmendel\neyesight\nvijaya\nmacedonians\ntweak\nhaider\nrepel\nfleece\nlivermore\ntypewriter\nharlequin\ndeletionist\nmasons\nduggan\nfeats\nosnabrück\ngermania\nlongman\ninge\nintrigue\nroasted\nmerriam\ntelevisa\nlanterns\nsucker\nenvironmentalists\ndisbanding\nshingle\njamming\nmaddox\nunveiling\naes\nmuhammed\narmin\ngreet\naoc\nbreaching\nosu\njn\nnewt\nultima\nebola\ncereals\nmarist\ndq\nhertz\nantlers\nobi\naberystwyth\nayres\nevoke\nhopping\nshiloh\nembryos\nattest\nupkeep\nhilal\nfanning\ntele\nccm\nfrith\nskit\ntidy\nhelpless\nalbemarle\ndeacons\ncryptographic\nlugo\nodense\ncentrifugal\nidiosyncratic\nfinned\nsubversive\nbehaves\ngoverns\nliabilities\nargo\nlittleton\nsieges\ndojo\nroaming\nmontessori\ngita\nbolivar\npitts\nbrice\nkirill\nslipping\nthanh\npasquale\nfoundational\nmorphine\ndisallowed\nbooty\nlaing\nchaucer\ndigger\nmisrepresented\nplatted\nrocking\nbeni\ndoran\ninert\ngarnet\nwanderer\nudp\nattaching\nahn\nzadar\nprotégé\nassyrians\nhardwick\nmoulton\nwalloon\nautograph\ninterrogated\njeddah\nhikers\nwilliamstown\nhawkes\njanus\nengels\nthrilled\nmcfarlane\nhounds\nbasemen\nserpentine\nagua\ngrêmio\nsliced\nspeculations\nderwent\nphipps\nslices\nartie\nrestores\nnandi\nstanislav\npolski\npreferential\nxing\nterracotta\nsilvery\nklang\nswahili\nfunerary\nreinstate\nhooded\nnavigating\naldermen\ncriss\nvosges\npenance\npaperwork\nkola\nmandel\ngoldie\ngourmet\nord\ndebian\npurged\nhandmade\nbuckland\niirc\nlucca\nsyntactic\nprofitability\nbudding\ntoolbar\ncolton\ncurzon\nvincennes\ninterplay\nlandslides\nanselm\nculprit\nhaunt\ndionysius\nfluorescence\naman\nroo\npatil\nphotovoltaic\nnecropolis\nlawler\nwba\nisps\nmichoacán\narcades\npertains\nè\nxy\nbanja\ncoordinators\nmedway\nassures\nwesterns\nmalo\nepiphany\nslavonic\nrandomized\nfirefighter\ncolossus\nurgency\nsultans\nassay\ndickey\nrefreshing\nkana\nmetamorphosis\nmartel\nbasra\nserbo\nchakra\ntheobald\nbouchard\nloudly\nfooting\nsho\nnag\nkraus\nsolvents\noverton\nomni\niglesia\nfathered\npassover\ndetonation\nportman\npap\nknowingly\nema\nsera\ncci\nparkes\ncompromising\nfootscray\neq\ndutchman\nintangible\nrecife\ngazetted\nschlesinger\naddendum\nprimus\nawakened\nhysteria\nplethora\nhauling\nklux\nleveled\ntouted\ngreenpeace\nhousekeeping\nuptake\nsampdoria\ngaon\ncartilage\npratap\nemblems\naztecs\npresbyterians\nbayard\nroulette\nroscommon\nemergent\ninequalities\nchau\ngloss\nkubrick\ndipole\nkidnaps\nraya\nmccullough\nholyoke\nreparations\nzapata\nforay\nenormously\nkeats\nsaltwater\nramsar\nperegrine\nniki\nbundestag\ncenturion\ninsured\nenglewood\nmulberry\nchez\nalamos\nmotorists\nphra\ntangential\nelectrostatic\nsteeply\nhalsey\npaganism\nrosalind\nprudent\nurs\napulia\nllp\nincubation\ncanis\nmarvelous\ngpa\npeckham\naac\nparser\ngreats\nmaurizio\nmink\nurbanization\nsled\nwidescreen\nsonja\nseok\ncamped\nmoderation\njanine\nsalah\nmotherboard\nzanjan\ncowley\nnumerically\nmediums\nmackey\ncensors\nthunderstorm\njalandhar\ngiordano\nradiohead\ndiscard\nmonet\nreligiously\nusurp\ntriumphant\ncompositional\nsolaris\noppressive\nfragmentary\nbellingham\nturbocharged\nmetalcore\nmultan\nokayama\nipo\ncurving\nkiefer\nultraman\noptimize\novid\ntempe\nrefine\ndung\nhopewell\nplo\nclockwork\npopped\nwpa\nmatriculated\ntrailed\nturkmen\njustifying\nbleu\nversatility\nperú\nalleges\nclapham\nbonham\nbrunner\nlê\nandretti\neustace\npunching\ntma\nbuchan\ngridiron\ngrounding\nbielefeld\nconstellations\nkirov\ncontractions\nfugitives\nunify\nvelocities\ndownstairs\narunachal\nlynchburg\nhamm\ncannonball\ncoeur\nsoutherly\ngard\ninstallments\ninstructs\nprocured\nmatador\nplundered\ndeterministic\ngerardo\nkershaw\nchrétien\nimprov\ngaol\nkata\nretinal\nabad\nsurveyors\nrabin\nhora\nsulu\nmashhad\nbarbary\nfpc\nwildfire\ngalbraith\npopping\nabram\nclamp\nlaughed\ndormitories\npsychotic\nremit\nserif\nconducive\nempowering\nheartbreak\ninnovator\nchr\nrotates\nsuv\nwac\ncorvettes\nluanda\ntsr\nsignatory\nencircled\nrawat\ncompassionate\nstirred\nplough\ncastello\nwestmeath\nfoucault\nsis\nsephardic\nwrongful\nulf\nunneeded\nskipping\nlichtenstein\nreeds\ntanganyika\nanguilla\nmillimeters\narjuna\ndoctrinal\nmacedon\nteixeira\ngophers\nreaffirmed\nguts\napproves\ndagestan\npringle\npere\nkd\npeptides\npimp\ncinematographers\nhousekeeper\nshakhtar\nanthrax\ncenterpiece\npathologist\nboland\ninterpreters\ntum\nforbidding\npenrose\nute\nrubinstein\ndst\nlanger\ngarrisons\nintimidate\nlupin\neto\nblaise\npanasonic\nhappier\nmisunderstandings\ndebussy\ncapsules\ngerrard\nstruts\nburman\nbainbridge\npseudonyms\njahan\nwola\nkozhikode\nwarlords\nrapping\neos\nrosas\ngyörgy\nequine\ntia\ndielectric\ncagayan\nportuguesa\ntoned\nconcentric\neaves\nabbots\nmultiplex\nprecipitated\nagha\nhijacking\nhalton\nfrontage\nmillie\nseduction\nararat\nsaraswati\ndealership\nmamluk\nridicule\ntownlands\nvalois\nnicht\nmime\nhussey\njinnah\nquotas\nloomis\nunwillingness\nhoms\nprerequisite\nbile\nicy\nfreetown\nconfucius\nkuo\niggy\nreckoning\nlk\ndunno\nwie\npiacenza\nweave\ntenets\ndiplomas\nuwe\ndisestablishment\ndeo\nrenegades\nunborn\njurors\npetrus\ngraceful\nemo\nslums\nkiki\ndividend\ncounselling\nrescinded\nvfa\nblackwater\nbroadened\nhdtv\ndevise\nclaimants\nwhyte\nechelon\ndoubted\nbom\ngastric\nrobby\ngruber\ncultured\nfermanagh\nthirst\nalfie\nofficio\ntrolleybus\nhutchins\nbayesian\nconnaught\nairliners\nounce\nallele\ntheorized\nforewing\nwinnings\nmera\nfribourg\nsilt\nrations\noft\nserene\nblossoms\ncurie\ndansk\ntph\ndey\nstaggered\nmaarten\nwithheld\nitalicized\nhousemate\ncanaveral\nconley\nmedicaid\nberyl\nhopelessly\ndios\ncellulose\nhla\nhuntley\nequate\nbbs\nlatimer\ncreeping\ntrumps\nlazar\nvehemently\nblogging\nroast\nounces\njockeys\nconcubine\ndidnt\nbanca\nguwahati\nworldly\nmyron\npotters\ninterscholastic\ncarrick\ncymbals\nmicrophones\nyip\nbroome\nsully\nuphill\ngeometrical\nmarr\nquixote\ngippsland\ntesco\nconseil\nannales\ncriminology\nloot\nedvard\nselwyn\ncleo\nincompetence\ncaulfield\nliberate\ntotalitarian\ninitiates\nstylist\nugo\narmbar\ntahoe\nyakima\nstandardised\ntherese\npino\ndownturn\nmortgages\npuy\nbint\nlangdon\ndeir\nelmira\nbinder\njuries\nsinus\nabode\nreword\napostrophe\nmoresby\nappellation\nmobster\nparliamentarians\nbuggy\nsamuels\nemptied\nvane\ncie\nlibelous\nrosetta\nrecitals\nreputedly\nchaney\nuntold\nspar\nbayonet\nprovocation\ntvs\nmisread\ntelephony\ngauss\nlesion\namritsar\nfabrizio\nbrokerage\nrichness\nsecretive\nfaithfully\ngillis\ngarvey\nkon\nseamus\nticino\nsubordinated\nthesaurus\nsyllabus\nsnowboarding\nzedong\nhips\nkaya\nfirefighting\nmeats\nvalour\nflocks\npathfinder\nphoenician\nllanelli\nexposures\nscramble\naltars\nparrots\nwinton\nnaturalists\nguanajuato\ndissection\nanesthesia\ncourtier\ngarrick\nsignings\npickups\npoltava\nworkable\ncarnage\nloader\njovan\nfatality\ngoss\nzimmermann\ncranial\ngrosse\nhydrocarbon\ntse\ngimmick\natf\nabies\nbasingstoke\nfahrenheit\nmedically\nwearer\ndeactivated\nsummarily\nsuwon\nbeasley\nmagically\nstoner\nkaur\nskid\nwim\nburgas\nturnaround\nackerman\nplutarch\nsybil\ndorm\nbec\nclaiborne\nrondo\nmargo\nrenfrewshire\nqingdao\nappended\neiffel\nhakka\nleif\nfram\nhallway\nrescheduled\ncoatings\nsuzhou\nmarcin\ncmt\nconfectionery\nheadaches\ntot\nsubiaco\nbarb\npermutation\nquintus\nsemen\nrecast\nscipio\nmineiro\nsycamore\nbothering\nproverbs\nfibres\njilin\nchee\nlager\njørgen\npye\nliddell\nhymenoptera\nfrankfort\nphenotype\nmalek\nkinshasa\npatria\nredacted\nwrecking\npernambuco\nascap\naron\npebbles\ncurtin\ninventive\nraffles\nbund\noxides\ntypefaces\nwali\ncastel\nstalemate\nbains\nthirdly\nnaturalised\nsearle\nsafavid\nprojectors\nhorizonte\nkidneys\nmcnair\ndieu\nrespiration\nheaton\ncounselors\ntempore\nlf\njekyll\nmarginalized\npoitiers\nharms\nclinically\naffectionately\naugment\nsuzy\nkrauss\nchap\naussie\nextinctions\nmascots\nenvirons\nncis\nrota\nhispanics\nxtreme\nextinguished\nengravers\nbrookfield\nremy\nnite\naps\n,,\nwhats\nicarus\nsugars\ndeclan\nedouard\nalmería\ntodo\nunison\nayn\njocelyn\nphosphorylation\nici\nduplex\nhails\nqr\nexporter\njackpot\ncompliment\nkalmar\nrephrase\nrem\nbachchan\nincomprehensible\nmaterialism\ndecoding\npatras\nneatly\nwarhammer\ninflatable\nholocene\nmasquerade\naire\nastounding\nleavenworth\npleasantries\npromontory\nvalletta\nmollusc\ndenys\npalmyra\nferro\nblooded\nemd\noblong\nsynaptic\nchieftains\ncollared\nrho\nkuznetsov\nbeecher\ncomplication\ngoto\nodis\ngirolamo\nassyria\nprotruding\npons\nbenghazi\nexquisite\nfertilization\nmcnulty\nretailing\nriffs\ncarrillo\ncanteen\nrajan\nopal\nmonochrome\nparke\nuplift\nunbelievable\nsummarised\ndisdain\ncheated\nwoking\nstressful\nwhistling\nsandpiper\nisidore\njudson\nzeno\nlili\ntragedies\nabdallah\npaok\nshenyang\ncassius\nairbase\npostulated\nbram\nolsson\nelevate\nheracles\ncarmarthenshire\nenigmatic\nmerkel\nforman\nhons\nhobbes\nmercado\nbelgrano\nnmr\ngiuliani\nwallachia\ngott\npresumptive\npeta\nobscenity\nkönig\ntraverses\nguessed\narabi\nkean\nconveying\nemulation\nzaire\nprophetic\njaroslav\noutposts\nkarlsson\nchowdhury\nacquaintances\nspilled\nsina\ncoronado\nreproductions\npatten\naylesbury\nsaladin\nmolded\nmerced\npew\ntcp\nkmt\nels\nlecce\naltercation\nobnoxious\nearthworks\nteton\nreceptive\nstoryteller\nmares\nmobsters\npinter\neverlasting\nsuitcase\nravel\nsubgenre\nlofty\nchute\ndonny\npeek\nassassinations\nvue\nlien\nkarst\nrestitution\nponies\nrefrigerator\nloudoun\nzine\ndoves\nkrasnoyarsk\nattaché\nparra\npixar\nstaffing\nrotations\narafat\nancillary\ndruce\nhumility\nundocumented\njelena\nphonological\nlh\nthursdays\nsuperstition\ndordrecht\ndismisses\nraccoon\nmoods\nheidegger\nsasaki\nauthoring\npao\nsouthside\nfodder\nkale\npebble\nobservatories\ntorsion\ntweet\nsaratov\ncorrea\nanimate\ncnet\nisthmian\ncaicos\nsarkar\ncruzeiro\ntapered\nlodi\nstubby\nwto\ncompilers\npuente\nshrike\nauthorize\nfolsom\nsaito\ncyp\nfsa\nselkirk\ncumbersome\nln\npahlavi\nrafvr\nfilename\nxyz\ndysfunctional\nmilt\nhadi\nidealism\nisaacs\ncassini\nrearranged\nphotosynthesis\nglutamate\nforester\nchiu\nhumberto\nfitzwilliam\ncalvados\nrelapse\nmarquee\nswallowed\nmainframe\nsusceptibility\nmaoist\nsonya\nadidas\ndominions\nmex\nmazowiecki\nolympians\nconnotation\nbroaden\ntypography\nfiasco\nhabitable\ncroat\nstiles\ngalley\nperalta\njodi\nagonist\ngordy\nzeros\ntemptations\nadverts\njacobi\nmatured\nedifice\nkishore\nhearn\ntighter\njovi\noligocene\nuda\ncarbine\ngenders\nerick\nelo\naccumulating\ndulwich\nralston\nvedanta\nalston\nemigrate\nlarnaca\nmurdock\nbachelors\nhj\ndravidian\nseñora\njuana\ngoldfields\ncalmly\nrussel\nyogyakarta\nporous\nselim\nniccolò\nmcdaniel\nconcorde\nbogart\nrobles\nepithelial\ndiagnose\nbattlestar\nsavvy\nkiran\nstl\neverglades\nzak\nskype\nfeces\ndagenham\nphineas\npancho\ncomique\nargos\nledge\nmoldavian\nedwardian\nstraus\nannulled\nryukyu\nbeached\nsetbacks\njanitor\nnarrates\noru\nbullies\npectoral\nwillows\ncriticising\nmarys\nesque\nhackers\nys\nhogarth\nsil\nestrogen\nabrasive\naccomplice\natonement\ndragan\nhwan\nbok\nmessianic\nwestwards\nphantoms\nearp\niceberg\nbackstory\nsigurd\nornithologist\nchronicled\nwestlake\ndeviations\nrevolve\nadolphus\nconductive\naalborg\nbenn\nofficiated\nclowns\nprimordial\nbrabham\nsubscribed\nquits\nloreto\nibsen\ndodger\nheadless\nfollies\ncooperating\npinky\npaleontologist\nstipe\nmalays\njudgements\nreflector\njellyfish\ninfringe\nworldview\nswe\nmarxists\nrevising\nsosa\npelicans\nshortlist\ninstruct\ngroin\ninfested\nintl\nbetts\nrosalie\nvocation\nsurabaya\nblenheim\nsacraments\nvarese\nmihai\ncabral\nflamingo\nremorse\nphenomenal\narchivist\ncapone\nabner\nminstrel\nepistles\nastley\nconstables\npouch\nsmithfield\nriparian\nkiryat\npyle\ndegrade\nfashions\nsae\nresented\nxiu\nkofi\nmoored\nbioinformatics\nperoxide\npredetermined\narchbishopric\ntilly\nstarving\nreproducing\nkelso\nflor\nnunn\naoki\ncolman\nremovals\nextremism\nrath\nwhorl\naland\nrnas\nresupply\ndevolved\npressurized\nnuit\ngramophone\nkendra\nusm\nhemorrhage\nvile\nsubtly\nmethuen\nuplands\nshaggy\nburglary\nwiser\nbiker\nivar\nacronyms\nbluish\nelia\nprohibitions\nenthusiastically\nmesserschmitt\nsummation\nextremists\npamplona\nnik\npittsburg\nwilli\nconsecutively\ntipping\narmaments\nunethical\nyankovic\nblankets\ncautioned\nrupees\nfrowned\ntaro\nwager\npediatrics\nsusanne\ncontemporaneous\nmisfortune\nrecombination\nconvoluted\nberlusconi\nqa\nvladislav\nallende\nrav\nmajid\nbowes\nundermining\nprimrose\naristocrats\njordi\nazeri\nputative\nshrapnel\nrump\nbellows\ncampers\ncantilever\nformulae\nobese\nbooths\ncarpathian\ntranssexual\naphrodite\ntumour\nraga\nnantucket\nintravenous\nauditing\nsubclass\npartitioned\ntorrance\ndeduced\nnotary\nudinese\ntimbaland\nwhitehouse\nhustle\nkirchner\nsimón\ntierney\ntoros\ngironde\nsmarter\ninhabiting\nplugged\nsengoku\nluoyang\nnous\nragtime\narrogance\ncertifying\ncapri\nhayashi\nraptor\nveneration\ndang\narcy\ntrainees\naq\njardin\nhandwritten\nfoy\nberths\nwestmorland\nrong\nherbaceous\nfelicia\nstride\njoaquim\nmes\npilipinas\ntú\nuncontrolled\nprimacy\nmagnetism\nstaring\nvw\nislets\ncavern\ntaranto\nkansai\nprotestors\nbrewed\nshaved\nivanovich\nshading\nsandusky\naloysius\ninsomnia\nalgarve\nconsolidating\ncmg\nroddy\ninsensitive\nlawmakers\nwyman\nayer\nrug\npendant\nconfines\nstoughton\nskewed\ntessa\nhazrat\nneilson\nparris\ncorning\nrollover\nyazd\nsaxophonists\nxuan\nhusayn\nfarina\nsexism\nrede\nwindward\nmchugh\nonondaga\nhotspot\nhoo\nairstrikes\nallowances\nmonsignor\nfoxx\nraquel\nstickers\nreins\njm\nmarlboro\nirkutsk\ndistinguishable\nesprit\nassemblage\nhornsby\nweeping\nyelling\neugenics\nneutrino\nfrightening\nirritation\nfairey\ntimo\nqt\ntweaks\nepping\nchained\nlolita\nrishi\ntora\nvk\ndreadnought\nmalfunction\ncoldplay\nshlomo\ngallows\nstrathclyde\nenlarge\nmurderous\nhelga\nbreen\ngervais\ngrote\njózsef\nmansell\ngerber\nassaulting\npopcorn\noceanography\nvermilion\ndoctorates\npraia\noverdrive\ncortical\nosage\nwag\nbuffett\ncontemplated\nhaste\nsupergroup\nfatah\nanatolian\nbuddies\nsouvenirs\nconvergent\ncern\nstylish\neradicate\nvitamins\narriva\nrecourse\ngrit\nyolanda\nucf\nrehabilitated\nbiologically\nkyrgyz\nbayan\neradication\nergo\nembarking\nmaas\nartistes\nokanagan\nmoulin\nsorcery\ntuskegee\ndecepticons\njanssen\nyau\nrcd\nincoherent\ntivoli\nrowling\nprophecies\ntakeda\nvettel\nbundy\nintertwined\nphilly\nabstracted\nbaha\nmela\nethnography\nludhiana\nostrava\ncharacterizes\nperfected\ndreamcast\nattainment\ndavie\ntagline\ncodenamed\ngermantown\nremodeling\nlakeshore\napron\nvetoed\nhodder\nfranciscans\nlonsdale\nwatchdog\nignite\nmostar\nlaois\nophthalmology\nbookseller\nwoodlawn\nfenwick\ncrtc\ngypsies\nbrasília\ncosmological\nswollen\nmcdonough\ntyrosine\nandalusian\nclassically\npussy\nbitterly\ndrier\ntraditionalist\nplebiscite\nassociative\nmodernisation\nraton\nbutterworth\njat\nhowitzers\nbesar\ncirt\nstreamline\npranks\ncompromises\nmarcellus\nlichen\nnarnia\nfreaks\nsteeple\nmatriculation\neccles\narrowhead\nblaster\nridings\nshasta\nexploratory\nlawless\nsangeet\nnoam\nprogenitor\nfurman\ntoon\nnang\nacapulco\ndisseminated\nlavigne\nmathilde\nfrancia\nraster\npartick\njeju\nrepercussions\nfluff\nhumanistic\npaternity\nburner\nrepressed\nundecided\nstamens\nbursa\nlis\npentathletes\nbubba\nvor\nhover\nbeggars\nparenthood\ntrotter\nsparrows\ncrucible\nforcefully\ngetaway\npyramidal\ntougher\nsalas\nkavanagh\ngöring\nrococo\nmoriarty\nmusgrave\ncommodores\ndetour\nbasie\nmiloš\nvaleria\nlisboa\nafrika\nhath\nknott\nfacelift\ndecathlon\nlingering\ntommaso\nmiscarriage\nfates\narmitage\nalle\naggregates\nmaxima\ncracker\nnavratilova\nfelton\ncaruso\nmoratorium\nhitman\nunlicensed\nemirati\nenvoys\ngriswold\nothello\nsoe\nwaterline\nluxembourgian\nbeatified\nailing\ndonahue\ncray\ndunlap\njt\nluger\nsuperboy\nillini\nsusannah\npickens\nhallucinations\npsychoanalytic\nsuperimposed\nmateriel\ndato\ntriplets\nkampung\nhirst\n≠\nindispensable\npuppies\nnewberry\nmarais\ngeographers\ndemeanor\ndreadful\nanvil\nsturdy\nbiden\npermutations\nsanti\ntherapists\ncentrist\natl\nintruder\nhorseman\nkauai\nrealistically\nsludge\nfrasier\nwinona\nmules\nholbrook\nschoolteachers\nrayner\nrenting\nstalks\nrioja\nlaramie\nfurs\njetty\nradiology\nwatercolour\ndoge\ncadence\nkumari\ncsu\nbasso\nstabs\naides\nrecognitions\ndeli\nagar\ntsai\ncharms\nwrapper\ndonizetti\ncbi\nmolotov\nproclaim\ncaballero\naffirm\ngenomics\npunter\ninterpol\navenida\navionics\nmayfair\nattrition\nrimini\ncentauri\npuberty\nwraps\nprompts\nlymph\nprivatisation\nindustrialized\npersson\ninsulated\nlala\nrok\ndividends\nstumps\nstahl\nkites\nnymph\naurelius\nwhois\ntikva\nlaterally\ntarragona\nrapes\ncapsized\nnerd\ntomsk\ndigestion\npesticide\nbougainville\npalisades\nmardi\nphys\nmods\nimmunology\ndaffy\ncopley\nteng\nbab\nmidrash\npowerless\nsanctuaries\nabruzzo\nsaracens\njamboree\nsurpass\nglyn\nflamboyant\nbosworth\nzola\nsakamoto\nojibwe\nloi\nmyung\nbasses\njustus\ndw\ntsui\nplacid\nprognosis\nunbeknownst\nvideotape\nheilongjiang\nakademie\nbhatt\nigneous\nlenders\nvv\nxue\nsubstantiated\nsettles\nmislead\ncleary\nsupervillains\npies\nesposito\ngrandiose\nclustering\nloftus\ninstincts\ndilapidated\nfavours\nkamehameha\nreworking\nkenney\nsatin\nrafi\ndomínguez\nnozzle\ntilde\nvantage\nunam\nhikaru\nvaud\nnowa\njeannie\npid\nirritating\ncordelia\nsilurian\nbroward\nstringer\nformulations\nheadlands\ncoyne\nrigby\ncpus\nkee\ndetriment\nsmelting\narthurian\nmalnutrition\nkodansha\ntenuous\nverandah\nquitting\nephemeral\noccitan\nhei\nquests\nviejo\nrockabilly\nmaki\nalternated\ncuring\ndetractors\ntaxed\nkirkwood\nsakhalin\nrab\nviacom\nacupuncture\nkinross\npunisher\ntroyes\ntatum\nsubchannel\nchopper\nbuyout\nblooming\nnicolai\ninflict\nquigley\nleona\npio\nlilian\ndilute\nsupplanted\nlobbyist\nenrolment\narif\nfades\nhaymarket\ncropping\ndiscern\napologised\nhomophobic\nslobodan\nquang\nsamaj\nlighted\nveracity\ndecaying\ntaoiseach\nswain\ntricycle\nbeatrix\nsanatorium\npescara\nguimarães\nrefineries\nstatistician\nive\npapilio\ngurion\nblanks\nfeasts\nbiochemist\ntaoist\nleaved\nolt\nhaim\nconsultations\nidiots\nflak\nsimcoe\nmotivate\ncoercion\nriksdag\nhusky\nbailiff\ninfused\nnmi\nhorsham\ntrumbull\nomsk\nconmebol\nharming\nsikhism\nbrunel\nkde\ncompetency\nschaus\ndisapproved\ncaltech\nsheath\njodie\nvelázquez\nhypnosis\ncharlestown\nnome\nfulfills\nherrmann\nvalenciennes\ncauldron\nproc\ncushion\napg\nneuchâtel\nece\nkhomeini\njoni\ncapitulation\nrosso\nhuan\nshostakovich\nactivision\nincursions\nindre\npwi\ncrux\nsrc\nalphabets\nconstructor\ninvocation\nblazon\ncorals\ndarjeeling\nlynching\nquerétaro\nanode\njour\ngijón\ninterrupts\nadditives\ncolvin\nhumiliated\nkindle\nanticipating\nnormale\nthumbs\nsockets\nundersea\ninsolvency\nacadian\ngreeley\neukaryotic\ndermot\ncorrelate\nchia\nsprawling\nlunatic\nsilverstone\nvitality\nanus\nwhittier\nechoing\nkoo\nacropolis\nrockaway\nplunder\nkoblenz\nbalearic\npoincaré\nhittite\ntheravada\nvn\nittihad\nantonov\naudiovisual\nubc\ncontends\ncoldest\nskew\npresidio\nchilling\nkamikaze\nvalparaiso\ncornered\nturquoise\nrectified\ncomté\nfreshly\nwatering\nvirtualization\nwaring\nmillers\nstrung\npitcairn\ninsufficiently\nseverus\nvideogame\ndobbs\ngritty\naesthetically\nhalliday\nadversaries\ncamelot\nbodyguards\nsurrealism\noverarching\nglamorous\ngcse\naomori\nkashima\nzona\nvandenberg\nhispania\nanaerobic\nshellfish\nfinisher\nlyttelton\npediment\njozef\nqureshi\nconceive\nfrancine\nmajoring\nclears\npenzance\nenclosing\nbooming\nnewsletters\nsnell\ndarth\nkohl\nbacklogs\nbarbra\ndecentralized\neffigy\ndeterminant\narmée\nnormandie\ninbound\neskimo\npará\nsapiens\nmilošević\nhatchback\nukulele\nbremer\nharmonious\nautoimmune\nsncf\nmagenta\nacl\nneuter\nsincerity\nguzman\ndonner\nperón\nwallet\naddams\nbibliographies\npans\nuterus\nyoungster\nlieder\ndwarves\nknockouts\nhinged\ncardiology\nfoaled\naircrew\nguyanese\nabdication\nmetcalf\nrout\nwaiver\nrevolved\npumpkins\nparse\nntv\nglitch\ncompulsive\n♪\nbrixton\nblazing\nbartolomeo\ncollagen\nallowable\ncampion\nsmallville\nsubic\ncuria\nmano\ncombative\ndespatches\nfilament\ndeprive\nlga\nicd\ncorolla\nliquidity\nwits\nnsf\nkindred\ndianne\nlng\ncarmarthen\ninfrequently\ndeliberation\ntaos\nregretted\nstillwater\naccountancy\nelmore\nozzy\nbj\ndutta\ndegeneration\nencampment\nipl\nfreehold\njokingly\nhammers\nloom\nadamant\nhobbit\nhaiku\nracket\ncounteract\nabsorbs\nintimidating\nactin\nmowbray\ndusky\nlleida\npublius\nsolicited\ncma\nhumiliating\nhaile\npresidium\ncaw\npinoy\narlen\nsch\nchantal\nblindly\npetah\nquack\nlonghorn\nvalais\ngamba\ndeepak\nslump\nryo\ncavities\neusebius\nattachments\ngoh\nmythos\nsewers\nnav\nhippo\nlys\narco\nphysique\nmaserati\nworkflow\nsutcliffe\ndenouncing\nbonfire\npalmeiras\ntpb\nguayaquil\nlecturers\nrangoon\nmarque\ncittà\nbehold\ncronulla\nlandscaped\ngracilis\nsystematics\nmechelen\nroadrunner\nmultipurpose\nnahuatl\nantagonistic\ngenoese\nchoruses\nshevchenko\nwray\nfairgrounds\nmcenroe\nmena\nheadphones\nwomb\nreiterate\nwildcards\nexecutioner\nshaolin\ntransnistria\nméndez\nautobot\nhonoris\nspeyer\nstorming\ncircuitry\ncuriously\nsibley\ntunbridge\nváclav\nridiculed\nginsberg\ndivas\nlivonia\nfibrosis\nvoltages\nfte\nqom\nmellow\nhecht\npoughkeepsie\niliad\nxerox\nsoriano\nstoney\nschoenberg\nlucio\nincendiary\npanchayats\nhernando\nsurrendering\narr\nbellini\ntran\nmystics\noscars\nremediation\negon\ndeducted\neesti\nobituaries\nsem\nweill\ncur\nquetta\ncompostela\nhandedly\nrenovate\nkos\nsayings\nminions\nmchale\nglitter\nstamina\nmol\nsvensson\ngiuliano\ntaping\nreece\nuninterrupted\nintrusive\nslavonia\npretended\nyevgeny\nvestry\ndismal\noctavian\npardo\nnix\nvirtuous\neclipsed\nanecdote\nwatchers\npascoe\nrefrigeration\nunambiguously\ntonne\nvirginity\nbianco\ninconvenient\ngraft\nlocales\nricketts\nnoodle\nblooms\nvj\ntack\ngestation\nelisha\nlaotian\naerobic\nflax\nrw\nsneaks\ndisintegration\nsacking\nirreversible\nhealer\nkierkegaard\nmethodological\nmizoram\nresorting\ncotta\nvillainous\nshoal\noats\nsheena\noutfielders\npell\nrenumbering\nchanning\ntwa\ncameos\nthurman\nrepainted\nevert\nrevolted\nnaismith\nfreeport\nhatching\ndiligent\nbader\nejection\nsceptre\nirreducible\nhanlon\noriel\nrevisionist\nnaylor\npleads\npatagonia\nprivateers\nlaporte\nodo\nbankstown\nmage\nwidower\npopper\nnegev\noshawa\nresuming\nnaïve\nchino\nsegovia\ntheodosius\nvassals\nnetanya\ngrainger\nolmsted\nparable\nmogul\nichi\ntutors\nberks\nimplanted\npentium\ndasht\nsvp\nmisconceptions\ncharterhouse\nmian\nmme\nreared\nblasphemy\ncurtains\ndickie\nespanyol\nhauser\ndodson\nsnl\nopioid\noctagon\nmemoriam\ncombinatorial\nflooring\nlothar\nfleshy\nforceful\neuphrates\nbroderick\nbelfry\nhulls\naan\nfreeways\nnaturalistic\nzorro\nshorty\nradu\nunrecognized\nbruxelles\ngwangju\nsoftly\nhilo\nfanatic\nnishi\ncorrigan\nplunkett\nwading\nblackmore\nbonanza\npandey\ntigre\nwiden\njewell\nalgonquin\npostman\noptioned\nprejudices\nsuperficially\nlombardo\ndlc\nsages\nmicah\nguerilla\ntalladega\nplatonic\nbuoyancy\nreverence\nkinks\nmah\nexponentially\ntsang\nslugging\nbrilliance\ndoolittle\nnanda\nclam\nvee\nanalyzes\nmartyred\nphage\nyogurt\nsexist\nwoke\nkimmel\nlauder\ndarreh\nbonaventure\nwarburton\nlangston\ntelemundo\nswamy\ntecumseh\nceases\nnicotine\ncertify\narkham\npasteur\nrecessed\ngeronimo\nnameless\npunic\nbiosynthesis\nnadir\nabidjan\nwalcott\nmajorca\nwoodside\nbohemians\nry\nshinji\niraqis\nparthian\nexecutable\nbroadside\njalal\ninfiltrated\nbuckle\ndonoghue\nunattached\nadhering\ndiscrepancies\nprincipalities\nosiris\ndaft\ndashed\nstrut\nsuffused\nbeyer\nsill\nevils\nbaloch\nipcc\nmorelos\nbalmain\nsind\ndisgruntled\nxxii\nrecite\ndegrading\nonslaught\ngustavus\ndemocracies\nroam\ngilligan\nadhd\nmatty\nwojciech\nrecognisable\nmurmansk\nrud\nreuniting\nallocate\navex\nkermit\namato\ncoburn\npurified\nterrifying\npropel\nwaterfowl\nkalimantan\nulsan\nindecent\nsteaming\nmodesto\nnotebooks\ncatólica\nschreiber\nprecedes\ngre\namazingly\ncrackers\ndirac\nleaps\nfooter\ndine\npotion\nventricular\ntamils\nbanknote\narlene\nrapture\nsip\nanew\nveer\nappalling\nsnippets\nsibelius\nneologisms\nowain\ncarlow\nswastika\npawtucket\nsdp\nbernal\nmccallum\nhinge\ngiancarlo\ntimpani\nvultures\ncorse\nozark\neritrean\nmujer\nprerogative\ndelightful\nhamza\nioan\ntenors\noverdue\ncurtailed\njanusz\nspectrometry\navila\nconformation\njolla\nmatisse\nney\ndefiant\nwalid\nscalia\ntrajan\njustifiable\nforgiven\nrattlesnake\nagm\noxley\nfibrous\ngoaltenders\nnawaz\nhinds\nasc\nswifts\njörg\nfarmstead\nforensics\npencils\ntelevisions\ncle\nrockstar\nwoodville\nbotafogo\ninglewood\nsolitaire\nhimalaya\nmarisa\nkneeling\ntotem\nsein\nheartbroken\nlovett\nfemmes\njester\npronunciations\njaved\nclegg\nsportscar\nrevivals\ntransmembrane\ngora\nbenedetto\ndordogne\nrealty\nsacrificial\nunintentional\nrecitation\nstrikeout\nshaky\ntavares\ntok\npalate\nspades\nbohr\ncages\nyonkers\nsssi\nlimelight\nsparking\nproprietors\nanfield\noy\nuniversalist\nexp\nsommer\ndy\nwaltrip\naliases\nvoc\nsimplification\nmutilation\nkleine\ncentimetres\nlx\nstalinist\nriverfront\nwestside\ntaker\nhydrolysis\ndevotee\npointy\nlanguedoc\ndismantling\nanomalous\ntwente\ngaseous\nticker\ndunstan\ndecorating\nrobberies\nrel\ncrawling\nrearrangement\nwaverly\nunser\npocahontas\ngénérale\nbaillie\nloon\nbladed\nissa\ntoyama\nlomax\ntoluca\nmeena\ninterlocking\nobstetrics\nbequest\nschematic\nahli\ncaterpillars\nnewsworthy\nstalked\nprog\nterraced\nkilogram\nkonkani\nrags\npheasant\napplauded\naha\ngees\npolishing\ninhibiting\nhennessy\natchison\nsushi\nscarf\nprudence\nstatesmen\nneckar\nkippur\ngoodrich\nsunglasses\nmontreux\ninductive\nballantine\nyvette\nchimes\ntranscriptions\ncherbourg\ndiversification\nbrushes\nevidences\nnicobar\nria\nwirral\nvadim\ndecoy\nblackman\ndownes\nheian\nkapurthala\nhistorique\nsorrows\nagendas\nonscreen\nspreadsheet\nespinosa\nlorient\nlindley\nlaine\ntories\nkfar\nhendry\nrakyat\nmarcy\nfeuds\ngeorgie\nperceptual\njukebox\nnederland\nmakeover\nprost\namending\nairy\nmonika\ninhibited\naoi\ndepressions\nearthen\nhallam\nconjugation\nehf\nspitzer\nauthorizing\nrolex\ncrucifix\nadaption\njudea\nexpressways\nfillies\nconsumes\naves\napiece\nriba\nplunge\ngolestan\nmohr\ncamacho\nflanker\nlexus\nblondie\nhotter\nmanche\nadventurers\nmaxime\nmilieu\nvalkyrie\ntuff\ncretan\nparabolic\nnewfound\nskunk\niván\nattaches\npristina\nwednesdays\nvasquez\nkabbalah\ninsecure\npostscript\nindeterminate\nsubmachine\nfilaments\nbuffet\ngoths\nzeeland\nheralded\nestuaries\njigsaw\noberst\noccupant\nkidderminster\nsylvain\njuicy\nhough\nchas\nfooled\nbakshi\ninactivation\nyisrael\nunloading\nstrongholds\nzoned\ncocktails\nrainforests\nmahendra\ntaps\nrerouted\nstaffs\nbrookings\nuncontested\ntarsus\ngauges\nyitzhak\nica\ncre\ninstructing\ncheeks\ncatalysts\nasymmetrical\nironman\nskillful\nhuffman\ncontraception\ndachau\nrivières\nmisunderstand\nacme\nrefereed\nashoka\nbarristers\nncc\ncuneiform\nhypnotic\ndaisuke\nentangled\nrationality\ntyphoid\nbanter\nproximal\ntrademarked\nranches\nboyhood\nhomologous\nreactionary\nclassifies\namine\npoirot\ngrandeur\naortic\nunduly\nnorthfield\ngist\ntransitioning\nenclosures\nmcclelland\nleven\ncocker\nemmerdale\nenrolling\npaulina\nembodiment\nreiner\ninquire\nlender\nyakuza\nvail\nsheryl\nyugoslavian\ngoshen\nwestbury\nbriscoe\nbate\nfavre\nbirthname\nhurd\npolska\naffections\nrelinquish\ntorrens\npoul\nwoodwork\nlwów\nmeri\noutweigh\nspoil\ndrosophila\ndaw\nheine\njervis\nschmid\nreap\nnlm\ntreachery\nimperialist\nora\nelegy\npertain\nplaid\nklingon\nbarnabas\nkurtz\nfootpaths\nredoubt\nbeira\nmadan\nsucked\ncannibalism\ngoebbels\nelaboration\namalia\ndevereux\ncuriam\ncelts\nworkington\nlott\nreuss\nburnaby\nbecket\nvitae\noeuvre\nepp\nworkmen\nroundhouse\nlancer\nsheik\nastrological\nstele\ninfrequent\nvalera\nauxerre\nresurfaced\naversion\npathogenic\nhyphens\nainsworth\ntreacherous\ncef\nedn\nresettled\nstool\nranga\nbofors\nscorsese\nolof\nirritated\ntestimonies\nwacker\nnomad\nmodulus\nfilthy\nharam\ngorky\nnoblemen\nkayaking\nromanov\ngmc\nluhansk\ninfidelity\ntoughest\ncolonized\ndesignating\nsnipe\nise\nrosenborg\ndismissive\nhaakon\ncollaborates\nweightlifter\nenclaves\nstartups\narca\naddictive\noxidative\ncolonisation\nahmadiyya\ntaco\nsetter\nupholding\nbuzzard\ndvb\noakwood\nmaids\nstanzas\nlistens\norf\nsitar\njackal\nconspiring\nconstitutionally\naustronesian\nimitated\npinocchio\nyp\ngroundwork\ntequila\norbitals\nworthing\nroxbury\nhedwig\ndetects\npowdered\ncranston\ncremona\nlenox\naleutian\nsst\nlilac\neparchy\nachille\ncleft\nlandes\nmises\ntoussaint\npastel\nfirestone\nesteghlal\ndeseret\ncreamy\ndeerfield\ntoshiba\ndillard\nbmp\nflagstaff\ncaverns\nparishad\nsteered\nportmanteau\nsentry\nsternberg\ncomm\nomen\nrts\nsymbolizing\npremierships\nseibu\nsecluded\nvries\ngaels\nendocrine\nmufti\naspire\nstomp\nstarcraft\nrarer\nces\nappleby\nchime\nbrazilians\nmendez\nansari\ncybertron\ndepopulated\ncompatriot\nlanier\nhemlock\nsparingly\nboils\nedie\ncandice\nnautilus\nlemurs\nreagent\nredeemer\npura\nchoo\nunharmed\nochoa\nhainaut\nbidder\nogre\nsquat\nfowl\nflares\nmagister\nifa\ncoop\npancras\nblurry\nrivas\ndisparaging\nsmokers\nzz\nlenient\nlewin\nevo\ncottonwood\nwilmot\nvir\nmosquitoes\nprefixed\nhydrophobic\nconsensual\ncondon\nlewisham\nrios\nfurthering\nboosting\nequivalently\nekaterina\ndenham\nsnaps\ngoldwater\nmétro\ndeteriorate\ndownside\naosta\ncassava\nbhubaneswar\ngrappling\nconfrontations\njah\nwatermark\ncremation\nplacer\nbenefiting\nforearm\nbak\nnonviolent\nashcroft\ngoodall\nhpa\nsidewalks\nburley\nphu\nboi\naspirated\nincl\narkhangelsk\nslr\nrevd\ngros\npushers\naries\nshrek\nwilma\nfoiled\ndiesels\ncolette\nasw\nbluebird\ntien\nrewording\nguntur\npierson\ntechnologically\nrevocation\nabercrombie\nbryson\ncontreras\nmiao\nbiathletes\nerode\nplaceholder\nfend\ninspirations\nesplanade\nmagda\ndanse\ndiverged\nflammable\nmcknight\nchelyabinsk\nej\nrhondda\nmolde\ncadbury\ntahir\ncalumet\ncanonized\nsalter\nreq\nmcg\nshyam\ndiabetic\ntightened\ntimberwolves\nstarbucks\nronde\ngrained\nrelaunch\narchitecturally\nalois\nusurped\ndosage\nnostalgic\nghazi\neglinton\ndesist\nestes\nevaluates\nrenounce\nfarthest\nmahon\nfacades\nimams\nfredericton\ngilliam\ndemography\nbeowulf\nstave\npubmed\nveterinarian\nfrick\nchf\nirl\ngallup\ndoodle\nfarnborough\nhinterland\nbetray\ncusack\nalford\nsolano\nsibiu\nkamil\nvinod\nacademical\nsembilan\ndiagnostics\nheinlein\nproactive\nniven\nlaurens\ncannibal\next\ngrilled\njolie\nosmond\nwielkopolski\ngcc\nnitric\ndemocratically\nspills\nhrsg\nmerchandising\nunhcr\ncondominiums\nmultiplying\nfanbase\nantidote\ndisabling\npopeye\nwarranty\novercrowding\ngradients\nmakati\nlagoons\nsokol\nwasteland\ndebit\nprabhu\nmethanol\ntiered\njacky\nzealanders\nsusana\noutbuildings\ntucumán\nintrinsically\nnez\nimpede\nstreetcars\ngatineau\nedda\nchauhan\nvolatility\nmartens\nschoolmaster\neo\njethro\nchambered\nadsl\nnonpartisan\nsanctioning\namends\nhoa\nsearchlight\nairflow\nrockville\nbebop\nvibraphone\ndekker\naclu\ntoms\nopposers\nunavoidable\nchicano\nclueless\ntigres\nconfessional\nhangars\nrainwater\nmisdemeanor\nkissinger\nkhabarovsk\nmugabe\njamieson\nbowyer\ntelekom\nschaefer\ntouchscreen\nerdoğan\njpl\nresumption\nsedge\nfranche\nbode\nade\nloanwords\nnrc\ngait\noutcrops\ninstantaneous\nnoche\nvieux\nitc\nenforcer\nfh\nbatches\nsag\nilan\nlaplace\nhops\nsachin\nbackhand\nheed\nkoji\nalternates\nladders\ncures\nplacements\nnegation\nconfers\narchdeacons\ntrivandrum\nretelling\neminence\nlada\numass\ndelusion\nhindsight\nbor\nfanzine\ncallum\ngazelle\ntowson\nhallows\nirrigated\nthirties\nthaddeus\ndsp\nbystrica\nchampionnat\nbaikonur\nveda\nnin\nremarking\ncobham\nmarlow\ntimeframe\nvesey\ncraze\nvarela\ncuff\nshanti\nantebellum\nstartling\nguardia\ndiagonally\nmaulana\nhaaretz\npizarro\npluralism\ngrouse\ntripartite\nmoraine\nbarnum\nregenerative\nknuckles\nbene\nfinney\ntoa\nlifeboats\nmaturation\nincumbents\nplunged\nprokofiev\nsoundcloud\nridiculously\ncorinne\nparentage\nshaftesbury\nflatwater\nfv\nteamwork\nconceivable\npq\ngwent\nspitfires\nhappenings\ngaunt\nnucleic\nentitlement\nbusby\nsputnik\nuniversidade\nbourgeoisie\nmircea\nibadan\ndalrymple\nhessian\npatrician\npriestess\npease\ncorazón\noblivious\nnationalized\nsilently\nunknowingly\nscorpio\nrecuse\nunloaded\nmanifests\ncounterproductive\nwiz\nati\ndiscounts\nswung\nazhar\npurgatory\nreddit\nvicarage\nsmugglers\ncentralised\nsola\naccommodating\ndfc\ndescendents\ncarbohydrates\npetrova\nneedlessly\nabdicated\nreciting\nbeatriz\ncontaminants\nramones\nnth\nbaptismal\ncongressmen\nnesbitt\nenumerated\nbeitar\ninfusion\nmoma\nscreenwriting\nprocter\nrattle\nhailing\nhashimoto\nunicameral\nbloodshed\ntarrant\ngoogled\nintimately\nepics\nnailed\nbusts\ncantabria\ntui\ndeepwater\nallgemeine\nspeer\nkatha\ncalicut\nfooty\ntuesdays\nmuni\nahly\ndefenseman\ndiscouraging\ndickerson\ndissolving\naster\ndisband\nripple\nthracian\novertones\npolarity\nblasting\ngannon\nmorin\nrideau\nryerson\nambulances\nspecs\nshaykh\nsightseeing\npaderborn\nhradec\nhalley\nkcb\nrhododendron\nunconnected\nbracken\nlino\nseong\nrenaud\nfenway\ngardeners\ncartographer\nsepals\nbuckeye\ndrc\nschrödinger\nmchenry\nscatter\nddt\nsándor\narchimedes\ncomposites\ntania\nucc\ncorral\nbigfoot\npoetics\nshetty\nseabirds\nsnuff\nshinkansen\nslovenes\ncapella\nbibi\nmasculinity\nchickasaw\nreminiscences\ndebtor\nadrenaline\nfahey\ninf\naram\ndominguez\nexaggeration\nahmadinejad\nthane\nyad\ncandid\ncommunicates\nhoracio\npinot\nphonemes\ndominicans\ndipping\nexponents\nlowers\nsculls\nswapping\nassurances\ncrt\nfirsts\ncorsair\nsolubility\nmultiples\nbasu\nvaldés\nfranconia\nsupersport\ngrids\nmilitarily\ngrimaldi\nrawlings\nnitra\nmusk\nelectrolyte\ntread\napplicability\nomg\nnightlife\nelapsed\nnetanyahu\nbanbury\nringed\nkerrang\ntekken\npruning\nhovering\nmanipulative\nbuckner\nsith\nzephyr\nifc\ngiraffe\nhispaniola\nmcmurray\nmitre\nkaz\nttc\nhagan\nturntable\nphilo\namused\nmackie\nunwritten\nepidemics\nbumblebee\nics\nfedex\nwomack\nbarbosa\nfrisco\nesper\nblight\nseeger\nimitating\nsubtype\nshaffer\ngilt\nthrived\nmalden\nlorry\nprocure\npw\nfujita\noverthrew\nthiele\nlimousine\ncasually\nbutts\ngondola\namorphous\npersuading\nshriver\nbudgetary\nnay\nrolfe\nburnet\ndispleasure\nsynergy\nsøren\nriverdale\ndiss\ndrogheda\ncomstock\northopedic\ncalculators\nboardman\ncadiz\nagonists\nbraces\nbrendon\nredefined\nlaced\nsmoother\ncomintern\nchainsaw\npte\nconceptions\nholla\nwhirlwind\nrelayed\nacetyl\nbogs\nhernán\nfreedmen\nishmael\ningenious\ndyed\noverhauled\nstraps\nnewtonian\nprimavera\nliars\nvalet\nsoledad\nbothers\nhashim\nevolves\nesc\nspiritualism\ntheophilus\nparticiple\nbyway\nstrabo\ncategorise\npodiums\ntranslational\nprocessions\npossum\nbrando\ncapacitance\nstupidity\ndisrespectful\ndichotomy\nleahy\nmunroe\nevoked\nromulus\nvallée\nrua\naffaires\nintoxicated\nsprite\nwoodson\nrocca\nmercier\nstandoff\nphobia\nanz\nzbigniew\npayable\ntas\ngli\ncrowning\nprobabilistic\ngul\namory\ntupolev\nrobbing\nstainton\nconvincingly\nkarla\nconsejo\njagged\nstitches\nmesopotamian\nghostly\nmetallurg\noberon\nfeynman\nislamists\nsow\niz\norally\nhispano\ncontradicting\ncurlers\ndeterrent\nlozano\ncredence\nbraid\nsquarepants\nhervey\nbaa\nwannabe\nelasticity\nvisayas\ncheckers\nunquestionably\nariane\nwaka\nasad\nfireball\nquell\norne\nbubbling\nganesha\nwhispers\nadel\nmediators\nrighteousness\nretroactively\nmetabolites\nreciprocity\nlimp\njv\nshouts\nkingpin\nkongo\nstately\nkino\ngrooming\nhordaland\ndecepticon\nmarv\ncarbohydrate\nsequoia\nunwieldy\nanglian\ngq\noutwards\ndit\ngrandsons\namps\nhydropower\nhessen\njama\nrepública\nsegmented\nrenfrew\ngwalior\nmuppets\nbrat\naau\nbackers\nboggs\ncroce\ndingle\ndenominator\nquaid\nuva\nkickboxers\ntigris\nquantify\naffirming\nhy\nkendal\nchanson\nabstinence\nsalome\nmemes\nseigneur\nexcitation\nkya\ndisrespect\nworsening\nbouchet\nponder\nsucre\nmahmood\nrants\ncharlottetown\newart\nshomali\nclem\nlytton\nrosy\nkristiansand\ntorus\nkerosene\ncpl\npcc\narcheology\nfleischer\nstrangled\nnewsroom\ninconvenience\ndavide\nbarbadian\npunishing\nprairies\nschiff\nphalanx\nrhea\nconfucianism\nrojo\nranjit\nnoi\name\nstoried\nlessen\nlynda\nlilith\nhw\nmonti\ncarlsson\nlira\nmpa\nintelligible\ngaruda\nbengaluru\nworshiped\njef\nyea\ntonbridge\nstylistically\nharriman\nstrongman\ncowardly\npsychosis\nism\nprieto\ndodo\nsangha\nsanctum\ncárdenas\nqantas\nagni\nrhyming\nconfidentiality\nramifications\nlte\ngopher\niterations\nrectify\ncurfew\nnurture\nboulton\nbooby\ncondolences\nsnapper\nkarelia\nhanseatic\nmcguinness\nslime\npeso\nodysseus\nlocomotion\nbauhaus\ncatapult\nsmh\nexcalibur\ndominik\ngags\nmackinnon\nnovello\nreductase\nalleles\nbyers\npreoccupied\nsignaled\nparaphrased\ndruid\nscant\nvisakhapatnam\nnenad\napoel\nwarburg\nspoilers\noceanographic\nlanark\nscc\naliyah\nhabana\ndecency\npals\nvanishes\netudes\ndecimated\neliezer\nxun\nappropriateness\nmitochondria\nsod\nccd\ncunard\nholst\ndonne\nrajkumar\nslapped\naligarh\nkalyan\nhonky\ntomography\napprentices\nprinz\nsymbolically\nasif\nevgeny\ngert\nmillimeter\ngrozny\nhanau\nwarlock\nsophistication\ndentists\nmelayu\nactuality\nparson\nirreplaceable\nnarrowing\ngon\nuber\nsunda\nbooksellers\nnie\nqinghai\nanh\ncatharines\ntonk\ndsl\nartois\ncatharine\nchem\nwilliamsport\ncolouring\nchp\ncirculate\nscratching\noverseer\naslan\nhatched\nferret\nathos\nubisoft\nsinners\nglynn\ncristiano\nsongbook\nexpended\nitinerant\nvedas\nseparatists\nshilling\ndoria\nmsp\nmoreira\neunice\nsupervillain\npores\naffirmation\nmorristown\nadonis\nmaritimes\nwatchtower\npristine\nyomiuri\nmasterson\ncapitalisation\ntalon\nrayo\nfingerprints\nmaimonides\nflipping\nkuan\nnaidu\n★\nvick\naccolade\nnagaland\nnx\nsensei\ntacitus\ngeneticist\nburgers\nserenity\nshih\ncategorical\ngypsum\nstingray\ndrm\naragonese\nantiochus\nharvester\nvirgo\newan\ndeceive\nsneaky\ncaron\nbelknap\neditorship\nphelan\nwingate\npashto\nvistula\noliva\nwhipped\nbenetton\nantigens\nclooney\npolymerization\nloup\nzeke\nhaldane\nbegged\nzoltán\nmelzer\nsturt\ncarmelo\nphotographing\nsainsbury\nbonneville\nbastille\nrollin\nratchet\noedipus\nreprising\nsowerby\nrizzo\nendlessly\noverruled\nmatron\nconclusively\nwalford\nkatarina\nparvati\nseparator\noakes\nskis\neulogy\nschott\nsich\naquifer\ncate\nesters\nsugiyama\nbataan\ncuyahoga\nbraddock\ncultivating\nalhambra\nupsetting\nimpulses\nforwarding\nfortran\nrobeson\ncthulhu\nshovel\ningestion\ncambrai\nurmia\nlegalized\ndocklands\nscandalous\ndalek\npaleozoic\njodhpur\ntruthful\nnuanced\ntangier\nallie\nscania\naikido\nsverdlovsk\ncommoners\ntus\nhighgate\nfairmont\nansar\nrestraining\ndike\nkrupp\npompeii\nomits\navraham\ndepaul\nsinner\nragged\nwtf\nmonarchies\nutilitarian\nhorus\nfournier\ngeorgians\ntrad\nmots\nmusket\nhulme\nrunes\npledges\nmerrimack\ncuomo\nreinstatement\ncensure\ndartford\ntrusting\nunderparts\naloha\nkfc\nhiroyuki\nshanks\ndjango\nbessarabia\nwithholding\ndacian\nlinearly\nmeltdown\nthornhill\nlj\nrenard\noka\nplano\nroshan\ndistrito\ndougherty\ncatered\nmysql\nenrich\nsinfonia\nnederlands\nparte\nshams\ndigs\nwallonia\nalum\ninterracial\novertook\nvela\nlw\nsuture\ncassell\ncarolinas\ncumming\nlamas\nshakes\ncabernet\npalme\nsarcophagus\nwoolley\nbrann\nancients\naff\nathenians\nmatti\ntowel\nbol\nshielding\nnovara\niwata\nmagpie\nsuva\nkiowa\nassociating\nzig\npermeability\ncesena\nshoppers\nmemoirists\njäger\nshroud\npoisons\ngallatin\nqajar\nmites\nturboprop\ndisturb\ncuttings\ncornhuskers\nlachlan\nmonteiro\nchorley\nintifada\nroos\ndeflection\nheinkel\npropensity\ndejan\nabolishing\nmambo\naquarius\nwaned\nhowl\ngolds\neagerly\ngulch\ncraiova\ncns\nvigilance\nhydrographic\nunião\nilam\nunlucky\nmanoj\nconnective\nentail\nrestricts\npentecost\ncollaboratively\nsolidly\nbead\ndredging\nbickering\nsaucer\nsoler\ngusts\nimpressionism\nentente\nboreal\nascetic\nfannie\nnominative\nharvick\nrearing\ndisseminate\nrandi\nsondheim\nminimise\nsaarland\nrolle\ncontemplation\ndinners\nfluminense\nshinjuku\ncharacterizing\nsubfamilies\nolin\nvinson\ntomahawk\nmatte\nseabed\nkinsey\nfsv\nindica\nnuke\ndrawer\npsu\ncuevas\nsadat\nfirenze\ncarbide\nwebcam\nshave\nschilling\noceanian\nconforming\nrajah\ngotland\noverhears\nsynthesize\nthong\nmica\ntransmits\narran\nbdo\nauditors\ndisregarding\nvivaldi\nsubdue\nchakraborty\nmilburn\nnutshell\nperseverance\nmesoamerica\npopov\nmolding\nwesterners\nprosthetic\nastrakhan\nproclaims\nalbury\nsrb\nboycotted\nnou\nboast\nsofa\nseagull\nbrackish\noverpass\nleviathan\nbagan\nunter\ngerrit\nupheaval\nmississippian\nunderdog\nnanking\nsarkozy\nstorting\nhemoglobin\nbake\nmodus\nsolidified\nfagan\nzis\ncorea\nairway\nrenata\nspooky\njoyful\nrarities\ndiurnal\nsvt\nalte\nhearth\nenfant\nappalachia\nlisle\nmeticulous\napra\nedmondson\nradom\nanya\nsartre\nchiesa\ndessau\nreconstructions\nblasted\ngeary\ntotality\nzonal\nwir\nlebron\nroddick\nberklee\nppv\ntarantino\nbild\nagitated\nadalbert\nabiding\nusaid\nrespondent\nwilled\ndwindled\ncorrelations\nlimoges\nlongfellow\ntriptych\nespoused\npallas\nmccracken\nseamless\nnapoca\nlevinson\nnava\nmasque\ngai\nhera\ncrankshaft\nchangsha\nmoles\namer\nbonifacio\noscillations\nconiferous\nbadajoz\nstupa\nponsonby\ncopernicus\nschindler\nihl\nscalable\npawnee\nattendances\nvases\ncoffey\ngrips\nmaja\nelie\nbattersea\nmanageable\nartistry\ndulles\noni\nellesmere\nkohler\nautomaton\nmiddlebury\ncáceres\nhae\nwaldron\nfurthest\ninvader\nentailed\nwulf\nrolando\nsuri\nliquidated\ndefection\nchalice\nbandcamp\nrephrased\njørgensen\nbharati\nniko\nschule\noxen\ncountermeasures\nprecarious\ndyslexia\nkudos\nbeltway\nferdinando\nmcdougall\nwindshield\nselina\nprešov\nappease\npontus\ncustodian\nsanborn\nesports\npurplish\nmmorpg\ncfo\nbitterness\nexistance\nbloemfontein\nbriefs\ninclusionist\ninflicting\ntalkie\ncheerleader\nbrine\nshedding\nkootenay\nnapoléon\nadvancements\nhumankind\nbusinessweek\nsyphilis\nhmcs\nimprovisational\nyonge\nmicrobes\nmarsha\ncoulson\nsepulchre\nallotment\nvalenzuela\n♥\nsxsw\ntransfusion\nwares\nmunch\nvyacheslav\nsewn\njakub\ndefinately\ntrending\nheston\nawaken\ncloseness\ngruesome\nbucky\ncalif\nwatcher\nartiste\nrambo\nheng\nmadrigal\nhaw\nmaddie\nbrod\nchrysalis\ncrossfire\npetre\npedophilia\ncharlene\nmeister\nintercultural\ncommissar\nalbin\nbalzac\nasquith\nlatinos\noverriding\nmcphee\nherding\nlucid\nmuskegon\niban\neights\nfunimation\nfruitless\nrainey\nidealized\nguesses\nmarbles\nloy\nbuller\ndomini\npresse\nrapist\nwrongs\nlancia\nelongate\nsaliva\nthanking\nmatchup\nwooster\nosteopathic\nistat\nkrueger\nsoulful\nsprinkled\nsodomy\nirresistible\ngiorgi\nlatif\nemmet\ninhibitory\nmeh\nconstitutionality\nrepublicanism\nkostas\nchiral\nslows\ncartographers\nwhitechapel\nailments\nsucking\ncarina\nthaw\nsula\nbbb\nbaal\nhildesheim\nrookies\nexcepting\nbenning\ndesignates\nuntimely\nchum\ncomplexities\nhiller\ncontemplating\nagro\ncartman\nswampy\nbrilliantly\nryazan\nslugs\nlucinda\nstrachan\ncruiserweight\njewry\nrehman\nfsb\nfoyer\ntilting\nroost\ngallic\nannotation\nnoire\ncta\ndisingenuous\ndesi\ndared\nwisely\nbotha\naroma\nsabbatical\nbeaulieu\nioannis\ngenitalia\ncookery\ncombe\nintentioned\nredford\nappalled\npelvis\nwarblers\ndarul\ndipped\n®\nisaf\npancake\nnazir\nimposes\nresurrect\nencarta\ncartography\nstripper\nxc\ngad\ncrooks\nsquires\nmusashi\nmears\nchairwoman\nchromatography\nasociación\ninfuriated\nmaterialized\nwoodhouse\ngraubünden\ncompel\nplasticity\nchartres\nsupercar\nemphasise\nlutheranism\nbridgend\ntheodora\nsemiconductors\nvenous\nbookshop\naerodynamics\ncaetano\nmattress\ncentrale\nwodehouse\naffective\ngravitation\nsubstation\nvx\nmaggiore\nsergeants\nbhakti\ncommuting\noxidized\nwarheads\nethylene\nmuses\nalibi\ncrucified\noutscored\njak\nimpossibility\niau\nshrubland\nalarms\nrebounded\nbiddle\nstiffness\nlahti\nparkin\nmissa\nkrieger\nfinitely\ndirectv\nwaterbury\nfunniest\nbouquet\ninitiator\nfez\nshipley\nstoryboard\ndiluted\nfdr\npaes\nneonatal\nfim\nchagrin\nokada\nniue\ngaylord\ncoon\ngrist\nleaping\nalarming\nvulgaris\npadova\nreconsidered\nusf\nredistributed\nbrookline\nyost\nfuturama\ncnc\nnaturalism\nunlv\ndrago\necac\nooh\ninciting\ndisruptions\nparlophone\nperi\nhewitson\nmansour\ncoursework\nmodesty\nbabes\nsensations\nkinetics\nhv\nmarathons\npetter\nsororities\nendgame\nmarseilles\nligne\nlise\nstressing\nasi\nwhore\nmackinac\nkalam\nsharapova\nbathrooms\nosceola\nrepeater\nharford\niff\nsnatch\nsterilization\nwebpages\nmcewen\n™\nelam\npickles\nauerbach\napparition\nseaton\norientalis\ncoyle\norientalist\ncoups\nkintetsu\ndobro\nreflexive\noaths\ndiscontinue\naude\nntsb\nreload\nmammalia\nluang\ningolstadt\ngunpoint\nhatchet\npierrot\nquatre\nnola\nhmv\nmidshipmen\nroundtable\nliters\nyui\nxinhua\nstig\nfacet\nmezzanine\nriordan\nnuances\nbatangas\npacing\nferrell\nhilliard\ncla\nbundeswehr\ncrowther\nlizzy\nrif\npapyri\nbogie\nleed\nmaricopa\nenlists\nclothed\ngirlfriends\nroadster\nsuomi\ndiaphragm\nphilippa\nunger\npaddock\nnha\ninks\nmisfits\nemilie\nmaneuvering\ngurkha\netched\ndebrett\nsie\nslanted\nolimpija\nwinch\nntsc\nnecrosis\nendogenous\ntraversing\ncapitalists\nshrunk\ngiang\nétoile\nskink\nloring\nmurakami\nwhipple\nlampoon\nmckinnon\nmujahideen\ncoronel\npenchant\nbridgehead\nbenevento\ninhuman\nfloodplain\ndystopian\njuggling\ncanaria\nslur\nthwart\ninject\njahre\nkush\nunwittingly\nworkout\nclientele\nsauvignon\nthi\nrubbing\npreamble\ntwigs\nsecreted\nantioquia\nmisrepresent\nborealis\ncoalitions\nredress\ngábor\nmarquez\nconforms\nreusable\noutfitted\neerie\nfreezes\nmargate\nmessier\nsoups\nlettered\npostpone\nsportsperson\nteo\npaton\nkp\nauxiliaries\nmajlis\nprintmaking\ncyr\nnicks\naffixed\ngyula\nweathering\nrelieving\nsupersonics\nproposer\nblackjack\ndepressive\nstryker\nhulu\nregains\noe\nalb\nviewable\nrecon\nfalconer\nalva\ntremendously\nsling\nmorte\ndwindling\nromsdal\nell\narbiter\ntalmudic\njoked\nbraganza\nrecoveries\nbenzene\ncommences\nrenzo\nsadr\nqueenstown\nprobate\njem\nginsburg\nporky\nehrlich\nkenosha\nschengen\nliguria\nspeckled\nbulacan\npeloponnese\npdl\nquarried\nindustrialists\ncsm\ndann\nabandons\nnymphs\nstrapped\ncomplements\nskates\ndisguises\nonslow\nsufism\ngoulburn\nbrevard\nmarissa\nsupermarine\njimenez\nchl\nroskilde\nunionism\nervin\nsadistic\ndimitrov\nradiating\nnightfall\namalgam\nnyse\npalawan\nfuelled\nherne\ncheckpoints\nstinson\nhustler\nhelpers\nbastia\namazons\nlimbaugh\npresidente\nbosniak\nwh\ndebrecen\nbowery\npampanga\nhcl\nrouters\nhydroxy\njuanita\nroch\nprecede\ncodec\nmademoiselle\ncit\nroewer\nnoonan\ninsightful\nlynette\norville\nmortuary\nftc\ncellars\nrégime\nfrying\nroofing\nmegalithic\nleopoldo\npolitico\ncaithness\nformalism\ngauthier\nautobiographies\nvitebsk\nidentifiers\ncondescending\nnap\nnoone\ncarmelite\nmaharishi\nsequenced\ndodds\nmcarthur\nvolodymyr\nräikkönen\nspoils\nshaheed\nsydenham\nmurong\neuropaea\nservitude\nequated\nstreptomyces\nmordechai\ncytochrome\nsantoro\nrac\nperjury\nangoulême\nkao\ncommercialization\nyoussef\nbrow\nranchers\npala\nrosberg\nkowalski\ncorby\nlyne\ntiled\nchautauqua\nkirsty\nmischievous\nruining\ncocoon\nchangchun\nweser\nmattel\nesmeralda\nturismo\nazalea\nsmt\nistria\ncosimo\nstandardize\njacobus\ncytoplasm\nallege\npolygons\nplaygrounds\nspammy\ncentering\nxenon\nspraying\ncentimeter\n¼\nbowdoin\nminimized\ndeems\nwaddell\ncrafting\nmalformed\ninundated\nmonolithic\nhabitual\nrelaxing\nsever\nats\nsaki\nhostels\nlevee\noccured\nnorthwich\nschweitzer\nstoned\nislami\nexcommunication\nmendocino\nlindy\norchestre\nroxas\nconning\nzap\nmockery\nredone\ntacklers\npalakkad\nranching\nbulky\ndetectable\nunwelcome\nmedic\nalbacete\nshad\ncountable\njaeger\nauspicious\ncomplimentary\nseams\nlevante\nzeiss\nrestorations\nfolders\nmodernize\ntelstra\nindependant\nchimpanzees\nbodywork\ndoubly\nriva\nfetched\nnuncio\narcane\nmombasa\nkirche\nsrinagar\nleamington\nmodulo\ncca\nilocos\ntakers\nsylvie\ndarin\nalexios\nretrieving\nsaxophones\nornithology\ncusco\nrejoining\ncastelo\nmueang\nindistinct\nlouvain\ncobain\ncev\npde\nmelchior\ncriticise\ndocs\ntoxicology\ncollegium\nblas\npellets\nprintmaker\nbeaton\nsuarez\nbarros\ndutchess\nbfi\neldridge\nchilders\nmanon\nglebe\nfrankfurter\nmotley\nwhispering\nmeera\ninnumerable\nalgernon\nspectrometer\nbraced\ngrin\nattains\nrecherche\nmisled\nfenced\narrondissements\nalluding\nsubversion\nilluminate\noutcast\ngenève\nbellas\nharrell\nsidelines\nrote\nlambs\nlaughlin\nboosters\ncrag\nhud\ncircumstantial\nseng\njh\nrik\nnist\niaea\nkyu\nbranko\nmutt\ncurses\nsahel\nmontero\nmeningitis\ngravesend\nnaturalization\ncorman\nsuharto\npane\nshowbiz\nnittany\nrepentance\ndistortions\nwhitmore\nadagio\ngatwick\nkhazar\npur\nfey\nabdulaziz\nrestructure\ncoed\nwinn\nodeon\nmetrolink\nleeway\nreferential\nlatrobe\nlapsed\nnewburgh\nruptured\ngarnier\ntowering\ngeller\nreloaded\nwandsworth\nschröder\nwattle\nworkstation\nvole\nhingis\nleanne\nferocious\ncaptors\npomeroy\ncooney\ndehydration\nplainfield\nsatoshi\nproquest\nilluminating\ndestitute\ngurus\nrazorbacks\ncheque\nfarnsworth\nadkins\ngroton\npiloting\nrealigned\ntraffickers\nmatlock\nshima\nsportive\nnepenthes\ncobras\nlangton\nobsidian\nibaraki\ntryon\nnavarra\nparametric\ncroke\neunuch\ndill\nstow\nforeclosure\nnevermind\nlowery\ncaius\nhoskins\ndona\nathanasius\ntaskforce\nfright\npolemic\nmotherhood\nlefebvre\nrampart\ndavids\nlettuce\nmicky\nbaines\nimpurities\nterrapins\naube\ntem\nmens\nhassle\nexcellency\nevangelicals\nputney\nmethodism\noysters\ndissenters\ncaesarea\nfulda\nidentically\ndalit\nforza\narchetype\nforesters\ncassettes\narousal\nsemesters\nrugs\ncontinuo\nsilenced\nhotmail\nbumpers\nenacting\nhendon\nias\ntartan\ndeathbed\nlowndes\now\nmishnah\ncountering\ncantonal\nuproar\ninsecurity\nnimrod\ndroplets\nmarianas\nuntreated\ncrests\ndeptford\nemits\nwarne\nnorthwood\nbeacons\nflyover\nnationalisation\nwalrus\nibf\nsetlist\nmutilated\nredshift\nglaring\nchauncey\nscriptwriter\nklub\nlefty\nsmack\ntestimonial\nbriton\npollack\nsymbolized\nminot\nmirroring\nschoolchildren\nnuneaton\nbuttresses\ntawny\nwhim\nconveyor\ncarrollton\ntypepad\ndeadlock\nastonished\ngonçalves\nhelical\nhallelujah\nlia\npkk\nberet\nvestibule\nupsets\nepitome\nbalinese\ntenured\nbattlecruiser\ngibberish\nshockwave\nwinterthur\ndispleased\ncassel\njacoby\ndistanced\ntraitors\nangelic\nmanta\ntabla\nwield\nfahd\nlakeview\nunforgettable\nscilly\nshabab\nnsdap\nmacabre\nuconn\nmoulded\ndera\nagglomeration\nmolloy\nciel\npriori\nhijo\nwilberforce\npainfully\nscion\nubs\nakkadian\nepiscopate\numbria\ncristal\nslid\nbeet\nphan\nprogressives\nbiopic\nfells\ndalí\nlogarithm\ntumultuous\ninhalation\nmindy\nacquittal\ncentenarians\nwaved\ntalisman\nrammed\nkabuki\nrosewood\nfrei\nhamasaki\nreverses\nholger\nincandescent\npoignant\nrcmp\nogle\ndiocletian\nbrasileira\nspartanburg\nchak\njackman\nunfolding\naleksei\nazur\nwaldemar\nreckoned\ngoff\nunaffiliated\nwolfpack\nhatcher\nregistrations\ndivination\nsolicit\nadmonished\nalters\nrepatriated\ndisuse\nayumi\nrodman\nlindberg\nhornby\nroca\npadang\nsoaps\nsylhet\nbetis\nsulfuric\ncadmium\nimpart\nmolds\ntreehouse\nsubsystem\nbottling\nsaar\nlipstick\nprudential\ncayuga\njillian\nvagrant\nbipartisan\nberlioz\njails\ninexperience\nblois\ndukla\nsolis\nciara\neloquent\nbakker\nabt\nparietal\npunctuated\ndanbury\nlows\nsplicing\nimpatient\ndido\nbesançon\nmartine\nwhitlock\niaf\nfondness\nroxanne\nthea\nrestraints\nremedial\nwhedon\nchristos\nnîmes\ntenths\nhepatic\nfoto\npassionately\ncucumber\nwaldeck\nbicolor\nicing\nhelios\nguang\nredeem\narboreal\nredgrave\nhomeworld\nbadr\ncda\nsalvia\nlamborghini\nzhuang\nration\nhawes\nshakur\nfingerprint\nizumi\njeune\nrestrain\nvai\nhin\ngeopolitical\nmorelia\nceredigion\nrevoke\ngeodetic\ncloned\nlula\ncomme\nmaniac\nfeatherstone\nparsing\nlingerie\nwald\nmiklós\nbelligerent\ncandace\nstumble\naltrincham\nbeige\nancien\ndalmatian\nnocturne\nbelcher\nraining\niota\nleds\nbaroda\nafghans\ninspires\nhippocampus\nimmortals\nalmagro\nlavinia\nkhalsa\nrutledge\nradiated\ndisproportionately\nbionic\nhafiz\nearths\ngermán\nhuntsman\nshing\nspaniard\nvisser\ninclusions\ndol\nsørensen\nsupercomputer\nludovico\nillogical\nretarded\npfa\npundits\nmordecai\nunproven\nantisubmarine\nleconte\nremodelled\nbdsm\ncité\nnewry\ncamberwell\nlightfoot\noutburst\nlibertarians\nprecincts\nflavia\nsnowboard\ncorrelates\nresolute\nasynchronous\nplastered\ndiscriminated\nansi\nkachin\nzealous\npenney\nnegotiator\nharuka\nneve\naretha\nince\nalgonquian\nsandford\nkingsbury\nemigrant\nmundial\nrepositories\nsilky\ntribeca\ndisorganized\nhotly\nthessaly\nlegalization\nary\nsprayed\nhumorist\nskateboard\nstains\nrosette\nbimonthly\ndietz\nmortensen\ntiki\nduan\ninjections\ntron\nmediacorp\nnegra\nresistor\nmarimba\nchimpanzee\ncapra\nsomatic\nsardinian\nmasturbation\nmandi\nputt\nantonin\nsynths\nrasmus\ndeakin\ncorvus\nconfiscation\nmorel\nwanganui\nvardar\nwec\nbahamian\ndaybreak\npeloton\ncrispin\nundetermined\nabv\ndieppe\nalberti\nchauffeur\nssh\nslo\njoliet\npickle\nregalia\nschrader\nradium\nintimidated\nnigra\nsteamboats\nsailboat\nbrt\ndisprove\nvives\npitbull\nmgr\nbhosle\nudine\nsedition\nbachmann\nbacchus\nroussillon\nocampo\nharwich\nswabia\nmindless\nfukui\ncollusion\nkarlheinz\ntrofeo\nhattie\nwrecks\nhuerta\ninjecting\nsade\nrework\ntaoism\nmartinsville\nredeemed\nphilosophie\nunworthy\nhowling\nosamu\nbrecon\nduvall\nutensils\ninfield\nsensual\ntyneside\ndespised\nbhattacharya\ntaiping\nmanus\ntaras\nbwf\npondicherry\nlinkin\nhighbury\nabreu\ngulag\npollination\nsforza\nkarzai\ngarrisoned\nunclassified\npave\nhaha\ndarnell\ndionne\ncris\nclarissa\ndulce\nllywelyn\nalbino\nosvaldo\nssl\ninga\nbighorn\nfeu\nselassie\nfederalism\nknudsen\nbrevity\nintruders\nlenape\ndoreen\noleksandr\nperpetuate\nghostbusters\nsangam\nxr\nnicer\ntouchstone\nparamedic\nbuda\nhoi\nnarasimha\nsummarise\nsoliciting\nfora\nnewbery\nsingly\nhigashi\nugc\nwg\ndax\nphenomenology\nrudyard\nadmirers\nbfa\nradiocarbon\nnunataks\nhone\nrebranding\npakistanis\nconverges\nmusharraf\ntumblr\nhawkeyes\nnah\npears\nofficiating\nvaldemar\nanglicanism\nbjørn\nhoming\nkol\nukip\nnoc\ngretzky\njacopo\nconfuses\nbhavan\nlorca\nsolihull\nadv\nornamented\nponta\netymological\ntorchwood\nenlistment\nthresholds\nzaman\njahn\nclarifies\nrigor\nfirewood\nmarín\ngorica\nparenthetical\noriente\nambedkar\nabbé\nyourselves\nsephardi\nennio\nglaciation\npapuan\ndassault\noccidentalis\naerobatic\ncounterculture\ngdynia\nrationing\ndivya\nmadhu\nacrobatic\nkeita\ndecor\ngambino\ntic\nnir\ncordoba\nindividuality\nsqueak\nchariots\nbarak\nminoru\nhalting\nolimpico\nmacho\nhybridization\nlun\nbrazzaville\nacknowledgment\nmaghreb\npharrell\nbyrds\nkirkus\nedson\nlaila\ntiananmen\nlmp\ntweaking\ncin\nsecularism\ntaxable\ndefy\ndrawbacks\nhsieh\npituitary\nnorwegians\nbracelets\nhoare\noxidase\nquarks\nethnology\nhamstring\nplugins\npositional\nker\npoitou\ncapitalised\nentrant\nsayers\nnoh\nposey\nshoshone\nfrantic\ngyeonggi\npeebles\npaloma\nmoll\nneri\nhamster\ngalena\nderegulation\nnecessities\nbiopsy\nspiritually\nagassi\nargentino\nbullion\ndrawback\nbaring\nmug\nlash\naddington\nscanners\ndistort\nchristiansen\nthorium\nmolière\nreappear\ngisborne\nmayr\nwallingford\nanalysing\ncheetahs\nrefitted\nwada\neldorado\ngabby\nfaro\nmirko\nabuja\ndravida\ntrimble\nbudge\nchastity\ncenozoic\ntugs\nholby\nvandalising\nlivin\npurview\nasuka\nintermodal\nmanawatu\nchivalry\nvendée\nfedora\nchişinău\nhauptbahnhof\nmortem\nrabbitohs\nkreuz\nsantorum\nglenwood\narbs\nfertilizers\nexalted\npoaching\ncopious\nfortaleza\nsyfy\ncelje\nsupermodel\nshastri\ncausation\nindentured\nkarthik\nahern\nmiyamoto\ngranular\nfamagusta\ncoolant\nbaronies\nkot\nnicolaus\nghat\nsauces\nvaux\nbernoulli\nclutches\netchings\nripping\nbligh\nhusain\nbreckinridge\ngaby\nxander\nmetallurgical\npadded\nrioters\nincursion\nseashore\ncosmonaut\nrhonda\ncoy\ncowper\nglobular\nsummoning\nmumford\nhales\nmilitiamen\nmenachem\nuncensored\noud\nparakeet\nseaweed\npredominately\nstarfish\nparadigms\npacifica\nlakh\nmauritian\nbeebe\ncyberspace\ningalls\nwhittington\njamison\ntaichung\nsalons\nbran\nssc\nconceptually\nenlisting\nmonsanto\ncheering\nmcmurdo\ncuisines\ncmc\nparagon\nmartí\noldsmobile\nbrookes\ninstrumentals\nstipulation\ncharing\nyeltsin\ndiario\ndnipro\ncovalent\nnutcracker\natr\npasco\nschulze\nbridal\nores\nunderdeveloped\nheadquarter\nlibro\nsto\nmultiverse\nclap\nsqn\nhermione\ndeduce\nzayed\ndonbass\nbrunt\nnong\nchua\nslashdot\nbamako\nbueno\nfresnel\nmottled\nsolace\nberthold\nsturgis\ncaravaggio\nregularity\neon\nselectivity\nlifecycle\nenthroned\nspinners\nswallowing\ndevolution\nsdn\nstonehenge\nmountbatten\npinkish\nrenoir\nrowed\ncigars\nsituational\naddicts\ntweets\nsds\noeste\nkearny\nbeauregard\nvisualize\nreplays\nradioactivity\nlittlefield\nwanders\nnewsday\ndesegregation\ntribesmen\ngorillas\nlotto\nlovejoy\nfrida\napocryphal\ninnocents\nindented\naffectionate\nitt\nmidwife\nringer\nscr\nargumentation\nlse\nathenaeum\nabul\ncamilo\noakville\nsafeguards\nmicroprocessors\nextradited\nafricana\njaponica\nmismanagement\ngulliver\nwarmly\ncote\nsitter\nsextet\nshowroom\nsaville\nturpin\ntrucking\nzionists\nrami\nmenstrual\ndogmatic\npima\ndelicacy\nnapalm\ntorment\nsawtooth\ndeceived\nsubduction\nappomattox\ntabasco\nghar\npoised\ncontrived\nmethylation\nsqueezed\njuventud\npagans\ndepressing\nfactbook\ndac\nhotline\ncreditor\ncraftsmanship\ngtp\nsalina\ngalleys\nguidebook\nmimics\nistituto\nbrothels\ncontours\nstandish\naltoona\nparas\nyakovlev\nrazak\nmangroves\nrotting\nthiago\ncq\nmargherita\ncaucuses\nrar\nkiosk\nrepayment\ncapillary\nchristiane\nkemper\ndancehall\nundisturbed\nfundamentalism\ngunter\nerm\nbushehr\nrascal\nfeinstein\nstaining\nnotched\nkamchatka\ncanoeist\nsufficiency\nalcatraz\nbilal\nstettin\nsimulators\napproximated\nbagley\nlads\nbrantford\nlubin\ncomprehensively\nueda\ncarthaginian\nanchoring\nfeeders\ninstrumentalists\ntriomphe\ngalle\nzim\npenobscot\nconcurrence\nhitchhiker\nkula\nmodernised\nmyer\nrepressive\nmadge\nbohol\nnorske\nexclaim\ngoldsmiths\nhummel\ncarlist\ndarlene\nfateh\nindiscriminately\nronny\njonson\nliberec\nmarten\nerotica\nczechs\noctane\nleaflet\nalbatros\nbaz\npinkerton\nheisenberg\nbloodline\nscrimmage\nhufnagel\nloo\nfils\ndurch\nsata\nafa\nconwy\nornithologists\nconifer\ngaiman\nsouthgate\nikeda\nraith\niwa\ngorham\ntensile\nseti\nrailings\ncolonna\ngreaves\npate\nregenerate\nstuyvesant\nbannister\nnorthridge\nmonrovia\nbrun\nmunson\ndummies\natlantique\nmavis\nseer\nbrough\nbartók\ndci\nglaze\nzahra\nfourths\nmailbox\nsilo\ndetonate\nhula\npatricio\nkev\ntalib\nachaemenid\ndilip\nimre\nnostra\nacidity\naec\nplating\nmahayana\nchaudhry\nmobilize\ncelta\nandante\nschooled\ncaterina\ndresser\nnene\nhala\nstigmata\ntzu\nrailcars\nsuperpowers\nframpton\npardubice\ndefying\ntweeted\nunderhill\ncoulthard\nairman\nheuristic\npersonification\nmasson\nshangri\neyebrows\nbarque\ngamblers\npreakness\nshabaab\nmahan\niridescent\nhervé\ncommoner\nsceptical\ndavos\nmeteorites\nbanque\nbatten\nsaif\nshaving\nbilliard\nfleury\nryde\ncordial\naimee\nvasile\nkelowna\nsalvageable\nisraelite\ndarn\ngiorgos\nfoggy\nabstracting\nautobahn\nsprints\ntms\ndetachable\naccommodates\nilo\nwildfires\nmckean\ngillies\nafoul\nsukarno\nharkness\ncustomize\nborgia\nreliever\npvc\nfleeting\nephesus\nproletarian\ngrasping\nelectrician\nmineralogy\ntallied\namstrad\nstances\npatio\nmeditations\nmts\ngoodwood\ntürk\nosce\nworshippers\nsars\nthrillers\neukaryotes\nhoff\nramparts\nnontrivial\nferrero\nbaez\nwacky\nhaddad\nsubjectivity\nhellas\nsauer\neicher\nrefund\nwindhoek\nchaldean\nretires\nhossain\nratu\nportraiture\nevangelism\noveruse\nronson\nafield\nmanuela\nwolfson\nbrompton\nkama\ndemarcation\nmeghalaya\nurea\nelicit\nfunicular\ngunma\nrisking\npravda\nmouthpiece\ncaesars\nsarmiento\npryce\nimogen\nduque\nmillimetres\nmetaphorical\nthankfully\nonus\ngrammars\njuror\novergrown\ncausality\neurozone\nascoli\ncation\nchaves\nisolating\nattenborough\ntrampoline\nwaddington\nscriptural\nevangelists\nnumismatic\nwrt\nsolomons\nwop\nwests\nploy\nspooner\nviru\ncrappy\nabbotsford\nsacha\nhedley\ntiraspol\naloe\nfrankel\ncielo\npreis\nratify\nzoologists\nleninist\npedantic\ndiagnoses\nostrich\ngermaine\noutcrop\nspleen\nrigs\nparrott\ndismayed\nrendell\npanelist\nreba\ngleeson\nmeasles\nconceivably\ncarols\nsá\nwayside\nshingles\nbailout\nrefraction\nsheva\nstetson\nunqualified\nfiend\nfürth\nstarved\nautry\nsubsumed\nrevamp\noccupiers\nboasting\npropagating\nconfessing\nwingfield\nspacex\nticketing\nstringed\ntiempo\nrattlers\naisha\ncif\nclipped\nflorent\nbuoy\npikes\nlayla\nkamloops\ncomforts\nmandeville\npsychologically\nembellished\nmull\nparamedics\ncetera\ncotter\ncoker\nragnar\nudaipur\npolygonal\nreshuffle\nbattlecruisers\nkazhagam\nsasanian\npharmacists\ncollie\nange\nsenhora\nghazal\ntrnava\nparticulars\nmanhunt\nosa\nlivy\nflorets\niwate\nfundación\nurgently\nfosse\njeb\nhomburg\nvacations\ntithe\nsalta\nrossetti\npolyhedron\nlaity\nshelburne\ntisch\nfille\nchul\ntilden\nanemone\njudeo\npinochet\nmee\nloophole\nintervenes\nump\ndeletionists\nlassen\ndossier\nfeuding\nshearing\nkutch\ngeddes\nflirting\ncamillo\nroda\npawns\nevokes\nhamill\nlamented\nevaded\ncormorant\ntremblay\nbarrows\narbroath\nkagan\nbbr\nrickey\nplumber\nhoods\nintersex\nrah\nrowdy\nmulticulturalism\nsortie\nretardation\ncanvases\nastaire\ncafés\nsavile\nborja\ntobruk\ncolorless\nregimen\nemptying\nhagerstown\nbillboards\ngeophysics\ntrento\nantiquaries\nafs\nellipse\nelwood\npancreas\nsanz\nrégiment\nhopi\nopportunistic\nhayek\namulet\nambivalent\nnpp\nprek\nbelgians\ntenement\ningested\nmaidens\npixie\nplatz\nschaumburg\nkaoru\nperpignan\nlegume\nsooty\ndmc\nlovelace\nelon\ngödel\ntatra\nfavourably\ncir\ndunning\nstewie\nalvaro\nducts\nscratches\nlindgren\nvern\npowerplant\ndelineated\nmuerte\nnorah\nmongoose\nsik\npompidou\nextraneous\nnurturing\nbursting\nfastball\nocular\nbalconies\nfarnese\nfeathered\nsiddeley\nbretagne\nlichens\nhydroxyl\nconspired\nhammered\nhealthier\nbmc\nlogie\npublicised\nfieldwork\ncharacteristically\npars\nagustin\nexec\noverwhelm\nreunites\nprefects\nkievan\nstudded\ncamus\nnuno\nharland\nparalleling\ningham\neleonora\nconscripted\nslugger\nstabilizing\nbrushed\ntriumphs\nantiaircraft\ndisclosing\nbentham\nlobed\nboilerplate\ninterplanetary\nmomentarily\nkanawha\nevocative\ncaricatures\nshmuel\nturkestan\nyeoman\ntitleholder\nsnipers\ndignified\nmeager\ndou\nembodies\nautomata\nyarrow\nfermat\ncapricorn\nexcesses\nlatterly\nfieldhouse\nhypocritical\ninhabitant\nbakers\nlawns\nspeculating\nclipping\noshkosh\ncopland\nshibuya\nhaat\ndecapitated\nrung\nborrows\nstrikingly\nfawn\ngrodno\nnomen\nnuova\nflurry\naggregated\nstott\ncubans\nfuzhou\nrodin\ndoubs\nsuing\nbearcats\nharedi\nafrique\nfait\naso\nskidmore\noverran\ndisintegrated\nyoke\nreine\ntossing\nslider\nhaden\naleksandra\numeå\ntopps\nspinoza\nota\ncapuchin\nmetamorphic\nrapp\ndornier\ncaldas\ndiscernible\nscalp\nwilt\nexclamation\nfreemason\ncourant\nzoroastrian\nverkhovna\nscourge\nhaller\nimpacting\ntioga\nvirginian\nimpressing\nsayed\nsèvres\nhijackers\nshou\noerlikon\njuba\nagreeable\nshortwave\nplantings\nspitting\nscrambled\nharmonics\nadministrated\ncampsite\nretraction\nknack\npes\nbathtub\nmilitaire\nchesterton\nnra\njammed\nizzy\nshatner\nqiu\nalms\ngallimard\nhump\ncopyrightable\nuribe\npolytechnique\nperes\ngeneralizations\nfragrant\nbarnaby\nvl\npcb\nmccarty\nkunming\nbookstores\ncongested\ndepository\npilkington\nconsulates\nmew\noem\nmaranhão\nrequisitioned\nrajeev\nbreastfeeding\novercrowded\nvapour\nbobbie\nbhd\nhebrews\nrecounting\noulu\nbrion\npundit\npunts\njett\nsloth\ntroubadour\nholographic\nresilient\njaan\nimpresario\naltenburg\ndorn\namigos\nnetscape\nbossa\ndui\nveritas\nphysiologist\nargon\ncomplying\ngirder\nvolker\nvalli\ncommandment\nnicaea\nmite\nphotojournalist\ndahlia\nbeeching\nshimon\nweighting\negalitarian\nnagai\nmotherland\ngilgit\nfigs\npiedras\npathé\nmajorly\nuzbekistani\nwoodwind\nbebe\ndeference\nparenthesis\ndelimitation\ntutu\nloudspeaker\norsini\nwayback\nmarshy\nkwong\ngeneviève\nemanuele\nvladikavkaz\nmynetworktv\nquattro\ntangerine\ngelderland\npiss\nadvisable\nwinnebago\ndeceit\nsuccinct\ndissertations\ncuritiba\nsalih\nmonarchist\nreappointed\nquickest\naffidavit\ncordova\ntek\nrogues\nnis\nabound\namputation\nelkins\nhagar\nheretics\ncrowell\nratna\npalatal\nstumbles\ndamm\nmelaleuca\nhsc\nlowther\nadventists\nmelanoma\ncrump\nlinkages\ncensoring\nbight\nsilvers\nvaldivia\npotency\nfaisalabad\ndiddy\naragón\nudo\ndionysus\ndello\nfollowup\nreflexes\neased\nappel\nafrikaner\ndiscourages\nplankton\nmangeshkar\nendothelial\npaleocene\nmuseu\nmystique\nlufthansa\nimpersonating\nkatharina\nloudon\ncnr\ncrabtree\nmawson\npreludes\npovs\nbraintree\nsimulating\novertaking\nmclachlan\nmelts\nsnes\ncomin\nsupercharged\nnossa\nstifle\nfeline\nfederalists\nmythic\nswordsman\nmessrs\nspangled\nifd\nacceptor\ntuner\nayutthaya\nhoe\nsteph\nbunbury\nohrid\ngillard\nkohn\nfluency\ngalina\nführer\nwy\npalacios\ndissociation\nnadi\ngotti\nstocking\nastrologer\ndimitris\nsarthe\nnigger\nadrenal\nvivien\nchipping\nexperiential\nantivirus\naut\nestudios\ndetainee\nadheres\nbasset\nbau\ncouncilors\nmariam\ncondoms\nstarry\nabbeville\npresuming\nmot\nwarmest\nonyx\nsymonds\nneuro\ncalamity\nglobes\nsunnyside\nnabi\nfuturist\nintensify\nbabble\namador\nfootballing\npreposition\npenalized\ndinghy\nbras\nvolk\nmicrosystems\nreagents\nbusted\nlibrettist\nenvelopes\npaf\nucd\norca\narian\nchesney\ndrs\nsecurely\nhumanists\nflavius\nholme\nmedea\nfausto\nironi\nphilately\ntampering\nlabourer\nprofane\nherons\nrapport\nmayne\nnationalised\nbarnstable\ncallaway\ndisplacing\nverdean\ndefied\nwallabies\nposen\nbromide\nkristine\ngnostic\nacetic\ndeflected\ncarrasco\nclarinets\nkz\nbannon\nviscous\nunoccupied\nairwaves\nfilth\nkorda\nmailer\ngenovese\nquilt\ntitian\ntrish\nnothin\nligaments\nkeiko\nbakhtiari\nparlour\nimporter\npharaohs\npittsfield\nmonogram\npreeminent\nbattlefields\nsupergirl\ncongratulated\npartitioning\nreorganize\ncrm\nzwolle\nlipids\nvitale\nrebuffed\npredeceased\nsoar\nurbanized\ncygnus\ncolumbine\nseedlings\nvientiane\nreedy\nbirdie\nrigidity\nfurnish\nvostok\ncsc\nogilvie\npounders\ndownright\ncatalyzed\nendo\npelletier\ndonato\nrelentlessly\nmonogatari\nlineups\nnabokov\nlithography\ndaejeon\noperationally\nculminates\nuterine\ncortina\nela\nbetrothed\npéter\nhsv\nassuring\ncahn\ninfanta\nidioms\nenmity\nrotc\ncustomization\nupriver\npropane\nwoodcock\nrenditions\ndelusional\nlajos\nhirise\nsadiq\nrotax\nargento\nklondike\nhighlander\nbovine\nlbf\nramblers\nautoroute\nperera\ntopper\nbrandi\nwiping\nheerenveen\nangelou\nsangh\ntanning\nrepechage\nhabsburgs\ninsure\nstaffers\nrfid\ntrapper\nsura\nstagnation\nstrengthens\nregio\npuig\nmaier\npena\nsandown\nmanoeuvre\nobliquely\ndepeche\nslaying\nrethink\nhilaire\ndykes\nrachmaninoff\ndaria\ntainan\njafar\ndoubtless\nfla\ncarmona\ncached\nslurs\ncede\nmegadeth\nrochefort\ncreationist\npussycat\ninterurban\nallium\ndilation\nextremadura\nkurosawa\nflakes\nargonne\nentertainments\nmolybdenum\nroleplaying\nchildress\nhombre\ntelepathic\nlaughable\nbourg\nvinton\nfreitas\nwields\nffc\nvid\naltona\nguaranteeing\nmonkees\naleksey\npartake\nsharper\ncardigan\ntrotskyist\ninjures\nbia\ndatuk\nvedder\nshukla\namal\nthatched\noriole\ninfinitive\nbandstand\nfontainebleau\nepiscopalian\nmallard\nbusters\nrubles\ndisbelief\ngoldfish\npsc\ncornwell\nmiramar\nintoxication\njeter\nbotched\nsérgio\nregius\nisobel\namara\nwurlitzer\npärnu\njunius\njab\ndeliberations\ndyck\nsiri\nhungerford\nharmonium\ndifferentiating\ncerberus\nspillway\nmpc\nrhoda\nbons\nmingus\nseitz\nderrida\ndillinger\nrediff\ncmos\ntactile\nwitherspoon\nunattended\njac\nbackpack\nptolemaic\nkincaid\nwgn\ngoalkeeping\nwink\nhinges\nramesses\nmasking\nsurinamese\nfurthered\nshocker\nsnapping\nchipmunks\nect\noutta\nvalparaíso\nmonologues\nencroachment\namitabh\nabsurdity\nsmc\nitinerary\nrisked\ncarsten\nexoplanets\nshahi\ntrilobites\ndesoto\nwerke\nverifies\njulián\npulau\nkingman\ndissimilar\nlagrangian\nincapacitated\nbigotry\nsills\nslams\naliyev\nintonation\nstavropol\nvirgins\nspiro\npretenders\npaweł\ntana\npissed\nsuspensions\nkebangsaan\nblueberry\nconverters\nmatsui\nrogaland\nblob\nhodgkin\ncpt\ncaches\ngunfight\nmarymount\ncleaners\nproteus\nspectacles\npremiums\nkestrel\ntec\nparva\nblumenthal\nredd\nsandler\ntermini\nlowly\nturnovers\nkwang\ncrompton\nsignified\nstratigraphy\ngoulding\nsuperfund\nexpanse\nbrito\nheaps\nhomeowner\npurchaser\ndrip\nhildebrand\nserialization\ndelusions\nlegge\nprato\ntibor\nextraliga\nshameless\nqe\ncheapest\nshiga\ntireless\ntriumphal\nroadblock\nannoy\nbottleneck\ncompress\ncontagious\ndinesh\nemulated\nagios\ntrombonist\nctrl\nlibertarianism\nbrd\nbelievable\nmakhachkala\nwhistles\nlumen\nkalisz\nfeliciano\ncatalogued\nguarani\nnacht\nangst\ntsing\neverytime\ntff\nrestrooms\nwaning\nprecepts\nasterix\ntelecasts\nquestioner\nwhittle\nexecutes\nbondi\nsprites\npuducherry\nbicameral\nmusicology\npineda\nninh\nmayotte\nnoblewoman\nundulating\nccp\nhyperbole\nadp\npola\nlivejournal\nrosales\ncomplicity\nperceives\nsafest\nlimpopo\nconvair\nforgets\nbeal\nprussians\nbaumann\njahangir\ncarrots\ntandy\ncaddo\nsewerage\nbristow\nathlone\nstadia\nnieto\nsten\nlogarithmic\nlotta\nwarringah\naymara\ndela\njonsson\nkaluga\ninnovators\nflawless\ntohoku\nbunk\navondale\ncanvassed\nreggaeton\nemden\nlocates\nnablus\nmito\nweakens\nlithuanians\norgasm\npredate\nwintering\nbumped\nquicksilver\nvenetians\ncarbonyl\ncourteous\nshp\nangkor\nharps\nmillionaires\nbetrays\nexonerated\ngenitals\ngötaland\ngreig\nkrypton\nuntenable\ntilburg\nfuses\ntinted\npangasinan\nsagaing\nforel\nvibes\nheadingley\ncalibrated\ninheriting\naeneas\ncinder\nmorph\nkenilworth\ntête\nfumes\nstepson\nprenatal\ncapel\naust\nborland\ntanager\ndnieper\nadrift\ncolm\nhinting\ndiem\nmisérables\nmotoring\npolygram\nheroines\nspaulding\nbhagat\ngautam\ncorso\nwhistleblower\nintermedia\nstrikeforce\npaulsen\nbeechcraft\nconservationist\npups\nworkhouse\nshikoku\ngrub\nrichey\nlyra\nleftover\ngraces\ndisconnect\ntrafficked\ncorsican\nspezia\nmerck\ntableau\nknuckle\ndandenong\ndisraeli\ntheoretic\nbalaji\nbischoff\nfuente\npersecutions\nsunbird\ndeterminism\nengulfed\npohl\nchadian\ncolliding\nyaw\npenske\ncoaxial\nkoko\nswabian\ncystic\nhanshin\ntoads\nmandala\nlasse\nslapping\nwielded\nlian\ntrombones\nmedan\nellery\nvirtus\nmaura\nendeavours\nlookup\nmoa\nmilder\nliberating\ndefencemen\nalcalde\nformality\nmarchand\nshari\nborisov\nnovellas\ntripp\nchevaliers\nsevier\nguglielmo\nueno\nunderlined\nustad\nsweetness\norientations\ngracious\npdb\nhounding\npythagorean\ntease\nbosniaks\nheyman\nmuda\nhyphenated\ninsiders\nhackensack\nstaggering\nophelia\ncoiled\ncarve\nnaa\nmanure\nresists\ngeorgy\nadelphi\nmodulated\nlanham\nsportsnet\nfurtado\nantiquary\naustere\nrucker\nreactivity\nhuesca\nramiro\noverheard\nedm\ngrinnell\nfronting\ndugan\nairships\nvaz\nwicca\nwaterproof\nmsx\nurawa\ncoadjutor\ncosgrove\njna\ntunneling\nlockdown\nmisaki\nlillehammer\nannuity\nedema\nnucleotides\njayhawks\nwellcome\nsuspecting\ndemonstrator\ntiff\nmockingbird\naca\nlugers\nrasputin\ngabriella\nluise\noverturning\nneurologist\ncrs\nnco\nael\njabal\nvigilant\nbeauties\nkwh\ncrusher\ngerda\ndiscord\napproximations\nsanremo\nintuitively\nrealisation\nnanoparticles\njacek\nfolkestone\nbibles\nsegundo\ncelibacy\nchivas\nwerewolves\nslasher\nroald\nvulnerabilities\nboredom\nmário\nfane\ncel\nthx\ncadres\nbyrnes\nbogies\nflathead\nnunes\nkda\nexegesis\nslipper\nformulating\nwidget\npowerpc\nmandible\nclarksville\ndodgy\ntibetans\nraping\nherein\ngutted\nbjörkman\nrepost\nsplendor\nbolted\nsotheby\ngazeta\naccrediting\npaypal\nnegroes\nkarina\nskits\nlohan\ndominick\njoystick\ndispensary\ntull\ncrickets\nposits\nshearwater\nmadero\nils\nrenewing\nleclerc\nmossad\ngisela\noctavia\nhierarchies\nochre\npolyphonic\nmiraculously\ntriathletes\nlner\ncellphone\napo\nhayat\nseva\nindividualism\nprovisionally\nkeighley\npampa\nkoper\nsuction\n‑\nmetroid\ndarko\ncarrara\npocock\nakers\nsdk\nply\nloci\ngravely\ncarpentry\nreprimanded\ncheaply\ntimid\nrepaid\naffinis\nhetman\nzacatecas\nmorgue\nkwai\ncheerleaders\nbedding\nconrail\ntectonics\nbossier\nprods\nelms\nsymbiotic\nsagas\nabortive\nmoustache\nethno\nléger\nwandered\nenvironmentalism\ncardoso\nbhaskar\nevers\nmobil\njsc\nrajputs\ncaprice\nnestlé\ncuny\npeasantry\nherder\nsati\nsexiest\nflorin\nincite\nfuzz\nkhimki\nfriary\nmalloy\ncarillon\nxa\ninterment\nequalled\nbarbershop\nhaber\ntobu\ndma\nvodacom\nsignalled\ntotalled\nwatered\ndewar\nelba\nkauffman\neuphoria\ncrypto\nstyx\nairspeed\nwolfowitz\nrisc\nsensibilities\nacr\nmfk\npogrom\nalerting\nayub\ncloverleaf\nhazara\nchromosomal\ngingrich\narezzo\neri\nhippolyte\nfeeble\nsmit\nhowland\niced\nfielders\nchemin\novary\nvd\norozco\nnorthcote\nsheaf\nriddled\ndiogo\nhuguenots\nstabilizer\nblundell\nembezzlement\nmss\norenburg\nhokies\nkharkov\nbroadening\nwinless\nwilfried\naurangzeb\nparedes\ndevotes\nbabelfish\nmotorised\ndeflect\nherron\nconsented\nmetering\ntsu\ncoalfield\nbypasses\nassessor\nteague\nfluorine\nmerited\nsportswriter\nlinnean\nlanders\nsouthwell\nmga\nhofstra\nschaeffer\ncomputations\nberkley\nwatersheds\nniro\nandalus\ncru\nverso\nsubtypes\nato\ngris\nfabrice\nrife\nhoisted\npuritans\nyaakov\nisuzu\nldp\nbelfort\nbirdman\nperish\nfreer\nleek\nwillson\nerhard\nbridger\ncondom\nkojima\ngrevillea\nburglar\namenable\nuvf\ntibia\ntransferable\nbroth\nexecutor\nannihilated\ntighten\nnorthumbrian\ncarcass\nwestpac\nmidas\nsangre\nshotguns\ngroot\nochs\nlateran\ncomplicate\ndrysdale\ncosa\nspiff\nvevo\nduchamp\nsympathizers\ntiara\nspiked\nrsn\ndewsbury\nfishers\ncompletions\nborrowers\nmlc\nunconvinced\nrodger\nlegation\ngimp\nbuch\nohm\nrawls\nwilkie\nbackstreet\nkikuchi\ndeafness\nterr\nludvig\nlilies\nrecharge\nspc\neisenach\npct\nminamoto\nsalix\nsyndromes\nabbaye\nmaryborough\nnarragansett\nbolstered\nmtdna\nmorbid\nkenseth\nangelique\nkernels\nouachita\nbuford\nthun\nhannes\nburch\nblazer\nsausages\nbolding\nandrej\nsmoker\nmelodifestivalen\nindemnity\nurbanism\nhargreaves\nvitoria\nnomura\nkeio\nruc\nkhel\nnid\ncostas\nammo\ntenn\nmughals\npetrograd\nmpg\nvga\ngalois\ncsp\nallard\nredhawks\ngravestone\nteleport\nbih\ngluten\nnumb\nashlar\nsoutherners\ninterferes\nomani\nutilising\nindulgence\nridgeway\nschwab\nsaddened\ndirectorship\ncalculates\ntheosophical\nrosenbaum\npolyethylene\nfinistère\nmetalurh\ntet\ndecipher\nisabela\nabort\nindebted\nkinsman\npharmacological\njagannath\ntps\necu\nzielona\nhennepin\narduous\nbayley\nnonzero\nultrasonic\nnemzeti\nmyocardial\nfacs\ncelle\naab\nsatu\npekka\nlista\namo\nscribes\nhugues\ntenacious\nhuns\ninfernal\njourneyman\npall\nruddy\nuncertainties\nbonita\nafricanus\ncentrum\nmistreatment\nmysterio\nimmaterial\nmdc\nthaksin\nruud\ninjure\nrena\nshareholding\nhavel\nanaconda\ninterns\nmuted\nfleshed\ntrang\nivano\nismael\nanjali\nleonean\ncordero\naspirin\nladakh\nthéodore\nsunbury\nkirkby\nparanaense\nodia\nawami\nslapstick\nlapd\nretaliatory\ntipu\ncaching\nnama\nruslan\namphetamine\ncashier\nsmacks\ncalves\nsequentially\npalmetto\npoetical\nusp\nmechanised\nmyotis\nabsentia\ncalvo\nauger\ntripled\ndroughts\nsaas\ndataset\nniklas\nneto\nboas\nidealistic\ngravy\ntantric\nanalyzer\nkalinga\neuphemism\nconspiracies\nsubplot\ndeen\ncrowdfunding\ngenealogies\niguana\nwrench\ndisbandment\nmcadams\nintegrative\ngopi\nuncovering\nnaka\ngaba\nnavigators\nscrubs\ndelinquent\neastside\nkilpatrick\neisenberg\nstefanie\ngrocer\ntfa\noireachtas\ncampsites\niterative\nmda\nlegrand\ngiza\nsatya\ncountrymen\nspires\nramone\nhalliwell\nhysterical\npopularised\ntragically\ndía\nencyclical\nherat\nsathya\ntorrey\ntaming\necologist\nmais\nbrie\nkittens\nreservists\nmultiplicity\nusefully\nhillcrest\npurposefully\nramachandran\nukr\nsatisfactorily\nyekaterinburg\nrove\nfarid\nmirnyi\nbedtime\npag\nbarcode\nsylvan\nstarfleet\nwhitlam\ndriveway\nbrier\nreckon\nyulia\nfinnmark\nvigor\nhoist\nhamer\ncruised\nmea\ndolce\ninfestation\nsupremes\nneutralize\nkarting\ncsr\nspecter\nninian\nagassiz\noverlaid\n¿\ncinta\nnoses\ndraped\nvishal\nhoudini\nappendages\nsuzerainty\njaffe\nlemonade\nhistone\ndeirdre\nreardon\nlevitt\nhillel\nneglecting\ngwyn\ngodard\nfudge\ngustafsson\ninspecting\nmoya\najaccio\npittman\nsteinbeck\nsemicircular\nlivonian\nchasers\nprotease\ngauguin\ntheres\nundiscovered\nundp\neyck\nexaminers\nchanted\nbarbican\nurquhart\noctaves\nupa\nalbano\nvisceral\ncharan\nfemur\nsio\neldon\ntomlin\nplinth\nelis\nabolitionists\nthales\ndenbighshire\ntrekking\ndiverting\nvegetative\nalcalá\nherrick\ntelemark\nfacie\namundsen\npublicize\npraxis\nbanach\nsubsurface\nshielded\nwootton\nskanderbeg\nmpaa\nleonora\njeannette\ncapua\nstrenuous\nbiodiesel\nvireo\ncatskill\ncautions\nnarva\nnab\nheretical\ncollectives\nlumley\nitis\ncalcareous\nmomo\nicebreaker\nstimulates\ndeductions\nbodleian\nopengl\nlevelled\nebb\nbehar\nemus\nwithhold\nmagnification\nmorningside\nfosters\nzvi\nrunic\nclams\npoli\nfructose\nlovato\npaís\ncsf\npaschal\nmonopolies\navenged\ndiameters\nsegmentation\nmila\nvoiceover\necstatic\nawaits\ngianluca\nridgway\npews\nangling\nlowestoft\nperched\ninman\nsubregion\nshigeru\ndisrupts\ntochigi\nkidnappers\nbachman\nrecieved\ngateways\nhideyoshi\nperforated\nphenol\ndemonstrably\nprunus\nloyalties\nbureaus\ncytoplasmic\nmanagua\neines\nnatwest\nbacharach\naip\nkostroma\nnrk\nworsley\nerlangen\nsentinels\nrethinking\nantonius\nmarathas\nandi\nwaterworks\naguinaldo\ngoings\ntelescopic\nkuznetsova\nunprepared\nostend\nchronicling\nanglicised\nsuspending\nshekhar\numbilical\nreprisal\npoking\ncour\npsd\nyumi\ncarburetor\nendpoint\ngx\nlem\nfracturing\ncollars\nunites\nreintroduction\nwafer\nsolon\nrtv\ncoen\ndivisible\ncalloway\nalkaloids\naudits\nkom\nsoybean\nriverview\nfatimid\nplow\ndanforth\nsiddiqui\nhokkien\nyachting\nanastasio\ndenounce\nunderprivileged\nmulroney\nvictors\npurports\nmediating\ntou\nreceptionist\nidw\nmeer\nariadne\nauld\niraklis\nalcohols\nslay\nchipset\nsmelter\nlingual\njus\nkea\nswarthmore\nwonderfully\nbataillon\nwalkways\nshattering\nshem\nseparatism\nprickly\ngoody\nnormalization\nretrograde\nbraithwaite\nmuffin\nyokozuna\ncondemns\ndecayed\nmsv\nlinfield\nloa\nsponges\nseduced\nleste\nptsd\nbreakwater\ndevoting\nrabindranath\niva\nheretic\nundertakes\ntammany\ncyan\nfervent\ndoon\nzamalek\ncottbus\nconcurred\nemissary\nshameful\ndali\npolemical\nfoothold\nfiremen\nsorceress\nexcretion\nalyssa\naudiobook\nmontpelier\ntyrannosaurus\nwsj\nnarcissus\npiquet\nrohit\nlemieux\nspeculates\nainu\nthankful\nthunderbolts\nepithelium\npfizer\natta\ntouré\ndeluge\nmaputo\ntantra\ningenuity\nhandguns\nbrezhnev\ndisjoint\nbeckwith\nturban\natrophy\nqs\nmarauders\nnovices\nille\nllano\nuncharted\nchuan\nstosur\northopaedic\nnegate\nswimsuit\ndeuce\njeon\ntasty\ntrappers\nsteuben\nclinicians\nakershus\nkhitan\ndisarmed\nvinny\nleng\nleanings\nkidman\nrescuers\npicky\ngriffins\ninanimate\nonshore\nbx\nheartfelt\nhandicrafts\nhandlers\nnederlandse\ngoon\nsportivo\nfavouring\nmarti\ncoffins\nscathing\nsulaiman\nbarnstaple\nindulge\ntestifying\nrenton\nmessi\nchrono\nalphabetic\nfranconian\nwildwood\nhornbill\nhideo\noliphant\nallegany\nsheboygan\nflannery\nartistically\nplaymates\ncranbrook\nsakurai\ngac\njazzy\nhillsdale\nleveling\nacceded\nreprisals\nhardie\ngelder\ntinto\nplanks\nstipend\nhouseguests\nbushy\nlederer\nfairbairn\nputra\nnewcomb\nlooms\nhumbert\nwatermelon\nziggy\npinewood\nmegawatts\nlenz\nwikimania\nresponders\nmussels\nstela\nsachsen\nhibiscus\ndiscounting\ntoba\ngraphically\nchiltern\nrehnquist\nopinionated\nmcs\ninsurers\nburrowing\ntrabzon\nsweater\ncac\nticks\noverlapped\ngenerously\nuttered\nchaudhary\nsubotica\nsmuggler\nblackrock\nencased\nearmarked\nmimicry\nwollaston\nbeckman\ntyumen\npiggy\njerez\ngripping\nyule\nblackheath\ncorbet\nstockbridge\ngrigory\nfredrikstad\nbusta\nplateaus\njani\nhormonal\nkwame\nhohenlohe\nwithdraws\nthrasher\nstoll\npumas\nardenne\nsnider\nmichaela\nashfield\nmainichi\ntuxedo\nflicker\ntroupes\nplazas\nstiftung\nbuckets\norigen\nwedgwood\nbiannual\nsedentary\nphonograph\nreuter\nmontmartre\napathy\nmohamad\nflemming\nsatirist\nconcepcion\ndrags\ncarlsbad\ntransylvanian\nunauthorised\nmcclain\nhibbert\nfonds\nreplenishment\nbahru\nboarders\npdr\ndisplace\ntiebreaker\nuf\nnia\nlusaka\nlifeguard\nconfigure\ndroit\nartnet\nwelland\ndecorator\nhur\npaintball\ngazprom\ndic\navatars\npaleo\nxf\nstalling\nrhenish\nspammers\nminaret\ngluck\nintensification\ndenser\noba\nmacrae\nsampras\nont\ngarages\nhuelva\nladislaus\niga\ngucci\nacf\nmorita\nreinforces\ntaiga\nmoreland\nwbo\nfreezer\nextrasolar\nfists\nsebring\naccusative\nespresso\nbodybuilder\nsca\nceos\nranchi\nreassignment\nseddon\nivana\nmuskets\nkodiak\nborel\nsorghum\nelaborately\nrima\nmerv\nlambton\nharpers\nsefton\ndiwan\nmixtapes\nbandmate\npetrochemical\nstudi\nserine\nriverine\nlashes\ntetsuya\npreside\ndnf\nreceivership\ncorrie\ncec\nyunus\nequaliser\nceasing\nplessis\ndisparities\nsperry\nflavio\nburdens\njha\nzahir\nbourgogne\nbutchers\nkorsakov\nreplicating\nbandmates\nnoyes\nshelbourne\nswayed\nthoma\ncrores\n❤\ncongas\ninvalidate\nsaale\nleonidas\nfiennes\nnordiques\nsinan\ntumbling\ngrissom\nmimicking\nzealander\npickford\nleong\nrfcu\naccented\nethereal\nplatoons\nglenelg\nstructuring\nrelevent\ninvestiture\nimagines\nhomebuilt\necliptic\npritzker\nkragujevac\nsloops\ntopless\ntimeout\nrer\nariana\nkearns\nlundy\ncormac\naptly\nsideshow\nbolshoi\nhydrology\nvitesse\ninstigation\npregnancies\nfortescue\ntracer\npepperdine\nradicalism\nfling\npint\nrigorously\nalessandria\nheaters\ncircadian\ngratuitous\nsprawl\nwebmaster\nrebate\nchemnitz\nfolklorist\ncripple\nnayak\nwayland\nosgood\npesaro\nappend\ngatherers\nhuntingdonshire\nunfold\ninept\nredondo\nquartered\ntransplanted\ngaspard\ncrosstown\nhommes\nnyasaland\noriya\nparalleled\ngv\ndrinker\nairdrie\nexhausting\nmanatee\nvaleri\ngta\nshayne\nbarksdale\nsopwith\nwaveform\nmpp\nmulholland\ngovinda\nsergius\ngenerational\ncapensis\nzag\nhalloran\nantietam\nlupe\npuzzling\necozone\ndarien\nmated\nmacomb\nsteyn\nherz\ngalt\ndeliverance\npsalter\nupc\nstags\nmalevolent\nbiloxi\nbarium\nnegligent\nrecycle\nunwise\naran\nsuppressor\nestimator\nrencontres\nhitomi\nhorta\nprotectors\nnashua\npinging\nogilvy\ndarter\nsinful\nhandover\nunstoppable\nfernanda\nchildcare\nstinging\nstillman\nharlequins\ndiodes\ntupac\ntightening\nindividualized\ntwister\nvacate\narchetypal\ntock\ngovernorates\nnetherland\nrak\nstitching\ngust\nprada\npps\nmaluku\nlaments\nkyo\nrenominate\nemphatic\nrafts\nnorrköping\nmorehead\ndistantly\ndigitised\nsuperpower\nplantagenet\nsubconscious\nphish\npresto\nsabin\nsecessionist\nomissions\nspammed\npsych\nroadshow\nhairstyle\nwicks\ncourtiers\nboathouse\nundeniable\nmako\ngearing\nfaking\nrafting\nbikers\narbuthnot\nasymmetry\ninformations\ncrematorium\npoo\ngoggles\nforking\nleitrim\nvalhalla\ncorneal\ncubism\ngoku\nforesight\ntgv\ndaewoo\nprobing\nunfolded\ngolding\nchoppy\nwealthier\ndov\nforfar\nuncompromising\nmitigating\nequates\nadrián\nbahr\nvitali\nconnery\nhem\napparitions\ncisneros\nretaliated\nscouted\nhatches\nretook\naristide\ncee\nhenk\ngrenfell\nbrash\nawfully\nchimera\nkhaki\ncataract\nmindful\nbryansk\nbarthélemy\nedd\nintertoto\nnewsreel\nchor\nfortitude\nmucus\nairstrike\nlaban\nstalwart\ngrapevine\necb\ncleaver\naviva\nguardianship\ngrahame\nbenítez\nheliport\nsportsmanship\ntcm\ndiverge\nlloyds\nbozeman\npharmacies\ndeepening\ngoverness\nhein\nfluke\ngrampus\nmanoeuvres\nneotropical\noffensives\nediciones\nvertebral\ntera\nmidwives\nmenéndez\ngoers\nfueling\nconfidant\nneurotransmitter\nnaik\ncontraband\noncoming\nwap\ndischarging\nyaroslav\nabrahams\ndeclassified\nhairless\nattractiveness\ninshore\nhavelock\npontoon\nmalhotra\nracketeering\ngunned\nmennonites\ncabs\nflung\nexhumed\nabundantly\nincrements\nchatsworth\nrefrained\nsutras\neuston\nnao\nmiura\ntowne\nhideous\nehud\nburgundian\nlice\neditore\nzuma\nmistral\nsarnia\nalamein\nstanislas\nlooping\nmouton\narellano\nsunbeam\nabteilung\nwatercolors\nbihari\ndissonance\nbreyer\ntunnelling\ndamping\ngrieving\nnatured\neriksen\ngoalless\nkapp\njesper\nunsettled\nbibliotheca\nemc\nqd\npinning\npocono\ncoos\noutages\nfibonacci\nmorrell\nrabies\nnotations\ndiscworld\ngreifswald\npearse\nmorgantown\nganguly\npepin\ncolville\nbaffled\nmacroscopic\ninfrastructures\nshipwrecked\ngogol\njuarez\ntoho\nbuffers\nkrebs\nkyi\napl\niodide\nethnologue\nnorthside\nvfr\ntelepathy\nmayagüez\nszabó\nfastened\nhomers\nmonteverdi\nreclaiming\npalearctic\nhaase\nbrazos\nviscounts\nflowered\noboes\niis\nagosto\npachuca\nsoria\nscholz\ncostumed\narchduchess\ncps\nmauser\nstockwell\ndilution\ntatarstan\ndalla\nbennie\ngimnasia\nmoyer\ntoma\nyamashita\nartemisia\nmota\nsourcebook\ncustard\ninsoluble\nirishman\nehime\nspearhead\najit\nmccord\nselenium\ntsonga\nsilicone\neiji\nbrits\nperce\nsharpness\nkrajicek\nfoodstuffs\nsuceava\ncarbondale\nnauvoo\nlinguistically\nalcoholics\nmetzger\ndescriptor\ngalapagos\ncalvinism\nwhitey\nsonoran\nmasahiro\nmodifier\ngory\nlasker\nwhimsical\ntapering\nlapland\nmachiavelli\nscofield\ncategorically\ntimetables\nbetterment\npcl\nnws\nfluffy\nshank\nbittorrent\nparadis\nisola\nyasser\nquinlan\nextremities\ntester\ntendulkar\nrevolutionized\njurong\nespírito\nfirebird\nvermeer\nproletariat\nimpala\npaulson\ntextured\nende\nalignments\novertake\nstratification\nrenames\ndürer\nliston\ngendered\ntranquility\ndecoder\nviborg\nhowarth\nglorified\nchrista\nnouvelles\ntranspired\nborrower\nhearsay\nrafter\nhounslow\newell\navellino\ncochabamba\nber\nantichrist\npecos\nautomate\nsasebo\nishii\nraffaele\nenclose\ngreyhawk\nswaps\ngaffney\njaén\nelinor\nlune\nfilho\nneva\ntranscriptional\ntoddler\ncrate\nzell\noverloaded\noutro\ndaman\nroving\nsaban\nbales\nantonín\nbeatification\ncatalysis\nxb\nmonolith\nalgorithmic\nunconvincing\nhooghly\nxtra\ndelirium\nmeehan\ndimorphism\nescapees\nmurchison\nsimplifying\npharma\nworkstations\nkasparov\ncypriots\ndormer\njesu\nbuttocks\nsemper\ngurgaon\nando\ntegucigalpa\ncoelho\nwakayama\numno\nbrandywine\neunuchs\nsilverstein\nelectrochemical\ndelilah\nluxemburg\nmisinterpretation\nrhif\nobscura\nperthshire\npennies\nbronzes\nmistook\npilate\nkonin\nnei\ntaper\nnfpa\nmahindra\nforgives\nleal\nlikud\nmisnomer\nlyell\ngranby\nblume\nsandboxes\nhuggins\nofcom\nsz\nyum\nbacillus\nporno\njoubert\nhom\nclp\noutbound\nopie\nesl\nmagallanes\ndistinctively\nnatalya\nnecessitating\nlollipop\nmmm\nmuldoon\nsistema\nmanassas\nszeged\nclimactic\nkoala\nkenan\nrügen\nparadoxes\nposited\nmøller\nbhutanese\nscrabble\ndurrani\nagricola\nposterity\nheathcote\nconjectured\ncountryman\nanglers\nsorensen\nkentish\nims\nhau\ndraco\naurangabad\nbárbara\nannandale\nhergé\nrollo\naristotelian\ndonatello\nquarrying\nfalsified\nintensively\nmend\napprehension\nhells\ncocos\nundersecretary\nbustamante\nmycologist\ntusk\nhyung\npuffery\nanalogues\nnationalization\nvillers\nnatively\nromo\nburney\ncatenary\ncedars\nsauk\nrcm\ncleverly\nsimmonds\noryol\npicardy\nreciprocating\nfilings\nmindoro\nmakin\ngurdwara\nstatisticians\nbabbler\nfirestorm\ncircling\nsrinivasa\nmauna\ntropic\nundermines\ntrs\nhordes\nrasa\nprescriptions\nblockers\nsketched\nharpoon\nswede\ntosca\ntapestries\ndaemon\ncybernetics\nemin\nstoddard\ngarret\npatiala\nherndon\nunpaved\ndystrophy\ncanadensis\nruncorn\nccs\nteamsters\ngautier\nvivek\niverson\ngroceries\nallergies\ntere\ngillett\neglin\nbrews\nlivia\ncallers\nwoodworking\nminotaur\nkashiwa\nsotomayor\nstateless\nunderestimated\nreiss\nstakeholder\nerecting\nquieter\ncluttered\noutboard\nbirthdate\nturan\ncade\nfathom\nrestaurateur\naprès\nhalim\nsuzie\ncoronet\nleto\nunfaithful\nestoril\nbelgorod\neniwetok\nimpersonation\nbeersheba\nhangover\ndecode\nkj\nkatana\ncommandery\nnarcotic\npylon\nagrippa\nscotus\nshunned\namicable\ntransduction\nbronco\nanglicized\ndatta\nbolzano\nheber\nannular\nbulletproof\nsitka\nterminally\nslits\natwater\nmunnetra\nminutemen\nhydride\ngn\ndiversify\nrusk\naru\nkazuo\ngiroux\nslipknot\ndebugging\npiled\newald\nbeaked\nblackie\nloew\nusha\nwuppertal\nflintshire\ndefectors\ntalons\ngambier\nragusa\ngalápagos\npeep\ngwinnett\nmovers\nkiwis\nbreads\naccomplices\ntherapeutics\nlooming\nkilns\nupanishads\nremission\nsideman\ngriggs\naffleck\ngers\namity\namma\nstepan\nanatomist\nlz\nsefer\nencirclement\nintrepid\nkuching\namassing\nsetúbal\ntypified\npatrik\ncarlsen\nbacker\ncarrion\nwhitehaven\nrevolvers\nsupra\ntrot\nrood\nhawai\ncasale\nmasquerading\npeake\ncrease\nskåne\nsymmetries\nmarat\nalkyl\ncarranza\ncaligula\ndor\nfreda\nflavours\nasgard\ndealerships\nlifeline\nscratched\ncommunicative\nintercession\ngrinder\nspammer\njl\neeg\nvesicles\nmoltke\nphonetics\nrebounding\nendorses\nlocarno\nantecedent\nduplicating\nperlman\nunsuspecting\nscarred\nlae\npelt\nmikko\nendowments\nnadh\ngannett\nbou\nlicking\nyardley\nfrankston\ncardozo\ngartner\nptv\nruston\nnicolson\nflugelhorn\nwallaby\nalloa\nsiro\nkandi\nvarga\nbaie\ncoleraine\nwoodman\nfrome\nmachining\ndialysis\nabdi\ntyphoons\nmahony\nseaforth\nauthorisation\nanguish\nvoz\nmoussa\nchiswick\nminto\nmudd\nmotivating\nhistoricity\nunderstandably\ngosford\ntrina\ncuvier\nnagel\ntenses\npolyester\nquarto\ntelly\nsedans\nunfolds\nguetta\nfinkelstein\npowering\ncalorie\naesop\ncamshaft\nbiotech\newa\naraneta\norientated\ncorazon\ndenim\ndunmore\nkanal\nsuitor\neveryman\nleonhard\npaladin\nvalerius\ncolonials\ncovenants\nmaidenhead\nbrickwork\nshredder\nsidon\nbangs\ndeviate\nhyacinth\ngamecocks\nnarita\ndeg\nbambi\nunconstructive\nperipherals\nschweizer\ncolorectal\ngit\ncommissary\nlegacies\nslammed\ngrier\ndohc\ngrc\nmurali\nullman\nmitchel\nsynchronised\nheathen\nmaximizing\ndeuteronomy\nutilise\nbnsf\nbastards\nivanovic\ntheseus\nsubjunctive\nkyoko\nredfern\nnubian\ngeyer\navert\nsoundly\nseljuk\nsquatters\ntechnische\ndahomey\nrenters\nhutch\nstuds\ncul\nfamiliarize\nmizuki\nmura\nstockade\nemporia\ndissolves\ncoetzee\nimpediment\nposner\nsatsuma\nuns\nenveloped\nwestcott\ntransplants\nbah\nsalvo\njeffreys\nbulawayo\namplify\nbriefcase\nhx\nsixes\nningbo\ncephalopods\nstaircases\nrightfully\ncaramel\ndales\nbribed\nvmi\nhuracán\nplexus\ntownhouse\njenson\npretense\nyamuna\nbrechin\nepo\nkeppel\nnanak\nnunnery\nnus\ncams\ncbbc\nferrers\npalladian\nclemency\nbloated\noto\nstylised\nmeghan\ncramped\nmagnates\ndishonesty\nbenigno\nprecaution\nsens\ninseparable\njurisdictional\ninfecting\nsalting\nelmwood\nuplifting\nfrustrations\nunaltered\nulcers\nfoggia\nelvin\nplz\nmellor\ndissipation\nprr\nisoforms\nnasional\nstrawberries\ngeodesic\nadvaita\nesperance\ninnocuous\ningeborg\nkonstantinos\nkirtland\nhijack\nmultivariate\ncamara\nsemifinalist\nbashar\nmayhew\nredbridge\nnevsky\nwilshire\nenron\n​\nhavens\nneutrinos\nlaborer\nbaikal\nganz\namarna\ngreyhounds\nerasing\nusman\nlipton\nunequivocally\nfondly\ntydfil\nwitte\npym\ncampeche\nkenichi\nfreebsd\ngallardo\nisometric\nabbeys\ntezuka\nphonemic\nkbe\nhabilitation\ncharacterisation\nayurveda\nstuntman\ntrisha\nhensley\nsubhash\ncarpathians\nbeehive\ncandida\nregrouped\ncram\nbubblegum\nconversational\ncurley\nbeheading\nimperium\nclausen\nprintmakers\nshunting\njiao\nufos\nrhett\ningersoll\nushered\namalie\ncontending\nena\nskywalker\nfluently\nsundown\nshadowy\nremuneration\ndubstep\nibarra\naffords\nantimicrobial\nprospector\ncyst\ncilia\ndispensation\nunclean\nmythologies\ntimers\nhereafter\nove\nach\nradon\npeachtree\nlabelle\nwhitewash\nhulled\ndidactic\nbiff\nemigrating\nflirt\nbaptista\nchopping\naller\nsokolov\nnovell\nsialkot\nwatchman\nsyrians\ndesiring\ncima\ncyberpunk\ngilroy\nbiel\naneurysm\nxiamen\nrafters\ncrick\nsau\nwirth\nrunescape\nbasse\ntenet\npiling\nnatura\ndroid\ncrumb\nthoreau\nsalm\narie\ngeisha\nwhitefish\nzandt\nquirk\ncinéma\naltos\npsoe\nblige\ntorrential\ndevanagari\nvetted\nbashing\nconvective\nmikado\nmurthy\nraines\nrumi\nvestiges\nmarrakech\nwhiteman\nassailant\ncorrespondingly\nwicke\nmultiplicative\ngautama\nwps\nasperger\nloos\nhanford\nbeauvais\nceuta\nclementine\nrivero\nmilanese\nficus\ncatawba\nfeudalism\nthule\nphilippi\ndebacle\nqualcomm\nrecombinant\nrcn\nminoan\nastm\npotable\nandros\nmariposa\nfenian\nkasper\npca\nswears\ngeospatial\nasturian\npurging\ncashel\nnegara\nrifled\nodell\nicu\ncysteine\nfalstaff\nembraer\nespoo\nboro\nportia\nteasing\nsupremacist\nresent\npats\nkangxi\nontological\nanni\ncarboxylic\ntsarist\nwasserman\ntokushima\nharada\ndhs\nclassicist\nnfb\noca\nheadteacher\ngutter\naligning\naquaman\ngenerative\nllb\nnarbonne\nburnie\nvenu\ncleansed\ndeadpool\nvytautas\nheadstone\ntormented\nfrisia\npsychedelia\ngarra\ntendered\neelam\nrehabilitate\nrefuges\ncoercive\nbpm\nwicker\nliberalization\nfinalised\npayloads\ncorvallis\ncollegial\ntrigonometric\nmussel\nestefan\ndiggers\nnhc\npuns\nkeeler\nnyg\ndiscreet\ncôtes\ntakeo\nkoehler\nbaffin\nedging\nopossum\npsl\nwga\ndeepened\ninternships\nspoleto\nwranglers\nhoyle\nraghu\nlinares\nbukhara\nloveless\ngrt\nconscripts\nerrant\nglaser\neisteddfod\ngabor\nskytrain\ncranberry\ncastaways\nboers\nmutton\nmeticulously\nnlp\ndepp\nrizzoli\nsportswriters\ncenotaph\nwholeheartedly\nfrail\nborac\nbiometric\ncirrus\nmooring\npalos\nstuffing\nconverging\nmegapixel\ntoppled\nburghs\nreverb\nsanga\ntempting\nzwei\nworshipful\nazarenka\nncl\nmamma\nsuperfortress\npavlov\ndictators\npyar\nstarkey\npda\ncoexistence\nbirla\npatronymic\nmhc\nhares\nwhomever\nsyllabic\nstumbling\nfondation\ndass\ndispensed\nunimpressed\ngamboa\nslaughterhouse\nsheraton\ndissuade\nrewind\ncontre\ntattooed\nkontinental\nscaffold\nrandomness\nexxon\nfanciful\ndisallow\nsteampunk\nstretcher\npolje\nsteadfast\nlagging\ntemplars\nzeitgeist\ngamal\nhawkesbury\nproverb\nspicer\ncoldstream\nzapotec\nthirsty\nsynapse\nconformal\nsimferopol\nhurdlers\ntaunts\nmismatch\nplacenames\ngetafe\nmargery\nunfavourable\nzoey\ninjunctions\nspares\nadaptable\nrobredo\npaleontologists\npedophile\nfiske\nmarinos\nkhamenei\nnsb\nivanhoe\nalcock\ntunku\nbiofuels\naugmentation\nmummies\ndiphosphate\nlocos\ntous\ncherished\nbernese\nunilever\nvignettes\nthorp\ngirdle\nvitaly\nforsythe\nbenthic\npayback\nperpetuated\npepys\nnla\nhundredth\nparapsychology\nnella\nhipparcos\nminkowski\ncorrosive\nlorena\ngreenbrier\nprecipitate\nsukhoi\ncuneo\ngeyser\nsuperintendents\nsuraj\nawkwardly\ncornea\ngolem\nvelasquez\nclassicism\ngeocentric\nseagulls\nsteppes\nmosses\ndemille\nadama\ncwt\nfarber\njoakim\nsek\nyaoundé\nbpi\nsiglo\nlamarck\nportuguês\nlennie\ntaunting\nwexler\ngts\nboavista\nslocum\ndefoe\ngonzo\nvengeful\netta\npurvis\nscsi\nmontmorency\nluminaries\nelbert\neoin\nczechoslovakian\nthrift\nborromeo\nsmirnov\nlobbyists\nfukuda\naunts\nhairdresser\nmethamphetamine\nferrand\nheadmistress\nstellenbosch\nrocked\nwraith\neducates\nlongview\nduels\nstagnant\nkalan\njinan\nsilencing\ncliché\ncorroborated\nbrainwashed\nxenophon\nhuai\nmoura\ncourting\nagk\nthrashers\nbanding\ntomislav\ndismantle\ncoerced\newe\nakiko\ntaxing\nsoles\ncommunicator\nbittersweet\ninsecta\nglyphs\nlocalised\nklm\nvasil\nlengthening\nmajorities\ncliffhanger\nieyasu\ndryer\nwoodley\nblotch\ncano\nsme\nhyder\ntwitch\nshimbun\nkeynesian\ndarpa\nremi\ntiago\nwhipping\ncubist\ngosling\nsemnan\nrailcar\ninvokes\ncrvena\napostasy\nettore\nbanked\nrimsky\ntabletop\noyo\nfairmount\nclippings\nvilas\nspokeswoman\ntropes\nnoriega\nhating\nashdod\nmeissen\nhoneywell\naffinities\nnance\nderailment\ncontemplative\nmatos\nsauron\nsweetwater\nronin\npuncture\nlomas\nmidwifery\nfarman\nprescribe\nschiavone\nakram\nallier\nrefering\nverity\nredlands\nmalin\ncancelling\ncongruent\nglued\nagnieszka\nbbl\nantecedents\ntsuen\npeacekeepers\neas\ncoining\nwaging\nopry\notero\nsela\nlillie\narequipa\nhoya\nfarquhar\npidgin\ngodmother\nbabylonia\nkuiper\nbartoli\nwadham\nprincipe\nmadoff\nplasmodium\nfema\nbeloit\nsafeguarding\nsharkey\nannabel\ndisgraced\nwoodcut\nharsher\ncézanne\nbeset\nimpartiality\ngooch\nacadians\ngilgamesh\naloft\nakp\ncrates\nimpostor\nhieronymus\nenquirer\nlayne\nbrinkley\nribbed\npres\nluzerne\nmaginot\nmetra\ntiresome\nyer\nkuh\ntrumpeters\nmaung\nsymptomatic\nbucknell\nfrisch\nschmitz\nsilvestre\nlimousin\ntart\nsandringham\nmertens\ndôme\nreimbursement\nweathered\nmultnomah\nbashkortostan\nfecal\nxxiv\nallocations\nflake\nreputations\nchandos\ncorwin\nboko\nfiance\nmaître\npinpoint\ntabloids\nnowak\nlibri\nwargames\ncollieries\nnetto\nnudibranch\nfarrow\nosi\ninpatient\ncircled\namélie\nschafer\nproms\ntransformative\nnaturelle\npriceless\npredicament\nunsurprisingly\ngant\noxyrhynchus\nabnormally\nrediscovery\nheadgear\nsoapboxing\nrakesh\nnorthwestward\nhaircut\npuerta\nhellman\ncuatro\nrotors\nblau\npredated\nkiri\nhiss\nwhitehorse\nelitserien\nstudebaker\nslag\nranji\ntryout\nseeley\nhausa\nclarinetist\nrarest\ndah\negret\ncourted\nvolkov\nunedited\nbayside\npurdy\najmer\ndardanelles\nginny\nmusicale\npaws\nirani\nconvene\nenforceable\ninez\nfripp\ndux\ntrudy\nchristiania\nhippodrome\npounding\nbijapur\nabsentee\ngruppe\ncrumbling\ngowns\ngodoy\nintrigues\nreissues\nluster\ndysplasia\nchangi\nredox\ncrozier\naer\nstraddles\nyann\ninfarction\ngrandchild\nbhupathi\nnakano\nagora\nbrokered\nmidori\nsteed\nmenezes\nmellotron\njacobean\ntricolor\ncfc\nroundup\ntrajectories\nimpunity\ntits\ngretna\npeder\niia\ndupuis\nsahrawi\npostmodernism\nmcclintock\nxenia\nbalsam\nembed\nentomologists\nkona\ngetz\ngozo\nmassed\njonathon\nyuriy\nominous\nruthenian\napu\ncrayon\nwashingtonpost\nsumming\nnsc\nseñor\nhelmand\naeros\nvonnegut\nmorissette\nzygmunt\nfesta\ndermatitis\nmanchukuo\nmacapagal\nhérault\nriddles\nhaddon\nvella\nkeeffe\nglens\nflensburg\nfiftieth\nfocussing\nsuvorov\ndugout\nattenuation\ntenderness\ndisqualify\nrehearsing\nwellbeing\ninfertility\ngrenadiers\ndismounted\ninaccuracy\nnanaimo\noccam\nperilous\naurelio\nfrontera\nchirac\n¾\nléo\nburi\nmanohar\ndesolate\norbis\nshred\nindividualist\nyama\nseclusion\nantagonism\naureus\ncog\ndaunting\nappointee\nmisha\npirelli\npraetorian\neuphorbia\nhalts\nnavel\ninherits\nuscg\nkilimanjaro\nfranken\naker\nshiite\nscherzo\nhurled\ntaggart\nmacintyre\npressings\nrevitalize\natrial\njustifications\nherpes\nbarks\nnikolaos\nempirically\ngeologically\nhacks\nshatter\ncaregivers\nrehearsed\nflamsteed\neucharistic\nmaldivian\nsimi\nrafferty\ncfs\nsubstitutions\nindent\nmarmara\nblockading\nmaccabiah\nfleck\ngrasshoppers\nhobo\nyoshino\nlookouts\nmobs\ndesertion\nnast\nlismore\nsveriges\nbarricades\nobenberger\nmornington\ngrievance\nmotta\noren\ndermatology\nmenlo\nintramural\nmultimillion\noromo\nconfine\nmovin\nemphatically\nnilsen\nbettina\ngallus\nheraclius\npubl\naccomplishing\noutwardly\nglock\nbioethics\ndisarm\nnorden\nnewham\nwatchlists\nrandle\nscrape\ndelights\ncamper\nhashtag\nmeigs\nseptic\ngloomy\nhaida\ncentenario\nmoores\nvalverde\nwollstonecraft\nplucked\nouttakes\nflared\nelche\ngranollers\nsaha\ngrieg\neda\nusurpation\nemblematic\nkinsella\nconsistory\nwycliffe\nstary\nbeaks\nticking\ncrossbow\nwakeman\nmohave\nintestines\nkatanga\nlangue\nthomsen\ngrubb\ncrystallography\nwettest\nbriar\nmaterialize\nemanating\nmoderators\nundetected\ntyphus\nchievo\nlassie\nteleportation\nwrists\nrfu\nyonne\nroommates\ninset\nprevails\ngandalf\nuninterested\ncircumnavigation\nunido\nscheming\nfalcone\ntavistock\nhairpin\ntfl\nascends\npraeger\nfearsome\npula\nkhao\nexorcist\nvero\nahmadi\nvoluminous\nyannick\ndeportations\ndammed\ncoolest\ndijk\nunset\nashworth\ndanvers\ntodos\nrainbows\nmang\nkilmer\ncura\nharker\nmukesh\nburleigh\ncharly\ndivisor\nnurseries\nlovech\nducati\njoyner\nsauber\ninvercargill\nbengt\nemptiness\ndonkeys\ncatacombs\ndalby\nsbc\nfoyle\nexpandable\nmuchmusic\npilgrimages\nninjas\nemiliano\napprovals\naerosol\nscarab\nopposites\nadjectival\nkillian\nbuzzer\nalkmaar\npasig\nasm\nmozambican\nsamiti\nwayward\npippin\npacemaker\nneanderthal\nhansard\nthruway\nstonework\nboltzmann\nconcealing\nmihail\npld\nfirehouse\njungles\nwetter\nhomeport\nsprout\njúnior\nlaminated\nstamping\ngrunt\nconfig\nneely\nender\nqueues\nargentinos\npeacemaker\nshards\naberration\ndisarray\nmonson\nreconstructing\nmaciej\nchickamauga\nhymnal\nwcc\nboxscore\nconsulship\nkars\nmisusing\nspectroscopic\nquantified\nkayla\nvalerio\nbondarenko\nnadezhda\nrathbone\nquiero\ncherries\narming\nyucca\nolympiakos\nerc\neloquence\nirradiation\ncenturions\nraritan\nparley\nisc\nparachutes\ndispense\nnürburgring\namputated\nkeyboardists\nfrancisca\narn\npalencia\nkaori\nbetraying\nthrills\noper\nclinching\nlicht\nfaunal\nascertained\nathabasca\ngallium\nsfb\ndouala\nhamel\naxon\ngeordie\nsisterhood\ndz\nhideki\navis\nscares\ngiselle\nnanyang\nmeeker\nfairytale\nharo\nalasdair\naeg\nskole\ndene\namicus\naudley\nunderstudy\nmathura\nfigurehead\nscopus\ncongresswoman\nentomological\nsaleem\nboyne\nvaslui\nimpeccable\njoys\nrbc\nmitzvah\nseductive\nhunedoara\noration\nbenefactors\nperlis\nwatchmen\njingles\ntoda\nibc\nbivalve\nprovincia\nanselmo\ncongratulate\nkultur\ncarruthers\nmohsen\nmoynihan\nfossa\ngallop\nclergymen\nrectors\nlido\nmadhavan\namboy\nfridge\npsychoanalyst\nsook\nsamaria\nbyng\nmcghee\nholley\nshahin\nlumped\nlauper\nbeatings\nincur\naime\ntrolleybuses\nhitchens\nrevisionism\ndiscontinuous\nharte\nbermudian\nsundial\nmicrofilm\nrabaul\npulsar\nhonouring\nangelus\nlabonte\nfsu\nhogs\ncursive\ndupree\ncarré\nnepean\ncorollary\nholborn\nchantilly\ncompliments\ncoexist\nvividly\nbrindisi\nmnemonic\nsinbad\npessimistic\nlubrication\ndrei\nfutuna\nalessandra\naoyama\ncoachella\nleica\nquinton\nwidths\nvegetarianism\nwolseley\nrepudiated\nredrawn\nsurrenders\nkilburn\nahvaz\nmonongahela\ninfinitesimal\npanionios\nolney\nbins\ninexplicably\nruthven\nforgeries\nnantwich\ncontraceptive\nlaver\npetal\nusac\npelagic\ntalia\nrasheed\ngoodies\nmotörhead\naxed\ndelany\nbacolod\nhieroglyphs\npastime\nstaley\ncrassus\nlumière\nteodoro\ncusp\nkerouac\nfiancee\ntacit\nrebroadcast\nabdur\nembody\nstrom\nflatter\ntunic\nmair\nbde\nleveraged\ndonohue\nalles\nepicenter\npardons\ninterrupting\nmeu\nroddenberry\nzug\nmanet\nwashes\nuavs\nhdi\negbert\nmcelroy\ndpp\nmoderna\nalix\ntuzla\nparlance\njabbar\nberhad\ndisembarked\npreto\nanak\nvers\nrou\ndari\nbahnhof\ndarkened\nugh\nrina\nredstone\nconfides\nrationally\nhoof\nbeastie\nbelgique\npella\nminuteman\nmistresses\nmcloughlin\ncompetencies\ncccc\nsufferers\nperturbation\netat\ncircassian\nauthenticate\nmoraes\nheadley\npeninsulas\nevangeline\ngarrard\nmoonshine\nrotorua\nhikari\nvijayawada\nmorehouse\nastragalus\nsheba\nshoten\noficial\njamaat\nzeng\ntatyana\ndinh\noverheating\nwenceslaus\ndreyer\naguascalientes\ntrickster\ninvests\nesbjerg\nsatish\nrupee\nsamarkand\npsg\narmadillo\nashikaga\nsherborne\ngrebe\ndurante\nappointees\netobicoke\nbarter\nfumbled\nbergh\ncysts\nkath\nrpgs\ndenning\nxxviii\nlenoir\ngander\norsay\ngrandis\nfortnightly\ncdma\nfoals\nglenda\nblagoevgrad\nastrophysicist\nlibero\nhauls\nailerons\nashgate\nmetabolite\nunsustainable\nbiophysics\nbenchmarks\ntypology\noberhausen\nleafy\nfocke\npevsner\nkoto\nnimbus\nscaffolding\ndisclosures\nsketchy\ndennison\ngangwon\nadvisories\ntrainings\nchowk\nunparalleled\nmegumi\ndehradun\nzafar\nvirgen\nmads\npalliative\nyeong\nmisreading\nterek\nhangman\nhardtop\nvibrating\ncassino\nalisa\nhyland\nasante\nganesan\ncappadocia\ndeterrence\ncockney\ninfra\nbiarritz\ntorrington\nbizet\nrtf\nbylaws\nwhalley\ntechnicality\nobtuse\nblackfoot\ntrove\ncomprehensible\nizak\nunruly\ntutored\nganglia\nbandera\nbtr\nurbino\nmatic\ntabitha\ndufferin\nbedside\ntermites\nglare\nparsley\nlawrie\nmacrophages\ndap\ntugboat\nelevating\njains\nlaymen\nmeryl\nmanmohan\nwhit\nblowout\nfusing\ndrab\ndior\nchiara\ntricia\nblinding\ngara\nforde\nvikas\nrecursion\nbabar\narchaea\nlymphocytes\nmacroeconomic\nnostrils\nswaminarayan\nbalan\ncaper\nhyena\nhibernation\namazonian\nsloboda\nflips\npylons\nastute\nweg\nlessened\nquadrilateral\nbenelux\nspoons\ninspectorate\nandries\nenglishmen\nasti\nwoolen\npowerpoint\ncustomizable\ndoorstep\nflavoured\nmathematica\nparticulate\nkrakow\ncausative\ngeeks\nbrookhaven\nebu\nconifers\nktm\neditable\namico\nstinger\noverpowered\nwoollen\nmadera\nblackadder\nsutures\nkinases\nmaddy\ndayan\npfalz\nremedied\nisley\nhomeopathic\nnadph\nyelena\nfarragut\ngipsy\nstoves\nsmuggle\nedelman\ngwp\ntsubasa\nwishbone\nbabbitt\nthrombosis\nmilitar\ncolfax\nkurgan\nwitwatersrand\nauden\nnapolitano\nstallone\nrawlinson\ngoby\ngau\nxhosa\ntambov\ncrayfish\nnieminen\nbartolomé\npsychoactive\nanima\nciphers\nrefutation\ncheadle\ncomité\nsaavedra\ntripping\nsovereigns\nprefabricated\nasker\ninterstitial\nsiobhan\nvixen\ndressings\ntut\nscaly\ntransfiguration\ndraughtsman\ntrivially\nmilos\nupfront\ndarrow\nsavages\nfactsheet\nclaudette\ndeathly\norientales\nyakov\nperverse\ngeforce\ndruids\nextrajudicial\ngöran\ncocteau\nuthman\nelixir\nenumeration\nentanglement\nkunsthalle\nscoreline\nsteffen\nyury\nmourners\nconjugated\nmémoire\narcana\ncana\neder\npmc\ndarwinism\nflagging\nalpert\nstreaked\npenicillin\nssa\nniu\narbuckle\nbinge\nlukewarm\ngagnon\nmek\ndimitar\ndbe\nagamemnon\nvittoria\narslan\ntint\nhumayun\nuab\ntelethon\nsped\nenergie\nyash\nmotorbike\ndrenthe\nderanged\nseptuagint\nwea\naberdare\nrefrigerated\nbaguio\njute\nfalsehood\ngermination\narendt\nfrantz\nkessel\ninquisitor\npickled\ncorrado\nstadler\nhuckabee\nkaram\nrawson\ntrackage\nventricle\npari\nrosenfeld\nwoodbine\ntanjung\ncoughlin\nyoshi\ndoughty\nleiber\nlumbar\nshulman\nshamanism\nresonator\ninsurer\ninterruptions\ndozier\nmémoires\ncrippling\nbroodmare\nleans\ncamarines\ncontemplate\nfellini\naargau\ndazzling\nwarmed\ndisinterested\ngovan\nako\neastleigh\ndeterminants\nproportionally\nbibliographical\nperceptible\nabl\nflagg\nfinchley\ncrib\npero\ncoupon\ncluny\nrefractory\nuncovers\nsubsidence\nidiotic\nhus\nalina\nretroactive\nsubstandard\nseamlessly\nwoes\nklagenfurt\nuniversitatea\nraytheon\nlathe\nquesada\ngeostationary\nhandout\nbexley\nreo\ntiwari\nparganas\npixies\nnuma\nmsgr\nlithograph\nfroze\nisomer\npolanski\nflask\ncanard\nbelli\ncauca\ncoogan\nroan\nlendl\nideologically\nicty\nobie\ndafydd\nsanctity\ndeadlines\ntidewater\ncurler\nexhibitors\nbulbul\ntoleration\npseudonymous\ncriterium\ncaravans\nscavenger\ncompounding\nchargé\npowerlifting\nboson\nfreire\nbarrichello\ndemeter\n¦\nmorro\ncourland\nglyph\nvicario\nmariupol\nmasami\nqadir\nheraklion\nmackerel\nyl\nishida\nunscrupulous\njonesboro\ncrespo\nkravitz\njuncture\niww\nbodhisattva\ndegeneres\ntosh\nlysine\nraps\nmarija\nhajime\nprendergast\ncoworkers\ncochise\njudi\nkanyakumari\nprodigal\ngamaliel\ncrittenden\nfinanciers\nbakeries\ndurán\nacevedo\nkanda\nmeth\nconic\ninsulating\norcs\nporters\nrolland\nakon\nglas\ntilak\nstepney\ntofu\ninnuendo\nmarsalis\neurosport\nsab\nsymposia\nrmb\nveneer\nmetroplex\niskandar\nchuang\nmitterrand\ndaugherty\nmontezuma\nimpairments\ngadsden\nskoda\nlabem\ngeorgiana\ncarefree\nchambre\npensioners\nlyre\nmenard\nemotive\ngorton\ntif\nambrosio\nroebuck\nhexagon\ntupelo\nhamar\npantera\ndiscreetly\nclunky\nvagabond\nmontauk\ngiovanna\npentagonal\norbison\ngillan\ncarnivores\nanthropogenic\nmeerut\nmcafee\ncorbusier\nxxv\nsopron\ntru\nsilos\npog\nplacenta\nwcha\ndingo\nbarreto\nbrainer\nétude\npennetta\naqsa\narti\ntarmac\nyahweh\nchardonnay\nlejeune\nhuckleberry\nbundaberg\nxxl\ndescendent\ndrugged\nluft\ndonal\nsharpened\nkaleidoscope\nbertolt\neuripides\nribera\ngrayish\ntaiyuan\ndeconstruction\nlures\nheatseekers\nimpulsive\ntokyopop\ntva\nreentry\npopularize\nkailash\npreset\ngrau\nyvelines\nmatlab\nluncheon\nbarracuda\njamil\nracecar\nlifes\ncockerell\npostulate\nwindermere\nhellfire\ndegenerated\nlandlocked\nangell\nvelodrome\ncrenshaw\nfeeney\nxxvii\nscopes\ney\nmanna\nspotify\nlyricists\nhurdler\ncruces\nhurtado\newen\nwenger\nuso\nzf\noutlived\nbergmann\nchansons\nrims\ngreys\ndeville\nsuitors\ntetra\nmustache\ngorizia\nrida\ncaptaining\ndendritic\nuncontrollable\ndisappearances\nsantosh\nteak\npaleontological\ntelevisión\ndoubting\njuxtaposition\nmonoclonal\ndomestication\nhowells\nmacs\nlunenburg\ndendrobium\ntulu\ntirunelveli\nsieve\nfacilitation\ninkscape\niwi\ndvorak\nspinach\nphuket\nintelligentsia\nhals\nspirals\ndeloitte\nharney\ncocks\nged\ncnrs\ncorte\nandriy\ngogo\nfireplaces\ninfects\nregistries\nbajo\nyasmin\nheredia\nspilling\ngj\nkickboxer\nalon\nmonies\nriel\nchit\nsaloons\nner\ngaynor\nallianz\nnaoki\npurges\nungulates\nnimitz\nprivatized\nmcnabb\nseu\ntrespass\npeppermint\nlombards\ndweller\ncompatriots\njuices\nrimmer\nconsummated\nsilicate\nhdmi\nmachete\ntinsley\nposh\nbrainiac\nfestschrift\nfallujah\nbushnell\nviana\nhausdorff\nkbo\nmurex\ndartmoor\nnugget\nunbreakable\nalfalfa\nmips\nbystanders\nauthorizes\nzdf\nglandular\ngiri\ndisulfide\nlousy\nmodestly\nlodgings\numl\naustral\nglimpses\nprecluded\nnoteable\ntoki\nanalogies\njochen\nudi\nmuammar\nmadigan\ntelemetry\nmacao\ntrabzonspor\neject\nchoudhury\nbletchley\nbagpipes\naap\nmulhouse\ncowen\nvercelli\nalejandra\naku\nsassoon\ncatalans\ncreatively\nranchos\nbrookside\nmishima\ncto\nsnub\nclaudine\nfundraisers\ncalico\ncaustic\nmellitus\nbooklets\nseamount\nbroadest\ngakuen\nburnout\nscraping\nkazuya\nsneaking\nvanier\nlucía\nsubsided\nradiological\nbloodstream\ngreville\njunkie\ntsa\nilyushin\nshakers\nnicklaus\nyuji\nshouldered\nfergusson\ntedder\nmainwaring\ncosworth\nclearances\nberetta\nintolerable\nkenshin\nassimilate\nisolates\nmeandering\ncleanliness\nforties\nespecial\nlcs\ngoswami\nshoreham\npicton\npolitique\ndispensing\nastrologers\nboney\nporches\nkain\nhasegawa\nequating\ninferences\ntheron\nbluntly\nmvc\nbogota\nemmanuelle\nplatelet\ndisorderly\ntiller\nisomers\nmpumalanga\nspeciation\nludacris\nnortheastward\nscuola\noutlandish\nlurking\nminna\nrushton\norigami\ntos\nmöller\nerasure\nherbivorous\npayoff\nkennett\nköhler\njyoti\ncyndi\nwestgate\nstirlingshire\nisotopic\naccrued\nnewsreader\nsono\nhuntly\nfuze\nkagawa\nretrospectively\nbaudelaire\nlegia\nfished\ntimbre\nforfeiture\nengelbert\nkremer\nmorison\ninformants\ncallie\nbrochures\nulcer\npinellas\nparoled\nterrance\nbalustrade\nsparkle\nchaka\nformaldehyde\ninterdiction\nbannerman\nfoothill\nsubtract\nsnoopy\ntippett\ncommendable\ncommemorations\nwhoops\nlongstreet\nmarwan\nimplantation\ninvicta\nlittered\ntomo\nctc\npohang\nhatchery\nbnf\nutes\nili\nserfs\ngilda\nlachaise\npattison\nnoida\nglazing\nroyston\nneves\nglycol\nkareem\ncatholicos\nospreys\nspitz\nkjell\nrevisiting\nkui\nbowing\nbonne\nwilliston\ngrumpy\ntaganrog\nnatchitoches\nbernhardt\nsaf\nsenseless\npradhan\nboilermakers\nbledsoe\neducationist\nportillo\nraged\nlessing\ncryogenic\nbyproduct\nquinta\nhallett\ncatheter\nfamitsu\ndelisting\nunskilled\nordinate\namigo\nponting\nespinoza\nsmog\nimageshack\nstagg\nmungo\nzan\nfranciszek\njara\ninterregnum\nmasted\nnormand\naffirms\ntarantula\nverdes\nssi\npitting\nnuestro\nenlighten\nwarfield\nbata\nmitte\ntexarkana\nbowser\njacqui\ndfl\ndecays\nyalta\ncaptioned\nstaines\nelectropop\nusurper\npantry\nfenders\nqpr\nsaatchi\noutermost\nbodine\nbrainchild\nmaracaibo\npowhatan\nchests\ncherie\natmospheres\namaral\ntampico\nozzie\nsedimentation\nspringtime\npogroms\nelitist\nmaw\nadela\ndatu\ndevo\nporridge\ngeeta\nwellman\ntunstall\nriflemen\nnehemiah\npirie\npaperbacks\njourneyed\nscams\nbourges\netihad\npuno\nhagiography\nvicars\nmam\ndiogenes\ncvp\nfsc\ncurlew\npseudoscientific\nrangel\nchattahoochee\nkafr\nbrontë\ngrandmasters\nandrás\nclef\ntaoyuan\ninsolvent\ndeming\ngreenleaf\nhynes\ndares\nní\nobit\nwma\nbrava\nnussbaum\nmemoria\ntasha\nfailings\nimpasse\npetrovich\ndialectical\nunmasked\nkendo\nano\npma\njusto\nstoic\nmarianna\ndamir\nhues\ncathcart\ncastell\nkolhapur\nscum\nobstructed\ntaylors\nblakely\nflack\nhindrance\nbalanchine\nshowings\nbruton\ncours\ntriumvirate\nhanoverian\nconverged\nranders\nharrassment\nahan\ngori\nmcewan\ncirculatory\nlitt\ncoombs\nsacra\nzed\nbento\nchurchmanship\naudacity\nunprofessional\nlarue\nmenacing\nblackhawk\nodes\nidiomatic\nmaktoum\nshirin\ncvs\nthetford\ngoblins\nbespoke\nramsgate\nchitty\nteas\njeunesse\nelbows\nhyo\nlarisa\nornithological\nmim\nnajib\ntheropod\niridium\nresidencies\ngrosjean\nstitched\ndesolation\npervez\nmorricone\ntamura\nironside\nlivres\nvive\nscheer\nshutters\ncovariance\nfrisbee\nenfants\netymologies\ndagmar\nhaddock\nenugu\nepoxy\ngenk\nyume\nclouded\nconferring\nheadliner\nspaniel\nkaspar\nconvents\ngamefaqs\nscrivener\nforma\nwik\nmerrie\nfiore\nhec\npredictor\nhospitalization\nduh\nhibs\nbyes\nmodifies\nmuhlenberg\nnuance\nbonny\ninflection\nkenyatta\nloris\nyarborough\npolis\nconstantius\nimtiaz\nyukio\nsubsystems\nfoal\nbarricade\nyiu\nabhishek\nmikoyan\ndiscontinuation\nreminders\nveena\nreread\ncleves\nsteels\napplaud\nbroadbent\nbatchelor\nhaugesund\nevict\neff\ncagney\nanorexia\nbiscayne\nringside\ndore\nsiret\nbouncer\ngagarin\nrémy\ndailies\ncortland\nrejoice\nhunchback\npreviewed\nyells\norme\neasing\nwardens\nascendancy\nrunaways\nmarika\nbruised\nscanlon\nnrg\ncroton\ncolima\ndictatorial\nlicentiate\nmoorhead\nadjourned\ngato\neilean\nmuncie\nstipulates\nmasato\noup\naliabad\nengl\nthelonious\nchrissie\nprolong\nouse\nmarketers\nmuskogee\nmaryam\nreproduces\ncelebratory\nmiró\nzo\nenergia\nboldface\nshafi\nkonya\nredshirt\npayout\narnie\nrabi\nisr\nelkhart\nchurchman\necc\nvilhelm\nxiong\nreconfigured\narchway\ncarapace\ndries\nloughlin\ngamespy\ndisassembled\nnematodes\nheide\nsmelling\nrekha\nrayleigh\nhydrothermal\ndoorways\nmicrofinance\ngabi\nmarysville\ndimaggio\nimprints\nbleachers\nmujeres\nuriah\ndecapitation\ndiemen\nbanfield\npires\nautonomic\nbanishment\nphony\nkiyoshi\nsascha\nqazi\nearhart\nconjoined\nconquerors\nbugatti\nuniversality\nalleys\nsupercars\nbrouwer\nunleash\ntaf\npapadopoulos\nkondo\ntithes\nrailing\nperils\negmont\nsuede\ncargill\ncoldwater\nfalla\nmeagher\npolicymakers\nudupi\npontypridd\nmalla\njee\ncarteret\nsimplifies\nboden\ntenacity\ninfatuated\nwelker\nreenactment\nokinawan\ncomplainant\npaphos\nmartians\nrepetitions\nparsonage\nbastian\nmultidimensional\nstartled\ncanonization\nrambler\nsambo\nunaccompanied\ncarnaval\nresins\nmcmullen\nimprecise\nhartwell\nbesieging\nendicott\nsilverware\nbenefice\nchilled\ndarshan\nwaylon\nsagra\nneurosurgery\nprintable\nzagora\nchae\ninvalidated\ntucked\nsuspiciously\nsorel\nloudspeakers\nunfriendly\naudited\ntisserand\nsymbiosis\nseon\nverifiably\nmarrero\nspat\nseraphim\nhelsingborg\ngrime\nrepresentational\norphanages\nequipping\nacrobat\napprehend\nautres\nnicene\ndearly\nibid\nchalet\nweep\npradeep\nstarz\npenicillium\nmoretti\nait\nthankyou\nocd\nadmires\ncardona\ngranary\nsebastiano\nhani\nseyyed\nunreadable\nsandstones\njoyous\npolymath\nhabitually\nglycine\nfjords\nnotting\nwhips\ncromer\ncomplicating\nperpetuity\naspired\nexorcism\nlorain\nhardening\ntriplet\noust\nimelda\nassailants\nbrownlee\nseparable\ncolley\nardèche\nsubs\nspenser\nsams\nrépublique\ncladding\nmolson\ndissipating\nnpl\nbeulah\nchisel\nrostam\ncarton\nwushu\nlaguardia\nmykola\narxiv\nspyware\nvoigt\nnoxious\nayrton\nstrides\nbled\nwildstorm\nuic\nchancellery\npetrified\njhelum\nuppercase\nadenosine\nnighthawks\ncivilly\nmundy\nbirthdays\ntuam\nfallacies\nfreund\ngeoffroy\ndeanna\nlatour\nchikara\nrona\nrelented\nconfusingly\ntilbury\nlunches\nkilo\npartying\nsleigh\nstratum\nivanovo\ngenet\nnarayanan\ncastellón\npape\nfoxy\ndisseminating\nlcc\npuddle\nug\npasir\nalphanumeric\nunderpass\nnovelette\nhaskins\nexpositions\nguardsmen\ncheats\nclays\nbusinessperson\nsrebotnik\nvoorhees\nmohanlal\nencapsulated\nlevon\nhemispheres\ngloster\nppi\nintermediates\ndram\nstratified\nbastions\ncoven\ncools\nfrontenac\nbian\nesophageal\nmclennan\nabn\nvernal\ntls\nleones\nkinabalu\nrenominated\ntranscendence\ndespot\ndaf\ngwynne\njest\nrosebud\nbridgwater\nsandberg\ncautiously\nila\nrhone\nfishy\nadf\nmiyuki\ncmd\ntypographical\nthighs\nlesnar\ndahlgren\nsculpting\npatrollers\nbravely\ntorches\nyana\nsolothurn\nsported\nglade\nrecurve\ncarmela\nallotments\nminefield\nfilipina\nhammett\nunbound\ncredo\ngreenhouses\nkelp\npennine\ninventories\ndropout\nsalamanders\nleyden\nfakes\nmower\nespace\nnematode\ncrossley\nnansen\nfaddle\ndarkly\nunchecked\nite\nanbar\nverdy\nrinehart\npho\naccuser\nhateful\ncuttack\neisenstein\ntrna\nkovács\nkomi\nshuts\nindonesians\ndecentralization\ncollider\nhauts\nberdych\nhylton\nteleplay\ncoincidental\nvander\nswinburne\nkamala\nauch\nabb\nblm\naxons\ncruciate\nganglion\ntuen\ncondo\noctavio\nconserving\nsheltering\ngrzegorz\nbelleza\ncession\nelicited\nmarple\ntrestle\nsandals\nbrodsky\npurchasers\nvpn\nmithridates\nwithdrawals\nconch\nhollister\nvigilantes\ninfringed\ncolouration\nrewrites\nduchesses\ngalvin\nwaziristan\nrobison\ndraftsman\nframingham\nalumnae\nbla\nirfan\nolde\nbbq\ndatum\npacquiao\npentax\nkilowatt\ngamelan\nstiller\nvilna\nearrings\nmetropolitans\nlucan\ncarradine\nansbach\ntimmins\nstink\nbhat\northographic\nasexual\ntoru\ninstructive\nprofessorships\nunchallenged\nsafi\nverbose\ncristobal\nfih\nlacroix\nenvision\narce\nworsen\nboynton\nhurried\npolymorphism\nkoran\nmannerisms\ndeviant\nprov\nremanded\ninwards\nadmissible\nhourglass\nleveson\noppenheim\nrenunciation\npusan\nshogakukan\nmccaffrey\nkure\nlundgren\nkeswick\nshaheen\nmoorland\ntz\ntransporters\nscapegoat\njorgensen\nlossless\npappas\nnuys\nmaterially\ntottori\ndiseased\nkirilenko\naeroflot\nwormhole\nrevel\ncracow\nperihelion\nplush\nwsu\njinx\nbaltistan\nkhwaja\nanachronistic\naiden\nharvests\nannabelle\ntalkies\nweblog\nluxembourgish\namasya\ndysentery\npál\nzn\nbuell\nretrieves\nmortals\nwhiteside\nslowest\nbunyan\nuga\nuninformed\nundertakings\nxo\nsoybeans\nunstressed\nauditioning\nparlement\nobservant\nevesham\nfullest\navia\ntrill\npuppetmaster\nnogueira\nterns\nordinator\npra\nsecunderabad\ngriffon\nlangs\nrupp\nsalinger\nnuman\nelan\nstourbridge\narmas\nlinde\nincision\ngauls\nhammarby\nlenovo\noctavius\nkerrigan\nequalizer\nrumsfeld\npanicked\nplat\nblackfriars\ngarvin\nfret\nrespite\njanie\ngrosseto\ndelve\nsark\nliteratures\nbuf\narmchair\nheadway\ninstaller\nasshole\ndemobilized\nbradfield\ndiderot\nslovakian\nrecitative\nbaek\nleviticus\noxfam\nmachinations\nantimatter\neinem\nsacs\nntc\ncrossword\nviv\nnubia\nmelba\nreiter\norigine\nroadmap\npetrels\nsimba\neurodance\nturntables\nhunterdon\nogawa\nafca\nnuovo\nyanukovych\nkrajina\nhaverford\nraipur\necologically\nchristiana\nplacings\ndisengage\nmaclaren\nglenorchy\nchoking\nnakhchivan\nhsien\nreston\ncruciform\nmeine\nachievable\nhollander\nstrove\nimpeded\nlionsgate\nkrista\niman\nvented\ndialectic\nsamos\nbiao\ntranscendent\nltc\nmacdougall\nzohar\nqm\ninvades\nkish\nunseated\nenver\ncarnation\nviterbo\nvial\nlivelihoods\nfaiz\nviswanathan\nppl\ncalton\nbismuth\narguable\nswenson\nmfc\nthad\nnarvik\ncodecs\nserpents\nherbivores\nsemple\nhajji\ncimarron\ndrammen\npotawatomi\nmullet\neso\ntyndall\nlaserdisc\ntaran\nendanger\nthrowback\ngeometries\ncounteroffensive\nmores\ncommend\ntengku\nwhack\nlull\nargumentative\nishaq\nhildegard\neyewitnesses\nbindings\ndaggers\ncockatoo\ncéline\nlighten\nproust\nmór\nmephisto\nsuperstitious\nindentation\namaya\ntakao\ndefensible\nnurtured\nkuroda\nnorepinephrine\nfouled\nissf\njogging\nbelizean\nblackboard\nrifleman\nakshay\nnovelization\nsodom\ntristram\nbrigid\ndiez\nnutmeg\nboing\nschenker\nheilbronn\nguano\nhologram\npnc\naas\ncrusoe\nobliterated\nilluminati\nludovic\nenzymatic\nwrangler\nvenkatesh\nblower\nribosomal\nfigurine\ncorreia\nreportage\nnegotiable\namway\nyeti\nmaidan\ndependents\nsauna\nenforcers\nlévy\ncic\nfryer\nselva\nretaliate\ntenancy\nautónoma\nvashem\namun\ndubs\nmccauley\ncongregationalist\nbullard\nribble\nshillong\nkartli\ncenterville\nfossilized\ncordell\nmacneil\ndiners\nmelkite\nslideshow\nrodionova\ninfighting\ndecomposed\nculled\nsiddharth\ncockroaches\nawry\npretentious\nlindisfarne\nkennebec\nheaney\ndiscriminating\nportability\npassageway\nstorefront\nerupts\ncustomarily\ntableaux\nswordfish\npoco\nmurky\nhoosier\nhime\nhisar\noutings\nmayall\nthematically\nkalyani\nkoda\nselves\nsbk\nnerds\nminton\nhiawatha\nteck\nshove\npriyanka\nbz\nsrebrenica\ncrain\nwildflowers\nsubtraction\ndaedalus\naly\nbukhari\nhightower\nprospectors\nincarnate\nvespers\ncaged\nmasaki\nboll\nstrathcona\nvladimirovich\ngwynn\ntristar\nhonorius\nprospero\noxidizing\nintrospective\ndisruptively\nlepage\nviic\nroja\nperpetually\ntsinghua\nhuw\npnp\ncapitan\ninna\nlubelski\nteased\nmurfreesboro\nolmec\nmotets\nengle\nenrolls\nradii\ncascading\nblueprints\nkillarney\nreeder\nthackeray\nkigali\ntuareg\ncerf\nstarbuck\nnairn\njanne\npremiering\nsemicolon\nadapters\nasparagus\nleduc\nstandstill\ncayley\nhyphae\nbabyface\nchanneled\nfaff\nfangs\n₤\nmiquelon\nhallmarks\nnajaf\nsacristy\nunlinked\ntrinidadian\nramallah\nundergrad\ncmu\nfanatical\nmanolo\npictish\neads\nlipetsk\njalil\nbolo\nrestroom\nmonocoque\nemitter\nlasso\napplegate\ncandlelight\nschell\nschenck\nstover\nimmersive\nscraps\nvdc\ngosh\nclumps\ntrusses\nedgbaston\nbraden\nmiri\ncypher\ndedicating\nantimony\ntappan\ncece\ninsistent\nsouthall\nkeogh\ncondense\nruffin\nthrusters\nsymbolist\ncompensatory\nsensed\ndipper\nmarg\nsixtus\ngestalt\nweevil\ngsa\nvesta\nwhatnot\ncricketing\nhaasan\npaoli\ntyrants\nchula\nhuon\nsdf\nprescriptive\ngroovy\nstasis\nlounges\nturkeys\ndreamers\npredating\narmani\nmommy\nberbers\ndonuts\ndcs\nuneducated\novertures\ngotthard\nanatole\nhappiest\nsancti\nbartley\ncyprian\nunderstandings\nreformatted\nberio\nnipple\nbalthasar\ndriest\nsou\nkarpaty\nferrous\nbanshee\nkoizumi\nmanhood\nbgc\nmoulding\neminently\nedicts\nslashed\ntaiko\nlinder\nbeaker\nsplice\nalef\nsilliness\nkanazawa\nstillborn\nryszard\noregonian\nchetniks\nbojan\nrickard\noradea\nbuckwheat\ngourd\nsawmills\nzé\nmindfulness\nlockyer\nairshow\nirgun\nfluctuating\nhowes\nbroccoli\nammonites\nlieut\nmalawian\nmillicent\nendpoints\nmetrical\nelphinstone\naccelerators\nleu\nmagazin\nkampf\nyolk\nvillager\nsaddles\nscca\nfirsthand\nmoderates\ntriangulation\npratchett\nauntie\nmanoa\nmorena\ncheckmate\nionizing\nminima\ntelluride\nuid\nuwa\ndebtors\ncotabato\nsupercentenarians\nrechargeable\nlandsberg\nkarbala\ngranules\nyrs\nwhiskers\nperumal\ntranscend\ntreasurers\nadmittance\nnatak\namerindian\nfortis\ntriumphed\nfawkes\nidp\npuffin\nfraught\npromos\nscapa\ncolonize\nbinh\ngoan\nvindication\nprays\nmankato\nquebecers\ncoors\namicably\npeale\nthang\nnether\nfess\ndhc\nvlaanderen\nmarksman\nhomelands\nshona\ncts\nsiegen\nsoweto\noxnard\ntycho\nlaissez\ninfractions\nbudweiser\npaulie\nseaway\nquintessential\nscorched\nredefine\nhelmholtz\nmontagne\nnonconformist\nskated\nupsilon\nanorthosis\ncallao\nbolu\ngreensburg\nstreamer\ngrenier\nslovaks\nsweating\ninterconnection\nsalamis\nfls\nhine\nwaseda\nxn\nmultiracial\ncapote\nkeri\nrowdies\nifl\ncomer\ndidi\naccretion\ncanister\nfolkways\niro\nyelled\nseceded\nsacco\nsihanouk\nshredded\nplatnick\nstereoscopic\nichikawa\nbigot\nzambezi\nstubbed\nshanty\ndynamos\nternopil\nmexicano\nsienna\nrenner\ndemolishing\nmq\nraiser\nrosedale\ncady\nkwajalein\ndeclarative\nvsevolod\nkuk\nzora\nseiji\ntwig\nticonderoga\nmesolithic\ngondwana\nassertive\nhippopotamus\nets\ncohorts\nhypertext\nadversity\nmerovingian\nvotive\nbarrera\ncacti\nsti\nzander\nvomit\npago\nescola\ninflected\nazusa\ndimitrios\nkidnappings\nboardroom\nhalford\nastray\nbenazir\naños\nrbs\ntakedown\npuntland\ntisdale\narguement\nsalyut\nneared\nastrophysical\nhorrid\nneel\naño\nforges\nchitra\nchroniclers\nconspirator\nsaya\nsubmits\nderided\nnuri\nalborz\nona\nmeara\nswapo\nlita\ncapes\naggie\noverlying\nbobcat\nadderley\ncaveats\nmartindale\nfouls\ncampinas\nignaz\nfederación\nweald\nratcliffe\nderail\nencircling\nmemento\ntulare\ncolosseum\noas\nresubmit\nairfoil\nregionals\nsimulcasting\ncallan\njacksonian\nalstom\nfruiting\naether\ndissected\nmrc\neugenie\nahsan\nreconquista\nioannina\nacreage\nhdd\njulianne\nhdr\nforgiving\nserrated\nmolyneux\nkruse\nbuttress\nurchin\ncourtyards\nseuss\nspringbok\nbiome\nazimuth\npromiscuous\nsuperconducting\ngambian\naskew\nliberally\nvellore\nlevers\nlaxmi\naubin\nhermits\nmashed\nhaslam\nshipbuilders\nraphaël\nmurillo\nbradenton\nunscathed\ncomplexion\nblasters\nfrescos\npimlico\nbev\npoblacion\nchristiaan\nyuko\nthickened\nlandlady\nboleslav\ncoquitlam\nextensible\nretinue\nsuo\nreichenbach\nbrevis\nkarlovy\nlantz\nziegfeld\nhollins\nnjcaa\ntexaco\nopenstreetmap\nmaur\nmadrasa\nstrang\nacland\nasn\nalessio\nrenovating\nacne\npayouts\nagri\nsigner\nliege\ntov\nscorn\nkazi\ngamepro\nlongchamp\nconfidently\ncartels\ntrowbridge\ndostoyevsky\nroseanne\ndolomite\nvergara\njabalpur\nglottal\neliyahu\nsiem\ndnb\nlobsters\nnakayama\nskyway\ncarnot\nsinensis\nsolute\ngatorade\ndatasets\nautocratic\nsloped\nundeniably\nconferencing\nmuck\nrtp\ndislocation\nsoften\nassembler\nkaduna\ncybernetic\nalun\nribbentrop\nalabaster\nhemel\ndps\njanko\nnakagawa\nbinaries\napocrypha\nbenzodiazepines\ndifferentiates\nstun\ntiverton\nsafeties\njewett\nrishon\nfaris\nbouvier\nvivienne\nronstadt\nburgoyne\nyorkers\nwaukesha\ninbreeding\nstaffer\ncem\nmourn\nepiscopalians\niceman\nichihara\nlackluster\npaganini\ndún\nconchita\ninterconnect\ndouai\nyoungblood\nromán\nndtv\njal\naylmer\nflimsy\nunplaced\nattentions\nbridgeman\nfedor\nbassists\nnewlands\ndebutant\ntinge\nido\nunita\npigott\ndarya\nknobs\nbulleted\nnox\nsopa\nrinaldo\ncompetitively\nphysiotherapy\nreclusive\nkarlsruher\nkingswood\nverdasco\ntrainor\nschäfer\ncreationists\nlichtenberg\nchillum\nharun\npetitioner\napia\nparasol\nrashad\ndeliberative\ntwinning\nteruel\nsua\nmurugan\noutage\nosorio\ndandelion\nbenji\nintersected\nmsm\nscarlets\natolls\nblackface\nemp\ndrunkenness\nweis\nfiu\nenergiya\nhooke\nyamazaki\nwitold\nunsung\nfliers\nzar\nborgo\nwrongfully\ngenji\nbuccaneer\nwoodforde\ndonut\nkiosks\ngobind\nunholy\ncartographic\nrosita\nnikhil\nasd\nradiate\nlactose\nsor\npha\nscented\nwinemaking\nmuzik\npipers\nbonilla\nhandshake\nambulatory\nchih\nuniverselle\nhesperian\naggressor\nmolasses\nparadoxical\nsorrento\npiranha\nsnapshots\njamia\nbeveridge\nhaque\nnaam\nsovetov\notc\npater\nprovincetown\npacifism\npinnacles\ndisclaimers\nfatwa\nléopold\nhypoxia\nacetylcholine\nroadrunners\ncaxias\nclinging\nwittelsbach\norestes\ngroomed\npropositional\nbarbet\nlemons\nxanadu\ngovind\nbroadsheet\nradek\ncountywide\nmoy\nmints\nshiner\ngazing\nhunslet\nskylark\nmustapha\nprosser\nchampa\norc\ncnt\ntrapp\nmaynooth\nbrasenose\ndharwad\ndistasteful\nlarus\nintermarriage\nstingrays\nflagler\njarman\nmaida\nwhirlpool\ncollated\npragmatism\nsubcutaneous\nincised\nimpersonator\nchuuk\nsuis\nsameer\nrashtriya\nrumba\nhayman\nbem\nmerwe\nmistaking\nrivadavia\ndistractions\nmanipulations\ndaze\nchlorophyll\nfabre\nbrooding\ntaverns\nnani\ndiatonic\nsoftened\ncanby\npaducah\nshortland\nshales\nuppermost\nlinköping\ndeportiva\nworshipping\nworley\nindignation\nexpelling\ncea\nmayoralty\nconciliation\nretainers\nbjorn\nborderlands\nintents\nhussar\ntroms\nweezer\nhbc\nfitzmaurice\npiecemeal\nkitson\nstranraer\nhuygens\nyazoo\ncroquet\nadl\ncutie\nprodigious\nventuring\npatchwork\nmcentire\nozawa\nossetian\nsegura\ndlr\npoulsen\nduda\nwillful\nihr\nhumorously\nbohème\nuneventful\nbelém\ndelinquency\nprioritize\ndraconian\nintolerant\npz\nknud\noutsourced\naorta\nnashik\nbudgeting\ndoordarshan\nravana\nbishkek\nerste\njaco\nfredrick\nfingal\nrerun\nwoodard\nkrazy\nfinnegan\nseafaring\nswartz\nwyeth\nqmjhl\nadeline\nturbojet\nimpersonal\nsherwin\nmycenaean\ntetrahedral\nnalanda\nmasa\nhendrickson\nkumi\nnavi\ndian\nballoting\nmichèle\nknowle\nselene\ntrigonometry\nmingled\nharmonia\ncommedia\namaro\ncinco\nsdss\nconyers\nwinded\nursa\nyouzhny\nwilmer\nicp\nmiho\nsojourn\npetrarch\nebro\ndeadman\nsadar\nincessant\nbreisgau\nemporis\nbitumen\nzigzag\nroseville\naquariums\nkx\nservo\ncowdenbeath\nwozniacki\ncontingents\nplattsburgh\nitch\nkulkarni\nsundar\nodom\nhoneyeater\npausanias\nkeck\ntriennial\npulsed\nyuwen\nretainer\ngrievous\nvimeo\nrichman\nseverance\nkinnear\ncoconuts\npant\nchanda\nmarquise\nfreya\nnys\nclaro\nscarface\nmannered\nichiro\naam\ndespicable\nbulldozer\nayurvedic\nsalish\ntvn\ngaro\nsolder\nusns\nnefarious\nhumpback\nmenagerie\ncommenters\nxmas\nthermometer\ndrydock\ngeosynchronous\nietf\nprs\neradicated\ngera\nbeda\nmestizo\nmitigated\ndative\naslam\nascents\nresignations\ntufted\nlimes\nschopenhauer\nwhoa\nvladimír\nairtime\nquicktime\nsiedlce\nmiliband\nsufferings\nvaluables\nardmore\nmobilised\nbridle\nzaidi\nextravaganza\n,and\npropriety\nheptathlon\nhn\nhandedness\nangra\nkut\nguaraní\nbeltrán\nife\ncorry\nkafelnikov\nforgo\nbeep\ntarc\nidi\nspeck\naileen\nriverbank\ngodolphin\ngehrig\nkarelian\nsteinway\ncymbal\nsuman\nvindictive\npairings\nflywheel\ncatwoman\nbracing\nhonoree\ncoulomb\ngor\nbada\npazar\nvices\nshallower\ncathay\ngrierson\nstal\nnzl\nmodalities\npurify\ncompiles\ncwa\npresides\nganja\nchromatin\nincriminating\ngiver\nhillingdon\nlanza\naaaa\ndefensor\nhospitaller\npubic\naverted\ndeserters\nbridgetown\nfaw\ngorkha\nhuck\nrudeness\nintertidal\nsilvestri\nsalto\nincitement\ncerezo\nwairarapa\ninterwoven\nepigenetic\nadherent\nmedallions\ndefies\nyuba\nhansson\npronouncing\npelé\npetroglyphs\nrau\nindestructible\nkeyser\ncaregiver\nsupercopa\nstockbroker\nhaverhill\nbogged\nwhitchurch\nabra\nprepaid\nmasala\nbodley\nmiz\nhaganah\nbyung\nberta\nakan\nnewscaster\nhazy\nblandford\nescherichia\nchats\nphenotypic\nquayle\nlyonnais\ntelegraphy\ngoblet\nbedi\naccesses\nnightjar\nalo\nnascimento\nmms\noglethorpe\nyardage\nedgewood\nruy\nsherrill\npreponderance\npsychopathic\nbitmap\noccipital\neller\ntheban\nresponsibly\nalight\ntymoshenko\nshueisha\ntravolta\npipit\nwarrick\nketchup\ndefuse\nmolars\nvisigothic\nlaziness\nmidge\nmethionine\nclydebank\nflagrant\nindustrialisation\ntortoises\nanse\ndrôme\nsquarely\nmikhailovich\nantofagasta\nlehrer\ncosine\nmakassar\nazov\nfatih\nreebok\naltruism\npatiently\nherbst\ncounterfeiting\nmesquite\nheim\nembossed\nvindicated\nconklin\nmarcella\nneuilly\ngyeongsang\ndonn\nlomond\ncomposes\ncathal\navellaneda\namericanism\ngrieve\ncanvass\nchristo\ncastings\nhandbooks\nnewcombe\nhelge\nreznor\nrubus\nmayweather\nharu\nschaffer\nêtre\naranda\nnanotubes\npdfs\npcm\ntremor\nwidowers\nmiro\ntrott\nlockers\naccumulates\nconjectures\nballymena\ntakagi\nfujitsu\nellicott\nbowe\nnazaire\nschooners\nparadiso\nbiswas\nproviso\nkrieg\nunbearable\ntantamount\nkcmg\nwharves\ntumble\nreturner\nphotosynthetic\ndecadent\nanadolu\nheartache\ndegrassi\nkkk\nkublai\nzindagi\nsociedade\nesophagus\naline\nintelligencer\nnodules\nsynchrotron\ncelt\nmalpractice\nashburton\nexcite\nankles\npositron\nemmons\nsparring\nforesee\ndamnation\nvz\nyc\namidships\ndunbartonshire\npenza\nnida\nwoodpeckers\npulley\naar\npopularizing\nglitches\ncécile\nhefty\nbabol\nbadd\nbristles\nmayday\nkono\nboutiques\napricot\ngomel\ntiberias\nsocialization\nstatuary\nordinated\naldehyde\nwebbed\njf\ntoyo\nevangelista\ncanaries\neamonn\nhamada\nthc\nalmonds\nmccready\nacuña\nquarrels\nindio\nsemiotics\ncetaceans\ncoupons\nkristy\nprogressions\ndixit\nhypothetically\nseongnam\nfarkas\nailment\nextinguish\ndixieland\nmamoru\noscillating\nzr\nstopover\nsel\ngrammer\nbcc\nwinkle\nabstained\nchristendom\nsalmonella\ntogolese\nconquistador\nutterance\ntravelogue\nbuchenwald\naftermarket\nbelton\nsucculent\nlucchese\nabo\nevasive\nhanns\nunravel\nreunions\ncompulsion\nparam\nshiro\nmorey\ntokelau\namol\nsaprissa\nchanger\nedberg\nbiafra\ncarioca\nncos\npredominate\nprospecting\nantiquated\nalgal\ndiphthongs\neec\ndecorum\nbearcat\npécs\nsolidify\npreemptive\nenlarging\ngat\nclijsters\nleonor\noverrule\nslayers\ncovertly\nanacostia\nsidekicks\napaches\nsubvert\nproductively\nsurinam\nforeseen\nbabak\ncontrollable\nadverb\nhangout\nbadlands\nrecluse\ndanske\ncinque\nshalt\nbentinck\nzec\nsuu\nlennart\nanointed\npledging\nuninteresting\njindal\ndeepen\ngraced\nhuo\nyamanashi\nkellie\nlode\nbracketed\nursus\nanja\nprewar\nelland\nsubtlety\nflashlight\nkpa\naqueducts\ndonington\nilliteracy\nmirabilis\nrafe\nsmallwood\ndrifter\nccr\nkel\nrowlands\ncme\njosefa\nnominates\nstrode\nsimona\nfrosty\npergamon\nrepulse\nwale\nfilipe\nberyllium\npetros\npreferentially\ngenotype\ncheckered\nwhiteley\ndama\neurogamer\ngarber\naraújo\nnuffield\nbiofuel\nacker\ngustafson\ntrebizond\nphonetically\nisham\nsubordination\nwoon\nduchesne\nshrewd\nshaka\nreb\ntipp\ncob\nprospectus\ndreaded\nclaes\ntrickle\nwasteful\nbuchholz\nstreep\nlacs\nallude\nclocked\nventer\ngermane\nbrougham\ncamber\nhalved\nsalads\nfut\ngemstone\nprejudiced\nshunt\ninked\nremixing\namx\nreplenish\ndownriver\neamon\ncommonality\njanaki\nuniversiteit\nholdsworth\nbeardsley\ngeri\nstanfield\nassays\nexclusivity\nbaru\ngerms\nym\nexcused\nunforeseen\ngpo\nspirou\nquantification\nwk\nmanipulates\ndicks\nyousef\narantxa\nabrahamic\nsmock\nwatercolours\nbombard\nzoroastrianism\nuscgc\nprovençal\nsophocles\natsushi\nkadokawa\ntauranga\napologizing\nvoix\nbecuase\nmithun\npowerfully\npickard\nkasai\nqasr\nbergeron\nforcible\nunsolicited\nlongwood\nesch\nsynonymy\nsparky\nmonro\ntyrannical\nkozlov\nlauda\nmontparnasse\nprizren\npzl\nleiria\norquesta\ndimethyl\nuru\nstasi\ncushman\nnevers\nnarcissistic\nhilde\ndesalination\nhollingsworth\nfamille\nobjectors\nree\nrajasthani\nimmunization\nprepositions\nmariachi\ndukedom\nfenn\nfaraone\ngrating\nchios\noverijssel\nblakey\nlevies\nbernini\nkilbride\nribeira\nmaliki\npontefract\nsamadhi\nhariri\nterme\ndislocated\npicardie\ncharacterizations\nfacilitator\nflue\nsheeran\npettit\ntaka\nqarah\nminter\nsiti\nhiroki\nselfless\nicbm\ngreenhill\ntogliatti\ndemotion\nmodems\namharic\nmarla\nbarometric\nbonsai\nfabius\ntorturing\nconservationists\ntransposition\nracked\ngreenwald\ndamning\nyeager\nshuster\nricard\nmagsaysay\npds\ndilemmas\nwidgets\nbreuer\nplagiarized\nsoden\ncahiers\nmomentary\nguilherme\njagiellonian\ngetter\nzipper\nslav\nbolger\nepithets\nheralds\nsingling\nnorad\ncrazed\noffa\nbodmin\nsomalian\noakdale\nosasuna\nflattering\nnegri\nrestarting\nwer\nempoli\nmastercard\noptimizing\njig\ndivorcing\nbrereton\ngielgud\nalexandrian\nsnowstorm\nclot\nemphasising\ngalli\nnar\nnacho\nfranchi\nchs\nobstruct\nesta\ngliese\nvukovar\nblockhouse\nprius\nreuptake\nscraped\npreoccupation\nfeelin\nhino\ncrewman\nplacekicker\nliberté\nwoogie\ngab\nanatoli\nroush\npremios\ndepriving\nsteely\nfemininity\nhexham\ndura\nmarshalling\nmerino\nconcubines\nhes\ncontravention\nminesweeping\ngreener\nkeeling\ngascoigne\nscrutinized\nsubdistricts\ngeneralize\nchoe\nscholl\nsrinivas\ncrandall\nevoking\ndex\nolivera\nrichfield\nboz\nsabotaged\nleitch\nbarroso\nllama\nruck\nrudra\ndif\nenda\nsatie\ncheong\ngraff\ninjustices\nuyghurs\naalto\nbahía\nhenriksen\nabdulla\npaseo\nseabird\nshura\ncantos\nzvereva\ndetracts\nstandardisation\nulterior\ntso\ntoth\ndeclension\npellet\ndonates\ncupboard\nexcised\nrectangles\ngennaro\nantonescu\nlavery\nfactorial\nscythian\nquantico\njari\nhock\nrabid\npreta\nibáñez\nmisgivings\ncapping\nmeher\nblurring\nkortrijk\nmaximise\nmarchant\nlibertas\nkahne\nfec\nstolberg\nburgesses\nfutility\nfishman\nrandal\ntartar\nsmurfs\nsalma\nconspicuously\nsilverbacks\nlifesaving\nislamophobia\nexporters\nmiddleware\neifel\nkalgoorlie\nbothwell\nbridged\nkeselowski\nshazam\noneness\nmabille\nsteiger\ndemocratization\nsummerslam\ndrava\nnuttall\njud\nsuffusion\nmorbihan\nqiang\nsurgically\ntraoré\ngroth\nleszek\nbefriend\ndecadence\nmoffett\nparatrooper\nconga\nphasing\nwinehouse\ntangentially\nkees\nori\nrmit\nocc\nparsi\ndetain\nnewsom\nseaford\nlumsden\nrdf\nredux\ninversely\nlum\nacademicians\ntaito\npastiche\nchatting\nutv\nhing\nkasey\nmansard\ncowardice\nperiscope\nanabolic\nsneakers\nheckler\ngosport\nmarquesses\ndolph\ndiploid\nwoolworths\nexif\nsla\nsolanum\nquintero\nprat\nfeuded\ntirelessly\ndikes\nkingsway\nrationalism\nhoned\npunks\naveyron\nphong\nstarve\ncanfield\nbreathtaking\ngorgon\nmodality\nbayes\nsweeper\nkenora\nspectacularly\nobscuring\nleake\neltham\nunicorns\nlucretia\n✈\nkakheti\ngeraldton\nobeyed\nure\ncarling\nbasaltic\ngrader\nrearguard\nnimoy\nsufis\nemmys\nkleiner\nibanez\nepidemiological\nmarte\nsolent\nsandwiched\nhenin\nfissure\ndualism\nrips\nshifter\ncastaway\ncarotid\nkotor\ndisproved\nbroadleaf\nsotto\npauls\ndegas\nferraris\nstalag\nmethodius\nnonviolence\ncamargo\ndowner\nparaded\nbestow\nviagra\ndeuterium\nsrinivasan\ngazi\nbicycling\nexclaimed\neternally\ncouplets\nnutt\nnevill\naro\ntrailhead\ntakeuchi\nbrownie\npsychical\ndistorting\nhovercraft\nmitcham\npuss\ntwofold\ndistaste\nmutineers\nnullified\nnewnham\namina\ntamer\ninvents\nclichés\nsuccinctly\nij\nmegawatt\nbuddhas\ndushanbe\nchandelier\ndarwen\nfactional\nfaure\nmercator\nhyuk\nchipmunk\npatched\nbioavailability\ncolne\nzoot\nauthenticated\nsupercharger\nkoichi\ndiffused\nunattractive\nmattias\nexchanger\nalternation\njarring\nvejle\ndebug\nbathe\nappreciative\nloggia\ninés\nitchy\narai\nextramarital\noctet\nadcock\nyuk\ngalego\ntimorese\nbhi\nprune\ngenerically\nbenedictines\noily\nmarrakesh\nmizrahi\nbecca\ntupper\nirena\npanics\nlightest\nchidambaram\nmaksim\narabella\nballistics\nocala\nobstructing\ncsiro\ninyo\nlattices\novercomes\nfca\nintergalactic\nbegonia\nfiduciary\nwatercourse\ndempster\nresounding\npericles\nrepute\naharon\nfemina\nmigraine\ngrohl\nzhongshan\nrheumatoid\ntoughness\nsoot\npruned\nimbued\nquibble\nbrea\nsevering\njaume\nmami\ncolonist\nnarada\ngarb\nmejía\nirv\nneuroscientist\ndiscarding\nhippies\nbranford\njarosław\nunsatisfied\nmacaw\nprovident\ncarne\noic\nopm\ntooling\nmenominee\nhillbilly\nkaroo\npyruvate\nlinwood\nlld\ncyclical\nluleå\nsoa\ngish\ndavydenko\nih\nula\nwaldman\nsiempre\nketone\ndeniers\naccompanist\ncariboo\nhap\nmaradona\nmccollum\ncarnarvon\nbraided\nschlegel\ngaleria\nmagnitudes\nsudha\netheridge\neloise\nthrowaway\nvann\nteapot\nfutbol\ninlets\nshard\nalmanack\nadorn\nhawaiians\nyearning\nhaunts\nrowman\ncampaigners\nprefrontal\npauses\nruggles\nactuarial\ngraphene\nnichiren\nhonorably\noscillators\nhives\ndanza\npacification\nhering\nbookings\nkham\nslotted\nilford\nnorge\nvillar\nprescribing\nadjoins\nsubprime\nsuborbital\nescalators\nbessemer\nraine\nkashi\ndisinformation\npicts\nleppard\nmetzinger\nshim\npersonified\nlahn\nepistemological\nxanthi\nchristen\nbooted\nwildflower\nboulevards\nchilly\ncollectibles\ndinar\nsteadman\nsagebrush\nmaturing\ngeer\nrochus\nbelenenses\nreggina\nvmware\nsteyr\nlegalize\ncasement\nelizabethtown\nini\ngregson\nminimizes\ngam\nwidens\noita\nbola\nhak\nttt\nescher\nnika\nlacquer\nbeadle\nroasting\nmmp\ndips\nfrenchmen\nmestre\nmoveable\nbrisk\ndementieva\nwtc\nmodoc\ncredential\nsmoothing\nschoolboys\npostulates\nnyman\nalfaro\ndevising\nyuka\nphilological\nmendip\nheffernan\ncancels\nashkelon\nkells\nrika\noutgrowth\norlov\ndebilitating\nrecep\nkirwan\nmci\nrapporteur\nfaerie\nanagram\nfirebirds\ncrowder\nwilhelmshaven\nmishap\njaber\nhisham\nabed\nmtn\nwook\nnaya\nbarranquilla\nboulanger\ntanja\nphonographic\nhalstead\ncommercialized\nventspils\nencephalitis\nreichsbahn\nwillett\nnameplate\ncytokines\ncotswold\nexterminate\nraisin\ntremors\nbuffs\nadder\ntyndale\ndangling\nfarsi\nkrusty\nbooms\npacifists\naest\npgs\nlimitless\nhumbly\ncranmer\nghani\nboe\nchildlike\nismaili\ntaunus\nsochaux\naamir\nponderosa\nserjeant\neverard\nhyacinthe\nmbs\ncottrell\ncoote\nrepubblica\nsurigao\ntejano\nsivan\nfirefight\nmakarova\ntremont\nreplayed\ndepreciation\nbeecham\nkumasi\nbulkhead\npreposterous\nclann\ndtv\nscientologists\nerrani\nbulger\ncharon\nallocating\natacama\nknuth\npais\nrepose\ntolentino\nlingo\nprotester\npassau\nmonogamous\nlora\nelven\nleash\ncot\ntyrell\nlongo\narthropod\nthorny\nsluice\nmcauliffe\nescalante\ncourtly\ntrespassing\nbur\nunderscore\nunlocking\nwilla\nhitmen\nfilibuster\nwawrinka\ncatharina\ntasted\ncondenser\nlevitation\nhermetic\ndiligently\nlehi\nsymons\nalanis\ncampgrounds\ncorleone\nheadset\ndiction\nshabazz\npupa\ntopaz\ngaillard\nmoron\nmcdonalds\ntutti\nfallow\ndoin\ngoodison\nux\nsani\nshampoo\ncarnivals\nhorsley\nshimla\nevangelion\ncitigroup\noar\nregroup\nbayview\nhindmarsh\nrogan\nverein\nsavant\npythagoras\ngleaned\nwedlock\nyatra\npastoralist\nkeyhole\ngrimshaw\nmachinist\nenforces\nhanger\nvenkateswara\nbarbaric\ngulfstream\ngsp\neleni\nmasood\nbeavis\nmenzel\nredcliffe\nafm\nopenoffice\nalmirante\niffy\nculpeper\ncheeked\ninnermost\namedeo\ngollancz\nalania\ntheotokos\nradiative\nwaterville\nelstree\npathologists\nreclining\nriverboat\ncorky\nvaledictorian\nmakerere\namply\nrawlins\ndenali\nsplendour\nazevedo\nschoolgirl\ndpi\nrichthofen\npregame\nsportswear\nabdicate\nseaplanes\nradiance\nleaguer\nfluted\ncri\nuil\nsoared\nleichhardt\nwane\nrube\nthessalonica\nnieces\nwindscreen\nmarbled\nfogarty\ndiscoverers\nbungalows\narrangers\nhobbyists\nschnyder\nbabur\nnoe\nalpina\nimpassable\ndens\ncheckout\nplumes\nmobilisation\naubert\nedina\nevaporated\ngretel\nintermediaries\nmehr\nhonshu\ngalindo\nimpenetrable\nionized\nanyang\nnovorossiysk\nmakarov\nprins\natholl\namanita\ncichlid\nmarl\nlumberjacks\nkarloff\nultras\nataxia\nrothwell\nenquiries\nivey\nhazen\ninaction\nqutb\ncdf\nyellowknife\noffside\nbicarbonate\nnordisk\nhurwitz\ntrask\neben\npastries\nvestfold\nowings\nbetancourt\nlackey\ngianfranco\ndoane\ngabonese\nhondo\nhalal\ngreasy\nskips\npauly\nvallecano\nmischa\nfeller\nskimming\ngiraud\nhazzard\nskeet\nplump\nabellio\ncutthroat\nreinhart\nilona\nchubby\ndripping\nerzurum\ndyeing\nsinestro\nocr\nfaceted\nbards\ndevious\ncolumella\nlangham\narcheologist\nchara\nelectromagnetism\norinoco\nnll\npersephone\nmodo\nunconditionally\nmusicologists\ncowles\nbarneveld\nsecrete\nwelling\nwhaler\ncamila\npancakes\nmattie\ndredge\nemphasises\ntoothpaste\nucr\nallenby\nimpregnated\nbudgeted\ngreets\nunderstated\narvind\nbrunton\ngeist\nfurnishing\nanesthetic\ninfraction\nmahut\nimitations\nguin\nfoxe\nplumb\nfrères\ncamaro\nsecretions\nbolero\nwnt\nnewborns\nluk\nfatale\ncataloging\nstavros\nprecession\nrequester\nbream\ninexplicable\nmachina\nkrone\ndufour\noutbursts\nsofía\nminolta\nalcoa\ninterrelated\nintermission\nisaak\npaparazzi\nbaht\nmatamoros\nintercepting\nlass\nblitzkrieg\nrebekah\npyaar\nplundering\ntabled\nlauri\nvadodara\nmeadowlands\nlázaro\nmannequin\nfcl\nnevins\nunregulated\nbana\nangeli\ntrendy\ngto\npopescu\ngoffin\nmcalpine\ngenova\nquine\ngynecology\nayesha\ncopacabana\namuse\narchetypes\ndeadwood\nleonards\nplagues\nviticulture\nmidler\npercussionists\nranting\nsnide\ncand\nrestituta\nstilts\nlansbury\nvillegas\nbrac\nmelodramatic\nbewitched\nhasse\nclarifications\nchasseurs\nmollie\ncogent\nsalo\nloach\nwilkerson\nbirgit\nmorphologically\nchâteaux\nnkrumah\narminia\nimus\nbhg\nstarfire\neventing\ncrass\ndiverging\nhydrochloric\nroslyn\nmaleeva\nhüseyin\nhugs\nhalcyon\nmardin\nzoë\nrationalist\nnovitiate\nmiramax\ndebunked\nulla\ncatalyses\nufl\nfleurs\ngympie\nlassiter\nnextel\ntei\nantidepressants\nhesitated\nfeist\nquintets\nsoir\nwolcott\nriverina\ncornerbacks\nharrowing\nviña\nawardee\ncoro\nslaps\nheaddress\nsteinbach\nhillsides\nalgoma\nscissor\nmilly\nmacaque\nvaucluse\nunjustly\ntala\nlirr\nopec\nsayre\nould\nstratosphere\npegs\nclackamas\nrosemont\ngoring\nexpiring\nmurrow\ncolonnade\ncorrientes\npolyphony\ntransactional\npippa\nhocking\nladislav\narora\nhamper\ngranth\ntralee\nilk\nespiritu\ncloseup\nfreudian\npatuxent\nmaxillary\npedagogue\nmycobacterium\napec\nnss\nparallelism\neffendi\nmullah\nminimalism\nbillington\nsverre\nnishimura\nhigginson\npomegranate\nphosphatase\nchaitanya\ntilley\ninterludes\ntrouser\nharju\nscrambling\ncwm\npompous\nkohli\nhutu\nrushmore\ncomatose\nspecialise\nmontgomerie\nlaszlo\ningo\nhetherington\nequalization\nsolís\ngelatin\nonsen\nptt\nchilli\nsivaji\nfunctionaries\ntrianon\nwatermill\nassange\nabou\nshute\nfalsetto\ngaeta\nsenecio\nhatter\nshamrocks\nferrol\nmatsuda\nsteuart\nsparing\nved\nnudes\nalderson\nkrug\nlippincott\nsepia\naig\nabreast\nislas\nvarney\nsurry\nmorrill\nouted\nsé\ngoldschmidt\ncadogan\ncabell\nsnr\nyon\nwala\nnissen\nbahram\nthroats\ngout\nimpounded\nlpg\nantisocial\nlahiri\nhomeric\nnarrating\nslattery\nliqueur\nspotsylvania\nspringboks\ndoubleheader\ndanmark\ntaira\nedgy\nsketching\nspate\nconsequential\natticus\nbally\ncommotion\naural\nyay\neichmann\narnulf\nportadown\nlazare\nfives\nsweepstakes\nmatriarch\ncapitulated\ngatsby\nmullin\nvermillion\nmillington\nstralsund\njuli\ncleave\nworrell\nedgardo\nislamism\nrit\nsdtv\nelías\nwitten\nfitzhugh\ngarrigues\normonde\nquin\nunsettling\nkolb\nmidseason\ngamut\ngawain\nclades\ndevalue\ngascony\nbystander\nstepdaughter\nmocks\ncleese\nemeralds\nmetamorphoses\naurelia\nsirjan\nmeteorologists\nmeltzer\nrepulsive\nsewanee\nalco\nworden\nmulcahy\nswagger\nbinoculars\npillsbury\nmamie\nswallowtail\nvel\nophthalmologist\nlotteries\ncistern\nkell\nalfons\nhandcuffs\nbrubeck\ndoped\narticled\ncompaq\nsubtracted\nhotspots\nfondazione\nwretched\nhomegrown\ninternees\nihre\nupholstery\ntelstar\nysgol\ndimensionless\nstato\ncpp\nantananarivo\nkemble\ntuanku\nmatías\nundamaged\nbustling\nconquistadors\nhander\nwinans\nsecede\nwalthamstow\nrauch\nflatly\nasb\ncatholique\ncrossbar\narmen\nvaldosta\namrita\ndunams\nsynapses\nringwood\nlithgow\nrostrum\ncleanly\nmordovia\nflashpoint\nhorgan\nbullseye\nipoh\nmem\nketchum\ndisposing\nmarmaduke\nblunder\nseaward\nhalogen\ndouro\nblinds\nprovisioning\nclary\ncancerous\nchronically\nfreelancer\nifpi\nnabil\ndisjointed\nyutaka\nuptempo\nmtb\nbackups\nlegionnaires\nhazlitt\nsandefjord\ncaja\ngog\npercentile\nomonia\nrcs\nclave\nheriot\nmorello\nbarstow\nprecambrian\ngrammarian\ncavanaugh\navenir\ncircumscribed\nhernia\ntocantins\nmitford\ncourtesan\ntrawlers\naimé\nclapboard\nfolkloric\nthurn\nblurbs\nsorely\nbalmoral\nunknowns\nodette\npauling\ncraddock\nhasta\nwholesome\nnasi\ninflamed\nbosley\nqualms\napologist\ncaswell\nkilgore\nshingo\nyn\npeerages\nmarinus\nwrits\nhuta\ndorje\npredefined\nmazar\nbrahmaputra\nhospitalised\ndestruct\npon\nexpedient\nbreakdowns\nsasuke\nsorenson\nmagnusson\nrubbed\nsuperdraft\naic\njenks\ngreenbelt\nlourenço\nsteffi\npankaj\nbsp\nsma\nssn\nmoms\nzanu\nvez\nbackdoor\ntoaster\njami\nnva\ncirrhosis\nfucked\nsleek\ntappeh\nura\ndearth\nmontebello\nmarton\nencylopedia\nstimulant\ndauphiné\ngilpin\ntranshumanist\nwigmore\nfajardo\nmalfunctioning\nrefreshed\novercast\nfringed\ngaiety\ntendons\nurbano\npogo\noha\ncockroach\ncaerphilly\nneurotic\nblockage\ntetrahedron\nbeatz\nefficiencies\nmapuche\ngwyneth\nsemifinalists\nphilharmonia\naffliction\noiler\ngio\nfelled\nwaffle\necclesia\nvioloncello\nprimeval\ndesignator\nmannerist\npheromones\ntransgression\nchops\nperceval\nprotectionist\nherkimer\njuggernaut\nwhalen\nlaverne\nbodybuilders\norig\nflightless\nrobespierre\nmeditative\npml\nflashy\nciv\nuaa\ncrake\nriddell\nassoc\nreticulum\nincited\nbrowed\nlusitania\nincensed\ngraze\nemissaries\nhércules\nfitzalan\nbrackett\nlolo\nmuirhead\ncheddar\nhellboy\nilie\nkrylia\nmeteors\nasaph\nbolivarian\nkryptonite\nharish\npsy\ncasals\nwerk\nentice\nwenzel\nberge\ncharcot\noddity\nglossop\ndory\nonsite\npenniless\nraglan\nsusa\nhamelin\nalbers\nsolver\nwozniak\njudaic\npeloponnesian\nlom\nghazni\nbernadotte\nluxor\nfes\nsidecar\nbistro\nraina\nmedica\narticleid\nmouvement\nkronos\nbahar\nleroux\nfrankford\nmissal\nearwig\nsanitarium\nsano\nvanadium\nfetishism\nrushdie\ngentrification\nfj\ncatwalk\npaymaster\nmoline\nkingstown\nmahadev\nlawrenceville\nwithstood\nprobationary\nimsa\nbracts\nkorg\nmayes\numi\ngalton\nincompatibility\nabbasi\ncoauthored\nstraddling\nwicketkeeper\ninterferometer\nmeenakshi\nsommers\nwenn\ncoutts\nedi\nbremerhaven\nstrewn\ncandies\ndoraemon\nliaisons\nunannounced\nhardwicke\ntaunt\ntechnologist\nandrogen\nstorrs\nhelmuth\nventuri\nveritable\nkaifeng\nedgerton\nmassoud\ndrifts\nrefuelling\ncav\nfaceless\ncanaanite\nepidermis\nclinician\nsuperstitions\nsomeplace\nvalerian\ncarpi\nyf\ndials\nagarwal\nbaidu\nconservatorium\nterse\nabril\nsnohomish\ngruppo\ntino\nwolfman\nfateful\nmehra\nxhtml\ncerebellum\nmarshland\nseema\nstedman\nvisigoths\npreexisting\nophthalmic\nsmk\nunintelligible\nhoran\nfiedler\nbizjournals\nhypothermia\nsnowboarders\nheadmasters\nermine\nmccook\nfila\ntrm\nazadi\nreconsideration\nlymphatic\nserfdom\nscepticism\ntamás\njours\ndena\npandas\nlisburn\narndt\nvang\niri\nrinaldi\nkelli\nsda\ncllr\ninvisibility\nhafez\ntatsuya\ncsn\nconant\nimpropriety\nholton\nmcm\nllandaff\nglazer\ncrediting\nsici\ndsi\nsowing\nwz\nwarlike\nviolas\nemulsion\nafterword\ndelirious\nsemarang\nsabatini\ntripped\ncontralto\ngasquet\nconsumerism\nstrontium\nkerber\noverground\npresque\nreiko\nattributions\nlectionary\nnebulae\nairforce\ncoren\ncosenza\nemmerich\nresistors\nilija\nbrees\norang\nbülow\nannika\nhachette\nfierro\npolaroid\ncalderon\nantti\nbolling\nkardzhali\ngmo\nobregón\ndramatized\nshamans\nlipscomb\nstjepan\nsavory\nincubate\nmetalist\nmilli\nrajinikanth\ngarbo\nisl\nnudge\nphraya\nidyllic\ncoxless\ncannibals\ncounterintelligence\nselectable\ninvariance\nimparted\nbusway\nves\nlalo\nunprofitable\npil\nprincipia\nvihar\ntraviata\nobedient\nexclave\nablaze\nathol\ncaesium\nkennington\nactes\nsafed\ncrüe\nanka\nkaterina\nquagmire\namerika\ngoons\ntaku\nilse\npantograph\ncastelli\ntaff\nunderlies\nmeru\nholcomb\nprimitives\nbrockton\nsandeep\nsteers\nunresponsive\nhillier\nwold\nmitrovica\nklee\nnecessitate\nwatchlisted\ncontador\nterje\nane\nimmortalized\nplayfair\nweizmann\nmesse\nmz\nconnolley\nbonin\ntownsfolk\naverse\nredeeming\nkenner\noperettas\noutweighs\nmunk\nadapts\nofficiers\nspatially\nmtk\nrustam\nschaffhausen\nretriever\nincognito\nsherpa\ntailoring\nankh\nalcorn\nformby\ntaz\nknvb\nmilliseconds\nbarham\nerna\nlatch\ncorus\nnestled\nobstructive\nconsummate\ndenominated\nassesses\novaries\njanson\nvelika\nlutyens\nmech\nanupam\nanubis\nstroll\ncritiqued\nguerin\nelucidated\nislay\nvenlo\nwpt\nzuid\nwoohoo\nbranca\ndonau\nequus\npta\nvelde\ncyrano\nperplexed\ntsushima\nsweetest\ntirol\ncampfire\nvisalia\npercussive\nadsorption\nprecautionary\nmalwa\nhebert\ndemarco\nsarsfield\nstalinism\ngivens\nmks\nentablature\ncalcite\nbayt\nhst\nagence\npermeable\ntyner\nyancey\nexposé\nmucosa\nstalybridge\nflava\nputsch\nregulus\nchulalongkorn\ncylon\nchinensis\nthereupon\nhalides\nbosh\ndost\ndoj\nrauf\nbarratt\nkeble\ncalifornica\nclearfield\ncensured\nmorsi\ncsl\nrhs\ninferiority\nviridis\nworksop\nmessiaen\nplas\nholyrood\nrapists\nstoller\ndulcimer\nsigners\nfitchburg\nfictions\ngoldfinger\ntoshio\nsunnis\ntimbuktu\nmonomer\nrayyan\ncalligrapher\nragas\nrrna\nlta\nlightship\nreinstating\nraison\nschur\nsieradz\nbrushing\nperrier\njohnathan\nciti\nmassie\nhyeon\nplugging\ncacao\nbirkbeck\nentwistle\nforamen\nsabadell\nturki\naleksandrovich\nmulan\nperpetuating\nhamdan\ncuando\nmanitou\nsuffices\nxena\nzhukov\nurology\ntrincomalee\nnabisco\nestuarine\nwarehousing\nnineveh\ndup\nunreserved\nevacuating\nzemun\nkrüger\nmaule\ndermal\nchita\nneanderthals\nsrs\nfansites\nmies\nsaran\ndrawbridge\nbikaner\nmargie\ntailplane\nthrives\nswivel\nfarouk\nteenaged\ndissipate\nxerxes\nfamers\nenos\ncurtail\nriau\nnarrowest\nwasher\npoon\njhansi\nramen\ndependable\ndupage\nveitch\npreliminaries\nfredric\nmethyltransferase\natlante\nbouches\ncollages\nanalgesic\ncecily\ncarcasses\naxl\nshag\nrundown\nsmurf\nsoi\ngalilei\nmesses\ndisfigured\ndexterity\nshafer\nunsaturated\nbirger\nbethnal\ncastleton\nshortfall\nssd\napologetics\novas\nhomesteads\ndearest\naccelerates\nsmuts\nrothman\nkesha\ntahrir\npanelists\nsizing\nquilmes\nmerciful\nmasterful\nshopkeeper\nvests\nmorgenstern\nseger\npcp\namines\nbamford\nwalworth\ngauri\nqianlong\nvasili\npracticable\noswestry\njuma\nleda\namu\ntbc\ndisservice\nbeards\nbrooklands\nmarmalade\nnaft\ntracery\noperandi\nsaguenay\nbeaconsfield\nrandwick\nconsuelo\nsnelling\noverdubs\narmoury\nantithesis\nleesburg\nbatley\nkaro\nmatures\nlongueuil\nrealtime\npellegrini\nfogg\nimprovise\nmockumentary\nwiccan\nperverted\ntimbered\ngatherer\ncrevices\nsepp\neverly\ngaiden\neger\nmeow\nmcwilliams\nemacs\nscotts\ndiwali\npermissive\nmanilow\ncarman\nppd\nselden\nrickshaw\nkarsten\nhardworking\ntadpoles\nshone\ndawg\nrijksmuseum\nwort\ndiscontinuity\nbergerac\nkashgar\nlug\nhomemaker\nanglais\npurposeful\naddictions\nflintstones\nhandily\ngorda\nmontego\ncadastral\nquinto\nevie\nvertebra\nvendôme\nbathhouse\ngabba\nbloor\nhexadecimal\nmoulds\nddg\nphilologists\nbretton\nsmithers\nelectives\nfhm\nglengarry\neleazar\ninternationalist\nherzliya\njossi\ngwh\nreassembled\nserhiy\nglobalisation\nkarna\nigf\ntrobe\ncisticola\nkayseri\ncoagulation\nlapses\nmladen\nnoda\nkamel\nalten\ndungannon\nemcee\ninger\ndabs\ncantrell\nunrequited\noceanside\nmiu\nfacundo\ncrawler\ntrueman\npaulette\nusfl\ninstalls\natleast\ncfds\nvestments\nshanahan\nmatson\nkatakana\nmachi\npayer\naeroplanes\npowders\nmathers\nrigoletto\ngmelin\nreestablish\nsalcedo\nmelaka\nboydell\nskateboarder\nmorden\nlilley\ndiallo\nhaan\nhermeneutics\nwahid\ntmc\nkivu\nesso\nduk\ninfallible\nhermosa\nstuttering\nconcurring\nbreakthroughs\nbremerton\nsquaw\nuncalled\nthanos\nmarbella\nwinder\nlibra\nbleaching\nprocedurally\nkimble\nfreeview\nindictments\nclashing\nebrahim\nmarengo\nluzern\ndurrell\nskim\nmorgana\nsucrose\nelmhurst\nelks\ncastellano\nplenum\nnami\nmise\ngiannis\nvaulting\ntrackers\ngaurav\nsuzuka\nfinesse\nconceptualized\nlivius\nbrooker\nsemaphore\nfaria\ninhaled\nperf\ndrucker\nglan\ncodices\nmacbook\nleaky\nscooters\nvoce\npilsen\nfoix\nreconnect\ntrapeze\nhewn\nbooklist\nswinhoe\nlenore\nconurbation\ncertiorari\ndisparage\nglockenspiel\nlactic\nthrashing\nforêt\nglaucoma\nscone\ndaydream\nreyna\nlorries\nescalator\nbrahe\ncava\nyeshivas\npassively\nkrugman\ncontemporain\nhigham\nfairport\nreus\ninfantile\nstoltenberg\nfiume\nvespasian\nxfinity\nborghese\nschenk\nhansel\nstenosis\nalexia\nsymbian\norford\nkul\nfrieda\nmerdeka\nalene\nrepent\nbaca\nclapp\nlubricants\nsluggish\nvying\neckert\ndownton\nsigrid\nlongitudinally\nshibata\nbarca\nlifeless\nldu\ntutte\nmiserably\ngoetz\nalexandros\nlitany\nproverbial\nlaurentian\nzvonareva\nmemorize\nmarvels\ncalming\nredevelop\nstash\ndemeaning\nstilwell\nprofess\ncasio\ndacre\nnegated\nsecures\nbonanno\nswims\nlq\npounded\nimmunoglobulin\nsapienza\nbakar\nunderrepresented\nfürstenberg\ndoggett\nbelgrave\ncongregate\nbitola\nmillionth\nlectureship\ncargoes\ngaulish\nslumber\nllodra\nhsiao\ndocg\npositivism\ngingerbread\nsingha\nsequestration\nmetalwork\nbulbous\nserna\nsponsorships\nné\nwasatch\nmut\ngcb\npohnpei\nchonburi\nbroz\nmosby\nvetting\ncastiglione\nhydraulics\nresponder\nbhagavad\nmasterchef\nruan\nkum\ngentiles\noars\nmadly\nkamp\nbrownstone\nseptum\ninadvertent\ndelos\ncbr\nmilliken\nimphal\nneuropathy\nsokoto\nfitzgibbon\nlayering\nntt\nbarnacle\nprogesterone\nuli\ngullies\nsutta\ninflate\nnafta\nrhizomes\ntoungoo\ndecoded\nverano\nstraightened\nimprovisations\nfemoral\nmarchetti\npellegrino\nghettos\npele\nbharti\nnéstor\nclerck\neffingham\ninconsequential\ntransponder\nsys\npodlaska\nnikolayevich\ncategorising\nlockport\naltair\nmyrna\nakiva\ncapacitive\nsamuelson\nsympathize\nkeiji\nannoys\nacumen\nsadd\nrappahannock\ndamme\nbisexuality\nscuffle\nloiret\nsaa\nmelina\nchasse\nuntrained\npontiff\nexemplify\ncompensating\ninadequately\nfso\nedirne\njehan\nkimchi\nkhun\ngwendolyn\nmonasticism\ncsv\nrialto\nsweetened\ntrope\nmistrust\nmouthed\nlusignan\nclos\nformulaic\ncalyx\nwhitefield\nnesmith\nspandau\norden\nseb\ngennady\nzelaya\nmatchbox\nemulating\nworf\narnaldo\nambushes\nmistreated\nunep\ntrollope\njoris\nhandset\nuntouchables\nmilitarism\nmasterworks\nasmara\nplácido\nmarchioness\nspliced\njarrod\nenc\nnitrous\ncarlsberg\nattentive\npigmentation\nainslie\ncofounder\ntsg\nnazar\nurals\nearthwork\nsteeped\ndredged\nartem\nrecreativo\nsandia\ncheered\nbaia\ncreoles\nauk\nspiritualist\nlaconia\npotash\ndetergent\nshrouded\nléonard\nshrews\nwhitbread\n⅓\nbodo\nklf\nlightness\nboulez\nleper\nsvoboda\nmunda\nstatuette\nsatyajit\nzor\ndived\nmirna\ncellos\nnoncommercial\ndenbigh\nhermon\nglycoprotein\nfairbank\ntimon\nplebeian\notsego\nloam\nhaj\nhoch\nformalised\npediatrician\nisolde\nedoardo\nroundel\nlikhovtseva\njanelle\nfol\nelongation\nsatoru\nchernivtsi\nanda\niago\npolychrome\nresponsiveness\nssb\nsais\naurobindo\nirishmen\nrepton\nferndale\nanker\nendangering\ndueling\nronda\naudacious\njerónimo\nflautist\nholtz\nmercilessly\nnevin\nsavona\nvcs\nchrysostom\npolitiques\ntiring\nytv\nestrellas\noutweighed\nhardman\nsynoptic\nmassenet\nvowing\ndeleterious\ninstill\nexistentialism\nmagnificat\nvitreous\nanalogs\nkennesaw\npessoa\ncatedral\nrels\nhomeostasis\nvouchers\ngrp\nsyringe\nloggins\nbeeston\nraisins\nsuccumbing\nflushed\nilyich\nelio\nvimy\nnda\ngregarious\nhsin\nhillsong\nferrier\namethyst\nantidepressant\nalvar\ndazed\noxo\nroms\ndewi\nschizophrenic\nmotagua\nsupervises\nnieves\narnett\ntiësto\nrephrasing\nadore\nmumtaz\nlicks\ndru\ngrammophon\nnaha\nscène\nsabu\nquasar\nsadc\nrobusta\ncoughing\nglycogen\ncomix\nashlee\nabkhaz\nadmixture\nhartland\ndeniz\nwitted\ndesportivo\nvirulent\nignace\ncaliente\nbagpipe\ndepose\ndateline\nabnormality\nlasky\nconnoisseur\nstrafford\nbridgestone\nsafa\nfuerza\ntortures\nkennedys\nhager\nseto\nconcealment\nsila\nprater\ndobie\nmetalworking\nolivet\nunderestimate\nriker\nwishart\nsextus\ntubercles\nernestine\nzacharias\nleaned\nantioxidant\nridgewood\nchancellorsville\nmeiosis\nbiju\ntourette\nnagle\nayodhya\norhan\ncommissariat\nheals\nolympiads\nexpounded\nlaud\nharis\nprentiss\ndelacroix\ngreenery\nbarisan\npollinated\nmunn\nporgy\nimpair\nbracknell\necotourism\naaj\nmanama\nbandicoot\narecibo\nraye\nsnows\nloopholes\nhelices\ndengue\nmpi\nsopot\ngoebel\narchon\nreptilian\ncrotalus\nxaver\nbosque\nhel\nschirmer\ncopperfield\nclaret\nraab\nkolar\ngy\ngalleon\nenright\nrobustness\njuridical\nlint\nkiko\nponzi\nwoodcuts\nweatherman\ndibiase\noam\nalphonso\nkirkcaldy\ncrossovers\nrhizome\ncognac\nwoe\nmoen\nkolbe\ntachibana\ncmj\nthurmond\nbsi\nbadged\ngargoyles\nguantánamo\nrecreations\nreplete\nmalmesbury\noilfield\nqiao\njuniata\namstel\ngodley\nmarvellous\njunkyard\ndiop\nthier\nredhead\nzm\nmexicali\nboycotts\nspiel\npurporting\nrincón\ncarlotta\ntabular\npender\nmichiel\nrhee\nroslin\nohne\nmusicianship\nmillennial\npeculiarities\nannulment\nwham\nlunchtime\nradeon\nchrysanthemum\nevangelization\ncompressors\ncurled\nadversarial\ndurbin\ndoughnut\nwav\nunbounded\nspitsbergen\ntutsi\nnortheasterly\nnegativity\nupstart\ntani\ndistributive\nlacan\ncasal\nzaporizhia\ncavanagh\ngroucho\ntoyotomi\ness\nperrault\nesk\narl\nwaterside\n²\ninasmuch\nbendix\namhara\nencamped\nprojekt\ncandelaria\nmeetinghouse\nlakhs\nwipo\noffhand\nnoll\nmalachi\nwaxy\nsarek\nstorks\nsplitter\nbruni\npaiute\nneues\nconglomerates\naruna\nspars\nbizarro\nbeachhead\nalderney\nomnium\nnim\nnebulous\nbodhi\ndizziness\namado\nblush\nsassari\nbadakhshan\nwemyss\ntiffin\nholyhead\nmetaphorically\ntuscarora\naerodromes\nvisor\nzeballos\ncondensing\nburwell\nalcott\nnankai\nbeata\nmulticast\nniemeyer\ntether\nbhatti\nyasin\nglycerol\nmatsumura\nsultana\nislip\nworkspace\nsandhya\nbiogeography\ngreenlandic\nypsilanti\ngoldstone\nfarooq\nober\ntrophée\nrefuting\ndefensively\nfoch\nhallucination\nstorehouse\nsolvable\nprecocious\nsiphon\ncrags\ngunsmoke\ncampbelltown\nmoni\ntamaki\nnutty\nevaporate\nauctioneer\nbelvoir\ntalley\nsepsis\nheusen\nhvac\nstormwater\nworkaround\nhighfield\nquaint\ncaliphs\nsatori\nambiguities\nuaw\nshuttles\nhaney\nalito\nakash\nroberson\njunge\nrained\nensue\njayson\nmejor\nimpaled\naspersions\nvali\nvocally\nméliès\nnigh\nuday\npaywall\nbreckenridge\ngracia\nbollocks\nrefresher\nellwood\nglentoran\nminstrels\ndpr\natrocious\ngarten\ntickle\ndedham\noban\nsada\nwestfalen\nheartless\ntca\nazmi\nalgeciras\nferrying\nwinwood\ncrystallization\nbasques\nhiwish\nsaxton\ndesde\njuxtaposed\nencoder\nlod\nlooped\nunelected\nbisected\nclout\ntournai\nintractable\ngrapefruit\napothecary\nsohn\nsalgado\nhollows\ndenman\nbackfield\nmesto\nraghavan\naum\nucsd\ngaray\npabst\nkeng\nvibrational\nverner\nwalkin\nfallacious\nmok\nrigidly\npelosi\nbernier\nleia\nfanatics\njeu\nexoplanet\nunnumbered\ndenier\nkankakee\namericano\negyptologist\nfactoring\nresonate\nwilford\ngiffard\npacino\nlucrezia\nhogwarts\nwenatchee\nthruster\ncarbonated\nyai\ncze\ncfg\nunderline\norgy\nloathing\nvasari\nstockpile\ninxs\ncaritas\ntid\nrutter\nmucous\nvandalise\nsema\narashi\nmotionless\nmidline\nsookie\nsarai\nsgi\nmandibles\nsubmersible\nabington\nchiles\ncleve\ncompost\nrundgren\nsoong\nscuderia\nusk\nprisms\notoh\nsevenoaks\nbailed\nimperator\nthreefold\nparthenon\nseafarers\ngnomes\nreykjavik\ncherish\nepi\ndoukas\nsombra\nidolatry\nunload\nretirees\nbrantley\nfrederica\nprofessionnelle\nmichener\ndumitru\nembankments\naiba\nreconciling\njinja\nmariko\nclaxton\nsmashes\nagave\npinckney\ngerson\nchica\nuchida\nsumatran\npietà\ncentipede\nvistas\ntzadik\nghoul\ncatlin\ndefaulted\nrollercoaster\nwahl\npgp\nbattaglia\nwoodblock\nswells\nhaigh\nvesnina\ngelsenkirchen\ntris\nhoot\nperceiving\nfloris\ncheval\nscimitar\nsetanta\nsoak\ngrapple\nslicing\ndeere\nnutritious\nbriefed\nstéphanie\nalexandrovich\ntechcrunch\nannecy\nsfc\nsummerfield\nrashi\nfête\npolluting\nxxvi\npalisade\norientale\nmermaids\nacad\nkovacs\nbramall\ncna\nartis\nsundanese\noverzealous\ndorman\nkennet\nmacroeconomics\nlemay\nforerunners\nthq\nkherson\nfreeland\nalmshouses\ndori\nracquetball\ntact\nsustenance\ntakumi\nravines\njansson\nbeckley\nveliko\nkader\nheadliners\ntestifies\nprofesional\nsaka\nmoshav\nsongbird\nplaywriting\nmedics\nmixers\nshires\nbobsledder\nproportionality\nparibas\ntyra\nkroll\ntransgressions\nnarrators\nrosanna\nkokomo\nseguin\nvecchio\nenslavement\nclementi\nkubota\nrushden\narin\ngrégoire\nelko\nethically\nhilt\nzsa\nmitosis\noffsets\nmiskolc\nramanujan\nhastened\nkirkuk\ndocu\nlegislated\nprecludes\nprelates\nfens\ncomunale\ncamouflaged\nroz\nkeele\naue\nikea\nchetnik\nlillywhite\nthiel\nlevees\nomi\nwoburn\nappreciable\ncheol\npandavas\ndur\njourdan\nanthracite\ntremolo\nspud\nfundamentalists\nbarbieri\nlifetimes\nbingen\nsanhedrin\nprong\nlillooet\nracquet\nsown\nlorimer\npicker\nranjan\nmalick\ncheeky\nbanging\nsian\naloof\nplasmid\nverden\nvásquez\ntabulated\nerde\nhjalmar\nduress\nedf\nquirino\ndogged\nkempton\nrenate\nwolsey\ngoalscoring\nzuckerman\nkharagpur\nretaken\nvisualized\nbiweekly\norfeo\nomnia\nariège\nchippenham\nacetone\nchandran\nhalide\nbonaire\nethers\nmariya\ntesticles\nrasht\ncouplet\nchaff\nbassey\nokazaki\npenning\ndfa\nminefields\nsnowflake\nsuperposition\ngatekeeper\nrsl\nclapping\nurgell\ndsb\npyrmont\nautographs\ncodification\nreincarnated\npkp\nján\nkargil\nbabysitter\nsoissons\nromford\nunworkable\nignacy\nosc\nincisors\nestrela\ncommensurate\nredo\ncarcassonne\nquiroga\nbusoni\nperiyar\nmanukau\nunranked\nphilatelists\ninsectivorous\nelectrolysis\nholed\nunbuilt\ndomingue\ndayal\nmarcela\nbridegroom\nhyperlink\nebooks\nvought\nmahjong\nacrimonious\nprimogeniture\ncongruence\ntritium\nlegible\nbassano\nsohc\neffie\nmisbehavior\nblücher\ntilton\ntetris\nyvan\nrestorative\nhikes\nfouling\nfylde\npanthera\nresourceful\nirreverent\nlucero\npathan\nmoroder\ngalloping\nnemanja\ncowie\npui\npopulism\nhefner\nadriaan\nadria\nliquefied\neschatology\ntaki\nrecherches\nsnarky\ndien\ntauziat\npalustris\nimpeached\nbap\naeolian\ngarson\nflagpole\ntejada\nbaehr\nepileptic\nkatarzyna\ndelimited\ndisenfranchised\ndep\nepochs\nbayeux\narion\nmisa\nphobos\nmoir\nsubtracting\nfauré\njeeves\ngloom\ndisengagement\nunaired\nrta\ninstated\nqueenie\nvalse\nlope\nmatias\nvarley\ndisordered\nsearchers\nmanaus\nmöbius\nmita\ntomcat\nplacename\nhpv\ncolmar\nmsi\nfiqh\nbruises\nlci\ntlingit\nslippers\nmccloud\nfrente\nepps\nstorch\nwebby\noverjoyed\nrih\ntechnologists\nyanks\ncaldecott\nfreaky\npoppins\nutep\naurea\naishwarya\ntirpitz\nroussel\ngilead\nforecourt\njeux\nplatypus\nmbeki\nsoundscan\ndevotions\nkempe\ndba\nwoolworth\ncogan\nuprooted\nblimp\nkantor\nmensa\nveloso\ntesta\namyloid\nlopsided\nkau\nsubjugated\nsls\nmerciless\nvouch\nhomicides\nthumbnails\nsagittarius\nwebbing\nbramble\nnewland\nbaboon\nbivalves\nfirstborn\nmukhtar\nhailey\nverna\njoost\nemplacements\nrubik\ncidade\nbarman\nantigone\nthreading\nspecious\ndeeming\nkloster\nyank\nliberators\nprogrammatic\nfounds\ncitibank\nrother\nneutralized\nsuomen\nauer\nparadoxically\nbromsgrove\nknopfler\nhaydon\npimentel\nunplanned\ntawi\nlietuvos\nchocolates\némigré\nbelediyespor\ncirce\nxiaoping\nrusher\nmino\nales\ngerlach\nadverbs\nbloke\nmerlot\nblok\ngunning\ngarrido\nrecursively\nmckeown\nsteeper\nhitchin\nmadrigals\nclearest\ninflexible\nsmitten\napportionment\nendocrinology\nimpure\nganj\nnona\ncuriosities\nwearable\ndiu\ntrovatore\nfajr\ndiarist\nnewsreaders\nimmorality\nboomers\nperfumes\ntân\netiology\nexpedite\nbollinger\ngirders\nsweeter\nembarks\nrebuked\nmötley\nsuburbia\nonlookers\nkaine\nƒ\ncabeza\nmicrobiologist\nnook\nerupt\nkoe\nridgefield\neames\nsemyon\nort\nvirginie\nlaidlaw\nprd\nkazuki\ncollett\ntewkesbury\namjad\navocado\nshareware\nexuberant\nwarangal\nmccurdy\nhasselt\nswirling\ncrum\nstrathmore\nene\nwhining\ngraceland\nmère\nsmartest\ntakayama\ngst\nstrindberg\nmobilizing\nnazim\nshaver\nrigg\nresale\nbil\ntriads\nautre\nrapa\nglencoe\ncreeper\nujjain\nsunfish\nxj\nexcreted\njenn\nskillfully\nshipbuilder\nworkmanship\nsaltire\nthermonuclear\nhep\ngoodreads\nhearne\nthundering\njenni\nattenuated\nmoloney\nberets\nmur\nwilley\nlek\ntorsten\nwillfully\ncharentes\nbabbage\nvitis\nmisadventures\nsemblance\nangelos\nhardline\nkroger\ngawler\nrundfunk\nrectum\nuz\ngirardeau\nokamoto\ndejean\ndts\ncng\ntdp\nalienate\ndistilleries\nhandicraft\nanakin\nlegendre\nkhans\nequalised\nswelled\nluttrell\nimplosion\nminnelli\ncontinuance\nrégional\nfaintly\nissuer\nswindle\nbroomfield\nrubicon\nmolluscan\nths\nintrusions\nbarrack\nblockaded\ndeering\nlamina\nsustainment\nabyssinia\nexcision\nalda\ninsulator\nselig\nrascals\nturdus\ndashing\njolson\nappellant\nstraighten\nleniency\nvinay\nnrw\nsplc\nmaneuverability\nsubcultures\ntransjordan\nsaws\nftse\ngálvez\nstaunchly\npleasantly\nfromm\nmaes\ngordo\nmati\nelen\nairbags\nshimmer\nraccoons\navenging\nlexicographer\naja\nvuitton\nizz\nbataille\ncling\nstratovolcano\nhatteras\nulaanbaatar\nimpassioned\ninfanticide\nschweiz\nfingered\npirin\nkellner\ncynicism\nforeshore\ncooperates\nhaveli\noctahedral\ncse\nmckinsey\nconflated\ncueva\nfirebox\nmmr\naspires\naboriginals\ncozy\ngeneralised\noverbearing\nmanchurian\nmacros\nbushido\ninterstates\nindustrie\nmunir\nkavita\nvangelis\nmaga\nruggiero\nsuperannuation\nprejudicial\nchub\npontic\ndiehl\nnain\nrowell\nrefereeing\nriddick\nnaca\neuskara\nspiker\nvesper\noverhanging\nparabola\nconvolution\nproportionate\nequaled\nbarents\nshashi\nrensburg\nyavapai\nbhojpuri\npauper\nmond\nburdened\nsuperfast\npseudomonas\nverdicts\nmotet\nsavchenko\ncreamery\njas\ncitywide\nidiopathic\ndurst\nquashed\nadorno\ngiessen\nsicilia\nabyssinian\nsobieski\nablation\ndiverges\nlegumes\npsychosocial\nlaswell\nbuenaventura\nmatterhorn\npapi\nhoffenheim\nbassoons\nmanhunter\nnogales\ninhumane\ncantus\ncask\nconcordat\nexemplar\nessonne\ntarlac\ncorrespondences\njemima\nsarcastically\nicann\npatterning\nhydroelectricity\nfunnels\nrepulsion\nabstentions\nimpressively\nwied\nbosphorus\nputter\nrunnin\nbailiwick\nhypothalamus\ndarío\nalbee\ntaha\ndanielson\nheike\nlexi\ngraben\ncheesy\nmagus\ncrewed\nembolism\njackals\nlaker\ndeeded\nbittern\ntubers\nblom\nmonochromatic\nawoke\nabbottabad\nbuoyant\nwatertight\nnoguchi\nlipa\nproclamations\nstour\ntlaxcala\nzucker\nlibération\npathos\ntempera\nmotu\nstockman\nhants\noverthrowing\nvcu\nsuffragette\nrockport\npica\nbounding\nbaile\nenlightening\npennsylvanian\njón\nelectrolytic\ncowling\nimaged\njebel\nmetros\ngrayling\nsouter\nfreighters\ngallica\ntyr\nkossuth\npathogenesis\npettigrew\ndaugava\nstaphylococcus\nrcc\nwarts\nfactored\nmitsui\ncasco\nlevan\nmahadevan\nlabours\nfairing\nsavoia\ncalmed\npilbara\nsickly\nsequencer\ndupri\nreachable\nimaginable\nkaneko\nrousing\nsafina\ninefficiency\nulmer\nfrederiksberg\nzavala\nmaldon\nvico\nlookin\nbayonets\ncumbia\ndhawan\nmusculoskeletal\nunlockable\nishq\nbarat\nniculescu\neventful\npoliteness\ndebunking\nayacucho\ngeneticists\nkavala\nprocurator\ncapoeira\nafon\npiney\nparables\nwhitcomb\nturbocharger\naudax\nmagog\nmeander\nancash\naaliyah\nsuperlative\nvalens\nfixable\nwertheim\nshaquille\nraz\ndomitian\nplummeted\nheydrich\nflatbush\nhannan\nemporium\njohnsen\nprichard\nwatling\ngrasse\nutada\njobim\npattaya\nhab\nnatale\nqwerty\npueblos\ndoré\nnsl\nillyria\ncraving\nmikel\necologists\nlurie\nwheelock\nfop\ncorrects\nbmo\nfae\nintensifying\n⁄\nchasm\nholbein\ngordie\nantonis\nrevitalized\npoulton\nsubpoena\nharbinger\naldous\nedgewater\ncarthaginians\nkomatsu\nedgeworth\nanuradhapura\nsassy\ntinian\ncomputable\nattlee\ncluttering\nyvon\nminibus\npalembang\nbatgirl\ncondone\nlabial\nunderdogs\nflirts\necija\ntoccata\nautopilot\n,the\nmulk\nkluwer\nmahathir\nscythians\nuddin\ngyrus\nnoa\njackass\nunlawfully\nrüdiger\nlarne\nrickenbacker\naryans\nhaye\nnighthawk\nkabaddi\nmodernizing\nakhenaten\ncollides\ncounterterrorism\nmeriden\nrejoins\nresentful\nabell\nabbie\nyoda\nfloodlights\ncliche\nchillicothe\nveterinarians\nmame\nlidia\nmetastasis\nredbirds\nbatang\nimperatore\nmobley\nwatchmaker\nmey\ngayatri\nblouse\nvolumetric\netna\nskids\nabbe\nsylar\ntaiji\nrickman\nadjudication\nstormont\nunflattering\nseduces\ncitizenry\ngottlob\naphasia\nlire\nhag\npostcolonial\ninterrogations\nlye\ndisaffected\nasteras\narthurs\nduffield\nsolicitation\nmcauley\nexerts\nnegotiators\nnervosa\ncyclonic\nveronika\nmarga\naleph\nferried\ntaboos\ncoastlines\npredicated\nfrancophonie\ntheremin\nxenophobia\nbelge\nrha\nsra\ntbm\ngargoyle\ndeterminations\nunp\nempresses\ngonville\nfergie\ngnosticism\njla\nshijiazhuang\ndwells\nsusumu\nvoldemort\nselfridge\nfrse\nsundry\nwiggle\nbelated\nredeployed\nsump\ncontemplates\npollinators\ngbe\ndefaulting\nstoneham\nflyby\nalsatian\nlandless\nhesketh\nhindering\nmappings\nmikkelsen\nlithographs\nproscribed\nwiles\nferraro\ncosmonauts\nthinning\nginn\nsanjeev\nflipper\nqua\nseizes\nretold\ndeviated\ncrisco\npaix\nfranjo\nbauman\ntvnz\nmonckton\nkyrie\nfuad\nsocialites\npictou\nevacuations\nsayer\nroethlisberger\ntoggle\nunmodified\nubiquitin\nther\nhythe\nstockdale\nvuk\ngujrat\ndepauw\nsukumaran\nminos\nbankhead\ntrotting\nakane\nsinfonietta\naardvark\nmethodical\nanis\nemt\nroa\ndilated\nwabc\ntethered\nhoyas\nmónica\nlalit\noxbow\nalexandr\nmarksmanship\nbrunette\ndéjà\nmariusz\ndormers\nheyward\nstingers\nteardrop\nsew\nfenner\ndailey\nridder\nkarolina\ncarbonell\nholmenkollen\nakiyama\noftentimes\nleh\nfreestanding\nesau\nepidermal\nhumanoids\neac\nascribe\nmesser\nwarr\nholi\nfertilized\nsymantec\nkuru\ngrinstead\njeet\nreassure\ncsb\nloveland\nfain\nfittipaldi\nmanitowoc\ngharb\ndiaper\nnarain\ndimer\ntheosophy\nsveti\ncandidature\nrehash\ndss\nhonorees\nung\ncaernarfon\nveronese\nchandrasekhar\ncoritiba\ndistracts\nkress\nscholes\nkonkan\niam\nforegoing\nwatkin\ngermanium\nfinches\nwessel\nastronautics\nanza\nreprises\nguillén\nsharpening\noptically\nmorgen\nkirkman\nabomination\nrectal\ngruffydd\nroyle\neconometrics\ncrowding\nimmobile\nripening\nulyanovsk\nrepackaged\nnursed\nstax\nfeliz\nzinn\ncowes\nmisspellings\ntapia\noutcasts\nhandkerchief\nlaughton\neilat\nbrm\nmelancholic\ntransiting\nchaffee\nmiko\ntraumatized\nbenefitted\nrearrange\nhoses\nhezekiah\ngums\nalaric\npth\ngasol\nsacramental\ngyro\nrelativism\nnts\nsandinista\nqueried\ntizzle\nmountjoy\naeneid\ncandlestick\ntuan\nromer\nbucs\nveal\nthapa\nnitin\nwilber\nahh\nvitus\ndazzle\nacoustical\nalbi\npermafrost\ntruk\nsrt\ncursing\nkeir\njujuy\nmaugham\naristophanes\nmineralogist\nblackmailed\nemphysema\nentombed\nroughness\nradiotherapy\negil\nconformational\nbunko\nttn\nconsiderate\nswath\nmontt\nivanova\ntiber\nhectic\nruano\nskilful\nries\npix\nhenriques\nrtc\nheadlamps\nchuo\nbootlegs\nclerestory\nneurotransmitters\nsurged\nawakes\nmanzano\nblacksmiths\ntirupati\nnota\naronson\nolden\nquartzite\nmalkin\nwillingham\nuit\nbackfired\nlemuel\nbatty\nelly\nschiavo\nconstitutive\nekman\nkushner\nbackcountry\ndominus\nstockholders\nundertones\nstephane\ntypesetting\nhitoshi\narakan\nearthenware\nywca\nseifert\nlett\ndanica\ncoughlan\nnour\nkabhi\nneff\nmonarchists\ndragonflies\ndespise\nshowy\neluded\npronged\nhummingbirds\niaa\nquintanilla\nflamingos\nhamed\nandrena\nsatyr\nobstructions\nseria\nsantee\natrocity\ndodging\nsolberg\nindium\nfujimori\nliceo\neakins\nnetherlandish\nprawn\nroemer\npallida\nluxe\ndiminishes\nshapur\nrix\nscifi\ntol\nack\nsuffragist\nsankara\nethnographer\ngigabit\ndevaluation\npearly\nexacting\nrothstein\nmichell\nradley\nbba\ntransformational\nvagueness\njihadist\nforecastle\nleaderboard\nwestview\naccomplishes\nbebo\npatchy\nsundaram\nprototyping\nplatter\nweibo\nabstractions\njessup\nmelilla\nprocuring\nabergavenny\nmanos\nbushfires\npare\nreals\nlaure\nconsternation\nuntouchable\nhoxha\nvioletta\nhutchings\nmurs\nulu\nraiden\nvirtuosity\nremand\nkhash\nchoked\nundercut\nzhan\njussi\nsurfacing\nvoucher\nbushranger\nboku\nmonahan\nthanet\nnines\nrobocop\nkellerman\ncorroborate\nwsl\nstine\nsnape\ncyanobacteria\nhors\nuu\nbedlam\nstereotyping\nastonishment\nede\ngrose\nsacral\nmasthead\nabraxas\nskylight\nbagration\nprohibitive\nhunch\nsafin\nfluctuated\ndefinable\nsubmissive\npillaged\npontevedra\nvasconcelos\nsubgenres\nevita\ncriminally\nweidenfeld\nsoca\nache\ndmk\ngord\nbloods\ntvr\ncunt\nbornholm\nfifi\ninsufficiency\ngasp\nruf\nbragging\nbatt\ngreenaway\nsquamish\nsubliminal\nprimorsky\nprincesa\ntdi\ncapitalise\nlindner\nmarshfield\nkosovan\npersonas\nmorbidity\npurest\nacura\ntrickery\naveiro\norel\ninquired\ncatanzaro\nhodson\ngounod\npatriarchy\ntotnes\npitfalls\nblondes\nwigs\nrenwick\nkora\nparodying\nhuy\nimpersonate\ndreamland\nkirkham\nbolan\nstilt\nsprouts\nsturges\ncholas\npredictably\ninsemination\nharingey\nlinger\nopendocument\nashdown\nrann\nplantain\nlibellous\nslurry\nsomethin\ntuft\nbestsellers\nmoti\ngalería\naníbal\nberea\nsubchannels\nbernardi\nsalar\nmandating\nmasterton\nsherri\nembattled\nfella\ngratification\ncomputationally\nparaphernalia\nfranziska\ncantwell\nunexplored\ndisrepute\nmultiplexed\njarrah\ntema\nfinsbury\nmose\nindisputable\nenriching\ninv\ntidying\nlamia\nheredity\nyt\ndirectx\nrooting\nbreezy\nlandshut\nwoodwinds\ndarrin\naotearoa\nalligators\njacobites\nrehoboth\nitching\nwoefully\nsebastião\nrayne\nanschluss\ntombstones\nsterne\ncerebellar\nfluctuation\ntesters\ncorvinus\nagate\npatrimony\ninsecticide\nsundsvall\ndissented\nsynods\ndéfense\nkleist\nhosni\ntraceable\nuttara\neurocopter\npita\nlyase\novw\nclarice\nbeauvoir\nmodifiers\nmcveigh\nanderton\nshamir\ntes\ngur\nsiskiyou\ncpm\nreposted\ntseng\ngaziantep\ncoopers\ncallisto\nsandys\nlinga\ninclement\nhejaz\nnodal\nshowrunner\ntribulations\nyazid\ndaigo\nangler\ntesticular\npours\nfara\nemmylou\nsignet\npriming\npanes\nrimbaud\nreprimand\nvalente\napologetic\nricochet\nleib\ninst\nmotels\nvirgilio\nkiva\ndarley\nannuals\nkook\nneverland\nelsinore\nfervor\ngarhwal\nmattered\nderecho\nbaritones\ncloisters\ncadena\njomo\nskynyrd\ncirencester\ngata\ndasa\nfallin\nintelsat\naeronautica\nroxburgh\narica\ndonned\nbohdan\npacer\nexterminated\nprismatic\ndollhouse\ninfertile\nblenny\nfaraway\nmargareta\nmingo\nemf\nasymptomatic\ncunliffe\nradhakrishnan\nclc\nmarlo\npiù\nallround\nintercepts\nfranke\nshirvan\nscribd\nsupposition\nashleigh\nschuman\nnoticias\ntriathlete\nsalesian\nconcours\nbanyan\nsupernovae\npiaget\nredfield\nmeaningfully\npge\nchamorro\ncannery\nmisiones\nzain\nreorganizing\nackermann\nosha\ncarronades\nmandrake\nnigerien\njezebel\nraitt\ndurbar\neis\nforays\ninnkeeper\nrnzaf\nspokes\nferb\njor\noverwritten\nrpi\nmenstruation\nunabridged\nwitham\nwipeout\nhippocrates\ntexte\npareto\nblindfolded\nplaylists\nbharathi\nwelle\nulan\nfrauen\ncyde\nplotters\npredominance\npassable\npowertrain\nneruda\noligarchy\namenhotep\nkettler\nreps\noj\nmahoning\nwallach\nshipp\ndamper\nconquers\nsmithson\nvalidly\nhsp\nzootaxa\ninterrogate\nplein\nresistivity\nsynchronize\nsvein\nbarometer\nfleas\nmitchum\nsquatting\nchantry\nocclusion\nlegitimize\nstrasburg\nbelmonte\nprema\n―\nbcl\natletico\ncopulation\npakenham\ntimişoara\nccl\npalladio\ncancellations\nevacuees\nprebendary\npolyurethane\nscarring\ndarwinian\nlandwehr\nruta\nnand\ngrillo\nexcavating\ndedicates\nronne\nbirding\nriser\nolly\ngrassi\nmansoor\nzirconium\ntouristic\nandroids\ntanglewood\nusps\noakleigh\nwinningest\nmulatto\ngeriatric\ntangle\ncrammed\npata\nfredericks\nkomodo\norangutan\nbrosnan\nciro\nansel\nsikorski\nblister\ndeductive\ninstituting\nfrémont\nchitral\ninterferon\nbigg\nsatires\nresuscitation\nkenmore\nrochford\nnatures\nnewbridge\njuha\ncrescendo\ncloths\nbarthel\ndiversions\ncolumbo\npennell\nheo\ncobbled\ncarle\ntransposed\nfreemen\npapp\nhvdc\nosh\ngba\nbookmarks\nscherzinger\niwan\nmacdowell\nobtainable\nthurrock\noffbeat\nwordplay\nchagall\ninverter\nigg\nfürst\npoulenc\ndaggett\ndispel\nbca\nlawfully\ngalatea\narta\nserres\nwoolsey\niep\nbounces\nmorelli\nblackened\nandrée\nniebla\nclassifier\nconservapedia\nquays\nlashed\ngeraint\ninfact\nplatelets\nlyricism\ngoaltending\ntarleton\nbooed\npollutant\nemulators\ninaccurately\ntrevelyan\nfrodo\nfob\nflocked\nkrosno\nbua\nkuril\ncreme\nmorea\nbrenton\nwdr\nhenschel\nhenman\nforsaken\ndorms\nbibb\namba\nsulpice\ncen\nleftists\nallyn\nbein\ntranscends\nladysmith\ntotalitarianism\ncaptivating\npracticality\npashtuns\nkenai\nhumerus\npanay\nspacey\ndivested\ntonality\nworthiness\nmercian\namputees\nballon\nsatanism\nstanislaw\ngoldeneye\ngrandin\nkurukshetra\nprabhakar\naxiomatic\ndmz\nchatto\ntbn\nexon\nrubra\nskipton\nbackside\nbuckethead\nmorphed\nneuromuscular\ngascoyne\ncolle\nfreshness\noverrides\narmorial\nbrownian\nheil\npug\nglut\npallet\nagincourt\nchamois\nseder\nreprieve\ntio\ndicaprio\ndigitization\nconveyance\nigniting\nsculptured\npcf\njosephson\nerlich\npunto\nstreamlining\nbombshell\nstolle\ngarfunkel\nacuity\nposada\nradnor\nlard\ncert\nsalve\nmanas\nlumps\ntovar\nmetastatic\neliminations\nfiddlers\ndha\nahem\nseco\nmisdeeds\nkrohn\ngyan\ngalactus\nfutura\nbartow\nshowman\nedin\nmizuno\nisma\npretorius\nbrockman\nbriarcliff\nsoros\nhaka\nmisidentified\nfro\ntablelands\ntailings\njiro\nbauxite\nomnipotent\nleeuwarden\nannenberg\niti\ndaisies\nmaccarthy\nsobriety\nzhuo\nswp\nneuroscientists\nfabien\nfarhad\nwhitten\nfrauds\njarl\nincurable\nfuriously\narapaho\ncoromandel\nbattambang\namygdala\nmagnetization\nenterprising\ncompanionship\nouagadougou\nfon\nnikolas\nvenables\nopa\nsurtees\nlafontaine\nthera\nramana\nstung\nfriedlander\ndelphine\nsuter\nsanam\nbooting\nmerriman\nveera\ncra\nbrauer\nfarris\nwatchful\ngeneralitat\numpiring\nskimmed\nrefinements\nabramoff\nhashem\ncarat\nmarsupials\ngemstones\nhorrendous\natco\ncassia\nledesma\nfricke\nadj\nhydrologic\nstreatham\npaused\nnanchang\nlak\nbrainwashing\ntage\npegg\nflourishes\nsiouxsie\nstorytellers\nratchasima\nmemos\nhakeem\nobra\nneko\nhapless\nballade\nhorseracing\ntranslocation\nkuti\nnutter\nsont\ncriminality\nremus\nsanctus\nonassis\nqat\nincubated\nblacklisting\nsunnyvale\nviggo\nbumping\nbreeches\nlintel\nfranky\nwily\nefendi\npapaya\ndispossessed\nmaar\nmui\ngooding\ndemonstrative\ndomingos\npotro\nnonesuch\nhirohito\nweeknights\nduomo\nwaitangi\nhayakawa\nexoskeleton\njost\nlegislate\ntcr\nconcertmaster\nlupo\npavlovich\ngaborone\ncortlandt\nalana\napnea\ndprk\nasda\nteramo\npickwick\nsleepless\nmauritanian\nadjudged\nfantastical\ncaddy\nsaxena\nrupaul\nnavan\nconcordance\nnewby\nremodelling\npeeters\naxelrod\nnewsgroup\ndispatching\ntetrapods\nbina\ndougie\nbanquets\npolitehnica\nnadeem\narginine\nexxonmobil\ngrazed\nshuffled\nibero\nphenomenological\nnhra\nholon\narmidale\nquranic\nmayberry\nurns\nsophomores\ntermite\ndrifters\nsona\nperks\nalienating\nlegio\ndargah\nsprays\nzuni\njuke\ncae\ntiga\nunequivocal\nbidirectional\ndutra\nhattori\ndasgupta\nluciana\ndunstable\nalumna\nhema\nwuxi\nlapwing\nphenotypes\npottsville\nsemantically\ndraughts\ngenerality\nrajapaksa\nboni\nreaves\noverridden\nerred\nbalthazar\nsorin\ncoronal\ntonto\ngrainy\nkashyap\nhavok\ndiagnosing\ncarmina\nescondido\ncelluloid\nmallow\nlain\nhaarhuis\npoaceae\nstrada\napuestas\nhina\nassociazione\nwallenberg\nmartinus\ngoodson\nsheldrake\nvarnish\nscaring\nrehearse\nnoose\nsafeway\nhemphill\nseidel\nsoni\nflogging\ntokyu\ncontributory\nfarrington\nennobled\naquaria\nsieur\ndeformity\nwgbh\nburdon\nhoodoo\nlyudmila\ngettin\nscat\ncoxed\ngelding\nhayworth\ntraci\naleister\nyb\nabstaining\nmacleay\nbarone\ngirth\nunmatched\nburj\nsparc\nmahone\ninfantryman\nvizcaya\ncastellanos\ncrustacean\nhrt\nsaintes\ncapitalizing\ngravestones\nvets\npepsico\nsarcoma\nbadgering\nforefathers\ncyrenaica\nhollande\nbilingualism\nwmd\nmondadori\ngunnison\nthf\nfiefs\nbrom\nretrial\ndaud\nsandal\nhornblower\nschütz\nwaratahs\nsnowdon\nsixers\npathak\nautoblock\nsupranational\nmilking\nfoxtel\nspb\ncurragh\naiko\ncull\nppc\nchampioning\nserviceable\npoop\nellipsis\nvorarlberg\nblot\nkilla\nvenizelos\ndebs\nwf\ncreeps\nkristofferson\ncarcinogenic\nlobbies\nitaliani\nwein\nstraws\nfulani\nmiyako\nlamy\ngente\nsuffragists\nmagnified\nmandibular\ncropper\ncreuse\nadrianople\nquai\ncanopies\nkarpov\nchristus\nibrox\nprodding\nostia\nça\ncosplay\natms\namiable\nreliquary\nrayburn\nbenet\nraving\ndispositions\nflange\npentateuch\nese\ncooperstown\nzakaria\nwalleye\nkinky\nischemic\neconometric\noude\nunaccredited\ngaudio\nmatsuyama\ntranquil\nosteoporosis\nversace\nshenhua\nembarrass\nsreekumar\nsappers\nhardee\nwazir\nsoaking\nmaxie\nmodulator\nrecused\ntsuyoshi\nvesuvius\nrobben\ntunney\nstackpole\nvisayan\naggregating\ntreadwell\ndeon\nvolpe\nfart\ntubman\nauster\nkhon\nhillock\nrawa\nfabled\noverseers\nheft\ninlaid\nspina\napportioned\nemptive\nimperfections\nlubricant\narundell\nwelwyn\ninsertions\nunmistakable\nutley\ngolgi\nbuganda\ncoq\ncarswell\nrecruiter\ninfiltrating\ngeniuses\ngow\nfreeholders\nadenauer\nmander\nthyme\ncanute\njeroen\nporfirio\nthucydides\nnolte\nprintings\nkensal\nlevantine\ncleanse\ninquiring\npetitioners\nkilleen\ntallies\nlà\nleveraging\ndefaced\nredditch\nmarigold\nnonstandard\noromia\nnoddy\nblotches\njefferies\nagong\nrisa\nabscess\nantal\ndaycare\nkavi\nacclamation\nhandcuffed\nhydrological\nsaussure\nstrawman\nhasten\nperelman\npunting\nbehan\nplunging\nzetian\nhark\npequot\nbiffle\nvillars\nslingshot\nthalia\npec\nunstructured\naaas\nelectromechanical\nmashup\nbirthright\nmartell\nbitches\nnip\nramu\niana\nquirks\nabsinthe\nroyer\nrangefinder\nwatery\nheung\nvaduz\ndfs\nrind\npbl\ncustodial\nradiators\ntroublemaker\nconformed\nlevett\ncann\nstretton\nhindley\nlezion\nlindemann\nkonstanz\nlawlor\nculex\nconductance\ncanes\nenthalpy\npanoramas\nflops\nlooser\nhydrostatic\ncybermen\nplos\nhirschmann\nthefts\nhalberstadt\nmsdn\nhandyman\nabsolution\nvideogames\nremastering\ngrafting\nlavishly\narmee\nlilo\nlytle\niida\nosten\ngiurgiu\nvik\nconciliatory\ngroening\nfátima\nsidewinder\nwendel\nhattiesburg\nbaran\nkidder\nbellman\ncamellia\nnlcs\nphoned\nmuskingum\nrawhide\ncarpentier\nconsults\nmaddalena\ndottie\ncaster\nwaveguide\ncayetano\nfritsch\npakhtakor\nffe\nnusa\ngangnam\nlatakia\nmeanders\nshopper\nbelén\nnita\nprez\nwaive\nashish\nnotional\nyuichi\nyerba\nunscientific\nmasaryk\nwilco\nciaran\nconnexion\nhertogenbosch\nzuo\nbouzouki\nirregulars\nrackets\nchania\nvasu\npina\nnightwing\nwausau\nmcshane\nszabolcs\nmilitari\nvelez\ncorbyn\nlanao\ncaserta\ndetritus\neea\nkani\ndudes\ntdt\npoodle\nconcisely\ncastlevania\nflume\ndigested\nkemerovo\npolemics\ngana\npagasa\nhoshi\nenniskillen\nmisfortunes\nkimono\ncaras\nswamped\ncosmodrome\nrecoverable\ncormier\nknickerbocker\ncofactor\ntradesmen\nousting\nzahn\ninker\ntravail\nbottomed\nchela\nbiodegradable\nsoundwave\ncytokine\nbava\ninductance\nbramley\nnagarjuna\nvibrato\nhammering\ndili\ncgs\ncategorizes\nfrantically\nheathland\nspringdale\nwatercourses\naroostook\nartefact\nnieuwe\nslp\ndemonstrable\nswazi\nonna\nbutters\ngeraldo\nclocking\ndivorces\nkrystal\nember\nthoroughfares\npunctured\nyuna\ndevas\nwinterbottom\nnorra\ndwelt\nstuttgarter\ninterleukin\nglades\ntheistic\nsuperscript\nsaheb\ntownes\nazar\nplacental\nrevels\nearners\naarau\npontius\nmoshi\noffensively\nenchantment\ngymnasiums\nmists\nintakes\ntubby\nchucky\nlyall\ngunung\nrapides\nnaps\nintrospection\nfta\ngrattan\ncayenne\nmohandas\nbalázs\nhalfbacks\nclawed\nbipartite\ncramps\narkady\ndelisle\ndisenchanted\nzhe\narcheologists\neasement\njoule\nventilated\nkimber\npossessor\nhomeward\nkura\nbidders\ngoc\ngliwice\ndakshina\nnaan\nbren\nstorybook\nplaneta\nrosina\nmmc\nbloodlines\nsepahan\nyusuke\nharburg\nvickery\nmaisie\nbeholder\ndegenerative\npolybius\nhoh\ntarawa\ncresswell\nfillings\nquds\nmush\nyousuf\nnayarit\narusha\ntelus\nwhiz\ncoulibaly\nfata\nmeriwether\nswanton\npomp\ngoble\ntroup\ndesecration\nillusory\nhopkinson\nrugrats\nkongsberg\ncaribe\nparamaribo\nallure\nudall\nrectifier\ngruffudd\nballerinas\nmilligrams\nkraftwerk\nsibir\nmachida\nbackfires\nlieber\nnichol\nmarauder\nchaste\nnarcissism\nnunez\nsimulates\nimc\ngormley\nruch\nscreech\ndanko\ndevito\nwilks\nlorde\nstarke\nelectrodynamics\ntestaments\ngainsbourg\nbolder\npipa\nhewson\nbeazley\ntreblinka\nzou\nfuturism\ndrinkers\noptometry\nawardees\nrepurposed\nschiffer\nfalsification\nbexar\npopularization\nmeza\narent\nkorps\njustly\ntimescale\nbelafonte\nmoh\ninstar\ncort\nthickening\nlarch\nmachen\naws\nhor\ndisapprove\ncombos\ncleland\ngomer\nkoster\nfondo\nchipped\nruckus\nsuetonius\nlabuan\nkanto\ndismembered\nfloss\nsudetenland\nmegafauna\nmoksha\nwanton\nimpressionists\ncaplan\nrecites\nstatham\ntailors\nsamsun\nmarduk\njülich\nselznick\nfuseaction\nhalil\nlinlithgow\neffluent\nkhas\ngolfing\nwendt\nkwun\nletitia\nwendover\nblaney\narrington\nnaim\ndeja\nhandouts\nsegmental\nvalles\nreinserted\nwatters\nirrawaddy\nfrets\ncelso\nunspoken\ndownie\nbruin\nbacklogged\nchittenden\nrameau\nsamad\nkas\nhameed\nvalero\noverpopulation\nyusef\nbipedal\ndisgraceful\ncinq\nmiramichi\nbajaj\ntench\nbanal\ngid\nviewfinder\nrive\ngenocidal\nhooligans\nevangelistic\ninterdependence\nboutros\nemblazoned\njaques\nklezmer\navalanches\noverflowing\nfulcrum\nnya\nsofie\nulama\nerudite\ncautionary\njenin\nsauvage\nwilds\nfoundries\nhammerhead\nmartinelli\nmensch\nslut\nluring\nmourned\nnoli\nglarus\nlingayen\npontificate\neliminator\nargüello\nszékely\ndca\n√\nhackman\nflounder\nfairground\njuraj\nambrosia\nnifty\nasaf\nmenial\nmartineau\ncontraceptives\nsnowboarder\npolypeptide\ntiebreakers\ndiddley\nshrank\nstereotyped\ngreening\npegged\nunhappiness\nabusers\ndomaine\nflotation\nefes\nimporters\nburdett\nmorais\ndevizes\nfirework\nbhima\nquinnipiac\nunderwriting\nhijab\nsmoothed\ncavalli\nanare\ncaterham\nmbbs\nburners\numpired\nreductive\nmerengue\nlandfills\nchawla\nadored\nmetternich\ncooperatively\nfrontispiece\nmargarete\ntosses\ntownhouses\nstora\npopulus\nhiphop\ncañas\nkindergartens\nenergized\nisner\nmccool\ntannery\ndarla\nregenerated\numbilicus\nindecisive\njog\nbromine\nmuscovy\ncetinje\ncavalcade\nmemorized\npopulaire\nwalkover\nscrapbook\nlarvik\nkaj\nherbicides\nunderstory\nbicol\ngrunwald\nmcevoy\nmcginnis\nclwyd\nresnick\nbiathlete\nfujii\nbinocular\nbuoys\nshimane\naggravating\nmeo\nporting\nclubhouses\nhick\nsmederevo\nsoured\nhumming\ntarnished\ncates\nnecaxa\nmetrorail\nvictoire\nencircle\notway\nelfsborg\nmccay\neasley\ncgt\npredrag\npyroclastic\nbrasilia\nstrident\ncff\nwatercraft\nairdrieonians\nbixby\ndecompose\nsouthpaw\nflutter\ncookbooks\nganymede\nyarns\nfci\nblantyre\nderivations\nhôpital\npentagram\nbrianna\nobjector\nkou\nalcs\nelizondo\ndak\nechr\njoiner\nbrecker\nfundy\nrenderings\nisro\nsoundgarden\nazerbaijanis\nlaborious\nchoate\ndesjardins\nwidener\nmistletoe\nstraub\nprofited\nilp\nchahar\nstansted\nacutely\nchine\ncraigslist\nharewood\ngrammarians\nprocopius\nvento\nfreie\ncondensate\ntransom\nnootka\nbroussard\nroamed\nooty\nkronstadt\nberkman\nredefining\ncholmondeley\nquestionnaires\nbrienne\npori\nhydration\npropelling\nacorns\nquicken\ndugdale\nmaduro\noddities\nfluvial\ntropicana\nucsb\nosmosis\nazuma\nexogenous\ndiversionary\npodlaski\npichilemu\nvoided\nstc\nberners\nmotorist\nunderscores\nringling\nimpurity\ngrandview\nrelish\nlayoffs\npostponement\nlewisburg\niulia\nbreathless\nivica\npuffy\nscreamed\nremarry\nmilitaries\nallis\nildefonso\nspor\narchivists\nunderbelly\nhyenas\nineffectual\nexpendable\nabstention\nincurring\ntruckers\ntrays\nredneck\nretrofitted\nhubbell\nconnick\nemailing\nyasuda\nfábio\npopups\nactuators\nherzl\nservette\nbega\nhamsters\ntracklist\nercole\nacademie\neffortlessly\ninsidious\ncoc\nandreev\nconstructivist\nmarkt\nerectus\nfandango\nmetropolitana\nmargrethe\nassembles\nadjusts\nbelorussian\nbimbo\ndennett\novoid\nhann\ncpsu\nwile\nethnologist\nmolestation\ninfatuation\nsvend\nmaxx\nreade\nstateside\nwept\nmauresmo\nsupplementation\ncarola\ncelestine\nkohen\nxylophone\npham\nfronds\nperla\npetitioning\netude\nberserk\nété\nwieland\nblogosphere\ngurevich\njúlio\nbcr\ntempus\nrerum\nmoda\nmeissner\nstp\ntransited\nferia\ntrillium\nantiquarians\nlicensees\nspender\nreimer\nlarynx\nmagalhães\nchoy\nthr\nhss\ntegan\npasschendaele\ndiscos\ndecried\nkatja\nreconquest\ntumours\nsnipes\ncouplers\nparklands\ncañada\nhealers\nautodesk\ninnis\nhaight\nhewett\nmusings\nexperimenter\nloudness\nconst\nremodel\nourense\nteodor\nquizzes\nionia\nboccaccio\nconfigurable\nviareggio\nimprison\ncuesta\nsalivary\ntoscanini\nmelvyn\nflycatchers\nejaculation\nbountiful\nmcp\ncaveman\naxa\nwinchell\ndob\neyeball\nchemins\niyengar\nelectrophoresis\nsymphonie\nagribusiness\ntolerable\ninr\nwhaley\nsingularly\nandrus\nmarsupial\nspooks\nperdue\nweiser\nines\nhasharon\ncuracy\nwaal\ndench\nert\ndvr\noli\nreferrals\ngalvanized\ngec\ngraaf\nsteelworks\nlicensure\nberryman\ndescriptors\nbayliss\nnoto\nreparation\npotocki\nanheuser\nelegantly\ncoldfield\nrie\nnysa\nozarks\nbrophy\nresized\ngrahamstown\ncorrèze\npka\nslowdown\nestonians\nbagwell\nmáximo\ndevour\nmote\nwouldnt\nmaples\nvasas\nwhiplash\ncentrepiece\ntherefor\nmoeller\nnoord\naraki\noccasioned\npullen\nrecur\nhorrifying\ntricking\nwahlberg\nprophesied\nbaronial\ntympanum\nthorsten\nshadowed\nbaluchistan\ngaskell\nregrettable\nmirai\nturbofan\nchroma\nenderby\nmaroc\ncynon\ncharred\nhooves\nglittering\nstratocaster\ncôté\nnilo\nhyperactivity\nlucasfilm\npasay\nbax\nmatinee\ntrt\ntondo\norganically\npeary\npouches\njeevan\nsheung\nqp\nneha\nperot\nwashoe\nuniversiti\nbourget\nmishaps\ngerminate\nstopper\npaddling\ndeviantart\noso\nveria\narse\nblossomed\nmangled\nequilateral\nnettles\ninstantaneously\ncutlery\nwagoner\nparkersburg\nsurpasses\nbarbaro\nmonfils\ndrillers\nfamines\nrampur\ntambo\ndano\nwahab\nkher\nstitt\ndespatched\nkotaku\nneedn\nshl\nfortunato\ncounterclockwise\nbrimstone\ngiga\njoly\ndesh\nantigonus\nscythe\nreentered\nphoebus\npreyed\nkerk\npendragon\nbadu\ncaso\nleer\nlucasarts\nhortense\nmalang\nminting\nsouthey\ncuratorial\ngali\narak\nflammarion\ntowels\nchui\nfirs\ndmca\nwarnock\nmunoz\nligurian\nhaulage\nhrw\nanachronism\nelaborating\nagnès\nudf\nsnowmobile\nlapointe\nsheathed\nnovikov\nbenevolence\nautofocus\nlindwall\nshir\nfrege\narsène\njanez\ncivics\noffsite\nmataram\nsalmond\ngush\ndiller\ncorregidor\nrime\nmaquis\nretort\nmsl\nvazquez\nböhm\nsmalley\naddie\nbrel\ngyroscope\nbrazen\ncoals\nskagen\neducationalist\nrevives\nbehemoth\nmontgomeryshire\narrears\ngarmisch\nlatinized\ndemosthenes\npygmalion\nbasildon\ncoombe\nactuated\ngrower\nhusserl\npotok\nsebastien\nprt\nwareham\navars\nslaughtering\nmontauban\ngulden\ncanines\nconservator\nnewsprint\ndriftwood\nhamptons\nvideotaped\nlindström\ninconspicuous\ncrucially\nadrenergic\nmada\nwishful\ncapstone\ntarim\ncartooning\ndunhill\nresold\ndreadnoughts\nfelon\nmayu\nburgenland\nmajapahit\ntrefoil\nhakan\nbolshevism\nkanu\nbaka\nshuttleworth\nmelanesia\ndisintegrate\nmannix\nhedda\nbarberini\npeshwa\nbritpop\nbrasiliensis\npeterhouse\nvigour\nimi\nkenta\nmiasto\ncajon\ncdn\ntreadmill\nbandra\nabsorbers\nvirology\nmoroni\npsychotherapist\nsquall\nheikki\nhhs\njointed\nodour\nthurgood\ntwp\njasmin\nphilosophically\nsingularities\nblavatsky\ndannebrog\nzebras\ntanjong\nmostafa\ntalavera\nbossy\npandya\nintensities\nkinston\nhirschfeld\ngeorgette\nfano\nrfk\ndany\nkardashian\nurbe\nkiernan\nswarms\nbse\nmasashi\nsteppenwolf\nayu\nbruford\nsop\nhistorie\ntua\naeon\ngenerale\nransacked\navicenna\nsoule\nhollies\nkido\ntunings\nraynor\njagdgeschwader\nhammerheads\ndogwood\ncandela\ncenterline\ncounterattacks\ngeun\namok\ntweety\nrefuel\nplante\nuniversalism\ninfantrymen\nregrettably\npharyngeal\nmaan\nmoguls\ncortisol\nsurmised\nworshipers\nstrobe\ntimings\nantiviral\nmisinterpreting\nephemera\nbideford\nplausibly\nenergetically\ninterlaced\nsealand\nunsold\nkeung\nimprinted\nlucena\ncomforting\njeolla\npetaling\nbookmark\ngks\nbanc\nnotte\nfiler\narman\nlaminar\neck\nfixer\ncutscenes\nphrygian\ntero\njuku\ngoettingen\nied\ncallas\nbatak\nelfman\nmastodon\nrnli\ntikal\ntranslink\nmachinima\nlugar\nbasile\nwarping\nrenewables\nlogics\nwyo\ngaither\nfrideric\nlanglois\nnsu\nunattributed\ncashew\nbagged\nbiltmore\nwotton\ncorgan\npaucity\nnão\nmoyne\nexcels\ndaya\nsaarinen\nbenitez\nsheesh\nmetheny\nalternator\nhurriedly\nmadhav\nwormwood\nsieg\ntryptophan\nrien\njaap\ndarken\nnaas\ngans\nfrawley\ngertrud\nhywel\nmohs\npillows\ncranks\nobeying\nharuna\nforsberg\ndeauville\ngrogan\nberra\nappendices\nshuai\ntroopship\nunitarians\nexmouth\nmanoel\nsteinman\nseki\ndislodge\ntakht\nnandini\nhaphazard\nrogelio\nmizuho\ntabula\ngrosser\nrubric\nmotility\nsalaried\nmacht\nhindemith\nmetropolitano\nhammock\nhanukkah\ngeert\nvermin\nwsi\nscherer\nimparting\nident\nfunctionary\nmemorialized\nsissy\nmonotonous\narouse\nmendelsohn\njager\nwilhelmine\nreigate\nuo\nloner\nkalle\naor\nadiabatic\njavan\nteaneck\nwhangarei\nciao\nmdt\nmallett\ngash\nharpist\njessop\nmateria\nies\nrza\nclassy\nxls\ncomplementing\nfatboy\nseasoning\ncrackpot\nmetabolized\nimplicate\ndetach\nbungie\nsoli\nabsences\nswf\nbewildered\nimposter\nmoiety\nricoh\nkrupa\ncomplicates\nsteelhead\nandalucía\ndnc\nmiya\nlamenting\nkapil\ntcs\nadipose\nlimo\nrepeaters\nblistering\nmobutu\nhaughton\nkalahari\nabeyance\niskra\nlionheart\nhumphry\nphilibert\nsessile\nrilke\ndialectal\ntoddlers\nbuono\nmotile\nconlon\njeffers\nidealist\npreah\ncrabb\nputters\nwiggles\nashtabula\naeschylus\nconger\nmorozov\ngrigor\nbenneteau\nsupplementing\nbongos\nnotepad\nrothbard\nphoenicians\nomicron\ngbs\nendearing\ntir\nsigna\nmantras\nblip\nmalton\narnott\ndhl\ntimmons\nwaza\nthirtieth\nshoddy\nultimo\nrepelling\ntarek\nbarda\ntakuya\nbouillon\ntanah\npsr\nhyperlinks\nfeldspar\nchilliwack\nsapper\nedsa\nadmirably\nweigel\nambience\nrebuke\nreprinting\nfincher\ncollegiately\nbissell\ndenzel\nkiddie\naccumulator\nmeeks\nvillard\nfrancesc\nzwickau\naggressiveness\nwylde\ninroads\nspeculators\nbahawalpur\nadmiring\nheinous\naquifers\nzong\nhersh\npetticoat\ngales\nrossa\ncaxton\nintransitive\nknowlton\nchested\nsadhu\nrooftops\nperturbations\nserre\nrefrigerators\nseca\nsaracen\noffends\nbooze\nmarnie\nmendenhall\nzverev\nburghers\nfloodlit\njaneway\nacceptability\nsharman\nrearranging\ndomus\nspout\nzs\narbitron\narbitrage\ntahitian\narianna\nnena\nkoninklijke\nbulwark\naudiencia\ndeduces\nskylab\nlorelei\nironclads\npromulgation\npolyps\nmoskva\nmatta\nmadhouse\nknowledgable\nmccreary\nscudder\ncontrabass\nsporty\nsuckers\nmedio\namide\nbetula\ngault\nsynthesizing\npiz\nledges\nfollicles\nargentines\nmarcelino\nvibrate\nanuradha\nchicane\ngulshan\nberat\nscm\numbrellas\nprithvi\nvoids\ncelery\nsoftcover\nstent\nreconstructive\nwaxman\nthais\nwartenberg\ndeschamps\nshimada\nunidos\nlexisnexis\naza\nmaggio\njoes\nevangelicalism\nkirin\nhabs\nchrister\nlatimes\ncoxswain\ninsurmountable\nintracranial\nkrantz\nkreuznach\neyelids\nfords\ntusks\nalnwick\ncdi\nkingsford\npheasants\nripples\ngyms\nbareilly\npssa\nprincipled\nartful\nhsinchu\nmckim\nzina\nporpoise\nnotated\ngerhardt\ngriff\nnathanael\nlarceny\noceana\nmeijer\ngaeltacht\nsoldering\nmvs\nbesson\norrin\noutnumber\nirvington\nnepotism\nveit\naud\ndebby\nwilno\njonestown\nscioto\nbloodless\nshana\nyano\nweaves\nkatya\nshusha\nwestlife\nmickiewicz\nzeroes\ninstilled\nangevin\nflemington\nmccloskey\ndeum\nprovokes\ncompacted\nkalinin\ntippecanoe\ndocket\nneptunes\nbleeker\ntrembling\nremaster\nlineal\ndistilling\nminced\nponiatowski\ndmx\nandreu\nprivatised\nguingamp\nskaggs\ncreams\ndervish\nunderserved\njudicious\nasghar\nliliana\nserialised\ntyrolean\nsengupta\nkalpana\nbure\nchol\nsupersede\nwarrnambool\nlabors\ndaltrey\nbade\nbeckford\ntimaru\npruitt\nfarmlands\nfogerty\ntantalum\nyek\nextrinsic\nmodulate\nseafloor\nemmaus\ncorneille\nneretva\nhassett\nroughnecks\nimprovising\nionescu\nrimes\ntiflis\nvelma\ntomboy\nvoyageurs\ncleans\nmilepost\ngilmer\nsearcy\nnoriko\nomagh\n♣\narmes\nencephalopathy\nhookers\ndisillusionment\nratiwatana\nbord\nmuscogee\nslumped\nwyn\neia\ningmar\nleathery\nequities\nlandline\ncamaraderie\nboehm\nparalysed\nstratus\nsociales\nsabri\nsubsidised\nenríquez\nmonotone\nzala\npoi\nzhengzhou\ntruex\nriki\nunsound\ncharlottenburg\nknightley\nmha\nolcott\nshel\nthatch\ncomers\nunchained\ndisapproves\ncbo\npisces\nhabermas\nspengler\nhieroglyphic\nconstructivism\ngump\ndaniell\nforelimbs\ncrh\nreintroduce\nghq\ncollette\nexteriors\nmds\nvegetarians\ncentralization\nsingaporeans\nphotogenic\nmomento\nhayter\nmatrimonial\nbaroni\nmicheal\nbaur\nclog\nalighieri\nsummarises\ndbs\nriverton\nlebedev\nencroaching\nrescind\ntranscribe\nclarita\nburslem\nbaraka\nborrowings\ntaupo\nfairweather\ntaxicab\nurchins\nolli\nlemaire\natman\nnightshade\nroadhouse\ndezful\nbackus\nü\nooo\nambon\ndelfino\nattesting\nmorphs\nlottie\nmasur\nmirpur\ncutbacks\nunreliability\nsappho\ndeathmatch\nquimby\nquartier\nsuccumb\nmaritima\ncatalyze\nshoved\nawd\ncorkscrew\nensenada\nrancid\nburge\ndestino\nfrobenius\ntenements\nstunted\nstari\nmorpheme\nfilial\ngracefully\nsubsonic\nseuil\nmegami\nriverbed\nhargrove\npropagandist\ngabala\nivf\nroden\nbarrens\nantitank\nscheldt\ningres\neusebio\nsnodgrass\nalemannia\nmna\ndanton\nsuiza\nwilding\nsugarloaf\nwaals\nshoreditch\nbyline\ndeceleration\nnavarrese\ncourbet\nchimp\nsubjectively\nstoop\ndiplo\nmccomb\nzabrze\ndancin\nbateson\nmagnifying\npalestrina\nunsuited\nfelis\ndreamy\npavarotti\ntimeshift\npikachu\nboatswain\nglick\nramble\nogun\nimmer\nflore\npetronas\ncarrey\nhugging\npopa\nbellow\nwofford\ntrinitarian\nspink\nviolators\nreloading\nsandalwood\ndeshpande\nwarder\nedsel\nvestibular\nantares\nbernabéu\nwestover\nwanamaker\nmers\nbaptisms\nlaxman\nmoodie\noued\nprithviraj\ncanarian\nallosteric\ncundinamarca\nhaughey\nmaybach\nchesham\nradhika\nsmalltalk\neddington\nabubakar\nratan\nsqm\ncoutinho\ngrok\nboughton\nbunnies\nslashing\nwhitelist\neccleston\nshears\nsavio\nbayamón\naryeh\nmaréchal\nhatta\negyptologists\nhouser\nvamp\nalyson\ncolburn\nmenopause\nvorbis\nmalé\notaku\nlexis\nsamajwadi\npayson\nbeatle\ngauche\nkanon\ninfill\nbesiege\nflèche\nparco\nnau\nnonprofits\nkenwood\nbanaras\nlogano\nfisa\nagnostics\ndispatcher\nreceptacle\ncarnal\nwunderlich\nafzal\ntenorio\nnouri\ncuddy\nsmalls\nkapur\nlgbtq\nokhotsk\ndeshmukh\ncivitas\narborea\nscrewdriver\ntazewell\ninsecticides\nengendered\nbrassey\nheadlight\ncuffs\nshonen\nfodor\nminigames\nfairway\ntitania\nhorrocks\ngreenstone\nequidistant\nalchemical\nnpcs\nmolineux\ncalatrava\nlouisbourg\nfairlie\naircrews\ncullum\nasen\nfal\nudon\nsymmetrically\nwhitewashing\nkeenly\ntsc\nshep\nniña\negos\nhelder\nbandwagon\nicrc\nrefreshment\nlaut\npelle\nzilla\nhowlett\nills\nhemi\nbloomingdale\nrti\njeweler\nmuddled\nbinns\nmckellar\nstrayed\nnalbandian\nkrefeld\nsomber\nfrosinone\nthicke\nmondale\nchabahar\nairtel\njsa\nlapentti\ninstigator\nhalmstad\nerebus\npooled\neason\nfmri\nmarchers\ncienfuegos\ncowl\nvalidating\nhingham\ntsukuba\nseabrook\nreappearance\npiezoelectric\nfleischmann\nbidwell\nannapurna\nmahi\ngreenspan\nlathrop\nvolusia\nfawr\nnmda\nhodgkinson\ncriticises\nthalamus\nlynyrd\nsensationalist\nbodie\nleonel\nvolition\nkorolev\nbehr\nfoyt\nconstipation\ntallying\nbriefings\nscepter\nexaggerating\nlupton\ntojo\ncep\nbrightman\nclickable\nventing\nkyaw\nplaymaker\nstaked\nhazing\ntalal\nyoshio\nscheduler\ntrapani\nsnatched\ndevoured\ngobi\nllewelyn\nramadi\ngalore\npastoralists\nrevues\ncalliope\nangelis\nbackfire\nstonemason\nabate\nadic\nmidterm\naldrin\ngaultier\nlanda\nbrera\npgm\nmichelson\nhjk\nplesetsk\nberthed\nkeypad\nmazur\nbluebell\nfgm\nhamburgers\nnewness\ncrohn\ngrueling\ntayyip\nboac\nnahin\ndagbladet\npoznan\ngeranium\npunjabis\nminelayer\nsheeting\nvehement\nahoy\nbds\neckhart\nthrockmorton\npétain\npheromone\neishockey\nmemel\nstrelitz\nbülent\nhistology\ncoincidences\nskagit\nischemia\nsoftening\nlaye\njanes\nrefreshments\nmurshidabad\nsedative\ndismissals\nalbarn\nkamara\nkinsmen\nsociale\nrpt\nibo\nkinect\nscotti\nbreakage\nshortstops\nmooted\nkatia\nvivendi\nchâteauroux\navram\nhumiliate\narvid\nurania\nintricately\nconceição\nchur\ncardiologist\ntoner\nnaughton\ndkk\nguida\nsurges\nfujimoto\nkingsport\ncellists\nbarnwell\negrets\nposit\ndisappoint\npianoforte\ncounsels\neyelid\nkeeley\ngrudges\nbaumgartner\nmainpage\nnación\nbeleive\ntoh\nstings\noxy\nzenica\nyemenite\nbullen\nfarhan\nwindsurfing\nespy\nlado\naquileia\nzulia\nsumptuous\nrevolting\nalu\ndegrades\nautomaker\ndespatch\ncraton\nfabiano\nduhamel\ndusted\nfelling\ncoot\nmolesworth\nmassillon\nactivator\ndoorman\nlimón\nreceded\ntunica\nthickets\nformalities\nleaguers\ntuileries\nterceira\ntopple\ntrenchard\neustis\norchestrations\nloewe\nhoppe\nculling\npiedmontese\nhazelwood\naspergillus\nhsi\nscap\nyounis\nvara\nvenn\nbacteriology\naftab\nalumina\narima\ncastration\nrajkot\ndrexler\nsheerness\nwogan\ncantal\ndenials\nstrachey\npaterno\nzaki\nbomba\nfitzsimmons\nwinsor\nscurvy\ndrawers\ntomoko\ncheques\nrheinland\nundeclared\nyucatan\ninteragency\noma\nmaire\nadaptor\nbreathed\nlyndhurst\nembittered\ntomkins\nhomogeneity\nhummer\nwelby\nkampuchea\npairwise\nmurrayfield\nsorcerers\nondo\nhelsing\nnatsume\nhenrico\ndeterred\nsledgehammer\nasr\nkotoko\nromantics\nbrainerd\nmarqués\nbeaverton\nkamran\nteja\nairmail\njunko\notello\nhagia\ngimeno\nojeda\nrowena\ntaal\nbhagwan\nnewgate\ncranfield\nhorvath\nglobus\nimmunodeficiency\nbeets\ntsung\nlargemouth\nsabercats\nhams\nbifurcation\nrestatement\nsawai\nrecuperate\nfulk\nappalachians\nbackwater\nyassin\nunreasonably\nmab\npermanence\nerikson\nmireille\ncaptivated\nelkhorn\nmoctezuma\nalbini\ntrung\noverstated\ncongestive\nsibyl\nblackness\nanzio\nwwc\nstevan\ntrolleys\napplet\nstruve\ndonning\nbreguet\ndowntime\ngpp\nsolemnly\nassemblages\nvarietal\noutliers\nwvu\nmontefiore\nbonney\ncalum\nblackmails\npoirier\nlawman\nchristening\ntraditionalists\nstumped\nbookkeeper\nkesteven\nunisex\nindulged\ndictatorships\nramming\nlighthearted\nmorpeth\ndivina\nsustains\nnorodom\nsarath\nmib\nauteur\nphg\ndrg\nviaducts\nneurobiology\nkitsap\nabdulrahman\nrulebook\npoppies\nrapier\nchairing\nfunchal\ncolson\nsekai\ngoole\ncastletown\nscalability\nalga\nmatrilineal\naltruistic\nseles\nbhavani\nspoofing\nprecept\nswingin\nimpervious\ntechnicalities\ncodon\noksana\ncoulee\nzakir\nwasnt\nallerton\nmomma\ndisposals\nbusting\nmarshalls\nfidesz\nstencil\naforesaid\nmenuhin\noporto\naphids\nspecializations\nezio\ncapriccio\nmanger\nehc\ntaut\npietersen\noshima\norigination\nwedges\npersisting\nindustrially\nrosecrans\ncarreño\nbilbo\ncoria\nextirpated\ncentrality\ndepositing\ngyu\noutstretched\njarre\nbrats\ncatamaran\ncanisius\nsuperdome\ndesktops\nartvin\nspiced\ngrooved\nwheelers\nbeto\nanamorphic\nutterances\ngrindcore\nfortify\nwhitelaw\nfayard\nrespectability\nshawl\nwru\nhallways\nrollout\nurbanisation\nkidnapper\ntheodoros\nbul\nvickie\ntreasured\nlansdale\nposten\nthomond\nmashonaland\nabernethy\ncentralia\nmilhaud\nammar\noverpower\nhadfield\nextrapolation\nbruising\nhomebrew\ngreenlee\nkladno\nionosphere\ngastronomy\ncrofton\nlindau\ncrewmembers\nburwood\ncoverdale\nwingman\ndeplorable\nfluxus\nbermondsey\nchunichi\ntsukasa\nsnark\nlitex\nlibreville\njeweller\nimmediacy\nstoddart\nvesicle\nabernathy\nhannon\namparo\ngatling\npaediatric\nwerden\nbole\nnepomuk\nlascelles\nhaar\nvod\nkaul\nmaurier\nbrickell\nouster\nzsolt\nhilversum\nmillstone\nstabilised\nfacies\nvanquished\ngrambling\npleural\nscruggs\nfulfilment\ncharpentier\nbarba\nterran\nniobium\ncutaway\nsquier\nmunicipally\nchumash\nphytoplankton\nsoas\nautobiographers\nquackery\nconfided\nkabylie\nbronchitis\nlipsky\ngah\nmediates\ncaracol\nsocratic\nsubalpine\nlorenzi\nkunal\npré\ncantabrian\nhedmark\nmoribund\ngiveaway\nloggers\nwitton\nconvening\nmoffatt\nhoarding\nadda\ngrills\nnisha\nsubways\npresumes\nthéophile\nstorer\nspyder\ngatos\nmontañés\nmanorial\ncyclopedia\ntechnion\nkilmore\nblanton\nporphyry\ncédric\nberenice\nnarciso\nshipman\nsubservient\ncomplutense\nhomozygous\nmccrea\ncarvajal\nmcalister\nreynaldo\nbabs\ntonawanda\nsolway\npettersson\ngorski\npierluigi\nmoorings\nburne\nunlocks\nphilemon\nludwigsburg\nmeads\nprofessorial\ntabu\nberisha\nmaintainer\nagata\ngrisham\nconakry\nalbay\nincestuous\nsprinkler\ncharade\nnellore\nunmanageable\nspringing\nfranklyn\nweiland\nsunnah\nantiwar\nvalérie\ndamodar\ntulsi\nfaithless\nklara\ngrasped\nweeknight\nreusing\nobligate\nbrower\nbemidji\nfiorentino\ndamiano\ncounterbalance\nturco\nthuringian\nmartinsburg\nimpeach\nhotham\nmichaël\nhennig\nsecretarial\npeterhead\nwhittingham\nslava\naami\ndelaunay\nbroun\neagerness\nmacklin\nkadir\nchappelle\nparachuting\ncanción\nrivington\nprovable\nashmore\nrincon\nsubarctic\nstoning\npetits\nphishing\nwyandotte\nmaurya\nmetrology\nlindholm\nmonomers\nmichaud\netymologically\ndissociative\njunaid\nculminate\ngottschalk\nwataru\nangelika\ntacked\nneurosurgeon\nokeechobee\nalkaloid\ncorio\nwavelet\ntitling\niom\nlumia\nwheelchairs\ndurations\nmarblehead\nseahorse\nasim\ncyclotron\ndeclination\nshrinkage\nfootwork\nrtd\nmclellan\npiatra\naguayo\ncaving\nannihilate\nassis\ngrammys\nmarketer\nboxcar\nphat\nwhistleblowers\nmcgann\nfeinberg\nlochs\nolivares\ntolerances\nmowat\nsulayman\npressuring\nstena\nbeery\nwheelwright\nteplice\ngini\nvoa\ntestes\nsportscenter\nmurata\nprostrate\nrukh\nbrittain\nperryville\nringgold\ncubana\ntoit\nsilvana\nboyfriends\nhenshaw\ntransducer\nmilla\nags\nfrighten\nhondt\nairlifted\nyim\nwholesalers\nsuncorp\ncarreras\ncaretakers\nkawai\nethiopians\nsauropod\nchanneling\nraat\nhaldeman\ntudo\nruthlessly\nbynum\nrewa\nswainson\nintensifies\nfrere\nungrammatical\ndeleuze\nmacdonnell\nsniping\ngeometrically\nflowery\neloy\nyoshihiro\npanagiotis\nsqueezing\nbootstrap\nbehrens\ngantry\nbakewell\nallo\nboldness\npared\nhoary\nitm\ndownplayed\nbling\nquerrey\nravindra\ncohan\nsayaka\npistoia\nconcertante\nerling\nparkside\nugg\ncrabbe\npooling\nsleazy\nuploaders\ngyatso\nhalpern\nzetas\nrooks\nbiol\ncognates\nkeisuke\nmiddelburg\napus\nsaurashtra\npersonages\ntoshi\nnaser\ncousteau\nveers\nstieglitz\nbunt\ndower\nglassware\nsilhouettes\nwobble\nfranchised\nnlrb\nhiragana\nsitio\nmoab\ncomposting\nnorges\nreplenished\noverruns\nabsurdly\nasus\nnorthgate\nletts\ndrugstore\nrefectory\nrimouski\nnen\nnarmada\ntuc\nrealtor\ntupou\nassr\nsaver\nnozzles\nnca\nyisroel\ntaub\ngallegos\ncharacterise\njosefina\nappeasement\ninfamy\napostate\nburges\nciencias\nbamber\nknowsley\nabominable\nmaryville\nmaltin\nhieroglyph\nammon\nnagging\nmehmood\nizu\nindigent\nlucent\ndisowned\ndampers\nseamstress\nweisman\nmarshmallow\novulation\nclump\narana\nbreve\npms\nkalman\nozaki\ndowny\nbenedetti\naurelian\ncraigie\nbushfire\nfederative\nbudokan\nbumpy\nballantyne\nino\nmaasai\nncs\ndhar\nmillbrook\nzonguldak\nluch\nfertilisation\ndobrich\nlemoine\nschoharie\nmutter\ngymnastic\nholiest\nmolla\nmeddling\nsugababes\ndoak\nalk\nepc\nmatra\nresistive\nrasul\nkeren\nsleuth\nmeldrum\ngtr\ninterceptors\nintrastate\nbraşov\ncontes\nstreptococcus\nharmattan\nfooling\nskydiving\ncrotone\nfaustino\nuspto\ngiraldo\nkoro\nnaveen\nmaterialistic\nsmothers\npicayune\neca\nris\nvestal\ntsunamis\nsharps\ngamla\nlandmass\ncortinarius\nsilkeborg\nrobinsons\nshipboard\ndunkerque\nbaen\nwildest\natul\nblackford\ndfw\nmonotheistic\nsubjecting\necclesiastic\nheadphone\npolystyrene\nmoyen\nrhp\npierpont\ngatti\nvariegated\ninstinctive\nviceroys\ncashmere\nsquamous\nnetworth\nignazio\nbrumbies\nmildura\nexpropriated\namersfoort\nalor\nundercard\ndaz\nszabo\nbaudouin\nfarmhouses\ncaviar\nghraib\nwhitewashed\nantipathy\ncomputes\ndelray\nreassured\nseppi\nmarini\nbragança\nbingley\nduchovny\nginga\njaro\nseale\ngristmill\ngleb\nqué\napprehensive\nanaesthesia\nshinichi\ntuva\njerky\nveliki\nkampen\nmarionette\nmotivates\nstelae\nschwarzenberg\nkoller\nhok\npaging\nnavigated\nduplicative\neffecting\nbuttercup\nharrah\nltu\nmilled\nsorta\nbundling\nconcomitant\nsprinting\nimola\ngazebo\nsuzi\ncolloquium\nrevell\nwiesel\ncoetzer\nghee\nkeyed\nvarun\nfizz\npacify\n¹\niib\nbottlenose\ngastón\npyrotechnics\nmatsuo\nculprits\nrussula\nkho\nmsf\nmulticellular\nwagtail\nhiromi\nribosome\ncerrado\nkeirin\neibar\nmcnaughton\nhier\ncorset\nfolktales\nscs\nstrafing\npfg\nclitoris\nquilts\njukka\nkarp\nmilitancy\nfreikorps\nayyubid\neffeminate\ndávila\nmcgarry\nvino\nnouméa\nindulging\nprolog\nlaon\nakari\ntomar\ntrachea\nbaskerville\nmarciano\nairframes\nploughing\nsouthfield\ncircuses\nnyj\nbunn\nnott\nshattuck\nabia\ntillamook\ncancún\ncarmelites\nchandan\nstortford\ngcses\nruble\nzidane\nfuries\ntürkiye\nrebus\naddy\ncaledon\nnecklaces\nsinéad\nvaca\nevanescence\nmountings\nglanville\nifs\nshinoda\nagen\nfier\nbessel\nsiddhartha\nshortness\ngobble\namericanus\nsuda\nantoninus\noneonta\nbolden\nsurcharge\npbk\nfoxtrot\nharv\nrothesay\nacheson\ntupi\ncoverts\nbayerische\narpa\nglobetrotters\nmayonnaise\nisha\nreforestation\ntasker\nthrusting\nsalvator\nschleicher\nicj\nmukti\nidents\nplatformer\nexaggerate\nvena\ncobbler\nhemmings\nwargame\nvasyl\nunconsciously\nsixpence\nmarinas\nviolist\nanathema\nleni\nmages\nkoren\npawan\nament\ngrendel\ntamarind\npyne\nhomesick\ndropbox\nskippers\ngiménez\ntransitory\nnambiar\nunderwing\npaces\npinafore\nundescribed\ncourtois\nfibreglass\ngbp\npox\nsilverton\npickers\nwerribee\nreise\nindios\nase\ntrixie\nbanshees\nsfa\nalbertine\nembarkation\nantibacterial\nrafsanjan\nmatanzas\nsmedley\npinheiro\nbenno\nbrak\nhypocrite\nbosons\nbasilicata\nhaeckel\nsanctorum\nmatin\nanabaptist\ncoupler\nmuhajir\nbogdanov\nindepth\nhesiod\ndeceiving\ncapp\nnadar\nplana\nius\ntrimester\nbalu\ndrowns\nhdl\nojo\nhorthy\nheeled\naber\natlases\nsammarinese\nspyro\narquette\nmichaelis\ndelmar\nislamia\ncolitis\nantillean\ntoomey\njarrow\nnazrul\ninfringements\nonerous\ntecmo\nburris\nknitted\nbly\nzila\naco\narriba\nweintraub\nhooray\ndifferentials\nogc\nfightin\nscharnhorst\ncawley\nabutments\nopioids\nnci\nmayumi\nvictimization\nfireflies\nfreeform\njerks\nencloses\nashwin\nveranda\ngakuin\nshenanigans\nwelk\nrhyl\ndethroned\nreburied\nwad\nyonder\nsaks\nseditious\nbridlington\nneglects\nrogério\nmccluskey\nrebutted\nclavinet\nnihilism\nnagata\nkwa\nbertin\nhooking\nsubjugation\ngrigori\ncbm\npagodas\ncayo\ncorliss\nbma\ncabana\nbrooch\nnonverbal\ncobblestone\nprerequisites\nbeveren\ndanby\nirises\nomnivorous\nrian\ncarnivore\nsummarising\namputee\nnalchik\nrevolución\npalatable\ncti\nscrope\ndamming\ntolima\nketch\nagung\nrader\nvoetbal\neclipsing\nmonck\ntandon\nrowntree\nbootle\nluba\nwinemaker\npolisario\nboaz\nral\nquatermass\nexpropriation\nslanderous\ntubbs\nnaoko\nneuf\ncorsairs\npinker\nmedulla\noblige\ndeviates\ngeldof\ngilad\nchalon\namorous\nincubus\nferruginous\ndede\nwhisperer\nlovable\nbuss\nculp\nsnowdonia\nassiniboine\nraimi\ndreamt\ncrushes\ntuners\nempiricism\ngerminal\nfaustus\nmanorama\nseedling\nandorran\nbanka\npaulin\ncreeds\nhannity\ncountesses\nplurals\nouch\npronouncements\nfides\npcbs\nbarreled\nwalling\ncrafty\nallstars\ngrâce\ncitrate\nrepo\nseaworld\nintermezzo\nshrinks\nvideography\ntelecaster\nflapping\nsecularization\nhela\nbratton\nntfs\ndelius\ncamry\niver\njeonju\nvanda\nenschede\nfairtrade\ndads\nbambino\nblackbeard\nlaoghaire\nedel\nsbb\ngotcha\nrereleased\nvollmer\nmicrocontroller\nfelixstowe\nspeight\nyeni\nbillingsley\nbigoted\nbettered\ncliques\nfaulted\nmarikina\npiedra\nthereto\nreeling\nbotticelli\napologists\nchutes\nhoneysuckle\ndistancing\nhobbyist\nlombok\ngoshawk\ncmp\nescalates\nvignette\ngandy\nmaersk\nracking\nstubbornly\ntrampled\naurel\ngoldsboro\nhargrave\nminorca\nravenswood\nkoen\nbielsk\nsyr\nmelanin\npou\nbenzodiazepine\ndukakis\nindustrious\nfrancoist\naugustana\nbustard\nyolande\nparamilitaries\ntimm\nkjv\ntokio\nbrin\nfeaturette\ngouda\nromaine\npah\nlumberjack\nuan\nmauve\niquique\nsubramaniam\ngte\nbreslin\nfaulting\ndota\nlage\nrallycross\nfahad\nmesnil\nustinov\nmovimiento\nmuang\ncoverings\nyushchenko\nhashing\nburford\nmolokai\ncoster\nnacionalista\nzalman\nhast\nhoman\neggplant\nkati\nbeall\nreserving\npineville\nratnam\ndistillers\nzanesville\nheenan\nguilders\nohv\nminimization\nlusk\nmaliciously\ngrocers\nhercule\nasic\nhermanos\nasch\nshinty\npelton\nschwyz\nhorny\nstimson\nmichels\noverlords\nequalling\nlugosi\nhoshino\nekiti\nvantaa\nkanaan\nunthinkable\nunanimity\navowed\nsniff\nbache\npitkin\nvenetia\nunt\nnbr\ncandide\nkoga\nvier\nwaders\nried\ncavalrymen\nattests\nkubo\nearring\nminecraft\nnanette\nhearty\nforesaw\nkirsch\nazadegan\ndispersing\npatrilineal\nconstanţa\nvirulence\nlimpets\nnickerson\ndeckers\nullyett\nlofts\nmassing\nlaryngeal\nthiers\nbaños\nsequitur\nbeaded\ntove\ngondor\ngyeongju\nmaniacs\nritualistic\ntenochtitlan\nmago\npaleobiology\ntallaght\nspillane\npleases\nhowdy\nalm\njayaram\ninvoluntarily\nstarfighter\ndwarfism\nexasperated\nleoni\nmachined\nbuckles\nwester\ncarrefour\ndolmen\nwaltzes\ndistressing\ncooker\nmagellanic\nprashant\ncaboose\ngreyfriars\nandra\nmoorcock\nempresa\nuffizi\ncoriolis\nbarbier\nmagik\naral\nvalentinian\ntypographic\nasano\ntobey\ntuco\nsethi\niambic\nseverino\nboosts\nlesh\nruckman\nnettle\npursuers\ninterlinked\nhopman\nshvedova\npanelling\nlashkar\nbuns\ncryer\ntrawling\nboletus\nabacus\nlapis\nfba\nmaio\nbrim\namiss\nbelushi\nlandmine\nrosslyn\nnewhouse\nsuave\nunderlie\nshelbyville\nlongley\nlancastrian\nyaya\npsychopath\nplatense\ninfiltrates\nrheims\ngeopolitics\nhob\nmajumdar\nsuperconductivity\nmignon\nfloatplane\nneeson\nbéziers\ndeflections\nquint\ndispenser\npuna\nmussorgsky\nadachi\nstansfield\nkhor\nhickok\ndumplings\nvinh\nloyd\neyebrow\nbombardments\ngib\nvinatieri\ninclinations\ntosa\nbushman\nsastri\nmariinsky\nclowes\neris\ngenealogist\nblackouts\noxymoron\nbested\nmushtaq\nburl\nshruti\nilm\neverson\nsamford\nwl\ncapillaries\npyrrhus\nvaulter\nbullfighting\nsmoot\nalexandrov\ncorrupting\nheterogeneity\ncbf\ngrazia\nsaco\nilia\nsomoza\nmythbusters\nrestate\ngruff\nbirr\ncumann\nhoneycombs\nsica\nhughie\nwegener\nheartbreaking\nbeanie\nglutathione\nrhoads\nbogor\ncatalana\nmahinda\nkristi\narmadale\nvig\nincas\nkristensen\ngeckos\nthicket\nreformists\nuninvited\nnovosti\nwich\nbaldy\nfullness\nreinvented\nstradivarius\nreinventing\npeart\nbirkenfeld\nindependencia\nhooligan\nasami\nwimpy\nathlon\nbaghdadi\nshowa\nmichiko\nens\nviolets\nunrecorded\norly\nsudhir\ncampana\negyptology\ngunnarsson\nlubavitch\nsive\nkanata\npreload\nepr\ngorakhpur\nningxia\nsandnes\ncatamarca\nsubatomic\nkowalczyk\nmothership\nsyndicates\nclearinghouse\nembryology\nfaithfull\nqaida\nneoplasms\npirated\nsportif\nyummy\nwaitakere\nerbil\nrelaying\nmotorboat\nallama\ntmz\njago\nrut\nchacón\nabid\nhypersensitivity\nspg\nspeculum\nmikkel\nvilma\nmultifaceted\nipsum\njeanie\nsylvania\nmalfunctions\nbaldur\nexcavate\nendzone\nsadhana\nedgware\nsugden\nsliders\ndiable\ndalles\nnahyan\nlauro\nsaskia\ncoloma\nbartram\nsolna\niim\nlengua\nmerion\nzlatko\nnovaya\ningraham\ngoiânia\njeezy\nmamba\ntdc\npurpurea\npiggott\nprescribes\ngeetha\nnostrand\nviseu\ncardwell\nmckeon\nkink\nlacombe\nrévolution\nameer\nshinobu\nsteubenville\ngulfport\nhiroko\nprim\namalfi\ntransgenic\nabduct\nwroclaw\nericson\nculpa\nbukowski\nstarks\nindefatigable\nconfraternity\npynchon\nbedchamber\nheligoland\nthibault\nlilla\ndemobilization\nvacationing\npaddles\nparaffin\ncolloidal\nwightman\nsandiego\narticular\npondering\narshad\nbonheur\nstriated\nthemself\nbobbi\ndesam\nsubcommittees\nhol\nhartwig\nmultitasking\nasme\nbabington\ndmitriy\nabramson\ncochlear\nrankine\nmarthe\nreinvestment\nburlingame\nshonan\nkhoury\ncopyleft\nkeyserling\nmasao\nbuin\ntallulah\nmizo\nbrogan\nnerdy\nbek\nmountainside\ndiscerning\nthickly\neurydice\nopining\nsamaritans\npalaeolithic\nfriel\nmatsuri\ndiscerned\nvioleta\nunderrated\nutama\ngru\ncucumbers\nojibwa\nebbw\nreaffirming\nculvert\ndancefloor\nrecruiters\nfukuyama\ncitric\nwhine\ntye\nragnarok\ncourtauld\nvillalobos\nkelleher\ndeflation\ncondorcet\nbesa\nflorio\naho\nmisogyny\nabsorber\nmauretania\ndemilitarized\ngoro\nskirting\nballing\nzúñiga\nfinkel\npaulinho\ncws\nbuffon\nbloomer\ndemir\npavle\ngrigore\nstriata\nlop\nitza\nholyfield\ncorriere\nrossiter\nhialeah\npredicates\ndumbledore\nambivalence\nmasaaki\nbellinzona\nlocket\nibex\nprincesse\nkhin\nwronged\nintendant\neldar\ncouplings\niterated\nrelatedness\nulrike\ngeylang\nfamilie\nphylloscopus\npopulating\ncvo\nunione\nphilbin\ncurd\nkurtzman\ngurdjieff\nmanipuri\nprofusely\npearlman\nhohenstaufen\ntripathi\nfem\nstarlet\neko\nsawa\nsze\ntasso\nshephard\nprensa\nngai\nkitsch\nhsing\njanette\npicnics\nprotour\npamir\nrijn\nskene\nquand\ncattaraugus\nbez\ninsanely\nreinvent\nverney\nyearling\nmamadou\nbrawn\nferruccio\ndaugavpils\ndupuy\nupmarket\nlehr\nsires\nuruk\ntuolumne\nprogreso\ndury\nfellowes\nhadron\nbremner\nmraz\nwesson\nswirl\nligeti\nschulte\nhyogo\nsecondhand\nconcedes\nunsupervised\nfarren\nkmart\nheadstrong\naccipiter\niac\nsampaio\ncrystallized\naltay\nendeavoured\ncuckoos\nket\nbrome\nartigas\ncommissars\naragua\nnourishment\nlongworth\nhos\naudiobooks\nhébert\nnieuw\nwoodworth\nincomparable\nbatumi\ngraciously\nkutuzov\nthermally\ncorr\nichigo\nscanlan\nbrickyard\ntamper\ndecomposing\nprokaryotes\nbim\ncalabar\nsmtp\nmosher\nrebrand\nwenlock\nishtar\nilex\ntightrope\npki\nglancing\nfriedrichshafen\naphorisms\nmellencamp\nchaplaincy\nkyrgyzstani\nrallye\nruger\nmyriam\nhopf\ncyborgs\nteesside\nskien\nnellis\numd\ntalkback\nmixtec\ngiggs\nhorten\nraonic\ntachycardia\nfaqs\nsavitri\nfyfe\nerectile\navar\nskeeter\ntonya\ntiti\ntomi\nmurry\ncuticle\nransome\ndiospyros\ndtm\nsasso\nqed\nmattek\nkurz\natheneum\naugie\nbritta\nbathed\nbuñuel\nschwarzschild\ntubb\nredwoods\npforzheim\nburnette\njools\nbuzzing\njpmorgan\notra\nedwardsville\nratner\ngioia\nincontinence\nschatz\nmccutcheon\nbangles\nforetold\npartenkirchen\nsanada\nwondrous\nripken\nrosebery\nahab\nfrontrunner\neugénie\nanointing\nnovus\nwuthering\nfolger\ndiggs\nblinking\nkagyu\nhod\nttl\ncopycat\ngershon\nchars\nwilf\ngyllenhaal\ncavallo\nsockers\nhawkwind\nchaperone\nclitheroe\nicl\nharrelson\nreflux\ncleaved\nirate\nfeisty\ncrumble\nretd\nheadstones\ntengo\ngce\nmcnamee\nsamut\nkrs\nhommage\nporosity\nchiyoda\ninalienable\nsarge\ncalaveras\nkeokuk\nbahama\ntownley\nfap\nmisrepresents\nmatilde\nlhc\naugustan\nsueño\nsunburst\nrobotech\nbre\nroseland\nbagram\nhaggerty\nshizuka\nalireza\nbridgnorth\ncoauthor\njuventude\ninteruniversity\nfergal\nhavering\njbl\nlicenced\njanesville\nlene\ngiraffes\nthine\nteater\ngung\ncolwyn\nmaior\njis\ntisza\nfeld\nautonomously\nstichting\nenamored\nfriendliness\nlani\npanning\nhowler\nsisko\nrosset\nmaxilla\ncharmaine\nrenshaw\nhaddington\ndua\nbrunet\ncond\nlydon\nrenominating\nwor\nelina\napolitical\ntapings\nbester\njaye\nhaplogroups\narkwright\nfilmworks\naviary\nrattus\ndhillon\nuar\nantena\nkhawaja\nwailing\nrahal\ngeorgiev\ndurkin\nsvn\nsarin\ncarell\nstrasser\nbande\nkumara\nmaitreya\nkalat\nfairer\ndelves\nricks\ntooting\nzoltan\ndicky\nconformance\nenna\nnala\nwls\ngodot\nisherwood\npettis\nwissenschaft\nincompleteness\nmerah\nfath\nblackwall\nmeagre\nfiguratively\nunbecoming\nfannin\naffordability\nsailplanes\nvidin\nmmx\nproportionately\nhowards\ncdk\netoile\nproline\nexons\ndiphtheria\nintricacies\ndisguising\nspook\nconcha\nkuma\nfonte\neastlake\njunkies\nconformist\nweems\nscavenging\nshuffling\nliana\nbelive\nparticipations\ninbred\nsubheading\ntauris\nowensboro\nromanos\ndeepa\nhowden\ncityscape\nwcbs\nbloat\nlatins\nobadiah\nleticia\nnigam\nlika\npawar\nofficier\nplums\npontypool\nbenchmarking\nredcar\nraff\noryx\nmelendez\nkrill\ndefector\ntaunted\nillingworth\nunpopularity\nsilveira\npageid\nfreiberg\nmilltown\nstonehouse\nrebelling\nradix\narma\nmonotheism\nhumus\nbishoprics\nappraised\nhaters\nmcguinn\nexecutors\nkevlar\ndenby\npampas\nkurtis\nshania\ngurudwara\nshimmy\nseptimius\njanvier\ndnr\nuplifted\nlundberg\nbraniff\nkeg\nlecco\nhavant\nspindles\nmittal\nquadrangular\nscribble\naschaffenburg\nkraken\nsmes\ntransparently\ndalziel\nboc\nmicroscopes\nmamet\nkluge\nphenyl\nmotorcycling\npoked\ndeschutes\nreale\npierrepont\nmerckx\nelihu\nmcfly\nbodega\ndaviess\nmeyerbeer\ngreuther\nvyborg\nbellucci\ncarers\nashbourne\nenactments\nluan\ngarratt\ncapistrano\nfibt\nstrummer\ninstalment\nausterlitz\nfantail\nseagoing\naraujo\ngalilean\nchay\nfragility\nbazooka\neazy\nperceptive\nbts\nknockdown\nouro\nbursaspor\nsmetana\ncleburne\ndoktor\nsquatter\nhouseguest\nrefounded\naligns\ndinos\ncompressing\nbhumibol\nklaasen\ndally\nmasri\npilipino\necs\nlegg\ndour\ncephalopod\nnisa\nsanya\ngua\ndaa\ntoshiko\nrudge\nwindfall\nburundian\nlocusts\npiscataway\nnaperville\nnombre\nhyaline\nsiebert\nsnopes\ndft\nhayato\ncampanile\nghostface\ncoastguard\nvanden\ndatsun\n›\napologises\nfaltered\nmclain\nspecialities\nclemons\nbjarne\nmazhar\nverandahs\ncharlevoix\nramachandra\nresonances\ncromarty\nselatan\nneurodegenerative\nkrishnamurti\nfrostbite\nalbus\nshirazi\nnortherners\nchaz\nprivates\nbedingfield\nsorkin\ncanceling\ncurrier\nodors\nfairleigh\nlokeren\nkäthe\nambrosius\nlibrarianship\njawad\ndalida\nnormality\ncitizendium\nstrangle\nmenswear\ngaslight\nracists\nphysiotherapist\naéronautique\nbohm\njoh\nnostradamus\ntestimonials\nradisson\ncherno\nsaros\ncaproni\nslims\neurostar\ncarib\nlida\nphitsanulok\nbogan\ncheever\nminangkabau\nwysiwyg\ndnepr\ntights\neretz\nslates\ndeventer\ncardenas\nvani\nroni\nlovat\nuninsured\nbandages\ntks\njct\nccn\nkis\nferré\nrte\nsandi\nbru\nwyk\njuni\nlecter\nforst\nundergrowth\ndecomposes\nveered\narup\npasternak\npfister\naml\nvcr\nquipped\nghalib\nmonteith\nlewd\nswitzer\nnormanton\nhuawei\ncomically\nssg\ncilla\nerrand\npsm\nwedderburn\nuanl\nintelligently\nminion\nuniverso\nzest\nnri\ngroza\ntorbay\nbolognese\nturek\nimd\ntrophic\ncolquhoun\nhedgehogs\nborer\nballou\njoffre\ngalante\nkamo\nglorify\nrinse\ncrist\ntroubadours\ncopyeditor\nkaw\ngeoscience\nlustre\ndentition\nforex\ntote\nbeja\nscorecards\namano\nbribing\nforeshadowing\nheme\nseptimus\nflamethrower\neif\nmontaigne\noxon\nberner\nbenavides\ndespises\nacrobatics\nzob\nshorta\nsocializing\nsleaford\nturmeric\nferri\nfistula\nsify\nmoonstone\nludmila\npetru\nmisrepresentations\nminarets\nmims\ntopsy\nasistencia\ntve\nmultiplexing\nsharan\nrizvi\nmedico\napproximating\nkomsomol\norienteers\ngremlin\ndenunciation\nprevin\nchapbook\ntodor\ncamberley\nunitas\nccg\nsubtribe\nrelinquishing\npankhurst\npalumbo\nuriel\nfou\nexacerbate\ndiouf\nchabot\ndivan\nkishan\nstenhousemuir\nhanafi\ncondoleezza\nyangzhou\nhavent\ngade\ntosu\norifice\ninstinctively\nbeninese\nhomewood\nscottie\nyardbirds\nstructuralism\ntempos\ngunston\nstarwood\nghibli\ncomposure\ntaser\nhannigan\ngalland\nrothenberg\ncronkite\nscrappy\nmordaunt\nvijayan\nrabbah\nrejuvenation\nrwd\nguernica\nmuzaffar\nbharata\nrubenstein\npce\nadarsh\nwafers\nkatrin\ngametes\nvaillant\nroku\nadac\nweatherford\namf\ndunkeld\nmoyes\nchiefdom\ntrilobite\nhofer\nabitibi\nmalformations\nbromberg\nfiddling\nemplacement\nnal\nfinlayson\nnephi\nrikki\nclydesdale\nsulcus\ngoma\ngudrun\npanhellenic\nalbertus\nwestphalian\nkolding\ntorii\nabrasion\ningushetia\nnaresh\nlengthen\noutgrown\nersatz\nmontcalm\nbogle\nwaterboarding\nalka\nmattei\nechelons\nchivalric\nadhesives\ndabney\nmenorah\npyrite\nmaccoll\noostende\ncounterweight\nmoderating\ncorzine\ntali\nlavalle\nphilosophic\ndeformities\norganelles\nbioshock\ntransceiver\ngranddaughters\nmilorad\nhinde\nroubaix\nexpulsions\nhiratsuka\nunapproved\nbelatedly\nhepworth\nmelanesian\nnortham\ntamayo\nspaceships\nnovela\ndivider\ngpus\nlignite\nmatchmaker\ncrept\nsth\ncornel\nchiron\npowderfinger\nmacedo\nmangal\nmeisner\njanse\nambler\ngunslinger\njónsson\naguiar\ntik\nwomanizer\nhata\nprototypical\npernicious\nryanair\nhurtful\noverused\ngallego\nappreciates\nsindhu\nsopra\nmalini\nrealtors\npelota\nsinica\ntaffy\nsaipa\nconditioner\ngneiss\nfranchising\nfreaking\ndauntless\ndissension\nvax\nnipples\nlazer\nseh\njulieta\nunni\nelin\ncicada\nduca\nbillet\nscrewing\ncentrifuge\nupheavals\ndcu\ndavila\nindicus\nbubonic\nbumbling\njenkinson\nbolingbroke\nwigner\nchandeliers\ngaumont\nfanclub\nlinker\nmarca\nagee\ncycliste\ncytotoxic\nkannan\npolitic\naxillary\nmisra\nprue\nnimble\npulsating\nperfecting\ncont\nmulla\nbombarding\nmonegasque\nfantagraphics\nlynched\nopine\nmarbury\nzhuhai\nshiba\nnorthwesterly\nvisage\nlandmines\noost\ngravesite\nsecs\nmarisol\nchangeling\ntine\nbegley\npirata\neponym\ntarg\ngigas\nmoller\nprides\nskiff\nconstrain\nhinders\nisan\nlombardia\ngawker\nyle\nandrogynous\nroldán\nmelanogaster\nakai\ndmytro\nnatsu\nmohegan\nkennard\nguinevere\nspontaneity\ncoffeehouse\nweinberger\nauthoritarianism\nopulent\nothman\nperrot\nreconstructionist\nfusarium\npremièred\nreredos\nsalvaging\nthoroughbreds\netv\nminutiae\nergonomics\npinion\nsprocket\nfalsehoods\nhbf\nadequacy\nreceding\npotenza\nsocialize\ncrawled\nslaven\ntroughs\nparham\nmillen\nnowell\nchemie\nthutmose\nfogo\naad\ndeol\nfrobisher\ndebugger\nmcnab\namadou\nroby\ntule\nayaka\npinchot\ntachi\nattestation\nfarthing\nmanipulator\nmagruder\nvoight\nsaulnier\nescudero\nadvertises\nwreak\nbeano\nschwarzkopf\nthomasville\nwayans\nquakes\nnikolaj\nadoptees\nsadi\nnris\nsuccumbs\npapillon\nbrainstorming\nblabbermouth\nmaccabees\nfanfiction\ngasses\nlesbos\nalfreton\nvapors\nyokoyama\nramsden\nherbicide\nprimers\nsmokes\ntollway\nprasanna\nchenango\nuncooperative\nviciously\nfootnoted\nfsf\npeduncle\nmoreschi\nneuen\nfms\ndoty\nfabricating\ntruffaut\nsagrada\njeeps\nsmithy\nmourinho\ndrapery\nbq\ncarnes\nkhoi\nmyst\nrete\nafricain\nbankura\nunsecured\nwaynesboro\nwsc\npitta\nguha\nmethinks\nredundancies\nditched\nepworth\nmaron\nfaçades\nbaffling\nfsn\nmarquesas\nlingam\nbilaspur\nbiochemists\npiel\nmcdonagh\nbunge\ntrims\nseiko\nperestroika\nfolie\nparle\npoona\nchloroform\nneuquén\nserviceman\nnpo\nhristo\nbrookwood\njule\nladoga\nmaithili\nerb\nviticultural\nretrofit\nmoisés\npedimented\nshwe\ngrice\nlnb\nmyeloma\nsolenoid\nbasilio\nferocity\nanswerable\nphyla\nunveils\nartic\nwither\nudr\ndoble\nyossi\nakc\nbatson\nasses\nstipulations\npolices\nsummerhill\npraga\nreggiana\nsheela\nbiggar\ninterdependent\najith\npenarth\nmolested\nrestorer\nsistine\npaulding\nplaybill\nklug\ngowda\nchepstow\nbelen\nbcd\npropellants\nstowed\nmarginata\nalesi\nrafaela\nmessner\nmarketable\ninternationalization\nandronicus\nwyre\ntelegrams\nfluctuate\nassunta\npurnell\nkpmg\nrenzi\ngörlitz\nofi\nbathgate\ndetonating\nrattan\nleyla\ngebhard\nfrusciante\nbley\nocs\nlitton\nunfulfilled\ntraynor\nappreciating\nprosthesis\ncardamom\nlegionary\nboomtown\nopeners\nforeshadowed\nmassimiliano\npurebred\ntrivedi\nsandrine\nbalikpapan\noruro\nblocs\nannexing\nblasphemous\nstomping\nannexes\nallyson\nsager\nlysander\nbasements\nwrasse\nhirano\nimmigrate\nilkeston\nksenia\nyul\ncountrywide\nletras\nmende\nbreitbart\nlipman\n）\nsheehy\npugin\ncybersecurity\nursuline\nbolin\nkhe\nviki\nbushland\npatroness\ncodrington\nirenaeus\ndeejay\nlettre\nbharatpur\npecking\naftershock\nreka\nstoneman\nreferent\nneufeld\nyolo\nakins\nslamming\nregionalism\ninadequacy\nsaldanha\ncounterexample\npentland\npolskie\naudie\nmorgenthau\nmatsushita\nagrippina\nllyn\nalmada\ncerritos\njaded\nacu\nnewburyport\ndez\nsohail\nphotojournalism\nmusser\nstabilisation\npetioles\ntng\nforgave\nssm\nheuristics\nbackend\ndiatribe\ncharteris\ncooch\nfollett\nbituminous\nlibrairie\nthreshing\nstromberg\nvictimized\njib\ncruikshank\npatroller\nklinger\nbluewings\nseacrest\ncim\npembina\nlockerbie\nattainable\nbfc\nalentejo\nwip\nkhanty\nbss\narjona\nsouthwick\nevian\nuntrustworthy\nredhill\ncréteil\nhazleton\nilyas\nmcbain\nanca\nbasilan\nrorschach\nventriloquist\npeschke\ndampier\nmasayuki\nilves\nwillcox\nbrickworks\nirreconcilable\nreflectors\nsupposing\neliade\namass\ndickenson\nreversals\nhanbury\ninfielders\nbhatia\nvicomte\narabesque\nkosi\ndocudrama\nclr\nrappaport\nleggett\nappendicitis\njardim\ndanson\ninteractivity\nconcocted\nmtu\narda\nphenolic\nviolator\nshrestha\nmtc\nbier\nvannes\nphylogenetics\nlossy\nalagoas\ncmi\nshinde\ncultivators\ntakada\nlalonde\nwynton\ntagus\nmccrae\nmccourt\nfulford\nhardiness\ntoasted\nmasada\ncoenzyme\npessimism\nponca\nmors\ngravelly\ncivile\nczesław\nmatej\npolities\ndropkick\nelucidate\necommerce\nepistemic\nhumberside\nsallie\ngoulet\nimai\ncroc\nlumbini\npanini\npervert\neglise\njurek\npalomino\naggrieved\ngoodfellow\nnae\ndemesne\nsift\ndubh\ntrichy\nlanzhou\nlordships\nmagritte\ngigantea\nchaumont\nreay\ntruscott\ntrapezoidal\nnewsmagazine\nremoteness\nrabe\nwynter\nidentifications\nseminarians\nmacaroni\nfrills\nmayen\nawan\nelaborates\nmedeiros\nshahrak\ninnovate\nhuss\nkutaisi\nhandbag\nvrt\nmilena\ndothan\nconcertino\nmunger\nmstislav\nmomentous\nsitara\nberk\nsilliman\nhartnett\nsweethearts\nphotometric\nyoungs\nwaiters\nactuator\ndandridge\nprobst\nscavengers\nrestated\noldman\nmenschen\npenetrates\nchertsey\ntsim\nreminiscence\nacha\nharrisonburg\ncardiomyopathy\nfranciscus\ntipo\njusticia\nappendage\nides\npicnicking\nspinelli\nroofline\nbullitt\nfaulk\nbarnstormers\nloder\nabwehr\npattinson\nshinawatra\nverband\nskylar\nreforma\npompeo\nannesley\nemg\nmontreuil\ngallaudet\nloka\nspinosa\nredistribute\nmaturin\nketones\nhairstyles\ncharleville\npresser\nkissimmee\nakio\ngz\nbartel\nbulkheads\ngrandparent\ngiotto\ndesertification\nguested\nfouad\ncampanella\nrupa\ncementing\nsmoothbore\nquestia\nromy\nnirmala\nexcerpted\npresbyterianism\ngigolo\nhutson\nratnagiri\njip\nmoderato\ninane\npatronizing\nbuie\nwetherby\nsaxo\naccentuated\njur\nsonne\nyoder\nxperia\nexpend\npenticton\ndisembodied\nastroturf\ncolonised\nghaziabad\ndeactivation\ntáchira\ntalmadge\npterosaurs\ntokai\nmcluhan\ntroilus\nchilde\nbaghdatis\nraucous\nminardi\nweathers\nloaders\nhellcat\nluxuries\npontchartrain\nlandskrona\nfea\ndunaway\nlamarr\nleaching\nhimmel\ncribb\ndribbling\netruscans\ncabriolet\ncanna\npotemkin\nkranj\nkilowatts\ncivilisations\ndrina\nfitter\nphobias\npoussin\nwetzel\nfoundling\nscull\nresisters\nossie\nshrublands\nimparts\nsharpshooter\nsumitomo\nbarbe\nnightcrawler\nxxix\ngriffey\nbroglie\nvalenti\nbrigantine\ninjects\nypg\nmec\nthaler\nvernet\nunlisted\narcos\nfollicle\ndonaghy\nmannerheim\nosmotic\ntenney\nfragrances\njayanti\nsankey\nmdr\npeppered\nostrovsky\nnowitzki\ntlk\nsandor\nanhydrous\nmansa\nkoma\nconsigned\nspecialisation\npyre\nsynapsids\ncolumnar\ntrigg\nlefevre\nexploitative\nsmokeless\nreprocessing\nplantarum\nmire\ncomercial\nforestall\nautographed\nsisi\nellipses\ndomineering\ntoots\nwinemakers\njalalabad\nspeechless\nalmere\ngass\nfoie\nusta\ngeotechnical\nprob\nerg\nsaltillo\nalister\nzaza\nhorváth\nnakata\naberrant\nkuch\nyasuhiro\ndeceptively\npvp\nshowgrounds\nlacoste\nnape\ncormorants\ngodman\nqassam\nsunita\noden\ncastlemaine\ndhamma\nardea\ndeport\nselfishness\npoliticized\nsivas\njacobo\npigmented\nmaplewood\ntalleres\naskari\nendoplasmic\nconstriction\nhelmed\nsez\neroding\ncarty\nlumumba\nshahar\nbookmakers\nnitride\nmuon\ngerwen\nproximate\nlakefront\nzofia\nmatsu\nkilt\nsubstantively\nuy\nvaluing\nzaid\nmaddison\ngabs\nhankyu\nmaharani\nmakkah\nfanshawe\nrocketry\ntrumped\nuppland\njassim\nsasquatch\nvolos\nwooten\nkoirala\nhearse\nvasiliev\nsmp\nstannard\nporoshenko\nbpa\nschuler\nnobilis\ncoatbridge\nrovere\nquidditch\neducación\nterrors\naintree\ntutankhamun\nseligman\nmagyars\ncipriano\ndiscrediting\nstich\nduterte\nleapt\npaycheck\nconjure\niversen\ncanter\norla\namenity\ncrosbie\nhumpty\nsomalis\nworshiping\nnatur\npsoriasis\nrangpur\nchunky\ndysart\ninigo\nheliopolis\nsunder\ncastlereagh\nmustaine\nkahan\nramla\nolivine\nvirtua\nshinya\nasta\nfalaise\nreformatory\nbozo\nvachon\ndlp\ngibraltarian\nvrs\nmorpheus\ntirade\nbonaventura\nsativa\ndione\njoël\nlegco\nsatyagraha\nhakodate\nguppy\ndocile\nantwerpen\nchurn\npapier\nbaugh\nuomo\noverwrite\nassistive\netf\nbungee\nphosphates\namplitudes\noptimist\nlanghorne\nlippi\nsolapur\noracles\nbij\ncudi\nkopp\ngehry\njanner\ngrandmothers\ncataracts\nsergeyevich\nmaslin\npesky\npopham\nambroise\nnotches\nsoman\nphilpott\nconfiscate\npalliser\nthereon\nbeautification\nderg\nantler\ntagger\nlally\nmunition\nkenrick\nohno\nautomakers\ncerevisiae\namami\nverbiage\nschechter\nemancipated\nclanton\ntownsquare\nvoles\nwct\nnicodemus\nfarrer\nunconsciousness\ncashman\nsalonika\nshambhala\nmaruti\ngrandfathers\ntarnish\nlactate\nohms\nharidwar\nkass\ndalits\npoachers\nesher\nlbc\nbentonville\npotions\npanelled\nlunn\nprioritized\nqadri\npolarizing\nfinno\nbonito\ndigi\nemirs\nlluís\nmicrogravity\nrecuperating\nqh\ndorking\nkudryavtseva\nalekhine\nloca\npatric\nfpo\nboxset\nnothingness\nwyler\nthabo\nwholesaler\ncoerce\nmanchus\ntoot\nmatsuura\ndragnet\ncapaldi\nsingin\ncoghlan\nsinkhole\nboban\nhoang\nmargret\nbohlen\ntampered\nzita\nzhongshu\nshubert\nlamm\ntelevoting\nfibula\nchills\nutilitarianism\nholderness\nsheringham\nethnological\nleominster\ncontagion\nmastiff\nsavannas\nlauer\nlinke\nstocky\ndesirability\nfurrow\nluminance\nwestin\nfaints\nfabergé\ntrulli\ngrazer\nprotectorates\nkur\nmitral\nhydrated\npudong\nsheri\nupendra\nhallucinogenic\nrosetti\nloosen\nstoppard\nnaito\ncommandeered\ncablevision\nculloden\nunchanging\nmetis\nsquirt\nabney\ncountenance\nportobello\nstebbins\ndamsel\nmammary\nkerremans\nriveted\nisao\nmarit\nrenan\ngere\nbarrientos\nplath\nhollowed\nomniscient\nmaltby\npatronized\nxinglong\nruthenia\nmilitaristic\nrachid\naizu\nbeckenham\nuncompleted\ntaint\npentathlete\nleakey\ntakahiro\nfindley\npretensions\noki\nmcginn\ntancred\nherts\nsharad\nigloo\nspits\ngtg\nmadama\ncampanian\nmiquel\npalatial\nmotorcyclist\ncrumbled\ntethys\ndragomir\nmantel\nindias\nmalling\nmidtjylland\nfdic\nlimehouse\nshilpa\nmacinnis\npastels\nalen\nmarooned\nelysium\nsakharov\ntraian\nricordi\nreformat\nwsa\nwhirl\nmalaga\nnicolò\njudean\napproachable\ndaoud\ncadaver\ncatapulted\nqumran\nsoothing\ngahan\ngermano\npapandreou\nwhitford\nthrall\nlanchester\npreservative\ntryouts\npecan\nsimian\npantheism\nmolise\navispa\nimbalances\nredline\noverlordship\nblackett\ncpe\nllandudno\nojos\nkebab\nhadhramaut\nfoz\noverhang\noglala\nwhoopi\nfreising\npainless\nphotonics\ntrinamool\nincapacity\nelwes\nfinalize\nloudest\ncompressive\npani\nbocelli\nbcci\nfabricate\nputty\nconservatively\nsheaves\nnumerator\numts\nzambrano\nsecunda\npoggio\ngeochemistry\nlogotype\nearthy\nframeless\nezequiel\npinal\nsolheim\naniston\nconversing\nbartsch\nhillbillies\nmatchups\nbellary\ncarmody\nfazal\nrudman\nanthers\nscotrail\npascagoula\nnep\nhakoah\nfraunhofer\nwert\npandemonium\naikman\nsachsenhausen\napeldoorn\nschwa\nnordstrom\nhydrodynamic\nclutching\ngunshots\nkock\nafire\nkosta\navanti\ndrenched\nodile\nschreiner\nquadrature\nnyssa\nanabel\nmoultrie\nenvious\nfaversham\ngirish\nrohr\npublique\nriaz\nllamas\nsawdust\naldeburgh\nkandinsky\nwillesden\nlemmy\nalbanese\nberkowitz\nstairwell\nlsi\narlo\namaury\nundaunted\nembers\nrededicated\nnitpicking\nreeks\nbintang\nsubhas\ncheyne\ncomyn\nxenophobic\norangeburg\noverblown\nauc\nalcazar\nirradiated\nmhp\ngeir\nbegg\nnagi\nzechariah\nadem\nhadassah\nskyler\ncushions\nschüttler\ncapitaine\ngrama\nextrusion\nconditionally\ntadashi\njokers\nzooplankton\npsilocybin\nmola\nmcd\naltamira\nsyme\noat\npfaff\nallegiances\ncrossroad\ncuauhtémoc\nfads\ntourer\nfoils\npinar\nseaver\nmasha\nprishtina\nboars\napm\npleiades\ntrawl\ndetonator\nindoctrination\nbroch\nexertion\nunderstatement\nanticipates\nfdi\nlindell\njudaica\nduniya\nmanifestly\npublicizing\nimpresses\nbrenneman\ncrouching\nbuttes\ncabling\nspurrier\naccrue\ntevfik\nsalami\nkarr\nmenken\nempowers\nchard\ngrete\nhauge\nbelted\nabenaki\nfareham\nsámi\nsapp\nisoform\nsedaka\nmmo\nawad\nldl\nsargsyan\nmosman\nrabinowitz\nwaregem\nschelling\nnewsnight\ncombustible\ndownsized\ncarles\npedophiles\ngujranwala\nplympton\nbandini\nturandot\ndougal\ndarnley\npetrescu\nvoy\ncurative\nsteinmetz\nsennett\npendle\ncadmus\nvestigial\nmintz\nblacked\nbrive\nchipsets\nlimestones\nriveting\npasso\nglyndebourne\nstallings\nanimax\nneuville\nhennessey\namb\nbulgars\nrmc\nlacma\nramey\nkling\nleatherhead\nrebadged\ngillen\nastrodome\nafridi\nécoles\nmalatesta\nfennell\nvlado\nmaguindanao\nilluminations\nthwaites\namateurish\npoder\nubon\naerobatics\npostures\nhanno\nbeeblebrox\nwarms\nhigginbotham\ndvina\ncapers\nflattered\ntypewriters\nrhesus\necg\nachim\ntaguig\nnorcross\nalleviated\nyah\ndoh\nblossoming\nkhatami\nneiman\nlatifah\nblakeney\noverfishing\nlogarithms\nglo\ndocent\ncrustal\npik\ngynaecology\nlegazpi\nwentz\ntourney\nduller\nicebreakers\nreactivation\nhomogenous\nschönberg\nspacewalk\nepiscopacy\nnorthstar\npetaluma\ntenures\nlitters\nthymus\nstraining\nsussman\nsobriquet\nautos\nsande\naltarpieces\nbluestone\nchâtelet\neffigies\nscotiabank\nswoop\nhyperinflation\nrabindra\nbloodhound\nreligiosity\nhyperspace\nnecessitates\nkeil\ncls\ninclusionists\nswipe\nkhimik\ndroplet\nkristoffer\nshakin\nnev\nbuuren\ngygax\nsensibly\nproteases\npoltergeist\nzambales\nmcginley\nbruner\nfireproof\nhemming\nmearns\ncommunis\nruinous\nminuet\ngandhara\niosif\nlewinsky\nraked\nshouldnt\nhoke\ncervix\norientalists\nunderpinnings\ncounterinsurgency\npietermaritzburg\nviscountess\nkaradžić\npetén\njigme\naqaba\npennants\njugend\nbenham\nmatthieu\nsuki\ngyőri\nlances\nsqft\nbattlements\nrq\nbayne\nizhevsk\nsargon\ncuttlefish\nfreese\npika\nragan\nbeamish\nemmen\nblackmailing\nanatomically\nconsenting\nwomens\ninsubordination\nmildew\ntoga\n¬\nwirt\npalladino\nsubtext\ninboard\nsubdivide\npato\nbassline\nchestnuts\nmolinari\nsuppresses\ncassa\nregattas\nculpable\nprick\nslavko\nfarcical\nvarghese\nlodger\npallava\nhemorrhagic\ndaoist\nnanoscale\nismet\nscotties\nchuen\ncbt\nrolla\nlibor\nzeb\nsuvs\nnecromancer\nnovae\npataki\nnacl\nriesling\nmerman\ninlay\ntuber\nballo\nkhazars\nkiha\nmurmur\nfestus\nknighton\nquatro\neee\nouttake\nsvay\ncynic\noswaldo\nfidelis\ntrappings\nnando\npgc\nkop\nnullify\npölten\nthanx\nfoia\nunexploded\nbevin\nfraudulently\nvllaznia\nasio\nludwik\nveils\nhaystack\nsonu\nprecipitating\npomo\nscarp\nupfa\nglaad\nskonto\nlimpet\nlomé\ngti\nnizami\ndangerfield\nvivre\nurb\nmailman\ngaudens\nquintin\npurport\nchangeover\nplunger\nmillenium\nitalicised\nhopwood\namulets\ncaballeros\nhåkan\ncocky\nnielson\ndimorphic\nproteasome\nzb\nwindings\nrebut\ntorr\nendures\nantonino\narakawa\nspad\ngrampian\nnachrichten\npby\nmarchese\ndisplacements\nvirtuti\npiledriver\ntams\ninsinuations\nburleson\nroode\ndedications\nbrca\nwegner\nxfm\ngarros\nmasi\nmiletus\nseul\nvisionaries\nhooters\nturpentine\npolitik\nere\nunreported\nllobregat\nawm\nserling\npartie\nidolator\ndonati\nstoryboards\ninoperable\nsphagnum\namata\nalbertville\nnono\nceiba\ntsutomu\nsilken\nguccione\nsemite\nsumer\ntransponders\nhowlin\npuppeteers\nlenticular\npickets\nkaos\nramblin\nheep\npopup\npoa\ntaube\ngrate\ncapps\nscolaire\nsimulcasts\naficionados\ncoryell\nsfd\nsiddique\nquod\nshawinigan\nindulgent\nwoodhead\nffs\ngolub\nfangio\nrosettes\nmargit\nkerri\nharmonized\nmired\nhashmi\nresents\ngenealogists\ndecently\ndiabolical\nlca\nwuxia\nrawal\ntft\nlangan\nquadra\nbeerschot\nzito\nbermúdez\nulrika\ninterpolated\nartforum\nsausalito\nchubb\nsupp\ntoil\ndeploys\nbuk\ndoro\nkawaguchi\nfissile\nfaithfulness\ntransference\npotteries\nthorndike\nfolios\nrepellent\ndisengaged\nlemberg\nstb\nwardle\nyayoi\nblaenau\nconsortia\ndeductible\nclontarf\nthar\nbikram\npoblación\nschnabel\ncicely\npers\nmicrochip\nlakshman\nobsolescence\nperishable\nponty\nnestle\ncbl\nwalken\ntribunes\nbagot\ndesiree\nmowgli\nbeltran\ncomilla\nyeomen\nparachuted\nprahran\nschola\nagena\nthump\norchestrator\nanimatronic\nrodham\nborthwick\nmne\nnesbit\nwyclef\nmicrowaves\nthurgau\nhuntress\nwomanhood\nbascom\nkrka\nanatoliy\nsaintly\napproximates\nsamman\nmanse\ninfiniti\nbeng\nbergson\ncobden\namphetamines\ncoliseo\nbsn\nmacartney\ndeprecating\netruria\nsanda\nmanju\nefe\nashgabat\nfingertips\nfurukawa\nagus\npungent\narcot\ncrédit\npasse\nfirmin\nvocalizations\npaar\nsiting\nvictorians\nlinh\nbesant\ndslr\nkehoe\naguas\naef\nble\ncandi\nalmelo\nhamamatsu\nopc\ncandler\nthameslink\ngeodesy\nprofesses\nminnow\nbandura\ninositol\nmansi\nhotelier\nschiphol\nelway\nloh\nestella\nsynthetase\nmechanistic\nsupérieur\nrecreates\nwako\nbookkeeping\nhosea\ntensei\nthurles\ntakara\nludwigshafen\ninconsistently\nrosser\nsetups\nherders\nsetae\nrammstein\nmdma\ntriage\nsarab\ndecryption\nkratos\ncdm\nrecto\nmazarin\nsensuality\ndolor\nbrut\nnobuo\nchakravarthy\nsaugus\nlomonosov\nsimonsen\nnabc\nchechens\nhaupt\ngilly\narsenio\ndocid\nwarrenton\ngrantee\narmagnac\nflexor\nbrockville\nbehrend\nmanda\njona\nbengtsson\nanant\nvms\nshar\nsammlung\nglorification\nbonnier\nlilli\ndonskoy\nnahum\nobservatoire\nmalheur\nfdl\nrisker\ntelegraphs\nfue\naage\nnavas\nbatticaloa\ntractatus\nscientifique\nfermions\nlockett\nstrasberg\ndunphy\nbandleaders\nsukhothai\nfavoritism\neic\nbellaire\nvenona\nmoluccas\nett\naykroyd\nstiffer\nsaadi\nnitrite\nonn\nparalympian\nzephyrs\n¶\ntsugaru\nbastrop\ngiffen\nboylan\nplayford\ngodly\nnilgiri\nbridport\nrnc\ncherokees\nhaa\ninconceivable\ncoley\nbrdo\ngigante\nsarabande\ntarr\npriors\nmaharajah\nfetuses\nhammam\nlagged\nboundless\nlagoa\nghastly\nnoelle\nestradiol\nbrainstem\nacte\nfarrah\nimperials\nratliff\ndefensemen\ncondors\nolympiastadion\nsplashed\nhaitians\nkalb\nalanine\nsaransk\nfingernails\nluthier\ntricolour\nbirnbaum\nfanzines\ntidied\ncacique\nchairlift\nutilisation\nmaz\nexerting\nctu\npti\nregine\ngediminas\nhosokawa\nmountaintop\nhinsdale\ndaan\nsnappy\ndingwall\nhoyer\nesi\nchaudhuri\nubu\nnaranjo\nblacksburg\ninaba\nboateng\nlemos\nanambra\ncanmore\nmanish\ndaredevils\ngerbil\npljevlja\nbogey\nmindedness\ndeka\njaworski\ncuration\naméricas\ngumi\nkolmogorov\ndelong\nnore\nsoftbank\nkann\nsania\nsesto\noptimally\nprintemps\ncorelli\nclerc\nsupercomputers\ncolons\netcetera\nunsympathetic\nnicolau\ntarun\nschumer\ncomplicit\ndima\nliste\nandria\nzant\nfollette\nbirdwatching\nprovincially\nexcitatory\nmauthausen\ndorf\nslashes\ntelecoms\nnumismatics\ncelibate\npolítica\noconee\nkincardine\nmags\nbuen\nstaccato\ndonegan\nfretless\nfrustrate\nmantilla\ndaventry\ncus\nspool\ntranscribing\nnuff\nfiorello\nleszno\nsystema\nmalan\nnoronha\nnaves\nkyun\neuphemia\ncircumventing\nlilia\nkha\ntalker\nboomed\nblackest\nkuantan\nfivefold\njacobin\nsimca\nthomason\nperetz\nharboring\nlaz\nkash\ntelefilm\nalpini\nglosses\nutm\nhecker\ngaur\ncommutation\nbraque\npenner\nindividualistic\nmusso\ntelephoned\nmaus\nlansky\nmeniscus\nmatthäus\nxk\nledoux\ninari\nburnell\nlinearity\ninapplicable\nguarda\nlevelling\nbamba\nhaploid\ntass\nindelible\nsilverberg\nhayate\nnightwish\nferment\nconcolor\nngs\nagia\nmaeve\ndiapers\nbluefield\nsmelt\ncuzco\ncouriers\nbecher\nabkhazian\nrenouncing\nsinusoidal\nenduro\nsba\nmariani\ncondiments\naugustów\nwiesenthal\nordinating\nkes\nrosenblatt\nscud\nfeigned\nhermosillo\noberoi\npostmasters\nshapeshifting\ngiannina\nmcmichael\nspinola\nremixer\nstipulate\nbullhead\noui\nreassuring\nrazi\napd\nbron\ngabler\nangiogenesis\nchambersburg\nvandross\nfognini\nclavier\ncolter\nclogging\ncamperdown\nanstey\namusements\nmasterplan\nbizkit\nultramarine\nhemings\ngeomagnetic\nruda\nhipper\nbloomsburg\nknin\nincrementally\nmontour\nseiner\nrudely\ngolubev\nmodulators\nharnesses\ndmt\npinged\nphosphorylated\npiaf\nschweinfurt\ncobol\nocarina\nmaggot\ntatsumi\naffixes\naspinall\nbernburg\nneuve\nmerlo\nsparrowhawk\nhetty\nrepublik\nclapper\npalmdale\npuckett\nceding\ndav\nalsop\nangering\nblackbirds\nmeld\njahr\nsombre\nboggy\nshirakawa\numlaut\nhypo\nobjectivism\nleftovers\noke\nsurakarta\nchal\ngnr\ndijkstra\nsalahuddin\nkamaz\nmillville\ndimes\nconcierge\nblowers\nfibrillation\nhairspray\nkeiichi\nstengel\nyug\nenlai\nwetmore\nwtcc\nlunacy\nflutist\nleite\nutilises\njalpaiguri\nmicrocomputer\nclotting\nullah\nkoei\nbiggie\noutlier\nperro\nrhubarb\npocatello\nrata\nrydberg\nelster\nkripke\nmalus\nspecie\nbonar\nlampard\ndubbo\nmankiewicz\nlebrun\ntoynbee\npainstaking\nblyton\nlaycock\nvéronique\nguerillas\nfederals\nhydrate\nszolnok\nleucine\nvellum\ninflame\nnorberto\ntelekinesis\ngerontology\numatilla\ncarabinieri\nhoria\nhalland\nprevost\nmeneses\nnaver\nswimwear\nguanine\nhoppers\norpheum\nfingerprinting\nwry\nvac\nmette\nbosom\nunsw\nlambast\nbabi\nkym\nbookshelf\nklitschko\nwinstanley\nthetis\nsperling\nmyeloid\ndonelson\nvitriol\nchewed\nmuscovite\nfogel\neklund\nsuperbly\nhaugen\ndini\ncepeda\nresubmitted\ndniester\ndawa\nfrancais\nvisualizing\nhrvatska\nmandelbrot\nstyne\ntaxiway\nsailboats\npvda\ntransliterations\nunirea\ndenizens\nregt\nprocessional\nindenting\nabidin\nriedel\nmcilroy\njoséphine\nschultze\npeeled\nforeland\nryoko\nnoize\ncassation\nsawan\npreble\nalcatel\nsfl\nnavin\nzaporizhya\nnexstar\nartemio\ntursunov\nmasuda\normsby\nsevenfold\nsabotaging\nspoiling\nalbertina\nuniontown\neffector\nballinger\ncoconino\nwielkopolska\nbeltrami\njps\nnikkatsu\nmorioka\nmcu\nsouthbank\ndoucet\nintermarried\nedwina\nbrowder\nimpersonated\nloran\netty\ncentreville\nrapunzel\npalpable\nfictionalised\nseaports\nbrainstorm\nfortin\nparaíso\nbahu\nsoraya\npressman\nseafront\nsynthesiser\nhandsworth\nbrazier\nkiley\nfilamentous\nbadalona\nqigong\ntabby\nseducing\nterrorized\nblagojevich\nayman\ngastroenterology\nkrishan\ndewhurst\nschramm\nreger\nendemol\ndogfight\nrathore\ncopula\nbellona\nzayd\ncso\nchelan\nkimiko\nrotem\ncornices\ntruncation\nnamor\nhuddleston\nlintels\nmcsweeney\nmauri\ngrouper\nmatabeleland\nsolange\ndakotas\ncarelessly\nbeleaguered\nimmigrating\nmadina\nbeekeeping\nwisbech\nschorr\nkyra\nchama\nusan\ncelina\narmavir\nsawn\nofficiate\nexpedited\nbumble\nundesired\nsatomi\nincidences\nfaustina\nvaljean\nnimh\nehlers\nsharpen\nlongoria\ntovey\nkarnak\neyeballs\nrutan\nimmovable\nharmer\naccusers\ndorr\nsurety\npariah\nevergrande\nchilena\ndespondent\nklaas\nboothby\nfeasting\nmaelstrom\nnaumann\nblomberg\ntirumala\noverdubbed\neastland\nlittérature\nselman\nalcantara\nengelmann\nbombus\nbaskin\ncosmas\nyukari\nexpunged\npompeius\nmagno\ntheism\noriginators\nfadl\nwachovia\nprotege\nboyars\nmcnaught\nfanboy\nhalmstads\nstraneo\nberle\ndivinely\nchavo\nopacity\nbod\nirreparable\nsanyo\nscape\nsabc\nfloodplains\ncanale\npera\nmow\nexcepted\nbic\nmladenovic\nhasidim\npanova\nhebdo\nlando\nwiese\nburghley\nwtt\nenz\nhydrogenation\naphelion\ntoul\nulrik\nbronisław\nkrishnamurthy\ncremonese\nneoliberal\nsolana\npanties\nheyer\nproscenium\ns,\nshuttered\ncamarillo\ndownplay\nswizz\nkap\nbloodiest\nbeatport\nmakai\ndebutante\ncrofts\nadmonition\ndiagonals\ndwyane\nsaverio\nsridhar\nilhan\nprévost\nupdike\njyp\nschön\ngcmg\nrudders\nknorr\nrésistance\npredestination\nsamsara\nmanfredi\nvlc\nhadid\nabitur\nnishikori\nglossaries\ncounsellors\nsantini\nresurrecting\nrapprochement\nludington\nparsed\nsordid\nhater\nchrissy\ncressida\ncmdr\ncaptcha\nsoka\nbranagh\npaa\nmonaro\nsteinitz\nwhidbey\nnuthatch\njans\ndalarna\npraveen\nchukchi\nespen\nshrug\nchamplin\nrundle\nfatimah\naigle\nbenedikt\nfuqua\ntranspose\nduckling\nstabilise\ndrunkard\nsdi\nlurgan\ncataloguing\ndarke\nharkin\noutkast\nfanaticism\nhalakha\nscarlatti\ncranium\nvarvara\nmathur\nharnessing\npolonium\nnewhall\nremedios\nlikeable\nindisputably\nshankill\ngrandi\nfoursquare\nfieldstone\nthacker\nelectrics\nwhatcom\nelim\ngrandfathered\nreiterating\neisen\ndownwind\nespada\npopolare\nirritable\nbulloch\nkroner\nnaf\nsequoyah\nboop\nbalakrishnan\nreshaped\ntaw\ndeland\noù\nagostini\nreshaping\ntannins\nshadowing\ndefacto\nsyriza\nreinsurance\nmuralist\nkabbalistic\npatagonian\nannealing\npetipa\ncarlito\nrothko\nthana\nducky\njaish\nsymbolise\ncvt\ncombats\nbourque\nburrito\nzweig\nmichie\nsecondarily\nsáenz\nbergstrom\nretour\nhijacker\nzenon\ngullible\nkravchuk\nsprained\natherosclerosis\ndreyfuss\nyitzchak\nwisniewski\ngrenadian\nanopheles\npeerless\nplotline\njinping\nclasse\ndoth\nlawley\ngct\nreconciles\nlogue\nhungama\nbanga\nnatick\nclimatology\nrockdale\nkuna\ncogs\ncisterns\nramin\nflier\nverónica\nsellout\nimpeding\nluria\nwellingborough\nannemarie\nhittites\ndunblane\nastride\nmechanization\nwrangel\nskateboards\nblanchett\nmyosin\nravages\nkhatib\nhospitable\nmuskoka\nhounded\nmattingly\nsouk\npeya\nmwc\nterrebonne\nwheatland\npedra\nhth\noye\ncve\nprefaced\nrspb\ndjurgården\ncolumbiana\nbalch\nseltzer\nstrayhorn\nembodying\nwnbc\nbickford\nbough\nryuji\nvad\nshorebirds\nbosse\nweddell\ncartilaginous\ncollard\nmodell\nvsc\nuntested\npillaging\nwilbert\nmoana\nthurber\nncp\nhelsingborgs\ngermplasm\nbuckshot\ntoney\nadenine\nchattopadhyay\npermanente\nwallsend\nmineralogical\nsoutheasterly\nmclane\ngiannini\njvc\nlondoners\nnoailles\nwahoo\nbergisch\nmelodi\nhirth\nbetrothal\nmra\nsmeared\nutara\nprokaryotic\nscg\ntring\nossian\nchiayi\nearthworms\nvacek\nagronomy\nbactrian\ncastiel\nkilobytes\ncecile\ndiscloses\ncomiskey\nverlaine\nmumps\npopolo\ndarkening\npascale\ntami\ngroundless\ncalmer\narenabowl\nsewall\nlibera\nmarkey\northo\nbadi\ntarga\nkaitlyn\npluck\ncamillus\nkurzweil\nsteepest\nsouthworth\nventilator\nebbsfleet\nkosovar\nduped\nanniston\nparkdale\nretiro\navoca\nsabino\nstraddle\nloretto\nfissures\nchevrons\nadt\npsychopathology\nangiosperms\naftershocks\nraskin\nblane\nsniffing\nchatterton\nrosenwald\nlaroche\ncentaurus\nevgeni\nembry\nhasselbeck\naai\ncosì\nfcm\nclove\nshimazu\nshai\ninvoice\ncostanzo\ntrelawny\ncarelessness\nobeys\nogasawara\njeunes\nzil\nrathaus\nconnally\nmohawks\nabsolutism\nstairways\nwickes\nhtv\nmiku\nsvendsen\nangulo\nbuzzcocks\nhanrahan\nmalachy\nsverdrup\nhelmer\nstarstruck\ndrosera\nshabby\nhass\npsychics\nheures\nkerch\ninadmissible\nweingarten\nkingfishers\ndurkheim\ntaqi\ncomunidad\ntortilla\nginzburg\nlurid\ntumbler\nbangui\nplaine\nmargaretha\navp\ncycled\n÷\naffording\ntrustworthiness\npredisposition\nborda\nnigerians\nbufo\nmillonarios\nfrançaises\ntpa\nslats\nmacular\nrockwood\nstarace\ncaer\nbroadmoor\nholywell\nsubscript\nresurgent\nganda\njintao\nmitzi\nduster\nkah\nrestyled\nkuopio\nlennard\nscrewball\nchoirmaster\ncrowdsourcing\namadeo\nmotherboards\nmateus\nflashed\nzanetti\nmenendez\nglobemaster\nracewalkers\ngladbach\neuboea\nhankey\nolivo\nesmond\nengler\nerroll\nillusionist\nportnoy\nbrubaker\nsaori\nugric\nseely\nmitotic\nmiyan\nchaux\ntorrents\nrepin\nerupting\nbelsen\nrobustus\nwoof\neero\ncrotch\nsocon\nspotter\npran\nberwickshire\nvanni\ntexting\ninstigating\nparthians\nmuar\ndecibel\nindomitable\nsharpshooters\ncohabitation\nmontano\nbalkh\nperron\nwrinkled\ngoad\nmaclaine\ndott\njel\niai\nunleashes\ngero\nadorable\nironwood\nrainiers\nsfx\nsolzhenitsyn\nnightline\nimmolation\nmadura\nflexion\ncostner\nbarwick\nwhitesnake\neinen\nwynyard\nmetrobus\nunderpinning\nencrypt\nsuckling\nbille\nurethra\naberrations\nproboscis\nmanrique\ntahu\nmalleable\nmelodious\nzainab\ndariusz\nalachua\nimereti\nhizb\nspirituals\nsuperspeedway\nsealift\nfuturo\nestaing\nremittances\nmarcelle\nbowker\nexmoor\nanaesthetic\njohar\nlumbering\nleaded\ncandido\ngeno\nportishead\ngoren\npips\nswank\npetrovsky\nrakhine\ngirondins\nderbies\ndecked\nbadawi\ndhcp\nvenkat\nshuja\ndravid\nnobile\ntrucker\niconoclasm\nforger\nmagne\nkhaimah\nlactation\ncli\nspreadsheets\nseimas\nhippy\ntnf\ncascadia\nloin\nbonkers\nwolverton\ncruickshank\nmpla\njorma\nwizardry\nadieu\ngrander\ngeum\narticulating\nduplessis\nbhardwaj\nsteakhouse\nrodolphe\nbiotic\nmccrory\nrps\nhori\ncga\nlifter\nchakri\nthorburn\nburra\nironstone\nsealy\nmcca\nritually\nhelier\nmagick\narba\nhamdi\nmariel\ncomforted\nsubtopic\nconvener\njacksons\ncollis\nsleeved\nnebo\nmagnetosphere\nwalnuts\nletterkenny\nprosody\npeacocks\ntaleb\nrefered\nswc\nweekes\nwaldegrave\ntweedy\nbanish\nrésumé\nkayaks\nmaneuvered\nibs\nbosna\naed\nsak\nsurendra\nstrenuously\nviol\ncarrickfergus\nliffey\nnouvel\ndemiurge\nscarsdale\nmcdowall\nmohammedan\nmacgyver\nrightist\nleckie\ndeadlocked\nvil\nfireside\nnortel\nconvenor\nnovembre\ntanka\nmercurio\nvortices\nseismology\npausini\njuried\nhydroxylase\namazement\nmaterialist\norbs\ndecolonization\nriddler\nclarksburg\ngendarmes\ngaël\ndimas\nsmoothness\nwayanad\nairings\nnisbet\nanimus\nhecate\nhoge\nsympathizer\npartisanship\nzachariah\neaux\nrumford\nvyas\nnii\nbsg\naarti\ntcl\nfunkadelic\ndink\nrelevancy\ncaudron\nparsifal\ngracias\nbalding\nasylums\nobsessions\nneutralizing\nkapustin\nmews\nepigrams\nrastafari\nkaizer\nvallarta\ncorel\nghoshal\nentrapment\ndalglish\ncreswell\nmaglev\nquince\nvasant\nfirewalls\noxidizer\nsonics\nhelin\nntr\nudc\nsaaf\ndworkin\nmascara\ncarabobo\nplunges\npolymorphisms\ncamas\nmaceo\nbrussel\nafricaine\nboyar\nspeculator\njuju\ngorillaz\nashburn\nkidz\nairasia\nmccandless\ntite\ninfrastructural\nkangra\nsues\nakuma\ngrout\ndeciphered\ndesdemona\ndeceitful\nbpl\namat\nincisive\nteddington\nikki\nhartmut\nsymbolises\nsudoku\nlikable\njtc\ndocherty\ncarpark\nintroverted\ncsaba\nsansom\nadlai\nhouma\ncustomised\nrivet\nstreamers\nbks\ncorrales\nyasushi\ncomeau\nwitney\nshabana\nsilber\npeet\nconfidante\nchastised\nsalis\nlucile\nnagisa\nfurtherance\njoana\ngodin\nnarrate\nunrwa\nschlager\npreventable\ncallous\nnovas\nregistrars\ntempers\nbaranja\nmakino\ninsatiable\nblériot\ndemolitions\nbagdad\njoao\nthemis\nnaturalis\nsrivastava\nlazaro\nmaul\ngötz\nmaffei\ncip\nkare\nredaction\nsaalfeld\ncandidacies\nrailhawks\nkatayama\ndisables\nabsalom\nmantegna\npurser\nbarris\ndongguan\nclift\ndunkin\nconundrum\nupham\ndomicile\nlovingly\nnigga\nginza\npotted\nbes\nharnessed\nnakai\ntalat\nheresies\nadoptions\ninoculation\nkats\ncartan\nrpc\ndhe\naventura\nriana\nsgs\nparamore\nbuffering\nmetropole\nwn\ngurung\nubiquity\ngels\nwrinkles\nhucknall\ngla\nsobota\ngoodhue\nbulla\nfakir\nxlr\nsnead\nlanning\nnicolay\neschewed\njellicoe\nbutyl\ncuza\nmorell\nstrangler\ngheorghiu\notu\nmantova\nrecumbent\npommel\nbodil\nspivey\nmónaco\nväxjö\ninterviewees\ncarril\ninversions\nberrien\njoann\npavements\nstorefronts\nlanius\ncompañía\ncsis\nraeburn\npoindexter\nshani\ncaton\ntumulus\ncreditable\npimpernel\noram\ndeified\nannibale\nburies\nrosemarie\nnahe\nrecoup\nelisabetta\nqamar\nbriain\nhollandia\nmobiles\nteflon\nencapsulation\nkirklees\nalwyn\nmyo\nexistant\nfilles\nvajpayee\nsinuous\njanakpur\ntriceratops\nbellerive\nchurning\njuniperus\nshaukat\nrosewall\nairtight\nabydos\nsensitivities\naks\nmaterialised\norazio\ngresley\ngreiner\nkingsland\nbastien\nnassar\nsocialized\nlampeter\neggers\nmedicina\ncoquimbo\ndefenseless\nspacer\nvy\nkaushik\nshinn\ncarbines\nfeedstock\nliberalisation\nshockingly\ndik\nwarplanes\nrepealing\nkrasny\nvenegas\nscylla\naugmenting\nsass\nconcertgebouw\nahmadabad\nrives\nconvalescent\nharbhajan\njarry\nhase\npeinture\nsarma\nraimundo\nwahda\ncanova\nasceticism\nsargodha\npolygraph\nconvulsions\nbarletta\nrajat\nlorem\nmlp\nuzi\nwortley\ndevises\ngascon\ntheropods\nkorakuen\npederson\nviasat\nsurreptitiously\nfipresci\nharel\nilić\njaa\nkerner\nmalisse\nsaakashvili\nprogenitors\nrunciman\nbibby\nflavoring\nexpansionist\npasa\neidos\nvide\nhairston\ningush\nopts\ndepuis\nwageningen\nmullan\notl\ndeputation\namici\nstrathearn\nbanderas\nwiiware\nclaud\nlumpkin\nhelle\naswan\nsitters\nparler\nharmonize\nkeelung\nmoorhouse\ntudors\ndoable\nemre\nlangkawi\nindustrialised\nshoji\nspoofed\ndictating\nleiter\ntadpole\ncórdova\nbackpacking\ncalvi\nkhosrow\neleonore\ncaricom\npetey\nquicksand\ncircassians\nbrega\ntactically\naragorn\nabhay\npremeditated\nrussellville\neldred\nussf\ndemarcated\nvaléry\nsmokin\nstrictest\nkuerten\nkwak\ngrönefeld\nsurnamed\nsauter\ndelanoy\nhayne\nadige\nveolia\nannuities\nsubnational\nhypoglycemia\nrocher\nhydrophilic\njunhui\nsignalman\nsgp\nlinea\nbalto\ndietmar\npottinger\ncissé\nquibbles\nbrattleboro\nppt\nbrotherly\nanxieties\ngalahad\nsalafi\ncastres\nodnb\ncaltrans\ntemecula\nantonina\ndemerara\noccultist\ninhospitable\nclasps\nneuroimaging\nschwartzman\ntauber\nemb\nrez\nmanton\ncompiègne\nsubscribing\nimmutable\nsuperiore\nsanturce\nfz\nfiori\nremarriage\ningesting\ngroff\nglabra\nhutcherson\nloong\nwallin\nlabored\nkarakoram\nreinterpretation\nlessee\nmimicked\ntolbert\ncherub\nmisinformed\nbroadhurst\nplunket\nmewar\ndisarming\nderision\nadriaen\nsittard\nallin\ntrond\nchecksum\nsublimation\ncyrene\ninterdict\nrinks\nkogan\nundifferentiated\nmccoll\nmcclatchy\npunted\nchastain\nhumana\ngrafted\nlissa\nllanos\nmcallen\npaneled\nbyproducts\nbardo\nlollapalooza\neuphonium\ncowdrey\nperversion\napollinaire\nhiker\nramanathan\nmasako\nreplicates\nhisashi\nlitvinenko\nkabila\nciphertext\ndebunk\npayers\nballesteros\nsilks\nprimed\nmagnetite\nsavar\njumpin\ncutlass\nstratos\nretirements\nlifelike\npbc\nhalesowen\ntimeform\ntigray\nneuer\nsarandon\nyehudi\nmutinied\nmaac\ngnp\nbahman\nesterházy\ncloutier\nalbumin\nbulwer\nhominid\nabdus\ntcg\nsalutes\npeuple\nnaoto\nunseemly\ndynamism\nrakuten\ncrispy\nalarcón\ntatjana\norem\nhegarty\nstroudsburg\nbreivik\ncaracal\nfoi\nsirte\nalthea\ncamorra\ninextricably\nsnares\npaavo\nansett\nfusco\nrepress\nregionalist\ndirge\nhierro\nmcduffie\ncrutches\nleonie\ncoots\nmwe\nhistamine\necr\nbloodbath\ngdansk\nsasa\nishihara\ndodig\nclostridium\njair\ncarcinogens\nmädchen\nvajrayana\ndispassionate\ntulips\nenders\nihor\nburchard\nneuburg\nlindstedt\ncoriander\nmodigliani\noccuring\nokanogan\nsilversmith\npalaeontology\nphotonic\ngodunov\nhaileybury\nuninjured\ntropez\neka\nemmeline\nreitz\nlewandowski\navilés\ninfidels\nvoicemail\nsassuolo\ngeneralist\ndanubian\nsimo\nwaleed\nguesthouse\nmatkowski\nendymion\ntelltale\ncolourless\ntullio\narendal\ntamed\nnémeth\niras\nkuni\nshand\nhydrochloride\nalleviation\nghazali\nmerz\nyonsei\nogata\nstol\ndetergents\nmav\negregiously\nchacarita\ncamphor\nghul\nsiddha\ngeni\nrawdon\nflattening\nsailplane\nhornchurch\nmadrasah\ndionisio\ncanisters\nvexatious\ndla\nscapula\njono\nsereno\nmisawa\nmicrocosm\ngabashvili\nlully\nfalsifying\nrous\nsupplant\ncarthusian\nplaines\ncannock\nsepultura\nalençon\nhoffa\nratzinger\nsedation\nnemours\nirregularity\nbarbora\norangeville\nbiogas\nwhey\ndreux\noverstreet\nfinders\nnears\nniv\nhelmsman\nraghavendra\nfluidity\nkoon\njez\nclogged\ncoolness\nregnant\nroti\nhighwayman\ngoldfield\nmanifesting\nsocotra\nleeson\ndaum\nfreiheit\ntruthfully\nclosers\nchishti\nkhamis\neraser\nnullification\nstews\nstillwell\nheli\nsaucers\nullrich\ncascada\nmangan\ncuernavaca\nlioness\nstabilizers\nalpi\nparkman\naustell\nlithographer\nséance\nunrelenting\nzanuck\nlevu\ntantrum\nsandhu\nrufino\ndepopulation\nshum\nlaurels\ndisintegrating\nrifling\nromp\ndulko\nmidwinter\nglc\nfordyce\nmarceau\nhelipad\nglassy\nbraunfels\nhaliburton\nbiogeographic\nruislip\nglauca\nmitsuru\ntaliaferro\nril\nangara\nmonts\nnukem\negy\ntornados\nphool\nyehoshua\nconceit\nmatera\npreselection\nnapster\nberezovsky\nbaltika\ndawood\nthyssen\namistad\ntuskers\nlleyton\nimmemorial\nentitle\ngermanicus\nbayezid\nenchanting\ncurren\ntemplating\napennines\nrekindle\ncooder\nkitano\nbabylonians\nindecency\ndiorama\nept\nchloroplasts\ncbgb\nmattresses\nfriedl\nastern\nimmunological\npodcasting\nshamil\ndomine\nklose\nthermo\ntrialled\ncarneiro\nescorial\nwooing\nffestiniog\ngiambattista\nneoclassicism\nfco\npatan\npeltier\nlehtinen\nonegin\nrusticated\nbohun\nqwest\nembellishments\nrendez\nboarder\nbouncy\ncotto\nbhangra\ngallifrey\nradially\ncripps\nfisichella\npaprika\nrossellini\nhawkman\nrutherglen\nrecklessly\nmaisons\nkarlstad\nlelouch\nalcibiades\ncradock\nhelpfully\nsequestered\nmonogamy\nsimla\nrhiannon\nstrays\nsliema\nchaining\nlavin\nmidden\npasok\nspelman\nsado\nrosamund\nlingered\nfama\nmadhava\nwessels\ngusto\nunclaimed\nquelques\nkaka\nhuis\nepcot\nadar\nreproducible\nwestphal\nphallus\norman\ncentaurs\nforthright\ntsi\nkneel\nbarua\ncoves\nidc\ntrailblazer\nfpr\nmckellen\ncolston\npanjang\nworkgroup\nrodez\nhalliburton\nbedridden\nwhiteness\nnirmal\ndraftees\nlibido\npolymeric\ndubliners\nbemis\nomens\nspinks\nencores\nthrees\nimpossibly\ndiablos\ncaux\nmonotype\ndurie\ndarden\nnem\nplacate\nrocko\nflach\nullmann\ndrax\nrationalization\nfett\nhirose\nwallachian\ncfu\nmarginalised\nsachiko\npainstakingly\nrooke\ncongleton\nschon\nbronfman\nilha\nbailly\nparachutist\nrounders\ngpr\nmoca\nhallowed\nlithosphere\ncroker\nabalone\nsauveur\nganassi\nombre\npolyvinyl\nciarán\nbacall\ncaracalla\npritam\nmemoirist\npilatus\narchipelagoes\nlacuna\njawahar\nlpc\nhofstadter\ntoscana\nconjunctions\ntanasugarn\nmeltwater\nwreaths\ngeng\natg\ndmu\ncarolus\nswitchboard\nouthouse\nappenzell\nsubsist\nligature\nappointer\noverloading\nwether\nlogician\nnera\nbourassa\nstatuses\nnhu\nhaemorrhage\nstimulants\nsejong\nbassi\ndostoevsky\nrusses\nturnhout\npromiscuity\nboult\nstrangest\nobstetrician\nzentrum\nhurrah\niberville\negger\nmusee\ndvi\nescapist\npewter\nnissim\ncivet\nfranko\ncrampton\nrais\njammer\nphallic\nbiomarkers\nabuser\nephemeris\nsouths\nunprovoked\ndisproven\nmattia\npavillon\nplausibility\nhonorifics\nkatsura\ndaihatsu\nsubmitter\ndelbert\nhato\nmcinnes\nfuscus\nspadina\nicosahedron\ncadell\nicm\nnanna\npurbeck\nmisnamed\nperiodontal\nghg\nnumancia\ngove\nmarmion\napostrophes\naldehydes\nputrajaya\nretracting\ndockers\nloosened\narango\npurim\nasakura\nsectoral\nchieti\nbefriending\nbago\nimpolite\nminsky\nsigne\nunreviewed\nyaqui\nvocations\nbeanstalk\npyke\ndianetics\nlanzarote\nmolto\nhumilis\naleutians\nnewsome\nnotaries\ncalibers\nsubroutine\nshockley\nhalperin\nhuda\nbung\noleh\nfuca\nalli\nsorrel\nleukaemia\nmonoamine\nunimproved\nrecoilless\nkickapoo\nofficership\ncounseled\nearnshaw\ninflationary\ncarbons\noutrun\npentecostals\nsimonson\nnevil\njepsen\nshopkeepers\nmunshi\ncanales\nsweeting\ntoya\ncapricious\ngarin\ngrg\ntalleyrand\nphotovoltaics\ntpe\nvvd\nweta\nbysshe\nidlewild\naltamont\nclannad\npennines\nbdp\ntaliesin\nkunz\nlodhi\nched\nlra\ndissuaded\nbartolo\nnema\nwipes\nblunders\nstaves\nheavyweights\nsabo\nreiser\nchartreuse\nsynthesised\nborderland\namityville\ncramp\nsores\ncalles\nrebuttals\nictv\netcher\nusnr\nlsa\nappian\nprotozoa\ncocked\nitasca\noffshoots\ndivulge\ncitron\ndoron\nmacmahon\nbato\nmoslem\nlittlejohn\nwildman\nrivoli\nfolklorists\nhippos\nbilge\ningrained\nsketchbook\nburdick\npensioner\nbandon\nescapement\nambala\neroticism\ngopalakrishnan\nberia\nwrinkle\nenticed\npalpatine\nkqed\ntynan\njumble\nnoboru\nnickelback\nsamo\neap\nlavoisier\npetrucci\natn\nvesna\nhiatt\nhoniara\nleeuw\ntejas\naalen\ninfallibility\nbursary\nhgtv\ncmb\npnl\nniamey\nlynott\ncaruana\njap\nmuseveni\nsantino\nperiodicity\nambika\nkushiro\nacetylene\ncasings\nroped\ncapa\njoffrey\nbullwinkle\necumenism\ngymkhana\nliturgies\nsofer\ncarden\npinder\nkudo\nmodulating\nunrepentant\nbanton\njeri\nmetronome\nguia\npalmar\narkin\ngrooms\nmangoes\ngulbis\nnyon\nrefutes\ncccp\nhyperplasia\npeeling\nhombres\nlofton\nbereg\ndetonates\nfirpo\ncheri\natsc\nresection\nbevel\nfelonies\nprion\nadaptability\npasolini\nssw\nsevero\nzines\nseawolves\nvoisin\nasx\nouen\ngeneris\ncoffman\nbcm\nlittlewood\nuncountable\nmarden\ninterventional\nliangshan\npolypropylene\nunderboss\nswarming\nandika\nsöderling\nclaypool\noatmeal\nsurvivability\npatricians\narmistead\nwallop\nbiak\nmoult\nkcvo\nmeditate\ntakamatsu\npinkie\nlawlessness\njawa\ngoogles\ndarkroom\ntestis\nweirs\nskyhawk\nfindlaw\nsprinkle\nsupercentenarian\nnwo\nforrestal\naffluence\nbmj\nsandwell\narnaz\ntvp\nwatermarked\nfealty\ntailgate\nnorval\nsardis\npestilence\nmoncada\nareal\nbirkin\nabdullahi\ndésiré\nimitates\nvelazquez\nnewlyweds\nmov\nwels\nbayswater\nvarèse\nreticulata\nenhancer\noratorios\nkil\nhrm\nheute\nloess\nrectification\norchester\njuste\noverprinted\npel\ncrocus\npaulinus\ncydia\npuisne\npappy\nenosis\ncliffe\nvaccinium\ncala\nsavers\nandroscoggin\nshowgirl\npna\nairbag\npisani\njanina\nlandform\ngoof\nrogier\nstille\nestá\ntheresia\nhew\nbopanna\nakhil\ncaguas\ntestbed\nelectrolytes\nkaga\ncamcorder\nkalev\nruthie\nandere\nhoey\nwaalwijk\ncongreve\nwart\nmsps\ntomic\nvenereal\nchoristers\nmagill\nrafa\nafterthought\nmerkey\nteda\ninterviewee\nveli\nlashley\ndocumenta\ntasteless\nparfitt\ncalifornians\nencyclopedically\nmonarchical\ntwelver\nplainview\nalchemists\nnett\ntruckee\ntinamou\nimmaturity\nnaturel\ninterrogative\ncalvinists\nkahane\niturbide\nmontesquieu\ntroon\nindebtedness\ncnut\nshivers\nparlors\ngleaner\npaulet\ngrowths\ncrave\nkutcher\nmachinegun\npowdery\nrafah\nboatman\ntroika\narmature\ndairies\nsali\nwhispered\nmercure\nsimiliar\nupholds\ncollectivity\ncoalfields\nproofreading\nastrolabe\nshoplifting\nsigel\ngulzar\nwoodburn\nholler\njaramillo\nelectrocuted\nnicoll\nkaneda\nharlingen\nramjet\nfpga\nmanmade\nsaranac\negm\nbib\noculus\nolle\ninvictus\neres\nmcdiarmid\nbackroom\nhippolytus\nfujifilm\nprankster\nyoshiko\nverte\ntenuis\ntfw\npatella\nsecretory\ncranford\nfuerte\ncataloged\nkhai\navangard\nquerying\ndeserter\nvaishali\ntaíno\ninedible\nejecta\nchangeable\nstandup\nyy\nagnus\ncranky\nperforation\nimago\nmeralco\nneurosis\nmudaliar\nzod\nbarbers\nmisquoted\noases\nsoren\ndra\ntilghman\ncucamonga\nstrangulation\nentebbe\ngilson\nmachu\nflamurtari\npropagates\nharborough\nblatt\nescambia\nhabe\nmickelson\npujol\nmargarine\nwyvern\nschemas\nhusk\nspaceport\nmelodica\nlatte\nlindo\nkak\ntemasek\nreapers\nsmrt\nelegies\nclemenceau\ngramercy\nfet\nundesignated\negmond\nthermostat\nshauna\nziff\nkohlschreiber\nwillys\nmems\nafshar\nendoscopic\narapahoe\nvaudreuil\nolmos\nkio\nkupa\npatnaik\nashura\nbulkeley\ntelos\nassassinating\nsmokehouse\ncommendations\ncrone\nbustos\ndominicana\nhoss\nvitor\ncupertino\npollux\nsanctified\ndud\nherning\ndialectics\npallets\neffortless\nhenkel\nveen\nileana\nrimmed\ndamián\nresetting\nfittest\nspassky\nallee\nuntranslated\nchemung\nxpress\natheistic\nkunstverein\ncusps\npetkovic\nwonka\nnuñez\nprakashan\nstuder\nbomberman\nmorant\ngraaff\nulnar\npublican\nreintegration\nwaheed\nhashemite\nchc\nkyustendil\nehsan\nbidar\ntemperamental\ninterments\nstinky\nvaleriy\nnebuchadnezzar\najc\nkeitel\north\nmathieson\ngrigg\ntormé\npostsecondary\nwef\nbanten\nwebzine\nlibrettists\nheritable\nrohe\nbachata\nsinead\nalleviating\nlafleur\nhambleton\nsenile\npurba\ncabildo\nfoxnews\nhistogram\nyoyo\ncrusading\nmidweek\nceti\nverges\nagüero\ncircumcised\nklerk\nfauquier\nfein\nyakult\nprosthetics\nmineola\nsharada\ntakeaway\nhipster\nsisak\nabject\nmickie\nincipient\ndeepika\nfigueiredo\ncouto\nhira\nsmokescreen\nsimmering\ninsinuating\nconflating\nhemispherical\nsnapdragon\nhah\nedelstein\nkillen\noop\nkuang\nmapper\nkif\nvallis\nhindman\nbublé\nchambéry\nferretti\nteotihuacan\nrostislav\nmelfi\nsunspot\nskinhead\nmiser\nradiography\nhypertrophy\nvta\nburks\nhillfort\njcpenney\nantibes\nferrocarril\ntetanus\npopp\nrosol\ndoan\nbilecik\notranto\namro\njudicature\nstanwyck\nadress\nartaxerxes\nleninism\nagama\ngropius\noddball\nworsens\nevp\nnanning\nbardot\nahar\nintegrator\ndois\nwithering\nvps\nquadrants\nextraterrestrials\ntakuma\nfruity\nnant\nmoffitt\nrox\nspanky\ncaan\ngotra\nextinguishing\npolymorphic\nalbertson\nmucho\nseparations\nthallium\ndamselfly\nringtone\nforo\nhomilies\nbranislav\nblockbusters\nextravagance\nrefurbishing\nfalkenberg\njadwiga\nasan\nhehe\nforgettable\nhardcastle\negremont\nburdwan\ntsb\nglimmer\ncarb\nbraham\nnotifies\ningle\naalst\njerkins\neod\ncgr\nadamantly\nheiner\ncracovia\nheitor\nrda\ntrès\nallred\neaa\nuniversitaria\nlawyering\nantigonish\njanney\nunpopulated\nsonatina\nbuna\ngaleazzo\npolygamous\nexcavator\ngoalkicker\nmula\ndecals\nsmeaton\nvityaz\nthornycroft\neustache\ndaniil\nspellman\nkir\nbirdsong\nconstantino\nusatoday\nwaratah\nexchangers\nimac\ndels\nheirloom\nbleached\nwallenstein\nscola\nrosenheim\ncomtesse\nfelons\nbefitting\nwokingham\nehrenberg\nobscures\nsohrab\nparka\ndahmer\ngonçalo\ncanandaigua\nrcp\nbuno\nduns\nlavra\nabysmal\nexclusionary\nlafferty\nsyndicalism\nphilatelist\nrytas\ninterferometry\nbramwell\nchoudhary\nfluoridation\nutd\nsobel\nelston\ngirton\nabductions\nreinaldo\ncrankcase\nkazuma\nmerchantmen\nsimcha\ninstigate\nhearths\ntgf\nchhatrapati\nsif\nelitism\nalif\ndisembark\naci\ntaguchi\nglial\nzahid\nyuliya\nyaoi\nmargolis\ncera\nmicrotubules\nreynard\nturnip\nshiver\naback\nhorsens\nkillah\navc\nlier\nsns\nbranning\nenya\nthoth\ngoldsworthy\nsverige\nleyva\nbruch\ngodalming\ngeolocation\nccha\nurbain\ndado\ngeophysicist\nbarret\nscouring\nlawford\nreaffirm\nassen\nandino\nmausoleums\nfrits\nendeavored\nlavas\nimmunities\neba\njaundice\nnorrbotten\nplimpton\nspangler\ncoalesce\nkahlo\nsuffocation\nmethylated\nsamarra\nshimoga\noverlays\nvikramaditya\nscallop\nbrabazon\npolyglot\nbelvidere\nonda\nhistórico\nmaumee\ncarlota\nquart\nundistinguished\nsnellen\nsatirized\nkokoro\nmidsomer\nrapoport\neubanks\ntredegar\nlutea\nrauschenberg\nrheumatic\nfabián\nmendy\nryuichi\ncipolla\ngns\ngrated\noulton\nautosport\nhagley\nyamanaka\nfollicular\ndavina\nhanne\ninjector\ntoyline\nhoorn\nreino\nsok\nxeon\npillage\ngodiva\norientalism\nreprehensible\ndomo\nexistentialist\nroane\npana\njessy\nknotted\nsram\nsmits\nclotilde\nunseeded\ninductor\nborderers\nhermaphrodite\nsternum\nrusse\nambrogio\ncritera\nsajid\nbauder\nconestoga\nnaqvi\nkieron\ntfs\ncolonizing\nmccarthyism\nnacogdoches\nrockne\nsukhumi\nexportation\nurination\nallendale\nhardwoods\neateries\nyas\nflavin\nhalfpenny\nvalentín\nidem\noptimised\nkuki\nbangladeshis\nmalformation\nngati\nmell\ndauphine\nroadie\npegu\nmicroelectronics\nyoshiaki\nrheumatism\nquimper\ncasson\nkucinich\nintelligibility\nsuspends\nalberts\nsunnydale\nphrygia\nellice\nsete\nhaplotype\nnoirs\nkaas\nsapir\njeffersonville\nconfining\ngoldin\nrohde\nbadal\nchicoutimi\nscriabin\nvaccinations\nposte\nagathe\nharbaugh\njagdish\nduro\nhartnell\nniosh\ncoax\ngira\nmortis\ngrado\ncompels\npritchett\ncleaves\nschaub\nbettis\nsriram\nsuggs\nico\ndissonant\nbuckler\ncoachman\nsaïd\nmiron\nlohengrin\npimps\nmacauley\nantioxidants\nguoan\nesse\nhuddle\ndiscursive\nrivne\nborehole\nmohini\npremonition\ninsulators\ntpc\ntravesty\nkryptonian\nconfounding\nbetel\ncivilised\npetula\naei\ncorot\ncanciones\nneuhaus\nleche\nsylva\nlorre\ntakayuki\nlanguished\nvisby\ndenounces\nbulgakov\nyamhill\npavlova\ntallis\nphs\ngari\nchenoweth\ncomplimenting\nchapultepec\ndiethyl\nblowback\npepa\ncookson\nchileans\nquelle\nmenotti\nelmendorf\ngestational\nbtcc\npissing\nesme\nzips\ninvert\ngreenback\nzagros\nvodou\nmoros\nludendorff\ndjinn\nmcb\ntheologically\nmontalvo\nunassuming\nagp\ntwitty\notome\nzai\nagusta\nfrat\nyankton\norchestrating\nbulmer\ncastañeda\nilluminates\nmacchi\nscientologist\nrefill\nclassifiers\ntattooing\nartisanal\nmisfit\nscrapes\ncrombie\njokerit\nsubtleties\nsymbolising\nrechristened\nbalaban\nlukashenko\nsuntory\nzoya\nweighty\nlavelle\nphoenicia\naffront\nwilander\njavad\ninflating\nhyperactive\nenric\nlumpy\notsuka\ncallus\nrobey\ndhoni\ncircumvented\npiaa\nboles\nfujairah\nchinchilla\nkoshi\nagricole\nexcrement\namrit\nsergipe\nharington\nreinterpreted\nuesugi\nbroods\nhislop\nfingerboard\npinacoteca\nmccray\nsmokies\nundeterred\nboaters\nwilkin\nmadhuri\nleisurely\nnicolaas\npolotsk\npáez\nstauffer\nreassures\nnyack\nstratofortress\ngagged\ninti\ntarik\nsulzer\nbatons\naerials\nzedd\ndécor\ntura\nmontenegrins\nmaronites\nargosy\nmcginty\nschulman\nbukidnon\nkhadr\naxils\nkerstin\ngsk\nbuddleja\nportraitist\nscorned\nwraparound\nyala\nquevedo\nmongering\nmannar\nchungcheong\nseacoast\nemlyn\nsmriti\nelson\nconcerti\nbiplanes\nsiete\ndunster\nheparin\ndonat\nperturbed\nzahedan\nfido\nunrecognised\nfid\novercoat\nsurfactant\nhefei\nroza\norix\nfoursome\nspearman\neni\nparnassus\ndeployable\nangina\nbrownfield\ninitialization\nsangster\nointment\nberenguer\ndresdner\ngalactose\nattendee\npublicise\ntorts\nluneng\nzardari\nbattenberg\nsnag\nhfc\nsandburg\niw\neze\nnouveaux\nschist\noutperformed\ncrudely\nferoz\ntati\nsdram\nistomin\nvérité\nrestrepo\ncath\nmonge\nshamokin\nboylston\nelope\nevolutionarily\nbakunin\nllagostera\nbgm\nspellbound\nbrentano\nregretfully\nrivets\nzwingli\nequatoguinean\naphid\nlaetitia\nvirginians\nvulcans\nfsm\nalexius\nliterati\nmiyake\nanthologized\nbiceps\nbandaranaike\ndenzil\nprestwich\nosgoode\ntutt\nvira\nzebedee\npeculiarity\nzeman\nkinski\nrhodri\ngaucho\nfijians\nmarmot\nnima\nmarinette\nflorist\nsantarém\nmaxims\nkonica\nunimpressive\nstuccoed\npitiful\nulema\numbra\neötvös\nuup\nschaller\nprecondition\nrosea\nrossendale\ntearful\nalhaji\nmajoris\nryman\nkeke\nnob\nbelk\ngoodrem\ntrg\nverhoeven\noccultism\nraimondo\nberkhamsted\nfevers\nclaridge\ncather\nanic\nnitty\ncerebrospinal\nperches\nestrangement\nolbermann\nahram\nminho\ninsensitivity\ndoğan\nmicrorna\nrouvas\nnavidad\nheartbreaker\nvall\nbellum\noutland\nweyden\npati\nlaa\nbyways\nhickson\nvios\nndr\ngavia\nkla\nefi\nammonite\npowerpuff\narachnids\npadmini\nmediocrity\npetraeus\nvizianagaram\nstifling\nallegra\nfacile\nshoaib\nspotty\nkushan\namédée\nfirma\ngervase\nnegating\nthornbury\natkin\nsikandar\nstn\nulloa\noverlain\ndelineation\nbottomley\npada\nlauzon\ngossard\nricki\nkinsale\nplantar\nflattery\nmechanicsburg\nmedlineplus\nvaldes\nipso\nwimax\nyasir\nakagi\nquartermaine\nfeodor\nrekindled\ncanadair\nsilvester\nhatchlings\nbast\nkhali\ntubs\ntico\nminimus\naljazeera\nbenched\ndormancy\nkarlskrona\necco\nborno\nhite\nuttering\nharbored\ntimurid\ncno\nlashing\ncheatham\nchamonix\nnds\nmymensingh\ntiebreak\nbsb\nuist\nstonington\ncurls\nnewhaven\nsdl\nvariances\nmessines\nlamprey\nvauban\nconjunto\npizzeria\ngadd\nsubantarctic\nawadh\nangustifolia\nperkin\nfasteners\nskupski\ngunship\ndistrusted\nscalloped\nsmiled\npayday\nsupercoppa\nusefull\nnukes\nfrankland\nmimosa\ndernier\ncantopop\ndominoes\nbookmaker\nplying\narchimedean\nhyped\naerobics\nchert\nkisan\nlisieux\nheelers\nkubica\nethnomusicology\nspecks\ngossett\npurifying\nvelho\ncaesarean\ndeltas\nobasanjo\nbentheim\nswr\nallt\npuyo\nalisha\nmiscegenation\nhibernate\ndrusilla\nfigueres\nboudreau\nmagica\nunforgiven\nshreya\nverdon\nbabos\nmusselburgh\noutgrew\nbenoist\nlegalizing\ntheresienstadt\nebdon\nprotrude\nherbivore\ntucci\naccede\nfortieth\nnatan\npacs\nhooky\nwilbraham\nsprinkling\noya\nhandsets\ngell\nmardan\nmacdill\nantonelli\ngamasutra\nkilian\nwaveforms\nheartily\nmancuso\nplumas\nstandardizing\nfirings\nmeson\ncheb\nkufa\ninterlingua\ndilla\nepinephrine\nnastro\ntheroux\nkeira\nincessantly\nrahi\nmensah\nmethylene\nbrodeur\nhibernia\ndoren\nbeckmann\nstares\nvidarbha\nwoot\nriis\npalmieri\ncumbernauld\nbabysitting\noyama\nphosphor\nguava\nwithstanding\nxenu\ndefraud\nequipments\nlepidus\naffaire\nmuséum\nhydraulically\ngrb\nabadan\nwhitburn\njustifiably\ncres\ntrinita\nrkc\nculpepper\npansy\nbiographic\nargonaut\nbygone\nmacduff\njinn\nhideaki\ngentler\nsatchel\nconsidine\nmeaux\nundersides\nperri\nmamas\nmehran\nrct\nhualien\nscarves\nsarita\nprofusion\nlovelock\nchace\ndownsizing\nextrapolated\nager\nlich\nhazelton\nfag\nkilroy\ncupa\nbarenaked\ndrivetrain\ngutman\nmaggots\nbct\ncauliflower\nnbn\nflirtatious\nhaft\nafton\nresp\nmartyrology\nsleight\nwelshman\nequilibria\nborodin\nbarricaded\ncastrum\ncongregationalists\nomb\neloquently\nkellen\nackroyd\nbeith\nissuers\nsupercomputing\nhanes\nwmv\nvvv\nmedellin\ngtv\nuninhabitable\nilias\nclarinetists\nheliocentric\nredvers\nmairie\nmosca\nrefitting\ncalderwood\nhauer\nrieti\nhangings\ndinara\nalemán\nuloom\nprodi\ncenterfold\nplowing\nnbs\nunwavering\nbeechwood\ntyldesley\ntheoretician\naravind\nduxbury\nnaturae\neloped\nrigaud\nguttenberg\nachievers\ndalmatians\npersonalised\nordain\ndayak\nandrija\nstg\nclonmel\nvied\nheretofore\nthrusts\ndivest\nshepperton\nakwa\ndumpster\nnaberezhnye\nquiñones\nmanatees\nsomeones\nslimy\ndelineate\nmossy\ntongued\nruggero\nparoles\npalenque\nhandbags\nmanisa\nshader\ntriestina\nrjd\nmildenhall\nunleashing\ndefections\nmaslow\nanchorman\npluralistic\nslipstream\nreddick\nsigint\nsuga\nvieja\nharpo\nkrogh\nsyncopated\ncockpits\ncondos\nslovo\nmanasseh\nribes\npaneling\nelma\nbiella\ndanner\nhelpline\ngarlands\nwulff\ngarbled\ncotonou\nrifts\nspr\némigrés\nrara\naguila\nbaggio\nrog\nmii\ntepe\nencino\ncarpio\nlifeguards\nharrigan\nchloroplast\nfoolishness\nbonelli\nerving\nnobis\nmandan\nanisotropy\nnitpick\nvaliente\nhardinge\nannam\nswale\nscops\ncerda\napoptotic\nsastry\nwebkit\nleur\nloaves\nhanan\nfws\nbiagio\nuncompressed\nshowrooms\nemerita\nbna\ndiadem\nextruded\nkristiansen\nsayid\nbatik\nfremantlemedia\nded\nbeinn\nzuckerberg\nkitakyushu\neoghan\nangiotensin\ndort\nchristology\nmultichannel\nmottos\ndebarge\nfirecracker\ntoscano\nsociable\nqualitatively\nsupermajority\nussher\npassant\nkubot\npopstars\ndevouring\nfinlandia\nfwd\nhdz\ncarbone\ncurveball\ndurian\nphantasy\nviernes\nadmira\nhaughty\nnewswire\nasí\nuntamed\nmbp\nundying\ndemetrios\nchukotka\nkofu\nswedenborg\ntransits\nyanni\ngiorno\nkurnool\ntopanga\nnetaji\nknightsbridge\npaltz\nheatwave\nschuller\nyawn\nkreuzberg\ningvar\nsintra\nrivaled\ngalesburg\nteuta\nloosening\nmillais\ncorporeal\nlundin\nantiseptic\nbattering\nsimcity\nstallman\naltus\nshipton\nstraying\nbourdon\nalpena\ncpn\ncostanza\nipsc\ngoodell\ninuktitut\nballparks\nberton\nmanne\nministered\nsemicircle\nmcneese\ntereza\nmaisonneuve\nscharf\ndumbo\ndarbhanga\nmosquera\nlexie\nhashed\nlerma\nwsb\npassos\nsakata\nlibris\npushcart\nkenzo\nsella\nbirkenau\nscanty\nhalep\ncodeine\nmohsin\nlessening\ndelinking\niconographic\ncomunicaciones\nmacworld\ndeciphering\njewellers\nzayn\nquantitatively\ncased\nfau\njamshedpur\natlus\nteleported\nsurfboard\njosepha\natromitos\nsolvers\nbaig\nearthworm\nprelature\nhatun\ngerasimov\nalai\ndelinquents\nassizes\ncentraal\nhoning\nemboldened\nmisappropriation\ntulloch\nclarks\nmazatlán\nheadroom\npublics\ndebi\nmuti\ndémocratique\nlunatics\naraya\nwombat\nabcd\ntengah\nsvd\nrainstorm\nbrachiopods\nbaddeley\nbautzen\nstalwarts\nextort\nfriesen\nlaney\nashur\ncamerata\nblighted\nkii\nmanzanillo\nhippocampal\nmolino\nvisitations\nnamgyal\nterritoire\ndaiichi\nviti\ndarussalam\ngobierno\nskyhawks\ncornering\nsaccharomyces\nhalsted\nrefocus\nmacerata\nmarlies\nbascule\nprofessing\nboos\njablonec\nschnell\nwetting\nmousavi\nscp\ngakkai\nflacco\nlorin\namores\ncanseco\nspinster\nstang\nconceals\nbunches\nsauropods\nhuth\ndarwish\njoinery\nrattling\ncarlile\nziva\nwhitestone\nnfa\nevi\nnrj\nstanislavski\nwinslet\ndelgada\ndunhuang\ncorinna\nolmedo\njoannes\nsakya\ninkjet\nfloored\notley\nvarner\ndehydrated\nmouldings\nfedorov\njutta\nmcavoy\nnewquay\nuplink\nmaharana\nsaman\npassivity\nworkouts\nrasta\nhorwitz\nhumorists\nfaizabad\nharries\nprestwick\nolmert\nkaun\nstarman\nfunctionalism\nwoomera\nbms\neruptive\nspousal\nfunen\nprakasam\nscolds\nwaikiki\nenriquez\nxxxi\nfbc\nbrzeg\nfactoid\ntwh\nmotorcade\nthirsk\nsuro\ncaixa\nsatyam\nturners\noutlawing\ncorpo\ntommie\nlonergan\nquisling\npatrese\nkumaon\ndemented\nbenediction\nseedy\nhowson\ntacky\ngryphon\ntanabe\ngola\nparagliding\nvolendam\ngraciela\nearthbound\nkeely\nmonferrato\nopenbsd\nsng\nathabaskan\ncammell\nallston\nrueda\ncarbonic\njacquet\nwsm\nsanna\nbiddulph\nsuh\nperret\nbatsford\nhulbert\nraves\nleann\nfriederike\nndc\nmoyle\nindecision\ntouchy\ngaudí\nangélique\nroker\nunscheduled\nnocs\ngann\nlinings\nprescot\nbores\nzakharov\nprotectionism\naktobe\nsplintered\nprolonging\njuden\nlufkin\nreassurance\nhashanah\nreni\nbleacher\nevaporates\nsimonds\ndevries\nhort\nsholom\nwarners\npollak\nkasi\ncajamarca\niftikhar\nphotoshoot\ncruyff\nsagarmatha\njeopardize\nmammadov\nmenem\nsisu\nsyncretic\nreposting\nalwar\nrevoking\nhebe\nsplinters\nthiessen\ncovina\njarkko\nseldon\nstrutt\nexclusions\nambiance\nantipsychotic\nunderlines\ncassin\nazim\nlantana\nsib\npoehler\nroop\nfln\ndoohan\nwmo\ngouverneur\naitchison\nsime\nconceiving\nwildebeest\nchelny\nacolytes\ndomer\nshipowner\nfrictional\nsolms\nsenza\nmondiale\nrfi\nmaven\nscour\nmaruyama\nasf\nstocker\nwisteria\ngeorgescu\nimpatience\ngroundhog\nvissel\nsiemiatycze\ntighe\nwaterpark\nruddock\nannexe\nbodin\neyeglasses\nbayfield\nneh\nmarginalia\nindustria\nminigame\nbenjamín\ngoncourt\npirot\nrefurbish\ngemeinde\nhirata\nquesnel\ncaricaturist\nfaisaly\norcas\nstrahan\nmele\nmañana\nvijayanagar\natria\nqueanbeyan\nlieb\naras\nreconfirmation\ncarburettor\ncardio\norellana\nvanbrugh\nfractals\nreminisce\nmostyn\nucsf\nparaphrases\nprine\nlozère\npisano\nprowl\nmonastir\nimplicating\nroraima\nfirmer\nacqua\njaina\nbernt\nrpf\njoop\nabercorn\npubescent\nmodernise\ngraziano\nwalkie\nbosman\nwithered\nturnstiles\nbasti\nthrashed\npozo\ntrac\ndunwich\nundemocratic\nshota\nwarminster\nwedded\nhilario\ncastrated\nsoundscape\nmegapixels\niap\nerinsborough\nniel\nammons\nhusseini\nshackles\ndashwood\nlafitte\nlautrec\nrepudiation\npekin\nkamrup\ngeraghty\nmohinder\nmummified\nkalki\nmaclachlan\nprerogatives\ntautology\nstefania\nlangevin\nkurd\njørn\nowsley\nramaswamy\ntruthfulness\nrovigo\nphotojournalists\nsarkis\nharshness\napostolos\nmontag\nbls\nbulldozers\nrepugnant\nnailing\nweng\nlipped\nsomos\ncontras\naubry\ndsv\ndefensed\nundersized\nmugs\ncloche\nhenze\nevictions\ndabbled\ncvg\nleys\nvma\noku\nirrigate\nkanna\nferrante\nchitose\nfrunze\nmurrumbidgee\ntinea\ntilapia\nkodama\nnpd\nhakone\njace\npolideportivo\npatrimoine\nradovan\nsymington\nyoshikawa\nuniversitas\nbbva\navoidable\nnahal\nwailers\nsterilized\nworrisome\namoeba\ncrj\nsabor\nnorristown\nkindersley\ntearfully\nblumberg\ncharlesworth\nhomicidal\nmyelin\nvvs\nmatsuoka\nfoci\nsalesmen\ntenzin\nnadja\nechidna\nmorphing\ngratis\noctahedron\n‹\nspotless\nburhan\nsov\nottavio\nlinklater\nsrl\nstirrup\ncontingencies\nrance\nclásico\nplaybook\nloathe\nisf\nbaynes\nconstricted\nextinguisher\ngolson\norff\nrauma\nmaterialise\nppa\nkiedis\ngremlins\nchanter\nodawara\novals\nmiyoshi\ndevan\nphaidon\nmcintire\n⅔\ntopologies\nsandow\ngrappa\nglace\nmsk\nhilson\nsuperconductors\nbohn\nmakepeace\nstuckey\nformalize\nelude\nmarginalization\nkeratin\nshambles\nberates\nroxana\nmariscal\nbolland\napollonia\nmanalo\ncarpal\nwetpaint\nsmb\nsardines\ncollinson\nbie\nheidfeld\ncollings\nsaudis\nkomm\nsuan\nreproach\ngulbenkian\nrotator\ngauchos\nimpotent\ntrl\ntribulation\nlondres\nunaided\nblazed\ntetsuo\nlonged\nobata\nmadoka\njamuna\neland\ndestabilize\nhawick\nmahé\nmalina\nspeakeasy\nsynge\nbeekman\nmcmurtry\nnewhart\nweasels\nmigrates\nkwara\ntransversely\nshadwell\nshaul\npieve\norta\nfuturity\nmanlius\nofficinalis\ninternationales\ninfidel\ninchon\nirrevocably\ngramm\ngeneralizes\ntransferase\nmichaelmas\nexupéry\nchetan\nglycolysis\nhatay\ntexel\nelectrocution\npelts\nimmobilized\nhokuto\nbetjeman\napprenticeships\nwheatear\ncosme\ncoalesced\nsacrum\namoral\nkeweenaw\nwhims\nhypersonic\nindivisible\nsmg\ncorny\ntseung\nscindia\nnaar\nmcinerney\nfloriana\nfontenay\nlgm\neskilstuna\nchildrens\nhinkle\npaulino\nbranden\npenfield\ndecontamination\nglories\nwimborne\ndishwasher\nmuscatine\nforeskin\nsupergiant\nwario\nibu\ndovecote\nwatauga\nkateryna\nvaluations\nétait\nagusan\nfixated\nqueensway\nexhilarating\nbootcamp\nbrocade\nappa\nreadmitted\nrenfe\ndevereaux\nmvd\npns\nfainting\nalauddin\nplowman\ncaliban\nsummerville\nrecapitulation\nsattar\ncatullus\nmikheil\ncompactness\njaxx\nghouls\nmucosal\napathetic\nenquire\nmacromedia\nproofing\nmle\nattar\nrino\ncruelly\nparaplegic\navm\nmarché\nexcelling\nchimps\nfrontières\nkeying\nzulfiqar\naji\nfiefdom\ntajikistani\nzayas\nshrugged\nzin\nvliet\ntonopah\nmarvell\naditi\nkeels\ncorticosteroids\ngranados\npharynx\nnucleation\ndemoralized\nsurah\nhaya\nrêve\noverpowering\nextractive\nhubertus\nophir\nsuwannee\nminella\nzemo\nworkbench\nliane\nhashemi\nizumo\nmarsa\nhaggis\namesbury\nplautus\nmetin\npredictability\ngeosciences\neyewear\nbartlet\nfé\nbourguiba\nhaswell\nspringvale\nsweaters\nvaulters\narty\nmacadam\nworships\ncarnivora\nschick\ncoincident\npostpartum\nbereavement\nsentosa\nsede\nbalaenoptera\njuillet\ndressler\npse\nvenera\nberchtesgaden\nwelton\nconcierto\ncheerfully\nmdot\nnods\npunters\nberthe\nmaracanã\nspt\nchubut\nstabler\ngrecian\ncommunicators\nmaint\nvala\narousing\nbrackley\nmotilal\ncelebes\ngoyang\nkalamata\nexhibitor\ntransfusions\npisgah\nbbwaa\nenqvist\nmaximilien\nsoundscapes\nbrunson\nkore\nlitigants\nsemis\nvell\nrecanted\nbelasco\nkashubian\ngalls\ncustis\ncapablanca\nbregenz\nmammoths\nschutz\nmcglynn\nworrall\ncavour\nlob\nmalkovich\npompano\naspartate\nsubstations\ngrudgingly\nsteaks\ntrikala\ndrover\nsufferer\npompadour\nyevgeni\nvojislav\nyarbrough\ncordless\nlycia\npospisil\ntechnica\nnewline\nrabbinate\nthingy\nrevivalist\ndelicacies\nsteadfastly\nshenton\nrousseff\ndoings\ndownbeat\nriche\nlauryn\nshaan\nkinematic\nairlife\nqq\nchequered\ncollectivization\nproliferate\nbedded\ntelegraphic\necclesiae\nsalles\nsquawk\nxxxiii\ndrivel\ndinajpur\nlakas\ngangrene\nintegra\npiercings\nirfu\nbama\nhotbed\nzither\namiri\ndimmer\nzooming\nostracized\nlightened\nextensor\nguaira\nregurgitation\nllm\nshahrukh\nbertil\nkunstmuseum\nirritate\naccordionist\noverviews\nforthwith\npissarro\nkms\nresize\npertwee\naprilia\nmogi\ncyl\nrosenblum\nsunray\nculpability\njuliane\nbankrupted\nperdido\ngchq\nfrg\nprawns\nmycology\nbijar\nmyrrh\ntorreón\nbormann\nnaja\nasphyxiation\nsupa\ntrunkline\nneuman\nlarks\nquackwatch\nkhoy\nkircher\ntaney\nravan\nrandell\nwordless\nwyandot\nnachman\nheroics\nflintstone\nroundly\nspiteful\nalexandrine\nposturing\nsanctification\nbasta\nspm\nlakeville\nisamu\nnaz\ntvxq\nponts\nrubies\njanos\nmurphey\nboardings\nhopefuls\nvq\ncaputo\ndecoys\ndyna\npujols\ngazzetta\nradoslav\nrew\ngraafschap\narbitrate\nsolvay\ninp\njungian\nistana\nmoulins\nhumes\nscrubbing\nmudge\njolley\nbocage\nentitlements\nbandage\nfloodwaters\nbroadwater\ncpo\nkoyama\nsamplers\nfoodservice\npenile\nsabretooth\nsunt\ntebow\nnazca\nredshirted\nhelsingør\ncondiment\nedifices\ncaloric\nheadman\narrowsmith\nakt\npupae\nmanufactory\nramapo\nhpc\nprofiler\nbanger\nalkan\nvespa\ncowbell\npavlo\ndisobeying\nnovartis\nbroadsides\nhongkong\nexaltation\nvitruvius\nstanmore\nomnipresent\nnayaka\nihs\neurosceptic\ndeftones\nsouris\nhohe\nmyint\nlesage\ngrandiflora\nrôle\nstedelijk\npleasurable\ncolonizers\nunabated\ncvu\nasuncion\nyoshiki\nphilistines\ngonsalves\nmalaysians\nbeluga\nmarly\ncorgi\nswag\nbiophysical\npordenone\nvegetated\ndiya\ntelepathically\ndowngrade\nsatiric\ncheikh\nbillingham\nostensible\nsociopolitical\nuhuru\nluque\ncreosote\npunishes\ndreary\ncubitt\nbioscience\nrectilinear\nlamentation\nozu\namours\nstad\nreverie\nhanssen\nchota\nbaldock\nconnemara\nagitator\nswa\nfiends\nbrokaw\nfreyer\nchore\naccumulations\nrackham\nspee\nhoar\ntotten\nfela\nunveil\ntfc\nbuccleuch\nregionale\nmazes\ntreme\nameliorate\nmurrell\nsamaras\nlighters\nleadoff\nimpotence\nfarwell\ncharmer\nnpg\nfamas\nnisan\nlanny\nattainder\nbickel\nthermopylae\ndinwiddie\nwouter\ncarrow\nduiker\nubi\nhammadi\ngelatinous\nmanresa\nevolutions\nexhaustively\ndesperado\nalejo\nsugita\nbuttermilk\ncarel\ninferring\nkatyn\ntrp\nupturned\ncronenberg\nagrawal\ndagon\npythons\nlawmaker\npanhard\nfillers\nconcertina\nleprechaun\npresumptively\ncontactless\nreassess\nlonga\nagaricus\nhilarion\nwampanoag\nmcclung\noppositional\njuve\nhermine\nborax\nimpressionistic\nlymington\ntrapezoid\nhousatonic\nhinkley\nelspeth\neisler\nitalie\nstauffenberg\nlids\ndistorts\nbrunch\nbila\nunderscored\ncrosley\noko\nandesite\ndislodged\nroméo\nlajoie\nromanus\nreuven\ntentacle\nsergiy\npyridine\nadjudicated\ncellini\nattilio\nguardsman\nlande\nadamawa\nunwell\nproconsul\nrockman\nwain\nshalit\nmegiddo\nmef\nvitaliy\nojai\nzemlya\nsangamon\nhomenaje\neditorially\nchristoffer\nsoko\nwrappers\nplaceholders\nyrc\nindulgences\nswordsmanship\ncâmara\ntrop\nbelew\npathet\nbisects\nsecker\nmendelson\nbhawan\nboycotting\ndirectorates\nfula\ngoalball\ndetmold\nbaited\nsalieri\naerosols\nshroff\nmarita\nbalad\nzt\npoh\nselhurst\nkoufax\nrepublica\ngranitic\ntalus\npartita\ninfestations\ngodhead\npaik\nkirchhoff\ndisheartened\nparasitology\ncommunicable\ngopinath\ndigg\nmarxian\nlayperson\nballooning\nsubsidize\nstrat\nuniversals\nmicroarchitecture\nreminiscing\nfritillary\nkosh\nadvertisment\nfrederiksen\nenchantress\ngev\nryland\ndds\nmielec\npidgeon\nbrus\npentecostalism\ntillie\npederasty\nnisi\nusafe\ncatchphrases\ntoodyay\naldwych\ncataclysm\nstrolling\nvivir\nkohlberg\ncpd\nvanya\nzaria\ndioecious\nunstaffed\nksu\ncrestwood\nteases\ntenggara\nhandicaps\nmoz\nmasayoshi\nmanistee\nzakynthos\najmal\nfunhouse\nagronomist\ndiaghilev\nâge\njiri\ntransmutation\ntirtha\ngreendale\nhelpmann\nchuckle\npharoah\nmhs\nimperia\nharri\nbastos\nlcms\nnodding\nyearwood\ndrogba\npauley\neckstein\ndupré\nfrederico\nanimas\nkarimi\nwargaming\noptimizations\nshrill\nafresh\nlibertine\nlgpl\nlemoore\nheterozygous\nlalor\ncriticality\ndeism\nweatherboard\nmosquitos\nhaydock\ncasks\nprovosts\ntroubleshooting\nbellarmine\ntripe\nsanh\nzaheer\nditton\ntyp\nhersey\nnorthport\nbresson\nliezel\nblohm\nfabrications\nalberni\nspoonful\nrelents\nbruegel\ninsectivores\ncontaminant\nsteinbrenner\nmuna\naadmi\namri\nespouse\natb\nanatol\ncantilevered\nthunders\nmicron\nlbw\nsankar\nlatium\ncanopus\nfriis\nfingering\nmaddux\naznar\ntolled\ngnat\nspectacled\nventricles\ndeviance\ndoobie\ndelving\ncerrito\nstaffel\npanthéon\nconsents\nbix\nprecipitous\notus\nroig\nstakhovsky\nbenares\nusu\njabez\nbirthing\njesenice\nminab\nkray\nazs\nbeneš\ncornhill\nhammill\nsatirists\nmerchantman\npurposed\nlevis\nlanois\nshowground\nagadir\nflaubert\npolonaise\nbugsy\nfoghorn\nkrumm\nbanteay\nestación\nhov\nreflectivity\nsittings\namand\nalexi\nkarenina\npublica\ntigran\nstiletto\nmeatballs\nime\ncollet\nzev\nterai\nvaccinated\nperfectionist\nshimonoseki\nsacro\notho\ntrompe\nwhopping\nhibiki\nagnosticism\nstatler\nriverhead\nmoskowitz\nchinn\nrejuvenated\nswaying\ndhanbad\nbergin\nedwardes\ncolborne\ncrowbar\nroselle\nparisians\nexecutioners\nunivac\ndisliking\nlesbianism\naswell\nkeeled\nfinisterre\ninterlaken\nhumanitas\nakali\nards\nunwitting\nharmonix\nkarn\nheterodox\nniebuhr\nmyrick\nroadblocks\ntwas\nsavo\nlodz\ncritters\nfce\nintracoastal\niworld\nwiedemann\ninverting\nbognor\nsrikanth\nmoats\nsultry\nwdm\njunagadh\ndoa\nkoontz\napostolate\nchangwon\nsaarc\njyllands\nmodernists\nrahm\ninfinitum\nmando\nrailhead\nresurfacing\nkikuyu\ndictum\ngaff\nzarzuela\nanisotropic\nbalm\ninevitability\ncelled\nlouse\nbootsy\nanurag\ndaniilidou\nsansa\nmoyá\necce\nobjectivist\nshyness\ncrumlin\nperle\nmarlena\njolt\nenticing\nkölner\nfft\nharing\nbuzzards\nupsurge\nredland\nkooning\nspeke\nsetzer\ngawad\nctesiphon\nlares\nvolandri\nrusted\nendre\nseaborne\ntapa\ndinky\nsimran\nregressive\njuin\nferranti\nbrownell\nlochaber\ndroppings\ntakings\nshias\ncalabrese\nlox\ntypist\nnassr\nabsolved\nmoraines\nhemolytic\nrickshaws\naah\nsanat\ntroicki\nwpc\nbarco\nmudstone\ncombing\nherpetologist\nbackseat\nkaminski\nhati\ncompanhia\nparmentier\nabatement\nperfusion\ngautham\nnozomi\nzenobia\nchippewas\ndowell\nchetwynd\nadvani\nminutus\nmcas\nhaviland\noverworked\npleasanton\nphenylalanine\nstupas\njoie\nbroadview\nmillot\ncleethorpes\ntowpath\ncorley\ncylons\nwhanganui\nwau\nindignant\nperching\natwell\nmetamorphosed\ncaio\njerrold\ncistercians\nparisienne\nlebel\nflopped\nsvensk\njenkin\nhallowell\nherero\nrheingold\nbioengineering\nsearchlights\nsectioned\nvieques\nbronte\neider\npappu\nblindfold\nproxima\npolyhedral\nindianola\nmasud\npounce\nproudhon\ndramatics\nmilian\nrize\nwhetstone\nvcd\ninácio\ntaxiing\npoynter\ngolda\nmicronesian\nplasmas\ngoodricke\ndummer\nbtc\nstrider\npmo\ngela\nhardaway\ndecrypt\ntriphosphate\ndisconcerting\nfaison\nbowditch\nbambang\nchaps\nmoyers\nmadi\nfortier\nwernher\ndubey\nbindu\npriam\nglencairn\naep\nglutinous\nuhl\nliger\nevander\nslane\norangutans\nhentai\nmugen\nmagnesia\nsizemore\nsalud\nksc\nmorty\ngraydon\nicelanders\nqueuing\ndisobeyed\nillegible\nparisi\nunenforceable\nrmi\ncholet\nfarmville\ntanna\nbickerton\nregenerating\nstrabane\nwhitton\nsprouting\nsupercritical\ndruk\nsíochána\nmelchor\nmiodrag\n´\nsontag\nstinks\nahi\nwolof\nfarben\nmultimodal\notro\ngaping\nika\ncronus\nkieffer\nconjuring\nrondeau\nmingle\nconflation\nahs\npretence\nbombastic\nhidaka\nprager\nharwell\ntinkering\npajama\nsanté\nsphincter\nkazakhs\ntakei\narévalo\nsaltzman\nsarum\nkodi\npindar\nstandardise\nhypotension\ncongratulatory\nfrosted\nndebele\nnicktoons\nphaeton\nnol\nranieri\namberley\nkokoda\ncreston\nfoligno\nhemsworth\ndamas\népoque\ndarna\nsalton\ntheocracy\nlma\nadage\npus\nbrumby\nbezalel\ntoshiyuki\npinkney\nconyngham\nmasoud\npenalised\nmbt\nlisicki\nruprecht\nqara\nmorn\nlvov\nfrisk\nmersenne\ntaber\nescutcheon\ncollyer\njettisoned\nbadass\nearnestly\nactuaries\ncuriae\nsafar\nmll\ndershowitz\nnovelisation\narrowheads\nstatehouse\nkiwanis\nwuz\nhoban\nantes\nsaro\nincinerator\nbasf\nsoames\npolina\npbx\ndisagreeable\ncobo\npotchefstroom\nkatalin\nattache\nklass\nbăsescu\nmonkton\nmudflats\narchdiocesan\nrohini\nsergi\nclockmaker\ntpp\npractises\nfunnier\nfunafuti\nkull\nmiwok\nmacha\nspectabilis\nepistolary\nwot\nanstruther\nsqueaky\nreimbursed\nliebig\ngrupa\nrosé\nhoffer\ncpusa\ncatskills\nguidebooks\ndiamante\nbirdy\nincantation\nmacaques\nrages\njann\nbiomechanics\nobsessively\nsuperlatives\ncys\ntubercle\nunionville\nhalden\nburgher\ntridentine\nfessenden\nsnc\ndilbert\nlaterite\nrecesses\ntiro\nﬁrst\npalanka\nwenzhou\nlugnuts\nbsu\nmacphail\nsemiotic\ncuz\nrlp\npaltrow\ncopts\nqueensberry\nkapellmeister\ngro\ndramatised\npreying\ncalligraphic\nnambu\nrajas\nmencken\nshowalter\nfurlough\nloja\nprowse\nmarillion\nterrains\nlentils\nhella\nlory\nhoisting\npicchu\nshelving\nguaranty\nveined\nfuga\nglossed\nratifying\neminescu\ninitiators\nretaking\nboba\nfuchsia\nmetrodome\nnees\nlah\nglorifying\nsheaths\ntelemann\nmammy\nmassú\ntacitly\nhornsey\nebel\ndesa\ncrawls\nrepressions\ngilder\nbarnaul\ntaishan\nfirebrand\naurore\nors\neigenmann\npique\nyellows\nclipboard\nglinka\nyoru\npicketing\nramage\nvocoder\nbharath\nlemmings\ncloves\nunquestionable\npiglet\nhalleck\ntouche\nailsa\nbist\ndeforest\ndeke\ncefn\ndelicately\nacidosis\nfertiliser\nhenna\nseamounts\nmouscron\nanaya\nastonishingly\nbirkett\nmondrian\nnecro\njunichi\ndosing\nmondial\nsubba\nsubfield\nblisters\nnari\narbour\nabusively\nspanking\ntailback\njuvenal\nced\ndramatization\ndiarra\nevocation\numpqua\ntecnológico\nearache\nkinoshita\nrégis\nmathewson\nwreckers\nbarnyard\nstorekeeper\nbeyonce\nmilagros\napalachicola\nspeechwriter\nfoxborough\nbrierley\nallegories\nwawa\nhorwood\naltan\nresettle\ncineplex\nolave\nfrugal\njamey\nbott\nlullabies\nrecuperation\nkellett\nrazorback\nsculpt\nlandi\nhibernians\nsubheadings\nmasaru\nutero\ngranta\ntimestamps\nclyne\nbubo\nbartels\nazaria\npulchra\nnovum\nxxxv\nmacbride\nfoshan\naten\nnorberg\ndinas\nginkgo\ntzvi\nueber\nsuppl\nlhs\nhedrick\nbencher\ndidsbury\ntincture\nelke\nprabang\njurgen\nproliferated\nrade\ndiscards\ntilson\nbunsen\nbarbs\nsubtracts\nletchworth\nshamanic\nbuzzfeed\nhalberstam\ninfomercials\nbannered\nrufc\narmband\ngatefold\njewelers\nwittman\nkawamura\npingtung\nbiliary\ndallara\nlali\nsyunik\npleasantville\npogue\nlefèvre\nbasking\ntasking\nsegregate\npulido\nbasilisk\nunderlining\ntitusville\nionesco\nquantifying\ntarrytown\nbaltasar\ncritiquing\nelfin\nblain\nnse\nconny\ncervera\nketchikan\ninterpretative\nsubang\nmartius\nvermeulen\nmaracas\nvivi\ncashed\nbáez\nbubbly\nchatty\nredknapp\npiaggio\nhannaford\nmogens\nprilep\nweitz\ndubin\ntss\nnuku\nreams\nthalberg\namkar\nvocalization\npincer\npurses\ndrdo\nweise\npaxson\nmourns\nshahbaz\ncollation\npastorate\nbelladonna\ntriste\nenvisions\ndisturbs\nginseng\nfetching\nnewsgroups\nunassisted\ngrazier\nyori\npajamas\ncavaliere\nchagos\nwelder\nazeris\nmême\nsharples\nsainz\nfriese\ngumbel\nbelluno\npyro\nfaubourg\npublically\ntualatin\nmagni\nsongbirds\ncustodians\nannées\neustatius\nlogicians\nlav\nnanterre\ncolditz\nsimson\nquasimodo\nexorbitant\nacoustically\nmance\nhouthi\nfrontpage\nfriedland\njabba\nfyrstenberg\nclearings\ndramatica\naegon\nalbinism\ngorse\nlatreille\ngratefully\nmdm\nzastava\ngano\nroundabouts\nmullingar\ncliches\ningen\ngosforth\naizawl\nevaluative\nsalk\nringers\ndestinies\ntinnitus\nniobe\nlogger\nheeded\nwaxing\ntash\nprelims\nquiver\nbourbons\npurulia\nexcavators\nicarly\nfrontale\nsuma\nairworthiness\nkops\nsurpluses\ntrashed\ncatapults\nseagrass\npienaar\nfatma\ntillis\nbolelli\nklas\nhordern\nmasoretic\neavesdropping\nbrisco\nmonnaie\nplanktonic\nmicrotubule\nmcquillan\ncassavetes\nefron\nflipside\nmentalist\nerzgebirge\nabercromby\nmyopia\nbromfield\nbirney\ntwining\nwrest\nkishi\nfibroblasts\nsilverado\ncofe\nverging\nramya\ndé\nsfgate\nstator\nabbasids\nimad\nanhydride\nindi\ntroglodytes\nskokie\nbork\ngbc\ncranberries\ndudu\npozzo\nbaryon\nmontez\nsesquicentennial\nthrobbing\nwanderings\nambiguously\nherrings\nhse\ntrost\nseeps\nmasterminded\nprideaux\nunderpants\nanemones\nviolette\ngripped\ncalamities\npharisees\nfeds\ntitleholders\nhortons\nhistorica\nwallington\neighths\nimaginations\nmacneill\nlindenwood\nllosa\nillegality\nswayze\nhundley\nstreeter\nprograming\nphds\nreitman\nunpowered\njeg\npastrana\nrebooted\ncontessa\nnunca\nstrippers\nably\naare\nwheldon\nsago\nsangli\nchiaroscuro\ndefensa\nkuro\njuncus\nholograms\nchernykh\nhemant\nimitators\ndormouse\nneedlework\nwbz\ngoslar\nphotoshopped\nissei\nsentimentality\nxxxiv\nshinobi\nberri\nneda\nlandy\nuzbeks\nlarimer\ntakoma\nharsha\nwoodall\napertures\nsebastopol\nsviatoslav\nbskyb\nomo\nplatters\ntrebinje\ninvulnerable\nkoppel\nseagate\nhandlebars\nseparators\nnarcisse\nligier\nbostock\ntachikawa\narvidsson\ntrevino\ntbi\nhalevi\noceanographer\noxytocin\nkitzbühel\naurich\ncofounded\nhoxton\nwhampoa\npis\ngals\nsurmise\nsaddleback\ninundation\nflagbearer\ningest\ncherwell\noblate\ncurvilinear\nrémi\nobfuscation\nheadship\nnefertiti\ncarstairs\nobstinate\nfrenetic\nturgenev\nesm\nalpheus\nextender\ntomy\nsabra\nsira\nsayonara\nbarna\nquik\nsoya\nchoreographic\ntacna\ndefiantly\nberlocq\nmisogynistic\nbayly\nwarhawks\nnek\nsneaked\nmossley\nartesian\nisdn\ngandaki\nmylène\nharkins\nchera\nkorail\nmimo\npuller\nrusedski\nmirim\nmung\nmellen\nconfounded\nkhasi\nboisterous\nkamui\ncovenanters\npmi\nholodomor\nunia\nflorencio\nbrae\nrothmans\njaspers\ntadić\nwestmount\ncarlingford\nlikens\nglaxosmithkline\ncaucasians\nlakshadweep\ntch\nroches\nbitching\nsergiu\nshorelines\ndenialism\ncollaborationist\nclairvaux\ntreacy\nbeamer\nwadia\nseptal\npaxman\nanise\nanimating\nmegabytes\nfarrelly\ncorretja\nbartleby\nglück\nspringville\nlightspeed\nmarksmen\nruthenium\ntham\nrufa\nhmc\npreece\ngolders\nhideaway\nsellars\nmuthu\njohanne\ndaughtry\nsoderbergh\ntallon\nmegamix\ndefecting\nhyperbola\npfl\ncarburetors\nksl\nredpath\ncathédrale\ninternationalism\njello\nodenwald\nbenoni\njoong\nsupercontinent\ninbox\nlycos\nprüm\nparbat\ndaffodil\nsmr\nhypothyroidism\nthein\nribot\nwestley\nhelgi\nnettie\npooley\nkartik\nhalla\ndoughnuts\natahualpa\noverflowed\nvulgarity\nmatrimony\ntetsu\nnicklas\nhogue\nsplatter\ninfirm\nlermontov\npaedophile\nternate\nbaywatch\nexclaims\nflatts\nbackhouse\nunassigned\nsneeze\nyoshiyuki\nmelgar\nkenna\nhygienic\nyearbooks\ndalia\nteletext\nroark\nditko\nwashtenaw\nmaximized\nsynchro\ntakarazuka\njit\nagi\nbogue\nnoob\nsiyuan\nkroq\nmillett\nbaltar\nrestarts\nlillard\nswee\nramses\nholmberg\nmeditating\nbenue\nnef\nmochizuki\npenna\nswire\ngerhart\nquarterfinalist\nkoa\nhuánuco\nternana\nkristallnacht\ncfp\ncloaked\ngoldfrapp\nspaceman\ncourtland\nfilipinas\nucs\nbristle\ngunfighter\nistra\ntsingtao\nespousing\npetersham\nsherif\nblacker\ntrimmer\nsouthwesterly\nsteinhardt\nreinach\nbuckman\ncallender\ntarp\noktoberfest\nreshape\nenplanements\nweeklies\nvater\nspiller\nidiocy\nazzam\nlifeforms\narabidopsis\npok\ncette\nbicester\norm\ncapitale\nwenceslas\nwaca\nmarfa\nvit\nvmf\ndurrant\nmisconstrued\nedmundo\nhenricus\nmajuro\nshimmering\ncadenza\nchibi\noakenfold\neradicating\nrobs\nunsc\nchicagoland\ntakeru\nshahab\ncourcelles\nplucking\npurists\nmapa\nvillon\nattock\naznavour\nshakir\ncutts\nunending\nlapel\nretard\nnikkei\niredale\ndisloyalty\ndefame\nimperio\nlatifolia\nmortgaged\nyomi\ncrunchy\nbequests\nbirjand\ngosse\nesquivel\nrebelde\ndees\noun\nsemana\nlepanto\nwiesner\ncrumbs\nhiccup\ngort\njardins\nkahani\ngauleiter\nrathbun\nfowey\ndismemberment\nsociocultural\nbizarrely\nquanta\nose\nansell\nkuno\naub\nmahabad\nrsi\nstreetscape\npardee\nirritant\ncupcake\nsone\nbanksy\ncormack\nfdc\nhoek\nhelsingin\neyal\nstoyanov\nboreham\nchek\nturley\nvinaya\nolam\nlawes\nminn\nuiuc\njailhouse\nchiloé\nsandstorm\nvishwanath\ncampagna\nnorthrup\ntyrrhenian\nproselytizing\nantunes\nlynton\nadulterous\nhooton\nstilted\nmuri\nhamann\nmelatonin\nmillman\nrendsburg\nmigs\nconsignment\nmeléndez\ndiasporas\ntakin\niic\nsuperstore\ncenturylink\nbagong\nimpregnable\njoules\nbaboons\nbruguera\nconsiglio\nippolito\nmadang\nunidad\nfaruk\ncampagne\nrabelais\ncabinda\nlandholders\nplumbers\nashmolean\ndillingham\nyorkton\npointedly\nxli\nshultz\npacified\nsquashed\nohr\nsctv\nbouton\nderulo\nnijinsky\nmorumbi\nkhoo\nabm\njusqu\nchangers\nchater\nnaught\nwhitmer\nfitton\nsain\ncondit\ncovey\ngallbladder\nstephanus\ntreviño\nhangin\ncatastrophes\nfaeces\nsunflowers\nlatvians\ndined\neurofighter\nlymphocyte\ncapitoline\nllangollen\nalouette\ndhivehi\nenceladus\ndeft\nvrije\nprabhat\nolivetti\nbrownies\nhardball\nkarajan\nwardell\nnegus\niles\nbechtel\nbhadra\nflt\nhaidar\nscrotum\nbitburg\nhoyos\ncolliers\nbrühl\nkundalini\nmetairie\nloxton\niommi\nnavigates\ndellacqua\nstranglers\ndecile\nsliver\nolmstead\npano\nmiers\ntheodorus\nsaku\nbosanski\nmattson\nclerkenwell\ncinemascope\noxidant\ngavan\nscn\ntoothbrush\nterrorizing\npuc\nlammermoor\ngacy\nwagstaff\nbrayton\nirrelevent\nchandu\nwrangling\nsrp\ndemetrio\nprus\nbijou\nwaukegan\nunbelievably\nshirai\neastham\nbicknell\nfugues\nvivace\nplebs\nbreathes\ngiddens\nwyss\nmenelik\nstutter\nsodor\ncoxon\npaynter\nrichardsonian\nmaggi\nparacetamol\nlucida\nwrangell\nhedging\nbaggy\nwhirling\nrunyon\nsoftness\ndoppelganger\nwebley\nassessors\njurado\nsnoqualmie\nmilam\nvoracious\nparkview\nnomi\ntrig\ncounterattacked\nextricate\nweatherly\nabp\ndigress\nreconfiguration\nbucher\nfabolous\nbulging\nabrogated\nhamming\nstunting\nmisdemeanors\nqala\ncandor\ntheorizing\nyachtsman\nmobb\noppositions\nperigee\nbakri\nvanna\npenitent\ncompetences\nsatrap\nnib\nrickie\nmorvan\nmantell\nhypnotism\nmazurka\nbialik\nmuskrat\nhanyu\nsmelly\nskynet\nstabia\nkasem\nslowness\nscolded\nflorencia\nbayi\nflorins\ndunwoody\nhematopoietic\ncarracci\nalcázar\nargentinean\nmisrata\nduchesse\nige\nnusrat\ntraill\nchallis\nelli\nreciprocated\nberthier\nmgs\nanalytically\ngurkhas\nuq\nfragilis\nexpertly\ntolerates\nzam\nbruise\nbeate\nsutjeska\nbalogh\ntadhg\nmasjed\nkerwin\nbrigs\nfredonia\nwoken\nwitter\ninsinuation\nmoc\njovian\nsidwell\nacidification\nnatsuki\nmichalis\nairey\narial\nelysian\nteu\nmalinowski\ndoublet\nanta\ngringo\nlavatory\nbeals\ngmp\nmcl\nstaatliche\nbugging\ntassel\nlandsat\navinash\nferromagnetic\nduvalier\nsuffield\nheighten\nkimbrough\nzarathustra\ndubose\ngandhian\nsuperheated\nhamzah\nclinker\nreoccupied\nmutate\nmirabeau\nfatherhood\ndwarka\ngoce\nfracking\nbushwick\nivie\nchiranjeevi\ndsa\nkasim\nroosting\neberle\ninka\nassi\nvariegata\ncannabinoid\nsantorini\nhaverfordwest\npyrotechnic\nzuzana\nveng\nextractor\ncliftonville\nkickoffs\nploughs\nversicolor\ntapers\nqueiroz\nextrapolate\nkenn\nbelittling\nmarkos\namazonia\nmaywood\ndoig\nschneerson\ntwine\nlenihan\nwimmer\nantanas\nholroyd\nistres\nfrisians\nkarts\ntrusty\ntavernier\namora\ntrusteeship\ncrossrail\nnargis\naffix\nbluebeard\ncarotene\nkweli\nmallon\nmunna\ngenerali\ntribals\nmuro\npádraig\nnizamuddin\nconscript\nalde\ncytoskeleton\nbeamed\ndjing\nmuñiz\nuncritical\npanellist\nserf\nhélio\nearley\nuninformative\nprofiting\nunmask\nfiercest\nsiraj\naddressable\ngregoire\ngrasset\nconover\nmagnuson\nmaro\nmanningham\nisidor\nrepeatable\nbdd\nrussa\nkunkel\necclestone\njudgeship\nmalaise\nzamoyski\nremaking\nplm\nyikes\ntapioca\nbarthes\ngmtv\npsychopathy\nbrutalist\nhalfdan\nalbani\noffical\nsbt\nchiaki\nbourse\nonside\neczema\nartyom\nbarbarism\noakey\nsyntheses\nnaso\neinaudi\ntedeschi\njaz\nsemitone\ngizmo\nchinatowns\ncastaño\nlazlo\nakashi\nlooters\nhuskers\nheartbreakers\nkosciuszko\ndermatologist\nascendant\nweirdness\nusury\nstowell\nallegretto\nreinstall\nrangi\nmartello\nswamiji\ndmus\ncoppi\ndelmas\nkarak\nxxxii\npiso\nwasser\ncharbonneau\nozma\nwrens\ncixi\nbactria\nvips\nelegiac\ntaconic\nkintyre\nnpt\nkairat\ncollegio\nsynergistic\nlightsaber\nbhatnagar\noiseau\nauthorising\nzealot\nencapsulates\ncommitteeman\nspiraling\nangeline\nhalfpipe\nbeek\nstraightening\nalvis\ntopmost\nosorno\nripen\nmatins\nbriand\nbehn\nvasudevan\nridgely\nunexpired\ncdo\nstornoway\nzoroastrians\nspied\ndau\nskateboarders\nermita\nintoxicating\nphilipps\nflynt\nmachin\nsalut\nkaman\npressburg\nfantaisie\nhomeschooling\napcs\nguilin\nwolters\nmarcher\nenslave\nbauhinia\nperros\nhornung\nsture\ncarli\nmlk\nherold\ncalderdale\nbair\nmicrometers\nravage\nimmaculata\npeering\namery\nmoan\ndern\npretzel\nlibs\nrsp\nmexicanus\nyorkville\ncortese\nsorbian\nment\nerastus\ngornja\nquanzhou\npoitier\nsayles\nroseau\nanimism\ngurley\nreichert\nrudo\nsinuses\naltes\nphotoelectric\npacha\nsukkot\ntoothless\nsecretariats\naltimeter\ntrappist\nediacaran\ndebreceni\nlilium\natlantica\ndockyards\ncota\nvineland\nazlan\npanna\nmalaspina\nanarchic\ndda\notherworldly\nrustlers\nlic\nxfl\ninwardly\nfainted\ncrean\nkuntz\nscents\ntuvaluan\noffstage\nayuntamiento\ntuo\nalbinus\nbechuanaland\ninverclyde\nstalkers\nhelden\nmykhailo\nblomfield\nbedouins\nhku\nloughton\nkozak\ntoons\ninterrogators\ntripolitania\nbloodthirsty\nimpediments\nschoolcraft\ntots\nbicentenary\nneuberger\ncontented\nrearward\nanser\ntavriya\nkone\npitney\nsatriani\nliman\nbirley\nmdy\nclamps\ngongs\nmamelodi\nblainville\ndesecrated\npadgett\nastounded\nmeshgin\neuan\nhulst\ncollectable\nrefunds\npreempted\njuanes\nzomba\nreformatting\nboyband\nsupermassive\ngedo\nkitamura\nadjoined\nseg\nsafran\nrondônia\nbirt\nlamentations\nshimin\nqx\ncarn\nbailiffs\nsickles\nflon\ndeplored\nunease\nerol\notomi\nfangoria\nundirected\nspratly\nyael\nornette\npeddler\npratique\nwaldheim\nfeltham\nsupercell\nobstetric\ngeneralizing\nevgenia\nflippers\nnikko\ncomputerised\nballina\nzang\nuerdingen\nbresse\ndhofar\nlegalistic\nwark\ndoce\nstuka\neuropéenne\nwagram\nbuhl\nfranchitti\nponders\nsuperseding\nsavin\ndeportees\nfada\nairsoft\nsoulja\nindiewire\nyamauchi\nhadji\ndechy\nhobgoblin\nwenham\nhrc\nshahzad\nstoops\nhurl\nluzhniki\njowett\npterosaur\nguianas\nsundari\nhubris\nmejia\nory\npsychotropic\ntanana\ndwt\nkanya\nbernat\nbinnie\nbps\nlaparoscopic\ndala\ndepositors\nbaud\nrustica\nmikan\nlarousse\ntellurium\nappleseed\nunblocks\ndibrugarh\nmicrosd\nhoddle\nhalvorsen\ncelaya\nbéatrice\nbonjour\nbarcelos\nhousework\npolak\nmauer\ntamale\naltstadt\nprieta\nchardy\ngwilym\nparkhurst\nkramnik\necon\nsoane\ntolman\nclandestinely\nglu\nyagi\narchana\naccelerometer\ndeca\npeacebuilding\nfunnies\navesta\nreutlingen\ninterbank\nbci\niveco\nduopoly\nmcgwire\nlevesque\nfriedkin\nmalthus\nturbidity\ntsuchiya\ndaniella\nyuvraj\ngopala\nquadrennial\nreales\nberwyn\nchoc\ngastonia\nhooliganism\nredoubts\nhabla\nthimphu\nherren\nawol\naparicio\nsakaguchi\npatching\ndivestment\ngreif\nsheedy\ntriglav\npresidencies\ngooseberry\ncommonalities\ncaryl\nteleports\nfancies\ndjamena\nleibowitz\ntangents\nsasi\npatera\nunsurprising\nselmer\neggleston\nzainal\nzobel\nunwillingly\nwaggoner\nseabury\nkaito\neckersley\nprohibitively\nmocha\ngilani\nhutcheson\nawash\nunspeakable\nbertone\nsurging\nparo\nsiang\nmmorpgs\ndobbins\ninsomniac\nnta\ntolling\nspoofs\ncarshalton\nculbertson\nwawel\natal\nketo\nstrangelove\nlato\nranjith\nbiron\nheathfield\nadèle\nsubduing\nmaximo\nnats\nviera\ncapilla\nnewshour\ncrutchfield\natone\nsaeki\nsquaring\nerle\nkiril\nsylvanus\nplayfully\nmiraflores\npalaeontologist\nhcp\nwnyc\ncadfael\ninterrogating\natar\ntoten\nfeigning\nyokota\nharuki\ntert\njenifer\ntsk\ndewa\nunsalvageable\nloe\nconstantia\ntapas\ntreen\nupped\nleadville\nwesterner\ntableware\nsarasvati\nstarships\nreacquired\nwwa\namboise\nstench\ncheetham\nstymied\nbleeds\nchutney\nbelarusians\nhartigan\nholcombe\ndivertimento\nryle\ntranscending\nvim\ntrejo\nshakey\nxerez\nmonger\nshovels\nhoma\ntsuji\nyaqub\nflorianópolis\nfluxes\nebba\nbridgman\njohny\nstubbing\nadulyadej\nfactionalism\nperusal\nstockholder\nquarterdeck\nbown\ncopywriter\npsers\ngarcés\nramstein\ngelfand\nsujatha\nsyncretism\nperfecto\nbackgammon\npmr\nborbón\nsharm\nheydar\nwfc\npartway\nspoilt\nmendonça\ntameside\nkournikova\nkashan\nplacentia\nuusimaa\nwael\nosuna\nseaview\ncavalieri\nashy\nbreezes\nhubli\nbambara\nnonpublic\nnewspaperman\nkru\nvarian\nlilienthal\nmiyazawa\ncoworker\ntussock\nvocê\nmanoir\ncristoforo\ntechnik\nbusing\nroxie\nhematology\nbilder\nlibyans\nkoya\natty\ngynecologist\nclevedon\nmajeed\ntanto\ntrammell\nhoppus\nbelov\ndiamant\nharnett\nashot\ndivo\ndiffusing\nmárcio\nperinatal\nzulfikar\nredonda\nminivan\nestudio\nteletoon\nsante\nperc\npatronising\nfolktale\nmetellus\nicehouse\njérémy\nspurgeon\nbluth\ngertie\nberthing\nmilf\nedulis\nschifrin\nanoka\nannunzio\npaule\nyuva\nledbetter\ngatt\nlsc\nvelha\nhyperbaric\nfairhaven\nansgar\nsubplots\nucb\nduce\neastgate\noverflows\ngunderson\nborys\nmbh\naib\ndga\nvetter\nwyckoff\ncamborne\npaston\nvakhtang\nmartti\nweeps\nosler\nbenchley\nmohanty\nkastner\nburgin\nredruth\nlearjet\niguanas\ninelastic\nvillalba\nrevs\nwolds\nmangum\nbasalts\nhenares\ntenebrae\ngrayscale\ndecrepit\nhalter\nmuzaffarpur\nzk\nhazlewood\ndenizli\nrhymed\nrancheria\nstewardess\ntransversal\npredictors\nreservist\noptus\nraheem\ncapello\nbrinkman\ncumbrian\nnewstead\nmoldings\nelissa\ncirculates\nrojer\npints\nsalomé\nsfu\ntelenor\nfrenkel\nmilutin\nfader\nszymon\nherpesvirus\nravenscroft\njue\nduras\ngasification\nhorvat\ntemporally\nsummerside\njeffersonian\nbarzani\nmagadan\nvulva\ncarcinogen\nbogdanovich\nanastacia\npollinator\npunchline\nbruck\nchhota\nlambeau\nstokowski\nsedona\nblinky\nchirico\nkobo\nream\nfoetus\nhalas\ndeactivate\nhatshepsut\nhazmat\nnxe\nreciprocate\narnim\nrml\nstratotanker\nfado\njrotc\nhannay\neef\nraghunath\nfriedberg\ncarbonates\nbridgeton\nwavell\nstovall\nhelvetica\nrafiq\nartical\ngivenchy\ngeysers\nfusilier\nbaleen\nlorenzen\ngretsch\nbolkiah\nalts\nmcdougal\nestevan\npedigrees\nayo\nlakshmana\nicap\nthigpen\ntarja\nvallance\nneutrophils\nsongz\nedc\nhildburghausen\nheald\nducats\nendemism\nbsf\nplayboys\nwipers\ntrapdoor\nmohit\nmilkman\ngilding\nuca\nloesser\nsalleh\nlorrain\nobstetricians\npeekskill\nabolishment\nmorland\nwałęsa\nlardner\ntrygve\nsystolic\npallidus\nstourton\ngandhinagar\nhistoricist\nmcminnville\nscallops\nwaterlogged\nmoradabad\ncrataegus\nsocialista\namplifying\nvenstre\nparalimni\nunaccounted\nxalapa\nferran\ncommuniqué\nignites\ndarab\nnori\naleut\nmccune\ngratia\nannulus\npipistrelle\ncreatespace\nsherrod\nsainthood\nkirchberg\nkeun\nmoaning\nintercom\ncompositing\nvetch\noverlong\nsurbiton\nvise\ntudela\ncarmelita\nmassy\nmycologists\npowerbook\nwatermills\nsalvio\nnmc\ntrutv\nscraper\nlasseter\nhickam\njalgaon\ncullman\nturkana\njupp\ngleaming\ngripen\ngrandnephew\nsalé\noort\ninterleague\nwärtsilä\nosmania\ncamrose\njanitorial\narcturus\ncreamer\nrhythmically\nduclos\ncallow\nbitrate\ntanga\ncassano\nmemorably\nforlorn\nouverture\npeto\nlavey\nmulgrave\ntsp\ntallow\nristo\ngentian\neliya\nhershel\ncyrille\nweeding\ngodson\nswitchover\nbayshore\nbpo\nmannerism\nfreckles\nlbj\narchuleta\nbuckskin\ndelorme\ncrouse\nheinze\nbuckminster\ngranularity\nthermoelectric\nnanga\nhosseini\nturton\nbizerte\nsunsets\ntranskei\ndeformations\nunrated\narguements\noakham\nlinotype\ncentralize\nstiglitz\nisidoro\npumice\ncitra\nadelheid\nabetting\nlabatt\nentitles\njayapura\nfushimi\nsorrowful\nsso\nsoutheastward\nendoscopy\nzapatero\nvfx\nraking\nkeanu\nparvathi\nsolveig\nembarcadero\ncopepods\nguidlines\nlidar\nfemi\nrutger\npantomimes\nflutie\ndendrites\nsuperconductor\nmyr\nibises\nmonolingual\noutram\nlandholdings\nfujisawa\nleet\ndoppelgänger\nskillet\nantonine\ntigger\nredact\ngeomorphology\npressurised\nfira\ndelorean\ncaptioning\nrasch\nundetectable\namira\nstools\ndoss\nyellowhead\ncircuito\nsponheim\nsinologist\ntollywood\naquí\ngimmicks\ndéveloppement\ngrowling\ntain\nflashman\nvtb\njarred\nbaptistery\nsle\nmuy\nmcduck\nglyndŵr\nsafekeeping\nordóñez\nseibert\ndempo\nunfairness\ndorcas\nmeridians\nsecaucus\nbenidorm\ntakako\nsnowshoe\npomfret\noranje\nwildland\nvideotapes\nbackbench\ninterrogator\nconiston\ngundersen\nkathakali\nnakanishi\nligation\nheadwater\nstam\nbotnet\nsilencio\nbrolin\nhomonymous\ntorched\nmish\nmaigret\nwythe\nsardonic\noscilloscope\ntaher\nmcardle\nibge\ndowne\ntabulation\nelohim\nmelling\nexcellently\ngostkowski\ntempt\nsnps\nipp\npitti\nspasms\ncutty\nfatehpur\nwcl\nsreenivasan\nthrushes\nharmonization\nbiosciences\nexhortation\nhereinafter\naonb\nexuberance\ntete\nhelmond\ngianna\ntarski\njatin\nakihito\nmarriner\ndodecanese\nconvenes\ncaries\nsongkhla\ntanjore\npenafiel\nburmeister\nayako\nzealots\nmanik\nlunt\narrayed\ninterviewers\nbancorp\nbereft\nguiyang\nbrydges\nskerries\nneymar\nlor\nmestizos\nmsw\necholocation\nmanipal\njohnsons\nbalaguer\nmalachite\nhorsemanship\naccusatory\nesslingen\nsaff\ngenotypes\ntoft\nkhatun\nhorry\nimprisoning\nparapets\nspokesmen\nnawa\nhowth\nclawson\nbere\nejido\nakiyoshi\ntowner\nwgc\nzarqawi\nsensuous\nriera\nsombrero\nlumping\nstomachs\ntoriyama\nproportioned\npurnima\nbishan\narabica\nplayfield\nlooe\nvartan\nquarantined\ndisregards\nsuperceded\ncalabrian\nfabiola\nmaxillofacial\nwedged\nnegotiates\nfoerster\nthao\ngeisel\nscoops\npumila\ntremaine\ndevelopmentally\nsmet\nbloodied\nrisso\nmarquardt\nburk\nbollegraf\nunapologetic\nspivak\nunbeatable\nbattlefront\naoa\ndisputation\ngovortsova\nbisbee\nsandbar\npolanco\nspiritus\ndictation\nluján\nhellmuth\ngrata\ntangail\nmossman\nzelazny\nbhanu\ninteroperable\nminuit\nhaqqani\nhegemonic\nbromeliad\nerudition\nbecerra\nunfettered\nbottlenecks\nburdensome\nbolus\ntrite\nalmodóvar\nbereaved\nclubbing\narvo\npiran\nrosyth\ngridley\ninwood\nectopic\nbiograph\npkr\npinhole\nupenn\nliebman\nfranchisees\nmudslides\npulleys\ncanonically\ncragg\nimamate\ncarrière\nretford\nbhandari\nromsey\nedp\nsynopses\ngtx\nalomar\ntapper\nadopter\nbellflower\ncresson\niea\ncobbett\nlibretti\nalleyn\nnilgiris\nadjudicate\nstuffs\nmarri\nreconquered\nkaki\ndatong\nyarmouk\nmassaro\nmarchenko\nlalita\nsibylla\nyakutsk\ndagobert\nhovhannes\nmichi\nventa\narcadian\ncrochet\nsako\nlongterm\npancreatitis\ninnovated\ngraphing\nbahujan\ncircumpolar\naldi\nkemi\nwladimir\ncontentions\ncaldron\nsheepdog\nsouthsea\nesr\nshinee\nhopton\ncots\nbradstreet\nprowler\nbabelsberg\nmiwa\nmauled\nwohl\nbricklayer\nivry\nsemites\nexperimenters\njuninho\namersham\nmagoo\npowis\nliveries\ncanta\nlessig\nfederalized\nsrf\npredominated\ndocteur\ndorfman\nrego\namadora\nkarmapa\nchante\nkinematics\nspratt\nfancied\nmckeever\ntpm\nhumaine\ntocqueville\nmalory\nwhores\nathleticism\ndiametrically\nordinariate\nchadderton\nplame\nqadi\nkovalainen\nlensing\nthornley\nsaanich\nfcw\nsalado\npininfarina\ntemperaments\nmolnar\nhamlyn\nchagas\nlagan\ngrandest\ningots\nkeener\namritraj\nreinsert\ngraziani\npando\nalmanacs\nherford\nbetz\nkavya\nwartburg\nlahr\nsigourney\nexquisitely\nzdnet\nnucleoside\nbustle\ncabrillo\nthurlow\ntarbes\ncharest\ncontemptuous\nyadkin\npavlodar\nhendra\nbucaramanga\ncorpora\njaxa\nsalonga\nselectmen\nseawall\nbetti\nflippant\nmarigny\nradiates\nlydian\nmfr\nheiberg\npaladins\ngiddings\nimmerse\nnanometers\norvieto\nsaboteurs\nhermano\ntasteful\nhelter\nyk\nstraubing\nsolti\nsidereal\ngeolocate\ntolliver\nbrio\npacelli\ncays\nasma\nravichandran\nbonfires\nlegnano\nvilar\nproteomics\ngdf\nairbases\nesparza\nmanheim\nbagel\ndisjunct\ngrafts\nkaa\nbedell\ntren\nespnu\nnozze\nchengde\ncella\nflankers\nrylands\nkord\noutclassed\ngubbio\nshoah\nrisparmio\ncolonie\npermeated\nbassa\nsibu\nescrow\nleacock\ncoldly\nbrandes\nabolitionism\nweft\ntripos\nfrc\nsmearing\nsectarianism\nyalu\nabsolutist\nisadora\ntrevi\nmcvey\nhosiery\nkhalidi\nkudryavtsev\nnorrie\nprivateering\nyannis\nhawa\npredominates\nmetrostars\ncumberbatch\nfiestas\nvarman\nsamhita\nknightly\nkanan\nbryden\ndislocations\nkathak\nsargeant\nsubsea\nbelair\nbeardmore\nexpanses\nrampal\nbardon\nbestows\nelkin\noxidize\nstationers\nheterosexuality\nsanguinetti\nweybridge\npapas\nugliness\nsharepoint\ncarbonaceous\nreconsidering\nchiricahua\nwerft\nuniversitaire\nabdoulaye\nmauricie\nwickmayer\nrationalize\nechinoderms\nkilwinning\nheiden\nsubterfuge\nrefrains\ncataclysmic\nodie\npotosi\nhippolyta\nchairpersons\ntechnetium\najman\ncommonsense\nringleader\nfof\npillbox\nyoshimura\npanton\nmisinterpret\nweevils\nvolterra\ntwc\nbanister\nhaiyan\nsarada\nyeasts\nbunton\nshahnameh\nungar\nviziers\nalleluia\nperonist\nendometrial\nsotho\nfrescoed\nrediscovering\nfouts\nsavoury\nswart\ndirecteur\nromulan\ninlays\ncongeniality\nshirk\nablative\nvtol\nleitner\ndiscontinuing\nmenai\nnak\nförster\nstrategists\nsunburn\ngumball\nnovelties\nprolifically\nsupernumerary\nkotaro\nashman\noamaru\nstonebridge\nvei\nproactively\nstoneware\njaclyn\npagerank\nunderpopulated\nbelousov\ncoupes\npratapgarh\ncartouche\nspastic\nkohat\nstaatsoper\nharstad\nsocialiste\nmeshes\ngauging\ngoulart\nsundowns\ncorsa\ngoldenrod\nbrockway\ntrombonists\nalena\nnata\nhairdressers\ncornette\nmadinah\nulbricht\ncyclades\nyttrium\nmaclennan\ngagne\nbarnhart\nnenets\nmelas\ncharlatans\nmukhopadhyay\ntamim\njetblue\nliquidate\nlaudable\nbestowing\navni\nuniversitat\nlocalize\nrecapturing\nyara\nkelis\ndudgeon\ncobble\ntwomey\ncrucis\npersonage\ngentoo\nthumping\nslidell\nprp\nretargeted\nmasaya\nsouther\nmorts\nmarkku\nwilders\nmarjory\nmeanchey\nogier\nunis\nstudium\nkummer\ncovalently\nattu\nhowitt\nbarrages\nresolutely\nergonomic\ngrowl\nriza\nchacha\nhenny\nautonomist\nlinley\njuggler\nkross\ndegeneracy\nbharu\nyoshinori\nsuchet\ntraver\nletzte\nhildebrandt\ngeoghegan\ncortona\nlymphoid\ncajuns\nintranet\nrostral\nauth\nkurstin\nblackmon\ndibble\nlithographic\ncvc\nbainimarama\nparada\nphenom\nbux\nsynesthesia\nludo\nfitr\nwhiff\nchalky\ndiverts\nsawing\ninformers\nswitcher\nnewsstand\nwedel\nlanga\npenda\nharrods\ncapriati\nsuse\nblatter\npredisposed\nbunce\nstaveley\nscutari\nkisumu\nadan\nhypericum\nmindaugas\nnanotube\nmcadoo\nhanif\nairlock\napostol\nmatchplay\nmlle\ntomita\ncountermeasure\nfos\ndavi\nrhodium\nwooldridge\nuhm\nconceptualization\npref\npinakothek\ncooktown\npiave\nmercosur\nlagi\nlinehan\nfrenzied\nquestionably\nebden\nmuddle\nherc\nseepage\nfulfils\nencircles\nfrosts\ndimensionality\nsupersedes\nwerther\nnelle\nallying\nholston\nleeches\ncopyist\nvictorias\nbrinsley\nrabble\nsueur\ncalvinistic\nmadawaska\nvicarious\naprile\nawkwardness\ntedx\nbeater\nkaroline\nindah\nnicanor\nenglund\nyasuo\ncottle\ninstalments\ngoldblatt\ngeopark\nbonk\npodkarpackie\neurythmics\ndevitt\nkilos\nconduits\nilluminator\nsagara\ncounterintuitive\nanalyte\nquelled\npastimes\ncoloratura\netsu\npincus\nsusheela\nrepaint\nferber\ndesarrollo\ncialis\nsneha\nnyingma\nuchicago\ncambio\ndarkside\ntennent\nchitchat\nlbl\nspearheading\ntaniguchi\ngrampa\nredwall\nbarracudas\nshihab\nparkour\nbravest\ncowichan\naetna\nventimiglia\nsmallholders\nsajjad\nflicks\nnetbook\nsonam\ntestud\nexpansionism\npermittivity\nprotestations\nscorpius\nsoulcalibur\ncorina\nroeselare\nspeedometer\nfitzherbert\nardrossan\nlozenge\nstriptease\nathanasios\nbarbu\nhuila\npalmes\nrisqué\nchatfield\nspadea\nxuzhou\nuktv\ntribus\nfathoms\nvalmiki\nvlsi\ndominika\ngelman\nwaterwheel\nyilan\ncpg\nbrule\nlxi\nkaaba\ninvolution\ntradeoff\nabsorbent\nbelmore\ntilman\nmalia\nrougher\nrpo\ngalatians\ntá\ncreedence\ninvective\nmateusz\nfolate\nmorningstar\npizzas\nwasl\nbrigands\nrfl\nreapportionment\nhelton\nwhiter\niol\nsuccessions\npleadings\ndestructoid\nfallback\npiri\npenstemon\nbagmati\nxlii\nauberge\nnscaa\nalderley\nsepulveda\nkagame\nergotelis\ncaloocan\ncorydon\ndro\nsempre\nthicknesses\ncaprica\nionospheric\nsmallmouth\nmaximiliano\nishak\nfrigid\nlysis\ndisapproving\nzevon\nrosamond\nlill\nsoliloquy\nconquista\nrtr\nmetlife\nprofitably\nkeystones\nrlc\nbarmaid\ndoosan\narka\ndissipates\nvalls\narrhythmia\ngreenblatt\nfaenza\nrarotonga\ncalamba\npobeda\npallid\npiedad\nkawakami\nlatta\npathophysiology\nsilmarillion\ntanis\nsurfactants\nhalos\nstraczynski\nsrivijaya\nmenahem\ndistro\nbacteriologist\ntah\nvaruna\ndependant\nmottram\ntelefónica\nduong\niphigenia\nakimoto\nxxxxx\nunfamiliarity\narash\nreinvention\nguesswork\nbarnegat\npulteney\ninclines\nfenimore\nvesa\nstile\ntranscended\nathist\npreemptively\nsignor\nslung\nfraudsters\nyani\nshrimps\nhanyang\ncongratulating\ncheech\nnovy\ntodorov\nenjoined\nunseat\npila\nuseable\nstef\nmetamorphism\noroville\nsteppin\ntoujours\nferrylodge\nbessarabian\nconcatenation\nsurrealistic\nasgardian\nlowrie\nlynsey\nkatsina\ndarlin\nefl\npeckinpah\ncurtius\nsmylie\nafterglow\nfaun\nhadar\nhurdy\npandan\ntomorrowland\nglasser\ndiatoms\ncollarbone\nvalenciana\nophthalmologists\nentwined\nrotenburg\nalphonsus\nhassell\nveblen\nrsm\natsuko\nmenander\nots\nwass\nentertains\nhasina\nantipsychotics\npaya\npastorale\nake\noffutt\ndevendra\nredness\nautocracy\ncheckerboard\nnuits\nzondervan\nassize\nangélica\ndok\njaunpur\ngardenia\nsensationalism\nclamped\nblass\ndemobilisation\nsupremely\ntaxonomists\nnagas\nrigour\ndharmendra\npositivity\npwg\nlexicography\nbellagio\nmerridew\ncristiana\nrhyolite\nornata\nburrard\necma\ncraybas\ncrocodilians\naos\ntromp\nnettwerk\nimager\nrajab\nsorsogon\nlougheed\nagr\nunacceptably\ncarpentaria\nbinney\nreasonableness\nconditioners\nlangues\nhomeschooled\ncolla\nchandni\nrecht\nudea\nkac\narchiepiscopal\noutpouring\ndarrel\nbpd\npatrolman\nnhtsa\ngsi\nmeso\ndrb\nopeth\northopedics\narabe\nbroca\nrestaurateurs\nparalytic\nthampi\nmanifestos\nkaen\ncorroboration\npram\nrealists\nlafarge\nracy\njugs\nlapierre\npiemonte\nbrag\nshoestring\ndein\ngriselda\njanissaries\npatronised\nvari\npekan\noscillate\nestadi\nftl\nneverending\nmcclendon\nhazell\nnels\niupui\ndisloyal\novidiu\nscuttle\nplage\nfistful\nxbiz\nstretford\namnesiac\ninnocently\nboehner\nhensel\nmulticolored\nantifungal\nprovocations\nafferent\npucci\nceremonially\norators\njoanie\ndepleting\npersie\ncowgirl\ngarman\nsnicket\nbaltazar\notoko\nlexicographers\ncbp\ntryst\ndaystar\nepiphytic\nalimony\nmoniz\nbruna\nfrascati\nhollyhock\nmodesta\nnickels\nzika\nperuse\nherta\nconservatories\nmembranous\nkennan\nbadri\noccultation\nuas\nmarach\nleffler\nnagesh\nfussball\nsorrell\neagan\nfru\nkaba\nneutralization\ndisinfection\ntsvangirai\npodolsk\ncastroneves\nswaraj\ndiarmuid\nmounties\nhodgins\nconflagration\nsifting\nundertow\nmargarethe\nniamh\nconfetti\nppr\nbackdrops\ndecathlete\nbrookville\ngyumri\ndiomedes\nkaminsky\npeculiarly\nraa\nportas\nvinayak\niot\nfuckin\ncomox\nsamus\nsayyaf\nhage\nrocksteady\nhiguchi\nbeefheart\ncaversham\ncompaction\nmorphin\nsno\nreinstalled\ntransdev\nrennais\ndenigrate\nblamey\nnourished\nkipp\nserrata\nihf\nwos\nkonishi\nsquandered\nparasitism\nresurface\ncommercialize\ncompasses\nrevenant\nmarjan\nwynette\ntraversal\nromberg\nhistoriographical\ncaird\ntactician\nglazes\nsharpton\ndespotism\nshudder\nraindrops\nreappointment\ncircumflex\ngammon\nantiretroviral\nunionized\nouellette\nmiscommunication\nmynydd\nddos\nfattah\nbrannon\nvictorino\nnygaard\nsubclasses\ninordinate\nbirdland\nkad\ntransducers\nhoyland\nquba\nhamden\ncloyne\npokes\ncomercio\nosteoarthritis\ncarthy\nskua\nfunders\nantun\nmalatya\nsacré\nstrapping\ngumbo\ntehreek\nchauvel\nfazio\nrabbani\ndystopia\ncommenter\ntangerang\nfrère\nmí\nurquiza\nuap\nsloths\nisleworth\nrane\nglinda\nlaurentiis\noberland\npoesia\nanju\nbartle\nbashi\nspry\ndando\nyami\nyz\nconcussions\ncloseted\nuto\nrunge\nxxxvi\nkayo\nunderpowered\nsbi\nbluffton\ncarbuncle\nefa\nlushan\nshacks\nashbrook\nmolnár\nschoen\ndube\nchesnutt\nherediano\nrayman\ninjurious\njewishness\nromina\njudgmental\nbeslan\nparkways\ncosgrave\ncarron\nkropotkin\nhed\narticulates\nmenelaus\ncanwest\nliberman\nmerlyn\njosephs\nshor\nmurphys\nsliven\nofer\ndeviating\nretargeting\npoliced\nredesigning\nromandie\ndistinctiveness\nalytus\nistrian\nzapatista\nabbreviate\nroehampton\niza\nbullfrog\naspelin\nparekh\nvenable\nhaug\nruffo\nchock\nstaton\nwangaratta\nrudin\nfenice\ngardaí\ncherian\npropagandists\nbadia\nhelvetic\nmestalla\nspoonbills\nmagnitogorsk\nhumperdinck\nsarwar\nhalligan\nrehabilitating\nnila\ndungeness\nquam\nupwelling\nspotlights\nharpsichordist\ndrinkwater\nbrindley\ninescapable\nbionicle\npropounded\nglenville\nlalla\ntedesco\nkhoikhoi\nslingsby\nnahr\nfalwell\nquench\neagleton\nappending\ncuxhaven\ndemarcus\nrfm\nkirti\nimap\nbenda\nunctad\nneosho\ntalkative\narjan\ndominator\ntesseract\nrodina\nguk\npreconceived\nmendota\npozzi\nczartoryski\nrippon\ntoppings\nabundances\ntiburon\nwildside\nlooper\nerne\nkalevala\nlisi\nlysenko\ndreamin\nvada\nhypnotist\nkastoria\namphora\nampere\narching\noeil\ncongrats\nsvu\nstenton\ndiscernment\ncontentment\ntroughton\nhospitallers\nmurr\nelmar\nnagendra\njosias\nmisleadingly\nquarles\nlwt\nmonumenta\nlix\nrajon\nechols\ndentate\nclassique\nprolongation\nblizzards\nretouched\nerakovic\ngassed\nsosnowiec\nsymon\nsiliguri\ninflight\ndisembarking\naksu\nbotolph\narminius\ndevalued\nlyubov\noswalt\nswitchblade\ntoddy\nvenda\nrefugio\ncapably\nbeerbohm\ncalc\npastes\nmicroarray\npocketed\nadu\nchewbacca\nstagger\ngalba\nnathanson\ndonde\nottumwa\nnewlyn\nshtetl\ngagliardi\nkhana\nkasich\nkanta\ncommending\nnaqshbandi\nnosferatu\nblaxploitation\nstyrene\ncanarias\npatrie\ndwarfed\nrevitalizing\npulver\nayre\nmcguigan\nwilsons\nbalk\ncompensates\nquinine\nmonotonic\ncringe\njelgava\ndaula\nspurned\ndcm\nabsurdist\nkiu\nspinnin\nbreakbeat\nwisp\nstelios\nwillems\ngaede\nhore\nbornean\nschoolmates\nbhiwani\nequipe\nhunza\nseurat\nunadorned\nfaraj\nrossum\nservicio\nelectrifying\nwj\nadelbert\nfeldmann\npso\nhesitates\nfothergill\nfletch\nwankel\nisambard\nbasho\nkhayyam\nbenjamins\nhedvig\ngcvo\nbera\nequalized\nverdugo\nramasamy\nconnoisseurs\nbannock\nbolaños\ncarrer\nnegras\nullevi\ngnosis\naurum\ncommodus\neverquest\nkron\nnyquist\nlibert\nparesh\nréseau\ntvt\nscoured\newood\njagan\nmagnificence\nneem\npepi\nextradite\nyuya\nmallika\ntranscaucasian\ncordes\ngallacher\nsoldat\npinyon\nveneta\nwalkthrough\ntransportable\nladin\ncoffers\nkantian\nsklar\nencroached\ndisavowed\nphill\ndolittle\nsenanayake\nholladay\nfairlight\nbackwaters\nmetrication\nzubin\ngangland\ncrème\nrazzano\ntwi\ncements\ngharafa\nnegi\nwestville\ninit\nallocates\ncapitols\nphlox\nwey\nagu\neinhorn\ntarango\ncomanches\nsalva\nuncontroversially\nembellishment\nmaximally\narchangels\ndestro\nbathwater\nberti\nuptight\nsadf\nstriae\nbushels\njetta\nskelter\ngardena\nhac\ntristis\nswope\ncatena\nfock\nwenders\nhaga\nrohmer\nurinate\nbawdy\ncapitán\nlateef\ntalas\ntbl\nperforations\ngordian\nctx\nmateu\nwalser\ncolobus\nwickedness\nkrystyna\nskyrock\nshortcoming\nglennon\nfalcón\nmemoranda\nbunga\nrescuer\nnishan\nfarquharson\ndefused\npasión\nshrubby\nclaymore\nheathens\nkoti\nblekinge\nthal\narik\nscrupulous\ndrooping\ncommutes\nhawthorns\nspasm\ntholen\nsecuritate\nspinoffs\ntro\nspanner\nkunio\nfassbinder\nejecting\ndisplaces\npaintbrush\nschwarze\nbibliographer\npetting\nupjohn\nlacus\noverreaction\nscipione\nmidgley\nfidelio\nlaurentius\nvlach\naudra\npediments\npetersfield\nulam\ndisappointments\nascribes\nspringhill\nchorlton\ngameday\ncardiothoracic\nproffered\nqaboos\nnisei\nflannel\nkesh\ninsubstantial\nsmug\nbalmer\ndalkeith\nsalama\nbengalis\nromany\nbarbell\nlimon\ndarbar\nmaly\nbohuslav\nsoga\nutters\nmatija\nndrangheta\nductile\nvals\nalok\nplaning\ntroma\njagadish\noddparents\nhambledon\njolene\nlachine\ndeadpan\nunobtrusive\ndennehy\nmerian\ntellers\ntubules\nmarcie\nallstate\ndrudge\nvaljevo\ntalktalk\ngooden\napotheosis\ncag\npeacekeeper\ntcf\ncrookes\nduodenum\nsubconsciously\njogaila\nsitwell\nsanandaj\nbudden\norto\nparamus\nbugis\ncocke\niki\npavlyuchenkova\ntmp\nhenke\ntrotskyists\npallavi\ngrunts\nmillimetre\ncardinale\ngül\naujourd\nfredy\nfunctionalities\nintruded\ntitration\nrecidivism\nencrusted\njohanson\nstrictures\nmytilene\nhapkido\nmoped\nliskeard\nhirshhorn\nvogler\ntpg\ndeleon\nreva\ndetoxification\nbeachy\ntemuco\nmassless\ndyfed\noxygenated\nculebra\ngenial\nlouw\nlevallois\nakatsuki\nansaldo\nwands\nintermolecular\nachaean\nturgut\nakrotiri\nmarja\noneworld\nclementina\nrosalyn\nstillness\nkosova\nrattlesnakes\npapillae\nbushell\ntransworld\nnovokuznetsk\nbahri\nlito\nbastogne\nhysteresis\nsommerfeld\nrosehill\ngulab\nbutchered\nstifled\nrupiah\nfalkner\nnog\nthankless\nsouthbridge\nbattleford\ninu\ndeimos\nfalkenstein\nmilhouse\ndropdown\nfrancolin\nsncc\nashbury\npentatonic\nrosalia\nscrupulously\ncopperhead\nhaygarth\nalans\ngprs\nmilkweed\ncrappie\nrecharging\nluongo\nclonal\njairo\ncoty\nhendrie\nroerich\ncampbells\nranbir\nnormalize\nklimt\nwebern\nscreeching\nmenton\nunterberger\nspla\ncatt\ncorday\npeay\ngioachino\nquonset\nstrangling\nkurumi\nhypnotized\ndiplomatically\npontifex\nglenview\namniotic\nchuckie\nladybird\nmultiparty\noutrigger\ninsel\ntiburcio\ngargano\nmpr\nacción\nvats\nfirestar\nunforgiving\nmollis\novc\nmakuuchi\ngeez\nwordings\nmelcher\naftenposten\ncopse\nciliary\nprepositional\nshimano\nniwa\nofm\neulalia\nrendezvoused\nforeclosed\nfarida\nshowered\nkorona\nzand\namalgamate\nbreck\nzabaleta\nmarzo\nintermixed\nmutagenesis\nfranchisee\nshitty\nnfs\ncorroborating\nfip\npoacher\nsportswoman\nrtm\nkalu\njardín\nashokan\nrohtak\nhinson\neggman\nvong\necowas\nhednesford\njada\ntelnet\ntransvestite\ngrus\nxuanwu\nwic\ninfomercial\nkohlmann\nflagella\nweisz\ninterprovincial\nsarcophagi\nendow\nkeltner\naktion\naffidavits\nminehead\nummm\ndragutin\nodra\nbrundage\neclecticism\nirrevocable\niwasaki\nxkcd\nhannu\naltiplano\nmanstein\nmolitor\nmafiosi\nlepers\nshiina\nsof\ngsc\ncampestris\npastore\ncotswolds\nnegates\nguna\nsmethwick\ncreel\nvolleys\nbaar\ncanuck\nislamization\ncomms\nvegeta\nestimations\njorhat\nrecieve\nrearmament\nmcreynolds\ndidcot\nridiculing\ncertifies\npassageways\ncopd\nmoated\narchaeopteryx\npreeti\ncoffs\nmultifunctional\njörgen\nmorcha\nindymedia\nmendis\nmsr\ntaxiways\nkhar\nwielder\ngearboxes\nschoolers\ndelved\nexpediency\ncourcy\ntoucan\nchaturvedi\nyugoslavs\nsylvestris\npetzschner\ncolgan\nnitrates\nbaki\norillia\nalvear\nsatsuki\nbreakfasts\nhematite\nmotorcyclists\nindefensible\nformica\ntiene\nbriefe\nsahu\nissac\nkayes\ntulagi\nnsaids\nmatara\nkutty\nfarian\npenrhyn\ndakin\nspira\nboren\nlindstrom\ncollison\ncentimetre\noverpowers\ncalms\npavlos\nmaatschappij\ngiallo\nchipman\nbienville\nrelict\nmujibur\nheb\ntrioxide\nmpv\narto\ndanielsson\nmobius\ntranspires\nhagi\nripa\nbunter\nhelmsley\nnewts\nkauri\nsahni\nlodovico\naspartame\nforebears\nalcántara\nquire\niru\ntestable\nweiler\nmaintainers\nmudra\nnias\ntolland\nharvesters\nnicholl\nrecyclable\ncamilleri\nvlaams\ntoppers\ndred\nabhimanyu\nmedias\ntoyland\ndeluca\nriddim\nraney\nrefraining\nministerio\neshkol\nablett\nsubstratum\nghazals\nbhagalpur\ntamarin\neinsatzgruppen\nrehashing\naunty\nquien\nmorne\nfaridabad\nkasuga\nammann\ngodless\ncancellara\ntruffle\nprocreation\nsancta\npokhara\nmvv\nequaling\nstenographer\ncongregants\nmuharram\nschomberg\nantelopes\nappleyard\ncueto\nvfds\nforklift\ngilliland\ncoarser\ndisorientation\nmodularity\nagape\nschoolyard\nunderpinned\nvill\nbutorac\nauthorise\nbagnall\ndetested\nploughed\nirwell\nblt\ndiptych\nthespian\nbattersby\nradioisotope\nvitally\nathenry\nmaree\njessen\nstrozzi\nguardiola\nmandurah\npolitica\ninquisitive\nlupino\nnandan\nmoncrieff\npryde\nferric\ngrabowski\nyoshimoto\nigreja\nfroome\nclots\nticked\nseñorita\nbudva\nphraseology\nlazarev\nkalashnikov\npartaking\ndeprecate\njy\ndemarest\nmanzoni\nnieman\nerdmann\nnicolette\npassat\ndisoriented\ncristatus\nerlanger\ngenki\nnocturnes\ncondoned\ntakasaki\ntikhonov\ncopra\ntrine\nwolfenstein\npriestesses\ncostliest\nmahila\nsdr\ndolomites\nwalkley\njakobsen\nminder\nshariah\ncancun\nulvaeus\nbenaud\nskinheads\nkunwar\ntartarus\nunrestrained\naeromedical\npreminger\nporpoises\ncheema\nslipway\nprut\nriata\nmardy\nisar\ngowrie\nepidemiologist\nkien\nschoolmate\ngcr\nsoeda\npermaculture\nboileau\nyeates\neyewall\nhandclaps\nkem\ncaerulea\nbesser\nkarun\nising\nsignora\nrinzai\nkava\neuthanized\nmultiplexes\nanabaptists\namaranth\nmuralitharan\nmeadville\ninfocom\nshoo\nunavailability\nexothermic\nalexandrina\nseneschal\nribose\ninhaling\nraffaello\nsaluting\nhindutva\nsoundness\ndeyoung\nmusial\nfriezes\ngoodbyes\nigarashi\nschellenberg\nstressors\ngarrity\ndiversifying\ncellphones\nporcaro\nscholten\ngra\ndesha\ngurdy\nheadstock\nvam\nhoroscope\nsoper\nplovers\nadverbial\norin\nglbt\nhelio\nmangala\nlyga\nsassafras\nbaluch\natyrau\nasada\nphilippoussis\nholtzman\nlancing\nelca\ntelfer\ncongreso\naang\ndhani\nleena\nsherburne\nparmar\nmamiya\nlda\nlewistown\nlts\nmercurial\ngalleons\nboscawen\nimpaling\nragin\njadhav\nackland\nbasilicas\njoya\nhillis\npotentilla\nuttam\nraden\nbuckling\nmuralists\nhydrazine\nunderhanded\ncvn\nfie\nratha\nbrite\nastin\nprobert\nivrea\nmulch\ntoppling\nearner\ndelon\nphilby\nhypermarket\nmalika\noverheated\nwral\nstockyards\ntrigeminal\ncarbo\nsdlp\nsultanates\nnelvana\nkhair\nclerking\nadal\nlamoureux\nreconvened\nessa\nyams\ntaek\ndetlef\nrumsey\nkorfball\nbostwick\ndocker\nairworthy\nlauter\nmisia\njovial\nkanepi\nsolemnity\nfrancaise\nhorticulturist\ncranwell\nronaldinho\nsaybrook\nrecusal\ncrispus\nmyna\nstarlings\nmillipedes\ncogswell\nbanzai\ncontemporanea\nltg\ndiamondback\nagitators\nilbo\nphotographie\nciclista\nbelittle\nwisc\nesb\npadlock\nquando\ncronies\nshalimar\nharith\nindu\nzog\nped\nburkett\nkaji\nmalai\nabutting\nhitchhiking\ncoram\npsh\ncua\nsentance\ntryin\nsinnott\npeyote\nfurred\nunearth\nero\nultramarathon\nforecasters\nburyatia\nmyitkyina\noutfitters\nprospekt\nvelvety\nlangmuir\nsah\nbelden\ndomhnall\ncandia\ntuf\nheise\nscl\nadina\nlaunchpad\nurrutia\nude\ntepper\nnudist\ntutelary\nthistles\nplasmids\nnonhuman\nuninspired\nagnelli\nguises\ntannhäuser\npeniston\npande\nhomily\ndisparaged\ngalea\nmaneuverable\npositivist\nnoreen\nhempel\nkapitan\nrocío\nstaats\ngrameen\njulianna\nplp\nmoreso\nhaikou\nelevates\nstopford\nmonadnock\nancaster\nriz\ninking\ncaricatured\nbluesy\npoèmes\nhayao\nsidhu\ninsides\nmaximizes\nchandi\nexpository\nhanda\nmrsa\nnephrology\nalberton\naegina\naina\nkroon\nunderwriters\nhir\ngasket\nfagin\nacd\nwoolston\nghali\nbolsover\nbogomolov\nushering\nwestbourne\nrearrangements\ntrw\nvocabularies\nentrust\nserdar\ntangut\noreste\nretorts\nkawashima\nfrehley\ndepardieu\nruisseau\numaga\nblubber\nstrother\ntretyakov\ntesoro\npresupposes\nslaveholders\nthwarting\nxxxviii\nnagashima\nmarcantonio\ngunships\ndisreputable\nnyan\nvoyageur\nsighs\ncornfield\nuts\ntimesonline\nbanjul\nhighsmith\ndreamtime\nbatesville\nmarkowitz\nsonghai\nbunyodkor\nscarpa\nreynaud\ngrinch\nholloman\nmaleficent\nmorganatic\naramco\ntranquillity\nedelweiss\ndolenz\nbuoyed\nniedersachsen\ngowen\nalc\nkarlin\nadopters\nwednesbury\ninventoried\nlukács\nsomerton\ncoorg\nserially\ncaceres\nrejoicing\nfionn\niconoclastic\norp\ntikrit\nbathers\nkurihara\nouyang\nhake\ngorchymyn\neinarsson\nreactivate\nseptembre\nquadrupled\ndramatizations\nkarunanidhi\nbude\nsmears\nmaza\ntoponymy\nusurping\narithmetical\nvocalion\noperable\nostrov\nkhammam\ncetacean\nunfavorably\nodious\nspv\ncotten\nbeaumaris\nabdelaziz\nschalken\ncarus\ncutout\nconstancy\nschwimmer\ndushevina\nnuuk\nalamance\nsolingen\nmusculature\nsaumur\nslimmer\nqvc\nntu\ntusculum\nheseltine\ndpa\nflávio\nabelard\nmirador\nshrunken\nbemelmans\ncitroen\nraekwon\narsen\napna\nbudgie\nunidirectional\nlemke\nyogic\ndromore\nworkflows\nktla\ngreve\ndeteriorates\nkarimov\ncer\nshoving\ntfg\nfergana\nskirmishing\nmemorization\njigs\ndeaconess\nstrangeness\nhormuz\ntrenitalia\nlade\nzing\nglanced\npeleliu\nvissi\nhammocks\notherworld\nrevamping\nstoica\nlarrabee\njunker\npreaches\nbringer\nundiscussed\nfreefall\nbarakat\ncavell\nbaily\nebi\nhoagy\nboorman\nmoebius\npsn\nbabbar\nmacromolecules\nhallie\nmaintainable\nhainault\nvernier\nkaisha\nmahajan\nvibhushan\nhesperia\nromancing\nsinglet\nwingtip\nclaps\ndiscontinuities\nunobstructed\nseta\nrachelle\nchiefdoms\nkiska\nbussy\nsupersymmetry\nhairstreak\nocho\nkyla\nmeguro\nhaynie\nnewsreels\nbrowner\nmaka\nbleecker\nbrazoria\nmichio\nbarossa\ntetley\nechizen\nmosh\ndutchmen\nburnin\npini\nanastasios\nhasidism\nritmo\nofficeholders\nrobards\nmarsan\nkumaran\ngiardino\newes\narrigo\nsolanki\nsiegmund\ncaraway\nsteroidal\ndovid\nashurst\nwallops\nvlissingen\nstane\nrodeos\nafoot\niliac\nrpa\nkingsmill\nturtledove\narchpriest\noptimisation\nreflectance\nenumerate\nthinned\nhockley\nahrens\nlimbic\ntriviality\nglutamine\ncornett\nsouthwestward\nesd\ngambill\nreformulated\nsna\ncondolence\nunsurpassed\nnaumburg\nean\nderringer\nglaucous\nraffi\nchurchmen\nbrijeg\nlocalisation\nreapply\nerlang\nolusegun\nmotorbikes\npaiva\nrebates\npadstow\nxuanzang\ngatto\nnostril\nbicyclists\nkitsune\nohara\nsnook\nhermite\nhoneydew\nmaysville\nmiscarriages\ncobourg\ncarnahan\nstylings\ngrisly\nwille\ninfotainment\nfma\nschlumberger\nmuds\nkumiko\njolo\nsoest\njumbled\nsapphires\nspinout\nsteller\nfurrows\nault\nreinserting\nflac\nslacker\njornal\nanoop\ntimekeeping\nsaharanpur\nspurt\nplaton\nmlg\nsavagely\ndepravity\nyousaf\nmella\ncheckbox\ntomes\ndungan\nvinicius\ndoorn\nthien\nrusset\nwesel\nturenne\ntgs\nbails\nrepositioned\nurinating\nveering\nrefrigerant\nsdsu\ngrits\nkazuhiko\nbellefontaine\nweedon\ndib\nherzberg\nlabia\ningress\nparallelogram\nynez\ntalkers\nstaub\nwieder\nmusicological\nhistological\nquenching\ncrowes\nkix\ndoma\nbenguela\nnewbold\nunaddressed\ntoyah\njvm\ncopernican\ngynt\nnaturist\ncastelnuovo\ncouturier\nalanya\nmehboob\npinchas\nossuary\nihsan\nstabilizes\nclothier\nfoolishly\ncassiopeia\nosx\ndaichi\nintercut\nclogher\naristides\ntwiggy\nstockpiles\nicebergs\nwillpower\ndespina\nreith\nlofoten\nmoriah\nimps\ngradation\nguderian\ncroom\nmgb\ndevant\nunevenly\nmiike\ntagle\nsmelters\nskippy\nlytham\nkaczyński\ntetrahedra\nkamiya\nlazuli\nperchlorate\nnebel\ntidbits\nmuhtar\nseychellois\ncelesta\nnakuru\nsousse\ndiscoloration\nminnehaha\nbeeson\ngeneseo\nyuval\npúblico\ndeposing\nberserker\nmultipliers\nnaina\nkinnaird\nmainframes\nauchinleck\nrenewals\nslush\nrykodisc\nsidonia\nente\njabs\nalamogordo\nrereading\nterrassa\ntomaso\nbachelet\nprotea\nosun\nphysiologists\nhayabusa\npausing\nshs\nwoong\nbachelorette\nlavrov\nbeaverbrook\nmaran\nconservators\nmirjana\ntroublemakers\nsubscribes\nfacetious\ntcc\nlik\nkibaki\nperea\nvadis\nkhoja\nsounder\npatapsco\nsurmounting\nchorister\nalsos\ndayne\nhatted\nmalvinas\nmcphail\nscheveningen\njuliusz\nead\ntribhuvan\njehangir\nmtl\ndupe\nbryne\nvenezuelans\nadélaïde\nfois\nashburnham\nmastroianni\nogdensburg\ntalbott\nautun\ncollina\ndecking\nadmonishment\nmohammadi\nmelli\nxamax\nlube\nreducible\naspera\nupto\nleaped\nsoutham\noctubre\nchon\nmoylan\nreit\nnewmark\ngalvanic\nlupa\nfirewire\nbushrangers\ncontractually\ncomenius\nchronos\ndeluded\nnusantara\nzoomed\npurley\nmoorehead\nlongacre\nsyrah\neasel\ncritter\ncaligari\nkunlun\ndryness\nsandpipers\npolansky\nguile\nhornaday\nundress\nwarhawk\npern\nharum\nmalate\nklassen\nbravado\nabdurrahman\nveg\ninsecurities\nmontigny\nkhurasan\nnordin\nkristof\nquilting\nwilts\ndwi\ndeandre\namalgamations\nnoosa\ntaubman\ndinu\nskyrocketed\nmalfunctioned\nporterfield\nknighthoods\nyoshimi\nannes\nmvo\nalgerians\nbellerophon\nbetta\nstinking\nberglund\nkiya\nsackler\nannunziata\ndespenser\nculiacán\neccentricities\nzdravko\ndelmarva\nenrollments\ncomisión\nbelkin\npith\nnuked\nstabat\nquitman\ngrégory\nkerguelen\nmideast\nlemond\nbakke\npurine\nconjectural\nrummel\ncobos\nthynne\nacquiesced\nyudhoyono\nmilnes\nmacmurray\nschurz\ndaiei\nhaman\npassy\nfalconi\nplebeians\npss\ngiulietta\nswati\npawel\nwahhabi\nstickler\npanos\njugular\nmulvey\nikebukuro\ndanubio\nalcan\nstampa\nharmonie\nflautists\nleghorn\nwoonsocket\ngargan\nhagman\nmorphy\nweinstock\nvolunteerism\nvelásquez\nfrancorchamps\ntalcott\nbishopsgate\nprunella\nportes\nlustig\nrenuka\nravensbrück\nnongovernmental\ngbagbo\nmouthparts\neula\nween\nincineration\nallardyce\nsudirman\nswarup\nlargs\nmayakovsky\nmelanchthon\nfnc\nbelloc\nratcliff\noif\nehrman\nreimburse\nmantled\nangiosperm\nforan\netr\nalawite\nfareed\nwestboro\npenderecki\nunoriginal\nchirality\ntailless\nmonsoons\nshmona\npelayo\nbabson\nfoulkes\nemulates\ndowntrodden\nberenson\ngrosses\nkrall\nknute\nregicide\nnikitin\nminke\nbirgitta\nbpp\nmowing\ncredibly\nheathcliff\nzulus\nboothe\ntib\nniemann\ninvoices\nmael\nradish\nbioware\ninitio\nsealers\nchaoyang\nisolationist\nszent\ncanongate\nvagus\nmodicum\nmatar\npendants\ncoder\nnacelle\nbrueghel\nwangchuck\nbonde\nflaccus\npopish\ngannet\nmse\nironwork\ndevolve\nironing\nebs\nbeachfront\ngokhale\npwd\ncolombes\nblr\nhird\nsegue\ntarpon\nstagnated\nashutosh\nyh\nmucha\nmalet\nlightening\nbannockburn\nreneged\nexpat\nsportiva\nperuvians\ncongenial\nunhurt\nvanes\nbarrowman\norganics\natra\nateliers\nhernan\nfrater\nduero\nriptide\nfons\nshingled\nlaborde\nnecrotic\nmorpho\ngodparents\nplatforming\nterni\nilhwa\nmelford\nlibros\ngoodenough\nkersey\nmbr\ncyberbullying\nkalpa\ndruzhba\nroel\nmetzler\npereyra\nlenten\ntollbooth\ndanziger\ntreading\nlangur\nstonyhurst\nleganés\nmicrometres\nmcvie\nopiate\nbaalbek\nteluk\nkawada\nbalin\ngütersloh\ncrb\ntabbed\naoba\npitre\naqa\ngarifuna\nfreeborn\nfuta\nlari\nkedar\nduryea\nlowenstein\nvaqueros\npdas\nshreds\nbootlegging\ncoterminous\nbertelsmann\ncalligraphers\nantonella\ncarrizo\nampang\nsackett\ntapir\nfreeholder\nmoonraker\nkaraj\nboothroyd\nengadget\nmochi\nepub\ntoluene\nbowell\ndinosauria\nsierras\ncolic\nstarkville\narchenemy\nharpe\nbewick\nflintoff\nnationhood\nwinterton\nabarth\nregimens\nmaciel\npaice\nnureyev\nsantamaria\ninbetween\ncoriolanus\nferme\nagitating\nlorber\ndenson\nnff\nspokespersons\nenthralled\npersonae\ncherepovets\nreminisces\ndemotic\nfervently\nfarmingdale\ngir\npattani\ncori\nbyblos\nupshur\neltingh\ngrubbs\nfriedel\npoule\npsychosomatic\neves\nnots\nencampments\nreadied\nzw\nschreyer\ntayler\ntecos\nsimão\nfatimids\ndías\nmicrocontrollers\ndeepdale\niterate\nstatoil\nskeena\nvitaphone\nheyerdahl\naat\ngushue\nalmanach\nseaters\nalluring\npérigord\nunsightly\ngohar\ngeste\nsociete\nakiba\nrazors\noxalate\nnihil\nbashkir\npolikarpov\nmarauding\nbacteriophage\nyongle\nbroady\ncarrión\nkhanum\nhandcrafted\namalric\nvoskhod\nsuzette\ngoud\ndolphy\nmajora\nfuror\nmordred\nhariharan\nnawabs\nvivant\nmoja\nrsd\nmro\nfriedmann\norland\nbethpage\nayckbourn\nnewar\nyoshitaka\nlubomirski\nmeacham\nsakarya\nlectern\nharpy\nesf\nkittery\nfrew\ndumpling\npenne\ngiganteus\nyair\nquedlinburg\nspahn\ndroylsden\npainkillers\nmuharraq\nbrusque\nfolies\nuncharacteristic\ncunninghame\nharvie\npuyallup\nbrinton\nmoret\nbrazo\nterracing\nfudan\ntraditionalism\ndeena\nmaryknoll\nbarretto\ncookware\nharshest\ndiscolor\nkörner\nfico\npulsars\nyuriko\nkhlong\nkudu\nkmfdm\nrorke\nencarnación\ndisconnection\nkhotan\nvishwa\nbicker\nlagarde\nscoundrel\nflirtation\nanatomic\ncornejo\nmasahiko\ncarinae\npolder\nassemblymen\npatricks\ntig\nconvento\nubaldo\nsandgate\nalki\nburglars\nantipodes\niguaçu\nsnuck\nhildreth\nmcgrady\nshenzhou\ntashi\nmatz\nkavanaugh\natonal\nsaur\nraisonné\nloons\nkinnock\nzavod\ntepco\nearlham\npivoting\ngaughan\njumeirah\nhissing\nfabbri\nhirsh\njoi\nrogerson\nporteous\nhyang\nfeliks\ngatlin\ninjectors\nwasim\nlatrines\ndevos\njubal\nkismet\nforceps\nkatholieke\naccenture\ntrafficker\nvisualisation\nsittingbourne\nhatorah\nescapades\npflp\nitzhak\nzürcher\nsorbus\npriddy\nkarat\nsinop\nstudious\nxenakis\nelectricians\npgk\ngsn\nenacts\nmaile\nvassilis\nmedes\nkhalaf\nharoon\nharty\nprésident\ncodify\nbangabandhu\nincan\nsecularized\nsattler\niditarod\nspouting\nwq\ntimms\nmertz\nvillafranca\nqubit\nbuber\nanura\ndestin\ntigress\nrectus\nbeesley\nlaminate\nmartialed\nseashells\nalitalia\nkapital\nlaika\nmtt\nfenix\njermyn\nrohrbach\nmontiel\njakes\ntenma\nsetters\nriegel\npromissory\nknotts\ninterrogates\ngreenough\npanicum\nnicotinic\nolean\nbalaclava\nquincey\nlangdale\nfolha\ninfamously\nrattray\nplatini\nnariño\njunqueira\nklotz\nalcala\nrelegating\nsachem\npolarised\nharlech\nweyburn\natma\nartaud\nmascagni\nvelella\njuri\nnoma\nbugger\nbatra\nvidar\nmaximization\naklan\ntheocratic\nresonated\ntarkovsky\nheadshot\nhurries\nbuccal\nlindqvist\njaune\nbutlers\naquabats\npacts\ngiray\nsaic\nmrg\nvasilis\nteasers\nhamblin\nsimpkins\nfitzsimons\negotistical\ntork\narpanet\nsakuraba\nconfection\nfurtwängler\nchyna\nephesians\ncipriani\nbananarama\npathologies\ncoddington\ngesù\nkdka\npenryn\njes\nmultilevel\ntypeset\nlantau\ncorden\nazazel\nthresher\nmogensen\nkhattak\naag\nblasco\nflickering\npreys\nregularization\nolena\nyoav\nendearment\nechos\nisak\nvillette\nthusly\nwinstone\npyrolysis\nhibbard\ndelores\nseep\nyounghusband\nkabardino\nwasson\nhabitability\nshaven\nflatiron\nvoest\ncanticle\nbronchial\nworkday\njao\nkehl\neducations\nlenka\nreassert\narteaga\nyavuz\nsamy\nmurnau\nsapna\nfastener\nmorag\nspf\nseagal\njamais\nmabry\npeony\npwa\nstereophonic\nspeedwell\nstubbornness\nmusics\nstamper\ntranche\naficionado\ngais\nlogroño\npostponing\nkael\npolycarbonate\npolycyclic\nscituate\nmacready\ngrob\nkroeger\nvidor\nstaal\nphaedra\nzyl\nfaience\nlanglands\ncalthorpe\nbowels\nswathes\nlynwood\namyotrophic\ntoca\nrifkin\nosei\nfaune\ncompresses\nnapo\nkreutzmann\nconflate\nyukiko\nlentz\ndiecast\ngarwood\ndisputants\nnypl\nagulhas\ncii\ngiscard\niuris\ncomputability\nmothballed\npersimmon\nlegionaries\naedes\nheflin\nnasrallah\nuppingham\ntradesman\nhephaestus\nburu\nheerlen\nxtc\nnapkin\nconjugal\nbaoding\nzakat\ncloaks\noudin\nmidair\nintrons\ngiglio\ninterventionist\narbiters\nmycorrhizal\noutperform\ncolfer\nbahini\nsepinwall\nrobie\nmiddling\nbackstretch\npetworth\nhulton\nhustlers\naspirant\ncentar\nraffle\ncoquille\nsct\ntada\nnazarbayev\nlipson\njeanine\nbhs\nraichur\naiello\nsepoys\nbraids\nincompressible\nsuwa\nnormanby\nuilleann\nunlink\nhetfield\nerrata\nzydeco\nstatically\nmciver\nerez\ninguinal\nheatley\nmaren\ntroyan\nsquibb\ndelamere\nutsunomiya\nproofread\nrecharged\nwigram\nangora\nenveloping\nhandan\ndashiell\nkelton\nfireballs\npruett\ngeodesics\nzoroaster\nordway\ncastaneda\ncorrêa\ntld\npav\nzakopane\nbtn\nhkg\ndannii\nvamos\nneurologists\nimpermeable\npusey\ntrivium\nbillericay\nroeder\nchieh\ndoghouse\ncriollo\ndammit\nshiromani\nextents\nthome\nrtb\nattractor\nsardine\ntellier\ncruze\ntaupin\ngdi\nsli\nraspberries\nclairvoyant\ncontravene\nirn\nharringay\nhaciendas\nkillaloe\nffb\nnainital\nbip\nsportscasters\neisenstadt\nbask\nguiseley\nlrc\ngraveyards\nchoon\ndogmas\nseiu\ncsg\ngellar\npillboxes\ndoctorow\niwc\nplied\nmolen\nmaladies\nsnorkel\nalleyne\nligatures\npez\nosb\nmso\nkinesiology\ndapper\ngambrel\nvivisection\nbrücke\nnaguib\nkcal\ndivx\nuric\nbaumgarten\nprecipice\neman\nltv\ninternalized\nnaro\ndefecation\nlookalike\nphosphoric\nmili\nbethe\ncowgirls\ncaisson\nscratchy\nhinrich\ndouce\nschuckert\nsnorkeling\nheterocyclic\nrennae\nshellac\nincongruous\npadukone\ntransportes\nostrowiec\ndanie\nspiegelman\nmoroccans\nltp\nfotos\nintercede\nkloss\nregress\nhaneda\nkrull\nbivouac\nmrp\npopuli\nlampung\natcc\nnadie\nfleshing\nbellshill\nsowell\nambit\nwhooping\ncanadien\nunraveling\ncomba\nretardant\ncarolla\nkunduz\nfino\nfmc\nkathie\nibom\nshide\nlapham\npuffed\ncavalleria\ntanz\ntmf\nconvertibles\nhovey\ngérald\nskunks\nalgirdas\nhiroaki\nadenocarcinoma\nbathory\npasar\nyx\neirik\ntlr\ntongs\nmartijn\npenises\nbda\ncongrès\njaafar\nkohima\ngaddis\nmcguinty\nskegness\nbcp\nmacinnes\nlacko\nhardt\ninterfacing\nnicest\ncromwellian\nstatuettes\ndoggy\nagustawestland\ndaun\nelucidation\nsheppey\npotrero\ngutta\nnaturals\nskala\nepl\nashington\nabutment\nimperatives\ncaerleon\nrecommenced\nsteelworkers\nsantschi\nduncombe\nblida\nrenderer\nplotinus\nvillosa\nfalter\ndhu\npoprad\nigm\ntooele\nconcoction\npolytheism\nkomen\nbff\nuntill\nraha\nskimmer\noiseaux\nbanna\nbalsa\nminimalistic\nmiddens\nunproduced\nunderclass\nmohali\nfranzen\nuther\nbmd\nirie\nparathyroid\nmegalopolis\ngtc\npuffs\ntola\nfacilitators\ndreamworld\nposton\nramakrishnan\nmaryhill\ndaren\npurpura\naloys\ndhanush\nascetics\ncowlitz\npob\nogaden\nicbms\nmalignancy\ntargum\nquads\nsenghor\nnondenominational\nartsakh\nbeste\nepigraph\ndebaters\nmusketeer\namberg\nunguided\nalaa\nlikening\njahren\nraycom\nmaler\nvestige\nchangzhou\npfeffer\ntalbert\nmoccasin\nwrights\nkadena\ndianna\nalg\nbettencourt\nranda\nrasool\nseele\nkuipers\nmainstays\njaden\nramus\nhelplessness\navaya\nbulaga\narata\nchanghua\ngowdy\nboykin\ndiscontented\nblick\nbhakta\nmanhole\ncapsid\nunguarded\nsnobbish\nuntied\ncastlebar\nteeny\nprats\nmeares\ntacos\negoism\naimée\ngms\nneuropsychology\ncomando\neme\nliquidator\nclaudian\nwerth\nmegara\nlightnin\nstriping\nolathe\nagostinho\nshatters\nkingsville\npreveza\nabducting\npanoramio\nicr\nbuncombe\nvalkenburg\nbakken\nauthorhouse\nhmp\npraja\npropylene\nclo\nfrightful\nkaku\nespejo\npagliacci\ntaillights\nabramovich\njobson\njeffersons\nshobha\nsandhills\ndalle\nswisher\ntouraine\npinched\ndundonald\nmichail\narbeit\nolhanense\nminis\ncribs\nbaobab\nperipherally\nhouseman\nsidetracked\nanatomists\nsandakan\nkarolinska\nbendis\nwarranting\nnaively\ntortuga\nranfurly\nmeteoric\nsabato\ncabrini\nvoir\npriories\nsubfields\nspyros\ntopsoil\ngani\neek\nclichy\nsassou\nwobbly\nincantations\npronounces\ncámara\nulus\nemplaced\nbric\ndeclarer\nlivesey\npetrochemicals\nadvertorial\naimer\nfreestone\njessore\ncleeve\nmaligned\nfroggy\nantica\nshehu\nctr\nmcgehee\nstrix\ntufa\nsejny\nharbison\ncivita\nmdp\nthunderball\ndepartamento\nintron\nchika\njousting\nmediaworks\nchardin\nsienkiewicz\nusace\nfts\nprabha\nnikolaev\nkeizer\nzany\nshimer\nhoards\nadat\nprefaces\naliso\nnuh\nswordsmen\nliens\nirapuato\nditty\nosho\ntheorizes\ncimino\nrelieves\nbertolucci\noncogene\ncomparably\nannis\njailer\nfanned\nmournful\npendergast\nsambar\nasec\nmikuláš\ndescents\npainterly\nbenford\nsulphate\nrops\nemitters\norbited\npreservatives\nmanzanita\nmaoists\nguanacaste\ncama\nlapeer\ntyburn\ncoccinea\ntrampling\nshapeshifter\ncryin\nacuff\nrhymney\ncostco\ncanarsie\neuronext\ntomoe\ncleanest\nsgr\nmisbehaving\nramsbottom\nlevene\nelric\nresizing\ntroposphere\ngrandee\nhomonym\nneela\niheartradio\nbampton\nsusi\nelva\ncalipers\nstepbrother\nfriendlier\nmarilyns\nconstrictor\nwillibald\nkannon\npalmach\ndelancey\nniazi\nhopelessness\nifr\ndems\ncurbing\nmortlake\nater\ncholine\nzvonimir\nnuria\nfending\ncoons\nmelons\nwithington\njanssens\nstranding\nphenix\nfreezers\nmicheline\ngolfo\nlampooned\njisr\nnazario\nyarder\nsfs\nelphaba\nwearers\npreempt\npayette\nmccombs\nyamasaki\nburgundians\nimamura\nmarketplaces\ncerium\nalanna\nglaciated\ntita\nabuts\nbroadcom\nnumbness\nbabb\narduino\nmisbehaviour\nbrp\nbazin\ndehn\nmuertos\nmisunderstands\nchloë\nimt\nmicrostructure\narabiya\nacca\ntrix\nbottomless\nintergroup\ntotti\nheimat\nmccarter\ntanger\nfinial\nnamath\netch\nrootes\nspiky\nskylights\nasterisks\npreviewing\nprobus\nrenaldo\nbookshops\npasi\nbeaune\ncrowne\njukes\nkeisha\nupp\nelkton\ndainty\ncimetière\nkandal\nbenedictus\nfeely\nmilliner\nrerecorded\nbinet\nmalam\nsango\nmülheim\nsines\ntuy\npertained\nmoxon\neliciting\nwelbeck\nbushmaster\nmaupin\nrydell\nsegregationist\nmakan\nduckett\nmotti\nbowland\nsubstructure\ndrang\nmorimoto\nseda\naccentuate\npto\nbroads\nnorbury\ngangneung\ncassady\npellew\nshearman\nnutritionist\nnordhausen\nlucullus\nverstappen\ndcf\nrolleston\nrakes\nsonate\nyoichi\nkariya\ncolli\nwyck\nbilliton\nraby\nbeli\nvariably\ndokken\nshippensburg\nuncommonly\nperso\ntravertine\ndíez\nherrington\ndisputable\nsida\nsaddled\npinnate\nromblon\ndykstra\nparcells\nhedy\ncabello\nsaffir\namani\nvaticanus\nstarburst\npiceno\ntoning\nmarinelli\naldred\nklingons\nreincorporated\nhebden\nipr\nnibelungen\nwayang\nwapping\neup\nagartala\npiura\nbobs\nbonet\nfickle\nsecondo\nmoderns\ngollob\ngentil\ncheka\ngook\ntachyon\njobless\nparading\nisbell\nyongsan\nradioed\nskool\nmicrobe\nalcide\nklemperer\ngorka\nserendipity\ncybele\naskin\nsanu\nbreakin\ngrewal\nnoyce\nbeuys\ndeform\nbisson\nsymbolised\nhauraki\ninfosys\nmcchord\ndolgopolov\nlindquist\nsnob\nunopened\nnorthbrook\nsvetozar\ngemayel\nabducts\nwelshpool\nfaribault\nkusanagi\ncoombes\nmòr\nwashers\ncanvey\npln\ncarding\nblalock\nwhiskered\njetstream\nforehand\ncarinthian\nsympathized\nbfd\nskatepark\nsonali\nsieben\ndrummondville\nmetatarsal\ninefficiencies\nghia\ncatterick\njaponicus\nchoppers\nmackaye\nsantangelo\nismay\nrosanne\nalegría\ngalil\nske\npemba\nboxrec\npontoons\nconveniences\nwallpapers\namie\ngoldene\nartichoke\nlegalised\novi\njirga\nburkhard\ndhoom\nvociferous\nstatens\nmckechnie\ncolombians\nswum\nbutting\ninayat\ngroen\nmulgrew\nfrock\nmagar\nlumières\noffsetting\nwintered\nandie\nsaumarez\ndhruv\nsushil\nfermo\nistvan\nboonville\ncornerstones\nromanticized\nnewseum\nhowick\nflorentino\nxliii\nsigur\nł\ncichlids\ntrabajo\nzea\npoulin\nspada\nbamyan\nliquors\novertone\nanalyzers\nmascarenhas\npostmortem\nborrego\nsuitcases\nraad\nwantage\nengelhardt\ngonads\ngainer\npolytheistic\npolizei\nfirman\nquarreled\nmelle\nhuet\ndecadal\nheadlong\nzeebrugge\nstx\nfrink\nclacton\npogues\nthreonine\ndevore\nlebaron\ntancredi\nsecularist\ncurrant\nprotrusion\nfmp\nzune\ncastellani\nclarkston\ntarquini\nveni\nellerslie\nnasri\npavol\nvillena\nsashes\nkadri\nblanch\nperennials\niconoclast\nlynde\necp\ndespotic\nkazuhiro\ntwyford\ncaptor\nexemplars\nrsvp\ndialed\nblais\ntrice\nwaid\naliya\nacrimony\nfeint\ncotes\ngicquel\nvana\nmonnet\nevaporative\ngrandstands\njuhl\ncryonics\nwilk\negalitarianism\ngeneralissimo\ncampy\nsomersault\nalmas\npoppe\ntolosa\nlommel\nrunnymede\nyverdon\ngrilling\nnuptial\nwix\nsikes\ntouro\noutre\nkinmen\nmethadone\nholmgren\nkuri\nanushka\nluso\nzutphen\ntyrion\ncraighead\nmoka\nrelive\nflintlock\nunbridled\nkapadia\nengrossing\nterrorize\ndizon\nauditoriums\nshintaro\nmarseillaise\nfbo\nmismatched\nchartist\nbayless\ncarvers\nmukerji\nnyerere\nshira\nzemin\nstudley\ndawning\nhastert\nlewisville\nlooie\ntuscarawas\nsvante\nstirs\nnsr\nsudo\nachebe\nbifida\nouellet\ndidgeridoo\naea\nculverts\nsturluson\nbearden\nexhausts\nfeira\npersonhood\ngarofalo\nkeres\nquietus\nlcm\nbruns\naleksandrov\nhirsuta\ncounterfactual\nsonnenberg\nshula\npocklington\nscooped\nflatbed\nplowed\nlif\nestimators\nmerida\nwarez\nzawahiri\nhuq\ncorda\nspurring\nsiv\nmonumento\nkuba\npachelbel\nindochinese\nushl\ntozer\ngramsci\nrosenzweig\ncne\nlcl\nkhufu\nmusicality\nhirai\nliss\noptima\nttp\nwhist\nchicana\nmoriya\ntelecinco\npcie\numan\nmcgillivray\nstrived\nasper\nrisley\nalphabetized\nblaylock\nnoranda\nfeilding\nhawn\ntalmudist\nadare\nexplaination\ntypecast\nprobed\nclerked\nstipulating\nchameleons\nusga\nnlt\ntnn\ncatatonic\nsobers\nllanview\nburkhardt\nhakata\nnidal\nfeuerbach\nlechner\nrocchi\nmetamaterials\ntatsuo\nclosets\nchana\ngura\nbitey\nsupple\ncoton\nmandaluyong\nclas\nlyte\ndeutz\nchannelled\ndoboj\nvalance\ndimple\nnivea\ndetaining\nlandward\nnutley\ngemmell\nvandeweghe\nkadam\ncoroners\npangaea\nnilly\nedinburg\nsubtopics\nuncomplicated\ngowan\nwaistcoat\nnld\nracketeer\nbnl\njelle\nfigueira\nré\nnpf\nhijos\nglyphosate\nwmata\nkahl\nanshan\ntabid\nsaveh\ncategorizations\npromulgate\nchartier\npharm\nshue\nneoliberalism\nshreve\naird\nrisorgimento\ntractive\ncheonan\nletcher\ndeprecation\ndeftly\nmaurus\nhalibut\ncomas\nyona\noriol\nroshi\ndyslexic\njacobins\ngci\ndusting\nfunc\niranshahr\nmagnussen\nchowder\nfratton\nrheinmetall\nantero\nkoestler\nfrelinghuysen\nreanimated\nsharaf\nfirmness\nneurologic\ngigabytes\nbiomarker\nunconcerned\nshoko\nretorted\nmccrary\nlalitpur\nbobsled\npequeño\njovanovski\nuncategorized\nphillipe\nbcg\nmgmt\nmugshot\nsoundings\noxbridge\nabdu\nyerkes\ncranbourne\nreiterates\nwanderlust\nmacdonell\noverworld\nikon\ncitv\nintelligences\nsagamore\nglennie\ntalca\nchiao\nsaboteur\nissy\nthiamine\njz\nflatten\ndefensores\nbardic\nhallo\naon\nalin\nfurlan\nrefocused\nmokhtar\nteacup\nsedges\nrecklessness\nassn\nshankly\nstawell\noccident\ncommentated\nhorseshoes\nrado\nfrode\ninfierno\nperon\nneave\nflin\nmystified\nmatchmaking\nbardstown\ncheesecake\nliquefaction\nsedalia\nforme\ngrable\nexacted\nwaddy\ndavros\nstepchildren\nsfio\nwaldhof\nmoberly\nmanes\nfluctuates\nyazidi\ncorti\nretraining\naby\ncromartie\nsaison\ntyga\nantara\nmelamine\ngoalposts\nbolsa\nfick\nhatters\nconfusions\ntaproot\nsyco\nsicard\ncapiz\nromuald\nabsa\nclothesline\ngripe\npaus\nshopped\nhobbits\ncornucopia\nquartering\ntoland\ngena\nguyon\nstiffened\nfreelancers\ncelebrant\nbratz\nraisers\nfermenting\nperitonitis\niola\nradomir\ngajah\nhampering\nstobart\nhcm\ncontributer\nafg\nhockney\nniños\nconsecrate\nholzman\nmistry\nfederica\nwillkie\nredstart\ntauri\nhund\npvr\ndolina\nerratically\npapeete\nmarischal\npetes\nstrafed\nactuary\nsupposes\nbrune\nunstated\nlupita\nnumero\nradi\nhoulihan\nbreaux\nreferenda\nfestivity\nwab\nprioress\nsveta\nmoonlighting\nnicollet\namager\nsprain\npunky\nprovidencia\nsefid\nsrm\nuniversalists\nigcse\ncinematographic\nlanz\nmorrisons\nchicopee\nceliac\nhirt\nbaytown\ncosas\nsète\njere\nsympathisers\nmelvins\nterrapin\npinner\nxps\ndeferring\nnofx\npeddling\nflatulence\ngide\nsupremo\nkusatsu\nplacards\nlorelai\nsuleyman\nscuttling\nheure\nhasselhoff\nparamagnetic\nrapide\npharos\ngauze\nhosmer\nomid\nsailer\nminx\nreclassification\nmcdaniels\nexcitable\nkorman\nvaidya\nswitchfoot\nhafeez\nreseller\nmycelium\nvania\npsychometric\nnonspecific\nultimates\nbarden\nmeles\npagar\nreviled\ndorsett\nlags\nwam\nignatieff\nmorrisville\nliberace\ntempelhof\npoise\nlacrimal\naruban\nappropriating\nvillefranche\nrisdon\nchalcolithic\nwatermarks\norde\ncobbs\ntolerating\ngoldenberg\nameen\npneumoniae\nmathilda\nresonators\nschenkel\nwertheimer\nimro\noverestimated\nstructuralist\nkettles\nwoodroffe\nknaresborough\nsavigny\ndiarrhoea\nbunyoro\nbitterns\npieced\ntrimet\nturnstile\nboyden\nfrisell\ndamen\nridin\nmima\nlez\nfete\ntrolled\nlotion\npinhead\ndereham\ngille\nkraven\nchok\nbushey\nmegaphone\ntanahashi\ncardin\nnondescript\nboli\ndraupadi\nspey\nladino\nfinbarr\nmagny\ncartons\nbarty\nmemorizing\nmuta\nautódromo\nprophylaxis\nglazunov\npendlebury\nroosendaal\nshabnam\nbrocken\nuncontrollably\ntynemouth\nniort\nfontes\ndroids\nskinks\ntabas\nlipoprotein\nmadhavi\nulcerative\nrepetitious\ndudek\nderon\nasides\nmicrolight\ncongratulates\nanantapur\nbhartiya\ngeographies\nthreesome\nbarras\nwethersfield\ndarkstar\nconfederated\nmetered\nmeunier\nangelico\nilkley\nstruthers\ncrutch\nfichte\nidyll\nbaylis\nfurthers\nmutates\nchlorinated\nposadas\nclosings\naromas\nnagaoka\ntrak\nbrose\nwok\nchavan\nusat\ncomptes\nbreccia\nfédérale\nhandpicked\nlocksmith\ngouache\ndelphic\nidealised\ntizi\nconcretely\nsweetener\nenka\nponti\nprogressivism\nnestlings\ntibbs\nlutes\nenthronement\npicturing\nfennel\nstewarts\nscandinavians\noboist\nbranigan\nzawisza\nmutya\nkhuda\nkoprivnica\nmcw\njarrell\nxuxa\narsonist\ndzogchen\nneptunian\nelroy\nresiduals\nquackenbush\ndtt\nrudiments\npreemption\ngwendoline\njsr\nrubi\nuncharacteristically\nextraterritorial\npoppa\nmateos\ngottwald\ndemirel\nschönborn\nlivio\natti\nicse\nawlaki\nstragglers\ntelkom\nvoltron\nrethymno\nmopping\necclesiastes\nstarrett\ncontrapuntal\nrollergirls\ndharam\nouija\nbronwyn\nabhi\neiko\ntriumphantly\nsuze\ndeschanel\nhovered\nwalz\nliqueurs\nwatterson\nbarro\nmoga\ncdna\ndishonor\ncsds\necheverría\nformosan\nclaudel\nsandhill\nrozelle\nyeshivat\ninder\nslumdog\nincompletely\nbourdieu\npalettes\nlannoy\nstoudemire\nferdowsi\nharappan\ntuomas\nlongueville\nsmattering\nheadband\nvoda\nbhaskaran\nbarrois\nlarose\nnuncios\nbotham\netowah\nsaal\nmacgillivray\nentercom\nzululand\ntyrannus\nmoche\nschwarzer\nvitriolic\nsilvano\nolongapo\nadelina\nziad\nhither\neran\nkhosla\nrebuilds\nseamanship\ncurio\nbasuki\ndisinherited\nconservatorio\nyaga\ncumin\njaffrey\ngraça\ngeochemical\nhomem\nbuhari\nblevins\nstoryid\ndais\ndoer\nwasco\nnoland\npelotas\nmedline\nshuang\nhemispheric\ntetovo\ngaspare\nrationalisation\nequalize\npeeping\nvítor\nmasterclasses\nittf\nfledging\ngrund\npail\neridani\nkakinada\ndisallowing\nabdou\ninterconnecting\nromita\nsurrogacy\nglycosides\ninterfax\npbr\nleverett\ncrisler\nmcadam\nmeac\nexpletive\nmurasaki\nbinders\nphilomena\nkairouan\npainkiller\npurves\nmtm\nstratospheric\nthermoplastic\nkhaldun\ncaput\nhasmonean\nchirp\nminnows\ntaksim\nremittance\ndisobey\noystercatcher\nkennon\npolytechnical\nnewkirk\nbellefonte\ngomorrah\ngeneralisation\nbackwoods\nwiner\nlubricating\ninterreligious\nradiometric\nmortier\nsob\ndafoe\nbulimia\nlingus\nsilkworm\nshakib\nolinda\nkanako\nberrer\nrspca\ngating\ndevaney\nsmithville\nnguesso\ngorgan\ncaiman\nerdman\nfocusses\naniline\nquacking\nkeiser\nmodelo\nrousse\njordin\nawning\nnwt\nanole\nnikolov\nanke\ngenting\nantipolo\nadvices\ncinemax\nsaaremaa\nrukmini\nsuffragettes\naphex\nsothern\nlunsford\nmuncy\nnollywood\ncorrelating\nmoorlands\ngou\nselle\nmerseyrail\nbijeljina\nsteptoe\nreynosa\nentendre\ncowries\namisom\nlewy\npersecuting\npettersen\nsturtevant\ntranscaucasia\nmendicant\nerigeron\nbronstein\nnetherworld\npygmies\nyounes\nksa\nkocaeli\nballin\ndegenerates\nrediscover\nbann\nmfg\nsulfides\ntiptree\nconcurs\nmasanori\nrunt\nsaylor\nnaomh\nglonass\noverexpression\nwpix\nbelgica\nagron\nkellman\nfowles\nkadima\nepigram\nwisin\nfalconry\nmapleton\nsetia\nokafor\nkazumi\npersonalization\nkoli\nspetsnaz\ncrawlers\nmalabo\ngusty\ncontaminate\nmckeesport\ndugouts\nvalentia\nneuss\nchouteau\nmolik\nbruiser\naccompli\nholzer\ntattersall\nsoftcore\ncowards\ntaxman\ncameroons\nlibres\nerdem\nwusa\nchihiro\ndlamini\nsprigg\ndarragh\nvelo\nspirito\nsorge\nregretful\nkallis\nplantes\nsumac\nreopens\ncockermouth\nparalegal\ndisassembly\nkadyrov\nsubdividing\nsangakkara\nrádio\nbiscoe\nbelltower\npsf\nflaky\nshere\nimpoundment\nscrapers\nbierce\nlightbulb\npowerline\ntruffles\ndebater\npatter\nvirginis\ncasein\ncarpe\ndingley\nneuropsychological\nvagrants\nwolter\nbldg\nvanua\nnjt\nestados\nbucking\nhieroglyphics\nsimoni\nmctaggart\nsga\nclampett\nfussy\nswatch\nmader\ndribble\nusma\nsandon\nstenhouse\norchestrate\nwitcher\nsymes\nkotla\nneelam\nstigmatized\nshowmanship\nsmi\nagrigento\nschatten\nvanzetti\ncouncilmember\naiadmk\ndita\ncittadella\nmurtaza\nrevilla\nsenussi\ngcl\notros\nrotherhithe\ncadman\noncologist\nsoham\nescudo\nbarbarous\ncolorist\npoc\nmafioso\nhsl\nfernie\nkurunegala\nrecurved\ngsfc\nminar\ncapon\ncatesby\npropos\nismailis\nschutzstaffel\nwyse\ndejected\nmercyhurst\nmatei\nidm\nkayser\npasserby\nfairclough\ngeauga\nsublette\nfinnigan\nasana\nmajin\nkretschmer\nidate\nuehara\nringmaster\ncomandante\njocks\nnordrhein\nlfp\nshikai\ntehachapi\nafghani\ngreencastle\nmundus\nvirginiana\ndioxin\nsunspots\ncarley\nsaari\nzc\nerrands\nega\nsanyal\nobelisks\nprivat\nsylvian\nwaris\nmujahid\nrsv\nnotching\nmeghna\nahrar\nkrum\nchiropractors\ndrainages\nmarky\ntenfold\ntrentham\nkootenai\nraimondi\nsybille\ngreenlight\nbugged\nfakhr\nschaal\nddp\ncranked\niifa\nuniversalis\nsehgal\nkage\ngohan\ngosselin\nkonyaspor\neupen\nzinoviev\npuffer\nleasehold\nattuned\nraghav\npirandello\nraigad\nattentional\nrosey\nannabella\nlonglisted\nmorang\nweehawken\nsavchuk\nstuarts\noverwinter\npronouncement\nsixx\nsanaa\njls\nprofil\nhele\nkhatri\nsmarts\ndizzee\ngullikson\ndistrustful\npetronius\ntarred\ndetox\nundercurrent\nforney\nshuman\nturlough\nblondel\nluminescence\nhalmahera\ndzong\nlimosa\nphilander\narrieta\nrola\nyoshinobu\nstarrer\nrubidium\nweaned\ntelekinetic\ndesiderius\nanteater\nliban\ninhale\nlemoyne\ntrilateral\nsnatches\ndownforce\nkalmyk\nplex\nnimmo\nmackillop\nvaal\naldgate\nlans\nkessinger\nunhappily\njoventut\njeffs\ngigantes\nshama\nalim\npartei\namputations\nballa\nfactoids\ndrywall\npyjamas\ngibney\nmillstones\nplanking\nratel\ntuberous\nunpaired\npry\nyekaterina\nwestermann\ncadw\nhovers\ngoddamn\nmkd\ninborn\nbarnacles\nbauchi\nharas\nyugo\nprecast\ncuéllar\nmaman\ntestator\nurdaneta\nhobhouse\nsolidifying\nsummerland\nsprouted\nrodwell\nidar\nevades\nlitmus\nsenescence\ntmd\nmiddleman\naint\nlunda\nmanis\nteasdale\nmortification\nzebulon\narchivo\nfréjus\nswarbrick\ndocomo\nsekhar\nlynden\nmccreery\nchouinard\ncrayons\nsignum\nyevhen\npeele\napropos\nopines\nkovalenko\nfalsify\nbillups\ntustin\nlarkspur\nmightiest\nkrautrock\nsilencer\nseigenthaler\ntoyoda\nchiming\nhsa\nlapped\nspecialism\ndammam\nadornment\nburly\nexhumation\nbevis\ndalkey\nvalentines\nsunan\nspann\nhangers\nexpound\nsoldered\nhito\ncinerama\nintramuros\nizquierdo\nmaltreatment\ndemersal\ndoyen\nbenbow\nbub\nunm\nbanqueting\npuntarenas\nananta\nbluegill\nghostwriter\narius\nmontaña\ntej\ngse\nomdurman\nfrown\ncourtesans\nzico\ntianhe\nrubs\nmien\ntendrils\nwisest\nanette\nsikander\nfarscape\nhistoricism\nenvisioning\ngren\nstackhouse\noblates\nflexing\nbelligerents\nrossington\nlep\nglean\naparna\nrepairman\nhenge\nulises\nodakyu\nbestiality\nensigns\nreconnected\nsmothered\nkfor\nkostka\ncotterill\nsubjection\nharrold\njad\nmatagorda\nbakugan\nhasakah\neitan\nkosciusko\nuti\narenberg\nperse\nfornication\ndoctored\ngeld\npanicles\ngaskin\nkibbutzim\nnesta\nspinnaker\nnares\nbobruisk\njagat\nchristof\naérospatiale\nbargains\nwirtz\nzygote\nerykah\nlawry\nsneaker\niiia\ndecimus\nsavi\nnadeau\nloney\nputouts\nunfeasible\ncometh\nknievel\nvalkyries\nshippers\ntranssexuals\ncentripetal\nerkki\npectoralis\ninflections\nimperialists\nmotivator\nmisato\nprotrudes\ngarmin\nmasques\nlukoil\nmalle\nschlitz\nferrets\nmalda\nmarj\nooze\nfayed\nuris\nwalmsley\ncrushers\nkipper\ncombed\nsinjar\nintruding\nnewfield\ngibran\nxxxix\nmuniz\nutsa\nwilkesboro\npenciled\nintravenously\ntehuantepec\ntwang\nrapidity\nargentinas\npankow\nguyot\ngratiot\ndiuretic\njankowski\nmarray\nvalidates\nharmsworth\ndugald\ndisa\nirrelevance\nlandover\ntheorize\nfractionation\nalessi\nmidnapore\nwoden\ngergely\nmiran\nreisman\nbrigada\ngourlay\ncaracciolo\nalcove\ntriana\nsteno\ncoupland\nedmonson\nkayserispor\nbexhill\npaltry\njoystiq\nfaltering\nkurri\nbisset\npadi\nesotericism\nhilltoppers\nhomoerotic\nsenta\nvampiric\ncantabile\nhadad\nsilences\nstylists\npelli\njadavpur\nhbs\npisco\ndihydrogen\ncomplainants\npwc\nisosceles\nruffed\nrigdon\nslays\namarnath\nrenmin\nrehired\noncorhynchus\ncommends\nrehovot\narsenals\nbaudin\nratko\nstroma\nsoundsystem\nshylock\nbroyles\nilham\npioline\nnahi\nsoutherner\ncirebon\ncibber\nreefer\nfeatureless\nsundara\nsalvi\nfillet\nstraightaway\npronto\nsfo\nunhindered\nyoutuber\nrubella\nopenweight\njx\nsoba\ncarmo\nunga\nhouseboat\nchordal\ncloistered\nuncomfortably\njaney\njobe\ndepraved\niarc\nnishikawa\nstraights\nelwyn\ndroop\nunquestioned\nteemu\nvina\ntrialist\njayasuriya\nungur\nreaping\nsentries\ntrike\nmetallurgist\nrampaging\nunpainted\nreynold\nshallows\ntraum\narcata\nnigar\ngroban\ntankard\nherpetology\ngarners\niscariot\nxavi\nrias\ngunawan\ncouric\ngravis\nborah\nsables\ncdad\ncitta\ngasworks\ntessier\npizzicato\nambiente\nkolomna\ntilda\nalgo\nrosin\ncolas\ncardenal\npagano\nkhedive\nfloodgates\ntissot\nnudibranchs\ntodt\npragmatics\ncattolica\nshortfalls\nwestfall\nahora\nguanaco\nborodino\nperitoneal\noxygenation\nthame\nicac\nbrandishing\nreappraisal\nwinsford\nswaths\nmarae\nbruun\nwack\nchopsticks\nheritability\nreda\nhsun\ngiacometti\ndever\ngollum\nyeshe\ndownlink\ndinars\nfontane\nmazzini\nemsworth\nvelenje\nnalini\nvassallo\ncloverdale\nsaini\nepiphone\ntheatrics\nkarine\neponyms\nsagittal\ncarmack\nsclc\ntranspersonal\ntollemache\ncoolers\ndelonge\ngromov\npyrenean\nblancpain\nkumble\npetersson\ntugboats\nmarz\nrebuffs\nindenture\ncte\nmelos\ndawe\njésus\nkennels\nbawa\nlengthwise\ncarin\nhaussmann\nnortholt\ndawid\njezreel\ncros\nzhukovsky\nmafic\nbme\nyelp\nmurano\nzárate\nmalbork\nmuto\npettibone\npageviews\ncommandeurs\ngekko\nhiroshige\nbelichick\nvelu\nwallasey\nceballos\nforecaster\nljungberg\nabounds\nchiari\nmerci\nredecorated\ngirardot\ncaisse\ndunder\nkennewick\njönsson\nmarquand\nbausch\nridding\nmarchi\nmannion\nmichaelson\nministering\nlamarche\nwatan\ntonks\niie\ncamelia\nlemony\nvergil\ndomodedovo\njanitors\nutr\nhoyo\njonze\nmarios\nthapar\nshilton\nponytail\nrogen\nlumiere\nlunga\nchron\ndensest\nbova\ntransnistrian\nretrospectives\nmapai\njameel\nfauntleroy\nbunin\nsibilant\nlongshot\nmook\nmenudo\nneunkirchen\nbaltics\nfeelgood\nalamitos\nlippmann\nchanute\nvandalia\naranjuez\nmoyo\nfehr\ndells\nadrianne\nattenuata\nhedonistic\nsuisun\nmagnetometer\nmoria\ntingling\nphe\ndentata\nmovistar\nsnug\njutting\nscalpel\nvaria\nchakwal\ndamiani\nbibliophile\nscd\npieris\nbrokeback\ndacca\ncorrine\nordos\nlamport\ndepositions\ncraniofacial\nedrich\ncygni\ndilworth\ncatharsis\ncircuitous\nlalu\nmln\nrya\nquasars\neggert\noffload\nshapeshifters\nhortus\nsapo\nvolpi\neschatological\nmunday\ninsula\nkajal\ngruen\nvélodrome\nminutely\nboman\ncobh\nassuredly\nnoy\nroadsides\nnido\nblobs\nkrona\nwelter\nmayaguez\niuniverse\nsplashing\nhern\nsweetly\nkingsbridge\nbrundle\nholography\nhashish\npuglia\npolley\ntyree\nnaum\ndieudonné\nwoy\nichabod\ndeighton\nfratelli\nwhelks\narmida\nbladen\nbcb\nrafinesque\nbefell\najahn\ndemeanour\nkremenchuk\nlongbow\nmarinated\nroused\nsinden\nnivalis\nevaristo\nspectrograph\ngoch\nmiddleburg\ngeral\nflanges\nrgs\npostgrad\nflory\nmilstein\nepigraphic\nsharda\nlector\nthenceforth\nmemmingen\nlegitimized\nmccaughey\nprata\ncorvo\ncontinuations\nshush\nnahar\ngivat\nneurosciences\nflann\nhumbled\narvin\nmosel\nalby\ncarberry\nemr\ncrisps\nstormers\nweizsäcker\ncien\nransomed\ntancredo\ncavernous\nalcorcón\nkilligrew\nguyed\ntalwar\nélan\npetropavlovsk\nxang\nanimaniacs\nphillipsburg\ndodoma\nevidentiary\nschembechler\nfacelifted\nnyx\nshoup\ngreentown\ntaxila\nadirondacks\nsait\neachother\nwhincup\nsadism\ntuxtla\ngoldfinch\npretrial\npistorius\nkunda\ntriunfo\nridgeline\ndogfish\nlangen\nforetz\nstoltz\nsterility\nkrasner\ncubicle\nyoutubers\ntamarack\nthorold\nchamba\nsinfield\ncosette\ncockle\nansan\njumpstart\nwhitson\njihadists\nyancy\nkaskaskia\nwrestles\ntouting\nkanak\novershot\nramenskoye\nlazzaro\ntortosa\ndowners\nmaschera\ncomunista\nlaff\nfrentzen\nbardi\naksel\ndelinked\niwf\ndufresne\nrdp\nschacht\nconfiguring\n■\nenviro\nwolstenholme\nntl\naggarwal\ntindall\ntempel\ngameshow\nimpartially\nmendelian\njadid\nsixtieth\nbotvinnik\nstritch\nuntapped\nhominids\nhooley\njunoon\ntrabajadores\nninomiya\nmacrophage\nreligio\nblueberries\nprequels\nbusking\ndopaminergic\nrbd\ncircumvention\ngambrinus\nallay\nseagram\nboscombe\nkrim\nencapsulate\nmirada\ncinemagic\npratibha\nchevelle\npalanca\ntidings\niwaki\nhousman\nmomoko\npharmacologist\nangiography\nbasha\ndissociate\ncorbeil\ncrayola\nlambie\nnishida\npontificia\nnamie\nmowry\nhellraiser\nchuy\njourdain\ndoki\npersecute\nfiz\nclausewitz\ntableland\nsalalah\nmarcuse\nsiskel\nchandrika\naggregations\npathfinders\nexclaiming\nunreachable\nendangerment\nbouvet\nargh\njetstar\nmobo\npolyp\nomri\ncassatt\ntransitway\nuinta\nbums\npersonifications\npent\ngolconda\nkimmy\nhinman\nbyd\nunifil\nclung\nstander\nsurin\nscheepers\nbiko\npoliomyelitis\nnua\nkrafft\nwingless\nlittlehampton\nflagellum\nirène\njanta\nmidsection\nminimising\nrighting\nfairyland\nmedill\ntroppo\nglows\npili\ngillman\nvevey\npyo\nshootouts\nliddle\npraça\nexd\ndatasheet\npergola\nsleepwalking\nprecipitates\nices\nundefended\nseibel\nselsey\ngardel\nwillemstad\nadjudicator\nlongish\npanchen\ndumpty\nrelatable\naccreditations\nunfunny\ncaitlyn\ndisciplinarian\nrajaram\npassé\naeruginosa\ncasebook\nagc\ncytosine\nxlv\ndecal\nsmelled\nnurul\nlivable\nossining\npantanal\nsalesforce\nnbb\ndaws\nqos\ngalliano\nafrikaners\nchronometer\nviljoen\nwiper\nbimal\nlamberto\nkenobi\nzork\ngourds\nglycoproteins\nenescu\ndrawdown\ntraktor\nnyborg\nshweta\ntobe\nproterozoic\nverrill\nynys\nautogyro\ngangetic\njailbreak\ncomplacency\nrejections\nrevisionists\nredeployment\ndiscours\ncastille\nridged\ntisha\nstreaking\npylori\nviento\ncrevice\njedediah\nevermore\ngenteel\nwarpath\neutelsat\nartesia\nkonstantinovich\npedrosa\ncbeebies\nrial\ntutto\nrevolutionize\nadios\ntajiks\nmcduff\ntoshack\nryong\nattwood\ndbl\nragnhild\ndecorators\nhadera\nsabi\nflav\nmutinies\nimplacable\nantithetical\nbowerman\nsikasso\nterranova\nairmobile\nondrej\nwatashi\ndahan\npredilection\nkob\nplacido\nblockades\nmccullum\nmobilise\nmeerkat\ndoux\nboosey\nungainly\nplagiarised\ntantrums\ncharney\nandris\nnecker\norpington\nfourfold\nproofreader\nobscenities\nsmee\nalbina\nnarváez\nceauşescu\nirrationality\ntelescoping\nduvivier\nentrée\ndogra\nwiegand\nhonegger\nsaru\nmockup\nunfocused\npian\nalida\nwittig\nrefueled\nlortel\nautomating\nattainted\nairpark\ndiggle\nminchin\nherzen\njosquin\nsobral\ntetralogy\nbalochi\ndop\nanesthesiology\nroosts\ncaney\ntexcoco\nbethell\nscreed\njujutsu\nbraidwood\nrov\nleadbeater\nmero\ntivo\nmystère\nmousetrap\notwock\nramzi\nfurore\npiazzolla\nlozada\nvaliantly\ngriese\nucsc\nmirth\namericus\noeuvres\ngavel\nguerreros\nlampe\npasts\nletterpress\ndogger\nrumah\nthoroughness\nsinkholes\nsiltstone\ntni\nredefinition\nchapple\nrava\nembalming\nhunk\nthacher\npeshmerga\ncommentating\nkoç\nskelly\nkitagawa\npoore\nkhoda\nhoole\ngamera\nfassbender\nkerb\nbatalla\nspano\nfreyja\nube\nlunde\ninflows\ncair\nunnikrishnan\ndodgeball\nlesueur\nguarantor\nsentience\nmcanally\nlabouring\nkapiti\neurohockey\nkaraganda\ndépartements\nokey\ndingoes\nimpreza\nlela\nalibaba\npua\nrollie\nmalfeasance\nkovalev\nalpe\ncilento\nfundação\nabyssal\nsehwag\ntibial\nchambliss\nbojana\nbonhoeffer\nlevying\nrafale\nhsiang\ntunable\nfeatherston\ntwee\nmarkstein\nfoulis\nintermountain\nchunma\nloony\ndisjunction\ncasillas\nconvexity\ntubal\nditching\ndialling\nfilippi\ntauern\nwebisodes\nkocher\njagir\nlfc\ngroote\nbluefish\nfeingold\nsumi\nscoundrels\ntrc\nforearms\nnauseum\ncognizant\nopportunist\nmannarino\nhabitations\ndth\nbloodshot\nlloydminster\nmitsuo\ndiffuser\nrochambeau\nnock\nknave\njory\nestevez\nurmila\nsaucy\nheartedly\ntsx\nbycatch\nyoghurt\nwarthog\nlakhimpur\nswv\nligure\nburyat\nanalogously\npanch\ndewayne\ncathedra\nsurrealists\ngravina\nepigraphy\npoonam\nmcnutt\nyeoh\ngilboa\nunattainable\ncodd\niberoamericana\ntoews\nquantifiable\ngirardi\ncosima\nmostra\nwristwatch\nmerr\nmarwari\nbaw\ndecibels\ngiovani\nakbari\nburbage\npandava\nfourie\ndissenter\normiston\nannotate\nphoney\nbusses\nnorthallerton\nglans\nyatai\nkonitz\nmethodically\nshinagawa\ngolovin\ngair\nxlvii\nabet\nbélanger\nvarennes\nhathor\nsupercross\nimprimatur\nnsp\nparsecs\nbluefin\ncallback\ntwisters\nseaquest\nlugger\ntawa\nrmx\nserengeti\ntonnerre\nquem\ncrippen\ntaa\nlethality\navr\ntraylor\nbossi\nhdp\ntannin\nscheckter\narjen\nhic\nursinus\npinsk\ntunguska\nherculaneum\nwalvis\ncnd\nribas\ninaugurate\nstormfront\neudora\ncultic\nbiomedicine\nbermudez\ngossamer\nnamm\nnass\nfalange\nmeaty\ncil\ngranit\ngenocides\ngstaad\nssu\nfeuer\ncontraindicated\ngombe\nlupine\nmadani\nhumanitarians\nkunar\nmutualism\nchafee\nhafner\nclamping\nméditerranée\nowusu\nbabette\npinehurst\ndiener\nclattenburg\ndissociated\nseixas\nhallé\ninsead\nbellew\nintrepidity\nimagen\nruna\ntripadvisor\nthinkpad\nrookery\nclubbed\nmemon\nsouness\nhoag\nsayyed\normerod\nshuler\ndcp\ntimah\nocelot\nwalkman\nwalgreens\novum\nmuhsin\nthevar\nsquids\nsamu\nmusashino\nviaggio\nfolketing\nhosking\nvigna\ntriplane\njasenovac\nbabangida\npostgresql\ngracile\nendocytosis\nroadkill\nmurine\ncelica\nljubomir\narcangelo\nmcgurk\nbungle\nrapped\nkwiatkowski\ntotems\nraum\nkiruna\nisin\nuke\nincisions\ncantt\nbiya\ncambodians\nreintroducing\ntins\ndenney\npicador\nspe\navs\npatois\ncriminologist\ndace\nuol\nsuriya\nperplexing\njamiat\nlaces\npiya\nemea\nherlihy\nhmrc\nflatness\nfriz\nstudd\nalopecia\nnarrations\nkph\nmiso\ntullius\ncocking\nmarins\nafterschool\ncapper\nsev\npiranhas\nauteurs\nniet\nfreitag\ncoosa\ntoffee\nbeauclerk\nneverwinter\nfootloose\ncookman\nmorons\npook\nunpunished\nstimpson\ngutters\nfaggot\nforegone\nhyannis\ntynecastle\nmodulates\ngabel\nandalucia\nosipov\npuli\neguchi\nfarnworth\njustina\ncrips\narche\nbuenavista\nmicrobiological\ninventiveness\nfrancophones\nsolly\nmachias\ncaliper\nsearcher\nurvashi\nbrandão\ninvulnerability\nquash\neurostat\ncaravelle\nroping\ndinka\nunplayable\nequivalency\nfwa\nconnotes\naveling\ninstilling\nlso\nschwann\nlumet\nbegotten\ngrouchy\ntamu\ninfective\nsepik\nxxxvii\nizmit\nkevan\ngiddy\ncontigo\nupr\nrfp\nsabir\nmayport\nwss\nkinki\nlongitudes\nbbfc\nderisive\nleva\ntello\npsychedelics\nhavasu\nsergeyev\nmakar\nworkbook\nparticiples\ncoit\narago\nlynchings\nkeo\nbice\nbano\nquantifier\nstooge\nnueces\nlandrieu\nmiraj\nwallowa\nleonis\nnobuyuki\nstriver\ninsinuate\npreclinical\nmagmatic\ncaillat\ncaccia\ndeflecting\nempted\ndunston\nkabaka\npolyphemus\ndrakensberg\nbando\npema\nriverbanks\ninterlock\nkesselring\nbelper\ninterdenominational\neldredge\nmaku\ntrias\nkriek\nkeef\ncrosscountry\nmoloch\nagriculturist\nvandalistic\npuyi\nsheared\nfisch\nscrubbed\nmorbius\nahmadu\nachiever\npatanjali\nuninstall\nmerrily\ndavor\ntelegraaf\norono\nfrizzell\nshimoda\nnaturalisation\nchingford\ncristea\npetrobras\nradiologist\nbowlby\ngodwit\ntatishvili\nyorkist\nratt\nnls\nsawed\nlebens\ntorricelli\niqaluit\nabacha\nrecasting\nschouten\nsituationist\nefficacious\npercolation\ncarpathia\ntrotters\nmannequins\nparticulates\nlankans\ncolorized\nsrna\nsalat\nbercy\ntarver\nimmunologist\nboogaloo\nleman\nmeistersinger\nduncker\nporth\nsubaltern\ncarballo\nspotswood\neasa\nprostaglandin\ntew\nhcc\nbirbhum\nhaplotypes\nneubauer\nmillidge\ntice\nmovimento\nkiro\nchristodoulou\nantonello\notomo\nflexed\nbaits\nfortuitous\nbrotherhoods\ndissecting\njoc\narum\npcd\ndwan\nbuscemi\ntanned\nelectronegativity\nsimha\ncamshafts\ncottontail\nweo\ndabrowski\ngranule\ndebauchery\nchiller\nfantasma\nlisette\ntimbales\ngulu\nsce\nfrelimo\nirby\nrsk\nmeadowbrook\ndioscorea\nbaldi\ngaithersburg\nfennelly\nburdette\nfunes\nfledermaus\nmahdist\npikeville\nbeiderbecke\nquetzal\nmccarron\nsutil\ntevez\ngoossens\ncva\npaszek\njamshid\nuprated\nunhinged\noms\ncartoonish\nmahfouz\ndeok\nmckernan\nretaliates\nglynne\ncasado\nhorwich\nfresher\nglasnost\ncampeones\njunco\nunsworth\nfinality\nejector\ndimly\nboyes\nunearned\nsubjugate\nkirkwall\ntramcars\nclea\nekaterinburg\nchari\nengelbrecht\nchom\nsaplings\nquickie\nmog\nscab\nmegaton\nallosaurus\njyothi\nblodgett\ngoliad\nalbatrosses\nlerman\ncheshunt\ncouncilwoman\nmelancholia\nsisley\nappraiser\nguajira\naisin\nwilliamite\niplayer\nbuxtehude\ntics\nfaut\nscheffer\npuritanism\npearsall\nwhittlesey\nintermingled\nnoguera\nrhona\noberstdorf\nclin\nanan\nbulgar\nstegall\npetrosyan\nalexandrovna\nplosive\nghc\nfryderyk\nareva\ncephalic\nmishandling\nweidman\nkein\nimmunotherapy\nburckhardt\nbujumbura\nanvers\nfrcp\nannul\npless\nplies\nyount\npastureland\nstolz\nsceptics\ntelex\ntyme\nlattimore\nsynch\nniclas\ncress\nbrannan\nepson\nfortean\nappraisals\nstepsister\nantonie\nlactobacillus\nrouth\nweightless\nbreaths\ncrunk\nbns\nsojourner\nrookwood\ncircumnavigate\nrecaps\nipanema\nblis\nmasterclass\nbluewater\ncento\nepicentre\nlanguishing\ngrilles\nsenkaku\ncationic\nheadhunters\nbhargava\nchim\njoslyn\nbourdais\nmonopolistic\ndivinities\nbayh\njansz\nchlorides\nheyden\nrostropovich\nshandy\nwisner\narema\nnrt\nanam\naldabra\nwinnetka\nbuffered\nchangeup\nmcconaughey\nbozorg\njonker\napprobation\nandresen\nkareena\nravenhill\ntolna\nkickin\nvaporization\nkaif\nreconstitution\nzarya\nbandara\nkeillor\nbeim\nflavian\nmitsuko\ndror\nanthemic\nlugs\nhonk\nlinc\nume\nmccown\nholgate\nmurtagh\nnewsstands\nrustin\nschöne\nadamo\nmatabele\nvardon\noverhangs\ninnisfail\nlevites\nregalis\njamaal\nskirted\nadorning\nmeilleur\nmirchi\nkristoff\nhsiung\nayan\neatery\nplon\nphysiologically\nnewsarama\nthracians\ntauro\nurartu\nwuchang\nenv\nayla\nprotists\ndecimation\nziv\nsubtilis\nsaddest\nspyridon\nohlone\nmusick\nemerick\npartridges\njoof\njarno\ncallander\ntomomi\nlogitech\nruhollah\ndaddies\nprunes\nabri\nabbesses\nnajran\nghaznavid\ndoodles\nsamrat\nkathrin\nnightmarish\nbadfinger\nostentatious\nhoople\ndeduct\nhsm\ngrotius\ngaijin\ngusev\nuj\nbaila\nthiru\nborman\nspaceflights\nbermingham\npatina\nunrecognizable\ntittle\nconversant\ncontaminating\naching\nresell\nfroth\nerythema\nbiometrics\nurso\nmadejski\nartifice\nwhittemore\nserafin\nloraine\nquills\ntreads\ntootsie\njuntas\nparacelsus\nlacunae\nrevisits\nnutting\nstealthy\ninvocations\nhoshiarpur\ntransgressive\nslings\ngfc\nliftoff\ncomelec\ntci\nunitarianism\nsawant\nwaltzing\nerato\nespoir\ntonge\nfanfic\nchessboard\nsneezing\noiled\njughead\nmagan\ngalant\nbodkin\nkdp\nteatr\normskirk\nobp\nbhabha\nsetar\npillay\nglushko\ntanith\narriaga\ncyclase\nkirksville\nitanium\nmandell\nsoling\nseptentrionalis\noctopuses\ndryas\nliew\nsubglacial\nwallkill\ncorunna\ntruong\ncouldnt\nnothings\nnormalised\norgs\nmcmillen\nsuperoxide\nshifters\nslg\ncomplementarity\nleukocyte\nrainmaker\nloomed\nantisense\ngoel\ncalkins\nkermode\nginepri\neci\nbilirubin\nedgard\ngonda\nnexis\nkhalistan\nmowers\ntena\nmazembe\nbenguet\nreestablishment\nslayton\nderick\nmycena\nnwr\nfolksongs\nzarand\ngimli\ndobra\nkori\nstringfellow\nwriteup\naisa\nldr\nmossi\nwheelbarrow\nanp\nastrobiology\nwindus\nelista\nfinales\nlegionnaire\nswinger\nweatherby\ndreamgirls\ndeuxième\nhijra\npippen\nrifting\nnorthup\nchasseur\nmacalister\ncranach\ncavitation\njambi\nbrathwaite\njanikowski\nmicrobiologists\nimpedes\naverell\nbhola\ndox\nadena\nkaruna\nmujib\nvirginal\nsoundstage\nhenne\nhammurabi\nwaterston\npyrophosphate\nrathod\nbethanie\nfacings\nmousse\nalmonte\ngreenwell\nkraj\napplebaum\nvinogradov\npager\nbowring\noverexposed\namericanization\nglycosylation\nkatniss\nadcc\nhitfix\ndowding\ncrespi\nmarshlands\nbenatar\nhoad\ngauged\ngholam\nostrander\nashi\nthieme\nwallflower\nrso\nwhitehurst\nanaïs\nardagh\nwail\nconstrue\nbobbin\nmuff\ngavaskar\nsafra\nbrachial\nburghausen\nloggerhead\nunsteady\nilla\nunfiltered\nsoapy\npineapples\nhadiths\nsaiga\ndigitisation\nfarrukh\nhorna\nfacsimiles\nwrekin\nwhitcombe\ntremble\nmeridionalis\nslopestyle\nddd\nschock\nhashes\nimprinting\nnorrland\nmidhurst\nhammad\nkoop\nshamsher\namerindians\nbtv\nempathic\nussuri\nkolberg\nmctavish\nmerkin\nbeehives\nadak\nsyne\nhypnotize\ngeel\nmultitudes\nrailroading\nneg\nmapp\ntrilingual\nsupergrass\nwestmont\nvietcong\nexacerbating\npetrozavodsk\npsb\ntari\nmenin\npompeu\numayyads\nhartog\ncountertenor\ncarrom\nrambert\nrossignol\ngigabyte\nrabia\nhsr\niveta\noverestimate\ntroutman\nuka\noveralls\njvp\nfrontiersman\niquitos\nmanabu\ndaphnis\nalmeria\ncollegians\nangelfish\npramod\npallium\niha\ntranshumanism\namericanized\nyeux\nsamizdat\ndunton\nkneale\ntante\njaane\ntoledano\npasto\nthermometers\ncstv\nfmr\ncapell\ncorsi\nsayuri\nhansi\nvolcanology\nnavarrete\nalge\nkuen\nturnkey\nskra\npsychoanalysts\nwallonne\nflirted\nmatriarchal\nequalising\ncerretani\nexcruciating\neugenius\ncollegian\nterraforming\nbludgeon\nloko\ndiyala\npaternoster\ncattell\nkuzmin\nphonics\ngodsmack\nriverwalk\nwaterbirds\nahimsa\ndeliberated\nahlen\nassuage\nsiddons\nallon\npeut\ntomasi\nuba\nbruijn\nclackmannanshire\nnanticoke\nrosner\ncingulate\nbarbatus\nkeshav\ninquires\noreo\nsissel\npoeta\nmerwin\nffg\nwyllie\nstegosaurus\nmoulay\nkeystrokes\nfowley\nplaça\nrahn\nsmirnoff\nfester\nhadden\nthrifty\nsiddur\nnaish\nrivaling\naozora\nmccaw\ntumuli\nwilful\noutrageously\nmusculus\nslezak\nfellatio\ndestabilizing\nlenton\ndake\nbutane\nlibertador\neschew\nstradivari\ntiepolo\npinches\nnyanza\nhelplessly\nloro\ncompetently\nfume\nconstitutionalist\nbrainard\nanastasiya\ninterjection\nshoegazing\npalio\nmalloch\nresets\nborgnine\nlignin\nknockin\nweasley\nneodymium\ndierks\nberankis\neadie\nfishguard\ndismas\nglances\ntempleman\nmerc\nilana\nphillippe\ngaal\nbasle\nparasympathetic\nstereotypically\ntitov\nverily\nbonnaroo\npolytechnics\nkoning\nmacri\ndogon\nwigwam\ntamir\nwalküre\nsoltan\nkhattab\npersevered\ncastellana\nmaunsell\nreticulated\nhuac\npodge\nummah\nguanabara\ncissy\nperp\narnoux\nrepopulated\nphotocopy\nmogwai\nbrita\nglenbrook\nbalder\nmargulies\nudaya\nverdad\nguðmundsson\nswac\nnq\nguadeloupean\nfantastique\npushrod\ncalleri\nfolksong\nalmshouse\ntiticaca\ninfuriating\nblowin\nmineralization\ninterbreeding\nagitate\nyoni\ntortuous\nrivermen\npreconditions\ncwc\nexton\ndownpatrick\nspruance\ndancevic\npolicewoman\nstenson\nmarceline\ndivino\neschewing\nhola\ntosi\nplauen\ncarpeting\nrealy\nchinaman\ndroits\nriefenstahl\nmifune\npanavision\nhaben\nwoodham\nwilfredo\nmcleish\nrepeals\nheady\nromulo\ncreatine\nheadsets\ntaitung\nbursaries\nunicellular\nsudeten\ncanola\nringtones\nsafford\npecuniary\nprabhakaran\nledyard\ncpb\nmenno\nverneuil\nbarbeau\nsupremacists\nedom\nverus\nwalkout\nninjutsu\nkrško\narcseconds\neren\nfenech\ncockrell\naustralopithecus\nondine\nhenn\ndwindle\ncollinsville\nheiko\nsubcontractors\ndejohnette\nholster\ngreeneville\nhouthis\nrailed\npaisa\nshepp\nperusing\nbluebirds\nkaunda\nbeltline\ndmg\nchauvinism\naiga\nthibodaux\ngloire\nmilgram\norations\nhmmmm\nplanina\ncolumb\nnucky\nbasa\nbence\nbreese\nushers\ninfeasible\nlambe\njourneying\nsugawara\negress\nrajendran\nengrossed\nfolklife\nmops\nrctv\nshur\nmedien\nsearing\njgr\npresumptuous\nmingling\nsneed\ncodeshare\nkhodabandeh\ngillani\nchiasso\nvilles\namana\nbroach\nchifley\naileron\naltamirano\ntuk\nsunscreen\nwindom\nbri\nteg\nhabu\nclarín\ndesiccation\nwarri\nraymundo\nburgdorf\nabovementioned\nflogged\naurochs\nsubstantiation\nokamura\ntaaffe\nlayoff\npaarl\nwilke\nbazaars\nzapp\nkabyle\nbloodaxe\nallingham\nmacalester\nkhost\nkoubek\ncreech\nmilkshake\nrayong\nkaras\nmelvill\nadrianna\ncrp\nmacrophylla\ncsk\nheifetz\nbanting\nautochthonous\nbrooms\nblackburne\npapin\ncadillacs\nnihal\nchea\nteardrops\nsidmouth\nsobbing\nbiwa\ndownpour\nsdc\nkobi\nmelia\nolajuwon\npeal\nelizaveta\nkanchi\nbapu\nmesenchymal\nmarella\nbish\nopportune\nfeder\ndesirous\ntatler\nshalini\nordine\nhav\ntrifle\nneumarkt\ncct\nhelston\ntrani\nprk\nthelema\nduleep\nmarland\nmatchless\nbalustrades\nformers\ngarnish\nowari\ndadu\nsundberg\nldap\nkrylov\navianca\nemmerson\ngranaries\nsahar\nquacks\ncriminalized\nneary\nyoun\nelyria\nardenn\ngoldilocks\npdm\nmetabolize\nminty\nwaxed\nstendhal\nfaroes\nepg\nlongwell\nbánh\nkamke\nlikenesses\ngirly\nneoplasia\nakkad\noccupier\ncochlea\nvelcro\naggressors\neditorializing\naske\nhowser\nneos\npreservationists\nvials\nabajo\ngalvanised\nhendersonville\npenghu\nthanatos\nwfl\nsarno\nmelun\nquakerism\nménard\ntorben\nhopkinton\nafterburner\neke\nwetzlar\npolarisation\nrelocates\nenola\nathene\nbolyai\nstriatum\ndagens\ngunsmith\nshowering\npias\nhahnemann\narazi\nlowes\nwbbm\nvahid\nsyncing\ncada\nanthea\nkaitlin\ntranter\nradnorshire\nantagonize\ndinger\ndropouts\nshunsuke\nmartingale\nsanthosh\nberated\nbalti\nkomarov\ntuke\nratepayers\nestela\ncleverness\nbantry\ncombinator\nvianney\nbodden\ncloaking\n∧\nwrenching\nblumenfeld\ncmr\nmelding\nquis\nkunsthistorisches\nsigler\nchloé\nchlamydia\nfarmsteads\nacrobats\nlethargy\ndileep\ndpj\nmaf\nirian\neasts\nallergens\nfraudster\nintergenerational\nelated\nscudamore\nrumpole\nsampath\ntarquin\ntuvan\nletty\nmonod\ngwang\nineffectiveness\nhypoplasia\nloni\nnekrasov\nvilified\npott\nassimilating\nharar\nbgsu\nflaring\noutfitting\nyazidis\nhedlund\nhaydar\nboethius\naacsb\ntriglycerides\nhyères\nrosi\nulna\ndeterring\nmillikan\nviswanath\ndownham\naprons\nleaden\npalle\nsamhain\nreappearing\nepitomized\nbutuan\nescobedo\ncongdon\nillit\nquli\ngcm\nsilicates\nvenango\nsaps\ncadavers\nwideband\nsnitch\nsvr\natzmon\nnayyar\nanjouan\nefta\nvishnuvardhan\nsalmson\nfringing\ncodemasters\nteheran\noled\nflp\nmicroseconds\nfini\nbratt\nopiates\ncomorian\nlimiter\nauteuil\nbemused\nvalderrama\nhasler\nkaneohe\ninebriated\nampex\nbonzo\nmashable\nradioisotopes\nmeda\npeixoto\nairbender\njhang\nunkind\nscholasticism\nwealden\nguanzhong\nfili\nunpalatable\nstewed\nodissi\nous\npoppin\nconvalescence\ncfi\nwintour\nrhinebeck\nomnipotence\ngoosebumps\nsilverlight\nbourgoin\ncriollos\nbrc\nmangano\npollute\nemam\nkukushkin\ncheviot\ncontravenes\ntendai\ntarom\nbelford\nlampang\npushpa\nbenefitting\nnahda\nwilkens\nbewildering\nswingers\nzynga\npareja\nveselin\narrhythmias\ntakano\nphysic\ncommandants\nerythematosus\nbusey\ndeuces\ncentennials\nlinnet\nnewlywed\nbenassi\nresemblances\naddo\nidling\nmahwah\ncatton\nstyrofoam\nkelson\nschlosser\nmuskie\nmaundy\noverbrook\nborobudur\njot\napothecaries\nmatthau\nshaki\nclogs\nhausen\nrizzuto\nckd\nwillington\nhopped\nalleghany\ndmv\nshifty\nfdd\ndioguardi\nspennymoor\ncelestino\ncoursing\nnorthfleet\ngidget\ncondenses\npatrizia\nvagabonds\nbrunetti\ngrinning\nparviz\nunscripted\nfrcs\nbenvenuto\nskymaster\numesh\ndiscipleship\nmaclay\nstrasse\nkhushi\nvijayakumar\ncarrol\nbamberger\ngardermoen\nkilby\ngladwell\nnaeem\nreena\nscrutinize\ndolmens\nsalonen\nngoc\nilium\nproyecto\nrurik\nkudla\nperetti\nsamoans\nduplicity\nbeatnik\nbiennially\nlandowning\nschisms\narild\nthreepenny\nkristianstad\nupson\ngillet\nleeuwen\nhinojosa\nwoodhull\nschematics\nkeeneland\ngeezer\noverheat\nziya\nhullaballoo\nquintessence\ncletus\ninsignificance\ncuervo\nnerva\naberavon\njove\nfraga\nrittenhouse\nmonotony\nsteerable\nnervousness\nnasp\nhorsey\nphilistine\nprescient\ndaron\nvfw\nstraddled\nparisien\npetrovna\nclarksdale\nascari\nchamfered\nnobs\nstickney\nbleue\nlifan\nbattuta\nambi\nbulges\njavi\nbarfield\nmesothelioma\nrepositioning\ncannonballs\nrecurs\ncompuserve\nsokal\nbrigg\nfàbregas\nhudgens\nahr\nforages\nprés\nbreakups\nunimaginable\ngethsemane\nabeokuta\nslieve\naggravate\ncannibalistic\nrockfish\nblissful\nspirally\nleftmost\nrebreather\ngalo\ndeclaratory\npokey\ncei\nshoemakers\nkopf\nlapin\npietra\ncecilie\nderails\nbep\npalindromic\nelocution\ndorada\npropulsive\nshida\nsolari\nkranti\nhighrise\narron\nwiggly\navigdor\npleated\nbaldini\ngmos\nrois\nbullfighter\nloosing\nefren\ndrongo\nseram\nsoundboard\npenta\ndowie\nlazier\nwaimea\nlief\noocyte\nperris\nsunroof\nfinalizing\nobc\nthrong\nsakis\ngdc\ncurating\nheya\nboole\nmsh\nsune\nkerem\nmelodramas\nwavering\nhynde\ngarnished\ncuiabá\nlanthanum\nbadan\nliaquat\nmukden\nhypercube\nkenyans\ntemminck\nimitative\nrell\naccosted\ntodi\npanis\npandering\nkourou\npanacea\ngleefully\npathe\ncahokia\naino\nunum\nkotte\nprokop\ncherubini\ncaridad\nerk\nfrolic\nmul\nidx\nsugary\ntze\nbrowsed\nsakhnin\ntambor\njitter\nmichaux\nmalignancies\ntharp\nselleck\njaisalmer\nedl\nslighted\nphares\ngws\nalpen\nwx\nimpairing\nmétiers\nconfide\nrazzaq\nakihiro\nmirzapur\nsympathetically\npossums\nmoriyama\ncharlatan\nbocas\nbandyopadhyay\napoplexy\nfretted\nwaqf\nluckett\nmolchanov\npandu\nbassin\ndri\nmeisel\ncordially\nmetastable\nturboprops\nskanda\nauthentically\nsuperfly\nsalutation\nharrassing\nprophylactic\ngalla\npagani\ntopi\netsi\npira\nkelman\neasements\ndelicatessen\ndodgson\nsinaiticus\nrimfire\nketamine\nkcrw\nsnatchers\nrahat\nmcquaid\neggshell\nnaseem\ngaliano\npoached\nanionic\nancestries\nformalization\nspier\nboozer\nsilvestro\nheroically\nruthin\nboal\nseahorses\nramada\nsakshi\nconolly\nkarnal\nmicrometer\ntritt\nyaar\ndunst\nkrasnaya\nsteffens\nharpur\nnio\nrione\ncapsicum\nmoos\novo\nmackworth\nchakvetadze\nlucretius\nrusticana\nbatcave\nstix\nbaldry\nhenty\nhassler\nwarsi\nhomesteaders\nraze\nreval\nwaitresses\npatra\narnoldo\nonetime\nrealschule\nfilibustering\ngifs\nyuchi\npusha\nraji\ngrylls\nbenicia\nkalidas\nparasitoid\nchanakya\ntransshipment\nhartung\nwishaw\ntrillo\nprettiest\ngalvani\nsynonymously\nwfm\ndischord\nchern\nappl\ndecentralised\nunfunded\nhairdressing\nnaac\nnorsemen\nspero\nlue\nspyker\nnaugatuck\nlongshore\nfreescale\ncheam\nairpower\ndrivetime\ngoldmark\nmontel\northodontic\nmicrocomputers\nfastening\nhellish\nuninitiated\noldbury\ncrothers\nfearnley\nchalfont\nsantonja\noster\nshami\ncharette\njcp\nfacultative\nsupergiants\nmilland\ncashbox\nsomnath\ntorments\nnove\nmaddock\ndistiller\ndonmar\ngoldoni\ncampi\ntress\ntswana\nchacon\npenitential\nclearcut\nprudhoe\nsiloam\nkugel\naitkin\npsychopaths\ndrewry\nchisnall\nadorns\nfook\ngarneau\nshas\nkasay\nironton\nunsavory\nruminants\nlocsin\nreuther\nneutrals\ngoodridge\ncuellar\nhanwell\nunderwhelming\njurgens\nwrightson\nerith\nmasai\ntynes\norizaba\ndiageo\nfigo\ncupcakes\ncoexisted\nstumpf\nkendricks\nveiga\ngondar\nsidcup\nrecalcitrant\nlaudatory\ndazzler\ncanty\ncotoneaster\nmuenchen\nhardeman\ncoolie\nwfan\nresonates\nfernald\nkunze\nnaypyidaw\nquoins\nschalk\ntecnico\nsav\nremover\nlasik\neliseo\ndisinterest\nfok\nmendeleev\npintail\ngarbin\nclichéd\nliliane\ngianluigi\nkebir\nyogesh\ncodebase\nfecundity\nteignmouth\npliers\npatchogue\nadjournment\nmasaharu\nradin\nffu\nfawlty\nboule\nintrude\nsí\nposttraumatic\nbokaro\nrhinelander\nmontaner\nmarshallese\nnewstalk\nkrabi\nanticline\nsympatric\nplotter\ngagné\namica\nture\npotty\npalmers\nconkling\nvolo\nbickley\ntimespan\nfios\nfluttering\nglidden\nbestiary\nunpredictability\nsisson\nkatsu\nbrannigan\nbesiegers\nmetathesis\nbrassica\ngiacinto\ncinematheque\ntux\nsynthesisers\ncantina\nkhirbet\nnewcomen\nmysims\nboma\nreverberation\nswerve\nvarda\nmcclean\nglasnevin\nmerk\ncoste\nmauch\nperipatetic\ncorrs\nsijsling\nmanado\ntsimshian\nbandmaster\njorn\nconstitución\nstepanov\nbrighouse\npuran\nmenashe\nmoraga\ndispensers\nsouthwold\nlavoie\nrudan\nchosun\noddie\ntarsal\nhatem\nmasterly\nphospholipids\nvincente\nsubregions\ngrinders\nzena\nexasperation\njawed\nmalakand\nitineraries\nunsanitary\nyogananda\numc\nmachel\nnakazawa\naltmann\nrhoades\nplesiosaur\nflatland\nsibi\nwku\nkral\nfrequenting\ncaffè\nmaxis\ncurtiz\nabated\nfarleigh\nvtv\nplater\nwintertime\nbraulio\npilcher\nélite\nteniente\nalyn\nsori\ngoldmann\nelbridge\nxylem\nrealizations\nlifespans\ncorin\nemerton\ntramps\njayawardene\nkeough\nschoolgirls\nretailed\ntotò\nyiannis\nshakuhachi\nwhines\nkcr\nmoises\ngestion\nwonsan\nunisys\nvenison\nseite\nbdg\nsimard\nunissued\npassersby\nbrudenell\nsiew\nscoliosis\naraucaria\nmeadowbank\naéreas\ngisele\nagon\nezrin\nkult\nsoley\ntaschen\ncovenanter\nregrouping\nsahab\ncarburettors\nravensburg\ninternals\nbrendel\nsecreting\nbesting\nbartolomeu\nlederman\nfolic\ngroundnut\nmontecito\nsorokin\njusticialist\nauctioneers\njinhua\nacquiescence\nperuana\nantic\ndarcis\nflotillas\nboobies\nsubcontractor\northodontics\nbrushy\nmiler\nselçuk\nbracciali\nfolke\nberrigan\nfastnet\nrosella\numma\nnitpicks\nskyblue\nwillmott\npillared\noilfields\nfulgencio\nfamiglia\nfagen\nbinyamin\nmotherly\nnyberg\nehr\nbigamy\nshuki\nneb\nsubside\nbodyline\nminibuses\nrunabout\nhollinger\nshapira\nvtec\nburglaries\nairlie\nsayin\nnian\ndamped\nhurston\nsph\npollster\ncamões\neucalypt\nlfa\ndockery\ntraumas\nadh\npws\nmaribel\nlittlest\nmerganser\ntwos\ngatton\nsubpoenaed\nmillwood\nibuki\nahad\npostlethwaite\njawbone\nandal\nteena\nrolston\ndyadic\nevatt\nbiotin\ncoking\nkalina\nbuehler\ncounterpunch\nhowley\nsutlej\nroud\ntendering\nomnidirectional\nwilmette\ncoho\nphilp\nkaryn\ndavenant\nbharatanatyam\nmalraux\nbarbette\ncavalcanti\ncarbery\npavan\nloïc\nasplenium\nfervour\ntskhinvali\nokehampton\nlgv\nbioko\ngrasso\nresellers\nmustela\nkhajuraho\ninoki\nholles\nradula\nagt\npastorius\norangery\nshatabdi\nbanstead\nkeneally\nstringing\npatrika\nlisbeth\nknutsford\ncoshocton\nsydow\nvitalis\nroxette\nstargazer\nwhitehill\npatenting\npourquoi\nsleds\nphysiologic\nunknowable\npowerball\nione\nmassena\nnunc\nasis\nnosy\nflexi\nrecessions\neuphoric\nnumerology\ncsir\nanscombe\nwolk\nsuffocated\nlota\nkase\nnro\nsbu\ncomparator\nboleslaw\nrieng\ncmm\nkyw\nzooms\nneurophysiology\nramalho\ncky\nmisr\nadn\nbathsheba\nyohannes\nunanticipated\nsacher\nkirshner\nbeas\npupal\nhulse\norkut\njiangnan\nvande\nkkr\nmédecins\nmalahide\nidiosyncrasies\nberland\nlineker\nlegato\nlotz\nvidmar\nnif\nreinterred\npiqued\nfulvio\ntewksbury\ntoolset\nploughshares\nnrp\nshevardnadze\negidio\ntorrence\nramblings\ncantacuzino\nmidleton\ndeegan\nmadea\nrajahmundry\nchesley\nlympne\ndiggings\noutmoded\nreyne\nmadox\nizzard\nvirtuosic\nmanville\nmarder\ngents\nalamosa\nsuprised\nkilsyth\ngorey\ngoltz\nolimpo\nklemens\ncryptids\ndemers\nmainstage\nelectroshock\nherm\nquantrill\noligarchs\nlonghouse\ntumult\nischia\nblackmun\nicicle\ntammi\nravaging\nphilharmonie\nbackpackers\nneisse\nsuccubus\nsaruman\nliwa\nsanguine\nfunctionalist\ndemobilised\ntoulmin\ngreenhalgh\nsubcompact\nrokeby\nnauseam\nanticancer\npondered\nrst\nbatna\nmicaela\nretooled\nlevadia\npaediatrics\nvaporized\nitalicize\nswanage\ninexcusable\ncarretera\ncoxe\nbinti\nsaudade\ntingle\nsproul\nvelarde\nlodewijk\nsonically\ndryers\ngladwin\npelly\njudaea\nrandomization\noverwinters\nafr\nboatmen\nleonov\nnovato\noutsource\nspank\nperrine\nshakuntala\njamiroquai\ncrema\nendeared\nscantily\nsuperset\nmto\nazrael\nvarden\nfrist\nbui\nlundqvist\nmisjudged\ncust\nweds\nthroes\nzabel\nheino\nviscera\npricewaterhousecoopers\nacetaldehyde\nracemic\nalertness\ndrescher\nanomie\nmeted\nbasheer\neuphemisms\nnikolaidis\nbotulinum\ngrossi\nbeetham\nhenrich\nmaryann\nheaped\ncicadas\npolysaccharide\nsiggraph\nvacuous\nnetware\ndumfriesshire\nkawabata\nraia\nvalur\npalatka\npalimpsest\nbitters\nlalande\ninvalided\nwaxes\ngsx\nprojet\nasiatica\norsi\nshaming\nexhale\ngilford\nspiderman\ntamm\nnyb\notte\nkanaka\nkhalili\nmétropole\npowwow\nkugler\nvelociraptor\ncorbels\ncopyeditors\nwassily\nyakubu\nfridtjof\nhhc\nrocque\nwimp\neir\ndepredations\ngriqualand\ngneisenau\navonmouth\nrecompense\nunsealed\ndarlinghurst\nmccafferty\nmatica\nsteamroller\npreterm\nannick\nplaytime\nfinials\nsexology\nwealdstone\npolysaccharides\nkubert\nastorga\ncoz\ncoherently\ncasuals\njsf\noeiras\nidler\nchedi\nthurso\nptarmigan\nherodian\nbrahim\ngerund\npipistrellus\ntiran\ndzungar\nhalpin\npancha\nwaltons\nabcs\nmaimed\nvoskoboeva\nnul\nhopkin\nkina\nshemesh\nedenton\nparvez\ncontrôlée\nmnf\nabdelkader\npanjab\navebury\nimperfection\nbillets\nhideyuki\nwimsey\nsarojini\nviale\nmetaxas\nvieille\nwrede\nstonemasons\nallegan\nnihilistic\npolecat\ncalcification\nbiologic\ndiluting\nborge\nrhind\napsley\nshivering\nhailsham\nbasham\nimmortalised\npooch\nhaberdashers\nahmadis\ncasuarina\ndualistic\nfuk\nsastre\nstronach\nguillen\nprolactin\nikarus\ndorji\nmilpitas\nharan\ncolombier\nbrodmann\nlamu\noccidente\nroblin\ncusick\nbraine\nmycroft\nâme\nvolcanics\nhaver\nchehalis\nbyram\nsusitna\npeppercorn\nwace\nunité\nvse\ngoleta\nguaynabo\nwarde\nreiki\nmistrial\nhaiphong\nherrero\nchichen\nentanglements\njessi\noverwriting\nmaiko\nnex\nimmingham\nnacelles\nliviu\nrinne\nrpp\nmatsunaga\nsuisham\nstorico\nmesurier\ntirso\nmarling\nmousa\nbosporus\nsylt\nlibertines\nfootman\ndelage\nduxford\ntheodosia\npuk\nkesari\nkasabian\nmineworkers\ngnrh\nbednarek\nbogra\nockham\ncolloquialism\nambassadorial\nprosaic\nstowage\nurie\ndalston\nnrm\nroly\nciconia\nguenther\nsorocaba\nhomeroom\ntiernan\nfiscally\nancre\nowerri\ndiacritical\ntoshiro\nanup\nreorganise\nhomie\nbarasat\nsandbach\nforaker\nhydrodynamics\nnatter\nsimm\npwn\ncomitatus\ncrescents\noau\nunappealing\nmouthful\nhadj\nrthk\nforename\ndepreciated\nconfiscating\northopaedics\nmariage\nmurayama\nwenner\nhitz\ntakeovers\ndtc\ngalas\nnorthbridge\ndphil\ncavett\ntmnt\nweatherfield\nakim\nwarty\nwhisker\nrevelstoke\nvoyagers\nunbanned\nborislav\nsummerall\ncostin\nousmane\nmarois\npollok\nkomorowski\nshanthi\nbokassa\njeppe\ndienst\nhighwaymen\nrabobank\napf\nsievers\nmusgrove\nsignposted\nsyntactically\nwicomico\nservile\nmegabyte\nbrawley\nvrindavan\nvillano\navogadro\nkellaway\ngraeco\nmogollon\nmallards\njuvenil\neveritt\nnne\npiha\nbrighten\nmâcon\nmahidol\nbruguière\novergrazing\npolygyny\nroyalton\nbrindle\nsukkur\ndetours\nbradwell\nkhader\nbasquiat\nsorkh\ndoorbell\ncrus\nkarno\nseni\nmaqam\nmik\nbylaw\nzhivago\ncecelia\npolanyi\nsetagaya\npermeate\nsot\ncron\nmilledgeville\ndomiciled\nchortle\nfoxton\ndonnybrook\nsyncope\noded\niiis\nunsophisticated\niy\neliane\ngimelstob\nsittin\nmeno\nsubverted\ntrotskyism\ncalera\nspillover\nzouk\nbrzezinski\nalbicans\nlessor\nsethu\nbusse\nsemitones\nblogged\nlawnmower\nyusof\ndurocher\nzircon\nsnafu\nboudin\nnommed\nkoné\nsilverchair\nbagging\nkuchma\nknolls\nkander\nstanko\nmaupassant\nmcbeal\nbasutoland\nbenzema\ntippu\nsibylle\nrecollect\nkajang\nmlm\ntecla\nkaia\nchristopherson\nuntagged\ncumnock\nattleboro\nidb\nnowra\nhaywire\ntrakai\nhelvetia\ncoden\ngeeky\nbarça\namarilla\nfilton\nbartenders\nceri\nsgc\ntla\nrheingau\nheadington\nhypothesize\nmaryse\ndeserting\nplacard\nboigny\nyasmine\noutbuilding\ntopher\nvictorville\njayant\nsurly\nsayeed\nksk\njanowicz\nmpl\nbhim\nmenke\ngand\nantislavery\nfraggle\npuffing\ncephalonia\nramchandra\nincredibles\nossett\nfmt\nwhelen\ncryptographers\nunjustifiable\nasner\nkazim\nawdry\nperche\ntesticle\ndismount\ngoer\nbiondi\ntwista\noptimality\ndelco\ntypographer\naldine\nkeine\nfumio\ndarters\nirigoyen\ncetus\nratchaburi\nbhuj\npama\nquant\ngrantland\npert\nwierd\nveb\ndevdas\npuka\nsadako\nmogg\nassemblée\nwebcams\nsylvestre\nsandie\nsweeteners\nbellinger\nshanley\nbelz\ntimescales\nfeodorovna\nlatha\nmagisterial\nanaemia\nbackbeat\nfigment\ndemain\nglobalized\nmegaman\nproudest\nracemes\nhamtramck\nnavia\nirakli\njammers\nverena\nobsolescent\njuho\ndaulat\nflapper\njhu\nghs\nincubators\nkarmic\neez\ncollum\namorim\nmonophonic\ncolusa\nperpetuates\nsalesians\nrenfro\nwhoop\naglaia\ntrellis\nnephilim\nisaacson\nkjetil\nzvornik\nshamed\nmalvina\nrovaniemi\ndemidov\ngoenka\ncibc\ncanavan\nsmyczek\narv\nahmose\nmalign\nmedcom\nkickbacks\nphotochemical\ngrebes\nwcs\npressley\ngoldsborough\nrouges\nharmoniously\nsabana\nwasabi\nkatt\nschnitzler\nseesaw\nkoons\netzion\nsaqqara\nhaywards\nmitchells\nchokes\nanoxic\ntempah\nyohannan\ngyi\nberndt\nbissett\nsjöberg\npedagogic\ncollectivism\narchdioceses\njaakko\nklimov\nliggett\nsocal\ntomko\nbritannic\nelahi\nquebrada\nmansel\nnuba\nsobolev\nquip\nsiles\nbalked\ncorreggio\nlifehouse\nrahmat\nhuancavelica\nlordi\nkogi\npoti\ntamagotchi\ndiniz\njuhi\nacculturation\ncinnabar\narbutus\nkerns\nweblogs\nmichelsen\nvidas\ncampanula\ngranites\nvulgare\nexplosively\naldosterone\ncervus\nsplashes\nnarco\nrepressor\ntonio\noverwintering\ncbsnews\naphrodisiac\nwarfarin\nkanter\ncrier\ndiamantina\ntwit\nlaga\nlackland\nwhitty\naccomodate\ncatriona\nsayf\nmesons\nabides\ngenaro\nsmpte\naccel\nclank\nragsdale\ngadfly\naramis\nbenelli\ntantalus\nformless\ncova\nnecronomicon\nmillisecond\ntomoyuki\nsickening\ncatalano\nadamski\nchaman\nfishburne\nchander\nbirrell\nmadder\nareca\nvoi\nsusilo\nmazza\niya\nunfccc\nminnetonka\nconstantinescu\nkhat\nphenols\nmayans\nruppert\nninoy\nmodernising\nsynovial\nskeletor\nwhiston\nfretboard\nluma\nkri\nchiudinelli\nloath\nsmitty\nholdout\nprograma\nfairley\nsamanid\nchaminade\ninternazionali\ndhule\nllandovery\nconran\nskied\nkawase\nlawmen\nnetherton\nchocolat\narna\ncarriageways\njumpsuit\ninfomation\nhidayat\nmalcom\ngetúlio\ncurries\ncrackle\npotholes\nbagge\nshipwright\nnitida\npares\nmilitarized\nhomages\nnoé\nbowness\nsquabbles\ncheboygan\nejects\ncasares\nhirasawa\nkeno\nmajoli\nplumer\nnaji\ntaryn\nminimis\nkarg\nspithead\nstranglehold\nhca\niht\nairstrips\nbilson\nthora\nwijk\nscariest\neaff\nlysosomal\ncosy\nmontford\nvandalizes\nkyrenia\nlaffont\ngosei\nvasilyev\ndrapeau\ntenancies\ninniskilling\npostoperative\nmanisha\nentrenchments\naal\nburan\ninterchanged\nanalgesia\nintrauterine\nredraw\nvixens\nheros\nshabbir\ngompers\nkye\nbsl\nddb\nanjana\namari\nseljuks\nnardi\nferrando\ncouched\nmedrano\ngeht\nbelgravia\nyantai\nzebrafish\nmysterons\nshahu\nwiretapping\npiccolomini\nauber\nseaworthy\nentryway\nsandbags\ntrevally\nyesterdays\nrisi\ngalaxie\nstricker\nbiro\nligonier\nkeat\ncaven\neffusion\nfittingly\nsuba\nshelford\ndosages\nesh\nlinsley\npirro\ndecimals\njamnagar\nwalcheren\nkünstler\nmontpensier\naylward\nconsistant\nmobi\nmusing\narsenault\nsceptic\nfieldturf\ndraconis\nvingt\nfirmus\nmalherbe\ntano\nreallocated\nmateen\nwiggin\nhatim\nodum\nchaffey\nstuffy\ntenby\nsubsoil\nzabriskie\nnobby\ngräfin\nfuelling\njelinek\nazzurri\nyoakam\nshins\ncukor\nsandilands\nmilind\nragga\nsanomat\nsupplication\nkinloch\nlolly\npenfold\nrond\neriko\nmarionettes\nstumping\nnakashima\npedley\nyardstick\ndarian\nrubbers\nminelaying\nhoes\ncarcinomas\nrotted\nmillikin\ngravitated\nprag\ngiggle\nvalentini\ndeplete\nsidgwick\ncantona\nbarré\ntreves\nicici\nblakemore\nyesteryear\nstabling\nhoverfly\nacerbic\nphc\nrosaceae\ndevarajan\nmcv\nantonioni\ndahn\nseraph\ngodlike\nsynchronizing\nclarin\nclumsily\nseahawk\ntant\npotlatch\nmork\nvacating\nwhatley\nhertzog\ngruden\ndonohoe\ntannenbaum\ncouperin\nactualy\ndefra\nglides\nander\nduplications\nquatuor\nchettiar\nquatrains\nmozarteum\ncomebacks\nngan\nsluiter\nmuruga\nessequibo\ntheodicy\njurij\nproulx\nborat\ndartington\nacknowledgements\ngowers\nbloodletting\nppe\nctf\nzhemchuzhina\nchincoteague\ntoltec\ndct\npoche\ncroghan\ngalaţi\npasserines\nostriches\nbiofeedback\ntans\ncarlebach\nmildmay\nmwanza\ninfuse\nscheuer\nwarton\nprekmurje\nbellegarde\nconjures\nkaramazov\nburak\nfetzer\noutsold\nkabc\naventures\nbextor\nkike\ntattered\ndisque\neberhardt\nmadtv\nliddy\nfloribunda\nnhat\nurszula\ndefaming\nsuruga\nmilosevic\nwheelhouse\nhoneybee\npbb\ngrottoes\nankit\nflail\nmeola\nwallets\nkritik\nsudhakar\ngoldblum\nwarmia\nfrequents\nwaqar\njoven\naven\ngea\nclamshell\ncrossbones\nfrostburg\noji\nbicyclus\nenel\nwagnerian\nmiddlemen\nhoutman\nhalladay\nintemperate\nganapati\nllorente\nsepoy\ncapuchins\ntann\ncottesloe\ntelarc\nbeqaa\nchenab\nvegf\nshamelessly\nmorandi\nzaw\nburnings\nokuda\ngundagai\nkaká\nkothari\nenea\ncarnell\nbech\nautoharp\nyoughal\nsynchronicity\nexhorted\ncarrack\ncavalera\nfoams\ndiarists\nmagnify\nbhaktivedanta\npodestà\nmunhwa\nbaffle\nssris\narmadillos\njamshed\npaglia\nrusting\ntissa\nseay\nintuit\nthuy\ncottingham\naelius\neloi\npaquito\nengen\nlevey\nconall\npreconceptions\nfst\ncozumel\nnatacha\nmendon\nclack\npedestals\nhampel\nnakahara\nquickness\nformalist\nrdx\nlinens\napolo\ncurrey\nraziel\nbrahm\nmonta\nrrr\nlaboring\nhelensburgh\nraunchy\nhambro\ninterlibrary\nampersand\noperant\nrmt\nkolo\nobafemi\nmilani\njosette\nriquelme\nkrems\nbehaviorism\npul\ndisbarred\nhuma\njebb\nfarringdon\nantar\nheave\ntelemedicine\nkreutzer\npresets\nequerry\nwirelessly\ndabbling\nmcgriff\nmullally\nstil\nrutten\nhamon\ngoldmine\nphysicality\nwlan\ngelb\nsaúl\nkonak\nspektor\nicefield\nchecklists\nnangarhar\npatiño\nhabibi\nreunified\nascribing\nmuslin\nendor\neide\njeez\nribosomes\nbarenboim\nkaramanlis\nsert\nkolchak\nmalak\nheterosexuals\nlegoland\nneuroendocrine\nallaire\nplateaux\nrediffusion\njongno\nzermatt\nbatmobile\nenteric\ncoakley\nsmut\ncoaling\nanania\nyoh\nbiomolecules\nmoralistic\nfuerteventura\nawl\nnecktie\nrila\ncatlett\nboccia\nwhammy\nshuri\npetites\ncolquitt\nmacgowan\ndiehard\nthema\nmadeley\nhouseholders\nkirkintilloch\nthoughtless\ntessellation\nsandbank\ncharli\ntumi\nkaraiskakis\ntvc\njaar\nadjuvant\noutlay\neratosthenes\noutlander\nemad\nhypochlorite\ngendarme\nconejo\naverill\nsidenote\ntbh\noems\nastara\nplotkin\ncatagory\nsedgemoor\nalawi\nkamla\nspiers\ncuyo\nsacd\ncarded\nskimpy\nneoconservative\ncalderas\nstalactites\nkgl\nruthlessness\nmolester\nboobs\nmasbate\npuke\nsourcewatch\nhydrolyzed\nhts\nfarage\nstimpy\njuggle\nlio\ntopp\nbetelgeuse\naransas\nsondra\nmenorca\nbroil\nengender\ngaudy\ntco\ninvincibles\nbioactive\natherstone\nkairos\ndorados\nmphil\ncopán\ncharlotta\nfeliciana\nsatyricon\njonge\nbannatyne\nalternativa\nadduced\nivs\nstephenville\nlockjaw\nmsds\nsaluted\nbellanca\ncamagüey\nkorte\nprovocateur\nliceu\nsystème\ndenigrating\nwana\nmonocle\ncreepers\narethusa\nroebling\nabf\ngrassed\nspawns\nchatterley\nperk\nflournoy\nyagnik\nbunnell\nahluwalia\nciutat\nsternwheeler\nkurland\nbuzzword\nsimile\nkittel\nmoise\ntempering\nosmium\nriggins\ncabals\nlicorice\nscree\nsear\nstoreroom\ndaub\nrulership\ncerebus\ncmo\nreticular\nquails\nhuppert\nlambasted\ncassar\ntheophrastus\nloverboy\ncoursed\nflorey\nskrulls\nengr\nflers\nnayar\npärt\nnikolayev\nmariette\nisation\nhardanger\nklemm\nekberg\nbaader\nkako\ncomrie\nmvps\naarp\ntiziano\nshindo\nmalka\nsarabhai\ndeadweight\nindict\nunicaja\neyepiece\nclairvoyance\nflorid\nteleological\nkoopman\nipomoea\nborger\nston\ncolombiana\nfettes\noverrated\nplotlines\ninsulate\nexemplifying\nprovenzano\nsabr\nthomaston\ncaltrain\namram\nwhitgift\nclymer\ncommandeur\njoslin\nsmoothie\ncattlemen\nbegawan\nteitelbaum\nafricanist\ncommunally\nsubpoenas\ncontemporaneously\nproserpine\nfollowings\nmctell\nugarte\nsephardim\narcuate\nnobili\nluci\ntaichi\nintegrin\ngeonet\nleftwich\ngss\nmenderes\neisai\nvivacious\nbillabong\nsleet\npassmore\nshanties\njamaicans\nsika\nfungicides\nrdc\nkier\nscrapyard\nunabashed\nshortcake\nreimagined\nstefanos\nmunsters\nschroder\nhopetoun\npbm\ngpc\ngrosz\npinson\nveganism\ngodaddy\nsog\nmince\nagb\nshuichi\ntars\nsagging\ngwern\nindeterminacy\nptuj\nishi\npaignton\nsidearm\nthammasat\nduy\ndbms\nseguros\nsolvency\nzinta\nmengele\naponte\nimpersonations\nyhwh\nmonongalia\nglassworks\nraisa\nbetsey\npetrology\nbalaam\nsympathise\nvaio\nupmc\nthrips\ncoppin\ndustbin\nembezzling\nquetzalcoatl\ncirculations\nstf\ncorfe\nbattler\neniac\nsurfin\npicaresque\nimplore\nfamília\nvanquish\nassaf\nshackled\nnouakchott\ncullinan\ncastrol\nevens\ncastries\nhijri\nsuboptimal\nrussification\nmpls\njuhani\nvesting\nbrieuc\nlefroy\ngatekeepers\ndecoders\narbil\nmisapplied\npnr\ncollectables\ngentlemanly\nwechsler\nkuhl\nenvisage\nolbia\ndemme\nmicroeconomics\ncedarville\nlazo\nbaume\ndva\nturunen\nwaterson\nwarley\nkhurshid\nbiosynthetic\ncryo\nthickest\nfel\ndivorcee\nvreeland\ncharnwood\nmargolin\nicahn\ncowry\ngast\nguri\nmnemonics\notani\nglassman\nmcdevitt\nbyrom\ntotenkopf\ncundiff\nallergen\nramdas\nvss\nmagnetically\npippi\ncatan\nbeeches\nretief\nelvey\nmacaws\nfurst\nbiohazard\ngridlock\nturney\ngolly\nfugazi\nparveen\nklum\nwarrantless\nwanstead\nglenmore\nbabysit\nsoh\nscullin\nderbent\ntoxteth\nabsolve\nsacre\nseguso\nlites\ncapdeville\ndemigod\ndvor\noldcastle\ndti\ntokat\nllanfair\nroderic\nyeol\nsagrado\nkwami\ntruncate\neddies\npaquin\nkiely\nblimps\ndundrum\nmopti\nascertaining\nnontraditional\ntimex\nmonstrosity\nmustering\nspurlock\nadductor\nzeki\nwavefront\nbrashear\namaru\nxliv\nnatatorium\nmuhammadu\nchateaubriand\naiaw\ncabarets\njacque\nmomenta\nptah\nquilon\nconsett\nmycological\nallahu\nmackinaw\nthos\nloveday\nprofumo\nmacaca\nvereniging\ndeeside\nharriett\nfibiger\ncasi\nbattisti\nfiorina\ndishonorable\nerases\ntcdd\nkarras\nkooper\nfrederikshavn\nleukocytes\ncolombe\nsnatching\ndirigible\nkoroma\neartha\nkke\nluchador\ndifranco\nboyzone\nturi\nnilotic\nseptet\nlond\neerily\nbrawls\nencyclopaedias\nrotorcraft\nprimitivism\nenamelled\numbc\nkbps\nfersen\nbordj\nselwood\nlevert\nkpc\napostates\nchaloner\naffable\nbais\npopovich\nkujawski\nbookie\nwittmann\nsaluzzo\nbefall\narraigned\nclaustrophobic\njetties\nminyan\nasoka\nadha\ndeceptions\nkediri\narticulations\npopstar\noffal\ndotson\nsarnoff\ngoodale\nraimund\nroache\nrepublication\ncalibrate\nclemence\ntegel\nbaggins\nmiskito\nsacchi\nnamen\niizuka\nmarci\nchimaera\nchirag\njablonski\nlivid\ninductors\nprioritizing\nwaterborne\ntempler\ndriffield\nsugimoto\nsuazo\nsota\nheydon\nappellations\ndemocracia\nperformative\nwijaya\ngreystone\ninterleaved\nisiah\nlemming\nsladen\nvigan\noutflank\nselly\namstelveen\nvoort\nkillswitch\nfss\ndupin\nstelle\nsnot\nexceptionalism\nreserva\njuanito\nhava\nrittner\nfugger\ndrumcondra\npadraig\nmosfet\ncrim\nster\nfpl\ndutiful\nhawkers\npunchy\ndystonia\ndemetri\nmineo\ncrediton\nphylloxera\nsumba\ntidwell\njordon\ngesualdo\nbridesmaid\nissaquah\nmérimée\nratp\nkazarian\nhedland\nbaffles\nderyck\nkamar\nfescue\nsintering\ncfe\nmatriculating\nouattara\nlindh\nworkin\nmelita\njessel\nvlasov\nmeester\noundle\nchihuahuan\ndorney\ninouye\ngurdon\ncoretta\nmcneely\ndenisov\ndongfeng\nmastectomy\nsienese\nthibaut\npiña\ntardy\nmirabel\nimpeccably\nevaluator\nsitges\nalfieri\nshoop\nyuta\nconjured\nariz\nebbe\nmanzil\nxiangyang\nmmhg\nkunitsyn\nnaturale\nnanometer\nhalder\nwnd\ncardiopulmonary\nsuffocating\nostrowski\nybor\nasser\nstrassburg\nsainty\nkorner\nrepopulate\nnizar\nincurs\neyesore\nhegelian\nscabbard\ntamsin\nultravox\ntemperley\nscaggs\nframers\ncommunique\ncongregated\nbloodstock\nmatanuska\nrelegate\njayan\nsyringes\nsamia\ncoldwell\nospina\ndrupal\nempathetic\njubilant\nfungicide\ngantz\ncoterie\nsylvatica\nmorlocks\nreconfirmed\ndimming\nmaroney\nincinerated\ndanni\nseeman\nschapiro\nsivakumar\ntarde\nuninhibited\nhmg\nseminario\nkingfish\nanteaters\nfengxiang\nchitwan\nneagle\nsanofi\nguitarra\nreevaluate\nlapping\nthambi\nfrancie\ndanaus\nrecoleta\ndignitary\npillory\nfusiform\nfala\nfrieden\ncartago\nspewing\ncontraption\neasington\nhyams\nrestlessness\nsotelo\nlaboured\nyupik\ncontinua\nstonewalling\nkarman\nmacias\njadakiss\ngreenpoint\njwp\nmonasterio\nsealdah\nkalla\nfsh\nqueenborough\nmfs\nbjd\nmolteni\ncardiganshire\nantequera\nassailed\ndhruva\nscherzer\nritu\nkatsumi\neleftherios\nmarmont\nmothra\nseperately\nzd\nfasten\ntullamore\nbehring\ncoubertin\nbaloo\nodinga\nseascapes\nkengo\npartington\nurbanised\noscillatory\nwashboard\ncrypts\nkhani\ntodmorden\nzolder\nprovably\ntackler\nmyopathy\nturrentine\nsuperstock\nhomologation\nwrecker\nfrança\nscorching\nshastra\nmarwar\ncodons\nakmal\nchancellorship\ninstabilities\nforagers\nbrawlers\nchilensis\ndumoulin\nsahl\noversize\nexplication\nosment\ncontrition\nhomophones\ngimbel\nkanno\naltdorf\ncompletly\nleonhardt\npantone\nesv\nspal\noum\nappellants\nmoyles\nsancha\nouray\nsmasher\nrizwan\nforti\ndiabetics\nacanthus\navion\nsullen\npenshurst\nmagnetron\nbookman\nfastback\nnulla\nmoonee\nsieber\nfaz\ngibbes\npattie\nmahakali\nlanyon\nmasatoshi\nfamiliarise\nhooft\nugliest\nnumeracy\nformic\nslaton\nplaguing\nanquetil\narra\nstoryville\nescapade\ngoosen\nmadoc\nwetton\nmckelvey\ntourisme\ngrandad\nlameness\nschum\namoroso\nesrb\nglasshouse\ngiffords\nchianti\nrustenburg\ntoshiaki\ntechnocracy\nfeuerstein\nbartlesville\nselborne\nahuja\ncameramen\ncmf\nbioluminescent\nresurfaces\nlateralis\ntransients\ntriffids\nluhrmann\nculkin\nbhagavan\ncalvino\nlapa\nkensuke\npantai\nrost\ngim\nplummet\nmillan\nzaccaria\nintercommunal\ntobi\nbrage\nlorrie\nsamira\nstereophonics\npco\nlenard\nneyland\nethology\nhomestar\nalbie\npotsdamer\nnue\nmironov\nbangka\nconlan\nweekender\nmoxy\ncecchini\nargumentum\nyaroslava\nokumura\npinsky\nsalmo\nwalkinshaw\nkermadec\njayden\namelie\nneots\njedlicka\nbaltica\nordinaries\npae\nurethral\nlampert\ngrigsby\nnahua\nsieger\nhaphazardly\nceleb\niredell\ntanu\nwhiteboard\nkhachaturian\nkaspersky\nmyskina\ntrumbo\nwaterton\nlegalisation\nlamellar\ndeangelo\nkirton\nundiagnosed\njcr\natk\nquarrelled\nkotzebue\nsaddleworth\nshockers\nshantou\nneoplasm\nlento\nitemid\nbarringer\nrist\nteletype\ncambuslang\nbricked\nexternalities\ndilma\ndeira\neyelashes\nsmokestack\nunleavened\nbotero\ncabezas\nacv\ngaster\nmasta\nstunningly\nsaenz\nviator\nbondholders\nfleisher\nmande\nnasiruddin\ncampbeltown\nconnectedness\ncaballo\nbramhall\nreubens\nseabees\nmannes\ngoodspeed\ngreymouth\nforego\nverdant\nballance\nlundquist\nreinvigorated\nmaoism\nsawada\ncolloid\nbtec\nplainmoor\nsiad\nequivocal\nssk\nfinucane\nwannsee\nabdullayev\npilates\nsamling\nidriss\nkashif\nhundredths\nethane\nashwini\ngershom\nbodhisattvas\nvespucci\ntranscriptase\nsabotages\niow\ncrosstalk\nmbeya\nzapped\nitamar\nmephistopheles\nwoolmer\nrunnels\nharoun\nminding\ncharacterises\nanat\nynetnews\nreprogrammed\nholyoake\nimplored\nemeryville\ndziennik\ncanny\nosada\nkawanishi\nnourse\njudit\nlareau\nstrathspey\nspo\nskerry\nmoj\nkonstanty\nnica\ndispelled\nreiche\nmemri\nhazelnut\nkliment\nbunty\nhalachic\ncalabasas\npreservationist\nshiho\nohs\nawoken\ngrosset\nakureyri\nmultiplies\naew\nstatesboro\nenfranchised\nshemp\nscolding\nnewley\nsavagery\ntillage\nexpander\nnbd\nshriek\nshimura\nvoll\nkawa\ndisorganised\nillya\nnaftali\nhomi\nmillfield\nturkistan\npetroglyph\nlondo\nandranik\nhamline\ncoders\npangolin\nextendable\nbundelkhand\nkanga\nequatoria\ndereliction\ndenholm\nleurs\ncrosswalk\nelectrochemistry\nalenia\ninterpolations\ngioacchino\nvalen\nasexually\ndialup\nrestating\nchappuis\nbudi\ndimanche\ngrantees\ntightness\nmatheus\nanesthetics\njupiterimages\nstupor\nipn\npieters\nmagia\nkuku\nhiked\ninfoworld\nstange\nasce\nhypoxic\nfaring\ntheon\nnovodevichy\nracetracks\nleventhal\nencroachments\nmourad\nsantamaría\nbaskervilles\neasyjet\ncantu\nbraemar\nvisualizations\npithy\ndelias\nbutkus\nstumpings\nforgetful\nnikkor\nhiccups\nhoratius\nderisively\nwurm\nsuzaku\npiro\ninstitutionalization\nsugarland\ngrandison\ncarstens\nautumnal\nsteger\nvalenciano\njinzhou\ncornelio\nregen\nmeller\ncostigan\nsanctionable\nconv\nidleness\nbombo\nannaba\nfastpitch\nifrs\nreauthorization\ngallaher\niñigo\nlachapelle\nmagen\nperdition\nhauschka\nbleep\ngarr\nsilene\nmadlib\nburstyn\napses\nkania\nneige\nsueños\npulsing\ntorrijos\nady\nparadoxa\ntapeworm\nacehnese\nchipper\nsolomonic\nnasik\navelino\nholl\nhuffingtonpost\nboucle\nhalicarnassus\nmigraines\nemas\nesu\narchbold\ndriller\nmayorga\nunh\nspiking\nphillipps\npoesie\nmotherfucker\nwattled\nemmitt\nsood\nmaillard\nroto\nwrested\nberke\nmycenae\neichstätt\nsherds\nparaglider\nbulfinch\nods\nptp\nshiel\nrahway\nglória\nthievery\ngratz\njajce\nouzou\ntaxidermy\nmachinists\nlumina\ncerna\npombal\nyoumans\nhyndman\nglickman\ncleon\nvls\nhyrum\njer\narend\ngravels\nadab\nandersonville\nowyhee\nwingtips\npowerplants\nkuei\nstrychnine\nascanio\ndaguerreotype\nrennell\nlymphocytic\nbasara\nyura\nvandalisms\nhota\nbayhawks\ncasimiro\nirritability\ntamang\nbenalla\nshredding\nthorson\nmarra\nsamper\njansch\nmikhailov\nsavoir\nkillam\nmitton\nickx\nimpairs\nrejuvenate\naira\nspitalfields\nsingletary\ntreks\ngoga\nunwed\npilla\nsegrave\nknollys\ncoauthors\ncdh\nmez\nbiomolecular\ncrustacea\nodm\nallmendinger\nstratagem\nparaiso\nfernandina\nburro\ndisintegrates\nsapa\nfdny\nanges\nlinseed\nslv\npluvialis\nallu\njaki\nfuturists\nepm\nsafeguarded\npreez\ncorson\nkiara\nunsportsmanlike\nharmondsworth\nmoravec\nsarat\nveeck\nhentschel\nbernays\nvincentian\ngamed\nsyllogism\nivanovna\nnaw\ngoli\nlautner\nastrocytes\nmakarios\nsaragossa\nlightyear\nrationalized\ngls\nblameless\nharpenden\nwillingdon\ndrews\nxmpp\ndemote\napricots\nbarrhead\nslinky\nkamov\nshikari\nbounties\nnorthcott\ntul\nburchell\nbootable\nsantis\nkonjic\nunsanctioned\nsevan\ndroitwich\ncorrector\nsull\nsivasspor\ndrovers\npierces\nspotters\nperpetuation\nnistelrooy\nleyenda\nlenawee\nbiofilm\nzadeh\nhamedan\ntakis\nligo\nahmar\northographies\nwardha\nddc\nvigils\nniklaus\nkresge\nattics\nivins\ncustomizing\nabhorrent\nunf\npolynesians\nlancs\nlubricated\nshik\nkgo\nzlatan\nhollingworth\nriffa\ncalmness\nkronor\nziggurat\nsandino\nsiskin\ndysphoria\norloff\nupwind\nmadcap\nhuainan\nrepudiate\nmongkut\nhokusai\ntrastevere\nneptun\nsiebel\nupu\nmoorer\nlieutenancy\ngrech\ntilts\nprestel\nkarta\nasakusa\nmimosas\nswarovski\nroadies\nconspiratorial\nmrm\nfloridians\ncomplacent\nhonka\nnazarov\ngluing\nslavers\nlongwave\nneurath\nphau\nstreit\nrma\nstorie\nrefilled\ndisappointingly\npapel\nasiago\nranelagh\ncodeword\nkab\nwagering\nmitsuki\nvaccaro\npayoffs\nobrero\nmangold\norrell\ncamra\nsauvé\ninoperative\nredmen\nscavenge\nsexualized\ndownplaying\ngrazie\nbackbencher\noutcropping\npathans\nokra\nniang\nstablemate\nfreightliner\nbienal\nbartosz\nnaiad\nstroh\nvoci\nbruning\npretzels\nwiebe\nlonestar\namplifies\nharb\naudiophile\nwalkabout\nwari\nintramuscular\ndagupan\nriad\ndroopy\ndzongkha\ndelightfully\nmatosevic\nperpetua\ncemal\nappreciably\nwarranties\nthoughtfully\nomnivores\nsteinfeld\neufaula\nukiah\nnichole\nnucleosynthesis\nclench\nballgame\ndeterminate\nbackpacks\nosbert\niagainst\nmagnificently\nnagato\ninadequacies\nherded\nbyelection\noppressors\nhartwick\nbraveheart\nqueueing\nsnowed\nmarcha\nwretch\ngalán\narmpit\ndastardly\nibni\noffloading\napatow\ncowdery\nschoeman\nirrefutable\nfarhat\ncompleat\nsidra\nbrasiliense\nfhwa\nvian\nbutthole\nheilman\npardes\nrockbridge\nmallarmé\nunrepresented\nconsummation\npenalize\ncorduroy\nprot\nzgorzelec\nferox\nnyer\noreille\nanciens\nnappy\nacbl\nmrf\nraed\nreichel\njacq\ncordia\ntikhon\njérémie\nbonnard\nkaddish\nsmollett\nwrongdoings\nblankenship\nhedonism\npangea\nneccessary\ngalata\nglenavon\nobjectification\nsisto\nnickell\nbecasue\nmurree\nbrinker\nelectromagnet\nrhodope\nhistoires\npacifico\ndabo\njalen\nfavreau\nmackerras\nsilting\ngeest\nkirkbride\nhypocrites\nguttural\nmdf\nkooks\neuridice\ncoulsdon\nliebknecht\niwai\ndowngrading\nfib\nisopropyl\nsaluda\ncecafa\nlegibility\nogura\nhenshall\nmnc\ncott\nantin\nbacklot\nrebeca\nscotian\nhaircolor\ncarignan\nswissair\nyumiko\ncoalville\nmilch\nvoluptuous\nnormalizing\nmaranatha\nprijedor\ndevvarman\nsweetie\ndrapers\noac\nbinky\nchoa\nwhs\nvinifera\nsynchrony\nbachir\nbayerischer\narish\njammin\nleilani\narchaeologically\naeolus\nscottsboro\ncarters\nakerman\nkonta\nnevermore\nlegaspi\npud\nsach\nlaypeople\naggregators\nvictoriano\nrambles\nury\ngerland\nmolesting\nvirchow\nflamborough\nhoodlum\nbinay\nsquabbling\ngushing\npelted\ncrematoria\neru\nunbalance\nsustainably\nperky\nprettier\naguadilla\nfarces\nizard\nalmora\nthiam\nkhaleej\nbeekeepers\novechkin\nkurupt\nsavard\nhilmar\nmartelli\npend\nwonky\nbenita\ndeller\nbalaton\nwebsphere\nleitao\ntumbled\nmeira\nmilroy\nunu\njamali\nwesternized\nnamesakes\nharrod\nobelix\nshafiq\naeration\niryna\nchattel\ngametrailers\nallister\nzaleski\nluscious\nwaa\ndigression\neiger\nnegroponte\nongar\nifn\noverdone\nasclepius\nsaadat\nsociolinguistics\nsolidity\nmarcellin\nwooly\nrheumatology\nfainter\npejoratively\nqual\nâ\nasam\nayyub\ndoggie\nlongue\nopelousas\ntaxicabs\nsiracusa\nnaseeruddin\nstowaway\nlancasters\nlistowel\nberns\ninvestigaciones\ncornus\ndodged\nrhetorically\ndisbursed\nmontepaschi\narthouse\ngudgeon\ngantt\nsuspenseful\nbeppe\nmahe\nsubmerge\nalkanes\nmercalli\nnogent\nlivers\nbotetourt\nmoulder\nmarand\nlakehurst\ndeflated\nrectifying\nhandlebar\nroadbed\nobliges\nmaroni\nautocar\ndressmaker\nhitless\nguayana\nhenríquez\nwooley\nayat\npnas\ntrashy\nnantou\nbeaufighter\ntruely\nsoothe\ntejon\nschaff\nilfracombe\ntda\nheadcount\ntapp\nspikelets\ntechnics\nsledding\npontoise\nsewa\nnewall\nrhd\nsaltash\nmauldin\nweyland\nschall\nalmoravid\nmico\njeopardized\ngerrymandering\ncharité\noverpasses\niveagh\nloathed\nwehen\ntamla\nseppuku\ntayo\nblackthorn\ncoronations\nlha\nemms\nselke\nsweatshop\nsudarshan\nbalakirev\nauriga\nvex\nknell\nchippendale\nbifurcated\nseiichi\nfalsifiable\noverdub\nfalck\nalamgir\nlior\npandanus\ntuque\niter\npresbyteries\nadjacency\nlamine\nmothersbaugh\nbüchner\ncoghill\narita\naraucanía\nsinglehandedly\nvallee\nsalacious\ncolonialist\nviviane\nkojak\nfibak\nalevi\ndelink\nekta\nhomely\ndiosdado\nwistful\nmirv\nuff\nbuckmaster\nhorvitz\ndisinfectant\nburana\nyoshioka\nhely\nherre\nege\nschwank\nlawal\ncantorum\nzao\ncruden\nwenbo\nmemorialize\nkian\nwomanizing\nnotionally\nhagel\nbaro\nhelpdesk\ninclusiveness\nlaffitte\npinang\nchaya\nwfp\nreceivable\nachtung\nayaz\nvirago\natriplex\naccelerations\nfeaturettes\nquadrupole\ngodspeed\nveliky\nsahil\nkottke\nprolapse\ntarantulas\nstrath\npetermann\nwebspace\nlegates\nmasturbating\ntomoya\nisb\naquarian\nmidgets\nautocorrelation\naugur\nintraocular\nsomatosensory\nleachman\nheras\noverman\narliss\nakhbar\nanalgesics\nevaporating\nkirtan\ndiscordant\nhassanal\nzellweger\nramah\nshara\nshanna\nbushel\njedburgh\nparken\nmurti\nunsatisfying\npoème\noverhauling\ntsars\nkhiva\nunderfunded\nherbalist\nojibway\nsdt\nandropov\ncellmate\nscotstoun\ninvestigatory\ncols\n‬\nbarajas\nismaily\naddressee\ntrailways\nlachin\nsugi\nslippage\nnormed\nneeraj\nmandrell\nfanon\nriven\ndisastrously\nlingers\nrecant\nplasterwork\nputri\nteeming\ncasus\nhalina\ngairdner\nacasuso\ninterconnections\nquoc\nbadging\nphrenology\nexclusives\nkoan\nmacfadyen\nmoulana\nbatis\nhollands\ntimeouts\nyorba\nhardiman\ntvi\nsheikhs\nsammo\nandreotti\npickerel\nirreversibly\ndieting\ninsinuated\nclares\nequinoxes\nbukharin\nottaviano\ntarnishing\ntamalpais\njordy\nholford\nbutterflyfish\ngsu\navocets\narouca\nkampot\npdpa\nroseburg\nrafik\npeavey\npomegranates\nlaurin\nulica\nsimeone\nplatts\nmics\ndruitt\nthousandth\ntwitchell\nspessart\nscoble\nsmad\nkwik\ncomputerworld\ntigh\ndoorkeeper\ngoldfarb\nstryder\nretells\ncastellaneta\nautopsies\ncavemen\nxen\nfurioso\ncalcasieu\nargentia\ncannabinoids\ngrumbach\ndomenica\nsvff\nchiropractor\ntennessean\nrosse\ntelco\nalleyway\nthyself\nhawarden\ntzara\nbelleau\neverhart\nkamm\nbrittan\nsleepover\njimena\ninverters\nmois\ntithing\nshiraishi\nkesey\nmakara\nhašek\nevaporator\nuchiyama\nuttoxeter\nluminescent\nresistances\nmordor\nquayside\nbettie\ngaekwad\njunie\nsouci\nfarias\nminiscule\nwaffles\nipods\nbenzyl\ndisenfranchisement\nwardlaw\ntiong\nnoroeste\ntubulin\npiter\nmumbles\ncpgb\nconmigo\nahd\nbarc\nrepayments\ninspects\ndeion\nprehensile\ninvalides\nhijackings\ninsaf\nhimes\ngrrrl\nanimatronics\nberthelot\npodemos\nabruzzi\npeu\nquist\nantão\npudsey\nrefractor\nesteves\ndepositional\nhanzhong\ntaung\nghose\npenns\npedantry\ngizmodo\nrós\nstreetwise\nelrod\nshite\ndewolf\nkeflavík\ntolan\nmoosa\nnatty\nlatinoamérica\nblackfeet\nunripe\nnotley\nseance\nshaoxing\nmilgrom\nryback\ndalzell\npacemakers\nannabeth\nximena\nsru\nacapella\nlachman\nshigeo\nbopp\nhathi\nmuggs\nparsa\nmechatronics\nlosey\nmuffins\ncattaneo\nquatrain\nregretting\ncollectivist\nvajda\nruel\nyanina\ngribble\nroco\nnumismatist\nmedullary\nseasonings\naarons\nlucienne\nsécurité\nbullocks\nostinato\nmcr\nvillani\nwernicke\nsarbanes\nhamra\nsoter\nberto\nlompoc\nituri\nlittering\nworthies\nmfi\ninte\nextensional\ndomestica\ncomber\nlubumbashi\ntelematics\nhousemaid\nunicycle\nantihero\nagriculturally\nvaliants\ntns\npentameter\ngruppen\ndramatisation\nturok\njala\noceanus\nsele\naleem\nkoerner\nbannu\nbudo\ngranz\nbristly\nambani\ncompressible\nmoxie\ndii\ngouveia\nbeaming\ncuracies\njwh\nmolo\ncolds\nkiarostami\nenthused\ncayce\nflatt\nmaistre\narundhati\nseis\nnunciature\nminiaturized\ndepress\nrushen\nbilderberg\ndeitch\nfrio\ngamepad\ndeu\nkassa\ninexhaustible\nadsorbed\nlanyard\nalkylation\nmpd\ncamanachd\ngülen\nsycamores\nnampa\nspats\ntintoretto\nsfax\ntisa\nlabat\nfishkill\npittston\nopitz\nbalbo\njemma\nnoakes\npinnock\nadaptions\ntidbit\nfelicitas\nhippocratic\ncaterer\ndedicatory\napologia\nstared\naccursed\nvagrancy\ntracheal\nrhb\nsde\nmutoh\nsweatshirt\nonitsha\nkraemer\ntravnik\nramis\nchristoffel\nparenti\nmahanoy\nconfirmations\narbeiter\nuea\nsturgess\nlyles\nglutamic\noverhearing\nsisaket\ncaning\nasir\nbroder\ndrumheller\npolitika\ndownfield\ndefile\nkrakatoa\nbiffy\nfreakin\nmanakin\ninternecine\nrmp\naquitania\ntablature\nborrell\ninoffensive\nlins\namartya\nmtg\nkhong\nirritates\ncuerpo\nfisker\ncapilano\ndods\npapillary\ngambetta\npropped\npasqual\ndacey\npwr\ncordier\nruíz\nwalhalla\nmenthol\ndecorah\nseers\ndodges\nquintuple\nsamuelsson\nexocet\nfania\nmedaille\nges\naxum\njayaprakash\nwriothesley\npensée\nreveille\nknossos\nvca\noga\nleatherface\nhuangpu\nshinohara\nzawinul\nidolized\nnatarajan\nmalarial\nsuet\nzama\ndaimon\nneater\nkomo\nkittredge\ndigne\npongal\nniners\nparatroops\noverprotective\nwhitstable\neutrophication\nbrushstrokes\natterbury\nrodrigue\nabounded\nyulin\nparatransit\nmakedonija\nmending\nmatadors\nnicolle\ngazelles\nsharmila\nflocking\nricker\ncommercialism\ntractable\nheider\nhedberg\ncircumferential\nmyc\nchristel\nfeathering\ntopos\nsearles\nhydrocephalus\ncvr\nermisch\nmcalester\nskagway\nenunciated\nbarbizon\nnits\nmillward\nsjögren\ngutmann\narauco\nterrill\ncomport\nthalidomide\nsauerkraut\nsocceroos\nsiaa\nfurler\natmos\nsymphonia\nsukumar\nstift\nyuuki\nsubverting\nphilmont\nscoutmaster\nblurs\nusatf\namaze\nsparkman\nfallowfield\ncason\nseit\ntilling\nlowenthal\nmapk\nvideoconferencing\ntormenting\nhaggadah\nundine\nfatigued\nbergkamp\nrosenkavalier\nfoner\namnon\nconfédération\narabism\nfichman\negged\nuighur\nhoagland\nwhitwell\nhandsomely\naltenberg\nbaldness\nscotto\nincumbency\ndrona\nchadbourne\nstraighter\ntortillas\ntish\ngeisler\nismailia\nrattigan\nnanni\nbairro\novershadow\ndebord\nenameled\ncompagnia\nbeato\nbaptistry\nemeric\nhalsbury\nundoes\njornada\nsva\nratchathani\nifor\ntillich\nsaris\npalfrey\npermeates\nblatchford\nlescaut\ncharioteer\nmeany\ncheuk\nceyhan\nflodden\nvink\nsupercouple\nbrittanica\nperennially\ntakács\nvalorous\nshortens\ntrini\nvoronin\nwnew\nbuggies\nzillion\ncombatting\nelicits\nshure\nkosygin\nhilltops\ncopp\nsquabble\nseaborg\nsbd\nhammerfest\nsahiwal\nplesiosaurs\nscreamer\npaquette\nmodeler\npmp\nharmonisation\ncally\noptician\nafterall\nbeaty\nredden\niphones\noverwhelms\napologising\nhechingen\ndaydreams\naphorism\nyenisei\nitaliane\ndurazzo\nbrawling\nkony\ntomohiro\nedgecombe\nbondy\nbreather\nconroe\nundoubted\nsigman\nopportunism\nulric\nsharecroppers\npoulter\ngema\nirr\nanelka\npilote\nskookum\narchi\ngreensand\ncollinsworth\nbacklit\nkazuko\nbodrum\ndreamlike\nvicissitudes\nseghers\nsunn\ntynwald\nconclaves\npneumothorax\nnsd\nozzfest\ngwadar\nchartering\ntolga\nuserbase\nchoong\ntelfair\nhegde\nembezzled\ndzerzhinsky\nreticent\nstatistique\nsemyonov\ngeneralities\nfalsity\nkaukonen\nexempts\npiñera\nwormholes\nusbwa\npassacaglia\nmongoloid\nmasterwork\nstrangles\nrepainting\ninvolvements\ninterglacial\nrigel\nkishimoto\ndimitrije\nmanoeuvring\nmonroeville\nkeach\nsuffocate\nweaning\nbocca\nschimmel\nupshaw\nkalispell\ncero\nmaseru\nmcaleese\ndorff\noversimplification\nboru\nzurab\nquotients\nétats\nebner\nalthing\nrony\nrattles\nnessa\nprather\nhellenes\nenumerates\ncrf\nscholastica\ncorroded\nconwell\nbretons\ndelacorte\nafn\nsammi\nunmoved\naish\nsiddiqi\nbolstering\noleander\nkodály\nrebroadcasts\nscrolled\nrecitations\ncarrigan\nestrange\nstroheim\nproliferating\ntepid\nwebm\nevarts\npairc\ncourtrooms\nsobotka\ntourneur\nverbena\nemanate\nbersaglieri\nhomocysteine\nbolognesi\nonerepublic\nlaundromat\nlochner\nenfer\nzink\nsubpopulations\northoptera\nfünf\ntetrapod\nfouché\nmetastases\npitfall\njoffe\ndispensaries\nzhai\nguerrera\nsmallish\nwasilla\nsweepers\nexpels\npron\ncoady\nryn\nvado\ncezar\nviscountcy\nswastikas\nbernanke\nvibrator\ngullah\nshaban\nbilston\nwinkel\nqw\nfach\nyenisey\ncees\npassim\nguidry\nkirchheim\nhotaru\ngjakova\ncritica\nipsos\nscreamin\nindicia\nakasaka\nalona\nmelodie\nleiva\nmovietone\nbaas\nblazes\nschilder\nkaho\ncanids\nbodmer\ngreenbush\nmélanie\ncrawfordsville\ntabi\ndermis\nevidential\ngaj\nlínea\nsavonarola\nlusty\nmasry\nstationing\nmelmac\nbrockhaus\nzilch\ninla\nroars\nneuwirth\nesri\ntawhid\nmanali\nkacey\nhoneycutt\nzrinjski\nreema\nagaric\nspottiswoode\nspacewalks\nchislehurst\nwinfried\nbisley\nmalnourished\nspiros\nbastide\nkranz\ninaugurating\ntouchpad\nphotoreceptor\nturnpikes\ncde\nlevitate\ncarolinian\ndinkins\nlacson\nmallets\nskara\nreordering\nvasectomy\npreprint\nrogge\nhickox\nsecours\noppositely\nzilog\nglycemic\nthoms\nmykonos\nlathes\nlevent\nvasopressin\ngriot\nharlot\nextolled\nmuara\ntulku\ncampobasso\nhaine\nbrydon\ncasbah\nalii\nweissman\ngoreng\nrumbling\nefraim\narbus\nrollbacks\nleptin\narx\nwreaking\nostracism\nlevite\noverwatch\nriviere\nvoith\nnishioka\nlilydale\nkristie\nmillán\nbohai\nstipends\nbusier\nfrond\ntenderloin\nfermín\noleksiy\nstarkly\nfibroblast\nethnographers\nkazemi\nnlf\ncafeterias\nfazl\nschnitzer\nvergine\ncobbles\nisothermal\nbink\nrefreshingly\njoko\ncamcorders\nsólo\nshareef\nwilmore\nshivpuri\nmeli\nlaverton\nsteinar\nmakings\nblankenburg\nthunderous\nandrogens\ncatchments\naftonbladet\ninegi\nprecentor\nmerrion\ndreher\nwep\nestoppel\nrootstock\nbalasore\nepidural\nadelson\nskiffle\nadhikari\nleverhulme\nnurmi\nlindon\nmaughan\nbarquisimeto\njinks\nrepatriate\nbandaged\nsiebold\nserenata\nairco\nrishikesh\nheckling\ncolumbarium\ncauvery\nyass\ntimeslots\namet\ntalc\ntreader\narchambault\ntheorised\nphang\nsoulless\nostrom\nmcminn\nwatchtowers\nkomal\nidps\nhothouse\npopulists\nquel\ndowntempo\nelectroweak\nsarala\natx\nmullioned\nnormalcy\npremarital\nnovena\nhagerman\nschillinger\nwhc\ncryptozoology\nclinicaltrials\naussies\nrafsanjani\nbrix\nsangeeta\nrivkin\ngemmill\nheppner\nalliteration\nbakhtiar\ndanceable\ncharlemont\nbyfield\naccidently\nconglomeration\ndni\ncurates\nabierto\nakademik\ncapetian\nsecord\npeden\nultrasonography\ngude\ndenpasar\ndumitrescu\ngiller\nfaryab\nseyed\nbaranov\nantico\nsequeira\nleverages\nslinger\nstolp\nburtt\npersico\nhooch\noccidentale\nmonaural\nhockenheim\naltaf\noer\nfantastically\nhashi\npetrosian\nfunke\nyoshitsune\ncapron\nbassam\nnewsboys\nslinging\nuncultivated\nrobina\nbrack\nalbinoleffe\nvaporware\nmoravians\ndifferentially\nhassel\nfilo\nairi\nitchen\nuneconomic\ncharron\nburchill\nhabibullah\nrustavi\nshee\nhollie\nbalestier\nhrithik\nbordello\nvicepresident\nballymore\nsaddler\nwgs\nkoryo\ngiamatti\nlessard\nbardem\nnaushad\nedinger\nmiddlebrook\ngirvan\ntherefrom\nskeffington\nfingleton\nschreck\nphy\ncontrarian\noverhand\nerina\nfuld\nchabrol\npiercy\nfynbos\ncookers\nauge\nbracewell\nmuerto\nvoivodship\nsef\nfarrakhan\nwickliffe\nmmu\ndétente\ngoodluck\nsacrilege\ncounterexamples\nfana\nbuttery\nretiree\neigen\nyordan\npeterman\nbookworm\nsmithii\nsupertramp\nsaree\nsaburo\ntreatable\nkearsarge\niat\nadressed\npål\nbarreiro\nadsense\nbatum\nexegetical\nomari\nmultiethnic\nclassifieds\nmeatball\nsinned\ngebhardt\ncorran\nungrateful\nhallen\nshriners\nedgefield\nblaue\nlpo\nepicurus\nscaife\nreasserted\nviviani\nsiler\nmuntz\nintercooler\nskydome\nmulliner\ndeceiver\nmicrotonal\nwaupaca\nsubramanian\njoos\nhaffner\nabaco\nmbi\npersevere\ndhx\ndragster\naboveground\nguillotined\nmcneal\nwinterbourne\ncrystallize\ngroupie\ntoler\npär\nderegulated\nblazoned\nhonecker\ndampen\ndiminution\nenema\nstorehouses\nacetylation\ntelephoto\nfaired\nmedevac\nwesterberg\nkallio\nstonehaven\njamar\nhuard\nmti\nbaldomero\nbrixen\nemanated\npironkova\nwaff\ndomesticus\nsulfates\nanika\nailes\nprocol\nsavarkar\nkasur\nseaham\nsmita\npizzo\nartificer\nperdu\nbelhaven\nmanti\nradicalized\nmycoplasma\narmories\nfavela\nhamersley\nsociopathic\nfels\neckart\nsufficed\nblockquote\nmoorman\nburstein\nvorster\ncropland\nandong\nmisdiagnosed\nsnubbed\ndevastator\nfarnum\npickpocket\nbeekeeper\nhohhot\nroared\nphytophthora\nunmistakably\nestienne\ngenista\nrohingya\nnieder\nrayna\nnemechek\npolignac\ncheeseburger\npadmanabhan\ntripling\nkasa\ngoucher\nweirdly\nsujata\nkalmykia\nmarcial\nheralding\ncwgc\nspacesuit\nalveoli\nwakamatsu\nschulenburg\nannamalai\nsogdian\nrieger\noettingen\narkell\nalkalinity\nabelardo\nloge\nharbourfront\nspedding\nfcb\njacinta\nfilbert\nmilkins\nfilius\ncarom\nkyd\naop\ntaskbar\nschoolroom\ndunya\nbirdcage\nynet\nhymnals\niwm\nyegor\nbattlegrounds\nhilfiger\nhif\nreordered\nmgh\npervades\nsubpar\ntablas\nnavid\noutrages\ntiv\nsuzan\nmagnolias\nromanovs\nreisen\nmanami\nbonobo\noriana\nansonia\ntunics\nspongy\ntonite\ntinder\nsharpest\nmilsap\nadoring\nmajewski\nmedved\nkoreatown\nhagood\nhoneybees\nexaggerations\nkookaburra\nfiendish\nsubroutines\ncorporals\ndevgan\nloddon\nantifreeze\nhci\nmaritsa\ncathleen\nsderot\nlicensor\nboner\ndeterminer\nobamacare\nkreider\nkarditsa\npivots\narpeggios\nhercog\nmaurits\nboxwood\ncalving\nanouk\nprognostic\nfutterman\nliebermann\nelectrotechnical\nmetuchen\nurbina\npoiana\nresende\navifauna\nethnocentrism\namendola\nleathers\nflawlessly\nvandana\nsaurabh\nchoco\nyamashiro\ncalne\nkalahandi\nracewalking\nthung\nsturrock\nwinchelsea\nmagnetized\nlazard\ncarolinensis\nsuzanna\nmacnamara\nbrittney\nuntidy\naiims\nnoblesse\nspie\nflinn\npranab\nbabbling\nonomatopoeia\ntaiyo\ndadi\ntrashing\nbacklund\nvini\nelles\nbadenoch\nbartok\ndeepens\nscintillation\nupm\ngoolagong\naskar\nyandex\nnsk\nwaynesville\nfootballs\nvorontsov\nelses\nmolt\nbrasserie\nbushmen\nmacarena\ntravelogues\nthet\nbarea\ncuthbertson\nwaddle\nirbid\ndividers\nflotsam\njame\narby\nvanguardia\ncivico\nphanom\ndoylestown\ngrue\ngaea\nloit\nrajdhani\npvs\nkiama\nespino\nexertions\nbarrick\nanjos\nrobichaud\nstationmaster\nkalimpong\nmeech\nverger\nclatsop\nrosado\nhopkinsville\nadminstrator\nzippy\nblore\nnvc\nalliterative\nsuperheroine\ncabra\nhydrides\nsylla\nmurrah\nhagedorn\njupitermedia\nvolante\nsilvanus\ngalan\nkandel\nundeserved\nschaffner\nmanco\nbeeswax\nsro\nsoliman\nagressive\ndanae\nohana\ngratuitously\nbuntings\nalcaraz\nkatona\nwalzer\nshirdi\ngarish\nboonton\nerrico\ncachet\ntoute\ntoboggan\nyigal\nexterminator\nmonophosphate\ncorsets\nwhatcha\nexotica\nstapp\nnettleton\nmaxfield\nrustling\nkeeble\nbrics\nlerwick\nsardi\nkroeber\nintaglio\ngravitationally\ncrossman\ncoby\nflay\nrrc\namanullah\nsinker\nlitigant\nregni\npanj\nkühne\ngalvez\nbacsinszky\nrepentant\nmajdanek\namini\ncraggy\nrecurred\ndecoratively\ncasserole\nappaloosa\nanxiously\nnfp\naldean\novergrowth\nhuertas\nmongrel\ncrosland\nreductio\nwintergreen\nbavarians\nbaire\npdi\nsilversmiths\namanmuradova\nmcglashan\ngiovan\nguedes\nferengi\nreenactments\ndowson\nlicencing\ncessnock\nregrowth\nschoolhouses\nabductor\nanxiolytic\nmaritza\ncrosswords\naikawa\nneurogenesis\ndaimlerchrysler\nbayle\ngso\narina\nbfe\nbuttressed\npeintre\nflipkens\nvilanova\nbosko\nrochefoucauld\nlimbu\nhaptic\nivens\nfredriksson\ndoused\nalfonsín\npiller\ncmh\ncontrabassoon\nyurt\npageantry\ntherion\ngorno\nkorchnoi\nobote\nodonata\nbaeza\najanta\nbrickman\ncrl\nwelford\nyorks\npalouse\nsufyan\ngyeong\nkennelly\nmisdirection\nlatrine\nkingdome\ncarer\nmohicans\ngethin\ndentin\nswayne\ngarou\nporthmadog\nbrandis\nestrogens\nerichson\nredeemable\nmorillo\ncurnow\nfanta\nhadleigh\nländer\nonega\ndefibrillator\ngiger\nriotous\ntinie\nunleaded\nmagnifica\nembrasures\ntsm\nauriol\nfortifying\ngavrilova\ncheater\nmonopole\nsisyphus\ntamerlane\nbiberach\nnagin\nlaboratoire\narcola\nhibbs\nmabini\njcb\nbacardi\nnorthamerica\ngili\nunimpeded\nlingen\ninvalidates\ntallmadge\nfoa\ntussle\nqarase\nventre\nagglomerations\nsirhan\npilsner\ndairying\nbigby\nmaclellan\nmassawa\nuncritically\ntommi\nexempting\ndilly\nsalesperson\npurdie\nguillem\nsheathing\nrhododendrons\nelit\ntrollish\ntorvalds\nngawang\nimpulsively\nbarberton\npipestone\ntorp\ncatalonian\njaggery\ntrekkers\nsilverdale\nfooters\nlamartine\nmanto\nfarul\nopossums\nqassim\ndinghies\ndruga\nadami\ninfest\ndiced\nfermilab\nworkhorse\nquelling\ndissect\nchanna\nplaisance\nhybridized\nacropora\nmiddlewich\nbreathy\ntraugott\ndioramas\ncobi\ndaemons\nfairings\ncordeiro\nudit\nstacie\nbitty\ninglot\nkirkcudbright\nduguay\nmilovan\nwykeham\nwallerstein\nbuffa\nturbans\ngladden\nreprogramming\naconcagua\nshunning\ntalaba\narsenide\ndanna\njesters\nrnr\nkwinana\nheinie\nbusk\nkickback\nmansehra\nzemeckis\nutu\nleavers\nspanglish\nwoodgate\npetunia\ncockayne\nzaha\nsnowflakes\nextinguishers\nquickening\nszymanowski\nmethuselah\nbso\nsats\nturcotte\ncassegrain\nsaura\nnewsagent\nshiels\nrodd\nsolidago\nshackelford\ntriangulum\nburnsville\nvirologist\nfgc\ngravitas\nprefontaine\nshaab\nsimbad\ngrowls\nbetamax\naccompaniments\naiguille\nevelina\ncuarto\ngromit\ningles\nvem\naswad\nemaciated\ntrackways\nkirishima\nberlinale\nqn\nimperiale\nbaldassare\nespacio\ndobbin\nmoronic\nimprisons\nsteeles\nunmasking\nbregman\nkurata\nreichardt\nhatt\nclubman\nespanola\ndetentions\nymmv\ndoheny\nwid\nfava\nstubblefield\nberlet\nappiah\ntropospheric\nrathmines\nhelmeted\nbirchall\nwhoopee\nanticonvulsant\nceramists\nfagus\nnadim\nlauria\nofficeholder\nobfuscate\nkcbs\nteresita\nsarang\nscruffy\ndecoupling\nwiggum\noculi\nmenor\npuddles\nnasu\nmckagan\nranunculus\nbunning\nwasa\nmoonbase\nhandloom\nlipinski\nspoilage\neivind\nzer\nfreethought\ncelebs\ncheon\nfekete\nbullough\ndorval\nbuchwald\naches\nleclair\ntensas\nqueensbury\nbronxville\nmnr\nepfl\noutpaced\nswayamsevak\nmudie\ndrouin\npeaceable\nfermoy\njarmila\nczerny\nprivatize\ndrowsiness\nminamata\nterrestris\nflints\nzakariya\nemphases\nruptures\npiezo\ngaramond\nbarbiturates\nrecede\nghazipur\ncadherin\nmanioc\nmkii\nhanke\nanupama\nniacin\nbagger\nestoy\nmetastasio\nsissoko\nleapfrog\nvoorhis\nkeaggy\nfroude\nguanyin\ninheritors\nbeath\npunctatus\nqabala\nzhongguo\navco\nbirks\ncassis\nmarth\napprehending\nhornish\nartista\nfactiva\nponcho\ndtl\nsilted\napennine\nbywater\naucoin\nsailfish\nvitagraph\neugeniusz\njol\ncrewmember\ntappara\nwombats\nfrankl\nesser\nvideoclip\ngilley\npayam\nladbroke\npowerhouses\ngaveston\nreacher\nkhurram\nexpats\nhuish\nunintelligent\nuracil\nlanegan\nsardou\nokun\nreaffirms\ntsh\nsedgley\nschlieffen\nsssis\nsaath\nmckie\nvainly\nineptitude\niop\npubis\npythias\npodolski\nnodule\nheaths\nguez\nipads\nnello\nintellivision\nsette\nporcupines\nbiomes\ndoody\nbarceló\npetrich\ndamselflies\ngangtok\nhatoyama\nramzan\ngrog\nschleck\ncréation\ngalop\nmicha\navantgarde\npartha\ntorfaen\nkazoo\nmatric\ncrock\ngreely\npanola\navedon\nbargained\nroomed\nzandvoort\nvalvetrain\nmeddle\nargenteuil\nunafraid\ntkachenko\nelman\nsealer\nmaxentius\nomori\nrelegations\nsadri\nchristabel\nswanwick\nlamarcus\nceu\n‐\numeda\ntaisha\nbellville\nhotdogs\nparodi\nengin\nrosewater\nkinkaid\nkaczynski\nblakeley\ndildo\ncalvet\ntumba\npingree\nvögele\ndespairing\nsupine\ntarts\nrelaxes\nkosuke\nbasc\nnatt\nruspoli\nkarvan\ninde\nsuppressive\nmacey\nvideographer\nteufel\nslimane\ncradley\nrop\nawf\ncruse\nboxcars\nkrom\ncoch\nendothelium\nsilke\nsaraswat\nscheel\nforemast\nanar\nsigny\nbirgitte\nmanji\nemanation\nmorteza\nflamethrowers\nbourn\nghatak\nfossilised\nmatsushima\nwindisch\nosan\ndrummed\nhasnt\npatties\nbacher\ngladiatorial\ndeterminative\nrevulsion\ncagle\nkolyma\neducationally\nerma\nfrag\namirabad\ndiao\nmislabeled\nbasler\nmultibillion\nhayton\nkanwar\nundressed\ncml\ncalero\namaterasu\nmonopolize\nclings\ninterregional\nbarrelled\nmottola\nwidmer\nabaza\npowershot\nyahia\neichhorn\ntpi\nhinchcliffe\nnibali\ndoby\nsayan\ncollen\ntweedie\ntonka\nlias\nbalfe\nweeden\nadenoma\npushkar\ngreenslade\nbramlett\nelyse\npieds\nsurratt\nmegaliths\nbugge\nconcho\nexonerate\namoy\nberggren\noverdoses\nannalise\nburridge\necl\nvroom\nhsn\nrenu\nbostic\nbiggin\ntelemarketing\nhammarskjöld\ncavani\nbedser\nwhorf\npiana\nredesignation\nleonardi\nupsala\nhout\ndevoe\nniterói\nunderarm\nlafourche\ntriune\nloeffler\nkomuro\nphospholipid\nlilliput\nrabb\nhausmann\nboros\novershoot\nmumm\npitzer\nlothrop\nharuko\ndougall\nbertens\nsaltonstall\ngoffman\nvamps\ntrackway\ntopalov\npowerplay\nmiloslav\nrepented\nincubating\nkardinal\nmacías\nsnep\npva\nmsd\nhypervisor\nbahauddin\nhedin\nhashomer\nbevilacqua\nradicalization\nbraschi\ndisqualifying\nmadisonville\ngwan\nimbruglia\nhatley\nbiafran\njohore\nkatra\nballpoint\ngelora\nabbès\nwani\nsyl\nkhurana\nhighpoint\nanchovies\npublicists\nrioted\ntaino\nsemmes\nfalciparum\nlarcher\ntautological\nmccarran\nuscis\nbrooksville\npcw\njizya\nlupi\ntaza\nschottky\nlcr\nankeny\nlingnan\nmilagro\nrenews\nmicrograms\nswaminathan\nhammersley\nclarisse\nfabri\npaderewski\ncappuccino\nholter\ncch\nsagi\ngrisman\nlupu\nbatiste\ngrodd\nclute\nuac\nvibrio\nimpetuous\ngallantly\nsegway\ncoolgardie\nhued\nkarsh\nghb\nshanklin\nscriven\ncrusty\ndrumline\nhydrogenated\noid\nmetacarpal\nzeitz\narruabarrena\ngoldbach\ntannen\nprosecutorial\nhayle\nfuthark\nsteinem\nwotan\nwattles\ncovarrubias\nanticoagulant\nmorice\nchavannes\ntomei\nbhairavi\ncrn\nupshot\nriina\ndroste\nlgus\njuozas\nrigors\ngrossmith\ncarats\ndefrauding\npáramo\nventnor\ncursus\nsprinklers\nershad\nextolling\norgel\ntraceability\noptimise\namel\nadores\nwunder\nwishbones\nprotégés\nmacdiarmid\ndramedy\ntreeless\ndolla\nscrutinised\nsanson\nskt\nverapaz\nthéry\npollo\ndawned\nuntoward\npoornima\nengelberg\nmannan\ninitialize\nbachinger\nincumbencies\nclavicle\nleelanau\ngroan\ngerardus\nconniving\nchiquita\nmanso\npenumbra\nquiberon\nnatsir\nhugged\nhvar\njaromír\nvagnozzi\ndahlberg\nlodgepole\nescapee\ntetrachloride\ntsuru\nremitted\narsonists\nadmonish\nrhizomatous\nguus\nneurone\nperfunctory\nfreehand\nmorrie\ngristle\nleipziger\nfahy\nesg\ndrogo\nnuer\nkbc\nreplanted\nlariat\nsaporta\nraiffeisen\nalper\nfha\nkerensky\ntassels\nflashlights\nashfaq\npilings\nalessandri\nmonachus\npsychobilly\ncapito\nhallyday\nshuriken\nstürmer\nadmissibility\nlhp\nunpretentious\nsanjana\nbandana\ncrooner\nsidestep\nnogai\nllull\nlevenson\nissyk\nemoji\nchaffin\nvelikovsky\nboyish\nconca\nevangelize\nital\nbroeck\nhematoma\nrazan\nlii\neldritch\nsecreto\nrushworth\nbarbecues\nkalin\nviburnum\nniranjan\nschizoid\nzuko\nrangan\nangewandte\nobtrusive\nrapti\nmartinson\nlonghair\ncatheters\nstrumica\nmorass\nschmit\nalawites\nschlafly\nplaudits\nniobrara\nadelman\nwestman\nandijan\ntayeb\nquinault\nkilgour\ntroi\nlangridge\nbovina\nfanlight\nieds\nconformations\nheneage\nbueller\nhurlburt\nnevado\nald\nlamond\nbruneau\npenwith\nirshad\nmesmerizing\nseiler\nyoel\nhomered\nzookeeper\nmilken\nmortified\nmargined\nbosanquet\nsombor\nlabrum\nsaadiq\nrausch\ncornaro\ncarcetti\ndyess\ncondescension\nquetzaltenango\nchâteauneuf\nsynchronisation\nzubayr\nannacone\nglucagon\nrosendale\nkaisa\nscrip\nfisheye\nricha\ntorrid\nyamin\nkazmi\ngannets\npomorski\npandian\nbetweens\nlinoleum\naedt\nloopy\npunchbowl\nplantains\nrotana\nlinnaean\naversa\nshiites\nundervalued\nchoses\nmabuse\nshrubbery\nmns\ngrimston\nhayami\nmqm\nhyphenation\nomak\ntrimmings\ntyger\nspica\nexecutables\nnatividad\nmaksimir\ndinsmore\nniemi\ninsuring\nlahaina\nofelia\nbloating\nvikki\ngeta\nfirecrackers\ndinero\nhoopoe\nnimr\nsourav\nbindi\ndibley\ndefamed\nmcbee\noggi\nbanbridge\nwheatfield\nderechos\nluitpold\nvegans\nsemiregular\ndiavolo\nepicurean\nchakras\ncorozal\nshimomura\nstromal\ncommercialisation\nducted\nmesas\ncheery\nnamba\ngosnell\ninvalidating\ncsar\nfawzi\ntalos\nfbk\nkoike\nsrv\nhydrangea\nhopson\nintervarsity\nsubramanya\nproudfoot\nlippert\nwachtel\nketcham\nvillainess\nmatsubara\nshannan\nturturro\nswearingen\ntrainwreck\nsitta\nnetizens\nsabers\nfouquet\nhillyer\ntopol\noverhauls\nglucocorticoid\nivb\ndamietta\nhekmatyar\noffensiveness\narnot\nsevers\nlipschitz\nuys\nfinke\npartook\npsk\njiabao\nsherbet\nlohner\nsaroyan\nmahabharat\nstereogum\nwmu\nrearmed\ndamavand\npinching\njesup\nzahab\nendorser\nrefunded\nreynoso\nselah\npinna\ncoppice\nalternations\npeels\nmaddow\nfinnegans\nprotrusions\njeou\nukhl\nbildt\nfpu\nbuckie\nrazer\ntrisomy\nettinger\nairdrop\nrytter\nnannini\nbanke\nespouses\nlach\naiff\ninducement\ntraub\nbeecroft\nrisque\nheadlamp\ngute\nfrickley\nasthmatic\npeeps\nlud\ncamerlengo\ndelroy\ncosmetology\nzorrilla\nrashed\nziyad\nharbouring\nlucina\nphillimore\nmisquoting\nlaboral\nviner\nairedale\nlmg\nkalas\npetoskey\nmillburn\nsayle\nhorsfield\nfreelancing\nuji\nintellectualism\nsenor\nfineness\nbagby\ndemonology\ndifficile\ndcl\npuerile\nlubitsch\nmerrell\nvittore\nmip\nzarina\nheilmann\ntourniquet\ngyaltsen\ntaormina\ntelmo\ncolum\ncallosum\nmisdirected\nniehaus\nexpounding\nbilaterally\nhistones\nmcferrin\nmilkmen\nbelfield\nblotched\ncroatians\nauditorio\nfermor\nbutadiene\nderailing\ndeval\n,i\nlaudrup\nluscombe\npropels\ncolorcode\nmadureira\nmessick\npolyakov\npacey\nsremska\nemigre\ngocomics\njogi\nsymposiums\nfrac\ncollegiality\nerzincan\nfusions\nkailua\nbeed\nfracas\nheysel\ngirija\nmyeong\nsoloveitchik\nginette\nrottentomatoes\npostmodernist\nrazing\nzamenhof\nmukul\npediatricians\ntunkhannock\nperreault\neschenbach\nmianwali\naccross\nbolshoy\nizzo\nhandcrafts\nalem\nsleeveless\njerboa\nextrapolating\nkek\nprefered\npercutaneous\niliev\nmorven\nsnowbird\nferrera\nmeaney\nintubation\nbanishing\nvanities\nscheherazade\nalami\nabele\nkalika\nexpressiveness\ndistill\norihuela\noflag\nfujiko\nethnomusicologist\ndougan\nthrombin\ncaravanserai\nfabia\ndundurn\nroxane\npochard\nkcc\nbiddeford\nlayup\narboreta\nmacqueen\nainsley\ngoi\ndossiers\ntga\nbandeira\njoli\noppress\nsnoring\nuncompetitive\nclaymores\nterada\noverstock\ndevilish\nsilviu\nbator\nportola\npostural\nmisogynist\nzealously\nyaron\nstrikebreakers\nasadabad\nsimenon\nxinhai\nlawndale\ntillotson\nkühn\nbaksh\nclun\nsrinath\nailey\nhermansen\ntween\ndeneuve\nbitlis\nrup\nmalmaison\nlaxative\ndisheartening\ninfringes\nnuwara\natlantics\nflasks\nfredericia\ntumbleweed\napplauding\nmicroclimate\nyoji\nusi\npesce\nheusden\nferland\nbhoomi\nnicolaes\nalou\njabir\nbrearley\nblunted\nvenosa\nmugger\nzouch\npfeifer\nagl\nsantuario\ncomprehending\nblanchet\nlepchenko\nideation\nicw\nconingsby\nwhither\nshafter\nfabiana\ncloture\nsandžak\neardley\nappetizer\nloafing\ndaoism\nclu\nhelicon\nmauger\nfontaines\nimpersonates\njaguares\nspeedboat\ntsw\nfabienne\nmenino\nrenounces\nbarahona\ncommerzbank\nwheelie\nheineman\narleigh\nshawls\nthrottling\ndeconstructed\nthrowdown\naguri\nlmfao\npalabra\ndefa\nsld\namerigo\nsuna\ntrautmann\nmonocytes\nilchester\ntongarewa\nmunchausen\nlindow\njehu\nquorn\nambrosian\nvjs\ncompositor\nmunros\nservicios\ndotty\naccomack\nnajibullah\nbasanti\ntorsional\ntayloe\nbilleted\nrpr\ndiuretics\nspoonbill\nmisericordia\ndetlev\nreaped\nconnellsville\nberating\nwestenra\nprudente\ncorporatism\ninaccessibility\nhoppy\nscw\nhomepages\nhybridisation\nkonan\ncomecon\nhedgerows\nints\nhmt\nassemblywoman\ntaba\ncesaro\nringleaders\nsperanza\ncavers\ncapitulate\nkober\nreivers\nlinford\nodsal\ncashing\nspeedwagon\nhikmet\ndisqualifies\ngeorgiou\njase\ndisposes\nequalizing\nstoppages\niskander\npostnatal\nestrus\nbayelsa\nblanding\nete\ntibi\njugglers\nexner\ntympanic\naldermaston\nsyros\ngigli\nsohr\nelectrophysiology\nhulked\nmenteith\nvegetal\nstriations\nantiochian\ngnc\nusain\nriessen\ndyffryn\nastbury\nmiah\nmonocacy\npapery\nmitzvot\nnavegantes\nichthyosaurs\ngratifying\nfairlane\nanaphylaxis\nlebeau\ndeewana\nbex\nviel\navarice\nsnowfalls\npotgieter\npesa\nmonosyllabic\nmoonlit\nvalmet\nhamsun\nyow\nbnr\nvacationers\nwyche\nmismo\ndesu\nderen\ninsp\nyanagi\nbookmarking\nmeatpacking\nmengistu\nnitschke\nitami\nparrett\nvpro\nfaia\nlevittown\nsubarachnoid\ntrelleborg\nnilson\ndiastolic\nprimula\ngonadotropin\nlcds\nchigi\ndbr\nkryten\npander\nconsigliere\neston\nfigc\nanschutz\nwiretap\npichon\nperdita\nharts\ntakaoka\naldebaran\nironed\ncordite\nbaps\nmontville\npremi\nvaleriu\nlandor\neam\nmerkle\nolivieri\ndaro\nleonore\nhocevar\nuppercut\nusui\npalas\nbirendra\nbothersome\nlestat\nkickstart\nsilja\ndoolan\nslu\naled\nnativist\nklia\nhertzberg\nandreyev\nhete\ndht\ngassing\nfengtian\nsonning\nitil\nolaru\nangulated\nahu\ngiudice\ngroupthink\ndard\ntsurumi\ngregori\nstrana\ncogeneration\nunruh\nkingwood\nqam\nnumbing\nmanan\niapetus\nparé\nbuc\nleery\nabhinav\ntokaido\nburberry\nskewing\nwadden\nrorty\nwagers\njil\nmunteanu\nogilby\ndorp\nwiig\nseedings\nresta\ngidley\nunderlain\ncasta\nshuttled\njealously\ncanting\nlincolns\nretrofitting\ndaeng\nmisao\nsorrentino\nsall\ncolwell\nthawing\ncalifornicus\nhornbeam\nbanditry\nyacoub\nkatori\nlymphomas\namaryllis\npanicking\nchillies\nbobbio\nschomburg\nalds\nqueers\ndork\nworshipper\nicky\nschismatic\nppb\narchean\ngoverment\nwalliams\nscb\nnederlandsche\nsaami\nunordered\nchinned\nlitigated\nshushtar\nbansal\narredondo\nteleporting\nfrankenberg\njunji\nmignola\nzadok\natif\nwadowice\ngeico\npichincha\ncoto\nfurneaux\nunknowing\nilsa\nconfusa\ntagbilaran\npion\nkrush\nagouti\ntowered\nsimultaneity\nandor\njordaan\nkishida\negf\nctp\nkhoisan\nyaman\nrugosa\nunpleasantness\npinsent\njoensuu\naztex\nantifascist\nwatteau\nmillipede\npersonalize\nstonehill\nrepublish\nprongs\ngrünberg\nbnc\nfertilize\nrietveld\njoust\nschenley\nnavarino\nolefin\nleka\nakutagawa\nmetohija\nsequencers\nscoping\nflorissant\nbisexuals\nstupidly\ncrawfish\nzeppelins\nscabies\nabadi\nbdc\nmpt\nephrem\ntichborne\nwillian\neichler\ncommonest\nparthenogenesis\nturvey\nchronologies\nsqualor\nbvi\narawak\nbitstream\nmabbett\nsomesuch\nkole\nwayfarer\ntortola\nlaan\nleeb\ntamiya\nhappenstance\nhorford\nenchanter\npilon\ncoir\nbraz\nsnip\nwaycross\nsadan\ndigests\nzab\ngarib\nideologue\nirreligion\nsnaefell\nyoram\namphion\nembree\nannoyingly\nchuquisaca\nipt\nmaser\nclallam\nhomebush\nnadler\nmustn\nhayride\nquadri\nguid\nprow\ngpi\nvyvyan\nsignficant\nfidler\nhirayama\nsains\nstaubach\nlammers\nduffie\njjb\nwollheim\nforecasted\nkarnes\nerlbaum\ninsurgencies\nmilazzo\nstockley\nviktorovich\ncappadocian\nweisberg\nfallopian\nlautoka\nbelies\nrosenkrantz\niwakuni\nkrk\nkataoka\nroye\naspinwall\ndetaching\nkultura\nfahim\nobliterate\nbischof\nreseda\nhasdrubal\niko\njavelins\ntyco\ninnervated\nmanhasset\nrancagua\nmusées\nmaeterlinck\nadware\ntonelli\nbelmondo\nlatinised\nordinations\nkattegat\nmartz\nmarotta\narcing\nliquorice\ncalzada\nmatted\nmizoguchi\nsandoz\nkalutara\nannihilator\nicke\nelizalde\ninterminable\nzte\nchemokine\ntakoradi\nmargera\ntipper\npaedophilia\nmalar\nshearwaters\nhulks\nhammel\nflagellation\nnasheed\npardus\niolanthe\npyrimidine\nsweaty\ngreenbaum\nteodora\nshotaro\nstarsky\nyasuko\nsumpter\nbaptize\ndolorosa\nkakavand\nwhistled\nbett\nkheri\nfonseka\nvetoes\ncloaca\nsubedar\nbiber\nsergent\narianespace\nseager\nwatsonville\nbiomaterials\ncutscene\nhandbrake\nshola\ncoffeyville\nkashmiris\nmaillot\ncronk\nletizia\nparalyze\ntwigg\naggravation\nnisqually\nmimetic\npiccard\nlebowski\nfreyberg\nahura\nquinte\njoyride\ngennadi\nneta\ndungarvan\nschram\nadua\nchesnokov\nhalve\nstaking\ntoivonen\nparkers\neckhardt\ntdm\nequalise\nreadout\ngaudet\ndta\nhomeported\nspeu\nbaeck\nsemenov\nreve\nlillo\ntoribio\ncontarini\npalani\nloman\nbrilliancy\nlnp\npisan\nhajar\nreasoner\nagonizing\nkatara\nintransigence\nhymen\nparalyzing\nafra\ndorrit\ntoombs\nbreastplate\nbulldozed\ndreier\ntimekeeper\nhirta\nabr\nglucosamine\narrhenius\nbacklight\namericium\nmeikle\nsaifuddin\nlandholder\nterese\npyrgos\nevangelos\nformulates\npipefish\nverdier\ntéa\nklinsmann\npushy\nboyko\nhammon\nszd\nbrossard\nxenomania\nblanda\ndecentralisation\ninterchanging\npappa\ncontextually\nundisciplined\ndisadvantageous\nbodom\nmutagenic\npistachio\nwilsonville\ndebora\nindentations\ndhahran\nhodgepodge\ndavit\nrulemaking\nsambal\ntozzi\nwillet\nfiordland\nsouto\nmdi\nkenmare\nbailando\nuncollected\npinos\niligan\nyantra\nshunted\nheifer\ntriennale\nnadel\ndarlings\nbiostatistics\npretenses\nonizuka\nbcn\nsepúlveda\nnini\nkonno\ndistaff\nsteelheads\nasli\nskyrocket\nbarta\nawolowo\nmorphogenesis\nflett\nwpsl\nshochiku\nmujica\nsubhadra\nmané\ndohrn\ntrotta\nkerkrade\njeremias\ncamerin\nimperialis\nleoben\nffi\nmasochism\nmusta\nlauterbach\nsaxifrage\nfus\nbeefcake\npestalozzi\nbion\nfrid\nfloristic\ntheophilos\nconceited\nsinge\nmacdonalds\nkhanates\nturpan\nbartered\ncoldness\nunrealized\nyadda\nsynergies\nroanne\nrebar\nbutchering\nlohmann\npiñata\ndelaval\ncontributers\naventure\nmateriality\nveniamin\nllwyd\npalance\nbirk\nsüddeutsche\nebc\ncliburn\nsartorius\ncahuilla\nimperfectly\nornstein\nimputed\npanne\nhaymes\nsiriusxm\nshawano\nupminster\necf\nadalberto\nborowski\nupperclassmen\nautocad\nsampo\npulis\nsolas\nlouisburg\nbouncers\nradboud\nbogen\nincapacitate\nporras\ntimberline\nrepurchase\ndanja\nbriefer\nlydda\nqawwali\ninterscience\naffirmations\nshiksha\nwhopper\ntallman\ncriminalization\nnitish\nepica\nremsen\nfisted\nlonghurst\nlavie\nsheree\nmalthouse\nretrievers\nswashbuckler\nshill\npronghorn\nbeaudoin\naydin\nzzz\nglib\nguilder\nginetta\nchito\nlolland\nmontjuïc\nunderweight\ndelimit\nandersons\nbwa\ndiedrich\ndivines\ndownsides\npaas\nstor\ngenerales\nkune\ndorota\naalesund\nmedi\nmodding\nhanmer\nwindowing\nforestville\nfoment\nsuperhumans\ndkw\nvillach\nhdb\nglendon\nlovesick\nzukunft\nnanoparticle\nwhitacre\nherzfeld\ndosen\nstabilising\nayam\nrickards\nwladyslaw\nzal\neval\nhistorias\nalmaz\nlongmont\nmidrange\nbilby\nunmolested\npownall\nhoneymooners\nampleforth\ntatham\nregistrants\nhelgoland\nmanzoor\nsumit\neubank\nkington\nstubhub\nprovisioned\ntweeting\ngambled\nscrambler\npreloaded\nlandsman\nshawcross\npavlovna\nmalady\nkade\nsakae\notakar\ndeni\nwracked\nabbado\ntulle\nglenfield\npottstown\norbán\nlustrous\nbridesmaids\nsaviors\nguttman\nlangtry\ntamas\npensive\nlumberton\nimmersing\ntmi\nmaharajas\nwatercolourists\nakihiko\nmitten\ngiuliana\ncéspedes\nvalter\nsonet\nplows\nrerelease\narar\nvezina\njerald\nutagawa\ndetonators\nsuleman\ngentleness\nkeothavong\nholberg\nzephaniah\ntriplett\nkazak\nsectioning\nsachdev\nfetishes\ncharnock\ndtp\nhcg\npartito\nalbertini\nwel\nguinn\njarmusch\nbleszynski\nomm\nnemeth\nunderpin\nascorbic\nhunnicutt\nbloodstone\nsvea\nging\nsanthanam\nbellis\nstrum\nsalvagable\nmeagan\ncoheed\ntelevangelist\ngundy\nwatermelons\nemeka\nvaught\ncreatinine\nlukin\ncapelli\njefe\ncabarrus\nmarano\nlimping\nladybug\nqais\nsolow\namalgamating\nhebridean\npratensis\nmoin\nsurinder\nnewsradio\nepigenetics\nheymann\nyari\nlobato\nhereward\nkulik\nappelmans\nmandelson\nministership\nodometer\ncookstown\neakin\nrelent\nnipper\ncacophony\nliye\ninfers\namante\nnourish\nkatrine\ncompacts\nheadbutt\nbandeirantes\njohnsson\nbackman\nimmanent\nbumblebees\nenamoured\nbeane\nnewtons\nemptively\naimlessly\nfaf\nputih\narachidonic\nchennault\nposs\nndi\ncastleman\nfintan\nschiele\nlehane\nstandardbred\nbisping\nuvalde\nmohamud\ninimitable\ncorned\nbrainwash\ngha\nmasan\ndarr\nguidi\nkhanh\nnct\ntoolkits\npurist\nstyrian\nmasvingo\nsolanaceae\nlurk\ngenerics\ncolenso\nboonen\neber\nacceptably\nwellsville\ndensmore\nmamta\nchipperfield\nirreligious\nyoshikazu\nbettering\nswabi\noliveri\nstradbroke\nuprights\narmatrading\ngolgotha\nwoodie\ncascais\nshotton\nhimesh\nluff\nift\nkarachay\nrito\nbruyn\ncocoons\nrattler\ntints\nmuseet\ntenison\nvampirism\nraikes\nlarocque\nhalsall\nsalability\nwaterproofing\nfingerstyle\nporcine\nvoces\nmatale\ncleghorn\nickes\ncibola\nfentanyl\ntopsail\nhodgkins\ntypescript\nbrisson\nrast\ngusti\nmorio\npestle\ntsarevich\nmelange\nstepwise\npiccoli\nstipendiary\neuropaeus\nfeo\nstrobl\npurveyor\nphaser\nxzibit\nfazil\narians\nprotozoan\nphysio\nalbacore\ndecipherment\ncahir\nenvironnement\ndorling\ncombi\nticketmaster\nbaga\nhermaphroditic\narnau\nunlearned\niniesta\nmanservant\nruber\nmourne\nprimero\nsiciliano\nobed\nsupriya\nbathinda\nbruder\nvitas\ngoodfellas\nlynd\nteva\nconcourses\nscobee\ncolmcille\nwillowbrook\nblasio\nharriott\nrabbitt\ntussauds\nmakeba\nbizzare\ndumber\ngiv\nzorba\narenal\nleeming\nkageyama\nskog\nreutemann\nintuitions\nporvoo\nbonavista\npeppino\ncardus\ncoiling\nscurry\nrajamangala\nsunna\nalwin\nmacrocarpa\npittsford\nsevendust\nboldt\nsonntag\nfrenchtown\ntokamak\ncanaanites\npyrrhic\nhexafluoride\nmilitarization\nintros\nbrenna\nphantasm\noconto\nnhp\nabaddon\nmassinger\nmalaita\nzoologica\nhdac\nmagisterium\nblemish\nbaryshnikov\nloya\ngorm\nrequisition\nipm\nmerano\nmartinho\nfudd\nupholstered\nlumberman\nillegitimacy\ncowherd\ncastellated\nworkloads\ntakaya\ngarside\nwove\ngaddi\nhesperus\nherriot\npesach\nasci\ndawud\noverrunning\nurich\ncarotenoids\nsurrogates\nspittal\nresourcefulness\nthemistocles\nmanilla\nsteichen\ndecrying\nswarmed\ninterlochen\ndacha\nseascape\nmaypole\nfunder\ncoarsely\njackrabbit\ncolonialists\nmargaretta\niduna\nbraff\nbrassy\nwrx\ncois\neckstine\nmorán\nnandu\njelen\nanthropic\npinilla\nbolles\nriske\nnabucco\ngoold\nwaynesburg\nrhus\nshowboat\noilman\nkamini\nvinland\nleggings\nosmani\nkajol\nhadrons\npolitecnico\nfairytales\ntoomer\nreturnees\npeptidase\npièce\nslanting\nvining\nswadesh\ngos\nsię\npinkett\nsextant\nlamonica\nsifton\nlofgren\nsambora\noktay\nvagaries\nliancourt\nlausd\nallington\nrajagopal\ntwombly\nivi\nspeedster\njuche\nsaroj\npalmolive\nsoloing\nurticaria\nbellies\napalachee\nclg\nextraordinaire\nunearthing\nlowrey\nlegolas\nsuccinate\nkojo\ntaib\nalphabetize\nretinopathy\njuggalo\ncopilot\nribe\nratifications\nfledge\nsedate\nerdogan\ndugong\nshammi\nestanislao\nbedworth\nsalal\ncornu\nheisler\nrossiya\nménage\nammi\nholguín\nbamboos\nvideocassette\ncondoning\nkanzaki\nwafa\nmacromolecular\nelas\nyasukuni\naimless\nharrisville\nfountainhead\nsakuma\nifp\nevaluators\nmonteverde\ntemnospondyls\nlibrettos\nzeffirelli\nbaqir\nchieko\ntoko\nsegre\npremadasa\nzinedine\nmockingly\nkatzman\nnairne\nkriss\nbouguereau\nparmenides\nkobus\ncellophane\narabians\nkumano\nnocera\nmnt\ngelug\ninvisibles\ntulio\nmanolis\nchuk\ncostuming\nabattoir\nyelverton\nranson\njihadi\ncrossbows\ninsa\nmarron\nokara\ntadao\nscorchers\nnaoya\nkoval\nanciently\nhazaribagh\nbeaujolais\nshon\ncolina\nmeydan\ncharlize\nprerecorded\ntrindade\nalternators\nmexica\nacquaviva\nbobbing\nestancia\nmagoffin\nhami\nvayu\nbroadwood\nfitna\nambo\nwinches\nrodan\nmke\nirredentist\nrania\nserio\nworsted\nerland\npadmasambhava\nstraitjacket\njati\nnevi\ntaverner\nnyheter\ncerulean\ninanna\ndisproving\ngrenland\nrussias\ntellus\ntakashima\ntare\ndiscouragement\nsarda\nbunks\nazkaban\nvite\nresuscitate\nturbochargers\nreemerged\nfloridian\naquatica\nveronicas\noperación\njodorowsky\ncorsini\nichthyologist\nauglaize\naraz\nrebounder\npoynton\notterbein\ncyclamen\ntiu\nphospholipase\ndisquiet\nirmgard\nnaruse\nreconquer\nbrownwood\nchantelle\nanglicisation\nbouldering\nspall\naashto\nsatake\nkaan\nelizabethton\nforaminifera\nbernheim\nhadoop\npiasecki\nnolasco\nmendiola\nkevorkian\nplumstead\njayavarman\nrigger\nbjork\nleafed\nvilmos\nheadteachers\npanamericana\nhavas\ntajima\ncugat\ndecriminalization\nliotta\nabhisit\npyu\ndigitalis\ncatford\npfo\nstoked\nunbundling\nmullets\nterrazzo\nosm\nrainham\npyunik\nalbornoz\nzygomatic\nsmartly\nfrommer\nriccarton\ndobell\nkollar\ngreggs\nbasinger\nhodgman\nobservables\ncny\ntertius\ncully\ntéllez\nboggling\nhiguaín\njcs\ngraetz\ncholinergic\njkr\nsundae\nenrol\nknutson\npopova\nrobi\nwinstead\nsodas\npaddies\njanna\nnigrescens\nchandrapur\nseidler\ndemigods\nfarmyard\nqualia\nunderwritten\nsheard\nunyielding\npenistone\njetsons\ngaeilge\nwojciechowski\nmaclaurin\nbhonsle\nbrengle\nbarthelemy\nducking\nkronberg\nportlaoise\ntaizhou\nremparts\ngaudin\ntrachtenberg\njboss\nzimbalist\nlunette\ntilia\nbankruptcies\nfabricio\nboreas\nosp\ndaffodils\nfunaki\nendocarditis\nvictoriaville\niczn\nquaestor\nalvord\nhypertensive\ncurvy\ntigrinya\ninsurrections\nchastise\nzeeman\ndeconstructing\nwxyz\nporterville\nméthode\naumont\nsatirizing\nradian\nneer\nmedievalists\nthimble\npranksters\neffusive\naiaa\nchewy\nproposers\nbacktrack\nathelstan\nwhereof\nlemuria\norienting\ndebased\nbrynner\ncheyney\nshapley\nbacklinks\nphotobook\ntopham\nfinck\ndavitt\nskinning\nallam\nguzzi\nglenrothes\njalapa\nbuz\ntheoreticians\neasygoing\nneame\njacksonians\nakihabara\nlamothe\nyasutaka\nturgeon\nhisd\nsulphide\nyog\nnsync\nwillan\ngaas\nponferradina\njaycees\nfindable\nembellish\nportela\nrhos\nhobie\nretry\nspectrometers\nlightbody\nfraenkel\npaperclip\nvicuña\nbrehm\nprincipes\nintercostal\nchaldeans\nsharpless\ncandidly\nstrabismus\nportales\nemmer\nkakadu\ncockatoos\nwaley\nreber\nsaxby\ndanone\nyakub\nturnips\nburhanuddin\ncuda\nnantz\njohnsonville\ncatflap\ncanonised\nbousquet\nrevlon\nconrado\noooh\nalthusser\nunmentioned\nauthorial\nunobserved\nadiós\nbrundtland\nswanston\nmamaroneck\nentranced\nlessens\nvce\ncorker\npluripotent\nabsurdum\nmiklos\nkohistan\nimhoff\ndonnellan\ngouge\nallsopp\ndurance\nacor\nsifakis\nhumanly\npgi\niñaki\nchambord\nascher\nmihara\nmasquerades\ncrystallizes\nwildfowl\nelision\nnasim\nbfs\nccw\nunbelievers\nsetiawan\neons\nminimised\nrobitaille\nestar\nwambach\nexamen\nliguori\njell\ndsk\nyakut\nharrach\nkadi\nbompard\nmoorestown\nfeign\nsundarbans\nmahavishnu\nchudleigh\nwinamp\nvelayat\nceleron\nopulence\nbori\nschönbrunn\npstn\nmarmoset\nslavoj\nmiglia\ncatamounts\nétoiles\nokapi\nhomophone\ncontravened\nfossae\nabha\ngeshe\nbernina\nmesenteric\nmccaskill\npolonnaruwa\noutlasted\nvectoring\nherriman\noona\nlazenby\ncyclocross\norl\ncamerino\nmultitrack\ndefrauded\nflam\nbrownback\nyate\nincrimination\nduesseldorf\nlmc\nusedom\napte\npeppy\nnorseman\ncampbellton\nwatney\nlujan\nmeic\nnoni\ndaigle\nnervously\nguichard\nlpa\netherington\nhader\ngrouch\nbracketing\nhosei\nteamster\nmarmon\nvef\nsiue\nradiographic\ngrindhouse\nmarck\nsatirizes\njemison\nkommer\njeweled\nroundels\nlinney\nshh\nwilby\ncontemporáneo\nmorand\nverticals\ngarrow\nshorthanded\nhammonds\nregno\npirlo\nwarnes\nwarrier\nmiaoli\nafanasieff\nnazran\ndunin\nuecker\norangemen\nlohr\nuzun\nhohenstein\nharmonizing\nstoa\nwharfedale\nkunsthaus\njunks\nmutu\nssbn\nkrog\ndalila\nwariner\npeo\ncontorted\nmaldini\nmicrograph\nhermeneutic\nicar\nhalacha\napolonia\nraghava\nwashingtonian\nshrikes\nnarcolepsy\nworkingmen\ngouging\ncari\njtf\nquips\nfazenda\nphotometry\npugliese\nspined\nshelduck\npamuk\nbaccarat\nbekker\npicot\ndispatchers\nnariman\namla\nelc\nlaursen\nodetta\neconomia\nnicolo\ntambién\nexpectant\njadeja\naqui\nuch\nbeggs\ntrickier\naref\npleyel\nbobble\nsculling\nyawning\nséances\npennock\nkhadi\ntehama\ntts\nalissa\ndelfin\nstrumming\npurveyors\nruma\npdo\nanchovy\ntahar\ndreamcoat\naoife\ntimbres\ncudahy\naugsburger\ntepui\nsdb\nirgc\ncrotty\ndelillo\nwilfully\nlct\naspirants\nwalsham\nmopeds\nmountford\ngastroenteritis\ncutouts\nemarcy\nalcides\ngaffe\ncolonise\nnatalee\nmalevich\ngiovane\nabyan\nfibrin\nlaunay\nimplicates\nswoon\niet\nknebworth\nippon\nmongolians\nvirat\nidly\nashour\nexpending\ndevika\ncist\nmalpas\nradclyffe\ntogoland\ndiazepam\nfoc\ngadolinium\ntransceivers\nbigorre\nscrubber\ncirco\nmakkal\ndavout\nmeinhof\ntoutes\nshilo\nminiaturist\nbateau\ngerlache\ntangshan\nvorticity\nullevaal\nlacklustre\nworldcon\nmeds\npopularise\nlesko\njayanthi\nbarnstorming\nholmen\nwajda\njimma\nhafnium\nadenovirus\nhorie\nmbarara\nanimist\nscannell\nbrenham\nevinced\nspecht\nfiguration\nenumerating\njoists\nopposer\nparquet\npontianak\nbaf\nzubair\nencuentro\nmckeen\ncomodoro\nfischbach\nsakon\nautologous\nsternly\nshaper\nawacs\nazo\nkozlowski\nparishioner\nimplores\nbuddah\nenviable\nincredulous\nballi\nmugu\ndegen\nbernardin\nallegiant\ncoolio\nvisualise\ndecameron\nballston\nlsts\nwatership\nrublev\nshepton\npfi\ncazenovia\nwayman\nmirjam\nsoiled\nchaozhou\njorgenson\namala\nkawachi\nbeswick\napolis\ncorpsman\nbenko\ndetainment\npalabras\nperpendicularly\nrund\nharran\ngrieved\nbeetlejuice\nthaliana\nmahasabha\ndigitizing\ncredentialing\ntuscumbia\nisadore\njou\ncybill\nqub\nkenworthy\nporres\nchlorination\naronofsky\ndriggs\npolen\nmicallef\nhierarchically\nyarkand\nrenn\nkasha\nshul\nventana\ncalabash\nhokey\nieuan\ndryad\nbrax\nmiskin\nnastiness\neisley\ncollate\ndaye\nevol\nstallworth\nfairman\ndaylights\nheraclitus\nsperber\nblockhouses\nanish\ngoogly\nsarangi\nacw\nthrombocytopenia\ndrips\nreplicator\nyoma\nwatersports\nwivenhoe\nunderwriter\nalbian\nantigenic\nfoliot\ntog\nshala\nhuybrechts\nmanizales\nmccusker\naccruing\nmultiprocessor\ndotcom\nrundschau\nwaterboys\ntyrwhitt\ncph\nhasselblad\naugments\nsixto\nloke\nmapes\naun\ntemeraire\ncremer\nawn\nrnb\nsuperba\nstrecker\nrabban\nwsh\nshuji\nsemicolons\nconstraining\nnonsuch\npaolini\nbenner\ngpm\nacolyte\nusurpers\nasg\ncontroversialist\nbeatbox\ncentralism\nblaisdell\ndaad\nhyslop\nfrill\ndufek\ndeaneries\narlanda\nsquidward\nthant\nacg\nklavier\nupchurch\nfulling\nvidyarthi\npoyntz\nfrogmen\ngiolla\nmoomin\nongc\nprefab\nmaybank\ndesperados\ncymbeline\nnasution\nimperious\njevons\nencumbered\navan\ngadi\nmeritocracy\nperugino\nalphen\ndidot\npizzarelli\nmartes\ntela\ncabañas\npotvin\nmescalero\nkrewe\npech\ntatsunoko\nvasey\ncraigavon\nmicrocredit\nrajko\nbeis\nreignited\nlieven\nlaski\nveljko\nbitterroot\nbasketry\nsprat\nkarlsen\npanza\nazide\nxinyi\nsapling\nmeffert\nmediaset\ncesium\nshiu\nspandrel\ncontractile\nkertész\noffloaded\nairtran\nwickersham\nwythenshawe\nukraina\ngirlie\nmolinaro\nfsi\npolycarp\npaddlers\nbarka\ndevens\nwhipp\nsilvan\nnationales\nmultilayer\nfetter\nyakutia\nmélisande\nwarman\ndunford\nbalam\nomelette\nsaliba\nbiddy\nvba\nbophuthatswana\ndeasy\nroko\nsedbergh\nkahin\nallemand\nnisar\ngape\nlamour\namiel\nsrichaphan\nkreuger\nogres\naltercations\nmaji\ndurden\nshowpiece\nsutro\nrashidi\nyiwu\nwelds\nblithe\npaulist\ncarmella\npreoccupations\noverwork\ncomerica\nkamat\nvalla\noverburden\ncian\nwoah\ntayside\naynaoui\nfräulein\nacquaint\niab\nmaximin\npositing\nenglishwoman\nmoye\nthessalonians\nnlds\nstroop\nfonzie\nncb\ncartes\nlaf\ncityscapes\nrhum\ncostar\ncarissa\njoelle\nbirdwood\ndecrypted\nworldviews\nsalic\nkempten\ntruant\nsoftpedia\nsleaze\nswindler\ncygnet\nactivex\nprester\nindrani\ntamblyn\ncvd\npaysage\nbrodhead\nnnw\nmapquest\nnorthway\nramy\nimagineering\nbureaucracies\nblackstock\ngrube\nrazzie\nbacktracking\nfadden\nrohrer\nessences\nedw\noliveros\nbalkenende\ninterconnects\ngrindstone\nbutz\nrecklinghausen\nagriculturalists\nconsonance\nreimann\nlindahl\nbeelzebub\nror\ngleneagles\nwholeness\ndray\nelec\nrehabilitative\nsfaxien\nwindstorm\ncañete\nkleybanova\njind\nbasant\nsoll\nbidvest\ngetters\ntombeau\nroselli\nblesses\nimogene\npestering\nalceste\nstaat\nnepa\nroyton\nibrahima\nmemnon\nodb\nwilmslow\nbidens\nmetalurgs\nbritto\nbsnl\nomnes\nlimousines\nledbury\nslessor\nmagie\ncathars\ncephalon\nyunlin\nsociopath\negyptair\ncrom\naneurysms\ntrilogies\nkriya\nanga\ngreenbank\ndhan\nbonhomme\nfluorite\nharboured\ncatacomb\nthomaz\nyeading\nbelanger\ndirck\nnorreys\nricin\nusar\n––\nmaesteg\nbullring\ndelo\nmahmut\nguyton\ntikka\nemanates\ntaussig\nsison\ndenunciations\nappia\nlevasseur\ngoaded\nleavin\nmavor\nmoai\ndisunity\npathum\ndefinitional\npersuasions\nmva\nomnivore\nfranny\nizzie\npraetorius\ngargantuan\nhunky\nglazier\nvidela\nholz\nlentil\ndattatreya\nbaza\npatroclus\npenman\ntoray\npaladino\nmonnier\ncatalase\nurbane\noron\ncanelones\ncoffeehouses\nbraverman\nboso\nlobotomy\ndantas\nelfriede\ncantaloupe\npikmin\ncoronavirus\natomics\nkeydets\nalondra\nsadao\nbocconi\nheiss\nmenger\nhorncastle\nkinsley\ninactivate\nganging\nherdman\nsuvarnabhumi\nfigural\nkommando\nsese\nhairline\ndriverless\nwyke\nfuenlabrada\nanticipatory\nmielke\ncitrix\nskowhegan\nmizan\nsquats\nstoicism\npiste\nfoiling\nprostatic\nbairstow\npaca\ninvariable\nagnihotri\nzawiya\ngreenham\ndjembe\ndirksen\ngraver\nshamisen\nromilly\nkingmaker\ndania\nnahl\nzbyszko\nwfa\nstapleford\ntst\ndnd\nmorgans\nrómulo\neckerd\nskippered\nphinney\nmarinella\nsnobbery\npacini\nquiescent\nrajneesh\nsnowmobiles\nlwin\ncallis\ndivulged\nbédard\neveline\nkravis\nanouilh\ndrivin\nakka\nberntsen\ncrossbench\ncryptographer\ninis\nspreckels\nchavis\nontonagon\njubilees\npowerboat\nredon\narw\nmessager\nmanjula\ndevours\nzelenay\nbootham\ngnk\nsyndicat\nmotian\nhassall\ncigs\nevin\nranariddh\ncais\ndevane\ngfp\nbrummer\nfeni\ntransubstantiation\ntekle\nsocietas\ncentenarian\nstoppers\nkoalas\nphilco\nenns\nmaries\ncluniac\nbroadens\nlipase\njeannine\nsedated\nkcl\ngroovin\nconcessionaire\nkauffmann\nsalvadori\nhyperthermia\nglantz\nsavai\nimitator\nshavers\nthiepval\nhre\nzhili\nbima\naddu\nbuka\nmitty\nbechet\nnachi\nkaepernick\ntrion\nmcentee\nrecs\nvenezolana\nmarsala\npresumptions\ndbu\nbowdon\nishigaki\npanmure\nzinaida\nsanabria\nfalcao\nmagnússon\nulcinj\nirreparably\nshadrach\nplaner\nhopkirk\nyayo\nwendi\nskal\nbuskers\nihc\npresences\neastport\nkrakowski\ncurrituck\nusg\nntra\nbluejays\ntenderly\nosawa\ninstallers\nclemmons\nmicroscopically\ncassock\nnyy\nhallman\nbga\nplucky\nlenta\nelbrus\naltadena\nbarabanki\nlamontagne\nquadriceps\nvillareal\nrousey\ngrampians\ntelemachus\nbjerre\nleen\npevensey\namati\nbusca\nsquint\nregicides\nkistler\nfarallon\ngilbey\ngauntlets\ncouleur\nextragalactic\nrecollects\nbummer\nhartshorne\nporvenir\nschöneberg\nlaunder\nqemal\ntopically\ncontralateral\nchora\nbayda\nroughing\nspeedo\nschlossberg\nmerriweather\njiva\ntaya\nbenes\nwildrose\ncrit\ntabrizi\nnonconformity\nspeckle\nsunbathing\nrvs\nmulhall\nhspa\ncouches\nschoolwork\ngunilla\nageless\npreuss\nqila\nkercher\ndeputed\nertegun\nmasih\nutsav\nethelbert\nnaing\ndésirée\najs\nfaeries\nforsake\nzamir\nintentionality\nkeefer\nmyrmica\nfanboys\ndimples\nsiachen\nhoste\nskilling\nkhairul\nperma\nchocó\nkeïta\ndaza\nbeevor\ndéby\ndebenhams\npalmares\nheartbeats\nclaydon\nhelo\nfeigns\nbhavana\njunpei\nwhimsy\nsorvino\nacis\nenvelop\nfloatplanes\nmaggs\ngravedigger\nspecular\nmaines\nmonopolized\nflug\nrationed\nmugging\nmanduca\nparia\nparter\nelvish\nlecherous\ncarbajal\nnanomaterials\nshippen\nsubsequence\nharbord\nhilariously\nbinks\njacaranda\nincarnated\nfrancke\nreoccurring\ncharis\nmudstones\nsolons\ndilshan\nrevolutionised\nsimion\nwinnemucca\nmumbo\nraje\ntamba\ndemoiselle\nkirstie\nsilberman\ndendy\nkildonan\nrosencrantz\nvolcán\nneutrophil\nbareilles\nsecretes\nmortlock\ndroves\npropagandistic\nscorch\ntessin\nhabra\nflicking\ntullahoma\nsibirica\nphilpot\nescala\nrall\ngehenna\nkaushal\nhousings\ndicussion\nhox\nsalvini\nhartz\nnorn\nshorn\nsym\nunspoiled\nebm\npawson\nclouseau\nharipur\nbelmiro\ncrécy\nuncooked\nkordofan\npsion\nlongleaf\nabella\nnevus\nmailto\nmees\nschnee\ngagra\ntroubridge\nwelty\npaternalistic\nhyposmocoma\nwilburn\nlyttle\nempt\nloamy\nhigdon\nricco\nanticlockwise\ntev\nnewsagents\nslandering\ncgm\nvires\nhsh\nbaux\nnoite\nafflictions\nsynced\ntransboundary\ntaylorsville\nolten\nmettle\nlogansport\npottawatomie\njohannessen\narsinoe\ntechtv\nclematis\ncostilla\nkurdi\nteke\nzon\ngolani\nquagga\nhyperlinked\nbeatifications\nsedley\nbulat\nafula\nswampland\ngopalan\nrhetorician\nabdali\nwardak\nhorticulturists\nboulter\nhypothalamic\ngreengrass\nwigtown\ncojuangco\nbuffington\nteru\nerdos\nopcw\nleidy\ntallapoosa\nmafra\ndunia\ndevers\ngrotte\nhristov\nvignola\nimpostors\ncaryn\ntelepath\nabscesses\ncritchley\nmckeithen\ntribuna\npiven\ndregs\nmultinationals\nolmo\nimplantable\nschmeling\njitendra\nplummeting\nogee\npolito\ndissolute\norginal\ntololo\nstingy\nrlm\nfêtes\narmbands\nunforgivable\nharajuku\nmastic\nholsworthy\nhypertrophic\nespañoles\ndicey\nstennis\nconfound\nclow\nntp\ndoral\ncoeliac\nmoree\nclassen\nwaterhole\nsuperstation\npennywise\nhypotenuse\nradionuclides\nfishbone\nhypatia\nbatlle\ndominatrix\nail\nrerouting\nclerkship\npurer\npostcodes\nloras\njavid\nflatlands\nberm\ncorrell\noverrode\nsupergroups\nwebmasters\nronk\nfasts\nfriedländer\ntempestuous\ntrillions\nkroc\ncapelle\nshema\ncertificated\nplayfulness\nystrad\nmichela\nkalka\nzeid\nzook\ntindal\ntransposing\naisling\npeacemakers\nmahanadi\npitchford\nfloresta\nlondoner\nrashtra\nmure\nchalloner\ncrimp\nnorthants\nlohse\nwalley\ntuscola\nduddy\ninterzonal\nbasseterre\nsks\nneurotoxin\nlapwings\nshorted\nlongshoremen\nheffron\nlustful\nadonai\nedgcumbe\nmcconville\nsaadia\nasie\ncommissaire\nsharpsburg\nmarlboros\nturreted\npanagia\nsegall\nkrøyer\ncregan\nbathhouses\nmendieta\nfantasie\npalast\nhypothesised\nfinzi\nlele\nmagistracy\ndécoratifs\ntickell\nbuckled\nhussaini\nseba\ndimitrovgrad\nblk\ndeeley\nbellwether\nprecis\ncelsus\nmidlife\nhuachuca\nsund\nweightlessness\nghassan\nhouck\nattard\ndainik\nemelianenko\nmillau\nselectman\nnameplates\ngraecia\narbitrariness\npathologic\ndrizzle\nhps\nmete\nmpb\nbigots\nwolin\nlungfish\npegula\nkirn\nescudé\nbiomechanical\nmetromedia\noverdubbing\nmathisen\nremainders\nmaitre\nlatvala\nstutz\nresidenz\nburkhart\ninequities\nlaconic\nmachineguns\ntiana\ntapio\nkosice\ngasser\ndragonball\nschild\npalaestina\nrelocations\nkamath\nhullabaloo\nhulman\nseedorf\ngodart\naerodynamically\npenitence\nipecac\nmcclanahan\ntailing\ngaram\nfrito\ncrevasse\nvenerate\njakobsson\nmolton\nblackley\nexpendables\nrelient\npapermaking\nensor\npayal\nsempill\nzana\nakal\nsimulcasted\nkirkenes\nchapbooks\nhotshots\nmoscone\nvlora\nclk\nmikio\ndalibor\ninsolation\ndrakes\nsavino\nbreedlove\nraksha\nbeier\nhocus\naurillac\nstatics\nmiele\nwakulla\nhabbo\nmexborough\ncarnoustie\nshamus\nstamm\ntbf\ntragédie\nstojan\nruffled\nallot\ngula\ndiddle\ncumulatively\nbabaji\nburbridge\nalvalade\nraymonde\npetered\nbookbinder\njanowski\ncampbellsville\nbirdwatchers\ndialogs\nsolovyov\nhexagons\ntourcoing\nmno\nfootlights\nbriers\namericorps\nshakeel\nkármán\nneophyte\ncurbed\nchessington\ncommittal\nstunner\ningleby\nduper\nmozzarella\nbrainy\npetkov\napplets\nziaur\nushakov\nvisite\npoma\nintransigent\nphair\ninexact\nbursar\neusocial\nshamim\njetix\nocp\nshraddha\nbardsley\nolofsson\nijaz\ngarzón\nsilty\nmangere\ncrenellated\nhma\nkulwicki\nbeckinsale\nefrem\nbikinis\nboulay\nsbn\nanecdotally\ntatras\nqubits\nsuffern\ntweede\nstrongbow\ngodden\nsupplanting\nthorstein\nmurrieta\nariosto\nkates\nstepper\nbozoljac\ndadar\nwess\ntrayvon\ntranspennine\nalloway\nmisidentification\nthunderdome\nvaranus\nbrockport\nrhinitis\nceb\nmonagas\nevonne\nier\nstanwix\ngroats\nltda\ntracie\nuckfield\ntatchell\nlarkana\nyola\nossa\ndisinclined\nraynham\ncoasting\ntinctures\nbrigand\nkassim\nmclemore\nparvin\nboda\nfgs\nrosenbloom\ngreenup\nsadist\nlrp\nkahuna\nfuc\nbogarde\ncraik\nbighorns\ngund\nvaclav\ngreenlit\nstarhub\nuah\nraper\nbotkin\narchrival\ngoda\ndors\ncopperplate\narchiver\nstealer\nskipjack\neasterners\nfathi\nsplint\nhoists\nsalience\ntrinkets\ngbu\nremick\ntelegraphed\nnde\nmarussia\nfyodorovich\nbeaudesert\nwalston\nedta\ndicker\nvecchia\ncornmeal\nchery\nastroturfing\nesto\nvanuatuan\ntsao\nspilt\ndepew\neventuality\nbusto\nchatwin\ndancesport\ngraduations\ncoaxed\nenso\nenhancers\nguarneri\nrealign\npharmacopoeia\nclovelly\naspasia\njocasta\nkravchenko\nkort\nlillee\nstrangford\nesmeraldas\nwhos\ntacking\ntrevithick\nnatl\nchitin\nescanaba\nimpermissible\ntarija\nhighlife\nharnack\nfenella\nrodale\nmesic\nwinks\nigo\nimprovable\nthumper\nwahlgren\nsirleaf\nsandinistas\nanticommunist\nmatas\nlicata\npineal\nsulphuric\ntrinh\nroleplay\njiangling\nculberson\nephron\nbrabourne\njavits\noverlaying\nkotler\nsidelights\ndaydreaming\nbonnets\nsnappers\npales\ndelmer\nleytonstone\npunctures\nghislain\ntulkarm\ncardew\nheres\nfeldstein\ngada\ninfineon\ncolonizer\nspikers\ntrintignant\neol\nshootdown\nvarius\nweiwei\nmeni\noatlands\nhayford\nbaiano\newtn\npampered\nbouverie\nkhadija\ngyre\ncherubim\nbwh\nbutlin\nmended\ninheritor\nknez\nripened\ntenable\nwaw\ninjectable\nmccoys\nacoma\nsensitization\nbardwell\ncami\nrogowska\nchakrabarty\nprecognition\ntalismans\nscolari\nalh\nwolde\nects\notr\npresentable\nmainstreaming\nsilverback\nbeeman\ntylor\nrattled\npekanbaru\nincognita\nmert\nbricker\noutgunned\nconradi\ntaskmaster\ncathartic\nimhotep\ndobre\nbohannon\ndorgan\nbemba\npraiseworthy\nyasunori\nkarlo\nbunched\nbeaumarchais\nlanigan\nsteles\ncraps\nmoscoso\nibi\ntoller\nholcroft\nhysterectomy\nkashrut\nshamal\nuct\ndebre\nunfavourably\ncassian\nnyi\nblackall\nkyocera\nniccolo\nmup\nstraker\nzinfandel\neuropeana\nfsl\nsullivans\nbarcodes\nsvc\nconquistadores\ndatetime\ntartans\nisra\nproscription\nskelmersdale\nfactuality\nactuation\nreinvigorate\nkerrville\noocytes\ndivergences\ndrunkenly\nipu\nmochis\ncordata\nconjunctivitis\nbrasseur\ntwisty\ngrangemouth\ncarolan\nmedlock\nphalke\nbiracial\nkuroki\ngrell\nulceration\nlely\nchalus\nedens\nruisdael\npreschoolers\nbrotherton\nlauritzen\nhuxtable\nbaccalauréat\nwhacked\nnorthcliffe\nparapan\nluas\nkokand\nblandings\ndoddridge\npornographers\nunadulterated\ntaddeo\nallport\ncellier\ntinsel\nrhenium\nlevellers\nmonoliths\nirf\nagdam\nrefracted\nconceives\ncoppell\ngrito\n¡\nmedicated\nfernsehen\nchie\nlogbook\nzhonghua\nvergne\ncantors\nsalavat\ncaribs\ndeporting\nbipod\nbootlegged\nbinaural\nhallucinatory\njcpa\nbeirne\njaffer\nmcinnis\nbtu\nserous\nsanguinea\ndirks\nandreae\ncausey\nwheal\nfelidae\ndelineating\nasiana\narlette\nairstream\nbyun\ninfiltrators\njaynes\nfrohman\nintestate\nmahia\njellies\nnaphtha\ntorry\nvey\nwagener\nglamorganshire\nlampson\ninsincere\ntikkun\nhunley\ncowra\ntremayne\nvenoms\npareles\nquoi\npazz\ncentavos\nlumix\nrandomised\nlivings\npilger\npsychoanalytical\ncohesiveness\nlatched\nafrobeat\nangelini\napurímac\nedvin\ntmt\npurushottam\nincites\nimpracticable\ncandelabra\nacro\nroubles\nrózsa\ncmas\nbasten\ntelomerase\nzuber\nreposition\nlgb\ntracers\nepaulettes\nshays\nlykke\ntowa\ncarruth\npustaka\nmallick\nconserves\nshulgin\nsloat\nblaikie\nkneller\nharen\ncharacterising\npaquet\nramo\nherwig\ngusmão\nmulia\nwissen\nvindicate\nsmirnova\nsanitized\npeafowl\nroose\nhyrule\nbrdc\necclesial\njemez\nmineralized\nrothes\nkillin\nchams\nrothenburg\ndeafening\nrcw\narielle\nwaterlooville\nelling\ndecima\npérigueux\nsione\nkhairpur\nburry\nludger\ncinders\nhomological\nreprimands\nmementos\nglitz\ntav\nlanna\ntsuda\njanke\ngalung\ndisassociate\nshinnosuke\nskyward\npotting\nguth\nmaxey\nevra\nsemiarid\nbrinkmann\nptr\ntoothache\nmachar\nretracts\ncauseways\nhemet\nephrata\nliefeld\nangioplasty\nkernan\nmartinet\nbeatus\njazztimes\nunionization\nmeriting\nefraín\ntobol\nracquets\ndominantly\numbrian\nboondocks\nvillahermosa\nvolver\nclocktower\nkeon\nrudel\nnightingales\nbeshear\nsepang\nbroxbourne\ndervishes\nsenlis\nmesquita\nautodromo\namant\nfaridpur\nsippy\nmugged\nfarrier\nmone\ncib\nfulmer\njürgens\nkanin\npournelle\nmeditated\nminimax\ngoethals\npahs\ncudworth\nsheepshead\nbackline\ndecennial\npolokwane\nhagiographic\nshyamalan\nfrum\nthionville\nmarrs\nboltz\nkazem\ncorts\nrashes\nstrobel\nchickpeas\nhemsley\ntooley\nwoodcraft\nbooing\nfark\nmalmedy\nhaddonfield\njohnsbury\nlinville\nflatley\npittsylvania\ntehrik\nskydive\nchariton\njeh\nunna\naverroes\nendometriosis\neasterbrook\nscoville\nalcântara\nbaan\ngavriil\nhoekstra\nnájera\ntubas\nhoulton\nmachetes\nhendy\nkegs\ndirectness\nsenthil\nkrav\nkuniyoshi\nmachiavellian\ngarçon\ngummer\nmanzanares\nmand\ndeidre\ncourrier\npaniculata\nfulvia\noti\nbluey\njoyfully\nheidelberger\nfoaming\npressburger\nrhimes\nwhdh\ntahlequah\nwia\nphillis\ntwirling\nnaturism\nlandauer\nantaeus\nregula\nfroese\nkoot\ndesimone\nfilmer\nepifanio\nmorges\nuncasville\nmuss\nstaffan\nsaida\nbiblia\nmiceli\nbloxham\nridicules\nuridine\nunderhand\nmaryport\nstreicher\nkhatam\ngalvão\nchivalrous\nschein\nguadalquivir\nteething\nlutte\nfeverish\nblewett\ndisequilibrium\nbonnyrigg\nmahala\nipf\nagnetha\nslavin\nnickleby\nopensource\nzafer\nsool\nolympiques\nbagels\navidly\nmarias\nbaike\nnostrum\nvillainy\nbuckhead\ninterlocutor\ngyroscopic\nshipmates\nttu\nvapid\norridge\ntricyclic\nisolationism\naborigine\ncourteously\nvashon\nnayef\nhalving\nantagonizing\nnetley\navalos\nmikuni\nkutztown\navner\ntibbets\nmamo\ncesari\npittodrie\ntsarina\nrcr\nsavas\nmensheviks\nallstar\nclaris\nvirtanen\nperanakan\nyamal\ncracknell\njudiciously\ntanners\nrothe\nnso\nkoz\nvaa\nrailgun\nbootlegger\ncanio\ntarbell\nfrankenheimer\nriddance\nlabeouf\nsmudge\ntessie\npanetta\nmuscled\nmsrp\nblackberries\npleasants\nacn\nnearness\nvint\nmicrobiota\nhardcopy\nforeboding\nnegombo\ndme\nyavne\ntvm\nbornstein\nkleve\nheckel\nnowicki\nmeio\ngarg\nemulsions\nmondeo\nmifsud\njuncker\nmatheny\nheadbangers\nfashioning\nmethil\nanthropologie\nncap\nfui\nunidentifiable\nstansbury\nriband\nsomethings\nspéciale\nbli\ncopps\ninterlocked\noakwell\nsdo\narraignment\narseny\nclyro\nclimaxes\nmoyet\nshaq\ninkster\ngondolas\ncubit\nrobillard\ncryptologic\nmontacute\nlibertà\npiscium\nhagler\nkaeo\nlochte\nsomeren\nhoteliers\nnaipaul\npetry\nbutchery\nslrs\nrooming\nspringwood\nloewen\nmontalto\nmhl\nzoster\ntubridy\nprocyon\nswail\nfricker\nsagamihara\nbluebonnet\nabut\ntono\nhattersley\npursat\nwende\nferryman\nvav\nnms\nconsol\nthw\ngalling\nkaguya\nfomenting\ngaan\nyanbian\nshoppe\nyunis\nbentsen\ndosa\notp\ntransamerica\nfunfair\npuppis\nmisión\nquahog\navailed\nshanker\ndiatribes\nmeudon\nserravalle\ncotentin\nmourner\nplaten\nberbatov\ndagblad\nkmc\ncentralizing\nhinault\nupliftment\nparatroop\nbledisloe\nharuo\ncalifornication\nsenter\nhopp\ncamelopardalis\njts\nrashmi\ndigesting\nacceptors\nwaren\neuskal\nunencrypted\nspiracles\nespana\ntangipahoa\nyaa\ntriannual\ngaviria\nhatcheries\ntercentenary\nprojectionist\nmcbean\nthereabouts\nstylistics\nmozarabic\nschooler\necclesiology\ncoverages\nmetalurg\nboreholes\nosterman\ncatheterization\nmedievalist\npersica\nrevie\ncurtailing\nvtr\nlimped\ngruner\nshantaram\ncouncilmen\npaddocks\nhowstuffworks\ngudang\nborlase\nexpressionists\nbasford\naric\ngreensleeves\ngiggles\nmontagnards\nmckenney\nlilongwe\ncataldo\nhülkenberg\nyichang\nparkville\nsoars\ngerakan\nmidbrain\nrepetitively\nrufina\ntellico\nbrokering\nsubdues\ndement\ninfluencers\nwesternization\ndagan\ndeist\nbogard\nkernow\nishibashi\nwebmail\nbrane\nmonkhouse\nbouquets\nresonating\nplainclothes\nrustaveli\nwidowhood\nruts\nprofiteering\nhil\nkittanning\nranald\ngurdwaras\nmanuka\npurna\nspigot\nwath\nincunabula\njilani\nneidhart\ncentipedes\nfibromyalgia\nsubwoofer\nplayland\nmervin\ndahlem\nnitroglycerin\nhozumi\noutflanked\ngioconda\nhirota\nrhi\negor\nafer\nswitchback\nkth\nalbéniz\ncoupee\nhydrofoil\npslv\nlanden\nbakelite\nboivin\ncaprices\norrery\ntocco\nannas\nmatting\nscatters\niorio\nsilang\nutil\nterroir\ngastro\nrubel\noccurence\nuncivilized\nwhately\nfloorboards\ncheaters\nfrailty\nphonons\nyorick\nramil\nnescopeck\nalsina\napollinaris\ndreadlocks\nharbach\ncorneliu\nbazan\nrisers\ncoextensive\ngailey\ncompactly\ncalogero\nsaye\nflavonoids\ninvalidity\ngoalies\ncanaletto\npasqua\nlá\njunto\nanibal\nactus\nearshot\nmarshalltown\nunrivalled\nnatascha\nchretien\nduelling\nprofuse\npartiality\noverproduction\njoppa\nscarbrough\ntago\nsuso\nannihilating\nupf\nrobinhood\nwesterville\nwlw\nbarbel\nobits\ndugmore\nstroked\nsongshan\nunwind\nsweetman\nwege\nbillikens\nuks\nchickadees\ntowanda\nspanos\ntelefunken\nutzon\nfairplay\nbucolic\ngiorgia\nfozzie\ncalarts\npalauan\nmilenko\nsudeep\nchemo\nunconquered\nfyodorov\nmisuses\nmiserables\nquandary\nklink\ninbev\ntamra\nbadshah\ncultivates\ndiário\nsundin\ngardes\ncostantino\nbenzoate\nserban\nmcmanaman\nclitoral\ninstigators\neragon\nbilan\nsherriff\nivc\nditka\ncrumbles\nbermejo\npivoted\nmahlon\nessie\nbestia\nbedminster\ndowden\nirib\nabang\nrybak\nhillclimb\ntoye\nzovko\nbeh\nerring\ndiebold\nscher\ntodman\nmilica\npeau\nbanishes\neti\nlyla\nhostetler\nlepton\nrowboat\nwhittled\nwhitening\nghanem\ninternationalists\nsedaris\npalácio\nbudde\nstaind\nsemele\nludolf\nllanfihangel\nvexed\nmannie\nunprocessed\nforeknowledge\ncresta\nbosc\navocet\nnbi\ndismember\naleks\nrobustly\nturm\ndiaconate\nrespirator\nfulci\ncrepuscular\ndiz\nintertwining\nmunsee\nwahdat\npacaembu\nthorney\ntraitorous\netisalat\nwiccans\npolychaete\nparthasarathy\nreidar\neightfold\npinatubo\ninoculated\npasty\nswales\nhlinka\nnamib\natiyah\nsoapstone\narap\ntaipa\ndelmark\norbach\nbureaux\nmeninga\nhermès\nderosa\nsequester\nbhaktapur\nchalkboard\nsuhr\nkhama\nmunsell\nrobinho\narirang\ncortázar\nkhang\nbackwardness\nchirk\nperryman\nhachinohe\nincisor\nqueensferry\ncremorne\nbarragán\nfalke\nkempf\nbarys\ngruelling\natu\nbelting\nwesten\ntallgrass\njoginder\ntrece\napparantly\npeacemaking\nemployability\nvelox\nanolis\nsammartino\nfera\nmase\nrosier\nwilkison\ndawei\ncaprivi\nbackbreaker\nlyster\nfawkner\nhavn\nlittlebigplanet\nmamata\nprostheses\ncafepress\ngtk\nmetafictional\ngrundman\nwiktor\npedagogues\nlagerfeld\nslonim\nnaxalite\nbeckenbauer\njuliano\nmécanique\nconstanza\nconfessors\nzippo\nrainn\nklaw\nabedin\ninkerman\nlindeman\nboye\npaternalism\nnenagh\nwinglets\nhillyard\nchelating\ncatanduanes\ncagefighting\nunflinching\ndoberman\nmiguelito\ndownstate\nadjara\nmechanicsville\ntatton\nruka\nmiserere\nwillits\nbayfront\nrespighi\nbaniyas\nampa\nhelgeland\nirvan\nshipowners\nheadscarf\nceramist\ninducible\ntextural\nsucha\npoonch\naylesford\nconned\nspeirs\nhernych\ndharamsala\noffscreen\ntaner\nperis\nvung\nfabrica\nsleepiness\npoplars\nsheffer\nnish\ncaminos\nsamithi\ntunas\nmeccan\ndenn\nsubstantia\nnegrete\ngulfs\ndavidovich\nsandpaper\nquintessentially\nbjarni\nyongin\nflyway\nelmina\nroces\ncorsicana\nbangoura\nsusskind\nembarrassingly\nroyaume\npokrovsky\nchivers\nmaubeuge\nmaddin\nscriptwriters\nvindicator\ncircularly\nentrails\nmonteux\ncandlesticks\nmarechal\nhumongous\ngav\nkiryu\nyod\nlamson\ntamales\nmcgavin\nsabot\nlegalise\nburscough\nnyro\ninauspicious\nbootstrapping\nkadett\nyongzheng\nafshin\nhomefront\ndailymotion\nboxoffice\nludmilla\ncapuano\nmanzarek\ncabuyao\ntreacle\nkirchen\nretransmission\ndutifully\nnikopol\npanegyric\ntorquato\ncosma\nnarrowness\nchandpur\nrefinancing\nthorogood\nrepossessed\ncapsize\nboliviano\nmontaño\ngeier\ngeriatrics\nlavington\ndigitize\nexley\nwingham\nacademicals\ndésert\noxegen\nfinnair\nmoldovans\nanneke\ncrestview\nironmaster\nfortuyn\nlongbridge\nkeshet\nmanni\nsöderberg\nutz\nmahaffey\nmaudsley\nhawkshaw\navgas\ntatort\nupadhyay\nfiesole\nblixen\nencapsulating\nknelt\nkae\nchatman\nmunchkin\ncallbacks\ncamm\npuro\nsupt\nbalta\ncycleway\nbuyouts\nbertini\nvenustiano\njubilation\ndwivedi\ncookin\nbadrinath\nfulcher\ndingy\nsansone\nserangoon\nbryozoans\nnidhi\nfrankincense\nllanelly\nbariloche\nuintah\ndavydov\nmcburney\nodon\ncharlot\nmfp\nferriby\neases\nprise\ndaphnia\nsmacked\namusingly\nmadley\nandreea\ningleside\nbarnhill\ndevastate\nkatan\nbira\njosey\ndemetriou\nmarinko\nmarler\ngildersleeve\ngondoliers\nperiwinkle\ntannadice\nlamotte\nmergea\nwister\nstorr\ntripper\nmenaced\nomniscience\nneutralise\ntruax\njuego\nlahey\nhaliday\nseeb\nfrenchy\ndawley\nbarri\nmargam\nnatali\nhynek\nsurman\nmaho\nschauer\njoysticks\nmijares\nanim\ncallen\nbunka\nmeadowlark\ngaf\ncau\nirradiance\nzildjian\njovovich\nperomyscus\neskdale\nddu\nhartsfield\ndineen\nusfs\nyearns\nandreassen\nsagal\nstreaky\ntalpa\nroskill\nmelnyk\ninvestigational\nspandrels\ndunker\nbakerloo\nelazar\nentrusts\ngranja\npolson\nstudiocanal\nfreaked\nmuswell\nfns\nbeynon\ntushar\nbena\ngane\nreisinger\nbelding\nwattage\nmargaux\nmeshuggah\nresiliency\nnoblest\ntichenor\nucp\nnewsman\nchisinau\noverpopulated\nflieger\nblushing\nkassai\ndisaffection\nléa\nkariba\nheth\nbuloh\nhrd\nzarah\nnupedia\nhumbucker\nlonghi\necn\nscrubland\nwizz\nprestatyn\ngynaecologists\ndarned\ncatron\npaperless\nfelines\nharter\noutcroppings\nslacks\ntange\nhathorn\npem\nusoc\nziarat\namortization\nmadalena\nsoldado\njascha\ntanenbaum\nsabena\nshedd\nporphyria\ncaray\nkamenev\nrehn\narguello\nsilvestris\naxton\nhilmi\nhaqq\nlexa\nnmb\nboric\ndampened\npopo\nhada\nartillerymen\nmundell\ngobies\nstina\nelwell\nwikepedia\nmuc\nschley\nventilating\nciano\nisms\nlunge\nlosada\nsecondaries\ndint\nrummy\ncadel\nmattoon\nsproule\ncandyman\ndorin\nsafes\ndbm\nuluru\nreassemble\nbugler\nholdover\ncleef\nkeilor\ngaits\ntarantella\nserov\nwillenhall\nblackmailer\ngodefroy\nshadi\ndollard\nfenchurch\nyeahs\noptare\nalr\npoema\nqays\neschews\ntavener\nfermin\nnamsos\ngirault\nsimmer\norontes\nvalette\nergenekon\nsleepwalker\nunacknowledged\ndeeb\nfaucet\nsabatier\nyoshihiko\npolyandry\ngehringer\nomura\nlandrace\nrazia\nhille\nfranquin\ncanards\nmua\njohannis\nhutchence\nalek\nblayney\nkrish\noom\neffy\ntapu\ntalmage\nkudzu\npelléas\ndeluise\nkrumlov\nsimplifications\nlamba\nsangiovese\nhasbrouck\nsegues\nwimmera\nbioenergy\npinkham\nhungaria\nmattsson\nceci\nhayyim\npreben\nbonobos\nkorngold\nmarinho\nconsanguinity\nprana\neliana\nancora\ntira\nchhetri\nmccorkle\nconder\nkiang\nlewisohn\ncourageously\nmilliyet\nneumark\nnitpicky\nhighline\ntriplex\ngluttony\nshamanistic\nhendriks\nribéry\nshg\ngertler\nentrenchment\nnonproliferation\nahearn\ndomenech\nleroi\ncarvel\nfenians\ncabinetmaker\nsiltation\nacetaminophen\norganelle\nkaryotype\nagnese\nvierge\nthomasson\noujda\ntradename\nhaddam\ntvo\nandalusians\nunderlings\nvollenhoven\nbernarda\ncardiologists\ncherenkov\nbankes\ngasparyan\nneocortex\nbuzzy\nmiel\nemotionless\nmaat\ncftc\nmuscarinic\nzammit\nptfe\nexcites\nbusia\ndefusing\nbujold\nlacerda\nsrbije\nhmnzs\npauwels\nfurth\nhoodie\nirredentism\nmishmash\nmdl\nochi\nunmaintained\nmasterminds\nladdie\ntankian\nimlay\nsws\nbarend\nbroek\nretz\ncuriel\nsatmar\ncolonnades\nextremis\nmontalban\nurinal\nhagerty\ngermanica\noverusing\nmerkur\ndanda\npandits\nyashin\nbesse\ndesantis\nedgeley\nmoviegoers\ncrivelli\ngotch\ngunnell\ntonsure\nakito\ndaiki\nioana\nsnowmobiling\ncarondelet\ngcn\nspacek\nsouthard\nabrogation\nröntgen\nbirtwistle\ncoutances\nabnett\nheon\notolaryngology\nenlivened\nanthropomorphism\nmyerson\nsalai\nhastie\nspong\nlanner\natanas\nalmaden\njakov\ndeadlift\nsouthwood\nbebek\nrendel\nminicomputer\nwaterfield\npliable\ntravelcard\npawlenty\nbreakneck\nmclendon\nmicroorganism\nhellcats\nwaites\ndetest\ngagliano\nbakhtin\nrosendo\ntolworth\nskilfully\nbrooches\nlogar\nnowshera\njakobson\nmouser\nnivelles\ncuddesdon\naccordions\nabaya\nmishandled\napsara\nlachey\nvedette\npurkinje\nbrière\nebbets\nslatkin\nviib\nbiotite\nnivedita\ncadences\npwm\ndowse\nvociferously\nkarthikeyan\nrevueltas\nguria\ntachometer\ntrocadero\nashwell\nbeautify\nruadh\nmarcio\npfp\npaupers\nsickened\nperlmutter\ndunc\nccb\nskidded\nswash\nralphs\nweidner\nsnooping\nbotti\nanghel\nshuo\ntetracycline\nsulley\nucce\nsmithtown\nbiopsies\ntramcar\npostdoc\ndiab\nmindlessly\nflushes\ncriswell\nbaileys\nscrums\nmdina\nindents\ncitybus\nsahir\nslammers\nevensong\nguideway\npuffins\noosthuizen\ncalexico\nharidas\nthune\nyw\nfreemium\nssf\ncagliostro\nunfashionable\nsarika\nbalham\nchimed\nmclay\nrosing\nkeystroke\nhomesteading\ndulgheru\nwigtownshire\nhartebeest\ngoggle\nwebbe\nchthonic\nmattos\ngsd\nsybase\nsankaran\nbeeton\ntika\nconstitutionalism\nhomesickness\nyusupov\nentrusting\nremigius\nimperiled\nreassessments\ninfantino\nhardens\nhartt\namn\namita\nmartorell\necker\nskywest\nxiaolin\nhomey\nmazari\nromanum\nlfo\ntrilling\nvoiceovers\nscholtz\nbridgette\nhurls\nlaureano\nrakim\nmasanobu\nfrolov\nmgp\ntongva\nysidro\ndrapes\nrussert\nsulfite\npetrarca\nmaestri\nwheatstone\ngriffen\njonesy\nminimums\ndiene\ndalal\ncastrato\nreplenishing\ngrauman\ntelevise\nsegel\nyok\nsatterfield\nnavigations\nlatches\nyoungman\nplagiarizing\nheadmistresses\nbaath\nimpounds\nautopista\nkingsmen\ngoffredo\nanr\nilagan\nartistas\ncrooke\nscf\nsuncoast\nblowfish\nramazan\nstockpiled\nodorata\nyanagisawa\nmiroir\ncerrone\nmichelet\nsolem\nactualization\npalindrome\ncorder\nprodigies\nfetters\ntarquinius\nmmt\nvitti\njabari\nlaguerre\nbrawler\ntír\nstencils\nstephani\nnonce\nshumway\nalvan\ngein\nglamis\nuneconomical\ngiani\ndonostia\ntey\nlurch\nank\ncrud\nmuybridge\nbegining\nsqualid\nhanders\nbula\ninterbreed\ndameron\nlindelof\nbraddon\nkatzenberg\ninnovating\nbasco\ndanio\nforefather\nflaunt\nskittles\npictograms\nsfi\nwashroom\nfovea\norions\nherrin\nspheroidal\nencyclicals\nboombox\nmoorthy\ntownend\npeddle\nadas\ngarbutt\nincorrigible\nwhitmire\nlider\nsemmelweis\ndimmed\nhoogstraten\ncanet\nfacedown\npyatigorsk\naventine\nunselfish\nmedians\njayalalithaa\nhunte\nbromo\nsynthesizes\nsmite\nimu\nquapaw\nchik\ngouldman\nhellenism\nspurge\nyuli\ndere\nwdf\nmimes\nengulf\njamahiriya\nmistranslation\ntamimi\nguidon\nhenneberg\ncucuteni\ncredentialed\nneuropathic\nkoni\nmounir\nnage\ncromford\nasiaticus\nunico\npff\nencrypting\nbradlee\npch\nsmugmug\nkamera\ngossage\nannexations\nmelua\nembalmed\nigp\naakash\nreasonings\nendothermic\nsassi\npaned\nretitle\noolong\nnoggin\nexudes\ndémocratie\naltieri\nradiometer\nberi\nentrainment\ntuckahoe\nsegni\npurépecha\nholo\nstaggers\ncompanys\nzillions\noñate\nnst\nshotts\njardines\nhiston\npugwash\nrudnik\nmastership\ndemocritus\nprecipitously\nleese\nbsh\nosteopathy\nducasse\nfabry\nokaloosa\nhiei\nbazhenov\ndanuta\ncleaving\nfrisky\nbha\ngiusto\nbyam\nnerf\ncriminalize\nantonym\nmacewan\nmosconi\ntelomere\ngrizzled\nottley\naxn\nabstractly\nimmunosuppressive\ndramaturgy\ngnats\nayame\ndavin\nballack\nevergreens\npalus\nivaylo\ndramma\nnorthwoods\nreestablishing\nheavies\nsate\nmansouri\nhajjaj\nmatsuzaka\ntaster\nstalowa\ndeas\nsanjaya\nslop\nbolshaya\nrachis\nsidebars\nzehlendorf\nthay\nscoresby\njetfire\npontecorvo\nheliports\nveggie\nwatercress\ntce\nresourced\nhershiser\ngussie\nellerbe\nbertel\nshahpur\nbestial\nelectroplating\nreshuffled\nmeas\npulliam\nkalani\ncomus\nripcord\ncliffside\naipac\njangle\nzaynab\nsangue\nratatouille\nchristianson\nkewell\ngodine\nwassen\nprioritization\nbonnett\nplayas\nepitaphs\nlossiemouth\ndjiboutian\nsikora\narchosaurs\ngramont\nbovary\njacco\ncitadels\nsubstantiating\naton\nlutenist\nsufjan\nschweizerische\ncricklewood\ncitys\norch\nfairford\ngrenache\nanova\nmacgill\nrobledo\nfasano\ncubano\nwhizzer\ntelenet\naxelsson\ngrignard\nclearview\nvarsha\nsewed\nfractious\njanya\npokerstars\nsexo\nhilarity\nteres\nwallen\ngiornale\nchakrabarti\nhominin\ncubesat\nudomchoke\nnsukka\nintegrations\nindiscretions\nmizutani\nforelegs\nhusein\nfabi\naae\nolm\nmoulting\nscalped\ngummi\noaklands\nconciseness\nlifters\nnaturopathic\nperdomo\ncrayford\nsoltau\nconceptualize\nporbandar\ntruckin\nasiad\nwatchmaking\nbhave\nconsumables\ntransonic\nbenois\ndigambar\nsissi\npsychotherapists\nposteriori\ngomera\nwole\nlhd\nfarra\ngeppetto\nhyderabadi\nplaistow\ngoading\ngroh\nlimavady\nmohapatra\nlmu\npullback\nelswick\npackwood\ntaishi\nhumacao\ninfliction\nmccarver\nunconformity\nfuzes\npatmos\ndoda\ntripoint\npaull\nrrs\nrachmaninov\nguarini\njaswant\nresnais\nmosse\nbookbinding\ndisallows\njayhawk\nephedra\nfante\nchums\neastmain\ndistributional\nzippers\nbeechey\nkampmann\nmaravich\nabir\nrepurpose\nbgp\nlanthanide\nfreesat\narbogast\nmajd\ncasella\npester\nbugg\nprofs\nsalamat\ncupido\npentti\ncela\nferrar\ninfusing\nshowgirls\nlecomte\nalbertsons\nnaved\nappropiate\nleaderless\nrahilly\nwaccamaw\nkoki\nbaier\ntelomeres\narnon\nmoratuwa\ntrew\ngries\nhoceima\nabsurdities\njasna\ndibdin\nsalzgitter\ndardenne\ncyprien\njund\nmatai\nlaemmle\ncloris\nluczak\ntravails\namoco\nruthenians\nfancier\nflv\ndibbs\npris\nbeefed\nobliging\nlindenberg\nborgir\nkuzbass\nmotala\nesopus\nloggerheads\nshahriar\nalita\npompei\ntanda\noutdo\nserpa\nhaycock\nagitations\nuwc\nplamen\nhvidovre\nmelodia\nkuzma\nnullarbor\nhmo\nvinokourov\nasnières\nflowerpecker\nliisa\npatt\ncravings\njalaluddin\nuncorrected\nsignification\npastora\nsunway\neph\npurl\ncopier\nsantillana\nsteere\noverstepped\nvoila\nakhmatova\ncolan\nwestheimer\njarama\nlinzer\nphalarope\nthunderer\npugsley\nasil\npostgraduates\nseducer\nscienza\nmanav\npowerade\ninstitutionalised\nkors\nbroadsword\nvillavicencio\nludgate\nmoel\nchampollion\nlorsch\npostgame\nqasimi\nfaraz\ndiplomatique\nmutinous\nstearman\nyager\ngroveland\nshinsuke\ndeviants\nevs\nmoje\ndelibes\nmangle\ninterspace\nleoncavallo\ntrainspotting\ncannell\ndentures\ndrunks\nfiera\nhoaxer\nows\nsneddon\nbagatelle\nmondes\ngoodie\nandree\nreferrer\ngarba\nindiscretion\njorgen\nscad\nhallucinating\noutstandingly\nprintz\nolea\ncanticles\nwolfie\ngallico\ncobweb\nwinther\nmonarchism\nkory\ncrna\nplanta\nler\nkuttner\nlantis\nponderous\ntsuburaya\nbreslov\nsplendidly\nimpersonators\ncírculo\ntenedos\nkantner\npupate\ndunvegan\nbrar\nluchino\ngravitate\npiu\nkrasinski\nmrinal\nkirschner\nserotype\nbeinecke\nofferman\nbloopers\nbernardine\nsciurus\nlorie\naoun\nweitzman\nkwaito\nkoos\nmoiré\nmissteps\nnnn\niniquity\nshaba\nboingo\noakridge\nnavale\nanasazi\nglynis\nsymphonique\nesterhazy\nhpd\nsourdough\nbodhidharma\nmelfort\nallgood\nbanham\nelden\nmcmullan\ngatiss\nexchangeable\nheiresses\nhighmark\nmaestra\nwingback\ncommment\nscudetto\nscobie\nbako\niovine\nwarriner\nfaldo\nbewdley\nnuclide\nmonier\nanansi\nhosanna\ntippy\nboccanegra\nmuk\noverwrought\nlaaksonen\ndomenic\nbruckheimer\nchucho\nheuer\nferoze\nlonicera\nspermatozoa\nsolidifies\nansley\nsangram\nposta\nsather\nbailing\nsanin\neliott\nibuprofen\nfearon\nreimagining\nshariff\nwinegrowing\nmisalignment\nmasterfully\nhsuan\ntrimaran\ndmb\nmarot\ncondemnations\nbernalillo\ncellulosic\nantone\ncashes\nreconstitute\nudt\ngravano\nserbians\noskaloosa\nwatchable\ngangway\nayia\nrivlin\nvitter\nstroking\norci\nimpelled\netchers\nabdal\nscold\ncochet\nkalo\ninnocenti\ncomore\nbirdsall\nangostura\ncarse\ntassie\ndefacing\nbozen\ncarnevale\ncrepe\nludi\naragonite\ncrowsnest\nblocky\nhacettepe\nshau\nnorths\nsoyinka\nseismological\nkristal\ngonorrhea\nmcclaren\nitoh\nfcr\nsubcontracted\ndreiser\nmuffler\nvlada\nmolle\nneka\nbouche\ncoogee\nberardinelli\nsummarization\nkimmo\nsoftens\nexcitedly\nneuralgia\nnamangan\ninhibitions\nlucidity\namasa\nfmln\nspoked\nandheri\nsplitters\nkanchanaburi\nmacdermot\nnoori\nsoftwood\namphorae\npuchong\nkentaro\nayahuasca\nhorsfall\nsoundcheck\nifv\nanglin\nolé\nunfilled\nabas\nscruton\nerasmo\ngatewood\ngagauz\npertussis\nherut\nmaazel\niarnród\nxingu\ngulp\nkempen\nsaunderson\npressurization\nfroissart\nhazarika\npharmacie\nglassmaking\nbaena\nalbo\ncuddly\nmizrachi\nspokespeople\nincriminate\nmcclinton\ndesilu\nproducciones\nadhesions\neyring\ngaleries\nhbv\nmacc\nsocioambiental\nkitching\ndoucette\nkreuzer\ndumbbell\nchetty\nmoroka\ndichter\namperes\nprenton\nfrogman\ngrieves\nnawal\ncytology\ncoweta\ncalakmul\nlingfield\ngouvernement\npursuer\ndcr\nyag\nasai\nbeka\nspca\nbourbonnais\nwaas\nklickitat\nsoit\nintimated\naquamarine\ncyd\nfortes\nfuther\nraincoat\nfloro\nmidges\nhorak\nunearthly\nsonoda\nphloem\ngivers\nbhansali\ninzaghi\nmullaney\nttv\nlotos\ncheriton\nwiry\nlokesh\nchaconne\nnacionales\nokubo\nestás\nbothnia\nzaitsev\nregularized\nmegrahi\ngreul\ndinant\nescalera\nbiannually\nkusturica\nleko\njolliffe\nrivette\nvermeil\nlinderman\ncattleman\nappledore\nmorrigan\nharari\ncaicedo\nwardour\nhaemoglobin\npotus\ncuffe\ntoying\nmillia\ngueugnon\nanthon\nkatyusha\npaki\nislamophobic\ninmarsat\nushuaia\nhumps\npoissy\ndisjunctive\nmacphie\nmapo\norval\nleopardstown\nzoophilia\ntricycles\nstatecraft\ncide\nrostrata\nwaver\nmcnary\nimpeller\nlayard\nseeta\nlatorre\nayton\nbryon\ncnes\npollinate\nbalusters\nselous\nunderestimating\nhatha\nfoxboro\nmegawati\nmacnab\ncalleja\nweare\nidt\ncorinthia\nribbing\ncasagrande\nbridgeview\nburp\ngastone\ndisenchantment\nkhiladi\nholbrooke\nsurg\nstreamliner\ngrijalva\nperegrina\nsaltmarsh\nbusker\nedler\ninstrumented\nfani\njazzland\natlant\ncarcinogenesis\njacuzzi\nnanoscience\nberenger\nprl\ngastronomic\ntragicomedy\nherakles\njunot\ndowsing\nccaa\nmhd\ninfilled\ncircumspect\nfältskog\ncoeli\nchamillionaire\ntanguy\npickin\nbacha\nduralumin\npieria\ncpbl\npesci\nmarkie\nginkel\ntagesspiegel\nrenier\nsnowing\ncarnitine\nsakari\ndroll\nmene\nessaouira\npreet\nmooresville\npli\nech\nshanta\nsipe\nsebelius\nfrazee\nkras\nreformulation\nhalogens\noutscoring\nfomin\nnaphtali\ngyr\nsitaram\ndumbest\nnnamdi\nhoniton\nwajid\nrotaru\nvester\nrealplayer\nzomg\njosaphat\nbrockley\nkele\nuncoordinated\nperoxidase\nmaniacal\nsedna\nlenk\npongo\nmeese\nhumvee\namylase\nmarah\npolow\ndrowsy\nsaz\nmange\nspeller\nmolenbeek\nprouty\nbundesrat\nlandesmuseum\ncanaris\nwaitin\nsogno\nlinas\nocu\nlabyrinthine\nmaurienne\nvuh\nzaibatsu\nlpr\nrnai\nhyeong\nschweppes\nsantoso\norantes\npinney\nnibley\nlotti\nkronk\nhuyton\nredemptive\nnaturalness\nunmade\nfreudenberg\npascha\nspaniels\nirked\nmanston\nkostenko\ntencent\nindigenously\nsaira\nsobibor\nmtp\nkneels\ngud\nmogo\ngeneroso\npef\nsushma\nzaphod\nrizzi\nlatam\naugen\nbulle\nwfaa\nkingstonian\nnovelli\nhabré\nandreozzi\ncivitavecchia\nspeciale\ningemar\nadour\nbiot\npvi\nocat\nlynnwood\nattenuate\nbardia\nregazzoni\nbourbaki\ncarville\nintegrators\nmacula\ncsun\nparus\nwatervliet\ndupre\nqueued\nleninsky\nbartali\njosée\nsoftworks\nhampi\nfreytag\nmontalbán\nhasa\nbragged\nsmaug\npippo\ngeophysicists\ndelahaye\nkazmaier\nverl\nehrhardt\nruffian\nfeathery\nconcentrator\neponymously\nautor\nmacario\noptometrist\ngrider\nschiaparelli\nencroach\ntrescothick\nchairlifts\nrassemblement\nfedayeen\nriversdale\nameche\nlesseps\ndpm\ntipsy\nnaissance\nsunningdale\ntcb\nenberg\nenomoto\ntelefilms\ngendron\narnauld\ngolitsyn\nwillman\negghead\ngrassley\nguay\nsista\natoka\nlaki\npreschools\npollyanna\npeiris\nfaial\nforgetfulness\nmazzola\nfida\nhellmut\nsurmises\npanoz\nvictoires\ncannae\noverprint\naverting\nnaco\ngarand\ntodays\nsvet\nsmoldering\nbuckthorn\nsamcro\nmentone\naudiology\nfliegende\nmuldaur\npygmaea\nbête\nflambeau\nesma\nperpetrating\ngigging\nmonbiot\nkellermann\nburrough\nnysdot\nhaircuts\nspacelab\nwwl\nelisabet\nmarschall\nhiga\nsuffren\nnakhichevan\npickaway\ncollating\nkempinski\nsizwe\nbutternut\nferc\nbalwant\nchannon\nminha\nhrv\ntruer\nradner\nsuren\nrumpelstiltskin\negfr\ndantes\nrenegotiate\nmasumi\nbubka\natos\ncarre\ninteramerican\njq\nboomerangs\nstoyan\noccultists\ntimberland\nagnello\nsamarium\nreceptacles\nmarsch\noffenburg\ndelegating\ncrimen\nversioning\ncachar\nrebukes\nparaplegia\nrur\nmakhdoom\nforfeiting\nfuton\nariza\nhydrant\nskarsgård\nfreeboard\nksar\ncarousels\ngoyal\nrickmansworth\npersistant\nsids\nabsorbance\nbungo\nuaf\nhwanghae\nketil\naylwin\ntanuki\ndyne\nboddy\nbiasing\njessamine\nhirsi\nsheepskin\nremagen\nrejoinder\nmilles\nginevra\nframlingham\nrnvr\nrodion\ntapi\nhamlisch\ntangos\nghirlandaio\nananya\nreddi\njavert\nbourdain\nweedy\nvirender\nusama\njpy\nserpens\nravenous\nkatina\nhoberman\nkenley\ncurbs\npredominating\nrobed\nmontijo\nverbeek\nforgoing\nmce\nkefauver\nidl\npromulgating\nsemiautomatic\nterahertz\nplacerville\nspandex\nsaray\nnoster\nrossii\ncubed\nchaykin\nmudcats\nreorganizations\ngiroud\nshofar\nbostrom\ngrumble\nspiralling\numaru\nlevitan\nvrbas\njarod\nnahariya\nrisborough\nolver\nhumourous\nplaxton\nthieving\nwoodsman\nlahm\nunida\nook\nrochon\nambushing\nmultilingualism\negham\nphotosensitive\nsirocco\ncatalysed\nmirnas\nliliuokalani\nspohr\ndemocrática\ntarasov\nballrooms\nnaima\nshanghainese\nmandarins\nvalets\nmoquegua\nsurfboards\nhaditha\ngodrej\nkaja\nhali\nacth\nzosterops\nseif\nheuvel\nbenders\nherrschaft\nkohala\npassi\ngiusti\nsvenskt\npetrovic\nyellin\nfielden\npatentable\nclaflin\ncomerford\nrubrics\nakebono\nmulloy\nhewes\ndearne\nlehrman\ncroesus\nlavon\npeeve\nlec\nneiva\nnrf\nregnery\nmaximillian\ncoverup\npetron\nbixler\nambridge\ndevoto\nlifeson\ntchad\nmilkvetch\neulogized\noxalis\nalbigensian\ncosmologist\nalliant\npetrelli\njunio\nmarka\nith\nartspace\nbandhan\nbiola\ncybercrime\nttm\nvaart\nbladensburg\nkazuyuki\ncarolin\noverbeck\nbookcase\nironsides\nmarquinhos\nbarger\ncill\nnewsmax\nmicrocode\nboosh\nkame\nnodular\ndatura\nloadings\ntaxonomist\nastronautical\nwehrli\nmidsize\nbont\nsloman\nwatchlisting\nharron\nseventieth\nnishiyama\norigo\nesfahan\nromanoff\nuhh\ntep\nmicrophylla\nquirke\noratorical\nfulwood\nreisner\nhauntings\naardman\ntreloar\ntransplanting\nformalizing\njewelled\nbalke\nduguid\nscottsville\ndufay\nambidextrous\nwoodhaven\nlocklear\nmarbach\ncarmilla\niid\nsesotho\nmammography\nchanticleer\nwherefore\nbespectacled\nastringent\nsalminen\nsaker\nproprietorship\nrostand\nmeego\nalfano\npollitt\nmithras\nebe\nteeter\nthes\ndozer\nurumqi\nrenville\nmasakazu\ncuming\nthalassemia\nlyndsey\nicts\nmdb\nfriedan\nmaintainability\ntaisho\nelint\nschack\ndimmu\nrowohlt\nkirillov\nsandu\nshkodra\njdc\nkhodorkovsky\nglendora\nsplat\nfamilysearch\nremorseful\nchoreograph\nknysna\nbenedek\nsnouts\nsarre\nkleinman\nspinifex\nswb\nfrancoise\nquarreling\nmillersburg\nredeye\noef\nhida\nblaxland\nhuysmans\nums\ntrogir\ncussler\nrackers\nprecluding\ncartwheel\nastarte\ncorporación\nberenstain\ncompte\netfs\nhecla\ndrame\nbelisario\nsprouse\nmá\ndines\ndisfigurement\ntinchy\nlarned\nsanrio\nzucchini\ndraga\nasdic\naddai\ncalpurnius\ndyce\nalavi\ncaffrey\ncroy\nencoders\nlechmere\nwillam\nsetu\nmilbank\nsteamy\nparana\ntakemitsu\nsumbawa\neastwick\nslott\nscheider\nentitling\nsirena\nnebulosa\nlassalle\nsynthetically\nseatbelt\nverbosity\nskp\ncmyk\nstraightedge\nbaghlan\nokie\nifugao\nmargarida\nschumpeter\ntrinian\npillman\nbryanston\nkynaston\nwfmu\nfalcão\npiotrowski\nrabun\nmacmanus\ninvitees\nsendak\nseething\nrodas\nbiles\ncolchis\nutensil\nbarwell\nneena\nwakefulness\npantoja\nflemings\nperrone\ncoquette\ntargetted\nhaïtien\nocasek\nkurita\nhenery\nyrigoyen\ntaita\nfamiliarly\ndirectorships\nbreadwinner\ndiapason\nkindling\nmaryna\npolicía\nquenched\nkestrels\ncooma\nnephritis\nmanar\nminstrelsy\npnm\nmcphillips\ndecemberists\nreddington\nweider\nconners\nvardy\nbengkulu\nangleton\nsaltbush\nmontefeltro\nbhambri\nctl\nmagnanimous\neckersberg\nvlan\nreily\nmcguirk\nrotavirus\nmillsap\nkuwata\ncampden\nvalk\nturlock\nitagaki\nkohanim\ndished\napb\nsatta\nsicilians\ncopil\nibbotson\naragona\nsteeleye\nmiscarried\nmemling\nmier\nsanter\nrestful\nsupersymmetric\nsre\njilted\nkiis\npolicymaking\nmutagen\nseashell\ncapsaicin\nsystemically\nassa\ncabanatuan\nslog\ndpd\nthymine\nkingdon\ndagar\nenslaving\nephedrine\nunanswerable\nkillebrew\ntaxus\nfoodstuff\nnyk\nfended\nbrecknock\npenetrator\nglp\nlemme\nodesnik\npraecox\nsows\noxus\nhaytham\npiqué\ncaltex\nblindingly\nautocrat\ndujardin\nprélude\nmoriscos\nkarakorum\nmucking\nrovira\nleclercq\nvidi\ncelan\nsituate\ntopix\nholism\nescolar\ntriforce\nleesville\nmunthe\njafari\ndroite\ninnova\nnonthaburi\ngerulaitis\nherter\narbuthnott\nvimal\ntows\nmilius\nalertnet\nishiguro\nphosphine\nsatyrs\ncanteens\nful\ndashi\nappetites\nloyally\nustream\npehr\nsuperbus\ncombes\npdg\nyosuke\naflame\nmalinga\ndelph\ncanavese\ngsr\ncentrifuges\nbethea\nlebor\nhochberg\nminerve\nelek\njci\ncatonsville\nbazán\nrambouillet\ntorelli\npiatt\nhellions\nmercantilism\nbrattle\nmankad\nempresas\nerdal\nrepels\nuche\ngallerie\nnetbooks\nwinterberg\nlukes\nkinga\nfolksy\nmacanese\naros\nacquiesce\nmatura\nopperman\nwmaq\nolla\nmegahertz\nlov\nfurnas\nswitchbacks\nwattana\ndegarmo\ngessner\nsubbed\nlairds\nvivas\nseafarer\nsacc\ngovernorships\nbayani\nvallo\nessington\nfertilisers\nmandelbaum\nmuzio\naït\ncoolly\ncruella\nprofesseur\nfolles\nreanalysis\ntors\ntripods\nhyperthyroidism\ncreatives\nregurgitated\nkunstler\ntiergarten\npedant\nbriana\ngerstein\npodiatric\nphono\nsubdomain\nwordperfect\nsturridge\nkikinda\nvettori\ncalla\nwasc\ngrimsley\ncoped\nwriten\npodcasters\npublico\nsolicits\nnephropathy\nlamoille\njanissary\npollinating\ncosell\natropine\ndiawara\nmescaline\nrepublishing\nscoped\noverthrows\nandras\nbarts\noutlooks\nksi\ncharu\nbodensee\ninvincibility\ndedalus\nbailar\ndallaire\nhemodialysis\nwigeon\nrezaï\nviolante\nesthetic\nemet\nfictive\nspearing\nscriptorium\nroeper\nableton\ncarcanet\ndealey\npaywalls\nliverworts\nunani\nincapacitating\nmccue\nfelino\nschachter\ntangles\nriya\nunfpa\npeeler\nrdr\nelopement\ncolumned\ngarhi\nralphie\nwcf\nasaka\nrocor\ngarton\nmelis\nlanguish\nshab\nchatroom\nenglehart\nhatting\nvulkan\nbootleggers\nramshackle\nvec\nbpc\nrozier\nryton\njuts\nsisal\nseol\nprizefighter\nooops\nunkempt\nouimet\ndisfavor\nneubauten\ninhaler\ntransaxle\nputti\npinta\njayme\nredrawing\ndissanayake\nmolin\ntreasuries\nregenerates\nholywood\ntope\nlorises\nszéchenyi\ngrandees\nheadhunter\ntartuffe\nsechs\nhygroscopic\nadivasi\nlivno\nqpm\nksh\nbenavente\nmodeste\nherodias\nnajd\npwp\nelt\njordison\nsunde\nhydrates\nryosuke\nfrogger\nmadmen\nsenders\ngratings\nhecke\ninaugurates\nphysiognomy\nprf\nnsfw\nscruples\nchristenson\nenfranchisement\nbrigadiers\nome\nsweeny\nnitrocellulose\nespirito\nantipas\ncaress\navrohom\ncorrupts\nsteeler\nsocialistic\npous\nweirdo\nkrol\nnaboo\nstockbrokers\nlhota\npanache\ncaire\nunderutilized\nsabbat\npipits\nparos\nclenched\nmahar\ntarkan\ntolly\ncedeño\nmillis\nescapism\npentre\ndimbleby\nbrightened\nsoulmate\nrelapsed\nhandrails\nwynonna\nscreamo\nkolev\nritorno\nphl\nspeedball\ngfa\npresaged\nofdm\ntorquemada\nloews\nrudos\npolyunsaturated\nulverston\njaka\nhaslemere\nmutable\nxanthine\npascua\nyousafzai\ncometary\nsiesta\nanantnag\ndueñas\npyroxene\nbooke\nteletubbies\nkulm\nsocom\nbatchelder\nscurrilous\nnuevas\nconjugates\nundertone\nwirz\nunicredit\ninah\nghaffar\nvbs\napel\ngrates\ngamestop\ncienciano\nplamondon\ncopes\ndupnitsa\ntransmigration\nsolidification\nheylin\ndegraw\nmusselman\nsnowmelt\njask\nmimar\nmudéjar\nhubbs\nortigas\nwerburgh\nhulda\ngrolier\nrampling\ndovetail\nlummis\nmarkle\npalmeri\nphotoplay\nreprieved\nbetances\nefflux\nkasia\nbucci\nvoto\numg\ngarn\npleasence\npeptic\nnelsen\narroyos\neardrum\nroyall\nunimaginative\nthud\ncarder\ngarh\nwhiteface\ndewdney\nyester\nuttaradit\nalarmingly\nhorticulturalist\nlacquered\ncoleen\nmkt\nskaneateles\nvolutes\npopol\nmaley\nblut\naghast\nkittybrewster\npalazzolo\nenvisages\nmachynlleth\norpen\nneedful\ndashti\ngrisons\nbutlins\nrnk\noutfall\nasics\nlowdown\nschering\nlupone\ncinna\nhksar\nsiviglia\nolen\nhorovitz\nshriner\nargyllshire\ngillibrand\nrailfan\ninfogrames\nforfeits\nkst\nfiddles\nforestier\neccentrics\ntrude\nkeb\nredshank\nconsumable\njcw\nwib\ncalandra\nsaguaro\nwader\nsumerians\nkeihin\nlpd\ngutsy\npsni\nhelland\ngreenstein\nstaid\nscandium\npex\npromenades\nbrocket\njamila\ngaucher\nexes\nunassailable\nroppongi\nmillersville\nbalotelli\ngimenez\nmuttiah\nkamilla\ncatscan\navital\npangilinan\nzabala\nlarkhall\nrefusals\ntanvir\ntruesdale\nromualdo\note\nabramowitz\ndrees\nurziceni\nbares\nverre\nmelnikov\ndatos\nbartolome\nhojo\neuropea\nhaifeng\nnoncommissioned\nlyng\ncorniche\nimperforate\niits\nenglanders\nformate\ndse\nsamarinda\nrolph\nkesler\ngimbal\nshepherdess\neurobeat\ndrape\nschinkel\nlocalism\nceaseless\ncobbold\nsnuka\nsparkes\nrosati\nlegalism\ninequity\nvirendra\numbro\nvolant\nbriones\nampthill\nastralwerks\nsalil\npreorder\napatosaurus\nwhitt\ntako\nsarrazin\nkristo\nethnocentric\nwallander\nisabeau\nmirages\nautoclave\nrestorers\nhohokam\nwak\nmerrifield\nnicotinamide\nhauck\nayling\nscalps\nadur\ndolny\nchronograph\nvectra\nchastises\nbaraga\nhawtrey\nanglophones\nsenn\nsyntactical\nreproducibility\ntuli\ndü\nstopwatch\nmanzanar\neparch\nmurshid\nmâché\nmicrostates\nautomates\ngibbins\noutpatients\ndiffuses\npietism\nharnoncourt\nnabu\ncarrere\nondaatje\ngynecological\nlutterworth\nrenny\ntrailheads\nantropología\nkulin\nneston\nfrias\nnym\nbettendorf\nshepley\nnira\nspheroid\nciriaco\nblindside\npegasi\nprions\nstubble\nspriggs\ncredulity\nrtbf\nsyncopation\npreiss\ncaved\nsecor\ndigges\nwarrent\nfug\ninstitutet\nbuzzwords\nbrauner\nvereinigung\novis\nabhijit\nplait\nzeigler\nbalak\nturvy\nbehari\nstamos\nclavering\nrheinische\nkohei\nentr\nenyimba\nwoodberry\njeronimo\nsevigny\nuncorrelated\nbiosafety\nmaccallum\nkarunakaran\nayler\natena\nprafulla\nmeron\ninfusions\nberberis\nmalu\neyton\nfoscari\nraag\nmanumission\nyoweri\npacis\nchichi\npieper\ndovre\nsiqueiros\nrosemount\nolszewski\npawley\nhughenden\nroams\nswi\njbeil\nlcp\nanny\ngauhati\nbanged\nrapallo\nparticularity\nunderused\ntomer\nroya\ntacks\nberen\nmalabon\nsaldana\ntrevisan\nrefundable\ngibby\nthorsen\npermalink\nprofanities\nsynching\naskey\nprakan\nnihilist\ncholo\nhambantota\ncanker\nbeholden\ntactful\ncelli\nchuncheon\nvibraphonist\ngnostics\nampas\nodu\nkarnali\nwpp\nlimber\nastraea\nhürriyet\nwms\nmorella\npeddlers\njettison\nstormtroopers\ntohono\ntongatapu\nsummery\nparisiens\nusbl\noutnumbering\ncelis\nkaratantcheva\nshibe\nbittner\ncraziness\nswathe\nchessman\ncentrism\nkreisler\nmanahan\nfizzled\nactionscript\nboxy\nabdollah\npedestrianised\nock\nerythrocyte\nsamudra\nhoseyn\nuncreated\npiczo\ntorc\nlandrum\nsprinted\nmcandrew\nmux\nsurvivalist\ninfantilism\nbadman\npilton\npostel\nbluelink\nsandifer\ntempi\nnishinomiya\ndisconnecting\nabsolut\nchurned\nentombment\ntresor\nskyways\nkalimba\nnasmyth\npicketed\nkamin\nmesut\nshanshan\nfogh\nunceremoniously\nwapiti\npaeonia\njamalpur\nramani\nmailboxes\nbashed\nossification\nimani\nbrags\ndeveraux\nraam\nanesthesiologist\nlangar\nnovick\nwroth\nlinnell\ncathar\nizod\nzeo\nleis\nstenberg\nbahai\nhofman\nmgc\ntakase\npaddled\nbrunelleschi\nbreadfruit\ntyagi\ndisorganization\nyukawa\nballoonist\ncoughs\nredeveloping\nscrambles\norphée\ncommited\npluralist\nwoodhall\npimple\ntychy\ngreenford\nlivni\ntrinité\nharasses\nbogusław\ncathrine\nfriedemann\nkrabs\ngoudie\nmente\ndenialist\nspeared\nslac\nwishers\nfotolog\ncrumpled\nshirtless\nsideswipe\nbenjy\ncañon\nues\ndermott\nprotruded\nmillbank\nbabri\npuddings\nfaragher\nsirisena\noverstating\ndih\npocus\nbewilderment\nkeeton\nbutanol\nkilkis\npharmacokinetics\nmasterman\necosoc\ntownscape\nlsb\ndunleavy\nwatchung\nquadrate\ncattleya\ncredulous\nfriulian\nauric\nhutter\nrasp\nflorinda\nkayan\napplewhite\nfootings\ngerold\nwatley\nshukri\nkapu\nboog\narmors\nschoenfeld\nsuperintending\npoling\nhalles\nzatoichi\nmadog\nuncapped\nvavuniya\ngujjar\nmaulvi\niliescu\npawned\nclumsiness\njourneymen\nelided\nblackshirts\nwyverns\nhillenburg\ndyn\nosf\ncaudillo\nhermans\nhatillo\nkmb\nregistrant\naizawa\nmisinterpretations\nrivalled\nagriculturalist\nfuhrman\nfrp\nomd\nperfectionism\ninterjections\nimagin\nexcrete\nshogo\nnonconformists\nayano\ncockerill\nodp\nvayalar\nactioned\nolvera\ndelicias\ntransphobic\nletourneau\numkhonto\ncreon\nsilvera\nraindance\nnursultan\nunmet\ntalky\nsheth\nnorrington\nsatyendra\nfuncinpec\ndeflate\nmoorefield\nmadhubani\ngullit\nsensitized\njuxtaposing\nafu\nkulick\nkoka\nvasek\nmarwick\nindochine\noutagamie\nchibnall\nawc\narroz\nrangeland\nguibert\ninh\nraylan\nguadiana\nalegria\nveendam\nbluetongue\nmorobe\naao\nprn\nroseus\ndisseminates\nmainmast\nageism\niscsi\nsanderling\nwych\nquadrille\nwalshe\nsnatcher\npalafox\nstretchers\ntransmedia\ngentium\ntiniest\nhankou\ngion\nrepaying\ntrf\ntrewavas\nmastaba\nparaskevi\ncina\nlistserv\nshermer\nfallible\nalphaville\nmicrornas\nsherrie\nstreator\njui\nkempner\nwatkinson\nkaolin\ngreased\nanning\necj\nmarsham\neuskadi\ntebbutt\nmuffled\ngrossmann\nfirmament\nunas\navx\nnowruz\ndele\nradioshack\nlorentzen\nsafir\nhoneyeaters\natia\nitb\nshamsuddin\nrefracting\norwellian\ntoomas\nspherically\nzamani\nfpf\njespersen\ninactivating\nhelmstedt\nerbe\nnought\nfragonard\nhisako\npanopticon\nreenter\npastiches\ntitillating\negs\nhetero\nmegha\nokmulgee\npuritanical\nbussey\nbiog\nshc\nkathi\nglendalough\nsouthington\ndowland\nfatman\nzarqa\ngarston\nphar\nnlc\nmanasa\nvelutina\ntadcaster\nnanshan\ndrumsticks\nhydrographer\npastoralism\nrabah\nwolfmother\nrosolska\nlunchbox\nsindhis\nlakehead\ndettori\nnsg\noig\nmckendrick\nmarwat\ngarrone\nbroadus\nrogersville\niguanodon\nprudhomme\nticketed\nyoyogi\nmunsey\ntrunked\nadrar\npaks\nhanneman\nmikasa\nhoroscopes\nbarbirolli\nird\njeane\ngwu\nhirosaki\nexpresso\ndrennan\nanjan\nyeomans\nzepp\napx\nascania\nsalus\nzhenjiang\nnop\nglimpsed\nprecolonial\nsurfside\neinsiedeln\nrivendell\narmouries\ncircumstellar\nplaintive\ncarmi\ncharoen\nshavuot\nmazer\ntlatelolco\nmulock\njugal\nbhaskara\nwesker\nmileposts\nfeldkirch\naurélio\ntweeter\ndeluna\ntolo\nhauteville\ntriceps\nhigson\nmcclurg\narcand\nteknologi\ngisèle\nbungay\nartsy\ncorigliano\nphonogram\nrask\nkunis\nswatantra\nmoonrise\nrollei\nalmira\nrepertoires\naldiss\nquilter\nchuckles\nmegalodon\nmorra\nsluices\nwarhols\nknapsack\nantje\narmm\nlincei\ndoni\ngrom\nquadrupedal\nolivaceus\nbridie\nneuropsychiatric\npreserver\nishiyama\nunrepresentative\nquijano\nequips\nlovano\nteale\nfiggis\nsinh\nwidmark\nwitmer\nhnoms\nlengthier\nrehder\nmacos\ntherewith\nyago\nmacadamia\nrtos\nspew\nnguema\nflagrantly\netter\nbendel\nelisir\nvisita\noutwit\nmainsail\nkowal\nindigestion\neurocentric\nornately\ntunde\ngiorgione\nabbotts\nfrauenfeld\nuninspiring\ndusan\nyokai\nparadigmatic\nsartori\ncreagh\nslandered\ndeterminers\nringtail\ntrestles\nkristofer\ncoteau\nsteffan\napophis\nboldon\nwestend\ndrewett\nordaz\nmcquade\nblix\nhcv\nkhari\nmillay\ndreamscape\nokc\nkuchipudi\nmmmm\nfunnily\noverreacting\nrajoy\nnightjars\npacífico\ncoachbuilders\nzouave\ntonsils\nshailendra\nlittler\nwaveney\nrivka\nitp\niguodala\nranchera\netten\nftv\nupturn\nslingers\nfolia\nnantong\nvirtualized\nquneitra\ngwydir\npuranic\nmohican\nresistencia\namerico\npuya\ncdl\noristano\nziyang\ndigestible\nlochalsh\ntekke\nwinship\ngummy\nshipper\nchabert\nwaronker\npeeples\nsclater\nanomala\nkhorramshahr\nmaladaptive\nhorseradish\nstagecraft\nkishinev\nkhera\nrindge\nupswing\nredwing\nhusks\nunecessary\nsulzberger\ngodfathers\nroldan\nscritti\nplainer\nfiligree\nsaloum\nbby\nkoskinen\nsubodh\nflavouring\nbellsouth\ngwaii\nchided\ncajal\nwastelands\nlochhead\nnorthey\ncabbages\nswithun\nduplexes\nhypodermic\nbilton\nunnerved\ntokarev\nrecanati\nearthlink\nkillala\nsondrio\nbarbiere\nrochette\nmithra\nseabiscuit\nwilmar\nvasoconstriction\nwarbeck\nlistenership\nzumwalt\ntagg\nlijiang\nbillah\nakm\nespanol\nbiophysicist\nsiwan\ntrippin\nilc\ndevoir\nhardys\nkisangani\nguangxu\nterminations\nretrace\nrossall\nstarkweather\nbetar\nvanga\nfadi\ndominici\ntarifa\ncomal\npulverized\nfrontend\nclunes\njudaeo\nbaryons\nenergetics\nblooper\nakemi\nilkka\nkue\nstomata\ntorana\nerhu\ntrikes\nhanko\nbuh\nlamington\nbdb\ncapitalistic\nvasiliy\nburin\nbranimir\ncollazo\nhawken\nsophist\nbeare\npilani\nwnw\nworkarounds\nweatherspoon\ngelli\negorov\ntrower\nauroral\nfederici\nmightily\nbiche\nministre\nsuchitra\nugt\namm\nfujimi\ndobrev\ncausally\nbladders\nloewy\ntalksport\nwilhelms\nneiges\nemoticons\ncengage\ngrimms\nkayode\nkine\naudioslave\nbotta\nunfurled\nconfed\nconvoked\namanat\nmutating\nchaytor\npicatinny\nmicheli\nsickert\nparkwood\nsimplon\ntechnocrats\neley\ngóngora\njeffry\nuncaring\ntemplo\nbaler\nverifier\nhagelin\nendangers\nhepatocellular\ndurning\ngraig\nbeheshti\nmammon\ninfinitives\nhawksbill\nalpaca\nredemptorist\nsubmerging\nbioluminescence\nayhan\nmillwright\nportmore\npanday\ndrinkable\nsymmes\nundergarments\ndgc\nsonnen\nmasefield\ngessle\neisenman\njianye\nawang\njocko\nganapathi\nwavelets\ngef\ncalifornias\nmudslide\nbreanna\npoinsettia\ncgmp\nharaldsson\nunlabeled\ntritons\nfrightens\npeachy\nwanker\ncun\nrehan\nsarod\ngwa\nenvigado\nbacup\ndoniphan\nrocs\nsdm\nubangi\nbatts\nserco\nphalanges\ntoivo\npopoff\nites\nzay\nhurlbut\nrezoning\nschipper\narbre\ngrappelli\nmathai\ngrowler\nportsea\nhengelo\ncancri\nulisse\ntusker\nbroiler\nfrakes\nedmundsbury\ntoccoa\nhellblazer\nmultipath\nsunlit\nsecularists\nbalasubramaniam\nyuck\ncrysis\nrashard\nunrivaled\ncradles\nsbv\ngentofte\ntofiq\nmaye\nstigmas\ncued\nwreaked\nmichibata\nfmv\nberkut\nsloot\naldermanic\nwiggs\npavo\nclavell\nugarit\nfarmiga\nponto\nmiyagawa\nkayakers\nbuckaroo\nwarkworth\nsasson\nsmillie\njsm\ntrilby\nhowls\ngelber\nsaml\nbuon\ntaegu\neisa\nlobel\nnoticable\ntangara\ngaffer\nplaquemines\nblasius\ntrackside\nknecht\natsugi\nhomolka\nrto\nsatoko\nzedillo\nfaringdon\ngedeon\nshravan\ngoellner\ngazes\nbronner\nblunden\nmagnani\noverturns\nbryk\nazra\nconsolidates\nerhardt\nadweek\npepo\nclaver\ndisobedient\ntonino\news\nbludgeoned\nadit\nsupermen\nglucocorticoids\ncowbridge\nmongooses\npremised\nannelids\nassholes\nlucey\nhobey\nsido\navala\ndrinkin\nsteamtown\natlee\npannier\nboh\nkhobar\nkurokawa\nlurks\njabiru\nresurrects\nanjelica\nhelbig\nbenga\nmitu\nfetches\nlocum\nambrosiana\nndash\nmoustafa\nlovering\nbrodrick\nwpi\nextramural\ndawns\nsirna\nvastness\nbrackenridge\ningelheim\nlepcha\nasphyxia\nbrixham\nsqueaks\nlath\nqahtani\nbalsamo\nkinen\ncharl\nchimeric\ncompressibility\nheikkinen\nbranham\nburqa\nborgen\nniva\nsubrata\nsymbology\ntpo\ngomis\nsubducted\nhixon\nkasteel\nstds\ndemar\nsharrock\nbrillante\nkristopher\nteilhard\nnitrile\nmelara\nheerden\nskinners\nleeuwin\npolemicist\nipi\nvecchi\ngroundskeeper\ntrifecta\njarreau\nbrevirostris\nradiologists\nlakin\nbeygelzimer\nbernkastel\nklement\nattwell\nlorette\npicabia\nbuckhorn\ndigicel\npesetas\npteropus\nsheahan\ngökhan\nflukes\npressler\nsneijder\njiji\nwitching\nfizzy\nmaariv\ndamo\niturralde\nrowhouses\nskov\nhermaphrodites\nsulaymaniyah\nbaseband\nsamui\ninterventionism\nfatherly\nwarps\ndowdy\nkumbh\njacko\nbajwa\npunahou\ngabrieli\nhba\nkannapolis\nsubmergence\nsheikha\nuggla\nvipassana\nvirion\nasinine\nsubmariners\ninta\npowershell\nghor\ncressy\nbjørnson\nsinologists\nmalm\nspasticity\nmanin\nhayfield\nkalia\noutbid\nnaphthalene\nconfectioner\nchowan\ntabata\nbericht\nbrimmed\nstafa\nberoun\nbektashi\nsavoyard\ntownsmen\navtar\nlehner\ncov\nsebald\njds\nkreutz\ncavalryman\nkerslake\nfarlow\nhispanica\nedgars\nsmothering\nlisowski\nfales\nlatinoamericana\ntihomir\nbiryani\ninfuriates\nlydney\nstarosta\nexterminating\ninterceded\nmusi\nampara\nbandgap\nwangen\nscavenged\nniggas\nunabashedly\nplasencia\npinks\namnesic\nblouses\nseko\nanji\nwjz\njabberwocky\ntrud\ncentralist\ndhol\nadoptee\nboracay\nfingernail\neft\ngertz\npoliticization\ndragão\npetras\nkinema\ndebriefing\nshamshad\neom\nciccone\ncoláiste\ngalván\ncolonic\ntsutsumi\nquezada\nopelika\nsqueezes\ntortoiseshell\npositrons\nsarhadi\nayoub\nreconfigure\natienza\ncorsten\nacht\nrosenstock\ncincinnatus\nboshoff\nrenin\npolsat\nblomqvist\nmethven\njalali\nchinon\nbovis\nabsconded\nglaciologist\ncacho\ngranma\nmantovani\norst\nwatrous\ncomhairle\nmafalda\nimbalanced\nstanier\nsuen\niwerks\nmetta\nsumgayit\npatwardhan\nchough\nwyld\nintesa\ncycad\ndanièle\nbadalamenti\nmeitner\nokabe\nvesalius\nmimesis\nchalets\nholodeck\ntrills\nchartrand\nbertolini\nenrile\nuprooting\nqueenscliff\ntytler\njudicially\naccc\nfalsifies\nemer\nincoherence\ncbb\nkraepelin\nzodiacal\nmactavish\nthesiger\nmartigny\nbennetts\ndisturbingly\nrefutations\nboneless\ndanang\ndebar\nlascivious\nnaba\nwoodfull\nrhyne\nimm\nconjugations\nbbm\nwherry\nkewaunee\npontotoc\nprosieben\nplastering\nprofessionalization\nsignore\narnesen\nsaturnino\nmorera\nnuméro\nathan\nreburial\ndte\nkurier\nmusab\nmonetization\ntohru\nkaczmarek\nizaak\nmullens\nattributive\nlvi\ntruda\nmihir\nbyard\nlawrenceburg\nhubbert\nseppo\ncce\nwaded\nfmf\njanuar\ntaio\ncorrespondance\nyellowjackets\nswordplay\nmangini\ncoalescence\ngiustiniani\nmandolins\nmondragón\npiecing\nrcti\ncurwen\noligonucleotide\nedsall\nthijs\nmcmillin\nseether\nadventuring\nspinney\nstrood\nmapplethorpe\nproprio\ntakeoffs\nstalagmites\nrovereto\nguinan\ndural\nblakeslee\nkütahya\ncranleigh\nmccully\nhaglund\nlindros\ndurfee\ngrasps\nchintamani\nobeid\nsharyo\nmillinery\nbelmopan\nrungs\nbravura\nlethargic\nmigdal\nwilden\nophiuchus\ndida\nguillory\nmacomber\ngak\nbarmer\nhicksville\nmolesey\ntink\nnichts\nrottnest\ndayna\nladle\nojha\nmclaglen\nfulmar\nlinx\nmesmerized\nescarpments\nschooldays\nweirdest\nushant\nkindled\nwagtails\nelp\nzeon\nblondell\nmccleary\nlitvinov\nicom\narthroscopic\ngodspell\nkika\nincana\nqueso\nnlb\nfujioka\npaged\nmch\nbattelle\nheba\ntrow\nbroadmeadows\ndemento\nfrsc\nfaultless\nganley\ndimwitted\ntiscali\nbulan\njanzen\nhookah\nsusy\ngeol\ncupolas\npitot\ndus\ncoolies\ndut\ncastigated\ncytometry\nprodrug\nstringers\ntomoki\nlotbinière\nherringbone\nusna\nmonkstown\nchardon\npaigc\nbasterds\ntelles\nbelconnen\nmorné\nwnet\nautophagy\nsonderkommando\nhallow\nlambayeque\nbucanero\nbeachside\nmonastics\nspunky\nladner\npantano\nbirthmark\nmizzou\nardwick\ntranches\ninerrancy\nsalespeople\nsnakebite\ncharlier\npolymerases\néconomie\npegram\nhannelore\nlucilla\nunconstrained\nnewberg\nleber\nchungju\nmisspelt\ntheis\nconocophillips\nputman\nbajpai\nsupportable\nmackin\nlogon\nnereus\ncontravening\nfedexforum\njazmine\nresch\ndsiware\ndieterich\nhijaz\nrefoundation\nbakeri\nbinkley\nfleischman\nupstanding\nseversky\njusuf\nmasochistic\ndeliberating\nniland\nbegat\nassertiveness\nimma\nmudhoney\ndallin\nocaña\nwol\nmalte\nmorad\ncondell\nbadham\ntoshihiko\nrickover\nponomarev\nsongo\nidg\nstiffening\nbaloney\nconlin\nberate\nserail\ndoering\nfranch\nfistfight\ncah\nenrolments\ncommunalism\nburnished\nsignifier\nmontejo\nburdekin\nparamour\nhüsker\ngranuloma\nmansdorf\nrepressing\nneutra\nemmitsburg\nulc\nsaleen\nzanardi\nesko\nfritillaria\nhallamshire\nkimo\nsolus\nparkins\ntintagel\ncawdor\nroseate\nchimie\nbeccles\nmurilo\nrupprecht\nolesen\ndillwyn\ntuya\nkeelboat\nsolidus\ntamagawa\nmathison\nalang\nndola\naltavista\npalu\nscheffler\nworkhouses\natrash\nwitchblade\nearthsea\nroberti\ngresik\nshaar\nbintulu\nrivulet\nardently\ndispenses\naherne\nlandseer\nhacksaw\nbeacham\nsofala\ngyroscopes\ngöteborgs\nnauman\nhorkheimer\ntimoshenko\ndecathletes\nlindblad\ntamarisk\nbaye\nevas\nmdg\nlilburn\nstatesville\nporo\nmaina\nastigmatism\nshuck\nbanjos\nkindest\nwsw\nnuevos\ndonnington\noddfellows\nbackdated\nwestmore\ncbrn\nbotanica\nstraightforwardly\nforeheads\ntreasonous\nimploded\nbentivoglio\nfriendless\nrailton\nveuve\nbfg\nbroomstick\nkonstantinov\nmicrosatellite\nperales\ntfr\ntizard\nlegless\nsatirised\ngratton\ndachshund\nburren\nbarinas\ncrafton\nperitoneum\nlakatos\ncobblers\nbelzer\nkvm\nduccio\nhenninger\ngaitán\nsnowshoeing\ncecilio\nneurosurgical\nherdsman\nfcf\ntheil\ncheeseman\npennsylvanians\ncush\nstrzelecki\nbooz\ncloncurry\ntorri\neire\nslitting\nscleroderma\ndeeping\nceladon\nbreathable\nmornin\ntota\nmixon\nfillion\npeiper\ntukaram\nrevolucionario\nduwamish\ncayuse\nmils\nvirions\ngona\nhoth\ncaddie\nlasallian\npatriarca\nmanno\ntracklisting\nandreasen\nlepore\njosei\nfreamon\ncrowdsourced\nchiffon\nmealy\nevanescent\nstephenie\nneugebauer\nemm\npuch\natchafalaya\npassamaquoddy\nrtg\nhillhouse\noscillates\nemporio\npropellor\nwrithing\nbelyaev\ngarriott\nawadhi\ndanijel\nkarai\nkontinen\ndhi\naccuweather\nhabanera\nnealon\nnominum\nbalachandran\nkillick\nkaboom\nbrenta\npenton\nhirvonen\nheritages\nwkrp\ngsl\nroxburghshire\nmmi\nkantha\nashtray\nolwen\nvosper\nyohan\nvies\npriyadarshan\nreck\nnaumov\nmatteson\npincher\naramean\nloxley\nadjuster\nsergej\nmaharashtrian\nguptill\nperemptory\ndetonations\ntufnell\nstreamflow\ndatacenter\nhubley\nredan\nmridangam\nempathize\nhonma\nsafaris\neppes\npersei\nbyeong\nrambam\nheaviness\njermain\nrecce\nkönigstein\nmedora\nkawaii\npandion\nschmeichel\nconneaut\npripyat\nhalli\npizjuán\nnhm\nwoodcarving\nmidnite\nobiang\nnelli\nryden\nplatteville\nkabel\nqaumi\nraffaella\nachaeans\ngötterdämmerung\nhonus\nlobelia\nepaminondas\nfunneled\nhuger\nbaixa\ndomesticity\nhouk\nyonezawa\nevros\nharishchandra\nabdol\nexcitability\nnystagmus\nnanostructures\nsiddiq\naulnay\nxining\nmirra\nmasseria\ntheda\nhindle\nlivewire\nrentería\nmachismo\nindulges\nnobu\nhotshot\neurocity\noklahoman\ngynaecologist\nfulke\nplessy\nosterloh\nstrictness\nsecco\nriko\nishwar\nelvas\nsolan\nnystrom\nhuynh\nemrys\ngiove\ngloriously\nregie\npendulous\ncwb\nnve\nsaiful\nclaves\nfrontiersmen\ndmitriev\nofcourse\nsanssouci\nmisapplication\nhomerton\ngaullist\nhipsters\nstethoscope\nbergère\nvarietals\nhospices\neld\nbricklayers\nmaracay\ndmd\nrebooting\number\nnyo\nogmore\nzakk\npinpointed\nchickadee\nfave\ncics\nescrivá\nsoro\ndasher\ndudi\nelsner\nviacheslav\nhowey\nlongboat\nyuh\nmahomet\nspeckles\nchromodynamics\nnorilsk\ncarothers\nlythgoe\nmodulations\nmalena\nunconstitutionally\nsabines\nrothrock\nhydroplane\nreding\ndunoon\nwlr\nkaramchand\nhigashiyama\nchaise\nangelov\nbalalaika\nverbum\nloubet\ndestefano\nfpas\nresplendent\noses\nguerrier\npally\ngeza\nviviana\ntramping\normoc\nelementals\nrainha\nbini\nblench\ncrookston\nemeli\ncorticosteroid\ndominos\nkiani\nkufstein\nburda\noversimplified\nwdc\nnorio\ndicta\norso\ntanegashima\nsimsbury\neichelberger\nparakeets\ninkling\nleflore\nvoyaging\nmaximising\nmillennials\ncristi\ndmp\nyazdi\nheffer\nsaada\nmacleish\nperps\nrigo\ncloete\nlaue\nastrophysicists\nhln\nimmonen\ntowneley\ncrosswind\ngumby\nkatsuhiko\nidina\nthinnest\nkosrae\nlauds\ncarpeted\nlayfield\nhushed\nbtecs\nemissivity\nrandel\nredbook\nroberge\naltena\nrozen\nrwe\nunk\nallspark\ncarted\nwalthall\ncenci\nood\nharling\nbracey\nbalzan\nswimsuits\nsaucedo\ncélèbre\nnewnes\ngraber\njoji\nsvm\nanadromous\nrideout\nscamp\nmpm\nborelli\ndsg\nunivariate\nholkham\nexpounds\narvada\nchamoun\nccny\nstaden\nquijote\nsailings\ndefcon\nclifden\nxianyang\nmcwhorter\nostwald\nbetws\ngunz\ndressers\nedisto\narklow\nranganathan\nnitrogenous\nronge\npavlovsky\ngemäldegalerie\ndurr\ntyrconnell\npendula\nmasham\nslf\nwaiving\nhibino\nassented\nmellifera\ncita\nbodice\ncmv\nhampers\nyerushalayim\npkc\nsagesse\nsamael\nclb\namdo\ntruitt\nhoel\ntheodorakis\nhjort\nluray\nkoivisto\nmilitarist\npoliticised\nplanum\ncousy\nbrunn\nkatsuya\nipsa\npsychodynamic\nbanta\naragonés\nenergize\ncontro\ncrea\nprestressed\nhuseynov\nhalberd\nborate\nsandisk\nimproviser\nhofburg\nsalzman\nbensonhurst\nsourcebooks\nlevity\nhistadrut\ncorsham\nsmother\nastle\nhplc\ncanossa\ndernière\nfirebombing\nkovel\nlitvak\ntaxonomies\nretaliating\npecci\ndarvish\npimenta\nchildbearing\nmiter\njellybean\nacademe\ntwiggs\nsalween\neline\ncanted\nbarbarella\nwaitrose\nchali\ncalorimeter\nizquierda\nconsenus\ngoldberger\ngilli\nayeyarwady\nkawagoe\nloitering\ncognitively\nsowed\nwarlocks\nashkenazic\npervading\nataman\nfria\nmahou\ndearden\nmox\nacadiana\nboastful\nmishawaka\nheeding\ncino\ngondry\nchambon\nenought\nmeco\ncontinuities\ncaymanian\nwesthuizen\nbluffing\nrawle\nelucidating\naristolochia\nkommersant\nmutharika\nridgecrest\ntolson\nnadya\nkuta\nonur\ninteractively\npetulant\nfarsley\nbheri\nizzat\ndesertions\ntrebek\nferreiro\ndecoupled\nunwound\nzooey\ntion\nproblematical\nnetzer\ndelphinus\nbeastly\nwordsmith\nmonotremes\nhecuba\ntapan\nmurton\nkahler\nsecant\ndietetics\ninhumanity\nedification\nundertakers\niancu\nzenda\nrooker\nantonios\ngharib\nmarchesi\ntamers\nkerim\ncco\nelectrify\nlumbee\nhardboiled\nbellson\ngrundtvig\ndawlish\npremières\nmcdade\nsnowmaking\njosse\nkissin\nguillot\nobando\nmoondog\ndionysian\nfrigatebird\nlebensraum\nluminary\niñárritu\npoggi\nfloater\nportus\nbcf\norban\nmussoorie\nbroached\nimbert\nenki\nforewarned\nromeu\ngammel\ndlitt\ndents\nidan\nidn\nbowsprit\nmontesinos\npushmataha\nsaturate\nhamam\ncardo\nmisterio\nkorey\noverheads\nefc\nspreader\nnery\nndsu\nbellwood\nmullick\ntourbillon\nrapidan\nthoresen\npursuance\nchantiers\ngoodhart\nplantinga\nmeiko\nunnaturally\nlank\ngermar\nbranta\ntechnologie\nkameda\nlangella\ngrahn\ngorp\naffiliating\ngrinds\nbayat\nstanly\nbrij\ntripolis\nseismicity\nsidebottom\ndevoutly\nseyfried\nrelievers\nanachronisms\nraho\nschreier\nchimborazo\nchauvin\nleas\nmahalia\npouvoir\nmelded\nnabis\nskyhook\nlimmat\nzamboni\nmalted\ncorvair\nmoorhen\ntransmissible\nsnotty\nbosnians\nlengeh\nhumiliates\nimjin\ndlm\nunserer\nharmonically\nramped\nsexologist\nguardhouse\nusfws\noquendo\nruhlmann\nbloodlust\nréal\nserp\namwell\nrikishi\nnehalem\nserá\nosip\nexhaled\nbitton\ncaskets\nuntangle\nchipotle\nstip\nnarwhal\nsalafist\npiombino\nrademacher\nlulz\nturbodiesel\nmizell\nicmp\nesteve\nchristman\nmencius\nspymaster\ngorgias\nuran\neliasson\nopensolaris\nheartlands\ntanneries\nceanothus\nmerril\nlavoro\ndecedent\numbrage\ndinoflagellates\nrepsol\ntintern\nmoine\narchipelagos\ntezpur\nmikka\nstebbing\najo\ndarién\nwhydah\nfeydeau\nphotocopies\njmt\nbusways\nscarfo\nawal\nwigglesworth\npamlico\nadroit\neit\ncrocodylomorphs\nloti\nepiphytes\nsamberg\nperfumed\nfukuhara\nfuori\nstassen\nlettice\nvla\nsebi\ngye\nshoegaze\npalahniuk\nfairburn\nauthoritatively\nmarya\nbochner\nenlarges\npavlovic\nfullarton\nchabon\nswab\nantiguo\nporque\nhosur\nadamantium\nplosives\nretouching\ntravelin\nibb\neludes\ngerbils\ncockerel\nhelicobacter\ntexters\nsuperbowl\nlarga\nhoche\nafterwords\naldus\narul\nmultirole\nsessional\nalencar\nfrustratingly\nnrs\ncaissons\nsheetal\nbelin\npeenemünde\nspotnitz\nsakhi\nnatrona\ncolluding\nherdsmen\nshivani\ncaradoc\ninsipid\ntrager\nsantayana\nskincare\nbrda\nheinrichs\nportis\neroica\ncupboards\ngeena\nmylo\ndain\ncalouste\nkazuyoshi\nayyappa\ndignities\nades\nshino\ncoble\nrimington\niir\ndestabilization\nshinzo\ngok\nrefinance\ngaap\ncopperbelt\nayyappan\ninvalids\nbicentenario\nqal\nmidwood\nkippax\ngiampaolo\netawah\nmatchdays\nmathurin\nfourcade\ncloches\nduals\nhersham\nparterre\nrehm\nasmussen\nmamá\nbrassard\nbialystok\nkassam\nravalomanana\nsito\nsuperfluid\ntremonti\nverlander\nshoves\ntruest\nnourishing\nmeccano\nteutoburg\nariake\nunenthusiastic\nspearheads\ngmac\nhomberg\ngamini\ncolebrook\nmisattributed\nspes\ngost\npelo\nhoarse\nbuttoned\nmaritzburg\nmarad\ndemographically\nmomsen\nwavered\ngillig\naldea\nyefim\nejaculate\nprophetess\nmauritz\nnuyorican\ncintra\ntaufik\nmckendree\nbuchman\nviolencia\nkersten\nhandiwork\nnecromancy\nodio\nemba\nimmeasurable\nturbid\nbambaataa\ncalman\nrosalinda\nily\nblarney\ncandlemas\nfishel\nmbb\npicus\nyearlong\ngorin\nstents\ndja\neventuate\napparatuses\nlipophilic\ncanariensis\nvariegatus\nsufficent\ntartars\nvenner\nchoctaws\niannis\nmoorgate\naminu\nbrutish\narve\nconté\nhellion\npylos\nlofthouse\nsalters\nspaceplane\nsidewinders\nfuoco\nbookmaking\nmaunder\nkubiak\nhypermarkets\nadepts\nhertel\nfifo\nsepak\nmilad\nbabin\nnycb\nlocatelli\ndaba\nahold\nholofernes\ncillian\ngoodly\nfaqir\nffsa\nshatin\nlepa\nbreuil\nmys\nabay\nregolith\ngarai\nneurotoxic\nsabian\nwinnfield\nhypnotherapy\nbrb\nfeluda\naeronca\nbaber\nmcgoohan\nzenaida\nfuria\nwirtschaft\nlocomotor\nchambal\nettrick\nafforestation\nfaqih\ngilbertson\nleopardi\nfourthly\ncabled\nzhdanov\nkurth\netz\nremembrances\nhaemophilia\nmontages\nthew\ncynan\ngladio\npenélope\noregano\nlaurentia\ntransfered\ntxdot\nlongden\natilla\nyevgeniy\nsonorous\nmcgarrett\nhayles\nbroadstairs\npublicising\nzend\nfishbowl\njewitt\nwhiley\nkunsten\nirae\nziauddin\nqibla\njetpack\nhansell\nkatoomba\nfeste\nenn\nzsigmond\nlúcia\nbinion\ngestural\ndented\nchink\ngado\nfreberg\nalbizu\nherschell\nlapp\ncarolines\nponyo\ngabrielli\nrupe\nalpin\nfrogner\nlacing\ntogetherness\noutwith\nragnall\nunappreciated\nrakhi\nhendel\nastrazeneca\nduell\nlegwork\nzapopan\nhiramatsu\nflevoland\nsuria\nnorteño\nmartialled\ncheapside\nrigida\noglesby\nkidron\nrevolutionizing\ngillick\narachne\npictographs\ndaresay\nradetzky\nduna\noverrepresented\ntextron\nrockall\ngutting\nwara\noverstatement\nuselessness\nbalraj\ncou\nmarts\nkeynsham\nagfa\nedgehill\nnonsectarian\nweei\nmyrdal\npwllheli\nmaestre\narcaded\noutremont\nshildon\nihp\namyloidosis\npygmaeus\nrelished\nexpletives\ntendinitis\njunger\nhillforts\nzun\nbuisson\nchromate\ncanoga\nmtf\nilir\narachnid\nbolen\ndreger\nsebaceous\nramazzotti\nambar\nimbedded\nhelles\nzameen\nappeased\nhicham\nsatanist\ncorrente\nheigl\nworkaholic\nvch\nheschel\ndubna\nreichs\nhofstede\nbared\naspley\npreselected\ntarentum\ncastling\nlowy\nchigwell\nsuperwoman\naversive\nplaisir\nmysterium\nprimum\nwerfel\nabridgement\nrhema\nsteinberger\nnaivety\nmedvedeva\noligonucleotides\nritchey\nfak\nexpunge\ncamelback\ncaren\nsuco\nhazlehurst\nbobtail\nprozac\ndorsetshire\nlowden\nunifies\ndirectionality\nwatchdogs\nhorsa\nupregulated\nforfarshire\nhaz\nquadriplegic\niridescence\nniaz\nerdington\nhermanus\nsubsides\nvladan\nabsalon\nlinette\nbuddytv\nsungei\nliberalized\nbaumeister\nkuip\nbothell\nedy\ndefranco\nemmrich\nhardliners\nnicu\njunichiro\nsuspenders\nurey\ntopiary\nkarya\ncorella\nmizushima\noutlive\ngrantley\nconscientiousness\ngenest\nbaixo\nsovetsky\nobfuscated\nlaffite\njackdaw\ntorrente\nmonseigneur\nvenugopal\ndornoch\nlaidback\npiccola\nsingen\nadk\nshedden\nskulduggery\nshrugs\ndasha\nanhinga\nchequers\nbieler\nfilene\nvigilantism\ncaroli\nschwinn\nshefford\nraver\nunderpaid\nhickenlooper\nharpists\nboeck\ncavaco\naltera\nreincarnations\nanselme\nchumphon\naflaq\nmechanisation\nsupermodels\ninvigorating\ntemescal\nwolfenden\nvrba\nclemenza\naeronautic\ndavidoff\nwizarding\ntympani\nlurker\nshepherding\nngau\nwun\ndeltoid\ncamo\nbaya\nwatton\nkumba\nrossana\nzelenka\nfillol\nxochimilco\nyojimbo\nelif\nlisandro\nthruxton\noligarch\ninterrelationships\nmarsters\nsants\njongh\natvs\ndisciplining\nfirhill\ngambhir\nmccloy\nvoyeur\ngrigorovich\ntatsu\ndatable\noberstein\ninky\nwesterman\nwensleydale\nkeewatin\nkucha\nweblink\nkenelm\nappendectomy\nprioritise\nmendler\nbawden\nuncommitted\nendemics\nsandlin\nnawaf\neightieth\nthermoregulation\nherbivory\ngoudy\njodrell\nhornell\nuntiring\nordzhonikidze\ngatchaman\nmiscavige\nlegros\nlowcountry\nmarivan\noyez\npolenta\nkilmainham\ncaduceus\nmarcotte\nforres\ncameronians\nduilio\noutrighted\nknudson\nmarkaz\ngroulx\nsavary\nfontenoy\nlandsborough\nstepanovich\nborek\nstegner\nfanu\nevel\nnashi\nkrickstein\ncarro\noughta\nbimini\ncrozet\ncheongju\niijima\ntreno\ncalabro\nsmalltown\nmacv\nlayover\ncanonry\nniggers\ntuffy\nbulkley\njerri\nconcrète\nfurio\nork\nfossey\ndaming\ndockside\nrecuperated\njeni\norganum\ngmr\nheartaches\nlargent\nknowland\nbost\nsherk\nvoltaic\nharmlessly\nagathon\nvlf\ntrailblazers\nnullity\narz\nchadha\nhemmed\nsbp\nsogo\nserval\nstereolab\nstarnberg\nkitto\nshadab\nkamata\npasko\nshackleford\nvanishingly\natocha\nchaillot\nzoetrope\nnergal\nbrockie\ndawgs\nichthyology\nellyn\nregressed\nvysotsky\nkinescope\nazorean\nfrontlines\ndesiderio\nwildenstein\nemine\nartha\nisca\nsennen\nsakic\nveld\ntfm\nguimard\nnaidoo\nyoshizawa\nsnouted\npalazzi\ncatchings\nhydrogens\nmothering\nrpl\nheslop\nballymoney\nkongens\nadministrating\nfortuny\nbiman\npickler\nchristiansborg\neulogies\npett\narmon\nwipro\nborba\nhinterlands\nhopalong\nrameez\nprinsloo\nportcullis\nradiofrequency\ntila\nmalad\naretz\nimmunocompromised\nanguished\ndieng\ndimensionally\ntitel\nbonnell\nfriendster\nnonresident\nloria\nkamensky\nkonda\ntarkenton\ntabb\nmarginalize\npickling\ngoyer\ndenigrated\negocentric\nunmitigated\nkudarat\ncaley\nregurgitate\nchapeau\nogi\ncalderone\nmenen\nextenuating\nludlam\nmyopic\nmisanthrope\nkogarah\nmészáros\nlaysan\nascona\nsprig\neosinophilic\nslavish\ncarnac\nlogins\nviolists\nwhodunit\nsestak\nschade\nacma\nicn\ngobbi\nmattar\nkose\nmobbing\npigmentosa\nmutawakkil\nradm\ndinaric\nunready\nwnit\nextorted\nwiking\nquién\nrequisites\nunderstorey\ndesensitization\nimpaler\nmenninger\nlilacs\ncasablancas\nbaiju\nnaropa\nchala\nbrimming\ncordoned\nmaheshwari\nmokpo\nemap\njordanians\nsawyers\niseult\nlegitimizing\nguildenstern\nvdm\npeddie\nthawed\nimproprieties\njadran\nideologues\nwoeful\narouses\npervaded\nlannes\nnyala\nentrapped\nflavorings\nmudgee\nshunga\nfullscreen\nfofana\nmazumdar\nlapidus\nspacemen\nvenise\nolavi\nretinitis\ndalry\nppf\nkilter\ncbu\nbelay\nmcpartland\nserj\nleth\ngrimstad\neuwe\ndoman\ncorded\nzagato\nbarbee\nbatanes\nkeyless\nmenopausal\nvpp\nyore\nocha\nhennie\nuat\ntruckload\nassemblee\nlpi\ngargantua\nvillupuram\ngest\nserkan\nmpas\nriboflavin\npetrenko\nrambla\nnikolsky\ngex\nstupendous\nneuroticism\nicecaps\ncaenorhabditis\ncassio\nchillin\nartcile\nmondal\nardal\nhte\nputumayo\nnehwal\nplanète\ndavinci\ndadasaheb\ndalley\nbaselines\nsatraps\npanty\ndurack\nroxburghe\nfilippini\nrossville\nlippman\nsedum\nknocker\nducklings\nabdo\nnextgen\nunderdevelopment\nunaccustomed\njuda\nswashbuckling\nknatchbull\nokla\nbridwell\ncloverfield\npoetically\nfsln\ndorrance\nanto\ngeorgiy\nfluidized\nmicroarrays\nbridewell\nconnah\nbobadilla\ncrandon\nwiaa\nhsd\nnjit\nadjuncts\nmazzoni\nfraserburgh\nkapila\nstockpiling\ntrifling\ntheologies\nfikret\nantonini\nols\noversea\nduhok\ncarvin\ndrogue\nstatins\nsenapati\nbelcourt\nselfridges\nbatwoman\nyodel\nlary\npanigrahi\nlatvijas\nguyer\nbickle\nniekerk\nchemotherapeutic\nmusto\ntufton\nmoneys\nvardhan\nraimo\nwanli\noishi\nolindo\nprodrive\nndf\nignis\nbrightside\nhaverstraw\nludlum\nastrup\nnoches\ncoexisting\nhexagram\nunspecific\nconfiscations\nanthroposophy\nvnc\nhie\nbeuningen\nkebbi\nkinu\nfathering\ndhyana\namable\nasamoah\nberit\nleaner\nbiggers\nteaspoon\nconsoled\nepperson\nrevivalism\nbrahmachari\nalvise\nmtd\nsynchromesh\ntrespassers\nburgtheater\nmoolah\nkaidan\nkamma\nsusanto\npteranodon\ntortugas\npek\ngcs\nroarke\nrater\nryall\nberasategui\nwendelin\npalestra\nchengguan\npoof\nviollet\nxbmc\ntolley\nryders\nterminators\nvinko\nhoule\nreabsorption\ntempli\ntopgun\nvintages\nabrantes\nrougeau\nvalleyfield\nsymoné\nlemire\npeewee\nmutua\nharmonised\nmoskovsky\nequites\nbalrog\nmercies\ndefecate\ntackett\nbehrendt\nrisings\nicg\nlúcio\ncautioning\nrowles\ndeel\nyuto\nvandy\nquixotic\nfeige\nchasma\nsniffer\naerojet\nagam\ndeducing\njabotinsky\nrecheck\npanter\nhofmeister\neddystone\nwcm\nringway\ndisch\nsamanta\ncoffea\nrumbold\nironmen\nchlorate\nrwa\ngores\nconfections\nbry\ngazan\nrazzak\nragging\nstartrek\nworldnetdaily\nconon\ngamarra\nvahl\nwarrensburg\nalys\nbuddhi\nsoreness\nselecta\notay\ndcd\nvesuvio\nrivaldo\nkuka\ndisassociated\ndalberg\nbalaklava\nflanged\ngashi\nmillenia\nparfait\nassunção\narakanese\nkorat\ngloriana\njacquard\ntakahata\nsturmabteilung\ndonets\nsumitra\nazikiwe\nbinoche\ninculcate\nrossen\nneagh\noberkommando\ncorporates\nwhitsunday\nlait\nvaw\nshalhoub\nranh\ntenniel\nnaturales\nzoé\nushio\npellegrin\ncolloquy\nretable\nphilanthropies\nflammability\nfosdick\nkohan\nbirches\ndefinetly\ndenouement\nrickets\nportrush\nartus\nsuzana\nswindell\nmurch\ndabble\nwbs\ngardin\npatos\nananias\nriverbend\nembarassing\nechternach\nbraiding\nvso\nsavimbi\nxray\nlvf\neugenic\nconemaugh\nindescribable\nperspiration\nkurta\nwestborough\nconflates\nhns\nkanchana\nhetch\ndurai\nmoten\nsalento\nhomophonic\ndisruptor\noberheim\ndisparagement\ndeferral\nthatta\nchori\ncastilians\nresorption\ndisbursement\nftr\nvansittart\ninquisitors\nvenusian\ntejo\nchauvinist\nantiguan\nsyon\nbaudrillard\nmisquote\nathletically\nfinis\nmajhi\nborna\ndiscothèque\nuzb\nellensburg\nquadrangles\nromulans\nmladić\nrsaf\nkovalchuk\nthakkar\nmenil\nsarcoidosis\nmeknes\nfiala\ncoffered\nleavened\ntutin\naldenham\nmayra\nmerl\nplumed\nscintillating\naddenda\nglomerular\ncouleurs\nrougemont\nessien\nreconfigurable\ndiplodocus\npakatan\nbenegal\nchakma\nrotundifolia\nkartal\nmoko\nbackpacker\ncorlett\néconomique\nspatula\nsparkassen\nnecrophilia\ntilney\nramji\nnagurski\nmoieties\nderose\nautobahns\ncranborne\nborstal\naventis\nmcmahan\nyatsenyuk\nsteyning\nsoothsayer\nterris\nartfully\nraphoe\ntorv\naguada\ninaugurations\ncrowfoot\nservir\npickersgill\nfayre\nangleterre\nhornsea\nmansiysk\nminicomputers\ndestitution\nxan\nlobb\nscv\nvidalia\nkoror\nsteenbergen\ncaracara\nprebble\nsempervirens\nrevivalists\nsommelier\nkunieda\npoudre\nwaga\nfinian\nsacagawea\neagleson\ntweezers\nmayville\nkist\nproby\nhelfer\nknuckleball\nposer\nthinkin\ndailykos\nperrett\nkismayo\npcu\nmeran\nmottling\nmadhur\nehl\nmof\nlibéré\nwestford\nichiban\nsarto\ntolyatti\ngeneralisations\nreconnecting\nnavajos\nnewson\npings\nimmobilization\nnnc\nfinnie\nlamberton\nbridgeheads\ncatatonia\ntinplate\nkomitas\nbirchwood\ngibbet\nayalon\nventilators\nassiduously\nryota\nahwaz\nstolt\npavonia\npolsce\nmelds\nsnakehead\nnandita\npobre\nluann\nolympias\nblackstar\neusébio\namoruso\ngnss\npolygynous\nmalthusian\ngiftedness\ndisclaimed\nnajjar\nprimorye\noatley\ntcd\ntricksters\ntoure\nmohyla\nwissahickon\nsargassum\ncria\nantonietta\nvineet\nchinois\nbritz\nmcgoldrick\nkhara\njaffee\narticel\ndalen\nheartwood\nbobbitt\nsaldívar\noffroad\ngoggin\nlavochkin\nfenestra\ngrammofon\nmaestros\nkingscote\nhayling\nkindi\ntrounced\nschermerhorn\ndimapur\nsupercluster\nbaldrick\nconspires\ndrukpa\nquadro\noulun\nolim\nsharky\ncrunching\nbluebook\nshrigley\nterhune\nimaginaire\nbillerica\nkalou\nnatta\nhueneme\nkrakauer\nlambourn\nshuffles\nmuntjac\nclamor\nbedoya\nslaughterhouses\nwiretaps\nsalzwedel\nswara\nrenminbi\ncapstan\npopularising\nmilas\nzuniga\nvoiding\nfiumicino\nbrande\nhalpert\ndioxins\nblinn\nlatinoamericano\nantivenom\nalben\nfnb\ndebartolo\nparasitoids\nokavango\ndocumentarian\nsilkworms\nanirudh\nhoras\nilmari\neptesicus\nnewnan\nwigley\nmcivor\nmoneta\naliquot\nholub\nhandoff\ngranholm\nliteraria\ngode\nanjo\nstratemeyer\nfundo\norgasms\nsubcamp\ndomani\nqst\ncumhuriyet\nmcgrew\nshamsul\nrevelry\nvonn\ntria\ngingival\ngamsakhurdia\nconstrains\nnma\ncorbelled\ndsr\nmego\nghai\nslumping\ndivestiture\ncircumnavigated\ngordons\ntaq\nthickens\nguisborough\nanagrams\nbridgford\nmagdala\noverconfident\nqueda\njackfruit\nraspy\nvelimir\nharriot\ngeils\namex\ntinkler\ncig\nmuthuraman\nbanffshire\ndeflects\nrepurchased\ndiomede\ntopographically\ntinley\ncahors\nisler\nalea\neducative\nblish\nsejanus\npronounciation\nsalviati\nagde\nciprian\nredbird\nsanae\nguayama\ntangiers\nresto\ndhekelia\nnelspruit\nedmundson\nstainer\nceann\nfarmhand\nrotter\ncounterfeiters\npanamerican\nhmi\nprashanth\nbordentown\nholte\nrailtrack\nbehaviorally\nhartington\nmediterranea\nfastidious\nhighveld\nneudorf\nwilles\nramillies\nhersch\ncarrousel\nopenid\nthulin\negede\nengelbart\nalesha\nnewtownards\ncolline\nboddington\nbrahimi\nmayers\neuroparl\nsifaka\nashbery\ntypographers\ntoasting\nduesenberg\nadélie\ndatabank\nmarroquin\nghedin\novereem\nmatto\nupfield\nrindt\nburkholderia\nfumiko\nccdc\neielson\nrfe\nakil\nkbr\nkapa\nsnarl\nlautaro\npastored\nsensitively\ncuyler\ntharsis\nrtve\namphipods\ncassels\nsafire\nmatchstick\ndivi\nlarchmont\ncesi\ncepheid\ntransaminase\nbareback\nritenour\nzellers\ntabak\nwahhab\nmeakin\nskywalk\nparivar\nepos\nslaughters\nsafehouse\nchandy\nbrander\nanata\ntimaeus\noutplayed\nreymond\nberceuse\ndobkin\nkavli\nbhandara\nbotelho\nuncouth\nhandkerchiefs\nglavine\nhakam\nschwitters\nparing\navocados\nageha\nradziwill\nmorganton\ngleam\nlycian\npussycats\ncrd\nyogis\ngérôme\narbitrations\nnorwell\ncarnaby\nmatruh\nattired\nhamsa\ngunns\ndearing\nenge\ncentimes\nwever\nandriessen\netsuko\nabiola\nhistorisches\ntatami\nfito\ndecorates\nskydiver\nablution\nnuala\netre\nssds\nunloved\nashkenazy\nimr\nmagnes\nwcvb\ndonen\nmoorpark\nscriptwriting\nposco\nencyclopedist\ndiw\nskuas\nlabo\nenquired\nhirokazu\nbarrière\ntremblant\nchestertown\nsleuthing\nmarienburg\nhackberry\nrapeseed\ndefiled\nouterbridge\nmelania\ngavrilov\ngroupement\ndishonored\nmagnon\nrega\nnomos\npatta\nsoggy\nlvt\nnabeel\nlorimar\neep\nhindquarters\nturkmens\ncheboksary\nkatsuragi\nmbo\ntimeliness\nunkle\nasymmetrically\nhalonen\ncurlews\nhagiwara\ncahoon\nhace\nolkusz\nkabi\nanglosphere\npeniche\nandreessen\ndevlet\nfrancesa\nrastafarian\noci\ncathie\nlemass\npeniel\nparenteral\nmargarito\nwingo\nrandhawa\nsalley\njaque\ngalsworthy\ngouin\nsubsidizing\nargenta\nwashy\nhandrail\ncamerawork\nnolen\nsprinkles\npolitkovskaya\nshawty\nmaol\nconnote\nzebu\ndespués\nkhabar\npiyush\ntenzing\nkuyper\ngrammaticus\ncorbel\npacem\nlibourne\nrecitatives\npartch\ngoncharov\nyaba\nsimek\nknutsen\nselflessness\ndriveshaft\nrenegotiated\nanais\nedwyn\nyellowcard\nhkfa\nphilipsburg\nenamels\ngeneralship\nramm\nkbp\nveneziano\nscheele\nxichuan\nhybridize\nwashout\nescalade\nhedman\nhongo\nprespa\nphuoc\nproliferative\nmerce\nmoldy\naerated\nvpc\nballer\nheyford\nntpc\ngampaha\nintan\nbergonzi\ncheesman\nlindblom\nbasswood\njenssen\npcia\nyellen\ngifting\nshes\njocular\ntiwi\nnewyork\ngbm\nadin\nspaatz\nschnittke\nscrutinizing\nlavaca\nquizzing\nfunakoshi\nsandbanks\ngatlinburg\nconnex\nguaymas\nelearning\ntummy\nunneccessary\ncherubs\nhamburgo\nintercellular\nvario\nyifan\nmurtha\nnsx\nchattering\nnovecento\nstoics\nvirtuosi\nbadar\nunsere\nwomersley\nkuroshio\nlmi\ncuteness\nsinusitis\nreductionism\nmagali\nmyong\naspirational\ntitchmarsh\nwairoa\nextorting\nodot\ncryptology\ndulcie\nlorn\nmarkel\ndelhomme\nmashpee\nfossum\nrosselli\ngrudging\nbezel\ndemocratisation\nquarrymen\nunreservedly\nmassapequa\nbaltacha\nconman\npescarolo\nmalvasia\nbroadwell\nupolu\nmcmath\npocomoke\ncultists\nmuzong\npinheiros\nanapa\ndeconstruct\ntecnologia\nrigoberto\ngronkowski\ndionysos\ncambyses\nprb\nguanosine\nermanno\nulsrud\nteledyne\nhashimi\nrend\nkinderhook\nwilliamsville\ngorna\npallett\ndemeo\ncolumbanus\nwiel\neinsatzgruppe\nrolo\ndiscriminates\nlaren\nbroadcasted\nsorter\ndagen\nfuera\nfcp\nproblema\nwellspring\nbitar\ngalashiels\nendosperm\nskagerrak\ngrapevines\nplowright\nmaloof\npmd\nhitam\nforeclosures\ncharam\nsugiura\nkoraput\nflashdance\ndesegregated\nrollings\nrevaluation\ntrenchant\nlaurance\nunfathomable\njussieu\nsfm\ncud\nclaustrophobia\nhsia\nkerby\nharriette\npontes\npromethean\nleck\nfrayed\ntruecrypt\nbartending\nstong\npermanganate\nshaye\nkaza\npeste\nturbos\nbahmani\nknol\nsorabji\nbeatdown\ninternets\nkhokhar\nrickles\nbluster\nnordahl\neparchies\nmukhrani\nironbridge\nunrefined\nryedale\nmerchiston\nheartwarming\nlaxton\nvanderburgh\nhipp\nlitho\nhaemophilus\ntich\namarante\njocelyne\ndmi\nlaoag\nivanka\nwehr\nballe\nphoenixville\nstruan\nlongreach\nboneyard\ndisko\nvolz\ntichy\ninternalization\ngmd\nnador\nhoda\nstopgap\nrajaratnam\nromane\nuruguayans\npedder\ntiso\nfef\nfearlessly\njone\nabg\nkimberlite\nkilner\nforebrain\nhuzur\ncompaoré\ntoyed\nhaza\nbilkent\nbuffoon\nfaze\nellsberg\ncreasy\nserenades\ncochineal\nmct\nchickpea\ngonadal\npointlessly\nsignalized\ngranduncle\nfolgore\ndenatured\nwyle\npotpourri\nsanath\nchoix\nagneta\nabax\ntwink\nraynaud\nmajesté\nesdras\natri\nhuizhou\nmoule\nmichon\ntsetse\ntta\npopovic\nisac\nrubbery\ncwmbran\njubei\noverruling\nsauerland\ndubiously\ndogtown\nradke\nfelsic\nsizzle\nkarger\nsprockets\nmadhopur\nsmyslov\ntripwire\nvion\nfurey\nshowband\npanamax\nengulfing\nwallflowers\ntyszkiewicz\nsofas\nguba\nkcs\ncouper\nyellowtail\ntoasters\nmathiesen\nmeighan\nreorientation\nveggies\nranvir\nmacdonough\npamphleteer\nyaroslavsky\nvihear\nqiqihar\nimperceptible\npothole\nearthlings\nbeenie\nenim\njms\ndampening\nciampi\nmoveon\npersonne\nmizzen\nprovincials\nanthologist\nnunzio\nczink\ncubby\nbykov\nvelebit\ntaxonomically\nwithrow\npagnell\nhoxie\nwaghorn\ncawnpore\ncgc\nheckman\nhetchy\nartcle\nmorpurgo\nglanford\netonians\nshined\nnavistar\nelbaradei\ndonegall\nservicemembers\njackrabbits\nkocsis\nramayan\nwhereafter\ngreb\nbhagwat\nwheatcroft\nmody\nopenvms\nlall\nprajapati\nndiaye\nagoura\nberridge\nsecombe\nparang\nfulco\nscoresheet\nstrunk\ntorturer\nvache\nkss\nomx\ndunkley\ndannie\nmaija\npleurisy\ndinsdale\npericardial\nphotoreceptors\nstaël\nmylar\ncjr\nmisleads\ndevry\ngiblin\nricerche\nssri\nkanchan\ngarrigue\nlasius\ncasse\nzaandam\nzauberflöte\nharlock\ncarrel\nscaffolds\ntigra\nhrant\nemanations\nwhitsun\nabebooks\ninaudible\nkupang\npomme\nfasciculus\nseve\nboudoir\ngervasio\ntakeshita\nerrázuriz\nparaguayans\nison\nhappel\nlll\nkeigo\nnosewheel\nellingham\nstorace\nnozick\narpad\nunavoidably\nquiroz\naln\naubyn\nfaunas\nriffing\nstrayer\nfais\nraggedy\narbitral\nbrianti\nbühler\nmulling\ntitre\nduodenal\nrugg\npernod\nchano\nwags\ntuohy\nstuckists\nwcau\npechora\niraj\nsanding\njwala\nprodder\nnishizawa\nunibody\nethnologists\nbojangles\namants\nsehr\neverts\ndestructor\nredmayne\nnanako\noverhear\nvaillancourt\nscreensaver\nkhiri\nbangsamoro\nwhisk\ncadore\ntmb\nrectifiers\nwarrack\nloughran\nmemoires\nirresponsibility\npii\nmlt\nseldes\nkilbirnie\nhotta\nmcchesney\nchicha\nponton\nstrikethrough\nschneier\nmarinha\ngrafen\nartnews\nlih\ntickling\npict\napac\nlenora\nbhutia\ngaim\nseverstal\noberg\nspaak\nferraz\nstrangeways\ndeaton\nletterhead\nguattari\notsu\nscarfe\ndruggist\nkumaris\nsputtering\nsyrinx\nnagasawa\nfromberg\nwom\ncatamarans\nstian\nkuybyshev\nrobarts\ncrocodilian\nhyattsville\narnor\ninfotech\namia\ntuneful\ninstate\nhansom\nbabatunde\nsteins\nwhitecross\ncampesinos\nbeos\nsangro\nstrydom\nculm\nparsippany\npdx\ninsolent\nyoshihisa\nsegarra\ntomahawks\nencinitas\nvictorio\nmehndi\nbioterrorism\ncronyism\nsegar\ncravens\ndeven\ndishpan\nburd\nackles\nwindowless\norlovsky\nmontagna\ncrossbreeding\nqeshm\nmerrimac\nkaveh\noon\nslurred\ngmg\nblacksmithing\nfyffe\nliebling\nlanky\nsagen\nbeel\ninvitee\nherve\nunodc\nkoeman\nmechanicals\nranulph\nemedicine\nchazz\nantisemite\nnovotny\ngassman\nsparkles\nsidorova\nbik\nmcaleer\ncătălin\nkádár\nabolishes\nadventureland\nlorton\ngoering\ncolloids\ngiuseppina\nmasturbate\naccelerometers\nbhattacharjee\nnuove\narabist\ncpv\npolycystic\ndegenerating\nheightening\ngravatt\ntartus\nharvin\nbleek\nsteepness\npeopled\nanticholinergic\nhamble\nbartz\nstopes\nefs\nmbm\nvoie\ngoyder\npamunkey\ncounterargument\nsimony\naksum\nlentini\npixelated\neoka\njolyon\ntabora\ntaxidermist\ninscrutable\nhessel\nfoxhound\nfreycinet\nballplayers\nsymbionts\nrotman\ncordy\npanellists\ntwitching\nnisga\nbuescher\nhanako\nkommissar\npinnipeds\nmable\nfount\nick\nmcnish\nfibrinogen\nalghero\ndfm\navonlea\nessar\nsnmp\nbotulism\ntaga\nphosphodiesterase\nfrustrates\ndmso\nelliston\nalconbury\nmittag\ndethklok\nmarijan\nbaley\nlucious\npelargonium\nspiteri\nglobulin\nbangers\nshelagh\nmonclova\nflexibly\npernambucano\ncapela\nmangalam\nobstetrical\ndeprives\noptimizer\nmillsaps\nangello\npintor\ndpt\nmonuc\nmehldau\nblancas\nflaminia\nkanai\nbollard\nnobuhiro\ntrong\nbenioff\naee\narruda\nplads\nings\nsundials\nboeuf\nafan\ngres\nnottoway\nmédico\ncalcified\nrfs\ngrabar\narmonk\ngrappler\nmanfredini\nswd\nbarcroft\ndjebel\nlandranger\nwindowed\njdm\nabsenteeism\ncompacta\npoutchek\nchugach\njokinen\nzeeshan\nplowden\nstazione\ninnately\nncdc\nhauptschule\n,or\nagius\nmyocardium\ngars\nbonomi\ndanco\ndowler\ngeometers\nunconscionable\nbourton\ngulistan\ntweeddale\nackley\nheckscher\nnunnally\nsohar\npinwheel\ncoldfusion\ngrubs\nxiaoming\nlaface\nacrylics\nasolo\ncontraindications\nsummerlin\nstarchy\nledo\nwjc\nduz\ncervi\nportneuf\nyukos\npresa\nfom\nblest\nbalderdash\nthicken\nscènes\nving\nmikveh\nizzet\nbirkdale\nrano\nstarches\npeptidoglycan\nlavi\naamer\ndigressions\nphantasmagoria\nattainments\nhavelange\ndamone\nfant\njawan\nbedlington\nkakamega\nhefti\njilly\nrinchen\ngluons\nfiliberto\nfábrica\npum\nstross\nkohut\nlulin\ntimepieces\nkothi\ncalamagrostis\npdk\npoer\naccumbens\nrahe\neastwest\nkurniawan\nfrodsham\nottery\nbangura\nkilkelly\nvash\njaurès\nmonteagle\nmoroz\nsumiyoshi\nmethotrexate\nbustin\nhopscotch\npfs\npreeminence\nmedicinally\ndithering\nsmartcard\ntarka\nbiopharmaceutical\nhbr\nvau\nhellespont\nsixths\nlowbrow\nhartsville\ntimepiece\ncbsa\nhousemaster\ndispossession\nsarnath\nsmf\njärvi\noriskany\nheloise\nfoibles\nspyglass\nlaterza\nexpellees\nlinoleic\nbalcon\nwbf\nmyhre\ndaudet\noncologists\ntenenbaum\nkheda\nspeidel\nvltava\nmpo\nkraut\ngelato\nconceptualizing\nbarrell\nphosphorylase\nfrankness\nrecordist\ndésir\nbeachwood\ncosi\ntemur\nsedgefield\nnannies\npulps\nhelianthus\ndelbrück\nwhr\nakr\npresstv\nseaweeds\nkoolhaas\nsahibzada\nrozas\ncryptically\ngracey\nsinglish\ngenii\nbabul\nshoguns\nthirlwell\nfarmworkers\nkirtley\nnique\namazes\nwelkom\npontine\npurina\nossetians\nshanachie\nmelek\ndamask\nunwatched\ntolhurst\nparsis\nkokkola\nsophistry\nbunkhouse\nyyy\nrusalka\nteesdale\nmista\nsabiha\npgr\nretinol\nakl\nairwolf\nusmani\nhyphenate\nparodic\ntrigon\nkaisar\ntoxicological\nkrist\nputintseva\nilene\ntrickett\ngca\nbohumil\nmbk\narnheim\nchappaqua\ncrary\ncolorblind\ncmx\nrevoir\noakeshott\nahafo\nhoneyman\nrealignments\nnestling\ncerebrovascular\nexpressionistic\nblotting\ndizzying\nagilis\nsanja\ncurium\nhazaras\nkilinochchi\ndelineates\nbenzoic\nduelist\nbechstein\nonishi\nappling\nflinging\nluiza\ntalang\npasteurized\nblatent\nartest\nquiapo\nlonny\nkyon\nevangelizing\ntroyer\nsarr\nzeenat\nedgemont\nsoirée\nfratello\nrion\nodorless\ndoshi\ncitylink\nbraincase\nimpactful\npichler\noarsmen\nstanstead\nstryper\njawbreaker\nmcquarrie\nbeeps\nrusi\nkik\nconchords\notaki\nglenister\ndusters\nhysterically\nunsinkable\nenthralling\ntelepresence\ndannenberg\nbarral\ndombrowski\ngeochronology\nchaturthi\nhusqvarna\nhomespun\nconaway\ndemian\nmimas\nnetz\nfabra\nthorin\nlahar\nvikernes\nkilotons\npoisonings\ncowden\nwinnsboro\ndopey\noverpressure\nstagings\nboned\nfogle\nnieuwsblad\ntalaat\nguianan\ndownland\nstarck\npostmark\nvasculitis\ndsd\nodhiambo\ndinan\nhijazi\nanemic\nanzeiger\nbbn\njiaotong\nforelimb\nwebmd\nseperated\ncred\nwpf\ngreenhills\ndrumm\nexil\ncardholders\noverstate\nsecuritization\nmaroubra\nlarkins\nwcco\nbch\nsuppressors\nhollings\npetteri\nhansford\nflg\nmenshevik\narafura\nscena\nbgs\naww\nwaldrop\nbaskakov\narmi\nparkhead\nnickolas\nthibaud\nmotored\nmagers\ndalhart\ndaguerre\nmorlaix\nteknik\npeur\ncrossland\nincontrovertible\npachinko\npmt\nkhenpo\ncholula\npinzón\ntimi\nsouthborough\nproceso\ngourock\ntopside\ngeto\nmummification\nmelik\nadham\nasunder\nlisteria\nhalbert\ntensioned\nbiermann\nfélicien\nwooed\nkalyanpur\nvergeer\nmirabella\nimperialistic\nbelmar\nanalects\nknotweed\nlobsang\nsexsmith\nsargis\ncadwalader\nzad\nsnowmen\npombo\ngranulated\nsprees\ndefray\nmultipoint\nairdates\nregrow\nchanced\npcpro\nhuemer\nspacers\ntane\nshakya\nteme\ndobbie\nchargeable\nyozgat\npantin\nawaji\nlgbti\npanjabi\naustereo\nnullifying\ncraziest\nmôn\nmouche\njetliners\nbdf\nprofesor\nchurubusco\nvasto\niae\nbrzezicki\ndisparagingly\njoomla\nspendthrift\nserafini\npif\nexhortations\nchunnam\njeopardizing\npoy\ninterferences\nbentonite\ncoletti\narkadelphia\ndeodorant\nmontoneros\nclassico\nlamon\nkurtosis\ntrodden\nvischer\nflexner\nzapruder\nrato\nbaiji\ncriminalizing\nlapid\nscrapbooks\nwhorehouse\nbaguette\ndepute\nrymer\nborglum\nprocrastination\nplatonov\nsellar\ncassowary\npcworld\ncantonments\naggro\nslamdance\npasley\nwillimantic\npalmira\nswiftsure\ngiveaways\ngolkar\nminette\ncompulsorily\nuninfected\nharryhausen\nmicroeconomic\nush\nmccomas\nwaypoint\ncni\nenabler\nmonsoonal\nmexicanos\npolycrystalline\nspaceshipone\nstk\nclutha\nimboden\ncogito\nmotherless\nyivo\ndorrien\nmwh\nkoman\ncatoctin\nmulford\noldřich\nwachusett\nninfa\nbary\nneurotrophic\nbostonian\nmacquarrie\ngrasmere\ntriathlons\npayot\nverdens\njesmond\nyoure\ngordan\ngalera\nawesomeness\ngabin\nburgon\nbanjar\nrahmani\nvladimirovna\nbedard\nboogeyman\npreity\ninverurie\nexoneration\ndinklage\nlabyrinths\ndls\nfinnerty\nzari\nbasson\ndfid\nflapjack\ndacron\nhawksmoor\noyu\nvasodilation\ndvs\npiva\nfumbling\nschrank\nzimbabweans\neira\nhause\ncelebrants\npab\nghazan\ncrossbill\nhydrotherapy\nwergeland\ntaurasi\ngwb\nmaybury\nrandazzo\ndiabaté\npaean\nquadraphonic\nmasekela\nsiwa\nadrianus\npotentiation\nmédicis\nmorikawa\nlestrade\nlycosa\nisk\nespagnole\nlexmark\nshortline\novations\nromi\nsohu\nreek\nvahan\nrunnings\nhijikata\nzetland\nwymondham\nmillican\npuncher\nseasiders\nsystematized\nmeb\nbunyip\nkfi\nvins\nmostert\ngfs\nagis\nbowral\nsyllabi\nwoolton\nranko\nsvetoslav\ngigg\nwashbrook\ndeforested\nergot\nzirconia\nashen\nfantine\ndamodaran\nwca\nphonebook\ntoreros\ntdrs\nweyer\nhumanely\nhemophilia\nrechts\nchrisman\nleighlin\nstompin\nlarwood\nrediculous\ntando\nidealists\ngagan\nurbis\njcc\nbahir\nkeepsake\nneoprene\nsahyadri\ntiesto\ngamekeeper\nhormel\nbhatta\nheins\nzapf\nrir\nrevelers\nplena\nmurkowski\nroasters\ncoherency\nurological\nlancastrians\nnoontime\nshekel\nmuma\nchalked\nagawam\novale\nrupturing\nwickramasinghe\ncoggeshall\nsku\ntsipras\nbuttered\naxs\nflaminio\nlieve\nvolcker\nunvoiced\nnyeri\nbandi\noirats\nkeepin\nlinksys\nbaramulla\npapillomavirus\ngleiberman\nselvam\nprophesy\npaez\nindoctrinated\namitabha\nagw\nmatrons\nwickens\nconfocal\nveux\noprandi\nsule\ncesspool\npaka\nzaïre\nstakeout\nslt\namitié\ntopsham\nalfonzo\nprosopis\nserapis\nboötes\nhangal\nescovedo\ntilbrook\nmidtempo\nterres\ndongle\ninbetweeners\nlacerta\ndromedary\ngissing\ncorum\nsukanya\nlangerhans\njatropha\nrubrum\nduf\nmadrassa\ngallas\ncauty\nsukma\ncanvasses\nssx\nbloggs\nrothery\nbackyards\nbonington\ntawfiq\nspader\nforeshadows\nfilet\nautostrada\nreciprocates\nrefurbishments\nyorck\nlafont\nfaxes\nuemura\ngoodwyn\nfoolproof\npallbearers\njellinek\nsmale\ntshwane\ncalista\nhistorien\nqueene\nhibernating\npulo\nkappel\ncoruscant\nchimay\nsteerage\nunprovable\nmarinetti\nstompers\nbening\ncannan\nzeev\nhallucinogens\nmalting\nprovis\ngregorius\nblackhall\ntrezeguet\nshermans\nmccallister\nnanyue\ncrusts\nmre\nwireline\ncastella\nsandip\nleptons\ninexorably\nmcfall\nazizi\nkamau\nbamburgh\ntembo\ndankworth\nmiscalculation\nheadmen\ncermak\nstanger\nsandlot\nbengtson\ndesegregate\nmalá\ndesborough\npyrus\nfeedbacks\nakeem\ncynically\ncdb\ncorroborates\nschwabe\nsabarimala\nhls\nnailers\nphils\ncalbi\nremapped\nkompas\nboadicea\ndermatologists\ndlf\ninsignias\nguillemot\nspens\nlordly\nbammer\nspss\nclémentine\nibd\nskive\nparmesan\nmirkin\nbacteriophages\nshakespearian\nmookie\nragland\nplexiglas\nmutare\nwonju\nwpm\ncolden\ntunisians\ntorun\nmahela\nvasculature\nabridgment\nmonette\nwishy\nwaggon\nlooker\npks\nbadin\ndriveways\ntrepidation\nmahabalipuram\ngrads\nfatu\ntipi\ncrossfit\nmanoeuvrability\nrighthanded\nsatanists\nansa\nstanwell\nimputation\nmercyme\nshearers\ngiese\nhosono\nlohman\nctw\nhankinson\ndoba\nbyington\nretroviral\nbriançon\nwojtek\npru\nlukens\nparachutists\nhymne\nlenr\npatinkin\nxylene\ndryland\nchelo\nnardo\ntaiba\nmorahan\nsett\ncurbside\nedita\nanonyme\ntody\nwolpe\nlvmh\nmosfilm\nhibbing\nlema\nberni\nnakba\nauctioning\nwauwatosa\ntempesta\nballinasloe\nunseating\nbaraboo\nfossett\nandromache\nprm\nkwesi\nnadiya\nbrasses\njiaji\nreconnects\nnewpage\nstoat\nhyoid\nrousset\nchemotaxis\nthreshhold\nsavalas\nunnerving\nhohmann\ntschudi\nharpsichords\nthermals\nsimonetti\nbeagles\nsteffy\nhunterian\npriaulx\nreto\nravin\nclee\nabetted\ngraha\ndisley\nprothero\nadulation\ngilling\ncleanups\nkring\nhaining\nmadhusudan\nehs\ngatica\nflorea\nstith\nunraced\nbotrytis\ndemoed\nlevar\nhorrorcore\njarlsberg\ngranton\nmarginatus\nasclepias\njibe\nbastar\nineffable\nkahana\nkenzie\ngranges\nmazama\nklint\nfredo\njaron\nqand\nees\nspect\nevelyne\nmahratta\ndelmore\nhyrax\nmagmas\ncoggins\nwraiths\nlowman\nimpériale\nankur\ncahoots\nbrl\nlussier\nlorillard\nkva\nmelly\ngranda\nnotas\nbisecting\nstijl\nwabasha\nruabon\nsaunier\nmelksham\nmurai\nfrimley\nbayerischen\nmckittrick\nanadarko\njusta\ndorris\nkashiwagi\nhoeven\nomon\nbrum\npitino\ninveterate\neverclear\nboito\nreapplied\nnorbu\npietri\nbracco\nbemoaned\nbrückner\ndek\nteltow\ntella\nkocaelispor\ngiuffre\nhillerman\ngoogie\nmarketwatch\npreferment\nsilencers\nucas\nunderperforming\ncampsie\nmellin\nfnac\nmessinger\nsusman\nclopton\nenergysolutions\nfullerene\nmobilising\nshishi\ntippin\nborch\nhornpipe\nponsana\nnanci\ncarlene\nnieve\ncercopithecus\nhalverson\ncamargue\nbinning\nsammamish\nmomin\nabdalla\nesmé\nseatbelts\nelita\nchoreographing\nyeshua\nnassim\ncosmogony\nspooked\ncliched\nsaramago\nredesigns\nobviate\nsexploitation\nchetham\noptometrists\nchlorite\nmeshed\nborchardt\nranke\nfaron\nadio\nsalida\nokoye\npriestfield\nbonville\ndonley\nunpolished\nwegman\nhyer\nsasser\nhonan\nvoyeurism\nnarathiwat\ntrib\nconniff\nvereeniging\nvictrix\nmaldive\ngri\nmukai\npolonsky\nbeys\nphaedrus\nsegregating\nherpetological\nathy\ngujar\nconversed\nkaikan\nearphones\npallidum\nstelmach\nthwarts\nrumen\ntyke\ngambon\nmattersburg\nbrakeman\nmayfly\nlegionella\nborde\nxcel\nvallenato\nparsimony\nsetlists\ncommun\nsorley\nenergi\nresuscitated\ncnidarians\nprofundis\nardeshir\notitis\nmaghrebi\nchêne\nbariatric\ncatbird\nbris\nmadd\nkhote\nperley\npldt\niphigénie\nubiquitously\nhaie\ntenon\nalicja\nobliteration\nnicotiana\ntimeshare\nuam\nbarboza\narmaan\nmylne\nptb\nchakravarty\nbugti\ndease\nsophists\nböll\nrvr\nbelyayev\nmtor\niberians\nhomesteaded\nreister\nliberdade\nrhinoceroses\ntestudo\npapworth\nromm\nmrd\nolfaction\nchalke\nmcclane\nbeausejour\njibril\nmaicon\npaperboy\nstepford\ngabbard\nadmonishing\namargosa\nhohner\nemmental\nfic\nhextall\nmummers\natmel\niginla\nmisinterprets\nwolpert\nradioman\nretroviruses\nholidaymakers\ndecarlo\nembu\nbracebridge\noctopussy\narends\ntownsman\nvereen\nbleus\ninconsiderate\njamin\nbosio\npolitiken\nplass\noboists\nbipasha\nsolanas\ngephardt\nsabbagh\nmonis\nanther\nbeachcomber\nritson\nmaskelyne\ninimical\nactivators\nelrond\nhabibie\nsplashdown\nanneliese\nmccausland\nmeiotic\nmtskheta\ntursiops\nweal\nbournville\nporco\nhance\nnighy\nparadies\nhumanized\nkaikoura\narmentières\nskimmers\ntrucked\nguilbert\nbretherton\nmcsorley\nnalin\nbalcombe\nmalindi\nperles\nsukkah\ncoulton\nsemaine\nbianchini\nretrovirus\nganapathy\ngourley\ntsl\nritualized\nductility\nchacko\nshackle\norcutt\nwhiskies\nnitti\npumper\nplaskett\nnobbs\ntarbert\nesen\nrenda\nreavis\nheckled\nharsin\nciba\nshakespear\nlittell\nripoll\nfushun\ncockfighting\npengfei\ncinquefoil\nhighbrow\nscribbled\nborsa\nglob\nsiward\ndaei\nlti\ncsic\nlfs\ntard\nlubbers\nspital\nanahuac\ngiridih\nschlick\nadmonishes\nrecordable\nhumala\nchauncy\nruh\nentrained\nffl\nshawna\nsimorgh\nwaaf\nbotley\nparaphilia\nandolan\nantagonized\nspouts\ntienen\ncryptid\ndungy\nciu\nbelittled\nenes\nnfu\nswingle\ntsering\npiqua\nhoadley\nklugman\nreevaluation\ncdg\npining\nsaldaña\ngraveney\nprovan\nveneno\ncapitata\ntharu\ndumbing\nvoormann\nstarrs\nhellen\nmajesties\nhonam\nthroop\npeeter\nminya\nrequestor\nmansoura\nsmoltz\nsilverwood\nberardi\npershore\ncrts\ntukwila\nrevitalised\nrayford\nrifkind\natopic\nfaustian\naussi\nalessia\nrocketdyne\nbanai\nnypost\ntorgau\nvacuole\nmargulis\nfreelanced\nhea\nasanas\nsatterthwaite\ncornuta\nwhaddon\nprisma\njolin\ncienega\ngavotte\nvaramin\nflashforward\nheysham\nfedele\nroon\niiib\nvía\neiland\ntansy\ntiredness\nroxborough\nraynal\ncuarón\nineligibility\ncloe\nosmanabad\ncannula\nriehl\nsylmar\npoppea\njefford\nmceachern\ninglourious\nsecessionists\nsut\narunachalam\nianto\nkosa\nrenhe\nnwfp\notoe\nburlap\njatiya\nequanimity\nserai\npolyana\nsulochana\nnovopolotsk\nchirping\nnajafi\narmadas\nradiologic\nparroting\narmillary\ntelecasting\nstashed\nneapolis\ndaumier\nreeled\ngier\natresia\ngrs\nmenier\nindividuation\ndighton\nmcbrayer\nrobat\nbewley\nskolnick\nvaline\nabominations\ndoumbia\nrutte\nouroboros\nnati\nperineal\nwerle\nirretrievably\ntiao\nfellner\nmutha\nsubdomains\ncycads\nkisco\nvaunted\nkalish\nmarchal\npragmatist\ntransphobia\nbif\nrecolored\npyeongchang\ndependants\ncómo\ncoriacea\nseeping\nporritt\nlff\nsimmered\nbrideshead\ntrumper\nfishponds\nvernor\nrossano\ncrocodylus\ndimm\nohashi\ntepic\nwaf\nzeitlin\negoist\nkosaka\nmadrasas\nmckinlay\npoel\ntpf\nurbanus\ntycoons\nstereographic\nembeds\nocta\ndelafield\ninagaki\nzapatistas\nwhakatane\nnewdigate\nderren\nunmik\nborehamwood\nchhaya\ninflicts\nsadik\nshuttling\nmetropolia\nrapaport\npolyptych\nkitaro\ndjibril\ntrialed\nmicrobrewery\nmiseries\nnamu\nechocardiography\nterabytes\ncochem\nagudath\nwanless\nshymkent\nrans\nhyla\nrol\nsabella\nriskier\npiton\narmisen\nrajinder\nmache\nzittau\nhornbills\ntibialis\nmoluccan\nhironobu\ncopsey\nmilb\nlogis\ncarthusians\ntransact\nbenjie\nkrasniqi\nturnstone\njeepney\ngulick\nalighting\nupwardly\nshae\neitel\nifad\nhighfields\npiedade\nschuh\nkaesong\nsfar\njaak\nohta\nolu\nlenni\nhanwha\nsterol\nassemblers\npuede\ndhara\nnoumea\nwsf\nherodes\nwandel\nnimba\nstéfano\ncooksey\nnaseer\nrazzle\nimmobilize\noja\ncasterman\nlipp\nhabersham\nsnetterton\nvancomycin\nquadrophenia\ndmr\nhildegarde\negr\nlefkowitz\nscenography\nwaveguides\nsegesta\nguignol\nwouters\nboricua\nslat\ncosmologies\nmicrofiche\nconstructionism\nshrouds\ncaleta\ncorporis\nflecks\nunwary\ntulun\nkoy\nmelamed\ntaciturn\npawnbroker\nbeaters\nthys\ncosmopolitanism\ncherkasov\nspeleological\nlba\nkahnawake\nsandwiching\nbaxendale\ndrafters\nvales\naccordionists\nadib\nrenta\nferociously\nkozlova\naval\nkodachrome\nberrios\nkelsen\ndilys\nreabsorbed\nmurmurs\ndamion\ncakobau\nunaids\ntheos\nsuffragans\nleguizamo\nsavery\ngiolitti\nlatymer\ntarsier\nbienne\nneumünster\nfezzan\nbrookmeyer\nlnc\nfourche\nkingbird\nwoodcarver\nteleporter\nreplicators\nforesees\ncrieff\nqueensboro\ndde\narlon\nddm\nguilfoyle\npolan\npeete\naua\numwa\npaten\nthumbelina\npersis\nturd\ncounteracting\nmusil\npatrón\nselinsgrove\nloaning\nbhumi\nlegislations\nbertuzzi\ndalma\nmize\nkosar\npitviper\ntijuca\nrashida\nquinones\nmacnee\ncapybara\ndescriptively\ncornbread\ndoonesbury\noumar\nwatchmakers\nmedhurst\nnsi\nneorealism\neua\nervine\naffectation\nxiaoyu\nfenland\nbetawi\nambleside\nlinacre\ncazorla\nleibovitz\ningot\nicknield\nerlend\nfeltrinelli\ndeservedly\nshoshana\ninglés\nupdraft\nseefeld\ngabaldon\ndemy\nlavine\nbellavista\nbidi\nkundan\nflouting\nnaman\nferch\ntaht\nnordmann\npolyclinic\nbackslash\ngoatee\nblacklock\nkyong\nalaskans\nsary\ntranscendentalism\nayumu\nbacilli\nsugiarto\ncultivable\nmotoko\nkitna\nlampedusa\nsuchlike\nplectrum\nblanka\nipek\nvalmont\neladio\nrepos\nhobsbawm\nbeno\nsingtel\nmisappropriated\nchungking\nankole\ndahi\nducked\normandy\nspanoulis\nchale\nnaturopathy\nfootnoting\ndalmau\nbachrach\nlifeform\nfording\nsantiniketan\nwhacking\nmuggle\nsexualities\nbiomed\nrejoiced\nyukimura\nvarkey\nbasayev\nnajafabad\nporat\nshehzad\nantiquus\nrehydration\nmonikers\nphotosphere\nrobbe\nbravos\nnyong\narbitary\nunderpasses\nfah\npoulet\nplayin\ntysons\nbruhn\ngauci\nbpg\nfreewebs\nyellowing\ntrotman\nsordo\nchronometers\notaru\ntelecine\nbayram\ncoevolution\nlamentable\nherbaria\ncatalpa\nzohra\nkindler\nreinert\nbaxley\nsockeye\nevened\nnhn\nhelme\nsuerte\ndân\njaycee\nwilayat\nandina\nceremoniously\nbaio\nbennion\nlarrea\nguillou\nambrosini\nschippers\ncarvey\nneenah\ndivertissement\nebr\nisomerization\nmorwell\nneuengamme\npinecrest\nabidine\ndunsmuir\nmcelderry\nbhagavathy\nazarov\nhollingshead\nsquatted\nrivard\nkalmyks\nyukihiro\nmcgarrigle\nmarien\nverilog\nbhandarkar\nbramante\nstorck\ndaylesford\nsalmons\nbailie\nmilwaukie\nchabrier\njaffar\nesmonde\ntillinghast\ndiscusion\ndeutch\nsarees\nczarist\nmarcell\ncrosslinking\npenedès\nviorel\nstromboli\nmikaela\nscalise\nbirlinn\nmuenster\nveres\ntheyworkforyou\nyepes\nbembridge\nkenedy\nsmd\nanchorages\nheidemann\ndonatella\nlippo\nfatwas\npedraza\nmihaela\nprokom\ngago\ntoting\nstrato\nwasserstein\nchristmastime\ntomba\nkillzone\nodawa\ncartersville\nparented\nkharitonov\npellucida\ntomboyish\ncang\nchorionic\ngoodchild\nmld\nplayability\nculturing\niselin\nabend\ngarforth\norn\ndovey\ntzedek\npained\nwittering\nenderlein\ninfinities\nslaver\njetliner\ngoggins\nindurain\nmetaphoric\nhooda\ntranquilizer\nstumpy\nnihat\nazzopardi\noxycodone\nhousewares\ngrecia\nyali\nkrystle\npvv\nrestrictor\nhrvoje\nsarina\ncoire\nseimei\nozaukee\nunalaska\noverbridge\naviano\nflanigan\nsawamura\nuniversum\nsandton\ndenard\nsideburns\nfanciers\nmanavgat\naav\nsedin\ninuvik\nkinnick\nmatsue\nmoke\ncutolo\nirp\nwarthogs\nbadulla\ncharybdis\nstandouts\nkusum\nmoonbeam\nmicrosite\ndanielsen\npetridis\npodolsky\ncristián\nblissfully\ngenies\njobber\ncimon\nfertilised\nrsr\nstarwars\nveidt\nrecuerdos\ntreecreeper\nnumismatists\nbirthed\nukc\njadida\ntutsis\nradome\nhometowns\narosa\npeadar\nputatively\npilotage\nvacaville\ntradeoffs\nsurvivals\nlauding\nopensuse\nflesch\nkatamari\nmclagan\nunescorted\nnomenclatural\ncirillo\nmikulski\nburntisland\nhavard\ncantú\nboilermaker\nrucka\npawnshop\nneuroanatomy\nrhian\nminivans\ncotillard\nmichaelsen\ndisincorporated\nbrot\nhih\nbironas\nhapgood\nclairmont\ncolleton\nshawangunk\ncoloureds\nverisign\nbarmouth\nswett\nburling\ncapsizing\npoway\nricotta\nsofi\nwastage\nscalzi\naparecida\nwast\ntuckerman\nkulthum\nschlatter\nwaltman\ntetrad\nhaugland\ntrefusis\nsaturnalia\nweathervane\nfrears\nquarrelsome\nhashtags\nibe\nhexane\nbelshazzar\ntemptress\nkodaira\narpeggio\ndering\npinnata\narmiger\nzopp\nbutted\nkoussevitzky\nheadlam\nunderscoring\nfearn\nskardu\nvieri\nliaise\ncraw\nnorvell\nhwaseong\nfron\nainge\nmarketability\nrlds\nrosenstein\npetco\nmagasin\nbaldev\njoad\nneoclassic\nderr\nsolidaridad\nearnestness\nlaurette\notherness\nireton\nevangel\nsubducting\nknickerbockers\nmirsky\nteymuraz\nbuarque\npepito\nhacer\ndlg\nteennick\nincorruptible\nrespirators\ninterlace\nphaethon\ntrollhättan\nraggett\nmoccasins\nalpharetta\nbarfleur\nurmston\nmambazo\nmanichaean\nfeigenbaum\neinstürzende\nfluviatilis\nvibrators\nbessborough\nhdfc\nalaina\ncalpe\ncatherwood\ncounterbalanced\nvaporize\nworklist\nhafen\nyáñez\nvizcaíno\nembarrasses\nlevinas\nmaltings\nsreedharan\nkayah\nsocha\nystad\nfélicité\nkaziranga\ngladius\ngraven\npiggyback\ngaffes\npopulates\nbarraclough\nocoee\nruffini\nhoffnung\nanderen\nchicot\nmonroy\nfinancials\ndiyarbakir\nwaterpolo\ncinemascore\npinelands\nnsn\nbogeyman\ndistemper\nohel\nsignup\nreorganising\ncantankerous\nhuskey\nfindhorn\noutlast\nunfazed\nhypothesizes\ntranspiration\nplodding\nshull\narri\nlimeira\nmaxton\nzimmern\nlarionov\nblaenavon\napop\nungulate\nrhythmical\njango\nrusch\ncardigans\ngota\nsherrington\nwhiteford\nruffians\nmarson\nmahim\nreclaims\nayad\ncrashers\nkoruna\narchways\nzumba\nprostaglandins\ndemean\nintercounty\nauthorizations\nrippling\ndenikin\nintimates\nadelante\nunsaid\nvap\nsixfields\nbrushwork\nrumination\nmakem\nfumi\nctm\nolefins\npbp\npinerolo\nlascaris\npawling\nagosta\nmaite\ntimotheus\nveikko\nadjourn\ntunny\ninstrumentality\nyamani\ngeomorphological\nwrightsville\ncomayagua\nmatane\npanofsky\njoiners\ninterferometric\ngazala\nhavers\nnarcissist\neich\ncytotoxicity\ninvierno\ncatabolism\ngiunta\nlantos\nbalinsky\nbijan\nalexanderplatz\nnotational\nnewspeak\nlethem\ntemne\nqanat\netchmiadzin\ncozens\ntrnc\npineau\nrendall\nbrambilla\nreinfeldt\nessam\nnamo\neeoc\nshada\nabramov\nhany\nphiri\nmatinée\nnixie\nbomar\ntazz\ncasiraghi\ncozi\nproteinase\ndecolonisation\nritt\naloisi\njhapa\nwindle\ncoman\nfinalization\nphosgene\ngeneve\nfabrik\nhillard\nseceding\nsunoco\nhahaha\nvika\nstrega\nwolfsbane\nnanao\nkishor\nmorkel\ndecommission\nbehead\nkushida\nunica\ndissections\naspa\ncredito\nraimon\ncupar\neuronews\nshila\nreconstructs\nrottweil\nminhas\nnkt\nteimuraz\ncommodification\noberhof\nalcaide\narmfield\nlectin\nwgp\nsocs\ntinashe\nmelk\nstacker\nadducts\nguesting\nunamuno\nlangone\nsekou\nvalby\nhamner\nenriches\ndbc\nagger\nlegalese\nweisse\nperfumery\nchumbawamba\nlukasz\nseidman\nhoriuchi\nmonoplanes\nbiosecurity\ncabbie\nkabataan\ntranz\nlipkin\nwicketless\nmicrometre\ntransparencies\nbritannicus\nzoller\nypf\ncatkins\nasea\nengstrom\nmeringue\nutopias\nseshadri\nlongton\nvibrates\ncadwallader\nbierstadt\nextensibility\nextraversion\nchaperones\nmortenson\nbournonville\nbehe\nmailings\nselon\nschlock\nguarnieri\nstouffer\nkakar\npitstop\ncodenames\ntty\nreem\nsheh\nhoedown\nfurr\nsimpleton\narmstead\nronn\ndeckhouse\nmajeure\ngossiping\nyamane\nautodidact\ntarry\npowerlifter\nreining\ndhaliwal\ntattnall\nmaytag\nwhisked\nflyovers\nperspex\ntechnocratic\nclassica\nmedscape\nvilification\ncharisse\nkyzyl\nbattiato\ntowle\nrexroth\nabeille\nfalta\nreutimann\nrajani\nazaleas\nkeady\nloathsome\nscrubbers\nsekondi\nfenster\ntannehill\nphoning\nmelora\nnubians\nsansthan\nbernardini\nrtu\nmeretz\nnisou\nsagittarii\nstartle\nfarish\nmesta\ndenon\nletterbox\napostolo\nhipaa\nzhuk\nmaclagan\nmatthiessen\ncheah\nmarosi\nxchange\nliniers\npenetrative\nhonourably\nramsbury\nshavings\ndickin\nsisler\nsabe\nbharadwaj\nestévez\nnecrotizing\naigner\nbeag\nsputum\nklinghoffer\nboutin\nsenhor\nyobe\nkjeld\ncandidiasis\ncanzoni\nescalona\nsigmar\ndangle\namgen\namasis\noland\nlipka\naméricaine\narmendáriz\ncuca\nschuck\nebd\nbdnf\ncurtea\nplumaged\nlaeta\ndaur\nnajm\npluribus\nreorder\nwashrooms\nwescott\nmismanaged\nbridled\nsmithereens\nmackellar\nsiraha\nwaddesdon\nnoncompliance\nkitchenware\nsibel\nkuskokwim\nmilivoje\neuropop\nacknowledgments\ndisassemble\nstigler\ncaecilia\nharumi\nfawley\nsomare\nromanova\ntyreese\nparotid\nrsf\ndrafter\nvrai\nnylander\ndewas\ncraftspeople\nstephanopoulos\nblemishes\ngoscinny\nsingspiel\ndaxter\natan\nwwor\nburson\nwragg\nhitt\nhealthful\nsmirke\nmusei\nvaquero\ncanidae\nbouteflika\nfarfisa\ntcas\ngreyson\nyuasa\ngreenest\ntrunking\nhuambo\nmft\ngodina\nkeatley\ntnc\nchekov\nmulticolor\nmandara\novernights\ndimond\nlandmasses\nernani\nccu\nopcode\nbasslines\ngeocaching\nthale\nfundus\nyinchuan\ndiatom\nsalicylic\ntefillin\ninsufficent\nelectroconvulsive\nmischaracterization\ntrouncing\nstepanakert\nndlovu\npalaeontological\ngrimmett\nmaramureş\nimploring\nhamblen\nrham\nradwan\ncleator\nramli\nguardrail\nlucrecia\nsingstar\nzamorano\nkanawa\nwami\nblancos\ncastellammare\nufology\njws\nbernasconi\nhelichrysum\nbillard\nfernwood\nconegliano\npileggi\ntpl\nnig\nsopoaga\nsalvos\nwesterlies\nschoolnet\nroopa\nfrogmore\nsloss\nhandmaid\nequalities\ncaye\ngourmand\ncohoes\nverticillata\nigs\ncommensal\nhks\nherstal\ngarbarek\nhellyer\nborchers\nfaintest\nnabataean\ngroupon\ndolerite\nsithole\nognjen\nsuribachi\nunsupportable\nnakhodka\nelda\nsteelmaking\nchittaranjan\nwildness\npfitzner\nchoroid\ntétouan\ncompensator\nbernstorff\nafore\ndowdeswell\noverdraft\nvacances\npullin\ncatiline\nfreebie\nmarmots\nazeem\nvanu\ndolman\nrussie\ntakanori\nrelapsing\nwowow\nexpedia\nseismologist\nerl\nkorolyov\nunderpins\ndually\nkralj\ngaita\nhitching\nmorphologies\npoppet\nkidsgrove\nenrollees\ncertosa\ngoochland\nshoulda\ninadvisable\nrandstad\nalthorp\ncorben\nwalder\nwaker\nsaj\nunl\nkamila\nepe\nsalusbury\ngretton\nafe\nmaillet\nreckitt\nmandola\nstickleback\nmedalla\nfadil\nasse\nlozenges\nbaul\nblackthorne\nyellowfin\njaponicum\nkorb\nahalya\nhezb\nglaringly\nherries\nsobek\nnismo\npleasance\nburress\ncatechetical\ngilardino\nvcrs\nsfn\nmaqsood\nroyster\nkabbalist\ndiscernable\ngandolfini\ninmaculada\nobscurely\nmobbed\nrusby\nborlänge\noxidizes\nmoïse\ngruenwald\ncomoro\nimpatiens\nneonates\ntambora\nglassboro\ngulati\nsignorelli\ngsh\ngenis\nlucarelli\nsantin\nthither\nusair\npharaonic\ncorris\ntoke\nvcc\ncloven\nballester\nphotocopied\nebury\ntarzana\nhijacks\nfearlessness\nschemata\nunderwrite\nelza\nphew\nbiggleswade\nwallaroo\ncorruptions\nbreakwaters\nrahimi\nshenoy\nkonietzko\nwestlaw\nconveyors\njss\nunderfloor\nsceaux\nbashful\nbyars\nprefixing\nbeddington\nundercroft\nopteron\namada\nximenes\nkarori\nsgx\ngroat\nteb\ncorbis\nvpi\nannus\ndross\ntouché\npayphone\nappin\nlouboutin\nferrie\nshekar\nshaler\nspineless\nimpressionable\nokano\notten\nmadagascan\nremiss\nmehsud\nboliviana\nteco\nodda\nhandicapping\nelectromagnets\nicefall\nenol\nleme\nsarfraz\nshatt\nlöw\nircam\nuia\nmcgeorge\nhory\nslaney\nlegault\nkilbey\ntoymaker\nlanglade\nspectres\nstroessner\nmeon\ntoraja\nbucyrus\nclydach\ndys\nweh\nzein\ntoastmasters\nlynam\nbeauport\nordem\njinju\nluckman\nshohei\nnitze\nsirk\nberenberg\nflossie\ngondi\nendocrinologist\nwoodring\nchoro\nmuhammadiyah\nchanukah\nquotable\nsplayed\nlafrance\ntäby\nmff\npaty\nmpf\nathletica\nshoveler\nbirtles\ndoolin\nleishmaniasis\nguitare\nblitzer\nsaraiva\nqayyum\nandress\ndenarius\npacman\ndazzled\nbalkar\nsanpaolo\narens\nldc\ndarge\nquatrefoil\nparigi\nejercito\nmoxley\nretakes\nfinery\nnhi\nmodibo\nstockland\nbuechner\nrigi\nrba\nkitazawa\nwhittlesea\nancher\ntedium\nshorbagy\nmondragon\nnambour\nbaldacci\nkarbi\nflamed\njarmo\ndemoralizing\niasi\nmilson\nfinca\npegging\nskidoo\nguliyev\naccoutrements\namoled\nwends\ndanai\nshm\nmetallo\ndenki\nufm\nsaigal\njiminy\nnup\npaal\nvasundhara\nabelson\ntalkshow\ncaregiving\njuglans\nollivier\narcgis\nunb\nmickaël\ngyo\ncoleshill\ncampese\nlakenheath\nmercians\napopka\nleatherback\ndosanjh\nwestervelt\nbrauchitsch\nharried\nsweetland\nebanks\nprivet\nrova\nkiir\nintegrand\nexpositor\nmarzipan\nlouverture\nengadine\npashas\nstrugglers\nmugi\nkloten\ndoga\nbenfield\nstyron\najayi\nsrg\njeppesen\nwinxp\ntewodros\nwalmer\nmasti\ntinbergen\ncooter\nleishman\nmicrokernel\ndiadema\naska\nmezzogiorno\nboka\nerodes\nmullion\nwalkable\ndraftee\naaaaa\ncivilizing\nentropic\nimpac\nzombified\nreproached\nanastomosis\nboycie\nchannelling\nentreprise\nshiri\nthroughs\nruxton\nlochiel\nrotonda\npugachev\nrugova\npilling\nmontemayor\ntindale\ndagestani\nnobodies\nbarford\ngassen\ntafoya\nifas\nsusu\nandrogenic\nllandeilo\nhazelnuts\nnazareno\npenicuik\neckhard\nseletar\ncontraltos\nnefesh\nlucci\nbrasher\nhadžić\nelefant\nmolinos\ncartmel\njeph\ncrofting\nindustri\ncannavaro\nresnik\nbiogenic\nnkomo\nstabled\nordaining\nbeaudry\ntarkington\nstabile\npembrey\nisenberg\nwurst\nchaban\nforlán\ntrebuchet\nturko\nbackflip\naaltonen\nadvisement\ncuna\nboheme\numarov\norquestra\nzale\ndemuth\ninputting\nethicists\nbeausoleil\nadrenalin\nsmilodon\ncira\nmccurry\nchiseled\ndariush\nexaminees\nkeyshia\nwauchope\nalbu\nbackplane\nbørge\nzepeda\nvolkmann\ncoppinger\ntord\ndanks\ngadsby\nalgar\npagers\nerythropoietin\nhgh\nbwr\nbarista\nshashank\nhillocks\naog\ndreamcatcher\nkasher\nhfs\nalmaviva\nbiran\nganged\npietsch\nerector\nbertoni\nbuzzers\nmtvu\nblancs\nsooke\nintently\nbecks\nkumbia\nyokkaichi\ninterfraternity\naurigae\ndeliciously\nmylene\nkow\nkavitha\nhumanitarianism\ngesner\nneuner\nraas\nclansmen\nsurvivable\nvandersloot\ndjamel\nldf\nblackheart\nmisfire\nmarleen\nhsdpa\nretellings\nmardle\njusticiary\nilyin\natle\ninitiations\nradiophonic\ngins\nsgh\nmêlée\nelka\ninductively\ngetxo\nsociobiology\nµg\nrems\ngustavsson\njacketed\nadornments\nperun\ncaudex\npentonville\njailing\nmaks\ncabe\ndesierto\nmunck\nserrations\nklos\niwamoto\nokean\nfarrand\nwolfhound\nkundera\nsubtropics\nlatus\ntrd\nhohenheim\ngibbard\ncushioned\nscada\nlovitz\nyakutat\nngozi\nkaru\ncentralisation\nrúa\naocs\nantinous\ncaricaturists\nwarzone\ncarves\nstortinget\nsandf\nkuyt\nwem\nzima\ntamazight\nunread\nmerlino\nnagao\nhospitalizations\nmandelstam\nsihanoukville\nhilux\nruz\nbekele\ndeakins\nligon\nstapf\nsaqr\nfriesian\nrandhir\nrdi\ngilmartin\nbanality\nweissmuller\nbrise\ndyers\nunravels\nwainscoting\nkrzyzewski\nmasini\nweblinks\nzwart\nvendettas\nbasanta\nmaebashi\nopl\ntristán\nmiddleborough\nbasij\nroundtree\nretinoic\natli\nmiyata\nmarshalsea\nkoss\nthylacine\nracha\ndarkens\nweiden\ngoy\nbeatlemania\nalija\ncakewalk\nlacerations\nborghi\ngarvan\nlongmeadow\nanticlerical\nwhistleblowing\nspurts\nyakumo\nsteck\nsuburbanization\niznik\nstabiliser\nchandrakant\nlagardère\nanalyser\nmafeking\nchorales\njuul\nyenching\ncodifying\ncayey\ndinis\nmaiduguri\ndjerba\nfeasibly\nmolest\nsumida\ndeface\nchalker\nbolte\nbramber\npensiero\nrevitalise\ndzerzhinsk\nmaudlin\nnichi\npenkridge\nperverts\nsecretaria\nblakeman\njacquie\nhaack\nshabbos\nboldin\nradcliff\njumblatt\nodbc\ncarbonite\ncondensers\nfq\nmoneda\nrenouf\nhiva\ndisrespected\nmandler\nmesmer\nshubha\nwagging\nshechem\nprechter\noverpriced\nlandholding\ntrottier\nbaseballs\nmagid\nasrani\ntaoists\ndistros\nmamun\nnaryn\nclytemnestra\nsabata\nkaski\nzawraa\njabber\nexcusing\nclowning\nmomoyama\nmonographic\nmismatches\ndinefwr\nthornburgh\nchachi\ntiresias\naltham\njuiz\ntechs\nflorrie\nrga\npennyworth\nsanju\npalla\nbizzy\naugustyn\nflustered\nliuzzi\nimmelmann\nsanilac\nshindig\nefx\nimpinge\nmalouf\npanchami\nmordant\ngomi\nzampa\nvassily\nrowen\nfallows\ninfluenzae\nkamenica\ngrandy\nmossberg\nkarthika\nbanneker\ngcsi\nmairead\ntimpanogos\nannadurai\nmajed\ntorrie\nteniers\ndiyas\nmediolanum\ngonabad\nspiritualists\nmumia\nvattenfall\nsien\nhenie\nwedekind\nalcester\ndisbelieve\ngloating\nluxuriant\nbothe\nslackers\ndrewe\nhagfish\nlefferts\nokita\nvarenne\ncolorings\npassione\ndeoband\nseabee\nnorthville\nyamanote\ngroping\nmetamaterial\nsatirize\nweyerhaeuser\nondes\nsutch\nsahi\nikaros\nnorco\nquintilian\ndolley\naubervilliers\nriise\nformwork\nmegson\narmbrust\nbakula\nmarkley\ndryburgh\ntheologica\nkus\ndwa\npolitifact\nhpi\nunfortunatly\nanemonefish\nhuddinge\nwheeldon\nradiations\nbolas\ncitic\norgies\nblitzen\nimmobility\nsnapple\nbarraza\nvintners\nhiddink\neterno\nfeehan\nmahir\nincubates\nvarig\nsilhouetted\ndavar\ncommunitarian\nnub\nmiasma\naladin\ntulin\nyoho\naaja\nkitimat\nsculthorpe\nlilburne\nprati\nnims\nskat\nkuchar\nalmqvist\nviewings\ngamete\nuncropped\nresinous\npincers\ncamellias\nkoha\nkhumalo\nsidibé\ngerontius\nsummerfest\nacholi\ntrésor\nthermionic\nhongqiao\nklausner\nisg\noreal\ngavrilo\nswingarm\nsealant\nvardan\nfoulds\nharrop\nhistocompatibility\nutp\nskorzeny\nfaller\npauw\nturnouts\nimber\nspeeded\nhailes\nbsm\nholdridge\nfallbrook\ndenominators\npittwater\ntrudi\nhova\nodf\nwillmar\npistone\nsimonides\nrafique\nhelwan\ncondamine\nrecycles\noor\nelectrolux\nsomone\ngaladriel\nossified\ngermline\nhalili\nguruji\nkitchin\nsnowstorms\nadvantaged\ntamoxifen\nweariness\nhamgyong\nashkenazim\nmaritz\nnega\ndieback\nziyi\njingwei\nhoggard\naktiengesellschaft\nfillets\ngrif\nncf\nolyphant\nmugello\njako\nseamer\nmocs\npapilla\nleclaire\ncomparability\nschmuck\nfomento\ngza\nbiodynamic\nvence\nbeso\ndepartement\nkinh\nimrie\nvibrancy\nregalado\nshizuku\nblaydon\nmanzo\nvaya\ngroupware\npelion\ndownsize\nfarías\nterpsichore\nsören\nselenide\nswedbank\ncompo\nwadley\nniedermayer\nthreshers\nlistenable\nsqueal\noras\nalberic\naltas\ntyrian\nzet\nmerapi\neberhart\nvallon\nanimistic\ntorchlight\nkibble\nopals\nemmis\nmcaulay\nrightness\ncorrib\nnedd\nbunche\nroitman\nsalak\noldroyd\nerkan\nspotland\nstokely\ncohabiting\nfunston\nwhec\nneukölln\nananth\ndostum\nridiculousness\nspurn\nuntruth\norganizationally\nreais\nchm\nrexford\nmandl\nmethylphenidate\nlunny\ninterconnectedness\nlemhi\nftn\nruffalo\nelwin\ndidymus\ndeezer\nadra\nplaquemine\nlalanne\nconditionals\nantihistamines\ncoffees\nyousif\nambato\nmonáe\nfellers\nclincher\nparticulary\nfrantisek\ndaschle\nfenestration\nnordstrand\nsaunas\nyulaev\nmandla\ngesell\nperineum\nyuppie\nratzenberger\nhkt\nnataliya\nkluger\nturnabout\nbussell\nsubservience\ngandhiji\nbelaz\nheadshots\nezine\nfamiliaris\nnobuhiko\ninternationalisation\npeccary\nknokke\nflatbread\neml\nlankford\nocbc\nsleater\nmaghrib\nmassino\nponsford\nplaaf\nvoltigeurs\nmorose\nlerche\ngrindelwald\nhannam\neraserhead\nchromite\nremount\nformulary\nchiarelli\ngehring\nrolt\nrutile\nmusha\nunclosed\ncollines\nbrych\nlabile\nhurdling\njournée\npolícia\nenno\nefate\nlaus\ngies\nspotlighted\ntawau\nhoylake\nposible\nbarlas\nswardt\ntombigbee\naute\nalmon\ntato\nvirk\nstamen\nbrede\nriyadi\ncarrboro\noceanfront\nassed\nahtisaari\nhamdani\ndichotomous\nalbon\nuntv\nlipan\nerisa\ncollado\nbratty\ncimmerian\nblagoveshchensk\ngreenmount\ncolosimo\njemaah\nprospering\nanticyclone\nberkshires\nncsa\ndessie\nleitmotif\nanholt\nstradling\nmclachlin\nballan\nchehel\ndebo\nskank\nsomma\nslonimsky\nbinational\nprest\nchima\namite\nparise\nheera\nbromeliads\ninfirmity\nrayment\nantechamber\njudgemental\nhoneypot\nsellafield\nuscf\ncerati\nmanicured\nbenvenuti\nfalafel\ncolluded\ngranata\nharveys\nigwe\nsabonis\nmorenz\ntoolbars\nriccio\nhakkari\nrashly\nmandamus\nluby\ndoldrums\nlassus\ntrente\nsahm\nmonfort\nbetrayer\nvalueless\nloughs\nthabit\ncontinentals\nitl\npadrón\nebon\ntdk\nkrab\nvignes\nhabakkuk\nbyford\nmacphee\nkiser\nbellotti\nsalukis\nperls\nhillage\nrobbin\ngwich\nflocka\nfale\nenniscorthy\nscaup\ncomminges\nzeolite\nfurstenberg\ntongji\nmargarets\nesmail\nmangin\nbrielle\nsnyman\nakif\nembrun\nmickle\nrevelatory\nmedusae\njmp\nbrewhouse\nikari\nhenner\ncerne\nmiharu\nhersheypark\nmisadventure\noakhurst\niwao\nkiddy\ncfcs\nkaf\nholten\nobjets\njyrki\nbelang\noberman\nhdnet\nseidl\npigeonhole\ntooke\nhabyarimana\ninfantil\nunpledged\nrisd\nsrbija\ninbuilt\npenge\nagronomists\nwewak\nellora\nzahle\nfuster\nsquelch\nchattels\nsantoshi\nciego\nliposuction\nmavi\nfriedrichs\nkayaker\nyello\nmutuals\nkyeong\nnoninvasive\ntrane\ncarolinians\nbeary\nremotes\nzhangzhou\nuniversite\nclimaxed\nzurita\neprint\nvenis\nlightner\nmoberg\nmohe\nashtar\nselway\ncerebrum\nstoneleigh\nbracho\nberryville\nbenedicto\nchicory\nmcgarvey\nlimca\nstatist\noloron\nsunstone\nsabal\ntsukamoto\nsargsian\ndamrosch\nplanescape\nstomps\nvolkan\nagito\nserkis\nhargitay\nuproot\njabloteh\ncrosser\nlarter\nnavratri\ntsarskoye\nfennec\nconcoct\nphotocopying\nturhan\nlolicon\nelida\nlingle\nkaibab\nmactan\nfatter\nfentress\nquiller\ndegrasse\nbld\nmacnaughton\narmstrongs\npavese\nzarif\nplanitia\naul\nportrayer\nseguro\nappelbaum\ndoesburg\nkuykendall\nrayed\nkitwe\ntrembles\ncalea\ndeflector\nkempthorne\ngainey\ndebutants\nlaubach\nstellate\npodcaster\nvorenus\ndargis\nertl\nhirschberg\nwurmser\ncoan\nunderstaffed\nstilton\nsidorenko\nviktoriya\ntomokazu\njuxtaposes\nmediations\nwiderstand\nsjöström\nlabrie\nderailleur\nkarishma\ngerónimo\npinedo\natto\ngeneración\nquilted\nsugihara\nfiorenzo\nhaire\nvalladares\ngokul\noxidised\nenlistments\npopulaires\ntorrejón\nclasped\nagapi\nkera\nyzerman\nmith\nchanteuse\nnobre\ntinfoil\nkautsky\ngalston\ncoomaraswamy\nadebayo\ngnawing\ncronica\nayaan\nnuking\nrohl\ninkwell\nbiblically\ndecalogue\nstocksbridge\njacki\niup\nhuddled\nleftwing\niroc\ndrugging\nveyron\ncotillion\npathogenicity\najami\nrottweiler\nendnotes\nbafana\ncharsadda\nfireboat\nrømer\nimpales\nlehto\nhideouts\ntheatricality\nornella\nafv\nleaphorn\nbeppu\nmediabase\narfa\nwoodcutter\njulienne\nenglander\nkjartan\nkusa\npintado\njonatan\nmuttahida\nmilka\nawakenings\njewison\ngesang\ncivitella\nciccio\nhulking\nernests\nmuzaffarabad\npascack\ntimecode\nphysiotherapists\nyusa\nshags\nluyendyk\nplainsboro\nscherrer\nspidey\ngrumbling\ntagil\ncalverton\nwintry\nbebel\nmanteca\nfathead\nwyd\nmony\nbygones\nsatay\ncanela\nbeckton\nvechten\ncasady\nschistosomiasis\ndenigration\nstrathallan\njoneses\nbugles\ncapulet\nshorting\nnessie\nbenicio\nnigella\nbailouts\nundercutting\nells\nkalenjin\ngeva\npaju\nshawkat\nsifu\ncasted\nfatigues\nstealthily\ngunton\njutra\ncostata\nprogresso\nmanders\ninan\npanettiere\npallavicini\nschoonmaker\ncorreo\nhardesty\nlampreys\nmonkland\ninstitutionally\nsiegler\ncaldeira\nsomthing\nurogenital\nludus\nfunabashi\ncalley\nmultifunction\nclothe\niad\nhepatocytes\nkrung\neik\nshorewood\nscammell\nhydrofluoric\nsirs\navoir\noccurance\nchinle\nuneasiness\nbuggery\nzahavi\nfoodborne\nnormals\nleasable\ncorrelative\nmoondance\ngeddy\nmaksym\npekar\nfoursomes\ndiamandis\nvalmy\nbehrman\nmuons\nlfl\nchadwell\ngallimore\nsigmoid\nestill\ncaptivate\nemanu\nsafeco\noutdoorsman\nstirrups\nsavelli\nichthyosaur\nwaxahachie\nsithu\nmutualistic\nquique\nhanky\ncomprehended\nhuntelaar\nimperii\nuvb\nlatouche\nknighthawks\nkinne\nrobidoux\nhebburn\nassheton\ndonoso\nkake\nsarris\nkaeding\nmeck\ngiada\nfortepiano\nrenter\nsmathers\nclubland\ncaernarvonshire\ndeters\nkeepmoat\nkrait\nfabless\nthain\npsychogenic\numkc\nexhorting\nresubmitting\nivers\ngalvan\ntraumatised\nspinel\ntypifies\nroney\nkristinn\nnotarized\nswabians\nosl\nmacneice\nrosslare\nirem\nmétéo\nlavished\nsuperhighway\nzaremba\nirredeemable\ngorshkov\nperuviana\nnewsted\ndefinitly\nspearfish\nmessa\nrepurposing\nbluecoat\nmalpensa\ncoushatta\nkirály\nnflpa\npanikkar\nrabinovich\nllnl\nvectored\ncaetani\nfdot\nmwai\nmeghann\navast\nquibbling\nchanderpaul\nglauber\nparshall\nbarest\nhideously\nunbeknown\nfarge\ncanonsburg\nmbira\nenlargements\npeavy\nwortham\nteleprinter\nobo\nmendham\ndalli\naqap\ndsw\ndupes\nburston\nsunkist\ndraganja\nerben\nvincenti\nshahram\njaunty\nsuitland\nlamer\nraygun\ntyron\nkiplinger\necd\nshod\nfilatov\nratnapura\ndouma\nwesterfield\nclintons\npurohit\ntakizawa\nportimão\nmasseur\nmédoc\nwadebridge\nsherrard\nothmar\ndogfights\nsaurischia\nttf\ntanta\npeja\npharr\noria\nlanvin\nhotz\nmyasthenia\ncurable\nwiest\nbarnesville\nbrigadoon\nfrise\nbullfinch\nsahay\nsubmersion\nguodong\nsharat\nleafless\nehrenreich\nmoghul\nvsb\ncapos\nstairwells\ncppcc\nescott\nfraley\nbostick\nodets\nreplanting\nbalenciaga\nbotts\nwesterlund\nfloundering\nconstantijn\nswanepoel\npof\naelia\nnakedness\ntreetops\nmaqbool\ndownsview\npoliovirus\nmargravine\nkomar\nmandar\nalums\nscoot\nucayali\nvirden\ncopolymers\nslapp\ngevork\nrameshwar\ntelefe\naban\nmetrically\nhagrid\nchapleau\noverbroad\nochiai\neffectors\ncotterell\nlifford\nidealization\ncuadra\nfolksinger\nvanitas\nbeatson\nyaquina\nmaybelle\nrylance\nmccaffery\nharikrishna\nhanworth\nboucicault\nitar\nmeynell\nintroducer\nliquidators\nmarzio\ncultivator\nmuzak\nmiedo\nsclera\nbrownstein\nshaiman\nsunland\nindependentist\nmogren\nkiffin\ncpf\ncontemptible\nrockwall\ndisqualifications\nyeadon\nfavouritism\nvolatiles\nsensationalized\nqueenslander\nkästner\ncengiz\nniyazov\nbartman\nhima\nthrob\nstace\nlewington\njürg\ncourser\nharvestmen\nmashups\nnenê\noakmont\nreinvested\npursuivant\nintertribal\nlotfi\njungfrau\nturia\nfurniss\nmulino\npluperfect\nclairefontaine\nlanguid\nrcahms\ndisheveled\ngreylock\ncookham\nanderssen\nrendus\ntruncatus\ntcn\nborlaug\ncrucifixes\naaronson\nfuruya\nbraised\nmeanest\nbelaya\ndemoralised\nkieren\nulfa\nforeseeing\nvalium\nbrookvale\nmattis\nreinstalling\nchelation\nbuona\nlale\nnuthin\nungraded\ndoughboy\nvoinovich\nhighbridge\ngoodland\nausterities\npemex\nmkb\nrescinding\nmbps\nsorana\nschaaf\nsouq\ncolonoscopy\npagination\nkydd\nkisa\nbanes\ntoothpick\nelastica\nonofre\njcm\ncruisin\nenglefield\nlanai\nlewitt\nbaedeker\ngyalpo\ncornyn\nncu\nridha\nsvs\npavane\nfrou\nfairtax\nreadmission\nadulterated\nmakki\nterminological\nstateline\nmartynov\nsprott\nwoodridge\ntriglyceride\nleuchars\nbarghouti\nodlum\nlockley\nentailing\nagh\ncorbijn\nautomatics\ntajuddin\npavlovsk\ndoulton\nneumeier\nfirstgroup\nbildungsroman\ngaudeamus\nperchance\ntraore\nlagunes\nstempel\ncordons\neluding\nflings\ndyad\nariston\nashurbanipal\nsüdtirol\ngoldner\nschicchi\novery\ncapitalising\naspatria\npettus\nrevo\npatani\ndunant\nmatthes\nunheeded\nmetalworkers\nthomastown\nphages\nwormald\nomc\nkreviazuk\nvilli\nhanbali\nlahoud\nawnings\ndukagjini\ndbt\nmisbehave\nbrianne\nbasotho\npanto\ninterneurons\nromário\nbolle\njanuarius\niker\nalvi\nbondar\nloblaw\npasch\nmercato\nphotogrammetry\nprévert\ngolightly\njamel\ndimona\nazer\nmegachurch\nalcamo\nnapkins\nglenside\neterna\npurifiers\nmicheál\naneurin\nbirnie\npasteurization\nbackspace\ncopulate\numpteen\nnofollow\nmornay\nfugees\nunwisely\njanam\nmcgough\nhuangdi\ntransliterate\nacls\nsubtler\nliri\nmaxon\navl\ncobre\nguimaras\ndaco\nmiddelkoop\ncordis\nlateness\nangier\nmessianism\nroscrea\ngayton\ndoerr\ngarçons\ngrimani\nséguin\nwilko\nfruitland\ncomden\nllorona\nrame\npelikan\ndarnall\nmutes\nfamiliars\nsherpas\ntrample\nloredana\ntiomkin\nkott\npaan\npanavia\nsplints\ngogebic\nterminologies\nkittitas\npedlar\ndeckard\nsamer\nzfs\nbottler\nairbrush\npersonalizing\ntenterfield\npowerlessness\nrickert\nrossdale\ncompere\nkaila\nroncalli\nthermite\nmanzanera\nsusquehannock\npendergrass\nlwow\nolot\ngovindan\nhossam\nmargao\ntranssexualism\nrosato\nuserid\nsonority\narcus\ninstinctual\nsidelight\ncleats\nswinney\ndestabilized\nmaidservant\ngalifianakis\nmodeller\nbowne\ndashboards\nnerdcore\ngriscom\ngoldhagen\nmcgrory\nremonstrance\njosselin\nrydal\nsmelted\nnbp\nmohn\nboudinot\nvargo\nthwaite\ncycladic\nhardwired\ncalibres\ndongen\njinshan\nindes\nsaraf\nherniation\nsarm\ncrosshairs\nnovia\nmatarazzo\npaperboard\ntalysh\nexcelsis\nroks\nanandpur\nfranceschini\nanadyr\nskillz\niska\nhadlee\nifill\nbrushless\nsaotome\nlanata\nkataeb\nkudrow\nmotegi\nechostar\nrcf\nkester\nohka\nkinloss\nsanjar\npostulating\ndisplayport\nrhc\nyechiel\nkanada\nboller\ntriadic\nbarnato\nboob\nmagnavox\nutca\nroutt\nacetylcholinesterase\nfreethinkers\nextrême\ngeos\nfuso\ntektronix\nstis\noutstripped\nfibrils\nhoechst\nventurini\npulsation\nstingaree\nsqueamish\nquirós\njupiters\ndfx\nslither\ndhr\ngwendolen\nmayflies\nadulteration\nminard\ndiaw\nfamiliarized\nclarrie\nlegende\ndoting\ngtd\nunr\ngallinules\nshied\nmichalak\nthermae\nsidemen\nsurma\nsezer\nhillhead\nsyms\nelkington\nterming\nhebb\nhever\nbonnar\npunning\nkomaki\nagoraphobia\nlithia\ngottesman\nicv\nramaswami\nsower\nmarea\nlansdown\nprisca\ngae\nprahlad\nhemolysis\nomn\nethicist\nturncoat\ndrouot\nabrar\nincredulity\nvalency\nbagshaw\nbrambles\nlais\nschemer\nkozan\nmsb\nturbomeca\nbucuresti\njustyna\nwhosoever\nfolle\nbiggles\nboorish\nkulturkampf\nwigston\npyramus\nwoodend\nsedat\nfiroz\ndonghae\ncoolangatta\nshrieking\npopularizer\nalderton\nschweiger\nranjitsinhji\nsjc\nlogistically\nsaram\nswitchgear\nyasui\nbotcon\nsmap\nsalmi\nhibiya\ntuatara\nmullane\ncurfews\nsoyer\nnamaste\npreamp\ngavras\nchristin\nsimmel\nnemes\ntumen\nvergennes\nrailwaymen\nkawahara\nlnr\nmedel\nnoo\ndubz\npercieved\nwint\narrogantly\nginuwine\nglendower\ngradations\nnanoseconds\nmassacring\njuels\nbeaudine\nqadeer\nwuhu\ncopolymer\ncorbucci\nkillingworth\npusa\nadlington\nintravascular\nrunoffs\nbongaigaon\nmoodle\nbloodstained\ncolbie\nrootkit\nelance\nchillout\nathénée\nteles\nkankan\nfud\ntmg\ntrichloride\nmisapprehension\nmichalski\nspurr\naziza\nbalewa\nmicrantha\ncashmore\nmultiuse\nknitwear\nmarfan\nfinster\nmacdougal\nunencumbered\ngerritsen\nrenz\nlorenza\nsucculents\ngreenport\ntush\nsarastro\nebla\nkissel\nmgo\nfrancisci\nsikeston\nadie\nlopburi\nchurchtown\nmcf\nbeddoes\nboulding\nmasaka\nlandreth\nawg\nfechner\nresveratrol\ninsinuates\nschönefeld\ndolf\nrhodesians\nsiphons\nmatériel\nlemonde\nclericalism\njonesville\nwalkerville\nwilczek\nfussed\nsoudan\nkoopa\nshapers\nhra\nbhalla\nfasted\ncolma\nmattox\nlecoq\ngiacomelli\nencaustic\nracewalker\nhumours\nkfa\nbote\nmercersburg\nionisation\ngitano\neln\nutne\npencilled\nalick\nmdgs\npiru\ntanisha\nendeavouring\nvialli\ntonsured\nlightsource\ntattersalls\nwyden\nawk\nclouding\nidée\ncastells\ngriffons\nbienvenido\ncootamundra\nscoreboards\ndishonestly\nsokoloff\npartir\naventuras\nschweinsteiger\nwebserver\nmadara\nwitzel\nwieman\nravinia\nprided\nviolino\nrcb\nsugarman\nphosphors\nkatwijk\nnaftan\nderain\nconfirmatory\nmispronounced\ndeshayes\ncorton\nlebar\neveringham\nscaccia\nodenkirk\npoms\nsteinhoff\nronsard\nbeaverhead\nanchieta\narteta\nsmallholder\nterrae\ntrono\ngoldy\nwelders\nthorfinn\nfpgas\nwestall\nwarrender\nmilito\ndonaueschingen\ndfe\ngajendra\npring\natanasio\nadebayor\ngeorgen\nbelied\ntendentiousness\ndunsmore\nkiloton\ncabela\nflan\njoby\nunderoath\ndoen\npangborn\nopto\nmrx\nbusquets\ninjunctive\nledgers\nbikeway\nprophesies\nextol\nerj\nbhagwati\ndslrs\nedinboro\nlamberti\nalis\ncroll\ncallejón\nkila\niqra\nearmarks\nbaruah\ngwinn\nwais\nariki\nautoroutes\nkerfuffle\ntunja\nhanifa\nsenato\ncliveden\nspindler\nmellish\npelley\nrudis\npunchestown\nsuccop\nwindswept\nkrio\ndugas\nsewickley\nszasz\nplumpton\nmariska\nriedl\nbrushwood\nlemar\nmeizhou\nherculean\ntwr\nisaias\nheaddresses\nrunaround\nrecouped\nhomogenization\ndisinterred\nvisio\nrushville\nzagat\ncrista\npinscher\nslm\nlovebird\norgano\nvashti\nryker\nchutzpah\nyonatan\nneopets\nwern\nwellingtons\nbrembo\npatentability\nlagu\nwalbrook\nbrookdale\ntargetting\nhantavirus\nmipt\ntobymac\nteagarden\nkomnene\nreisz\nsobering\nerlandson\nbiskra\nnanotech\nguarino\nuntouchability\njosué\nuncontentious\nzapu\nkoba\nnunneries\nscheiner\nboettcher\nlogica\nlookups\njass\nrowbotham\nehret\ndiscontinuance\ninattention\nbusselton\nbogut\nparoxysmal\nkary\nfragmenting\nswinford\nharnell\nseid\njünger\nayin\ncovadonga\nmalts\ngebel\ndecelerate\ntheist\nipsec\nbaftas\nbrickley\nmochrie\ndenazification\nyungas\nklansmen\nkruk\nkentwood\nbenzoyl\ntinh\nwiman\nreiff\nknickers\nashgrove\nkleiber\ndepowered\nhanscom\nmeerkats\nellroy\ncountercultural\nmiming\ndhaulagiri\nhasani\ncraves\ntakenaka\nwbca\nyueh\nexhalation\nlycett\natglen\nmilked\nlinnea\ngrana\nmononucleosis\nhirschhorn\nclémence\nwendland\namelioration\nsignac\nfarts\nsargasso\nreboots\npieta\nbisbal\nbundi\nbarbeque\nbeena\nastonish\nveronique\nbrás\nvmc\nmessel\nnguni\njoerg\nmondi\nxaverian\nfowls\nnymphaea\nthornberry\ncuero\nclavichord\nlightman\ncackle\npotentiality\ngrigoryan\npnv\nperthes\nkasten\nreselling\nchupacabra\ncardle\nsplm\negoyan\ngroene\nrufin\ncoatesville\nmanzini\ncoppersmith\nhargeisa\nduta\nyersinia\nfaceoff\nvacca\ntopoisomerase\nflir\noribe\nspafford\nserin\nponiewozik\npolyamory\ncacapon\nshotwell\ntramlink\nmacdonagh\nunderling\nmaal\ncrise\ndopo\nshantanu\ngordimer\nalkylating\nshanmugam\ntorpor\nhasson\nkrenek\nsacrosanct\nbalmy\nbardeen\nquesta\nbuta\ncoppery\netcheverry\nticknor\naromatase\nsedatives\nsey\ndemba\ngerrie\namagasaki\neyvind\nallawi\naquatint\nlorch\nnoriyuki\nlostwithiel\nfortissimo\nhelgason\ngrantor\nuntie\nkilbourne\ncreo\nchosin\ncobby\nvectis\nugs\nciskei\nohlsson\naddon\nbolivians\ncockcroft\ncamil\neddi\ngeithner\nsalò\njesolo\nexh\nfonteyn\nsasol\ngmu\nheytesbury\nmalabsorption\ncisc\nmedicago\nheriberto\njumpy\nbarres\ncorvids\nfarnell\nsmarty\nlorenzetti\nsadeq\nlubavitcher\nbovey\nshimshon\nshew\nromande\ntremain\nmatthijs\nlassa\nizvestia\nreticulate\nadmonitions\nsafdar\naleksa\nswirls\nipsilateral\nindraprastha\nzick\naromatics\nrosalba\nbusty\nchey\nexorcise\nharping\nhinz\ngaudi\npeine\niyad\ncounterparty\nsteeples\nmaret\nbaughman\nkeli\nbricusse\nmorogoro\nwlm\nhavergal\ntaji\nquantifies\nnuthatches\nnandy\nfiefdoms\nmalvo\noverexposure\nortolani\ncarm\nancelotti\nvix\napprovingly\nshaadi\ndeclamation\nterkel\ncrkva\nmrr\ndiarmaid\ndancy\nsead\nboatyard\ndgs\nsecularisation\niste\naiton\ncarnet\nholmdel\nradiographs\npolizia\ncatalyzing\nangèle\nmabinogion\nrainfalls\nquintal\nsûreté\nswish\nferroelectric\nadhemar\nweeded\npeveril\nglioblastoma\njesi\nrastislav\nshuhei\nanacortes\nmashing\nstoudamire\ntreacher\nbente\ndodie\nlaclede\npalanquin\nlaundries\ndistillate\nbiggins\nmbl\nhudak\njohannsen\nsalehi\nthb\nprothom\nbrontosaurus\nluga\nanl\nnuclease\neldoret\nquillen\nmojica\nearthling\nrisch\nsylph\nlaberge\nmuth\nclarinda\ntidally\ndeification\nsmo\nmelnick\nsterner\nscoter\ntatooine\nkorczak\njux\nphilipse\nwinchesters\ncachoeira\ncaminho\ntuchman\ntoadfish\nbandopadhyay\ngarstang\nladytron\nnabors\nmollison\nirrepressible\nchenier\nadw\ngaîté\npolychlorinated\ntheatricals\nmannering\ncrematory\nthyagaraja\npaniagua\nweenie\nmurdo\nfenrir\ndorner\nglacialis\nfrothy\ngentileschi\nadblock\nhifi\nkiddies\nrasen\nlegislating\nrybinsk\nsaith\nfurnishes\nessentialism\nflds\nvaas\nkanti\ntaobao\naynsley\nneuromancer\nhearns\nvibert\nhassam\ntriggerfish\nallain\ntourmaline\nanuj\ndonkin\ntempio\nzugdidi\nbarmen\npataudi\nderm\ndilettante\ninnapropriate\nvarberg\nunderperformed\nbrama\nhamp\nhedonic\nsheikhupura\nmoly\nfite\nmaîtres\nibra\ntater\naskia\ntwu\neconomica\ntrippy\nterras\nmutombo\nyefremov\ntoohey\nmurtaugh\nbargh\nkobalt\nglitters\ngmm\npixley\nbahrami\nnujoma\nfetterman\nkirana\nzahorchak\nedythe\nplaygroup\nholdup\nenquiring\n（\nmaryanne\nedyta\neldad\nmohmand\ntakhti\nactaeon\nreintegrated\nwcg\nchambermaid\nklaatu\npiniella\nwetherill\nwabi\ngagosian\nstuns\ngml\ntrooping\nbroadley\nbatam\nrationalise\nwaitemata\nlingala\nspunk\ntreadway\npignatelli\nedington\nwigton\nwaveland\nhazan\nmcdaid\ndongdaemun\ntabulating\nwitkowski\nbetcha\nuden\nduthie\nforté\nvpa\ntaiwo\nvianna\nghamdi\nhight\nrecombine\nbrogden\nachi\ntdr\nbratsk\ncaumont\ngroundswell\nnoman\ndispelling\nfurse\nairventure\nfaircloth\nrothschilds\nmoralist\ndragoljub\nbely\ntarnowski\ntristano\nfuat\ntimmerman\nteichmann\nstarchild\nstrickler\nmaimon\ndisperses\necevit\nrubini\ncivica\nyandel\nandar\nehsaan\nhadrosaur\nletta\nhatice\nraconteur\npounced\nnoemi\nmulhern\nzedek\ncourteney\noutmatched\nsavants\nmuli\nchickenpox\nmontagnard\njette\ngarver\narmalite\nmanhua\nsholem\ncounselled\ngrinham\nwebisode\ntarin\nsiong\nquon\nsigerson\ndá\narona\ndelvin\noutflows\nsandre\nbinz\nbearkats\ncircleville\nchb\nweathermen\nnitzsche\npublix\neysseric\ndassin\ntranspacific\nrft\nnewtownabbey\ncodigo\nmusicum\nerythromycin\nortona\nkanika\ngenentech\nalfredsson\nreallocation\nmelnik\nhibi\noomph\npavone\nsff\napraxia\ncanley\nserafino\naera\nsubsisted\nfitts\nforeplay\nbethlem\nfenson\nreidel\ngewandhaus\nrocketeer\nquaking\nsymbiont\ndecry\nsitapur\njerking\nconsultancies\nhealthiest\ngallovits\nprurient\nmulticore\nwassenaar\nbrecknockshire\nreshammiya\nkristol\nsharona\nclink\naminotransferase\nazumi\nayana\nteleki\nlockup\nbalmaceda\nsocialising\nredbacks\nayatollahs\ncubicles\njujitsu\nlabadie\nbuchner\nwann\nsensorimotor\ncricklade\nvolkskammer\nnaloxone\nsaikia\nrowlett\nscammers\nkarasev\ncjc\nlaboriously\naltea\nromantica\nexacerbates\nrecusing\nbomp\nelgon\nbrenan\nexaggerates\nstooks\namblin\nexude\nkillifish\ntakestan\nbearskin\nkhouri\ntecate\nbettini\nhansraj\nbeguiling\nruanda\najab\nmanang\nwaqt\nsigurdsson\nsextets\nkalashnikova\nbalian\nelephantine\ntúpac\nneustrelitz\ndeadbeat\nfariña\ntose\nfreudenthal\nbluestar\nfining\ndugard\nslaf\nmontepulciano\ndimarco\nioane\nsanibel\nsyst\nkazimir\nhtt\ncannington\ncereus\nhbk\ngiardini\nnemec\nstewarton\npatios\nalburquerque\ncookeville\nherreshoff\ndarkchild\nbeniamino\nséverine\nliminal\ngnutella\nlassies\npolychaetes\ndharamshala\nungodly\ntrucial\nandrius\nsemolina\nheribert\nclase\nbooneville\nmoans\nkingsnake\nneoconservatives\nsecondment\nrezaei\nboran\nszekeres\nhilla\nblacklists\ncandu\nyakama\nabhijeet\naliquippa\nsonnambula\nbookish\ngergiev\nglia\nmcgonagall\nlincs\nnewsboy\npatrizio\ndunking\njordana\nsinecure\nrescheduling\nplayroom\ninappropriateness\ngrantville\nweert\nwistar\nmegalith\njernigan\narnsberg\nvare\ncqc\nshockwaves\ndeaver\nbaoji\njhajjar\nrefits\ndodi\nlotr\nsalling\nranasinghe\nicq\nmilnor\npantages\nreeducation\ncheo\nibt\nwaif\ndeseo\nvasantha\nsubmersibles\nmundelein\nedutainment\nchakravarti\nferragamo\nbednarik\nnoth\ndeobandi\nbivins\ngelnhausen\nballooned\nazucena\netl\nsnort\nbonsall\nestcourt\nmerson\nwiens\nhiroyoshi\nentrenching\ngossipy\nyambol\nbonetti\npeláez\nmckillop\nhushovd\nabashidze\nmontell\ntigard\ncolomb\nkhanda\nzubov\ncabos\nclimatologist\nokura\nreculver\nedgier\nvolcanos\nkearsley\nspeedskating\nconstanze\ncronquist\ntasos\nmenhir\nmazeppa\npasado\npulpits\ncounterrevolutionary\ncofer\nvili\nlasagna\nctd\noppressor\npuny\nymer\nambergris\nbelinsky\nfrutti\nnabonidus\ntybee\nmatriarchy\ngyanendra\nkents\nbasia\nlengyel\ntethers\nominously\nmeriam\nkpk\nliberación\nshifu\nwenden\nsabarmati\nmaira\ngoans\nmrap\njanša\nharbottle\natac\ncloquet\ntomkinson\nsamal\nhanauer\nwahpeton\nkomori\nratti\nkays\nintraspecific\nverwoerd\nidolizes\nsymbolics\nsubpopulation\nsettimana\npalomares\nscapegoats\nbiloba\nkpn\nrarefied\ncombate\nadamstown\nnuptials\nhanazono\nagana\nmeiner\naltach\ngildemeister\nreductionist\ntaieri\nmarshawn\nlycanthropy\nerman\nentreaties\nhummus\nkourtney\ntaipans\ngimpel\nlindi\nmonolayer\nneutered\nsenso\nrikard\ntuguegarao\naacs\ngaskets\nclassless\nmotorcoach\ntardieu\ndysfunctions\nacademica\nyeng\ncanyonlands\nwagered\nadjudicating\nbarun\nbitsy\ntrippers\napocalyptica\niwanami\ntribalism\nnickson\nleskov\nmcclary\nprovocatively\narasu\nchauvinistic\norndorff\nhollweg\nobbligato\nseagrave\ncaravel\ndhimmi\niiic\nderailments\nsympathiser\nlifelines\nhomonyms\ndadaist\nsherard\nqari\nsry\nkufuor\ndiggins\nsalif\nfantasyland\ngymnosperms\nstupidest\npuzo\nsapient\nacqui\nchisato\ncasters\neverlast\nsumatera\nseasonality\npettiford\nkeppler\ngertrudis\noxenford\ncardington\npollstar\ncontoured\nsteidl\nnibs\nskewer\nvannevar\ndrumbeat\nthecla\nevernden\nbronko\nasom\nmeridional\nkirkyard\ngowanus\ndagny\neby\nstaro\ndeforming\ncowbird\nbuckhurst\nliken\ninspiron\npendulums\nravn\nvirile\neady\ntransilvania\nmusette\nannett\nlhb\nmoench\nchell\ncharenton\nzainuddin\nschütte\nigc\nronkonkoma\nkiku\nduquette\nfrm\ndaladier\nbernabé\npetticoats\nkahneman\nlongridge\njubilate\nmultiprocessing\ndepositor\ntribesman\ncavil\nbfm\nbettelheim\ncharleson\ncenteno\nrears\ntrackless\ndefintion\nhehir\nsolstices\nsuhail\ndoges\nroundhead\ncorrode\nmetropol\npleats\nflorenz\nphila\nfrisby\naitor\ntanzimat\nsini\nstrollers\nmoonwalk\nshivraj\nkarns\ngosden\nbandeirante\ntulisa\nregi\nbida\ntamid\nurmas\ndurell\nrienzi\nsols\nturnitin\nelation\nmontfaucon\nsuperpowered\ncontusion\nwallow\nlysergic\npanatta\nmatan\nwampum\nffr\ngabo\nabebe\neutaw\nfigura\nsunfire\nyoshiharu\ntabuk\nrichler\nlundell\nfluoxetine\ncrowninshield\neboli\nholness\nmisbah\nstas\ncaecilian\nkhal\nvaudevillian\nkro\npornstar\ncaltanissetta\nmodernistic\npanta\nhigo\nshellfire\nanarkali\nnunchaku\npacifying\nbouffes\nsokolniki\nombudsmen\nfatherless\nleipheimer\nmiscellanies\nhanga\nqasem\nscharff\nembo\ngately\nnajdorf\nkorneev\ngiron\nkerley\nmarquard\ndisengaging\nmahalo\nshorrock\nnikoloz\nclimatological\nferlinghetti\ncroaker\nreverent\ncarpool\nuvm\nunami\nsisulu\nrecreationally\nbackstop\nkube\npsychomotor\ntongariro\nhawksworth\nkaradeniz\nparashar\nneches\nsalette\nferrigno\nbridegrooms\nfugard\nforeleg\nwahhabism\nsolidarité\nkingly\ndenture\ndissidence\nredoing\ncorba\nbungled\npaquita\nainsi\nnapp\nsenora\nfrivolity\nikhwan\nnadira\npolyphenols\nruscha\ntakraw\nsalander\nturris\ncynics\nwalia\neuromoney\nchipewyan\nkhushab\nraka\nfaker\nbessa\nintegrally\nkittiwake\ncapercaillie\nentendres\ndesplat\nphiladelphus\nlindfors\nwctu\nbrownson\nsmithies\nravenel\npikas\nradikal\ngarageband\nentrada\npogba\npoitevin\nheartthrob\npentangle\nmarni\npora\njiujiang\nblg\nemusic\nbradycardia\njano\nsnowballs\nmounsey\nqaem\nferdi\nmaned\nkansan\nflasher\ntruancy\ngizzard\nguiness\nneutralised\nshiller\nrosell\npredispose\nmarunouchi\nmicrocephaly\nbarkat\neclair\nrozhdestvensky\nbreeden\nbetfair\nvarnished\nvisigoth\nbrouhaha\nssangyong\npadam\ntrowel\nembossing\nchisels\nclune\nsnickers\nkimba\nmarwood\nphare\npatellar\nspanier\ndayr\ninferential\nhennings\nhimiko\nquade\nbahari\nhardison\nunsolvable\npetrushka\nmapinduzi\nnationalize\nbocanegra\nunderstrength\nlovinescu\nmattioli\nhibberd\nbarada\ncurci\nmanichaeism\nalimentary\nsajan\ntyrese\ninterweaving\nxul\ndetrick\nlancets\ngaitskell\nirascible\nchinua\npnn\nanabasis\nhanka\nfrugality\nmonsey\ndarton\nlouvered\nsachar\nlambrecht\nmouthpieces\nbloodgood\nkhirbat\nabela\nmoishe\nlumb\ntetsuro\nmessmer\ncobwebs\nrzeczpospolita\nspagna\ncathodes\nmoyà\nathar\nbookshelves\nsnuffy\ngrauer\nmusikhochschule\ntimp\nbeckerman\nlexicographical\nhandspring\nvirunga\ncharnley\noceanographers\ntewa\nvisscher\npaden\nnxp\nsatria\nvilain\nmälaren\nranchero\nwolman\nabracadabra\nschill\nchangping\nriccardi\nwestjet\ngtl\nassia\nblanford\nesha\nwirksworth\ncontextualized\nfostoria\nbanville\nkangana\nanaesthetics\nfairbrother\njazze\nvandellas\nzorzi\nriyad\nmuhamad\nyermak\npapoose\nayew\nelana\negovernment\nsistani\nothon\nvermicelli\ndemerit\nnodaway\nreinterpret\nniggaz\nmedeski\nbegrudgingly\nevicting\npelz\noutwood\nliquidating\narnell\napatite\nclumping\nreflexively\nunrevealed\nkub\nhardouin\nlule\nhistrionic\nzy\ndabul\ngovernador\nsarpy\nacquit\nbruny\nhorlick\ncroup\nossi\nwordnet\nlazily\nimslp\nperforma\nnonna\nobliterating\ncondensates\nvertu\nswitchable\ncrackling\nautoimmunity\nphlogiston\ntuamotu\nhippel\nrecombined\numag\ncdot\ntressel\ngleaning\nmoorfields\nenglert\nloincloth\ntélécom\nrws\nballoch\nraabe\nfyrom\nsyngman\nreines\ndnevnik\nwavre\nculshaw\nlumezzane\nepf\nmetformin\ndebonair\nconverses\nnkosi\nmanion\nloewenstein\nuntraceable\nkouros\nlansford\nrazr\nomnisports\nrothchild\nborchert\notar\nquickfire\nvsi\nbeutler\nkaposi\nbanas\ntoshiya\ntimea\nfranchot\nchaleur\nimpulsivity\nbarbossa\ndecompositions\nliem\nantrobus\nsubhuman\narley\nredback\nrumney\ndongguk\noverreacted\nkammerer\nnovalis\ngolder\nleadbetter\nhanratty\ntetroxide\nablest\nseedless\nburkitt\ngeiss\nmelin\ncripples\nmontevallo\ntrudel\nwhew\nhawksley\nschliemann\ndisown\nbuenas\ngompa\nalava\nshekhawat\nsarva\nrelavent\ntatu\nmariella\nlemonheads\npeltz\noutdone\ngjergj\nmelichar\naphis\npone\ntokiwa\nclasping\nwoollahra\nfullerenes\nchaliapin\nretrenchment\ngimhae\ndellinger\nnanping\nsilico\nfaysal\nphuong\nricciardi\nbamenda\nmikes\nalbufeira\ngermani\nresupplied\nmegalosaurus\nphenomenally\ncladistics\ncrosthwaite\ndisaffiliated\nsuid\ntorin\ngalang\nstrouse\ngranulation\nielts\nflattens\nlawrance\nnicoletta\nlaleh\nbonis\nmotueka\nnwp\ndila\ntevis\nmerrett\nmaximilians\ngour\nscop\ntidende\nchamberlayne\nrockfield\nkaylee\ntanizaki\ntweedsmuir\nbaldo\nndt\ntaurine\nseminarian\nfioravanti\nombra\nklatt\nhungate\nkups\nferus\nyoussou\neidetic\nprearranged\nwachter\nbeneficent\nyoul\naxemen\ntrempealeau\nisas\njeremih\nrwc\nkbl\nventus\nkormoran\nlatinate\nactium\njacked\nmarichal\nnexon\nquitte\nwrenches\nagustina\nfaecal\nmannheimer\ndipietro\nreelin\nimbue\nmcgreevey\ntaksin\nsoekarno\nsoftwares\nrohtas\noficina\nmusburger\nhorden\nvlaicu\ngalicians\nfinepix\ncamembert\nhanlin\ncésaire\nkartika\nguion\nstigwood\nbelas\nhatten\ncronos\nexcommunicate\nnicklin\nkristjan\nncsu\nebsen\nyesh\ndati\ncesc\nedificio\npremolar\nlampooning\ninexorable\nborris\nroaches\nmallah\njorden\npolygamist\nposthumus\nfeminization\nboudreaux\nmcroberts\nlulworth\nooi\nsnooty\nenshrine\nkiera\nsnippy\nbrazing\ntransgendered\nsaget\nojsc\naermacchi\nfuming\npreformed\nkameyama\nsturge\nbarnette\njauch\ndamocles\nbhl\nnonpoint\nmassages\nyaki\njujube\nwhiptail\nmartlets\ntpr\nbdu\njaci\nfann\nschützen\nstroman\nmailly\neskridge\nsachi\nramping\nblunkett\npeachey\nbriefest\nstelling\ntoque\nplaut\njato\njantar\nwafd\ncapriles\ngrifter\nbondo\nperaza\nibsa\nnandigram\nspeedup\nleros\nnewey\nsaidi\nfamilar\ndemarcate\npetacchi\nsifted\nthanasis\ninterpolating\ncranbury\nmelnikova\nsoubise\nhirofumi\nkilts\nrainsford\nkeifer\nwalney\nscreamers\nharlin\nroshni\ninsectoid\nfavelas\nstubb\nfairways\nswaffham\nbehl\norlandi\naleksandrs\nparses\npterodactyl\nanjum\nfdf\nmastin\nstevo\npetőfi\nblueshirts\ngds\nhev\npocketing\nmultiplatform\nkollywood\nsterilize\nfenerbahce\nnetta\nrackard\nmercadante\nhochschild\ntestaverde\nsexing\nfroom\npenry\nscapes\ngipps\nbrines\nweatherhead\nlugard\nhilldale\npkm\nspeicher\nmacisaac\nbreakdancing\nbinda\nheadwear\ncristero\nduchenne\nmandali\ncaffe\nathanase\nhumfrey\nnipa\nscrubby\ntaif\nmilman\nalbumen\nisparta\nngong\nbarnardo\npiron\nmorini\nkhanom\nkozo\noveractive\nomarion\nbarbey\nleeroy\ninui\ntriremes\nlibertyville\ndepressant\nballplayer\nrannoch\nyeboah\nferny\nexcreta\nkosugi\nfujio\nentrench\ndeniliquin\nandamanese\nmasaccio\nstowers\nkorine\ncpj\nmeaghan\nfede\nedney\ncolonnaded\nqatif\nnewlin\noverclocking\nfundació\nmerri\nmesilla\ncosted\nobviousness\nrumpus\nadvocaat\naccrual\nabrahamson\nkaal\nklahn\nluana\nlitigious\nyuncheng\npergolesi\nbryans\nfaxon\nlwb\nweaved\nbagua\npimping\nmdx\nheliotrope\nkarnad\ntego\nhistorico\nbordes\nvehemence\nyagan\nmuggleton\nsparkasse\nmacleans\nflockhart\ndurnford\nmossel\nmcmurry\nsilverio\ngoldstar\nfatso\npantyhose\npearling\nsameness\nrefocusing\npotito\nurinals\njebediah\nnomo\nshachtman\nrequena\nuniondale\nrhamnus\ntouts\nlatchford\nfleurus\nvasan\ntricolore\ndibba\nextroverted\nchincha\nwnyw\ngtb\nunraveled\ncamerons\ncolonias\nrci\narimathea\nsamburu\nhottie\nhitches\ndarlan\nprg\ndunkel\nsanpete\nwickenburg\nmacfarland\nklay\neulogio\npiglets\nwigginton\nnason\nadentro\nsandgren\nliselotte\nrainbands\nheartburn\nswerved\nfogelberg\naureliano\nmotorboats\nneoproterozoic\nwarbling\nshawshank\nsaviano\nabrogate\nparikh\npallo\nvakil\nhuskisson\ncaillou\njemaine\nmushy\nmolk\nlifton\ndittrich\ncoonan\nmalbec\nrenamo\nalesia\npentagons\nguadarrama\ncuracao\nlaptev\nlbt\nekkehard\nstukeley\ncommunicat\nliberates\npirs\nverón\naristarchus\nmathes\ngiggling\ngascogne\nherbig\nelephas\noverage\nowego\nsreten\nproteome\nleat\nhennigan\npradip\nsere\nanastas\nhedgerow\ndoers\nnicolls\nmillner\ntyers\nlusitano\nalphas\ngaskill\ntelefon\npenduline\nslicks\nsuiting\nallez\noldie\nnoces\ngreenbacks\nmilward\ndoritos\nphysiol\npizzi\niqs\nonder\nbronzy\ntawang\nreassign\njrc\naddiscombe\namortized\nkodo\nlamin\nariadna\ngipson\nfeldberg\naseptic\nraut\ndeepavali\nwitley\nlostock\nroxboro\nbantams\nvampyre\nshera\nsteine\ndnepropetrovsk\nfreshers\nempson\nmeloy\nkokusai\nmsas\nrosebuds\nvass\nmumba\nportugese\npettitte\nvvt\ndouche\nunaffordable\ndarabont\nswaroop\ncolonsay\nbuemi\nhankins\ncoitus\nkof\nhillery\nturrell\napprised\nrahu\nowatonna\nreliving\nguillard\ngwin\ncuc\ntingley\narutz\noakfield\nexploitable\nbove\ninelegant\ndiba\nbomberg\nnagase\nbergey\ndookie\ntented\nhaystacks\nmahapatra\nboleros\nkostova\nharrassed\nwhitehawk\nimplanting\nmastoid\ninternationalized\nholopainen\nlemont\nkno\nathina\naminoacyl\nmarcano\nbluebells\nginóbili\nvolturno\nbentworth\ndivadlo\nhieratic\ndefenceless\nhasanov\nbarbero\nsakes\ncuddle\nlamjung\nrishta\nbaze\nyimou\nfoxwoods\nmki\nborscht\ncableway\nhomeostatic\ntantalizing\nblacklight\nsmm\ncrudup\nlobdell\nmultisport\nneveu\ncovell\nparacel\nleached\nrollicking\npostgate\nradstock\nfrandsen\ngordini\narkadiusz\nhudsons\nalibis\nhyon\npeluso\nhansberry\noingo\nsamak\nstealers\nprécis\nmoçambique\nsuperchargers\nfootbridges\ncose\nbuendia\noleo\npigalle\nnaniwa\npled\nmainspring\nniekro\nrancor\ntribble\nelectioneering\nbures\npohjola\nmeknès\nbreadalbane\nhanse\nvirility\nhootenanny\nrebutting\ndeferens\nrubia\nperpetuum\nprying\nuhlmann\nwarmblood\ntof\nfpm\nmollet\nceylan\nberens\nginastera\nmusikverein\njovanovich\nbanya\ncuse\nsigi\njagjit\noleic\napulian\nissara\nmullaitivu\nproserpina\nmonnow\ntutuila\ntrabant\noxalic\ncuboid\nrema\nmoba\nasselin\nvintner\nchernov\nwinky\nbowhead\nkowsar\nmoneypenny\nskyteam\nelysées\ncisplatin\nmanchin\nculross\nsapien\nplanus\nnasrullah\nlinkletter\nfebrile\nscornful\nepik\nfalkenburg\nkrisztina\nibar\nmayawati\npahor\nteide\ndemining\nbaserunner\nruge\ngath\nbogo\ndeliciousness\nburkes\nfilomena\nakhdar\ncoolmore\nania\ndualshock\nworthen\ncastanets\npilibhit\nzabul\nigi\ngaleano\nhelmi\nhonfleur\nexorcisms\nspazio\nsolicitations\nscu\nnorland\nneuroses\ndredger\nquadriga\nswakopmund\nbta\nbhawani\nmatrimonio\nmanzikert\ncosponsored\ngiacchino\nhalse\ncnh\nroyally\nbuzzed\nmorozova\nakha\nquaife\nschukin\ntolle\nextractions\nblackfish\nlapels\ncrossan\nstans\nphon\nonu\nlollipops\nsikri\ndanilov\nanglicization\nanamika\njayalalitha\nmiddlefield\nchandrashekhar\nreticence\ndivinorum\neylau\nparrotfish\nasche\nsabang\nkaskade\nlefranc\nhuit\ndmm\nawi\nnoncontroversial\ncolibri\nwaldstein\nthongs\nrhinoplasty\nvinciguerra\nwoodfield\nfibronectin\nrehashed\nshenouda\nasal\nfreiburger\nmagness\ngsg\nperini\njtag\nalleyways\nbrunell\nkoke\neuphemistic\nstreetlights\ntirupathi\nontogeny\nludicrously\nchinmoy\nlathlain\nemoticon\ncamarena\ncorfield\ngrandniece\nsacheverell\nlakoff\nelul\ntranscranial\nterwilliger\ndoofenshmirtz\nnasarawa\neskil\nmatisyahu\nsavonlinna\ncronyn\nheadspace\nimmunologists\nhootie\nnonchalant\ngouged\nlehtonen\nroble\nspoor\nbobb\nheskey\nirak\nrecitalist\nqinhuangdao\njanaka\nschwedt\nsordi\nkamba\npulkovo\ngopalpur\nmusters\nhuangshan\ndodwell\nadaptors\npompton\nlachmann\nmushi\neventuated\ngamow\nletterer\ninterstices\neosinophils\ngeoid\nweale\nminkus\ngirlz\nchanelle\npitchforks\nraimond\nallens\neckford\ngodber\nchongming\nezer\nschecter\nmuhamed\nburrill\nducharme\nvliegen\ncheraw\nbordighera\nrubino\nblasi\nwhippet\nlakeman\nbonino\ningels\natque\nplantin\ncanti\nmicroglia\nmisanthropic\nkazakov\nmilon\ngiteau\nbhaduri\nhomeboy\nmedius\ntarring\nnaveed\ntift\nslammer\nemanuelle\nleeuwenhoek\ncardholder\ngmb\ncoppens\ntsereteli\nmahood\napn\nlivescience\ngoblets\nplutonic\nalleghenies\nnazarian\nlevitating\ncomplementation\nbabai\ncirculars\npotier\nbrandreth\nburnand\nkurama\nrathfarnham\nvedra\nsamothrace\nmagomed\nbarne\norbiters\nmaneater\ndeathless\nszombathelyi\ncottam\nsuci\nsinton\ncasca\nlegitimation\npseudoephedrine\ncarboni\nungern\nmontalbano\nbeller\nbaysox\neucalypts\npaulownia\nmedfield\njitterbug\nconjunctiva\nunwinding\ngrosmont\niger\nwatsons\nlitigator\negotism\nrjr\nvsa\nneigh\nkvapil\nrodes\nbislett\nlinie\niterating\nlimewire\nlostprophets\nforshaw\naridity\npappenheim\nmbti\ngloriosa\nlobkowicz\nroskosmos\nhallucinogen\npopjustice\nlaverty\npesch\nletov\nbaccara\nhaugh\nvilleurbanne\nplasmon\nconboy\ngallina\ncascaded\nfertig\ngarfinkel\ncandied\ninternist\njamz\nstellan\nsnowboards\ndooku\ngeon\nsubah\ntillakaratne\nhien\nyallop\nlambertville\nlappin\nporro\ntibbetts\nteed\nnegron\navante\nmaspero\nobrist\nmitani\nsamen\nfyn\ntapirs\ndarkling\norlova\nbete\nterma\narnstein\nnortherner\njohannesen\nservings\npegler\nild\naaya\nanodized\nmortician\nspck\nwsp\nrigney\nkwaku\nsircar\nlongstaff\npcg\ncolebrooke\nsnobby\nlidocaine\nbecquerel\nlexicographic\nfoxtail\nekström\nsaugerties\ndyskinesia\noligarchic\ngórecki\npopery\nktvu\nagadez\nfuerzas\nerath\nweinman\nputera\nquero\nbiologics\ntrillian\nnansemond\ncorne\ntobermory\ncrockford\nossipee\nsharron\nflaunting\nthinktank\nphlegm\nminiaturization\nweserstadion\njoonas\nverba\nmolehill\nantiope\ntánaiste\nsdhc\nilr\nkero\nnyasa\nprioritised\ngarlick\nremotest\ngjon\nmarriageable\nrumyantsev\ncodewords\nglaswegian\nperna\nsibilants\ntsavo\nyaddo\ndolgellau\nrabih\nborj\ncaciques\nbarranco\njaishankar\nservetus\nlysistrata\nmalfoy\nexpressen\nendgames\nsamra\npigtails\nlongerons\nkallen\nmalaika\ncema\nbakht\nanik\nrockett\norderings\nscullion\nwinking\nsiân\nfishin\ninterlink\nnitrox\nspongiform\ntriskelion\nalresford\nquinoa\nmothe\nreligionists\ncroissant\nwarrens\ntetraploid\nredcoats\ncrawshay\ninductions\nbise\nclang\nzevi\ntoasts\nlianne\nlaserjet\nnutritionists\nbaddeck\nlegitimise\nductus\nminka\nvaricose\nfabiani\ntekin\ncodas\nhellmann\nimpactor\ndouches\nfreckled\nsoichiro\nrosenberger\nsekar\nboettger\nheptathlete\nsfp\nhallock\nmosely\nperilously\npincushion\nrecency\nprincip\nnishino\nreevaluated\nmiserly\nbankroll\nhierarchs\natef\ngewehr\nschober\nharmonise\nemes\nkrispy\nafterburning\nschulich\nhartzell\nmums\nfestering\npunctuate\nburgeon\nmikal\nicebreaking\npacifier\ncentra\nmixmag\ngrishin\nislamiyah\nfroman\nkamiński\nbottlers\noverburdened\npalomo\nlvo\nsynthesise\nchromed\nsaina\nminera\npenmanship\nmarkland\nprotoplanetary\ncomorbid\nyastrzemski\nbawang\nsheiks\nrale\nhardenne\nglues\ncammy\ngrethe\nlibation\ninterchangeability\nlunettes\nkovac\nwillemse\nlpfp\npendolino\nrajpal\ngondal\nlifeforce\nmenos\nreroute\nsmethurst\nheuss\nvotre\nfreakshow\nviso\nrutherfurd\nmessaged\nsarsaparilla\ncatahoula\ncaz\nmeanie\nampat\nmachan\nripens\nnakasone\nzhoushan\nheu\nlibeskind\nfengshen\npicos\nmasint\nboubacar\nerat\nkeong\nsuperintendence\naffirmatively\nharth\ntoiletries\nkedleston\nlyford\nwarfighting\ncallanan\ncida\nlothringen\nenterbrain\nyaris\nbinford\nredtube\ntronic\nstowmarket\npucca\nmiyajima\nrossby\npfm\ntartaglia\nruo\ntoulousain\nsipa\npoconos\nsalla\nhydrants\ncounteracted\nmutch\nmultiflora\nclutterbuck\nappetizers\nendear\nnonparametric\nhenriksson\nibrd\nstymie\nbezerra\ngeb\noverstepping\nflatwoods\nashwood\nfassi\nkirat\nkuli\nfunès\nshrove\ncrony\ntranscoding\nvomited\ntujia\nstanden\nlente\nauvers\nfoles\nfreenet\nsigue\nappraisers\nmohiuddin\ngeen\nreuses\nveneers\nshugden\nbundesrepublik\ntahoma\nrukia\nbrillant\nspambot\nndu\nlifeblood\nguaraldi\nsociability\naccredit\ngortat\nflinch\nluss\ntbr\nnakao\nheadon\neveleigh\nwaseem\ncarillo\nsallis\nrecollected\nbatistuta\nlouvin\njmc\nvanunu\nberga\nhammons\nlettermen\ninsolence\npervak\nlocational\ngowran\nkaleb\nborst\nmccaul\nolpc\nobwalden\npraed\nwoolfolk\ncosentino\nranade\narchdale\nsketchbooks\ntranslocated\nmonoculture\nhisao\nuft\nmused\nkoski\nmido\nloukas\nshuvalov\nkoja\nrolly\nwesteros\ngriffioen\nritwik\ndevolving\nthompkins\nanjar\nparente\ntinta\nspk\ndevastates\nacworth\ncastelló\nsaltcoats\nsvyatoslav\nwellsboro\nbejarano\nyngve\nunplug\nluu\nkaha\nkunta\nadlon\njannat\nmavs\ncolinas\nkinji\nhewitts\ncrockery\nbootloader\ncomforter\ngrimme\npalpita\nhorney\nshekels\ntranspositions\nvizzini\nimpermanence\nmédard\nyoshimasa\npangs\ndunks\ngoonies\ntuol\ntagawa\nabcnews\nmodlin\nlaurenti\nvnukovo\nsinti\natsumi\ndilley\ngreenways\nstradlin\nvinge\ntalabani\namateurism\nmeatwad\nfazekas\nhoby\nborgias\nshinnecock\nflatfish\ntuas\nastore\nshorey\nflowchart\nkhyentse\npondweed\nwimpole\ndieterle\njambo\ndebenham\nwhitlow\naymar\npuneet\ntenis\ndrumhead\nbouygues\nelford\nharrap\nuhura\nunabomber\nvermes\nopencl\nkalayaan\nhoodwinked\nsignposts\nboroughbridge\nraba\nmigliore\nironpigs\nopticians\nospedale\nashrae\ncorrida\nmakah\nmasc\nfeted\nvallier\nmcwilliam\npaktia\ncopulatory\nsocialise\nkhu\nzvonko\nremscheid\nbotox\ntwiztid\nlocalizes\nfuer\nantineoplastic\nmolting\nuncool\nvaleriano\ncorporatist\nmangione\ngrabby\nanselmi\nswiping\narditi\nparamedical\nmarcelinho\nglossing\npearland\nsouthlake\nmononoke\nsongstress\nobrador\nmiccosukee\nwidjaja\nliebherr\nsunnybrook\ndorham\nseel\narrernte\ninviolable\ncalwell\njiaxing\nfaubus\nshap\ntelic\nhigbee\nparroquia\nsitus\nriken\nrecursos\nmonserrate\nstarlin\naila\nstapledon\nmogador\nsnags\nblessington\nandrejs\npé\nburgo\nbullfight\nfurze\nevren\noptik\nhesitantly\nwarrego\nbekaa\nserifs\ncuccinelli\ngrimly\npieterse\nsholay\nrubaiyat\ntimbering\nlonglist\njassy\nmatia\nscrawled\npreproduction\nblithely\npapadakis\ngorki\nscram\ntaggers\nfirebug\nadora\nscavo\naxially\nracialist\ncundy\nlovro\ncarlaw\naleksi\ntormentor\nletelier\nrumania\nnkb\nminghella\nnowotny\nniantic\nmeusburger\nkananaskis\nvoelker\nmaltose\nrencontre\nneuroblastoma\nhissar\nhelpfulness\nmattison\nbarthélémy\nbenard\ncreases\ndisillusion\nlaburnum\nebv\norenstein\nnatanz\ntopspin\nsadra\ncroxton\neschweiler\nbassoonist\nserialism\nstomped\ncante\nmlcs\npellerin\nstockard\ncaret\nringwald\njowell\narete\nrecco\nsandeman\nohg\nlauridsen\nairside\nanimalistic\nghadar\nharlington\nboga\nweeklong\nbowstring\nholtby\nsinise\ndobb\nusada\ncaroll\nrael\nbellboy\nagitprop\nweifang\nkrater\ndiscourteous\nchillán\nnokomis\nyasna\nribonuclease\nmahdia\nshowbusiness\nkatey\nfrenchie\nmilenio\nfluffernutter\npictet\nfrederiksborg\niolo\npsychokinesis\nhanina\nvdl\nrantzau\nferghana\npriština\nvarnum\nresonable\ncortisone\nkember\nwebcasts\nbellay\nleonesa\nvitra\nunwrapped\ntafel\natala\nconvicting\narpaio\nesculenta\nzeynep\npolicyholders\ncomings\nzdenek\ncarducci\nmoncrief\nmudslinging\nhadlow\ncemil\nesiason\ngorrie\nwreaks\nconceptualised\nneven\ncalixto\nreiteration\nwildhearts\ntenterden\ntitu\ngreenhorn\nmostel\nzamudio\nturonian\ntno\nahc\nstephanos\nnacer\nflyable\ncastillon\nkamer\namphitheatres\ndeleo\nsholto\neulenspiegel\nmouthing\ngrüner\naquae\nrosco\ngaja\nthins\nottilie\nmdd\nreflexivity\npeacocke\nlamarque\nfrosting\nullswater\ndisinfectants\nector\ndendrochronology\napollonian\nmonoecious\nfidalgo\nlakeport\nrhetoricians\navtomobilist\ndeconsecrated\neveryones\ncepa\navailible\ndimetrodon\npoza\nmercers\ndixons\nbrenden\ncopperheads\nzeca\nairlink\nchaika\nspielmann\nkyriakos\nmottoes\ncrerar\ntirupur\nversfeld\nherpetologists\nstormtrooper\nhenriquez\nbbd\nalexandrescu\nraluca\nkumaratunga\nunproblematic\ntechnet\nstuckist\neggen\nyemenis\nstrudwick\nlindfield\nmarkgraf\nweirder\nnagaraj\ndethrone\nhanh\nmoncur\nfausta\nbloomers\nqutub\nverrier\ncollor\ngreeves\nelem\naryabhata\nguybrush\nlocalizing\nlaufer\ntickled\nlarder\nqamishli\nsartain\nmortes\npanky\nthornburg\nevgeniy\nregin\nsabbatini\nwds\naristotelis\npoulson\ninconveniences\nmetu\ntellingly\nwank\njuergen\nsider\nmarielle\nryun\nnesterov\nprinzessin\nberlitz\ncasso\nkakutani\nrossy\nfelicitated\nfoel\nwellens\ndornan\nbrasov\naftra\ntugela\nallemande\npectin\nslickers\nintrudes\nclif\nabbate\ngentili\nstillbirth\ncaccini\nnof\nbleasdale\nhuehuetenango\ncalverley\nbeheld\nsvaneti\ngillon\npulsations\nmastodons\nraval\njeannot\ngfk\ntucano\nraila\nmudvayne\nchisago\nbraehead\njoensen\nrahsaan\nfazer\ncomicon\ngagik\ncomodo\nhdt\nsubclavian\nboursin\nestabrook\ncharnel\nbackscatter\ndinucleotide\npdvsa\npoolside\nllantrisant\ndpc\nbondarchuk\norest\nprievidza\nibañez\nincirlik\nchiclayo\nbuttonquail\nmaun\nantimalarial\nlavell\nuntruths\nsoja\nronen\nlith\nkrapp\nbnet\nspielman\ntishomingo\nphou\novershadowing\necoboost\nirredeemably\nofficialdom\ngynecologists\nozon\nembargoes\ndwp\nnagra\nlamo\nmusar\nlaughably\nnags\nyegorov\ndetuned\ndessalines\ndumbfounded\neneco\ntantum\nfruitvale\nrecut\ndiviner\nunaccountable\narnaiz\nigorevich\npanchali\ncampa\nrazon\nfard\nsanctis\nbotch\njaggard\nrady\nharefield\njoggers\nlaskaris\npopsicle\nsokolsky\nsavini\nfinalise\nbarbiturate\ncobbe\npalanga\nfrons\nwysocki\nmorrowind\ntimlin\nroyan\nmadhavrao\nniceties\nlaub\nfedotov\nasymptote\ndebney\nfredi\nstijn\nloureiro\nbrujo\nmorna\nalcor\npanadura\nresoundingly\nveggietales\nvlatko\nlychee\nrushdi\nmayur\nritesh\nkarri\ngazpacho\nvajiravudh\ntijd\nkuda\nashta\nservilia\nkcet\nfpp\nhyperglycemia\namra\nsuey\npancasila\nadjoin\nrowlandson\nhodes\niskenderun\nrewilding\nsfg\nblumenau\nepoque\nftd\npragmatically\nwetted\nkelvins\nshhh\nintercalary\nabin\nabdomens\nsarria\nkach\npetronilla\ncarrère\ntorg\nwytheville\nsoccernet\nstromness\nsadomasochism\nremasters\nmissourians\nstefansson\nrdm\nwashakie\ndhanmondi\ndecisis\ngoverno\nkrak\nprynne\nbexleyheath\ngymnasia\ngouraud\ntroost\nmicheaux\nbridleway\nbenyon\ninexpensively\nmedicus\nbwi\nbatz\nifan\nmaric\nwinterland\nrecognizably\ncaernarvon\nnerc\nthiemo\nreexamination\nchurchwardens\nhiran\nresettling\nbrightening\nrashleigh\njeremie\nbeilstein\ngriesbach\nlandman\nsamaná\nmarmorata\nwavertree\nlaurinaitis\nlongyan\nchangin\nalzira\nbakhramov\nhogar\npersonable\nmccowan\nhackworth\npepita\nhusum\nnatureserve\nrnd\ndonnas\nbuhler\nfasces\nhastening\npechersk\ngoldwasser\nmarcellino\noligomers\nhammonton\nfoxcroft\ngesu\nkosovska\nwuc\nvalledupar\nhuhne\ninsurrectionary\nguyane\nheche\neum\ncarotenoid\nliya\ntetouan\nanf\nuren\nmalamud\nchiese\ndrome\nlilting\ndursley\nmedgar\ndemoting\ndemoiselles\nbaqi\njencks\nmaddening\nlongleat\ntarahumara\nhartenstein\nardsley\ndedicatee\nbuckaroos\nnien\nrepublicana\nheadedness\nfrédérique\nsisco\nbofill\nheterodyne\ncasamance\nrabari\nunravelling\nfertilizing\nbiharis\ngobel\nsatirically\nashrams\nzwick\nsadowski\ntommasi\nsprengel\nmenomonee\nmarshmallows\nsheil\nosteen\ntruthiness\nkarasu\ndispassionately\npsycholinguistics\nburo\nstaterooms\nprefatory\navakian\nbaddest\nperpetrate\nsias\nfearne\ngluteal\nproclivity\nhecho\ndoone\ngurr\ncowart\nsawhney\nlunas\ncombermere\npunctuality\nmoras\nmuttalib\nzwicky\nwombwell\nakamai\ndrood\narchitectonic\noesophagus\nahmanson\nmidvale\nretitling\nzerbe\nuki\nphilbrick\ndahlonega\ngujral\nsorell\ndetracting\nmachiko\nbrach\nawu\nneira\nimpound\nopd\nwellstone\ninternalize\nparched\ndobbyn\neupatorium\nheintz\ncontinente\nkidston\nfestina\nbalogun\nferrata\nilaria\nrednecks\nflagships\nphilandering\nmonona\ncardoza\nwellfleet\nbjelica\nbaillieu\nestas\nvalores\ncosford\nlaviolette\nklima\ndelp\nyolks\ninverts\ndinny\nundeserving\nmelodically\nhasso\nsesia\nwessely\nsmitha\ntrelawney\ndavidian\nxilinx\naycock\nmwa\nyupanqui\niden\nganna\nboggles\nmalala\nande\nalfio\nfage\nmundine\nexcretory\nfactfinder\nkamei\naleichem\nachill\nhudspeth\nvikrant\nlaxity\ndelgadillo\njacobsson\nrtgs\nshergill\nmccalla\npaume\nconker\nswoboda\nbrs\nepitope\nshiawassee\naccredits\nurologist\nfoldable\nhorsetail\ngenna\npollan\nwadis\nbipin\naudran\nderricks\ntakami\nkhj\nwinnowing\nellin\nfaruqi\nstartlingly\nmoslems\nunforced\npsychometrics\nadq\nbielski\nploughman\ncampobello\noikawa\ngunfighters\nmowatt\nauburndale\nbozizé\nkilmaurs\nglaciology\ndilorenzo\ncoppermine\nacceding\nruiter\njacquelyn\ntunisie\ninflaming\nhagiographies\nlivestream\ncaseload\nxcode\naniello\nkeratosis\nrhomboid\nlinge\nsadhus\nmarrufo\naesa\nfellas\ngraveside\ncaul\ncontrivance\nbackhoe\napodaca\nzelman\nrossel\nvillaverde\ncrato\nkanab\nholdfast\nwrightwood\nchev\nhumain\nknittel\ngrinspoon\npfk\nbaram\nsprach\nchk\nsystèmes\nnihilo\nmitochondrion\ninterlacing\ntinned\nrelf\negp\nbiofilms\ntrypanosoma\nforbearance\nclovers\nendometrium\nlyda\nskan\nbraver\nupping\ntransporte\nmultilayered\napollos\nhowse\nbnd\ngede\nmegatons\naurélien\nyanked\npullout\npnb\ngemina\nhabitants\nunconsolidated\nhuebner\nquebecois\ncaudate\nparlament\nfahm\nschwalm\nstann\nkaffir\nthaman\nseafield\nhomero\ndissents\ntalyllyn\nhackathon\novadia\nwillcocks\nengrave\nwahine\nremounted\ngaber\nrossland\nsluts\nidrisi\nsympathizing\nabdellah\nquandt\nchews\nlacour\nrainhill\nbagdasarian\nmaritimus\nredshirting\nromolo\nkurram\nfoxworthy\nshipka\nyoshie\nnegligently\narctos\ncgd\nradice\ncondado\nburrage\nknick\nswooping\neyepatch\npamphili\naspin\nafif\njannie\nbramham\nhiraki\nfatou\nstipa\nimperfecta\ngodrich\nvasilevsky\nclastic\nchd\ncentring\nmanara\nkonaté\nlovey\nflashbulb\nenergizing\nonstad\nprang\njandek\ncharvet\nmantri\nmaskell\noptimizes\nlandgraf\ndisdained\nsohan\ndisorienting\ngreenspace\nleeman\nwyborcza\nbluesman\njaromir\nachard\nschama\narbors\nkarni\ndevaraj\nshahada\nmittens\ncomv\nbattleaxe\nnpv\nkalem\nswatara\nmalou\ndiscontents\nfredrickson\nstepsons\nvisualised\ngluteus\ncedaw\ncoyoacán\nlegarda\ntramline\nelbegdorj\nsappy\nzerubbabel\nlongfield\npetrarchan\nmanliness\ncastors\njingzhou\nreadjustment\nharborne\nwoolpack\nhovhaness\nvedran\ncollegeville\nhiya\nfrings\ngasparini\nfotheringham\nfreude\ndmf\ncrossmaglen\nrainie\nplessey\nshirer\ndummett\nthika\nneisseria\nbattens\nrehearing\ningolf\nzelig\nfds\nbuttle\npyongan\nburrowes\nprin\nchurchville\naleman\nsjs\ntrae\nbattlegroup\nhumanos\ngoretti\nlussac\ncountervailing\ngaw\nsweeten\nwetherell\nbaronesses\nader\nlenski\ndusit\ntavis\nngf\nschüler\nbalint\nbreau\nfifer\ncatchpole\nbullfrogs\ndisharmony\noptimistically\nshonda\nkaler\nseck\nnelligan\ntenafly\nascott\namble\nsevera\nassignee\nkater\nleyen\ntadd\nsorte\nhoodlums\ncibao\nmacken\nindictable\nimpingement\nhumberstone\nblondin\nsushant\nmcgruder\nasas\nreversibly\npicasa\nmonopoles\ncarme\nenvied\nworldcom\naqeel\nsnooky\nvlasta\nlilja\nafflict\nlipari\ncramming\nberean\nkeay\nserotonergic\npsychopharmacology\nkeim\nmorphos\nngata\nyamakawa\ncopping\ndharan\ntranscultural\nthrottled\nsloughs\neberswalde\nkeightley\npandolfo\nunrestored\nlnk\nroeg\nfragrans\nmassine\nkluivert\nhesitating\nsegev\ndehra\nmusiq\ngilkey\nantihistamine\nhetzel\ntambour\njonsin\npennsauken\nceli\nsolen\ndisrespecting\nheirlooms\nkufra\nlaureus\ncamelford\nfeijenoord\nfredman\nnikah\nvlastimil\nsubmariner\ntoop\ncrackdowns\ntoman\nkipke\nbansko\nkaranth\nfuhrer\nvishisht\naccreted\nservite\nwomans\ngehlen\nwoodvale\ndeists\nedlund\nkhaleda\nvigne\nsirota\nmanjrekar\nlatching\ntomson\nhenniker\nlonghouses\nretinoblastoma\nairfoils\npaychecks\npinelli\nbrewpub\nagroforestry\nbrahmos\namelio\nsolin\nbiga\nbrust\nriled\nsaputo\natrioventricular\nrecedes\nkafue\ntuma\nheidenreich\nkody\nkennerley\nvolans\nhds\ndermatologic\nschembri\ntrousdale\nprobiotics\ntransnet\ntsien\nhamman\nchieftainship\nwebos\nfillon\ntjm\nirrationally\nbalducci\nuniate\nbushings\ncantref\nmandrill\nabida\ncatley\nelata\ncahan\nalcoyano\nhearers\nseaways\nnordby\npeleg\ncelentano\ntjader\nbogert\nharav\ngandia\nmeloni\nsobol\nflaccid\nvda\nlattre\nenum\naivazovsky\nbto\npicardo\nfivethirtyeight\noglio\nolivos\nupregulation\nwindlass\noverreaching\nnarc\ndrummoyne\ndisconnects\nbriarwood\nporthcawl\napj\nkada\ncolumban\ntevita\nkeohane\nkurchatov\nintensifier\ngambles\nrebelo\nattis\nconsensuses\nxiaomi\noverclocked\nsepa\nyarnell\nwilms\nphichit\nasmat\npouilly\nhailstorm\nkastles\ndtu\nyukie\npupfish\ncarpentras\ndickman\nlynley\nbroadfield\nhurghada\nstableford\nmoustached\nnayan\njovanovic\nmacky\negnatia\nwickman\nentravision\nwillibrord\nlto\nsubramanyam\nnipped\nvinu\nboughs\nverga\nzicheng\npreceeding\ncaramanica\nlemelson\nebo\négalité\ncontrôle\nkstp\nsauli\nrationalised\nknockdowns\nrastriya\ncraxi\ncrouched\nboyds\nnamah\nconnexions\nhyuga\nsteeds\natreyu\nbetwixt\nsantacruz\nlankester\nwritting\nlouden\nclutters\njacknife\ncoraline\nmagomedov\nedzard\nuncheck\ncusa\narchetypical\nmarquet\nictr\nmixe\nchatterbox\nruslana\nnozomu\ncrissy\nopencast\nobd\npretension\nmohana\ndongan\ncoif\naccumulators\nsaxes\nenero\nfutur\nbabblers\nchappel\nparan\nmonorails\nchesnut\nrivett\nlannan\nmanship\nnarasimhan\ninterleaving\nblin\nlongstocking\nnúmero\nfiord\nbrooking\nacrostic\ntawney\nashmole\nherath\nlaffer\nkatsushika\nuku\ncureton\ndanis\nrivalling\nmazo\nverum\nbeechworth\nbekasi\ngallants\nkalamarias\nleeks\nfoden\nportici\nrenfield\nboudewijn\nswellings\nwhithorn\neridanus\npowerlifters\nplaygirl\naxing\ndumond\ntramadol\nczernowitz\nlamongan\nbogdanovic\nholliston\nspofforth\ncfls\ngpcr\nbijoy\nspiraled\ntoral\nmayon\nferreri\ndail\nkilmister\nmemorializing\nbraudel\ngunsight\nflorists\nlilas\ntamiami\namazigh\nhirono\nslpp\nvaricella\nnormalisation\nroutemaster\nsangat\nirmatov\nreminisced\nsuara\ntomoyo\ngazza\ngembloux\nbettany\nschutte\nvirtualbox\nfeliu\nhumint\nmouthwash\nincorporeal\nretried\nofra\nkintore\nrvm\ndolo\nrepulsing\nkotov\nforsey\ntareq\ngnassingbé\nalcon\ndhow\njayewardene\ngasnier\nanodes\ninstrumentally\naspern\nprohibitionist\nwattenberg\nwhelk\nfairuz\nindiscreet\ncalusa\ntaipan\nexcimer\nrunyan\nlasorda\nbuttler\nhdv\nvolya\nscow\nnuristan\ncribbs\ndrubbing\niisc\njarnac\nbirkeland\nybp\nresa\nbriant\nbodh\nwhewell\nlouvres\nkneecap\nkalama\nkodesh\ntras\nwereld\nfloundered\npoorhouse\nmahr\nrainford\nbilodeau\npeterburg\ntotton\npuebloan\nfifield\nauras\nvernadsky\npuncturing\nretraced\nyuxiang\npetrolia\nroloff\nncse\nrangefinders\nrêves\nchondrites\nweer\ncalvillo\njuelz\nthompsons\nautomobili\ntamia\nmantles\nviney\ndetections\naltagracia\nkaloyan\nmediterraneo\nroding\nindustrielle\nmhk\ncleantech\nenormity\narnoldi\ncraigieburn\nnaseby\nstrafe\nkuper\nescoffier\nvlt\neeprom\nhuggett\nvocs\ninheritances\nnahuel\ncheckerspot\nantenatal\npeppard\ndraughtsmen\nwakabayashi\nanglaise\nterminalis\ntonle\ndriss\nbaserunners\ningeniously\nhustling\nikf\nrls\ndukat\npsychosexual\ngenio\ngraffin\nngu\nferron\nparola\nunchangeable\nfuhr\njourn\ntypologies\njubilo\naspca\nhomoeopathic\npupin\nunspectacular\ngrunewald\ncolangelo\nkafir\nattell\nneurotoxicity\nsimonov\nbalram\npneuma\nsimonton\nbaptizing\nmeiklejohn\ntarpley\nshakopee\ndamayanti\nottone\nstiefel\npieri\nmarryat\nsantaolalla\nikegami\neul\nsirikit\nchadron\nmaces\ndamar\neiichi\nvaticana\nburrus\nnoisily\nsmolin\ngarretson\nriverboats\nstomper\nrashomon\nbrigitta\nhessians\nosteomyelitis\nfloodlight\nganon\ncoad\noscilloscopes\ndogme\npuffball\ndegerfors\npurr\nsalin\nemax\nphukan\nbayar\nbarreiros\nbiondo\nspeakership\ngráinne\nefrain\ngurgel\nsharq\nzha\ncollectif\nsubcortical\nrockcliffe\nerrington\nfestspiele\nwasi\nbpf\nmultiplexer\nhandhelds\nulysse\ninukai\nlaertes\nvinita\nkca\ntrajkovski\ntunceli\nmilady\ncapehart\ncohasset\nwindscreens\nabstruse\nhiscock\nincapacitation\nifield\ntresham\ncheerios\nbelitung\nsuperclass\nsadeghi\nfeeley\nammanford\nhames\nclete\ngwydion\nschwantz\ndrooling\ngigantism\nzoonotic\nguillemin\njaysh\nbranston\nferreyra\nfuselages\ntakamine\nmcmartin\ncerny\nruhuna\nkooky\nlahaye\nbechdel\nduffel\ndiorite\nshipwrights\nirk\nisai\nbiogeographical\nenbridge\nglistening\ntwill\nincompatibilities\nnclb\nshireen\ntorturers\ngalleri\nnyanga\nmanby\nhebner\nvaidyanathan\ninterbedded\ngermiston\nliuzhou\nreshma\nkoresh\nknepper\nblackcap\ncarisbrook\ncriminologists\nthia\ndiminutives\nundiano\nwichmann\nottorino\noropesa\nsaket\ndanaher\nwaymarked\nphagwara\nschleyer\nvibratory\nhindrances\nmxr\njarrad\nhachi\nbovell\nmedleys\nclownfish\nlya\nubayd\nmustin\ntarento\ncatastrophically\nducale\nbagheera\nripoff\ntykes\ngance\nmaecenas\nbionics\ntedford\nkarmal\nlactating\nnhlpa\ntobar\nrelapses\nmorong\nmaalouf\nfreemantle\nmhic\nryotaro\nneckline\nvoorst\nmoorcroft\nsluggers\neldora\ncathodic\ncastine\nramgoolam\npopplewell\npyrene\nbirdies\naldama\nmassad\nfabrizi\nsayadaw\nwpvi\nplac\nmalakoff\ntinkle\nigy\nbarelvi\ndeplore\ngoodes\nweygand\nbotsford\nmicrobiome\nroques\ncanham\nwefaq\nupnp\nlinné\nplt\nnemtsov\ntransocean\nzulueta\ndoxorubicin\nmetonic\nfleuve\nsyrups\nrevises\nyayasan\nterrazas\ncarlinhos\nardor\nparral\neffervescent\nentführung\nayah\nharrass\nwoodchuck\ndaqing\nbeaubien\nholies\nbaluchi\nidee\nordeals\nseah\nhavemeyer\nhaast\nmyall\nzarate\nvith\ncoleco\nhhv\neircom\ncenerentola\nslsa\nreflagged\ntarpaulin\nouanna\nvitiligo\nnorthshore\nsnarling\ntrappe\nshelve\nredick\nbracks\nmcmillian\nmarkell\nvuuren\nornl\nmth\ncastiglioni\nsveinn\nbaston\nvillagra\nhcn\ncoolpix\nwinograd\nroofless\nwachtmeister\npreppy\nbasrah\nalar\nmontilla\nlarraín\nkanae\nchapa\njezero\nkahar\ndepositary\nfelsted\nwrenn\nlindsley\nfors\ndjk\noranienburg\nsii\nbequeath\npanin\nsarong\nmichaeli\nranney\nrappelling\ninterject\nparsimonious\ncsonka\nsubtitling\nmager\nflaking\nquestlove\nenon\ndemjanjuk\nlefkada\nshaik\nlightwave\nfawning\nunsavoury\nwalkden\nbulnes\nchieftaincy\nnorby\nbatas\npitjantjatjara\nrepechages\nvillaraigosa\nrouses\ndrabble\nsarazen\ngpb\ntongass\nvoskresensk\nkadish\ntankersley\nwozzeck\nlemans\ntechie\nsipowicz\ncrofters\nmoussaoui\nnondestructive\nloveridge\nfranklinton\nmitigates\nzonda\nhominins\nhallberg\ntölz\nkashin\nfridolin\ntrekked\ndeacetylase\nlicked\nbisphenol\nwebinars\nbraasch\nmuttering\nboag\ngnomish\nsubrahmanyam\nveep\ntypesetter\npainesville\nshandling\neurasians\nengström\ncelal\nkhola\nkieft\nkleinschmidt\nkhaliq\npetrillo\nperkiomen\nmeulen\ntoti\nditzy\ngullickson\nleaderboards\nbayrak\nabductors\nsatcom\nkabale\nrefilling\nmanns\nsaeko\nbahl\nroundhay\nturbot\ndemilitarization\nazeglio\newbank\nlittles\nralli\nmeirelles\nactivations\nsanches\nllanes\ncotopaxi\nsekiguchi\ncaddies\nohn\npoignancy\nboruch\noddest\nconveyancing\nkasahara\nrapido\nfomenko\nlacordaire\ntrai\npernell\ntrolle\nnorell\nvavilov\nswaine\nscalping\nhaun\nbookable\nnungesser\nstrongpoint\nkempsey\ndyan\nerewash\nfriso\nnadeshiko\nskee\nsterns\nluckey\nnack\nmixteca\ngrega\nsarl\ngisbergen\ndelvecchio\nsonnenfeld\nfranti\nrinconada\nbalart\nmishna\nrubescens\nraphe\nsakti\nzonder\nalltel\nmassi\nbreaded\nconshohocken\nunreality\nblytheville\nestrin\nrhuddlan\nlepsius\ncarbs\nmarples\niud\nlautenberg\nkundu\npirès\nrege\nsmilin\nnitya\nbeaman\nprimorac\nnobuko\nfenny\nbutterley\nbanquo\nrosine\nanfa\ntimmer\nroter\nsmallbones\nhoti\narmijo\ndelerue\ndelphinium\nbassetlaw\nharmison\nbonfim\noudtshoorn\ncollocation\nfretting\naase\ncalipari\nsakakibara\naerolíneas\nabsolutes\ncatz\nessenes\nyerushalmi\nwinick\nphilharmoniker\nplouffe\nhumani\nasphodel\ntoiling\ngaribay\nduplicitous\npravin\nconcessionary\nsennar\nmolony\nharmar\nrax\nnaira\npanam\ncpac\ngratified\nvillepin\njugendstil\naffray\nblasko\nahp\nencomium\ncharades\nwalch\nshampoos\ncoachbuilder\njospin\nfabrique\nfests\nphiladelphians\nseguridad\ncolyer\ndrancy\nosco\ncuadrado\noculomotor\natenas\nrespawn\nspacesuits\nzampieri\nboras\nhoude\npenso\ncepheus\nneoconservatism\ntafari\ncinematograph\nmwm\nidomeneo\naddled\ncruachan\npetruchio\ncognizance\nkriti\nunheated\nsuccesful\neyeing\navca\narnos\nharting\nionize\nsomdev\nputrid\nrummenigge\nrearwards\nfeild\nzazen\nhorley\nlaminates\nneutropenia\nstriding\nsalzburger\nfoord\nolazábal\npiscator\nyodeling\nglendinning\nmachon\nmiletić\ngojira\nceva\nbushmeat\nppo\nvillaret\nbreakpoint\nkalk\nifaf\nvasudev\nsodus\nbothy\nferryboat\npatrolmen\nromanovich\nintroversion\nfelts\nduvernay\nredeems\nobaid\nmollo\nborrowdale\npoch\noskarshamn\nserville\nexocytosis\nsanatan\nbewegung\nfujikawa\ninstantiation\nlounging\nberryhill\nfenty\nhuisman\nduggar\ncopter\nbayles\nreidy\nthrowbacks\nhairstylist\nhurrell\nislamofascism\n�\nioannes\nhumoral\nlifesaver\nsungkyunkwan\naxtell\nmadhyamaka\nariete\nmosport\nmikhalkov\nmoyse\nmpu\ngromyko\nottmar\nuag\nvasilievich\nhettie\npufferfish\nsmallman\nheanor\ncung\nozan\nfridman\npatry\nostapenko\nvirginica\nrumer\ncrowborough\nwoodworkers\nstaatskapelle\nharet\nprecariously\nfoxhounds\nkizuna\nvaseline\ngodfried\nents\ncentum\nshigatse\nlahontan\nrexall\nhickling\nbillund\nnibble\nerkin\nphalaenopsis\nblam\nhaumea\nunmixed\nmyatt\ndesmarais\nbowdler\nusra\nsukh\ndisfranchised\nmanipulators\novenden\nmulcair\nretails\norry\nundercurrents\nburbidge\nschneiderman\naroha\nsurvivorship\nstatment\nsterilisation\ndisarms\nafrc\nokolie\nkitkat\nprefs\ncrufts\nlathyrus\nwarmup\nmégane\nreconditioned\ntapscott\nhpp\ndearie\nwilliamston\narnulfo\nmyoglobin\nnorwest\nlonge\nsquib\nolap\nderwin\njarret\nbulut\nnspcc\nhatchling\nsnagged\nwristbands\njanka\npropagator\ngualberto\nexpeditious\nverdigris\ntrapattoni\nenvies\naskmen\nfestiva\nbhattarai\nguerreiro\nrelaunching\nnesn\nljung\nbukharan\nretentive\njemmy\nrecirculation\ngavril\nmcvay\nobservateur\nsectionals\naldington\nirelands\nfilmic\nsouda\nvandiver\ndubuc\nschwager\npoisoner\nhmb\nelaborations\nrusconi\nbeatie\nknotty\ntdd\nedifying\nmadelyn\nhix\nguar\nhossa\npayee\nnanobots\nchorizo\nsurcharges\npanteleimon\npouliot\nsuds\nolympe\ncahuenga\narchies\nmclintock\nnasha\ngiordani\nleinart\nlote\nmatanza\nkalash\ninformality\niiit\newert\nnonfunctional\nverulam\nbowmanville\nbillon\npimpin\nphiloctetes\nberdan\ndonte\nwashburne\nbiba\nsnowshoes\nhandelsblad\ndillman\nbanki\ncreusot\nfatalism\ndermatological\nglatch\ncantinflas\ntibesti\ncalzaghe\nmultimillionaire\naauw\narneson\nmeinrad\nmujahedin\npoodles\nkima\nhamo\nsacrilegious\niem\nsalvadorans\ngoldring\nknbc\nschillaci\nsarna\nsitamarhi\nunspoilt\nfalken\nfauvism\ncelestina\niannucci\nmikis\nshivam\nproffer\nlinhas\nrifai\nsyphon\nists\nivr\nscions\nkadhim\nfundacion\nsaccharine\nmistranslated\nboudica\ncampina\ngnarls\ntrimotor\nrinsed\nkef\nleotard\nhaudenosaunee\ncharlies\nmarrone\ncanneries\ngallstones\nmagherafelt\nvln\nholmwood\ncabane\nbelievin\nburka\ngibraltarians\ngallet\nsinc\nmicki\nlefort\nastrometry\npresuppose\nsleat\nsimples\nparla\nwildes\nnonexistence\nvouched\nnecdet\ndenker\noosterbaan\naristo\nluzia\nalamodome\ntransbaikal\nzulkifli\nchausson\nwsr\nyelawolf\ntransbay\ntoliver\ntnr\nsaen\nshepherdstown\nhassani\nglaciations\nbrained\nfrescobaldi\nunipolar\nuel\ngamecock\nrothermere\npithoragarh\ndelcourt\nmohler\ndysregulation\ncaza\nswinnerton\nbevy\ncolorados\nbekir\nsiris\nmits\nscrofa\nduceppe\ntdf\ngruel\ndeloach\nvardi\nkear\nledley\ngoldcrest\nmeireles\nraso\nbarebones\nsart\nangharad\nveerappan\nreti\nfrazetta\nbitterfeld\nkokoschka\nrobbo\nholdouts\nconciliar\naronoff\nvandegrift\ndilutes\nagno\nvariorum\nsangar\npalaeontologists\noverabundance\nsmeltz\nbundang\ncorre\nolpe\nbigsby\nkelkar\ndogo\nsonatine\nmacalpine\nmaintenon\nzep\ncampfires\nlactamase\nwinnipesaukee\nexelon\nbryde\nbajrang\nracal\nwinsome\nmnemosyne\nquerido\nsamedi\nbogalusa\nbucknor\nroadsters\nfireboats\nsidoarjo\ntrod\narbitrated\nfost\nnummer\ntapani\ndws\noutselling\nloots\nmelanocytes\nsocials\nmodica\nshalott\nsteinhaus\ntubize\ncairngorms\nradionuclide\nelbit\nstrawbridge\npaediatrician\nblanketed\nmyat\nlamacq\nlitani\ndaintree\nkampi\nadminstrators\nclayey\nsimonetta\nfriberg\nsupermax\nxna\njmu\ndohuk\nkolozsvár\nkiener\njasmina\nfct\nlibo\njennison\nuptime\nsnd\ntirah\nvolte\nelectronegative\nvaldis\nbowley\nconscientiously\noberto\ndruck\nranaldo\nstoute\ntelia\nfinmeccanica\nmckell\nshively\ndeliverables\nwapa\nmicheletti\nshinedown\nmomen\nlaufen\nbibiana\nattestations\ntegra\nnber\nmaurin\nbruyne\nipe\nventoux\ngorgas\nayyad\nrayalaseema\nglutton\nchickasha\ninria\nluciani\nrollason\ntrie\napure\nofa\nyoshinaga\nbedelia\nelodie\ngorsky\nmcparland\nluqa\nhenrici\nwoodmen\ngarryowen\nsukhdev\ngreystones\nsupermini\nlsat\nnumata\nmoralizing\napostoli\nlatécoère\nannelid\nventriloquism\nduffey\nsuperstring\ncrace\nmentha\ntht\nshaina\nbriley\nhealdsburg\nidylls\ndrudgery\nbogner\nnoro\nriss\ntamika\ngalassi\ntischendorf\nzuiderzee\nbacteriological\ndropsy\nlyngstad\nebrd\nhomebase\nchevallier\ndhyan\nahe\nhackneyed\nbommel\nbarkin\nslumberland\ntransferrin\nkarami\nbory\nconewago\nkubler\nbandied\nlauriston\nflq\nvts\ndoroshenko\ntobit\nchato\nhumanae\nbreitenbach\nalmeda\nsnowe\njerrod\nmendrisio\npokorny\nclaros\nmosso\nyauco\nhonore\nbeady\nschoch\ncdd\nbilgi\nabravanel\nprotestor\nautomatism\nwexner\nlafon\nbyker\ngiugiaro\ncortot\nsignoria\ndhul\nhuberman\nthyroiditis\natd\nbeeing\nabelia\nkautz\npbi\npuedo\nulli\nramrod\nmorphogenetic\norito\ncrawfords\nschaap\niorwerth\ntaheri\nmalaka\nherniated\nkabak\nichkeria\ngualtieri\nbeatboxing\npalaeozoic\ndunraven\njungen\neastview\nmerrow\nironmonger\ncruelties\nmassaging\ntabard\nmonocular\nlabore\nellenborough\nlimnology\nhassi\nsubdeacon\ntheismann\ngernot\nlittman\nrowse\nincongruity\nhaarmann\nlytic\ndrachma\nelkie\nholbeach\null\nnewsmakers\nroncesvalles\nunpersuasive\nsamora\nczarny\nizawa\nbhattacharyya\nangelopoulos\nsavill\nhinshaw\nmagnifico\ncontextualize\nslimmed\ntandja\nluzzatto\nsneyd\nwagr\nmetamorphose\nbmr\nchioggia\nliedtke\nshean\nteichert\nmanhattanville\nretardants\nwjb\northostatic\ninconstant\nrushall\ncorollas\nkomet\nmard\nbackboard\nmedibank\niacocca\ndepalma\nskalica\nzyklon\nbraunau\nsancto\noceangoing\noana\npunctual\npandemics\ncolom\nsignori\nmrn\nbelsky\nsweatshops\nmisprint\ncator\nmisbehavin\ncaffery\nlaporta\nadders\nmeyerhold\nmclarty\nengberg\nzhuzhou\nhydroponics\nroelof\nsilkscreen\nrabo\nmayorship\nmnlf\nbludgeoning\nshriram\ndomachowska\nevidencing\netcs\nsalwa\npela\ntrifluoride\nhoochie\ndigha\npipette\nobjet\nquaresma\nkeerthi\nsédar\nsadegh\nhorikawa\ndefalco\nlluvia\nkonzerthaus\nreexamine\nbrimley\ncampen\njeffords\npreteen\nawwal\nberkoff\nwtmj\nlovisa\nflatworms\nduren\nchian\nfashionistas\nhilty\naphra\ntouchline\nbawn\ndecanter\nasaro\nkenning\nwynberg\ntorrio\nkentigern\ncels\nlapine\nfortuño\nkatonah\nhunsdon\nreaffirmation\nkansa\nelmham\nnpn\ncélestin\ngandolfo\nsalop\nfloorplan\nsöhne\nmarana\nsócrates\nahhh\nstossel\nrégine\nlucho\nvolkova\nnedlands\nebadi\nthermophilic\ndeneb\nplayout\ntashan\nirondequoit\nataturk\neuropium\nlazcano\nharpoons\ntirico\ndirtiest\ntroopships\npolgar\nmagnin\ngalpin\naulis\nmeleagris\nvibeke\ntipler\ncomptrollers\ncoalescing\ndependability\nsantha\nnagumo\ngreenacre\nhessler\nblad\nclaussen\nfragaria\nballycastle\ncrosier\nartilce\njanco\ntrigo\ngoldenthal\nlaurinburg\nnmt\njarryd\ndunshaughlin\ntdma\nkittyhawk\nsaltoun\ngirlhood\noverdosed\nincidently\nrhinestone\nleonida\nsinsheim\nlaypersons\nclimategate\nmcwhirter\nnerses\nwhitelock\nuckermark\nterman\nglaxo\ndrl\ntebbit\naure\nshihan\nchobham\ntines\ndisavow\nadvertizing\nerba\nmorneau\nstroller\nrefn\nbeetroot\nazor\nbelsize\nthreadgill\nmilia\nhongzhi\nhegemon\nnappa\nfaery\ndimitrova\nsiodmak\ndoogie\nkeychain\nsanader\nmonicelli\nauks\nsasse\nmundt\nxpt\nodilon\nearless\nemley\nperrins\ncrossbred\ngpx\nwaifs\nterzo\nunexceptional\nmdna\nlondinium\nprydz\nrepackaging\nsanne\nparys\ngrünewald\ngitmo\nlaraine\nkeni\nlatinas\nrett\ngoldsmid\nshoeless\nkubitschek\nflavonoid\ngrievously\nfasciitis\nbopper\nunpack\nimpregnate\nbasri\narborescens\nfassa\nmlrs\nblinder\nlundström\nwestmead\nretracing\njfs\noozing\nedb\nbébé\nsilage\nlukacs\nkazu\ntournoi\ngyulai\nsence\nclinique\npulman\nstagecoaches\nhunziker\nsipping\nscammer\ncashiers\ndonk\nferne\nbarnstone\nconsciences\ngry\ncilantro\nparkhouse\nerden\nmede\nnili\nalannah\nzetterberg\npartai\nnightcap\nunviable\nscallions\nthornaby\nproselytize\ncurds\ngreyscale\njoëlle\ndowngrades\nstomatitis\nnextstep\ngelling\nheemskerk\ncanid\nantoon\njda\nmegaptera\nshoring\nnarai\nmorceaux\npoliticking\nbettman\nentree\nintérieur\npinkston\ncroak\nrationalists\nrajoelina\ncatto\nguster\nhostesses\nkitesurfing\nreconnection\nbitchy\nviotti\newha\nthum\ngei\nhollar\ntorta\nubiquitination\nkafi\nniarchos\ndeferential\nmortara\ncernan\nscargill\nbolting\nlongyearbyen\nflaxman\nguttmann\nucm\nemanuela\nradia\ntorlonia\nfrankenthal\ndsn\nriu\nsibanda\nchondrite\nwiwa\nmorano\ngorée\nporthos\nozeki\npalatino\nuppal\nrjukan\ndomon\nlewontin\nmirvish\nbassman\nmandragora\nthich\nizabela\nfabricator\ndressel\nmutlaq\nbiletnikoff\nrigas\ngdb\nronettes\nchilwell\nprintout\nlydekker\nlgd\ndeformable\nperonism\nilka\npewsey\nifi\nlovegrove\nlingga\ninquisitorial\nilluminatus\nschwetzingen\nfotis\nevapotranspiration\nkejriwal\nferhat\nrijkaard\npgd\nbuehrle\nmehrabad\nkili\nbookmarked\nwarspite\nvaticano\nsundquist\nkgf\noffishall\nsiphoning\nchristianshavn\ncalisto\nhitched\nwata\nlaceration\nbookselling\nlomakin\nchoudhry\nolan\nchilkoot\nramiz\ntujunga\ncaymmi\ntethering\nchilies\nfreethinker\ndreamz\ngroombridge\nsaxatilis\nstreetlight\ngarcetti\nfancher\nmadrazo\npettitt\nannalisa\nspringbrook\nelrington\nkingsmead\nkoura\ntoenails\nkoffler\ntavira\nconjoint\nchiro\nsturdier\nschwinger\nlebreton\nmatignon\nquickstep\nlpl\nafflicting\nkisii\nmieux\ncnmi\nyoakum\nnti\nopenshaw\nredi\ncastilleja\nrussi\nphifer\nromper\nragdoll\nbraked\nillumina\namas\nberenices\nburchfield\nbackbenchers\npaku\nchriste\nosd\nhosein\nfougères\nmandu\nggg\nkeown\nmulde\nfindon\nbhel\nlionesses\ntardif\nvivica\nseptiembre\ndukas\nmcclusky\nbelligerence\npythia\npaye\nunfailing\nwindsurfer\naeons\nbrocard\nlushington\nunobtainable\npatens\ntbk\nalekseyev\npeleus\nphelim\ngojko\npsas\nnutini\numami\nnics\ncraved\nundersigned\nmayhill\nttr\nfoon\nsonik\nepidemiologic\nfilippov\ngershman\nmystikal\nshorne\ncinémathèque\nchorea\nkazushi\nloughnane\ncranking\nteltscher\nevgeniya\nhonking\naspic\ngillam\nplataea\nmidstream\nfiley\nchg\nbarked\nmaass\nparticipative\nhendler\nkleberg\nbegbie\nbechara\nceux\nqingyuan\nunbelief\ndalip\ndogmatism\nleaver\ngameloft\npanesar\nlycee\nhindlimbs\nameliorated\naltmark\ncumulonimbus\nroyds\nsabz\nmodis\ngimpo\nlajpat\ngoe\naggrandizing\nyokoi\nkilborn\ncoops\nsubsisting\nginter\ntwat\nroget\noaklawn\nsocked\nclayoquot\ncevdet\npequeno\nelapse\npauri\nvocalized\nnambla\nbörse\nvif\noppressing\nrefile\nzapper\nfilaret\nmortaza\ncoralie\nkua\nsdh\nwhibley\nncw\njacmel\nkellys\ndunnigan\ndemographer\nfussell\ntwinkling\nforgers\ngrigol\noxted\ncalamari\ngaspari\ntamera\nfonthill\nreexamined\neius\nhoffmeister\nrangarajan\ncobblestones\nvanja\npanji\nweinrich\nwestcliff\ndraperies\nbarackobama\ntaglioni\nsandcastle\nebeling\ndexys\nyilmaz\naplomb\nchatted\nberghof\nensconced\nelectromagnetics\natavistic\nhauler\nfatales\nhalkett\nbombe\nbottomland\nmodifiable\nsugata\nellerman\nwarbird\nalvey\nlysosomes\nmortise\nkreme\nior\nriffles\nlabelmates\npanaji\ntatsuma\nophiuchi\ncaernarfonshire\njulliard\ntournon\nbourguignon\nhalvor\ndroning\ncordage\nhazem\nclimatologists\nyawkey\nforbin\nfrannie\nróisín\neeyore\nmirny\nvaras\neffluents\noccluded\nchamomile\nbiter\nboccherini\nawed\ncaiaphas\nsummited\nronit\nbajan\nmearsheimer\nhaldia\nfée\nbaiyun\nspangenberg\nexf\ncaden\nvhp\nflagstone\ncustomise\ngirlguiding\nlamination\nanatolyevich\nbati\nwgi\nngam\ncisl\njinshi\nsinar\nmatzo\ntransmilenio\nretrained\ngiganteum\nekiden\nnotarial\nmacmaster\ndihydrate\nmajo\ninterning\ndisfavored\nyixing\ncayes\nbispham\nbroonzy\nfass\nblastocyst\nmirella\nbalbir\nsiirt\nkeshavarz\nasplund\nverisimilitude\npfeil\nmitta\nwombles\ncirc\ndürrenmatt\ntradable\ntelevising\nbouffe\nfiel\ncontini\ngraciosa\nburchett\nspiridon\nmanicaland\nriveter\ndebarked\nhypochondriac\nwingert\nmajidi\nglorifies\nmajorcan\npasquier\nzschech\nsurreptitious\nyixian\nblogpost\ninnards\nrosson\ntransdisciplinary\nfécamp\nbrahmans\nsteenkamp\npavlik\nslocombe\nlaffey\npernilla\nibbetson\ndoli\nbarby\ntalcahuano\nusj\nganguli\ncoase\nkantele\nfonovisa\nlymphedema\nskewered\nmiyabi\ndespots\ngiora\nkunigunde\ncockade\nsedgewick\nyuzu\npolkas\npinel\nueki\ncooperativa\nbecke\nreedited\noink\npeligro\nlavenham\nmé\nnørrebro\ncastellane\nvestas\nostracised\ndeputized\ngodric\nhej\nproximus\ncnts\nmisrule\nlafollette\nreicher\ngolaghat\namazônia\nwaterworld\njbs\nhurn\nholmfirth\ndevolves\nequitably\ndriftless\nrbm\nislah\npalps\nveruca\nborrelia\nspecialisms\nmorriston\npereda\noben\nprokhorov\nwalpurgis\nrestive\ncadwaladr\newi\nenke\ngraney\nslipcase\nlothians\njaja\nfulkerson\natas\nlyttleton\ncinchona\noystercatchers\ncubists\nsbf\nandantino\nmenara\nserafina\nfahmy\nsahelian\nahamed\nsuhl\nkilauea\nadamu\ndeuel\nsiberry\ncreswick\nnall\nflattop\nreconfiguring\nblackledge\naplin\nintertropical\nglobalstar\nmangling\naste\nnarconon\ntonner\nguidestar\ngramps\njamba\nclathrate\nunwashed\nsanusi\nmarías\ndisproves\nresidentiary\nsisk\nkefalonia\nkmp\ncurti\nangustifolium\njaneane\nhomeschool\nillusive\ncadastre\ngrosseteste\nouterwear\nextravagantly\nletdown\nsailcloth\nrosenfield\ncapitalizes\nsampa\nbolsters\nskm\nlensman\nmanya\nbartercard\ncoulon\nsmoove\nopatija\narmillaria\nsushmita\ninhofe\nconagra\nbearsden\nstationer\nbraes\nmunford\nljudski\ncata\nauthenticating\naircrash\newca\nwwdc\nqrs\nbritomart\nsnorting\nscheff\narête\ninvigorated\nyemi\nrevis\nethnographical\nindependente\nshallots\njawless\nhenstridge\nmilngavie\nmukherji\nkupfer\ndexia\nvien\nquartermain\nyasuhiko\nnewswatch\nmurari\nnoblesville\nfarran\nwaterskiing\nslicker\nnmu\ncurson\nusia\nsizzling\nnoncompliant\nlifshitz\ndepressants\nimprovisers\noller\nclinches\norphic\nmacculloch\ninnuendos\nzoomorphic\nfanling\negill\nemme\nkiner\njerash\nrededication\ndepopulate\nbalbi\nequiano\nmuramatsu\ndanka\ntaca\nbobblehead\ncheveley\nterritorially\nlygon\ngaran\nmayi\nalfonse\nindustriales\nlusts\nptcl\nhearer\nijtihad\nmashantucket\nimpoverishment\ncharo\nledford\nteigen\nauricular\nvampirella\nbebb\nhadamar\nrostelecom\nhonora\nrosendahl\npolona\njeffro\ngrez\nhayim\narmpits\nkorff\nkoshien\nseato\nfcd\nrejewski\nconstantinos\nkollwitz\nbisi\nbiodegradation\nketu\nanaesthetist\ndacosta\npruritus\ntypewritten\ncaniff\nmeritless\nwillstrop\nrandleman\ncaledonians\nelmont\nballmer\nposhteh\nfriedrichshain\nodai\nmegabus\nidwal\nmajella\ntasers\nfolkwang\ngreil\nbordon\ntrippe\nstarlets\nrecoletos\nwomenswear\nwickford\nhispana\nodiham\nmichoacan\nmacaskill\noracular\ncarlitos\nvincents\nsoukous\nunlit\ncazares\nschnitzel\nbezos\nguillemots\nriposte\nsinghbhum\nkincheloe\nveltman\nphotochemistry\nsaez\nberiev\ntorinese\nhoarded\nguarulhos\nvilhena\nsharpay\nboley\nmoston\ncustodes\nwinnington\ncoupés\ncrowhurst\npsychotherapeutic\nlof\nkurian\ntalha\nkubera\njabot\nmgl\nloups\ntaverna\ngentilly\nsledging\nheadfirst\njordanhill\npadraic\nfashionably\nrinsing\ncortege\nfav\nfakhruddin\nblinders\ntheiss\nfontanne\nsevnica\nvermouth\ndhubri\ncoola\nruffles\nzichy\ngrappled\nshenstone\ndirhams\nuwm\nsomeway\nkyles\nvgs\nerf\nrebuff\ndefiniteness\nprimm\nnullius\nkadar\nenjoin\ncascavel\nutan\neleanora\nprincipi\nsneek\ntracings\nporos\nspermatogenesis\nnaze\nrecrystallization\nfoxworth\nnazarenes\nalmir\nfiamme\nwraxall\nammunitions\ncistus\ndissociates\nsorters\nkesavan\ncheca\nadelle\nliebert\nneuroprotective\npietersburg\nbampfylde\nguse\nwhiny\nwapo\nminimi\nbimodal\nhobe\nntm\nvpt\nteleprompter\nnakamoto\ncoello\nmorbidly\ncashin\nobrera\nvinje\npujas\nchanler\nréserve\nleeper\nsilves\npardew\nsteedman\nextemporaneous\nkrylya\nfna\nhsus\nfreewheeling\ngrigson\nzaytsev\nreactionaries\nstimulator\nattractively\nhalfa\nkoc\nexiling\npostmenopausal\nflopping\nsmoothies\nhedged\nocchi\nhartfield\nbeighton\nasya\nweatherall\ndaltons\neuroseries\nalmirola\nzoi\ncaballé\nempiricist\nautonoma\nnewydd\nsequent\nrashtrapati\nbaucus\nkaldor\nhult\nmenaces\nlibbey\ndelicata\nmif\nlevada\ncopleston\nswitchers\nltm\nfazlul\northodontist\njaleel\nlertcheewakarn\nhumphrys\nfantom\nmallenco\nmerlini\nkapo\nkomotini\ncapodimonte\nslicer\nkouyaté\nsubtribes\nphilae\ntanfield\ndotting\nderham\ncornhusker\nvrain\nesparta\nflohr\nsnh\nusba\nchacabuco\nmcnichols\ngràcia\nwasit\nmladenov\nflabbergasted\noncle\nkajaani\nebrahimi\nfsp\nhomologated\nqinetiq\njabar\nenshrines\nihi\nperuzzi\naneta\nunfailingly\nsprenger\nphx\nbulkier\nfaizal\ndetracted\ntomasson\nprinze\nclefts\nfrontwoman\ngraziella\ncreat\nenrages\nliebenberg\nbenveniste\nrubalcaba\ncatalinas\naurélie\nctvglobemedia\naltra\nanstalt\nmeyerson\nlagash\nwoolrich\njosy\ntle\nblitzstein\nfpi\nbraunton\ntainter\npergamum\nstac\nwallenda\nhämäläinen\nhidatsa\nkonar\njdk\ntimba\njamaluddin\nkopi\npense\nliesl\npipework\ngopichand\ntakeout\nnaoyuki\ncrybaby\nkarunaratne\ngde\ngentlewoman\ngrapeshot\nflirty\nlongmire\nheaving\nfuge\nkoppelman\nihara\ndisfiguring\nattleborough\nepitomised\navma\nzaka\nhfcs\nllantwit\ndwelled\nnamier\nmajerus\noverextended\nnôtre\npagnol\npurnia\nunbundled\nvoit\nplaxico\nserrate\nsaltbox\npedroso\nmcfee\nfoxhole\nguppies\naltyn\nspeiser\nshchedrin\nseawolf\nslovenians\nmaund\nphua\nijaw\npasturage\nhib\ntamarine\nrunny\nkasimir\neyles\nharnden\npenniman\ntorbert\nfranchetti\nchondroitin\nagard\ndenotation\nrezoned\nfaslane\ngarrod\nneuropeptide\nboke\nsledges\nspeier\nyurchenko\nolc\nmayas\nyonah\nigen\njta\nlrs\nrekindling\nelefante\nvmm\ndul\nwco\nmrl\nbusily\nappallingly\nloes\ncrome\nchainsaws\nbullinger\nbraunstein\ncallington\nswindled\ntubercular\nwanger\nilario\ndayz\ngerona\ngormanston\nenfilade\nbour\nendowing\nkawamoto\nacheulean\ncarrero\nnangang\njfl\nkatsuhiro\nsarangani\nyik\nmeckel\nlevering\nhestia\nwidianto\nnuisances\nfinbar\nfalmer\ncollinwood\ngni\ncrossbreed\nkillington\nxizang\nroundish\nmeucci\nglascock\nvalidator\numb\nkoevermans\nceramicist\nsericulture\nunexplainable\nshearsmith\nacappella\nghadir\npatmore\nspiner\ndalmas\ncosmologists\narreola\nbajau\ngolfe\nxiaowen\ntokunaga\ncootie\ntreanor\nwitkin\nparcs\ndesorption\ngodse\nhincks\nbergner\ndillen\nwarfighter\nserginho\nmineralogists\ncrazies\nduplin\npesca\nwhitetail\ndayang\nskein\nkelo\nkunihiko\ngalusha\ndiam\nsawgrass\ncarnations\nbasileus\nscooping\nrafer\nnuchal\nviereck\nirun\nbinger\nboinc\nbosa\ndoubters\nstokers\ntastings\ngeographia\nhmh\narikara\nassai\nguilhem\nmycobacteria\nbaidoa\ntacuba\nrayen\naldersgate\njayanta\nwatermarking\nlittlemore\ngastritis\nshigella\nwaske\nblowhole\nsonal\nprovincias\nnarcosis\nceda\nliebmann\ncifuentes\nrampton\nloudonville\nfadel\nmesmerism\naisi\nrentschler\nemeterio\nmarkova\nkardashians\nmerville\nsamina\nsherburn\nsolipsism\ncavill\ndramatizes\ngroans\nnapping\ncisterna\nhumors\nlomborg\npolkinghorne\ndeejays\nmacbrayne\ntechnopark\necsc\nunheralded\ninflates\nselin\nlonelygirl\nhonorarium\novertakes\nfuku\nrenfroe\nhoullier\nbattin\nseconding\npatrouille\ntenax\ntashiro\ngillmore\nlingard\nverismo\nsusann\nmedicis\nsbr\nfoxglove\nprewitt\nlavatories\nekrem\nsoupy\nburmah\nmusca\nkivi\npathi\nvolcan\nkocharyan\norava\npeterloo\nepb\nkittson\nkultury\nfloriano\njareth\nantiparticle\nkep\nmaag\nchinooks\nquinze\ngom\ngharial\nois\nrosebank\nmorarji\nbeeb\nchiptune\nwaxwing\nflim\nlenglen\nlecithin\nuntruthful\nanneli\nbrockwell\ntiltrotor\ntwentynine\nfrictions\nvallely\namarte\nmatosinhos\npanyu\ndfds\nmcnicoll\nvij\nshahabuddin\nbesi\nbaykal\nbilinguals\nprothonotary\nmodine\nrevelator\nsudeley\ncronje\nvivax\nblumer\nequivalences\nperfumer\nalster\nsyngas\nbuswell\nmotorcar\nkatsuura\nwestfalia\nwhiteboards\ncaeruleus\nbranly\ntandberg\npils\nalcyone\nmargalit\nrationalizing\nnormann\nhiu\nflytrap\nhavlicek\nargerich\ncimarosa\norcinus\ntungabhadra\nchannelized\nserotypes\nbushveld\nhapag\npringles\nkett\ndisodium\nlorien\nchoreographies\ncalmodulin\nbismillah\nballew\nyasha\ntamarins\neliel\nmodelers\nrylan\nsmokie\nnativism\noggy\nionel\nclipperton\npapaver\njonesborough\nnasher\nregnault\nnorum\nkasdan\nczk\nbellicose\ncuno\nfortas\ncollateralized\ncaufield\nhercegovina\njalapeño\nmessaoud\ncarob\nshakyamuni\nmiralles\nmumble\nshapeshift\nchicagoans\nriskin\nsuresnes\nmoustapha\nculottes\nsansovino\nlibi\npigman\nriverdance\nenga\nfloe\ninauthentic\nmagrath\namey\nsonnenschein\nlacandon\njiuquan\naukerman\nduhalde\ncointelpro\nkimbell\ntoefl\nlandfalls\npouille\nzeist\nteti\nkumagai\nparkgate\nzanni\nspicules\nmágica\ntaron\nadán\nepitomizes\ntisbury\nhelion\ntornadic\nvidkun\nishpeming\nbenét\nrugeley\nalienates\ntalgo\ndubinsky\npoiret\nebersole\ncynwyd\nkogyo\nduende\nzanskar\ncolombiano\nkennecott\nmonforte\noakie\nflayed\ngeis\nboutwell\ndito\nundecorated\ndendrite\noris\nlatoya\nchertoff\nbeveled\npseudonymity\nhenric\nnovoa\ndataflow\nphotorealistic\ndrin\ntransdermal\nsanstha\nnikolic\nfroggatt\nirizarry\nstriven\nronni\nunsporting\nhcs\ntawil\ngrabber\naxelsen\nundercoat\nliverwort\nlincecum\nwentzel\nrepossession\nmarshman\naldborough\ntheming\nmonteleone\nunsorted\nimpalement\nmandingo\nflyin\nrecusant\nupstage\nyashwant\nkuhlmann\ndestabilise\ntruckloads\ngoalkicking\ntwiss\ntridents\nchemokines\nnoradrenaline\nbrw\npresuppositions\npene\nquaye\nkirkstall\nbismark\ndeboer\nplettenberg\ndilmun\nsommerville\nborut\nhalling\ngiancana\ndingell\ntaine\nleavell\nmenaka\ngroupies\ntumbles\nwaseca\nsplendida\nesat\nsaleswoman\nemiko\nbroadsheets\nuis\njey\njolivet\nplainville\nshadowland\naubusson\nleos\nfpa\ncrikey\nflannigan\nrulon\nkroes\nnasirabad\ndeepti\ngryffindor\ncandlewick\nlathan\nbersih\ntaytay\npuis\nsouthfork\nromanelli\nlagomorphs\nobv\nbehnam\nrigsby\nhaidian\ngrimmer\njaruzelski\nhousecat\ngilby\njra\nchives\nlegalities\nallegri\njastrow\nsashimi\nsprains\nregimented\nserialisation\nfeuchtwanger\nrappel\ntideway\nbucknall\nintermarry\nintrovert\nmontand\nlitanies\nmicrobreweries\nboardgame\ncirculator\nmatjaž\ncetra\nlayden\nvreme\nkupka\nballyshannon\napted\nmatawan\nbeatitudes\nansons\nsteelman\ntramroad\nrocketed\njut\npalpitations\ncassirer\nosório\ndemurred\nfethiye\nfiggins\nnmp\nballymun\nkinley\nencasing\nhermoso\ncowpens\nmaceration\nshikha\nbewitching\nercolano\nreggaetón\nvolare\nfibber\nzile\nsigsworth\nsalvesen\ndkp\nharrodsburg\ndenuded\nsesi\nbanos\nmutantes\nlandlines\nperimeters\ncovergirl\nclouzot\nbolinas\nmcnichol\nmirabal\npackhorse\nthorndon\nsturla\nolympio\njoye\nperico\ngolos\ndomina\nriverland\nclutton\nrecuerdo\nmamou\nbehaviorist\nbessy\nsepulchral\nachmed\nserpentis\ncypresses\nhanser\nrosewell\nbulleid\nkurobe\norgone\ngsb\ndisher\nbandidos\nshebib\nsuntrust\ngorn\nshamash\nlibels\ngalarraga\nprien\nseip\nvoe\nunarmored\nblacc\nkhm\nmountie\ndieux\nwrathful\ntomtom\nmotorhome\ncretans\nqaradawi\ngoor\njovem\ncascio\nravishing\nlenzi\nattractors\ndragana\nskydivers\nsinding\nhendren\nmahama\nthyroxine\ndemerits\nhonig\nstudbook\nteetering\ndoxycycline\nwidgeon\ndishonour\ntatters\nhairstyling\ncreekside\npoum\nsammons\nwiss\nratanakiri\nfirdaus\ndeedee\nhita\nkibi\ngic\ntakedowns\nmalmo\nhütte\nlapidary\nmercat\nmajus\ndematteis\ncuronian\nkarmas\nathenæum\nmorar\nshaves\nuec\nboldest\npernille\nendorsers\nchalabi\ngroundlings\nantiphons\naddio\nmannitol\nmistreating\nsilvertone\nyixin\nmalathi\ngreenidge\nbychkova\nsciatic\nbroadford\narlequin\nbiedermann\npahl\npracticum\nsleeman\nstreatfeild\nmyler\nmelson\nyor\nburlesques\nifm\nlevitsky\nweinan\ncontrarily\nstellaris\npasos\ndoctoring\njerwood\nrighetti\nvermelho\njunctional\npiat\npuls\nbogside\nmattawa\nspamalot\nardrey\ndyspnea\narni\nyanomami\nnarang\nhuahine\nibragimov\nrettig\ntrautman\nscholastics\njameer\nraftery\njanek\nfauvel\nsugg\nsharpie\nrockhill\nkwekwe\norganon\nhizbullah\nfokine\ndidion\ninheritable\nwoodmere\noutputting\ntempts\nmizzi\nbonhams\nreste\ncongeners\nallerdale\nzhivkov\nmanagements\nborderless\nlel\nrahbani\ncrocs\ncomiket\nimplode\nptosis\nwristband\nkirch\njaxon\nmartire\nsulmona\ndignitas\nbubi\nnanosecond\nreadying\nprorogued\nalvares\ngrm\nbraved\nconsolidations\ntkachuk\nmeres\ntimoteo\ncaraga\nredder\nsabatino\nroasts\ngastein\nkoffi\naniruddha\nhofmeyr\nknutsson\npseudogenes\nrakoff\nequestrianism\nmagnetometers\nobolensky\naqil\narnica\nangraecum\nrolin\nzacarias\nkttv\ntfp\nfugate\nlawmaking\nbarthold\nnovelas\nvlog\npindi\nspu\nulitsa\nkhanpur\ncouscous\nworkspaces\ntartary\numphrey\nkisser\nkempston\ntanveer\nlipsius\nmbas\nipfw\nmaclehose\nflyleaf\nkampar\ngurewitz\nsifre\nlevitin\nmilby\nservatius\naley\nlikelyhood\njammeh\nstressor\nbagapsh\naudun\nesn\nnort\ncoercing\ngranodiorite\nschwarzenbach\nhonoria\nneurotransmission\nscilla\nmisaligned\ngranderson\ngoree\nspiritualized\nsteves\nkashani\nokan\nrichmondshire\nlohia\ngundlach\nschirra\nmelvoin\nanthocyanins\ncopepod\nmasakatsu\nsaqib\nglycerin\nkoza\nncd\npeled\ntheists\ngilberts\nsusanville\nmosasaurs\ntonton\ntakaaki\nlanford\nwikström\npenitents\ndaran\nsecada\narbeitsgemeinschaft\nchromis\ntihar\ndiatchenko\ncourtaulds\nanssi\ngazillion\nbeales\nunpatrolled\nalarmist\nmoriori\nmilitaris\nkorhonen\ntsiolkovsky\neyespots\nnavami\nberliners\ndehli\ndilatation\ngentis\nmercutio\nmikhaylov\nkingz\nnaral\ninattentive\nhessle\nbenavidez\nalewife\nbagchi\nreloads\nexpeditiously\nseac\npagel\nsimulans\nautomatons\nbloodworth\nmenges\nconseco\nforeshadow\njugnauth\neski\ntreo\ntransat\nsuprema\ngessler\nbejeweled\nintermingling\nkilcullen\ntrawls\nldh\nloins\nbotton\nrightwing\nlilienfeld\nonomatopoeic\noaktree\ndusen\nyadin\ncoutu\nehrmann\nlydiard\nravelo\nveron\nbookies\nchiharu\nrbl\ntororo\ndaikon\njacka\nragwort\ntuamotus\navena\nmacero\nchav\nrecta\nmerrit\nshefa\ndorsolateral\nmascis\nfinglas\nwillowdale\ncrutcher\nlongfin\nzahedi\nmahalla\nloing\nassociés\ngangotri\nscruff\ntenge\ngarrincha\nkaleidoscopic\nsorum\ncurle\nbeartooth\nrosacea\nsphenoid\nstandardising\nryne\naikin\nespina\nroskam\npedicle\nputtalam\nsumathi\nshepherded\nworkweek\nwonga\nchilterns\nleixlip\nmerriment\nabhorred\nparcours\ndosso\nxlp\npreux\ncorrina\ngavriel\ngyles\ntivat\ncosmetically\ncounterstrike\nyeley\nbuzzi\ndumbed\ninterpolate\ngie\nperren\nepiscopi\nconspecifics\neirene\nudd\nsonglines\nbrizzi\nkalwaria\nprepped\nefrat\nleve\nfraktur\ntabo\ngaron\ngrigoriev\narumugam\nquia\nmarcum\nlepper\nfinneran\ntelavi\nbarabbas\nnoonday\nmckibben\nansah\ngrundig\nnigro\nfoxfire\ndbi\ncarterton\nbenigni\ngenbank\nbushwhackers\nashraful\nzonta\nshamsi\nroxx\nsamsonov\nspake\nairbnb\ndreamliner\nglioma\nfestivus\nfloodwater\ngbl\ncecchi\nkornheiser\nsilverdome\nfehmarn\nstepanova\nftth\ntreptow\ndisruptors\nstably\nlalique\nhayama\nfarrugia\nmegs\nkartel\ncramping\ndrac\nskf\ntarragon\nsondhi\nhemorrhoids\nnicos\nkathe\nsncb\nnatori\nmappa\nhermogenes\ncrisfield\nlodestar\nnrdc\ncoleg\nedibility\nasantehene\naqualung\nrra\nsacchetti\nguanche\nblackham\nnovoselic\nhighwood\nsaisons\nspools\ngijon\nwpk\nmeijin\nafrocentrism\ncontentid\naimar\noshii\nnucleated\ncitizenships\ngynaecological\nyonhap\ncanford\njacking\nlge\nreferents\ntilled\nbota\nkunene\nprud\nlocher\nshikibu\nnavarone\nfilmfest\nkapisa\nhawkwood\nstottlemeyer\nglamor\nringlet\ntesti\nkaterini\nkelvingrove\ninterlocutors\nspid\nphosphide\nhandcock\nstammer\nbaringo\nseekonk\nkostov\npallister\nbostonians\ngamescom\ncallinan\nwilkinsburg\nstaré\nbirthdates\nstoopid\nwalwyn\ncubical\ngroundsman\nmaadi\nlevens\nedman\nnanowires\nrustler\nconsequentially\nshb\ngloved\nparsnip\nybarra\nreali\nkuji\nloveable\nbushrod\nachelous\nfone\nemv\ndorinda\ninfidelities\ndenner\nbeyg\nsubstantiates\nbulli\nsalicylate\nambassadorship\nvds\ninarticulate\nkirkdale\nxbla\nbontemps\nmikkola\nmccarey\ndybbuk\ncanonisation\nhogwash\noxidants\nkilobyte\nwettstein\nspeers\nezo\nhillebrand\nballadeer\nhumbling\nbocce\nbeauford\nanagni\nklingenberg\nfoust\nflecha\nkeye\nquebecor\n⅛\npdu\nplaythings\nlpp\ndurward\nhollenbeck\ngossips\ncini\nsiba\nrafflesia\njaimie\nrountree\nbilk\ngasperi\nbhupinder\nharapan\neyez\nbowtie\nstabbings\nfamilias\nsnu\ninco\npotala\nscheidt\nbenge\nfranceschi\nosmund\ntoothbrushes\nlomu\nhimanshu\nmoyano\nnationalgalerie\ngainful\nflorentin\nkhare\nkadu\nsmn\nlenhart\nfahmi\nappeasing\nbrochu\nmaremma\ncoeducation\nbodoni\npib\newers\nbrünnhilde\nsandell\ndnv\ntrivalent\nabbreviating\nshellharbour\nairtrain\npimpri\nblaga\ntrou\nfukunaga\ngittins\ncarmaker\nmcateer\nunlined\njags\npalestina\nmulkey\nomics\nruhl\nplop\npennebaker\nharle\nchapuis\nsalvatierra\nlatitudinal\nbalazs\nrhapsodies\ntuncay\nsoddy\nbankside\narcanum\nmorshead\nscheibe\ngénéraux\nmeninas\nannu\nyamaoka\nbeaucaire\nredbone\ngerminated\ndaher\nexa\nselex\ntiwanaku\nelectroencephalography\ngiustino\nrobbinsville\nridwan\ndimitry\nthilo\nubaid\nureña\ntrin\njavelinas\nantigo\nfarrel\nbollards\ncravath\nvlaardingen\nasala\nvinca\ntadeo\nlutetium\ncici\nsodhi\ngreenstreet\nexoskeletons\nasaba\nnghe\ndehumanizing\nsubdural\nkomeito\nshoehorn\nrevitalisation\ncarcinogenicity\nketa\ntansey\nbrummell\nbrookland\nchaperon\naways\nphotomultiplier\ncharta\nclayson\nequivocation\naureum\ncallin\nannotating\nbaotou\njanki\natanasov\nbartell\npornographer\nsablon\nphebe\nendeavoring\nsignoret\nnightwatch\nusbc\nerh\nbraeden\nneurotoxins\nchippy\nstokke\ningi\nwitwicky\nlettie\nladainian\ncharmian\ntruncating\nwnc\nnambi\nterza\nkhandwa\nefg\nusted\ncommunards\npalmeira\nwso\nconstructional\nrajni\ndooly\nkoca\nashbee\ndiffrent\ncurial\nfamiliarization\njohannesson\nvanilli\npaclitaxel\nboloria\nhexes\nmcvicar\nmnd\nnaren\nedelmann\ndilli\nbickerstaff\nabie\nreentering\nsharer\ndemocrático\nsundazed\nebl\nzucchero\nkiswahili\nduratec\nfoer\ngavi\nnch\nantitoxin\npigskin\nfuray\nklis\ntracheotomy\npinjarra\nkinsler\nolha\nmacoupin\nchicanos\njusten\nmaginnis\nmalevolence\ntorero\nvoynich\nheping\ncloyd\nreig\nlaeken\nbierzo\npegaso\nlelio\ngrr\nandrae\ndubrow\noptimists\nmandriva\nrennet\nnickle\nheggie\nmaca\nghanaians\nlaci\ngrise\nmulticoloured\nreawakening\nwieck\nslouch\ninternationaux\nleathernecks\nmelanoleuca\ndeceives\ntanagers\nhemme\ndittmar\nwanaka\nrns\nciné\nmigros\nburritos\ncrabapple\nmanch\nmarañón\nguv\nabsheron\ndasari\nhild\nmicrofilms\ntinny\nmiddleham\nfretwork\nsubash\nifo\nappraise\nrmf\noao\nerosive\ndevol\nprejean\nmbna\namenia\njiong\nkuribayashi\ntings\nsilvered\ntowcester\nrli\nshukhov\nlaba\nvastu\nmalco\nwaterless\nlocka\nvermaak\natis\nmcbroom\nbaynton\npey\ncarrum\nandreja\naldobrandini\nraus\nlumens\nmendi\ndellums\nhowerd\nbiller\nsevierville\nshorthorn\nturun\njabara\nregnier\namidah\nbunnymen\nsweats\naccidentals\nentreprises\ntomaszewski\ndebentures\nidioma\nalarcon\nsarakhs\nnapanee\nkornbluth\nraveendran\nnannie\nroadless\nsurcouf\nintervertebral\nhunton\nwritable\nsaberhagen\nbuttock\nbjarnason\nskewers\ngalloped\nnordberg\nhinch\nrovi\nwakelin\nhiguera\npll\nstokoe\njimmi\nincalculable\nmenjou\nquartetto\ncasselman\ncommerical\naerofoil\ncelio\nlippard\nmaroochydore\ngymraeg\npleydell\ntamira\nchaudhari\nbalme\nslovenske\npictor\nprogestin\ntats\ndelaroche\ncravat\nlampley\nsoylent\nwinkie\nmullions\nagilent\nserik\nmengelberg\nklebsiella\nverboten\nbeiping\nedad\nriverdogs\nindiscipline\nredeploy\nmaddocks\ninequitable\nfattening\nnago\nmajdal\nlittorio\ncommendatory\nflaco\nhedren\nkamikazes\netim\nsialic\nimmeasurably\nsemiconducting\ngolems\nflix\nraincoats\ngivewell\nstigmatization\netzel\ncentcom\nlobbed\njaintia\ndoti\nimpugned\nbrind\nconstrict\njnu\nmerrin\nmoustaches\ntennille\nhache\naspe\nbioremediation\nliaising\ndecapitates\nrotherfield\nklosterneuburg\nnyima\nsameera\ninfuses\ncorsaire\nlapaglia\nbotting\nvogelsang\nhangouts\npiranesi\ntakakura\narjunan\naritcle\nabounding\ngamesmanship\nwindspeeds\nvavasour\nphilipson\nkeeney\nriazor\nvilvoorde\nfoolhardy\nratites\nwryly\njeered\nmieke\nekin\nsternal\nurbanist\nimprudent\nliteralism\nfeith\nkufic\ngulbuddin\nwaldmann\ngird\nicedogs\ndorothée\nperverting\nsynchronously\ndwain\nchattooga\nmoman\nspektrum\ncavs\nwharncliffe\nmbda\nbergoglio\njedward\nredistributing\npravec\nbolaño\nsupers\nawaz\nsneer\nbarranca\nshifa\ngorrell\nmyp\nlpm\nirix\nschickel\nisostatic\ninterfaced\nyaqoob\nquattrocento\nbaad\nguto\nharian\nmenuet\nsyosset\ncenk\ndanses\npallone\ndunce\nchastising\nreligieuse\ncorbridge\nwilcock\nneng\nsardo\nahriman\nsidek\nhinrichs\ngyros\nlouella\nbashford\nbaretta\nnarrowband\npercheron\nskiles\nramberg\nreinecke\nagrostis\npotsherds\norangerie\nkonopka\nwertz\nwhitethroat\naltamaha\ndco\nqube\nredfish\ndroga\nviterbi\nennui\nafanasyev\nexim\ndeangelis\npleura\nkazunori\nsujit\nimpey\nlaar\nthorbjørn\nbarek\nmuhajirs\ndreamweaver\nbindon\nbatwing\nyearned\nmcglinchey\nmcgivern\nphobic\nalassane\nleota\nermengarde\nroope\nvire\nbosé\nthro\nlinkable\nmszp\nraved\neales\nhardcourt\nflit\nfluor\nefm\nunequally\ncarisbrooke\nroentgen\nmallya\ndestry\nidk\nnortherns\nschwan\nderangement\nsoumitra\nstormer\nvivier\nbosmans\nboubou\nonr\nhitotsubashi\nsadek\nscirocco\nreinado\neib\npatronize\nleavy\nmühldorf\nporticos\ndixson\nagronomic\nchromatographic\nredid\nzurbarán\ngarissa\nniggling\ncontraptions\nkatter\nenuff\nwjw\ncamby\nsensationalistic\nhuffpo\nhaslett\nquantile\nimpounding\nzucco\nportocarrero\nfruticosa\ncalender\namw\nasem\ndamer\npaktika\npish\ncahaba\nthinness\nmashal\njigar\nointments\nwarpaint\nnunthorpe\nmartinis\nkabwe\nkranji\nflos\nparlours\nakhmetov\nunblemished\njonty\ncolledge\nrimless\nbaumbach\nbrocklehurst\ncalma\nvadym\nvoinea\ninfiltrator\ndeana\nziguinchor\nmalim\nweatherwax\npredestined\nunaligned\ntwyla\njetsam\nbourdelle\nseasick\nwillison\nschulberg\nthelwall\nspinosaurus\ncoelacanth\nharbert\nstreetball\nians\nindefinately\nrapturous\nberan\nvab\nwoll\ngreenall\ntaraki\nbrittas\nsuning\nmultistage\nvoortrekker\nianni\neoc\nfitri\nutpal\nhimal\nyassine\ndyersburg\ntélévisions\ngorani\ndnssec\nsubverts\nkimbolton\nbarson\nspillage\nltr\nscandalized\nmondello\nhbcu\nasger\nkrell\npollsters\nallsop\nchoudary\nlaundered\neukaryote\nfucks\ntakeaways\nawka\nvokes\nplacoderms\npolydactyly\nravensworth\nplassey\nflog\nwoofer\nincase\nhabilis\nmatings\nrawtenstall\nbrasco\nrangeley\nbarresi\nfags\nnejmeh\nkarmel\nhurlingham\ngreystoke\nshadbolt\npaese\nalcobendas\nmarone\nltl\nhackettstown\ninstyle\nbayamon\ndvt\nhengist\nmultics\ncroome\ndionysios\ntuppence\nstartin\ntorchy\nsagawa\nexigencies\ngobain\nawt\npadgham\nverandas\nmazara\nozymandias\npme\nschriever\nlollobrigida\nemts\nzuri\ncalamitous\noikos\nsandridge\nstonestreet\nsunapee\nhamels\ncrucify\nsophy\nmarsel\ncollymore\nsaoirse\ntenpin\naughrim\nsubspecialty\nnegroni\nrayfield\ndutilleux\nmccullagh\nnoye\nqadisiyah\ntrutnov\nvelour\nprettyman\nkgs\nvib\nboesch\nguercino\nprestonpans\nbookmobile\npomeranz\nharddrive\nrickety\nrajkumari\npretexts\nimposters\noncogenes\nraph\nedr\npenrod\nmisstep\nbiogeochemical\ndouze\nprofessionnel\nshockey\nwantagh\nwalruses\nsavannakhet\npenndot\nnewsweekly\npraslin\nidu\nlading\nxplosion\nsherbourne\ngriz\nsevastova\nsigüenza\nmazen\nprambanan\ntarmo\nsoldati\nshishkin\nbranes\ndreadfully\nirreverence\nsieves\nsiqueira\naradhana\namell\nhabituation\ngené\nkirribilli\nlinna\nweigand\nmirrorless\nsouled\nfakhri\nmazurkas\nbolduc\nespinal\nteamtennis\nruminant\nnadzab\nfuyu\nswanee\nsavan\nsplattered\nzille\nadrià\nrycroft\nmantas\nsmacking\nbrakhage\nmccants\ntenri\nschilt\nsemipalatinsk\ngirling\nfunctionalized\nsportsline\nnassa\nbelson\nrahner\nbtm\nclercq\nwelborn\nbakehouse\ncowdray\nleçons\nhandelsblatt\nblacktop\nsequins\nvenecia\nsaputra\norignal\nmotorcars\ninscribe\npedaling\ntob\nforsman\nbanhart\ngupte\nendnote\ntunicates\nreversibility\ndahlman\ncentrifugation\nhimara\ncollodion\ncirclet\nheilig\nbeneficence\neclampsia\npipo\njanjua\nfiorella\nframerate\nchatrooms\nrauparaha\ncarine\nbeautician\ntroia\nchaudhury\nmelati\ndickins\ndoddington\nolb\nprca\ndecimate\narcangel\ntreacherously\nnewbern\nkantar\nerichsen\ngramatica\nevora\nsanlúcar\nngee\nmetroliner\nasit\nbergens\nlewton\ntraunstein\nneasden\nsask\nwinkelman\napne\nmady\nelmslie\ngaiters\nhütter\nmccullers\nchihuly\nluong\nheadnote\nmapfre\nschillings\ncaedmon\nenchantments\nflorus\nschuur\npanting\nshomron\nautomagically\nmagners\nalces\npartlow\nnotturno\neclogues\nskoff\nhealesville\nmauvais\nfleisch\nopv\naine\ntwirl\nauberjonois\nhajo\nknowable\nharedim\nbradburn\nkurla\ngoupil\nchengchi\nknapweed\nimminently\n，\nquickened\nhypersensitive\ndelson\nwidowmaker\ngenta\nrieu\nwring\nnisshin\ncyclically\nkurti\ntaiz\nbanqiao\nwonderwall\nwhoo\nklim\nornithischian\nprednisone\ndisobeys\njnf\ntruett\nichthyologists\nsiku\ngingold\nmuggeridge\namphibole\nconcussive\ncatid\ntechnic\nuppers\nspiess\nombres\nactives\nbesnard\nbasketballer\ncalheta\naccessions\nberl\nflypast\nbarkan\nzeidler\nfromme\nbjörklund\nbamiyan\nentrap\naerogel\ncnf\ncasadesus\nsergeev\npcn\ncauvin\nuntried\nsuperego\nparlay\nsavate\nphthalate\ngoodkind\nbenayoun\narsed\ndivisiveness\nmorphou\nnnr\npepperell\nemetic\nkincardineshire\nscissorhands\nsouleymane\nblainey\nwapello\nprete\nthewlis\nbick\nrohm\nmullard\nthoughtfulness\nreintegrate\nsplintering\nadx\nwhitepaper\nkardar\niestyn\nentremont\nbeza\nlaminae\nmilitaria\nsongdo\nmorville\npottawattamie\nlampman\nyaz\nmontgolfier\ncoso\nazeez\nthz\nmandrel\nanarchistic\nperistyle\nawaaz\nazari\nnaif\ncdos\nistiqlal\nddi\nddl\nioannou\ntoshiki\ntaru\nfizzle\ntrb\nturman\nstarlite\ntransposon\narnolfini\nastronomically\nrosalina\nbandh\nbombast\ncassiano\ndialogic\ntyrus\ndemus\nyv\nmonceau\npoète\nosmonds\ndamara\nsedgman\nleino\nboulle\nlatgale\nheizer\ncosmin\nrawkus\nbobbins\nenmore\nmanz\nplatten\ngreenman\nryles\nminutia\nsupervalu\nnurtures\nbernards\nblockages\natalante\nclansman\nshuns\ntamora\nkusama\nfiglio\nhookworm\nexistences\nibaf\nchrystal\ntsf\nundesirables\nsneezes\nminahasa\nkelemen\nmarica\nrotund\nreh\nmwr\nductal\nvalbuena\nconjurer\nbarfly\ngrossmont\nhochuli\nputa\npriyadarshini\njingu\nsidewall\nanimates\nrch\nheatherton\ndunstall\nshillelagh\ndonis\nplayset\ncheez\ntitlist\nsolferino\nsingalong\nluckiest\nkossoff\ncrespin\nwillets\nsotiris\nshames\nakina\nhup\nbluray\nforsyte\nmcmullin\nffv\nparfum\ngothamist\nmallaig\nzapping\nfrauenkirche\nbraswell\ncopan\nstier\ndextrose\nwhiteout\ntroves\ncalatayud\nsakina\ntissier\npertinax\nrosada\nfoxley\nneglectful\npurposive\nrifaat\nlakebed\nbiotechnological\nbroomhill\nfreier\ngleaves\npriester\nchevrier\ntextually\nrambunctious\niccpr\nwalle\nmvm\nbluenose\nsprightly\nreprogram\nstull\nasparuhov\ncwo\nishaan\noverspill\ndiabelli\nesco\ndamselfish\njanae\nissachar\ndrilon\ngiardia\ncentos\nnacac\ncorll\nritsuko\nstucky\nmarivaux\npuncak\ndebakey\nbernama\ndiscotheque\ncaudill\ngpt\nmeshing\nkrishnam\nchucks\nsolex\ndood\nmassari\nhamadi\naricia\nprestbury\nhillview\naudibly\nanthropoid\nftw\nniqab\nwean\nrawling\njbc\nmisspell\nfrancona\njammy\nsubacute\nprotist\ngkn\nsirt\nskoll\nfot\nkellar\nfarinelli\nhumdrum\nnobly\nclockmakers\nbupropion\nalders\ncoproduction\nhazeltine\nlymphoblastic\nboger\nbreslow\nehealth\nitemized\ncappelletti\nolentangy\nguilderland\nkushboo\nagn\nkamu\nmult\ntesh\nzukerman\nfoca\nmethode\nvahdat\nbrewton\nhailstone\nkilly\nmoretz\nburros\npelagia\nlockridge\nllan\nszpilman\nchex\nmackenzies\noth\nxylose\nfarouq\nocracoke\nnutritionally\nizabella\nparisot\ninsan\ntouchwiz\ndonisthorpe\nfreischütz\nreification\npoliakoff\nmilanov\nbth\nrech\nkitahara\nskiba\nchasen\ngranier\nrijal\nstereos\nocmulgee\nrehberg\nfinra\nlimbed\ngrins\ntomos\ntamasha\ngreipel\nunceasing\nrescission\npresentational\nreffed\nbartering\ndowndraft\npetersberg\nghosting\nshatrughan\nweismann\nnightwatchman\ntonia\ndehaven\nirna\nfogs\nqun\nimmunosuppression\nsimkins\nabrasives\npeth\nturnberry\npolarizer\nsiya\nfyne\nzeven\ndesiccated\nlifehacker\nrajini\njuntos\noury\ncytomegalovirus\nabertillery\nhaves\nmockingbirds\nhepatica\nchristiansburg\nbrith\ngrandjean\nflicked\nbettenhausen\nfedak\nconnaughton\nlithographers\nfasci\nprakriti\ndilate\naycliffe\nauda\nragnvald\nshirelles\nnerissa\nisaksson\nredheads\nbrumfield\nseemann\nnantahala\nsustrans\nasko\nleberecht\nccds\njining\nnrcs\nmotwani\nvsm\ntaille\nhorsehair\nseverson\nressler\nmutilating\ncagan\nvandermeer\nbagatelles\nkennebunk\notv\nlangport\nhuffpost\nrabbids\ncaecilians\nakasha\nfolland\noverdosing\nbarclaycard\nhollowell\nyoro\nwouk\njovana\ncassville\nmicrosecond\nmakeovers\nargot\nberita\nhildy\npetronia\nniguel\nluda\nkotecha\nventurers\nkorba\ndecamp\nmerope\nwhigham\nparticulier\nacela\nsudeikis\nburzum\nquietest\nkhakassia\nscorcher\nchomp\nnuvo\nbronislaw\nalessandrini\nhiddleston\nneuendorf\ngenuineness\ngraal\nholabird\nmanalapan\nsoftshell\nculbert\nkayin\nuart\ntriveni\nuzès\nprieur\nbulwell\nralls\nagenesis\nkaden\nclackmannan\nheraldo\nupstarts\nsnedden\nakamatsu\nagreeableness\nbryony\ndepletes\nrealpolitik\nmagnier\ncommandeer\nmalic\noie\nlandcare\ncompensations\nyaniv\nnung\nmarsyas\nmaquoketa\nfruited\nchandana\ngriddle\nbeihai\nhofmannsthal\ngradisca\neyelash\naccardo\nfrayn\nmcalpin\ncolcord\nmisako\ncourvoisier\nseismically\ndehner\nunrealistically\nreignite\ncomely\nomr\nbolam\npracticalities\njamelia\ncrispo\nlitigate\nsauve\nleadon\nkislovodsk\npropoganda\nweatherill\nunbanning\nunionize\nstandley\nchump\njournaling\nmessin\nfailsafe\ndunlin\naluf\ntakapuna\njagran\nhenwood\ndenizen\nreinsertion\njhon\nurbi\ntriangulated\nolov\nexecs\nbroadfoot\narbat\niif\nassiduous\nboulet\ncosh\nhalston\ntailpiece\ngazipur\nxuchang\ntischler\nquy\nharjo\nstoryboarded\namann\nroser\nhairdo\nbaoshan\nschmaltz\nonodera\ningold\ncrooning\nkentuckians\nejections\nrambutan\npembleton\nexemple\nmento\nresh\npressly\nlynbrook\nmuddied\nsaji\nfierstein\nshortridge\nysabel\nplanetoid\nnapoleone\nladislas\nhillenbrand\nmeus\ntamanrasset\ntrialling\ngustin\nwestray\nlienz\nviens\nnft\njouy\nkunitz\nmotivic\nmispronunciation\ngasping\narsi\npicken\ncopiously\najinomoto\nlfg\nserpico\nshediac\npesto\ntaeko\ntaar\nmoff\nncube\npulque\nrestrains\nkittatinny\nkhmelnitsky\ncheekbone\nglobosa\ngéricault\ndbd\nmelodeon\nshankland\nbellemare\nangstrom\ndudbridge\nbrownrigg\ninzamam\npartials\nhoreb\nfigureheads\nnorthlands\ncondes\nkpm\nmusselwhite\ncoalbrookdale\nrutkowski\nveloce\ntallent\narene\nlauretta\nproselytism\nagim\ndavidii\nreincarnate\narcas\ndaal\njuristic\njaana\nsinga\ngoins\nwannabes\nbriskly\nhudd\nfunicello\ndilawar\nexpressible\npoinciana\nyaks\npayet\ntoiled\ncannonade\nseiki\nlevodopa\nplatina\nvhdl\nbeppo\niguchi\nqar\nnewsasia\nhuizenga\nshak\nsvengali\ndepatie\nshurtleff\npumpkinhead\napace\nnatas\nclaremore\ncwi\nsnecma\nwollaton\nteofilo\nmcmenamin\nbusied\nbugey\nsympathised\nyurok\nleboeuf\ncatty\ntambura\nreidsville\nkerrie\nasymmetries\nmcnairy\ngelasius\nwinchcombe\nuccello\ncheckup\nhesperides\nmajordomo\nbaris\narrl\nninette\nrelaid\numer\ntamilnet\nconsumerist\nthekla\narmide\ndaubert\ncongratulation\nbeaucoup\ndhafra\ncorto\ngiantess\navo\ntranscanada\norphaning\nkhanewal\npenetrations\nfuseli\nyearn\nkuczynski\njordanaires\nkehr\ntahara\ngayoom\nnabf\nbynes\ndivulging\nosteology\ngomti\nlofted\njonquière\ntoch\nwoodrat\ntaddei\nmorath\nmunnar\nserzh\ncartouches\nfomented\nmbabane\nsaionji\nureter\nraindrop\ngrodin\nforager\nzio\ntobler\nholier\nexactness\ncontinence\nbjk\nmurph\njohari\nmehsana\ntyreke\ndukinfield\nprobiotic\nmarleau\ntevye\nsealink\nkurowski\nappropriates\njosefine\njitters\neldr\ncornes\nmillbrae\ndesultory\ngimignano\ntripitaka\nzihuatanejo\nbillowing\nberbera\nlyotard\nnewburn\nsharecropper\nsoltani\npuzzlement\njasa\nherminia\ngladness\ncramm\ntroodon\nchayanne\ncalendrical\nmeckiff\nmuerta\nsalamon\nriebeeck\nhunstanton\npoitras\nlodha\nnedney\nshadowlands\ncorvina\njurat\nmaen\nbirbal\nravinder\nbimetallic\nmarney\npeeking\nlovemaking\nciliate\nmonarchic\namick\nmichelotti\ntomek\nshesh\ncowritten\nvaibhav\nberling\ngrabow\nflicka\nberberian\nderozan\npompeian\nricca\nvaso\nmiffed\nleaven\nsangrur\nbtk\nsalchow\naosdána\ntaveras\nlather\nlongstreth\nzogby\ndaiwa\nzafra\nsolum\nmuhly\nhorstmann\nmcmann\nkochanski\nkibera\nakaroa\nkrynica\nbigeye\nradials\niccf\nstargell\ncezary\nbumi\ntomica\nvenancio\nkeadilan\nviscoelastic\nbeanpot\nonet\nthyra\nobregon\nwaxwings\nvava\nabrasions\nfontenot\nmontecatini\nyorkists\nseru\nlearmonth\nszilard\nsalou\ngunslingers\nwhalebone\nronco\ndansville\nblay\nmisheard\ntakayoshi\nsidecars\ngignac\ndeanne\nlessees\nairto\nmassasoit\nwellhead\ncatalyse\nrosemead\ncergy\nxbrl\nvfc\nwasilewski\ncounterarguments\nshowrunners\ncrackpots\nxanthus\nnapper\npendennis\npupillary\nfluvanna\nmadhesi\ntimisoara\nchel\nfrenchwoman\nnewmont\nerratics\nseverini\nmullica\nmacari\nvágner\ncirilo\nneilston\nhollerith\ndualist\ncollegehumor\ntogashi\nravidas\nclelia\nwacko\ngarcinia\nconcordant\nkft\nrgm\nmirwais\npodiatry\nbridgit\nmorishima\nkhujand\nmugham\nlinhares\nequis\nufw\nbarla\nschrute\nhdc\nslights\nbayonetta\nmcilwraith\nnoboa\nweaponized\nyoshii\ndefecating\nurc\ntannic\nlaidley\ntufail\nstettler\nfaneuil\nchetumal\nmarshalled\n∆\nelectrocardiogram\nberchtold\nmammalogy\nkelpie\nmurgatroyd\nspingarn\nzuluaga\nboven\noligosaccharides\nkile\nmoins\ncrv\nfakenham\ntener\nconvertibility\nballas\ncurr\ngulbrandsen\ntoren\nhulett\nchuka\nhrazdan\ndessel\ntamarix\nbudgerigar\ngendai\ntransience\nbrecher\nremigio\nsostenuto\nmccotter\nsemih\ndubreuil\nellul\nclavel\nselinger\npomus\noverconfidence\nweka\nishita\npardoning\npinstripe\nvigneron\ncanetti\noutwash\nflatmate\ngeertz\nfledglings\nstreamlines\nfintech\nslavishly\ndámaso\nturville\nclasico\npeloponnesus\ncentauro\naion\noverexpressed\ngilwell\nkharlamov\nmarstrand\nassemblymember\ndemerged\nsexta\nmccaslin\npestis\nhusted\ncloke\ntrumble\nfeta\nmotorhead\nkrc\namidar\nfendi\nmanlio\nlarbi\ndadt\nbrucellosis\nundergarment\nvalda\nshevchuk\ndexamethasone\nurethane\nuncluttered\nstapled\nnyah\nrpms\nachillea\navaliable\nrinn\nfarting\nreformulate\nnueve\ngiti\nlamda\nbandwidths\ntoughened\nmusiques\ngainsboro\ndiagrammatic\nnosebleed\ndesafio\nexocrine\njoselito\npilsener\nkoivu\nnonaka\nhaseltine\nshigenobu\ndjm\nganas\nresolvable\nimatinib\nchil\nkatelyn\norbicularis\nregla\nchaumière\ngensler\nunscom\nhouellebecq\nbalaj\nfrenzel\nolf\ngurnard\ndandi\nkentville\nbreckland\npiratical\nfreudenstadt\nboehmer\nbellatrix\nlelia\nechevarría\ndefrocked\nbuilth\nbayous\nwallechinsky\nboies\nnoticiero\neleuthera\njingo\ndisclaim\nmaritain\ncolourist\nouessant\nsarn\ngargamel\nsangallo\nwyrd\nfildes\nlamento\nluken\noutstation\nwomad\nwindup\nmaravilla\npollio\nmazzoleni\ndiscoverable\nkitzmiller\nsaintonge\nepitopes\nhomesteader\npanzers\nkrabbe\nuninstalled\nriffle\nsrd\nrff\nbilo\nhernias\nswanley\nflickers\nmotorman\nparaphilias\nchampasak\nanthropos\najdabiya\nproofed\nneurosurgeons\nzamzam\nneurodegeneration\ncrytek\npaddler\nyasu\nwurz\navb\nmcenery\nbrennaman\nalloyed\ncardano\nkilljoy\nshahak\nbookends\ncampagnolo\nminns\nosoyoos\nduino\nlevay\nmeteoroid\nsahin\nkretschmann\nzeena\nscarry\nconvalescing\nbaladi\nbreadcrumbs\ncarnwath\nearthing\ndingdong\ntinsukia\nsessa\ncharro\ncorrin\nheidler\ntoastmaster\nmastro\nokon\ntoreador\ngrackle\nwetsuit\nweirdos\nthala\nisoroku\nbiratnagar\nmunari\npartakes\nitam\ntarlton\nantenne\nneurofibromatosis\nsilenus\nistar\nsoth\ncannibalized\nchalmette\nactuate\nswithin\nmelcombe\ncnp\nrotenberg\nirrigable\nprepubescent\nhurrying\ncathepsin\nhonoka\npellow\nsimonyi\nbehzad\nbloomed\nkolya\nfujimura\nibooks\nglatz\nalber\npuccio\nrelaxant\nboxall\nupgradable\ncharissa\nuele\nintertwine\nstormbringer\ndushku\nshiners\ntudeh\noranjestad\nschonberg\nheadways\ntup\nargyros\noodles\nselzer\nlidl\nshikarpur\nbroadstone\ndaiko\nbehnke\nhorning\ngiampiero\nabductees\ntropicbird\ngosset\ndoggedly\nmmf\nmedaglia\nscouse\nhsg\nyasmina\nwolfhounds\nforklifts\nwilletts\nsrk\nrikers\npiscis\nneurobiological\noverhill\nexotics\njizan\ntalamanca\ndappy\nstardock\nkyser\nalpinist\nundulations\nwacha\nkinkade\nfincantieri\narevalo\nnelsons\noncogenic\njuxtapositions\nkyodo\nmillcreek\nstiffen\nmanohla\nbrehon\ngirt\nransack\nfiorenza\ntapley\nquora\nfalah\nitd\nkomsomolskaya\ndappled\ntsuruta\neuroscepticism\nreciter\npneumococcal\nimpugn\nocn\nblackshear\ncampari\nnovin\nhazlett\nyahtzee\ncreeley\nharbi\nbescot\numpteenth\nktv\nmontazeri\njocelin\nrothbart\ntailwind\nmaros\nnardini\ngalaga\nkelaniya\ndowningtown\narava\ncheatin\nmutters\nculham\nbenzo\nbarpeta\ningall\nkeays\nbroncs\nwalcot\nsundiata\ncattermole\npingat\ncryopreservation\nabboud\nconfiding\nblackdown\nkushi\njobbers\nmehrauli\ncevallos\nauggie\nmisch\nguestbook\nscientism\nberkel\naarón\nmorrone\nandrogyny\nsuppositions\nhenslow\nobsessing\nspion\nhatakeyama\nouthouses\nbierman\nunpatriotic\ncompas\nkraal\nzahl\noogie\ncitrine\nhoopes\nclimie\nnarodowy\nfuertes\nwotherspoon\nrezende\nmunt\ntyrannosaur\ngoll\nlynskey\nwiddrington\nritch\nquirinal\need\nalexie\nduse\ntorvill\ncess\nzosia\ndeactivating\ntaunggyi\nklaudia\nbruisers\ncck\nfloes\npossessors\nkazuaki\nnauseating\nsoldiering\nvallone\npronger\nstrohm\nhortensia\nmbale\nworksheet\nkaspiysk\ndespard\ntrabuco\nmehl\nazharuddin\nobservatorio\nlincolnton\nidh\ncoiner\ndrechsler\nvizard\nwollen\nstuntmen\nahlquist\njuhan\nkinghorn\nkodai\nmabon\nzahara\nsteadicam\njuhu\nikan\nakko\neberly\npapists\nhamdard\nmeineke\nwiglaf\njavon\nplumlee\nemcees\nfrowns\ncaperton\nquestioners\ncapshaw\nmutilate\nsohag\nvoltmeter\nvomits\nhasek\nkraig\nzel\ndugongs\nhubie\nbansi\nshanter\nlacaze\nphosphorescent\ngueye\nmillon\nredlich\nbegich\nparp\nzibo\nsuperstructures\nmulu\ngilg\nmuzyka\ntetyana\nrhymer\nhinter\ncioffi\nsalaf\nriaan\ncaister\ngoober\nbaras\norbe\nhilfe\nkouvola\ngansevoort\ncherney\nrameses\ndoerner\nantonieta\nsiring\nhosny\nkixx\nbeadwork\neurocontrol\nnesher\nskipwith\nbeccaria\nsamata\nscatological\nkie\nripeness\nlowrider\neconomía\nesler\nelitexc\nstringently\nmadhubala\npictogram\nevangelized\nkettlewell\nbermudo\nquoll\ndunderdale\nhii\nfmd\ngrachev\nschadenfreude\nreme\ngluconate\ngunga\nislamisation\nwheatsheaf\ntronto\nsnf\nxiaobo\nfiemme\nbodnar\nleisler\ndeodoro\nfinasteride\npreps\nwurster\nprimeau\nhamley\nweihai\nfrictionless\npersecutors\nwhas\nwillner\nmiti\ngombrich\ndissensions\nheidt\nnibbles\nmaccormick\ndenarii\ncanosa\nsorbitol\ncramond\nbalki\nmelman\nakizuki\niacob\ndecapitating\ngreenlaw\nchangde\ndongshan\ndallenbach\nnathalia\nbagshot\nbodyweight\nlangeland\nsunitha\nwizzard\npersonifies\naravalli\nstayner\nahidjo\nthermos\nsaca\nhenningsen\nkanban\nfonsi\nshii\nmultidrug\ncorradi\ncephas\nkitajima\nuhr\nreinet\nsmersh\nbumiputera\nsunbelt\npolonius\nsubmillimeter\nclevenger\nmester\ntuxpan\nmaden\ncalibrating\nmoradi\ntfi\nnegativland\ndelvaux\nshiratori\ndulac\nsfor\nsaifullah\nrpe\nidr\nscotians\noppen\nsincil\ngarabedian\nsophomoric\nvbr\nheman\nbrightwell\nfleishman\nrenegotiation\nhamiltons\nwestway\nochilview\nschaerbeek\nfairhurst\norchestrates\ncopyists\nkarki\nkipnis\nrumson\nseigneurie\nplainchant\nvivere\nnanu\nhankin\nnanhai\njokey\nchancey\ngarlin\ndecriminalized\narabization\nlepe\nkrane\nolatunji\npoupée\npsychically\nnunavik\nafdb\nresurveyed\nandro\nqsl\ndvp\nzzap\ncompson\nmacfadden\ntuckwell\npolidori\nzamfara\nsju\nshisha\npaswan\nringgit\npavitra\nneste\npierrette\nbarbecued\ndemis\nshug\npericardium\nthermostats\nlithophane\nsamay\nseeped\nraymer\niin\nfloppies\nmanasquan\narchitraves\nsyntagma\ntransfection\nsaltaire\nfarriss\nmclaurin\npetronella\nwinckelmann\nmetzner\npanchkula\nmaikel\nwkbw\nballed\nlaudanum\ntirades\nblc\ncascia\noutgrow\naeron\nmahalaxmi\ngoalpara\nmartis\naqi\nparwan\nsubcamps\nclic\nicam\nracoon\nrepeatability\nhutten\nmemoire\nnorgay\nstrep\ncheh\nromesh\nspellbinder\nbedwell\nbudha\nlambada\ncheckmark\ntifton\ntinney\nsmbg\nkearse\nenviron\nsealants\nbettye\nhasselbaink\nuncharged\ncetina\nolah\nperishing\njeepneys\ndahir\nsadducees\ngournay\nextenders\nannet\ninness\niag\ndalmeny\nclamoring\nalizadeh\ndettingen\nglobin\njèrriais\ntarboro\namicale\nnodosa\ncomscore\ntica\nosteoblasts\nlangland\nphilosophes\nbefallen\nsneering\ntrémoille\nbiti\nsurmounts\ngellert\nbonafide\necovillage\nhmmwv\nufs\nseule\nmoltisanti\nunperturbed\nmoinuddin\nartmedia\nmiyahara\nfronteras\npoko\nderiding\nmerola\nguenter\nimmunized\nngor\ngasherbrum\ntypists\navoidant\ndowel\ndetmer\nnancarrow\nplutarco\nmuseology\nlopo\npremodern\nringworm\nixa\nlindauer\ngesher\ntsangpo\nplatitudes\nsawfish\nsekhmet\ntaye\nramification\nmisshapen\ninformix\nbrightcove\neasters\nurundi\ntpt\nmissus\nferenczi\nangrier\nbsr\nraich\ngriquas\nbronwen\nbarot\nmyka\nalur\nhistologic\nsangin\nprecident\ncge\nbarrault\nferg\nludford\ncaffeinated\nyaacov\npions\nmarieke\nketan\nprudently\natlanticus\nhoaxing\nlicentious\njola\nsalameh\nbohs\narboleda\nskirmished\ncarnelian\nelwha\nnarmer\nacct\ncrenata\npleurotus\nillicitly\nnewburg\nsarti\nyajima\nzetec\ndorrell\nbakes\ncampione\nlims\ndarvill\nfreeride\nanmol\nudai\nunquiet\nfaulks\nserafim\nrighted\nhoz\nkotto\njuggalos\nflailing\nmasonite\nabdirahman\nvarkala\njoongang\ncoldharbour\nnemi\ninada\ndianthus\ntorturous\npulford\nerrr\nshorthair\ncuddles\nkumo\nespenson\nanthurium\nlipps\ntaraxacum\nbadghis\ndratch\nflintridge\nreccomend\npaedophiles\nbeedle\nminch\ngue\nzoro\nhomeopaths\nyining\novalle\ntct\nballinderry\nillich\nsuperga\nonkel\nspearmint\neppley\nimmunisation\nparapsychologist\nminne\nbakary\ngroome\npantani\nmoorei\ntevaram\nrateable\nmatsuzaki\nboijmans\npietist\nsamachar\ngubbins\nsiecle\nelectroclash\niria\nalwan\nimpious\ngrothe\nwimberly\nconsecrating\nbromma\neglwys\nklin\nantisemites\npanamint\nparasitized\nrijswijk\nlakefield\nmatangi\ndrom\ntouquet\nsublingual\ncharice\noutperforming\nszell\nhakon\nhenlopen\nmicelles\nkisch\nliev\ncappy\nfoltz\nharkleroad\nkahlil\ncathryn\nlubricate\nthrenody\nbearsville\ntuthill\niou\nspasmodic\nnietzschean\nhookup\nbarthe\nprathap\nfantasias\ninflamatory\nminhaj\nbarroom\nepiphyte\nmarksville\norganophosphate\nsundby\nlydd\nkindertransport\nincommunicado\nsportsplex\nhuguette\nikuta\nlachance\nsalvatori\npeterkin\ncomair\nmco\nconnersville\nslavonian\ntrova\nmaterializes\ncuriousity\nrla\nclf\nsabathia\nnahant\nglowworm\ntooled\nethelred\nsandvik\nborwick\ndears\nmals\ntarascon\nlarmor\nhodel\ngearhart\nklong\nklebold\ndamascene\namphitheaters\nmalarkey\nmimed\nsmouldering\numstead\nbearman\nbulrush\nwigg\ndenniston\nfoos\nweibel\ngovardhan\nvacuums\nkokanee\nrnp\nkensico\npratima\ncurassow\ncorpuscles\nportables\nhedayat\njasminum\ngiraldi\nsomer\nwaistband\ncorcovado\nchansonnier\nersan\njaymes\nuvs\ninsufferable\nbevington\nbages\nklar\naliyu\nbestival\nbloodhounds\nmercifully\nliftback\nstriked\nstolypin\ntakasago\neffectual\nlouvred\nshandon\nmegalomaniac\nhutus\nteleology\nhammy\ntehelka\nbrazenly\nsuess\nbennelong\nmasten\nsteiermark\nnasiriyah\nligia\nquantock\nmatildas\nembellishing\nuffington\nclicker\nchira\nneetu\nquartett\nissam\ndinozzo\necurie\nclued\nnederlander\nmakonnen\nosterley\nwieczorek\nsisa\neggshells\nfritters\ntakla\nkumeyaay\nmillhouse\nvasconcellos\nselvin\nplasterer\nmcfeely\nmolesters\nwickenheiser\nbandmembers\nhirado\nlaurents\nbodysuit\ncbssports\ndumaresq\naronian\nvolitional\nkinch\nclar\njica\ngroundnuts\ngasparilla\nbalaram\nhabanero\ngalati\ncolloquialisms\nfandorin\ngelugpa\nlyse\nmethicillin\nwier\nsiphoned\ngenderqueer\nvoilà\nmadchester\nbirdseye\nobstinacy\nsprawled\nweightman\ngouges\ncommonweal\napsa\ntereshchenko\nwolski\nnuvolari\ngaris\nikot\nsundew\nlalli\nbroomsticks\ngrahams\nawp\nwrangle\nwhitelisting\ngodhra\njustino\neffacing\nhoyts\nmassimino\noedema\nantheil\nsulejman\ntangi\nthermidor\ntopal\norbcomm\nvolkskrant\nberthoud\nvelden\nkoryttseva\neggar\nbartholdy\nqana\ntopley\nsandbars\npratama\nostmark\nmprp\nmadaba\nchahal\ndemonize\nguiteau\nkathrine\nblissett\nsubordinating\nsammon\nadamou\nleblond\nsisowath\ndorey\ngroby\nvigneault\nnovarro\nalsager\ncoba\ngrandville\nazarbaijan\nscrapper\nhartshorn\nkrämer\nchéreau\nsugarfoot\nintegrins\nkeech\ninverkeithing\ntitchfield\nrossman\nmartos\nheadhunting\nfrontbench\nderegistered\nectoplasm\nreassessing\nharems\nharten\nprofaci\narvydas\nfabris\ndollywood\nbanega\ndebutantes\nyildirim\nquilty\ntzur\nbreastworks\ngiverny\narrestor\ndecapitate\nrossouw\nstrategical\nviertel\npsychodrama\ntroitsky\nuladzimir\nkamphaeng\nortmann\nainley\ntransmittance\nhyperdrive\nmckibbin\nponchielli\nczolgosz\nstablemates\nsindi\nadresses\nlancey\nbagno\nlutetia\nsolemnis\nwbal\nschiffman\nreflexed\nbundesbank\njassi\nsni\nchurchwarden\nrievaulx\npodesta\neade\nkers\nboatwright\nmassifs\npushback\ncolet\nportofino\nsummerhouse\nexiste\ntriggs\nfigg\nfacepalm\ntraven\nroelofs\nheiau\ngooey\nenger\nillmatic\nzhangjiakou\nadwords\nterephthalate\npartes\nsinclar\ndragonforce\ngravitating\nantihypertensive\namerie\nshivalik\nkotter\nrepents\nvassiliev\nbeaufighters\nnastassja\nmolins\nliberality\naizen\nsondre\nnymphenburg\nfauves\nmcmorris\nkoharu\nyoukilis\nputz\nashmead\nwestergaard\nabulafia\nafrocentric\nhokum\ncrematogaster\nhaluk\nhandcuff\nrambaldi\nmediumwave\nmorlock\npdd\nguen\ngüell\nbohan\npeaceably\nseika\npayed\ntages\nkogen\nlarkham\ngrayskull\ngezi\narkan\nlamé\ncoldcut\nmindscape\ngorden\nsettembre\nlillis\nbergamot\nmarchisio\nmcilvaine\ncolberg\nkushtia\naquilino\npeco\nspirulina\nbuffalos\nfosco\nviols\nferrarese\nragweed\nloudermilk\njik\nbwp\ngrazers\njogo\nderi\nkachina\ncorbières\ndicke\nmcginness\nflybe\nsyke\nhalogenated\nlawgiver\nbreit\ncuss\nvesti\ndefilement\ntinkers\nhornbeck\nairshows\nknowlege\nsegreto\ngittings\npenha\nkelham\ndicom\nbathymetric\nllanrwst\nkutner\nhughley\nalmo\ntheodolite\nflavell\ndiwa\nserpukhov\nzante\nilmenite\nreprocessed\nhughson\naptn\nchristies\ncappello\nbienvenue\nstourport\nnipsey\nnein\nscalding\norania\nreenactors\nakkerman\nmalleson\namchitka\nwithlacoochee\ndanke\nscottsbluff\nbellisario\njanda\nstrongpoints\ndemure\nebru\nsagem\nnonconforming\nsml\noffertory\nbiaggi\nclimes\navett\ncrawdaddy\nexternality\nwitbank\nsuraiya\nnaturwissenschaften\nkohs\ncritchlow\ngrammatica\nkishin\ncatoosa\npolluters\néclair\nbronzino\nunequalled\nclaymation\npoppers\nreavers\nmarroquín\ndemopolis\nmianyang\nplumtree\norlan\ncastelfranco\noros\nprindle\nhassles\nflorescu\nknoblauch\nbastl\naengus\nweah\nkelmscott\noozes\nanqing\nbaling\nexpectancies\ntercer\nshuttlesworth\nmanzi\nulva\nwench\neurop\nhilder\nbasicly\nbolitho\nrinat\ndoublets\nderide\ndrakkar\nyreka\nscid\nishizuka\nmapmaker\ntagliabue\ntognazzi\nmli\npalamu\nnadkarni\nremarque\nanyplace\nhyborian\ntribbles\ntrundle\nlouvers\nsilloth\ntechnocrat\nbernardus\nbrüder\nbeautified\nstram\npetrine\nsips\narmond\nwhiteway\ncinematics\nsteuer\ncarper\nmaso\nhia\nparaclete\nreshot\nboselli\nrealnetworks\nstim\nnoce\nfaceplate\nhatchbacks\nchamberlains\nbartowski\nshahzada\nmadelyne\nevolutionist\npondok\nkogo\nlaskin\nschlechter\nsaravia\ndeers\nangerer\nmazzone\nhomans\ngfx\netu\nliveliness\nmodbury\nmcgahee\nmowed\nchaput\nnonexistant\ncontractility\nvillarrica\noutsmart\naquin\ncongregating\nshakeup\nsippel\nlagonda\nparamahansa\nengulfs\nbilberry\ngripper\nsammie\nskansen\nleatherneck\nyawl\nsason\nposo\ndisembarkation\nmarchena\nparkhill\nsacredness\ntejeda\nredemptorists\npremji\nvecht\nleonia\nlourens\nbashkirs\ngratuity\nstyler\nstagflation\nautoantibodies\nabderrahmane\ngadwall\ndetests\nwegmann\nksr\ntabernacles\nthubten\nindolent\nesmaili\nzalm\ngeneralise\nlyfe\ncolclough\nidées\nmegalomania\ngremelmayr\nhitlers\npistachios\nloblolly\nmclarens\npetterson\ncheruiyot\nhaematology\nsaima\nwordiness\naikens\niee\nhoobastank\ndisdainful\narteriosclerosis\npanahi\nigboland\nhealings\nestée\nhygienist\nhomma\navendaño\nbnt\nkhs\nbasma\nbazalgette\nbulba\ncomper\nkadai\nunimpeachable\ndbp\nkiichi\nglf\ngranulocyte\nsheeted\nsleeker\ndoetinchem\nscrutinise\nreissuing\nsalutations\nburian\ngrigoriy\nexalt\nreciprocally\npublicis\nretooling\ncolchagua\nmiroslava\nminidoka\nwiske\nkingussie\ncaskey\nblenheims\nnarsingh\nshutdowns\nsidonie\nthio\nyoshihide\ndeputised\njillette\nhandke\ntenaciously\ndeary\ngiambi\nporthole\nbouin\nseiken\nbefuddle\nacrid\njudds\nsowa\nnogi\narmd\nmaricel\nmaidana\nirm\nsaperstein\ngevorg\nmicrochips\nservlet\nwason\nconteh\ndocx\njanatha\nuco\ntitanate\nsanuki\nagrégation\nporton\nlovelorn\npickaxe\nfaridkot\ncaplin\ndecease\nronchi\nindrajit\nflixton\nbergdorf\nkautilya\nselección\nsympathizes\nlinesmen\nplainsong\nbatey\naver\npeñas\nstreptococcal\nilb\ndacko\nretractions\nmayerling\nsinghania\ntarana\ntroels\ngarang\nshaef\nardis\nlaflamme\nwelz\npilsudski\nyakin\nserological\nbenni\nneedling\nophüls\nquieted\nbort\nprzewalski\nhailwood\npépin\nglasson\nmohair\ncockles\nwinckler\nlettow\nlouisiane\narng\nshaku\nlucidum\ndalea\nnanook\nboardwalks\nartform\nforgan\ngameiro\ndunnett\nlpf\nsteg\ndiamantidis\nsalutary\npipped\nmowag\nblueline\npkb\nedmontosaurus\ngirouard\ntournier\ncunnilingus\nkrugersdorp\ncrowing\nskarbek\nmumbling\ngeoengineering\nsamuele\nliquified\nkaling\ntroponin\nhenao\nyiyang\nvisors\ntriumf\nviersen\ngerusalemme\nnetter\npiute\npalk\nnougat\nscaramanga\nsesil\nmcchrystal\nramel\nsmokestacks\npsps\nackworth\nlogit\npujari\ninvergordon\nelmet\nterracina\nfailover\nchena\nyolngu\nprêt\nphotographically\nbandshell\npopularisation\nwhipper\nyoshiro\nblackball\najar\nufp\nzarb\nstadthalle\ntaneyev\ngrünbaum\ngrigoris\nrots\nbruk\nphin\nreattached\ndatalink\nparseghian\nvillarroel\nramadhan\nlèse\nsequoias\newer\nmorinaga\nrsx\nwhitewood\nannoyances\nsoloviev\nwootten\ngovernesses\nmultifocal\nchachapoyas\neave\ndelavan\nsloppiness\nlegitimised\nmaisel\nsvenson\nnais\nnicked\nboyett\ncyclorama\ngabino\nundressing\nsobs\navey\nluzhou\nsantería\npeeves\nanzani\nrhoden\nacra\nduffus\nyossarian\nsouthbury\nsaam\ncuoco\neilers\nreenacted\nsuperintendency\nguma\ncharango\ntreasonable\nlatu\nmitsu\nachenbach\ntahan\nryoji\nverkerk\nreproductively\nappliqué\nmusicus\nobihiro\nmeen\nshalva\nrtt\ndoiron\nprobyn\nindignity\npricey\nchernenko\naudios\nbransfield\nisna\ntetbury\nmischaracterized\nthyssenkrupp\nmcindoe\nmimmo\nsarov\nhemorrhaging\ngerrold\nkestner\nfightback\nroff\nneurodevelopmental\npsac\narques\ngirdled\nguis\naddax\ndefers\nbridgton\nbreitner\nreoriented\nbybee\nextirpation\ninl\npotti\ntudur\nresumé\ngunda\nladbrokes\nsmarties\nclassicists\nfantastica\nbostan\nrondon\nalexisonfire\ndanity\nriyaz\nbaggs\ncresting\nbrandywell\nchrysanthemums\nsuspender\nweirton\nsibugay\nancoats\nvertov\ncorno\nwbai\nsaltworks\nkratz\nnccc\nsparred\nspencers\nwinfred\nmykiss\nbilko\nabsecon\npinero\nheybridge\nevolutionists\njse\nwakely\ntimken\nspeeder\nblacking\nthorley\npatco\nsöderström\nmolay\ncrafter\nmigrans\navchd\nweddle\nmcn\nbalas\naliant\nbellenden\nlasser\ncanario\nwmds\ncanoas\ngva\nwhymper\nascites\ntouma\nlovins\nskullcap\nbodyslamming\nwao\nplatanus\nilderton\nprocumbens\nwuling\nshoppes\nverifications\npaulos\nrantoul\njaitley\ngloversville\nradim\nfossati\ntupperware\ncecco\nflorez\nsquirts\nunicast\nbrainwave\nnazari\nkilbourn\nkebabs\ntanguay\nhomemakers\nleifer\nmicroalgae\nkubla\nxiaoli\nenergizer\nspaeth\nwiedersehen\nowosso\nsomatostatin\nrtn\nholidaying\nrucksack\nkatherina\nadg\nboysen\nthoresby\niihs\ntomorrows\nbangash\nnajma\nkroos\ndisbands\nvoegelin\nhigueras\nmaestoso\nsecularised\nmaresca\nyoshitsugu\npolesitter\nduart\nlambo\nsimbel\nfinborough\nepidaurus\nbalen\nzir\nvaganova\nfordson\nhotdog\nbrezovica\ncooperations\nmirsad\nabashiri\nperishes\nrunnerup\nsékou\nlichte\nfixx\ngenn\nbiedermeier\nfretilin\nituc\ncordifolia\nmagos\ncoahoma\nfluoro\nvexing\nknitters\nbowater\nstata\nhuichol\nabsaroka\npreachy\ndiscredits\nmustonen\nmathematic\nperipheries\nholsinger\nkemba\nedexcel\ncpas\npinon\ncasque\nuefi\nedes\naurita\ngogarty\nobst\npurism\ncfn\nhellerup\nwardley\nridolfi\nsoldiery\ninflect\nhkd\nmckiernan\nhoan\ndzhokhar\nwiti\neal\nchilo\nsuat\ncesarean\nbreconshire\ntranspire\nloughner\nmakerfield\nlabelmate\nsylvio\nphiladelphian\nbangaru\npurgatorio\nchamblee\nganado\nlusophone\nprocreate\ncalor\ncantera\nnaki\nsegued\nvess\nspen\nmaltz\nernestina\nheckle\nriveros\ninchicore\narteriovenous\nfrancisville\npearcy\nmisstatement\nfsr\nmintzer\nquercy\navison\nwbt\nkrikor\nlarco\nyougov\nsokolac\nschmalz\nfoulke\nderogation\ntrampolines\norangey\nmutis\nsanchai\nekranas\nhospitalisation\nkalinowski\naldeia\nbaynard\nslavica\nnfr\nplacers\noptoelectronics\ngreaser\nfarewells\ndaina\nquigg\nnavalny\nrumiko\nkouba\nlapponica\nocto\nclumped\ndynamited\nmurrelet\ntehrani\ncrosscut\nbiala\nretell\ncomradeship\nicicles\narchy\nbronc\nmarcoux\nalstyne\nregale\ncanelo\nbili\nshaddai\nphenytoin\nethnobotany\nemon\nsexed\nkozma\nkasbah\nwesterhof\ndanyang\nmuckraking\ndivinyls\nhermitages\nbetterton\nsalafis\nbyles\nsgm\npten\nkamaishi\nconsoling\nbondevik\nlavant\njorginho\nzahoor\nkayleigh\nwindass\npileated\ncobar\nirritants\ndemagogue\nragazzi\ndiaphragmatic\nleyburn\ncww\nraymund\nfirebase\ndayananda\nwhittall\npierres\nspeakman\ntransgressed\nhaggar\naluko\nputtin\nsportsmanlike\nlymon\nkatchi\nlemaitre\niosco\nvaledictory\nkeflavik\nskiddaw\nmagloire\nulb\nprineville\nreynell\nmainzer\nwithholds\ndiplomate\ngreasemonkey\nmiscreants\nlieux\njayaraman\nyoussouf\nscoffed\nshoebox\nngn\nhazar\nfeiffer\nunhrc\nwebdav\nnygren\nalgren\nhaymaker\narwen\nagencia\ngledhill\njigawa\njaiswal\ndurkee\nsouthdown\nyavin\nvrij\ndietitian\ndalyell\ntylenol\ncasterton\nrhps\nanb\nsexto\nstereotactic\nelastomers\nliles\nmzm\nsamity\nkotz\nragamuffin\nrevelle\ntini\nlunges\nreconsiders\nprincedom\ntooker\ngimmicky\nmonreale\noatman\nrefiners\ncooperage\nemigres\nhuk\njägermeister\nrehabilitations\nlehmer\nkubra\niller\nmanera\nreframing\nclorinda\nmouret\nphetchabun\nlastman\ngering\nhobbled\nbicyclist\nghaznavi\nsigal\nfuhrmann\nnace\npadiham\nlavan\nfrattini\nhoskyns\nhungária\nfahr\ncomputerization\ncoffer\nappearence\nstribling\nchaikin\nmartynas\nhesham\nsard\nleibnitz\nabhaya\nferriday\ninterdepartmental\nbettor\ncampesino\nbishara\nsocialisme\ncontortionist\nlimoux\nteesta\nvola\nokereke\nowais\nbeltrame\nmocenigo\ninterspecies\nwayzata\nsetsuko\ncycleways\nmodifieds\nrinko\nblackening\nbuskirk\nimpe\nrepudiating\nlabine\nchillers\natreus\nhagop\nblued\njaffray\ncanadas\ndumbartonshire\nnsaid\naken\nvisp\ncholla\nstallard\ntuominen\nnonet\nbellick\npersuasively\nmacdermott\npantelis\nkela\nmahbub\nbilstein\nkorkmaz\nkeesler\nmenas\nyorn\nskerritt\ntuckey\nganden\nloansharking\nnandrolone\nsahra\nchoquette\nbibel\nahve\ndayana\nevliya\nbussed\ninterferometers\nrosenkranz\nnamaqualand\natropos\nmoncayo\ncarbonation\nharkat\nkippenberger\nstranahan\njfa\nventurebeat\nstob\ntransoms\nqaa\nvoortrekkers\nvernaculars\nnicaraguans\nmonticelli\nxjr\ncanas\ndeucalion\nrappin\nberzerk\nhinchinbrook\npeterhof\ngaven\nezln\nlamprecht\nkidjo\nhedi\ncumbre\nkavir\ndrawl\nseverny\nleavening\nboho\nzoologische\nsidibe\nsoldaten\nintertitle\nstorerooms\njdbc\nbbe\nmwanawasa\nbondurant\nperamuna\ntuite\ndallam\ncft\nvapours\ncordner\ndli\nethnics\nfugu\nreactivating\nphragmites\nrifat\nbowmen\nkulikov\ninara\ncustodio\nazione\nakshaya\nhatchett\nshead\nclonakilty\nmatlin\ntapeworms\npeyer\nthring\nhamma\ncaylee\nbenyamin\nqueercore\nallamakee\nbranscombe\nteifi\nwobblies\nbather\nsiciliani\nkuwabara\nkoester\ndimeo\nkudus\nklugh\nmagor\nkaltenbrunner\nvandel\nvenkatraman\norderlies\nfuka\nabided\nhoak\nrefills\nceramide\ngregan\nsmither\nobliquity\nlevator\nkrister\nleukotriene\nheightens\ndrost\nantitumor\nsonography\naktuell\nmoreh\nzoho\ntadayoshi\ndola\nkakapo\ntaymor\nxanga\ncrêpe\nnoss\nbronckhorst\njiffy\ncretin\ntsongas\nkaline\nneneh\nvaltteri\nmoneyball\nbalancer\nkosinski\nmalki\nlyneham\nbuzzsaw\ncorazones\nhollaback\nnewsround\nchatterji\nkirkaldy\nthorgerson\nsquandering\nsubha\nipos\nclarinettist\nmonetize\nzalewski\nkullervo\naltschul\npudendal\nopportunistically\ncharbonnier\nsymbolical\nbiss\nbarbro\nshirky\nmeteoritics\nfarrant\nbrosse\nseliger\nwiene\nhukou\nbrading\ndieckmann\nstoking\nonlooker\ntorques\nbue\nfuruta\nreponse\nfnla\ntolmin\nleamy\nmatapan\nornithischia\nemploi\ncras\ndecodes\nmonaca\nkenesaw\nbuckton\nvijayaraghavan\nzhanjiang\nzaharia\nfissured\nginde\ngillow\npiera\nlateen\ncfda\nhameln\nbungling\ndismasted\ntpd\nrufford\nsnowpack\nfriedhelm\ndogfighting\nsierre\ntebaldi\nkennicott\nmahendran\neram\neban\nhdpe\ntuvia\nteviot\nneilsen\nlumbermen\ntemenggong\nmajesco\nmattachine\nzaharias\nmodugno\nkulaks\nmajumder\ndiskin\nrebecchi\nlinyi\nextrasensory\ndiskette\nkasserine\nysaÿe\nreproaches\ncosponsors\nvanderjagt\nhlaing\nashcan\npharmacokinetic\naiyar\nunsuitability\nkemsley\ncardy\nshibli\ncurcio\nlince\nismaël\nmulvaney\nascender\nopn\ntulipa\nculhane\nanterograde\nmartlesham\nfontvieille\nmayank\nrelinquishes\neithne\nsaharsa\nlaterals\nflavorful\nquesting\nkeizai\nvmro\ntiangong\ndomestics\nturbonegro\ninveraray\ncampeau\ncompagno\npanoply\nyol\nloneliest\nprostration\nrothbury\nshrieks\nscabs\nsgd\ndouthit\nllanidloes\nstonor\nconcomitantly\nmirco\nloth\ntestamentary\ngladiolus\nstrauch\nbribie\nsorna\nkaempfert\npawlak\nquadrupeds\nisleta\nmasser\nunction\nvolar\nmacleods\nfollis\nesalen\nviognier\ninterruptus\nruk\nevolutionism\nbulova\nfrit\nirregardless\nperham\nsenckenberg\nimbecile\nkost\nspir\ntuffs\nhobos\nchimeras\nfassett\nkast\ntoshihiro\nkolarov\nvalidus\nellas\nquale\ngordana\nijmuiden\npuru\ndanseuse\nottaviani\narioso\nlandslip\nkhoa\niolani\nglobules\nblackmar\natascadero\nmccurtain\nfuxing\ngams\nuruzgan\nlubna\ndownshire\ncracroft\nmargaritaville\npecora\nyos\nshailesh\nrealigning\niio\numbels\nhighmore\ndeliverable\nbunraku\nrodrick\nimmanence\nborrelli\nwtop\ncarbureted\nprivé\nmiscalculated\nrecriminations\ncampylobacter\ndripped\nzhob\ngaria\nlakhan\nmeningococcal\nweingartner\nbyword\nwesthoughton\nstevensville\nbarong\npetzold\neduc\nsvb\napfa\nacushnet\nshinbun\nsingularis\nshaeffer\nnce\nbopara\nodder\neugenicist\nopryland\nbni\nazuki\njajpur\nnationalbibliothek\naleck\nrmm\nrubí\npawlowski\nlazarenko\ncatv\nshoichi\ncoalinga\nendara\nmychal\nmaskhadov\nkeris\nzera\nelastomer\nbroxburn\nbradt\ntakemoto\nreschedule\nkolesnikov\noutmaneuvered\nfalchuk\nfariz\nbrenes\nnewcome\nmalkmus\npavey\nkuncoro\nraintree\nlitten\nhustings\ncharleton\nbehinds\nstatesmanship\npervaiz\nprioritizes\nbiocontrol\npatek\nmattapan\nbetas\nsajjan\nfeelers\napk\nmorbidelli\nsuppport\nseines\narinc\nmulsanne\ntriumphing\narijit\nwestcountry\nhaslingden\ngoanna\nezell\nmakhmalbaf\ncassiar\nhersholt\nprogrammatically\nfiche\nbeckoning\nsiple\nriou\nkulp\ndemint\npjak\nditchling\nhuat\nbrumley\nlutfi\nmcdill\njörn\nkatzenbach\nlaminin\nanuar\nsuni\nbeaujeu\ntrespasser\npisi\nsherinian\nhierapolis\nhermiston\nichthyosis\nryoo\nphai\nkwee\nbusoga\nboksburg\ncordiale\noceano\ngreinke\nfacetiously\nhenriot\nrovio\nfilibusters\nnurseryman\nudhampur\nrens\nmetascore\nscarecrows\nsayyad\ngrimley\npreseli\nseweryn\nreceivables\nbonga\nmagnesite\ntoninho\nsadia\nbws\nwla\nethnical\natalaya\ngarp\nrohani\nprickles\nsadko\ntnk\nmossadegh\ngeissler\nnja\ntrussell\ntwaddle\npagadian\nkoreas\ncorm\nskousen\nswipes\nmaterialization\nstanway\nkamakshi\neconometrica\nkeratinocytes\nabhorrence\npreinstalled\nloper\ncernat\nlave\nmartenot\nelderberry\ndelfina\nlovie\ndiabase\nquirin\ndobby\nlexicons\ncousens\nkleber\nbartos\neditorialized\nshinjiro\njaret\nisaías\nhilburn\nfrilled\njochem\npatekar\nbodyshell\ncosco\nokami\nesca\nchalfant\neystein\nslaveholder\nkayal\nkriens\nlotions\nhungaroring\nmattes\nmicroblogging\nzinnemann\nholwell\nslk\nkuusamo\ndorothee\ndahlan\nveiling\ncolocation\nkruskal\npnu\nmatriculate\nkimsey\nhartberg\nfindlater\nbiophilia\ntedd\nhoopla\nkhotang\nperiodontitis\nfeedstocks\ninduct\nferre\ncontrollability\nfayez\nmultiscale\ncruce\nhamadryas\nmattea\ncomposited\nsaudia\ndemonization\nbroadness\nlesa\nmeca\nmirfield\nmaderna\njum\nhumorless\nmapei\ntechnorati\namil\nparian\ngdl\nenmeshed\nmitja\nslutsky\nfastlane\ntotoro\nllanberis\nblaustein\nbwc\ntinos\npiñeiro\nrailroaders\ncharitably\nehren\naraba\nspotlighting\nnasm\nbarrino\nbojonegoro\nhomerun\njambalaya\npoyet\notterburn\nquintuplets\nwino\nbolliger\nwheezing\nmythe\ntallahatchie\nmariela\ngroener\ncangas\ndrp\nscheler\nnithsdale\ndasan\nbusboy\nrejuvenating\nsspx\nnatsuko\ndelectable\ncarnforth\ncockrum\njowitt\ncaptiva\nytterbium\nabhor\nnaroda\nklepp\neastchester\nhenslowe\nstalinists\ncele\nhoffs\nalexy\nbich\nrobocup\nhypnos\ntokaj\nsoderberg\nvivar\ntwardowski\nestep\nlaplante\nchambering\nnursemaid\naticle\nfleiss\ngarces\nporlock\nreinstates\nbrucella\nchinnaswamy\ngasset\nhornos\nbagheri\ngip\nontiveros\npiltdown\ncaner\ntharoor\nsanneh\nbeiteinu\nvividness\nskilton\npilas\nbdt\nsyah\nmatveyev\ncressey\ncurtailment\nsollentuna\nhominis\nsurefire\nsulcata\nastrium\nbitmaps\nrejoices\nmonoceros\ndysphagia\nirresistibly\nbruggen\nheini\ndomingues\nzhenhai\nrodi\nenablers\nwhishaw\nhrp\nmeandered\nharrer\nhackenberg\ntreng\nmbd\nramamoorthy\nhoku\ncivets\nbiosystems\nbushing\ninterpolates\npricks\nunderfoot\naldergrove\ndesalvo\nplaters\neraserheads\nlafourcade\nslob\nterzi\nshutt\nevandro\noarsman\nlactase\nserendipitous\nrwb\nridgeland\nturka\ningleton\nterschelling\nprancing\napayao\nasco\nelna\nsaddlers\nthaïs\nsummitt\nanping\nimpertinent\nkolok\nfanged\nnto\ndominators\ndefacement\npagán\nmattheus\nidolatrous\nzlin\ncoronas\novereating\nmessageboard\npanjshir\ntase\nreassembling\nmbit\naldhelm\ncrevasses\nracecourses\nqalandar\nstiers\ningoldsby\ncanonic\nhutto\nnyland\npreda\nsicko\ndepersonalization\nbyo\nwillunga\ndynamometer\nthayil\nabuna\npoyser\nhattrick\nribald\nzinder\nbandoneon\nzipf\nfrf\nasheboro\ntesty\norifices\nbrandan\ndecriminalisation\ncityhood\nosma\nitai\nfuyang\nheliocentrism\ndecrement\nrissa\nyeovilton\nmoneymaker\nlocksley\nluddite\ncapac\nbeeline\nunobservable\noffshoring\nforetell\ncapi\ncoenraad\nmouskouri\nnoack\nkorban\nlynes\nrockslide\nbretschneider\nbasilique\nforbs\nbracha\nafscme\nfishmongers\nazi\ngrane\nconnivance\nabiodun\nautocomplete\nwatermen\ncerca\nbrug\nheere\nliaoyang\ngwenn\nkume\npantelleria\nnetbeans\nbessette\nbishnu\nfdb\nsquillari\nbtt\ntrueblood\nstanislavsky\nhänsel\npichichi\ngaylor\nellman\norinda\nfreewheel\nmidriff\nreinterpreting\nclery\nlionfish\nwoolford\nmarjoribanks\nlarrañaga\nenunciation\nharakat\nstrugatsky\nwiltz\nbebington\ndahab\nfactfile\nkarisma\nklien\nmillom\nclearness\nottesen\nbiren\ncushioning\nwakeboarding\ntryggvason\nperrotta\nfrahm\nmols\nmystifying\nkatich\nzall\nvigevano\nbenzie\ngaura\nmabinogi\nscythes\nlasha\nlamborn\nkingi\nhaymakers\nzindabad\ngriner\nluisi\ngilliard\nstivers\nstingless\nblowup\ncelbridge\nmaik\nscaredy\nmiga\nexhorts\ntraquair\nmcmenemy\nperfecta\ngugu\nnefertari\naicc\nmusker\nindolence\nsaturnian\nflorentines\nadelphia\nfronton\nfhl\ncreem\nnewschannel\nisen\nradishes\nmandvi\njisc\nlithographed\nmynach\ngurun\nrondane\nregino\nultrafast\ndavidge\ntlb\nigra\neyüp\nshalem\ntorne\ncaldo\ndietetic\nkazantzakis\nalbertosaurus\nladyland\ngoyette\nabish\nblacktip\nsangita\nmasterminding\nzhirinovsky\nagriculturists\nbagni\nrelix\ndevoy\nhoverflies\nmedwin\nareola\nglint\nannualized\nsemiramide\nkeynotes\nforestalled\nnbk\nwroe\nbezuidenhout\ndula\nmcgibbon\nbonneau\nsqualls\nkornberg\nlowder\nnissa\nkoenigsegg\nbicton\nbonamassa\ntardiness\nmarquesa\nsagitta\ndahlin\nmizu\nfreebird\nfoye\nblowtorch\nvsl\nsixt\ndunums\ncharmingly\nridsdale\nkangchenjunga\ncarbamate\ncarragher\nredacting\nabh\nnautica\ntinting\ntimebomb\ntomoka\nreif\nexclamations\nkinghorne\nrhames\nsumlin\naccentuation\ngritstone\nkhalifah\nbendy\nyagura\nzakho\nharuhiko\nstanshall\nmcclory\nunderstudied\nbanerji\nmarianus\nshinshu\nsukhumvit\nmarmaris\nkabbah\nakhilesh\napod\npihl\nmarineland\nunderclassmen\nstt\ncaldicot\nnasar\nstrandberg\nbaumholder\ntgif\nmanobo\ndvm\nbleddyn\ncapsular\nkintail\nrdna\ntinamous\nconfidants\npetrovski\nplaited\nofford\ncounterweights\nugandans\nibook\nsils\nchava\nngoma\nshorncliffe\nsiletz\nfloaters\nstupak\nposthuman\nlaryngitis\ncarryover\nagila\nscheck\nrahl\nwaypoints\nsandpoint\namv\nbasher\nwhitefriars\nrecirculating\nfeis\nuntrusted\nlausitz\nkumho\ninstigates\nbensalem\nkimbra\noilseeds\nharrower\ndodder\ntomioka\nduden\nzula\nlechuck\nleder\nsouffle\ndeshannon\nvulcano\nvalcour\ngrotesquely\npinglu\nphillipa\nspringburn\nperversity\nrilling\nshebang\nautechre\ndebarred\nnewsham\noai\nholbeck\nanglerfish\ngandolfi\nsacristan\ngeraci\nlollywood\nnazeer\ngodwits\nloog\ngivin\ngabbar\ncheick\nminu\nfabulously\nlabov\ngweru\ndisruptiveness\nbelzec\nvannin\ntucumcari\ndaim\nsieglinde\nshinkai\nkaley\nvukovich\nshawmut\nfica\nhydroponic\ninfluencer\npindus\ncascina\nkübler\ntuckett\nercan\nnonacademic\nmastitis\nmanfredo\nmayte\nadani\nboumediene\nwakasa\nmml\nmarabout\nstefánsson\nelsey\nbetrayals\nplinths\nhurstpierpoint\nstarnes\ntorrentfreak\nblokes\npnt\nsynthetics\neyjafjallajökull\ngamay\nscreeds\nfatuous\nbenmont\njackhammer\nwellbore\ngalvanizing\nkahal\nbatterie\nocellated\nredesdale\nforssell\nayotte\npeeks\npierrepoint\ndiggin\nazz\nmontoro\nexculpatory\nlallemand\nmilkshakes\nphagocytes\nbuggles\ncaruthers\ndaito\nstrack\nmalva\nacsi\nblo\ncvm\nniculae\nsabyasachi\nmarlette\nsterry\nhügel\nsimonon\nupgradeable\nträume\nbicolored\ntraxx\nmcelwee\nmoosehead\npeaty\nbellhop\nlukaku\ncoccyx\npraag\nhorbury\nsnipped\nstuddard\nroussos\njantzen\nctn\npedernales\nyuzo\nchangzhi\nrutles\nrbcs\nrootless\namericanos\nsmil\nursprung\ndisha\ntelmex\nmagennis\nyunupingu\nfarhi\ncollisional\nbarneys\nlami\ncudgel\ntroggs\nsyzygy\ngamblin\nfrithjof\ngirdles\nchup\ndisincentive\nveach\nadventitious\nhoven\nbuddle\nchurchward\nhorr\nreichsmarks\nlaas\nsportscars\nscarabs\nnontoxic\ncll\nramamurthy\njale\npierrefonds\ngrupera\nskeat\nrecurrences\nwaterholes\nuveitis\nfalzon\ndronfield\nsubsidiarity\ntonny\ndigitata\nunpredictably\ncimbalom\nanglophile\nkabbalists\nbukovsky\nanesthesiologists\nghostwritten\nharby\nmapuches\ninstants\nsupergravity\nmacnicol\ninterline\nngl\nbeauséjour\ndedman\naias\nbomis\nmarciana\nprelim\nsavak\ncommercialised\nabbotsbury\ntunga\nflybys\nvenkataraman\nhows\nmicrolensing\nperrysburg\nblech\nthrought\namuses\nwinkleman\nubb\ngamedaily\ntotowa\nloto\nhanni\nordon\neubie\nredington\nnorns\nkinesin\nwroughton\nnostromo\nstrad\nposad\ndivac\nnazmi\nmatus\norio\nsarit\naviemore\nvolkmar\nburpee\nedx\nquiere\npku\nsantas\nchelios\nzuloaga\nmanja\nverfassungsschutz\nforemen\nfux\nvalvular\ngidon\nmuskrats\nktp\nalleviates\ntruelove\ninterconference\nkhalq\nlakhdar\nmures\ngynecologic\nyedioth\nunpacking\ngorbals\nerhan\numana\nwulfhere\nbonum\nbearable\nzinger\nwoad\nberna\nhosa\nfoamy\njobbing\nensnared\nsheremetyevo\nreconnoiter\nwhitehorn\nwht\nsalmiya\nbinny\ncurrants\narten\nyasuyuki\nsuka\neavesdrop\njina\ntomcats\nmvt\ncircumnavigating\nbarcarolle\nmurli\nmccutchen\nbutera\nvodkas\nmangotsfield\nprenuptial\nwurtz\nleftfield\nscramjet\nhiroto\ncrumple\nphotomontage\nsmail\ncheektowaga\nofficious\nkishen\nmuckle\nringsted\nriess\ndimasa\nshikhar\nsidelining\nremapping\nkarima\nkozluk\nmacewen\nstygian\nboons\nbelch\ncoas\nbatres\nyuca\npahrump\ninver\nluganda\ntaplin\nconcocts\noles\nbudimir\nplasminogen\ndamu\nepigraphs\npismo\nduncannon\nkristel\ncherif\nwira\nhtp\nbalmont\nthiocyanate\nshunji\nterrorizes\nficken\npjs\nborotra\naxworthy\ntortona\ncrosswinds\nraiatea\ntunnell\ncoatzacoalcos\ncristofori\nclarington\nkuranda\nanemometer\nminersville\nagyeman\ncountback\nlacanian\nbraley\nmetonymy\ngedling\nuygur\nneurogenic\nkarcher\nduduk\ndrt\nthisbe\nanspach\nwindshear\nsould\nhovell\nvalliant\nsaeid\nbredon\nextremly\nobstructs\nbayport\nfowke\nmeglio\nimmagine\nloveline\nmurrays\nerv\nservos\nmakhlouf\nmahato\nstreptomycin\nmcmeel\nfairmile\nisengard\nkapor\nhellstrom\nperiscopes\nclavijo\nmindstorms\ncatherina\nchappie\ntsoi\nmaunganui\ntravailleurs\noakeley\nsveinsson\nhyperspectral\nevince\nspofford\nshortt\ngebre\nfonterra\ndosimetry\nwiliam\nrathenau\nkalkaska\nbarg\nnccaa\naelita\nwqxr\nliveried\ncolbeck\nborsellino\nliping\nsteig\ngrandstanding\nkaare\njaren\nhnd\nifni\ndelacour\nstagnating\nmaff\nboyan\nterenure\nkomachi\nconstanta\ncoralline\ncantante\nnwda\ntakata\nbenedicta\nswt\ncorde\nsaiki\nblockheads\nhamadani\ncompulsively\ndebilitated\npnd\nbichir\nramnath\nespnews\nroyden\npasties\nekg\nwasif\natcham\nlij\nmurtala\nsarton\nfaustin\nkasturba\nkrawczyk\ncatecholamines\ngiovinco\nbetters\nbeltsville\nexasperating\nhamidullah\ncipa\ncait\nlario\nbullish\nbele\nemeril\nreimers\noutlays\namoebae\nwholesaling\nhyndburn\nnavis\noversights\ntakuro\nstonechat\nwynd\ndhirubhai\nkatinka\nasprilla\nschneller\nomap\nvandermark\nelektron\nplusieurs\nremembrancer\npapadimitriou\nswabs\nhta\ncintrón\nshvetsov\nkamelot\nseraglio\nvima\nkibo\ncongenita\nholker\nshunts\ngilbreth\npowel\nstok\nocclusal\nloango\nreinier\nsuperdelegates\nbigler\ndollfuss\nallais\nconstructionist\nbinyon\nyawata\nbonnington\ngoslin\nwelshmen\nkaris\nwinterset\nkosei\narmbruster\ncyberman\nuighurs\nshoos\nmanoeuvred\ndrozdov\neszter\nmadia\nromola\nhammar\ngourami\nrosaline\nindicum\nyng\npista\nretouch\nsynched\nhailu\nsarracenia\nbrautigan\nfmedsci\nkozelek\nbayo\nmargriet\ncurrumbin\npne\nhege\ndarknet\nbrahmananda\navowedly\npapunya\nbyatt\nstovepipe\nroba\ncedros\nfmg\nwmca\nejiofor\ncellulitis\nnarvaez\nmarkovic\nenglisch\nkokrajhar\nbumgarner\ndetestable\nlamotta\nsciencedaily\nkalari\nenchant\nyoum\nkuriyama\ntootie\nvassall\ndecoud\nreadymade\nasadi\ndamaris\nnadav\ndanas\nembouchure\ndoune\nteus\nmasseuse\nhorseless\nchiswell\nkaija\nbicornis\nmetropolises\ngutteridge\nniihau\nchromaticism\nuth\nchouf\nagenor\navium\nschreber\noutgassing\ncarpetbaggers\npostmarks\nheimlich\ninsectivore\nplebiscites\nroundheads\nhallinan\nadministrate\njedwabne\noken\nvilsack\nbattye\n³\noppinion\ngreenside\nachan\nhannett\nnawalparasi\ntingting\ndourif\ndlx\nlegislatively\nnagraj\ngildo\nlagasse\ngedney\nnordmark\ngause\nselmon\nreoccupation\ndalgety\ngna\nforeclose\nfavorita\nborohydride\ncircuited\nkarekin\nportly\nbarrancas\ncouse\nkeyarena\nthalys\ncupping\nminbar\nsafia\nfloria\nfortner\nperennis\nhalm\ndoze\ncrummy\nkaurna\nrecieving\noughton\ncoattails\nvenuti\nkanshi\nkuhlman\npharmacologists\nraley\nprachi\ncachorro\nkete\nfishmonger\njuster\nrelaible\nhsueh\nshimo\nolms\nweinbaum\nshirebrook\nlascaux\nbusybody\nmutilations\nwonkery\njorg\nmomus\nludhianvi\nbrokerages\nbodhran\ngranat\nsasktel\nmascherano\nmellie\nguesthouses\nprosocial\nspokesmodel\npulcinella\nwarlow\nperused\ngarowe\nsalgueiro\nrichert\nkipner\nhilali\nfumo\nszekely\nmaquette\nkatmai\nwmap\nbarataria\nproffitt\nstarker\ndivs\nbirthplaces\nhannemann\nashville\ndhoti\nbroking\nnordhoff\ntemnospondyl\nhengyang\nwarbirds\nskolars\ndiverticulum\nthielemans\nchenault\nkps\nkalos\nnettuno\npersonals\npolyvalent\nehrenfeld\nrhumba\nranuccio\nvictimised\nhoppin\nhackle\nyellowlegs\ncuscus\narnprior\npatrimonial\nwharfs\nsistemas\nkazaa\ncopal\ndanila\nwuorinen\nbellbird\nriek\njeroboam\ncarfax\ndebenture\nbeckoned\npovey\nygnacio\npucks\nioannidis\nmultivariable\nhabbaniya\nkrofft\naircel\nigual\npalam\npinup\nduka\nsiegal\nseme\nchanticleers\nemancipate\nferriss\nminin\najoy\nsimak\nshantz\nditlev\nbrémond\nxxxxxx\ngothia\nlampoons\ntakhar\nmeasly\nratsiraka\npaha\nwets\nfamu\naquaticus\ngilfillan\nenliven\nslbm\nbessin\nouthwaite\nproprioception\nkinner\nrasim\ntoshima\ntsuruga\nsurmount\nhogenkamp\nducting\nsaddens\ntoilers\ncaped\nredactor\nkarski\naph\npht\nmoone\ncyanogen\ntucuman\nmartaban\nxenos\nldv\npreprocessing\ncartilages\nlathom\nwithdrawl\ntabori\ntamburini\nhéloïse\neknath\nsaward\nbeara\namanuensis\nquinctius\nberney\nkernaghan\navait\naylesworth\nurologic\nhardtack\ncurvatures\ndoctrinaire\nkumkum\niulian\nlikert\ngreatschools\ntishman\nperrott\ntambay\nessec\nvarg\nmmb\nbeeby\nbitte\ninpatients\nmodafinil\nkeila\nultimus\nmajnun\nkubik\nmaras\nyoshitoshi\ninfinitesimally\nfatt\npakula\nscarpelli\ncartland\ndisbursements\npouched\nniblo\nbasshunter\nbosra\ntriplicate\nlammer\npêche\njuab\nsonchat\nbiru\nolio\ngogoi\nshareholdings\nherberger\nmmda\ncesario\ntrx\npensioned\nstandaard\nkpix\nanastasi\nuglier\nsouthcott\nkelani\nreformism\npapist\nchernoff\nglottis\nmanaf\ntveit\nrainger\ngerb\nflatworm\nbatasuna\nphotodiode\ndeberry\nfmi\nsupertall\nwainuiomata\nceccarelli\nhydrofoils\ndownloader\nnerio\nrevolutionist\nbuner\nspacings\nsubchapter\nandoni\nrespublika\nantimafia\ndollis\nplanalto\ndemas\narnel\ncarlucci\nkiyosaki\neynsham\nprocopio\nkidal\nferrall\ndogpatch\nyancheng\nhawthornden\nnydia\nosmo\noutgrowths\ndowlais\nheward\nsnowplow\njenney\nbreizh\npapules\nmiddlebrooks\ndebus\nmagnetics\nszigeti\nmatchett\nmutineer\nthighed\narizmendi\npatchett\npery\nforsee\nmullett\nhuot\npathetically\npatchen\nbrize\ngouri\ncinzia\nponferrada\nperforce\nvivarium\nsunniest\nsonepur\nlawa\nhauppauge\ncorrectable\ncani\nshal\nzags\nkumaraswamy\nwttw\nalevis\ndisowns\nglissando\nbokhari\ncuf\nsziget\nemperador\nattanasio\nkrige\ntulse\nknx\nbrienz\npurifier\nalacrity\ngenotyping\nblab\nwelsch\nwoodcote\nzubaydah\ntopmodel\nsulman\nlucchesi\nhisamitsu\nbbg\nstanwood\nardian\njacquin\ntarif\nlecuona\nbutoh\nmlr\njordans\nheydt\nimmunologic\nahistorical\neremenko\nmccone\ngenetta\noverreach\nmahalingam\ndsu\ndutoit\nsilbermann\ncornets\ngorontalo\nshiroi\nstatutorily\ntybalt\nsolarium\nragtag\ngeorgiadis\ngulam\nharmonicas\nrushcliffe\nmichaëlle\nparticularily\nsiavash\nexploratorium\nsartorial\nvicariously\ntrended\nnbt\neldin\nelevens\ntormentors\nkuga\nstoma\nthreadbare\nventurer\nmyke\nmaturana\nclaverack\nofftopic\nibp\nmmg\nwadding\nchingy\nbradlaugh\nschrodinger\neleanore\nplaice\npensées\nhouari\nabegg\ngallerist\ncircolo\nwitchhunt\nadeel\nconseiller\nevaders\nahumada\nvilleroy\ncornetist\nfilipacchi\nmittler\nkiyo\nmansbridge\nfagundes\nleonhart\naltro\nneccesary\ngraving\nokello\ncuckfield\naddario\nsharath\nparrotbill\ncalibur\nmadding\nspeedie\ntindouf\nsantilli\nconvocations\nsnowfield\nwhare\neliason\nwitless\nmcdavid\neyeless\nramdev\nbergamasco\naql\nbarberi\nbaiser\navenel\nlona\nfaik\nbamboozle\nkebede\nnutritive\nwingnut\nspilsby\ngunfights\nmellowed\nsuor\nhauk\nvegard\nrightward\nkapilvastu\nnogami\npenalizing\njuif\nhistopathology\nkameez\nhider\nuntainted\nsartor\nnereid\ngddr\nmasina\npannell\nxis\nfigueras\ndecoction\nkolstad\nemek\nkukri\npedrito\ntangy\ncorundum\nmyddelton\nfurrowed\nhartig\nnewsline\nsamuelsen\nunrecoverable\naboyne\nbashevis\nlaneway\nblak\njinjiang\njonna\nbabyshambles\nbalustraded\ndoens\nwinge\nsurgut\nhaneke\ngordonstoun\nicemen\ncaracciola\nrmn\ngrilli\nblakley\ncyclingnews\nabodes\nkangri\nmufulira\ncutesy\nhenrys\noistrakh\njodo\nlightbox\nbullfights\nsealab\nshannen\nbonsu\ngnb\ngallucci\nhookups\nlango\nunocal\npomar\nmontañez\nsurjit\nniente\nburrs\npraya\nsharyn\nleishmania\nriverstone\nthunderhead\nhostos\nsavinykh\nlothario\nswettenham\nlsv\nmicronutrients\nbdi\noxana\nkapi\nearthman\nsarfaraz\nranil\nmitrokhin\napco\nakgul\nsweety\nbwlch\ndongfang\nrosman\nwenman\nleeton\nruapehu\nbonucci\npetronio\njagjaguwar\nárea\nzipra\nluxton\ninvisibly\nboxford\ngroep\nsuspensory\nsunnybank\ninquisitions\nturkington\nbosaso\ncitadelle\ndingli\nhalevy\nlangi\nkinzie\nmenasha\nmyfanwy\ncoronets\nzenker\nuncirculated\nwachowskis\narcelormittal\nbozell\nhri\nbukka\ndehumanization\ndelanoë\naof\nshimabukuro\nironies\nmartirosyan\nprophethood\nrösler\ndestructively\nbartholdi\nsabar\nmerkava\nwynnewood\nhaid\nbrayden\ntaufel\nmarchesa\nraki\namstetten\ntiptoe\nlunaire\ncabanas\nmachakos\neyadéma\npalmeiro\npevensie\nmarciniak\nmanteo\nfondue\nperversions\ncornford\nchildhoods\nturnoff\nlabute\neastville\ntarasova\ndrozd\ndramatis\ndurum\nshadyside\ngormenghast\nauroras\nkalil\nailesbury\nphidias\nkalinić\nredundantly\ngambela\ncauda\npovo\ncrèche\nshrivastava\nosservatore\nfortuitously\ngatesville\ninishowen\nveltins\ncouped\nszymanski\nmmd\nyamoussoukro\nwesterham\nlaupheim\nevernham\ndelpy\nwarrenpoint\nkamina\naldana\nsbe\nshallot\nherbalism\nindesign\naffan\nkrishnas\nriesch\ndowries\npiatti\nrawley\nmuret\nadwa\ndaar\ninvesco\nothe\nmallinson\npulldown\nmoesha\nchaurasia\npetree\ncoleford\npantheons\njonnie\ntrochanter\naromatherapy\nsproles\nglatt\nquadrillion\nchiellini\nrosaura\nguti\nhualapai\nguston\nobsess\nbowerbird\nsantry\nnorquist\nxilai\ncodey\nimprobably\nfrancoeur\nurf\nprefigured\npolizzi\nkilmartin\nmitropoulos\nsteinmann\ntorstein\naxeman\ncsh\nloveliest\nkahle\nmalegaon\ntyc\nhypogonadism\nbide\nbeker\ncouzens\nmaemo\nleofric\nonan\nsholes\nkindhearted\nshango\ncollingswood\nferrone\nalcina\nneuroplasticity\nmahru\nhyperventilation\nyushan\nplagarism\nakela\nhirschman\nwillen\nfanatically\nautomator\nyeom\nbiljana\nkini\naggrandizement\nsacasa\nefd\nhougaard\nhalite\nnerina\nballades\nqueluz\nopi\nhawtin\nchandrababu\nfreida\nbriz\nborromini\nbadness\nbarolo\nmilverton\nmckennitt\nrearmost\nshain\npenciling\nhuli\nunashamedly\nbinchy\nallusive\ngaffigan\nluteum\nmuso\ndramatizing\nchihara\nschoellkopf\nsaphir\nzanussi\nfringilla\nlutsenko\nwarding\nremedying\narteriosus\ngolay\nsloppily\nscandia\nmccreight\nkensei\nurfa\npeseta\nplummets\ncydonia\nfugal\nriese\nkubu\ndevis\nyoshihito\nschip\nmahanta\nlyonnaise\nkavkaz\nsoloed\ntoka\nroesler\nsalvor\nzoffany\ndjoser\ninducements\nchukar\nfitzgeralds\nbkv\npoot\nvsd\nmildest\nmullahs\nsnobs\nanticoagulants\npersevering\nhbos\nunglazed\nrachana\nunderestimates\nmexia\ntevin\nmarinov\nencke\nannealed\nbedel\ncranham\nhyam\nprincipio\njolanta\ndipak\nmayak\nmeilen\nkeqiang\ngastineau\nmilligram\nmagico\nmedtronic\nebonyi\nsorcha\ngraffito\nverhofstadt\nfrasca\nkaido\nlanfranco\ndolma\narcherfield\ndennys\nborie\nbrettingham\nwkrg\ncsepel\nconstricting\nequitation\njafri\nnozawa\nsamana\nstrutter\nbeardless\nunbilled\nfullmer\nmainstreet\nshigeki\nblackcurrant\nwuwei\nbrailsford\ntonneau\ndogan\nangelle\nduncanville\nkeytar\nmontanari\ngénération\nshibboleth\nscaramouche\neggleton\ncht\nwiscasset\nbarbarity\nbhati\nyal\nsuperintended\ndaishi\nteruo\nilog\nnaivete\nleaper\nlilyana\nnooks\nhorii\nurlacher\nparkash\ndetaches\ncavaradossi\nspex\nbandundu\ncaris\ndanieli\ninfringers\nmidpoints\njauhar\nshlomi\numwelt\narsenije\netan\nmaguey\nghauri\ntietjens\nhirobumi\nmisano\nndongo\npancholi\ncrg\ngurls\ncorsaro\njinsha\nohata\nstorytime\nchristinna\nmelchett\nbankim\nantiarrhythmic\nnips\nazriel\ncyclophosphamide\nshelah\neleftheria\njagr\nnelsonville\ncombustor\nshuya\narbore\ncatchweight\nbalor\nzdenko\nslanders\nagan\ncambium\ntoshimitsu\nvonda\ngodowsky\nxianfeng\nenac\ngéant\nfalangist\nlanl\nsoochow\ninverell\nphyseter\nspang\nargentinians\npropyl\ncomps\nphotocopier\nsancerre\nrockfall\nwtae\nmallam\ndahlen\ndeansgate\npaves\nfxx\ngusher\ntiaan\ncharism\nalula\nrossmore\nsquarish\nsallies\nkfw\nfowle\ndemonized\naegypti\ncortijo\nshallowness\nromualdez\narachnoid\npocketful\nroadworks\neuropeo\nammu\nhetton\ncoleby\nstricture\nsublet\nviraj\nmaino\nmorgellons\nbefuddled\ntwilley\nqods\nbrutes\ncounterfeits\nneun\ngasa\nlysaght\nchinh\nwoffinden\nspx\nnoires\ncomencini\nheyworth\noreja\nkyl\ndami\nbarve\nthrelfall\nkanyon\nbenedetta\nadiabene\nscarps\nextremo\nrainstorms\nkanade\nsiping\nforesta\ncondesa\ninas\nthau\nunapologetically\nisopods\nrennert\ncurios\ngauck\ncunnington\ncorleonesi\ncolonising\nverbier\nhutterite\nrazorlight\npangolins\nunsprung\nbeget\nkallon\nkroenke\nbustards\nbrynn\nrepetitiveness\nshuker\ndunga\ncaselli\neuropol\ntirado\nchicxulub\nbridgeville\nacar\ndebauched\ncheerfulness\nseigneurial\npeeress\noccurences\neltz\nresubmission\nteno\ndelors\neprom\nchartists\ndeliverer\nilf\nwendl\nmalinche\nwildcatters\nwatercolourist\nkahlan\naftertaste\ndenialists\ngoalpost\nshahryar\ncoochie\nrathi\ndisappoints\nmegathrust\nsimiles\ndaisley\nkaew\ngalia\nformalise\nkaat\nsial\nheists\nffp\nkinser\nryoichi\ncaretta\nshtick\nwillebrand\nmegillah\ndielectrics\nolvido\ntombaugh\nimpregnating\nwilting\nsokolow\nmehrdad\nveii\nfreon\nrhyd\nvimukthi\nhyraxes\nstatto\nrecompression\nunconditioned\nbirders\nkilldeer\njpegs\nmanthan\nlcf\novando\ntranslatable\nbirgisson\ncompuware\nbacke\nsouthold\nnorvegicus\nprotagoras\nkidwelly\nwristwatches\nboehringer\nmarinescu\ndoughboys\nlef\nmatchlock\ncounsell\nbrahmanbaria\ncik\nkyte\nasprey\nmancunian\njohncock\nandry\nathen\nhelman\nedam\nyongkang\nmiandad\nmcgillis\njuge\ndikshit\nwidodo\nsugarhill\nrebello\nmonohull\nprokuplje\nqnx\nkronborg\nportages\nharrasment\npeplum\ninterposed\nsalone\npask\nethier\nkingstone\ndring\nprizewinner\nwrottesley\nreconnoitre\nrevving\ntolbooth\nsene\nwaisted\nvaldivieso\nignacia\nmentawai\nmansilla\nchapeltown\nnikolskoye\nbrainless\nfloras\nsenegambia\nblockhead\nlyndsay\nmenn\nborkum\nmalmstrom\nenvenomation\nwheatgrass\nencrusting\nsenda\nanitha\nrakers\nhelly\nmahua\nhajer\nschistosoma\nmetabolically\ncojocaru\nprospected\npinpointing\nhistrionics\nbeseech\nsubjugating\nfadiman\ngaud\ngorsedd\ntorsos\nsaade\ncoppers\nevic\nchainmail\nmirta\ntiree\nlant\nschlemmer\naaw\nblinks\nalik\ndalin\nbaggett\nwui\nwenceslao\nbeavan\nmacrinus\ncarreg\npavers\nimmunoglobulins\ngrrr\nauchincloss\ndorma\nwringing\noutranked\nlege\ngillmor\ntharpe\nnassif\nbetancur\nwarhurst\nfingertip\nhilli\npragati\ndigweed\ntrego\npicacho\ntablecloth\njuliá\nsyncline\nlaksa\nregionalised\nlongtail\nkopa\ngirardin\nmcdormand\nhysterics\nshowmen\ndiviners\nnaldo\nhomogenized\ninflammable\nfelonious\nrussells\nglencore\ndalman\nincinerators\nsilverlake\nspondylitis\nlevien\nmineshaft\ncuyamaca\nwoolman\ndenville\nwarlpiri\nrautavaara\njupiler\novermars\nmisdiagnosis\nyarlung\ngaragiola\nculverhouse\nenet\nsildenafil\nkvarner\ngneisses\nunburned\nmaguires\nlebo\ncremations\nscintilla\ntidworth\nuaz\ndupleix\ngiorgini\nalbom\nkassar\ntroncoso\nkunde\njobbik\nretributive\ntoumani\nyeosu\naitutaki\ngunplay\nhoskin\nfontanelle\nadebisi\nfroehlich\nselectin\ncarsharing\nborzage\nsununu\ngorleston\nnatyam\nphilipe\nsamet\nhalprin\nleuchter\nmangora\nxscape\nhenbury\nalgy\niwona\naana\namilcar\ncallistemon\nwatari\nvarnishes\nfayyad\ntraun\natmore\nmacallister\nsrr\nguercio\nfunai\nparrilla\nkiyoko\noof\nnaivasha\nintricacy\ninterlagos\ndoel\ntarbosaurus\nminix\nbasedow\nlowlife\nderbez\ncommunicants\nunashamed\nbanjoist\nparet\nhilberg\nclapboards\nnotícias\npfd\nfurnivall\nayush\nmekas\nwhorled\nlubyanka\naldis\nfootfall\nbancshares\nwychwood\nwehda\negge\ndodos\nwillmore\nkotani\naquilina\nramc\ndutcher\ndilwale\nchubu\njöns\nrapamycin\narieh\ncholmeley\nstrainer\nwingard\nliberti\nregularities\njiddu\naribert\nligi\nuvarov\nvleck\nyevtushenko\npauvre\nmarkoff\nsoucie\nmonograms\njael\nvilify\nmuzzy\nmuqtada\nlancie\nhuizinga\nrubberized\ncolney\ndöblin\nreemergence\nholländer\nshenley\nvoom\nnovotel\ndago\nbacchae\nschoolmasters\nstasiak\ndramatize\nbronchi\nbarkers\npinetop\ncapitulations\nfritzi\ngery\npulping\njankel\ndomi\nradiodiffusion\ncostantini\nethridge\nberkelium\nmarwa\nvishwakarma\nlambros\nesposizione\nfoglio\nruess\nperversely\nskikda\npinho\nmalinda\ngrétry\narlott\nchaminda\nvetri\nvea\nalgis\nedah\nroaster\nmaneri\npantheistic\ngorb\nleatherman\nflandres\nbiggio\nwarre\ncatechesis\neyres\ntrendsetter\nskidding\nnazia\nmadariaga\nyared\ncounteracts\ntechnicals\nparticularities\nopencourseware\nreshuffling\nremeber\nmultiuser\ndebasement\nbrousse\ntiedemann\nskews\nauroville\npelee\ndominum\nitzik\nkirloskar\nurizen\nkanton\nanalytes\nperutz\nbundesamt\nbustan\nentranceway\nfatio\nschlierenzauer\nhenrie\ncharmers\nasakawa\nsubsidise\nhoult\nberlanga\ndamita\nramzy\nseiyu\ngodsend\nstarkiller\nfurber\nthuja\nronnies\nolley\nkostya\nmoseby\neddery\nasyut\narmrest\nmarshak\ngiraudoux\nharve\nhandstand\ntelo\nshenfield\nairbrushed\ntreu\npandy\ndessin\ntalman\nunconstitutionality\nmiscalculations\ngervasi\nflouted\nmunakata\ndubuffet\npowerscourt\nmpac\nhawass\nmakiko\nbooger\ntranslocations\nrechecked\nartland\ndeggendorf\neanes\ngaku\ncontax\nashida\nreliquaries\nrepresses\ntiler\nleks\nbluestem\nfurcula\nfriant\nmeridiana\nportinari\nsciencedirect\nqantara\nratifies\niao\nmadinat\nspuyten\nhanoch\nvente\ncampillo\nchapo\nspadefoot\neleuterio\ndespaired\nhernani\nnaos\nnegrito\ndamião\ntattler\nquinten\ntopsfield\narae\ndewatering\nmetromover\nvidocq\nelham\ndriehaus\nticha\nmaternally\nziba\nscuffles\nornelas\ntelehealth\nparlayed\nandel\nsankei\npase\nfdj\ndespres\nmuz\nasgar\nwangan\nbulma\nimbibed\nthalassa\narlovski\ntelcel\nraub\ndoubletree\ndigitalized\nhaleem\nmantz\nkarlos\nruppin\nhermie\ncrackerjack\nzaim\nchichele\namsa\nwisecracking\nbalzer\ntelefutura\noley\nastyanax\nsignatura\ntrogons\nrautahat\ndisentangle\ndeputising\nnorther\nhindon\nbtp\nexorcists\nbachar\nilluminators\nstagnate\nmaciver\ntreeline\nshoeshine\nrng\nesty\ntsuchida\nguterres\nprweb\nbattlemented\npowervr\nbrucker\nradicalisation\naeronaut\nsignaller\narla\nrodionov\nindecipherable\nnupe\nboks\nlinchpin\nbroughty\nreplicant\nogs\nbanias\noscoda\nmalko\ngallivan\noddi\ntamron\nmilkmaid\nshalwar\nchartwell\nblips\npurus\nhypotonia\nkittin\npriceline\nbailes\nclarus\ndichromate\ngozzi\nctg\nkennebunkport\nrexburg\nellard\ntsukiji\nroscosmos\ngesso\nroadmaster\nbunching\nrickson\nffh\nclouston\nverrazano\nlhermitte\ngrubby\nsnorkelling\nkunshan\nvna\npistil\ngoulden\nanau\nantidotes\ntapachula\ndecimalisation\nexplainable\nexciter\npaintwork\ncodicil\nparag\nmanana\nafo\nvra\ndoina\ntunnicliffe\ntalpur\npomare\nfullers\nclipse\nceratopsians\nmckuen\ntempranillo\njaypee\njow\nofr\nsillier\nbaldassarre\nlibbed\nshiozaki\noutskirt\njudie\nmartiniere\nbrachiosaurus\nphonographs\nyoann\nbrandenburger\nskadar\npiti\nmontesano\nbever\ndeve\nmeinhardt\nreynier\njukeboxes\nscart\nwhenua\nbaise\nmabuchi\nprofligate\nruba\nearwigs\nslitheen\nexum\nangelino\nrawnsley\nwashi\nthiokol\nhsk\nagnolo\nubl\ncrutchley\nlopresti\nsedov\ncrivoi\nswasey\nhitchhikers\ncrimmins\nschnauzer\nmillpond\nmasafumi\nheiligenstadt\nultrasparc\nmetallurgists\ndiscolored\nkiwifruit\ninfiniband\nbissonette\nvibo\nenma\ntopkapi\ncambrensis\nbramah\nfittler\nhogmanay\nboling\ncedes\nhemenway\nviele\nthye\nspn\ntransacted\npianissimo\ngiap\nafia\nteso\nnbdl\ncasaubon\nkingswinford\nmorgentaler\nperlin\ndruidic\ngoldthwait\nsignees\nfoh\ndimms\nruddigore\nportilla\noligopoly\nhoni\nstargazing\nwarin\nlightbulbs\nntd\nduyvil\nlettera\nsegeberg\nkail\ntonna\ntardive\nkaylie\nbordesley\nfitkin\nperdix\noilseed\ndilke\nrivulets\nupdf\ncored\npovenmire\ncriminalizes\nsilvius\nthile\narchipenko\nranjha\nbertran\nmoshannon\ncainta\nitaú\nrosana\nkachi\nphenobarbital\nverdot\nmiddlesboro\nhandcart\nhafod\nmuthanna\ngastroesophageal\nchesky\ncordeliers\nmoeen\nguillain\nphetchaburi\nsindbad\nhannington\ndabbs\nmunter\nrockhopper\nsilkwood\nhadda\nchristoff\nmasdar\nmfl\nlangauge\npaneer\nunremitting\ntemplemore\nfixe\nblockhaus\ndandruff\ngleanings\nreawakened\narq\notw\nufford\nwieniawski\nejaz\nbasi\nlugansk\njasen\nbfl\nbrinks\noyl\ncorks\nbakiyev\nwebsters\nwindshields\nwankhede\npido\ngripes\ngamage\narness\ngarrisoning\nboreanaz\nbhasin\ntrackball\nharken\nbadali\nfbr\nvirulently\ngirlish\nismailov\nhapp\nkulka\nchohan\nbankston\ngalerija\ngorgonio\nsegun\njeyaretnam\nwindrush\nhusn\nbaía\ntheriot\ndavido\ncaloplaca\nwallack\nsalicifolia\nlibreria\ngazer\nskylines\nscampton\nbeverwijk\npresti\nnkurunziza\ndangereuses\nbreakdance\nwilderspool\nspode\nimpinging\ndukkha\nsods\neset\nharpies\njubba\nahlberg\nthirtysomething\nonyango\ntorridge\nmairi\nelysée\namped\nburdock\npeltonen\nneshoba\nsirianni\ndaves\narzu\ntywyn\nuriarte\nharijan\ndiffracted\nzhurnal\nneukirchen\nkinnoull\nadhi\nwailuku\ngazed\npopkin\nsawchuk\ntsuzuki\nbeene\nrisotto\nkropp\noutriggers\nchiquito\nnuo\ndrever\nmudug\nradiograph\nbelittles\njhs\nscarpia\nironical\nsouring\nletzigrund\ntimetabled\najk\nreportable\ncaterers\nercc\nalliot\nsimonis\nmeristem\nperrie\nobeisance\ntightest\nceramica\nperforating\nhauptman\neuphony\ncounterclaim\nregurgitating\noutsized\nmenefee\nheadwind\nspillman\nlahood\nrosneft\nsandbag\ndeniece\nsajak\nhettinger\ntulum\ngesturing\nteela\nedwige\nstainforth\ngijs\nkdm\nwindsurfers\nsubadult\nmckinstry\nwbrc\nkaruizawa\nlovel\nichneumon\nnacre\nsechelt\nmetalocalypse\npreciosa\nsanad\nlustron\nphilbrook\nvideotaping\njaegers\ncopas\nbrownstown\noulad\nvpl\nblankings\nchoirboys\ncoode\nyurika\nrivier\nwilburys\nacacias\nascorbate\nroze\nexcusable\nvestibules\ngumshoe\ntimoleon\npetek\nmoure\nilsley\nbiosensors\nmarlee\nroquefort\nbrookins\npharisee\nouma\nimprobability\nftt\nfamilles\nsplay\nfono\ndogen\nigla\nsyrie\nstenciled\nluxeuil\nbodenham\nluella\neurosong\nkaweah\ntatanka\nmorny\nalyce\nnotrump\nflout\nesra\ningratiate\nebt\nkushal\npolestar\nblouin\ncharolais\nmicra\ngaceta\nblairsville\nmlf\nwth\nkoreana\nahle\noes\ntiaret\ncrpf\nsatoh\nraspe\ncowgill\nperjorative\nblaring\ncowrote\nantennal\nlini\nmisto\nsirah\nhuckaby\nmoala\nhongkou\nbisque\nswansong\npilley\nchides\nmacchio\nhuaca\nschlessinger\nepoca\niliffe\ngravest\nmycotoxins\nsaja\nzahar\nhansbrough\ntranscendentalist\ncervo\nhaba\ncentricity\nsawtelle\nuserland\nquiche\nalberich\nkehna\nsiffert\ncourbevoie\ndiagoras\nnordea\ncashback\nwolford\npelzer\nchernin\naudre\nlaman\nstrongmen\nregals\nfarndon\narabesques\njervois\nfanconi\nchildline\nhadas\nbelling\ndokic\nkhem\nspecificities\nvose\nhellwig\npetrovaradin\nchastel\nkutta\novertimes\nossip\nmoiseenko\nanamorph\ndematha\ngrapples\nmarske\nemeline\nendino\nquarrelling\nkutai\ncawood\nkanwal\ndevenish\nmcgeoch\ndiabolic\naguado\ninupiat\narline\ntuath\nmccartan\nhyponatremia\ncordwainer\nengram\nliberians\nslotting\nwalke\nbarceloneta\ndenyse\narduin\ntormes\nmaghera\natik\nbors\ncarreira\nsubsiding\nsmiler\nprimatologist\nwahi\nremediate\nassesment\ntroubleshoot\nlangage\nfamiliarizing\nhallström\nusnews\nprivatizing\nkigoma\nsaham\nundercuts\nsrisailam\nbrocklebank\nprudencio\nsurayud\nmoorea\nrosal\nkohinoor\nschor\nsturdee\nsaltburn\noverdoing\nstepsisters\nruder\nszeto\nmdn\nwooler\nducat\nthunderclap\ndirecto\npwl\npuce\nhuong\nbunhill\ntesch\nnarodnaya\nextrapyramidal\nvoyaged\ngoar\nbangle\nmccoist\nlhuillier\neurotunnel\nfryar\nbrackenbury\nonsager\nimmelman\nshavit\nkeddie\nneckties\nmangos\ngrimy\ncanepa\npersuaders\nimprecision\ninviolability\nrefines\nmitsuhiro\nquarrington\npanhandles\nfoundering\nrickettsia\ncomming\nfriggin\normeau\ncherubino\ngayo\narmytage\ntartrate\nclarkia\ndivining\nconstitutionalists\nbloemendaal\naccentuates\nsupercenter\nhoosick\nfairhope\nromanek\ncommunistic\npetters\nwestmacott\nacrylonitrile\ncontradistinction\npnac\npatriarchates\nconciousness\nbandido\ncapernaum\narsdale\nunquote\nblauvelt\nmencia\nmattocks\nmystically\ncunningly\nfga\njeanneret\nauberon\ntouchback\nyohanan\ncolaiuta\necos\nwaterfronts\nwalloons\npbwiki\ncompulsions\nrpk\nmalfi\nanvils\nafk\nmauling\nselamat\ngunawardena\nbov\ncomplexo\ncluedo\njare\neuv\nhomeownership\njunya\ndicicco\nchantrey\nkardec\nlampkin\nkwasi\ntrat\ngroundhogs\ndagwood\nrvc\nbjelland\ndarpan\nposto\nbyfleet\nfruitcake\ndownturns\ngrotta\nkurile\ncratered\nbingöl\nmcfadyen\nuncrowned\nxichang\nraynes\nsaviours\nbeauly\nsebadoh\ncalva\nomt\nbiochem\nstrawson\nloza\nimaginatively\nhellinger\nsuge\napfel\nbefits\nmuskeg\nhalimi\nromanes\nwaterstones\nbahria\ncentromere\nunjustifiably\nborkowski\nchicas\nridpath\nsaugatuck\nauriemma\nkoinonia\ncsw\ntym\nwalberg\nplantagenets\nchrystie\ncomino\nrodel\njfc\nneatness\nsummum\nrosalynn\ntullow\nnorthview\nterfel\nvampyr\nbecton\npyromania\nmuzaffer\nchilcott\nwoos\nskjold\nkleenex\nberes\npreti\nchloramphenicol\ntremlett\nadenomas\nhopps\nlackeys\nalbertans\nbrokenhearted\ncecchetti\nambystoma\nduby\nbuckcherry\nlss\nscholem\ncottagers\naborting\nphillinganes\nnoviny\noedipal\nbrosius\nberchmans\nwhimbrel\nritalin\npolloi\norlik\narclight\nlatona\nsallow\nohp\noaken\nhdds\nmindsets\nmitchison\nmushers\nchinas\ndinnerware\ncbca\nuncompahgre\nhartwall\npolysilicon\ntapiola\nreisterstown\nprothrombin\ntambourines\nnyayo\nborre\naesculus\nmccarroll\nsequiturs\nmorongo\nholistically\nmessias\nbridgehampton\ngeode\nwaistline\nsalomón\nessentialist\nfresnes\nkeuka\nhandcraft\nthieu\nultranationalist\nzytek\nmcdonell\ncarnie\nlacquerware\nduas\npolwarth\ndisturbia\nbrackman\ncaballus\ngondii\nwinooski\nyichun\noia\nhumiliations\nkerkorian\nugk\noakton\ntsutsui\ndomenichino\ngalion\neventide\nbayoneted\npois\ncarjacking\nazania\nvalkyria\nnaysayers\ngreentree\nthamesmead\nlicia\nhalvorson\nenshrining\nromanticised\nillu\ntdsb\ninterlocutory\nhybridity\nprimi\nkrew\nleiper\noriens\nfafner\nburstall\nbackbones\ntimidity\npuritani\nnary\nmalouda\ncrider\nbartolini\nsimin\nalleppey\nsayeret\ncnsa\nkwaśniewski\nfreakish\nlightyears\nshinrikyo\nembeddable\nfasi\nibérica\nshinui\nhytner\nlorazepam\napplauds\nnavaratri\npanik\ngauld\nbaldrige\nbusacca\norgasmic\nnyr\nsurest\nanthropocene\nmissenden\nphillipines\nlavergne\nnmsu\nparaboloid\nmelancon\nhomies\njaf\nhalabi\nteenie\nnordschleife\npatroon\nseaborn\nhockenheimring\nwehner\nforevermore\ncolomba\nakwesasne\nduis\npittsboro\nyaguchi\nmerula\nbeyers\nturgid\naculeatus\nkinzua\ndextromethorphan\nmerode\nmontella\narmato\nisom\nsaggers\ntolmie\nkombi\nkusunoki\nslimming\nwenchang\nliukin\npurisima\nlgr\nkecil\npmdb\nsokolova\nsignboard\nsaft\nkatu\ncatlins\nuwi\nbroadmeadow\npasarell\nsheepherders\nlamma\nandri\ntransposons\nkimmeridge\nstatism\nlyudmyla\nprivations\nskowron\ndarlaston\nfuran\nsoomro\nsadomasochistic\nkhasan\nllŷn\nkelsall\nrawi\nspatter\nintersectional\nvautrin\nfilson\nglazebrook\nmochtar\nnolin\nblurts\njarrar\ndiemer\nunità\ndeveaux\nemmert\nhainanese\nhitech\ndacs\ncdrs\nutile\nakure\nbaade\nelliman\nredfearn\nbrimmer\ntoofan\ninsite\nlipoproteins\nshophouses\nhardi\nmotoki\nsarafian\naltidore\nbeleza\nroseanna\nfrankton\nwoodburne\nhausman\nsanded\npréludes\ndeclassification\nshaoqi\nstroup\nqubo\nvco\nduin\nbonaduce\nkida\nmelpomene\nridout\npalatina\nbaldridge\nmalti\ndonaldsonville\nsingur\ncomfy\nleiston\npapago\nswaim\nyasunari\nwimpey\nammount\ngehman\ncatia\notan\nlivestrong\ntourers\nusami\ninvalidation\ndearg\njetson\nstavro\nlanfranchi\nmantar\nhend\nthuan\nbotto\natago\nthibodeau\nbrotzman\nshange\nmauler\nstorages\ncomi\nunderbrush\ncanasta\nsty\nnichola\ndramaturg\nignominious\nchordates\nlanta\ndispensations\npkg\nfalvey\nperebiynis\nptolemies\ndownpours\nherms\nwwj\nurinated\nresiduary\nhatherley\ngree\nexoticism\nfyr\ndinesen\nhajnal\nevangelisation\nalwis\nblackbuck\nlarin\nmclauchlan\nportail\nviadana\nhso\nheong\nbeheadings\nnedim\nstalactite\nfeg\nalyx\nhirabayashi\neurobarometer\nvenkateshwara\nkokugikan\ndotrice\nepicyclic\naurat\ndoremus\nbirge\nweiskopf\ndawnay\nboasson\nkielty\njuncos\nvervain\nfluorescein\npocketbook\nsemon\nshafik\nhemodynamic\nalexeev\nexel\nbirsa\nminora\ndoy\nburwash\nantikythera\nillusionary\nmisstatements\ngrima\nwilshere\nmenhirs\nvergel\ngalbreath\nual\nteats\ncourte\nhiroo\ntribu\nbanjara\nsouthcote\nbazargan\nfrontages\nabyei\nlavern\neildon\nmiscible\nboosterism\naméricain\nhippeastrum\ncalmette\nsfsu\nslaving\nlandstuhl\ndrane\nterpenes\ndrumstick\nsurplice\nkricfalusi\ninjun\ngbh\nfnl\nexperian\nhauff\nsundaland\nreworkings\nhjorth\nkith\naxion\noakman\ncallistus\nquatermain\namidala\nberowra\nsarabia\nbragi\nshrimpton\nteat\nkanner\nkrw\ngress\nsowers\nslyly\nbuckhannon\nstifler\nilion\nphèdre\ndolin\nbrid\npapio\nhebraica\nrpj\ntraceroute\nnudism\nnebe\nvocab\npetherton\nmrcs\nvicent\nchhatra\nbuggers\nahmadou\nflabby\nsleipner\nrifa\npivovarova\nmilloy\ncongressionally\nwachs\nphr\nmilkha\nfreshest\njuxtapose\nmixmaster\nlibitum\narchos\nconvolvulus\nabbruzzese\nosby\nmieko\ninconvenienced\nnayland\nholsters\npontiffs\npalooka\ncloudburst\nmaelor\nshenango\nrepellents\nmiru\njcl\nmatsuno\nclairton\nyogo\njeepers\nroughshod\nrearrested\nsaltpetre\nsqs\nskira\nrohn\nepmd\nfifita\nhyperreal\nnorian\nswartzwelder\nabdy\nboattini\ncircularity\nsheed\nhypotheticals\nstater\nterauchi\ncackling\nchalco\ndónal\nquashing\ntengiz\nflyte\njaunt\nsociality\nnovarum\ncades\namul\ndeniability\nprnewswire\ndolci\ndorland\ncenterpoint\nrango\ndongola\ndevar\nsteenburgen\nmukta\nturtur\nclore\nachham\nogoni\nfollo\nfrangipani\nradebe\nnpi\nquarts\nrendón\nsooo\nswalec\namund\nseabreeze\noverexploitation\nkunisada\nofs\nrubina\nchickahominy\nlevins\nballadry\nsoave\nwayfaring\nbusa\nmadwoman\nswilly\nsatiety\nwoodinville\nkoyukuk\nweigl\ncaulker\nstylization\nimes\nheyes\noei\nyücel\nquickies\nrouch\ncasc\nbarto\nmilonga\negeria\nnonreligious\nmerreikh\nchopard\nrecommender\nduffin\nsurendran\nobfuscating\nxiangtan\ndvin\nriperton\nimplosive\ndistended\ncubesats\nbreastfeed\npinstripes\nnns\ngerstner\netal\nmerryman\nbedales\ntlatoani\nphylogenies\nitty\ntlaloc\nmwp\nngurah\nbercow\nadewale\nsgn\nbethania\nvendome\ngulen\n˚\nstatin\npharmacologically\nrivest\nsearchin\ncriolla\ndegaussing\nnajwa\ndeflationary\ncortices\nptl\nzoff\ndonnan\nwissam\nxishan\nshahjahan\nwinnifred\nplasticine\ndufy\nbeke\nthaxter\naot\nresentments\narchaeoastronomy\ncibo\nboerne\npinole\ncorns\nprekindergarten\nmyoclonus\ngoral\nbelbin\nkhorasani\nalishan\npervade\npenstock\nraku\nsinkings\nroseman\ntalmudical\nsimic\ntdcj\nceni\nkma\npittosporum\nrmg\ncanopied\ndeforms\nzootopia\nigt\nzeleny\netp\ncecum\nseraing\nflecked\nlampi\nhelgeson\njayawardena\njostein\nforefinger\ndebbi\npawtuxet\nbeb\nbillows\nidas\nlattes\nfumarate\njuscelino\nharvill\noag\nbalts\nplastron\nserviceability\nprowlers\nhigley\nstiegler\nkreisky\nmixup\nbrok\noccassions\nhaggart\nfrutos\niohannis\nmuffy\nsaqi\naxford\nmindbender\nquins\notieno\nantiproton\nhambly\nermelo\nreimbursements\ngimlet\ninsurances\naporia\nmazi\narmley\nbottcher\nadès\nyeu\nlegat\nrudnick\nbustillo\njharsuguda\ntattle\ngrooving\napnic\npalak\nkrishnaswamy\nbrecel\nartificiality\nleatherwood\nkyoung\nleverton\ndoubler\nroue\ntrebišov\nshoprite\naraby\nwindu\nlekha\nkedzie\narber\nshariati\nrybka\nschou\nmidem\nfre\ncnoc\ngentrified\nbeauforts\nmoutray\nalguersuari\nstrongarm\ntros\nforewords\nmukilteo\nritsumeikan\ndjenné\nkeshia\nscoggins\ndraping\nkatheryn\nnaini\nmeggie\nthomassen\njodl\nkoponen\nwamena\nvars\nshariat\nyakimova\njub\namneris\nbritches\nejb\ntruism\nwince\nprivation\nalterman\nosteogenesis\nfreestyles\nameobi\ndavidians\ndzmm\nfutian\nromanenko\nsargento\nsalima\nguatemalans\nvandyke\ndiakité\ncolaba\nnavales\ncornforth\nsavor\nbellu\nhuidobro\nesthetics\necclesfield\ndalliance\ncategorises\nsinema\nsverdlov\nduritz\nstradella\ngraebner\nwebgl\npiazzale\ntaitz\nconferment\nmcdermid\nzorin\nchatelaine\nkeratitis\nodt\nexonerating\ntrebbiano\nbartha\nsnb\ngreektown\nmirkwood\nrics\nnaxalites\nbochco\nkewanee\npartula\nyoughiogheny\ntetrafluoride\nhallucinates\nunwholesome\nexploder\ndeferment\nkyan\nbarf\nmadrasahs\nopportunists\nmispelling\nackerley\nturbofans\ningenue\nwmp\nweatherboards\nscowcroft\nnasreddin\nhizbul\nlucier\ncampton\nhandmaiden\nfoghat\nferrum\ncreede\ngoads\nfluorophore\nkotwal\ngolam\nslyke\nclouser\nhylas\nyamagishi\nbradtke\nwesthampton\nmultispectral\nmonsta\nzeba\nmiyashita\nsubversives\nkazooie\nlatics\nenvelops\npremonitions\ntamanna\nvoronov\nberardo\nraffael\nglycerine\ntoyoko\nfirestarter\nworsham\nokotoks\nnoordwijk\nkotak\nepidemiologists\nargan\ncuckold\ncluck\nlongstone\nbronzed\nsantucci\nclm\nsesay\ncotati\nmontblanc\nexplicate\noversubscribed\nwfla\ngroce\ndenyer\nfends\nmentzer\ntrapezium\nlongines\nrocketship\nimmediatly\nkatahdin\nodours\nguten\ngangopadhyay\nlangat\ntandil\nbreezeway\namadis\nkoby\nfrutescens\nwoodlice\ndite\ntilo\nsilkk\nbottega\nstorz\nbastin\ntongo\njunia\nalekos\nvasi\ntimperley\nmelkonian\nupstaged\nlopatin\nstangl\nribcage\nalcona\nilgwu\nhaggling\npanas\nmaconie\nundigested\nnafi\nuseage\narkestra\ncran\nnerval\nhaslem\nbeevers\nbrûlé\nbhar\nmellows\nnihonbashi\nscorpionfish\nnasreen\ncordwell\nlilleshall\nzielinski\nminzu\nguarantors\nstuckism\nheadpiece\nwakehurst\ntollgate\nsensenbrenner\ngelbart\ndewald\naltho\nafterimage\nroybal\nalmondvale\nblenders\nmonomakh\ntheakston\ndriessen\nkripal\nmarvelettes\npellicer\ncongaree\nvoo\nrestates\nespers\nsealey\nclaptrap\niechyd\ngärtner\nfoetal\ncinderford\nsilvertown\nfelician\nsovremennik\ngoldsman\ncomorbidity\njarratt\nsofitel\nyoong\ncattrall\nyashpal\npurdah\nlaury\ndietitians\numist\nheliodorus\nsmosh\ncompactflash\nakeley\nsolvang\norley\nskiffs\nlarios\npotluck\namazonica\nrachman\nvideocon\nrando\ncéleste\nhelsby\nsonars\nfenlon\nmalmgren\nnayagarh\ncodice\ncreaky\nspewed\ndisfranchisement\nstammering\nkalt\nmccovey\npeutingeriana\nanoeta\nlungi\ngherardo\nevanier\nlarner\ntumorigenesis\ntityus\nplats\nhavisham\nnaiman\nrowett\njuggled\nseaworthiness\ntelegrapher\nvoyeuristic\ngadgil\nvittal\nanchorwoman\ngarris\ndromaeosaurid\nbiennials\ngravelle\nnacion\npiebald\nscreener\nbaco\ndenni\nblethyn\nunderappreciated\nbenzi\npacy\nbinod\ncelebre\nsanitize\nladyhawke\nkulu\nenriqueta\ncambon\nnegre\nlaka\nmosier\nearnt\nstaleys\nautumns\nglx\nshepstone\nkinard\nttg\ndantan\ncaos\ncley\nvyner\nharrowby\ngoldbergs\nsomerhalder\nessanay\ncapsizes\nklick\ncolorists\nconfidentially\nclaudiu\nmaiming\nxvid\nmarke\nknobby\ntranshumance\ntiroler\npulu\nepaulette\ndbx\nleff\nconveyances\nnibelungenlied\njubail\ndemoralize\nvtm\nnorcia\ncanadarm\nscherman\ncahier\noutflanking\nankaraspor\ntailgating\nwiseguy\npunxsutawney\ncaldicott\nvashi\nivonne\nmiéville\ndeyrolle\naplenty\nnebbiolo\nmunden\nkuldeep\npalmate\nworton\ndissimilarity\nfanciulla\nchahine\noconomowoc\nchérie\nhuskie\nstoats\ndesirée\nkhadijah\ntotter\nscrappers\ncritz\nakel\nranatunga\nbarón\nonge\nvanek\nkany\nshimazaki\ntanti\nrofl\n,but\nsanctify\nmarmosets\nzq\nklett\nnuon\nniner\nirrfan\nbeauté\nmcteague\nséverin\nsatpura\nnewscasters\neaf\nmonkman\ngatun\nsoucy\nseismograph\nalpa\naubrac\npropping\nbertelsen\nfabricators\nrini\nmorial\nsarwan\nfowlkes\nautonome\nmonn\nzehra\nsqualus\ncarhart\nkamali\njamerson\ngcap\ninsures\nhorsford\nwieser\nbluefields\npadovani\nchesler\ncarstensen\noverstretched\nseverinsen\nberjaya\nottaway\nalbery\nulmanis\nstagehand\ngelo\nhollen\nosian\nnexen\nunhesitatingly\ndignam\nbatalha\nsituates\npoulain\nskomina\nsportier\nrefrigerants\nrodway\ngrieco\nthermocouple\nesterase\npolders\ncarbamazepine\nabadie\nklaxons\ngidding\ntsuneo\nbrassiere\nhopedale\nconoco\nslimani\npkd\nnuc\nfickett\nscio\naloni\nmarva\nfashionista\ngavilan\nantonucci\ntabern\natmospherics\nfusil\nvcds\ncontactmusic\nwach\npotentialities\nabhiyan\nmultifamily\ntypify\nbrassens\ncaquetá\nmayoress\nsimpkin\ngachet\nsriracha\nheaphy\ntarquinia\nmallinckrodt\nbattlers\nfroebel\nswirled\npessimist\nnampula\nterritoriality\nuniversale\ndrottningholm\nneutralizes\nchampneys\nstecher\ntherapeutically\nmugshots\numu\nthura\ntardelli\ndigoxin\nzaia\nmongers\nprorogation\nspieth\nnedbank\nreadies\nputrefaction\nparmelee\nchinen\nbergdahl\nmaske\nkiddush\njoux\ngorce\ngenoveva\nrosengarten\nfanfares\nimx\nseiter\nwatlington\nunus\nakranes\npyatt\nmyburgh\nches\nsunstar\nklehr\nblechnum\npozzuoli\nlinha\ncallison\ntikaram\nlsr\nefsa\nncar\ndisputations\nhallowe\nudom\nappell\nkinkel\nvocalisations\ngrandsire\ncostes\ndriberg\namyot\ntailspin\nwareing\nvarices\nbrister\ncorcyra\nbushbaby\nbackhanded\nvorkuta\nfoliated\nparoxetine\ncors\norientalia\ndandong\nprobationer\nunbeliever\ndaylighting\nbridson\ncalque\nhusson\nfrigo\nngoni\nyih\nderic\nergs\nohh\nharrie\ncomtes\nsymbolists\nbayon\ngoschen\npfennig\nreconvene\nwoodsworth\njerningham\ntwinkie\ndahle\npchs\nmarre\ncodfish\nsinew\ntrifles\nlabasa\nhightstown\nstrutting\nmillo\ntemerity\ntensioning\ncévennes\nwearside\nstapes\nophthalmological\nsodo\nannua\nhemorrhages\nhomoeopathy\ncompositionally\nchrism\ntransiently\ndenso\nsanton\nplexiglass\nflink\ncavallari\nnjn\nbentall\nfrankenthaler\nfluence\ncasar\nniwas\nquinteros\nnisbett\nramanand\nogni\nwinant\ndisquieting\nstrick\ninfocomm\noffing\nfagg\ngriots\nenppi\ntfeu\nnafs\nhouben\nmelander\nneher\neaker\nalbinoni\nvernonia\nfarne\nenamul\nkoscheck\ndecisionmaking\nrealaudio\nviduka\nlasher\ncftr\ngerken\ngaulois\nlesch\nzedekiah\ncapobianco\nbirobidzhan\nprickle\ngxp\nbivariate\nasynchronously\npoleward\npangbourne\nhappend\nwalkerton\nmaars\ntanuja\nanglesea\nmaney\nteige\nsensical\npharcyde\nokeke\njca\nencrypts\nebcdic\nonoda\nmetabolised\nvoznesensky\nneshaminy\nnemuro\nplumages\nlassi\nantm\naconitum\nchaska\nanglos\nvoris\nverrocchio\ngarni\nusborne\nnmm\nsequelae\ndiciembre\nbhubaneshwar\nbalthus\nmoneylenders\nmayol\nnaudé\nhoog\ndeckhand\nwaheeda\nxfc\ndiederik\nlebed\nximénez\njeers\nlagunas\nnonmetal\nksp\nnooksack\nblimey\nbukavu\neuphemistically\nsluis\nkrem\nclews\nbuyback\nmaathai\ncorpuz\nnizza\nsuncor\ncommitee\nwisps\nfictionally\nradiosurgery\ntamina\nloken\nnanton\nnoes\nhuf\nbuis\ncigna\nbendtner\nshantha\nanchorite\nfunneling\ngrassroot\nryohei\nungureanu\nlippitt\nveber\nsupercooled\nrethought\nbarings\nusenix\njanel\nunintuitive\nactinium\nownby\nyae\nwaterberg\ntsuruoka\nsphinxes\nchenery\ncetshwayo\ngowon\nmicrons\nnozaki\nhindawi\natherosclerotic\nstenger\nrenamings\nstrawn\nilly\nwarders\nsindelfingen\nmorganza\ngandini\nlaxatives\nabyad\nlinzi\njex\nnasrin\nruane\nsynecdoche\ngalette\nshamanov\nsteinke\nslapshot\nredwater\nuummannaq\ncerise\nfoxman\nhorman\nhalesworth\nrésultats\nductwork\nthomasina\nprostatectomy\nrendon\nsynchronise\nqtv\nradburn\nbushwalking\nsuppressant\nkonig\nalzheimers\nglbtq\ngiguère\nshutterbug\ntodaro\nhaneef\narcheologico\nanoint\ncousineau\nmandalas\nbtg\nphev\nwomaniser\nanimo\ndichloromethane\nkoslov\nduhon\npikesville\nanri\nadon\ngalax\npacifique\nvogts\nanticonvulsants\npanzerfaust\naeroméxico\nbillingsgate\nwallinger\nokefenokee\nburghardt\nluanne\nosos\nmutabilis\nairbrakes\nplanetesimals\nreassembly\ngnaw\nimaginings\naplastic\ngitanjali\npung\nronalds\nbuxom\nnorthwind\nsturbridge\nlaem\nsamian\ncairney\ntailend\nkochanowski\nswiftness\nattucks\nkinta\nhaitink\nsafwan\npreening\nlhotse\nwinkles\nscheller\nleonetti\nduathlon\ntrickling\nhatherton\nroundness\nchoughs\nturntablist\ncers\ncaucasoid\ncinecittà\nkimpton\nglis\nallatoona\ninterrelationship\nzacapa\ntyro\nsova\nwatchet\nunece\nextrovert\nkowalska\nbulstrode\nsuperposed\nardingly\nchetwode\nmilitiaman\ntejan\nmizz\nyuletide\nshettleston\nincapacitates\nparenthetically\nscrewy\nreplaying\nbazillion\nzoila\nshmidt\nmccoury\nmcshea\nharalson\nciborium\ntoxoplasmosis\npeepal\nboor\ncompania\nsayreville\nsanogo\nbegoña\ncatechin\nburgan\npapis\ndniprodzerzhynsk\nfragoso\ndickies\nnullah\nmengal\novermyer\nimmaculately\nnakatani\nutopianism\njsi\nghislaine\nolimpic\npostma\nassignation\ncardell\njogger\nunderestimation\nloosed\nrusskaya\nutpa\npantheist\nsobha\nbruxism\nvti\noutfitter\njustman\ninhabitation\nlipski\nabsense\nsillitoe\ncito\nvjekoslav\njingdezhen\nfrégate\nencephalomyelitis\nibérico\nmawashi\nbernabe\ninfrasound\nebor\nwatercolorist\ngrof\nrapson\nkautokeino\ncbx\nfagans\nuncategorised\nrevillagigedo\noutgained\nbentos\nhajibeyov\nwinterhalter\nbackcourt\nazzurra\nalcmene\nomnipresence\nsherwani\nkratochvil\nrepairable\nreenters\nerections\nzoominfo\ncentinela\nheadrests\nbollman\nbuonarroti\ntillery\nsolera\nkortchmar\nruffing\npiddington\nidfa\nculley\nmellal\ntacony\noommen\njochum\ngasper\nsamit\nnolde\nerbium\ndeducting\nreedus\nfauzi\ntempura\nhandwaving\nfelids\ningrassia\npoultney\nsashi\nencanto\nhorrendously\nglasscock\nkassandra\nmaassen\nrhabdomyolysis\nmobilizes\nhottentot\narika\nchauvet\nwaiheke\nvaladares\nbrenz\niowan\nnilssen\nduch\npenson\nrodda\nakashic\npadmavati\nheadey\nhaf\nminehunters\ncoanda\nbrucei\nikram\nmccarville\nrosenman\nsadleir\nvirga\nbinhai\nsuperhit\nsnowballed\ngranado\nbrunettes\nkalay\nkittiwakes\napprehensions\ncawthorne\nsterols\nembleton\nnormie\njennette\ncaillebotte\nstronge\nknower\nbutyrate\nairfare\nhartlaub\nshwegu\nlensky\nsparkly\nflorentina\nctt\ndyre\nbomer\nnoxon\nkazanlak\ncarrell\nmanoeuvrable\nfayum\njsut\nannen\ntangmere\nkostner\npharmacologic\nkayenta\nsarkisian\nbresnan\nnevelson\nsuji\nkuya\ncephalosporin\nsouces\nlotuses\nthatcham\nburkert\nairwave\npaddlefish\nhattin\nholzmann\nrondebosch\nscafell\nwildlands\nmoneylender\neaglet\nchilvers\nupington\nromanovsky\niguala\nautoexpress\nshinichiro\nrépublicain\ngraefe\nfka\nusccb\nfaizan\nbindra\nbanska\nziller\nblanchette\nherby\nmorishita\nnaff\nkiselyov\ntric\ncouloir\nesol\nsibuyan\nvanhanen\nkentuckian\nmcelhinney\ndignify\nmbaye\naircraftman\nbarbi\njedec\nioffe\nkneading\nraben\nwallwork\neadweard\ncrédito\nblatty\ntearoom\nstoichkov\ngennadiy\nclattering\ntcw\nistc\nmaryvale\nlabiche\nmasu\natsdr\npuddling\ntightens\nfarthings\nantawn\nvindex\nchocolatier\nburbot\nwilentz\nbrows\nrebroadcasting\nkolff\ntompkinsville\nmagnanimity\nbokeh\nabx\nstancu\nendoscope\nbelek\nirrigating\nstudentship\nnrb\ntrygg\nkubuntu\nbioreactor\nmicrofluidic\nkensit\ncoms\nnavigability\nfiver\nbemoans\nchata\nsuperfight\nfishbein\nqader\naltissima\nbussard\nphoenicopterus\nmawlid\nsuccès\nhargis\npromotor\nucv\nkelsang\nubaidah\ndishwashers\nrincewind\narrabal\nmimms\ncroon\nsurmountable\nngok\nmalipiero\ngyfun\nwincanton\nhelfrich\nwayfarers\nshepperd\nhellmouth\nmistrusted\nleoncio\nblurt\nmado\ncyworld\ntoilette\nmenifee\ndewine\nknu\nziemia\npalaeography\noresteia\nwadhwa\nshorebird\nkhayat\nmires\nsparsity\nemployments\ndanzón\nfrayne\nmarash\nvoges\niriarte\nmughalsarai\nbeckons\nmulled\nhueco\ntumbleweeds\nlongshoreman\nminifigures\npeonies\nravena\nfiorentini\ncdx\ncouncilmembers\npetróleos\nmanninen\nravishankar\ngerbert\ncommie\nfum\ngarigliano\nmcginnity\ntormenta\nblackton\nkratie\ncumbres\nisco\nunmotivated\noosterbeek\nlazzarini\nrelishes\nceller\nmultiphase\nlammas\ngedi\ninforma\nmsci\nmiño\ndispirited\nrackspace\nkpl\ncephalosporins\nconi\nflagstad\nshivnarine\nmitts\naccion\nlunate\nmalfatti\ngarciaparra\nlongyear\npayless\nbumiputra\nkasab\nsolie\ngjilan\npardoe\nfieri\nmytton\ntablighi\nhallet\nriggers\nkarang\npresupposition\nblier\nmccree\nepictetus\ncopt\noec\ncattedrale\nchateaux\ngeneralists\nhila\nhorsehead\nnorham\nleachate\nundercooked\ntiberio\nfacey\nascendency\nkaist\ngemological\nplaything\npickings\nnvo\ncreepshow\njackknife\ndooms\nhums\necht\nngorongoro\nbauza\nrusholme\nproclaimers\nbrewis\nleghari\njailbait\nbuchalter\ntitmus\nextravehicular\nkarlskoga\narifin\nehe\nneeta\nkenneally\nclézio\njustgiving\nminshull\nnarodni\npolack\nbriceño\nlongdon\nleavis\nsupercells\nbeddoe\ngulags\nrauner\nkobol\nbandhu\nguipúzcoa\nstookey\nanthill\nwensley\nwrotham\nlissitzky\nomnis\ndavern\nchalcopyrite\nhoots\noviduct\ndejesus\nchamal\ntrekker\nfarro\ncryptographically\ncranshaw\npolonica\nvestment\nigbt\nrira\npinchbeck\nbmps\nlebow\ninterparliamentary\nchanters\ndirham\nguardado\nmetronidazole\nhpl\nbrownlie\namphitryon\nunclos\nhollandsche\nastronautica\ntauride\npinkard\ncatolica\nbayarena\nnfhs\ncorddry\ndisfavour\nvasilisa\nsongbooks\ninadvertantly\nopposable\nrepopulation\ngintaras\neckman\ncalahorra\nwedemeyer\nkozue\ncarpinteria\nradka\nstringy\nmeiktila\nazuay\nunwelcoming\nkhural\ndonaire\ncirkus\nbarzun\nbrazza\nhulot\ndizengoff\nbidston\nnahid\nwdcs\nalvensleben\nfumaroles\nsirindhorn\nbelgacom\nfeverishly\nstockmen\neai\npowles\nbusinesswomen\nprobity\nselectric\nmousasi\ncoerces\nblairgowrie\nshambhu\nadjudicators\nasrar\nmckesson\nelektronik\ntyrie\nscullery\naminah\npeabo\nchugging\nincongruent\nwoz\nhmnb\ndislodging\nincorrupt\ndichotomies\nfireships\npyi\nblag\nmeanies\nrefreshes\nrapacious\naio\nunbearably\nlakki\nhejduk\nrioux\ntsvetkov\nclasts\norisha\nimaginarium\nwestra\nrepackage\ndiaby\nmissolonghi\netting\nrepaved\nnewish\noge\nisolator\ndivesting\nrudenko\nrosarito\ndehavilland\nstimulatory\nfoliar\nskyrocketing\nmosson\ncrystallographer\ncatagories\nburnden\nstevedore\nvassili\nrantanen\ngothard\ndiagne\ncliffhangers\nrooyen\ndomenici\ninterloper\nachleitner\ninslee\nmurwillumbah\nfreebase\noverloads\nfazlur\nifsa\nmaor\nindigestible\nobiter\nmetrowest\nremarrying\nyashiro\nreconsecrated\nsoutherland\naffixing\nbuel\npreamplifier\nmeatloaf\npurchasable\nborth\nlastra\nkepi\nemended\nnafis\npamina\ntactfully\nsubframe\nhippopotamuses\ncatchall\nbvg\nthirlmere\nwhizz\nmargi\ncaprock\npacifics\nsepticemia\ntransoceanic\noutvoted\nkashmere\nsatis\nifans\nexhume\nriobamba\nfedde\nbwana\nnoureddine\nbrachytherapy\nsahoo\nspoo\nmarle\nphmc\nrheinhessen\nengelhard\ntamada\nlera\nibaka\npestered\nkrahn\nmelanesians\nlcn\nmcat\nhottinger\nchemise\nnewsrooms\nblondeau\ncutoffs\nyasmeen\nlevita\nservi\nrupel\nusca\nnagarajan\nreanimation\nvcf\njairus\nladue\nsquashes\nbickers\ngabbana\nquillan\nands\nbrisebois\nizhar\nwinkelmann\ndhanraj\nsultanov\nsmithee\nlutter\nfov\nmazel\npetya\nreconverted\nceilidh\nufj\nmisuzu\nbused\ncatwalks\nshaaban\nkawano\ndressmaking\nridgemont\nhammerton\nhorniman\nbireh\nsubpart\namran\ntracksuit\nmurty\ndecapod\nreichmann\ncarlock\nsapping\nthrombus\nicb\ngarmon\ntakamura\ncamryn\nadumim\nweaverville\npajaro\nudayan\nchown\nsanterre\npolarize\nmisanthropy\neisele\nmuh\nchisum\nnarberth\ncytogenetics\nisopod\ninclosure\nanacleto\nludden\nsapped\nmetron\nitsy\nirksome\nstratfor\nstraggled\nlugg\nholleman\nsemiramis\npleat\nboosie\nweakley\napolipoprotein\npuerperal\nbarga\nforni\nrusu\nivies\nsardinians\ntoco\nvora\nknotting\ndauer\nachaia\nexcruciatingly\nsinfonica\ncapponi\nnicoya\nnwf\nstankovic\nhintze\ncommercializing\nsulfoxide\ndichroic\npâté\nnoctule\nskoglund\nbibliophiles\nuig\nwelte\nchristiani\npajero\ndemetria\ntrumpington\nplacidia\nsteeves\nruzicka\nmdk\nwilner\nkompakt\nlemba\nbiagi\ntimpson\ncintas\nnunchuk\nbottrop\nbbk\ndurang\npostern\nmaloy\nahonen\nrafal\nlarrousse\nregehr\nkaelin\nmenchov\ndlt\nchrysis\nfli\ntinderbox\nallinson\naboud\norie\nshowstopper\ngraciano\ninspecteur\nwhodunnit\nfarol\nbaule\ncgf\nfroch\ninundating\nbicalutamide\nabit\nbizkaia\ndinitrogen\ngenitourinary\nweizman\nincendiaries\ndugway\nrangamati\ncosigned\nwaziri\nlolli\ndrool\nmillenarian\nnaderi\nchembur\njosiane\nstannary\nsetauket\npoltimore\nalberdi\nneuschwanstein\ndodsworth\ncallard\nkubek\nastrée\nkardia\nvientos\ndubourg\nkalbi\nzukor\nsubsume\nmanzana\nhoque\nunsuspected\npelorus\nslideshows\nholle\nuncoupled\ntuta\nhalwa\norcadian\nparktown\nflauto\nmaipo\nardour\nskovgaard\nrefuelled\nmillerton\nevensen\nberend\nsafwat\nunweighted\nkhurda\nrockit\njaqueline\nmcclymont\nsinterklaas\nepitaxy\ntrousseau\nlagers\nlabbé\nhuzhou\nrefiner\ncoss\nprobs\noblonga\nsayville\ndesmoulins\nsauvages\nsarani\nazules\nboje\ngolomb\npneumatics\nclonard\nalbrighton\nkaira\npetard\narnall\nkilduff\nista\nlevison\nbielsa\nponcet\nhiley\nbrigit\nglenroy\nsones\nvissarion\nfisica\ntangata\nasos\nurbaniak\ngillie\nlaliberté\ntrashcan\nberingia\nkrissy\ndunked\ndeclensions\nmorriss\nphilomel\ndubno\nhovland\nquainton\nmuga\ngaleb\ngrigorescu\ndepailler\nkyros\nfuryk\nboussac\nblazin\ngilstrap\ndredges\nmissi\nulp\ncroes\natum\nmoscovici\nkark\nhartson\ninsurgentes\nwoosnam\nartemisinin\neuropos\ntakaki\nrotondo\nmusicae\nkheng\ncollierville\ndevere\ngradin\nabdicates\nvolmer\npowter\npearman\necuadorians\nbluto\npolat\nsexting\nturmel\neleison\nazd\nhartono\nmelitopol\npanspermia\newf\nlesean\nanticoagulation\nmonreal\nzigbee\ndampness\nzakrzewski\nfirstenergy\noxm\nowa\nnwe\nburkholder\nsauda\nalioto\nmouna\nbehringer\nwilbanks\nsnowmass\njakobstad\nheder\njudkins\ngostivar\ngulley\nghomeshi\nloathes\ndonghai\nbedfellows\nhaggai\ncolloquia\nbraybrooke\nherrnstein\nmicroenvironment\nkingsdown\nbroda\nlaugharne\nizabal\nennerdale\ntragedia\ndengler\nprenzlauer\nchafing\nkhandelwal\nweardale\ntombalbaye\nprissy\nkmph\naskham\ndhea\nmaram\nmangla\nazeroth\nkexp\ndevgn\nvariazioni\npomace\nkimmie\nsakhon\nsemey\nbalbus\nmimeographed\nghanshyam\nbrickhill\nmontale\ncobs\nbresnahan\nkiriakou\nalls\ndesrosiers\nredhat\nfaucets\nmaurie\ncrapo\nrowson\ncroucher\nkayne\nhopatcong\ngranbury\nunprincipled\nazmat\nrinus\ndionysia\nboyles\nideo\nstackable\nkegel\npennyroyal\nkenitra\npide\nmenza\nretards\nlajas\nmacrovision\nkononenko\nkloof\nmasataka\nwoodyatt\noved\nspreewald\nguimet\npossiblity\nnominalism\nheming\nnearshore\npolicia\ngenevan\npingu\nstylianos\ngurdas\nmutts\nabdulkadir\nhugill\nwebex\nfrangieh\ninet\ncruzado\nhunched\nenigmas\nhallvard\nlaza\nschieffer\nlevitz\ngorget\nbestwick\nstathis\npedipalps\nschwalbe\nmonosodium\nbrockovich\nwidmann\nsoso\ntillers\nmeninges\nactualidad\nhabash\nreceptivity\ngnocchi\nneuropsychiatry\nbantustans\ngildea\nfubar\ncroswell\nchamunda\nroyo\nnovik\nraban\nblomkvist\npicon\nlcu\nterezín\ncowman\nparastatal\ncappa\nmissoni\nrafn\nminiver\nletoya\nfloodway\nyanofsky\npontiacs\nzaher\npurples\nmcguffey\nbalut\nvdi\nvenne\nmeurig\nburidan\nautorun\ncripplegate\ncapstar\nbaudry\nlacaille\nlugoj\nsusteren\nksm\nheilbron\ncavelier\npeyroux\nseebohm\npecans\nredfoo\nlumo\nmangel\nlakmé\ncopei\nbispo\nprototyped\nprasada\nbph\narani\nmediacom\npacoima\nhaitham\nspeedboats\nstraka\nsciacca\nhorology\nslevin\nldn\ndullness\ndeadlier\npaestum\nputi\nleta\nchaebol\nakino\nfreedomworks\nsaidu\narseniy\nmakedonski\ndanu\novide\nbarleycorn\nslashers\nagne\nbjörling\nzazie\nsunanda\njocky\nfatehabad\nnwl\nvdot\ntsawwassen\nchug\nhebel\nreparative\npauker\nzinner\nglostrup\nslimer\nnixed\nwetness\npectoris\nfdm\nfumigation\nfti\nsabado\nshagari\naquiles\nbahrani\nkasprzak\nspurling\ncaesarian\nlarroquette\nheathers\nfeaf\nvsp\nreperfusion\nbarnham\nrossella\nvenere\nopinon\ntmx\nglenalmond\nosmar\nlifesize\npagny\nleaderships\nsloper\nmuldowney\nstelvio\nbrumbaugh\nvano\nappends\nhoulding\nivanchuk\ntownhomes\nbelleview\nashrafiyeh\nceaselessly\nbyrum\nbleiler\nwtic\nbriony\nesrc\ngingerich\nagta\nmisfired\nkerli\ninfographic\ncustomisation\nsnuggle\ngnarled\nrues\nzemlinsky\npliska\nstarkie\npuer\nviewtiful\nivp\nfrederickson\nevigan\npicciotto\nmalcontents\ndoshisha\nowlman\nheidelbergensis\ndarwini\nterabyte\nstroker\nlakshya\ncartoony\ncardiomyocytes\nspallation\nwarka\ncurvaceous\nminda\nsoos\nattachés\nmccardell\nheartstrings\nbbdo\nwadala\noktyabr\nmorata\nbohlin\nfikri\nkempes\nacquittals\nipd\ntpv\nnesa\nbng\nhafs\nlaparoscopy\ncoonoor\nnetgear\naksoy\nsverker\nlinolenic\nprenzlau\ngrandly\nwaqas\nbellin\ninterlake\ndeorbit\nniaid\niwas\nmodjeska\netrog\ncsaky\nunie\ndrownings\nnuss\nsabourin\nfolch\nastrometric\nfoton\nhamidi\nestrogenic\nmaure\ndismissively\nkoln\nkhajimba\nprepayment\nrotella\ncoaxing\nredolent\nparel\nsabbah\nwoodshed\nunderlay\nboothbay\nmusetta\nhomebuilding\nzwinger\nneuberg\nbartonella\nforetelling\nrevson\nsieveking\ndynasts\ndragna\nhidetoshi\nbikel\nlazaros\nrallis\nswarthy\nornithopod\nbethan\ndebden\nherington\nthorton\ntillett\nficci\nfleadh\ndistrusts\nleffingwell\npleso\ndarnton\nalph\nmecham\nengenders\ntonally\nconcetta\nhanabusa\nhelvellyn\norients\nvolpone\npentacle\nchassagne\ntapajós\npcv\nsentra\notmar\ninquests\nlunds\nunexposed\ncollies\nsavva\nsteinmeier\nvictimisation\nwapakoneta\nearmark\nboscobel\npapiamento\ngaidar\nwjbk\nfruitlessly\nsaddledome\namraam\npainlevé\ntzipi\nkirkeby\ndombey\nnordica\ngoofing\nlvl\narkel\ncrimped\njonasson\nboase\ngits\nlockable\ntrampolining\nginer\npiek\ngfw\ntexturing\nloiter\nswoops\npcmcia\nhausner\nkemmerer\noutwitted\navenges\nkayoko\nrationalistic\nilunga\ncharacterisations\nbonnot\nmorayshire\nreinterpretations\nfomalhaut\nmotier\ncrip\ncommendatore\njaeckel\nsighing\nneet\nterrarium\ncaac\nlarocca\nvojtech\nwakemed\nswartwout\nfantasizes\ntrumbauer\nfearnet\nmandira\ntelegraphist\nspillways\nropa\nsarsgaard\nmcniven\noccoquan\nkihara\nforethought\nuzma\nconleth\nasensio\nnbg\nisam\nrustication\nlansana\nnebi\nnumazu\ntoklas\ngrendon\njanin\njnk\nvasseur\njrue\nsteamrollers\nkikwete\nviridian\ninflexibility\nnajeeb\nducas\nbilad\nmiddlemarch\nwalraven\nfarrokh\nzooxanthellae\nroomy\nfroelich\nlazzari\njarle\nprotuberance\ntempor\nsteigerwald\nntn\nkome\nscacchi\ndryopteris\nvogl\nmabee\ncauldrons\nguárico\nassateague\npericarditis\nlitigating\nbhosale\nlunden\nbluewave\ncranko\nborgmann\nsalcombe\nbasch\ngored\nhaparanda\nyaobang\ntreinta\nbulwarks\ncentrists\nheldman\nmukunda\nbrora\nrockhurst\nrums\nmusson\ncowal\nstammers\npsmith\nproctors\nperfetti\nstayer\noddo\nsuccessional\nblackcomb\ncaffey\ngatley\nserow\nneuritis\npropionate\npetz\nhydrocodone\ncoachwork\ntalitha\nboces\nfichtner\nzayat\nglickstein\nvarick\nintimidates\nhempfield\nmouzon\nallopathic\nshehri\ntmr\nusr\nchindwin\nbittencourt\ngabbert\npamirs\nnevel\nwandle\ndailly\nbaril\nmattila\nhusker\nhorological\nviolative\nedgartown\nalway\nfriede\nbidwill\nquintette\nuria\nauv\nnatus\nbdm\ndakini\nfrontside\nhummock\nmarzano\nfactum\nsadak\npaulhan\nhominy\nbillionth\nzuk\nrosses\nseared\nlindeberg\ntarso\nborstein\npemmican\ntrotted\nsarasate\nmaurine\nteepee\nadamczak\nneptunus\ntuman\nazaz\nluh\npmg\ncognitions\nkntv\nposses\ndifford\ntsvetana\nschelde\ntova\nberiberi\nclauss\nbookcases\nmcgirr\nscobey\nalexanders\nrailcard\nkeyport\nparnes\ninducer\nlish\nciampa\ncubas\nironworkers\nmadrona\nteran\nphrasings\ngmat\nsrivastav\nmahaveer\nvisto\nspringbank\nhosier\nkneaded\nudder\npelling\nvlogs\nboules\nsonisphere\nheinsohn\nseacroft\nuhs\ncarolyne\ntilford\nzul\ngrolsch\nunfertilized\ncreutz\nbennani\nflamme\nobviating\nokawa\njaba\nnaarden\nmahy\nsaturating\ntastefully\nbushra\nzeil\navidan\nabidi\npedroia\nferritin\nborage\nsare\nlorica\nmetsu\nitcz\noladipo\nlunceford\nfredensborg\ncertianly\nirwindale\nanm\ngerminating\npersonify\nakar\ndava\nsilane\ncapades\nbullfighters\nanachronistically\nestatal\nmarquises\nerno\ndutkiewicz\nasar\nmisbehaved\nherbalife\nrayan\ndeflectors\nequipo\nbonspiel\nculturelle\nmehar\nadly\nhyden\ncantley\njamborees\nmnp\nheeley\nmariann\nguillaumin\npowerlines\nfranka\nrampa\njunipers\nmoris\nnutcrackers\nhelots\nthrongs\nbilli\nbaranski\nity\nquartile\nbathtubs\nmarmite\nnahman\ninordinately\nessel\naem\nbathymetry\npulmonology\nforded\nravalli\nhenckel\norosco\nsplices\nabarca\ncpsc\nkahr\nrauh\npotton\ntransgress\nimmunizations\nthackray\nanahita\nmoistened\nmeja\ncomicbook\nmccubbin\nkies\ngherman\nhajjah\nfleurieu\nmcconnel\npolitis\nlirico\nlorax\nhuling\nbajaur\nchait\ntampon\nkeibler\nsuliman\ndach\ngubin\nautolycus\nphantasia\nitoi\nzdenka\neelgrass\nganghwa\nlucjan\ndoniger\nquietness\nwfld\nfrieder\nwestword\ntsvetaeva\nmoire\nbroxton\nshalev\nwestwick\ndfi\ncocina\npersicaria\nmethylmercury\nleadenhall\nzeitoun\nerding\nbetham\nblurr\nstanbic\nkeratoconus\nmilbourne\nshouse\nhominems\nhotspurs\ntransliterating\nbiomimetic\nneutering\ntamwar\ncaliphates\ntalen\ncharny\ndingman\npreciado\nornamentals\ndrea\nbulman\neventhough\nminturn\ncabinetry\nsexier\nsebago\nbérard\nsirajuddin\ntresco\nausa\noaxacan\nbju\npollet\nrosicrucians\nprimitivo\nnicko\nriet\ndayo\nnishijima\nparun\nklopp\nhawkish\nkotal\nskyview\nqadim\nwestwind\nantoniou\nleonine\nsavaged\nknoppix\nchartiers\nechevarria\nazienda\nqufu\nmeteo\naril\nzoon\nsantigold\ndispensationalism\nglassblowing\nmooi\ncouillard\nfotografia\nzapote\ndongchuan\ncallendar\nmaltsev\npandev\ncolonus\nawo\nbenbecula\nanshe\npilbeam\ntugging\nisabell\nhmos\nstromatolites\nenablement\nfugit\nlilah\nseoane\nverts\nglycosylated\ndelphin\nkilgallen\ndurazno\npariser\namyl\nindexer\nyuzawa\ncrystallised\nmenteng\nseductress\nmorientes\nvre\nfroid\ndynevor\nstukas\ncañizares\nvolokh\nslippy\nstephon\ntilsley\npohlmann\nruidoso\neyrie\nfirmino\ncristiane\nrestaged\nprimatology\npropitious\nbelmonts\noliviero\noberndorf\nstaughton\nrepercussion\nsoutar\nmanpads\nmunic\nameritrade\nsitt\nbryars\nshowplace\nparshuram\nneuraminidase\nclaughton\nmyelogenous\nwrack\nidd\nmixx\npoletti\nblather\nnimrud\nrahma\ndorotea\nsuperiores\npawpaw\nhistologically\nwoodworker\nsubluxation\nwyrley\nmontello\nturnham\ncorwen\ntartaric\ngriffis\nappanoose\ntranscriptome\napplejack\ntodhunter\npredisposing\ngál\nqilin\nkustom\nneeley\nhallstein\nimmunohistochemistry\ntaraf\nphotolithography\nmantises\ncalcitonin\ncastalia\npostmistress\nyeardley\nigniter\nmids\nmachito\ngridded\nmosfets\nressources\nportent\ndarah\ngroaning\nhatto\nfhs\nzukofsky\nfidget\nchapala\nshowbox\ngambir\nmakris\naccentuating\ndobzhansky\ndargan\nlonguet\nbronski\nozomatli\njingshan\nboggle\nketogenic\npricked\nalvarenga\nfredericktown\nhinn\nreframe\nstevedores\ngrittier\ndewolfe\nhdf\nplaymakers\nembryological\ncalshot\nrocio\nrypien\nubykh\ncarentan\nsatanas\nmonopolist\nwoodyard\nkelty\nthatching\nsharecropping\nnarborough\nrathlin\nkepa\ntakanohana\nshizuo\ntrebor\nforas\nmousehole\nreengineering\nflemyng\nameri\nleveque\nstreamy\ndisassembling\nwholehearted\ntaels\nwonk\ngunder\nesquires\nkanun\ncrayton\npallette\ngellman\ntarps\nselz\nsightseers\npapon\ngreylag\nbikash\ndursun\nstaunchest\nvelas\nshoshoni\nmethodologically\neverette\nmedstar\nbraam\ntlv\nryrie\ntyus\nbathes\nphuc\nthorneycroft\nservia\nebers\nkadoorie\ncatrin\nwanderlei\nnavasota\nlacock\neugenicists\nkafkaesque\ngoronwy\ndier\ncappie\nnatron\nhermetically\nslovenly\nchenin\nriada\nputu\nrobbia\nlockie\ngleeful\nsavu\nkristoffersen\nshatila\ncounterfeiter\nrocklin\nmanokwari\niwami\nbdr\ndraycott\ncnv\nkiambu\ncchs\nhupp\ngleann\npyxis\nshoeing\nalcoves\nboogaard\nnamrata\nabseiling\ndesjardin\ndissing\ngyp\nboldfaced\naubergine\nkasei\nindexation\nwaterland\nvejjajiva\nsophos\nwharfe\nhuckle\nmansura\nboffin\nzawadzki\nvct\nrone\npolaroids\nimpiety\nbosie\nfratricide\nakaka\nprovocateurs\ngyges\nmarchmont\nswanky\ncomity\nspingold\nsawalha\nkallman\ndormice\nhesitancy\nbirstall\nlianas\ndialled\ngentner\nelysia\nnorml\nvaid\nsgu\nafonina\nferdy\nahrq\nshallowly\nyezhov\nkirstein\nbirman\ncannone\ncesarini\nshipmate\nsilents\nzipped\nthrale\naasen\nconsectetur\nshrikant\npule\nradiocommunication\nballyclare\nwaldner\ngebauer\nalexeyev\nbogardus\nquiksilver\npiscataqua\npopcap\npriapus\nbrummel\ngurnee\ninno\ncingular\nveris\nhotan\nmanca\nlevinsky\njigger\ncoville\nmaladministration\nhowlers\nnoëlle\nsickest\nadol\nprodan\nkawana\nmatri\nmedianews\nkinberg\nvlogger\nalpacas\nsftp\nshub\njadeite\nkilfenora\nprematurity\njeopardise\ndallapiccola\nbemoaning\naracaju\nmerten\naxminster\nalfre\nputh\nbazi\nmadhukar\nsaguenéens\ncleaveland\npapo\njoyeux\nkorbel\nburri\nkahului\nteich\nvmat\nwhiten\nschirach\ntulla\nescudos\napportion\nmarkovich\nzupan\nelco\nkerridge\nmanabe\nlisse\ndevadasi\nsibbald\nfowlers\nmorante\ncism\njittery\ndect\nkatar\ndumbass\nimplementers\nghafoor\naeneus\ntigo\nbosquet\npiscopo\nrowand\ncanice\nthiols\nbii\ndokdo\ndyno\nconejos\ndiseño\nogallala\npodunk\nkemalist\nallora\nchiffchaff\nnucci\npaspalum\ncottonseed\nañejo\ndrugstores\necologic\nvivacity\nblotter\nbnn\ninac\nyanga\najw\nzande\ngardez\nreginae\ntoogood\ngrunberg\nfirecrest\npisses\naht\nsmurfit\nnally\ncircumvents\nsanquhar\nnailsea\nneuchatel\nexperiance\nbarmy\npansa\navers\ndelmonico\ntoontown\nklem\nwhi\ntoucans\nholycross\nope\nhoovers\nblagdon\nphonic\narhat\nmounier\nserrat\nmistyped\nnewsworld\niemma\njdl\nparchman\nsakala\nnudging\nhypnotizes\nthioredoxin\ngingivitis\npugnacious\ntrachoma\nducey\nschooley\nguber\nfaurot\nmercantilist\nimpatiently\nlongmen\nmeinen\njihadism\nleprechauns\nhuitfeldt\nmetalhead\nnijhuis\nparagliders\npatni\nsurkhet\nzugspitze\nyazawa\nprudish\ncorbeau\ngtm\nikeja\nrokeach\ncharman\nnarrandera\nmicromanagement\nmarsland\ndanon\ngreenspun\nheiser\nhangovers\nqualls\ncholuteca\nbhan\ncaff\nsoundbites\nrougier\nyildiz\nmegacity\nmyrddin\nparameswaran\nbullington\njinling\nfederman\nblackbaud\nbankier\nfmn\nmalebranche\nweinreich\nfantozzi\ngep\nponcelet\nvillaggio\nsaurav\nflotte\nblanshard\nrampages\nwillink\nplevna\nhijinks\ntroyens\nwipf\nwike\ntoumba\ngrunfeld\nsleman\nalyona\ncocoanut\ngrushko\necotec\nnela\nencarnacion\nvoile\npue\nimpale\noldenbourg\nroundtables\nblakelock\nmonrad\nboppard\nricardian\nmegane\nportpatrick\nodos\nchindits\nspeculatively\nseafoods\nclampdown\nrapt\ncroxley\nprange\nthd\ncorney\ncaistor\nmassé\nbirdwell\ngallienne\nmachaut\nantinomian\nengendering\nstandpoints\ntzar\ncrumbly\ndess\nsilliest\nshivdasani\ncystine\ndecrypts\nhangu\nsodbury\nlarrikin\njiangmen\nvocalise\nsharlene\ndekmeijere\ntippi\ndeadlocks\nmazurek\ngrigelis\ncourtright\nguevarra\nlydgate\namphipod\nfeuille\nsuperweapon\nbequeathing\nlibman\ntapps\nbuju\nbernet\nfascinate\nperodua\nsecchi\ngoncalves\nwidor\nmhuire\nblindsided\nmauriac\nchangle\nmakovsky\ncontemptuously\nsteadfastness\ntredwell\ninculcated\ncovens\nermey\nppu\ntarsiers\nreliablity\nreauthorized\nnuaimi\ngrangetown\ntransvestites\ncleanser\npinhas\nseow\ncrne\nmartinière\ncals\ndemyelinating\nkait\nbrimfield\ndunmow\ntinga\nunthinking\ncoblentz\nkutxa\ngreyer\nsoubrette\nbiv\ngalvanize\naudrina\nshoelaces\nicecap\nsandel\nwondolowski\noberursel\nkhodro\nhixson\nwunsch\nmerrillville\nnanai\nincorporations\nwarplane\nunhygienic\ngwangyang\njohnathon\nphaseout\nbolter\nsynchronisms\nneurophysiological\ngodstone\nvendrell\norsa\nnegar\nstraube\nmummery\nhuie\npetrino\nhenryson\nnewborough\njuna\nhayk\nyussuf\nhazari\nogletree\navinguda\nindwelling\nkahta\niceni\nstevia\nsparx\nyushin\ntawfik\ndemetris\nmerryweather\nprohibitionists\ntrongsa\nkharbanda\nimposible\nxanthos\npadula\nhpe\ntiradentes\nconvivial\npustules\nclapped\ninformationweek\nanantha\nmcq\nappletalk\ntoxicologist\nwispy\nscurlock\nrevolutionists\ncreche\nskepta\ndorough\ncoelophysis\nsnowdrop\nclarkin\nstastny\ndrigo\nassignable\nagit\nmcdonogh\nindirection\nvitex\ndramaturge\nscrewdrivers\npomerantz\nautobus\ndinoflagellate\npaideia\nantipode\nconcision\npoetess\nruggeri\ncataño\nwykes\njpa\nsutcliff\nstebe\nyare\nmatata\nphaneuf\nrummage\nberard\njumanji\nfannish\nkirsti\nsenigallia\naiu\nintime\nindaba\nhodkinson\ngordillo\nmasroor\ngrump\nsdg\nandon\nunreformed\nhamida\ndurston\nboychoir\nprotuberances\nmicronutrient\nantipodean\nrioch\ntagliani\nwadewitz\nmanege\nbotanically\nfabritius\nkohner\nmetropark\nbijli\nredskin\nkonarak\neyeglass\nljuba\nmarquetry\ngertner\npicquet\nporticoes\nsnaith\nspiraea\nrybakov\nlicey\npedi\nnawi\nschroth\npresario\njudokas\nbelzoni\nsanctimonious\nchevrolets\nbelievability\ndisbarment\nmultiforme\nshillington\nluchon\nproclivities\njunky\ntomoaki\nwrs\ngardiners\ngymnosperm\nroundhouses\nkassir\nkateri\ncmmi\nbluntness\nbodger\norser\nchikatilo\nswapan\nvulliamy\nmarkis\nneurochemistry\nneapolitans\npackham\nirlam\nampicillin\nzanna\nmitter\nattitudinal\nmailroom\ntemporality\nserang\nkremen\nhorseheads\noptimising\ntailpipe\nlitas\nhazmi\nminiato\nlobachevsky\ngrapplers\nchiasson\nsoulwax\nnavaho\nsubzero\ndiplock\nsouthaven\nfridley\njessa\njazirah\ntopor\nchesil\nstarlifter\nschroeter\nbernsen\nsello\namalienborg\nbadea\nchogm\nfitters\nsmellie\nnewtonmore\ndisneysea\nwcca\namlin\nclarets\nramkumar\nstokley\npadmore\nmeramec\ncairncross\nqaiser\nkonstantine\nreischauer\nevershed\nwarshaw\nsolh\nsadeh\nsnyders\nlamppost\nduds\nbednar\nmacrocephalus\nttd\nvictorine\nteel\ncombated\nepd\nriazuddin\ndarmon\ncarbonari\nremini\nkozin\nverhulst\nphalaropes\nbolla\nsilex\njaccard\nmamedov\nkoertzen\nbisection\nsimonet\nnsm\njantjies\ntomball\ndfo\nslb\ncranley\ngeldern\necx\nsparkplug\nromanies\nchailly\nkurien\nbertold\naberdour\nkeister\ndonahoe\nsalins\nalema\nanhydrase\ntroubador\ncryptosporidium\ntydings\nthaxton\nphilharmonique\nperioperative\ndaines\nboccioni\neskom\nkuepper\npgl\ndonell\nzade\nliverpudlian\nkyalami\nbulgari\nloners\nmenhaden\nlarosa\nautobiographic\napplescript\nzeppo\nchaffinch\nmoroney\nazin\nmacca\nhypergolic\ndanesh\nfatto\ncoenen\nboardinghouse\nbuttermere\nhaki\nblustery\nhochman\nxinhuanet\nmonégasque\nnicoletti\nganilau\ngasconade\nmtx\ngonzáles\noccitania\nmaudling\nanklam\ncoxsackie\nhadrosaurs\ntorching\ngaleotti\ngose\nissey\nprovencher\nmacavity\nnitrites\nleitz\nepee\nscrutineers\ncoursers\neichendorff\nolie\nkayu\narolsen\nsebastiani\nannand\nolduvai\nblanchflower\nunio\nbadgley\ndépart\nwides\npirkanmaa\nrabuka\nwoodlark\ncallicebus\nprotégée\nplastids\nhasebe\nhozier\nsherpur\nchevette\ntalese\nluhya\nquinney\ncolorfully\ncalzado\nlorcan\nankita\nperouse\nscurfield\nbrandished\ncumbrae\nnoriaki\nseno\nkangwon\ncasework\nwoi\nmswati\nsenju\ntranscribes\ndoubleclick\nimmuno\noiling\nkupe\nhooped\nvincenz\nbilleting\njanjaweed\nsidetrack\nnewsworthiness\nbadcock\nerlandsson\niws\nguentheri\nheathlands\ntox\nmountainsides\nresenting\nguff\nfumihiko\niah\nweidmann\ndurrow\nsublimity\ndengel\nfatalistic\nkelliher\nond\ntatmadaw\nfloodgate\nfrimpong\ncongee\nsamaranch\nsteinhauer\nchurchgoers\noutfest\ngarenne\nselflessly\nbranchville\nedme\nvictuallers\nfransen\nairhead\nbriatore\nshamefully\ncatelyn\nbargnani\nstridently\ndelahanty\ndevaughn\nshabelle\nelmbridge\nadamantine\nbuffum\ngotovina\nkratt\nmeili\nwaylaid\nkyren\njamb\nmonfalcone\ndelap\ngrano\nabdulmutallab\ncowrie\nmediacityuk\nnomar\nklerksdorp\nmeazza\nlaxalt\nparamahamsa\ncelestin\ntasc\ndubarry\nromantique\nweighton\nrott\nintermissions\nmuharrem\nellingson\nodeh\nklik\nsighet\nbersani\nopoku\nctb\nljubo\nmessiahs\nsketchup\norczy\ncrankshafts\napolinario\nmélange\nkingsmeadow\nsanitorium\nkiwanuka\nalesana\nroten\nsabini\ndarting\nchitons\nmantels\nlaundress\nmccambridge\nwilhelmsen\nmyriads\nredefines\nsertraline\nbitzer\nohe\nhunks\nkozlovsky\ndeppe\ndouse\nbrammer\nsnakeskin\nquakertown\ntoric\ncendrillon\nvogels\ncementum\nmieczyslaw\nwithe\nmicroclimates\nbrekke\nfaxed\nhighflyer\nmajka\ncortège\noaa\natys\nimca\ntomales\nsteeping\nfemtosecond\ncroaking\nfola\nkihn\navonside\nundermanned\netats\nwyner\nhoyte\ndenk\nsalwar\ncarbonyls\nbirthrate\nkhloé\nclaymont\ngrotesques\nfawad\nlegon\ncustomisable\nmoland\nsmithwick\nmetabolizing\nkornfeld\niatrogenic\nwestworld\nfrenulum\nnoga\nallée\ncannondale\nburse\napoc\nmelito\nheaping\npappalardi\ndarell\nnaltrexone\nchaiya\nspindly\nsorbo\ngatting\nluau\nfoxsports\nhode\nixion\ngaller\njashari\nkapiolani\ndambulla\narditti\nsalafism\nhadow\nskamania\ntnm\nhiromu\ninspiral\npapuans\nplagiarize\nmorone\numara\ncapitole\ngagging\nwatercolorists\nlundi\npolystichum\nrelinquishment\ndemark\nbaglan\nstransky\nderidder\nragib\ncustomizations\nfrigerio\nswedenborgian\nbhc\ndyal\ntarbuck\nsinghal\nfanelli\ntitano\njuran\ncudmore\nfiniteness\nadventuress\nappert\nravioli\ngustatory\nbrinson\nmaini\nweisser\nngaio\nsmithing\narnaut\nbroadhead\npash\nsweelinck\ndongcheng\nmyshkin\nguerrieri\nrosenbach\nfenelon\npassa\ncovet\nmitsuharu\nciprofloxacin\nconvulsive\nberinger\nbohli\nkandiyohi\ntrackpad\nrauscher\nxincheng\ndewees\nrakha\nlinxia\nhench\nacrylamide\nharnick\nweinhold\nbalog\ncountdowns\nreshoot\nhellbent\nbagri\nsalsoul\nyaf\nhogshead\nbuzek\nbeavercreek\nchandlers\nmelisa\nallcock\nshaista\ndiamanti\nparvovirus\npolychromatic\nquirinale\nugalde\nthronged\ndawit\nknik\ncommodious\nwra\nsaleable\nyuanzhang\npalen\napax\ncutthroats\nwecker\nbiela\ndyckman\npattee\nolesya\nherry\nrahel\nanodyne\nnrel\ntooheys\ngrint\nmaré\nhuu\ncharanga\nmetamora\nreducer\ndimmock\ncorzo\nchelmer\nregressions\nmylan\nkhandan\nhundredweight\nthakin\npoppleton\nleumi\nespaillat\nmarihuana\njugan\nsubcritical\nmelanson\nkorematsu\ncolliculus\nforestland\nouteniqua\nagos\nxlib\nrathmann\nbainton\nwut\npennzoil\nfignon\ncorries\nresponce\nquoque\nspon\nmadiun\nsiuslaw\nplushenko\nkerin\nundignified\ndragoman\ngolde\nsavannahs\nrumanian\nmakuhari\núnico\nswampscott\nedicion\nblackmoor\narshavin\ngodhood\nviegas\ntaxol\nwhys\ncertainties\nblokhin\npompe\nmarbling\njosue\ntransfrontier\ndecelerating\nleptospermum\ntkm\nwhirlpools\nguerlain\ntrifonov\nmisconceived\nberryessa\ngusting\nspecialisations\nmisperception\ndiga\nhohenfels\nrausing\nimmacolata\nkrg\noprea\nabdelhamid\nfryatt\nbalks\nhighclere\nindifferently\nottinger\ndgm\noutpointed\nworkroom\nncm\nspinet\nrending\ndeactivates\nstayin\niordache\nexergy\nturina\nluskin\nfluorinated\nderrike\nkeven\nrüsselsheim\nflippin\nnother\nmarkéta\nmacroom\nbgb\nchini\nworkgroups\ncavafy\nhedon\ngentilhomme\npolyposis\nwiremu\nsevernaya\nsyringa\nshoveling\nrco\nembargoed\nstreptococci\njincheng\nizaguirre\ndamsels\ncellach\ntrueba\naviaries\nceiriog\nbenbrook\nashar\ndamato\nlaxey\nlubango\nimanol\nshako\nmonteagudo\narrigoni\nfootmen\nsupercups\nashkenaz\nwenhua\nemina\ntruesdell\nvhsl\nbristlecone\nmiddleport\ngwon\nsayo\nuselessly\nbleriot\nvtc\nruggedness\nalgie\ngivati\nrogerio\nteleplays\nshero\nhomebound\ncounterclaims\nballynahinch\nhewer\nreamer\nhumm\ndreads\nsödermalm\nlenahan\ndishing\nbamar\ncrossville\nhelu\ncommodo\nboch\nmeanness\nony\ntakehiko\nrohwer\nsuber\nslappy\njurgensen\npaleoanthropology\nholbert\ndhananjay\ngallura\njannings\nhematocrit\nlecca\nmazatlan\nrompuy\npaddick\nrastafarians\nperforate\nlrr\nvolcom\nshish\nluks\nlinthicum\nprydain\nlewa\nyaounde\ninfantas\nnupur\ngrunting\ntahr\npromotionally\nbizness\npoundstone\ntbo\ndoru\nbarbets\ngether\nfenugreek\nmusher\nusfa\nrenai\naddysg\ngoldmember\nremco\nvme\nscoffs\nweimer\ndownplays\nvegetatively\ndtr\ngrotowski\nducie\nbiederman\neugenides\natiq\napplebee\nreckons\nborgne\nbeaching\nsarkissian\nmangas\ngobbo\nyihan\ngreenshank\nmoder\nhorch\ntinton\nple\ncupids\nyochai\nhitchhiked\nhitchings\nlittleborough\njaen\npranking\nsturgeons\nmetreveli\nrmr\nkerins\nmadrox\nanimosities\nanthropometric\nschottenheimer\nkosma\nbudworth\nhande\ngatehouses\noxton\nhallucinate\nclicquot\nlenght\ntownson\npartum\nleftward\nwhitesell\ncarné\nhqs\nredoubled\noverdo\nmcilhenny\nwaterlow\nveale\narkle\nnudged\niram\naeterna\nheptathletes\ndaisaku\njmi\nazzi\nlsk\nkws\nsuperheavy\nshushan\namh\nmadryn\nstoryboarding\nsemler\nhiiumaa\nryley\nmadar\nhaseena\nbloggingheads\ncnl\nbalco\nfitt\nbross\ntoadstool\nghori\nmuting\nelcho\npimm\nbetacam\nmojtaba\ncottesmore\nsincerest\nmcsween\nouten\nmiscast\ngrytviken\ntoku\npoutine\njimeno\nkendell\ndesignee\nnorinco\ngrisha\nlobi\nstampe\nadze\ngenco\nsiddall\nbrindabella\ntrishna\ngafsa\nhjelm\nramires\nsharett\nmorrisey\nhalabja\nirks\nkhand\ngravenhurst\nflighty\ntofig\ntactless\nofir\nmackinlay\ngerrans\nhurray\nfreeh\nkena\npersonifying\nfaulds\nnedra\ntarbox\nmaybole\npavao\nknm\nholsten\nsidestepped\nlna\nhiba\ngrieb\nbylines\ntrobriand\nbunda\norkla\nsnore\npeikoff\npadden\nwookey\nputte\npostmodernity\nbergquist\neab\npecorino\nvowles\nytb\nfantoni\nlaurell\ntangerines\nmanica\narchaean\njanardhan\nintraoperative\nfoxconn\nbens\n。\npalden\nrazz\npugilist\nsmitherman\naimes\nscoff\ntutwiler\ncandomblé\ncornstarch\nepsrc\nkatamon\nsunstein\ntaharqa\nmakalu\nrothenstein\nparappa\nsambucus\nmatvey\njagex\ngrinberg\nkaleem\naltobelli\nclambake\neastcote\nreiman\npersaud\nkentridge\nmetroparks\ngattuso\nluzhkov\nmoho\nsjeng\nbattlezone\nbraybrook\nhornig\ntalab\nphotometer\ntyros\nyabba\nkenema\nhelluva\ngle\nwooding\nispahan\nvossloh\ndinning\nchavanel\nnecromancers\negitto\nlyonel\ntexier\ncalamaro\nbaglioni\nwickedly\nsaviola\nbrickhouse\nnier\nchiemsee\nsiebe\nkelling\ncherche\nhabonim\nkleiman\ntelangiectasia\nsvi\nalpino\nglobalism\nreheat\nwinterstein\nhaggerston\nvrede\nreaktion\nfurtive\nculler\nicos\ndair\narkadi\nfatin\nbacktracked\nunshakable\nfunck\nchital\nredbrick\nziona\nsoufrière\nfrancophile\nbeauchemin\nrepublique\nthach\nnoetic\nuniacke\nthore\ngyurcsány\nasheton\ncalorific\nmariotti\nnaturists\ntorrealba\nshahnaz\nbuttrey\nkhou\ncrealy\nboatload\ninchoate\ncrossfield\nlenis\nbeloff\nvanstone\ndecimating\nbeefsteak\ncavalcante\njaggi\nweyr\nrhymesayers\nnankana\nstoddert\nsatun\nschikaneder\nbrownsea\ndesormeaux\ncarphone\nquercetin\nexacts\nlucker\nupt\nsubhead\nchandrayaan\ninforme\nmodellers\nstonecutters\nlistless\nequines\ndueled\nbarad\nbeter\ngynecomastia\nteodorescu\nmárta\nvusi\ncrac\ndiyar\nyeatman\ngile\nniggles\nspurns\nbrinda\nmoschata\nreu\naile\nptu\nrehearses\nkaganovich\ncyberstalking\npiao\ngtld\nbackstories\ncostain\nbullosa\ndicken\nengelstad\nsways\nstrahl\ncrossway\nefp\nvitt\nrecaptures\ndelimiting\ngunnel\ncupped\nzamin\nplexippus\neasthampton\npolarities\nmunni\nmichaelangelo\nstuxnet\nunperformed\ndungarpur\nkinnunen\ntorchbearer\ncorb\noauth\ndiagramming\ndefibrillators\nheynckes\nwwp\nfabbrica\nslaveholding\ntippet\nbaudelaires\nwersching\nroquebrune\ntraherne\nchinstrap\nbordoni\ncombretum\ncatherines\nfarook\nhutterites\neverwood\ncombust\nwidebody\nsavita\npoley\nkenwright\nhauntingly\ncéu\ngalligan\nfeebly\nmerrier\ngillray\nkramers\nariella\nkurban\nburnes\noligo\nanoa\nforet\ncinemark\nkorangi\nuffe\ntolleson\nmesi\naugmentative\nchafe\nvoiceprint\ndevaluing\nneuropathology\nelectrospray\ntakai\nsemin\ntrivialities\nunserviceable\nmeasurably\ninra\nbgt\nrizk\nbaria\nartichokes\nwaitt\ncrenna\nvertol\nirondale\ncharlebois\nintercooled\nexacerbation\ndrophead\ncpsl\namidon\nhamara\nbereza\njackpots\nvindictiveness\nitr\nsinless\nlandsbanki\ndeterminedly\nkinneret\ntucks\nmaclin\nyasemin\nlipopolysaccharide\nlidice\ncanopic\noctogenarian\nahti\namerical\nharket\nterrorised\nprotestation\nvanillin\nproblematically\naksai\nskud\njamyang\npouget\npgn\nmarteau\natonality\nfolau\nmicroscale\nvitruvian\nokhrana\nsagres\ncadieux\njianghu\ncorridos\ncorrido\nvoeckler\npennie\ndeloria\nanvar\nfto\ncatalytically\nbluhm\nminks\nblackmer\nulyanov\nmnsu\ninstitutionalize\nhankook\nbrights\nwats\nbaskett\nsuperstores\nbulow\ncomey\ntanglin\nretool\ncaporetto\nsavvas\nkangerlussuaq\nkromer\npurdon\ncurtly\ngruyère\nsitek\ngoondiwindi\ntheodoor\nkaunitz\npmk\ndemographers\nbilitis\nlandin\ngingras\nmoxey\nhtlv\nseybold\nlazytown\nharline\nchevening\nturchin\nmalmberg\npjm\nhydropneumatic\nangiogenic\ndoornbos\nkrasne\ninterlinking\ndoko\nspectrophotometer\nandaluz\nxstrata\nshahs\nannalee\naak\nvilá\nhamstrung\nmonserrat\narlie\nziggo\nurubamba\nchiti\nionising\nemeraude\ncrosswise\nkrannert\njansons\nsojourners\nhugin\narthritic\nbutterscotch\ndiano\nassassinates\ncafiero\ntorpedoing\nhearthstone\nexuded\nuschi\nalaba\ncozzens\ntaimur\nlongboard\ncbg\nhogwood\nsilviculture\nhourani\nbacca\nkfir\nkanha\nmaudie\nmennen\npaveway\norosz\nmontvale\nregistro\npaiutes\ntavor\nstretchered\ngiesbert\ndingbat\ncopeman\nwideman\nchayim\nfilariasis\nlongline\ncedrus\nmatamata\ncohl\noverachievers\nvoidable\nkaili\nmiyama\nmicmac\numrg\neuskirchen\ncontiguity\nsubcultural\nfacemask\nlesar\nraconteurs\nhevesi\nwangchuk\nyili\ntoño\nstephano\nfinks\nsekine\nbangin\npayrolls\ndesenvolvimento\ndraeger\nuncg\nmuirfield\nzat\ngadjah\nhinchliffe\nquadros\nmellis\nberlinetta\ncrestline\ndannatt\ntrireme\nserenissima\nzeeuw\nllanera\nideologist\nmapusa\ncallings\ngaudino\nreznik\nreuel\nbroadleaved\nquast\npluggable\nsoiree\ndocter\ndarbyshire\npublick\ndisbelieving\nburghfield\nsnus\nshubenacadie\njodha\nbardolph\ntenda\nlongbottom\nsydnor\nbreakaways\ncrescenta\nnabbed\nanaglyph\ntsukada\neleazer\nesquina\nconfiance\ncrisscrossed\ntàpies\ndkv\nharr\nmurnane\narmero\ncryogenically\ncoteaux\nwrey\nnaaman\nmaim\nalemany\nfinnan\nnewbiggin\nkezia\npoto\nelser\nnachos\nyushu\nkulak\ngunness\nshapely\nfraile\nzollverein\nwolper\nabschied\nfreek\nchartism\nhussle\ncodebook\npetrone\nliberatore\nborinquen\ncarreon\ncollinge\nkaolinite\nzubkov\nimpregnation\nkaizen\nspams\nnester\ndemyelination\nrearranges\nlagaan\nswensen\nfaull\nmissourian\nkooyong\nrecommence\nclefs\nglaw\nahrc\nutamaro\nfinanza\nmonto\nbiphenyls\nseydou\nouvrier\nechidnas\nexpensively\ncranstoun\nfiero\nmatagalpa\nheadscarves\nabsorptive\ninam\nbakari\nismat\nmauk\namorphophallus\neae\nrosaries\nloosestrife\nboetticher\naumann\nloanee\nsoumya\nupdater\nmorier\nnyenrode\nvatan\nsophronia\nhutchens\nsulfonamide\nmatteucci\ngcw\nundocked\nhuaorani\nsensorineural\nunderexposed\nnegaunee\ntmn\nrsu\ncuhk\nbrunello\nsuperjet\ncontrite\nbaggot\nfoxhall\nsafdie\nbratunac\naidid\nberanek\nmercuric\nulis\nbraving\nforsaking\nlatifi\nzacarías\njianwen\nclickbank\ncaws\nldk\ncaryatids\nxdrive\njanak\nmde\nyarrawonga\nmcstay\nzacharie\nglancy\nllong\nfeuilles\nmontalcino\nillegals\narmitstead\nisraels\ndefoliation\nppk\nunai\naceves\nsupertram\nitzá\nlavelli\nwarham\nlpb\nvarlamov\nimpério\nciani\nkringle\nthiry\nruncie\neightball\nmichna\nvlbi\ndahm\ntaxonomical\ndonnelley\nrebuking\nreddin\nhaddadi\naustins\nhmd\nvipin\nfantasizing\npallavicino\nacheampong\nrivularis\nmineralisation\nstieg\nrelevence\nscallion\nloges\niheu\nhaloperidol\nedmiston\nstevenston\nsuss\nwestfalenstadion\njomon\narlberg\nmumma\ncomsat\nbelorussia\nyingying\nthrawn\nchablis\nnovelistic\nlyari\nlüderitz\nflounders\ntransavia\nvideojug\nprattville\njunon\nsomogyi\nelah\nyacine\nsoju\ncrasher\nlono\ncarrs\nbrena\nprell\nroseberry\ndico\nolivero\nugyen\nmahboob\nbourdin\nluard\nvacationed\nautzen\nwrongness\nfredriksen\nromagnoli\nitakura\nmaynor\nlungo\nluminal\nfreakazoid\nleete\nfinniss\ndoku\npreclearance\nassayed\nunconverted\ngroundsel\nlur\ncholerae\nhungarica\nfitment\ninec\nsteevens\nanaphylactic\nvay\neckardt\nsallah\ncentavo\nayios\nschnorr\nevenson\nashtanga\ndymock\nretest\nbrancaccio\nguapo\nsemtex\nmvno\ntianshui\npuleston\nintangibles\ngraveolens\nkito\ncachaça\npalika\nvivanco\nneander\ntreadaway\nouvrière\nschull\nlegum\nexpediting\nvarians\ndébat\nmaitri\nadeyemi\nbielecki\nliat\ngerri\nolander\ntecnica\nheger\napta\nkumu\nobviated\nsoirées\nbroomhall\nloayza\nukranian\nunfixable\npapanikolaou\nelkridge\nbabbit\nelectrophoretic\nsuphan\nhenoch\ntreorchy\nromanza\nknobloch\nroché\nbrownhills\nhorsforth\nkaroly\norakzai\nluol\njme\nnorville\ngranola\ngaviota\nknechtel\narbi\nmakowski\nwillingboro\nfrancistown\nifsc\nwhatsonstage\nbeltane\nbilevel\ncopia\ncoliform\nheimann\nactra\nishan\ncleverest\nhurtling\ndrewes\nnegrón\nshahnawaz\nastoundingly\nstaph\nlegalising\nlegitimated\nabdurahman\nunarmoured\nczars\ncatus\nlukyanenko\nmmk\ngermersheim\ninterrupter\nsangu\npipkin\nantenor\nbintan\nchynoweth\nnoobs\nboulting\nncat\nhilden\npartida\ncrailsheim\nlunel\nedan\nmosi\nalpinism\npolyomavirus\nshr\nsatra\npsdb\nnonconsecutive\nhawkesworth\npallot\ntancharoen\nmaula\nfwb\ngunnedah\nbaraat\nmankell\nupholder\nmenstruating\nreasearch\nexuma\ncreu\nveras\nhurum\nngb\nportholes\noropeza\numr\nhebraic\nhems\ngermanos\ncontemporaneo\nwiggling\ncontos\nravenglass\nbellotto\nbrooklin\ngruda\nhammerschmidt\nhagger\nchavasse\nraggi\nfalters\nblatche\ncantillon\nhuevos\nleser\nhabiba\ngrisaille\nvillalpando\ncofidis\npeopling\nladywood\nwhet\ncaddick\nstudiously\nunimog\nlemmens\nsurfed\ntisdall\nstrassman\nbeefy\nfalconbridge\nknauf\ndalgliesh\nblocher\nmarham\nmossop\nhagiographical\nabdala\nkrauser\nhoneys\nblairstown\nluangwa\nliming\nbreitling\nbromance\npegmatite\nodorant\nthur\nnikulin\nbogosian\nenergyaustralia\njudenrat\ntvoz\ndyfi\ndiante\ncarcinoid\necosse\nilmor\nguice\nhursley\nexcising\nunpacked\nbln\ncristiani\nbritains\ntofte\nupcountry\navezzano\nbhullar\nkazys\ngrigorenko\npushrods\ncinch\nrockhounds\nfernhill\nclamour\nobstructionist\nsteagall\nwaster\nfactcheck\nkimon\nsensationally\ndickensian\nraduga\nbantustan\nkames\nlenzie\nminehunter\nreca\nunderdown\nkoninck\nruhe\nmedard\nwmi\nselahattin\nsuccour\nzipcode\nneutralist\nabdin\ndewberry\nredrafted\nfoulois\nzucchi\nlyrebird\nvalerii\nslauson\nfedorova\ngroundcover\nwenshan\nrainville\njanowitz\nmunz\ncontortions\nmutuality\nprotos\ngreengrocer\nhurra\nwarhorse\nberline\nsheaffer\nedenbridge\nfernandinho\nlehn\namphlett\nergin\nballerup\ncbh\nprogeria\nbayrou\ntamerlan\nwtbs\nunenviable\nmanam\njacquot\nmactaggart\nangol\nfreediving\nshapeless\nattia\nmoscato\nlhote\nwestbank\nbashan\nghada\nanel\ncodling\nzsuzsanna\ntarak\ncech\nguthlac\nyounus\nperrey\nchaparro\nwhyy\nhoje\nliff\nheadlock\nfdg\nlofting\nmous\nporterhouse\njoely\nbrosque\nsutan\nsnowbirds\nheeren\ntereshkova\nmarshaling\nhudgins\neverbank\nundiluted\nsarra\nradhi\nyaxley\nneumayer\nandamans\nyevgeniya\ngervasoni\nlederberg\nstrathairn\nchakrapani\ncauchon\ndagg\nwomanly\nflórez\ngaudreau\ninnovia\npilecki\nmullis\ngatien\naubagne\nduncanson\ncraver\noesterreich\nnureddin\ndejavu\nswooped\nsupershow\nautozone\nlengthens\npayan\nnrr\nparamotor\napperson\nkleinfeld\nsiskins\nudas\nhematologic\npdn\ndishwashing\nstaniforth\ncaminha\nstockyard\ncyphers\nmasvidal\ntheus\nthm\nleschi\nwispa\nbugbear\nherber\ndolomitic\nremarries\ncymbidium\npowergen\nmovember\nucg\nsylver\nwakil\nhayhurst\ntetras\nmudgal\natlantas\ncontepomi\nwend\nfeversham\nthure\nplainsman\nroenick\ninterethnic\nlubomir\nmagtf\nsoffer\ngamgee\nmarklund\nyounge\ngowland\nmingyi\ninclán\nhassidic\nsalivation\nreichelt\nharlon\nbullis\nisv\npovera\nrajabhat\npredella\nwemba\ntrev\ngrantchester\nphotoshopping\nslants\nllave\ngoofs\nlevitated\nconformists\ndiophantus\nbarsuk\ntrews\naicpa\nisinglass\narland\narchitectura\ndexterous\npecker\nqueensbridge\nmahlangu\noadby\nbenedito\npsychophysiology\nentices\ndandelions\neus\ncoudersport\nsuginami\ndouwe\nhagberg\npiatigorsky\ndetweiler\nhostiles\nology\npandurang\nimperishable\nhinks\nryuki\nbarremian\nmru\nemulsifier\nensa\nleatherstocking\ndewulf\nlasith\nfyodorovna\nasifa\nramlee\nmellons\ndocumentations\nplatting\npointillist\nnifc\nsilivri\nbennis\ncroly\nnugroho\nguglielmi\ncasati\nmpe\nalecto\nnymphomaniac\nnasties\ncarny\ntarar\nshilts\nzillertal\nbankrolled\ndesean\nflorestan\nzongo\namoebic\ninfringer\nvasser\nhagopian\ntomah\nlipe\ncatamount\nmewes\npout\ntervuren\nchowdhary\nmyocarditis\nhungover\nmauchly\njadranka\nmediatised\nportslade\ngonad\nwestby\nnoiret\nminear\nkinesthetic\nsaraya\ntuks\nabstains\nsmulders\nqandil\nwinsted\nshepshed\nravensbourne\nrorem\nstrathaven\ndenaby\nmacgibbon\nicpc\ndrachmas\nlandesman\noverpaid\nredcoat\nbutyric\nputten\nparata\ncrans\nsparebank\nhandguard\nvowell\nbhama\nmillinocket\nwuerttemberg\nhinoki\ngutfeld\nbierut\ntenuously\nnenana\nbrevoort\ntippmann\nlindman\nsaltz\nhref\ncypripedium\nfmo\ninclusionary\ntuggerah\nsukha\nmcginniss\nmytishchi\nbatton\ntesol\nlightships\nlemp\nvalkenburgh\nprayerbook\nbangalter\nneate\nrdt\nquotidian\nhashana\nkidwell\nvetoing\npaille\nkamber\nencina\nusvi\ngrenz\nconcannon\ngumbs\nbatan\nbakhtiyar\npietrangeli\nantis\nabedi\nselfishly\nhonesdale\npank\nscherr\nbertalan\nnamsan\nstricklin\nlazaridis\nspilsbury\nundocking\ncadenzas\ncalistoga\ncenote\nerms\nunrecognisable\njurica\ntokoroa\narisaka\nmaroto\nwarded\nquee\nkristan\nsingel\nvrenna\nsinaia\nparcell\nquizzed\nwails\nwaltheof\npentito\nridgeley\nfeig\nlaurus\neku\nizbica\nlainie\npascoal\nbreydel\nryans\nproba\narchdruid\ncaptaincies\nspahr\niaquinta\npolygamists\nastudillo\ncribbins\nmacguffin\nsolá\nderk\njuwan\nstreambed\nsaimaa\ngerdes\nlleras\nsando\ntinkerbell\ncaressing\nrenne\narnt\nkastrup\nesad\nbaganda\nvaka\ncajón\nsuperbird\nendovascular\nautomobil\ncastagna\nnityananda\nvinicio\nanthropomorphized\nentergy\nmajic\nblanchardstown\ninstallable\ncynthiana\nuntergang\nkarroubi\nthomasine\nexpropriate\ngeagea\nstepanek\nbellone\nconsequentialism\nghrelin\nhexum\nlucchini\ntowton\nucu\nreginaldo\nresound\ncanonizations\nmarye\ngrounder\ngeoffrion\nbousfield\ndunigan\nghiberti\nunquestioning\npunya\ngedda\nostermann\nwitz\nordinaire\nviau\ngigue\ntomino\ntropicale\nselander\nboeng\nkosmas\nkattan\nstraggling\nikue\noutpolling\nsinon\nxve\nbouma\ntifa\nbambu\nliuyang\nolymp\nglasper\nbys\nspruill\nserry\ncarriere\ncya\nacda\nipkf\nbanzer\nevangelina\nmeed\nspellbinding\nkookaburras\nmikita\nstanbury\nmeimei\nolivas\ncolucci\nvolubilis\nswin\nnarcissa\nskeen\nstomatal\nanansie\nwgsn\ntypus\nandréa\ntreadle\nkwanzaa\nbettors\nmdu\nadelia\ngogi\nochsner\nsvan\nkazue\nsouthee\nmenna\ngrenelle\nwoodblocks\nmusky\ncaligiuri\nrvt\nassante\nbumstead\nkhadka\natla\nsyngenta\nzviad\nkompany\nmckern\nlipietz\nican\nncua\nallgaier\nbuts\nseocho\npred\nhoofed\nsurekha\nfota\nsleipnir\nsiff\nfestspielhaus\nrailfans\namanah\ncasements\narabized\narnab\nkolk\nmukund\nmegastore\nperkasa\nrailyard\nbocuse\nbroadland\nhaddo\nbündchen\npedants\nbundesverdienstkreuz\nchinoise\ngassner\nnems\nusap\nwinchendon\ntroubleshooter\nsanjo\nanaesthetists\nlomba\nhuntersville\nmckelvie\nsureties\ninsideout\nwoori\nallayed\nvrml\nmopar\nmilarepa\nrafinha\ntoxics\nthelen\nknockoff\ndùn\nnibelung\nterrify\ngomm\nemendations\nblackstreet\nbeca\nlynas\nmezcal\nkiplagat\ndorsch\nardzinba\nsimbirsk\nvinyls\npretreatment\nwitchfinder\nsmallness\nenis\ntshabalala\nmctiernan\nroundworm\nlikley\nboudet\nevolver\nlanesborough\nnotate\nflirtations\nfujin\nevetts\ngwenda\nsubroto\ngoldthorpe\nidyllwild\nfantasticks\nbadoer\njouko\nbuti\njayenge\nmethyltransferases\ndotterel\nmahamat\nrenmark\npsittacosaurus\nwansbeck\ncossa\nhypertransport\nwingmen\nkalm\nlundby\ngaer\nreclassify\nphillipson\nhallahan\ninvigorate\nbinn\nintersting\ndeltic\nkingswear\ncasali\ninterrelations\nsalmonid\nluker\nkrier\narncliffe\nbarnfield\nphthalates\nzhijun\nkds\nsaccharin\ncodebreaker\nkarlis\naberfoyle\nrhayader\nbouterse\nprecognitive\npurchas\ndispositive\ntimeframes\nlowville\npolyamorous\nrapidshare\nkorkut\nbonnin\nherbalists\nsquirting\ntlp\ncobleskill\nlimericks\nmihaly\ninfirmities\nribisi\nswiped\nreut\nganong\ncalvaire\ncystitis\nsarney\ncrr\nheworth\nwestberg\nphaistos\nadjani\ncryogenics\nnamak\njaquet\nrajmahal\nyanagawa\nnollaig\nsirimavo\nswerving\ntaplow\ncurva\ngallops\nmoscheles\nalmer\nexc\nhexameters\nsemiannual\nhubcaps\nyasuharu\nfinalising\nyudkin\ndentsu\npolythene\ntartini\nbernau\nsouthtown\ndewalt\nseim\nmasset\nlsf\npaga\nexpiation\nmonopoli\nopennet\nkirker\nbehera\nsbm\nmanel\nclod\nsongster\nsonido\nobinna\ncooldown\nfiberboard\npomaks\nsignifigant\nnègre\nwaiau\nfarnon\nmofo\ngwladys\nphilosophe\npostiga\nmurga\nfasb\nfarmar\npanjim\nfirebombed\nleddy\nsukhwinder\neichberg\nzhaoqing\namputate\nfishpond\nahed\nsecretaryship\nteikoku\nantithrombin\nuechi\npintos\nkotze\ndoubtlessly\nastrophotography\nkolombangara\ndanh\ncongonhas\nhanning\npezzi\nhejazi\nkilohertz\nreligous\npereiro\ncaille\nbrislington\nslumps\ncebus\nomnibuses\ndislocating\ntristesse\npatthar\nfirbank\nvirgili\nmonie\nccdi\nolancho\nrecategorized\ninternee\nrison\nvorst\nsorg\ngrottos\ngermanwings\nvirginio\navaricious\nabutilon\nalgemeen\nlamest\nscotsmen\nihh\nwenz\nruttan\nmontecarlo\npanizza\nmaco\ncayton\nnudie\ntoronado\nhyperparathyroidism\navinor\natypically\nphytochemicals\nanamur\newok\ncarpaccio\nrollerball\nbranwell\ngezira\nfurqan\nclemmensen\nkingside\numw\nreiffel\nmakos\nscattershot\nfetisov\noesophageal\nsardy\ngunk\nsmirk\nreputational\nbisht\nmarkin\nsequenza\ncashews\ngambusia\nfairlawn\ngorny\nsteeg\nupd\nmarilou\nemami\ndhm\nbipedalism\nires\nlötschberg\npsychos\npemphigus\nlithe\nioof\ndanzan\nyashoda\nsubsumes\narsalan\nsixfold\nlarijani\nvernalis\nrotatable\nbanknorth\nsoroti\ntroisi\nsrx\navt\nshorenstein\nsarvodaya\ndevey\nbolcom\nzsuzsa\nshaurya\npaulton\ncandour\nreichenberg\nquicklime\ndiesen\njailers\ncocu\nunorganised\nwilliamtown\nryusuke\nqadian\npercenters\nbrookshire\nmoet\nviken\nborane\nslagle\nunuseful\ncunego\nsautéed\ntlds\nzani\nbalderas\nmixta\ngauley\nbeu\nsalaryman\njasin\ntrona\naliona\nmissive\ndfd\nalcindor\nsanko\ntole\nsiragusa\ninamdar\nmacchia\nwernick\nsubwoofers\nmagnox\nfras\njsw\nreinstein\narchtop\nsuperimposing\nkraton\nbleibtreu\npeñalosa\ncleverley\nprideful\naberffraw\njuvenilia\netb\ncriminalised\ntemenos\nlieberson\npravo\nchapron\ntaenia\nsarhad\nwog\nmicrofossils\nprocures\nendorphin\nkaus\nlatinization\nhamama\ngorgeously\ncrossen\nacmi\nwardrobes\nsamphire\nplateosaurus\nmargai\ngorenje\ndabba\nteterboro\nbicep\ndrouet\nunimodal\nswick\nbickham\nkipps\nconsistantly\nstilo\neniro\ntene\nostrow\nmaspeth\nleydon\nciardi\ndebunks\nfemen\nbarbastro\niguazu\nmishkin\nbultmann\nreconditioning\nevangelium\nimperioli\ndastan\nkallas\nschlöndorff\ngroupers\nwesthoff\npunakha\nsavina\nveja\nsemion\noxidising\nblacken\ninklings\negging\nbabo\nsozopol\nqashqai\ntresses\nawas\nagin\nkaranja\nlenthall\ninhambane\nsabmiller\nrpd\ncherrypicking\ndefibrillation\nbarakzai\nwerrington\nstarland\nmondavi\nsukhbir\ncentralise\nsnugly\nrecordkeeping\ntomczak\ncedilla\nmonstrance\nidrees\nloreen\nboalt\nbobak\nbadran\npasion\nevison\ncangzhou\nluling\nvohra\njofre\naggressions\nmantelpiece\nmccrimmon\nmahagonny\nwuzhou\nwagh\nbigham\npublishable\nbonynge\nklec\noligodendrocytes\nhadash\ndharani\nengadin\ndarkman\ndoxford\nhottentots\npátzcuaro\nlattimer\nmarner\namarjit\neinion\nleptotyphlops\ncoeditor\nuja\npitbulls\ntocopherol\nhorrifically\ngamper\nredoute\nstalemated\ntopshop\nbanno\nmehrtens\nbaddies\nkirino\nanusha\npétanque\ntentacled\ngoolwa\nearles\nwithing\nlecouvreur\npartitas\nbtl\narcachon\njeffress\nfnm\nburswood\ndornbirn\nshiprock\ntilzer\nmegachurches\naxholme\nbrochet\ncognoscenti\nattiyah\nquinoline\nsharansky\nperec\njabo\nnoblewomen\nwindjammer\nthetan\nsuperimpose\nquelea\noberammergau\nreanimate\nglobalizing\nmhr\ntussaud\ncotai\nezeiza\neustachio\nkuai\netsy\nböttcher\nsusans\nventress\nschrock\nhylan\ncrisper\nglyfada\nhuub\nshazia\namott\ndebnath\nchairmanships\nuros\ndhana\nvereinigte\nnewsies\njutes\ngreenlawn\nanandan\naphrodisias\ndollies\npoku\ndebré\nslee\nmdpi\nhaulers\ndimittis\ndragstrip\njô\narlesey\nmüritz\nnagaraja\nmanulife\nregionalisation\nleftism\nnasscom\nmolucca\nbastiat\ncerone\nconcordes\nrdb\nondina\nktn\nwellwood\nauthorises\ntaillight\nsandin\nbalcells\nbaildon\nhapa\ndaire\ntavera\nrefugia\nfoma\ntarzi\nshoudl\nbarthelme\ndayuan\nruperto\nelea\nposy\nnazarabad\nirin\ncyanosis\nvetiver\nvivants\nemich\ndownslope\nminurso\nblandy\npatteson\nclimaxing\nbouman\ndónde\nladera\nakdeniz\npreform\ngielen\nlunacharsky\nvrana\nzaloga\nlactea\nmirman\ntraiana\nbevans\nacclimatization\nomv\nmarcelin\nemei\notterlo\ndromaeosaurids\ncardone\nslippin\nwunna\nfreres\ncarlino\nhuayna\nbrau\nazman\nghi\nngwenya\ncowens\npolyatomic\ntimofeev\nmetalled\nsurer\ncollectivisation\nikebana\nmarilla\nnaegele\nlignum\nfrensham\nviscose\ncratons\nhandclapping\ntyee\nterbium\nascham\nelopes\nfitzjohn\nburien\nrushd\ngoffstown\nnightside\ngraeber\nfederalization\npels\nprovenza\ninterposition\nmahonia\nextralegal\nindiantown\nlagat\nbeanz\nzisis\ncept\nmorticia\nhazelhurst\nbabli\nupplands\nduveen\nlamarckian\namandine\nkanouté\nattentively\nwayamba\nnonsteroidal\npims\nmuschamp\nmcgonagle\nnetcom\nfurthur\nkse\ncresap\nwoyzeck\nsolli\nlhamo\nivybridge\nfernley\nwagenknecht\nshusaku\nnorrell\nshoud\nrono\nswindling\nsiegbert\ntauscher\numbral\nsassnitz\ntufte\nchristofer\npumphouse\noverreact\nlaboratorio\naksyonov\npuyol\nkagel\nlimburger\nenergised\njeering\npostponements\nzephyrhills\nsubhumans\npuissance\npilch\ncroyle\ninconsolable\nthistlethwaite\nchrysostomos\neruv\nkezar\novercharging\nmarigolds\nstfu\nabergele\nmiamisburg\nriversharks\nmagaddino\ngoldener\nfoodie\nblackshirt\nhershberger\nterrasse\nragni\ncottonmouth\nsessler\nswifty\npropst\nplucks\nabaca\nsiar\nqishan\nfrie\ndeanship\namanpour\nparkinsonism\nrandolf\ntorquil\ncowed\nmaika\ngypsys\nalinsky\nhauke\nunicom\nsahak\nknabe\ninsubordinate\nanthro\nfulmars\nchesty\nmetrocard\nenclos\nhirschsprung\ndigna\ndanehill\ndibaba\nubm\nbonapartist\ndilfer\ndinard\ntortorella\ncarwin\nrifampicin\npieterszoon\nkyme\nconches\ndoggerel\nexternals\norlin\nwolbachia\nponi\nmabe\ndamai\nzem\nspeciosus\nhyperlocal\ncounterpoints\nsovereigntist\nsynodical\npoldark\nhidayatullah\nuce\nlomb\nshopaholic\nmansingh\nbatfish\naftereffects\ncandra\nbucked\naranguren\nbeltz\nsfv\nregev\noverrules\nhipódromo\ncolmenares\nvpd\nballerini\nvaccinia\nwanita\nberlinski\npulpwood\nreconnoitered\nazara\niiss\nmanhunters\nweyler\nopsin\nhape\nparzival\norli\nmedang\ncloudiness\nfeeny\nplu\nlehning\nmoranis\ncueing\ngadchiroli\nsammut\ncolavito\nkhmers\nkishanganj\nkenjiro\nphotoresist\ngalliard\ntnp\noxidizers\nkangan\ndemps\ncockfield\nsuhas\noutspent\nperceivable\nboyington\nbarzan\nsavernake\nduman\nlasch\ncallies\nerris\nraskolnikov\ntamborine\nholdstock\ndjarum\ntelemedia\nsaron\nslayings\niese\nlydie\nparaparaumu\ngunhild\ngouna\nllandrindod\njefferys\nbarin\nardnamurchan\ndoit\ngazzara\ntbe\nvélo\nbarberry\nrebalancing\njassem\nhti\nworshiper\nyoshihara\nnarda\ngiz\nlucera\nbedwyn\nleinsdorf\nheimdal\nseismologists\ndonoughmore\nlebon\nhinduja\nsciascia\nconfluences\nbht\nbuitrago\nambling\ncastaic\nrecalculated\ntowada\nmimis\nschoorel\nliberata\ngatty\nadjei\nconstantina\ncammie\nfeisal\ndestabilising\npayola\naramid\nbeman\nballclub\nparmigianino\ngrimwood\nloredan\nicca\narida\ndold\nmastication\nhumped\ntotemic\nrearview\nrüdesheim\ndelio\nzoque\nlaminitis\nshawm\nmajoritarian\nbackwell\npapelbon\njacklin\nheythrop\nsomchai\nrembert\njobseekers\nfilkins\nastronomic\nmetonym\ngobelins\nwallner\ndurrington\nbocking\ndaulatabad\ntamaqua\nlrb\nranjana\nwakeboard\ndefenestration\nsparling\nburgled\noseltamivir\ngodliness\ncirri\nmontagny\nmarvelously\nteese\nmanea\nbroudie\ngringos\nchila\nprattle\narriola\nevm\nsepulcher\nbighead\noreilles\ngfi\nmtz\nbeefing\nanzus\ndenisa\nsalmonids\ngrantmaking\nojukwu\nalliston\nstabenow\nmaharlika\ndiep\nposies\nduele\nholing\nretarding\nadey\nwinmau\npinault\nunmanaged\nabdicating\nbombards\nruggieri\nbtob\nroddam\nnpm\npiscine\ngonne\nsherbini\nunsentimental\nplimer\nsilicosis\nwoodcocks\nkarapetyan\nchango\nkoeln\nprimark\ntaylorville\njabr\nehara\ntorossian\nrfr\ndack\nbrewin\nneckerchief\nbackes\nyucaipa\nmedjugorje\nwrentham\nzeelandia\numut\nfiras\nparmer\nnathu\nsny\nanoxia\nworkbooks\nflues\nhamata\ndiscriminative\nmerauke\ndracaena\nmomoh\nchagres\nunicoi\ntampons\nsovetskaya\nratty\nirresponsibly\nconfederal\nlitz\nhoose\nguardi\nhamby\nlatifa\nthorning\nmonne\ngigawatt\nwwd\nnachum\nbanastre\nrdo\nbassée\nbuckfast\npegi\ndcg\nbactericidal\nmoloko\nkatydid\nquarterfinalists\nteste\neurico\nguillet\nmrqe\nvlachos\nmaxed\nintraparty\npastis\nsmuggles\nhaak\nbreccias\nderwentwater\nfaddis\nbeakers\nechinacea\nhmis\nipsen\nballoonists\nbellerose\nlopata\narter\npadel\nlitherland\nelmhirst\ncreve\nmgf\nresidencia\nchristison\nguntersville\nyurii\nscribbling\natw\nmacaronesia\nwarnecke\nborton\nbeguine\ncarline\nankylosaurus\norana\nplatanthera\ntullamarine\nnfcr\nsequestering\nsaltville\npivo\nwts\nlostpedia\npurefoy\nlargesse\ngracenote\nallbritton\nseneviratne\nsubsists\nbelfour\nfreundlich\nbanchory\nforgivable\nfse\nkeyworth\ngaikwad\ntaxco\naleko\ndiaphragms\nmaderno\ndoob\nmiddendorf\ncontrive\ndialogical\nmethacrylate\nleask\narmless\nhoodies\nmucilage\npelias\ndibs\nbaciu\ngerty\nvigeland\ntarhan\nnullifies\nhotjobs\nicsid\nmoutinho\nconcessionaires\nsafarov\nlegroom\nsnowbound\nvaslav\nsvv\nbioethanol\nheyl\niztok\nnolo\neschborn\nwebtv\nfromelles\ndalgleish\nmulrooney\nwacs\nkilifi\nceltica\naberaman\nshabak\nchesters\ntungurahua\ndungiven\nbrandl\nsinning\njarrold\nkrasnov\npontardawe\ngradings\nrands\nsmithton\ncarnera\nkabeer\ncipollini\nputtnam\nsidharth\nsemmering\ncorradini\nmercantil\ndribbled\nbreithaupt\nsetton\nkalita\nembalse\npierogi\nfabray\nkinglake\nsampedro\nfireproofing\nmentos\nkochan\nmussina\ncissie\niconoclasts\nolmi\ngénéreux\nantonovich\narchundia\noversteer\nanpp\nwainscot\nheadrest\nmangles\nwarnke\npelin\ntawdry\ngimson\npinetree\ndaxing\nshoemaking\nwaists\nbosun\ndeverell\nsuperimposition\nefts\nkinane\neloping\nsalu\ntobacconist\nmittel\nmaranello\ndestructible\nflorine\ninq\naylett\nmccallion\nlidiya\ngaskins\ngundry\npudgy\nminimises\nembury\nbredesen\navron\ndesta\nisandlwana\ncowlishaw\ncoreopsis\nlittlewoods\nbettino\nkagera\nwiddowson\nramot\nhumanize\nspironolactone\nampatuan\noverend\npenname\nrainsy\nsolovetsky\nmenge\nfreiman\nemotionality\ndeprogramming\ngarro\ndrumlin\nglobetrotter\nabbiss\naerostar\nwspa\nxingtai\nmaghull\nenroute\nlefties\ngrobler\ntechnopolis\nsuquamish\ncalumny\nfazli\npennypacker\nsteckel\nbeobachter\nahm\nmatrox\nvronsky\ncreaking\ncaston\nizet\nfoxholes\nbotterill\notic\nknauer\nwanchai\nsigfrid\ntiantian\npescosolido\nviveka\nkilliney\nquim\ncilmi\nhannum\nopinión\nmedios\nlocalizations\nballasts\nmussa\nglos\nredecoration\nduer\nvcp\ntotteridge\nvuillard\nsuperfinal\nfrontrunners\natanasoff\nprophesying\nsaidabad\nshyamal\ntemporomandibular\nscaler\nyamashina\nsuggestibility\nakh\noptoelectronic\nhomophobe\ntongans\nilliberal\nrotisserie\nimari\nvitelli\nakinori\npaci\npelaez\ncristin\nmadeiran\npepperoni\nphunk\ndemaryius\nworkfare\nconcentrators\nsneakily\nyogendra\nthroughly\nprolongs\nballyhoo\nmccolgan\nuptick\npowderham\nwylam\nmixte\nkhaleel\nsciarra\nsafavi\ngoodstein\npsoriatic\nshein\nconurbations\nkrys\ntoughen\nosteoclasts\nkamlesh\nbigard\nhyssop\nertms\nmoustakas\ntoenail\ntenbury\nnessun\nohchr\nbrison\nlegere\nalpaugh\ngarey\ntabulate\nhandshakes\nimagist\nvolstead\nviren\nthursby\nbeathard\nmalanga\nciaa\ntensed\ngoehr\nivermectin\npni\nrecline\niurie\nrostow\ntreetop\nmontargis\nmerak\ndonelly\npernfors\ngherardi\nbuso\nmanat\narterials\niaas\ncherrie\nminelli\nsaski\nscammed\nesna\ndeyo\nbackhaul\neaglesham\nmcelwain\nzimbardo\nfitzrovia\nadjusters\njolfa\ntomographic\ntrofimov\nofferor\nlamplighter\nthibodeaux\nfawaz\nzarin\nsahai\ncornflower\nstruble\nnuk\nundoubtably\nhovis\nsissons\ncommissario\nwestcoast\ndume\nmkv\nsweetening\nborken\njiggy\ntiffani\nshontelle\noumarou\nscouter\nhildyard\npickpockets\nbookend\nnoora\nbirdhouse\nmassachusett\nibogaine\ntuborg\njrs\ndardanelle\nspatz\nrefloat\nenteritis\njibes\nhaddix\nmilledge\nmcgaw\nblueish\nfiac\nklausen\nthatcherism\ndiffusers\nbolitoglossa\ndisenfranchise\nyaquis\ndecaro\ncalvario\nbarberis\nkall\npronk\ndarron\nriso\nsfpd\nkohout\ndarra\nfodio\naquanaut\nzender\nmagique\nshafei\nomran\narvizu\nnito\nslugged\nhemmer\nosric\nfokus\nbergenfield\nloulou\nsoran\ntailfin\nmathare\nddh\nlerch\njidkova\nvci\nutt\nndrc\nbarthez\nbarnoldswick\nforsteri\nearpiece\neunomia\nfreewill\nkujira\nperistalsis\npostmarked\nredubbed\nremar\nwakeling\ncassim\nmaterazzi\nhenceforward\npitlochry\naflatoxin\nrenaudot\nblakes\nhotch\nrearm\nskaro\nshuttlecock\nmahram\nuntag\nmaybeck\neffeminacy\nalgorithmically\nforint\nvaleriya\nlampposts\nrecio\nfeher\ncompère\nhikmat\nkayhan\ncsas\nnaras\nunemotional\npurton\nuttaranchal\ntakaful\nholliman\npock\nguerres\npolysyllabic\nmapreduce\nxabi\ngolmaal\narbab\nheiland\ncomite\nkaneto\npuzzler\ndarvel\njitney\nbowline\nfownes\nsuckle\nhondurans\nsousaphone\ncramlington\ndessen\nseacliff\ngourcuff\nhalima\nisos\nsheff\nbohinj\nchins\nbricklin\ncambie\nchlorpromazine\nsalutatorian\npanayiotis\ngigawatts\nveces\npotw\nbranton\nwjr\njamshoro\nbrünn\nadderall\nsolfège\ngrantsville\nofori\nmaisy\nbritishers\nhidenori\noveremphasis\nisser\nminiskirt\ninclusivity\nyifu\nfura\npescadero\nmaralinga\nbdk\nsullavan\npierlot\nngari\ncarouge\nyacob\ndymchurch\nkuznets\nayes\npowe\nperfections\nherczeg\nplaysets\neritreans\nhyades\ngeochemist\nmashad\nreverential\nprestwood\nnith\ngosper\nhiaasen\ncoati\nlaetare\nzawya\nmoar\nblott\nhelliwell\nonstar\nurayasu\neberstadt\nsobule\njiaqing\nfluorophores\nroys\nbolander\ncchr\nzhizn\nwabbit\nruxandra\npyros\nprowling\npiazzas\nfegan\narabo\nsleuths\nexudate\nvesely\nchallah\nxinyang\nstringham\nbreno\nmaldita\nstaci\nembl\nedric\nstrathnaver\nkanchenjunga\nwetumpka\nerard\ndomestique\nbfbs\nmagri\nmadelaine\nsixten\nmuny\nluiseño\ncrewdson\nenewetak\nmoru\nazurite\nexterna\nprude\nilulissat\nindepedent\nlimiters\nvaritek\noffley\nfatmir\nvfp\nkabushiki\nlinq\ndaryll\nvdsl\nisea\nandelman\nascione\nhendaye\nocclusive\nbudrys\nbarite\nfukuzawa\nheadbanger\ngallois\nuyo\narteritis\nbuu\ntulf\ncarpetbagger\nbrycheiniog\ninducting\nonca\nrifugio\nkeleti\ncaravels\ncrecy\nicra\nicaza\nsuwanee\nkeldysh\nterps\nahronoth\nsilicic\ntrexler\nfeedforward\nswamping\nfrn\ngangstas\nsulfurous\npremotor\nantonyms\nlutwyche\nmadruga\nbadgered\nfluoresce\ngerrards\nreenact\nswot\nwews\nkellum\ntoliara\nbagenal\ntouareg\nhantz\nrikkyo\nllanbadarn\nwappinger\nsproat\nbordallo\nbircham\ncalo\nswaggering\nmultifarious\npallbearer\namundson\nmacondo\nrenova\ntriptychs\nthalassarche\nzenawi\napplecross\nmanser\nhouseboats\nlezcano\nkic\ngoldthwaite\npmu\nahman\nferner\nrevanche\nrahab\nyone\ndunklin\nshaare\npseudonymously\ncawthon\nxoxo\nmaumoon\nleerdam\nseapower\naom\nchattan\nledes\nferrin\ntopa\nramechhap\nshunde\nehrenburg\ndenominazione\nkupper\nbhabhi\nroath\nogan\ncoderre\nchernova\nraes\nunaccented\nhabeeb\nfairwood\npoignantly\nhorder\naviles\nrevellers\ntyla\nearplugs\ndehloran\nnml\nlinfen\neveleth\nmccaig\nanyones\nsokka\neversley\ncupe\ngrandaddy\nnsclc\ncoevorden\nmcelhone\nyamen\nexcercise\nudell\ntanwar\nhogging\ncjd\ndulverton\njairam\nubud\nbogong\ncelyn\nsektor\nhongbo\ndeportment\ninhabitable\nstreetsville\ncherrytree\nsalerni\nbottrell\narati\npresupposed\nsaguaros\nblyden\nfase\ntalento\nazhari\nbookplate\nrogoff\nlarkhill\nwinbush\ngiraffa\ntrouts\nvlade\nevaporators\njagmohan\nvakhsh\nnamecalling\ncompacting\nmorigami\nhanza\nreeperbahn\nrossbach\nbarmby\nquotidien\ndistel\nmonkwearmouth\nmindedly\njobin\nunrolled\ntablespoon\nextraterritoriality\nlmb\nagnon\nalfredson\nconsistorial\nminimisation\ndweezil\nsuperintend\njoran\nstrock\nkpi\novervalued\nbaybears\nskipjacks\ncrugnola\nmemorise\ncesana\nspeedcar\ndeta\nmorphic\nknitter\nfunereal\npasserelle\nthrombotic\nvocalize\npriestman\nwordstar\njarden\nkhieu\nconfiscates\ncaouette\nanquan\nishrat\nkriel\nsizer\nrenae\nhubba\neuless\nmoshoeshoe\nuchimura\ncourtside\naspirate\ncohens\nwächter\npicone\nuhlig\nestey\norkin\nkopitar\nbordo\nbardin\nmugler\ncerutti\nnavona\nlysosome\nalgoa\nyouri\nunexamined\nobjectified\neweek\nmamdouh\npakeha\nbrooded\nyellowcake\nzst\nschjelderup\nchurton\neprdf\nvetinari\nbucca\ncarriacou\nlemus\ndoakes\ntekla\ncleanroom\nnpower\nhousekeepers\njunipero\ncutest\nwks\nbenefactions\nbianconi\nwycherley\nnna\nheintzelman\naok\nkabakov\nmanthey\nunpleasantly\nprivée\nkappler\nhasely\nclifftop\nnichelle\nworkup\nbroxtowe\nnervi\nlysimachia\ncarlyon\nbria\njaidev\nkarlsbad\nalbán\nbrey\npolet\ntrypanosomiasis\noverspending\nwead\nserology\nlaming\nrichemont\nekelund\nrequisitioning\nrille\nincivilities\nmoriches\nminagawa\ninsulae\nvilcabamba\nkerrey\ntarcisio\nbattlespace\ncharalambos\nallready\nvampira\nextricated\nhanoun\nvickrey\norlowski\nsolitario\ncotler\nsobhan\npuertollano\nescargot\ntaze\nunburied\ndisinfect\nzenga\negyptological\nafarensis\npanfilov\ncorstorphine\nnobuyoshi\nlaches\ncuri\nlovestone\nbaigent\nslivers\nmetalworks\nchaiyaphum\nmolalla\nlollar\nbrandish\nenraging\nambu\nkiyotaka\narcady\nroughened\ncarlie\nlavoe\npolamalu\ndorrington\nascendance\nkracker\nfychan\nagarkar\nnesuhi\nbalkhi\nanche\nleyes\nmyaskovsky\ntucholsky\ntequesta\ndrita\nquoin\ntexoma\ndepreciate\nvladas\nwreathed\nnnl\nseverally\nfilmy\nnuna\nhagenbeck\nwahhabis\nmedawar\nvergina\nthiebaud\nbraye\nvremya\nyabe\nnafees\nmunchkins\nherri\nalexiou\nsenorita\nsituating\nsmigel\nvardanyan\nprolly\nclik\nglenarm\naardvarks\natleti\nvishwanathan\ncarita\nségolène\neboracum\nrarebit\nlecompte\nshrift\njayasimha\ncándido\ncentrica\ncréole\nmorrilton\nskittish\nauron\nsurfeit\ndilettanti\nruban\ncadwell\nglassford\nkaneshiro\nukyo\nengvall\ngrajeda\nbuckenham\nchéri\npember\nimmunofluorescence\ngrigoryev\naleta\nqueenslanders\nhawkhurst\nhundertwasser\nzeh\ntenanted\nturriff\nsimonyan\nskg\nreedley\nwlb\nsupernaturally\nuncommunicative\nbroadhall\nkobayakawa\nkidwai\nwalrond\nlynnfield\nnanos\nvae\nproblèmes\ncica\nunreliably\npostpones\nissoufou\nchokehold\nabalos\ncuffed\nlititz\nsanaag\nlowton\nfairless\nhauenstein\nknap\nvitals\nsenat\nchoux\ntewari\nballentine\nhuguet\nzarco\nmecosta\nmischaracterizing\nmoré\nsuperieure\ntunneled\ndoles\nsweatt\nmersch\nrantings\nduesberg\nroutings\npatey\nspick\nclondalkin\niinet\ntahsin\nfinegan\nmundos\ntogethers\ncontainerized\nlivesay\nmawes\npadania\nbartek\nneubert\nscocco\nlvg\nseafire\nghoulish\nrectitude\nchoros\nkikyo\nbrinley\nelte\ndwarfing\nunthank\nmisfolded\ngerrish\naaronovitch\nfbb\ncyana\nscrimshaw\npannu\nnabob\nlaghman\nnodded\nheffley\nconsumptive\nprologues\naise\nshubin\nchedid\ntreecreepers\nyakushima\nevasions\nshiflett\ncarausius\noverbury\nstapley\nbombadil\nwooton\nsecc\nputonghua\nrtk\ntaiki\ncomeuppance\nkayani\nnejat\ncattail\npxe\ngreenvale\npigg\nholohan\nhurtigruten\nnaag\nlodestone\nkailali\ngoffe\nbrynmawr\npeskin\nschemed\nmusicor\ncottier\nbayu\nlijn\nrelives\nmccalebb\nbingbing\nsealion\nkove\nfoolin\nmunde\nhpo\nurgh\nhackles\nignoramus\nsteelton\nruyi\nfantasize\nkabal\nbordelais\nnorgate\nlorene\nassicurazioni\nremixers\nlavar\nciticorp\nbercovici\ntweddle\nrowlf\nupv\nzoll\nmorrice\njailbreaking\nkissa\nküng\ngien\ndmytryk\nbusbee\ngabapentin\ncommack\nenr\nsugo\nhuangfu\nfptp\nmetalmark\nwinningham\ndecriminalised\nwingsuit\njankovic\naptitudes\nkelvinside\nbbf\njaroslaw\nwestgarth\nkoechner\necologies\ntaronga\nbramcote\nvaradarajan\ndefinatly\nhiranandani\ndippers\nendotracheal\nmorisot\narledge\nkav\nfeininger\nkarratha\nkrakatau\nprandelli\nhargus\nlloret\neveready\nmimbres\nfoots\nolo\ngarching\nmonotherapy\nfatemeh\nbreakpoints\ncordele\nswig\nconc\nstanek\nlinsey\ndyspepsia\nlysa\nwakhan\narauca\nthrottles\nsymmons\nsiebeck\nlagarto\nnavdeep\nunresolvable\nvirologists\ndolezal\nbörje\nunterfranken\nuop\nsemipalmated\nfridmann\nrunrig\netonian\nshumlin\nlashio\njayakumar\ndcv\ntitchener\nsteelworker\nmunim\nklaipeda\ncrazier\nlordan\nforbath\ncarbury\nakhara\nmatthey\nsqueaking\nhardenbergh\nortner\naugmentations\nglycan\nboitano\nmamiit\nhinkler\nsukiyaki\ntashlin\ndimock\nuai\nbaddiel\nbaryonic\nplatformers\nvenevision\nschmidhuber\nransacking\nwisher\npatuakhali\ncalisthenics\njessye\nbatemans\nmonga\nagoraphobic\nkaron\nrobertshaw\nmeskwaki\nfattest\nbugliosi\nfishwick\nlebrecht\nfunda\nchayefsky\nnicolini\nmahamadou\nplaisirs\nradyr\nvirtuality\nfeldt\ndestructiveness\npazzini\nvossen\ntandoori\nbrotha\nhuitzilopochtli\nvedat\nbestowal\nnagler\nmuchalls\nkaustuv\nvias\nhomosapien\nsnooks\nenforceability\nproteas\nsapulpa\ndespairs\nbrenchley\nuninterruptible\nwesterbork\nnaic\nnissin\naliza\nhochi\nmutism\naspergers\ntawe\ntumbledown\ncavazos\ncoveney\ntianshan\ntreece\nisfahani\nmireya\nndegeocello\nsindhuli\nlamarckism\nhaldon\nwapi\nsunstroke\nbrezno\ntwitches\nverhagen\nvtt\nbudai\nlinkers\nklinefelter\nbelaúnde\nidema\nevreux\nnalan\nvirtuosos\nfurries\nsquish\nconsolacion\ntims\nnuminous\nnoria\nensley\nmerlins\nkrok\nsilverstream\nbrevik\nwassermann\nazabu\nlatimore\narnolds\nkennywood\ntrium\nmarathoner\nfischler\nmirka\nconstantini\nstooped\nlaning\nlembit\nsherilyn\nkneejerk\n%,\ninglefield\noutspokenness\ncarting\nmends\nvep\nbuca\neil\nlambretta\nbaun\nmcpeak\nshahidi\nwenonah\nohad\nlalas\nwkyc\nreviewable\nblaker\nmetroland\ndispositional\njavadi\nesquerra\nshapovalov\nmawhinney\nmarquesan\nveith\nrustle\nwielders\nkrapf\nkaleidoscopes\ncompartmentalized\nzelle\ntontine\nsycorax\nwisse\nmelvil\nbuffeted\npangu\nnaat\nsags\npgce\ninglese\ninformatica\nuncircumcised\nwauters\ncoryton\nclavis\nkazuhito\naustar\nellingsen\natha\nabascal\nabsconding\nsardesai\nrvn\nardoyne\nuncannily\nhospitalier\ngodel\ndescript\nfootplate\ndadaism\nwecht\npolden\npovich\nbloo\npecs\npranayama\nensnare\nfrasers\namies\nkartini\npenances\nryon\nmcgeady\ntrafficante\npreševo\nfincke\nglendive\nramsdell\nbabacar\nyakir\npolyneuropathy\newoks\nseres\nsudesh\npagett\nkizer\nemoto\ntwiki\nintellects\nsoundview\ntraverso\nalmog\nsteadiness\ncoffeeshop\nmandaeans\npaperbark\nsambas\nserle\nmabey\nderbi\nerceg\nbottesford\nhocken\nsilurians\ntokoro\napelles\ndugger\nflemmi\nmilnrow\nhaarlemmermeer\ncalpers\nuncounted\nquilliam\nbubbled\nbrue\nliteralist\nalphons\nberlinguer\nlipchitz\ninhouse\nflumes\nengdahl\nllorens\nflamenca\nwoolfson\nescándalo\nfilers\nkoppen\nfilarmonica\nzorg\ngobbledygook\njarome\nmicrostate\nrackley\nstickman\ntoph\nfelin\nchangjiang\ndusseldorf\nschrier\nheda\nkatsuyuki\nabdon\nilja\npallotta\ncratchit\nkolker\nflaked\nlbp\npendent\nplacated\nwowed\nelectrum\nbarbato\nayelet\nhymie\nbelkacem\nashridge\nsideboard\nvanderpool\ngiulini\neagar\ndickon\npeterlee\nhorsburgh\nsoloway\nclisson\ndynamix\nnimal\nminidisc\nbgn\ngreef\nsatirising\nhuntsmen\nspadaro\nfrilly\nruffle\nrohnert\nrykov\ntbt\nspindletop\ncomprehensibility\nbalkhash\nkarm\njok\nnighter\nembroideries\npriti\nmonopolizing\naccessable\nbushby\nblaugrana\nlescure\nnura\njennet\ntessellated\nniyaz\ncinquecento\nhoghton\nmujtaba\nkojiro\nevildoers\nhanegev\nselver\nmckusick\nstoners\nneoplan\nrenege\nvalentim\nbonecrusher\nneorealist\nbracy\nconfounds\nbeachheads\nmegamall\nankylosaur\nhydrous\nglenna\nbambini\nprovidential\nesses\nhryhoriy\npleshette\nlaguerta\ncasilda\nmicroraptor\nesben\nswieten\njónsdóttir\nwgr\nmannington\nwhirled\nseacole\nizmailov\nmeaninglessness\nuderzo\nqueneau\nscali\nreet\nhumored\nmaling\nlyly\nsoundararajan\ncarport\nedline\nbeheads\nmacdowall\ntetteh\nakhand\ngeely\ntanel\nmediatek\nparmenter\nhugger\nclairsville\naralia\nvsat\nhains\nmuzaffargarh\nraker\nshiota\nyoshito\nnarok\ncadi\ndayaks\ndelannoy\nprière\npamphilj\nmadhepura\ncarelli\ncumbrians\nprebends\nmallia\nnavon\nuneaten\nstaplehurst\nulrica\nstratagems\nsitch\nsaulius\nmkrtchyan\nluzzi\ngoni\njonglei\nleti\nmillets\ncorpsmen\nharlesden\nfattened\nkhayelitsha\ncaelum\nwhoi\nskydrive\nedem\nnorridgewock\nshutoff\nvarejão\nsoooo\nqft\nguacamole\nrajguru\npatman\nservet\ncarian\nlongmore\nhryvnia\npavlenko\naftermaths\nalí\nkneeland\nkiesling\ngemelli\ndisparages\nfigge\nconfidences\nshebelle\ndosed\nmetall\ncléo\nfurcifer\nmarable\nreenacting\nbrockmann\nquiddity\nsamant\nhiromitsu\nredrew\nmunawar\nmarber\nhakobyan\nohanian\npegah\nzanzibari\nwagnalls\nquabbin\nboj\ndowncast\nmazie\nwomble\nhitlist\ncouvent\ngnatcatcher\nguangming\nnarnian\nendesa\nsarfatti\nsylvius\nnirenberg\naeronomy\nlobate\nfaiyum\ncholinesterase\nabeer\nkhadim\nrumbo\nserkin\nsomerford\nnakul\ndemarchi\nmusty\npayerne\nmanie\npapilloma\nclydeside\ntreefrog\nhnc\nquivering\nkiat\nnodar\namaechi\nprevite\nstet\njacquinot\nbilardo\ncullompton\nduplass\nmahad\npervasiveness\nhafsa\naldin\nhangmen\nsuras\nnoho\nbantock\nmirin\nkiew\nprocureur\nchunder\nbutner\nmaskin\ncowbells\ndasilva\nconsistancy\nentrees\ndinnington\nbhairab\nviju\naborts\nschuurman\nwestar\nquadruped\noliverio\nrohs\nfritzsche\nwinnicott\ngastroenterologist\ndiphenyl\nrandor\ncollab\ngolla\nhasard\nselvi\nscapegoating\nkrt\nfcv\nhughey\nparmalat\nmoonfruit\nvelodromes\ntversky\nfiachra\ndialer\nwithy\ncristianos\nhickel\nwetherall\nbeeping\nrosaria\npeon\nlavezzi\nsalvadore\nindri\nplaythrough\nfundraise\ngoddaughter\nlindt\ncoasted\npyloric\ncappielow\nannia\nhypnotised\ncantoni\nmalviya\nmeusel\namalekites\npostulation\nbirchfield\nalbia\naudrain\nbasmati\nnadin\nkitara\nprofundity\nrri\ncoh\nlegno\nhousley\nkerikeri\ndobyns\ndaas\nbrigstocke\nthali\ngoiter\nincinerate\nseann\ndefazio\nbobwhite\ndyskobolia\nbriefcases\nmagnetospheric\nsilts\nreadwriteweb\nkhush\ncrowland\nsalkeld\npeyser\nwincott\nwarby\nzhonghe\nqaim\nreaney\nhorrell\nrespectably\nwobbling\nbrisket\nhitchhike\nanorexic\nrhe\nyanbu\nonno\ncashless\ndoorknob\nsalomons\nliberalizing\ncancionero\nwatten\nnationalizing\ninfinito\nanastasija\ngrimaud\nyusheng\nfardc\nsexologists\npatoka\nfoucher\nchasuble\nkiriyama\ninkers\nbriquettes\nkcna\nbassem\nbombonera\ncupp\nvalentyn\ndedi\ngammage\nhiromichi\ntormo\ndanquah\npecked\nmpx\notb\ntinoco\nimeem\nzweifel\nbardet\nmuong\nbossman\npromis\nsolidarność\nglyde\nbuddhadeb\nkofler\nbabalola\nsantillan\nmanigault\natascosa\n,it\nuws\ncodifies\npostmen\nwortman\nmvr\nzella\nsoleimani\ncarraro\nlindale\npêcheurs\njouni\npigpen\ndigiorgio\nvelit\nnereo\ntinkered\nrotolo\nnitzan\nbundu\ngrivas\nebersol\ngöncz\npeening\najah\nswill\nfathima\nahsa\ninclining\nmotorrad\nkammer\nappleman\nersoy\nmultani\nhaberdasher\nshaqiri\nbrahmacharya\nsirloin\nsahlins\ndigester\ntrumpeting\nmaggy\nsoloman\nmassaged\nhallux\nbrennen\npictorials\ncapdevila\nbolide\nnewmar\nspeckling\nfeaver\ngolovkin\nmontresor\naraucana\nbucco\nfryeburg\nsunsari\neike\ntiina\nkantipur\nguericke\nartilleryman\nsushila\nvenal\nhairpins\nsportimes\nealy\ndesio\nconcorso\nwardrop\ncallup\ncrafters\nbvt\nsanitizing\nflamel\nwalldorf\nmangabey\njohnes\nlangurs\nexpansionary\nspanked\nsapwood\nrazumovsky\nparagons\nreunify\natash\nindustrializing\nbachao\nbaumer\nzizi\nmeathead\nbonda\nblanck\nablutions\nnikol\nserota\nsoroka\nkodori\nweltanschauung\nkhushboo\nchelonia\nmeinert\nlesya\nfister\nvwf\nwtvj\nstumm\nnawada\nhoneycreeper\nloche\ntasters\noppo\nlacrimosa\ntenga\natrophic\nbajram\nchristianize\nnavigazione\nhappold\ngorgonzola\ndaguerreotypes\nchiluba\nwindstorms\nodysseas\nbyelections\nwestaway\nallauddin\nfauve\ntkach\npolitécnica\nmaclure\nsujet\nkiyomi\nvulpecula\nwattis\nmarken\norri\ngener\npeverell\ncoben\nrenzetti\nfirefall\nblumen\nmidgut\ncarnets\nchouhan\nardee\ndeathwatch\nansara\nloreena\nkatari\nbattey\ntendring\narnar\ncontextualization\nfailsworth\nféile\nbouck\nbonehead\ndonc\nallegorically\nfranchisor\nuncontacted\nheitman\narzt\nprocurators\nturabi\nukuleles\nriverkeeper\nplacebos\nmetabotropic\ntanit\nnahan\nlubeck\nhagee\nbrogue\ndietl\nadalah\nbotes\nkilrea\nmorona\ncrewmates\nvestris\nkieth\nniimi\nbutterworths\nnanavati\ntills\nblanes\nolympos\nbachand\ncappelli\ngylfi\nbrims\ncloakroom\nabsolutly\nsigurðardóttir\niconium\nallotrope\nleicht\nchipp\nrohana\nloucks\ndinanath\nalderete\nmasochist\nthl\nfrancoism\nsyariah\nwpxi\nkalanidhi\nbasit\nultan\nkarle\ncorrals\nschlieren\ndousing\nosney\nlwr\nhade\ntonhalle\nhenney\nbelie\nhadouken\njephson\nblondy\nfloriculture\nbirkat\ninterjected\neisleben\nllangefni\npitlane\nsysadmin\nfriederich\nyokogawa\nrotimi\ntijani\njsat\nsuperannuated\nsenility\nordoñez\ntrinket\nsoeurs\neaglets\nmorisset\ntgi\nclothiers\nfleer\nbumthang\nmiev\nduhig\nnaor\nmaysles\nqueretaro\nmarama\nquaranta\nwuerffel\nsidley\nharbourside\ntrym\nbranka\ncolinton\nkeitaro\nlochan\neberstein\ninvoicing\nfincham\nshabba\ncannings\ntijerina\nmatam\nsulkin\nfahri\nyapp\nwaku\nraheja\nthespians\nrolan\nfreies\nrisaralda\ntalpiot\nshanice\nakhter\nsaarlouis\nhellberg\nbuckden\nbodegas\nsanatoriums\nkanker\npire\nsuccor\nardalan\nsiboney\nszentendre\nunst\nsibal\nluteinizing\nradchenko\npapaioannou\nminnis\naliwal\nundiminished\nsycophantic\nstooping\ngryf\nzino\ntimiş\nmwangi\ncyclopean\nroessler\nibk\nchazal\nfractionally\nabbad\ngomme\njunkin\nbreadbasket\npostmedia\nnetherfield\nwiberg\ncruk\nfoodland\ncourville\nvonk\nmesabi\nfarell\nporsches\ndiamine\nummc\nwhaleboat\nmancera\nnuetral\ntsd\ncommitteewoman\nfludd\nsandpit\neffusions\nravindran\nduddon\nsquashing\neick\nwyer\nsayfutdinov\ngrbavica\nsuger\npij\neustachian\ngrebo\nkuehn\nnehra\nspezza\nperidotite\nflandre\npropranolol\ntrabert\nelectricals\nsqualene\ntavi\nworksheets\nzócalo\ndharmas\nturnage\nberlage\nkingpins\ngrbs\ntikhomirov\nnastiest\nrequisitions\nqutbuddin\nbroberg\nstrudel\ngetaways\nhackl\nrhetorics\nauro\nyaka\nsteklov\npondo\nredheaded\nnaïveté\neventbrite\nmardon\ngeocoding\nhassen\nheymans\nchileno\nmaudit\nhindhead\nmisono\nkais\nncn\ndismounting\nshupe\ngach\niseman\nronell\nairfix\nrobinette\nacasta\ngolborne\ngrownups\nosburn\nhoudin\nbonello\nlukavac\nfuzziness\npargeter\nagv\nendorphins\npennisetum\nicoc\nrothamsted\nsomites\nschedeen\nvestey\nsentido\ndiahann\nfishnet\nsequera\nsharipov\nlowen\nlimite\nalak\nebbs\nareopagus\nsiddi\nplott\nbronzeville\nademir\naharoni\nblowpipe\ngolubic\nsakho\nalbuera\nallender\nglucuronide\nogunquit\nsommes\naddicks\ngibe\nswirsky\nhighton\nazura\nintoxicants\nmidamerica\nankylosing\nnonchalantly\nweaselly\nhermansson\nwindfarm\nunturned\ntolerability\nhaskin\ncolerain\nwresting\nphalaris\nmirabelle\nbuidhe\nhokie\nrueben\nnacs\nantinori\nsollecito\nclauson\nkuran\ndistricting\nteary\nlychgate\nrike\nallameh\njealousies\nmazursky\nericka\nbatoni\nwimps\nshelia\nfrontally\nburgi\ncampin\nkimes\nopis\nindustrialism\nfishhook\nlarc\nenglishness\narnault\ntorno\nmatravers\ndimarzio\nfaya\nkazumasa\ntuv\nvukan\nsennheiser\namália\nclarkstown\nassimilates\nketevan\nchotiner\neunan\nglueck\nraters\nbisa\nfrighteningly\ncasgrain\nmalwarebytes\nvasileios\nbuffo\ndolomiti\npropublica\nhogback\nplastik\nbattlement\ntaisei\nhevelius\nfischel\nnymphaeum\nbti\nmoudon\nolegário\nmoonie\nfibroids\nsalom\nlavina\nfineman\ngrieux\nvasanthi\nbedale\ntransposable\njons\npercept\nelene\npasquotank\nmccammon\novercharged\nlezama\ngik\nlabarbera\npfau\ncucaracha\nflecktones\ntoer\nguffey\nmaner\nmusgraves\nstormare\ndholakia\npotočnik\nbayadère\nwti\nselima\nmaclear\nchalices\nbarged\nhamengkubuwono\nsideonedummy\nzilber\ngokarna\ninscribing\ndwg\nautobiographer\nsubclinical\nmbts\nbrem\ndigregorio\nbrongniart\nhoggan\ndoms\nwyalusing\njaps\ndinardo\neastchurch\nmauchline\nabdominalis\ndispels\nkhumbu\nantwoord\ncelui\nwillowmoore\naen\nnevesinje\ngrf\ncfrb\nwishlist\nmuntari\nunef\nanthocyanin\nmenomonie\nverdin\ntypecasting\nprepackaged\nsiman\ncoens\nopr\nexco\nrials\nchihuahuas\nrulfo\nmaroun\ngilbreath\njianguo\nundergrads\nindefinable\nhalbach\ndraymond\nbailon\nkarlie\nmalani\nmountaintops\nyoshiwara\nhofbauer\npettiness\nrimas\nkeshi\npizzerias\nbreedon\nyaacob\nmajik\ncerruti\ngussow\nhindlimb\nmontealegre\nguiche\nkilbane\ntaus\nelta\nsimulacrum\nsems\nbollington\ndelic\ncoldingham\nobion\nvefa\nensler\nhossack\nemiri\neckl\nportlandia\nmisdemeanours\nmcnerney\nbeurre\ngmi\npleaser\namancio\ndjorkaeff\nsalz\ncatman\nemraan\nqualicum\nhousewarming\nlevente\nnecmettin\nwoolard\nwdaf\nancram\nmalakal\ngarriga\nnetapp\nteriyaki\nleaman\npodmore\nlukman\namphotericin\nsmeets\ngorkhas\ncervia\npetani\nebullient\nfeilden\ncartledge\nviardot\navocation\ncoupar\nwamp\nicct\nagutter\neichner\ncarrozzeria\nsyenite\ndiatomaceous\nwarda\nlatencies\naravane\nautonomies\nblic\ncontentiousness\narthroscopy\nillegitimately\nforbush\ndecedents\nlycra\nscamming\ntiku\ndemonizing\nsenan\nacquirer\nwaushara\nroky\naqim\ndeathtrap\nmucky\ncarowinds\namelanchier\ncatsuit\nmally\nmetafiction\nbecalmed\nfrisbie\nmurciélago\nruffner\nelano\nhopgood\nferrat\nshantytown\ntullis\nkoepp\ngarnock\nnoelia\nberezin\nambiental\nathey\nnevinson\nkuster\ncannavale\nwebzines\nkermes\nwisn\nstanbridge\ntelephonic\nabro\nvalenta\nailanthus\ntalma\ntowler\nmukundan\nslovenski\nhags\ncavallaro\nabdallahi\ntippit\ndeweese\npowells\ninda\nshaked\nnaza\ndinosaurian\nsilchester\nmerhi\ntetsuji\nturkcell\nmithraic\nscheidemann\npolster\nchandrashekar\nhydrogels\ntkachev\nbegets\ndikembe\nprivileging\nufologist\njagdalpur\nparasaurolophus\nmulayam\nsuperbad\nftm\nvilafranca\nbegot\nhiralal\nhothead\nbergqvist\njingjing\nknipe\nshills\ndipankar\ndefrag\ncurare\nmateja\nluverne\nsemel\nedwardians\nshulamit\nsandfly\ncolorada\narnason\nyishai\nbvc\nrowboats\nbarma\ntrelew\ndragunov\nrawcliffe\nraquette\nmontis\nbrazauskas\nwriteups\nmelitta\nvoysey\nkonkuk\nmcneile\npudu\ngipp\naramburu\nnady\npadlocks\necheverria\ndeeks\nsirio\nrestyling\nsaeb\nellendale\nliven\nimbrium\nlevick\nhoncho\nupended\npathologically\namlwch\ngaertner\nweddington\njancis\ndammers\nburdening\nsamm\noza\nsolomos\nnepalis\nferman\nsylphide\njanmashtami\narmrests\nstraightens\nparalegals\nosr\ndonata\nliversedge\nwoolridge\nchlorosis\ncolling\ninskip\nwhirlwinds\nwyton\ndennen\nmaruko\nsatz\ndickel\nberrett\ngitlin\nkaspars\nbirther\nnaledi\nnasca\ndpl\nbarrat\nliberale\ncof\ntóibín\nspybot\nmanitoban\nadlard\naltgeld\nshaunavon\nqinghe\nchilhowee\npassavant\nilitch\nrosaleda\nfitzhenry\nlyres\nreplicable\nverrall\nsittler\nbiehn\nshibu\nvarzi\nworkdays\njagland\ncawthorn\npietrasanta\njenness\navoirdupois\nmanette\ncowher\nsonorities\ndiffernt\necpa\nflatline\npsychoses\ndoring\ngalliera\nlangman\nchristophersen\npositio\nnightshift\nseara\nparini\nsingson\nkersh\nunpromising\nabled\nalinghi\nduru\nmulvihill\nwyly\nmclaury\nmohi\nlasing\nbaranowski\ngravier\npolin\narcelor\ntinguely\nparksville\ngaluppi\nellinor\ncontoocook\nsingye\nmatranga\nselfe\ndeluged\nautorité\nfenby\ngroveton\ntranssexuality\ndnl\nsadder\ndossena\nminustah\nkulakov\nvaldas\ndestrehan\nyarm\nardley\nnieuwenhuis\ntortious\nsoutine\nkonstantinou\npflug\nservando\nkitsis\nungrounded\npwll\nroundtrip\ngof\npinn\nvlasto\ncarazo\nbyas\njmb\nyaracuy\nglomerulonephritis\npipettes\nfarfa\nschwechat\nacromegaly\ngranet\ncatterall\nelastin\ngaleana\nregedit\npaulose\nswivels\nforetells\nbango\nmcsherry\nunstuck\ncarlberg\nbodog\npinza\nbarik\ngenzyme\nmeston\nferd\niberica\nitta\nlipstadt\nproscribe\nmishmar\nbuprenorphine\nsunao\nholanda\nwallowing\nmagnifique\ncolletti\ndecrypting\naylsham\nleptis\nhentoff\nswerves\npyx\ntietz\nvarona\nepididymis\namrani\nmorrall\nkhalif\nrosemond\nlauf\ncachapoal\nedgington\nbadie\nsaara\ndure\ncurrington\nvalencià\njml\nbabysitters\ngitte\nskibbereen\nciccarelli\ngunsmiths\njoosten\ntéchiné\nfrontiere\npatheos\nbracher\ngravett\ngoldrush\npadawan\nfestooned\ngranulomas\ncornfields\negba\nzenger\nhammoud\nmanyara\nocb\ngwillimbury\nbehram\nworli\nshahn\nresupplying\nloblaws\nbagus\nbreakable\nszalai\ntaskforces\ncaribbeans\njambs\nhouseplant\nbracciano\npeice\ngyorgy\ntwn\nsheepdogs\nararia\nsellwood\nsablan\narqiva\ndakhla\nvagal\ndevastatingly\nasselineau\nsuq\nsfmoma\nmisappropriating\nhermia\nwesterwelle\ngresty\nfethullah\nfortezza\nbudleigh\nimpugning\nsmer\nmihdhar\ntirano\nharking\nrafelson\nfforde\nschoolbooks\nhobb\nsuperficiality\nfelker\nblassie\ncanh\nvympel\npunctum\nmidazolam\nciam\nbannan\nmarlton\nottey\ndulaney\ncuong\njuiced\nncpc\nantonsen\nheadbands\nschagen\nagudelo\nnopcsa\npistes\nsharmarke\ndagnall\nappreciations\npalazzetto\nmtwara\npausch\nheney\nrecaro\ndavydova\nhopley\naishah\ncleat\ndustjacket\nlauterbrunnen\nicelandair\nrfef\nmesaba\ndiel\nzit\ngebrselassie\nessling\ntetrodotoxin\nindelicato\nmanal\nfreerepublic\npropofol\nlovebirds\nkerimov\ngbi\nyamba\nmerch\nbuzzworthy\nphm\nfbn\nmukerjee\nmessieurs\nkassem\npeploe\nattercliffe\ndondi\nsimkin\ntenaya\nparliamentarism\nkimani\ngullet\nfethi\nbattistelli\nprobative\nyuchen\nbargello\nbatur\nfitzclarence\nmutlu\ncarlini\ncadd\nbromyard\nsievert\nceduna\nrouget\nmesure\ntaccone\nkiting\ngnd\ntropico\nturkoman\nvind\nvocalizing\nlachen\ndoggystyle\nhaddin\nloughor\nmoyal\nzendejas\nhasted\ncalabresi\nrilo\nglenys\npomelo\ngiggleswick\ndevante\ninteroperate\ntucking\njanick\nomidyar\nacclimate\nferdie\nrabinovitch\ntdy\nlongings\nfrend\nboater\nahf\nfreespace\nokoro\nspic\naliaksandr\ninsistently\nwjla\npneumatically\nperiwinkles\nwillock\ngaggle\nlatv\ncrich\ntiemann\nbéjart\nformiga\nzuccotti\nperiódico\nanap\nlangenthal\ntaillefer\nlevingston\nlockstep\nrasberry\nwils\ndysprosium\nkalinina\nreaganomics\nlionhead\nricocheted\ncluskey\nsoulé\ncusk\nharangue\natilio\nzeidan\nanisa\nsaurian\ncnb\naneuploidy\nsgl\nsalzach\nnimeiry\nwathen\ncapitation\ndeddf\nredwings\ntectonically\nmeegan\npaella\nrufe\nclintonville\nliberum\nkonigsberg\nmaertens\nagbaje\njiban\nunsubscribe\nruminations\nnápoles\nbroc\nolafur\nemule\npiccinini\nfrain\ndesecrating\ncuius\nflours\noutgrowing\nfrederika\ndjerassi\ngeneology\nrahmon\ndros\nboxx\nrichborough\nehle\nshastry\nhannen\nhobley\nremer\ntirthankaras\nfuruholmen\ndudayev\nfrivolously\ngaetan\ndepo\nmorganfield\nsalmagundi\nchasin\ndropper\nedis\nstickley\njáuregui\nnoémie\ntameka\nlashawn\nlessa\ntheyre\nplcs\nroeser\nmanuva\nurbs\nalexandretta\neixample\nusmle\nperdrix\nimpel\nelize\nprophète\nthousandths\nmotz\nfarner\ndidone\nwantonly\npalindromes\nwhooper\namorebieta\nkhushal\nversant\nfiorillo\ngazzo\nkarner\nturlington\ngrizzard\naliaga\neloisa\nlidell\nczernin\nbicultural\nromanowski\nphalle\ndosn\nhoddesdon\nadekunle\nminiaturists\nngael\npadme\ngoodsell\nprincipalship\nmccredie\nmackinder\npoivre\njrp\nmadonnas\nsocialisation\njulen\nswanberg\nidv\nsoundproof\nmagdalo\nstanozolol\nsmx\nmegastar\nroberton\nwidmore\njerod\nbaudoin\nelectrocute\nmorfa\nwinterborne\nplitvice\ntakafumi\ndyachenko\nhelldiver\neicke\nflavourings\nmyton\ntrocadéro\nrhydian\npatino\nleontyne\noverstep\nlavalin\npayn\navishai\nroughed\nasari\nbanlieue\nwhitetip\nsagna\nmunising\nshiatsu\nmadelon\nironbound\nmutational\nactualized\nmarginalisation\namoxicillin\ngullibility\nthestreet\nlaja\nalors\ncameraria\njhonny\nmarlatt\ntheia\nrigours\nkalis\ngranulocytes\nkaci\nnsv\nmrca\nretrievable\ngizzi\netosha\ncowled\nanomalously\nbka\nvenky\nproteinuria\nttx\nbjerke\nbisect\nelcano\ntaihu\npenrhos\nsuchy\nmeggs\ncgo\ngravesites\nignoble\nbeust\npadarn\ndarshana\nwinwick\nvanaja\ncleavers\ndinapoli\nmatthaei\nblotted\nvalses\nalekseev\natsb\naymon\nparamecium\nmonopolization\nskouras\nmedo\nloftin\nnazmul\ndebits\nyokes\nwinnable\nsebree\nhaggin\nrende\nporretta\nakayev\npredicaments\nphysiques\nmerom\nzef\npodhoretz\ntechies\nketoacidosis\npeli\ntendril\nmcdougald\nmcfaul\nsamatar\nfif\ndaktronics\nfinning\nlockington\nhalswell\nlaffan\nscarponi\ngarita\ncenser\nflowerheads\nrepairer\nstateful\njetport\nsiamo\nmaffia\nwryneck\nghosal\nsombart\nsieving\nrudkin\nshiitake\ncareca\nthyristors\nzabar\nheffner\nscheib\npnh\nsoylu\neydie\nsekula\nbajrami\nwolfen\nlaocoön\ngrossberg\ncamisa\ntoleman\nempting\nfali\nsakal\ntutta\nriquet\nshagged\nfrinton\nlatticework\ninopportune\nrecca\ngubkin\nphenotypically\nshinjo\nanticompetitive\nbarriga\nbogolyubov\ncoover\nninetieth\npapillion\nalgona\nradames\nlysozyme\ngoodrum\nsipho\nwbur\ntrist\nbertinelli\noberthur\narosemena\nkumai\nwheaties\nkgw\nkompani\nbergara\neddyville\noskars\ntiaras\nhoen\nknill\narlt\nkoray\nollanta\npasinetti\nquintile\nnfo\nstoff\ncerney\nberrocal\nfriable\narindam\njayasinghe\ndunciad\npinkner\nmeols\nfuturology\nkiewit\ntabatabai\nyueyang\nreames\nportway\ndamour\nmenarche\naptos\nrieder\ndotes\nmanayunk\nfriml\nzahi\ndehesa\nludovica\nboakye\ndemocratize\nseach\nfornax\ndody\nnezha\njayashree\nxiaofeng\nscuderi\nasako\nredactions\nsayang\nsakara\nhobble\nstockwood\nerinn\nuvic\nmorta\nkrauthammer\npolecats\naxp\ntrem\ntrimmers\ndunnock\ngilliat\ntreherbert\namfm\nyoi\ngabb\nhandbills\nremunerated\njeld\nazureus\nharlots\nnicomachean\ndiplopia\nmaccabeus\nmckidd\nindemnification\nfryxell\nikat\ngarroway\nwaitz\nfestinger\nkalmbach\nnoot\narsene\nlooses\nkaramat\nmagro\ninestimable\ntensing\nburruss\nbolingbrook\nguamanian\ntrapezius\nnewsmaker\nstepp\nfartown\nsizzla\nneco\negoistic\niframe\ntempora\nmoffet\nfloresiensis\ntropea\nstearn\nbesh\nbieri\ngramma\nlamadrid\nzegers\nsabit\nkilman\nrevill\nhingston\nsybilla\nnordkapp\nfva\nghajini\ndores\npolymerized\ndocsis\nsclerosing\nsunless\ncreamfields\nhft\nottman\nbalik\nwortmann\njenison\nyha\nvient\neastfield\nhoriguchi\nsampan\nacuna\nattie\nfarma\nseau\nrbb\noctant\ntagish\ndrumheads\ngoût\nmccawley\ncapitalizations\namalek\nkisha\nfunnell\nshalala\nbleeckere\nwieden\nstinnett\naspersion\nkorver\nritts\nsavitt\nhepa\ncensures\ncamelus\nsivertsen\ngoetze\ncerrillos\nwildhorn\nrhiw\nathabascan\nchinoiserie\nfoudroyant\nbassel\nbloxwich\nyediot\ntiros\nscarification\nnacreous\nfiscus\ncheekbones\nwhitemarsh\nkenia\nfujiyama\nsagacity\naliev\ncloudless\nbruen\nwoops\njogs\nsuntec\nlizabeth\nbirstein\nbulldozing\nbrowned\ngtf\nliriano\nstepdaughters\npallor\nsinclaire\nwalbridge\nhultgren\nsaia\nlibations\npinfold\nissoudun\nbaresi\nneglectus\ncindi\nprentis\nrewound\nhecklers\nquinceañera\nmesrop\nroepke\nmicrospheres\nghazala\nmahomed\nsulky\nsoftener\nates\nlimbe\nshalamar\nhaverstock\ntiko\nbelfer\nlalbagh\nwiedlin\nkausar\nsuthep\nalliss\nrouault\ntte\nstylebook\nbigamist\nmelih\noverflights\nakg\nhoogland\npulborough\ncebit\nimprovises\nheuston\nbillinge\nsubdiscipline\nbleakley\ngianpaolo\nhammes\nmrv\nsphynx\ngraywolf\nquah\nsalen\ncorticotropin\nmwd\nduraid\nsonnier\nindecisiveness\nseberg\nisoc\nthaws\nsonthi\nncsoft\ndpg\nboite\nschutt\nmosler\nbethenny\nzoolander\nenterococcus\ndargaville\nhoofddorp\ncommode\nosgi\nvarmint\npitty\nlenya\ngranddad\nmadhesh\ndortmunder\nkrauze\nclodia\nanechoic\ndeceivers\nmerkley\nplaylisted\neffectually\nschuester\nryhope\nintermarriages\nalaimo\nridgewell\nchevaux\ndonatelli\nunió\nryzhkov\nheyburn\ncpap\nuyuni\nkracht\nmuggsy\nhalong\nsebag\nanthropocentric\nnationalisms\npdh\nantediluvian\nkendrapara\nsinitta\nkogure\nurinalysis\npseud\nrecapitalization\ncurcuma\nreenlisted\nsigulda\nunfixed\nvacher\nboléro\nterpstra\nswaggart\nrukeyser\naquash\nauja\nshd\ntarda\ngarmendia\ncampanelli\nflexors\nschurman\nmotd\nuruapan\nsaqlain\ntitmouse\nosteosarcoma\ninterbred\nhendryx\nthermochemical\nboquete\ntraeger\nmothman\nvignoles\ndillane\nkiddo\nsnaffle\nsayegh\nsosnowski\nstrassmann\nevacuee\nbrrr\nhalpenny\naquidneck\nsanthal\ndusts\npota\ntarplin\nsakyamuni\nbrinckerhoff\nweissberg\ncopiers\nclearwing\nvinter\nproinflammatory\nbresciano\nmarinovich\nromeos\npeverel\noks\nagapito\nmtbf\ncensuring\nmeharry\nvigilius\ndahlquist\nhalon\njoynt\nlaskar\nclotted\nderaa\nmaibaum\nubp\nsavarese\nnidd\nhdcp\nmettler\njarecki\nmosbacher\nkommandant\ncoltan\nintrod\ndandies\nsaucon\nmcvicker\nwillemsen\nposa\njounieh\nspitball\nscharfenberg\nfaits\nbremmer\nalbinos\nunawares\ncurrin\ndrumcree\ncarps\ngluttonous\nfonti\nlysacek\nkitzhaber\nfeuillade\nberserkers\nwebinar\nphysalis\nconsignments\nchirps\nmihalis\nhelminths\nrodriquez\nirishtown\nkönigssee\nreys\nrohrabacher\nbeban\narrau\nprotozoans\njany\nbeacuse\nmagowan\nracialized\nwailes\ntrussed\nallenwood\ndunnet\ntheatreworks\nfrühling\nnawabshah\nmuizz\ncardon\nfreeth\nhussam\nmagdi\nlro\nhepatology\nnathuram\nprufrock\ndeltaic\nscarier\nhomesite\nenuf\nmarnell\nvondel\ndiwakar\ncarloads\nclayfield\nwunderkind\nwizkids\ncontinous\nebitda\nextols\npoiré\ngeldings\ntaylforth\nprofessionalized\nliqin\npositivists\nsayoko\nmultihull\navella\ncurriculums\nblucher\ndarlow\nrajshri\nsportspersons\ngrindal\noussama\nrippers\nkushinagar\nieg\nkaiserstuhl\nurr\ncaamaño\nyorkshireman\nspeyside\ncazalet\nfriendliest\nhydroids\ndaunt\nperotti\nmuslimeen\ntabun\ntriborough\nhornless\nriggle\ntimelessness\npinedale\nlodgers\ncodebreakers\nzadran\ngreenhow\njalbert\npellissier\ncapitolina\nlandini\ndwr\nasmar\nnacha\ncostarred\nsetswana\ncatfight\ntaraji\nantigravity\njacquelin\nbumpkin\nstockmann\nhillin\nrastogi\nneurospora\ngervin\nluen\narau\ntulou\nranong\nljubica\nmonessen\ncopywriting\nhorler\nraspail\najantha\ncontracture\nuwb\ncaraballo\ntrave\nheiman\nnyholm\npremo\nbougainvillea\nadversities\nfloorspace\nnapoletana\njuanjo\nuob\nracketeers\nmarcopolo\nidalia\nmaithripala\nzhiyi\ntates\ngabbiadini\nnyorai\nsightedness\nmoncure\nminoo\nyoneda\ninterjecting\navio\ntumblers\nraniganj\nfourah\nnabataeans\nidentidad\nautant\nhomiletics\ndeploring\ndml\ntuns\nsullied\npach\nprominences\nlucila\ngilks\npeda\npembury\nbuttigieg\nsvetlanov\nroxby\ndisegno\namcham\nwillenborg\nfrb\nlickey\nocher\nbcom\nkealoha\nvernay\nbygraves\nionised\nalfond\nhilditch\nshobana\nnewgrange\nzeroed\nghezzi\nbozkurt\nkorina\ntingey\ngowing\nbucklin\nkinrara\nchappy\nlarrionda\n：\nferncliff\nhollinwood\nlatissimus\nscardino\nlunchroom\nmasers\nfirefights\nunneccesary\ndistension\ngole\nsipos\nsilverfish\nnewchurch\nmiyauchi\ndindo\nputian\nborella\noleson\nwsis\nfandel\nkwabena\nlamme\nsushruta\nyall\nnotman\nkalakaua\nwildey\nkananga\ndiscriminations\nandøya\nklavierstück\nmegabits\nskybus\ntranscaucasus\ndalvi\nsembach\ncrossett\nnutria\nsorbet\njumbotron\ncgiar\nsanjiv\nsweatshirts\npiketty\nsiegrist\nmagnifier\nconstruing\ncantilupe\nstmicroelectronics\nbazilian\neggnog\nrago\nparimutuel\nlidge\nchummy\nlichtman\nselimiye\nwakarusa\npcos\nperturbing\nplasticizers\nkateb\nrathmore\nbluer\nmaidenhair\nswaledale\nschoenherr\ndeadeye\njarboe\nbuchi\nleisz\ncalifano\nstreaker\ncaspi\ngavroche\ntenderfoot\ngalarza\nioa\njawara\npasseig\nboberg\nfroment\nmcot\nmulago\napter\nzilberman\ninuk\nmancinelli\nbrocklesby\nucar\nbaathist\napprehends\nvids\nwhiskeytown\nollerton\nsoha\ndelighting\nmizer\nriverbeds\ncrittenton\nhatchets\ngiske\npinguin\nnagahama\ncafta\nfultz\nduport\nadivasis\npizzey\nmariotte\nacy\nvade\ndragonette\ndaresbury\nindependantly\nkarpis\nhirakawa\nbambridge\nhadrosaurid\nsaraceni\nfederline\nbagga\ntendo\nshary\npratley\nljiljana\nkabat\nerasers\nwmm\npeephole\nlateralization\nsilvino\nplimsoll\ntynedale\nmagner\nwolfeboro\nsuport\narchrivals\nheye\nkoin\nmaddern\nbbp\nkeatinge\nkagaku\nmisma\nwmg\nrawles\nelmes\ncleobury\nmodernizations\ntranquillo\nshrewdly\ncrumpler\nboody\nplaka\nfalle\nstreller\nptk\nplagiarist\ndissonances\nabakan\nschreuder\nvoicings\nkoichiro\nmanora\nderniers\nkhairy\nfratellis\nmaister\nvoluntarism\narcherfish\ncesa\npottage\ngreases\ntaglines\nnij\njejomar\nmotos\npaolozzi\ncarless\naraceli\narleen\nexposés\ngleiwitz\nvindicating\nchitwood\nblyleven\neuropeanism\nloral\ndaunted\ngerome\nexplicity\nchapada\nfutrell\ncosplayers\nescamilla\nbritishness\nimpermanent\nspinetta\nmarginalizing\narj\ncreasey\nfuru\nclozapine\nabinger\ntessitura\nkellan\nsunscreens\ndipiero\nruhpolding\nspectrogram\nwoodbourne\nangioedema\nmarjoram\nmosharraf\nhuffer\nsciatica\nclarkii\nzooropa\nzennor\ntettenhall\ndanilova\nboyton\nmieres\nstaggeringly\nzhicheng\nmorabito\ndowdall\nmicrofluidics\nglorieta\nwigham\ncluff\ntrinder\nfbw\ndoudou\nleafhoppers\ngramenet\ndiclofenac\ncjsc\nchaussées\nvroman\nblewitt\nmumy\ngreenburg\nsuur\nimovie\ndestra\nbayazid\ntoxoplasma\nbarrette\nimagineer\ngask\nbevil\nloosehead\nprotean\nccj\nwillkommen\ngoldbug\ndivorcée\nfarfus\nlauris\narbitrating\npolyamide\ncasteel\ncrichlow\nmhi\nhool\noscillated\neuroregion\nalara\nscopolamine\nhobday\nmarangoni\nneca\ngartside\nthuggish\noosterhuis\npimples\nkacper\nhealthgrades\nrecantation\nlochore\nitea\nlecavalier\nnosek\ncaned\npetain\nuzelac\nchristophers\nexcoriated\nleggatt\nunselected\ntathagata\nisoprene\ntullia\ntrv\nlubbe\nrathdown\nodgers\nthulium\ntutbury\ncraveonline\nscreven\ningests\nquebradillas\ncarquinez\nafrin\nagraria\npreventer\nforcefield\npylades\ncix\nmrcp\ntraube\nsatow\nneogothic\nhogben\nbonbon\ncoinages\ndraka\nindignantly\nmadrugada\nblaha\ngiesen\nlapsing\ndimon\ndeoxygenated\nfrederator\ntassos\nabattoirs\ncadenas\norchha\nwimberley\namblyopia\nwestlands\ncsos\nstatelessness\nminored\nbuzznet\nrundell\nrajapakse\nsaphenous\ndessa\nclaritas\nferozepur\nqatada\nmacaluso\nlembeck\nukrainka\nberlingske\ndriveline\ndifférence\ncoquina\nlch\nantipolis\nliliom\ncalama\ndishonourable\npneumocystis\nchitarra\nbehm\nnewscenter\nmofaz\nprova\nbalon\npejman\nthangka\nredemptoris\nbeastmaster\nchoker\ngrimace\nextrapolations\nbuzzes\nintriguingly\nshizhong\nvarano\nbesos\nuninstalling\nfibs\nfaba\nestrous\ntandridge\nsahn\nmorgane\npann\nhalva\nyicheng\nnosebleeds\nenjoining\nboisset\nbucur\npuree\njoanneum\nwinstar\nkatzen\nchalais\nkovach\ngpcrs\nsolares\nzawra\nrowhouse\novidio\njingoistic\nlindenhurst\nsiegburg\ncircassia\nbenedick\nlansingburgh\nroughest\ndennard\ntarkio\ntesson\ntorcs\ncorser\napres\nkddi\ngaily\nbalachandra\ncareening\ngastown\ndeda\npols\nchichewa\nhisses\npilo\ndisarticulated\nmcbryde\nnays\ncimabue\nverplanck\ngfr\nmontecristo\nberre\nquadrupling\nleiner\nlongueval\nfavell\nvesak\nbraziller\nmisdemeanour\nshabtai\nmentz\nbaqubah\ntanzer\ntarazona\nmirah\nsoapnet\ntaina\npenhall\nermin\nunfrozen\npepsin\narbon\nmemorised\nseka\nexfoliation\nmartinek\nbadruddin\nbensen\nzuñiga\nnagore\ncerullo\ndaryle\nperisher\nyngling\nchloral\ncarolco\nferryhill\nwagenaar\nkhachatryan\ngasometer\ntatarinov\nchloroquine\ntwan\nreusability\npugni\nphysx\npolyploidy\neducationists\nhura\nslamet\nrbk\nchachoengsao\nprajadhipok\nplataforma\ngompertz\nchassidic\nvaluer\nmemorandums\ntreadmills\npeterik\ndirleton\nmuste\nczestochowa\nlws\nsuperprep\nephram\nhateley\nmahdavi\nbrewood\ngnasher\nunities\nmustaches\nbya\ngitta\naveline\nvalproate\nclarens\nfromage\nvasodilator\nchessmen\nnikkan\ningenio\nrajasekhara\nsouthridge\nbania\ngualtiero\ncastellon\nvte\nnordling\nhaysbert\nbasehart\ndorival\nearworm\ninzerillo\nwarbucks\nfarag\nhymer\nrunde\ngelatine\nprivity\nnavvies\nnevile\nhymnody\nbarrionuevo\nsportster\nskillman\nnicholasville\nfarrago\ncapetillo\niwamura\narrc\nrothberg\nrailroaded\nfraternité\nmerab\nglycans\nadsit\nudmurtia\nlaskey\npoldi\nteredo\nhickerson\nbankable\nwhitest\nfandi\nbromhead\ndessinée\nprescience\nfradkin\nvoltigeur\nroka\nbandula\npicc\nljunggren\nsanyasi\nhoogeveen\nfurillo\npuu\nparacas\nlongships\nkemps\nhamrick\nrocard\nforseeable\nschatzberg\npaling\nsnelgrove\ndarwinist\ntoyokuni\nretrain\nyawen\ngigliotti\nvolvulus\nsteilacoom\nwesthead\nchipettes\nnasopharyngeal\nrcl\nwillaumez\nkarev\nfaulconer\ntarawera\nbiophysicists\nhayer\ninsecticidal\nkershner\nangolans\ncasini\nmansor\ncosumnes\nimmunize\nramasar\nwakeham\nunasur\nchinandega\npothier\nextractors\nweatherproof\nvillamil\nnacio\nvithal\nsparke\nhongi\nthredbo\ndecicco\npoddar\nplebe\nvertus\njau\ndistinctness\nmaslen\nrinder\nkilrush\nthermes\npflag\nwhaleback\nextracorporeal\ncatrina\ncandlebox\nshinozaki\nwqed\nacord\ntuxford\nrolen\ncervenka\nbaldwinsville\nnyunt\nleukoplakia\nrober\nmaiolica\nsahand\ntoadies\nrayson\nderogatis\nhatsumi\nfranklins\nwoodchucks\nmangosteen\nkarmann\nluberon\ncircuiting\ntheriault\nmuwaffaq\nasri\notd\namarah\nrecognisably\nché\nnirupama\nmentalism\nemidio\nbupa\npasquali\nkheir\ngesar\nmarelli\ncristy\naie\ndecembrists\nrandeep\nkiick\nmarkievicz\nserotina\nprachanda\nyatala\nlangara\nbandler\nkallur\nunwinnable\nbitterne\nipg\ntestarossa\nfaina\nwijeyeratne\nibby\nrochet\nbesoin\nssgt\nacteon\nstearic\nnmfs\nioi\nwynkoop\nkashgari\nvendler\nkweller\npigmy\nlangholm\nuncompensated\nrockpile\nmantooth\nhailstones\nduga\nkantara\nholmstrom\nprebiotic\ncossette\navati\noberdorf\ncuanza\nvalaam\nruination\nfaggots\nbonnies\ntappin\ntullus\ndanto\nhirotaka\nhna\nlyndall\napplewood\nhrsa\nforno\nelchin\ncozza\nmwt\nleukemias\ncaprio\nolor\ngroesbeck\nivories\ndestructions\nwriggle\nvueling\nabels\npavesi\nophélie\nstockham\nhotheaded\nbeo\nwolfsohn\nswingman\nmatua\nolmecs\nhct\nmahle\ncockroft\nsiniora\nmalham\nadjuntas\nopenwork\nphotic\ncongener\nvitalia\nbrymbo\nmasiello\nviveca\nunwto\ndrafthouse\nmanhandled\nhalfords\nbhf\nmicroelectromechanical\neked\nnylund\nrewired\nsittwe\nfeminisms\neconoline\npiemontese\nalternaria\nbilt\ncaptchas\nalhassan\nwdiv\ngurpreet\nsandby\nmalir\nsquander\nmalema\nlowey\npileup\ncerros\nempresarial\nrinker\nblox\nfreeling\ntoal\napthorp\nsangyo\nkbb\nsamangan\nniyazi\nbarnaba\ndsf\ntml\nyehezkel\nkeffer\nfailla\nteka\nguzheng\nfugs\nikin\nreang\nloxodonta\nabcc\nlaxness\nhelmy\ndalat\nntini\nkhashoggi\nmodder\nkemeny\nolaya\nwiller\nmeers\nzajac\nmajer\ntabackin\nridgeville\ncordilleras\nyulee\nmotoyama\nphibes\nyaffa\nvarnas\nhighball\nbirchard\njeunet\nhydrocortisone\nanthologised\nkronen\nkoranic\nmuis\nmuggers\nslimline\nplatov\npancrazio\ncigale\neconomou\npharmaceutics\nwilcoxon\nlocle\nrevokes\numali\ndekkers\nhawar\ndcn\ngenia\nenrage\nshayna\nraschke\nmarienbad\nsukabumi\nbadillo\nmisner\nashtrays\nidrc\ntraumatology\nchacun\nkurseong\nluddington\nciw\nkusuma\nskuse\ntrn\nhumourist\nwoodlouse\nhumoured\nwalgreen\npersano\nshiney\ngatecrasher\nliberalize\nzeljko\nsupercapacitors\norzabal\nthackery\nobstinately\nplastique\nsherrin\nkericho\nmaglia\nbrainwaves\ngoyeneche\nsanfilippo\nchumley\ngierek\naltima\nlairs\nbardiya\nwolfert\nsegún\nanees\nllanthony\ncoxsone\ngustavson\ncaulking\nberube\nhollick\nzaa\npierfrancesco\nordinators\natka\nluncheons\nlihue\ntrevose\nretractor\nbonnes\nkonerko\nsheindlin\nwrather\nmisstated\nlrg\narismendi\nrighi\neatonville\nalos\nkári\nholmer\nautoplay\ncyberwarfare\nlestrange\neinsteins\nscollay\ndigvijay\nporten\nhajdu\nudal\nbassler\njoffé\nchmerkovskiy\npiacentini\nencamp\njolanda\njabuka\nmfn\npoble\nconsciousnesses\njubb\nambra\nlovelight\nchimurenga\ncheckable\ncalvisano\nchicagoan\ndimartino\nmonody\ncarpel\ntshombe\nyudin\nchapdelaine\ncrannog\ncarbonara\ndefinate\neor\ntercero\nnewsam\nrückert\nmaekawa\nshahan\nalsa\ndismembering\nteratoma\nfalso\nreta\nlaveau\nbgr\nbermudians\nmiasta\nlums\nmory\nliebowitz\nghandi\nshoto\nplanas\ncontestable\ngrignon\nwarmers\nhaniyeh\nfuchu\nabnegation\nquileute\nschnapps\nbrummies\nkizito\nwacom\nnilesh\njesson\nnortheastwards\npermeation\nmoch\nworland\nchree\nstrub\nonc\nherzig\ngeron\nmezger\nguajardo\nwaki\nexplicated\ncrois\ncaggiano\norganismal\nhomann\nscroller\nperino\nkamuzu\nstefanelli\nsterilizing\njunejo\nmaltreated\nchalcedony\nfarooqi\nmuddying\npersecutor\nilah\nantiphonal\ntutorship\neer\npyrex\nmercader\nquinnell\ndorsi\njunfeng\ntutzing\nefcc\noscott\nwellard\nbadshahi\nnewsy\nkharian\nrelativist\nvanguards\nkenworth\nburkhalter\ncratering\nzelma\nfozzy\ndonaghadee\nburki\nvestergaard\nbuxus\nkerzner\nkawamata\nmarylanders\nnondiscrimination\nkingsgate\nmusiker\npatridge\nstockach\nfallibility\nturchynov\nwynford\nbreukelen\npoppel\npargo\nshamen\nsofiane\nbitz\ntelefonica\nkarakalpakstan\nslon\nzinman\nredbud\naspis\nriper\npaleface\ncanterville\nsauro\nellenville\ngönül\nanglong\nueli\nvered\navance\nwasley\nakhund\nkouchner\ngiocondo\nmineta\ngagnier\nredressed\ncisse\nsoysa\nhalper\nreinstituted\nmurre\ndonax\nhaemek\nsilberstein\nsaucepan\nserah\nauray\ncaviezel\nmantling\neckington\nrecommissioning\npilz\nseney\ndecelerated\ninfonet\nmeegeren\nlienhard\ndiphenhydramine\ndeindustrialization\nchildishly\nphylicia\ndagoberto\npondexter\nchucking\njacobitism\ngerardi\nanonymized\ncollingham\npatassé\nsterrett\nghan\nsandboxing\nkasyanov\nsidestepping\nbuse\ndefensiveness\nmorgannwg\ndonaghmore\narmah\nmantuan\nfyre\nfennica\njijel\nbivens\nregmi\ngeomancy\ngii\nferneyhough\nkrimsky\nshive\nmaddon\nwireframe\njayshree\nchoirboy\nvescovo\nlikelier\nrappe\njns\ntsuboi\ndeplores\ndmitrov\nabq\npranav\nboulware\nglassell\nfaden\ncourchevel\nunnoticeable\nminnesotan\nzaydi\neiler\nredirections\ngraziers\nbaklava\nlytvyn\nusaa\ndemocratizing\ngastrin\nmendonca\ncertifier\nbason\npicketers\ngilo\ndeslauriers\ncharg\nhaynesville\nsignifiers\nprogestogen\nwagler\nchamoli\ncafferty\nbiphasic\ntellings\nkamsky\ncahen\nautocross\nufrj\nwynand\nhakeim\ngrawemeyer\nannonay\nnpe\nhesjedal\nsops\nbonder\nodintsovo\ncosey\nveta\nhitec\nfengtai\nunmasks\ngomery\ntatung\npakhtuns\nbullwhip\njabberwock\nzuffa\ngravid\nlitmanen\njatun\ndhobi\nlexy\nbaty\nuht\nosim\ntabling\neyemouth\nbirzeit\nglendening\ninsull\ngogan\nfunkhouser\npsap\nsnooki\nllangattock\ntrull\nitsukushima\nfantino\ncleavages\njustitia\ndeciders\nexactions\naleksandrovna\ncapen\nberimbau\netang\nneccessarily\nfujisaki\nboekel\njackett\nbarthelmess\nleight\nakshardham\nhoenig\nmonetarist\ngotemba\ncanne\npernis\nneba\nfancourt\nhumanizing\ndispleasing\nformentera\nclearcutting\ndarrington\ngagner\nkiraly\nrickett\namiodarone\njerked\nunearths\nfabinho\nkapler\nupg\nmogao\ndulcinea\ndamaso\njefferis\nspambots\ncrighton\nsacajawea\ncygne\ngreywater\nendows\ntravelodge\npanchromatic\ntroiano\niztapalapa\ntareen\nzivkovic\nredlining\ndarner\nadjutants\nhackley\nmallock\nteves\nreichman\ntuyen\nwiesen\nmeditates\nappraising\ntransfixed\nkaahumanu\nmirrlees\nnust\nsuspiria\ncathey\ndodgeville\naccor\ntedi\ndeerhunter\ndimitra\nsne\nmabou\ngurinder\nwando\ndonadoni\nmeall\nbriel\nskellig\nsadu\nautarky\ncianci\ndulag\ndarcey\ndicamillo\nloafers\njumma\ndemoss\nboothferry\nvirna\nbardsey\nhardrock\npowertrains\nmery\nbinswanger\ntamati\nfinau\nosram\nbeading\nschäuble\ntenaga\nconfessionals\nphyllida\nscumbag\nbernauer\nskyliner\noguz\ncoxswains\nunicyclist\nhodgdon\nbrannen\nkaifu\nallott\nfauji\nelmi\nahsoka\nbowens\nxylitol\nobtusely\ndrunkards\nhenreid\npolicyholder\nflucht\ncotham\ndemorest\noutshone\npassivation\nwesthill\nthermohaline\ntharanga\nreichard\nredfin\nlanterne\nphysicochemical\nleontief\nabduh\nwaterstone\nloden\nroisin\nwaianae\nluts\nmunchen\nautauga\nlujo\nduvet\nhontiveros\nvalorem\nneft\nyop\nfixers\ngisors\nfaliro\nresister\nbeany\nskol\ntplf\nroedelius\nhrothgar\ncathays\nilles\nseperation\nzbornik\nsalahi\nkolob\nbuet\nchema\ndalmuir\nretinas\nreassurances\nhargraves\nfemicide\nbew\nkurbanov\npiazzi\nmuffed\nhartline\nmcgahan\nbergan\nnakada\nparfums\namrut\nmaurel\ncallier\nbaldessari\nseductively\nlaurencin\nsupertanker\nwpbsa\nnole\nproprioceptive\nmcilrath\nlockman\nsquealing\ntoseland\ntraxler\nnart\nbelarussian\ncrimewatch\ndheeraj\ngeorgics\nsculler\nlafond\nrockie\nkostis\nwholes\natapattu\nhintz\nelphick\nlobanov\nmerlion\ngullane\nplasters\nprincetown\nhartl\nphotorealism\nlcas\ntadamasa\nnajah\nneighborly\nlivelier\nonderdonk\nravers\ninternacionales\nperrow\nvaginalis\nsabbaths\ncomparision\nesti\nkrulak\nkec\nkasese\nsymbionese\naereo\nstavridis\ntasmanians\ndatsyuk\naretino\nspittle\nmonocyte\njuanma\nostler\nroselyn\ntagliamento\nnachtigall\ncontrollata\nsafet\nmoister\nbrasileiras\nsizzler\nkanemaru\nmodu\nlianhua\narnould\nadri\noccasioning\ngimble\nhawkweed\nhiebert\nesaias\ndows\nrightists\nanalysands\nditties\nkabanov\ncarano\nrelicensed\nridnour\nslaine\nliposomes\ngstreamer\nmidship\nflic\ncrue\nvibrated\ntuong\ncleartype\nabrera\nbimah\nholdin\nbozon\nashiq\nlavers\nvaro\ncastilho\neasterling\nvaleska\nbrickfields\ntugwell\nstampedes\npierpoint\nshorin\ntherfore\nsuperchunk\ntowles\nloz\ndvir\ndildos\nkovtun\nschl\nlunokhod\nbonser\ntrenchcoat\nkombucha\nhandyside\npisanello\ngreenjackets\nlalibela\ndyas\nvigée\nembodiments\ncubase\ngaeltachta\nmountainbike\nmarcon\nmagnesian\nbeton\nmimmi\npovilas\ntragus\ngots\npartygoers\nbles\ntdl\nlauncelot\nvanquishing\nguare\ngustloff\nwinkfield\nglubb\nflagpoles\ndownhole\ngiralda\ndumpy\ngrobbelaar\nivars\neshel\nyarde\nclorox\nglasshouses\nrtx\nravello\nwhall\nheadstart\nyelchin\nsimoncelli\nkazuhisa\nmeseta\nczeslaw\nlemkin\nsitton\nfarmersville\nsculley\nictu\nspartina\nfawns\nballers\ntinkerer\nfibulae\nuhde\ncommentors\naart\nfrison\nmalade\nzawada\ndwb\nsair\nalisher\nmcgillicuddy\ndefeo\nspargo\ntraf\nziemba\ndnt\ngandi\nango\nmemorializes\ndoppelgangers\nschoonover\nvivat\nmckane\nvotto\nvalory\nmapungubwe\njuanfran\nhypospadias\npurfleet\nkhidr\nphillpotts\nphosphorescence\namuso\nsexualization\nhanumantha\nsusurluk\ningrams\nfilipp\nkoll\nwalterboro\nfahl\nlusa\nnasiri\ndesouza\ncourtesies\nkoans\ngretta\nsmoak\npão\nwhitebeam\nbergé\nmaintenant\nsoulmates\norbitofrontal\nkuqi\nmassenburg\nlmt\nkundara\nraoult\nbesotted\nkahanamoku\nhame\nisda\ntrouw\nsellin\nthorleif\nradicalised\npraemium\nmariucci\njogye\ndiluent\nvervet\nwelten\nkapranos\nosmunda\nhro\nkangas\nkvam\nbackflow\nsln\ntomiko\nmolsheim\nmeyler\nhicklin\nossietzky\nhemanta\nbyelorussia\nladrón\nballagh\nnonvenomous\nbopha\nbécaud\npierzynski\nlectins\nbungei\nrockledge\nuntethered\nportchester\nflatow\nelectrocuting\nrager\nzech\nshirl\nmahuta\nreplications\nijf\njerold\nchafed\ndussault\nbarguna\nmincing\nwargrave\ngloats\ntío\npoteau\nmacek\nwayles\npetsmart\nadipocytes\nlocators\nkamogawa\nyolen\ntaare\nstenzel\nswindlers\nparm\nlandeck\nncep\nmyelodysplastic\nscrotal\ncowlings\nmomoi\nbogoria\ndenes\nmanoukian\nspackman\ngigged\nenemas\nrapinoe\nkeloid\nballyduff\npiane\nraghuram\nrewritable\nnaimark\ninglorious\nwedgewood\nstrass\nekland\nattac\nples\ntrances\nflexes\ntemplin\nbredbury\nmicrostructures\nbauwens\nribblesdale\nqingming\nencyclopedie\nsahakian\nnerone\nkissy\nsubheads\ninfarct\ninterlopers\nhdms\npriok\nzinnia\njunin\nembarassment\ninfographics\nsamat\nsupersaturated\narabsat\njrm\nelaphus\nitawamba\nlimonene\nphellinus\nhach\nlookahead\ngbenga\nmargus\ntofino\nkidlington\npagsanjan\ngotlieb\ntabatabaei\nkatsav\nmarwah\nsundowner\nevey\nnavicular\nllorar\neggplants\nhartsdale\nalfreda\nbirtley\ntominaga\nmurk\nrefrigerating\nmiis\nvalu\nzappos\nkilworth\nhavin\nboswells\nbiphenyl\njagga\nashik\nifex\naccrues\nellan\nsobe\nconal\nbrayshaw\naldworth\ntallchief\nuvula\ncolchicine\nchiwetel\neliphalet\nkoide\nlofa\ntrawniki\nsoulchild\nvitrification\nbhatkal\njacobabad\nschau\nidsa\nsebesky\nconingham\nacústico\ndenslow\ncasanare\nnoncombatants\nschuschnigg\nemel\ntestino\nommen\nundersteer\nwhimper\nveh\nsoori\naquia\nmoushumi\nyanez\nhelpfull\nrecyclers\nbalafon\ncarner\nbarani\nmiikka\nmoren\nshider\nlapsley\nlemongrass\nithaka\nrmaf\nasca\ngarant\nrusko\nimpudent\nmuench\nsobeys\ngordonsville\nheermann\nluciferase\ncentrepoint\nhoneoye\ncyclassics\nsentances\nnecati\nupdown\nsmallholdings\nwickremasinghe\ndalmia\ncountertop\ndunam\nhyson\nbinned\nriego\naxolotl\nzhiyuan\nardizzone\ncantorial\ndonlan\nxieng\nhobgood\nrwenzori\nanshuman\nilg\namarapura\nganji\nelayne\naparecido\ngumma\nleod\nfilippino\ncemex\ngarnets\ntoddington\nconcubinage\nchambertin\nyakup\npersonaly\nfissionable\nmabus\nfantastico\nlucre\nhfe\nnagare\nsummative\ncitybeat\nloreley\nklepper\ntimbo\ngarters\nnieuws\nsignee\nnsfnet\nwanjiru\nbatta\nsankofa\ngebert\nmonetarism\nphonograms\nfrankowski\nshambala\nbrocklin\neddings\nausmus\nphocoena\nrecomend\nlovatt\nmarawi\nbertelli\nhablar\nregs\nnorthug\nperlas\nhellebore\ndeterminable\nanyhoo\nwragge\ncatnip\nxiaogang\nhepatocyte\nmomir\nshoa\nmunsch\ncontemplations\nkoori\nmajel\nvisages\ncaroling\ngalanter\npavlyuchenko\ncorydalis\ntakahara\nkhalfan\nashbridge\ndualities\nninemsn\nleeza\nmyanma\nsacroiliac\ndegroot\nghiyath\nislamiyya\nremora\ngiffin\nsocialismo\nfrosh\ncardross\nbrean\nfluoroscopy\nunvarnished\nseverest\ndescant\ncopiah\nwinterburn\narrighi\nfirehose\nclappers\nmutiara\ndespondency\ngradus\nadeeb\nreorient\nblundered\ngorringe\ncantilena\ntopcliffe\nlyde\niffley\nlambasting\nskyfire\nsasakawa\nplunderers\npams\nrsg\nbraugher\ntoor\nguidolin\ncastlefield\ntrant\nchus\nchapterhouse\nfilesharing\nouarzazate\nmarisela\nnegredo\npored\nbadwater\nomkar\nsidewalls\nfres\nunladen\ngravimetric\nnph\nogo\ngohil\namsberg\nmistreat\nhuse\ncelebi\npfarrer\nhioki\nbrandname\nbeachcombers\nflook\nfryman\naloes\nussocom\ninkatha\nmkapa\nwassail\nautrey\ntopknot\nbabys\nchelsey\nwkrc\nfiorano\nparsees\novershadows\ncharlwood\nleguminous\nvally\nulman\nweakling\nbattier\nstateroom\ncille\nrajouri\ninuvialuit\ncaesura\nniskanen\nxdr\ncloisonné\ngofa\nhalewood\nmelchiorre\nkacem\nplaymaking\ntherrien\nbrazelton\nnausicaa\nsistemi\nadjunctive\nzeckendorf\noliwa\nlymm\ntikriti\nsimen\ninsistance\nkreps\nzubrin\nmonumentality\nmannish\nsalli\nlongsight\nharks\npagenaud\nweu\nconsequat\ntroth\nskr\nheartened\nbachi\nrastelli\nberrington\nnonessential\nnapoles\nsteyer\nstudholme\ndeerhoof\nouaddai\noxenham\nmazeroski\nforlani\nruelle\nsanday\ncrx\nsergo\ngenderless\npresnell\nsawfly\nparslow\nanonima\ncolmes\ntilikum\ngunnersbury\nduris\narcadio\ngitana\nmolyneaux\nwhiteland\nstenborg\nchelles\nshafted\nforebear\nduret\nsocieta\nsholl\nkrinsky\nbarocco\nholbach\nnoxubee\nhisaishi\ntaik\ngraphology\nhadiya\npvd\nslangy\nlegalist\nussa\nrbst\nattenuating\netiological\ngulabi\nforkbeard\nethologist\nweese\niberoamerican\nmikula\ndemerger\nrufo\ndeukmejian\nhaverty\nskyros\nunequaled\nhectoring\nsagredo\nchiffre\ntokay\nsoundz\nansaldobreda\nghosn\nmeuron\njamat\nmixteco\nrinderpest\npenumbral\nstenography\neron\ntamao\noettinger\nseacat\nshotley\nburritt\nlunalilo\ntgn\ntipple\nmidsized\ncrinoline\nwebcasting\nigls\naudiard\njonassen\nkarren\npreen\nárbol\ntrickled\nemceed\ntavola\ndaniyal\nagains\nhardbound\nmecano\ncolomiers\nellsbury\naminata\noast\nludwick\npratham\nandaz\nesherick\nolteanu\nknab\nbernthal\nbibliotheque\nsoeur\nladakhi\nklutz\nkölsch\nprepping\ngekas\nhennes\nlachie\norica\njointing\ntransmuted\nsarvis\nadelstein\nmotiv\nviia\nchanchal\ndongxiang\nlalith\nmehtab\nhellhole\ngadkari\ncjtf\nvianello\ngda\ncoment\npenwortham\nconciliate\ngokwe\nwingspans\nastrea\npeirsol\nlosh\nborrero\ngwenllian\nsisodia\npanamanians\narmourer\nyella\nlianyungang\nriggi\njobseeker\nthermoplastics\nscenographer\nmaysa\nzeebo\nadom\nadshead\nvadi\nglossolalia\nmishka\nhergest\ntamino\nvanian\nseagrove\npiquant\npyjama\npiguet\nboathouses\ntripfilms\nmontara\ngreenbriar\nmartinville\nsatterlee\nchasms\nsiento\ndeq\nsemedo\nschaper\nhoworth\nlitke\ntrahan\nsmashers\nghsa\nveltroni\nsalines\ngiambologna\nhomed\njlp\nhallandale\nnoreaga\nagnel\nlocura\ncoexists\nrunar\napparat\nquerida\nunderwhelmed\narounds\nstuf\ncivilize\ncgv\nhardaker\nschalkwyk\nscarpetta\nmuscovites\nyunque\nnucleases\nqeii\nleanna\nmidd\nhedera\nwaner\ndippy\nfreddi\nprostrated\nruairí\necclesall\nlobito\nderik\nlarache\nincriminated\nkfm\norna\nvaradero\nwinzip\nintermixing\nglobalised\nfrier\nvermonters\nfarel\nspindrift\ndiscretely\nkranjska\nshuckburgh\nhillah\ninscriptional\ntweens\nwoc\nfages\nmiyoko\nhemagglutinin\nfarzana\nprolate\nazizabad\npava\nklif\nprancer\nmxn\nhomayoun\ngoldacre\nrentz\nyanis\nnightgown\nvns\nbalcer\nepidermolysis\napiary\nrehmat\ngover\nsumptuary\nbonfils\nhsb\ntimmermans\nlemgo\ntighthead\navary\nginnie\ntulyaganova\nmaruja\nnorthanger\nvacuuming\nciviltà\ngrd\nmcmahons\nbfb\nsuha\ngesundheit\nwurzburg\ngraner\ngoris\npictoris\nbá\npinnick\nchianese\nnumeration\nbrockington\nlocalizer\nlobbing\nprivatise\ndedes\nsabratha\nalexandro\nvaporizes\ntakasugi\nthumbing\nrizo\nskiathos\nokuma\nmcbrain\nmagoon\nbhuyan\nrotifers\nnorquay\noutperforms\ndemaine\nkmbc\nholyroodhouse\nccis\nsilvey\nhageman\njoyously\nknowledges\nlinac\nshijie\nshuar\nyaffe\ndogsled\nhematological\nhypnotizing\nindistinctly\ntarkanian\ndbb\nrewinding\nvenatici\nmarji\nwinnick\nkokan\natangana\nzidan\nbredow\ntaavi\nsomersaults\nsleepwalkers\nbertman\nloesch\nowada\ncalpurnia\nmcguiness\nbristling\nazami\nehow\nsequim\nlonggang\ntadanobu\nidowu\nlaisse\ndescender\nthaçi\nparga\ncadle\nalojz\nfreewheelin\ncontinously\ngavyn\nboykins\ntritton\ncowans\nlesbia\naquilani\ntorda\nsauternes\nlipsett\nphosphatidylcholine\ndunces\nchicos\nilyasova\nrutenberg\nbraude\nmajolica\nsavidge\nearthed\nhousefly\ndorit\nspecters\nabercynon\ndepresses\ndaddah\ngiovanardi\nbreyfogle\nequalisation\nkatusha\nkinnell\ndrzewiecki\nhalfmoon\nmanvers\nncte\ntonsillitis\naddenbrooke\neinsatzkommando\nsontarans\ntef\nsyndicale\ntobia\nalienware\ncazenove\nstoutly\nkerpen\nlafuente\nwanga\nbodiam\nfarson\ncopulating\nboquerón\nrighthand\nbrohm\nkookmin\npeifer\nnatzweiler\nvanderhoof\nsinfulness\nmanjhi\ntabar\nrafic\nluces\nchagnon\nconcretes\nscioscia\nrhun\ndubay\nteratogenic\nartzi\nimpoundments\nselvan\nelbowed\nhandfuls\nintimations\ngarran\nraggio\npomerol\nsommeil\ndoubtfire\norvis\nmoabit\ncleckheaton\ncyberjaya\nbehesht\nblasé\nalbena\nitala\nsoqosoqo\nmalleability\nilt\nsuzman\nsomes\nscalded\nmenchaca\nenthoven\nenloe\nkinzer\ntrenching\nmorrish\npareil\nuncoated\nimmodest\noverstates\nhydrogel\nqbs\nquinby\nchinnery\ncubero\nstorybooks\npetchey\npernice\npostpaid\nballenger\neffet\nmisreported\nforaged\nchaar\nconfectioners\nhwange\nsportv\ncheapness\ninclusively\ntzotzil\ndeicing\nyusaku\nnummi\nsyphax\njarlath\naubameyang\ncotte\ntré\nyabu\nvocale\nchurcher\nbelisha\nchafer\nemmi\nwonton\nratoath\nkontos\ngauhar\nrevote\nlaith\nsobell\nzillah\navulsion\nbonnefoy\nlukis\nlcac\nayden\nenrica\ntranquilizers\nlegitimization\nshennong\ngellner\ndonya\npermeating\nzigong\ndowsett\nsimians\nmanouchehr\nazraq\nclinger\nradicans\nallright\nisotonic\nshiism\nthalmann\nfowell\nodc\nmeko\ncorkery\nkindt\nrubes\nhardscrabble\nthiomersal\npatas\ntoksvig\nbootheel\nmonocytogenes\ndymaxion\nrotta\ncournoyer\nellos\ntemazepam\nkempis\nmonozygotic\nfinnis\npolyesters\ncoppélia\naktau\ngrifters\nfuret\nsassen\ngnarly\nmandolinist\nimpacto\nhellier\nvicari\npeppery\ndalaman\ntellez\naigburth\ncurried\nindigenes\nlacon\nmightier\nkorzeniowski\ndirnt\nbracegirdle\nphilbert\nepona\njea\normesby\nforwarder\narmco\needen\nchoctawhatchee\nskolkovo\npirillo\nromijn\ncoxen\nkilham\nmaceda\nliudmila\nlionello\nboatbuilding\nmeyerowitz\nwisla\ncourtneidge\nbagehot\nropati\ncrocheted\nchlorella\nletang\nwitts\naguilas\narbenz\ndetractor\nkephart\nrense\ntaibbi\nvizcarra\nosher\nmichiyo\nboissier\nmakira\nneanderthalensis\nlochgelly\nrevolutionibus\nsaucier\neyak\nbloodsport\nbogaert\nwyomissing\nsudi\nboesel\nmadha\ndragonslayer\ntrama\ncytogenetic\npev\ngairloch\nwiik\nbucovina\ntaverne\ndeblois\ntechnika\ncrated\nfyvie\nmariangela\nscarth\nrouben\nteargas\nverite\nhaeundae\nseachange\njwt\nkajsa\nlunan\npappe\nined\nhumpy\nduggal\npetrella\npalicki\nwintle\nadde\ndeford\nsaluki\neusko\ncandoli\nhissy\ntelma\nraffy\ncostumer\njohnnies\npoultice\nenforcements\nactivites\njora\ngoshi\nstarfield\ntorrini\nresultantly\nksawery\nblease\nplumley\nzithers\nfriern\nsaiz\ndhimmis\ndescamps\ngoosefoot\nintegrase\nverbalize\nsimulacra\npeerce\nmameluke\ndessner\natlantium\nmusumeci\nobdurate\nhefer\nmodenese\nchidi\nzaghloul\nponza\ntonypandy\nanimists\nnikka\npullover\nraymonda\ngpg\ngilkes\nmainboard\nyahaya\ndegnan\nroycroft\nsigiriya\nouseley\nconstrictions\nchingiz\nignatyev\nnovation\nkadlec\ngreenfields\nbrutalized\nbaltes\nalazraqui\nroyales\ndode\narfon\nmayim\nginko\nthrupp\nflywheels\nwaistcoats\nhartke\ntuebingen\npotto\nkuruppu\nsurender\nwsfa\nvpns\nwst\nlpt\nreinke\ndalloway\npreoperative\nsubnormal\nmuftis\ngelded\nboxley\nmirela\ncanadiana\nmitnick\nlandmarked\nshibukawa\nsabertooth\nindisposed\nnobler\nmolko\nwinterson\nkinison\npeyre\nkindia\nwindpipe\ndeduplication\nhymes\nanalagous\nnase\nbrownsburg\nsharpeville\nwhiteville\nberms\nifes\nhilts\nvetlanda\nharlaxton\nlissy\ncorneas\ndonnersmarck\nhadise\nscattergood\nfeelies\nnigg\nlieberstein\nafyon\ncarbonized\npellagra\npantalone\ntulowitzki\nciii\nalledged\njoist\nyaseen\nchewton\nbvr\nsiddle\nfairborn\nmonobloc\nchakravorty\nmultisensory\nrelight\ndjed\nboncompagni\nbbi\nthataway\nsumiko\nsalaman\ncabiria\nbenquerença\nmanske\nwoolcock\ndarger\nindemnities\nenumclaw\nfootrace\numra\ndisclaiming\ngvwr\njapp\ngraphed\npattana\nbackbenches\nharthill\ncurro\nfelicitation\nlunes\nucmj\nsterett\ngirod\npétursson\ncrucibles\nsarner\nleveller\nturtleneck\nbeveland\npatricide\ntimerman\npraseodymium\nreeb\nconard\nyuichiro\ntablo\nvyse\ntiefenbach\nwarsame\nmcteer\nideograms\ngerasimos\ncollegially\ndecamped\nflansburgh\nviviano\nsumeet\nseligmann\nfeargal\nchaque\nvelikaya\nquivers\ncancion\nhatzis\nstonecutter\nwetering\nrajar\nterrestrially\ndeadball\npinas\nkaymer\ncrinum\nbittan\nrescaled\nproletarians\nshirtwaist\nglyndwr\npolyphase\nnanowire\nhagadol\ntemas\ntransmute\naestheticism\nskream\nbarbar\ndne\nmartone\nmacke\njep\nderrière\ngunnlaugsson\namiya\nsobrante\nhauz\nsledd\ngoldwin\npreempting\ndsps\nburrowers\nilarion\narrs\nwendling\npaddon\nbobek\nasmahan\nofthe\nifac\ndefrancesco\ntindersticks\nchicanery\nperoni\nderfel\nkvist\nwhelchel\nfarjeon\nquitely\npruden\nhisense\nwasdale\njozsef\nmailers\nnoorani\nunderaged\nderbys\nveloz\nbarbas\nilminster\nrecchi\nduckie\nrelicense\ngarrel\nteapots\nlurked\niml\nunderemployment\ngentamicin\ncollioure\nhygienists\nkiveton\ntradewinds\nreplant\nurueta\nmscs\npresenta\nkarthala\ncedillo\ndeza\nsandri\nknottingley\nnairi\nmackall\nzesco\ngcp\nklasky\nnrotc\nlura\nmisbegotten\nnoris\nunrra\nbeaven\ndangdut\nasmus\nbhoj\nvag\nkuok\naist\nnappi\nmorina\narchabbey\ncomunal\npasanen\nramle\nmetastasized\nwaqa\nkomando\nchristodoulos\nblunderbuss\nchesterman\nbagher\nbloodstains\nrubato\ncallicarpa\ncopthorne\nsitra\nantagonizes\nexhibitionism\ncanella\ngoodier\nbvs\njaphet\nspean\nchristiansted\nmultiyear\ncharivari\nvigoda\nephialtes\nfotopoulos\nidiosyncrasy\ndaewon\nphagan\nribbit\nhollidaysburg\nouseph\ngenevois\nshyne\nmeritocratic\ntrever\nhenkin\nannelise\npreziosi\nhote\nerebuni\ntassigny\nsalme\nvietnamization\ngrandiflorum\ntulln\nblairmore\ntabley\nmickael\nzeldin\nlakshmanan\ncondrey\nsoundbite\nvishniac\nstennett\nshimamura\ncioran\nmarabou\nsavour\npails\nbayona\nnazimuddin\npsychonauts\neubalaena\nendocrinologists\nborin\nacomb\nfeltscher\ncribbage\ntheobromine\npyper\nnatriuretic\nszalay\nperdida\ndonlon\nstorrow\nstollery\nbrierly\nstarsailor\ncalver\npsychosurgery\nborgman\nintermix\nmuhl\ndraghi\nbacklighting\nhersi\nbaffert\nhaws\nnumerus\nhosford\nsrilankan\ntresa\nmazz\nparesis\nhotze\nchettinad\nspinrad\nfumigatus\ngradi\nrajnandgaon\nfyfield\nlaatste\nnuenen\nmiata\nshillingford\nsandokan\nirion\nstordahl\nfinaly\nchio\neskenazi\nparkas\nmukalla\nregenstein\ncladdagh\nmicroelectronic\njuppé\nroadtrip\nhartvig\nhuerfano\nspews\navonmore\nrigler\nkevork\ncabrio\nintroverts\nsquiggle\nrootstocks\nsapsucker\nbergstein\nroyd\ninconsiderable\npierhead\naeb\ndeontological\nnoshiro\neisenmann\nlouris\nbateaux\nlawfulness\nloughgall\nlanzi\nkoppa\ncks\nskirball\nvaad\nhosaka\namando\nfrancks\nayscough\naltamonte\nparadorn\nwakeford\nhmr\nmuttley\npetch\nshantung\nsaltus\nepochal\nmausolea\nakpan\nhohl\nskyland\nnoiseless\nsolidaire\novington\nportisch\nmesosphere\npenser\nperal\nkokin\nwimbish\nhanly\nkomsomolets\napposite\nstaters\nwhitefly\ncutlet\nmainardi\ngahagan\npietistic\ndilek\nsteventon\nnerys\nchunqiu\ndieguito\nbodas\nscenarist\nklok\ncropredy\nconure\nglimcher\ndreamboat\nknowns\nzernike\npelting\nknobbed\npraz\nperilla\nblanches\nfcu\nhots\npneumonitis\norga\ntorpey\nbanwell\npadalecki\nhalyard\nmotorcity\napha\nnokes\naspar\nroyko\nimhof\nzegna\najka\ncandie\nmeyerhoff\nmollify\ncellino\nontarians\nkeltie\nmalmström\nhalis\ncubi\npatryk\nryunosuke\npaynes\nfolan\nreachability\nbiennium\njoseba\nstolonifera\nelsworth\ndymond\npuentes\ncharrette\numair\ntwm\npirating\nsufian\ntramore\nstefanov\nnuttin\nknoop\ngitane\nhendersons\nbuendía\ntabaré\nbeaufoy\ninactivates\nonalaska\ngwasg\nroydon\nbeuerlein\nincentivize\namaker\nmenina\ncholi\nmasia\nredmon\ngoldson\nhortensius\ndemoralization\nushaw\nbessey\nkangal\ncoeval\ndvla\nusurps\nlafave\nwajahat\nkasthuri\ningatestone\nbrocka\nanlaby\nbravia\nboonsboro\npybus\nrubashkin\nlitigations\ncorneum\nfacta\npetukhov\nsoled\nlambertus\nelian\nneemo\nsachse\ndupas\nafropop\noseberg\nbagasse\nschwarzbaum\nguanghua\nplf\nhadza\nstank\ntownhead\nthre\nkempson\nvenditti\nrosenau\nsendero\nsanghvi\navtovaz\nrutt\nmaximums\nyizhi\nmurcer\nzangpo\nbarish\nmartie\nyob\nreadington\ninefficiently\nchesbro\nrowton\ndì\nsatter\nyavneh\nmerpati\naureole\ndongo\nglitzy\nhorsted\nuppity\nsemiclassical\npontormo\ntti\nbudak\nalvechurch\nmukachevo\ndourdan\nsisimiut\nquichua\nskeptically\ngonzi\nautoblog\ntaseer\nlongship\ndehydrogenation\nlaisenia\npupi\nbenenden\ngunvor\nmelanocytic\napgar\nsoonest\ngargiulo\nhansteen\nvoghera\nrogowski\nkissi\ncibolo\ngadson\nreichsbank\ngoop\nmacheath\nhabil\nholzapfel\nthula\negas\ntfas\nkällström\nnorthwick\nphb\ncuddling\nwengen\nasaad\nelkan\naju\nknowledgebase\nfreja\nmarzuki\nrocka\nfutureheads\nzhanna\nhardisty\nslobodna\npoix\nmaybelline\ncasavant\nlavage\nsandiganbayan\nsimnel\nlbi\nblinker\ngavino\ncanó\ntrouper\ngroundout\nvincit\nmetopes\nhijrah\ndemarcating\nhoye\nrlif\nspangdahlem\nfratres\nbychkov\nboredoms\ngarut\nchangelings\ncomunidades\nallum\nmeadowhall\nsolenoids\nosaki\ngersh\nricht\ntolu\ntomio\nneophytes\nkittinger\ncallista\ndejuan\nrekindles\nnatuna\nmagath\ncrillon\nshaoguan\ntoffler\ndinko\nelitch\nmenuetto\nrijk\nkansei\nschladming\nglacially\nboscastle\nnoory\ndefrost\nlummi\nrosalee\naurignacian\npennefather\nyulianto\ngcf\nwitwer\nspath\ncanali\nbiswajit\ntarandus\nmentira\nmattock\nshehhi\nsimas\ncahit\nconferral\nanglade\nprêtre\nruwenzori\ngunge\ngongadze\nwilmott\ncanvassers\nacpo\ncalas\nduracell\nnkunda\ngono\npatting\nlarbert\nguram\nscarcer\noradell\nallotting\nportside\ntribology\nhaemorrhagic\nselenite\nnaxal\nmarimbas\nhayom\nkaftan\ncarra\nsyrus\nlivsey\nkristijan\ncamphill\nmoringa\ntibco\nmagicjack\ncowslip\nstoch\nkovalyov\ncostars\ncruck\nstaved\nwadhurst\npkwy\ntelesur\nchatelain\nihe\niasb\ncalendula\ncherryville\nendres\nbelem\nleola\ncheep\nlesli\nmiryang\nrabie\nimpute\nworkingman\nignat\nzilli\nmettmann\nkuldip\ncurto\nscicluna\nyot\ndehaene\nshumate\npaver\narbres\nbongiorno\ncrosslink\nwbu\nquartiers\nchikungunya\nbackhaus\ngreeter\ncimb\nmatlack\npeni\nfcn\nairless\nwardi\nklobuchar\nbombardiers\ndacorum\ntibbett\nwilga\ndestructs\nftu\npegmatites\ntiziana\nsnively\nintensional\ncradling\nwerchter\nrodr\nstipes\nbandarban\nwerneth\nnudists\nmatchroom\nhollmann\nheiskell\ntorridon\nkatzenjammer\nquraishi\ntosti\nhleb\nfiamma\ntykwer\nlaliberte\nsater\nabaris\ntickner\nalethea\nshirokov\nabrahamsen\nleuluai\ncananea\nbiopolymers\nnoscript\nparterres\nardglass\nklcc\nnejm\nmassalia\nrenick\nallom\nassynt\nmadams\nvicary\neppler\nwinson\nrahmatullah\nshapps\nbobov\nhardtops\nproofreaders\nblots\nbaubles\ncowtown\nnuthall\nlaufenburg\nnaseeb\nhoquiam\nvuillaume\ntshering\navaz\nlivability\nspiralled\nperfidy\nplacoderm\nshirov\nespectador\ntrishul\nemrick\ndecompress\nspringy\nbiryukov\nmikako\ndaldry\ncauley\nnourishes\nnasta\npichel\nsamvel\nskatalites\nlatourette\ntrouville\nobel\nlogy\nciotat\nspilotro\ntexted\nscads\ncologna\npriscila\nproscriptions\nesx\nterrero\negl\ntabib\nrishton\ndema\nsojo\nfilion\nnaviglio\ndoriot\naggravates\nzundel\ntwitchy\njuco\nvata\ncafaro\nkimberlin\nsopranino\noverlie\nnaama\nyanai\npurée\nporteus\nmuzzles\nenterovirus\ncasto\njochi\nghostwriters\nneedlepoint\nortolan\nharrying\nmilborne\nwailer\nidiomatically\nchristou\nmmos\njouan\ndehart\nbalah\nappletree\ndarom\nhaldar\nnewville\nniem\ndews\nliberalised\nagnete\nhamnett\nguardrails\nmachale\nhoxne\nornithopter\nurp\ntallit\ngetrag\nstoffel\nexhort\nstilling\nupsetters\nbastet\nschwerner\noverberg\nagip\namoretti\nrussified\ninconclusively\nyearlings\nspelthorne\nguptas\nhcr\nrach\nreveries\nadulterer\ncappellini\nsyllogistic\nregionalization\nsynergistically\nazan\nsœur\nsouthcentral\nsmeal\nkhursheed\nskanska\ncherkess\nhodgetts\npilloried\nmultimode\nlittrell\noutpacing\nthies\nvulcanized\naihara\ncarmouche\ntricare\nwicken\nroosevelts\nclyburn\naerea\nringspot\nkhim\ndefore\nimbibe\nmicrocar\nceram\nmoneyed\nbalbriggan\nngv\ncaffarelli\navie\nanfal\ntuggle\npaxos\nunderwrote\nvouching\nteoh\njhoom\nhuay\nklayton\nkijima\nxss\nnabo\nmanoogian\ncontrada\nwads\ntrafic\nporing\nbarreda\nvatanen\nhittin\npullum\nleconfield\ninseminated\nmorgado\nlongy\npushover\nwigger\nitô\nenkidu\nbreathalyzer\nheppenheim\nhgs\ndiethylamide\notford\nbigwig\nbagpiper\ndambusters\nakula\npoulos\nlourie\nmountview\nsubbuteo\nbruckman\nvilly\ncogwheel\nabutted\nokuno\ntaghi\nkmg\ndisinfected\ndemirci\ntolerably\ngauntlett\namauri\ntricolored\nsparro\ngoofed\nrihm\nwheater\nregalo\nskybox\nfrothingham\nharshman\nfellaini\nburmester\ncharle\nmotoren\niftekhar\nenglishtown\nappurtenances\nveenendaal\nignalina\nseijun\nlissouba\nkotel\ncvv\nbingaman\nmassee\nvenray\nwroxham\ndrayson\npresumable\nvelveteen\nalexakis\nvaziri\nsalzmann\nctesias\nramalingam\nbonatti\npolluter\nrisco\nanzu\nathenagoras\ncobe\nwojcik\ndentil\narale\nhydrophones\ncampaspe\nnahas\nbowra\nshrieve\nhilgard\nmappin\nrjh\nmephedrone\npolli\nmawlana\nstamkos\neigil\nmgk\nfitzhardinge\nwerkbund\nfortnum\nspellers\nautodrome\nyit\noversupply\nshusterman\ncheckboxes\ntuz\npwo\nkosal\nrogallo\nvdw\ntantallon\nreciprocation\nzarek\nwaz\nmfb\nmchattie\nwoolner\nsimp\ncourtrai\nwhang\nswifter\nwiegmann\nberck\nmckees\nengelman\nflettner\nclaypole\nlanphier\npelfrey\nbreakingnews\nnuland\nmakmur\nthoroton\nserc\nderya\nuncharacterized\nmoher\nrelatedly\nbeaupre\nellender\nmisdeed\nmcbreen\nhanoverians\ncastaing\nagudas\ncawston\ngrater\nelectorally\nitaipu\nberkey\ndisgusts\nwakeup\ndennistoun\nkulish\nfaiza\nkastel\nfrierson\ndéjeuner\nbulking\nantiseptics\naberfeldy\nazamat\nmufasa\njedd\nthp\ntransfigured\nhumvees\nwsoc\nmcconkey\ngidi\njeopardizes\nkankkunen\nzhukova\ntarns\nabbs\nrumblings\nultrashort\ndonie\nsheema\nnqakula\nhardliner\nstadtmuseum\ngaëlle\nhanada\narps\nbroiled\nrehire\nfakih\nperegrin\nvaccinate\npiermont\nmrk\nunos\nchamakh\ntesfaye\nclanranald\nmccarrick\ninfesting\ntekakwitha\nblears\nzeuxis\narboriculture\ngweedore\ngergen\nmarkarian\nrecycler\ngoines\ntoxicants\nprincipessa\ntaklamakan\nlamberg\nturnor\nfll\ndrolet\nfoliate\nahron\nwaxworks\nkirkton\nshilluk\naerie\nkeikaku\nactionaid\nyuill\nhatano\nvessey\nkosha\ndirectionally\nsankuru\nmiettinen\ncornelissen\nvirani\nvehari\nlongchamps\nuncorroborated\nspellchecker\ncarnotaurus\nbrasch\nmuza\njagdpanzer\nreheated\nviby\ngresford\nbarques\nparool\neyebeam\nconnells\narmer\nrubbermaid\nzaida\ncual\nemoluments\nnalwa\ndissuading\ngeorgine\nhube\nmemórias\nassail\ningratitude\nguttmacher\nlevein\nbridles\npersonnes\nreflexion\norality\ngpe\ngaona\ntimezones\nhord\nsoaks\nmenage\npreparers\nsluggo\nsummered\nglockner\nillusionistic\nwuornos\ntramlines\nuen\nrinds\nclagett\nbannerjee\nexcommunications\nbfp\npolytrack\nwoodway\nbackstrom\narrowroot\nosca\nhemery\nmaslov\nnazionali\nhéroïque\nncca\nmeem\nparklife\nchurns\nrajyam\nfacially\ncrocco\ndouglasville\nkavner\nardiles\nwindowsill\ncorallo\ntanking\ntegmental\nussuriysk\naifa\nlurkers\ncharbel\notherside\nporges\nlelong\nduncalf\ncaraquet\nrangifer\nsamaniego\npsilocin\ncaselaw\nrendlesham\nmorali\nbeamline\nzumbi\nreadjust\npazuzu\naloysia\nsturmer\nseatac\nnistor\nseumas\nnyra\ngellhorn\nimada\ncaramelized\nofili\npersinger\nfluidly\nyata\ngratify\nbedia\ndjawadi\nelectromyography\nafrikan\nremitting\nfederales\nbuttonhole\nfuglsang\nhalbe\nyanji\nbradykinin\nsuozzi\nlallana\noolitic\njibrin\nkips\nleshan\nbenj\nmeis\nfussing\nfurie\nsnowblind\ndowdell\nmilana\nmassaquoi\nrobbinsdale\nterribles\npapazian\nswoopes\nunreactive\naono\nufi\nzhendong\nvögel\ncarrà\noakhill\npurslane\nhirani\nmourdock\njawhar\nprickett\nlimekiln\nsers\nbuthelezi\nguicciardini\nmugwort\nryk\ncupra\nklarsfeld\nmarquees\narianne\ntsmc\nadmir\nwinklevoss\ndishevelled\ngrat\nmendacious\nhallin\nmondegreen\nkawkab\nemptor\nfady\nvipr\nbillick\ninfilling\nkrajewski\nplattner\nkalmus\npite\nwoodcarvers\ntanay\ndonatas\nnippert\nsquished\nlmd\nalatau\nnaogaon\nmickens\nvorderman\ncremin\nfoskett\npraxiteles\nblaby\npropter\nwoollard\nshiseido\naggrey\ncmn\nlude\npret\nskvortsov\nkprc\npantoliano\nsagunto\ntahini\ngorgons\nprisco\nagbonlahor\nbarritt\nterroristic\ndorien\nweingart\nportmeirion\nmeekly\ncreedon\ngcu\nthandie\nsmidt\ngalletti\nbellezza\nbundoran\nekholm\nouedraogo\ntetuan\nsoftimage\ncorms\nklansman\nprecendent\nfibro\nkorie\npercolating\nfarhadi\nlehnert\nreturners\nruderman\nsesac\nhimani\ntimofey\nauclair\ndaka\ngrossen\nkemer\nchkalov\nkirpal\nelsdon\nesquipulas\ncilliers\njanardan\nimbues\nunda\nsarker\nblanchot\nwhio\nocm\ntakasu\nmisconstrue\nlolcat\nbanns\ntemperton\nrasoul\npiggies\ncrandell\npapageno\ngullett\nfethard\ndaytrotter\nserializing\nalighted\nvaginas\ntiberium\nmatinees\nhermas\nvisualizer\nsirtis\nmonodrama\nmistitled\nlilias\nnoosphere\nbagg\nmonywa\nuncertified\ndepopulating\nhorsman\ntair\naledo\ngoldsby\nucav\nabueva\ndouglases\nkwanza\nforli\nkarpin\ngunny\nsimeoni\nnicea\ntitmice\ncraddick\ntaillon\ntonsil\nbeuno\nkamisese\napplicator\npravasi\nsarel\nnanteuil\ncastellet\nsilkin\ncataluña\ntesserae\ntakehiro\nnard\nhydrae\nafricano\nbabis\nwaterbird\naflac\nwynalda\npatitucci\nlabette\nlmr\narndale\nbotnets\nmirabile\nlykina\nmincemeat\nranthambore\ngati\ndkny\nbenetti\ndauntsey\nhydrogeology\ncalendaring\nkalmia\ncreager\nmerve\nnociceptive\ngrosch\nfetes\nsumana\naustal\nleath\nreined\nkiprusoff\ndarod\nlondoño\njeung\ntrec\nparchments\nmadu\njwa\nvinohrady\njoma\nebersberg\nontologically\nhermeneutical\nmuawiya\noric\nravensthorpe\nrill\nabol\ngalasso\ndqa\nmapmakers\nwncn\nduprat\nemslie\necml\nanaplastic\nrendang\nplowshares\nbheema\ndelmont\nyuman\nnastier\nforgings\nkpt\nnvr\nimpulsiveness\nyasi\neisenbach\naldon\neem\nrawer\nmedievalism\nsexless\nstrolls\nmidships\nmirani\ngrizzle\nhelan\nformalin\nmeaford\ngwalia\ndevery\nliaqat\ntuberculous\ncossiga\notg\ncondones\nassar\nbero\natropurpurea\nrailroader\ncrin\nspinto\nclindamycin\nbeem\ndidyma\ndarc\nrelaxants\ntongzhou\nherbals\nstambaugh\nbrockenhurst\nalbanesi\nmashburn\nbuncrana\nchastisement\ndomergue\nportstewart\nchukwu\nsarmento\ntremens\ndennie\nrescaling\nsynopsys\nshamu\ncariou\nmishan\ncabbagetown\nciau\ngalkin\nmirus\nkyösti\narromanches\npréval\nreprap\nrajai\ncriminalisation\ncaloosahatchee\nhongyan\nweisinger\nchurchyards\njolted\npipi\ncrosslinked\ndostal\nbrf\nbeeler\nibadi\nbargoed\nhazlet\ngibsons\nkevyn\nblowouts\nlahori\nconecuh\nluckless\nsaggio\nvver\numlauts\nerects\nsutphin\ncafu\naironi\nstutsman\nhypertonic\nbriquette\ntonson\nmoola\nkirlian\nsahgal\ndalem\nlubov\nallchin\nlodes\nherlin\nsabater\nsteelwork\nmarbletown\nveining\nzellner\nsubdirectory\nronaldsay\nacir\nknowe\nexpatriation\nkvb\nreverser\nmaylands\nmercurys\nghanim\ntranscriber\nyablonski\nserme\nyoker\nfjeld\nipiranga\npeil\nmazatec\ncraney\ncropsey\nshirreff\nbranche\ngardo\njungmann\nhelwig\ncaq\nsubzone\nduri\nluisito\nlvs\nimpasto\npostbank\nmagliano\nmalahat\nrepatriating\ntholos\nbaoli\ninstills\ndosh\ntyrannosaurid\nnewfane\nbaik\nhuberty\nkarenni\nratiopharm\nhandl\nglaive\nakabane\namare\nebbed\nlambourne\ncorrer\naweigh\nrastrick\ngilled\nsolé\nyid\ncomedienne\ntrenet\nalembic\ngarzanti\nchaw\napv\nkamio\nanangu\ntuomo\nmadrileña\nhypoallergenic\ncristine\nklosters\nshimei\nconsecrations\ndilutions\ncordray\nsnooze\nzerlina\nlannon\ncoltishall\nindubitably\npmla\nkashiwazaki\nendotoxin\nbokan\nqassem\nyamano\ncaspersen\nlockinge\njadi\nsuttie\ndefinetely\nlynxes\ntonalities\nmetalloproteinase\ngosdin\nnums\nminichiello\npummeled\nsonera\ncoury\nhohne\ncauthen\nslmc\naecom\njakubowski\nhalitosis\nterraformed\njerusha\ntorchbearers\nunderachieving\nmarijo\nlorenzana\nbootmaker\napeejay\nmulcaire\nitw\nflashmob\npledger\nadalat\nkolam\nanisimov\nwinterreise\ndenne\nmadill\nparalympians\nbrittleness\npasargadae\nzigzags\natlantico\nmedaled\nnaxi\nuvc\ncairngorm\nalaya\nleir\nléonide\nmarranos\njoep\nguarana\nclague\nkleiza\nharn\nmetabolizes\nsulieman\nschantz\nrealclimate\ntelewest\nneutralising\nlumberyard\nmaté\nsuskind\nschwalbach\nbalete\nseavey\nblat\ninsta\nnastos\nboya\ncoulda\nmelby\nscottdale\ndiebenkorn\ntotp\nsweetnam\npredispositions\ninverleith\nltt\nbioethicist\nhawi\ngiesler\nleggio\ngrbac\nberny\nalberoni\nstetter\nwhitesides\nwidmerpool\naïda\ncaddyshack\nlipset\nspruces\nquivira\naiglon\nhurriyet\nmuluzi\narntzen\njunket\ncommentate\npoeple\nblaize\nafgan\nprashad\nunravelled\nimmobilizing\npergamino\npolkadot\nsilsbee\nepling\nzombo\neducationalists\njockeying\nneonate\nhyping\nsugimura\nlazarsfeld\nmko\ngreenlanders\nshwedagon\naltun\ncorporative\nmatadi\nbackspin\nclubbers\ncanajoharie\neuratom\ndassel\nsolitudes\nransford\npugs\ndépôt\nhendley\nflappers\npethidine\nonsets\nhumic\nafx\ncornthwaite\nswordsmith\nsidorov\nkraushaar\nsisteron\nbaor\nexaminee\nprimakov\nimmobilised\nenraptured\nkonsthall\nmemetics\nycl\naars\nmishin\nkonow\nwrp\nkirkcudbrightshire\nluckhurst\nsextuple\nostracods\nfidelia\neckard\nscally\nkorsakoff\nlicit\nvictimless\npeeved\nperfidious\nidir\ncih\npinkus\nsputtered\nthermopolis\nfigueredo\nbiton\noly\nvova\nhorrigan\naduriz\nkuts\neffi\ncorine\nbakst\nmanderson\npreparative\npindad\nslingerland\nmonken\nowaisi\ntoji\nzainul\nescp\nsculpts\nhudud\nivette\nyma\nkhusro\nimpromptus\nhalfaya\nfarfán\nmafi\nhigby\nreichenhall\nintrest\nbuton\nbajada\nviñas\nneild\nanciennes\nmatkal\nopg\nperin\nsynchronizer\npembe\ncommunautaire\nseydoux\nmeader\nsinged\nwheler\ncaroni\nstogies\nmajorette\nthunk\nspork\ntonics\ndismutase\nsubmitters\ntageszeitung\nfloodlighting\nzich\nbarbaros\nevashevski\ncuyp\nsativus\nmorat\nterayama\ntigr\nosv\nsipri\nexperimentalism\nshuttering\nfelisa\nprothro\ndebbarma\nrait\nfritzl\nalport\noverlander\ncoade\nszegedi\neser\nlunged\nimed\nkelmendi\nkahrizak\njahanara\nhedo\nreferenceable\nestán\nterblanche\nasat\nasociacion\nstaw\ntieng\nkarmin\nmanicure\nkushnir\nligeia\narlin\ncaspases\nsquinting\nlualua\nbienen\naudet\ndussehra\nbasar\niraheta\njutarnji\nparth\nmimoun\ncrewkerne\nresidente\nfrowning\npakpattan\ncamejo\nhunterston\nrne\nsarasin\ngevaert\nplacide\nbaitul\ndemott\nkuber\nwigand\nfiddly\ntrex\nrewiring\nelr\nacclimated\nklinikum\nundisguised\nbreastfed\nhypercholesterolemia\nmicros\nfrolich\nneilan\nalprazolam\nchiklis\nkuttan\ncarlgren\nbrinkerhoff\nreuleaux\nwbtv\ncanina\neosinophil\nellon\nfarahan\nhepatotoxicity\nosvald\njahra\nhoddinott\ndivulges\nscrawny\njotaro\nwaze\nkripalu\nasturiana\ncozzi\nextern\ndisbelievers\ngetzlaf\ntaihang\ngopnik\nimmerses\nweyrich\ntrillanes\nkuehl\nhambone\nyungay\npacto\nzoellick\njelling\nbhool\nvinho\nsaloni\nesdaile\nluxuriously\nlalah\nmcmorrow\nrinna\nfrolicking\nmawgan\nhorwath\nibert\nmedlar\nbucephalus\nplayhouses\ncrystallisation\nneturei\nembryologist\njuez\nranier\nmashimo\nwrubel\ngeof\nunkel\nherminio\ntemblor\nvalin\nhydrophone\ncabras\nlynmouth\nruoff\nepiskopi\nmunaf\nsabel\ntítulo\nrevaz\nbenchers\nvarallo\njusepe\nprosecutes\nlante\nkaleh\nlph\nalmudena\nwesties\nkichwa\nnazif\nweinert\nthermography\nquemoy\nhunsaker\nwangler\naloo\ncarpooling\npommes\nfarnley\nkrenn\nwidney\nclohessy\nhally\nunfeeling\nrangelands\nwutai\nbres\nfagioli\nbhakkar\ntommasini\npassionist\nchasetown\nneedleman\nsampla\nhydrosphere\nbrer\nmakinson\npfäffikon\ntobolowsky\nawsat\npcso\nodaiba\npuyang\ndecentralize\nconcreted\nhgm\nrituximab\ncounterbalancing\nevacuates\ncontrails\nsvedberg\nbannen\nbonomo\nliveable\nsadlier\nmaxence\nsensus\ncrisscross\ngrabe\nnewpaper\nmcilwaine\njanni\nreik\ndjakarta\nmcguckin\nkercheval\nrovs\nsingletons\nirlande\nparasitologist\nkotagiri\nhonington\nmihael\nmizner\nretta\nwows\nodalisque\nkhela\neuc\nhapi\nwelham\nhellgate\nperonists\niacs\nmetrosexual\nuddingston\nlimnonectes\njanny\ngettys\nfabulist\nlapworth\npishin\nyuzhny\nchuckwagon\nsyncs\nscones\nyusif\nquango\nyabloko\nrozario\nfurano\ngyration\nwheelies\npuncheon\nchaisson\nswindells\namadi\ncgp\nexpiratory\nscheyer\nshortsighted\ncowpox\nzohan\nentangle\nscamper\nmaplestory\nkrystian\nultrafilter\nklavierstücke\ngats\nsolaire\nstw\nrunkle\ncuerda\nelvir\nrangeela\nbreathtakingly\nwiddecombe\nlaux\ncondie\nislamiah\nspatiotemporal\nconjugating\nravitch\nbouchier\ntsewang\nicsc\nyous\nshoutcast\ngriffo\nsajama\nshiraki\nalizee\nprincipale\nwastebasket\nfreshfield\nreverdy\nsilvani\nmentis\nhydroquinone\njado\nbombshells\nnork\njalabert\nshoplifter\nemscher\nglowed\nedgell\nexomars\nclayworth\nvladimirov\ndoted\ngooder\nlimey\nkuriles\nsalihi\nazzurro\ninnkeepers\nstatments\ndonath\ntrotha\nmeral\nearwax\nstatt\ndorus\nthuram\nkarney\nlimbless\nisoniazid\ntropica\nforestal\nwellston\nnoha\nsullenberger\npugacheva\ncelebrezze\narpino\nrichet\nfluffed\nmatchbook\nbordin\njazmin\nsige\ntasik\nsaenger\ncorsicans\npenury\ncastronovo\ndatar\nlangworthy\nlanolin\nstolac\nacas\ndermody\nfondant\nbva\nalvim\nkert\nlutenists\nshimamoto\nausonius\nsmolny\nsuvero\ndeogarh\nmaillol\ndysmorphic\nnordwind\nsimko\nyohei\nkitayama\ncontraire\njsu\naught\ncantilevers\nvanderlei\nlogout\ndrd\npakal\nunobjectionable\niorek\nhawkey\nidli\nsundeep\nultramar\nadbrite\nabdellatif\nbuckeridge\njasim\nhaylie\nsinopoli\npersichetti\ndigbeth\nwallich\nghettoization\nhollandse\ngoitre\ndinaburg\npoliti\nidentifiably\nemmanouil\nthep\nbeyaz\nkusi\nquispe\nwesterplatte\nmontesquiou\nsackets\ncherishes\ntangling\nvalene\nfortman\nbasak\nveasey\nkrikorian\nlabarthe\nbrawny\nwaterspout\nceremonials\ndifesa\neilert\nrixon\nconcoctions\nclaudie\nmantia\nthyestes\ndouchebag\nnetivot\nsteelyard\ndamariscotta\nmarveled\nballona\nbeamforming\ngunmetal\nlabadze\nweitzel\nbossu\nacec\nrhapsodie\nhest\nnecessaries\nchanterelle\nspaceshiptwo\nindemnify\nchernow\nfrechette\nchampagnat\npigou\nowers\nrukavina\ngpf\ncoqui\nruto\nneef\nkaprow\nmirisch\ngroo\ntona\npetrocelli\nfondren\nbookers\ncorinto\nmarvelman\nnarrowboat\ndalriada\nstovl\nsemetic\ngernika\nstipple\nmosaad\nbotan\nsassone\nacco\nslapdash\ntajo\nmandula\nacinetobacter\nmangrum\nisocyanate\npertamina\nteleconferencing\nkirmani\ncowsill\nmikee\nziolkowski\nhelikon\njazzmen\nsebum\naubers\nfosca\ncubbon\nhendre\nravenscraig\nschallert\nioe\npolishes\nmotorbooks\ncarbolic\nfabricates\nameritech\nhoopoes\nelisabeta\niriomote\nshayla\nbeautifying\nmagnan\ngalangal\nreasserting\nreassigning\nsiddhu\nalife\njarnail\nflom\nyilin\nringnes\nvicodin\ndynegy\ninvidious\nbrancato\nworksite\naccretionary\ngregoria\nenterobacter\nindustrias\nsolt\ngoldhaber\nfeminized\narguedas\nuncultured\nstringency\npublicaffairs\nfrogfish\ninterschool\nosmanthus\nfurrer\ncld\nmaximised\nhotlines\nerni\nalaknanda\nroethke\nxanax\ntelecommuting\nnauen\nchugoku\nberghaus\nwaterbuck\nreits\npoveda\nadcs\nroswitha\npountney\nnephrotic\nbaci\nspach\ntombo\nanlong\noehler\nstumbo\nvassell\ndisinvestment\ndarnay\ngorshin\nstagno\nspel\ncriminological\ngslv\nkazunari\ncleisthenes\nalano\nprovine\nkokhav\nazua\ntorey\nwingrove\nrivonia\nfondling\nkosor\npiggybacking\nflurries\nvigier\ngoulash\ntinkling\ncavallini\nladykillers\ncataldi\nstowaways\nlorenzini\nqalam\nsgb\npendarvis\nzugzwang\nxinxiang\nmanetti\nhermeto\nchimbote\ndulhania\nnowakowski\noutshine\nfiddled\nhanner\nmarsteller\nrhu\ngagaku\ncarborundum\ngoodtime\noecs\nmochida\nsberbank\nfev\nbarfoot\nmurcian\nrienzo\nladylike\ndafa\nslotback\nerice\nbaard\nbtb\nnobutaka\ncharrière\nwhan\ntrinley\nporcelains\ncullin\nwagar\nidel\nprohaska\nisakson\nsimplemente\ngern\nindelibly\nquartermasters\nafri\nmuskox\nnonius\nwrittle\nmignot\ngamson\ndebretts\ngrizz\nfranchet\nkiwa\nblackhole\ncorbitt\nmillibars\nharshaw\nloafer\nturnarounds\nfonz\nmhra\naquilegia\nstepchild\nsatcher\nrietz\naráoz\ntelethons\npersonam\nyushi\nallergenic\nintramurals\ncertitude\nmeryem\nkram\nmadres\nqalat\nkickball\nalthaus\nschizoaffective\neuthanised\nangkorian\ntakayanagi\ninterrelation\nmouflon\nnuristani\nisbister\nencase\ntosco\nintimation\ndji\ncilley\nwebbs\naking\nadell\ndynast\nhowerton\nbarrasso\nacheived\ntamari\nreoccupy\ninvolvment\ncoonskin\nformule\ntansley\nicebound\nqueenside\nraffia\ncockfosters\ncyclooxygenase\naudenshaw\ninteractional\npilkey\nhistria\nkadare\ncatapulting\ndevilder\ntetrahydrocannabinol\nmidcontinent\nneris\nzakhar\npastorals\nshafqat\ntapti\nwheezy\ndeprivations\nartemus\ntomonaga\nerec\nbookmarklet\nfaheem\nbuckfastleigh\npaek\nabramo\nwonderbra\nposterolateral\ncawthra\nelderslie\nseamans\nribonucleotide\nyuxi\nmotivators\ndrian\ntaban\nisolators\nwarneke\nmileages\nbhavna\nlonghua\nmiccoli\nwilcots\ncirculo\nbickersteth\nshakil\nstylites\ncrafoord\nbgl\ngizmos\nabot\nreedbuck\nmelick\nbriles\npirogov\nbelva\nlovedale\nnegley\nmatewan\neichel\neyetoy\nwrongdoers\ndimitrij\nticaret\nmown\nsuppers\nbrisker\nwedgie\ngasps\ngilruth\nunpressurized\neita\nfortson\nodescalchi\ncdte\nreneging\ngolia\nshammar\nsweetser\nclaverhouse\nkameng\nkalinka\nsajida\ndoublespeak\nkoudelka\ncollocated\nstagehands\ndecavalcante\nschad\ntiye\nparatus\nfallingbostel\nkans\nmoretto\nmeche\nyotam\ndelli\nunsalted\nhaberman\nsharpener\nbarât\nabdulhamid\nmabrouk\ncryptogram\nmalvin\ntanimbar\nbatesian\nwernham\ndevyani\nacquis\nsemaphores\nconceptualism\nrostron\nblundering\ngeiser\nfreeloader\nraita\ntearooms\nkaikohe\nmaturely\nchowchilla\ncolada\nmenkes\npoissons\nverran\napoyo\npittoni\ngranato\nletellier\nwillowherb\ncoltness\ncharito\nsoco\nmakoni\nsaut\njollibee\nsousou\nweicker\nlitterateur\nwhippany\nozerov\nserrana\ndiversities\nadminstrative\nkeston\nstamatis\nspinocerebellar\nreisch\narenig\nbromwell\niole\npuhl\nitx\nmattern\nwesterland\nsouthpark\nmierlo\nkeskin\nvalcartier\ncolumbians\ndeben\nchacao\naliki\nkalymnos\npercentiles\ndoner\ndulany\nayoade\nvirgenes\nrevegetation\nvovk\nusfsa\narmé\nteign\nokk\ngruss\nmikki\nvasko\nsnak\nneonatology\nbelabor\ngalitzine\nmagas\ndeseronto\nquartos\nkarrie\nblackstaff\nethz\nessayed\nchinquapin\nspinnerets\nasexuality\nepitaxial\nvolcanologist\nfurnival\ndransfield\nstapler\nsquirm\nhandprints\neucom\nsilcock\nalutiiq\ngherkin\namorsolo\nmerozoites\npavillion\nhillwood\nnafa\nxtr\nhingley\nstrapless\npeons\nzaks\nvarios\ntoupee\nlartigue\nlpn\nfarron\nfaur\nbulford\ncret\njoa\ncapricornus\npolsky\nlandscaper\nvaporizer\nrimrock\ncaucasia\nahmadzai\nstolpersteine\nlevitra\nfaulkes\nkims\nnegrín\ngup\nchristelle\nmidc\nbealach\nnetworker\naberaeron\nsangay\nkemptville\nabsentees\nfranciosa\nvelvets\ngoyen\ndesales\nzandi\nslugfest\nwignall\npyeongtaek\nspiritu\nintermingle\nrre\nieng\nmonzon\nwangari\nantwan\nmcanuff\nvillamor\nwelliver\nantiterrorism\ncenterfield\nical\necha\nvacillated\nkapok\ncerveza\ngloat\namadu\ngilardi\npenitentiaries\ngorseinon\nmeursault\nuberto\nalya\ntravés\nmiñoso\noffsides\nyonex\nwachau\nkowalewski\nbouwmeester\ntanf\ngyllene\nlamplugh\nmanford\nvitrified\ninspite\nchiodos\nhmx\nconchos\npoulidor\nturre\nsweezy\ndescanso\nraasay\nrossoneri\npasuruan\nsazonov\nspaceballs\nswilling\ncradled\ncorredor\nmarmi\njindabyne\nragnarsson\nsfk\nkermani\ntouba\ntallink\nbottomlands\nkinyarwanda\nbellus\nmmpi\nstopovers\ndesrochers\nsavana\nchunking\ncacciatore\nferidun\nblinkers\nittefaq\ntailfins\nfranking\nkokang\nsenderos\npointlessness\ngrondin\nlumea\nboîte\nsatam\ntottenville\ninkheart\nharvestman\nsorolla\nstoplight\nlisson\ncrail\nbuchtel\npantaloons\ntoffees\nnytorv\ntollcross\nbigley\nidrissa\nmarcil\nryuhei\nchadds\nrealizable\ndepositories\nquorums\ndembo\nhonorio\nskylon\nnkf\ncenotaphs\npaddlewheel\nvipul\nmazurkiewicz\nbarrowclough\nwaterwheels\nmehrotra\nfluorosis\ngianfrancesco\neytan\nleavey\nladi\nforsook\nmischaracterize\nqca\njio\nbudnick\nburkart\nalbe\ntragedian\nrondel\nkissling\nrefract\nmentalities\nperrineau\nkrld\nmonsen\niacobucci\nquadruplets\nbevo\necobank\nripp\nleke\nmolfetta\nventrolateral\nvinzenz\nbracero\nalpines\nmckevitt\nmonovalent\nparkhomenko\nembarassed\ncgh\nfuntime\niser\ntretiak\nbrechlin\nkeenum\ncaspary\nerogenous\nbeeville\ngenesius\nkielder\nobert\npfoa\nsnagging\narbel\npaulsboro\nimsi\nborriello\nrepetto\nsayd\nciliates\ntatsuta\nklock\nnrbq\nmikhael\nequateur\nsalmaan\nclayburgh\ngiardello\nkobrin\nobl\nbaldelli\ndatt\nbergheim\nliesel\nbekka\nmartelly\npyrrha\nelad\nmilsom\nerda\nprabowo\nkarem\nvolksoper\nneuheisel\nkeya\ngortari\njulin\nmegaprojects\ncanora\nlooy\nsabas\nmccaskey\nsweetgum\nsepe\ndeba\njovica\npupo\nbaconian\ntearle\nsevenths\nsuero\nkasprowicz\ntanigawa\ncohost\ncarrée\npuspa\nkacha\nwhitsett\nfogging\nboghossian\nchurchland\nseamstresses\nglin\nloic\ncontextualizing\nisidora\nmiani\njokela\ntickles\nquercia\nmiddlemiss\nbellomo\npeyo\narben\ndilshad\nfsg\narinze\ntobogganing\nwoolfe\nakademija\nkasner\ndescribable\nwitting\nhalkbank\nstreich\nvervoort\nyantar\nsignorini\nweide\nveere\nmislabelled\nblandine\naparri\nwyandanch\nthebe\nbordaberry\ncampbelli\narwa\nsqueaker\nyesod\nomnicom\nweck\nluppi\nmequon\nbluesbreakers\nmauritshuis\ncoseley\ncrede\nlangside\nblenkinsop\nbathrobe\nmotzkin\nmudanjiang\ncucina\nwillig\ntuku\nexorcised\nnewegg\nmazin\nbrantly\nohioans\nbronk\navermaet\nsalvinorin\nbruschi\nairlangga\nbriskin\nditson\ntto\nwedd\nburlison\nnicolaou\nbrainpower\nsantelli\ninbal\ntenens\npolyploid\nmasamichi\naeroport\nheimbach\nhofheinz\nacclimatisation\nseadragon\nhassling\nparapsychologists\njuggles\nfigari\nmulliken\nunplugging\nrationalizations\nmaravillas\nlolich\nlelo\nkohoutek\nseasprite\nabdeslam\nteun\nanpr\nyangzi\npatisserie\nelucidates\nmaternelle\npunnett\nlaron\nphotogravure\nballadur\nkarmen\nlichter\nnaomichi\nriitta\ntarvisio\nbeber\nirsay\naashish\npathein\npassphrase\nshizhen\nnovakovic\nromeyn\ninvestec\nsamphan\ntorte\nmacer\ncisa\nmejuto\nbols\nchytridiomycosis\ndaunte\nmarkman\nparkfield\norsino\nslideshare\nwellborn\ntitty\nbourret\ngilcrease\ntakenouchi\ntoff\nkersal\nstap\ncolectivo\nnli\nsilversun\njovita\nwanes\nbautz\nrisible\nlunging\ntyramine\nhatanaka\nfuengirola\nlustgarten\nrevenger\ntanita\nhohenschönhausen\nperito\nmullican\nmedwick\npiperazine\naiman\nclaesson\nmazzetti\nleporello\nheraclides\nzuhair\ncopperas\npropensities\nnymex\nbabil\nscarisbrick\nbjorklund\nnechayev\ntrimark\nbitterest\nlenna\nvulvar\nmoapa\ngetcha\nusw\nbackdraft\nmatcham\nchemistries\nambros\nkneed\nbruma\ngammons\nislwyn\nstirk\nunguja\ngoutam\nstirrings\nbowa\nhlc\nmisbehaves\nreddening\nbranksome\nriemenschneider\nbmf\nacna\nbiogeochemistry\nitas\nwrongheaded\nsaturnia\nmarinade\nkuttanad\nrrp\necclesiastically\npecks\ninstil\nlogjam\nforró\ndigga\npostulator\nenl\nbareli\nprocida\nonedin\nfilemon\ncourson\npulham\npictographic\nazzarello\ntoone\nosmolarity\norji\ndubya\ninsularity\ninconveniently\narzamas\nwawona\nbabine\nlibrarything\nnadp\nocasio\nstaller\numemoto\nmaxpreps\nalviso\nzervos\nmagnifies\nziani\nkiper\ntrekkies\ndcns\nhipolito\ntamzin\nsthalekar\ndiscussants\nmartinsen\nlilford\npiazzetta\njivan\ndogmatically\njiwa\nbüchel\nlabe\nbalustrading\nwoollcott\nrovinj\npflueger\ntarnowska\nmortared\njiayi\ntrotwood\nsauteed\naabb\nluckner\ncontestation\ndaruma\nheta\nneurobiologist\nlunney\nmaois\nhoofs\nbarkha\nexcellencies\nmudbrick\nnandana\ndavon\ndiptychs\ntrumpeted\nnishant\nscandinavica\nmiroshnichenko\nsocialised\ninfoseek\ndesailly\nkazuno\nraudabaugh\nyelizaveta\ninputted\nglorioso\nfolkman\ncey\nccas\nmitsuya\nelution\nhedingham\nirrelevancies\nrollovers\nblacky\nigoe\nleonardtown\ndorjee\npawhuska\npanah\nviard\nmarylin\ncalathes\nlaidler\nknits\nrepresentativeness\nintermediation\nsantoni\nayrault\nguangyuan\nheddon\nziering\ntonys\nhemangioma\nkulasekara\nreutter\ncontrivances\ndelahunt\nchadli\ndisinfecting\nfulwell\nroussin\nchirnside\nbujang\ninterborough\npanskura\ndirlewanger\nlindzen\nhayati\nridgeback\nriles\nbedazzled\nnordgren\nrossio\nneuroma\nwaechter\ncastroville\nnmd\norlen\nvandenbergh\nurtext\nhallier\nidolize\nvoeux\npreorders\ndente\nmifepristone\ntrawsfynydd\nintertemporal\nrandles\nlanglais\nprobationers\nlamberts\nanding\nuxo\ncandelabrum\ndamad\nploys\ntarentaise\ncaulder\ngujaratis\nvob\nmeinl\nchazy\netchells\nsoz\nakkar\npetroc\nsummonses\nclavin\nglum\nzarra\nxabier\nbagnoli\npalwal\nchenda\nclassifiable\nnatufian\nbitdefender\nguastavino\npratica\ntithonus\nferit\ndracut\npeatlands\nudoh\napic\nstachel\ngava\nkhedira\njuancho\nracialism\nchinnor\nforeshortening\nsubdivides\ndulli\nhomestake\nwarioware\nbergersen\ndiscourtesy\nteofil\nataris\nwixom\nbâ\nlambing\ntannahill\nelworthy\nassayas\nglau\nvoigtländer\nsukhi\ndispersants\nnewbattle\nnonlocal\ndodin\nokajima\ntima\nfisticuffs\nlastfm\ncatalani\nekd\nnordics\nzaghawa\nwenvoe\nloli\nthickener\nkaupthing\nsojourns\nwappingers\ncoquet\nawaking\ntaranis\nsuttle\nviciousness\ndysgenesis\nbaartman\nheeney\nholmium\nhassocks\nsmurfette\nleontine\nflamboyance\nhathcock\nkakkar\nsneinton\nhajek\nbösendorfer\nfieldston\nphung\ntuckerton\nneeme\nsidewise\ncastleberry\ntullibardine\nmirandela\nobaidullah\njadoo\nlacewing\nkeiichiro\nharpa\nbichon\nbingu\ntanvi\npalminteri\ndatang\ndoneraile\ndhhs\nkroupa\ntriest\npapst\ntresckow\ntalis\nmohar\nhfa\nandantes\nabhainn\nrockmore\nhald\nnecc\nthielen\nmahopac\nreverberations\nchissano\ndigos\nprehospital\nmarilu\nelbaz\nfrolics\njettisoning\ngulik\nkokko\nchristl\nschaghticoke\nbirdsville\neei\nhelberg\nrachna\nvaishno\nastier\njupitus\nfetting\nrunton\nwme\nbeba\nconsign\nfarzad\nerdrich\nerevan\nmifare\nzare\nfufu\ngoodacre\norientals\ndilapidation\nciudades\nlaffoon\ncloseups\nlittlestown\nhees\nstabilisers\nhartpury\ndebridement\nmarcato\ntfsi\nbetanzos\nnosey\ndemaio\nkusch\nmelian\nkeoghan\ncatmull\nencalada\ndars\npolicarpo\nrawness\ncullom\nfieger\nglittery\nrussky\ndemobbed\ncoomer\nddot\nteliasonera\nkepner\nabusir\nestrellita\nböhler\nchernomyrdin\nberlinger\nbrah\nsreesanth\nheckmondwike\nfrederich\ndepressus\nfrea\nofficina\nmesbah\nbushati\npirsig\nmant\nfurloughed\nvalsecchi\nwoode\notpor\nmalcomson\nmiraz\nhise\njazzmaster\ntovah\nsuprise\nstremme\ndoonan\nblogg\nsolutrean\ncassill\ncvi\nflorek\nsummerson\npazienza\namahl\nlubetkin\nkassala\nhorgen\nwades\nfex\nnakane\ndmrc\nhelford\ncenturia\nabeles\nkitschy\nwatchword\nphotocopiers\nlenina\nhohn\nvolf\nscarman\nsaralee\ngonzague\nlebowitz\nwestphalen\nisaia\nooooh\naddlestone\ntallboy\nfaunce\nsuttles\ndethroning\ntediously\nkreuk\ncryme\nbingle\nhidehiko\nheadboard\ndelphian\nbavo\nmultihulls\nunallocated\nsprecher\netzioni\nyauch\ngiovinazzo\nmthatha\nsonybmg\ncurmudgeon\nebbing\nteleconference\nnotchback\nbowsher\nperjured\nmirtha\nsoulshock\nignatiev\nnyonya\nterrorising\ntuberville\nrevolucion\ncoxwell\ntgwu\njovin\necoles\nnóbrega\nisayev\nbafokeng\ntemer\ntarini\nkuning\nimmunoassay\nprocessus\nbiochemically\nguineans\nfynn\noutros\nalmanza\ndesaparecidos\ncronan\nmiracleman\nsilje\njsoc\npilchard\nzada\n&#\nbricolage\niftar\nnoriyasu\nqingyang\ngeorgeson\nsoundproofing\ndorotheus\nbordoloi\nunmaking\nbissonnette\ntotes\nmaybin\nmrinalini\nvanzant\nagbayani\nbouzid\ncaloris\npanguni\nchos\nhummels\ncomacchio\neav\nnaryshkin\nnlra\nfontan\nshovelnose\ninfini\nillimani\ncony\nmdw\nchiding\njulep\nnidderdale\nflowerpot\nbachelot\nmersea\npacesetter\nichinose\nkiesler\nhig\nfactotum\nminet\ncassi\nhahne\ndesperadoes\nfritsche\nfoosball\nbacteremia\nnubes\nryutaro\nlaïcité\ntrawled\nbursitis\ndishonorably\nromare\npatar\ndehydrating\narshile\npronovost\naircrafts\npoulains\ntsutaya\ndhamar\nretread\ntransgressing\nasur\nfluting\nheartfield\npenthouses\njodeci\nchristoffersen\nlaroque\nongaro\nwardman\nskytte\ncomillas\nswitchgrass\nflorizel\nbecki\nitay\nchastened\nvitantonio\nheinen\nrusts\nfsd\nbarnston\ngorst\nsudano\naminul\ndohring\nimmolated\npetabytes\nruas\nunderbody\nlusher\naltria\ngiwa\nindustrialize\ndowntowns\ntekapo\nzehnder\nmasoom\ndesiccant\nyonan\nlepine\nmoscoe\nnottage\nnashoba\nclaspers\nbornemann\njawf\njizzakh\nchattaway\nfieldbus\nwachowski\nwalkthroughs\nquoit\nzot\nshits\nalcubierre\nyuliana\nharkey\njif\nmaryan\nbioaccumulation\nskillset\nlyin\nmarktplatz\nfilemaker\nharasser\nbathyscaphe\nbalbina\nrempel\nkaolack\nborbon\nkleinert\nricasoli\nantagonise\nshuffleboard\nvolleyed\nacconci\nshanda\naltin\ndominical\nreinoso\nroseboro\nweissmann\nbeckenstein\nbawtry\nfysh\nbrizuela\nimprovisatory\nyyz\ncabezón\nballett\njori\npsalters\nliversidge\ncrabby\nalsthom\npatchouli\ndissects\nsucka\nsnicker\nsimonian\ngrandiosity\naghdashloo\nliggins\npotentiometers\nquaver\ngruening\ngalpharm\nsichuanese\nzealotry\ntrehalose\nparihar\nbodes\nsaux\nbaranof\nlambro\nobstructionism\nusonian\ncrosswalks\ngahr\nmeti\nprimitivist\nupul\nule\necatepec\necclesiological\nshoaling\ncorexit\nlatinum\nurgings\nnaden\ncoudenhove\ncester\nheirarchy\nhonorine\ngutzon\ncontactor\nelitzur\nancon\nnegations\nkalasin\nneocon\nlibeled\nlampre\nfolco\nchhabra\nbifid\nnle\nauspice\ncliment\niguazú\nripton\nwhould\nmimsy\ngonthier\ngreger\nvegemite\nwheats\nfrydman\nunnavigable\nvoxels\napos\nloggias\nsideshows\ngorley\nclawing\ntakemura\ngaelscoil\ncihan\nlapo\nhtin\nrabbitte\nimmokalee\nklöden\nmpw\nxicheng\nschach\nperplexity\nxist\ngugino\ntuddenham\nmarianela\nwij\nshabir\ngarganey\nfigeac\narkley\ngelati\nwookiee\nmakis\npwyll\nsann\npetiolaris\ncrossbars\nmycologia\ncalamine\nlopate\nexhilaration\nmontélimar\nuniqlo\nkabri\npras\nsegawa\nfraulein\nschleich\ncommerciales\nindignities\ntschumi\nsaddlebred\ntotaly\nmonounsaturated\nwaipahu\nbarwa\ncharme\ncappon\ngrisi\nunpasteurized\nmovsisyan\nneemuch\ntamai\nsaraceno\nshaws\nmcgeehan\nfleche\nsporozoites\nkellock\nrisca\nwisley\nkundra\nkonovalov\nfairwater\nloonies\naacc\ncayzer\nbandanna\nredhorse\npyr\nmilbanke\neuropacorp\nbeddgelert\nlaunderers\nullapool\nlaggan\nhasid\npumphrey\ninocencio\nshrivastav\nhirschfelder\nsuryavarman\nrangitoto\nfalak\ntrebles\npotentate\nmorada\nranta\nkilcoy\netchingham\nromas\nswatting\nrosamunde\nsubcontract\npedroza\nkikongo\nifrc\nmanek\nrelais\ncomprehends\nsleighs\nverbania\nlaurentide\nouahab\nwanadoo\naerodynamicist\nquickbooks\nschenn\ntractarian\negen\nmetalworker\nhughton\nwll\npedregal\nbelgrad\ncreedmoor\nvierzon\nhht\nfelber\nthind\nmozaffar\nbaboo\nhrg\npedalling\nbaserunning\nladas\nadaptively\nderksen\ngenkai\nsungoliath\nuncoupling\nfritter\nprsa\nbovington\nexhaling\nroughead\nattias\nfrancese\nblart\ndarwinia\nmcbrien\nhubner\nsumburgh\nagreeably\nchitnis\nclarkdale\nnorovirus\nharbingers\nrodino\nhincapie\nevocations\nfossilization\nmelor\nsacerdotal\nreil\nlacasse\nirri\nsilvas\nwyant\nshifnal\nzedler\njebsen\nlemminkäinen\npyramide\nmittenwald\nrwandans\nmitla\nhouldsworth\nsimmental\nancel\negli\nhaarp\nvraie\nbrominated\nallspice\nalguien\nraskulinecz\ngreaney\nisard\nsestri\ntarpan\nvaping\npirzada\nmakarenko\nkaylan\ndhalla\nunerring\nlarina\nwoolnough\nmühlberg\ntalkeetna\nzie\ndelt\ngabber\nhysteric\nbruit\nhainsworth\ngarko\nqueasy\nbalsom\nshuey\nfinkle\nmaari\ndeeney\notunga\ninnisfree\ninvasiveness\nsulfonate\nconsuela\ngranulomatosis\npolytheists\nstaverton\nroundworms\ntanimoto\nsynchronizes\ngerow\nabertay\npetts\nbakir\nstifles\nheathcoat\nsnoddy\nseidenberg\nceph\ndest\nloures\ncomplet\nschlichting\ngowans\natiku\nantiga\narmey\njailor\nsensi\nsnared\ncavin\nveldt\ngsz\nstradey\nscripter\nelsbeth\nkaliko\ncarrott\negea\nskint\ninfront\nbarretts\naaib\nhanus\nsvidler\nuncontaminated\nparalyzes\npandolfi\nmcglone\nshifrin\naftercare\nokocha\nallandale\nduchin\ntosun\nuremic\nwavers\nhamence\ndwc\nbial\npomeranians\nmehring\njamma\neuonymus\narveladze\nblaire\ngwt\ndaingerfield\namniocentesis\ntetzlaff\nlfb\nwasherwoman\nwknr\ntapis\nkmox\naztlan\naravinda\nfaget\nboel\nlhakhang\nsambuca\ncatabolic\npaintballs\nhighwire\nbarsky\nmillerand\njiawei\nacsa\njula\nnte\nnethack\nsomerfield\ndessler\nrepartee\nwoop\nguillon\ncpw\nrolpa\nstreetwear\nlewicki\nungarn\ngrav\ncaractacus\nromme\ncrestone\nradiotelephone\npilothouse\nfloret\noco\nnaranja\nupali\nfarfetched\naudiotape\nvittori\nthingies\nhemostasis\ncornetto\nhoerner\ndeschampsia\nabuelo\nfazilka\nreticulation\nhertog\njarzombek\nvues\ndardo\nkastrati\nabhors\nfernande\nfridges\ncolorable\nkarlson\npledgers\ntroodos\nforstmann\nkvn\ncarwyn\nstarshine\nsaretta\nmaragall\nrockumentary\ndigo\nstewardesses\nzimin\ncompartmentalization\nademola\nmaradi\npianistic\nmeskhetian\nsocceroo\nreder\ncommunions\ndumars\nsparser\nstrummed\nhelal\nmacrocosm\nstonecrop\ngoeldi\nschönebeck\nsantillán\ncoale\napollyon\nhowat\nsherer\nrushford\nbriens\nwhitland\ncoia\nhabituated\nlondons\nstripling\nmunidopsis\ndispite\ncourseware\ndogging\ngreatbatch\nshoeburyness\ntwyman\nproove\nhanamaki\nqura\nshepway\nadmonishments\nwri\nrostrevor\nrowsell\njeron\nmontereau\ndawat\nsaronic\ncruder\njackalope\nrotoscoping\nluxo\ncholangitis\ntblisi\nklosterman\nwickremesinghe\nmeinertzhagen\ncarollo\niof\nunbranded\nlaulu\nkleptomania\nmorphologic\noxyura\nmuette\nreye\nmazrui\nfunctionalization\nworzel\ngirdwood\ntidier\nbhrt\njata\navailing\nveeder\nyota\nsilberberg\nsheller\nunswerving\nsabat\nmosaicism\nhano\nstell\ndonachie\nmunier\nvaida\nlaius\nbischofshofen\nplattsburg\ncosmopolis\nbrg\nhadham\nabdirashid\nlaak\nkelurahan\nedip\nleatherette\nkavan\nsaravanamuttu\nrulebooks\nuchaf\nteatime\nnies\ncourtin\ncompressions\ndaubenton\ncurbishley\nobr\natallah\ngrampound\nsyston\nolufsen\nrottingdean\nphotosensitivity\npanisse\nrustem\nstana\ntrudie\njudgeships\nlandraces\nxinyun\nmimeograph\nmicke\nzyuganov\nwadleigh\nquilters\ncloster\nvarin\nsinopec\nfaried\ngallos\nmosco\nalbelda\nhermida\ngeigy\nprotoceratops\nmalyshev\nonix\nmaplin\ngses\nmonumentally\nzittel\nproops\nyatton\nvizquel\nhabermann\nvosloo\ndovercourt\nbaynham\nushkowitz\nhazra\nsorokina\nwertham\njeanes\nchaudhri\nmuad\nperowne\nwoolloomooloo\ncortelyou\nriserva\nifb\nsketchpad\nrecertification\nhakusan\nprosciutto\nbaitullah\nzoysa\ncollates\nbernadine\nprigent\nundernourished\nmeadowland\npalmitic\nadamek\natrophied\njapantown\nrispoli\nalmansa\nlevelheaded\ndespoiled\nhanham\ncdep\nsharam\nhoggart\ncriqui\ncipm\ngondolier\nnicolaides\ncoonhound\nadsorb\npadbury\nkrenz\nbilawal\ncallousness\nsarmad\nradlett\ndreck\nmaring\ncnw\neliphas\nmassingham\ndanseur\nbywaters\nprofiteers\ncrystallizing\nmissioner\nclurman\nearthrise\ndelict\nduplicator\nthira\nsafflower\nmkp\nbahawalnagar\nshakoor\ntomfoolery\narcis\nstankiewicz\nsowden\nilliquid\nbailee\nmorakot\ngordano\nbuttar\nickenham\nrelearn\nrantau\namichai\nonias\nwaldon\nhelleborine\nmewat\nsansepolcro\nmilion\npeshtigo\nbromides\nlitle\nclodagh\npuppeteering\ngirma\ngotoku\nabkco\njakab\nopprobrium\nverbruggen\ndachstein\nbeeld\nselfhood\nsaci\nriopelle\nbeckles\ntoggling\nsuperstardom\npilaf\ncruzes\njarvik\nkellgren\nhimba\npinard\nuniq\npieler\noverestimating\nchelle\nsouthlands\nmenses\nnemea\nwoozy\nordsall\njamo\nnishimoto\ntmk\ntobermore\ndharavi\nquelch\nskittle\nsanzo\nrosaleen\nfascinates\nloosens\nhapsburg\nopalescent\nfpt\nunexciting\nvacates\napas\ntiaa\noptometric\nbentwaters\ndevalues\nsputter\njobert\nasharq\nitzehoe\ngroner\nunassociated\npeñón\ntrumpy\nsollers\naesthete\nishioka\nmangels\narsal\npestana\nlifesavers\ncampanini\nondas\naldie\nmicroflora\nmortmain\nqvga\nwednesfield\ncaddis\nnipping\nbusinesslike\nfallot\nsyeda\nfrederique\ncoor\nextraverted\nunmil\nnzd\nacceptances\nbokova\nlockbourne\nstaudt\ninsipidus\nbpb\ndehaan\nmannino\ntailhook\naptera\nmoncef\npointillism\nchametz\ndgh\nngt\nsarcomas\nlaogai\ntredinnick\nzeaxanthin\nthielemann\ndcms\nsheerin\nmalibran\nphasers\ninnerrhoden\ndepressingly\nhurtubise\ngatow\nbastardo\npullers\nblé\narcha\ntatlin\nsudduth\ncastrati\ngiocoso\npunctate\npuryear\njerkin\nhertsmere\nmebane\ncreutzfeldt\nmoorside\nchanology\nevac\nweyman\nzaur\nportier\nesop\nmicrosurgery\ntalton\nvore\nruah\nmilitantly\nblighty\npeppe\nszabados\ntiangco\ntangaroa\nadbusters\nquello\nhonjo\ntremelling\nparaskeva\ntawas\nkeelback\ndorne\nbaumgardner\nwoodsmen\ndoeschate\nlgs\ndpf\nredeemers\ncollimator\nheuser\ncanino\ngalles\nmammuthus\nshajarian\niuds\nsiber\nsepticaemia\nkishwaukee\nnakheel\nmicrurus\nudrih\nhornberger\nthel\ndisaccharide\nkawhi\nkpo\nsedes\nvanderslice\npeccaries\nwardlow\nclutched\nadhoc\nchampenois\nfep\nforgone\nburntwood\nminnesotans\nhemochromatosis\narsenite\nhypocritically\ngabreski\nhanin\nguenon\nreitan\nfunderburk\nwomanising\nbalochis\ncabaye\ntabarin\nkapanen\njablonsky\nangstroms\nstubbins\niliya\nkeary\natoning\ncammarano\nreptilians\npaise\ndecreeing\nvlastimir\nzunyi\npingle\nbovo\ndauber\nitk\ntoso\nmortgagee\ndiné\ncccn\nshallowest\nmooch\nplaned\ngreensville\nceryx\ninterconnector\ndismantlement\nlaze\nsubbiah\nheswall\ngroes\nignominy\nkrul\nberggruen\nreptar\nbrygge\nfoulger\nganser\ndanial\nachy\nimpaction\ntahil\nbeckers\nbhang\nmainieri\nblaser\nveoh\npmm\nparratt\nunhelpfully\nchandhok\ngodda\nplumeria\nklute\nhesler\nroughs\nkostelanetz\nhambletonian\nsrw\nelleray\nsiga\nhingle\nays\nrantzen\naymer\npenthesilea\ntrekkie\nnorthborough\npeiffer\nmcgreevy\nmontrachet\nkuwahara\nello\nncba\ndeverill\neells\nbasim\nmatzah\ndeichmann\nsoldiered\nduminy\nyefet\nbratwurst\nweinrib\nbidet\nkillingly\nmargrit\niip\ntyche\nduany\nvacillating\nhinksey\ngnn\nsuryanarayana\naccokeek\nunexperienced\nnalbari\nbottum\nangelita\nmilankovitch\ngolota\nmédias\newc\nnaila\nperceptually\nswains\nbenckiser\nbaral\nstrandzha\njargony\ntermer\nsubbarao\nnoppawan\ninstructables\ncholangiocarcinoma\ndiwata\nhypnotics\nptf\nveu\ncitalopram\nsachsenring\npaleography\nognenovski\nbignell\nneuroleptic\npettifer\nmceuen\nquantic\nsenden\nryno\nargun\nwicb\nknebel\ncontempo\nskyy\nchilpancingo\nsalang\npridham\nsouvanna\ndowty\nattentiveness\nhiraoka\nspearfishing\npdq\nunrealised\napds\nkefir\ncudjoe\ncoactivator\nkopenhagen\nkharaj\nhsan\nnaiads\nrolette\naparently\ncantero\npurloined\nhechler\nvautour\nmaccarone\nshrewdness\nwoodcutters\nihop\ndesmet\nshuisky\ngarlock\nwaiblingen\nnenshi\nmcisaac\nnyos\nmajeur\nelmander\ntava\nchifeng\ncustance\nbrookshier\nsnelson\naveley\nleterme\nescamillo\nkjersti\nlycées\ntyrer\nlashings\nbenwell\ngardée\nwdsu\nkozloff\nsexyback\nmels\nyehia\ngilmanton\nsupaul\nyarkon\npromulgates\npurposeless\nrnib\niwabuchi\nxueliang\ncecina\nmurmuring\nbrigandage\ngarraty\nrodge\nbrugger\ncyro\namta\nsaurin\nflannagan\nbeckie\nlatticed\nllandysul\nquelque\nbakhshi\ndestructed\njinny\ngordin\nscribbles\nbaghi\nparodist\nbuea\nyorkie\npench\nnishihara\nrolleiflex\nlurleen\ndraftsmen\narh\nthromboembolism\nkerio\nneusiedl\nsathe\nkapell\nskyclad\ncarneros\ncountermanded\nwhitham\nbroaddus\nschoor\nowo\nburkard\npadar\necthr\nlinhart\nflatman\nabar\nlivecd\nsohmer\nstonewalled\ncribbed\naxilla\nkobuk\nuxmal\nemmott\nmcculley\npbt\ngabriels\nrobotically\nbrangwyn\nvoulgaris\nstrebel\nmondino\ncutback\nvivante\nfiraxis\nunapproachable\nktvt\nheinonen\nrasping\nraychaudhuri\ngoodbody\nyamanouchi\noot\narrester\nknibb\nlydford\nderoy\ndowagiac\nzelinsky\nicograda\ntorgersen\nstefanovic\nshinwari\nmidhat\nsfe\ndejong\nmiff\nmedtner\ncalavera\nmyrie\nhott\nblackfield\nbesta\nhelenius\nyamana\ndisbursing\nchanga\npilato\nazulejo\nmuneer\nnaschy\nramseur\ngéraldine\nsquishy\njiaozuo\ndongping\npartis\nhansie\ncedarhurst\nunpolluted\nnegritos\nuberti\nmlada\ndatapoint\nkersee\ndeckert\nnedda\narmchairs\nluxembourgeois\nreplicants\nmurgia\nmeacher\ndoyne\nentrepôt\npflugerville\ncolleyville\njoyless\npredication\nbayramov\npreferrably\nnvm\nepauletted\npalmata\nketty\nmommie\nkollek\ninterdicted\nparee\npositano\nnunley\noyelowo\ntransalpine\nbaguley\nbellowing\nparabolas\nmonalisa\nchirpy\ncambell\ntidus\nungaro\nsoltaniyeh\nredoubtable\ntoubro\nhagemann\nsumrall\nnobuaki\nwellow\nezcurra\nwhateley\nspliff\nchorrillos\nasprin\nbarash\nhamri\nransdell\nevrard\ndotter\nreverently\nmalha\nchittick\nchromeo\nmte\nkristaps\nwinnetou\nmarjolein\nemasculated\nmedinah\npapá\nstinker\narthus\ntolomei\nlissner\nbagneux\nhexavalent\narborist\nwintu\ndrabek\necuadorean\nchucked\nmadrassas\npyong\nholofcener\nmargalla\nnovaeangliae\nprioritising\nurry\nottis\nmitteleuropa\nrenteria\npve\nameliorating\nyglesias\npattillo\nnakia\nruen\naniseed\nfurner\nsenne\nnnp\ngibert\nparanhos\nsiegelman\npoliticos\nsourire\nblauer\nglided\nphosphoenolpyruvate\nfortwo\nhaxhi\nseleznyov\nbarefooted\ncercla\ndanaë\nberendt\nnurburgring\nchye\nlarrys\nabdominis\nasphalted\nfightstar\nronconi\nohsu\ncounterparties\nlearnings\ngrondona\nvandergrift\naerospatiale\nbenmore\nspeach\njacklyn\nmadrone\nrelevantly\nnonnative\namboinensis\ngrayslake\nhoratia\nmetatarsals\nsupercharging\nthrum\nrushmoor\nboozman\nchauvelin\npertinence\nwarboys\nwillacy\npickton\ntaboada\nnarula\nflintham\nrapace\ntitta\nbottlebrush\nvuong\ncroupier\nhoogenband\nsoroush\nstrapline\nshirow\nikuko\nkuusisto\nboonsak\nnaci\ndesensitized\nboingboing\nmelanotan\nyeller\nderivate\ninternalizing\ngiganta\nputbus\nschoolkids\nondra\nedenderry\npackington\npurifies\nselkie\ncios\nmkiii\nslytherin\nvisualizes\nshortfin\ngeoje\ncaddell\nundeciphered\npradera\nbudan\nrobertsbridge\nmowlem\nismo\npyogenes\nkonchalovsky\nmirian\ntrivialize\nsobhuza\npitons\nmonna\nriesel\ntoltecs\ntriaxial\ndossi\nmoonbeams\npanathenaic\nbonariensis\nyagoda\nbrugha\nspiderweb\nsorriso\nswaythling\nfurrier\nmonsell\nviglione\ncritism\npursey\ngazzaniga\ntomalin\nleafcutter\nhandford\nsuccinic\nhoynes\nbelliard\nqsi\ncerva\nmuezzin\nemendation\nhoonah\nxiaolong\nrighter\nwamu\nloughery\nlafortune\nelum\nvictrola\nyack\nkaisers\nbunratty\nmccudden\njulissa\nschepisi\nvics\ntaruskin\ncoalisland\ngracing\nfacchini\nzhengyi\nennoblement\nmaroulis\nriri\nbge\nanglogold\nhuddart\nbatali\nvesterbro\nlesabre\ncottee\nyoshifumi\nbalsall\nsiki\nperidot\nfisc\nlages\ngardyne\naverred\nrochat\ngrímsson\nnauseous\nsposa\ncdv\nmainlanders\nposillipo\npanfilo\ncullowhee\nsorn\ncime\nrohatyn\nhlb\nthunderstruck\nclonazepam\nniskayuna\ntenko\nitalicizing\nmazzei\ninterdisciplinarity\npeyron\npetrina\nmelter\nmirepoix\nshat\nbjs\njrr\nhostetter\naabenraa\nliveryman\nvilifying\nkhaya\ncastillian\nconforti\nbivouacked\nsogni\namitai\nwariness\nblunts\nbrasileiros\nfiges\ncharrington\nnukus\nmni\ngoudreau\ngoldings\npersuader\nlilt\nalvy\ngrifo\nkerogen\naybar\nwithnail\nboral\ncolomer\nristorante\nhelminth\nverrazzano\nrej\nimb\nnephrite\ncontraversial\ndistrusting\nchinense\ngattis\nbonnici\nstanislao\nagogo\nqaddafi\nburlingham\ngild\nmuchachos\nkubin\nsunne\nsaadeh\naxelle\nhostelry\nmeixner\nmarybeth\nalehouse\ntabaco\nherewith\nspetses\nmaitra\nbegay\nmofa\nsonitpur\nschiano\nlianna\ngaydar\nzamba\nvariola\nsmouha\npirouette\nsiôn\ngwi\nabla\nwoodhill\ndomesticate\nbondsman\nbrimble\nsextuplets\nmalene\nedmondo\ngardell\nbiocompatible\nlouviers\ntzitzit\nnomani\ntribbett\naddi\nmokoena\nunevenness\ngampel\nmoppet\ndusable\ndovecot\ncockatiel\npeepers\ngolic\npurines\nrakai\nsharking\nchmielewski\nkinkead\nlackner\nfluoroquinolones\nbairns\ncpk\nbackroads\ntianping\ncasl\niroh\nchamdo\ncluelessness\npierina\nberu\nindustrials\naltec\nharl\nthé\nrondell\nfredette\nhoarders\ngooderham\nlancefield\nsolidworks\nmargolyes\ngasparri\nlangstone\nvedomosti\necclesiasticus\nredistributions\nsurfliner\ndeshawn\ndiaconescu\nutusan\nundersecretaries\nbajío\nhuggy\nautoerotic\nsirotkina\nburnage\nsyrupy\ncrucian\nbullae\nlanman\nonley\njohnno\ngédéon\nlauber\nifeanyi\nadhan\ngervaise\ntcherepnin\nwoudl\njika\njetter\nkemmer\nconsell\ntransgenderism\nglatzer\ncurlin\nsöder\npuerco\nconigliaro\nleora\nchelly\nepix\nswynford\nlievens\nvermeille\ntansu\nmiska\nshcherbakov\nbdl\nchahta\ntouchwood\npoveri\ncuma\nmarcks\nmichiels\nshakespears\nlabview\ncogburn\nleaseholders\nposehn\nstree\npatersons\ngarett\nunderemployed\nonorato\nleeke\nfust\nselenia\nnutbush\nolanzapine\nmkk\nconvolutions\nhalffter\nrintoul\ncaméra\npagès\nhendrich\nguarany\nbosson\ncaltagirone\njamrud\nmahovlich\nvaron\nolhos\nparul\npitchman\nminkoff\ncallaloo\nmutuel\nutsumi\ngeldenhuys\ngreenacres\nacheive\nrosholt\nsolier\nnonferrous\nsynthesising\nouzel\nsqueegee\necomog\npresaging\nfesting\njuneteenth\nbière\nfrewer\ntrifoliata\nwhirligig\nastrud\ndiosa\nstaggs\npiv\nguldberg\ngoner\ndeeps\neola\ngigaom\ntimey\nfinden\njacquetta\nstender\ncapacious\nciders\nthermosphere\nbienes\nmcswain\nstickiness\nrainworth\nattenuates\nschwarzman\nhalvard\neidson\npresage\ntrafigura\nfederacion\nostertag\nbatajnica\nadjuvants\nchromic\nhowze\nfarnam\npassionfruit\nfmcg\nguille\nabjuration\nwiregrass\nquat\nstegmann\nbedfont\nqsm\nfelina\ndanel\nnajar\nrasi\nunawareness\nsivaraman\ncuan\ndecries\ndyker\nhistorial\nabsolom\nbarnby\nhwe\nkahuta\ndanilovich\nbokhara\nviviers\nlaune\ncastleblayney\nhighgrove\nfrustum\ncrickhowell\ndevall\nairdropped\nvaporizing\nhöfer\ndecriminalize\nsullinger\nvictoriously\nladenburg\nyashar\nrummaging\ncootes\npittock\ntouhy\nporęba\nashtead\ndisassociation\nkishtwar\nptas\njasika\ntransect\nbfw\nrudie\ndupatta\ncrozer\nmoralists\npermissable\ncassiterite\nshibh\nstanwick\nsabbatarian\ndisconcerted\nfresca\nhalloway\nkirsi\nshraga\nthymic\nkhatoon\nheidari\nsublayer\nehrlichman\nhypoglossal\ngraniteville\npapert\nlaager\nwroc\nrosolino\nperivale\nicefields\ndecouple\nmarcoussis\nnatrix\nethnos\nbookended\nmoton\nstenerud\nkemah\npenev\nfrizzle\ntunick\njaparidze\nlourd\nsicut\nsels\nkishon\nuncorrupted\nnarine\nelectrotherapy\ntransposes\nbonnat\nsignifications\nvido\ncufflinks\nneuropathies\nfpso\ndamasio\nmancos\nbalestra\nkureishi\ntomassi\nponchatoula\ndownrange\nanorak\nbreves\nwherewithal\nmonolayers\nendothelin\nplr\namash\nconstrucciones\nmustafi\ngirdlestone\nsaffar\nprx\nolivar\ngutt\nneils\nwillies\npsychokinetic\nteradata\nfowlis\nkinetically\nnykänen\nmaccormac\nespagnol\nundof\ndarfield\ndisbelieved\nsenger\nrobley\ndlna\nhamit\nmtsu\nkensett\nmuffet\ndispersant\nporpora\nmoundsville\nmanxman\nwikiscanner\ncevert\nschnur\nsirdar\nholzhausen\niberdrola\nfaites\nungerer\nrigoni\nwerber\nrache\ndimmers\ngooge\nmeatless\nkrank\ncregg\nsandstorms\nwsd\ntaconite\nelbowing\nblackhearts\ngalimberti\nsalan\nkeiller\nhematuria\nicebox\nbabka\nhomeboys\nsterk\npequod\nmgv\ntoybox\nwonderment\nlastovo\nakst\nmóvil\nunip\nsaclay\npollini\nguill\nwplj\ncriminalise\nschibsted\nhont\nisakov\ngerau\nfuehrer\nsendo\nnyle\nguerrino\njanicki\nhypothesizing\nsleepaway\npillet\ngrimble\nbazelon\nwubbzy\ndruidism\nbarramundi\niwashita\npich\ngerolstein\npanico\ngrella\nflamin\nmajerle\nvallées\nalthought\nhilger\nlizette\nngugi\nsensitize\ndiii\nagre\nilliterates\nbpmn\nstubbe\ndomecq\npremeditation\nfrossard\nvasarely\ndeloris\ncategorisations\nphalange\ncornstalk\nmilde\ncyanoacrylate\nshands\nshibayama\nniemand\nfactitious\nmareth\nunadjusted\nbpel\ndhami\nfatos\nimv\nporcher\njeana\nhugos\nrivage\nrastrelli\nstayers\nyippie\nbindweed\nkudlow\nkheyl\nlaner\niturbi\nmorels\nagd\ntomiyama\nshankman\nquebecer\neconómico\namethi\nmerstham\nfinicky\nbundrick\nwicki\nlandolt\nkulongoski\nuglich\nstowing\nkinahan\njagdeep\ncullis\ncastaldi\nnoahide\nlazzara\nlaboratorium\nviveros\nmekons\nneuromodulation\nginther\nimpérial\nfriedgen\nfilner\nundemanding\nfarndale\nsitti\nfoyers\nalgeri\nvoiture\niddo\ngantries\nflik\ntibbits\nelastomeric\nclaytor\nrentable\norascom\nmerli\nmutans\nkillingsworth\nflikr\nmicr\nbringers\nnomis\ntruces\nconsols\ngiana\ngammer\nnaqi\ncrisostomo\npkn\nkesa\nsanjoy\ntiegs\nbodrov\ntreed\nyuke\nnorthcroft\npdms\ncrewing\nendel\nvedova\nmasorti\nsemplice\ncetti\nhalk\nmontoursville\nblanke\nkhattar\nownerships\nyippies\nmear\nrichi\nlimites\nglasse\ndrypoint\nvaghela\njewelery\nwestling\nbarcelo\nasteria\nmindell\nlanghans\noptimo\nlojze\nnutters\nnoncompetitive\nbudgen\nvaleo\npantries\nyingkou\ntapsell\naaup\njohnie\nvaginosis\ngrot\nmadhab\nhongik\nwordgirl\nravenwood\nmorcheeba\nskandia\nchaga\nfreemans\nifvs\nmagome\ntarwater\nqum\nfabregas\nvillacoublay\nfluoridated\nbuchen\nwobbles\nmotorhomes\nincoherently\nsternwheel\nstows\nwhittingstall\nasahara\nlustration\nmetanoia\nfichera\nchikako\nedgley\nemrah\nisozaki\nkorol\nagco\nstokesley\nlazzeri\nkado\ndiabate\nnanomedicine\nivete\ncivilizational\nwracking\nzodiacs\nhede\nlasley\nlivens\ngrene\nparfrey\ntropfest\njudiciaries\nhandbell\nglossopharyngeal\nunzipped\nchaat\nheliosphere\nfulltime\nharpreet\ndeflating\nconceits\nciri\nhenlow\ndelfi\nhorsetails\nblacktail\nblackhead\nattala\napplique\nvva\nkindermann\nneah\nritualism\nrhagoletis\nbolena\nfenger\njohanns\nfilmore\npaynesville\ndiferent\nmccaleb\nchallinor\nfibular\nacount\ninseparably\nnaslund\nschultheiss\ndvl\npuits\nhalcombe\njetset\navus\ncjb\nbeatniks\nbowron\nbaars\nmmog\npenology\nfortiori\nsteelpan\nungovernable\nhepple\nconchas\nyalla\ndaydreamer\narhats\ntusa\ngeneralising\nreinsdorf\nroundups\nsandile\nvarroa\nzocalo\npitying\nbudhi\nducos\nknanaya\nroscoff\nmccollough\nwsvn\naccu\nincat\ndipika\nsoward\nbellmore\ntechsters\njohal\neig\nainscough\nchunghwa\ncarolee\narterton\nseverodvinsk\norishas\nsicker\nhenton\nbharatha\ntaveta\nholmquist\ntost\nnezami\nconjunctival\ntalu\nsouthesk\nshuter\nartibonite\ndocents\nrumbles\nmalins\nbmws\nkarume\nbedbugs\nblowfly\nhaik\nmelisma\nheri\nmarkwell\nmomi\nzwirner\nblackbox\nmarnay\nnitz\nassouline\nfcps\nmultipart\ndesroches\nachmad\nintranets\ndunois\nweinreb\ngehrke\ndamson\nannaud\njered\nmongkol\nwaxwork\nhocker\nlizardo\npolisher\nlisnard\nchitosan\nprobables\nbredin\ncabalism\npolyrhythms\nhbl\nadulterers\ncinequest\ncausse\ncederberg\npaea\nmias\nhekla\nsprayers\nichijo\ncbj\nrodos\nacademi\ndecisiveness\nfrede\nkhater\nashu\nmohnish\ntempeh\nkinescopes\nrowlatt\nproteomic\nfixations\nbiocompatibility\ncoring\ntía\nbullhorn\noakleaf\nnetsuke\nmarijke\ncosatu\ngoldston\ngoalkickers\nislamonline\nguerriero\nregraded\ntakahisa\nlucene\nbyelaws\ntiant\nkcra\ndalembert\nrodolph\nhiren\nnahanni\nedelson\nturberville\ncleora\nenticement\napocrine\nvisco\njumpsuits\ngigahertz\nbloodshy\npaleta\nchines\nsaraj\nnorling\nabai\nwilken\ndurkan\nvigen\nfashanu\nemelie\nlivadia\nbixi\nmiscarries\nlenart\ndesprez\nunderland\nkirstin\ndenisova\nvenlafaxine\nmoco\nsrbs\npricking\nkatsuo\ngreca\nmcgowen\ndirigibles\nnorthlake\nattwater\ngourdon\nkrabbé\nsanlam\nvermiculite\nsettee\nbaugé\ntoggles\nmosa\ncarmon\nshowboats\nwerde\nsandworms\npacifistic\nnewfoundlanders\noberholtzer\ntiebreaking\nflr\numbridge\nbandelier\nniassa\nfeiler\nytl\nshevlin\nsestriere\nsplotches\nbuffets\nupwey\nuim\nintegumentary\nsouks\nbollaert\nfemke\nsupercede\nnoten\nrahall\ngraton\nmafa\nyaser\ntottington\nwiart\nfajar\nottar\nzenyatta\nnondescripts\nkowtow\nseelye\nfehmi\ntribemates\nkjeldsen\nmacrocyclic\nmyelination\nwrns\nramlila\nbeilein\nemotionalism\nlithonia\nhubbub\nhedden\nmaysan\nmukasey\nolerud\nstreetview\northographically\njunked\nanimales\npersimmons\nharrisons\nflec\nsinfonía\ndoraville\nrevitalising\nklingberg\ngabra\nspac\nredzone\naugury\nturbocharging\nsquarepusher\nbrawled\nanttila\nberkner\nridgeview\nnastia\ndestouches\nfanni\nmng\ndater\ntathiana\ncasos\nsours\nimprisonments\nrohlfs\nakutan\nkondratiev\noverachiever\naboutrika\ninundate\nintemperance\nhanbok\nsuperpipe\narsons\nweekley\nneurocognitive\nrowallan\nofficiant\nkoru\nladywell\nboschi\nbutare\nvoorhies\ncychwyn\nmodise\ncelie\njuvénal\ntolomeo\nvalvano\naccreditors\nharleston\nuhc\nredbox\ncollude\ndoerksen\nsahlin\nmcgloin\nbiomorphic\npaulk\nlazica\ntorrado\ndidio\nmeia\nvasistha\nchantel\nmiyawaki\nmccorquodale\nfuit\noutliving\ngangways\nklaassen\nxda\nhobgoblins\nshootin\nsherratt\nreconstituting\ntrotskyite\nbtf\nseverna\njagiello\nhesa\nphreaking\nbushcraft\nsarim\nkungfu\nkalbarri\nfdu\nsalant\ntodes\naysha\nsephora\nvautier\ndiatta\nsoliloquies\nanes\npuchkova\nduffer\nlynagh\npetrick\nsaccades\nvelupillai\nesselen\ndodik\nfleetingly\nausaid\nbiasi\nconcarneau\nhiten\ndesmarest\nsinmun\nsentimentalism\nhelming\nmedavoy\ndissidia\nlette\nrelit\nmacrolide\nhobsons\nmelbury\nbreytenbach\nwarryn\nlavanya\nanthracis\nzadora\nbahonar\nmakurdi\ngoodloe\nsarcelles\nylang\nedmar\nkoenen\nqbe\nshaham\nconlee\nbarometers\npolitti\ntrashes\ngodo\ngrechko\npbworks\nchandon\nantz\nbranwen\nrenna\nvisnu\nbirns\nmogis\nsocar\ngarnishes\nroj\nmontpetit\nehrich\nveronesi\nclubrooms\nduruflé\nboganda\nheyn\naboutus\ngeorgio\ngallian\nfixit\ndisavowing\nschoonhoven\nzeek\nloganville\nsubcontracting\nnoach\ndebray\nnizhniy\ndrystone\nshangqiu\nwesterfeld\ndgse\ntalaud\nhealthsouth\nminuto\nnephila\nchumbley\nlinne\nbelka\nwhitely\nbentz\nsutured\nikuo\nwohlfarth\nshaoyang\nallograft\nbashiri\nhypercalcemia\nparolees\ncarcharias\neastmond\ncertifiers\nlacerated\ntiz\ndovedale\nfalabella\ngenotoxic\nbelfiore\nmocky\nmechel\nweatherboarded\nrecieves\nfixin\nluddites\nwaht\nmargaritas\nproeski\nhoudon\nsasan\ncitgo\nvilmorin\nnathanial\nmicrovascular\nkimbo\nthongchai\nyateley\nineffectively\nkarkh\nbsw\nschone\ntoggled\nbankole\nsynephrine\nkelland\nvetere\nperloff\nmitsotakis\ndoud\nambigious\nmikvah\nchiki\nveysel\nshearson\nsuperheating\ncasuistry\nnacc\nnecking\nterritorials\nnonuniform\nlavo\nloiseau\nfarabundo\nfolkes\nleykis\narenac\nbzw\ndanegeld\ndrazen\nmisquotes\ncraniosynostosis\nholmby\ngoos\naqr\njóhanna\ntoprak\nsuccesfully\nhusa\nsegi\noenone\nderides\nbumpass\ncwu\nrosenbergs\nsankrail\nduignan\nhamoud\nwbez\nebf\ngummadi\nsitiveni\nostentation\njacor\nchilworth\nkatee\nisotretinoin\ntecho\nshelli\nardens\nburgage\npapillons\nvgf\nustrzyki\npadamsee\npeyman\ncowpea\nmunenori\nargyria\nludgershall\nwedell\nmatsuki\nquixtar\nlekki\nlochmaben\nxfs\nhightown\nkujo\nunsweetened\ncabinetmakers\nyusufzai\nboeheim\nannulling\namec\njitka\nexecrable\nalvorada\nwittenoom\nadecco\ndeadlands\nchivu\nmarasigan\nnuestros\nbroglio\nmayme\nswiveling\nultralights\nwalkup\nclotaire\ncontrail\nvastus\ncrinkle\nvillafuerte\npreservers\nrehana\nflagstones\nbellfield\nundset\nintial\nwicksell\nbriars\nlocklin\nmichot\nboker\nchhu\nmorbi\nborgata\nalmy\nleonowens\nguadagnini\npolos\ncolorization\npoquoson\nradan\nchristiano\nchicoine\nbeki\nkouichi\nhya\nkitzinger\ndeadspin\ncodnor\nexorcising\nassimilationist\ngallone\nunseaworthy\nspinsters\ndulas\nbpt\narrol\nbinfield\nseeburg\neliahu\nbunky\npaatelainen\npheu\ninapropriate\nsisterly\nmuldrow\nbaggies\ngombrowicz\nsidesteps\nsixsmith\nbryer\nparanal\ndunkle\nbouazizi\nktvk\nditmars\nmaclise\nperlo\nrippy\ndulled\nsdv\nshyster\nepiglottis\nwurttemberg\nloyalism\nresourcing\nduikers\ntitelman\nteuber\nostad\nbaramati\ngirlicious\nguofeng\ngimmie\nfirouz\nopilio\nkolesnik\naguardiente\nkcop\nperreau\nreinking\ntheorising\nayas\nkorth\nmbia\npublicans\nelizabeths\nfusible\nfdlr\nbahour\nzanjani\nreprimanding\nwainfleet\nenshrinement\nkofman\nmasso\nhomeopath\nbascomb\nfastballs\ngoroka\ngoannas\ndiehards\nbloxam\ncrooners\nspeakerphone\nguiney\ntatsuno\nsonnabend\nearlobe\nordinariates\nwbap\nlightens\ncomunity\nscholefield\nschmelzer\nconstancia\nendodontics\nlattin\nproact\nholling\njarek\nruses\nconze\nlevassor\ndako\nmachos\nvoraciously\nunfruitful\nbraindead\nwcu\npenglai\nuneasily\npcrm\natomistic\ncrepes\nportugues\nalmgren\ntruby\nbelter\nannihilates\npeonage\nseccombe\ntŷ\nlaist\nwardroom\narguta\nmnac\nrolandas\nlaigh\nchittenango\nselfoss\nfarquaad\ngroupama\neluting\nmefistofele\nnifong\nwhitesboro\nmamer\nlimbourg\nkanchanpur\nfinan\nquwain\nroeland\nnaches\nkivas\ngeus\ntanzi\nshamba\nbhuvan\nballhaus\nbetton\ndraven\nardeer\nkempt\nsharifi\nhegan\nontv\nprominant\nsiegman\nhattusa\nverticality\nborey\npieman\npiñon\nburgee\nsatirises\nstretchy\nmasher\ndoubtfully\ngarat\nbbt\nevr\nketton\nrerecording\nterabithia\nkaycee\noutranks\ntennison\nmocho\nmoskal\nchickweed\ngunjan\nbénédicte\ncatala\nmetacognition\nlnt\ncarrickmacross\nnixdorf\nbuffel\nczajkowski\niatse\nedg\npriddis\nunidas\nredouble\nvaulx\nslammin\npretences\ntouchscreens\npervious\nevernote\ntimecop\nblankety\npudge\ncabalistic\nyamini\nstewartstown\ncoaxes\narmonia\nsellouts\nyongfu\ntrembled\nbridgepoint\nlactones\nhandal\ncfx\nmeka\njansky\nmickleover\ngutai\ncommerson\nnitromethane\nsmarmy\nbookham\nswigert\nixtapa\nasobi\ncono\nheyns\ntoome\ntrumping\nliborio\ncanan\nfinfish\nmaccabee\nsanur\neilenberg\nurt\neglantine\ngami\nsolders\nalsea\nsilverthorne\nsoubirous\nburgmann\nmalvolio\nbatsu\nmataura\nrediscovers\ncyberia\ngewürztraminer\nfarka\noutpace\nwenche\noleta\nbocaue\nshalford\nrecoded\nprofeta\nmocambo\ndeedes\nmonklands\noir\nparvanov\nshaya\nkrop\nafflicts\ncorcuera\nrepolarization\njamón\nnagpal\nmatfield\nlillies\ndicarlo\nrodinia\ncogently\nbalsillie\ngallitzin\nenglishes\nrahmi\ncastillejo\ncheckmates\nayatullah\nmajalis\ntermoli\nmgd\nalleman\nnoseband\ndunluce\nlaia\nadelaida\ndahrendorf\nreworks\nshant\nbobbed\ngallinari\nhammerbeam\ntaldykorgan\nultimatums\ncaple\nmaryon\nbungoma\nkillybegs\nunimpaired\nblurriness\nheijden\nwaaay\nstolper\nmichalka\nchongjin\nbushmills\ngamely\nknaus\nmanian\nsapientia\nkinsolving\ntransgene\nsteffensen\nshaara\nmannesmann\nmicrogram\ndelas\nkarakalpak\nsofya\nciclosporin\nmethvin\ninfectivity\nirbm\ndraves\navari\nhenlein\njavafx\ndauphins\nbolloré\nbadon\ngravediggers\nleichter\nshunyi\npuntos\nboudou\nsawicki\ndoke\nplenipotentiaries\nappétit\nuntroubled\ndonan\nturpitude\nyellowman\ndeano\njansher\nkinkladze\njacksboro\nharyanto\ncorollaries\nkarz\nrhead\nroguish\nliran\nseawright\nmorawski\noutlands\nbarcoding\ncaudillos\nbabergh\nijsselmeer\nmarjane\nonenote\nconfabulation\npople\nkidapawan\nscrying\nsabry\nwsg\nebonics\nsexiness\ndelton\nupb\nduckweed\nchessie\nblitar\ntuxedos\ntaveuni\nreplicative\nkununurra\nspectrums\noregonians\nlfr\nguymon\npresqu\nsteny\ntainting\njaarsveld\nscea\nraviv\nbrigida\nsemiahmoo\ndeconstructive\nliquefy\nclouse\nbolinao\nbutterfat\njovanotti\ntravelocity\nhackford\nayuso\nmeghnad\nmarinaro\ninox\nodst\nweidler\npelini\naeropuertos\nkettleborough\nbrownfields\nbirra\narantes\nkdfw\ngennari\ndollie\nlotter\nlynge\nnautique\nluhan\nlrn\nszechuan\nstruthof\nheg\nebell\noxfords\njahi\nschtick\ndoradus\nmurasame\nfihri\nbillig\npassionless\neliminators\ndolton\njanik\nzilina\nsnm\nhoardings\nokinawans\ndovetailed\nwennington\nelz\nkiger\ninferiors\nchromatically\nrheinberger\nheyde\nnedelcheva\nreapplying\ngleick\ntarnow\nfedir\ncedartown\nmoschino\nxiaodong\ndrawbridges\ntrabecular\npeeblesshire\nsloe\nlochinvar\ntejanos\nlyonne\nchiodo\nboit\nrailfreight\nshapell\ntrailblazing\nkoral\najr\npicchi\nbagerhat\nhyp\nemdr\nferrario\nminga\nremunerative\nsaturno\neffaced\nadzes\npreternatural\ncoeruleus\nnabulsi\nloveliness\nsoltis\nwttg\nburgdorferi\nsnips\npangkor\nilena\ndesportes\ngardella\nroamer\nfoale\nprovera\ntempests\nwormley\nsconce\nrimer\nryegrass\ncarto\nschapelle\namte\nabiko\nfgr\ndumm\nmudflap\nlethally\nfordingbridge\nkantei\nagentur\ndevelopement\nmandie\ngeoscientists\npiccirilli\nthrelkeld\ntuu\nprados\nvaltonen\nbruff\nfairlady\nsobo\nbreckin\ngreenhithe\ngearóid\nsymetra\nolivers\nassal\nluuk\nsalaria\nalcocer\ndausa\nkatri\nyindi\nhockaday\nochieng\npunchlines\nashtiani\nlann\nwinogradsky\nsaddar\nbunder\nfootholds\nbalderston\nskai\ngowariker\nshetlands\nvermeule\nzanotti\nmistle\nlogbooks\nulladulla\nequalizers\nbently\nafpfl\nnagayama\naudiophiles\ntoan\nchalybeate\nplenitude\nraiganj\nbrabin\ntitanosaur\nbuffed\nharanguing\nsnellville\nmangu\ntaipower\ngraininess\nstillingfleet\nkhoei\nklages\nweymann\nhsf\nnones\nbioidentical\nnrcc\nedendale\ndorrego\nlauritsen\nindicting\nkymi\ninès\nbabia\nkaida\nmyelitis\nhlm\nfunktion\nsamueli\nmiscreant\ncapricho\npasek\nfabares\ndraconic\nväyrynen\nacutally\npachauri\nrestocking\natelopus\ndarel\nnghi\nadenomatous\ndefeatist\nndong\nshirted\nmosuo\nsyreeta\njewess\ndornblaser\nbouba\ninspectah\nmuggles\neyespot\ncuticular\ndeodar\nduprez\nshehata\nshahed\nypsilon\naseem\ndeayton\nheitz\ngoncharova\norko\ncarma\nscitech\noel\nmyres\nmintoff\nllibre\nryuzo\nuberoi\nphotoshoots\nlable\nmessala\nactinic\nbaronne\nmullock\nhinsley\ndolphinarium\npolarising\nembolus\nbiehl\nlaie\nsativum\nsendmail\nbergere\nreselected\nharbourmaster\nteensy\nmattu\nbartolozzi\nvivos\nradovic\nfevre\nsfeir\ntrossachs\nsachet\nanklets\navic\nfeh\ncaseworker\nasadullah\narguelles\ncorriente\npenrice\nquain\nextempore\nhamdy\nundefinable\nsummerton\nchurchills\noursler\nrusudan\nkere\nshumpert\nchiavari\nrelator\nkimya\nbino\nnocturnals\nrse\nquitter\nczarina\nrotch\ngarbett\numaña\nshlaim\nadobo\neddisbury\naini\nmuttered\nkronenberg\nskehan\nkosminsky\nbibo\ncorreio\nzahran\ncutlers\nnabila\nlyf\nbeckingham\nleuschner\nraees\nrawsthorne\nsubir\ndesecrate\ncorriveau\nmeteoroids\njaideep\nsweltering\nhijau\nnightlight\nbema\ncraigs\nanubhav\nnaunton\nasbestosis\nmatsusaka\npilson\nschendel\nfaten\nearlswood\naldona\ngujjars\ntygart\nproteoglycans\nrebuttable\nazizah\nlaven\npirbright\nnru\njayce\nsouverain\nfutenma\npame\nrathke\naoshima\ninterweaves\ntemperment\npenley\npontin\npiché\nindoctrinate\ntomblin\nprotractor\ncodebreaking\npettman\nhobyo\ndule\nkilljoys\ndaca\nepeli\nkenda\nbransford\ntranslucency\nexhibitionist\nlasswell\ncooperman\npreethi\npolmont\ncamac\ndimidiatus\nmizen\nforktail\nriccione\nchac\nzandra\nthomaskirche\npingyang\nkonarski\nangelicus\nrominger\nbedsit\npharmacotherapy\nliaocheng\nverco\naidoo\nprzemysl\nmaiwand\nlonghand\nsheene\nleybourne\namrum\ncallery\nwaybill\nbradleys\nmanono\nhypoplastic\naykut\narencibia\nflorham\nmoneim\nperiodontology\nadriane\npenola\ngnomic\noutlasting\nspeedier\nleijonhufvud\nhsls\ncremate\ngushed\nkolin\nwensum\nbittaker\nsolukhumbu\nbabek\npiggly\nmantello\nvha\ncampero\nweisskopf\ngivi\nwhistlers\nrecoils\noverestimation\nrheged\nappearences\npaps\ngetup\nstrader\nmisse\nappstore\nunimak\nhewison\nwiltord\nhypokalemia\npunctuating\nekins\nhocks\nstriatal\nclunie\nbeaky\nchowdhry\nnoia\nnextera\nsamani\nmealing\nawana\nsaldivar\nzephyrus\ndickstein\newg\ncoppel\nsessue\nsirois\neckel\nduerr\ngoforth\nitg\nyupeng\nbarnhouse\ncrazily\nfacetime\nhinchey\norac\npasdar\nramprakash\nlaurindo\ngarrels\nzatara\nobscurantism\nbisham\nlaudate\ndemolishes\nallsup\nsamaraweera\ntoux\ncarpeaux\nmikolaj\nnamechecks\nneti\nendecott\nccv\nhurtin\nserafinowicz\ntowners\nbickerstaffe\nrantala\ndivis\ntheorbo\nlowball\nperoneal\ncerdan\nkerzhakov\nmonas\nkiritimati\nfoti\nrostenkowski\nsandblasting\ncastrillo\nwireshark\nfawzia\ndeterrents\nbended\njamesville\nmonemvasia\npogge\nnadan\nflashover\ninal\nmisperceptions\namsterdamse\ntheophylline\npreparer\nphilanderer\nsalzberg\neska\nthromboxane\nhalman\nklayman\nbrade\nconversano\npito\nfilan\npurcellville\ndeis\nboxee\nwhe\nitworld\ncoggan\ncoche\nfederale\nguralnick\ncrawshaw\npetropoulos\ncerts\ndld\nabductee\naskwith\ntokuda\ntenaglia\nsamp\nbongani\nbinah\nmuley\noutweighing\nermenegildo\nmambas\nndaa\nboulos\nroditi\nsynaesthesia\nyeston\nvolkert\nabadia\ncommunicant\nlunatus\nhoodoos\nhartle\nnuru\necorse\nscaphoid\nscdc\nrepossess\nmedinet\nzookeys\ngreiff\noligomeric\nuttley\nnizwa\nfairbridge\nddf\nkomu\narmel\nnatation\nlanzmann\nmetlakatla\nlaforest\nincomprehension\ndepuy\namamiya\nfatten\nmazzocchi\nintranasal\nantwi\nvolleyballers\ngadgetry\nevenhanded\nenp\ntaimanov\nmaeght\ndarwinii\ntransformable\ntacy\nkolber\nnowlin\nchampaigne\nromanée\nolafsson\nwaksman\nbirthe\nwesternised\ntrapt\nwhereever\nmeike\nserica\nantartica\nengelen\ngiard\npenck\nanticapitalist\ndisinhibition\nherbology\nwinterfest\nlovells\nvalenza\nwmur\nconcessional\narmorer\nquong\nyasuaki\nsft\nkiana\ntragical\ncashiered\nlescott\nguben\nmarthinus\nhavasupai\nkmsp\nagonising\nsteane\ngwas\npaschke\nratri\nbaah\nkoskela\npolitzer\ntrudell\nwendie\nmirek\nwaiouru\npolyrhythmic\nmoffit\nmeshach\nspellcheck\nshafto\nfarnaby\nuthai\ntappers\nunalterable\nkomala\nmris\nborers\ninhalant\ndrivable\nnishad\nlampton\naune\nfior\npolyandrous\nhypnotherapist\npilat\nnorbeck\nbte\nagonies\nryazanov\ngittin\ncapelin\nnotaro\nmakina\nfastenings\nkalra\nalfresco\nsichel\nmakro\nlese\nsacca\ndesiderata\ndjan\nklump\nkulla\nsimancas\ndiamanda\nmaleficarum\nrestock\nhaberdashery\nmalakhov\njuggernauts\nchads\nseyss\nlycopene\nmorsel\nschuld\nstratotankers\nmakita\nzobrist\nhengshui\natlantida\nstuhr\ncarnavalet\nvatos\nfertitta\nshujaat\nageist\noastler\njom\naloisio\njayna\ncariappa\ndemountable\neyeliner\nmwi\nhudaydah\nleawood\nmodul\nnatcher\nbobigny\nadia\nstarmer\ndagenais\ncound\nhenequen\ninterferons\nuwp\npodujevo\ncargos\nfrewen\nmehreen\nsellier\ntuberculin\nmoerdijk\npneumonic\ninos\nlyoto\nhcd\npeoplesoft\nseun\nunomig\nlargish\nsleng\nclyst\nmadaris\nsodexo\nunderreported\nlovestruck\nmatchings\nrepays\nwudu\neurocentrism\ngermont\ndesautels\nbeland\ntorito\ndekel\nwillcock\nwelsby\nuncompromisingly\ntrichloroethylene\nlakha\npohamba\nnich\nindec\nconfucianist\nchipstead\nterrana\ntanqueray\nsanhe\ntelematic\nportents\nseveso\ncontostavlos\npolitican\nhitchen\ntiberiu\nteneriffe\necri\nsews\nchesser\nboisvert\nsophiatown\nokonkwo\nvolcanically\nharfleur\nbodman\northotics\nmelhus\nvaswani\nsamarai\npipp\nkawakawa\nrecessional\nlowie\nrosetown\nelvan\nkufr\nbrushfire\nmehler\nfloridsdorf\nbloodrayne\npunx\nnihang\ntutankhamen\nbayardo\nnormandin\npuebloans\ncolen\nwaage\ncursors\ntugendhat\nlammy\nkaragounis\nluchetti\ncottingley\nshunichi\nparce\nroxanna\npanaro\npengcheng\nconvoyed\nbadarpur\nzafira\neha\nzaventem\nverdana\nreser\npagés\nzhangjiajie\nexpressjet\ngerontological\ngovindarajan\nrtmp\nzipp\nalroy\ndoyon\nmoonstruck\ntsou\nnewlove\nkitch\nstrongsville\ncapitani\nberghe\nluteal\ngliomas\nzenimax\npropoggia\ncinar\nfinancière\ntrivers\npresentment\nmaldivians\nmotril\nmetasearch\nworkshopped\nchiru\nnorouzi\najla\ntorme\npennsboro\nwellknown\natapuerca\nguage\nsukarnoputri\niannone\nbober\nautotrader\nschreker\ngaydon\nthoburn\nmakaay\nunarguably\nvicenzo\nhammondsport\nmirs\nevia\nautomat\npontins\ncloying\nlensed\nimaizumi\nmomentos\nprivada\nalwi\nwearisome\nvaldo\nfierceness\ntatupu\ntahira\ncrossways\nashcombe\ndantzler\nprobaly\ncineworld\nnetra\njhangvi\nilker\nheister\nokkervil\nsalbutamol\ncomprimise\nshahdol\nminott\nalmunia\nchildlessness\ntemco\nlawrenson\nnomenklatura\nreinvigorating\nmisremembered\nassonance\nfarington\nkeenness\nentreat\nnorthstead\nlogi\nlumby\nhyneman\nmorganti\nworplesdon\nafcs\nmonopolised\natau\nreinvest\npitiless\ndubuisson\nonlive\ndiederich\ngolenbock\nporcia\nconvy\nmontross\nmadra\ntweedledee\nmadson\ngoleman\nzaun\nparminder\ngovett\nwanderley\noppressions\nitaldesign\nmadjer\nstuntwoman\nfulks\nabendroth\npainlessly\ndespising\ntobie\ndalgarno\ngrupos\npainswick\nharkens\nthiess\npaled\nyorgos\nbeeper\nzarzuelas\nperfil\nteppo\nmodernista\nlishui\nganatra\ncohabit\nhotelling\nbinzhou\ntakato\nchetak\nvicks\nlitem\nbleyer\ncomplainer\ntrichoderma\nritsema\ndieren\nfaln\nmalpractices\nwearmouth\nflamingoes\ntelepictures\ndressen\nppaca\nherschelle\nleitmotifs\nstoltzfus\nschieffelin\nnoticia\nmbah\nschizophrenics\nguiraud\ndimmitt\nzippel\npuga\nfuxi\nejaculatory\nalagna\nchafin\nsalvadorian\nparkey\nkable\nmoed\nleam\nclouet\nbahuguna\nexperimentalist\ncomores\nhypertrichosis\nbeardslee\nharriss\nmilks\nlubezki\nravings\nketoconazole\nmakemake\njenne\nserguei\ncnx\nfarahani\ndonzel\nflamand\nmeurer\ngades\nkaige\nnishina\nbizzle\ntendler\nfusa\nmckerrow\nnakamori\nburmans\nmanvel\ndorantes\ntrippi\njenkyns\nconnellan\naspel\ndidon\nmodarres\nshucks\nfoisted\nschelotto\nwhicher\ncalibrations\nnyein\nbretz\nbraaten\nnamiki\nshpilband\ngiao\nsindy\nthirlwall\nsometing\nparaplegics\nscuppered\nparamor\nskoog\nkoibito\ntamponade\nschaik\nallahyar\nslipways\nhareide\nbridleways\nbeckert\nclandeboye\nupavon\nreallocate\ngàidhlig\nkratzer\nimmensity\ntweeters\nreintroduces\nsaghir\nvmas\nmcclennan\nfourfourtwo\nterrie\nnankin\nworkmanlike\nkentwell\ngamel\nclynes\nsuja\ndarrah\nrottenberg\nhaidari\nlattuada\nrangoli\nmahra\nneurophysiologist\nautore\npiloto\nveerappa\nbaró\nstrolled\nszubin\nserletic\nfow\nbarnack\ntitcomb\nlihua\nmelanocyte\nnlg\nclaeys\nclausnitzer\nbestor\ncleverer\ncoloane\nmoghaddam\ncallejas\nibirapuera\nmildren\nbylined\nnickey\nrejigged\nreacquainted\naughton\noppel\ntendancy\nborjomi\ndziga\ngladman\ndarci\nburla\nbashung\nconcussed\ngeth\ngitai\nbattlecry\nanari\nsodje\natlassian\nsja\nwarroad\nunde\nvakulenko\ntriplette\nuntelevised\nchrists\nbattledress\ntahiri\nbaban\nannakin\nimagers\nhacktivist\nideapad\nkruis\ngowerton\nbrantingham\nrandow\naimo\nharpeth\nslutty\nsublimate\nflorilegium\nosyth\ncrowdy\nlatynina\ngandon\natrazine\nkibler\ntatian\nparanthropus\noptique\nsoundclash\njrb\nhideko\npichardo\naccuracies\nkempff\nfurosemide\nsimcox\nmagidson\nkomma\nboroughmuir\nbouvard\nlindsborg\nwaterlogging\nassuaged\nrookeries\nstigmatizing\nkleinberg\neccleshill\nmedha\nwismer\nlader\nbagnold\ndevadas\nphotodynamic\nfathy\nmariazell\nnarrowcast\nmallrats\nvannier\nleniently\nminutae\nspeakeasies\nupperclassman\ncoontz\ngrooveshark\nishizaki\nuggs\nsolaar\nelantra\ngrabner\ndanz\njyotsna\nshrivenham\nkolwezi\npeppa\nnicotero\ncerrato\nnazri\nmrta\nrezko\nraisman\nbobi\nhalmos\nputamen\ncimber\nlepidopterist\newr\nunreinforced\nbeidou\noutstrip\ntrailfinders\nsinews\ntreponema\nnormalise\nkanne\ngarima\nfbf\ntellin\nbaima\nbolly\nblassingame\nzoa\nholcim\nmatthewson\nhelgenberger\nbluing\nldcs\ncatchiest\nfrontotemporal\nbunkie\ncintron\npastrami\nrarick\ncopyrighting\nobsessional\nchilmark\nraif\nsiders\ncraggs\noutran\nwakatipu\nikey\ngey\nconfigures\nstubai\nfingerprinted\nbeamon\nlaslo\nguingona\nkutti\ndoggone\ntakfir\nuriburu\nkeet\ngameboy\nhepzibah\neffete\nhyaluronic\nversteeg\ndearman\nusareur\nschelle\nverbeke\nmckinnell\nguerrini\ndelis\nunfaithfulness\nslingshots\ntawton\nrabha\nrisala\nnamer\nfranzese\nfresenius\nkupferberg\nerbakan\nbongiovanni\ncarousing\npalynology\npharmacia\nxanana\nkyne\ntalo\nwatsonians\nwarte\nserendipitously\ncoller\ncespedes\ngilmar\nhungnam\nglobalist\nmultistep\nrubis\nkingsolver\ndeepsea\nbyner\nsamaya\nlaima\nbuisness\northographical\nsignallers\ndiagon\ndunrobin\nkenzi\nocker\nwallenius\npaichadze\nteachout\nmckillip\ndemagogues\nbodi\nboxster\nilea\nroha\npentridge\nfarmworker\npeacefulness\nbelon\nsadanand\ngreenburgh\ntekoa\ndisablement\nwayuu\nverdone\nmarocco\ntarjei\nerwan\nnoemí\nlaks\npince\nverdonk\nusumacinta\nyoan\ngoz\nkafa\nhephzibah\nsigar\nbrioche\nojala\nophiocordyceps\nitalico\nsnowfields\nmikkelson\nregge\nfalardeau\nobliterates\njowkar\ncoolock\nlongwinded\nransoms\noverwrites\npropanol\nspaz\ncompetencia\ntranshumanists\nlattanzi\nbrecksville\ncleanstart\ncolunga\nchasez\nyanko\nreshoots\nveys\nzaps\nyeronga\ntrefor\nellum\nhuffy\npagels\nforensically\nmoynahan\nakimbo\ncapitales\ndmax\ncowering\ntaints\nseretse\nmacbain\nceferino\narpeggiated\nkroto\nringsend\npokhran\nsiver\nmanzanilla\noverdriven\ndownstage\nrubinson\ntexeira\nmadron\ngohain\nknope\ncasanovas\nabz\nkahala\ngenscher\ncabazon\ninstitutionalizing\npositronic\nshewan\ntubac\nheid\nlevertov\nriv\nchandrasekaran\njakobs\nmesrine\nsabia\npebbly\nkarama\njole\nmispelled\nplayaz\nnewsflash\nkolleg\nbarilla\ncloonan\ncuello\nundisputable\nphosphoinositide\nhaacke\niliff\nbeggin\nnunhead\nbienvenu\nlecky\nmilliliters\nbachelard\namaretto\nlongform\nsyal\nløkke\ninternode\nmodano\nniceville\nimperiali\nlutton\nlevchenko\nnaqib\ngreenan\ndharm\nbrattain\nsavall\nsubbing\nautoweek\nimpositions\nrattana\ngudauta\nmanneh\nketola\nbubb\nbovids\nkamani\neichinger\nallers\ncrocuta\nchristophorus\ngripsholm\nclubb\njostling\nmatiz\nmcinally\nstaithes\nzollner\nboscov\nirretrievable\nkatif\npalmiro\nshugart\nsouad\niob\nhongqi\niwase\nchone\nallana\nwindblown\nahlam\nknopp\ncilea\nkeizo\nbujanovac\ndonora\nbauch\nsabol\nsaner\nnawrocki\nrecapping\nschlichter\nlizz\noisin\nflims\nincinerating\nflickinger\ntamarindo\nrelabeled\nampk\nmelanistic\nplatino\nnansha\nnitroglycerine\naçaí\nellerton\notake\npasini\nnihad\nadamov\nhubel\naltough\ninayatullah\nlycidas\ndrepung\nwrongdoer\narthroplasty\nvilledieu\nkovacevic\nmonje\nrondine\naymeric\nnordine\nherrman\norelsan\noguchi\npushin\nleval\nhydrometeorological\nesade\nsibert\notávio\npigeonholed\nhaagen\nspouted\nclines\nlevuka\nlightweights\nchiusi\nwayde\ntetum\nenoki\ncottenham\ncaver\nkhur\nlithospheric\nakande\nprimeros\nbabbington\nentorhinal\npaparazzo\nnanas\ncassina\ndozy\nbolli\ngramme\noroonoko\nfluidic\nkhadra\netiam\ndeheart\ncollignon\ncampesina\nsacy\nrubberband\ndno\nnebraskan\nmurderess\ndeak\ncambay\nlile\nsheetmetal\nmundaka\nunitel\nchunyu\nmehri\nceol\nbilis\nwertmüller\nzoli\nnavenby\ndongjin\nextensa\nceasefires\nrzewski\nnegligable\nlasek\nspattered\ncaballito\nmyhill\nwoggle\nlarghetto\nsebastiaan\nbascombe\nshotter\nekstra\ngillham\nbeaujoire\nbronchus\nharon\ngyngell\nartas\npennsville\ngayane\nnordfjord\nskybridge\nyusei\nfifes\ngrenadine\nrobic\nconsequentialist\nbishr\ngrafica\nroffey\nbruyères\nbucksport\natre\nrehashes\noddness\nemigdio\npoteat\nyutang\nvasaloppet\nbaharestan\ncurbelo\ncarraway\njolts\nköllerer\nvratislav\nlycan\ntarte\npanchito\nnopal\nstegemann\nccbc\ntonkinese\ntantrik\nmavin\nmcconnachie\nbafut\nonaga\noblomov\nignorantly\nkuwaitis\nsibs\nasanga\nlown\nworryingly\npromusicae\nranganath\ncorpore\njauhari\nprintouts\ncorra\ncomparatives\nkuria\nnali\nreassumed\njazzman\nbrüggen\nnealy\nedonkey\nbiotechnologies\ndavalos\nalani\nkaushalya\nmockler\nkakuta\nsakar\nchiappe\nclassism\nericksen\nimpracticality\njetman\nglr\nkitti\nujung\nwarmington\nsupplications\nrushent\nnewsgathering\nkereta\nluiss\nbioavailable\nrmd\njailbird\ndrach\nftf\nwadud\nshinyanga\npucker\nfargas\ndrams\nnewsquest\npepitone\nahlert\nhornady\ntatlock\nlincolnville\nmindel\nfalsifications\ndulal\ninterop\nberthon\nladybirds\nhajjar\nsandstrom\nsuperbikes\nmousquetaires\nkinman\npalmi\ndespertar\nkinchen\nptg\nisitt\ndronning\nfakery\nstraton\ndownturned\nsurcharged\ncommemoratives\nlinington\nsensex\nneoteny\nasq\nsnidely\ncarluke\nsummertown\ngalkayo\ngiorgetto\nliquide\nkuntar\nachingly\nobamas\nestenssoro\nwrenched\nsawley\nroesch\nnegin\ndignan\nreproved\nregier\npachi\nazathioprine\nproblemas\nsuturing\nsynuclein\nurkel\nwinne\nvolkszeitung\ngrumps\nmardones\nlippa\nbeliefnet\ngruenberg\nmeador\ninm\ncommiting\nmufflers\nepitomize\nischaemic\nranft\njoannie\nchenoa\nkameny\nfangshan\nromanow\nlillington\nbricktown\nvisualising\nspreaders\ncastner\ngimmes\ngiono\nfarhang\ngabay\ntanzim\nquestor\ndemodulation\nfransson\nedfu\nmytilus\nbouch\ncountach\njeshurun\nnocioni\nmells\ntwelvetrees\nrhinehart\nbiram\nankang\nabbeydale\ndalan\nndn\nwhin\nibas\ntranent\nsarlat\nislamiya\ncaimans\nmiras\nandreou\nrealest\nshahriyar\ncleto\nohioan\nbulo\nelga\nwisborg\nfloy\nmallaby\nrongji\nbarchetta\nherseth\naberthaw\ndevildriver\nsaval\nsatrapies\ngeodynamics\nstegman\nrearming\nbethke\ntitillation\nfaubert\nsorgen\ndestructing\nbagshawe\nedgecumbe\nmckoy\ncoulombe\nlogarithmically\nshrader\ndinge\nconundrums\naranha\nmurmured\naslo\nchooser\nsprawls\nmisrepresentative\nnafplio\nrecyclables\npiatek\npartage\nsanctifying\nerrs\ntangen\noxcart\nfaregates\nscottrade\nwurman\nleadup\nsirico\nfinless\nfastness\ndittmer\nnesters\nmpvs\ncolella\nviljanen\nmountable\nkadisha\ncopped\ngoethite\nchiens\nbavay\nyouthfulness\ngimblett\nteer\naberfan\ncoastwise\ngreenwalt\nmesserschmidt\nmassu\nenergomash\nhitcher\nyadollah\nbreslauer\nkodjo\ndibb\nillogic\nerewhon\nollier\nsidamo\nmajorelle\negi\nrennick\naronowitz\ndiokno\nmopped\nzakuani\nicsa\nbertsch\nblazey\nhuapi\najmi\nguss\nvmo\nmuzzled\nkucera\nunsubtle\ntheisen\nseikan\npharyngitis\nbalashov\nakroyd\nbiderman\nbesler\nsavitch\nlujack\nunreasonableness\nlutein\nexactitude\nhighwater\nanns\nsjp\nkhryapa\nbaquedano\nsanteria\nelpida\nstraffan\nrheidol\ncroshaw\nsettipani\nsdd\ncambiasso\nwheelbases\nnaj\npressey\nrubbra\nembraceable\ngellért\nblatently\nmerchandisers\nseussical\nceausescu\ndaido\nsamah\naylin\nyinka\ncoucher\nmerseybeat\nbloomsday\ncordyline\nodierno\ndeselected\nshawbury\ngizmondo\nwoodthorpe\nlaurynas\nferid\nkarpenko\nhenryi\nballena\nabdiweli\nbairam\nalbendazole\nsorong\nshahani\ncoprolites\neastin\ncheops\nbreggin\ncoracle\ngarbajosa\ncapitola\nfrane\nchedworth\nganze\ndubber\nsadaharu\nthreadneedle\nosen\nmarignane\ntownsley\njagannathan\nfreakonomics\nhosta\nbeatitude\nparthenogenetic\nshinhan\nceja\nphrasebook\nwiranto\nirobot\nshrovetide\nrotich\npedy\nconvulsion\nconcocting\ngerizim\nauris\nfrumkin\nterezin\ndither\nlfn\nsubsidizes\nkarrer\nmedemblik\nludd\ncastafiore\nsajna\nnickles\nnovoye\nbeitou\nlevonorgestrel\neddin\nfreighting\ninne\nannelies\nohlson\nembalmer\nkhad\nchisolm\nmashriq\nfazeley\nsoglo\nnabih\nrancorous\nkehinde\neatontown\nswaddling\nrodenbach\nsnedeker\ndrey\nbirdied\nmisri\nsxc\nsrdjan\nferney\ngumede\nwaitomo\nclearlake\nress\naristocats\ncatie\nmartland\ntrainable\naffronted\ngintoki\nmenounos\nrouts\npowerset\nbillson\ntonie\ngosfield\ninterstitials\nmoniuszko\nbilski\nellett\nsubstantiality\nsilwan\nteraflops\nalginate\nruvuma\nbirner\nivanpah\nfebres\nblaschke\ndoodlebug\nyakuts\nmusc\nneversoft\nreconnoitering\ncrispian\nalphonsa\nantiprotons\nsamye\nyoshitomo\nfluorouracil\noaten\n‚\nmeadowcroft\nkyprianou\nnamtha\njyske\nloudmouth\nflamini\nibne\nbarsha\nmorinda\nlyndale\ndownshift\nmikati\necolo\nkurányi\nbjc\ngeismar\nbraeburn\npedretti\ntuum\nsowmya\nmckeehan\nboortz\nfanmail\ngarcin\ngothel\nkarmiel\nbradgate\nmanzur\nsycophant\nsparknotes\nwaard\ncasilla\nmétier\npappano\nkovalam\ncarini\ntopolino\npettibon\nmaginn\nfike\narye\nspera\nuniti\nbernardes\nnivel\nannville\nbroodmares\nvaluev\nverghese\naccompanists\neddowes\nseydi\nazizul\nladon\nsiaka\nemsley\ntompson\nwaggle\nparried\nofficine\nbuba\nmeasham\nzentner\nhatami\nbonefish\nscarboro\npiccolos\nierapetra\njego\nprimigenius\nvitorino\ntylney\ngiuly\nberson\ntwingo\ntyo\nfirmicutes\noln\nmansergh\nikonen\nskeete\nbleakness\nanesthetist\nmonstrously\noffsprings\ntemkin\ndonnées\nbuffing\nscba\nkumul\nshahjalal\ntorrez\nbreathers\nanticorruption\nebensburg\norbitz\ncaines\nesten\nmanuelle\nhesitations\nmortimore\ngraun\ndary\ntomasso\nnaproxen\ncwfc\npinpoints\nacuerdo\nimpalas\npolvo\nmehlman\nhisa\nlesmahagow\nrapanui\nhoneymooned\nbanky\nsoftest\nhaiyang\nepigastric\nnzz\nepiscopalianism\nschildkraut\nquemado\nbekenstein\nranna\nbaltz\nheartening\nbarrot\nparrington\nkyoji\nkesel\nmeux\nmandeep\niren\nfraunces\nditchley\ngeren\nshortlists\ngoopy\ncambia\nedvald\nedmonstone\nnamibians\nvollmann\ntornatore\nbrik\nbranstad\nkunle\nhoulgate\nbernay\nmorato\noffen\nrockcastle\nrivolta\nblunstone\nmbf\nhayseed\nsugarbush\nhards\nknollwood\ncampello\npontifice\ncgl\nseverly\ngillberg\nfamke\nforestation\nkruglov\negawa\nsupercharge\nrassam\nprolix\ngheluvelt\npequannock\nbeko\nelfrida\naugurs\nmutatis\nbigamous\ncrosfield\nhalberg\nframer\nwatusi\nfod\nacq\nevangelic\nbunkering\nbalilla\nuttlesford\ncsj\nmarinello\nballymote\nrockaways\nmattaponi\nheeling\ntck\nsandham\nendocannabinoid\nloonie\nolshansky\ntanked\nhydrolysed\nbaldwins\nrichelle\nalimentarius\nassyriology\nclingan\nhandicapper\namna\nfiyero\nbernardsville\nviloria\nbolten\nthunbergii\necko\nsquiggly\ndigitech\nbika\nrebecka\nragstone\ntransborder\nmillepied\ntbg\narkenstone\nnavneet\nikaria\ndenaro\nshebaa\ndecs\npizzonia\nhertwig\nmulroy\nmccallie\nkingsholm\nfroud\ngowanda\nmicelle\nkrasin\ngoatskin\niomega\nkochhar\npletcher\nfourway\nbéart\nmeirion\nmuncaster\nkunin\neversole\nillusionists\nklingler\ncorkhill\nfevered\nremorseless\nstodgy\npinsker\nanandamide\nwhitted\ndavidsen\ncryptococcus\nsharpley\nmcmurphy\nicas\nmaness\niphoto\ndarras\nzahira\ngyres\ncoober\nlaminations\nbeutel\npielke\ndioxane\nprojets\nstacia\ninterlocks\njingling\ndisingenuously\nximen\nnonaligned\nintersport\nwreckless\nruark\ncoler\ndavidov\nrhel\nleviathans\njessee\nwomenfolk\ntde\npenicillins\navelar\ncalapan\nignatov\ndicing\nshinsei\nezzat\nmudflat\ninterweave\ndeltona\nnaiveté\ndisinclination\nmangi\nmuzorewa\nlucette\nhybridizes\nardon\nnovacek\nheterodoxy\naunque\nsymptomatology\ngundu\nfati\nmarineris\nsorbara\nprance\ndundy\nzuzanna\namitriptyline\neligio\nbioreactors\nchorion\nazéma\ncolonisers\njaponais\nphileas\nsadhna\namabile\njianhua\ncyfarthfa\narboga\nmalaco\nbuttimer\nsalitre\nseijo\nsmucker\nexcello\nmontcada\nlrdg\nsabharwal\nventi\npantene\nsaheba\nacupressure\nrajahs\nunclothed\nkoroleva\nratigan\nbellport\nhonaker\nramkhamhaeng\nhachenburg\nballachulish\nlaprade\nreson\nfreethinking\nespa\nbaruchel\nmorado\npayen\nertel\npramila\nmontecchi\njustiniano\nlbo\ntemirtau\nmutai\nessene\ncharbagh\nzutons\naméricains\nbashes\nasriel\nedinho\nwetsuits\namoeboid\ndissed\ncofferdam\nbhaji\nmenes\npanabaker\ntrespassed\ninfuriate\nétrangères\nnastya\ncooing\nbartenstein\ndelorenzo\nyentl\nwibc\nskinless\nboonah\nkrumholtz\nconsulta\nevy\nhyperkalemia\ngoji\nundulated\ngopperth\nbassani\nauricula\nelsass\ngoodling\nflw\nflashcards\nstowey\nnutz\nsvevo\nmeaner\nhameau\nshigenori\nanhedonia\ngamache\nhayti\npersimilis\nkennaway\nkadiri\nyurakucho\nfringey\nsyndicator\ngaslamp\narizpe\ntiwary\nrancourt\nvitolo\nunfortified\nyazdani\ndoctrove\nlightstone\nthermocouples\ngroenewald\nsybaris\nhean\nsgpc\ngorenstein\nherbes\nmillin\nmosty\nparejo\nkeer\nspeedskater\nferrel\ndonyell\nhirwaun\ntheobroma\ndeserto\nhrbek\nskil\nsirene\nshouguang\nconfabulate\nkhafre\nsoucek\nshoalwater\nrecognizance\nleadbelly\nhurlford\nfeux\nboehme\njamai\nfrites\ncommandeering\ntinicum\nroboticists\nhounsfield\nbalanta\nhagai\nkombu\nloret\nmorgenthaler\nvindolanda\nbussi\ntokuyama\nqena\nsibsagar\nprimedia\ndurrett\ncoarseness\nkurhaus\nhasnain\nbringhurst\nsandwith\nlortie\nalaeddin\nhuracan\nddo\nhawala\nsoini\nhiw\ncowshed\nljubljanica\nrasche\nscandic\nbrahminy\nmoli\nhickock\nhefford\nfedeli\nbilas\ndiscoloured\ntougaloo\nshirogane\nchappe\nnehring\nsanmenxia\nludwell\nkrausz\nkarplus\nverdú\numehara\nsesa\ntamiment\nario\nrigueur\ncharita\nlangtree\nyehudit\nhemer\ncassy\ntortuguero\nsomo\noshiro\nmulticenter\ncormoran\ntamuz\ndcom\nhardboard\nngk\nbayati\nossington\nconchs\nbennettsville\nbarona\nbennink\nblushes\nkahraman\ndeadhead\nroughneck\nghd\npartout\nsommet\ndriskell\nshinchosha\nfinola\nargall\ntsumura\nsafka\nrefractories\nkrome\nwarsop\ngiustina\ncaol\nsomnolence\nkerkhove\nparkstone\ntelle\npummel\ndille\naita\nmkm\nmaton\nprofundo\nxel\nbejar\nexudates\nbrienza\ngtz\nachilleas\nrelph\nmuntasir\nwfmt\nruffa\nlefthand\nthrobs\nambitiously\ncraigmillar\nlasswade\nkoshy\nsalterton\ngeneraly\nudyog\nmisprinted\ntorstar\nidled\nnearsighted\nbackstabbing\nbalsamic\nosea\nvictimhood\nmariza\namschel\nconciliator\nblonsky\npickthall\ntammar\nkimmage\neufor\nkhizr\nmablethorpe\nwdl\nboustany\nstepladder\nmalzahn\ncoalbed\ncerys\nkwansei\ngericke\nwhalsay\nademi\neardrums\namsterdams\nbletso\nghostwriting\ningratiating\nbacs\napprenticing\nreferable\njovanka\nemaar\nmyositis\nitri\npitied\nchauffeurs\nhuji\nbaus\ncunts\npalko\ncseh\nfehrenbach\njoice\nsbh\nanalytica\npahlen\nslacking\ngoodin\nbanisadr\nschoenbaum\nmatthys\nchimo\nroustabout\norts\nnetizen\nshoelace\nsfera\ntuqiri\ncharmes\nshehan\nthsi\npawlikowski\ncagiva\npeds\nstartles\nsigatoka\npacioli\ncavender\nobayashi\nrabkin\nverno\nbunts\nmeiwa\nkimonos\nbristled\nmonahans\nmcclay\nrallo\nurologists\nyoude\nmasakadza\nfollowill\nsirat\npratas\npunative\nunshielded\nbeemer\nunbending\nunlikelihood\nsward\nnilesat\nhoehne\nogerman\nmicrofilmed\nicelander\nalvida\nmikro\ncheeseburgers\nsteckler\nmercouri\nkouassi\nblare\nmultistate\nacrimoniously\nbelluschi\nlivigno\njetting\ntollefson\npalang\nunobtrusively\nrakovica\nprompter\nretrocession\nhomoeroticism\nminuets\nreturnee\nventris\nzoomable\nsignboards\nlanghe\npocky\nanklet\nqiong\nfranceville\nboire\nrooy\nrecirculated\nbladon\nelveden\ntradespeople\nchaudry\nebow\nbrewerton\njiamusi\nbirchgrove\ntriangulations\nemployable\nmeroe\nkhalis\nnial\nrothley\ncontaminates\nkotra\nhaccp\nendacott\nfidrych\nglovebox\nspacy\nflamboyantly\nseismometer\nyussef\nhughesville\npiombo\nalacranes\nmomotaro\nactiva\nbisharat\nspacewar\ntvbs\nyearnings\ndirtier\nmatute\nkofoed\nantiparticles\nvandalization\nmbu\ncompactor\nwagg\narihant\nmcbeth\ntruncheon\nbioethical\nthoroughgoing\nsteppers\nblowjob\ntuilagi\nccma\nremzi\nspdc\nlunghi\nverbinski\nberr\nbrosseau\navshalom\nesquiline\nbaillieston\nmehbooba\nvidhu\navas\ndajani\neuroncap\nscaglione\ntroubleshooters\nnitrification\nzipporah\nfilat\nfemm\nmonacan\nhagans\ncrabbing\nrula\nncpa\nraquin\nbyt\npanufnik\nfriendswood\nbergsma\npachyderm\nkuchi\nhacohen\ncasstevens\naurorae\nmallen\ndzmitry\nwiseau\nasado\nskipworth\nyonghe\ndoctrinally\ntast\nmultilateralism\ncattails\nshahdad\nunconquerable\nsequined\ncastmates\nheindel\nmodak\nswanscombe\nhuby\najp\nallardice\ncoran\noin\nmacguire\nsarki\ngubaidulina\nnevo\nbirchington\ngeetanjali\nbrumm\ntracee\npalomas\nbumpus\nbunim\ndira\nroselawn\nezri\nmankowitz\njudt\ndokkum\narks\nisse\npomposity\nvpm\ndemande\nplautdietsch\nkosuge\nnurnberg\nibon\nkalindi\nkeiretsu\nbuser\nnegril\nbaldus\nghoulies\namii\novercoats\nfeldenkrais\ndaymond\nflagellate\nuninterruptedly\nmorsels\ndispersions\npharmaceutica\noverhunting\nmusl\nkptv\nelectrocardiography\nshchukin\nwestdale\njuliani\nharehills\nputeh\nrotundus\nabuela\nkynoch\ntotley\ncizre\nmalzberg\nparques\nxishuangbanna\nagitato\ncryptome\nevry\npwe\nrockier\nhymnbook\nlaterano\nnikhat\ntilehurst\nnginx\nmckeag\nanesthetized\njangly\nwagoneer\nwhippy\nkushwaha\nesata\nstothard\nlowveld\nsubsector\nreformations\nconways\narnd\nbleier\nkinnan\nzaya\nkeylogger\nkhalatbari\nsynanon\njennens\ncrookham\nchildishness\nplb\ncromlech\nsalvages\njarrard\ncrispell\ndisproof\nkinver\nasten\nfarfan\nchamran\nkyrle\nkaio\nrwp\nagonized\nmedem\nstratten\nlavagna\nserhat\nmotioned\nhickie\nmadhi\nbragdon\nhotak\nkasson\ntristeza\nmajida\nwrasses\nlusting\nemlen\nnanocrystals\nmobilizations\nbonifacius\nprayerful\nmccreadie\ndreghorn\nshoehorned\nobviates\nohira\npbuh\nlcpl\nrebelliousness\npaolina\nbaser\nbackwash\nwilda\ndijo\nbolat\nlévêque\nplatero\nlightwood\nchristakis\nfortenberry\nmisspoke\ntremper\nkillinger\nmws\norsola\nphenotyping\ngoltzius\ntartt\nbenaroya\nworkability\nmaille\nuhi\nsanayi\nreyno\nbitrates\nkatla\narchaisms\navnet\nedhi\nvembanad\narris\ndocosahexaenoic\ntersely\nzuzu\nnouadhibou\nundersheriff\nbelpre\nzemlja\nlangwith\nkalorama\nvidler\ndefreitas\nsarginson\nhousebuilding\nkusano\nsoeharto\nlle\ntrots\normes\navira\ndelahunty\nkatzir\nenthusiasms\nupholsterer\nbanamex\nkiem\ngulla\nbenq\npreeclampsia\nkakko\ntalbots\nhamstead\nbleating\nsunnier\nphenology\nalexandrians\njais\novarense\nbillfish\nhaledon\nmaenads\nmclelland\ntilston\nhanya\nprochnow\nwben\namenorrhea\nprio\nnyren\nunroofed\nmyofascial\nmailbag\ndeuterated\nkatainen\nmonifieth\nnajat\nallura\ncazadores\nblatz\nceviche\nfortunei\nclassing\nstolze\nnoachian\nnesterenko\nstandpipe\nventilate\ngiedroyc\nanoushka\nrela\nthuringiensis\ncheevers\ndrachten\nturca\nbabayev\nrosenstiel\ncopycats\nmensae\nshohat\ntartikoff\nyanqing\nmcclair\nstereophile\npalmitate\nescogido\nlubao\npinchuk\nmelena\naliu\ncicerone\ncolorants\nkilmacolm\nthoda\nciampino\nponda\nhobeika\ncybersquatting\nsux\nmattick\nthill\ntudyk\nbarata\ndch\nwaterspouts\neshleman\nwhicker\nmurguía\nfilipa\ndenice\nmalapropisms\nwhfs\npalapa\nexpressivity\ningrown\nklac\nahava\nlaurenz\noberlander\nbica\nmessersmith\niliana\ntaks\nrockettes\nhitsville\nsyktyvkar\nwitchery\ndipa\njiangyin\nvot\nruairi\nwegmans\nnajam\njuta\ncaldecote\nrighteously\ntuberosum\nkonotop\noare\ntoothcomb\ncilacap\nreichsmarschall\noste\nradiogenic\nrevocable\ngentility\ntimonium\nphysick\nbrolly\nbalma\nalbar\nshanksville\ncansino\nreclined\nbarnsdall\nvno\nhackner\ntiswas\ntorpoint\nhutcheon\nrulli\ndeinstitutionalization\nwolkenstein\nstoup\nosmose\nbourdeaux\nbradys\nator\nhandsomest\nbellah\nustasha\ngaetz\nekanayake\nsilverbird\ngusman\nsistem\nhenfield\ncaradon\nghotki\nblackhorse\nnoko\nreeler\nsummations\nrampaged\nbaili\nupmanship\nstotz\nduba\nheiliger\nclenbuterol\nhofkirche\nbijl\nvfs\nrollinson\npillion\nknole\nbrookman\ncollezione\nshophouse\nstaghorn\ndaoust\nzahrani\ncrossbills\nfiorini\nmurnaghan\nmarcinkiewicz\ngebze\nbeitbridge\nabsolving\nfortuné\nlidded\neyepieces\ncoorong\npmh\nadelino\ndvrs\nextention\nextremophiles\ntausch\neditorialising\nbrittingham\nmarcas\nfeiglin\nwente\ndrenica\nkelvedon\nbulbar\nmuffat\nendodontic\nconnon\nharner\nmultipolar\ntrimpe\ngabrielsson\ngrabbers\nastraeus\nbelal\nboatner\nempyrean\nrafted\nsampford\nstarless\nkeylock\nchlorofluorocarbons\nvorobyov\nprocope\npashley\nbeliveau\nifma\ncullinane\nmedcalf\nlandesbank\npinturicchio\nlouison\nfurat\nfloatation\nbatshit\nkineton\nfullwood\nhogbin\ndenness\nmikayla\npanchos\nentreated\ngulper\nknaves\nlaybourn\nsaparmurat\nmagalhaes\nsteelbacks\nmorvern\nindice\nwahabi\nunforeseeable\nstrieber\nthreadlike\ngalin\ncuvée\nevader\nmidyat\nusepa\nmockups\njhr\nlicentiousness\ncunene\njenova\ndenature\nmotomura\ngrigorian\namai\nbormio\ndortch\ndeathstalker\nmorwenna\ncombatives\nsequin\nstaiger\nrisperidone\nelectromagnetically\nngawa\nprepress\norginally\nixia\nkilkeel\nbangali\nmustain\ngages\nguana\nfetid\ntrespasses\nwattie\nrgc\nropp\nbayit\nnailsworth\nverão\nunaccepted\nlivi\nkirriemuir\nbavasi\nchiapa\nfeedings\nmaterializing\ngobs\nquare\ntartakovsky\nbatroun\nmendicants\nfallis\nmotihari\nstites\ncarmeli\ncranmore\nimagineers\ndatchet\nacculturated\najai\nblasphemer\nuglow\nhummocks\ncrawlspace\nhugon\nyorkin\namarc\nerkel\ncantieri\nschruff\nflannan\nmemex\nschelin\nitim\nmanifeste\njeno\ntangs\nschoolbook\nfarmhands\ncnnsi\nmicrofabrication\nhotchner\ngdm\natitlan\nquizmaster\nswisscom\nstilettos\nnauta\nthit\nindependance\neberl\ndumper\nnoseworthy\nchynna\nnardelli\nbringin\nhawza\nachala\nelettra\nmoratti\nfaille\nrotationally\ndisunion\nbaggott\nbenchmarked\nleatham\nmontalembert\nsugoi\ngursky\nquartey\ntablecloths\nliaised\ncreamed\ndenaturing\nnonpareil\ngloaming\nleetch\nraffarin\nlauzun\ncaffi\nprovisionals\ntasmin\nvirenque\nnylons\ntropopause\noutstripping\nkennerly\narscott\nvanderveer\nblackboards\nbarril\nparamotors\nhierocles\ntregaron\npostcolonialism\nnostalgie\nceallaigh\ndorot\ncrw\nryr\nhagens\nssid\nsturminster\nshac\nhaier\ngrannis\nlundeberg\ntaihe\nrenouard\ntristes\nazoulay\nhhmi\nfortuneteller\nmultifactorial\nstoyanova\nscheid\ncheerio\ngraffitti\ntessema\ndigard\ndiddly\nsquanto\ntortelli\ndesmarets\nzedtwitz\nboza\nkishoreganj\nputted\nhaggle\nakunin\nwendler\nfastpass\nschotte\nmckendry\nkhail\ndrat\nkoltsov\njanu\ntabatha\nsambat\nmaxson\nmarville\ntomonori\noperability\nlackadaisical\nsaladino\nburgiel\nskerrett\nathi\npavlovian\nrandburg\nbettws\nmalky\ntaurog\nfarmerville\nbotica\nsonghua\ngoñi\nwoche\npoyang\nkawakita\nreacquire\npicco\nberthelsen\nsicht\nshealy\ndesson\nzuazo\nmasqueraded\neburones\nsunol\nhandprint\nscapulae\nsechin\nsolva\nsteatosis\ntortura\nbossche\nhassoun\ntaizé\ncamoys\ntramiel\ncressman\nsaariaho\nronaldson\nbraggart\nbaso\ntanha\nsociologically\nsokurov\nfebvre\npubertal\norco\nquilombo\nbaryton\ninosine\ndeshi\nkalergi\nnhan\nquiescence\nyoungsville\nmaudits\ndubravka\npeatland\nlacto\nbellingen\ndelbarton\nartane\nkaua\ngoudeau\nchirino\njeanna\nniese\nmutagens\nhtf\nveo\nlangsdorff\ngondwanaland\nsuli\npookie\ncorbière\nrobinia\nhaddow\nalcorta\narowana\nwampler\nschottenstein\ntimbral\nbronowski\nextracurriculars\nsillett\nnorthvale\numrah\njaspal\nzygotes\naungier\nmuspratt\nairdrops\nkuduro\nlarf\nohayon\nvieta\nnečas\nbalneario\nlench\nbeeny\nmajura\nbhusan\nlbv\npogson\nigloos\nraouf\nklip\npérignon\nfirelight\nbnb\nclytie\ntandems\nentangling\nadulteress\ngilham\norigliasso\nleidschendam\nsmashwords\nspiffy\npullar\nwmt\nbelgard\nvisicalc\nbeesly\nraybould\nlieshout\npeconic\ntaubes\ncoquerel\nkourosh\nkulikova\npuddy\nredes\nabbi\nlirica\nchorleywood\ndagang\nmeaningfulness\nbergesen\nwgrz\niacono\nuglies\nhomemaking\ngiesecke\nspuds\npardis\nstratocumulus\ndextroamphetamine\nmorrisania\nerps\nbiopharmaceuticals\nbiase\ndeberg\nbabysits\nregularize\nporphyrins\nmikimoto\nquattroporte\nlefts\ndeaflympics\nboxofficemojo\nkazmir\nranas\ndccc\ngazebos\ncrisanto\nflanery\nbarnstormer\nflm\nwaltraud\nnetminder\nyachvili\nanticlimactic\nexcl\nwindspeed\nspeare\ntrilled\nsiao\nexpanders\nsriperumbudur\ndainton\nkonk\nnakdong\nhoyerswerda\nsaltley\nberd\nnekounam\nsylwia\nvikash\nfishtail\nked\ngarrigan\nashan\nwidom\nelectricidad\nnimi\nantimicrobials\ncounterfeited\nmarvan\nlankershim\nkonik\ncfz\nfaludi\nadministrates\nhooijdonk\npioli\ngracy\nfollet\naustraliana\nbacteroides\neccleshall\ninhalers\nbuscando\nlavis\nchiho\nreso\nseagrasses\npianola\ndraa\nfluoroquinolone\nmisael\nisaev\ntalan\npoach\ncrilly\nschneemann\nbacteriologists\nwaffling\ndisfellowshipped\nsatkhira\nbiochar\nheidecker\npolypodium\nyoichiro\nphlegmatic\nyevgenia\naper\nnorvo\northodontists\nmazzy\nsitdown\njerico\nlinstead\nmesivta\nkuningan\nknucklehead\nwhalum\nqaqortoq\ncontortion\nbistable\nbebeto\nswagman\nridler\npredisposes\nsolovki\ngeathers\nanteroom\nturchi\nturnin\nplumbago\naham\nabierta\nparamountcy\nevalyn\nhydrologist\nwif\nskd\nempath\nbrightlingsea\nchipley\nkunga\nrwr\ndystrophin\nbelching\nclerkships\nolivi\nnicolosi\nqueensborough\nbroaching\nfinningley\nstati\nbfn\nudonis\nbroadgate\nluper\nrockliff\nofficiates\nvnu\nandermatt\nbatara\nhustled\ntaqwa\nmwene\nunmeasured\ncryptomeria\njahnke\ntarah\ndevaux\nuif\nmönch\nschlumpf\nrimac\nilic\ntais\nkepulauan\nedensor\ndings\nmalpass\narpin\nsubtractions\nclapperton\ngernatt\ngrès\nmutawa\nflowerbeds\nbegonias\nkanayama\nzuccari\ncombinational\ntroublemaking\nbäumer\npcusa\nturkel\nmagnitsky\naegyptiacus\nlome\nwhish\nlapus\nbonta\nayodele\nbaly\nufuk\nambrosiano\nalse\nchughtai\ndisburse\nzsl\nprasanta\ngoteborg\nnosocomial\ncentigrade\nboondock\nzeni\nnamgyel\nheterogenous\nstyris\nxmb\noaf\nsalubrious\nnitto\njosten\nexcerpta\npcap\naffonso\nloams\nsuhaimi\nmome\nbermudan\nowney\nsenad\nbellino\nshirur\npinkas\nkulan\nsanwa\nspacefaring\nfinanciero\ncharros\nbayraktar\nminders\nerratum\nboning\ntorremolinos\nvisuospatial\ndisproportional\nshefali\nwashable\nboby\nskitch\nmacroevolution\naacr\nflyaway\nperine\nswinson\nfilmgoers\ncounterrevolution\nyepremian\nbagman\nmcdyess\nrollerblading\nvlaminck\ncolyton\nfouke\nkerinci\nvirally\nzoeller\naasha\ncornbury\npavitt\nsnuffed\ngachot\nbage\ndouaumont\nvinni\nkenway\nactualité\nmujtahid\notec\ncalculable\nbataclan\nvassa\ngilera\nquestar\nvannucci\nqueeg\nequalizes\nsoner\nhewat\ngaula\nsunbeams\npilotless\nrisalpur\nbutchart\ngarrote\npaolucci\nrile\nnativ\nmarpol\nbaracoa\nouchi\nmultimap\nchemehuevi\ntambellini\nouko\ninulin\nnces\nstouter\ncja\npooper\nphalen\ncnnmoney\nturkomans\ndogar\nmikhailovsky\ngrope\nserina\nnette\nsangalo\nvatnajökull\ndiorio\njerker\nalabi\ncolostomy\nlounsbury\nlns\nbyer\nglasvegas\nlifeway\nplatja\ncritisism\nechenique\nallwright\nqisas\nbrk\nfmm\ntoothpicks\nkaela\nmeum\ntthe\nalbritton\nnunda\nwhittling\ntemesvár\ntimar\nkatti\ntransvestism\nheckmann\ndingos\nrodley\nstalagmite\nyoshimatsu\ncrisply\nwaag\nbrütal\nrockstars\norfeu\ndarrang\ngodey\npistils\nmoots\nlehua\nvaladon\nmihaylov\nornithischians\ninterreg\ndemaria\ntorrevieja\nkjaer\noutmaneuver\nwilsey\ntortue\ndzagoev\nefimov\nyoseloff\nriboswitch\npannalal\nchristoper\nriprap\nboulais\nnewbould\nmoes\nfirewater\ntsukahara\nahadith\npaceman\nskopelos\ncantone\ncitronella\nwoodsen\npolycythemia\nmulund\nify\nantinomianism\nkhronos\nrectories\nfantasized\npenlee\nqualm\nclyfford\nnumberless\njuxtapoz\nyaeger\nwiffen\nsigonella\nwhatsover\nbriere\nbathonian\nzaira\nyab\npufnstuf\nseyfarth\ncarona\nthors\npusser\nlacalle\nrym\nsolemnized\nmitar\nkhrunichev\nmarclay\nshandi\nbuckfield\nseismometers\nboese\ntalay\nhomestay\nobuchi\nsantur\nlluis\nmarburger\nbochy\nsokolovsky\ncappel\nemberton\ngerner\ntafawa\nizzi\nlooseness\nmarsabit\nfiddlin\nlipuma\ndullea\nthole\nmtsensk\nporcello\ngrazes\neviscerated\nhatboro\nhurriyat\ntii\nliveliest\nseiffert\nmazumder\naravena\nnejad\nmotes\nwindlesham\nragione\nsylwester\nvolksbank\nearthmoving\nfroot\ndonaggio\nsarkin\nrece\nmillbury\nseyyid\nryen\nicheon\nruehl\nmoliere\nweimann\nczarnecki\nkoscielny\ndaouda\nkuleshov\niacp\nmenz\nfabulosos\nmakeni\ndacus\nqurna\nishares\nmetasequoia\nmiffy\nconfucians\ndufty\npazos\nsilveria\nwrappings\nshabani\nkehler\nmarsal\nsaffy\ncury\nclydesdales\nradomski\nmancilla\nhowison\npesonen\nhoron\nlatsis\niturbe\nfixative\nsundridge\ncompendia\ntyrannosaurs\nassortments\nfeaster\npettah\npreachings\nsnowballing\nschrei\nemmie\nemond\nginola\nkopel\nbegone\ncallously\nevenki\nclec\nnwankwo\nzapad\nmazzilli\npremenstrual\nclandon\nboruc\njobling\nostade\ncujo\nhighspeed\noxygene\nbny\nmavra\nsingrauli\ngilberte\nseashores\nkctv\ngudmundsson\nkoronadal\nmaltais\nteare\nsuprematism\nmdv\notep\nraghubir\nsacp\ntoshihide\nextrication\nbärbel\nscroggs\nbazil\nsamish\nlittleport\nfairland\nlamonte\nnigris\naliyeva\nkaput\nporthleven\nnomex\nkinesis\nmoina\ngreider\nmeltdowns\nwoodcliff\ncaldara\nstanchions\nsaras\nsalzkammergut\nkingsborough\nkorf\nriverrun\nexigent\npccw\nbrightwood\nsalvacion\ndinton\nlangfang\nunquestioningly\nqusay\nmanjit\ntzeltal\nwestpark\nagnesi\nsobchak\nergaster\nfungible\ndedo\nverrett\nkingsnorth\npalanquins\ntheunis\nshudras\ntucana\nmorino\nloganathan\nvanoise\nfirestop\nlavallee\nfordism\ntharparkar\nvilches\nbrinjal\nrambus\nfaucher\nlungotevere\nreedbeds\ncarso\nfunnyordie\nkaptur\ngossau\nhayashibara\nberges\nrancheros\nmitzpe\ndisaffiliation\nnicols\neftpos\nfeints\netive\nsubsuming\ntoothfish\nkajima\nepiphanies\nworsfold\nhonderich\nannfield\nsuwaidi\ntepes\norkestar\nstayman\nhemicycle\nquern\nmengs\nbaju\nmusavat\nburnishing\nbujor\nfash\nschori\ndruckenmiller\nsaag\nvosburgh\nwhacks\ncockfight\nluntz\nfoist\ntepeyac\nbiped\nmisfires\nlaparotomy\nkhazraji\nblase\nzhenya\nsauder\nfreddo\nmonisha\nmboya\nsacem\nkuralt\napparant\nipse\ngholson\ngillo\ntelcordia\namrapali\nwambaugh\nstults\nrowden\nfireweed\nbarrón\nkrla\nbortolo\ntach\nkarch\nhunches\ntipis\nforeshortened\nbelletti\ntgm\nonoe\neggheads\npanniers\ncashflow\nmurphree\nrosenquist\nwolke\noen\nrizza\npartaken\nbiretta\numf\nlamarre\norlandini\numoja\nlomo\nleavesden\nunexpandable\nragheb\nathas\nlano\npetn\nkrasnoroutskaya\npetroff\ntroutdale\nnabatean\nvagif\ninfests\ngarzon\ntuto\nsouce\npeppercorns\ncarnies\nkorpi\nlitchi\ndanan\nvoloshin\nsakr\nulta\nriner\norexin\nplisetskaya\nwone\nmcgrail\nmonhegan\nblistered\nhichens\nnoort\ndorton\nzambrotta\nnahj\nalexandrova\nzaentz\nthuot\nmihailov\nredeploying\ntallard\npsalmist\nkhami\nbulged\nklimek\nenam\nkandari\nlsl\ntonghua\nchalan\noddy\nstrategize\nafaq\nanacapa\nolavo\nsisir\nginty\nmallalieu\nkadosh\ntishchenko\ncelanese\nleyser\nmgt\nmcculloh\nshardlow\ndelaporte\nunproved\nsamland\noliveto\nwjac\nbogdani\nciutadella\nclubfoot\nembolization\nsyringae\nsiim\nlangstaff\nambos\nrenat\njohannson\njaywalking\noccassionally\nmante\nkushite\nandrostenedione\nsaceur\nhematologist\ndrumlanrig\nbirdwatcher\ngassama\nnoisier\nlete\nkalonji\nicecream\ntakebe\nbref\nhedonist\nbronkhorst\nevenness\nhermanson\nwarble\ndrumkit\nkerbs\ngub\nfundemental\nwittstock\nmoviefone\njosefin\nkretz\nmachineries\ndomitilla\nseraphine\ntrampas\nsoufflé\nthimerosal\nthordarson\nchimaltenango\nssdi\ndimmick\ncapgemini\nsoloff\ngolspie\nhypoglycemic\nbraund\nthirlby\npomerance\nweeper\nproces\nsniffles\nredbreast\nmetromix\nbackfiring\nforton\ndillmann\nperpetration\ntauron\nplaybills\nkifissia\narmeno\njuneja\nhanalei\nannegret\ntandoor\nlmf\ndarro\nnicci\ncalientes\nparatha\npremixed\nhanami\nzloty\ngleave\nkrupps\ncolvig\nsuffocates\nparaty\nkassner\nedinson\nkarjakin\nambrus\nbatiatus\nbarbarin\nconcentus\nyellowhammer\nmeckler\nkhorezm\ncwg\nvenuto\nampoule\ngirolami\nwelcher\nleeland\nstaphylococcal\nkeds\ntemerin\nnasd\nbemusement\nsmidgen\nwanner\ncarstensz\nkarmakar\npatronise\ntasa\nalcl\ntripathy\ndealbata\ntrilemma\ntinu\npharsalia\nockendon\nthiagarajan\nsousuke\nkeziah\ngoetghebuer\nalbayrak\nretrofits\ndeprecates\nmarijana\ntamme\nhiroyasu\nderivates\nsimpsonville\ndoormat\nphotobooks\nboarman\ncherepanov\nmutandis\narzoo\nbetina\nstargazers\nshox\ninteco\nyuzhang\nprospectively\nodebrecht\nojinaga\nilustrado\negeland\nhanafin\nvaldarno\nwarrenville\nscheurer\ncedarburg\nbearsted\njeffcoat\ndancey\neffervescence\nmacvicar\ntrucco\nkhanjar\nittehad\nkabba\nbumblefoot\nokemah\npossessiveness\nfearfully\nmilija\ngalbally\nplunk\nbluetones\ngrimond\nschiavi\nnakamichi\nexterminators\ngavron\nmidan\noakbrook\nblueclaws\nkukla\nslaw\nwoodcrest\nwenhui\nceg\nmenacingly\nfaucon\npolyphenol\ndeuk\nharpswell\nbeerwah\nabzug\nharchester\nterius\nfml\nafars\nmaoris\natitlán\nvolksbühne\nakf\nakam\nkongs\nmuesli\nnorthcutt\ndemond\nsharers\nhounsou\nkista\ngleaners\npresencia\npermatang\nluggers\npittenger\ndockets\nketel\npudi\ntignes\nranjani\ntreffry\ndigitalization\nreenie\ngijsbert\nundersurface\namontillado\nuzan\nhudec\nserevi\nrainiest\nfavorit\natoned\nnoar\nsirène\ndmsp\nlongville\nbruel\nfernan\nboie\nporkchop\nkreuziger\nbellen\nnarducci\nsismi\ndesnoyers\nhhh\nfolky\npenalise\nmckenny\nteetotaler\ngambardella\ncontainerization\nshying\njunctures\nfrontière\nanoles\naptana\nhusni\nknobel\npiramal\nbakal\npingo\nxinghua\npmb\nbearss\narlecchino\ncati\nmlakar\nnanostructured\naxia\nwbns\nwormer\nspitze\nrandfontein\nvadhana\npentaprism\nwetten\nemis\nxts\nlorant\nawww\nalpers\ncurrys\ntrialing\npathétique\nbesieges\nharber\nmentmore\ninculcating\nsecuritas\nzager\ntrimesters\ngange\nschuberth\nsabzi\nsafecracker\ninoculate\nsatpal\njinni\nchinoy\npantha\nmalcontent\nlavilla\nhoefer\nexalting\nklaver\nrapporteurs\nleakes\nsirin\npachamama\nbonners\nfoli\nchubbuck\ninstitucional\nbourj\npascali\nmahane\nllanishen\ndpw\nupperville\nnieuwendyk\ncoronial\ncrackles\ncheetos\ncottonwoods\nelsom\nredevelopments\nhomunculi\nnantlle\npully\npenan\npurgative\nmudras\njindo\nbarling\npodravka\nfifpro\nwerent\njalili\nlochee\nasafa\nsemo\nnmo\ncarvallo\nkarakocan\nmitsukoshi\nbatsuit\ndroving\nlaube\ntrinitas\ncadwgan\ncabala\nshondells\nngcobo\nravidass\nthrombocytopenic\nislandia\nmotormouth\nblackshaw\nballinrobe\nhengchun\nwinterbotham\nsorento\ncantat\nohman\ncincpac\ndukie\nelzy\nforfeitures\nmelanomas\ntrav\nphyu\nsneath\nnazz\njavanshir\nsuroor\nelaeagnus\nmaleki\ndegenkolb\njenkintown\nhinshelwood\nmaiya\nkfmb\nchappaquiddick\nntare\nbeatific\nnagina\nhaydée\nolano\nnanotechnologies\nyunos\namiata\ncomposted\nsurv\ngriet\nportada\nsunia\nnoval\ncommies\npressurize\nlisburne\nmerseytravel\nillustrato\nhigashino\ndemoing\nneisser\ntondu\nalexandrou\ndibango\nasbo\ntwycross\nstraightness\npallice\npolityka\nwrithe\ngiacosa\nkoff\nvarnhagen\njuon\nreattach\nsemitrailer\nvestryman\ncrimping\nktrk\njudaean\nsmorgasbord\ncanam\ntiba\nschwabach\nreijo\ntenace\nguren\nypa\nfeek\nhax\nyiannopoulos\ndihydrotestosterone\nangwin\ngremio\nkeiner\npittance\nzworykin\nruffs\ncousine\npigna\nwhimsically\nbugaboo\najloun\nnezahualcoyotl\nterneuzen\npetrosal\nsharifah\nlightsabers\nrosca\nudm\nveon\nbranting\nreim\nlesher\nobjectivists\nbeame\noho\ncompunction\naubenas\ngivet\nsucralose\nbranner\nazem\ncircumambulation\nleakages\nkhosa\nelixirs\nberde\nmacoun\ngatekeeping\ndorte\nduparc\nculme\ntammie\nandreoli\nkohr\nvollard\n¯\nscandale\nprepuce\nbanfi\nbeauvois\nconformant\nsingleplayer\nsquillace\nlits\nabyssinians\nsparkbrook\nzadie\npoliticize\nafsar\ncertaine\nlectureships\nocotillo\narceneaux\naudiofile\ncuter\nlubanga\nmeilleure\nuzeyir\nmcghie\nmsy\nexcavates\nburek\nfengshan\nghimire\nmeades\nkley\nkeyvan\narcimboldo\ncuscuna\njayma\nthot\nmandisa\nreneau\nmeekness\nazn\nmaidment\nshabaka\nyanking\nhitzfeld\nneben\npragya\nlutcher\nbernstadt\nmosquitofish\naerostat\nxlt\nmolas\nbreakouts\nbessell\nkiron\ntausend\ngoyo\nayresome\nindividualists\nyoked\ngayer\nstotts\nkyokai\nbiblis\ninsulates\ntrux\nsubcomandante\navetisyan\nhambrick\ndavachi\nmccosh\nreider\ndarda\ntheophile\nreseeded\nntia\nintrested\nbeckel\nyamamah\ngoian\nulhas\nkadoma\nsurvivalism\nfrindall\nhoveyda\nimmunosuppressant\ncommins\npromontories\nstroger\nmasahito\nsteatoda\nbakrie\nmontee\nsmartbook\nscurrying\nchaffetz\nseibold\nnauset\ncrais\nibara\nverdeans\nshrum\nredbull\nsharleen\nfforest\nimprecisely\nsawflies\nkahlenberg\nballsbridge\nwhelp\nnicastro\ndonita\nreagle\nwestleigh\nlegarrette\nforestalling\nkarola\nsodden\nmusin\nbesiktas\nhuazhong\nfatcat\nkluane\nbondfield\nconstantinou\nintension\ncotonsport\njannis\nmouly\nhuarte\natrás\nmatryoshka\nyac\nlosin\nshockoe\nrhod\nslatter\ntragicomic\nrrt\npresidencial\njogos\ninterdependencies\nastrocytoma\nlimacina\npoorman\npointes\nmanele\nseabra\nuntaxed\npathfinding\nkinnaman\nbendall\nlaxer\nnanchong\nsoaker\nvalters\ndonnellys\nsculpturing\nwordpad\nmahtab\nbiographia\nkazushige\nterim\nmichihiro\npoesy\nlissauer\ncommerford\nbrunelle\npreflight\ndakotan\nenlace\nbirkhead\ninsensible\nluciferin\nginning\nvallejos\nniggle\nfrontierland\nbiodefense\nmammalogists\nzanella\nlochlann\ngabriola\nespindola\nolton\npldm\ncampau\ndollmaker\nmermoz\nleukoencephalopathy\nimmortalize\nfelted\ndolson\ndarga\nyarmolenko\nnuhu\ndelatour\nxvs\naikins\nbanteng\ncomana\nomara\nsupe\nireneusz\njaluit\nhammerlock\nxizhi\nthandi\nilwaco\nsolare\ngaret\ncadair\nchlorhexidine\nbithorn\nhural\nmanjimup\njugo\nnewb\nmutualist\ntaradale\ngoicoechea\nfulla\nnanometres\nborea\nheifers\naltis\ncompsognathus\nparesthesia\nlakhani\nblanketing\nseparateness\nsantuzza\ndemet\nhershman\nsebold\nportree\njagatsinghpur\nhambling\nproj\naima\nhorno\nmomofuku\nsamarasinghe\ndiscontinues\nmotlanthe\ngyrfalcon\nmacungie\ndemilitarised\naxbridge\nmulcaster\nsclerotic\ntomaselli\nboeke\nklöckner\nargonauta\ncavorting\nartel\nscathingly\nhandymen\nshotover\ndrumlins\nchitinous\nboak\nneuropsychopharmacology\nmalavasi\nnovelisations\nvelorum\nmonir\nphilodendron\ndulin\nhurr\nfasching\nnissanka\nmacromolecule\nsunrail\njih\nastell\nhoopa\nunformed\nwci\nlookalikes\nokl\nkingsclere\nlufeng\nsqa\nmodzelewski\nopo\ninflatables\nchane\nspeedline\ngawthorpe\nbatuta\nschönherr\nsalvinia\nshestakov\nrospigliosi\nloboda\nanacondas\nkitchenette\nmavro\nendosulfan\nkyai\nduetted\nreial\nhalki\ntraineeship\nthiopental\nmainlines\nherma\nmichalek\nescutcheons\nlegrain\nmcmasters\nseelig\nminoans\nmeshell\nskylarks\nrapha\nnakazato\nrelavant\nscribed\nsakka\nkernal\ngdt\ntikki\nreiber\nhousemartins\nhef\norlean\nkirksey\ndacascos\ngonin\nbwb\nmccorvey\norlok\nfábregas\ncassells\ndheere\nhyphy\nacipenser\nbloedel\nsongkran\nariyoshi\ndegradable\nrazo\nbiodiverse\nyurts\nshohada\nlvds\nglenfiddich\nhunchbacked\nbelchertown\ndenigrates\nlatinisation\nacupuncturist\nminc\nguaviare\nzits\nkarpinski\nsout\ngreisen\nfreyre\nskykomish\nghadames\nnashashibi\nometepe\naleuts\nmapperley\nprosiebensat\nstrawmen\nsinoatrial\nazerrad\nnutcase\nportentous\ngubler\ncinquième\nnoyer\nmaligning\ninfoshop\ncussing\nrotarian\nkissam\nwendat\nfredericka\nkhvostov\nalow\noraon\nasotin\njalaun\nkilz\ntemin\nduve\nscrollwork\npeche\nadamkus\nmilanesi\nturbin\nsniffs\nrzeznik\nrosseau\nguli\nkasuri\nwarid\nimplausibility\nalmario\nbattagram\ncastlewellan\ncofounders\npalhares\ncontraventions\nlenghty\nperturb\nardila\ngermann\nciénaga\nnubra\nlongquan\nkhagaria\nbennati\nvickerman\npotentates\nazolla\nsarp\nbents\nainhoa\nrafiki\nbedknobs\nwagstaffe\nzopa\nstec\nnube\nchuzzlewit\nbradby\nhtoo\ncepr\nhamnet\nliturgically\ncoston\nkiesel\ncreag\nrmas\ngaoler\nwalkmen\nbeardstown\ngapes\nbenediktsson\nspectaculars\ntonalá\nrepenting\ncostel\ndega\nberanger\ndetoured\nufologists\nunseasonably\ntheiler\nnordlund\npartenope\nkujalleq\nsleiman\nhousecleaning\ngolino\nkhaw\naspidistra\nzaslow\npicnickers\npopple\ntetiana\naccedes\nrecognizability\nriverworld\npoots\nwearied\nbenoy\nbenten\nfionnuala\nkirkley\noverflight\ngentles\nblancmange\nrsh\nstarched\nmesmerising\nkady\nmetzenbaum\nrecharges\nwheelbarrows\nmazzotta\nawesomely\ndresch\nbirthmarks\ntsunoda\ntweedledum\ntwila\ninderjit\nianthe\nmerna\niwork\niseo\nasimo\nmontlake\nhamlett\nsodano\nmunching\nradd\nhammamet\nbrewmaster\neschbach\nrutabaga\nnsrc\nhodler\nbifurcations\nearlobes\nispr\namuck\njunjie\nparisse\ntrialists\nbromham\nterrifies\nmentee\nfrese\nsriharikota\nwch\nrodden\ntvw\nlvovsky\ntyphimurium\ndumbbells\nlewie\nfiddlehead\neplf\nglycation\nquagliarella\nondemand\ncanuto\nephrussi\nleemans\nwindscale\ndjanogly\nrosebush\ngreenfinch\ndietzel\nbellot\nsizewell\nglienicke\nkiyohara\nauditionees\nbandito\npolperro\npawsey\nshehab\ntomentosum\nholmqvist\nheydari\nduje\nintertwines\nmunnings\nprostatitis\ncottons\nkashtan\nmegamind\ncosmographia\ncraic\nkillingholme\nbarzilai\nbierhoff\nkapel\nryuta\nadmitedly\njoell\npourri\nrdd\nmujahedeen\nbrinsmead\nbrisa\nundershirt\nhoaxers\nkante\nsoffit\nranted\nriblets\njemal\nolbrich\nnixa\nkold\nacea\nrespublica\njinxed\njojoba\nurwin\nactc\nlindvall\naecl\nvlasenica\nshawneetown\nshortbread\nrheum\nbogeyed\nshf\nhitmaker\nbenighted\npenhaligon\nclinica\ngeorgieva\ndhingra\ndemitra\nanax\nhennen\ncanot\ngunports\ntroodontid\nspuriously\ncaché\ndtic\ngalambos\nphen\nmillport\nbenefaction\nfuwa\nlestari\ncrania\nbrucie\nmugo\ncomley\nsquibs\nsaberi\nlogothetis\npermissibility\ncoleslaw\njape\nscriptable\nackbar\nbould\nrys\ncrz\nudorn\niracema\nnoisettes\ncadetship\noutlawry\nlovich\ngallion\ntrankov\nclum\nmahabodhi\nsteadied\ndaryn\nkevon\ncouching\nthrane\nropers\nnarelle\nbossard\ndilling\ncoproduced\nroatán\nnccu\nantitrypsin\nwindstar\nchirbury\nbwalya\nvasca\nhikurangi\nhubcap\ncrowden\nkalypso\nwesh\ngeotagging\nbrewpubs\nkoyuki\nfmx\nepilepticus\nkinko\njanklow\nmelty\nxanthe\nvallés\nurm\nmahaffy\nschoolfriend\ndjalma\nskeel\nstiffs\nawestruck\ngruevski\npneumoconiosis\nconcordances\nzuleta\nfrear\nariba\nnamibe\ninvovled\nzwaan\ntimman\nzettel\nshayan\nbosques\nennedi\nplym\nprostituted\nnafplion\nleopoldville\nbisignano\nmolts\ngois\ncavaillon\nwaives\nmadou\nbeiser\nmclynn\nmerryfield\nexperimentalists\nanuak\nsomov\nwtvt\nirreversibility\nbarrantes\namylose\ncharioteers\ncuvelier\nskolimowski\nmundra\nkamasutra\ntimekeepers\ndeadening\nquirkiness\ntappa\nnegreanu\nrazorbill\npetershill\nnctc\ndoull\nestepona\nditta\nranee\npreliminarily\nweihe\nnohara\nrackmount\nbalkrishna\nbortolotti\nsanski\nsherley\nhylands\ncok\nosk\nduchene\nmtbe\nvandervoort\nvaxholm\nreadjusted\nttb\nplanked\ntahitians\nespero\nbottomline\nxinmin\nsudler\ngholamreza\ncoalescent\nchanhassen\nhaake\nrubell\nmiyazato\ngeim\njolimont\nhoarau\nportelli\ndinga\nfazle\nhartranft\nbabbs\nwou\nsporades\nlakis\nlahi\ncordle\nivanishvili\ntemir\nnamdar\nxiwen\ngluckman\nsøndergaard\ngeschwind\nzambezia\ngoodlatte\ntatsuro\njirí\nstickam\nchisti\ncrosson\nehh\nzvenigorod\ntitanosaurs\ndazu\nsherani\ncoughed\ntelemarketers\nluman\ntinsmith\nmorzine\narchaism\nferrule\nrousselot\ncsce\nthermobaric\ngaden\nrúben\nconsolations\nnafisi\nenciso\nmaty\nrappa\nhaçienda\nfingerpicking\nredressing\nzien\nunsystematic\nbcw\nwilm\nbrons\ncanalised\nsubthalamic\nshiwen\nlowii\nscroggins\nsge\nbressan\ncibrian\nansible\ncastrate\nmitha\nszechwan\nloisel\nsailboard\ncotchery\nkahuku\ncoureur\nrakan\ndalio\ngorged\nrimba\nsitchin\nakhmed\nkonkona\nmosasaur\nmarianist\nhypoventilation\ntaborn\nmanacles\nilli\nunrighteous\ntraversable\nkisselgoff\ndüül\nausiello\ncraneflies\nbalderton\nmarum\npliant\nportaferry\nbelisle\nleef\nfbt\nbeamlines\nchanteur\npapaloukas\nletwin\nhasankeyf\ntfca\nazithromycin\nluli\navani\norti\npaines\npcsos\nchambrun\nwindley\nroedean\ncanzona\nknowshon\nnalle\nvicens\nchani\npulsipher\nsapperton\njaafari\narachnophobia\nnasib\nwestlink\npresas\nskylink\nvollrath\nburstow\nantwon\nmerrivale\nmobbs\nfirle\nbaddow\nmeditator\nlaforgue\ngraying\ndahlstrom\nkurumba\nkarawang\nvermonter\nkaltenbach\ncheckups\noutstations\nmcla\ntoffoli\ngonu\ntopicality\nnonmetallic\ncomesa\nlucano\nsokratis\ndispensable\nacrocanthosaurus\nontong\nrealisations\nrupo\nsonnenhof\nbyres\nsynthasite\nsnakepit\nkutv\nphillie\nscherchen\nlangstroth\nrenyi\npetaflops\nceawlin\nyohann\ngomberg\nfranci\nmustards\nbibs\ninia\nreoccur\nfitzwarren\nleelee\nsprigs\ncavy\nradiochemistry\njyri\nwaldseemüller\nsiegenthaler\nshipborne\nstubbington\nchesterville\nopaline\nphrenic\nsulking\ninishmore\nearphone\nscreensavers\nsardonically\nstylo\nkovalevsky\nbraylon\njatoi\nshelden\ntibbitt\nplutocracy\nfanfan\nlanxess\ndroege\ndisenfranchising\npixelation\nmacrobert\nguignard\ndelavayi\nlamaze\nbaptise\nsahid\njeanty\nfadhil\naccardi\nkeynesianism\ncontouring\nfatiha\ndiabolo\npymatuning\nmarq\ndifferentiator\nsudip\ndickerman\nkleiser\nmeredyth\nobara\ndiggory\nberlanti\nadrs\ntenuta\nkamerlingh\ndibenedetto\nmarchais\nmacandrew\ndebnam\nblaauw\ngenetical\nsubtests\nlayed\nellisville\nimpresa\nbinga\naerovironment\ndildar\ninsurrectionists\ntorreya\nbuckstone\nvaisakhi\nmosimann\nunluckily\nfoundlings\ntaberna\nkernighan\nfatf\nrailwayman\ngoatherd\niops\ndoorsteps\nanica\nmoët\nproselytization\nsentamu\nunseld\ndiarrheal\nsnijders\naboul\nreframed\ndubrovka\nmicrobiol\nsarpong\ncommonwealths\ngiornata\nglenridding\ntattershall\npeattie\nmaraschino\namnat\nchangning\ncommentates\nhutong\nbomblets\nkurdo\nkrishnakumar\nmicrotransactions\ndohm\ndetains\nplavi\npannella\nbarrons\nmohnke\nlwl\ngahanna\nbeny\nzenas\nabeid\nyibin\ntrittico\ncoalmine\noverplayed\nvengerov\nmontalva\nenercon\ntarjan\nurease\npatin\ninserm\nitanagar\nnonpayment\nmalignaggi\ncasandra\nskocpol\nretorting\nbargate\nbleh\nhirson\nabse\nembarrassments\nparmi\nmelani\ntickhill\nhakama\nperrette\npickerington\nweightings\npizzetti\nantipersonnel\nplanetariums\nforsett\nshio\nebersohn\ntrematodes\nmckale\ngunasekera\ncaddisflies\njopling\nkovin\npimlott\npigtail\ndunbarton\nkyp\npequeños\nnbcc\ngrayer\nbluesmen\nlohn\nfogelman\nswar\nliben\nmacovei\nfixate\nsarjeant\npassman\nwhinging\npbf\nwaterslide\nbiscotti\nnasco\ncolostrum\nherbison\nclearstream\nlongtan\nyuanyuan\nmeurice\nactiv\nbonneted\nsuz\npeten\nbrockbank\nstimulators\nmcnicholas\nshamblin\nkieswetter\nlongtown\nludovici\ninsurrectionist\nshirahama\nmutti\nfermentable\ntamaz\nstreamliners\nforwarders\ncolmore\nétrangers\nmarcom\nnuccio\nerysimum\ndirceu\nswapnil\ncoudert\nmonofilament\npitter\ntananarive\ncashion\ndockworkers\nleppert\nshiplake\nintonations\ngamecenter\npedo\ntransferability\ncathinone\nreusch\npokot\ngooseneck\nblackspot\nindividualised\nstiffly\nsiza\nryce\nwoodcroft\nbita\ndalt\ngovernmentally\nbyre\namroth\nprams\nweslaco\ntroxler\nternan\ndilan\nsenescent\ncoum\nfreckle\npanny\nagenzia\nenderlin\nmatamba\nmanari\nrahimov\negbe\nleeann\nkursaal\nexpecially\nminories\nimposture\ntopples\nnapoletano\nununseptium\nhims\npluses\njdrf\nwagenen\njosu\nreggiani\ncarrizal\nsulphurous\ntiang\ngoodna\nlauck\ngresh\nsaifi\njusticiable\nbargeboards\nolg\nkatznelson\ndeafblind\nremediated\nsiac\nbassingbourn\nfaccio\nriksbank\ncinc\ntvg\nexplicable\ndonatien\nfelicitous\ndisinherit\nepsa\ntoyin\nerico\ncottony\nprotoplasm\nobsequious\nrefashioned\nwinkworth\nzazou\ncountertops\ndurov\nordonez\nvolland\ntestability\ncrenulated\ndickov\nfishburn\ninchcape\nenciphered\nfederigo\nwindschuttle\nystalyfera\nmaschio\nswinden\nbahrainis\nnill\nthiong\nworkchoices\nkanes\nmaccoby\nclaassen\nxoom\ndarryn\nnwosu\ngoumas\ncolliders\nlandsdowne\nseigner\nwinchmore\nsbx\nzionsville\nmasn\nsbarro\nferencz\nadelie\nraworth\nserano\nglobule\nrajib\nedmonston\nperiosteum\nheys\nwillette\nwcdma\ncallable\npamukkale\nkrupskaya\ncustodianship\nmazon\nelectroporation\nleviev\nhershkovitz\nfeore\nbernheimer\nsententiae\npictograph\najram\nherston\nsotirios\nsvindal\naustronesians\ngatemouth\nsatch\npostville\nexide\nyeonpyeong\natbara\nforos\ncubillas\npaleoclimate\nstonemasonry\nruddington\nsundt\ncalomel\nprimadonna\nlichtenberger\njeopardised\nexperimentations\nsightlines\ntomm\nvalore\namate\nyahel\nyus\neaw\nliberato\nemulsifiers\ncalcaneus\ndoled\noww\nnorie\nroseum\narbella\nsebright\nwarschauer\nfelten\ncuriouser\npushbutton\npetach\nmamoulian\nswannanoa\npackager\nhefce\nwilliton\nmaresme\njonkheer\nminorly\nparaiba\nbalcones\nwitan\nchillingworth\nngarrindjeri\nabridge\nbendit\nintimidators\novershooting\nchaung\ncaractère\nregulative\nhollington\nnordhaus\nprotostar\nstreetscapes\nholli\noversimplifying\nmawer\ndobry\nmonastero\nocna\nmevlana\nkabalevsky\nlembo\nveneered\nbushkill\nyuhua\nbasilone\njobeth\nmuhanna\nregressing\ndastur\nmroz\npernet\nseawell\nkaneria\nsearls\ncreameries\nudmr\nrazdan\nworldbeat\nunswept\nsternhell\nconstructivists\nhaddy\nbiopharma\nlossing\nmashona\njony\nkpp\npfleger\ngrune\nviñoly\naetos\nhaussler\ndenia\ndavida\nmacgraw\nkreidler\ntzaneen\nstuarda\nmaturities\nbjarke\neklavya\nkenichiro\nriddoch\nslazenger\nwahidi\ntrombino\nallensworth\notolith\nboychuk\nniit\npattan\nieremia\nwgy\nquenneville\nrandalls\nlovas\nbabic\nunestablished\npayphones\npenas\nbeezus\nteilo\ninarguably\ncabi\notti\nmondlane\nbrou\nboxleitner\nkuby\nornithopods\nprefacing\nchitta\noutcompete\nunsa\nmargerie\nardoin\nsyllogisms\nbreakbeats\nadalgisa\nosteotomy\nmassanutten\ngingham\ndeerskin\nhillen\ntsvetan\nswallowtails\noutsell\nbeerman\nhennequin\nbabineaux\nagyemang\nrass\nbobrovsky\nduva\nchoicest\nlammert\nimposts\nbassnectar\ngiordana\nidlers\nantonsson\nseacombe\nariya\nterraform\nbeatboxer\nlintott\nschematically\nbrushstroke\nadiga\ncedd\nardito\nrheinpark\ndwyfor\nconinck\nbellydance\nlippy\nputing\nveatch\nhakki\nlyubimov\nacquisitive\nsahana\nahlmann\nmycotoxin\nkeepsakes\nminow\nreigniting\nbalata\nfidh\ndjordjevic\nriehle\nahlers\nshamar\ntumaco\nabrahamian\nchorten\nkuehne\nkorir\nelberton\nmacshane\ncamelids\nmendo\ndendera\nhermaphroditism\nrrf\nménière\nbzp\ncityline\nnectarine\nopf\nmaharajganj\nmonetarily\nhusbandman\nelv\nkieu\nabdoul\njulesburg\nvézelay\nhotness\nfaxing\nbarrosa\nhilma\nfinnian\nchubs\ngribbin\nsuddenlink\nfortuin\newelina\nfeiner\nabdelrahman\nabortifacient\nmuffs\ncatafalque\ncrescenzi\nnibiru\nglenny\npotbelly\ncastanheira\nscrim\nimbeciles\nkonstam\nkataev\nbraeside\nfáilte\ngrol\nraghib\ntelekinetically\nwalkouts\nvoya\nfatemi\ncaresses\ncontractures\nbabich\nodorrana\nshizuko\ngefen\nremanufactured\nmitroglou\nmorgon\ngnawa\nchromosphere\nsbo\ntlalnepantla\nelektronika\npurring\nambre\nschreder\nsimien\nseborrheic\ncatalin\nrosatom\nyosano\nfiliz\nfilippa\nmalchow\nsecu\ndorna\nregionalized\nsandiford\nliaises\nflunked\nmmps\nrajnath\nclotworthy\npermanency\nwyrick\nceh\nrolaids\ndarwyn\nborich\nceta\nvisioning\nferriero\nplod\nsuas\ncabel\nlebedeva\nneuroscientific\nneurofeedback\nharissa\nkhaja\nneocolonialism\ncoldham\nagrochemicals\ndworsky\nguapos\nbacchanal\nkuznetsk\nvakula\nribavirin\nblon\nnanuet\nmarylou\nmwana\npromethea\nknowhow\neliseu\ncanaima\nkillough\nmebyon\nisinbayeva\nhaedo\nbridgers\nsambalpuri\ntessera\ngittens\nschneid\ntruther\nquemada\nhocine\ndebugged\nnecropsy\ntupa\nreverberated\nblowdown\nyamma\nwifredo\ndisproportionally\nanirban\nafterparty\nsanatoria\nzolpidem\nhedegaard\nboesky\nbuchheim\ngearshift\noborne\nvaletta\nequidae\ndighi\nenquires\nunexpurgated\nswarthout\nflitwick\nguillemette\nintresting\nsmac\nmachala\nfedden\nathis\naylestone\njaffé\nmeiners\nparasols\nskender\nyonemoto\nwhitmarsh\nculturales\ngismondi\nchildwall\nstanleyville\nsetts\nvassy\nchrysanthus\ndisorient\nberenbaum\nszymański\nsanlih\nrealclearpolitics\nunsaved\nweisel\nllew\ntimonen\nvermeersch\nsulak\nkurnell\nwitkiewicz\nsakigake\nphenylketonuria\northwein\npatenaude\narison\nrebeck\nschine\nsmileys\ncabrales\nparlamento\neigg\nmallin\nmoeder\nbakau\nzuleika\nbarkham\nzipcar\nbourgois\napiculata\ntauno\ndhp\npagla\nleps\ngombert\nelongating\nshahpura\nghesquière\ntrefriw\nhtb\nhatzfeld\ngranero\nkvly\nmutability\nvalon\ndysarthria\njewsbury\ngaafar\ntremulous\ncurig\ntoné\ncallister\ncutaways\nproslavery\nvenkaiah\nguale\nzaken\nnwsa\nhacket\nbanak\nmoonless\nsuggestively\nsegers\nsallied\ncontrives\ncraster\nlandside\nhenig\nmarkson\ndepressurization\nperess\ncolantoni\nrubbings\nswatches\nwetzler\ntappets\nonomichi\ndemange\ncommer\nseferis\natomization\nglenbard\ncauldwell\ntwixt\nswineherd\ngsma\nmpsf\nantel\ntineo\ntrethewey\nhilarie\nbranyan\nsawbridgeworth\nelhanan\ndongbu\nmangue\nincorrectness\nwbcsd\npownal\nsulk\nqanun\ncranny\nmehrab\nanissa\nclent\nfiladelfia\nlrcp\nmisteri\nmalolactic\nsayce\narcona\nvideophone\ntidning\ndyche\nleksand\ntrico\nrge\ntakuji\nsmoothest\nnodosum\nauke\nriverwood\ndiscerns\nbroseley\njaubert\nribchester\nintimating\nbaseboard\nsenga\nvoyevoda\nkalma\ncibin\ncannata\nbarotrauma\ncaresse\nuip\nmothball\npaisano\nmellberg\ntopolski\ntosin\ntembe\nenjoyably\ngiry\nchilkat\nfraport\noffputting\nmilivoj\nmythili\ntgr\ngilels\nicare\nkelch\nsemitransparent\nseances\npalmarès\nbahlul\nipcs\nbunten\nsaheli\nphyo\ntez\nsenge\nsublett\nnoncombatant\npatcham\nprinceville\nmexès\nwoud\ngorney\ncamano\ngoodwins\nbertucci\ntweenies\npisans\nsoldan\nmármol\nlarnach\nbouder\nburkle\nverkhnyaya\nchehab\nknussen\narbois\nmeriweather\njiaying\nvilniaus\naldf\nbluejay\nmautner\nminotaurs\nzusammenarbeit\nemy\nforesti\nstehr\nchurchgate\npropitiate\nlindenbaum\nfallas\nfarad\ndacoits\ntuti\nohlendorf\nbelloni\nostrea\ndapeng\naffordances\naweil\nbrabus\nlebus\nspringfields\nlaer\nschroer\nasarco\nharim\npaleoanthropologist\nflowerdew\nfatback\ncopeia\ncarella\ntruthout\nlikeliest\nschneeberger\ntanai\nataka\nteta\nhha\ncapurro\nchurlish\nstylishly\nmisspells\npvsm\nsupremacism\nhighnote\nhwee\nmeirionnydd\nwhitebark\nbilingue\nchaptal\nplasticizer\nelg\ndevotedly\nslowpoke\nguinot\nmersing\ncorradino\npoydras\nrefried\nunian\nregroups\nblagden\nimpregnates\nmichelob\nkalitta\npury\nhrabal\nrucci\nexuding\npulverised\nmres\ndfr\nseveromorsk\nclaas\nboisselle\ncernuda\nmercante\ndcnr\nfeasted\ncomunitat\ntinus\nansermet\npemberley\nreingold\nirrelevancy\nbrainwashes\nehrhart\nppar\nboyaca\nkhidmat\ngoodpasture\ngrindley\nwazoo\nservis\nekeren\nviolons\ncsec\ngramatically\nclewer\ninrush\nmanute\nfreest\nnovelized\nencantada\nclatter\nhamre\nvidhi\nooc\nanniesland\npolites\npirenne\nnimo\narstechnica\ntownhill\neliud\nresidencial\nwhisenhunt\nnccn\nnoke\nhibben\nafolabi\ndaringly\nkirman\ndouglaston\nasocial\nunive\nabolhassan\nkamble\nmarcucci\nbumbershoot\nphouma\nridenour\ncopernicium\nptak\nmohawke\nfadec\nrégence\nalmquist\ntunggal\nionotropic\newings\nganso\nmagwitch\ncaddesi\nsevi\ndpe\naccretions\npechstein\nalycia\nsaxbe\nkpfk\nhinging\nlauwers\nbletchingley\nessig\nwtp\nsecularity\nlemak\ngreenberger\ntointon\nstorke\ntakia\nbarendrecht\nscourged\nswamplands\ncongenitally\nsnia\nprophetically\nyeun\ncraiului\ncak\nchaki\nobsesses\nrahane\nconsilium\ngriping\nconisbrough\ncharismatics\nlangsford\ntaggert\nmilone\nromig\nbromborough\nwatabe\ngerwig\nskiable\nthakore\ncolvile\nprosumer\nlundie\nmakani\nberthiaume\nkilcher\nroslindale\nlamitan\nkosten\nadlershof\nexpressionless\nicsu\nnewent\nimke\nkienzle\nemac\nlaunderette\ntiberi\nruffy\nmonsalve\ngenotypic\nforefoot\nlordsburg\nscorns\npinki\nbullous\nmeep\nblythswood\ncognizable\nhaldi\ntravellin\nraptorial\nvelocimetry\nlouch\nknighting\nformalizes\nmulready\nmily\nstenting\ncommandeers\njedidiah\nbroilers\nshalal\nglassed\nakishino\nhatra\nswearengen\nidolised\npesantren\neverage\nfrosch\ntyrosinase\ndimos\nbramson\nfeckenham\nroundwood\nbaselitz\nzimmerli\nlilavati\njinotega\ngavazzi\nvallotton\naiea\nnaheed\ntheatr\nsoukup\nagers\nquiney\nmillsboro\namanti\nfaders\nbishopbriggs\njahad\njeremi\nmccrum\nsolbakken\ncolten\nclumber\nléogâne\nmashina\nglanz\nclason\nhamal\neeepc\njdf\nsarver\nklea\nkirkhope\nbennison\ngosnold\nestyn\nletha\nekpo\ntalence\nkornet\ndersu\nnonbelievers\nleister\nchaud\ntepee\nslayed\nmajali\nkingo\ntrebled\nkheer\nsporobolus\nexpropriations\nfaile\nstudier\nlooby\nflori\nrezek\npembridge\nshumaker\nvehbi\nsikka\nglashow\nsezgin\njandal\nkhafji\nnatterer\nfechter\noutgames\nmusidora\ngyroplane\nnavara\nsimus\nyanev\nsparkler\nderrek\ntamiflu\nbonnaire\nkumalo\nucha\nostentatiously\ncrull\ncoulouris\nsamaha\ngaren\naspersa\nnaef\nwhcih\ndiffusely\narwel\nfelici\nbovill\nkoidu\nrooty\nazat\nconerly\nbeckum\nknac\nmerloni\nxai\nyuppies\npauleta\nwajir\nnhon\nmedoff\ngragg\nmushroomed\ncervélo\nraiford\nhaymon\nporites\nabdiel\nsellon\nlevet\nguanxi\nmayock\nfoundationalism\ndefeatism\ndisunited\nrecasts\npaternally\ncinnaminson\nplacencia\nnavesink\nanomalistic\ntransnationalism\nrasc\npolhill\nohi\natlit\npandur\nwaaaay\npardue\nchaine\ntweedle\nswannell\npudovkin\nhitchins\nsiller\nfaurisson\nahtna\nefford\ndeet\nveenstra\ninsurable\ntellegen\nzigzagging\nschlesser\nsannikov\nemboli\nweedman\nnandor\nrochas\nsainted\nioanna\nwildwater\ngumpert\ndowels\npluripotency\nattires\noxendine\nelectrocutes\nceefax\ndogpile\npouncing\nyachtsmen\nbocock\npiccirillo\nrason\nskłodowska\nagros\nkba\nseyni\nknapton\nimss\nstraten\nquetzals\nmaglie\nbaikie\nbousman\nlati\nbanani\ndewani\nlaleham\nhuaihai\nshored\ntradeshow\nkalim\njianzhong\nclonidine\ndentoni\nkonz\ntasered\ncassadaga\nbuiltin\nacetates\nxiaojing\nsadiku\nagah\nvold\nspangle\nprizewinners\nearthscan\nyokneam\nswofford\nhorcrux\nwaverton\nrecoiling\nwilhem\nkellog\nkre\nabbiati\nmisconstruing\nravanelli\nlemper\nvicini\nders\nsabermetrics\nwolvercote\nbearwood\noversaturated\nweingut\nquex\nhoeppner\nmaizière\nnosecone\nyuengling\nconnel\ntarbat\nnitsch\nvicinities\nwaru\namadio\nkittur\nlloris\nepigraphist\nnxg\nnationalise\nswinburn\ndeprez\nwoodchurch\nwinesburg\nbakos\nwilmshurst\nhackel\nliras\nratto\nrakel\nsystematize\nmouawad\nretransmitted\nkettleman\nxinxing\ntianwei\npachycephalosaurus\ncachexia\nstosch\ncuppa\nnivkh\ncherrington\ntransesterification\nmoazzam\nparnaby\nmannin\nreclassifying\nzeri\nmetaphysically\nelephanta\ncdw\ncalment\nhaslinger\ngustavsen\nmerenda\nbraunstone\nnordman\nipil\ntreschow\nsouthshore\nabord\nassociational\nlineside\nghaus\nganim\nsilverthorn\nbalayan\naala\ncruzi\narcangeli\nrheem\npardi\nbabita\nbloemen\npavie\npenedo\nsobrinho\nmahmudi\ncarsey\nsherawat\nkeaveney\nbeccari\nfractionated\nekurhuleni\nrealness\njsd\nponciano\ntoobin\nramnaresh\ntamkang\nworldcup\nezz\nsquab\ncragin\nmooning\nboreman\nakindele\nsenador\ninviolate\nconditionality\nhgf\nblaye\npromisingly\nkatsuta\nmamani\nchaotically\njeal\nwomankind\ndollinger\nharmonists\nfolkets\nnumerological\ndka\nbamse\nyothu\nsubsidising\nrogo\nbagi\npirrie\nwhisks\narbury\nreawaken\noddsson\nlsg\npompon\nculter\ndonskoi\nunicycling\ntriga\nawqaf\nhighcliffe\npeps\ntrinet\ncomverse\nbessone\nsgro\npostholes\nwestpoint\nspringboro\noceanarium\ncona\nkriegel\ngrantown\nindentures\nschlag\nurinetown\npiercer\nabms\npandoras\nsundries\nthéatre\nmanou\naica\nmetastasize\nardo\nosseous\nverica\ngoossen\nnestles\nskadden\nrestivo\nréalité\nhubby\nwormser\nlizbeth\ncampanas\naben\naparo\ncryan\nwindbreak\nticinese\nbachus\nsportaccord\nderryberry\ntongyeong\nnepalnews\nwoodstown\nkneipp\npharmacopeia\nmendell\nhorsell\ndairyman\nhtr\nmudflows\nmoville\ndotto\nogino\npraetorians\nyune\ntruthers\nmontezemolo\nsherr\nvictoriana\ncriminalising\naloísio\nhalvorssen\npeterbilt\ntazawa\nmillstream\ntschaikovsky\nnado\nnarinder\nkastor\ngobrecht\nschnitger\nlokey\nportugalete\ncnooc\nletterforms\nkaravan\nsteff\nuspa\npeñafiel\ngransden\nzygotic\nedginton\nkeowee\nlarke\nsuperliner\nabass\ndakhil\nparceled\nbrownhill\nlahij\npede\nwashingtonville\nregensberg\nindiggo\nhaltemprice\nstarliner\nsprayer\nprotegé\ncreggan\ntoran\nrustico\nshellshock\nramsbotham\nesai\nblust\nlorikeets\ntingwell\ndiskettes\ndanchenko\nhockett\ngagnoa\nellerbee\namerind\ndragone\nbuzzell\nchatillon\nmckissick\nzacharia\neinziger\nsorce\nmarchwood\nimploding\nsangwan\nkroy\nsoftline\nluxembourger\nkinnard\ncourante\njarbidge\noddjob\npreferrable\nrelicensing\ndirtbag\nxcor\nleuenberger\nseawalls\nkeirn\narny\nestamos\nphenomenons\nmccardle\nfadlallah\nnonplussed\nsabari\nvlm\nfulop\nrooth\nneot\ngovert\nlawbreakers\nunburnt\nwonks\nrenderers\nloreal\nholdovers\nredlight\nkanthi\ncoachbuilding\nlgp\nspoelstra\ntijara\njember\nterrifically\nstruth\nmonthon\nvojinovic\npyromaniac\nquanto\nenza\nkerfoot\nesperia\npeca\nkharg\nalexe\nliquidambar\nthiophene\nmemorialised\nkaifi\nwtam\npulsatile\nhadlock\nwarlingham\nclancey\ntrivialization\nwakatsuki\nnudges\nschwenk\nnetlist\nflappy\ncodina\ncasserly\nkyohei\nwenjie\nsoiling\ngads\npolyurethanes\nhosie\nyba\ninitializing\nharless\njiyuan\ngasteyer\ntipaza\ngulberg\nluqman\nkyffin\nwarf\ncompellingly\nfalconers\nsalcido\ntendre\nomp\nborok\nblitzed\nboto\nlisco\narghandab\nsenecas\nsasu\nbastidas\nboumerdès\njakaya\nodemwingie\ndaugaard\nhurtle\nratcheting\nglendenning\nvortec\nangiolini\ncannizzaro\nult\ndreamscapes\ntrevorrow\nmyerscough\nuck\ngodfred\nschaick\ngonaïves\nseagle\nmagnetoresistance\nvergata\nwiggy\nrhib\ninsein\nyocum\nyampolsky\npwnage\ncharline\nkönigswinter\nenitharmon\naesculapius\ndiaa\nbokor\nkrentz\nyews\nnorthbourne\nkarena\nflys\nnahm\ncriccieth\nflyball\nfossiles\nuneventfully\ncountersuit\nhbp\ngrundmann\ntelevangelists\nsady\nwilmerding\npradier\nzhulin\nouzo\nscreencast\ngrabski\nhotwire\nbrittania\ndwane\ngeraniums\ndeori\nmicroplate\ntearjerker\nnuka\nmisnomers\ninterdicting\nreichlen\nwrung\ntenofovir\nworldliness\npaladini\ntennet\nwondercon\nsuperunknown\noped\ncryptologist\nyadier\njevon\nnuva\npetrosino\ntotale\nsouthfields\nssrs\nhedgecock\nscac\nniddrie\nsadruddin\nsporn\nixs\nrheological\nrocas\nvirgos\nhafizullah\nulundi\nimmunogenic\nwoolgar\nwda\nyokote\nremovers\nannatto\nciudadano\nkintner\nbacio\ndudamel\nkitsilano\nsapphic\ndundon\nallogeneic\ntikolo\nralegh\nufcw\nfishtank\nheffelfinger\nspqr\nseyrig\nhawkinsville\nstevedoring\nallem\nschweiker\nfespaco\nidot\naldrete\nndoye\nkatsuji\ncapuleti\ntothill\nbaldacchino\nwojewoda\nmozer\nlindland\nkeitai\nshafir\ndeemster\nellingwood\ndosimeter\ncélèbres\nteter\nsegregationists\ndelen\nharmonising\ndarkhan\nintoned\nerotically\npariente\npoz\ncountersued\nkaftanzoglio\nsparklehorse\nmoldenhauer\nrootlets\ntowboat\njasbir\nahmadov\nnajimy\navent\ndpo\nmcfc\ntomohisa\ninternetworking\nzygons\nvermette\ndge\nrelaxin\ntroya\nathanasiadis\nsurfs\nanerood\nyasa\nreorganisations\nblacklick\noverstuffed\naprils\nhelgesen\ngirne\nheinola\npluvial\npaleoconservative\nsimplistically\niwatani\ngarraway\nmeddler\npontet\ngleditsch\nadvisability\ndedrick\nchh\nbaldauf\nbuma\nardi\nunenclosed\nnewsbeat\noropharyngeal\npreempts\nmilitarised\nmulatta\nunconstructed\nmoonglow\ntihanyi\ndeterministically\nflaunted\nshowell\nassan\nrinku\nfornell\nllanbedr\nclínica\ncramton\nryul\nblixa\ncartmell\nmouat\nstenmark\ncélestins\ngyantse\nrobofish\nuzair\nallways\ndangriga\nmusan\nopenview\ninstep\ngld\nxcp\nassaying\nfahs\ndictaphone\nneese\ngyrodyne\nmsha\ngillnets\nprosthodontics\nlafe\nbiotopes\nmaciek\nbreland\ngoodhew\nraffo\nwadesboro\nxinzhou\nheartedness\npoliticisation\nalemanno\nmultistory\nferrini\nfranson\ndippenaar\nfallingwater\nleehom\nbissel\nnbf\nrattenbury\nolas\nstigmatised\nthornes\nclimbié\nkrabby\nrelegates\nbeholding\nspay\nktvi\nsouviens\nfraternization\nfuncom\napplegarth\nfulbert\nbrining\nlorenc\nribbe\nsidearms\ncoté\nbillman\nganzel\nvainikolo\nhongshan\nksbw\nsmsc\nlapide\nmahn\nlello\nbuettner\nchonan\nmarut\nery\nyorty\nturgay\ntws\nrubins\nballsy\nmichelis\nazulejos\nrevascularization\nperrelli\nbodoland\npullach\nerth\nzrt\nwolstencroft\ncuddington\nrebuts\neyot\nensminger\nngum\ngallinger\nyerington\nmanoranjan\ndallman\ntwilights\ngouaches\nhilfenhaus\ngandara\nrepudiates\nmastan\nhaseen\nsetouchi\npriv\ncopulations\nwalli\nsaddlery\nakhavan\ngarma\nswadlincote\ntadoussac\nhacken\nlakeridge\nshirking\nprantl\ncontusions\nappy\nplesner\nmatriarchs\nmacrobiotic\nkolm\nhypervelocity\nconcords\nlintz\nbho\nsharqiyah\nteachable\npitiable\nyaro\ndidia\nslackened\nhurford\nevertsen\nbadenhorst\nbotstein\nsviridov\namplio\nchauffeured\nlefever\nokemos\ncentrosaurus\njades\ndahiya\nmeldon\nhesder\ngillispie\nkoekkoek\nkaleva\ngarioch\nstøre\nnarges\ncorbould\nnicaise\nwoolson\nkrasimir\nemigrates\nsnorre\nboff\ndiscolouration\nrangell\nkwela\ncommittment\nleukodystrophy\nsanchar\nisay\nsyf\nimpish\nunicity\nzijlstra\nbucko\ndfp\nhuddy\nnonselective\nsplashy\nsonda\nicer\nksb\narvon\nlacquers\ncatholicity\nlaira\nmccahill\nicbc\ntenino\nramal\nbecali\nlaurelton\nriemer\nsuttons\nrepugnance\nmodded\naksana\nwansdyke\nthomann\nsladek\nzecca\nsodomized\nbemberg\nnarcos\namadeu\npumpkinseed\ntanami\nnordoff\nmcnicol\nmarlen\nguoli\nbeisbol\nmoza\nrecouderc\nunconventionally\nwagan\nmorganwg\nhöller\nhaeng\nbéal\nbarging\nskb\nlasses\ndelicatessens\nrody\nrenkin\ncompetizione\nmcfarlin\ndrydocked\njackanory\npaolino\nraytown\nlamade\nvica\nderogatorily\nstahel\nekstrom\ncaulkins\nkatzenstein\nhacha\nmississippians\nwalts\nlucescu\nbrydan\njeng\ncarolinum\nsuydam\nimplausibly\nlyssa\nlaghi\nglackens\nlail\nsheriffdom\ntelecasters\ndidonato\nturkiye\npenalizes\nalexx\nsyvret\nmcphatter\nwivb\neisendrath\nbeilin\nladybugs\nquot\nheadbutting\nveri\nsotnikova\nponteland\neddine\nwestfields\ntubeless\nwibowo\nbaine\nfetishistic\ndabur\nnatasa\nkokorin\nbandmember\nsadiya\nlosman\nogburn\nlamed\nglaister\nismene\necotypes\ngaols\nlansley\nochil\nfreshen\nfranklinville\nmedlocke\nhaff\ntornabuoni\nsemipro\nquoth\nschedulers\nwelner\nhumidifier\nbrookgreen\nthamarai\nrhapsodic\nlaughingstock\nbrebner\ndeneen\nrupesh\nplanetside\nkaps\nandriana\nnyla\ndominium\ncongos\nshahidullah\nywam\njic\ngunasekara\nhayloft\nsuzannah\ngullion\nnomadism\ncoulier\ngandak\nvaldemoro\nishtiaq\nturion\ndeign\nincarcerate\nmoriwaki\npleasingly\npsat\nginobili\nslive\ncojedes\nbleich\nwijewardene\nteplitz\nirg\nschwob\njameela\nrigoberta\ncuratorship\npenetrators\ncoquelin\nltz\nemyr\njhaveri\nsunsplash\ntenser\nburdine\nstriken\nvaryag\nsestiere\ntoton\nhelaine\nmahanagar\napiculture\nrongbuk\nanjir\nrouser\nvaguer\nnattier\nallori\nfranchione\ntebbetts\nmavado\nloaiza\nrahan\nsimvastatin\nboyana\nvaldir\nbutland\nmassana\niowans\nspitzbergen\nkerkhof\nthorndale\npicmg\nepileptics\nwillich\nnegrin\nfuniculars\nspcs\ntaneytown\nomelet\nsalthouse\ndaljit\nloughridge\nkundi\njaviera\nkens\nahlstrom\ncognos\nsartaj\ntrafton\nfedra\neurocom\ndymphna\nniblett\numari\nstumbleupon\ntenochtitlán\ncheska\nmalec\nmarang\nakinola\nimpellitteri\nellinger\nicsi\nopb\ndweck\ndorst\nvoiculescu\nwurzelbacher\nschuette\nfortino\ndanese\nlexeme\nayukawa\nmedecine\nbabayan\nhanjin\nmerfyn\nkango\ncytisus\nvaleurs\nburgoon\nbachan\ncyberattacks\nstormbreaker\nsuntan\nkahoolawe\nmonia\ndaglish\nramar\nslone\nnovelised\nbuchmann\nsultani\nschemers\nballu\nhectoliter\ngraptolites\nstammerer\ncorpuscle\ngwn\nchaturanga\nmecum\nalalakh\nalatriste\nbrissot\npelecanos\nczukay\nmcps\nsidor\ngrajales\nclowney\ntgl\nolegs\nlaicized\nsykesville\npanera\ntroubetzkoy\nciera\npmcs\nvideoclips\nlapita\nunglamorous\nblitzes\nuned\ncarrasquel\nvgik\nbensley\nthorington\nmansuri\njulietta\nforwood\nglassnote\ngrimwade\nnajera\nborgwarner\nngi\nkard\nburle\nfullam\nbakan\ncomex\npdv\ndebat\nbadung\npomodoro\nvorn\nfelsen\ntoothy\nbloss\nwildstein\nroomful\nlydians\ndarkhorse\nmilfoil\nfuntastic\naapg\ncoaldale\nmynd\nkabala\nghazaleh\nbriffa\nkiprop\nkoehn\nargyris\nvesco\nniblock\nvinne\nzavadil\nstonegate\nhuntingtower\nbrickmaking\nlikeminded\nbohl\ndgo\nkastle\ntrimethylamine\nkenfig\nbalck\nprosecco\nalderaan\nshawfield\nchristmases\navez\nvalori\nbumpin\ntibby\nrisser\ndocetaxel\nplagne\nysleta\nvend\nmüstair\narkangel\nneerja\nprosperi\nmizuta\nfallaci\ngroggy\ntweedmouth\nshimkus\nbrame\nantagonisms\nforschungszentrum\ntitmuss\nflagellated\ngranatstein\ncepheids\nrygge\nschank\nhierarchal\nwayfinding\nclobbered\ndaddario\ndicentra\nangella\nmichoacana\noosten\ntranny\norrego\nkpcc\nunflappable\nbyculla\ntransfused\nremota\nnorheim\nbrookhart\nwcpo\nloree\nspieler\npepfar\nthornalley\nclenching\nhatreds\nguadaloupe\njir\npoincare\nreutov\nherbin\ncaci\ncharkhi\nnorcal\noquirrh\nmubarek\nuceda\nmckitrick\nbattel\nheilbronner\nsalamone\nstieber\nbrouwers\nactuating\nsimonelli\nboorstin\nvishakhapatnam\ntaua\nbodacious\nbrowz\ntushingham\npatterdale\nmultibeam\npossessory\nwenjun\nchoire\naopa\nscawen\ncholoma\nscie\noduro\ngriego\nverda\nnorlin\nminnick\ntijana\ngreenly\nrodchenko\ndesjarlais\nthalheim\nsanden\ntinkoff\nharf\nsharpens\nspermatic\nnzru\nborei\ngeneralizable\nallhallows\nstuttered\nhaskett\ncezanne\nrylstone\nmedicate\nbyronic\nsallinen\ndilruba\nimbuing\nacetylcysteine\nmanrico\ngroninger\nantichi\nberka\nnermin\nsvitlana\nskillsusa\nbailiwicks\nazza\nhadean\nbisher\ndargie\nunethically\ncivis\nröhl\ndroog\nmajles\nstolid\nbackstairs\nfrankau\nrawn\nlifu\nappetizing\nsheetz\ncreedy\nkgaa\ncarillion\nalexej\noyston\npenetrant\nepicureanism\npokrovka\nmongrels\nvänskä\norgani\ngaude\nunachievable\nagudo\nqadisiya\nstodgier\nconsortiums\nslager\ndealbreaker\nkhalilov\nkeadby\npassu\ncavallero\ntraumatizing\nmultitouch\nmarito\nrazorblade\nhallein\niaps\nreveling\ncedi\nborns\nlagrave\nszolkowy\nphotostream\ncanady\nloye\njijiga\nalpo\nalsup\nheedless\ngrails\npolarizers\nsenates\nlappland\njingyuan\nhbi\nkeppoch\nupraised\nnesser\npostproduction\ninflames\ngnashing\nsanka\nneuropsychologist\ntalybont\nmagirus\ntrammel\naloofness\ncontrastingly\nnucleosides\ndespierta\ncicotte\nteertha\npeepshow\nlemmons\nsnailfish\nwellings\nrafat\ngaonkar\nhalfon\ngamu\ncyanuric\ncielito\nshitting\ntolkin\ngradualism\nkanellis\nrynek\nplagiarising\nalness\nujjal\nweigle\nneé\nnaafi\nkormos\nfulbeck\nsmudges\ninterna\ndénouement\nmolenaar\nperfects\nwhee\nninds\naping\ngelled\nhuws\ngoffs\nrussophobia\nloga\ndoji\nfirkin\nkopassus\ncalaway\ngeppi\nfaruq\nwptz\nchalons\nbigmouth\nmcgrane\niara\npaulsson\nkawara\nchinery\nsyncytial\nfretwell\nuiq\nshawarma\ncapuzzo\nsuran\nsafonov\nbetaine\nboyet\nsouthwind\ndupuytren\nbuah\ndevelopable\ntriangulate\nmontagnier\ntertre\nrajin\ndangles\nmursi\nmcginlay\ntolon\ntoorop\nthenceforward\ndjordje\nemunim\neasterner\nyasumasa\nragone\nhardstaff\nnneka\nlargescale\nglaciological\ngastelum\nrenaults\natic\noverdrawn\ndupplin\ngrimethorpe\nimplementer\ntwinsburg\nduron\nvenning\nchashma\nsibusiso\ntadmor\ntodeschini\nrupnarayan\nmiseducation\nchondrocytes\naskam\nbrihanmumbai\nsamsa\ngervinho\nyubari\nseogwipo\nhilltown\nadfa\njltv\nauslander\napostel\ntrombley\nlgt\nfadeev\ntufo\nmanekshaw\nzeroing\nouidah\nfireclay\ncoltman\nderay\nwankdorf\ndelocalization\ncoveralls\ntransrapid\nmaximalist\ncloninger\nalie\nkirwin\nbezels\nflyhalf\nkopaonik\nmilada\nusdot\nvrubel\nvalproic\ninosanto\nimprover\nformartine\npawcatuck\ndunstone\npröll\nmontalcini\nharring\nkaruk\nsalpingidis\nberrick\npalmistry\ngrn\nlhr\nsunyani\nsecteur\npetrides\nadis\nguadalupian\nkayama\npixy\npaschalis\npsuv\nkalanga\neckhoff\nmaneesh\nbreitman\nglassblower\nstylez\ntrastuzumab\ntszyu\nbrodovitch\nmedicals\nchurnet\nkarakol\nsucessful\nshowjumping\nstraaten\ncapas\nshaftsbury\nhambleden\nambersons\nmilosz\nwerff\nlulled\nembley\ncarthago\npanucci\npushups\nhunn\nswm\nlione\nblz\ndsh\nwallichii\nsies\nmtpa\nroderigo\nunscr\nurbach\npko\nelstow\numberleigh\nyvel\nmontesa\nsemicircles\nsjoerd\nzenden\nlanie\nlegard\nunbaptized\nlastras\npenders\nawasthi\ntrepanation\nquarantines\nbérubé\nframeline\ncohousing\nmashrafe\nsilberling\npaduan\ndooming\ngreyed\nlocution\nacog\nazrieli\nknb\nmarriot\nindecently\npopocatépetl\nbalearica\nmoffo\nunwieldly\nrados\nmuito\nglitchy\nhibernates\nquasicrystals\njaggers\ndismally\narat\ncaber\nhevia\nargylls\nnairo\nmilliliter\nradler\nesguerra\nuziel\nasra\nikiru\nfeckless\nshishir\nlahav\nchuckling\nreimposed\ninerrant\nwinched\nfeil\njurvetson\ncyclosa\nflagon\natack\nbidault\nsignalmen\ntth\ndisclaims\nisasi\nluzi\nunclassifiable\ndoling\nmacba\nglauco\nrecliner\nmahin\ndubliner\nkanchelskis\nevenimentul\nfucus\nfalkiner\ndorée\nkalaupapa\neglington\nscelsi\nstidham\nemelia\njallow\ntriply\nzax\nphytochemical\nwoodleigh\ntamburlaine\nklondyke\nunsullied\nzillow\ngigha\nnikole\nsanni\nliotard\nashaari\nhuka\nwistfully\ncondover\nseymore\nrosiglitazone\nweerasinghe\nvimpelcom\nchevra\nbrightens\nsulfonamides\nledecky\nroes\ngreenglass\nolynyk\nzanjeer\nmomoa\nmislaid\nctia\nkittle\nbyerly\nclaygate\nsavolainen\nczvitkovics\nmuckraker\nculicoides\nwestonbirt\nseleção\ntuaregs\ndryads\nthreadless\nwih\nthir\nmusqueam\ncloacal\njaspar\npedler\nbabalu\nsequoiadendron\nterrorise\nreveled\ncoreligionists\nbreadboard\nyucheng\ndauda\nmadine\nthinkquest\nabortionist\nhinkel\nfirehawk\nenlistees\nbitchin\ncomfrey\nruach\ndurak\nfomina\nfilastin\nfaberge\nklasfeld\ncurieux\nhaseeb\ngoodwillie\nrecoding\nkwadwo\nsidis\ngomaa\nobes\nmakonde\ndiscordance\njustis\ntarporley\nhamrin\naulos\nfadhli\npatte\nneurochemical\nbloodstain\nadilson\nhacarmel\njongen\nkulit\nstav\nbunte\nrestructurings\nduverger\nmassereene\nmodin\nmegalomaniacal\nsedia\nhamiltonians\nscrublands\nfallada\nstereoscope\nlaimbeer\nvillella\nbelltown\nmafias\nkryptos\nneger\ngrenouille\nsmudged\nduyn\nmarshaled\nkule\njoynson\nairlifting\ntrivializing\ndiverticulitis\nsibson\nlowth\nwilted\nparalysing\nbenesch\nlasham\nbrasier\nhucker\nweobley\nconcentrically\ndjite\nsheeler\nintercedes\ngoater\nryokan\nbackfill\ndeistic\ntioman\ntokiko\ngaus\ndenko\ndemoralising\nmakossa\ntransfield\npallekele\nmiddlemore\nmdewakanton\naimable\nmisjudgment\ndangled\nendon\nschnitt\nation\nchunlai\ncelibidache\nzah\nbhumika\nmaccracken\nsclafani\nhigginbottom\npacky\nshaz\ntrapster\nlitel\nponticelli\nsprewell\npadlocked\nghita\nchailey\nsantonio\naltschuler\nreposed\nabdelwahab\ndubus\nretinoid\nnonda\nextortionist\ntunnellers\npewaukee\ndeadhorse\nlinezolid\nvogues\nbucha\nwithstands\ngernon\nimpinges\nefb\nbladet\nfondest\nfrostbitten\nunfired\nturnings\npintu\nmih\ngreven\nparentis\nmarabouts\nmoralism\nmuddies\nbaruth\nvirgina\nhosed\nsneers\narashiyama\namero\nsabahi\nhotseat\ndoisneau\nmustached\nfattori\ntabara\nheis\nresonably\nsaddening\ngnoll\npeery\ninterconnectivity\ncarrageenan\npostlude\nrefiled\nquadripartite\nriccò\nassasination\nchambres\ninbounds\nroboto\nconcertación\nthingie\nkenyah\nenskede\nwrona\nlul\npigozzi\ncankers\nbyc\nbeida\nmorandini\nifds\ntarpaulins\nnystad\norszag\nsyk\nunbundle\nsicknesses\nsilkie\nnovar\ngronk\nllys\nkjartansson\nelyot\nantrum\nbivalent\nnuf\nhighlighter\nsipah\nhegg\nsalada\nschayes\nghouse\nstoiber\nteutons\nemerich\nszdsz\nkarakul\nlenehan\nsarkisyan\nhodie\nallanson\ngrigorieva\nlaettner\ncommissaries\nheneghan\nsimplot\nwiston\nvize\nilliana\nfuchida\nplacentals\nahola\nchike\nreinvents\ngrizedale\nmercaz\nmuji\nallocator\nlein\nbucarest\noverenthusiastic\ngatta\nprobabilistically\nimst\ntraversée\nmesfin\ncollodi\npribilof\ndistresses\nnicd\nstaffa\nronning\nmarloes\nunifier\nunhurried\nknaggs\nzerbo\nhary\nminish\nrengo\naneesh\nzhihao\njjl\napproriate\ntmo\ngirado\nmalines\nstewing\nhpr\nwagah\ncorbu\njaundiced\nundershot\nbiding\nbryzgalov\nfews\nquella\nknapman\ntaupe\nmertesacker\ncardenio\npegge\nkredi\nbeneficially\nsawar\nglassmaker\nransoming\nbalasko\nspaceframe\nlullabye\ncenta\nsublimated\nscatterbrain\nboonesborough\nfosbury\nkli\nardant\nkissell\nhirschi\nfratianno\nunitized\nwestshore\nnoson\naccomodation\nfairlee\nhaart\nwhannell\nadath\nloveman\nsres\ncowin\nneuhof\ntaraborrelli\njurupa\npanozzo\nliek\nmandaean\nrippled\nstik\nbenefactress\nmucci\nwohlers\nbandhavgarh\nmetalsmith\nbmb\npitstops\noctuplets\ncernay\nyarnall\nronaldsway\nvaleant\ndanilovsky\ndurlacher\nlevenshulme\nrubab\nduranty\naslef\nlogsdon\nfank\nxxy\nmacivor\nagosti\nwaun\ndarnielle\nkangoo\nunfurling\nsweb\nshambu\nhendrickx\nluxembourgers\nnideffer\nrichford\nlakshmibai\nlabar\nnabe\nimperceptibly\nsallee\ngeilo\ncachao\nkernewek\npeaky\nblanched\nclosedown\nalist\ntozeur\ntyas\nalyth\nrenea\ngites\nkhlebnikov\nsatou\nslx\nslaveowners\neho\npataca\nparticularism\nhachioji\nschary\nprejudge\nadduce\ngigot\nweegee\nrootsy\nshey\nnikaia\nmagis\nmclouth\nldi\nlovelady\npheochromocytoma\ngaffaney\nsandyford\nbloomquist\ncappagh\nmirer\ndeister\nredhouse\nknockhill\nmuffle\nrippin\nzowie\nheatter\nvostochny\nsensitizing\nperego\ngiovannini\nvarshavsky\nestragon\nwringer\nstavans\nlkab\ncamaguey\nhoráček\ntonioli\npolytech\njordyn\ngiandomenico\nbubbler\nheinberg\nrubicam\ntoughs\ndumpsters\nkoyanagi\nlambic\ntyseley\njoyal\nkaplinsky\nnewsmen\nmcfadzean\navilla\nammerman\nchatel\nortrud\nvishnevskaya\ngranddaddy\nstapling\nnuada\nalbergo\ncolossi\nhudswell\nbuni\neeva\nbeguiled\nzuffenhausen\nsteratore\nrenesas\nkenenisa\nchakan\nmindjet\nbaclofen\nwahkiakum\npanchal\ngermanophile\nkoningin\nsurfrider\nflubber\nsalesmanship\nfreestyler\ncrenellations\nruedi\njunos\nagroecology\ndayglo\ncaptivates\ndemystify\nalgerie\nblobby\nguthridge\necton\nbrockdorff\ngambaccini\nhepper\ndemodulator\nmellower\nsuryakant\nrotan\nromine\nsnowbank\noutrank\nmalesherbes\nprecedential\ndebateable\ntruls\ntaxied\nlehel\neades\nboere\nsubmersed\ncwp\ntegmark\nmultiplexers\nteddybears\nbevins\nroslan\ngawd\nshubra\nrudbeck\ngreedily\nkatsouranis\nteseo\nthatcherite\nshiela\nwhn\ncrescenzo\nmultivitamin\nfolwell\ngerrymander\nwmal\ntazio\nallene\ncrescendos\nparoisse\nbaled\nmhaonaigh\nshaila\nisizulu\ndungey\nmittermeier\nconsensually\ncommonground\nkalinovik\nkailasa\ntfn\nbullpens\nsauropodomorph\nmörner\npulli\nkutz\nshaa\nhofner\nsutor\nsheyenne\ncink\nmiano\nferrucci\nlottomatica\nmcharg\nholck\ndmo\nschweickart\ncannibalize\nriboswitches\njawai\nfascistic\nexcreting\nlodwick\nmozley\nmotorable\nwithey\nlundbeck\ndunja\nveldhuis\ncounterattacking\nexhibitioner\nhoffner\nbooksurge\nalgimantas\nbehooves\nxara\nmannus\ntraurig\nmaterialists\nclachan\ncarribean\nvpu\ntournay\ncollectivized\nfréchette\nmartiri\ncommingled\nsaltsburg\ncheverly\nbachelier\nopenside\nbedwas\ninri\nselk\nchunhua\nbenedicte\nhassled\nnastasia\ngrannies\nziege\nsinisa\nusmanov\nunpardonable\nellena\nheimer\nhelguera\nreffering\nadobes\ndudas\ntamen\nsuprising\nganay\nflagrante\nasatru\n\nkomische\nhercus\nschnier\nleventis\nscialfa\nfrischmann\nkhokhlov\nllopis\ngober\nthijssen\nducato\nkomai\nsholing\ncookhouse\nwachsman\nnightstick\ndrydocks\ncbcs\nlasagne\nsacrement\ntearaway\nustr\nkaiden\naniket\nforbesii\nseasickness\nlimmud\nnimes\nivangorod\nsankoh\nindymac\ndhb\nastroparticle\ncuauhtemoc\ndecon\ngwennap\nnorgrove\nkelvinator\nkindlmann\nwaba\nestee\ngoulette\nrebekka\noximetry\nsympatico\njingoism\ndorky\nsubhan\nböttger\nmicrosatellites\npangalos\namad\nquinet\nshergar\nberrow\nlucido\ndemocratico\nbdw\nkaypro\ndongyang\nonitsuka\nxiaoshan\ntuban\nnamazi\nsandison\nkobelco\nmaccormack\nparaneoplastic\nmilke\njurisprudential\nhenrickson\nbrutalities\nclaggett\nsheeba\ncountersigned\nnutella\ncommuniqués\nchiana\nriboud\ntaur\ngingko\ncleome\nsilverleaf\nhoefler\ncurculio\nkatsuki\nestève\nduplo\nlilya\nmorcom\ncanidate\nharrowden\nkaim\nnapierville\nalama\navadi\nhandshaking\nvalvoline\nasla\ndipaolo\nziwei\nrore\nsibilla\nsolf\ntaedong\nelectroluminescent\nkingfield\nbeaminster\ngradel\nlavoy\ndufner\nashlyn\nmontolivo\nhuebel\nmegafon\nbilliken\nvukovic\ngolijov\ndonavan\narara\nhandforth\nmckissack\nkeret\nreinfeld\npdsa\nabsolon\nscruple\nrevved\nduquesnoy\nwycoff\nstoxx\nrezvan\nhighworth\nunspent\nstratfield\nnuwan\npierrick\nlarra\ncnpc\nbreadwinners\nzorka\naulia\ncedrick\nhkust\nnachshon\nfinsen\nhamzeh\ninhales\nfoglia\nbalasubramanian\nscandinavium\nhustla\nlughnasa\nchupa\ncreach\ncaughey\ncomptoir\nwojtyla\nmalpighi\ncoraopolis\npoya\nmrb\nzachry\narkoff\nblomkamp\ncheckbook\nlooff\nvorm\ncruddas\nultrasounds\nnaciones\nmauris\nincongruously\nibrahimi\nokina\nunoci\ntardigrades\nrezzonico\nperistaltic\nnafc\ncatlow\nhuell\nabrazo\nkemo\ncroplands\nkealakekua\nwoai\nabap\nvivero\npengelly\npolyansky\ngmf\nheru\ningela\nmakhaya\ngodmanchester\nkutschera\nignasi\nftb\nleprous\nhailstorms\napheresis\nunexcavated\nkerrison\nsupernature\negle\nangelides\nmorlot\nhuaxia\nsht\ncoralville\npcx\nmisandry\nbujar\nuzo\nlabib\ngabrielsen\ninfiltrations\nbartolucci\nfluconazole\nchurston\nmortiz\nleotards\ncfpb\nfruita\ntallangatta\nricarda\nlancôme\npengilly\nmerwede\nsurvivalists\nsetif\npetrini\npuiu\ntasikmalaya\nzahm\ndiori\nwitzleben\ncke\nbrooksbank\nmkr\nroissy\nmillas\nricos\nlandrith\nhakimi\ndandekar\ncooperators\navons\naldredge\nrhm\ninital\ndoggart\nlabouchere\njamun\nrutelli\nucn\nrimantas\nmundesley\ngart\ncuidado\nlamoure\nhuatulco\nsojourned\ntemu\ntutela\nstatuesque\ncvb\nformalisation\nbonifaz\nsadaat\nsirik\nscampi\nobuasi\nbloem\nblanning\ntrista\nleisen\ngaregin\nfrej\ndahaneh\ndistefano\nguesstimate\nillsley\nseeler\nmarkups\ntorretta\nquavers\npragmatists\nprocambarus\nnawabzada\nlocog\nigh\nkohrs\nmapmaking\nlipsey\nexploiters\nexclusiveness\nnorthover\ncarrabba\nexacta\nmatrona\nmethow\nthackerville\nwjar\nassistantship\nblomstedt\nvalgus\nonex\ntolgoi\ntmj\nybbs\ntevatron\nicesave\ngualeguaychú\ncodie\nmellifluous\naceveda\ndiablada\nbicuspid\nwimsatt\nembroiderer\npittard\nwesterveld\nbumba\nbestie\nreflexology\nsilversmithing\ncotroceni\nfreakout\narsenale\nhertling\nsuining\nalimentation\nthurs\nstottlemyre\naui\nwellsburg\nhydromorphone\ngants\nmillette\ngonorrhoeae\ntakanobu\nclucas\nrockery\ncalon\nplinio\nbrunelli\ntidelands\nhotson\nténèbres\nfrenchay\npansies\ntagetes\npoile\novercooked\nkaukauna\nmalaccan\nsazerac\nquarterbacked\nlarriva\narcaro\ncaerau\ncompendious\nsimper\nmarkdown\nsindical\ngookin\nprinses\nlondra\nsrilanka\npetraglia\neog\nshirehampton\nfrothing\nwwn\ncombiners\nlegado\npriviledges\nnazer\nmukarram\nbelvin\ntalukdar\nashihara\ncambered\nmyre\nmeins\ndamac\nelmsall\nvugt\nderman\neylandt\nrofe\nastar\nstefon\nfisting\nmocca\nsunroom\nreli\nshechter\nreactivates\nmonstrosities\npancoast\nzazi\nrebalance\nfoad\nmartella\nhyperhidrosis\nrockhouse\nmnj\nlingappa\nlijun\neidgah\nrcog\ngordonia\ngaisford\nsquidoo\nbellocchio\nbadam\nglb\npyres\nginnifer\nrogel\nlacayo\nfonction\ncleveleys\nkuow\nmarlinspike\nyongding\nflails\nwadlow\nputtkamer\nschizo\nsavon\npolaco\nbanatski\nwhitnall\nwendorf\nfokin\npastas\nvokoun\ngerrymandered\nbiocon\noxymoronic\nacclimation\naspie\norangi\nsavickas\nthn\ngraystone\ndelarue\nmeaker\naltshuler\nprestonsburg\npequots\nturnagain\nnoodling\nimmunogenicity\nmariane\nritualist\nplotnikov\nivona\nantonioli\ngatland\neffa\nmilitarists\nvenkataraghavan\ndavidsson\nmahl\norrick\ngalbi\npujo\nsangye\ngalanos\nsuominen\ngillum\nbroadman\nrespire\nrynn\nmontanaro\nultrasonics\nncg\nrohner\npittas\nburtin\nkoçak\nisothiocyanate\ndaira\ndietsch\nmynah\nuor\naloma\nneergaard\nigawa\nrudolfo\nkaabi\nuds\nbamana\ncarcase\nchryseobacterium\nhaygood\nmulki\nsaaz\nmalter\nmcvitie\nveritate\nsycophants\naccumulative\nlygia\ngleich\nhawsawi\ndiversa\nbraunston\nreiniger\nhosoi\npharmacol\nimei\nacri\ndinneen\nrancic\nedet\niványi\nsaho\nchinaglia\nrockridge\ntaslima\nturkle\nraymon\nsgv\nlatiff\nheun\nmouli\njaroslava\nbryar\nesoterica\nrafo\nchizuko\nobscenely\ndelf\nclementino\nturow\nhilding\nwiegert\nbromell\nwenig\nbadra\ngloag\nelectropositive\nsnaking\nbordeleau\nmurle\ntrebuchets\nbirling\nramalinga\nrosendal\nwainstein\ngarga\ntragi\nhelou\nselsdon\nyakovenko\nkassovitz\npolycrates\nhénin\nfonzi\nmarzouki\nbackdating\ngaarder\ndepor\nfortum\nlocksmiths\nselkirkshire\nclausus\nreaps\nkehrer\nharkavy\nhuimin\njimerson\npoppycock\nguestrooms\nxterra\nrph\ntorreon\ngoodeve\nsanoma\nkhruschev\nsupermoto\ncoachmen\nbolek\ndandan\nheffalump\nstinkhorn\nwilburton\nfritts\nbeauman\nimpudence\nnephrologist\nclendenin\nzabeel\nhornbuckle\nlactoferrin\nmoissac\nwolinski\nmehtar\ninebriation\ntoile\nsprue\nmechanoid\nhominum\ngedge\nmarazzi\nwaterkloof\nkaiulani\ntermas\nlaet\neléonore\naarushi\nleant\nsupernovas\njoburg\ntoula\nmaglione\nfuthermore\nsaturates\nchesson\ngunna\nhitesh\nscimitars\nwallfisch\nghimpu\nswitchboards\ndeamer\nrigh\nreturnable\nlatonia\nkhaira\nfatone\nkobarid\nproselytising\nperplex\napperley\nferlin\nschüssel\nmitsuyoshi\ndiamantopoulos\nwidder\nkaoma\nsmalling\nbelleisle\nriyals\ncomedie\nimminence\nexosphere\npierro\namnesties\nwalbeck\nankylosaurs\nskokomish\ntroutbeck\nfilippos\nsloc\nleron\nleggat\ntitova\nlifo\nfleener\nmacaire\nwigman\namenta\nkimmitt\napoe\nmultidirectional\nbrixworth\nkatzer\nsofiya\nfollmer\ncitywalk\nferial\nleyendecker\nrestatements\nnduka\nozols\nweisbrot\nalthoff\njavagal\ncottin\nlibrae\nzhongxing\nusrowing\ngeomagnetism\nexpungement\nlael\nsorbent\nworthlessness\nattunement\nplaisted\nmujahidin\nbaldly\nmealworms\ntotenberg\nphenolics\nlarcombe\nculina\ntlaquepaque\nladonna\nsenghenydd\npercolate\nkaska\nprotheroe\nagnar\nclaudy\nwijesinghe\nluding\ncompered\nsturdivant\npattullo\nbcra\nluza\nvatika\nbaraja\nhsfp\nwitticisms\nmarcu\nsomethign\ntwiddle\ncorvino\nschmieder\nwicksteed\npyeong\nminicamp\npurposing\nandrzejewski\nohhh\nastc\noxonian\ncoved\nthroaty\nbrined\nbluford\nunlabelled\npache\nstrelka\ndaskalakis\nassiut\ndabbles\ntreichville\ninds\nfortrose\ncardi\nbakare\neryngium\ntownline\nspagnola\nmulching\nzentropa\nbadel\nexr\nfremington\nworkaholics\npulpy\neigo\nafterhours\nventromedial\nmerker\ndragonair\nwarrented\nscrat\nifj\ntrotz\ngauna\ndekay\nsidr\nmerrylands\nwenjing\nenduringly\nsayat\ncantharellus\njdam\ngreenview\nnivola\neasyshare\nbenkler\noverproduced\nocularis\nsarich\nkhabibulin\nmathy\nflatware\nleadbitter\ncomedown\nsemporna\nguterman\nattinger\nguinee\ncrosshair\nlidstrom\nabernant\ngetafix\nduckbill\nheym\nknockers\nbucktown\nvivino\nmamuka\ngoldsberry\nharambee\nxjs\nsahira\nonchan\nstablemaster\nusurpations\ningenuous\nebbesen\nleontes\nrkk\nsportsground\nulusoy\nyeow\nrecrimination\nseyoum\nbekim\narous\nherkules\nmidnighters\npicher\nboote\ngambara\nlyke\nrimshot\ngetman\nagel\nhimyar\nsadam\nromar\ncerqueira\nkatunayake\ncappelle\nhaxby\npickpocketing\nschmied\ndrenching\nshinned\nthiha\nkabin\nbagamoyo\nupholland\nsiewert\nopara\nuniversita\nactualize\nphevs\ndenucci\nkatou\ncastmate\nsamatha\nviliame\nbudig\nsupplicant\nmarki\nlynds\nprednisolone\nmignard\nchakar\nbenifit\nilizarov\nsoundcard\nnannerl\nkimmins\nmegabit\ndones\nszostak\npersky\nbelliveau\npontivy\npanayi\ncolnbrook\npaulistano\nordains\ndosari\nconiglio\nedlin\navin\nchiccarelli\nbouley\nvitaliano\nturbinate\nisraelita\nauchi\nbasecamp\ngisella\neurogroup\nportmarnock\nmeise\ncolesville\nteaparty\nviroqua\nbernacchi\nsvga\nbarlett\nhuffing\nhassa\ndinnerstein\nbenbella\npria\nkoyu\nwatseka\nnovica\npseudolus\ngradkowski\nneelum\nduplantier\npalmgren\ngrego\nlabbe\npicha\nranford\nanticipations\naaps\ntrarbach\ncrampons\ngasimov\nedibles\npronation\niaith\nmalpaso\nmanika\ndahlias\nruvo\nepizootic\nluisão\npopmart\nkurtág\namfar\npanegyrics\nnazanin\nstreiff\nmsj\nhardstone\ntimpanist\nbaranovsky\nsiamak\nnoblet\nsesamoid\nrbe\nowston\nkillion\nsibyls\nironworker\ninsincerity\nunceasingly\nseppala\nhandi\nsaguna\navascular\nconjuror\ndulong\nparibus\nnafisa\ntruc\njogged\nhawser\nwrg\ndragway\nblackjacks\nepicure\nparda\nblockley\nhullett\nboyertown\nshud\nstandardizes\nvitiello\nhogfather\nbloque\nenviroment\nyokel\naminoglycoside\nneuropeptides\nnonimmigrant\nmarris\nstrobes\ntechdirt\ntouchstones\nwalp\nformalising\nbeedi\nanjaan\nabrego\nalman\ntonsillectomy\nlovette\nespecialy\nmikaël\ndescarga\nkhorshid\ncruddy\nhyppolite\nsangma\naphthous\ncardellini\nmaniema\nversos\nreagon\nimbroglio\nmalbaie\npinotti\nkumquat\ntammet\nforsch\nrotblat\ngatson\njonni\nhesburgh\ncanonica\ninstantiate\nsecessionism\ntickers\neim\nvlasic\nmissoulian\ncurtained\nmahjoub\nchatters\nsamlesbury\ndonagh\nsheek\nkokoity\nzueva\npearle\npieniny\nmisdirect\ngouw\ncatgut\nchatteris\nyelland\nunprintable\nandromaque\nrostered\nsoldotna\nmxc\ncastaños\nstinnes\nsalomone\nbanzuke\nlivedoor\ndooling\npowderly\nkeynesians\ngosha\ngeli\nfukada\nrijke\nlanercost\nburrington\neasingwold\nhartnoll\ntgc\nfoudre\nillusionism\neling\nrefortified\nsmithkline\nguerriere\ntobyhanna\nilich\nsumulong\nleninists\ngamez\nuncitral\nembolden\nmandera\naudette\nmozes\nsoulsby\nbossing\nmajestically\ncreus\nbrazzi\nmujuru\nsarao\nhekou\ngroynes\ncoope\nblegen\nwintersun\nagli\ngeldart\nthersites\nalna\naxim\nschutztruppe\ndefendable\nmoens\nyassir\nchinami\nhgt\nhippa\ncing\nksla\nhampe\nstonefish\npashupati\nnorbertine\nclawhammer\ndarzi\nzaveri\nootacamund\npsychotronic\nmarmande\nstoss\novitz\nlambchop\nvelda\nmoncloa\nplaits\ndaulton\nmigi\ntese\nmangham\ncantillo\nbianchetti\nvasiliki\ngouws\nheliospheric\nfbe\nbarwise\nawls\nmantaro\nbogeys\nmaloja\nhuhn\nmidcourse\ninkblot\ngayfield\nerga\nnchc\nsaadawi\npecher\nlates\nregresses\nfryers\nchael\npicturehouse\nwashbourne\nshope\ngiffnock\nutb\nairburst\nlembah\nblackguard\nparreira\nkawan\nbenso\nthery\neboué\nsnoops\nyaws\nbleeps\nmalnourishment\netranger\nmicroburst\nalrighty\nrecombining\nicrp\nzairian\ntranshipment\nluik\nsamedov\nmillage\nducker\nacey\nempey\nsomersby\nsternfeld\nwestridge\njiggle\nventers\ncombles\nlaster\ncamerini\ndawgz\ntuiasosopo\nfeatherless\nlefortovo\ntallarico\nahearne\nluneta\nsavarin\nwiltern\nperranporth\ncurdled\ntildesley\ncissokho\nakathisia\npoundland\nvalencians\nrovelli\narbib\nhomebuilders\nrisin\nprazeres\nkokopelli\nvittoriosa\nbiwott\navais\nantivari\nanjani\nvago\nrunningmate\nvalediction\nlgc\nslickly\nraddatz\nabre\ndevananda\nmulticomponent\nrazali\ndubb\ndiscription\nmozarts\nswooning\ncalotype\nswaby\nrichartz\nheadmastership\nadichie\nhardangervidda\nflaunts\narbed\nrematches\nkingsale\ntottering\nnarcissists\ncomplexions\nbrunning\nlipo\nborings\nforenames\nzhiming\nhubo\ncornick\ndogleg\npummeling\nschmeiser\nmarketshare\nracecars\nyoy\nakn\nfondre\nbeguines\ntarghee\nossorio\ntln\nbryher\nmewa\nturrialba\nkeisler\nmaccready\nneurotypical\nconeflower\novercapacity\nsenio\nyick\nsodi\naccreting\nmcnay\ncrestfallen\nheuchera\nmisericords\nneurovascular\nwebstore\nhornbach\narvesen\nglaeser\nhaynesworth\ntakita\nmarree\nquidam\njonbenét\ndominie\nstager\nlamona\ndelamination\nmcternan\nperr\nlusso\nemiliana\nvenerating\nmoonshiners\nlimonov\nflypaper\nhoppen\nsfh\nsupress\nnecesary\nnextwave\nsoph\nfaymann\nkingsoft\nrecuperates\nwiht\nwaart\nguanylate\nlewdness\nfeldshuh\nbesmirch\nwining\nschweigen\nlindos\nrouble\nfurby\npirveli\nfanwood\ntrimethoprim\nbotte\ntiebreaks\nscalene\ntacs\ndunkerton\nechoplex\ncreusa\ndriel\ngainsford\nquartus\nringel\nappologize\ncryotherapy\npassarella\nkalsi\nphooey\npuissant\nturmoils\nnimruz\ninthe\norris\npagewanted\nrevalued\nmahu\nmathworks\nlamarca\nobnoxiously\navrich\ncreditworthiness\ngreatorex\ndandini\npavlou\nvasp\nkandhamal\nrajang\nkathimerini\nabrikosov\ncomdex\nharlaw\nalvor\nwabco\nkandeh\nbego\nktunaxa\nkuniaki\npolypore\njuhasz\neuthanize\nundulation\ntalgarth\nlatos\nbraben\nkeesha\nkrauth\nteaspoons\nspiderwick\nhellogoodbye\nvelten\nundulate\nunhistorical\nbuntline\npogrebnyak\ndelamare\nlungren\nhanway\nkelner\ngettier\ngunsan\nmtw\nbrott\nunprivileged\ndruckman\ndawlat\nfunnelled\nsterilizations\ncriner\ngasse\nphencyclidine\nbeefeater\nmankiw\ngazet\nuntalented\nsiemaszko\nautónomo\nfallah\nboam\nblackrod\namnestied\ncamero\nlipodystrophy\nsankyo\nexcommunicating\ncohere\nscandalised\nsteams\ntotentanz\nglyndon\npolyphosphate\nrobbert\nmukasa\namarcord\nsxt\nseismographs\nblf\ndistillates\nvalujet\nbrutalism\naaker\nperrotin\ntejera\npakokku\nbengoechea\nnaber\nballyconnell\nmenards\ntorborg\ntomaz\nunclimbed\nhethersett\nzanon\nbuzi\nthermographic\nsimplicissimus\nrodica\nvieilles\naveril\nenglands\npeeing\nnordsee\nnurit\nwelchman\nyathrib\nhoisington\nmingora\nnoz\nzambada\nasare\nslurring\nkalli\nbrandies\nberrier\nmackays\ndunsfold\ndehydrate\nrippey\nvdb\npartite\newelme\nmiep\npurkiss\nesslemont\nmccoughtry\nmarter\nmontenotte\nsujan\nium\nglt\nmackendrick\npabón\nhagg\nodorous\ncastelar\njedinak\nenjoins\nshukhevych\nmoskvina\nkeena\nouta\ndaghestan\nzulema\ntaiaroa\nkatydids\nputts\nfazakerley\nhian\nholguin\nslumbering\nrimon\nrusal\netiologies\ncavo\nbonventre\nmaci\nuncommented\nreineke\nkwp\ntorras\nrobertas\ntitley\nnaess\nerminia\nbusher\nhummed\nfaivre\ndisd\nunitech\nhoneywood\nbemoan\nrunup\nqaid\nbussing\nzurbriggen\nblattner\ngerth\namstutz\nbruntsfield\nmarchington\nmeriel\nnonentity\ndramatisations\nrubidoux\nbotwin\nstai\nbaranova\nfreep\nmaxïmo\nalerta\nlovefilm\nmanwaring\niua\nspoonerism\nkgmb\nstädel\ndawoud\neyam\ncrumley\nhonghe\ncoches\ngrutter\nsolal\ngosplan\nsculptress\ndoar\nparahippocampal\narsehole\ndaytimes\nkhambhat\nceteris\nafflalo\nextruder\nncsl\ncarême\nbigtime\nkawy\nbelyakov\nlambesis\npoliticans\niliamna\nedelbrock\nbarazzutti\nwadhams\ntwe\nroutley\nburbach\nthresh\nremez\nuut\nforgoes\nphosphatidylserine\nlulla\nbusdriver\nelizabethans\ngreathouse\nchenggong\ndaho\nsmolarek\nteviotdale\nadap\nspohn\ndebes\nbollea\npozner\nheitkamp\nwiehe\ndegenhardt\ncunninghams\newb\nreawakens\nlocalist\nrearden\njackey\ncartage\ncassandre\nwhirring\nseiran\nuntangling\ntemujin\ngrenze\nsherkat\njarosz\nnikoli\nmaladie\nrisha\nasao\nperedur\nbops\nvieuxtemps\nhaise\nlifers\nultraviolence\ntakahito\ndeerfoot\njeer\njtv\ninessa\ndestatis\ndiliberto\ndoosra\nkostyantyn\npontremoli\ncadaqués\nhink\nlakme\nhalutz\ndebarking\nissigonis\nbirnam\njóhannesson\ndéputés\nnuuanu\ncpcs\noriginalism\nhancox\nosis\nmanen\nonkar\nunchain\nodermatt\nuprightness\nmichaele\nleggenda\nartan\nthemba\nmegu\nscudo\nrogozin\nmcmoran\nsoken\ncleophas\njaycen\nachondroplasia\noronsay\npurnea\nwaynes\npotente\nallbusiness\nancrum\nwanchope\nyanovsky\nghannouchi\nbabilonia\ninfraero\nsonn\nfaim\nannemasse\ngreivis\nfori\nringen\nkneebone\nfratricidal\ncrowthorne\ntatti\nstorable\nbooka\nbreitenstein\npucallpa\nfederate\nmcneice\nkirschbaum\nabood\nenvisaging\nmcaloon\neridge\nhoki\nrefinished\ngillot\nclingy\neigner\ndevraj\neurabia\npryderi\ntradenames\nturnbuckles\nbonzi\nbaten\nhilf\ntriforium\ndamacy\njolissaint\nwasikowska\ngiat\nmeranti\nwesters\ntunnelled\nnoteriety\nreimburses\ntorphichen\ntolstoi\ngeering\nenroth\nspinouts\nevangelia\nchangement\nuttermost\npirri\nsleepwalk\neqs\nkamille\nsonck\nsinuiju\nwoda\nnubar\nholmesdale\ndonck\ntigrayan\nejemplo\nsermo\nunderfunding\nbalboni\nhouseplants\nheilprin\npersonalising\nknockabout\nuio\nchokwe\nnyai\ndeitz\nsteitz\nphot\nrutka\nchenies\nwerman\nyoseph\ntoolmaker\nassim\nloane\nthg\nminford\nggt\nholey\nmastrangelo\nlaundrette\njolliet\nwhitened\npowershares\nscheidegger\njowl\nrehhagel\nmazdaspeed\npathophysiological\nysa\nkimbrel\nprempeh\nblampied\nshok\nuranga\ngirotti\ndonot\ntager\nunderslung\nunexploited\nsallam\ndarrieux\nlongbows\nneile\nkets\nrisner\nlaverdure\ndextran\npanwar\nolliver\nbourdeau\nreconfirm\ncardstock\ngolisano\nduenas\noutsize\nbaratta\nmeyrin\ncamilli\nrcra\nhobbie\nllanfyllin\nsihamoni\nlynnette\nnfdc\npuglisi\nbarragan\nvanesa\nvaccinating\nschumaker\nbindery\nhummelstown\nbenouza\nchequamegon\npresteigne\noptronics\naddicting\nmattera\nudayana\neisuke\nuniveristy\nmyotonic\npattenden\nalbro\nimparcial\npeddled\nxyy\nvouliagmeni\npamp\nivel\naje\nercall\nfutilely\nspragg\nureters\nblotchy\ndilithium\nroederer\ngrania\nakhlaq\nkoukal\npoka\nloriot\nwilcke\ndadaists\ntomasa\nunderperformance\nreemerge\nzorich\niroda\nbatyr\nsuvi\nkilar\nbergholt\nsatiate\nlockouts\nproxmire\nlangtang\nensslin\nlawe\nsabanci\nbabuyan\ndiko\nclangers\nmatchmakers\nscarff\nhuei\nzikr\nkoker\najinkya\ntonnant\nchoshi\nbedloe\nwpl\ndiopter\nlucerna\nkacy\nnatoli\ngrandpré\nballantrae\nauchan\nfgv\nmahat\nnrd\nlcbo\nleiser\nbassols\nedzell\ndorries\ndotti\ncbz\nkubicek\nwhatmore\nvarrick\nslindon\nanhang\ndragonfish\nmicawber\ndestabilised\nmannock\ndeady\nramer\npontarlier\npome\ntinseltown\ndatas\npraline\npittenweem\nlapi\nbaucau\nneurasthenia\ndivin\nhcb\niberoamericano\narrestees\nkeaney\nexosome\nculford\nwoodmont\nshatskikh\nadroitly\naircon\nconsorting\ncaca\npatak\nthair\ndaju\nreflow\nmassospondylus\niadb\nbelhadj\ntaffeta\nkrysten\nmiggy\ngless\ngroll\njarndyce\noswal\nencinas\ncroston\nlutze\nogwen\ndecontaminated\nwrightstown\neisaku\nbalibo\npotocka\nvomeronasal\nsouder\ndefuniak\nkathlyn\nirradiating\ntarra\nbaaz\nclissold\ncitri\ninveresk\nsignups\nzema\nsyntek\neffing\nhilaria\nexpandability\nsmooths\nhanisch\nfriedlaender\nmccaughan\nrecoiled\nchristofias\nacocks\nsapelo\nleider\nasphyxiated\nmalathion\nviktória\nacidified\ndixi\nquaintance\nnuran\ntokar\nformia\nroosa\nnirav\nfusi\npadrino\nvalek\ncaas\nraro\nleupold\nwatchin\nflorance\nepimetheus\nharmel\nalfonsi\nsirajganj\nbarq\nkashf\nsampat\nkoech\ntepuis\nleskinen\nhollowing\ntotalisator\nosogbo\nhvidt\nsirnas\nheartgold\nbosham\nbetchworth\nnicknaming\nnehlen\nploshchad\nbeil\nrotheram\nhudong\ndernbach\nfrade\ndeclamatory\nplasterboard\npahokee\ntacrolimus\nannamaria\nmatina\nkfwb\nmiddles\nstraitjackets\nkoppers\nsiddal\ncorail\nnasda\nbillo\npimen\nyaphank\ncockell\nyoull\nsmartt\nholway\ntomasello\nchimaeras\nmccargo\njutiapa\nallergan\ntomasini\nmarchionni\nswerling\ncuriosa\nusualy\nniyogi\ncamaiore\nfetishists\nvincere\nmortalities\nnbu\nhaddaway\nmonteros\nfederspiel\nkasauli\nmetafilter\nvao\nprendre\nmandai\nmgi\npablos\nın\nskipp\nschaan\nsike\naillet\nallia\npritt\nminge\nvicenta\ncurwood\ncinderblock\nbowersox\nmodiano\nmullings\nuntreatable\ntwangy\nerian\ntanimura\ndahod\nramaz\nmortazavi\nscrees\nschlitt\nmetabolomics\nmiroslaw\nvilks\nbooysen\napicius\ninterpose\nmariquita\nplausable\ncayetana\nkibet\nmorihiro\nswashbucklers\nmarve\nlobkowitz\nhearken\nfager\npsyches\ncavuto\ndeterminist\ncompadre\nsoem\ncounihan\naccies\nthermus\nvartiainen\nshobu\nawns\nzadkine\nspecifiers\nkhorana\nhazeldine\nbeason\ntransantarctic\ncampfield\nisrar\nportamento\nwhitewall\ndmitrij\npsoas\navens\npanju\nmamdani\nbashmet\nmuchmore\nksf\naddey\ndawna\nnyassi\nlamoriello\nspiegler\nmassingberd\ncameri\narietta\nengorged\nlupica\nboyens\nmeshwork\nborked\nboze\nfockers\nlaplanche\nfelicita\ncrapp\nlomi\nalerte\nhatchard\nenchants\nunstudied\ntordenskjold\nalgonquins\nbatmen\nprosopagnosia\nbamboozled\nalaoui\ncomixology\nkorobov\nlambrechts\nscornfully\nfarooqui\nditchburn\ndominicano\nerasable\nmetrobank\nurbanski\nduellists\nallwood\nqas\nfpb\nmebo\nfootstep\nbackrest\nglm\ndunmanway\ncorbetta\nphalloides\nhyperpigmentation\neskin\nguardino\nilike\nmichiana\nmoderations\ncornishman\nevinces\njudes\nzhirkov\nlukowich\nbronchiolitis\nrokko\nirritations\nrexhepi\nshugborough\ngalecki\ngarron\ntorstensson\npruyn\natomized\nhemed\nbotello\npoteet\nsurikov\nvladek\nchequer\ncaballos\nawori\nregiomontanus\newhc\nyerma\ndenitrification\nbohanon\nêtes\nadvertorials\nstockades\nnudelman\ntarasenko\nsiona\nfenter\nbrener\nchiaramonte\nminghui\nmalikov\nmauriello\nradjabov\nspringtails\nterefe\nmuqrin\nscullers\nbutthead\nintec\nquach\npeko\nadvantageously\nflatfoot\nmarí\noklahomans\nclemm\nsidle\nnetweaver\nbanyoles\nvoima\nparmigiano\ntredici\nmalakar\neday\ndapple\nurbandale\nurinates\nbolex\nutaka\nwre\ntreffers\nfinntroll\nlindenau\nheadrick\nexpatriated\nmogilny\nportlethen\nfaran\nbluebottle\nfourplay\ngoog\nwesttown\nlayon\nabare\ndfff\nkolehmainen\nmowrer\nsisera\ntobira\nkosmo\nneocons\nfumarole\niniquities\nmichale\nsuppan\nbisphosphonates\nasdf\ndapa\npierino\ntwosome\nshehbaz\nffynnon\nandruw\nminicab\nshigeyuki\nmiret\nmobileme\nzaide\ncomptia\nstaab\nnyarko\nsatrapi\nphilipstown\nlovitt\nlancy\ndonnay\nbustani\nrosler\n­\nbellos\npiasa\ngarnsey\nrotha\nborchard\nintercompany\nbloop\npccs\nsimoneau\nisachsen\nannetta\nbrandts\ncrossbreeds\nsunning\nmovida\nmaddened\nduetto\nswindles\nmidp\ncloss\nmeningoencephalitis\nautogyros\nmida\nheatseeker\nhakuna\npiggery\nemancipatory\nminko\nlighty\nliveth\ndellacroce\nmcnee\nelastically\nledezma\nhaeju\nwasmuth\nupdrafts\nnuruddin\nmafikeng\nxylophones\ncarvell\ngottehrer\nibraimi\nborley\nmałopolska\ndrainpipe\nnicollette\nverbascum\nlickin\nstard\nschuldt\nsingler\nnabarro\nsagacious\nlashari\ncatchier\nmronz\nwildernesses\nstarmaker\nsimonini\narmenteros\nnaus\ndaat\nouali\nessl\nrhr\nhafer\nromay\nwhitesburg\neusa\nnoughts\nlej\ncandombe\nnonsence\nsafronov\nsmarta\nalphand\nxibalba\nvladyslav\nrasha\nvajiralongkorn\nvanbiesbrouck\nhatfields\npilfered\ntiktaalik\nbledel\nreynes\nkoppe\nosawatomie\nsatpathy\navraam\nsoleri\nzebo\nskorpion\ndaag\nbrüno\nunsimulated\nevelio\nbetuwe\nquek\npolycentric\nkellam\ngennevilliers\nbuggin\nandreini\ncerveteri\nrootkits\nalptekin\nsmugly\nseductions\ngrocott\ntadros\nquila\ndewlap\ncleddau\nbirdlike\nclepsydra\nhighrises\nmkz\npriorat\nhetrick\nrembrandts\npratten\nbaquba\nrtw\nalvie\nyorath\nptm\nchilis\ntrysts\ncheilitis\nbaccarin\nmunte\namans\nkahnweiler\nmjf\njamu\nnoer\nbynoe\nrougerie\nexploitive\nhenpecked\nshahe\nabano\nueberroth\ndrage\nabberley\npugmire\nyangsan\nshohreh\nperlite\nnijinska\nhiace\nsdlc\nphotophobia\nbenenson\namagansett\nmagon\nabbes\npalates\nttk\ntripel\nimaan\nwoolery\nmadalyn\nmeebo\nbhavesh\nbishopton\ndomokos\ngimbels\nnasogastric\npnw\nrajhi\nneak\nlwd\nnazione\nschleifer\ngummo\ncharisteas\nsurmising\nbenon\nsuwardi\nswen\nmycorrhiza\nzohn\nksfo\nledwidge\nrotoscoped\nboisbriand\nwhiteread\nkostin\ngladney\nkrauskopf\nmanzai\nsted\nkowalik\nextrusions\nmanimal\nbrehmer\npescado\nkagen\nmatusow\nmallette\nronchetti\nkovic\nbislama\nargillaceous\ncutcliffe\nhelmore\nchador\nhypogeum\nregn\nfgb\ncherone\nbuist\nblache\nmouldy\nwkaq\ndorrian\nclimaco\nhudnut\nsahashi\nrouf\naysel\nmadeup\nkirovsk\nlegatum\ngrippers\ngerunds\nsnowdrops\nagor\nroderich\nsegmenting\nfontbonne\nstaffords\nbrr\namorosa\nkaronga\noii\nnardiello\nvads\navod\ncochon\nbonini\neconet\nmaziar\nalicea\naveo\npodiatrist\nvelikiy\npreller\ndollman\nstoloff\ncitylife\nmargolies\nhumpbacked\ncosti\nkepala\ncryengine\nguérard\nmaila\ndonoho\nbalat\ncaprioli\nrelishing\nawais\ncoate\ntalaq\nrlv\nfreights\nmountainview\ncongressperson\nnachbar\njouffroy\nthermophilus\ntrevis\nmarilena\ngritti\ncheli\nschoenstatt\npirjo\nmugar\nexistentialists\neeee\nhaina\ncrossers\ngovia\nleptospirosis\njundallah\nhewetson\nfoday\ncurettage\nsoccerplex\ntsuchiura\nhelfgott\nmlbpa\ninb\nwhipsnade\nhoneybourne\nadmixtures\nhussy\ndystopic\neverthing\ntweener\nphilosophizing\nschrade\nmenne\nparri\nkuramoto\nmosqueda\ngoogol\nsiarhei\nnoren\nkoseki\nbarrandov\ndaisey\nkeffiyeh\nhuyghe\nbarrelhouse\nberkovich\nvanderlip\nmoghadam\nkiah\nmatancera\nhotwells\nrimet\nurgel\nevar\npapes\nbraggs\nurdd\nlvr\nunmemorable\nyadegar\ndewing\nbovingdon\ntoxicities\nbahian\nschneck\nrusskoye\nmylor\nvrabel\ntymon\nschmeisser\nbookbinders\nmyfc\nitalicum\nleterrier\nkarstens\nwilhoit\ncollingsworth\nbawdsey\nsusini\nchiriqui\ncasspi\nkoyo\ndayle\nzubaida\nalvernia\nberroa\npothos\nniwot\nvaris\nlongwall\nafricanized\nendresen\nseman\ngign\nostin\nshintoism\nkrays\nclobber\nattal\nhighnesses\nrába\nsavitsky\nlagman\nyips\nfrancesconi\nlafosse\nspener\ntahreek\nleashes\noncolytic\nmccahon\nepinal\ntawan\nadelboden\nrgn\ninfor\nmatton\ndebusschere\nwgst\ncorofin\ncoletta\nshielfield\nchegwin\ncarwardine\nimer\nkleptomaniac\nildebrando\ncentros\nkyat\nizo\nsatine\nricoeur\nchavous\nexacerbations\njaimes\nsubregional\nmcbeath\ndomenick\nhpu\nosbournes\nebx\nazcárraga\nbelizeans\nsillman\ndxf\nherrerasaurus\nthaddaeus\nfukumoto\nhemma\nusgbc\nyrsa\nrockton\ndownunder\nasle\ndjimon\nthornett\nhollering\nkime\nchatuchak\nbailyn\nsyncom\nschiaffino\nradulov\nkreiner\nwillesee\nmurawski\nbowdlerized\nlaforce\ncaspe\nchecco\nvillalonga\nzakharova\nshearim\nsaly\nsamoyed\nhasbara\nolay\nazria\nchallen\nlenexa\nnatesan\nuntranslatable\njudice\nicma\nnorthmoor\nodenbach\ncems\ngumy\ndoñana\nlalaurie\ndabashi\nshostak\naberconwy\nkajiado\nnimmer\nunpronounceable\nfurcal\ncorruptly\ngiertych\nhochstein\ndanell\ndalmore\ncapece\nurashima\nirrigates\nlimpid\nswu\ndiaphanous\nwxia\npreimplantation\nulv\nmclure\nmcqueary\nsizzlers\nexternalizing\nschwer\ntechsystems\ninnerspace\nutrillo\nhanjour\nmaraj\nhenges\nrothen\nsykora\ndurutti\nbgi\nmoralising\ntoposa\nredecorate\ncastellari\nolaudah\nincomparably\nprunier\nlateline\nsicario\nbelotti\nsimeonov\nsolondz\nrancour\ngemsbok\nrickwood\nguelleh\nschoendienst\nboodles\nheatstroke\nmusawi\nsatilla\ndigestibility\nblamire\nchoptank\nfhi\ntourmalet\narchbald\nunlogged\ngattaca\ntastemaker\njonquil\nkhris\nsniffed\noetker\nheteronormative\noteri\nproietti\nteff\ndanil\ntelevisual\nbigleaf\ncombusted\nshinko\nlechlade\nmezuzah\ndrori\njoschka\nsleeplessness\narcadium\ninsole\nluminaire\nnonvolatile\ngaugamela\nburzynski\njallieu\nkameron\nackerson\ngrou\ncairoli\nmowden\naspirates\nesbjörn\nkreisau\nspenny\nkristiina\nintraepithelial\ntusked\nwittenham\njagoda\nunhyphenated\nmisremembering\ndimucci\narlan\nmontuno\ngez\nparticuarly\nmazarine\nvoyer\ninagua\nflaxen\ncuillin\nmamilla\nutilis\nnovosel\ngurvitz\nbozzo\ncortines\npopularizers\ndukhan\nparochialism\ntimoney\nlibia\nchoroidal\njostled\ndenílson\ncpy\nstanciu\nwedging\neditorialist\ncostlier\nhectoliters\nwinberg\nauthenticator\nhasen\nlotman\nmarazion\nsug\nmancow\nisiolo\nsogang\nbodenheimer\nzipping\njlc\nquiera\nmoulavi\ndrobnjak\nwheen\nlatasha\nbargo\npalantir\nbiobank\nesson\nmelandri\ngreenhaven\nmoussaieff\ncrippa\nswit\njests\nhtun\nbracht\nhofland\nbecchio\noleksander\nshiekh\nradiographers\nspectating\ngobbler\nverulamium\nsaulo\nshaposhnikov\nhearses\nmezquita\nayyam\ngrouting\nscrapbooking\nquietism\nmarkiewicz\nwindhorst\nogof\nreplayability\nvanwyngarden\ncassaday\ndebevoise\noctopi\neiken\nshinta\ndegla\nakre\nhabanos\nvinick\nmyostatin\nbummed\nsteinheim\nprotectionists\nyanase\nwraysbury\nbourdillon\nlacewings\nharassers\nlightroom\nrajskub\nkamaal\nveyette\nritzy\npolke\nillescas\nemmets\nblackmailers\neasby\nnool\nnycha\nyatta\ngarrulous\nsiggi\ncouceiro\ncollishaw\nponente\nsyphilitic\nkandasamy\nhyperlinking\nranville\nenunciate\nsurkh\ngnawed\nsefi\najn\ntalvin\nwftv\nschnellenberger\nstri\nattas\nranjini\nbricklaying\nbedecked\nghika\nalmanzo\njibs\nergun\nkalanchoe\narzhan\ndening\nritualised\npekingese\ngakushuin\nrepartition\nparvis\nstardate\nphytosanitary\ntuer\nlitigators\nribonucleic\ndelma\nswarts\nzdeno\nzohaib\nstojanovic\ntadano\nfulsome\nwaterproofed\narundinacea\ncaestecker\nundistributed\nslatted\nferentz\nbouteille\nneccesarily\nbabini\nbareiro\nglessner\ntakeya\nankers\ncountryfile\nshenker\nluffa\nharaguchi\nvybz\nincontro\nstrop\nwhisperers\noncological\nmoyra\nteuku\nsiddig\nparer\ncastillos\nthomlinson\nlulac\ndoubleheaders\nmargy\ntache\ncockspur\nmorgenpost\nunlikable\nenstone\nskowronek\npolegate\npaba\nmonolinguals\neget\nlorence\nbermeo\nmotonari\nmetter\nhooten\npelissier\nbraamhaar\nshailene\naamc\ntahi\nflippantly\nherstmonceux\nhelipads\nneth\nbalmoor\napexes\nninagawa\nicfi\nyehiel\nobjectifying\nfletchers\nanally\ncambra\nsequestrated\nlmao\nasiri\nmeze\ngeisinger\nfishermans\ngeorgievski\nmissin\nherberts\nwinnow\npowley\npyrites\nminner\ncomparators\nintrathecal\nnorddeutsche\ntemporada\ndeshaun\nzalgiris\nwdg\nfarda\nkompong\nbrettell\ntirona\npassaro\nvagana\nsamwise\ncumnor\nthermic\narensky\npolian\nkreisel\ninspektor\ndeutschlandradio\ninterspersing\nacho\ndiscothèques\nloewi\nphf\nribatejo\nworde\nconformism\nscreeners\nburtonwood\nblandness\nchloramine\ngelai\nzubizarreta\ngrommet\nsodomites\ngratuities\ncontraindication\nvarlet\nonboarding\ninsúa\nhalawa\nrosellini\nlusi\nferrybridge\nbiosolids\ndynes\nseptien\ngerasim\napoplectic\nmirabello\nmerest\nhovel\nferdous\nfallsburg\nstickle\ngapped\nmacrory\npesqueira\ndetoxify\npavelka\nmarda\nnemerov\npedophilic\nnervy\nuthappa\nkurylenko\ndonlin\nrecidivist\nschwarzlose\nmontană\nthelin\nbronchioles\nheaved\nhaddenham\nbaculum\ndaubed\nroxon\ntokaji\naltuna\ndyrdek\noompa\nklaxon\noverfished\nmagrini\ncarmencita\ntrengove\nspj\nhardcovers\nmccorkell\nbuttressing\nkdd\nextortionate\neslami\ngrandage\nfleshtones\nncmp\nswags\ncoline\npercieve\nmauceri\nmissives\nalarmism\nhaltwhistle\nhonved\ndistributorship\ndalling\nasencio\nmcelhatton\nhalcon\nbanpo\nbrmb\nkohls\nschwabing\nmaitreyi\njewson\nlinebaugh\naracoeli\ncresskill\nlaforet\nscantlebury\nnrma\nlambis\nsentier\ncorpulent\noher\njessalyn\nkidson\nkuzmich\nmosk\nwenk\nthornborough\nprodigiously\nmastiffs\npejoratives\npadley\ncallout\ngretl\njourno\nkhamsa\nderipaska\nlaville\nchangxing\neatons\nsamur\nficker\nwestchase\nsiol\njhabvala\nvolle\nblindspot\ncristescu\ntsankov\nculcheth\ngeremia\nbestiaries\nfluorocarbon\nmaoming\ncalado\nbrocchi\nwhippoorwill\ngurian\ncuticles\nheisey\ntaler\nbramblett\npossibles\ndebeers\nosteoclast\ngleed\ncrossgates\nriegle\noldboy\nnadra\nyasukawa\nyoffe\nyobo\nchateaugay\nrehoused\nnucleate\niruma\nweigert\ntanar\nméridien\nressel\nkaisei\narrupe\nrfo\nslipshod\nwithernsea\ncastros\nhardcourts\nnare\noligarchies\nslink\nsatyanarayan\nacaba\nbuczek\nï\nvilkov\npraziquantel\nyutong\nasiaweek\ntortora\nbrotman\ncordwood\nvolage\npeguero\nrdu\ntaketh\nmonetized\nquattrocchi\nsignis\ntinning\nvirgile\nschwartzberg\nkadeer\npeover\nthiamin\notowa\nkyphosis\nfpd\nascertainable\nbutterball\nhawkshead\nfaros\njpm\nlacertae\nrumpf\nridgetop\nbargen\ndifficultly\ndriesell\nsuperspeed\ngeach\nconcertgoers\ncremaster\nopata\narundale\nhurlock\nlemmer\nkirkkonummi\nganea\nmilen\ntechnicolour\nmatheu\nweasely\nlbm\nmaeno\nvanwall\nbaddesley\nlespinasse\nlipmann\npichu\nleeder\nhaendel\nmitad\ndant\npedicab\ngándara\nhallan\npaoletti\nimbrie\nkhesar\nlegitmate\npigsty\nxetra\nsahuarita\narundo\ndeiner\nunderstate\ncholecystitis\ncadigan\nsebert\nglenmont\nmoltmann\novereager\nnmf\nvionnet\nadressing\ntoxoid\nparavicini\ndigue\ncentralising\nbiosensor\ndickhead\nackman\nkheel\nscmp\nprivatizations\ncalmar\nrepaving\nshasha\nfreas\nprocurer\nsundered\natorvastatin\nvituperative\ncertian\nreconversion\nmuddling\nskidelsky\nyanshan\nmountrail\nzakarian\npropositioned\nhamstrings\ntropism\ngiorgetti\nbmh\ncoddling\naxelson\nslighting\ngravesham\nchecchi\nphytoremediation\nfitzwater\npsuedo\nmyleene\nrosenwinkel\ncnm\nmacaco\nbardney\ncefalù\nozias\nminiskirts\nkouyate\ntiggy\nrockefellers\nfrühe\nkibworth\nbenadir\nmiscounted\novenbird\nstrout\ninundations\nigad\nmonheim\nmoomins\npachanga\nllyr\ncheeta\ndearmond\nsundowners\ngoed\nnihilists\nokpara\nslouching\nostra\ngodín\ncalcraft\nwombourne\nvillela\nplacita\nstarlog\ntabassum\njinji\nreleasable\npsychotherapies\nlatecomer\npervis\nbelleek\njohana\ntermez\nodet\nquesne\nchancy\nprisa\nmagix\nlakey\nactivin\nrattner\npavlina\nnewsmedia\nallrounder\nbasketballs\nkailin\nfreel\nsaveur\ndileo\nsadaf\nformalistic\npostulant\nkmpc\nminkow\nsplenectomy\nlors\nkbd\ndonovans\ntowy\ninessential\nmaharoof\nsugai\ngetchell\nfreebies\npaiement\njois\nrighty\nkafe\nempathise\nquetzalcoatlus\nstieb\nschiro\nkuruvilla\nstavisky\nleeves\ntmv\noktyabrskaya\nleukotrienes\nkarnac\ncosponsor\ndescalzos\njarett\nmagnetar\nbaljit\nwhitebait\nsavoyards\nnamings\nsugimori\nbuffeting\nfumagalli\nedrington\nmeggan\ndozing\nrosten\nhaston\nmema\ntvl\ntigar\nquacker\nfillip\ngaitan\nsanjuro\nedgerly\npiette\nfesto\nssnp\nlustiger\nunsubstantial\nwrgb\npohlman\ncofield\ncobley\nbruckmann\nhardier\nprovidenciales\nbastianini\ndryly\nlombardozzi\noea\ntueni\nskitter\nkoushik\nkarnik\nsmiting\nscabrous\nsyriana\ndowe\nfringy\nhemingford\nlaca\nbanyumas\ntreebeard\nmosaddeq\nsaren\nskyhorse\nbridgett\nresi\nprofessionalize\nsturmovik\nnans\nwhu\npeachum\nrabinowicz\ntitrated\nchicheley\nvirta\nextinguishes\ncasiano\nbodyboarding\nfcpa\ntaicang\ntransgressors\ngalmudug\npiffle\ndanta\nvermaelen\nparador\nforsythia\nkempster\ncubo\npries\nimprudence\nbelova\nbabbo\ngarsington\nremmel\neop\nwittily\nzollinger\napplicators\nspeake\ndiffident\nscrounge\navit\ntorok\ngerstenberg\nshiwei\ngoomba\nauletta\nlusby\nkipping\nantbirds\nferm\nstanfill\nlugares\nsteeplechaser\nblaik\nlavette\nzillur\nhustvedt\nmisapplying\nlaffin\nkerchief\nzeina\nakinnuoye\nmurer\narrochar\nrabidly\nkripalani\njohner\nbuhr\ndeseado\ntsuno\nnyhan\nkurek\nmushrooming\ncompston\nfirn\njiefang\nchangqing\nhandwashing\nterpene\nschuylerville\nkyril\ncautley\nstene\niesus\nsilicones\nkjos\nvincristine\nurbanites\ncobie\nkuechly\nincompletion\nmasucci\ngreatrex\nbodner\ncartuja\nfemto\nterrel\ndacheng\nqissa\ntieling\nbasnet\ngyn\ndynamis\ngvt\nslicked\nsoetoro\nkljestan\nhofstetter\nbenecke\nbecht\ntulpehocken\ngrinderman\nradioiodine\nproduits\npohick\nchatrapati\ncoolbaugh\nluminaires\nmontreat\nacklam\nkgc\neip\npiques\ncorked\nlandsburg\nsandes\nstever\nfrayser\ndendrimers\npaia\nigd\ncommunitarianism\nattapeu\nwdjt\nbaiul\nmichetti\nplumbed\nabysmally\nzavaleta\nhawkstone\nantropov\nxinyu\ntornillo\nasman\nsursock\nxsara\nlannemezan\nkwtv\npollin\nwlad\nlemi\ntrichardt\nfreestylers\nbrixia\nsccs\nethem\nfluellen\nsandersville\nbraunschweiger\nreiten\nclopidogrel\neulogised\nsdks\nfishback\nfraying\nugarov\neijk\nnavjot\nringrose\ndaini\ncoital\nmulugeta\nhaaland\nteutul\nrecategorised\nlarabee\npashupatinath\ngrimal\nalhaj\nraymore\nbourlon\nmattino\nkanerva\nmtas\nmihama\ndulcamara\nluffenham\njaqua\nsuef\nkemmler\nngm\njackdaws\noppermann\nhived\nmeeke\nelish\ndargo\nunlikeable\ndafna\nsunninghill\nyamaji\ntennesseans\natomizer\nmarcelli\nteegarden\ntrapezoids\nsammer\nhomebuyers\nbensimon\npacta\nwashouts\ndanso\nchassé\nplock\nsolitarily\nbsba\noceanica\nhijras\nkiyomizu\nhaefliger\nbvn\nbuntine\npsychopathological\npanja\nwchs\nvillamizar\nhougue\narses\nmandore\nraveonettes\nheiligenkreuz\njokhang\nzarza\nchatelet\npacepa\nattiya\nivarsson\nnorthington\nthialf\nglenfinnan\nnemoto\nmessam\ncicco\nkarolos\nrestlessly\neppa\nbiberman\nwallah\npilchuck\ndirecte\ndeuchar\ndimethyltryptamine\nbejan\nerekat\nabraded\ntrackwork\nnagla\nnoby\nwpri\nammonoosuc\ntilefish\ncandelas\nvizio\nhayashida\nbluesky\npeppering\nfootlight\ndelegitimize\nepoc\narci\nvorpal\nunacquainted\ngoyt\nparalyse\nlipponen\nhobbema\nalleanza\nejidos\ngachibowli\nfructuoso\nfaumuina\nmoiseyev\nvalarie\nsaintfield\nsccc\nchaiken\ndoubledays\nmergenthaler\nihab\nsenatore\ncyberinfrastructure\ndhd\nscaglia\nvillian\nscald\nlycanthrope\nvural\nfondle\nbressler\nwiba\nbevill\nbifurcate\nguarisco\nakhmad\nbacco\ntautog\nhierophant\nsardinero\nphotoacoustic\nirini\nrungkat\nberrie\nharkes\ncellblock\nakhurst\nhurtig\nsharratt\nthurmont\ncausley\nmonegan\ncleckley\nansf\ncrumpsall\ndimitriou\ngiannakopoulos\ntweek\nhacke\ncaspari\nmiscanthus\nretuned\nkaluza\ngunby\ncurdling\ndecisional\ndelamar\nminson\nfulminate\nbegumpet\nrobinet\nschizotypal\nhealthily\nauditoria\nwatty\ncussons\nmiva\ndevyn\nhession\nnchs\nnegligibly\nsundering\nkosovars\nbrodick\nphotoperiod\nlius\ncardia\nweetabix\ngrehan\nhho\nnetrebko\nresko\neurosceptics\ntenenbaums\nmhg\nsugared\neuthyphro\nherkomer\nmirando\nfitial\ncasei\nbiomaterial\ndefintely\npettengill\nmunchies\nnambucca\ngilyard\ntanin\nsharpsville\nbrehme\nvengaboys\npazin\nvideographers\ngalal\ncording\ndistichum\nludic\nzilei\nclickers\ndirrty\nairgun\neichenberg\nheaves\nwahiawa\nwyland\nsportscasting\nmelech\nnuwas\ntypesetters\nsusser\nwlwt\nwalenty\nnisra\nlehne\nhawiye\nitinéraire\npachulia\ngabonensis\naeo\nneumeister\nunshaven\nrapin\nkertesz\ngalwegians\nkaramoja\nsantacroce\nmikulov\ncaniggia\ngoodheart\nradiographer\nhutz\nturu\nserenading\nchide\nufd\nmahout\nnickie\nmoku\nminuten\nkwale\nvillus\nradmila\ntonda\ncrossbencher\nabating\npunchers\noger\ncornerhouse\nclimatically\nlitman\nborak\nmegaplex\nfiordiligi\nhruska\ntanabata\nmarcellina\nholdren\nemperatriz\nsoopafly\nbiffen\ncupples\njaniculum\nlillias\narigato\nglimmerglass\ngladrags\ntossup\ndecamps\nprofilers\npariwar\ncantin\nquieting\nbarsi\ncurrying\ndouri\nschuch\nknp\nsallisaw\nexpends\narbeloa\nsomaiya\nidabel\nshinseki\nkoteas\nalderdice\nghoda\nrefinanced\nexpurgated\nzych\nambulant\nborwell\ngruesomely\nprocreative\nunitedhealth\nhuckett\nareco\nbarbourville\nmaita\ncael\ncowlairs\nispahani\nbulldoze\nbuchberger\nbleiberg\ndaubigny\npoliticizing\naicher\nabruzzese\ndabas\nrde\nwordie\nkensi\nrashied\nfanad\nskaff\nbollen\nindivisibility\nsirkeci\ntradescantia\nrachele\nfayyaz\ndoidge\nramallo\nmeningitidis\ncheddi\nlondesborough\ndangerbird\nosier\nmarsico\nuachtaráin\nmuntaner\ntucows\nhinesville\nrusin\nwvga\nhyltin\nhachem\nsolor\nfrightfully\npanicky\nozuna\nhazes\nborowitz\nsunwoo\nkrarup\noverstrand\npisarev\ncspi\nslbms\nsoundman\nislamized\nkamps\ndysplastic\ninsuperable\ncapozzi\najayan\nshibumi\ncatastrophism\nmokri\nmuchas\nurbanity\nioannides\ngme\nflandern\nspringside\nstegosaur\ntoles\ncolletta\nbegrudge\nbaquet\nbadland\nkleinwort\ncbot\nwellen\nhidi\npoom\nbuggs\nhda\nyanukovich\nfibrotic\nmcareavey\nskempton\nplatner\ntwittering\nvasilios\ncheckin\ntamarac\npignon\npotshots\nmulberries\nmakua\nleane\nweininger\nzipline\nrussolo\nbrandishes\nclathrates\nsnubs\nunsupportive\nkalter\nalyeska\ntpk\nherschend\nbelnap\nkurahashi\nsteinert\ntracs\nbaff\nkopec\nnasaa\nsaltergate\npichot\nklc\nhasting\nslaked\nfrailties\nbfr\nkumaritashvili\nzehr\nyeshayahu\nrayos\nsarde\ntoils\nellmann\nbirkebeiner\nparrying\nanae\nmeetinghouses\nponni\nkakai\nniac\nzafarullah\ncommercialise\nprocaccini\nmaxene\nfrodingham\ncondenet\nghgs\nbraman\ndeyan\nmuseeuw\nmadej\nlangbein\nbicicleta\nwunderhorn\nmolé\nadenocarcinomas\naubade\nkiryas\nmaximises\ncacia\nhydroplanes\ncemetary\ncafos\nphorm\nfauntroy\nstraggler\ndjam\nencana\nglocester\nbarchester\nmonumentum\ndewyze\ngonadotropins\nanderston\nvercoe\nrebrov\ndisemboweled\nelrich\nbenxi\nneeman\nreauthorize\ndennings\nwindber\nettlinger\nanila\nkuwari\nsilone\nwilbarger\nnorthpark\nmcalinden\npeeper\nspikey\nsaung\nramshaw\nqazwini\nrossett\ncorica\ntxu\nmychael\ntrussville\nsouper\nretested\njauregui\nunsettle\ntilles\npatkar\nucode\ntuneup\nglisson\nkarabagh\nkendrew\nrumina\ngadara\ngrobe\nparamo\nfames\nguiomar\nepaulets\nencinal\nbandersnatch\ncountrywoman\nokai\ndigic\npiot\nhacky\nabbass\ntuuli\nlimps\nzamperini\naerocar\nbarkatullah\nboones\nprelinger\nduranguense\nhynd\ncowgate\nguelmim\nhopital\nilhéus\nbramshill\nasfaw\nheirens\ncabanillas\nbulatov\nsolter\ndrench\nlignano\niod\nwineland\nspanker\nunsponsored\ncominform\nsoundstages\ntomov\nshoham\nabkhazians\nlanfang\nrickardsson\nnationen\nchikan\nimdad\nmortgaging\ndistler\ncymdeithas\nchivo\nshrugging\nhaidt\nkimmelman\nechocardiogram\ndagfinn\npokljuka\nramanan\nmarginality\nridgley\nmoutiers\nonel\nfixity\nanhalter\nravenscourt\nreddened\nfiorino\ntaimyr\nmielziner\nmoharram\nvxworks\nmolter\nunpatented\nfangled\neltinge\ndriskill\nlulls\nbayamo\ntenner\ngiorgis\nfriston\nvillacorta\ncheang\nsaddington\nschwebel\nseresin\ntpn\nenterprize\nshirvani\npersonalisation\nchrisette\nitraconazole\nstemme\ncruzados\nkeast\ndelk\nwarford\nmonotypes\nliddiard\nfuxin\nliba\nmoonshiner\nferments\nlrl\njahani\ncorrelational\ncucumis\nsalyer\nstadholder\nacos\nfalzone\nneuropathologist\npacu\nsherritt\nsereni\ngidea\ntoehold\nmanpreet\ngpv\navensis\nrauno\nbassmaster\neuskaltel\nhaslar\nalthouse\nbalmes\nchatroulette\nstough\nchikamatsu\nsamosa\ndeery\nwahba\ndetwiler\nherreweghe\nbenburb\nmudcat\nbereket\nmeachem\nticky\ntoback\nysl\nnissar\nurate\noberwesel\nwidgery\nkex\nwannstedt\nobziler\nsickbed\nfaultline\naberdyfi\namby\nbandari\nmulitple\nkutless\nblackballed\nfoundas\nantacid\ntribespeople\ncandelario\npauillac\nbeyeler\nboersma\ncoalhouse\nyossef\nzakayev\niacopo\nhubbards\nswappable\nweiz\nwhitener\nelderkin\nkasr\nshourie\nviorica\nrewire\nvmd\nsodastream\nblackhill\nsoemthing\nkernersville\nkatas\nsverrisson\nconsorzio\njarhead\nthornfield\nanticlinal\nslik\nklepacki\nmudflow\nappassionato\nkaimur\nanneka\nkunstgewerbeschule\nmultiphoton\nimportante\nalcazaba\naltura\nkyler\ngrackles\nneshat\nkovner\npharaon\nmolex\njli\ncref\nmatmos\ngarum\nghirardi\nchobe\nreald\ndwork\nkampo\nmuren\nalecia\npaiwan\nvirola\nshareable\nmirabelli\nmothballs\nswissôtel\napostolov\nbobolink\nteitur\nelfed\nbehaviourally\nimpeaching\nbastiaan\ndannemora\npsychologies\nrescorla\nbatha\nshahabi\nfinckenstein\nnamc\nzindel\nmeder\ncrk\nshalabi\nmaral\nekran\nnahon\nmountstuart\nafterburners\nsexualised\nenchaîné\narney\nproneness\nantistatic\njaff\nsorrenti\nquantcast\nlasd\nnarimanov\ntidd\nrebid\ncerri\nbuer\nbogyoke\npardoner\nhargest\nkronfeld\nparricide\nlachiusa\nnorthen\nsulfonylureas\nleny\nrhenus\ncleugh\nregality\nsakazaki\nbundesanstalt\nfreia\nfogler\nschluter\nkgosi\nvxr\ngrovel\nwfd\nsuperfan\nsubcutaneously\nwretches\nazcona\nyusoff\nlipoic\nthistledown\nbuxbaum\nthumbprint\noversimplify\nsiregar\nembezzler\ndups\ncsrt\nströmstad\nwhoppers\ncrapper\nscheidegg\navalokiteshvara\nrockey\nazimi\ngrigio\nbertorelli\ncassani\nflattish\nqinghua\nhewish\nzulma\nheusinger\nboozy\ntverskaya\natem\nbrochs\nwarrensville\nlabrecque\ncjm\nutterback\nmirallas\ntipoff\ncombativeness\nyangs\nmethanogens\nbahal\nsleepytime\nohtsuka\nsceneries\ndelhaize\nbrunete\ncraigellachie\ncarri\nglovers\nhosoda\nobeidi\nanegada\ngyratory\nteamer\nsjogren\nlaotians\nblonds\njerram\nkjr\nnonoy\ncundall\narribas\nlaska\nhelloworld\nibad\nmalise\nmobberley\nfulgida\nbattlegroups\nmessiness\nkamaluddin\ntusc\ndisavows\nphoolan\ndistributable\nzingiberaceae\nogston\nremold\npastern\nfarha\ndoula\ncommonness\neffenberg\nmanges\nvbi\nesfahani\nsterilised\nqtc\nkello\nfoamed\nvanvitelli\naalsmeer\nichinoseki\nbelcastro\nrailsback\nkurin\nbergmans\nminimalists\nsyllabuses\nshigeta\nmitchels\nliese\nmunis\nmetaplasia\nfesch\nwiltse\nhassanzadeh\nradiotelevision\nhelmsdale\nnorthwesternmost\nauxilium\ngulli\nyiming\nfiercer\nmalgorzata\nslama\nnefa\nwaddon\nbedwetting\nsokolowski\nkichi\nrizhao\ndeserta\nfnd\nmoonves\nconnoting\nbunney\namoung\nprecancerous\nrepp\ndrongos\nneera\ndeno\njuddmonte\ncadzow\nkoregaon\ntiptronic\ncwl\npolman\nmatahari\nshowstoppers\nmehul\nkrispies\ngrainer\nbiomedcentral\ncresco\npropagators\ndosent\nvioxx\nsiskind\nlgn\ndiacetyl\npulped\ncajole\nonlf\ndanford\nweigall\nhudlin\npostdocs\nuncleared\npowerboats\nturbomachinery\nperpich\neccrine\ncoface\nbronchiectasis\nramdin\ngulmarg\nkristyn\noestrogen\nluini\npailin\nkeypads\nsego\nkemar\nlysy\ncaulk\nbbv\njuskalian\nmuamba\nmonzonite\nbiegel\ndracul\nwitbooi\npfund\nambassadeurs\nchelm\nmereworth\ntalcum\nfrivilous\nsoichi\nketapang\ntantawi\nredraft\nschoon\nhaizhu\nantifungals\ncropp\narmine\ntwelves\nnazarenko\nmusicares\nyangtse\nalric\ngelais\nescuelas\nwsmv\ntemperamentally\njosslyn\nmithradates\nfanbases\ndaleville\nmumin\nplasterers\nfacchetti\nelmdale\ngerster\nwhitwick\nswartberg\nvagner\nserlio\nannina\nelron\nturnour\nkente\nnikanor\ncontento\nkleppe\ndishman\nmcgonigle\ncenotes\nwwiii\nredgate\nvignelli\nmuan\nnvq\njiangbei\naschenbach\nmikoshi\nonchocerciasis\nromanced\nstudding\nfishmeal\nsperoni\nkitale\nurla\nberel\njaswal\nkaempfer\ntollways\nterna\nmacrozamia\ngrinling\nfunt\nguardroom\nmeridiani\nsurguja\novertown\nimipramine\npaani\nsyncrude\ndermer\nyoucef\nveracruzana\nruntimes\npasteurella\nriyal\ncoppicing\nweepu\nprica\neastcheap\nsuperstorm\nsois\nkaura\nasara\nkuffour\nsunnat\nzisk\nhappyness\nqrf\nallitt\nruhland\nmcguffin\nmandoline\nbothmer\niparraguirre\nstraitened\ncoordinadora\nwebelos\nhudler\nquaich\ntychowo\nmeadowsweet\nmedicea\ndushyant\nsatchmo\nbreaston\nmimpi\nscorm\nkemboi\nmoosic\ntean\nanosmia\noestrus\ndayman\njarvie\nkhir\nkirra\nreadopted\ncimetidine\nmcshann\ncauchi\nmystification\ntidball\nscioli\nmarquês\nprader\nsaundra\npolesworth\nreligiousness\nmoakler\nsanitaire\nbircher\nfantasist\nnostalgically\nnumbingly\ncire\nfaryl\nmeconium\nsampans\ncorke\nseperating\nfadul\nwaupun\nstringfield\nwegg\nsoulsilver\nminky\nkirkup\nzhongyi\ncruncher\nibma\nenquist\nrengifo\nlizana\ncharring\narreridj\nfacio\nkizlyar\nmohseni\nschoolmistress\nknaben\npulpo\nmoncho\ndevadasis\nganson\nnoya\nosiel\nluers\nsmote\nstroboscopic\ndefter\nekonomi\nlaetrile\nkxas\ninternalised\nwerf\nanokhi\nayanna\nterrington\nmarkieff\nbbw\nbirgunj\nalfi\nqif\nbeaird\nkansans\nburketown\nlemche\ndilton\ndepressor\nwildparkstadion\nsaidov\naltimeters\nhannie\nglassmakers\nsaudek\nrhodia\ndippel\nherff\nyeddyurappa\npego\nkassab\ndunnington\nbritian\ncraftmanship\nmotl\nmenevia\njoeys\ndingbats\nfowleri\nremap\nspiceworld\nwoetzel\nsolovyeva\noldschool\ncisg\nincandescence\ncultist\nrockline\nrajpura\nuksc\nkhalilzad\ngiannakis\nwinging\ndelineations\ngregorc\ngurbanguly\ntautou\noutsmarted\nwonderboy\ndvora\nbackdoors\nkarabekir\nsesimbra\ngonchar\narrivederci\nmoonwalker\ndemonoid\nalfasi\nuggams\nravil\nteals\nnibbler\ntadesse\nronay\nunsurprised\nhsca\nolliff\nizaki\ninterchangable\nascertainment\nfeltz\nikechukwu\nserviceberry\nlibbing\ncroxteth\nlatrell\nslogging\nshapcott\nyaakob\ngérin\naartsen\nmaniacally\nmiramontes\nravenstahl\npluralities\nsundararajan\nknoedler\ninvestissement\nincontestable\nliatris\nfriburgo\nmasturbates\nappetit\ncannabidiol\ndecc\nballyfermot\nmarkovits\nsherin\nghk\ntatev\nlamotrigine\npitifully\ncentr\nmarvi\nandenes\nmuktananda\nbloodily\nhommel\ncartogram\nqmc\nmarai\nagnarsson\nlululemon\njuergens\naicha\npanhandling\ntoyne\ncherishing\nrooter\nradecki\ncoes\nleandra\nthua\nunifem\ncapetown\nenlightens\nnishat\nsantisteban\nelion\ngarlington\nvorstadt\nroubaud\njavaid\nfistulas\nbootstrapped\nhakko\nsigfried\nhatoum\nmaytals\nmillwork\nstanky\ncollaterals\nscribblenauts\ncapleton\nmiac\nifam\ndlo\ncoppock\nroboticist\nkinked\nyapa\nroofer\njayco\nmazzara\nsluicing\nchidgey\nshenk\nhbcus\nprospers\ncambo\npbn\nmacwilliam\nhuys\nbalt\nkaranga\nfrightfest\nponomaryov\nquetiapine\ngoikoetxea\nwlodzimierz\nlcross\nmcgrigor\nmufid\nramez\ngreenhead\nanac\nbackstreets\ntakimoto\nradiopharmaceuticals\nwarlick\neizo\nfiza\nmanacor\nmolave\nhtut\nvivan\nkitgum\nteguh\nternent\nmailloux\nyearsley\nfilomeno\nunrolling\nmisir\nwarmongering\nbuttonwood\nidles\nparni\nkeolis\nculdrose\norganophosphates\ngaillac\nboyette\nalmondsbury\nmariss\nbatmanglij\nsureste\nbenefield\nnidia\ndeviancy\nzizzo\nsummersville\ngrowlers\nrangjung\ncips\nmoonroof\nmangabeira\nmonopropellant\npushdown\nbrasstown\nandrieu\nvri\nborderlines\nahlu\noira\npendry\nlzb\nhowa\ncockerham\ndefuses\nhorsewoman\nhamworthy\ncoalport\neker\nbulked\nbertos\nreliefweb\nkienholz\nhorlicks\ntalentless\nevinrude\nkeum\nsacconi\nmechanistically\ncarrano\nviktors\nfishlock\nguzik\nschertz\nmullany\nsituationists\nyanjing\ndigitizer\nsqueals\ntayshaun\nergen\nindexical\nekklesia\npenikett\nthinley\nbrownstones\nwarnborough\nsigismondi\nweever\nheol\ninterorbital\nkalemegdan\nfabs\nsnubbing\nestefania\njajouka\nschwall\nhdx\nkway\ncostarricense\nsemiotician\nffw\napba\nstuber\nduncans\ntapton\nwarth\nbriganti\ntampers\nwistow\nhydroxybutyrate\ngallate\nmcmc\nmammograms\nsanzar\nlysenkoism\nmmmmm\ndemidova\ngazans\nmokhtari\nimtech\nbenedictions\ngargrave\nnpas\nwoio\nlegba\nkiai\nxev\nindiscernible\nexter\nlodève\nqms\npakse\nculiacan\nentrancing\nchansonniers\nvindaloo\npardy\nneuhauser\nhmu\nguidotti\ncarquest\nshahaf\nspeedsters\nbrousseau\npsychobiology\npoos\nkasongo\nsirri\nunquenchable\nshold\nfabians\nroomba\nheitinga\ngurgling\ngodsey\nlinkous\nratnayake\nantagonised\nstirrer\nlividus\nherridge\nsethna\nwieners\nturnback\nchangan\ncaha\ntrabelsi\ncharlaine\npepinster\nmegastores\nnafziger\nplancha\ndurres\nakobo\npointon\nreplenishes\nparolee\nevisceration\ncommuniques\nhoolahan\ndudman\nstickball\ndeniable\npatricios\nreiley\nliesbeth\nmaritimum\ndemeaned\nbarer\naleida\nskindred\nccel\nlpsn\nabscond\novaltine\nelizabeta\nbarbat\nakhir\nnautor\nbaraki\ndodaf\nchronobiology\nspoto\ndisproportion\nlinocuts\npalmisano\nfishbourne\nanamosa\nkcia\nrickenbach\nshavian\npapoulias\njodoin\nhgv\nanwyl\nslapper\nnimbin\ngoodge\nbrookeborough\nkurfürstendamm\nmercaptan\ngnh\njoconde\nbhb\nmanège\njayesh\nschuon\nofac\npaho\ninterclub\ngoizueta\nbufalino\nomarosa\nsandblasted\nhotbeds\nlisters\nbaryonyx\nbarboursville\npenland\nsousveillance\nhambach\nseabourn\nmostow\npvl\nzabid\nlaham\ngote\nformatter\nburkman\nytd\nnorwid\nagonistes\nchiselled\ngunaratne\nverme\nsnowberry\ngarrotxa\nlovage\ngutnick\nflowcharts\ntelescoped\nyoka\nchillingham\nsoules\nlusail\nzervas\nsaule\nlannion\nviewfinders\ncockscomb\ntrappists\nandray\ntahoua\njabbawockeez\nparras\nsundstrand\ngrungy\ntopsport\nmacnair\nbaccalieri\nfonder\nmarojejy\nhybridizing\nholdaway\nledet\nburness\noccurances\nphenylephrine\nhanania\nrebirths\nsarli\nacording\ncurmudgeonly\npearcey\nmaham\nzeferino\ncarlstadt\ntayport\ncacophonous\nnalco\ndenervation\nneeleman\nlovegame\nscotswood\nringa\nsumon\nlenda\nmarantz\ndevitto\nabani\ndeiniol\nstelzer\ndunas\nyongzhou\npapageorgiou\nbrusca\nslas\nimmunohistochemical\nmidkiff\ncaia\ngwathmey\nweavings\nflatmates\ncammi\nimbibing\nsulfated\nxeroderma\nshanghaied\nswivelling\nadlib\nacceptible\nfinkler\nnamal\nnewtyle\nthumps\nfeit\njeavons\nnewbolt\ngazal\nthirkell\ngraef\nextravaganzas\nnortec\nshukor\nhetland\ngreatful\nmopan\ninducers\nngp\ninternews\nbulgur\nphoenixes\ntayfun\nzus\nsouster\norganismic\nbradbery\nmandovi\ncwr\nhiroya\nglencross\nsahota\nsanitizer\nclamdiggers\neido\nbrignone\natiya\nabsolves\ngarbed\nwoolverton\ndeniro\nschoemaker\nypm\nflaying\nvassaras\nmoldoveanu\neikenberry\nschmutz\nfirehouses\nhappisburgh\nsemak\nieva\ndifc\ncholestasis\nshipmaster\ninwa\npcna\nbhatinda\ninglish\nvanoc\ntimesheets\nfratta\nmargetson\nomeprazole\nlevina\npablito\nsawah\nlfe\nmapleson\nliquigas\nphototherapy\nheena\naboubacar\nkoger\nduggie\ngushi\nhtay\nbeckon\njackaroo\nvatuvei\nandrewartha\nbonchurch\nfilibustered\noyj\nlatae\nrivieres\ngodderz\nwahoos\nstigmatize\npsst\nmalamute\nalvina\ncryonic\nmeschede\nobscurities\nwellesbourne\nblueshift\ngarvie\nreveres\ntimberlands\ncopello\nguider\nivanek\nkassebaum\nbolckow\nporcel\nomentum\ninterjects\ncanarians\ninvision\njannus\nbeauchamps\nbraly\nningning\nyampa\nodium\nrosengren\nsubito\ndisaggregated\nglynneath\nservicer\nworldwatch\nhopoate\nbucquoy\nfelsenstein\ntugged\nwrinkling\nvenkatesan\ncarefull\nmerde\nsunsilk\nandrianov\nlaumann\nupwood\nchapelton\nunties\nalz\nchaunac\nhedger\ncspan\nshioya\nkabira\nproyas\ndeepdene\nfirebomb\nhamet\negar\nbroyhill\ntmw\nbarasch\nbulaq\nmutiple\nghafar\nkapolei\niodized\nfrewin\ndehmel\nsnehal\ncourmayeur\nhelma\nzazu\nkarli\nmesmeric\naree\nbichel\nmellado\nleadsom\njuicer\nazzedine\nhyacinths\ndits\ntade\nquechuan\nmacerich\nfingaz\nheadbanging\nreptans\nleutze\nnecochea\nstrokeplay\npallotti\nsarla\nbarnea\ncarbonneau\nichthus\nbusuttil\nmaintainance\nmallo\nmcatee\nvisconde\nadamec\nramadani\ntamin\nseia\ndetent\nhopeman\nsaxa\nblinkhorn\nbement\ndomke\nsiedah\ngwe\nbarnabe\nsudairi\nramakant\nnaysmith\nviesturs\npicadilly\nbacton\nbuglers\nsareen\neimer\nkarter\nnesha\nsukuk\ntavia\nchavarria\nlaurenson\nscansion\nkhoren\npequena\nmonterosso\nlcb\nayna\nunreason\nlosa\nlabuda\nmaroof\nsilao\nstojko\ntrinidadians\ncuaderno\ndolliver\nnestmates\ndecompressed\nhenryka\nkeithley\nmeaden\nreichen\nlosar\nstromer\nleahey\ncamoranesi\nscreentime\nthonet\nlookers\npumpers\nnorfleet\ncollectorate\nrakhimov\nkunstsammlungen\nccna\ntrinny\nspremberg\nencantado\nsneh\nswardson\nthrid\ndefiling\nvalcourt\ndechen\nmcrobbie\noxycontin\nekatarina\nkolli\ndumaine\nbushong\nyujin\nsharrow\ncrusted\nsurkov\nkeenen\nheadwall\nstrozier\nerawan\nfullstop\nafzaal\nbatboy\nmolesta\nlongmuir\nbogging\nmapledurham\nneutralisation\ncando\nachakzai\nbelugas\ndainius\ncontentiously\npressurizing\ncbcp\nmaquettes\nnosing\nmohamoud\nrussophile\nsoay\ndebrief\nmartires\nkelm\nbuckhaven\nprojeto\nbendik\naminoglycosides\nqinling\nnze\nstreambanks\nwadih\nshambling\nlozier\nroutemasters\nfreeness\nswoosie\navnery\nbastianich\npoperinge\nmarkowski\nsciuto\ncapecchi\nrodenticide\ndeliveryman\numuc\npindari\nglogovac\nfazlullah\npellicano\nreran\najou\nvesteraalen\nhydroxyethyl\ndesolated\naama\nfluorspar\nfalconieri\nwestmark\nmahasaya\ngumm\nnuj\ncolorant\nnejc\nklu\nshorthands\nluvin\nwertenbaker\nmonja\nzaytuna\nxuande\nauxerrois\nmthethwa\nduprey\nsavoye\ndethronement\nmorente\ndantonio\nbhala\noutwits\nscovell\nlawdy\ncaffyn\ntalai\ntamam\ncarabiner\nwurtzel\npeirson\nvilfredo\nminifigure\nmanoeuvering\nlaur\nanelli\nhaislip\nwaterview\npescatore\nmadgwick\nmadri\nafters\ngpmg\nbechard\njordanstown\ndavoud\nsitatunga\nbankrupting\nannamacharya\nshouldering\nvinterberg\nnymphal\nmurase\nhoving\notilia\nmotus\ncalame\nmanteau\ntrumpler\nsmirking\nhemraj\nxingfang\ntasca\npuffballs\nchep\ncwd\nmightn\nfoodies\nmccraw\ndoxey\nzhizhi\nwilbon\ndawu\nbedbug\nadcom\nwicking\nxiaoxia\nbootie\nmultiregional\nbozak\nmagilla\njianying\nhuallaga\nobcs\nunshakeable\npalad\nbrychan\nmeissonier\ntille\nbrayford\nxrs\noktibbeha\ncaparo\nstalder\ncurcumin\nbarillas\nsolimões\nthisday\nmoas\nappointive\nunzip\nmarathe\ngremillion\nhumping\ncowbirds\ntouristy\nmariyam\nesrange\naiw\nbne\nswados\nkaret\nefflorescence\novine\nswainsboro\nsalzer\nsendler\neynesbury\ncompatibly\nrumbled\ncubital\nspeciosum\necke\ndabbed\nstrube\nkalugin\nfitzmorris\nfranzi\nbonpensiero\ntinubu\ntahta\nfornix\nbrosch\nsweatman\nengblom\nkatju\npoff\ntems\nputtenham\nnatoma\nnativists\nkhoe\nstecker\nbrymer\nsorsa\nclassicizing\nknifed\nmassart\nboonstra\nheatedly\nballydoyle\nsulfa\npinhal\nbongard\nmarouane\nrevolta\nstoically\nfassel\nricardos\napocalypto\npharmacovigilance\ninextricable\nserdyukov\nglyptotek\nwookie\nmicrographs\nniceness\nmenchú\nishim\nbashaw\nprigorodny\ndkr\nhalsman\ncominco\noverprinting\ndorel\nclairette\nachterberg\npetrossian\nresurrections\nkabuli\npalpably\nbrockhurst\nsawmilling\nkettner\nekdahl\nmozgov\ntrustable\nsadiki\nuhhh\nkowa\nsleevenotes\nmcnelly\nbaisha\nshidao\nparacha\narmlock\nthurl\nymcas\npriestland\nkarlen\nnarendran\njetways\nstratas\nchami\nyermolov\nmedicalization\nherland\ngusle\nsoutherton\ndimensioning\nthundercat\nwoulda\nmcree\nnield\nelvia\nwattens\nruysdael\nmacfarlan\nharrachov\nconsanguineous\nkopelman\nrautins\nsobhy\novertopped\nfhc\ntillmann\ncbso\nkazin\nslowdowns\ninsha\nmatuidi\nkhaitan\ndeconstructs\nouseburn\npilaris\nshopian\nsaltier\njoba\nibla\ngrammes\nlydbrook\nmaoz\nvouet\niptc\ndhari\nquinteto\niliadis\nrentier\nberkely\nprofessionalisation\nluy\nlorri\ncastaldo\nhalcrow\nconservatorship\ncelerity\ntrashigang\nspeciesism\nzeiger\ndelly\nvolcanologists\nridenhour\nnunney\ndehumanized\ngongora\nblisworth\nblumlein\nmeserve\nneckarsulm\nikoyi\navtandil\npanfish\ngida\nsmets\nvkontakte\nrykiel\ntriano\nauken\nzombieland\nschurr\niyaz\napplesauce\nteepees\ncumings\nengelke\ncaerlaverock\nnewtongrange\nviagens\nwode\ndownloaders\nbullingdon\nwentzville\nkaleo\nalderwood\nbenzion\njaffery\ntetons\nvoegele\nfauzia\ncolligan\ngaranti\ndumdum\nchukyo\novermatched\nwissmann\nxaml\nediz\nratdog\npocantico\nknauss\nshameem\nmandalorian\ngrabowsky\nharsanyi\nkaulitz\nhochtief\nemini\ntalita\ndietician\nbotanicals\nyanick\nheathkit\nsemenya\nyatim\nphysalus\nyelle\naéroports\ndorabella\nprotium\noffiah\ngenesys\nfootages\nncss\nmcclurkin\nbosso\nfould\ntiryns\nilgauskas\npangani\naraiza\ntransracial\ndeathblow\nleering\nrheinberg\nhemlocks\ntouchbacks\nevangelisti\nhebi\nsangria\nbraggins\nlizzani\nlucite\nmingles\nvendel\necsa\nlabianca\nworsnop\nwthr\ndalys\nbeetz\nlaymon\nnapoleons\nwollman\nhinchingbrooke\nvillaflor\nkessenich\nwainer\nsedlmayr\ngruffalo\nbabylone\nsabarkantha\nchote\nmantler\nmisi\nyadavs\nmaladjusted\ndeoxyribonucleic\narko\nkiteboarding\ncellucci\nwaterboy\nsaeedi\nalemayehu\nidec\nprionailurus\nklopfer\ncornerman\ncauterize\nequiv\nspacial\nresor\ncordgrass\ngrimbergen\nselecter\nknd\nhazim\nbotella\nhaunches\nroyaux\npharmacodynamics\ndrik\nbenaissa\njerman\nnibbling\nenteropathy\nextrude\nhuzzah\nsaijo\nmarlyn\noverdevelopment\npfannenstiel\nsemir\naonghas\nmomper\nstauber\ncordeaux\nolejniczak\nliphook\nshadowplay\neichenlaub\nunenforced\nkhadir\ndarran\nfehn\nahmeti\nhirshberg\nganderbal\npentz\novertaxed\nkoco\nfeldkamp\nlabra\nlewthwaite\nfrr\nsatirise\nbapat\nswink\ntilework\nidolaters\nwaterjet\nbijoux\nmudford\nsubotnick\nghostlight\nmccombe\naspies\nglaspell\ndyncorp\nmannings\nneomycin\nanothers\nonita\nlup\ndwd\nparisiennes\ncopulas\npaskal\nedlington\nbeetroots\njoscelyn\nnarsimha\ngartrell\nmoley\nukase\nguerrouj\ndilophosaurus\nhaykal\nautochrome\nmcx\nminaya\nkrama\nsubban\nrwf\ntappahannock\nbiomimicry\nbachs\nwaks\nmahina\nbolz\nmarzotto\nprosky\ntourettes\nzox\nimageworks\npolari\nboitel\nserenaded\npusch\nisreal\ncevat\nnoordin\ndotan\nteds\nwelches\nlundstrom\nmeadowvale\ntokenization\nlubis\nnanay\nvilliger\nagecroft\npachacuti\nmaiman\nlucus\nkozel\nsimtek\nlense\nbrodkorb\nleukemic\nnephrectomy\ndanubius\ngangaram\noutrider\nmisquotation\nmoyale\nnewmarch\ncocha\nposeur\ncityplace\ncaersws\ncharn\ndistills\nnonie\nishizaka\nexige\nexpunging\nudny\npauvres\nbickmore\nkpu\nrosenfels\nrumbek\nhoula\naetate\nfredricks\nmonos\ncyn\nbalkin\nweisgerber\nblindsight\npierpaolo\nalling\ncatherall\nstargates\nhaavisto\ntuason\ntiss\ntoothpastes\nbaabda\npanizzi\nlegnani\nkingham\nsemillon\namphoras\nrituparno\nwreford\nadelanto\nchopstick\nharrisonville\nforedeck\nhousden\ncosin\ngurmeet\ntelegraphers\nsepah\ndesaturated\nheslington\nfye\nloquacious\nsolara\nintouch\nshims\nmozdok\ntragödie\nportobelo\nburglarized\nshahadat\nbudgerigars\nfestoons\nnetty\nsabita\nmitm\nchuter\nholsey\nempiric\nkhary\nprepositioning\nshabran\npriore\nhilltowns\naspyr\nprahalad\nxrd\nreville\nsamb\nlabled\noversensitive\nberdyansk\nassociati\nkawal\ncyma\nkozinski\nloera\nsecretase\nmonroes\ndietze\nyafo\nkaisi\nnochi\nsardinha\nethinyl\ncribbing\nmarcinko\nsackhoff\nlarochelle\nreconquering\nsubliminally\noldtimers\neshelman\nearland\negocentrism\nmoscatel\nsondergaard\nlieth\nrish\nfreemont\nspeechwriters\ncomfortdelgro\nbertozzi\nyukihiko\noctoroon\nhypocalcemia\nfesten\nsrn\ntupaia\ndahlke\nweitere\npfos\nfoggiest\npremie\ndavorin\nsigurdur\nschunk\nsmolinski\nalysia\nabiel\nberlaymont\nantiemetic\nmccoo\npentheus\nheitmann\nwaud\nanastasiou\nhorndean\nkozyrev\ngorecki\nblasingame\ncifra\nsubhi\naffeldt\nfoll\ncockshutt\nglaus\nbrandauer\nmdiv\nhematomas\njovis\ndres\nnjoku\nsuperdog\ncyclodextrin\nretype\nkowroski\nhedd\nantúnez\nstupider\nwarland\nrudaki\naslanov\nlopilato\ndunnes\nceasar\nnandalal\npinetown\nmadin\nscarrow\nsaegusa\nmcgonigal\ntefft\nimplementable\nverged\ndecorous\nponselle\nmilewski\nkwo\nzielonka\ngüiza\nlmo\nwolesi\nvogon\nsonus\ngenovesi\npetherick\ndeconcini\ncorktown\nvibram\nsanai\nwhitegate\nenzersdorf\nnonlethal\nprovisos\ngner\nlauterbur\nmowlam\narticals\nystradgynlais\nambergate\nsdxc\ndashain\nchunga\norduña\nberrabah\ntaylan\nproteges\nlaich\ndiaoyu\nlamanna\natomically\ntames\nvenir\nbouna\nryer\nedilberto\nhlf\nlakelands\nquino\nkolingba\nostracon\ndisavowal\nnanpa\nmaberly\ntassajara\ncostumers\nkci\nﬁve\nberninger\nterregles\nandoh\nfedoruk\ntrivialized\nlockton\ndisgustingly\nsantosham\ndombasle\nultraconservative\nkryuchkov\nparanaguá\nflanner\nfecund\nsawston\nmerlene\ndocumentable\nloquitur\ncaliendo\nashtown\nfewell\nnahdlatul\nletterboxed\nkinswoman\ntylo\nrecchia\narocha\nnonwoven\ncybulski\nballykelly\nklossner\nerler\npoblete\ncrenelated\nforesman\nujan\nvilvens\nruether\nazb\nunloads\nmajoor\ndishonoured\nlouganis\nspectrally\njenga\nmacneal\nwattisham\nperfringens\npulwama\nparto\nukrainy\nchinchillas\ngarik\nmarenghi\ndethick\njuticalpa\ntlx\nfeeler\nrads\npungency\npeces\nhabibur\nrepower\nminsi\nalvino\nleccia\nhassayampa\ncapucine\nlifar\ndgb\ntemplon\nmemmo\nensuite\nincharge\nnostell\ndurcan\nkerryn\nmurfin\nlabradors\nboggart\nbioweapons\nmecom\npetta\nkilty\naliança\nstorrington\nulsterbus\nbishopville\nneurologically\ncacau\nhyperlipidemia\ngjokaj\nbabrak\nwateraid\nalpher\nnanayakkara\ntorvald\nlbk\nmcgeachie\npupillage\nrosmer\nbraco\npicoult\ndoaba\ndragline\nhizbollah\nherzogenaurach\nvalognes\nhondas\npoing\nsgurr\nmickel\nteacups\nbarkston\nsweetgrass\nnaadu\ngbk\nwagle\nchermayeff\npolonez\ndpms\nlordy\ndorthe\nhershfield\naidc\nmoonlights\nsteinsaltz\nwisk\nqalqilya\nsighed\ncoomes\nndwandwe\nnyac\nzilkha\nyellowed\nmiera\nlewknor\ncounterforce\nmineko\natomique\nhickories\nacpa\ngoujon\nedmilson\nbooga\npustular\npallenberg\nneedwood\nweybourne\nsoejima\nrésumés\ngoldington\nmplm\nspayed\nuncertainly\ncytological\nbastardized\nslutskaya\nsherine\nintegrationist\netops\ngirlband\nrussow\nwangpo\ndemeans\nrothfuss\ngwasanaeth\nfashawn\nlongjumeau\nledovskikh\necheverri\ninstore\nntf\ncdre\nwebshots\noenology\nsentani\nphotorealist\nplebiscito\ninsieme\nvlaar\nsinges\nshafran\nspurrell\napichatpong\nantidiuretic\nhirscher\npirner\neynon\nfoulk\ncalluses\nencumbrance\nprobated\nanthonis\nbange\nschaerer\neffacement\ngothams\nduopolies\nvitz\nairbrushing\ncoscarelli\nlegatee\nkriz\nvrelo\ndesisted\ncensorious\nbrowbeat\nsmelser\njanss\nmiche\ngisenyi\ncholecystectomy\nhiskey\nparijs\nmekel\nkustodiev\nsilverline\nhardeep\nlobban\nhribar\nclapboarded\nmonavie\npeper\ncobban\neisenberger\nbeledweyne\nmancroft\ncockeyed\npuddin\nmaged\nmoocher\ngiugliano\nwindpower\njaago\nashvin\nthurloe\nlarcom\npmw\nlekman\ninterahamwe\nnobi\nscalpels\nclothilde\ntallec\nmistreats\nvawter\ntrudge\ndoot\nsye\nhepp\ndufftown\ngauvin\nexistentially\nekpe\nmangler\nbanken\nmyalgia\neinojuhani\npingyao\ntukur\nsteinbock\ncerc\noxhey\nigraine\nminimizer\nelapses\nwinegrowers\nlebec\nalchemilla\nsulpicio\ngolzar\nyashica\nwhodini\nschrempf\nhamadeh\nforcier\ncrocetta\ncattelan\ntheissen\npoliteia\nkrastev\nesfandiari\nsteeve\ngilbride\ngokey\ninexpressible\nblader\nsparrowhawks\ndominants\nempanadas\ndibernardo\nmelchoir\nwisecracks\nisel\nyawar\nbeaglehole\nmescal\nsveen\nrealitatea\ncaverna\ncynara\ndeglaciation\nenbw\nblackland\nmcdivitt\nbaldin\nunicornis\npolycom\nstenographers\nbuttram\nmonotreme\ninverlochy\nlemaster\nroure\nballand\nbandas\npolicymaker\nemote\ncastiglia\ncommitte\ngaar\norzechowski\nmallender\ngreenspring\nwestlund\nnangle\ntounkara\nwaxers\nupnor\nganoderma\nhyperalgesia\nappr\ngalvis\nvaizey\nborenstein\nmerchandiser\nzinser\ndurness\nsansoni\nglico\nparducci\nyeshurun\nmerta\ntransglutaminase\ncoulston\nstepien\nbeasties\nmeningioma\nnektarios\nlalji\ntimmermann\nmandylor\ngofer\nsysml\nreha\nluthuli\nibru\nouterspace\nbellion\ntarkin\nfletton\npreordered\ndeworming\ndewy\npropre\nfeyzabad\njatta\nslumbers\npiñatas\nceto\njero\ncharcuterie\ndougray\nkimathi\nrossmann\nmudguards\nbareboat\nnolf\ncatrine\ncentrelink\nschouler\ngreenlees\ngueule\nyonemura\nbilious\nrankled\nredway\nduelists\nclaribel\nwinspear\nmichnik\nmoonman\nhollo\nnamaz\nsinnett\nholms\ndogville\nrastas\nandrian\nthawra\nditte\nroffe\nbangerter\nalbam\nsubmunitions\nadamczyk\nfiv\nruthe\nhiland\nlowlanders\nrutting\nwalczak\npasang\nsatar\nslosh\nhybridise\nsalimi\nkepco\nsalemi\nchamlong\nkhunti\nchenu\nblandly\ngummidge\nrsb\nwolframalpha\nranikhet\ninvalidly\noxygène\nrombach\nshipston\ntillmans\nmoglia\nmagliana\nberigan\nfootrot\nkonstantinidis\nanabela\nmetrica\ngomal\ndesnos\nfien\nconowingo\nmondy\ncolonizes\nmontebelluna\nhitmakers\nchevreul\nunadvertised\ndholes\nkalon\nsparklers\nschrager\nkorem\ncombattants\ndenouncement\nnauseated\nkvirikashvili\nendean\nacceso\ntickler\ncelik\nljp\nkulgam\nhypocenter\nhtwe\nmegyn\nosg\nmccarten\nyaphet\npossil\npositronium\nnagasu\nataque\ndongxing\ngreathead\nrajo\nglenlivet\nsctp\nfascias\ngirardet\nlbgt\ninfluxes\nlinky\nkeyshawn\nansty\nmattathias\nmarginals\nutrera\ndiodati\nmicrophthalmia\nburping\ncruelest\ncbk\ncoaker\ndehiwala\nextractable\nplurinational\nnewco\ngyeongbokgung\nabdullaev\nsast\ncelador\ncarrico\nkeelan\nketley\nfeghouli\ncolori\nliska\nkingery\nripstein\nxiaoxiang\ncadaval\nrackstraw\nlongmoor\nunderglaze\nkvass\nbrulle\npolitcal\ncornershop\nlatinist\nrockfalls\nweisner\nvsg\nyukichi\ncarriles\nhmhs\nalzate\nsiop\npirc\npersuasiveness\nrandee\nbergues\nmatcha\ndemagoguery\nholts\nflapped\nkemmel\nmoortown\nspendlove\ncybersex\nextensors\ntryed\ngravettian\npinkertons\nossuaries\nweinmann\ncarls\nfrisson\nkhateeb\nsulton\nsconces\nirthlingborough\ncarterville\nimperially\nasgari\nflatters\nassata\npecoraro\nblindfolds\naapa\nsamovar\nprunty\nguimaraes\ncollaboratory\nsapone\nyerger\nnanostructure\nplouay\nnault\nherp\nshinda\nconversationalist\ncitoyens\nokrent\nklenner\ntwisleton\nireen\ngametap\nbilbray\nbarh\njannah\nalbasini\nthumbed\nappassionata\nborle\nrushforth\nbende\nguzzo\ncatchfly\nfestoon\ncuon\ntsukishima\nhovde\nmureaux\nyabucoa\nzagar\ntienes\nbanik\nterlingua\noxidise\nwhoriskey\nnagbe\nkollo\naday\ncochiti\nhomeplace\ntaufa\ntjrc\nforthbank\nsternbach\nkallies\nlouka\nilva\nstrongheart\ndanijela\nscrapie\nkanehara\ndreidel\nmacalpin\npentastar\npeker\nichabods\npavon\nwadsley\ndaikin\nbrüggemann\nratchets\nunsparing\naramac\nhaarde\noce\nmisprision\ncalcitriol\nmegaron\ndraughtsmanship\nspitak\ntftp\nbassac\nstadtschloss\nzetter\niben\nassumpta\nfld\ngga\npapyrology\neustachy\nredundent\nsalpeter\ndadeland\ncheder\nleid\nwilmut\nseagraves\nosteria\nbrunssum\nglorietta\nbace\nbemrose\nhouseboy\nbruhl\ngastronomique\npraful\njudgepedia\nbuzan\nrochereau\nlempicka\nmanoeuver\nartiodactyls\ntirzah\nnoujaim\nninotchka\nvaile\ncarmena\ntinkham\nlandru\ncattani\ndeutschlandfunk\nreekie\nsonnino\nfingerhut\nhookes\nanyanwu\ndahal\nbergs\nkailahun\nqvale\nvhd\narthington\nideologists\nitochu\nravished\nbenirschke\nrovin\nqsr\ngosta\nponomariov\nwalesa\nvillazón\nledi\nprotectiveness\nlucozade\ndefrosting\nweeny\nradway\nruediger\nramchand\nfazel\nkalaf\npaletta\nelysa\nledra\ncayne\nmarketa\nclimat\ngrowled\nharward\nfresnillo\nprecipitations\nqabbani\nmaraba\nexurban\nthiazide\nnachtmusik\nbrana\nvoisey\nburrowed\nbiassed\nteetotaller\ncléber\nfaggin\nloxahatchee\nmammen\nmcgaughey\ncegb\nmooting\nunderclassman\noleds\nrbg\nmazzoli\nunfurl\nokhla\njpr\nkorin\nchidiac\nmindbenders\nmascheroni\nmaceachern\nrangana\nbocchi\nprivata\nkikuko\nrehmann\ninglenook\nbongiovi\nwaterlily\nrunet\ncauston\nshuk\npolybutadiene\nwhirls\nkondylis\nbinjai\ndalu\nbuscaglia\noratories\nepler\nexpressively\nvictuals\nwaples\nfarkle\nzenji\nvcm\nbentine\nagronomique\nanzor\nstandon\nsomerby\ntriazole\nmersereau\nretarder\nsalote\ncajoled\noceanport\ngamil\nwitsel\nbigpond\npotently\nalternans\nvocalisation\naibo\ntetracyclines\nsnettisham\nbarcia\nstoppa\npunj\nsindelar\ndigitalised\nbontempi\nmisgav\npaaske\nshapwick\nbelturbet\ncrotchet\nchulanont\ncedrik\n¨\nleafing\nyezidi\nahlgren\nmousy\nbusinesswire\nrhianna\ntrussardi\ncorticosterone\nclime\naudace\naskins\nkarelin\nmargareth\nkaikai\ntimewasting\ncamouflaging\ntorrejon\ndevendorf\narpita\ncameleon\nyeasayer\nharrisson\nrichmal\ndondero\ntoyooka\nheadend\noppong\nibr\nkewpie\nberkson\nfocaccia\namadori\ntues\nbkk\nflossmoor\nkombo\nracino\nneas\nmargules\nsouthville\nunderstating\ngroped\nfourvière\nsprit\nkupwara\ndoleman\nflagman\nscahill\nwtnh\nhellweg\nlymphoproliferative\nboublil\nrovshan\nrossier\ndassler\npzpn\nscuzz\nmarikana\nsath\nvaknin\nfarmall\ngrainville\nteddie\nwashtub\nsolennelle\nmanggarai\nyazaki\nszyk\nuniques\nawale\nassortative\nhscs\nasghari\nnadarajah\nincurables\nbuckey\ndewes\nlechwe\nstarkman\nrefosco\nweida\nyunan\nklier\nslags\nalkatiri\nfrocks\noliynyk\nmisjudgement\njaman\npetropolis\nmessaggero\nalatorre\ndunmurry\ngerace\nobliterans\nsackcloth\nspradlin\nerlendur\nroszak\nglares\nwolfensohn\nskirving\noddone\nnolans\nstollen\nregularised\npomigliano\nzaven\nkramden\nwights\ncongolaise\nbartoszewski\npellington\ndesalinated\nbrockhampton\nkhairi\ndobrynin\ndredgers\naicardi\nspanners\ngwc\ntrackdown\narlit\npetrolina\npaita\nkhalifeh\npachachi\ncharalambous\nbattell\nhaole\nbaddie\nshirane\nthereunder\nmandora\nticklish\nenfeebled\nsteinweg\nntds\nstaroffice\nleow\nimmolations\nvelopark\nfumosa\nnoncommittal\nsciarrino\nmaíz\nhibakusha\nrebeka\nextruding\nstirrers\nfetlock\nistiklal\newenny\ncircumcisions\nwhome\nyeas\nhornstein\namerongen\nunilateralism\ncountrywomen\nwassef\nkanth\nstra\naoraki\nlifecycles\ntrickey\ntamago\nsameh\nkellet\nyibo\nattractant\ngitega\nkemptown\npree\ntoom\nprejudicing\nworldvision\nfutch\nfolmar\nmosser\nharahan\nbardonecchia\nchrysanthos\ngiganotosaurus\nugetsu\ncalcineurin\ncasseroles\ncatchiness\nselick\nboniek\nasilomar\neynsford\nlho\nvarlam\nzeisler\nwolfswinkel\nkarpal\nlauderhill\nnepad\ngauger\nshailaja\ngarnham\nchandipur\nharborside\nenglebert\nwojtowicz\nlanson\nraylene\ndenio\nvldl\nsonique\nkeko\nwmar\nthaicom\nbusi\nkovar\nmetacognitive\nteeters\ngoldreich\nrinella\nwahed\nsensorial\nfankhauser\nsheens\nhernanes\nlochnagar\noculist\nturnblad\nshutterstock\narmande\nmadrassah\ndemarcates\nmagal\nacellular\nniedermeyer\ncompendiums\ncanda\nkint\nhukam\nwenning\ngreasers\nbřezina\nmahathera\nbibendum\nwittmer\nigda\nbutchie\nrimu\nvartanian\nbigs\nincompetency\nleeser\nsaxmundham\nbomarzo\nschulhoff\nmarschner\nfurled\nentrées\nshadia\nscabiosa\nelburz\nrhabdomyosarcoma\ndurer\nimagem\nhypochondria\nmushfiqur\nstuddert\nfoula\nheilbrunn\njmg\nsoslan\nwooller\njayantha\nhöcker\ngendre\nanhanguera\nfle\nbalde\nsilkstone\nqtr\ndrue\ndisowning\nayt\ncroissants\nkariuki\nmekhi\nlyngdoh\nmousey\nsurridge\nkopecks\noleum\ncastanet\ndurnan\ntejedor\naripiprazole\nfrideswide\nvattimo\nkelleys\npaperweight\ndurwood\ntoshikazu\nmahdavikia\nkirkgate\ngasca\njlb\nhickstead\naggresive\nmccance\npracha\nphilles\nfgd\nnclc\nglobals\nproffesional\nmartucci\nrazorfish\njokester\nabdelmalek\nostium\nmartinican\nhuggel\noutrunning\nperote\nhabte\nfatness\njoga\nnamecheck\nkadeem\npesquera\nerrett\nupl\nhaselden\nkooten\nyouku\ndepredation\nchanin\nswangard\ncric\nthorncliffe\njameh\ntartare\ncasita\nebina\nskelmorlie\nmcgarr\necmo\ngranicus\nmucked\nsocrate\ntechrepublic\nzidovudine\npleasley\nmanawa\ngummow\npitfour\nbirlik\nbenincasa\nrexach\nbertoli\nadeliza\npaycock\nrepro\naradan\nmalhi\nshiren\nglanton\nbuysse\nbashley\ngamliel\narja\niue\nirex\ncopake\natget\nforis\ncarlstrom\nzerkalo\nborgetti\nnechells\nsevin\nscrimgeour\nbiocides\nsinkiang\ndiplomatist\nnanofibers\ncarbineers\neirini\nhorsemeat\njunebug\noad\nmotorization\nilwu\nfranchini\nteac\nsobey\nzhigang\ndavidovsky\nserenely\nkaminey\ncicilline\nosteopath\nkashdan\nunscholarly\nmiyakojima\nyanda\nwassoulou\npedigreed\nsubscale\navoriaz\nanwer\nhinode\nsurco\ncorroding\nfranzini\naéropostale\nwelts\ncholon\nraun\nrawl\nzopiclone\ninjera\nuremia\nkhilafah\nelectroencephalogram\nyulon\ncontentless\naluminized\nfoxfield\ndjebbour\noutokumpu\nbutai\nfør\nlowney\nlinate\nvudu\ntimbrell\ndery\ntypographically\nmunfordville\nherbstreit\nichat\nkazbek\nappendiculata\ncastelluccio\ndinur\nanki\ndomu\nratte\npennsbury\nlittbarski\ncomiso\nclowne\ntelemarketer\ntiferet\nprocurements\ngoelz\ngirded\nsnowcat\nalphabeat\nsulby\nmacgrath\ncaunter\nhandbill\nsunline\nandreeva\npilfering\nreynald\nchenard\nwarg\ndiffernet\ngunbarrel\nbowlin\nbiswa\nhailee\niscb\nsandrock\nfincen\nabscam\npsia\njairzinho\nvff\njist\ndemetre\nketsana\nwobegon\nketosis\nfustian\nishara\nmaiga\nshindler\ndaine\npomponio\nkurochkin\ncollonges\nbruyere\ncilt\neiu\nwilsonian\nnescafé\nfirdous\nbhuiyan\ncolbys\nehn\nhaskel\ndemoniac\nzuccarelli\npearlstein\nmauls\nabita\nprimitively\nyasuoka\nnewsfeed\ngréco\nrachlin\npoids\nbudda\nkataria\neht\ndistention\nblustering\nmandarino\nnfca\nssrc\nsoame\nseales\ncentenaire\ntrattoria\nnetheravon\npontificating\nendroit\ngaubert\nmalapropism\nglassfish\ngillom\nmacrina\nnorthpoint\ntirkey\nvéra\nnicam\nmatricide\nsoberly\npicosecond\ntrentini\nmeridien\nuncodified\nkoronis\naktiv\nbizerta\nsolovyev\nligaya\ncarcosa\nshoplifters\nsimoes\nbearish\nstephin\navra\nsolsona\nsalicornia\nkildee\nlushly\ndosimeters\nserse\nluristan\nchiat\nmihashi\nshiyi\nvexation\nkoffman\nfactly\nquynh\nulbrich\nollantaytambo\nphilippou\nvoth\njinlong\naforethought\ntamping\nbalsdon\nairmanship\nkamper\nduavata\ngregersen\nugl\ninova\ntrkb\nheckert\ndiuresis\nperronet\npadron\nhevel\nssat\nufton\nbootylicious\nbadara\nbehoove\narmwrestling\nmarmaray\nerrorless\nbugbee\nwallhead\nfindus\nflatboat\nbapuji\natomium\ntuimavave\nmccleery\nsolemnities\nmahfuz\namax\nachour\nmonashee\nmechnikov\ncanonbury\nlucic\nastutely\nsumptuously\nnassir\nprepon\nmawla\nwilmarth\nhomogeneously\nleckhampton\nkomati\nwanyama\nhallgren\nantirrhinum\nmichaelides\nsportpaleis\nmoviemaker\nbockscar\npiddle\nhumourless\notari\nbandoneón\nduddingston\nihnen\nhollers\namezcua\nzaoui\nvaguest\nreshuffles\nwhitemore\npanamera\ndki\negner\nmechlin\nlauryl\nrevering\nifilm\nelmaleh\nclasica\nbibbo\njop\nfreshener\nacetabular\nlents\nvoici\nmanchego\nkuniko\nbraggadocio\ncastlerea\nbechtolsheim\ntatsuhiko\narktika\nmaharaji\nnevena\nrajagopalan\nbylsma\ngeostrategic\nornery\ncolab\ntruculent\nkeyloggers\nbingyu\ndobrescu\nrubinek\nwikki\nplayfirst\nsilman\ngoudhurst\nkurtwood\njifu\nmousseau\ntiede\nbulganin\ncrummey\nsocioeconomically\nredwine\ninrockuptibles\nstammheim\nbelene\npbj\nschmo\nrampurhat\ntornquist\nczechoslovaks\nghirardelli\nspondon\nmonacelli\neiichiro\nnumantia\nbruhns\npattabhi\nhaycraft\naleatoric\nspangles\nsmarting\nzhelev\nfuzion\nakosombo\ndraheim\nbeachley\ncastricum\nsrei\nlovelier\npeerzada\nmanyika\nmiddlewood\nencroaches\npetrochina\nmiyatake\nminsheng\nsaliers\nyeesh\nschachner\nsteerforth\nbebbington\npastoring\nacquits\neyelets\nruchi\nmacassar\npierluisi\nmccowen\natsu\nmicrofiber\nscrollable\ngolmud\nfolliculitis\ntaroudant\nmotter\nhond\ntorra\nrosia\ntreg\noverlanders\nnayer\nhallard\ntorkel\nildar\nprateek\nserialize\nnalbandyan\nstandells\nmckeith\nabrogating\npawlett\nsumar\nmram\nmilbury\nmunificence\npetegem\nenzi\nviewport\nwriggling\nwidescale\npaster\npursell\ndottore\nfobs\nderrickson\nmeraj\ncultivations\nbaxters\nfischl\ntablespoons\nspaciousness\ngeeson\nshapero\npousada\npennywhistle\nmccanns\nkossmann\nansen\nhydrometer\nneeru\nhurworth\nwno\ncknw\ngerety\nphotogrammetric\ngreenie\nwrinkly\nberretti\nprostrations\nvincy\nwaldwick\nflans\nmamy\npinkel\nsunglass\nbarno\nbleda\ncholeric\nmpofu\nspoerri\neidlitz\npuffleg\nworkforces\niberostar\npoliziano\nbjarte\nachen\nkirpan\nmachpelah\ndorrie\nlockey\ninterweb\nbidzina\nseena\nneuenschwander\nrevi\nlacetti\nidil\npagosa\npreconception\nshahrokh\ndollop\nbudgett\nwackenhut\nstealin\nploeg\nprefiguring\nfanie\ntomjanovich\ncherniss\ngalit\nthunnus\ncancelation\nblantant\nhady\nsystemwide\njakks\nzhenhua\nharriton\naset\nbystrov\nmatterson\nbromont\nkochs\ncapstick\nrazaleigh\nschranz\nkuzmina\ngioeli\nminigolf\ngrasshoff\ndisaffiliate\nemh\nsonneborn\ntrichophyton\nrijo\nsevgi\namitri\netcc\nmonigo\nclayden\nmemphremagog\ngarri\ndeora\nborana\nmustachioed\nlearie\ndoura\nmanholes\nelcock\namalfitano\nenvoi\neppie\nshoreview\njpc\nnalepa\nallestree\nimpro\nlilywhites\nmourant\ncurs\ntroglodyte\nadamas\nkonstanze\ngasman\ntrickles\nfrancescoli\nintercessory\nvredenburg\nlangrishe\nexplicitness\nlurcher\njinzhong\nohle\ntropa\nareias\nrenk\nftz\nmcgugan\napley\nvlaeminck\nforslund\nohkawa\ngustavia\nhironaka\njocularly\nextravagances\nwhipps\ncheesesteak\nnpdes\ntarvin\nplourde\nsagnier\npoulsbo\ncasma\nnevilles\nlawang\nrabel\nvistavision\nvaden\nthie\notha\nkandersteg\nunconfined\nreas\nmelanism\nfrancigena\ndissimulation\nkpis\ntransco\nbevilaqua\ncanaday\nanwarul\ndadlani\nmixco\nkryukov\norbetello\noophorectomy\npoltergeists\nenlight\nswinfen\nmuirkirk\nswathed\ngaltieri\nmedlin\npft\ndilara\njarawa\nbeic\nfurloughs\nrossie\ngundi\nodwalla\nvieru\nacanthamoeba\norsolya\nloosest\nsarine\nrorquals\ncrellin\nwestens\npiège\nfoucauld\nstamey\nantiparasitic\ndîner\nvmt\nasters\nbuffoonery\nbronislava\nalexandersson\nsolario\ncovetous\nadmiringly\ngernreich\naldonza\npinniped\nangélil\ngiddins\ntreehugger\nmalon\nsalvatores\ncombattimento\nmcloone\nprpic\ncudicini\nambrosi\ndabbing\npappalardo\npetroni\nliù\nbihan\nwillaston\nhapper\nperonne\ncarlinville\nbers\nrootham\noverbite\nnovaes\nceqa\nkorbut\nvandenbroucke\nsatins\ndisi\nradiobiology\ntakaka\nzhongyu\nseaga\nreshef\nsarkies\nthakker\nmandra\ncoverdell\nraavan\nrolles\nbenzer\nkipsang\nbhupendra\nmptp\nbrookner\nchristenings\nitto\ndirectionless\nbarbadians\nkibby\nconservancies\npercenter\nnagqu\npantnagar\ndurables\neustaquio\nrayners\nmanikin\nshw\nsalvific\ntme\nmarwijk\nstewartry\nkellyanne\nrupununi\nbernera\namyas\ngantner\nhelander\nwolfgramm\ncasu\ntheri\nendearingly\npennoyer\nschlee\njigga\nnonu\ndæmon\nmarchione\nwyggeston\ncastlewood\nkeesing\nbrisley\nweinfeld\ncasitas\nperforator\nwarmonger\nshowdowns\nflippo\npolyrhythm\nmidrand\nsheperd\ngrainne\nnattrass\nkittlitz\nobis\nquondam\nwilkinsons\nhallas\nzorica\nnovikova\ntarter\nflandreau\npotsie\nkelk\nkodar\ndawan\nbeghe\nexene\nincentivized\nsatinder\ntoppenish\nfoodways\nbeatin\ntrendle\ncarice\ngamekeepers\nvitalij\nhusin\nminorsky\ndejection\nkpbs\nlezak\nratnakar\nsahabi\nilda\nmullarkey\nwombs\nkosgei\nocklawaha\nbreezed\nschismatics\ngateposts\narcobaleno\nalmendra\nrippingtons\nnordenstam\nzions\ndoppio\nflorette\njdp\nrhynie\nneverwhere\ncullerton\nretraces\nhanny\ntournant\nafis\ndandin\ncamomile\ngammell\ncarencro\ntriclosan\nenslaves\ncoel\nvlast\nspamhaus\nschwartzel\ncongealed\naquired\nkirchherr\nbarrs\nmatheran\nkurmanbek\nfifeshire\nmadingley\nberquist\ncarino\nmeszaros\nbrouillard\noverestimates\nmicrosdhc\nsohi\nmatcher\ncazeneuve\nlifer\nsamual\ndongs\nvitello\nolczyk\nshahir\nwingnuts\nbludgeons\nzachery\nunrelentingly\nprenatally\nasscher\nspeedometers\nmongia\npagode\npranked\nmgn\ntremeloes\nlatt\nasilah\nhias\nshimoni\nnajee\nskywarrior\nkwch\ntrata\nbattened\nthaxted\nsejdiu\ndelko\nvietminh\nfeifei\naamar\ntroilo\nmatalon\npersil\nkawanabe\ndapo\nckx\nmishal\ncccs\nchamberland\nundecidability\ncolescott\nshilla\nchangshan\noxidiser\nkitchell\nopper\nqurbani\nbeeck\npalaver\nligety\npapayas\nnerz\nstewardson\nreeser\nilin\nfijo\nmalverne\nbouldin\naaahh\niwu\nweisbrod\nredlynch\nvolkspartei\nkalima\nfreelander\nalveston\nduloxetine\nanguiano\nvrdoljak\nsegismundo\ninsurgence\nschwendinger\nlivistona\nmanicurist\nkidde\nasociados\nedkins\nlajeunesse\nwbir\nsuvari\nkeiki\ndflp\nbatard\nkhadem\nhallstrom\nkrishnapur\nkahlon\ntupamaros\nunderrepresentation\nsopher\nfreelon\nilleana\nmenocal\nthewrap\nticehurst\ncathkin\nfregosi\ncarloway\nkadee\nfagerbakke\namberjack\ncgu\nadriaanse\nramstad\nannable\nbogoliubov\nsportbike\nwakkanai\nwizo\nmahanama\naidy\nermington\nwennberg\nkafirs\nlinslade\ncarrozza\nfreighted\npussyfoot\nfloridas\nupscaled\nchytrid\ntyrel\nstutters\nfeldheim\nvisoki\nnektar\nbulava\nmombassa\nchoson\nanticlines\nvertonghen\nliebeslieder\nleocadia\ncampan\nugu\nyawing\nblackwelder\npeston\nmisspent\nfring\nansys\ncharlbury\nsteingrímur\nroaf\nghandour\nmistimed\nplatformed\ncrimewave\nrodenstock\nayoob\npiene\nroodt\nsawiris\nhammerson\nodn\nladles\nstenning\nmaska\nneustift\nvanport\nraef\netx\ncarnock\ntxn\nsoare\nfargeau\nbratu\ntrepassey\nhypomania\nwjfk\nwinford\nfreni\nlithos\nfronta\nfreston\ninsets\nshannons\nciolek\ngpd\nchiddy\norofacial\ntorosaurus\neag\nanthracnose\ngreenshirts\npirmin\nchene\ngwillim\ndispossess\nlavik\nmouette\nricheson\nzaugg\nacces\ndraftsmanship\nkajagoogoo\ncolwick\nkleopatra\ntvweek\nrgr\nnulty\nhuckerby\napol\nloui\nbifurcating\nsuperordinate\nmultivolume\njumu\nfumed\nwarhorses\nlcv\ngeopolymer\npolunin\nrodenberg\npurtell\nglsen\npakman\ninfinitude\nlactis\nyeamans\nbelser\nlandownership\nhomewrecker\nlme\nwestdeutsche\nundistorted\npyrethrum\nconsuegra\nbaburam\npadovan\nmesserli\nwhatman\nmaughold\npawlik\nsubscales\nrubinoff\nnyrup\nsissinghurst\nsoloveichik\nbowerbank\ntlm\nphotokina\nfaan\nkaap\nnese\nchinar\ntkr\npollens\nuncw\nkhordad\nhaleh\njalapenos\nlarivière\nmanageress\nsittang\nichthys\nabramowicz\nstamatopoulos\npasdaran\nvölklingen\nbraamfontein\nbrierfield\nknop\npastured\nparasitologists\ntangye\nmireia\nmilow\nphilomath\nmalkiel\nmisoprostol\nwaterbeach\nheckerling\nfossen\ndehp\nvsr\nshirey\noutpolled\npounamu\noverconsumption\nriegler\nstatkraft\nllanymynech\npllc\nnoles\nakerlof\nzenn\nprairieville\nsanchita\ngrumbles\ncollingridge\nhindbrain\nchutneys\nseyi\ngpn\nstanage\nmolder\neasdale\njancker\nimpassive\nferrick\nsidedness\nadms\npaymasters\njanz\ncryosphere\nfeza\nwebbie\nmcgeough\npetrotrin\nswg\nkleinhans\nstutzman\nrathcoole\ntaue\nbohrer\ndomaines\nwhiskeys\nsachie\nwylye\nkoes\nansted\nxiomara\nchenzhou\nfurong\nrestall\neriskay\npapain\nboul\nacurate\nbraf\nhautala\nrabiah\nharridge\nhillbrow\nlaffy\nmcanulty\nmeldahl\nkiehl\nrubery\ncaunes\nagression\ndimpled\nketi\nmorenci\nconstrictive\nvalorization\nshekinah\neeny\ntrinucleotide\ncongar\nnavnirman\noccassion\ngapping\nsonko\nkilvert\ncanzonetta\nkwacha\nmbombela\ncamelon\naider\ndurão\ngestel\nlevack\nthinkprogress\ninocentes\nshopko\nusgp\nholzinger\nbettega\nfrickin\npregabalin\njavadov\nburbs\nrotenone\naltendorf\nmitsouko\nlanyards\nmodan\nsotos\nmasp\nkooy\nfamished\nmoluccensis\nrosand\npostino\nbleeped\ntorro\nkhedekar\nunmanageably\ncavallino\nklusener\nlyd\njunaluska\ndusek\nrogosin\ncarmakers\nchng\nrottweilers\nsuckled\nbioweapon\nhoodwink\nurijah\nchisora\nuib\nwolfer\nboyarin\nsentech\ncryptanalysts\nbludger\ntagua\ngesink\ngambrell\nohsas\nklooster\nconneely\nmittweida\nmediapro\ndawber\nbuntingford\ntabou\ntatura\nrhinestones\nmicaceous\nfleckenstein\nessor\nhunwick\nsaidy\nfarney\nmoisei\nartscape\notey\nprehistorian\nmagleby\nzusi\ncrepsley\ndnes\nworsthorne\nrockband\nactuelle\nprefigures\njungr\nnonpolitical\ntiter\nbraak\nbigalow\nerhart\nmanagable\npalach\nnampo\nnegocios\njianping\nunmapped\nsheikhdom\ncruttenden\nstreetlife\nhosp\nsudre\nsobran\nredpoint\nhomestand\nnonbinding\nquadriplegia\nfirat\nponor\nbokar\nromeoville\nmellan\ncalix\nfetlar\nodorants\nberlins\nkeralites\norol\ncrocket\ndories\ntongling\ngalibier\nfmqb\nbasauri\ncutlets\nlanc\nslawomir\noyun\ngeliebte\nmahnke\naljubarrota\nstitcher\nretallack\nazinger\ncacace\nnalder\nhonourary\nmasirah\nsangkum\ndumais\nputto\npersad\nhanby\nblairs\nblickling\nhaymond\nbobrowski\nserey\nfreesia\nferodo\nbourses\nscoles\ntrubshaw\ncristopher\nsketty\nkeralite\nwaukon\nhekmat\nciment\npepoli\njentsch\npictoral\ncommendably\nbreathlessness\nscogin\nlittauer\ntheuns\nadamle\ndamaraland\nturbidites\nemmel\nnilan\nneurol\nnre\ngrimthorpe\nfirebombs\ngetulio\ndurbridge\ndego\nthuong\nbasicaly\ntamiko\nballinamallard\nanabelle\nimput\nkostadin\nbutty\ndinter\nfener\nforegut\nfinocchiaro\nribicoff\nmurphysboro\namritanandamayi\ncoincidently\nbenfleet\nadhikar\nbackgrounder\ntaining\nsebo\nnikica\nblankly\nglobalize\nhillcoat\nstockbroking\nhomophobes\nzna\nchalks\nclarey\nimputing\nabberton\nveazey\ngoatfish\npropithecus\ninsultingly\nacms\ncampomanes\npiatkus\nderrell\ndespaigne\nukaea\nbargy\nlewi\nwankers\nadamovich\npreprogrammed\nmanabat\nrbn\npavin\nlings\nflummoxed\njarocho\nbikur\nfuglesang\nerwinia\nyouself\nsyu\nchanoine\nwolken\nhuayi\nacadien\nbilingually\ncivitanova\nilyumzhinov\nhikikomori\nbelda\ncrispness\nsupercarrier\nwauconda\nlytchett\ntzedakah\nvijender\nhumbler\nalbertino\nilliniwek\nlomita\nexaggeratedly\nzillo\njowzjan\nlache\nwunderland\ngeelani\nmajkowski\ntransair\nanaesthesiology\nignorable\ndhss\nschuessler\nunambitious\nikar\ntooby\ntineretului\nbeltways\ncarmichaels\nleura\nparadine\nquercifolia\nclarithromycin\nsaddlebags\nhvad\noisans\npearn\nnakadai\nrafales\neccentrically\nwehrle\nmislabeling\narkie\ninconspicuously\nrakia\nmajeski\ntummel\nmonotheist\nmedardo\ndfat\nodenton\nyouve\nditmas\nforstner\nkition\ngamefish\nsundaes\nbandolier\nlettings\ndutroux\ndextrous\nsace\nbissinger\nbritos\nnetcong\ninappropiate\ndoillon\nblotto\nmellat\nmoghuls\ngadot\ndingolfing\nnatapei\nabstinent\nholtzbrinck\ndraggin\nrozenberg\nhomefield\npurex\nmetacarpals\nsummerhayes\nhyaluronan\namericain\nphilippson\nnayla\ncoplan\ntvd\neastertide\nmontée\ncambs\nliyanage\neastwind\nkarstadt\nlafite\nspectroscope\ncumulated\nmisclassified\nklippel\nkotv\nwhinchat\nwallstreet\noney\nmyotonia\nheadly\nmudder\nmatviyenko\natay\nfrankenhausen\npointblank\ndebasing\nmimura\ntaurino\ninversus\nzadek\nsigsbee\nguoqiang\nsnowiest\nlandslips\nceftriaxone\nokoli\nacquitting\npreceeded\nmultiplane\nundisputedly\ncontibutions\nsitz\ntracon\nfrankley\nmadhusudhan\nfolkerts\nmanhoef\nplusnet\nxetv\nregner\nreceptionists\nlewenivanua\nchamaeleo\ncocom\ndiamorphine\nblakeway\neasterby\nkinburn\nespousal\ncompressus\ncasula\nsext\nmarocain\njumpman\nmischaracterizations\nparisyan\nschedler\nechegaray\nhonker\nimpulsion\nbaudo\nimmutability\nsoudani\nkomura\ntussocks\nboodle\npprune\nshaddad\ncoppiced\nrokita\ndully\nberget\nharroun\npenfolds\nnishani\nweighbridge\nmancino\nhedworth\nbateleur\nstoessel\nchoosy\nlathers\nisoflavone\nmiscues\nshige\nguirado\nbuffoons\nmanasse\nnisl\nchuzhou\nilma\ntraina\netd\ndzbb\nrandomize\ntpms\nbullivant\nfotsis\nsportfishing\necono\nsterkfontein\nmoncks\ngroomer\nknie\nbifrost\ncitaro\nflightline\nacgme\ngillings\ndyads\ndisentangled\nehp\nmendips\nenyart\nlamoni\nredistributive\nelfyn\npioglitazone\nhagner\nministrations\npreteens\nveilleux\nroadmaps\nbioprocess\ntamal\nplantlife\naravis\nshimoyama\nurbanowicz\nfrederikke\nvillejuif\nviscardi\ncij\nbernières\nkokh\nlorente\nbourgeoise\ndruten\ntuifly\nkdb\naccreditor\narmoire\nlupul\ngabrielson\nquarterman\nwitchy\nthornwood\nkarolyi\ncelandine\ncampany\nlehar\nstenz\nkeizersgracht\ntomohiko\ndantewada\njohnst\nkomiya\nchimei\nkoganei\nnazli\ndockland\nsuperdelegate\neyp\nentrepot\nslumming\nlount\nkosuth\nurushiol\nnurettin\nkilim\nfronsac\ndizdar\nleers\ntaï\nzumaya\nparkchester\nclewiston\nchangshu\nchallenor\ngrognard\nmallows\ndrucilla\nclamorous\nlaminating\nguadagno\nwolong\njlt\ngardi\nrmk\neuropcar\nzaharoff\nterawatt\nwashingtonians\nschwegler\njanno\nbeqiri\nfaouzi\ngarfagnana\nsnagglepuss\nhøeg\ngameplan\nsawrey\nnonconformism\nnookie\natak\nbilino\noverarm\nsnapp\npassarelli\neumetsat\nhayyan\nranbaxy\nintracerebral\nadvocare\nfrt\nsmv\nraymi\nbutley\nsieff\nferres\narthas\nfastow\nbarnier\nwenninger\npratik\ncherven\nlobar\npitsea\ntrasimeno\nchumakov\nxueqin\nwillans\nprotic\namitav\ncarsen\ntyrannicide\navista\nkrishnamacharya\nfranzoni\ngumienny\nuntraditional\nshotput\naccesible\njeram\nuniroyal\nheidsieck\nchona\npassito\nyasuhito\nvincenza\ncotman\nbliley\nvassiliki\nnumer\ndockweiler\nesas\nmutambara\nintercessor\negleston\nwools\nkatsunori\ndalbeattie\ngangrenous\nfairyhouse\nalrosa\nsteinkopf\ncorralled\nmzee\nogwumike\nnasturtium\nkls\nbml\nmakri\nzaillian\npsyched\nnawar\nstepin\nwabe\ncads\ntrichotillomania\nbilyaletdinov\nlincolnwood\nderuyter\ntypewriting\nnaruhito\nwolmar\nkassis\nbradesco\ninvernizzi\npodrinje\nroloson\nranmore\nchypre\ntemi\ngenya\nsilverlink\nshambolic\nbastow\nozren\nvouches\nappo\nnvi\nsligh\nsatyanand\nlosch\nlutwidge\npenick\ngelert\nslavo\nmetalheads\nkozintsev\nyandell\nbalka\ndelingpole\nwtxf\ncowhand\nchampagnes\ngalip\nspottswood\nreverberate\ncoalmines\nsappington\nscrutinising\ncagey\ncontemporânea\nnortheasternmost\nlimahl\nholyland\nbefalls\ntraficant\ncrveni\nunrewarding\ntavy\nottoline\ncompostable\ndoumergue\nkorgis\ngtmo\nmonstre\nplectranthus\nmourvèdre\ntharwa\ngebhart\nbotz\nzaritsky\nbartholemew\nsassa\narons\nweblogic\nsachets\normrod\nreverberating\nsiyum\ndistractor\nsavidan\nroughton\nclamored\naaah\nkravets\nbrigman\nambers\nistaf\nkeeshan\nlinichuk\ntommorow\nlankenau\nsekimoto\npadilha\nmncs\nreseach\nwojnarowicz\nthicknesse\ncarone\npalmquist\nzellerbach\nyanjun\nfme\nepia\ncyo\nslussen\ncurtice\npressurisation\nleysin\nmandeb\nmarzouk\nconvo\nebolavirus\nnimmons\nmidyear\nhaemolytic\nphoblacht\nbastable\nkipketer\nminni\ngushes\nmanoharan\nlonzo\nsongcraft\nbirchenough\nparran\nfritjof\ncolosio\ndieck\nannotator\ntrendsetters\ngrubbing\ndiffa\nabramovitz\nfki\nbojinov\nammer\nwakeful\nmarles\ntrompette\nrefurnished\nsirenians\ndelpech\nauria\nnevzat\ncarland\njaufre\nmanchild\nbasov\nmckey\nfunland\nseleznev\nhivemind\nmoshing\neiki\nfröbe\ncrorepati\nsalik\nwhirly\ntascam\nbleomycin\nbauble\nbaying\nfinstad\nacrophobia\nmandana\nyesha\nhieronymous\nzoraida\nsupervolcano\nrazzies\ndelran\njover\nperfluorinated\ncabri\nroediger\ncipd\nreciters\ncappiello\nawwa\nintussusception\npribram\ncerveris\nfitchett\negginton\ntandragee\nricken\nmarkelov\nfarabee\nfaerber\nxisco\ngesine\nimmel\nefremov\npetiot\npól\nmabley\nvornado\nbaystate\nhandbells\ntingler\nserapio\nrande\nispa\nolza\nvisiongain\nbowering\nmassow\nantiepileptic\nremonstrated\npmos\nphentermine\ngracchi\nhulley\nappam\nwaay\nstippling\naccs\nhcmc\nshomrim\ngroenewegen\ndeveloppement\ncgg\ndacapo\nkrusen\ntwt\nmikhailova\npostbox\nibstock\nmuntean\nlaybourne\nmrtt\nschwein\nqinglong\ncoagulant\nbesame\nrecodo\nwaban\napar\nabbemuseum\nnyota\nyoriko\nxplore\nreep\ndattilo\nbanyuls\nnikiforos\nspermaceti\ncravo\nboffa\nunsheathed\nrkm\nfreind\ngoddamned\nkering\nmakgadikgadi\nchertok\nsisti\nguite\nsandblast\ncapris\nalimi\nabshire\nreedham\ncolicchio\nchkheidze\nlyth\nlocascio\nkeansburg\nwynwood\nburditt\nballetmaster\nhuckster\nmuntu\nconfit\nbukittinggi\nsportal\nemancipator\nninnis\nmcelligott\ncbrne\noyak\nidrive\npazo\nchaand\njowhar\nmameli\ncrouches\nakra\nenzensberger\nghyll\nbermel\nmusiri\nixe\nruhleben\nvilanch\ngregynog\nderian\nbeshara\nreveley\njordie\nnanoelectronics\nthorman\nbossert\npinback\nhiruma\nsucky\nimpost\nblunting\nbathysphere\ncreased\nstandridge\nhoopers\nnerine\nallason\nzinni\ngreenshields\njihadis\ncheckouts\nmirante\nbaekeland\nnetsky\nschongauer\nphuntsok\nclamouring\npotapov\narkanoid\nharangued\namericom\nhorween\nkakha\nneall\ninoculations\nmeece\ndenisof\nwoodchips\ncyriac\nhundal\ndharmesh\nforetaste\ncookout\nchanh\nmccrone\nbundesländer\ndromgoole\nhunnam\ndormael\nsonika\nkayden\nviernheim\nmadhok\nwarbrick\nregrown\nnovis\noslin\nsupercollider\nrudrapur\nmayson\nnewswires\nplasmons\nsnakeroot\nparisa\noswaldtwistle\ncalve\ncomt\ncostelloe\nleimert\ndahmen\nabim\ncowsills\nashima\nneang\nligertwood\nstandart\nkubicki\nmonpa\nmekki\nousley\ntulia\nellicottville\nkilpin\nkirovski\nkoosman\nrecapitulates\newin\ntheatregoers\nvalentiner\nfima\nyeagley\nbavand\nharrel\nrepaints\nashari\nexpresscard\npracticability\noctagons\nsyndicating\nmafraq\nspired\nhanak\nmarani\ndathan\nneitzel\ncamulodunum\nschoof\ninducts\nsimitis\nresited\ngaudier\ndumbadze\nforbear\nkhasavyurt\nheasley\nlutherville\nifab\nfrenette\ndemby\nlubich\nntk\nturksat\nfolley\nweekenders\ndisses\nrupnik\nperov\nextradimensional\nfayoum\nkindai\nskogen\npolymorphous\nayuda\nwingecarribee\nlocatable\nabramsky\nbesigye\nmethley\njockstrap\nassefa\nblazek\nmetrological\nenplanement\nmaisky\noser\nwurzels\nhengoed\nnocenti\nwyles\nmisuari\nmcinnerny\nfenske\ncaiazzo\ncockiness\nmadiha\nspycatcher\nponiatowska\nprognostication\nyorkdale\nkavadi\nemulsified\ncokes\nschulenberg\nrecoba\nsöderlund\nfortinbras\nroadable\nsreenath\nebihara\nhaunter\nlandfalling\nstrahovski\nburdell\nartemyev\ngutknecht\nperata\naudiotapes\nfleshlight\nbriest\nreyner\nbrigden\nlorge\nbotija\nqol\noktar\nspoliation\ngjertsen\ntronc\nmouthfeel\nciega\nmagnifications\nkratts\ntoeing\ndenr\nlmn\nallagash\ndogtooth\ndisfigure\ncellmates\neissa\ndiljit\nombo\ndedeaux\nprarthana\ndanker\nmidianites\nnutjob\nburba\npelinka\nglk\nsinglehanded\nwaddingham\nderisory\ndeodorants\ninnit\narmthorpe\ngautrain\nreformasi\nrossetto\ncrennel\naurland\nblowed\ndettmer\nspeights\nbuicks\ntranh\nwittrock\nfrappier\nvitzthum\naltizer\nkmel\njoblo\nmagat\nrozema\nmottes\ntigana\nvillalta\nmbari\nthumped\ntoshinori\nhelling\nbaviera\ntelevison\ndrosselmeyer\npforzheimer\ntavel\npragyan\nkudrin\nexpeditor\nshuten\ntopline\nbrijesh\nyarwood\nunimaginably\nairtours\nxolotl\nkerik\namali\njannik\npoolbeg\nuplifts\nobong\nunmoving\nrnav\njde\nwtsp\nandorian\nlathbury\narmonico\nsated\ngaggenau\nroumanian\nehmke\nbigeard\nscrovegni\nheterodontosaurus\nsaria\nglencorse\nwideawake\ntolpuddle\nflesher\nbinges\nstickel\nartech\nstefanova\nmickleham\nbaulk\ncovets\ncapriccioso\nconnectome\nmalivai\njaros\nmetoyer\nwulsin\nsucces\nunreceptive\nschonfeld\nkapela\noceanview\nequifax\nchalking\npyg\ndamnable\nazaad\ncpx\nlongan\ngravette\ngarzelli\ndeddington\ncddb\ntakahama\nkeurig\nblakesley\nstilgoe\nscrawl\nwhillans\nholdsclaw\nconsigning\norit\nhomel\nangmering\ngoulds\ndelauro\nlugubrious\nunfenced\nhurwicz\npipal\ncondescendingly\nbedd\nextinguishment\nekblad\ntipster\nsterjovski\ntadjoura\nyennenga\ncortney\nkarppinen\nchimu\nkorla\nvyrnwy\nrumley\npimped\nornamentations\nprotasov\nelectrica\njennys\nshafie\ndownscaled\nmelsheimer\ncostacurta\nteignbridge\nintech\ngbowee\nlawnmowers\nmillones\nbreeam\nelián\nsentir\nfarsighted\ndesanto\npartied\nnorthcentral\nsaucerful\nrestalrig\nmaisonettes\nethnographies\nxuereb\ncentris\nlacie\nazpeitia\nnvh\nggp\ngornik\ntapout\ndahlmann\nleatrice\nbovines\nsiria\nhartel\nmmol\ngilsig\nhaviv\nations\npramana\nuihlein\ndecongestant\nbobst\nquinolones\nhillgrove\nreeth\nfiammetta\npassato\nscats\ntabuchi\nriesman\nkeates\nhaiden\nracisme\nescom\ndragos\nunbalancing\nniemiec\nmunns\nteshima\nharboe\nmyla\nboeung\ndefiniton\nrecouping\ncoatis\nherrell\nredlasso\ntobaccos\nclelland\nbottisham\nmamat\nnicholai\ntorralba\nsolomona\npropitiation\naudiogram\nadkisson\nrochers\nglees\neile\ngarthwaite\nmartinair\nthygesen\nexperiencia\nteleri\nbisschop\nromila\ntheobalds\nmodernly\nsnv\nrakish\nbevacizumab\nkuchin\nagot\nvinoo\naaha\nhunny\nfactbox\noll\nwenda\nsesar\nfahnestock\nabagnale\nhodgen\ntowline\nnonchalance\nmarashi\nsilom\nrbw\nopet\nglg\nkondrashin\nyounan\nkroft\nmbaqanga\nlinage\nwetherbee\ntribally\nnavaratnam\nedwardson\ncraine\nkoolau\nratatat\njrf\nsuperyacht\naasa\ndemichelis\nvinalopó\ncolver\nkernot\nkornman\nokechukwu\nnontechnical\npindling\nsesc\nelephantiasis\nlongbowmen\nbrzeski\nomma\nkump\nadrenocorticotropic\npedantically\nwestham\ncontactors\nmetronomy\nwheaten\nsesterces\ncomtois\nweakish\nforbears\ntransferral\nwarkentin\nsubcontinental\nhideyo\nwfsb\nmachrihanish\ngeras\nshinwa\nsindoor\ncojo\nkpho\nlevit\nkeebler\nlasserre\nlanni\nrioter\nalary\nmusco\nchiffons\nsakhr\notlet\ngatch\nmpho\nkomplex\nharaam\nsylphides\nschaeffler\nadea\nmitzvahs\ntranscode\nuponor\ngunaratna\nhanashi\nplumper\ncaban\nconvalesce\ncochère\nrupali\nhawton\nsenese\nsosnovsky\npenydarren\nzlatni\nubach\nbarabas\npobjoy\ndrinkard\nansip\nmarven\nstifel\nbaroud\neion\nboatlift\nskoal\nplaca\ninescapably\nbheag\nsabalan\nwakering\nyuqing\nmanero\nbami\nprefuse\nnewz\nmckercher\ntuscania\nromped\nasier\nobituarist\ndoorjamb\nkirst\nrehabbing\neyman\npemphigoid\novulate\nfrontcourt\nborwein\nlemerre\nppps\ntollhouse\nancón\ncabaña\nsharipova\nkamae\njamesburg\nsledgehammers\nmonoblock\nbluest\nakyab\nitera\nmandiyú\npenteado\nbradner\nqadr\nbanaba\nsachlichkeit\nreorganizes\nfellah\nmcdine\nstouts\nninawa\nprizemoney\nmarschallin\ntrainmen\ncharef\ncordyceps\nnewick\naahs\nlodgement\nzewail\nraisons\njhumpa\nautio\nmanganiello\nkwangju\ndevorah\nthinners\nwpro\nfrizzy\nphotodetector\necstasies\nbiogen\nrabbitbrush\nmuhyiddin\npapastathopoulos\nrosko\ndistin\nstemless\npulsatilla\nkoston\nshanked\nairbourne\ntranby\npastores\ntransversus\ntarcher\nzacher\naena\nchessmaster\neleftheriou\namade\nvivitar\ncabourg\nehlert\nsalemme\nfinkelman\nzaranj\nwinblad\nbagnolet\nkickhams\nfelicitate\ngebbie\ndensification\nscrivens\npowick\nunquoted\nunmoderated\nhairlike\nshuba\ndiena\nracicot\njonjo\neatonton\ndcis\nowasso\npozzato\ntabe\nvietti\nduni\nsaling\nninan\nzehi\nstarbreeze\nwallula\nsheck\nbrogdon\nfioretti\nsheckler\npiontek\nmcgarrity\nzetterling\nleevi\nguertin\nnfi\nykk\nmilberg\npoisk\nmiren\ngeluk\ntureen\nseita\nstretchable\nsmcs\nfuzzball\nterraplane\ndfh\nmaketh\nsimulink\nvasoactive\nlevitas\npatz\neastwell\nbronagh\nmcgurn\nfinalisation\ngarnaut\nnemorino\nrestocked\nbrigata\nelkind\narnal\npozos\nmachair\nmaheu\nratzel\nwfor\nacquiesces\ncívica\nstuermer\nthameside\npaata\nalioune\nyel\ncsps\ngiveth\nwildbad\nprotools\nzuberi\nlisch\nnaveh\nlegitimizes\nnuckolls\ngenerosa\naastha\nkerlin\nsilkair\nfoxbat\nyanagida\nshoba\nschofields\nmcmeekin\ntefl\ncardarelli\nkingsman\nroutier\nhubler\nsonoko\nbattrick\nludin\nmezz\nrappoport\nbalerno\nwainman\nasja\nboan\nauma\nfitzgibbons\nnamjoo\nscatterbrained\nnainggolan\nretzlaff\nshuvo\nyafan\napapa\nchauve\nvernes\nreversi\nblinkered\nomanis\ndescriber\nspooling\nledin\ndépôts\nmprs\nmatey\nwalkability\nmercs\ncapek\nlugia\nbriny\nbunetta\npicante\nborini\npleasantness\nwhickham\nprofessore\nsardana\nbuzzin\ngrunter\nzeinab\nlohas\nreheating\naalam\nzagging\nmetasedimentary\nlyden\nyunjin\nsupercapacitor\nfeedlot\nmontone\nshahida\nvolontè\nledebur\nedgerrin\nivans\nelvington\nepaper\ndemartini\nwynona\newig\nunflagging\nlagunitas\npalmero\npelsall\nanswerer\ndairyland\npaterfamilias\nelgort\nwhitall\npropertied\ndinakaran\ntiya\ngullwing\nmohonk\nuncivilised\nfaas\nbroin\nwinokur\nozai\ncoffelt\nproficiently\nunalienable\nbraziers\nbashers\nbloodworm\nmistrustful\nraizo\nsanzio\nsabur\nlockean\nbitam\nrottman\nwassmann\nhusaini\nheavyset\ncentrino\natalay\nnaxalbari\nverdade\nskakel\noberstar\narakelyan\nriseborough\nexultation\nbreillat\nhulce\nsnog\nhutchesons\naglow\nstanes\nkeckley\nsbir\nmasterstroke\ndonnel\npenk\ntaiho\ncatting\nlivanos\nimmersions\ngenov\ntomme\ngymnasien\nabizaid\ntulalip\nsobriquets\ngallstone\nswoosh\nantoun\nspagnuolo\nfindel\nemeriti\nbierko\nflunk\nileus\nimpossibilities\nmcmurtrie\nwiretapped\nwiniarski\nhorlogerie\nchataway\nlesvos\ntubed\naspens\nreappoint\nevt\nmerisi\nbantering\nvtech\narregui\nmonell\nunimportance\nerixon\nmyopathies\nshahla\nathans\npolhemus\npollokshields\nmalakai\ntammam\ncoagulated\nkrehbiel\nescap\nemomali\nminoxidil\neww\npersa\ndisorganisation\nfriedreich\nisaaq\nederle\nkoldo\nhermel\ngoodfriend\nbwyd\nemancipating\ncollon\ncholecystokinin\nsandbagging\nrices\ndynamique\nduffryn\nidon\nleonides\nbecouse\nwny\nxiaoqing\nmanza\nzucca\nsakiko\ndammann\ndanzi\ncastellini\ndhal\nravitz\npietre\nmojito\nsayyida\nkillorglin\nunderplayed\nooga\nyouporn\njeromy\ndongting\nispat\njull\nigbos\niittala\ndillards\ngallega\nguler\nmcqueens\nwcbo\ninsomuch\nflaxseed\nwibf\nmeanwood\narūnas\nagglomerate\nabsurdism\ngoodwrench\ncalaf\nasds\ntrichomonas\nolhão\nmudassar\ndemurrer\nvisitante\nwindsors\nmushin\ninapt\ncaminiti\nkaysville\nmathe\npasqualino\nkiowas\nchavira\nprawo\nkarapiro\ncanzoneri\nphosphorous\nkemin\nhellzapoppin\nclarett\ncadden\ncullercoats\npeggie\nmarkopoulos\nengagingly\nchicanes\nskytrax\nloewenberg\nshacklock\nlampshade\nbecontree\njimy\nindpendent\nwhereon\nvizcaino\ndramani\nfigments\ncontrariwise\nvalidators\nskylit\nazoff\ntularemia\nlamorna\nwust\nantonova\nhewa\ncoagulate\nbrantwood\nbellmon\ncasciano\nhomebuilder\nfrançaix\nbrauns\ngartland\nradtke\nmtech\ntriiodothyronine\nalbiol\nhaloes\nhaicheng\nmyrin\naffric\nniecy\ncarneal\nnilda\nnoffke\nkanza\nluganville\nrattail\ncolautti\nsiemon\njehlum\npeals\nnyce\nhogi\nwelke\nhammerless\nhartridge\nmollified\nbutanediol\nlaterna\nthoas\npastebin\nintercessions\nhasanuddin\nsealings\nfilmstrip\ncaudwell\nguenevere\nhostettler\ncavanna\nkerron\noutshot\ndrewery\nsmid\nranker\neelco\nplaywrighting\nthian\ntepa\nnaturalize\nboffo\nwansford\nsteadier\ncristie\nsnacking\ngiacomini\ndulaimi\nnccpg\nreininger\ndelfonics\ndiament\ndoust\nshipbreaking\nmariama\nscheiber\ngaydamak\npalmo\ncfrp\nbaginda\ngranberg\nepoxies\nacmc\nfairhead\ngribben\ngittleman\ncompartmented\noverreliance\nairt\nalworth\nkopple\ncrossdressing\nlemann\nschoeck\nksdk\ntransglobal\nlarkfield\ncarrig\nfemurs\nfarshad\nkabinett\nbabbacombe\nmanics\nsasai\nemulations\nsubspecialties\nrobuchon\nprizzi\nnordhagen\nxenomorph\nsparseness\nlastuvka\nderny\npictorialism\norsett\nsalaspils\nchubais\nalane\nrainfed\nalgarrobo\ngutu\nhornacek\ngangi\nkondapalli\notehr\nspeedman\nsaor\nmountfield\ncatalino\nfatai\nbidu\nböcker\nheidemarie\nugi\nfuckers\nhalyburton\nmilquetoast\nalpilles\ngaleton\neastburn\npukka\nimmigrations\ncloacae\ngeminata\nupyd\nsteinunn\ncoffield\npaschi\nthoughtlessly\nverina\nsofaer\npoststructuralist\ndefund\nsproston\nslithering\nvapi\nsamudrala\ncrassostrea\nsnakebites\nsgarbi\ntwits\ndemonte\nklinge\nantagonising\ngabri\nbarangaroo\nbutor\nmochan\nscor\nrosenburg\ndolcetto\nbrinck\ncyberchase\nindesit\nfutcher\nkirkegaard\nmatras\nsparsh\njagielka\ndrumchapel\ncordiality\nprecipices\nzanelli\nwibw\ngateman\nsannat\nmenkauhor\nulsterman\ndysphonia\nzitting\noura\nmarkes\nmuchos\nhereunder\ngwendraeth\npalaung\nclawless\nbarentsburg\ndeluding\nsekhon\nvoller\ndiaconu\nneuroanatomical\ncallimachi\ngoneril\nwatkiss\npascals\nopps\nwilkos\ndoonican\nprocedurals\npanafrican\nllanddewi\naustintown\nkittrell\nhooksett\nameristar\nbuxted\nbarakah\nvlahos\nstorys\nkyrillos\nbwindi\nturbat\nwingrave\nterex\nparaquat\nchickenfoot\njabbarov\nenchong\noyler\nmachaca\nfolkston\nhengshan\nguanggu\ncuries\nreforested\nserralves\nqingshui\nzissou\nkatsutoshi\nmonogenic\nmbala\nalstott\nabanindranath\nlouver\nleininger\ncopel\nnocton\nrhizobia\nfabrikant\nstephie\npaleoclimatology\nmobilicity\nrigamonti\npatxi\nmarandi\nadjudicates\nbaize\nlinscott\ngmanews\nnabhan\nwalda\nnoteboom\nferet\nquilla\npenzler\nbuesa\nbittman\nfoggo\ngestured\njrd\nebsworth\nliautaud\nskaf\ntaganka\nphilippos\ncarhampton\nholliger\nkfs\nbholu\nchampenoise\nsanderstead\nnatomas\ntranscendentalists\nnoughties\nbazookas\nwaidhofen\nstreng\ndarktown\npiquette\nmemorex\nbreer\nhastens\nfruitfully\nnasb\nsahakyan\nbonaly\nwavin\noverspeed\nshredders\nwachenheim\nharpal\npedagogies\ngendebien\nlispector\njurowski\nbessler\nfrappe\neastick\naccola\nberrian\nafrodisiac\nargens\nhyperstimulation\nvato\nroubini\nmichalik\nmaniche\neyelet\ndysthymia\nglams\nsanon\nmfsb\nmisson\nmnn\nsaleability\nclw\nmardas\ndayspring\noverhung\nfalsettos\nchapmans\nperone\nsouthdale\nalev\nportsoy\nyanes\nteissier\nheidrich\nmonda\ngiai\nfinkbeiner\ntechland\nquader\nparenteau\ndrell\ndhok\nmbembe\ntsujimoto\ncooties\nnadiadwala\npoher\ndunnellon\nscherz\nbado\nsunrises\nmullein\novercomplicated\nharnish\nbosma\ntelecomunicaciones\nyardbird\nfreestyling\nrefillable\nmicrohabitats\ngeorgetowns\njableh\ntasneem\nlintner\nschirripa\nmythologized\nmudi\nfrangipane\nbarye\nbarner\ntwic\nweera\npressel\nnaggar\nbonte\ndallaglio\nnighthorse\nwex\nadits\nnycta\nmarondera\ntbms\nlemley\ndto\nkoneru\ndysautonomia\nbelchite\nagganis\ntomczyk\nmöhne\nhoarder\nvcg\nstudland\nfeatherbed\nselten\nelide\npoythress\npretentiousness\nfraker\nheadstocks\nsedating\nquinolone\ntechnotronic\nkeyring\nfiddes\nintelectual\netic\nsztuk\ncarlesimo\nuninflected\nkillip\ndtes\nwassmer\nvenerates\npermethrin\ninfrasonic\nvitiated\nterao\ngreenscreen\njodhaa\njnc\nlevelland\nluns\nnardone\npatou\nneds\nronquillo\nbouwer\nkiyokawa\narrestee\nbadinter\nstaphylococci\nstaniford\nquoits\nkirstine\nfreedonia\nchigasaki\nmoyses\nbule\nmelanocortin\nalibert\nmoskalenko\nenas\nfreixo\nvaladez\nluxemburger\nhiltz\nglenrowan\nhackathons\ngofman\nservin\nliberte\ningber\nmidterms\npitou\nretaliations\nsumption\ntumblin\nrainmakers\ndependently\nstoermer\nphyllostachys\ndissuades\nbouet\nphilyaw\nrésidence\noopsy\nkidzania\nsiegert\nmarigot\nmanosque\nantacids\nstettinius\nclaverton\nmingas\nsplendors\nivanauskas\ngodt\nabdulhadi\ncelek\ntoz\nglavin\ntimchenko\nlyrica\nchefchaouen\ntheorise\ncalar\ngulou\ncorroborative\nmisstating\ncoogler\nunshared\nrahba\nuhp\ndebary\naprs\nchenille\nfasth\ntreue\nfagel\ncandreva\nledisi\ngrownup\nstereotypic\nkonso\nwindstream\nliepaja\nqic\ngimcrack\nrebozo\ntheunissen\nrecurrently\netymologist\nburaimi\ndrr\nwychavon\nmjg\nitokawa\nmotzfeldt\nsunshade\nrockliffe\nbuan\nsice\nunscrewed\nberhane\nsonett\nkololo\ngiardina\nvamped\ngodchild\ntartak\nulmann\nthuraya\nmellophone\nangelia\ncorá\nsundell\nshipbreakers\nunprecedentedly\ncromie\ncobbling\ndulci\nmacatee\nclé\nrosecroft\ncrofter\nkestenbaum\nmegacities\nbreakeven\npyin\nlerista\nkopek\ndreading\nvelaro\nassez\ncheju\nbouza\nkhuzdar\ndigesters\nnctm\ngarvagh\nlutin\ntherm\nirreproachable\nratho\ntefé\ntega\nhurly\nmascolo\nvolman\nkillyleagh\ndinorah\neimear\nlumenick\nnazem\narshi\ndahr\ndisdains\nanthropocentrism\ncommissione\nfukaya\nlingenfelter\npaperboys\ndittman\nmehlis\nbeeding\nmackerels\nzebrowski\ntinyurl\nwiddop\njorquera\naxi\nkaffe\ntakahiko\ngabar\ndemaret\najose\nashcraft\nbarinholtz\ncartwheels\navf\nodegard\njottings\nkizu\nberbick\npatted\nsatiation\nsamhsa\nsoes\ndisplease\nperast\nbmm\nnordqvist\npanga\nravilious\nhorsed\nthreepence\nldpe\ngaydos\nkrysta\natalaia\nloupe\nburti\ntwerton\ndabei\ncanin\nkever\ndunfield\norphism\nlisas\nladra\nchole\nepting\npitfield\nmaoi\nstefanik\nweee\nwiehl\nateneum\ndefries\nflydubai\nmoak\nserbsky\nevenk\nmallikarjun\nparlin\nmeriem\ndimitriadis\ndecontaminate\nwoolas\nlussi\ntoucher\nalborada\nsmallholding\npanem\nwydler\nsypniewski\ncossington\nhuangshi\nguiting\nvarone\nblystone\nandrias\nhaniya\nrinjani\nostende\nruthwell\nmaurois\ngautami\ndebatably\nchanghe\nmurrey\nyeghishe\night\nhooter\nunplayed\npolicewomen\nbleeder\ngansler\nnearside\nbartov\ncetane\nmalladi\ncadaveric\nmaters\nalpay\nfukami\noveremphasized\nrennison\nvincentians\nransomware\ngryposaurus\nmonusco\nliverani\nmagsi\nmcgeary\nnormanhurst\nalcopop\naztek\nswallower\nlaurentians\nblitzing\ngunson\nrashness\nwher\nenslin\nyoplait\nbarricading\ndicen\nattfield\nbanh\nlavrio\nstechford\nbellahouston\nloyo\nsifter\npraunheim\nlechler\nylva\nabdelbaset\nluchs\nineos\nhazanavicius\nfauvist\nlatz\nshelmerdine\ntwinkies\nstorekeepers\nsloopy\nmandarina\nfransisco\nsimeulue\nproscribes\nblome\nedgewise\noverstayed\napportioning\nshackley\nepn\nmazor\nethnomusicologists\nwedi\ndangerousness\npetersburgh\npither\ncastlemartin\ncompos\nwhiles\nsdot\ncanalside\nsuprisingly\nobfuscates\nsuad\ngalabank\nrusia\nlaughingly\nfaras\ndowneaster\naramark\ngrokster\npeed\ndenna\nhomecare\nslowinski\nskyship\nsirion\ngarbus\nrandon\nnabeul\nappletrees\nfusillade\nrichy\ndillahunt\npangloss\nkerekes\nknoblock\nravenal\ndeknight\nkeser\nchapati\nbarbacoa\nbaster\nmetagenomics\nsolvability\nstretham\npaleobotanist\npeb\nnhgri\nkalinda\nsafaricom\nheade\nmauthner\ncroitoru\nnieuwegein\nkarnow\nsecrest\nlucasian\nhumen\nbenanti\nfauns\nseemly\nszymborska\nborzoi\nlandeta\nperton\nescapology\nbusywork\normont\npoors\nferina\nrosalinde\nthatgamecompany\nebsa\npooler\nsatheesh\nlifschitz\napts\nwissel\nhesselink\npurolator\nkhil\ncomplexly\nstirchley\nbenacerraf\nnarin\nmaytham\nhandlin\nconine\nwilhelmsson\nleuzinger\nmetes\nhige\nkuze\nmarvão\nbartkowski\nfeddersen\nbazeley\naamodt\npapiers\ncosmonautics\nelmsford\ntamano\nenameling\nkhudai\ngarnacha\njurby\nzom\ndander\nbonacci\ndiethylene\nnecklines\nmanohara\ngelin\nveirs\nphosphite\nlaureen\nrobbi\nmasahide\nkrupski\nhorsefly\narief\ntideswell\nyaeko\nreibel\nchhina\naldwyn\naonuma\nlova\nsansui\nrammohan\nproudman\nsteadying\nirreplacable\nsignifcant\nniveau\nnonunion\nstaar\nhakala\nvanderpoel\nnowinski\ncaral\nevelin\nunderreporting\ndysert\nkiessling\nkeenest\nette\nghosted\nmeddlesome\nbacc\nanastasiades\nharton\nstumptown\npiromalli\npostdates\ntahnoun\nturba\nsnee\neroi\nbusfield\nsitarist\nlicensors\nkandla\nlinnane\nshechita\ntaneja\nzachar\nopacities\nelyas\nbrockenbrough\nbalibar\nbarriere\nkrymsk\nkalach\nfluker\nexultant\nzircons\ntelcom\nsmocks\niodate\nncic\nbandiagara\ntzi\nmouri\nalmont\ndangote\nultrahigh\ncolourings\nbenali\nturcios\ncostea\nbruyneel\nnpy\nreinterprets\nmolecularly\nreversers\nbootsie\nlowrance\nsesso\nscheinberg\ntyngsborough\nkieschnick\nzirkel\nmechner\nigal\nbmxer\nwittenburg\nsayward\ndarion\nhypergiant\ntoryism\notsuki\nmetzen\nsamhan\nmazra\ndisgorge\nmargetts\nkitanoumi\nbackplate\ndoswell\nbaaba\nschupp\nwithdean\nprepay\nflatus\nlorrin\ntranspower\nlenkiewicz\nthrashes\ncheena\nzunino\npieroni\nkosti\nsiaya\nburdisso\nkayvan\nwoodworm\njalouse\nfrerichs\ndjalili\nsohl\nulvi\nlobanovsky\ndevelope\nullal\npollutes\nclipstone\nangele\namoah\nroussanne\nmuggy\nladerman\npapazoglou\naromatica\nsupercilious\npakka\nmycobacterial\nzamil\nformularies\naminuddin\nmotorstorm\nsinama\nsearl\nsadlers\nbostridge\nheartworm\nsyndromic\njairaj\nsaturnine\nvanik\nbukharian\nmouchel\nnorfork\nroday\njalousie\ntramonto\nyalobusha\nkassian\ndinham\nretransmit\nhilmer\npesters\ngainford\nheldt\nsjm\njamea\nsautet\nautobahnen\ncarrió\nprouvé\nnanopore\nfoppish\nmesopotamians\nsilmaril\ndesulfurization\nminitel\nmarans\nrigmarole\ngalleried\ncaramba\ntoyotas\nminoring\narisaig\npropounding\ntensioner\nwathiq\ngregerson\nalemu\nmender\nkanz\nkcnc\narborfield\nharmonizes\nunbarred\noap\nburrel\nconfalonieri\nmetso\nuntestable\nwhatton\npatang\nchicle\nlavalas\nsneezed\nnorthman\nrasika\nmundari\ndocudramas\nareata\ntuktoyaktuk\nnoatak\nspectrophotometry\nwindsong\namacuro\nloquat\npried\nbeinart\npartin\nthorens\ntway\nhireling\nashkan\nsidnei\ndaedelus\nestancias\nconvergences\nhomare\nceesay\nsamjhauta\ncrimefighter\nwollemi\ncinematically\nhemerocallis\nrubbo\nimpetigo\ngotto\nbof\nandalou\nstencilled\nmiraglia\nmammut\nportora\nkilbeggan\nwiegel\nvollenweider\novercall\nbaglione\nwakeley\noiwa\nterespol\nantiplatelet\ndeese\nskims\nbahoo\ncarquefou\nextraordinaires\ntaproom\nnarwhals\ndecena\nchimenti\nminsters\ntideland\ngardasil\nrubeus\nginormous\nbergier\ntoughening\nstreatley\ncascarino\nwatmough\npasscode\ncanastota\ndumba\nbotchan\ntankred\nkcmo\nscreamfest\ncejas\nroussy\namorgos\nryding\nsatelite\ncassowaries\ntorma\nklavan\norgueil\noligodendrocyte\nnewspoll\nexcretes\nwhitestown\nmccarren\nglenmark\nturfgrass\nyorubas\nvictimology\ndolmans\ndebu\nnorina\ndemilitarisation\nocelots\ndzehalevich\nemx\nkilliecrankie\nhagin\nnelms\nfarriers\nwanderson\ndoff\ngrether\njosefsson\ndwdm\nhaunch\nlupins\nhoaxed\nschnebel\nkulon\nngah\nisringhausen\nthielmann\nbogues\nkyoichi\njalapeños\nniah\nkurtley\natrociously\nveno\ngynes\nzhangke\ngillnet\nelitists\nporzio\ncarmont\nmellors\nosteonecrosis\nlabruce\ndanno\nbirdshot\ncardini\npiñeyro\nunsworn\nesche\narmatures\nmistra\nquijada\nscotton\ndubos\naspergillosis\nprofondo\ntempter\nangarsk\naimi\nhatrick\ngooseberries\nstrathern\nescoto\nlwa\njudentum\nhellbilly\nmcquay\nmadon\nnatwar\nmajuli\nkatv\nhiginio\npdgf\nipas\nknyazev\nstereoscopy\nmusoma\nwoodbrook\ncoronaria\nprevalently\ntarvaris\npawning\ncancan\nmirpuri\nlinnets\nisobutyl\ndibakar\nperlow\nsulis\nnappies\niddesleigh\nunhampered\nfehling\nbrynjar\nketterer\nknapdale\nportella\nrowsley\nthuc\nmermelstein\ncazenave\nspiciness\nmaquinna\nbrambell\nmanannan\nnorthenden\npantaloon\nmazury\nsowetan\nhajib\ndevotionals\nparkinsons\nvéron\nsossamon\nakhenaton\nnippy\nkickflip\npathmark\nmoorilla\nkrong\nmitti\ngett\npalito\namicorum\nglynde\natol\nmijas\nregurgitates\ncannet\nfesler\nharnisch\nmouloud\nvladimiro\nizrael\ngehl\nmulanje\nsplattering\nkeino\nnagatomo\nréaumur\nomondi\nlella\nluchadores\npoyer\nfoulness\npatters\npúblicos\narev\nmamelukes\nbiotec\nforiegn\nmilitarisation\ntrichet\ngernsheim\nventuras\nsupersound\nprogres\nmacallan\nadame\nsteeplechases\nbodelwyddan\ndankert\ncuozzo\nfirelord\nsunu\nrightsholder\npsychoacoustic\nhajong\nslimed\nselebi\nschut\nantireligious\nhebblethwaite\ndilatory\nschorsch\nwhomsoever\nashin\ndockrell\nselanne\nngala\nstagioni\nbrenne\npoliticus\noreos\nakhmedov\neberts\nnewsmagazines\nblasien\ndncg\nsnoek\nforegate\nexpectedly\nmarcovici\ndissapointed\neugenol\nblumenberg\nthundercloud\nmythologist\nsalvoes\ngrandkids\nplainness\ncendant\nempiricists\npaprocki\nsoliloquize\nkadirgamar\nhormisdas\ninupiaq\nduchesnay\nshinnie\nboteach\ndimi\noutstrips\nnecesito\nhooting\naprc\nprizewinning\nesmerelda\nreum\npegues\nkasra\nnafeez\ngearhead\nwayna\nstirton\nlessors\ntiters\nradioactively\nskold\nseabass\nunmerited\nquolls\nnushi\nvaira\nthankfulness\nmikell\nwitherby\nwtvf\nltcm\nestos\nproles\nkapala\njanas\nruinas\ndhows\nivon\nbelville\nbkr\nocio\nergonomically\nnscc\naletsch\nbroadlands\nwanborough\nliberati\ndemystifying\nmikra\nimmunosuppressants\nwedmore\nparlato\nbrandenberg\nwaldoboro\ndeposes\nogp\nreconnoitred\nsteamworks\nmckew\nllr\nmitri\nwiffle\nbiocentrism\nmeddings\nbajar\ndobrogea\nlhotka\ndotel\nrazu\npoite\noldland\nmahabir\ncrile\ngelinas\nmetroline\ngirgenti\nlittleworth\nmanucho\nrustad\nbeakman\nkdaf\nwaxhaw\nhabel\nchook\noizo\nsough\nparthenia\ntriable\nguerard\nhoagie\nmuncey\nsegerstrom\nsagua\nsudbrook\nprestons\nmunjal\nsocalled\nparmanand\nnahshon\ndhahab\nmoviemaking\naleksandras\nhsts\nbarrau\nwaterbed\nbhanot\nmarwell\ncrenellate\ncolindale\nbrioni\nkustova\nbotez\nmozzie\nmaierhofer\npingali\nkluwe\nmatapedia\nmateer\nlouanne\nfastcompany\nserranos\nkrippner\nredburn\npios\nmindi\norangeman\ncower\ntravesties\ntinsulanonda\nmulticam\ngravelled\nsocioeconomics\nstyal\nescada\ngrasstrack\noffit\nctcs\ncerina\ndados\nnioc\nbiddenden\ncorporately\nlilit\nneyer\ncandover\ngubby\nmalchus\nsaygun\ngiguere\ndorsa\nkpelle\nnicoli\nkolan\nhartness\natjeh\nhili\ncockett\nusmar\ntauba\ncaj\nseicento\nsifang\ndictyostelium\numansky\nramayya\nneema\nroulet\nkinn\nkimhi\nworsbrough\ngouzenko\nkarlan\nungaretti\nlehighton\ntige\nvideocassettes\nswerdlow\nefsf\nwarfighters\nreeded\nholtville\nventas\nissas\nrathburn\ncasados\nhelicina\nharnik\nbernardina\ngowin\nnabor\nyurovsky\nsoberanes\naeris\nfuelwood\nlongneck\nelvises\nmarchuk\nvendela\narken\nirawan\nkensho\npohlad\nmitridate\npantazis\nfrerotte\nlubomyr\nmystere\ndreadfuls\nhiggin\nmck\nseelbach\nlincou\nniraj\ngötzis\nunresponsiveness\nollerenshaw\nmirto\npastorini\nbalquhidder\ngadea\nanyon\naerobraking\ntakushoku\ndachshunds\naitch\nsolorzano\nlibet\nfouda\nmoreni\nlalgarh\ngaullism\nironmongers\nrobida\nbandolero\nfinegold\nvirsa\nothellos\nraiola\nbrainin\ncernavodă\nptacek\nwroten\ncathedrale\npistole\nrenren\nhollin\neszterhas\nhfm\ncaridi\nthelwell\nfavio\nreionization\nheppenstall\noutnumbers\namah\nsya\nremak\nwindowpane\ncopco\nshriveled\nbpk\naargh\nhalifaxes\nwarehoused\ninbar\nspagnoli\nsalver\nlaxminarayan\nmesmerised\norotava\ntne\nhehehe\ncym\ntransacting\nbrambling\nregrading\nrugger\nebey\ngreers\nammended\ntoasty\nnetease\nirabu\nyushun\ndenari\ngwynfor\npresbyopia\nlawther\nweishan\nwist\nbayldon\ngreenwashing\nreanimates\ndimopoulos\nmilhous\nameriprise\ngurrumul\nhoathly\nfruta\nleggo\nterashima\nblueback\nwoodspring\nggs\nnovocaine\ndibnah\nmuchnick\nroumi\ngudkov\nswafford\nkantrowitz\nreassertion\nstanner\nbuckby\nmanlove\ncumulate\nrupal\nsognefjord\nmilita\nhazza\nflippy\nchurkin\nimtiyaz\nprincetonian\nperahia\namperage\nwwt\nswedesboro\njulee\nstephany\npropolis\npunitively\ndipstick\nkelburn\ncabg\nprevarication\nlusted\ncardillo\nboxmeer\nsupercedes\njialing\ndavyd\ntennapel\nenh\ncovetousness\ncymraeg\nmonton\nhollanders\nwicky\ncampidoglio\nreefing\npoinar\ngenzlinger\nockenden\ntantalising\nbreakages\nkillerton\njagadeesh\npullinger\ntasburgh\nlasme\ngoodpaster\nexcises\nhardway\ntroxell\nllu\ncraiglockhart\nschnack\nopsec\nprajatantra\nivanna\nargali\ndepreciating\nshuhada\nweylandt\nvelshi\nklimenko\ndeclawing\npedis\ncaging\ninterprofessional\nimpassible\nzoellner\ncisac\nhyves\narthaus\ncoatsworth\nbedini\nmellinger\nprising\nipac\ngulak\nthighbone\npramoedya\nhieu\nwws\nbordetella\nwolfit\ncastrop\nomneya\nmicrocars\nnsps\nparcham\nhesmondhalgh\nbadagry\nbogans\ngimondi\nrieko\ngitler\nomprakash\ndonec\nmuhajiroun\ngilden\nsuppes\nvoina\ngilders\nypt\nchasselas\nobermaier\nfdma\nmineralised\nsaraband\npullan\nlunula\ntabarro\nfrancesi\nnewfangled\njasinski\nbresee\nesnault\nepitomises\nslepian\natrac\nkoldewey\njimson\ngspc\nlaslett\nbedrosian\nflecker\nhll\nidolizing\nsaola\nischaemia\nmagdy\nnembe\nhandlooms\ntachira\nnoid\neulex\nvulgarities\ndicko\ncheesemakers\nriklis\niiid\nhirsute\ntrach\ndigressing\nkrick\ndeadheads\nhumblest\ntresvant\nbwe\ntorney\nolusola\ndonavon\nmoross\nseiberling\nhiranuma\nroadwork\nirmo\nwellpoint\ntlrs\nshadrack\nosteopaths\nunrepeatable\nlutefisk\nmadonia\nintergraph\ngurl\nflaten\nnoura\ngarren\ngedde\nqasmi\nkatzmann\nbazzi\njalopy\nstrongroom\nsimmers\nchukwuma\nnosenko\nhorseshit\nfreeserve\nlissette\npragma\npropper\nhorseplay\ndiscriminator\nswalwell\npianto\nmagistrale\nassaad\nhispanos\ncanvasback\ntroell\ncoolum\nprivies\nupslope\ngahn\npatrício\nsmartass\ntripa\nvittek\ntarapore\nhirsutism\ndushkin\nfalseness\nturl\nvishesh\ndeerhurst\nelectrometer\nsonos\namedee\nobon\npreheating\nspringboards\nbonfield\nsuperzoom\ntaizo\nhawkmoth\nhatful\ncandlelit\nampera\ndipendra\nbelote\nproprietress\nmalen\nspirale\nabdoun\nmoyglare\ndesch\nrenegotiating\nizza\nbuechler\nthibeault\nluxemburgo\nburnup\nshkoder\nqianjiang\nharoldo\nchauth\ncantrill\namorrow\norrville\nbowerbirds\nmsgs\nmargita\nladyship\naguero\nsilverliner\nsorgi\naphoristic\nbrightline\nbjornson\nbarkworth\nrebell\nboggess\nsheepish\nplaytex\ndarman\nrosevear\nussc\nmorganville\npersue\npresidentially\noikonomou\nvibha\nardross\nsystemc\nphilippic\ntoxicant\nescheat\nguízar\nllanwern\nglycosaminoglycans\nguigou\nshadid\nnorthop\ndededo\nstratiform\nofficemax\nsiriraj\nbinali\nqts\nhasdell\nplumeri\npsycholinguistic\nvlk\nsofar\nnosler\nmystras\nolausson\ntrevena\ndoesnot\nstegeman\nhornburg\nmujeeb\nesfandiar\naltimetry\npreheated\ncohesin\nzelter\nbové\nleites\nshortall\ngoshawks\ndph\npolymaths\nhaaga\nwebobjects\ntunc\nprioritisation\nquelqu\nfoppa\nbuttercream\nsenftenberg\nminuses\nfixating\nmaninder\ndemocratica\nharmsen\njagjivan\ntonder\nudoka\ngessen\ngbps\nmatula\nfloride\nkingsize\nkelle\nrailbed\netches\nsteketee\ngroundshare\ndockum\nscrutineering\nroughy\nurano\nxms\ngarst\nplumly\nmaleme\norate\nmischievously\ncookes\nzientara\nkinna\nchiquimula\nvelle\npoca\nyco\nmoonies\narrillaga\ntacita\nconveyancer\nollila\nmiika\ndemel\nmartines\nhbt\nseraphin\nuncharitable\ntuazon\nmuktafi\namorality\nclownish\nsixaxis\nredentor\nsolich\nquires\nmccumber\nghaleb\nimprovers\nchalgrove\nmelhem\nkeßler\nfirer\nunflinchingly\nbolivariana\nleeching\nadger\ninao\nfootstool\nndungane\nluanshya\npapà\nyasar\nsundews\nbambra\ngibbings\nfilipo\ngayness\ninching\nlianhe\nduetting\nastill\nmignonette\nnephrons\ncader\nravensburger\nahorros\nshaowu\noyamada\naydar\nlizarazu\nbengel\nixl\ncuffaro\nvieni\nkins\nmelismatic\nconductorship\nnelda\nnaep\nebp\ndelux\ncadoxton\nmcneilly\nfundament\nbonan\ncareened\nnordmeyer\nwoodchester\nyippee\nraphel\naksakov\ngoude\npictionary\nfiorucci\ntoptenreviews\nvegfr\narguin\nbulling\nalgunos\nsciortino\nfator\ncoban\nprieuré\nchacín\nfinell\nnorddeich\nnormalizes\nhamhung\nkeshari\nmanageability\nwoodshop\ninternationalis\nkumon\nwhyteleafe\nshiffrin\nbaluchis\nbuana\nvaldai\nunmonitored\nperim\nsnipping\nwuerl\ngwlad\nsilbert\ncockaigne\nyaari\nbrookhouse\nchelsie\naskance\nshigeharu\nnyf\nunbuttoned\nrigali\npantaleo\nstamboul\npersonalise\noxx\nanyi\nconeheads\nzhongnanhai\nkabbala\nanneal\nowasp\nrelly\nnonito\nwemple\ngoosey\ndriv\nmilicia\ncementitious\numtata\nstandford\nkammermusik\nlindum\nhuds\ndepasquale\ndeivid\nisotopically\nblizzcon\nmaharam\nsise\nmerca\nkangding\nwakey\nabcb\ntamboura\nrocester\nevildoer\nwatada\ndyscalculia\npredawn\ngaullists\ngarfish\nbarbagallo\nnasopharynx\nlugu\ncameco\nfuisz\nketurah\nloadmaster\ngoeth\nsteading\ntqm\napotex\ntinti\nnoem\ncerdà\nmalingering\ngoodlad\nwallows\neudy\ngunthorpe\ndrinan\ndefecated\npermira\nriveters\nukhta\ntechne\nelx\nsomercotes\nhania\nmiyabe\nmadlock\nincapability\nukhov\nluti\nrawmarsh\nnccs\nschanze\nhalaby\nelliptically\nipcress\nimedi\ntremé\nfloyds\nschwert\nbackyardigans\npesek\nautistics\nimprovment\nupscaling\ncharalampos\nmalott\ntendonitis\ntelcos\nsoothsayers\ntorabi\nmtbs\nmonomoy\nkayano\nmandella\nburnbank\nkarmazin\ncomiccon\npdci\nggc\nsamis\nogam\nadleman\nmontefalco\ngersten\nghaffari\ncamu\nsagittae\ngaffey\nllwynypia\ndadd\ncandleford\ntauzin\nshali\nbasser\nrebbie\nadiponectin\nmutairi\nhvm\nteruyuki\nboltons\nhenckels\nhym\ncalstock\nyanan\npancham\ntimestamped\nchalumeau\ndefiles\nforough\ngaffar\nbarch\nkucher\nnubile\nsportin\ny,z\ndribbler\nvinyasa\nsocky\nwned\nunaccountably\nmonkee\ncuckolded\nduroc\nmicroprobe\nfereydoun\nxwb\nsamuda\nbirol\ncontrovery\ndokey\nevanson\nmaestranza\negat\ntransuranium\nkazantsev\nhymnus\nrecke\nshergold\nrushkoff\nperceptibly\nlangenhagen\nmaccabaeus\nthern\nancho\nyouse\ntpu\núnica\nirún\ngons\nschulten\nchirinos\nmauritians\ndonella\nbrainstormed\nauthenticates\nnosema\nphotoworks\nbrezina\nhengst\nusamriid\nironweed\nbundesnachrichtendienst\nktar\nspratlys\nlundvall\nlaframboise\nrinfret\nmakena\nief\nproceedure\nmelisande\ncrêpes\nfidelma\nrodia\nleapfrogging\nbozza\nclarkes\nbadong\nchichén\ncyproterone\ndalbello\nhiscox\ndonruss\numezawa\nsagia\nhexamer\nliddel\nstrongbox\ndurley\nboix\ngoulandris\nmetodo\ndorfmeister\nthakor\nbabbel\nbeyle\narmfeldt\nfacinelli\nankrah\ncuv\nalmaraz\nwhitechurch\npepperpot\namerasian\ngodby\nzakia\nqia\nliseberg\nhalberds\nzhujiang\nuntended\nzhenyuan\nbaliga\ngoba\nvish\nballygawley\ndisengages\nrzeszow\nshaefer\nyuda\ngamestar\nhbm\npeev\nfrontieres\nthaad\nuep\nbackless\nkolpak\nmercatus\nyanzhou\nhelmick\nyassa\nizi\nsolidary\nexalts\npachycephalosaurs\nunipol\nwhb\ngesamtkunstwerk\npeller\neuboean\ndebriefed\ntakeshima\nshortz\nmceachran\nbrockes\nsweated\npasteurised\nmalians\nheadbutts\ncoveting\nshawan\nalyaksandr\nrepertorio\ncuties\nsimes\niui\nmilitate\nnaftogaz\nfarnesina\ndeductibles\nshibani\nkortright\nkozol\ncabooses\nkienast\nplautz\ngropper\nderegulate\ndoorknobs\nmaer\naspheric\ndusa\ncossart\ncaes\nyorga\ndarma\nguarín\nyoshiyasu\nbrugghen\nkrotov\nrasheeda\nsarratt\nmughrabi\nsulev\npeckforton\nsainsburys\npraneeth\nabdulkarim\ncantalejo\nschoeller\ndkim\nsongkick\nbrode\nheighington\nmout\nfadhel\nekstrand\nfrighteners\nstanishev\nmccrady\ncarax\ntaibo\nmatthiesen\nextranet\npartyka\ngornal\nhiroe\nhonganji\nhgc\nhaem\nphiline\nramsi\nsilko\ndieringer\nmagana\necfa\nedsac\nheighway\ntranscutaneous\nrearguards\nunicycles\nbrayan\nmorishige\npixton\nwaibel\nmushing\nshambo\nfoxrock\nyohai\namoria\nllega\njaji\nkeychains\nkambli\nmeare\ngunfleet\ntricorn\nwestcliffe\ngrindle\nbifurcates\nwissler\nraposa\ndæmons\ngret\nportbury\nbucaram\nharumafuji\nbelchior\nramshorn\nhairbrush\nferryboats\ncharla\ndullest\nshubik\nboisseau\ncascone\noodnadatta\ncreveld\nmelquiades\nbronchopneumonia\nmcgreal\nslushy\ntullock\ngoldendale\nhotties\nibragim\nantwone\nvoinescu\nsheraz\nmjp\nandreasson\nmertes\nloralai\nrestaurante\nmbuti\nlighthill\nvarnado\nsoulbury\ncosel\ncounterfactuals\nadz\nkeiron\nguilbeau\ngabourey\nlatehar\nchamara\nporosus\nelnora\nstodart\nsharone\nproskauer\nenriquillo\nghazanfar\nijk\nmesdag\nredshaw\ncounterproposal\nregueiro\nimmoderate\ncorgis\nboulud\nhourigan\nminhang\ndantley\nmasr\nshirota\ntoscani\ndevons\nxuhui\nulstermen\nzonnebeke\nfubon\ngitxsan\nkliff\nrurutu\nboey\nallnutt\nalfetta\nmcclements\nlafaro\nsadrist\nbettys\ninanity\nsadasivan\ngangsterism\nosweiler\nerosions\nconvulsed\nroyse\nhomarus\npalmanova\nblurted\nspodek\nsupernal\natrato\nmarubeni\nxiping\ndespain\naquifolium\npuckering\nyazbek\nchaussee\nendogenously\nwayan\nwops\nlamey\npremade\nkufi\nlungless\naxé\nanonymizing\nlongjing\nthornham\nsysco\nraghunathan\nterrelle\ncarleen\npanmunjom\nmanzar\nunanue\npovetkin\nvahagn\nwuss\nmagellanicus\nunreached\nfawdon\nhistologist\ntirunesh\nharben\nhakewill\nzembla\ntadalafil\namstell\ntakamado\nzookeepers\nhassinger\nmarico\nlakra\nkibbe\nnorr\ngrasser\nglimmering\ndevel\nklare\nmoneymore\nauty\ncomplainers\ntechnobabble\nimmunomodulatory\nsuddenness\ntabulations\nsavic\nbylot\nrepairmen\nclunk\nomertà\npolwhele\nkipchoge\nstayton\nburakumin\nrelocatable\narmie\nwitherington\nabbreviates\nskellingthorpe\nrodnina\nezaki\nwsv\ncovenanting\nholidayed\ngigolos\noasys\nbicoastal\neccc\nmetrix\ntremec\ndispersers\nvecsey\nscooch\nharra\ntierno\nquente\ncoquilles\npietrzak\nsnooper\ndespereaux\ncalbuco\nbessbrook\nellenbrook\nbyut\nvilasrao\nmedulloblastoma\npeskov\nwinefride\ndyslexics\nkozuka\nescondida\nlarmer\nlegalizes\nsamoset\nradomes\nradivoje\nfettered\nbellringer\ntepito\nliturgist\npalitana\ncunxin\nfroyo\nknappertsbusch\nvesperia\nprecio\nradics\nimpinged\nmapi\nheiligendamm\ndegassing\nunemployable\nfakers\nmulato\nsaraki\npunkin\ninfliximab\npatay\nmotorcross\npashmina\npariahs\nspiderlings\ngurs\nluckie\ntherin\nabstentionist\nbeaudet\nchimneypiece\nrevolutionise\nboudicca\nbainter\ncanudos\nlewallen\nschlei\nenchiladas\nalmagor\npontarddulais\ngess\nsagarin\nconsummating\nasato\nstapel\norebody\ntanzanians\nmechem\nsteinborn\ngoram\nburjanadze\ngonesse\nqiantang\nkalhor\nbeye\nloganair\nhankerson\nthrombolysis\nninos\nmcraney\nrbmk\nwardwell\nviñales\nattah\noresund\nredbourn\npâtisserie\nbabad\ndounreay\nworkmates\nbeurs\nwplg\npopi\ndecosta\nfowling\ncascabel\nruffino\ncaroe\ncertifiable\ncircumcise\nvendramin\nhonigmann\ngalica\nlöwy\nkirilov\nwesterling\npatara\nfisherfolk\nadmited\nsjöstedt\nmonkseaton\ncolerne\nniza\nhagiographers\ngatcombe\nscherzando\ngumption\nbershad\nsvanberg\ngracin\nbeenhakker\nmuthulakshmi\nparlett\nboiko\nzitouni\ndhammika\nbioaccumulate\nshindell\ncoworking\nningde\nsuprachiasmatic\nmccalman\nbarkingside\nsnowdrift\nclady\ntaubert\nmaximov\nmirzoyan\nbroadmead\namap\ndscs\nblackground\nrahr\nmartinborough\npotboiler\nnoul\nsumners\ntrachomatis\nbeaudin\nloebner\nrisby\nhuashan\nwapato\nweeting\nlystrosaurus\ncastonguay\nhidekazu\nhilland\ndtn\nbumbry\nfaute\nunbowed\ntreyarch\nmetaverse\nbarnton\nmacaya\npazzo\nsourness\ngoias\nsarona\npepino\nmavers\ndesireable\nmaniatis\nrenishaw\nsharafuddin\nsavaii\nhaythornthwaite\nconservatoires\nlangner\nprogestins\neicosanoids\nmaldah\nmcpheeters\nreggiano\nkarakum\nfrys\nnewspapermen\nmawdsley\nlleol\ndelicto\nayon\nexcell\nbasílio\nrecapped\nkhot\norba\ntordoff\narmagnacs\nkrein\nkitani\noutwitting\nshimmin\nvarty\nlaurer\nsaikyo\nnymf\nfastway\nburcham\nshewhart\nmencap\nearthlike\nnilton\nasenjo\nwiv\ncybc\nweissenberg\nsweatpants\nburkeville\namalgams\nleoneans\nreviver\nendsley\nalfriston\nmurghab\nlako\nspurning\nsiega\nllerena\narcadi\nruxley\nharambe\nmyelofibrosis\ndoublethink\nprobations\nsnyderman\npersonel\nwullie\nenews\nedon\nsquealer\nsauchie\nferes\nhorseland\nhewing\nhignett\nkimm\nhorfield\nardisson\nisahaya\ntallet\nvellacott\narnage\nkröd\nrestrictively\napam\narese\niolanda\ndisentis\nbishnoi\narsht\nreseau\njousts\ntracht\ntdg\nstenka\nkmw\noveracting\ngrizabella\nkorova\ncompusa\nforefeet\nsubsaharan\npaoletta\nradionics\ntarbet\nquaestors\naghion\nbefit\ncrannies\nwhsmith\ndgca\ndunsford\nhaversham\nzahed\ncoreana\ntawakoni\njerling\nariffin\nquintela\nyumoto\npixma\nwoerner\nbonsor\ndysphoric\nunmis\nthaiday\ngià\nmalocclusion\nquipping\nsarstedt\nmirkarimi\nsabrosa\nyoshiteru\nankiel\nfago\nchazan\nassumedly\nnégritude\nelysees\nkuzman\nconk\nretinoids\nmarcinkus\nweissenberger\nkaita\nlrrp\nunserved\nportreath\nbermudas\nladan\nembroidering\npertz\ncannoli\nvont\ncarll\nmougins\nsahrawis\nmicroblog\nideale\nchlorpyrifos\ncyt\npauh\nelzevir\nkhuri\ngradualist\nchabat\nvdr\nvalentijn\nwintney\ncroteau\ntokara\ndenneny\nuruma\nshinbone\nfixings\nextremophile\nrsta\npadalka\nblowgun\nbackfilled\nrockferry\nmajorettes\ntuomi\ndetta\nnonprofessional\nvichada\nkenmure\nufr\nbirder\nlünen\nstourhead\nwate\nassy\nlide\npeacehaven\nlombo\nmikheyev\ngumley\nislamique\ngiesbrecht\nlefler\nemberá\ninyokern\nconcertant\nmarlou\nmarkit\ncounterespionage\nprimas\nluder\nnightcrawlers\nkerrier\nvassilios\nrudbeckia\nkanaga\nbakayoko\nclanging\npreconditioning\nkitaj\njec\nstillbirths\nreyburn\ncoalmining\narrhythmic\nlaclau\namirkabir\nbrandin\nchlo\npcas\nartz\nzhixiang\nscrewtape\nostpolitik\ngreggii\nbluestreak\nsrikkanth\nwtvr\ndrupa\nagarwala\nweisberger\nnailatikau\nbrayley\nderdiyok\nindhu\nbusybox\nkiesewetter\nhuddlestone\njaniszewski\nmeditators\ngherardini\nfaizi\neconomize\nlibidinal\nlidos\ncampagnola\ngreenfeld\nderakhshan\nmesylate\npien\ncornelian\nsikharulidze\nakzo\nlemm\nlasdun\nqpi\npouting\nheygate\nhurndall\nnijs\nnurullah\ndijck\ndabiri\nembarras\nmisfolding\nsetrakian\nexfiltration\nrenold\nskolnik\nantonacci\nkearton\nmoyna\ndefarge\ntolkein\nghil\ncelecoxib\nsevareid\natoc\nmuktar\njalebi\neyzaguirre\nwhacky\npascall\naugustín\nadjourning\nmakins\ncoprinus\nfreshford\nsatiated\ninfospace\ntasi\nlegare\nunibank\nzandonai\nmcguane\neducause\ntatts\nataullah\nplebes\nshyer\nfessler\ncoye\ncercis\nnoisette\nfrottage\nfrager\nreinders\nopic\nsose\nringlets\ncoex\nfeilhaber\ngavilanes\nschneiter\ndarvin\nfalko\narmindo\naghai\nsalfit\nsnellgrove\nnrx\nwomp\nlambiel\nnighters\nligabue\nocclude\nkumbum\npowassan\nbellmer\nladson\ncqb\ntricot\ntumu\ncenturytel\ngoven\ngardenhire\nweakside\nmealey\nwassell\njuston\ncamaros\ngrybauskaitė\nsundaravej\ntechstars\nlissie\nsponging\nkilogrammes\nteotihuacán\nlazarides\nspacehab\nbellaghy\ngarw\nsrdan\nbazille\nnumminen\nquaritch\nhordley\nmojahedin\nmelsungen\nardern\nwarily\nlombe\nmelanopsin\nmpenza\noviatt\nfantini\nzarabad\nbetsie\nstuy\nneovascularization\nepinions\ndevita\nsamsam\nfrankenweenie\nalmendros\nswihart\nseulement\ngiudecca\ngaudette\ngibs\nfoxwell\ntransmeta\npistola\ncusd\nsolaria\ntruncheons\nblackwells\nprocrustes\nbelan\nabsorbency\nlemarchand\nrenseignements\nseagoville\naeroelastic\nmiscarry\ntransferee\nsegreti\nrefaeli\narenales\nsanthi\ngorlin\nlgi\nmaruf\nbayberry\nalbertz\nroton\ncaretaking\nreynders\ncostarring\nbizri\nmadore\nceneri\nheimo\nbalonne\nbehrouz\nmclusky\ndannemann\nfranju\nkoenigsberg\ngávea\ngabilondo\ngipsies\nislamica\nequipoise\nkudai\nmarulanda\nmoacir\nabidal\nparenchymal\nteras\nmidwater\njajarkot\nkolkatta\ntrailway\nsquirted\nvallentine\nkronwall\nratledge\nkavak\ndomtar\nstarvin\nkhorkina\npinstriped\nejc\ncrepuscule\nbaksa\nasbl\nmilliners\npredilections\nuhu\nhollinghurst\nschweich\ngwilt\nunenlightened\nbossiney\nnkc\nambiorix\nconjoining\nrailyards\ndaloa\nsatans\nlapan\nglimmers\nlotze\ngordeeva\nmarada\nskintight\ngeminiani\nruland\npard\nsuperblock\niriver\nobersalzberg\nlaveen\nschakowsky\neriq\nsatisfyingly\nyapi\nwesselmann\nmmv\nadelgid\nglomar\navuncular\nchoden\ngrubber\nabts\ncuadros\nhenhouse\nrepressors\ntyurin\nocf\ndxm\nshepherdson\nzeitun\nlummus\ndismantles\ncaddisfly\nbrechtian\nrhic\novercharge\nvenial\nquagmires\njueves\nhasumi\nfarago\nclasswork\nneuenkirchen\nchildrearing\nevf\ndelie\nwhetton\nhandmaidens\npake\ncarsley\nmaisch\nrayville\ncareerbuilder\nduderstadt\nchenonceau\nparsee\nkallikrein\nkarenga\nfudged\nbacci\npostion\nmartellus\ncoughton\npelicula\nbalsan\nfangraphs\nthrogs\ngrischuk\nthirlestane\nchuma\nsaltiness\nconceptualisation\nmotoo\nkenkichi\nsamil\ncandlepower\nsergii\nsilverheels\nairlifts\njook\ngerron\nceps\nkeiran\nfetchit\nbellard\nshirish\ncorncob\ninched\nmonett\ndenstone\nshuddering\nfadeout\nracey\nprittlewell\nnaná\nwestermarck\ngratzer\nzwerg\nustedes\nruina\nweening\nbrealey\nbelfries\nbarde\ncourtown\ncalleva\ndramaturgical\nblogtalkradio\nvasomotor\npottle\nunderwire\nunfilmed\nwhiddon\nvona\npennar\nkompressor\nporec\ndevouard\nobermeyer\nlisbet\nshaddix\nkafes\npopat\nmchardy\nhorrobin\nfilipovic\nhewins\nappertaining\nyts\ndromaeosauridae\nshingler\ngiral\nwua\nbelfrage\nprefabrication\nrichings\nblean\nnaadam\nmauboussin\nbreffni\ncsulb\nplomer\nintercontemporain\nokoboji\nfernbank\nsympathising\nlainey\ninstitutionalisation\nnmma\ninscape\ncatling\nfarinacci\nmanka\ncolourfully\nhoka\nnanjiani\ntsuchimoto\nheldens\nconjuncture\ncassella\nveltliner\nottavia\npoolesville\ncoly\nescapologist\nmitchinson\ntrohman\njarabe\nbiographically\nadderbury\nkrouse\ncarissimi\nfazed\nsoquel\nmrad\ntierpark\nstfc\nmetrocentre\ndemystified\nzaidan\nitsekiri\nsaed\nkasumigaseki\nsnappier\nuee\npinecastle\nkamari\nranglin\nseveriano\ncroesor\nmatyas\ndtg\nsarsen\nkiyonaga\nentertainingly\navramov\nkeinan\nkazuyo\ncondemnatory\nwbff\npunditry\nllwyn\ntwisp\netro\ninterdependency\nkisoro\nbendt\nkremnica\nbièvre\nmetaphysician\nruggedized\ncenteredness\nnavvy\nraca\nfrontstretch\nschwadron\nkooser\nbaverstock\nwitteveen\nschueler\ndats\nlrh\nbaith\npommier\nouters\ncalin\nscrapple\nghanzi\nlumbreras\ncyanides\nakhmat\nsamudio\nacronis\nclasper\nlobules\ndarted\nashfall\namericanisms\nkaung\npectorals\nbezeq\nburrillville\ntenting\nkovr\nfacelifts\nbisceglie\nbriegel\nnorrish\ngase\nflorens\nrivaz\nfishtown\nbatti\nkatic\nmahabat\nsubmerges\nmagu\nunama\nfithian\nfitbit\namfissa\noligarchical\ncountertransference\njunglee\nbrians\nlindstrand\njaipal\ntopacio\nulas\ntuel\npeti\nchadd\nlaziest\nmoster\npanynj\ncairnes\nebbers\nzhvania\nglenrock\nbradstock\nagbo\nsherm\ncurteis\ncohutta\nmoskow\npomerleau\nsabeel\nkornel\nhtet\nnaut\nwhirlaway\nviduthalai\nfrontalot\npigmentosum\ndonné\nmwamba\nreja\ntrr\nakala\ncalland\ngeu\nboardgames\nmelanosomes\ntyrod\nserums\ndovetails\nhodnet\nineke\ngodparent\nprotectable\ntutus\npapakonstantinou\nresistible\nsémillon\ntwachtman\nflorissants\nrajala\ncowhide\nphotoionization\nforero\ntjx\ncdss\nsubcomponents\nvisegrad\naioi\nkirwa\nruritania\nperishables\nsollima\ngiussani\nbasir\nultramontane\ncerta\nhhi\ntaroko\nwatene\nvieng\nlochy\nnardin\npederast\nudders\ndurano\ninterwebs\nritonavir\nruido\ncanner\nyingling\ndeionized\nmbatha\ngonen\nlittmann\nrsw\nenskilda\nvilamoura\ngoot\nvenetta\nverbage\npuea\nkhanji\nmcgilligan\nbulbus\nschlueter\nalterna\nsupranuclear\nmeiri\ndenisovich\nsharland\noperationalized\naerialist\nchaperoned\nrumbaugh\neagels\nhooey\nsapsford\nbankrupts\nbrookley\nthorkildsen\nruefully\nmonterroso\nxiaoxiao\nbakili\nrolandi\ndorie\nnho\nheavey\nshangla\ncasterbridge\ntinner\nteslas\nsonogram\nlavista\ncaat\nerythropoiesis\nfrankmusik\nwilliamsburgh\nmanjeet\nkotlikoff\nbartl\ngottman\nfromkin\nrefounding\nhve\npelamis\nagma\nunalakleet\ntiner\nhideto\nzinsou\nmulton\nbomans\nancholme\nfoward\nashrafi\njörgensen\nterazije\naskegard\nkinte\nprh\nmorency\ncalcagno\ncolombano\nwurzel\ndisrespectfully\namiably\npichet\ncactuses\nviljo\nvonne\nspringett\nnespresso\nvestel\nlauchlan\nmyongji\nskateparks\nwarneford\nmillenarianism\naffectations\ntootoo\nultime\npresages\nstubbings\nkoslow\nshantytowns\nnaeyc\nchinggis\nhexose\ndiggity\nfrari\ncoya\njanvrin\nmompou\nmondsee\nromiley\ncunneen\ntalboys\nhamartoma\namodio\ngawn\njudaization\nearlston\nlww\ndemotions\nafrl\nculotte\ndownfalls\nimportations\nsidesaddle\nglotzbach\nbusman\nbinley\nabdolhossein\npih\neio\npoppen\ninterliga\ndeanie\nteetotal\ncardonald\nalexina\nbuzet\nreintegrating\nherengracht\nrisan\neicosapentaenoic\ntverskoy\nwilcher\nlotan\nthesen\nluminoso\njianlian\nmuise\ncheapskate\nnungambakkam\nmedlicott\narmourers\nciber\nroyalle\ngabai\nsulfone\nbalearics\notolaryngologist\ngorseth\nulanova\nruffman\ncenterpieces\nfarentino\nakyol\nspeeders\nsistership\npoage\nslimes\ntfx\nhighley\nsomerdale\nbardal\nslovik\nsussan\nmartir\ncoltart\ndaifu\nquinns\npolychromed\nnagl\ntarnishes\nardipithecus\nmililani\nkandar\nshortish\ngeoffry\ncabby\nalverson\ndomiciliary\nkormákur\nmuelle\npitchblende\nrepopulating\nsarpa\nnidre\npellerano\ntattooist\ngalili\nhochhuth\ndutchtown\nmonocrystalline\ncondobolin\nwindbreaks\napisai\nsabinas\nfrangible\nwoodsville\nlambuth\nbicentennials\nwilhite\nsyron\nbreckman\nstrug\nuher\nsqueezebox\noctanol\nheeb\ncortesi\nliangqiao\nhme\nzizek\njunor\ncontort\nswedien\nkennaugh\ncavada\nozanne\ntuusula\nhypernova\nhayesville\nhardon\ncarrigaline\nague\nleechburg\nnikifor\nbalaclavas\nmckain\nprabodh\nhigman\nlewisite\nburuma\nanslinger\nunspeakably\nhyperdub\nfusebox\ndillons\nprimos\nlaikipia\nchicka\n─\nquehanna\ndyin\nactivesync\nmanabi\nbnfl\nnpcc\nweipa\narnalds\njabulani\ncontrade\nlannoo\nsólyom\nstilled\novercurrent\nsuperhumanly\njosephina\nsapan\ndetmers\nmultisystem\nconnoted\nburess\nntb\nreedman\nhistoplasmosis\nafterworld\nchiavenna\nitchiness\nmeasat\ncohon\ncoti\nbander\nghillie\nbickell\nbacteriostatic\nsegantini\nmonogrammed\naweys\nstrathpeffer\ncetto\nsugamo\nbrooksby\ncolander\nhypersexuality\ntomizawa\nwaldschmidt\nferring\ndujiangyan\nblencathra\npizzazz\nglasswork\nildefons\ncrape\nwinders\nkuenn\ndarchinyan\nyoud\ngelignite\npharmacogenomics\naccomac\nameren\nportner\ngranqvist\nofframp\naccelerant\nmotomachi\nscrushy\nkatyń\ndarvas\ncastelao\nklingman\nspitefully\nricaurte\narzobispo\nsipple\nthackwell\nponson\nlrm\nelibrary\nkasun\ngurcharan\ndbn\ncavefish\ncollapsable\ntrango\ndegan\nmyrtles\nramadoss\ncondello\nsaverin\nftir\nlegendarily\ntindle\ntradescant\npantagraph\nmasing\nbubby\nlisvane\nprochaska\ninvectives\nnevaeh\ncannibalizing\nbierbaum\nmassar\nfayad\ngollan\nherberg\ncandlewood\nbandish\ndroppin\nmehmeti\nverni\nentablatures\nmalmros\ndote\nrodong\noperalia\nmirande\nbaerwald\nshapleigh\nmargueritte\nbarrancabermeja\nscabious\nsunrays\nshott\nariodante\nmongabay\nmerrymaking\nsoulier\nperugini\nbreakwell\nflossing\naptheker\ncfos\nralliart\nschanz\nhoushmandzadeh\nfeare\nchy\ngwaun\nbrackins\naerosolized\nkiawah\ndidacticism\ndardel\ndelude\nallos\nraisbeck\nabasi\ncarretero\nfertilizes\nsamplings\nbassenthwaite\npantun\nleaseholder\nslavik\nmillson\nqattara\neditorialize\nosibisa\nbaldinger\nschmoll\nobligates\nheawood\navida\ntyzack\ntcv\nwieber\nservicewomen\ncandlepin\ncoronis\nrationalising\npallant\ncouturiers\ngugg\njauron\nberkus\nskellington\nooxml\nbouffant\nkarmanos\nwarmups\nhanningfield\nfasttrack\nrodgau\nhassard\nxiaoyi\ngasifier\nshowoff\nhagos\nveau\nlenzerheide\naventinus\nluter\nanum\ncalme\nbreaky\nwainaina\naddabbo\nmaricar\nvenceremos\ngilet\nyarmuk\nnatch\ngorkhaland\nwiu\nbhoys\nwestminister\nsurtsey\nsiffre\nexecutively\nmangus\ncampani\nfuqing\nhashima\nwestwater\nshapour\nkenward\npilita\nverbeeck\nrfra\nmoktar\nprzemyslaw\nelmasry\nmorga\nsaluja\nsimsim\nprocrastinate\ntrophee\ncassetti\ngelabert\ninterpenetration\nhvc\nkillamarsh\nbadley\nestremadura\nrobeck\napicella\nmaleh\nhert\nweinke\nmillares\nnachtigal\nkuharich\nkanemura\nlachs\nincurably\nteps\noutscore\nkaisen\nbahjat\nkoufos\nschrenk\nengrained\nbrayne\neconomico\npolicier\nlietz\nseamaster\nelnur\nlukic\nwheres\ncharmbracelet\nbenante\niselle\nwaked\nkarros\ncohabited\nbuyten\nbalamand\nveneracion\nomohundro\natoyac\nsursum\nuncollegial\nfurley\nepes\nschabas\nprogrammability\nkarega\nkluszewski\nalydar\ngranberry\noutsides\narbos\nolumide\nsegodnya\nintu\nkottak\nkovachev\nthae\nsuiter\nfert\ngurmukh\nngouabi\nkirkhill\nkanaeva\nbrookeville\nabbandonata\nfaul\ngumble\npaulius\noudry\nclarkesville\netchison\nblowhard\nsalvadorean\nfairvote\nzomer\ntarsem\nlysyl\npollina\nvibrantly\nshabu\nkaliska\nbororo\nerkelenz\nmohite\nmammogram\nudin\nprivatbank\nhchc\ncaccioppoli\nreputability\nburgener\nhanaoka\nwittels\nmeur\npionniers\nfmcsa\ntrimdon\nschaar\ncornflakes\nlongport\ncrutchlow\nunvisited\nsmps\nwheelwrights\nmorros\nhower\nwoodlock\nmarquart\nkarpeles\nafspc\ntradewind\nbobin\nkight\naquavit\nguavas\npriebus\nrecomposed\nclearwell\nblackleg\nsnowdrifts\nlalwani\njelani\nedenvale\nsóller\narsia\nanx\nwangmo\nbilgrami\nthermoregulatory\nclucking\nvendee\nemersons\nvisy\nfarole\nypo\ndaehan\nkidbrooke\nfranschhoek\nripert\nbrummie\nwerve\nloudi\ngarstin\nmeteosat\nshabaz\nmcvea\nkanamori\ndoorly\nfurkan\nmonetizing\nfellside\naldar\nemeth\nhasil\njesty\nblaina\nparlow\nhesitance\nhollandaise\npissy\nsiddis\nharmston\nlisteriosis\nabramovic\nbiopics\nputina\nphthalic\nnzoia\neuropeanization\nstacpoole\ndombrovskis\nfrontin\nfitzcarraldo\nshadings\ncambronne\nxenophobe\nchomet\ngwalchmai\nwiddicombe\nstuntz\nsharqiya\ngeda\ngilfach\nmykel\ncompagnoni\nnordjylland\ningvald\nsanjiang\ncitysearch\nlichenologist\nziebach\ncolwall\ncapriciously\nfairouz\npellatt\nhakuho\nsandfield\nhussin\nevn\ndartboard\ndarst\nchope\nschlink\nkhap\nbendor\nchiarello\nzalmay\ntheorises\ngalisteo\nbiopsychosocial\ngreasepaint\ngff\napuzzo\njesperson\novett\ngwanghwamun\nkalen\nwsn\nnovack\nnightspot\ncumber\nnyam\nllera\ncalafate\nfriedberger\nheeds\nlandgren\nredecorating\ntaleh\nasmir\nadalia\nstaite\nlapchick\nmunira\nvija\nsperone\nentreaty\ndatabanks\ndropkin\nshariatmadari\ncarlina\nmavinga\nbarefield\ninnaccurate\nmalba\norgreave\nindustriousness\nhandcuffing\nmedhi\nsupernaturalism\nkcp\nlavrenti\ninherantly\ntrellises\nmultigenerational\ngrans\nmacintoshes\npilobolus\ndaiva\nsynthetases\nruttledge\nlarches\nerpingham\nkaut\nbedaux\ntashima\nrevamps\npityriasis\novershoots\nreevaluating\narkansans\nmappy\nallcroft\njotted\ncialdini\nmastung\nkashyyyk\nlemington\nmaraca\nkurkjian\nuvas\nlongobardi\nkriging\ncongolais\npéladeau\nvirage\ndarabi\ntaverners\nsamin\nwans\ncarlisi\ntracheostomy\nacciona\nluhr\nsydor\nhippogriff\natiu\ngaume\nitec\nmackichan\npelaw\nmanchesters\nwadkins\ngodefroid\nbrownjohn\nbossed\nlaryngoscope\nsychev\nradegast\nforcella\nmoroso\nfalaknuma\ngrein\nbasiliensis\nkeawe\naudemars\nadcox\npenderyn\nsogavare\ncompadres\nhaotian\nrodewald\nbaleful\nphial\nsweetums\nhostelling\nojc\nwesco\ntitleist\naleix\nsturmey\nrenison\nchawton\ndaraga\nkadowaki\nkingsburg\nsolebury\nsavion\nhumaid\ntestaccio\nlipolysis\narulpragasam\nmichelmore\ntizzy\nbethânia\nbartholomaeus\npromociones\npineview\nmohammadreza\nazubuike\nsnaring\nepas\ncarapaces\nmatthaeus\nesterson\nkabongo\nscient\nalade\nanerley\nridesharing\narvidson\npropinquity\ngyle\ncoronagraph\nprothesis\nkotoka\nwalis\npastorelli\ntsca\nwilberg\namsler\nrajamani\nnastasi\nunforgotten\npetanque\nchinhoyi\nintocable\nivillage\ntechnol\nadenhart\nkirks\nweissensee\ntambunan\nnicoleta\nmirpurkhas\npannal\npawnbrokers\nwoodlake\nligula\nharadinaj\nishant\npenni\nsteakhouses\nkabaret\nflakey\nfewster\nclaypoole\nkharif\nblandin\ngovier\neffectuate\ngilot\ndentine\nkeate\nciss\nsupernormal\nlancair\nsuhana\napy\nhussman\npadian\nlognormal\norganotin\nchrysostome\nthuggery\nibrahimov\nmackrell\nnyrb\nbladderwort\njankauskas\noutpourings\nustashe\nkashmira\nyouyi\nthalictrum\ngangwar\npentobarbital\nmabye\ncspa\nevett\nrakowski\nrabello\ntpx\nusms\nbrokenshire\nradamel\ngracefield\nrockmart\nconfrérie\nobiora\nmakhmudov\nwili\nheeger\nkitties\nabraha\nashlin\nbadaling\nelmley\nvardalos\namazin\nsomatoform\nbunnings\ncarlill\ncalcaterra\nunfused\nwheter\nhabía\nlynnhaven\namantadine\nsticklebacks\nfrae\nkuman\npublicizes\nbudiansky\nsirola\nhypermobility\ncaponigro\nnagila\nhybridus\niwk\ncardinali\nboguslaw\npateman\narmoring\nappg\nrameshwaram\nlaclos\ndongles\nnewdow\nscdp\nhuntoon\nfcj\nexportable\nmurdoc\nhalkin\nuncreative\nluthra\naristos\nflied\nnjai\nfloren\nwoodborough\nobin\nravenclaw\nbutman\nairguns\nforouhar\ntektites\nbenaki\nbragan\nunlearn\nsangaré\ndebility\nimazu\nsnazzy\ncaldey\ngamin\nifpri\nmixology\nogm\nkillearn\nnemeses\nloutraki\ngautreaux\nricocheting\nissi\nkoistinen\nkuffner\nharang\nunsurfaced\nprum\nmaliciousness\ntransmittal\nschroedinger\nsugano\npopsci\nadairsville\ncicinho\nbinbrook\nchinglish\nkenting\nsamardzija\nhoor\nzindler\nchadda\nedenfield\nboco\niler\ninoke\nwvtm\nephesian\nambriz\nmait\nbedsheets\nmonico\nresected\nfogelson\nmarcelina\nlytell\nevicts\napurva\nrobstown\njombang\nrfq\nmotionplus\ndiminuendo\nseeff\ntagliavini\ntailcoat\ngotama\ncosson\ndims\nbolender\nmainer\nbrandao\nnams\nmoneo\nlowkey\ncudgels\nalinda\ndearer\nlawan\nnahash\nannadale\nsnavely\ndainichi\nadenoid\nexpectorant\nceremonious\nrdh\nreconciliations\ntakiko\nenrols\nwakka\nreneges\nlibelled\neagleville\nbuchanon\nmcgown\nsinkers\nwillo\nexhall\nghari\nthermoluminescence\ncarmassi\nizon\nscours\nqurban\nkoloman\nmorsy\nmerchantville\nhookham\npyruvic\nunderwings\nceara\nfuiste\nmunby\ncinquantenaire\ndewart\naskren\nunvaccinated\ncoolants\nhemans\nnmea\nleadhills\nionizes\ncamba\npontyclun\nshaner\nbrandwood\nchiva\nnetiquette\nopti\nrecapitulate\nabobo\nimperil\nlarrieu\narrecife\nmaccabe\nstabilises\ndudok\nducret\necv\nambert\nspillers\nokkas\nshuangliu\nmørch\nsheeley\nwango\ncarajás\nantiserum\nembezzle\nkumin\nyakovleva\nonis\nsarraj\npaet\njacot\ntenneco\ndisconsolate\npisciotta\nlazic\nrheas\nostbahnhof\nsanming\nmyk\nmariola\nkrikalev\nsplatters\ncaruthersville\nbigland\ngaters\nacerenza\nmangat\ngosch\nicta\nharkis\nundelivered\nysp\nfeminista\nzlotys\nfintry\nalerce\nyarmuth\nherskovits\nkielbasa\nmyelopathy\ncogen\ndagoba\nkellas\nchengzhi\nsabic\negc\nmascall\nbertoia\nthode\ndjia\nippolita\nmosey\njilava\nbrunhilde\nearnie\nolalla\ndatsuns\nbandaging\ngoyard\nrhue\nchivenor\nchary\ndevrient\nndo\nscabbards\nmumbled\ngajewski\nheegaard\nsadok\ngieco\nmoer\ntopiramate\nbeezy\njolting\nfederalsburg\nvenkatachalam\nreinette\nkaladze\nroxo\npohle\ntookie\nfickling\npnina\nfoeticide\noversteps\nbutrint\nmillionairess\nkiprotich\nthruston\nleedham\nhuddling\nfrolinat\nbussière\ndissention\ndaae\nfillan\nkillanin\nnomenclatures\nanglicana\nvazirani\narleta\naderholt\nivanoff\nbevelled\nmamontov\nsanaz\neastover\nzullo\nscholey\nmicanopy\nmellanby\nsetubal\ntorke\ncreer\ncomputerisation\nmalis\nheppell\nkirknewton\nacap\nchare\ngusset\nsheriffmuir\nbagnols\nnosworthy\nwaukee\nmansoni\ntwined\ngtt\nliberta\nfuddy\nelectable\nsukanta\nhauch\netnies\nmarilynn\npapac\nsingerman\nlampshades\nshowtimes\nswissinfo\nosnabruck\nbuggered\ntuthmosis\ngorkhali\noyen\nbishvat\nvittel\npinochle\narann\ncadens\nkauto\nfauver\nplugger\nmeatier\nmadala\nblomquist\nkaffee\nblanchfield\nsteinmeyer\nsaltram\njeanerette\negrem\nthoren\ncentrex\nhandfull\ncentrosomes\ntoppo\nlucidly\nregas\npimco\nbearn\nferrate\nnathi\npities\nhamms\nunscored\nseraphic\nsandcastles\ngarcon\nshukan\ngoodmans\nboscovich\ndofasco\nevington\nsteenson\nroughcast\nfumikazu\nmolefi\ntanmay\nhawdon\ninterport\nluaka\nmitchellville\ninjudicious\noldknow\nasyr\nneeti\nelongates\ngerminates\nradleys\narmilla\nokulaja\ntitties\ndiandra\npufa\nflightpath\nmayle\ncoopersmith\nloehr\nmukhabarat\nsmcc\nartless\ngencer\ndolmayan\nmarianao\nkessell\nnnpc\nhaltern\ntuscans\nbodson\nqim\nharmen\nadolfas\ncampoli\ncraftwork\nmisreporting\ndostana\ngerould\nnemacolin\nschoening\nbendable\nmlx\nmemebers\nfennesz\nsawt\ncotesworth\nindika\nhippler\novertopping\neversion\nmelco\ndelattre\nbrancepeth\nmawa\nindep\nparmley\ngypsophila\nviiis\nlichtsteiner\ndashan\nnecco\ntranstech\npoplawski\ncameronian\nyankel\nairspeeds\nkitted\nmédiathèque\nxscale\nburgau\ntenix\nscutaro\nbalfron\ngriffeth\ncapitulating\nspringtown\nwhoring\nzorra\nvavasseur\nlochside\nyohe\ngiovine\naccosts\nbpe\nesotropia\nrooflines\nseraphs\nbreker\nourself\ncouchette\nkaleri\nlofthus\narithmetically\nfritton\nfinnemore\nphrao\nwcp\nosterhaus\nhollandsworth\nblach\neemian\nclitherow\nbungert\nmismanaging\nlokhandwala\nriall\nmakana\ngeochelone\nrosés\ncidra\ncolloq\nkutuzova\nbestwood\nkirshenbaum\ncrda\nloeser\nuttrakhand\nfrais\npaediatricians\nratchford\njytte\noremus\nhpf\nmenlove\nciguatera\nmuhammadi\ntreforest\ndesheng\ndogue\nobjectify\nzhaojun\ntrautwein\nisic\nheigham\nmiddleweights\nargenis\ntujuh\ndelica\nontrack\nteairra\nadieux\ncotts\njerai\ntrea\nfrugally\nhewed\ndulux\nkallan\nconsoli\nstandin\nlundeen\nhahndorf\ninvitationals\ndarroch\nrakic\nhiromasa\nmassmutual\nnitinol\nnatales\nkennison\ngoldline\nhuludao\nhowdah\nbrays\nacrolein\nmagots\nuncaged\nuneditable\ntappet\nrosarian\nspatafore\nmaures\nkillilea\nmouseketeer\nwaterpower\navijit\ndovizioso\ndarkie\nhifikepunye\ngpk\npoges\nhorndon\ndemetrious\nunitedhealthcare\nurna\nbrusquely\nevoluzione\nsuperieur\nfauth\nnaté\nechlin\nviets\ndagher\nlbd\nisobars\nmilot\ncarbonia\ncebr\nblackalicious\nunwounded\nsinghvi\nrocklands\npachuco\namoo\nmatese\napears\nhirshfield\nhoussaye\nboerum\ndullin\ntryp\nkickstarted\nisserlis\nexuberantly\nthimbles\nrothmann\nvedado\nbushism\nmakor\nchazen\nintermarrying\nelegantissima\ntonko\nkambar\nbaltoro\nlatigo\nmuckross\ndocumentid\nlunger\nvalentinos\npacwest\nhollandi\nsmout\nmerengues\nmcalmont\nfmu\nmasin\nfrati\nvoyeurs\nminassian\nyakunin\nwacc\ngpw\nendive\nsirc\nsiero\nruutu\ngucht\ngoofball\nmykines\natiyeh\nhako\nmoring\nmilieus\nscaremongering\nhafid\nwightlink\npipiens\nminetti\nscharpling\nlonghai\nrossia\nshigar\nmacgeorge\nbratcher\nwangford\nsnt\npandal\nextirpate\nrepellant\narmitt\nmolden\ntubo\ntabler\nnovye\ncheesemaking\niggulden\ndefla\nmushu\nbailo\ncolorations\nmarsy\ncaked\necuadoran\nbrunvand\nblogsite\nstrandings\nirrecoverable\ngreengard\nkullman\nugni\nwalby\nampoules\ninler\nkess\nstarbird\nboulger\ncvh\ndebreu\nmeurs\nrauber\nrecanting\ncrotchety\ncarnivalesque\nmegalitres\necclesbourne\nserret\nhdw\nminetta\noverflying\ncrotona\nvolosozhar\nmamoun\njerrard\nnefyn\nsirolimus\nmohideen\ncenacle\nwees\nvisionaire\nfacchinetti\npoels\nfiredrake\ndarbuka\nmegève\nethnobotanist\nivig\nrindler\nnikkhah\nlaack\nmelker\nproinsias\ntumby\nmasoli\nshelleys\nguede\nbevacqua\ndunchurch\nhiestand\ncellulase\nspringhead\nzubeida\njetlag\nbrookie\njemini\nweatherbee\nkuypers\nfromentin\nsbz\nbalakovo\ncivilising\nmuraoka\nunrepaired\notoliths\ndeste\nthierno\nstrategie\nsiggy\nbiplab\ntulk\nlangsam\ngiladi\nsuel\neavesdrops\npolya\nmulenga\nadamowicz\nnoreste\nwelspun\nwenxiu\njacquette\ngnash\nquester\ndawda\nnessi\nmaterialises\nglenties\nkapaun\nmogra\nmagdaleno\nressam\nbehler\nvillalon\nbesen\nalmera\nchamorros\ncastrating\ntakacs\nroome\nheadsail\nschaer\nborrel\nsangala\ncits\nwangjing\nfrishberg\ntomin\nlaloo\npwned\nmarolles\ninterpublic\nsanitization\nnewmans\nmcwherter\nkeal\nmeggett\nvinaigrette\ngoolsby\nluckinbill\ntortelier\nanatomies\ndeveloperworks\nstrums\nmedows\ndiverticula\nindividualization\nbeseeching\ngraus\nstanislava\nguinier\npendine\nbarberio\nkarlberg\ntrixi\ndrivetrains\ncreeggan\nmeile\nbodde\necuyer\nwnep\nrandallstown\nzzzz\njunod\nthse\ngarve\nneethling\ndarville\nweiller\nmoix\nmilord\nfpg\nchriss\ncalcavecchia\npurdum\nrothkopf\nsuppository\nrennard\nemig\nmerryn\nculzean\nuwic\nzaret\nbeauchief\nspencerian\nsamms\ngreying\nmatuszak\ncesspit\nsnork\nuntying\nneedell\nliberopoulos\nposener\norderliness\nayyash\nselvage\nbalgonie\nremoter\nosmaston\npako\ncrianlarich\nlovemore\nladuke\nozer\nstasio\ncontainerised\nhandwoven\naury\nwwb\ngrean\nwolfs\nyayla\nneighboured\nicecube\nzemfira\nplacating\nsirf\ncastano\ndepaolo\ncroons\npasmore\npocius\nvernons\ncryptozoological\nboddie\nglassdoor\nskrela\nmacafee\ncurent\ninsitu\niccr\nrudloff\nberek\npronin\nsociopaths\nnetten\nflogger\nkameni\ngascón\naxiata\nshawky\nkotick\ntrimingham\nmessageboards\nblabbering\nhaidinger\nschacter\nzavvi\nlendvai\nmaccorkindale\nstronsay\ngibernau\npeyto\ncharlap\nizuru\ndeek\nsipes\nfrugi\nstepbrothers\nwaldir\ninoculating\njaneth\nvindicates\ndenisse\nextricating\nfrediano\ncocchi\nhypervisors\ndorai\nqusai\ncico\nnvs\nblahnik\nmanesar\ngreyling\nwitchdoctor\nconstricts\nshorties\nszczerbiak\ncorringham\ndownhearted\nchix\nricochets\ngusinsky\nenology\nbisconti\nymc\nperama\nlubang\naryana\nasok\nellenbogen\ntommies\nbuol\ncastalian\nscoular\ningar\nzaslavsky\nriehen\nbonampak\norilla\nhavilah\neavesdropper\ndmanisi\nsesh\nragab\nbetteridge\nguzzling\nbraydon\nmeigle\nbahati\nbuil\npublicidad\ndokka\nkerrick\ncsto\nissaquena\npacitti\ndembowski\npanno\nragaz\ngrohmann\nwoodlots\nafghanistani\noubliette\nparachinar\npresense\nannaly\npuhi\nuncommercial\nbullmore\nsipp\nmuskego\nblacher\nnantgarw\ncompay\ntwellman\nstampings\nlifebuoy\nblasphemies\nlicklider\nantlered\nmstrkrft\nmaranda\nvillawood\nhiti\nhathersage\njaguarundi\ngmv\nkorsholm\nfantasmic\ncuti\nquerulous\naccf\ncalacanis\nuntaet\nbezoar\nsheepishly\nclokey\nmalecón\nkonst\njatte\nfilderstadt\ntextor\nmarçal\ndivertissements\ntergesen\ncammack\ndigitalism\npenberthy\nmanures\nkennis\nguruve\nlassila\nathe\ngerra\nlikelihoods\nspiridonov\nkeamy\nsenshu\npermissiveness\ndogana\nsnj\niffi\ncanyoning\nwahyu\ncircumspection\npicchio\nsaranda\nweda\nunwired\nforkner\ntriebel\nraymondville\nhomosassa\nkocian\ngiovanelli\narkush\ncharmin\nbresler\nbrimhall\nspondylus\nwittek\nkappe\nvasanta\nfairbrass\nbishopston\ninni\nfers\ngigo\nhanon\nrautenbach\npimento\nchiqui\nmonachorum\nsweepstake\nherse\nklapisch\nplatitude\ngidney\nclower\nbioassay\npatarkatsishvili\npicart\ntheydon\nabdelkrim\nsapin\nteymourian\ndhiraj\ncarmike\nmhe\nmunslow\nsooth\nsterzing\nfistball\nkachel\ngusted\nmcneish\nrosenow\nbajos\nyorkston\nadeola\nrosenbluth\nhartsock\nzumthor\ndarina\ncusi\ncontalmaison\nikbal\nkamra\nsaltspring\nppis\nvasoconstrictor\nhelia\nfloch\nportwood\nmalawians\ngojra\ngambari\nearmuffs\nsurti\nrodford\nconair\npronounceable\ndepts\ntahirih\ndongsheng\nmacrolides\nabseil\netn\nchizik\npouncey\nboocock\nsanitised\ngeiringer\nyifei\nyonas\npantothenic\neikrem\nmeeny\nasps\ngrassington\nestimable\nkarten\nbotas\ntifereth\ngregoriana\nmeersman\nchronographs\nunfabulous\nkizza\nbovec\nalland\nnanocrystalline\nrichburg\nexbury\nyermo\ndemitri\nwgf\njuby\nouistreham\nkealey\nwinglet\nhertzberger\nbaskaran\ntiw\nseptentrional\ntaag\nknapping\naeromexico\noromos\ngutless\ndombi\ntilke\nopatówek\ncottington\nnajim\nenfin\nhinwil\nfiducia\nbarkow\napfelbaum\ngeebung\nmakka\nyeary\nignashevich\nqahtan\ndissimilarities\npassey\nanqi\nnumerously\nodoratum\nschirn\nbradberry\nochroleuca\nhockessin\nmillicom\ngardners\nbarzel\ngedera\nrefunding\nfirebaugh\nmutv\nlambley\nsplott\nchocks\nhultman\nhabgood\nsouped\nfitzpaine\ntrialware\nbozcaada\nslapton\nscarano\ndufrasne\nmaynes\njrg\npannon\ncryptozoologist\nreille\nskelos\nhoodia\nbanaue\nguillermina\nwiseguys\nfotheringhay\nvengsarkar\nhikma\nhulan\ngrayed\namz\nwillin\nmadai\nlyngen\nplemons\newyas\ntheys\nbarab\nkuso\nnitrosamines\nwlra\nbalcom\nbourgault\nvarkaus\npoochie\nabuzz\nstadtpark\npropound\nbelga\nbottoming\nunionised\ncopson\nsoporific\nswanzey\ntrenholm\nkulob\nsundhage\nhayslip\nvinces\nfatat\ntargett\njianzhi\ndarksiders\nfiesco\nhampdens\ntpj\ncampervan\nthorner\ncheechoo\nsabbaticals\nhendred\nbaalbeck\ndevrim\nnutraceuticals\nwinlock\nmarinara\nsantora\npopulariser\nchaan\nindissoluble\npetrakis\nsurprize\nretton\nosser\ndonaldo\nearvin\nmonstro\nbiscop\norestiada\nblythburgh\nhemon\nchella\nkornblum\nenteral\notology\ntopnotch\namemiya\nhoneycombed\nrelived\ngaetana\nkrook\nommanney\nwella\njampa\nfinnissy\ntassell\nysbyty\nsergel\nfertilise\nseraphina\npapabile\nsherston\npobol\namonasro\npolyscope\ngeibel\nduno\nghp\nflunks\nondi\nxidan\nfetishist\ntdh\nersin\nvatsa\nnasz\npepler\nshereen\ntrancoso\ntuoi\ndeusen\nconformable\nhowman\nmerkaz\nbbo\ndeti\nlwanga\nphotosynthesize\nhissène\nfarncombe\npakefield\njimin\nartek\nchapare\njudaizing\nlizotte\ndcps\njenko\nyanling\ntividale\nsveshnikov\nbarkai\ncardiogenic\ndilger\nelectrostatically\nolexander\nkwasniewski\nvanston\nwina\npueri\nschechner\nraiz\nhinda\nmonsun\nsatwant\nliberalise\nintensifiers\nmiddleditch\noake\npanguitch\nhensleigh\nspaying\ncerar\nwashougal\nturangi\nmeadmore\nhighams\nsumaya\ngomen\ncianjur\ncuche\nstephensen\ntelephoning\ncontee\nsloshing\nconvery\nhormigueros\nmanuchehr\nyassen\npeppermints\nriski\nbarnhardt\ninterceding\nquarrie\ndermatomyositis\nrajnikanth\nunamended\nwalburg\ngaynes\nxmrv\nsotir\nsalboni\neuell\njeppson\ncountee\nvalpolicella\nroewe\nruia\noubangui\ncrewson\ntwirler\npandiani\nbushwalkers\nschopf\nmaroua\npartypoker\nelmers\nwolowitz\nsystemverilog\nqueerness\nlonelyhearts\nvizsla\npeered\nbouc\nslurries\nmakau\nramotswe\nmarzieh\nknightswood\nsorelle\nmorwood\nchomping\nbwin\nteia\nmahaney\nglaces\nkitman\nmccamey\nvaccari\nhartville\nathleta\nhamano\ngrosbeaks\nlence\nenterica\nfrigidaire\nmedicean\nherradura\nbaseballer\nkyrill\npalade\nriestra\nacquisto\nganeshan\nmcnicholl\nknutzen\nlucht\nkipruto\nantigovernment\nrawalakot\ngadahn\ncastlemilk\nmillrose\nstreamside\ndeerpark\ncountertenors\nzinjibar\nelsen\nyattendon\nireport\nmetoclopramide\ndragsters\ncarminati\nbradbourne\ndurieux\naccorsi\nurca\ndiederick\ncontaminations\nentel\nkiselev\nbulimic\nxilin\ncockeysville\ncryptograms\ngento\nnewsweeklies\nlout\nmarna\npvf\nsabermetric\nexhuming\nlyndonville\ndeignan\nbottlenecked\nsabet\nbirthers\nclearnet\nsaloman\nniseko\nbombes\nbilboa\nunconsecrated\nevincing\nbardach\nhatteberg\nmallowan\nkatai\naxler\nconspiracist\nrutstein\nmanhattans\ngoguen\nbondoukou\nabrial\nanticlimax\nnutri\nsarana\ntalismanic\ncoupeville\nconcreteness\nchindit\nriccobono\nspagnolo\nmarguerita\nofferer\nposi\ngammal\nmarett\ndansby\nethias\nhandelman\npatkai\nakie\nscowl\nchargeback\nbenke\nrosch\nterp\nverwood\nlmm\nmeisels\nliangyu\narlyn\nwakatobi\nkenly\nbadhan\ndissociating\ncanonize\nbodos\nshortbus\nagressively\nshubham\nfowlie\nflageolet\nschobert\nfraney\nwolbers\ncherchez\nvitous\npetrovka\nkalogeropoulos\ndukan\ncione\nunpretty\nmatts\nsulivan\nmoginie\nhena\nmaggette\nccfc\ngwaltney\nlettieri\ncoundon\npuel\nhatcham\nmartinsson\nlaton\ncloudesley\ntorti\nkajaki\nanderegg\nnunu\nnetsuite\nmiddlebrow\narcor\nmitro\ncrinan\nbkc\nwiersma\nwimple\nbioregional\nsifa\nfaramarz\nlemna\nzambo\nbielawa\nfbm\nprole\nkemet\nadenoviruses\ntraut\nchcf\nsooooo\nbrington\nbehavoir\ndilema\nlacera\nlaghari\nchillida\nrasam\ncordula\nmeadowview\njostle\nheidrun\nsupertest\njamalullail\nkhakis\nfffd\nshuna\nilife\nconstabularies\nphotocell\ninmost\nnotating\nermitage\nperfetto\nwaterkeeper\nyelm\nfloresville\nstraightforwardness\ngreenspoint\nsalatin\njiayu\nalleycats\nkragh\nkorten\naugh\ncouverture\nwilman\ncolophons\nbrawner\nwalsenburg\nsignos\ndreja\nrafeeq\nrefusenik\namatory\ndextre\nregola\nreznikov\npinda\nnuti\nfrate\njogjakarta\ncreeped\npérichole\nareni\nlonglines\nbennu\ncoleville\nlonie\ntuukka\nlinschoten\nnobels\nportuondo\ntrieu\nyongxing\nportsdown\nesmée\nbastardy\nbleaches\nmaayan\nmarak\nscpa\nhovnanian\nshigeaki\nheavener\nexigency\nprimerica\nkandaswamy\nadley\nmendte\ncentromeres\nstanmer\nyucatecan\nwalmley\nfiano\nberlyn\nwalburn\nlennartsson\ndeliriously\noverflew\nmelikian\npaly\noakbank\ngandil\ndivined\ndaou\nmarmor\nmamaia\nntg\nleasowe\nconcealer\nresignalling\nlundmark\nzini\nlenham\nonyewu\nsandrich\ncoments\nlabastida\nvuoso\namicitiae\npilav\nringens\ncapeci\nskalski\ndubourdieu\ntitch\nfeyd\ntaikai\nlibin\nraghuveer\nsolnit\nsiberians\nahanta\nthackley\nschmucker\nnields\nrammell\njanabi\nschwartze\nlety\nzagazig\nbudgam\nmarshyangdi\nwalkaway\ngrévin\ncicala\nbelsham\nshengli\nbroadcloth\nkulczyk\nbirgir\nfacia\nyangyang\nelya\njitem\nultraviolent\nwarrendale\nvanina\narminio\nlightheadedness\npanchmahal\ndunny\nkessels\nappleworks\nwaitressing\ncolsanitas\nstreett\nlenience\nrolfing\ndiari\nconvoying\napso\npresle\nentartete\nprostituting\nkhandro\nkabar\nsmal\nmultilane\nquantz\nhookworms\nperfused\nlipper\nsaltman\nfavorability\nmeere\ncothran\nrodier\nhoeffel\nannah\nhopland\ntias\ngaoyang\npoops\ngebran\ndebunker\noldfields\nhodo\nbequeaths\netha\nantispyware\nrheticus\nrefinishing\nlowson\nbreastplates\nmarib\napon\nourstage\ntionne\ngulaal\nsteinhauser\nterreiro\nparanaque\nmisti\nencumbrances\ncroall\nhapmap\nquinquennial\ngoytisolo\nwetterling\nborremans\nunderstudies\nmayuka\nballaugh\nchavda\ncastlehill\nmjc\ndominico\nfactsheets\nsuccinctness\nreceptiveness\nabbyy\ntsintsadze\nplanchon\ncityhopper\ntajín\nakeelah\nbogdanoff\nunnecesary\nbrosh\nramco\nfienberg\nnicodemo\nyanjiao\nwghp\nsonatrach\nfallston\nweideman\nicepick\nxsi\nsercey\nmajima\nsautoy\noughtn\nwilsford\nstradale\nvuze\nfrack\nsmartrip\nthoman\nrozhestvensky\nbakun\nwhimpering\nbaren\nnykvist\nheka\nbathampton\nvirg\nkwapis\nolivarez\nmabhouh\nbeloeil\nepargne\nhawkinge\nbardolino\nmarcescens\nbedstead\nagresti\nthaung\nsouljah\nsteffani\nritc\njérome\nfrades\nbraselton\nmalfa\nmedlen\nyañez\nlippold\nsoliven\nwelin\nnasif\nracette\naija\nruttenberg\nbabeş\nmatsukawa\nunmediated\nbgy\nbahnhofstrasse\ncootehill\nskullduggery\nbatasan\nalaw\npropitiated\nbalz\nnorrmalm\nbrasi\ngiuntoli\nkinuyo\nfesters\nnrps\nfuenmayor\nhmiel\nnunan\nframemaker\nreprobate\nwilpon\nblobel\nodysseys\nghilzai\nzoonoses\nbridgeland\ndendrobatidis\nprego\nbbj\nbillingshurst\ndelite\ngrassie\nmoawad\nvolaris\nraceways\nknxv\npsim\nformalists\npyrrhotite\ntrover\nvacillation\nbrunzell\nlucasville\nandam\nsolido\nsueno\npensnett\nbenihana\nhockings\nb,c\nheuvelmans\nmeulenhoff\nhyacinthoides\necklund\nvidrio\nhyong\ntums\ntechtarget\nisoflavones\nvohor\nbeitrage\nnasseri\nplaymobil\nturkmenbashi\nhoodbhoy\nleewards\nwillams\nunkown\nmirasol\nhoustonian\nrefaced\nchetna\nmishler\nbetfred\nrailamerica\nsiani\ndocwra\njoyriding\nlarmore\nteardown\nenuresis\narey\npiedrahita\ndirekt\ndanfoss\npateley\nmonopolise\nlibbie\nvpk\ntoldo\nfullbright\nkunert\namericanize\ntroikas\narly\ngasperini\nfalconio\nreapportioned\npiperno\ntriomphant\ntudou\nsingable\nfreebooter\nwymore\nmasella\nwinny\nnghia\nerdemir\ngrunion\nlogix\nrepayable\nccar\ngladwyn\nolp\nroder\nberko\ndarry\nboggo\ncittadini\npoststructuralism\nhartert\nrestenosis\nshellenberger\nplzen\nmacks\nmogok\nwego\nkinneil\nluquillo\nmelky\naggborough\nscherfig\nsgo\nusssa\nphun\nacesulfame\ntlf\nvolodin\ntabish\nnorc\npeekaboo\ncnemaspis\nsmales\nxixi\npadshah\nkulash\ncalello\nnonmembers\nhellner\nasham\npretties\ngiannopoulos\nrisse\nchangbai\nseeber\nbrugmann\nbroody\nsiddharta\nrcg\nkrumbach\nsupertankers\nbraunsberg\nadeleke\nibv\ngoalmouth\njeckle\neastridge\nmileva\nkelby\nthaipusam\nrigondeaux\nflattest\nentryways\nnyoman\nwheathampstead\nyarosh\noropharynx\nevasiveness\ngáis\nshawqi\nheterosis\naustrinus\ntransgenerational\nvrb\naloi\nblatchley\ngitter\ntlg\nalibhai\nsarie\nburas\ndiethelm\nihrc\nfayssal\naspell\ncolhoun\nnonesense\npolycomb\nquerns\ntransferrable\nmelillo\nbrodbeck\nfatalis\neurowings\nhawfinch\nlucke\npacte\ndiversey\nmignone\nfifers\nnatsuo\ncholly\nharaldsen\ncrutzen\npedestrianized\nmekka\nalpujarras\ncmts\nallamah\ntamerlano\nsummerlee\nthermostatic\ndebited\nscherf\njwst\ntrerice\naleksidze\nlütken\nandujar\nishchenko\nrexona\nspacecrafts\nlindemans\nteele\nremengesau\nhorticulturalists\neurochart\nbluffer\nnazaryan\nsiumut\ncerto\npinups\nboxgrove\nmartorano\naccenting\njitra\nfieldfare\nxmi\nmershon\nchadbourn\nnozuka\ndhumal\nadelard\nsekt\nelectroless\nmacchiato\nforbury\nwitczak\nirreducibly\nmutasa\nnordan\nminocycline\nugwu\nbraising\nbitti\nstraightway\nflatliners\nsegui\ndifferentiations\nsirian\nfafsa\nhundi\nwebkinz\nguiltless\nhepatomegaly\nprostacyclin\npisana\nwolffe\nssbns\ndutschke\njanai\nfilipov\nbonnybridge\nlubec\nlihou\nrocketman\nkranenburg\noverheats\nglister\ncatán\nnowt\nwehbe\nfizzles\nmirel\nmollies\nkhawar\nkriv\nkohistani\nwoodcarvings\nantiquorum\ntubeway\npeulh\nblackstrap\njanuari\nappliqués\nbrugmansia\nnewsagency\nprovincialism\nhowsam\nmedalled\nmaytime\nguira\naasif\nbizard\ntjuta\nvenga\niams\nheitler\ntuonela\nwgl\nloomer\nalers\nliminality\ngochujang\ntilos\nbovard\nstoilov\nmakars\nwadge\nlochend\nfrendo\nseckel\nbosal\nabondance\noriginaly\nmultiline\ncourteeners\nuncontained\nsarason\ncokie\nceas\nwenli\nrififi\nmezvinsky\nneki\nminutewomen\nunderstates\npoppier\nkrakowskie\npolytheist\nagas\nfabel\nheesters\nainsdale\nyefimov\ncapybaras\nvastra\nglenburn\nlynchpin\nimageboard\nanglada\nimpresarios\njalloh\nluminosa\nyarnton\nyarnold\nbramer\nlevinger\nkalaallisut\nplymstock\nvainqueur\nromantische\nconflictual\nlefrak\nngwe\nlamido\nxingyi\nropewalk\ntemplesmith\ndisgorgement\nvikes\nnyoka\nvolleying\nbyzantinist\nshandwick\ndinata\nstinchcomb\nnintendogs\nmzuzu\nmariposas\nbrugada\nmongkok\ngouldsboro\nwanzer\npoil\nsnx\nbiagini\nbatsheva\nkazeem\nwbr\nafterthoughts\nyountville\npaxon\nriderless\nekibastuz\nderval\neberlein\npanhandlers\nshwayze\npowszechna\nmertensia\nvilleda\nlikeliness\noystermouth\nblackbushe\nsidey\nmicoud\nconstand\npastan\nautorotation\nrevenged\nlumines\nalparslan\nwarmoth\npaterlini\nplut\ndevault\nsynnott\nnikkel\nblye\nshawnna\nexciters\nsophistic\npacificorp\nesquel\nceder\nnoorul\nbulley\ncaremark\nturbeville\norah\nstenhammar\nbeninati\nbalakian\nselinux\nmontagnana\nesea\ncoddle\ndereck\npeñasquitos\nchilham\nlimpsfield\nhaydee\nmarketeer\nbenching\nadmixed\nespressivo\nmoanalua\norecchio\ndomen\ntanui\nwaithe\nspolsky\nmumme\nmarring\ncains\noveral\nshagreen\nrascon\nhadramaut\nnormalising\nfinnbogason\nfeneri\nxenobiotics\nminneola\npalmera\nboun\ncognita\nbasters\noldsmar\nfarallones\nfemsa\ntenny\nneptuno\nconfuciusornis\nwebchat\nolawale\nbondarev\ngauloises\nsadistically\nkuja\nsobhi\nhomestretch\nidus\nadelin\nmanaudou\nwesfarmers\nsilverwater\ngolab\nlaryea\nbams\nurai\nsrey\nismar\nabhilash\nlarimore\nwgcl\nboucheron\nnitesh\nbugner\nroady\namoli\njref\nhighers\nmythago\ncushendall\nntozake\nsturr\nsabzwari\nsamsonite\nmagnano\nmarsicano\nrenger\nonny\nincarnates\nriat\nvolonté\noatey\nabsorbable\nzeytinburnu\nlaran\nélites\nmfis\nperfidia\nlambertini\nsaufley\nprend\nvmx\nbefor\nneonicotinoids\nnonproprietary\nazadeh\njayde\nnavsea\nleggy\nbidart\nplett\nstoudt\nactis\nwojcicki\nvictorinox\nmeopham\ncorkscrews\nluego\nexosomes\nconfirmable\nbehrami\nkongers\nlurching\nboltanski\ninexpert\nobligatorily\nrequa\nkozy\nmoakley\nwalkington\nblomdahl\ntrowell\nvampaneze\nquddus\npodiatrists\nciaccia\nmccary\nkirchoff\nbildu\nserratia\nrayer\nhuseyin\nwatros\nhnb\ncheswick\nkotowski\nbaadasssss\nbruces\nhett\njcu\nlilien\nmhealth\nclaessens\ntatsuki\nabdulkareem\nkazlauskas\ndressmakers\nmansha\nevangelise\nmasculinization\nahsanullah\nexpedience\novie\ndivot\nchildbed\nunpitched\nsalvaterra\nmiddleboro\nshair\nscotney\nbarela\nmicael\neverywoman\nmanahawkin\nlegris\nviridiflora\nnegga\nfrua\nundercount\ncommingling\nkunashir\nmamerto\nquisenberry\nvnn\nandreadis\nintracity\nstemmer\ntapie\nnecrophiliac\nfeulner\nkuhle\nmuizenberg\nrocke\nforegrounds\nkabbadi\npawiak\nvitara\nfunkier\npedrera\nizakaya\nretinues\ndingwalls\nregenerations\ndouve\nfaca\ngrindstones\nenclaved\ncouchman\ngisa\ndipeptidyl\nbrustein\nskyrockets\ntakaharu\nbejaia\ndebase\ngascons\nbrownmiller\nramstedt\ncajas\nherzlich\nyankey\nmedicaments\npapermill\nhevea\nnanorods\nfarb\nepw\nhydrops\nmistranslations\nrecived\nsudd\nkisho\nschnepf\nlanco\nviverra\nelsy\njunning\ndavoli\njuric\nwoolens\nzoomer\ncolwich\ntrystan\ntamlyn\ndurra\nkraddick\nvarcoe\ntemaru\npasca\nrenesse\nschie\nueshima\npreordained\niei\nnestos\nvalise\ntanne\nparkinsonian\nsener\npiëch\ncorbo\nyeganeh\nnumbed\nayinde\nrotundo\ncestius\nmctigue\ntorontonians\nschweikert\nradion\nclusterfuck\njtb\nordet\nrudall\nduenna\nmclin\nishizawa\nramie\nbatte\nmultiday\nargueta\nachuthanandan\ntrivializes\nkimbrell\njazzed\nniton\nccx\nsciorra\nhundt\nsmartcards\ngiebel\nattridge\ngheyn\neurotrash\ntober\npitka\njagdeo\njackhammers\nchristan\nplai\nritzman\naati\ntopolobampo\ntrog\nhmmmmm\nmigliaccio\ncarena\nthusfar\nbackround\nnocerino\nchoteau\nwulong\nparoubek\nphlebotomy\naltavilla\nbaher\nlindane\nlarrazábal\ntesori\nscrapheap\nsportback\nsabara\ncadotte\npériphérique\nlft\njrcc\nkuntsevo\ncroquette\nmesdames\nwelti\nwpd\noegstgeest\nvaldéz\nurbania\nshinar\nqueyras\npétionville\ntimidly\nharakiri\natole\nkizzy\nyizhou\nkronberger\nagog\ngreengate\notamendi\nkundry\nkondratyev\nosieck\nkarpman\nasgiriya\nutian\nimpecunious\nsarhan\nbjerregaard\nferzan\nabtahi\nraynsford\nstobbs\nbitmapped\nunionpay\nbembe\nstelco\nholga\nstarla\nawfulness\ncohler\nsublimely\nlopera\nnosov\nfionna\ndispiriting\nhollaender\nnaimi\nmcfarlan\nseehofer\nscra\ncarabaya\nrcds\nsoothes\nappan\nballybofey\ndervla\nvogelstein\ntwinn\nhomos\nlimos\npetur\nverdure\nhumanization\ndemur\nexmor\nlasko\nyoki\nanglicisms\nearthwatch\njugnot\njasso\nchangling\nbuerger\njumbos\nnairac\ndevco\nmessori\nkordell\nhanekom\ncarping\nsariñana\nkauppinen\nlifelock\nmulaney\ngete\naccustom\ninsley\ncannibalised\nbailhache\ncriseyde\nobradovic\nredistributes\nbadoo\nhykeham\nscrumptious\nelettronica\ncnhi\nkomeda\nsouthie\ntarsia\nzirkle\nfukudome\nsonoita\nmachover\nimmunoassays\ngangly\nkartemquin\nimaginationland\npodsednik\narakaki\npienza\nskus\nkanzi\nruncible\neasterhouse\nweining\nmaconochie\nberning\ngatlif\nvaster\nbraço\naccrete\nlisk\nedun\nsammarco\ngoelet\nizarra\nspicejet\nstigmatic\nferon\nhalma\nappley\ncuadrilla\nidiocracy\nluban\ncentura\nrecollecting\nweizenbaum\nthrapston\nssgn\nshengelia\nliuzzo\ntowey\npoleg\nmudeford\nbakaj\nchandor\ngyalwang\nlrd\nudeid\nsortilèges\ntavárez\nnordquist\nanv\nobligor\ngerundive\ngangloff\nmilvian\nmoler\nkizomba\nauditore\nmultiprotocol\nklans\nmühe\ncronkhite\nmeningiomas\ncfpa\ncachi\ncoloradas\nunredeemed\ncunniff\nkdv\nberhanu\nrillington\nshorthaired\nhalsell\nsârbu\ncapulets\nisobar\nflameout\nporajmos\ncastigating\nautocrats\nzabbaleen\nvillosum\nruddiman\narthaud\nlactobacilli\namaliada\nsanan\nhomar\nwkrn\nsatsu\nbadmouthing\nalshon\nstraiton\nshanel\njolles\nsubaqueous\ndiekmann\nrooij\nmitrovic\npalmier\ncrépin\nteucrium\nmitred\nbancorporation\nmaslak\nnzx\nmiseria\nlasson\nanindya\nvolek\ntadley\ncongres\nflipnote\nalpinists\nphog\ncreak\ncrichel\ncoonamble\nkevins\ncastlederg\nbellowhead\nreproof\nwalpurgisnacht\nantivirals\ntylden\nmagilton\ngainfully\nampney\npeeked\nservility\ntoppan\nsparano\nruminating\nneua\noverdetermined\nlimulus\nsteersman\nslaters\nlatters\ndebabrata\nkopacz\narcuri\nmovingly\nsaltwell\nmontandon\nplantsman\nfiats\nwhoopie\ndanette\nkunsthal\nmilanés\nbragin\nsenario\nloor\ncryptorchidism\nzareh\nchesshyre\nlaurium\nterrien\nsobrino\nexsist\naglionby\nenevoldsen\nstohl\nrousso\ncyrill\njakobi\nkuramochi\nstratocasters\nconnexin\nirrefutably\nmigori\niesa\nhimmelfarb\ngohmert\nvdo\nhochul\ngidman\nasthenia\nhais\nolegario\nyentob\nnemmers\nlubick\nkobori\naracely\npensieri\nstanimir\nwirthlin\nkumarakom\nmccullen\nbuntin\naberlour\nkelabit\nlorbek\nfaena\nboorer\nphas\nelwick\noshie\ncullimore\nxiaomin\nomnisport\npremies\nwirtanen\nchislett\nmouza\nlarrie\nisy\nsiemer\nleunig\nhyperkinetic\ntakoyaki\nsquirming\nlitwin\njobcentre\ndorridge\nrobinett\ntompkinson\nnitrobenzene\nneocatechumenal\ngodbold\nmartic\ngetto\ndkc\nexcelent\nmaruo\nnamikawa\ntausig\ndallimore\ngrinde\nplanarian\ngeovanni\nftaa\nslateford\nrambin\nprai\nmoyola\nshuka\nsafehouses\ngigabits\nfgcu\ndrylands\nforints\nhundredfold\nsaheed\ngerontologist\ngolombek\naulin\nespuelas\nvanderpump\nkaroon\nables\nbutterfingers\nwhith\npyrolytic\nhaayin\nmagomayev\nmagdelene\neasiness\npurwakarta\nweiguo\nnoticably\ntibbles\nnyingchi\nhinzman\nwebiste\nmooncakes\nparrs\nthyrotropin\nmemantine\ngenc\nradko\nmeindl\nscrimmages\ndivisi\nroehl\nseabright\ngeorgica\npostive\ntriac\nkardos\nstannington\nmeltham\naardwolf\njaanus\nshrimpers\ntugay\nhousebound\nsotero\nvanderbeek\ngaramendi\nsaric\nforayed\nsupervisorial\ncacc\nriddling\ngallager\nheadcorn\ncollagenase\nfirebrick\nrocketplane\npennon\nmekon\nshamsur\nsanaullah\nhaikus\npremia\nakinyele\ndeconstructionist\ncouderc\nantifeminist\nyohn\niracing\nconscripting\nbuiten\ncarriker\nfruin\nshapinsay\nhenryetta\nhiatuses\nhöch\nbuza\ndellal\ncogger\nheale\nkieta\nalliluyeva\ntecuci\nhelders\nwiklund\nstandoffish\nwindiest\nterrasson\nrosenior\nsudhanshu\nzigman\ndslam\nenco\nnvg\ntsum\nwoodpile\nplonk\njingyu\ndirges\nseeder\ngrings\nwillers\npampulha\nquirinius\nbabacan\nrealis\npallikal\nhindraf\ncdti\nyongping\nseipel\nbilgin\nslamball\nbrelade\ncavalla\nzaca\nchinaski\npraagh\nfuzed\nthetans\nbairn\nxiannian\nwearhouse\nsubmarino\ndystrophies\nkulwant\ncial\nhulten\nchernyakhovsky\nskall\nmacbean\nsegan\nnathusius\nknifepoint\nsuperferry\nbjorkman\nalinea\nhosh\ngamkrelidze\nmawddach\nstickles\nscourging\nlangille\nkpe\nuniqa\npridgen\nnottawasaga\nbarbells\nwerkmeister\nbuffetaut\nstarves\nmôr\nshahbandar\nsudetic\nhormonally\ntookey\nnazan\negoists\nteske\nspains\nkesten\njehuda\nstarmedia\ncoyly\nshua\ncerana\nmendacity\ndaiquiri\nmikva\npictorially\nmemristor\ntricker\nodontology\nbacklist\ndaube\nrongai\nvestroia\nunavailing\nelrick\naddressees\nhinnen\nmultitudinous\nghostley\nwaterworth\npitton\nmiltos\nlegler\ndubner\nmalenchenko\ntrefoils\nnosedive\nvels\nreachout\ncazale\nsummitted\nafip\ntietjen\nhandfield\npatchily\npropellerhead\ngilts\narboreum\nlauth\ntappy\nmissie\ncrispino\nhelheim\nnelmes\ntippah\nvanniyar\nfeldon\nscreencasting\nbijlmer\nmanningtree\narria\nmarich\nholzwarth\nschlender\npetrik\nbanlieues\nseigel\nmetallurgic\ndustman\nporsha\nkmov\nozick\ngayda\nwijngaarden\nfrights\naceros\nmorozevich\nabdulah\nenchilada\nkallie\nhuntingtin\noliseh\njollies\njegan\nlindenwold\nbelligerency\nbouchon\nquadricycle\ngentlest\nstentorian\ntavon\nrecalculate\nreckoner\nstratfordians\nhellhounds\ntelegu\ndeeg\nambalangoda\nliburd\nvlasenko\ngogolak\nunwinds\ngeoffery\nfarsight\nheen\njulich\nrewinds\nsoftwoods\nwretchedness\nprintworks\nursini\nwoldingham\ntoxie\nlidle\nstilson\nodendaal\nchis\nimos\nsatchwell\ndisincentives\ngorelick\nwallman\nmckinnie\nmisericordiae\ntecno\nkickham\nashwani\nsharifa\npetralia\nkuhr\nkeenlyside\nanastos\naformentioned\nlinnhe\nshawne\nparasailing\narmengol\nderec\npapps\nár\nrépétiteur\nsafri\ndavion\nbeder\ndurgin\nglatorian\nlomaiviti\npatchway\nhamberg\nfaustini\norok\nkapono\nduhan\nfranglen\nbarelli\ncelata\nsummerskill\nrachida\nstereolithography\nverifone\nsuona\nconstanzo\nhomeplug\nhazelden\nevanovich\ncrackin\nhipwell\nspherules\ntrammps\npiersall\nleonardos\nmargiela\nkarttunen\nrodallega\nhavanna\nzamor\ndistil\ndarrius\nmentees\ntostitos\ncutlasses\nleimbach\ndrf\nsloatsburg\nscorelines\nscotoma\nruhrgebiet\nvassil\ntoplessness\nblofield\nallouez\nletham\ndinorwic\nsaphira\nfulgence\nemburey\nthornber\nagaves\noccassional\ngoehring\nunsan\nbalado\nknoe\nbeiersdorf\nperabo\nbroths\nrixton\nhilliers\nledwith\ndizygotic\nhangeul\ncontextualised\nwiechert\ntelesis\nspicher\nkolis\nmendizabal\nremanufacturing\nbensenville\nhippopotami\ndudleys\nsturman\ngaleazzi\nnawfal\ngrumbled\nneocortical\ntinman\ncheongsam\nbyr\nbehaviorists\nmoonlite\narlésienne\ncerion\nhorrifies\nkagiso\nalver\nmcgivney\nmaseko\nworlock\nchimamanda\nderinger\nbadrul\natiba\nquencher\nlimuru\nwearables\nangelman\nsouthbourne\ndawsonville\ntomich\nharperone\nsabreliner\nidemitsu\nscholte\npâte\nkonkin\nstache\ngasthaus\ncobus\nusefullness\nviscerally\nsreedhar\nbachelder\nsahajanand\nluneburg\nattali\nbarbin\norganochlorine\ncagoule\nfiorito\npitz\nsprog\ndedecker\nellijay\ntschichold\nhollman\ndelalande\ndinnie\nhaynsworth\ncointreau\nsutera\njumilla\nlonborg\nglowingly\nelkann\nemollient\nrogov\nspaceborne\nringle\nebene\ntadworth\ngolin\npasodoble\nuchitel\nhertlein\ntechfest\ncyanogenic\nreemerging\nhogtown\nkhejuri\nsenbei\ndpn\nnanocomposites\nexplorative\nvostro\nfraticelli\nboyi\nmatsutake\ncarbapenem\nhertie\nfrancofolies\nbandsmen\nprehaps\nderonda\nsyt\nyayi\ntramuntana\nstoychev\ndilhorne\nhereabouts\nlappalainen\npalatability\npercolator\narpeggione\nrubha\nmdh\nmeret\nuscc\njaglom\nmalodorous\nchambray\nyeap\nchromatograph\nleisha\ncovenantal\nmishearing\ncalmes\ncallicoon\nauli\nclubber\nturiaf\nystwyth\ncharnin\nnondisclosure\ndanticat\nridgelines\nperuzzo\nsteinhart\nblek\nhorrorland\nchettri\nparastatals\nvertes\nhailemariam\nkayunga\nmendl\nscratcher\nisador\nboxsets\ncahuachi\nsouthview\nbarbaresco\nnonindigenous\nbhakra\npavy\nsabba\nbazile\nsparkford\nnmg\nringback\nkarie\nhoiles\nacquiescing\nsleepyhead\nversi\nfereydoon\nabendblatt\nhyperextension\nesposizioni\nqueequeg\nfurney\nfomc\nnonclassical\ncamuy\ndhakal\ndwele\ndabi\ncondat\nelvet\nberganza\nnimby\nenstrom\ngaldikas\nnikonov\ncauterization\nbbsrc\nyuni\nsarabjit\ndundry\nsatchell\nlongland\npuleo\npistilli\nkisen\ndivison\nastarita\ncentralizes\npatwari\nisetta\naquiline\naraucania\nweidinger\npinga\ninsource\nbocchino\nenterocolitis\nbisse\nratso\ndingles\nmarjo\numtali\nmljet\nefr\nrobusto\nigb\nlydden\nborujerdi\nmulvane\njaster\nintones\nsofteners\nrationalizes\ndumbwaiter\nlehren\npervasively\nacocella\nhfl\nsavoring\ntijeras\nnecator\njabala\nvilhjalmur\ndestabilisation\nnhler\nfoccart\ndipeptide\ninitialed\ncuchillo\nkarasawa\nnkandla\neske\nmultijet\nhissed\nehi\nbelke\npassin\nboondoggle\nzengin\njeta\ndoyenne\ntransaminases\nsunburned\nvlg\ncaudle\njaydee\nfumie\ncypria\nstingley\ntimucuan\nsomerleyton\naverts\nnaude\nstaving\nkrewes\nyetta\nseigi\nkarimun\nyingli\nforeshock\ncbms\nbörner\nflimsiest\npoena\nweeze\nbellecour\njerichow\naames\nmazan\ntrainload\nbitlocker\nokeanos\nsouthill\ncommunale\njila\nexpropriating\nmifid\nanytown\ntexada\nbattie\ntreisman\nzacharek\nshamma\nwollesen\ncosham\ntorticollis\nwrathall\nargov\nscheinman\ncutmore\nbiomonitoring\nhardik\nkiddle\nthisara\nallopurinol\nalldis\ncardale\nwidecombe\ndelport\nmohel\nkasch\nabjure\ntbv\nspawners\ncontis\nframboise\nwujek\nglemham\norkneys\nyuge\nseke\njollie\npaxil\nbarcombe\nspeir\nbloomingdales\ncovo\nmadhvani\nraan\nfiumara\nboxborough\nkizil\ntierce\nrhinovirus\nheiki\nneblett\ncatsuits\ndrw\nwestmalle\ndysgraphia\nbrinkworth\nrybczynski\ngoddam\nhassane\nkudasai\ndungloe\nmicrobus\noutreaches\nnonliving\nanouar\nsakamaki\ntgp\ndisalvo\nsneered\nomfg\nbharucha\nbeah\ninstillation\nhavelis\njks\nlintas\nmummer\nshabwah\nkirt\nlunardi\nfujima\ncauquenes\nmonomania\nacsm\nchism\nwns\ngoudge\nnaiqama\ntheun\nnappanee\nbalbuena\ndré\nnoynoy\ngamp\ndemobilize\nucluelet\nfrontmen\nratterman\nseegers\nliebes\nklapper\ncoppelia\nfleeces\nneurotics\ncedeno\nwangyal\nkramarenko\ngarreg\nbgv\npetkim\noutwork\nkaberle\ncarolis\nzimba\npowrie\nglanmire\ngnjilane\nbavaro\nliet\namadiya\ngodsell\nstavrou\ngehlot\nretweeted\nxle\nbrownout\nirritatingly\ncartlidge\nrutz\njamarcus\nlofar\ngravediggaz\nktf\nsolicitude\nlehmbruck\ncurfman\nbatya\nsuhaib\nunifi\nmoszkowski\nibnu\nantiphospholipid\nmcnall\nmalefactors\ncushy\nvye\ncoffy\nvzw\nrtls\nmanufacturability\nraheen\nfrombork\nnepeta\ndelizia\ndrimnagh\npicoseconds\ntripodi\nkihei\ngaetani\ntuh\nransomes\nccgt\ndymoke\nmachell\netoiles\nnikitas\nhelminen\ndemarcations\nsoothed\nprimroses\nrivesaltes\nvison\nperman\nokolo\nhazael\nreadman\nicehockey\nhettiarachchi\nsulejmani\nunschooled\nthinnes\ntosches\nacdc\ngreka\nnaughtiness\nbrodin\ngerges\nmosteller\nsubsidization\nescalations\npennal\nmeye\nknigge\nbursey\nzahida\nlinklaters\nmontfermeil\nmarketeers\ntaieb\nmartialis\nmacfarquhar\ngeoduck\nmaryellen\nnuestras\nbrowbeating\nivanovski\nmejlis\nmactier\nallal\nwhut\nzaiko\nfairbourne\nmauvaise\nmanase\nrepairers\ngeishas\necdc\nduco\ntzigane\nafsana\nanello\nwinegar\nidealize\nandis\nveco\ngemalto\nabrasiveness\nfaires\nsingal\npavoni\nvoalavo\nrecaptcha\ntattenham\nupholders\nberriedale\ndeconstructivism\nzambians\npathmanathan\njingyi\nchilson\nunsg\ncleal\ngottfrid\ntorcello\nscovel\nstoping\nclasson\noakengates\ntimberg\nrudisha\npizzorno\ndoomwatch\nclippard\nzoghbi\nfootway\nrazov\nbondsmen\nslidin\nflitting\nghuman\nvons\nmollah\nkirschenbaum\naffluents\ncrippler\nasanuma\nmanichean\nrohinton\nbremanger\nlívia\nucits\nlabarre\nwamsley\nboobie\nnipah\nazzaro\nquintillion\nnudd\nheav\nseery\nkrekar\ngrizel\nshunryu\ndreifuss\nmockridge\nspaceports\nryba\nstoped\nmaluma\nealey\nmedd\nsatisfactions\nshandaken\nmarram\ncommotions\nimmelt\nrukai\ngoogoosh\nrgp\nisec\npumpernickel\ndallow\nalj\nshigematsu\nnarcis\nfootedness\nraghunandan\nluxuria\nezy\ntressler\nyasnaya\nwillens\nhoepner\njianfeng\nridolfo\nkiski\nwyncote\ntopolánek\nkyc\nbroadgreen\nintraday\ngollop\nsriramulu\ncotgrave\nobeng\nhamani\nprevia\nvrooman\ntregony\nlitte\ntsakhiagiin\npaksas\ngummed\nbacktracks\ndujuan\nproenza\nintepretation\nmusti\ndiler\nheliopause\npreindustrial\nplaiting\nsugaya\nishfaq\ndiesendorf\ncalli\nintersperse\nlindiwe\ncarreiro\nschoolchild\nvejvoda\namont\njabi\nreyn\nwarsh\nshamshir\ncleanness\nbbh\nrecondite\nanastasis\nboskovic\ncfra\nstudt\nsoupe\nebensee\nxinxin\nutri\ncavalese\nfavola\ndravs\nagag\nwhittell\nmaati\nhusam\nhypnotists\ntoonerville\nokur\nparanoiac\ndenfeld\njpod\nfardeen\nlenn\nartifex\nhaes\nvivianne\npratts\nbarbès\nchelli\nroumain\nnatanson\nsoumah\nlatacunga\nguerry\nmecc\nbellon\nimpastato\nrendova\npackie\nkostic\npbg\nheren\ntucanae\nedell\nkaryotyping\ncannady\ncict\nmhatre\ncondi\ndrawstring\nchudnovsky\nscroobius\nwindsurf\nwrangles\nfurriers\nsunexpress\narnout\nawst\nshortlived\nabderrahim\nculin\nforsell\npratti\nvideoconference\nwimbush\nfictionalization\nsangharsh\nsambu\nsavane\nunidiomatic\naugenblick\nsaltires\njym\nmugambi\ntrofim\noverwintered\nonramp\niasc\ngametime\neuphronios\nsokak\nnnenna\nhorcruxes\nkarlsplatz\ntyce\nshatz\nkilmuir\ndewit\nsorceresses\naics\nsakia\nchagrined\nborko\npennard\nautocourse\ndiltz\naqib\ntoepfer\noppositionists\nofek\nreified\nhalkirk\npebworth\nraczkowski\ndikeman\nyukiya\nabsented\ndixey\nfayet\nkampa\nvaitupu\nmasaba\nyellowbeard\nellenton\northodoxies\nsonae\nhockenberry\nsefolosha\nenderle\nirradiate\nuninfluenced\ninterlayer\ngoldwag\nnasally\nrozman\nmcgahern\nvanderbilts\nduranti\nmccrery\nsignorile\nrogg\nfearghal\nstehekin\nblunk\nappealable\nderussy\nfloorboard\nmoxham\nroadhouses\nciac\nbouillabaisse\ndropsonde\nciat\nrebalanced\nleana\npriveleges\nbardez\ncozier\nmontet\nkatongo\nsautter\nbegetting\nmovieland\npettingill\nnetherwood\nguajiro\nalbertoni\nmirrorball\nsociaux\nshinwell\npaliwal\nincompetently\npilli\nlarrieux\nmimran\nescs\nklipsch\nwopat\nkitaoka\nmcsheffrey\nsteeltown\ninterworking\nbrilliants\ngardy\ndefilippis\ngcos\nclapperboard\nglaum\nrybicki\nmunificent\nfni\nirakere\nwirths\ncoronati\nmccroskey\npataky\nelvire\nneedlefish\npanko\nbiglia\nnataly\nsojka\nkirinyaga\nabridging\nicey\nnevern\nkalinago\ntillerman\npassbook\nyingzhou\njodelle\nplaskitt\ngianelli\nkiet\ndonator\nanthos\ngooglemaps\nmilstead\njerónimos\nstaker\ncabranes\nmatschie\nlikhachev\ncoypu\nröhrl\nprettiness\nballen\nproconsuls\nzanoni\nraffel\nbesso\nslamannan\npjsc\ncamkii\nkret\nzol\nmitchelson\nreinstallation\nintercountry\ngarling\nheathman\ndamin\nhorstman\nnorgren\nktxa\nuclan\nsharf\nnamp\nvist\ngarlasco\nentreats\nattari\nnabs\nzeisel\ndrecker\nsixways\npositiva\nvictimes\nroundway\ndrek\ngionta\nshans\nkealy\nyati\ndidim\nbrancusi\nfischbacher\npasturing\nporche\nnapes\nmascaras\nwekesa\nwohlgemuth\naout\nhessilhead\nwillerby\ndelek\ngoswell\nentwine\nlonghaired\nmeanderings\nmadaripur\nunsociable\nkatsuhito\nreffered\nbilheimer\ndomesticating\nburkill\nakhalkalaki\nchuseok\nasiavision\nyane\nlaboe\nbenezit\nkemco\nostiense\nseafish\nbalda\nzhiyong\nreihana\nmuradov\nengles\nnisman\nkopje\npreponderant\nsimular\nwesterby\nhotten\nreanalyzed\nkummel\npalaeoecology\nloek\ncrookers\narchimede\ncsrc\ndebilitation\nradionavigation\nsoldierly\nsamih\nploughshare\nhalphen\nlurssen\nbaldwyn\nhotaling\nstreetdance\neeas\nfrizzled\ntuteja\nsape\nsweitzer\ndelaine\nmoravcsik\nseafair\nskulking\nklr\nuncompromised\nherle\ndelone\nturnbow\nbillin\nslesinger\nbloons\nmorrisseau\nloadout\nkibbeh\ncalmac\nzuhri\nardoch\nrevelled\nshvut\ngarderobe\nexcelencia\nquadrimaculatus\nunmee\nreynoldsburg\ndeanda\nwkn\nustc\ndiminishment\nbalshaw\nbreger\nbarbecuing\nomelettes\naugers\nestacada\nseafires\nfistfights\nkurnaz\nbenett\nklindworth\nwair\nchagossians\nkosloff\nueg\nkhatron\njiguang\n,so\nhaseley\nhummocky\nrtaf\nstrathkelvin\nwkbn\nsyers\nravenstone\nfoschi\ncarcharodon\nmichelia\nwhippings\nkday\nskot\nruffe\nstaysail\ntzion\naota\nlesniak\nfudging\nsemiprofessional\nlyria\ndgps\ndihydroxyacetone\nlevie\nperlstein\nqawi\nastiz\nflamm\naubier\nronna\nschwede\nfreeburg\nmantoux\nbramantyo\nmaleness\nacle\nairland\ndissolutions\nwildsmith\ncarnevali\nbeanbag\ndelgrosso\ncandan\ntejaswini\nlouk\nplec\nmapinfo\nkurdistani\ncmes\ndescente\ncécilia\nfelicien\ntechnologic\neuropeanists\nnepalgunj\nklopfenstein\naesch\nsonin\nunknot\nauel\nwhinney\npeacockery\nneuroendocrinology\nexpedients\npterygium\nbulrushes\nxga\ngodfree\ndongpo\nnorv\nsgouros\npaulins\ninadmissibility\nsteenberg\npalazuelos\njank\namathus\ncopywriters\nmaartje\nbushwackers\ningleborough\nbrusa\nkarpen\nvonage\nmaspalomas\nbergers\nauctioneering\nllanharan\nencases\nmcraven\nfaherty\npyrokinetic\ndigitise\nernsting\nmyeloproliferative\ncajoling\nshakar\njlo\ndiopters\nfero\nmonosyllables\ncolaco\nfreedberg\npenknife\nwwlp\nmontois\nkalmunai\nhathway\neireann\nreinisch\ndistractors\ndellavedova\nscip\nrtnda\nausubel\nrached\nliddon\nmallery\njenaro\nkendallville\nmeehl\nreeking\negorova\nraffaelli\nlagrone\nguiliano\npignut\nallofs\nmarzouq\nwestwego\nrastenburg\ngosney\nhichem\nkhq\nschapira\njuca\nlohara\nsyam\nflitch\nseatruck\nmrls\nantiquarks\nschwarzmann\nhexagrams\nhucksters\nkhc\nheavitree\ncritchfield\nruxpin\nhudood\nautotune\nscotstown\nhafren\nbillable\nmousley\ncheil\nlopa\nperella\ndondo\nforcings\npankin\nfurtively\nlaths\nwhoosh\nsimcock\nsughra\nwetherspoons\nanemias\neveland\nudrp\nhoustonians\nyarden\nhermanns\naukland\nspacebar\nproscribing\nzango\ncomportment\nmenstrie\nnucleobases\nsubeditor\ncrownsville\nhurvitz\nstanchfield\nsensorium\nclaman\naccommodative\nroobarb\ngodi\nbecquerels\ntaumalolo\nzinsser\nruminate\nsherfield\nfelip\ntopinka\npostprandial\nneoguri\npahad\ndruon\nlonigan\nguantanamera\nheiskanen\ncleere\nsumio\ngovernement\nstaf\ncruciani\nhornik\nklapa\nayora\nairboat\ntodds\nnesquik\nkhodynka\nllantarnam\nfougasse\ndoleful\nphlebitis\nokoh\nkozinn\ntigon\ndubonnet\ntriumphalism\nwegerle\nsternhagen\nneurodiversity\nwates\nphotosmart\ninfn\nsqueaked\nvexations\nraichel\nbranchburg\nmidsayap\nplese\nsongaila\nbalderstone\nmewn\ningpen\nberneray\nmarnier\nderas\nclampitt\nparanjpe\nmorosi\ngossen\nbreault\nmultitask\niaapa\ninformatization\nprogs\nningaloo\ntacticians\npillon\nschiavelli\nwongso\ngingers\ndidar\ncndp\nulaan\nquesiton\noverindulgence\ntalisker\nfuda\ndiscographical\nhauritz\npetherbridge\ngrillz\nirit\ntwg\nscalawags\nlebas\nfarshid\nluciferian\nwcr\ncrimeans\nbetamethasone\npolyolefin\nclairol\nfrancos\nventilatory\nbuxhoeveden\nholmesburg\nchenghua\ntroche\nwoodling\nhoushang\nadai\nalejandrino\nnessim\nporchester\nnovem\nbraud\nmeeropol\ndorigo\nvandervelde\nsinghs\nbecerril\nhauksson\nkuter\nbesim\nmacronutrients\noverclock\noración\nruga\nkullen\nasola\nmastrantonio\nburghoff\ncaucused\nwachsmann\nflsa\ncarrolls\nfeshbach\netape\nhopcroft\nflophouse\nwestergren\nperfumers\nfreeskiing\nunshaken\norianthi\nazis\ncolmer\nvolsky\nasianweek\ncoleus\nwernbloom\ngunzburg\nerrigal\nualbany\npushto\nestell\nderwood\ngrassby\ntimescape\nspani\nridzuan\nunratified\nbernoldi\nslawson\nunderstudying\nrld\ngosselaar\ngallwey\nbikeways\ntejpal\nkoegel\nnegrita\nspag\nchrysotile\nlucetta\nzanclean\ndatz\nmcgettigan\njurevicius\notn\npolicastro\npettyfer\nothmer\nhongbin\npelphrey\nlulav\nvilella\ncuckmere\nnierenberg\ndelimits\nfreital\nmystify\nfairholme\nzelnik\ndecelerates\nswamis\ndiran\ncastigate\nmerkerson\nbulis\nelswhere\nmazzeo\namru\newins\nsearsport\ntastemakers\npetare\nmargrete\nfowkes\njiaming\nequivilent\nmckeldin\nhuseby\nvasilyeva\nwiedmann\nrestyle\nanorthosite\nrangy\nnanog\ngroover\nbantus\nqra\ncshl\ngrymes\njomar\ncordel\nkolka\nmurthi\nphthisis\ngunwale\nwonderlic\nbionda\naraminta\nfoaling\nawada\nstrainers\npaish\nsles\njoannou\nhanbin\njunkman\ntercel\nuley\nneals\nbizzell\nlardy\nillumined\nwhifflet\nweenies\nscifo\nleipold\nnarnians\nworkwear\npaetsch\ncarris\npersonnels\ngunji\nsouphanouvong\nnetroots\nhectolitres\nmerga\nkyrylo\ncorsie\nfesenko\ncuber\nsüskind\nwhited\nbeanies\nchriston\nsnarkiness\nmypyramid\nmousepad\nmalartic\nizabel\nsilvretta\nhalyna\npottenger\nlandale\nleonsis\nacclaims\nserino\ndarl\nskar\nstandardly\neya\ncfw\niglehart\ntaylormade\ncantigny\nzajonc\ninamori\npézenas\nlokman\nattn\nmarioni\npayami\naliye\nsauers\nosagie\noxburgh\nepicureans\nrocketboom\nkameoka\nportel\ntakashimaya\ntoyopet\ngoertzen\nyariv\nschaech\nyojiro\nccleaner\nhaylett\nmechatronic\njohanssen\nkononov\nercp\nwesendonck\nbrokop\nmcad\nbotswanan\nterk\nshudders\nchoto\ndissembling\nburghead\nwasti\nyeldham\nakinfenwa\nvalensi\nlingvo\nizvestiya\nwarung\nstanisic\ndumai\neharmony\nunmetered\ngeorgiades\nmemmi\nhowled\nmankin\nmerchantability\nfreeney\nanonymizer\nnamus\nburps\neconomique\njadeed\nsthree\nlucayan\nfactchecker\nsuperstate\nshante\nforelock\njaheim\nathanasiou\nbsee\nezzard\ntamta\namruta\nbookworld\nplaypen\ndharmaraj\nmetabolise\nrezo\nvotorantim\nnolle\nchoosen\nbunion\nfpmt\njenji\nwaterslides\nuncomplimentary\nlwp\nsikdar\nzic\nparkhotel\nguayanilla\nsokcho\noah\nondimba\ndilator\nlgl\ntaymiyya\nsignficance\nkiskunhalas\negotist\nararipe\nnni\nvespri\nishino\nbarny\ntomohito\ntjc\nvelzen\ntransects\ngastronomical\ngameworld\nprofligacy\nrichwoods\neckley\nlendale\ngeochemists\nwalsworth\nkuryakin\nbeerbaum\neisel\nskewering\nwaldie\nboatloads\nessm\nstatut\nwery\ngerbrand\nkiwami\ngiff\ndecipherable\ngastropub\nhesseman\nunexpressed\njucker\ntejero\ndetente\nworkprint\nhaughley\ncanalis\nnaem\nmccalmont\noakington\nlifeways\nrenauld\ndibella\neasson\nincomers\nitsm\njacs\nkaprekar\nnwfa\nlooter\noakden\nsuppiah\nintersperses\nfoxit\necotopia\nterrifyingly\ndoubloons\nrecognizer\nglasspool\ntapo\nféret\ndavidsons\nparanasal\ninternetwork\nroulin\nagritourism\nmidford\npandav\nsodomite\nwurth\nvanel\ngyda\nsachdeva\naltesse\nfumetti\nkandiah\nszwed\nwaldenbooks\nlambdin\nitvs\npilchards\nneurokinin\nminnich\nmultiphasic\nsolipsist\nbmu\nidealizing\neilidh\nvaselines\ngrealish\nbelinelli\nashurnasirpal\nahb\ncarcieri\nlybrand\nheaslip\nsigg\nbottlings\nsawin\nsliwa\ngladding\nasashoryu\nenumerator\norquera\nsemenova\ntaitt\ngiedrius\nwaha\nsneakin\nbhagwandas\ncange\ngardini\nviscid\nlepic\njehad\nfrognal\nredsox\nclastres\nsalming\nlaoshan\npolychromy\nburca\nvugar\ntingay\namaurosis\ndeshields\nsteindl\nestherville\nunfermented\ndesplechin\ngearoid\nxds\nmangusta\nabdillah\nyellower\ndefrayed\nvandeveer\nwagman\nendura\npickoff\nestriol\ndsch\nwormleighton\nstepanenko\nsharnbrook\ncadenet\nderana\nkhane\nchovevei\nneedlegrass\npadwa\nghorbani\nchukwuemeka\nbackchannel\nraunch\nmateship\nburaidah\ncasamayor\nshalikashvili\nkainate\nastrolabes\npreiser\nabat\nsagged\ncygan\ntavarez\ndiagrammed\nperillo\nremissions\nbradly\nseedbed\nlesly\ncanfora\nbaqer\npetróleo\nbacar\nkilberry\nmildness\nrealestate\ncoupole\ngaddy\ngmh\nancillaries\nteyana\nsile\nspiaggia\nkwanten\ndourados\nsynesthetic\nwunderbar\nfeick\nbaratz\nflosse\nmusandam\nnyhavn\nloreta\noverprotected\nconto\nastwood\nfruto\nsölden\ntricorne\nhaass\nschusterman\nkeigwin\nkhawr\ncodifications\nkapali\nunfulfilling\nogu\nhalbrook\nsamiullah\nyabby\nheshan\nlederhosen\ntehillim\nsipi\ncharner\nfesa\nfromer\nbongbong\nstefanki\nsouthwesternmost\nmizelle\ncommisso\nonrushing\nsejima\npibor\nimmured\nbussie\nkhosravi\nswatted\nhatikvah\npanpipes\nmelonie\nseraj\nulo\npelkey\nverbiest\nmccombie\ningibjörg\nvecchione\nwagamama\nraxworthy\nrussin\ntansman\ncandis\naynak\nbovino\nscalzo\nmenter\nguttridge\nimmunosorbent\ngallowgate\nrrv\ngialle\nchazelle\nbroadnax\nkounellis\nairfreight\nxiaoling\npalimony\nfynes\ngorokhova\njiggers\nmicromanage\ncnor\nkarmichael\naapt\nquisqueya\nsolicitous\nnamita\npachacamac\nneuquen\npessimists\ntalend\ngreengrocers\nsaryu\nuniao\nhendershot\nbarbon\nmarkan\nsawtry\nbahaa\nyamcha\ngivry\nunaipon\ninflammations\nstabber\nyubin\nprovos\nintercommunication\naugé\ndecio\nhenrion\naaryn\nguangfu\nsagada\nmiral\nleguizamón\nlegitimising\ncrysler\ncourtemanche\nhennesy\ncockington\nleibman\nconventus\njodhi\nsheeps\negidijus\nprivett\nchicksands\nhudal\nrinke\nfinestra\nborun\naiping\ncesarewitch\nrainout\nfaïence\njagielski\nmoisturizer\ntrombetta\ngauzy\nholmewood\nwabush\nweerasethakul\nmilholland\nsandall\ngargle\nskyla\noverthinking\njeanson\nanglocentric\nochocinco\npaperweights\nkhammouane\nsubmissiveness\nwestbeth\nmarkeaton\nnewroz\nwmgm\nwymark\ndriftin\nstoren\nsectorial\nshecky\nkittles\nredrafting\ndalitz\nairglow\ngroupwise\nrotonde\nhellfighters\nkinka\ngurmit\nkaser\nmondulkiri\nandrieux\nathaliah\nclatworthy\npepes\npersonnal\nconviviality\nvizcarrondo\njoeri\nmakharadze\npipino\nwelly\nbovril\nsletten\nkrylenko\nstudiolo\nmatis\nodem\naugustina\nalario\nfelicidade\nharned\nwildt\ncrisscrossing\nfibrinolysis\nlamplight\nennals\nbuczkowski\ncarpentersville\nupe\nblavatnik\nprospectuses\nquested\nfissions\nmuraki\nbundibugyo\nwiederkehr\nscourfield\ndevastations\njolicoeur\nkardon\nunvarying\ngrl\nexfat\nunwatchable\nalshammar\nsnarks\ncliffsnotes\narngrim\ndavtyan\nzurzach\nwrightington\nboice\nkostopoulos\nostroff\nmanezh\nmulticasting\nocotal\nziemer\nronee\nperianal\ntegid\nkumgang\npalena\nfischetti\nijm\nwbi\njsb\nspenders\nmartyrdoms\nimplacably\nunbelieving\noptio\nejaculating\nsoacha\nlawanda\nguiro\nclothworkers\nhorehound\nbeaubourg\nqattan\npsyllid\nmcgrain\nwittes\nmaniaci\njailbroken\ncegielski\nrtca\nalphege\nmoonlighted\nvanves\nleinonen\ntzipora\navaaz\nmotijheel\nroura\nferrofluid\ngeorgianna\nanthroposophic\nkunka\nnoiseworks\nphotorefractive\netm\ntypifying\ndesir\ntiburtina\nslaved\nglaslyn\nsiyi\npassel\nbrailey\nprosector\nelgie\nmcpartlin\nsportsbook\nghazl\nkellow\npatiya\nmispronounce\nwrangham\nfarenthold\ncohesively\nduping\nhanshaw\nrrl\ninish\nmaufe\npanya\npolus\nprofiteer\nwhinge\nalmen\nwoodfill\nclugston\ninhibin\nruchill\nheese\nassem\nchaplaincies\nfrazzled\nhensler\nmemorising\nbeens\ngalizia\ngruver\ncuttle\nforthrightly\ntyack\nndfb\nneubiberg\nkaptain\ncsny\noscarsson\naristizábal\ndoofus\ncosmides\nbomblet\nmicaiah\nhammami\nrahmati\nwohlfahrt\nrickrolling\nwickett\nsparkled\nzadig\nlombardini\ndenbeaux\njordanville\nbeauteous\ntishreen\ndestabilizes\nkuchen\nrossiyskaya\nmitel\nliberton\nsranan\nspintronics\ngarlanded\nbernero\nsellick\njiwan\nagaisnt\ncelant\nrayamajhi\nameriquest\nkolles\ningvarsson\ntrenta\ntouray\ncoulby\nmurieta\nasafoetida\ndurling\nstrathy\nservizi\nwyville\nmoxibustion\nroundy\nplayschool\nedgecomb\nsomersworth\naltissimo\nziana\nmenteur\nqilian\nalao\ntopographies\nmankiller\nforeclosing\nazarcon\nnordal\nmisfeasance\nuninstaller\nmédica\nzeiler\nameba\nbalang\nwrtv\ndhaba\nsehorn\nludivine\nneighbourly\nwheelmen\nbanesto\nantinuclear\nchiddingfold\nyansheng\nkbjr\ndanceteria\nparlby\nsplendidus\nincroyable\nseverine\nkreiss\ndarrick\nwailua\nskoko\nrovner\ngoodenow\nyodeler\nwujiang\nmrazek\nfreundel\ndangi\nsquiggles\naberlady\ndiabolique\nmcspadden\nmorgues\nharinder\ntoadstools\nvaliante\nunfortunates\ncomac\nicho\ntihany\nprincen\nonigiri\ninfringment\ngratin\ncircumscribing\nphimosis\nmarvellously\nhardeeville\nsubedi\nbohle\nriai\nmallaber\ntobita\nlariviere\nmelan\ncullens\nreliabilty\nbonanni\nevins\nmaneka\nrockley\nendoskeleton\nsumthin\nchargesheet\nflorsheim\ntrembley\nchignik\nkishu\ntariana\njinhae\nktvx\nbenvenisti\ncottonmouths\nimec\nreichstein\nwholemeal\nfavaro\ncierra\nconciergerie\nbakeware\nvasgersian\niln\nsedano\nbluestein\nblitt\ndahshur\nphytoestrogens\ntannersville\nbasden\npterodactyls\named\nsagrera\nharptree\nfirestorms\nbirthrates\ngulnara\nsoie\nsixpack\ndelila\nkarow\nayen\nvereshchagin\ntieghem\nnotifier\nhoatzin\nrahbar\ndurio\nbaisden\nvides\ncastellucci\nmassara\ndidrikson\nicrisat\nhamor\noffner\nrothorn\nzieba\nniram\nreversionary\nsnowplows\nratmansky\nfundi\nkelloggs\ngapyeong\nchriqui\ntasse\nmannis\nshennan\ncozma\nrostova\neeghen\nundershaft\njinggoy\ndestine\nmosaico\nsynaspismos\nsumant\ncolaiste\nkurve\nbhante\nteahouses\nimpertinence\nclottey\nosteoblast\nuhry\navontuur\nclon\nepicatechin\nhagupit\nghaddar\nmürren\nsohel\nough\ngosar\nbenvolio\ndemocratized\nshakespeares\nfloodings\ndeividas\nhelbling\nkmk\nunitar\nscorton\nsabs\nfreilassing\nerstad\nbrancaster\nkeuning\nmolybdenite\nleuser\ngeula\nanythign\ncustodia\nthrilla\nretrospection\ncappetta\nbuki\nstefanini\nails\nvelleman\npelleas\ntoyshop\ntawana\ncleanses\nkouri\njohri\nkdc\njabor\nturo\ndragones\ncollaged\ngorai\nbohman\noneasia\noutdid\nkarpovich\nséraphine\nbutterwick\ngodar\nhorchata\nandreani\nphalangists\npoppi\ndruse\nitten\nixus\npade\nnipon\nsalmonellosis\ncuy\nalting\ndivisionism\nwallie\nguinta\nfirmed\njeannet\nbolkestein\ncovenanted\nbeccaloni\ndedan\nberdimuhamedow\nvilbel\nsegueing\ncmcs\nhrubesch\npaysans\nhanwei\nmasimo\ngushan\nkretzmer\nradzi\nkassie\nbacri\nusurer\ndesisto\nshotcrete\nburs\ncherrywood\nmedianoche\ntraber\nchuvalo\nwilhelma\nimacs\nmomtaz\ndaykin\ntrient\narnhold\nmunsan\nfahed\nbooters\nacj\ndanah\npillinger\nkofa\nspandauer\ngjerde\nbrighi\nsemioticians\ngibber\nxserve\ncorrigenda\nmullumbimby\ngho\nhillhurst\njurats\nbergeson\njosi\npompa\nnatto\nmillbay\nsurt\ngeran\nsatriano\nmingay\nfreeplay\ngaman\ngapp\nbresslaw\ncoull\ncrowson\nmcfetridge\nadebowale\npeppas\negomaniac\nmacedonio\nnars\nsonga\nkuss\npoundage\ninorder\nmorrin\ncaythorpe\nshuyang\ninterrogatories\nbronston\noliveras\nceren\nchigumbura\ntappi\ntingo\ntulley\ndjo\ndocility\ncaam\nsevillian\ngeddie\nciudadana\nknead\nmetherell\nrubida\nncsc\nirven\nboardrooms\nhotell\nsalinization\nkinabatangan\nsedo\ntelep\nkalala\nenlarger\ndanoff\nweathercaster\nmaragos\nfrappé\naerotech\ntywi\nglorie\ndharmachakra\nadiemus\nhudsonville\nbobbye\nhellerman\nfpn\nerla\nminchinhampton\ngenny\nsalanter\ntsurumaki\nappelt\nzinzan\nlitvinoff\nconsistencies\nbeauchesne\nobraz\ncarluccio\ncharwoman\ngiacobbe\nvivaro\ncalpain\nsapodilla\nmusonda\nwoakes\ndogstar\narns\nnooijer\nalexio\nanai\nbudiman\nfingerlings\nsatnam\nwexham\ninterchurch\ncmhc\narrack\nbrazell\ncottontails\nblinked\nlaurey\nmottley\nhermsdorf\nqanbar\npixeljunk\ncuby\nfourball\nzareen\nraybon\nrebhorn\nmicrowaving\nmonkshood\ndecentralizing\nbenozzo\nskela\nglaziers\nnussle\nlocoroco\nmolano\nvimmerby\ndietrichson\naquinnah\nguernseys\nnedeli\ncounterpoise\ncannella\nclardy\nwrixon\npropylaea\nblatch\nfrohlich\norfila\nphonetician\nthommy\nburaku\nbornand\npechanga\nherria\naterciopelados\npendelton\npropellors\nashlag\nlixin\nivanenko\ncomancheros\ntschopp\nsolarcity\nkollmann\nhiden\nsynchronising\nnoci\nsupersize\necus\ndemetra\noelwein\ninstrumentations\nproficiencies\ngarrigus\nkyriacou\nclassement\npencoed\ncubanos\ncrosscutting\nlueders\nvellai\ncassata\nshahul\ngrupp\nbergsten\nzecharia\nmetzingen\npalea\nduihua\nmessers\nvicino\nsaliency\nhobert\narmyworm\netec\nmispronouncing\nsacheon\naltix\nsyngnathidae\ndoka\njefferts\ncsat\nleaseback\ngreenlighted\nnikolova\nflorica\npapagayo\nsloyan\njba\nhotton\nfeedlots\nrecalde\nessi\njelks\nevc\nlinspire\nrahayu\nunconsummated\nwarchild\nlellis\nverité\nchisso\nketut\ncalcot\nonslaughts\nsucia\nhaemolymph\nrequirments\nlapdog\nzingg\ntoytown\nroader\nsesostris\nmirtazapine\nguideposts\notterton\nramorum\nnavarin\ndeinococcus\nalnmouth\nwestampton\nqadam\nbonwit\nverducci\nantonopoulos\nopinons\ncontroversal\nthinkable\nhardwork\nfeltman\nkukulkan\ncontraflow\ndalliances\nsportacus\nlabre\nstenographic\nbiotransformation\ncompactpci\nkhaing\ngolshan\npalmatum\nalperin\nrizzio\nkurung\nkdh\nblacula\nsaltford\njedermann\nhasim\ncyberport\noriganum\nnaud\nyvr\nmutliple\nvandergriff\ndesaguadero\nmckennon\nliván\nvocero\nbabysat\nawlad\nmollohan\nhgp\ntiddy\nebbert\ntalegaon\nyiorgos\nredcat\nellenor\ncosmopolitans\nlegitamate\nembarq\npeotone\nmalalai\nposthuma\nkajiyama\nmarilynne\nscheuermann\nwaskow\ncontin\ndand\ntraversi\ntalmon\nasmik\nmegaproject\nnevius\ncheves\nstallworthy\ngoheen\npolwart\nelusiveness\nknayth\ndehri\nwakin\nesport\nbibbs\ncallanish\nburkinabe\npanero\ndoogan\nbatterson\nperic\nflub\nltn\ncoarctation\nflindt\ngarmsir\npunit\niachr\nsanket\neinat\nillana\nmerricks\ntouran\nstaithe\nandraé\nmalaysiakini\npredjudice\nsumika\nanwr\nadeang\nbuddhadev\nmattituck\nfluss\nmcclarnon\nfeile\nmatola\ncandon\nmipi\nbabyz\nturfway\nrandomizing\ncalayan\nbathans\nfadia\nmidsouth\nlightkeeper\ncrites\nludens\nfikr\ngegard\ncomand\nlerici\nkrit\nglasco\nvirtex\nlefrançois\ndunwoodie\nkukui\namic\nmensing\nhfp\nlorella\njabbing\ntrazodone\nastori\nhbg\nkleen\nnamdaemun\nprzybilla\nbentleys\nmichoud\ncleavon\nstirlings\nhahahaha\nweatherization\njoyo\ngoller\nsandwick\ntaja\npyelonephritis\nhlp\nterrestrials\ngurira\nhegle\nfwc\nwobbegong\nolding\nelzinga\neltingville\nminny\nbluffed\ntoadflax\nnakhla\nluchini\nboorda\ncaronia\nrunout\ncostessey\npenhallow\ndysrhythmia\napprover\negberto\ndazs\nsayad\nvanita\novulatory\nsquitieri\neifion\ngriffithii\nsnick\npowderhorn\nasiata\nkonner\ntremolos\nliptak\nreducers\njuiceman\nraffled\nyuyuan\nmateri\ngustavian\nroundoff\ncnas\ncuu\nseborga\nbudworm\ndispersals\nendell\nfontella\nhesser\nautomaticity\nmagglio\nasali\nghad\nsuckered\nwhiti\nhogweed\ncynosure\nnarcisa\nobediently\nrukhsana\nsentara\nconsalvo\ntissington\nthorsteinsson\nmathon\nhazir\nrecapitulated\ndepaola\nfunkytown\nshamkhal\nfloridan\nflyswatter\nvanko\nhinrichsen\nvarto\nfrevo\nwrva\npreheat\npostale\ncnic\nmahara\nfootbal\nlayar\nannonymous\nainsty\nalbertazzi\nempyema\nexpiate\nkharms\nhardtalk\nmalaparte\nbirkby\naboubakar\nkawata\nbushfield\nxiangdong\nzaltzman\nsahr\nknipper\nsieh\neagleman\ndarkwood\nballater\nsdio\npfn\npawlicki\ncaldy\npinzgauer\nbaumol\nolek\nfaeroe\nbonesteel\npostle\nsackey\ntotino\ncarnero\nrinka\nbattiste\ngraciousness\nparticulière\nnulle\nmopani\nchadians\ndoppleganger\nbuday\nreadymades\nbourland\nnetherdale\nsonicbids\nbertola\nhfi\nklasen\ntwemlow\nxfire\npaneth\nbunia\nhtd\nsuw\ntoiler\nspecsavers\noutfalls\nboneheaded\nprasetyo\nfcoe\nhighcroft\npinatar\nchondrules\nmazzarella\nmaxwells\nhoneyboy\nherra\nvigas\norlo\ncheyrou\neducacional\nreteamed\nseps\nsnarls\nstillington\nebeneezer\nreconsolidation\nseiche\ndrydocking\nfootfalls\nbalter\nhanney\nloenen\nrajavi\nhuckins\nthymol\nteba\nlevofloxacin\nmatthewman\nmainali\natwal\nbeymer\nqcs\njaswinder\nrazaq\njordache\nnacala\ntabacalera\nyvo\navenal\ndobler\ndonncha\njklf\nbendiksen\nmatza\nejaculated\nunreconstructed\ntabet\nbellier\ntradeable\nhiru\nforoughi\nuslan\ncmac\nnygard\ntenley\nbothroyd\nreede\nzacchaeus\nalburgh\nmuammer\nphippsburg\nnki\nutmb\nvernazza\ncartosat\nbulanov\nprofaned\nnavaja\nsurjeet\ngabbay\nmennea\nallods\nkeymer\nbullett\nbiozentrum\nschoolies\ndeianira\nforgas\nsextans\nhseng\nimpractically\ngazit\ngrennan\neastbury\nhashoah\npsychodynamics\npiff\nranocchia\nsnarf\nolejnik\nuhler\nkingspan\nbeeley\nanishinabe\ngumdrop\npantaleone\nkrasnow\nhasyim\ncounterterrorist\nbreanne\nnzrl\nfreedland\nslbc\ncuil\nkurn\nfruitlands\nlecher\noit\ndrumbeats\ncenac\nvoulkos\nboatbuilder\narturs\nendpapers\nbrinklow\nkenickie\naneka\nreznicek\ngerets\nfadeaway\nnuneham\ncockatiels\ntadahiro\nlonrho\ndecter\nlobanovskyi\npebbled\nsighthill\nkurir\nspeen\nagrochemical\nbtd\nfcra\nsassoferrato\ndaylily\nfapesp\nheyliger\njiahu\nrielly\nmubariz\ntürkan\nholste\nsufia\nminit\nprtc\nlitening\nminev\nundernutrition\ninitally\nderks\nsillars\nyaghi\norchy\nrecrossed\ngorog\nkunama\narki\nmanzella\nmcglade\njayakody\nsalata\nauchmuty\nrimando\nreedsport\nkonosuke\nrjc\nbomford\njonang\natec\nmarkka\npolic\ncantelli\nanyting\nenumerators\nberty\ngroyne\nbenger\nswailes\nrepatriates\naethelred\ntryfan\nabud\nsiamang\nalredy\nmuin\nfoard\nvarta\ndebrah\nafterburn\ngiancola\nrainman\npollards\nnagged\ntergat\ndeason\nadventurism\nculliford\nyettaw\nbouschet\nbirse\nsnit\nnanyuki\nwindjammers\nukrinform\ngerti\nbommer\nyewtree\nrutherfordton\nstoyanovich\napotheker\nteresópolis\nharbinson\nlorik\nkabo\nbowmore\nunfitness\nsugrue\npogodin\nminahan\ndemars\nbellissima\nembellishes\nphad\naigis\njefri\nimpliedly\nalbermarle\nmerlotte\nraceday\nnarm\najia\nakbulut\natsuo\nbrenly\ncosewic\nthaye\npeggotty\nfreebooters\ntoxicologists\nchrysantha\ngaghan\njongleurs\ntaraneh\nasef\nmonsarrat\nmicrocosmic\nmankins\nkcd\nbrueggemann\nlumbosacral\ntaskin\nchemoattractant\njellied\nthuban\naldbourne\nmarkinch\ncriado\nchitlin\norgeron\ngvs\nearbuds\nimpermissibly\nsaomai\nbuckminsterfullerene\nhédi\nepigrammatic\nnjoroge\nmunhall\nuncleanness\nberresford\nbuste\nmigliori\nganci\nfaught\nlocalise\nkolla\nringneck\nokpo\ncholing\nmapother\npenington\nvinalhaven\nmengel\nelting\nmistinguett\nkomarno\npleming\nmegapolis\ncolavita\nfelli\nzeilinger\nmassport\nescribano\nsiriano\nvilest\ngroenendaal\ngoffey\nosmers\nnettelbeck\napparels\nrbr\nhygeia\nsoapboxes\ndechert\nalmodovar\nairth\nweeki\njouer\nsquillaci\nyangpyeong\nvette\nshipway\nkatimavik\npipex\navails\ntwofour\npomerium\nwelbourn\ndecriminalizing\nnorthfields\nainger\nlandsbergis\nblaen\ngasparotto\nmendax\nblabber\nwagners\nkumyk\ngdnf\njesting\nnotaras\nthreateningly\ndefeis\nbirtle\npapy\nbraunwald\nrimmington\npolyclinics\nkneeled\nspoony\nrepast\nrocketing\ntmh\ntrapdoors\njoash\nwicklund\nquatorze\ntamburello\ntonguing\ndeceitfully\nmaxted\nschiffner\nsynesthetes\nnesler\nelysee\ntunay\ngemignani\nhockin\nnaugle\nmicrowaved\ncalì\nnimani\nberro\ngiratina\nlavrenty\npresale\nkiriyenko\nshachar\nkatsidis\nhairpiece\nrotondi\nbrunskill\nhaitai\nbhuta\nibope\ndiktat\nkeloids\njordanne\ncoloboma\nwinlaton\ndeare\nkraak\nfape\nconflux\nmundel\nmeel\nagunah\nnoisemakers\ndevesh\nkamenetz\nkirkburton\nnuxhall\nonagawa\nkornfield\ngrimson\nrecode\nnsaa\nheelan\naggrandisement\nproses\ncaborca\ncircumnavigations\nershov\nmsconfig\nplimoth\ncollipark\nlabarge\ngotan\nsilverpoint\nrustum\nhews\nhirohiko\nplomo\ndillsboro\ncombatted\nfelucca\nkussmaul\nnizkor\nardan\nschriner\nlockroy\nkeshavan\nulua\niafc\ndushi\ngaustad\nscotusblog\ntianmen\neppard\nbanuelos\nmazy\nsarakatsani\nwiddows\nalyse\nnectarines\ngoodbar\nslighter\nhopin\naerodyne\nrequiems\nnscs\nkarada\ndamnit\nbrainiest\nsamurais\nlennar\nsplunk\nbassas\nsonthofen\ndrooped\nyellowwood\nmasius\nsealife\nkishwar\nhypnagogic\nkavalan\nvirals\njoio\nbonadio\nhypothermic\nbyeon\naislinn\nisobutane\nbelievably\neidelman\nmellett\nardara\nbonbons\njacinda\nbartolomei\nflatout\nacree\nderge\nnoell\ndownburst\nbinu\njannatabad\nunrehearsed\nafrasiab\nwargamers\ncollusive\nbracci\ncarnivale\nbitting\neguren\nappologies\ncolasanto\nthami\nshavelson\ncjp\nnale\nmicaëla\nheah\ntranseau\noiticica\ncabrita\nsynovitis\nbaazigar\nearlene\nlicari\nexisiting\nvenugopala\nalemdar\nwbb\nxingjian\nminervois\npestovo\ndiamondhead\nbiebrza\ndecastro\npopovych\nkaysen\njgi\ngudbrandsdalen\njacobina\ncarmageddon\nshchusev\npurda\nhellersdorf\nvaquita\nsomsak\nmaston\nstonham\nwuest\nhalkyn\nsmutty\nsiano\njedda\nmacroalgae\nnasus\nnedo\nlopped\nsico\nmultichoice\nbft\nkobler\ncreatore\nmicucci\nduston\nbroster\nshouter\nregimentation\nanaerobically\ndanowski\ngaloshes\nbiasca\nplucknett\ndhir\nfurans\ndiversely\nfitful\nhouda\ngharbia\nstampley\npaccar\nbhowmik\ncheco\nheatherwick\nliniment\norbi\nody\nhemangiomas\nwynnstay\nbalkman\nmocker\nziemann\nnollet\narntz\nparcelled\nlinkup\nasadollah\ncoity\nfiendishly\nbleeth\ndingus\nathiest\ndamia\ndigitising\nfinanciera\nnyqvist\nmessen\nbardfield\nliacouras\nvolkswagens\nllcs\nmarsanne\nquinlivan\nmarvis\nellerby\nnicco\nhuntingtons\nchaabi\nabscission\nkgalema\nhoneyed\nruit\nedger\nmazowsze\nriegert\nmeisters\nsubida\ngediz\njergens\naufidius\nbajer\nacreages\nelsayed\nunmounted\nunspotted\nmosheh\nlavillenie\nsolr\noversampling\nhamerton\namiruddin\nsoftley\nshutterfly\nmenden\nreadthrough\nchalle\nhydronic\nstupar\neoi\ndeferments\ncarbó\ncaramoor\nfinalizes\nmousinho\naprista\nyazan\nzabi\nbrandenstein\nselinunte\nillum\nbrachypodium\nreagans\nsouthmead\ncenturio\nbuzaglo\nenslow\nferlito\nmóra\nquintas\nnonresidents\nsabai\nperegrines\nkdvr\ncaldbeck\ngerbe\nkadhi\ndibner\nsuccisa\neqt\nrivelin\nsabrewing\nkimberling\ndumitriu\ngonder\ntoker\nchristianizing\ncimic\ncomorbidities\nbacalhau\nmckeague\nguta\nraburn\nartifical\nakzonobel\npinfield\nllewellin\ngenesi\nkrasno\nzatlers\ncartaya\ndeavere\nincertus\nlisterine\nkensley\nbittu\ningénue\nthoracotomy\nkostecki\nrebellin\nohene\nderated\nsiham\nbucholz\ngribbon\nlepel\nesteli\nwidman\nclerico\nethylbenzene\nembroider\nluminus\nwfo\ngurunath\nmatveev\ncuffley\nkrasheninnikov\nhavo\njammal\njiazhen\nterzian\ncamelo\nquietude\npocheon\nlems\nnyali\ncnac\nspartel\ngalu\nstudds\nunmissable\nessent\nzanamivir\nwitticism\nsangoma\nserero\nrondos\nspivakov\nvolpato\ntannis\nphilosphy\nsarafina\nrasher\ntrimarans\nhartlebury\nwxrt\nfullington\nwhiterock\ndesford\nmoumouni\njambe\nsezs\narimaa\npackards\ndragonetti\nafterimages\nshuzo\nztv\nproliferator\nnanka\nrueful\nfayerweather\nsmeg\nitep\neitc\nherek\nsaens\nlairg\nveerle\nnockels\nbalser\noreilly\nardoz\nlovelife\ngirons\nmccreesh\nadaa\nkalicharan\ndehydroepiandrosterone\ndisgracefully\nripsaw\nbellanger\nrafto\nsarc\ngwasanaethau\ncsia\njagath\nrajmata\nlepidopterists\nuwais\ndisarmingly\nprerelease\nsubercaseaux\ndetling\nabk\nlocutions\nicmi\nhruby\nboudiaf\nengorgement\nfranki\nccmp\nzhiqiang\ndhanda\nbudeaux\nglobs\nlamble\nmeshulam\nkukkonen\npampering\nshantel\nzilker\ndjeparov\nnashiri\nnovoselov\nbalmaseda\npiezoelectricity\nnagamura\nmarange\ngibril\nsaum\nbidmead\nslatington\nlupercalia\nsoce\nrapley\nuunet\ncerén\nbastone\ntelework\nblading\ndbrs\nhalderman\nferreras\nmalinke\ngrayston\niph\nlydiate\nairton\nrightwards\ncurzio\narular\nbenzine\nphanatic\nfoldout\ndewdrop\nvaldepeñas\ntownie\nivankov\ndices\ntutton\nsurtout\nbittle\nkhazarian\nmegi\nsulston\nsamaritaine\noreskes\nwainscott\nbrard\nfbu\nsectarians\ntirrell\nkucharski\nauxiliadora\nsametime\nneuroradiology\nicis\naonach\nwinslade\nmuskat\noveremphasize\ntitterton\nsoad\nfloodwall\ndelmon\nweinzierl\ntondi\nnamas\nkyriakou\ncerulli\nkimera\ncroze\nkammuri\nbnai\nmawby\npaymer\nthanawat\nbabycenter\nhawl\ndementias\nellam\neyen\nspicebush\nbreathlessly\nbashundhara\nsnn\nishimaru\namaan\nhornist\nsaltation\nuicc\npithiviers\nbaxandall\nhollinshead\niten\nmedroxyprogesterone\ntusken\nmawali\nkubilius\nspicier\njumaa\nskeels\npolie\nbajazet\nlamivudine\ntopman\nshikun\nfeda\ncoxcomb\nfinitude\ngovou\nampon\nelaina\netel\nquennell\nsamten\npassingly\nchlor\nkeedy\ngracechurch\ntinges\nangsty\nopendns\ngrohe\ntarantini\nmatsura\nspooktacular\nrenker\nqueally\nrestituted\nvng\nchessboxing\nvause\nmarfil\nsieved\nstrasburger\nstamler\nairheads\nmaehara\nsynthonia\ninstrumentalities\nmge\nsensurround\nacheloos\nmubadala\nfilhos\nforeseeability\nbickered\ncopano\nmicrofracture\neeo\nhaeften\nmanier\narrant\nnewschool\ngaine\nhiplife\nkarwan\nlorentsen\nwachee\nscorza\npyra\nyanping\ndelwar\nkameshwar\nmbw\nkirya\nmutator\nsscs\nalsdorf\nkumuls\npartanen\neschscholzia\ndesenzano\nmaskey\nsoundbox\nrathdrum\nkhodadad\nincomprehensibility\nmaryja\ndenniss\nedmc\ntempa\nemissive\nholtzclaw\ncharco\nszadek\nantonetti\npean\nroelant\nsulfites\naudiologists\ncostanera\nitinerants\nchayes\nliubov\nboumsong\nrelenting\nspreadable\nkcpq\ntaia\ncourter\notherwords\nkróna\nrescinds\ngayan\noutranking\nisolationists\nwassaic\nbenthall\nbandanas\npriego\nragen\naztecas\nnoton\nbienstock\niliopoulos\ngatecrash\nmedicating\ngumtree\nproft\nbenziger\ntaqlid\nlegging\nastri\nbeachgoers\nwistrich\nsaujana\nnease\nburdge\nspearritt\nhalasz\ngingrey\ntimbs\nlennoxtown\ndarning\ngrises\nascom\nmagnetizing\nmotorpoint\nsqdn\nindels\nmascotte\nwackernagel\nunworldly\noverripe\nfugato\ndhf\nnixons\ninchbald\npeartree\nteitel\ntka\nbaug\nracin\nnonwhite\ntiptoes\nbirdsongs\nbosnich\nintimidatory\nsoulages\nisim\ncristino\nbelshaw\nmazal\ndjibo\npanpsychism\ndanzer\npochin\ntads\nkyran\netalk\nhomeschoolers\nwilbury\nedgren\nhorologist\nworts\nsinosauropteryx\nbedhead\nsamand\niniquitous\nmoralities\nraincy\ndoodling\ngranovetter\ncifs\njohnsonburg\nfranek\nvbl\nkimmons\nmulatu\ngermanys\nrelicts\nhavertown\nperdues\nthundered\ncgil\ndisintegrator\nforgione\nmealybugs\ndreambox\nunbleached\nfiskars\nkhosro\nparramore\nfule\nfrankenmuth\nbayanihan\nmoidart\nkurultai\nzwilling\nmalecki\nhopetown\nspeziale\ninsidiously\nkapalua\npees\nlavandula\ndadullah\nhaberland\nmesocyclone\nelenor\nairfrance\nkuhne\nastaxanthin\ncommandante\nfloristry\nbrandler\nsycophancy\nmerin\nteather\njavy\nbekmambetov\ncopi\npelada\nkundt\nmabhida\nhice\nstauch\nenfranchise\ncaereinion\nwajdi\nlegrottaglie\ndisorientated\nlarentowicz\nnorlander\nrimal\ndorcus\nccrc\nsivalingam\ngors\nhollenbach\nlubos\nrattay\nsoarin\nsigrist\nbertarelli\nnolting\nvonnie\namerson\nnotifiable\nspacewalkers\ndeigned\ngurgle\nnarbona\nfeustel\ncroagh\nnaville\nglitnir\nabdullatif\njeshua\ngroudle\ndionysiou\nmenfolk\nfarland\nsunriver\nsaensomboonsuk\nsutphen\narbol\nsalido\nmigas\nprovencal\nrocastle\ngorsuch\ninvermere\nsyncopations\nmullaghmore\nfilmland\nsluggishly\nsagala\nprocrastinating\nlaurita\nchoosers\nsagd\ngosl\nzabol\nmsimang\ncharif\nacerra\nmcgeer\nharworth\ndenting\ncmea\nardeche\nabucay\ndogz\ngalien\nsenni\nstrugnell\nlevinthal\nquiambao\ninfarcts\nmirae\ngorriti\nkrash\neame\nzingarelli\nchernyshev\nsoroptimist\ntraves\nhnp\nzhores\ndysmenorrhea\nbelbo\nlolcats\nwoks\ngerini\nfintona\ndeci\nravish\niturup\nmondawmin\ndecors\ndelbridge\ntransantiago\nslaloms\nmathcounts\nmabuza\nautocorrect\nriha\nwillshire\nkombarov\ncairnryan\nschweigert\ninnovatively\nakanksha\nhayflick\nsommerfeldt\nlandhi\nspinella\nstrahm\nsterritt\nformspring\nrevol\nmachiavellianism\npwb\nunferth\njosceline\nclamber\nmisidentifications\ndebelius\nlobotomized\nringu\nnasrollah\nstort\nteresia\nnewsdesk\nparadises\ntaiheiyo\ngobblers\nfusari\nprovolone\nmakarem\nkuzu\nalxa\ntapeless\nelzbieta\nrebodied\ninconveniencing\nfrug\ntrufant\nflextech\nautoclaves\nkrstic\nrepatriations\nmultivalent\nchirped\njoropo\nsandis\nneumaier\nboshell\narsace\nlellouche\nbirkner\ncymmer\nuliana\nbronzo\nracon\ndemagogic\nkapros\nutilitarians\nopheim\nquotidiano\nfeinman\nsudek\nlewisporte\nmothered\ncircumlocution\nsumati\nexurbs\nnoomi\nswineshead\nchungnam\ncsokas\nhoutte\ncseries\nmorry\nestima\nereader\nreiver\nconradt\nduppy\nbipeds\nintarsia\npendeen\nbungy\nbracamonte\nbanane\nlurton\nmusina\nkippa\nsesam\ndithered\ndionisi\nhoiberg\nsouders\nbrodo\nkhazana\nabdeen\nhaja\npersonnages\ntrichinosis\nlajja\nbusinesspersons\ncsid\nndis\ntamarkin\nalmaguer\nchessell\nhattem\ncabrol\nbrillo\nnetcraft\nbarz\nossman\ngrowin\nafsa\noutspokenly\nscénic\narafa\nmeana\nschadt\ngastrectomy\numhlanga\nshatterhand\nnowland\nadwick\nbahamians\nnynex\nlinocut\nharebrained\nmutesa\nblemished\nmacgruber\ndeshaies\nkalaallit\nkarsavina\nrondeaux\nflashcard\nthamrin\noneal\nponchos\nauthorisations\nostlund\nhellingly\ncosmi\njerryd\ntailbone\nschamus\nshreck\nottomar\nmasturbatory\ncalzone\nosteomalacia\nsmeralda\ntamminen\nkotsay\nreuschel\ntrezise\nvidagany\npunctuations\nhobs\nvallet\nephebophilia\nreuteri\ngeneralizability\nsoszynski\neurocode\nenkhbayar\ndecaffeinated\nspertus\nvogeler\nsurve\neja\ndeerhound\nophthalmia\nviguerie\nuly\nsanyu\nschneiders\nchahe\nsafranbolu\neuropride\nipy\ntemba\nblackbrook\nharap\nerasures\nrecalibrated\njockeyed\nchantler\nwoodvine\nkaufusi\nhtl\npetrification\nmamey\nsabates\nperegrino\njooss\nbrogues\nscherzi\nexclusionist\nsundog\ndogwoods\nteem\ncolorblindness\nsanguinary\njayakrishnan\ncarros\ndren\nyasuhisa\ndoute\npalaeontologica\nreanimating\ndixwell\nsuai\nbooktrust\nmomchil\ngannaway\nrawail\nalverstone\nscuff\ncumbie\ndynamiting\nobus\nyangpu\ntularensis\nbahamontes\nforder\nassegai\ninhalants\npronator\nliklihood\namitava\nbici\nemmanuele\nswitchman\ncommandoes\nsundstrom\nmochudi\nrebuffing\nlipases\nbatswana\nanousheh\nbadboy\nsahadevan\nblosser\npoyle\nrestormel\nhaïm\nsalafists\nfanuc\nlittlepage\ncuori\nganjabad\nmanoff\nvertiginous\ntogni\nkapos\nwimprine\npehrsson\nfourty\ngoubert\nstinkers\nsunjata\ncodis\nsetra\ngaisberg\nweatherley\ncirculators\neskandarian\ndebaser\nddn\nshibi\ncolca\ntermon\nhuffaker\nbelanova\nundercliff\nnonus\nkesang\nsamawah\npoons\nsaltwood\nlothbury\nsaluzzi\nstojanovski\nbarnsdale\nlimekilns\nallbäck\ndimwit\nkartheiser\ntoques\nccafs\narmington\ndisrobing\nziyu\ntzintzuntzan\nmediabistro\nyitong\nkonaré\nlongrich\nclassé\nevariste\nvallette\npreamplifiers\nguttering\nmably\nbaishi\njcd\ntellem\ngermanies\nfloggings\nlarysa\nneesham\nstrongside\nhhr\napproche\nkrunoslav\npagham\npaíses\nspringwatch\nmolitva\nrockbox\ntought\nhippest\neskandar\nluzenac\nardabili\nfreema\ndogface\nmonocultures\npruvot\ntavan\ncondensations\nsuccessfull\nmiyaji\nkhom\nwastrel\nelkland\nscaleless\nsalek\nsomerhill\nbrunhoff\nconvertino\nmommies\noase\nsquarer\nsweetback\njianzhou\nwarzycha\nsheriden\nallchurch\nismaning\npopgun\npressmen\nclearwire\nbaisakhi\nnorac\nbackpage\nmbasogo\niiro\nparisii\nhartsburg\nmalach\niwe\njianli\ndonnacha\nsatoyama\ntofts\nalpern\nprimatologists\nwaterhead\ndigable\nbaalu\ntyle\nlimerence\nawen\nguzzanti\nlochwinnoch\nllandinam\ndolen\nlachenmann\nheery\nbillu\nbyk\ncampoamor\nludwin\nuncalibrated\nhighnam\nbrisas\noverath\nadditonal\nallsorts\ntrevone\nrivel\nmnu\nkomissarov\njerseyville\npagliuca\npursed\nvanpool\npfft\npinjore\nconstrues\npscs\ntruehd\nspanair\nbytham\nserta\niziko\nusis\npopek\nnené\ndemarchelier\nndour\nkudoh\ngleek\nwoofers\nhauger\npetrou\nuncatalogued\nautostrade\nxuecheng\nsociopathy\nsolomonoff\nbaisho\nboboli\nrobichaux\niaff\nxiaowei\nwasters\ndieters\nlightburn\ndgk\nbourbourg\nwenchuan\npinera\nhoffmeyer\nsanguisorba\nalyssum\nrasel\nhooping\nlovren\nmayacamas\ntedisco\ngenzano\nkipa\nleckey\nnavair\nchinks\nllanover\nvirgatum\ncardrona\nrosensweig\nteni\nobservador\ncasasola\nqomi\npeugeots\nmalmqvist\nteagle\nintercoolers\navionic\nnaranjos\nfastbreak\nserg\nhalfin\nmpn\nbafflement\nrebrands\nvacuity\nmarting\nnavasky\ntoulalan\nzesty\ncelsa\nmuston\nmorones\nkythera\nnespoli\nreenactor\nunnamable\nparbati\naow\nskyride\nsachkhere\npenmaenmawr\ntumilty\nrequesters\nsoongsil\npipping\ngeerts\nzuckerkandl\ntulipifera\nbackburner\ndusenberry\nardua\nthiede\nmarto\nimmobilizes\ncongers\njonte\nigcc\nkaczorowski\nbakso\nghurair\nimputations\ncoreen\ngobsmacked\nglycolic\nsecuritized\nqingcheng\nchusid\njorja\nportet\nlepeophtheirus\nrambaud\ncalcifications\nipro\nthurible\nunschooling\ntherry\ntabac\nmudejar\npergolas\nsverrir\nkneen\nquadroon\nperrys\nharnois\ntragedians\nsicl\nnival\nbarzagli\nmcgirt\nludvik\nurticating\nalibris\npratim\nzhoukoudian\nmucositis\ndegraff\naurally\nreimbursing\numran\ntheale\nhfr\nléaud\nlarrain\nexonerates\nnaharin\nipmi\nmotorail\nrebelution\nkbytes\ndisembowelment\nmusallam\nizturis\neulogizing\nswitcheroo\nleibler\nconstrictors\nreadouts\nryderstedt\nchecotah\nacetylsalicylic\nsalita\nferras\nmammoliti\nbrasseries\ngodward\njarana\npirogue\nwashwood\nguohua\nreichle\ngaprindashvili\nbifacial\nmudfish\nschochet\nummmm\nshadegg\nzauner\ncejudo\newhurst\nkosan\ncravero\nhanle\ntelstraclear\nbahagia\nmilonakis\ndinkum\nfujiki\nchintz\nverplank\nkroeker\nboxted\ndrais\nsteinhagen\nphalangist\nloftis\niwr\noddballs\nmegaw\nstanleys\nnuernberg\nexplination\nfliess\narsala\nmoulson\nliling\npackin\nlionized\nseathwaite\nspittoon\npietrangelo\nharjinder\nesophagitis\ncultish\nassessable\nnaydenova\nosirak\nangriest\npackagers\npanayiotou\nnathans\nclimo\nshechtman\npurvey\nbortolami\nkilovolt\ndrakeford\nlanterman\nmagistris\nporthcurno\ncosier\nclearway\neliteness\nforenoon\nctls\ngunbattle\nclaudication\nerfan\njamahl\nsheelagh\npolystar\nseang\nwadd\ntradecraft\ncummer\ncomputex\ntranspiring\nextroversion\nespie\ntuten\nforestieri\nmisinforming\nhipc\ncalixa\ntaso\npinctada\ncoonawarra\nflorican\nbourgas\nstelly\nopsahl\ndemelza\ndefelice\nnikolaou\nprudy\nweichsel\nforewarning\ncayos\nloriod\nadmont\nweltklasse\ntewaaraton\nroski\nvojta\nxerostomia\ngerolsteiner\nkaton\nsobered\nfichter\ndodgem\ngrafite\ncimt\ntecnologico\nmixi\ntugger\nwtrf\ngrahm\nreavey\nkaimana\ntruslow\nboeselager\nploiesti\ngopalaswamy\narmscor\nswed\nvocalese\nanganwadi\nbordet\ngewirtz\nmlambo\naedas\nrauff\nandrology\nratcatcher\nmcgahey\nfarlington\nhuiyuan\nabejas\nfurstenfeld\nknockoffs\nmatadero\nwratten\ngilmor\npentwyn\nbarrenness\ndecoteau\nbulga\njerrell\ntuhoe\nnarcissi\ngallopin\nservicemember\nrotators\npasaje\nchatt\ndunkerley\nsadollah\ngandharan\nkust\nkamada\nzevallos\ncasero\nrepertories\nmultivitamins\nquantel\nfarnes\nkotorska\ngraphix\nkinzinger\nbonal\nbrondesbury\nlatchkey\nlucking\nbyward\ninvasor\nfumigated\nwearying\ndunkard\neaubonne\ncherri\ncertains\nlobular\npengam\nfatema\nkurihama\naucilla\ncdcs\nflyboys\nreifer\njankulovski\nkegworth\nproscriptive\nunconsidered\nforca\nesos\ntransfering\npalco\njenas\nffn\nhispanidad\nbakool\nayun\nbuttercups\npostscripts\nstrohmeyer\nvaluers\nmumias\nrheon\nmowrey\navallon\nnorb\nlatas\nsuckley\nlykken\ncuttyhunk\nveryan\nfariñas\nkatabatic\nhasni\nkiz\noctyl\nholobyte\nmuscling\nmanteno\nmaike\nkema\ntimeslip\nwordsley\nnocebo\nakuffo\nthaindian\nfanum\nmargvelashvili\ngleadless\nliberalising\nbcis\nngg\nnegash\nvinger\nacclamations\nkakuei\ncassuto\nunconference\nmaaike\nshangdu\nchristal\nalmazan\nargueing\nmagaly\nalpbach\nzelinski\npó\nhartshill\nwinelands\nspatola\ncaricaturing\nhumby\nshardul\ndystonic\nclerides\njenolan\nlizza\nwaterski\nkeran\nmetr\nreverberates\nmalpais\neventer\npeterka\nneuropharmacology\narnoult\nzhenwu\nstj\nneen\nifri\nyahoos\nokonedo\nribalta\nodumegwu\nrued\nblackbook\nsatur\ntambien\nbunmi\nmccampbell\nslahi\npnds\nyukimi\nironhead\nseptuagenarian\nlelant\nathole\npingdingshan\ntroccoli\ncrossbeam\nprohm\nqueralt\ndavood\ndislocate\nfilipiak\ndetroiter\ngrismer\nmegafaunal\nmagpul\nlimu\nblagrave\ndatacenters\nbergeret\ncomplexioned\nzemke\ndigiovanni\nmoulsecoomb\nohrdruf\nnarratively\nmallozzi\nmisal\notisville\nkerviel\nchaklala\ncantanhede\nyameen\nmoony\nafroz\nreckord\ndarent\nmipim\nbelak\nlydeard\nfalana\nmilda\ndigitel\nzakim\nnankang\nmcilwain\nkbit\ncannelton\ntricarico\nbriavels\ncruickshanks\nmojca\nunresearched\npremedical\nberenyi\npizzichini\ntamerton\nbads\nnige\nebara\njonn\nsoderstrom\ngofal\nharston\ngiambrone\noppresses\neurobank\nshrivel\nipic\nnurdin\nfakhro\nnannes\nweyers\ninya\noxygenate\netemad\nenglischer\nmentorships\nmadad\ngtn\nthangkas\nobb\nunfreeze\neurodisco\nbisulfate\ndolar\nashigara\nseagren\nsolinger\nmilliarcseconds\nrossner\ngallini\npacifici\nchesworth\ninzunza\npearsons\nsuffruticosa\nstoffer\naulie\nweyand\npalooza\nhaiqiang\nepu\naoh\neggermont\nunthanks\niriondo\nrefahiye\nvinyard\nroce\nchindia\nurasawa\norense\nmisjudging\nshelvey\nmadjid\nblumenstein\ndlb\nruelas\nwgbs\ntrinkle\ncucchi\nxinghai\ntupe\njabalia\nbloodborne\nclammy\nfrisbees\nivailo\niatp\nrenie\nlehder\nteofisto\ncargas\ntosser\nswiftwater\netools\ndoumanian\nbattleline\natchley\nteana\nfolkish\nmuscari\nhanken\nhamaker\nsudol\naviatrix\ntristana\ndekle\nchueca\nairlifter\nnjal\nharuyama\nschotten\nsnakeheads\nmajorino\nhorribilis\niveson\nradarsat\nklout\narvanitis\nexercisable\nkadim\nrosenblat\nsmelts\nwaymark\ntropicália\nkiuchi\nsunseri\ntokitsukaze\nbillen\nunderachiever\ngorbea\nbipartisanship\nshaoshan\nsheepshanks\noughtred\nmenthe\nsevillano\nvinous\nsoundless\ncheesecloth\nharassments\nverri\nrafalski\nconad\ngips\nsican\nkene\nintractability\nkilcoo\nqinglin\nlehmkuhl\narmellini\nmokulele\nlauricella\nmancina\npomade\ntowpaths\nglycyrrhiza\nzipes\nelectrochemically\nkirui\nrolfes\ncharlee\ntinariwen\ncosio\nnwsc\ntsegaye\nautosports\nbissix\nmisprints\nsavoca\nconflits\nwestoe\nrhinemaidens\ncheckoff\ntraudl\nktb\nhilaly\nmariga\nclaghorn\nfacist\ncahow\nizapa\nspotts\nlotsa\nkobs\nschweik\ngulotta\nkippah\npresson\njentzsch\nwdrb\nmaniwaki\npanerai\nglew\nsniped\nstriggio\ncowton\npanela\nschrobenhausen\nkomiyama\nneutralizer\nkaczor\nvicentina\nzimbra\npigtailed\nsagor\namulya\nvlahov\nhadl\npickel\nteeling\npenkovsky\nmisting\nisetan\nsticht\njohans\ngallaway\nkaliya\ncasazza\nxeni\nlauterpacht\nendoscopes\nmhór\nlumpkins\ngreiser\nanimé\ncucurbits\nberistain\njamet\nstampeded\nsermoneta\njerboas\nwedo\nbattiston\nmerel\nsutherlin\nnontheless\nstansfeld\ngeremi\nverstraete\nsxl\nspadafora\npéan\npolicial\nplumelec\ngräfenberg\nlafd\nmeniscal\nplasmapheresis\nchèvre\nlls\ncimex\nskor\nwodi\nindiecade\ncalegari\nretrenched\nunchristian\nkindergartners\nbharara\nsteinbrück\ncissp\nzilin\nchosing\nbatallion\nraec\nkatin\nsahagun\ndugi\nsymi\nsendoff\nmisconfigured\nkgtv\njeopardising\nrelaxations\nmontanan\ndoorjambs\nmaradiaga\nsenedd\ndinmore\ncopus\nnarla\nshimotsuki\nphotobiology\nhabibul\nmodbus\nhaxton\nescrick\nsouthpeak\nstz\nresealed\ngairy\nyusen\nsledmere\nbreukink\nhamdallah\ndusseldorp\nreintroductions\nlittlecote\nkshb\nlonget\nkova\ndolemite\nwptv\nmorando\nclingman\nmalpica\nwenbin\nkogler\nmuhr\nweeb\ndevia\neeckhout\nkaladan\nwarleigh\nfirdasari\nhanae\nunilingual\ntsat\nkassin\nconnelley\narauz\nfolklorico\ntimson\nheadhunted\nfalch\nhispanicbusiness\ncharmouth\nsupraspinatus\nrhymefest\nwilen\ngasolina\nwennerström\nailton\npakubuwono\nmaxing\nconcetto\nyohanna\nstenness\nskeaping\nbumbu\nflumazenil\ndenardo\nrutnam\ntriamcinolone\ntameness\njianjun\ndellwood\nprofanation\nnade\nlidz\ndieffenbach\nmahari\nmayrhofer\nkatzrin\nmompesson\npashtu\nfcx\nyunior\nbetwen\nbarooah\nkhokhlova\nstanzione\nphife\ntseten\ncorbetts\nrychlak\nwilker\ndecanted\nshiso\ndeaner\ntadese\ndargomyzhsky\nwaipawa\npreshow\ntogas\nhemis\nunstrung\nwhiton\nmishaal\nmartinu\nsalguero\nhouseflies\nbortz\nsaalbach\nassoluta\nrahon\nslimey\naurizon\nfishkin\nbasepoint\nexcavatum\nsuperdrug\ndelucia\nnanortalik\ndaftary\nmostro\nmarise\nalexithymia\nmufon\nalanbrooke\nnamara\neggesford\nlazur\ncyclosporine\nminidv\nkcts\npenygraig\ntentacular\nlongis\npichai\ntussey\nlockland\nkitasato\niashvili\ntecnicos\nzhangjiagang\nbehbahani\nmonarda\npettinger\nadisa\nsilvertip\nconstipated\nnevio\ncellcom\nhatzidakis\nroseworthy\nbuckham\ngrisanti\nverdejo\nferriter\nrockfest\nqueensgate\ndoubloon\nliliya\nautofill\nbêtes\njva\nteeuwen\narivaca\nkizhi\nresponsable\nmatejka\nsorek\nmarouf\ncalders\nfornari\nskywriting\nkaari\nmailey\nbioplastics\nguttuso\nroutiers\nayim\ncabotage\nbloggy\nbaragwanath\npoxy\npanula\nmarchetto\npasubio\nagona\ncrumpets\nholdenville\nmarvelled\nlatecomers\nfoston\nkuhner\nhappenin\nkaino\nextrapolates\ndramatises\ninfluencial\nspottiswood\nglengarnock\nkeratinocyte\nnder\nccms\nhurstwood\npepsodent\ntilli\nfeigen\npopeyes\nbollin\nvranitzky\ndemes\ncaruth\nkingsfield\nsuleymanov\nelanor\ncoffi\ngastroenterologists\nmonkfish\nkaysone\niela\nbrinkmanship\nplattsmouth\nlezgins\nyenne\npupating\nssci\nfoulon\nmossa\ngerdau\nmokbel\nmalonga\nhelenium\nmaisonette\nalcee\ndeepali\nhuddie\nliebeck\nabdulwahab\nlogística\neverdeen\nchlamydophila\nforresters\nmarku\nnebojsa\nzichron\namuru\nroyalism\nstreete\nwinfree\nkendi\nbentgrass\ntsentr\nkozik\nmodelica\nsabie\nmouseketeers\nsunderbans\nnaohiro\ncineaste\ncoiffure\nscappoose\ntheoren\ngodinez\ndaimaru\nlympstone\nmongar\nastakhov\nmoshood\npizzolo\nwarlordism\nflahive\nchristofi\nhiam\nhutzler\nkaryo\ngiss\nuteri\nboetti\nvodianova\nvobis\nhillar\nrefco\nprognoses\ncipla\nzagallo\nlearnable\nanomalocaris\nprotti\ntaibu\npinxton\ngoudey\nvaros\nblackhouse\nyie\nbirkenshaw\naissa\nsullom\nrúnar\nzebina\nsongyuan\nkotwica\nzarautz\nhogland\nphotothermal\nvarmus\ncategorie\nfxr\nwerc\nwelliton\nguge\nkinmont\ncurently\nchave\nrumple\nsorpresa\nsiran\nbreading\npandilla\ntripplehorn\nspritz\nrégent\nschwind\nwynkyn\nfresheners\nhypacrosaurus\nwallenpaupack\ncassese\nagapanthus\nmaniwa\nrothera\nwatchfulness\ndovers\nhermagoras\noverworking\nweaf\nitzin\nstandiford\nruddell\nivinghoe\npadfield\norbea\nkimock\ncanche\nlisztomania\nzitong\nshipload\nspectroscopically\nwoude\nsuperpositions\npami\ndeslandes\nmarchini\nhatfill\nmoviegoer\ntennys\nyunshan\nstabroek\nphotinus\nmetabolisms\nrxs\namatya\nmehserle\nbeibei\nmikus\ndeadbolt\ntipps\nkaikhosru\ndixville\nluchterhand\nlodo\nnatc\nbacka\nunderachievers\nwavves\nfambrough\nportale\naktiebolaget\nholloways\ndelfs\ngodown\nwoomble\nsubvention\ninterservice\nsjoberg\nsupermercados\n,who\nlightsey\nhelghast\nmednick\nrishworth\ntillicoultry\nbukh\ngaige\ngreasewood\nlangenfeld\nmulley\ndeportee\ndfj\natila\nharkers\nrigobert\nworldspace\ntafa\nwellford\nrunamuck\nbureaucratically\nmalinois\ntiman\nbenjani\nheever\nposluszny\nfarnan\nagarwood\nmoskos\nabello\nsquawking\npleopods\nsiwei\ndinda\nlaithwaite\ngravell\nyaara\nxaverians\ngacaca\ndameon\nalyss\ndisembarks\nshingwauk\nconaty\nsoursop\nyacouba\nsnowville\nchirundu\nprimatech\nntrs\nlizardi\ncappuccini\norka\nmucker\nlosee\ngozzoli\nnussbaumer\nbilfinger\nwillsher\nbario\ncocksure\ncanedo\nreliables\nswetnam\ndron\njagua\nmwalimu\nafifi\ndonatoni\nrichton\nzelen\nsnarled\nkunihiro\neisbach\nhorseriding\nmanchanda\nindustrija\nshanon\nrabot\nmilliman\nascutney\nmottershead\ndynaflow\npenmon\ncorneau\nfruehauf\nalbrechtsen\nnalc\ncrampon\nhlas\nfabuleux\nmuslimin\nkinfolk\nwallström\nchangji\nsilin\nfertita\ntiefer\naijaz\nandronico\nsukhbaatar\nhamidou\nrecoloured\nyoku\nmichala\ncarga\nrambova\nfinnart\naftershave\nnzb\nlaitinen\nkingstree\nufe\naracena\nglasner\nmasacre\npyrethroid\nlevenstein\nquiros\ncuddyer\ndilshod\nantiheroes\njamon\npreverbal\nnicktoon\nnouel\npopkov\nrathje\ntigerman\nuate\nexordium\nborup\nparwana\ndisquisition\nelberta\nsauma\nfriedheim\nuscs\narac\nnrao\nmnh\nsupraventricular\ncodelco\nbhe\ntourneys\nsilverwing\nastrocyte\nexecration\njaywick\nacom\nalbana\nbezzina\nbope\npapered\napli\nhallifax\njnj\nkissane\nvoicemails\nnoctilucent\nhassania\nuseing\nsubthreshold\nchungbuk\ntgt\nstroe\ndresher\nbuttafuoco\nmakinen\novshinsky\nfenoglio\nhermelin\ntlw\nlissan\ndisrespects\nanuja\nberriman\nshwartz\nmagnetoencephalography\nnaevus\nrebollo\naull\ndirtbags\nstructuralists\nvandervell\nneeld\nkabui\nmatassa\ntdci\nwilleke\ngerba\ncockrill\ntuts\nmetten\nnoisemaker\ncityflyer\nhornpipes\nperrigo\nmagnette\nprocon\ndiomed\nlapoint\nbroomall\ntownies\nmontbard\nineptly\nfrijoles\nclockworks\ndpv\ncammalleri\nkarvonen\nsardina\nmagnificens\nseidensticker\nleesa\nkindles\ncommunic\ncemaes\narib\ninvestitures\nyafi\ncoriolan\nbellecourt\nkorma\nsharpshooting\nwildfell\nvictoriei\ncohle\nvallas\ncherax\nhumbleness\njbg\nazizia\ndohnanyi\nolimar\nleapfrogged\nfineberg\nkirishitan\nkhewra\ngosain\nbrayer\ncambourne\nverlinden\ncail\nortego\nbsaa\ngroomsmen\nlycans\namaar\nfelstead\nhyperuricemia\nthematics\nchanu\nesterel\nstaleness\ncordovan\nundiplomatic\nshtern\nradosh\nchemerinsky\nkieslowski\nstendahl\nminderbinder\ntermly\nentrain\nschrott\nasexuals\ncelestron\nmazak\nvalade\nmillmoor\nbencomo\nqnb\nelfi\nmenning\nteleservices\npromptness\npetten\nfigline\nbeija\nassab\nosso\npremnath\nbaixas\ntabeling\ncaragh\ngalanti\nflaca\nfragata\nveldkamp\nstrangeland\nknightdale\ncaretas\ntonegawa\ngoodliffe\nuralsk\nlerer\ndesconocido\nmoccia\nchiloe\nbecka\nterral\ncommodified\nbrassington\nmacher\nornans\nmelleray\ntvf\nschnauss\nselvig\nkibbey\nturesson\nvorhaus\nlindsell\nyoshika\ninawashiro\nnitsche\nschaechter\nderring\nrodarte\nbursledon\nwalkathon\nzele\nchasey\nannies\ntowe\nhonkala\noverbay\nfasch\nkazoos\ncoluccio\nbandsman\ncsikszentmihalyi\nesy\npuech\nashly\nguianese\nsteerer\nsarid\nkijang\ngreenpower\nepisodically\nrussett\nnontheistic\nyis\nbusson\nballetic\ncraniotomy\nshortnose\npheng\nperel\nmabelle\nskibo\nmorgridge\nnijenhuis\nyelps\nsleepwear\nmoodysson\npedot\nbutterly\ntriband\nvinos\nsonangol\nphonak\nescandalo\njeffes\nlysol\nbaptizes\natara\nshrewish\nbhide\ngohl\nlunchtimes\ntorroella\nborovets\nupswept\nbrasted\nslewing\nvenaria\nzoonosis\ndrowne\nleveille\ngazin\nturpis\nessilor\nindexers\nmingei\nparrado\nlobs\nutis\nfleshes\ncarabiners\njoda\nmurkier\ncerrejón\nedurne\nkosaraju\nneubau\nbeha\nconclusory\nbadilla\nhegedus\nramadhin\nmonkhood\ngreenlands\nmcintee\nbosoms\nfheis\nmcgaugh\nbirsay\nglassner\nfilarial\nsics\njaso\npolemicists\njugnu\nintimidator\nproselytized\nporntip\nalleycat\ncheesemaker\ntuscia\nkiewa\ndachas\npettie\ngarbanzo\nneurofibrillary\nbics\nbiosimilar\nkaysar\nstaviski\nhillsville\nholburne\nlandell\ncorbell\nyogurts\nmunera\nstanlow\nmanandhar\nauchterarder\nnadon\nshitload\nislamorada\naciclovir\nsabapathy\nhaytor\nunasked\nmalkani\nblan\nejf\nkunzang\nklasa\nclemo\nmustafaa\nrevit\ngingerly\nhollybush\ngamlingay\nmotorplex\npastori\nwallum\nwalrath\nwamego\nstranorlar\nconstanten\nitani\nkarez\ngipper\nnashwan\nkhata\nlimmer\ntriticale\nkentland\nspumante\nziskin\ntryggvi\nasts\nstemple\nbourbeau\npjd\ngentlewomen\nselja\netoposide\nshoda\nurman\nqadhi\nmultis\nsabbir\ngrody\nfinedon\nburgstaller\nniether\nmangino\npinetum\ntsusho\nstettner\nexertional\ncipolletti\nwcac\nspaso\neiti\nashmun\nwisch\nbenzedrine\nchandrasekar\nmisclassification\nmyrow\nofgem\nbrazile\ntappen\nosseointegration\nloades\nedifício\ntranquillitatis\nzeile\nleshem\nrecluses\nllorca\nmodesitt\ntayyib\nbnz\nhdtvs\nheylen\nbeebo\nkersting\nsobolewski\nketner\nraio\nkostitsyn\ntinelli\nbinayak\nmondesir\nmetaxa\ndoughton\nightham\nsacu\nantibacterials\ndrummonds\ncaylloma\nbrûlée\nbalcerowicz\noccupancies\nkamioka\nhuessy\nkumail\nmanker\nminchev\ntrembler\nfiola\nsugarcrm\nsherf\napposed\nkso\npathirana\npinchback\ncibeles\nilcs\nlevothyroxine\naced\nliebel\nrodelinda\nvancamp\nsturdily\nmainstreamed\njamhuri\nbangert\njetpacks\nbruker\nbiogs\nmoonshot\nelectricité\nstupefied\ninfallibly\npépé\ntimbuctoo\nfrary\nelectrique\nbacigalupo\njitu\nchiddingstone\nsarcosine\nmcbrearty\nnutr\nesquinas\nintoning\nslyde\nsquyres\nweathersby\npalladianism\nschelp\ntwentysomething\npreki\nrefractions\nmearly\ntrauth\nslusser\ncoquettish\ndewhirst\ntague\ntocumen\nohmi\nhayatullah\ndomiciles\nrooi\ncomgall\nyarber\nnewnownext\njinyu\nghajar\nstratego\neastvale\nbalawi\nbalko\ndinged\nmooreland\nlarken\ndoihara\nweixing\nmeiring\nsifakas\nravelli\nvoltz\nmawkish\ntsvi\negolf\ngastarbeiter\naouzou\npogany\nharmse\nvagos\nretronym\nbacchanalia\nkeulen\nmisinform\nuproarious\nfelty\nhasham\ngenito\norphean\ngerlinde\niroko\nrabiu\npersecutory\ngadon\najan\noverholser\ntrên\nbeachill\noverhyped\ncicc\nnajmuddin\ntashman\nliljana\nefecto\nthaer\nbravehearts\ntmax\ncalumnies\nshmita\nunscrew\ncoalminer\npegylated\nstampeding\nzanini\nmercey\ncrumpacker\nmacerated\nzarai\nassignees\ngriffing\nbookforum\nkoth\nmmbtu\nplunderer\narents\nanticommunism\nhuiwen\nmcclintic\nscreeches\nhabtoor\nvgm\nelsaesser\nvergès\ncarload\nkilocalories\njhong\nbery\nscaa\nruppel\ndifrancesco\nafellay\nachieng\njephcott\nmytouch\nhumpbacks\nhegner\nwallaces\nmajdi\nguangshen\ngenei\ncrago\nqingshan\nkhnl\nclx\nschygulla\npulheim\nbutetown\nwlf\ncims\ncostinha\nintersted\ntagliapietra\nbollettieri\nlexden\njuggy\nnandlal\nrull\ncaaf\nshoppingtown\nzaini\nbronchopulmonary\nsaltalamacchia\ngbit\nfarraday\nabil\nzambelli\ndihydrocodeine\nvilasini\nthumbtack\nmiescher\njalpa\ncourttv\nbinyan\npangasius\nschlereth\nbijaya\nfreiberga\nvaler\nstenciling\nroundhill\nslaughterer\nlimone\nklimova\nmayorov\ntommo\nfaçon\ncineteca\nhanzel\ncherono\ntweel\nmemorialization\nimmobilien\narlindo\nevanna\njessika\nnkhata\nrazif\nfaura\nspermatheca\nmanini\nmocksville\nchemosensory\npolland\njantz\ncoiro\nkopechne\ncleansers\nmaenad\nprecht\nnominet\njenckes\nchickie\nshonn\nhorserace\nriska\nfacc\nunfurnished\nserbelloni\ndeclercq\nkoury\ngroothuis\nduberman\ncofton\npenjing\njaveriana\nphiliphaugh\nchiado\nrobocalls\nactualities\nilian\nhemostatic\nsqueers\nloewner\nfabricant\nmonin\nkunzel\nclovio\ncualquier\nbilin\nbatching\nhuayang\nhulsey\nwender\nmakamba\nmedicolegal\ndiscomforts\nmatelot\nsupercheap\narnone\nunamid\nwebshop\njasem\ndanau\nxtec\nacetonide\nboaventura\nboche\ndaykundi\nicaew\nemry\nbutti\nejaculates\nsubsite\ngoldstream\nulmo\nderogative\nzwan\nczapski\nwaterflow\nserebryakov\nresolven\nhepner\neiteljorg\nbronzini\ncunanan\nslaidburn\nmacie\nvilleret\nclwb\nverissimo\nfibich\ngustine\nfrigging\nthanon\nillogan\neumenides\ntoeic\ncompletley\neshoo\noleanna\ngoudelock\nclanfield\ndagr\natambayev\nmangochi\nyseult\nyadira\nmejillones\nhaward\ndippin\ncuddihy\nharrows\nbarnabus\ncampagnaro\nfluffing\nschuhmacher\nhaveeru\ngodlewski\ndamani\nbraddy\nconstantines\noverfly\nsunpower\ncetin\npinnell\nbensinger\nreadin\nembroiderers\nchillington\nloda\nblowholes\ncybi\nerazo\nenshi\nserm\ngingin\njavelina\nmarieta\nrancheras\nenvisat\nunworked\nqeh\nahwatukee\nakker\numnak\nbuckskins\ngabelli\nmosto\natotonilco\nnaufahu\ncapitalisations\ncorell\nunwerth\natong\nmmogs\nelectrochromic\nlandel\nyoungson\nmultiplatinum\npader\noxshott\nhacen\nequestre\naitmatov\nsupersports\nhanish\ncodsall\npontis\ngholami\ninterchangably\nscholder\npinin\nmotamed\nrufai\ntotus\nstape\nhorizontals\nringland\nwse\nscheier\ntassin\nlamkin\npalffy\nzavaroni\nvetra\nmenya\nhune\ntrailside\nlissack\nboswellia\ntrauner\nhyd\ndeyn\nwoodes\nmeshal\nthalheimer\ngrimacing\nstagliano\nskywards\nthreave\ndefreeze\nlundahl\nsattam\nkachinas\npooped\nwetheral\njanetta\ndicussed\ndeline\nkerkar\nvigia\nfrohnmayer\ncollister\nbeltrão\nmalov\neisenhart\nrold\npuriri\ndoozy\nchiseldon\numdnj\nshaznay\nquogue\npells\nspiv\nstovetop\nwrington\nabersoch\nheadpieces\ndreadlock\nmycenaeans\nsetchell\nmarick\ntypescripts\ndriveshafts\nssme\ntunel\nbabydoll\ncottrill\nparenthetic\nmuroto\nnarina\nlhv\nbrithdir\nsaxitoxin\ndouthat\nreznikoff\nsilangan\ntrouve\nwendall\ndungarees\ndebunkers\nbifa\nyil\noutcries\neuxton\nkoslowski\ntramples\nrathfriland\nquadricentennial\navallone\narconada\nppq\nncoa\ningvild\nurth\nsaffire\npaseka\nfoshay\nshrimping\nauh\nnightclubbing\nstrete\nngunnawal\nturfed\njosselyn\nbjörgólfur\nnetflow\nkinvara\nmaracatu\nmcgroarty\ndilana\nmckayla\napperance\nlanter\nperseveres\nitaituba\ngrabeel\nbispebjerg\nshripad\nlactococcus\nlemelin\namplexicaulis\nmcnew\npantech\nchisos\nparries\nbeldon\nrockslides\nyur\nfika\nsomnambulist\ntianna\ncroma\nmasahisa\nklestil\nstrank\nmadiba\ndongbei\ndaylong\npcaob\nfgp\nkeron\nbortnick\nkotb\nlithotripsy\nsakineh\ncomoran\nragù\niex\nmacina\nparadip\nandalas\ngirata\nbizot\nwigwams\nboaden\ndirani\nescitalopram\nctbto\nluebke\nafdc\nsundman\nreisfeld\nkfan\nbeefs\nfakirs\nyarkovsky\nfasd\nchristofferson\nreedsburg\npunctilious\npinholes\njaggar\ndeandra\nspeedweeks\namerasia\ninterbrand\ncanvasser\nkutama\nrauhut\nmorlacchi\njbel\ntowell\nbelmarsh\npannonica\nrunk\nsnsd\ninvestcorp\ndegette\nstreetwalkers\nchooks\nalgodones\nswang\nibos\nmediatheque\nknapped\ningomar\nquetico\nsrbi\nwinked\nunagi\nisabey\nmravinsky\nnoritake\nghaly\navera\ndiann\n✤\nargentum\nebrard\nesslinger\nmhairi\njaylen\nmichl\nmontmorillonite\nedps\nlovechild\nobjectif\numbricht\npiskunov\npiros\nalltogether\neavis\ndormont\npetley\ngeostrategy\nfxs\nmaalik\njohnstons\nlanterna\nhoos\nkochen\ngreenbelts\nnassib\nacquainting\nsgf\nhegyi\nagranulocytosis\nwaldrep\nruddick\nsekka\nmcneeley\nguye\nhedgpeth\nogyen\ndyersville\nploughmen\nbalgownie\nlejon\nstojanov\ntrefoloni\nbloodsucker\ntakehisa\ncrémant\ndoland\nsaltsman\nhabbaniyah\nrouvier\nhypocrisies\nobree\nandsnes\ndatil\nmazlan\nhagon\nmastronardi\nperper\nsingleness\nuncolored\ntopas\nkoul\nhardiest\naddingham\nrestinga\noutriders\ndiagnosable\nbaucom\nstanddown\ndemont\nambushers\nfigurations\nwillendorf\nreinold\ncoun\ntonsillar\nallsburg\nexotique\ntimoci\nwowereit\nsamaa\nsinagra\nfouche\ncreditably\nbelgo\nallenhurst\nhongren\nmollard\nsarmatia\npinewoods\nhnat\nsuperleggera\nramsauer\nkammen\nabaroa\nporcini\nkatib\nmieses\nbistros\nvetrivel\nshackling\nbremond\nphysicals\npourcel\nkarunakara\nuncaught\nuams\nlandgraaf\nkace\nsaine\nvemork\ncpre\nbritanny\nkisling\ndeji\nbellmont\ngrindon\npratte\nloevy\nmingyuan\nwonkette\npapanicolaou\nborras\nshihabi\ndolours\nunderachievement\nmaulavi\nkoor\ndérive\njayasekera\nbesnik\nverio\neisenstaedt\ngoodis\ndeuba\ncapus\nrotis\npickover\nsedar\nsatt\ndartnell\npenair\ncavenagh\nharshbarger\nsolida\nknaak\ninstitue\nmajestie\nchiaro\nluisita\nrewe\nhagstrom\ncardium\nzittrain\neyecon\npemulwuy\ncowperthwaite\npletnev\ncheron\ntirunal\nhypertrophied\npatency\njerad\ngalesville\nstlouis\npirotta\ndeclaimed\ngerst\nliffe\ncinematical\nlwc\njahvid\ntoolson\ntibaldi\nmasuku\ncastlerock\nrethinks\nmuratov\nboisei\nsugarcreek\nbounder\nresul\ncothi\nbeasant\nstarlit\nosbourn\ndya\nsturdiness\nskechers\nozona\nmerican\nmyn\nencephalopathies\nlews\nbrocton\npenzias\nsagano\ntenex\nfmla\ncusworth\nsunwing\npippy\nbordelon\nlandsgemeinde\nelmfield\nhaci\ngeant\ncharrier\nembrapa\ngigapixel\noverqualified\nsorín\nkarpo\ntheretofore\nalterio\nlengel\naudy\ncastillejos\nardgour\nfrenemy\nmiladin\ndeyang\nlangberg\nkonko\nundercarriages\njeschke\nfreekick\nquintiles\nbirdwatch\npetabyte\nminiaturised\nbirker\nbonfiglio\nknaphill\ngrzybowski\nreepham\nchengyu\ntremadog\nhorni\nrcv\njasser\nkjellgren\narrowe\ncolourised\njagt\ndunloy\nzonk\nincompletions\nefas\npiquer\nloso\nearthship\nshochet\ndeescalate\nrolodex\npriapism\nperkinson\ntankage\nrivkah\nsuthers\ncommision\nwhihc\nwaclaw\nguidoni\nhazon\nlucanian\nweightier\nkaslik\nkassi\nsleepin\nhuilong\nemasculation\ntrank\nrxr\nskiatook\noptronic\nsicher\nfishscale\ncabanes\nmytalk\nkroh\nburgeo\ncuper\ndoughy\ncordials\ntsukioka\ncampur\ncaplet\nschnellmann\nbroadheath\ngalanthus\ncrcl\nccts\nuptodate\nmeeran\ntadic\nkbw\nasiatics\nleïla\notelo\napéritif\ngodrevy\ninchcolm\nsavigar\nfackler\nsolsbury\nwoodchip\nquints\nmetwally\npriolo\npanarin\nniggli\nbreezewood\nabjured\ndcsf\npotanin\nhartzler\nenchanters\njcpc\nsperos\ngatot\nworoniecki\ntrago\ncastex\nkhaibar\neavan\nbolwell\nmelones\nfaes\nfriesner\npennisi\ncystatin\ncfsp\ndifiore\nagami\nmicko\nborowiec\nhodgenville\nnariko\ncascadian\nknowhere\nferebee\nlanghoff\nyagudin\nbith\nakinwande\nkapow\ntemirkanov\ncrackpottery\nilling\nnzoc\nfitzharris\nhydrilla\ntanura\ntasia\nuvo\nverson\nkaleta\nkalapana\nvalcke\nreappraised\ndillin\nchinamen\ntanni\ngrunsky\nchmp\nakwasi\nlisner\nloughrey\nqdos\nmoldau\nnushagak\ndeducts\nsylviane\ndwaine\nveiller\ncabriolets\narle\ntelindus\narchipelagic\nfoxdale\nluders\nbabesiosis\nelectrik\nmammillary\niseries\npatato\nlakhpat\nstutterer\ngmrs\ncropley\nmahakam\nemilion\ncampamento\nkasane\nportraitists\nramathibodi\nsugarbaker\nwirtschaftswunder\nacorah\nrankins\nbellissimo\npolypterus\nraida\nanicia\nvoorschoten\naberle\npettway\ninstanced\njúcar\nghedi\nbleckner\nmorazan\npreisner\nshapey\ndipu\nognissanti\nanastacio\ngunj\ndhupia\nhurva\nlewinski\nspalletti\nkyrre\nblee\nhueber\nfloccari\nfarningham\nvelayudhan\ncytopathology\nkotova\naorangi\northoses\nminz\nagriprocessors\nmecu\npyrethroids\ntwix\nrfn\nazali\nvog\ncorruptor\nolana\nhaughtiness\nmayama\nconvivium\nmroe\nkahekili\nbarretta\nroborough\nsaltdean\nbonderman\ntarbela\nmurney\nspringerville\nshorr\nnosh\ncommish\nmonotheists\nunimplemented\nmomaday\nkickass\nworkhorses\nnebulizer\nhomozygosity\nregaled\nnols\nciak\ntianyu\nstavropoulos\nnapolean\nbeere\nlofland\njatinder\nmezzanines\njazzfest\nservery\nmarei\nlongboats\nhoppity\nngaire\narensberg\nstensland\ngargling\njustyn\njalopnik\nbongs\nglazkov\nbleary\nflender\nkahveci\ngayheart\ndutugemunu\nlogrolling\ncorvid\nchaohu\nboadilla\ngobin\nkolon\nnightrider\ncircumlunar\ncompañeros\nannise\nvlore\ncharterer\nzuck\nyemenia\nschwenninger\nnewitt\nnavali\nluzuriaga\nkaboul\ncortini\nemsa\nsunnism\nblumenauer\nleuba\nslagging\nkoce\nshena\ntanny\nmikie\ndongying\ncaymans\npolyhymnia\npresburger\ndober\ncahal\nmutola\ncandidatures\nmaxtor\nkonigsburg\nmacbooks\ngossman\njanan\nagere\nshatkin\nrecusals\nvvip\nwipha\nroenicke\nanghiari\najemian\nrutigliano\nterlizzi\nwolfers\nevictee\nsaibal\nwashingtons\nnici\nsyringomyelia\nbezold\ndribbles\neapc\ndismaying\nyagya\nfontanella\nbolds\nniam\nfoxhill\nflits\nfores\nmimieux\ndivorcees\nazfar\nhwacheon\ncazes\neventualities\nszewczyk\nnortons\ntreherne\nogbu\nbeanfield\nference\nslotnick\nstiffkey\norting\nunglued\ngroop\nbocci\nanthing\ndefensin\nchavs\nlaisser\nclercs\nchristianne\ntidiane\nalaniz\ntnl\nfawwaz\nkedia\nauswärtiges\ncipes\nemiratis\nspivack\nteabag\nhadal\nquashie\nnurserymen\nprotegee\nrajpath\nbarcino\nwhithouse\nrichardsons\nanswerers\nipec\nmerteuil\nsplenda\navians\nwatcha\ntucheng\nmusashimaru\narvs\nastounds\nmckeough\nboppin\ntroller\nbuckhalter\ndahlhaus\nlusha\ncenterstage\nwirehaired\nlovington\nhayder\nsauchiehall\nfamiliarised\nshneiderman\nrusbridger\nantolín\njassin\nshopfront\nbronchoscopy\nshuafat\ngroundings\nconen\nbaetic\nweder\nhydrologically\ninflecting\ncantini\nhypochlorous\nconventionalized\nhng\nliposome\nhadjidakis\ncoastguards\nchinatrust\nzaydis\nsmurthwaite\nsomwhere\nusaha\nanticholinergics\nlincroft\nneuroleptics\nthujone\nbaquero\ntopel\nflybridge\nqaanaaq\ncamplin\ngraphische\nboothman\npawa\nsupercalifragilisticexpialidocious\nanxin\nbelloy\nthumbtacks\ncentime\ncolourblind\nmatrika\nbloodsucking\nschindel\ngateau\npickaxes\nexecutory\nchimanimani\nfoose\nkidane\nanisha\nirsa\nworthier\nshoichiro\nbavel\nrepka\nniugini\nscj\nwaken\nflashers\nbeavertail\nstev\nrosel\nollis\nthereunto\nmuhannad\ncampeon\npakhomov\ndamphousse\ntyer\nmosshart\nmodernities\nunfitting\nventosa\nexternship\ncrimestoppers\nnondiscriminatory\nmignini\nnastran\nmachir\ncortright\ngyrating\nliou\nnotar\nneurobehavioral\nroki\nasklepios\nphenanthrene\nmittie\nchayote\ncarbonization\núltimos\nrakoczy\nsumps\nsongkla\nsehat\nsegregates\naprica\ncarraig\ndejah\nharborview\nistook\nthrowaways\nvgt\nuntimed\ntatty\nviliami\nrodding\nradde\nminkin\nbuq\ngaillardia\nserveral\nalmereyda\nblazkowicz\ncbos\ncharalambides\nkoppes\ngaetti\nwmtw\nbonfanti\nhaibin\nsalame\ncoccia\nuscb\nprosauropods\nmammatus\nusip\nwalberswick\nvota\nuncinate\nhalia\nsortir\nluckier\nlederle\ncalafat\nmunny\norgon\nfangirl\nfirebreak\nashim\nfulbourn\nmaritim\nevoque\nhunsicker\npippins\nlefthanded\netymologists\njarai\nlandowski\nmedin\nbuerk\nmedmenham\nneofelis\nhabibollah\nbco\nultramodern\nweatherston\nunequipped\nzettler\nheceta\npartee\nangiogram\nfastrak\nhindmarch\nincontrovertibly\nmdis\ndimino\ndisassociating\nsurena\nbarbershops\nrobotham\nendlich\nuntutored\ngennifer\nhannoversche\nlanre\nsuppositories\nkauppi\nsuplee\nstackers\nrapiers\nwhorwood\nharrad\nnasza\nbroaches\noutraging\nfauria\nkiira\nvoyce\ndunnell\nfarang\ndelftware\naudiologist\nincinerates\nherriott\nfattal\ncongresspersons\nbusinessinsider\ngamemaker\naverett\nrendezvousing\nnaspers\navere\nmarzan\nspfa\ngabrielino\nendz\nangulation\nmariz\ntomey\nkcsm\nabandonments\nmccunn\nmerchandised\nmaxcy\ndaudi\ncentex\nquiles\nrrb\ndownend\nbuth\nlucasta\nwalger\nhengistbury\nrizla\nlefel\nvladeck\nkonfrontasi\nbromby\nbernhoft\nluyt\nsnogging\nborkman\nbloodier\nmeloche\nflextronics\ngalstyan\nconfort\nnorbit\nhudsucker\nstratis\nclach\nbertine\nruhlman\nfahmida\nbaudet\nbethge\nfollowups\nchristopoulos\nparera\ncounterrevolutionaries\ncrowbars\nncate\nkiprono\nspelunking\nrieser\njerricho\ncabera\nskewes\njianxin\npercolates\nbarile\ncastlegate\ntrovoada\nnimbly\nschiraldi\nskypark\nmittleman\ntetela\nionov\nwolstanton\ndziwisz\nparsnips\npsychostimulant\nudl\nherlitz\nneutrogena\nindicts\nagian\ntransunion\nnamir\nfelbrigg\ndonalds\nclevland\npétrus\nvalveless\nlaugher\naperitif\nevere\nkylee\nmatteis\nmcandrews\npigweed\nmervyns\nbrügger\ngarel\nkamprad\ncasamento\njerusalemite\nstamets\nromaric\nwetware\nraczynski\nunshackled\nranitidine\npattnaik\nyardy\nknoweth\nbrimelow\nvimont\npreckwinkle\ncoomans\nthaden\ncancio\nobrien\nkariwa\ncontextualise\nbeadles\naltomare\nnevitt\nnaung\nmusicmatch\ngisburn\nbrocades\nkastler\nszczebrzeszyn\ngendler\noverholt\nalexandrite\ncpsa\nriffel\nkinzel\npinet\nmaxus\npetero\njallikattu\ncapirossi\nnagpra\nanalysers\nflixster\nlulling\nkarens\nduhaime\nalexian\nbigness\npankratz\nreprocess\nlandore\nsurtax\nslimbridge\nloubser\nlcg\nmontesi\nwahlen\ntessio\nteleven\noberweis\nbakhit\nvogons\nhinky\ninea\nquintos\ntorto\nnypa\nstothert\nbiocide\nsanghera\nzumbo\nintramuscularly\nniederhoffer\nnonstops\ndropoff\npedlars\ntendu\nmones\nimformation\ninot\ncpuc\nmchedlidze\nrosalita\ndesalegn\nramb\ngiampietro\ninma\naxxess\nbergler\nfalluja\nscozzafava\npetrifying\nbackstabber\nkaiya\nsory\nnewminster\nwarbles\nvisagie\nskycar\ncoalesces\nneuhoff\nadmi\nharput\ntransurban\ngronow\nurzúa\nhalicki\nsirakov\nagramonte\nalighiero\nmargarite\nzerhouni\nfidan\nbutaritari\nsakonnet\nrustington\njcaho\nbettoni\nreconstructionists\ntoretto\nembsay\nmyl\nosnes\nafriyie\nrnz\ndirrell\nschlamme\nkalemie\nmekonnen\nhumidifiers\nsanlu\nshoshan\ngasson\nguzan\nstrevens\nlabriola\nhemswell\nherky\nhiob\nunlettered\nrdl\nslep\nennobling\ngravida\nzacky\npurdey\ngavialis\nreabsorb\nblackpoll\nhenrikson\nparboiled\nmooncake\npsychobabble\nzhengding\nwsaz\nscows\nloweswater\ncrunches\neverardo\nhasu\nhdm\nbonamy\ndeblanc\nremonstrances\nepidermidis\nhäusler\nheskett\nmceveety\namatuer\nkappes\nknoc\nkilcoyne\nqadis\nmikiko\nphalanxes\nbioblitz\ndael\nhinske\nhomeruns\nsignoff\nklinsky\ngitanos\nstolte\nebright\nwinnett\npitsligo\nplomin\ncarwash\nagolli\nquesadillas\nremolded\nringstead\ngriffes\ncommitteemen\nhuella\nlco\nkrenwinkel\nirland\npoplarville\nsherron\nbreif\nambrosetti\neastry\nfininvest\nrampe\nsapote\ntanzanite\npercents\nseijas\nelenore\ncreux\ngovender\nlicciardello\nedmon\ndemurrage\nsaraiya\nenergizes\ncanem\nmesko\namlak\nanbari\nstelter\ncsdp\nhenricks\nchassidim\ncamb\ntrawick\nhighroad\nafaf\nslackness\nkeba\nyangjiang\ndawsons\npunchier\nredoctane\nnaïf\ndresner\nglebova\narmlets\ntratt\nkff\nibolya\ncockneys\nbolina\ncarboplatin\nmathie\nfetcham\nflorido\nheacock\nemanuelson\nmyxomatosis\npostmodernists\ngormaz\nhattab\nsawers\nglandorf\ntyminski\nnorihiro\nsciri\nkrik\nmilicevic\nnaranjeros\nartw\nscates\nbasting\ngigandet\nliyuan\nfreedomland\nschier\ncollingbourne\nskeggs\njahanzeb\nsheldrick\norganogenesis\nshashanka\norgaz\ncolourists\npeppiatt\npromyelocytic\nashden\nstotesbury\ntonners\nheraeus\nreovirus\nberriew\nzecchino\nroxio\ncywinski\nbluestocking\naramon\nmadyan\nveneziana\naldate\ntfts\nlagares\npfcs\nranevskaya\nwinster\npurgation\nbunkum\nbrentnall\nthilan\nsnorted\ndisabuse\nneuroeconomics\nmarcq\npeaslee\ncynopterus\nimpeachable\nkilojoules\nrahaman\nyarrell\nvirpi\nmuggings\nchega\nhossaini\nepistemologies\nweisheit\nwallbridge\ngiacalone\nwetherspoon\nmccooey\nrebennack\nfinito\nypc\nbelarmino\nnoja\nunmanagable\nzpm\nsabac\npbd\nmfe\nheadbutted\nreimpose\nfreilich\ndrays\neatwell\nofficialy\nemrich\nutkin\necpat\ndilhara\nkoonce\ngraybill\ndontrelle\nwhitbourne\nlozoya\nwroblewski\nzaf\ncrocheting\nsmbc\nborsch\ncods\nlockard\nophthalmoscope\nbrioude\nskywatch\nivindo\ntreas\ntirole\nclausa\nlamizana\ngoathland\nhemdale\nstoneking\nphotinia\nchingachgook\nwinawer\neskew\nmcgillion\nbcv\ncartographical\nrebirthing\nmarsoc\ndressup\nabdc\nharangues\nbellefeuille\ninsupportable\nbulgaricus\nafterbirth\nmortgagor\nsmilie\noutpointing\nghotbi\npenseroso\npanchenko\nsalsbury\nwgba\nkwek\nequilibrate\nobuya\nmedzilaborce\nwildscreen\nhattaway\nsmeeton\nmcwhinney\nputah\nporridges\nattash\nextortionists\nlemn\nscandalizing\nosako\ntsotsi\nheorhiy\ncockfights\nfuttaim\nfolklórico\nostuni\ninfarctions\nhfg\nnavetta\ndeathlike\nondoy\ngwyther\nkonchesky\nhelleborus\nhakamada\ncauser\nrdio\nfloury\nhejian\ncringing\ndracena\nrastus\nresponsiblity\nmoul\nkhloe\nzelimkhan\ntoufic\nfontanes\ningratiated\nredshirts\nshanwei\ntompa\nmoretonhampstead\nphotograms\nneofascist\ncarnauba\npalis\nparainfluenza\nheinke\namboseli\nfakin\nffordd\nsolyndra\nbrickner\nhoun\nphut\nhemorrhoid\nhillington\nnater\nsarwat\nirkut\ncndd\natca\ncapgras\ndanakil\navpr\norbelian\nglentworth\nlennertz\nleiris\nbumppo\ncrsp\nmakela\nsillyness\nperemptorily\nlatchmere\nfascinations\nfecteau\njuicing\nmeddled\ncax\nritva\nunscramble\ncycler\nnijo\ngulian\nextendible\nimas\nteijin\nmegaphones\ntokenism\ngrouted\nsuppressions\nseigniorage\nweronika\nhampsten\nsamandar\nlishman\nrissington\noakum\ngynradd\ngarlett\nchulabhorn\npulex\numunna\nruis\nbandwith\ntarmizi\nsoskin\naigen\ndoyles\nseamed\nachin\npvh\nselles\nbaptising\ntramaine\narnside\nirsn\ngnomeo\nembarrasing\nlamberth\nmucopolysaccharidosis\nantispam\nmolehills\ntrewin\ncolly\nnhrc\nrotundas\ngrabban\nflannels\ntegin\nammendment\nhoneycreepers\nskymark\ntrys\ncitti\nnaomie\nmoso\nwaye\nmisreads\nroary\nduffle\nwalentynowicz\nsaadiyat\ncumani\nfigl\ntursi\ninfernalis\ntallant\nhys\nklsx\napethorpe\nvecindad\nsensationalize\nmeden\npesenti\nurdang\nscheide\npartem\nshengnan\nlandhaus\nleucadia\nnicotera\nnorvig\nalimentos\nrids\ntrifon\nencasement\ngianandrea\nhoehn\nswoope\nfeferman\nviticulturist\ndemotes\ntoun\nicarian\nswop\nxiangfan\nsirona\nalfasud\nbelyea\ndecurtis\nncta\nboyland\nquaaludes\nkettlebell\ncearbhall\nbrustad\nmousing\nlovastatin\nterremoto\nkorky\nofakim\nbaseload\nmeteoritic\nvetsera\nmadonnina\nsecedes\nbreadcrumb\nstcw\nbauzá\nosam\nramaphosa\nbachpan\nbarrak\nshahrastani\nbricket\nmanafort\nsketcher\nzhizhong\nlengthiest\nsolium\nmidcentury\nbrightstar\nabdelghani\nbrignac\nbandirma\njadwin\nprestonwood\noxtail\natheros\nfushi\nkiy\nhunkin\npeewit\npressroom\ncalixte\nremnick\nmcevedy\nschrag\neurobonds\nbhm\ntraje\nthsrc\ndamico\nsames\nhuszti\nrandol\naberdaron\nbugden\nastrogeology\nresuscitating\ncandyland\njulija\nstreptocarpus\ncortada\nlound\ndearmer\nactividades\navold\nvelociraptors\nserono\npropagandizing\netuhu\namihan\ngambits\ndeclassify\nkanamycin\npiltz\nschmoe\nniek\nbeir\neklutna\nseishiro\nschwalb\nbenjelloun\nverpakovskis\nmcavennie\nmeasurer\nursine\nbies\nrubirosa\nobeisances\npopsicles\nfilch\ngavdos\nblashfield\npulsifer\nsteidle\netap\nnazeri\nsagmeister\neslava\nderyn\naey\nentropia\nnesty\nciotti\nthanou\nhighlandtown\ndaylami\nshoenfeld\nbradl\ngdot\nlegree\nwcnc\nkytv\npesh\ncarparks\nlyes\nwindthorst\nbecase\nredemptions\nkhandi\nseimone\njorvik\ngarlits\nbalkany\nduvauchelle\nandrassy\njenelle\nalson\nkozakiewicz\ndanum\nbalkanization\nderelicts\nwallgate\nwidefield\npeiser\ntaling\ncoard\nnawang\nlazear\nroadshows\nirizar\nohuruogu\nsnitching\nchokri\ntecan\nshamoon\npowerpack\ncombustibles\nsofoklis\nbaltusrol\nbuggie\nicds\nfootsoldiers\nbinnacle\nattenders\ngysi\nstagione\nrepas\namerikkka\nslaby\nyongli\ncordish\nokfuskee\nineffectually\nshaher\nbeardy\npangnirtung\nhathorne\nmouches\ncordwainers\nmyakka\ngaudencio\nkostal\nbullys\natias\nriendeau\ndalmellington\nsapiro\ndius\nengebretsen\nwelp\nbernbach\nléotard\nkeever\nboylen\nwieslaw\ncornmarket\nmazloum\nsirajul\nmurchie\nbinner\nsamedan\ntuckers\npolisi\ntunji\nsouvlaki\nanastassia\nmarywood\nozette\nvevay\nletraset\nvishakha\nbargmann\nataxic\ngranturismo\nmahmudullah\ncammarata\nbhutta\ndecesare\nbarberie\nhypercapnia\nhancocks\nchinos\nsandiacre\nexline\nthayet\nsapience\ndefames\naatish\nºc\nlimeliters\ncrosscurrents\ndarkish\ncroppers\nterraza\nmottet\ndonelon\nrudloe\ntaren\ncarns\nhallucinated\nvange\ninversiones\netau\nstobo\nverster\ntakatoshi\nconnoisseurship\njamshedji\ninpe\nparaskevopoulos\nwaldenström\nkoyaanisqatsi\ntransportations\nsudley\nspiderwebs\nesam\npictorialist\nnichd\npunia\npenco\ngrinter\ndaham\noerter\nmadol\nunprimed\nverhaeren\namoebas\nnispel\nmusayev\nwkar\nmarnham\nartc\nyuyao\nserowe\nhilborn\ngarfinkle\noxymorphone\nkxly\nwolfberry\nhaverfield\nfarkash\ndemunn\nburgle\ncountenanced\nbarnstead\nmarianists\nkleinrock\neliassen\nhembree\nhypnotically\nintervenor\nmockus\nmoeran\nkomon\nzhongbo\nheadwinds\npauken\npoblenou\nsouderton\ndreamless\nmacheda\nblockaders\nlamos\nubar\nrampersad\nmladic\nrajender\nkauhajoki\nactivewear\ninskeep\ntagoe\neliran\njayadev\nclywedog\nkazinsky\nbehrmann\neyl\nsignposting\ncsea\nkretek\nmpge\npollokshaws\ngoertz\nminihan\nnasrat\nnives\nhilco\nconry\nracc\nvoght\npène\njenrette\ncarilion\ntumulty\ninterrupters\nfnf\nwiegman\nbogdanova\nwesseling\ndewis\nvibra\nidrissi\nbullrich\nabouts\nfeehily\ntrustful\nresend\nsplish\nwamc\nresounded\nreppert\nganter\ngorder\nukrop\nhenno\ncoachworks\ntimetabling\npriskin\ndivertimenti\npabellon\nchiche\nminy\nchardonnet\nmodou\ntyphi\nteems\nvajira\nllandow\ncfh\nexcepts\nfolktronica\nninjitsu\nclatskanie\nhenni\nbragh\nsudar\nwilfork\nudoji\nverapamil\neffortful\npharming\nloyer\nchabala\nhurtt\ncirino\nmoglen\nmackley\nvirtualisation\nitaipava\noml\ncroissy\nchatou\nlubrano\nkindergarteners\nsanchis\nmediana\npiw\nrosta\nhindmost\nbroers\noutmanoeuvred\nlorian\ndavyhulme\nprescriber\npangelinan\nbushbury\nbelmokhtar\nedwidge\npulsate\naltheim\nimmunologically\npolymetallic\nyelping\nalonge\nrosneath\nkiszko\nengman\ncampora\nbroude\nbiersack\ncerra\nchannell\nfsis\nharamain\napaseo\nkosman\nscelidosaurus\nlyss\nbctv\nsantorelli\nsatre\nfifthly\nvicereine\norabi\ncayla\nseosan\ncbre\ncosto\nshahidul\nancell\nraiment\nnrv\neulogistic\nshivananda\nsenko\nadwan\nlanchbery\nchepe\ngarre\nturtleback\nthanassis\nworster\nskinwalkers\nlatika\nwitmeyer\nadeney\nbahan\nconnan\nastorino\ntzemach\nattractants\ndastagir\nrifampin\nherlong\ntenebrous\nescapers\nlimor\ncopybook\nzav\nlaxenburg\nshopgirl\nwissem\nobame\nmúgica\ninformaiton\nsopi\nburing\nmasetti\ncorentyne\naquarist\nobasi\nmatawin\nmeggitt\nkeathley\ngammond\ntalalay\nloebs\nkeevil\ndipturus\nneyra\nmcgiver\ntahquitz\nbartoletti\nlarn\ngeita\nsyniverse\ndegollado\njarom\nbobrow\numaid\nreincorporation\nbiters\nultramarathons\natex\nbrillstein\noccc\nseeberger\nvajna\nelisheva\nmalc\nomers\nsherpao\nyahav\noooo\nkimberlee\nidis\npycroft\nacara\nrossant\nsavelyev\ncsxt\ntedrow\nsuguri\ncommiserate\nrothenberger\ndeby\ngisbourne\nmcflynn\nnorbulingka\nulyett\nximenez\ndiatomite\nunindicted\nkareen\ndaelim\npgw\nespci\nluxair\nscoutmasters\navidity\ncorsage\nlindor\ncarburetion\ncashell\npolyamides\nkarrar\nebinger\nrenovator\nschijndel\nsickens\nwillimon\nmasculin\nmisfiring\nblatnik\naugusts\nconfidantes\noberste\npopemobile\nxiaofei\nstarlink\ncamile\nwase\neisinger\nqueering\nreedsville\nsheldonian\nmonochord\nlevered\nrushi\nlibelling\necosphere\nmultiview\ntomate\ntransall\ntifft\nhelpe\nhotpot\nspruced\nwafs\nreinvestigation\nduddell\ninterdigital\nharad\nmandic\nkulagin\nstanback\nyangshan\nthoughout\nscandalously\nholograph\ngagarina\nmcfaddin\nrawiri\nvolmar\nrías\ndoub\nnanometre\ncias\nhargett\nstinkin\ncrumpet\nboiro\nfauziah\nminucci\ncorrodes\nhollingbourne\nsicheng\narborists\nsultanahmet\nmaliks\ntobis\ndemanda\nchastelain\nnusret\ntridacna\nbucklebury\norteig\nmoscovitch\navea\nlineout\nhilles\nyuzhou\ndomar\nxifeng\nlivengood\nrouass\nlongeron\nsportsworld\narroyito\ncorozzo\nlorak\npaillard\nsatala\nhosler\nscrutinizes\nphotodetectors\nstairsteps\nherget\nboehlert\nmilitello\nlollies\nokies\ninfotainers\nbellusci\nmeridor\neloff\nsarsat\nduale\nsirhowy\nsteacy\nmoorfield\nmetrotech\nloker\nannamma\nbédié\nhinnom\nberic\nlalich\ndevaluations\ntheatergoers\nbrowsable\ndasch\nendotoxins\nohb\nwoss\nkugluktuk\ngalabru\nzarr\nbioacoustics\nsauropodomorphs\nsaltney\nculbreath\nmauck\npendency\ndejagah\nreguera\nspampinato\ndrumroll\nsucumbíos\nsolipsistic\nlindpere\nchorda\nexplorable\nbunjaku\ncloseout\nstokesay\nallegre\ncallebaut\nhaniel\nunthinkingly\nlahun\nrisberg\nactaully\nhaenel\nvassiliou\nleukopenia\ninsulza\nfleuri\ntransgressor\nremon\nmoorehouse\nminotti\ncouhig\npedicure\nridgers\nmalbranque\ncoronilla\nnetherhall\nzakes\nstrache\nvotel\nmediterranee\nnahmias\nisport\nbilger\nhayseeds\nmaindy\nrichenthal\nvirgine\nlibrescu\nebbitt\ngramlich\nboonie\nlcts\nbonnor\ndecaydance\ngrecco\ncoffeeshops\nspinello\nmedco\ngasb\nplasson\nwisemen\nmahals\nedek\nchapell\nohiopyle\nakara\njorrit\nivone\nloel\nmitsunobu\nstarsmith\nmetafile\nmantu\nsturzo\ngruchy\noord\nrevolutionising\nchangfeng\nbeheira\ndirectgov\nschyler\nisight\ndoyley\naberg\nranters\nlobectomy\nworkups\nphenomenom\nmccs\nsambava\nlehendakari\nkhanal\nszeemann\nstickland\nmocatta\nlutoslawski\naafia\ntarong\nhomewares\nlantry\nradzinsky\ncefalo\nbrokenness\nsalini\nmadidi\ntroxel\nhueytown\nbentiu\nkahui\ncantalupo\ndeciphers\ngobbledegook\njimmer\nnoisiest\nndes\nmarcora\ngeers\nparthiv\nmustique\nsiberut\nrushmere\norahovac\ntuat\naez\nconkers\nfricka\nrambled\nxacml\nrisus\ntapert\nsheers\nhetauda\nchirwa\nkopylov\nszohr\nfutagawa\ncommments\nrtz\napocalypses\nshamo\nangeliki\ntopnews\ndessay\nchalford\noesch\nthermogenesis\nmicroevolution\nnobuteru\nmontrealer\nbridgeway\nepicenters\nrefound\ncolonially\nfeal\nchère\nwaitlist\nsamosas\nyassi\nsampoerna\nvicta\npross\neija\nnanhua\nkalihi\nexplicating\nanamaria\nhaymaking\nguoyu\nconcertantes\nargentière\nfleta\nmaxell\ntibetology\nbostons\nhfb\nfarry\ndremel\nrmjm\nportago\nsainath\ncasona\nphilharmonics\nfoxcatcher\nliden\nmisplacement\nmefloquine\ntyndrum\ndarbishire\nktbc\nunalloyed\nbpn\njabbour\nkoehne\nminocqua\nefmd\ntazi\nvelino\ncollabo\nteschner\nconcious\njuhn\nscorekeeper\nbulkington\ntarsalis\nnasality\nappendixes\nsamsing\nkrahulik\nchervil\npleo\naex\nshabbona\nmurley\nmaintainence\nflunitrazepam\npegeen\nvictimize\nmtbi\narcam\nhyperrealism\nneea\nabassi\nyalding\nimco\nvgc\nmaeby\ndarleen\nwilliford\nwprost\nivison\nspirochete\nbarabati\nsibella\napatzingán\nhashmat\nbewitch\najaib\ngelson\namputating\nsads\nunibet\ninterjet\npppoe\narrt\ngeomorphologist\nclemen\nbiswal\nedamaruku\nhairball\nfawcus\nguoliang\nterzis\nthyroidectomy\nberkovic\ncalan\nrosiers\ncoillte\nbugloss\nvalleyfair\ndelane\noeser\ndeckchair\nmontjeu\nhambrecht\nexcisions\nedris\nalfama\nqeqertarsuaq\ntracksuits\nbense\nvvn\ngâteau\ntetany\ncastellitto\nstunners\nfamiliarisation\nironmongery\nnexgen\ndaquin\nexplainer\nborghetti\nguandong\nxanthan\nkiichiro\nintented\nflatted\nzeuner\nblonder\npreformatted\nvoorn\nedhec\nrecants\nundyed\ndeidrick\nvenality\nrileys\ngimps\nrooibos\nglynnis\nrasco\nblechman\nchurruca\nmaxxi\ncellan\nkatty\nharai\nbéatrix\nnonmilitary\nlenine\ndke\nemili\nhiles\nhammell\nsungard\njacobinism\nsalur\nhallowes\nwebshow\nfacilitative\nsopel\nrestructures\nincludable\naeds\ndisfigurements\nokonomiyaki\nshimpo\nringmer\nboogerd\ncarneddau\nsommeliers\nnebraskans\nturay\nschopp\nneuroprotection\nnevs\nguesser\ndurrus\nwvec\nvaculik\nlewellen\nmescaleros\nmcmackin\ncloward\nweaponization\ncortado\ntiahrt\ndaker\nortonville\ncaborn\ndarrelle\npolmar\nbyland\nterumi\nsemb\normen\nkarrin\ngalyani\nrobertsdale\nnanney\nwaipara\nbryggen\nrossford\ncaputi\nunmeasurable\nairfares\npajon\nredentore\ndauvergne\naltcar\nverbrechen\nbbtv\nbourzat\nnazarova\ndemodex\nmondini\nseya\nprogr\ngordievsky\nregrowing\nagba\nsquashsite\nporath\nsitemap\ntommorrow\nmanh\nrubbia\nfarmacia\nhedstrom\nbenrubi\nbleaker\nwoth\nbarmore\njacir\nmegastructure\ndefar\nmnb\nkisnorbo\ntruglio\nnothign\nhooman\nsilvi\nmikeladze\nceniza\nrysselberghe\nsystematizing\ncrusaded\ntasco\nestacion\nspringborg\neorl\nmicroporous\nbarukh\naseman\nsidiki\nfresu\nbesterman\npujara\ngleditsia\ndenbo\ncebrián\ntaittinger\npachter\nballymurphy\nfiloni\nliquidations\nchsaa\nyongbyon\nconservativism\nhydrochlorothiazide\njurca\nrexy\nnamik\nthermoacoustic\nfavero\nwykagyl\nchuckled\ndwindles\nhertzen\npashayev\nwinepress\nargc\nfarzand\nnazneen\npetacci\nmukri\noneidas\nachive\nhazle\nandreanof\nshabalin\nshef\nsalpa\nmottingham\ntestings\nmultiband\nsharmin\nsinlaku\nnicot\npestles\nlapthorne\netgar\ncastelvetrano\nradack\nnutfield\nafroman\nschoenfelder\nmunmorah\ngrandfathering\nkeyt\nwbls\nelectropunk\nkardam\nlumpectomy\naflam\ndamasco\nkashkari\nkazutaka\nsarlo\nlokmat\nlordstown\nchalupa\nlycanthropes\nfrejus\nborrello\nmulls\noresharski\ndlh\nhydrophobia\nkinosaki\nriam\nhambridge\nrhoose\nsolca\nbaecker\nvli\nplanifolia\nincepted\nhadnt\nahmedinejad\npocho\nsüdtiroler\ngleams\nsumaila\ntautomer\nintestacy\ntabligh\nairbox\nmulas\nlujiazui\nrebs\nalbertas\nkingwell\nremodels\nstampfer\nklammer\naviat\ndubal\ndirtbombs\nmoshiri\nerco\nrosinski\nubt\nstoneage\nmoonshadow\ncambiaso\nbiologicals\ncleber\npressbox\nberrio\nworldline\nmercola\nxolo\nhetti\npuetz\ndearlove\nresponsability\naliceville\nmakiadi\ndanns\nomai\nfelgate\nculls\nenthrone\ntruisms\nrothermel\nsuperclub\ngcg\ntrumpton\nliljeblad\nequable\ngrafitti\nseedpods\ntrouillot\nhearkens\narpi\nglasier\npaihia\nhallenstadion\nbrams\nmishael\nsalia\ncliona\nfebuary\nboisson\noctopamine\nvalco\nbedeviled\nfeenstra\nchevènement\nstalbridge\nhilsea\nclèves\nbilirakis\nnarrowboats\ngartz\nlurched\nrainton\nuei\nmoonen\nkoshland\nlobregat\nspectrometric\npeinado\nnrsc\ngladstonian\nverret\nreist\npvm\nslover\nsumita\nnominalist\ngrovelling\nkachemak\nquechee\nbozzi\nrhinecliff\ncollations\ntarkwa\nmyojin\nmeily\nsherbert\nricksen\nsabaneta\nrajen\nbootblack\nkeenor\ngodkin\neclac\nrathkeale\ngbd\npozas\nbertholf\njso\nwarsash\nathenia\nambassade\nkopit\ngfh\ngrouville\nmanuele\nbeholders\nmakropulos\ncapex\nmayos\nloginov\nmiranshah\ntechy\ncumby\nalishah\nleaker\nbensoussan\nhybla\nsjt\ntweezer\nunha\nikuno\nupstroke\npresiden\nnikora\npajares\nbiopolymer\nneeskens\ndefrance\nbuchler\nbosetti\nalbrechts\nkelsi\nsuperhard\nsasagawa\npetti\nstiffeners\ncommis\nvrm\nluncheonette\ntoulson\nbronchospasm\ncvetko\nbefitted\natención\neustachius\njicheng\nredstate\ncareerist\nsambad\nsarychev\ndunson\nmanakins\nrelabel\nquiznos\ntorosidis\nknighten\necis\nmortarboard\nphomvihane\nbillotte\npiergiorgio\nibew\nkhairallah\nmilers\ndjoko\necolab\nebtekar\ntranspac\ncaparisoned\necstatically\ntubthumping\nmorter\nreasserts\nhamil\nrakiura\nkinkajou\nmcgarvie\nmeritor\nzoia\nradok\nrightsholders\nswezey\nreformulating\nrathgeber\npabo\nflossy\nvadnais\nditchfield\nchally\nbitterman\nsirkus\nbuttrick\nintimacies\nnxumalo\ncasalesi\nnadesan\nceilinged\nledeen\ngomo\nfrenais\nschutter\netalon\nmanias\nkarsan\ndtw\nkadan\nignominiously\nhabesha\nphipson\nbecque\npanoche\nirujo\nkernis\npattens\nuntypical\nrudner\nversoza\nvidigal\nukai\npedometer\nhamiltoni\nhaldun\nsottsass\nnesse\nreichsleiter\nlukka\nrutkiewicz\ntuomioja\ngerbera\nmethone\ntourniquets\ncreese\nnavratil\ncarnesecca\nstithians\nsupplants\nsepulchres\nspil\nwebring\nparlous\ndepriest\nkodu\ngolb\njazzwise\npuppini\nclientelism\nhangang\nporkpie\nbenevolently\nshippingport\nxuemei\ntischer\ngarnishment\ndeinterlacing\njanisch\nbmv\newens\nebusiness\nsperrin\nremonstrate\nvalances\nfissioning\nodilia\nmiomir\nbuoniconti\nmauny\njinghong\nkindl\ntiva\nhatikva\nawda\nhagibis\ntyvek\nshouf\nstepfamily\nmealtimes\nmzoli\nopenmoko\ngungor\nhense\nstrategizing\nnipomo\nlitsa\ntvone\njoshu\nbowhill\ntyrannosauroid\nchingola\nyavlinsky\nkalifa\nchipko\nmulally\nfleetham\nyasue\nbalvenie\nhowzat\npoblano\nerotics\nshidler\ntimeshifted\nphrenologist\ncouserans\ngitelman\nverzasca\nsynergism\nthadeus\nmodelinia\npedestrianisation\namlodipine\nstemmons\npincay\nbedsheet\nsuchard\nkutlu\nbellhouse\nlonden\nwalkersville\nstraughan\nevdokia\nwatco\nrussy\ngardon\neking\ndonnino\nfmb\nfoce\nruffins\nnorstedts\npenteli\nchid\nboucherie\nberrys\ncaselle\npasset\nsobat\nurdiales\nlandsberger\nkorrespondent\nheatly\nharangody\nsupersoldier\nummagumma\nshepherdsville\nflashier\nchiou\nmorman\nsanghavi\njumptv\ncoxhead\napella\nvsh\ncjs\nrepulses\neffectuated\npyman\nrenzulli\nmeliton\nblockings\nptn\nhartsell\nnatio\nigoogle\nchancay\nqhd\naboot\nyego\nschans\ncircumnavigates\nmulta\nybm\ntombak\nhauteur\nblanching\ndigicams\ncourneuve\ncommentor\nkyriakides\ncsca\nheumann\navaiable\nhones\nhrl\npgb\nsoteria\nclayborn\nmontcourt\nfedewa\nbeggining\nsatterwhite\nskerrit\nbooher\nanencephaly\nchabukiani\nraunds\nhoofer\nsoseki\nteasley\nfieldcraft\nstrutted\nksas\nundercliffe\nbikila\nhanen\nfundings\nissawi\nslifer\ngooneratne\nmoneybookers\ndamelin\nberezhnaya\nprimorsk\nreker\ngraceless\nglissandi\nbozek\ndecongest\nverdienstorden\nniffenegger\njiaozi\nhalimah\nvirgie\nmesmerize\nklingsor\npashan\nchesterford\nkarita\npithead\nramree\nhaleiwa\nibuka\nfirtash\nkrey\ngumboot\ncarnedd\nannat\nschub\nmotherlode\nrealizability\nblasket\nblicher\npiestewa\nelectus\nkuujjuaq\nalirio\nleisel\nmusyoka\nariely\nflorals\nconkey\nnstc\nshabat\npentagrams\ncodorus\nwerkstätte\nlicciardi\nnorin\nqueenan\nteynham\nexcipients\npraca\npossibilty\nmarchesini\nvaishyas\ntineke\nrudderless\nrandalstown\nlickers\negov\ncalderbank\ncryosurgery\nbernis\ndumbreck\nquazi\nvasodilators\namericanas\nfastrack\nviridiana\nclifty\nfanis\nbraddan\nappt\ncenon\nshamshabad\nkawempe\nlykens\nbaiana\nshoval\nmiramare\ncinephile\ncarbonera\nbanim\nprevotella\nfuzzier\nsaranga\nsvanidze\npigliucci\ntorpid\ndressy\nastir\ncummerbund\nzib\ncircumscribe\nhktdc\nnikolais\ngirvin\nwickenden\nbentwood\nstefanidis\npanamsat\nieu\nmagnasco\nnhmrc\nmaquiladoras\nfrumpy\nscreengrab\ntarapoto\nnaela\njeetu\nvldlr\njrt\nbazzani\nundeliverable\nwcsc\nzonealarm\nschulthess\ncoarsest\nebaum\nniitsu\nclubroom\nodone\ngreatham\ninattentional\nhuband\nawdurdodau\ncampisi\nbreastmilk\ngharials\nborchgrave\nshouty\nclennell\nbarran\nobliviousness\nflab\ntafalla\nriemsdyk\nhooe\neisenhauer\nccsu\nmechtild\ndzurinda\nquarmby\npistoletto\nlargeness\nbuies\nwettig\nyodelling\nconceptualist\njabri\nfrauke\nupgradation\nmelburnian\nmaniago\ncinzano\nwurts\nscreenname\ntianxiang\ninfestans\nweeksville\npelevin\nhonley\nirbe\nextorts\nlieto\ndwarfish\ndixiecrat\nantecessor\nhaixi\nvocm\ndemonetized\ntotland\nnicholaus\noverstaying\nnearsightedness\nkowt\nnaegleria\nzaobao\nalbas\npodell\ntreille\ntebay\nolayinka\nrakoto\nmuccino\nlampanelli\nsmackover\nfooks\ncryopreserved\nlatka\nrassa\nkilovolts\nsupai\nsouquet\nmentionable\nwarchus\nunworthiness\nconveniens\nrezillos\ndenkova\nslagter\nheshmat\nhaaf\ncebeci\ndevotchka\nketubah\nfrausto\ntraeth\nmubashir\nkaller\nleaming\nartusi\nferreting\ncocksucker\nabimbola\nhitwise\ncaeser\nmanava\naerogels\narag\ngoldhawk\nsaskpower\nmorpholino\ngrindrod\ncoastwatch\nsalmela\nscx\ngeoghan\npinckneyville\nprogramas\nnegrini\npinata\nheinle\npyles\nswedberg\nmillipore\nrostami\nbayad\nsanyang\nbanaz\nfody\nelding\nschérer\nyongsheng\nkarly\nsupergene\nbaste\nhoyles\nfendt\ndarold\nosipova\nxiaoshuai\nbohannan\nphenological\npeza\nsfakianakis\nfinny\nmondor\nbishopthorpe\nsahour\ntranscom\nmbandaka\nmetrobuses\nrongelap\nrasmuson\nyoshiya\nmentana\naigars\naluka\naronov\nobscurantist\nuchtdorf\nmevagissey\nfilby\nwillowbank\nuakari\nsempione\nzotto\nmodders\nshroyer\nklampenborg\ndargin\nbalicki\nbaltra\njinfeng\nddk\neastlands\nlingonberry\nrooters\nzafir\nndugu\ndownhills\ncharite\nzastrow\nhandwork\ndallesandro\nbooties\nalusi\ndonwood\ninterbrew\nblountville\nmangueira\nficticious\nmarchbanks\nkfox\nbkl\nshinny\nauri\nhavey\nkinnison\ntetrazzini\niwamatsu\nalquist\nalmondbury\nkretser\ngamebryo\ncompassionately\nunframed\nadjudicative\ngouais\nbatucada\nfairhill\ndefrancis\nbayed\nrubey\nutami\ndinty\ncioni\nhzds\nsinfonias\navinu\ntangkak\noreg\ntyn\nfetherston\nwighton\ndramatical\nyurchikhin\nlikability\ndallington\nchusan\nbandt\ntamargo\nzafy\nvasilescu\nwano\nwendo\nmcguffie\ntawheed\nsymmonds\nlability\nonawa\nhollioake\nsilsden\nbowood\ntorrecilla\nklown\ncomittee\nwestacott\nkarygiannis\nambe\nmansueto\nomarska\nkaspi\nwilhelmi\nkallenbach\nneslihan\nmargene\npodolak\nwebbook\nwilshaw\nchoules\nshoebill\nfushan\ngalanin\ncisf\njavale\nextortions\nhugel\nmenand\nlimitada\ncathouse\npatricroft\nteeing\nomero\nbiriyani\nscribbler\narcel\nnost\nheatherington\nundervalue\ndesegregating\npockmarked\nviviparity\nburgeoned\nkovats\ngpas\nalero\nchaddha\nillig\nducktown\nclayborne\nasphyxiating\nxihe\namod\ndtx\nlazarevic\nmetop\nhorlock\ntoumanova\nmezey\nlayng\nwanzhou\nmacksville\npiozzi\nstekelenburg\nridgedale\nbartomeu\ncommissionaire\nkinnelon\nlaboratoires\nrodrik\nflowerhead\ndurex\nmorcillo\ngodfroy\nzargar\nmelior\nkozloduy\nbonnel\ncalke\nbaughan\ncwmavon\ngoessling\npetric\nmkx\nbarnbrook\nstraffen\nfoxwood\ntoddle\ndspd\nschmucks\nrumana\nsytycd\nabsolutists\nejaculations\nravensdale\nziegesar\ncyberdyne\nlibidinous\nwaldock\nbluntnose\npachycephalosaur\nactos\nwzzm\nrodell\nzenón\nlegace\ndrinkware\nacclimatize\nwarehouseman\ncryptosporidiosis\npanzhihua\nfurling\nxiaopeng\nninel\noverpayment\narchard\nfiretruck\nmbokani\ncoolen\nickworth\nsteffes\ndavidstow\nkpnx\niraola\nommegang\nfergison\nvladimer\nstorrie\nmakefield\ngullick\nhotrod\nabrahamyan\nacron\nclunkers\nziehl\nlynford\nsplurge\nwilensky\nwahat\nmestel\nstrehl\ntrademarking\nunmanly\nbancomer\ndemag\nlaken\nkitri\nstansberry\nemmeloord\nperko\nstandage\nchording\nnolensville\nreclassed\nuzala\nguergis\nkran\nshrake\ncriers\ncliftonhill\nthondup\nadepoju\nlullingstone\njulita\ntonnages\ntransvaginal\nenamelling\nalll\njiwani\nweakland\nhairsplitting\njilib\nantep\nfyffes\nstoen\ngajda\ntremolite\nfremd\nhugoton\nnion\nbojinka\nfing\ninsectarium\njigging\npartium\nimgs\nborgonovo\ngreenbury\nbhavsar\nwinnall\ngutenburg\nburnquist\nhanft\nameera\ngunrunner\nsuperminis\nsoubriquet\nmonterotondo\nmalkangiri\nbishopscourt\nebrington\nobscuration\nundescended\nfanthorpe\nwilmont\npatinated\nlennix\nshalamov\nsmudging\nlenska\nhydrologists\ndiscreditable\nlaragh\nshimbashi\nremilitarization\ndecertification\natrophies\ndownlands\nndamukong\nbamforth\nschmiedeberg\nzaldy\nlicker\nmicklethwaite\nroone\ndalisay\nlinalool\nflowerpots\nhirotsugu\nsimplicio\nbogumil\ndirtee\nsteamrolled\nalysha\nsengstacke\nwheeze\ngouled\nkhwar\ngoyim\nfruitfulness\nblant\ntixall\nbookscan\nnewfoundlander\naquaporin\nchigurh\njoblessness\nblackheads\nstracey\nuppercuts\nmyklebust\neurotech\nsluggishness\nzubar\nvirajpet\nelvaston\nunselfishness\nmottaki\nelectret\ngrossest\nnieuwenhuizen\nmagneti\nworle\nsquam\nreassembles\nspurting\nmingxing\novergrazed\nhumidification\ncassone\nscrapings\nbuchsbaum\ncospar\ndeganwy\nacuteness\nmezzosoprano\nious\nbulworth\nllanarth\ngodmothers\nstrelna\nruritanian\nbrdy\nassos\noaky\nlammermuir\ncraighill\nlengies\neskandari\nmalteser\npalmtop\nleyritz\nscms\nnescafe\nhuntspill\ndellas\nfatha\noaksey\njsg\njeanpierre\nwoodrush\nsystematised\nmahfood\nlimas\nlaunius\nshemar\nmenjivar\nclewlow\nbaweja\ncevennes\nnjoya\nmaleate\nxuefeng\nshadyac\nbertinotti\nhoarseness\nlavazza\nbajracharya\nslimmest\ncyanotic\nboundries\nglave\nfocker\nreaser\nbaumel\nnicking\ncastagneto\nafricom\ntanweer\nmachination\nfhp\nthugz\ninterminably\nhammarskjold\nmacroscale\nmarzia\nmonopolists\ncataloochee\ngornergrat\nbohemianism\nzelia\nbenjarvus\nwohlforth\nballadares\ncrisologo\npaloschi\nbuelow\ndomestiques\nvgl\nchudasama\nbsec\nsharlet\ndhimmitude\nhazelius\npopeil\nchemtrails\ngelfond\nsahwa\ntewes\ndhiman\nastors\nnemorosa\nduning\nandronic\nquackers\nbartolotta\nambras\nrambis\ndunkelman\nbertino\nvemuri\nsitagliptin\ncrazes\nlancret\nelasticities\nimpels\njamaa\nrabee\ncotulla\nromel\nleupp\nsalinan\npalanan\nspungen\nrebated\nsherrilyn\nkalmaegi\nkeycard\nnfci\nmokena\nlovera\npullmantur\nsarson\nroven\nwrobel\nvisaginas\nlevoy\nyele\nbillingslea\nweiming\nbusemann\nsankha\nsambizanga\ndownstroke\nvernissage\nrotarians\nhunding\ndamak\nxingguo\nparticularized\nstoler\nangi\nmirzayev\nmathies\ntechnip\npatriota\ntastier\nshrubsole\npriebke\ndismore\nplaint\nkerrod\nshann\nphotochromic\nlarreta\nswizzle\npsyllium\npipefishes\ntymoshchuk\nrozin\nbauers\ndurmaz\nanodizing\nwalheim\nvandewalle\npellizotti\nhengel\nadaptec\nellerbrock\nunkindness\nthrombolytic\nmcgilvray\nbaloy\nghaith\nseadogs\ncerrig\nnewes\nrushin\niiif\nhunker\ngortyn\nbarningham\nretallick\nsji\nlembke\nhongyuan\ngramophones\nzykov\ndarlton\nsxs\neten\nastronomico\nfastbacks\nbaisers\ngilston\nlukšić\nvelichko\nptaszynski\ngoldsmithing\nschultheis\nmarial\nrastaman\nnormington\nokakura\nlongingly\nnauck\nayse\nsigle\nsardari\ndupee\nhandicappers\nyegorova\nipca\nludy\nklinker\nharvards\ngasped\nzingaro\nsirko\naurelie\nbrazi\ngravitates\nbroomhead\nyamaki\nsweetmeats\njinggangshan\ncuchulainn\nshootist\nduggars\nbandes\ntashilhunpo\ncampano\nrossler\ntaqueria\nshannyn\nhogge\nyunes\nfoodbank\ndelbene\npascaline\ntroiani\nabada\ndarín\nmvula\nsanou\ncontactable\nrudiger\ncanessa\ncombino\nidex\ncowls\ndyspraxia\ncoluche\nesclaves\nsobyanin\ngaros\nhusna\nprust\nreinfection\ntrestman\nweyoun\nyawa\ninkpen\nalmonacid\nnighty\nlagana\nnsis\nkmiec\nnublu\nprocuratorate\ntianyuan\ncenedlaethol\nakman\nbrowses\nsobhraj\nindemnified\nhöfe\nstreb\nkoplik\nprade\nnalls\namarin\ndefaulters\ngenitalium\nmergansers\npatka\nheti\ngoodby\nklever\noruzgan\nlubuk\npendeford\ngodeau\ncallipers\ncrucifixions\nkumbaya\npalafrugell\nbeisel\nkentfield\nplyler\nibama\nprejudging\nsubin\npunja\nkeratoplasty\nlisnaskea\ngignoux\nkuc\nnairu\ndeibert\ndetectability\nakos\nyamadayev\nfugro\nrenascent\nsinnemahoning\nmethylprednisolone\nlocomotiv\ncydney\nmagistrature\neisteddfodau\npegwell\nepigenome\nbinaca\nrabbo\nlawers\nkuz\nchocola\nquibell\nbaorong\ncoko\nvoros\nnooses\nmacritchie\nlabaki\nmedtech\ndaphna\nappen\nclementon\nsuissa\nslanging\nmaccarinelli\naprox\nsomniferum\ncyberknife\nalferov\nkoreeda\nhanz\nkrisna\nfaizullah\ncardioverter\ntrudgill\nmatchsticks\nvietor\naksa\nsahraoui\nsixpenny\nsuqian\nspinka\nfoetuses\nalboran\nunmentionable\nmacnelly\nyonathan\nfuenzalida\nstreetsboro\nkippen\nphilippon\nevalueserve\ncarvalhal\nsilbury\ngoudarzi\nzollo\nhemoglobinuria\njotun\ndouds\nmalmsten\naurantium\nchadema\nanglians\ndiphu\nstabb\nsholapur\nficarra\nruedas\nkotok\ncutrone\nspeedways\nroomies\ngallow\nparoxysm\nvalidations\nbassil\ngoodger\nbizimungu\narapaima\nfingerless\nnonnie\ncallouts\ncomports\nsedrick\ngaut\nmotional\nportlock\nbanus\nmasamoto\nbooboo\nkissena\niribarren\nrockcorps\nbkm\nrll\nmaghi\nfionnula\nrupavahini\nholberton\ncadential\nberneri\nsoor\nsharpeners\ncontr\ntianxia\ntiramisu\njlg\ncarlen\nhyndland\npck\nnoureddin\ndunkers\nmaribeth\neenie\nkopell\ndukw\ntuula\nruckelshaus\nvisentin\nprixs\necholalia\nmingun\nalgonkian\nbrisbin\nmadl\nstuffit\ngreate\ngrift\nwielkopolskie\nprevue\nlelièvre\nbratkowski\npiroska\nchieveley\ntrocchi\ncoloccini\nenlivening\nkloos\nfogged\nchâtelard\ndavidsonville\nkranjec\neanna\nteratomas\nutube\nletterheads\nflechette\ncdfi\nlobert\ncongregant\nwildhorse\nannasophia\npandarus\nricart\nfucheng\ntammaro\nhardage\nbecaue\nmegastructures\ntandi\nnarong\ncrps\nconfindustria\ncrosshill\nlitang\neiche\npathetique\nrudest\nboof\nkaradzic\nplotless\npulcini\nsioned\nhyperplastic\nneemrana\nshanaya\ngarmo\ntanton\npoujade\nmarzel\nkatims\nmerlet\nchafford\nwassup\ngarud\nrockpool\nrichens\ntawni\ninfirmaries\ngruene\ngodawful\nhingst\ndunipace\nkeyboarding\nkerse\nbindel\nmugford\nmoate\nsuzano\ndiadora\nalmar\nkilted\nbenna\nchauvinists\ngordes\nlegman\ncharleswood\nabbett\npbz\nbyler\nmuskoxen\nkoukou\nmarzorati\nporker\nespinar\nphalaborwa\ncriscito\nwahbi\nflacq\npottsgrove\naquaponics\napap\npahala\ndalyan\nmanantial\ntrunklines\ngaleras\ngriebel\npedestrianism\nwhiteinch\nmoenkopi\niwp\nprocainamide\nmurrells\nreutilization\nondansetron\nheliostat\noxenberg\nheileman\nsubtexts\ngernhardt\nturnley\nwiesenfeld\nwuyuan\ndisentangling\nenap\nshuli\noroya\nsamon\naconite\napocalypticism\ngluckstein\nanston\nnorthleach\nhoryn\nlfk\nhanban\nhewerdine\ntriclinium\nnerijus\ntakenaga\nquacquarelli\ngamston\nfolgers\nbureacracy\nmodele\npedicabs\ncerebra\naudis\nwgal\nminyanim\nseana\ngolondrina\nsqually\ndyble\nvmix\nskues\ntesio\nspecificially\nmabie\nschodt\nmaciejewski\nbogas\nmetropcs\nalza\nskitters\nyere\ngenzken\nfugly\nfrenk\ngrameenphone\nrigal\nfereshteh\nswallowfield\nerdenet\nredcross\nscapegoated\ngilzean\nlimina\nmihalik\nyasur\nnumericable\nhannant\nasmaa\naace\nfornos\njipeng\ncertaintly\nneola\nchristoforou\nbaugur\nsomersham\ntoppin\ntollett\nzocchi\ngreason\nimpellers\nsumeria\nfranciska\nlashon\nheintze\nrighton\nmrh\nshojaei\nfancifully\nwepener\nabdula\ntramontana\ncomberton\npaua\nbehemoths\npaumen\nwickland\nyalom\nglocal\ntakeishi\nsceptres\ncreal\novingdean\ngerke\nntcc\naitzaz\norejuela\nparaffins\npixellated\nrohingyas\nproxied\nmcgaha\nformato\ngreasley\nhaltom\nheit\nleshchenko\nmarenco\nnechama\ngarvaghy\nbreite\nclaque\nratanpur\nperegrini\nkiyohiko\nchovanec\ngaida\nkilik\naulas\nfermentations\njakup\ncounteraction\ndnfs\nprawa\nbootstraps\nmengzi\nhillwalking\nrouthier\nbackings\nsoltero\nhanahan\nfossilize\ngamersgate\nfelter\nmacoris\nblachman\nwkyt\nfranktown\nroehm\nyazhou\nutopians\nmicrobicides\nghale\nmasturbated\nvercauteren\nhandballs\ntinne\nsunblock\nzanten\nburdi\nlewsey\nmartley\nlgf\nleveaux\nhelmshore\nrabil\naccouterments\ncaram\nrucks\nclearbrook\nruotolo\ngeting\nmccraney\nmho\nshazza\nkoroni\njizo\nsaitoti\nharville\npersicum\njhc\nrühle\ndisempowered\nrevenu\nlouds\ndehumidifier\nlingwood\nklop\nupbraided\npiret\nmiyakawa\nsummerbee\nexternalized\nkhovanshchina\nrediker\nvelikov\ndilating\npulmonaria\nredtop\npeyret\ndrillbit\ngalloper\nmjb\nhynix\nfrenchkiss\nwastin\nmsdos\nortmayer\ntrachsel\nrevengeful\nmonach\ncroque\nartadi\ngroundwood\nliimatainen\nguancha\nvsphere\nthebus\nalibek\nfuruno\ncantine\nshust\nmurua\nlanser\ncollegeboard\nnalis\ngranillo\nketsbaia\nfastens\ncakaudrove\nunpreparedness\nreserach\nangarano\nartifices\nhandsfree\ndecompressing\nkiff\nnessman\nfarrior\ntancock\nskyboxes\njoses\nhamamelis\nagoos\nstoehr\nvinegars\nleye\nravaglia\nsavignac\nmuddles\nhogestyn\nfreetime\nwastefulness\nklüft\ntullos\nkomedia\nneato\nleonova\ndeejaying\nteallach\ngrunert\nleafe\ncicierega\nneele\npentavalent\ndebelle\nandreyevna\nazf\ntelesat\nmmse\ndruyan\nfeticide\nséralini\nhokuetsu\nwrase\npesi\nbankunited\ncavenaghi\nmeq\nlimbert\ncarpani\niclei\nunlistenable\nponorogo\ntrimley\nascl\nmeiers\nayles\nlijst\nbardini\nhoberg\ntrombay\nlovibond\novervoltage\nvisitacion\ncoloso\nvmebus\nscalera\nburao\ngaard\nbrockworth\nhegar\nquarrier\nfenichel\nnonfictional\ndemyanenko\ndinkel\nnorful\npraesidium\ngrobman\nneediest\nboucek\nbaniszewski\nallfrey\niora\nscheidler\npolyuria\nchandrasekhara\nnthe\nfroglet\nbijlani\nprivatising\nilango\nklossowski\nleuer\ntaarabt\nchagaev\nandreyevsky\nreen\nciy\ngalilea\nprosequi\nlibeling\nassef\nfossella\nprouder\nshivakumar\njacobellis\ninle\nbujak\nmabs\nstodola\nopnion\nmemminger\ndaele\nacyclovir\nsimlish\nmpemba\nbauerle\nmufc\nkamrani\nnanfang\nmurakawa\nbaksi\ninserter\nkoob\nwhisking\nmillgate\ncouve\ngassy\nwelled\nehrenstein\nahsha\nworsdell\nfrucht\nfiaf\nadomah\nmustaqbal\ndweeb\nbaihe\nblagg\nincomprehensibly\nbettles\nmardis\nstoppin\nbeacause\nfradkov\nsinewy\ndetrimentally\nchhotu\nwasherwomen\npinko\nbirdsboro\nprotasiewicz\nfischerspooner\nnasc\nmailes\nrajkhowa\ncaravanning\ngianmaria\ngauloise\ncumbo\ncxx\nbeckey\nmahasweta\nnnsa\ncalella\nstoppini\nskibbe\nflippancy\noxwich\nfariborz\nleki\nbreon\nrelaunches\nsforzesco\nhohaia\nanif\nmassimi\nmontignac\nmouthwashes\nallonby\nhemifacial\nquinidine\nbarcamp\nkaddour\nbilanz\ncaygill\nfii\nbartons\ntreharne\ncoloniser\nritualistically\nwagyu\nbonora\nkloves\nchandrakumar\nparrino\nwarter\nmackovic\nmarau\nastonishes\nknoydart\nhandelsman\nbaldhead\nteicher\nunrelieved\nhucks\nschipa\ngoze\ncameroonians\nlibrería\ncrunched\nmoulvi\npantiles\ngoodlettsville\nhambros\npiccini\nwisnu\nkudryashov\nburciaga\nbaltzell\nkiszka\ntabulator\nsequitor\nsnir\nmarneuli\naushev\ntransgresses\ngratefulness\ntorroba\nuntangled\ntrethowan\nscuttlebutt\nacie\nmühlfeld\ntista\npageboy\nbethuel\nyers\negotistic\nragg\ncalifon\nadrenalina\ncmaj\nmizanur\ngethard\nbidden\nyongfeng\ncumana\nnonsmokers\nschommer\nkossi\nchinamasa\nliedholm\ntaleghani\nwareheim\npaintbox\nsylvana\nkurils\nlinemates\nbegain\nopnav\nriesgo\ncifaretto\nkleindienst\nkombinat\nronchamp\nhyett\ninfrequency\nkluk\npentewan\ncrepaldi\naeroplan\ntrun\nniru\nwillhite\nmetronews\nsharla\ngazoo\nbcbs\natavism\ncrazee\ndensham\nnummelin\ndaq\nsteinkamp\nintracytoplasmic\nclac\nhiger\nlibertini\ninvasively\nsintef\ngrounders\nunpredicted\nsprowston\ncrummock\nreimagine\nphilarmonic\ntricorder\nhunniford\nhugli\nquinces\nokri\njaray\nunhorsed\nunsent\nscaneagle\ncatemaco\nmorovis\nazel\nwhiteflies\nnavfac\nouda\nmtns\nesxi\nweststar\nmilicic\nmiliary\npapillomatosis\nleysen\ngration\ntlt\nlewanika\njast\nsvod\netoys\nwisdoms\ntintype\njutted\nperiodontics\nelchibey\nsharana\nbounteous\nfriedeburg\ngambell\nolexandr\nmullery\nshanidar\nhauschild\nmajorie\nvariegation\nnairnshire\nrastro\nregata\ntwinings\njpii\nbellefleur\npeterhansel\ncommencements\nmceldowney\npracy\nenmities\naitc\narnison\nithe\ncaerdydd\nfairpoint\ngripp\nmoika\ndolt\nhitzlsperger\ndonnacona\nhouches\ntrillin\nrenovo\nshellie\nsotiriou\nheadwaiter\nloven\nbocek\nmorshed\nschelklingen\nkille\nstiehm\nfermier\narberry\npistolero\ndevelopped\nfroberg\nkhatibi\npalmed\nforegrounding\nbll\ntocchet\npiekarski\nrichardville\nprefigure\nmcilveen\naccussed\nbolstad\nmeinhold\nframestore\nvittles\nparnevik\ngrh\ncounternarcotics\nevah\ncouteur\nbatal\ntambopata\nkrieghoff\nscarfs\narry\nanisimova\nivt\nbahen\ninertness\ndowbiggin\nwallers\nonera\njulians\nmurrin\ncullberg\ninx\nolitski\nouston\niolanta\nshinola\nintervale\ntayyab\nporbeagle\nfuri\nwheather\nchakiris\ngambas\nakamine\nmogote\npeevish\npapamichael\nkeening\nshivas\nostry\nfembot\nwishin\nemulsifying\ndarwins\ngutheinz\ncanapé\nmarmol\nxpressmusic\nmoallem\nboxtree\nlizarraga\ncroquettes\nscharoun\nimmunosuppressed\nintermittency\nfossi\nsteber\nherblock\nplentifully\nthymosin\npsig\nkenwyne\nkanka\nkärntner\narmetta\nsellersville\nbarnt\npulverizing\nmancebo\npeppin\npanteg\nbarcus\naarnio\nvermeij\nviani\ntrumpf\nkarriere\ngouger\nshirato\nbeur\nijc\nsecessions\nkongregate\nonair\nkarges\nsokoudjou\nplantier\nservicers\nvocus\nkatkov\nngen\njne\nkompass\ndurazo\nflatrock\nmealtime\nchatchai\nkochin\nsoller\nugur\naleatory\njuantorena\nhuntingford\nrouxel\ncanelli\nbulykin\ndivestitures\nicmr\nhyperopia\nshelsley\nokhta\nlightwater\nsinnamon\ncise\nlonglegs\ncitrulline\nconsorted\npaone\njiewen\nzhihui\nwalkovers\nfitow\ngcaa\nbackstopped\nindya\nalbarrán\nocz\nwolle\nphotosynthesizing\nmicrostrategy\ndiscontentment\nlanotte\nstillson\nsensini\nschoolkid\ngerta\nsimpatico\nbotín\nlongside\nabos\nachuthan\nbequia\nzhaotong\nstanbrook\nfahrettin\nrossello\nsirkka\ningleses\nbleeping\ntrena\nstippled\nplotz\ncorrugations\nhoeksema\naspull\neuromillions\nswx\nvsop\nkamya\njelley\nkresse\nschiehallion\noduber\ntrka\nedgemere\nshindand\nbasey\nyufu\nstanderton\ngiuda\natascocita\nnavion\nreconvening\nliebezeit\nfletching\npinery\nalikhan\nsasco\nsigit\ntipitina\nforss\nholdrege\nmughlai\ntix\numeboshi\nenlow\nbobsleds\nlitespeed\nverka\nrivertown\ntacconi\nrorippa\nskerryvore\nglobulins\ntzec\nmarilee\nawadi\nmozambicans\nbanyamulenge\nberendsen\nlearoyd\natisha\nestudiante\nidole\nniccolini\nblaz\nmiesque\nquitclaim\nfoxford\nhiltzik\nkrebbs\nhayhoe\nolvidados\npersistance\nballoted\ndecourcy\nnscn\nblakk\ndwf\nbrenin\nnaïvely\nstoecker\nkamiizumi\ngvp\nhallmarking\nhaematological\nkayapo\nmediagroup\nmoute\nulma\nnnaji\ncherrelle\ntriodion\nribet\ndialga\ntardebigge\ncssd\nlechero\ncosman\nnewtownhamilton\nluttwak\nblondi\nclampetts\ncarandiru\nzairean\ndoughs\ngenos\nzidek\nbarosaurus\nalbiston\nbenegas\nalverstoke\ndelima\ndiceros\nincontinent\nevilness\nmariachis\nbrodowski\nmultiway\neuphausia\nrefractometer\npattammal\nuny\nchbosky\ndanelo\nwaterfoot\nryogoku\nmeselson\nburck\norwig\npooter\nncrc\nlandaulet\nmgy\ncontino\ntalmudists\nogren\ndazzles\nchancer\nwfts\nacon\nsergiyev\nwordlessly\nkirichenko\nmuela\nbuble\nandreina\nmasen\neloísa\ngalinsky\ntavakoli\nbrighstone\nrheumatologist\naabar\npulgas\narsenopyrite\nquenstedt\nfreidel\nborum\ntrenchtown\nunromantic\nedmée\nbrisbois\nahla\ncressing\nhelguson\nperia\nmccreevy\nbenglis\ncheminformatics\nfreudenstein\nraggy\ntembisa\njeyaraj\nnassi\nvieth\npowerbuilder\nuplinks\nsoutter\nkaprun\ncacioppo\nstockier\nepifania\nhonoraria\nricciarelli\npevney\nszeliga\nnightspots\nunkei\nliberto\nbassiouni\nmalk\nferndown\nbudweis\nsuppressants\nhildred\nperlich\ncallowhill\narseneault\njupe\nbielema\nberkovits\nsammes\nvenkman\nkaramay\nbiennales\nburtis\ntasci\nbacksliding\ncoln\nmatip\noppositionist\nlalime\ndubinin\nunseasoned\ndetroiters\nconsorcio\npotch\nhavemann\nhänsch\nsoulive\ncatorce\nmanulis\nassertively\ndiapering\nmargarethen\nvettriano\nlifejackets\nmethylmalonic\nzeth\nawr\nreekers\nclingmans\npitmen\nkathman\ngatz\nazimut\nsincelejo\nwofl\nasume\nbioenergetics\nmilitated\nlangeais\nyankovsky\nheleen\nmonkeypox\nkoenigs\nkillowen\noestreich\nvallot\nbelaying\nbejo\ndonghua\nnavels\nthakurta\nveness\nellagic\nfolliard\nmalassezia\nshughart\nautorickshaw\npinecone\nhosey\nspaceland\nraycroft\npachora\ntrolli\npfe\nmutualists\nhetemaj\nvishnevsky\ncobija\nwian\ngotv\ntangentopoli\nachyut\ngerbrandy\njabu\nwindbag\nfanck\nummayad\nnatasja\nshoplift\nbeachwear\ngreis\nbovespa\nartiles\nhatf\nwigfall\nbisphosphonate\nllf\nanod\nverdine\nbjerge\ncardioversion\nachewood\ntubin\nmbv\nsenechal\nmccrindle\nramia\nvirosa\nbarce\nstuever\nhanigan\ntaekema\nakj\nwolfsonian\naiya\nclayman\nlakai\nselm\nbnot\ncrug\ngalerías\nsunward\nather\nstavely\nwkd\ntransexual\nhangup\ndegreed\nsidus\nlanne\ncattanach\nkubel\naltenmarkt\nescherich\nklebnikov\ngiralt\ncornflake\njangling\nncas\ndominy\nragama\nlamlash\ntimewatch\nblei\nrogalski\ncompromiso\nmiracolo\ngossow\ndawar\nkafiristan\nwtem\nsironi\nstreambank\nyanar\naloy\nmoisten\ngagah\nsummerstage\naleu\nalcudia\nmamedyarov\nnilla\nkoerber\njowls\npepco\nohshima\nsugaring\nhilário\nhongdae\ncavalo\nkrust\nsuburbanites\nrscg\nkousa\nschwahn\ngalison\npocitos\nforegrounded\nctk\ndelphini\ncloudberry\nseafrance\nmarjoe\nkharchenko\nbagayoko\nhartselle\naskhat\nhirschl\nabubakr\nkumarasamy\npandals\npounces\nbirdmen\njuva\nobselete\npopenoe\nebby\nflotta\nnewpapers\nhaematopoietic\nmaracana\nremise\ncassey\nreynal\njiles\npositif\nruffell\ndioses\nkanzler\nfuh\nloftier\nhalam\nlegos\noverslept\nnwachukwu\nmenc\nblong\nflye\npandang\nbakoyannis\nkastellet\nthas\ncomping\nrechecking\nnikbakht\nnataf\nleprae\nakifumi\ndisrobe\nhannula\nanshu\nzirin\nboullion\nbuchanans\nprescribers\nmoonah\nluteolin\nslagel\ngrantsburg\nlifestream\nanzalone\nmarchibroda\nmetronet\nbarrero\nwierzbicki\nredesignating\neconomidis\ndiago\nnqf\nmudlark\ntiznit\npasquino\nelaborative\nyakusho\nshorthouse\nmeeus\njanica\nteletypewriter\nmudville\nmalty\nguanacos\nhinnant\ntreleaven\nscalpers\nvennegoor\nhasna\ngreatcoat\ninterisland\ndollhouses\nrhiwbina\ngannan\nkitezh\njeudi\nkarinska\nexpediently\nsturgill\nthibetanus\nafta\ncamerota\nneitz\nconvers\nazambuja\nunpersuaded\nlgen\nkissers\nchristenunie\nettie\nbeloki\nalesandro\nmountsorrel\nveneman\nkamboni\nseminara\nbakassi\nschow\nwhittenburg\nmilliwatt\ntagliaferro\nstancliffe\npreplanned\ntocopilla\ntalisa\nboutelle\nmergen\ndumervil\nchelimo\nkokkinos\nmacandrews\ndelonte\nshuren\nroosmalen\nboogieman\nhaveing\nfbos\nscofidio\nlarnaka\nincising\norgiastic\nshammari\ntempliers\ngilma\ngutfreund\nkig\ncrabbers\nbioprocessing\npetulance\nunenrolled\npartiers\nxperiment\ncottman\nsquelched\nyasuni\ncabannes\nsparkhill\nlongshanks\nmongla\nalmazbek\ntredyffrin\nfariba\nrønneberg\nbroyard\nnorthmore\neckbo\nsgae\nacars\njimmies\nratting\nacps\nchalkley\nmeseret\nmcilvanney\nvanquishes\nmanolete\nbikeshare\ntristen\nkrejci\naysgarth\ncarmelina\ncovel\ngarance\nkyriazis\nmccole\nsugaree\nllŷr\ninstone\nguh\nvainglory\ntrescott\ntoiba\ntdu\ndulling\nkhadafi\ngaliana\nbaltzer\nverrilli\nwasho\naustralopithecines\nwhitesville\ncenterfielder\nkébé\nskag\nmetallization\ncollomb\nkottar\nneila\nboyo\nmclibel\nminsmere\npepperidge\nnemer\nscoffing\nintroitus\nbusco\nkabini\ncowon\nmeliá\nbárcena\nkoeberg\nculligan\ngiasone\ntongren\npedrad\ncernea\nreviser\nbotches\nyeliseyev\npostdate\ngamberini\nsveum\nbarbarez\nnarron\nleire\nwittiest\nqujing\nmecs\nkindel\nmechanize\nseiber\ntudgay\nmaby\norangefield\ncolleville\nbiggy\ntagami\nmangosuthu\nnorgaard\nmitag\naulaqi\nuisge\nawwad\nlindhardt\nfehl\nbreastbone\nhethel\ndamiana\nnaturalizing\nendler\ncoutard\nsuppleness\nwizened\nwearily\nnecedah\nweist\nburgman\nhampus\nkemalism\nmatrikonopc\npeschisolido\ngvk\ndatacom\ngestating\naczel\nshimmers\neih\nbreadon\nhelpings\nriahi\nsabbatarianism\nsheilah\ngambiae\nsalable\nmenisci\nrizhskaya\nzago\ncossey\nworonov\ngottstein\nerlinda\nexpansively\nowino\nfochabers\nmullaly\nbratislav\nrefutable\nwoolsack\nalperton\nseccion\neuphonious\nmckone\nradiothon\ndillion\nkaina\nunblockable\ndamron\nwytv\nunsexed\ngulpilil\ncherise\nhotkey\npsychogeography\nmonae\nticad\nfroilan\nmamonov\nalstroemeria\ncez\nbajofondo\nboyack\nvuna\nmoua\nmondorf\ncodependent\nworldnet\nsymbiodinium\npaleoanthropologists\nremek\npolymyxin\ngallier\nstanchion\nmeerschaum\naveragely\nbigazzi\nbordone\nexcoriation\nharbutt\npippard\neconometrician\nkipiani\nprostrating\ngateacre\nguerneville\nsimmon\nptw\nzhar\nboddam\nrimawi\nbioactivity\ntweaker\nparticualr\nsportscene\nautobiographie\npiecework\nloooong\nkubus\nwukesong\ndobe\nboogers\npincio\nmazars\nwormsley\nsharpham\nreko\nphillippi\nsissies\napolonio\nstemi\ntullie\ndeif\nbullit\nmungiu\naldermanbury\ngallner\nmerse\nvotkinsk\nistructe\nuhmwpe\nkatmandu\nbittering\nvinda\nmovil\noccitane\noutsmarts\nsozzani\npostseasons\nhurban\nrussophiles\nnakasongola\nrabanne\ngracen\nkiunga\nlochgilphead\nduelled\nitek\ninterventionists\nsiphiwe\ndunns\nwaltzer\nswickard\nilopango\nhadland\nabutbul\nrambow\nlealtad\ncityjet\nfreelove\nmashreq\nxperience\nlambent\ncoult\nveltheim\nlocs\naltius\nhegemann\ndumbstruck\nfwv\ngyrate\nstojkov\nmuge\nfacer\noufkir\nsueddeutsche\npernía\nlinsky\nnko\nlasry\ndionysiac\nstörtebeker\nbreidenbach\ngarreau\nquiff\nbaruta\nroatan\nwafaa\ndujail\nallenton\ndaptone\npotshot\ncyberlaw\nradiometers\norwin\nviguier\nbegrudging\njonathas\ncheekily\ninomata\nrevver\ngtlds\ntyronne\nmaligne\neugeni\nnikolaeva\nbealey\naccomodating\nbougival\nsleestak\nsapkowski\nirh\nnecid\narnette\nrecalcitrance\nshurmur\nkagurazaka\ndtour\ncysticercosis\nlooong\nmungos\nbinta\npaska\ngaizka\nsinharaja\nmowery\nhaeger\ndockstader\nsummiting\nvucetich\nwabo\narsine\nvanover\norengo\nhudon\nkokesh\navara\nbattenburg\nhaylock\nanomic\ntreemonisha\nkaminer\nnaoshi\nresidentially\ncongresbury\ntombola\npermai\nbaijiu\ngriffel\nchelmno\nshortie\ncordingley\nsimplexes\nipscs\nbrunetta\ngilde\ncihr\nperdanakusuma\ndahlak\ncherrybrook\ngriseum\njiali\nkastri\nfrood\ncantare\ncaniglia\ntuvans\nintoxicant\nsiegl\ngenden\naniak\nsempra\nmaniscalco\nerlkönig\nglibly\nsneezy\nticotin\nbagla\nmtnl\npsaltis\nkedem\nglobetrotting\nmangia\nunsearchable\ngoron\nmigliorini\nsesma\ncheapen\ndiles\nshakhbut\nkarakaya\nhybridised\nsuasion\nnonsurgical\nosio\nhemmingsen\nadventuresome\nbellview\nsulgrave\nsaltine\ncalen\nnigina\nbermann\ncinthia\nrinzler\nsplendours\npromesse\nwallcoverings\ncyrankiewicz\nbrackeen\nwineglass\nmaquina\nbavier\nadot\ncloudera\nmicrosurgical\nvadisi\nmilieux\nonuki\nsiddick\nreimbursable\nnoorda\neliécer\nshadwick\noutcompeted\nchatterer\ninterbay\nbiochemicals\nkittridge\nesala\ngortner\nglenohumeral\netchegaray\nbierley\ndelenda\ntroisgros\nprotais\nadebimpe\nreinvesting\ndouses\nscadding\nkolbert\nrödl\nmaynas\nmcferran\ndishforth\nmariappa\ncommissaires\nmohammedia\nunkindly\nneque\nneurones\nshukra\nmathys\nsoldatov\nmavens\nilgar\nadnet\nevv\nwashings\nshikata\nmainlander\ncarandini\nsyrett\nradiogram\ntatarsky\npritikin\nboge\nbracamontes\nwobbler\nsalvati\nspieker\nasms\nquiverfull\nbaselessly\nrentoul\neurekalert\nsimao\nfloreana\nmuehl\ngorta\nmethodic\nlastest\nmicronesians\nturiddu\nvasiliu\nsanjit\nackee\nwcd\nfachtna\nweasle\nholkins\nmtawarira\nrohatgi\npyla\nsuperspy\nbigband\nglenavy\ndeadend\nchilcot\ncanalization\nmelikyan\ngaspe\nnaruko\nminasian\ntwiddling\nstrachur\ntolleshunt\nsej\nftx\nbarkov\nchurchdown\nmakovicky\nosterwald\nhäagen\ngmhc\nhadary\nkilmory\nclwydian\nkinjo\ndivvy\nascencio\ncrambe\ndobermann\nnonproductive\nserenitatis\nprosaically\nreforged\noliech\ncorish\ncoopted\ncheezburger\nspads\nflobots\nstroudwater\ndruggists\nmelodist\npuppeteered\nnccam\nbirbiglia\ndeconstructivist\nkoeppen\nsymbioses\ngenistein\nrampone\ndistribuidora\nthulani\nbengston\ngethsemani\nstandpipes\nmotortrend\noverby\nvalgeir\nleagrave\namsoil\nisvs\nringroad\nledward\naspd\nweepy\nzoloft\ngilmerton\nanzoategui\nfairtex\nmaclaughlin\nlongcroft\nzahawi\nmcmicken\ncottone\nuncongenial\nlugging\nwikicity\njills\nleveler\nwindchill\nghormley\nlaunde\nbornheimer\nnantyglo\nmofokeng\nkampusch\ningwersen\nworthily\nnhlers\nkonte\nhuayuan\ntreviranus\naleka\nreenacts\nprahl\nbazemore\nsuzann\ncreg\nshies\ncout\ncoatlicue\ndesilva\nbroo\naceval\nkikukawa\nsanyuan\ntheaker\nsvobodny\noutsprinted\npodded\nryken\nhironori\nforonda\nthunderhawk\narepa\ncarys\ngrun\nintiman\nvanquisher\nruxin\nlukyanov\nmudcrutch\nbooi\npecari\nsadda\nmarkakis\nhankering\nblombos\ncupper\nvisione\nsafta\nlonegan\npicturesquely\noldrich\nlongnor\nregensburger\nmadhumita\nhuntingburg\nbarnstorm\nloanhead\ncheckland\ntesler\npavano\nbaguettes\npiggot\nghasemi\nkavalier\nprys\nosias\nnoggle\nasatiani\nolins\nskell\ncollini\nchevin\nsloughing\nchass\nscheels\nthorntons\nnoyori\npanh\nmurt\nbluemont\npriviledge\nsandbrook\njesco\ncurrach\nlenzen\ncboe\nshernoff\npascrell\nknobbly\nmaintenence\nlechery\nhanasi\nwnct\nnaturopath\nhegang\nchyler\nquadruples\npietrelcina\nshiming\ndeely\nmedivac\nworthwile\nirsp\nnagorny\nconrads\nnprc\nkragen\nearmarking\nftg\ntretorn\nmeltem\noogway\nendy\njavel\nrattazzi\nlongshaw\nsroka\noximeter\nsweeties\nwboc\nbizz\nboldy\ndvorkin\nelkabetz\nashti\narkalyk\nfrantzen\ndowndrafts\nstarcross\noumou\nbeamont\nitsu\nmuhammet\nrafina\nremits\nstefen\nibac\nwelterweights\nzivko\nsaffo\ndopp\nemar\nbodhgaya\nbourgueil\nredheugh\nplensa\nkeese\nplainsmen\nhimala\ncarlon\nlape\nkreipe\ncandee\nkhandu\nbeketov\nvlp\nzebroski\ndevonte\nzhangjiang\ngallaghers\nnightstand\nbreeching\nsunrooms\ngerben\nconnotative\ncrowed\nblandón\nstrano\nbodart\ntavish\ndimitroff\nbertoncini\nkibeho\nkonnie\nwahaha\ncopernic\npraet\nroero\ncardamone\npetkoff\nscillonian\nvisable\nrihan\nbroiling\nmundum\nlenon\nbelaid\nflouride\ngoldikova\ndwango\ngrev\nurusov\nnorthbank\ndamps\njenning\nahhhh\nadeleye\nengulfment\nthrangu\nmcgauran\nspataro\nspaziani\nproanthocyanidins\nbuchbinder\nbatwa\ntotesport\nergometer\nsmailes\npresby\nchenes\nzimmers\nsheshan\nsopore\nheublein\nswirly\nevdo\nnonni\nbeholds\njanhunen\nsalvatrucha\nbocquet\nbazinet\ncashen\nnssf\nconnectu\nmerrington\nsnaggletooth\nodili\nlatinoamerica\nmarsili\nsist\nardudwy\nkolari\ncatullo\nkeglevich\nwitht\ntamasheq\npeguis\npnin\nguileless\ncigi\nunrooted\nrayhan\npaulauskas\ntemerarios\ndegreasing\nsmithsburg\ngaraudy\nneulion\nbrownings\nkauf\nhuntin\nacacio\nshabbily\nomron\nsüss\nyeezy\nniccol\nspek\nboombastic\nragno\npapeles\ncherrypicked\nsubrogation\nshortlisting\nboatright\ncrofty\nbiava\nbjörgvin\nultracold\ndurieu\nmccarley\nconcupiscence\ngelt\nnyet\nglennville\nariss\nsnris\nfluoresces\nrondi\nmarren\njianghuai\ngriess\nbilek\neoraptor\nasid\nantidiabetic\nreygadas\ngenset\nkipawa\ntardes\nwhoah\nhunsinger\nsneider\nencausse\ngraynor\ntottie\nchumps\nwalshaw\nqiyuan\nparme\narges\nmalcolmson\nbaulch\niorg\nlre\nmixcoac\nrozan\ngaleri\niocs\neulenberg\nsnitzer\nferulic\njosee\nturkman\nreasi\nflatirons\nsweetmeat\ngoranson\ngenographic\nvolodya\ntitina\npennino\norpha\nstyl\nsurette\nquillin\ncrawfordville\nkaieteur\nxiaojun\ncocoanuts\nlindhout\ngirardon\nfenham\nplitt\nshapland\nmuris\ndougga\nbayoumi\nmargareten\ndolia\nnoordoostpolder\nieb\nspruit\nbarac\nsipadan\nlusardi\nronayne\nfeitelson\narboretums\nvanegas\nhousemaids\npenzo\ncrcs\nflaxton\nsuperorganism\nproisy\nhuba\nbadda\nbárcenas\nplys\nziegel\nrauschenbusch\nshittu\nabdelkarim\npbdes\narizonan\notmoor\nmarkoe\nflot\nmosteiro\nnafas\nfrühbeck\nexabytes\nmicroparticles\nmeridiano\nslacken\nkameo\nkaegi\ngodet\ncherrapunji\nemployes\nzerbinetta\nbabulal\nprimarly\nzamparini\nedale\ncrona\nmaxthon\nkerney\nmicrodot\nincongruities\nkobzon\nschlieben\npierrots\njebus\nhoffberger\nlordkipanidze\npreposterously\ngoodwick\nnaganuma\npasche\nwtm\ntruncus\nmualla\ndavidow\ndiablerets\nsadurní\nseiberg\ndiffract\nbalkars\njvs\ndebashish\nfalloon\nsclerotherapy\nwestcarr\nsujin\nshoesmith\nllansantffraid\njulu\nhgvs\nkettledrum\nstreetwalker\nspott\nnevisian\nresounds\nshifra\nvalpo\nambati\nkouno\ntamp\naspi\namylin\ntoofer\nkirstenbosch\ncorrinne\nbarracked\nzayandeh\nyucky\nccat\nhsas\nhlavaty\nlustbader\nfitzy\nlanese\noocyst\nchargés\nknoebels\nchasewater\ndifícil\noverspend\nshoman\nzhol\nyaoshi\nldpr\ngalvanising\nforzani\nwheelan\nkressley\nignitor\nreely\nmindlin\ndonda\nmisalliance\nbarnwood\nasile\nalker\nportnow\naginst\ntoschi\nbettison\nmultilink\nikaika\ncozad\nspacedev\nlubber\nmaycock\nrotters\npenitente\nguntis\nproliferates\nbutan\naiha\npetke\narzew\noffhandedly\ngunwales\ndoumit\nsignaal\nlams\ndextro\nbakka\nloincloths\nflutists\ncalil\nlunched\nkrukenberg\nlupis\njysk\nbelying\nacounts\npoulsson\nanastase\nghostbuster\nfauci\ntimmis\nqpo\ngivhan\nifg\nazeredo\npowerbase\nkatyal\nxrf\ntritiya\nclaustro\nstrate\nscatology\nblokland\nupwell\nkmgh\nnsfc\ndoutzen\nbunu\nxkr\nmyoko\nhenham\nhelmke\nmoammar\nghareeb\nprivado\nivth\nchapuisat\nfranssen\narenys\nmellem\ndallan\nnoseda\nborrowash\nuclick\nblueefficiency\njannette\nunassembled\nweerakoon\ngitt\nfreckleton\noverbuilt\nmercaptopurine\nbudvar\nhelmerich\ngeysir\nreassortment\nilab\nclarisa\nrevelling\ntransneft\nwolverley\ngeddis\nderealization\nwette\ndelmonte\nridot\nallas\nchabi\ncastmember\nbergendahl\nstreetfighter\nheringsdorf\ncryptogams\nkopecky\nxaviera\nflinching\noppressively\nconing\nberardino\npoliana\ngeotextile\nscreencasts\nzemaitis\nbugeja\nidentikit\npuricelli\nbotwood\ntierkreis\nfeve\nmacknight\nkadazandusun\nzahur\nshallcross\nkilvey\nkalalau\npiras\nwasg\nkopra\nkihlstedt\npyrford\ngottschall\nexpeed\nmorrible\namga\ncalmy\nikl\nmotyka\nsuperglue\nantiinflammatory\ngeezers\nschlecht\nboliden\nanani\nmoorhens\nwjxt\nesky\npingel\nwoodsy\nmugan\nylli\nmerrells\nasellus\nkalona\nkebbell\nstranmillis\ntubau\naures\nmatache\nlosi\ntwite\nreinsurers\ndtb\nverry\nsuperfine\nlri\narkhipova\njibber\nrosoff\nnaulakha\nfermenters\nstylophone\nshoney\ncounterspy\nsnakelike\nchanan\nzwigoff\ncaptopril\nmirow\ncleopas\ncronshaw\ntretiakov\neichen\nbekas\ndockworker\natossa\nkirsan\nnsel\nsaleha\nihg\ntvu\nrossellino\nrygel\nburgmeier\nryazansky\nharir\nplatypuses\njimtown\ncollamer\ncellulite\nvrdolyak\nhubbardton\nzahniser\nbotnia\nreimagines\nguero\nintolerably\nbuderus\nopendoc\nranu\nlaczkó\nthundersley\nfatullah\nairo\npropanediol\nbulent\nanjem\npeashooter\npostgres\nlandguard\nhras\nconjurers\nnymark\nerek\norleanians\nsindona\nforer\ngothick\ngreeters\nballykissangel\nrester\npistolet\nkreitzer\nperogative\neatin\nelvina\nkudelski\nstoev\nnewsbreak\nchillingly\naydan\njmf\nactavis\nsupernanny\nmultifuel\nsyerston\nsilverdocs\nmikail\ngalkina\nkerrin\nchangé\nguai\nkingmakers\nmarkas\nplaydate\nmssa\ncomprehensives\nkenealy\nbushwhacker\nedginess\nvgp\nhelbing\ntoddla\nsolovay\nlevinsohn\nantoniuk\nbaldursson\njamelle\nhoback\nmollen\nblixt\njmr\ntaffarel\ngozitan\nbolometer\nbraunlage\nmabiala\nborghini\nflandrin\nlibe\nradicalize\noctreotide\nhomogenize\nirrigators\nhanuka\nzeff\nmeilan\nmaluf\nagostinelli\nnarsai\nlipitor\nnifedipine\nakhmim\ntripti\nguitart\nsaavn\nxiaochun\nrotberg\nuckg\nerdf\nmassamba\ncilgerran\nadeptly\nryser\neverist\nmoomintroll\noverwater\ntalansky\nsubstitutable\nmousetraps\ngsusa\nknmi\nbeville\nbenicàssim\nfashola\nquesti\nmalmquist\npiche\nfeltwell\nreattributed\nseiichiro\ntande\npsychoanalyse\nmelniboné\nbanzhaf\nbouveret\nstemmle\nmcnealy\ntude\ndiethylstilbestrol\nrendre\nnippers\ninacio\nsupercruise\nharpurhey\nherrenhausen\nlipoma\nwinogrand\nriefkohl\nheslov\npetrilli\nmicco\nharpham\nsaturna\nrosemarkie\nisara\nmaqdisi\njulyan\nabasolo\nshiftless\ndisbelieves\nyilong\nmanzie\ncarcharodontosaurus\nbrandolini\nflacso\nnolita\nedds\nflashforwards\ntunks\nbaruto\nhotte\nwaipoua\nrensch\ndrippy\ncontentedly\nwhiteoak\nsidel\nchristijan\nniello\nlongniddry\nbrcko\ndiisocyanate\nlagrimas\nperenco\nthoughtcrime\nryad\nvenit\nwalkabouts\ncopses\nagaint\nhto\nvenusians\nmayron\nstainland\nmonied\nmulisch\neminences\nchartbuster\npensford\nreoffending\nmbanza\ntentation\npafford\nargentinosaurus\ngrushecky\nincb\nkawalerowicz\nexagerated\nwxix\nbricmont\ndistressingly\ncamelina\nmelawati\nfemtocell\nauthoress\nhavili\nniehoff\nphiltrum\ndowdle\nparrillas\njohne\nbathwick\nkoret\nmockers\nklippan\nyalong\nunowned\nfuro\ncking\nsirènes\ncitycenter\ngjoa\nsieberi\npavlidis\nkcci\nchamique\nghalibaf\nboedo\nzanes\npankov\npeapack\nerandio\nriordans\nabsolutley\nkewl\nunprejudiced\nalwaleed\neggan\nnethercutt\nrussom\nventry\nmaquet\nspiritualistic\nkarkare\nance\nheglig\nengracia\nkoeppel\nbychan\nkierszenbaum\nkluster\nhomfray\nzwz\niacc\ngrossglockner\nmulet\nehrlichiosis\nraices\nmodafferi\ntedglobal\nseabus\nkuam\nxiaozhao\nkellee\nniesen\nkopeck\ncafo\ncotidianul\neamont\ncoody\ndalbir\nrozi\nhvb\nmaben\nniessen\nharnham\nmitarbeiter\nmorigaon\nsaddling\nnazr\nentourages\ncailloux\nantoaneta\nstankov\nmphahlele\nobsequies\nostern\nsicel\nyachats\netxeberria\nsadists\nphreak\nblecher\nceratosaur\nsinjin\nfloorless\nmanises\nbialy\numizoomi\ndéborah\nchoonhavan\nhinchley\ngeringer\nrecapitalisation\nphyl\naals\ndecaux\ncrashworthiness\nwestbroek\nfarivar\nshaftoe\ndairymen\ndundrod\nergotamine\nbrookstone\nwapner\neyedea\nwagenbach\nespadrilles\nryberg\nsmike\njiminez\nnebrija\ngerde\npugilists\nbellmawr\nzore\nwtvd\nresells\npust\nplaids\ncyberrays\nwek\noborona\nesmer\ngerwyn\nnuan\ncheapo\nsealskin\ngenin\nkinmel\namien\ncatechins\ngautreau\nkirit\ncoolspotters\nkonishiki\nessman\nvoyles\nohlmeyer\nxandra\nclementis\nhiep\nmolluscum\nmarrus\nraziq\narvey\nborates\nqudsi\nbiw\nhensman\npuffers\ndecreasingly\nnarwa\nsanj\nkawar\nspheniscus\nunisphere\nscodelario\nsyncml\ngajdusek\nspringman\naapi\nsantogold\nboughner\nashhurst\nzaouia\nfuruhashi\nsantes\nselye\njmj\nmatsa\nlancelyn\nprovenances\nsandee\nlonhro\nhtx\ndonté\ntincher\nmedicinenet\nplateaued\nabaga\nmercuri\nlesieur\nrecourses\nputas\ndisklavier\nmortons\ncountermand\nintersil\ncataplexy\nunsubsidized\nnahta\nletarte\nbozhidar\nfreediver\nnityanand\ngarabed\nbaggers\ntrivago\ncilegon\natencio\nphantasms\ngroundstrokes\nmdct\ngvm\naalten\nlangfield\ngalder\nmeraz\nfictious\nbasharat\nyant\nheartbreaks\ndierker\nlempert\nmuha\nscuds\nceq\narriagada\nryb\nmetacafe\nvaginally\nfarsta\nbristolian\nosterhout\nhoppa\nsabca\nassir\nmanorbier\ngolinski\nbriedis\npasricha\nzonday\nipms\nantispasmodic\nmistura\nkhasis\nundrinkable\nathletico\nrikabi\nfröjdfeldt\nstefanski\nhollanda\nmercredi\nsuder\nnorteña\nregales\namrinder\nshoshanna\nshoebridge\nlizárraga\nsmss\nhuaman\nisabellae\nekmeleddin\nrogovin\nercs\nopw\nheubach\nbellamkonda\nsomani\nrashba\ngillanders\nkoetter\nlanced\nweathercock\nmcgarty\nbandbox\nanitra\nrorqual\ncutsem\nquantal\nartcraft\ndtf\ndermoid\npantycelyn\nsomnambulism\namasses\npallonji\nlaywoman\nelectrospinning\nthirsting\ntritschler\nsarwari\ncogency\nashover\nlevisa\nraeford\nglori\nkleven\ndiadems\nrecommendable\naeternam\ngerrick\nprosecutable\nbogar\nchiau\niheanacho\nkaag\ntadiran\nlinemate\ncostabile\nlingan\nmultiservice\ntsur\npresupposing\ntoyohira\nmeale\nabengoa\nsixer\nsavusavu\njeannin\nmasatsugu\ntattling\ntamsyn\nmonsef\ngulland\npianka\njignesh\nchanco\nbouie\nluath\nsope\nforsa\nenergise\nmcleary\nenb\nbaristas\nlovebox\nhangeland\nsnet\ngrammies\nbagian\npremenopausal\nseubert\nplages\nfiddy\nomnimedia\ncowing\ntorger\nrwandese\nviraat\norlistat\nkinnersley\nkamaruddin\ngottschalks\nklukowski\nchetwood\nbarstool\nwuerzburg\nniavaran\nteasel\nwanhua\nerrante\nkarembeu\ngulda\nterreros\ntrudging\nmasco\nbyck\ndonaghey\ngraphitic\nyablonsky\nocclusions\ncraxton\nkovacic\nstampers\nryobi\nbanques\nvolgodonsk\ncprs\nhagge\nsohal\nspeccy\nsked\nconsumerlab\nbhaumik\nferko\nhigly\ndardan\nerrick\nretesting\nmisbach\nmagaliesberg\nkabbani\nvlieger\nbelzberg\nhupmobile\nenculturation\nolimpiada\ncompatable\nantman\nbagirov\nbagpipers\ngoofiness\nfanchon\nantelme\nfuzzed\nidms\ndadaab\nmicali\nwuttke\ntatenda\nwhiggish\nseher\nreconciliatory\nmoncoutié\ndramatising\ndibden\ndowman\ngurin\nxenotransplantation\njkf\nunprinted\ncataglyphis\ngigantour\npositivistic\nkelpies\nmansudae\nthermometry\ngameshows\nlivs\npenalising\nbibbed\ncuk\ntransshipped\nminarik\npaleographer\nifis\nfeigin\nmururoa\nzyryanov\nlaureles\nruini\nmogae\nmraps\nnerazzurri\npice\nkakul\nseidemann\nraivo\nmaximino\nesterhuizen\njacey\nhambali\ncopé\nmorford\nnagaya\nmellouli\nsinecures\nprimorska\nbestest\neyeballing\nmacd\noverharvesting\nabsconds\nantoniadis\nanesthetists\nteston\nertz\nbushtucker\nexi\nmaisuradze\nambidexterity\nlarking\naustralopithecine\nworkmate\nmcmillon\nhegley\nairds\ngraveled\nmillilitre\neastbrook\nyake\nsentimentally\noutsmarting\ncarjacked\nvorobiev\nthrillingly\nflutters\ntortas\nwickrematunge\nazita\n,where\nmaltzan\nwodaabe\nsccp\nitaewon\nbikie\nfusobacterium\nsamadi\nnavstar\neducacion\npolyus\nharibo\nconventionality\nrautiainen\nconsolo\nhypomanic\nhsph\nshangkun\nwinkelried\nguiry\ngrrl\npayg\naudo\nchatmon\nmaaa\noomen\nmammalogist\nbutenko\nsaharawi\ndonee\nyir\ngibralter\ncanoers\ncatts\npostum\najak\nunnacceptable\nlayback\nvlcc\nrapf\ncolwood\nminal\nsompting\nmiyashiro\nponch\nsegways\nlangenscheidt\ntempsford\noverrepresentation\nkvinesdal\ntiltman\njohaug\nbombeck\nnsls\ncolsterworth\nbuche\ntoady\nblackburnian\nthickeners\nwenyuan\ndeliberatly\nrosenhaus\nknoller\nkolodny\npoplin\nlustmord\ntamela\nsiendo\ncopiague\neev\nunblinking\nderülo\ncargolux\npates\ngwaith\nklinck\ndzhugashvili\nmontagues\nglahn\nbrox\navms\ngresini\nwtoc\njeoffrey\nfout\nhalban\ncliente\npkware\nhenbane\nmothercare\nbellybutton\nearpieces\nvideodrome\nseing\namabel\njhp\ncortazar\nfreshfields\nboops\nmazzella\nboym\nkistner\nrodborough\nrathor\nhentrich\nneckwear\nmajko\nwaterjets\nsauri\nbenzon\nautopilots\nplatooned\nlewisboro\niuu\nkisiel\nidealisation\nbonifay\nroxann\ngreenline\nrbt\nnormaly\nmaroth\nbloco\ngillain\nhorenstein\nadrie\nsteege\nrewatch\niweb\ncatlike\nbanri\npatriotically\nlewe\nscheuring\ndräger\nsnowline\ndagobah\nplange\nokimoto\naffliated\nhoundstooth\nrhon\nluxford\ngoldtop\narthralgia\ncesky\nhagey\nayoreo\ndaalder\nmongala\npenz\nrapu\nalladin\nisohunt\nessentia\npokhrel\ncuing\nfrenemies\nvelocidad\nogbonna\nlinet\nkuhns\nkaiparowits\nbrabec\nrothblatt\ngrossinger\nnirad\nenwezor\ntooze\nvézère\ntayla\nbloatware\nswh\nrompe\nnetherthorpe\ntransalta\nsilbo\ngrenda\nsridharan\nspectacor\ncrimefighting\nlouima\ngenoud\nmillilitres\nincentivizing\niweala\ngeoglyphs\nmalloys\neasynet\nmeladze\nboldak\nbaldon\naghajari\ngorak\nqaitbay\nchuquicamata\noffically\nchocano\nmelwood\nawsome\nwaleska\nsalicylates\nambarawa\nbiffi\ntyendinaga\nstotfold\nimclone\nminjok\nrunabouts\noltmanns\nabdelhak\nkebe\nconures\npoweredge\nrowans\ndyspareunia\nrecolonized\nprezzo\nsmallbore\nprinsengracht\nstanborough\nvandenburg\nwhch\ncamming\nstandoffs\nlutèce\nluwak\ncortinas\nluxenberg\nduratorq\ndegc\nbussan\nbouyeri\nmanzon\nirbil\nekotto\nbayman\nwhippets\nbalikatan\npaddleball\nniell\nwiggily\nadolfs\nifoam\nperrino\nwlross\ntoyz\ngaoled\nhydrangeas\nbougherra\nerap\nfrancey\nbaute\nfrancy\nhegge\nreinterment\nplungers\nzivic\nzelin\ncountrypolitan\nwinkies\nldm\nyukata\nslimness\nposti\nkolko\nmahorn\nchecketts\nlhm\nbohbot\nbraying\ntunisi\nsorella\nthorrington\nturds\nyangcheng\nmazzanti\njcf\ndanida\ndemagogy\ntkvarcheli\nschlein\nbronchoconstriction\nfalkow\nloping\njeita\nosumi\nnoddies\nmanorville\nkxan\nnlos\nbrodgar\nposers\nfinos\nquangos\nshinbashi\nqai\ngaldo\ncaking\nterrey\nmurman\npalmerton\neury\ncystadenoma\ncucu\ndolbeau\nborjas\ngoalby\nfarzaneh\ngreedo\nlauaki\nnariaki\nmaneki\nchatri\nhurtgen\nhsupa\nnanofabrication\ndataquest\neitam\nintially\ncabbies\nfrump\nrebury\ngafni\nwoodies\nsopp\nmobilink\niloko\nplunders\nmultivac\nmetab\nchatwal\ncenterfolds\ndusi\nmytravel\nshayler\nsteinle\nordinands\nmease\nravencroft\npapademos\naleesha\nderailleurs\ndessoff\ndosha\nglasbury\ntalash\nnamechecked\npurnama\narganda\ncalhoon\nargentaria\nnémirovsky\naveritt\nafuera\nwebroot\nbraceros\nmekorot\nroschmann\nravidassia\nwez\ndarmody\nmkhize\nklecko\nfleeced\ncecere\nbreyten\ntbu\nconficker\nnanjo\nstinchcombe\nchoderlos\npein\nunstinting\nfocht\nkobal\njudee\nlezgin\nkoleva\ndalilah\nkht\nwestminsters\nzambry\npialat\nperfectionists\nbylund\namadei\narrecifes\nsdsm\nsafiya\njacobsohn\ncybil\nssis\ntevere\nipaq\nbemerton\nbarsa\ntwirled\nvanille\nmellini\nsuckering\nguebuza\naquanauts\nspisak\nswiftest\nvinet\ngreive\ngoslings\ntigerlily\nrosnay\nmizos\nharlen\nwesolowski\nceska\nbatcheller\nryazantsev\nthurlby\nhennell\neleme\nfraiser\nirremediable\npolasek\njaneen\nalne\nyitzhaki\njawlensky\nburgettstown\naffordably\nzewde\nnocht\nhulet\nicaro\nbasdeo\ndecklid\nzurek\nsabean\nadjourns\nmaaouya\novermuch\nnabel\nmekelle\njiulong\natran\nanthimos\nfettiplace\nsalesgirl\ngibault\nlineas\nhatz\npedrick\ncampolo\njid\nteatri\ngrabau\naltruist\nheeler\npennekamp\nomaar\nvtp\nwsfs\ntoshiharu\nhaberfeld\nwincer\nbulmers\ncennamo\nsoirees\nabalo\nchabal\nbourneville\nchelwood\nadages\naimard\ngarwin\nkoepcke\ntrimalchio\nshinbo\nriddims\ngottardi\nbadwan\ncarvery\nleashed\nschaufuss\nballybay\npintar\nweena\nmicrometeoroid\npapadopoulou\nproske\nhumanlike\nvillatoro\ngazers\nrescigno\ncocody\nbosher\nabolfazl\nmanukian\nhannifin\nlamphere\ngamine\ncoverley\ntippler\nscrewvala\ndemong\nnetjets\nmue\njethwa\ntxiki\ncitabria\nwakker\nrolvenden\nsemidesert\naudlem\nsaurischian\npaudie\nnotetaking\nriddhi\ndanley\nalberg\nmaddeningly\ncrudeness\npagulayan\nacat\nmossie\npatillo\nsaipem\ncrabgrass\nnanosatellite\ndefination\nbellefield\nconradie\nmegna\nshaer\nmatalan\nkaria\ndharker\nseetharaman\njathika\noehme\npyracantha\nhitti\nluckes\ninvensys\nathelete\nrvi\noxeye\nhirer\nquarterbacking\nminkler\ngarnes\nseribu\nklaveness\nmsgt\ngncc\nkrizan\ndisqus\nvigurs\ndegussa\npetersons\ngumprecht\nriskiest\npodres\ngangeticus\ntalarico\npoth\ntaseko\nspolin\nwrightbus\nsemmler\nmethot\nmbenga\npertile\ntoubin\nshamong\nncds\nasfour\nwellbutrin\nromanticizing\nlamis\njashn\nlazarou\nassailing\nkamins\nguimond\noleifera\nchiefland\nsuprematist\nparthenocissus\nbanaszak\npharmacodynamic\nakila\nmshsl\nmannini\ntassotti\nsteeplechasing\nhoneypots\npéché\ngorefest\nreverberant\nsherie\nabena\nflighted\npucklechurch\nlavonne\nmontebourg\nconversazione\nwfmz\ninfluentially\nsonde\nrequited\ncreepiness\nozcan\nposselt\nmeadowlarks\nbunchy\nhaught\nfluorescently\ntekwar\nburkinshaw\npinschers\ntristam\npalfreyman\nstrivings\nbresnick\nposch\nthatchers\nrilles\ndunstaffnage\nadjaye\npacher\nsmolenski\nschepers\nbushkin\ndecongestants\ngreenies\nmiet\nsmallworld\ncryptozoologists\nkielburger\nkrishen\nbehnken\navv\nboretti\nvigilantly\nshakila\npetruzzi\ngreentech\nwrockwardine\nlere\nverey\nmacko\nxtm\nbritvic\nscal\nthodoris\nmatthies\nventola\nzitouna\npolydactyl\nsoapstar\nlamming\npontesbury\nisat\nptd\npolinsky\ntanen\ngreacen\nmamah\ntubifex\nsucess\ngdps\nsibon\neshun\nsunburnt\nwollert\ngwydyr\nxcaret\npipistrel\ntolles\nloie\nelvy\nmethemoglobinemia\nginni\nproactiv\nfiberoptic\ndownmarket\nescucha\nvandam\nmetula\nnodo\ntaiyi\nraptures\ncfca\nraiwind\nscalloway\nabjection\nmarkopoulo\nobligingly\nshoddily\npersbrandt\narchdaily\nchioma\nbellido\npagliaro\nmegajoules\nsuggestible\nhinkson\nbivio\nkakei\npancevo\nkhaldoun\nartforms\ngrodner\nhushing\nforstall\nlinthorpe\nousts\nustaz\nayot\nbloodroot\nmicrocirculation\nbigwigs\nacrs\nsidepods\nmakinwa\nbumgardner\nkinson\nhanksville\nsancar\nsalaya\nketh\nrushan\nsheheen\nunremittingly\nkupi\npemra\ndurani\notm\nmaberry\nnhulunbuy\nlingwu\nnaturalia\ndager\nlichtenfeld\npukki\nromanija\nhaino\numlauf\nbmac\nmusicke\nsahelanthropus\nirishness\nsupereva\nsekong\nwahlstrom\nsheepscot\nkohlhaas\nbloomin\nultratech\nkretchmer\nintelius\ndelagrange\ngolfweek\nminnewaska\nisci\nfaty\nvaliquette\ntourne\nlarusso\nguermantes\ndateable\nfranglais\nzolotarev\nlouche\nhurok\nrolands\nmuscadet\nwinscombe\navvenire\nbeantown\nchellaston\ntacchini\nunremarked\nbuzby\nroomier\npiously\ndespotovski\nvinerian\nbrunk\nlordswood\nsörensen\nmezzotints\ntamesis\nfrancina\nvallarino\nsupercarriers\nonthophagus\ndetestation\nerzerum\nroyces\nyanagihara\nlineaments\nhristos\ninbee\nyellowface\negu\nregionalists\nponding\npsyops\nzacks\ngunsberg\npsychotics\nhosken\ntailender\nchicama\nconjoin\nngh\nnzf\nhaweswater\nkedge\nintraventricular\nxers\ngunflint\nromps\nuliginosa\nsurreys\nhlt\nprettily\nchako\nzenair\nasomugha\nkunzru\nborsos\nboes\njarrahdale\npamintuan\nmatrixx\nlamentably\nkierra\nbonazzoli\nrowes\nskandinaviska\nmcburnie\nromanticize\nvuvuzela\nmoisturizing\ncircovirus\ncastrogiovanni\nbasilea\nlappe\nstinton\nponceau\nwolframite\nadma\ncontemporani\ngiuffrida\noutshined\nfoncier\nquos\nkhaldi\nfenghua\nsasho\nmalloum\ntulafono\ntibidabo\nroychoudhury\ntelevangelism\nfalletta\nglucosinolates\nwhacker\nroset\nnovenas\nguga\nhyphenating\nsubfolders\nmalveaux\nchuffed\njumel\nansbacher\nbotching\ndroppers\nfritchie\nkonaka\nbamfield\nraissa\nsignificances\nlangeberg\nogidi\nmeting\nmorritt\nmaggin\nmawi\nwhiner\nmondriaan\nratter\nlerew\ntorishima\nleinwand\ngrigny\nimara\ntorigoe\nrzayev\nrieth\nhenrick\nkyrgiakos\nfeffer\nvelos\nrooz\ndaff\nchadash\nkalkan\ngiannetti\nshibley\nlangeveldt\nzhongguancun\nsaidaiji\npanaca\nschlanger\nreproaching\nkni\nrustie\nhonkytonk\nmaisha\nsquanders\ndiscomforting\nhamachi\nfreeholds\nschs\nlirio\nburpham\nisme\ndroops\ntoonz\ntanh\nfreixa\ngomersal\nviewshed\nlasar\nzayani\nbalancers\npuw\nmalorie\nborgs\nerzulie\nmerriwether\nksaz\ndiggi\ngenette\nmorss\namenophis\nmcelrath\nclontibret\nfroots\nkeoni\nlimner\nchuxiong\nlyburn\nphrma\nreche\nchidester\namounderness\npaluch\ncolbourne\nsariska\nindubitable\ntettey\nshuwa\ntyahnybok\nadicts\nharked\nfhfa\narround\ngyrations\nruaha\ndiscuses\nmedland\njpt\ngyron\nght\nlafreniere\nlumidee\ngravities\nmercan\nbiosocial\nassimilative\nparasitizing\noilmen\nseadog\nethelburga\ncapitolio\nmalariae\nbiocidal\ngope\nbruccoli\ninterquartile\nbassen\nproducible\nvenema\nfunchess\nwindbreaker\ndimaio\nselcuk\nlaxmibai\nrichters\nschroll\nperhpas\norbus\narthropathy\nstockinger\nfloorpan\nkiszczak\nloughead\nararas\nestridge\ngorry\nwildcatter\nyedlin\nmappers\nguthy\nandrii\nvasospasm\nbaraan\nindispensible\ntyrannies\ntolbiac\nlanderos\ngunnislake\nfreecycle\njobi\ntwinjet\ncarras\nshorouk\nmaschke\nperquisites\nshafroth\nagstar\nyusupova\nsticklers\nwhatta\nnohl\nnijdam\nstoykov\nbootes\njuncosa\njessamy\nsujoy\ndefoliated\nmientkiewicz\nanorectic\ncamarones\naftertax\ncopperman\nwaterbrook\nachiote\ncqs\nwaurika\nmaybes\npsoralen\nlangmead\ngaudete\nvernaccia\nginge\nneuropsychiatrist\ngsas\nednam\nyttling\ngilbertsville\nlamantia\nchodas\nspazz\ncavos\nmicrolights\nmunif\noestreicher\ncheatle\nheugh\nbalamory\narmenio\npcms\ndolgov\nkahrs\nesencia\nrenfrow\nacklington\npaektu\neschauer\nblackbelly\nhongkongers\ngraczyk\neeu\nahca\nshbg\nrulin\ngreenness\nsharonville\nmiscue\nnyhus\nacclimatise\naerosystems\navrom\ngengo\ninconstancy\nhpm\nqassimi\nreconcilable\nvitek\nllandrillo\nkanie\nghattas\nhysen\nlamoreaux\nlangauges\nmaseno\nmultimodality\nsuperbrands\ndonard\nworx\nflad\nshinju\ngreetham\nharsch\nforeignness\ncagr\nbaichung\ntotty\nrefuseniks\nadenuga\nmountlake\nkrumme\ntopflight\npropecia\nsnorkelers\ndavro\nmuyu\ndobrica\nannaka\nsubmittal\nhalba\nsoulié\nrieck\nglickenhaus\nsnitches\nwhiteknights\nmassengill\naharonov\nstihl\nnake\noverdrafts\nchandrasiri\nchion\ndaytrippers\nsollee\ntalibon\nincautious\nbathos\nbleck\nfinnessey\nbouazza\nwestly\nbabenco\nlundborg\njungwirth\nwgms\nbiggert\nparasitosis\nshambaugh\nmandich\nspeightstown\nsuperlattice\ntolos\nsolman\noldpark\naerovias\ndisallowance\nbloodsuckers\nsurfline\nmeijers\neyde\ntotec\ngabbidon\nflexilis\nbournbrook\netherton\nlairig\nfortinet\nzuckert\nrodanthe\nsaawariya\ngrimaces\nhurlbert\nafoa\njacquire\nrowney\nbrumidi\nchornovil\ncivils\nacupuncturists\ntrichomoniasis\nmoneylending\ncorboy\ntripler\nbadder\ncodpiece\neifs\neryk\nferaud\nmanasieva\nthissen\nimperato\netno\nportends\nlyashko\nphulbani\nmattinson\nhufflepuff\npuckered\nraloxifene\niruña\ngbf\nfiggy\nupreti\nmarinades\nabecassis\nmotts\nsundeck\ncryptonomicon\nmumblecore\npietz\ngluzman\nspringwell\nesqueda\nopin\ncurae\nshafting\nnebot\ncapodanno\ntakeno\njoch\nstrikebreaking\ndimity\npempengco\ndurational\nchalvey\ndalkowski\ncorrelli\nnjs\ngwersyllt\nbrandstätter\nsirica\nwarmley\ngawk\npaschall\nmamola\nplatooning\ninexistent\nfauteuil\nninis\noverdiagnosis\nnetzarim\ndenters\nkutscher\nthroughway\nwztv\npawnshops\nfurin\nthabet\nelvio\nmoakes\nhuasco\nmamay\npayap\nmiralem\numgeni\nwarrents\ntanki\nhntb\nbodysnatchers\nmannucci\nundresses\nprincipato\nroerig\ngerada\nenio\nstoltzman\nshurman\nslory\ndevilishly\noneto\nnkosazana\npresciently\nbladecenter\nfukumura\nranby\nmincher\nwidad\nhousebreaking\nvaporised\nbailer\nooms\nbengie\nsaussy\nshavar\nnervión\ndinovo\nrabaa\nasoke\nbellver\nthringstone\nduzer\ndefrosted\nwinternationals\nhyne\nakingbola\nboultbee\nunderwrites\necheveria\nunamerican\nmattin\negemen\npalaeogeography\nlambsdorff\nargy\nunbinding\nbrigante\naptamer\nmarchio\nbcbg\nflatbreads\nearlsfield\nshofner\ndunay\necstacy\njiwon\njaenisch\nudy\njaval\npotes\nnimis\ngerberding\ntorslanda\nsalves\nbusboys\nwaveriders\nhoughtaling\nlehnhoff\nmasuzoe\nconceicao\nkelechi\nmatley\nverdicchio\nflatau\nasdrúbal\nlarenz\ncapewell\nmwb\ntravelator\nguangcheng\ngeotagged\njabel\ngamlin\nblissfield\ndarwinists\nmalpeque\nbesley\neimert\nstamfordham\ntransaero\nexploitations\nlarena\nthall\nlukashenka\nfermentum\nchimoio\npurp\nbakala\ncolegrove\ngugulethu\nxenocrates\nuninsulated\nlaskowski\ntheodoridou\nfsx\nprusik\nrgk\nwerts\nastound\nkuhrt\nolema\nmeditational\namess\nalgarotti\nwendkos\nashrawi\npracticals\ndoubleton\nviscosities\ntge\nspaceboy\nshoka\nwahls\nszenes\ndcaf\nbresch\nhandgrip\nnavteq\nshyly\npamper\nshokat\nlasi\nshibetsu\nwashford\nvarias\nzuroff\nsilkroad\norh\nkarmah\nnamedropping\nmacbeath\nbenigna\nderrington\nbodor\nfairlington\nnishizaki\nletsie\nbolet\nexplora\neschede\ntalgat\nqilla\nchitungwiza\ndubie\nviriathus\nakili\ncryptologists\nbarbies\nhowgill\nyanchev\nseversk\nrupinder\ngardnerville\ngleim\nmermet\nvezzali\nboseman\nmyun\ngabions\nboyers\noverreached\nleoluca\nzwol\nlevithan\nxiangning\ncollagens\nkristinia\nleslau\nedtv\nwlp\ncodi\nbielby\ntogaf\nvawa\nwrda\nolev\nvicker\njohndoe\nhensen\niov\nchuggington\ngardos\notso\nteir\ndodworth\nhvp\nprotostars\nmorana\nnfln\ngiric\nkigen\nspitter\noverscan\nharat\nwalding\nshemtov\nmeeta\naepyornis\nmezei\nschayer\nfibroid\nyatsenko\neinsteinian\nhobden\nsprucing\nthow\nmarinol\nmeguid\nruppe\nperjurers\ncepero\nkoshiba\nbéchamel\naspden\nverny\nchangeless\njene\nelektroprivreda\nangelin\nmatousek\nmortlach\nbobbled\nantitussive\nfluffer\nsiddarth\nrehung\nxenograft\noverawed\ncouncilpersons\nincanto\nchipboard\nbernas\ndehumanize\njibjab\nspermicide\ninhalational\nahlin\npigford\nsaik\ngavage\nkhatir\nmytholmroyd\nmomchilgrad\nrori\nmidea\nwega\nsorgue\nrosling\nwiveliscombe\ncoruna\ncontorno\nboffins\naún\nbouwman\nyero\nawre\nblixseth\nhihifo\nmilram\narnarson\nsuperdrag\nnossiter\ncristofer\nriddlesworth\nxer\nwlox\nbenini\nmsic\ninternation\nhaury\npalpated\nscavenges\nfootrest\nirpinia\njcg\nhendriksen\nmotonobu\nakyaka\nstroganoff\nalphin\nkollias\nmanovich\npeterle\ncarhartt\nkuhio\ntattletales\ntrevisani\ngwendal\npennings\nneachtain\nscomi\ngrissell\nsharna\npapermakers\ntadej\nstargardt\nonecare\nscrounging\ngrewe\nmillender\ndhai\ncraftily\nformateur\nmerwan\ncovelli\nsakkara\nbastianelli\ngerena\nrasm\nbetley\nlatrice\ncolombus\nfdle\nblights\nbernabeu\nlilliputian\npierangelo\ncossio\nlipscombe\nglycols\ndeitrick\nmalua\nqma\nhuaqiao\nxinsheng\nboria\nrosental\nfiaich\ndurgan\nbryag\nknowlesi\ncontrasty\ndowagers\nyulong\nslighly\nslurpee\nkosheen\ndubov\nairconditioning\neuropes\ntherian\ndtra\nchartreux\nlrf\ntfu\ncordera\nfunnyman\natheletes\nminestrone\nyemassee\nfreedy\ncelera\ndemant\nloquasto\neuregio\nvanderlinden\nclanking\ncourteau\nscatting\nschily\ndeiss\nkidan\nstiffelio\nrerunning\nwiuff\nstrelitzia\nevron\nmistype\nmicklegate\nkalnoky\notunbayeva\nslowe\nalsatians\nleszczynski\nkhreshchatyk\ntsitsikamma\nstensgaard\ngugliotta\nshakai\nbaratashvili\ncimini\nyasawa\nmotson\npxi\nzolciak\nemmetsburg\nriascos\neustice\nkimora\nnewtownbutler\nminiclip\ndeemphasized\njungfraujoch\nmercuria\ntassoni\nbrazda\ngroveport\nphulbari\nedgaras\nfirmani\nkubby\nlimin\nblerim\nraschi\nhaigler\npopsugar\nvisitscotland\ntellis\nstaffroom\nbirkerts\nwraf\ndarek\nmosko\nepton\nradiolocation\noversite\nombudsperson\nlambot\ndunville\nfoudy\ntiffs\ncenelec\nmoov\ncleminson\nhogsmeade\namanatidis\nkobach\nspartiate\nalkire\ntenuissima\npatels\npinger\ncondy\nrolim\nneatest\ndirda\nhajdari\nbeazer\ndande\ntoones\npowerstation\nilec\njaybird\ndafeng\nmedhin\nfinklestein\nskywriter\nyary\nmegale\nsipc\nrajdeep\nauthoritarians\nprabal\nsuya\nparros\npublio\nsulfamethoxazole\nzano\nencouragements\nestec\ngamon\nkintbury\nevered\nnonfatal\ngodon\nbacau\nmpondo\nmobis\nromario\ndelfont\ngorme\ndemián\nglenluce\nliron\nwerre\nmosop\nkakabadze\ntalhah\ndossett\ngalon\nivorians\nkourouma\nseyval\nvoiles\neskisehir\nsejersted\nsheepwash\nthorius\naspira\nberwanger\nonl\nintone\nwinfrith\nsunniva\nmainers\nayed\nmkg\nnegritude\nleyba\nbetim\ndermochelys\ncrewmate\ncalmest\nrossmoor\nvolturi\nraniero\nheliostats\nbulteel\nbrigette\nmangope\nstarzl\nwildermuth\ngherardesca\nbaitfish\nunboxed\ndanesi\ntariel\nhildale\nmorelle\nsiat\nstefka\nfollia\nagayev\njonás\nsteelcase\nphotomask\nishbel\nstachura\ntirmizi\nsunuwar\nbuehlmann\nmargelov\nkitv\nanote\nvorticist\ncatfights\nwysong\nwideout\nmuscadine\ngilchrest\nvillians\nswitchblades\nteutsch\nrdas\nrestaging\ngallions\nkenter\nhitlerism\ncusson\nduisenberg\ngcms\nqcf\ntrattou\nfitoussi\nitns\nbirkenstock\nlisping\nalzado\ndanniella\ngirgis\nsaleslady\nvukov\nchupacabras\ngunstock\nroulston\nretrogression\nsalloum\nvassileva\nsmerch\nsiyabonga\nmeisinger\nlahinch\nunice\nsperlonga\nhemlines\nrampaul\ngiresse\nfeser\ndoorbells\ngreenore\nstarobin\njambon\nbertrando\ntannoy\nledoyen\nmacmillian\ndobsonian\npavicevic\nhollamby\ntocotrienol\ntroian\ngft\nbabji\nokalik\nescriva\nrajbanshi\nsavoldelli\npreddy\nstethem\nshovelling\nfuzzies\nbffs\nstorkyrkan\nsilverhill\nbluetec\nmaaskant\nangelillo\nolisa\nkabunsuan\nmarkinson\nsaurel\ngoyle\ninterdit\nguidant\ntrisakti\nnugegoda\ncandaba\ncapaccio\npaimio\ngizo\narbeter\nfuren\nshippagan\nnely\nanimesh\nshrm\nwodiczko\nmakram\nrohloff\ndemare\njiuzhaigou\ncrmp\nmuzi\nordina\nfalkingham\nomand\nsuzlon\nsiedle\nscaglietti\npiotrovsky\npescheria\nbondies\nhopefulness\nsmy\ndurlston\npharmd\ndoormen\ngashes\nhexter\nunrewarded\npentlands\npsychostimulants\nfeudalist\nstandifer\ntaschner\nkonate\ngillin\npliego\ncacs\nhicken\nverdura\ncondict\ndalecarlia\nacmd\ngroseilliers\nwaxcap\nurias\njerel\nmylonas\nsakellariou\nwangs\nbigbury\nmustoe\ntopete\npoled\nsmirks\nllynfi\nmobin\nphab\ngundel\nmisers\ndeclinations\ndallinger\naerator\nhanneke\nfootprinting\nseffner\ntreaters\nmarcianise\nburnish\njamail\nrobbs\nniñas\ncheston\nmzimba\nfreakishly\nisokon\ndoxy\nkxjb\nparvaz\ncrosshaven\nlookingglass\nsambi\npaciano\ncockleshell\nhustad\nhotlist\nverschoor\ncifa\ncarafe\nkobelev\ndoney\nlumpen\nbellavia\nprotazanov\nstahlschmidt\nherida\nmindtree\nehrhoff\nasgeir\nreappearances\ngambrill\ninishbofin\nstadlen\nbelonger\nmarchionne\nnka\nziq\nlychnis\nmaeder\npentreath\nnhr\nlevanto\nyokoo\nspinothalamic\npratto\nmaquiladora\ntomlins\nmaeva\ntumas\nnoyd\ndesikan\npetechiae\nkanine\nriteish\nhelplines\nfefferman\nadelita\nliansheng\nyardarm\ndegroote\nbezige\njeebies\nsetty\nmedair\njmw\nduvernoy\nbiolay\namaris\nvdt\nsanal\nlaam\nmiramonte\nlivestation\nremen\nderkach\nimpotency\nbruss\nmelies\nkonecny\ngurdeep\npreconfigured\nmoriarity\nnystrand\nhyperinsulinism\nklyne\npampers\nstratmann\nchoisir\nmuhsen\nyumin\ncanelas\nbenzalkonium\nmurrysville\nbernes\nantipholus\nezeh\ncampout\nzhongxin\nmaltster\nassche\ndamaru\ndunnage\nbunds\nadastra\ntindemans\ngoorjian\njims\nlijo\nkatten\nhotpoint\nbecuse\nthomasian\ntidiness\ntmvp\nzhongjian\nsufferance\nzersenay\nbaralt\ncioroianu\nsarkari\nchaitén\nguaraná\nilliam\ncorless\nhassanein\nmorys\nconesa\nllao\nhishammuddin\nshuichiro\nzarifi\nbauls\nattalla\nwwu\nackerly\nugrian\ngdst\nasadov\njangmi\nhoong\nassents\nbja\naffandi\nemceeing\nusag\ngenz\nmellers\nwaah\ntechint\ncheddleton\nnute\nspdr\ncokey\nashburner\nfakty\ngiannelli\nbührle\ngorlitz\nbrodsworth\nnhf\nkazanjian\nteddi\nmotovun\nkoci\namerks\nbreithorn\ntopcoat\nstohr\ncallegari\nsteiff\nkozar\nlongenecker\nbiers\nheldentenor\ncaffin\narmelle\ncoddled\nhazels\ndutchy\ndamaliscus\nclavicles\nnephrops\nroey\njansa\nbromate\nbeechgrove\nyahyah\nstembridge\nveltri\notton\nfirebreaks\nadeane\ndavari\nkinderman\nhayfork\nsabath\ngeordies\npaharpur\nmillhauser\nepigenetically\nkandia\nsundon\nfoulsham\nwawarsing\nworkrooms\nmuckers\nstrelley\nmorayfield\nqvt\ngraining\npiked\ntohill\nstratham\nsmyril\necbc\nmclucas\nbisso\nroutable\nantolin\nfeedburner\nbindaas\ngromova\nsuelo\ngrassmarket\ncheckmated\ndenilson\nlancang\nhuebsch\nmetier\ngoogolplex\nfascinatingly\nlybia\nserendip\nnorihiko\ngrottaglie\nbrodkin\nrabotnicki\nagbar\ngrecians\nbowyers\nanchiornis\nqbert\nturab\nburgio\ntayback\ngeschke\nsawhorse\nfleeshman\nexhumations\nperfectv\nmiddleville\nelsasser\ncranswick\nberdahl\ncamco\narreton\nresistence\njacarepaguá\nwerbeniuk\nweisgall\nmisconducts\nmilitates\ncodebooks\nnjenga\nkadison\nglenbervie\ndaigneault\nskunkworks\nintradermal\nanj\nqassab\ndatastream\nsinisalo\nkvaerner\nwalthers\nfriedensreich\nwesham\nrageh\nperryton\nkidner\nkandra\nabercarn\ncavens\ndovi\nsubra\nhender\ncarsport\nfurmint\nschjeldahl\ndannel\nlandscapers\npigmentary\nfbl\nwürtz\nfurbish\nsarcos\nstuc\nfenin\nremes\nedms\nmemari\nlocicero\ndorfmann\nhodapp\nteosinte\nwatkinsville\nfountainhall\nseith\ninspirer\njireh\nissuances\noceanology\nrawk\ncogley\nvictimizing\nmeasureable\nnnt\nkesri\npastilles\noseary\nwendron\noleoresin\ngreenspoon\nrouyer\npencader\nhotung\ntonn\nklengel\nassertations\nstammler\nabiword\nautech\nheene\nkligman\nespcially\nbacevich\nafganistan\nentier\njpf\nadomian\nwect\ndeets\nwinos\nzenko\ncyanohydrin\njamshidi\nrajadhyaksha\nnockamixon\nrajmohan\najusco\ntransection\nvillagran\ngowling\nmoultonborough\nsikma\nbahah\nsamdrup\nqsa\nmunnerlyn\nbasciano\nwirkola\npezza\nrobaina\nparticualrly\nprioleau\nhanspeter\nbitner\nknsd\nmbengue\nsheberghan\nflinty\nmaon\napob\nmembury\npockriss\nhussien\ndeerstalker\nhendricken\nvindictively\nlaurene\nchinedu\nsomeya\nodourless\nnotter\nbluma\nheddy\nmcgeechan\nnabateans\nsokha\nhammack\nprognostics\nvishnevetsky\nprogess\noasi\nlacouture\ngwala\nrazvan\nshirzad\nhugman\nvanette\nadresseavisen\naleksanyan\nzeichner\nbabas\nbews\npasni\nbassanio\nfleshly\nophthalmologic\nbinkie\ndadrian\ncesenatico\ngeduld\nshoen\ncipro\naquatints\nsasin\npoelvoorde\nbuzzanca\northotic\nmingulay\nthickset\nbbss\ndilson\nhazarat\noswiecim\ngoedel\nzandberg\ninsted\nxla\nglamorized\nsheran\nreeta\ninfosec\nunengaged\nmuckrakers\nmarianelli\nkippers\nbuentello\nrybolovlev\nauvsi\ncesarani\nkotchman\nspirometry\nmarkfield\nsakoda\nconstancio\navellan\nplanchette\ndeaderick\nmope\nwheatly\nmixner\ngusenbauer\ncfac\ndiwaniyah\ninfosphere\nwinnberg\nstiliyan\nappleford\nantich\nseydel\nvarco\nbambenek\ndownsville\ntulipe\nvabre\ngeoeye\nadipic\nheebie\nglemp\njaleo\npomc\nklees\ncrowberry\nworkless\nrueter\nlapper\nrongcheng\nradiopharmaceutical\nnalgo\nceratitis\nsensationalizing\nderma\nsiput\ndaljeet\nstorebrand\nhizballah\nanonymised\nhids\nscee\nluzinski\nsandars\ncarmello\nfilice\nrotas\nhuur\nasiya\nzanda\nduttine\ndixton\nniceto\nmerrison\nkonstantopoulos\nacetazolamide\nsincerly\npicou\nplaats\nuncompelling\nskycycle\npeatbog\nyutian\npiner\ndsos\nsiegmann\ngkids\ntimofei\ntompion\njakobovits\ndempsie\neryri\nimpington\nbeauval\npuchi\nsalway\njins\ncannibalization\nrivarol\nshawa\nmangaung\ngodstow\nkellenberger\nschieber\nofqual\noutloud\ntalulah\ninstream\ndolle\ndolgarrog\npiddock\nlimi\nyuyama\nlustick\nshishapangma\nmascha\nfuredi\ntropiques\nprefabs\neisenhardt\ncropscience\nearlsferry\ntilth\nibraheem\nparliment\nnephrogenic\neaglescliffe\njazzie\nethe\neastbank\nbiggerstaff\nusurers\nlourenco\nbewes\nintercropping\neuropeanist\ncymry\nszczesny\nknoke\nkeyte\nroopesh\nhonorato\nweepers\nmikaeel\nqueijo\nayanbadejo\nmoonface\nlimthongkul\nmarkethill\nzagunis\nlcos\naltig\nhondros\nsteakley\nillah\nmilou\nmerrilee\nillarionov\nalbeniz\nyetis\nshier\nthurmaston\nfrak\nmoritsugu\nrutted\nikhlas\nhamdaoui\nbarsotti\nunselfishly\nfurlow\nbravi\ngodwins\ncecconi\nantwaan\nunclearly\nchangemakers\nbyob\nkookie\nterseness\nsouthchurch\nyuhuan\nanycase\nreinga\nisolina\nchurros\nmoiseev\ncdsa\npridemore\nstickies\nfbg\nshinners\nmcglinn\nechs\nfewkes\nsymonette\nfasil\nmezza\ntherme\naggelos\nmoonwalks\norangetown\nyandle\nsubparagraph\njazzier\ncines\nullico\nhitchman\nendsleigh\ntrademe\nfernery\nfawsley\nmanatt\nvoge\npetrassi\nqurei\nsuau\nrimma\nborracho\nstiner\néclat\ndrem\nforestdale\naxelos\nbobsledding\nramfis\ndiani\nbaxi\nanchusa\nhyperventilating\nvougeot\nostp\nvillaume\ncames\nrajpipla\nkolpino\nlicet\nmorellet\nkettleby\nheugel\nsumatriptan\ndemeure\nansary\nteece\nbatia\ndelvina\nindoctrinating\nhemiplegic\nteko\nworkaday\npulitzers\npyt\nlelie\nkiyani\ngemologist\ncorymbosum\ndarcel\nfranked\nshiono\nlapuz\nrabeh\nwcmh\nviettel\nstrykers\ncherryfield\nmournfully\nmosinee\nrozanski\nsohm\ncompstat\nzhixin\nskeletonized\nnormatively\nbéarnaise\nnestler\nsnodland\naddin\nkirkliston\ndemick\nbaturina\nwasner\nsummerhays\nunreduced\naudiocassette\ntmcnet\naucun\nbichette\nstrine\nsulphides\nchainrings\noonagh\nwhiteclay\npachman\nfalcinelli\nabimael\ndeckhands\nctbt\ncollura\npowertech\nexplantion\nflorrick\nhearkened\nseend\nskenderaj\nulcerations\ntennen\ncamaret\npulvermacher\nfusaro\ncustomizer\nbasca\nadik\ndumpers\nblace\nspecifc\ndeluce\nzgoda\nalsip\nsmugness\nhublot\nbioterror\npettyjohn\ngeat\ngrutas\nwtov\nmontelongo\nbudinger\nreshad\nsonneveld\nrmo\nxpert\ncihat\nreassuringly\nmadikizela\ncantle\ncalciopoli\nrezaee\ninternalisation\nkurka\nkeddy\ncuajimalpa\ndispleases\ndeliquescent\nsahani\ntrommel\njinchuan\nratepayer\nbeml\nwbcs\nchw\nlabbadia\nwintershall\nabstergo\noverweening\nwahida\nullage\nwatchband\nmanina\nclouthier\natin\nrelativly\nkeyon\ntoyosaki\nexcretions\ngermon\nwarmongers\nnarz\nspeth\njungk\nstonefly\nshoul\nbugno\ngigantor\npalamau\ndisingenous\nseige\nbehaviourist\nvaginoplasty\nguyan\nudalguri\nschyman\nbilthoven\nsubmunition\ngoldcorp\nphyllo\nwagenseil\nluyindula\nstreetside\nivanschitz\ngroaned\nmohammedans\ndayparts\nmedunjanin\ngilkeson\nlechi\nwelburn\nkagari\nvalere\n˘\nmugabi\npmoi\nsmotherman\ntateishi\nanner\nchersky\ndjourou\npalley\nsooden\neleftherotypia\nchromophores\nairbridge\nutsira\nnarsingdi\namplifications\nchimerical\nmcmonagle\nstaplers\ncolborn\nkuck\nabrigo\nleadgate\nvanasse\nkumars\nabdolreza\nezekial\ncommissionership\npartys\nmorrissette\nvocoders\nooltewah\nwoodturning\ndrtv\nwdw\nchage\nlaunders\ncharyn\nmatalin\nenesco\nsightless\nroseline\nbipropellant\nhohensee\nkiken\nliferay\nkhori\nmourneview\nhenzell\npierian\nscoones\ncompiegne\nanglophobia\nviteri\nschortsanitis\nsamsudin\nprecipitators\nyanaka\npinx\nfreshet\nizhmash\nfriman\nminnaar\nbetrayers\nneurolinguistics\nendocannabinoids\ncontrole\nosse\nmassiel\nbruinsma\nmiragaia\npejic\nhearkening\nalee\nhamud\nkotche\nleuthard\nmyddleton\ngasmi\nvender\nventersdorp\nlandport\nervand\nkingshill\nantimo\ngedman\ntiddlywinks\nmaceachen\nmoonglows\nsubscapularis\napurimac\nelkem\nbijon\nslota\ntarsila\nwaen\nnasw\npunga\nbusinessworld\ncrunchie\nshariatpur\nhufford\nbreazeale\neyesores\nhelpmate\nbifidobacterium\njuventino\nthion\nbilimoria\nshoutout\nenochs\nbachianas\nusery\nstiassny\nresan\nschefter\nstoplights\npersonajes\ncorpi\nlzr\nmicheldever\nsinaloan\nherscher\nnatalizumab\nhlh\nfroehling\ncentralists\nrsis\nuntarnished\nmisgovernment\nanad\nrimmel\nseli\nbritwell\nwetteland\npeetz\nputain\nanjema\nroccella\ncwrt\nemboldening\nlabrang\nconcealable\ncleverdon\neynde\nshowtunes\nmindfreak\nchintan\nsinor\ncaseloads\nunionizing\ndescision\nbater\nwanja\ntstf\nmorningwood\nnicorette\ncitronelle\nseaming\nkatehi\nkkh\nlassitude\ntoynton\ngamertag\nnatra\nbrijuni\ntolani\nvellu\nkogut\ncheddington\naugarten\nteza\nguitarrón\nstumpff\nfleecing\nanikulapo\ngrottes\ntoia\njaimee\nstrensall\ncommuning\nests\nbouchette\ngansel\npagliarulo\nfilppula\nateret\nmcts\nsturry\nbologoye\nweoley\nphazon\nsurprized\nlieberthal\ndcvo\ncuzzoni\ndehua\nsteir\nribbeck\nizzedine\nelq\nlarche\nanaemic\ngarfias\nbrackens\nkickstand\ncooperativo\nclementines\nwtb\nbandol\nmorleys\nxavante\ngkp\nrobens\ndreisbach\nofws\namelle\nlyndsy\nzarooni\nsphincters\nouzounian\nbuffoonish\njayasena\ndiscrepant\nazizan\njohnette\nmcz\njanneke\nmastella\nlipizzan\ndarrall\nkohlhase\nfefa\nnagakura\nkasimov\ndelatte\ndarenth\nreichart\nwormy\ntavano\njordanstone\nlasater\nmethamphetamines\nvedeno\nbenzocaine\nprotectively\nmaerten\nnyiragongo\nzaccheroni\nanacaona\nkernville\nbibury\ngoldenson\nreggia\noeschger\nwinthorpe\nthela\nfibreboard\nbarnsbury\nwrd\ndebolt\nkipkoech\ngoproud\nstraley\nduralde\ngoldbloom\naivar\ngoosnargh\npiedimonte\nlogvinenko\nhaptics\nyasgur\njafarov\nwgu\nsuperball\nturfs\ncarthay\nntaganda\npaing\nballistically\nannihilators\nblacke\nharrys\noracabessa\nforro\nshurley\nzuba\nthurleigh\nzydrunas\nmagnums\nartworld\nroei\ngiannino\nreligare\nfundin\ninnovates\naizlewood\nsemra\nkenwyn\nouimette\nsangbad\nblonay\nmoisturizers\nwonnacott\ngruhn\neddleman\nkesteren\npolysomnography\ngroundskeepers\ntaffs\nilluminata\nquickstrike\nmilcah\nautoantibody\nsaumya\ncruciferous\nrecker\nleandersson\nneustadter\nsestina\nharjit\nprestia\njimoh\nmudde\nregnerus\nbockris\nutkarsh\nparamananda\nepatha\ncommunality\nvarient\nnanson\ntarakanov\nreids\nsabis\nciit\ncacus\nbecs\nfumigant\ncrimond\ninditex\nmaheswaran\nemera\nproelite\nhandclap\naumonier\nphytosterols\nniederreiter\ntnbc\nlipsticks\nphonies\nullin\ngentrifying\nbettmann\nosterberg\nevette\ngrayback\nsarif\nalehouses\ndovetailing\ncluxton\nnuzi\ngooney\nsaffah\nslorc\nhuanghua\ntuch\nsultanas\nvimla\narmerina\ntomac\nabdulle\nseybert\nhust\nheacham\nhaematologist\nlucidi\nmamaev\nfagor\njsx\nnonviable\nbackstrap\nkpax\nzisman\nallodynia\nkibosh\nolyroos\nsmolan\nlappé\ncapitalisme\nporphyra\nquarryman\nsenia\ncoiba\nparkington\nmorimura\noxidization\nreedbed\nkatcher\nwafi\nreplogle\npencilling\nkriegler\naliquam\nmetus\nfairuza\naioc\ncerp\narbonne\nmakenzie\nindiepop\nsupergun\nhomestyle\nhuevo\nstonebraker\nlafell\nzeidman\nrecuperative\nchrysothemis\nunlockables\nabdollahi\ndewatered\neyler\nwarnick\nlaniado\nhaeri\npalazzina\nbraemer\nhosley\ndohan\nmasuka\nbellerby\nzanja\nminnpost\ntowyn\ndeflator\nnikitina\nposin\nburrup\nstitchers\ndromey\nrá\ncacheu\nsiyad\nlockney\nblit\njiggling\nhirokawa\ngiya\nmassinissa\nsheilla\nchollima\nmularkey\nzuoren\neridania\nxiaotian\nnccr\naddonizio\nsaxman\nkoumas\npinarello\nbaldomir\nclemmer\nktbs\ninculcation\nmarlan\njehanne\nwiswall\nsemiquavers\nventanas\nspeas\nappignanesi\nkomoro\nnvqs\nkowalsky\ntetrahymena\nbuczynski\nzindani\nkatanec\naldag\nbeccy\ngradishar\nyerxa\nhogfish\ndatebook\ngrandmaison\nkayam\nforeshocks\ncarreno\nbranstetter\nkerameikos\nnotturna\ncocozza\njosimar\nowlpen\naguinaga\nchulym\nbersham\nstorni\nmuscularity\ntannhauser\ngangemi\nsentinelese\nsummariser\nparavel\nvassos\nwdtn\nelop\nmishu\nmeylan\nbutin\nbunke\npocking\nhdacs\nmahamane\nxinqiao\nklüver\nmobitel\nnanocomposite\nafognak\nstonecipher\nmossend\nrumtek\ngilderoy\ndoppelgängers\nnissl\nsuvorova\nzachodni\nfowlerville\nssps\nsevres\nwhippingham\naveroff\norganica\ngooders\nniteroi\nbrasa\nptes\nglasman\nsaragosa\nbreadline\nsvf\nakaash\ngrimaldo\ndilligence\ngbt\nwussy\nnamby\nugochukwu\nflagellants\ndeneau\npavelic\ngilhooly\nuhlich\nlebert\nmardini\nmxp\nahmer\nwlky\nstatman\nkkl\ntemel\nhelmes\ndeputise\nhilgenberg\nquadrifoglio\npaydirt\nturteltaub\npochettino\nanointment\nhcfc\ncheeger\nwretzky\ncunniffe\ngrandmama\nsharn\nbaraza\nmccreath\nsalkin\nruti\neiriksson\nmeriva\nhoram\nsydsvenskan\nverdoorn\ncontretemps\nhomogenizing\nsubleased\nglorieuses\nsobin\nengelberger\noverlarge\nkhandekar\ntumorigenic\nlalami\nmicrometeorites\nwely\nclubby\nmocidade\nfeathertop\nallehanda\nrademaker\nthekkady\ncarpegna\nipab\nsalum\nvendace\nquitters\nmallat\nbindura\nnectars\nwabtec\ncmbs\nfinancings\nhué\nantiship\nlighton\nfullfill\nprocurve\nbourcier\nbeji\nrenseignement\nlarges\ntacuma\nwembury\npiltch\nsaher\nsuperhydrophobic\nineson\nfondled\ncondescend\nleithen\natcha\ninitialled\nshillito\noperationalize\ngrolle\nlarrazabal\nunmovic\napologetically\nineptness\ninsana\nredial\nzver\nwindfalls\nsqueakquel\nnowrasteh\nvladi\naardsma\npneumophila\nceriani\nhomeware\ndazzlingly\ncountrified\nfledgeling\nadesso\nneistat\nandia\nforden\nsaughall\ngardar\nnacewa\nreorienting\nastrologist\ndowa\nhobbins\ncrownpoint\nnagareyama\nroundell\nfeck\nmoelis\nqataris\nsural\nconsolini\ndepaiva\ncouts\nvws\nismaila\nependymoma\nsöderman\nteobaldo\nreems\ncafod\nsanten\nhadag\npeltzer\nsaddlebag\nfullard\ndrut\nwdi\nionut\ngoodway\nlaunchcast\nnucleoli\ncoscia\nconnerly\npisoni\nhilleman\npascucci\npowless\ntomine\ntatge\nmaxam\nvlogging\ncprf\npaur\neverland\nbeitzel\nburiganga\ncyres\ncolumbaria\nnexter\nsakakawea\nkoguryo\nmuche\nawassa\ndroeshout\nmcclenaghan\ntedy\nccap\ncayeux\nlindskog\noswell\nsilicified\nusdp\nscalfaro\nnng\nkechiche\ngbx\nmawle\nmutuo\nshicheng\nlastminute\nadjudications\nselda\nprina\nteutenberg\nehd\nroshini\nkapsabet\nnabavi\nhapton\nrafiqul\nclassicising\nfloorplans\nroquemore\njpp\ncelcom\nyohji\nkabanová\nsecuricor\nmarstons\nkese\nnewsnet\nnppl\ncervelli\nhudnall\nblackhat\nalmus\nfidgeting\neconomicus\ntangney\nperchlorates\nleibbrandt\nvvc\ninvergarry\nstellarium\nnephrotoxicity\nvienen\ntregenza\nsinaga\nstadelmann\njabra\nlindop\ncarletti\ngrandiloquent\nheterophyllus\nsbragia\nhakem\nhdh\nschiffrin\novertop\nwhitinsville\nashis\nmiangul\nharvington\nreverends\ndonaghue\nfenfluramine\nolonga\nulk\nturgeman\nshati\nappealingly\ncrolla\nbeauxis\nfsas\npetitclerc\nhollywoodland\naxiomatically\nagps\nseaquarium\nwilmerhale\ntreva\nbickler\nmelaine\nmartyna\ncgb\ndigitalglobe\ntrovato\nholytown\nayisha\nexhilarated\nboddingtons\nkandji\nroomette\nnitv\ncallups\napprise\nhyperammonemia\nzokora\naxumite\nhuapango\nkahama\nbiji\ntruckstop\nafshan\nlmdc\njenette\nlitterbug\nthayne\nphotomontages\ndauncey\nhample\nsomebodies\nlovegood\nbahariya\nkcm\nrallings\nbolgatanga\ndfk\ncarolwood\nfluorescens\nrovno\nyuming\nwhizzing\nalyas\nunlearning\nbomberos\ngoolies\ntrochowski\nfertilising\nhumaneness\nmahrt\nsipped\nnpfl\nscowen\nhospitably\nadriel\ncrittall\nchikane\nlayo\nbuckyballs\ngrafing\nshuping\ncamarata\nyakimenko\nbieksa\nsacko\ndetraction\ntielman\noche\nzdzislaw\nbodiless\nclivia\nsuhayl\nhallaton\ncatalon\ndominga\nopes\nkaimuki\nimagina\nintegris\narkia\nkasmin\nlexcen\nmunsif\nwhitbeck\nmcnasty\naccessibly\nrokkasho\nchegwidden\ntoguri\niub\nbatth\nleston\nmünchausen\nlinganore\nkernell\nnodi\nihle\nseverall\npravia\nadolphson\nkez\nharrar\ngayest\naucklanders\nfootlocker\nlippens\nisrafil\ncanlas\nrainless\nmaytown\nsnodgress\nandersonstown\nhanzhi\nkingda\nlaitman\ndivisadero\nthoss\ndooher\nalfredi\nwuhl\nprorogue\njenever\nanterselva\norus\nsoheil\nsvelte\nnnu\ntheat\ndaragh\nlarussa\npaskin\nkakure\nmasayo\nbibimbap\nradkersburg\ndurno\namodeo\nstepmom\njiayuguan\nparvaneh\nrichaud\nbowtell\nkleptocracy\nferren\numea\navoda\nweaseling\nliwei\nkalispel\nunchurched\ntobor\niteso\njapandroids\ndelaughter\nnationalizations\najuga\npatong\nkrosnick\nkhutbah\nturano\ndhaheri\npalombo\ndongjing\nruhm\ncrosskeys\nbaroe\ndishware\nsheetrock\nhamhuis\nreichl\nzaatari\nclumpy\nrigidus\nstobie\nmezrich\ntotalizing\npargas\nmccoubrey\nmisjudge\neastnor\nbullrun\nkoziol\nborschberg\nzhenyu\nortenberg\nlarrocha\ngleniffer\nglenullin\nsapara\npueblito\nvoluble\ndanjuma\nnordbank\nrezone\nngawi\nalac\nhaircare\nberden\nertan\nranum\nhristova\nbananaman\ncumene\noldring\ngyor\nbrynmor\ncockerels\nhidayah\nalstead\nimmobilizer\nortman\nknxt\nsnood\npassalacqua\nedify\nabdelhafid\ncymer\njutras\ncuccia\nsawaya\nkuebler\narchenemies\ngalvano\nlisbie\nchinley\nsellaband\nbarff\nbohler\ngrubman\nhillcroft\nixi\ngreenbrae\nyuyan\nestaban\ngarble\nlmx\nmagin\noutré\ndarch\necolabel\notavio\nauw\nkerrii\npaderno\nkaramojong\nwiliams\nseppelt\nherceptin\nsombreros\nrednal\nseago\naudibility\nwifes\npremachandran\nellyse\nlunev\ntalloires\nqasimov\nthorazine\nboumedienne\nlanguishes\nwibberley\nfonoti\ntophane\nkathoey\nwonda\nescambray\nimpeachments\ntenancingo\nheliborne\nvasya\nobenshain\nalexandroupolis\nwijeratne\nanycast\nbiglow\nwhernside\natika\nlaunderer\naisc\ncfdt\nmirinda\ngunja\nbowlen\nbleszinski\nmoulitsas\nversicherung\nseikaly\nucca\nmischel\ntomasetti\nbellina\nbodices\nbisan\nhumanum\nlübben\nscramblers\nstieff\nmimetics\ngolob\njinns\nihram\nallibone\nyalda\ntristate\nsloughed\njouissance\ndeale\nachal\noxgangs\ncarrickmore\nspaceguard\nsunbathe\nrusland\nyuo\ncowpeas\nradici\nmonokini\nmosedale\nferree\npuft\nstashes\nkiep\nundertand\nmcinerny\nsertoma\nchurchgoer\nconsecrates\nwallboard\nschoene\nhaleyville\ngassers\nmovieline\npardal\ntorkham\nbognar\nfrant\nolwyn\nmauritanians\ngustavs\nhemmingway\nhydes\ncambus\ninstamatic\nhanim\nelfstedentocht\ndocumentarians\ntaslim\nnimni\nkuser\nindeterminable\ncawkwell\nmumcu\nlakemont\nbombi\nhuntford\ndorning\nmachame\ncaramels\npeleides\nkyjov\nceridwen\nbreville\npiquionne\ntraversa\nshuguang\ndixiecrats\ngiussano\ndisgracing\nneelie\nshadravan\ncopenhaver\nreadyboost\nacess\ncerd\nunscented\nhanso\nstebbings\nthemarker\ntomonobu\nhypophosphatasia\ndunlevy\naudrius\ncllrs\nconformers\nsbac\ndificult\nmuamer\nbeogradska\nshiffman\nmaull\ndudesons\nblowflies\nmajnoon\navocations\ndampens\nwitn\nhoulder\nmujaheddin\nstigers\nrhees\nsereny\nsigwart\nmisapprehensions\nporush\nhamanaka\nmaurissa\nmoratinos\nauklets\nblithfield\nhitan\nmoskovitz\nwobbled\nquitted\nmotivos\ntrinitron\narquitectonica\nraincoast\nzeeb\nfengcheng\ncommonplaces\nknurled\ntibshelf\nimide\ngerdts\nscalawag\nfilicide\njuban\nderwentside\ntangental\ncatarrh\nverdoux\nbuav\nsmadar\ngreensmith\nephebe\ntoreadors\nrepa\ncarolers\njozy\nleonids\nyeang\nboquillas\nrisher\nnunchucks\nmisreadings\npalletized\nmunnabhai\npeloquin\ncholesteryl\nmarilia\nflatlined\nlamsdorf\nlenalidomide\nckoi\nspose\nwestendorf\nespo\ngraca\npasveer\nsalignac\nshowboating\nbandying\nlangsdorf\nherjavec\ncoene\nunlighted\nlagartos\nbaracus\nweinzweig\nsazan\nemitt\nillogically\nharbury\nkirati\nfttp\ndramatise\nmorillon\naperçu\nyitzhar\nchippings\nyaogan\nobligating\nbillesley\ndoodlebops\nagrio\ncoccineum\nscobbie\naccelleration\njobsite\nprikhodko\nbriggate\nallmand\nunparliamentary\nhaemon\npieczenik\ndensho\npimental\nifw\nlixian\nnewpark\ndelgados\nkomba\nmascoma\nsawahlunto\nnatterjack\ncatcalls\nritola\nparnassian\ndelavigne\ngravesen\nlaryngology\nnagori\nwaran\nadye\nseifer\nquantick\nthornleigh\nisobelle\nnarasaki\npmbok\nhyesan\nscorning\nrogala\nalberghetti\ncompote\nwimer\npanella\ndepoe\nczyz\ncurveballs\nsilvstedt\nsalifou\nblyde\nphv\netiwanda\nfalfa\npagent\nhypotensive\naphrodisiacs\nahas\ncharnas\njinbei\ncapasso\nnaptha\nmaale\ntolonen\nmisplacing\nchoppin\ncocq\ntweddell\nscrooged\nlankarani\ndorwin\ncruzan\nfujianese\nalisdair\ngoranov\nrazzmatazz\ncsra\ngroundfish\ncrystallises\nhypersexual\neuropro\navto\nscowling\nssts\nfretz\nnzt\nkaarle\ninsanitary\ninternationalize\nconvoke\nteferi\nformichetti\nperinton\ncayless\noberle\ngrael\njellystone\ndunsworth\nemolument\niraklion\nafriqiyah\nmenthon\nmykelti\nschlachter\nairtricity\nnanograms\ndewolff\nrnt\nendplate\nmiddendorp\nberhan\ngerardine\nlaâyoune\ncaserio\ndewpoint\nfarebox\nvisma\nparamjit\norlac\nraitz\naqs\njomsom\nhoilett\ntangalle\nnoades\nproudlock\nwaconia\nsauceda\nasthmatics\nlongabaugh\ncasselton\ncustomising\nscrumpy\nunruffled\nraffray\nbatho\nottobrunn\nwna\nlamberhurst\nfoolery\noverplay\nmoneymaking\nvitalic\ngrimstone\nowuld\nnavone\nsustainlane\nburnstein\nunloving\npgms\nhetz\nrocinante\neegs\nthousandfold\nmoraleja\nidara\nironfist\naccl\nbasen\nishimura\npimpinella\ncronberg\niford\nknowth\nexurb\nbodysuits\nperforatum\ntemo\nshizue\nstreeterville\nadaminaby\npaonia\nmimivirus\nthorat\njaidee\ndoigts\nhepcat\nchukka\naeroponic\nringley\nacaster\nsatnav\ndaktari\nsublimates\nirus\npisgat\nverratti\nrinkeby\ncantore\niwelumo\ngardot\nharmeet\ngutch\njoypad\nwaterparks\nwaggner\nhäfner\ntigresses\ncapizzi\nbuildups\nsphygmomanometer\ncolleran\nkhooni\ndiogu\nutting\nepicene\nmandane\nsucessfully\naccelerando\ncarasso\nincarcerations\ncinedigm\nwud\nrxlist\nmutilates\nsandline\nnijman\nborchetta\nwieters\nnitwits\ngillott\ncedarwood\nadone\nchryst\nblaffer\nwyne\nsummering\niosefa\nadir\nfrontex\nremerged\nbrinegar\nsuperspeedways\nharmfulness\nghazarian\nmauren\nnicety\npurebreds\nmelungeons\naberporth\noldsmobiles\npayá\njaisingh\nciénega\nafh\ndeepings\nsandag\nmercosul\nkamancheh\nksat\nlikey\nllanrhaeadr\nbucksbaum\nemaciation\npentyrch\nichihashi\nrehouse\ndiscomfiture\nglamourous\nhga\nsofts\nhawkin\nflashpoints\nhebrang\nmaltodextrin\nbohnett\nsigala\nbookfair\ncrimeline\nhajri\nbgg\nproliant\nchoueifat\nwasan\nsonhos\ncrisman\nvocalised\ncataloguer\nihar\nintex\nabac\nmenchik\nrecalibration\nluminato\numberger\ngatra\nlemonia\nangouleme\nastuteness\nbeik\namurri\ngibbous\nvarvatos\nadebola\nconvulsing\nwinker\nputland\nhatin\nstutterers\nengro\nbaim\nsambrook\nkirbyi\nchisox\nwolfsschanze\nmarcinelle\nfrommers\nmarkovitz\nschiavon\nbartmann\nbobbit\nkoivunen\nluttazzi\npassero\nstrobilanthes\nalipate\nquander\nselvaggio\ndefragmenter\ntreharris\ntrounce\nmonthlies\ntnfa\nnarragansetts\ndiester\ncwru\nhillers\nmaurici\noab\nsmoller\nommission\nfeudalistic\nbernath\nqimonda\nblunsdon\nscholfield\nsilda\ncordus\ncohanim\nbodle\nexhales\nheupel\nfickleness\nnonlinearities\ndulcimers\nkalau\noveractivity\ndeathrow\ntyntesfield\nbealls\nresurged\nhickmott\ncrau\nslivovitz\nbunner\nbsat\nbrontës\nfelpham\ncoracles\nbspa\nkarslake\nebit\ndukeries\nbarrowlands\narats\nleibrandt\nshahenshah\nwittwer\nmakavejev\nheyuan\nleimen\nsilberbauer\nholifield\nmongan\nschaden\netwall\ndebbouze\nbuffini\npecknold\nsensitisation\nperv\ndegolyer\nkinya\ncanel\nkomisarjevsky\nlemonis\ngjelsvik\nkatariina\nshawki\noltp\nzeneca\nsiau\ntecton\nxiaohui\nfrasch\nulcerated\nenock\nhandman\ncasartelli\norchestrators\nlucks\nnazira\nearthers\noakford\nsubtlest\nloitzl\nardell\ndiar\nspiritwood\napear\nususally\nstanchev\nbechtold\nbuddenbrooks\nmahnaz\nvpx\ngelida\nprojectionists\nspawar\ngele\nstomachache\naaberg\nfroide\nultralite\nmicol\nnalli\nlobbe\nunsalaried\nenrollee\nhfo\nadolpho\nclaires\ngangplank\ndonofrio\nsouthpoint\ndesmin\nbready\nchalayan\nbooky\nchillar\nthorugh\nltx\nfibrocartilage\nhibernaculum\nbuya\nmarula\npalia\nberried\ndieudonne\nstreetly\nhatab\nfischman\nmacara\nthurm\nprometric\ndorward\nholoprosencephaly\nshanan\nsosuke\nthede\ndeafened\nkuchera\npiercey\nosia\nnetzero\nstoppani\nrankers\nfiancés\nindigenization\nornish\nphimister\nmagothy\nbedar\nmoueix\nmetn\naitkenhead\nvinum\nikb\npnnl\ncretinism\nheighton\nermakov\nderoche\nctos\ntriballi\ntootles\nconsommé\ntocsin\nchessa\ncassinelli\ngabol\nburkey\nchandio\nbompas\nsulayem\nrousers\naesthetes\ntigercat\nkoretz\neurypterid\nvituperation\nskeins\nfridrich\nfreakum\ntendresse\npocketknife\nnaydenov\npamphleteering\nthyne\nuldis\nfidelman\nkezi\nberch\nguishan\npoligny\nafghanis\nsansho\nwebmethods\nmoorad\nstupefying\nlukla\nprecocity\nruberg\nprebisch\nperkasie\nquetelet\nhoogovens\nindovina\ntransmuting\nmercados\nbeco\nzeitler\nivalo\nsukie\nkeatings\nspurway\ngeeti\ncalfskin\nademás\nyalo\nenterococci\ntihama\nackers\nwpgc\nnetherlanders\narteche\nshinier\nflopsy\noutcaste\nabscisic\nebeid\nparijat\ndedza\nbetrayus\nforsmark\nmunmu\nthibaudet\ncollectivistic\ntalebi\ntatneft\ngroomers\nkoat\nmcgrattan\nfactus\nslurp\nbiggies\nanalogized\npichette\nangor\ncommoditization\nameena\nhurstbourne\nivm\nauerswald\nettal\nnutraceutical\nfeer\nfcca\nmaultsby\nsubjectivities\nspanks\nexora\ncouvert\nvoorhoeve\nirrationalism\ngregorie\naddictiveness\namerykah\ntranscaspian\ncurral\nrozanne\nregus\nayerst\nbolick\ncheoil\nfoschini\ndtrace\nbhut\ngibberd\nleckwith\nljungqvist\nokoth\ndvf\npottu\ngingell\nabdolkarim\nfossile\nkhadar\nconchata\nkenard\npretto\nkharma\nacebo\nmejias\nassocham\neshbach\ntreuer\nkirkbymoorside\nbuyoya\nghazab\nopio\nmassasauga\nfialho\nmoremi\nsmolt\nchampéry\nhazed\nocx\nfitfully\nslothful\nstephensi\noozed\ntuffin\nannear\ntreeton\nantiquaires\ngfci\ntomelty\nokcupid\ntenontosaurus\nkeishi\nnavantia\nkhizar\nbatstone\nheartsease\nbrunhild\nflightdeck\nwakata\nlaurice\nabily\nataur\nmiedema\nempathizing\nvetro\nchanterelles\nmammas\npercolated\ncoppard\nfems\nfranjic\nlucheng\nsiwanoy\nhiawassee\nwairakei\ndwar\nkooistra\nsikand\ncolugos\narchibugi\nbiopower\nsgrena\njalapeno\nbaffler\ndurdle\nsemidouble\npatau\nljungman\nperos\nfundoshi\nmaniitsoq\nbutterbur\nbuske\nvocalism\nvpo\nkerbing\nfolkingham\nlindenstrauss\ngudmundson\nmehitabel\nldo\ndownscaling\nhanis\nartistshare\nsenneterre\nrizzotti\nspsl\nbatf\ngrap\nanimadversions\nmelismas\ninterposing\nlouisy\nlaneways\nchalkhill\npsma\ntossers\nborri\nbuljan\noncken\nsoic\ndbg\nrehbein\nailred\nprehn\nyoba\nppas\nnyas\namalgamates\nmaione\nsilvering\nhemsedal\nadkin\nrhines\nmaerdy\nhuser\nkingstanding\nskerne\nhody\ninterrobang\nlindu\nsolntsevskaya\ngoisern\nguadagnino\nfatiguing\nosheaga\nbleat\nmotorama\npermenant\nwestie\npolyvinylidene\ncombet\ngrunsven\nbarbastelle\nthoughtworks\nnbpa\nomro\ncrumpton\nalltime\ninterpellation\nberlino\nlccs\ncharmless\nfdt\nprecalculus\ngolm\ntinca\neaglebank\nloosemore\nbhimbetka\ndrippings\nnosair\nsolley\nnemat\nnithyananda\nnashat\nfasel\nstonybrook\npoizner\ntredway\nmaybellene\nemk\ncompatability\ngamesa\nthnk\nutl\nhongwei\njuicers\nchiaia\ntenleytown\nkonkola\nkhayyat\nnadas\nserrato\nbrez\nkreft\ntadatoshi\nslye\nhurdes\nszarkowski\nshangaan\nnewth\nkias\ncontriving\nbenadryl\nparman\nwfuv\nunderinsured\ncomstar\nbidford\nmdz\npmx\nbiochip\ncltv\ntofan\neuromed\nwikianswers\nmrtv\ncornfeld\ngoen\nbeharry\nmcnuggets\nhulen\nogiwara\nlipschutz\ntoub\nsoleá\nvodochody\ntonnelle\nwoul\nsoftie\nscalextric\ngrindlay\nacidophilus\npaonta\nsebrango\nopionion\nmultiform\nkythnos\nxiaoyan\nromie\njruby\nkirundi\ndinnertime\ndefour\nshokai\nafw\nkleck\nchondrosarcoma\nforcefulness\nwyness\nwakame\ndeceits\ntahj\nmaritimo\npizer\ncroci\nhegira\nrutley\nrokocoko\nmuliaina\nlurches\ncrushingly\nphytologist\ndisinhibited\ngmpte\nconvenors\nmonarca\nlynches\nserfaty\npellucid\nlyness\ndistillations\nherger\ndilates\nofrenda\nabdelmajid\nansaldi\njonell\ndetriments\nweatherson\nrisinger\nfizzing\nmatzke\nrora\nswillington\ngiannoulias\nsarcomeres\nxingang\npokharel\nsorich\nfraboni\nmatano\ntrehan\nsenaki\nsayago\nswagga\nmarginalise\npridmore\ngalanis\nkmtv\nmarro\ndins\nbarnicle\nboreray\npagnotta\nbeckner\ndudh\nerwartung\nicad\ngaillot\ncuthberts\nedghill\ntydeman\nprincessa\nwithnell\nbradish\nchanner\nthrifts\ncardiel\nakbank\nheterotopic\ngevrey\nvacuously\ncoslet\nscrewup\nstrathcarron\ncrociata\nfylingdales\nwhic\npelagio\nleade\nmonheit\nkhazanah\ntabacco\ncarrabassett\ngitomer\ndgf\ndasani\nrodo\nnasso\nkirchners\ndulé\nlangshaw\nmelius\nzaccardo\nwaddles\npellinore\nunidimensional\nhopei\nsocko\nbuckwalter\nbagert\nbritannias\ncityville\nmisk\ngamst\ndauphinoise\nepley\nsauze\nhumph\nkenza\nderickson\nempanada\nlevier\nrolton\ndeforge\nisely\naridjis\ndowneast\nwonderfull\nashwaubenon\nstaver\nsaintliness\nkeybank\nauctor\nconsiderd\nmazibuko\njezek\nnurturance\nmarijn\nwischnewski\nnetiv\nkeepass\ngamm\nconflit\njunkanoo\npinzon\nmillhouses\nkoumei\nandile\ndanin\nhaitao\ncrinkled\nreapplication\nservier\nocana\nwnyt\nwestwell\nclendenon\ncanari\nvcam\nkasota\nhenck\nbrawne\noxman\ngrisebach\nggb\nebeye\nfraid\ntrackable\naggi\nbandoleros\nretinyl\nboyajian\nreboarded\narchicad\npedagogically\nklaveren\nlavasa\nwevers\nlattisaw\nsaccule\nmaringa\nwalliser\nbriercliffe\neluard\ngangbang\nkafer\naams\nkromowidjojo\nwum\nstaghounds\nsoulard\nbattaglini\nesmaeil\nkeagy\npeldon\nbarreras\nohanyan\nkelda\nmetasploit\nmedimmune\ntimss\nbpv\nprödl\nchoque\nlinebarger\nvespasianus\ndunthorne\nfrosties\nherranz\ntilal\nboola\nvillepinte\npotatoe\nnigersaurus\nfacenda\nmoulted\nholleran\nculmore\nchany\nbobrova\nnomade\nwoodhams\nsidbury\nscho\ngotz\nhardingham\nwerblin\nholderman\nharkening\noutspread\nkosen\ndragonoid\nbiffo\nbaggie\ntroussier\nvelva\npramanik\nmmtpa\nempa\nwijngaarde\npremer\nhenneberger\nmiang\ncelcius\nwazirs\naphc\nloton\nfrayling\nconsignee\ncontadora\njapans\nreknown\nsorey\nlancastria\nbowmans\ngainor\nfonua\nchlorobenzene\ntranslocating\nsbcs\nprendes\nisere\nmidlake\ndubble\nmamula\nofb\ngayley\nchoekyi\nrecruitments\nsugerman\nkermorgant\ngpws\nsertich\nkalifornia\nmagnetised\nzenia\neggy\nxiangxiang\nnovembers\nshowaddywaddy\nrommedahl\nlithologies\nbocskai\nhongxing\nseeders\ncrowdsource\nlaingsburg\ntransversality\nmalthe\nsundre\nmahmoudi\ntetrachloroethylene\nacclimatized\naurelien\nnationsbank\nsolemnize\nkhalife\nmesan\nglenshee\npreud\nholsteins\nsandboarding\nkencana\nredouté\nparolin\nkatalyst\ndavoren\nhenries\nchampêtre\njellybeans\nolis\norganisationally\nbaslow\ntrophyless\nhoneck\nborneman\naynesworth\nlinter\nwener\nlandecker\nmagnay\nackoff\nsportage\nmaiti\ncockenzie\nvisualisations\nmccourty\nsilsoe\nninewells\npeders\nrabelo\npostoffice\ncesr\nolina\nhentemann\narato\nhersant\nmurzyn\ntiton\nmarrara\nweaste\npatia\nmurderball\nroddey\nmoresco\navdeyev\nrapira\nrainin\ngispert\nzheleznogorsk\nangula\nstraightener\nbarbering\nwarao\nsupermac\nwinteringham\nschmale\nminik\ncyndy\nschoop\nwiimote\nleedom\ntootle\nwindless\nthornliebank\ndeviled\nwakai\nosw\nparroted\nyakisoba\ngizenga\nnjc\ncucine\nwieren\nkeezer\nniggardly\nbishopstoke\ndoxiadis\ntegla\nlouderback\nlazzaretto\noverpainted\nchaigneau\narah\nwellbeloved\nyammer\nbankcard\nfellay\nleweni\ngercke\nleitmotiv\nmitres\nbettin\nchida\nloof\ndoos\ncuratola\nlomma\nhornak\ndalva\nbarrass\nwasfi\nyounce\nbrownshirts\nautolink\nkornelia\nbrahan\nswartland\nmalchin\nmadone\ninterst\nayten\nuras\nhaule\nmahyar\nsedwick\nelumelu\nlorman\nadderly\ncapeman\nadrenoleukodystrophy\ngururaj\nmeachum\ndamilola\nkhairat\nkrupnik\nalasania\nforsakes\nfranciacorta\nhanff\nfacials\nmettre\ngabu\nwcax\nplucker\ncolpa\nmarkert\ndulces\nlusinchi\nifra\ndreamings\ncafritz\nrever\nexept\nqaisar\nwilkshire\nhaemochromatosis\npaicv\nludham\nphantasies\nscalloping\nsassaman\ngayoso\ndemonically\nrikk\ncampiglio\nbscs\nliepa\nosgodby\nmarienplatz\nbromford\nplattekill\nimahara\nkery\ndarndest\nsternin\nkohara\nneuadd\ntraffics\npropounds\ngirodias\nbronchodilator\nozyegin\ntwinge\neluned\ngendreau\nochamchire\nburbury\nbalwinder\nindiscriminant\ntognoni\nleffe\nellistown\nraymont\nkrach\ncorazza\nsizzles\npecota\ncrudo\njwm\ncolorize\nmrak\nthubron\nraymonds\nvansh\ninstapundit\ntanat\nledwinka\naffymetrix\ntabarly\npfahler\npureed\nvembu\nziyuan\nsextette\nochamchira\nphotosynth\nexpansionists\nsouthmost\nballfield\nfeburary\nneumeyer\nanichebe\npvcs\nyakitori\nlomonaco\ntegner\ndunbeath\nzehava\nconnetquot\nqpcr\nhoogendijk\nsdw\ndeacetylases\nedhem\nponytails\nremenham\niodp\nnehi\nloussier\ntamilian\nhoniball\ndaydreamin\nlucyna\nlamone\ncnsc\nmccrane\ncdcr\nislamophobe\nemulsification\ncaressed\nsfas\nsolignac\ndystrophic\nwdtv\nclubhead\nelstob\neffin\nedra\nlehzen\narmands\nbootup\npeggle\nnihl\ngbn\niniciativa\ndisrobed\nmauzy\nachel\nstrumpet\njokic\nluchtvaart\nbocharov\nmyalgic\nprudery\nmunchkinland\nhowatt\nfauchard\nimls\nfhd\ndefunding\ndoorstop\nxango\nunsynchronized\natea\nannunciata\namtran\nbhv\nfulu\nwifebeater\nkabylia\nmahones\nbercovitch\nbartonville\nforepaws\nretyping\nurbanizing\ndalloz\nchryston\ncanwell\nphibbs\nntca\nmarfleet\nteagan\nunfancied\nsotherton\ncocorosie\nmarli\nunigo\nbiodegradability\nodr\nprecisionist\nverra\nramidus\ndelisa\nprecariousness\ntolstoyan\nmasar\nbeechman\nrainmaking\ninboxes\nbelet\nbengawan\nfcas\npehlivan\nfriesinger\nchaa\nmulqueen\nunfitted\nlosail\nmireles\nweifeng\nspigots\ngratifications\nabayomi\nrozzi\ncasone\nktt\nsouthway\ncapulin\nbressay\nsoci\nzobelle\nhisingen\nbedraggled\nmceachin\nfervid\ntooles\npabuji\nsynergetic\ndemonised\nstepmothers\nunfortunatley\ndelaplane\nbearzot\nkliper\npassthrough\nbikita\nportageville\ntrotternish\nehmann\ncapoue\ndurmitor\nkakati\ntataouine\nnapf\nsorrowing\ngreenisland\ncharanjit\nbollworm\nsunter\nbarbudan\nnationalising\nphotoblog\nguyatt\nbayville\nroundstone\nbashkirov\nzahav\nstylin\nscrumhalf\ngalatas\nvendola\nhooson\nuncontracted\nalsc\nbulsara\nabruptness\ngranik\neligable\nschrad\nlateraled\nteetotalism\nkalac\npoilu\nhins\nmonologist\nquaglia\nborzov\nroquelaure\nbeseler\nhobbles\nolympiahalle\nmashayekhi\nlucchino\nlasgo\nbroederbond\nspectrographs\njowers\nbéguin\nsoldner\nrenck\nmawdesley\nberkhout\nvhi\nshekau\nnovikoff\ngraffham\nskywalkers\naellen\nchorded\nhibbett\nklutzy\nbizzarri\nlicinio\noutshines\nhaub\nstonefield\nyuste\nkarey\nripia\nbaffa\nbooyah\nakinyemi\nvolksfront\nkühl\nromanet\nbassford\nsaadallah\nblackwill\ngiriraj\nsidetracks\ngrippo\nhandhold\nsongun\navoth\nslackening\nvilallonga\nlunching\narseneau\nclaudet\nnorthlight\nlinlin\nweprin\nflyways\ntunisair\nmccarney\nmisapply\nknollenberg\npsychiko\ntauran\nmorphosis\ncalabozo\nbohart\nshadowboxing\nuntraced\nmarcegaglia\nafua\ngoecke\nsiouxland\nhesperornis\nshoard\nsigmon\nolia\nrepêchage\nruche\nbaset\nbutterfinger\ncaponi\nghaghra\ngobbled\nwaad\nseptuple\nautogenous\nalamin\nsundancer\nwenford\nlowri\nwermuth\nwauchula\ncapitulates\nsubclan\nedelmiro\nbmrb\nsoss\nshreeve\nlemberger\nkintampo\nhamate\nhallettsville\nbigfork\nhostal\nearthshine\nsuccint\njadoon\ndoca\narandas\ngundog\nlevanon\ntrullo\nwanis\nlaber\nzealousness\nstoor\nshitamachi\nwaissel\nchodron\nkheli\nlanthier\nleming\naghios\naegerter\nbalestre\ntermagant\nfukatsu\naasu\ngawdy\nhipkins\nqatargas\nshoucheng\nfurlanetto\nlumos\npugilistic\nkanmon\nmensalão\ngpon\nzohreh\npolydipsia\nprimare\nconfectionary\nrgu\nsmallbone\nchibana\nkalik\ntheros\njeay\ncarryout\nhrb\nmarittimo\ndecanting\nmultimatic\nwissing\nhawija\nmasinga\nshleifer\nschindlers\nuwch\nyadlin\nsalahaddin\npoovey\nnoorderlicht\ngunsmithing\nkxtv\nbrusatte\nhamr\nlhéritier\ngrzyb\nattackman\nbardhan\nlouna\nbettes\npreforms\nfrosti\nproverbially\npuett\nconciliazione\nmansaray\ngriefs\nguidence\niason\nnonvoting\npyinmana\nwainright\nmeasurability\nunacceptability\nleics\nlefevere\nluf\ngabii\nwhiffenpoofs\ndibona\nprecooked\nsommerlath\nfisken\nkifl\nrindal\nmoping\ncounce\ntoupée\nmeritt\ntavani\nrgl\nmatiur\nlingotto\ndrieu\nstamer\nledonne\nsagot\nkjer\nsauza\natwan\nbourdet\nhodnett\nmaws\nsatyamurthy\nlugged\nkulig\nsounion\ncrie\nakhras\ntrinca\nmcgruff\ntroodontidae\nangliss\nnhadau\nraco\nlobov\nmnangagwa\nshortform\nunwrapping\nattachmate\nkandilli\nmtel\nsulaimaniyah\nalberici\nshishkov\npined\nlevander\nhennin\nallenstown\nzalayeta\ngeopolitically\nfluorocarbons\nblameworthy\nguettel\nquicksands\ntoughie\nblathering\nroughwood\nlistservs\nriptides\nkrenzel\nprotezione\npijesak\nshewell\nparagua\nascherson\nlapolla\nepilepsies\nfany\nscabby\nmatikainen\nuprise\nkrqe\nwireimage\nnbbj\nmashallah\nmohri\nvoicexml\nendwar\nwhiners\nilaya\naebischer\nkollman\nstroppa\nrosukrenergo\naristocracies\nnorthtown\ndanya\nlustrum\ngourde\nnemr\nmotsepe\nbossley\ntuncer\nmickleburgh\ncolombière\nmincer\nkipyego\nnsca\nakhnaten\ncoppins\nurbanos\nparnham\njaal\nkanis\nlugt\nwykoff\nmacunaíma\nluxon\nflipbook\nhopfner\nruleville\nvólquez\ngleevec\nviser\nwendlandt\nghannam\nvtx\ngeimer\nsurdu\nunappetizing\narverne\nklarwein\ndipti\nvwd\nleist\nyarmulke\npokeweed\ntranscoder\nhillmer\nnoctiluca\nactuall\nhowry\ncamouflages\nferrovial\nmaltzahn\ncontiki\nunlikeliest\ncurtails\nlefcourt\npavlopoulos\nburmantofts\ncountercoup\nsweetcorn\nchaiyya\nvirji\ntaxpaying\ninbhir\noltmans\npiliyandala\ngalex\nhonkers\nwinnersh\nverlagsgruppe\npeltola\norren\nlacher\nmincho\npblv\nvolstad\nreatard\nhairdos\nlonglands\npretax\nterezinha\ngafford\nbussiness\ntesser\nahadi\nhurriya\nspringtail\nbonavena\nbronchodilators\nboylesports\nilton\ncollington\ngattai\ndubailand\ntalagi\nsaoud\ncarrothers\nodelay\nanier\npolytonality\nagle\nclockers\ndannielle\nmahre\nvidale\nronet\njenet\nerjon\ndantis\ninjil\nbakero\nmodasa\nroselyne\ndefragment\nyuting\nhellp\ntrebilcock\nrungwe\nlollo\nodair\nlaffit\ntraphagen\naceto\ngallay\nposeable\njhunjhunwala\ndumisani\nyuzhong\ndorell\nbasepaths\njoline\nreflets\nsmush\njannes\ntemplecombe\nclst\ntanongsak\ndussollier\npoolman\nshorebank\nlysate\nhaslington\ncorduff\ncianciarulo\nproffers\nwoodrum\nwessler\nbunde\nscholium\nlibdem\nfenning\nvinessa\ncondren\nbeddau\ncornhole\ncolumbite\nibach\nfison\nmanucharyan\ncoppersmiths\ncardena\nvittorini\nwampanoags\nrenia\nhijet\nganton\narrojo\nfetu\niajuddin\ngiddiness\nxandros\naebi\nelmstead\ncolomé\nbollmann\nequinix\nchichiri\nshackell\npaartalu\nsonobuoys\nwics\nwellies\nharesh\nschimpf\nkabara\ntdx\ngirdling\nkozelsk\nwaymouth\nsciencenow\nmurden\nstarstreak\nperenne\nobfuscatory\nultrathin\ndisproportionality\nweequahic\nnewmilns\nschilsky\nmagaluf\nclambered\neggum\nbrehaut\nkaplow\nsicne\nwackiness\nfidgety\nriffraff\ntaillamps\nrinck\nrakshit\nmoneygram\ndonleavy\nyucai\ndonators\nlorig\ntwigger\nicps\nsmola\nhauber\nroquetas\nmunchie\ngasthof\nmasamitsu\njish\nbsds\ninformacion\ntoutain\nrehavia\ndmw\nvirgilian\ncrear\noccupationally\nbhamra\nunroll\nverheiden\nbinzer\nmandatorily\ncinergy\novenbirds\nrocamadour\nkhalida\ntsos\noxpecker\nottosson\nmagli\nlotito\nendang\nleden\nzadan\nmischaracterizes\nhumanness\nrollercoasters\ncantey\ndustan\nrizos\naafes\nbrightling\ngoseong\nanisur\ndrako\nsporran\nbelligerently\nokwui\nleviton\nnenndorf\ncaliff\nflouts\nmcquoid\ndagsavisen\ndeewan\nsexson\nhellbender\nposthaste\nchinati\ngiberson\nekster\nupgrader\ncales\nvssc\nrevalidation\njudengasse\nstramonium\nmathi\ncontrapunctus\nkefu\nagazzi\nbellette\ncreevy\nowne\nrugge\neglon\nrajaraman\nandreyeva\naversions\nklumb\ncowle\nretrying\ngtin\nbrathay\nunderfed\nbichat\ncassida\nreinsurer\ncontursi\nunsecure\nagnelo\ntollerton\nhollein\nreinertsen\nrifka\nryal\nradovich\ngottheil\nyaqubi\ncornard\nremillard\ntylers\nundeservedly\nuphoff\nuysal\nhualong\nvipond\nbuildable\ndropwort\nbowkett\nnulato\nspanaway\nansdell\nindyk\npedaled\nfantuzzi\nroji\nayob\nbookkeepers\nsudani\ncompa\ncorita\ntourelles\ncichocki\nzangara\nehart\nhoai\ndirie\nclases\nsparkie\ndunagan\nbilma\nboualem\nscei\nbickleigh\nrailay\nkenwick\ncolrain\nelliotts\nmouride\nsilcott\nvallois\ngawande\nmushaima\npeplowski\nschoener\nmazuz\nrapacity\ndecroux\nrollerskating\nbgh\ncauterets\nbortle\nrideshare\nchmiel\njtd\nblumenschein\namcc\njavaris\nyahud\nnecromantic\nrukavytsya\nsmithills\ncolorno\nazab\ntrus\nzura\ngrommets\nyosi\njahoda\njiemin\nmissen\nimplimented\nfinback\nagnone\npuppo\nrusski\ndougy\nyantian\nkubilay\nsynthe\nunrevised\nvillere\nratted\njoar\nsinglets\ninteriority\nwehlen\nndikumana\nloux\nmicroenterprise\nformaggio\nscooper\nmjs\nsteiners\nsanitizers\nmackle\noversold\nforeward\ncelgene\nsensually\nfeldafing\nrockface\noverfilled\nrinde\nhofbräuhaus\noschin\neulália\ntranslucence\nperspectival\npecha\nkaukab\ngaiam\nsadowitz\nhaysi\npaganelli\nlinos\nvassilev\nsaadet\nbratman\npjp\nmakowsky\nfaiveley\nwady\nalday\nstoeckel\nsidman\nronis\nlerby\noveridentification\ndonno\nkitanglad\nplandome\nkarahan\nblackamoor\njournalese\nwillfulness\nnager\nborren\npmsa\ndwek\nnsit\ngoetzman\nbonvin\nhoeck\nfaves\nalkmund\ncherny\nchoppa\nwholesomeness\nkutaragi\nneid\npaetec\npoisoners\ndivinations\ntiffen\nnonylphenol\nballybeg\njarque\nhaehnel\nuttaran\ngeoint\nvitol\nbergenline\norfèvres\npmv\nrales\nostro\nunpatched\nnoshir\nerme\nbenedicts\nzubieta\nsplinting\nvaradkar\nfournaise\ncaspa\nchipinge\nmarzena\nsupplicants\ntoneless\nmotherships\nantiquarianism\nhoeft\nclambering\nrigidities\nrosbach\nlfv\ndayer\nmarchés\nengelaar\nschmoke\navita\npachyrhinosaurus\nchere\nkandelaki\nholzner\naugustyniak\nupend\nkleban\nragon\nagitates\ntylertown\nperriman\ngóes\nahus\nsackings\nkempter\npenkala\nzubeldia\nknipp\njanela\npotterton\nculbreth\nfoxall\nyetholm\nsandling\nakter\ntehnika\ncalifornie\nromping\nbaic\nudzungwa\ndeibler\ninadvertant\nbaduy\nexploiter\nunprompted\nhakimullah\njosephy\ntollin\nnachle\ndeveron\nanthropologically\ndowiyogo\nnfld\nlawngtlai\nuxb\ntahlia\ntitlis\ncompanie\nsamura\nkomunyakaa\nuslu\ngoodden\nmitrani\nfaga\nhoveton\nyordanka\nmableton\ngreasing\nbudanov\nhillarys\nthrostle\nespinel\ncodjia\nmcclenahan\ngladstein\nwuzhong\nguzzler\ndoinel\ngrippe\nhanton\nlolol\ngruben\ntaguba\nbrina\nhawkinson\nhygrometer\nwible\ncigarini\nprofi\nhungers\nstoffels\nchoudhuri\nstachowski\nseein\nruffling\nmachavariani\nsenreich\nleighs\nrecondo\nnhlanhla\ndocumentarist\njozi\nbarmes\nspackle\nkamman\nprocrastinated\nfredda\nbucy\ncowick\nbonvicini\nsundarban\nthurnham\nlassana\nbutyrka\nbuitenen\ndronten\nmamasapano\nbisto\nporc\niconix\nkimbro\nmatronly\nkilju\nqatil\nrazza\npadda\ncudlitz\nfinchampstead\nmondonville\nlison\nmorgenstein\nzii\nroadworthy\nnanri\nsyndicators\nmutal\nrezai\nmiskovsky\nbookstall\nmanguel\nvenipuncture\ntributyltin\nlawfare\nadeniyi\njehova\natheroma\nexplicates\nlavernock\ngreenawalt\nbeaney\nrollox\nusherette\nahwahnee\ncatspaw\ntensioners\njaxson\ndmitriyev\ntchani\ndelyn\ndolgin\nmaryfield\nkrens\nsinquefield\nwieler\nwilis\nduku\njacome\nsheffey\nythan\nkavin\nsaige\njaoui\nmoorage\ndufault\nuvr\nfatalist\nprazak\nschwaiger\nballons\ninured\nbeese\npolyolefins\nlarrick\nbonera\nkitada\nwiesinger\nspratling\ncurdle\nbbci\nseegar\ndomb\ninverkip\nnango\nnordenham\ncarrascosa\ndetre\nambrym\nphormium\nuygun\ndefoliants\nmbulu\nlafco\nfischli\nturncoats\nseasteading\ncannaregio\nballingry\nrangasamy\naffligem\nglazers\nbroecker\ndarkon\nchallange\ngibbus\nbaschurch\nfraternizing\ncoccidioidomycosis\nshetler\nskrall\nholburn\nsusceptibilities\npolarise\nakvavit\nalbats\nzefiro\nbogusky\nfranchize\nbishopwearmouth\ncasseurs\ngoitom\nsleights\naleksejs\nbouchut\npulverize\nalbopictus\nachatz\nkinokuniya\ncolistin\nbefalling\nkibar\nhiman\netanercept\nmastocytosis\nhamou\nsugarplum\nkazakhmys\nranchland\nreassignments\nhomotherium\nradoslaw\nzbc\nmilingo\nkupperman\nkoyi\nderegulating\nhymettus\ntussles\nandrogyne\nlosartan\nmassai\nweatherboarding\nnusbaum\nhallworth\npompilio\noehlen\nchallengeable\ngunnerside\nmurciano\nkoppu\nwahnfried\nchéret\naerobically\nstarline\nsinofsky\nkwashiorkor\nranawat\nsafenet\nrusli\nbeya\npublik\nkampman\nbolthouse\nbufano\ncaldarium\nsunchon\nlauret\naswat\ntelephonist\nlatanya\ngayet\ndriefontein\nyadong\npanek\nmirail\ngrochowski\nmejri\nbarloworld\ndisplayable\nsawbridge\nfedexfield\nneuroanatomist\nbrightnesses\nimmunobiology\njeffersonians\nriak\narek\nbmcc\ncinda\nfraîche\nartreview\nantitheses\ngeiranger\nsubassemblies\nwatersmeet\nhovingham\nstockroom\nbrennand\nmusante\nbruer\ncidre\nnautico\nspreadbury\nambuj\nharilal\ncucinotta\npowergrid\nschaber\npucara\nmoggach\ngollwitzer\nrerio\nhyosung\nmasterless\nbrinsworth\nrusha\npreclusion\npehrson\nkleeman\nmaie\ncodependency\nmégret\nkeagan\nishani\npuning\nbogost\ndierdre\nradziner\nprotoplasmic\nglackin\nrehousing\nsabhal\nclasen\nlekking\ngranatelli\nsemans\nnulli\nfleurie\nprecursory\nanderer\nsarro\nunornamented\nkhanabad\nreexamining\npliva\nkalomira\ngeib\nwoore\nneubacher\ngwilliam\nwaljama\nmacaron\npettijohn\nyuhas\noginga\naeroportuario\nfreshening\nboin\noverlea\ninternationalised\nariff\nmozote\nsquawks\nsijan\nlechuga\nsmartness\naptamers\nantiproliferative\nhalfhearted\nmccreedy\nseehausen\nkabua\nsnowmobilers\nkersley\nconjuction\nmorcombe\nduerden\nvandoorne\npertiwi\nswatter\njoson\nsweed\npalmy\nangelology\nbudnik\nnavarette\ndenmead\ngeorgetti\nbutterfish\nnitpicker\ncaravane\nchugg\nenvi\nbrost\ndybdahl\nschroders\ntett\nhony\noffloads\ndubnyk\nrefrigerate\nomerta\nverities\nchandlery\ncrudity\nmastercraft\nbollier\nrelearning\nesteros\nbangguo\nhammacher\nkandil\nviolences\nfallers\npouya\ntabone\nbanko\nenthrall\nalbach\nraber\nhaggins\narticulators\ncausae\nkindleberger\nscampered\ntikun\nkaralius\nyawns\ncongruency\nforestay\nkoken\nbalangay\nbarhi\nsaib\nfrady\nberlingo\nmakart\ntraceback\ntransito\nzeynel\ntessé\nheffler\ntytherington\nmilea\nmison\nstanground\nindranil\ndeemer\nsteeplejack\ncansler\nperno\nkande\nmwale\nfarinata\nhippe\nquilp\nokin\nswn\nrassinier\nrusnak\nlesperance\nchatta\nrenos\nllanwrtyd\nmashaei\nbarnie\nreinprecht\nprophesized\nlybster\nbbqs\nskydance\nmazower\nbralower\naveva\nwarnaco\nseamy\nadoum\nlochleven\nphylis\nkiersten\nzorc\nstibnite\nalysheba\nrimula\ngocha\nwoulfe\nsecurom\ndebacles\ndeuteride\nrybnikov\ngreenspon\ncatalysing\npursing\npcts\ngeg\nwurzach\nyaroshenko\ndeflates\nparlane\nestacio\npozdnyakov\nkreek\nprofesionales\ndardis\nlillingston\nwindsock\nadditionality\nkillie\nsavagnin\ncrescens\noarfish\nchangnyeong\ncirs\nfortnam\noptimizers\npeslier\nabcp\nquintain\nrudas\nhelfand\ngadir\nvitrine\ndaul\nsalovey\nlawrenz\nadkison\nglengormley\naspendos\nraanana\narlidge\nthreet\nvoicework\nmathern\nmonchengladbach\nakutsu\nattingham\nodam\nbadil\ndiatreme\ncjn\nmercuries\ntunecore\ntaurean\nflavonols\nwoolpit\nfafard\nmakeda\nsanand\nkalaba\nzanger\nayal\ncroutons\nnnamani\ndefecates\ngisli\npocketwatch\nrozsa\neuractiv\nchaix\nmurguia\nhammerman\ngeodynamic\nbusche\nmillenniums\nfiorelli\ncossutta\nkesgrave\nscovill\nsekret\ncharulata\nliferaft\nbasketweave\nhemodynamics\nworcesters\nmultiphonics\nharebell\nschechtman\nniuatoputapu\nkonvict\njetway\nbadat\nbujsaim\nnuhiu\nschnoor\nhorribles\neriboll\nwika\nmadlyn\nubr\nalewives\ngavitt\nreeltime\nmoinul\nmoonstones\nabiquiu\nrictus\nidyl\nazpilicueta\nvatel\nmachacek\nninigret\nmarinero\nambro\nginna\nenf\nkonrads\ndeprecatory\nklimowicz\nalrewas\nomh\ngroper\nbeatha\nshichinin\ndkm\nlandreau\ntanuku\nrathborne\nockbrook\nlykos\nteunis\nswo\nmispronounces\npianura\nvwp\ngassan\nstyli\naurigny\nheijn\ncommentaires\nsklyarov\ndecontrol\nhils\nzaitseva\nburtons\nmulches\noverstimulation\nneuk\ninfinis\nbaccus\nbogazici\nwaso\nkanesatake\ncontrovertial\nrollup\nleijer\nshowmatch\nimmolate\nserber\nmayardit\nolayan\nkermanshahi\nullens\nstoc\nbroaddrick\ngnm\nfulginiti\nugbo\ndebach\navago\nqqq\nshipai\nurquell\ngoldrick\nkyowa\nkarvinen\nscagliotti\nmanderley\nprimitivists\ndepressurized\nbackspacer\nlightshow\nlvad\ncarimi\nchhay\narval\npahan\nknoyle\nadjudge\nboening\nobrestad\nsandycove\nmismeasure\nmillionths\nlillet\nproffit\nmaisey\njbt\nwakako\nboozing\niuc\nhemley\nascensions\nnatgeo\npashkov\nglaise\nhealthwatch\nmapfumo\ndovecotes\nnajem\nbouchez\ninflexion\nlmh\nsolider\nmachlin\npalamara\nreaderships\npuo\nsparty\nrenovates\ncervinia\npalkina\nbiros\nqada\naudino\nkerckhove\ngoodlett\nvirginias\nbucherer\nkleeb\nthrombophilia\ndausset\nneocolonial\nfraisse\ncaitriona\natwt\nmastheads\nuzel\nlowing\nbirostris\ndigeorge\nbauld\nshuaib\nclode\ngubi\nhandholds\nnutmegs\nmarotte\ntetlow\nsabelli\nwilliamsbridge\nyanyan\ncunegonde\nclanger\nbalink\ninnuendoes\nccoo\nsucci\ncourrèges\nsanshui\nrills\nvigdis\ntrainability\nbeauts\nvamping\nhuntingdale\nkraybill\naugustino\npashtunwali\nvarah\nltj\nchautard\ngenel\ntractability\nmourvedre\nclods\nnilmar\nblodget\nliuna\nranter\nteasingly\ndiefenbach\nsurveil\nabjectly\nkeyham\ntwirls\nsynaptics\ndharmakirti\ngrindlays\nmoelwyn\nloughman\nguipuzcoa\ndreaper\njinxing\npandeli\nstielike\nruwais\nlonardo\ngrassmere\nandravida\nskenfrith\nepact\nmeaulnes\nlahiya\nskarlatos\nadella\nprizm\nmarimekko\nelgood\nmì\nyolu\nnehme\nbehrooz\nhopen\nresurrectionist\nfeverfew\njacomb\nortwin\nintar\ncryostat\npointier\ndugarry\ntecom\npauma\nrochestie\njaver\nsarf\nommitted\nptz\nsmtv\nkirklington\nchristien\npoynor\nbertolotti\nbakdash\ngibsland\nmănescu\nnonaggression\nprobally\ninfinita\nindigents\ndesignees\nparasitise\ndoblin\narcidiacono\nsteelmaker\nunobtainium\nsamari\nsephy\nansu\nchacombe\nwesterleigh\nkalp\npulte\nyasufumi\nschlicht\nkraters\nwichman\ndwarika\nartimus\nravenscrag\nzenati\nalternativo\nshikasta\nelmir\nxinfeng\nnikel\njanah\nkcrg\nmerenptah\nzaniewska\nsunkara\nsubsectors\nnoao\nhagins\nborro\nkomara\nhansung\ntangibly\nyousefi\ngalliani\nartemesia\nnanobiotechnology\nneedlestick\nwebsense\ncryosat\nprocuress\nmamzer\nboogey\njawaid\nskarn\nhoniss\narvor\nfennessy\nhydrogeological\ncrvenkovski\nzwiesel\nwillughby\nfanya\nlempel\nskubiszewski\ntongham\nsuster\nwintory\nsauls\ndesco\nwythall\nsterckx\nunrests\nmusacchio\nsatriale\nboneta\nluddy\nthreshed\nkragthorpe\nsaun\nbrontes\npetunias\nokruashvili\ncichy\nwidcombe\nzazzle\njimny\ndielman\nabderrahman\ninsh\npestilential\nliskov\nmiaow\nperiorbital\nstimulations\nantibalas\njhi\ncorporatised\nknaths\nloseley\nvegar\nduflo\netchemendy\nchristens\ncompactors\nmickley\nunsuccesful\nbaldonnel\nzicree\nnoak\nadiposity\nmanagment\nkolesov\nputer\nphyfe\nhinthada\nmalaxa\nrelativists\nwesthaven\ndŵr\nsakhawat\nfordwich\nlindens\nwildy\ncartwrights\nristic\nhapoalim\nmukhtarov\ngeneraciones\nmoneyline\nmaliha\nzambello\naglianico\nfandral\npolywell\nippc\nprucha\nplantronics\nluten\nodier\nlaurino\nmyford\nvrla\njoggins\nxiyang\ndomnina\nbrickfield\nsuperheroines\nllay\nferriz\ntobacconists\nyiyuan\nhangleton\nlxr\nqadar\nkoivuranta\ncomforters\notavalo\nminger\nosmin\noconaluftee\nrannells\nsikat\ndriedger\nlisak\ntemnothorax\nborrás\nlambright\nbassets\nsacramone\nsecondmarket\nweilerstein\nweeton\nprayas\nparacels\nbufalo\nthundersnow\nintervision\ndittmann\nsymbiotically\ncervin\nqiaotou\ngodbey\nporizkova\nnewsvine\npotencies\ngotay\nmkandawire\ngyrocopter\ncirie\nreadjusting\nmeos\nhueys\ntidmarsh\nweiping\nforfait\nartos\ndetmar\nalbinia\nwiemer\nrieke\nmilliwatts\nhandovers\nteazer\nglobalfoundries\nenjoyments\nwddm\nfvc\nbabers\ncorkill\ndilbeek\nheavylift\ngrandmont\nroved\nshakeela\nelgart\ngevinson\ndreiberg\nbrandel\nrasin\nnsec\nhoshikawa\nmyhrvold\nhemnes\nvestibulum\ngatenby\npaolis\naudel\namblecote\nsntv\nrozel\nhiders\nscrounged\nfás\nalcivar\ntitas\npasque\nmeritage\nhibachi\ncoliforms\ncloseburn\nqaryat\nmanically\ngambinos\nlinzey\naerostats\nbennekom\nromen\nusuhs\npackinghouse\nhaysville\nftas\nsunam\nnately\ngodane\nchipchase\nexplications\naleksic\nshipps\nderlei\nmanganelli\nrido\nignatiy\nnahari\nkeverne\nwqs\ntimesheet\n,then\nstrallen\nattkisson\nwavendon\nkuffar\nforteviot\njollity\nsharifs\nxiaolu\nfergy\ncheverny\nhaysom\nmhor\nbashy\nboesen\nmiyun\ncarenza\nmarketwire\nbosox\npeasley\nmarathoners\nwitheridge\nmazzacurati\ncutdown\nromanesco\nmagaziner\nseront\nprodromal\ntőkés\ntuitions\nbosingwa\nkazee\ngerin\nnwn\ngedrick\ncolindres\ngref\nnecklaced\nwindfarms\nroelandts\nsistina\nmsms\nwtn\nsomen\nwoolcott\nspectrographic\neasthampstead\nvegh\nredeker\nportzamparc\nbooy\nundecipherable\narnotts\ngoaled\nmcerlane\nengleman\nsympathises\nhowett\nkullmann\ncablecard\nhoggs\nmartinetti\nzierer\ngershwins\nklon\nnikken\nactualizing\nbobic\ngreetland\ngurov\npmln\nstets\nranton\nglassberg\nreuser\nleibold\nsapsuckers\nmaliszewski\ndunner\nbajnai\nfritos\nintercessors\ngiftware\nmislav\nmanot\ndenkinger\npowerfull\nevangelising\nfeierabend\nleatherjacket\nmcginest\nwangfujing\nmaenan\ngorran\nzelikow\nguralnik\nhockeyroos\noupa\ncapstones\nrocklahoma\nbivouacs\nzirc\nchodorow\nfluttered\nkrathong\nsatyarthi\nthornby\ncabaniss\nparilla\nkozminski\nmatilija\ngornell\nkancho\nemcc\nbaginton\ntweeds\nreprograms\nyaqiong\npacolli\nheimerdinger\nbaitz\nzamarripa\nstirratt\naverin\nsperl\nnewmains\ngrisedale\nkancheli\ntrebol\nhirschfield\nrase\nfishergate\niffat\nwindell\nhazlerigg\nangelenos\npegboard\nsawano\nmansoori\nbdos\nprodrugs\ntrashers\nearnock\ndunvant\nvélizy\nrabon\nliers\nbahamonde\nvallie\nbusked\nplights\njtm\nspgb\npalud\nnastase\nlezignan\nyosh\nchicharrón\nellner\ncornacchia\nsharonov\nyamatai\nhinesburg\nshosholoza\narticulator\nmuffling\nwxga\nyergin\nccee\nsube\norgasmatron\ntanno\nzweibel\nalleg\njanay\nxpc\neffrontery\nhanun\nkharas\nsureness\nokaka\nreinsch\nimh\ndenge\nmcelhenny\nentenza\nwisler\npetillo\nhusan\nromera\nfortey\nshneider\nohia\navowal\nchebbi\npaphitis\nqingpu\nbearly\nsewri\nhahnenkamm\ntheatermania\nmeekins\nfarinha\npashtunistan\nkhudyakov\nnorthcoast\nsozzi\nleverty\nabersychan\nzaatar\nlagniappe\nbolar\nremineralization\nundefeatable\nissia\nbulku\nacerno\nwava\nkadafi\nrossport\njuntendo\nhdk\nclein\nchagford\nfaah\npseudacorus\nremelted\nkilgallon\nkapchorwa\nquarantining\nnobelist\nmohib\nxiaoxu\ngondjout\nmalby\nzippered\nformidably\nbidegain\npasc\npressurise\nsofian\nseropositive\nonuoha\nratier\nbergsland\ninsoles\nwyatville\npreeya\nhayatou\nsuunto\nbablake\ndetoxifying\nmountfitchet\nscorekeeping\negilsson\nhabsi\nnazeing\norizzonti\nkaneva\nxinyuan\nkavaguti\nemoting\ndubash\nmahallas\nburqas\nmilnthorpe\nschulmann\nkassen\ncourir\nkøbke\nanakinra\nsupressed\nmetheringham\ndenyce\naeronauts\nanxiolytics\nfreedia\nextemporaneously\nclomipramine\nseiden\nrbma\nwalb\nputout\ndefne\nclericus\nsazanami\nattritional\nbifolia\nfolgate\nindianhead\ndungu\nbussel\nscholer\ngumming\naproach\nbillow\nnawabi\nkilrain\nabasement\nvrp\nbelongingness\namwa\nchoudry\nthornback\npitroda\nbredahl\ntransurethral\npaticular\nannin\nperretta\nstaubli\nexfoliating\nkitteridge\nspem\neitzen\nabdulai\nribadu\nliftgate\neastney\nrivergate\nstalky\npluta\nhwp\nbeldham\nliborius\nmaestrale\nmidis\nmahto\nsnax\nnolet\nchacho\nbarmston\nbeneficiation\nagglomerated\nderwen\nevonik\nmarkyate\ncocido\nfleuriot\nsinsinawa\nshifang\nacklin\nfidele\nstarcom\nretzer\niglinsky\nharpooned\nheeswijk\naskern\nvosa\névénements\nmajola\nisch\nparliamentarianism\nwindex\nsaunder\nfluoroscopic\nrivadeneira\nrochel\nimon\nairservices\ntudhope\nsplicer\nmisidentifying\nragueneau\nkarrimor\ndoublers\nsuranne\nmagendie\nmishina\nhamoaze\nkrotz\nletscher\negami\nyusaf\naerin\nhakin\nsenselessly\nussd\nmarzullo\nsirpa\ntadataka\nnyanda\nlejla\nmaschler\nsaganowski\nsuperheroic\nshallenberger\nmetamorphosen\nhorrorshow\nchongzuo\nlauten\niribe\nnonrenewable\nskok\npariaman\nondcp\ncoul\ncoldbrook\ntriche\nkarnowski\ntollefsen\ntretinoin\npoudel\nncaas\nogando\ngooglebot\nsextants\nteressa\ncharterers\nfaren\nridgewater\nadvertized\ncaic\noreck\nfaseb\nubben\nfofs\nghailani\naccost\ngerlich\nentirity\nhilleary\nsonetti\ntrickiest\ntailpipes\nostracize\nmohieddin\nnaturedly\nhovels\nlampl\ndizaei\nradioland\naquarela\nlozells\nflyting\ndispur\nvorontsova\notherland\nwpo\nmoulden\nkarissa\nmarimuthu\ncranie\ntuilaepa\nrateb\ntousignant\nesterbrook\nhashemian\nmaladroit\nstupids\nhatherleigh\ndarkin\nmamut\nfanfest\nupfronts\nhaing\ndorfsman\nathanasia\nerythropoietic\nyaping\ngyasi\nfishbach\nscallon\nmjm\nmipcom\ndayyan\nmuteness\nseldin\ncakebread\nnatallia\nturkishness\nnonalcoholic\nmanagership\nwildwoods\nshrady\nbreezing\nmadadi\nmatelica\nmaximalism\nnermal\nsandbaggers\nunappreciative\nstrenger\nlambeg\nsnowglobe\nlogistik\nknuckleheads\ntillingbourne\ncardiomyocyte\nhfq\nfellman\nhaddiscoe\ngiaquinto\nkhouang\nbazza\nyamai\nkrystina\nwinnenden\nchavela\nannemie\ncordileone\nblea\nshowjumper\ngaver\nriney\nlisfranc\nlarmour\npurifoy\nozell\nphyllosilicates\nhohman\nthinkfilm\ngrevemberg\nnegócios\nnucor\nokeover\ntorday\nbacteroidetes\ndanyel\nsalekhard\nsurveilled\npettet\ndabao\nborowsky\nwallem\npropsed\ntilmann\nmaddens\ntravanti\njankovich\nantonito\nmuya\nbrainpop\nkodaly\nhovsepian\nicat\nkarnail\nobika\nstrittmatter\nbruntingthorpe\nprecisions\ntheurer\nrebounders\nnuncius\ncalvocoressi\ndiack\nhochevar\npopocatepetl\nlacunar\ncolvard\ncowers\nferas\nzonen\nustica\nbijie\nmoitessier\nbackslide\nglais\nvortexes\nmohajer\npsychographic\nlizzi\nrevocations\nabergwili\nkofta\ngalvanise\nsmas\nhatty\ntkachyov\nmatucana\nmervis\nmirowski\nmusavi\nfrostproof\nsadowsky\nakte\nrohypnol\npuspita\nshoukry\ngrevers\nharbormaster\nlymond\narchant\ninfectiously\nhighview\nbeaubois\ninvloved\nbudgies\nbiologique\ndietzen\ntribosphenic\nouc\nicbn\ndaska\nselita\nstraightaways\nrecinos\nvinoodh\ngemcitabine\nmiraval\ntter\nwylfa\nreyhan\nbesmirched\nzhilin\nklyuyev\nboroujerdi\ncatergory\nchofu\nandr\ncolie\ncreekmore\nahenakew\njaniak\ncadeaux\npoonia\nrandian\nkuszczak\ndennington\ntanika\nabdulsalam\npardey\nskomer\nunlikley\nantsy\nyurie\nkouroussa\nbizenjo\nbechmann\nlangbaurgh\nakta\nyurevich\narza\nmcwane\nbrindled\nboomtowns\nfscs\nsheepherder\nkarson\nacnes\nbarsac\nuninviting\nlexile\naverbuch\nshults\nhallion\nghencea\nmedoc\nendarterectomy\nwhatsit\nbarbarina\nhumbard\nrousay\ntuncel\nnucleare\npyongtaek\nhetal\nsarles\ntabernae\nobliques\npassfield\nstori\nmajar\nescapements\ncorporeality\nyusri\npoortvliet\nminshew\nseeling\ngunawardene\nlording\nlutie\nmeric\nmoceanu\nbackpocket\nflambards\nchaskalson\nbrickmaker\ngateside\nverduzco\nprope\npulao\ncronista\nedelen\nruhengeri\nmaiella\nlotario\nleinbach\nleil\nchelford\nrogate\nlupines\nwhatevers\nmumby\nmarinakis\nfastweb\nuniqua\ncarsdirect\nspitznagel\nroso\nkhine\nballadeers\ndementors\nlingyu\nitalys\nscribbly\nmutley\nsaky\nlichuan\nlightless\ntriazolam\nhullo\nmemorium\nacquaintanceship\npptp\ngladwyne\nannigoni\nlinnie\nartschwager\nfremm\nroncal\nkirsh\nacedia\nrabka\npaleis\nschreckengost\nlaprise\ncbct\nangularity\nreflectometer\ngoldfoot\nherpetic\ntarsis\nexclamatory\nskittering\npepping\nsunport\nhäring\nmaidservants\nmcmordie\nlimington\nkaffirs\nnecrolysis\ncrystallise\ndinastia\nmaotai\nveerman\nflitter\nwazed\nektachrome\nemotes\nrerated\nacrosome\nmarveling\npikine\nhasib\npression\ngermicidal\nboringly\ncarefulness\njubair\ncampoy\nepirb\nkende\nbounceback\nlouts\najello\nmoita\nhollowness\nledgard\nglinton\nmultigene\natrai\ntudorbethan\nallg\npflaum\npouillon\naschenbrenner\nangelyne\nbudds\nheureka\nsheepfold\nfarokh\nunwelcomed\nmorlan\nlightworks\nschrage\nzhiwei\nheptones\nnaics\nhumar\nmillns\nscalby\nfelmy\noutbidding\nfaceplates\nbasenji\nshukur\nernö\nhindelang\nzerelda\ntrusler\nunocha\nfoscarini\ngeosystems\nlefevour\nfegley\nschmaltzy\nisw\nzentai\nmaximisation\nshaha\nungentlemanly\nolguin\nepithermal\nspiegeltent\ncerner\nfengyi\nnetworkworld\nklann\nwuhai\nfery\npaygrade\nsimyra\nppos\nmcvaugh\ntinniswood\nblueskin\nrrh\npahlavan\nbackflips\nngd\nmosquée\nirritans\nwongs\nodac\ntisane\nkivalina\nshamshuddin\nyoknapatawpha\ncinto\nyounès\nostracod\ngrindin\narsham\ntocopherols\nabertzale\ncinemateca\nlauric\noppens\ndeshon\npaloheimo\nmôme\nhydatid\nopdyke\npervin\nupsides\nfrecklington\ncorbiere\nbobe\nintikhab\ncrisil\ntankards\nskandinavisk\ndelahoussaye\ndisqualifier\nkohle\nviramontes\nginting\nabraaj\naisleyne\nbrabants\nkangol\nwavelike\njamario\nharib\narrhythmogenic\nbasualdo\nmeria\nwessington\nbaobabs\nparvenu\nkohonen\nsfard\nchamitoff\nspelunker\nlownes\ngreencard\nhakola\nrhome\nkilims\ndiardi\nmetsamor\npsychonomic\ncolls\nparche\nhoshide\ngadabout\necumenist\nriffed\nprogressiveness\nwarehousemen\ngeesink\nsteuber\nchkalovsky\nsaffold\nshangrila\nghettoes\nminie\npudil\nallesandro\nmanneken\naeroponics\neenhoorn\nsexpot\nscrutton\nsalsberg\nmugisha\nmallison\nbcsc\nherbers\nboghosian\nalico\neichorn\ntref\ntessitore\nhertzfeld\nsfakia\nchevelles\nkarystos\nmccurley\nanual\nruili\nelze\naltice\njeev\nlebaran\ngaras\ngefilte\nqarni\ntorm\nwonderous\nhalldorson\nzulaikha\nstuk\nfilleted\nstoeger\nmikuriya\nettington\nstateville\nbeekes\nimplodes\nkiliwa\nmironescu\nbarkeep\neibner\nunpublicized\nreknowned\nstrati\nkingma\nnakahira\nzbikowski\nlindholme\ncaisses\nchrisp\nbeign\nmaccrimmon\ncosic\nfraizer\ninstow\nparchin\nmpq\nvavoom\ntaepodong\nmurbach\nkehlmann\nplx\nmarghera\njennett\nawilda\nwhitewell\njiyun\ntropicalia\nabsinthium\nhakman\ncockettes\nheveningham\nmufi\nkatisha\nrimonabant\nerics\nbaturin\nramboll\nmicropayment\nsokolnicheskaya\nanticolonial\nmoeser\ncenterport\nexame\nnouriel\ngiba\nkrishnendu\nargyl\nemboss\nkolp\ncheckley\nscroby\nsamois\nkuroneko\nwemp\nlbg\nqdr\nbashkirian\nmadobe\nahlawat\ncivoniceva\nkernohan\npennycook\nhenricsson\nmartien\nhotfix\ninterlinks\noversexed\nsteyne\nbankrobber\nilas\narchaelogical\nstansby\ntilahun\nsylphy\nmilchan\nmussi\nsteinbruck\nnahhas\nmistajam\nthermostatically\ndelee\ntrachycarpus\nzuiker\npolenz\nduberry\nboabdil\nmatsuzawa\nshorthead\npyon\nsophistical\neisman\nteshigahara\npugno\nngcuka\ntitchwell\ntetro\nocps\ndachs\ntobs\nmcdougle\nwhitemoor\nstriper\nduntocher\nvillasenor\nperras\nblatchington\nuntung\ndispirito\nsynn\nepicurious\ntorriente\npitera\nfoyles\nmoisander\nzambrana\nemos\nfarnerud\nlotts\nromansch\nfoong\nshirlington\ninquirers\ncodswallop\nbythewood\nkaufhof\ngonson\nsolanine\nstreib\nclaverie\nkisatchie\ntrajano\nlipu\nimpolitic\ngeha\ncapitain\nblowups\nsajda\ngovtech\nmagliozzi\nanoints\nlakanal\ncontemporaneity\nlops\nprts\nfgi\nkirori\nramius\nseuthes\nnottebohm\nkornienko\nreprieves\nsiby\npruss\nashenden\nmision\nmontevergine\nharkaway\nschaack\nrahil\nwarga\nequipement\nclocker\ndeepness\nmaslenitsa\nkadhimiya\njide\nhdcam\nlargeau\nkpf\nlaurine\nfootwall\nmasire\naacn\ncreepiest\noverexploited\nperihan\nguggenheimer\nmerill\nluedtke\nbreu\ncoldblood\nemrs\netg\npaysanne\nlindert\nlaunchings\nimbecility\ndelforge\nsherzai\nmayet\nrashawn\nhoublon\ninnovativeness\nwtvy\ndgn\nvisteon\nstocco\nsavvis\nfaucheux\nalittle\nkarera\ncrondall\noee\nmariotto\nfinigan\nkennish\nshouters\nconflictive\nbedroomed\nundramatic\nmayland\nadvertisments\nmalae\npredeal\nbakkal\nfanciulli\nfakhar\nrosenvinge\nuniversitaet\ndjilas\nnihar\nfsma\nbolanos\nstegosaurs\nglucksman\njaza\nchancellory\ndelice\nintelligibly\noverachieving\nvorilhon\nsretensky\ndelaram\nbalchin\nroadchef\nrodat\nterrio\nharison\nmavic\nbilde\npalmore\nsintez\nkilobits\ncatroux\ntwillie\nokot\nfirths\nnafion\nzoet\nbossie\nkatigbak\nscuse\ngvhd\nwaitaha\nhilling\ncharouz\nwitcomb\nkeratectomy\nmehmud\nmalbon\nwegelius\nlockshin\nminnillo\npolicja\nusti\ndravet\njanosik\nassenting\nkosgey\nfelica\nyiadom\npostindustrial\ntanaquil\nusurious\nschererville\ncriticsm\nthreating\nbroza\nfritze\nmcphie\ncspc\nludgrove\nnativité\nsevmash\nkywe\nbiswajeet\ndunsinane\nfrancolini\npepeng\nmustansiriya\nidei\nniceley\nmollica\nzatkoff\napics\ndefoliant\nservat\nhereof\ncossu\noceanlab\nlythrum\nkratka\ndagerman\ngendun\nciesla\nodera\nmangy\nsyntel\nyorkshiremen\nlaquelle\nwainganga\nwvia\nmüllerin\naronsson\nthabiso\nalandi\njellyroll\nghanam\nherrod\nsignally\ntohan\nmwenezi\nshafaq\nwillenberg\npicards\nhoed\ncyberterrorism\nimmobilisation\naulakh\ncareen\nfoleshill\nmurrindindi\nastolfo\nfornes\nsublevel\npeverley\nstultz\nzagorakis\nkarpf\nparticipators\ncdns\nhubspot\nwolmer\nwimmin\nnutjobs\nbiodegrade\ndarkplace\nsubdermal\nhunkeler\nyongchuan\nschepens\nhighfill\nfucile\ndéluge\nparvo\nselchow\nsauerbrey\nfunkee\nthise\nyacoubian\nrehabilitates\ntatin\nleymah\nlaindon\ndamerel\nembolic\nmonterde\nreunidos\ndolgan\nconvalesced\nchartham\nmisstate\nchorney\nidealab\nuniter\nfullbrook\nduncraig\nmoodiness\njaspreet\nipps\nzaldivar\nlavash\nrgh\nmarinoni\nbubs\noverpeck\ndisgruntlement\nhomeports\nnoue\nmiscommunications\nabrash\nsangare\nkoncz\ndubilier\nseyran\nterrail\nthrun\noverstay\nbardy\ndesertec\ntheodorou\ngmeiner\nwataniya\nmakari\nfluvoxamine\noberholser\nchatrier\nyaizu\nplaytone\nacoustician\nholim\narrate\npiara\nhutin\nwerley\nratched\nportioned\nafram\niwpr\nbaerga\nwoosley\nlauner\nhyperreality\ndebarkation\nanglicism\nsstl\noutdoing\ntarle\nnzc\nwefts\nshunk\nroychowdhury\ntaufua\nelph\nmoctar\nleija\nshuggie\nmartill\nzemer\ndimick\ndicuss\ndeclamations\nerwitt\nkunhardt\nvinings\nparameshwaran\nholan\nmoussawi\nprizefighting\nkirkmichael\nulceby\nouds\nunexecuted\numland\ncoff\nsubtotal\nliteralists\naddit\ndevisive\ngurwitch\nmanzullo\nmudry\ndewei\nlongbenton\ngoldfinches\nthieriot\nmulumba\njosina\nlaboy\ncavalries\njuande\nnewscorp\nmihm\ntweedale\nabry\ndiffrence\nrosenhan\nliferafts\nballouchy\nthreatt\ngrabavoy\ntarique\nmyrto\ndivakar\nvaporizers\npindell\nhols\nhyperacusis\nrabassa\nfeest\nagness\nfaap\nabisko\nintraconference\nladell\nbrandwein\nrelyea\nkruczek\npolshek\nayar\nfrizell\ncizek\nthereza\nardboe\ncleanrooms\nruca\nbrunetto\ncontroled\nobliviously\ncossío\ndegema\npeopel\nwatc\nmoreaux\nlifespring\nveizer\nbolado\nyerwada\nrowbury\nneroli\nhaselton\nfrauenfelder\nidrissou\nfreeston\niwaya\njumex\nypi\nmoggridge\npykrete\nnimule\nmelanic\nhollier\ngilsa\nbuffay\npuzzlers\nlandrover\nsvenja\nottershaw\nragamuffins\neefje\nedstrom\ntekel\nthei\nmcweeny\nirukandji\nlantirn\nhuckabees\nfolli\nlchs\nhandycam\nhamsher\ncantil\nzouaoui\ninterleukins\nvespro\naragones\ngne\nbeac\nradzinski\nseznec\ncyberattack\nsoza\nbernabei\nneligan\nceap\ncraigmore\namhurst\nkakan\nmarkranstädt\nvrouwe\nhouchard\nharwinton\nreloadable\npommy\nezechiel\netablissement\npintails\nmirboo\nkirkoswald\nstickin\neasly\nlaurentien\ngunten\napotheke\nprocede\nicrf\nfloud\nunctuous\nvacationer\nlayec\ngnadenhutten\nuclg\nstüler\njournos\nsoldano\nhopefield\nokumu\nabras\ndesaturation\noverreacts\nmicklefield\nrobiskie\ntarlochan\nmontre\nrecalculation\ndocomomo\nkunyang\nfrechen\nmubin\nmccartin\ndismounts\nborowy\nmojado\njulies\nstoel\nshoveled\nwoodhenge\nseverns\nques\nperambulation\ntahuna\ngork\ndiac\nguideways\nscandone\nantifa\nmuzzi\nmispellings\nastrologically\ndiomande\nallots\nkanbar\njitender\ndegradations\nvileness\nwetback\nrosmah\nsridhara\ngaugh\nnorborne\nkorach\nfradley\nibri\nmanzer\necom\nfrasor\numid\nwicketkeepers\nlancelin\nperriello\nstrohl\nvassalboro\nteleshopping\nsamuela\neasthope\nsyring\nprimack\ntrimboli\ncapucci\nhous\ncicogna\ntabucchi\nsates\nredouane\nkargbo\nsimoncini\nmilov\nopensocial\nvarnell\nbrockhoff\npongolle\nmethemoglobin\npreprinted\nsonnenburg\nzarakolu\narien\necas\ntriqui\nruyan\nderangements\nsimos\nfireblade\nsynan\nfekadu\nkepis\nmenwith\nkakodkar\nhalama\nlahmeyer\nammiano\neuropan\nbrabender\nberbere\nagonize\npaulien\nmathu\naudaciously\nalioramus\nparkyn\nhamis\nhidebound\ncavit\ntintwistle\n\nsidway\npicardi\nunhatched\nmurfree\ngashimov\nhermawan\nkores\ntarapaca\nbuckboard\nkoothrappali\ndoz\nzonn\nvcat\nmillercoors\nsaqqaq\nsnri\ncalida\nkalanianaole\nmellody\naneurism\nmercieca\nspello\nrosoboronexport\nmalkowski\nshaffner\nqinghong\neyeopener\nkarnofsky\necgbert\nlelyveld\nmilkmaids\nhenmi\nlhi\ndyslipidemia\narwenack\ntyerman\ninnerleithen\neichenwald\nrosaly\nregrade\nconflagrations\ngiler\nnacro\nzathura\nyouview\nebk\nshehadeh\nstrida\nkargar\ngreenhoff\npheomelanin\nwellmann\ncyberathlete\nseamers\nchapped\nszirtes\npicault\nevinger\ntheofanis\nhoffert\nfinancer\nbarbours\nreiger\nshepardson\nkabamba\nmonotones\nasikainen\nhegazy\nmwinyi\nzoabi\nfoodtown\ndesparate\ncolchian\nmccrystal\nhairiness\nhardwar\ntherapeutical\nmessagelabs\nrades\nboryeong\nsuggestiveness\nencontrar\nzeese\nchavalit\nblasdell\nrubai\nrocamora\ncornbrook\ncultra\nrejectionist\nzherdev\ncrusting\nmhn\nrussos\niadt\ngotts\nshela\nmasochists\nettridge\ntchaikowsky\nciavarella\nbetto\nkiddieland\ntrilla\nbalakot\nlubell\nrandazza\nzzzzzz\nrenon\nharpole\naspland\nglushkov\ngyulafehérvár\npalimpsests\npuking\neppolito\nkinlet\nlightsquared\nmynci\nreclines\ngreenes\ncelona\ncalanus\nnowdays\ngeragos\nriggin\nparaprofessionals\npawsox\nabukar\npullet\nlorusso\nshahba\nabdoh\nstölzl\nmurni\nelsi\nsterlite\nknidos\nmuratore\nbeavon\nnorthiam\nminutest\ntöpfer\nefw\nrededicate\nmartinon\nduros\nhefeweizen\ngorgie\nrevisted\nvoiron\nlevitates\nworldskills\nsansha\npearlie\nthnks\nbayton\nravelston\nlipsyte\ndecouverte\nbeitz\ncoalpit\nvaledictorians\ncudlipp\nhazira\nbrashness\nlgw\nsaughton\nferriere\ngranjon\nlipnicki\ngendercide\nziggurats\nlolas\nidiosyncratically\nsensationalised\ncremins\ndeductibility\nsuperyachts\nbuscher\nmodularized\nzes\ntecolutla\nsalka\nsterba\nozama\nmmmbop\nasuma\nprebuilt\nmelbourn\nmahinmi\nsatisified\nambien\nrehavam\nlaywomen\nwahren\nflaum\nevart\nmaghoma\nverstegen\nundestand\nsigir\nphlomis\nbowthorpe\nscutt\nfetoprotein\nshahroudi\ncorporacion\nmonee\nmersad\nrockeries\nstehlin\nblago\nunstained\nliquify\nscourges\ndemonstratively\nfullmoon\nsüssmayr\ndiamantes\nproductiveness\nparreno\necru\nkolinda\nclinking\nadedeji\nrehov\nbookland\ncahora\nwendelstedt\nschuetz\nbohen\nsolinas\nfrasco\nsheli\neijden\nzahraa\nredoubling\ntamr\nkentuck\nspymasters\nmadaraka\ninvergowrie\nkà\nndv\nlugovoy\nfontoura\nkanas\nbartiromo\ntransducing\nperner\nwalland\npentatone\nbénouville\nmungiki\nharberton\ninfelicities\ngovil\nglancey\nsherrer\nteranishi\nchesebrough\nments\nafewerki\nruley\nmuzzleloading\njackfield\nsöze\nlayin\nzlobin\nhypoglycaemia\nbridgeford\nwordier\nrecommitted\nsightly\nultrafine\nfirkins\ngartree\nadvanta\nkuntner\nrasselas\nunscrewing\nhavenstein\nyorkey\ndisinterment\nawar\nnizamani\npuskar\nkuangwei\nsubparts\nsoulfulness\nsketchers\njaunts\nvahedi\nlaris\ngayla\nvilakazi\ndessaix\nbanisters\nayda\nwreake\nhadra\nwestvleteren\nschneiderlin\nlougher\ngiardiasis\nannmarie\nkakuma\nbellowed\nunncessary\nunquantifiable\npelmorex\nmayar\nfeemster\ngolddiggers\nflunky\nsouthmoor\nsikandra\nhennesey\nrevett\nnakedly\nquasthoff\nmurcott\nreinjured\nhindson\nmolan\nlusia\nmogambo\nstope\nshene\nwipp\nbrylcreem\npatroling\nchauviré\ngermanotta\nhawksbills\nadee\nliposomal\nsorrowfully\ntankless\nzarka\npremix\nlavishing\nguangyi\nzeira\nplantlets\ncooperator\nrecoverability\nwitherell\nfusionist\nsalgar\npharyngula\ngilreath\nshallop\nsteelmen\nobsessiveness\nlettable\naughnacloy\nalread\nkishikawa\ninfernos\ndisasterous\nonychomycosis\njorie\ndiarmid\nnurminen\nfakhry\nmohegans\ncatsup\nmacmullen\nmakerbot\nchukchansi\nmasquerader\ncharabanc\ndiffidence\ndiaboliques\nndez\nstillie\nlangenbrunner\nstavoren\nparaskevas\nwenyu\ncalp\nobraztsova\nsuperdrive\nfleitz\nstoick\nduey\npianeta\nosbeck\nponsa\ntoiletry\neshaq\ntogiak\nrereleases\netorofu\nokino\nintellij\njath\nisolda\nandratx\ntertulia\nbradenham\nukas\nmediterraneans\nhaathi\nkilmun\ntatia\nsenft\nbucyk\ndeindustrialisation\nbigman\ndehumidifiers\nscorsone\nmakkonen\nturnhouse\nkuhak\nberghoff\ngashed\npallini\nschlosberg\nbarlaston\nfreephone\nhentgen\nconcurrences\nluner\nfiduciaries\nasphyxiate\nhumera\npauperism\naubuchon\naquos\nwalpin\ncosmetologist\nsounes\nremanding\nlonza\nunmindful\nlopen\nicef\npogatetz\nfavalli\ngliosis\noversimplifies\ndeardorff\naccomodated\nrokach\nbussmann\ngreeno\nhamat\nnewbrough\nvalorie\nzigmund\ntomiya\ncadaverous\nbedwyr\ndumpsite\nwebbers\nbaccalaureat\nhellings\nkerbside\ninvitrogen\nbriskman\nhogans\naneth\ntransmittable\nzelnick\nbutyrskaya\nbpu\ncorna\ndmitriyevsky\nmakarevich\napma\nvuckovic\nkampe\nboukman\nbreakdancer\nbondra\nfraise\nparazynski\naberdovey\nwatersons\nsightline\nbedgebury\nmaddest\nwastewaters\nsampit\npingjiang\nslake\nbtus\nriddel\nhandfasting\njungleland\nfilipowicz\nscoon\nzhihua\nrapino\nasaoka\nadulatory\nmaselli\ntaimoor\nrehydrate\nconcreting\nafanasiev\nuani\nqingyun\nmyners\nruijin\ngideons\nzilliacus\naggrandize\nlevamisole\nwaltersdorf\nbriarcliffe\nworkes\nmilkwood\nwindproof\nfwy\nlobell\nneaga\nvinall\ntrapps\nbrainbox\nelvie\nraiko\nraevsky\nxlendi\nsilvo\npacemen\ntacurong\nwalewska\ncaorle\nchristenberry\ndongya\nagey\nliveline\nconceptwave\nbrucan\nguowei\nhartwich\nmayoralties\ndyak\nkecap\npoptones\nphenylbutazone\nmehrangarh\nschexnayder\nworboys\nniyi\nsubalterns\nvallauris\nbanorte\nmcglothlin\nstasiuk\nlvd\nquandaries\nthromboplastin\ngasco\nperroud\nzhijie\naerotrain\nshopfronts\nmutualisms\npierse\nfundraised\nverison\nkozai\nsmic\nsaabye\nstockmarket\ninseminate\nspurdog\nweichel\ngourmets\namphioxus\ntulkarem\nbbrc\ngholston\nsoufiane\nmegakaryocytes\nrubem\ncorrada\nkgi\nattachable\nsmooch\nfulghum\ntrepanning\nwickedest\nshincliffe\ncardamon\nglyder\ngarendon\nsingsong\nfoxhunter\nduquet\ncoattail\nrushe\nbelisa\nsubnotebook\nzantzinger\nbastyr\ndelegitimization\narchigram\nchacewater\nthek\ntemplepatrick\nmackenroth\nsmythson\ndxa\ntaquet\nblunter\nstanka\nlinthouse\nkhutba\nradulescu\nmoyar\nmuszaphar\nparanjape\nforstchen\nwloclawek\nashlynn\nmedine\ncowriter\nbumiller\nknopper\nkondopoga\nstainback\nboerner\nvierra\ndemonologist\narnol\njoleen\ndefectives\ngraffitied\ncubbage\nlelis\naircrewman\nugali\nabengourou\nmousie\nmww\ngettleman\nmonetta\ngraine\nmenez\njetée\nnotwist\nurenco\ncomplexing\ninwardness\nzussman\npassably\nhafencity\narmado\nanantara\nnunberg\ncarno\nshenzen\nmonetarists\njahid\njetro\nwhined\nterceiro\naleen\nderegistration\nkangnam\narmenta\nwordsmithing\naunties\nmulgan\nfanmi\ngreenhornes\nlescano\ngriefing\ngtos\ntgb\nvilda\nkornbluh\nterracottas\nroho\nmorwenstow\nhübbe\ntuckman\nthimpu\nstepps\nvietnams\npiercers\nkhoza\ndubcek\npaulescu\nnetherley\nlarg\ntjapaltjarri\nbrownes\nflorianopolis\nunderdrawing\ndemacio\nrobdal\nmicrocassette\nkatsuaki\nnebulas\nblaupunkt\nchippendales\ndromedaries\nwaaaaay\ncardine\nsouillac\ndattani\nchurchwell\nvillone\neow\nzabransky\nsendhil\nfreshened\nlcss\nedey\nsabaya\namrozi\ncmss\nsquelching\ntittel\nuddhav\njayaweera\nriesener\ncorrey\nchelton\nsuncruz\nutai\naltium\namirite\nbackbreaking\nohama\nhiyo\namarjeet\njindra\npassyunk\nlaquidara\nengleheart\nteimour\nshanmuga\ncarthew\nnhatky\nkinchla\nparlotones\nmolinia\ncreditanstalt\nkinfauns\nbuspirone\nthena\noverstressed\nyoungberg\nfjc\ntreiber\ndouthwaite\nclandestino\nellens\nishimatsu\nappetitive\nwongsawat\ndownpayment\ncosten\nwainhouse\ncahills\nsunderlands\nbiabiany\nlattanzio\nrigell\ncinelu\ndalis\nbabushka\nnethercott\nlorius\nkasprzyk\nletten\npandigital\nobodo\npearlescent\njaison\nferreiras\nsicken\nchatan\nunpleasing\ntomasulo\ncags\nisae\nphillipine\nragozin\nhotwired\nidria\nlovebug\nbundt\nghanbari\ngeographics\nbergomi\nlimberlost\nmutz\noglebay\nnecesarily\ndependance\npirovano\ninternasional\nliposarcoma\nrutshuru\nπr\ngeorgij\nwnv\nwoolhouse\ntimsbury\nkoach\nexplosiveness\nisho\nmayanja\ncasquets\nscarily\nlawerence\npilibaitis\ncardiovasc\ncinelli\nostaig\neaga\nnccic\nevanton\nbomani\nrumore\ncraughwell\nthroneberry\ncoxes\ndustbins\ninterrelatedness\nsoftphone\ndursunbey\nfleurier\njonno\nsuperimposes\nkawin\ncomedically\nsubcontracts\njtrs\nozimek\ncignetti\ncorsetti\ndziemianowicz\nkolby\nlorens\nginowan\nmichalowski\nasbel\nhemophiliac\nkrant\nmultisite\nvenora\nisixhosa\nmackler\nsparsholt\nargenti\nexempla\nrwi\npedram\nmossadeq\nthatto\nccne\nlalchand\nnihill\nferrill\ntrendsetting\nhoffecker\nxiaoguang\nvims\nyve\nliow\ntemmink\ncalt\ntwinbrook\nyahiro\nheylia\npadberg\ntumb\nkypros\nabdelilah\ntharman\nhungar\nwellock\nmanassa\nsmerconish\nweich\nexclusivist\ngusau\ndewinter\nibexes\nsragow\ndriers\nwhitcher\nnorley\nwenge\nleisa\nsafrole\nimmortalizing\nhafta\nmilongas\nkoenigswald\nfedorowicz\njermy\nserviss\njimmo\nkaiseki\njolokia\nbackbiting\nschiltz\ngaywood\ndoulting\nkiga\narmwood\nhardacre\naiders\nmacmullan\njauncey\noblongs\nmythmaking\ngranath\nglaswegians\nsck\nhepcidin\nstruss\nndk\nnymphet\nreince\neilis\nrogie\ncrouzet\nkog\netkin\nsprayberry\ntaormino\nsanco\nradiochemical\nsqi\ngislason\natieno\ngridlocked\nvasanti\ntoxocara\nstiffest\nrozendaal\nbroodstock\nheuheu\nmicklethwait\nruza\nloranger\nnonexclusive\nippr\nargoed\nsyndrom\nsturdevant\npurlie\nshinfield\nslocumb\nmenorahs\ngostiny\npixi\ndisabused\nzaib\njakie\nzipzer\nnrega\nfafa\nfayt\nbeachcombing\niconically\nmyeongdong\ndolling\nwce\nthebaud\ndianetic\nshaiba\nacademicism\nleucistic\nbehaviorial\ngoldfine\nzeynab\nsilvennoinen\npiggle\nlubovitch\neddins\nairlocks\npreschooler\nrany\nagglomerates\nfolta\npetralli\nvideotron\nsepi\nshvets\nprebiotics\ndrizzled\nadoptable\nobos\nprofilin\nkerrs\nbiang\nreclamations\nkemna\ndbk\nardeatine\ndevilbiss\nehrler\nsamaan\ntryweryn\nbaisley\npreffered\nreenen\nbaddrol\ndepue\ntheodolites\nodoratus\nincandescents\ncrangle\nliebster\nroué\nodni\ngeoscientist\nvillares\nsheinberg\nmendeley\nxdcam\nbusek\nkalua\nshimao\nthurland\nuncorked\nsandero\nbarkun\nghione\nwintergarden\nmycoskie\njbm\niasp\nboula\nzelienople\nspeechwriting\ndettmann\nmanipulable\noncol\nvrg\narugula\nsalaheddine\nkgr\nndma\nbedwellty\nboseong\nhatbox\nnly\ncuvilliés\nschwengel\ndinnerladies\nreccommend\ntareque\njindrich\njamul\nourso\nquyen\nalican\nscotsport\nbürki\njilt\nindescribably\ncaracoles\nshanmuganathan\nsowe\nathwart\nharrick\nahmat\nfloaty\nseelos\noluwole\nshrimper\nmithen\npenrhiwceiber\nchiseling\nmahale\ndisapointed\ncopings\ndishy\negerszegi\nrezvani\ntsogo\nargentario\ndispassion\nlaverde\nayd\nbroadstreet\ndelev\nwafb\ncherubic\nsader\nibec\nbezanson\nwassailing\ngrishuk\nwedin\ncision\ngrimaldis\nszczepaniak\nepitomizing\nkest\nleaseholds\ntuenti\nrubberised\nflexx\ntrawden\nwehle\nkirka\nmashiko\nbheinn\ndispossessing\npatrushev\npratfalls\napostolou\nsweta\nnewhey\nwyms\nmicrosites\nnarcoleptic\nnahla\nsupari\nforkball\nnarus\nodle\nmackowiak\nzurcher\ndunstanburgh\ntetri\ngrammatology\ncolorism\ncastlebay\ncmte\nnoninfectious\nroitfeld\nbiches\nwittelsbachs\nsoltam\nbuckram\ncareerism\nburnouts\ncurrahee\nvasara\nuccs\npoller\nskokholm\nturbaned\nfirnas\noxberry\npensionable\nshirlee\ndaoists\nmilbrett\nlilys\ncodford\nrubiales\nnmci\nhoglund\nkartell\nrensselaerville\njuridically\nfilipetti\nalw\ntrailered\ntelem\npathbreaking\nmezze\nportswood\ndeecke\nmethylhexanamine\nenrc\ntcpa\ntarell\nvasectomies\naplus\nyenagoa\neij\ntsultrim\nlerida\nasrc\ncandybar\nmarista\nstobbart\nbuchheit\nbioengineered\npacifiers\nhaapala\nmackeson\nelwy\nmoelfre\ngrymalska\nziman\nnishiguchi\nnakumatt\nlaraque\nyertle\nkanju\nwoodmore\nplanetout\nfelch\nbiomimetics\ndrypool\nxerces\nroley\nhershkowitz\nokasha\nmizzy\nmagnetopause\ndoublings\ntardigrade\nbeauticians\npostiglione\nzipfel\najmera\nlubricity\nmechtilde\nquadruplet\nlaughin\nkyunggi\nvaldano\nnauruans\nzaydan\nspri\nhominoid\nrahardjo\nreydon\ncincy\ninterlachen\nauman\nsarne\nyankers\nroomer\nhallmarked\nnakhoda\naprender\nhuysegems\ntetraplegic\nspeedmaster\nyima\nmimos\ngouty\nkeitt\nmicrometeorite\ntarquinio\nrefluxing\nfuimaono\nabergel\nsaxondale\nmallinder\nnadda\nbatanga\nosor\nicdc\njillson\nsrikant\nsubbotin\nkisspeptin\nmatteau\nlinson\ndelme\nfelson\namcs\npayscale\nncra\nsteamin\noberholzer\nfutaleufú\nsagaponack\ntravelmate\nabair\nmacconnell\nshong\nkolaghat\nweltner\nhummers\njonz\nnordlinger\nupstaging\nfacebreaker\nembakasi\nvarel\nprea\nhoegh\nmichler\nglossopteris\nudrea\nsevil\nwlib\nmckown\nmismatching\ncavewoman\njaggy\nyamagami\nsigge\ndeko\noutboards\nbrancacci\nbashkiria\njiwei\nscarratt\ncebuanos\nurena\nkuttab\nrupie\nrickroll\nsalhab\ncavaday\nrolvaag\nnaoupu\nsquabbled\nnahalal\nfractionating\naldinger\noverlake\nhoverspeed\npamby\nweschler\nklove\nhirings\nindosat\nbianculli\nmaualuga\nghiaurov\ndambach\nfleagle\ncogman\nnacido\nporlamar\ndijana\nsavored\nmammone\nalving\nproferred\nesquerda\nbespeaks\nceltel\ndnevni\nalpinvest\nbclc\nzwerin\nmilley\npisey\npartos\njerre\nsousaphones\nfaythe\nreichenowi\ndemonise\ninterpenetrating\nguangya\nsunitinib\nmetoprolol\npenhale\nretranslated\npract\nlactulose\ndwn\nhayler\npetito\nbreneman\nbalers\npreppie\nlovelle\nwhitting\nbrakemen\nhasp\nwisbey\nolinger\nmarielena\narjo\nmacramé\nparwez\nchameau\nspätzle\nfainaru\nloftiest\ncius\nbogdanor\nneustar\nberau\nfleder\nwhitethorn\nbendlerblock\nlessac\ncaran\nlenbachhaus\nbeihong\ncogliano\npalombi\ndisdainfully\nhaycox\npopayan\nlaviana\nnyathi\noverflown\ncanovas\nblough\ninverary\nyiping\nrabab\ntakanami\nbungles\npantaleón\nsupersemar\nimrei\nastringency\ncariocas\ntirtzu\ninukshuk\nveridian\nschweikart\nansin\nturbie\nterryville\nfrado\nangarita\neichengreen\ncanai\npassionflower\nthiet\ngasbags\njutkiewicz\nhelmar\njoseline\nivany\ngudermes\naright\nscalpay\nellert\nlacerating\nfokina\nrobicheaux\ncxl\nstoreman\nshenkman\nthomassin\nbastardi\nibekwe\norosi\npermanents\nantidemocratic\nneshek\nbentzen\nokrika\nstridency\npiscitelli\nhelis\npankratov\ntoniolo\noline\nscheunemann\nedgemoor\npinola\nreemerges\ncontainerisation\ntriwizard\nsurveyusa\ninchkeith\nmungai\nreseal\ncounterdrug\ngaugler\ncaffé\nkashkashian\nence\nmantou\ndesensitize\nacctually\naltmeyer\nnewitz\nanosike\nntuli\ndallek\nirsyad\nstahlberg\nfahan\ndalgaard\nrabu\nigate\nradloff\nrhodey\neyestripe\nhoratian\nbankrolling\nbreman\nimmobilise\nneedled\nmccullin\nsinas\nchademo\ndormans\nnovellist\ndoraiswamy\nshahana\nexperimenta\nsniffers\nshafiul\nmaslowski\npridie\nregularise\nleekpai\ncanobbio\ntianyou\nmitsugu\njournolist\nkalaya\neyad\nidy\nluchsinger\nmaiken\nsensitiveness\nmokaba\nmalvidin\nporté\nsteelmakers\npaulistas\njili\ncataratas\ndepigmentation\npunctuates\nandamanica\ntecta\nguibord\nprinciplists\ncoiners\ntransparence\nplack\ndarkies\npavelec\nmaraval\nsivori\nedite\nhechinger\nrunnemede\nhirtle\nrecolonize\nexs\npliyev\nbreamore\nmacierewicz\nrehydrated\ncrecente\nbigwood\npinkins\nacsc\ndoonbeg\nlitvin\nanthelme\nconseillers\nreestablishes\nkapadze\nectopia\ndeterming\nxiaoying\noperationalization\nshlomit\nbignone\nwymer\nsuchan\nsiyam\nvaryingly\nquiron\nfiggs\ndeene\nbacigalupi\noswalds\ncroisette\ndalwood\ngunmaker\nosita\nchaddesden\nterentyev\nbagheria\nejogo\nfallsview\nbordellos\npianism\nconsignations\ndecaf\nccfl\nvacheron\nzaba\nfreezone\nslowey\nrobshaw\ngrinkov\nimray\npompiers\nalchemic\nkanbara\ntritle\npeppi\nhightide\nforelli\nsportswriting\nisraelian\ncharbonnel\nkylián\nlanhydrock\nchilbolton\nnipe\nwellesz\nmemeber\nkerbel\norache\nmarbut\ndouville\nbreugel\nvatopedi\nretana\nreconvert\nsaprykin\nkief\nhaval\ngeminis\ncrouton\nscharner\nmetallers\nkornegay\nworldham\nidma\nzubeidi\ndsge\nkarlene\nborota\nbasker\nkangleipak\nintercuts\ncarrilho\ndemineralization\nsantina\nhuguenin\ncoverly\nbelacqua\ntchen\npackhorses\nrajapaksha\nfunked\njoz\nnamuth\nreignites\nraska\nafforested\nfakhravar\nmaharastra\nbrimson\nxmc\npinegar\nantioco\nnarela\nklark\nteddies\nmadrileño\npokka\nnunsense\nlegorreta\nsyle\noxalates\nqumi\nhaner\nheresay\ndeqing\npilati\nvaus\nbafin\ncanigou\nzenos\ncompletive\nhauben\nshanzhai\nauditioner\ncortexiphan\nletson\npierotti\nmedjool\niguatemi\nauthoritive\nreucassel\nvernie\nmartock\npalaia\natarot\nkawas\ncoromoto\nmadelin\nmova\nbolme\njopson\nstacee\nmaltman\nunspecialized\nmamboundou\noutsources\nzavos\nsarni\ncensorial\ntbwa\ntaing\nkhanaqin\nshatalov\nransohoff\nvladmir\nspecula\nbuongiorno\nseronegative\nblandon\nmassin\nbassuener\nlohra\nconsuelos\nmensdorff\npergament\nmuen\nschang\nvarya\nsarraute\nruv\ncarbapenems\nflourens\nsamruk\noidium\ngashouse\nsalako\nlandingham\nrokus\nicus\nwhitepapers\nverweij\nspeakes\nehab\natrios\nanarchical\nwctv\nroiling\nkines\ntootin\nhalat\nspeediest\nungoverned\nsouray\nantivenin\nkoplowitz\nrustamov\nimpune\nlathrup\necheverry\nopon\nonnagata\nmeintjes\nbelot\nolman\ncaprea\nratajczak\nsavable\nproffering\nidelson\nweli\nvermontville\nctrs\nlongnan\nmvl\nlippett\nbobick\nabloy\nnikolski\nestreicher\nginsu\nbrutto\ntargetable\nnovator\nresona\ncomahue\ninishmaan\nascs\nsuperabundance\nequivalencies\nruweisat\nshrooms\nthrombi\narchiteuthis\nzaruba\ndinho\npokernews\ntzvetan\ntumin\nturlin\nexcoriating\nmolefe\nscriveners\nackermans\nlogans\ncrpc\nreefers\niannuzzi\nbaquerizo\nwhay\nspecialis\nkingsthorpe\nhajra\negomaniacal\ngrunsfeld\nvasovagal\nswidler\nresorbed\npharo\ncosmopolite\njulier\nshuffler\nfathia\npopples\ndisneynature\nleili\njussie\nbollani\nspiezio\nclal\nscalper\npatapon\nmounded\neconômica\nwafiq\nnumi\nbelardi\nprasenjit\nsadow\ndominations\nlapps\nravera\ntorsades\ndargai\ndauger\nautopsied\ngullberg\nwaft\nglw\nestephan\nbernales\nwachmann\nmedad\norangina\naerate\nmantain\ncitied\nlovellette\nkasztner\norioli\nlicensable\nragley\nhorizontality\nhypomagnesemia\nimaz\ndkba\nbenhamou\nregistan\npolier\ngwaelod\nhensarling\nexsanguination\nmirghani\ngaravaglia\nmcnees\npropagandize\nmargiotta\nsobranie\nahds\nspoilsport\nnorridge\nveldman\nmennie\nblundy\nvainglorious\nceledón\nkelcey\nbunthorne\nolodum\nhardberger\nthate\nprecociously\ngibbering\ntwirlers\noramo\nthiselton\nmagnanti\nsombat\nzbornak\ncammaerts\nsippin\nmetamorphosing\nellipticals\nbrahem\nthrasymachus\nitalicizes\nevangelio\nlempiras\nsuccinylcholine\nsemimonthly\nventriloquists\nhelenio\ndcfs\nixnay\ndjamila\nweimaraner\nfotini\nleasure\nshuford\nheatwaves\nfocolare\nsambil\nkessock\ncoomber\nhanikra\ndiegan\nthumba\nbowmaker\nroelf\nmorrey\nnocentelli\npankey\nmazzuca\nsandvig\ntatshenshini\nevraz\natrix\nsoyka\nebsary\nmalino\nedito\nkhanjian\nkraljevica\nnapravnik\nwintery\ncannas\nwycheck\nlabradoodle\nschuco\nindelicate\nmhlanga\nwickert\ndustbowl\nalderbrook\nprydie\ninebriate\nstudbooks\nreuland\nladwig\nshinsegae\ndarwitz\nshanower\ncopthall\nnametag\npyfrom\ndomoic\nlabrocca\nfamiliarising\nhighbush\ncherryvale\nriegger\namdocs\nhalak\nrebekkah\nkrankl\ngeman\nmalyan\nsasikala\naubisque\nkyauk\ncza\nbaldoni\naerostructures\nbunted\nduplantis\ntrousered\ninterserve\nivd\nmeiwes\nwycliff\njolle\nmixu\nbrms\nlalic\nruggerio\npapercuts\ncordura\nkaplowitz\nrendina\nmistley\nwockhardt\ncawsand\nplutocratic\nboxfish\nuko\nkarelis\nbacklights\nbevs\nkatoh\npacom\npraktica\ntorode\ncatsimatidis\nbareheaded\ngoulder\ncommunityamerica\nschalcken\namoudi\nkhaemwaset\ncapanna\nnafe\npostales\nchrismukkah\ncharlayne\nwilgus\nrunneth\notman\nmitic\ncardiorespiratory\ntimko\ncurreri\naliadière\neraclea\nhieronymi\nphonecall\npekao\nsichem\nmelchester\ngosht\nmorever\ngeissman\nandalucian\nocé\ntwd\npitsford\nmettawee\nvarosha\nsuperheat\nchernikov\nrylander\nrabson\nkamarudin\nfactious\nschrieffer\narresters\ntrigorin\nsteinbauer\nheatsinks\nsplodge\nuneca\nnakamatsu\ndestremau\nacklins\ndesmodus\nkawika\ncasselberry\nabéché\nnyad\ngabbiano\nbilyk\ntheatric\nhimmelreich\nuhse\nsaloniki\nboppers\njoorabchian\nmrj\nmaybrook\nholocausts\ngetaria\niaconelli\nkesselman\nvenditte\nbobbleheads\nspooners\nelkader\nbiskind\nnobilo\nrafel\nsensitizer\ntimebase\ndomracheva\nechavarria\npilferage\nreilley\nintersexual\nwadl\ntincknell\ndheri\nlinse\nppcs\nmushkil\nponchartrain\nparaprofessional\nirresolvable\nbarez\nyars\ndiedre\nisaa\nsofyan\nlazebnik\nbatrachochytrium\nantigona\naveda\ntatas\nintrawest\nreiserfs\nriady\naaqib\nbarnburners\nsypher\ntoners\npitesti\nfarhana\npilanesberg\nolatunde\nhedsor\nmiked\nwaihopai\ncobourne\ntrevathan\nkelderman\nkltv\ngines\nhackforth\nwsls\nhouri\nbankrate\nkhdeir\ndesalinization\nmilanesa\nwilborn\nunpractical\nfujikura\nvavau\ngandhar\nuemoa\nbennewitz\nhardley\nwishman\ndaboll\nmarjon\norefice\ndadford\nthik\nzeder\npallion\naugst\nguman\nvigouroux\nkanev\nreoccurs\nkidpower\nposas\nwichard\nmasoe\nschmelz\ngader\nhmn\nhmph\nmoxi\nswinnen\nkeersmaeker\nupending\nukti\ngeesthacht\notterbourne\nempaneled\ndcaa\nsupersized\nlahmar\nlightsail\nlipica\namericanised\nfelindre\ngarnishing\ngiarrusso\nclennon\nmutrux\nbadiane\nhorris\nvindice\nmaní\ndeolali\nschundler\nminquan\nlessin\ngraziadio\nklinkhammer\nnuyens\nbarú\ntollund\nciccolini\nsangen\nocchipinti\nsitars\nwebre\nvitrectomy\namethysts\nthackston\nlohri\nballardini\nmcgeeney\ntelecare\nhusting\naccotink\ncedia\nlamblia\nchittering\nssns\nriach\nliveris\nnormandale\nbaloi\nefface\necfr\njejuni\nstrs\nmargarette\nnembo\nlameck\nsetoguchi\nirglová\nhetzer\nastellas\nrumps\nschar\nradel\nlakoba\neoe\nfraiche\ntessalit\napigenin\noldmeldrum\nampt\nroutray\nladwp\ncheapens\nlonghope\ncounterbalances\nmenkin\nlize\neliasberg\nkashag\nwalsch\nsharmaine\nroraback\nyesteryears\nmarean\ntuulikki\nfuerst\nbaggini\nzhucheng\nleintwardine\naeromagnetic\nlipshitz\njurman\nsoapland\ntisci\nsystembolaget\nitin\nassistent\nbioassays\nmillivolts\ncomalapa\nmicromanaging\nmingchao\nborloo\nsscc\nsafai\nlamott\néminence\nlaubrock\nduffner\nsuchinda\nabade\nollivierre\npenalva\nflaim\nhumaira\nirrelavent\nboneham\noutfoxed\ndelabole\nhoundsditch\nidrac\nmurro\nstrobing\nimagistic\ndalessandro\nwolfley\nneurogenetics\nmyelomonocytic\ncabrito\nadls\njihads\nmangudadatu\nteamers\nfidra\novergeneralization\njoique\nuncrewed\nlifejacket\nllaman\nkyron\ncrèches\nreissig\nynysybwl\naspers\nkoenders\nprescriptively\nshiley\nsnorts\nwoolhampton\ndook\nponnelle\najm\npseudoscorpions\naldersley\ndjankov\ncohosh\nhultin\ncornrows\nmesaieed\nroccaforte\nprecourt\nnegoro\nbeldame\nwarramunga\nepcc\nstangeland\nbolotin\nabets\nsuicidality\nmicrosystem\nfalchi\ndownshifting\nslea\npagos\nkerplunk\nswarf\nkolata\nchri\nseawards\nplainest\ndiplome\nnusseibeh\nheronry\nnsta\nzishan\nwyndam\njurnee\nmeah\nmeadors\nvadims\nmaddren\nhaversack\nbamidele\nhahm\nchanos\nblackcaps\ncagni\nwinnowed\nblaak\ngilette\nabdelhadi\nbikfaya\ndiametric\nsoundproofed\nzonker\norganizaciones\ntexter\nbwt\njerebko\naborn\nfloetry\nsensitised\nhindlip\nbadmouth\nplooy\nmasci\nallopregnanolone\nmascaro\nsulloway\nsww\naristolochic\ndilettantes\nrowdiness\nmixologist\nfeijoada\nanacapri\ncasby\nhacktivism\nbauke\npussies\nsoussan\ncelmins\nmarzocco\nmrcc\ngamero\nitchycoo\nstoles\nwrigglesworth\npayaso\npeacocking\nruocco\npromethazine\nrlh\nadelberg\nmurcutt\nrevanchist\nmestral\nrhosllannerchrugog\ntrainloads\nayso\nyasbeck\nfaizul\noberpfaffenhofen\nanseong\noakworth\nscis\nsidelong\nadeje\ncolback\nkarti\nsvec\nupbringings\namsouth\nonj\nfujitani\nwsbt\ncorie\nconnexus\ndepartements\nehrs\nalhurra\ncloudscape\nhuxleyi\ngaravito\nrefloating\nprospera\nretweet\nhepler\nfabela\nguoqing\nmytown\nvoluntariness\notx\nkillefer\nmcelwaine\nfridrik\nlawnswood\ngsxr\npetcare\nkval\nristow\nbustles\nwizzy\ncushnie\ndarche\nsavov\nrangiroa\napparatchik\nstraggly\nhoerni\nruak\nkotin\nfirenza\nriyo\ncalicivirus\njousse\neurosystem\ndimuro\nmindshare\nmonopsony\nricciotti\nwkmg\ngranoff\nchipps\nraim\nwähler\nsylvette\nkempin\nduwayne\nleick\npollino\ngarfein\nstarikov\ndowds\nswissotel\nchairmans\njelsa\nstael\ntomatometer\naubergines\npartakers\nmugwumps\nborhan\nmjj\nshiping\nmohannad\nindividualize\napgujeong\nvirtualtourist\nairbaltic\nmvi\nsunseeker\ntachograph\nlillibridge\nandric\nhuaiwen\nruku\nbahcall\nhellmer\nshital\nlumme\nyelkouan\nconvergys\npettine\nnimai\nmirch\nencored\ncraniopagus\nmanseng\nkincora\nbloodworms\nbrohi\neareckson\nwnuk\nnorrman\nalkartasuna\nanuses\ngainers\nchalupny\ntazza\nleatherwork\nmughniyeh\ndukurs\nmazagon\npccc\ntogan\nkligerman\nfoshee\nphillippa\namac\nrotaries\nhvorostovsky\nauvs\nhattestad\nclaybrook\nbecomming\nmonsour\nweaponize\ncalcs\nyouki\nportend\neys\nthoe\nwandell\ngracefulness\ninmobiliaria\ncollagist\nviceversa\napian\ngiovannoni\nzevulun\nmagnifiers\nshellard\ndisapprovingly\nlimnos\nluebeck\npummels\nioda\nsuenaga\notterness\nikenberry\nlinbury\nkolakowski\nhagenauer\nmarchiano\nconventionalism\nthst\nolarte\nijv\ndemobilizing\njingsheng\nenoh\npajhwok\nlansburgh\nlowick\nballiett\nmementoes\nprudes\nsecher\neijiro\nschuss\ninglett\ntuvaluans\nrubie\nadditon\nberkswell\nroaders\nchopwell\n,that\ndosunmu\nnordlys\nreformatories\nchopo\nmodish\nsherrell\ncluses\nchicky\nhamoudi\nparx\ntrumpkin\ndeify\npolachek\nbabbino\nclubmen\nsmilla\naganist\nkordestan\nbelches\ngreaseman\nfolkard\nmaneuverings\numenyiora\nsuperfluids\ntobel\nbelland\ndxo\nmatrixes\nparioli\nunredacted\ngorie\nrault\ntocci\ncraske\ndemre\nrieff\nmadhouses\nxico\ninattentiveness\ntenuto\npillager\ntweedbank\ndelderfield\nnoirish\nbassy\nbarlborough\npretentions\nheintzman\ngrazioli\ndrx\nbreadsall\ninorganics\nkabukicho\nstewarding\nsharkboy\nhargan\ntrixter\naesthetician\nantolini\nharinath\nkroemer\nphranc\ngroeschel\nhotbird\nworgan\nshikumen\niphigenie\ncadiou\nmystifies\nbacnet\nalian\nsawer\nwanek\ngroveling\nducote\nbontecou\ngafoor\nbruchac\nusfda\npersistency\nrusesabagina\nmahouts\nseersucker\nintrinsics\nsamini\ndissapear\ncafs\nconspiracists\nsoboba\nunwarrented\nglitterati\nanexo\nsfumato\nfrigidarium\nghanashyam\nkeyzer\nseff\ngadomski\nchicco\ndeliberates\nceco\nwers\nearith\nspecchi\nleitchfield\ncerralvo\ncallar\nwestlawn\nhaukeland\nenghelab\ntoroweap\nactioning\nwoodfin\nmaramba\norchiectomy\nfletc\nstaus\ndijkgraaf\ninterrelate\nkulina\nhenchwoman\newloe\nbolívares\nkesi\njournaled\njosephe\ndystopias\nunshaded\ncooped\nmehigan\nhypothesises\ndunalley\nwamala\ncastan\nlelli\nstewartville\nshaggs\nspradley\nhanaway\ncarpena\nnras\nbouli\nwudunn\nsellinger\ndusko\njbara\neliav\nfreerunning\nmutuelle\npuggy\ntuymans\nshavlik\nfizer\nviendra\nkrukow\nshoyu\ngeoffrensis\nschiemann\nvytas\npalest\nbuttiglione\nstompanato\nmanhandling\ncordingly\narizonans\nmank\nsussed\nyuja\ninkblots\nangsana\nstarn\nghettoized\nvezo\nvladika\nlegatees\ncrazyhorse\nkogod\nkamalabadi\nmazouz\ndassow\nturckheim\ngardere\npackrat\nllanvihangel\nbloomingburg\nbiazon\nbouge\npinette\nbaymen\nfiguera\nsproxton\nprivilages\nwafted\nhindfoot\ncomon\ndemurs\nmaksoud\nponsot\npiehl\nnoonu\nnaziism\nsacrococcygeal\npolytomy\nbascially\nmaroš\nzaccaro\ndigipen\nhowgate\nmisstates\nrebun\ngeneracion\nharmlessness\netfe\nbilborough\ncostis\nhirosue\ntriptans\npolovtsian\nwesternisation\nfolarin\ndazzy\nfaton\nperenchio\nemran\nhandbasket\nmittelstand\nsuhler\ndelvoye\nchast\nshiancoe\ngainesway\njackfish\ntoolboxes\nbunzl\npureness\nchlorogenic\nwalpurga\nrazmak\nbaghban\nfilezilla\naldecoa\nfulin\nsapcote\nexsultet\nbledlow\nabusiveness\nrlx\njlr\nsulfuryl\nbittercress\neassie\nflamant\nconda\nmerica\nbrutale\nhasnat\ntoine\ndahms\nfivemiletown\ngolondrinas\nkld\ncheckpost\norganoleptic\nsymbion\nplanetoids\nbusst\nboyang\ncamira\nbasescu\nwoudenberg\ncongruous\nchouchou\ncvetkovic\nsegu\nmeiser\nshigeto\nchebet\ncarabello\nfernet\nlambertz\nrackheath\nbibliometric\npidcock\npaghman\nhitcham\nventresca\nhatchell\nsafm\nterron\npenzer\nordinariness\nvicarages\nmasnadieri\nfroemming\nisserman\nconsolidator\nakano\nunedo\nsexwale\nthorsons\nklimke\nunltd\ndegn\ndgl\nsuellen\nsidetracking\nkashem\nbuttonwillow\nharbourview\nnyungwe\nallbright\nnydam\nrastafarianism\nblockbusting\naswa\nfreshwaters\ndallon\namoebiasis\nherzenberg\nlivingroom\nwolsingham\nbanavie\nmalielegaoi\nboshier\npoire\nwhoville\niwakuma\nintellipedia\nhdfs\nloyals\norrock\nstrangulated\ninshallah\nmacal\nwhther\ntrentin\nzhongwei\nkwwl\nkhwai\noxmoor\nvelux\nexobiology\ncintia\ncrumpling\nhelprin\narren\npajitnov\nbohunice\nusally\nrenunciations\nnoirmont\noio\njonkers\ncarabiniere\nbrek\ncepu\nlandesberg\nsiroco\nkleitman\ngollin\nniwano\nprejudgment\nlowdham\ndilkes\nantifolk\nbehrang\nbaggaley\nollman\ntourish\nbirdseed\nsidecut\nomalu\nsiefert\nformanek\nhashid\nhaad\ncsrs\noae\nclippy\ncollarless\njaroff\nogling\nchryslers\nschwenke\nfayne\nchaak\nmunib\nbkt\nbrockholes\ndöpfner\nmondschein\nfratto\nbrignoles\nroffman\nmindfreedom\npachouri\nmishchenko\nrushey\ngeorgeta\ndobtcheff\nbaldia\ncablelabs\nwahle\nkenana\npriddle\ndiscoverability\nalbigensians\nseculars\nremmert\nostomy\nrothenbaum\nhogansville\ncarven\narced\nsportscast\nmanipuris\nunconvincingly\nthouroughly\nkitz\nholstered\nfêted\nganglioside\njll\nbecta\nhadspen\nkornacki\ntaniela\nwintel\ntugade\nshili\nconferees\nplopped\nlcra\nstolk\nsodha\narslanian\ndevilfish\nbravas\nmiedler\nlazaroff\nsaim\nmakary\nshamanist\nollé\nvails\nbastardization\nsayeda\nadfc\nbilyeu\nsopka\nthomases\nmetolius\njugoslavia\nchopan\nadja\nkorsmo\nmirvis\nlavonia\nprescod\nwestonzoyland\neagleswood\nistockphoto\normolu\nharalds\nshrieked\ninchinnan\nsayeeda\nwitsch\njorunn\nelephunk\nnorrath\nwatersport\nsnodin\nmstp\namericanairlines\nmaldwyn\nshriekers\nmyasthenic\ntgd\nennstal\nbegur\ndemko\nffas\numh\nodejayi\namrullah\nbatre\nduckham\ndisneyworld\nchanty\natopy\nsinani\nkopernikus\ncutbush\nbecaus\nhypermodern\nbiyani\nnguoi\ngainst\ndioula\nkreskin\nzamaneh\ndreamily\ndrebin\nsevim\nmza\nslabbert\nifremer\nrahima\nburbano\neugenijus\nfolker\nelsberry\npekoe\ndegale\nadubato\nfilumena\nmundie\nraelian\ntoolshed\ngiusy\noverambitious\ninternalise\nwargo\nadrastea\naffronts\ngotabhaya\ntvam\ncustomizes\nunchaste\nshlain\nlogiudice\ndelobel\nmaratea\nstuhlbarg\nparadoxum\nletterboxes\nhollerin\nstallholders\nadium\nmahmudiyah\nmarlos\nchaderton\nbodedern\nsoundwaves\nohler\nivanko\nsylphs\ndonalda\npinhoe\ncallon\nkritsky\nbahgat\nilfc\nralfe\nhullin\nkaufhaus\nmarszalek\nrinky\ncihangir\netam\nfalanga\nstandi\nmidgette\nmalecon\nkenis\nizotov\nciganlija\nvenza\nkirkness\nlinderoth\nkreitman\ntomberlin\nlamping\nclotheslines\nzellman\nbinkowski\nmorsbach\nkabaeva\nunderarms\npetruzelli\nborgström\nautostereoscopic\nsteenis\npertinently\nkandis\ntrinkaus\nneowiz\nnesil\nscandella\nboab\ngonnet\nsinikka\nballerino\nabeele\nklezmatics\nhighpoints\nkexin\nstraubel\njacquez\nnextmedia\ndemised\nwidford\ndabbagh\ntheanine\nsleepovers\njohannesberg\nasay\nsequesters\nmoralez\ncecal\nranck\nglidepath\nhelixes\nbervie\nbialowieza\ntshisekedi\nflatulent\nsluman\nwarsteiner\ndebauche\nhamady\ntorrontés\nyangshuo\nclts\nbryants\nkertzer\nmiyasaka\nhandscroll\nreformational\njye\nsilsby\nsabkha\nrafati\nrégimes\nschorn\nföhn\ngyeltsen\nbuice\nhootkins\nweisenburger\nzabludowicz\npeke\ndannen\nemira\ncccu\nmorín\nrigourous\nboulahrouz\nmysa\nglanders\nromanticists\nsmartmoney\nlovefool\nballiet\ntrease\nglenfarclas\nsugartown\nniane\nnahalat\nmurison\ndarpariaethau\ncederschiöld\njohnsrud\nharlee\nhumorului\nhereon\nshuberts\nfcic\nsirkin\nkominsky\nforgues\nbolivianos\nblomgren\npaleckis\nautosuggestion\nmaani\ndaven\nrhodiola\nhepatorenal\ntatana\nitria\npalop\npillowman\ntagaq\nunfaltering\nutahns\nsuntech\nsumsion\njawbreakers\nnmdc\nmicrogeneration\ndelistings\nmyrtus\nfeuermann\nroylance\nmonache\neakes\ndonadel\nnetworkers\nkazadi\nreorg\npegel\nketaki\nshiploads\nccdev\ntraille\nsidna\nsuranga\ntippie\nfette\nardie\nwaddling\nstinkwood\ntintner\nsmokejumpers\nhoneymooning\ncomunist\nhouchin\nmeperidine\nbiaggio\nhogsback\nskutnik\ndygert\nepcor\ncrang\nauspiciousness\nkimberlites\nspringfest\nmaggard\ntabea\ngaitonde\nflickered\nunarguable\nbroadswords\nmealamu\nmischka\nrequiescat\nweaklings\nagonizingly\nnibelungs\npinizzotto\nmarcondes\nnagalingam\nmallinger\nashkar\ndagmara\nweidong\ndisapprobation\nlaju\nże\nkidby\ndinenage\naltran\nhonkin\njamesport\nloaner\nshoni\nramgopal\nobsessives\nteetzel\nkohlman\ngravidarum\nsportsnight\nartman\nywain\naand\nzzzzz\nfarmborough\nmermin\naymerich\nknaster\npaoay\nbelaire\nlaufenberg\neifert\nsuschitzky\nbarrowby\nkallat\nirek\nwoltz\nwatter\nbakhtawar\nkaarel\nayanda\nhucklebuck\ngatherum\nshinmachi\nholne\ndoggerland\nfilgate\njacc\ntark\nwhitta\nmeritus\nrushbrook\nhydrox\necocide\nmisick\ntrimmel\nminiter\nbobola\narabists\nsivivatu\nkosin\ndogmatix\nlanos\nipea\nmutrie\npunishers\nplatzer\nmehan\npantaleoni\nonely\nstampfel\nunpacks\nayliffe\ntianfu\nhullah\neulis\nsovan\ntelevises\ndieldrin\naliah\ngretz\ncowherds\nsweetin\nfalstad\nlydell\nkodjoe\ndialers\nneuregulin\ncapicola\nbenjamina\ncki\nheaver\ngoliah\nkaracan\npredations\nhornes\ninscribes\nyerby\nlondis\nelledge\nlookbook\ngrindell\npoje\npactual\nkatselas\nfechteler\ngrinzane\natteridgeville\npiddling\numeki\ncooky\nkightly\nplanty\npentothal\ncoupet\nflewelling\nrechnitz\narsan\nvilseck\npatronages\nmemorisation\nlarroque\nmurless\nlamacchia\nreclosed\nhrach\nshawlands\nunbridgeable\nmyrta\nlabidi\nyangmingshan\nlell\npracht\nholtzberg\nliveleak\ndahari\nkosem\ntallil\neaps\nbeadie\njerko\nellenberg\ngrapefruits\ncharton\nwallbank\ntweakers\nedirisinghe\nvocationally\nyusufu\nquen\njatinegara\nboughey\ntullett\nomark\nlendu\neneko\ntelehouse\nettv\ncracraft\ndengie\nají\nvillumsen\nbegala\nreasonability\njihan\nthandwe\nallergist\nbeacom\nlarive\nmohanad\nisaca\nhoeing\ngure\nchandramouli\nbordley\nmolaison\nexternalization\nfontán\nschanberg\nnutrilite\ndacoity\nweart\nkavaja\nmanati\ndble\nnarramore\nrecomendation\nbrazillian\nlangguth\nspycraft\nbarreling\nkubelik\npostol\ncobell\nrompetrol\nkawakubo\nchiappa\nbrittin\nknibbs\nfrazar\nlowliest\nbonao\nooma\npeasy\ndirectoral\nlhf\nfanzone\ncalamar\nmontreaux\nvirdi\ndorks\nlamsweerde\nemadi\nbullrings\nolimb\ncareys\naflatoxins\nwilliamses\nzarine\nbottlerocket\nindomethacin\nsuppliant\nallderdice\npumpido\ncigala\nlevchin\nrachubka\ntinatin\nsaju\nvalvo\nhudis\ndalmarnock\ncubbins\ngeorgopoulos\nterina\nstoneground\ncalibra\nimpressiveness\nsoundchecks\nbalanda\npreconditioned\npoky\ncarthon\nprotamine\ngourevitch\ngiovanny\npoupon\nbakhita\nigli\ninspectorates\ncervino\nbunz\ntokin\ntraditionals\nvly\nadeptness\nmorcilla\nnganga\nthiaroye\njentz\nselichot\nsoppy\nkraits\nblincoe\nwaisale\nmakoun\nkecoughtan\nducommun\nzahler\nkamuela\nabridgements\ndougans\nreprésentant\nlopping\nkopan\ngottingen\ndespatching\neinfeld\ncoppertone\ndelegado\nfunches\nsuperheavyweight\ntajine\nkuusankoski\nspooled\nflaxley\ninstonians\nbeseeches\ndownwash\njauss\npeated\nfirstmerit\ndisgree\nmusulin\narblaster\nmabberley\nlukan\nmansford\nnorthend\ntouchpoints\ngulping\nbreskens\nnalley\nspastics\nsaila\nvadas\nkhalek\nnuth\nknoche\nathari\nketv\ndogcatcher\nretama\nspireites\nenigmatically\ngoldwell\nrines\nherricks\nsenekal\nbrouillette\npollione\nhairier\nchethan\nbutson\nevos\npowershift\ndiscords\nstaphorst\nhandwerker\nardleigh\nungated\nmujahadeen\njewkes\nblumstein\nyafai\nobx\nbacalao\nsexinfo\nturnesa\nkulatunga\nnowacki\npocketbooks\nunops\nconnived\nerucic\nlozowick\npbgc\nrepack\npactum\nlanciani\ndeerslayer\nottilia\ncroisade\nkrogstad\ncarulla\nmench\niava\nyoungers\nrodders\nwetz\npanderichthys\nrelabeling\nfranchiser\nmultilateration\npalookaville\nshadoe\nbraai\nmwape\nstylee\ndisinheriting\nonfield\nsaiqa\npirouettes\ngafa\nenx\nkleeberg\nnufc\nhillsman\nmicroanalysis\nalbertsen\ncolosseo\ncecs\nnicholsons\nschueller\naskerov\nbisky\ndinizio\nbogland\natenolol\nsilvy\nmushir\ncastigates\ncarteris\ntiravanija\nquarterbridge\ncyffredinol\nbouy\nbellringers\ndouc\ndainis\nhanvey\nnouman\ndaviot\nmahli\njavari\nreclusion\natai\nthrybergh\nnyuk\nmarjah\nnewsbusters\nlinpeng\nrivetted\ncraigleith\nzisa\nbedevilled\nsaletan\nsynetic\nmcalary\nmccloughan\ntasleem\nramphele\nshrouding\nthemsleves\nshekh\nbacklands\npaisan\nsagtikos\nzarzycki\ntanioka\nbeaudouin\nentryism\nvaladier\nchurro\nimison\nfolman\ngobelin\nextensiveness\noedekerk\nsmidge\nskicross\nmultispecialty\nrabbie\nrazik\nsupermicro\ndannon\nhongli\nnzrfu\nmorans\nmangku\nlakeway\neuromonitor\nparalyses\nlurdes\nekaterine\nsimilarily\nfbx\nzakheim\nqueenfish\nxuanhua\ntoploader\ndatacasting\nmediaite\nkappen\nhechi\npotiskum\nruscoe\nprees\nsoftee\nismaeel\njavie\nbgf\nsufa\nvarella\nneels\nzooniverse\ndouar\nmarimow\nastrocytomas\npiggins\ncnnfn\nyuille\ndouillet\nkhalilullah\nmashaal\nrapel\nnicolaï\nmiddeck\ndaubney\nfagles\nhadia\ngadaffi\napeman\nmnouchkine\nbartlow\nabey\nrockstone\nbroe\ntomoo\nsmbs\nbenchtop\nargyropoulos\nfsia\npigeonholing\naviram\nayari\nbrundidge\nmorasca\nstruggler\nherrig\nmayoría\nsivertson\nrubenfeld\nthrillist\nmfume\nanw\njakin\nquesters\nstrathblane\ntieto\nyps\naiz\nschenkenberg\nkazakova\nmarfin\ndurenberger\nwarrener\ncaland\ncew\nmaruca\nsobbed\ncaravella\nmakeups\nkristos\nmadni\nobt\ngalahs\nkasasbeh\nswingler\ninstable\nedv\ncarreer\nabotu\nfidi\nbaltia\ncanners\nzadkovich\ntullibody\nbaumannii\nsimilary\nelsley\nwindrose\nslann\nboonyaratglin\ndithers\nkulusuk\nasel\nallures\ncruisecritic\nphulwari\ntrotskyites\nparakh\nlosangeles\ncomins\nfutter\nvelveeta\nmarylander\nhenein\ncatatumbo\nmicrostock\nhurtles\nmihailovich\nindividuated\nputerbaugh\neqa\nnovitski\nschnurr\neducare\nhemmat\nmuleteers\nfailaka\nmanuchar\nlomachenko\nsfjazz\npreciousness\nnarkiewicz\nhaidara\neruzione\nuncured\nshuga\nwatsa\ndelaria\ncohiba\njinhai\nruqaiya\nglassblowers\nscas\nthebaine\nkarazin\nnerja\nguity\nmitrofanov\npalila\naloke\nesmark\npechtold\nsittig\nannualised\nfiocchi\nnibbled\nwrightsman\nafridis\ndoley\nburaka\ntranzalpine\nfauldhouse\nmorreale\nepidemiol\ntransplantations\nkazanka\naddtion\ntumescence\nmoulsford\nklunk\nflairs\nincompetents\nbritnell\nobstreperous\nvaupel\npendens\nwarberg\njoing\nasael\nkaws\namobi\ncomtec\nhyndford\nghezali\nbrandie\nunflagged\nheckles\nanabella\nvyne\npodger\nduroy\nthomsonii\npolarizes\npisanu\naldemir\nhandcarts\nakuila\nimeche\noberhauser\nmasetto\nheimberg\nspinna\ndehghan\nblurton\ncharvis\nmegacolon\novercautious\ncompl\nbagnasco\niljin\nsacker\nhollingdale\nironware\npsychoville\nhebes\nglucksmann\ndlpfc\nchokin\ndrh\ndevender\nmodrow\nacedemic\nbroomfields\namee\njadrolinija\nafcc\nnewser\nhibernacula\nixtapan\nslenderness\nscyld\ncompal\nloanees\ndolwyddelan\npencaitland\nhamshahri\nkinara\ncuningham\ngrafs\ntindley\nciena\nblando\nkendalls\nshemaroo\nmcfedries\ntopock\nwaterkant\nhyoscyamine\nchits\nsnel\nduellman\nzounds\njackling\nakol\nnfts\nsfos\nbruneteau\nvidere\njeptoo\nkorzun\nmahna\nzhilong\necca\nnaeto\nweihenstephan\ntorrico\nearlsdon\nfalungong\nfriess\nurfi\njeantet\nrathman\ngladioli\nesrd\napro\nariary\nmortimers\nraphaelson\nprotus\nbonallack\ncastledawson\npiskor\nglockenspiels\nbayham\ndritan\nquickstart\nmmw\nsakyo\naquellos\ndogberry\ncablecom\ncoagh\npontcysyllte\nrumbelow\nafrico\nblackplanet\ndmap\nparetsky\nwalfrid\nhabashi\nwindies\nclamming\ncombusting\nghf\nmisjudgments\nelior\nboler\nkravica\ngarity\nneupane\nsuperteam\nkovacevich\ncalanques\nerromango\nislandwide\ncalcutt\nnhlbi\nreappropriation\nhofmanova\nhimley\nbarehanded\nlochinver\nbrashly\npuddletown\ndigswell\nfortini\ncoxiella\nwhataya\nmileham\ncongresspeople\ngromek\narticulately\nrateau\neyerly\nagresta\nflahaut\nmoniaive\nsarcopenia\nchinguetti\nbulgogi\naletti\nabta\ncouncilperson\nfirstenberg\nlequel\nisikoff\nbosca\nhagolan\nfordice\nhasti\nleftwards\ndemming\nbreightmet\nkcom\ndahe\nvacuumed\nooda\nintersexuality\ndegryse\nhosn\nroofers\ngerad\nfreebo\nprobasco\nsummerdale\nkashk\npand\nradif\ncopplestone\nshammas\nbuscot\nguney\naftershow\ntrischka\nibsley\nslickness\ncosac\nwisa\nbaulked\nibcs\nktlk\nbalsams\nkobie\nchouette\ncrem\nlatke\ngrappenhall\nkneecaps\nnyoni\nmonder\nbackworth\ncherimoya\ndiamé\nserration\npestilent\ncookoff\noneok\nbianconeri\nhelmingham\npromesa\nduplexing\nwillmon\nmountebank\nebus\nfalloff\nolliffe\nxivth\nbalash\nvallat\nrampantly\ndainippon\nhoffe\nachmat\nstreamwood\nperiodicities\nresetarits\nnastily\nburnable\nmaderas\nbenishek\nquilici\nentrapping\ngreggory\ngoldsworth\nwiederaufbau\nbignall\ncrummett\nbiafrans\nxns\ncomolli\npoyi\niddon\neavesdropped\njaud\ntriffid\nreshooting\nminick\ngoldens\nzadro\ngangte\nwoodroof\nwittke\nspruyt\ntemozolomide\nincubi\nyizhong\nbrownswood\nasae\noptum\ncarboline\nbukar\ncharkaoui\nnightstalker\nbakley\nsilksworth\neverbody\nplatforma\nsemaan\nseidlin\nraddon\nceus\nsorbie\nfieldsman\ndyfan\nveliz\nkiyota\nrothemund\nteunissen\nwmas\nsupachai\narchaeologies\ncorrigendum\nganis\ninsignificantly\nlakshminarayan\nfrud\nwaitsfield\nrimkus\nreligionist\nquoddy\npipher\nmyrtis\ncgrp\nimmunogenetics\nchevys\nplomley\ncrivitz\ntoolmaking\ndfas\ncommensals\nesplin\nwfb\nscuttles\npamphleteers\ncroughton\nsorrentine\nncv\nminimovies\nwheelman\nweyrauch\nlatapy\nbacchi\ncloudcroft\nwheelersburg\ndrechsel\npalladin\negberts\ndejiang\nsagnol\nalbaugh\nflyout\nscotish\ndelauter\noverwing\npollari\nhenske\nstehle\nsubmited\nanie\nbirke\nhyperemesis\nmoulsham\nlanett\nregrette\nnotw\nburfict\nrecombines\nhalmi\ntelescopio\nelkhound\nanielewicz\nuttal\nminges\ncaroms\nswishing\nneier\nurushadze\nstackridge\naetiological\nnalty\nodoyo\ntymoczko\nprzybyszewski\nsherlyn\nmeininger\nhedychium\nfarooque\nmetalized\nnoever\nfenley\nmonocled\nabene\nnaspa\nchiasma\nknavesmire\nbrewsters\nheesen\nsanikidze\ninonu\nfato\ngeomechanics\nledroit\nclinkscales\nhookahs\npalanivel\nhamu\nporumboiu\ncopertino\nvelilla\npayatas\ngungho\nhargreave\nhuseklepp\nmerja\nperrée\nlonmin\nnaoum\ncustards\nbarki\ntuchel\nkabocha\nsaronni\ncarnmoney\nhanmin\nmcso\nwtih\ncholodenko\nmirk\nintangibility\nfaintness\njiahui\ndoctora\nfingerling\ninsall\npolishers\nibargüen\ntadakuni\nceel\nheff\nmeowing\nrevuelta\nwaipio\ndooks\npatrica\nunforseen\nbenthamiana\ntrau\notk\nchynn\ngorleben\npopsy\nzvika\ntzena\ngilhooley\nbacsik\ntranscon\nfillory\nauther\nmehamn\nmcaslan\nsmyers\ndinna\nlazzaroni\nfuw\ndakotans\nhipperholme\nlauterbourg\nbeatlesque\njoralemon\nzuleikha\napil\nliljedahl\nscentless\nindustrialising\nlangkow\nlekker\nfriending\nrassel\nmotahari\nbrantas\npharmacogenetics\nnationalbank\nrikyu\ntompall\nfallaway\nbacchanalian\nbrocks\nshosha\neurus\nguidepost\ncirl\nbated\npreoccupy\ncoequal\nhilltopper\nkomor\npianiste\nbesetting\ngetta\nblayne\ndarklord\nkorus\nlaudi\ncambusnethan\ncordesman\nbrettler\nmeerow\nbethencourt\ndarrien\nalko\ningrate\nagonizes\nvandevelde\nshoreside\ngrod\ngulches\nplayskool\nzigeunerweisen\nboepple\ncreditability\nworl\nchondritic\nsunzha\nhobbesian\nlorean\nepcs\nnamirembe\nhummert\nroggio\nménilmontant\nmcguckian\njoeli\nnightshirt\nakabusi\ngurfein\nbrockmeyer\ngarz\nlmct\nhindenberg\nbolzoni\nstormfury\nwackiest\niecc\ntoos\nazahari\nletcombe\nhardbody\nshenar\nzscaler\nthean\nremarketing\njetted\nilkin\ndemographia\nmwenda\ntheoklitos\ncrosspiece\ngurgenidze\nparmigiana\nshowreel\nvontobel\nirishwoman\nvulgarism\ntumnus\ndevonish\nbushisms\nudb\nstokenchurch\naltadis\npolunsky\nblastoff\nmegaloceros\nseion\nkewa\nnajarian\ngenpact\ndafen\nbachelorettes\nmoretta\nstrategem\nbayana\natomoxetine\nrathe\nbailin\ndramatico\nsorrels\ninglehart\ngoey\nreillys\nhanceville\nboldre\nbernadett\ngobbling\nsophisticate\ninterministerial\nziekenhuis\nchronis\nchambourcin\nsheats\njazzbo\ntianjing\nllanfaes\nmandanda\ndaguerrotype\nadenoids\napostolates\nelastography\nposession\njelli\nsharafi\nfuegos\npopinjay\nkingsdale\nturowski\nlosse\nimmunizing\nbejewelled\nallured\nstockett\nredenbacher\nfungicidal\nminett\nagroindustrial\nphyto\nrottier\nweedsport\ngarsdale\nbucklew\nkinlochleven\nsamme\nzigler\npsammetichus\nwidgetbox\njixian\nsarongs\nsanny\nshenae\ntraffik\nagulla\njohhny\nhdn\nsankaranarayanan\npetkova\nmarije\nperformable\nnaturopaths\nsheron\nopare\ndanielli\neuzkadi\nwynns\nreguarding\njco\nhayner\nviniculture\ncheffins\nhyodo\nfinnieston\nssas\nreprobation\nstrothers\nfumiaki\nwinterized\npabon\ngherkins\nopvs\nhulshof\nguiton\nkirtlington\nbriem\npalladia\nedil\niied\nfusilli\ntizer\nryler\nhedgepeth\nmosholu\nitsi\nleblon\nstondon\ntorrisi\nvimto\nmastroeni\npwrs\nasinara\ncremains\nmontanez\ncleasby\nujjwal\nebele\nboschetti\nsayar\nteramoto\nscognamiglio\nnbaa\nmallord\nsnibston\nusamah\nprogenies\nekati\nundies\neverone\nskulled\nuntrammeled\nsamiya\nsofty\nmoily\niliopsoas\nmachimura\nshipibo\nrichardt\nquinquefolia\ndragoncon\nbuile\nsearson\ncerin\nmatsch\nburnhope\nmayaguana\npillagers\nsynchronises\ncile\nnadder\ncontroversey\ncorendon\nleskovec\nkeratinous\nbenevides\nsucceded\nceglie\nrtus\nmoisson\nsirtuin\nshoehorning\nmckearney\nttxgp\nteleperformance\nlignan\nhookey\nshowpieces\nmaltbie\nbrkic\nzaripov\nikramov\noppurtunity\nbekaert\ncluemaster\nperusahaan\nairmax\naurélia\nkalpakkam\nangley\nharit\nburham\nsassine\nsikha\noriginalist\nvanderburg\nphenacetin\nsexxx\ngingivalis\nhollimon\nbushwacker\ntsuruya\ngengenbach\nanonymization\nnahrawan\nastrit\nlegitimating\nhummable\ngröner\narmit\nbuhera\nmagasins\nweerstandsbeweging\nkarlsberg\ndiyan\nrofecoxib\nrecommit\nlangworth\ntashjian\ndionisia\nlifesciences\ndomperidone\nkeshishian\nfiveways\ngartmore\ngipping\nmckennan\ntolong\nshash\ndanielpour\nlanskaya\nsebat\negton\nmutahir\nmacdonogh\nkieser\nimperilled\nkptm\nsampietro\nmuramoto\nulin\ntaxidermied\ntaymouth\ndeddy\nderin\nmalefactor\ncrann\nendears\ntattletale\nmerveille\narem\ntackley\nlauffer\nkorowai\nnewquist\nmulry\nseigfried\ncriminalises\ncystoscopy\nbomhard\nupperton\nsterman\ntenoch\nappart\nlaboulaye\nchiweshe\ncandlish\nchalifoux\npaslay\nlinett\nstabilities\necks\nkarar\nekofisk\narritt\ncoprolite\ncrathes\nmithali\nlandzaat\nsherwell\nchuppah\nhypercritical\nshives\ndelinda\naccp\neclectica\nqru\nharelik\nleistner\npirg\nsukhum\nvasilievna\nkompania\ndugal\niryani\nserlin\npoelman\nherskovitz\nsparber\ndirecta\nsecularize\nprecipitator\nnoncombat\nboyde\nmcgiffin\nlochcarron\nscuffed\nsauté\nmaglio\nwasantha\ntoktogul\nrollcage\ndurso\nkhorsandi\ntonis\nkouilou\nquaintly\nsanmarinese\nliel\nimpoverish\nnextdoor\nkappos\nhereto\nlomana\nprates\nissl\npradel\nwfi\npooping\nthornely\nskau\nmumbly\nprospection\nreddaway\nshortcrust\namvets\nmilage\nenniskerry\nkösen\nqurashi\nuncoloured\nlezard\notas\nrethymnon\nanush\nmarena\nskelley\npruner\nholdich\ndjerma\nwoodpigeon\nunconvicted\naronin\ndrollinger\nokudaira\nlaubscher\nviveiros\nlumbly\nlevox\nshefer\nchlumsky\ngrrrr\ndamxung\nlongtemps\nscrs\nbulter\nprutton\nbohner\nswavesey\nbodwell\nshavei\nfriedenberg\ntaverham\nmenomena\nmonnin\nesdp\nmelching\nwlbt\nutsjoki\nrerecord\nsenser\npremierleague\nislamo\nmerrymeeting\nnorplant\nkonec\nhuili\nmilioti\nproblemo\nsaraland\ncanzonas\npeñate\nwerne\nifpa\ngraveline\ndamman\ngracq\nornithomimids\nmagaw\nbutterick\ngaian\nsoutherndown\nsisamouth\nwpbf\npettrey\nhillclimbs\narsov\nsurowiecki\nmaximiano\nfatullayev\npayner\nencrustations\nnorthline\narrestable\nkurkov\nrobing\nhinderance\nperanakans\ncollecta\nbelaga\nbawag\ndraperstown\naeroespacial\nimpactors\nwhisler\nunpromoted\npoderosa\nagoutis\nlehenga\nmenabilly\nslackline\nlomnica\npdus\niconia\nkayley\nstunna\ntostan\ncojones\nwitcombe\nworrack\nchewable\nghafur\ncanst\neyecatching\nkhazan\nantitoxins\nsituationally\ntostado\ngeremek\nsteineckert\njuang\npopworld\ntagge\nmultipartite\nrubbles\nkockott\nfxb\nbednarski\nledum\nadipiscing\nsobhani\nwisecrack\nvathy\nslating\nbramhope\ndefrocking\nkasparian\nmeiggs\nfilippenko\nivoirienne\nvirtualize\nwhirligigs\ninarguable\nkepier\nunserious\ncapezio\ntigerland\nturndown\nszczepanik\nfiss\ndinello\nrotos\nhadrien\nlanguorous\nsnowsports\naspr\ncorncrake\ndaviau\nonuma\nsarducci\nmihok\nepistemologically\nowensby\nsydsvenska\nnolot\ndoster\nsherryl\nixworth\ncaucusing\njevan\npahars\nshijian\nplanetology\ncravats\nimar\nsabbiadoro\nrahmah\nlindenmayer\nlineberger\ntenho\nmcclatchey\nculpo\nwittingly\nburgoo\nbearpark\nvear\npleasureland\nlagunita\nejectors\nmmabatho\ngionfriddo\nmisdirecting\nelkana\ncarabelli\nsubframes\njonbenet\nvalleywag\nafinogenov\ncippenham\ncoud\ndeveronvale\nwainwrights\nliuba\nballfields\nfeasable\ncirigliano\nblythewood\nmenaker\nunselective\nigfa\nharoche\nhodgkiss\njurys\ntcks\nreincarnates\nlatendresse\nliddington\nvincenzi\ncockbain\nmicropayments\nweisenberg\narav\nvinnicombe\nhoola\ndique\nkissidougou\nnaumoski\nbopping\nrejig\npilsley\nduplicators\nrivieras\npetignat\nmelhuish\npinheads\nnaisbitt\ncoffeemaker\nsocolow\nkolomenskoye\nverburg\nmasculinized\nunimagined\ntrrs\npoésy\nmayaki\npixilation\nclomifene\nkalnins\nscotter\nhouchen\ncinsault\nshahrzad\nroskell\nsosin\nendevour\ngastrostomy\nconceptualise\nchytilová\necureuil\nlowriders\nbladnoch\nsantell\nreattachment\nambassadorships\ntysoe\namardeep\nreinstitution\nenigk\nunwrap\nscuffled\nlubinsky\nverheijen\nleogang\nfrieser\nintercoms\nbellic\nsoundest\nmcelhenney\nindialantic\ndogtanian\nkhda\ngarbarino\nencourager\nlambrusco\nuptakes\nblastomycosis\nkostyuk\nmercopress\nariela\nschoff\ndhokha\nkontinent\nukrayiny\nfractus\nespenak\nhomestays\nbontrager\nperspicacity\natlixco\nbrightwater\nochres\npibe\nsulks\ndavenham\nmemorizes\ngaxiola\ncanson\nleibovich\nguscott\ninventively\nhrab\nzaretsky\nyapese\noverstone\nsoldini\nnonoxynol\nkoelle\nsoffits\nbouras\ntropp\npaleopathology\nemarat\nbreiner\nscenting\nrataj\nratcheted\ntaffe\nvizion\nsailele\ncoskun\nprofili\npulite\nroest\nkotch\ninterglacials\ncastelbajac\nsaywell\nalvaston\njansport\nhuanuco\ndysregulated\ncorinium\norakpo\ndanze\nbackplanes\nsurra\nhanushek\ndustpan\nmahonri\nvtl\ntwinz\nleece\nbgan\nshukr\nwuh\ncapitaland\nmoteur\nokayed\ndykema\nrollright\nsigalas\nmusen\noenologist\ncarderock\nintuited\nhypertriglyceridemia\nsuhag\ntamudo\nentreating\nsheherazade\nsottile\nneumayr\nvolcanological\nmbakwe\nliscence\nvaudevillians\nniedermeier\nstreakers\nschol\nsonapur\nsedlacek\ncandelabras\nhavill\nwaikoloa\nmoluccans\nparbold\nretching\nwhistlestop\nmelnichenko\nvitellia\njujubes\nmerinos\nburruchaga\nvirag\njamukha\ncourtot\npallares\nkoodo\nmyne\njackley\nbrazel\nmoladi\nannalynne\nborza\nalwayson\nbolikhamsai\npeci\nwhiteadder\nguseinov\njannati\npassivhaus\nboedeker\nmenchu\nphilipon\nnorriton\nyasen\nimpetuosity\nmonohulls\ncherrill\ndisanto\ntemanggung\nexpiatory\ngulliksen\ngrabowska\nholtet\nunscreened\nrehiring\nmulesing\nbuchwalter\nkohlmeyer\nshinozuka\nveran\ndoval\nglossier\nsteininger\ndadda\nfootpads\nirwan\neradicates\nzanatta\niiasa\ncisek\nzarko\ncapsa\ntugnutt\ndelphinidin\nveiw\ndinges\nsicoli\nshadai\nagler\nstrogatz\nswansboro\nziga\nflodin\nadalimumab\ngearan\nfenosa\nsaied\ngething\ndadong\nhartshead\nknobler\ninfibulation\nbahadar\nealham\nvanderkaay\nbouchey\nschwannoma\nyose\nrubdown\nrajabi\nsmarten\nperra\nchenega\ntognetti\nplaît\ngabrial\nkurtenbach\nnuremburg\ndictations\nwappler\ngeds\ndrafty\nbanisteriopsis\nmargaritis\nryann\nzerka\ndottin\nmisconstrues\nrandt\neafe\nbaseboards\nmanvendra\njetters\nbetony\nconehead\nsymone\nduchemin\ntasiilaq\nsidlow\ngraybar\nprudom\nilac\nntaryamira\ntodate\nbasura\nnpws\nfelger\nprig\nlikly\nngola\nseropian\neurogames\nsecondigliano\nkylemore\nfunfairs\nsaffery\nriverwoods\nvenkov\nechaurren\ngrainge\nsandtoft\nhengrove\nsuwan\ntonkatsu\nindispensables\njoanikije\ndatapath\nmoules\nfwp\nnarender\nezzor\nclevelander\nnakfa\nsundaresan\npretest\nsonobuoy\nbaboquivari\neconomiques\nhutsell\nnormandeau\nmunguia\nuhlman\ntelegraphing\nkornati\nshulin\nchcs\nwilkey\nchervenkov\nfiords\nmpssaa\nmisterton\nimambargah\ncespitosa\nbustier\nfreeloaders\ntoodle\nkinetica\nkeiler\noutdoes\nlicheng\ncannistraro\ncarseldine\nnemiroff\nciar\ninternationalizing\ncrosshouse\ndenationalization\nstepparents\nmunce\ncdli\nwilliamsons\nsendek\nyellowbird\nmotability\nradamès\njawal\nbithell\nsisyphean\nphal\nnorworth\narini\najaz\nnahb\noday\nchimbonda\nselmani\nkresh\ncatino\nmovimientos\npalan\nfaizon\nconnoquenessing\nusme\nburkesville\nkarytaina\nlandbank\nmiyao\nunfertilised\ndeann\nalexandrines\nsavides\nkoryaks\nluse\nouyahia\nscha\nvollmar\nminjur\nburnitz\nnymphomania\nashchurch\npipeworks\nvelayati\nrinses\njti\nfilley\nrossin\ntouchard\nbaiter\njoycelyn\netheredge\nsubfloor\nminskoff\nkebbel\nwdca\namortize\nirasburg\nroussopoulos\ngazillions\nkulpa\ndeschain\nhilman\nminniti\nnorick\nallwine\nuntucked\nlaugesen\ndervis\nlambrou\nrecomposition\nturista\nmijikenda\ndiakite\nandreychuk\nvxi\nayuthaya\nkukan\ngunmakers\npmq\nbarthé\npyxidis\nhonea\nmathijsen\nenfranchising\neurich\nwadhwani\nsurachai\nsportifs\ndauner\naugean\nhaulover\nprovençale\nueb\nsifford\ngalsi\nmerabishvili\nsolel\nzoie\nromantiques\nboverton\nboissieu\nsiia\ngarbs\nsnco\nmctague\ncarcases\nduggins\nclemenson\npilarz\nsulfonylurea\nkitu\ntriboro\nweinbach\nusov\nreincorporate\ntorqued\ncamlough\ntheocracies\njania\nomegna\nbelleair\ntuffey\ntogs\nblasphemers\nknuckled\npittville\ntonyrefail\njerde\nclassist\ntatoosh\nhotfoot\nmatern\ncoupla\nkawarau\nerdei\nklopper\nreinis\nshafique\nglomerulosclerosis\nkeas\nmikulak\ngruwell\nandacollo\ngerima\ngoerke\nflorestal\nnrh\nopta\nqoute\ndefi\nsongsters\nedley\nzaroff\nkaeser\nnoti\ncomplicite\nupperhand\numai\nbrader\naraf\nsmolder\njahreszeiten\ndryosaurus\ncruet\ndadoo\nuzzell\nromilda\nkoshin\nhealthline\njooste\nnabr\nestopped\ncampogalliani\nictj\ngeodes\ndayro\nkhoshaim\nfastly\nmousavian\nbleedin\nnutcases\nscribblings\nnemetz\nmetapan\nmosaica\nrodeheaver\npsychologic\nsyz\nswetman\narkadiy\nunkrich\nzhisheng\nnagasato\ngilvan\nwestendorp\nqcc\nerlotinib\nknezevic\narmina\nchieu\ncluzet\nschinas\nmedani\nbullheads\ncassocks\nlapstone\nimporant\nphyllosilicate\nsaloonkeeper\nmoodley\nkhanan\nsammis\nhaughtily\ncapc\nwikus\nbredero\nandersens\nglobalpost\nsilvis\ncampodónico\nmosen\nrakeysh\ndisquisitions\nharut\ntoskala\ndonepezil\nwakui\npinkman\nkreizberg\nconveners\njaleh\norlock\nattemps\nrobart\nantonian\nhenllys\nhostelries\nheckington\noliviers\nfluide\namiesh\nimagi\nxfp\nvoxx\nglobecast\nnht\ntuckasegee\nbrumberg\nrogachev\nsvenning\nphurba\nforvik\nzappala\nadherance\nolari\nferdinandi\nambulation\njetsunma\ncorsano\ndraddy\ndoret\nanonymize\ngastaldello\nburum\naih\nnegativism\nfriedewald\nsharpies\nmembe\nzamacona\nskilbeck\nrafiuddin\ndebark\nruke\nentwisle\nbressingham\nraghuvanshi\njackasses\njabbed\nmadresfield\nkeitany\nmidlevel\nyongxin\ndisturber\nfinjan\nspicing\ntolsma\nhuyen\nmeetin\nascertains\ncutz\nmuralidharan\nsymms\nvillarosa\nrumpled\nboulby\nniksic\nmonchique\nklenk\nandora\ncodron\nfreesheet\ntriers\nhopis\nmcconchie\nhilker\nroseway\nyaohua\ndigressive\nhadrill\nrosengart\nblimpie\nlawbook\nrollerblade\nohev\ndcgs\nlanzoni\nspeta\nmatatu\nlowis\nimperturbable\nengert\ntarmacked\ngaboriau\nhornyak\nlawyerly\nwilmoth\nhabis\nmccarry\ndobs\nprofessionalised\nschoeneweis\npasillas\nobua\njagdale\npatrulla\ndecimates\ntitillate\nhippolito\nrundel\nkirkleatham\ngrasu\nshoshi\nlauca\nefros\nmontavista\nshvarts\nschmerling\ncorrens\nlookstein\nclaramunt\nbienert\nportville\nmotera\nkiyonari\nfrays\nhuckleberries\nsalari\nlgbts\nsnorkels\ngaup\ntartous\nleiby\ncowburn\nkafando\nwiele\nbayston\njerrie\ncliffie\npaintbrushes\nkosonen\nsyse\nkyson\ntoblerone\nwilpert\nrimm\nrespekt\nlagravenese\nleazes\nchurrasco\nhengqin\nconmen\nmrozek\nmasazumi\nagyness\nbrassier\ncrammer\ncassopolis\ndismissiveness\nalwa\nsuperweapons\ngurry\nrhody\nbno\nrenfree\nmicrostation\ncocooned\ntuller\nwuv\nstagefright\nsecca\nthromboembolic\ntraineeships\ncossor\ngrynberg\natsuro\nbeaubrun\nseperates\nyna\nslaithwaite\npangburn\nallocution\npyrethrins\nmaneely\nhillstrand\ncosens\numes\nazahar\njbj\ngayles\nreveiz\navercamp\npellston\nporphyrios\nyfz\nardolino\nhuatai\nwbk\nweezy\npielmeier\nchurnalism\nwilce\nqutbi\npersonation\nbandsaw\nsatele\nfeirstein\nsotalia\nkantilal\nermione\nbanchi\nsrtp\ntransgenes\nfinckel\nwyndmoor\nhofstad\nhorizontalis\ngiblets\ngunsaulus\nzukin\ntowery\nrheault\ndisjunctions\ndepósitos\nmerhige\nrathen\nmcquarters\nsidhwa\nsimiyu\ndierkes\nbackswing\njoxer\natletica\nassessement\nmelone\ngrann\ndegner\nrawda\nlaayoune\nhilti\nsweatbox\ngarota\ntoja\nundercovers\nvondie\nlochearnhead\nvett\napnoea\ndisinheritance\nincommensurate\naduba\nnepotistic\nzorbas\ndeviously\npierside\ncovance\nscareware\njohannah\ncrumm\ntaoufik\nsharry\nclearpath\nlucchetti\nbarasa\nlacosta\nmuseion\nvanin\nsalvors\nleira\naladar\ndecliner\nselfsame\nekho\nsalyers\nemcdda\nbottari\nchoreographs\nswiftlets\ndeines\ncotard\nremédios\ngrinstein\narchey\npallin\npossamai\njurmala\nspogli\nhaverton\nbertam\niino\nrollerskate\nkanjorski\nisenhour\nairhogs\nmaharero\nfaja\nleyshon\nskewen\npalling\njariwala\nauvinen\nvershinin\nlulay\nilze\nperms\nlowfield\nenforcable\nluedecke\nscreenprinting\napprently\nthecodontosaurus\nfishhooks\ncaapi\ntalamantes\nsqueezer\nsachio\nkaramjit\ncarbis\npolyculture\nzardoz\ntugg\nlaszewski\nvedad\nbrimington\ndranesville\nlineback\nimta\ndefunded\ntouger\npuxi\namiability\nmatinicus\ncarisma\nguthro\ncandeias\nottewell\nreprod\nmartirano\ngalwey\ncamarero\nginocchio\ndadon\nshelvin\nosterville\nludik\ntothe\ncqd\ntittabawassee\nloompas\nblaspheme\nbörne\nfedotova\nlaocoon\nsarbaz\ndeeny\nsavident\nzbb\nbunshaft\nfaine\nfkm\nhaddie\ngibbens\ndiaspore\nlucine\nannuls\nnutts\nbourelly\nimpoliteness\nlituus\nbarias\nhendron\namoros\nfosun\nvoudrais\ncoggs\nendress\nulugbek\nneurorehabilitation\nzappia\ntrentonian\nbrico\nlockbox\ntorpy\nponyboy\nyizhar\nequinoxe\nkyonggi\ndiasporan\nwonderfalls\ncadabra\nbeyene\ncheatgrass\nranny\nintrusiveness\nfergusons\ngianotti\nlongjiang\ngmx\nleja\nmattawamkeag\nbadsey\nporcari\ncormick\nretuning\nsalzgeber\nbrabbins\neconomise\nbondholder\nstracke\ncyf\nisss\ndeitsch\nkharafi\ncohabitating\ndolgoruky\nderngate\nezarik\nshochat\nmskcc\nmurville\ncreuset\nottolenghi\nhandstands\nccps\nitemize\ncvma\nxdsl\nyizhak\nfaour\nbrisby\nshewanella\nnizhni\nnuyen\nkorek\nwoollens\ngodowns\ndelfouneso\nmicrobursts\ncovidien\ntreuhand\nunrecovered\nedwy\ngrimberg\nclothespin\nspectatorship\namim\nsangs\nacromioclavicular\npenrhyndeudraeth\nsupernumeraries\nunbidden\ndisdaining\nglisten\nmacsween\npoxvirus\nnclr\nlarita\ntorksey\nbodell\nnardis\nkoelsch\nmonsoor\nchalana\nzucchetto\nthaker\npalyama\npregerson\nperiaqueductal\nkinng\nppw\nswilley\nbodyboard\nwesc\nriggleman\nmartinot\nsileby\nmifi\nsubversions\nsrec\nsmallthorne\nmotorshow\napmc\ncirone\nswinley\nrcgp\nherran\nessense\nbarentu\nfrecce\nmaynardville\nflyertalk\nmaiale\ndurran\ndoest\nrepulsor\npinte\nlprp\nkinniburgh\ncodner\ncondylar\ndakka\ndesforges\ndeele\nvoelkel\nbeinin\nvarmints\nrepublishes\nxma\nheavenward\naroon\nazahara\nclachnacuddin\nlhakpa\namiram\ncarrieri\npamplemousse\nfayer\nvasicek\nisakhel\npechiney\nnhd\nweligama\nmarucci\nlensbaby\ncédras\nwearstler\nfolowing\nchangez\nmyaungmya\nfairydean\ndorthea\narvelo\ntsujii\nhorev\nlauch\nvasiliou\npoopy\nglassport\nprestissimo\nmccranie\nhalldor\nbolls\nscieszka\nmengniu\nwaine\nartspeak\nscrappage\nmickelsen\neble\ntrifunovic\nwva\n\nduneland\nkunen\nrecolonization\nbigfix\nracquel\nkonare\ndiskerud\nendellion\nmagnetize\npicu\nurbanite\nduniv\nphilosophize\nirremediably\napproachability\nbrownbill\nimesh\nposibility\nmaggert\npotternewton\nstephans\nvijitha\npenclawdd\nvof\nlongboards\ncassillis\ncruzat\nrvu\nwharmby\nunsocial\nlavishness\npralines\nlefanu\npeppermill\nvietri\nblacon\namiloride\nfreidrich\nespe\nrecertified\nhosing\nbaros\nglenurquhart\nskaarup\nsuperjail\ndhg\niwd\nnazih\nbrocaded\nstickier\nmalyn\nteakwood\nlongerich\nmoxifloxacin\ntnz\nushahidi\nganache\ngrumiaux\nyuquan\nmodernisations\nbrandley\nchoji\nunbanked\nfauzan\nsambac\nvukmir\nbulerías\nelderfield\npeszek\northochromatic\nneaves\nsalination\nlorina\nfosa\nshaohong\ngowadia\nmegginson\ncompensable\nwisener\nmeconopsis\nchmura\nzöggeler\nstanic\nunbroadcast\nogogo\nzinat\nplyushch\nrashaun\nglba\nexor\nlevetiracetam\nrafta\ndayron\nkjærsgaard\nciona\nfeleti\ndunelm\npubes\njasms\nnotizia\nmaraniss\nprorated\nveljohnson\nrusswurm\nuncc\nbuback\nvenceslau\nscratchcard\nintrospect\nbulker\ngaler\nscooba\nsumus\nponant\ngrimonprez\nmapple\ndandie\ndialectically\nclassicus\ngundula\nnaseri\nwarmhearted\nsioc\nseedhill\nmarkward\nmckinzie\nfrenzies\npaikiasothy\ngrannie\npanajachel\nscarwid\nngoy\nbeddow\ngaborik\nboulmer\npolylactic\ngwynneth\nkolodin\njagang\npitsunda\nanthracycline\ncariani\ndohme\nzarem\nprenger\nslavitt\nprak\ncarjacker\nsakhir\nchildfund\nsencer\nmamady\ntioté\nxay\nbiltong\nzendesk\ndgac\njelloun\nmasterfile\nslavov\nladybower\nbupivacaine\ndrools\nbelridge\narmouring\nyodok\ntumacacori\nunluckiest\njuvie\nmasley\npomonella\nshawal\nmdoc\nthinkgeek\ndrysuit\nlearnin\npsephologist\nmatityahu\nwisam\nhatosy\nbenami\nsoubry\nbacino\nregasification\ntebessa\npolyglots\nnagell\nalmadén\nmaterialising\ndeselect\nmonotonously\nmicroelectrodes\nbrutalizing\nincorruptibility\nconstructal\nschumm\ndownscale\ncude\nmahasi\nschönhauser\ndiamantis\nschriver\nakopian\nrougham\nnordhaug\nrootstown\ntribunale\nwebware\nhoffstetter\nhostler\nknw\ntolba\nzadi\nbunmei\nflubbed\nwhaddya\nshigetoshi\nwenches\nmessolonghi\nvasilij\ncovello\nosmington\nkimmell\ncitp\nraytracing\nwherries\nmüntefering\ncleeves\ntouriga\nkeflezighi\ncoloradoan\nshearmur\nflamands\nborree\nmegacorporation\nibbs\ngarvald\nbulliet\nhalime\niav\nsytem\nhlophe\npacal\nphysarum\nnyree\njorasses\ncheok\nunlawfulness\nforthrightness\nhrn\npalevsky\nchondrocyte\nmorani\npiazzola\nmetas\ncouesnon\ncoxall\ncoathanger\nblizard\ntroupers\noxidises\nshahad\nafzali\nflorie\nincentivise\nchlordane\ncascata\nitzkoff\nagna\nmastandrea\nbustelo\nguillotines\ngrabiner\nbobbidi\ntongi\nwiertz\nmardale\naquire\nfamosi\nresalat\nsportske\nquiktrip\nmunther\nbombproof\nyaren\ndavino\nrieussec\nwoodseats\ntgvs\nmullineux\nmontagnola\npomposa\nindoles\nfreemind\nhepplewhite\nmodwen\nmenageries\nvaitkus\nmarez\nsunrun\nchuch\nursell\ndilts\nassiette\nepon\nflâneur\ntabatinga\nhuttner\nvmg\npramuka\nkaab\nmanhire\nsukhorukov\nmontagnani\nisobutanol\nroizen\nexterminations\ntadzio\nzaragosa\nfriulano\nmacroglobulinemia\nleaches\neppinger\npokies\ngerhartsreiter\nbeyblades\nsauvegarde\njtac\nspirograph\nihes\nbrozman\nwawrzyniak\nguestroom\nmaresfield\nsmithland\nliheap\nquillian\ncalday\nnontron\ndehydrates\npoorva\nhuddles\nsobczak\ngolosov\nprzybylski\npekkala\ngwithian\ntebaldo\nndereba\nshahzaib\nkelsay\naldrick\nlemel\nexb\nawasa\nunitaria\nautoregulation\nsaundersfoot\nvolcanogenic\ntshepo\nbrachman\nbrandram\nbarath\nrudston\nmalachowski\nyongchun\nstocken\nwardington\ntatnall\ntranseuropa\naktar\nhaggarty\nkainos\nlangway\ndidem\ntriola\ncleage\nweihan\ntrojanowski\ngccf\naptidon\nlichty\nmumo\ncesspools\ndmcc\nsemiprecious\naugher\nmallan\nluarca\nkashfi\nwatzmann\nxclusive\ncmcc\nedner\nslipperiness\nnightcaps\nnorwin\nfeighan\nsheetrit\ngrese\nhhp\nfelicitations\nshreeves\nutseya\neditoral\nstromatolite\nmeself\nmoez\nyemin\nmoondust\nmarczewski\nluminol\ncoffeeville\nstupnitsky\nfoolscap\nduumvirate\nbeechy\nsween\nmarkou\ncrbc\nmouchette\ntiven\ngutowski\ncapellan\nbeiber\nwus\ncheaney\nstephentown\nsinde\nmodchips\nboudier\ndeeble\ntundras\naudsley\nclotheslined\nblastomeres\nacheivements\ndommage\nsonnenfeldt\nairmont\nfalleni\nwabara\nraewyn\nindisposition\ngeck\nfitzpatricks\nallègre\nyaletown\nbouffard\ncavehill\nnezavisimaya\npelchat\njianming\nmarshwood\nspearmon\nenoksen\nmacgowran\nfuzing\nmadian\ntiida\nmottahedeh\nhyu\ndahesh\nrgi\nglissandos\npenketh\nfieser\naultman\nembera\nbiztalk\ngoldenseal\nfestered\ndyane\nnivison\nichigaya\nkrygier\nsaiyed\ntempl\ncoatham\ngorgonia\nsynchrotrons\ncheuse\nlibere\nenad\ncrimps\nmalediction\nceaser\nshettar\notaiba\ncatalanotto\nguemes\nkamunting\nuncac\nnadig\ndyukov\nprolapsed\ntrapero\nncri\nsheeva\ngastroparesis\nwpbt\ntonye\neggington\nanathemas\ntwinkles\nshampooing\nblub\ncource\ndobrinsky\nrotifer\ncwmaman\nmailonline\nbilali\nfogleman\nhartikainen\nbvl\ntufano\ndablam\nveba\nwledig\nsaimon\ndopes\ndorsiflexion\npelmeni\npossesion\nvibraphones\nemsdetten\nmanipulatives\nstrathaird\nbayandor\nhamsterley\ndromoland\nsomin\nchups\nbilney\nparasitically\njabour\ndirtnap\ntripwires\nheatherly\nblackinton\nbunnett\nkonjac\nfahrni\nknacker\nchiaromonte\notherhand\nseiches\nqadiani\nshick\nbanyard\nmanigat\nlrad\nwitchell\nmeprobamate\ncantele\naltruistically\nhooverville\nlamrock\nduncum\nmoloto\ncayucos\ncambiar\nraraku\ndislodges\nthembu\nstäfa\narcheologically\neagleburger\nrecapitalized\nblandina\npiernas\nsavitha\nfancast\nalkyd\ntulliallan\nlayabout\npipkins\ncrudest\nfingerpost\npakistans\nsawkins\njuans\nmarteen\neggo\ncerha\nnapolitan\neilon\nlizhi\nwalen\npevs\nyated\nnamche\nbacksides\nsukkar\ntiens\nproportioning\narnet\ntonsley\nnalp\nserpell\nwangdi\ndeepcut\ntonin\ntarheel\nnannette\nhegenberger\nbrueckner\nsuspensive\npandera\nfleetcenter\nllaw\nkikwit\ncownose\nhydroplaning\ncanfranc\nishige\nrotoscope\nepitomise\nweichai\nkiryienka\nchouraqui\nfradique\nalarie\nstalteri\ndiarios\ncairnhill\nreticulocyte\nmettenberger\nvorticism\nsklodowska\nklause\nnevirapine\nunseal\nruppersberger\nshoukri\nvibiana\nvacherie\nisleham\nmkiv\nkeppie\njbi\nneus\nbienvenidos\nverkhovsky\nsadir\nhunga\ntokage\ndubbin\nhaussman\nzinny\ncrociere\nyobs\nauryn\ndeya\nvenkateswaran\nneog\nreepalu\nknerr\nholnicote\ngervacio\nglamorize\nsportingbet\ndavol\ncatacamas\nmarrack\nfoiles\nskillings\nbokhary\nladys\nsingletrack\neyeworks\nbadir\nseldomly\nbunkley\niuss\nlounger\nlenczowski\ntheophanous\ndynastes\nbozidar\nweixler\nbahawal\ndisinterestedness\nmoshpit\ncoccolithophores\npultusk\najita\nscantling\nsibur\ncuvee\nhatam\nlleshi\ndariel\nribadesella\nkeyah\nkonon\neyestrain\nbosville\njoner\nsirus\nsingkil\nwunderman\njäätteenmäki\nfuest\nbawling\npapagiannis\nadec\nayalew\namfortas\nweichert\nwawer\ndiq\ndorrans\naguaruna\namberol\nregretably\nelbel\npasqualini\nirta\ndogsbody\nignitions\nmenik\nserigraphs\naerially\nmaterie\nekantipur\ngreenstock\nrrd\nnesi\ndainelli\nchabanais\nmonjo\nachraf\nkatsuyoshi\ndiversityinc\nunmovable\nshipworm\ngimmickry\nhammerklavier\neriocnemis\nbalzani\nbalbirnie\nepode\ntakuo\nbombie\nmortadella\naumf\nziebart\nbowmer\nobm\nbonduelle\ndockrill\nsharbat\nadisucipto\nchojnacki\nchaifetz\nanglezarke\nlancellotti\namatil\noldfather\ndesso\nsugarfree\nfuwei\nreshapes\nkassidy\nfrette\nmarkee\nincarcerating\nxinping\nkligler\namaré\nbellan\npeices\nakinfeev\ngolog\nlukash\ntressa\nmaajid\nmabrey\nlinkins\ncharacterful\nallanton\ndrakenstein\nalcano\nglosters\noverpayments\njinda\nescravos\nmyricks\ntasnim\njehane\ntricoloured\nhewell\nghir\nglobovisión\nmoranbong\nstocksfield\nagonisingly\nkeji\npleno\nsorrells\nemnes\nfootie\nuralvagonzavod\npirtle\nshallal\ncirm\nhavner\nplews\nmeuser\nquepos\nperuga\nwriteable\ndiesmos\nquicke\nvoronoff\ngrosgrain\nermua\nglaven\nplagiarists\nschmitter\nvoguing\nmcjob\ndonka\noberdorfer\nanemometers\nnevadan\nbugun\ntammo\nrujano\nbabou\nyanic\nbolinder\nkeiley\njumbles\n\nhoneychurch\nmutaween\nakhundzada\nbrumback\nhofs\nradwell\nbinnig\npiccione\nsoient\nmuckleshoot\nbransholme\nkrushchev\nteichman\nkneeing\nrggi\narzo\nshonibare\niceplex\nscurr\nbougon\nfarahnaz\nshumsky\ntiznow\nallthough\nnikam\nogba\nslipcased\nrachmat\nafsaneh\narrowwood\nsoneji\ntrezza\nsevernside\nperseids\neisenbeis\nunibanco\nascani\nremaindered\naspo\ngorongosa\nhehn\ndipropionate\narabinda\nrizer\nviano\nssem\nauthoritativeness\nwardian\ndelena\nploeger\nfujimaki\nsappi\nunprofessionally\nbrynamman\naudiological\nclodfelter\ncentrowitz\nsiwiec\ncoray\ndharamraj\nlandenberg\nmutato\nyulan\nlundkvist\nbiser\ncorruptible\nmarcolino\nbordón\ntempur\nbruening\nofficiants\nmssr\ncaídos\npondy\npadanian\nshagging\nkeens\nsynovium\ncornflour\nudraw\nbumiputras\nkalavati\nsofrito\ndenzinger\nintensives\nbullcrap\ntanada\njamsetji\nbashas\nvauvenargues\nrythmes\nmceveley\nturkmani\nginsburgh\nmakaha\ntsvetanov\nniehs\nemmc\nrvp\nhaloid\naaargh\njesualdo\nhansens\nsnaresbrook\norihara\nsimelane\npharmakon\nhabita\ndigitalb\nhayal\nexadata\nsnøhetta\nturino\npetrodollar\nbarklage\ncongregates\nbukola\nabron\nfurhter\naymen\nnonsexual\nmarom\nschierholtz\nchrissa\ncranworth\nassou\ndurland\nyout\nhillmen\nmontsalvatge\nbailamos\nyusefi\ndeleón\nmontiglio\nllanstephan\nhassanali\nsubtile\nzene\nribfest\nshamelessness\nselawik\nmakula\njibberish\nvonlanthen\nweepies\nfeindouno\nwidemouth\nkreiger\ntinners\ndyserth\nrugen\nmojos\numoh\nernster\nmawnan\nraynard\nromelu\nnustar\nbrownley\nlisahally\nrupak\nblehr\noligos\nstyraciflua\nsaire\nabruption\nelberon\nsanty\nhuguely\nbiddles\nradiodurans\ncharke\nbrookmyre\nbonsoir\narnow\njunaidi\nflowerdale\npingpong\nsveinung\nzillman\nhelali\nseminoma\nrefah\nbeyrle\nalho\nyeaman\neasterday\nnettleham\nweberman\ndetraining\ndaems\nsarj\nmantoloking\npollença\nbuzzie\ngenachowski\nkydland\ntarusa\ndrager\naharonot\nspensley\nduato\ncarmacks\npinhel\nnacd\nstephensons\nkobiashvili\nvildagliptin\ngaraventa\ntokidoki\npiketon\ndueted\nworner\ncrevier\noshodi\ncaporale\njosefov\nschanche\ncytarabine\ncheapened\nmasuko\ngalluzzi\naguiluz\nmosese\nclavo\nchinwag\nsaffran\naygo\nsaxl\nptsa\nbelic\nillertissen\nncell\nchiney\nixchel\nbainsford\novertraining\ngoheung\nastillero\nvcv\nolympism\nkoppenberg\nfirminy\nsinfin\nrefracts\nskinnier\ntrebanos\nbethalto\ndence\ndeftness\nobba\ngressoney\nlyuba\nretroactivity\nstraubenzee\nmitsunori\nkearley\npontnewydd\nkillens\nswensson\ngallé\nbhata\npeepli\nclamav\nshiek\nnarasingha\nnovitsky\niaq\nbendixen\ndigirolamo\ncamogli\ncremers\ntamped\nmarinating\ndiagnostically\nlanker\nsadad\ngritton\nfakt\ngeopolitik\nkupchak\ndulaim\nsemiannually\nadibi\nhogle\nzuger\nmaravi\ndoctores\nmohajerani\nneener\nnunns\nworke\nclubcard\npmdd\nslughorn\nperejil\ngabbiani\ndiegues\nmosasaurus\nwhiteleaf\nderivitive\nmyria\nthrogmorton\nsolemnization\ndebroy\nserps\ndaks\nhfpa\npoorhouses\ncolfe\nrockness\nselbourne\nburgard\nshadle\ncliffwood\nrosaiah\nminorites\nswayamvar\nabramtsevo\ndlj\nbirkinshaw\nsvara\nkonkret\nuntrimmed\nsfrs\nplibersek\nchiltington\nwkc\ncadenhead\neii\nseaperch\nvassalo\naloa\ndahlbeck\nholborow\nsoufriere\ncasarsa\nhalda\ncmtc\nmcgrillen\navants\ngeurts\nsantanu\nbenison\nblancornelas\nrorer\nninilchik\narop\nmattoo\npenmachno\nfrancella\numit\nteufelsberg\ncoys\nredeposited\nhriday\nacampo\ncontextualizes\nheima\nredha\nwindoze\ndecoupage\nchenghai\nmjo\nrosière\ncholecalciferol\nnbcu\ndigon\nliscard\nsamarrai\npixilated\ncvf\nyandi\nroundball\nkosteniuk\nsysplex\nfayzabad\ndechant\nemilija\nozouf\nsaidin\npatrington\noverclaiming\nblissed\nkalonzo\nhealthiness\nsefrou\njerson\ncobit\ninape\ngozer\nrocheleau\nmisfiled\ndembele\npursglove\nhoungan\ncostings\nnevadas\nburhop\nagribusinesses\npierron\nmicrofiltration\nsabam\nchillwave\nkalfin\nsalkey\nmorda\ndisentanglement\npapercutz\nblacktips\ninadmissable\nthermoforming\ngiros\nathanassios\nwenzhong\nfriskies\nrospa\nreaccredited\ntolkan\ndelorey\nchugiak\nfinaghy\npapering\nwitchhunts\nchocolatiers\nstupendously\nblenkiron\nschenkkan\nfolz\ndromio\nacemoglu\nbrookmans\nimag\nprostar\nglowacki\ngrimsay\nasselborn\nduncansby\ndiametre\ncursorily\nhotchpotch\nfgw\nheligan\nchatwood\nsalade\nbasili\nmarzocchi\ngrossers\nandreia\njubeir\ngrua\nrabaut\nfadiga\nmoblin\nparkton\nmorres\ntamms\nqaq\nmaynards\npompom\nsalsas\nnaughtie\nnyahururu\ngadiel\nsearight\nroadsigns\nloutherbourg\nbalby\nstarer\nbarkhan\nniney\nremyelination\nsusette\nolubunmi\nacpe\npetrona\nrobertsons\nfreewheelers\nchellie\nventham\nmeticulousness\nreitzell\nchoirgirl\nederson\nrianna\nbadakshan\novereat\nstagedoor\nnanocrystal\nigl\nturkified\nkendrix\ntweetie\nextraditing\nlakka\nmaladjustment\neastley\njusoh\nkharja\ncicchetti\ndiscounters\nalisia\nscepters\nwepper\ndiarrhoeal\nardiente\ndespues\ndisestablish\nimportantes\nsots\nvolesky\nperona\nbryntirion\nvinal\ncretins\nsackboy\navailble\ndayside\nmeline\nbalmori\nlaetoli\nhfea\nepisiotomy\ndeboo\nshopworn\ntaproots\nmcskimming\nramola\nmuxloe\nquickens\nbroden\nloebsack\nbaiocchi\njicks\ndivans\nbiodynamics\njassm\nalliedsignal\nmcevilly\ntableaus\nkwiatkowska\nbrutha\nmoloi\nchiredzi\nholdem\nksee\nzazula\nmunicipals\nbasinski\npiolet\nsakib\npelser\nbohjalian\nkdi\ngushiken\nkralik\nscotchman\nqmul\ncastellina\nfermina\ndhiyab\nundischarged\ngreenlighting\nrosuvastatin\ntuono\nniagra\nkastelli\nsalvager\nnotamment\nkreeger\ndateless\nsomaly\naviad\nrusthall\nbramford\nthrillseekers\ngurg\nmaquisards\neastmont\nethell\npizzuti\nmisidentify\nbennigan\nrosher\nramonet\ndreijer\ndecouples\nwhomp\namelung\ncommanche\nnorkus\nultranationalists\nanadol\nmyoblasts\nmetzl\nshorto\nderrinstown\nmothballing\nkinsfolk\nleibel\nyoshikuni\njoura\nodah\nhakkarainen\nnavs\nzankou\nneuromas\ngenx\nranderson\noafish\nconfusedly\nboydston\nprotv\npapadopulos\nmudguard\ntrifasciatus\naequorea\nspermatogonia\nblashford\nlysiak\ndemaree\nroula\nunfortuantely\nerjavec\nakinde\nzahorski\nbarwin\nklesko\nzilpah\ncollyhurst\nviewsonic\ntamsen\nmitchie\npostfinance\nreelzchannel\nwartski\nwestfeldt\ndevic\nallaerts\nmarysia\nmazariegos\ndinuba\nroade\ngalperin\nachievments\nstridor\nsafc\nashkali\nhafan\nbotolan\ndesbois\ntalam\nmandeans\ndhiren\nstrabag\nsideview\ndragas\nsegredo\npercée\ntossa\naedpa\npaneuropean\nmmrs\nstiffy\nclacks\nklumps\nclairvoyants\nwallraff\nrichetti\nkoefoed\nlivaneli\nsandfields\njubelirer\nminou\nbagai\nerakor\nsitutation\nkininmonth\nstoy\ncheesecakes\nbarkleys\nglasshoughton\njarius\nvernard\nzemel\nberewa\nswallowers\nboutte\nkundig\nkrosa\nlenhardt\nlindhome\njellison\neldeen\ncimmyt\nbomet\nfirmenich\nbarzin\nlaholm\nviles\norientate\nsiss\nametek\ngoshorn\nhonister\nvalastro\nisraelsson\nenglebright\nconnal\nhungrier\nferesten\nvisclosky\nivoirian\nhsw\nkillgore\nasika\nchevrontexaco\ntenere\ngiacomin\nyavorsky\nisua\npancuronium\nburano\ngastroenterological\ngüines\ntamez\nclatterbridge\nahmedi\nphilliskirk\ncreegan\nwowt\njcn\ngoalline\nporchlight\nschu\nraubenheimer\nslovakians\nmaasvlakte\nchongo\nsaeco\nlargess\nhudepohl\nautoworld\ntaynuilt\ndhondup\nrodnik\npremarket\nfraz\ntricolours\nimee\nshabaks\nperiodico\nalbyn\ngordis\ndalmat\nmicrobicide\nheiti\nakeel\ncaldon\nocado\nhennacy\nvsnl\nfongshan\nbredius\nbackhoes\naristegui\nplotnikoff\ntabárez\nmoonshining\novp\nbadaruddin\npetted\ndimdim\neduca\nmirvs\nqamdo\ngearless\ncanadia\nexequiel\ntomotaka\nyess\ngrayrigg\nwistrom\ngolliwog\nfiapf\nbuika\npterodactyloid\nveneranda\nseckler\nhafun\njorda\ndiridon\nquarless\nzarafa\nlessines\ncaundle\nrautaruukki\nplaybacks\ntomassini\nknacks\npootie\naslr\ncustodiet\nmullenweg\ncopacetic\nmaurstad\nzmuda\nmohns\npatashnik\ntrosper\ncabochon\ntambach\nkhiam\nasociado\nbeden\nyuksel\nmachat\nzfp\nyoule\ngroseclose\ngalipeau\nsandflies\nquie\ncouey\nghobadi\nwftc\nermera\nyakou\ntulay\nfalsities\ngreenberry\nvecchioni\nchomhairle\npertusa\nschager\nswats\npluriel\npollinates\nsheyla\ncadder\nsnitterfield\ndaaé\nchive\naghdam\nunderinvestment\nmahamuni\nhgi\nlegay\nmawlawi\ntadgh\nmartinov\nsrini\nsibolga\nhengdian\ngrassic\nhelson\nsideway\nkumana\nbouncin\nsidiq\npolitecnica\ntripple\nrogi\nsunedison\nsaisiyat\nclassmen\ncontamines\ncybernet\nlongues\nunmusical\nsopho\nsixgill\nblinken\northopedist\nsorber\npagonis\nleenane\nmonaldi\nflareup\nspraining\nadrain\nreffing\necocity\ncameoed\nincisional\nlipsitz\nhbu\nthamesdown\npenina\nsambourne\nfonejacker\ngalvao\nclonus\nfevronia\nzerby\nrichebourg\notterson\nblueness\nfilmstrips\ndjuric\nloas\nkernodle\nmycophenolate\nfrontis\noptim\nwafting\nbonas\nequuleus\nyamamori\nmerrymakers\npsds\npoilievre\nfraidy\nturnt\nlovick\nunseasonable\ntastiest\nkouadio\nunchallenging\nkoita\nsambueza\nmandt\nlyssavirus\nnucular\nkyriacos\ncronauer\nhavnt\nmagilligan\nstandees\ntoshimasa\nurmson\nemett\nlangewiesche\njillani\nnadirah\nssafa\nbilocation\nloroupe\nrefashioning\nchaldon\ncoxsackievirus\npencier\nunaged\nselectees\nwolfinger\ntabulates\nimputes\nlandres\ntequilas\nzorita\nfoppe\nsillah\nnabbing\nsugarbeet\ncapay\nraffetto\npueden\nmatmata\nsearchability\ntouvier\nfalles\ncelestis\naraneda\nbuey\ndiriamba\nvoler\nelectroluminescence\npushpakumara\nmattawoman\nprivalov\ncalmo\ntrivialising\nunhook\nganey\nhmmwvs\nboada\ndaric\npreemptions\nhalwai\npzev\nnaskar\nlenah\nmastracchio\nanuta\nsittenfeld\ncalifone\nllangwm\nanec\ntrejos\nbinging\ncansfield\naurukun\nfederating\noxandrolone\ndonnarumma\nimmodesty\nspindel\nbiobased\nstruwwelpeter\npillot\nmabius\nulay\nlumbago\nheluva\ninstal\nknowl\nsalopek\nroeber\nkootz\nriling\nféraud\nlubchenco\nborka\nmersham\nauditable\nbolney\nradcot\nmontsouris\nxiaoyun\nporcelli\ngadzooks\nleaton\nslovic\nsemley\naverbukh\nheikal\njoanes\ndemetriades\naerators\nmascota\nshakhriyar\nbadbury\njackboots\nsperms\nhobbiton\nbadalucco\nnizer\npny\nwheatstraw\nferma\ndoubter\niglauer\nradich\nmcgillin\njaragua\nhandelsbanken\nbhisham\nchildfree\nperran\nwhinstone\nluq\ndietrick\nnonny\nnanon\negholm\nlawder\nseargent\ngoogleplex\nnaxalism\nbroadneck\nbrassieres\nethologists\nauvray\njusts\ntamarama\nhaikal\npalter\nhalligen\ngawky\nmcaliskey\nflatscreen\nsquints\nanovulation\niapa\nccsp\nagonised\ngolina\nmarasco\nburnier\nprofe\njizhong\nepperly\nnyserda\nediscovery\nreorders\nmelvindale\nkeetley\nshoreward\nnewcrest\npagliarini\nwaldinger\nchokers\nshrubberies\ntchenguiz\ntakie\npentiti\nbrownouts\nuza\nradic\nwincing\nshayk\nablate\nalejandrina\nfeillu\nproffessional\nqaasim\nverrone\nwickerwork\nolkiluoto\ndecebal\ngarozzo\nphadnis\nnicta\nmediobanca\nbelives\nmatchboxes\nshabib\nbramshott\nspaun\noceaneering\nrescan\nscheibner\nlandow\nfanz\nillion\nwreg\nsafarzadeh\nkrest\nfreels\ndentons\nwallaroos\ndeschenes\ntuneless\nunterberg\nircs\ngasland\njdub\nnln\nbifocal\nkooner\nkouakou\nsvartholm\nlurgi\nbottin\npowerbroker\nwoodgrove\ndelbanco\nbarredo\nosheroff\nfollansbee\nfinnell\npettini\nyfc\ntaikang\ntriumphalist\ntherizinosaurs\nmbtu\nterbush\ncrucifying\npshe\nrawest\nnepenthe\ntifosi\nmortuis\nieso\nabae\ntoge\npucher\norona\nmeggiorini\nzesh\nciff\npalce\nmbalax\nhoboes\nmisguidedly\ndecission\nscavullo\ndubas\nequivocate\narmendariz\nbackboards\nsublease\nlachrymose\nedir\ntise\nlancha\nconaie\ninktomi\nlecesne\ncraghead\nlogmein\nsugarless\nleedsichthys\nsidan\nmthembu\nvarnishing\ncubin\nazamara\nitkin\ncogdill\nytn\nstutchbury\nswarcliffe\nbathtime\nanshel\nbogglingly\nvtsiom\nwive\ngraeae\nliley\nhadaway\nfiocco\ndaish\nnotario\ntosha\nsutyagin\nsmiddy\nvignerons\ncreaming\ngardezi\nileostomy\nhimars\nnaughtiest\nfalfurrias\njega\nkarpinsky\ngompert\nyueng\nryabov\nfakri\nipilimumab\nfleischner\nblagrove\nhemse\naspc\nspindled\namerada\nbarabara\nbednarz\nfaiella\njagodzinski\noptix\nwateringbury\nmcgegan\nhully\nkatiyar\ngendarmenmarkt\nshrivelled\nislams\nmegamouth\nshurat\nlacen\nguruge\ntajan\nhende\nyaverland\ncatani\nshiran\nchevreux\nhillson\nransone\nsilcox\nmatzen\nshoveller\nsebastiao\nmansbach\nalila\nracier\nalizai\ndiagnostician\nifil\nvavra\nnevski\nbenedictis\nurumaya\npeles\ncoplay\nryke\nbfgoodrich\npasternack\nmcdonaldland\ndref\nclia\ngyrls\nillovo\nequestrienne\nbasest\nrasti\ndyskeratosis\njarad\norgun\nhaidee\ndetectorists\nstassi\ndiepkloof\nmolests\nkoecher\nvoke\ncrowl\nlavishes\njnl\npenpal\npurling\nboeckner\njent\nsubhani\nmugnano\nfahima\ngisevius\nkonstmuseum\nparlier\ngodshill\nknux\ncreedal\nelvino\nprestigous\nconservations\ndorléac\navgn\ndouris\nsnaked\nfoglesong\njezierski\nglanbia\nkalvin\nlawnside\nnemet\negads\nhodak\nmoonwatch\npommels\nmetronomes\nbarkman\nduvel\neffusively\nminicar\nbuddenbrock\nfaryal\nrollerblades\ndroney\nblackhurst\nphit\ndicenzo\nringgenberg\ndivorcement\nosmena\ngelmini\nmilans\nisomura\nyiwen\nfatoumata\nattahiru\nkopper\ncofa\ncreigiau\nwashery\naronia\nbolkonsky\nchrys\nyanna\nmakgoba\nbartenieff\nfunn\nshanken\nyongala\nkvoa\nhyaenas\nnumberplate\nrdw\nklimas\nandee\nburghart\nkenansville\notai\nmurashige\nlodore\nshoigu\ndonoughue\nlojack\nysgolion\nboulé\nmediaweek\npoultices\nbidets\niyogi\nlbb\nroamers\nstoragetek\ncellnet\nmabyn\nyuanqing\nalaka\nkranepool\nimpoverishing\ncanonico\ncetuximab\npssst\nzajec\nfolsey\naasm\nmitie\nkhambatta\ncouder\nbidco\nbahk\nvoriconazole\nbearse\nreinvigoration\npibulsonggram\nmpower\nsmallcap\nbortezomib\ndjebar\nidfc\naido\nbromelain\nhassey\nsalterns\ngentz\nthermogenic\nweisburd\ndydd\nterrine\nollin\netuc\npedobear\nannita\nmyrlie\nbhoy\nstrøget\neasels\nsmokefree\nreaming\npavelski\nwato\npersecutes\neryn\ntaskings\nrionda\nbundanoon\ndhere\ncasualities\nbeckworth\nthakali\nmyeloperoxidase\nfestively\njablonka\nhassenfeld\nlissoni\npollarded\nguirgis\nhazelbaker\nmcmakin\nbaidya\nrhoderick\njonel\nhorwill\nquisque\ngestes\nssac\nmuseumsquartier\nreservas\nblagoy\nbanharn\ndolder\nfilmaker\nharvestable\nsandbagged\nfouchet\nghostland\npizano\nmaver\nduncton\nborislow\nstinginess\ncasentino\nwakley\npriora\nloates\ninfoway\nharmine\nharaszti\nassocation\nkippel\nnelken\nchinchorro\naecc\ndyment\nballards\nlaudehr\nfrüh\nrachell\nattendence\nnoury\ncognisance\nscaramella\npalmachim\npozniak\neisold\ntraveline\nalamar\nmalzieu\nfictionalizes\nmanit\nsobh\nhalters\nbacopa\nkasdorf\nakot\nroomettes\nbioethicists\ndandyism\naafa\nponnusamy\nignorants\ntamrat\ntastee\nwilnis\nsotogrande\niccrom\nincredulously\ncavero\ntimewarp\nsiden\nyardie\ntwohy\nyquem\narchs\nrockafeller\nmuñeco\nretrogressive\nplinking\nlobotomies\nwildmon\nadlabs\nschrab\ngutkind\ncarf\nshaposhnikova\nloffreda\nindiscrete\nglanusk\ncrescenzio\nbaleshwar\nbombora\noverfeeding\ncasy\nanglicise\nsravanthi\nipswitch\nalwoodley\nappro\ntruste\nbastienne\nnurgaliev\nduble\nnonflammable\nadeniran\nkinglist\nshouyang\nnbcolympics\nsetterfield\nmugdock\nsayasone\nkurosu\npoventud\nmonchegorsk\nrobonaut\nparamours\nielemia\nbiscardi\nseppänen\nlewan\nserait\ngaffs\nnafti\nemplace\ngogel\nstratégique\ncomida\nmoodys\nloutish\nvarshney\navy\nunerringly\nprefering\nonie\nschacher\nhimmelman\nquater\nshadd\nspyrou\nprah\nwigdor\ntskitishvili\nsharable\nclcs\nkbh\nhiwa\naleady\nhrdlicka\nnkole\nmotyl\nucpd\ncroaks\ngazela\nnoddle\nsoysal\ncyberarts\nmoshinsky\ncorporatization\npremarin\nmilgate\nholdco\ndnestr\nhurel\ndiat\ngardened\njetton\nsoref\nfangzhou\nmaguari\nmuno\ninforce\nyeaton\nstant\ngesser\nlenardo\npigbag\nbikey\nsymbolisms\nhashizume\nsuperwasp\ndampener\nmegatrends\nkapllani\nnitazoxanide\nmatsson\nlooky\nmeeuws\nposeurs\nprendiville\nincriminates\nbidlack\nlayby\nspady\nmuscio\nhighborn\ntongaat\ngibside\nalhat\nmulyani\nroughage\nraffa\ntalkradio\ndomizzi\ncpic\npnrs\neringer\nmaslanka\nserviette\ntimoner\nmoseneke\nfunmi\ncarrels\naadb\ndierking\ntein\nterracycle\nkardan\npaffett\ncautery\nvaccinology\ngubb\nyesayan\nguarenas\ncrathorne\nabetment\ngellatly\nkiros\ndagga\nshearn\ndroogs\nhooi\nrooper\ndiagouraga\ndälek\njalani\nwalchensee\nmonita\nwarith\nwigfield\nwice\nkikka\nabera\nyonosuke\nroese\nchoummaly\ntafolla\nqgp\nlautier\nringelblum\nnovakovich\nkamalesh\nperriand\nsimpletons\nskimboarding\npujehun\negot\nwheedle\nwarshawski\nevanshen\nabdelnour\nhaiman\nrscm\ncichon\nrodenticides\nmattering\nvolvic\nlinebacking\nxianwen\npadano\nchikyu\ndeaden\nmilic\nsiems\nyellville\nbrinn\nstockist\nmckinna\npdca\ngranelli\nimipenem\nrahme\nbritcar\nsakashita\nwium\nramelow\nmassam\nffion\npbxs\ngreber\nwidger\nshoreland\nbiogenetic\nwytch\ncalio\ntolzien\nyemm\ncusumano\natliens\nmccague\nforgy\nchertkov\nartesunate\nsiaosi\nstrathtay\ndills\nprecuneus\nvalenzano\nsouthcom\nperahera\nsandfish\ncausas\nbourriaud\nburtonsville\nbannocks\nstoppable\ngudjonsson\nsaxilby\nataqatigiit\nsommariva\ncommutations\npöttering\nanestis\nisted\nmugica\nhru\nprejudged\nboti\nlecroy\nwhitter\nbonine\ngorging\nrexer\npinchers\nkebo\ntradeston\nsomby\nwailed\npresbyteral\nhekmati\nwiebke\nhardan\nfkl\npenshaw\nnkala\ngomersall\nmarcolini\nhinkins\ncfk\nhibbins\nmultifaith\nrieckhoff\nkhanfar\ndelwyn\nchickenhawk\nhylda\nkreiz\nkibuye\nchiclet\nalwatan\nchegg\ntirrenia\nkhashuri\ngeniez\nlindoro\nsentimentalist\nhaseman\nwilderspin\nmckowen\nfeerick\nreforest\ndropzone\nvanger\nshillinglaw\nsatc\nalguacil\nhuntercombe\ngilovich\ngbg\nbasks\njaakkola\nlorine\nmcglew\ngrilo\nrevanchism\ntral\nbessi\npacaya\naptx\nlouann\njohnna\nhadjar\ndashiki\nmagisters\ngoldi\nburnat\netemadi\nbaric\ngunnarson\nnonis\nstepparent\nlentin\nmccluer\njaguaribe\ntumbo\ngerris\ndemol\nbonnen\nstockstill\nworcs\nhoggett\ngilpatric\nprym\nmancham\nsmallbrook\nassociados\nolasky\nhetzler\nmanfully\nmacas\nhirschbeck\ndubaku\nflunking\ncambuskenneth\ncalcifying\nvinas\nstaniszewski\ndiscussant\nbarnetta\neichman\nlousma\nboreyko\nmazzi\nfeir\nhamouda\nparkar\nfeddans\nperegrinations\nbanchan\ngephart\nnjord\nhanagan\nsugiuchi\nmisplayed\nkelang\nlamichael\nchaddock\nsauerbraten\nbonior\nkashia\nsutin\nworchester\nbochenski\nbilberries\nkarlstrom\nboukhari\nhochstadt\nsekulow\ncarfin\ntienda\nmetacom\nmidson\naloneftis\nbeechen\nbalnagown\nnaringenin\nleamer\neaddy\nkeyline\nockrent\nyanbin\nmarsay\nschiel\nairforces\nshiff\nkaltag\ngirlfight\nredrock\nharimaya\nslingo\ngonorrhoea\nslayden\namsterdammer\nlechon\nholmlund\ngraysmith\nkondratyuk\ndienstbier\namrabat\ndunand\nagwu\npaternostro\nkeuchel\noshun\nboxscores\ndaingean\nknighthawk\nperre\nprozorov\ngosia\nadoc\nchele\ncoronella\ndykeman\nkitaen\ngazzaev\nmcnown\nsonger\nmincu\nriseholme\nhippocampi\nidit\nmawei\ninon\npersonalizes\nständchen\ntobgay\niliotibial\ngonski\nfederov\npiia\ncardinham\ngirtin\njayashri\nmcleese\nkvue\nbimbos\nroston\nponche\nronel\nclavulanic\ncyclamate\nfordell\nscrutineer\nbonfante\nconchiglia\ndecanters\ngyang\ncommanderie\nmarambio\nnovruzov\naccessorized\nelemento\nprew\nlatroy\ncourchesne\nsoret\npetrological\ncherington\nciancimino\nwassim\nzifu\nhanie\ndrys\ntimberwork\nessentialy\nportu\npousadas\nbackache\nsedlar\nrippe\nnonspecialist\nwickstrom\nethylamine\nyohimbine\nbrv\nnorfolkline\npaeans\nwasko\nasiedu\nitron\nladu\nmaravilha\ninfelicitous\niets\nyanwei\nshenkar\ngouldian\nnavorro\nfayaz\nlmk\nknk\nmosis\nunsighted\nkrukowski\nwyc\nzoubi\nduchesneau\nklebb\nnabiyev\nrainsborough\nbadelj\ndemerge\njongkhar\nalvington\ndelana\nvelits\nqaidam\ngrunebaum\nselbie\nsuperjam\npeplow\nsikelel\nmaala\nshmuley\nsonification\ncasm\nmaasdam\nchidlow\nhunmanby\nrumaila\ngew\nannamarie\nikemoto\ngummies\nukba\nwoldenberg\naskrigg\nkamut\nyehoram\nmetion\ngesellen\nromaní\nflatback\nfluticasone\naffability\ntollington\ncnam\ndipasquale\nspatulas\nankan\nkarimloo\nhouts\nharger\nzhuangzhuang\nbattan\nulyana\nunpublishable\ndemostrate\nrlr\nchateauneuf\nchadi\nwakerley\nartistiques\nwsyx\nbodysurfing\nkamewa\nautoexposure\nhvd\nenosburg\nexcitingly\nelten\nfritzie\nuists\nsalha\nhessa\namref\ngyrator\ntechblog\nnordegg\nheuliez\nzenna\nbozar\ngreensward\nrisques\nmorva\nriggio\nunbalances\ncoplon\nmepham\nwarnapura\nmuter\nhymans\nparshin\nmindreading\nbenkenstein\ntunng\nbullseyes\nwashbasin\nderating\nkapinos\nrecapitulating\npurkey\nmorace\njanjalani\nczisny\ndisick\nhenllan\nfeichtinger\nlibano\nmflops\nkendle\nslotkin\nkokal\nmotorcraft\nhankyoreh\nmicombero\nmikardo\nhypnotise\ncohrs\nsayako\npunctually\nfaceoffs\nusian\nmukhin\nwadler\ncircumscribes\nrenay\ntiley\neuskera\ndnata\nobtuseness\nloncraine\nbuddie\nbullshitting\nmazzello\npattersons\nwolyniec\nniddk\nodel\nauki\nuthr\ndecertified\nkrejza\ngubernator\npflüger\nshenhar\nmargallo\nweidenbach\nkorniyenko\nroundedness\nhersee\nfango\nrspo\nstellwagen\nchande\ngittes\nrovell\nenthuse\nsanel\nbanalities\nrotton\ngeissinger\nvanderlyn\ncyberlink\napostolis\ninduration\nkindertotenlieder\negalité\npuda\nignatia\nkrass\nmauves\nworming\nabysses\nbilqis\nwalderslade\nhowieson\nexpansiveness\nbadstuber\ndettmar\ndzama\nwifey\nlundon\ndecubitus\nmellowing\ntextualism\nbahla\nchiusano\ngetachew\nplinko\nclicky\nkudisch\nmoskvy\nambassadress\nauten\nriversides\nkavinsky\nseasonable\nchengdong\ntafer\nroelants\nsamh\nsuplicy\nslub\nextrajudicially\ndumlao\nevilly\nsonnett\npudwill\nposca\nnadella\npostpile\ncelestini\nattore\nsignon\noravec\ndemob\ndystocia\ngansey\nppmv\ndargent\npaananen\nterrasses\noctone\nkosolapov\nmgarr\nwilliard\ndickau\nsauser\nsouthon\nniquet\ncheslyn\nnancie\ntouchingly\nwaddock\nmcteigue\nunobvious\nkuah\nmasseurs\nmsba\nords\nlangney\neddleston\nwaldhaus\nashill\ncomision\nhaberkorn\nequivocating\ndabhol\ndowlen\ntrysting\nlagerbäck\nearby\nmalave\nbarzman\nnursling\nwoodhey\ngrabovski\naiono\nflacks\nmydd\nijtema\nloueke\nrazek\njunkets\namidi\nchrebet\nmaghery\nbereit\nyokels\nboatbuilders\nraese\nsfca\ndarine\nsteinbrueck\npomes\ngorbanevskaya\njeck\nsobti\nturque\nunremembered\ngloop\neffron\nkindergarden\nceyhun\ndyett\nskirvin\nboeri\nsibneft\nyaqoub\nblendon\nkritzer\nreyat\ninquisitiveness\nvalmir\nsnickering\nellenson\nyudi\nwendla\npremalignant\nvallehermoso\narner\ndecimalization\nsuuri\nmakie\nbuerkle\ndecoster\nhurtled\ndealtry\nreddest\nxochitl\nmhlongo\ndelanie\npalsson\npelu\nhoedspruit\nhumam\nkarponosov\nbrinke\nbicycled\nkennie\nspiering\nkillay\ntakac\nbbcso\nlentivirus\nfinishings\npouter\nmetalware\noverdid\nchloromethane\nuahc\nzimerman\nyinan\noverdeveloped\nunauthenticated\nskeoch\nauchtermuchty\nsifts\naghia\nsanlitun\nocoa\ndiscused\nscupper\ndanet\nunburdened\nschwedler\ncircumferences\ncarolynne\nanouska\npipedream\nefavirenz\nmacroeconomy\nabib\npeaker\nstanczak\nmitoxantrone\ngrantmakers\nmisbranded\nplink\ninnoventions\nibms\nbenington\npocari\nrishab\nunconcealed\nstaelens\nhetar\nevgen\nmandrills\nzuno\nmicrodeletion\njirka\ntorchon\nhavea\nallesley\nunusal\nhardebeck\npead\nsimplement\nteetered\nanonymising\nvoudouris\ndaerden\nbaldivieso\nporrello\nkuzin\nminea\nmarketization\nchallanged\nreichling\nmercereau\ncovino\ntonry\nblute\noveremphasizing\nbeever\nsuperhighways\nifixit\nhuestis\nikramullah\nbakradze\nmoussavi\nnikolskaya\nstrakhov\nvelonews\naxcess\nphibsborough\nparadisus\nnimeiri\nvashistha\nhyperbilirubinemia\nvrv\ndaara\nrahnasto\nabacavir\nsulphite\ntranquilized\nreseeding\nrespirable\nwrighton\nfenzl\namonte\ncordery\nrreed\nsculli\ncanoed\nhussen\nquess\netem\nteruhiko\nbendus\nexperince\nhegemons\nschnauzers\nnassiri\njurrjens\npapé\nbagworth\ncuckolds\ndeisler\nniederhauser\nfuquan\nbrissenden\nhoho\nbroadhaven\nmakley\nsibthorpe\nthumpers\nsperaw\nchibber\ndzau\ngalimov\ndeleveraging\nrmw\nmudang\nnonscientific\nexculpate\njussy\naikau\nperelli\nglobis\nnienstedt\nirredentists\nbrul\nmcconnellsburg\naascu\nhencke\npaltalk\nqianmen\nkidzone\narborvitae\nmgg\njakubowicz\nloxford\nsilton\nmcsharry\nmasika\nniedecker\nzucchetti\nfimmel\nkolton\nkontakte\ngundling\nvalaya\ndobles\ncitro\ncantaloupes\nllanrumney\nfiap\nstebonheath\nolkaria\nitches\nlawhorn\nflatlander\nruari\nyousry\nkoskie\nenjo\nnsai\nzylberberg\ntião\nhotpants\ndetainer\npanettone\nmarou\nwjec\narrindell\nlkp\nmadacy\nscanzano\nstuden\ncataclysms\nsharlto\nstwc\nyambo\nlurd\nconason\nbaliani\nletton\nanderman\namper\nfruitmarket\nsecuritisation\nphenoms\narmoy\neuobserver\noveruses\nlohnes\ntainio\nturse\nouf\ncardiotoxicity\nominami\ntichina\npodrazik\nphn\nbosu\nfunaro\nschuerholz\nmacaroon\nhilferty\nschoolrooms\nperttu\nsoftail\nembajada\nakehurst\nnimura\nverdell\nabourezk\niwg\naiona\nunipart\nmaggies\niken\naecs\ncrisafulli\nsavara\nwezel\nkauder\nmidgett\nbuttolph\nmarkree\nnkhotakota\nleats\nlaydown\nharkless\ntesuque\nmphasis\nproably\nachuar\ncrosiers\npregunta\nplishka\nlineham\ntalkshows\nphytotherapy\nschlage\nartprice\nbrandenberger\nbrassaï\neviscerate\nprofet\nomes\nensberg\ngarreth\ngoair\nintraparietal\nimpregilo\ncesg\nbisutti\ngrimsbury\nconsumptions\nwhetter\njeanmaire\nlarian\ncambi\nlsds\nglaspie\nradermacher\ncarulli\nambx\ngoil\nharmoni\ncarndonagh\niolas\ncoloradans\nheisel\nmurofushi\nnikesh\nyouqing\nlamorisse\nragout\nischigualasto\nzutty\npennywort\nchurl\ntvardovsky\nunprofessionalism\nmapletoft\ngianpiero\nkimia\ncleanings\nlikeability\nkilu\nrorion\nbellbrook\nmetelli\ncastletownbere\nbarcellos\nquadriplegics\nmadugalle\ndalham\nsembcorp\nfiascos\neket\ncosmosphere\ncraftsperson\ncaseworkers\nbaseness\nentasis\nfamiglietti\nbeniwal\nmanches\nuncrowded\ncalis\ngoutham\nliquefying\nbrocq\nlamonts\nedwinstowe\nlamplough\nbeston\nfurtiva\nbleicher\nclunker\nameet\ntanney\nbunnag\nkolodziej\nkondi\nputtock\nwiehle\nsuketu\nschweers\nteneues\nlnu\ntakahira\ndisempowerment\ncby\ngarad\nminyard\nmarkopolos\nmidcap\nnerlinger\nanshutz\nautopolis\ntravaglio\npuréed\nexl\nrezaul\njunquera\nishola\nthermostable\nsudarsono\nsxi\nvaginismus\njovani\nqik\ndisingenuousness\ndespoiling\ngrayden\npoliticise\nillegalities\nreconfigurations\ncarabajal\nromanchuk\nputeoli\nwhistlin\nmuriatic\nscalers\nstroppy\nshoudn\ntuija\ncampain\nmirena\ndebeljak\nlaulala\nsaphire\npepín\nkoroi\ncastoreum\nbirecik\ncutman\nfaleomavaega\npapilledema\nlannie\nchizhov\nhinny\nbrefi\nrabbe\nbrunansky\nbajas\nkaleida\nyuschenko\nmurin\nknuckleballer\nuncommanded\ntetchy\npastorally\nlfw\nkeetch\nketeyian\nswearwords\ndayrit\ndreesen\nlimpieza\nlisanti\nkremers\nadrenalize\nibg\ncarius\ncisr\nresponsibilty\ncetirizine\nsuspenstories\npanduro\ntharthar\nnazare\nmuray\nbarbarito\nwenr\ndreisam\nqiming\npanitch\nintubated\nstorycorps\nkalbfleisch\ntabbing\nwowwee\ncutlip\nselston\nkantonalbank\nspol\nhirudo\nunmarketable\nhasley\ngallinae\nashkin\nkanamaru\nlovullo\ndimitrijevic\ncoppelius\nretransmitting\nacnielsen\nextroverts\nyough\nkenon\nsohaib\nfuson\nmeskill\ngoltv\ntrautwig\nsedgeford\nmcj\nrefuelings\njaks\nxex\nbanegas\ngrauniad\ntianya\nschwartzwalder\ngrooverider\nmarmer\ntimeshares\nimation\nhapka\npsyd\nlaurila\nmuncher\nestin\nsteamroll\nccba\nlorenzino\nsepideh\ndecoursey\npadoa\nconed\ndapp\nnewbigging\ntribbiani\ndehumanising\nsententious\npogrebin\ningenieria\nkerne\ngreenroom\nscanavino\nmalkhaz\nrosete\nanthills\namdur\nleidseplein\nsheerwater\npointman\nlovelies\nstreat\nnussey\npasteles\niber\nmahlasela\nschnapp\nsegalen\nfarveez\nchudinov\nstepashin\nblaspheming\nuproariously\nclimacteric\numcp\nfranich\neighe\nwitta\nsè\nfearns\nghulja\nbheja\nairspaces\nadolfi\nshankle\npilfer\nshanmugaratnam\ndenesh\nnobunari\nsantero\ncapodichino\nmofatteh\nprobenecid\nhoneymoons\nentender\nsnowboardcross\ngoudstikker\nsharar\nyasuhide\npoundbury\nlisiecki\nforseen\nkunkle\ncackles\nsharik\ncropwell\njuntunen\nlueck\nganti\nmutterings\nsemington\nfriargate\ngamze\ncrymych\nireson\nspectroradiometer\nbuzkashi\nbusyness\nbioplastic\nflightplan\ntenaha\nkely\nmoaned\nidriz\nmitcheldean\nferozeshah\nepoetin\nrokas\nakora\nkeltic\nstewarded\ninnervisions\neleri\nqueenwood\nwillmer\ndibromide\nbréguet\npickelhaube\nsolaiman\nportanova\ncasley\njanita\nreincarnating\nsteppling\nbresso\ntransferor\nllandough\nwoerth\npkf\nescwa\nbatenburg\nwatmore\ndrissa\nmayeda\nbaab\nxao\npapantoniou\nringler\ngelernter\ncervone\nplaydom\nmashiah\nputtick\nlobão\nasperity\nmrbm\nukeles\ncontinuosly\nchewin\nsolidarnosc\ncollegues\nmallmann\nkohlsaat\nredcurrant\nimmunodeficiencies\ndymally\nmultiport\nisaach\nmanacled\ninculcates\nconstitue\nboggled\nwite\nterrytown\nmacki\nrenney\nmohon\nnopd\nvavrinec\ndrillings\naftereffect\nmasino\nshortsightedness\nzhambyl\ndindane\ngfe\nwayburn\ngwot\nacclimatised\nrarified\nshimoji\nstalemates\nvirts\nwena\npilocarpine\nkeyspan\nhomebrewers\natilano\nkeiths\nmaggart\nmastopexy\norane\nturcan\nrhisiart\nheatons\ntaqa\nbobbies\ndefoliate\nshibam\ncdfa\nbankwest\nncbs\nkindliness\nashlie\ndromaeosaur\nwranglings\nhjalmarsson\ndharug\nboyson\nquintupled\nadjudicatory\nwarrawee\nprefilled\nopenair\ncandlenut\nmasqueraders\nultrastar\nalmac\nkoromo\nsopris\nrobak\nhispasat\nlivecycle\npheonix\nlièvremont\nmarlis\ntrela\nostrosky\npflimlin\nxiaochuan\nelgoibar\ncalypsos\nmarcó\nandale\nwretchedly\ndunera\nformans\nalbio\nrinaudo\nsirigu\nreiners\noverhaulin\ncaravello\nmovieguide\nbriese\npoidevin\nsuryadi\ngrewcock\namiina\ncathi\nmoiz\ntaubenberger\nbossio\nobraniak\npriviliges\nupholsterers\npermanant\ndavoudi\nbrokencyde\nmouses\nmythologically\nignatios\nvardenafil\nratlines\nholtkamp\nazran\nrecirculate\nbenussi\nsimonsbath\nsibutramine\nabarbanel\ntabulators\ntakeup\ndisrupter\nbobber\nwarwicks\nladykiller\nthupten\nserac\nvisored\nfresnos\nvideira\ncharland\nkowald\ndhalia\nknowledgeably\ndamarcus\nsniffy\nappoggiatura\nherrion\nunprogrammed\nradiocommunications\nrockview\nretrievals\nseabeds\nparamilitarism\nredhills\npevear\naphasic\nhauliers\nclawdy\nspirt\nparnitha\neyraud\nfreinds\ncosgriff\ncyra\nryes\npanetti\nanyday\nphw\ninroad\nchelsom\ncampanis\nlifecasting\nruhle\nsnuffleupagus\nadlan\nmordden\nsasamoto\ndelagrave\ndauphinois\ncompletists\nfarai\nszanto\nkeogan\npiga\nglared\ngoyescas\nngon\ntorahs\nnelio\nhendrikse\ndigicam\nbaburen\nsrbijagas\nvanak\nramsland\nstanningley\naustine\nbelsay\nmicronuclei\nvespignani\nchukhrai\nhellers\ncalipatria\nasherson\ngotbaum\nshangdong\noshio\nobernai\nmérieux\nujc\neede\nfirming\nindycars\nculverwell\nbirindelli\ngoodey\ndjou\nnorbreck\nrazorbills\nmccartneys\nkonrath\npetroski\nbilles\naronica\ntaubin\ntimegate\nserratos\nfeiz\nencryptions\nesterline\nlabey\ncorestates\nembrey\nchantay\nmidamerican\nsiedlecki\nelyashiv\nafros\nzaxby\nghassemi\nhexed\nkaroui\nmasae\nneidpath\nnapped\nmorgantina\ncliquish\nboystown\ncowhands\nutopic\nrosenbauer\nghezzal\nditerlizzi\npawing\nurbanists\nmcenaney\nshawon\nreassigns\namphenol\nponticum\nlambrate\nlanghorn\noanh\nbisby\nlindheim\nchakdara\nstefaan\nfroxfield\nabati\nkoner\nlabus\nfrankenburg\ndihydrochloride\norzel\nmccarthyite\ntjeerd\nchomps\ninvestigacion\nsifuentes\ntrenor\njoycean\nsmilers\nfrax\naerolineas\nassadi\ncorleto\nzaslaw\nmansally\ngeovanny\ngovanhill\nhooiveld\nifaw\nagyekum\nwote\nclamoured\ncounterintuitively\nacidifying\nsudin\nnerka\nslatin\nmilmo\nrossif\nshamberg\nskloot\nfilderman\nwenhao\nrecategorisation\nfrech\nhospitalist\nsalko\nyaohan\nuncomprehending\nwirawan\nmottisfont\nstrikas\nmeschery\nwoodlanders\nshian\nollivant\nklowns\nwingen\ncalved\nbours\nrasmusen\nwenjin\nomnitrax\nhelotes\neuropass\nmatchpoint\nmarziale\nweizhou\ngtech\nsudipta\nbockarie\nmynarski\nbeheer\nlubricates\nbitterling\nthisis\ntoothill\ndohrmann\ntianwen\npikus\nfollini\nsweetbriar\nweinger\nxinhe\nsahebganj\ngraser\nlerach\nsuffragio\nspinnakers\nshirked\nanvik\nneria\nhalangahu\nbalasingham\nwassan\npengrowth\nbusload\nfrequenter\neckes\nmacrumors\nturanga\nmédicos\nraemon\npunchdrunk\ntuca\nyealmpton\nundy\nbirdsell\nbradway\nkerckhoff\nprisk\nkimitsu\nfelicis\ninactions\nmihos\nucan\nmariages\nlekota\nrunscorer\nmukham\nkeresley\nmolsky\nwanlockhead\nunreasoning\ninhumanly\notterloo\njamies\nulliott\nsurhoff\ncityside\nchutki\nchukwudi\ntoosey\nviss\nhimma\ndifferents\nnorthlink\nvinoy\ncuni\nsnaky\ntkc\nnolly\ntaliya\nmetioned\nbuder\nekos\ndeviousness\noverblowing\nmordente\nnouble\ngazzard\nlachemann\ntigua\nnsubuga\nokitsu\nquickplay\nbakas\nfindability\nabdulali\nburpo\nkowtowing\nvidro\nchambery\ngiorgadze\nreinman\nndia\ncolstrip\ndiscomfited\nledgerwood\ndagomys\nhattfjelldal\ncolruyt\njammie\nsilkmoth\nbonza\nreattaching\nkibria\npeyronie\nmahsud\ndatteln\nsemliki\njiaxin\ningos\nhelius\ncogwheels\nkhizanishvili\nyefimova\nremm\nsentimentales\nlihir\nheartsong\nhitchcockian\ninterrail\nawni\njabarin\ncriddle\nberwin\nrhinorrhea\ninvestable\nnurlan\nleithauser\nhoorah\nnaburn\nlengthly\nmizque\ntottel\nvouchsafed\nciriello\nhowver\ncuris\ntoribiong\nfaiq\nkingson\nisaza\nmudhar\nearlies\ncohee\nrombauer\nfelisha\ndiversidad\npazcoguin\nyeend\nchateauroux\nzaton\nabubaker\nmiking\nprocrastinator\nabdulsalami\ncordaro\ncliffy\nterrano\nbeefeaters\nretrials\nharleigh\nbluelight\nmeddles\nrockmond\nignoramuses\nstahler\nvaporous\nmonye\nnissinen\nkathrada\nborrett\nruhul\nkuperberg\nbiglari\nsarty\noueddei\ncatenaccio\npambula\nsaltimbanco\ncncs\nsuccintly\nstrollo\neulogize\nlittlerock\nsomersaulted\npetruzzelli\ndimmest\nfreestylegames\nhaspiel\nwamo\nscte\nsajani\ncentereach\nepcam\nbrisingr\nnanoprobes\nswetland\northodontia\nunderexposure\nhcca\ndevient\narkivmusic\nturnipseed\ntamaddon\ncatchup\nsilkscreened\nlohmeyer\nnoncitizens\nvidra\npublicmind\ncalpine\nemigh\nesenin\nredcastle\nparasitical\ncansei\nragu\ncastmembers\nbureaucratization\nvarady\ncosmically\nraffinose\nharvell\nyoxall\nshatterproof\nvorel\nallografts\ndenshaw\nmachesney\npapaverine\nliebestod\npotapenko\ndelancy\ndubina\naustrade\nunprepossessing\nrigters\ntafur\nghelderode\nduelfer\nrendcomb\ncompetion\noverexcited\ntitfield\njiving\nhustles\nclaysburg\nkessie\nfaultlines\nbogale\nlanthimos\nsudamerica\nrunnicles\nwildebeests\nreposado\nbutternuts\nserba\njacumba\ncompartmentalize\niols\nappiano\nbeiteddine\nnaxals\nlinnemann\ncossins\nscholastically\ncahalan\nwestons\nwambui\nscheu\nnidec\ncasabella\ngirdhari\nsakra\ncreativeness\ndilks\nspeckman\ngaly\neverychild\nbadsworth\nmimika\nahwazi\nkeswani\noxyacetylene\naskeri\narendse\nmcelveen\nmicronized\nporcupinefish\noverburdening\nmagnanimously\nellagitannins\ndatastore\nwoodhorn\nhairdryer\nbeersheva\npiq\nrapine\ngleamed\nrigshospitalet\njenners\nsidereus\nboritt\nmonoglot\nornithorhynchus\nlawrences\normand\nnonrandom\ndscc\nillest\nvpe\noxtoby\nplascencia\nbanyon\nnayon\ncarryduff\nufg\naligners\naafp\nnonsensically\nwragby\nbamir\nbico\nmankowski\nholgorsen\ninformaton\nhne\ngasparino\ncilfynydd\npbe\ntorgerson\nwinchburgh\norya\njajuan\nloredo\ncanevari\nuthayan\nbiorefinery\nmirebalais\nbogata\nadss\ndogfighter\npancytopenia\nimprovident\nconservativehome\ncarlotto\ndioner\nastrobiologist\nhandorf\nriposo\ncleadon\nleyda\nmoquin\nsickler\nroozbeh\nteevee\noneiric\nfeoktistov\njasmeet\nbusybodies\nphotis\nunceremonious\nrepetiteur\ndatel\nbiguine\ngiacconi\nminneriya\nmultiprocessors\nmwyn\ngateley\nshortchanged\ngraul\naggrandising\ncarolini\nmisguide\nstrickson\nswagg\ngilliatt\nmauga\nkurten\nraeside\nperseveration\nlavori\nleandre\ncloakrooms\nexcipient\nkytx\neniola\nbuic\nmasferrer\npessina\nsbobet\nninny\nmatsunaka\naidi\ndiscounter\njepkosgei\nwingett\nkoong\ntriodos\npanayotis\nlavy\nmiggs\nbataar\nwbng\nzinna\nplacentas\nmelekeok\ntraceur\nxrays\nurokinase\nyekhanurov\nminooka\ngastel\narisan\nmohen\ndederer\nclaybourne\njazan\nmccarl\nsekera\nrhigos\nproprietorships\niott\ngaiser\nfunción\nmuson\ndriesen\nnutsy\npellentesque\nverbalized\noperacion\nseeiso\nunheroic\nfinberg\nxiaomei\ngermanischer\nziti\nnechung\nsarft\nrera\nwarhola\nmeshaal\nfrykman\nabakaliki\nliveness\navana\nsatbir\nluetkemeyer\nbinghampton\nsexualisation\ndreamchild\nzwerling\nwhnt\ndalmally\narashiro\ncoaltown\ntenía\nboushey\ngylfason\nfincas\ngeiberger\ntohmatsu\nsharrer\nkazyna\nbolasie\noildale\nenlivens\nwicketkeeping\nsuperstrong\naselsan\ncountrysides\nelastics\nkojic\nmycoides\nplagiarise\naffiliative\nkaryl\njinghui\nbioengineer\ngogar\ntenconi\nmicachu\noex\ntaean\ngunville\nrejean\npaltiel\ncipp\ncdms\nkirgiz\nhuifang\nshuld\nsomiedo\nofferton\nballarin\nfreidman\npalpitation\ngrouches\nvant\nragdale\nsortation\nwystan\nruhama\nbaizley\nkaboodle\nmissiroli\nafpa\nvennard\ndwts\nbabajan\nevon\nsizzlin\nguanylyl\nunhide\ntubbataha\ntriphosphates\npaperwhite\nverveer\nbaiza\nkrivine\nmassel\npolzin\nharpending\nirreverently\nmmtc\npasteboard\nbryanne\nblanquette\nramadhani\narbatov\nfirmest\nmovado\nvotevets\nsjo\ngrazalema\nstert\ndishonoring\nmorays\namortizing\nclintonians\nchunked\nfmh\nnykredit\nschermer\nchaloupka\nsporza\nbeitia\neickhoff\ncounterproliferation\nnorcliffe\nlongcross\ndiltiazem\nboatyards\nbrandings\nbarlinnie\ntreaded\norum\nkohnstamm\nringwraiths\nclipless\ncragside\nepri\nsinatras\nyazbeck\nnilofar\ncajori\nlandu\nbéhar\nstrikemaster\nshefki\nriccia\ndeehan\nunvalidated\ntulear\neremurus\nchafetz\nosteogenic\nplyometrics\nbarnas\ngiacinta\nrubbished\ndpmo\nfereday\nearswick\nfirebrands\nwpec\ntual\ngailes\nbudged\nknettishall\nunnameable\nhonorata\nistabraq\nhadopi\nkremerata\nhermantown\nhollender\nwaitstaff\nudeur\nradinsky\ntaibi\nwhipworm\nwijetunga\nburster\nhunchun\nmaybee\ngymru\nwunderkammer\ntressell\nohly\nfluorescents\nantifascists\nbudka\njavnosti\ntransafrica\nraskob\nhsy\nhirak\nsauver\ngangitano\ntafuna\ntelmarines\ngaboon\nhibler\nperseid\ncastlecary\nensnaring\nmontse\nontake\nbrookstein\ngarua\nseders\nscotched\ndionicio\nbasavanagudi\nkloppers\nrathnew\nesac\ndougald\nlittleover\ngadzhi\nexaudi\nmerchandises\nroid\nlumpenproletariat\ncouzinet\nswarn\nbobos\nherze\nbehrs\ngajendran\nkompa\ndamnably\nwiling\nsros\nstepdad\nyeoville\nnerida\nuntermann\nqueenswood\nghionea\nvolatilities\nqueensryche\nflvs\nslingbox\nabrosimova\nbedrich\nbillown\nplently\nshitrit\ndunkeswell\npursel\nkahla\nturves\ngunnera\nkrumping\nbozic\ngillaspie\nshiyu\nqueller\nfairall\nbodek\nstrongwoman\nwhataburger\nsalivating\ningrao\nmolted\nserat\ngeter\nrungius\nbartu\nfopp\nhadler\nblackduck\nbenoa\nvph\npianosa\nsikua\nradamisto\ncwalina\nclearways\nfamilymart\ngoverdhan\nrecommission\nknaap\nlattitude\nsealions\naswin\nemani\npesar\nglennis\nseepages\ngullette\nkesling\ngonia\nayto\nbehenna\nspab\nlakshmipathy\npablum\ngoodmayes\nbahaman\nmurren\nfrogging\nmagaldi\nkarolyn\nhareb\ncrinolines\ncreuzot\ntachtsidis\nunpatentable\ncissbury\naquarama\ngulps\normsbee\ntaboga\nsorlie\nliljefors\nosmel\nmicrotech\nurney\nurpo\nviglink\nworra\nsonequa\nnehgs\nclarabell\nreschke\nsemshov\ntushnet\ncofie\nstambler\nsayeh\ngatr\nvidia\nmorrazo\ntalty\ncollingtree\nsewel\nparkesburg\nhiway\nmicroloans\ndorsomedial\nvacationland\ncicle\nflambé\nbarle\nnyers\ntononi\nrucksacks\nkurlander\nbecames\nenunciating\nhifu\nvibia\nmissle\njihua\nfedorchuk\nbokros\nmatsuya\nirinotecan\naudiovisuals\nbjcc\nvecchiarelli\nrehberger\ndemartino\nmobileye\nmascia\nmcgarity\nrjb\nnimrods\nswindlehurst\nkalogridis\nethnocide\nsancocho\nmerevale\ninbreds\nbaharna\nigbts\nholtorf\nluczo\nthermophiles\neulas\nestan\nfragging\nshebeen\narii\nfoodism\nacip\nhaematoma\nmakdisi\nrocknrolla\nhaffenden\nefstratios\nbugbears\nrueff\ngrazyna\nmoignan\nsantarem\ntinku\ngardone\nbbbb\nnstar\nundesirability\nfouch\nslaveowner\nkoonin\nzhenzhen\nwaylay\nveillette\ngimje\nmerrilees\nzut\nkujtim\nzondi\nedgers\ncosmogirl\nprady\ngronlund\ndieticians\nmazowe\ntaggle\nsawy\nmarsalforn\nvrf\narrogated\ngarven\nhourlong\nrungis\nhaemostasis\nsoliz\nhkex\nmargaretville\nkinlaw\ngajjar\nunlf\nregistrable\nblankfein\nharkema\nirfb\nbakersville\nwrvs\nlimbers\nlke\ncondron\nsibrel\npokphand\nfootrests\nmidnapur\nchimneypieces\nwhittles\nnnrtis\nwytham\nseferihisar\nplastilina\nsawad\ninotes\ngorgoni\nensnares\nzehir\nnobukazu\nattender\nnuseibeh\ntylorstown\nheldenleben\ncaergwrle\neliades\ntarlo\nnhek\nfriendfinder\ndelrio\natriums\nmsft\nhyperprolactinemia\ngjm\njarel\ndefeater\nevangelised\nmelva\nbouchardon\ncolubris\ntuson\nkennedale\nqanuni\nchamula\nappology\nwienand\nuncrossed\nnedved\nunmistakeable\npatzcuaro\ncalverts\nzhijian\nlizars\nhecatomb\ncapillas\nrecoinage\nruyton\ngenecards\nsnuffbox\nminuti\nsanclemente\nfilmstar\nlawgivers\npqs\nnadelman\ncomplutensian\nholmbury\negizio\ncinephiles\nbloxom\nhoani\ncomitted\nmicroscopii\ngunalan\ngosa\nbandslam\nbradbeer\nfirey\nlisanne\nomundson\ngaladari\nresizable\nobrigado\nceratotherium\npohjonen\nherskowitz\nramezani\ndisadvantaging\nfeebleness\nriskiness\nimparato\nanaleigh\ncanonicorum\nhmie\nelswit\nbeeen\nhardenhuish\ngranito\nkandler\naltor\ncindrich\naaiun\ncfif\nindustriously\nosbaston\narifi\nretreads\ntrumka\nsharpsteen\ngawne\nruders\nhurtwood\njilbab\nfredricksburg\nthiaw\nuchikawa\nbaozi\ntarlow\nkarran\ncalbert\nmitac\npomezia\npuhua\nhenrichsen\nnaparstek\nborovoy\npersuing\nifrah\nastrum\nvolny\nconceptualising\nmuslera\nhindalco\ncolegate\nhidemasa\nnyamweya\nclayesmore\ntailbacks\nisbe\nwainthropp\ncacciari\ndoorne\noxiana\nyasujiro\nxeroxed\ndetc\noverspent\nteradyne\nbachtiar\nzaleplon\ngipton\nmerita\nkitesurfers\nshirasawa\nboilly\nnegress\nszot\nairtanker\nankama\nlignocellulosic\nthouless\nacai\nkosmala\nzaabi\nmorchard\nfairhall\nshtetls\ndicynodon\nbredenkamp\ngormless\nhighdown\nmoncler\npostdated\nmarquita\nherrada\nentenmann\nsharmas\nemori\nedyth\ntounge\nwesternizing\nbreakdancers\ntashard\ncolico\nlouisette\nshakh\nroudebush\ntabin\nxeriscaping\nposterous\nunhidden\nvaly\nmccluster\nkoppell\nmindo\nallegretti\ndullard\nhaith\nnooruddin\ntorchia\narpan\nvaste\nqss\nsmailholm\ncarvill\nmeenan\ntanko\ngopalapuram\nalbanel\ntudjman\neesa\nknr\ndarijo\noutmaneuvering\nalviano\nlekima\ntrancas\nwaddams\nbhagirath\nrafaello\nlendrum\ncrocketts\ncalstrs\ntirelli\nshaps\nschifcofske\nporgie\ncmdb\nclappison\ntryna\ntunnicliff\nestampas\nmagrathea\nbuncefield\njash\ntwizy\ngreasby\nairconditioned\nfesco\norgill\naprn\ngoolsbee\nstashing\nwyatts\ninterboro\nhould\nhoffbauer\nmariem\nbarugh\nyuanchao\ndavidtz\nsywell\nstratify\nrudisill\nhirola\nmorstan\nstrew\nqayum\nmanolev\nperalada\nbarmak\nhemline\nschiemer\nswimbladder\nseatpost\nsmalleye\nmaheen\nkarhan\nshteyngart\ngno\ntrakas\nluxx\nwinching\nhachiko\nbosshard\nignarro\nfricking\nmourilyan\ngonpo\nfastidiously\nrosenlund\nsrijan\ntresidder\nbeerhouse\ndabke\nhourglasses\ngouffran\ntigerdirect\ncasolaro\nhadeed\ncirchetta\nstellen\ndonadio\neuroclear\nlichaj\nspinosi\nhevs\nrefus\nmillstein\nzanele\nrivara\nmcnabs\nkalmadi\ngradwell\nbezant\nmorrisson\nkuijpers\nupshall\nwarrell\ngurnemanz\ntarabay\nmucks\nindianness\nesho\nsileo\nyasunobu\ndismukes\nderogate\nhubail\nffii\nchelated\nbvu\neisenbud\nwric\nmacwhirter\ntomoji\ninformatively\nkaestner\nchoppiness\nfairstein\nadulterant\natlantia\nteall\ncabinetmaking\nchillen\neastpointe\nhomesites\nsercan\nruttmann\nmusharaf\npaasilinna\nschliersee\ntenderer\nalexandar\nusml\njawdat\ndrolma\nparikka\nswingate\nrockette\njeryl\nsweetpea\nuplinked\nbelchers\nrunje\njoannette\ncaru\nusfk\namyris\nhabituate\nbullers\ndelsea\nexult\nstokey\nmassiveness\nnelthorpe\nbladel\nhovenkamp\naveni\nvraiment\nwaagner\nintrade\nindahouse\nchichijima\nbofi\ngeare\nmchunu\nperedelkino\ntehmina\nhesford\npcj\nintoxicate\npopularist\nbalsley\nvcjd\nandrina\nasterisked\nhcw\nxto\nmontrichard\nnaukluft\ngreenedge\nosteopenia\naarif\npigovian\nclavero\nrauti\nigas\ntarpey\nmillepora\npratz\nkaufer\nseacom\nbozzetto\nbartter\ntariku\nismayilova\nsemmel\nreanne\ngiolito\nrhf\nrakestraw\ndeaker\noologah\neeek\nbandipora\ngefitinib\nccamlr\nclybourn\nmodularis\nmcgimpsey\nchurchs\nbranagan\nnijholt\npough\nalexandrino\nflh\nantek\nslotbacks\nminium\nmuser\nthabane\nkilday\nmeskimen\nefua\ngordhan\ntrepashkin\ncandiotti\ninxile\nbothma\nqaly\nwindemere\nyagur\nmayerson\nellerker\nsaqer\nosteoporotic\nrolley\nneowin\ngoler\nkonger\nsawka\nkondrat\npogorelov\nschmemann\ngoodhead\nwhitner\noffiong\narachchige\nfardon\nsiggins\nhumungous\nbruschetta\namoc\ncerge\nyachimovich\ntechiman\ndarty\nwadworth\nfolse\ngaviotas\nchadburn\nrtrs\nbarreiras\nbacrot\npcca\nwendlingen\nrugh\nheared\ntruswell\nrosero\nkafeel\nhooding\nnold\nmarkina\nadme\nuofl\nmidnights\nairtrack\nskyscape\nbrookshaw\nonida\ngreeklish\nblmc\nchugs\nmisskelley\nwillford\nmukaber\ntidbinbilla\npclinuxos\nninefold\ntelecomm\nmarginalising\ngerstenfeld\nbluecoats\ncabibbo\nibizan\nmisogynists\nrahter\nshikotan\nlentiviral\nnantclwyd\neitingon\ndecrepitude\nbedclothes\ntrimm\nluzerner\nbenninger\nniman\nmezin\nmetzgar\nnansi\nburrower\nlavaughn\njiko\nkmex\nkudelka\nkalsu\ndeemphasize\nfreeloading\naacap\nhomering\nmmtv\nwriterly\ntevi\nwaterbug\ntrapiche\nguedj\npavía\nslithers\nsnobbishness\nmunnelly\nquartiles\nbisou\nmccasland\njdu\nsegares\nspoonfed\nbrassed\nblackfly\nliquefies\nvertica\nloevinger\nsukhanov\nmishkenot\nlangtoft\nadegoke\nskoch\nhidcote\nreflagging\nhummin\nfregate\nmahamoud\nminyanville\ncenso\nshuckers\nblankenhorn\ngrich\nwxtv\nlunder\ncursos\nbassinet\nkaban\nbestway\nakamaru\nnetbank\nfoege\ngîte\nkazmierczak\nloduca\nbvh\nputaruru\nafrah\nbrandstrup\npolymyositis\nnarayama\nharendra\npitas\ntrewhella\npefc\nsaveourseas\nrostraver\nmacklowe\nentin\nmatachewan\nzoromski\nbuccellati\nkupol\nboenisch\nwallonian\nlimond\nestro\niland\nnewcomerstown\nangeleno\nstevas\nnevenka\nkresa\nchiatura\nweve\nfootsie\nsorab\nitfc\nbehrends\nolsberg\nmilovanovic\nsovereigntists\npomander\nfibrinolytic\nluva\nmurvin\nmalaki\nyanacocha\nakarit\nwagonload\npdcs\nprostor\ncarotenes\ndimasi\ngustibus\ngean\nampelmännchen\ntresh\npremis\nwibsey\njeanny\nnonagenarian\ncutaia\nhilbre\ntywardreath\nveira\ntacheles\ncollegedale\nsaadé\nwolo\ngodforsaken\npietschmann\nepigenomics\nbatool\ntrypanosome\ntchibo\nniea\nsuperstrings\ncieca\nbubblers\ncolectomy\nravetch\nrealworld\nkazel\ntollner\nperdre\nlyo\ntortellini\narbia\ngrinned\nwiskott\nwinget\ntishrin\nfeuillère\nballasalla\nhornbacher\nveiel\ntoughbook\nscrivner\nmbare\nchantrell\nhennis\nbanket\ndhalgren\nholum\nmalerba\nopentv\nmehlville\nynyshir\nimedia\nsuneet\nkenken\nenkel\nskycam\nvernita\ningelow\nmeeteetse\nawali\nfrigidity\ngnaws\nwoolacombe\nblankley\ntrowse\nzaozhuang\nlafeber\nthoughtlessness\nhexamine\ncombusts\napparatchiks\ngschwendtner\nodometers\nmoaz\nbompastor\nnonhumans\nalmos\ngemstar\nnocturnally\nthursley\nreclosing\nsublethal\nswaddled\nkilsby\nguedioura\nlcy\nnande\nbreezer\nlsst\nagoa\nasshat\ncardelli\nvulgarly\nchanakyapuri\nkros\nswoons\ntocks\npriyamvada\nhomeboyz\nkidulthood\nmashaba\nhospes\nflowerbed\nsany\npluscarden\njhan\nkayle\nmounter\ncaramoan\npalaniappan\nindividu\ngolcar\ncfoa\ncrrc\nhabas\njanga\ntrathen\nnetsch\nchavagnes\nkrief\nmobipocket\nfagbenle\nbaida\nanimality\ndoomy\nbioproducts\nkaaren\ntongogara\nhotez\nmckinleyville\nphotocells\ntrone\nfalkowski\nstaib\nshirreffs\nphocomelia\nberghain\nsumber\nkhd\nalio\nmansfields\nmanhunts\nfiscales\nwijnants\naeri\naxions\nvassilieva\nbonasera\niort\ngrillwork\nporteños\nmorely\nlamalfa\npapplewick\nmangahas\nscire\ndisputatious\nkgun\nmerley\nguthman\nassiduity\nadvertize\ncoursey\noosterbroek\npinchos\nrasikh\nclarifiers\nvouvray\nionides\ngrigorov\nhowkins\nmaddi\nexceedance\nloll\nbordelaise\nswiney\nakinbiyi\nisrailov\nwengert\nrocketeers\ncataloger\nqualifed\npaunch\ntrovador\nhyperpolarized\nshuddered\ncarucci\nfisons\nbuthe\ntrita\nlonewolf\nisenheim\nnicetown\nchudley\ncoarsening\nscurried\nevocatively\nvignali\nqmi\ndassanayake\nkivel\nbassaleg\nlodro\nbottini\nscoglio\nblurting\nkleon\nagema\nmatv\npirog\ngravitron\nhadramawt\nkvamme\nmillian\nmarline\nfoisting\ncudd\nkuras\nradheshyam\nuos\ngalani\nfluoropolymers\nbleakly\nemmaline\npennywell\nbluffdale\nadhikary\nanglicize\nhwo\nkonary\nduinen\napoligize\nconahan\nkundun\nséguéla\nwintz\nsnuggles\npeuvent\ndahua\nmonopolising\nbowcott\nsiemering\ndarier\nndadaye\ntychsen\nnmai\ntgh\nravoux\ntayto\nmaarouf\nmeji\nfrango\nradicalizing\ngeechee\nroughan\nbaglio\nbothel\ncherkasova\nbroadwick\nlinguini\nbuluan\nneshin\nilim\nkaii\nbudaj\nlvh\nrighto\nelstead\nreunifying\nbattipaglia\npiang\nbachelorhood\ntcho\namericares\ndobriansky\nfuroate\ntewson\nrocinha\nzolotaryov\nkreon\ntokmak\nchiotis\njacarandas\ncapitalia\nfunnelling\ncontorting\napor\ntiplady\ndejar\nelean\npontyberem\nzulqarnain\nunwaveringly\nmogale\npriebe\nyuanjiang\nleoz\nepistemologist\nbirnbach\nbarrowford\nnonreactive\ngodding\nbbu\nkisel\nqibao\ngrittiness\nthernstrom\nglooms\nsorimachi\noverthrust\nnatixis\ncennen\nsquishing\nwaesche\nztl\nsoku\nopos\nzolli\nkhagrachari\ntaybeh\nthiazides\nasayish\nkarole\nyudina\nmumok\nmaculinea\nwissa\nteabagging\nrambos\nhilter\ngranov\ndragun\nazzolini\ncaddington\nactully\nbugiri\nfinbow\nbechir\nbrentsville\nweaverham\nbahais\nccsvi\nallans\ncraighall\nkoussevitsky\nowenton\nreadmissions\nvidaurre\npancetta\nstantonbury\nforner\nshimshal\ntuyet\nfireeye\nlautenschlager\nribero\nblewbury\ngache\nmalinauskas\nsealyham\nelat\nboyen\ndragger\nhovercrafts\ndziuba\nrecordholder\nsenff\ndefroster\nrensing\nrehires\nscotlands\nlahej\njcdecaux\ndemario\ntotipotent\nregularisation\nzineb\nrachunek\nyonath\nkanakaredes\nhysterectomies\nmonoprints\nburgermeister\nkaranganyar\ngatski\nincidentals\nwesty\nsomatotropin\ncorsetry\nmaceoin\nintravitreal\nhultberg\nhastiness\ngergel\navik\nzarei\nallí\ncaskie\nbrimful\nadva\nverdian\nlokendra\nlawfirm\ncharreada\nfasken\ngreuel\nqueux\npaccione\nfukasawa\nhectors\npreventers\nxconomy\nnorsworthy\ncliniques\nwachtler\nsaiten\nundulates\nregaling\nreemployment\neastmoor\nreacquiring\ncalgon\nyra\nbeneteau\nvekselberg\nvinification\nwulan\nearful\ntaci\nberezutski\ntalkington\nboao\ngarzone\nmusbury\niisd\nfakhreddine\nhadzic\nfaithlessness\ncubley\nflightaware\nokd\ngubar\nritchson\nveneziani\nechard\nmidtable\nfigglehorn\nbrout\nwireman\nfictitiously\nhellerstein\nuntuned\nisono\nribadeo\nshahrani\nzeig\ndeshazer\nhaxhiu\nalette\nlarrison\ndetoxified\nwoodsia\nlôn\ncuchulain\nissimo\nsual\npagis\nsupercop\nlics\newaso\nrmbs\nstada\nnamaskar\nnanomaterial\ncrowmarsh\nefv\nmuath\naeolia\nvilcek\nropley\nloizou\nkupferman\ncanutillo\nespin\nraggatt\nthater\nmaydan\ncontas\ncsbs\ndeeyah\ndrls\ndoua\nhince\nbritisher\npostherpetic\nwjrt\nastatke\nbillis\nbonghwa\ncebula\nbenskin\nreneé\nelvidge\nharleysville\ninverdale\nrythm\nnaohito\newerthon\nflyboy\nlevington\ndumble\nliapis\nstrausbaugh\nmierda\nsnowplough\nsomera\nautobiographically\nregrows\nihome\ndisfigures\nfruitcakes\neglish\nrhain\npotentiating\nanodised\npaleness\nmabbutt\nnickelodeons\nclassier\nferrán\ndiuca\nmorlet\nwdbo\nigdir\nlipomas\nsadbhavana\nschelte\nberdichevsky\naarya\nschlitterbahn\nkyre\nadada\nscimeca\nmiddleby\nmakkum\nleetspeak\nlegislates\nelkus\ntromso\ncraigton\nlistenings\nazaan\nchampfleury\ncarcillo\nparbandhak\ngoonetilleke\njpj\nmenelaos\nhoteles\norfalea\nbackoffice\ndissuasion\nkouao\nbrunker\nantiquarium\nsoskice\nbryner\nmuscleman\nlacework\ngafcon\nunviewable\namarone\nterc\nbechtle\nstanbery\ngreenspaces\ndibutyl\ncubbies\nkretzmann\nwutaishan\npanchagarh\nkirkstead\ntelramund\ncriminelle\nkolonel\nkingshighway\nlazareff\ntaiana\ntranquilizing\nrianne\nbleakest\nqueenly\ncitrinin\nviton\nleeanne\nteaff\nkastellorizo\njamy\nhampl\nwhyman\nstema\ngenro\nsamaira\nfillans\ndiscotheques\nillarramendi\nnonfat\ntaillevent\nkeiffer\ngroins\nfirecat\nlarocco\nbeattock\ndannielynn\nloibl\ndecoux\npramac\nnahmad\nballman\nfardell\nnaias\nroofe\ngravey\nsupercharges\nmicroglobulin\npundt\nffolkes\nhecks\nparalyzer\nmysto\nzizou\ntiering\nxindu\nprusiner\nfelman\neuthanizing\nelrose\nwallgren\nsanitaria\nsalsify\nyela\nmcanespie\nbabolat\nhydroxamic\nshanly\nyadel\nfacp\nhamaoka\nbytyqi\ncitril\nennoble\nmacniven\njingdong\nportglenone\nhijli\npemon\nexminster\nheideman\nbandeau\ntouati\nbedevil\nhervieu\nformigoni\nholahan\nkeola\nmunstead\nlarrinaga\ndeports\ndimmable\nprei\nenikő\nshiralee\ndegang\ncrozon\nlautenbach\nlazor\ntoberman\nwormed\naltmire\nmcguff\npetascale\nmidwesterners\nhlw\nfabasoft\nlacinia\nklindt\nvogtle\nrht\nswiderski\nurwah\ndamonte\nsilverwork\nmelée\ncroo\ncircumambulate\ncpos\ntrounson\ndouching\ndettwiler\nlesk\nllanddona\ngarrets\nsemlin\nueland\nbadry\nmalaprop\nmetronomic\nabour\nkeilberth\nvempati\nmegatrend\nbuildwas\nlushness\nodean\npendergrast\nkilroe\nceramicists\nsevcik\nbozos\ntelscombe\nladybank\nsocialtext\nforwent\nblasberg\nsaifur\nmaestas\nramcharan\ntenter\ncgap\nzysman\ndolphinariums\naécio\nwpht\nzayid\nwld\ngaymer\ndende\nmythologizing\ncalvia\nbeachcroft\ngalerias\nstubborness\nunflavored\noccasionaly\ncoixet\nerber\nlindegaard\nmonex\nbrangäne\nrogow\nhvga\ndefragmenting\nhawara\nimprudently\ndickering\ncorrespondant\niguarán\nsifnos\njafr\nmccrorie\ngakkel\nreelz\nmuqam\nlugoff\nmagoula\nranina\nhbe\nbollock\nwillhelm\nyasunaga\nroorda\nlibermann\nspigel\ntopolsky\nzizka\ngyllenhammar\ngushy\npinstriping\nperou\nsyler\nbiondini\ntricastin\nvenerables\nbehnisch\nitchington\nbrindis\nbewilder\ncontendere\nsurvivorman\nsuperamerica\ndolorous\ngromada\ntabua\nstatcounter\nroszkowski\ncoelius\nbided\nrivelino\nwuping\nbuket\ngoreham\nyimin\ncuitláhuac\nitma\ndavignon\nclop\nsossusvlei\nhydrobiology\nantoniades\nmaghaberry\ngarafola\nfgg\nmargutta\ndaniyar\ncolnaghi\nelectrostimulation\nguarnerius\nfastlink\ntabcorp\nlowline\nsccrc\nktvb\nlandberg\nmcjunkin\nkfda\naapc\nautorickshaws\ninternists\nlandfilled\nbiospheres\nwenyan\ngaunts\nkeszler\nluminarias\nweisbach\ndrillship\nearthiness\ndubawi\nmostostal\nprimeur\nkwassa\nshtokman\ngabara\nhobbling\nfeminazi\neuropeenne\ndrybrough\nphotostat\ncrynant\nheffington\nburatti\nshivendra\nbarriscale\neurojust\nrimpoche\nneudecker\nisaack\nzyuzin\nmiremont\nwillox\nrevolutionizes\npacioretty\nxuri\nhiremath\nesquibel\nscoggin\nesoft\ntotani\njiron\nmnajdra\nopenwave\nczerwinski\nsterno\ndaping\nverdelho\nzbv\nélémentaires\nunderperform\nkfp\napocryphally\ncustomizability\nzagier\nallmen\nrighini\nkapusta\nfilippis\nspidery\nexotically\ncincom\nroseann\nprepa\njohjima\nganem\nhoelscher\ncyberbully\nhoolihan\nbewsher\nhasland\nmutaz\nrezza\nkincraig\ndesmopressin\npopulare\nbusso\nombersley\nlenfest\nfolkboat\nefo\numbach\nkawazoe\nsegol\ncalderoni\nyandarbiyev\nenshrouded\ndecelerator\nschr\nhasc\nfaiyaz\ngallaga\nislandmagee\ndomanski\nhansis\nquietcomfort\nrhatigan\nrikon\nmontaut\npankisi\nscheibel\ndirtied\ncomeaux\nbrender\nantiterrorist\niacobescu\nbrummitt\ngosier\nchrissakes\ndauterive\nstrontian\nrugao\nlightkeepers\nbording\nniaga\nkujawa\ndanelle\ngivan\neastford\nalevtina\njacy\nsteinbrecher\nschimmelpfennig\nunchosen\nodl\nrayvon\nchernukhin\nnicc\nnicolaisen\nretune\napuan\nmakeing\ninnodb\nsoulseek\npossibily\nfonio\nsláma\npliosaur\njaran\nrazvi\nfilleul\njanot\nmadliena\ndesalinate\nkaiserhof\ndonyo\ntheoharis\ndoorpost\nspirea\ntousled\nkondaveeti\nestor\nshalane\nsilka\nneurodevelopment\npogs\ncausalities\ngeolog\ntanenhaus\nharkened\ncarshare\nmccutchan\nangelinos\nnettlebed\npainchaud\nbiesenbach\npermadi\ntregoning\ndefinative\nkuerti\nguidugli\nmemorability\ncoverlets\nfairgoers\nmolodaya\nshieh\naverre\nexplaning\nhoebee\nverbalization\nriether\nstranczek\n,if\nliudmyla\nzharkov\nscurria\njakeman\nmonastiraki\nluber\nhachama\naioli\nodditorium\nfucino\nhayt\nillumine\nnagui\ndtcc\nhikind\nalights\npynn\nboeckmann\nvitrolles\nrecessionary\nperich\ntooro\nbarrowland\nkoshelev\ncarmyllie\npigmeat\nbarrus\nshoesource\nphotog\ndomanico\ndecoratifs\nbrorson\nelectrolytically\nncmec\nmostafavi\nappreciatively\nstetler\nluxottica\nreverbs\ngougeon\nindecisively\neliopoulos\nwholey\nchihi\ncogsworth\neldard\nloughinisland\nbulkers\nallemann\nsundermann\nchemtrail\nhardelot\nfootboard\nmasasi\nshippy\nunusualness\nsincan\njustness\nswazis\nguará\nmukhtaran\nwithold\nantiperspirant\nkuci\nboudhanath\ntalke\nballinascreen\ndalhausser\nkouwenhoven\nokorie\nbalcomb\nnishantha\nsleddog\nsebokeng\nmuzzling\nwombling\nhively\nkurbaan\nhoegaarden\nbollox\nsompo\ndichen\nproprieties\ngarnica\nasph\nzwanziger\nscheper\nnanotyrannus\nzeltner\nwinnecke\nfardre\nrosello\npointner\nrhinopithecus\nthabeet\ngreste\nyunfei\nmontier\nabramova\nxinjian\nbischofberger\nputtees\nnikky\npartings\ninventorying\nbaxa\npowderpuff\nbrouillet\nbelaboring\nbasted\njotting\ncorston\nclaytons\nqipao\nhirelings\nsamey\nglenconner\nchortens\nlabiaplasty\ndefinitionally\nbillett\nrezendes\ngambut\nchinmay\nblady\nbeiler\ngiddish\njadot\nmeridith\nwhetted\njunkyards\nghengis\nbureij\ntavernas\njongmyo\ncarrasquilla\nwvon\nbuttonholes\nlammtarra\ntwardy\ndracunculiasis\nspliting\nboneshaker\nunattributable\nlabral\nepitomes\nmarcedes\nkidrobot\nnomiya\nspiritan\nbicchieri\npackman\nsaigol\nossos\nmehari\nkochel\nbarbash\nhotcakes\nliscannor\ntasselled\nzankel\nsillerman\nburgum\nsarasola\nabortionists\ndecorus\nkesen\nstaveren\njazayeri\nvhc\nartprize\ncorver\ncalibri\nchignell\nmbacké\ndartfish\npunny\nbradie\npalese\neasterlies\npookutty\nsamworth\npalming\nquietist\nhagit\nwonogiri\nmultiverses\nmarrons\nwrongfulness\nnyons\npuenzo\nbarlby\naleksandrowicz\nideacentre\numbarger\nstimmung\nefstathios\ncinevegas\nsertao\nmumiy\nicmec\nbrynjolfsson\nrehabbed\nsalsman\nappellative\nokaz\nfeagin\ngildor\nwmb\nfezziwig\ninfomedia\nshurta\nleleu\nargenziano\nkestutis\ncupitt\nrecored\nleamon\npopik\ndedlock\nsotin\nriblet\nduckenfield\nfuemana\ncoaley\npothas\nvamc\npetrof\nosthoff\nsiwi\nliqun\npranay\nrooijen\nzixi\nfreegard\nbouda\nturbulences\ncambrils\nphilpotts\nlewkowicz\nhabia\ndziedzic\ndongtan\nlangelinie\nwailea\nailuropoda\nlooseleaf\nreconnoitring\ncabergoline\nheidar\nafet\nrailcards\nsvitek\nhildenborough\nmeerman\nactuates\nnemitz\ningrooves\nabermule\nkasuya\ndecheng\nrepointed\navet\ncoddenham\nmediawatch\ntrabucco\naapl\nsilbar\nkollege\nmofongo\nklina\nstraightest\nwoan\ngodtube\nsigmoidoscopy\npromotors\npepto\npollinger\nmerzenich\nsucher\ncharboneau\nguoping\nwhitecliff\nzingers\neaec\ntimbavati\nmorbidities\nsharee\nhasaan\narico\nkaiko\nibragimova\ndocumentum\nsophism\nmedicexchange\nectoplasmic\nbunkered\nbackbeats\ncasalino\nshiau\nolr\nfootsoldier\nthetimes\nchaiman\nlagomorph\nkotite\nemong\nstewartby\noligopolistic\nhawpe\nrecordists\ndalil\nparabens\ndyken\ntogian\nahavat\nwathan\noveride\nhods\nabdulnabi\nsevengill\ntaghmaoui\nmpika\nvors\nhdri\nhorcoff\ndrms\nmignano\nhelicoptered\ncavataio\nembalmers\npestano\ndcma\nperuanos\nhkma\ndbo\nsilkscreens\nkolosov\nchiaureli\nbreadbox\nconexant\ncockburnspath\npillages\nbesancon\nanshul\nzebari\nsfbc\nditsy\nmultipla\nbruneval\nohri\ngianlorenzo\naokigahara\nclutz\nodhar\nlionhearts\ncanete\ndrex\nslicers\ntarmey\nperachora\nmacia\ngrados\nlvb\njianwei\nauchterlonie\nbtq\ngaped\nfarebrother\nterminable\ndaufuskie\nhbot\nsaltern\nsonnleitner\ndiselenide\nsalei\nwanya\norlofsky\nableman\nvanney\nsaccani\nranomi\nagaist\npennybacker\nsidedly\nperfer\nsafadi\nsayeth\nanhembi\nmonina\nmignolet\ndabeer\nyusha\nskalka\nherodium\nhecox\nantipathetic\ndoore\nuvi\nchadlington\ncornishmen\nrobustelli\nhartzog\nhiers\nooijer\noilsands\ninarajan\ndismembers\nvlok\npelourinho\ntonkolili\nturkson\nkaramarko\njarc\nprobot\nservicemaster\ntoumi\nlesinski\nathill\ngrodzinski\nprosen\norapa\ndotzler\nmusu\ndemimonde\nwetv\nrivalrous\naquaplaning\ncullins\ngoldies\nbrennecke\ngisin\nselvey\nberani\nmicrodermabrasion\nsnuffles\nzeevi\ncozzolino\nsweetbay\nsukiya\nmceleney\ntownswomen\nhambrook\ndrano\nracs\ndonigan\ntaravella\nwhyatt\npanych\nbhavans\nrapturously\nmurtada\nnvt\ninexpedient\nzakiya\ndelisi\nuncorrectable\nspillovers\nkembo\nkalms\nmeldrew\nlynher\nrosenfelt\ndhavernas\nglobemasters\nkiffmeyer\nharaz\nkatama\ncoccinelle\ngaring\nhammarberg\ngobowen\narbizu\nsallanches\nzuhra\naschau\nplagiarizes\nbenzylpiperazine\nhido\nselya\nsubretinal\nzayda\nzumbach\ndelagarza\nrowswell\nbatelco\npandor\nidealizations\nindigenisation\nkng\nvigano\nnatpe\ndodford\nlotty\npassent\nnelis\nabrol\npussyfooting\nblidworth\nuhrig\nraili\nkilonzo\nbookending\nwhisenant\nretter\ngliha\nhackert\ndecompensation\npasionaria\nkhaddam\nlaveranues\ndamnedest\ntractebel\naprea\nwaskom\ncolorfulness\nmassen\ndeia\nblippy\nredmonds\nmazzie\ngayot\nyevloyev\nsludgy\nharrises\nashna\nsaborio\ndewdrops\ndovolani\ntaligent\nbalie\nlakmal\ndeads\ncharlus\ncareaga\ngerrity\nnedlloyd\nfivefingers\ncinci\nonancock\nbruegger\nneveh\njettisons\nscheinfeld\ndetectorist\nxtrac\nreming\nzirbes\ndisfranchise\nbranigin\ncynog\nquindlen\nvoorheesville\nhakansson\nratatosk\nraws\ntebogo\nmarrowbone\ndevora\nnhleko\nbutalia\nsidoli\nfache\nbaptisia\nmawas\ntorain\nphillipp\ndkt\nhanjiang\nblushed\nsuassuna\ncollaterally\nrayno\nbilharzia\ndelbos\nbuhai\ncañaveral\nmatlow\nnasab\nfaddish\nstultifying\nllangennech\nohhhh\nkniphofia\ntrudged\nyeste\nsaza\nburani\nbudded\ndeap\ngenelle\nbenwick\nphaedon\nchiew\ngawr\ncreigh\nshurin\nkorydallos\nvalie\nalite\nkylian\nborje\nkhobi\ngorres\nstarchaser\nvasik\njubilantly\nnegba\nacheivement\nmareen\nurvan\ngozzo\nbureacrats\nherrema\nthake\natls\ncawthron\nmongiardo\ncroakers\nmoshtarak\nosteopetrosis\nquesos\ncapenhurst\npithoi\ncallidus\nogwr\ngroucutt\nskynews\ndarcie\nkassell\nstuhlinger\nzurga\nohsawa\nippa\nheybeliada\nschönemann\nshepherdesses\naldunate\ngnw\ncringeworthy\nreticulocytes\nheggestad\nasenov\nquizzical\ncalfee\nimpermeability\nmanjoo\nstrugar\nbagster\nrepointing\ngastronomie\nmazarakis\ndongyue\nccla\napga\npueyo\njoydeep\nunbaked\neleventy\nminsterley\nossama\ncastlerigg\nskirrow\nkatsaris\ncodesharing\nclynnog\nbarkey\nnatalka\nhamidreza\nmesud\npedros\ntrencin\nbacari\nbazzano\nwornham\njackowski\nfasher\nburkhead\nzhongchen\nchignon\nmagistro\nlaundromats\ntracfone\nmcpeek\ndaunorubicin\ncaipirinha\nwarshak\nitca\npoutiainen\nmirvac\nreall\nmatakana\nmanoeuvrings\ngite\nastrantia\nwhipsaw\nmeows\ndoobies\nsilus\nchaudron\nhemin\ninopinatus\ncerie\nangove\nchiong\nreata\nastrophotographer\nstibbe\nmisters\nerlichman\nhappenned\nsprake\nscreamadelica\nvivus\nsimum\nwittenborn\nbarcham\ntikis\nwellsprings\nresprout\ntyper\nceroc\nonomatopoetic\nbobonaro\nselmi\nacclaiming\nbruggink\nilegal\nbertaux\nmodolo\nhuskins\nbugbrooke\nsludges\ngomory\nilsfeld\norbin\nbonci\nbolinger\ndiaoyutai\nsteinauer\nturbotax\nregulary\nghilas\nesps\nkolja\ntittensor\nsingo\nnelon\namatrice\nsipan\nmotherf\nmvb\nkelan\nsalsano\nmcvean\ndissapeared\nbruerne\nmilbert\ncomanchero\nsuton\nkeela\noncologic\nrapidio\nardeth\nrégua\nlibdems\nanahi\nualr\ntrumans\ngenereux\nossawa\nreinheitsgebot\nmenosky\nesthetically\ninterpretational\nherrenchiemsee\nsunbathers\nbanditos\nottl\nchuukese\ndecapitations\njaksa\nredenomination\npaulaner\nsempervivum\nnatalicio\nprocrustean\nfaneca\nhahoe\nkumis\nfollwing\njanell\nelijo\nrossoblu\nceridian\nweik\nnael\nwolfensberger\nhosseinpour\nkaddouri\nakhvlediani\nrhossili\nzeyar\nexcatly\napda\nnimisha\nfragasso\nstrensham\naerialists\nclashfern\npasturelands\nmiazga\nwending\nchettiars\nhohlbein\nthormanby\nbisenzio\npostminimalism\nespirit\ndorge\nimpetuously\ndulls\nfavara\ndauman\ncontrabands\nxiaobing\nmichod\noutshining\nincongruence\ncampell\nmartials\nbeckmesser\nadagietto\nbuellton\nstazzema\npartovi\nhodding\nbupkis\nprosimians\nguaifenesin\nkaarst\ninsound\nvahidi\nfinocchio\ncocca\nyahrzeit\nhumdinger\nbioelectric\nstratasys\nscumbags\nllandybie\npeasedown\nbinter\nadroitness\nbelcore\npuchner\njiechi\ngapper\nkandha\nalbaladejo\ncgis\nmozeliak\nbagosora\ntollet\nwachler\ncliviger\nbashara\ncnil\nuplisted\ncircumstantially\nleefe\nmickeys\ncoppices\nxsp\ncorniced\niema\ncfbt\nadvisedly\npepple\nhypocracy\nriesenberg\nyodh\nlandrigan\nparmly\nsideswiped\nsafiyya\nberléand\nlmsr\nfeijoo\necotricity\nangiopathy\nlisps\npandered\nwli\nhollesley\nleafleting\nlabuschagne\nmutum\nstangel\nkosofsky\nskycargo\nelika\nprolife\ntrinculo\nsarig\nsusse\nfelicite\nelts\nriadh\nfrecheville\nchlorofluorocarbon\npeligroso\nderrett\nanvari\nbarzanji\nonkyo\nmuarem\nsymphonist\ndjohar\noesterreichische\nitamaraty\numpc\nshinmun\nmockumentaries\nmihailova\nveejay\nbrint\ndzongs\nkanat\nepiphania\nzahner\nagx\nsoundexchange\ndozed\nsaillant\nnihr\nmilanovic\npolyarthritis\nabdool\naarc\nnerved\nspansion\nkapral\nmcgavick\nmonongah\nluderitz\nrusskie\nmabi\nmarxer\nmurrelets\nconjecturing\nrwt\ntacker\narimura\nkindof\ntibro\nprac\narcan\nspodumene\ncherrypick\nhuaqing\npushpak\nredmann\nbaldino\ndruthers\ndavises\npyrethrin\nbohne\nuncf\nitaipú\ngingy\nsecularizing\nconversationally\ndaddi\nelodia\nproser\nkilger\npopularizes\nsoftswitch\nemec\nhyperbolically\nmanouche\ndunum\nbustamente\nmeadowood\npickaninny\ncreamers\nyabo\nflechsig\nazodi\nmuumuu\npereulok\nlothe\nvarujan\neshan\nkaupas\nsocia\nworman\nludeman\nbahad\natlantans\nlenoble\nmorari\nproselytise\nnograles\nchalkboards\nlindmark\naimco\nhagas\nhoved\nleggott\naswany\npleon\nherpin\nkrulik\ntindill\ntrillick\nsalda\ntauren\nwastefully\nbutser\npassard\nabdominoplasty\ninacceptable\npelagornithidae\nnaea\necholocating\ntucanes\ncrawly\nsafavian\nmeifod\nagropecuaria\nempts\nmrdja\nbilbrook\nsaydam\ncmon\nimn\nfrenchified\ncarens\ncrossness\nmisnaming\nrayas\nskeie\nsaeeda\nbergessio\nbersin\nglycated\nvisconte\ncoastland\nwatermans\nblumenfield\nmuzammil\nfelecia\nmargeret\nsilodor\neyeshadow\nsawadogo\nbeseeched\nchipeta\ndeever\nfernleigh\nmbes\nuniversalistic\nalvo\nandron\nsaltcedar\nskimp\npederasts\nkanaly\nscotford\nalexion\nskv\nkhavari\nkingsbarns\nphysioc\ndorsoduro\nlisnagarvey\nmindspring\ncorrenti\nninevah\nfimian\nsteffie\nwyken\ninnateness\nstrangio\nditech\nantell\ntumtum\nparren\nshoumatoff\nnoncredit\ncanyoneering\nbarsetshire\njaunes\nrhomboids\nprioritises\ntribhuwan\nmcconnelsville\nblithering\nrübenberge\ngeise\nclerkin\nvulgaria\npertemps\nplaysforsure\nmedalia\nbankshares\namagiri\ndummerston\nfusha\nmanesh\ncierre\nsylbert\nswearer\neskay\nmarolt\nsalahis\ntroch\npetrouchka\nicod\nmarsman\nnavickas\nlynns\nendi\ngalotti\nbarrys\nhakes\nscheindlin\nzubi\nruegg\nmairs\nduffett\nnexia\nplazuela\nmoena\ntuatapere\nbaluard\nnilay\narfield\neadgyth\nwelbourne\nzent\nguzy\npowar\nfurchgott\nraulston\nabudu\nimraan\nmajano\nkeisel\nbosavi\nehnes\nelectrocutions\nrebiya\ngarriock\nlabaree\nossificans\npavis\nsavills\narboricultural\nunbent\nselley\nataxias\nfulminating\nankleshwar\ntylosaurus\nlober\nbusteed\ndegus\nlittlehales\niddings\nalameddine\nzunes\ndurably\ndealbreakers\nfamen\ntomago\ndeferrals\nvossler\nuwins\nfaku\ntudge\nmannone\npardoel\nquepasa\nmcquiston\nallegria\nhaaken\nethambutol\nmâconnais\ncogdell\nsignficantly\noverexpress\nparranda\nrakhshan\ntimestep\nremediating\nknifing\nseidenfaden\nsanidad\nmcwaters\njerrells\nrousselet\nmvnos\nbackroad\ndufton\nbndes\nlaax\nskyworks\ntorreblanca\njerrel\nloganberry\nnubs\ncullison\nfemaleness\nunmc\nbuchdahl\ndjamal\nlecterns\nthinkpads\nsolidarnost\nluddism\nbvk\nrostill\nfrancies\nevaldo\nmozyakin\ntracz\nxiushui\nrachal\narsc\nmelioidosis\nsammartini\nrosmersholm\nmemet\nhuanta\ndenominate\nprestage\ndriburg\nhawked\nsatura\nnouvelliste\nseksu\nalltech\naanholt\npasqualoni\npontbriand\nprober\nwestcote\nhomecourt\nmegarry\nisabelita\nsubianto\nbusloads\nkutsch\nsapeurs\nohsweken\nshaniqua\ntutting\nflowrider\nstefanyshyn\nbrauch\nmunisteri\nkordan\npichola\naksenov\nbinstead\nwiil\nbielenberg\nekeblad\nsurfwear\nmeulensteen\nhbj\ndahuk\nveney\njalonen\nberingei\nmuscardinus\nknla\nsherring\nnadzeya\nkemnay\nsabb\nwhooshing\nzenone\njingming\nkellard\nscic\npropably\nserigne\nplimmer\nkalenna\nmerrigan\ncarnyx\nbelman\ncarvoeiro\njantsch\ntanztheater\nbedfords\nyakka\ninterpal\nsplats\nballclubs\najdukiewicz\nkutler\nreticule\nsuto\npersonalty\ntrockel\nwondrously\nferriol\nnusserbayev\nmandarich\nhuggetts\nhaggs\nriotously\nysaye\nkapsch\nenticements\noverriden\nsloganeering\nschimmer\nmomia\nspleens\nzhiwen\nbunkhouses\nziprasidone\nmspb\nflagey\nzanin\nysanne\nmellisa\nsabden\ndrakoulias\nquikscat\nqasemi\nablyazov\nquintano\navailabe\nautomotives\ndayers\ngarbageman\nvendaval\nranched\nshafiei\nprsc\nthottam\nroag\nmedhat\nshevket\nserrell\nastete\ncrystle\niprs\nmessiter\nassitance\nherlovsen\nshipbroker\nrenaultsport\nkoteswara\nschanzer\nlapize\nquintuplet\nhidding\nbiello\naffraid\nnumis\nunivocal\nnonono\ncryptomnesia\nvitet\nhothfield\nyanov\nrayle\nsissako\nrrw\nbilandic\ncoyoacan\nstonier\ndundrennan\noltra\nmoun\neliecer\nconz\noutmigration\npowerlink\nlinzy\npeddars\nmastersingers\npankiewicz\nhinck\ncouñago\nfirstrand\nunsteadiness\ngegner\nlumbered\nsearchs\nzanders\ncorvairs\nknost\ndermatopathology\ndunmail\nlevs\nbranquinho\nformulator\ndecriminalise\nbandstands\nindivdual\northophosphate\nhartsop\naggrolites\nmahen\nkingshurst\nkarmapas\nlabas\nllanfairfechan\nporterbrook\nmannville\nkevans\nkerfluffle\nmutara\nbrynteg\nloutro\nhardial\nnonrecourse\niafrica\nwims\nadduces\nclusaz\nhrdy\ndeniau\nmayfest\nsteinhäuser\nringold\ngoonewardena\nrucha\nmashes\nnikiya\ntimme\ndelanco\netron\nfaussett\nmikros\npoupaud\ngontard\nfiroza\njagatjit\nferrare\nfolksingers\nwikramanayake\nhumax\nnajman\nrasmi\ndemineralized\nmerkl\nrebars\ncornflowers\npitroipa\nhotheads\nfkk\nikebe\nheartful\nbohem\njoanou\nsegontium\nianniello\ninvisalign\nvaljak\nepigonus\nerwood\nkurara\nmotaung\ndecribed\nsinkin\ncous\ncampanario\nbrumwell\nwhitsitt\nskavsta\nmasontown\nmaharashtrians\nhotchkis\nmirax\nboggan\nchamisa\npolitie\nbryggman\ngudmundur\nfraher\ndisingenious\nswainston\nirureta\npavier\nburias\nhowcast\nminver\nczapla\nhogrefe\nautonomo\nkurson\nbridgland\nbreacher\ncharfield\ncargile\nsexby\nmeltwaters\nvignal\nadriamycin\ndeselection\nsanjin\nticketcity\nwhittingdale\nsovremenny\nsensa\ncircumlocutions\nbcms\njiuhua\nnielsens\nyurij\ngreeny\ncygnets\ngelsey\nimprobabilities\nfaithfuls\nmesquida\nroggeveen\nluthe\nlorey\ngome\nattune\nfole\nymax\nsmoosh\nrobocall\ncantelon\ndulcet\nwhedonesque\njwoww\nfaubel\nhempton\npeppler\nruwe\ndichio\ntookes\nenamora\npâtissier\nhuthwaite\nhowle\nempanelled\nclearasil\nstanescu\nmoverman\najani\nwahconah\nannisquam\nzonca\nmontjuic\nmiltown\ndakers\ntransmissibility\nsanmen\ntrimbach\nmcdo\naffluenza\nfalola\nbondsteel\nabertawe\nconsigli\nspringthorpe\ncartmill\ntomiichi\nankawa\nraghuraj\nbuangkok\nplaxo\ndeadwater\nizady\ncompanionate\ndepe\nbancos\nsidell\nmandroid\ncandian\njoleon\nkaurismaki\ncityzen\nrobar\ntearfund\nhumbles\nguotai\nguanaja\nacquafresca\nranty\nsaude\nscruffs\nkraenzlein\nvaughters\neginton\nalmanzora\nbridgemaster\nsandercock\nschebler\nleinen\ntehre\nsegale\nrosebowl\nplastination\nalkylate\nconfinements\nhasell\nfactionalized\nsaah\npumpe\najaj\nmiconazole\nmagre\nchoclo\nsigfusson\ndrusen\nintertie\nordish\nguodian\nbinyam\nbiglerville\ntreichel\neaglestone\npyy\nshoebat\nomrani\nnordegren\nvauclair\nhorrifyingly\nkleiss\nkronenbourg\nkesner\nhusing\nunsheltered\ngospocentric\nkumyks\nwriston\ntryptich\ntamaro\nsarwate\nnellist\ndisquieted\nminimo\nplockton\nlucado\nimmunodeficient\nafari\nreallocating\ntootal\nmontefusco\numbilicals\npantalon\nnakas\ntidey\nwiddle\nmanagerialism\nsplashin\nassails\niannetta\naaco\nmeny\nlarded\nzurer\nprepayments\nlloro\ngamecity\npartygaming\ndoffing\npavé\npuppetmasters\nenergias\nhafizi\njarren\ndisembowelled\nkammerphilharmonie\nrejoinders\nlowthorpe\nsarb\ninss\neucryphia\nguarentee\ndevinder\nartola\nichimaru\ntokonoma\nstobi\nmontelimar\ntongil\nbioprospecting\nplanaria\njodee\njudum\ntatem\nofwat\nduberstein\nsatinwood\ncoxheath\nburghard\nbridgegate\ngrasberg\nmurlough\nrareness\nwrth\nbabor\noverreactions\nsekkei\ncoiffeur\nduboscq\nxinlong\nbingos\nglug\nschamberg\nkuitunen\nmontanans\nbrechner\nwhiffen\ncutshall\nwtkk\ngalp\nnadji\nbufalini\ngirardelli\nalaris\npolu\nbergelin\nhollenberg\nchawan\nsecy\nsarazin\nnasruddin\ncompetely\nsteamrolling\nbourns\nnoninterference\ncarnall\nlepre\ntokunbo\nworkover\nclophill\nnkem\nbetteshanger\ndaneman\nkoobi\nvedernikov\natkinsons\nrevin\nshrager\nisbin\ncomapny\nputinism\nstawamus\nconcertinas\nstateswoman\nmurderously\nkandol\nbreydon\nrittenberg\ncoylton\nmicex\nrutina\nmalebo\nhourihan\ncheapening\nbaychester\nshabazi\nfeltsman\nsieghart\nallocators\nbicyclette\nmakeout\npcyc\nkotido\npiggie\nsiguiri\nlarrington\nnessesary\nveloster\nweaponless\nthorpeness\nnilekani\njonelle\nacxiom\nuchenna\npresper\npurifications\nunderplay\ndogmaels\nsemestre\nmcleroy\nmasisi\nenten\nmccline\nmaricarmen\nkilogramme\nryneveld\njermichael\nvelappan\ncourteline\nobservants\nunnerves\ngrindr\nkwtx\nspiddal\nshaalan\njamiel\naltruists\nalexeyeva\nholda\nafleet\ntranscribers\nundrained\ndahon\nsandqvist\nmidwesterner\npreparator\nyermakov\nfragniere\nkwv\nbrahmbhatt\nifop\nafkhami\nbradshaws\ntrbovich\narcahaie\ndisassembles\nduquoin\nnigeriens\ndillsburg\ndentmon\nlewman\nelima\nhovik\nchawki\nhalstow\ntellechea\narvell\nbergmark\nslurping\ncheerless\nhernandes\nlly\nzhuji\ngdv\nrajasa\nbuzhardt\nhaemophiliac\nmyrl\nafic\nyusup\ndabbler\nvry\nsprezzatura\nriels\nwooburn\nngx\nreregistered\nmuckler\ninflects\ncedella\nribaldry\nbackshall\nunfussy\ngrosberg\nstriefsky\nswabbing\nkecman\nkolesar\nphuntsog\nhumidities\njanish\nkubacki\nruaridh\nsoms\nbohlmann\nphuentsholing\nbradham\ngrosbard\nmusli\nkason\npapalia\nfriske\nalabamians\niafrika\nstifford\nlaicization\nbreughel\nstampalia\nelektrotoer\nkraftwerke\ncetiosauriscus\nluctus\nmanzel\nbeyong\nfloodwalls\nstejskal\nprochazka\ntelefono\njamaah\nhijabs\nholgado\nchargebacks\nmaxes\nnanya\ngriffi\nchais\nundergrounds\nmormando\nsuwat\nhütz\nunlatched\nsplicers\nanogenital\ndisorientating\nlurz\ncoloradan\nffos\nexcluder\ngmes\naahe\nmortuaries\npicassos\nsynthesises\ngll\nfargate\ngrimness\nonchocerca\ntrollery\nrivercrest\nsmithdown\nslops\nospi\nsackman\naviance\nliberalist\njiggly\nnomoto\nganju\ncruzeiros\nyeares\nafci\nbriody\ndaneyko\nnoyo\nkurnitz\nbramlet\ngabion\ndörrie\nskorpios\ncybershot\npadura\nrichhill\nhiraethog\nklrt\nchurandy\nvoluptuousness\nulcombe\ndirs\nhoneydripper\nyahiya\nbkw\nwharfage\nplantel\nabased\nslusher\ngreenheart\nknowstone\nrisp\nchampignon\nkneehigh\nbelstead\nmcbratney\nmava\nreoccurrence\nbancassurance\nbreds\nshibboleths\nransacks\nessendine\nmaxwelltown\nhurlyburly\nahoghill\nwowser\npanjandrum\ngvc\ndialectologist\njurançon\nweihnachtsmarkt\nwerkheiser\ndeclaiming\nvyntra\ncarbidopa\nzhaohui\nsitko\nrockenfeller\nfirestation\npalaeoclimatology\nglop\nfutz\nllw\ntranexamic\nmisguiding\nschlomo\nshashikant\nboediono\nbudiarto\ndepaula\nflaine\nelmbrook\nnicholaw\nsocializes\njtp\nmaximalists\ngilfoyle\nhagwon\nkanafani\nzicari\ntabletops\naccosting\nundisputably\nqazigund\nrafidain\necns\naughts\nmaraldi\nexenatide\nfumé\nsublimating\nempi\nstanlake\npether\nsportz\nyoula\noverfull\nwamalwa\nmicrotia\nringdahl\ndegress\njinyuan\noxenholme\ncaahep\nihedigbo\nabps\ntailgates\ntbsp\namaka\nlinderhof\niuliano\njeongeup\nsarraf\nconcelebrated\nsocios\nbardell\nirakly\nvinik\nflorenceville\nwedren\ngigatonnes\nwindowsills\nequidistance\ndrohan\npecho\nnooyi\nwalcutt\nsaitta\ncrueler\nstenholm\nbifocals\ngranose\ngeometrics\nballes\nnebuad\nmcrib\nterabit\ngrunwick\nstorlien\ncheongdo\nobermann\nisolations\nziaul\nprinci\nfesser\nbartosik\nageas\nbristols\npenygroes\nbentleyville\nswingtown\nelong\nfornham\nkurlansky\nstigmatisation\ndeià\nperricone\nordell\nklingenstein\nhechter\ntiguan\nbrouard\nlauralee\ngrobet\nsharah\neverolimus\nshowmance\nagrosciences\niufro\nvalliere\nhenenlotter\ncostley\nkasser\nivg\nspiderwort\nplaygoers\nicefjord\nblackboy\nbewl\nfarges\necopetrol\nodessey\nnarveson\nsteinhof\nnoncitizen\nkirchin\nunutterable\ndellys\nsanglier\nzoete\nunderlaid\nthornhaugh\nscitex\nstolfi\ndelcambre\nsstp\nsmisek\nkorber\ninsulins\nfeghali\nswindall\nwestroads\nshakman\ngrandmas\nkenoyer\nbristo\nstongly\ncaoimhe\ndanos\nfeola\nsaggy\ncroyden\nbankatlantic\nngongo\nmchinji\npossibilites\nmeditech\nbenesh\nbukharov\nkittipong\nunmercifully\nnemov\ngriles\nneurotropic\nanway\nhayriye\nhomeaway\nmailhot\noutweight\nfiander\nspringers\ninniscarra\nmarthaler\ngholamhossein\nfchv\nschimel\nmillibar\nfinelli\nbruney\ngraziosi\nspeechly\nyevkurov\nurol\nschedl\nallaying\naguecheek\nnpia\nzadroga\nfujiya\njerrica\ndoublewide\nronal\nrectally\nbeels\ncridland\ngladis\nsangji\nchewie\nllangadog\nunsettles\nburkei\ngawronski\ntomboys\ngailani\nreganbooks\npothead\nmultisyllabic\nragbrai\nlassic\nfarvardin\naronow\nfilé\nunnerve\npiskorski\ncheika\ndelvalle\nvesterbrogade\nbehdad\nsibenik\ndaney\nsangiran\nhazlehead\nwargnier\nnasserite\ncarfrae\npoofs\ntwofish\nkimche\nnowness\nmannon\npleasers\nfournel\nfary\nkaiserwald\nbasem\nlopat\nhcci\nsentinelle\ngabbi\nlatheef\nkvbc\nbuckyball\nschoomaker\nseniormost\nrosenboom\nnacm\npiszczek\nuhaa\nduflot\nijichi\nibori\nmacsharry\nhartcher\ncoggin\nretweets\narrr\nthassos\nwaialae\npinehearst\neliad\nspigelman\nlisc\nentrecôte\npigging\npeelle\nhuaxi\ntvland\ndupontel\nhensol\njawbones\nlyminge\nrusskoe\nscnt\njodhpurs\noscon\ncapannelle\npangram\nfarted\nkthv\npipsqueak\ncornelison\nkonono\ncarkner\nelegar\ngiac\ncumberlege\noehha\ncober\nmomeni\nsuyama\nentrepreneurialism\nconservationism\nenrichments\nhcps\nconstructeurs\nlilliputians\naselton\naith\nrogner\nblepharoplasty\nmamoon\nszukalski\nuvira\nallders\npolicys\nsuperintelligent\ncraviotto\nnicelli\nburfitt\nsechler\nsituationism\ndooney\njuvonen\nnakorn\nvujic\ninternments\nmeeson\ntipplers\nkosslyn\nkamana\njncc\nforecourts\nsuburbanite\nkosdaq\nnfib\ntrefilov\nunknotting\nsonan\nmazagan\njujamcyn\ntumescent\nimiquimod\ntabberer\nrfps\ngaley\nmeshkini\nropey\nnipp\nmompati\ncolossally\ntricolori\ncremating\nkazmunaygas\nhoppenot\nnebb\nsnapback\npresenteeism\nerectors\nrighties\nmonegros\nshamas\nbrachet\nunfastened\nbegelman\nrenou\nhullavington\ngumulya\nuchino\nsuperfecta\nmeihua\nbaneful\nladha\ndorasan\nminuted\neurorap\nlatton\nmontminy\nclodoaldo\noshu\ndieing\nlahaie\ngamey\namoss\nbugel\nneuters\nlambrook\nfungo\nkabasele\nunfurls\nsaunter\nwingreen\nplaygroups\ncatalogers\nlacina\npieterson\nvillechaize\nrangemaster\nwrox\nhoiby\ncorren\nlightheartedly\nfrancisella\nlooksmart\ngaglardi\nfukang\nportaventura\nquintron\namre\ncmrs\nwyda\nconfigurability\nvinai\ninseparability\nindicom\nshawish\nlipofuscin\naneurysmal\nbiocatalysis\nshalders\nfriendfeed\ngriddles\nbennets\ngorgui\nzednik\nsteinfeldt\ndums\ntianma\nkalw\nouspenskaya\njusticiability\nzwilich\nlattakia\njarringly\ncrotchets\neaglin\nunpasteurised\nheinecke\ndoomadgee\ndahabshiil\ncuckney\nclendening\nsolorio\nmanchev\nchokling\nstrangeloves\nsuperbia\nsuperfans\ncubillo\npasatieri\ngioni\nbalice\nbedspread\nspiritedness\npifer\njoling\nturchini\njakosky\namrc\nharwick\nunadventurous\nrelex\nyarlagadda\nsingalongs\npongala\nannica\nlessingham\nkedrova\nfiar\nsiepmann\nseré\nzicam\nbryceland\nbaym\nmuneeb\ndifferant\niapp\nmacaroons\nadventurousness\ngrathwohl\nlidington\nqurans\nmackesy\nvaluble\nwunderle\nballgames\nsugarcoat\nstalham\nsecunia\njaiprakash\ncrossbeams\nchoristes\nponniah\nparanoids\nhibaldstow\nbelloumi\ntravelport\nmdpv\nshaklee\ninflammatories\nfifteens\nchingari\ndlbcl\nprobabilists\nnicey\ndaggerboard\ntowboats\nrimell\nnocentini\nqiagen\nlare\nwasmund\ndesquamation\nbuker\nrunts\nfrankweiler\nkarry\nicex\nkozmic\noxaliplatin\nstarkness\nporthill\nsenba\nadeni\npearblossom\ngauvain\nmehrzad\nweetwood\ndesideratum\npoett\natayev\ntiera\nnaudet\neut\nofdma\nhuaibei\nhydrocracking\ngezhouba\nrunco\nbrimer\ntengboche\nandretta\nheilbroner\ndinks\ndanyal\ncarmazzi\nvald\ntesseracts\ntallin\nfoetidissima\nballbreaker\nmavridis\nbasinas\nwehn\nhibbitt\nmillerstown\nandic\nlaino\nenlargers\nkalashian\nyonglin\ncerveny\nkuller\nhefley\ndunckel\nhaxthausen\nanthropomorphised\nbarsamian\nberthod\nrestauranteur\nindosuez\nlaperriere\ngoulard\nbassplayer\nhideaways\nlijjat\nguidewire\nfauld\nsatyapal\njouet\nmanikins\nsmallscale\njanakiraman\naiyer\nhailer\nsalmeron\neverlong\nbeatmaker\nakerson\nmabbott\nalphie\nbrobdingnagian\nparres\nstoneridge\nreconceived\nooooo\nkinderdijk\ncevahir\nembroil\ninjuria\nsyko\nbankoff\nhasher\nverderers\nsocastee\nbarewa\nkilmessan\ngrovely\ngissar\nzylka\ncarrols\ngulkana\nibai\nbeineix\ncairene\ndirecttv\nbittel\ntomake\nseino\ntytell\nscalabrine\ngrotzinger\necps\nbarefaced\ndebevec\ntiare\nbearhug\ncamelid\nalium\nvasiljkovic\nsassan\nbeguinage\nuliastai\nhexachlorocyclohexane\nstantz\nallées\nwieseltier\nsaccharides\nchopp\nrajendranath\nenteroviruses\nunitive\nciji\nsimlar\nkrack\nnessen\nbushwhacked\ncbgbs\nmaimi\nencumber\nneoadjuvant\ncaddied\nasep\nmirit\ngunnels\nrzd\ndemoustier\nwyff\nmacneacail\nrezaian\nobala\ncalamansi\ndahlkemper\ncdz\nzieler\nsingeing\nbiotechnical\nspio\nwickwar\nosbornes\ntimana\nlightings\nbenney\nmoriconi\nkunder\ninterruptible\nstroock\ndinorwig\nmillien\nmows\npifa\nfritillaries\nlimani\ngodchildren\npiker\nbankfield\nbhojraj\nbpcl\nnyh\nrodolpho\nheary\nbabeu\nzimring\nrippert\nbuchina\ncapellini\ntavare\nmasoumeh\nshabwa\nthangai\nnelia\ncfia\ngoscote\nmenorrhagia\nmcleavy\nracegoers\nsheilds\nspyders\nsijia\nramush\ninsufferably\nharinordoquy\nhesp\nimbricated\nbiet\nyouthquake\nmouw\nfariha\nretractors\nwickard\nobss\ndexion\nkedwell\nbibik\nunderbite\nfursenko\necotourist\naufgrund\nerdan\nsemiha\nvolvos\nantilock\ntimorous\nmayobridge\nstalnaker\naudigier\nbarrueco\ntollis\nzolotukhin\nyovani\nfosi\nkokubo\nlafc\nwaterlilies\nfettercairn\nornamenting\nhenkes\nsalmoni\nstarcatcher\ncardioprotective\nmasloff\nenoc\nblucas\nthomasin\narvest\nmundhra\nbuachaille\nenic\nshewchuk\nshortenings\nchafes\ndulais\nkyivstar\nmorticians\nimmi\nyauheni\ndanter\nblondet\nsahan\naddazio\nranya\ncommensurately\ncyclothymia\nreponses\nmoennig\npanagopoulos\nleason\npommiers\noprichniki\ncfmi\njossie\nguit\nalfonseca\ncentralian\nousman\nserms\nkamares\nepigallocatechin\njingchu\nbygrave\nstrangelets\ntumlinson\nleiweke\nrihards\ncivicus\nvalbonne\nseacraft\nvaage\nnagarjun\nmhsa\nlanguor\nlhcb\nrosborough\nslh\nnasra\nlagergren\nbassham\nenergising\nlollypop\nkxmb\nmotiur\nmerbau\nmuchness\ntizzard\ngioiosa\nphull\nmuktinath\nwndu\nmunhoz\nthessalonika\nadorni\naiim\npranger\nkrupin\ntrislander\njeckyll\nanslow\nmuuga\netios\nmccarthey\nazzawi\ngragnano\ndemeyer\noquawka\nforkhill\npeelers\njiaxiang\nbintley\nllanas\nfernhurst\nfajitas\ndalene\nheraud\nboysie\ncondrieu\nfeasterville\nskenderija\niedc\naliou\ncalonge\nbuckeystown\ninvoiced\ntosta\nstockists\ncaixaforum\nsuaram\nandrist\nvondelpark\ndistington\nlovettsville\nbaculovirus\nlevinstein\ngraden\nrasps\nchrysalids\nplaymen\nplekanec\nsomei\neclat\nbrinkburn\nbrowell\nsuheil\nmó\ngoldenvoice\nofn\nderris\nhajizadeh\nlegerdemain\ngeorgei\numbrians\nwoodmancote\nonw\nparameterizations\nalerus\ngagnan\norlanda\ncramb\nnarey\nhydrolysate\nhonks\ngoradia\nplaylet\nmozilo\nxover\nbonos\nboconnoc\nmezzocorona\nmclardy\nindentify\nfarideh\nhunsbury\ncapco\nbahrein\nhayatabad\nsuicidology\nundersecretariat\nhitchhikes\naeriel\nillanes\nalogoskoufis\nreindel\nsteenbeck\ntaim\nlevegh\nchandeleur\nessentiality\nblepharospasm\nmahender\ncorduner\nokeford\nkachu\ncharcoals\nclockface\ngüttler\nglobokar\nrealisable\nshavir\nstiffens\nallbrook\ndegeorge\nnissenbaum\nrectorial\nruskie\nhcfcs\nnyckel\ndisconfirmation\nmanelli\nspidering\nguignols\npoloni\nvongerichten\ndeschner\ncrivello\ntiffanie\nsubsecretary\ncreason\npanadol\nfarahi\nwaterboarded\nextreamly\ncaerwys\ncrawlies\ncarpools\ndulzura\nsinz\npilrig\nabsoluteness\nbhaya\nkrap\nworksites\nandriani\nilani\neculizumab\ncincotta\nprevelant\nfiancées\ngreentrax\nshotz\nvolpaia\nwheldrake\ncertifiably\nbenignly\nfoxbusiness\nmahay\nmoussaka\nburnfoot\nsakichi\nabsentmindedly\nbowfinger\nfehring\nyantic\nserpe\nmontecristi\nbenkert\nluing\nripps\nfarrells\npspv\nsprod\nthreesomes\nhillfields\ncentella\nbirdbath\ndavita\nmcing\npicower\nschwarcz\nkocha\ncrigler\nmanassero\nproximities\ngruia\nstradley\ncatweazle\ndidomenico\nbendelow\nedano\nsecour\nstarfruit\nleavens\nwidden\nkillik\nfinessed\nbke\nduffs\nboulden\nhanegbi\nuhlenhorst\ndiekman\njbic\nmaaco\nunhealed\nserap\nimin\nkrm\nhaberl\nmicrogrid\nnavaira\npusc\nradakovich\nhermance\nzarouni\nallai\nkosse\nboim\nbuttner\ncalipso\nventotene\nshoffner\nclaverley\nhoffenberg\nrosane\nsabetha\nnilin\nakale\nfloridi\nzamecnik\nfüle\nhallum\nmonter\nliebreich\nprinciplist\ntumanov\nbarten\ndobol\nworfield\nleafa\nnatela\ngreenwoods\nhgcdte\nzhaoyuan\nlanglie\nnorcom\nsmmc\nmbugua\npriggish\nroshon\ngmap\npeperami\nodlyzko\ndowser\nshaddock\npolycarpe\nobnoxiousness\nprodrome\nbracker\nniemczyk\ncreuddyn\nshowbands\nlewry\nlabranche\nverschaffelt\ncappuccio\nhintsa\nlamely\nnolberto\nmulhearn\nbernaldo\nsailesh\natayde\nortved\nstubblebine\nafeni\nhyotei\nsourse\nmccorley\ntimpanists\nwholescale\ntownscapes\ntuffley\ncongruity\njigsaws\nballz\nunderstaffing\nmouthguard\nmaclaverty\natras\nshella\nreconfirming\nnyts\nmithai\nmckeel\nsapio\nklauser\ncrystallographica\ncycos\nromeril\nchanonry\ntarasco\nctenophore\ncozmo\nnativo\nsonestown\nfelinfoel\nguttentag\nlockheart\nkbi\nhorseferry\nwallia\ncoalición\nnoser\ntakamizawa\npheidippides\nvilloldo\nomata\nworkshopping\ndaycares\nplumm\ndealin\ndoretta\nupthread\nrefueler\njoppatowne\nklabin\nhhg\nsprats\nbussink\nrifu\ntriesman\naicar\ndesmosedici\nwuld\nteks\nshomi\npzu\nboundy\nunlikeliness\nwildavsky\nzares\npoppinga\ndepa\nspringboarding\nbubblebath\nbudock\nstevi\neitb\nflots\nmedion\nakamas\ngigatons\nkokkino\nmimick\nfemtoseconds\nlenel\nalima\ngonta\nmaleo\nsmolders\nscrabster\nbaldeo\nnullam\nlibary\nener\nsebek\nmarginalizes\npaulsgrove\ncomputeractive\nfoglietta\nshrna\nsrour\ncabaletta\nbance\njosemaria\nzennström\ndelich\nsunshades\nwaunakee\njilts\npletikosa\nbioresources\nunchallengeable\nzyzzyva\neuropeanisation\nblanchar\nroberty\nellef\nhamdullah\nmacroeconomist\nmomos\npinkwater\nlivernois\ncreasing\nhtis\nwxii\nmcpike\nbiagiotti\neurex\nyeses\nguyanas\nmentorn\nencinos\nweyant\njindong\njeudy\nguihua\nneuburger\nkambi\noutboxed\nkorch\n,they\npastner\nrochina\nretiral\nakris\nhalophytes\ndilston\ngildersome\nmouthfuls\nbkb\nnetti\nyoxford\nsunart\nannother\nbacot\nkallos\nacetosa\navastin\nserbin\nselworthy\nzinberg\nhoxby\nhornbrook\npendley\njulani\npalauans\ncwn\nvirgilijus\nvenetiaan\nbergtraum\nkaaki\nswallen\npneumococcus\nadron\nschneeman\nbullmoose\narces\nbonna\nnorthfork\npeaces\njosefson\nkunigunda\nbuñol\ndemchenko\ncyncoed\nnubi\nanupong\nchurchwide\naratama\ngulia\nchelsfield\nblackstuff\ncuéntame\nsansonetti\nlanners\nsambadrome\ndefinitley\nmembrillo\nmaylam\npenally\ntemeke\ngermanico\ncupidity\nruwa\nayiti\nmauston\nflagons\nbetweeen\nmedvedchuk\nhegeman\ngunsights\nadila\nparasocial\nmontem\nbemersyde\nreadmit\njeelani\neckstrom\nladany\nbaised\ncookridge\nbrunnhilde\nzuppa\nvauthier\nnarval\nforequarters\ntiefensee\nadmissable\njackboot\ncringed\nsahli\ndarks\nsuskin\ndaikyo\nfoulston\ntargamadze\ncagnina\nelyssa\nraptured\npovero\nmetricated\nrutman\nknuckey\nhandbuilt\nautophagic\nleiomyosarcoma\nfortney\nweelkes\ndaddio\nglitterball\nelice\ndabigatran\nlumax\nthinh\nflugga\npetrovietnam\npapathanassiou\nbuitoni\nmotoyasu\ntabernas\neyharts\nmousehold\njetz\nosbaldwick\neverday\nyukai\nkorty\nlouay\niita\nhoblyn\nchiriboga\npotage\ncamdenton\nsabco\nlehand\nshucking\nreprographics\nervil\nchidsey\nsarabi\nfriedmans\neufrosina\ndiametrical\neiders\ndermatol\nkinzler\nmonkeying\nyegua\nyaan\nmadlener\nlappas\ndeclaim\nbananal\nbeeskow\nprivatisations\nahmir\nzweden\nwilzig\neventers\nheyhoe\ntitti\nrushby\nnewcourt\nprety\ngiovannetti\nmelitz\njgc\nmoteab\ndisintermediation\nfze\nguayule\nworldpride\nstaa\nheekin\nfermenter\nodubade\nméchant\nteahen\naouad\nbyfuglien\nunpressurised\nborena\ndatapoints\ncibotium\nbenhall\ncefta\nwohlstetter\nassane\nwingdings\nshanto\naguerre\nmassicotte\nhogin\nspecialness\ncorian\nvollaro\nlukasiak\nqkd\novulating\ntonja\ndecarli\nghahremani\nappier\nremizov\nbytheway\ncmsa\nquadzilla\neibach\nsukey\nurg\nwaystation\nstandring\nsqueamishness\nwillmot\nbedpost\nastroboy\nfurnell\nafps\ncawsey\nfuencarral\nunderprepared\njijia\ntavita\nlojeski\nlowi\npozole\nrammellzee\nmsq\nswierczynski\nanstee\nrollett\nlastingly\nmaseratis\nkirakosyan\nhurun\nonek\ndeerness\neinzig\nsrms\nhensher\nsouqs\nkauffer\nmisplace\nrosebay\nwaunfawr\nboora\nkillshot\ncantuta\nsgoil\nmaguro\nlfh\nparanoias\nstripey\ncascella\ntaibach\nlagunilla\neadfrith\nsorafenib\nrepossi\nbattlefronts\nethnopharmacology\nskiving\ndolidze\nbarkell\nhaitien\nmilevskiy\ndeutschmarks\nsiyu\nblacklion\nharuta\nvecellio\nshiplap\nmodjadji\nlundblad\nclova\nihde\nbufala\ngrishina\nforgey\ntomescu\nelymians\nsenning\ncxr\ngohary\nkaichi\ngalguduud\nqueendom\nlagerberg\nbuzzkill\nkakkad\nmuthana\ndecety\ndelara\npanzeri\nyigit\npetagna\nsiderov\ndignifying\nmininova\npaektusan\nbenka\ncumulation\ntyshawn\nrayes\nkillingbeck\nhumerous\nrenegotiations\nbanzi\nxeo\nzarian\nanalysand\nngubane\npörtschach\ncoychurch\nprogressivity\nblander\ncoastwatcher\ndenselow\nzaffar\nwron\nendocast\ncfact\nborowiecki\nbrevig\nnonbeliever\nuncouple\nsediba\nquindio\nmediapost\nmanicures\nshakeout\nmahgoub\nmyristic\nquenches\nwhil\nbarming\nhable\ndrachman\npurdham\ncaunt\ngledhow\ncadila\nstopcock\nretreading\nchildersburg\nperovic\nrogerstone\nvonck\nstankowski\nchelmsley\nrifi\nyouman\nmomand\newt\nsaper\nhunstein\nsimma\nhoene\ncongeal\narakel\naule\nhammick\npaddleboard\nhowmet\nshafee\nromasanta\ngabas\nshippan\nconveyancers\nleonide\ndhanapala\nwickremanayake\nblantantly\nrwandair\nbancaria\nkarbon\nkotara\nitinere\nouternet\nshoy\ndaheim\nlangata\nnaccache\njianzhu\narthit\nquesadilla\nfatefully\nqlogic\nclybourne\nshowtune\nstemcells\ngorre\npelous\npapahānaumokuākea\ngatsos\noverawe\ntitsey\nmarkab\nhelgerud\ndreamliners\nmeneely\ntajir\natcheson\nlisia\ntarnation\nschoolwide\nterracini\nboesak\ngrossfeld\ngoolden\nincent\nmunish\nsiriwardene\nmisso\ncolins\nwebcor\nwway\nscourie\nswir\njsj\npositas\nsilopi\nseining\nguiglo\nhalem\nadeyemo\nwiedeman\nkacie\nphibsboro\nruggedly\nsoifer\nyida\nrodricks\nabwr\nbarsoum\nomegas\nfajer\nnanofiber\ncarnlough\nirchester\nbabby\naaos\nmárai\nstucker\nlears\nilina\nlabanotation\nsitel\nlettuces\njera\nsawit\nattakora\nfromt\nchangs\ngenessee\nmeitar\nkvoo\nbrantham\nsteavenson\ntoshizo\nbrandman\nurbanek\ntajinder\ndanshi\nsbas\nvarno\nmillworkers\nwerstler\ngrisel\ncobscook\nadelaja\nbarwood\nstatz\nshrikhande\nbuccaneering\ncolbourn\nstansell\nbensons\ngrupero\nwasel\nkornilenko\nelderflower\nhosenfeld\ncrcc\nsimpy\nastrada\nocreata\nexacly\nxti\npanelli\ncottontown\nspiels\nmanglapus\npuskas\nprésidente\nmistyping\nclonally\nstatelets\ngizzards\nkosowski\nlaggard\nafricare\nespersen\nwellfield\nmoneybags\nthoen\nskipsea\nmaskaev\nwachsmuth\nyiren\nscratchings\ngasque\nquizzer\nbarikot\nbrandau\nprovidentially\nmetropolit\nmesika\nresons\nbeehler\nthurnby\ncomparitively\nlambi\nbobrick\npiercarlo\nlüdemann\nwoub\niketani\neolas\ncourgette\ndissemble\ndawsey\ntyrangiel\nreplenishments\nshihezi\njanacek\nequador\nsiloso\nabubakari\nazizov\nmesón\nbusser\nauxilio\noxf\nvingaard\nderosier\npicolinate\nszymkowiak\nkcf\nayele\nscabbardfish\nunkillable\ngacek\nfarocki\nunrighteousness\ngivaudan\ndeadbeats\ncetyl\ngrushin\ngeisenberger\nsabour\nperquisite\nhistoplasma\nasiasoft\naccidentaly\nmycoses\nshihri\ndaallo\ngecas\nmerrygold\ndhale\nconfabulations\nfaqeer\nskeates\nstiffener\nhalali\nboura\ncassaro\nwhiteabbey\ncatterson\nkanuma\nlexton\nmonteforte\nrustics\nparticpate\nflavorless\nleaney\nzdroj\ndecompensated\nkarsenty\nsagely\ngvardiya\nbravin\nudhagamandalam\nhalethorpe\nathreya\nwildau\nbookmobiles\nfreeflow\njirón\ncyle\ngravies\nolympiades\nwatervale\nsolemnised\neliding\ngryzlov\nmoneygall\ncental\nushanka\nmayrhofen\npropogate\nbalouch\nlobley\nhordle\ncallenbach\ncolposcopy\ntwerp\nuphall\ndebashis\nzibi\ncampling\navelina\ncrooker\nbeauregarde\ndarland\ninview\nunrepairable\nibarretxe\nmccaughrean\nbethia\nacep\ngauch\nemptier\nkapaa\nshijingshan\nendal\nhairlessness\ntunb\ngasparin\nabsi\nctfc\nmahlsdorf\nbridgen\ntrecynon\nantimonopoly\ncohabitants\nmirinae\nvoestalpine\nnamazie\nauditioners\nrosmarinus\nellacombe\nbilmes\nvesi\nauspiciously\ncipta\ndigitalisation\ndirely\nperey\nliscio\nkingswell\nzackary\nunredeemable\ngansbaai\nozs\nlengai\neppel\ncheorwon\nlighthall\ntermine\nkareli\nrodenburg\nkurzman\ncheryll\nanin\ndexy\ndingding\nbukamal\nwillesborough\nstickmen\nmanque\ngelbard\nbrotherston\npilotto\nhohler\ngillmer\nbaronova\nstaehle\nmedef\nastrodon\nmald\ntisi\nlibrado\ncalnan\nsaumon\nboisdale\nwarnham\ndepodesta\nmercker\nseabaugh\nbutautas\nmontas\nbaloyi\ncrownhill\nlomaia\nbotten\ndamns\nrizzolo\nsigl\nfilamentary\nunmodulated\nriah\nstanzel\nwrithed\nnorouz\ndoisy\nodesk\nnaugahyde\nmicrosemi\nprows\nenow\nveramendi\nregente\ntucanos\nprofessedly\nlaters\ngarshasp\nfinerty\nclevers\nicaf\nwelsman\ncardiological\nmangwana\ngolik\nmpri\nintellectuality\nmedaling\nsiddartha\ngrzelak\nferuz\ndisconcertingly\nbakeshop\nbarbate\ncnvs\nstreambeds\norignial\nweissenstein\nbourjos\ngangbusters\ntarika\nvaynerchuk\ncontextualising\nlutine\ntassler\nregularizing\ngaiter\nbroadclyst\nzasada\nlicona\nbellenger\nrectenna\nprewett\nrhaeadr\nhuggers\ngolant\nsyas\nhoudek\ncastellvi\ndumbness\nhefter\nhamamura\nantithrombotic\nretsina\nschey\nunbonded\nrestrictors\nminack\nahome\neyefinity\nperthus\nfouracre\nhelyar\nprayerbooks\ncertanly\ngalat\nundershoot\nmassimov\nsharkskin\nbeaut\nuntrodden\nazarian\najara\nplisson\nnyasha\nuncoiled\njinglei\nbresser\nasesinos\nmangurian\nstannage\npirogues\nnadym\nmalolo\nmarpessa\nofd\nwallison\ncamerer\nirrelevantly\nslavich\nduplexer\nmerenstein\nmeyrowitz\nrevile\nratemyprofessors\nintune\nlisl\nsuburbans\ncorralitos\nmckeegan\nproofpoint\ndady\nmesoamericans\ncoutries\nasianews\nirruption\nlightbourne\nossabaw\nhelpage\njiff\nhartsel\npsos\nyibna\ncinched\nwilcocks\nflury\nwishology\nclomiphene\nstup\nbaala\nbaptie\ntirez\nkoops\nalkhanov\nhuaneng\neverex\ncoiffed\navailabilities\nsolidere\ncashner\nhodan\nrecabarren\ncvetic\nnendaz\nangiulo\ndentsply\npagai\nbarazite\nvictime\njeanneney\njinky\nvinther\nupvc\nmarshack\ncpz\nwaldbaum\nfebles\nsandipan\nromes\nhspd\neckerman\nfrang\nlibatique\njawline\ntoha\nsmilow\nheliophysics\nninkovich\nwambsganss\ntofield\nostergaard\njoselo\nfados\nnassour\nnelsinho\nappologise\njetski\ncompetant\nmedidata\nrtds\ninves\nvestra\nskryabin\nurbanize\nglasby\npagon\nbuess\npathobiology\nvitanza\nsargus\nrowayton\nberky\nkaust\ndawkin\ncervellera\nroiphe\ngriefers\nkomei\nbealings\ntajani\nunexceptionable\nnakaima\nrdn\nmidcourt\nbusbar\nmulheron\nyoshisuke\ncharlatanism\ntapeta\ncalderstones\nundercapitalized\nparmele\nsavoured\nsalivate\nflechettes\nstroj\nkidar\nizzah\nkobin\njobing\nosnaburgh\necofin\ntheofanidis\nphotopigment\nsandomir\neffra\nkepel\ncovan\nbrusco\nkasubi\nadulterants\nfoetidus\ndubby\ngorgone\nwaitzkin\nwimbley\npowerboating\nmedicinalis\nlifeguarding\nbmn\ngropes\nsynthy\nblethen\nzaeef\nladyman\ncellulases\nventon\njersiaise\nslmm\narzak\nquiett\nbatfe\ncampilongo\nstamberg\ncler\nsharqat\nbishal\npedigo\nflorita\nforeskins\nrurrenabaque\nblackmarket\ngossiped\ngaylen\nescot\nbeschloss\naccio\nzileri\nthereat\ndefaces\nsuperville\nguiyu\njefferey\njemmott\nwaymire\nsenlac\nceed\nlasantha\nfowlmere\ngulangyu\nhanung\nselee\ncorrour\noverington\nschwantes\nfallings\nbambú\nhughesnet\nkilani\nlunchboxes\ngorby\ngardenias\nmedgyessy\nteraoka\nnazik\ncevian\nbellhops\nyeremenko\nnoves\nlowlifes\nhackwood\nchippie\nalpizar\nscratchers\nboilerhouse\nnanninga\narmona\nwayn\ndevenport\nlastpass\ndyneema\nkopin\nrecompensed\nstormin\narraf\nisreali\nsofties\ncalcific\nheskin\nsubcity\nmalev\nbrewington\ngastronome\nhesen\nluciola\nnatsios\niyan\ncraniosacral\ngrubin\ntabloidy\nedamame\nchattisgarh\nfutral\nmirant\ndoored\nantonenko\nrlj\nmedecins\ncenarth\njejune\nllamazares\nskylarking\nfranchesca\nwude\nyeongam\nveselka\nmiot\nfastfood\nlebovitz\ndeutschmark\njubran\neix\nguédiguian\nmonan\nradiosondes\nferrant\nsajith\nparasomnia\nburkenroad\nsensis\nnovelo\nwhispery\nkarumba\nmadiran\nohtake\nfreyssinet\npermenantly\nondarroa\noutbreeding\nlijia\nteca\nbrichant\nlupillo\nimrt\npresumptious\nsheathes\nvenusberg\nweitzer\nbbcs\nponzo\nrowlings\nkozakura\nakber\nstefanovich\ntoomua\nimmigrates\noloroso\nrobel\nashrafieh\nconterno\nkiele\ngradgrind\nprognostications\ngavins\ncannadine\nhuxford\nldg\ngirão\nzenbu\nthurow\ndoorkeepers\nkeppe\nwelco\nwifely\nschulhof\noutdistanced\ngiallorossi\nturski\ncleavable\nnehantic\npeccadillo\nfreelances\nheilbrun\nlenor\nchaumet\nmarlovian\nsilkmen\noculta\nmcgartland\nwarumungu\nvedia\nleduff\nbioterrorist\nbarberena\nkonchog\nryecroft\nalbarado\nshehryar\nruhs\nsilverfast\ntomassoni\nflato\ndestinee\nlisker\nditter\namena\nvru\npepperberg\nrickerby\nmankoff\nhendrawan\nvisanthe\nmacronucleus\nkamler\nwhih\ncrossey\nockelbo\nekerot\nsanbornton\nyetman\nmmsc\nrutsey\nunhchr\neristoff\nperritt\nsubaquatic\nmaramures\ncushley\ncrosas\nneish\nnbpf\nibram\ndenly\nkamangar\nrelvas\nmerkt\nreimold\nwalsgrave\nrustu\nboldyrev\nwalbank\nchapline\nvisibilities\nzhimin\npatellofemoral\neex\nleanza\nslumlord\nmmj\ngutterson\nnedry\nmineable\nhackbarth\nlisdoonvarna\nvelours\ncardoni\ndrai\nzerbini\nmaccaig\nbatiuk\nhyperrealist\neichenbaum\nllanfoist\nmanthorpe\nmueck\ngoic\nnohant\ncarolo\nneureuther\nmcpp\nrostering\nafak\nhawkings\nwhalin\nchemould\nsteil\naalberg\nknwo\nentner\nbasudev\nbantjes\nastafyev\nairheaded\nyaque\nhouchens\nnuez\nextérieure\nmamberamo\nmacauliffe\ncoppedge\ndettman\nunzipping\nkrombach\nexonerations\nantisepsis\nfishmarket\nrolon\nkalafatis\nbedsteads\ngavigan\nemeghara\nenglishwomen\nserrania\ngdn\nslathered\nssms\njilian\nbarths\ndeuter\nhindy\nhesket\nferra\nkovaleski\ngabalfa\nyijun\ncryptochrome\ngobber\nsharktopus\nshearon\nmarw\nluno\nniskala\nkolpakov\nmurielle\nrpx\nsathorn\nelseneer\nesbl\nkenthurst\ncomaneci\nlopham\nrummell\nairtimes\nachmet\nbusbars\nsedric\nservient\nsoilent\nuniversalizing\nkyriakidis\nchirang\narkema\nbotteri\nzaina\nbrancheau\nsulaimaniya\ncafr\ntelesystems\ngewurztraminer\nprostyle\nlasan\naunjanue\nclarins\nfleeman\nmilkis\nurpeth\nblees\nacryl\nvinery\nachron\neflornithine\nhaying\nvinters\nresistin\neisemann\ncadishead\nulker\nklodian\ncompanionable\ntowage\nbassir\nbrún\nnewi\nyouren\nfanar\nsmullen\nteuton\nlitterally\nostar\nibiquity\nhauptstrasse\nwolstein\naguacate\nusss\nanonim\nbrockham\ncapesize\nszmanda\nsalihu\npemetrexed\naviel\ncatchword\ncontrario\nghiotto\nvasallo\nstagnates\nvelandia\ngluttons\nsalti\nmihajlovic\nnovastar\nmysociety\nunteres\nhaloed\nbackcross\ngloryland\ntullman\nyga\nangelitos\nfoulest\nkondratieff\ncrif\nijp\nquicky\nfederalisation\ngoldbeck\nfedoras\navina\npuji\nemilee\nmeretricious\nnawas\nlatonya\noversaturation\npyrexia\nghufran\nlysippos\ncafayate\ngertrudes\nriederer\nbanchieri\nmethwold\nbaldick\nfalasha\njaworowski\ngosaibi\npeculier\nmagara\nvibroacoustic\njheri\nuntwisted\nhepi\nhunold\nwagged\nskullcandy\npathless\nbirthweight\npenpont\nhenselt\ncfcl\nanthropoids\nbachardy\nlobiondo\nwelldon\nreenlist\nmcgorry\nkurnia\nspago\nshougang\nbangarra\nupdatable\ntje\ntolchard\nraineri\nderu\ntrevethin\nkernochan\nrieker\nprigioniero\nmoneyness\nonthe\nkerasotes\ntroxy\ncristhian\ndummar\nfruticans\nbioenergetic\ndonside\nweggis\nlessore\natalla\nyehonatan\naake\nmaje\nkales\nalere\nlienau\ntoups\nmaximizer\ngrage\nmalbis\ntrepca\nstenungsund\ncoper\nloveys\npoate\nzajal\nspierer\ndemocratise\nunimed\ngiusta\nassoumani\nmercaptans\nadeli\ngenzebe\ntuama\neuh\nshijun\npiotrowska\nbartkowiak\nlocky\nlaughren\nyessica\nfeury\ndelfland\nmaidman\nmetalious\ntielemans\nfilgrastim\nehlo\nsirrel\nbespeak\noich\nsheu\nlangbehn\npapermaker\nelgen\nexaggerator\ncicak\nbudarin\nlieing\nhydrofluorocarbons\nkissack\nleodis\nyungang\nmelvina\nalverton\nwaisman\nabstractionist\nfassino\ncumbers\nmuhanga\nbiobanks\nsavours\nprausnitz\nlatulippe\nburglarizing\ndelisha\nmuleta\nbusic\nnicmos\ngayen\nakomfrah\nrafaella\ngiuca\nkoshetz\ntadawul\nwinzer\nrealogy\nmigrante\nkontiola\nwdef\nplagarized\ntadahito\nabunimah\ndickason\nmurino\nkamli\nhulsman\npanhandler\nmatc\nboath\nekram\nbarquero\nlegkov\nporetsky\namuria\nstratcom\npiggeries\nwhitleys\nbitingly\nherfindahl\nalanen\nclignancourt\nkular\nbrotton\nwaymon\nfuliang\nthrivent\netra\ncredico\nkamaboko\nheartbreakingly\noverpainting\ngego\narcu\nrechy\ncringle\nstoclet\nyeso\ncarate\ncremator\nrehear\nborsalino\ntrivet\nkrekorian\ndotun\nsonner\nsansum\ncazzie\ncoyer\ncuscatlan\npioneertown\ncustomizers\nirimia\nprovender\nneotenous\nmarm\nresemblence\nsnoot\nsuperlight\nbunking\nduhks\neverpresent\nmosborough\nrohrau\nshallotte\nfiretrucks\ncasada\ncuyuni\ncattery\nkolachi\nlerato\nfaurschou\nnavigo\npengwern\ncamon\nhezlet\nholmesville\nmccallan\nditz\ndeckchairs\nbonsignore\nreposing\nstarworld\ncoachbuild\nfeethams\nhildenbrand\namoako\nkazal\nbeeks\nletendre\nyouyu\ncolpaert\nselco\npimiento\nparnia\ncofresi\nnabers\nmultipage\npolemically\nidonije\nmitterer\nnonphysical\nwachner\nmandoki\nzales\nwahler\ninflowing\nwowie\ntirebiter\nmanhart\nsarmayeh\ndudka\nhidradenitis\nmohammadyar\nridging\naneke\nsaadoun\nduanwu\nsiginificant\nbluebonnets\nsidique\nwitchford\nekici\nosel\nbaliles\naleotti\nredrow\nserdes\nbelber\ndroppingly\npropjet\noqo\ngreyboy\nbitartrate\npsinet\nshervin\nlangis\nhershberg\nmutoko\nfilipinotown\nclobbering\nelkhan\nanucha\nmcwhinnie\ndrazan\nraido\nwitherow\nkme\nzusak\nneurolinguistic\npinget\nnottle\nmichas\nresarch\narnao\nacsh\ntradd\nbirck\nwellingtonia\nmacoute\nwnat\nvloggers\nevenlode\nfulgham\nugonna\nvinke\ngaylon\nvanderkam\ncolza\ncarnan\nborregos\ngeniality\nirbms\nbracquemond\nalckmin\nyousendit\nalloush\nascp\nferda\nreinbert\nfortunoff\nkylesa\ndockett\nsnowdown\ntameer\nsangatte\ncaramelization\nghl\nskanky\ncourtships\ncarreta\nobikwelu\nneutralino\nintermetallics\nmongeau\ndalmiya\npece\nmayombe\nassel\nvoirin\nchildhelp\nvenero\nbunaken\nmoei\ncarrega\npresspass\ngearin\nunswayed\nbartold\nimmaturely\noahe\nbeqir\nmaximillion\nanderl\ncuidad\nengemann\nbluescope\ntamlin\narcati\nblakenham\nmentalists\nplaylets\nazw\nduisberg\ndunbier\ngreatcoats\nmirams\ncounterblast\nseismographic\nszczerba\nbubbi\noatcakes\nusak\nburnam\ncrewless\nmuntadhar\nkimiya\nunderutilised\nwitkop\nawani\nkirkee\nsbj\ndnm\nyunel\nirbs\nszetela\nusec\ngrigorije\nkutum\nmolls\nzhihong\nhatmaker\nrvl\nsouthminster\ncontrarians\ndromara\nannastacia\ntooks\ndashers\nboschwitz\nprator\neizaguirre\nalliott\ndigged\nrenaut\nlaureat\ndiabolically\nleopolda\nheirship\nundesirably\nvallegrande\nlorinda\nakqa\negad\nmotorwagen\nleclerq\nsild\nbordman\nevdokimov\nfengler\nmazaheri\njereme\ncailin\npodeswa\ndisembowel\ngranita\nworlde\nchavit\namulree\nauo\ncach\npollos\nchoicepoint\nduley\nfonden\njeanjean\nabergynolwyn\nrubiks\nfialka\nactivia\nwhirlybirds\nairborn\nfode\nponseti\nrozina\niffco\nunifrance\nautoparts\ncentinel\nexhibitionists\nconvergencia\nascod\ndunhua\noneshot\nsportstime\nfishnets\notcbb\nchowpatty\npiris\nsibghatullah\nswartzentruber\nnauert\nbarmaids\naev\ninauspiciously\ndancefloors\nderogated\nfrontbencher\njeetan\npeple\nglumac\nzhangs\nxrp\nchelomei\nflorman\nudaltsov\nbilik\nmaaten\ndaithí\npentapeptide\nweidlinger\ngloated\nshalhevet\nzoppo\ngentex\nbackley\nbyock\nmeidner\ndigestif\nnilas\ngranahan\nroehrig\nkafar\nmicropower\ntweetdeck\neliska\nantedating\nphalloplasty\narcangues\ntrillionth\nhelgen\nhasee\ngehrmann\nsitecore\nhongguang\nprabhjot\nalyea\ncitylights\npitchkolan\nncx\namirante\nrenovators\nhydroxychloroquine\nbroons\ngarishly\ntouhey\nrenninger\ndolans\nlarrabeiti\ngubba\nscreechy\nneets\nwalikale\ntaisce\ngundle\nxfe\nbradney\nekv\nplavsic\ncoffe\ntingri\nhomburger\nbiddenham\nlecourt\nsubventions\nkirshbaum\nchds\neverth\npetering\nhessey\nroundtrips\nmuskett\nbeccafumi\nfaline\njohnstones\npeahens\nunecessarily\nvorotnikov\nsteorn\nwestbridge\nnaida\nmaev\ngourriel\nmodnation\nsalvington\nsamaire\nstaes\nsydnee\nresections\nnonobservant\nkelvyn\ncharna\ncoordinative\ndurris\nanze\nvbac\nmilutinovic\nmenia\nmotorik\nuelen\ndancehalls\nrokia\nbreyers\ntudy\nmatulino\nintosai\noxymorons\nvises\nmunshin\npunkish\nnedergaard\nstudwell\nxfr\nmarkovsky\nabbassi\nrudebox\nestalella\nglycolaldehyde\nanuga\nnonmedical\ntrinitrate\nchallow\ndanzhou\nhonigman\npetracca\nsteiglitz\noverzealousness\nmiuccia\ngyuto\npierini\ngatchell\neriberto\nhersonissos\nmegalania\ndelicado\norner\nagvs\ndambuster\nvitaminwater\nseye\naswani\ntoukan\ndbas\ntartrazine\nkinbote\nperdigão\nhelstad\ngearchange\nkeelty\nchaplow\nhumidified\nsnowcapped\ntsatsos\nmackintoshes\nmeanwell\nglom\nphototaxis\nunscear\nsomborne\nbarthomley\nkorobkin\ndunmall\nmantelpieces\npasseier\nrumspringa\nvillarejo\nrodmell\nallsaints\ncapitalises\novelar\ncadoudal\nchicon\nphotosensitizer\nbowdlerised\nretranslation\nbinsar\nmalori\nruminative\naflalo\nisoflurane\nmaxiell\ntureaud\nmirro\nzijian\nboleskine\nhartong\nclickz\ngerenuk\nkopplin\ngitanes\ntaumoepeau\ndague\nindustriali\nhaverthwaite\ngettelfinger\nbrassware\nportended\njiuzhou\nrakhal\nhilberry\njuaquin\nmarturano\nchessplayer\ndougher\ncorking\nmarseillan\nwickersley\nfengxian\nsotoudeh\ntoreo\nueyama\nmummert\nlangesund\ncogbill\nbipolarity\nzylon\ncrackhead\ngadlys\nschwindt\nzuleyka\npompously\nsestieri\nsalespersons\nnieuwoudt\nraphaëlle\nsukowa\nenerji\nchurchgoing\nwatchkeeper\nwttc\nmazeika\nperdicaris\ninsync\ntulo\npersada\nballywalter\narkins\nameliorative\narambula\nrookes\ncodetermination\nhalai\nhutarovich\nreynie\nmystérieuse\nmagnetotail\nferdin\ngobeil\nlemesurier\nwinstons\nnaad\nbonitzer\nulibarri\nasmal\noure\nterol\nzuehlke\nclosson\nternay\nphusion\nsolutia\npisarcik\nhekman\nvulval\nmascarpone\nepipen\nquants\ngyrodactylus\nsevran\nteabags\nftca\ninsensibility\nqpc\npayin\nseverities\nweyerhauser\ntiendas\nmoosilauke\nwakened\nraap\nziobro\nhousel\nbrightfield\nmisagh\nharja\nbattalia\nmicrocap\ndowlatabadi\ntyrrells\nizmaylov\ndeuda\ncraigend\ntraffig\nmövenpick\nyevseyev\nplutocrats\nbucshon\nflasch\nkendama\nbarkway\nbrowbeaten\navramopoulos\nailin\naflp\npredock\ntapatio\nayllon\nbozena\nthyolo\ntutone\naiuto\npruess\nkulemin\nfroh\nchanels\nreul\npääbo\nbape\nbyw\naouita\ntarried\nwfr\nlammi\nfeyyaz\naerc\nmicmacs\nrestrung\ncalderhead\nomotoyossi\nseidell\nbrammall\njaine\nhurdzan\ncardno\nguttenplan\noutgrows\nmarnock\ntresman\ncolbran\nwyndhams\njabin\nshoaf\nfunkel\nokkalapa\nkppc\nrynd\nschinasi\nvoogt\nmartorana\nrewarming\narmantrout\nmanoury\nanvita\ntenille\ndelargy\nbausell\nrurka\nsebouh\natonic\ngwd\npanner\nnixes\ndotage\nafam\nunarticulated\nlombarde\nlanseria\nfieldworkers\nkikkoman\ncreteil\nyijin\nslowhand\nedgeware\nkolzig\nneylon\naltana\nchatburn\ncautiousness\nkinning\nbruntwood\npiggybacked\nbreakroom\nbancaire\nhuckabay\nrathmell\nvende\nmalarchuk\nguarente\nxenografts\ncemi\nevolvement\nincoordination\nuusitalo\ntabai\ncyclophilin\nanjulie\nedgworth\npaterniti\ngallai\nhorsedrawn\nshrien\nbluepoint\npompiliu\novereaters\nlazin\nnaegle\nketura\nkaleigh\nthundarr\nanozie\ndaus\ncundey\nnantel\ngoldsack\nhackerman\nprologis\nquincentenary\nkading\nbryngwyn\nwtw\nbukhsh\nrancidity\nborca\nanwaar\nelephantopus\nfrasure\nbsam\nlabii\nvath\nabce\nbrunon\nwappel\nconfict\nretyped\nanotha\nwaj\nbroadwalk\ngeisa\nmiqdad\ntaxidermists\ncascos\nboatworks\npredetermination\nunreturned\nmizukami\nsuposed\nphilippoteaux\ncienaga\nndcc\ninsomniacs\nrozhkov\nsissener\nengeler\nachived\nsgg\ntarling\nmicrovision\nzoppi\nalbariño\npassauer\nkoyuk\nconvallaria\nalom\nbeanery\nkamishibai\nruffier\nrotliegend\nlectus\nbernert\noverdramatic\ngrevy\nundefiled\nbhimani\nlaj\nrailsplitters\ndangin\nakinnagbe\npilsbury\nwgaw\nrotheray\ncubing\nzhurbin\nzehetner\nguanghan\ngluey\nslanderer\nbatini\nquiets\nsannie\ncantora\nkulis\nstierlitz\nreceieved\nhowsham\napperances\narnaldur\nhoisin\ncharliecard\nkamanda\nmethylone\nnavigant\ntsegay\ntollbooths\ncoore\nreenlistment\ncrooned\ntsurphu\ndabengwa\nlavatera\nwarrap\nrefix\nsplashtop\nhamleys\nmetrostar\nfloreal\ncolombani\napprendi\noutwell\ngweneth\nwtok\nwittner\ndonorschoose\ngoodnough\nstuchlik\nhosain\ncorkin\nbioanalytical\nnegm\nraffish\nsigfússon\nljmu\nerlestoke\nleimgruber\nandou\nenunciates\njuny\nskander\nyongqing\nconfigurator\ntrustco\nkswo\nketchen\nhydara\nfaucons\nreverso\nemaus\nbheki\navandaro\nloungers\ncollender\nshoniwa\nsariwon\nlegimate\nhuggies\ncanonero\ndolgoff\nvits\ntokarski\ndelauney\nswordfighting\nblauser\nhooshang\nkulyk\nsollie\nhowardian\ngerdemann\nndvi\nkhondaker\nbentalha\ndiers\nuspenski\nbenjo\negalitarians\nrascist\nmichelman\nsomersaulting\nthommen\nyoff\nrideable\ncarbisdale\nsquadmates\nbergdoll\nbambous\nsalee\nzijin\nramsdale\nllantilio\narguer\nscaduto\ntoshifumi\nzhifu\njinhui\ndecidely\nclamato\nkarthaus\nmushka\nboothstown\nrectifications\nhellertown\nferguslie\nuneccessary\njapanophile\nkatawal\nworkplan\nattributor\ncumpston\nschabowski\nqarabagh\nutj\nbogusz\nkernes\nbartholomay\nlanghammer\ndematteo\nisfa\nhinderaker\noldaker\nfeets\nteenybopper\nadland\ncheteshwar\nfashionologie\nbasudeb\ncarby\nbramsen\nprowls\nhinxton\nchaperoning\ncledwyn\nmadonie\nshoppach\ncrosspoint\npelkonen\ngushers\neyecare\nosipenko\nabler\nbrima\nigniters\nunistar\nhnpcc\nhenwick\ncigno\ncantagalo\naberra\ncadley\nyapping\nkosilek\npavlica\nengela\nsgps\nrustomji\nketterle\nkreuter\ncorect\nthabang\nsirop\nbanac\nsjb\nwachiramanowong\nholmesian\nbartling\ncortis\nstainthorpe\nfumar\ngics\nzairi\nreresby\nyarrington\nmahd\nseawifs\ntropper\ncamberwick\nsparv\nleeflang\nstyer\nkimmirut\nbaljeet\nmclaughlan\nfazul\nborinqueneers\nchanteys\nwoza\nwandy\nacers\npropagandized\nberhe\npuah\ngrimalkin\nraanan\nweismuller\nngobeni\nglatfelter\nseila\nxisha\nbrinon\nantao\nwabun\nfabulists\npuva\neventid\nvaporisation\nkiptanui\npinkeye\naedo\neavesdroppers\nkeyholes\ntelang\njianchang\noxblood\nhartack\nsepehr\nagrihan\nparnells\ntacha\nzaker\nsulim\nconfortable\nabongo\nwistfulness\nbankas\nknepp\nexerciser\nmalila\nensisheim\nwvlt\nribby\nreya\ncastellabate\nquoz\nexabyte\nmusayyib\nmtoto\nadesina\nbroadminded\nfudosan\ngehrlein\nbeezley\nriemersma\npreforming\nkralove\nfinmere\ncamiel\ncicoria\nscob\nhansch\npuligny\nresuscitates\ndelaere\nfendrich\nsouthwater\nvanelli\nwhirr\njohno\najil\njgb\nwangechi\nkyats\ntonini\ntrifu\nwesely\naray\npartwork\nvitalii\nsaddiq\neshete\nunrein\ndureau\nnattering\ndukha\normer\nhydrating\norpik\nnyangatom\nanjanette\nvilani\nopdycke\nvaibhavi\ntipnis\nprurience\nunsealing\nrockwool\nclitoridectomy\nburngreave\npianta\nrecumbents\nmcconaghy\nmoonbow\ngasometers\npatpong\nthuresson\nplaynow\nsudsy\nlegibly\nholdcroft\nfroglets\noliviera\nfonzarelli\ngriptonite\nsalie\nspatiales\nbistritz\nsoulfully\nyasuní\nslickrock\ngreaseball\nhyoscine\nneringa\nchimères\nvideoed\ncutchogue\nstergios\naritz\nkanke\ngtfo\nkasrils\ncardiol\npaté\nbayada\nlangrish\nsarabeth\nhofmans\nabderhalden\ngamehouse\nherbet\nwakefern\ndozers\nomeish\nbarbella\nbuley\nexning\nripponden\niev\ncohu\nburfield\nmammoplasty\ncarrock\ninamorata\nveiny\ncompris\ncarnoy\nbroidy\nbarrand\ngroundball\ndogu\ncaprese\ngenesio\nhallwalls\nkwhy\ntred\nmaulit\nfacinating\nlifson\ntroed\ntestteam\nsteria\nrajabzadeh\nmjk\nstron\nbuckleys\ngulko\nhuntingtown\nstraughn\nexoskeletal\nstudentships\nsternburg\nweizhen\nplainwell\nmoonset\nfings\nantle\naltmaier\nmulticolour\npfitzer\nmyddfai\ngetronics\ncorfiot\npeggs\nnossek\nchaon\nortygia\nglenford\ntaughannock\nbreeched\nrokr\nsaffa\nramiah\nmikovits\nsearby\ninteractives\nngardmau\nluambo\nfritchey\ngwinner\nmosisili\nlizcano\nhusak\nchavarri\nnebiolo\ngrampy\nvicolo\nhancke\nquadras\nreggane\nuonuma\namberson\nbouchareb\nmetts\naipc\nspik\naddtional\ncontar\nimpasses\nbreward\nrilly\novercalls\nsoilih\nlinny\nethne\ncaracazo\nspicule\nstairlift\nlaide\nadlerstein\npersonnally\nborve\nclarel\ncsfb\nperfluorocarbon\ncecchetto\nstanzi\nanthonys\nmainak\nmuraro\nkailey\narnaout\nprashar\nalica\nleor\ndonats\nscraggly\nrapada\nazima\nfootbinding\nasplenia\nppap\nadjustability\nskokloster\ntorrs\ndurjoy\nharpring\nsardelli\nbesra\nhediger\nreddings\nbaechler\nmoviestar\nretconning\nherxheim\nintracardiac\nbestower\nafos\ncerumen\nspevack\nmasanaga\nfilippelli\nconvergència\nogd\nbibbidi\nkewpee\njunkins\ncomotto\nmandlik\nberrill\nshaibani\nblasphemed\ndoscher\ntonj\nadhamiyah\nhumoreske\nresus\nremounting\nkellye\ndontae\ndaiso\nmonse\nthannhauser\nvirological\nakona\nnasad\nmulya\ncnrt\nreindeers\nkusc\ndoronin\nyuxian\ngimpy\nomotesando\naymond\ngrumley\nkelam\nrittman\nussery\nppx\nimpex\nreckonings\nkoussa\nconserver\nmatal\ncelent\ntroglio\nmistrusts\nhiner\nkinkell\nasraf\ncountrys\nsonai\nherault\nmalnati\nentangles\ndiety\nrenehan\nstendardo\nengh\nansanelli\nchaviano\nraschid\njousters\nsalmaniya\nwhiffenpoof\nquistgaard\nbodeen\nreportorial\naerating\ndrcongo\nlidderdale\ncomed\nmicrodissection\ncostcutter\nstomu\ncarcharocles\nsunniside\njurg\nrousham\nobadia\npieres\nekaterini\nsantaquin\ncarrbridge\ncroisic\nkhau\nlonchakov\nbaatar\nxantia\nhumphead\nrawan\nsteinski\ncinven\nlochsa\nsammir\nplaat\nmuallem\nserigraph\nallmond\ngarbis\namcor\nmelamid\nairwalk\naitcheson\nterminuses\nconvalescents\nglast\ncarrizosa\nmirriam\nlettin\nmontie\nmakhoul\nmumbi\nsimmie\nrosburg\nserhan\narellanes\nmodernizers\ngamania\nengish\nthalassaemia\nhongmei\namirault\nkrovanh\nvelvel\norien\nstelzner\npaddleboarding\nstolojan\npacelle\nnadhim\nabdelhakim\nhachijo\nloughgiel\nlindall\ngarches\nhandcycle\ndevins\nkurtzer\nbleckmann\ndumke\ndigresses\ndstl\npaperino\nfouzia\npoux\nminchew\nglowsticks\nbootmakers\nfogelsville\nshudehill\nkelsie\ningratiates\ngrecu\nrotfeld\nguanahani\nzayyat\nrebutia\nunreciprocated\nrubberneck\nadevarul\npejeta\njarrin\nvoicebox\npsychrometric\ncuerno\nchauliac\nsarka\nleves\npoltical\nlabrada\nonate\nphats\nstowford\nbloaters\nfinistere\ncharacterless\ncoxwold\nmicon\ndengfeng\nneuffer\nmanderlay\nkincannon\nreelect\nsteinacher\nbloks\ndln\ncolker\nitemised\npuopolo\ncomputor\nmaxinquaye\nbatched\nhaagensen\nunar\nvelvelettes\nkilmany\nkoshino\ntaqwacore\nkilocalorie\nkosik\nreformats\nsanctuaire\nafes\nsoliola\nyazzie\narcosanti\nderuta\nhaemophiliacs\nsalvarsan\nlibanus\nenedina\nteamtalk\nscreenwipe\nicasualties\nziporyn\nfrancescatti\nudana\nwolodarsky\ndonilon\nneopolitan\ncrinkly\nsistersville\nkarriem\nrubert\ndodman\nfasih\nnahasapeemapetilon\npungwe\nlafrentz\nmulville\nruettiger\nrhinegold\nhoulston\ncarion\ngilesgate\nfalconí\nafterman\nexelby\npreprocessed\nproctored\nrootin\neisenia\nversata\nkijiji\nmatondo\ncjeu\nexorbitantly\nwinegardner\nnhpc\nmaaz\nbarellan\nzuhal\nhaikara\njostens\nidiomas\nhesson\nkorsten\nlahman\ndarcus\nchishty\nalexanian\nsimonova\nvandalisation\nprofitt\nfishwife\nvreni\nprincelings\nstilley\nigby\nmoschella\nfeherty\nggw\nflemons\nborio\nnonmember\noxyhemoglobin\npierer\nyorkhill\ngrenon\nstockbreeding\nquil\nstrzelczyk\nhydrozoans\ntimanus\nparanavitana\nmucke\nsdrs\ndaung\nphotocoagulation\ntightwad\nastacio\nmicrocircuits\ndelory\ninacurate\nspandana\nrepass\ntamyra\nyopougon\nborlongan\nvoluminously\ngarms\nspurlin\nelvers\nchorn\nlacedelli\ndeily\nheyy\naliff\nshahal\nlifa\ngreave\nduensing\ntahmoh\natep\ntridgell\ngadhafi\ndniestr\nbronzing\ncantering\nbigpoint\nvonetta\nreappropriated\ngarath\ndrumpellier\neconomizing\nrondonia\ngyimah\nmaiava\nderb\nyacoob\nwackos\najala\nsaenko\ndehumidification\nprodigality\nbrookbank\nmethi\nundeletable\nwhelton\nravichandar\nthrowin\ncarmyle\nreplaytv\npetm\ndrash\nalazraki\nhellqvist\nshellabarger\nspino\nreinnervation\nlolling\nmichalopoulos\ngoodhand\nysc\nvaliasr\nhorobin\nbeardyman\ndisaggregation\nkhagendra\nbarkett\nmcclellanville\ngares\nbfpo\nbuoninsegna\nszoka\nmcanallen\nzoophiles\nhandong\nballygowan\nsomtimes\nconcievable\nkagal\nmaruta\njocketty\npinotage\nfurbished\nniver\nritters\nhanapepe\nsplotch\nmalem\nnhial\nbananafish\nrecalculating\nuddi\nagms\nnaturalise\netag\ncottager\niosseliani\nnerpa\nelidor\nfrisina\nmarinate\nclewes\nwfie\nvideogaming\nhartly\nhagy\nkgan\ntaris\nfacture\nkoteshwar\nellmau\nisri\nredican\nkakakhel\nmanze\nfedco\ncusimano\ngareau\ndavone\nlaurencekirk\nnouv\nfanger\nazize\ncopters\ndhh\nmoimoi\nisoline\ntermeer\nglaciologists\nrememeber\nllangynwyd\nrostal\npedroni\nwerkman\nhypothesise\ntollesbury\nfahlman\nlanus\neug\nsellotape\nschnabl\nszatmari\nrogart\nrusape\nhoyzer\nnupen\nraisings\nnewswoman\nkallum\nmarshburn\nbarnetby\nkovack\ndelaurentis\nsapey\nmacosx\ndusautoir\nweightloss\nbiostatistician\nampico\nlennons\nvernoff\nzmed\njurin\nalices\nsudarso\nislambouli\nlynndie\nhascombe\njátiva\nhardies\nsublicense\nsluizer\ndinkeloo\ncelex\nkolobnev\nplaybooks\nburtynsky\nshadowless\ngehr\nakti\nshringar\nnazon\nweasly\nncai\nhomola\nbenander\nduncairn\nmagris\nhousemasters\nagsa\nwoonton\nakahoshi\nastrov\nhanborough\nkleinsmith\nbarlows\njaric\nruolin\newl\nmateschitz\nnonpathogenic\ndisgraces\nwajih\nbegijnhof\nwinser\ngrimod\nworths\nlinesville\nabstainers\nprestin\nrelm\naniara\ndisy\nlenita\ngagnaire\ncamaleón\nliwen\nsolidum\nponcey\nheadrush\nmuhs\nthemerson\ndeterding\nthreeway\nseaspan\nroncero\nxinji\njawans\nserifos\ncussac\nmokashi\nsaarbruecken\nguelfi\npedja\nmagimel\nringham\npealing\nmyracle\narestrup\nbatebi\ncpq\ndarwinius\nruardean\nmnookin\ntramon\nkawthar\nweinhandl\nbavetta\npoppit\ncomponentry\nrowetta\nsoundalike\nbqe\nsheko\nstethoscopes\npreconscious\nkarabulak\nfusheng\nmelangell\nacassuso\nchastleton\nprowled\nelain\nlegget\npolybrominated\nmalesani\ncarnosine\nparitosh\nliriope\ngrisoni\nbernon\nbrandberg\nhillmann\ntomka\nustyugov\npluna\ncapsulatum\nvengefully\nbirao\nbedsole\nlichtblau\nsamek\nbettinger\nbrammo\nbrabenec\nunfactual\nlert\nbattani\nkubis\nsadden\nslemp\nchemotherapies\njingxi\nflegrei\nhohberg\nlobstein\ntupman\nhennion\ncontruction\nkalangala\naspros\nrealme\ncorporatisation\nfischoff\nospital\nbrein\nscatcherd\nribblehead\nbookclub\nlassan\nkalliopi\nmoonachie\ngreediness\nbecnel\nleapers\nlupit\naldy\nkalus\nashenfelter\nbranum\nlaister\ncoloristic\ntesauro\njiye\nflahavan\nsilveri\nvalspar\nqiannan\nholscher\nwhiffs\nshiksa\nadbul\nliks\nrupertswood\ncockup\nfuzzing\nokpala\nopacic\nobvioulsy\ngodelieve\ncué\nmcdc\nandrex\nsweers\nzumar\nbenacre\nprepubertal\nkilton\nwhitewashes\ngorgeted\ngambella\nfriscia\ntetney\napcc\npicabo\nkulov\nsathyu\narcano\nsby\nsouleyman\nschmooze\nsinndar\nmajerski\nhohle\nimli\nkapugedera\njohnta\nthursfield\nsynonomous\nkasyan\nmennin\nglyptodonts\nbickert\nrazzy\ngurdev\ngeaney\npcast\nafz\nforeordained\nmedlars\nshikabala\nwilh\nbza\nipss\nrfw\npanka\nfeatherstonhaugh\ncarders\npunker\nnegresco\neurobird\nbrutalize\nekeroth\nconcering\ndoucett\npiatkowski\npuchalski\ncoeds\nhandz\nleboutillier\ncallejon\nreny\nperfectibility\nwhow\nflattr\ntowry\nzagorski\nbjorlin\nyenan\ngitarama\nfloros\naner\nperaino\nhrytsenko\nnumark\nkonia\nstationhouse\nfortius\nchiggers\nembarass\npbms\namster\nesselstyn\nvoltmeters\nhemsby\nfluidics\nkalaeloa\nrtms\nbaranyai\nkorbin\nasmi\npuletua\nclingendael\npietarsaari\naliments\nlustral\nallami\npriem\nfedossova\nextranjero\ntorresani\nbioaccumulative\nrlf\nframwellgate\naeth\nnoncontiguous\ncleyton\nbedinger\nnykesha\nhamblett\njadon\nminibike\nwbe\njcvi\ngillikin\nhousen\nhelú\nhesco\nngema\ndaigh\nwarzones\njoas\nboscolo\nrossitto\nloyiso\nbivar\nsugarhouse\nnemir\nmicrometeoroids\nhogen\ntrainspotter\ncusset\ntratamiento\nsockalexis\ndtap\naphanomyces\nelona\nifosfamide\ndissectum\nshlaes\nkrown\nlubben\nappi\nrtkl\nyordanova\ngisp\nshahriari\nzonneveld\nseaforths\nbrickwood\nlajamanu\nappraises\nlewellyn\nsineva\nolestra\nultraportable\nbaloha\nlexar\nkedo\nimrul\ngunst\ncryptogramophone\nhospira\nseafoam\nshpeley\ndsca\nbarbery\ngudiya\npilbrow\nthurley\naneel\nnswc\nhamai\ngarnons\nnegligee\nshandra\npariol\nbasrur\npanathinaiko\nmetreon\ncutshaw\nkantaro\nshingleton\noutlives\ncaddying\napert\nlewine\nodora\nseligson\nfencepost\nkitchenettes\nsteegmans\nsabawi\nyatseniuk\npimply\nscrawling\nariege\njeremiad\nhoren\nunchr\nsendings\nkedgeree\nultrasensitive\nplayworks\npehaps\ndormand\nqmv\nhagbourne\npsaila\nkonfabulator\nmeduna\nsieu\nburried\nstellas\nfashir\nelucidations\nazamour\ntruecar\neverynight\npitstone\ngroarke\ncyder\ndesuetude\nbocian\nlittleford\nfregonese\nmorrisette\nfairplex\nbedspreads\npoznanski\nquiggle\nhait\nwalbert\nminstead\nelward\nkryger\nnonconsensual\nbroadoak\ndiversifies\nfountainbridge\nsaem\nstoli\neesc\nmordt\nchinaberry\ncatcott\nnalgene\nlucet\nhawsers\nromneys\nterrestar\nschlussel\nneuharth\nbusabout\nondrasik\nwhiteheads\nshorrocks\ndclg\nhaemorrhages\nahlus\nwatchkeeping\nvih\nexternships\nbedlinog\naspirator\njokwe\nmanmeet\nbings\ncasner\ndovish\ncoyuca\nlonnen\nchivilcoy\nmicronucleus\nnobrega\njakobshavn\nyakob\nbleeders\ninformatie\nmallonee\nasfandyar\nroamin\nreclose\ngarnant\nversyp\nmistrusting\nnecromorphs\nworlders\nmunsel\nloughrigg\nhumanised\nsudakshina\nhinnigan\nschulmeister\nyanin\nflitton\nfaraci\ntrepte\ndmarc\nsbirs\nkouachi\ncack\nceccaldi\nocarinas\nwlt\nonlooking\nhoggatt\ndaybook\nhiper\ndoriana\nwhodunits\nranallo\nkirkheaton\navci\nhidy\nwaldbühne\nassael\nlawr\noportunidades\nmorett\nhollandale\nfornaci\nannabi\ntownsperson\npalliation\nkaradas\nsallying\nvenerdì\nrokhri\npandher\nstammen\nverlyn\norgin\nslacklining\nvoas\nwkt\npatru\nmdea\nmeris\nrailpower\nerel\npittsburghers\ndambrot\nmondragone\nkirkstone\nhieronimus\nllanddulas\nsalena\norido\nredactors\npollington\nmerelli\nrouzer\najaan\nfirestops\nmrnd\nwincobank\nrheaume\nbiomarin\nwinterberry\nkucharczyk\netns\nhirwani\njopek\nhouli\nearlybird\nlaboureur\nsavané\nmarras\nbdn\nmcausland\ngaultmillau\nlysandra\nbranda\nchkhartishvili\nphh\nrysher\nwangel\nnavitas\nvesnik\nglobality\nranae\nfrankham\nerlbach\nmoorends\npersnickety\nrepacking\nsissay\nsulfasalazine\nhorsmonden\nnury\nlouisianan\ncwmbach\nbreeland\nthone\npryke\nnordyke\nsaracino\nsoemone\nbuffone\nmillea\nsuprun\ntranscorp\nsafwa\nsubburaman\nethicon\nsoulforce\nelectrodialysis\nstarin\ngéa\nsoozie\nsquabs\nnavo\nyazov\nsaltmarshes\nkovarik\nbielawski\nkohlrabi\njustes\nbiltz\nknowest\ncolinear\njenbacher\nlousteau\nmcmenamins\netiquettes\nburtch\njaroussky\nnafar\nslavka\nsamouraï\nufu\nquenton\nstanely\ndumay\ntolimir\nleistikow\ntagliatelle\ngelles\ncregagh\nxiangzhi\nsiskiyous\nmaxjazz\ntusd\npamella\ndontcha\ntraute\nsubversively\nburray\nsutomo\nsideboards\nmelloni\nmevis\nzadik\ncamisole\nhamson\nordure\nimod\nsowerberry\nacoustica\ncaher\ndidanosine\ntopliss\ncotinine\nsaes\nters\nsedova\npettoruti\nsuccesfull\ngovenment\nbiffa\nogunbiyi\nbengeo\nfarty\ntahari\nhaseeno\nholaday\nfauchon\nsuppossed\nkristiansson\namongs\nkarise\nkléberson\nausra\nlaras\nverlinsky\nconcent\nglobalising\ntanigaki\nmaeil\nasteroseismology\nfilburn\ncalabaza\npoliticial\nglassfibre\nborgeaud\nremarkables\nwhacko\nunrepentantly\npozzovivo\naprd\ncommonsensical\ndramane\nisfield\ncouchsurfing\nkenes\nyorston\nlaq\nnorde\nenterohepatic\nzeger\ncremello\nfolkstone\nparasail\nbarel\ncenovus\nvanneste\ndiaphoresis\ndpss\ncolaianni\nirresolute\nrideal\nseys\nraoni\nembling\ngaravani\nmarcham\nnarro\npérec\nsterlington\nbuffie\nfranciso\nsophina\nliverpudlians\nburga\nsokolski\nkitley\nabban\nfussa\nfolinic\nscarpati\nstude\nisovaleric\ntopcoder\nblazo\nseef\naugusten\nzehntner\nauguries\nhemond\nhattery\nzevs\nyabbies\nekranoplan\ngleans\nretentions\ncheal\ncrabapples\nnordtveit\nmuthspiel\nrenouncement\nkavre\ncornavin\ncker\niros\nbachleda\nchaldees\nannisa\nirton\nmings\nmelasma\nmuravchik\nmohai\ngedeck\nhistroy\nagcaoili\nnoyz\ndopson\ncaveau\nmyza\nbermeja\ndergoul\njonet\nnocella\ncefalu\nfangtasia\nsanteetlah\nkratovil\nschreffler\nhowabout\nlittel\nunscrambled\nruzi\nmughniyah\ndowndraught\ntrampler\nalterra\nantea\nhasmik\ndemonstrandum\nsiahaan\nmcqueeney\nyekutiel\nconable\ngeldard\nkstu\nmillhone\nlaiwu\nanla\nmukogawa\njpac\nstrenght\nkfh\nfresson\nweizen\ntwitpic\nhadorn\nerddig\nellerson\nshono\ncarlita\nkhamovniki\nchinda\nveysey\nscota\ngrotberg\ndjp\nunfamilar\nnitwit\nclintwood\nwansink\nconnington\nhumidor\nmoviemakers\ncarriageworks\nbesset\ndewe\nmoussy\ndolch\nziming\nverdery\nsoif\nplyometric\nsuchomimus\ncstr\nswype\nmcrc\nalrady\nblaqstarr\nopik\nconceed\nemmit\npatternmaker\nheidy\nhydrobromide\npharmacal\nsakartvelo\naleksy\nnastar\nviolaine\nwoodmansee\ndaguin\nseethe\nlamya\nreadhead\nberhalter\nambev\nlignans\nneuros\nruhrtriennale\ncrism\nrichenda\ngarefrekes\nguinobatan\nvirtuously\ntiptop\nsauerbrun\nbalza\nladji\ncanright\ncagno\nmoqtada\ndundela\ngringa\nlcme\ngrascals\npolebridge\ncomission\nmonterrubio\nbellugi\nautocue\ncsq\nperturbs\napitzsch\nyadana\nkrajcik\nsandelin\nktrs\nshirland\nsharara\nnyko\ntenseness\nkuito\nwenming\ngeumgang\nkidon\namout\nshaqaqi\nmanke\nakoto\nclaines\nmautz\nyoshikiyo\njeanrenaud\nrouhollah\ntorbet\nproceded\nchuanqi\nrochin\narvinder\ncarco\nmoluag\nwickerman\nviggers\nhemric\nstanlee\nbachna\nacerola\narrivé\nugv\nrecomendations\nthrusted\nbhimji\ndubroff\nsdrive\naronberg\nfctc\nulka\ncullyhanna\nminja\nkirkconnel\nmtj\nunderachieved\nhangnail\nnolfi\nabbondanza\npyeonghwa\ntriacanthos\nbalikbayan\nenlistee\nthous\nskjelbred\nmberengwa\ndalto\nminia\nmisconfiguration\nneuvirth\nteahupoo\ncande\nkiker\nspritzer\nziying\nbraathen\ngiuli\nguell\ntbas\nsibilia\ntranberg\nlenti\nastigmatic\ndialysate\nbirdhouses\nschoolyards\ndolinsky\nnorelli\nhaemorrhaging\ntinier\npostl\nbupleurum\ncrommelynck\nvlti\nbiki\npeterlin\nezard\nkexby\nknackered\nolaine\nherer\nbardales\nprimis\nshorland\ncvitanich\nkamien\nvaricocele\ngerstenmaier\ntippetts\nactionism\nbrusati\nimbedding\nbattiscombe\nferals\nbettine\nczarniak\nziglar\ntrian\nhiptop\nbotos\nfleabag\nakey\ncuisinart\ncenterview\ntryscorer\ntilma\ngrebennikov\nfreeskier\ndannelly\nprognosticator\nbacken\nrefaat\nblackcurrants\njoughin\nruffolo\nwildomar\nwtvq\nwaghmare\nrcvs\nplatel\nklochkova\nrepitition\nstatesmanlike\nlaro\nnokesville\ndarwich\npattishall\nheiney\nhospitalists\ndazzlers\nkullberg\nkealia\nverjee\ngaieties\nhibernated\nlansac\ntrendiest\ndejen\nandroni\nwhitstone\nmontechiaro\nfinkielkraut\nbesmirching\nbluecross\nzampogna\npoutchkova\nleoville\nfavorito\nnorine\nwheely\ncoari\ndegray\nsissonville\nhailie\neveyone\nafpc\nkhasab\nsieb\nsumin\ndeconstructions\nrassi\nbirna\ndeprogrammers\ncarhuaz\nbulcke\nwfe\nnesson\nstanders\nforewarn\nimplosions\ngervis\njeffcott\ndenley\npowwows\nbiopsied\nbedes\nphylacteries\nwastepaper\nterrariums\nhedgesville\nknowlson\nborodina\norsillo\ntautly\nmultitalented\nsherlockian\ndankner\nlehoux\nbürgel\nlwazi\nshafiullah\nlaunced\nodero\nsaarloos\nlotteria\nredish\njammys\ngroys\npecc\nballynafeigh\nfavorited\nlactalis\nhenschen\ndiscrepency\nfaily\ndelman\ngisha\nnationalizes\nbrislin\ncarmax\nflavie\nacclimating\ncruceta\nreamed\nshustov\nberstein\ncharcter\nadkinson\nneuromarketing\nverot\nbitsie\nportmahomack\ncowcatcher\ngranfelt\nunmit\njansma\nannulments\nwellton\nabuelita\nkefaya\nleadley\nwhop\ndiference\nwvla\nbrosnahan\nrozonda\nalcea\npardner\naviacion\nsauget\njumbie\nphilipines\ncountersign\nfukuchi\notoacoustic\njopp\npilning\nfanjul\nripostes\nkisko\nsoofi\nopenable\nscaroni\nworldport\nkvitfjell\nuppermill\nruden\nnitroaniline\ngwyer\nbabayaro\nchanie\ndicillo\ndohmen\nexcommunicates\nrendy\nissele\ntâche\ndemocratique\ngaidheal\nzagel\nseghir\nstemm\nxtv\nsalopian\nfisma\nrosenstrasse\nkatzin\nbeque\nbruichladdich\nbreckon\nihave\nhardnett\nratchaprasong\nleavings\nlintu\ncurrywurst\nilmi\nbeefier\nfeminizing\ngamechanger\nrelaxers\nskripochka\nclaerbout\nventrella\noutdoorsmen\nyadkinville\nnebres\nvélib\narrgh\nastons\nslighty\ntakala\nstears\nluuq\npawnbroking\npignataro\nyazeed\nrankov\nlandcruiser\ngurnam\nharrovians\nbultman\nschussler\nlewers\nspinotti\ndealy\nturnersville\nkajita\nimaal\njobrani\ndodgems\njaggies\npleonastic\ngereshk\npiratbyrån\nfortingall\nmelchiondo\ninvoled\nvanino\nrepsonse\nggyc\nquerry\nagera\nmeibion\nnacke\nquic\nwedgies\nfarooki\nfoor\nneffe\nfitzwilliams\nredistrict\ndejectedly\nghadeer\nmadari\ndolk\nwoodchopper\nmebbe\nfortymile\nbrants\ndriveability\nlacotte\nstargaze\nintolerances\ngambarini\nalaux\nstifelman\noctodad\npugilism\nderoo\nalmut\nbeglin\nsunfest\nlleyn\nkhorog\nchesaning\ntfb\nhalfwit\nbudrus\nzarvos\ngengis\nmantovano\ncrevecoeur\nterell\noechsle\nentryist\nwhybrow\nmitigations\nmootoo\nmarcovicci\nsighisoara\nfaenol\nfundatie\nloovens\nsiteadvisor\nzaky\nedenic\nnonjudgmental\nobici\nftos\nturrill\nnbh\nsilverstar\nboudia\ncurcas\nstuhldreher\nscaletta\nkranzler\nsankai\ntextura\nduric\nglenmary\navgerinos\nvirmani\nmutebi\nautodidactic\ntriggerman\nunessential\ndaiane\nmispronunciations\ncyangugu\nlarbalestier\ncoar\nbruntlett\nsybrand\nvedantam\nghettoize\nmilstar\nmasip\ncapitanio\nxingxing\npeppone\nbonhomie\nheadcase\nbrinkema\nriolo\ndefintions\nkiyan\nalkon\ngirding\ntão\npiperine\npurrs\ndruggy\nwkow\nboond\nsadakazu\nabsents\nwillott\ndongling\ndicking\noutcompeting\ntunzelmann\nlophelia\npagnozzi\nzagan\ngrandfield\nkishenji\nminong\nlutui\ntôt\nlandkey\ndumbell\nparatrechina\nantihypertensives\nunsurpassable\nkettling\npelino\ndurn\nsmoochy\nallover\nhishamuddin\nkamte\ngoliaths\ndeichtorhallen\nduthiers\nmetanarrative\ndiammonium\nligustica\nvalland\ncoie\nlathen\nwindtunnel\nsilverbridge\nbartles\nintertek\ngracanica\nakopyan\nvadivel\nprofessionalizing\nvanags\nflanagin\npjn\narmo\npreziosa\nnarum\npokerface\nbaross\nschotz\njialiang\nkbyu\nprasher\nemigré\nunskillful\ndiazinon\nhisanori\nxos\ngoldminer\nnumberous\ndarly\nicpd\nhiggenson\nadjoa\nbsx\nraphaelle\ngrinker\ncaressa\nteemed\ngassings\nnayeem\nsliva\nvendy\nwintersburg\nmathangi\nmalaba\nmontenapoleone\nfidell\ngiggly\namarasinghe\nbalal\ndepósito\nfixates\ncommercializes\nreeked\nbobridge\nmackell\nklain\npennacchio\ncausus\nnonresidential\nlitterateurs\nnoahs\npevensies\nstriscia\ntursun\nmimicks\nhimalayans\naimlessness\nnothern\ncroisset\nquasha\nthistleton\nrangnick\npomroy\nahari\npinkville\neddo\njaitly\nmeling\nonh\nmetaweb\nuuk\npalmateer\nmadworld\npalka\nnure\nwastell\nyocheved\nfoldes\ntelescreen\nberzon\nnesar\njiuling\ntunnard\ngopalkrishna\nhursti\npotboilers\nredefinitions\ncaze\ndinerstein\njetbrains\nkalamaki\nyufeng\ndelbono\ncartwheeled\nheartrending\nplancarte\nnarvesen\nekow\nchampcar\ngallegly\nickey\ndayley\nguyler\ngettis\nlodwar\nstefanowicz\nmavuba\nbroderie\ninsouciance\nrutu\nmilham\npenylan\nostrum\nmichuki\nyaqoubi\nhongjun\nmakaya\nkorup\nhollymount\nastd\npervy\nyanhong\nrecapitalize\nodegaard\nhunsley\nunduplicated\nhwyl\nnvda\nhelfferich\nxtp\njimmerson\nsucres\nmoonfleet\nnixey\nfayzulin\nschobel\nsajad\nbarzee\nbrutalised\ntapit\nsouthcoates\naquis\nrepassed\nwazi\nprésidence\nbeichuan\ndisconnections\nleadmill\nmezan\nflexpoint\nmayrand\nstaddon\ngritting\npedlow\nboericke\nhutongs\nparvesh\nracho\nlengthways\ntouchable\nenemigos\nlincicome\nintermarket\nbaffinland\nhenkle\ncommunicational\nuntrustworthiness\ngarduño\nkyger\narano\nyopal\nriter\nleitgeb\nohnesorg\nsusta\nchusovitina\nfortugno\nhrawi\nashling\nmortems\nsuddaby\neconomista\nburjassot\nbesom\nsoldierfish\nticheli\nperiosteal\nkindy\ncollectivists\nalcopops\nbriner\nfexofenadine\ncommercialising\nbysiewicz\ncocacola\nalize\nrwenzururu\nabdisalam\nsunmi\ncottoni\ngebru\nnotepads\nfabe\ncryoablation\ngrunty\nbaart\nachaemenian\nfloberg\nearthweb\nzgs\njaitapur\njamilla\ntaula\nrivermead\ncrouzon\nbottone\najijic\nsaltgrass\nrashaad\ngorner\nmonem\nelectrocardiographic\nchristianism\nyoshizaki\ngeotag\npickfords\nlaroy\nexperientially\nresizes\ncoosje\nhaselhurst\ntoadlets\nwesthay\nautonation\nnovogratz\nsecas\necosport\nbalneotherapy\ngilje\nsmokehouses\nsudhalter\nwhirlow\nazmin\nmenagh\nbladerunner\ncigarillos\nbuzzcut\nmartinico\nvivyan\ngroenewold\nskys\nbiryukova\nwannamaker\naberbargoed\nspringettsbury\ngoulue\npastoor\nringelmann\ntubitak\nchure\nhueston\nphogat\nsecurite\nfouhy\ngambero\nisx\njibran\nlatitudinarian\nzampella\nprq\nregalbuto\nturizm\nseide\nbeltram\nteakettle\netas\neschmann\ngalled\nmarkale\npusing\nbisgaard\nhyphenates\nafte\neryl\ncrims\nmotocycle\ndarris\nkamaliya\npalmsource\nndaba\nakros\nmorasha\nbackrests\nlemen\nbobrinsky\nfradulent\nwraparounds\nriano\njivamukti\ninza\nbluestockings\nfoible\nkazuhide\ntnuva\ncohmad\nurx\ngurganus\ngunwalloe\naskatasuna\nyoungish\ncharlottes\nconstitucion\nwoolaston\nindentified\nimaginasian\napung\nramdan\ndrouhin\nqahir\nlascano\nsemashko\ncounsil\noiv\nfleig\npatchin\ngedächtniskirche\ncybercafe\nixo\ncooly\nlangmaid\nlykins\nkzo\ndoell\nharrisongs\npegrum\nmourier\nzix\nlischka\njondal\nsheknows\nforestlands\ndirndl\nwestferry\nkolyada\ngoodyer\nmenahga\nsébastian\nslydini\nrecalibrate\nbashall\nhonu\nshithole\nwardrope\nbeigel\npenacook\nfecht\ngraco\nirvinestown\ndubi\nbichard\ncounterplay\nsaveri\nlangmann\nercot\ntejal\nshaybah\ndipo\nprises\ntakur\nhebgen\nidealogical\nheijningen\nfilthiest\ncorruptive\nnettop\nccta\ndavus\nhoseth\nrandone\npumpherston\npary\ngubbay\nsusac\nsamart\ndessena\nkittner\nfahidi\nlamasery\nrobilliard\nactioner\narmleder\nravening\nnavios\nbresnik\noxygenator\nhufkens\ncreake\ntuberculosa\nfaluja\nmorphew\nshirakaba\nbradd\nojjeh\niskakov\nmatinecock\nbellavance\nlisberger\nstingo\nsatchels\nbaille\nnaicu\nhalff\nslayter\nradiotracer\nnewin\nhometime\nflashbang\nvarnedoe\nkawczynski\nllangoed\nazia\nunknow\nnordenberg\ndica\nkepple\nmaitlis\nomaezaki\npapillomaviruses\ntrashman\nlutfur\nosem\nchesire\nmacharia\nniida\npaharganj\nhelfman\nlennert\nmoggi\noglivie\nsiderúrgica\nbectu\nsawsan\nmutsuo\ncapriciousness\ngibeau\ntandan\ncompagna\nchepa\nremax\nllantysilio\nsilberg\neuron\neurocodes\nautocars\nuntempered\npseudomembranous\nkightlinger\nbrainstorms\nmondol\nrenken\nllanafan\nkauswagan\ncoqueiros\nplumpy\npahinui\nresealable\nairlifters\narola\ndhone\npilares\nmondaine\nimerese\nrokka\nburfoot\nsoens\ntuffaceous\nmetters\ncrada\nepically\nstevey\nmartynova\nveinlets\nguiliana\nkenefick\nangelas\nnurme\nbaño\nunderpayment\nfrankwell\nrobalo\nhypermutation\nhiyas\nkoules\nchellam\nshibui\nmirzaei\nedgings\ncounteroffer\nsocietally\nfriedenthal\nnokwe\nhousebuilder\ncorralejo\ncrudes\nbrunot\nabes\nshahri\nhjs\nvigeans\ndebriefings\nbancs\nguandique\nbentota\nvestar\nmindgame\nanalisis\nencom\nlarkey\nreconfigures\nunostentatious\nschnetzer\natasoy\ngliss\ngegenbauer\nmoistening\nbushmasters\nlaquan\nskijoring\nxiaojie\nnebulizers\ngco\npenicheiro\nworstead\nscotlandspeople\nnark\ntasseled\nteares\nwestow\nbeachball\nludek\nkenge\ntapiwa\ntalibans\nwanniski\nkykuit\npalitha\ngainsay\naamot\nacidify\njarallah\ncantv\ngoitein\nneuropsychologists\ndesaulnier\nperignon\nkorac\nabeylegesse\nglinn\naptenodytes\nprps\nmiyamori\naldebert\niccc\nnickol\nfanel\ntrefil\nmeechan\nliteraly\nhertzka\nmahsa\nuarts\nsheepy\nhougham\nmatlovich\nantier\ntricon\nadonia\nrrts\nobeida\nbatheaston\nunwra\nwrecclesham\npenalization\nabiraterone\nseter\nnicholle\nvermilye\nbenhur\nsabaa\nvidesh\nliterarily\necodesign\nmakiki\namona\ndoong\nmsns\njoti\nsoyeon\nneryungri\npolyketides\nbookworms\nhandpiece\nschoolfriends\nchinaware\nczekaj\ntakura\ngreenhall\nverrerie\nyoa\nseedier\nshigellosis\nwachman\nwardensville\natlanticist\nmurck\nkrh\nmicrotechnology\nwindhover\ndawnn\nperplexes\ngeeves\nvarejao\nossana\ngjorge\noutvote\nkarapetian\ncdfs\nwcos\nfahrenden\nfumihiro\nmontz\ndavinia\nmuraviev\nsadullah\nonozawa\nqahira\nfancying\nhelbert\ngazdar\nmbanga\nshaunna\nberzins\ntretter\ncommixta\ntranscatheter\nalphavirus\nvilà\njarosite\nbaiters\nsempé\nillston\nlodder\nsamoyeds\nasaduzzaman\nsecuestro\nvarey\nmarcali\nonselen\ndfz\nvaut\nvaze\nlabella\npasquarelli\nchetrit\nauston\nseascale\nglenbuck\ntrien\nemri\nminiland\nglod\nfetcher\nformlessness\nlepard\ncockley\ndashon\numble\nschmidts\nbüchler\ngoldwing\nqalyubia\nbirchett\nvaradi\nvernix\ntoughman\naletter\nsamim\nzehri\nrumman\nbortoli\nschillebeeckx\nweinheimer\nweijie\ntantalizingly\nspinnerette\nborgnis\nbaldies\nnagayasu\nrejuvenates\ngorio\npigeonholes\ngustov\nbankfoot\nperfluorocarbons\nmirafiori\nwhot\nshcherbak\nruther\nbeermann\npalaios\nposess\nglamorizing\nblacklands\nhaefner\nsassano\nhoeryong\nlehtovaara\npapanikolis\ntalya\nclaesen\njaouad\nblilie\nthakrar\nwordfast\ncheckmarks\nflatterers\nbettauer\nshaari\ntepi\nsoboleva\ngoarshausen\nmusashigawa\nairtouch\nunhooked\ndoorframe\nligang\ndossari\nboldenone\naurilia\nskille\nrectifies\ncalcott\nobomanu\ndahler\ncabezon\nedgmond\ndeutschlandlied\nsilverblatt\njungers\nlindheimeri\nwhitsundays\nmarsans\nsionko\nseaburn\nmusse\nrodeph\nsyverson\ndidenko\nmicrobacterium\nstraumann\nmmix\nevaldas\nportero\ncanaille\ndmh\nturahan\nlemelle\nfreydis\nfeuvre\nrajarathnam\nmastenbroek\naddlery\ncholos\nschwarzsee\ntunie\nloyn\ngripens\ntiptonville\npigtown\ntarawneh\ninnerbelt\nkellers\nnavman\nhaldin\nwillm\nmollenkopf\ndoege\nstoneygate\nhadwin\nbelhaj\nparapolitics\nblackwoods\nrimming\nespelho\nkutia\nchaperons\ngattopardo\nzey\nlugos\nwycoller\nresidually\nadvil\nrieppel\nbelsey\nherzi\nsanely\ncarbuncles\nzhabei\nheartstone\nhpn\nblotching\nyuly\ntexturally\ntopalian\nquandry\nerbitux\nnesrin\nrbv\nbellringing\nlesnie\ntumults\nsylvanian\ntastelessness\nsomis\nkcen\nferpa\nostinatos\nemara\nwhitefin\ndjermakoye\nrecentness\npangma\nseegmiller\nbarlay\nolenka\nalphonsi\navlon\nsadaka\nglamazon\nnimet\ngmcs\nsiebels\ncheatwood\nrootworm\ndeltoids\nllansteffan\nstiebel\nfessenheim\nsalicin\nnbtc\nschlicher\nmediaflo\nlensmen\nzingaretti\ncartloads\nsteans\nwerks\nnattawut\nsambath\nanwari\ndelino\nyardena\npirton\ncaccamo\nhsx\nmukhlis\ngdh\nclaudi\nadjournments\nairflows\nrechsteiner\ngolis\nstyluses\ncarusi\nrevkin\nmisgiving\nmasumoto\ndenegri\nsheibani\nmaysoon\nuvaria\nrohi\nschoenke\nrheinenergie\nyangquan\nkuske\nparquetry\nsumantra\nbrisset\ngoogler\nmarzook\nmcconachie\nmalese\nsiphandon\nhikkaduwa\nklitgaard\nmobilises\nvindu\nstrause\ngreenhut\ntcx\nadiyiah\ngrischa\njummah\nkoranteng\nrankle\nnoirot\néclairs\nsmallbridge\nmecke\ntetsuhiro\ndoxylamine\nchoplin\nphiladelphi\nrossborough\nnitel\noxybate\nmeulens\ntrolly\ncontortionists\najas\npearisburg\nvsel\nkekana\nwharry\npanagiotopoulos\ninsureds\noffie\ncurtainraiser\nexpectantly\nsoton\ncorrelatives\nrohitha\nalbala\nduijn\nmoeaki\ngabardine\nrobach\nkaiman\ndunnings\nrhn\nhomogenised\ninews\nothar\noutbox\nresolvins\nretox\ndeciles\nsalceda\nboated\nbutzer\nallicin\ncreepily\nintramedullary\nakumu\nmahiga\nroulade\nfimbres\nnangpa\nlki\nkplc\nhuarong\nallix\ndjamaluddin\npetrify\npietragalla\nprecised\nfazzini\nantoin\ncoinsurance\nambrosino\nsyna\nfoeniculum\ncadgwith\nprimicias\nyilma\nsherzer\nhuilin\nkilgannon\npreternaturally\neeshwar\nladette\ndegaulle\nzetra\nmainlands\nneediness\nbetten\ntsujimura\nchafik\nargungu\ngossan\nmazzarri\nlehmberg\nwhimple\ntrackback\ncarrasso\nmachugh\ngrumet\nnatuzzi\nturducken\nletchford\nandthe\nizen\nhettich\ninglesby\nblenda\npowderkeg\nkilfoyle\nspooning\nkitov\ntyneham\nvatsyayana\nhartburn\nspanghero\nhaldenstein\nnassos\nrenacci\nbarkann\nproforma\nsizun\nwras\nshumba\nfeatherweights\nhouvenaghel\npéry\nilliad\ndisposables\nmoghal\nodey\nrozo\nshowstudio\nquestors\ndiehr\njesusa\nloehmann\nbúzios\ngiblet\nrajoub\nnqr\nshevtsova\nlilach\nlocanda\ndevra\nsomekh\nilpo\nyoshinoya\nbonni\ncolloquies\nvinit\nbiddick\nschaffel\nstraightjacket\nneighborliness\nrolnik\nmozaic\nboomboxes\ntrapezes\nhaimar\npupusas\nspack\nhongyu\nmarketting\nkonecky\ncareens\nbvf\nsadah\nmalpartida\nmascoutah\ntronco\ngrafstein\ncpes\nwarthen\nterian\nracoons\npashtoon\nheidinger\nfaha\nhilscher\nolmeda\nravid\nletch\nwinrock\ncailliau\nbehavour\nbergères\ngoffert\ntcga\nboulis\nforna\nheartline\ncolorama\nhongyun\nlindsays\nedmands\nfujitaka\nalnabru\ngorenberg\nsabudana\noklo\nyuldashev\nfugelsang\nzimny\nkransky\nsutterton\nrazim\nhypoperfusion\ndeysel\nmalekzadeh\nwallerawang\nshifta\nmyrtos\nyeye\ngarnishee\nniçoise\nceric\nkwamashu\nibbi\nsöll\npaglen\nfrase\ndozes\nucavs\ntottle\nlcmv\nlehninger\nyodels\nserebriakova\nnotenboom\nfabini\nmalacanang\npollie\nzigzagged\nwvf\nmahbubani\nwhins\nfuselier\namchit\nsextillion\nfirby\nmailbags\ncentralizers\ncityview\nmulticulturalist\nfarmboy\nscandlines\nkostadinova\nperegrym\nsuardi\ntonina\ntornero\nredlined\nebg\nkotkin\nbernick\nreligon\nkifah\nbartholomews\nhollyfield\nrelabelled\nhamin\nfiscella\nbednall\ngenuflect\ngussets\nmackney\nqualis\nfena\nleicesters\ntarnower\nbeatties\nmanella\nseroquel\ntatang\nlassos\nshiron\nclosable\nkilmallie\nsighvatsson\nwoodiwiss\nboscoreale\nghur\npoeticus\narepas\nloury\npoha\ndreamgirl\npboc\nbrunne\nchlebowski\nllangurig\nmayom\nvokey\nskillsets\nkalatozov\nkoom\nsalamin\nbottai\nenfold\nscientistic\nlinglestown\nbhengu\nfinagle\ncsts\ncrossbenchers\nincaviglia\ndutertre\nbonura\nperjurer\nnarrowsburg\npimstein\nwellner\nparticpants\njrh\nsoula\nprotuberant\ntianchi\ngreiss\nteper\ndaybreakers\nboonmee\nnymt\nallroad\nkornhauser\nmouradian\ncowlick\njaffri\nlnx\ntonel\ntrimper\ncassazione\nschoodic\ncapadocia\nhouliston\nkalikow\nelfa\nbracklesham\nwaul\ngardom\naroldis\nmccandlish\nblurrier\naspropyrgos\nbluestones\nsurnow\nblitch\ndislocates\nstss\naldham\nchildre\ngrindheim\ncertaines\nelsmore\nfinans\ntjp\nmarreese\nropartz\nziploc\neyewash\nwassel\nkarlton\ntakeley\ngelabale\nwazowski\nandrij\nesajas\nlerangis\navvo\nfinma\nhongji\nporrata\nsecdef\ndivestments\nmouriño\nsapkota\nwashpost\nuitp\nshapoor\ntieton\nkhac\nrecross\nprogressiva\ngermond\ncashtown\nelos\nchemi\nknuble\nkusuda\nmiia\nromanowsky\ngenth\ncorrao\nplancher\nhommages\neastway\ndessy\nklaar\nverklärte\nmccallany\nmbour\nmamadi\nottakar\noosterschelde\nbceao\nkooijman\nkaca\neeles\npandin\nsaabs\npltw\nzywicki\nppmd\nkilspindie\nrackoff\nmedigap\nfaiman\nidolization\npepetela\ndastgerdi\nantipathes\nedden\ngamebird\ndarsi\nfédrigo\npromed\nraupp\ntabc\neinbinder\nchucker\ncoverlet\nmagnetoresistive\npreeminently\npiceance\nhegerty\nikg\ncharnia\nadaro\ncrawcrook\npej\nknockholt\nmultidecadal\ndvorák\nlittleham\nrevu\namoa\nboalsburg\nwitchu\ncosker\nfortnights\nmoneychangers\netps\nselinda\ndjw\nsynology\ngiannotti\nadiel\nsantanna\nlindback\nsmoothe\ndiyya\nkhazei\ncalloused\ndownspouts\nhuajun\nraybestos\ntanged\ncharmain\nrepowering\ngröger\nlukins\njouster\nrakeem\nupmost\nfacco\nkepp\nfassler\nbartelt\ncaiola\namnestic\naquilini\ngaos\nkothe\nsujal\nhornbeams\nfaid\nacually\nbergl\neoa\ndozo\nmozartian\navow\nbiaza\nscerbo\nguerino\nyoik\nnaturalistically\nsherrif\npaneriai\nunlivable\nrelentlessness\nbalmond\nlouey\nmikhailo\nnetl\nballymacoll\nzuccarello\nstonehurst\nreserch\nwjm\nyogev\nhoggins\nsmhi\nchuff\nsatwa\njeppsson\nkhalip\neversholt\ncroteam\ncuckoldry\nroqueforti\ntintinhull\nmze\nyelvington\nnightfly\nheadcovering\nsiyanda\nabdirizak\nmonvoisin\nmeisenheimer\nwomer\nmocvd\nthamer\nreqs\nroddie\neduction\nunresectable\nloverly\npanino\nhousesteads\nschoolmarm\nteruya\nilyukhin\nrayborn\npricier\nvanderhoef\nunreasoned\nleriche\nconsecutives\ntitletown\nmidwich\nzmeskal\nlivetv\nbrei\nibtisam\nhdls\nsadhvi\nagalarov\nascd\nglaude\ndagpo\nkedumim\nkoomen\npapenfuss\nkhn\ndfcs\nvarekai\ncrpd\ngiegerich\nshepler\nyandong\nelderberries\nthreequarter\nbesancenot\nhawas\nbrawdy\nmallary\nmemorialise\nkrizia\nozturk\nakune\nelectrabel\ndorri\nnanoengineering\nyachtswoman\nnightwear\ngulval\nemagine\nbeause\nalmalki\nanyaoku\ninsulative\nitic\ndorosh\nincenses\neconometricians\nsemmens\nteppanyaki\nbiery\ncloch\njuggs\nblonska\njohanneson\ntabuse\nurschel\ndrabinsky\nteixidor\niwade\noverfed\nperfunctorily\nscallywag\nextemely\nyizhuang\ndamascena\ndjivan\noppostion\ntaprobane\nbaali\nkomack\nzezel\ngire\nsepteto\nskulk\ntelman\nmunnery\nsplitsville\nispi\nbarbarously\ndavinder\nbartholomeusz\ninus\nzenovich\nrangeen\ntsiolkas\nranalli\nsnufkin\nrobilant\nhajrudin\nshivered\nbakwa\nholeman\nbugaku\nmitarai\nritzenhein\nbanny\ncuttino\nfortun\nsothebys\ngoodnestone\nhenize\ngrai\ngarthdee\ngiengen\ndoily\naltiero\nelxsi\nkalmiopsis\nemachines\njohni\nsathyanarayana\nmenconi\nfoghorns\nperminova\nberniker\nbienvenida\nafikoman\nonesided\nkgsr\nstegodyphus\nyatomi\nwajah\nhypoactive\ncrms\ndecleir\nlindbeck\npentacene\njahir\nstaniland\nchampetier\narlett\nrodenkirchen\nborletti\ninflator\nkensey\npinwheels\nthurcroft\ncondotti\nsybarites\nnauticus\nbartik\nsepehri\nformentor\numbers\nsettimio\nalmandine\nhannukah\ncaprine\npoinsettias\nmossbank\nkrisha\nibou\ndudin\nkarlyn\ntetlock\naghaei\nwatsu\nprerevolutionary\ncarletonville\nscrunched\nimmobilising\nzabou\nlsms\nkagwa\nstensson\nmargreta\ncountercyclical\ntume\nkunuk\ndouanier\ndualling\nllanuwchllyn\ngedan\ndomme\ncnhc\npythonesque\ncameraphone\nwhisnant\nherson\nbrancion\nteevan\nmiletic\noverboost\nduramax\ngrünenthal\nintentionalist\nmegaplier\nalamy\nrluipa\nwingerworth\nmachars\nscaturro\ngetgo\nbuckalew\ngambol\nmaniam\nkunas\nuiv\nmockford\nallina\nkair\ndepositi\npantos\nanstead\nbufton\nröttgen\nsutherin\nvacio\nretegui\nseghill\ngobaith\nanba\ndefaulter\nnahimana\npontificates\nkottaras\naló\nkrogius\nlaureys\nscoonie\npereire\nbastiansen\nscut\ntreiman\ndhaid\nrangaswami\nlimmy\namihai\novercompensate\nitslef\nwtvm\nthorbjorn\nkarone\nmeecham\nbaylin\nblodwyn\nultrapure\navil\nskrew\nkouzmanoff\nloiza\nfactortame\nflatford\ngaelle\nbelenko\nkrtv\nmcallester\nfirewalking\nashli\nproppant\nsocarides\njuvincourt\nanatinus\nponeman\nshaibah\npropsal\nmzungu\nscattergun\ntorkington\nsembler\nmahayuddin\nbickershaw\ntuwhare\nmepc\nlockups\ntosar\nfuz\ngaitskill\nzozan\nsmartlink\ntellme\nkerlan\nspaceward\nnaypyitaw\ncheban\ntrolltech\nkristiana\nszeklerland\nelburn\nknapsacks\ncoronelli\ncornball\ncarryl\nvernus\nbentt\nransley\nervan\ndobrawa\ngallan\nsunbather\neskander\npineta\nrirkrit\nmonea\nbedu\npsystar\nnorthaw\nkiala\nfuensanta\ntigrett\nrubby\nusinpac\nbuybacks\nkeilani\nisuru\nodil\nperkowski\ncitect\nmillbourne\ntruvada\ndolev\noishii\nnaby\nhollygrove\nkronenburg\nvillarica\nbobkov\nriffage\nbaldor\ntpwd\nbabbles\nirremovable\nazimov\nleapster\ncherundolo\ncriquette\npleasuring\ninhumanely\nredgrove\nsagana\ncabas\nvujovic\nlillingstone\nshantry\nhooser\nbendixsen\nrzepczynski\nmandawa\nllangyfelach\nsniffin\nkatcha\nsirmione\nboyardee\nadnams\nradioing\nspirometer\npromiscuously\nalieu\nmillitary\narseholes\nguasimas\nleauge\ntitford\nlenkov\ndiked\nbazaruto\nusst\naddys\ndickmann\nestermann\nsofres\ndishonourably\naryo\ntaters\nsnoo\ncaraeff\nhandier\nocurred\njiujitsu\nkyllo\ntzahal\nqambar\njackeline\nbrkich\nkonnichiwa\nwanta\nanesthetize\nmardian\ndylanesque\ndsquared\nclowe\ngoddards\ndrog\ntcad\nbaroody\nnmsi\npozarevac\ndaimlers\nhootsuite\ntrully\nphythian\nsteatohepatitis\ngermander\nbrougher\nmarcelline\ndoas\ncoroico\nperoration\nunmake\nimmunomodulator\ncelada\ninlcuding\nbermudiana\nmuseological\ntarriers\nfreis\ngintis\nborsen\nhyperglycemic\nsaucepans\nstathopoulos\nwerthmann\nemprise\nmorefield\nyulianti\ndreama\nmakoy\nkhasanov\npdry\nkalian\nmongstad\nleons\nhairballs\nmazzaro\nquedgeley\nhassim\nshlesinger\nupwellings\ntaigh\nsickeningly\nmelaku\nplethysmograph\nseidner\ncyngor\nmaumere\nkangyo\nsavaging\ndycus\nlaryngectomy\nlaminectomy\nantonette\nmangone\nflexray\npenone\ngeraerts\nendtroducing\nnegahban\nsmertin\nsarthak\nftes\nsawford\nbalgowan\nhestercombe\nathiests\ncbfa\nroisman\ndafis\nseens\nstonewalls\ninjuction\npressurising\ncinnabon\ninaam\nckmp\nammoniac\nfelshtinsky\nkevjumba\ngiustra\nhhn\ndesanctis\ncieslak\nmiracosta\ncarlsten\nbloodflow\ntalau\ncalamia\nshaohua\nalala\nfluck\nantrel\nvicaire\nbudejovice\ngumballs\nzicklin\nmoea\nrupen\nhaimovitz\nfranchisors\nekwueme\nfleischmanns\npomersbach\ngilsenan\narbutifolia\nfrates\nrahmonov\naccts\nwakening\ntrundling\nlexico\nmanzanas\nkeny\ngamonal\nfinchingfield\ncullybackey\ndisbeliever\nmittermaier\nunseasonal\nbuco\nschoeffler\nyarl\nsibbles\nthomopoulos\npanke\ntropicals\ncleta\nmarchmain\nehhh\ncorletto\nabdulqader\ndubarbier\npolygraphs\nnewspace\nelbaum\nnoller\ngolshifteh\npavlovski\nvelloso\nplantersville\nghawar\nprespective\nengholm\ndjokic\npaintballing\nmileti\npittsville\nbaitha\nbegetter\nobdulio\nzappers\njasons\nwyfold\ninpex\ncalderoli\nfanton\ncontaldo\nunevidenced\nyihe\nalexeeva\nuncarved\nnurgaliyev\ngoldfeder\ncaciocavallo\nobligatoire\ntimbira\nshowunmi\nimri\ncorani\nhonglei\nmangers\nkellison\nobtainment\ncondescended\nkerchiefs\nfalkvinge\ncanzano\nboogies\npierret\ncotti\nghedini\nkanstantsin\ncroation\nogbonnaya\nperrilloux\nbovver\nfcuk\nseahouses\nbarabash\nrevitalizes\nlicko\nmultitracked\nkyber\npalelei\nericht\npeyresourde\nbialystock\nholovaty\nzarnecki\nvecernji\nyoffie\npredappio\nlapinski\nmechler\noncoprotein\ngreatwood\nsteffans\nstubing\nchampalimaud\nforgacs\nzmp\nlanting\nkinmount\npayor\nfärm\nlycabettus\ndriza\nsalong\nschissler\ncopperweld\nkeiper\nnandina\nfronius\nhoffen\nscroogle\nkeam\nmajken\npakis\nbergenheim\nriester\nzaran\ntichon\nmediavilla\nulead\ngayl\nbalikh\nbellvue\npullbacks\neurospeedway\nfaletau\nsilvaplana\nlibba\nbendtsen\nsnegurochka\nmikhnevich\ncutover\ndayville\naymaran\nszczawnica\noxcarbazepine\nhisto\nconsoler\nmorre\nchloramines\nleontina\njaunting\nhandwara\nkeerti\nasml\ncolefax\nschaake\npolynya\nhouran\nmaig\nnizhegorodov\npachira\nticats\nmusks\nmabu\nreiterations\nunprofitability\nconciliating\nrasouli\ntudhoe\ndexa\ntannat\nbergenia\nfrann\nvisvanathan\ntayrona\normat\nsachtleben\ncrusell\ndesmonds\npactio\nnefas\nhomerooms\nstogumber\nplakat\naranka\nflamers\nforswear\nesomeprazole\nloj\nmasterkey\nproblematics\ndecherd\nbulverde\nauldearn\nverifiers\nfbis\nmansu\negoli\nkanaks\nbtz\nivre\narist\nfilleting\nundercounted\ntransportability\nchangsheng\nalgún\nedilson\ncinesite\nmorson\ncacharel\nswiftboating\nwehby\nottendorfer\nrhyn\ninterpetation\nthohir\ntomma\njumah\nluhnow\ncaylor\nmathen\nwanat\nsedates\nfredon\ndeprave\nflocke\nsivs\neslinger\nllangorse\nmwv\nmeletios\ntanksley\nkohno\nmiddleway\nmeisler\nhabegger\nnevzlin\nyakubovich\nqorban\nwoodentops\ndriveable\necat\nrefiling\nzaiyi\nhardener\nferihegy\ntreon\nprizefight\nagrium\njlpga\nomarov\ntatsfield\nsuperlens\npmqs\nspeedtrap\ngongshan\nkesting\ngraviano\nehrc\ncoumadin\nuspstf\nmanjiro\ntwizell\nmedicos\njeeter\nculson\nutcubamba\nhubristic\ndinette\nrecipies\ngpss\ntrittin\nadjacencies\nmythologists\norduna\nlitzau\njamilah\ndimaria\npallam\ngrandchamp\nnutech\nisoft\njoppy\nlucedale\nrathmullan\ndarque\nabrading\ndeekay\nvlavianos\nbichara\nhiriart\nmaerlant\nsemrau\nleco\nkallawaya\nmigdalia\nfergalicious\ncaravelli\ndiavik\nsorgente\nmuscovado\nedwar\nallanah\nsanitas\ndaumesnil\nnethergate\ninnovatory\nkailyard\ntacvba\npreska\nzuill\ninverbervie\nexpatica\ncubie\nrosti\nbourgain\ncarisch\ncranage\nstreatfield\ngianola\nimmobilier\nmidlanders\ndaskalopoulos\nkorneyev\nunbid\nbhit\nnischal\nnyugati\nvxl\nemrit\ndientes\nsusato\nliebesman\npredesignated\ntaynton\nksfy\nkrahmer\nmisdiagnoses\nschellnhuber\nnebahat\ninefficacy\nbartica\ngasolines\nstuffer\nsickman\nlyk\nmimoza\nzopf\nintas\nhejira\nparati\nohlman\njfe\ntlingits\ngards\nkitco\nvampyrum\nroşia\ntekulve\naded\ncaudell\npearmain\nmulana\nenm\nmutilator\nberegi\nlandgrebe\nmitchem\nnyclu\noakhanger\ndiscectomy\nhorsing\ndelinquencies\ndykehead\nmesothelium\nsaltimbanques\ngreyback\nopentable\nezpeleta\nsundal\nmoustachioed\ntendance\nncrp\ncoid\npavanelli\nleafield\nrahnama\nzins\nkilembe\nbiodome\nzide\nbirri\nsecretively\ntullberg\ngreenpark\nnewgen\nthrombopoietin\ndeflowered\nalit\nrasuk\nsposato\nscaphandre\nbobbito\nanophthalmia\npaasio\nsgw\nrefuels\nturkistani\nblackly\nrine\nspondylolisthesis\nsiddiky\nguarnaschelli\nloompa\nbrighty\nwinebrenner\nfontaneda\nnirta\ngeraty\nkindig\nhinglish\ntuscon\nsrimuang\njudder\nkisber\ndorade\nircon\nelease\nunbias\nindianised\nglassie\nglaisdale\nangoor\nwdam\npatagium\nkhek\nboldrini\ngoldfeld\nfardan\nprophylactically\neliasch\nsenselessness\nburim\nlonna\nwaaa\nequivelent\nrhia\nmbytes\ntincidunt\nseamark\nunbacked\ncoursen\npsychobiography\nchitterlings\naccessorize\ndisillusioning\nnrlc\nchiarella\nhirschbiegel\nvehicule\ncullingworth\nemps\nworser\ntuftonboro\nmevissen\ntamaryn\nlekeitio\nchichicastenango\ngoding\nnivens\nsaloth\nruairidh\nradiall\nmoisi\nkolesnikova\nfrisa\nbrents\ntetreault\nlievre\nwoonasquatucket\nsharri\nzhaowen\neraring\nyogen\nmantee\nnorthsound\ndisneys\ndhafer\nkerhonkson\nterryland\nrakt\nkruif\noheb\nwratting\nusutu\ncrossford\nhonnold\nheriots\ngathright\nfriedrichstrasse\nstandlake\npodebrady\nroover\nicrt\nclawback\nafgooye\ncmms\nrosano\nhutches\nimmolating\nebling\nbasnett\nsturz\nkosch\ncripes\nmcgilchrist\nguelzo\nhagemeyer\nfiermonte\narticulable\nstefanou\nmjpeg\nhederifolium\nhadwen\njawaan\neyenga\nwissant\nsadikin\nkhayam\nrusche\noverinflated\ndajuan\nmessily\nolesa\nukil\nlisinopril\nmccleod\nsasabe\ndresnok\nasciano\ningimarsson\nbawl\njaynie\ndockings\nrykwert\nnuha\nmosab\nbiopiracy\nscenically\ncarrolton\nplaytech\nchiropodist\nescombe\ngodwyn\nalloways\nbryndza\nperambulator\nburtsev\npentraeth\ndillan\ninsulations\ngritsenko\nquantocks\nsharaa\njunck\ncoolman\nbaltiysky\nibookstore\nartangel\ncasevac\nblankenberg\naymes\nreformulations\nwayfinder\nuncatchable\nelliots\nedholm\nkazimi\nswissport\nchci\ndanskin\nsagaro\nspanbauer\nmagnoli\nseneb\nutti\nmctear\nskii\nklemp\nmorillas\nskah\nasajj\ntencor\nsqd\nhilke\nabates\nbitesize\noutpaces\nnorthglenn\nhirji\nwolgan\nfreeriding\nranadive\ncrooms\ncifor\njennifers\nmixson\nomfif\nbrasilian\nepcglobal\nshigehiro\nseatrout\njeté\nhelmsmen\ngccs\nfordney\npolarstern\nwbgh\nwaimanalo\ngaca\nmanai\nloopers\nmeanly\nyevgen\nchaucerian\ncavey\nmiserliness\ncarnitas\nmixtur\npositionally\nkietrz\nverea\nmikolajczyk\nastras\ncocentaina\natton\nschoenenbourg\nnajin\nxrx\nbawls\nquenelle\nbakara\nvainonen\nplumwood\nbalikesir\nsonero\nxiuying\nstagers\nkamco\nantimalarials\nburder\ncounterprogramming\nhoas\nanthes\najaria\nzapater\nchulack\nwielgus\nafricas\ntroldhaugen\npasteurisation\nbuccellato\nanying\nammos\ndanjaq\nbutrus\ncotillo\nwarmack\nisrair\neizenstat\nexford\nbreunig\nvanhecke\nafge\nriffat\nnewtownstewart\nsuburbanized\ngranai\nscrounger\nkfsm\nnovozymes\nsendup\nsnookered\ndeposal\nbrachetti\nderamore\nmotoric\nbraw\nmalew\ndearle\ndrinkhall\nobgyn\ncoalwood\nstarehe\npolyphonies\nwolak\ntrenk\ndelysia\nibat\narhus\npekhart\npanchita\nkobel\nswiffer\nléoz\ndolge\nyanowitz\nkarley\nwidner\nrsno\nbertheau\nnnrti\nfaad\ndisjoined\ndreg\nhammams\nedwalton\nbudrio\nnambaryn\nqul\ntanihara\nkettani\nstanco\ncsav\nirinej\nfrancon\nzoopla\nmarinkovic\norginization\nnewbuildings\nreinstitute\nmesmerist\nrothken\ndragnea\npiccone\nshirburn\nredesignate\nodlanier\nflameless\nbarvikha\njasminoides\nlianyuan\nmicrovesicles\nwannian\nnoyola\nsattari\nmarjayoun\nikettes\nprouse\nlorson\nbouldery\nyema\nrecognisability\nsternness\nbeeswing\naraj\ndallen\nmkek\ntransvestic\nsherrick\nqte\ncheif\nsigitas\nulee\nunendurable\namandeep\nmagdelena\nridiculus\nmichelago\ncadc\neoy\nharmfully\ngalama\nalayna\nrafaeli\nmellgren\npenyrheol\nmppc\nmaidenform\npriceville\nprweek\nlbh\nstandfast\ngallaccio\nbroschi\nrainsville\nelefun\ndreessen\nandenken\nmaunga\nltrs\nphoo\nexpositional\nsalesi\ncobridge\nstoical\namondson\nbessant\nhillandale\nslunk\nsoans\nmcammond\nnalidixic\nappliquéd\neraill\ntodesco\nrisom\nlatife\nalier\ngdg\ncavalierly\njegi\nrenningen\ncapdevielle\nremorselessly\nswanmore\nmindblowing\nbixente\nsimpering\ncelski\nramsons\nmaysara\nsuwalki\ndominatrices\npulsates\nrollerskates\nantidiscrimination\nvuoto\nkumquats\nceibs\ngadret\ndmochowski\nassistantships\nvillatte\nduhart\nnephin\nvlj\nnewbuild\ncpds\nkokka\nlefkow\nstrossen\nimpelling\nfroyle\ncorrick\ndisincorporation\ncultism\ntapash\nchainlink\nbédoin\nwhca\nsloven\nkadian\nfortifies\ntonderai\nsedd\ndangjin\nshufu\nachilleos\nnugaal\njerusalems\neggbeater\nhomecomings\nmikos\nhryvnias\nosedax\nrollinsford\nhornafrik\nuresti\nbarrhill\ntrustor\ntwala\nhohenwald\nepicondylitis\nbensouda\nuwsa\nnaohiko\nnoji\nennen\nkasza\nsturua\nlakay\nmegahits\ntoothaches\nsahba\nscorrier\npapian\nuigea\njff\nbrenston\naymaras\nhariya\nqaisi\ncucuta\noedipe\nsurkis\ndmgt\nlacivita\nbeziers\nmeilyr\ndaylife\nattles\ngodmen\nunactivated\nreham\nmozah\njuliaca\nkelek\nnoordhoek\nscouser\nljuboja\nhrms\nclou\nfarsightedness\nchelonian\nplasterk\nschlangen\nmeiselas\ncafasso\nevenwood\nosterwalder\nveenker\nexaminership\nanesi\ntoguchi\nicid\nblairite\nexpn\nfarinas\nnanodiamonds\nroschdy\nescb\nrigotti\nshoplifted\nkatinas\nvitreoretinal\ncartonera\ndelocalised\npaining\nyct\nbirching\niparty\nshahrir\nurmanov\nfirwood\ncocorico\necms\nturracher\npojar\ngudu\ncaneira\nhardings\nunplugs\nbetemit\ndigressed\nmohaqiq\nkomondor\nwollenberg\nhiltons\nlafita\ncoltsfoot\nlecia\nleitzinger\ncayre\ncitibus\nndna\nnaouri\npeole\nbowlmor\nedde\nmodupe\nirresistable\nberin\nlimescale\nimediately\nwiggled\nbrandyn\nfarzan\ndolent\nscheving\nbraca\noverextension\nangelotti\nsephton\nzyazikov\nlangseth\ncarie\nscappaticci\nridgebacks\nbeled\nartemov\nscampia\ncheekpieces\nkeping\ngatepost\ncoverack\nmoneyless\nbramfield\nmassaponax\nchaiwat\nscrewups\ndyana\nbrooman\nlightheartedness\ngxg\nmelroy\ngeocache\nvaccuum\nfoxing\notologist\nmasalit\nopinel\nbranfoot\ndrakelow\nmingyur\ntennents\ncatcliffe\nbandaid\nbedpan\nuksf\nlisticles\nroseana\ntoolsets\nhoblit\nhomelife\niads\nntw\nbeumer\nghorbanifar\ncrapping\nendrick\nhirsig\nmineshafts\nmaktub\nantiangiogenic\nkaftans\nsulabh\nurumchi\ncopyboy\ndeogracias\nyesenia\nsxe\nanzures\nfrv\nelectrocautery\nhupehensis\nmiltary\nleappad\nalbor\nclypse\nremanufacture\nergogenic\ntoubkal\nundesireable\nseroconversion\nlochridge\nmukhortova\nviorst\nfouque\ngelis\nflander\nciner\npolga\nmorikami\nmayest\nmohibullah\nchapstick\nmonkish\nfanciest\nitogon\nscreenful\nkondracki\nbagillt\nlyor\ncrucorney\nintercoastal\ntelevicentro\npoitrenaud\nlynche\nadney\nberdos\nfroma\nqit\nundateable\naliveness\nfarecard\nphotocatalyst\nliason\nrayn\nrumohr\ndemotix\nbristolians\nharleys\nkalhu\nknifes\ntilders\nsermanni\nlaurant\nopenvg\nhileman\nodongo\nconstructiveness\njurich\nguincho\nmeerbeke\nlaibin\nkaena\nhendri\ncobbers\nmaese\nleftenant\nfargodome\nreappraisals\nfirooz\nbukha\nschroedter\noxenhope\nfausset\nwilkening\ncalarasi\nlefebure\nautomattic\nringz\ncheonggyecheon\ngeschonneck\ngutsche\nmigron\npollutions\nhdo\nquadir\nyinghua\nmahowald\nbajour\nberntson\nrtas\nabessole\nadex\nternhill\ntelesforo\ncrucifixus\nspittoons\nsahibs\namponsah\nfaucette\nnisp\ndunlea\nunsmiling\npostured\ndragic\nabergil\nbizarreness\nhauserman\nconection\nexperianced\nfurl\nliverman\neule\nayeni\nmumuni\ndervishi\nbavington\noladele\ncopulated\nlambir\nsindall\nruskington\nmccains\nparalelo\nmillisieverts\nsnowbowl\nlandells\ndawyck\nmutinying\nslic\nkromm\nbaroudi\ntaake\nsuryani\nbilsdale\nservals\nquartararo\nantirheumatic\nhoogerland\nmodiin\nbrog\naigas\noverdale\nabdelatif\njournalistically\ngaag\nbumbo\nfttc\nbustros\nheartwell\nraduyev\nedutopia\nprodigals\nvarinder\nexcerpting\ngabulov\ndobermans\nstussy\nzud\nlykov\npastafarianism\nlinssen\nlouro\nusasoc\nmotsoaledi\nbalis\ntzachi\nkostek\nlinfoot\nazmeh\nmonzo\nrössing\nepaulet\nsweetbreads\nabdennour\nnugatory\nschuback\ndubiousness\nhelghan\nblisland\nopenreach\nyakobson\nredbus\nalbarran\ncockrel\nxiaolian\nujiie\nfazi\nprachya\nhydroxyurea\nlaisterdyke\nemollients\nmelvern\nholditch\nseea\ntrimsaran\norietta\nuson\nbuchholtz\nengebretson\nwalleyes\nbryncoch\nlifecare\npersuadable\nminurcat\nathor\nmcquilken\nunregenerate\ngesticulating\nmaurilio\ngozlan\nhulland\nhasheesh\nrabinovitz\nbmmi\npawleys\nbegleiter\ntutak\norex\nhorschel\ntsukuda\naregawi\nmonges\nabrashi\nsopo\naxels\nkünast\nharrowgate\ndurdham\nblanker\npratfall\ncerato\nfults\nsherill\nstenotic\npriss\nspinball\nbremridge\nterenzio\nthamel\nangely\ntufty\nworkboat\njaccoud\nmarymont\nundernourishment\nbaishui\ncastorama\nnewid\ngrimmy\ntorrin\nraincloud\ncoalminers\ncarmelitas\nmancusi\nmooching\nslacktivism\natlason\nsijie\ngoytre\nsmdc\nantiracist\nheaston\ngeovany\nvidovic\nsfinx\nmesnier\nliliyana\nvelis\npullovers\nsoluk\ncji\nbenstead\ncomentary\nthamm\nfluegel\nagner\ndisburses\nncqa\nphotoreconnaissance\nsouthtrust\nbessels\ninsinger\nforecloses\nbanducci\nnaraghi\nlomography\nventes\nrareshare\nromagne\nhutzel\nvisioned\nenalapril\nschmeltzer\nsnitz\ndxs\nbearingpoint\nairmotive\noptra\naswini\ncristol\nacacus\nshanno\ntranquilize\nklyce\ntekere\nedified\npostell\nrasnick\nbucchino\ninstantaction\ncervids\nbsce\nbedhampton\nhorsehay\nrokos\ndegroff\nforkey\nbierk\nnuzzle\nelyakim\nrosarno\ncloy\nsygma\nkrautheim\nbotataung\nparkmore\nkaillie\nparatore\nsitwells\ntricomi\nfiredoglake\nsteier\nmenzi\nrusizi\nidentifed\nswerts\nepro\nconviasa\nhardheaded\nrearrest\nlaguiole\nboulerice\ncountires\nporthkerry\nmccririck\nschaunard\npetina\nmurrison\nquebeckers\nweild\nburde\nvictorianism\nncoic\nporchetta\nwowee\nlddc\nmozzi\nweeraratne\nzarrar\nspred\njayalath\ngenedlaethol\naltha\noverlayed\nunwraps\nmythopoetic\nwriggles\nneros\njackel\ncalheiros\nblackler\nbournes\nuzice\nprestonfield\ntouchpads\nturso\numberg\nrohwedder\nbouboulina\npepy\nmilnacipran\ngranot\nzoncolan\nbensham\ngondwe\nmagnetotactic\nbashore\njethou\nconspicuity\nfacr\nrosewarne\nplacerita\nelbrick\nehrr\nmachain\nidj\nnabataea\ndilweg\nschnauz\nmyuran\nmelberg\ncorncobs\nacera\nswid\nunbind\nmelkert\nmoodier\nkiryandongo\ncoverups\ngoodwell\nemporiums\nmelphalan\nmanitowish\nwame\ndellucci\nbaltimoreans\nmountebanks\nceredig\nhellesdon\nuntenured\nsimango\ncodirector\nshamdasani\nmarygrove\ndhaval\nsupression\nfinchale\nwindowpanes\ncsci\nmutki\nfitte\nmcpake\nbzh\nlecker\ndexedrine\ncivically\nmamanuca\nshishmaref\nmyheritage\npatricof\nmickleton\nmarianos\ngizi\nsuperflat\ngeffrye\nrubislaw\ntrenary\nteeman\narakcheyev\npapava\nbeanbags\nwordly\nasot\nglore\njayati\nstae\ntogiola\nnongoma\ngovs\naroung\ndmas\nvolodia\nmusclebound\nyens\nsunley\nhyeres\nkluczynski\nsury\npakong\ncanottieri\negb\nessiet\ncrapload\nimperiously\ntrouten\naytes\nswint\narbaces\nkarva\nfontainhas\narrigorriaga\nhydrokinetic\nzalla\nhandpainted\nnccl\ntennakoon\ntankerton\nbouchaud\nuralkali\nzuccarini\nvcsels\nrawski\npresidental\nwoolos\nkeratoses\nlaymans\nrydon\nsneetches\nespalier\nslipcover\nflather\nbechtler\nmuddiman\nbuynaksk\nmarcoses\nsnowbelt\nhanadi\nwinterslow\nlliw\nbockel\nchollas\nreile\nbmibaby\nnodak\nzingales\ncabdriver\ngenter\nblacha\nunpiloted\nkaralis\nelmiger\nmcmuffin\nparbo\ndealth\nstandardbreds\ncasaus\nschneiderhan\nmanriquez\ndresel\nmaxamed\nucil\nhemorrhoidal\nnubira\nbraatz\ncoercively\nstortorget\nstirrat\nbewitchment\ndebauch\nmuliro\nnijjar\nkrautheimer\nbibhu\notgonbayar\nmukkamala\noghi\ngrindleton\nipana\nperzel\ntidily\ntohti\ndooh\nboroumand\nmodernizes\nwangchen\ncrathie\npeginterferon\nquanell\nblinkx\ngoulston\nsê\nrefashion\nefthimios\nlinell\nsillinger\npolitian\nbroughshane\npeffermill\nmerendino\ntisei\nskyguide\nanfernee\ndirr\npandorum\nbarraged\nactiveness\noilwell\nhonkey\nstenehjem\nguarnere\njaponisme\nrexel\ntauss\nreetz\nonesta\namendable\nhumiliatingly\nverel\ntassy\nbaathists\nyepez\nhieber\nlaytonsville\nidealizes\nbilsborrow\nearthwave\nkummersdorf\nmanríquez\nrussek\nswinderby\nzieff\nhalwill\nuuv\nknaup\nfazzi\nwmsl\nempathizes\ncopon\n\nermete\nhayfields\ntwizzle\ndominey\ncaveney\nshobdon\nbisaria\ndunlough\ntheologists\ngiudicelli\nhollyhocks\ndeuteronilus\nmauney\nwholely\nrishel\nloade\nbivalirudin\nsteindorff\nsalati\nkorengal\ngrazian\nhumectant\nzoysia\nlamarckii\narchila\nanusara\nndd\ncopen\nbolad\ndorthy\njacono\ntapulous\nlimpley\nmuranaka\natomised\ndkar\nsmoulder\ndoorposts\ndaveed\npaperworkers\nkadivar\nqosi\nvactor\nwolfington\ngreyness\noduya\namag\ntabarak\ngreenhorns\npolynice\nnatalegawa\npocos\nvoluntarios\nacci\nrescored\npastorek\nxtube\nweligton\ngloeckner\nerrazuriz\nracan\nhomesh\nmarichalar\ndisbrow\nrelined\nabqaiq\ncontibute\ngamerscore\nantithetic\nscartho\ndevolder\npseudonarcissus\npezzini\nzuidema\ndcmg\ncadaverine\nmulliqi\nfurusawa\ncolyn\nundoable\nshapingba\nterrett\nchlorotica\ncreţu\nbacai\nncci\nfeba\ndyanne\nelbogen\nfeock\ngolon\nrobosapien\nvarndell\nartley\nsketchily\nguilfest\ncontex\nelectrowetting\nmazzotti\nwliw\nkarantina\nchannahon\nmelnychenko\nunretired\ncanonbie\ndornsife\navrum\nbarathea\nvillans\nsyjuco\nwatchwords\nshenwari\nmatea\nkamdesh\ntannishtha\nlinaker\nlimoncello\neephus\nmanifestoes\nraboteau\nunaesthetic\ntrouver\npertec\npostillion\nsimler\nchatr\nshimmery\npcps\ngenex\ngovts\nrummels\ngheit\nkinnon\nelementaries\nreengineered\npadborg\ntanishq\nfattahi\nquimica\ntiryaki\narjay\nretch\nanquetin\nserama\nrudine\nnasolabial\ngraduands\npolymathic\nkadugli\ndks\nwesternize\nhickton\ngilwern\ncortaderia\nentombing\naptness\ncommerciality\nshowest\nbolkvadze\nsbdc\nwyszynski\nlongi\npolariser\ncouzins\nartisanship\njsow\nexar\nhtein\nsumud\nnonprescription\nomgpop\nwindo\nbindman\ngrossology\nexhalations\nkaronen\npotamos\npudney\npleating\nmischer\ntrevitt\nimpenitent\nbeausire\nbaltan\nnfx\nmastoiditis\ncalitri\nblogtv\naloisius\nentine\ntuinei\ntahiliani\nrakija\npilarczyk\nhemagglutination\nperreira\ncalcars\nwhing\nwfn\ncorralling\nwoollcombe\ninjuns\nvarenicline\nbotc\nhoseason\ncernak\nroohi\nzagged\ncairnie\ndayanara\nskirbeck\nhourn\nsurace\nstudebakers\nidiosyncracies\nmallas\nmarrows\ncogenhoe\nmezcla\nmahmoody\nriddings\nshamsudin\nguangchang\nzyb\nalgus\nhipodromo\nticciati\nworkrate\nworthiest\nhhe\nsaunt\nkctu\nvatersay\ngooglebomb\nhorridly\ndarse\nhemianopia\nnonstarter\nguruprasad\nspiva\ngamber\nlacemaker\nresit\nphrai\nunificationists\nreigh\nleasowes\ninnercity\nvillagomez\ndevoran\naboutaleb\noutqualified\nsermonizing\naddded\nraileurope\nelkes\nlovaas\noctopodes\nmunkeby\nfilmfour\nstolichnaya\nkohm\nschrafft\nalwen\nunhealthily\nlamelas\nmcelhaney\nkayalar\nscheible\ngiardi\nmikulas\nfundoplication\nhannelius\ncheesesteaks\nbuchannan\nessaying\nkiddington\ntherof\nfawzy\ntwx\nbujalski\narginase\nbalslev\nkorbi\nmammadyarov\nwhittam\nultrabattery\nramrao\nscootering\nlefton\ndrizzling\nascó\npbrs\nmehanna\nhepher\nkondh\nimpliment\ndiffey\nsunburns\ncelliers\npauletta\nslynn\nruffer\nreacquisition\nhomebuyer\nstarbeck\ncolajanni\nmedalling\norten\nlongboarding\nbergisel\nvoca\nseptuplets\nassignations\ntshirt\nmeisl\nmopes\ndijak\nhwl\nkosove\nludbrook\nunderclothes\ntrakker\nrebney\nteisseire\nbaxt\nklor\nthegame\nbepi\npsnc\npremonitory\nfotakis\nlimewood\newww\nrevillon\nrecondition\nsynar\ntimika\nbecoz\nalkis\nunderbid\nyongbo\nkiddin\noverzealously\nsubstanial\nmohney\nkakade\nmaafa\nmaxxpro\npascoag\nassasinated\nbiologos\nbodrogi\nshigefumi\nwasta\ndorinel\npenalosa\nbetide\ngupton\nxpcc\ncukurova\nrickaby\nluiten\nsoperton\nradiowaves\nfreij\nkilic\ndhruba\nthomsonfly\nvauvert\nhouseware\nmvrdv\nwalem\nmotioning\nflavourful\nalferd\ncalangute\nwhitehills\njovians\nplacemaking\nshrawan\nsaxer\nmckinnis\nabderrazak\nmildy\nmaintaing\nbickington\npolzeath\nbroggi\nchaddesley\nieps\nbrilliante\nyantis\nhitner\nmarong\ngandhis\nparities\nshawanda\ndurando\ncockers\nelbers\ncockman\nstaiano\ndamrau\nsteinbacher\nwibble\ncontractionary\naraldo\npelargoniums\nfengjie\nnonino\nconks\nashjian\nhoneycombe\nmandolino\nshukrijumah\nhamburglar\nkreischberg\nmoominvalley\nnowack\ndisseminator\nilluminative\nhafei\ngwion\ntherma\nnanoscopic\nroughhouse\nembratel\nbuechel\nattukal\ndelousing\ncorteza\nschoenmakers\nrohrich\nkorshunova\nbernadino\nmazzolini\nturcat\ntimothee\nannotates\npitcock\npoyraz\njakupovic\nmartagon\nmagsafe\nfreimuth\nintraspecies\nlemasters\nqmu\nhawgood\nkrx\nchampoux\nswartout\nmaconchy\nhsct\nramson\nxiali\ndmfs\nsnocap\njersild\ngenerativity\nmanyata\nstunk\nreoffend\nbernado\npennel\nresurfacer\nrothaus\nbeckstead\ncrye\nhryb\ntkeshelashvili\nnenita\nnetbase\nrokin\nncircle\npakora\nbooij\ntimimi\ngendall\nakognon\npotsch\npontcanna\nkadota\nraib\nbosisio\ncoldblooded\ncollimators\ngrelle\nrabigh\nmoisey\nmidgham\nkatsiaryna\nmeole\nglotzer\nledermann\nthronging\ndadis\nsmallcaps\ngalinski\npatakis\ndith\nouakam\njeanetta\ngrygera\nastroland\nkristeen\ngratae\nmouthy\nbreitbard\ntradeshows\nshinkin\nshahrvand\nraikkonen\ndistractingly\nqueeny\nlavere\nactivehybrid\nhde\ndeonarine\nbodach\nwesa\nzwane\nyingjie\nbenzing\ndolbear\nubad\nkurr\ntprf\nrelatedto\nmoggy\nagrestic\nnaofumi\nbuseck\ngaslights\ndaters\nfirstbus\nkeavy\nasotasi\nexotiques\nmirwaiz\nvivie\nwnuv\nkranjc\nimprovs\nomnifone\nabatements\nminumum\nhich\nmehretu\nshuangqiao\nbusinessobjects\ntods\nkrehl\ncarim\nwaber\nfieldorf\nlcmc\nkamanga\nparad\ndvalishvili\ngritten\npavi\nlanguge\ncreveling\nringwall\nkunak\ndmsa\nspys\nmoiseevich\npentney\nojt\noutrightly\ngreeba\ngeocachers\nkeqin\nalterable\nreactively\nweissach\nmlpa\ncorbacho\nachievment\nnimroz\npolruan\ncivilise\nazhdarchid\nmalissa\nseducers\nqueerty\ncottaging\ntechwin\nfabens\nkrugerrand\nkuragin\nsnam\noutrageousness\nlingor\npharmaceutically\ninquisitr\nuspap\nzolo\nabominably\nlunik\nishiba\nmerad\nfrançafrique\nainun\nraisi\ndivemaster\nrockhard\nheartthrobs\nbeidler\nneylan\nchalonnaise\nwideload\navini\nolanrewaju\ndexters\neckles\nbornholmer\ncamelias\nmcgeever\nritchies\nmahmudul\ntrustmark\ndensus\nbrunsdon\nambrotype\nmutchler\ntepees\nchickies\nndabaningi\nbayoneting\npayables\ncondemnable\ngraffeo\npolaski\nvhb\nroketsan\nwhoomp\nammara\nquinter\nartemi\nimz\ngreenboro\nloyalhanna\npopalzai\nmcgarrell\nhandey\ncxo\ngaldakao\nlightcap\nklatch\nekimov\nebullience\neurodollar\nditchingham\ndumes\ndimap\nstear\nschnucks\nrefreezing\npluss\nbawku\ncookey\ncontries\nkuumba\nmasimov\nschiattarella\nquiting\nyugoslavians\nremotus\nhenschke\njokin\nsoyza\niraqiya\nsartin\nsaharon\ngutbucket\nkamrul\ngrabham\nunlocated\nlaggy\noverpricing\nruakaka\nceske\ndysmorphia\nhoba\nearlimart\nbiljon\nreinwald\nratm\nlingafelter\nbeverlee\nnaqqash\ntdap\nwolkoff\nradulovich\nmuzzleloaders\nbame\niame\njafargholi\nboisture\nchegutu\neasterlin\nejike\nnissans\nseitaridis\nsecoya\nihd\nlauran\ngallman\nsestito\nlakhnavi\nhumphris\ncelotta\npittinger\nmudavadi\nkirven\nbriegleb\nchinwe\nhrpp\nklapwijk\nthrowley\nsoumillon\nfuturegen\nselecciones\nbastani\nkarsts\nmanoeuvered\npalinuro\nirms\nusherwood\nsouthcoast\nneukom\nlouca\nhawx\nmalmborg\nvaldobbiadene\nbowey\nfootstone\ncavanah\nkirchick\nniedringhaus\nsarcosuchus\nexcedrin\nsabaudia\ndenbies\nantiretrovirals\narkansan\nscummy\nrubinsky\nkaylin\nsoonish\ndenicola\nlegrande\nscapino\nxom\ndhlakama\nprolixity\nfaughart\nzorana\ntodaiji\nperezhilton\npipilotti\ncolombine\nconfreres\nczuma\nsaquarema\ndtz\nseethes\nroflmao\nunbiblical\nmontres\ngalbanum\noscommerce\nbiltine\nsicco\nhamisi\neulogising\navocational\nxinfu\nworby\nflyger\ndiki\nhagadone\nwitteman\nsaac\ngianbattista\ntsiskaridze\nbokel\nkishorn\nenvying\nknoebel\nlustleigh\npartier\ninvigilator\nmarata\nundershirts\nbocco\nlebowa\nlotfollah\nexhaustible\nabaunza\novergrow\nzentz\nhoeber\nmarvyn\nhedican\nmccleskey\npuelles\nkuehnle\nwaing\nmcgees\neyden\ngogia\nhassabis\nscythed\nchones\nhaloti\nteleconferences\nabag\ntraymore\ntongchuan\nbeignets\nmezuzot\nninkasi\nsidoti\nbackhands\nensalada\nybas\npetrache\npontsticill\nestuardo\npoliticising\nsaltis\nlavallette\nctas\nimpenetrability\nkaripidis\nlanken\ncaerhays\nwyth\nhocutt\nborell\nmonthes\nrogé\ncinestar\ntrupti\nrevenges\nhenahan\npfanner\npreise\nrends\nmacheteros\nleshoure\nmulticulti\nkyungnam\nchillenden\nbubalo\njambor\npursley\ntmetuchl\nconglomerations\nmckayle\nstrous\nvasks\npettifogging\npuranik\nuset\nanso\nbainum\nqna\nuncat\nwhinlatter\nbittar\ntarutao\nchinu\npedregon\nncfl\nmacronutrient\nnfkb\nthorncombe\nwcas\nawak\nhasin\npaams\nbozz\nsuperinfection\nmokes\nsuicidally\nlininger\nmontsant\nlifeboatmen\ntelepathology\nconvos\nmeagen\naers\nadition\nbleadon\nbiancheri\nbiotherapeutics\nhasheem\nkosmotras\nbadwi\ntwats\nkeratsini\nmusts\ncutepdf\nspicewood\nilsinho\nwefald\nbfca\naltemus\npensione\nyeter\nvillalona\necologie\ndearnley\nbarbwire\nusaca\nfanless\nlituania\nditf\nfibbers\nhartstein\nmozhaisk\nwebtop\nfenyvesi\nglobalflyer\nbelived\nseverability\nkyobo\nbadh\ncissi\nandritsaina\ninvista\nsonographers\ngwy\ncernik\nhardwire\npurveyed\nfovant\nmonsal\nfishamble\ncinefamily\nrubano\nartax\nsiegle\ngtaa\ndunecht\nfrommel\nrpu\nunquantified\nstaincliffe\nloakes\nkoprivica\nveech\nhipness\nbillyboy\nsdps\nruhrgas\ndrafi\nmahamud\nelides\norangewood\nmoominland\nruha\nrozell\ngoatwhore\ntvpa\nhunchbacks\ncoronated\nshillingstone\nshortell\nkhalifas\nwordage\nzanganeh\neiden\nsharrett\nwaab\nvesko\nicbl\nchasses\nnachtwey\nwatnall\nheitner\nthurmann\narbitrates\nknödel\nkohnen\ntelecomms\nclannish\nrusnok\nbelterra\nshovelled\nmputu\nokell\njiyan\nnotarization\nmiddletons\nzdx\nbaches\nebbin\nzubairi\nmianzhu\ndickran\ntsec\ncitrin\nacroteria\nkellyn\ndissatisfying\nlittl\ngrethel\nbloats\nemelin\nzacharo\nextrem\nselph\niabc\ncreamware\nwusb\nnpca\nreoccupying\nsomontano\nkremmling\nmalara\nadeem\nmaarit\nnannetta\nguarin\ncouette\ncheik\npanitz\ncontinentale\nhaemodialysis\ncatarino\nnegs\nsennybridge\nzamka\nvishay\nswanner\nstrichen\nphotopolymer\njeramy\ngaft\npolyneices\ngazey\nspearpoint\ndualled\ngundelach\nrilwan\nzaloom\ntriax\ncreaks\ndasarathi\ntasini\nkortney\nseeduwa\nnnk\npineridge\nbederson\ntsotsobe\nlimpar\ncaldwells\nlomon\nsweetney\nbraungart\nrustled\neverbright\nsomehting\nindustrialise\ncroyde\ncroda\nsodales\naugenstein\njamain\nledsham\nrodhe\nleang\nbrockhill\nfacepaint\ngdrs\nironist\npaskov\nbepicolombo\nwhiney\nsignifiant\nloggie\nmaing\nelectrotechnology\nmcds\nsdsl\nbelcoo\nprokopis\nmaffs\nmcgeown\nandiamo\njalees\nharpaz\nnorthrend\nharos\nplatysma\ndavilla\nschnapper\ngeyserville\nichinohe\nchannelview\nlambdas\nevis\ndoocey\nmaryinsky\nhumbie\npertis\nzikri\nmwnt\nilych\nomnicity\ncentocor\nstaka\npulverizer\nderyl\ncabc\nsterlings\nsidd\negide\nmorston\ngasifiers\ncandidats\nsetsuo\nballboy\nvisan\nskrepenak\nbrajesh\nvork\nbuntrock\nthimmaiah\nmccumbee\ntchs\nsleddale\nribbleton\nbolkan\naufderheide\nreifert\nlavagirl\nfuchsias\noverexertion\nkalana\nyudu\ncitings\nweidenbaum\nspanx\nredel\nlagardere\ntorkelson\nhunston\nparentless\ntantan\nbeelzebubs\ndokumenta\nlegian\nrascally\ndarusman\nduiven\nkobina\nhuizar\nbladenboro\nmonasterios\nbenik\nmamic\nexposito\nhigareda\nratiocination\ntufnel\nkamtapur\nepsiode\nroseraie\nkanowna\nsullins\nthrashy\nbombmaker\nzehner\npeir\nwhitfeld\nlby\nlawing\nnympho\nunburden\nqfc\nsylvinho\nsengi\ngolston\nsiptu\nkingscliff\nnaaqs\nviscious\nhamedani\nsamsom\nhacktivists\nselvy\ngenstar\nvosne\nalagia\nshweli\ndorando\nflygare\nubukata\ntamiris\npresentiment\ntike\namicizia\ngainline\nkorvette\ntecpan\ninformationally\ndiski\nrollouts\nhepu\nimprotant\ncienegas\nbassong\nblaubach\nshenay\nholender\nmisalignments\nhybels\nqueudrue\ncrossgar\nbrondello\nimbaba\nholomisa\nallio\ndalwhinnie\nchatree\nstahelski\nfuksas\nolowu\noutgaining\ntubingen\nheadlee\ngeiko\nmultiunit\njenniskens\nshuan\nfangirls\njalila\nramsays\nnooo\ncognisant\nlindinger\neveything\nangeleri\nwindier\ngelston\npudor\nkaragöl\nherfurth\ntrepanier\nribena\nbufford\nenoteca\ndeininger\nboxhead\nregretable\neroticized\nmalyon\nsandeno\nboggis\nmassett\nnagarhole\nmorfydd\ngreenmarket\nworkbenches\nbarchi\npoletown\ndahalo\nelvi\nscottishpower\nbluehost\nmogilevich\ngarcelle\nmckenley\nongpin\nchudzinski\nguilloux\ngvg\nvart\neagling\ndiah\nlivarot\ngrapnel\nchingis\nsepaktakraw\ntranquilli\nilaga\nxingfu\nbottomly\nsehnaoui\nonevoice\nkenzaburo\nmasey\niltalehti\noptionality\nresettlements\nironik\nragghianti\nsinohydro\nsteinhatchee\nfunks\ndrumnadrochit\nhambright\nporfiry\nfrik\nhalswelle\nskane\nsandefur\nmanc\nsnigger\nchannu\nrabone\nroriz\netj\neyedrops\nmthfr\nmbarga\nhuatabampo\nrufforth\nphenylethylamine\neventualy\ninterflora\nserrao\nbeaverkill\nprivee\ngscc\nnotkin\nmudhole\nasgaard\nfarmingville\nconserv\nbido\nmomenti\ncager\nfladmark\nlukasik\nhannawald\nmarlar\nlroc\nschiefer\nvulcania\njmk\nciso\nterzieff\nwibe\ncucuy\nmoussi\nanalogic\nnrha\nhalterman\nnodari\nboetie\ntoaff\npinboard\ndgx\nndri\npicanto\nissie\ntand\nvejer\nchangeovers\nmaddrell\nhuallanca\nwadhawan\nenyeama\nweech\nboisse\nratnasiri\nbrickbats\nmovables\ncontadina\nstarlab\ncreatura\nbrzezinka\nzitron\nrangely\nmcconathy\nvalpak\nkisielice\npachyderms\ngrousbeck\nantonveneta\nporticoed\nmaiani\nbuspar\ncritcism\nsimrall\nromary\nresendiz\nhouris\nviiv\njunxia\nvalrico\nmerche\ncowger\nmohnhaupt\nxiuzhen\nwrithlington\nsevoflurane\nvalhall\ncraigmyle\nsontheimer\namama\ngiertz\naleshire\nunconfident\ntestbench\ndilon\nverta\nshailer\nlongson\njacobowitz\nabrogates\nstraightline\nfusina\nreclad\nkemperman\nellida\nsunlike\nceleriac\nukelele\nbrueggergosman\ntiernach\nporuri\nsmoketown\njoselyn\ndanieley\npaperclips\nharmoniums\nvilayanur\nsoufan\nnyishi\nkhunying\nbreathalyser\nliko\nbirdoswald\nantia\ndarkrooms\npullens\nhorkesley\nborrani\nfezzes\nrotunno\noundjian\nmcroy\naquafina\ntwitters\nlavagnino\nhoffmans\nfanhouse\nyebda\noutclass\norbitting\ntaotao\nhanifah\ngwahardd\nglacé\npitmedden\nallisons\nsobolov\nevercore\nicna\naasb\nskegby\npytka\naccelerants\nmcclard\ntabbouleh\ngarnethill\nlamborghinis\ntaiichi\nmallie\nnads\nglasheen\nzonia\nsofinnova\nanorgasmia\naycox\nfarizal\naast\nzenoni\nsimiane\njewelweed\nistm\narifjan\nmatteoli\nyacoubi\naabc\npaiz\nheter\nniaaa\nacció\narzoumanian\nayyoub\nnokta\nincantatory\nharrellson\nchiran\nmukherjea\ndeadliness\neuphues\nbutyrolactone\nminney\nbeauvale\ninfuriatingly\nveruschka\nxiaoqi\nrailey\nhuxham\nburtka\nhellens\ndelacruz\nunderwoods\nkeiter\nzeewolde\ndicha\nmuddler\nsuccentor\nbabaoshan\nloopnet\nisabeli\nsugested\nunitard\nvailima\nabony\npataskala\nresco\namericablog\nperkovic\nbindoon\nsennelager\nasbpe\nnedelin\nthta\nmji\nunaccomplished\nsimilair\nduss\nbuchannon\naustrey\nbottaro\nepalle\nunclouded\nstarwave\ncaporaso\njazzin\nfnx\nalire\nnhis\nwheezes\nlabarca\nbarmaki\nabdulrahim\nkreindler\ndigibox\nwjtv\nwrithes\nyazdan\nunharvested\nalipio\ngayler\naeschlimann\nangelical\nzevin\nglenallen\nbuddo\nrutsen\nlarrimore\nkorsgaard\nzialcita\nrbh\nyongan\nterroirs\nzilberstein\nfrand\nshallowing\nobertan\nelmohamady\ncagny\negnew\nsvcs\nteeb\nweas\nantidisestablishmentarianism\ndeliverers\niicd\ngeesh\nbostich\nbonifas\ndyomin\nmaldef\nyre\ntheu\npsittacosis\napopo\nservillo\nverint\nnofal\nlytes\ngongyi\nbloodsports\nafrik\nshoushan\nmancunians\ndimness\ncassoulet\nrebuilder\nreinharz\nskyhigh\nreira\ngroundstaff\nentraps\nbebout\nexorcises\nclayburn\nrichar\nrabei\nlucked\npinkies\ntransloading\ngroomsman\nantan\nmirbat\ngrannan\nogiek\ntidies\nessayistic\ncomputerize\nlengthily\nezeli\ndochart\nfreymann\nsuer\nlittlebrook\nsattelite\nwaramaug\nlubo\nsafesearch\nescandon\nnuckols\nspiriting\nistithmar\ncabriole\ntraprain\nfreindlich\nsogou\nchampy\njary\ntecktonik\nlligat\nsportwagon\nblw\nhamutenya\nrowsthorn\naschwin\nredstarts\nvaccarini\ngrenadians\nmgrs\nchiya\noffcuts\nholekamp\naasan\njinga\nlaconically\nunintelligibility\neccb\nlychees\nshatford\npaddleboat\nbelkis\ndoddle\nkinmonth\npilkhana\nburtenshaw\nmountainville\nnowick\nkoellner\nbjoern\nbathrobes\ncodicote\nsehome\nugland\nkesse\ncomparitive\ntanwir\nbarette\ncontroll\nradiowave\ngrens\nmarjanovic\nandrean\ntavris\nbatek\nmesospheric\nwitha\nrigsbee\nnasaw\nbreathability\nkenniff\nweijun\nnycomed\ndörfler\nstatelet\nbraila\nmussen\ndepite\nemplacing\nrosica\nmindfully\niknow\ncanion\nogonyok\nabshir\nambivalently\nnethy\nchequerboard\ncraftiness\nbawana\nsnailbeach\nmokae\nhillend\nzhongliang\nrasiak\ncprm\nscampering\nsoud\nblazar\nsonenberg\ninamoto\nducruet\nnivkhs\nlillehei\nrangiri\nsequenom\nlorwin\ngithongo\nsumthing\nparred\nmsec\ngooley\nsedlescombe\ncochinos\ngotshal\nkatiba\nfakoly\nronette\nwanni\ngazania\nripstop\nminibar\ncommmons\nweissenbach\nreassume\njaras\npenwell\ngeia\nbput\ncedis\nccea\nfranze\nfollie\nbilges\nglor\nmunto\nlidy\nkoten\novercharges\nsfmta\nmwg\nnhw\nogunleye\nbekri\njerramy\ndokan\nrunte\npredannack\ngostick\nharbisson\nyuskavage\nhavi\nshrubb\nmegabucks\nwestell\nassp\ndelectation\nabrines\ninexcusably\nyoran\netelka\nneeb\nunhitched\nfictionalizing\ntsampa\nfryzel\nacquirers\naslockton\ndrawcard\nloftiness\nbogeymen\nllagas\nsteyl\ntippers\nwevill\nsweetbread\nkoelewijn\nnanan\nbismol\nnicad\njapes\nekp\nkomla\nparve\nzernov\ndmdd\nfalconetti\ntransgenesis\noverstocked\nqap\nraciti\neezs\ncotana\nvorhees\ntamaya\nscudetti\nbolano\nmulamba\nmetzelder\nküpper\nboasberg\narchstone\ngowy\neurocar\nexsists\nkatchen\nwijekoon\nxiaoyang\nfayemi\ncooperrider\nterrordome\nmosset\nbryghus\nmatsuhisa\nlecanto\nspeedos\nabdominals\npandove\ngratuit\nmoviegoing\nprevalences\nhenjak\nawes\nlogjams\netak\ncyrilic\nvolpini\nhedonists\nlondyn\ncodger\nallbutt\nhartcliffe\nsundram\nsmoko\nstucki\ninsurrectional\nkhandker\ndenenberg\ntexmelucan\nhooved\nmovilla\nevoy\nkalantari\nhiccuping\nkhristenko\nveverka\nrecommences\npentaerythritol\nwirtschaftswoche\nfuntua\ntokoyama\ncolugo\nleatherheads\ngarscadden\ntorneos\nsivanesan\nleilah\npinton\nafagh\nvikar\nguanling\ndoublemint\nforebodings\nnancys\nklich\ngazz\namyx\noefelein\nquerejeta\ndeaux\ndetouring\ntotto\nwebcasters\ntalland\nwestbahnhof\nwaith\ncvrd\ntownhome\nsweepings\ncookley\ntimberman\nhalmosi\nbojo\nokemo\nnerem\nbelorussians\ndequenne\nsougou\nmursal\ningrow\nhalver\nbacksliders\nunpremeditated\nlainez\ndefensibility\nbanghart\nsabayon\nsamho\npremaratne\nbadgett\nharvel\ndarfuri\nchlamydial\nopuwo\nkarlmark\ngoldhammer\nrelleno\ngilgo\nkeeran\ninkwells\nrefoulement\nkohe\ntretchikoff\nrusafa\nfinanceira\nschapper\ndownbeats\nquansah\naerobus\nsgorr\nthanvi\nkurzawa\nafeaki\ntollman\nvoluntown\nvarenyky\nmadrassahs\nlafca\nbringuier\nqadisha\ndrub\noposite\nhussian\nnurney\nemtala\nnyatanga\nanyhting\nbaffour\ncollectivised\ncoert\nangioma\nstrunsky\nmislabel\ncopado\notniel\ncruiserweights\neith\nrandomisation\nehen\ngreffier\nfflur\njuckes\npremack\nreprove\najt\nneep\nembeth\nconcentra\nsowards\nanshen\nhipkiss\naustan\nreadymoney\ncaitríona\nkeshar\nkeleher\nbiosimilars\nrevalidated\nhagelstein\nrusling\nwrangled\nbchr\nconcow\nsweetwaters\nmpds\nkathiresan\nshumard\nleslee\nkontroll\nlucita\nancestory\nschoeneck\nlasvegas\npoquelin\ntyronn\ncaires\nfeca\njobsworth\nlimetree\nmachrie\nnatera\nmesick\nwhelpdale\nflamberg\naloneness\nminnear\nashenafi\nhonchos\nfitschen\nzubrus\nkamus\nthecityuk\nbaldemar\nexia\nhammed\nkeram\nenak\nlxb\nbocht\ngelineau\nwineville\nsuhaila\ngarretts\nmanfreda\nstairlifts\nhenault\nsuttee\nsinowatz\nhuether\nkarnei\npominville\ninextinguishable\nbelue\nfrisked\nloppi\nmazher\nterrazza\nfritas\noystering\nnuca\ndélices\ncmmb\nmillefiori\nbalking\ncreaney\nximending\ncpoe\nbajuk\nelblag\nscheetz\nregurgitations\nhubbins\nkanwa\nkimutai\ncompunctions\njimm\ncancale\nmonans\nphumzile\nboet\npigram\nopion\nkhodr\nkoumba\narbitrageur\npheromonal\nshairp\nmahbubur\nsnan\nhannig\nmcconnells\nkehilat\nfoxhunting\nmakosi\nhumanizes\navst\ntaxies\nrebuy\ndemeco\nlougee\nnsidc\nstrivers\negusi\nacab\nyurimaguas\nhamiguitan\nhatchings\nselectee\ngwynfryn\nsouri\nhydras\nnogle\nalfon\nsorpe\npuddington\nkasavubu\nstumpers\nfrpi\nfantabulous\ndrennon\nsisterhoods\nalwani\nplumped\nsweetlips\nyerbury\ntempier\ngrandinetti\ntostada\nparatene\ntitman\ntransmutes\nkahlua\nseymours\ntotok\nboyda\nlinby\namital\nlevieva\ngradney\nhurka\nsulkhan\nconero\nkropf\nboasso\nteleflex\ndcri\ntorpedos\nchristofle\npiaui\nblackgang\nmeningococcus\nbunget\nfiserv\ngwrych\ninfratil\ndenuclearization\navedisian\nyumen\ntelogen\nmirchandani\nkringen\ndeshay\ncolesberry\nfremontodendron\nignac\nfathallah\nshelekhov\nalemtuzumab\nglatiramer\nblastocysts\nboguski\neurocorps\nhajari\nxnview\nsafaryan\nballwin\nhiranya\ngiubba\nmihoko\nrayl\nloreburn\nbasix\nsickos\nyuhanna\nfarmyards\nbrittani\naverroës\nwasserstrom\ngrafenberg\nadala\nfasal\nkontras\nsrivatsa\nyanbo\nstabbers\nduhe\ngysgt\nbarrells\nsneads\nquieten\nlccc\nlupane\nhinga\nchalfie\nhellos\nscoby\nriduan\nnajeh\neggborough\nteratogenicity\ngalten\njarvi\ngurey\npkv\nhighbaugh\nbrauw\nrsamd\nwillowy\njinmen\ndesanti\ntriay\ndecison\nopiyo\nmeridan\nnostrums\nvaleen\nfrankfurters\nnautiluses\nstotler\nghoneim\nbrecqhou\ndelestre\ngreeson\nsliwinski\nartwalk\nloku\nschnader\njaliens\ntaracena\nfennoy\ntitbits\nczajka\nbatkovic\nkulfi\nrotherwas\nelegible\ngudmundsen\nmontgenèvre\nbenakis\nclaming\nspeedcubing\nwittekind\nsurayev\nmalarky\nbirdbrain\nchalhoub\nhorridge\ntodorovic\nfactset\nnke\nnewsfield\nddraig\nshepis\nmasaga\nditu\ngorostiaga\nsubbaraman\nalaia\njukkasjärvi\npanus\notiose\nhemophiliacs\ndrezner\nparticleboard\nhavat\nminsker\nwhitetails\ncircumnavigator\nloverde\nwanjiku\nyinon\nmavica\nozdemir\nlamphey\nribhu\narrowood\ngourmands\nginori\naiwf\nentwhistle\nantil\nlombaerts\nstiffed\nalberga\nberowne\nheek\njumana\nmannofield\nconverage\nnotifed\nglodok\ntharon\nkamstra\nmujra\nneede\nmònica\nurasenke\ncapuçon\nminoff\nruminates\nphilmore\nnienke\ntalamo\nbarquín\nnunnington\nalledgedly\nxol\nnortriptyline\nradlinski\nkoskoff\nkalley\nndas\ngofton\ndookeran\nderéon\ndawne\ncambe\nbrayman\ntyrannous\nalekna\niberico\nvacco\nnavratra\nthila\ntairi\nbagpuss\nwalgrave\nghanian\nbovin\nmercurey\nchinde\ntrenwith\nsollberger\nstreetlamps\ngreutert\npączki\nnobodys\nnorweb\nchartock\nnutrasweet\nsureshot\nthuoc\nsellman\ninbounded\nflatlanders\nvaporising\nisella\nrizvan\nbungler\nhukawng\ncalsci\nsuryan\nstreck\nryue\ndunitz\nlongstanton\nmodhera\nkimbe\nedivaldo\novertakers\nlowa\nshirah\ngnlf\ninterdictions\nbharananganam\nvandenberghe\nbhana\nquattrone\nradaronline\nhelpmates\nserralles\nunibrow\nmaryla\narmanti\nsasportas\ntangela\nlenôtre\nvacillate\namericanizing\ngettman\nsurveilling\nmidmar\ntheway\nhiel\nanthopoulos\netok\nsfra\nrssi\nkunes\narndell\nrangelov\ncountermanding\nkorr\narshin\nborisovka\nsuperbug\ngobstopper\nbuncha\ninterrogatory\nbelta\nliberhan\nivlev\ncatarrhal\nleftback\nbilinski\njediism\nsamel\nmehman\ncontraversy\ntofo\nmtvs\nartex\nkvalheim\npanauti\nhoodless\nmicheel\nhairstylists\nciabatta\nvht\nrichville\nradiomen\njoyes\novalbumin\nsqc\nminnett\nunnervingly\nauxier\nalavesa\ndingmans\ncourbis\nnimród\nroban\nmyslef\ngreengage\nbegon\nmatala\nitsa\nforthe\nunitt\napti\nlionhearted\nlawbreaking\nboeta\nhempsted\ndaimiel\nefilm\nmovsesian\naleknagik\nhonourees\nnamkung\nkirsteen\nlhh\ngeoreferencing\nfunnest\noxxo\ngrossness\nmacarthurs\nstrich\nnikes\nshivji\naldbury\nsimister\nanhe\nidzik\nniurka\nmacena\nifakara\nkeala\nsummonsed\nelectrocardiograph\nphalluses\ntibooburra\nhesley\noligopolies\nirréversible\nclarridge\nshoora\nbrightmoor\nfergerson\nunal\ndommett\nmortifying\ngerwin\njovenes\nsaracho\nvagenas\nsuseo\nmarathis\nhesme\nkempa\nostrer\nferlo\nshamin\nfalastin\ngurvich\nindigence\nkervin\ncieplak\nvacillates\nuksa\nniloofar\numani\nderriaghy\nksby\nweem\nsidelnikov\nforlornly\nvenugopalan\nreconsiderations\ntarish\nunsaleable\nmuthyala\npronatura\nboerse\nroehr\nbambalapitiya\ncushendun\nfrémaux\nnasrudin\ndroughns\nsenecal\noladapo\nlickliter\naulenti\nmoeketsi\nbreheny\nbauby\naryal\nobediah\nriflery\ndaris\njutanugarn\ndiora\nbaiba\nmckelway\nfavalora\nremic\nbaratti\nhualian\nbothies\nbreastroke\nleibish\nmichalewicz\nwinkless\ncohre\nsniggering\nhijuelos\nosswald\nelsewise\nqiz\nairan\njukic\nrhan\nfinci\nvillandry\nlcfs\nrochemback\nfilmdom\nfmap\ninclinometer\nfodé\nencouragingly\nchanaka\nsovereignties\nsebastiane\nchavista\nslobodkin\nkestel\navolon\nghandhi\njoze\nacdp\ndemoralisation\nvanlandingham\nmarikar\nalcalay\nprejudicially\nchiropody\nbrolsma\nwashstand\natwa\nenfolded\nattallah\ncobey\nremley\nafrim\nsaturns\npassera\nleatherbarrow\nnows\ncheongwon\ntrevanian\nlushnje\nlidgett\ncentrefold\npanchakarma\nmishcon\nbrixius\nsilverswords\nmalavé\nconcidered\nclearinghouses\ndestini\nboiz\nfüsun\nligambi\ndemer\nlandler\nbaetens\njinxiang\nipratropium\nunsatisfactorily\nminoprio\nkoge\ncunneyworth\nangy\ndharmasala\nschmidheiny\nrephotographed\nbeeri\npatinas\nzingo\npromptings\niczm\ndenouncements\ntamaru\nheertje\ncalzadilla\nwenxin\npetteway\nslaten\nnarkiss\nwuwt\nkhazali\nadeniji\nnyamira\nbrainers\ngnatcatchers\nwhorlton\nkatp\nhyperextended\nstarbursts\ndouridas\nleuchtenburg\nmemc\nfiorilla\npozieres\nrecommencing\nfaiers\ndenford\nonesie\nrozental\nlooooong\ndelamielleure\nmisimpression\nvandendriessche\nkhamees\ntagliaferri\ndefoliating\nredbrook\nczuchry\ntheresienwiese\njigged\nyardsticks\nlanggaard\nlcps\noring\nehrhard\nbenza\nmrgo\nbrezec\nfountas\ncontenting\naigua\nagoglia\nsambhavna\ncookstoves\nnorment\ntopica\ndestocking\nlanikai\nvirginny\nfcfa\ncoronini\nbandele\ncesp\npontllanfraith\ncurtsey\ntabooed\nwrapup\ncxt\nmcwethy\nbetsi\nraborn\nqionglai\neviscerating\nceranae\nkatouzian\nrazzi\nartyomov\nbûche\nnibbs\ndosch\nwoodworks\nlocali\nfroward\nmultiscreen\nshubho\nericaceous\npowa\nlylah\nclinkers\nmclinden\ngalbiati\nkøbmagergade\nphip\nkapoors\ngillison\nihec\nsharfuddin\ncocooning\njaeden\nanky\nsinghalese\nbaranoff\npelleted\nshenington\npowermac\njlm\nspurted\ndehnert\nkhune\naivd\njeffri\nomair\nzumiez\nfauss\norick\nfrig\ncooperativeness\nghafour\nkinokawa\ncorriero\nicrw\nwhur\nrubiera\ncusu\nsombody\nchumby\npatacas\nyerokhin\nbesla\nlipizzaner\nherefords\nwhateva\nflyfishing\ncuppers\nkorka\nhortatory\nfiguerola\nascraeus\nhillwalkers\nbutina\ngiantesses\ntaxanes\nhardev\ngenuflection\nemberley\nopdal\nvaliha\nbadshahs\nwiggenhall\ncoplin\nfenstanton\nagran\n,you\npaternalist\nvarughese\ncasellas\nmordern\nruwart\nilos\nkuca\ntonghe\nbolita\nhamamoto\nphinn\nyunhe\nmaamoun\nnazzal\ninsch\nenviromental\nfreiamt\nverbrugghe\nistinye\nshaugnessy\nmiryam\nabsorbtion\nhoofnagle\nperplexities\ngonalons\nashwick\nchachar\nkvetch\ninsourcing\nshorecrest\nnutritionals\ncircumcising\nwoosh\nvalsartan\ndimaporo\nbackwoodsman\nigbinedion\nglittered\nlunts\neying\nvigorito\nfilipowski\nkilmeny\nscratchcards\njogis\ncasseus\nkreutzberg\noumi\nlispenard\ngennadios\ndesisa\nruchika\nborf\nasuquo\nffin\nbrunken\nwiard\norignally\ntatlow\ndoven\ntesche\ntagab\nsurapong\nbecknell\nlqfp\nsimonstone\npallisers\ndefilippo\narchibishop\njdr\nwcjb\nputtar\nconfab\ndespoil\ncommodes\nblashill\nbareknuckle\nshieling\nbateer\nrebids\nlanctot\nhanauma\nsharyland\nklauss\nariga\nalmsick\nccss\nskewbald\ndosb\nnovellino\nelot\nbriarcrest\nhessell\nheilemann\nskaugen\nspygate\nreductively\nhekma\nemmannuelle\nsummerford\nsomnolent\ngotobed\nmontegut\nteashop\nsaxonburg\nprairieland\njyles\nfubu\ncanakkale\nnammo\nunclog\nraqibul\nroadbuilding\nabsurdistan\nozmen\nattili\norrefors\nquells\nrevalue\nmfps\nganzi\nrosily\ncasualness\npenix\nprovitamin\nreheard\naufhauser\nfrizington\nklamer\nkosb\nluzzara\nparklea\nhangtown\ndarcheville\npodrabinek\nreinberg\nnuttiness\navcs\nmanpack\nyade\ndispatchable\nkmworld\ninabilities\nhockenhull\ngardeur\nphreaks\nstfa\nheffern\nbahtiyar\nplagens\nbluemotion\ndoree\nhaslen\nwestburn\ngobbles\nmunks\narteria\ngdo\nubaida\nhospitalizing\nweninger\neckbert\nturnball\nsuada\nbookout\nzwelithini\ntuile\nunorthodoxy\nermen\nviably\noatway\nicet\nimk\nbroekhuizen\ndamehood\nphills\nmagubane\nhackler\ncsit\nwiechmann\numkomaas\nshevon\ngoldmines\ncicig\nfenglin\nmaguy\ncongi\nvirb\nkazatomprom\ntrika\nxchanging\nimmunotherapies\nbobbsey\npractioners\njaramogi\npretexting\nharfield\nrajamäki\nunacademic\ngrifasi\nbadenov\ncrams\nworldy\narranmore\npsid\nkhonsari\nependymomas\njubouri\ntherriault\ntabei\nsivak\nsozo\nsischy\nbrettschneider\ngunta\nbreasley\nnewhan\nchonnam\nbeaupuy\nreaccreditation\nescalon\nmarijampole\nmilpas\npirkle\njailbreaks\naltares\ntatsuko\nmssl\njwaneng\ntelis\nbreteler\ndunseith\nsixkiller\nnégociant\nsutrisno\nrosenfeldt\nsemenzato\nmarkwick\nreinbold\ngovedarica\nemerica\nnashawaty\nsteggles\nfccc\nhalatau\nclinkenbeard\nkemsing\niriyama\nahuitzotl\nlavaka\naudibles\nwinsten\nlamichhane\nkaraszewski\nlenat\nkübra\nbugattis\noranim\ndistruption\nstylize\nlefleur\nzegota\ngrupe\nneurofibroma\nbuñuelos\ncervoni\nquami\ngoonhilly\narchelon\nrasiah\nmaterialisation\nmoszkowicz\nwoolavington\nfivers\nbraudy\nlatenight\nlaugerud\nmatadin\nbattis\njosemi\nmajoritarianism\nhowcroft\ndecani\nzogg\nonil\nhamao\ndraughon\nlimache\nronca\nmamatha\nmontador\nschurig\nselc\nbelloq\nemtec\nhucclecote\nunionisation\nshangyu\nconfessore\ngilat\nbrautigam\northop\ngachechiladze\nmarzec\ncontemplator\nkaledin\nrowlinson\ndogmeat\nstambouli\ndgfi\nculpin\nbawaba\nheve\nmolinar\ninsufficiencies\nscuffling\nonanism\nfallowing\nbuyuk\nchinee\ndecendants\noeuf\nunfocussed\ngeldzahler\nmoutet\nmassys\nrossmo\nbamn\nprospal\nllansamlet\ntriangulating\nsophea\nimpersonality\njabeen\nsalfords\nmunnik\ncye\nwinterkorn\njamestowne\nettinghausen\ncountie\npinkava\nstefi\nsaghafi\nmaccioni\nulrick\nmassager\nlopota\nbasner\nlaughner\nflambeur\nmanent\nchandelle\nportersville\nblanchester\nglouster\nsupercool\ndesaster\nnoooo\ngyepes\nskittled\nkotulski\ntrandon\nfernandino\nratomir\nwiessner\nchimayo\nhelmers\ncasadei\nsalsedo\nmetalink\nhossen\npaykel\nbenu\nfantasiestücke\nteuscher\nfiesch\ngardemeister\nomrlp\nnumbskull\nspelunkers\nyuanlin\nncoc\nkipchirchir\nroets\ngoldhill\nwyers\nmugg\nstanardsville\nforker\ncerletti\nllanddwyn\nunibond\nwormlike\naagot\nduskin\nbandz\narancini\nbeierle\npainshill\ngebo\nfoucan\nfengyang\nemsc\nbeimel\nfalih\nrecolonised\ndongsi\nthudding\nkenderdine\nchollet\nfawkham\nplagiocephaly\nahip\nkebri\nsantapaola\nfemtocells\nmiddel\nerwann\ndelphina\nocchetto\npreysler\nvolaré\nmaraging\ntonteg\nlocater\ncharikar\njazzercise\noutplaying\nctsa\ndolto\nkibale\ndarnestown\nromayne\ndennery\nazcarraga\nsaunton\nmucormycosis\nemasculate\nyike\nkutiman\nfedotenko\nkasatka\ntimpone\nmontagnac\ntayer\nwolinsky\nkeem\nbistrot\nyemane\neveryting\nproximately\nelgee\nshafayat\nsutar\nbakkar\nmcdorman\nbiancone\nsagle\nladyfingers\nmicroenvironments\nversifier\nneice\nkadriye\nbajillion\nreengage\nschmeidler\nexalead\ndominque\nweitbrecht\nlanard\ntiwai\nbenally\nlasn\nsliman\nbiowarfare\nbaloise\ngambians\nikonos\nbairnsfather\nrovsing\nfawc\nburdenko\nzinovieff\npessary\nbryers\njxl\ndutchbat\ncely\ntrinneer\ntrabue\nhardhats\ndrabs\nratzon\ncontently\nprestrud\nmegabases\nhosty\ndottori\nstanfords\nzolfo\ndris\npokers\nincarnating\neskan\narush\ntwana\nparkridge\nafful\nmoonlets\ncaldew\npuddled\ndemonisation\nkaylani\nnacirema\neglingham\ncherrydale\ngerhaher\ncedarcroft\neggimann\ndesena\ninvestimentos\nklingenschmitt\nshortener\ninsincerely\nvulpe\nvarne\nsalawati\nstaindrop\ngutersloh\nunamused\nsightseer\nhicp\nhuaixi\nvocht\nnatalist\nvisnjic\nvernice\ntinnion\noverpaying\nfusaichi\nputdown\nzeltser\ngreenebaum\nkeily\nkingsey\nkempka\nmcvittie\npahlevi\nmutassim\nyetnikoff\nnazims\neffexor\nschmick\nlatisha\ncletis\ndipsea\nolsens\njetons\nsubsample\ncoarelli\ngrear\nborucki\nmalie\nkcbd\ncollegue\nspattering\nolaiya\niarpa\nnyha\nmcy\nsûr\ndensitometry\nmasunaga\ngurwen\nnlj\ncrosswicks\nsehk\ndawi\nnedkov\ndenno\ncilic\ndedicatees\nkuester\nantipathies\nporcellian\nmatriculants\nrisø\nvilu\nreinstadler\ngolez\nsawe\npipestem\nlindzon\nstrumble\nabss\naudiovox\nshaldon\nflyingbolt\nbyman\ncleri\nruggiano\nwimble\ndurdin\ntaleju\nkunie\nearthjustice\ndeloss\nolazabal\nnrityagram\nstandbys\nnilar\ncrotches\nspeedbird\npettinato\nvermiglio\nivas\nkfsn\nzarganar\nmejorada\ndeghayes\nmontipora\ndornum\ncorer\nplacket\npirnie\nattatched\nrenacer\nsolio\nchandola\nbangour\nscsl\nblighting\nnostos\nstandwithus\nkhuong\ncoulais\nfullman\ngoldenhill\nleisured\naglukkaq\nsupercenters\nalpro\nbinetti\ngudger\ngomulka\nincise\nlitan\nriepe\nguara\nargi\njehl\nniederauer\nararoa\ncens\nbaleh\nmisko\npodgorski\nkalyn\nbettcher\nhaggadot\ninfills\nborgstrom\nskarz\ntunne\nvnr\ntigecycline\npunke\nrmh\nhorrify\nasyl\npaudge\ncedex\nmegill\nmunitis\nantczak\nbreinigsville\nnigo\njameses\nslewed\nverbale\nbanyak\nconduced\nsheild\nkatangan\nstrewed\nprepublication\nfluoropolymer\nnationalistically\nsedco\nsinabung\nmackanin\ncraciun\nfuturologist\npreloading\nlazarevich\nderafsh\nyonggang\ntruants\nberretta\nthawte\nenayat\nadamos\nsodrel\nquilling\ndaohugou\nmonochromes\ngemelos\ncomar\ntannock\npaname\nmalthusianism\nwhre\ncarretta\nsabaratnam\nbines\nlomasney\nartemide\nalthingi\ngalster\nribboned\nmatrouh\nenqelab\nlligwy\nabdillahi\nventrone\njawas\nlightbourn\ncrocuses\neyeline\ngoodings\nbenzel\npirone\nhiemstra\nbantamweights\nmeshack\ntrackballs\nvivancos\njudiciária\ngavarni\ncovic\ncolace\nchulmleigh\nwontons\ncantinas\nirfon\nmeeds\npanoramica\nyuans\nwenling\nstanikzai\ngrieder\nstantec\nfrautschi\ndisplaysearch\nbackfilling\nwhitnash\nbcar\ncordiner\natomico\ntortoni\nitel\narouna\nsanteiro\ngantlet\njackup\nteklehaimanot\nllullaillaco\nkotex\nzhengsheng\nraue\nprimarolo\nfims\nlesniewski\nredelfs\nhunanese\nfransman\nfirethorn\noccludes\neggold\nferrisburgh\nfarewelled\navrakotos\nmunsingwear\npinol\ntuakau\nwonjongkam\nleilei\nlihui\ncostard\ndaosheng\nequiped\ncheryomushki\negglescliffe\nhaverkamp\nesmo\nneora\nchheda\ncoachways\npistoleros\nevolvable\nbaissac\nashdon\nconcieved\nprayerfully\nproctitis\ndipali\ngitex\naverette\nsuborned\nniermann\nktva\nharrovian\nkamgar\nhendershott\nemminger\njurewicz\nbicmos\nprudishness\nplaschke\ngbeho\npizzaexpress\ncastigation\nsuppertime\nmomm\navoch\nteklogix\nunversed\nmorghab\nfeatherstonehaugh\nwflz\nmisperceived\neuribor\npaillé\nstraightfoward\nkliman\nflum\nverruca\ncfcm\nbranchflower\nvanmeter\nlivadi\ndequan\nvicuna\nmontsho\norlev\nhokanson\nmcleans\nmarun\nmorrisroe\nhabayeb\ncromwells\noverbooked\nmordue\nsonoyta\ngriller\nbotkins\nweatherproofing\npulmonologist\nghlas\nxiaoning\ndeossie\npalada\nmatmour\nbankhaus\nupcycling\nbaalen\ncrimefighters\nlimpert\npendre\ndusenbery\nxfactor\nosin\ntelkiyski\nwebberville\nrscn\nskouris\nhenrard\nlubinski\nmafiosa\nlardo\njakobshalle\npagliero\neurosatory\ncromac\ncondori\nslobbering\ngoobers\neaglesfield\niccn\nfinardi\nlaurien\nkassan\ntssa\ndolinka\nhothouses\nprive\ngolovina\nnykl\nangiographic\nvandeventer\nflyman\nellenwood\nmydin\npullicino\ntamely\npappin\nbroadmarsh\ngaido\ntervo\nkwakye\ndacourt\nraichle\nkarnstein\nbaghouse\nleage\nlonan\nsaulsberry\nfunmilayo\nmayhall\ncepacia\nhardbacks\nsomehwere\nfundable\nglargine\nbowmont\nsanyukta\nkroese\nstickup\nmeadwestvaco\naurthur\nkalynychenko\nbiggart\nbrundall\nsalicaria\nliberationist\nmccormac\ndelicia\nhamadou\ntiken\nupchuck\nphotobooth\nsalomonsson\nbellany\ndehere\nmardell\nklecker\nstuke\ncarsons\nnierop\nflourescent\nceinwen\narmonía\nfodors\nfuggers\nufood\nankhesenamun\ncukurs\nwitthaus\nsanchia\nabdulahi\nczs\nctirad\npother\nfatted\nxlf\nspaihts\nsportsbooks\nteched\nhatchway\nderriford\ncrumples\nswitz\nloudmouthed\nquorra\nidoc\ndonsol\nmalignity\nlookaround\nsalmeterol\npiebalgs\nmosha\nanchin\ndutrow\nglenesk\nkuliyapitiya\nplayon\nbohmer\nslad\nstahnke\nstajan\nsecretiveness\nbhopa\ngaebelein\nbodansky\nlovre\npentamidine\nfealy\nperceptiveness\naifm\ntogger\nwaterfire\nkpsi\nchediak\nmuresan\nfarino\nsubclassified\nmozelle\nnardella\naxr\nexageration\nsuperfluity\nlayups\njuicebox\nburway\ncursi\ntoyboy\ninfrabel\nalysa\ncandaele\ndisprovable\nbakkies\ncriminalist\njanuszczak\nthirsts\nplainspoken\ncipollone\nbricktop\ngertjan\nbirsel\nlawsons\ndewars\nconcil\nheynes\nmarsella\nwitzig\nriblon\njacó\nsauerwein\nphantasmagoric\naibel\noutlandishly\nsdsr\nkhutsishvili\ndettelbach\nworldwideweb\nfollain\natj\nhumpin\nfreudians\npillowcase\ncapanne\nyongjing\netlinger\nlistserve\ngresser\nhahahahaha\ntompsett\nsiyaj\nloffredo\nriesenrad\nneroni\nramaiya\nsnoozer\ntewin\noxazolidinone\nyuquot\nrusnano\nlitto\nberain\nlouisans\nalcine\nknook\nmothertongue\nscarcella\nagaba\nmarshon\nventolin\nrimantadine\nstorkey\nealier\nwartell\npolmadie\nbracadale\nkhalden\ndownings\nsmaak\nvinclozolin\ngollnisch\nspitalny\nbradon\nhughs\nnetafim\ntynda\nhinet\nbrickmakers\nthurtell\nfulp\nvermeiren\ntelemadrid\ndespond\nlfls\ntacher\nproceding\nlemat\ngulet\ngilf\ncyanidation\nbiopesticides\nrecoupment\npaszkowski\nkinmundy\nsubcomponent\naskaig\nabanto\nphantasmagorical\nbudging\ndanais\nbartin\ntaqaddum\nluek\nsiemion\nlupini\nepidavros\nregeneron\nkinakh\natzori\nviler\nmccarthyist\nlepchas\nkhelifa\nmylroie\nfennessey\nmaegan\nmangabeys\nchurchy\ndamji\nunmerciful\nrosan\nbenmoussa\norganizacion\nchinking\njalula\nparrinello\ntyrolia\nmikaele\nbankasi\nnishita\nkohring\nmclernon\nsinaa\nvitelloni\nxposure\nmarinis\nniumatalolo\nnyjer\nappli\nleysdown\nwibbly\nreveller\nkupets\nhispanico\nmbyte\nwalorski\nkeirrison\nmickler\nwalhain\nmultitracking\nenobarbus\nsivok\nlivvy\ncrainey\ncepsa\nnieu\nflins\ntuberose\nsandback\nramseys\nbasran\nberthaud\nsukhdeo\nleonberger\ncambron\nzelizer\nerke\nvonitsa\ndharmatma\nkuljit\nstehlik\nhaberle\ncaftan\nmemin\nyudof\narrivabene\nwaxx\nkrulwich\nnunatsiaq\nspeigel\nyambio\nsubsidary\nmaralyn\ntunley\nléoville\nvrolijk\nmtkvari\nfolkie\nroughley\nbelittlement\nparakhouski\npostpunk\ndetailer\nconvecting\npectins\nstanberry\nbellingrath\nrolm\ndvoretzky\nkkp\npilfers\ncyberworld\nwanniarachchi\ngeidt\nmanninger\neirias\nbanglalink\nvitiate\natones\nzygi\novermach\ncnsl\nnewgrass\nbettridge\ndetsky\npeprah\nentacapone\nesmee\nlapides\nunsought\nvisiters\noriginations\nbanse\nsedbury\nandreis\ntanjug\nvpls\nrajula\nguilden\nemboldens\nphalcon\nduick\nwaretown\nlampell\ntibouchina\nalando\nrocksprings\nsquadmate\nunenriched\nnurofen\nlazarist\nhomewards\nzahiruddin\nunexcelled\nfirestein\nromancer\nchodos\nlandsvirkjun\nimportunate\nwiedemeijer\nitchenor\nesrf\nfeniger\nkurosh\nengo\nmouhamed\ngruyere\ncattan\ndiscoursed\ncherilyn\ndiabetology\nwahm\nfreerunner\ngoza\ncicciolina\nenckelman\ngalluccio\njunshi\ntrulia\njarmon\nbulkiness\njutge\ngrigoropoulos\ntravailler\nstallingborough\nshaff\nargar\nmths\ngoromonzi\nclarksons\negee\nlindrick\ngusi\nglanfield\nsington\nlollis\nlemierre\nviterra\nhypophosphatemia\nsuhey\nmassei\namenability\nbratby\nguidances\ncccl\nboehler\npadmasree\nledcor\nflexographic\nozersky\nciric\nillinoisan\ntandra\nglafcos\ncoppen\npeellaert\npolycarbonates\nvidrine\nagui\neriks\ncarlomagno\nfetishized\nmeaninglessly\nnurenberg\nsubsidises\nebchester\njcrc\nnonwovens\norshansky\nwoodentop\ntrubee\nhipple\nmanetta\ndubuis\nkeysar\ntsoukalas\ngiin\nspreadshirt\nhiort\npogorzelski\nwienerschnitzel\ncaujolle\nlandsharks\nbinladin\nbedient\nsccm\nastles\nbses\ndiverticular\nokochi\nzelazo\ncrowton\nbiebl\ndibben\nhuchet\ncotner\njaslene\nhilbig\nschoenberger\nsouthpointe\nahg\nnotepaper\nised\ncrossmember\ntregua\nranchito\ndraughty\nsimbin\nbrinnington\nsamie\ncollinet\nnekoosa\nqingnian\noffermann\nwelney\nvilnai\nnuzzo\nafspa\npfann\nyaqin\ncinderellas\nanyother\nhecking\nkidepo\nshoeshiner\neminger\nijustine\nmitchener\nlassy\nfingersmith\nsidestream\nfullfilled\nunenumerated\ntriaged\nplov\nishaya\nmotty\nokaro\nayodeji\nbohon\nkulay\nunilateralis\nkeauhou\naernout\ntecia\ndeandrea\nmuscatatuck\noutruns\nbenat\nwannabees\nwentloog\nengquist\nbantz\npassacaille\nreferrers\nquislings\nindestructibility\ngroman\npacult\nhobaugh\ndoocy\nantik\nlepik\nebrima\nbørs\nwasafiri\nmenstruate\nburgett\ntitanum\nhazelett\nstobbe\nmoorjani\nsacriston\nmmpa\nstolar\nsouthbrook\ncoffeen\nsabih\nhoei\nbluteau\nmillennialist\nblindfolding\nkandak\nyanomamo\nzew\nmceneaney\nbackheel\ntunesmith\ndemoralise\nnunnelee\nshrimali\ndesicion\nrheta\ntawaf\nmilty\npister\nmccobb\nalbarello\nganor\nclopper\nmidttun\ngoetschius\nfrazz\ncrackington\nzarang\nlistach\nstopsley\nportabella\nbiked\nburriss\nfilkin\nsalukvadze\npivar\nuraba\nmontemezzi\nkownacki\ngimi\naberdulais\nmeganeura\ndeincourt\nophthalmol\nsusantha\nposnett\nmbete\nnieland\npicornavirus\njelimo\npetplan\ngunrunning\nyumei\nrapo\nspaggiari\nhmda\nmaroilles\nveigar\nfenomeno\njumbling\ndrollery\ngunvalson\nweste\nreefed\nunboxing\nzerner\nahrendt\nrizzini\nglamorama\npotlucks\nsterilise\ncassington\ngirgaum\nrushydro\nadni\nmylink\nmasayasu\nkimbel\nbonami\nqiqi\npinprick\nstencilling\ngrillon\nsilversea\ntransmogrified\ntalea\nmeuli\neqo\nktuu\ndentice\nmackereth\nattaullah\ntaphouse\npeneda\nintertrust\nsarpaneva\nwaight\nneuroinflammation\nfujiyoshida\nfettuccine\nrockrose\nochlocracy\nspaceway\naxium\nundependable\ndongdan\ncallaspo\nmacp\nfdcpa\nflinton\nmasciarelli\nmannatech\nforsworn\ninlcude\nyeakel\nexulted\nméribel\nforbad\nqayoom\nomniture\nunhallowed\nberish\nbluetick\nvislab\nprediabetes\nearll\nloratadine\necuavisa\nbillström\nizenour\nileto\ncaravansary\nkarara\npida\nbellgrove\nhackbridge\nmarse\nboizot\npetrak\nkashechkin\nkiknadze\nromeus\nscherbatsky\nelorriaga\nmotiva\nmurum\nammari\nlampitt\ncurth\nstakhanovite\nbuchloh\nhamberger\nkarnazes\nearthrace\nlustenberger\nnehruvian\nzarema\nasira\nvengence\nsupplicated\nsibierski\ntorrone\ntranquilo\nscissoring\ndanged\ndesbarres\ndeinstitutionalisation\nnafo\ntreelike\nbungs\nmerchandizing\nbishen\nduffell\nradica\nrashidiya\nscelfo\ntaty\nvolnay\nleerhsen\nbushbabies\nagianst\ninherant\nsanso\nlaplaca\njozami\ninsolvencies\nmillisle\nderouen\nirania\nthundershowers\nchapmanville\nhobnail\nazobenzene\nbeedie\nsenzo\npaavola\nbizar\nswelter\ngillean\nsaucony\ncorrupter\ncrones\nmurgh\nburdin\nsoueid\neena\nkingsbrook\ninlaws\nnmwa\nhenrit\nidamante\nneurontin\nnakanai\nnurek\nsemprini\nrully\nbarsby\nbofa\nbrakni\npkl\nmanumaleuna\nfasolt\nsferra\ndeferoxamine\nrakib\nhogstrom\ntesar\nanaly\nignitable\ndugoni\nnewshounds\nrhy\nstupka\ngroeninge\nscarnati\nvenla\ncromley\nkpcs\nofferee\nabergwyngregyn\nfredrikson\nangoff\ndowntowner\nkhemka\nverla\nchampi\nalakai\nthamir\nfrikkie\nxiantao\ndidgeridoos\ncantillana\nintertan\nflng\nhedqvist\ncalahan\nmcgleish\nlauberhorn\nlygo\nberlian\npessaries\nosteo\ngoldstick\nkvadrat\nkoska\ndreyfusards\nxiaobin\nmeisha\nvarini\nsilovs\nbetanews\nplaceres\nmsee\nscandanavia\nheurelho\naafl\norrorin\nextrudes\ngangmasters\nalticor\nardersier\narrestors\nrundall\nflipsyde\namport\namézaga\nbenfey\nnombreux\nchagossian\nsecound\nbrodies\nscriber\nnackt\ndiscolorations\nplaytv\nromgaz\nunscriptural\nmundella\nvideophones\ndjilali\nbodnia\nescabeche\nfrita\ndzhezkazgan\nmatopos\ncarbona\ndihua\nhalkidiki\ndishwater\nllanhilleth\ntropos\ncruchaga\nphmsa\nattieh\npiccolino\nnissel\ncnss\nsebou\nmelder\nflamanville\nbaky\nkuroyanagi\nshirra\nscougall\njomaa\nilw\nbluck\nplymouths\nmericle\ncreamier\nzili\nkosek\nyira\ndodoo\ngrahl\nloynaz\npostie\ntigertail\nlerg\nkyrsten\neapen\nawuah\nuweinat\nwackers\nnoella\ngraae\ndshs\nswets\nmccolm\nbeechwoods\nephemerality\nbeltre\nmurcielago\njerpoint\nconcordville\nullas\nvulpine\nfmoc\nlhomme\nxeros\ncinéaste\nleflunomide\nelectronvolts\ningoldmells\nklaviermusik\ncoucke\nimpera\nadministrational\nhopera\nunpick\nromac\nneeed\nbppv\nbertoglio\ncommotio\nsprotbrough\nopap\nduntisbourne\nmeddeb\ntrmm\ndweebs\ngraiguenamanagh\npocketpc\ngiselda\nsadun\nbirthstone\ncristman\nuamh\ngogledd\npidie\ndiverticulosis\ncolonialization\nwtg\nhardrict\nmacrobiotics\njaymie\nestrich\naonbs\nweyanoke\nhomen\ncreepier\nwickers\narandora\nobg\nraring\nmistretta\nunsuk\nhardhat\nquernmore\nswiveled\nsimat\nlimy\npennsburg\nbedout\nareta\nbridgework\ntidswell\nlearmont\nerechtheion\nyellowlees\npowdering\nsystemax\noheka\nweatherwise\nwhiteladies\nindecorous\nvillita\npleau\nmascott\nfttx\npintsch\nharff\nodoms\nlarpent\nlacava\nlisha\nvampy\narmamentarium\nlitchurch\nouwehand\nyerli\nmaricela\nobduracy\nwalloped\nbutylene\nwineman\ncardenden\necmm\nfisi\nbricking\nbinaisa\ncber\njacquizz\nmatkin\nhotnews\norchardson\nmealor\nerot\ndrambuie\ndillistone\nunawatuna\nsunfeast\npammy\ncarbaryl\nreesing\nglyncorrwg\nhahah\nbicks\nupbraids\ncassutt\nmelanocarpa\npiepoli\ndefaria\ncomplementarities\nfreewheels\nhousemistress\nmusaid\ntaleban\ndunkery\nepifani\nkfdm\nmiljkovic\nsmartmatic\nbrohan\nsalvagers\nupsy\nnewmore\nupperthorpe\nunthreatening\nmagnetars\ncarletto\nscuppers\nkuenssberg\ncolorizing\nreciever\nunsual\nguilded\neasygroup\nmentougou\nmusn\nbindy\nschudson\nundiscriminating\nsérieux\nzaslav\nvevers\nmolate\npossilpark\nagritubel\nfoxen\nalra\narfi\nkingskerswell\nposnanski\npenneys\nreligulous\nfaccini\nkyei\nnorthchurch\ntriet\nsaïfi\nokolski\ninfocision\nrecliners\nfludarabine\ndrooker\nradfan\namoore\ncalliste\nmonomaniacal\ncarcinoembryonic\nravat\nnordnorge\nbockman\nstraighforward\nharsent\nminimoys\nuhrich\ncheezy\nmalbin\npdma\nlockette\nmerchantable\nmoonis\nwijesuriya\nnonclinical\nlindemulder\nbreitbach\nlanoue\nfretful\ndazhong\nnarcy\nmaharajahs\nidolises\ncloudier\nhandcrafting\nmaizels\nandersdotter\ncrisped\nsiasia\nmaimun\nqitaihe\nsoulman\nfreudenberger\nlovesickness\nreinvestigate\nblabla\nwrose\nsmokovec\ngiddily\nbloustein\nqxl\ngaulin\narduously\nsible\ngrafters\ncropduster\nmeghraj\neyrich\nnasba\nabdella\ncardonnel\nbenchetrit\nimja\navrig\nlibous\nhbx\nstief\ntegs\nundomesticated\nlatzke\nfrats\ncrossharbour\ndragila\ningratiation\nsellards\ntalukder\nlainer\ndechristopher\nhohlraum\ngurn\nheisley\nlausen\ncbpp\npanders\nperceptively\nandf\nmastorakis\ndecentralise\nhoper\nlizano\nchantels\ngreysteel\nunelectable\nbraafheid\nflinched\nwoodbrooke\nmerrall\nhotplate\nmatana\nraymar\nwhelps\nabdun\nverita\nborjan\naftv\nslogged\nmycotic\nlewises\nrajiva\nunexampled\nzabavnik\ntsay\nmuhiddin\nendcliffe\nwoodlee\nomoro\nformalwear\nhallenbeck\nmorquio\nunsustainably\nsigificant\ncarbonless\nbandoliers\nhichilema\nreinaugurated\nindispensability\nswordfight\nrohlman\nspillett\nehsanullah\nstalins\nascendent\nneisha\nbenja\nmorri\ngrowed\nbeus\nwegelin\nasby\nkghm\noutshooting\nassadourian\nmenzie\nfmrp\ndavia\nkerly\nanete\ndichterliebe\npinny\nviers\nsemisi\nracketball\nslonem\ndecriminalising\nkaituma\nxau\nstamaty\nprivalova\ntupungato\nwoelfel\nwoodchipper\nnofziger\nmitsch\npromenading\nlijie\nstealey\nberlant\ncubbie\nmirrione\naniko\nkandara\nefax\ntouqan\nunselfconscious\nmumsnet\nimpugns\ngebremariam\nbrosses\ndiggnation\ndabizas\nabbington\nventuresome\nkarbaschi\nandruzzi\nmaiolo\nkolonaki\nguardamar\ndefendent\ndiestel\nceftazidime\nepichlorohydrin\nheggessey\ngolfs\ncocotte\nbrynglas\ndontrell\nsmartway\noler\ndelfeayo\nnoki\nstarband\noudsema\nperphenazine\nbloodwork\nclalit\nmerki\nmaust\noyewole\nfaac\nisik\nlungin\nunpointed\nromao\nsanie\nhewit\nkalandar\nsibillini\nxianghe\nbuffin\nnjr\nfracci\namcom\nyatesbury\ndicorcia\nannabell\nnumenta\npuriton\nentrepeneur\nannys\nboustani\naquil\narhuaco\natmosfera\nkaratz\ncplex\nnasimov\nbattlestations\nbunuel\nclemont\nsuperbugs\nhoury\ndangeard\ndhaenens\nsakhee\nendocasts\ngaung\ndoublestar\nelaph\nlagrotta\njunan\npriess\neftekhari\nynglings\nbienniale\ndimitrovski\nexito\ncracklings\nrssb\nharouna\nmisjudges\nwltx\nhuixquilucan\nneverthless\nmclarney\nghanaba\ndomenik\njoza\ntharaud\nyongming\ntravilla\niamgold\nomalizumab\ntommaseo\ncagna\nlagutin\nrockfeller\noverbalance\nwoolies\nbyrn\ncarryovers\narrison\nkiberd\npiria\nmakukula\nmoubayed\nradislav\nhelmreich\nvolksdorf\ntolk\nforcados\ntombazis\nteigan\nkuperus\nnjabulo\nbalzano\nkilmeade\nbaswedan\nhardart\ncoloreds\ntrecartin\nmorrel\ndepreciates\nneilly\nhoeller\nhaffar\nlonn\nchyron\napostolopoulos\nspates\nkasler\nimbecilic\nhillshire\nheinig\nkriegspiel\nbertilsson\nqmg\nzatoka\nsanei\nstainfield\nabdulmalik\nadlestrop\nmaktoob\nwindes\ndtsc\nognibene\ngharavi\ndamasceno\ntiddly\nhemanshu\nfelgenhauer\nshinga\ncendrawasih\nalfege\npluk\nathletissima\nrbbp\namericanisation\nkolhatkar\npolypody\niddi\nbehcet\nroosen\nyaha\nspitler\nbakhtar\nradatz\nbenucci\nsumei\njerusalemites\nszymany\nbollini\nbrunstad\ndormered\nadvancer\nhandwave\nkathia\ntokhi\nbiznesu\nloosey\nlinster\ninva\nendometrioid\ntoshikatsu\nmaama\nfinton\nanisette\nprignano\ntubeworm\nnaseema\ntangelo\npolich\nsaimir\nbamburi\ntanuj\nshatti\nleuthold\nriddiford\ncfsa\nmotoman\nmuick\niwarp\nvestri\ncornermen\npanier\ncontango\ntuckson\nvanke\noptout\nfawell\nivinskaya\nlisin\nswip\nshakara\nlugny\nwdel\nouroussoff\ncuit\nkoharski\nthorneloe\nwynder\nallgeier\nsirous\ndoralee\nahlqvist\nredbay\niframes\ndormon\ndruidical\nasbc\nbaechle\ntahmasebi\ninmigrantes\nschalow\nsahnoun\ninveighed\ngordeev\ndessens\ngarrotte\nllansawel\ncomposts\ndubowski\nswimme\nromei\nviveur\noverthink\ntrisko\nposset\nnebs\nmeskel\nhansal\nklumpp\ntuigamala\nindrawati\nmayenburg\nunsuitably\nbharrat\nlownds\nlection\nmangyongdae\nellerbeck\nbenificial\ngolaz\nimpostures\nlaphroaig\nkymco\nwooo\nlorick\nglasslands\nfischbeck\nglaz\nbisciotti\naqc\nvadzim\ngiersch\nwatoto\nprosport\nthamara\nheroe\nrentokil\nlinthwaite\nproia\ncreich\njenine\nlisetta\naustwick\nagliotti\nnumerologist\nguilbaud\nscinto\noliveria\ndisinvited\ndefamations\nfidanza\nseedcamp\nniaf\naqis\nseear\nkpcb\nhile\nleadenham\nzyprexa\nmlstp\nwessing\nbuscombe\nbubbe\nsteepened\nlartey\nbensel\nanglophilia\nswoose\ngnosall\nbelabored\nlidholm\nholik\ntased\nsurfleet\nkilminster\nseamie\nsvete\nmygatt\nkristjansson\nmccallin\nspoonfuls\nzetlin\ntritiated\nbullimore\nmcauslan\nmicroliter\ncomunn\ngayford\ntellabs\nfourchon\nwcfc\nkoed\nbranchless\narrogate\nhalfcourt\nburningham\nbubley\nnagre\ncotingas\nautogrill\ngoenawan\npariseau\ninfinium\ndecrescendo\nbecketts\nmandarine\ngershenfeld\nbrambleton\nremploy\nsymptomless\ngoulon\nstrupp\nsigurimi\nworters\nwesthuyzen\nborm\nartemision\nnaoshima\ndrily\nkrayzelburg\nlivnat\nfgt\nasne\nokagbare\nvoh\ncarlyn\npreval\nhorth\nlessman\npieux\ndissapointing\nnosaj\neardisley\npresland\nsafmarine\npreservations\nlamari\nrscc\nhilu\narnouville\nkestelman\nzour\ndybek\nultracapacitors\nmuirton\nhenchoz\nvyke\nveton\nbracers\nwellham\nfloto\nwaggener\npapathanasiou\naubie\nhaubold\nplurk\nnabj\nbagneris\nmatzoh\nwadie\ncuonzo\nmetabolizers\nkingsteignton\ncesspits\npiraro\ncowcaddens\nvillis\nmossville\nnatzler\ncoupa\nokung\nunfortunetly\nclimbable\njulienned\ngoodhearted\nsickroom\nbosler\nshevelove\nogri\ncepal\nissacs\nduken\nomeath\nappuldurcombe\nkeukenhof\nsamways\nnavolato\nmarown\ntrokosi\nhenok\nrhamnosus\nswk\nshamley\nimia\ncristales\nswakop\ndrumbeg\nbarnstormed\nsolecism\nticos\nunrefuted\nsalzano\nbitts\nengelland\nkatumbi\nkendu\nwhitebrook\nmepi\ndastgir\nunoriginality\nchastized\nunconcern\nchesbrough\nnuttgens\nartemether\nglaudini\neconomides\nkavafian\nhadan\npuggle\nbrechfa\nshiliang\nqasam\nertürk\njerba\ndeso\naogo\nvisse\nenglehardt\nzohur\nsabogal\nhoofers\ncssf\nwhitebox\ntodrick\nprotoplanets\nhallsworth\nafshari\nelpc\nkorshak\ninnumeracy\nsybarite\nshuming\ngotzon\nyemini\nkorans\nbrattbakk\narod\nparhat\npécresse\nmallis\nredraws\npanyarachun\nlbma\nstylisation\nblaisdon\nekachai\nwkts\nwestbrooks\npickney\nmolouk\ngeanakoplos\nhenrichs\nstopwatches\nkievskaya\nautoworkers\nhmw\nnlv\nhaugli\ndirigo\nkirtling\nstemware\njamari\nmouhot\ngiambra\npreesall\nmallesons\nsylvaine\nunnoted\ntravelog\nblakenhall\natome\naudebert\nungenerous\npoquito\ncriscuolo\nlandham\nmizeur\nmowen\nchagra\ncanazei\nsheilas\nshahade\ncurvis\nfeloniously\nflopper\nglassworkers\nkerruish\nhergott\nwhitla\nfoodgrains\nyasutake\nmerkland\nvermejo\nwolfendale\nlatkes\nexcrescences\ntonita\ntogadia\nzubaidah\nmcverry\nwwoz\ndiginotar\ngrudziadz\nebron\nliyana\nqualys\nunfound\nsesler\nshembe\nquanxing\namoungst\neigeman\ntoolan\nmändoon\njurf\nbearfoot\npolfer\nsvae\nwastwater\nslipstreaming\nunderminer\ncarcassone\nokuonghae\negglestone\npropellent\nembolisms\ndyc\ntemascaltepec\nunstudio\npbde\nlulea\nchippers\nbridcutt\nbuerge\nrayonier\nmogel\nusao\njobard\nhierachy\nnapoleoni\nuncooled\napplebroog\nuninstallation\ntarator\nnalen\nrootlessness\nperrottet\ndespatie\nolando\nligthart\nopenbook\nkingmambo\nfrewsburg\nabbatoir\nyanqui\nloisaida\nloescher\nmaffi\nhoever\nsurete\nmsss\nferroalloys\nhydroacoustic\nsantner\nkerlikowske\nglauser\nbeepers\nwivern\ncyark\nkoprulu\nhypotrichosis\nhumphery\ngalella\ncoproducer\nmoqbel\nkeypoint\nneckband\nbruckhaus\nonne\nmiddlegate\nvulgarian\ncibula\nsmolen\nbafflingly\nholonyak\noverstress\nbanche\nteet\nbraveness\nflorale\nchieftaincies\nraafat\nbuscar\nkarcz\nelfs\nroustan\nshelfari\ninisheer\npultar\ncorbelli\npentel\nsandeen\ntatou\njajah\nmeiselman\narachnological\nbires\nalbuterol\nclarance\nkoepke\ndemeny\nhradecky\nbphil\nsmokescreens\ngritted\nmagreb\ngriesel\nteitelman\ncadabby\ncaulked\nmarianella\nkarpa\nnesconset\nexoplanetary\njiroux\ncrantock\nsayah\npernoud\nverástegui\nerker\nhayduke\nphillipstown\nmicrocapsules\nnovatek\nscifres\nvaleron\ntalvivaara\nquirini\nchiappetta\ngurría\nmozartiana\ngeosmin\neidelberg\nkaavya\nospringe\nnewfields\nverstraeten\nkorson\nruam\ntuebrook\nnanjemoy\nsinnamary\nschneer\nangiolillo\nshahinian\nensdorf\njanota\nhoobler\nprolongations\ngvl\nbrandweek\nshariq\nwachtell\nmayda\ncresent\ncazayoux\ncarboxykinase\nyamase\nbmy\npontac\nvenas\naudium\nreplanning\ngalleywood\npolyhedrons\ntristars\npageau\ngyt\nwilfley\ndaveyton\nciga\nlongone\nbogomir\ntérminos\ndeskford\npiii\nsplashtown\nmicrophotography\nmarrella\nyundi\nimane\nmspa\nravachol\nafor\nbabatundé\ntaysir\npreliterate\njuleps\naora\nkislyak\ntreet\nsteines\nmarzelline\ngardam\nmtcr\nconagua\nniblick\neumc\ncytosines\npcba\nneelan\nangeloni\ngrio\nnotus\nyigo\njantjes\ngeale\nicesat\nopentravel\notsemobor\ntahseen\nminara\nelokobi\nklesla\nmanqué\ncirrhotic\nnaguilian\nbowhunting\nhodsden\npattin\ntweeny\nrixi\nbiver\nsymond\ngodec\nbudgens\ncelac\nschabir\njafarzadeh\nknowlegde\ncivvy\nmetzker\nrondot\nmilna\nvulcanism\negnos\numbi\ntajrish\nseismograms\nghm\ngiostra\nsantalla\nfhsu\nmarijane\nolimpio\ndonnez\nunrequested\nhalbreich\nrakytskiy\ngodmanis\ninterring\nmoonbat\nknechtges\nhbss\ncuddled\ncptp\nrudes\nrcz\nfumarase\nbankboston\ndavutoglu\nwayda\nreddan\nleedstown\nngandu\nhudna\nbeeban\nmaarek\ndewen\nsystemes\ndawkes\nrinca\nlynfield\nfolino\nkarpets\ndanita\ncarnality\nthunderclouds\nmecanoo\nmidmorning\njiggetts\nmanahi\nchupi\narbin\nvean\nutecht\nhottelet\ndoagh\nglobalgiving\nwilkomirski\nkalami\nzvimba\nmesones\nlegras\nogonowski\nduking\nladnier\nmoqed\nymm\ntolentine\nubh\neuropeanized\nhargens\npesic\nchouest\nspitzberg\nbrangelina\nosteopontin\nsistrunk\ndruker\njamesian\nbreder\nroseola\nhamze\nrockoff\nviggiano\nrinspeed\nmither\ngeodis\nrouzi\nzaytun\nantithyroid\ncibulka\nkannemeyer\nregardles\ndisengenuous\nsuffian\ntranslunar\ntchadensis\nlynemouth\nosnabrueck\nhickersberger\nwymeswold\nncbc\nbday\nhaspel\nfoglights\nginia\npalmettos\nharto\nrangin\nfwm\ndhali\npatzer\nokutsu\nunstabilized\nallariz\ncnaa\nmandagi\ncoving\ngemco\nsemira\nllaima\nbluemner\nblai\ncuccioli\nojp\nvbied\npasd\njabil\nradipole\nviyella\nscrummaging\nbacik\nnexo\ncryoprotectants\narmathwaite\nintensions\ntzeitel\njiyao\nromey\ncrymlyn\nmanhas\ngaetjens\nmabruk\nirrevelant\nmolini\narec\nsaveable\nuscirf\ntingwall\nrespondants\njasjit\nfunspot\nbonnyman\ndependably\ncuecat\nsiheyuan\nyakubov\ntrybuna\nsuperfood\nwimedia\ncaramella\nfotu\nmakala\nkelsch\ncitycell\nswankie\nrepresentatively\npalmos\nawarta\ncannulae\nportee\ndpsg\nscheen\nraziya\ntepecik\nzhari\nwhiskery\nstiperstones\noever\ndeskside\nmawae\ntenne\nnres\nadminstration\naava\nunedifying\ntrieb\nalveley\nyerofeyev\nkaktus\nkotagede\nfreeheld\ncovais\nveis\nsteinlager\ntepperman\nburnetii\naustyn\nmornhinweg\nagaine\nmilgrim\nreponsible\nromona\nbaribeau\nfuzhong\nunalterably\nnordex\nhrabowski\nphap\nmallar\nisungset\nmoschitta\nstadelheim\nesthetician\nkhatuna\nwesleys\nherschler\ntsuzumi\nphilistinism\nkalmanovich\ntarina\nsurobi\nmolavi\nchoueiri\nstarsem\nhellebuyck\nlaane\noperastar\narianda\nbonati\nmithal\ncidi\nspeciﬁc\nemro\nrechristening\ncolemans\ntianlong\ndoggies\nforgie\nrealite\nthumbsucker\nsamii\nosthaus\nmeho\ncooman\nhumanise\ntacom\nfeczesin\njackbe\nruesch\ntennell\ndiaco\npadgate\nnuptse\nuon\nwalloping\nspro\nornamentally\nsunroofs\ncarsington\nsydneysiders\nasbos\nleney\nclifftops\nashara\ncleansings\nseiners\noverselling\nbutcheries\ntoscan\nlarm\nsongkok\nkelin\njarvinen\nlauzen\nimmobiliser\ncitius\nroell\nharia\nmorbegno\nholk\nellwanger\ngrayce\nbabyy\nkalpoe\nkosintseva\nunaudited\ntrusov\nbahador\nfiremaster\nkreisleriana\ntsri\nelmgreen\narrrr\nrelationally\ncudillero\nmelika\nrzepka\ngastronomica\nsodis\npaygo\nzampino\ngromer\nredmoon\ntianhua\npurty\nbennachie\nlowish\nlootings\ntschetter\npunked\nmcconnon\ngeox\ngartin\nballymacarrett\nterrasar\nshehnaz\nschmier\njacomo\ncredos\ndodiya\nhirotoshi\nbachner\ntryton\nmaffey\nonora\nnewmills\nhidetora\ndppe\ntopware\nlandfilling\nigem\ncrerand\nternes\navilez\npetlin\nborse\nstoreng\nchacaltaya\nukra\ncordoning\nsurur\nabitbol\nwitholding\nlamsa\nkemmelberg\nionomer\ncyw\nguardhouses\nwheelspin\ngatecrashing\nrostad\nentwining\nwcrp\nfactfinding\ngepetto\nreforesting\nbraniel\nbroomes\nnazeem\ndumptruck\narthurdale\ndilators\nitzler\njulieanne\nunassimilated\nbutleigh\ncuzner\ngiggled\nabbou\nagronomical\nphilomene\nbonaiuti\nottavino\nmecir\nmohtarma\npiteous\naryn\ngallard\njundullah\ncleer\njavaone\nbundler\npyott\nreconstitutes\nribeye\nmojaddedi\nlopers\nseatbacks\ncomported\nvaporise\nloginova\namping\nteledensity\ndedinje\nboever\neigenberg\nzamolodchikova\neyadema\nratico\nfya\nalbarracin\nravasi\nmoosewood\nvetos\nfornarina\nsolazyme\nfearfulness\nneckarwestheim\nsedlak\nbriceno\nemmetts\neffluence\nmeneghini\nwawanesa\nwuterich\nclaggart\ncamalig\ncircumambulating\nmvovo\nchiselling\nhitlerite\nbuyung\nellinas\ngroomes\nnayim\ngearon\ninnocuously\ngluskin\nbrida\nmohamedi\nmewing\nretha\negames\nladdish\nrabina\nfookes\ndeader\nlauterstein\nthushara\nsonderkommandos\nperspicacious\nstempniak\nuud\neji\nglobex\nonofri\njuicier\nsebok\nyeild\nadul\nredspot\nwaymart\nkaczmarczyk\nnaquin\nwalkom\nnomansland\nvietjet\nverhelst\ncolworth\nsoder\nmaskawa\nhamstreet\nstruther\ngerontocracy\nliscomb\nunmoored\ntechnophobia\nckr\nmuckaty\npannus\npouty\nxylenes\nglading\ndreamboats\nedcs\nbudke\nbechis\ngrumpiness\nfadhl\njalon\nlabouisse\nkoperberg\ndrunker\nhigsons\nsentebale\nmyersville\nharvinder\npoppie\nphotojournalistic\npetrowski\nsailortown\ntaranath\ncinemagoers\nproch\ncsfa\nunrefueled\nplek\ngrasslike\njezza\nunreflective\ncowey\nsutanto\nchlorpheniramine\nschilawski\nsentimentalists\nlahcen\ntroutt\ndighe\neleana\nquébecois\npolyphenolic\nbattleborn\nnseries\nvaill\nmeital\nsmud\nblet\nliaoshen\nfirbeck\neffectivly\nbarnehurst\nfrequenters\njishou\ncardiomyopathies\ngelashvili\nhosam\nwcrs\nrisebrough\nkitchenaid\nsucio\ncecilienhof\ndezenhall\notisfield\ntwante\nentraining\nedmeades\nolaves\namulo\njehle\nlinera\nwihtout\nlateiner\ncassen\natsi\nvaccum\nlucente\nthees\nvibrance\nerrm\nsallai\ndecontextualized\nrattlestick\nalgan\nblini\nrajnish\nfannon\nberzsenyi\ngoodsprings\nkwoh\njayes\nsavell\nantjie\nkajiya\nmelchiot\ntabane\ntankerness\nhirafu\ngammopathy\nabbadi\nbcca\nrotstein\nsmrekar\ntibberton\nfreid\ntophill\nnienhuis\noutdueled\nmislabelling\nbugaloos\nbigdog\narkadina\nkfoury\nrezidor\nwielun\nxiap\nderderian\nbayrakdarian\nsodomizing\nturetzky\nmclarnon\nsmallfilms\narcadis\ntejinder\nsljeme\noopsie\nshirting\nzaniness\nfilosa\nribamar\nmahtani\ngaulke\nwjet\nglenveagh\nodgen\nbrushland\nstancil\nherlinda\nsrecs\nmollinedo\nsyde\nmennenga\nplean\npompeiian\ncongresswomen\ndrawling\ncoppage\neakring\ntriallist\nemergences\nsonidos\ncasuistic\nameloblasts\nwritin\ntheoni\nhospita\nstranden\nposteriors\nrhinoviruses\nacquaints\nhoeflin\nhakel\nkilbrandon\nrudenstine\ngibbsboro\ngnossiennes\nguffman\nriskless\nuniprix\nzoubek\npreadolescent\nlewenstein\nsheely\nallaway\nlorried\nquraan\npreciseness\niglu\npreassigned\nceec\nannouced\nbouzas\nreplacer\nollas\ngouriet\nholdups\nadcenter\nmunchers\nbaharuddin\nwerburghs\nworrier\ndolomiten\noutplay\nehman\ncandys\ndirtiness\nelectricite\noshman\njiyoung\npolys\nvallini\nwhippersnapper\nswri\njoung\nshimadzu\nmcha\nnonfarm\nvakili\ndawr\nsubandrio\nveredus\nparticlar\nhamodia\nfriss\nheilpern\ntowan\nwanlip\naaia\navtomat\nuner\nostby\nultimas\nhisato\nbroadhalfpenny\nkissufim\nmulched\neffulgence\nsheltie\ngrdina\njosefowicz\neini\nrasmusson\napicomplexans\ngrouses\ncesaire\ndiseconomies\npollentier\nchurchfield\nbodha\nmendels\nyavar\nbrighteners\nkimlin\nrogliano\ndakich\nscorpios\nbiomanufacturing\nbackpass\nleonovich\nklunder\ninjaz\nroever\nfusionfall\npifs\nkimsooja\nfuniculì\nzock\nmendive\nmcgoff\nformisano\nemtricitabine\nliedekerke\nmelendi\npreppers\nstcs\nloughbrickland\nwerbach\nwaigel\ngameforge\nemmanual\ncusto\nmiit\ndomonique\nshockproof\nkhade\nparlak\nquarterlife\nluthi\nrumbler\nlivent\nparedones\ndentyne\nrohullah\neilbacher\nnakaji\nrestorable\nsafehaven\ngossypol\nkianna\nspilker\nadewole\nsaute\nswingley\nmarggraf\nbods\nbromage\nsuduva\nmedicom\nqayyarah\nangoras\nscoters\nfaleh\ncanizares\nnanoporous\nembalm\nccie\nlagendijk\nzoomerang\nzorman\npfenning\nmegadrive\nmisidentifies\nconcret\narieli\nperkinelmer\ncommericial\ncésars\nparanoic\nbolotowsky\ndutasteride\ncrocks\nbrooder\nvlasak\nchimi\nraunchier\nleparoux\nexternalize\nwagih\nrothiemurchus\noverbilling\nsmert\nchikezie\nzanno\ndemio\nshrewton\nparfois\nsoplica\nschlong\namokachi\ntinson\nsinochem\nschuetzen\ndunnit\noxenbury\nnorfolks\npsaki\nzukowsky\nasfordby\ntigertailz\ncoalter\nluncarty\nchhun\nstrutton\ndanladi\nlfd\nicor\nabiyev\npaschali\nripetta\ncameleers\ngitha\nauriel\ngrazeley\nforepaw\ncapucho\nkrauts\nestanguet\nturistas\ndilligent\nvivisectionist\nhiatal\nwessells\nradiantly\nbichsel\nknotek\nmetinvest\ncrill\nspeegle\nverkaik\nportimao\nneighing\nmulvenna\nsterilizer\ncoccolithophore\naccessit\ntomsula\nnorem\ngeothermally\nroizman\nassister\njader\nkrankies\nbikeable\ndatacentre\nedko\nazhdarchids\ncandiru\nmcnitt\ntourian\nmcus\nchildbirths\nljm\nsodding\nbravissimo\ngravitt\ndisrupters\nqingquan\ntoranzo\nduggleby\nlawd\nshootaround\nsecuritizations\nbunye\nmicrocephalic\nplods\ncoopersburg\nbabani\nsoundarajan\nantai\nthreee\nardler\nwcaa\nreice\nmultisectoral\nfandemonium\nlangwell\nguanghui\nharsco\nwogs\nkiffe\nmacgillycuddy\ntravelex\nlansdell\nyumashev\ntenterhooks\nsandjak\nwaide\nsaffer\ngaelan\ncodax\nkambanda\ngudina\ndhanoa\nhynie\nlaverick\nrisinghurst\nteya\nenquirers\nassuaging\nbeles\nfxcm\nfarenheit\nsigmatel\ntitrating\nmorganstern\nnutrisystem\nstreetman\ncastrates\ndasornis\nshreddies\nboyata\nfavila\nincra\nursodeoxycholic\ncandyfloss\nstelarc\nsouse\nrosabal\nkneelers\niwb\npresales\nabdessalam\nterrin\neasebourne\nsanctums\nstichill\nlechuza\nskaug\nertugrul\nhereunto\nprinstein\npatatas\nentezami\nblx\nricchetti\nmorral\nyorio\ntchoupitoulas\ngalenson\nsasscer\nmisappropriate\njosserand\nlachanze\neesh\noutsells\ncamuto\nkhee\nmardie\ngeralyn\nfinham\ngukurahundi\nbelters\nwkl\nnunchuck\nostapchuk\nsmtc\nunpeeled\ncipinang\ngibbsville\nheartsick\nnonbank\npauzé\nbuchel\nskiverton\nupbeats\nvacanti\nsquonk\nlochmaddy\nbannigan\nculliver\nkrummholz\ncontect\nleav\nneddick\ndashoguz\nneupert\nstartline\nfogey\nhawt\nniemeier\nsoliah\nanick\nbeckii\nkanze\nrepola\nerpen\nbiobehavioral\ntusayan\nsmos\npfennigs\nblackdog\nhiggens\nvilifies\nwaria\ntyskie\ndineley\nkarpe\nzhr\nchodo\nslps\nchaly\nswinoujscie\nswebus\npedersoli\nmischaracterisation\nhoofbeats\nkalva\ndoorley\nsellable\nletheringsett\nkonchak\nodelia\ngfo\nsnowshoers\nbraunohler\npudlo\nchianina\nharicot\nangotti\nprécieux\npretium\nheurich\ncullera\npasachoff\nkalamity\nblickensderfer\ndebmar\nleavel\nshmulik\nblumenkrantz\nbekah\nwinna\nslyne\ndecontaminating\nelazig\nului\nbasingwerk\nnilotinib\nsifo\nattaran\nlatently\nsajjadi\nshajara\ngeuss\nkotev\njensens\ndequeen\ncstb\ngcon\nburneside\nkhondji\nponderings\nnweke\nfreudianism\nbermond\nxuesen\nadere\nzahr\nharmonielehre\nkomano\ndirigisme\ntatp\neortc\nabscence\nkilcreggan\nbouziane\nlivingsocial\nshoichet\nanastrozole\nberntsson\nsuzane\ngoldmining\nsagalassos\nchombo\nsherries\nshiying\ntaloqan\ninalienability\nbvba\noesterle\nrefurbishes\nsupervisions\nardill\nhodeida\nvaltin\nelizabethae\nusweb\nlistlessness\nchrystia\nfluoranthene\nthunderhorse\nbrevin\nsilverbrook\nreconvenes\ntiffeny\ntardises\ninshaw\nbiocontainment\nlenglet\nmurambatsvina\nslithered\nchophouse\nalongwith\nsirtuins\ntaubira\nshamos\nmultiair\ninturn\nbajic\nspyplane\nzawahri\nsrah\nphenylacetic\nvaujany\nchoubey\nliebau\nuhb\ncieslewicz\npathhead\ncreflo\nvidim\nrefolding\ngillmeister\ngalloppa\nkathrein\njayceon\ngoodyears\ngenyk\nojok\nroiled\nvitznau\ngalavision\nkalaa\njolan\nomantel\nfilife\ncotsen\nbertolli\ntisco\ngopac\npaddison\nknightstone\nmodernizer\nhybridising\nnonresponsive\nsoused\narduini\nkjellson\nqasba\nchows\nunoffensive\ndanys\nllaneras\nmydans\nrootedness\nprominance\nunlovable\ndympna\nhemington\nlofaro\nthierse\ncrabbed\npenuel\nmlat\nneurexin\npalethorpe\nhelaba\npalitz\nstoneworks\nfreret\noverlanding\nrbo\nzeqiri\nchiaverini\nexpiated\ngravenstein\naliko\nnavigon\ncarlee\ndroppable\nerotomania\nsmead\ntinapa\nbenway\nmorain\nbritsh\ngraumann\npfaltzgraff\ntabram\nmuzzleloader\nbridalveil\npazder\nbetye\nvicinanza\nantipoverty\ndoubletake\nrelators\ndallos\nastaldi\nghandy\ncammermeyer\nchapelhall\nshoshani\njewellry\nmisener\ngalvanization\nzawadi\nharano\ntomatin\nbuzau\nphonebooks\ntexels\narowanas\nskladany\noppegard\ndejun\nchesa\nannunciator\nbushwhack\nwerthein\nweasleys\nkilravock\ntaghavi\nhelmig\nhimss\ntoxemia\ngambar\ncurrell\nhanem\neldh\nbanyas\nulipristal\nguanhua\nmalampaya\nimlah\nengraulis\nunccd\ncarnon\nthiery\ndownlisted\nflaviviruses\ncelebrex\nalmaza\nspruell\nschoolyear\nartero\ndujana\njingmei\nimprovolympic\nvillines\ntreater\nblackle\nlavallade\nopsiphanes\nkaranovic\ndyskinesias\nalÿs\nchronister\nbetaworks\nsooni\nbrüssow\nkemsky\njudenplatz\nkdrv\nbuckrose\ngiorgios\nvswr\nstracathro\nmemi\nkinlochbervie\nbonakdar\nssees\nbrían\nkrajan\nguadelupe\ntitterington\noffenbacher\ncodevelopment\nemn\nquaalude\nidiotically\nstepfathers\nsteppingstone\nblindsiding\nzamore\nloteria\nkheil\nburnopfield\nuswitch\ndulan\nbelaunde\ngushee\nalotta\nindefinitly\nautodidacts\ngogglebox\nlagemann\nroyersford\nberends\nharreld\nbuehring\norganza\ntimis\ngarelli\ncurico\nlandshark\namigoni\nmidón\ntrendline\necgs\nromell\nkennoway\nbebber\nlizi\nelisions\nshestack\nsarenne\ngeyt\nhesta\nwenguang\nranua\nbedeau\nvacansoleil\nnrtis\naffion\nganzer\ngendel\nopies\nwrenthorpe\nexis\nhure\nhakuhodo\nzulauf\ndevoutness\nrihanoff\njianmin\nhucles\nvinified\nauthier\nmythologised\nperiodontist\nnewchapel\nvelfrey\ntoosi\nkerkeling\nenkhbold\nsupo\nozploitation\nnoppadon\ngrea\njakim\nradiologically\nrehema\nyoigo\nladarius\nshamali\nlypiatt\npsychoeducational\npushup\nkerris\nbrotchie\nguffaw\namphioctopus\nhabbit\nteritory\nhermanis\nreardan\nfreestyled\nscarpitta\nkillea\nclerestories\nnonalignment\nworbarrow\nilluzzi\nbullman\nkavvadias\nguesstimates\nshantala\nshora\nbonfiglioli\ncendana\nbudington\ndafnis\nbwm\nenpi\najvide\npubens\nstrassberg\ntesich\nshawi\nhamied\nkco\nloserville\nuntrusting\nspedale\nmischaracterising\nglacken\nengano\nwristlet\ngaastra\ndrye\noyola\nrielle\nsafafa\ndykman\ndrdc\ncommmunity\nsusil\nkaprielian\nacet\nwowza\nskippering\nnonrefundable\nforfour\nubah\ncerenkov\ndoright\nwintergarten\nshakily\nreznick\ngarlow\nkutsher\nkrystkowiak\ndanjahandz\nbergwall\nreforge\nazuka\nqalander\nastrofisica\nmbuya\nusmnt\nneurotically\neue\njapanther\nbraginsky\nvisn\nsicking\nnough\nortlieb\ncommunitarians\ntotonaca\npoisonwood\ncountenances\nnamika\nsapelli\ncitarum\nspitteler\nfootitt\nraimondas\ntjarutja\nmanometry\nlulo\ndoaks\nabbeyfield\ncelsi\ntrummer\nkilobit\ndomonic\namde\njasad\ntoolworks\nmackechnie\niapt\nvansbro\nreinholt\nunfretted\nsoutra\nafterlives\npowidz\nzahalka\nmerryl\nchetek\nrueckert\nspankers\nbenllech\ndenic\ngiovenale\nandonis\nnfip\nsibongile\nirrelavant\nvavi\nsharikov\nbergelson\nsohrabuddin\nfoxhunt\nboneheads\nintriago\ncamie\nbreadsticks\ntokayev\nllanilar\ntards\nnyamu\nwotruba\nuzoma\ntollymore\ndolega\nsteepening\noverblow\npinworms\nabkarian\ncliq\nelsener\nodling\nferroalloy\nagyei\ngoldemberg\nshaheer\nquizzers\ninigoes\nheadrow\nmechta\nsuccotash\nishima\nminffordd\nkelter\nkondracke\ndibala\ncarboy\nflatlines\nharberger\njarah\nbalkestein\noaktown\naups\namorth\nsubletting\nmindboggling\ndurif\ntresillian\nannobon\nstarliters\nderiba\nyamburg\nissan\njayasekara\negidi\ngooglers\nnabunturan\nsportscasts\nhopefuly\ndrivability\nkenig\nlehan\nractopamine\nkabiri\nbrodkey\nalacris\nagaints\nbordeau\nriccitiello\ncuddie\nlotterywest\nkeva\nhoneytrap\nsehar\ncredle\nhossegor\nollo\nnutkins\nsensuousness\nkaroi\nzischler\nserioux\naboville\ncharmley\nsanatorio\nbagpuize\ncentrepieces\nziskind\nprobelm\nuncared\nshirly\nlowassa\noilcloth\nmctc\nnewbottle\nadman\nlifg\nofff\nmideastern\njiyai\ntremarco\npolixenes\nropin\nlongannet\nuntruthfully\nbigshot\nviktorov\ncratty\nmadey\nsuqami\nsimatupang\nlabate\nvolkow\nsubjectiveness\ntimesaver\npsychoneuroimmunology\nkomisarek\ngalantamine\ncareering\nstuttaford\nchristianna\nbedier\ncyberwar\njackiw\nrotfl\nstripers\nmcse\nhaibach\ncawdron\nayudar\ninsoo\ndelpierre\njimbaran\nmultistorey\nkolenda\nbeleived\nfrancophilia\nmerowe\nsheikhdoms\nnetfront\nslaugham\nfurd\nwhodunnits\nschlenker\nchanghong\nfruitiness\neonia\nempathized\nfacemasks\nhouseful\ngranpa\ntambaqui\nwhitebridge\nmsos\nboulangerie\nhfrs\npfefferberg\nchishui\nyetunde\ndrph\nsharqawi\nmistoffelees\nsolta\ntaleyarkhan\nadmiting\nmesereau\nfranzia\ncasler\nmolera\nennepetal\nmsika\nreboost\nordin\nrunnells\ngockel\noshinsky\nprezioso\nthorvaldsens\nthanksgivings\nnewarthill\ntreatement\nextracranial\nkandell\npushchairs\npálfi\nalistar\nlcca\njelic\ngaelscoileanna\naalders\nchryssa\nmunyaradzi\ntévez\nbeguile\ncalcining\nundervaluing\nlineen\nbeidaihe\ndelval\neawag\nevenhandedness\nuncombined\nfortuneswell\nhogtied\nhollered\nsolarte\nharmonique\nsechrist\naestheticians\nures\nbioanalysis\nostermeier\nstrangelet\nanido\nbimla\ncrout\nkasturirangan\npharris\nswindley\nkochavi\nchancing\nvintry\ngerven\nemberson\ncarmines\nkvirkvelia\npogrebinsky\nlaughland\nshneidman\naliana\nheho\nspelaeus\ndamerham\nemana\ngigerenzer\nreinartz\nddda\nmulligans\ngodement\nmilki\nsavitz\nbiviano\nsabriye\nabraj\nsuthar\nsiripala\ncadarache\nkipkorir\nnendo\npavic\nendplates\nigcses\nhajira\nsounddock\nkuhar\nguenveur\nlasnik\ndrymen\nplasticized\nopuses\nlansman\nhaleigh\nmatsunami\nserras\nhermens\nchiverton\nbaichwal\nwatsonian\nmihajlov\nvotaw\nmahas\ndrizzly\ncorpulence\nalloted\noutplacement\nmussomeli\nmadain\nimponderables\npreambles\noverbid\nbengkalis\nrabten\nvincentelli\ngedevanishvili\nlathing\ngake\ncarboxyhemoglobin\nlemahieu\nownerless\nabdulin\naudibert\nbucker\nkappas\nbutembo\nourika\ntiptoeing\ntowse\ncaustically\nguhl\ntragicus\nlöscher\ninadvertence\ndftd\nisah\nglenday\nreplicability\nswahn\ngolling\nsensualist\nrajawali\nateeq\npoulan\nnuruzzaman\nschoepf\nplavix\ndunkleman\nbarati\nolajide\ngoign\nribao\nizze\nglints\nisidingo\netown\nyucel\nkarpat\ngewargis\nsynovate\nmattoso\nsarcone\ntypicality\npuddicombe\nsiqi\nizhak\nbuttitta\ndeguerin\nottauquechee\nsewerby\ngullfoss\neranga\nsupacat\nroastmaster\nlidya\njoines\nrushbury\ntendinopathy\nkopelev\nilori\nyayin\nbarsukov\nragano\nmuthee\ncefni\nvignoble\nnacton\nchangjin\nrepave\nmelsom\ncavils\ncastled\nundermind\nhumiston\nsaona\nffiec\npanameñista\nstavudine\ncornetti\nlatia\nsteelville\ntraidcraft\nlemerle\nimperdiet\naebn\nsollicitationis\nfuddruckers\nburnhouse\nscarba\nwoodgreen\nabco\nbioelectronics\nkounen\nimportuning\nwerbowy\nlightheaded\nzozobra\nvernhes\nmicrobrew\nmoonscape\nmarchlewski\ncontexte\nadgate\nbordogna\nnaqdi\nboded\nloick\nemigrés\nmidwicket\nsoru\ndollarama\ncdisc\nbackpedal\nintervenors\nwle\nchunilal\nzens\nhorstead\ncrappies\ngartnavel\nbozorgmehr\ndaylilies\nkarmali\nlagavulin\nsulphates\nvoltaggio\nannely\nmatlinpatterson\nnewcastleton\nunderplaying\nfossilisation\naugue\neuismod\ncastoro\ntalvi\nustvolskaya\nfreaknik\nkonen\nthingamajig\nfairfields\ntemara\nwellins\niwatake\nwallisch\ncoulrophobia\ncarntyne\ncragun\nrosha\nlistin\nostrove\neunson\nhockham\nshcherban\nsipra\nkubina\nmatlala\ntontons\ndonnycarney\nnightengale\ntregs\ndolled\nuniversa\nkutak\nwawne\ndsx\nelmbank\ntiy\nbarraco\nstoerner\ncharlyne\nklarman\nstoeckl\ntemko\nmarzetti\ndizzie\nunladylike\nherner\ninsistant\nunworn\nyoast\nunfreedom\nworswick\npaen\nevgueni\nyamao\ntantalisingly\ndolens\npluralists\nkarpas\nprescence\nblumfield\nlimbal\nsorabjee\nruthann\nmulaudzi\nchnage\naktan\nparriott\nskidby\nkimberworth\noversampled\nkapon\nleylands\nbigos\nzanker\nbrachyglottis\ngamesters\ndiagree\nsalafia\nhabitué\ngalunggung\nmarasmus\ntouchtone\nchedjou\nativan\nlogorrhea\nshazier\nbreschel\nbuzbee\nleyer\nprobly\nteiresias\nnanoimprint\nmilz\nherge\nziade\ngaokao\ngrig\nnihari\nxeriscape\nmiptv\nmeulendijks\ntorghelle\ntisk\nlandtroop\nkresimir\nkasman\nsemina\nannik\nmuthukrishnan\nunrolls\ncontemporània\nekoku\njacuzzis\nlinsay\nlacp\ntamares\nsadducee\nstatscan\nomond\nforastero\nninians\ngaerwen\npizzolato\nrepurchases\ntaquari\nthorkil\nwesterton\ndominicis\nmontuori\ngentlefolk\nshattock\nfavino\nomnitel\nsikkema\npoliedro\nehler\ntapentadol\nkikue\nneuroligin\nphonecalls\ndrighlington\nlilbourne\nmaitres\nvercammen\nscotswoman\ngenworth\nselvarajah\ndorsin\nbarbae\nheldenplatz\nsportscotland\nmonden\ngrowden\nwardriving\nschweppe\neatoni\nmethylate\nalviro\ntimahoe\nclassiest\ngarrad\nalmine\ncristovão\nmegatonnes\nprinceling\neifionydd\nageyev\nnovinite\ndraughn\nperkel\nunrequired\ntetranitrate\ncoyte\nnovatel\nararipesuchus\noguma\nsteeplechasers\nbadshot\nnmk\ndallied\nmortice\ngraphologist\nbelenky\nnorito\nprogramed\nobstfeld\nwestwinds\ngick\ndaesan\njuguetes\nsadiqi\nkayumov\nyangdon\nleasburg\nmularczyk\nlandivisiau\nspoc\nlipgloss\nmeeth\nshizzle\nnuoto\nglenmorangie\nabbaszadeh\nsidlaw\noakcrest\nschaumberg\nclintonian\nhongju\nspni\ndoublemoon\nfattoria\ntavenner\nletheren\nparex\ndeltek\nweat\nautolib\ndeheza\nveet\nwpsd\nkpb\nnotatum\nmctighe\nmccadden\ntrewern\nbrillon\nmatusiak\nketron\nostiglia\ndemery\nnieuwmarkt\ndarulaman\nweixin\nulcerate\ntizon\nguaranties\namkor\nlizanne\nstreetcorner\ngrigoriadis\nardbeg\nthormann\npizazz\naddressability\njonh\nkerrera\nbarrott\nhurth\ngurner\nquiltmaking\ncattles\nkettlethorpe\nmrem\netto\nboarfish\nwaialeale\njianguomen\nmegajoule\nlieke\npassantino\nfup\ndje\ncregeen\nlegitimisation\nstarcher\njeda\nworthpoint\ndifferentiators\nclucks\ntabing\nbrassbound\nmaryline\nazorian\nyongxiang\ntwofer\nstupp\nodnoklassniki\nshellings\nboulestin\nzambon\nundereducated\nusdaw\ntonkov\nsalsburgh\nlaureana\nsullying\nfayçal\nintentionalism\npiznarski\nsipson\nthiéry\nremediable\naslani\nalasia\nemmerton\ncownie\ngathegi\nsoly\ntravon\nyetter\ntahera\nantigay\nchilhowie\nrieman\nplayrooms\nattosecond\nmultidomain\nshingai\nbitc\nssdp\nsafaga\nthilawa\ntses\nramanlal\ndaudzai\nslaymaker\ntishkov\nballysillan\nstephnie\nfredricksen\nskillen\nanswerphone\nwilfong\nscown\nclubcorp\nelver\npermanantly\nzaidis\nleatherdale\nnsas\nsphenodon\nbilour\nwizzo\nlenko\ncamorristi\ntherien\nandriesse\npingguo\nthumann\ntaktsang\nmironenko\nassaultive\nruinart\ncolletto\nauror\nrosu\nlearys\nroughening\nwindau\ncutforth\nsaulat\nindictees\ningrain\nglandon\nuncoded\nnebeker\ngobby\nbrandee\nhélas\ntauseef\ncyclosportive\nfeagles\npotulny\ndefuser\ncurtner\nufdr\nvarazdin\ncartal\nrediscoveries\nfreshney\nzuhdi\nmicciche\nyoungdahl\nmaranhao\nfajita\nbranchage\nternura\nperminov\npostwick\neunjung\nxpand\nsuppo\nboldo\npipavav\ninion\ncolangeli\nsterilising\nhalbherr\noakthorpe\nbutovo\ndunavant\nblazingly\narouri\nanderlini\ngilderdale\nunventilated\nwhiffle\nhuadian\ncramerton\nintersession\nchymosin\nkruschev\ngemütlichkeit\nsanitise\ndavitashvili\ndoeth\nkoty\nallmark\ndsj\nriprock\natouba\nstewartsville\ncancilla\nliiceanu\nbempton\nqzone\nshenaz\ntorkan\namandi\ncallil\nvenrock\nabramovitch\nyrp\nrecherché\nhuggler\nﬁnd\nrecolonisation\nostracizing\ncayden\ndepilatory\nkeppinger\nnahunta\nmahaboob\npyramis\nschork\nwilmotte\nsiok\nuniverity\nhedwige\nmetri\nurozgan\nlessie\nmettam\nlighthorse\nfluffier\nputes\nfatburger\njawwad\nsumme\nnehal\nbignold\ngazetting\nnatassia\nwedowee\nhaessler\ncordobés\nconfounders\nwilmes\nzhongyong\nnevarez\nmohenjodaro\ndurakovic\ngosto\nauditionee\nyosif\nbenifits\nsklansky\ncpni\nserga\nhwb\neisenreich\ncrescencio\ntibias\nscorpionflies\nlironi\nauza\ndrogin\nmythically\nwellingore\nparapluie\nhumectants\nmotsu\ngeordan\nmooned\nepoxi\ngonz\nmilies\nverheugen\nrelased\nhannock\nlandrush\nhenker\nvanatta\nmacombs\ncholent\nchimere\nsomekind\nseedheads\nvoreqe\ndelgo\naylmerton\ngörgl\nhurns\ntraianos\nmicrotransaction\nporthtowan\nkabil\ndisintegrative\nshadeed\nhavlat\nrodero\nbusicom\nleinenkugel\ngradiometer\nmarrieds\nvittoriano\nroths\ninconceivably\nneukomm\ntrashorras\ngursharan\nnkosinathi\npostcommunist\nrooley\nmcclement\npurewal\niswaran\nparalomis\nguessable\nromberger\ninsitute\nvbci\nrollison\ndrawdowns\nstansel\ncorteo\noblivian\nsuperblocks\nultracentrifuge\nnemescu\numschlagplatz\nextel\nexceptionality\nlumpers\neberharter\nstuard\nmoshonov\nsoekarnoputri\nmonya\nsecton\nkasilof\nnervet\nworriedly\nriviresa\ncofre\nnonconventional\nfilsinger\npupping\nhoben\nhiriya\ntirian\narwad\nduchaine\nbrooklier\ngoerlitz\nlivechat\nhumewood\nglenridge\nnoisome\ncranton\nreyneke\nbood\negcg\nbesford\npchr\ndutchwoman\ntediousness\nkenson\nanimalism\nwholesales\ncatterton\nmoskito\nkresty\ndeshapriya\nkeells\nfakhir\nlavrador\nsuperflex\nrutty\nfilar\nalchohol\ndupire\nkilteel\nchalkias\nsurreality\nkinnerton\nnonsens\nrodis\nrapidfire\nhashlosha\ndessi\nwhinny\nbarke\ncystinosis\nforcer\neilberg\nbrodnax\nsevele\ncourtlandt\neestor\nluneau\nlahlou\nhokin\nperiolat\nmiag\nwildtangent\nschowalter\nhomeworks\njanuarie\nhefin\nschickler\ngroundballs\ncassada\nunsympathetically\nsteffin\ncretton\nthreated\ngusanos\ngriswald\nleshner\nodoriferous\nchatam\nimpersonally\njaradat\nlandford\nincontestably\ngynaecologic\nbarkby\nautothrottle\nhiltrud\nlossie\nmittelstaedt\nhajjarian\nalspach\nbakong\nkics\nreligiose\nclutts\ncristofaro\nviperfish\nwaymarks\ngaramba\nswaney\njonesport\nskotnicki\nsniveling\nexcrescence\nmuqeem\npetai\nghez\nmasterplanning\nsycuan\npelter\nmcluckie\ncricut\nshchekochikhin\ntatianna\nhobin\nswinefleet\nlamay\ncambreling\nuuj\ntodorovsky\nambah\ntetrick\nsiafu\ncoppergate\nsupprt\nvarndean\nleapin\nvitrines\ndongala\nreaddress\nweakfish\nchantha\nghadr\nsidlesham\nschairer\nkathputli\nvigilio\ndoenst\ndjh\nzayatte\ngbarnga\ncrosscheck\ntoffs\nabbotabad\nousu\nburcot\nsirven\nwinecoff\ndearbhla\nmantzios\ngilfus\ntempestt\ntoity\nmccarthys\ngrandfatherly\nstreptokinase\nesera\nsibiya\ntrisul\nnowheresville\ngirardville\nneteller\ntailgunner\nproschwitz\nteambuilding\nhempen\nflubs\nchaye\ngunwharf\nfratini\nbundock\nlahrs\nmeggetland\nstatists\nssae\ngammy\nlosang\nwyns\nrenco\nnegationist\nimmunised\ndutse\ncarabosse\nmerav\nmacchiarini\nrediculously\nblezard\nlubambo\nbroco\noatcake\njinro\nmarcal\nmacchu\nranawaka\nreverentially\nunfed\ncatshill\nscamarcio\ndassie\ngvi\nrosharon\nnassco\namacom\ncqi\nhandoffs\nwijesekara\nkolvenbach\narranz\nlerolle\nbalsiger\nhabitués\nmhf\neyedropper\nshayegan\nmageean\nvanacore\nmauran\nzosen\nerechtheum\nsymbiogenesis\nzfn\nrazumkov\nzhenping\nadverting\nbutes\ndaypart\naigai\npagliari\nfanpop\nrabea\nchamari\nscreamingly\nkyar\ntigrayans\nscawby\nbeckhampton\narcelia\nhangi\nkasting\nrisperdal\nshaffi\nnkwocha\nhagmann\ngiannuzzi\nkampfer\nvouchsafe\naasiya\nmoreoever\nkliptown\ngorzelanny\nkget\nleckenby\nkabuga\npowersports\nprovid\npaypass\njackpine\nguzzlers\notolaryngologists\nbeatley\nkukors\nherodion\nmessara\ngameness\ncorseted\nloomba\nlogisticians\nmatatiele\nkäpylä\nmasturbator\ntoter\nglabellar\nkrysiak\nnarrowcasting\nllona\npedmore\neuropen\nsevastyanov\nnabiha\ntwb\ndcos\nferson\nkeynoted\nroadrunning\nshkval\naffectively\nghods\nvalaitis\nbluebeat\ndoxie\nhannesson\nsteet\nserein\nhemas\nwolffs\nmoneragala\nspinderella\nbågenholm\ngfn\nallenson\nzorb\nvles\nfraternize\nfuniculà\nchaun\ngangbuster\nentranceways\ncango\nlagrima\nstupidities\nartyukhin\ntrooped\nclarine\nobita\nminiaturize\nmontera\niuli\ntekamah\nintead\nbarau\npekerman\nriffola\nmondatta\npennac\nvasher\naggiornamento\nivankovich\nwikinomics\nflein\nhittner\nnizaris\ndeskjet\nunmentionables\nrizokarpaso\nappelfeld\nheshmati\nbarelvis\npostoperatively\nneurosky\ndeobandis\ngemba\npianoro\nmiddleeast\nargota\nguazzini\ngullivers\nzart\nbioneers\ntransglobe\nhoudry\nherbertson\ncourtine\nlassoing\nbkd\ndahsyat\nhanut\nlienert\npermed\npaultons\nletrozole\ndonta\nceleski\nsandside\nlubaantun\ngreenmantle\nbroinowski\nnotecards\ncalzones\nborgeson\neurobond\nderbe\nmullova\nbodla\nfontwell\nmastrosimone\ndenault\nguidestones\npenallt\npascarella\nleitman\ngubelmann\nbress\nclunkier\nortgies\nsheremet\nrangsan\nweisenborn\nratray\nblandishments\nfianceé\nbududa\nciocan\ncortazzi\nboucaud\nskytower\nmatrioshka\neychaner\ncdbg\nundammed\nkleo\nhorkstow\nsenesi\nvarsavsky\ntittlemouse\nbestbuy\ncolorway\ncaramelised\nunderconsumption\nproctology\nsaubers\nshiroma\nmirkovic\nunretouched\nchalino\njeanmarie\noreti\ndistict\nmathmatical\ncherna\nahlan\nbollan\nlangfuhr\nshelterbelt\nhumoring\ngreenert\nsherwan\nbrank\nwinkerbean\nclevelanders\nsuperheroics\nmadkour\nchaudary\nfuhua\nsoeder\nluiseno\nkwoka\nsuject\nserialist\nbrockhall\nbonaiuto\nhartill\nvallery\ngiuntini\nasoc\ntordjman\njambos\nnwg\ncapossela\ncissoko\nfraudulence\nchristianise\nimmediatley\npredetermine\nlomong\narmacost\nburrator\nbessan\necontent\ngambrills\nplentyoffish\nshabangu\ntingi\nvademecum\nkludgy\naped\nbasked\nakkermans\nneidich\npowhite\nswashes\nsolomun\nafriforum\npaixao\nliliensternus\nauslese\ngaziyev\nmerner\nbondt\nshakhnazarov\ndecares\nrosalio\nzabad\nnorbrook\nintisar\nantiscience\nelluminate\nkbmt\nnarjis\nshoebury\ngrichting\nmintlaw\nmendicino\npaleobotanists\nbreadstick\nhirohisa\ndluga\nkaroubi\nartal\ncatic\ngestate\npartygoer\nabdelrazik\nmicroinsurance\negwu\npharmacoeconomics\nbosquets\nrottach\nstuntwork\nsaburi\nbuccieri\ntoolmakers\nbusato\nzaro\nyasutoshi\nfamilier\npromet\nbenchwarmer\nstřední\nlunetta\nlogistician\nnewbuilding\ndailytech\nprpa\nmendini\ngartsherrie\nfumigate\ngatehead\npetrovics\nwarsak\npwds\nwaisting\nquaeda\ndcist\nleale\nchibhabha\nteresi\ndulle\nbrauhaus\nmofford\naphorist\nunfeminine\nsabras\ngoerz\nstiffler\nitals\nseelan\nsliney\nhoytema\nmyway\nmigenes\nhipps\nbugesera\nescaper\nflimby\nnarayanhity\nbausman\ntadini\nkarua\noppama\nscrapyards\nkhormato\nlofti\npanzi\nmaikon\noutclassing\nbenli\noutmanoeuvre\nsuryo\napelike\nhorspath\nchillcott\nneuza\nwalstad\nlibardo\ndagnan\ngiusi\ncmpc\nterendak\nrowledge\nbatangan\nleunen\nfracasso\nwhitecoat\nsibbett\nsomedays\nmommas\nmontemurro\norchitis\nsybella\ngaffie\nhammud\ndelyth\nilum\ntapei\ndahna\nnohar\nstoneferry\ncaporali\nspeedtv\nventuris\nkaufland\nroelfzema\nhorine\nbracingly\nshaarey\nbremhill\ntreichler\nkornmann\nmindspark\nmultigrain\nkyries\nsprey\nscribblers\nconatel\nzinio\nrasulo\nsoderquist\ncukier\nbulgy\nmonosyllable\ncobwebby\nfundamentalisms\noyvind\nhobbyhorse\nmalfitano\nclps\ndeonte\nzair\nexpedites\nentrenches\ninvestimento\nkaniguram\nlongaberger\ntoston\nnuray\nmarlise\nsawali\nautoinjector\ntahmineh\neventi\nslooten\npromotable\nslatyer\nleobardo\ntriviño\nwahnsinn\nsalvato\nlarman\nwenke\nsongzhuang\nparkham\nginned\npamelyn\nbrainware\nirrigator\nbolerjack\nehiogu\nkeneseth\nrebrandings\nautocracies\ngelligaer\ndipsy\nbriargrove\ndeputize\nsirak\ntigi\nbrye\ndombroski\nstoneywood\nselvakumar\nmobisodes\nhexing\nalfold\noffseasons\nzduriencik\nhemudu\ndomainkeys\nglantaf\nskanking\ndemayo\ndandapani\njakovic\nraydale\nkesc\nbaculites\nsymptomatically\nwoodingdean\nduyet\nnetanel\nlockets\ncercone\ndespondently\nsanzenbacher\ncedergren\nlonginotto\nclads\nbrackla\ngadzhiyev\ntugend\nmckeating\ngenaux\nmaccartney\nchando\npendolinos\nusts\nslacked\ndatamonitor\naghahowa\nnaturi\nozwald\nirro\nandreolli\nberaldo\nshipu\nmerkers\nkometal\nflypasts\nwickline\nconrath\nfazeli\nnaysayer\ncitterio\nhinchliff\nblakney\nspaten\nbburago\nglieberman\nmacroregion\npulgarcito\ndorsets\nlobanova\nrecyclability\nsobri\nchinelo\ntriki\nbullsh\naracataca\nvitripennis\nlegwand\nexcellant\nqassams\ncumiskey\naccessability\nstellina\nschardt\ntakamiyama\nautocratically\nsplaying\nhandsomeness\nlightle\ncannich\nwheatears\nmalony\nkósa\ncloherty\nmanycore\nropemaker\nvelindre\nkcsa\nskey\nduffers\nmotorboating\nweinsteins\nruukki\ndanspace\ncostolo\nbarouh\ndemobilise\ncallejo\nwelfarism\nbuiter\nnoumenal\nbowties\nhualalai\nalethiometer\nngmoco\njerid\nmarciac\nprimitifs\nscalea\npribylovsky\nmeirionydd\nteruaki\nunislamic\nschmidtke\nqera\nlsis\nfirestreak\nhuarache\nemporer\nlittermates\ndelaet\ndivincenzo\nconventioneers\nlewison\nunexpectedness\njunzi\ndhindsa\npaustian\nweltman\nwahhabist\ngemberling\ncasperson\nlvx\nsegas\nfalstein\nwestering\nsupersoft\nbulks\nschlozman\nticklers\narasa\nalmana\nbyrsa\nmangaldas\nlajčák\nayoubi\nlojas\nmarychurch\ntgfb\nwideness\npochon\nnalaga\nchilango\ncolantuono\nciesielski\nsaltzberg\nweisenfeld\nsparticles\nlaventure\njunking\nlenhard\nspinazzola\nsubkoff\nhoffpauir\nphotogram\numtv\nhaltingly\nhariyanto\ntransmogrification\ncornpone\nfrankensteins\ndarial\nmcvoy\nasrat\nintourist\ngildernew\nvavoua\ndoglike\nmigden\nmidcoast\nkarats\ngraville\nadande\npontani\nsnoods\nthornless\ndistractive\ncume\nwhitesands\nbroadie\nharple\nnirupam\nhoity\nblinkbox\nfremlin\nhainton\nvigabatrin\netendard\nmatarese\npait\nphotolithographic\nlvsr\nargetsinger\nlapidot\nwittenstein\nsilatech\ncogo\naimal\nkarandikar\nhypocricy\nrhwng\nbiever\nsnuffing\nwensheng\nsofri\ntackaberry\nwibro\ntennenbaum\nkapl\nhyseni\nbowart\nsherwoods\npzc\nkeratomileusis\nakinobu\nbinzel\nspelterini\npretreated\njoppich\nflutey\nismir\nsurani\nkinz\nbonnefous\nmellion\nbryza\ncornershot\nlatavia\nmehrabian\nkassianides\neventus\ncatholictv\nspinx\nbrassicas\npavley\naldaba\nbacala\npowerbar\nantias\ntipitapa\ncrudele\nbrilli\nniederkorn\ndrobny\nbrimscombe\nnuveman\nbassolé\nbarnicoat\nweinraub\nshelle\nbeckstrom\nhegi\nraak\nutila\ncompulsivity\ngoest\nsemitendinosus\nsynesthete\nfuturistics\nklops\nacquiescent\nulufa\nrsna\nnarwal\nmuzzey\npoipet\nstiver\nurbanworld\nvolcans\nmmea\nvleet\nichino\ngaliardi\nvallario\nitq\nadnkronos\nitos\ngancia\ncatledge\nvukasin\nlandsmen\ncsem\nkruszewski\nquenby\nmpwapwa\nyuriorkis\ncolligo\nwinata\nvukich\nmydeco\nprimitiveness\ndbsa\nbentov\nschwentke\nhspc\nburgling\ndonelan\nzakary\nuntrammelled\neacs\nbarnfather\nkashka\nbhangarh\nvalcareggi\nhaketa\nkalskag\nmansourah\nnutrigenomics\nquataert\npuckish\ncomputerizing\nsabby\nscarnato\nmistic\nghettoizing\nlomell\nrubashov\npavee\nwinep\ntrolla\ntouchup\nwoodmansterne\nflatbeds\ndaeron\nquinol\nobjectional\nsmedes\nantedate\nrosebraugh\nstearmans\nhedayati\narenella\nwsil\nvredenburgh\nbehren\nbarany\nawde\nfazioli\ngrasty\nminmetals\nsalvucci\nhunsecker\nhamifratz\nbalzar\ntritech\npolypharmacy\ntakana\nzentralbank\nsundress\nimvu\nminjun\nschloesser\nbodge\nningún\nkebble\nferrochrome\ndawdon\nscuffing\ndarroll\nunadopted\nhilsum\nglimpsing\nprawle\nroscioli\ntunstead\nalmirall\neisenhut\npattar\nrapho\nmave\nmohajir\nsoltys\nenamelware\ntrzebinski\ntintypes\npiyapong\nghaem\nlouvel\ndongseo\narchea\ncytter\ntransgenics\nraedwald\nsternest\nmartinovich\nhaisheng\ncontextualisation\nimmidiately\nnoncustodial\nadhir\nsessums\nmantega\nscandling\nparlante\nlouttit\nrumbas\nbrightwork\nchehade\nsamaroo\nunat\ntownfolk\nhillstrom\nmories\nbaginski\nhayel\nchansa\nngls\nzhelyazkov\necrc\nhabila\nmaryculter\nhennon\nfrivolities\nenano\nebird\nshaymin\nsubjugates\nherro\nmaximova\ntarman\nmynott\nstrewing\ntimbale\nmckhan\ndosnt\neljanov\nfarfel\nbaccini\ndemsey\nsixpences\ndeltaville\npagden\nconsigny\nschaafsma\nmarford\nalashan\nchinyere\nlowness\nzyvex\nbertoletti\ngasfield\nlarri\nanandasangaree\npanoptic\nlyndell\ncrating\nkijabe\nrouland\nselsam\nlebogang\ndoodletown\nfuan\npinella\nlram\nreddell\nmolterer\nrelatio\noccure\nbaxenden\ndescriptives\nrhame\nazkadellia\npyrah\ninternalizes\nschrempp\nuniversalized\nmazraa\npatmon\ngrudzielanek\neternit\ntxurruka\nstagey\nimerys\nquietened\nislamification\ndulhaniya\nwholefoods\nsinornithosaurus\nuproots\nticuna\nvuyo\nleakers\ncombwich\ntimnath\nzigo\nshamwow\nreman\narchitecting\nhowsoever\nsteffey\nwolfy\ntockwith\nsoueif\ntrenholme\nrorimer\ntieing\nslcc\ncorales\nnonconference\ntelestial\ntregaskis\nbaisya\ncocodrie\ndamaturu\nhtat\nflourtown\nhosemann\neruh\nhangups\nraegan\nvaland\ntristin\ncaldow\nmaftei\nlouboutins\nsamawa\nbocs\nkindhearts\nhopcraft\nsakir\nloyden\npelles\nlopokova\ncostantinopoli\nintercutting\niccas\nmagaz\nquids\nrecomment\nhelem\ndesparately\nhulings\ndiegans\nnmw\ngeorgas\nbewbush\ncoulterville\nmatricardi\nforand\nstraitlaced\ntackiness\nemailer\nathersley\nrazzall\ndollops\nkeiren\nsaimin\nvraca\nbiolab\nlabeyrie\nrygbi\nmoati\nbeachum\nhowaldt\nfrenetically\nmbogo\nfrancophiles\nreçber\nchocolatey\ncherkaoui\ncnngo\nbridies\nhelferich\nembarassingly\ncervara\nhazelbury\ncimoli\ncotija\nflashings\nstansgate\nbartleson\npasito\ntartaruga\nprepositioned\nshawsville\nramappa\nkevi\naxius\ncarrollwood\nkoutoubia\ncredenhill\nerakat\ntorbinski\nqixing\ncumparsita\ngazali\nrowntrees\nbrusqueness\npomés\nelystan\nopprobrious\nsollett\ncharlson\nschmiedel\nshafrazi\nmames\nkochav\nbuildout\nbarbini\nbombon\nsystematisation\nbioid\nabeyta\nbackstabbers\nhypnotising\nkaieda\ninvercauld\nnxr\nhowdon\nstuben\nmohabat\noberkfell\nnezavisne\nindividualizing\nchernick\nchachas\nsembene\nunsexy\nusnavi\nkrummel\npvg\nmacroprudential\nnaah\nausterely\nxhci\nplayzone\nraser\nprecarity\nhighstreet\nstimulative\nrippingale\nhumira\nyouell\ndalbey\nmulumbu\nqaderi\ngamme\nlarranaga\ncannily\nshiremoor\nastolfi\nkieckhefer\nruthy\nfnk\nrepoint\ngatx\nmatsesta\nallegrini\nherlie\nworkum\nippodromo\ndipaola\nspelter\ndhawal\naleynikov\npaddleboats\njunsheng\ngadzuric\nfanboyism\ntimotei\nwardy\nfogal\nhanzlik\ndervaux\nmigra\nusam\nscarified\nendearments\nkegler\norebodies\najamu\nflanimals\nsalinarum\nstashwick\nsnowcats\neworld\nkraprayoon\nlazing\nbraxted\nloanable\nfolden\nndx\ntomobe\nnjeri\nsuppurativa\nsimpleminded\ntonno\nmccullouch\nrakhmanov\ndebswana\ncashflows\nheidkamp\nlozzi\ncentrebet\nbeggary\nncnb\nconvallis\nhatpin\nlistrik\ncatchable\nacnp\nschaad\nbeatts\ngentrify\nmatvienko\ndannhauser\ntrevin\ndible\nvayner\nrussen\nababil\nlavonte\nbarrasford\naiso\nbusies\nmaims\ndejonge\nparmjit\nmusuem\nifq\nhdad\nwinterstoke\nmontets\nhollyshorts\nhelos\nsutliff\nniky\nkhuzami\nashante\neastasia\nplunked\ncardoon\nupskirt\nzuban\nstumper\nshorthaul\ncardis\nbraise\ntziolis\nsongline\ndochev\ntayag\nplagerism\ngrzywacz\nmellott\nfederle\nopra\nscantron\nyaghoubi\nmilmore\nstryd\ncafcass\nhurrican\nsundy\njutzi\nhorticulturally\natacms\nhilari\ngorditas\nkrans\ndampf\noip\nmedialab\nimmolates\nmbongeni\njanning\nlawnmarket\nhoneyghan\ncascione\nditore\ndynion\nskepper\ndrumochter\nbakich\ntrahern\nhadnot\ngalí\nspanierman\ndickheads\ndubbins\nvarricchio\ncoraghessan\nbelgrove\ngagandeep\nbaghdadia\nrentrak\nairin\nbierbauer\narcieri\nxiaoqiang\nachcar\nusinor\nhandcross\ntorrelodones\nadvisorshares\nndubuisi\nmahinder\ngpos\nsystemization\npanjwaii\npremisses\nsflc\nakitaka\nozen\neximbank\ndaikanyama\ntdecu\nagapov\nchuao\nzelzal\nwoodwards\nintuitiveness\nrelase\npenallta\nlebhar\nrebuses\nndukwe\ngrispi\ncoronie\nbawitdaba\nwaorani\npierola\nalmay\ncusic\narvilla\nmishor\nnissenson\ntrivialise\ntemma\nhahs\nkaethe\noverplaying\nricon\nmaysonet\nzubik\nclaycomb\nteleton\narturas\nemmaville\nablow\nappliques\nwasen\ncentenaries\nballycarry\npintscher\nscaur\nverraros\npolytonal\nmigrationwatch\nbadejo\npressurizes\nipet\nnanthana\nvédrines\ncrittendon\nswanland\nxijin\ncodel\nbeutner\ndalindyebo\nmanko\nbortnikov\nmucca\ncalza\nnaghi\nteguise\ncarnduff\noakshott\nbrodney\nnatia\npremal\ninfoline\nsilenzi\nmyia\ndettol\nokposo\nthrumpton\nchoat\nphix\nspindelegger\nmeert\nallahdad\nbaysal\nsmallfield\nstarkist\ncpam\nkeano\npsagot\nsantaland\npalaeontologia\nswartkrans\nvernell\nseastreak\npurveying\ndakotah\nnaybet\ncutzamala\nbottome\nborusewicz\nnoemie\nglendevon\neagly\nwallendas\ndeanes\ngerhards\ngestated\nculio\ngurode\ncopmanthorpe\nnanosatellites\nheckuva\nliverwurst\nundoubtly\nprifysgol\njarnot\nmathema\nadamsdown\ndarinka\neffused\nmoonlighters\nglocksen\nwaldrom\nsevani\nymha\nmerret\ngarlieston\nisns\nkeliher\nkozyra\nvilalta\nshadier\nubari\nhohensalzburg\nshmaltz\nkouts\npostcoital\nagentes\nfruchter\nsellal\ngassée\ntodisco\nnsga\nkitigan\nkitesurfer\ncastree\nnkotb\nenglar\nmostviertel\nmetrocards\nmoema\ncurrenly\nseabourne\nagy\nkristien\nmitesh\ntighar\nnaison\nsleptsova\ncauterizing\ncracklin\nstanojevic\nkalida\nafew\nmorowitz\nbonesetter\ndepreciable\nzorbing\negomania\nfirstmark\nseatback\nbeguelin\nabsentminded\ngyfer\nschoenborn\ndarryll\npilotta\nzinsmeister\nodenberg\ndoctoroff\nsebastion\nfacil\nwattay\nvranken\nattaboy\nfrancom\nsamidare\nquirkier\nparkerson\nmailmen\nsrikumar\nhesterberg\ncarsphairn\nraphi\nmatech\nrachlis\nkbtx\nlaskier\nartmann\nmeghani\nschnider\npostnuptial\ntholins\nsleightholme\nunitron\nliberations\nspiegelhalter\ntelectroscope\nmakarim\njoumana\nmridha\nserradilla\nbhagyam\nrelighting\ncoreceptor\nnaica\ngscs\nfcsl\nmediano\nskyscanner\nehec\neulogizes\npressac\nmieh\ncontractive\nashqar\neurolines\nnemazee\npantex\nbaselworld\nmccuistion\nsemenko\ndaryan\njanitzio\nhertenstein\neszopiclone\nkankava\nzelezny\nrafia\nmaisa\nsecurid\nenflame\nbeyazit\ntampopo\ngroundbreaker\npadwick\nferral\nbuonocore\nhightech\ncandiate\nserwa\nlevitch\naugenbraum\nwittier\npolyanthus\nfilmhouse\nnalapat\ndaubhill\nktnv\nsterkel\nceris\nharmelin\nemed\neirian\nscarlette\nlewandowska\nmareks\nadge\nfailand\nrosindell\nwaxie\nbourla\nhypercar\nyonamine\nthoses\nfatuma\ngessel\nlafta\nclearcuts\ncombien\nfielmann\nprenup\nkalder\ndatini\nmotlagh\nwwpr\nklawock\nhosi\nkotis\nsirenas\ncarcel\nstefanko\nsemiautonomous\nruggie\nsvenn\ndiglycerides\nreadsboro\nahronot\nwaitlisted\nkurtyka\npetrolatum\ntxakoli\ndjojohadikusumo\nhimmelb\nkerttula\nconkright\nreamon\nstryer\nouca\nweine\ntorian\nbarcella\nchetia\npowerhead\ndivests\nspetchley\nwhag\nwildblue\nstanch\nassuredness\nsuborn\nparsonages\nunscrupulously\npocosin\nrastegar\nthrockley\nkerosine\ndepetris\nnanopores\nlington\nargolic\ncentron\njetley\nhattar\nsaltuk\ncauby\nkfyi\nnibelheim\nareeba\ndtis\nandreoni\nvfg\noleguer\ncoviello\ncarrasquillo\nakthar\nbickenhill\nsebba\ntchula\nshakeshaft\nbragason\nmultiplay\nmisallocation\nsmartboard\nsures\nforehands\ngabbie\npeñaflorida\npulseless\nrbz\nlivas\nsagkeeng\nmoaner\nmetroshuttle\nbioengineers\nmorrocco\ncampervans\nfishcake\nllanfechain\nlulzim\nmbct\nfarraj\nterdiman\nmetrolina\npeteris\nshareen\necofriendly\nhachey\ndishnetwork\nfarberow\nnecula\ncollingdale\nhifa\npanthi\nkabob\nmealworm\nteji\nscousers\ngenesco\ncreditworthy\ncosign\nmigh\nblahblahblah\noutgoings\njanczyk\nfraiture\nabbeytown\nnawroz\nreservable\ndevilry\ntrebling\nponnuru\nphalguni\ndzeko\nbocoum\nstierlin\ngianetti\nbekoff\nazincourt\npeppo\ntearjerkers\nduscher\nsandpits\nwetangula\ndistressful\nblared\nschlaudraff\nelectrodeless\ndesensitisation\npenurious\ntexico\necrs\nanatel\nkramar\nmâche\nhalemaumau\nmegahit\nsabera\nulitskaya\nlahia\nhisbah\nrepeatly\nsangstha\nlanghart\nkaechon\npavlis\narciniegas\nhignell\nezzatollah\nusx\nrusco\nkowalchuk\nlowboy\npitchess\nmachno\nalbannach\nusin\ncobaea\nidbs\nrohrbaugh\nnisku\nshadowboxer\nmaroma\nsafonau\nschisler\ndyfrig\nsherando\ntaurian\nhalliwells\nwizzair\ntengan\nrocos\nvideon\nmuoio\nmaryjane\nboad\nfraschilla\neclectically\nwormold\nfibrillin\nbourillon\nmallwyd\nkolter\nvillan\nvemula\nmetris\nlambadi\ncomau\nwelcombe\ncawker\nsivtsov\nlashin\nmapesbury\nvearncombe\nzoltar\ncambó\nkambo\nklix\nwallpapered\nreversable\nyasuchika\npotsy\npericlean\nshumon\nhisahito\nremoulade\neaie\npouts\nferneley\nshamel\nraquela\ngawlik\ndqe\ntessem\nwalkure\ntheire\ndemutualization\nsalaskar\nakaji\npilkingtons\nrevolucionarias\ntroedyrhiw\nnolt\nlebrock\nzvjezdan\nbedrocks\nesurance\nsoudley\nrahama\nplangent\ngheen\nbortolussi\npoltrona\ncichero\nwentbridge\npeahen\nderartu\nniebel\nkharazi\ntelesales\nbearley\nkopitiam\nbernville\nkiriasis\ndunakeszi\nbastardised\ndajka\nbrowntown\npecvd\ntheimer\nspaziale\nmatovu\ncephus\nduesterberg\ncynghanedd\nluxin\nhighjacked\nskans\naberdonian\nstitser\nhewanorra\nharrowed\nzhongxun\nnbad\nmaos\nbalderrama\nenev\neshkeri\nmfw\nactivee\nsubotsky\nnyoro\nscil\ngarantita\ncovereage\nquennevais\nprofessoriate\nvulgus\nnatm\nfajt\ntooths\nakef\npiskun\nnederlanders\nlaurinda\nholbrooks\nqfp\nsandbridge\njacquemetton\nexeunt\nbregy\nrentaghost\nrequirment\nmarrah\ntida\nabertis\nobloquy\nswicord\nnestin\ntelegrammed\npecorini\nblackistone\nripoffs\nteleradiology\nprocreated\ntonsberg\nnmpa\ndarrieussecq\nzucchelli\nahlbeck\ntaylorstown\nkyogen\nwedgetail\nreinvestigated\nvaruzhan\nbiergarten\nfams\nkanevsky\nmanaj\npoobah\nrames\ndiqing\ncawl\nfickman\nmelady\nschlachtensee\nliguo\nalaïa\nsamorost\nterrones\naugmon\nmweka\nkanagaratnam\nschefer\nheliskiing\naddyman\nkarasyov\nbilkis\ncarthel\nmaulden\nwte\newm\nelw\nkhogyani\neigler\nwoolfenden\nelsenham\npavlyuk\ncitypass\nmedek\nnawash\ngodtfred\nmellars\nrodker\nkiona\ntoubab\nsanjayan\nrudden\nxceed\nstumpp\nulev\nkammerling\ngrafter\nlaverstock\nyuejin\nbotwright\ntarpaper\ntunchev\nananthaswamy\nsagiv\nmlodinow\ncherng\nmyfox\ngoofus\nmacdermid\nfrafjord\nradric\nagboyibo\nkanoe\nhypothesising\nyuanmingyuan\nscurries\nanbumani\nnatalina\nfilipi\nwhisperings\nhaneen\nteddybear\nshovelton\nwalasiewicz\nrotovision\nballem\nlevenberg\nmiddleclass\nrobeco\ngenoways\ngrgich\nbrenman\ndownlinked\nlavrovsky\nsomport\njenkem\nradkov\ndangoor\nnsse\nhobnob\nunengaging\narafah\nsorbets\nitsuko\nfruitport\npanor\nhairnet\ncannnot\nebrt\ncompartmentalised\nscattini\nprieska\nchaison\nsalunke\nscraptoft\nalberstein\nhabomai\nvennel\nreroutes\nwft\npollensa\nholtman\nkaral\ndajie\nschouman\nmukhriz\njaklin\nkdic\nparvan\ntrainspotters\ndoncella\nklieman\ncorlette\ndairylea\nkazaam\nloeper\nheilbut\nedeka\nkarani\nbrainlab\nendoglin\nrecidivists\ndemaris\ncandar\ndoddering\nheiloo\nherchel\nrafidah\naaap\ndeadened\ncastellazzi\ntsaritsyno\nclecs\nadili\nfondas\nmagorium\nfering\nfollowership\nsubassembly\ntrilliums\nmatalam\ngadhia\ncelena\npalmilla\nliraz\ndasatinib\ngombo\ncammas\ntyphoo\nafmadow\nburgis\nkrishnapatnam\nmacoutes\nvectren\namiriya\nmidlander\nkalashnikovs\nbellicosity\nkostovski\nprofond\nawford\nmackean\njeffires\nfoxed\nmortgagees\nmidsole\nclimer\npramipexole\ndolichenus\nsteira\nmelanins\nedvardsson\nvallar\nrequalification\ntuley\nchickerell\nefird\nceibal\nkicevo\nkrynicki\ncaqueta\nnakara\nzfns\njobie\npickell\nclumsier\nwalburge\ndalya\nmehregan\nrabanal\nfirt\nichiba\nparleys\nserani\nreleford\njvb\njevans\nabettors\nmcdean\ntranscriptionist\nferryside\nlasala\ndilbeck\ndenuding\ncolums\nartifically\nlykaion\nmicho\ninsu\nunderrun\npillowcases\nhpw\nkkc\ndresen\nhightail\nsplays\ndrospirenone\nrecharger\ncizik\ngumo\nwijers\nthrougout\nhombach\ndeterminists\npotten\nusni\nulner\ncrole\ndeffenbaugh\nangelidis\npvrs\noversimplifications\nteggart\nshapiros\npurlieu\nsuspendisse\ninfluentials\noakamoor\nokah\ndefleur\nnjue\nlimply\nfeinting\nkrainik\nnautiyal\npaviland\neksi\nosmany\ninhalable\nmimbs\njanwillem\nexotropia\nprahok\nmalltraeth\ndoggers\nbrownsword\ntrivett\npolon\nmetway\nmicrogaming\nccid\nbabelgum\nlochans\nbossons\nmve\nswiper\nwessell\nboysenberry\nmlbam\ngopalaswami\nbritania\nepazote\nghostwrote\nshalon\njpi\nsensate\nbizcocho\nstehling\nmuhib\ncusop\nadang\ndobrowski\nkanellos\ndibbell\ndaybrook\ngulyas\nconsigns\nkemner\nstanke\nvickey\npineywoods\ndsme\nkizashi\nnaguru\nfrakt\ntatad\ntahri\nmorthens\npliability\nciociara\nolofinjana\ngunselman\nsamotlor\nhamdeen\nmanegold\nmusella\nshirasu\ninventure\npaskey\nyinhe\ncountercultures\nelisabete\nnimesh\nderriere\njesser\npikalyovo\nonishchenko\nsommersby\nnicolelis\nmurlidhar\nabraxis\nwaterboard\nenig\nburbo\nthanatology\nluffing\noxcarts\npacesetters\nmarbleized\ntichelaar\nbuit\nlikasi\nsalsero\ncollbran\nfrizette\nshamsie\nhaemorrhoids\nvioletas\nhjc\ncuckolding\ntrexlertown\nmcketta\ncapuchino\nsupose\nbeug\nlacerate\ncronenweth\ntourvel\nsteinsson\ntennie\nremingtons\npassings\nsemakau\ntiozzo\nrvx\ntrumaine\nimmunomodulation\ntuitavake\nstobhill\npritsker\ndewell\nknowlegeable\nevictees\nannella\nhancheng\ntiddler\npzp\nschetyna\ngratteri\nkroell\nfraternized\npunshon\npattyn\nclonoe\nlasica\nmatrics\ndols\npdgfr\ntovil\ntredington\ndinelli\nmphela\ninterlined\nteenick\ncelene\nmarrin\necopsychology\nmintal\nsparco\ntropicalismo\nbeinisch\nbahre\nqorvis\njubileum\nfortesque\nhussan\nalagiri\nosisko\nhillmorton\nuclh\nvenson\npasseggiata\ncomestibles\nconsiglieri\ncoheres\nverley\nzirndorf\nsutzkever\nschollander\nborck\nlacosamide\ntoczek\nnedum\nrosabella\nrafu\nlecturership\nkreutzberger\nanmer\njicama\nmorvah\nronak\nsaidel\nshahzia\nearthsearch\ndallying\nholtom\namera\nterhorst\nedrs\nhulanicki\nchachapoya\ngeff\nnicchi\nsateen\nreprazent\nmiscount\ngivon\ngasim\nfulshear\nwartorn\nshoenberg\nsourpuss\ndileita\nbwx\npropositioning\nchunfeng\ncypres\nbodyworks\ngoliat\nminich\nfullagar\nalnitak\ndisestablishing\niodised\nwitchetty\nevercreech\nexeption\nlumas\nolenicoff\nbodys\nghiglia\nhypovereinsbank\nherwald\nmour\nthropp\nkvetching\nnabala\nllangibby\nimprecations\nakoni\ncrazzy\nadducing\nearplug\npapoutsis\nverbals\nhipotecario\nghafari\nsweigert\nsymmetrix\ntrows\ncurgenven\nladak\ntakfiri\nkhalfallah\ndisgorging\nyaccarino\ngrumblings\nkindnesses\ndéesse\nxolos\nsayres\nricou\ngloomily\nsnitker\nchakib\nmjt\nfaulder\ndegregorio\ngoddammit\nlierre\nlithwick\ndiagnosticians\nkravetz\nruning\nbishoff\nverkhny\nchemcam\nmuttur\nturtlenecks\npcma\nassistence\nlaharrague\ninvertigo\ngayman\ncuntz\nkelway\nsuggitt\nbowle\nkhmaladze\nneostigmine\nplumping\nsimmo\nyamoto\nmonopolizes\nnonrestrictive\ncrushable\ndeaccession\nmisericord\ngotchas\npomares\ncolaiacovo\nschoeps\nbrainteaser\ndirgantara\nsanner\nsomberly\nnfte\ngriever\nsorbier\nsanft\njobbins\nsayulita\nbossangoa\nhenly\ngrabert\nzulfiya\nshukman\nfayence\ngjirokaster\npasayat\nsovietism\nmarisha\nnickens\ncambert\nafable\nhalloumi\nshizuki\nmarchbank\ngeriatrician\nformell\ndefinatley\nyhe\nshipler\ndeportable\nshooed\nseife\nunwieldiness\nnaken\nwokefield\nesops\nstoolball\nkislak\ncasola\nbutana\ngothika\nasrm\ncornillac\nhessman\nadenomyosis\ngoodrington\ngeritol\nbullmastiff\ndeshazo\nmcway\nspinningfields\nboonoo\nbertolino\nwaistlines\nretoucher\nsankaty\nnonentities\nfernao\ndiscrepencies\nriedinger\ndesvaux\nphytoestrogen\nhunkered\ntorosay\nalreay\njaymay\ncarree\npeyronnet\ndoyens\ntetherball\nswfs\ncusiter\nshinnick\nalsp\nbesseling\nfansler\nprotoplanet\nkainer\ncowered\nessy\nsamrajya\nloogie\nreusser\nxlb\namerine\nlazybones\nmuffett\nsiqin\nsmacker\ndelfini\nmoscot\nlupoli\nhenio\nitex\nbellota\nrautenberg\ngernandt\npenparcau\nanuszkiewicz\nramblas\nflagstar\ndejoria\ncrossdressers\nexogenesis\njro\nresending\nwaswo\nkynan\nrollcentre\ndemopoulos\nlabone\nbarters\nspacewalking\nalfreds\npneumococci\nhudi\nwahib\nwahweap\najira\nheerema\nshavkat\nmalifa\nbookcrossing\nreacquaint\nbarryville\nprzemyk\ncatelynn\naereas\nphonautograph\nfranscisco\nbioelectricity\ntreister\nbigscreen\nwoodsongs\nboghall\ngujran\nhjartarson\nbodnant\nforsooth\nstrassburger\nchaises\ncomen\nwildenberg\nscrips\nyering\nchitale\nherschberger\nneuroimage\ntubeworms\nunsucessful\namikacin\nborenius\nmaxvill\nmohamedou\njauzion\nmapou\nkildale\nsoumaila\nshivashankar\nkurtas\nstickered\nkreischer\njdate\ntoibin\ndäubler\nworlwide\ninsensitively\nrahlfs\nareti\nispo\nenjambment\ndudmaston\nraimunda\nrienstra\nunlovely\ncytokinins\nmaschmeyer\nulanhu\ngenever\nbenefiel\nokorocha\ncasimira\nrozon\ncutajar\ncatinari\nbrearly\njournalisten\ngenetica\nkdnd\nfelbridge\ncdis\ncanouan\nrakhat\ndakis\nlaneham\nshdsl\nromcom\nwindover\ngalak\nberfield\npernia\ncymreig\nislamophobes\nrubaie\nholzen\ngtfs\npopera\nswarth\nsurojit\nalotau\nmajok\nhalcro\norse\niene\nquesion\nnerikes\nwaggling\nheginbotham\nconsolatory\nranworth\nwertsch\nbarbarities\nkaffer\nnapolitana\ncontradictorily\nultrasoft\nnaeemi\nfomr\ndeontay\nassemby\nbeltone\nridgetops\nstainburn\nmehrabi\nmarooning\ndoumen\njinmei\ncillizza\nichaso\nrsls\nsaei\ndisip\nzekeriya\ndelucchi\narpels\nrassoul\nstrettle\ncolsaerts\ndebbe\noveremphasizes\nvineeta\ncharminster\nweimin\nmisplaces\nliveability\nlimpias\nblacksod\npelon\nieronymos\nluw\ndescibed\npushtu\nalfas\nlayovers\naukin\noxpeckers\nhoshyar\nantley\nsuisan\nifft\ndeconstructionists\nkuney\npedone\nhaziness\ntaxe\ncrays\nlandisville\nsemioli\nntombi\nbloome\nmaloway\ntatoo\nbahareh\nlnl\nminhinnick\ncountrie\nklawe\nrawabi\nzatarain\nunruliness\nbacolet\nrheoli\ndiffernce\nfacism\nneovascular\ndrear\nendelman\nodwa\nmarquel\nxiaohong\nkislitsyn\ncamou\nnetcare\ntave\nbrooklynites\nsavely\nlascio\ninure\ncrues\njavadekar\nrowthorn\nmuniesa\ndismissible\nharbage\nshortboard\nmoneywatch\nlikings\nistead\narment\npietrzyk\njackers\naperghis\naobut\nahdal\nrosevelt\ndumar\nfinnentrop\nsettees\nedaf\nheidel\nmatveeva\nwhql\nchernus\ntaxane\ngurak\nyowell\nalsberg\ngladyshev\npsychopharmacological\nninewa\nslinks\ndecriminalizes\nsqueezy\ntitze\nbirdcages\npinelawn\nriggott\naaaand\nnurgul\ndesensitizing\nhadjer\ncoonrod\ngarl\nkarsums\nfalkender\nflegt\nbabuino\numán\nperchloroethylene\nlakdawalla\ndynam\ntetsuzo\naccretive\nlaurenne\nbozan\ntijan\nnausheen\nbootlid\nslopping\nbendheim\nopernball\norfield\nsanlucar\nbrogger\ngeophones\nbootlace\nglueing\nojd\nkamryn\nlongshots\nthoracolumbar\nhobman\nhodur\nfonssagrives\ntraini\ncyfres\nliebler\npersonell\npiousness\nmalashenko\nmaslansky\nunderdiagnosed\nplackett\nweiman\nsacrificium\nguaco\ngerstel\nsysmex\nrequestors\nvoytek\npiemontesi\nforestburgh\nwaghef\ntsunekazu\njobarteh\nbumpkins\ntandel\nextraditions\nvcsel\nwucherer\nkiviat\nllanbrynmair\ndavidowitz\njiayin\nwayson\nwinmill\ngollings\ngurrola\nospar\nclaffey\ngoetsch\ncangrejo\nsolu\nchangeability\ntristani\nharwin\nachivements\nphilodendrons\natiyya\nluecke\nhuges\nchinnici\nmurkiness\nexpirations\nhunthausen\nselloff\nbramlage\nudca\nbiolabs\nbozzolo\nzarghun\nsnowpark\nmagnini\nmchc\nhuanuni\nhuya\nalmrei\nprofoundest\nzaraysk\nfinessing\nivery\nsawbuck\nmenzer\nclaimer\ndalmahoy\nanabuki\nnyyc\nshuttlecocks\neasther\noecologia\ncontainership\nnissman\ntriolo\nmirfin\nsankeys\nskanks\nsundholm\nbaseggio\nnevland\nmaxa\nbubis\ndinunzio\nresentencing\ncacher\nmartelle\nwizardly\nholshouser\nguyville\njianfu\nmajlinda\nthirza\nkypseli\nbodey\nisesco\nphotocard\npessimistically\naltenrhein\nkalachev\ngamefly\nretributions\ntetaz\ntapuach\nsouers\nkampia\nfelinheli\nkhamar\neaso\nparoxysms\nupthegrove\nbarbadillo\nhorsington\nexulting\nstoppelman\nsciamma\nannalong\ndabis\nlevines\nleinert\nycf\nhrbacek\nvenuses\nristretto\nilna\noutted\nlacker\nlahiff\nbanier\ngroopman\neditorialists\nagio\neffulgent\nlongobardo\nnorimasa\noriane\ndunger\nhellam\nprunedale\nhovda\nprimmer\narroyave\nhawkley\nhomeplate\nribic\nizuka\nrecrossing\ndeuxieme\ngumpel\nunsurmountable\nlinteus\nsensitizers\nlamna\ncharry\nderventio\nintrusives\ngolinkin\nxihai\nribolla\ndilwyn\nmcmap\nvoorde\naamna\nihemelu\nsuperpartners\nrichart\nyongjun\nzasloff\nmazzio\ndeibel\nnemon\ndfsa\nwakestock\nduder\nfulgoni\nfangyu\nqct\nubiratan\nkobeissi\nconiscliffe\nsuctioning\nwaterscape\ngormly\nkrin\ncomplaisant\nmclaughlins\nprifti\ntarisio\nhemm\naneuploid\nungerleider\nchunnel\ncoquelles\nlyophilized\ndanneberg\nmely\nsaido\nreflectively\nkoppie\nbreeks\noberkampf\nshemin\nkerbala\ncompulsories\ndarioush\ngenrich\nmicrocosms\nroxxxy\nnocturia\nbicket\nhpac\nfrizz\ngrandcamp\nglub\nmariné\nzuccaro\nscrutinises\nwhaplode\nzecco\npolge\ntrefniadau\nbrutalization\nlayaway\nozgur\nucatt\nsenegalus\nmadaki\ngosal\nfayza\nalltwen\njangles\npedalled\nvpf\nloktionov\naltoids\nyingjiang\nbanjolele\nsuburbanisation\npoyntzpass\nspendings\nleonidio\neskbank\nyetminster\ngaeseong\ncredenza\nnadeen\ntwing\ntkts\nrehau\nmsrb\nintell\nonik\nloughry\npalletizing\nsanctifies\nherda\nsprengelmeyer\nlockdowns\nhuesman\ncorpas\nstolzing\nthamsanqa\nbarbazza\nvocalising\nrafid\ncsere\nmccorkindale\nkurmangazy\nbrakebills\nsheilagh\nhardeen\nkoert\ncommoditized\nshahrazad\nsupervening\ncavalleri\namsallem\nashei\nblooding\ncopulates\ncalik\nunanchored\nirdeto\nboltbus\nlaywer\nalabbar\nmadchen\npenllergaer\ncrowcroft\ncostières\npiiroja\ncantet\npueda\nuluwatu\ngilkison\nalekseeva\nexall\nbruehl\nmisappropriations\ndustoff\naliette\nrosefeldt\nbcaa\nvencer\nclareville\nxintiandi\nbackmarker\nwetherington\nboomeranged\nspamminess\ntalacre\nnamira\namphicar\nkayonza\nschaumann\naleisha\nparboiling\nextrememly\nindulis\nnoncooperation\nunreconciled\nhengelbrock\nwordsmiths\nyian\ngogland\naldam\ninauthenticity\nlingang\nfibrates\nbalkwill\nsprackling\nstaros\ncassez\nferiha\ncraigan\npayees\nthers\nheathy\nfundacao\nflaneur\ndiscretions\nanglicanorum\nsheas\ntarum\nrovensky\nbaojun\ntaton\nrapkin\nspiting\nverleger\ntabarrok\nradnofsky\nglioblastomas\ndraycote\nberndtson\nvaraiya\ncountesthorpe\nmasui\nhochbaum\nquaintness\npluckley\nfellig\nreifsnyder\nblanik\npoliter\nappaling\nthell\nandasibe\nclampus\nitre\nschwieger\ngrimey\ncefas\nyotta\nrehmatullah\nschwalger\nlambskin\nopex\nafran\ntristique\nleski\nerdös\nzelasko\nthwack\ntomasevic\nllanddeusant\ndiamantinasaurus\nklonowski\nswithinbank\ncorvara\nfolbre\nlodish\ntatro\ncomprehensibly\nstingily\nkuun\ninnsworth\nsoboroff\nrodder\nhansons\ndunfanaghy\nshivshankar\ndialidol\nclyman\nmagnaghi\ndionisis\ndemographical\nbaroin\nnirim\nbudesonide\ntzatziki\naccoring\ntorlakson\nleigertwood\nyoox\nrekhi\ntaransay\npandeglang\nhatvany\nheidelbergcement\nunordained\nzuider\ninvertase\nhellawell\nfehily\nrecertify\ndodrill\ndymo\nbenicassim\nunstamped\nboluda\nadsa\ngostelow\nghostzapper\ndemmer\nhassig\nmylitta\numcor\ncorleones\ncinisi\ngreenfly\nshawkey\nnosseck\nscheila\ncentralen\npenrhys\nlockerby\nlinkoping\nabusable\ncolantonio\npitahaya\ncruellest\nmalicky\ncoldrick\ncrammond\nmincey\nburcombe\nclipsham\nmontelena\nlustily\ncmha\nkvea\nvrbata\nhuls\nstorry\ndiara\nmulherin\naedc\nchoise\nganciclovir\ncazal\nnordwall\nfaezeh\netos\nzients\nsetser\narade\nnortheaster\nfhimah\nlewinter\npfra\njrfu\nexasperate\nnwaneri\nonischuk\nispot\njiggles\ntithed\nsofias\nlscg\nyoncalla\nfarrance\nimmeadiately\nfreehill\nnebbou\ncrume\nfrontbenchers\njerrycan\nbontnewydd\nyearby\npressers\npalemon\ndizin\nhockfield\nfaki\nfounts\neataly\nhamood\nplga\nkortan\nunderpaying\nkunimura\nrecommencement\nkteh\nmatloff\nboardriders\nrossing\nsvare\nmssc\nbizhan\nflamstead\nincitements\nchamu\nmontek\nfrolick\nfawzan\ndaequan\negglesfield\nrutube\naibu\nrabiya\nldas\nmusictoday\nadulterating\nlochgoilhead\nfamuyiwa\ngiovana\nscripturally\nhemsky\nrawstorne\ncenex\nabeywardena\nmuffles\napplecare\nwillke\nocassionally\nachten\nsignaler\nmemorialising\nnsbri\nskovdahl\nsurajit\nrehoming\nalldays\nkrims\npracticioners\nautissier\nanthocyanidins\nsneck\nallert\nzwiebel\nkayaked\n,we\nbismarckian\nremedio\nfoucart\ntripes\ncoryphodon\nhddvd\nrevist\nwiggo\nstefanel\ncegetel\nvillalva\ngloor\nhoelzer\ntinte\nsurvery\nexquisitus\nexpertize\nolara\nutech\ncalyon\ngennaker\nphia\nquantrell\nchristow\neunoia\nlatry\nlarapinta\nbozovic\nheredero\nprosperidad\nrotz\nrooves\nlebara\nsupressing\nferencvaros\nberdmore\nlinighan\ncrable\nedolphus\nflirtatiously\noverstaffed\nwitbier\njennerstown\nchocky\ndorschel\npriyantha\nrhydymwyn\nsamaroff\naknowledge\nmaplestead\nripudaman\nbyshovets\ncompliances\nveguilla\nliani\nvichyssoise\nbrownsover\ntoura\nharow\nonetouch\nchattrapati\nwittersham\nogley\nagrotourism\ninergy\neassy\nseductiveness\nctrc\nweger\nshanell\nuludere\nmcgladrey\nneasham\nxilitla\ndocumenter\nhardknott\nmohareb\nfigues\nmotorcoaches\nrogalin\nkelleway\npostilion\nhypothecated\nstaedtler\nbejart\nkoiwa\npaesano\nincautiously\nthoes\nnarissa\nnumbersusa\nlisabeth\nbelabour\nhexal\nkormakitis\nzhus\nwaugaman\nhommet\nmelgaard\nkondoh\nwenski\nchunari\nmoty\ngongan\nlicenser\nructions\nstedham\nmcwatters\nfirstbank\nzabit\nbeaner\nesplanades\ngarinagu\nroadholding\nespon\nmeasha\nisdell\nwolfsthal\nmohmands\nshedded\nzubeir\nphilippot\ncastronova\nandrabi\njovel\nholkeri\nshrinker\nsealaska\nmcphedran\njordanelle\nlumper\nauder\nlisagor\neletrobras\ngallimaufry\nbreadmaking\noocl\nburrabazar\nwestmoor\nnumerable\ngarçonne\nnanophase\nkanfer\njoyriders\npeñoles\nlitzenberger\nklimchuk\nmadheshi\nchavismo\nsketchley\nbuwalda\nbrasso\njarka\ngrèves\nlorentzon\ntoonces\nadetokunbo\npopulaces\nreinventions\nnacchio\ndongarra\nnapiers\ntreepeople\nshifman\nkandos\nsylvaner\ncurandera\nunimprovable\ncantel\ndlugosz\ndriggers\nmentation\nhoarfrost\nperly\nlasciviousness\nlual\ndeyes\nineradicable\nmxd\njouin\nearlsfort\nscriptment\ncfius\nkoyra\ncanabal\nfreestate\njanala\ngrinko\nboolarra\nyoulgreave\nsynthon\ncente\nryeland\nellmore\nalaykum\nstrel\ncornillet\ncabret\ndeadheading\nfmtv\nyehudai\nkarenia\nusurbil\nleibell\ntransgenders\nrantisi\nnuttal\nshagaya\nholybourne\nkupreanof\nmallorie\nbuddleia\nkhanvilkar\nvilija\ndemarre\nebaumsworld\nsurtitles\nwerlein\nrussets\nuder\nmedicale\npechmann\nvorobyev\nmatzos\ncommutair\nscholtes\nhomestate\nbaddour\nmuminov\ninebriates\nmatsikenyeri\ndeathtoll\nboutsikaris\nhaicang\ngrv\ngigalitres\neurodif\nerzsebet\nrintala\nchalerm\nbladud\nrudetsky\nserrin\nridgen\ndiapered\nkobes\nlarbey\nplaa\npreventively\ncristos\nequalisers\nhoften\nzembiec\ncubbington\ngulabchand\ndunsborough\ngergis\ndebjani\njinshui\nlegan\nwhomping\nostreicher\nmainbocher\nweathervanes\nspota\noduor\nbarisic\nrememberance\naestheticization\npizzaro\ngeoss\nnavanethem\nduramed\nbioequivalence\ndivito\npolesden\ndaurat\naurach\ncarports\ndongsha\ncastellino\namea\njonsdottir\nmaray\nmuirhouse\ncrivella\nnonsenses\nbabyish\ngorebridge\nnailz\nbelger\nuhw\nnoirin\nmatchpoints\nmerceditas\ncalihan\noxleas\nstarrucca\njfr\nballynure\nmotorcades\nzylberstein\nkhachigian\nekker\nerdoes\nequipos\nwrings\ngkm\nkichiemon\nfigley\nstagnancy\nschonert\npathy\nfenlands\ngrawe\nmisdated\nbanquette\npelto\nwaksal\nsportives\nschlepper\ndiala\necolodge\nsmelley\nregney\ndiod\nreddam\nleyna\naircell\nkhotso\noringer\ncrox\nsanitariums\nrpsi\nizibor\npeirano\nbrackenreid\njoern\npontprennau\nwadood\nempedocle\nchocula\nhernon\nworklife\npecina\nwemmer\njohnsonian\nstarphoenix\nwrynn\nirmas\nheelis\ncarjackings\nchloropicrin\nasnelles\ncapab\nomayra\ndtmb\nkyoo\npaumier\nimmortalise\nmaholm\ntanji\ngeoplin\nbeanball\nefficent\nvtg\nhousepainter\nsipora\nadisonline\npepperland\nteacake\nhistolyticum\ndareus\nanwen\nsparekassen\nshayer\nmaiorana\nswatis\ntonetti\nunsellable\nkarawaci\ngalinhas\ndécors\ncloake\nmuridke\nglobalive\nupk\nbabylons\ntsys\nrepacked\nearflaps\ntartness\nlinch\nwhingeing\nkopeks\nmeringues\nvendt\nczaban\nmaatouk\nthirith\nmarangu\n‘…\naccutane\ntemplewood\nspiga\nzenor\nbardai\nvassanji\nchromosomally\npulawy\nrufi\nupsized\nnutgrove\nmdos\nphillippines\nanerio\necotones\nwestheim\ncompean\ndownlinks\nboybands\nwithee\nnhps\npannett\nchristkind\niskandariyah\ncavotec\nneace\nstinespring\nreadjustments\nmcworld\ngusmao\nplew\nfieldworker\nwbig\nmojadidi\nchikovani\nsyda\ncallaham\nkusina\nvixie\ngnakpa\nwolton\nkotey\ntigelaar\nwoeste\noptokinetic\nrizwanur\nkouyoumdjian\nbettystown\nsonza\nalarmists\nrösti\ngeohazards\nfrostrup\nhorseguards\nsignwriter\nerleigh\niansa\nsbisa\nzhezkazgan\nieo\nmetsch\nherzing\nbeleave\nnimer\nmorejon\ncircumstancial\nogles\nzakouma\ncajoles\ngruenfeld\nworkboats\nmincy\npouf\ncuxton\nkeusch\nhomebody\nlesy\nnoele\nfingerlike\ndoek\navenham\ncondotta\nnorbertines\npribyl\nkrasa\nsuperchair\nfedai\nmcac\ngooglies\ntragedienne\nbairds\nsragen\nbaradar\nadolphine\nexercisers\nwhiskas\nsylve\nnotah\nembarrased\naramex\ncatmore\nboeckman\ninntal\nwalham\naustralovenator\nhimsworth\nmaturi\njoynes\nwrye\nnorthsea\nzerpa\nfluoroscope\npozsgay\nfeebleminded\nabrons\nwictor\nseavers\ncerak\nsayigh\nreify\ncharmings\nfishpool\naroca\nprotectant\nbhutani\nscratchin\nparasuraman\nangrist\nbaires\nlegeay\ntelegenic\ncumberford\nprocaccino\nhilderbrand\nhabberley\nmaringouin\nsiljander\nshipworms\npierremont\nkhansa\nberbizier\ndiprima\nbahner\nselwin\nqincheng\nlansoprazole\ndefrees\nprecipitately\ningliston\nrelman\nvaltos\nsimun\nfrappuccino\nfetchers\nstute\nifco\nsheering\nzukowski\nkristyna\nescobal\nlongpre\nfjd\npasseth\nysi\nkasperczak\nzorthian\nhohneck\nmichalska\npinckard\nduez\nthems\nthelander\ncadburys\nqapu\nsuperabundant\nskirrid\ngolliwogs\nnahma\nbyran\nmocco\ncustomink\nshoshones\njaehn\nadlers\nrationalisations\nbyssal\nneotel\nsentri\ngelderlander\nwichert\nappleinsider\nreincorporating\nelectrophysiologist\nservicable\nrevolte\nschlabach\nelzen\ncollaros\nbunna\nwalles\nnorelco\nenlightment\nneuenfels\ncircumvesuviana\naideen\ndebasis\nshamlan\nsatloff\nchemotherapeutics\nsindone\npharetra\nkatelin\ngumline\ndarvell\nknish\nfunkiest\nreposes\nsjodin\nsives\naddres\nlacq\nrasing\ngumboots\njarron\nbienfait\ngridshell\nyongchaiyudh\nmelyssa\ntorbjorn\nirrc\nalayon\nmazzucco\ntavaglione\nsubmenu\nrachad\nmamenchisaurus\nmclees\nmovietickets\nvideocore\numicore\nhni\nsucession\nburlton\nhorchow\nstoppered\naqel\njiaxuan\nlookie\ncharterholders\nswapper\ngravamen\nyardi\ncoastliner\nadzuki\nestudante\nmouhamadou\npyrotechnical\nkiesha\nkambangan\npeplinski\nyinlong\neurovan\ncommuni\nbrinkhorst\npasian\nzaetta\nmosney\nplewman\nbomas\nlanzafame\ngobena\nanxiousness\nincrementalism\nbrashares\naoac\nbartkowicz\nzytomirski\nkleckner\nmonthlong\ncernobbio\nmainsteam\ndunwell\nzutter\nvesty\npokolbin\npoorter\nmaolin\nmareb\nscarcities\nnapqi\nlonelier\ngarbling\nwinco\nwalshes\nknightshayes\nkosara\ndampeners\nshelat\nnaughten\nhammerschlag\nhypnotherapists\nguzzle\nunderpriced\nbervoets\ninstallshield\ngerbes\nyalcin\nzombification\nkeidanren\nblocos\ncywka\nbeidi\nmsdf\nanyukov\nberlind\nbeisner\nulchi\norchestrion\nbambury\nilhabela\nrebased\nmakumbi\npseudopanax\nharch\njovially\nihsanoglu\nsalhus\nplaintively\ngurning\nherenton\nsadowska\nchuah\ntemporoparietal\nschmierer\nmarmalades\njbod\nclarie\nkrolikowski\nsleeze\nlalia\novercompensating\nholoman\nmalaguena\nantipasto\nmunduruku\nshalin\nmilliamps\nfalera\nstudivz\nbatad\nexcessiveness\nmysterioso\nfricken\nforssman\nsydbank\nitouch\nresealing\nmalul\nhulkkonen\nstybarrow\nmonteroni\nsned\nkandula\nbaricco\nbaladiyat\nbeachmaster\ngemballa\nhaimendorf\nsuperskills\nchickenfeed\nlorenzon\nhornick\nshirenewton\nsupernews\nizambard\nunterman\nperspire\nentomb\nalfven\nlangbank\nappelation\nassistances\nhazlemere\naspirating\nmaïa\nmobistar\nameerah\nsemisolid\npucciarelli\nborte\nappalshop\nzald\nseini\naciman\nsingley\nhukkelberg\nthng\nadaware\nlisandra\nmotola\nprobaby\nguyancourt\nmediadefender\nmarsee\ngrumpier\narchfiend\nnourizadeh\nbaev\npeddles\nedgerley\nfeio\npinkberry\nanzhela\nwalesby\nskinningrove\nlazell\nreoccuring\nkarrada\nitziar\nleiderman\nmakola\nmellman\nactie\nscofflaw\nthade\nsukari\naruga\nrecuses\nsemitropical\nmaxtv\nlehal\ntopkick\ngelberg\nfeedstuffs\nskorton\naylesham\nquiring\nboardinghouses\npouliquen\nbaseliner\nstosh\ncockapoo\ngolfsmith\njunes\ndjeli\nmallorcan\njahurul\nradul\nsteib\nprevios\nkmx\nfilloux\nceff\noganov\ndeiana\npernickety\nkoether\nnergis\nbides\ninapposite\ntwinkly\nateek\nunfaired\ndantzic\nyazalde\nbanaji\nrzb\nrief\natps\nkeshubhai\nakayesu\ndeutchman\nbernelle\nbacause\nloznitsa\nrouslan\ngalliford\nergasias\ndimauro\nsameen\nfreestar\npimozide\nrunako\nacteal\ngrennell\nmirrer\nwollan\nrusks\nxueqi\nsanitising\nhemerdon\nmultiplicata\ngulled\nriabko\ncrittle\nllanybydder\ndimplex\nsalwen\nsistah\ncaballe\nquarrendon\njordens\nthaci\nsudhin\nmoshulu\nvilem\nexsisting\nalexopoulos\nsubmicroscopic\nwasherman\nwoodlynne\nsigurdarson\nposthole\ndiscoursing\nilios\nvolponi\nbeteta\nporical\nbirtwell\nbeleve\ncurabitur\nchanglong\nidvd\ntakach\nhelmetta\nläckberg\ndeterence\nbrenninkmeijer\ncolehill\nsprüth\nkontiki\ntavella\nkavlak\ncelinda\nyiyi\nrajaa\nrabner\nfirebombings\nregarder\nkanarek\nkanebo\nghysels\nmoheli\nschroen\nkierston\nlochbroom\nstech\niati\nbressoud\nkaramoko\nsadaqah\nlayzell\nunflatteringly\ncrankiness\nexista\nbriery\nbolea\nnightgowns\nvoorsanger\nbrondby\nkozmo\nbeduin\nmorizet\ndipesh\nmaiori\nbackrooms\nwwjd\nbasabe\nkessner\nepicor\nhypolite\nsuos\npettigo\njaverbaum\nverro\nmunley\npantoprazole\neffluvia\nardkinglas\nkakimoto\nhomebrewed\npriding\niosia\nlouv\ndege\npcmc\nbidon\nhangam\nasthal\nchapatis\nchildrenswear\ndigiacomo\npatissier\nmerling\ngasport\nseima\nlizer\nassadullah\nsynergos\ngeske\nberlie\npreljocaj\ndellis\nleemann\nnigiri\ninformaticians\nmejorado\ntoano\nbeagrie\nnvcc\ngazprombank\nnejame\nfreightways\nkaiga\ncornfed\nbundeskriminalamt\nsleepwalks\nmcnevin\nharrumph\nprotandim\nraheb\nvaleriani\ndigium\nmuradi\njinong\ndoanh\ndemuro\ncaniparoli\nalípio\nbloodsworth\nsetara\nflus\nrephrases\npalong\nswimmin\nobeidallah\ntimewise\ncugno\nschickedanz\nschrieber\nvanson\nunoffical\nvertis\nfircroft\ngerding\nbetterments\nshappi\nmydoom\ncoquelicot\nghillies\nnaq\ncompounder\nsarshar\npachulski\ndolbadarn\nbilinda\nobviosly\nsutpen\nefra\nseclude\ntysk\nlakotas\nhassin\nkrawiec\nlaroussi\nraiffeisenbank\nmiyachi\ndemmler\nmaccagno\nitau\naaq\nanabaa\nderakhshani\nglenolden\nexegeses\nraisani\ndeathstar\nmarzabotto\ntransparancy\nshyan\nscrag\nkidscape\nzellous\npalios\nchequebook\nhirshman\nkamdar\nsackur\nsihala\nufdd\nbracelin\nsilvercup\nbomberger\nawat\ncamin\ncrdi\nshroder\nturle\nyolly\nmatumbi\nsiner\nelectability\nimagemovers\njosepho\nmatchbooks\nhakizimana\nsanli\nmarconnet\npeskett\nbibikov\nermina\nvaliance\nbunglawala\nkelser\nhese\nrantissi\ncotteridge\nwbbr\noreland\nsaharicus\nunderuse\nolexiy\npushman\nophuls\ncresaptown\nkruje\nmeyjes\nadduci\nmeadowside\ndenburn\nimperatively\ngibsonburg\nradimov\naquacultural\nconglomerated\ntelemaque\ncytometer\nfreepbx\nechave\nautumnwatch\nvucic\nstruever\nflagger\nmcmenamy\nvenki\nashely\ndhis\nreselection\nslouched\ngomboc\nyurek\ncontainerships\nltps\nolaniyan\nrayuela\narousals\nenthralls\nanora\njoff\nbjornsson\nsilvestrini\nkadhir\nrepub\nvaitheeswaran\nkwatinetz\ncareflight\ndeadrise\ngorlov\nmeneguzzi\nlanschot\nenergo\nfawehinmi\nsapsan\ngypo\ntubemogul\nhysaj\nanovulatory\nolesko\nmafai\nviaud\nnikias\nfrisking\nnuthouse\nbuaben\nvorobey\ncalculability\nmazzo\nponikarovsky\noriginalists\nbadonkadonk\nundersize\ndemocratising\ncageprisoners\nelsztain\njumpstarted\nframley\ndecisionmakers\ncayard\nyunyang\nmestia\nolofson\nalberca\ngige\nilisu\nyechury\nmagistretti\nhubin\nrigopulos\npipefitter\nleganes\nmalah\nproceedures\nprotokoll\nshooglenifty\nercolani\nattus\ntechnosphere\nrobidas\ndisemvoweling\nmarangos\nrolfs\nferromex\ngarics\ndecabde\nlaronde\nawy\nffvs\ndawdy\nfunso\nlambasts\ninso\njoesph\norensanz\naztar\nwkys\noverstayers\nmudpuppy\nbanji\nvisualises\nlifenews\nhubbel\nkotake\nnavez\nitrc\nchipo\ngarmash\nosbi\nrasner\njamarko\nabilify\nfpk\nkaneta\nwoodfox\nfauzy\nmesserschmitts\nchessani\nfudgie\nmeinke\nreproving\nnaím\nrouart\npercee\nsergiev\nnonwhites\npolarbear\nilink\nlicken\nnowroz\nacamprosate\nwahlert\nmanhandle\nembracement\nnorihito\ndemonising\nhosepipe\neurojet\nshoegazer\nungers\nhawtree\nbillesdon\nzysk\nwolfish\nrazmadze\nkumpf\nmannschaft\nhartmans\nbruzzese\npognon\ntintina\nathough\nknep\nyigong\nchangli\nrautela\nlakisha\ngodfray\nlencquesaing\nmicrogrids\nmularoni\nmusabayev\nkiptoo\nblagojevic\nromley\ntaphorn\ndepiero\nrecyclebank\nposthorn\ngharbiya\nvijai\nsamwel\nformely\nblueway\ntaltala\nmacmath\ncalenders\nbeshty\nmoneyweek\ngrotjahn\nansal\nstenmarck\nsitubondo\npedraja\nsweeden\nmankulam\npiquancy\nradicchio\nrahmstorf\ncpsf\nhomogenisation\ngettinger\namazonians\ndeceitfulness\ngalard\nvarennikov\nigam\nguillame\nmorie\neisteddfods\nherges\nshahwani\nmummify\nkeilson\nyounas\nbuik\nlagrossa\nmolcho\nanthracyclines\nvoelz\nleiba\nngoo\nrouco\ngyrfalcons\nbessinger\nspeek\nfreeville\nseastrunk\njerilyn\npedasí\ndustour\nteletech\ncarerra\nbundchen\nverle\nnystedt\ngarceau\nindiviual\nabbeyhill\nsetlur\nacadamy\ntraumatize\nwrair\nwams\ntrahison\nalpaslan\nsumichrast\nsals\nbaloon\ncinquanta\nbambach\nlaabs\nproxim\nsulamani\ngensel\npetrosa\ntunefulness\nskinnyman\ncontres\nardington\ntsakane\ntrealaw\ndelaire\nrechargable\ndirest\npluijm\ntiddington\nsouchong\nisues\nbagful\nevasively\nthrustssc\nprotofeathers\nmaffett\nmenarini\nrecapitalizations\ndanielsville\nurofsky\nistúriz\npolivka\nheier\nbassim\nsterban\ndarlins\nrodkin\ngaubatz\nbilked\nwech\nllop\ntharwat\nanelay\nstaredown\nneglible\nexcoriates\nsantour\nitabira\nruibal\nscarnecchia\nbiggi\nzuabi\navriel\nforebearers\nmuireann\nkaturian\nmazola\nphysiatrist\nbalintore\nbatmanghelidjh\nunitaid\nkuqa\nperkis\nnebulously\nstaddle\nnegret\nhaarsma\ngreatstone\ncastlepoint\nmitx\nsentell\nuui\nbongers\njoles\nsoyabean\nmallusk\nalagiah\nzft\nminiatur\nmadlung\neclisse\ndarbus\nvotives\njonck\nstorari\nhajigak\nsaderat\ngiannantonio\nbessonov\ntimmonsville\nforbis\nextrahepatic\nschemm\nconsorte\nbarrabas\nsolness\nmedflight\ncritisize\nadipocere\nlingmerth\nearthshaking\nvilleta\nkuoni\npeacejam\npugilistica\nalethia\neldercare\nbokator\nboilen\nskandar\npotager\nmoehring\nlemmerman\nthumm\nsudlow\nclipboards\nmabasa\ngulnar\nnightdress\nbendon\nschwetz\ncrosshatching\nchlorothalonil\nechikunwoke\nretrench\njunggar\nbatchelors\nkonovalova\nhsms\npursers\ndelusive\nzedkaia\ninterviu\nschlissel\nstillwagon\nbocaccio\nneurostimulation\nwankery\ndialoge\nlatavius\nplasmonics\naugmentee\nholc\neveryway\nolgas\nsuperlove\nkettley\nnikaah\nredshanks\nmuneo\nroup\ntiggers\nherv\nnonfried\ngraffanino\nmoggie\nbarbules\ndoormats\ncumbias\nficca\nruzowitzky\nmatsakis\nmindarus\nastrachan\nhersman\nurick\nyaxham\nrovos\nvoser\ntanden\ncecils\nwaterphone\nawj\ntouchez\nnaats\ndauenhauer\nyifat\nmakiya\ndece\nazucarera\nerinys\nmiltonic\ndietel\nwoollacott\nnobo\nbomback\naxehead\nlaight\nvalueable\nparacentesis\nbobba\nkorchagin\nlivescribe\nfangchenggang\npuello\ncryder\nbabie\nzett\nmammadli\ngenious\nclytha\nphotomasks\nrinuccio\nmaaninka\nunsuspectingly\nprufer\nmitz\nburbling\nuppie\nbolzan\novershoes\ndayao\ndefraying\ndilnot\nintercon\nmackiernan\nbreece\neargle\nswol\nhirotada\nkator\nsharapov\njurinac\naldwick\nsupervia\nsewanhaka\ndubee\noldwick\nquines\nhorwitt\ncymorth\nonekama\ntheobold\npsychotropics\nhailin\ntoymakers\npommery\nmovs\ndrigg\nmistrials\ndeutekom\namericaspeaks\nvodcast\ncoalburn\neurostars\nthilina\nwojciechowska\ntshuva\njccc\nbrys\nniched\nhaukaas\ngasiorowski\nironville\nworldgroup\ngorbach\nesentially\nluminar\ntoshimi\ninsouciant\naenean\nalesina\ntempodrom\nbruzelius\ncammarelle\nrachakonda\nllangynidr\nlaguiller\nbuteau\nbeadman\nraül\nrenita\nantiperspirants\nmodrikamen\njurisprudent\nforwardness\ngodlee\npeseiro\ngladsome\nberting\nadjud\nflapjacks\nadelsheim\neskender\ndescibe\noanda\nzimride\ninterveners\ndescargas\nglympton\nlewer\nsowerbutts\naxenrot\ngazimestan\nholtsville\nyoli\nbuggins\nmoredun\nmaduka\ncybrid\ntransgaz\nejup\nlesmo\norston\nwatsco\nperpetrates\nglyndyfrdwy\nelasticated\nnoveck\nonetto\njanow\ndeejayed\ndisemboweling\ncaracappa\ntiteuf\nschoenmaker\nkayte\nnahayan\nsthe\nstanich\neastaugh\ndruridge\nouja\nwissman\nbellizzi\nbayron\nmairtin\nthiazolidinediones\nfriers\ngofio\nméïté\nnonmarket\nsamassa\nfornicating\nsotg\nwatelet\nswatragh\nsayem\nchhum\ndrossel\nreguardless\nmuoi\nvomitus\nthielman\nconnerton\nchambas\ndisfunctional\nzelalem\nrheinstein\nraai\nkktv\nyuccas\nlittlebury\nussi\nhejda\nmithaq\naetn\nbueb\nbelladrum\nbarcina\nintermec\nkaniel\nfoxhills\nwithholdings\nbozarth\nklingensmith\nhasay\nkarthick\npetzel\nmeyong\nmeasurers\nlongplayer\ngolddigger\njuki\nmalheiro\nrajaiah\nlanasa\nvenders\nacig\nsherone\nsparer\nopels\ngobetti\ndosser\nipass\nastal\nlerone\ncarvana\nwaffled\njells\nblackridge\nchaenomeles\nndidi\nhooted\nzoladz\nmarcle\nleuci\nborbolla\ncattley\nkharaz\nkissology\ncetp\ndestler\ndualisms\nvallecillo\nantivir\nindulgently\nfasque\nchatshow\ncopiapo\nhanaro\nthelonius\nedelmira\nicbs\nsnakey\nwixon\ncobbins\nhornel\npantala\nlongwu\ncoudl\ncamfed\njerrys\ninforamtion\nliebestraum\ndismuke\nbcap\nboeve\noohs\nkuwano\nmanuelo\neagled\npeacehealth\nlihong\nfukumitsu\nexoatmospheric\nfolias\nkellingley\ntechworld\ntalula\nspyer\nmenis\nilaje\nsmukler\narton\nwrests\ngargar\nleli\nllansadwrn\nkjrh\ncarfagna\nclaques\nengelder\nnsenga\nseyam\npagodinho\nhueck\nnamias\nssab\njailani\nbellboys\nrazel\nexcitements\nktab\npozen\ndudhope\nbuttoning\ntocq\nafilias\nantidumping\nmckirdy\nxwd\nbagaran\nlauk\nsapos\nbacary\ncynllunio\nstentz\nsecos\nsanita\nnitkowski\nmiv\nkosoy\ngarroted\nwfg\nsandier\nbuckelew\nunles\nnonspeaking\ntrentadue\nyanet\nmagnarelli\nswails\nhwacha\ndohn\nmulticentre\ncultivatable\ntiefenthaler\nisilon\ncadstar\nvondra\nnores\nhollyrod\nmugnier\nkollin\nbonifassi\nalamshar\nkyoritsu\nmajilis\nboulkheir\nfunctionless\ntawse\nsélavy\nunlicenced\ncorryvreckan\nstavsky\ncruzadas\ncounterman\ncantlow\nranitomeya\nkondos\nbankolé\nmastectomies\nschwarber\nfranczak\nchapela\nrelook\nsecura\npetersens\ncrombeen\nfirstar\nivedik\nclickthrough\nappartements\nwisco\nglobalcom\necclefechan\nrospuda\nflowline\norz\nakunyili\nzeldovich\nshulan\ndemetz\ncammon\nandys\nncfe\nlodis\nlemas\njerkens\natmosphères\nbucketful\nschettler\ntravessa\nworkshare\nturfing\ntulliver\nsiguenza\ndusshera\nycd\nweatherstone\nharrietsham\nautarkic\nsuspiciousness\ntapiriit\nkanatami\nambuehl\nditmarsh\nsaletta\ntraister\nscouller\norama\nalekperov\ngodik\nunpolitical\nomakase\nberenato\ncaligaris\nknisley\nprattling\npctv\nlladró\nmizzima\nhabbush\nryugyong\ntogther\nshahra\nfogden\nlincomycin\nnecas\nkallay\nblogposts\nmcaskill\ngabeira\nzolotas\npinggu\ndagano\nbedum\necover\ndonges\nrodamco\nmeasureit\nfragged\nsonawane\nsilverjet\njieshou\nmaleeha\numph\nbolani\nwhiteleys\nglutted\ndiamondville\ngyamfi\nsansbury\nguterson\nhijinx\nsuddently\njkx\ncorpwatch\nmindich\nneutralises\nautobody\nstube\nweltz\nhendrina\nkanodia\nnastassia\nminehan\nurac\nsuraya\nkandahari\nquashes\noverfill\neronen\nviatical\nrechtman\nambac\nstuhl\nwallice\nggr\nmcfate\ndanaya\nallerston\nelphicke\npensby\nrentiers\ntoldot\nocalan\nyeshi\nphytase\noaksterdam\nllanbedrog\neboni\ndoerfler\nbentson\npeasmarsh\nsanes\novl\nluwan\nrelin\nalih\nfurnham\npackiam\nkaties\ngatten\ncinemedia\nengagé\nfrancky\ninservice\nwonderstruck\nklatsch\nrolanda\nrabit\nppts\nfurrey\nerté\nbrase\nccim\nhasner\ndaithi\nnemone\nliebeskind\nconnive\nfack\nsunliner\njalava\nsabesp\naleya\nloitered\nspidered\nglowering\nbrierre\ngilsdorf\nziliak\nklau\ncesaria\ndissembled\nilot\nivn\ngrondahl\nscaw\nsohlman\ndapoxetine\nesearch\nsielecki\nkamai\nstulberg\ngirs\nseamier\ntokuji\nneysa\nwenfu\neuroregions\nbruggeman\ninundates\npreciously\nvitellogenin\ncondign\nredken\nmytchett\negdon\nalchin\ngomorra\nnorthwold\nwrightman\nmullholland\nvomero\nisuppli\nkaanapali\nnemenyi\nleiken\njasmonic\nkimolos\ntuckshop\nadductors\naliante\nshichahai\ndownswing\nnicean\nmontès\nashforth\nnonthermal\niju\ntakiji\ncheerily\nharithi\ntorchmark\nnoncommunist\nzappalà\ntamsulosin\nwruck\nrijks\nfootwell\nsarcoptic\nportlands\nhoreca\nvaporetto\nitemizing\ngrousing\naliabadi\nisehara\nsadoski\npenhill\novercompensation\nlarders\nprevaricated\nchengjun\nrankles\nterramax\nnewsfeeds\ncfma\ninat\nantoniazzi\nwssc\nbrinksmanship\nraisuli\nfloorwalker\ntucher\nshahrokhi\ncxs\nakbay\nlookback\nstatfjord\ncorbier\nunfindable\nwebcrawler\ncrunchies\nslabber\nintragovernmental\nhenleaze\nyarza\nbanjaran\ndecant\nlandsea\nhemispherectomy\npanpipe\nreindl\nkarasik\nteith\nscatchard\nbhulaiyaa\nfirebases\ngleissner\nbevell\nlipin\ngarrin\ngenon\naccomodations\ntrimarco\nsnappin\ncaliburn\ncouponing\nbriefness\nimsc\ncrialese\nsiic\ncostella\nmoisan\nhypermasculine\nphillipi\nlalka\ncalinda\nviad\nrgcs\nvillepreux\ngrenvilles\nfenninger\nexemestane\ningonish\nwellheads\neltahawy\nccls\nballynoe\nganne\nbeidh\ndelshad\ndersingham\nnmtc\napplin\nchenowith\nzaner\nlodato\nlittlestone\nllanbradach\nkammback\nmcinroy\nduchaussoy\nkalthoum\ngoldenrods\nmiyares\ngehan\nbainian\ncoverall\nburckle\ndismissable\njuridica\nlapdance\nlatshaw\ncariparma\ndiscribe\nkyubey\nburmistrov\nsnowbasin\ndoliche\neurofins\nlucians\nkersa\nmceliece\nvolger\nbazmee\nevenin\ndullah\ncliver\nnovant\norsova\nkaniskina\nmoderniser\nskydives\nlivestreaming\ninuits\nkaradere\ncoagulating\ngrayton\ntribler\nangang\nkinking\nunamplified\nderyk\napanowicz\nzimpher\nguled\nguyanan\nsanvicente\nzajicek\nequilar\nkeidel\nlietzke\nzayo\nsydling\nfoetid\nbims\njaczko\nbartenev\nszajna\ncarjack\nravensbruck\ngrenoside\nchebeague\nrakau\ndesflurane\nendourology\nreprogrammable\nrepossessing\nshakier\nboykoff\narifa\ndussel\ntayseer\njawwal\naverring\nwhitelands\nglaubitz\ndeclo\nhamot\ntouchpoint\nlevelheadedness\nzingy\ngreencine\nkneza\npardners\nstrech\nbioversity\nsarandos\nmotorisation\noyebanjo\nclaretta\nhedis\ntooba\nbotolphs\nsifma\nlanty\nimmunochemistry\ntransorbital\npiech\nalegrías\ndebell\ndenish\nfarnsfield\nprankish\nyouga\nclaudino\ngramozi\nparenté\noverstretch\nstadnicki\ngerecht\nbandurski\ndarshaan\ncoachhouse\ngriped\nnuveen\nsampanthan\nwagonmaster\ninfeasibility\nbernot\ndecock\nbuttering\nexcisional\ninternationalise\nkacho\nnanok\nkuehner\nvalesky\nbahloul\nlyncombe\nbuckholtz\nseptembers\nlamen\nloadmasters\nqualter\nbonymaen\nfikile\ndallal\nmofetil\nprosor\ndefamer\nrefocuses\nburglarize\nlierop\ndhimma\nprabakaran\njabbo\nbounderby\ncowpen\nforby\nredzepi\nshvartsman\nhattrup\naltantuya\npundir\ntrackpoint\ngadling\ntrellised\nnyarota\nfalbo\nkabiru\nsadlowski\nlabeler\nzuerst\nlcci\nhinche\ngoverments\nwaygood\nfraph\nsculpturecenter\nmvu\nsuperlatively\nbrozek\nsathers\ndirtying\nbekkering\ndevor\nolafson\nwoodmill\nencomiums\ncocarde\nsuleimani\nassocia\nblazars\nspätlese\nmandil\nabendzeitung\nidph\nlenagan\nseagoe\njiahua\nsunja\nnipun\nchowhound\nloctite\nstricly\nbrisseau\nsuheir\nswaggers\nlistyev\norleanian\nbriney\ntdvision\nluty\nzahau\nairfone\nbonifield\nmyit\nmantero\nencrustation\nsceptically\ndcli\nexergaming\nhermosilla\nfops\nconforme\nbagans\nhomochitto\nshabi\ncybercriminals\nelectrosurgical\nmayodan\ncoulport\nboeings\nclaria\nmagney\nllansilin\nfrenkiel\nrocawear\nkansha\nstrathdevon\njovic\nvelina\nkttc\nmomix\nlieske\nkovels\nsturckow\nnmv\nheteren\ndrubbed\ndrange\nsmocking\nmpact\nomino\npikey\nalschuler\ndiscolour\ndeconstructionism\nwentwood\njohnsgard\nllanystumdwy\nljs\ndaniken\nmerilyn\ncozen\npatoski\ndisinter\nsystran\ngreycroft\nkasparaitis\ncognex\nsassoli\nleckford\nsubmachinegun\ntiruchelvam\nshockumentary\ndaling\nbihi\nmargeaux\nmelodicism\nfreeroll\nwisborough\nlangho\nduyen\nkuder\nlambertson\nmohácsi\nramonti\ndischer\ninelegantly\nhyperthyroid\nrinascente\nboudu\nmasnick\nlawned\ntannous\ntawn\nmvj\nmnos\ndamaschke\nporetta\njemina\ndugher\ncopayments\ndarus\nmarissen\nledingham\nslifka\nwindrows\naerosonde\ntxi\nskarstedt\ntimolol\nappeares\nsisyphos\nobjectiveness\nclaster\nbirand\nallessandro\nzúniga\nearthmover\nfiddleheads\nkuensel\ndrayer\nmaugeri\nbusying\nfogl\nsmedberg\nargumentativeness\nfantana\nremko\nannoucement\nfradin\nrouche\nbizzaro\nmovius\nshakeri\nwafts\nrunwell\nscorches\nfiretrap\nasist\ntukiainen\ngentlemans\nphytomedicine\nexpocentre\nsemipermanent\nserogroups\nlatto\ntwisties\nexterieur\nmortella\nporkers\nfontanarossa\ncamoletti\ngaliazzo\nkurskaya\nesoterically\nfundaments\ncallosities\npulposus\nwispers\nandouille\ncranor\nkoebel\njeesh\nmakaa\nshebeens\neifman\nkirkconnell\ncryptococcal\nidrs\nlechter\ncorpach\nsneem\nmewelde\nguffaws\ngounon\ngroux\ndarbepoetin\nalicat\nmehnert\nuyana\nchillis\nkabbage\nbrinsford\nyahir\nphoneline\nsisario\nybf\nstrimple\ntangley\nbuchinger\njabriya\nbaradero\ngoffriller\ndesousa\ngreenstead\nmoonrakers\naracruz\nseehafer\nacrobatically\nvegoose\ndunivant\nflexa\novr\nyongpyong\nseib\ngiugni\nnesc\nmetzer\natmospherically\nqof\nseiple\nskinnies\njiangyou\ncaulkers\nanane\nfreixenet\ntoubon\nkhone\nrtlm\nmccornack\nprecharge\nsimonside\nservas\ninsadong\nbargeddie\nstefaniuk\niyabo\nkandace\npsca\nkudlak\ncotinus\ntapwater\nkhamdamov\ngutka\nchungmugong\nrenno\nrupiahs\ncoykendall\nineluctable\nmarketo\nviennale\niqair\ncostales\nmorwellham\nmainsheet\nyufei\ntaione\nkalousek\nalderwoman\npleanty\namirah\nwillsie\nfulls\nboskovich\ntipsters\ngovernator\nranst\ncointrin\nunequivocably\nfinlaggan\nshovelers\nsanbao\nrkh\nwjno\npseudobulbar\ndrabkin\nmefford\nheatlie\nbakulin\nhammen\njarrettsville\nmcglohon\njuyuan\nsniegoski\nstylishness\nblogher\nfreihofer\njorritsma\ncontroling\npichit\nthommie\nkoyuncu\nsericite\nzygomaticus\nblubaugh\nsubdirector\nrepackages\nrolnick\nstemcell\nkambiz\npufas\nstiger\nitalcementi\nfroomkin\nslw\nconfusional\nnoureen\ndhia\nhutchesontown\nlanx\nladbrooke\npmac\nclariant\nzupancic\nlupron\nstitzel\nbeamers\ntarquins\nsnowsill\nerinle\nkinneir\nfobbing\npilcrow\nmisjudgements\nferr\nhyunmoo\nchoquequirao\nphobail\nbiosolutions\nmarquetalia\ntetrault\ndehar\ngih\ndahej\npostsecret\nbrambly\nmyoelectric\npawk\nluttig\nepassport\npolitz\ncerith\nfelafel\nmuthaura\nquazepam\nhoody\nestabilished\nenvenoming\nchilcompton\nfilaria\namrhein\njobos\nfillette\nspumoni\nhomepna\nruckert\nsuccessorship\npopal\nsuperfoods\nvallourec\nkuomingtang\nmistwalker\nstraumur\nschnepp\ndrypoints\nmardel\nkleins\nfleisig\nevarist\nacros\npadnos\nwolfes\naloka\nlhari\nnaturalising\nsekisui\nkreiser\nmargarines\nmoalim\nantuna\nsheils\njinbao\nsummerhall\ntrewick\nnanoelectromechanical\noskanian\nhospitalfield\nfinkelhor\nuconnect\ngassco\ngurneys\ndiscribed\nbathford\ngornick\nswearword\nlimted\nbdelloid\nbeardsmore\naveos\ndeyhim\ndasti\nhanagata\nmexi\ndorianne\nmakine\nhhf\nbating\npastafarians\nsaveliev\nverbalizing\nsouchak\nkapuscinski\nzitzewitz\ngoapele\nachinoam\nshafrir\nfujicolor\nbandannas\nmamajuana\nwildheart\nsantopietro\nnephrologists\ndamore\ngajic\nvastine\nbouris\nhulshoff\nperigord\nrichel\nbotanics\nzbt\nshangai\nfasciana\nmédicas\nharren\nterro\nzuyd\nscarper\ncooktop\nduddridge\ncfas\noshinowo\npetaflop\nravenhall\nbarouch\nzachos\nsuperlink\nraigmore\ntyuratam\ndacor\ncosslett\nopenaccess\ndisgorged\nviscuso\nbinson\nscofflaws\nlasix\nphyliss\nconstructionists\nmqtt\nantasari\nrepond\nheyningen\ndattner\nindissolubly\nunperson\nhbh\nlgo\ngoman\nlongbranch\nthando\nadvan\nnyrstar\nhaubrich\ndevang\nnolden\nherremans\nruatoki\ndadonov\ndomitien\nqingyi\nnelida\nloudin\nmoskovskiy\nrheola\nwined\nhucheng\nvadala\nmarzu\nbegles\nhuus\ntalati\ndevanand\nsumberg\nmaltagliati\nlcia\ngwernyfed\npanau\nabchurch\nsophon\notterman\npushcarts\ndollaz\ngissurarson\nballymaloe\ntrinities\nfarke\niriki\ngogua\nbolsinger\npuzzlingly\nscarre\nenvironics\nmotorweek\ngeobacter\nbahaism\nundreamed\npalfreman\ndestineer\nopcon\nkromah\nroboticized\nmidocean\nenti\nsynergie\nmattice\nneval\nfortwilliam\njessicah\nhuegill\nrebollar\nbaltin\nteath\nclaassens\ndealmakers\nfiraaq\nhexthorpe\nworldpay\ndynatac\nshivery\nsonsini\nsinglaub\nsledged\nleyde\nwonted\ndifa\nsasac\nplowboy\nvinokur\ncopple\nsurprenant\nclubjenna\nyetkin\nstoneback\npescow\nspeonk\nbarbalho\nmetastasizing\ntianqiao\nletona\nshapin\nmaendeleo\nvelutha\nesperon\npromus\nencima\nbunyard\ndolphinton\neob\njorrocks\nmephistophelian\ntahmina\ngaffers\nmullaghbawn\ncamarão\nmillworker\nimportances\ncaraguatatuba\nrequite\nbirnbeck\nsensitise\nteets\nvasic\nbonnerjee\nschuring\nlightering\npeens\ndossevi\nvandelay\nbeitunia\namae\ngorer\nmullikin\negestas\nhodos\nreadaptation\nragbag\nsporozoite\nugtt\nmerillat\nhric\naneesa\nbellyache\nvorticists\nsybren\noruma\nlabarbara\nprognosticators\nopdahl\ngritt\ntwelfths\nefstathiou\ntrinians\nlatty\neche\nrainone\nbegger\nasopos\ntagula\nkeithsburg\nwajed\ngryder\nballyhalbert\nkinal\nmcbath\nvolokhonsky\nsamaw\nreignier\nxvycc\nriverdeep\nremifentanil\nllangeitho\nbazalt\nsudlersville\nforsters\npleck\njrl\nldrs\ngregorek\nnosegay\nbizos\nfruitbat\ntortuously\nhlongwane\nestablised\narram\nrodric\nguntars\nbeaned\ncoxford\nrugamba\njiqing\ntrustwave\ngokhan\nnikopolidis\ncalow\ndobroshi\ndenosumab\nkuhlmeier\nsuilven\nandorrans\ndudum\nkaokoland\ntotterdown\nandas\ndetering\ngerashchenko\nmussenden\nlouisianans\noverfilling\nstepstone\nmylod\nducci\ngraterford\nskeer\ncmgi\nraffan\nbeadling\nmontelukast\nlignac\ntrowels\nwanty\nzucconi\ncomis\nterios\nmetallised\nkorchak\nfrangos\nterril\nglulam\nparthy\nwinningly\nprototypically\nmargarett\nabers\nslvr\naaviksoo\ngamestation\ncingolani\nchavistas\nfrankin\ncaig\nestime\ndistend\nhccc\neditoria\nszarek\nitacoatiara\nmaleng\npekovic\nluzio\nshafia\nsukur\nmayakovskaya\nnapfa\ncompliation\ndubowitz\nsenrab\nukirt\naedan\nmonfries\nthinkfree\nmorses\nwazzan\nspuc\nreauthorizing\nbornedal\npsfk\nshimron\nbormes\nabberation\nsawchuck\nmeredydd\nelefsina\nstrines\nquickpath\nsiebrecht\natholton\nnourry\npgti\nbarreca\nserralunga\nchurt\npoliticises\neatables\nefdss\nrebeccah\nallander\nsportingly\nnuiqsut\nmoneymakers\nwpad\nfadela\ndemerol\nhellebaut\nphleger\nlixia\nciocca\npretentiously\nanema\npaim\ngoldwire\nzarghami\ncourtnay\ndharmsala\nborodulina\nfumento\nmoge\nfanne\nivus\nborissov\noroshi\nnahem\nsurrattsville\nwhorton\noltrarno\nintrapartum\nweaponised\nbvlgari\ncosmovision\nteetotalers\nfiror\nnanabozho\nthriftway\ndevided\npuffiness\njitan\nmegacorporations\ndhotis\nnoncommunicable\ndigitizes\naccutron\nbrecciated\nshacklewell\nshapoval\nrhl\nbösch\nlonglevens\nrauluni\nrawaqa\nonta\nhujar\nbrainiacs\nkaloyev\ntrefnant\ntemtchine\ncaseville\nhenrythenavigator\nunstaged\narchlute\nforgotton\nnikai\ndellert\nuncontestable\nmattew\neroski\ndribs\nnishinari\nrvca\nvernerey\nberkery\ncdic\ncorralito\nkambakkht\nrudlin\nmountville\ndesisting\nspliffs\nemmbrook\nayoola\nlavigueur\nmykhalyk\nremonstrating\nhango\nnasturtiums\nduhigg\ndanek\ndiebel\npiccioli\nraham\ndalakhani\ndevenney\nmarimon\nsavvides\nandren\ncapons\ncelam\nhachiya\nfendley\ncapezzone\ngredler\nwihout\nminety\nfourrier\nischgl\ngammoudi\nhiorns\nwolaytta\nharringtons\ntamis\nrabbinically\ngurton\ngusha\npotentialy\naronsohn\nreuil\nbowflex\nlassally\ninvitro\nvancleave\naaea\ntoquero\nrepeller\ncsz\nholah\nroekel\nantipodeans\ncaracter\nobertauern\nconvulse\nhojatoleslam\nkoudou\nshobe\nhccs\nwhitneys\nhakkinen\nlyson\ngevalia\nwaberi\ntaho\nrozycki\neuna\nmarhaba\npiment\nfdx\ndenvir\ntienne\ncamelbak\nmanaa\nwelll\nspectate\nhendro\nbeghin\nmoggio\nhailong\nminjiang\nakuressa\nmcquilkin\nsharifuddin\nschaffrath\npowerbrokers\nsyomin\nwahabis\ncraigiehall\nchanctonbury\ndemonica\nxinyao\nidealistically\njossy\ngutrune\nprovenge\nkatsunuma\nshafii\nhockridge\nlambrigg\ndynan\nadultos\nendostatin\npostnatally\nnephrolithiasis\nisett\ndissagree\nmahanthappa\nbachia\nnameable\nmoger\notegi\ntenero\ncélimène\nandijon\nraveh\nprysmian\nsaartjie\nzhaoxing\nbernadina\ncastanon\nnatig\nvinti\nhablan\nniggly\nstrib\nmarill\nridleys\nalopecuroides\nresurrexit\nstopbadware\nmicrocell\nairola\ndécolletage\nsuccar\njck\nguvera\nsingstad\nwidders\nvulnificus\neugena\nbacklots\nproblably\nslickest\nmanufaktura\nwalravens\ncouchepin\nbenzies\nboreland\nduree\ncwmdare\nnarrowbody\nguangyan\nzongchang\nblessedly\ncorridas\nrambukwella\nautoglass\nabelow\nbendukidze\nbintliff\nmoodswings\ncullotta\nmaneuvre\narchaeoraptor\nhintertux\nmanischewitz\nintermingles\nlavendar\nmelles\npetrifaction\nlivened\nmaesbury\nulic\nunhealthful\nbergerson\narone\nstuffings\nhalekulani\nantiquing\nfalutin\ngreenwash\nbriguglio\nnexans\nhighlighters\ntogus\ntejarat\ntishby\nstigman\ntalywain\nchevis\nogundipe\ncansdell\nviaggiatori\npurslow\nlaxfield\njingoist\ndreamspark\ncantagalli\nggf\ncrashlanded\npratice\nungracious\nconsignor\nblueblood\nsibeko\nbaliem\nemisoras\nboraas\nknierim\nmaycon\nqasab\ndeadmarsh\ndomoni\ndisent\naqraba\nmyah\nramandeep\ninsua\nperspicuous\nhawe\nmerval\nchulak\ndaltry\nrakhim\nappartient\nreginiussen\nmaundrell\nneurectomy\ngokana\nsteuerman\njiayuan\ndpko\npollsmoor\narchitectonics\nmccorry\nseatown\nbellapais\nfasion\njoyent\ncoherant\nsalonpas\nkakas\nepiglottitis\ncalvez\nwilmers\nseagrams\ngingery\nbregovic\nzhenghua\nlangel\ngovea\nsyer\ngamburtsev\nspongers\njessops\nmonke\nbarthet\nmaxym\nscheppers\npedicures\nfpsc\nvaadin\nreclusiveness\niceworld\njahncke\ncorve\nworc\npurtan\nmunita\nadigun\napprovers\nbenzoylecgonine\nbertuccio\nsalmonis\nmpsa\nmandatum\nriddley\nwaulking\nwlk\nestranging\nicx\nplanespotters\nmedicity\nmononitrate\nbonvoisin\nlectularius\nherculez\nwannenburg\nboxcutter\nusdoe\ntsna\nlerberghe\npoynt\nrohrbough\nbarakeh\nwillowfield\npiñón\nyock\nranthambhore\nbiotherm\naanr\ngerut\nbrisben\nnormark\nantwain\nkhiem\nturnstones\nlisser\nrahho\ncowpoke\nmillio\nwildgoose\nrouvoet\nfrisoni\nindefeasible\nljuboten\npullard\nsanborns\ntriterpenoids\nfirefest\ncohosts\nyassar\nkoliba\nbrutt\ndatblygu\njemaa\ncliquey\netling\nporca\nheezen\nfadli\nzirbel\nisidra\nordener\nsimari\ntlacaelel\nolowokandi\nseiha\nananiashvili\nsaile\nthurlbeck\ninholdings\nirans\nriunite\nbhuwan\nseppa\nnikumaroro\nbenfold\njesca\nkvas\nattitash\ndeber\nvenzke\nwaltraute\nbourdonnaye\njpmc\nbalms\nunbridged\nmagetan\nliteraturhaus\nelefteriades\ngnx\ntunnock\ndansie\nincensing\ncutta\nschriber\nkhazal\ntechnogym\nlandbouwkrediet\nkozhin\ngiacobone\npelot\ngaladima\nyarema\nhadebe\ntempters\nbudos\nstrober\nlitsch\numeme\ntimbro\nkaribu\ngroscurth\nriona\novernite\nguéhenno\nbrothertoft\nngoche\nharbourvest\nlubecki\nkatsushi\ncablecar\nscrabbling\ngadbois\nhochkirch\ntouchless\nfearmongering\nmidy\nkerswell\nhorspool\nbletchingdon\nbaldas\nlyondell\nyorvit\naltenahr\nfoja\ntrenberth\nbackstab\nplié\nkweisi\nleucism\nlonwabo\ninguri\nmilitarize\nangiomas\naulich\nkhada\nheathery\nfbop\nmancur\nsamdup\ndruggie\ndansili\nchristianities\nquintavalle\nchucha\nsocor\nelapsing\ncooktops\njabaliya\nradetsky\nfinbank\ncarran\ngarned\ncybook\ndemnig\nwachsberger\nyowza\ntshuma\ngoresbrook\nodihr\nhultquist\ngenotyped\netis\nbewteen\nderogations\nbelled\nmorave\nmaalim\nreisig\nsaikal\nfortepianos\nperiwig\nixv\nconerns\nslaking\ngruzdev\nmultum\nmcshay\nrutili\nobetz\nnamic\ngulgee\nnewarke\ntingly\nmaulan\nhillborough\npanjwayi\ntoutou\nbyx\njounce\nitajai\ngrafenwoehr\nalpesh\nshmoe\ntalfan\nmasari\nhoulahan\nquavering\nevanthia\nkubodera\nsachertorte\nolivio\ntatsushi\nwesterhoff\nyothers\npehl\nsiemen\nrafti\npaicines\nsurpised\nyamawaki\ncreekmur\npitchmen\nplcb\ncopperhill\nsediq\nwilsher\nclacking\nmelosh\nsematech\nnethope\ngussy\nhardyman\nmediaroom\nzanicchi\nvahey\nhondutel\nfladbury\ngarnell\nilounge\ngbas\nishai\nmccuen\nlilypad\nfreeda\nbellaigue\ncolóns\ntingles\nezzedine\ngasana\nohim\ndivots\nmcneece\nbergren\nmulticar\nberdnikov\nslowik\nlionville\nstbs\nskrimshire\njungla\nmuriqi\nfunking\nbeggarly\nnessling\nphedon\nsfcg\nincwala\nshayesteh\nplaners\ntheorin\nunderworlds\ncagsawa\nstogner\nalowed\nmaylin\ncrimebusters\nblazquez\nplce\npleather\nelfenbein\nchamkani\noerlemans\ndiker\nsheasby\nsonoco\nweihua\nzichichi\nbrendle\nfreuchie\nlevenmouth\nokoronkwo\naadvantage\nhillburn\ndipyridamole\nceilidhs\nmawazine\nclir\ncyclopropyl\ntornai\ngenuity\nphilcox\nisamuddin\nweedkiller\nmabton\nincentivizes\nchickened\nlavasoft\nwmgt\nwaxen\nsonographic\nruncton\nsgma\nmoratoria\nstarchenko\nbaiga\nfisherwoman\nlyerla\nrenes\nshabalov\npresidenta\nmorvai\nteleglobe\nmerckle\nmacfayden\nruyigi\ncirrhosa\nrohrwacher\ntelepacific\ndropshot\ndémarche\nkarsa\nsouissi\nkernicterus\nkosoko\nraistrick\nloyce\nwaringstown\nthevenot\nmotoaki\nchugh\nbroening\nsinx\nsmilingly\ncispr\nguangyu\nbrighthouse\nmaskless\nimmoveable\ntolovana\nanalia\nwapper\nrichner\nsron\ndunnam\nmanber\ndenhardt\nblackline\nmunton\ncasquero\nldw\naasim\nsivaraksa\nfucci\ntechnico\ncriticial\nmoleskin\nandoain\nkeylor\nqeiyafa\nlantieri\nrmdsz\nguines\nabettor\nnewsouth\nnaftohaz\nwathba\nportsoken\nschjerfbeck\nneiafu\nreelections\nshalaby\nkopsa\nlesin\nlakenham\nscreengrabs\nbitney\ncimperman\nchygrynskiy\nurbanspoon\nprecedented\nawsworth\nestleman\nsavada\nalpar\nsandland\nscamps\ncopay\ncamier\nshrewdest\nchessboards\ndysfunctionality\ntidbury\nhormann\nreisler\namerasians\ntrana\njetwing\nstavola\nbassetts\nstum\nalecky\nhomilist\nausone\nassiri\nerx\ndobey\nmudingayi\nnardoni\nteitelboim\nerkut\npinpin\nnanopoulos\nscana\nkhanon\nezetimibe\ncukic\nceop\nibold\nkreder\nboruff\nlaureth\ncalabashes\nwheelz\nkaib\nmelodee\ninscrutability\nreinitiated\nmarkram\nrambagh\ngve\nsoothill\nvrx\nrozeboom\nturai\nmcrobert\nheathcott\nlimones\nmotha\ngrobart\ndunt\nkomamura\nvogelzang\nleucate\ndemilitarize\nclaritin\nmarkby\nharvison\nmarotti\nluzhny\nyaskawa\ndepass\nmusudan\nfitzrandolph\nstice\nluhring\nunuseable\nweikel\nappc\ntransesophageal\nuptrend\nklosi\nboue\ndaishin\nbiovail\nwritedown\neople\nnebet\npillorying\nwinkleigh\ncarias\ncaty\npfandbrief\ncodder\nkatlego\nmarlys\nbackpedaling\nnordheimer\nsoehn\nhrabosky\nlangsett\nsimexchange\nmcewing\nrollingwood\nfabianki\nkleis\nchangcheng\nchermiti\navonworth\nlisman\ncaterwauling\ncarafano\nharbridge\nmardirosian\nmpsc\ntsitsi\npke\nyoungarts\nhazelhoff\nfilmakers\nedworthy\nwhrc\nraclette\nzetsche\nedozien\nsidik\nezza\nlongrunning\nrasaq\ncsfs\nthinktanks\nmarno\nsanoh\ncloner\nderald\narrasate\nhoofprints\nlongdong\nsaftey\ntchao\nzalmai\nbeckstein\ndxl\nsmurfy\nquibbled\nibw\nfoubert\nsuffit\nvibhu\ncoalface\nsukenik\ntsy\nzhiping\ncentropolis\nfieldsmen\njobtitle\nkarrow\nartioli\ntions\norlich\nbujagali\naseel\nfinisterra\nfcstone\nferati\ninanities\nfaultlessly\nlvrc\nkidzui\nmamby\nexfoliated\nfuturologists\nssti\ntoxocariasis\nphylip\nwainewright\nkadr\nkalonas\nbrookview\nmckelvy\nsiasi\njazelle\nrigeur\ndoivent\neeriness\narncott\nironport\nstealthier\ndarbi\ndunscore\nandrostenone\nkhing\noldskool\nindiabulls\nrockling\nallays\ncoltrin\npallen\numps\norangevale\nkhurbet\npeepholes\nelfers\nbedawi\nslupsk\nbigamously\nbussert\nnasuwt\nfoxgloves\nheuga\nlamco\nparmigiani\nnowaday\nsharki\nshellman\nwestrup\npakalitha\npunkers\nfrx\npreventions\nmilbrook\nwashinton\npollitzer\nhypothyroid\ngandules\ndanilin\nhawaa\nluvs\ndusart\nriyan\npankova\nrutabagas\nelissalde\nwaismann\ntopweight\npromulgator\nportlanders\nvaidhyanathan\nholmoe\nibank\nkorosteleva\nsoltanov\njayasundera\ndatacore\nunspun\nbrahea\nekl\nknautia\nfroogle\nocassions\nbaohua\nnadjib\nastilbe\nbasulto\nstanislawa\nholdenhurst\ngadhimai\nstoia\ntixkokob\njhony\npauric\nbridgehouse\ntabacchi\narostegui\nrapanos\nbendle\nexults\ndesiccating\nhondajet\nabdulhameed\nanejo\ndimont\nminnifield\nfagging\nbladeless\nbroström\nnorthavon\nparasomnias\niolta\nfirewalled\nconaghan\nrappelled\ntoivio\ntalor\nprobo\npaskowitz\nogoniland\nunpatterned\ndeech\nmazrouei\nusapa\nchorba\nvitasoy\nsumilao\nballykinlar\natiga\ncleen\nzannier\nholyport\nkaslow\ngardinier\nmossos\nroskelley\nbrittnee\nkbfx\nshaath\nkofuku\nofri\nsuwandi\nshmatko\noverbanked\naridi\nbraskem\nbrynhyfryd\nbeechfield\naqe\npicas\nbalconied\nmandile\nariyan\nscottishness\nsquillante\ncatapano\nslyness\ncreches\nkriangsak\nkircheisen\naraque\noutman\nchindamo\nnawzad\nkandawgyi\nrabai\nbenaglio\npalaly\numanzor\ncrackenthorpe\ngonerby\nfullilove\nalamillo\nburchenal\noctomom\nhiromoto\nburniston\nlolla\nhaskayne\nkweichow\ntelemetric\nunimposing\nvildosola\nrejigging\ntoua\nmazzolai\nblackfan\nrundberg\nlongyi\nadauto\neditioned\nilter\nhannas\ndevedjian\nsbsp\nrbtt\ndivita\nkomin\nreengaged\nberdy\nmccrudden\nwesthouse\nsautéing\ntiem\nwheeless\ngodleman\nchairwomen\npicada\nharalambos\nmarkens\nmorococha\nlatterday\nhennah\nhandless\ncarneglia\nlopatkina\nroofie\ncassagne\nwhatua\nnalebuff\nperchard\nfoulden\nappeases\njumbuck\nfurthermost\nctba\nruhnke\ndineh\nbickett\ntrijicon\ngweek\nsvensmark\nandropause\nhaochen\nborzakovskiy\ndomy\nbaileyville\nplumbs\ngrigoryeva\npravastatin\ndepsite\nkarapatan\ndornhelm\nsmae\nmyss\ncanonizing\nparmitano\nrantum\nmeave\nsepilok\ngiaconda\nlazzarino\nvyle\nantonakis\nrettenbach\nraphaela\nhesilrige\nbulbocodium\nvlachopoulos\nmelodramatically\ngandler\ndigeridoo\ntacke\nkolawole\nbarnidge\nbozhkov\nedik\ndylans\nsavors\nncda\nginyard\npaino\nburnbrae\naraoz\nautochromes\narchmere\nghaffur\nmeyde\nraltegravir\ncornier\nstrokkur\nchardonnays\nrubido\nmediapart\ntrainman\ncorato\ntrasmissioni\nastrobotic\nelab\nhalber\nwehrheim\nlemalu\nloseby\nwillistown\nrohrbacher\nnonoverlapping\nrappels\narborio\noverweighted\nekern\nalaq\nvarengeville\nsarun\nliederkreis\nseebold\ndimichele\nnasirov\nstadnik\nroginsky\nholovak\nmiscounting\nrabou\ncollectibility\nmlps\nroussell\nthim\npretextual\ncremonesi\nfabes\natucha\nmuchin\ndaytrip\nharrased\nkavakos\nygm\nhagendorf\nbegiristain\ncipfa\nbitel\ncurrah\ncrosthwait\nalipour\nbalala\nbarnathan\nkrasucki\nbearak\nmmmmmm\nratfish\ndinakar\npuniet\nlamah\nhandson\nokas\narcore\neverythings\neiss\nresurge\nyodobashi\noverextending\njoceline\nshakra\nfukawa\naizenberg\nmontelibano\nutans\nuniversitys\nnibh\nmemphians\ndumouchel\ncracco\nssy\nsubhankar\nvilana\nfellmeth\nterance\nwindridge\nzurawik\nchalfin\nqtel\ngoodrow\nrure\nproprietorial\nwillert\ncurette\nfrenchwomen\nshashlik\nlafaille\ntwelvefold\ndefacements\nmollier\nepirbs\ndevlins\nmonlam\nzarkava\nresearchable\nmabvuku\nvarilux\neiscat\npamoate\nneoliberals\ncarteles\njearl\nphrenologists\nlisaraye\nsmeulders\nitsself\nmueenuddin\ndisempowering\nkeiskamma\ndisbar\ntoraman\ngullfaks\nmoratoriums\nvillazon\nscandentia\nparochially\nbaselga\nkirklevington\ncattistock\noughtta\nhsaw\ncynulliad\ngopalnath\nslutzky\novervaluation\nbatterham\ndallerup\nostalgie\nhenchy\ngoverner\ncija\nministerios\ngiangreco\nliikanen\nlanday\nclarsach\nthermotherapy\noldford\nhomsi\nciervo\nmatzek\nibio\ntereshinski\nmalford\nkomarek\nmurungi\nbrunier\nzbynek\ncrocosmia\nrohita\nlarcenous\nmabahith\nmastny\ndeschapelles\nmozy\nstrykert\ncladribine\nledgewood\nliuwa\nchlorambucil\nneedlelike\nganzes\nhazarding\naeroflex\nsupraglacial\ndimokratia\nmeina\nwillesley\ngroag\npoucher\npisar\nnubble\nknego\nchaffinches\ntengzhou\nbellamys\nsadighi\nchunkier\nsimphiwe\nkatchor\nsmartie\nreimposing\nhaselberg\nchellaney\nunivest\nguillo\nbalester\nmeigh\nbureacratic\nfloriane\nmoldable\njardinière\nkrisel\nmerey\nstubbies\nlutfullah\npioner\nmenemsha\nwhittock\nnadjari\ndupond\nkolache\nkeaveny\nbequette\natomstroyexport\nfurring\nstorcenter\nraful\ngänswein\nmonsterism\nrettendon\nmorner\nvigilancia\nclemmie\nschleper\nlovekin\nfennville\nbalena\nprldef\nkarpovsky\nosnos\nlantigua\ndeubel\nhoneysett\npakhtoonkhwa\nincises\nnitrosamine\nneaman\npensarn\nashika\nboors\ncasoli\nchagin\nmurko\ngriem\ngruenebaum\ngutterman\nsteltz\nerionite\npsychotically\npennridge\nalawsat\nundie\nlepetit\nvetrano\nthomire\nshaktoolik\nlinzhou\npensee\nsundblom\nrearick\nkave\nlouer\nstadnyk\ncruciat\ndaning\nillium\nkliger\ndormston\ngeac\nsudarsan\nthornlea\nekes\ntyutin\npernick\nllynclys\ndedina\nhuguley\njstars\ncavium\nhighcliff\nebbetts\nsuchitoto\nskowronski\nmathmos\nnahim\ntorys\nnightrunner\nzurutuza\nwherefores\nlinnehan\nsartory\naffianced\ncooey\nparatuberculosis\nsenesh\nmishari\nhree\ndinkin\nnrwa\ndierama\nfathomless\nsaltarelli\nresequencing\nlionell\ndavoodi\ndaidone\ntanase\nnargund\npeapod\nglassine\nrefsdal\nhongtao\nnonnegotiable\nishay\nportakabin\nflibanserin\nslomka\nhuelin\nwalbottle\nkengen\ncandling\ngalson\nbasketballers\npakiam\nbrazils\nkadhum\nsuezmax\nsparely\ndecompresses\neaglen\ntamgho\nboogity\nwinshape\nthobe\nmicrogreens\niqrit\ndilaudid\ntreefrogs\ncolonises\nosberg\ncontingently\nvalentinas\noatis\nsexagenarian\noooooh\nsuperbeasto\naucuba\nnégociants\nleatherbacks\ncoquettes\ncrapped\nequivocated\nchytrids\nbwn\ncauterized\npaintsil\nshihua\nshushkevich\nbevendean\nsatio\nhammarsten\ncockshott\ntomeing\ntunnelers\ndoodler\nhrtv\ndingxiang\ntinworth\nabdelsalam\ncasaleggio\nduddington\nlacorte\ngavidia\nrundowns\nbauduc\nlaureateship\ncranefly\njanigro\nreaserch\nlafford\nhessing\nhegemonies\ntorshavn\nvartanyan\ndonahoo\nwenqing\nmonterosa\nnettleford\ncrosshatch\nlysbeth\nessenhigh\nequivelant\nhairpieces\nbeatiful\nabama\nirascibility\nmedanta\noloye\nlassoed\nansbro\ntregurtha\ngardoni\nasztalos\nfoxon\namarri\nmerchandize\naldwarke\ndeclawed\nfingolimod\npaddingtons\nbercher\nmatzinger\nciaccio\nhoetger\nhendin\nbual\nyukagir\nmisharin\nmuzu\nshahrizat\nbrightsource\niccat\nkemalists\nstagner\nobradors\nteleworking\ngisel\nilê\nhalu\namaranto\nbenardete\nfuti\naslett\nsuther\ndenka\nperagallo\nboazman\ndû\nabon\napiaries\nfulfillments\nrejecta\nmarijnen\nchomo\ncompnay\nabbassid\nconfederacion\nchizek\nkillmer\nnawruz\nnantporth\ntelit\nkarkhi\nhargesheimer\ngraters\nkhanya\npadd\nskachkov\ngrandstaff\njericó\niupati\ndiktats\nsplinted\nlampen\nayerza\negekeze\ntheofilos\ncarryall\nhelvey\nacrc\nimpedimenta\nnorthall\nsparq\nmaluso\nbartrop\nmoneysupermarket\nuglegorsk\nnokie\ncejka\nneikrug\nvespas\npepel\nolfers\nefail\ngrosman\nhaimes\ntatic\nnefazodone\nmansukhani\npaulen\nundimmed\ngourdine\nzeland\nzeyno\ntayleur\nchool\nholzberg\nkench\nshamsheer\ndungworth\nlitigiousness\nwhitefoot\ntarm\nstollman\nvideoblog\nrohbock\nnram\ncapecitabine\npartsch\nmagaki\ncabstar\npipedreams\nfrayer\nnitu\nunhesitating\nsonntagszeitung\nadiru\nllanfairpwll\ngeorgeanna\nciria\naghajani\naminath\nkimaiyo\nhawatmeh\nplewa\nkolonics\nmalakov\ntalhaiarn\nnonancourt\nalemao\nbaltimores\nnkufo\napolipoproteins\nzhangqiu\nlavs\nfransham\nshamie\nsalotto\ndumpton\nyahr\nsuiters\nhomosexually\nbonnethead\ncraigievar\nnuzzi\nheggs\nkaminska\nsuperstrength\nslemmer\nerasto\ncunnane\ngravgaard\nwhitebread\nkaratzas\nmcvety\nmeteogroup\nnagamori\nheadboards\nimamverdiyev\ngeniès\nmobay\nadriyanti\noverdoes\nchenevert\ncasen\nwoodeshick\nonuaku\nnonagricultural\nmabil\nembezzles\npolycarpou\nkerobokan\nsaquinavir\ntimelord\nantabuse\nnabaa\nepner\nadefarasin\nwildish\nbruchweg\nfleetboston\nsputters\nazmy\ncadamarteri\npuckeridge\nklap\nconibear\ncantatrice\nkeckler\ncaique\nunsoundness\nsalw\nandronik\nswanstrom\ngaoming\nneabsco\ndesecheo\nyiquan\nmatney\nfrykowski\nbelcea\nshoed\nrebadging\ngenivi\ndubbers\nangelea\nrissman\nchrs\nrodak\ndropcam\nimpassively\ncupholders\nbehing\ngroser\nbòrd\ntieshan\nkenanga\nopenajax\nhickinbottom\nmeminger\nscreenprints\nvango\nsisig\njamee\nchemopreventive\njinghua\nadorably\nelouise\nswitkowski\nrupeni\ndevany\nshnaider\ngiribet\nnusser\nprestiti\ngrassini\noddsmakers\naramberri\nluigino\njasur\npurgatorial\nmudlarks\nrdecom\nmalangi\nmambos\nbaree\nspooned\ntakalani\nidham\nspraypaint\nzeel\nixquick\neyeful\nyonker\nobfuscations\nlabash\nsabrin\nproaves\npreit\nhackleburg\nrajnikant\nneemia\nkatera\npukhov\nbackpages\ndureza\nradisys\ngreywolf\nbustled\ndecentralising\nhanifin\ngeochimica\nquate\nfhlbb\nchimerica\nmaharajan\nkhawla\nmitchelmore\nsonesta\nbarbarino\npeope\ndoncha\nuninsurable\nrestrictiveness\nchittum\nchoung\nnarghile\ntaishin\nguobao\nallyssa\nbertrams\ntessler\nstorwize\noompah\nnistico\nbrabeck\nkillinghall\nnogues\nmyoga\notokoyaku\npteropods\nnitpickers\nstaiths\nwilckens\nexpediter\nfrancophobia\nmckegney\nmuqdadiyah\nbncc\nfengxia\nboriello\nunterweger\nikm\nstrosberg\ndemotivating\ncoteries\ngaikai\nmometasone\ncornog\nyumkella\ncablegate\nwatercooler\nsanakoyev\neariler\nkemira\nlesjak\ntwohig\nglobovision\nbarabanov\njongerius\nperezhogin\nmusican\nmilfield\naradi\nsimmy\nalbareda\nuncomplaining\ncanditate\npilc\ninheres\nbrema\nfimbul\ncje\nmursell\ngki\nkaese\nshlock\ndespairingly\ninvasives\nhauptschulen\nkyauktan\nnarsad\nnaari\nmcingvale\nmichiyoshi\nssam\nsnif\nmetraux\ndivall\nchps\noskay\nerrrr\noseni\nchaunte\nstokols\ndimenna\nweatherbys\nmarchick\nwdet\nmorlon\nroadstar\nparknshop\nlochboisdale\naready\ntalco\ndeoksugung\npalander\nbagna\nehad\nghostwrite\nzinoman\nanirvan\nballesta\nzabawa\ndurring\nperformace\nkrikken\ndalkon\necohealth\nwalberton\ntecnologica\nmovieplex\nnysut\nfucilla\nimmunogen\nkorfmann\nkrugel\nbyass\nbarjac\nquainoo\niafrate\nboell\nkhacheridi\nkrump\nwessinger\nthalib\nbhogle\ndandu\nmeziane\npaleyfest\njianqing\nkoeverden\ngillson\nillogicality\nconville\ngrims\norian\npeynado\nicsr\npowney\nsuceeded\nchesse\nschwitzer\nbegrudged\ndonnall\ndecaffeination\nheinemeier\nbarcena\nkangura\nchumachenko\nhuanhuan\ngalloways\ngher\ntamiia\ncostarica\njollof\nkatella\nfetishization\nansoff\nshuttlewood\nleikin\nyuden\nlazarte\ncnni\ntakefumi\ntaedonggang\ncolora\nfacussé\nfarese\nmelf\nahamad\ngreyfield\ntransferees\nogunsanya\nsoloistic\nasbmr\nhomebred\nameal\nmasara\nwigmaker\nbarsebäck\nnmdp\nwilstein\ncotugno\nstike\ntransman\ndisposer\nrobbyn\npusillanimous\npanshanger\ncoultas\ntumaini\ncaril\nvandort\nplasschaert\ncesifo\ncockamamie\nnabby\nmainsprings\ndegroat\nashia\nitweek\npaphides\natter\nmaenclochog\nmarshside\nfixator\nstauss\nstigmatising\ndreadlocked\ntraykov\nvaughton\nnichido\nwalmgate\ndck\nheledd\nscajola\nwoodmoor\nbiaw\npadalino\nwiis\nmckaig\nquatercentenary\nherpa\nbockhorn\nwde\nlasy\nduffing\nbimson\ntowans\nstupin\nquienes\nbrads\nhomenet\nprieb\njiahe\ndiprose\nguttierez\ncanoy\nbackwardation\nqelt\ntarasyuk\nsethusamudram\nkrayem\ncompubox\nsuvorovo\nerts\nlaparoscopically\ngrimandi\njuet\nanagnostou\nbrevan\nmartek\nsunair\nresponces\nbinbin\nschoelkopf\ngraem\nmacarons\ncheskin\npretinha\nlizo\nrayanne\nmolecomb\nohip\nintertainment\ngarby\napirak\nparcheesi\nholdall\nxapuri\nmucolipidosis\nlatcham\ntefaf\nrihana\neirikur\ndagogo\ngfg\nsmoothen\npoken\ncomparables\ngodoi\nshorthold\nency\ntillot\nfunniness\npollarding\nerz\nasmallworld\nnaht\nduchamps\nkarora\nmutko\npoundstretcher\nlaffranchi\npercocet\nsomkiat\nasaduddin\nsnowbanks\nwwin\nsliming\nmicromuse\nguriev\nnazam\nkolodner\nstollenwerk\netchecolatz\nmontney\nquiches\ntorgler\nmiddlesborough\nerrickson\nmortehoe\nmediafax\nlwg\nsotalol\nkemakeza\nscreeched\nnannu\nheathcock\nresorb\nluay\nalveda\nwmsc\nnabati\nllaves\ncotehele\npeeped\nhedrich\namstar\ntreasuring\nsablefish\nbetsen\nmuker\nmessud\nroids\nberkow\nyusra\nunobscured\nouvidor\nmacerate\nterekhov\nbadinage\nmruvka\ndeemphasizing\nakalaitis\nmussell\nunguent\nweatherbug\nadvancedtca\nblackney\nnutbrown\ndetesting\nklci\nkrumov\nturnmills\nbasnight\nallistair\nwrigleyville\nbruecke\nmonohan\nblaschka\nhautman\ntiralongo\ncommingle\nyurtseven\nmihails\nradonski\ninstinet\nfedun\nmagaro\nfrenches\nteknaf\nkalaitzidis\nwallid\nchewning\nconsolidators\nintravaginal\nfilets\njork\nlecomber\nhyperthermic\nrecre\nidata\nkjla\ngymboree\npavlick\nhendricksen\nvelka\ndanneel\nficc\ntules\nuserkare\ndzr\nyongfang\nchildminders\nglascoed\ntsipouro\nhouton\nurgun\nbiopark\ndockage\nceron\nsnailwell\nuntilled\nstenham\nmantech\nmurabaha\nesure\nringmann\naffi\nvardanian\npedrillo\nkibwezi\nrakove\njev\nchalid\npissarides\ncatera\ncipressa\nabideen\nbesigheim\nmisys\nmeersbrook\nbrafman\nsandbostel\nlivock\nunswervingly\nrussophobe\nculantro\ncervarix\nkamminga\npieranunzi\nrobotti\nalberman\nzlateva\nbuonomo\nrvsm\nsarco\npalinkas\nwenhaston\ngreves\nbedchambers\nenergis\ndepardon\ncamec\ndeighan\nnetezza\nisys\nadaora\njaywalker\nmarilyne\nshewfelt\nlockyear\nratby\nsulake\nboastfulness\ndapples\narkinstall\nirey\nsothic\npolonio\nbabaloo\nobejas\nfornicate\ntaggerty\nmarionville\nhessels\nrelgious\nelgible\ncarlitz\nsspf\nfidai\nfnmtv\nidisk\nnayman\ntraumatically\nfarfalle\nunbefitting\nwarke\nromanticization\nburnhams\nindefatigably\npuzey\ncastellamonte\nreaux\ndarai\nfallahian\nopengate\npeillon\ncambusbarron\nyasinsky\nmottau\ngeseke\nnonoperational\nglobke\ndownline\nlifescience\nskillets\nschisgal\nphototoxicity\nwinegard\ntringham\nsynthes\ncastoff\ncattenom\nepitestosterone\npachon\npjtv\ngrayshott\nhortefeux\nfamiliarizes\nstobs\nhuwei\ntanasbourne\nbhattal\nexpence\nschifani\neglevsky\nsnarking\nmolodist\nattrocities\nrawdat\nxra\npaun\nigrejas\npayors\nsouthcombe\nfillongley\nlinders\nnobuchika\npcrf\nuzbeki\ncnsas\nbatsto\nblakedown\nasuming\nnubium\ninteracademy\nblackens\nmattey\nflorit\nhjejle\nmaragh\nthreemile\nblueshield\nfxa\novertricks\nkhidir\nsergia\nvolchenkov\nsmoldered\nkelud\nghw\njanti\nemerik\noxymetazoline\napice\nraffoul\nconfidencial\nstamata\nkingscott\nantinea\nbiegler\nreeman\nmanuring\ncollectability\nkhorol\nklineberg\nmcgiffert\nmudzi\nallmans\nrashti\nsunnylands\ndoggedness\nloray\nhorseflies\nbolshie\ngrimsdale\naerophile\nhreinsson\noeufs\nbouira\nruut\nshoura\ntwerps\ndermabrasion\nlevich\nmahmoudiyah\ntsikhan\nhypocretin\nlopud\ntatenhill\nszeklers\nvideographic\nscheuerman\nlampstand\naici\nmaxman\naliotta\ncrimplene\nnautla\nplachy\nmizerak\ntazer\nroszel\ndishearten\nmaoa\nprocreating\nmancienne\nkarayev\nunrefrigerated\ndemske\nunabsorbed\ntondabayashi\nalipay\ntortel\nklipper\njardí\ntreshnish\nobservably\ndreamlife\nscenesters\noftel\nvwr\nivet\nyesilcay\nwildung\nminisd\nanorexics\nprid\nnyquil\narija\ncaraher\nangeletti\nsindia\nmembrana\npolitesse\nkahel\nboullier\ndeakes\nwebvan\nvecho\nzasyadko\nhellenikon\nstonyfield\ntheocharis\nfeldsher\nseperatist\nraychem\nunpleasent\norbinski\nxacobeo\nlorser\nolufunke\neilish\nscottow\nnjdep\nmediatech\nnaturallyspeaking\nyiannakis\nkalisher\ngnvq\ncaridee\nflatfishes\ngriesemer\nandell\nleftie\nsignicant\ngues\nwinnebagos\ntwaddell\nbertagna\ndownspout\ngrundfos\ntoboggans\nlifestory\nbunawan\nunfriend\nnoop\nyellowy\nkenjon\nwojnar\nacutal\ntagicakibau\nkootch\nciron\npenndel\neurocamp\nschussel\nchamanga\nvulputate\nfuxingmen\nlesego\nstellmach\nsharyl\npicketts\nmaybourne\nshober\ndelightedly\natus\nbiorhythms\nebben\nchaudiere\nstrohmayer\ngovernemnt\nburgs\nbkh\nalfei\ndhanak\nmarfo\ntackie\nbaycrest\nvickerson\nparathas\nmxi\nmekeel\ncottesbrooke\nrevuebar\nkomarovo\nwebgui\nmyspaces\ntourneau\nsoeren\nelstein\nagadoo\naquilante\nschnecksville\nkasasa\nshailja\nkessy\nsensuously\nrabbitts\ngardaland\nkinnier\ncovaliu\ngloried\nfreye\nkluver\nunripened\nscannable\npollera\nselside\nsiswanto\nabashed\nkunicki\namagertorv\nsitation\nsarafyan\ncompaign\nalbea\nlegesse\nstepfan\nrelgion\nsupplicate\njhin\nmysti\nclennam\ndelpino\nplayfish\nludicrousness\nkrzeminski\nsuscipit\nboxlike\nbiserka\nsantika\npiata\nciccolella\nelicia\ndamasks\naprés\ntrhe\nsalumi\nspme\nsholeh\njitterbugs\nbrownridge\ncalarco\nredco\ncked\nchampika\nyoido\ngoestenkors\ncleworth\nmegowan\nricharlyson\nmoniter\nimpax\ntrenchantly\nwishard\ndeoxycholate\nkakkis\nyien\nhudur\ncartha\nexerpt\nfarrall\neckhouse\nbipa\nmalkan\nsmolts\nrukiya\nering\noverbalanced\nflavorpill\nmalesuada\nremunerations\npicadillo\nkniffen\nbelimumab\nchuyen\ninexpressive\nscro\nlichterman\nspittin\ngearty\nlamaism\nstarner\nchessex\nicehotel\nzeanah\nanchormen\nwiederhold\nemulex\ntatio\nbirhan\naxolotls\nhoplon\nscuzzy\nossifying\nelaheh\nlsta\nmessaoudi\nlyveden\nbernfeld\npronouncer\ncolp\nmagyarosaurus\nfeltus\nserageldin\nflexibilities\nmajeste\nbunghole\nbarko\npinnington\nchugai\nkirchberger\nalmanzar\nmalouin\nimmorally\nedgecote\nmagalie\npalamos\nlamagna\nbutorphanol\nbaoshi\nlanig\nmcnugget\nicdl\nskuli\nstatia\nchengji\nthemselfs\nbourgon\ncrappier\nkumra\nlipez\nspooking\nperuses\noutlasts\nsousatzka\nravenstonedale\nzias\nraskatov\nalbone\nsundai\narean\nshyu\nguanzhuang\nabdelrahim\nsolankis\npennyweight\ngluepot\ndiuranate\ndarboven\ngabey\nbehnaz\nmopey\nbashment\nqueston\nprazosin\ngrisbi\npurita\nashrita\ndaoudi\nmemorious\nausman\nmatzner\nsumtotal\ngubser\ngermanness\nsaana\nnagae\ncrybabies\neasterwood\ntrebah\nbhindi\ngreenkeeper\nnewberger\ndpk\nredditt\nsergeyeva\nbirthistle\nayars\ncqg\nrainelle\njeste\nstikeman\nterhi\nchocolove\ntaskmasters\ntumorous\nsoanes\ncontracorriente\nschewe\nlillicrap\ntrebunskaya\ndodgertown\nwellsway\nvidarte\ngyoza\nrashers\nfdk\nfeintuch\nsaedi\nnedap\ncaino\nescos\nbuddington\ncussed\nmeservey\nunmatchable\ngtcr\nweisfeld\nhealthsystem\nleving\naesseal\npacemaking\nmylife\njusticialista\ncolorways\nsarossy\ntaltal\nsantosa\nblanken\nddim\nhunty\nvlisco\nbrodifacoum\nafms\ndedic\nitaa\nliepert\npapaconstantinou\npesquet\necumenically\nsymptomology\nspeakin\nlindolfo\nlevitzky\nsabuni\nbunked\nswingset\ncuckooland\nwowing\nmulbarton\necologica\nrahila\nmethylenedioxymethamphetamine\nrobitussin\nvulcanologist\nquammen\nprial\nkulsum\nagathidium\nantionette\nkinlochewe\ntroopergate\nfrittered\nlummox\nlagrande\nnimbleness\ndallachy\nbarou\nachieva\nmescudi\niseya\nyibing\nsolexa\nfinol\nibtc\nfloh\ndrainpipes\nmudbox\nhayovel\npetrucelli\nmorrises\nraphaels\nlecour\nbaolong\nfenofibrate\nfuerth\nchakkara\ndeputizing\nschachen\nnanosensors\ncantlon\nzdb\nashera\nwestvale\nkervern\nlatella\nidenburg\nnisc\nlapsang\naltoon\npolota\nrusper\nhagglund\nkhatab\ncarvedilol\nassurant\nobjectifies\ncowfold\ncypionate\nlastweek\nkarsaz\nhollyweird\nsier\ndalbec\ndslams\nfelmersham\nunwarrantable\nstigmatizes\nlowdon\nonehope\nmarthas\npleasington\ntrearddur\ndiderich\nemsis\nsalaad\ncosmati\npennells\neffluvium\ndisjuncture\npfaw\nsiné\nsunit\nkinan\nmalaquias\nbackchat\nsdat\nfoulquier\nshoofly\nmisquotations\nblighter\nagbeko\norphenadrine\nsnivelling\nfinesses\nthewissen\nswad\nmadisons\nsearfoss\nsubterfuges\narlacchi\nbianet\nbeleives\nnoncancerous\nbuston\ncatellus\npronouncedly\nbenshan\ndeenie\nabstractness\ngantin\nopsvik\nerrantly\nkaseke\natomisation\nheyder\npippig\nclarissimus\nhiit\ngallichan\ngrouplet\nbueche\nhadir\ntoloa\nstateman\nfuchsberg\nmouammar\nlubar\nwentai\nuntwist\ncadboll\ngemeinhardt\nzoria\nhpx\nshaibu\nundefendable\nmclafferty\nlamidi\nhearle\nlongnecker\nheartlessness\nheliosheath\nesmael\ndewater\ncurlett\nbrancatelli\ncanters\nkodwo\nmomberg\nrangone\nvended\nnabobs\nmelees\nleesport\nobadele\nncvo\nhhk\nhooydonk\nlangerado\ntinopolis\nbootcamps\ndarsham\nsnac\npisau\nmartinoli\nmehadrin\nkroffts\nhkjc\ngudjon\nhnz\ndelbruck\n‪\novercooking\ncogges\ntrashcans\nbergy\nvanco\nreattribute\ntinkhundla\nlitella\naustinites\nchechik\nlittner\ndawie\nirreconcilably\nsheinwold\nbrooklynite\nlorded\nkyleakin\ntwines\nanisole\nabiye\ncder\nlemer\nradiograms\nshovlin\nbosl\nbedrest\nvirtualizing\ngénova\nupdaters\ndonnison\ntoppy\navigliano\nozier\npreachin\ngunrunners\nmilliband\npostmarketing\ntargoviste\nirelande\nedaw\nbouchra\nkimco\nabendanon\ncorojo\ndenesuline\nchicontepec\ndancewear\nsahlgrenska\nbawtree\nfuni\nhpcs\nfilipenko\nbrex\nliesbet\nnilon\nnallet\njanuario\nmattishall\nchetri\nnomisma\noverbye\nmaves\nwechmar\ntrasporto\nkacc\nsynchrocyclotron\nboubker\nmesurado\nczege\nspara\npintauro\nsmithey\ndisassociates\nromeros\nsjostedt\nclemon\nboxun\nshihui\ningolfsson\ncorazzi\nlubmin\nflatteringly\ngasohol\nmurias\nukt\noverstretching\nvolleyballer\nrhinosinusitis\ndutti\nrochlin\nundersold\nmdluli\narabtec\nbortone\netame\ntafadzwa\npeosta\ngaryville\nkeath\nflemmings\nhamers\neupol\npdps\nbhavin\ngemenon\njihai\nyunji\ndezi\ntoccara\nmsibi\neprs\nclogwyn\ndebilitate\nliberatum\nbelbroughton\nbacta\nplupart\ntucuxi\nkdn\nsadecki\nearthier\nkintamani\nbalbay\nmmwr\nmingala\nzarni\nmubanga\ndurrer\nparch\ndipple\nwilkof\nsatiny\nthreequarters\nehlvest\nposibly\nnightie\ncorrias\nricke\nbrookwell\ngordeno\ntolkachev\ntillable\nolshanski\nintrests\nnewcraighall\nmodoki\nrecantations\nbeerling\nmidfoot\ndurnin\nnitzana\nunders\ngroundskeeping\nparnon\nheungdeok\nabrasively\nonalfo\nemasculating\nnewbald\ncelibates\nyarnwinder\nnelsan\nwhoredom\nsaccone\nesman\nclsa\nairsickness\nmcnatt\nkneeler\nbahij\ncarcamo\nezquerro\nmessia\nsolarwinds\nrigatoni\nunknowledgeable\nblisteringly\nthraves\nhelseth\nchanta\nabeni\nmarinela\nsavigne\nteleprompters\naxtel\nhelg\nlackie\nempac\ndaleiden\nbaylen\nmizuko\nwarrilow\nvocalises\nglobacom\nosci\npitreavie\nauriti\ngallin\npotbellied\nopenning\ngetgood\ngayssot\nwlodarczyk\nboroff\nharcum\nsněžka\naprotinin\nwhatuira\ndiamon\noneil\nkpvi\nmspca\nmadel\nkalichstein\nrentsch\nchemla\nschleef\njeffco\nsaxelby\nshudde\nxac\nverzosa\nhohm\nlichten\nhadba\naerion\nhyperphagia\nbeibi\ngotthardt\nkaliyar\nbaudilio\nlockes\nbehling\nmovila\nkayongo\nboothville\naltafjord\nronnell\ntranswestern\ngibes\nunitedstates\ndenseness\nstuding\nnelnet\ngrandpas\nsknl\nagnews\nduyck\ncnse\nstuebing\ndealbook\nyood\nreviling\nbnk\nvsevolozhsk\npretensioners\npashkova\ncofi\ntobinick\ngeomyces\ntyeb\npetruk\njenilee\nubinas\nmammoet\necmc\ndelanne\nrentmeester\nachouri\njugtown\nmamprusi\nmathuram\nzīle\nhindpool\ninaki\nseomoz\nbeart\ndarmstaedter\nsalbi\nnzimande\ndiscomforted\nwaweru\nkhasbulatov\nzarrilli\ndobwalls\nvoinov\ncaminada\nmoorfoot\nalmand\nuteruses\nhies\nasselt\nbaretto\nound\ndeveny\nbleaklow\nsemenza\nnagasaka\nspivs\ncernavoda\ndalpe\nthoron\nannison\njazirat\ncariad\nstonesifer\ntrivialisation\nwasik\nllambias\nbartella\nbajramaj\ngack\naccordia\nassasin\nbreitkreuz\nballymoss\nstreelman\nsadoff\nupritchard\nquaggy\ndowgate\nprocacci\nfeiyue\ndullingham\naasia\nhudes\ncuthill\nskorupski\ncosmochimica\nrescanned\nsopchoppy\npremedication\nhimym\ncutco\nlefrancois\ntangibles\nequivocally\ntallness\ndepersonalized\nfruitier\ncvts\nschoenbrunn\nchambermaids\nnonresponse\ncompletey\nbroeke\nsukoharjo\nbátiz\noceanos\navadon\nsnackfood\ngustard\ndiscomfiting\nliljestrand\nbrette\nrapscallions\nbleasby\nfoum\nshulevitz\nearler\nechohawk\ntrovesi\nwolferton\nphilisophical\nwelborne\nsironko\nstraighteners\nsetpiece\nvalbruna\nshammy\nsergie\nberom\nmachray\nertman\nredeployments\npiquing\nwholistic\ndongming\npenmorfa\npopovkin\nheckenberg\njalovec\nprecut\nnarvin\niannaccone\nbaulcombe\niaap\nwenta\ncoronados\nnovillero\nllanedeyrn\nfies\nparulekar\ntomicki\ndazey\nlendy\nlilibeth\nwacoal\nbearskins\nleonelli\npeplau\nhuttle\ntrussel\nbreakway\nmuslins\ndrider\nhasselquist\ncwmcarn\nlibrium\nforsgren\ndeprogrammed\npoleksic\nmossburn\ntansill\nqiushi\nkscb\nilkhani\ninnertube\npsyclone\ncrippin\nfeedingstuffs\nchege\npdufa\nnevett\nengineless\nodobenus\nscotforth\nsetpieces\neservices\ntheese\nlowari\nvisitbritain\nbartholome\ncribwr\nstenstrom\nhonked\nvirii\nnwtf\nlotha\nabec\nchitlins\ngauer\nwssd\ngortin\nlintern\nxvt\ntadman\nschoo\nblueliner\nnaimoli\nhcpc\ntorwood\ndimed\nbecaming\nncj\nkreamer\nmilkfat\ngriffier\nmakhani\nlaffrey\naronimink\nabrahami\nagrawala\naimie\nparceling\nbugaboos\ncyclery\nprincipali\neyk\nplasan\nstrelchik\nprolotherapy\nderailers\ntieman\nglenariff\nblecker\nwagnon\nprondzynski\ndifferance\nunisons\noymyakon\nbenalouane\nleguin\nlangsner\nscaggsville\nodeo\nsease\nresturant\nsupermajor\nshamiya\nobul\nprocessionary\nvanidades\ndamballa\nfirestream\ntrenin\nherreid\nabsorbents\nngeny\nestemirova\nrhdc\ntextualist\nnonfood\ntechweb\nbsms\nforese\naygun\nregza\nnewens\ncnep\ntiddles\nmaesycwmmer\nwinslett\nhybride\nncee\nmosqueteros\nrootbeer\nmuraqqa\nguoxin\npatsavas\nbreth\npoiares\nfatialofa\niwon\nparzinger\ngallerists\nlangthorne\ncraniectomy\npattering\njonquera\nwillocks\ncsss\ndanzon\nictaluridae\nlemcke\ncprc\nharpooner\ngubicza\nwanchoo\nkepu\ndisapeared\nvitkov\nuntargeted\nzaytoun\nemanual\npolino\noganessian\noyarzun\nspermicides\nexpereince\nkeelin\nunneccessarily\nanthropomorphizing\nsidarth\nhavanese\ncedrone\nappers\nlivan\nstoodley\ngulyaev\nbrocato\nmakov\nmudpit\nsaify\ncatastrophy\nslabinsky\npepsis\npajak\ncrisci\nsisay\naugelli\ngencorp\nfirepit\nrowdyism\ntayeh\nmidpark\nrivane\nioulia\ncamlet\nbogdanowicz\ntransfixing\naqt\nshefter\nsegurola\nkatzenberger\nbeckhams\nkwas\ncalomiris\nnembutal\ndutson\ndicrescenzo\nterawatts\nshiant\nsadulayev\nuña\nabstemious\ndunbrack\nthermoregulate\ncareerists\nsubsites\nbutalbital\nyounkins\nbedella\nscapegrace\npostapocalyptic\ntrifled\nheyser\nwangdu\nhyster\nelectrolyzer\nbancrofts\nsnowhill\njabarah\nkrugerrands\nrouzier\ntrefousse\nremodelers\nmaricle\nboiron\npetrofac\ncynda\nkissock\nlifeboatman\ngoodish\nbiesen\nkeehan\nleha\nletteri\nmobilisations\nsaull\njammat\nwoolsington\nmalevsky\nweeley\nmishel\nmattingley\nlones\nmudawana\nkorecky\ndagworthy\njahri\nrudnay\nmadoyan\ndealmaker\nyoest\nskelmanthorpe\nromanzi\ncompletist\npahlavis\ntedo\nwaterweed\nmatsko\nswampers\ntransys\nadetunji\nentelechy\nmccrossan\nlapt\ngjerset\nsystemized\ncagw\npiloncillo\nalcova\nbernardis\nleban\nanswere\ninteroffice\nnongovernment\nmaquillage\nremigiusz\nziska\nwalsum\ninseam\nnewater\ncropston\nambulante\npirozhki\nmanougian\nskyr\nforwardly\ngossain\nmcartor\nkunnen\nsdwa\nlpsc\nlehrmann\nwhassup\npudemo\ntrajkovic\neurail\nsirwan\nolmesartan\naratani\nyiyun\nmoates\ntwitched\nturpi\ninmon\nlugovoi\nabbeystead\nidzikowski\nschauble\nmessagepad\nspringsource\ndisembowels\nundercounting\ndemotivated\nklpga\nzemmama\nprovencio\nlivenation\nimiela\nkertus\nethiopiques\ndeceptiveness\ngomshall\nsantolina\ntransluminal\nmcowen\nqrops\nhuvelle\narsenis\nlibran\ncabic\nrevenging\nscharin\nyuzhen\nsturrup\nfelitta\nalspaugh\nesteems\nmuttontown\nroge\nnorthdown\nbatrachotoxin\ndubnov\nalikhani\ncornelly\noutswinger\nswabbed\ntowb\nelmau\nmoutarde\nwesterdale\ndilutive\nchronologic\ncelsum\nderrylin\npolishness\nprinknash\nutx\nlantin\ntrendier\niivari\nmazunte\npederneiras\nsatinath\nestranges\ntransflective\njahns\ndanella\nborzois\naristóbulo\nunusuals\ntimewarner\nkruck\ntransversing\nbessonova\nverichip\nburnaston\nkaihui\njisheng\nbrascan\nbrung\nqummi\nmalverde\nmesler\nseminis\ncemr\nwtnt\nkenteris\nvarenna\nsavinova\nmutsch\nenergem\nchaze\nhatiya\nbalzary\ninportant\nfirebugs\nilchenko\noakwoods\nsuperheros\npunycode\nfeatherbedding\nslamdunk\nstapeley\ntecs\ncoverge\narocho\nsundwall\nbridgham\nmucuri\npoupard\nasenso\nbowlt\nmckelvin\nxenapp\nrfh\nqci\nvalorize\nsteeling\nllanharry\nrastall\nincisiveness\nunichem\nlooi\nglutes\nsurroi\nminibikes\nbarquera\nchellomedia\nnikhilesh\nmethylcellulose\ngliori\nthyer\npactor\npursuade\navz\nbarflies\nsheppards\nmaliqi\nzavyalov\nbolkow\nklepfisz\nkenth\ninterros\nlaucala\nunfriendliness\ninfatuations\ngaddum\nteros\nneurotechnology\nruhnama\nmischance\nlumbers\nrydalch\nsnoozing\nranadivé\nkrader\nzypries\ntarradellas\ntithebarn\nisothiocyanates\nscirrotto\nivoryton\nkinge\nflicky\npmml\noctoberfest\nsmokeout\nbilic\nballyjamesduff\nsuring\nbonnette\neems\nmuhibullah\nindvidual\nfrostad\nbayno\ndayeh\ncavallier\nwarentest\nmiviludes\njianhong\nresurgences\nampules\nsondermann\nmaraviroc\nrempstone\ncossman\nkhaosan\nchiongbian\ngyptians\nliberationists\nvaaler\nsheepskins\ndannemarie\niocl\nedmonde\nbacabal\nostman\naweful\nimmunoproteasome\nthrowable\nburundians\nghazzawi\ngwynt\nklawitter\nmedfly\ntensely\naffirmance\nintersputnik\nsaffrons\ntremiti\npearler\nearsdon\nmoorey\nkouris\ncolonoscopies\npureheart\nmickal\nmcga\nsphaerica\niisi\nrosslea\nfliss\nprause\naddle\nraelians\nhgr\ntekna\nvetches\nhongxia\npelynt\nimoca\nkammerlander\ntranquilliser\ndioctyl\nmuzquiz\nbupp\nafit\nemmonak\nappearantly\nestuarial\nheiligenstein\ngallais\nrieslings\nlewsley\ntaizz\nyull\naudrie\nversaille\nchokeholds\nperfomed\nstoneley\ntyacke\nsquadronaires\nguittard\nmichôd\nfecr\ncmec\nsinnathamby\ntureck\noposition\ncrissey\nsquillions\ndenims\ninflexibly\nkinslow\noverextend\nbobinski\njordis\nxinli\ndoorns\nunpicking\nmexicantown\ncrassifolius\nandraos\nmubeen\nniccum\nopisthostoma\nsireli\nlamberty\nyiddishkeit\nwakao\nchuwit\ncaboodle\nvezzoli\nglevum\ncraigmount\nhomegoods\nparolles\nmaghazi\nlorenzetto\ndongmei\nbashforth\naromatized\nzalmen\ntreinen\nmagallon\nbahlman\nrrose\nbatar\nstibel\nptj\ninosinate\nenfermedad\nstripy\nwanke\nampeloprasum\nadvogados\nrojiblancos\nkleinveldt\nknauff\ntostadas\nkenen\nunpermitted\nnokelainen\ncloudiest\nhashahar\nschwenksville\nwennergren\njarchow\nleutasch\nincuding\nyuwa\nkrestovnikoff\nsobia\ncaiu\ngilon\nformoterol\nprehen\nlegear\nhorsnell\nimil\ndossey\nmhh\ndownwardly\nreabsorbing\nbasche\nzeroual\nzillmer\nsikahema\namendolara\nthroughputs\nnawara\ncoldhearted\ndeshong\ncheye\ndefanti\ntitter\nsuperquinn\ntlrc\nlebda\nbzdelik\ndannay\nstober\ngoeke\nmalinconia\nhhgregg\nbehgjet\nmalarz\ncraignure\nyurman\nbucho\ngunka\nthomsett\nnorrena\nbutterman\nszczur\nsnappish\nmainconcept\njesses\ntransfair\nrebuttle\nmediu\nelsby\ncheesiness\nlongswamp\npostflight\nsherels\nxedos\nmarikkar\npoundcake\nnonradioactive\nabstractionists\nsavonnerie\ngasbag\nsynners\ndueting\nloopt\nstrone\nmercadolibre\nwtaj\nwwrc\nrogne\nkernick\nanoma\ntomasky\nswimmingly\nmicrodialysis\nnadege\nluminex\nnewcleus\ncirad\nkilmurray\nocse\narmful\nmazagran\nmalodor\nclaypit\nfrackowiak\nmiyakejima\nunendorsed\ntrevon\nbaracuda\ndashcam\nrandiv\ncastoffs\nemak\nreclassifications\nborrie\nfrittata\njellema\nshirat\nfillis\ncatthorpe\ntributed\naccussations\ndematerialized\ndapkus\ntakotsubo\nswivelled\nbastwick\nhilgay\ncarrodus\nalonnisos\nlukaszewski\nduologue\nhesistant\nunderproduction\narouch\npizzini\ntwal\ncazuela\namukamara\namorosino\nthhe\ntrannies\nwisoff\ndsrc\ncharleen\nesbwr\nenthusing\njacarezinho\noberau\nvoro\nschuurmans\naraia\npremat\nchanghui\nladron\noapec\nbengoa\ngullotta\nwanxiang\ncivc\nmicroseismic\nllangynog\nrecive\nlobstering\nsaferworld\ntalwinder\nconvience\nmicroblogs\nhausken\nkeeslar\ncareered\nkokoszka\nbrinnin\nheberlein\nmoumen\nloita\nmacrocell\nweinzapfel\nwestrick\nkulula\nthriftiness\ncandesartan\ngittisham\ncopdock\nhaulier\nfeus\nclaunch\nlazarescu\nmoop\nravenously\nulemek\nharperbusiness\ndecelerations\ntkf\nkangshung\nfarmersburg\ncelestica\nwombacher\nrubinho\nladwa\njotter\nlaverack\nbirbeck\nmomposina\nfrish\nunbuckled\nmillfields\ndejanira\nlaketon\nmanala\nhaakonsen\ntillstrom\norcadians\nrahs\nzykina\nriocan\nradwin\nhockeytown\ntoyen\nejg\nserape\nrebaudengo\nkweon\nschilthorn\nenertia\nyeki\nbelkovsky\nkaputa\nwillinger\nboart\natrisco\nscampston\nallums\nelectrocardiograms\ncineplexes\nlaryngologist\nrudham\nsaksena\ntreacly\nstrategized\nsakie\ntwigged\nhendi\nrecette\nedar\nglinting\nlefkas\npossable\ngransha\nchristain\nalteon\noverpay\nsrijana\ngwynant\neseries\nwhealy\nlaurean\nbrumer\nhadewijch\nyoani\nputschists\nbubas\nvulvas\nmebazaa\nongwen\nbuddon\nlumar\nfluegelhorn\nzapiro\nchampex\nshipshape\ncharecter\nchawner\nroadbeds\nrohter\nehome\ntriston\nzmievskaya\nmcclarty\nlaaga\nagla\nmanhattanites\nbonenfant\nexactas\noblinger\nsahalee\nmealie\nhatrack\nmartinstown\nsupernationals\nflowserve\nwokalek\nkeraterm\ncarlat\nsuperantispyware\narguez\nteaspoonful\nsmartcity\nnickless\netrace\npoyner\nrelámpago\ngurewitch\ntobón\nburled\nbewailing\nmeriah\nuserra\npetrodollars\nputhukkudiyiruppu\navenyn\nfaidherbia\ndiictodon\nkudankulam\npumfrey\nfluorescing\nmaywand\nmomodou\nglaciares\nciee\nlochar\nvonder\nloehle\nkubuabola\nblash\ndayoub\ninterlaces\nbudish\nnosher\neslick\nkailyn\nrotherwick\nencarsia\nnoriel\nhankes\nmirthful\nboonchu\ncaled\nwinnisquam\ninformercial\ncuill\nmarinoff\nfeniton\ndirtcar\nalleva\nperspiring\nsuffuses\nkilloren\nfingar\nfeminisation\nspecfically\nunstick\noakenshaw\namrep\nsimliar\nkrumpet\nbyi\nsojitz\nconquerer\nmorsch\ndragées\nichetucknee\ngotomypc\ngnomedex\nopeness\nrossberg\nniua\nndolo\ndiscernibly\nacholiland\nsanit\ncardullo\nowg\nskyservice\nwriddhiman\ntrutanich\nchildminding\nheartmate\njoren\naramayo\nsnizort\ngradison\nalayne\nsightsavers\ngartcosh\nhanesbrands\ndownpipes\nmanacle\nchameleonic\nolberman\ncriselda\nzagurski\ncraigville\nkronenwetter\npinking\nbilili\nmcguirewoods\nyenta\ngarcez\npsyllids\nberenstein\nnopa\nsatisifed\ntoneelgroep\nxde\nmainul\ngriesa\nhankar\ntartagal\nvisher\nunirrigated\nantigenically\ntorrigiano\nfreddoso\nhinteregger\nmuglia\nscandanavian\ndzon\npotrykus\nappcelerator\nsups\ndiacono\ngeffner\ninchmurrin\nfurnishers\nrespers\neyssen\nhutchcraft\nminzhong\nwojahn\nbadree\nwikler\ngloucs\nkreimer\nlegna\nlitvack\nindefiniteness\nweigmann\npermeabilities\ndroned\nperamivir\nunenforcable\nartexpo\nnenno\nsouthers\nwordwide\nbucatinsky\nhimmelmann\neuk\nnoorjahan\nhaideri\nxte\naffinia\noxygenates\noswyn\nnumbi\nrajevac\nbraer\neduardas\npréliminaires\nthébault\nfishies\naluwihare\navantha\njahanbegloo\nedaps\ntamie\nnitot\nscantly\nkhona\nclonbrock\nwessman\ncoquetry\nmoscowitz\nmatsuba\nballysax\ngodinet\nsteinbaum\nchurm\ndiepen\nepsdt\nelizardo\ndieste\npetulantly\nstojkovic\nbiotechnologist\natfp\njaures\nwilland\nbashur\nkasse\nsolae\nfarzin\nmardiros\ntongkang\nminitab\nfootner\nchristmassy\nunclarity\nsichting\nattemp\nhpcl\nrafaele\nboadrum\ntrelissick\narvel\nmassih\nmaume\najg\nctis\nspilka\ncoarsegold\nburgalat\nsotnikov\nsemco\nsolecki\nsneezer\nshumkov\narmanda\nknowling\nzargari\nfarafra\nmiembros\ncerasuolo\nhaufe\npolastron\ncorbisiero\nlatting\nplacates\nqlt\nhousego\nporeda\npruthi\nlachezar\nzagats\ncocal\nkorell\nible\nhyperosmolar\nwerntz\nevendale\npogosyan\ntogwell\nkashagan\nanothe\npecherov\nkegley\nmacuga\njorgo\nsviggum\nfilmbuff\narthurworrey\ndesiccate\ncullers\nmuseon\nlagon\nheydays\nsolove\nfattie\nlagrene\nclaverdon\ngonadotrophin\nbazell\neotvos\nsnapfish\nvoshon\nkloner\ncachuma\nampo\ngordeyev\nmanaton\ndemeulemeester\nklaveno\nkincsem\nweirding\nvindija\nsolchaga\nllandarcy\nkaros\nsarbi\nmindlessness\nzulay\ncoiste\nmtcc\nsriskandarajah\nbiondetti\nbewail\ncherkasky\nunassertive\nsayano\nwintersville\nyachty\nomotoso\ncyrkle\nwafula\nugueth\nfluttery\niveljic\nphonegap\nlabourlist\nexplict\nmarraige\nmazure\nstright\nopticon\ntarjeta\nagrama\nmurrel\nbossiness\nhfn\ndipiazza\ndatadyne\nlabèque\nrafle\ngopuras\ngoupy\ndonnés\nmetc\ndrissi\nhuwaidi\ngaltür\nwutc\nmakey\nhassiba\nmorleigh\nabsorptiometry\nkendel\nbruwer\ncfts\ncenterburg\nrajgopal\ngalácticos\ncavenaugh\nasplin\nbarcade\nanyama\nmennes\nmurugesu\norlaith\nrelat\nhunkers\nichc\ndodsal\nglotz\nsymank\nstatoilhydro\nfith\nfaeroes\nedz\nrevisitation\nceleno\neqip\ndarik\nallmann\nclancys\nzawistowski\nhalau\nmoussambani\nhumbugs\nanthe\namriki\nmahla\nbitu\nnemchinov\narleston\noxney\nhamito\nnahai\nwmus\ngeschwitz\nsangpo\nschrimpf\nsalarymen\nlandfair\naurubis\ngroundsharing\norebro\nspokeswomen\ntheboat\nphials\nromanticizes\nportos\nbirchmere\nberghausen\nproggy\nmousses\nfaser\ngomidas\nsavanah\nbrecknell\nhulc\nkaric\nroelandt\nallyce\nswoyersville\ndelegitimizing\nreimplantation\nkeeter\nhantman\nxintai\nanney\njaiden\nminicom\nhousemother\ngatecrashers\ntindell\npipitone\nreyher\ntruing\nmbele\nradanovich\nmostapha\nwachira\nconflations\ndevellano\nwaspish\ntransnationally\nfranzos\nhumbleton\nnsereko\nsmiffy\niping\ngoners\nstrandgade\nrigano\nsupercrew\nchens\nprashanthi\nliakopoulos\npirfenidone\ndudleston\ngambin\ncovad\nrixos\ntinklenberg\nleijonborg\ntapeh\ngabrys\nprou\ndensified\nchicherova\nweigelt\ndechellis\nhiong\ndemonizes\nwilnecote\nmazmanian\nandelin\nwestclox\nmetaswitch\nameliorates\nhassidim\niskan\nfeugiat\nlidstone\nadmaston\nnocere\nredcaps\neqf\nthakeham\nstreitenfeld\ndishonouring\nedocs\nspowers\nmetenolone\nriecke\nmotiveless\nfydd\nfalettinme\ntontitown\npoptech\nyanhua\ncraned\nossify\ntianyulong\npedn\nunembellished\njdw\ngrassle\nrudyerd\nshrivels\ndevcon\nmisdoings\nninio\neltis\ntillou\ntzortzis\nronkainen\nsweid\npremiss\nkonocti\nborgdorff\nbcfe\nmcgourty\nbushed\ntamson\nrestrengthen\nkatalina\nbhurban\npirus\nnonfinancial\nbadaber\nupconverted\ntraipsing\nlurex\nluvo\nsoosai\nairtankers\nfonart\nbaktun\nicily\nbitsadze\ntowelhead\nkurkova\nmitiga\ncantarell\nfragola\ntimespans\noxybenzone\nbazzini\ndepoy\nviharn\nmubasher\nadtech\nbhavnani\nmestas\nillegibility\nbeydoun\nchineese\nhisle\ncorporatocracy\nvassiliadis\naltberg\nlewisberry\nklieg\ndebusk\nschmutzler\ndallis\nteulet\npreperation\nunstimulated\nqdoba\nstammered\nparure\nginjo\ntinky\nrightmire\nalpargatas\nunfeasibly\ndzus\nmutalib\narmelagos\ndaylin\nodintsov\nvuzix\narette\nbasam\nabpi\ndustup\nloida\ncoml\nillume\ntachbrook\nseath\nsemiofficial\ntomatillo\ngladdened\ncencosud\nlisovsky\nimperva\noluchi\npbsc\nsisqo\npopski\nseaking\nsibat\nflocculent\nroadworthiness\nkiltegan\ncanf\nkadie\nvielmetter\notylia\namptp\nimponderable\nviagogo\neskdalemuir\nshakiest\nafiya\nkazanas\nablating\nbrewdog\nhwd\ndissuasive\nmalafeev\nintellegence\ninfinitas\nhaved\nkhemlani\nchiemgauer\nyianni\nbananagrams\nzkb\nelounda\nbourdages\nbengtsfors\npcpa\nkarpowicz\nmeasureless\npurificacion\ndanseuses\nllanrug\nkamakawiwo\nunvetted\nchlamydoselachus\nkwr\ncooptation\nkalabsha\nbenen\nshazad\nbague\nmakler\nwandoan\nvenery\njiggery\ngeorganne\nvolken\nwieger\nmoslehi\nwearin\nhovick\npenkhull\nburic\nkarlovic\nlasota\nhoogervorst\ngiltbrook\ntaiyang\nmichou\nksde\ncrowner\nbergemann\nruddi\nsorba\npefki\nluckock\ndierckx\nonamia\nmangena\nloubier\nbellach\nspratlan\nsudjic\nkaehler\nbloodthirst\ngerring\nbeachland\nterrorises\nghermezi\nkarnilla\nroslea\ncristoforetti\naroud\nuntagging\ngeniune\nviall\nlancome\nekundayo\nspiccato\nlshtm\nunlu\nsavouring\nzampini\ntimewaster\nmackubin\nmoharebeh\nprofonde\nsaltines\nritmanis\nmiddlethorpe\npekinese\njetmir\npitsmoor\nlexing\ngrigolo\nbadgery\nclontz\nmaginness\nahnlab\nsadigov\nmuhimbili\nhatband\nmotherload\nbronzés\nhellgren\nevidentially\nreynish\neida\ntillion\nsilveyra\ndendreon\ndeanwood\nnincompoop\nwinsnes\nfornicators\ndabit\ndrenches\ncentrix\nhockensmith\nreitwiesner\noluja\nkinawley\niass\nuncategorizable\nsnuol\nancyl\nbrandán\nbeagley\ndaggert\nscorzonera\nsteinback\nxiqing\nasinof\nhojjat\nshaikha\nonyekachi\nimprest\npatane\nbidari\nranaudo\nfelizardo\nhabeck\nstaniel\ndauntingly\nlearco\nsolidarités\naiston\npegoraro\nmetabolist\nciresi\npasic\nbowfell\npetroleos\nmutchnick\nrubaga\nromanik\nsakubva\nbilking\nbems\nsetareh\nfuzztone\norrs\ncackowski\nbeadnell\nvillasante\nrogin\nmanvell\nkocharian\nlivaudais\ntailcone\nlittlechild\npanormos\nayachi\nmargenau\ncryogen\ndispell\ncruzvillegas\nibwc\nunenthusiastically\nistrate\njannaschii\nwannabee\ntrellick\nmukuru\namiee\nkalandia\ngeraud\npitshanger\nsalinero\nnycz\nraay\nvastitas\nwilan\ncctvs\nchauke\nsakio\nschwazer\ncopine\nwettengel\nestorick\npatsos\nsupergrid\nlendon\nvaisse\nsamsons\ncrestmont\ndipton\nplaku\nsudac\nkulbir\npiossek\npirrone\nsigifredo\neppy\ncronberry\nntsu\ntecdax\nsidha\nbeyda\nbcts\ngelan\nmingming\ndeisinger\nbeelman\nspart\ndenga\nakakpo\nbreinholt\nmarhoon\npickrell\nhuntziger\nhumanising\nemanon\npentair\nrajyavardhan\nhaeck\nlaina\nacria\nunderqualified\nnerenberg\nsamayoa\nredgauntlet\nhainer\nlensch\nmaerl\nenerkem\nburkean\ncullaville\nstoneyford\ngonk\nadnoc\nzinha\ntussling\nsicarios\nrazes\nrakowitz\niraida\ndeoliveira\nanathem\nquirnheim\nbarouk\nflancare\narnost\ngaugin\nglocca\nsiddiqa\nramelteon\ngostin\njinglin\nmakapuu\norsted\nmolinelli\nnarayanhiti\ngangbanger\nrepect\nmanalang\nloyrette\nalmondo\nrollier\nmalkiya\nmanobos\nyashraj\nslinking\nholetown\nzov\nwisman\nsaland\npequenos\nostell\nhuadong\njerm\nthwaytes\nburdis\nsteeled\ntouman\ntabel\nafwerki\neditorialised\noakgrove\nportait\nenfolding\nmemristors\nsantacon\nfeitian\nsebestyen\nhodell\ncianfrance\nliptrot\nponzu\ncurrenty\nzeitouni\npushchair\nhousecats\nettingshall\nishasha\nmhcs\nrickinghall\nfujara\ngreasbrough\nfirstborns\nunfading\nbeaurocracy\nnovitiates\nberumen\nhellraisers\nsedgemore\nkingmoor\nchesuncook\nlewites\ndefendor\nsadasivam\npacc\nmileson\nkelber\ndegenerations\ncollards\nmassingale\ndhcr\ntakal\nmansson\nfreds\neponymy\npatali\nflitzer\nskyrme\nsharkia\nmdrs\ncartoonishly\nstructureless\nwellawatte\nedgett\nhusnu\necchinswell\nchitralada\nfauconnet\nxianglong\nbaldassari\nhayarkon\nwohler\nncor\ninhalations\ndhupa\ntantia\nonatopp\ngoetzmann\ngrayland\nsusur\ngobbins\nexpressvu\nrandeniya\nslinker\ntaygetos\nnozari\nzappo\nrestell\niggo\nvalerik\narhab\nginns\nprizegiving\nhelft\nmetropoulos\nhamshaw\nnecrotising\nudry\ncrassness\ncraigneuk\nmoleskine\neinars\ngricar\nahv\nshobukhova\naunger\nbacn\nleverenz\nachak\nitchin\ninterposes\ncherelle\ndeats\njadav\ndicterow\njamen\nwhisman\nrosengard\nmarimón\nshaheem\nmainetti\nsamboja\ninvigorates\ncineastes\nnipr\nrnl\nantons\nglenshaw\nrefacing\nbeths\nscherzos\nsweetish\nskea\nvelouté\nbrutti\nuniloc\nzieman\nwendeng\nconspicuousness\nchateaus\nthemeselves\noutfought\nliquidates\njamband\noica\nchambo\neveno\numred\nelachi\ndebenedictis\npricasso\ngwede\nbaccalà\ncliquot\nanaplasmosis\ngrandnephews\ntsakos\nkoshu\nbetsky\ndeaccessioning\nkostrzewa\noverwash\nmonocultural\nwwoof\nladled\nkinglassie\nilhami\nmisinforms\nshakura\nparapluies\ntrainline\ntransatlanticism\ngillooly\nromanski\ncleamons\nverreos\ndifelice\nsimsek\npareek\ndoodad\nmainstreams\nwatzke\nccfa\ngrisewood\ncynddelw\nufh\nbellafante\nmuhaimin\nportesham\nunforgivably\naccoutrement\nmaitree\nzaplana\nbalatoni\nparwich\nbbls\nlotina\nshaima\niannelli\nlavorgna\nsuperstitiously\nscarva\nkelmarsh\nbeven\nheps\nnahed\nundogmatic\nshunda\npâtés\nimperviousness\nsepulchers\nbecco\nliddi\nfleecy\ngoanimate\nlingonberries\nfdo\nmabior\nhicieron\npenbryn\ncark\nmusone\nndiku\ngentians\nclamshells\nechocardiographic\nfoba\nenucleated\nlaverstoke\ncelebuzz\nhöll\nsimley\npettaway\nnwafor\nfentons\nlindis\nbarbree\nzubkova\nmhin\nbuggying\nfisette\nkarah\nhillesden\nkilbowie\ngrahamston\nbisho\nditullio\nbealer\nchinch\nspatuzza\nsharston\njiddah\neifl\nnahdi\nrotr\nbirkhill\ntopotecan\nkabaya\nntshona\nmatatus\nialysos\nsummersdale\nnikkanen\nwondergirls\ntarian\nbedsitter\nsoldout\npeychaud\nmisconstruction\nsabagh\nelbulli\ngenral\nperina\ngoppel\nmathivanan\nmadueke\nfelsenthal\ngloamin\nunfriended\nspinless\nbunions\ndufrene\ntanygrisiau\nmichaelian\ndemin\nmombo\nportskewett\nnataliia\nyse\ntrenance\nanythings\nconsidred\nmyelodysplasia\ngerassi\nsuperpartner\naggresively\nshipbrokers\nschanker\ntetrominoes\nkhachiyan\nchambless\nmuhka\norphanides\nudawalawe\ndhondy\nhutaree\nflightsafety\nkabashi\npickfair\nvarbanov\nhajja\nfoldit\npublow\nnanodevices\nsiteman\ndeconditioning\nislande\namputates\nbrogel\nzeshin\nshahristani\nfreshdirect\ncholer\nmeropenem\ngroeningen\nhospedia\neveningwear\ngolitsin\nfarukh\nlumberyards\njibu\nhelarctos\ntortolita\nlaveno\nusless\nspacenet\npainkilling\neiderdown\nzier\npinnau\neclairs\nkolen\nlandesk\nbrizard\ncambier\npiaggi\nguilbaut\nsiegessäule\nmcgoon\nbegnaud\ntufiño\nsapolu\ngreyston\nlampinen\nalbpetrol\nkempowski\nbluf\nillions\ndisenfranchises\nmuylaert\nhomayun\nremue\nbarcoded\naxman\nshimasu\ncastagnetti\ndalswinton\noverstatements\ndevilliers\nyoudao\nmyelosuppression\nsisemore\nneckerchiefs\nbaura\nmonoline\nrebagliati\ndecisión\njeary\ncowdry\npaessler\nilkhom\nghahramani\nhoevenberg\nbooklovers\nvolen\nsoymilk\nchilden\nbiotics\nsteffon\nweeda\nbovenberg\nparlays\ndobbertin\nwigged\nduques\nkariye\nkarpel\nmedinat\nflavanols\nvietnamnet\nkhandwala\nskadarlija\nbewailed\nwiesmann\naylesbeare\nhorningsea\nshuttler\nsuavity\nbxvi\nwhitevale\nmortoni\nkrzr\nkerkow\nelectronix\ntrialogue\nphilosphical\nsldn\nmelber\nmasch\nncnw\nstranzl\nwidmar\nmelany\nvalian\npaedo\nraghad\nseitan\npaciello\nelisse\nminatitlan\nafrol\nnajia\nmanyi\nyuping\ndoilies\nthebom\ndonowho\nhallingbury\nfaffing\nmahfoud\nvulcanology\nminisode\nwhackers\nmusalia\natmail\nflics\nannees\ndarsena\nviglen\nvacuities\niqn\nnosedived\ncustomisations\nbefrienders\ntrabajar\nwested\nkuperman\nsurrency\npaedophilic\ndeeding\nwigglers\nsvilen\nllps\njumar\nmagundayao\nlocalising\nilliquidity\noutmatch\ndurette\nteodorin\nsparkplugs\nmahboub\nplester\ngasunie\nconsolers\nzdunich\nmacellari\nxiushan\nmykal\nmarchon\nseierstad\nprilosec\nfrankcom\nraditude\nconsumptives\nstmicro\nbradbourn\nedleston\nbiohacking\ndapena\nsavennières\nbahanga\ncamatte\nnewsholme\nterritorialism\nchoge\ncmds\nwiseburn\ncsorba\nsnapdragons\nhallisey\nyubo\nevets\nlineweaver\nhogget\nkaiserman\nstompie\ndoubletalk\nbragman\ntsvetayeva\njanahi\nnarcocorrido\nissu\ngrindavik\nmzwandile\nsputniks\nsapochnik\nmcelvaine\ncajones\nspritle\nkrestena\npoleo\nfreegan\noxi\nexpalin\ngawking\nhartin\ndecembers\nphoturis\nfootways\ngarcha\ndobel\nshepitko\npetursson\nfastenal\nmalph\nibot\nmonua\ncritisized\nserwer\nkelps\nguanfacine\nsynaesthetic\nsoderling\nyouk\nkinemathek\nmeghni\nphilippidis\ndaggar\nlourmarin\nautographing\nkillinchy\nkillary\nhanukah\nmcelmo\nlunk\nrieper\nalgermissen\nnichia\ncrannell\nnonfunctioning\ngreenplum\ngrimmest\ntelmar\ncherico\ndiacetylmorphine\namson\nfiascoes\npostgraduation\nfungibility\nentrenamiento\nudeze\npearlington\nhuwara\ngarnero\nkreitler\nbenzarti\nmathebula\nmnisi\ncitygarden\nchocked\nsabersky\nbutko\nnatynczyk\naleqa\nradovanovic\nbleo\nmooty\nautoshow\nsaamna\nunclipped\nwaldi\nalmosts\nmacanudo\nktre\nschubertiade\nsoooooo\nzeune\ngurnos\nfictionalisation\nseychelle\nspellacy\nmillstadt\ntalx\npfefferle\nbellway\ngrabill\nhamdam\ngrassfield\nsagheer\nrostovtsev\narcherd\nundergird\nberken\nbesuki\nchevillon\natmar\nwatana\nibrar\nspaceliner\nkulvinder\njaleesa\nthurne\nqalys\niscar\nspalter\noodle\nyouds\nscotese\nmazhilis\nrajakaruna\neasthouses\nbuczacki\nhoncode\nchristene\ntahina\ncaynham\nsegars\nmulrenan\nfressingfield\nmccamant\nmagden\nkeepership\nwihs\ndragge\nabukhalil\nunsustainability\njonrowe\nsodan\nbenitz\natuona\nkutesa\nbluejeans\nsynergize\nfakher\nclootie\ndipdive\nwinegrower\ntaiyaki\nmilanello\nrivaroxaban\nbodorgan\nlewak\nayash\nromed\nfiser\nscanzoni\nziolkowska\npedrazzini\njaico\nhanemann\npontymoile\nlukavica\nkoenemann\nsutz\nsandle\nschifter\nmalagueta\nischenko\nclementson\ncolliano\nsuon\nshonna\npeul\nchrystina\nallbaugh\nhespèrion\nrsmas\ncognacs\nhyaluronate\nmorphologist\nviruet\ncollabnet\nphilanthropically\ntabards\nuelmen\nbaringa\nyosvany\nkajlich\nyousra\nsportman\nseighford\ndulse\nbarriques\nwerlin\nkakata\ntallac\ncounterpointed\nmatekoni\njinli\nsuperbubble\nmcclafferty\njalala\nnoveski\ntrelowarren\ngauke\nrochell\nbushwood\nforston\ngarf\nlynsay\nseiver\ncigarroa\ncridge\nglowworms\nnickolaus\nagboola\nreparable\nalbita\ntawanda\nnatco\nsangak\npinhão\nbiskupic\nkleindeutschland\njunn\nemert\nmisremember\nwyrsch\nlarding\nparlaying\njobstown\nworldspan\naharonian\nphotographics\nnicaso\nkalniete\npoultrygeist\nsuspiro\nhamfisted\nadastral\nditzel\npiccalilli\ngavvy\nbaoying\nnouhak\ndeidra\nturnock\nboonjumnong\ncheren\ngyger\nonyszkiewicz\nkablan\nbartolotti\npado\nzedo\npolston\npiena\nmexx\ngracioso\nbuzin\nstimmel\nbernall\nbryantown\nbudenholzer\nupdegrove\nrubbishing\nhowald\nlongparish\nsulfonates\nmckenry\nabdullo\nmartinets\nwaxham\nricefields\ndoveridge\nclarry\nkaimin\nstahmer\nlutsen\ntommasino\ngastroschisis\nbrassfield\ngooglewhack\ngelukpa\nskyworth\nartema\nmiltenberger\ncabragh\nberenike\npreqin\njakubko\ntelelogic\nthri\nctms\nmyfootballclub\nhufton\ndieta\ncysteamine\nsoldevila\njeol\nbroodiness\nmacys\nhelvenston\nknology\nschumpeterian\nducktail\npnk\nmaylee\nnumskulls\nnorimitsu\nafsm\ndecarava\ncharitybuzz\nfunghi\ndefiner\nzarabi\nkholodov\nadamowski\ndiène\nkadikoy\nmataya\nraashee\nbeigi\nyueqing\nhalation\naroun\nreseed\nbapco\nfufill\npitlik\nthemepark\ninvigoration\npacula\nschmidle\nstrathendrick\nclearout\nsosie\nmcduffee\nsternlicht\nahdi\npugnacity\ntesei\ndynabook\nstrogg\niwmi\npenfro\navowing\nintralinks\nhorovitch\nhypes\nproabably\nmascarade\ncsco\nbeechcroft\npickax\ncrosswhite\ndunghill\nexmore\nsixstring\nzettl\ndueholm\nretinoschisis\nslickline\nmanliest\nlienholder\nthorngate\nsietas\ndidnot\nsimensen\nsheinbein\nmppt\neaglecrest\nmptc\noelsner\nbittinger\ngrangefield\nnamhong\narkengarthdale\ntcca\nwinnemem\ntarazi\nvalcarcel\nleyner\ndanay\nlessness\nstickwork\nmildews\ntolver\nrobynn\nsnay\nhinsch\nkennemer\nscottland\nweidenfeller\nmontorgueil\npedf\nkoljonen\ntamulis\nbirlas\npolemis\nvisted\nendobronchial\nmoheb\nshearings\nchamblin\nfirdos\ntabart\nbenbrika\nkabanova\njalin\nyusi\nskibsted\ncurrans\nlefkofsky\nsucharski\nfalciani\nhqn\nfreespire\ntacey\nliteralness\nherzsprung\nsweda\nithaki\ncpvc\nluevano\nzekai\npaker\nbrackney\niwmf\nslobber\nbrandstater\nkriseman\ngolledge\nmoonwalking\ngarsten\nelementum\njuvin\nweijden\neverall\nrunswick\nculdcept\nrappold\nsongfest\nshehla\nrakotonirina\ngîtes\nhorng\ndichiera\nhooff\ndomspatzen\nnkululeko\nameln\ngawley\ntrasher\nsprinturf\nradelet\nkovachevich\npolyamorists\ntrug\nsupersub\nollivander\nmaoulida\nnykoluk\nsvq\nvaila\nsbinet\nchilecito\ncalker\novergeneralized\nglodwick\ncassoni\nhaibo\ntraceurs\nkyaukpadaung\ncounterclaimed\ndatelined\nfreiston\nhermen\njoannides\nkinderszenen\nsexi\ncaira\nvishneva\nafterglows\nskinflint\nhappenning\nsetebos\nlightstorm\nkeasey\ncompliancy\nnubbin\nhayya\nablator\njeffrén\ngraycliff\nultimatebet\nkospi\nreckers\nabbar\nkervezee\nscup\nchemiluminescent\ncfao\nkupfernagel\nwaterperry\nepso\nmussie\npommard\npopii\netw\ncapbreton\nmasuma\nteversham\nbeanpole\ndeceases\nashp\ntrudges\nboccardi\npurnomo\ntelenav\ncarliner\ncorah\nweier\nbirzhan\nmenheniot\nrecompose\nkupersmith\nwramc\ngpda\nkafé\nlechleiter\ndortort\nazema\nbadjao\nterrex\nlendvay\ncontroverisal\niberá\nedenmore\nstatice\nshanteau\nkutna\nuntransformed\nalack\npseg\ncandidated\nzanka\nsaesneg\nsantita\nconsummates\nanjuna\nperimenopause\nturbary\ntomkat\nshonky\npublicises\nraslan\nsannicandro\nmazdas\nhochstrasser\ndeerbrook\nsophisticates\nitson\narsic\nohlen\nstrathnairn\nmulcahey\nloston\nbroomhouse\ndscp\npelliccia\nohss\nuring\nafkham\nwinglike\nuclaf\nkidds\nmadcaps\nhamedi\ntrumball\nmettawa\nkemoeatu\nipala\naktogay\nfutu\namington\nsumwalt\nnooshin\nstylizations\nartline\nmultidistrict\nagrument\npittaway\ncovin\nwitherall\nmachinarium\nmedison\nfinglass\nkleinkirchheim\nprofessorate\nkrassel\ncrolly\ncanolfan\nhalfheartedly\nulemas\ngiraudo\npraktiker\neshed\noyamel\nmlangeni\nnickolay\nanpi\nbaninter\ntastebuds\nhantaï\nagentless\nacrux\nrasharkin\nberecruited\ngenner\nmasseuses\nhathershaw\nvagas\nkuric\necotourists\nharbo\nskaftafell\nmassini\nmourides\nrascasse\ndayjur\nghashiram\nlatheron\naldringham\nkadhem\nguiltily\nshamai\npfeifle\nammirato\nillbruck\ndugar\nhealthways\nustashi\ntengelmann\npavich\nstutts\nzizinho\ngony\nahima\nketorolac\nlifeworks\nnenthead\nignatenko\nsankore\nempg\nyerbabuena\nnijhawan\nkanani\nhooah\nvadrouille\nlupercal\nkhakpour\nguynes\nspuy\ntimbrel\nwellers\ndarwall\npanici\nmazahir\nashbrooke\nlonghaul\nskying\nshearin\nahali\nmethanogen\nguttenbeil\ngrazebrook\npuris\nsammel\njapanned\npinaud\nsentelle\nyacimientos\nacknowleged\nsechseläuten\nneurochemicals\nsetola\ndepressurisation\ntouchflo\ncraiglist\nwebquest\nfaida\nkolly\nphokeng\ngingerbreads\ngarganega\ncatchwords\nbatook\nfrêche\nschaberg\nfroedtert\nhalesia\nrelenza\nbiersch\nwiith\nispra\nairdropping\nparast\nsuzhen\nayish\nmabaso\nbenmosche\ninnse\nberol\nownes\ndaughtrey\nukho\ntainton\nperkovich\nfleshman\ntendinosis\nbhabra\nnaftaly\nholligan\naslaksen\neristic\necap\niogen\nchattenden\ncharise\nmedomsley\nfoments\nimpracticability\ncentenier\nraaff\naletha\nsoulquarians\nscarle\nsourton\nitno\ncuzzi\nkristeligt\nstenin\nvenustas\nivester\nbentel\nunderpopulation\npiver\ngornall\nnutopia\nkatumba\nyourname\nonp\nbatiks\nkabary\nzenonas\ngoiania\nscaffidi\ncolourable\nfarmwork\ngooks\nunmetalled\nspurgin\ndarrach\nashlea\nmaixner\nrobathan\ngezim\nfargesia\nyoungquist\nsofcs\ndimmesdale\nzaffaroni\ncavins\nrepositions\njoinson\ngerontologists\nkleinsasser\nmethimazole\nresurging\nxcelerator\nsermet\noyedepo\npoettering\ninfanticidal\ncpat\nkabulistan\natletic\nvatz\ngrevious\nbeshir\nvantagepoint\nbogaard\nshamoun\nnejla\nlicitra\ndemutualisation\nponcher\ntavassoli\nfilthiness\noptioning\nsamran\nmycoplasmas\nunshaped\nplacidly\nrothiemay\nkeleman\ntauke\nvasisht\nclouts\nkruta\ngoransson\nwirh\nspadoni\nsirtori\nbelder\nlongde\ndallku\nnienaber\ncounterpane\ngollub\nography\nbrakspear\nuzay\nmitroff\nnamedrop\nsonare\nconceptional\nwormit\npantalones\ncaremore\ntrepp\nopaqueness\noverbey\ncollaging\nwarbreck\ndyet\nchasity\npriveledges\npalihakkara\nespeically\ncosan\nakokan\ndezso\npupovac\ntallwood\nyezid\nsaavy\nguoxing\nfujihara\nshenanigan\nceftaroline\nbeddor\nvansanten\nanoush\npabp\nmedwyn\nhmmmmmm\nhaggui\ndorayaki\noutmanned\nolba\nkaurin\nsavundra\ncastera\nwilfert\nlivesley\ngennet\nmoataz\nsiekmann\nintrafusal\ndvts\ncastlemore\nmaccool\nplayscape\nsuicided\nuzak\nquilicura\nmasrani\ntwanging\nfadal\npinche\nnarazaki\nmbss\naliment\nedmondes\nnaturalmotion\naccoding\nhomestore\nzauri\ncwmparc\nmerkatz\ntoolis\nafld\nchromophobia\ncrisanti\ncleeton\ngrenell\npreecha\nchilstrom\nsusheel\nkandle\ncathinones\nmicrographic\nbushwhacking\nhyat\nblankness\nloungewear\nsuek\naslanyan\nunfractionated\nwantonness\noecussi\neuronet\nsdms\nbailong\nharolyn\nsandaza\nscollen\ngdls\ntravco\nkasell\ncroner\nlabneh\nbaoguo\nstocktwits\ngradoli\nopsware\nhochstetler\ncastino\nbalkanized\nmethanotrophs\ngoldfein\nkeisling\nvukcevic\nlevangie\njarbas\nebershoff\nkifri\nokays\nkarakia\nkickabout\nadamsons\nincentivised\ncorsar\nunsteadily\nkeval\nelectrolyzed\nkkg\nredrup\nnobutora\ndysplasias\nhlx\nfrcc\nconneally\nkaniuk\ncayugas\ncurraghs\nvge\nprophylactics\ncylch\nerviti\nwangui\nyashili\nblumarine\nroosted\nhcas\nnodder\ncountin\ntripleheader\nlolitas\nyacona\nguilloche\ntelepharmacy\nevidance\nvacillations\nwilier\njives\ncobuild\nnashiro\nashibetsu\nmouthguards\naouate\nlardarius\nsimor\nredus\nenervating\nrensen\npresure\ngerig\ngonzago\nincr\ncluetrain\nschlocky\ncivvies\nllandegla\nslok\ntwiga\nmanijeh\nbencherif\nallco\nmacker\nludtke\nornellas\nshadowbox\nmouldering\ndeboy\ntchelitchew\nmyomectomy\ngurjit\nhotlanta\nmilliamperes\nsareth\neiir\nlebleu\nsemidetached\nreifying\nchmi\nrving\nkennamer\nhymel\nimette\npalmpilot\ndallet\ngallenberger\nhoerr\nkenmochi\nzinnias\nhuisken\nmanha\nchayan\nballindalloch\nyvaine\nsveva\nprayuth\nstennes\npremed\nhalcomb\nchengappa\nculpables\nplanman\nkarkowski\ntiviakov\nbedsits\nqazis\nnesu\nepitomising\nkkd\nconfiscatory\nmalook\nhvtn\necholocate\ntranen\nnewmill\njuyongguan\nbrannum\ntábata\nalexsandr\nberckmans\ngrimesthorpe\nsheika\nbleau\ntjan\nmongeon\nkimveer\nquerbes\nrebasing\navilable\nblaum\nallaf\ngaspipe\nboyner\nairag\naniruddh\nzubaidi\nholdback\nyihua\ncomporting\njezebels\nshabandar\nmaggiolo\ntcmc\nanencephalic\ngratefull\nbeken\nostk\ndelante\npellin\nrnid\nfaridah\ntrapasso\noix\nnarkomfin\nglosson\nbretheren\nwahyudi\nstanhouse\nzeisl\nshamah\njetersville\nlazarowicz\nwajs\ndgv\ndueto\nsanteros\nitsec\nbrassai\nbelfi\noneday\nhegyeshalom\nvoevoda\nmssp\nbuckel\nchomiak\nmandia\nsteinkuhler\nenoxaparin\ndrumglass\ncharbonnet\ninnerhofer\nkilmington\nundesa\nkazdin\nmonicker\nunfreezing\nardtornish\ntity\nfouetté\nsquarepantis\nmerage\ngeijo\nglengad\nohiohealth\noctopod\nsemiconscious\nsioe\nteboho\nbrucks\nbogoroditsky\ntuschman\ninvestee\nlekiu\nonewest\negyptomania\nxiangjun\ncheapshot\nherzlinger\nnailgun\ntoshihisa\nloughney\nanisakis\nyapton\nunkept\nblintz\nbounders\nnewscasting\nlaundryman\npute\nheroku\nbeaning\nmetalic\nvadera\nxenophobes\njanmukti\nevite\nbolitar\nlorbeer\ntillar\nincisively\ndongmyeong\ngilds\nkerswill\nsarofim\nbludworth\nmethyltestosterone\nlaire\nweeder\nleiman\nreclaimers\ncolludes\ngreenkeepers\nswoopo\nschwartzenberg\nginor\nbiruta\nmarymoor\nwunmi\nlabiosa\nchastely\nmikels\nhaddadin\ncsrts\nmetia\npolitick\nossetes\nprbs\ncavoukian\ndarboe\npockmarks\nmadhulika\npouw\ndoyel\ndmsc\ncompetetive\nsagir\ntirich\ncutbank\natomize\nljubodrag\nsauced\nparticpation\nescapists\nlekkas\nsnowdome\nbennies\nunfilmable\nkaktovik\nmarès\nutem\nimportent\nhesc\nblowpipes\naframax\nbarbrook\nbiutiful\nlarratt\nserologically\nulvert\ngamzatti\ndarity\npensylvanicum\nypu\nreaons\ncarinhall\nproselytizers\natpl\nefca\nerislandy\nboiceville\namarilli\ntreehouses\nkarrenbauer\npliosaurs\namason\nrapaille\nzapa\nyehya\nprorok\nnomvethe\narvier\ngine\nceramides\nrevealingly\nsalihiyah\npsq\ncheckerboards\nvcloud\ncomres\ninum\nmapstone\npatriated\nkeshawn\ngreenwall\nluallen\nsupan\nrocktron\nsundback\nfuseini\nyav\npulikal\nschappert\nunbreak\nastrobiological\nlefthander\nwoodview\nwiam\npreisser\nveritably\nsagdiyev\ndanzinger\nfakroun\nhübschman\nexultate\ncinealta\nheywoods\nwaffly\ncenterwatch\nyastreb\ntorgiano\nivelin\npelcovitz\nmulryne\nozak\nboonies\nquadricycles\navioli\nguofu\nraffensperger\nclunking\nzberg\nglemsford\nmarkheim\nislandicus\nunthought\nsamaddar\nsaunters\nsluijs\ntekkonkinkreet\ndeuell\njamesy\nesmt\nbpms\ncessnas\nqalandiya\nguangli\nprepper\nkrasker\nmrha\ntrippel\nmcbane\nottolini\nhockwold\nsoumises\ncornall\nkwali\ntelespazio\naaaah\nstallholder\nturinui\nhillerich\naidans\nmatthee\ntranssiberian\nkandhari\nbeerens\nnetlog\nqrio\ntigerstyle\nsexualizing\nbuttershaw\ncebs\nqrd\nsicklerville\nnafdac\nbechor\nwome\nsweepingly\nlechón\nnlbm\nvideoid\nmurmurings\npfaelzer\ndoornail\nfadesa\nxerri\nfrogley\nsupersets\nundergrounding\nprelaunch\nkopjes\ndemolisher\naverchenko\nayris\ndeanshanger\nbreakfasted\ninteligence\ntomans\nmicrofinancing\nhartbeat\ngotsiridze\nelectrifies\nberris\nhads\noptiplex\ndazi\nschmoozing\nnatz\nwatamu\nionophores\nclausing\nschmidly\nwoodwardia\nwiosna\nnovikovas\nlubow\npatootie\ncscec\nratia\npapiri\nenormities\nekkart\nherzogovina\npoten\niexplore\nemrooz\nlittledean\nidealising\ntradeport\nnirvanix\nbridezilla\nshaorong\nkruzenshtern\nlucevan\ngurdal\nreusel\ngoldschmid\npradit\nnizlopi\nsindbis\npabrai\ntoussuire\nfotiou\nwhizbang\nmosch\nnyn\nglowna\ntigges\ncuaron\njuiciest\nbpos\nkilnwick\ncoolbrands\nimroz\nplutocrat\ntresser\npushtun\nutts\nszatkowski\nwarburtons\nvalvettithurai\npfluger\nyanfeng\nbohrman\nhimelfarb\nweidel\narenavirus\nbacsa\ninvincibile\narathorn\nvocento\nmasthay\nhydrodynamically\ntobon\nshiling\nneigel\ndaulet\nlipow\nproroguing\njoviality\nlarten\nshericka\nbirpur\nstarpower\ncostebelle\nvaser\nmaxum\ncartwheeling\ncornelsen\nkharrazi\nrubloff\nglobalise\nboekelo\naayush\naliotti\ncomparisions\ncioe\nfamilicide\nwilliamwood\nundg\nbossart\nturbulance\noustanding\ncarpano\nperraud\nkortedala\nzunior\nfischlin\npornthip\ncockling\nframatome\nphotoflash\npsychiatrically\nlockin\nnemirovsky\nharmonisations\nthorntree\nleanness\nfarepak\nfriedler\npollara\nfanboi\nnbpp\nrecolonise\ncringes\nmatteini\nsamiha\njackee\nserricchio\npraesent\nakwaaba\nostfeld\nwerz\nmendola\ntuttles\nmetabolomic\nbinstock\nalhaarth\ntimergara\ncomella\nhuraa\ngelsinger\nklingle\ncontrariness\nlazara\npricewaterhouse\nwashcloth\nspierig\nnewill\nbraggin\ndambisa\npices\ncrunchers\nyaponchik\nmehlhaff\neits\nuestlove\nhaggled\ntonucci\ncodicils\nkehillat\nbromstad\nsimandou\nmadbury\nslyder\nteklemariam\nberkhampstead\namezquita\ndemarche\ntrebetherick\nantsohihy\nmirabito\nprais\nbaccelli\nkenepuru\nfrankos\ndemutualized\nforemark\nhostas\nquf\ndecarbonization\nchastanet\ntrinchero\nroomie\nkhowa\nchaouki\nmantica\nanalogizing\nhaydu\nmanifattura\ncannizaro\nsoufflés\nburnetts\nnehar\nwhistlejacket\nbiddable\ninconsolably\nrouille\nnumbs\nmixcloud\nstraubhaar\nkountry\nunpredep\nkendro\ngyrated\nbalpa\nfayadh\nicaap\ncerticom\nainars\nherdwick\nhalliche\ngoligher\nnightsticks\nburnmouth\nhaverigg\nsautner\nofheo\npokou\nnteu\nearing\nsandpile\nshalah\nsarsden\ndenzer\nbauknecht\nnonporous\nfeifer\ngxs\ngalthié\ntostones\nzanan\ncoutry\nsubependymal\ncect\nkowit\npropitiating\nterpning\nodstock\nmarysol\ngriswolds\nzied\nmakgeolli\nfeinted\nthalassotherapy\nnoneconomic\nwarney\ncookouts\nconfrontationally\ndefatted\ncontibuted\ngermà\nbiaxially\ntrindon\nrauschenberger\nhousehusband\nreken\ndemocractic\nnistri\nsaids\nyouness\nbarama\nfosterville\nkeroppi\nhamuli\nknapper\nmakkasan\nmoalem\nnoffsinger\ndorre\nchiropodists\ncressona\nmccart\nyongzhi\nkoets\navobenzone\nwisdomtree\nmehrabpur\nnocito\nbodian\namercian\ncentreback\ndaywear\nlancelets\nhankmed\ndisorients\ninjectables\njanee\ncommiseration\ndelibrately\nwhispy\nrayven\nkerstein\ngiclée\ncalleguas\nnatsuno\ngeere\nskinstad\naccommodationist\nsnekkersten\ntortajada\nsteingraber\nstechert\npijpers\npolcies\nhamshire\nphantasmal\nmcfarren\nancar\nminuteness\nsnagfilms\nbehavious\njeremey\nsherbrook\ngrandmotherly\ncongestions\nsunami\nvuono\nshamva\ndirectionals\ndeputes\nngudjolo\nrepossessions\nravenshead\nraetz\nswappers\nhorsenden\nharandi\nkobad\nquamina\nmelitus\nlogrono\nkynge\nwoodlea\nhaapanen\nkiaran\nsepon\notdr\ncolumbans\nvideolink\nunti\nwampe\nlyondellbasell\nshanelle\ngelardi\nsupplicating\ndeadness\nnewmyer\ntautz\nhreik\nnafld\nmercal\nmendlowitz\nfreuds\nneij\nwhaddaya\nsanest\ntacis\nmpongwe\nuwire\nestafeta\nboursicot\nsheridans\nrespons\nswauger\nerrored\nugas\nmenacho\nsimri\nmianserin\nbeckhard\nsharfstein\nlanci\ncids\nhungrily\nmissles\navern\nsannes\nkelkoo\nkassiopi\nnordhus\ncatw\nsumann\nperinatology\ntjaden\nnonvisual\nyuganskneftegaz\ncoedffranc\ndesser\naereos\nleymarie\nmktg\nrustbelt\nguei\nchenette\nvalvetronic\nquatt\nrobla\nyaitanes\nlutalo\nbeoley\nmysteriousness\npajam\nasmah\nhamler\nmizra\nrangaiah\nchiuso\nrushy\nstatnett\ntornante\nhaldiram\ngtalk\nmutitjulu\npuroland\nbruh\ntaglialatela\nreinvigorates\nformenti\nshinri\npustelnik\nlouisine\nmnich\naysun\nhicok\ngibbo\nrabelaisian\ntenderize\ncoxhoe\nyanggakdo\nyulieski\nhfmd\nsaita\nsenoia\navinoam\ndemore\nhamidur\nemblazon\nlampur\nriceland\nvilt\nbadeaux\ngopio\nmellerstain\nsouffrance\ntsakopoulos\nberkmar\nosgerby\ntelefoni\nghanoush\nhirshleifer\ngiambastiani\nwittington\nflensing\ngershenzon\nvatterott\nwitherslack\nchakraborti\nsneakiness\nopenmrs\nratjen\nhierl\nicross\ndolga\nafrasiabi\njanuszewski\ntrovan\ncatanach\njacquemontii\nmaffucci\ngaffin\nthermokarst\ncropthorne\nnewling\nblacksville\nsteepled\ndantin\npomalidomide\nvoes\nwoldemariam\nnirut\ncuya\ndemystification\nzakin\nmarzolini\ncatoche\nherzenstein\nwonderfulness\ntendencias\nawesomest\n,not\nconvertor\nserpentarium\ntretton\natlasjet\ncaparros\nmumsy\nsuceed\nstouthearted\nrequejo\ncyclope\nwiny\nasharoken\nhusham\nbrumder\noverstays\nbekes\nbrushback\nduckbilled\nreallly\ntassone\nunfragmented\nwalson\nodong\nethisphere\nfurzebrook\navailibility\nskytel\nlaurieston\npreapproval\ncommuned\nthurton\nsealable\nhazily\ngrebner\nlasp\neuroarts\nfunari\naufschalke\nkottwitz\nstoneycroft\nwickstead\nstratta\nsotirov\nistan\ncnosf\npaleologos\nnyagah\ncurphey\nszaniawski\nciancio\nagreee\nwardhaugh\nkaleidescape\nurtubey\nlinguine\nshawntae\nbelza\ngoogleearth\npntl\nripplewood\nfischers\nskiway\nsarwer\nsquiggy\ntracys\nadhiraj\nhauswald\nkrzyzanowski\ncallam\nsuperimpositions\nparomita\nlcec\nthatn\ngullo\nrurally\nwahiduddin\nherti\nnestande\nratifiers\nnaumovski\npsychedelica\nucal\noykel\ndolgoch\nharridan\nklym\nmuchall\nmuthoni\nblackson\nliker\nlimeade\nservaas\nfiacco\nxiahe\nislamically\nreportid\nfictionalize\nmedawachchiya\nspratley\nrezazadeh\ngeomedia\nkearl\naravosis\nvirtualised\nsessoms\nlepeilbet\nhouze\nrastrojos\nlangfeldt\nhumpers\ndrukman\nradzikowski\nhealthscope\nbreindel\nbergel\nwpcs\nijamsville\nharled\nlaudomia\nnuvaring\nameria\nthata\nfloella\nunattractiveness\nmaib\nlebid\npresumeably\npecksniff\npappert\ndkms\naara\nmatthis\nmiddelheim\nalleway\nbulluck\neveridge\nbloodying\nsubas\ndesagana\nrylie\npetitti\nhuseman\nrathjen\ncybercrimes\nexpensing\ncyclobenzaprine\nmagheralin\ntamariki\njetboat\ntobocman\ntasch\nacce\nlledrod\nsocarras\ncrurotarsans\nalmasy\nexasperates\nqiliang\nodland\nseabeck\nplanarization\nobligors\nkhandaker\ndumpstaphunk\nataba\nfreej\nguigui\nmoosejaw\ntauqeer\nached\nmichielsen\nnuart\nparlourmaid\nwalchhofer\nprequalification\nbioshield\nahamd\nvissers\niachimo\nmaligns\nmärzen\nildiko\nmossler\nimga\npetrols\nhainz\nbehmen\nshatteringly\nfertel\nresignedly\nengell\nsoftlayer\nulsd\nhuaraches\ndeviser\npasetti\npittella\naulton\nvolksbanken\nfamouse\nseargeant\npercolators\nranibizumab\nheida\nbloater\nskymiles\nquilligan\ngyroball\ngofers\nwaldenfels\ntorquing\nkalikimaka\nantron\nkhumri\nkandola\nphilippakis\nzadra\nbattat\nsasanqua\nbialek\njoani\npaddleford\nbrynden\nschallreuter\nkuechenberg\nspradling\nnorihisa\nweinroth\nchatelherault\npeduzzi\nbaned\nprecent\nshontayne\nkerchove\nbullfinches\nabrahall\npackbot\nchieftess\ncopestake\nkosak\nirschick\nvaciamadrid\njudeh\nchaghcharan\nantillano\nlgh\nalleluias\ntiding\naspall\ncishan\nsarnat\ntramel\nurbinati\nmpcore\ncolliston\nrissi\nfonzworth\npeiyuan\nkaratina\nbuenrostro\nhandelian\ncollee\nconsitution\nflaggers\nconrow\nstunell\nvicon\nhourcade\nortal\nboutle\ngaowa\nissing\niglo\ntavecchio\ndagestanis\nsciennes\naisen\nsnookie\nmayassa\ngilgel\nvirality\nkorangal\ncessa\ngobern\nbamf\nmiddx\nperlez\nsandbeck\nnephrol\ncetainly\ncamisea\nbarbless\nferrostaal\nhornitos\nmandak\naddley\nbartholow\npenhow\nquenington\nwilonsky\nmalborough\nanthim\nmrisho\nantos\nynyslas\ndilmah\nstrandhill\nsciency\nsummerbell\njebet\ninswinger\nnonthreatening\nnaquib\ndelbonnel\npurepecha\ndastjerdi\nwoldu\nunsparingly\nrummages\nlemes\nsmses\nsocialbakers\npetrin\nbedol\nhockman\ngodlessness\nbonhill\nungpakorn\netait\nanaran\nxijiang\nbaengnyeong\nwahabism\ndassu\ndirtball\nbairin\nplaypark\nunmin\ncanidates\nstingel\ngudrún\nwhta\nbatholomew\nmcmillion\nbelarusan\nkhalap\ncircuitously\ndescalzi\nautoridades\nthati\nprotz\nluxemburgish\nbouajila\nslayback\nrissmiller\nliebst\nelectroma\nrabeni\nblackballing\nkusadasi\npoolewe\nmarret\nangio\nvedo\nevg\nharminder\nveasley\nsilvestrov\njoffrin\navellanarius\nesiri\nmucklow\ncondorrat\nimmortalizes\ngranfield\nifpma\nsnitchin\nnerius\nkelisa\nstaikos\nrothert\nlechat\novereagerness\ncrookedness\nkinlock\ntinsman\nmagueijo\noner\nanglophiles\niavi\nlocane\nnecklacing\nhandiness\nopulently\nherschensohn\nabilio\nmulticamera\nperling\nvolpert\nayittey\nrestitutionary\nboardercross\nvirdee\nsagehen\nhennicke\nlizeth\ndisfiguration\nfabulation\ncharol\nmicahel\nmcsweeny\nsosi\nmoneyfacts\nhurson\nhyppönen\nyakushin\nbadawy\nultracapacitor\ngaddes\nhoty\ndeathtraps\ntaqwacores\nmarnoch\nsoroa\nstoep\nmullinger\nernen\nqichen\nhsdd\nkhannouchi\nemons\nmicrodots\ncooperazione\nsetence\nplaudit\nsyan\nappello\nvillemin\nrudresh\naugello\nquittner\nlazonby\nsolderless\ngoryachev\nactel\nroustabouts\nchicote\ngroovers\nrealschulen\nkirdyapkin\nbanadir\ngarfinckel\nembeded\nbillionths\ntirabassi\ngroundlessly\ngiftshop\nheatherwood\nantonowicz\nlamri\ndoomtown\nheloisa\nrangnekar\nblendtec\ncaganer\nguestlist\nbourgie\nschoeni\ndandenault\nastemirov\npikit\nwaterwise\nlrz\njwr\nbadas\nlimpy\nbellers\nabastos\ntemitope\nacore\ngarajonay\nrihs\nbarrese\ncorbella\npurrington\nwladek\nhiving\nburble\nradosavljevic\nchunyang\nsantra\naustrailia\ncaspofungin\nvenessa\npingping\nmonjack\nescc\nadado\ngambira\njakati\nparticulares\npeignoir\nholystone\nadran\nlathem\nlandwind\nlomans\nunderinformed\nottos\nddis\nwottle\nlakenhal\nleuchten\npapantonio\nloewer\nwilczynski\nbipv\nkingarth\nprate\nradioastronomy\nlayevska\ntelegraphe\nobaidi\nmunem\ngman\noelhoffen\nfasolino\national\nhubbing\nphilipsz\nkafta\ncatastrophist\nhutments\ngunite\ngiganticus\nrashford\nunderskirt\nboffi\ncolapinto\ndemello\ndeepal\nvontaze\nlockleaze\nvardell\npetrifies\nballuta\nstutton\nzekri\ncuis\nmosti\nwintv\nparanavithana\nsuard\nurness\nschildknecht\nwakimoto\nchappa\nriffelalp\nguberman\namdy\nbarvas\nplanyc\ngettler\nkayyali\nadenoidectomy\nhowevers\nkonchellah\npreceed\noutcompetes\ncarpoolers\ndangermouse\nrosemoor\nursua\ncê\nattivio\nlumphanan\nwuertz\noptimises\nmicrochannels\newerby\nimtc\nquarterpipe\ngius\nbodiford\ndehumanizes\nkongevej\nintermediated\nmonofin\ncercas\nduvets\nskunked\npipistrelles\nelcomsoft\ntailenders\ngastroduodenal\nsrodes\nwinothai\ngovermental\noltremare\nthembi\nmisattributing\nxpression\nravey\nnodong\nxingdong\ngrieveson\nbolom\ncahana\noutdistancing\nbloodred\ncgro\nnedeljkovic\nbartletts\nakpo\nepv\namrany\nsupsa\nlaband\niconographies\nzehetmair\nlucketts\nmilinkovic\nreargue\nsagemiller\nmervat\nalveolitis\nfluffs\npassiveness\negarr\nautarchy\npbcc\nzurn\nbeyerle\nguderzo\npfic\nbredwardine\njiping\neulogia\ncheniere\nchionodoxa\ncraighouse\nundistinguishable\ncelotex\nardfern\ncalco\nintersexed\nlaganside\nboehne\nelfreda\nradojko\nquoile\nhertsgaard\nbitove\ngrinten\ntadepalli\nludes\nginder\nallensbach\ncaldarelli\nsmeltzer\nspiritless\npennario\ncoile\ndesharnais\nyelloly\njervey\ntimmie\ngetzel\niraklio\nmultilaterally\nbruener\nlugner\nboskin\ncopaiba\narikan\nhexamethylene\ncompatibilities\nnadelmann\ndromintee\ntimani\nbasinghall\nisrotel\nmahnkopf\ncelades\nmaimaiti\nparirenyatwa\nbrannagh\npithily\nhareth\nvenerini\ndecendant\ndaghlas\nwatchfield\nmiscasting\ninstitutionalise\ndutko\nknic\nroselee\nsalans\nparticularist\nyordanis\nyuniesky\nclosedness\nscotchgard\necolabelling\ncanjet\nstenner\nloher\nmidatlantic\necharri\nlentos\nasjha\nbrielmaier\nhospitalize\nwolmark\nunchartered\nkhoresh\nbronchoscope\npalocci\nresop\ngoggans\nspoel\njhd\nwillebrands\nkorzen\nwineskin\ncleanout\nprescreening\nseona\ntollard\ndelelis\nhunx\nsmoggy\nbettington\nvelchev\nfilz\ntryptase\npokesdown\ntablers\nlembongan\nkamie\nritournelle\nbaulks\nmoshier\nstather\ncombustibility\nnece\ngrieshaber\nlungley\ntenpa\nmaunde\nurip\nrondels\njessiman\nrkn\nhaefeli\ngatell\ntentpole\nmartearena\npimmit\nshorwell\ndorment\ncomuzzi\nmaplins\nmilc\norandi\nneuqua\nrakhmonov\nsebbe\nchronopoulos\nunibail\nmasto\nstoutest\nmotamedi\nknutton\nvatter\nranchlands\njingzhi\nargumenty\nastec\ncirac\nmihangel\nworleyparsons\nbernhardsson\nmellissa\nfredin\nogrizovic\nhorreur\nstormville\nmellotrons\ngenowefa\nbattre\nboultham\nmovenpick\nmohamadou\nskurnick\nsautee\njomphe\ngaube\naaii\nkagin\ndechter\nabrahms\nburela\nbwiti\nkilve\nhayrides\ndocumentry\ntrivialised\nxva\nfontainbleau\nnaimo\nmaciá\nawtar\npropoxyphene\ngalic\nmaranon\nstripp\nbarari\nnonsmoking\nfatmi\nabriel\niiu\nsunrider\nciis\nmaruk\ncdex\nflyglobespan\npowerplus\nmarkwart\ntornabene\ngeerlings\nmidgely\nresynchronization\npullapilly\nmacspeech\ntherms\nmwambutsa\ntodesbanden\ndubhe\nuhhhh\naviod\neiermann\nligairi\ncils\nfloozy\nniedere\nboundry\nulstrup\nglenmoor\ncaseyville\ndrongan\nkitzbuhel\ngarcias\nlacerte\nquinsy\nguilsfield\nnewfest\nlaili\nsidang\nomelek\nschwartzkopf\nryzhikov\niovan\nszafran\nturbogenerators\narli\ncomfirmed\nzoshi\nbransten\ngobabeb\nvvi\ncnns\nmomart\ngaomi\nsoftmax\ncondry\nsuhaim\nrodowicz\njozias\nlittlebourne\nmeribel\nmonopolisation\nfacilier\nfasola\nleever\nportera\ngrassmoor\nlievin\nepimedium\nabecedarian\ntzahi\nbankamerica\nluze\ndaubing\ntryline\ngrandaughter\ncombourg\nbinamé\nberria\nrongshui\nquaytman\nmetabolife\nabdulrazak\nanakena\nupconversion\npalepoi\nupshift\nnatanya\nslappers\nbowburn\nsirvent\ngautieri\nwaterbus\npharmacoepidemiology\nharush\ncotterman\ntheh\npretention\ndipascali\nrhodin\nhealthspan\ndzhioyev\nsaveurs\nsitthichai\nwestlb\ndergue\nferrazzi\nliudas\ncelimene\ncataleptic\nfaru\ncedewain\ndallison\npursuaded\nyorongar\naite\ngramscian\nsludden\nmehdorn\ndyrosaurids\nphenomen\npurpusii\nneuropathologists\npdss\nduchscherer\ndevy\nkevo\nphoonk\ndangel\njewfro\nmindstorm\nattenboroughii\nnitshill\nreporte\ntiefenbrun\nlisbona\nwindtalkers\nqamber\ngiclas\nhomaged\nrbgh\nshiquan\ngnv\ndiscman\nforcefields\nwarson\nasbarez\nsublicensed\ntroshev\niachini\nmakutsi\ncurlies\nduprees\nhershy\ndujarric\ndauch\nsimione\npedini\nbrizendine\nbongi\nstaaf\nashapura\nmistick\neaskey\nelenita\nmoayed\nglauberman\ngrzywna\nkagwanja\nespc\nchemises\ndolkart\nvannoy\ndulaine\njanifer\nkamajors\nstompe\nhemy\nppta\njenoptik\nsaqlawiyah\nmellowness\njaua\ndesigual\ntakas\npanish\nmuguerza\nmarinucci\nobies\njannine\nbotai\nintracom\neyeblink\nneurotrauma\npolyaromatic\nkibbutzniks\ncongealing\nbakonyi\nflashbulbs\nmeldreth\nshaviv\nhexam\npositon\ntrippier\nwarfel\ndorazio\ncorriston\nquinsey\ncattier\nobana\nmacnabb\nsoreq\nlukensmeyer\nmisdirects\nfickel\ntouil\ncalda\nsuang\nrallier\nsouleiman\nshakiness\nluxuriance\npavarini\nweensy\nalzahra\nzisser\nkarachaganak\nmeacock\nnibblers\nwbx\nkazakevich\neconômico\nstuhlmann\nfibrosing\nhashemzadeh\nsaliou\nrolltop\nlogsdail\nakepa\nnitpicked\nmerzbach\nagonis\nsandborn\nkrupka\nneumont\nguttersnipe\nkeycards\nthunderbox\nprosthetist\nnunzia\nhonua\nkisor\nskydeck\nnobert\nridlington\nvoaden\nbauermann\nbechtolsheimer\ninhs\nruinously\nwendys\nglenholme\ndreiling\ncasserley\ntableside\nheven\nfruitarian\nstratifying\nfarri\nausaf\nsewp\nunauthentic\nreinard\nwidenhofer\npetfoods\nsweetshop\nzouerate\nwitchingham\nfliehr\nbandeja\noconnor\nnanogram\nelectrosurgery\nkimelman\nmorohashi\nsprits\nsavater\nrechargeables\nmeroni\nmelotti\nokapis\nhelendale\nuplawmoor\narcsight\npompea\nsignifican\nkjus\nthoreson\nzarou\nosakabe\ndongho\nkampongs\narvato\nastrobiologists\ndittus\nzarins\nmonterrico\nmycle\nmwaa\nceep\nunanimated\ncodeblack\nstomachaches\nllanllyfni\nnympsfield\ngoedert\nlevoir\nukrainska\npanathlon\nhandsaw\nrauth\nsemblence\nmetanarratives\nlacavera\ngwyndaf\nappeaser\nbruenchenhein\npontification\npensively\nvlasák\nmaximizers\ndenninger\ndoerflinger\nqawasmeh\nborovec\noyeyemi\nholidaymaker\nzawodny\nportell\nropery\nenourmous\nsickafoose\nzambar\ntwentysomethings\noluoch\ntollison\naframomum\nscheie\nrhosymedre\nyounglings\ngeocaches\nredroofs\ncashdan\nfloatable\npyrocumulus\ntraid\nptin\nkneebody\nwinsett\nswinbrook\nellson\nachkar\ncalata\npoillon\ndendroctonus\nwhitsand\npharoahs\nhinterglemm\nsarvo\nkufri\nkundo\nplatonically\nnorthbay\ncopnall\nfmcs\nsalhouse\nmaghen\nligouri\nwooderson\nskateistan\nidlout\nraisner\nnwk\nweev\ncristophe\nstasevich\nkilleavy\nbiobutanol\ndiwaniya\nfillo\ncatacutan\nnavigli\noxborough\npropylthiouracil\ngaritano\nsokhna\nohlund\naissatou\nceaa\ngroupuscule\nifema\ntchmil\nbasini\noperationalizing\ndisarranged\nmtrs\nchanny\nbleedings\nhajaj\nbiotype\ntiet\nstelmakh\nbaumkuchen\ndowntrend\ntrellising\nbookbag\npublichealth\nupholstering\nsweatin\nrallys\nschrenker\nrlw\namikam\nbarnardos\nshanin\ncaneel\nintergrated\nschuil\nimmortalization\nkokua\nwieting\nwladimiro\nbigonzetti\nkekich\nchasanow\neatable\nsilverglate\nfleenor\nnoogie\nmulatos\nhouhai\nhoagies\nraichlen\nsulaymaniya\nundercovered\nfalsly\nlobero\nparky\nsimiliarly\ncherien\nrhessi\ncossham\nshainman\ntreffinger\nslapshots\ntmap\niseas\nalecko\nbadgeworth\nalberobello\neirgrid\ngemar\nmotezuma\ninviable\njamstec\numca\nlapushchenkova\nlongfleet\njamiya\ntrabants\ncnnic\nbulpitt\nearther\nblencoe\nteardowns\ntangalooma\nwonderkid\ncullet\nlliwedd\nsifiso\ntoben\nballykeel\ndrendel\nfocis\nslenderer\nemotiv\nkurwenal\npakoras\nmekele\nprominente\nnonorganic\narlauskis\nvandereycken\nzonderland\npresbytère\npreppies\ntitulaer\nmicroblaze\nludovick\nhawford\nshowery\njinpa\nyomp\nkaster\nshizeng\ndison\nbonifant\nrafetus\nbalfours\nlrit\nimperils\nchubbs\noversleeping\ncompsci\nspacewire\nklaudt\nturpen\nbiohazards\ndreifort\nallthingsd\nbeardo\npyritic\ndamásio\nshasa\nvoicestream\nsuperhot\nbogachiel\nnappier\nmoob\naccretes\nfinancers\nradulovic\nincorrigibly\nbarenholtz\nmalsor\nblazejowski\nulimo\ngurgles\nolivarius\npytchley\nperficient\nlovestoned\nabbondanzieri\nthoratec\ncloten\nauyuittuq\ncounterpath\nlargly\npsbs\nlodh\nlajitas\nmieu\nbihani\nmonot\nweissbach\nmoskit\nshirlie\nhilliker\nscottoline\nabdussalam\ncothill\ncrossenny\nbiodesign\nchunkin\nclubgoers\nvogelaar\nbookmen\nfeminity\npicosat\nwirsing\nmamphela\nivanisevic\nswigart\nzuiderent\nnigp\npermal\nlammerding\nmgahinga\nlornie\narcsoft\ncorrosives\ntotalitarians\nnause\nnalluri\nlorino\nfidencio\nzizzi\nwupatki\nkroo\npermament\npieties\negoi\ndieker\npadil\ntricom\ntrefgarne\nsidelocks\nkroesen\nfreebee\nacheiving\ntreglown\natheer\nwellcare\nlupeol\nbhajji\nhatmaking\nkhachik\nfindern\nandreopoulos\nporgies\ncassity\nslask\ntrebly\nphonepayplus\nferrett\namieva\nghostlike\nrevelries\nmarrel\nfance\nlekgetho\nhurll\npheobe\ncrianza\nlaikin\nscauri\nmadurodam\novergate\nyalumba\nbasmanny\nhammertime\npricilla\nottenberg\nconyer\ngappa\ntmcc\nmcelfatrick\nanthonioz\nbeautyberry\nparlá\nrauhihi\nbrezsny\nlamara\nspuhler\nvivisector\nnhlapo\ntraceries\nbrunches\ndecoutere\nstreeters\ntadamori\nrubenesque\npolychronicon\nabouth\ntarangire\nscopwick\nloizides\nreihan\nkial\nsteeps\nkingley\ntrauger\nrugunda\nsonagachi\ntomenko\nunbox\nvanburn\nwonderlands\ntomasik\nfilevault\nporingland\ngunderman\naverageness\nbombsite\ndibya\nlakos\ndobber\nenfields\nislamize\npotoroos\njoggling\nkippy\nazera\nvallow\npornos\ncolino\nfttb\nmsrc\nmatze\nchevrefils\ncelf\ncappucino\nbarkerend\nlaodicean\nnaide\nrajak\nbmhs\nonen\nnajla\nyatom\nrxt\nhirons\nazziz\nrudong\nplodder\natba\nsnepsts\ncraigen\nacquavella\nworsts\nscrinium\nmatsuhashi\ncarhop\ndrexell\nwnbf\nlauby\nmlotek\nbalmat\naptt\ncrassly\ndarrett\ncaraco\nlucimar\nsandmen\nliemba\ntitanoboa\nefland\nmegalyn\nlalmohan\ncinderhill\nshaikan\narbitrageurs\nrogatory\nparant\nhesher\nboomsma\nektar\npompeians\nwanguo\ncamerawoman\ncurers\nbigotted\nlehra\nhomeplus\ndisembodiment\npacbell\ncyclos\nseym\nmoucha\ncerridwen\nmomotombo\nbouret\nnanduri\ncelesio\nbisnow\ndannah\nhandpicking\nullamcorper\nkathryne\nscandrick\nnorne\nblares\nmagnaye\nburhans\nathanasiu\nferriera\nnhanes\nsajous\ndarwent\nalvarion\nbraccia\neekhout\nstacher\ninaudibly\naquascutum\nmarchwiel\nchaverim\nflowertots\nbaljinder\nwissey\nmigiro\ngibbets\nsayong\npossiblities\noshchepkov\nkinton\nravetz\nhadfields\npotatos\ntampion\njestyn\nmohatta\ndolts\nwenjiang\ndanzey\ngundogs\nscalds\nraske\nlippspringe\nittai\neastcott\nzylstra\nhollingbury\nnovlene\nsentimentalized\nflystrike\nhandleman\ncomprimised\nmoviebuff\nappropriators\ntondar\ngoure\nraskar\nlibeccio\nmontchanin\npetrobrás\nparadeplatz\nsimco\nvwi\nlinneman\nbrayson\nturweston\nliszka\nomnicare\nmerzak\nsermilik\nfisser\ncathro\nbeyster\nsnoopers\ndylon\niwpa\nleppings\nhurles\nmichiganders\nhiney\nemtman\nhypochondriacs\nvillasanta\npenninsula\neuropeanised\ndouchebags\neagen\npattisson\nskimmia\nroesner\nphippen\npavant\nenfinger\nwolfberries\njessopp\ndahmane\ncochi\nyasny\nrkf\nforbearing\nsphero\nluthar\nvolchkov\nbishopsworth\nfalvo\numas\nbehnoud\nyinger\nvenini\ntransfigure\nlongueira\nteavee\nwingates\nsukhinova\ndoumar\nduette\nvirada\nkerelaw\nuhmm\nvachani\nearthships\ndarunavir\nmuthulingam\nsonkar\nmarchy\ndriftnet\nhefferan\ngeml\nepidauros\ntakeoka\ntitwood\nvodcasts\nmontario\nbornet\nrowner\nshervington\nsommerstein\naktyubinsk\nbarding\nsikhanyiso\nassuages\ndietzsch\nhawed\nramnarine\nlifeclass\nresumptions\nbookexpo\nmyroslava\nhlady\nelettaria\nweatherburn\nrockii\nstracciatella\nhurleys\nmusicstation\nrawly\nteinturier\nashelford\nmuxes\nsongtao\nblastic\nfrancoists\nfarangi\nschappell\nmostarda\nomapere\nmctwist\nsheerman\nshimba\nbinnington\neffe\nharles\nmarrese\nshimpi\ncassanova\nhalbeath\nwenban\nosteocalcin\nbuyin\nrivulus\nscrummage\nemodi\nlouviere\nmuaz\namerasinghe\nsatel\ndemocrazy\nigbp\nyaichi\nrelearned\nreverbed\ndpz\npolycephalum\nosac\npaetz\naktenzeichen\nnafh\nkhoroshkovsky\nomelets\nzhangmu\nleamas\npandak\nqud\ncimm\ntrongate\nmidways\nmunyaneza\nattemped\nmccoig\nywc\nperegrination\nunipolarity\nslobs\ntetragon\nprusak\nsportsbet\nfatsia\nchangho\ngasteen\nczeisler\nsarubbi\nlacome\nsaurischians\nokecie\ndouby\nlochead\npromesses\ntatafu\nfitments\nhattam\nneifi\nawra\nlegambiente\noilskin\neprocurement\nhostilely\nwilmorite\nconvivio\nwingspread\nsabaot\nwattson\ndazeley\nchetry\nswade\nbacos\nbernette\nhairshirt\nalavanos\nmushahid\nhuichon\nbotwnnog\nbarnz\ntacci\nunfreezes\nhoder\nmuthuvel\ntoshihito\nitms\nhodgin\njrtn\nmesnick\ncittaslow\nbokma\nvicissitude\nmajmudar\ntheatreland\nanisul\nulich\nprj\nhollon\nbcrs\nlyminster\nporphyromonas\nqes\nschoolbus\nmalkeinu\nbartee\ncowbit\nconveyer\ndezza\nsteart\nplinky\nackah\nstupefaction\ndanwei\nackner\nnuweiba\njfx\nmccreanor\nbraslavsky\nramipril\nattique\npeterculter\ninciarte\nvidic\naccordant\nencumbering\njochim\nfallouts\ntskhinval\nmugla\ntonghai\ngallu\nhausler\nbroadsided\ncommandaria\nbrogeland\nmedero\nyipee\nunaggressive\nshakedowns\ncsst\nruttle\ndeblicker\nfoinavon\ncppr\nmatsuev\nostling\nblattberg\ndocusign\nundemonstrative\nlensbury\nbänziger\nwoodbank\ncontry\ncurvier\nunfavored\npatronizes\nbolometers\ncaterwaul\nheadends\naddthis\nstrateg\ndyspeptic\nmalandrino\nlamprell\nbolze\nwhatchamacallit\nbulleting\nhaggas\nfranza\ntortor\ntoensing\ngressier\ncisi\nkirkcolm\nkaiane\nnmrc\nmahasin\nmanoncourt\nwoudn\nburkan\ncarolene\nabsenting\nrondinone\naniva\nromanko\nvukicevic\nrets\nyakhouba\ncumberworth\npremacy\nstenius\nirresolution\navanzi\nmegginch\nwolkers\nbloglines\nbenzos\nfaurecia\nalverthorpe\npenglais\nbrfc\nfizzling\ngalarrwuy\nfassberg\ndeclassifying\npinkaew\nhaensel\ncattouse\nrizeigat\nleftish\nmuigai\nsupermajorities\nbadenhop\ndimasalang\ncvca\nskoyles\nbarkhor\nredrado\nnagourney\nclubmate\nwhimpers\naircar\nicewine\npummelled\nwhitelee\nmunafo\nmohanna\nelmswell\nkaratzaferis\nelscint\nkroons\nprimesense\nadelsohn\neidan\ndhusamareb\ndomenicali\nkurfirst\nxct\nsetoodeh\nracaille\nmustapa\nguiral\nantidrug\nhoggarth\ntheodoropoulos\nronaldshay\ncounterspin\nserfaus\nisssue\ncadnam\ndrewermann\nteesport\nredated\nkiess\nchionoi\nmilhazes\nveley\nsnafus\nskep\nchemoprophylaxis\nmassell\nsquareness\noutsprinting\nvelikhov\nnexhmije\ntuppen\ndoyal\nkapka\naossm\nbilderback\nmuddier\nembi\nuthayakumar\nmccheyne\niluc\nmaturer\ngurland\natention\nschloemer\nstellifer\nnals\nvitalize\ncahoot\nscharping\nassesments\nbankamericard\nthemsche\nwoodlesford\nyujiao\nberal\ndemonologists\nhydrick\nbleys\nguiping\ntushita\ngotomeeting\nchijindu\nlatet\ndandified\newington\nkarimullah\nukcs\nallsports\nmccomish\nchlorinating\nmuslimah\nbricknell\nilgen\nliitle\nsemerenko\nnirenstein\nmilnsbridge\nabdulkader\nbront\nkizawa\nfriedli\ncathe\npalatin\nporong\ngfz\ngaratti\nanderes\nnatella\nadenike\nommc\ncamak\nschoolman\ntyin\npenlight\nguenin\norientating\nletestu\nclise\nsaure\neverytown\npukaskwa\ncryptochromes\ntaihua\ngalekovic\nlarney\nkacin\nrepurchasing\nscusa\nmodifed\ndecended\nnewsreading\nnefes\nknome\nanthropologic\ntransformerless\nkenroy\nmcdavitt\nelwa\nchasubles\nethylmercury\nbublitz\nmetrohealth\nodama\nmorebattle\nkalkhoven\nnashir\npencey\nsondheimer\nrooneys\nantidoping\nalela\nnsba\ntipner\nleleux\ncabbar\nvocalizes\nmassetti\nborgella\nveracious\nanagnostopoulos\nrollable\nmarfuggi\nshelnutt\nabhorring\nviticulturists\npokery\nstanno\nflusher\nmaymana\nkerl\niyke\nralitsa\nlimaj\nwikstrom\nnaevo\njern\ntawake\nintellegent\ngaebler\natat\ndumelow\nogx\nambion\ntriffin\ntwiins\nyouthbuild\njasperware\nrauer\nwhissendine\nmcleods\ngerobatrachus\nstefanescu\nsifters\nmilosavljevic\ndenize\neyking\nrauzzini\nmarksbury\ndingel\nwholegrain\nshehade\nkilson\nlodin\nmikesell\nsherzad\nmedifast\nmulisha\nhemiscyllium\nfadell\npapermaster\nvuguru\nepilepsia\nhardye\ngibsonia\nhaeberlin\nyankelovich\nbacklashes\nsupes\npetushki\nandreikin\ndyllan\npratical\ndeveney\nisses\njidda\nlanugo\naaldef\nnabozny\njerseyans\nknockando\nvittatoe\nrummikub\nnihonmachi\nfuzi\ngillenwater\nfinanciere\ngakayev\narraignments\nbassat\nmedicinals\nphilospher\nvandermeulen\nconsignees\nsawhill\nsjostrom\nlegitimates\nbugajski\nsangiovanni\nmuseumsinsel\nhyperventilate\ngenser\nbevers\nbowheads\nlrip\nsahal\nmcgorman\nbrissie\nbeefalo\npalpate\nbluecar\nagrifood\nhonigberg\nspeling\nbayefsky\ntangas\nejiro\nnigsberg\nkillelea\norbiston\nderwyn\nweatherstripping\nberrynarbor\ncatheterisation\nguadet\ndeleeuw\nfortepianist\nrulling\ndoulas\nwfed\nnashes\nbabyhood\nkunpeng\nikitsuki\nbogden\ngerberg\nucea\nambiga\nrucking\nmelda\nkendy\ndownshifts\naigu\njundt\nnavtej\nshakhlin\nstereoviews\nnottie\nvuvuzelas\ntramontano\nfoulbrood\ngaito\neatman\nlumbala\nmboma\ngymnich\nmotoryacht\nchollerford\nhasl\ncauter\nkeading\ngelle\nmommens\nmangane\ncontroversa\nbarbone\nryozo\nuspis\ngarbowsky\nexcelente\nrealmente\nbedfellow\nnorooz\nfritted\nafnan\nfelly\nmooing\npyrotechnician\nuntiringly\nhaitong\nbatoka\ntoileting\nexacttarget\npractioner\nmfon\ngranvia\ntrzeciak\ntreament\ncaijing\ntremeirchion\nwitman\nkadewe\ncelebutante\ncucci\nsundried\nsocol\nruvkun\nmadaleno\nbanahan\ngoldbugs\nmovment\nparklike\nressurection\nlecs\nuerj\ntechni\nkoly\nhoseini\nunconventionality\nbarbeques\ndawdle\ncidp\nwalthour\ncornstalks\nkalakuta\nvcb\nwiththe\nswanpool\ngruffly\naerodynamicists\nbumming\ntheladders\nsignifica\ntobback\ncovingtons\nlinamar\npaveletskaya\nredlake\nbharose\nzytel\nciclovia\nslavkin\nrothay\nmiddlemas\nbootee\nhotdish\nverron\nintelligenz\ntarara\nirom\nsantiesteban\nbolens\npyrithione\nsubconjunctival\nchachere\nshpend\nlyst\ndebriefs\nbluecrest\nkuppusamy\nfikes\nchallender\nguggenheims\nchongde\nmuzito\nworng\nbeilenson\nzuhur\nllangedwyn\nfarcot\nkinkajous\nfebreze\nshosetsu\nkathar\ncashpoint\nrounsevell\nsouki\nintercapital\ngodbolt\ncitarella\nraqib\nnguyet\nexsultate\ncentrifugally\npostlewaite\nuwak\nlongnecks\nundetonated\nmitgang\nbawar\nadamses\nfobbed\nfirepool\ngrimentz\nblikkiesdorp\ngeophagy\nkatende\nhoneyball\niadc\nlinkshare\nstandeford\nonstream\njouf\nvampyrus\nguelaguetza\nrenationalisation\nneorealists\nmeakins\nxitong\nhauswirth\naapor\ntituss\noutworn\nwildeboer\nshusett\nsunswift\nbehoves\ngargett\nwillerslev\nmamère\ntenku\ncraftswoman\nengraftment\nshamuyarira\nlefebre\nmanenti\ncentanni\nbelleayre\nintenet\nucst\nwscc\ncatano\nicba\nbroon\nkaralahti\nblackmans\nkcas\nwerin\ntaroa\nnapitupulu\nwagland\ncothren\nsaprolite\nsantacana\ntolvanen\niliushechkina\nchandrakanthan\nyosu\narquilla\nbellars\nmirikitani\nciofani\namytis\nlambri\nanagha\nveerabhadran\nusmonov\nmunitz\ntrosch\nnasos\nesquer\nhrelja\nbewsey\ngluts\nkiyoto\nbennifer\nbramerton\nbradbrook\nrealigns\nabkhazi\nmurakoshi\nrichelson\nbarkeley\nzwentendorf\nirvings\nlondono\ngreenfinches\nghriba\nvallorcine\nkmic\ngarnerville\nrossos\nfrobel\ndinamarca\nmohring\nbezerk\nbrookhurst\nbaranes\naariak\nreighard\ndebello\ngalyna\nmulkerrin\nbrebis\nzhengyu\nhairmyres\nkittisak\nocfs\ndeanston\nkingsweston\ngazar\nprospeed\ngeria\ngestas\nlaini\nmadlala\nfasters\ntruog\nbadrutt\nbiocapacity\ngoldcrests\nconsiderately\ngilbertian\nbotin\nsanparks\ndovico\nostendorf\nhempseed\ndelerm\nhorseheath\nqomolangma\npacentro\ncouso\nbhelliom\ntruglia\nharyono\nskunky\nzulkarnaen\ndarkland\nvesconte\nnaite\nminwoo\nsalteri\nzimbler\nicklesham\nshalfleet\nconductress\nvagelos\nzanies\nemneth\nhaldanes\nfiterman\nmorskoy\nbantered\nstojkovski\ndadá\nfalt\ngugliemi\nchilblains\nlownsdale\nrainieri\nclaar\naceituno\nstangroom\ngritter\nelarton\nmallomys\nrobroyston\nkelvim\nouside\nerron\nshernaz\nnonrational\nzaino\nwhyment\nsedighi\nalmsot\nmendik\ngoodnites\nnandos\nmicromirror\ncroshere\nfagatogo\nheidrick\narunga\nscarer\npikoli\nbumrungrad\ncolourants\nasna\nradiotelescope\nkidnaping\newy\nbishton\nagustinus\nrevivified\ncste\nrooden\nshamsa\nllandaf\ngagg\nmuthill\nmadiwala\ndhgate\nbercu\nkhoshbin\nkones\nmirzai\ngaier\ndisaggregate\ngarcons\nshavonte\npartlet\nkrissi\nchistopher\nnsam\nkaballah\npartialy\nmushrif\neasler\nforams\nmcrobb\nosmer\nstuffiness\ncodeshares\ndemutualised\nnosal\nhamelech\npritpal\nkippe\ngemi\nkentlands\norlow\nhowletts\nmappable\nindiscreetly\nhillarious\ninfomania\nexeo\ncolbrunn\ngentilini\naigo\nmouthwatering\nmetoffice\nalinejad\nbrockweir\njosefino\npellis\nringmasters\nstilwater\nyesco\nsoering\nmuntafiq\nthreescore\nreboux\nhandgrenades\nanatsui\ntaborsky\ntrehafod\nxiaojiang\ndonaghcloney\nharpooning\ndoodlebugs\ntomintoul\nbuergenthal\nmamabolo\npressoir\nebata\nrimonim\neinav\nclaimes\npolartec\npaauwe\nbartletti\nnettops\nthumbscrews\nadario\notedola\nhvalur\nenchantingly\nmontereale\nsanc\ngroundfloor\njezzard\npeltzman\nduvoisin\nbukuya\nsexaholics\nnicolaos\npmas\npacos\ncommunigate\ndsrs\nswennen\nkamyar\nzhaozhong\ntravelall\nkzf\nenvolved\nfutureproof\nwhimsies\njessell\nwaldren\nrestitutions\nmonji\nfriockheim\nlecount\nbacons\nwbv\nreignition\naleksandrina\nselectiveness\nkmiz\nhimat\nuncoiling\ncarbofuran\nquso\nsuperweek\ncavort\nannino\nergometrine\nvolanakis\nromsley\niwant\nhalifa\ncilgwyn\nmuzenda\nsacharow\nunsay\npatternmakers\nrichboro\nkottkamp\neurazeo\ndudinskaya\nlightwell\nlaskos\nghisolfi\nanixter\npirgs\nhafa\nliebscher\niraschko\nshetlanders\nncts\nsonnenblick\nmorenos\nhalamish\nplayphone\nblandi\nfurer\nhinnerk\nkleinsmid\nmérindol\nkirchschlager\nfainlight\nprotegée\ngvaladze\nsouthdowns\nbransgore\nsugarcoated\npraa\nminiaturisation\nalfonsa\nintralot\ncostless\nenslen\nconstructora\nniwari\ndesecrates\nenwave\nganiyu\ncadnant\nqis\nfernandopulle\nindistinguishably\nomilami\nplonked\ncoalman\nglenfarg\nnjoro\nniedenthal\nriggen\ncrouter\nanthrozoology\nceniceros\nuln\nhevener\ntraigh\nrogal\neslamian\nbeaus\nbuehrig\nwygant\nunboiled\nlidgate\nnurturer\ngtcs\ncranfills\nkwast\nbastie\nsannoh\ncunin\naxlerod\nreisel\njebara\nbagaria\ncointet\ndestierro\nrimel\narefin\nodyssean\nzelotti\nwijdenbosch\nbught\nratchadamnoen\nappeasers\nkpsp\nkwarta\naccetta\npegatron\nfurball\njamme\nsqueek\npnes\nboattail\nclimpson\nbanderillas\nthumbstick\ndesecrations\noim\nkemel\ndocusoap\nrigth\nfirebricks\ntulipwood\nhairband\nadeona\nlorelle\ntremorfa\nastrom\nkozub\ndragovic\nmatco\ndorschner\nstarlike\nqalqilyah\ntweeners\nbrohn\npostdoctorate\nmicroturbine\noutmost\nastarloa\ntrawangan\nrummaged\nkittleman\nazizuddin\nstingiest\nnaumi\ngardai\nsuchart\npekarek\nmmscfd\nrocheford\nmcswegan\nmawenzi\njucu\nmorling\nhighjacking\nbarrer\nshelterbox\ncvx\ngosberton\nheliconias\nparklawn\nkewadin\nregrew\nsoddu\nmcletchie\ntolz\ncoopetition\nsbcc\nwakeeney\ntrimpin\nreadercon\nnafsa\nnanoelectronic\nrockcliff\nkillock\npocked\nopoona\npixellation\npenlan\noutspending\ncomares\nwellemeyer\nspreyton\nsedloski\nginsenosides\nlemonier\nantebi\nbieito\nesar\nruthanne\nligations\ndordon\negpws\nrobleto\nmdeq\nrobideau\nofisi\npercona\nputdowns\nlidon\nmasilela\nstollsteimer\nvershbow\nandreson\nutilityman\nvillaine\nkanehira\nlorden\nskweyiya\nheilmeier\nlikoni\nfonctionnaires\nordover\ndahntay\nplacek\nirwandi\nhonnor\nsurplices\nzwiers\nborribles\npangas\nvalldemossa\nhamriyah\ndoetsch\nmetfield\nlobotomised\nchemoprevention\npmtct\nranunculoides\nromito\nglatman\nsnuggie\nhaertel\nrasual\nlieberknecht\nekert\nsweedish\nogled\nkopylova\ntatsunori\nphishers\ntampax\npetronijevic\nmcelduff\nengraves\numane\nmajette\nsponsered\nbabyland\nakpala\ntextless\nyoghurts\nkinyanjui\nhayrick\nvlts\neurest\nsashay\ndadc\ntschida\ntinkles\nmeloxicam\nsportsdirect\nvorinostat\nkirbys\nihealth\nmascardo\nxkl\nwiesberger\naudiosurf\nbelgiorno\nlilavois\npenasco\nborovac\nyasukazu\nthemselve\nnooooo\ntsend\ndariye\namebic\nshimeji\nkokkedal\nperben\nachelis\nboxcutters\ncluss\nunigate\nindur\ntesl\nsolás\nsomalians\nhvcc\ngilfedder\nadre\ninterart\nsoupir\nhaaften\nweetzie\ncritized\ncarlotti\ndisturbers\ntuiloma\nwiszniewski\ndrogenbos\nspellar\nweigela\noluwale\nvicke\nwhatnots\nbmis\nfetlocks\nmaij\naddleshaw\nraivio\naviara\npetawatt\ncalamondin\nshooing\nciste\nuntruthfulness\nlibicki\njointness\nhayb\nhannaway\nmclauchlin\nmanhatten\ncounterweighted\nkirrily\ncivilisational\nchimichurri\nmousawi\ndogmersfield\nyzaguirre\nusbs\ndeloatch\nnakhjavani\npetrila\nhumulin\nousland\ncrossties\nunivesity\nvarnelis\nslivka\nacquia\nmanab\nabdeh\nterser\nteape\nkarpasia\nheelys\nfishwives\nselfs\namankwah\nsilcrete\nwaterpipe\nfootlong\nmulero\nmaaran\nwaguespack\nschwerdt\nfernleaf\nweishaar\ntelander\nmelmotte\nyunas\narpc\nflashblock\natsma\nnyaga\nfsta\nocv\nrajbir\naldbrough\nnoscapine\nkluever\naberdares\nwapt\nkindelan\ndman\ntappenden\ntiputini\ngitau\nbertele\ndigimarc\nrabii\noversensitivity\nnoilly\ngandel\nsouthpaws\nproterra\ncorningware\nbirgeneau\napostolakis\nchhibber\nreediting\nmokdad\nstavronikita\niztaccihuatl\nsalit\nschectman\ntradesperson\nslimehead\neastlund\nhijiki\npraxair\naasheim\nfiskin\ntriboluminescence\nupbraiding\nalhassane\nkansen\nbirkitt\nnopales\nwolz\nsieburth\njazzing\nvulvovaginal\npsiphon\naccruals\nregather\ngeorga\nreoffer\nsprüngli\nslom\nmaksimovic\nshelina\nkibawe\ngoldby\nuppark\npopeater\nkrups\nelizabet\ndbj\nesclusham\njudiasm\nroseborough\nexterminates\nconjoins\nbootland\nexhange\npengzhou\nbywyd\npnueli\nvalinda\nmormeck\nlackenby\neuroscience\nrishawi\nleest\nevony\nlindenwald\nithenticate\npouha\neclectics\nchernovetsky\nbieng\ninterpersonally\nschellinger\nprotec\ndiad\noelze\narpu\ncannibalise\ngyaincain\nroddis\nkolkota\nbesthorpe\noggie\nfilmclub\nvoracity\nliebrandt\nwowbagger\niaca\nbrunstrom\nhidary\nlarma\ncraker\ngarelochhead\nlefavour\nevro\nlgbs\nbramshaw\nwesam\nmekhala\nscrewballs\navsim\ndixter\nhashman\nridgwell\nlajolo\nhimberg\nkroke\ngebelein\ntouchsmart\ntrebarwith\nolarra\nolafsen\natogwe\naerotoxic\nwacked\ndedeman\nfolkies\nlaterly\nmycfo\nryskamp\nvangsness\nmagaret\nners\nseitoku\nbluestripe\nbarratier\npotjaman\nsourcefire\nhemicrania\nbaroch\nlmgs\nsharen\nstarfest\nmontay\nstaight\nveneering\naircom\nmakiyivka\nhuichang\nbulletstorm\ndiskeeper\nflytraps\nbreezily\nicls\narvizo\nlivelong\nmitumba\nlumsdon\ngroundsmen\nbourguet\ntyvon\ndesignline\nciclavia\ncupo\nhaluska\nbráulio\ngreubel\ncongeries\nsolmssen\njianghua\ngleno\ndisfavors\nwilll\nturnell\nkumsusan\neswar\nentis\nqype\nkogge\nbamboleo\nmandli\njaberi\nwildcoast\nbaduizm\npapakostas\naeca\nujpest\nlavel\nfamciclovir\ninsalubrious\nwisocky\nappal\njuszkiewicz\nxiaodi\ncruzer\nvadar\nbidoun\nyarou\ncaihou\nwoulds\ninsensate\nfrima\nmayumba\npreventatively\ngossiper\nuntaken\nrifqa\nlatinisms\nkitemark\nnahri\nabercrave\nbrettenham\nhaizhou\npartnoy\ngoeller\nrudolpho\ncomtech\nkarbowski\ntribemate\nwickenby\nbigest\nperindopril\naruze\nitandje\ndavidia\nhintlesham\nsende\ndaruka\nsherba\nqubaisi\nfagone\neliminationist\nseadrill\ndeliverability\ndernis\nbraganca\nguanica\nsalmani\nkennys\nbarnburner\nfyles\nremeasured\nsouhaite\ncopayment\ncowslips\nomnilink\nbernazard\nmagzine\necla\ncreetown\nsartini\nwetterich\ngernert\naqar\nbesche\nbanford\nlegalistically\nllandwrog\nclaridges\nwakaya\nconverstion\ntirosh\nrosthwaite\nchattem\nwufu\ntregothnan\nljube\nmanawan\nremarriages\nvalders\nlifebelt\nvivisected\ntfiloh\ntateh\nveneza\ncartload\nwallworth\nsbic\ngansz\npatriotica\ndishon\ncolohan\nneveldine\nwitout\ntoks\nrenau\necocentric\npallidipennis\ndebka\nsteadies\nqec\ncaméléon\nnaweed\nxintian\nhatheway\nclevie\nnalawade\nregence\nbreashears\nchada\naylott\ngandal\ntoley\nassington\ncobr\npustule\nzelensky\ncrookenden\namiriyah\nfavignana\nsmoothy\nknappenberger\nicklingham\nalnilam\nbarassie\ninseminating\nsunup\nwindman\ngrundfest\nrightnow\npascendi\noverworks\neuphorbias\nniebaum\ngiebler\nahkami\npelletized\nundersell\nholstead\nstriano\nhachemi\nscammonden\ndemircan\njobbed\nrenhold\nacculturate\nchevre\nmontopoli\nlatitudinally\nyaojin\nflophouses\ntubiana\ndennet\nfenceposts\nkushev\npaochinda\nirps\nspacs\nfantasists\nwakefields\narrak\ngrossenbacher\nrehbinder\nwnci\nmelnitz\nmabul\nhildon\nwohlford\nfavretto\ndigicert\nbryl\ninsecurely\ncontextualist\nallpress\nherrenknecht\ncybernauts\nseban\norwells\nermias\nboming\njakk\nbencheikh\nmatichuk\nrabino\nkribbella\ncreaser\ncontoversial\neditorialise\nprattsburgh\nimpaneled\nlangebaanweg\nhapcheon\npgas\nnaziha\nmadeiros\nblankinship\ncombers\nmontieth\nspokesmodels\nopenminded\nloscoe\nroutan\nlorcaserin\nkhelil\nsterchele\ngaggers\nostrowsky\najavon\ngibsonton\nonesteel\nllandyrnog\nkimel\nswartzendruber\nroas\ntamasin\nemulsify\nharcus\nsolet\ncuka\njanhavi\nrefreeze\nwinik\nzctu\nruzic\ndizzyingly\ninkoom\nwskq\nheage\napms\nsapte\nvanterpool\nrebholz\nyeghiayan\nprestidigitation\nfaez\nsersale\nbynea\ninterveniens\nbagnaia\nwattanasin\nhelicos\nfundu\nkaeberlein\nkessingland\nalaei\nmcging\nelligible\ngalitsios\nwazan\nraucously\nmorago\ntrillionaire\nbrettle\nballintoy\nbickersons\nbeffa\ntechinically\nfinalises\nunitus\ndevree\nmispricing\nmengual\nsamory\nmimara\naamulehti\ninnocenzi\ngàidhealach\ngolby\nnanofluidic\nbidisha\npukes\nvehanen\nkavo\nparadize\neera\nkabura\nsnezana\nhoaxsters\nwellingtonians\nritblat\nspacewalker\nalerian\ncracken\nsheryll\ncntf\nhurdled\namreeka\nheijne\nkahaluu\nprzygoda\nnasti\ndarba\nmuccioli\nmtetwa\npohiva\nbackdropped\nhimmelblau\nvivion\nmundham\nneukirchner\ndisapear\npelephone\nlhg\narchibong\nburiton\npluriform\nfolloweth\ncoux\npartagas\nsemidocumentary\ntanios\nmassob\nbreakover\nmicrotel\nindicies\nliom\nmaxin\nspanley\nnaïvety\nsummat\ndevalos\nrokotov\nncma\ndignissim\ncertes\nxiaolongbao\npreissing\nmulryan\nantrix\nheedlessly\ndinallo\nnvision\ntauqir\nbelltowers\ncroham\nhackery\nscharffen\ntronson\nshtreimel\nberceanu\nbhcs\nhypersleep\nwesner\nlacta\nhowrey\nndic\ndafi\nmolokini\nkisaburo\nmassanet\nzuiverloon\ngoldschmied\nwhitlingham\ntreut\nshitte\nreichmuth\nalfonsin\neckelberry\nclaimers\nchancers\nsonntags\npogopalooza\nwarmsley\nrosett\nseeda\nobhrai\nbojaxhiu\nbenayahu\ngsea\nrsss\narzani\nharkham\ncorixa\nyayan\ncutrona\nneurocysticercosis\nasmc\nhörl\ndukovany\nsparham\nlevelers\nchoriomeningitis\nikes\nderryl\nbharwana\nkitzbüheler\nppic\nkaneez\npuds\nballotta\njots\nroëves\nananova\nkrzystof\nsoundbridge\ndialoguing\ncaucaunibuca\nnawe\nproin\nchalone\nxianyi\nporterie\nbradburne\nrahnavard\nstevick\nunevaluated\naceee\nchelsia\nmicrosofts\nnondominant\nkentro\nshihuang\njansing\nleadless\nlonard\nlivemocha\npulchritude\ngokarn\nmilvio\nfurutani\nrememberer\nedgeless\noteng\ngleghorn\nyxy\nhrz\nnamastey\nramoche\nmutahi\nakhdam\npbtx\nkagarlitsky\nbalestri\nboosler\nkupriyanov\ntopiaries\nszechenyi\npampling\njerabek\neverybodys\norata\nymi\nmcniel\nceledon\nhighfalutin\npallesen\nshadbush\nvvf\nderadoorian\njink\nfazly\nmantei\naddiewell\nkurylo\nesserman\nmonetisation\nvectrix\ncibarius\nnothaus\ntriniti\nlehew\nskogland\nelgan\nller\ndeflower\njoho\nchilwa\nllangennith\nfroemke\npowerwall\nlundrigan\nbumbles\nstucture\npetzner\nsunshowers\nhawo\nwoodwalton\nrenationalization\nlandaluce\ncoxing\nperjuring\nprequalified\nfazackerley\nwashlet\nopulus\naristomenis\nleadwood\nfichandler\ntroups\ncufi\nbigging\navda\nrcom\nsupervoting\nrohlinger\nrudell\nthorstensen\nidealogy\nchrimes\nramler\neidsvig\ntawjihi\ndisconcert\nlacaba\ncwmtwrch\nsumarni\nglobalists\nnsmc\nzanga\nmardol\nfondles\nafact\nlohaus\nsensabaugh\nredflex\nfieldcrest\nhez\nsquinty\nhoj\nmelloy\nsarmah\ndistortionary\nclist\nalvr\nkushayb\nseracs\ncloisonne\nhylander\nharibhau\nnpsg\nurgo\nxiaowan\ncaxa\nreroofed\ndiriye\ncbas\nschlucht\nolomana\nsocioemotional\nsobreira\nturistica\nhorsehide\ntulpan\nhomesense\npneumonias\nhigashikawa\nschnatter\nserajul\npoolroom\nmoutai\naubyns\nxinyue\nhorsea\npolenzani\nbhundu\nneenan\nsaskin\nscientifics\nfiatlux\nharray\ntalaa\namateurishly\nfillipo\nadonal\nbilateralism\nsonnanstine\nmnemiopsis\nlondonistan\nnirwana\ncassara\nbasely\ngonyea\nshipshewana\nmethenamine\nmarkeith\nshkin\nhodd\nbolac\nbewkes\ncyfyngiadau\njirina\nbarrelling\nofda\nannuitants\nelveda\nkeahey\ntihs\nwenckheim\nmurrie\namytal\nscaffolded\nyoussif\nbefuddlement\nsportsquest\nperlozzo\nseuseu\nmaddix\ncremyll\nmarriotts\nezzell\ntheunited\nchairboys\npainda\njerins\nilliano\ntoiles\ngulbahar\nheatwole\nqifang\npagli\nyerkir\nswaptions\nenisa\nwykehamist\nwannes\npeperoncino\ndiagnóstico\nenvato\ndenbury\nduross\npapandrea\nkutuzovsky\njibreel\nziebell\nkecksburg\norlandersmith\nlamark\nwulingyuan\ngidleigh\nluxoflux\ngordman\ncallingham\nchalkwell\nsimilan\nkapner\nairboats\nchillax\nevisu\naedin\necocert\ncoreys\ngemologists\nperjure\nostyn\ndecisioning\nsuperfish\noffhanded\nidahor\nwiding\nbunni\ngarreta\nsylvina\nheus\nsharkwater\nmcgilvary\nboukar\nbalson\nkokam\ncarstarphen\nsitnik\nwbli\ndogfood\nweinig\nmorici\narmodafinil\njailbirds\nunmined\nednet\ngwk\nemcs\npolcyn\njazzer\nknupp\nacies\nnddc\nlosper\nfeadship\nbarzilay\nwaternish\njieyu\ncarbonised\ntomasic\ngirths\nbeauceron\nmoayad\nkhider\nqustion\npalfi\nvirginicum\nfaini\ntonette\nlivestreamed\nwinnard\ntattie\nkoryta\ngevo\ngyurta\nbxs\nnoiseuse\nebags\nermelinda\nedsels\nbushier\nmaleta\ncheikho\nyashere\nneuroregeneration\nwedc\negerman\nkoenigswarter\nbumbler\ncaffee\npigorini\nlury\ntaoshi\nconsistenly\nyonis\nejei\nfeklisov\nsinegal\nsenwes\ngapminder\nnawaat\ntruvia\ntournedos\nreavie\ntetepare\nscissored\nknicker\nbergfors\nalterian\nviane\ninterweaved\ntsilla\nfruma\nolchfa\ntussled\nvitalizing\ndezhi\nseekh\nfyvush\nsubcommander\nemal\nthomass\nlibnan\nhardbodies\nfalcarragh\nconver\ngenitalis\nchristlike\nnilufer\nnemcova\nepidexipteryx\nweissbier\nmacduffie\nrightback\ndrean\ncnca\njettou\nkeskar\nhoblitzell\nandrosch\ncohran\nhajiya\njaeschke\npishevar\nwildbrain\nfoxtrots\nunfathomably\nbaup\ngregorini\nhendrerit\nafac\ntailfeather\nindependants\nskarsgard\ntioram\ngevorgian\nflunkies\nhullbridge\nshowstopping\ntreculia\naguer\nunutilized\nmultimember\npratically\nbaltas\nnavacerrada\npolpo\ntriquet\nsrcl\nyoruban\ncircosta\nbarstable\nritzema\nsembiring\nropinirole\neucomis\njagtiani\nracecard\nthimbleberry\npicozzi\ntehy\netretat\nprizing\njuico\nogah\nfxpro\nonrush\nassests\nhakkasan\nschlapp\nremunerate\nspasmodically\nheavyhanded\ntellqvist\nwadsleyite\nstarbury\nbrodziak\nhametz\nseethed\nnutshells\nfrush\nprofootballtalk\nseatrek\nmalzeard\nmorelock\nshipbroking\nmoonlet\nfairline\nwladfa\ndjalal\nsundrum\nzern\ntornell\nkiuru\nkeema\nlabradoodles\naford\ngalardi\nmarabu\nbrezis\nobiols\nmuirs\nintimidations\nabete\ncirelli\nrivm\nskott\ncomputerware\naysen\ngezeichneten\ncufflink\nkeycorp\nrehobeth\nnametags\nissacharoff\nmadslien\nlaurenzi\nvgas\nrouler\notoro\nyamashta\njibson\nhults\nawkard\nweatherup\nalic\nrilee\nsukup\nubilla\ntinkerers\nbaraz\nvermeers\nsedins\nmaurward\nissels\nmanbearpig\nlathwell\ntrethomas\nofttimes\nwistaria\nexcelle\nnovinger\nnirc\nbhuleshwar\nlizama\npowderhouse\nreinfried\nmunsterman\nusiminas\namrane\nhimelstein\nradiocentre\nmanoug\noritse\nsejal\nhesmer\nneroche\ncalmore\ntoyonaga\ndaneshmand\nzyxel\nbiohazardous\nbatterer\nhooo\nfascher\ntanter\nstrongin\nsnuffs\nturnquest\nngeze\nkersaudy\nfunkausstellung\nchargepoint\nhomescreen\nbigne\nhurren\nwajihuddin\nsofo\nredwick\nduguet\nazadliq\nfootes\nsawbones\nfangataufa\nchalfen\nprugh\nfibrodysplasia\nkozhevnikova\nnoeleen\ngnad\nsoxx\nuhlaender\nangarola\nmikhalyov\nanbessa\nsoothsaying\nhasenfus\nweirich\nvalmeyer\nixy\nxdc\ncalibrates\nmadarasz\nmuniyappa\nfallo\nnicklausse\ntoothaker\nbrelsford\nwaunarlwydd\ngappy\nsandale\nsueing\nkicc\nedelin\nglamping\npedowitz\nladdering\nveglio\nbuttington\naddiss\ndumain\ndyw\nhougland\niddison\nkagy\nunliked\nkipsiro\nschicker\nforenza\nbdx\nyorkies\nfusses\njakubauskas\nlanette\nkaratu\nrealeased\nstisted\ntrihalomethanes\nraitis\nyinghuo\nwanderin\nlapush\nladina\nmartinos\nconjurors\nchorwon\nspiritedly\ncowardliness\nsabattini\nweeraratna\nkettell\nelbryan\nmundaneum\ninfectiousness\nbhartia\npugil\nhafele\nmiltefosine\nlorenzoni\nubicom\nfuch\nplessix\nsajed\neffah\nblessitt\nazabal\naddendums\nbenepe\nblockwork\nptdc\ncoopt\nsersen\nhcso\ntamogami\nbiddlecombe\nanaesthesiologist\ncorticobasal\nuskudar\nnicker\nshenice\ngowned\ncareggi\nfean\nrockyou\namgueddfa\nscoa\nassualt\nnearne\nmanses\nomanthai\ncowsheds\ndillie\nnuzzling\nfsbo\nhsj\nedgeways\ndilettantism\nrahy\natomizing\nschreibman\nmillholland\nfriesians\nbovrisse\nplatnum\npalmist\nmidafternoon\nsubmittals\nscandalize\nlushoto\nshwaas\nmeaders\nliggan\nfrumin\ncicalo\nlymphadenectomy\nmalgoire\nsosnowska\nnazreen\nresorbable\nozeri\nqilu\ncassop\nflounces\nhartsuff\nmercede\nlifeflight\ndidymo\ntessenderlo\ntrochanteric\niepa\npaidcontent\njerrelle\nwallick\nnowpublic\nadvergames\nshowcaves\nquinquefasciatus\nnidar\ntotters\nbinacional\nlodsworth\ndogsthorpe\ndarweesh\nfougere\nwhetten\nrickers\nkadra\nhatless\nyoopers\npulsion\netampes\ncouloirs\nphilosophising\ngeneen\nencipher\nhaggan\nexisitng\ndioxides\nhedp\ncallixte\nifx\nfallou\njazeerah\nglenis\nkamron\nparhelion\nbressanone\nunbraced\nsnca\numile\nquance\nchristabelle\neldra\nchaffer\nlachrymae\nglocks\nnamers\nripatti\natomizers\nschnieder\ndocketed\nmastercraftsman\ndecertify\nmicrobubble\nakanbi\nscheufele\nricer\ncanar\ndroguett\npannick\nguayasamin\nhesling\nattenboroughi\nmilonas\nsirnak\nsabahat\nacli\nmatossian\nsigalit\nhlatshwayo\ndassen\nbonafini\ncrute\nwhitekirk\ngalvalume\nundervaluation\nyokich\nmitzel\npeepul\nmegaraptor\npesonal\nrecrudescence\ndelling\ncobblestoned\ndeilmann\ndomnérus\noutsurance\ngattelli\nvollhardt\nhypervigilance\nseacoasts\nmssb\nmortifications\nrcic\nralcorp\ninciteful\ngload\nleinweber\nmiram\nworple\njandarma\nchangeups\ngratwicke\nherden\nsequestrations\nportrack\nunvaried\ngrossart\ncompromiser\ndysynni\nsukhvinder\nhadrians\nmansión\nviolens\neuroatlantic\nhassock\nquicktake\ntarita\njudaicum\nexpain\nplusses\nfretts\nzlb\ntaglia\njumadi\nopificio\nkobasew\nlicencia\nscientic\nburhoe\ngrucci\nperked\nbrandford\nbartelstein\nanty\nterho\npinecliffe\nmaneaters\ngilhaney\nbottke\nwortel\nfyf\ndematerialize\nsubclauses\ngoglia\nbbox\ndafang\nmicrocellular\nrafie\nfollowspot\nanonimity\nlivingood\natkeson\nknile\nmilitating\nleckrone\nbanaszek\nshakespeareans\nfumigants\nstrutz\nfarcet\nibagaza\nunperceived\nursache\nanounced\nwoeser\nirrgang\nsonograms\nadlung\nleventon\nbridgemere\ndistributorships\nawx\nfuzzily\nizetbegovic\nhopers\nadeo\noudolf\nbendick\ncouncell\ncaroler\nmassarotti\ntabgha\nyashvardhan\nglengoyne\nazizullah\ndevanadera\ncutteslowe\nsubfreezing\ncincinatti\nshapeways\nsaemangeum\nchhim\nboskalis\nlantus\npankhursts\nsuspicous\nroxi\nunrealizable\nsolidaria\ndonziger\nunclimbable\ngowins\nreferal\nnossel\nkurage\nexculpation\ndonham\nmarmaton\nstangmore\nmogk\nrefurb\nmitsamiouli\ndoodled\ncrossbenches\nhirohide\nhaastrup\nunpunctuated\nbohnhoff\nhollstein\nferrellgas\npalermitan\nderlis\nvrdnik\nfloodtide\nnosik\nbyrdgang\nailed\ndiffent\ncarpentieri\nkowboy\nperissinotto\nbursars\nmitreski\nsobinsky\nmullenix\nenforcment\ndiebler\nadrem\nneuson\nsubramaniyam\nabjuring\ncripe\nhajredin\ndeplorably\nbocks\nbroucek\nsturtz\njanat\ndinova\nsanmu\nslowpitch\nhdssb\nrabinovici\nfiene\nrhydyfelin\ncolao\nkrümmel\nmansholt\nepupa\nlehideux\ntjallingii\nzelt\nmourie\napalling\nneighborhoodscout\noska\nhurring\nhunh\nsudarat\nhatkoff\nommissions\nzagorec\nchadwicks\nreplayable\nhauf\nbarder\nkrithi\nmahbubul\nzajick\nmasticated\nrakkas\npnz\nwalcote\nrese\nvohs\nanwb\nakri\nschams\nboerma\nmancetter\nmccan\njuravinski\npieczonka\ninnocentive\nhacerlo\nblists\nchalkie\nellenburg\nmalmsey\nsportsblog\nsplintery\nmehrangiz\ncoedpoeth\nunstretched\nstyopa\ntramped\nhighground\nmathemagician\nlapenna\nmelham\nhitchner\nziebold\npinklon\nnmac\nsaathoff\nloura\ngorfodi\nkunashiri\nimy\ngasified\nlagutenko\nnekschot\nsektioui\nquadrennium\nderrig\ngreier\ntarakai\ncaussin\niowas\nspoerry\nderric\nyuci\nhogged\nyawer\nfotino\nvaporub\nislan\nrember\nsisoulith\nbramdean\nhoverboards\nnabb\nzabiullah\nshahine\ncivia\nficek\ntrink\nkirkcowan\nsteelberg\ntabio\nputrefying\njerkiness\npratesi\ncocksfoot\npolyacrylate\nmacci\nbranthwaite\nyoutubes\nmievs\nbirdal\nwitzke\nmaltophilia\nwayns\nportgordon\nabramyan\nprecociousness\nmeghji\nbuchlyvie\nskrzypczak\npirom\nsatani\nillyas\nspanta\nelaborateness\nkazaure\nsimins\ndiffley\nscrims\nlakebeds\ndenevi\nborodulin\nwhitebeams\nmitchard\nchairmain\nremodelings\nveraldi\nlowlights\novercomplicate\ncanizales\noloffson\nvenmo\nvyalitsyna\nbigio\nanthemion\nwentao\nharborplace\nnhsc\nrecreationists\nzaru\nmcgreavy\nbenamou\nloerrach\nsinced\nnjuguna\nlinta\ndgap\nrespecter\nkargman\nauriana\nquaglino\nbitched\nprettify\nleogrande\nalagir\nadmittances\ncloverhill\nrayden\ndyens\nsiaw\nwagi\ndishonors\nvxx\ndppa\nkhenin\nvnl\nshorefront\nsaachi\nenslavers\ntmos\ntarrington\nguntrip\ngavels\nholdman\nnhcs\nceac\nstaios\nbuskila\ncylab\nhaersma\nfuerbringer\nhannahan\nfugaz\njerzey\ncomplacently\nmanouchehri\nfxt\nplaylogic\nhofu\nkarahi\nbellocco\ndiscombobulated\nextraterrestres\nhamren\nmaeir\nbetc\nkatemodern\ndialouge\nalmasi\nnanogenerator\nmcnarry\nmintel\nmcnelis\ncoundoul\noligodendroglioma\nsssc\nrxte\ncogar\nbanquettes\nemmanuelli\nbursted\nsheleg\nyacov\nkarleen\nanjuli\nikos\nsarachan\nicfa\nmcmansions\nmcchicken\npervan\ncraigsville\nalvero\njoellen\ngwyllt\nlainya\nwcb\nsuleimenov\neastriggs\nlamai\nvanmechelen\nayoun\nlaksono\nmalinosky\ncorsellis\nscharpf\neasycruise\nobreras\nmakhijani\ndelcarmen\nlanipekun\notash\nnaifa\nmecklenberg\nstylz\nlenderman\nberezutskiy\ncandescent\nurdangarin\nfarfour\nattackmen\nlaiks\nplopping\nsteffanie\nkgil\nguarico\nkavuma\ntriacs\nnamoff\nsuckin\nadbi\ncharness\nraik\nsupernational\ncompleter\nunpicked\nterabits\ngeorgeas\ncoahuilensis\nakhalgori\nzeschuk\nspanger\npcom\nchantemerle\ncanoville\ncrizotinib\nujf\njidong\nsendall\nkrasnoff\nblaengarw\nhexton\nmaute\namodu\ngajardo\nneuroimmunology\nprendeville\nsheeple\nalsgaard\nshanbaug\npurkayastha\ncablesystems\nthoughful\nshithouse\nupdyke\nangliru\nsquirrely\nnewing\npsdp\nkesley\ntumblety\nhilgert\ntopco\naqualand\nsnaffles\nseanie\nbuhs\nsenchenko\ndenorfia\nrouff\ngrisez\ndobinson\ndystel\npolyoma\nkaseem\nblabbing\ncheesed\nmengestu\npollert\ncomtemporary\nbrothas\ncmta\nardgay\ndzhabrailov\nnéerlandais\nchorioamnionitis\nafriad\nhovig\nthrove\njedis\ntsygankov\ntrouvés\nbregenzerwald\numebayashi\novoids\nnovarra\nsalier\ninertialess\ngenung\nandd\nmuravyev\nslurve\nazli\ncomare\nsheward\nsluiced\nweirds\nhautmont\nboldrin\nfreiwald\nherley\nfaisel\nphillpott\nkinrade\nattrill\ngeeze\ntakefuji\ncpjp\nilaiah\nnalyvaichenko\nupis\nsibun\nprisme\nrasagiline\nmcgunnigle\naberrantly\nraqeeb\nngilu\ndaeschler\nlessem\ncopuos\ndeigns\nbouabdellah\ntwinax\nwahiba\neleftheros\nprav\ncircumciser\npoxnora\nurango\nrustproofing\nweili\naltnaharra\nhammersmark\nperpective\nmathlouthi\nislamised\ntomatina\nsermitsiaq\ncanaport\nmiddies\ngeorgantas\njeronimus\nepiphanic\nverminous\nnutbourne\ndmards\nendtimes\nschadler\nhummes\npourmand\nmadalina\nprophesised\nbicyclepa\nwestrom\nalabamian\nkhagen\nciviletti\nresoluteness\nbeliaev\nsheidlower\npaey\nprotip\nrugman\nraisch\nsawicka\nphorid\nllandygai\nsmoots\nkajishima\nsubcomittee\nmwaura\nterracciano\nhaizlip\nhuista\npremysl\nbabalawo\ncorporatized\nshochu\noleanolic\ncandidness\nbenyo\ngirlies\nsasken\nsudharshan\nbeggers\ncisma\nseatless\nitab\nkanojia\nmartindell\ncayan\nwoollam\nrollberg\nweisskirchen\nbarricada\ntransam\ndyster\nkostermans\nvilardi\nschwaab\nhitchener\nrukundo\naultbea\nblastoma\nhcdc\nliyan\ncamrys\noozora\naeternitatis\nwithings\ngruener\nebbett\npierogies\nhesselbein\nkhardung\nshopnbc\nshandler\nnnos\ngirobank\nmakutano\nvédrine\nakhmadov\ngreeman\nlongomba\nvitug\nlalumière\nbilsthorpe\nkurucz\nallscripts\nnoncooperative\nbirtukan\nedwardo\nmepal\ngroud\njaffas\nstockford\nvrettos\nrigourously\nmponda\nriby\nnjogu\nfatties\nsubsitute\nreoccurred\nlabyrinthitis\nerus\nvilankulo\nobreja\nfondacaro\ncrosslin\nvolodina\nlonglife\nkoppang\nhudghton\nphosphokinase\ndonnée\nbutto\nmôquet\nmanucher\nvuco\nitcs\nchangyu\nvrsa\ndegustation\nplexicushion\ncouzin\nalbox\ndeloney\nprofessionalise\nsoghoian\nspaling\ncatellani\nhanoune\nustar\njarjis\nbravard\nfrolicsome\nmashie\ncitywire\nbertish\nforeplanes\nnoela\ncharlone\nmanlike\nbrauchli\nmonoski\nanaesthetized\namets\nthrussell\njarquín\nkivett\nbirdstone\nmasek\nripka\nscroungers\nallmost\nloeys\npillock\nprised\nsintel\nsundorne\nsuckerfish\nxiaoyuan\nzhiguo\nbaofeng\nbarich\nfentiman\nsivell\nludvigsson\nlawyered\nschoenstein\nedlesborough\nvermicomposting\nlatortue\nlantinus\nporphyries\nguoguang\ntcz\ncuffing\nilri\nnacionalni\nkehres\ndormanstown\nblastema\nepicentres\ncircumspectly\ntitler\nsidu\ninerting\nboesman\nsapia\noistins\navacha\nmicrowavable\nhelictotrichon\nmeltzoff\ntohopekaliga\nizsak\narenzano\nrollerbladers\ntreasa\nhimmelstein\ngeszti\nbluesier\nhedgers\ntaskers\ntarrance\nzelyony\nvrtx\ntasar\nredmore\nperfomance\ndanisco\nadulterate\nefj\ndesalting\nveikkanen\nhafeet\ndenationalized\nrijs\nyafei\ndefrancisco\nneph\nhimrod\ntailwinds\nsulzmann\ngórriz\nmasterplans\ntruthy\nheatmap\nnöel\nundan\ncolsa\nandrographis\nticagrelor\newea\ngiobbi\nterrafugia\nbalentien\nduara\nsoundtracked\nchinked\nlymphoedema\niccm\nhazeldene\nbohara\npostmillennial\ncherryholmes\noconus\nizis\nconcierges\nhuncoat\nforeseeably\nmyrsinites\nasperatus\ntargowski\narbeitman\nfued\nafflerbach\nkeesey\nferrymen\nspinsterhood\ntemeka\nnondemocratic\nfurbearers\nbarrooms\nbenflis\njaneites\nbsfs\nspadework\nzoomers\nreplicon\nfroms\nmaicer\nuruguyan\nzepplin\ntelecommute\nmosiuoa\nprotocells\nfootraces\nnaciri\npresencer\nwinnik\narushi\nfranqui\nbenowitz\nbrigstock\nkrout\nfleeter\nalmontaser\nlifebelts\ncastellacci\nkomlos\ndirlik\nrotelli\nperana\nwiecek\nattanayake\nsurugadai\nperdis\ntenderizer\nfitrat\ngiansanti\nbaleka\nlcgs\nplender\nclearcoat\nmupirocin\nakdt\nloveleen\nnobue\nawel\nfloridia\nawja\nminkes\ndubow\nnarraway\nsummerseat\nsleekly\nmanganello\ndresscode\njeyes\nbilotta\ntonganoxie\njalle\nmedders\nuninvolving\nnarts\npaolilla\nchronicity\nmilbradt\nrapini\nquow\npivato\nchilingirian\nbiid\nsemgroup\nkokosing\nmustaf\ningenico\nkurtiz\norny\nrosehaugh\niouri\nalmerares\nprsp\nciclon\nccpl\njibing\nravich\nderrin\nwherehouse\nkorologos\ncemlyn\npedalo\npilade\nlarrazolo\nstolarz\nwadah\nfursan\nbriston\nbluette\nevaluable\nwobbe\nfesses\nhongming\ngahirmatha\ndewall\nkensworth\nsalamu\nprebish\nmedbury\nwtd\nnurcan\nkrupke\ngriebnitzsee\nmylyn\nbilsby\nhuancheng\ntajbakhsh\nconguero\noperatically\nlaerdal\ndeisseroth\nhaldol\ngypsey\nvigourously\nbimi\nspirent\nimbed\nwashingborough\nhowwood\nhaqqi\nwienermobile\nkameel\nentrevue\ntatman\nscelsa\nhuggan\nkucan\nmociño\nattardi\nmssrs\nozumba\nnamai\nmcaffrey\nborgas\ntimelessly\nchuc\nguizar\nmoturi\ncobres\nrigamarole\nbankey\nawendaw\ncrucifers\ntabbies\nminergie\nstateliness\nlingfeng\nmesmerizes\nbinos\nspeechnow\ncellblocks\nwosner\nnarcocorridos\njonaitis\ndemerging\nneusoft\nbealeton\nhearties\nlauterbacher\ncorreale\nchappells\nudhna\nfitty\ndeleu\njagmeet\nwentorf\nlifestraw\nsprunger\nkrepp\nepec\neizenberg\nmarcoule\nwve\nnavratras\nabonnema\nhackwork\nfurtwangler\nkirkilas\ncumbrous\nasheru\ntghe\ndumanis\nleaa\nancova\nsultonov\nhazelmere\naberglaslyn\nhacham\nbunnyranch\nbuycks\nbhcc\nhardscape\nsamore\notological\nhiltermann\nkaberuka\nchanca\npendas\nsoapie\ndivorcé\nwaap\nnisr\narsinée\npooya\nzachs\npirker\nverini\nhealthpartners\nepis\ntickenham\ngfeller\nfishless\nkeif\nkaufhold\nnewsone\nuninvestigated\nblatner\noxjam\nestrosi\ntellos\nbritart\nmyca\nhinckle\ncastrale\nmiscarrying\nsavuka\nmangweni\nodalis\nbluesport\nzhengrong\nfaraji\ngoatees\nmonikie\nmedhane\nprestigiacomo\nbarbuto\nvideomaker\nasesu\ndatamining\nsamast\nsdsa\nthrossel\nhelideck\nsilvercrest\nrijnmond\ncissna\nnetmums\ntameem\nresponsively\nvansickle\nmyrtha\nconstanti\nnuttery\nphrs\neppink\ndarrent\nmachol\nwackness\ncientifica\nsteigenberger\nccbe\npentameters\nrossignoli\ntrimborn\nwashboards\nchrw\nravell\npreslar\nsnai\nhalcion\ngardeazábal\nmicroexpressions\nmaxair\ntristian\npolitest\nregionwide\nfiabci\nhippotherapy\naubertine\nnabilah\npober\ntertzakian\ncantered\nstll\npaperchase\nburgundies\nthaugsuban\nhaon\nlondel\nagreda\nkiyonori\nhölldobler\nbamigboye\ngalacto\njubinville\nszczepanski\ntradestation\nechocardiograms\nvadehra\nmcfalls\natessa\nluminant\nkaming\nsaaransh\npandith\nfennema\ntoothlike\nfthe\nstoianov\nchoosey\ntalysarn\nbovvered\naltherr\nmadekwe\nmalubay\nkonducta\ndjemaa\npoppets\nbordenave\nrenana\ntmrc\njobes\nxtend\nnevoso\nshurna\nsses\nsinigang\nepidemiologically\nmcclaughry\nmatusz\naiyaz\ngynn\nchappaz\nnanosciences\nwolsky\nrosena\njabre\nmechoso\nturqoise\nleanda\nvanetta\nbowings\ntawab\nmousepads\ngiordan\ngieves\ntoradze\ncityarts\nopana\nscai\nmajstorovic\nzenzo\nsolters\nlammie\nsourcebits\nvanderlaan\nwallbanger\nmiskitos\nsinkewitz\nhabaneros\ncoverciano\nmouritsen\npaleosuchus\nfrishman\nwarbled\nmalotki\nzillionaire\nglier\nmiuchi\nentv\nkrystof\ndiagrid\nbatian\nintitial\nshtrum\njinpu\nletterston\npetrodiesel\nmbewe\nminicabs\namache\npanthaki\ntraumatization\nsciona\nhazarded\nhohagen\nctrip\nadderstone\ncatsouras\nkleagle\noverdressed\ndelarosa\nnrsro\nmarkowska\nwasey\nscalfari\nmoonfire\nmonogamist\nbussler\nolivea\nsarosa\nhaqqania\nnethan\npitchy\ndomeniconi\nblecha\nasfi\namne\nnicktropolis\nmaranto\nadvertsing\nglobalfest\nunoxidized\nelzer\ntollie\nsportsclub\nfitzell\nshinkolobwe\ntarted\nsemillas\nhistorics\nkrooked\ncurtsy\nagroparistech\nkröll\naneja\ntaneski\nnway\nguaranis\ngoona\nvorus\nkandice\npatchiness\ngossops\ndesists\nmouffetard\ngenband\nradostin\ndranoff\nbemowo\nquicks\nleavelle\ntrid\ncapman\npigheaded\nsynthesizable\njawid\njesurun\nbekonscot\nbattsek\nreanalysed\ngieve\npreassembled\nyoking\nafap\nrossomando\nbenjafield\nwielandt\npattingham\ndicussing\npheloung\nfurbies\nzilda\nhitlerian\nikano\nmerchantcircle\npeckman\ngloomier\nkochamma\ncommom\nlavena\nyalin\ncinepolis\nmoorby\nerrey\nheuch\nunenjoyable\ngonfreville\nhachim\ndigerati\nlassar\neleva\ndefenestrated\nvauxhalls\nbeskidy\nhoeveler\ncoworth\nsambit\ntelkomsel\ntoshka\nsompong\nparde\ntattoed\ngrovers\nsahili\nakiyo\ndenarau\nmarinca\nsilverburn\nbandrowski\nforword\nphera\ndouet\nyonks\nfalc\npobal\nanalytik\nfouchécourt\nsellstrom\nkantaria\nbxp\nmarqus\njenda\ndamazer\nsorto\nfantasising\nsyndications\nccoi\nlcsw\nrabinder\nedgemar\nfransico\npantridge\ngwyr\nmaicosuel\nfendrick\nvisine\ntempero\nshanghang\nneons\nashuba\nwiegersma\npinniger\nsonicwall\nsamuelian\nalderville\narbabi\nshotty\nbremelanotide\ntadeus\nfoid\nmotzko\nauthorizer\nmillthorpe\nmontecasino\ngedhun\nafsheen\nmuqata\nherrity\nakbas\nguisewite\ntrinitario\nflimm\ndogmatist\njuliett\nmachsom\nbarsh\ncdts\ncabochons\nsizov\nberuti\nnegovan\nstipanovich\nglistens\ngolberg\nsplatt\nbashfulness\nmcilraith\npaoloni\nbradygames\nklaff\nzarella\ntasering\nhofbräu\nreinvests\npommerenke\nnewteevee\nshumack\nsoliders\npangestu\nmumbengegwi\ncheviots\ndavidsbündlertänze\nriabouchinska\nisaan\norongo\nsooie\nidependent\nhellyar\nlawbreaker\nmoatize\nmennella\nannamites\nfinnsson\nnooka\nverdu\nnosei\nswordtails\novertreatment\nkaixi\nmarciel\nclutchless\ncherel\nhuwaider\nvitagliano\ncurrence\nbedroll\nodpm\nweightwatchers\nitanos\nhannibalsson\nramday\nlezayre\ncotroneo\ndemattos\ncoghen\nkarmitz\nhofreiter\npenjamo\niiris\nkriete\nturetsky\nsuperdry\nfilardi\ntransplantable\ndoshas\nspitters\ndavanon\nterakawa\nsunwest\nscoppetta\nenth\niwin\nbartelski\nbiomethane\nyolonda\nunallowable\nholthus\ntbis\nensco\nmetaio\ncatbells\nzarrella\nhasd\nbloodlessly\nbrocail\nkilotonnes\nbettisfield\nconfortola\nsolecisms\ngoddin\nmudiad\nlegeno\nzolt\nersun\nbarrish\neurofly\nbaradei\nboardshorts\nvirgle\ntovi\ntribolet\njbf\npushkov\nhamide\neuarchonta\nturquino\nhackleton\nsegraves\nzoosk\niobit\nbumb\ndarvis\nweirded\ntshirts\nsecularly\negregiousness\nschale\nvishnevskiy\nshirty\nteigh\nklimaszewski\nayma\nsweedler\nsamoun\npaam\nkondaiah\nfoltin\npantomimic\nliche\nalfirevic\ncantarelli\nnisour\nrapuano\nmassabielle\npetrocaribe\nrevello\nsubmissively\nneophobia\ngrinded\nhren\nyassky\nshuold\ncarnally\nfrancomb\ngentlemens\nmadikwe\nnasby\nfluphenazine\ntemme\nafcea\nhoneynet\ngroomsport\nzertal\nmoutawakel\nbourdette\nspudgun\ntofel\narguers\nsaporito\nborino\nzse\nsnoose\nelati\nantiphony\nrozehnal\nmosche\nmontsegur\ncheasty\nhairstreaks\nstrassler\nhygienically\nbelenguer\nkulayev\nnatiq\nmcgoverns\nnothstein\nralphy\ndjsi\nkadija\nhashmatullah\nmindgames\nzld\nbaathification\nglutz\ncelexa\nvanderberg\npgx\nrashin\nhewgill\naints\nmorken\nefpia\nblowoff\nkayvon\nhydrocolloids\nmartlew\npervomayskaya\nmorrogh\nbonette\nupala\nkeryn\noxidisation\ntupamaro\ndenya\nabdurehim\nunpassable\nfranzke\nkleinbaum\nchamoiseau\nramazani\ntrepagnier\ninswinging\ndéricourt\nshkolnik\ndoesen\nbraziliense\nquidco\ndraftfcb\nqpp\ngfb\nprochlorperazine\nmihalich\nurzua\nomtp\neigth\nboneyards\nasparaginase\nozment\nloncar\ndemetrice\nnarisetti\nermou\npaolillo\nballadonia\nroeck\nsuperinjunction\nsunesson\ntucek\ntelegramme\nharpagon\nwendelboe\nacuvue\nkongou\nbiotechnologists\ntvashtar\npotolicchio\navineri\nderiv\nshinnery\nfusae\nlignocellulose\namgala\nprominantly\nbruseghin\ndefonseca\nbarbre\nmikve\nschroeck\npellon\nrozenfeld\nnebulized\nwojnarowski\nrapida\nprocurers\ndonkervoort\njaudon\nwhizzed\npcmh\nmaganti\npipc\nterryl\nmorigi\nkeitha\nsecondments\neige\ndesenfans\nmindbody\nnineham\nmonoprix\npyapon\nthejazz\nlionza\nmezzaluna\ngever\nguanzheng\ncredibilty\nmegalomaniacs\nsuperfruit\ntarasoff\nsuported\nspennithorne\nselvaratnam\ncaptivation\ncatelli\nsmerdon\nlubya\nddinbych\noplev\nfenstermacher\nkalluri\nbarach\nratiu\nprayoga\ndokoupil\ncompering\nspeakable\nlesnevich\ntaffet\nbetimes\nhensingham\nusdan\nchaupar\ndongwon\ntuataras\nlno\nibish\nrawlsian\nlundegaard\nlongpigs\nkakum\niuzzini\nbuoso\nzmi\nactelion\nbips\nchellberg\nalphage\npiloerection\napprovable\nxwe\nvashist\ndunley\nratliffe\nkurzban\noryxes\nqoe\nnafeek\nfiocruz\nkientz\nccci\nredhook\nflorescent\nfilarski\nstinchfield\nfloggers\naapm\npollocks\nkantis\ncrackerjacks\nurquiola\njasey\nfigeľ\nwathelet\neismann\nshamsuddeen\nloansharks\nhypokalaemia\ncraner\nnathen\ntriska\nlpas\namge\nherewini\nalongkorn\nfenyo\naltangerel\nrestaveks\nnimic\nechávarri\nnookat\nyomps\nspsa\nkitesurf\nantagonises\npuckette\nujiri\ncompair\nholthouse\npedrie\nflagel\nkickstarting\nloutit\nsivanandan\nflitted\nspintronic\nunroasted\nmukhtiar\nunblinded\nbrenig\nlaventhol\ndownley\nspufford\ncurre\ninnogy\ntelquel\nharrowdown\nevershot\nmajur\njongi\nalpinestars\nyajaira\nrukiye\nsaturations\nhounshell\nwoodston\nsponheimer\njailors\nrachet\nlovefoxxx\nengrafted\nagap\nkorres\nbombilla\nronacher\nbiner\nmikla\nmakower\ncofee\nkluft\nnesses\nmantlepiece\nfarse\nvanderheyden\neilene\njebi\nhuldai\nkarling\nspeedcubers\naandahl\nscrivo\naproximately\nwriggled\nshads\nbetrand\nmicrolending\nswedens\nbpx\nmatfen\nostracization\nscrupulousness\nborned\nmasoudi\nentrekin\ngrinton\ndevecchio\nmarrinan\nnoordam\nsprl\nnpsa\nkaraaslan\nyanhai\nbethersden\nbadiola\nlamfalussy\nsiphonophore\nandoversford\nllanwnda\nferragudo\nsadomasochist\nkingslake\nclaypot\nputzel\nzampolli\nbalmford\nindinavir\nilchev\nwanging\nlandladies\nsmartwater\nbrugal\ngowalla\nbuter\nbargemen\nhpakant\ngrasper\nhouweling\nchemosphere\nkumala\nsophmore\nyardville\nghardaia\nmetabolising\nzivanovic\nteleflora\nladda\ncaversfield\nkazaks\nhizumi\narfin\nfracassa\njorc\nlienhart\nharpersville\ngettings\nbatasi\ndehghani\nlochbaum\nhowtown\nwaywardness\nyifter\nivh\nvlv\nwullschlager\nrecons\nguanipa\nveyrat\nmurehwa\nmbai\nzadokite\ngellan\nmashonda\ndise\ndethlefs\nneller\npapachristou\nmoralized\nahrends\nujs\nnavestock\nfathali\nrexes\ngrimus\ntrollies\navandia\nlafontant\nngassa\ntyonek\nbolters\nfamau\ninola\nmediascape\nkaback\nhazey\nmollett\npresumptuously\ndayjet\nmiled\nprofitless\njitin\nmyreside\nsemtech\nsungevity\nchristia\ntarren\nbrynley\ndomestos\npilsdon\nkasliwal\njohannsson\nsiamangs\nthorougly\ntabuaeran\nomeros\nrebreathing\nmadlen\nrassmussen\ngeorgelin\nbaudis\nbeinfest\ngegechkori\ntilleard\nnonrepresentational\nframfield\nsoshy\nplaz\nbrulee\nbernius\ngebbia\ngrix\nachoo\ndoubront\nebrary\nendotoxemia\nbowdlerization\ntillingham\nschudrich\nanuzis\nprotetch\nblotters\nanoraks\nbulding\nyusufiyah\nmallahan\npapageorge\nedderton\nbenettons\npatikul\ntoumey\nbosniac\nexplaing\nwinborne\nkozlík\nmetallics\nmultiformat\nmianchi\ncdiscount\nottowa\nsunman\nhungered\nkolodziejczyk\nmusaed\nmoosylvania\ncwpt\nsynovus\nokta\nstavrakis\nsumaria\ncommisar\ncaptial\nchastening\nmetrotv\nravand\naraji\nvictorya\nclimping\nllansanffraid\nmesotherapy\nshellshocked\ngemmel\ncampaing\nndoors\ntruxillo\ntayyeb\ncharpin\nbadreddin\nancier\noscc\nbarragem\nannd\nposterized\ngaspin\ndhoon\nermer\ndesynchronization\ndansker\ninterocular\nharkonen\nfalsey\nzensho\nflexcar\nmlss\ngadsen\ndehydrator\ntoxo\nchibás\nlennig\ntroncon\nyustman\ntiete\nblattman\naberkenfig\nmasoumi\ndobsons\nkirtlebridge\nhulihee\ntayyiba\nblava\npricegrabber\ncolkirk\nnonintervention\nponderously\nkabine\ngoddio\nlmos\ngoosby\nsodded\narculus\nworldatwork\nshoukat\nchivi\noverexpansion\nchemed\nskeletally\nmarcona\narapey\nburchart\nteaford\nmuvico\nmogford\ncaie\nretweeting\nupstages\nleuprolide\nmarmorstein\ncarrall\nrhosneigr\neastsound\ncambian\npolini\ncedep\nhadass\nwaltzed\nhoness\nmeslin\npermana\ngalvanizes\nundoubtable\nprocyanidins\nclamper\njackknifed\ndamrell\nboiseries\nwinterfield\nstratou\nexwick\nlebwohl\norlinski\ncoleham\nvaultier\nbeerenauslese\nbiqa\nreguarly\nparanthan\nshortliffe\nmichaelle\nemmentaler\nsenatorship\nmathenge\nionise\nsravan\nwoodenboat\njacory\nlecrone\nkuemper\nsangwon\nvoluntarist\nmollar\ndingleberry\nguetzloe\nweisband\ninvestissements\nrockburne\nsealine\nprik\ndavitian\nhettema\nfettle\njallo\nshatilov\nkeithville\nnapley\nshirtsleeves\nvenoco\nbroks\ngildan\nmortonhall\nelbourne\nsodomize\nanaesthetised\nnayong\nslotervaart\nphilion\nmanky\nrivastigmine\nyest\nchinalco\ntriplexes\nquecreek\nmichy\ncartner\nbandoleers\nearthcam\nbpas\nstudzinski\nspecialy\nspecint\nfaucibus\nasgharzadeh\ndematerialisation\nmoccas\nbowo\nweirauch\nziko\nmanged\nhynds\ndelehanty\nwaldis\ntiresomely\nmotorworks\nloterie\ncloudmark\nlupolianski\nhatsue\nlopakhin\nmarylyn\njalaleddin\nembosser\nusbi\npekarsky\nalstone\nryandan\nrocholl\neucheuma\nwaaaaaay\nsawma\nvehicula\nmartikainen\ncrackled\nabson\niaculis\nturgoose\nneubecker\nkelburne\nloone\nflueger\ntreeby\nddec\ngressenhall\ndegreaser\nklavdia\nlifschutz\nrobertet\nherft\npacia\ndoppelt\nvasilakos\ncroaked\nhausfrau\nvandervort\nsangini\nharlene\ncottom\nhollyman\ndebose\nrollman\nvencor\nminchella\nvarrio\nengy\ntouzani\nhadari\nvego\nbrona\nsniffle\ndominczyk\ntembec\nsmites\njazztel\nfellating\ngramajo\naleluya\nwedbush\nitumeleng\ninghams\nodiferous\nghurka\nashfaque\ngaraad\nkleypas\nlitcham\ncorsewall\norfanato\nkeneth\nreseated\ncoladas\nammirati\nminzhu\nsnci\nberkmann\nagust\nsoep\nsecom\nmahantongo\nhydroponically\nchalong\nwitcover\nregathered\nweissinger\nearthlife\nfplc\npelczar\ngrotesquery\nfttn\nforgiveable\nbhogal\ngreenguard\ncloudberries\ndecemeber\nlaghdaf\nyasuyoshi\ngröben\nfogiel\nkades\ncrawfordsburn\nfractionalization\nchateauvieux\ndorcan\nkalsa\nthreadgold\ncampmates\nranan\nflachau\nispir\npriscah\nstymies\ncavi\nnetcast\nguarachi\neao\nmusalla\naugured\nxianyou\nhartsook\ncomica\nstuntin\njoustra\nflippered\nlurchers\nzouheir\ntysen\nappletv\nyscc\nmarlie\nscarpino\npupusa\nsnorter\npropogated\nbastardisation\nskypephone\nfemia\nnebbish\nchigwedere\npelote\ngembicki\nachray\nspume\nidrizaj\nkarabell\nunidata\nfrania\ngutkin\nsties\ngrantors\nhungering\nibda\nsanyuanli\nbarrydale\nsundahl\nkhashm\ntrabi\ndiagana\ngeeneus\ngeorgiadou\nmessege\njamaludin\nbattening\ntabouk\npodor\nlegbourne\ngoodhope\nfragomeni\ndroubi\ninel\nkamancha\nmaried\ncuch\nbeween\nbezabeh\nharjeet\nogwell\nmadjeski\nscafidi\nundertray\njamdat\nknl\ntayaran\nlezlie\nalarums\nsquishes\nterrys\nsquirmy\nlieff\ntepedino\ncoagulates\nmulticrystalline\nmeatyard\ndrysuits\nkhokar\nachnasheen\ncorazzin\nbareikis\nnoeth\nthalis\nkamoun\nwessling\ndenaples\nbibai\nforbo\npeipsi\neberwein\nblusher\ncotting\nmoptop\ncynllun\ndecoste\npidc\nextramusical\nrbcc\nzre\npwrr\nspudis\nofcs\nnitsa\nghettoisation\nzarrillo\nkikuyus\ncomisiones\nnothwithstanding\nsardiñas\nindentification\nchrysogenum\ndelacey\nlikhi\npayano\ntornay\nsuhel\nsedgeley\nbassington\nductless\nkingseat\nmesaoria\nbiproduct\namygdalae\nsievwright\nactivies\nrovetta\ngrandpre\nparticulalry\nbedolla\nnoveda\nfarberman\ncaochangdi\nlhakang\ndebie\nsiula\nparmo\nmaltreating\nmasie\nsanjust\nfluhrer\ndcha\nimmateriality\ntriquint\nufcu\ncritchett\nsolvit\nboiardi\ngfy\njesusita\nzilia\nhandscrolls\nhasia\ndespoliation\nmowl\nheldmann\nliftin\nartemida\npostsoviet\nthanki\nwittiness\nwangle\ncarolynn\ngabble\nmemmel\nwillauer\nnelp\nkhullar\nethosuximide\nclogau\nfelstiner\nkeffi\naksarben\nmizuna\ngyroscopically\npenenden\njerame\nkelon\nthurlaston\nchisholms\nespecailly\nmafe\nbesik\nfillery\nplée\ntenderest\ningabire\nthereabout\nmicrosleeps\nalgeo\nmicromanaged\npanepinto\nuwchradd\nrubbernecking\nshafeek\nampfield\nsippenhaft\nbloodcurdling\namarg\nbuccino\nputzier\nitopride\nsamhadana\nlymphopenia\nmatenopoulos\nimmunochemical\nschoell\ndiamantopoulou\nsomehwat\nkaloogian\nmashhadani\nsamnang\novertrick\nstasko\nwadey\nzehavi\nsulemani\nperusse\nadiv\nirranca\nmilborough\nairblade\nscattergories\nmodec\nludeke\nwiland\nʼ\nmawari\ntotilas\nroundelay\nilit\nsamco\nleyson\nsreepur\naslet\nchrb\npillories\nateker\ncommitt\nvosganian\nllanboidy\nvassel\nmarasciullo\nwarora\nhinterstoisser\naltero\npetland\nbaralaba\noneamerica\nkishna\ncosmica\nespied\nspohrer\nspluttering\ntroposcatter\nrepplier\nagota\ngoswick\noutis\nzannino\nholvey\nposkitt\ndoleac\nconnock\ntropeano\ntupaz\nitele\ntemelín\nwatermead\nkfgo\nunman\nscrabulous\nleol\ncarolae\nwakiya\nhadyn\ntonetto\nmicroloan\nartcurial\nredating\naggieland\nlisek\nquantitive\ncandolim\nnayif\nreboard\nlambaste\nwheelton\ncucinelli\nreconquers\ntomochika\nhospitalisations\nmordad\nkronish\noutcastes\nseht\natuna\nmedshare\nsadza\nleyne\nmarktl\nliveing\nbriann\nchiambretti\nlucianne\nosinski\nhelpern\nudvikling\nbusines\nniran\nenom\nszumowski\nkounta\nlobue\nhypochondriacal\ncizikas\nbrowny\nhiraan\nunbelieveable\nkhulan\ntenke\nanyiam\ndraculas\nsureño\nwgz\nfederowicz\nprasugrel\nlicalsi\nglebelands\ncoulis\nomnivision\ntascón\nstachybotrys\npolyvore\nwdcw\nmauskopf\nsarko\nuniversalisation\nhawthornes\nmindworks\nheph\nlubega\nrightmove\npullins\nlikins\ndecamping\nvandyck\ncontinueing\nrobinswood\nsesquipedalian\nmenders\nminvielle\ncaunton\nnonjudicial\nzelenin\nfreudberg\nwerdegar\ntinius\nhilmy\nsexx\nmapps\nefimova\nnarusawa\nlewen\nxvm\ndanceny\nludemann\nadré\nfudges\ncontinuer\nmobilephone\npurkis\ncorrosiveness\ndavidovici\ndivvied\nhuffine\nbalmedie\ntrócaire\nmoshassuck\ntechnophobic\ntarallo\nuntamable\nivin\nbarski\nvaldivielso\ngivings\nstrathmann\nscarcest\nbarwari\neyeholes\nfocsani\ngestoso\nrossmere\nwiederseh\nrahum\nsaneh\npiersol\nmiglioranzi\nbuttonwoods\nparrys\nhabano\nmassaad\nmodrzejewski\nfreman\nmyotragus\nmesage\ninteroute\nkountz\ncordina\njudenrein\nkalemba\nraether\neiff\ndevid\ncaissie\nzumo\ncsba\nschoool\ndrumgelloch\nnewlife\nlegistorm\natomfilms\nwakelyn\ngalyon\nhammerin\nsaeijs\nthornell\nrifaximin\nsextuplet\nfigi\ngenra\npitztal\nkhosh\nscandar\nvereinsbank\ncommix\nteriparatide\nartsfest\ncavlan\njacquart\nservicewoman\nbaiden\nkalkin\nfrescas\nchinyama\nhecm\nraqi\nsanderlin\ntaulapapa\nsuperjumbo\naliy\nbroughan\nminley\nphotofinishing\nhollybrook\nzakian\nmunched\nbuildering\nlebewohl\nluebo\nstearnes\ndenat\nshterev\noutpoint\nwlae\nyorp\nhighwoods\ndavola\nposdnuos\ngroaner\neasc\ncoffeepot\nnyirenda\nkorder\nfiberglas\nevason\nidahoan\nbiobio\ngesticulations\nhelvin\ngoldenbridge\ndmytruk\nzammar\nmesur\nmaquila\ncondoleeza\njacalyn\ntounsi\ntactlessness\npantsuit\nhydrogeologist\nwitchweed\nmaintanance\ncjv\nchrissake\nwanko\nayvazian\nthandar\nandeans\ncarringer\nbivouacking\nsahali\noios\nellsinore\npegswood\nnivet\ncabreira\nklagsbrun\npurposefulness\ntartaro\nburb\npiecuch\nguererro\nteny\nhelsel\nflugtag\nreaon\ngauder\nemedia\nlabonge\nmilnathort\nharnischfeger\ndiangelo\npaisner\nmeatpackers\nwebbys\nfrelighsburg\npoquette\nmobtown\nbagle\nparcelling\ncorrugator\nchode\nlymbyc\nnordion\nknai\ncolognes\nchason\nbouwerie\nskyeurope\nloebe\nbairbre\nserialise\ndripper\ndaffyd\nporri\nzhenwei\nobviosuly\nqac\nwashlets\nforesterhill\ndistain\nkumpel\nbarky\nnorimichi\nregisterd\nabess\nprobabaly\npiou\nnihe\nkerstens\ncrockenhill\nofframps\nrubberstamp\nwcrf\nghurabaa\ninfirmed\nibraham\nsmolens\nbruria\nachba\nshegog\nfilippetti\nsandrin\nhuszar\nhumblebums\nrwu\nroanna\nlisheng\nminnawi\nmercerville\nbeahm\nrowenna\nbrainteasers\nmizel\nebeam\ncolwill\nslopped\nyurko\nsgcc\ncavel\nmirach\nurooj\nmuttawakil\nnasreddine\ntowbin\ncmpi\nblitstein\ngurevitch\nprevalance\narshack\nunderutilization\nscibetta\npodkamennaya\ngenuinly\nsantelices\nzalben\nrouhi\nslaughterers\nroskin\ndoyers\nwwon\ngoldendoodle\nknuffle\nsteggert\nraymondo\nshitov\nreguly\nmahallah\nmorphoses\nboisterously\nchicharrones\ngikas\ndragusha\nwebwise\nswitchoff\nextravert\nwarschawski\nschtroumpfs\neconomising\nteedra\norrison\ngrotesquerie\nfrankee\nsachkhand\nshirvington\nintellectualized\nuige\nharchibald\nchanghai\npluggers\nlootens\nschmatz\nayubi\nrasho\nnpra\nwessin\npapcastle\nseiff\nsilverbulletday\nathaiya\nprimar\nferreted\npitmasters\nhadri\nlahoma\nnzpa\nmarer\nvegnews\ndrayage\nstreetlamp\ntolin\nradioplayer\nlandlessness\nudalls\nsooy\nhugest\nneonatologist\nprotégées\nchunlan\npetitt\nkupiec\nreynoldson\nbelloli\nonesource\ncarlyss\ncichowski\nctcc\nardebili\nhtg\nmendle\nmeetze\ngreentop\ntahmasb\nminati\nkerstetter\nhutchin\nyaowarat\njagemann\nramak\ndokubo\nfreedon\nsussmann\njohnnetta\nfiloviruses\nflextime\nblankson\nzollitsch\ndewanna\ncupriavidus\nbroms\nnevatim\ndijksterhuis\nlubtchansky\ncoumba\nladurée\nbasang\nheydey\nspitballs\ntemazcal\nwooh\ngorur\ntebas\njablon\nabdurakhmanov\nnadol\nstiefvater\ncystectomy\nlionize\neleifend\nargyles\nnyiro\nesbjornson\niraqiyya\ncottontop\nkasaev\nscelerisque\nlandstown\naimson\nyongnian\nreupholstered\ndarg\nvory\ncrudities\nunpreserved\nwaie\nlimed\ndarbellay\nlydbury\nsageworks\ncatcheside\nabbc\nmalonzo\ncapalbo\nmachaerus\nwithies\nzeti\nmonbijou\ncritcher\naldorino\nmerrills\ntalari\nicstis\nbulygin\nleibovitch\nhissey\nvictimizes\nprif\novercorrection\nmsamati\nbotteghe\nbirria\ngigondas\ngalmpton\nboyter\nwaqif\ngorska\nchilingarov\nmaltitol\nschiappa\nruppersberg\nferruzzi\nfazila\nhornlike\nmehaffy\nkashti\nludworth\nsibony\ninformercials\ngissler\nglutens\ntoothbrushing\nkaligis\nmootha\ndmat\ngueorgui\nkummetz\nyohana\njiamin\nkhayrat\nallford\nmouris\ndroping\nunmarred\ngogue\ncomensoli\ngibberellic\njuddering\ncomradely\nabiertas\nberluti\ndaskal\nelyn\nmagniflex\nsulfenic\nsetzuan\nwillersley\ncoyness\nkharwar\norza\nbaingan\nspeedworks\nschoep\ndoorcases\nsølve\nkearn\ncasie\nbeaujour\nviqueira\nqingli\npettinari\ncorré\nhunminjeongeum\nlonghairs\nrenourishment\ndadey\nspirituall\nmbunga\nyampah\njendrick\nosana\nbogoro\nboroson\ndogood\nkeiland\ncarayon\npdip\ncyclers\nsheskin\nmynetwork\ncrustless\ngleitsman\nmaily\nkalkstein\nmilbridge\nmillboro\nstartac\nkrutoy\naamco\nbagli\nminichmayr\ntubercolosis\nmegaport\ntaricco\ndjukanovic\nreinelt\njiggled\nkaupp\nwoodroofe\nsidewards\ntonkinson\ndisapointing\nftld\ncomorians\namapa\nhackable\ncrystallising\ncleeland\ntepalcatepec\ngolasa\npaulick\ncarefusion\nmicroencapsulation\ngaláctico\ncarnesale\nlanjigarh\nhris\ndistaso\nvilan\nshiah\nbatcha\nthermax\nschnecken\nlecht\ngeneste\nkaraki\noverspray\ngussenhoven\nlamparello\nwigstock\nmerchan\ntarkhnishvili\ngestifute\npipefitters\nfirecrown\nadiele\nrajus\npfeffel\naÿ\nthorlabs\nkogalymavia\natna\nbulleit\nkagayama\nostrovskiy\nzakhilwal\nhiggenbotham\nnerud\njakson\ndunholme\ndenisot\nblickenstaff\nusofa\nannese\nobliterative\nmacroeconomists\nvalloire\ncarrigans\nanyinsah\ndunkels\narchly\nbishko\nwolftrap\nfehd\nmoulaye\nincommunicable\nschink\nzydus\npaloverde\nnuaman\namygdaloides\ndfki\nmarava\nbabitsky\nmonetti\nparami\ndelafose\nroling\nkolat\nbeula\nberrey\nsafawi\nlioret\nunderbellies\nfokienia\nplectrums\nwackies\neconomized\nmeyr\nrupi\nkatzav\nhempfest\nplutôt\narnull\npromark\nfiszman\noleaginous\ndigitimes\ncorelogic\ncelda\nwhaleman\nclaimable\npetroli\ndiment\nhexic\nlulis\nmessiest\nefaw\nbucknam\nfryklund\ncheim\nyongyuth\nscarey\nkarnam\npohlen\norben\nleiberman\nquestionaire\nglassing\nbrachmann\nzhongdian\nlinkebeek\ngaggia\nlevring\nfantasises\nhydroxycut\nstuttard\nperuggia\nwagley\neroglu\nchomski\nwoodhoopoes\nduffels\nmhondoro\ntimpany\noando\nchernyshova\nsrps\nvulvodynia\nmilewicz\nmischievousness\ntwinkled\nrbge\nsimoneaux\ntagaris\naeropostale\nwadeson\nagca\nsloshed\nragers\nreportings\nabertridwr\nadjmi\natempo\nditcheat\ndevorski\nrosenthaler\nhoogerwerf\nkxxv\ncolva\nfarbstein\nfamiliy\nindentity\nlausitzring\nmalamutes\npackouz\ntelestrator\nsamter\naiag\nnothung\nagainist\neligibles\nmodernisers\nglistrup\nmatzerath\nbébés\njaksche\nwhitsbury\nanybodies\nplazes\nescalatory\ntabakova\norzo\nadumbrated\ngusciora\noximeters\nrayton\nkippie\ndisharmonious\ncombatively\nculyer\nnsdp\nmcbriar\naplon\nzenobi\nlexapro\nnurun\nemanuels\nbombifrons\nbrossy\nlamere\ntheyve\nkioko\nrasgotra\nlusser\nwissington\nostanek\njadson\ndepoliticized\nsteampacket\ngoldtrail\nlucubrations\ncompanywide\ncpfc\nsugarcube\ntouchups\nalstrom\nstankevitch\ndjimi\ntcxo\ncurrrent\nnimb\nrebe\nechoey\nkhemir\nstockers\nrww\nintrospectively\nfrett\nisoa\nvasti\nsidebotham\npeluce\nlegitimises\nbuddin\nkicklighter\nconfrères\nilegales\ncecp\nanchoveta\nhovorka\ntamakoshi\nfirin\nhaniff\nkronenberger\nsheps\nyahn\nslinn\ncallado\nceballo\naminullah\ntravalena\nrusticity\nadmob\nbelligerant\nrfea\nveenhoven\nmehrerau\nhesters\nrengel\nrefregier\ngeddington\nfreeagent\nurby\nknuts\nknols\nyouthaids\nespndeportes\nlawrenny\ntopscored\nzubillaga\nkoryn\nycombinator\nabshagen\ndrzyzga\nrhincodon\ngunslinging\nhoomanawanui\nbajema\nnimetz\nstrickling\nstigmatise\nopendata\nowh\nkommetjie\nperello\nscharfenberger\nprolificacy\nmisspoken\nnaturaly\nsivagurunathan\nsedita\nrainclouds\nyahuda\nusjfcom\ndrozdowski\nservitto\ncanapés\nsaldarriaga\nrifabutin\nlundestad\nsenillosa\ndaytonas\ncryospheric\nchokeberry\ndouek\npopuluxe\ntialata\nodabasi\ncattus\nkeynoter\npostulations\nkeasler\nstoltidis\nmarzilli\nschlecker\nromiti\nennahdha\nexhumes\nmultiengine\nintegrati\nidiocies\nsportwagen\nmurrumbateman\nhaffey\nwabbits\nabandonned\nhyperpower\nenbrel\nperfluorooctanoic\nvielma\nragpicker\nroob\ncalia\nrodriques\nwarmness\nintercarrier\nhochreiter\nwinterling\nmiraculin\ndbcp\nnaughties\nwondermints\nshobhaa\ntrundled\nchavenage\ncellardyke\nbulería\navramovic\nbreathalyzers\nmatebeleland\nmedomak\nmessent\nstanesby\nsovereignity\nfdcc\ntangeman\nreproachful\nbreezeblock\nwwan\nleverich\nshuvee\nsemerci\ncordone\npaliperidone\nprobabilty\nkanzius\netemaad\nfathomed\nwolszczan\nacronymic\nswooned\nfutons\nshimaoka\nmorgenson\nfootpad\nshovelware\ndannell\nbabytalk\nalbig\niwar\nupclose\nturland\nsteinkellner\nessiac\nmaluleke\nlumpa\nmudflaps\ntiatia\nmozingo\nlakhvi\nmortell\nmontlouis\noverseal\nshuying\nconductorless\nlearndirect\nsempo\ntamy\nibell\ntelevsion\nfairton\nupromise\nnhbc\nlobianco\ntcby\nshekari\nceleritas\ndmps\ncrossborder\ntwop\nlustfully\nptychodus\nnamasivayam\nroomster\nmahjabeen\ngrandness\nquett\nhyboria\ndaqian\nciliau\nsmick\npochinki\nwacka\nridc\nwugang\nhassman\nnumalink\nlipner\nashprington\ntachilek\ndehan\nboopsie\nsnifter\nverrey\nkanouse\nmordkin\nsollars\narbitation\nyoungquest\nanalisa\nemarketing\nbokov\nwacholder\nrutberg\nminnies\naito\nlumension\ncozies\nproselytizer\nlangrick\nqubba\ncalina\nembroiling\nlifelight\ntaraporevala\nseaching\nnaeim\nblaenplwyf\npelindaba\ntomatillos\nambari\nsuttree\ntrailering\nsatinover\nrudds\nrenucci\nbagwan\nforestiere\npeik\nshahara\nbicom\nnsaliwa\nvanowen\ndeliang\nchunying\nintralesional\nsequi\neyeshot\nolar\ntronox\ngrotty\njibla\nballhandler\nauberges\nunweathered\ncolome\nnewnet\nnazakat\nreappraise\nvashadze\nbraida\nbaldisseri\nwijemanne\nkodnani\ninsolently\nnotasulga\nfortuno\nmaalot\nfursa\nbronllys\nbarcott\nlongmead\nouvry\nneccessity\nobscurantists\nxuejun\nblachly\nhochschorner\nmazlin\nhoaglin\nyuzana\naspex\nbiobrick\nlasana\noculoplastic\nkreditanstalt\nirungu\nsyndey\nbuyse\nresolvin\nbrajkovic\nflowback\ncrematoriums\njumpseat\nhassanpour\nbarnert\npaetkau\npetroskey\nkulcsar\ntomada\noperationalised\ncasulties\nkosto\nnonperishable\nzoldan\nclemetson\nsovann\nowuor\nkaramani\nmabandla\nunproductively\nfrilford\nbazire\nholdeman\nkrisnan\nbaduel\nradwa\ntsokkos\nmassala\nrumy\nempirica\nchente\nfarmgate\npardeep\ncanabalt\nnyepi\nromal\nterria\nawin\nunpinned\nderrygonnelly\nkhaima\nkavoshgar\nnghiem\nfrieth\nsuhair\narrue\nbrantner\npromaxbda\neichbaum\nopensky\ntiscornia\nhaenyeo\nbalmaha\ntchividjian\nbatko\nmamlok\nautotuned\nhillfoot\nblunderer\ntinies\nsanmina\n﻿\nleucovorin\ncwmtawe\nswy\nfluhr\ndudson\npellini\ngbao\nlatterell\nfusty\nmadie\nbife\nlandmen\ncxm\ngrindstaff\nbotesdale\ngalère\nlisney\ngoodfield\njurkowski\nvesce\nmasone\nchidren\nrespa\nbelorukov\naggett\niding\ndepfa\nwaldingfield\nlovingkindness\nslanty\nkembra\nhoganson\naradhna\nmeryton\nbillmeyer\npeculation\nexaming\npsychosomatics\nmayah\nhodari\nsuckles\ncurborough\nrdk\nkopas\ntokuoka\nbakuriani\ntassara\nfsmb\ncasarez\nreflectiveness\nunrelatedly\nplemmons\nschlotterbeck\nequilibriums\nhayduk\nsolidarities\nbodfari\nsumaira\nsunbeds\ncampustours\ncapd\nlendal\npalpating\nwassall\nafiuni\nedusei\nnutkin\njubeh\noriakhi\neastex\nschlich\nbulgheroni\nscothern\nfosgate\ndanjiangkou\nziebarth\nwotte\nnextbus\nvertebroplasty\nwethington\nrivelli\nmuckamore\nmetatags\nsmadja\ndarvishan\nfishcakes\nruperti\nlebedyansky\nrier\neyebar\neagleview\ncontary\nmicronas\nhuppe\nsenk\nmsbp\nawadallah\nroil\nechosounders\nlables\nsupportively\ntotin\nsubramanium\nniw\ninnoshima\ndaston\ntheworld\nbiechele\ngyngor\numezaki\npalliate\nbigbie\nrotoworld\nmaurren\nmedress\nmischaracterised\nheilind\nfranceso\nwheatfields\ngojan\njilting\nsnader\ndillehay\nabstainer\ncoval\npersell\nahtila\ntimimoun\npilip\nhafetz\nookla\ncercel\nzannini\nveikune\nmousquetaire\nzeppos\nossai\ntolla\nogborn\nmusyoki\ntrudering\nmegapascals\nvfn\nlimper\nbaetz\nvickerstown\nkobylt\nduncansville\nspatafora\nvernetta\ndaraz\ncountermelodies\nkinships\nyerima\nsnuggly\nreedie\nyuam\nnoctambules\nbarschak\nlongliners\ngafar\npooneryn\nsgy\nembarrasment\ndobin\nbirnes\ncrockatt\narwyn\ncarniglia\ngrundberg\nberdimuhamedov\nfrivolousness\ninvestimenti\nparahaemolyticus\nhilltribe\nbadhwar\npalavi\ngosman\ntorren\nsteijn\ntabron\nkelvindale\nshininess\nallnut\nhalfs\nntamack\ntnsm\nincarcerates\nbrostrom\npivonka\nhederman\nuwt\nmansurian\napcor\nideson\nbisciglia\nsuboxone\nriluzole\nlaxford\nlokubandara\ngerresheimer\nlokum\nwichniarek\nvergo\nhawing\nferragosto\ndillema\ncommiserating\njaydon\nsmitham\ncildo\nmistruths\nzulfahmi\nkieber\ncroziers\ntakino\nkittleson\nvatcher\nwickware\netone\nbystry\nstummer\nduckpond\nhouman\nnjtc\nphlebotomist\nclobetasol\nfederley\nstocktaking\nbazaleti\nverdehr\nglenarden\nmorvich\nbiyi\nflareups\ncsam\npostrel\njailson\nyasheng\npeccadilloes\nnonofficial\ndeltour\ndamjanovic\ngomarsall\nmyntti\ntpao\nsaret\nnahles\nviamichelin\nkux\nkojis\nseedley\nhoess\nmroczek\nifh\nlucaya\nbauzon\njoskow\nbresonik\ntonalpohualli\nunobjective\nvaleric\nbrotons\npentrebach\nmercexchange\nsubu\ninvigilators\npernetti\ntelfort\ngovone\ndshea\nchozas\ncido\nvasiljevic\nviégas\nbantom\nrebroff\njochanaan\nstrategems\nvecher\nhalvarsson\nmerkins\noggins\nmullers\nbretforton\nletouzey\nboerewors\ncraemer\nvilca\nsydell\nuqm\naguad\nlodestars\nethopian\nshanab\nmetzelaars\nrapisardi\nflaig\nfahem\nfharraige\nsloggett\ntotsy\ndeflowering\ntrailwalker\nanklesaria\noppal\nfufilled\ngodsiff\nsolvik\nredistributor\nabuelazam\nharperley\nvernham\nhameur\nfeldeine\nisaacman\nnegen\naldonin\nrongrong\nwandong\ncluley\needar\nplayaway\nwlans\nhilb\nmaziarz\nerck\nsociobiologists\nsippola\ndynavox\ncsbc\nmartham\nkamall\nunthreatened\nfernholz\nskwentna\nskyjacked\nluminita\nekel\nblackaby\npoulou\nessure\nchainless\ndisinflation\nlyxor\npenalises\nquadrilatero\ncouser\noverdriving\ntoxicologic\nbadgingarra\nrutzen\nharped\narcara\nexposer\nbleymaier\ntolas\nbrissette\ndecroce\nsenioritis\narteriosclerotic\nbetzy\nschweber\nfishable\nmoralization\nbogof\ngrauwe\nsahag\nsspc\nremmers\ntepoztlan\ntauck\nbangham\nmiessner\nladdy\ncartin\nchunda\nrereads\nresig\nbeneduce\nshweder\nlazkao\nardvreck\nwellstar\nshahran\nfese\nbonsanti\nimpotant\npareo\ncabindan\nduckies\nscreenprint\nsynsepalum\nfhn\ncoolhaus\nwashbasins\nrecher\ndarco\nwiseacre\nteletherapy\nonechicago\navenido\ngiabiconi\ncellared\nshude\npossitive\nmicrotec\nothmani\njialin\nalbertha\nwaak\ncountercharges\nbatterers\ndutka\nnordwand\nmayeul\npasztor\nweinglass\npsihoyos\nposuere\ncahuita\nlanguidly\nprescreen\nrhas\nsignator\ngensets\nkaysing\ndrumcliffe\nventurella\nbonani\njansrud\nbinita\njayawardhana\nsipek\nsupportability\nsulligent\nklunky\ngatecrashed\nibobi\nlangness\nvman\nbitencourt\njahromi\nhoepfner\ncheapflights\nacfas\nidabc\nmarchello\nvoivodina\npatrich\nffrwd\nmahall\nririko\ndisempower\ngrunted\nbibliophilic\nmndot\njenvey\nwhizzes\npersued\npbso\ngypsii\nzalmoxes\nfairwinds\nmindbreeze\nsadaqat\nairwatch\ndisaffiliating\nhoudyshell\nbafflegab\ntemsirolimus\nululation\ntotie\npoohsticks\nngultrum\nmushes\nranz\nloelia\nmollycoddle\nmidniters\nkrispie\nbrakha\nvarsalona\nmangé\nhorseflesh\nsplendido\nitzhaki\nmangyans\nfiligreed\nsampsonia\nhintjens\nlenzo\ngroupm\ncranney\nenova\nulukalala\narfan\nkasarova\nacquisitiveness\nyetts\nbirdemic\nvmpfc\nseattleites\nbuffaloe\norfali\ntibnin\nnujaifi\nzepa\nbercht\nparija\nclodio\nkedourie\njapaneses\nmichito\ncristoph\nlaurentina\nbitterlemons\nescuelita\nporti\njixiang\nnearline\ngyrocopters\ncheminant\nbarshop\nbloviating\nhoosen\nscss\nseakeepers\nhomelike\nfelsher\nmostofi\ndefragging\nciganda\nhougan\nhobnailed\nsealfon\nalbertin\nloosli\nozbek\nkibel\nsmar\nwoelk\ncanaccord\nderegister\nlapietra\nwodka\nbentalls\nciliax\nrunscorers\nibisworld\ntocache\nbambou\nbalthazard\nmileageplus\nwyckham\nmulheren\nherberto\nsidewinding\nbankson\nraynold\nquinonez\nañoveros\natherogenic\nhynkel\nginoli\nthorborg\nstridulating\npresskit\nlongonot\nhephner\nnetwrix\nfontmell\nlochos\nbayen\nodriozola\nchinedum\nsipprell\nbadmouthed\njacobe\nkoniag\narkholme\nyoungwood\nnersesian\nrebadge\noutfields\nchocoholic\nlatady\nlodal\nampler\ncymro\nwhitakers\ncumpsty\npaperbarks\ndabbawala\nshabqadar\nhuiming\nsciencey\nsartono\nkinyua\nsecureworks\ntangradi\ncharlesbank\nhvt\nscotmid\nkivell\npatisseries\nretasked\nulba\nstoutland\neids\nminxia\neckehard\nkameli\nemag\nurraco\nlancias\ndelsing\ncirincione\nbakhoum\nyeji\nerran\nprofessionel\nglasthule\njabbering\ntrello\ncasano\naggrandized\nbenua\nidsia\nshedload\ngeys\ncollaboratives\ntiarella\ngiftcards\ncringleford\nklatz\nkalwar\naanestad\nrocksalt\nsowah\ndurcal\nniketa\ncharoset\nreatta\nairband\nfollistatin\ncodgers\nkulchytsky\nokladnikov\nodonkor\ntrecastle\nchelem\nestablecimiento\nmassachussetts\nradware\nibin\noide\nsnideness\nbeaufront\nlovestory\nrevaluations\ndemocratised\nrenovaveis\nfeistiness\nthriplow\nbrobeck\ncollotype\nmatheis\ndierick\npatsatzoglou\nvigneau\nbieger\ntimet\nodrick\nbelldegrun\nbeeghly\ndescried\nzalka\ncastlehead\nbabbin\ndaisetta\nwilkenson\nmeijo\nrlpo\nclios\nreproducibly\ndisfunction\nlittlemoor\npumlumon\nbloodvessel\nperveen\ncantalamessa\nkrivtsov\nrosenberry\nhlavsa\nvillarruel\nkallam\nsugár\ntiltyard\nasmundson\nfoolhardiness\nlonnell\ngandan\npickavance\nthornberg\nfaiumu\nzierlein\nluccin\ndenlinger\nlanguirand\ngurvitch\nfarcically\nexfoliate\nluxuriantly\ntremough\nrimberg\nsosanya\ntippling\nladendorf\ngittinger\njackanapes\nchemex\nnotz\nbaluk\nhaiming\nconciously\nderegulations\nwirajuda\ntarran\naschner\nchemmy\ncarpinella\ngonner\nsubasic\nmppa\nretherford\noutler\ndittisham\ndysc\nbraies\ngoofballs\nfortensky\npostering\nlercara\naeeu\nchorzow\nnoelani\nimmodestly\nwittenberger\nstecklein\nhudema\nsabti\nbjornstad\nhaylor\nrefight\nsulham\ngeomag\nunnessary\nlutzen\ngigapan\nthast\nsolntse\nmobily\nmikhalev\nhentges\nweisbrodt\norjan\nquedagh\nnonissue\nshirtmaker\nsystym\nsalelologa\nhaltzman\nnoxzema\ngrotton\nefimenko\naerobie\nderridean\nfcci\nbiscet\ndobrygin\nmoshammer\ntivey\nnereida\nclearchannel\nonyett\nkujovic\ntastevin\njassar\nesterman\nyapper\nembroiders\nrondor\nklotzbach\nprimettes\nguling\nkendler\ngulleys\nthommo\nmohmed\nkeiss\nhelpston\nsudep\nsokos\nfewtrell\ncornwood\nlévai\nyoussoupha\nmimram\ncance\ninnaurato\nnightlights\nintertubes\nheatherette\nchupack\nvukadinovic\npolsby\naccentless\ngrafman\npusateri\nintellectualization\nredmarley\ndziena\ntrollop\ncses\nexhibitionistic\npaolantonio\nreemphasize\nhardrick\nwhineray\nsaisies\ncuriale\nbawdiness\ntrostre\nromalis\nmondamin\nhousni\niraizoz\nworldnow\nspelke\ndorismond\ntempelman\nunreviewable\ndirties\namvs\npére\nmateas\ninuksuit\nbrookhiser\njoyrides\ntribunali\ntöben\nberdine\nparaben\nhyderi\ntuju\nheythuysen\nikeme\nroscos\nheartmath\nmccafé\ncabbell\ntrenbolone\nihas\nnonspecialists\noceanair\nharlescott\ngunnin\nondrejka\nngbs\nomlt\nheadstand\nshippon\nphola\nkarabelas\ndasovic\ntraumatizes\ntarsha\npaktel\nmanchurians\npggm\nfirewise\nlooke\nstaffin\nreneses\ngiddyup\nkolelas\ntenev\nmournes\nmarkovics\nbaldcypress\noncotype\njuked\ngovindji\nkhunu\nvfi\ntattooists\nplanningtorock\nsickling\npleasantry\nherrhausen\nbonnevilles\nribalow\njudyann\nfootless\nraikov\nbammy\ngeremy\nmicheler\nlimusaurus\ngarritano\nantimissile\nmisreport\ncauldon\nlovefest\nsawh\nwaldgirmes\nhadep\ngramzow\nbotein\nzewe\ncrabmeat\nstepovich\naeoi\npingju\nesmaeel\nhistamines\nushaka\nrosehearty\nschavan\nelmayer\napparati\nsailboarding\nsingleminded\nabdolmalek\ntyddyn\nlowborn\nmoonquakes\nhuysman\nropy\nmacchiavelli\nediplomacy\nmeilhan\nzafón\nmossimo\nnolton\nbarkus\nsmrs\nprovidian\nmccareins\nwheate\nisikeli\ncybercafes\nlilje\nbounden\nfualaau\nselek\ninfobae\nkaviar\npiledrivers\nedvinas\nnoforn\nporthdinllaen\nojjdp\nlandri\nfarrowing\nmalkoff\ndwyre\nkibuuka\nothere\nrakis\nasab\nmelingriffith\ngeslin\nthongloun\nhereditas\nlevitina\ndundonnell\nemmies\nncst\nwallowed\nsemrad\nbewails\nsoludo\ncousland\ndungaree\nbermange\nragosta\nrodbourne\nflorien\ncantiello\ndeyermond\nsiegele\nposhard\nnimit\nelisofon\nlavizan\nblundeston\nbuonaiuti\nheiferman\nkretschman\naakre\ndvorkovich\nmbarushimana\nkolanos\nenlgish\nwaiwera\nidolising\nduhulow\nmiccio\nsingificant\nnesch\nsuperintendant\nsapori\nharraka\npxp\nnetbackup\nbullheaded\nkrimmer\nandrianova\nrickhoff\ngengler\nfudoh\ngranick\nrelabelling\ntasr\nnasonia\nspacca\nsomeome\nsesser\npurt\nomiai\nrebuck\nmarkha\nunstitched\ntabachnick\nnilsa\nuscar\ntudes\nsolando\nafco\nphotopigments\ndepressives\nsteffe\nenergoatom\nresentfully\ncrapshoot\nmeledandri\nsoderlund\nsafetynet\npostboxes\nrepurcussions\nkyphoplasty\nonstott\nkasischke\ncantlay\nreymundo\nmcilvain\nbhasera\nairily\nkibitzer\nshankley\nvany\nblackbridge\nkostyantin\ncharmeuse\nbanyu\nyewon\nquinata\nverfremdungseffekt\nsugen\nendoscopically\npurlieus\nadepitan\ngracz\nthermoelectrics\naceti\ntoves\nnobacon\ndigitas\ntofus\ninteligent\naldeen\nauchinairn\ncoule\nmcniff\nnonghyup\nbushmans\niesc\ncaberet\nritterband\nlapadite\nviewfield\nlaudner\nvorlich\nfunderburgh\nthomsonreuters\nwoodseaves\njamii\nsheinbaum\nguamanians\nyonda\nruckersville\nmusharaff\nessop\nschnall\nadblue\ncredulously\nboulevardier\nkumbi\ncarrio\nsungkar\nbrouse\nresonantly\nhingorani\nflader\nrurality\ndongzhou\nitinerancy\nseesmic\nanalogize\nlizaso\nowoh\noystermen\nfvi\nasdrubal\nwhiteparish\nwilcockson\nflecking\nblahyi\nantiguans\neinsatzstab\nlaor\nduely\noberti\nschlemiel\ncaroff\nqunu\ntimmi\nyobbo\nschomer\nsuyapa\nlisby\nberkeleys\narispe\ntrut\nregally\nkvasov\ntrali\nneronian\nstreif\nfpw\nwoolstencroft\ngatso\nhalvergate\nnoels\nondokuz\nopionions\ncountermovement\nsosei\nwinesap\ndelinsky\nsegler\nprostrates\nlonkar\nspilum\ntuitert\nburnap\nnorten\nmisted\nexpensed\nshiftwork\nunderactive\nsquirmed\nkochman\nmcquigg\nchenet\nrichmont\nremolding\nplaygoer\nmendillo\nwattsi\nkeylogging\nbugarin\ncoruscating\nbowlegged\nfaretta\niwokrama\npasqualina\nchrgd\nakubra\nbeetlebum\nfernea\nshicoff\nsolidaires\nrahad\naleah\nnashawn\nkrogman\nbulthaup\nriggings\nfolkington\nunscrupulousness\nrockot\nkerpel\nsonographer\nhydroelectrical\ntigan\nlesbo\nunforgettably\nyanagishita\nstdm\nclickair\natazanavir\nyisraeli\nbrixey\ndianchi\nthks\ngirts\nrenumeration\nrübig\ndamschroder\ntarkett\nmaitatsine\ntedtalks\nbartana\nmemsie\nriojas\nloosers\nrenilson\nkiltarlity\nmburu\npeerman\ngingles\npikler\ngoreau\nmantria\nhogarthian\nbotellón\ncampouts\nagnant\nyonnet\nleestma\neyeshade\nfriended\nvidan\ntaleggio\nnaeba\npendell\nrafaelle\nrayback\nincompatable\nshilowa\nketoprofen\njournoud\ngrefe\nsupermedia\nunamet\nprivelege\nbelvederes\nkrivoi\nmacdara\nbellyaches\nzisapel\nlaboratorios\nbiring\ntoyes\nnoryangjin\nsudans\nalfani\nmusahar\ndoorless\nfrittoli\nwelaka\ndeuchars\nstaudacher\nvillagio\nalbuminuria\nretouches\ncresseid\nperéz\nakinmusire\nsyesha\ncfdr\njinger\nfertilises\nbonanzas\ncgcc\nascribable\narutunian\nkhaznadar\nbrisac\nbackstroker\nfrohna\neckland\npshaw\nheidepriem\nstadtfeld\nbarrowfield\nmisbehaviors\npanamian\nmerritts\ndumor\nlueneburg\nbelinga\nsaurs\nkelbrook\nspreckelsen\npavanello\nbabchuk\njericoacoara\nipsita\nwenqian\njassam\nmarmorpalais\nprimanti\nsegnatura\nbehlendorf\nmacor\nvichit\njaaber\nmaxmara\nsumbanese\ndeboned\nmukluks\nprovacative\nsheerly\nsabeer\nkintsugi\npapademetriou\ndombrovsky\nskora\nautolog\nwatermain\nmulgarath\nultrasuede\ngleicher\nscrupulosity\ngoydos\nreinjection\nbroudy\nzobor\nvopak\ninkie\ncountenancing\ncharkh\nbearne\ntorakichi\ntavakkoli\nfacilisis\ndoog\nurkal\ndmaa\nfreeski\nboisar\nscafaria\ngussak\nsarrasani\narchambeault\nsamri\nwolch\ncampiness\narzneimittel\nbemusing\nsares\ngatrell\nfindochty\nreinstatements\nwordworld\ntobianski\ngrgic\nifmr\ntshogpa\ncommittments\ntcom\nicehouses\nfreshies\nkatsoulis\nrepetti\ntrigiani\ntransmusicales\ncloudbook\nskelding\nemmenecker\ndeola\nbatie\naliskiren\negilsay\ngreenley\nollabelle\nlarudee\nsnowpacks\nlizzies\nkomives\ninfarcted\nschifano\nnewswriters\ngeorgakopoulos\nvoskuijl\ndemillo\nlynval\nmaypoles\nzuaiter\npropogating\nmittelstadt\nambady\njuyan\ngiorgetta\ncrabbs\nmashood\nreaal\nkichak\nindiginous\nfourscore\nshizhao\nsouthcorp\ncohodas\nwedtech\nsucipto\nconfectionaries\nginestet\nseiont\ntamberg\nsweetens\nfolates\nsoona\nschipp\nramchandani\nsadoun\nbozinovski\nyokomitsu\nparentes\ndind\nriyanto\npolaha\nthorsteinn\nboparai\nstorberget\ntexa\npaylor\notps\nfosset\nmaith\nsumfest\nunfamous\ngolbourne\nnymphéas\nmandata\nwooer\nsaadullah\nbioaerosols\nsileshi\nseatonian\npasskey\npironti\nspellberg\niqpc\ncantamessa\nbeaverdale\njerrycans\ndarvon\nvornic\nlefko\ndewoody\nnkhoma\ndebdale\nspyhunter\nknies\notsuji\nravard\ncofco\nbenberry\nginiel\nglynllifon\nweichang\nmakkar\nwillowemoc\nspeechifying\nbamborough\neynhallow\nmnscu\ngunduz\nipartment\nhypothecation\ncovehithe\nbonavita\nmeggido\ncseke\nequallogic\noloibiri\nsourly\ncontine\npaulis\ncavitt\nahdaf\neverwhere\nraploch\nkandarian\nnovemeber\npaddack\nharimoto\nalaikum\npingan\nhoseasons\nunitised\nciemat\nimplausibilities\ndsei\nkuttelwascher\nmughelli\nprocuratorates\nminidiscs\nyehl\nwltm\nlabcorp\nujaama\nkerbstones\nsaltanov\ncavenham\nbazilio\nsookhdeo\navai\nnotos\nsodexho\ncourrent\nauthentification\nchampness\nezenwa\nbuisiness\nroocroft\nhinkes\nmulot\nstrachman\nundercards\nmelkamu\nkrink\nultracompact\nurethanes\nweaubleau\nlycanthropic\nblanchelande\nquickarrow\nomcs\nvashchuk\nkreusch\nplacente\nbaatin\ntrutnev\nruding\nshhhh\nbarzelay\nbillowed\nriverain\nbiologi\nhemoglobinopathies\nsweatsuit\ningol\nunforgiveable\nceroli\ngarboldisham\nupmann\ntimbalier\nzawar\ndorenbos\nsakher\nlovric\nkreig\nchavhanga\nawakino\ndingess\nsultanova\npiquillo\nspillages\nvistaprint\nreregister\nracheal\noutdoorsy\ncradlesong\nwenker\nyaweh\nfrosterley\nhimo\nkarkov\nausburn\nchildminder\nhaahr\ndonastorg\nomagbemi\nnelima\npayzant\nkukoc\ncasetta\npinchi\nmawrth\nfrancene\nmazzante\nshufflers\nprensky\nnpoess\nchocolaterie\nlincon\nbauserman\nhuxter\nbafétimbi\nruchir\nlewars\nliraglutide\nmarignan\ntorryburn\ntoddlin\ndiakate\nshishou\nbechamel\nbartolina\ngurnani\nwennemars\nknuckleduster\nvanderlans\nbarnetts\nbarrea\npardeza\ncarriden\nnutso\nsecrist\nwolkonsky\nsquirrelly\ncusters\nennobles\nmicromachines\nwinwin\ntitcombe\nsolokha\nhomogenizer\nflinches\nbritni\nthelocal\nlankov\nyanming\npaulene\nmondia\nummed\nkeeth\namsale\nvernors\nmyto\nmedis\nglenrio\nevps\nciocci\ngoldenacre\ndolora\nseatrade\ndiskus\nbossini\nxnet\nepiphenomena\nkristofor\nbalcarras\nbrillian\nchartis\nmatichon\nlancope\nunsubscribed\nbenzopyrene\nmacfarlanes\nmoonalice\nscopelliti\nmusaffah\nforemothers\nargyro\nlélé\nlewins\nnetbox\nlornah\nanella\ngremin\ncomputacenter\nsdic\nithink\nxiansheng\nplodded\ndistending\nnachmani\nboochever\nlhrh\njcq\namorello\nmcmansion\nacusing\nparrella\nthurairatnam\ngheesling\nrasey\nheftier\ntoey\naboveboard\nabdinur\nopenhearted\naquainted\nshalan\nbuhne\nrsma\nysrael\ncolourway\narmiliato\nnelfinavir\ntealby\nlihtc\ntishkoff\nalzner\nranin\nsagastume\nstartpage\nrowdier\nfulljames\nshoc\nvanderhoff\ntelepsychiatry\nsaqqa\nbmcs\npolmaise\nlincluden\nbobošíková\ncalliper\nhajin\nlochaline\nthorsell\ncistulli\nirenic\ntrefeglwys\nconnectable\nwernersville\nkettlebells\nzunil\nintercontinentalexchange\nveeraswamy\ninvadopodia\nsmaltz\npekkarinen\nchintzy\nrocanville\nbatarfi\nlovecchio\nnanosheets\nkolinsky\nwillfull\nbenach\npichushkin\nlimewash\nexaminable\nkochetkova\nderges\nsubnotebooks\nseatings\nmultisim\njovicic\nduffys\nthac\narakelian\ntomarchio\ncrusoes\nlqts\nbiospheric\npatchi\nhaselböck\ncamay\nfreckling\nironkey\nreagins\ntussler\ngutjahr\nstarmore\napua\nhanker\nsumang\ncoppack\ngroupmates\nkoroman\nenervated\nwoolls\nboardmasters\nregenhard\nresino\nterunobu\ncyron\nioakim\njerrett\nnyne\nkhaldei\nrohling\nseagrim\nfanfreluche\nvanno\nbaptistina\nsumbe\nromanticise\nshehi\nivaw\nsspa\nliwu\nzappelli\ndiffferent\ntysse\ndyckhoff\nkolpakova\nclevelands\nrenuart\nretied\nterminix\ndippolito\ndabdoub\ncajasur\ndegennaro\nblaeberry\nboohoo\nokny\nhcz\nhayehudi\necmt\nseleccion\npelago\ncaesarstone\ndijeron\nsartorelli\njerell\nfmca\nétouffée\ncypermethrin\nsergas\nodiah\nbloggie\nseans\ngoldminers\negocentricity\nmcanany\ncoziness\nmuhammads\nemprunt\nmimimum\njeyasingh\ntulchin\nbuckypaper\nlurasidone\nshenderovich\npril\npreapproved\nspitefulness\nhawkmoths\nepiq\nshouln\nathawale\nnellen\nwipa\ngroundcovers\ndeonar\nshansky\ncontroverial\nmwita\ncoleson\nsbarbati\nmarlantes\nsalesroom\nsadagopan\nwallyford\nspookily\ncourgettes\nasphalting\nfoldberg\ndonellan\nveterinaria\nmoscatelli\nbodysgallen\nfortt\nruith\nmacquitty\nfishermens\nshadley\nackwards\nfarrish\nschioppa\nmckaie\ncarnegies\nalthamer\nwarndon\nsadique\ncervero\nhungwe\nmovens\ntorosian\nscheuch\nburkino\ncapicchioni\nrelaxer\ndrenning\nshavon\nbraan\nhollobone\ndesutter\nmaipu\nstmt\nlashoff\ndilapidations\nopenx\nplaystations\ntmsi\nhother\nphotosynthesise\ndladla\nflummery\ncibil\nboorishness\nlebovic\nrhiwlas\nbirkedal\neiseman\nrafed\nmiskiw\nbugie\ngumdrops\nhahns\nnissley\nmorduch\nuncoil\nbaltijos\nnoninvasively\nschole\nandler\neidi\npadrick\ndhanani\nabduwali\ndeparment\nslumlords\nmavrodi\nmendivil\nconstâncio\ndarmawan\nfaloon\nsiloviki\npamarot\niskandariya\npranoto\narrowed\npricetag\ninveigled\nfrontality\nnajafabadi\nmegastars\nrelevé\ncbes\nsyahrir\nsubplate\nmochary\nmicrosomia\nsinnerman\npladda\nhomotopia\nmirali\nusuall\ndizzia\nterramar\nqtrax\nbéchir\ndimitria\nhartner\ndreamspace\nbiosurveillance\ndemonaco\nnovellus\nportz\nostersund\nskimping\nbiopharm\nbuttala\ntrabeculectomy\nijaza\nspraypainted\nwessner\nrubinfeld\ngraddick\ncinv\nkvvu\njianqiang\nkozakai\nluos\nhaeg\nnumskull\nrabhas\nmultistrada\nspergel\norosei\nkrauter\ngulfshore\nromanticises\nskullcaps\nleaser\naricent\nclothespins\ntlili\nmoscona\nloudmouths\nglamorization\ncoordinations\nincognegro\nhaddara\nglycyrrhizin\nreconciler\nedur\nfekkai\nhonnappa\nconcomitants\nzarinsky\ncaçapa\nsuccessfuly\nmcphearson\nnatos\nneyo\npoppyland\nqueler\nnlgn\navidyne\nwavefield\nsheathe\nmoominmamma\nvuma\nrorc\ntechnophobe\nlichtenthal\n˜\nwortzel\ngialos\nthimister\nselter\nnonini\ntitrate\naggy\npycraft\nthoday\nsloppier\nselvarasa\nmagiera\nfadhila\nteriflunomide\nmwah\nbawadi\ntantalize\nsonrise\nvanagon\nsieversii\ndepakote\npentecostalist\nicesheet\ncatched\nkarwa\nstaxton\nschara\nmanster\narinsal\nseemi\napartado\nbelyayeva\noutbred\nstambridge\nneded\npantoum\nlavolpe\nseremaia\ngulbransen\nbermudagrass\nlocalists\nsavastano\nzdrojewski\nrolwaling\npakuni\nbautzer\nsheiner\nenergen\ndionte\nbridon\nashong\ngoncalo\niaee\ngarate\ndscr\nstemgent\nnawshirwan\nfadem\nbillowy\ncrystl\nlandwarnet\nlambastes\nraffaelea\ntheberge\nwordplays\nseshego\nnush\nmccalister\nkangle\nmicrolitre\ndarrion\ngleen\nduffen\ndayroom\noverbuilding\nlitster\ntapola\nrosicky\ncretinous\ncompartmentalisation\nfizi\nticketsnow\nrocuronium\ngashing\nostracise\nrutherfoord\nchalai\nuzis\njanecek\nwmh\naboya\nlinguiça\nkillyclogher\nshirks\npastorello\njurisdictionally\ntiffoney\ncourteen\nyelwa\nsutley\nsoupçon\nmatk\npoju\nfelinfach\nbrocheré\neizenkot\nnannygate\ntonmawr\ndargins\nsubmissives\nredleaf\ndisgustedly\ngreff\nmixups\ngisla\nstampin\nelectroencephalograph\naluvihare\nzakarya\npaunchy\nrpks\nannelle\nbabywearing\nlangrée\nkovalic\nbarracking\nratey\nvollers\najvar\nnfer\nfeminised\nogas\ndhoo\nnewbill\nhuntcliff\nsmartypants\nlanglo\nchaouchi\nkrasair\ngnant\nmonimail\nnauseatingly\nkolde\nweixi\nnumeroff\nkneeboarding\nspanakopita\nklyuchevskaya\novergeneralizing\nastrologists\nmsrs\nskylstad\nremmington\ngound\nscaramucci\nashikodi\nshirdel\nlevisham\nnabaztag\nmufg\nkuchis\nburres\ntakahide\ndecompressive\ntrella\ntemime\nanalyis\nziplines\nezzati\nreframes\npummelling\nmckail\nmalinka\nchado\nstreamable\nassumable\nneidermeyer\nmirandinha\niald\ntabío\nruli\nquitno\nconnett\nhuntik\nzoutman\nbreemen\nhossan\nupcycled\ntunjur\nkerrys\njimmys\ngenerationally\nakiyda\neresearch\nroudham\nsnw\ngrubisic\nademe\npolicie\nnuthurst\nkuhnen\nepizootics\nenthuses\nroesel\nwalbourne\nbiasone\nbangara\nhapsburgs\nchristner\npramada\nvoracek\nnbic\nfeinerman\ncolarusso\nkimona\ntophi\nabakumova\nundershaw\ngiezendanner\nlivshits\nkhaleq\nminimed\nwordsworths\npalandri\nndele\ndaughton\nheizo\naristizabal\nmohammedi\nejigu\natim\nyankowitz\ndelai\nputtering\ntradeking\ndisrepectful\nandreotta\ngoodmorning\nbosumtwi\ngiberti\nbilbrey\nfralin\ntattingstone\nidlis\neithun\nbolsas\ndweik\nfinanz\ncengkareng\nsivil\nfreeters\nsupras\nparlon\nbronconnier\nbardelli\nbochatay\nunbothered\npatzelt\ngrätzel\nvanguardist\nbagage\nneurotology\nllangernyw\ntuohey\ndellin\noverladen\nlilywhite\nsaveth\nshirihai\nadco\novonic\nmoralize\nmikvabia\nelixer\nkenric\nschiebinger\ndrumond\nbillière\nswiftian\nparmeshwar\nhaering\nbraich\nmenello\noverlawyered\nnautically\narbory\nneyts\nhepting\nzagoria\nacing\ncoudn\nthermoformed\ndenter\nkrisflyer\novervalue\nbaharom\nhaircutting\nenviornment\nmitc\narrelious\npomonkey\ncaballeria\ncanziani\ngrisdale\nliechti\nbazayev\ntinpot\ncihai\nclambers\nwaterlines\ntelesto\ntverdovsky\nverjuice\nnored\neusden\nsoumela\nropert\nnarcotrafficking\nkismayu\ntwyning\npoussaint\nmatsumi\ngigajoules\nantepartum\noksibil\nplaister\nnumerate\nspaghettios\ngreenstick\nshadeland\ncitrullinated\noakdene\neneida\nrobiola\nirreconcilables\nvanj\njiadong\nsarcasms\nbegue\ndoncieux\nsanderlings\nriminton\nbobbers\nironsmith\ngreulich\nzalis\nsubornation\nbackpacked\nrebensburg\nbauerlein\nrejas\nrahimullah\nspalko\ndogmatists\ndeblasio\nproctologist\nlaggards\ntruckle\naltimari\nrealarcade\nrespose\nkaarin\nishikura\nsherlocks\ndevening\nguayaki\nvergangenheitsbewältigung\nghostnet\nwamphray\ntocilizumab\nrafizadeh\npalantine\nconchi\nshamari\nglenboig\nkurant\nserpas\npeyrelongue\nprestidge\nhirvensalo\nwaliszewski\nbegram\ncomtrade\nwelshampton\nsmurl\npruna\noakar\ndankers\ncapricorns\nwebasto\nmoolenaar\ngreenskeeper\nguruli\nspellcheckers\nquarriers\ndakoda\naphalara\nbrimin\nshwekey\nammour\ndtcs\ncolgrove\nsmalltime\nfinlaystone\nbannisters\nmuscatel\nrabago\nmonocles\nguyett\nlapdogs\nkrzykowski\ntalkativeness\naircraftsman\nolimpick\nramlan\nberdyev\nfennig\nrfmd\nnorry\nactéon\nnaseerabad\ncherlin\npatriquin\ncangnan\nxiangjiang\ngatete\nmoroun\nnceas\nmerlis\nveedersburg\nsportcity\ncheswardine\ndemystifies\nsmurfing\nmcdarrah\nmunstermen\nchekhovian\ntutera\nmcci\nconcertation\nitfs\nroutinized\nmandle\nraghbir\nbreier\nbentler\ngenval\nbizimana\nromoli\nstabby\ncamusso\ntchomogo\nprematch\nyaghoub\namny\nwoebegone\nbyrnison\nathenahealth\nsidlin\nautenrieth\nphenonemon\nredound\nindividualities\nmabin\nshousha\ndochia\nflambéed\nbaramidze\nyusmeiro\nvirtualizes\nbotel\nbunei\nanoto\ncompaore\nholtham\nshipe\nconnoly\ngilbart\nramit\nkeams\nyesnaby\npietikäinen\nmaxjet\nhexion\nstanners\nfinzel\nlimonade\nmochdre\nconcatenations\nmonsummano\nmarcellis\ntomandandy\nbriosco\nborror\nnautic\nmcgladdery\nmancall\nvladivostock\nnessel\nsantomero\nkinoma\nbahadir\nunsterilized\nswaption\ntoyko\nabdulbaki\nramé\ncheapskates\ngaytan\nwebo\ncrisan\nmiraya\nmirsaidov\njazzmobile\nunsharpened\neternities\nmarketeering\nblackjazz\nazare\nlanpher\ndelea\ntrajanov\nsaarbrucken\nenwonwu\nmerriott\nderker\nrosener\ndesyatnikov\nmeraki\nborchin\ncalmus\nunet\nepynt\ndefen\nluocheng\ndannenfelser\nsanitaryware\nsaithe\nsoonchunhyang\nbackpay\nmarzol\nparalell\nstanz\netech\nrja\nfreile\ntricksy\nmakhanya\nengima\nmanadon\nxhaferi\ndexterously\ntapenade\nmbusa\npommerening\nsalaheddin\nwarcup\nlongfor\nebenstein\nhongda\ncapnography\nwenwen\nunparallel\nsodini\noutbids\nkingway\nwrvr\nunisdr\npiratebay\nkerchner\nnewbrook\nlarrivey\nbookaboo\nnonviolently\nvendange\noctocorals\nbrunicardi\ntury\nlamictal\nluketic\nmcnairn\nrostker\nkoffmann\nirineo\nguleria\npakleni\nfalin\npicure\noverplays\nmbacke\nrikhi\nbobbles\nmaitha\nclervoy\nfareeda\nleverington\nbkx\nrenneberg\nfujiang\ntulong\nroedel\nnatca\npapadum\nbarkero\naaish\npflugrad\nsummerhaven\ntelerama\ndulieu\nmehru\ninterfer\nnewspapering\ncretney\numetsu\nlosts\nrauhala\nmertins\nsuhaili\nmoscrop\npapariga\nyogan\ncouty\nmarex\ngreencore\nphentolamine\niceboxes\nlinning\ndraskovic\nlalani\nbeging\nspicoli\ncobalts\ngallowhill\nchancelleries\npents\ncotechino\ndhanabalan\ndelicioso\nalliterate\napiarist\nlorscheider\nvandeveld\nharstine\nfastidiousness\nofferred\nmassagers\nfibernet\nyamdena\ncpga\nattracta\nogio\nwunderkinder\nbelview\nwonewoc\ncatina\ndementiev\nkarlinsky\nlarcenies\narpana\ncorthron\ntintomara\nraissi\nstld\npougnet\nbabineau\njuelich\nsocma\nheyneke\nintelli\nmetagenome\nabercwmboi\nofficialese\nmayella\naudiocassettes\nwaplington\nhindhaugh\nlabit\naarnes\nadvisees\nsaxagliptin\nsihem\nkikis\ncoconspirators\ndemises\nlidow\norsoni\ndjurberg\nbertron\nniedenfuer\nrhinology\nuniworld\nsdny\ntideline\nrobling\ngallou\noffencive\nboluses\npoupou\nmindreader\nalwayz\nnocsae\nunrepentent\nziese\ngrandpappy\nborràs\nchordoma\nshabery\nbancoult\nslushie\nmonolithically\nsuell\nashiana\ndigiday\nmicarelli\nbanters\nsudafed\nmerseysiders\nramaley\nquantique\nmeteorologically\nenglis\nhindia\npulos\nwhar\nischinger\nmundanity\nquesillo\neglwyswrw\nfedexcup\naxline\ndrabness\ntekonsha\nguarrera\nkarasick\nstenotrophomonas\nhyperconnected\nreventón\ngurrieri\ncosmetologists\npotterspury\nlarod\nselikoff\njianhui\noreskovich\ncrimini\nofter\nglumly\ndarcys\nshaowei\nateya\nessandoh\nbramos\ngoche\njlh\nsegretti\nfarmoor\nilonen\nfraternisation\ngareloch\nabscessed\nbaraa\nbikkembergs\nskrenta\nkandt\nlongforgan\naccanto\npalaeoanthropology\nslavsky\nsikov\ngarding\naltheide\nsebelia\nbaibakov\nrusskies\nstolarczyk\nragdolls\ngudvangen\nsicsa\ncasbaa\nnisin\nfuencaliente\nsunbed\nsabuleti\nbearshare\ntalei\ntamango\nolbrycht\nscarola\nhighmount\nkarbalai\njasdaq\nzorkin\nestruch\nadmas\nmckamey\nrappolt\npintat\nflatpack\nacused\nsyntec\nbrightons\nknowler\nolowalu\nupcharge\nguessers\npagnini\nariat\npulag\nspile\nllysfaen\nracingtheplanet\njesch\nmondell\nharmans\nnortman\nmdax\nbladderworts\nzeitlyn\nirking\nnixing\nhorsebridge\ntamor\nstovold\nbrimob\ncharalambidis\nalmyra\nferrington\nheartier\nyacuiba\nremonstrations\neepco\nairvana\nforceably\nsepetiba\njugaad\nriveria\niyiola\nsimantov\njudis\nschaible\nbrauman\nchromotherapy\nuprg\ndelcroix\nandrosov\ndrumlines\ndeckle\nplake\nniuas\ngeitner\nlebedinsky\ngilauri\nfurture\nrayyithunge\narcega\nlabutta\nrestor\nnyambi\nphobaeticus\nmakaton\ncalbee\naetf\nkarsner\nappalls\ncrypticus\nwoodforest\nthelondonpaper\nhunain\nhaywoode\nbrotheridge\nruskell\nskypoint\nvolquez\nazada\nlimn\nmccormicks\nangkasawan\ncatfood\nerani\neckerson\nrentrée\nmortin\npasquariello\ncowplain\nnartey\nsalved\nderron\nunderlayer\nosadebe\nparleyed\nurrego\nogbo\nwasden\nquindici\nreeker\ncgcs\ninkjets\ncharith\nfonnereau\nolalekan\nparaders\nparodical\nflear\nbayart\nundergirded\nshieldhall\nkelenna\nahhhhh\nbutre\nyovich\nlousie\ncopnor\nduker\nstotijn\nbielat\npredicable\nanez\nkamaljit\nkillhope\ngussied\npotocnik\nappelqvist\ncrashlands\npaveley\nbelvieu\ndamons\nmoulinex\nbakalli\nganeri\noverreaches\nbrolan\nrefere\nintercommunity\nbaphuon\nmedicalized\nyosvani\nfloodways\nmousseline\nkabaivanska\ntarzian\nemison\nskiffington\nbeneatha\nciliberti\ncreds\nwhetting\nbeancurd\ninoc\nwhelming\nvalad\nburmaster\nauthentics\nsweetzer\nunicredito\nghedina\nerzen\ndeema\nmobinil\nishag\nscenester\ntransshipping\nhoussine\ntalibanization\ncafarella\ndecare\ngejdenson\ncrickmer\nhazley\nalbawaba\nkayna\nrhug\nbencze\norice\nunclutter\nseismogenic\nschwertner\nchikomba\nnakazono\nswaleh\nlungis\nmontelepre\nallurements\nroic\nculinarily\ngivanildo\nspencertown\nvergoossen\nmence\npapadopulo\ngabetta\nniewood\nballyedmond\nquante\nmurgu\nsidespin\nhollifield\ngones\nwoodfords\nyugos\nabuelas\nhostin\newalt\nlavisse\ntido\nlesnick\nbenjamen\nbrewerytown\nportending\nanonyma\nfalgout\nmyhrer\nrhj\nprokofieff\nshujah\nbackdate\njeromes\nforestar\ncuénod\nwoundings\nwarrer\nlonghill\ndehumanisation\norangeade\nlaffs\nccra\noldbridge\nleffert\nnetco\ngaudioso\nhatib\ndebonaire\nficcadenti\nrenaissances\ndowlin\npittencrieff\nantunez\nendalaust\ngloominess\ndéja\nsydykov\nattra\nmssd\ngienger\nkoumakoye\nsuzon\nvadum\nradnich\ninterprofessionnel\nparles\nflorales\nsaratogian\nkishishev\neasyrider\neastpak\nwyotech\ndunalastair\nklempner\nfahlgren\nbeezie\nmacel\npacbio\ntoeava\nluliang\nexplainers\nglomb\nflashgun\nquantas\neverloving\nflorentyna\nflexfuel\ngutstadt\nponos\nnordfeldt\nbonannos\nbochenek\nalligin\nbiziou\ncanwick\nmorrissy\nchipaya\ncrafford\nspringford\ngraveling\nkanokogi\nrakewell\nworkaholism\nchauffeuring\nnirmalya\nnutricia\nestoque\nminneota\nastronome\nharkinson\nvnexpress\nffynone\nlardons\nclowers\nsavenaca\nrefinable\nkftc\nstarquest\nwinnefeld\nwohlfart\nhemopure\ndunskey\nahrendts\nwojnowski\ngoosens\nkalid\nendlessness\nlingmell\nwoiwode\nbotryococcus\ncreadon\nkirste\nhopstop\nmunadi\nimpolitely\ncortachy\nnatkin\nlivening\nprivilège\nsnpc\nnumerious\nmamane\nmathlete\nmanap\ntzorvas\norams\nsitings\nijaws\nputhod\nkalivoda\nleverburgh\nwalkmans\nspiewak\nsazo\nwbcc\nsascoc\nwadel\nnivard\nhaldenby\npoliclinico\nochowicz\ncodding\nbiggish\nsarosh\nsantulli\nguanidinoacetate\nanglepoise\nfelise\nvivisimo\nerlegh\nbenshoof\nvnesheconombank\ngulya\nskinsuit\novadiah\nsandzak\ngalanes\nbelched\ndccd\nsconosciuta\nscenerio\ndeviltry\nskofterud\nkapha\nsokhan\ntheonas\nholstrom\nvilia\nsquelchy\ndanchev\nsemiskilled\nvenita\nmikadze\nbiranchi\nskarupa\nthid\nbitchiness\nscrawls\nmuirend\nmyhren\nfootmarks\nfilenet\npotties\nmotivepower\nlambics\nbidlo\nstepfanie\ntoddlerhood\nchegem\nxobni\npreponderantly\nfrodeno\nganes\nbrynmill\nsterilizers\nhapen\ngallogly\nkhalidiyah\nlafi\nslitted\ncolacurcio\nshearling\nparnu\ndaskalaki\noshri\nunguicularis\ngerardia\nbaqeri\nperfectionistic\noxera\nittre\ncragnotti\nbureaucratese\nsebesta\npalisadoes\nopeka\nfalsgrave\npietras\ntreem\nlawnchair\nnayani\nphysiatrists\ntamkin\nsalcey\nunpackaged\ntelmisartan\nbartak\ngreenwheel\nhilbertz\nallrecipes\nfesterling\njiming\ntriade\njumpshot\nkgd\ncornman\naboubakr\nlayhill\ndeployers\nmalnar\nzangana\nsindal\nanifah\nnimick\ngreenbrook\nsylke\nscheherezade\ndevier\nryals\nbizbash\nconstruccion\ncoelodonta\nwidmaier\ngaggero\nunife\ninuendo\nnonobjective\nrouleaux\nmaow\ncadenced\npohlig\nsakano\ncelente\nmalaitans\nlancker\nladds\ndonees\nkordestani\npwap\naristy\nelsje\nrépons\nhonsha\nboccieri\nwhatsername\ncoathangers\nwadeye\noutpour\ndonnette\nshamardal\nsenu\ngörg\nbarja\ntatel\nsocco\nhogger\namarchand\nwendlebury\nofisa\nzinin\ncisternas\naiyana\nnoetzel\nnagami\nkirkwhelpington\npadmapani\nasug\nstuey\nslegers\nmerron\nkritz\nmisbranding\nseabridge\nbishai\ncheesehead\nxec\nwigig\ndayenu\nplacemats\nakello\noctogenarians\nrubial\ntuleyev\nleslye\nhrusa\noleanders\nunshod\nndjamena\nkucerova\nlanzer\nuncaptured\nunderplays\nhadippa\nhubacek\ncinematheques\nturnbridge\ncatmint\nbnsc\njinming\nbryceson\nlapindo\nsupramonte\nmrff\nmotrin\naamd\ndeferentially\nkomadina\nbevere\nhefferon\nbalmore\njongro\nlapido\nunicor\nmassarella\ncrundwell\nkeatts\nshlemiel\nscanio\ngrimsey\nunimaginatively\nnfz\nhatchlands\nmndaa\nwaylett\nsheffields\nmccluggage\nmancia\nmisdirections\nsolbes\ndesgranges\nbulelani\nbuttu\nlincolnway\npartech\nminaev\nbasaraba\nrecirculates\ngenauer\nlarraine\nhoustons\nschlyter\nsandora\nnorichika\nnetvibes\nstansky\nfleckeri\nwapshott\nfarolito\nsamovars\nbrunious\nmongie\nlacovara\npresidence\nparadisiacal\nbogdanos\niervolino\nyessir\ngridding\ntankini\nriems\nmittendorf\nbockius\nagiza\nirrelivant\nautostadt\nwelthungerhilfe\nbaardson\nkhiid\nsuperbeings\nappollo\nortuzar\ninterphone\ntardio\nbatham\nkapadokya\ngurfinkel\nwillises\nsnowsuit\nsnakeoil\nhousecall\nvanpools\nbsis\nflassbeck\nunderbanked\neveryplace\nsagoe\nebble\nvadinho\naeos\nblurbed\nreexamines\naddas\nboerrigter\nlydersen\nkorunas\ntsuper\nsubdwarfs\nashlawn\nheagney\ncasel\ncampbellii\njumani\nbandos\nprocedings\nmorskoi\ningénu\nscit\nbalistreri\nlusky\nchapli\n,what\nmonthy\ndoci\ncarrino\nenergex\nedmark\nbulte\nabraço\ntowerhouse\ngodshall\nhargittai\nholusha\nchasten\nmexted\npenarol\nunattained\nfrancome\ncooliris\nliying\ncondem\nsandquist\niley\ncoxyde\nullock\nadoo\nchangiz\nchapelcross\nshiyam\ntokely\nwalkergate\nimler\nlavelanet\nvaller\nholper\ndebouching\nchothia\npetritsch\nmonogastric\nnoorzai\nsunzhensky\ntanusree\nmeridia\nglenmere\nschoettle\noeltjen\nsharktooth\ncdfis\nbenetech\nobstinance\ncotty\nartbeat\nkregg\nchivhu\ntrembly\nrozita\ndedge\nsharak\nllais\naerotec\ncauliflowers\npatronization\nhammamy\nappeard\nteig\nomov\nkozun\ntransload\niolaire\nmatricula\nteekay\nconsummately\nshemyakina\nmanteiga\npitte\nunwitnessed\nbulgakova\nbloomgarden\ndeursen\ngoodnights\nadass\novervotes\nbecherer\nandriyan\nendeca\ndeaccessioned\ndeshun\nwinterbrook\nfnma\nmarchesano\nawwww\ncontagiousness\ngillmoss\nachao\nacidly\npuppyhood\nderiso\nokinotorishima\nfeierstein\nflexicurity\ntrelewis\nchoksi\nnorrey\nhoneycrisp\nwomenpriests\ntelerik\nrasky\nimmunomodulators\nmendilibar\nshareeka\ndollarhide\nhemichromis\nsharemarket\nunlikey\npalins\ntunheim\ndikgang\npoteri\nconsulation\nenchained\nidns\ntransparant\nnoodly\norsières\ndemoulin\nheroismo\nbiomodels\ncovi\nsghc\nlenhoff\nlicea\nibera\nremorsefully\ndalkia\nlogboat\nindividualisation\nhewage\npleb\nllanrhystud\nreemployed\nsatit\ndissatisfactions\nfiorani\nflounce\nfaldbakken\nhyperspeed\nmansaf\ncnty\nsilano\nyasini\nawaroa\nhergert\nromitelli\nvujacic\nbettinson\nnimbuzz\nquirkily\ndikgacoi\nmergia\noverdependence\nddw\nasciak\nbunged\ndoveman\ntamagotchis\nboska\nfxi\nsoung\nhcbs\nheronswood\neiffage\nsigalet\nmercaderes\nstigall\norlinsky\nsegedunum\nflna\ncyndee\nclobbers\nsumani\nbouhours\nlardinois\nficano\nrouhana\npubococcygeus\noportunity\nboulianne\nfotomat\nmorrells\nbinjamin\njanene\nbessac\nouazzani\nsexists\ntemuri\ningenues\nlobola\ncisatracurium\nbeddingham\nyelton\nstrathyre\ncomeliness\nweinblatt\nbjugstad\nkusters\nleches\nstavropoleos\ndunlavey\nmetastock\nlauitiiti\noveractivation\nclaeson\nherati\nskysails\nfakey\nmegaregions\nalil\nmclogan\nwoodrell\ntalega\nvergelegen\ntomdispatch\nddgs\ncimes\nkampfner\ntrikke\nhemann\ncheparinov\nwised\nshors\nbraunsteiner\npodobnik\nsuzue\nmerissa\ntorff\nsswc\natsuto\ndissolvable\nrunnion\nstrohmaier\nniri\nashouri\nmctier\nramras\nzreik\nihuatzio\nkhaldoon\nsaffari\nmamad\nteleamazonas\nbener\ntulsky\nfracs\naguanga\nmoneywise\nrommer\ndtmp\nmengozzi\nazzuri\namarilis\nkafala\ntakiveikata\nchantale\noscillococcinum\nzurawski\nillegaly\nthompstone\nboilersuit\nchoiniere\nsubscribership\nmathiot\nvgz\ntajudeen\nliangping\nschuble\nlowenfels\nslews\nyearend\ngreenmail\nvarez\nmatjaz\nzuazua\ntaxista\nboink\ntecún\nlundine\nsasnal\nquadracci\ndhada\nyeren\nregisterable\nwilmsen\nseibersdorf\ntenative\nantartic\ngobstoppers\nruebush\nsafian\nmauricette\nllanon\nogunlesi\nsnuggled\nishmeet\nklinkenborg\ntelegrafo\nabidance\naragoneses\nwinda\nadji\novos\npedalers\noscr\nabduljalil\nmisfortunate\nkerching\nrollkur\nrodne\nzaraah\nngrc\nsloes\nepeat\nabdullayeva\nbordowitz\naspart\ndellacamera\nbartholet\ncreton\nhorrall\nbeaumes\nisacson\nweisheng\nribner\nkaibiles\nespically\nsmyths\nodent\nparfaits\njarzembowski\npaddlewheelers\nfatik\nkhangura\neashing\nboome\nilisa\nboyadjian\ndacal\nonramps\nhaiqing\nspanswick\nswidnik\nrhoncus\nguibal\nplummy\nelstone\nhyflux\npanchina\nswinstead\nskanes\novejuna\nhentsch\nohoven\nspeedtest\ndisolved\ncataloguers\nwfirst\nphotgraph\nmcninch\nricheze\nronzoni\norender\ntopfree\nhandysize\nwohlberg\nmanheimer\nfryars\nwigeons\ntalinn\nhovan\ntrefeca\nbloxx\nazc\nmorhaime\nomah\nboastfully\niseq\nmentholated\ndhiab\nridgmont\nbritan\namys\nsuchus\nkrucial\nelys\nmockeries\nsaneamento\ngraywater\nbolzaneto\naztreonam\nmisar\ndemocràtica\ntomotherapy\nquiara\nscarless\namendolia\nplayspace\npynes\npachacutec\nguvnor\naronstein\nacftu\nhoussein\ninexpertly\ngoood\nsunbursts\nmanba\nsemiautobiographical\nperci\nandresito\ninoa\nducange\ncirclets\nbizspark\njillion\nsturdiest\nconstan\nwillberg\nbraker\nchalom\namedure\nnonconscious\nsupernet\nvitreoretinopathy\nhensby\nsellindge\nminipops\ngeronte\nstraussian\nbuttermarket\ncongenially\nwhir\nosawe\nyerself\nstrathie\nmalevolently\nbackmarkers\nmeulman\nchalak\nkenninghall\nlocandiera\nkatlehong\nsanctimony\nsimunek\ncalasso\nrepletion\nileen\nplauche\nsideward\nmuwenda\ntokayer\nbereng\nyenga\ngarelick\nantioqueña\ndhoinine\nfich\nvezzosi\nactionist\nentwines\nkandie\nblowtorches\ndewain\nchavdarov\nallocable\nmeniere\nrashean\nunhinge\ncedefop\nnekkid\nyutz\nharmfull\nncip\ncervid\nescargots\nlubrizol\ndawdling\nnewports\nobeso\nkatherines\nschwaner\nbevois\nnailbiter\nkatherin\nroseae\nvaljavec\nwease\navient\nvolleyer\nbogdanski\nnachmanoff\nholmsley\nlearing\nshurland\nroepstorff\nfanga\ngwybodaeth\nsomethng\nhoogendyk\nticketless\npinecones\nmatison\nsantro\npiram\nnasara\nsuperinjunctions\ngourdes\ntanera\nivlp\nxaui\nlunzer\nfeese\nenterasys\nundeb\ncleofe\nmidblock\nzeituni\npasquill\ncantilevering\ntooo\nasni\ntennants\nshaoping\nbadware\nimpremedia\ndowlatshahi\nbijani\narcc\ndunscombe\ngenmab\nmarkovitch\ntormore\nonging\nskirda\ncamdessus\nhrvatin\nlaufman\nsaidah\nkhadaffy\naideed\nalbinson\nbidognetti\nramify\nxiapu\nvonzell\nsuperhet\ntuomisto\nmcmanaway\nniznik\nkobia\nukcat\nhandspun\nbleats\nmeddlers\nhalavais\nbamc\nislamey\nusarpac\nhuiqi\nribaudo\nmarant\nmandrax\nwirefly\ntayman\nkanja\nbloodstreams\nyelin\npuces\ncubers\nautoalliance\ndeifying\nsergant\ncorpsing\nzumino\nkrumholz\nmarsis\nshirqat\nhighwinds\nlabourites\nhüttig\nsrirasmi\nmayakoba\nfranklincovey\nincompetant\nlabry\nmaltreat\ntalerico\nstanols\ncaftans\nautostart\nmolaskey\naquasco\ntyaughton\nartron\nmyaung\nstreeting\nzaiser\ndenmon\nfahrenkopf\nmakhteshim\nguastella\noffed\npromedica\nabbotswood\nlidsky\nkickstarts\nsanakoev\nfalkus\namberleigh\nnonworking\nmbarek\nkitenge\nseverer\nsaimdang\nkaroshi\ntickbox\ndronedarone\nkasl\nspeediness\ndecaires\nsouthcenter\nervs\nsadock\nsquatty\nskidmarks\nsnarr\nspak\nchristion\nscervino\ntimbits\nchuku\nscarers\nphenylbutyrate\nsilverburst\nexilim\ntarnopolsky\nnonsignificant\nhmsa\nsummerteeth\ntrinchese\nbojorquez\nkiffen\niocg\ntsikata\nwhne\nabridges\nkapchagay\nrossnowlagh\nbauler\ngrabovsky\nqwe\nscatty\ncmz\nwavecrest\nstatures\nirvingia\nfuar\nartim\npolyheme\nsheikdom\naboriginies\nmurieston\nbabycham\nambinder\ninsensibly\nreprofiled\nlanzinger\nalmansor\nbarstools\nbhusal\nspiby\nslader\nbaxterley\njendayi\nvoronova\nmolzahn\nshowbuzz\ngrasscutter\noyan\npealed\nbolot\nsierens\nagelessness\ntamest\nprochaine\nrandolphs\nlisianthus\njonai\nafficionados\nwaterer\nmerali\nbluffers\ncjis\ngregariousness\nberding\nsowter\nshazli\nvantongerloo\nsveaas\nbitsa\nmelot\nsloshes\nrubleva\nporu\nmaltesers\nrhoscolyn\nellies\nfossdyke\ngvh\naqmd\nbroadstock\nlyerly\nshipham\navantage\nstasny\novercrowd\nkharge\ndecieve\npalavela\nmisiewicz\nleapman\nbashardost\nchinkin\ndormido\nsubsidisation\nmicroconsole\nstreissguth\ningreso\nvdim\nkleinplatz\ndoogue\ncaricatural\naqsiq\naftergood\ngardee\nidiakez\nmalee\nkashua\nzonisamide\nmarkiza\nsteepleton\nalney\nschmich\nwtas\nsumiton\ntafara\nmarettimo\nphilosopy\nlitfin\nirongate\nbovisand\nbuizingen\neulberg\nfornet\nsnic\nresettable\nmarrion\nberriz\njingyan\npegase\nsuffient\nstieger\ndeafen\nmcnabney\noverdraw\npenycae\ncku\nparonto\ntregarth\nsiveter\nflexibles\nimporve\noverrepresent\nvnsny\nnioplias\ngeneco\nparoling\nisouljaboytellem\nalair\nprivilage\ngolebiowski\ncollesano\npsls\ngordita\ndongliang\ntensilica\ndagne\naecb\nlightmoor\nviscusi\nwapama\ncoffeemakers\njianyu\nberthelier\nmugavero\nkiwanja\nwallentin\nkice\numds\nballycran\nhidefumi\nbednarczyk\nmahdy\nalqueva\nbellovin\nruhemann\nyammering\nspaul\nmonteria\nobanda\ntonsilitis\nsebha\ncolglazier\nribordy\nminkovski\nsamode\nboites\ncamy\ndanzante\nthobela\northe\nminker\nxiaosong\nhashr\nyesss\ndacher\nmoormann\nsangamo\ndegand\ndecluttering\nschore\nmegadoses\nbiscot\nnerdish\nlonghorned\ncanefields\nsenichi\nminxin\nizurieta\nsantoku\nsullenly\nmauldeth\nkaney\natthe\nssentongo\nxaar\npeos\nberz\nfamilly\nmobilizer\nhollas\ndonskis\nrollergirl\naperol\nguangsheng\nmitterand\nrabanes\ntolcher\ntrebelhorn\nhgn\nmicroenterprises\nruswarp\ncogitation\noreodont\nviant\nthearc\ntilera\nnuoc\nassurer\ngwm\nheppe\npakledinaz\npelvises\ntorquhil\npulizzi\nlevade\nbastians\nmoscardini\nanouchka\nlongano\nevergreening\narchitected\nkleinhenz\nirmis\nchilman\nsocalgas\ncolford\nbendavid\npeabodys\nmutabar\ncordrey\nduvaliers\nbuiding\nchainrai\ndickert\nbomere\nblmis\nmercantilists\nserk\ncourtliness\nglufosinate\ncastrillon\ncatya\nhillblom\noutten\ntipoki\npriciest\nplanetree\nrecompute\nschillace\nautodefensas\nstorrier\ntwitcher\nstakeknife\ncheekiness\nbargemusic\ncpri\ncemetaries\ncarsia\nbembeya\ngreisinger\ndouzaine\nbromden\nmarites\nbritdoc\nrahhal\nnakamachi\nabbadon\nifrss\ndictor\nbrechtel\ngiammetti\nsrz\nkoito\ncasebolt\npeipah\nmaked\nbarbury\ncambone\nlendingtree\noverskirt\njasperreports\nuninet\nbekki\nabilty\ntredecim\nsupertarget\nbuttars\ncluelessly\njinguang\nlinick\ndetainers\narobieke\nunreadability\nbonjean\npigsties\nrathole\nsyddanmark\ndemeanors\nscarifying\nfastcase\nlunesta\nterzan\nboov\nobvously\ncherohala\narrangment\nkvarme\nlacamoire\neabl\ntrepak\ndebarment\nveze\nshamkhani\nschuermann\nböge\ndeetman\nstous\nmunches\nholeshot\nsarniensis\nwetterhahn\nsalseros\nfountainheads\nfloatopia\nadarius\nmarbo\nrindfleisch\nschiebel\nactt\nintrasquad\nmicelotta\nsledgers\ncalcify\nmandaric\ndemagnetize\nirrecoverably\njagaciak\nspetzler\ncarpenteria\nassuras\nrocktober\ndynamed\ndemontagnac\nsodomised\naspal\nkatleman\nbeatbullying\nstalags\nespinola\nsubversiveness\ntronolone\nlekas\nheptinstall\ncomissioned\nehx\npulverization\nnter\nroadgoing\nzhitnik\ndraman\ncabarete\nringless\nchebrikov\nsowton\nmcartney\nsahebzada\nshabunda\nuncrushed\nserpette\npromod\ncncp\nbateen\nkuluk\nweichselbaum\ngangasagar\nwheelis\ncopout\nforceout\ndifferen\niacet\naeroscraft\nvray\nouaga\nramanuj\ngutty\nkillean\nhoneysuckles\nkendry\ntreml\nicfr\ncomtex\nwhitelegg\nvoskuil\nmargasak\nchoephel\nraquet\npredications\nopenbts\nsasae\nmoraghan\ndeciliter\nkauser\nheslin\nmccane\nzerhusen\nmomolu\nnonstick\nbateke\ncesid\nkumakura\nteulu\nvmap\nbattistello\nnmsc\ntapajos\nstruk\ntengas\ncommunites\nhoenlein\nkalandadze\nsovcomflot\nenery\nadventurist\nrfids\nmissis\ncouri\nroomate\njulika\nmartifer\nkaranas\nsardini\ncomptel\nhaipeng\nilko\nsurber\ntelecity\ndraftsperson\nvigneux\nkorede\noverbooking\npricerunner\nherkenhoff\ngandelman\nheanet\ncavelli\nkamol\nbramnick\ntibetian\nskyjacker\nyotaro\nfinklea\nfundraises\ncrticism\nsongsmith\nspreti\nnubbins\nnevadans\ndemonetization\npharmacare\nypersele\nmceneny\nkuapa\nkrvavec\npreceeds\nquenchers\ntadulala\ndecaturville\ntaleo\nfolksmen\nmayeux\nenviously\nfurai\nnccf\nkislingbury\nroote\nfriona\nallahpundit\nshabaan\nverycd\nchammah\nwildig\nmccaulley\nirishcentral\ndinnet\nellers\nkelco\nadvfn\nmaced\ntonen\npparc\nalbertyn\nraikar\ndebulking\nspagnoletti\nwhisby\nmmic\nfortysomething\nkabr\nkrotoski\nalijah\npechman\nsajo\nlaborites\nfermes\nsicav\nwestcot\nbawled\nspätburgunder\nbacheta\nhandymax\nbirbraer\nyonfan\nsligachan\nvexes\nyardeni\ninequitably\ntranscendentally\ntailcoats\npeson\nvakoc\nshackler\namericanese\nnasiriya\nkoplin\nalendronate\nbottenfield\nfead\ntangena\ntulloh\nalevras\nmouettes\nlunacek\nfoulmouthed\nkorcula\nteyon\nfidlers\nfigleaf\ngodinton\nhoutz\npitsuwan\nserioso\nmammaries\nschwass\njapin\nvibratos\nvarous\nasoif\nchahed\nstockgrowers\nbirgham\nkhatchaturian\nsahibabad\nmolotch\ncromagnon\ngollner\nnolonger\nmacdevitt\nhölzle\nthadani\nneftali\ncaseys\nthreepeat\nsolemani\nelisra\nevano\nnjonjo\narchaelogy\ndietlinde\ncirella\ngilkicker\nnyth\nbaldersby\ngianniotis\nyonaguska\nreemphasized\nserpotta\ndemarinis\nlockeford\njoraanstad\ndanyelle\ncasanave\nsleepout\ncremates\naadil\nvongsouthi\npretre\ntihinen\nhaixia\nvetco\nnerger\nmistranslating\nozinga\ndruyun\nkalume\ncompartmentalizing\nsanmiguel\nstengade\nterbutaline\nseider\ninfraestructuras\nadelgids\nelonex\nasias\najirotutu\ncocalero\nhaiba\nlowlight\nkadyrovtsy\nsejad\nbidness\nlazne\ncommonhold\neastborough\nschons\ntriflin\nsakalas\ntangherlini\nkarisimbi\narchnemesis\nhawcoat\ntongxin\nseany\nderniere\nponcy\ncharara\ngloving\nhallwood\nzastudil\nprogam\nsportsbusiness\nbartoszewicz\nchoquehuanca\nkerith\nhaqiqi\nttyl\nmepis\nrsme\ntallentire\ncemetry\nefromovich\nirishwomen\nphotobioreactors\nsnores\nbehavorial\npierrehumbert\nshevin\ngolland\npeasenhall\nlickorish\nhoussam\nmilioni\nunpublicised\nicpdr\nquyang\nlangwathby\nmoneybag\nupselling\nbbnp\ndaggle\nogunyemi\nrogaine\ncalpin\nkyrghyzstan\ntrads\nhiscocks\nlindenfeld\npsychoanalyze\ngiannoni\nvantini\ncontagions\nrhostyllen\nwayyy\nemarati\nothaya\nsmoothes\nwinckley\nmaisonet\nvieau\ntorness\nmehli\nforseti\njerard\ntafralis\nsharits\nresentenced\ninsiste\nunguents\nincestuously\nsariyev\nmaccambridge\nbalbach\nwaterstock\ntirey\ndagong\nyeske\nflammang\nnemko\njourneaux\nexarcheia\nskedaddle\ncomitology\nswiggs\nbopd\nbarenblatt\nhalloy\naustrailian\nveuster\nnaeole\nbattlelines\nbogush\nautodialer\naijalon\ncipel\noluwafemi\nblackberrys\nmewstone\nvideoboard\nhuppenthal\ntromboncino\nlistees\ngasparian\nchaebols\nmoclips\nkihansi\ngunfighting\nscaleable\nsubspecialists\nmiert\nshibao\nccmc\nnphs\npracticioner\nxunlei\nstarborough\nharke\ngottex\nvgg\nbisk\navalan\ndeoksu\nbordeira\nsinisterly\naghadowey\nscialoja\necrypt\npoleglass\ngibralfaro\nmixa\nxativa\njhw\nmilfontes\nseillière\npallab\nokam\ncannop\nyorkshires\nbegovic\nbaburova\npaciotti\nislom\ndisseminators\ntweeks\nwolfquest\nfaehn\nbrittny\ntownsell\nprecipitant\nrydzyk\nblairism\nberki\nctic\nrafiei\nwaraich\nabyssus\nonsens\nfisherwick\ncheontae\nmedix\npointfest\nmichigander\nmissned\nbonekickers\ncionnaith\nkhawja\nnaray\nshaugh\nprofanely\ntrishaw\ncompil\ncodefendants\ndubrul\ngaiole\npoofy\nnuvinci\nrovero\nrouanet\nplimsolls\nmetrovacesa\njlj\nnrsros\ncaflisch\nlaferla\nsuperabsorbent\nnailin\nspithill\ncyberpsychology\nghanaati\ntrest\nmaehl\nhaberal\nbayernlb\nambepussa\nglancingly\nwerehog\ncomperes\nmonaci\nkindie\nkiren\nkcmc\nglavas\nbrauneck\nrepko\nmooses\nhygenic\nunhip\nsubgenual\narunas\nchatzky\nbucketload\ndonayre\nsoaping\nadamik\nlataif\nsitruk\nbirdstrike\nhafsia\nyezzi\nplenette\nsperrins\nstreeten\npandjaitan\nmenelas\ntappings\nonexim\nsemblances\nanley\nskiddy\ntotting\nissak\ncourtley\njeremijenko\nhorsh\nvaunting\nlumpfish\npiglia\nchoeung\ndubrave\nriosucio\ndiiulio\nceeney\ndostoyevski\noleophobic\nthaught\narastu\nfarmable\njeffre\nsteenhoven\ngerren\nballgown\nnoras\nmerial\ntaurisano\npantelides\naggressivity\ncoadministration\nnassarawa\nellough\nouanaminthe\nchicchi\ncaofeidian\nfurla\nscariness\ndragados\nhotsy\ndangit\nbaosteel\ndissel\ncariso\nbrunger\nkemmons\nstovell\nlongini\nbasanez\nbensch\nhouseboys\nxuwen\ndeguzman\nharidopolos\npanarea\ngoerner\ntogbe\nsuborning\nbuidling\nkharel\nfragos\nchionochloa\nmaintenace\ngyeongbok\nyaiza\nclachnaharry\nmelissinos\nclientèle\ncivilianised\nozm\nsmiter\ntheall\nisayeva\njinke\nbrowde\nscything\ngianfelice\nknutzon\njagermeister\njinning\nankarafantsika\ndepaulo\neventim\nhoffmaster\nbirkholz\nleontiou\nstinkweed\nphoon\namendement\nsiala\nprostates\nunwins\ntrulock\nmacsorley\nnibutani\nautonet\nenrobed\nvorp\nforeside\nexiguous\nguestimate\nberley\nnexium\ndidit\nborensztein\nmoph\nvonta\neinstien\nmtwapa\nfengming\ngoiters\nplusher\nappartment\nbahrke\nmalaguzzi\nhadoram\npossebon\nthrombocythemia\nclearcast\nstipan\nkalee\ndades\nkeralan\njurjen\nthatthe\nslatton\nsannine\nroubin\nlobortis\nataq\nbruguiere\ngauchito\nadmen\nkierkegaardian\nboromo\njosipovic\nunsuccesfully\nchabane\nfreakley\ncieh\nmolchan\nyahadut\nfratianne\ngagen\nlipsy\nashcott\nconvivir\nziehm\nimportuned\nmirdamad\nimpeccability\ncayer\nsmidgeon\nliadov\npootle\nsynchs\ntagammu\nkamruzzaman\ncbga\nsyfret\nhotair\nmonchaux\nmaroussia\ncheslow\ndidactically\neisentrager\ncharisms\nntelos\nbowleaze\nglawischnig\nnarcan\nfolashade\nkanayo\nnavada\nwaistbands\nblastomere\nschallau\nredds\nxyrem\nsolarworld\natsa\ntoomay\nraming\nbasenjis\nadify\nbluefire\ndinkelspiel\nintroducers\nlevere\npanagariya\navichai\nayson\nbariatrics\notterspool\nrobosoft\nkurtaran\nakau\nkozinets\nyunqué\nngiti\nfeltner\nuale\nmundan\navows\ninchnadamph\nsensationalising\npitoniak\nzumanity\ncroeserw\npedometers\nknowingness\nsunworld\nroessner\nheartiest\nthiébaud\nnurock\nbajans\nmcaree\nmcgreggor\nstamoulis\nkuneva\nkudamatsu\nearthmovers\ngreenfingers\nndege\nanniv\npopken\nmachholz\nrudic\nanatevka\nkorzh\nwesteinde\notane\nfotouhi\nemarketer\nnyaru\nsterilizes\nwisnieski\necotoxicity\nrelaxnews\nfouler\nsabira\nviaspace\nmoordown\nminsa\nbenns\npeverill\ndanwel\nnetsmart\nludgin\nvyvanse\nffbc\nnyanasamvara\nlocavore\nwgae\ncraymer\nthomspon\ndpicm\nphallological\nspuck\nlucenti\nepiphenomenal\nrichardsson\nplaceshifting\nkhalifé\nterreri\nlepori\nopensim\nfunnye\ntorie\nhesam\nenticingly\nunbreathable\nshargel\nrhizomatic\nbitung\nmarmarth\nrosasco\ncroxford\ndaraei\nluly\nschnellbacher\nrxi\nipbes\ndiresta\ndynastar\nbeewolf\nskobrev\nsurama\nchoirmasters\nsdna\nstingl\narnesby\nattendent\nsuppor\ngöldi\njonothan\nllywodraeth\nsteadiest\nfraa\nidolise\nstrummerville\nstrathdee\nhushmail\narriaza\nparsimoniously\nschlopy\nsturtze\nsoumare\novesen\nreargument\nhistoy\njihong\nrottenness\nhorethorne\nmovieclips\nbialetti\nkvh\nberteau\nfromthe\nchemezov\nbecu\nbaumber\ncordey\nranchipur\nnickols\nvalderama\nborrowman\nultrasonically\nspectron\noutrace\noshrat\nwasiak\nhitchers\nhuangdao\nmidshires\ngiering\nelectricty\nringshall\nkastelein\ngancarczyk\ngastman\norobio\nkhanin\ntequan\nyongjian\nartmaking\ntanamor\nsplutter\ngudnason\nmelich\nacknowlege\nvengefulness\njintai\nshowerheads\nmezereum\nnehushtan\nwizner\nmccrabbe\nshinwar\nringstrasse\ntorrentspy\nkathoeys\nijburg\nthokoza\nledden\nsidestreets\ndjurdjevic\npriciple\naihua\npaleobiologist\nzhengjun\nminju\nfederalize\nbeharie\nirwins\nmcar\ndanniel\nkalyuzhny\nclientearth\nbabus\nburish\nbritax\nnpes\nplantée\nchewer\ndramamine\nbalgay\neinig\nallurement\ncounterfit\nhafte\njamaran\nvkr\nlevalley\njabbers\nzebrawood\nmathstar\nmdds\noutwoods\nassembleias\nmolestor\nstoreowner\nfetishize\nrugani\nwintner\nshalgham\nbetrothals\nsupino\nmyspacing\nvomitorium\nhanit\narthrosis\ntith\naehf\neerc\ninfields\nbionz\narnowitt\nkayf\ntourmalines\ntonning\nwenzler\njingping\nsoftballs\nescomb\nrundale\ncontroversialists\neyeliners\nmalverns\nsawani\nkesho\ncentene\nderrybeg\nportio\nfussiness\nmorrah\ntahmima\nnimir\nsurley\nmereb\nmijak\nschyff\nquanjude\nshahier\nkayanan\nbrumas\nphilosphers\nmams\nxtina\nbiomet\norpah\nfingerpointing\nmerediths\nunfeigned\norgad\ngidden\nmelhuse\nupnd\nmichella\nstrathmartine\nherongate\nettajdid\ndecoturf\ndrossos\noxetane\nkshirsagar\nmanseau\nemson\naeltc\ncoraci\nhiggenbottom\nsofield\nwepf\nborker\nnanomechanical\nzazai\njupille\nghussein\nparricelli\ncintiq\nautopen\nconsultores\npenetrable\ncopella\nbugnion\ngorenjska\nsuruchi\nyejun\nextell\nroesgen\nwibisono\nmerrihew\nmullooly\nbuddenbrook\npistelli\ntholl\nzwecker\nblogroll\nsomal\nleeswood\nzingale\ngrecos\nsavulescu\ndyesol\nfabozzi\npochards\npapper\nprototaxites\npfox\ndoughoregan\novulated\nblavat\nraffaela\nballyhooed\nsarisbury\nchowdry\nunki\necall\ngamebreaker\ndigitizers\ndigitalsports\nhorseriders\nwunderteam\nrosaryville\ncombee\nweinshall\nimagesoft\npurwanto\nremands\nblekko\ndembinski\nbfsr\nmandis\nboeufs\nmaienschein\ngladiatrix\nkinbrace\nxee\ncregger\nmcgahn\nmalkus\nsovern\nyankovich\nsundsbø\ncybex\nrüttgers\nsarky\nmatli\negoless\ndwy\npruners\nlungworm\ncollamore\nporttitor\nkucuk\nunremovable\nheadingly\nbandier\nrunback\ntabú\nurdapilleta\nxrc\nprosseda\nnamechecking\ndéfilé\nmicrowaveable\nnewbay\npolitbureau\nrenovables\ncardioprotection\nslimeball\ndistributer\nschoot\nlomeli\nlazars\nsharpish\nmesnes\ninsitutions\nkraska\nmintimer\nticketholders\nkehn\npolical\nstringencies\ntransship\nbildeston\nbirky\ntransmen\nclitorises\nmaasdriel\nkaplans\ntracinda\nzilk\ndrapchi\ntonello\nunenhanced\nuprichard\nfarter\ngemfields\nmambi\nbelnick\nvidalin\nmechling\nghida\nwallpapering\ncorridan\ndangrek\nrhio\ngrandberry\ncolotto\nnffe\nflacon\nglazman\njabby\ngogerddan\ndmos\nrubinow\nisobella\nlootah\nbruguier\nthornbridge\nlambersart\nslather\npasscodes\nmaturan\nlawwell\nstoschek\nersen\nprechtl\nstonestown\nlengsfeld\ncrss\nactuarially\nlupercio\ndimitrius\nshizuishan\nsaharans\ntanski\nghavami\nzaslofsky\nwakens\nhealthvault\nboemre\noutreaching\nbackstopping\nousters\npaška\ninhabitat\nsymphonists\ngrunenberg\nunderhandedly\ntreys\ndemann\nnaccarelli\nkelava\nalongi\npocar\nmattrick\nparritt\nsherco\nrwasa\nreinebold\neastrington\ngonaives\nkwena\nsogeti\nunscrews\nxke\nbeci\nencryptor\ncotoletta\nbelkhadem\ntraue\nrafikov\nanthropomorphize\nchanoff\nsergy\nmerideth\nravasio\nairiness\nnivins\ncentrify\nmanhours\navunculus\ngeekery\nbrisman\nthway\nhayfever\nmadhesis\ndafter\nunpleasantries\ndickinsons\nmulé\njeppestown\neditis\nklueh\ntorvik\nlyndie\npoliticals\nhousesitter\nanomoly\nantiabortion\nconsitent\ncytometers\nsprouston\nplaszow\nfrago\ncynar\nacti\ncwmbwrla\nmanil\ngazetta\nkillylea\njunaibi\nproactivity\nabdulmajid\namalga\nappetiser\nmiramshah\naquisition\npolam\nreljic\ngárda\nhamod\ncristeta\nvelcade\nprespecified\nbodyboarders\nhamzanama\narvan\ndisagee\ncroquant\nbartal\nhealds\nvidex\nclanchy\namoi\nimoinda\nriml\nsummerly\nclaisse\nlaboure\nwildaid\nbriffault\nbarrymores\nespys\nscatterings\nfigurs\npenybont\npreviti\nkarrine\nchadway\nkumudha\nlarish\nricchiuti\nbecomeing\nchoralis\ntrubshawe\nfrsb\nkiza\nvestara\nmeya\nhartanto\nfredricka\nchloroformed\nstanowski\ndrizzy\nemler\njezz\ndoper\nhitmaking\nsuprapto\njeser\nigaming\ncymbalta\nweadock\nsellam\nchrd\nballykinler\nstankovich\ntrass\nmandale\ndvoskin\npearled\nhaadi\nbenchill\ncloran\nkarsay\nvillafana\nbacelar\nhustwit\nibeji\ninventus\nderegulatory\ntirr\npenumbras\ncolcannon\nsodaro\nreitmeier\nkuentz\ncodexis\ncindric\nsabetta\nelvs\ndrumaness\nkeqi\nbiehler\nforesighted\ncwmllynfell\nzalloua\nsmarr\njenufa\nayorinde\nmusorgsky\nmetgod\ngastropubs\nkiriakos\ncolumbines\nantiracism\nnezar\nexcercises\nivinson\nkutlug\nintersectoral\nradicalising\ntorchio\nsquirters\nakpinar\nolgivanna\nsudworth\ndomalpalli\nalhamdulillah\nmalachias\npodkarpacie\nvqr\nmacacos\ndrumsurn\nbrandell\nsidestreet\nredisplay\nsherchan\nunmo\nshotgunning\njaiku\ntortorici\ntabetha\nenglaro\nalierta\nkonvicted\nantiobesity\nailean\njanaway\nclubroot\ntheel\nchalifour\nmynt\nguek\ntillmon\npardilla\nhagemeijer\nsteepen\nplutoid\ninhorn\nmagnetotelluric\ncritelli\nharmonises\nnaotake\nsesenta\nhelgerson\nobesogens\nunsell\nyibai\nprotractors\nlunesdale\nlesesne\ncamperos\npanoramix\nwcbi\nmwandishi\nbourguignonne\ncomposter\ncossin\nebid\nlambson\nkilcrea\nplatek\nferngully\ncutright\nhadassa\nriyale\nperfectible\nmerouane\nsheepherding\nsetian\nmorrocan\nsampey\ntaohua\npasteurizing\nshmarov\nretreatment\nperthnow\nologies\npivi\nyente\nrupprath\ncachalia\nlazim\nallstream\nhrsc\nknabb\ngilang\nlibrizzi\nkontrakt\nedder\nblacksummers\naminat\ncolage\nsyeed\nmahfud\ncroson\ntumukunde\npetroleo\nryter\nchloroprene\nhuitlacoche\npoppleford\nsteephill\ncanala\nsupermercado\nvilsmeier\noncall\nmaraud\nmundanely\nnonfactual\nsakane\nornare\nnihombashi\narmini\natlantan\nmachlis\nmcara\nferencsik\ncalitzdorp\nrelitigate\ndjellaba\nhouwelingen\nnadaraja\nberglin\nnikolaevsk\npampore\nmabarak\nbrotherson\nchimpsky\nfreerice\nglutamates\nharpin\njested\nmocan\nstumpage\nbalmforth\nkasule\nkarpluk\nrashkow\nmaake\nstrigl\ndagley\nkhazanchi\nsoltz\nwertman\nadta\nstroughter\ncolicky\ncavalcades\nfsin\nfiesp\nbbbc\nmicrobicidal\nreappraising\ngraymont\nhajian\nungi\nfuction\nmaramotti\nherchcovitch\ncelebrators\ngridpoint\nkaramon\nabdulhakim\nozkan\nconciliators\nbeddows\ntqt\nelectroplankton\ndtos\ncaromed\ninitative\nphotoaging\ntachographs\nmichelmersh\ndonehue\nndcs\nqrr\nhvala\nundermain\nhemophilus\nsanghatana\nuncleaned\ngunters\nwakeel\nkomarica\ngarabito\nhetian\nqmt\nbollenbach\nwittle\njaumann\nfibo\nrothgeb\nramda\npyrotechnicians\nmidmarket\nbertschinger\nundervalues\nljubinko\najos\nchavers\ntroqueer\nizzadeen\ncollaring\nbimanual\nfogies\ngreenlander\ncabat\nellertson\nslocock\ntwibell\natalia\nkartick\nscandalising\nakimova\nnedam\ndickleburgh\naplington\nminisodes\ntokaimura\nunconcious\nserasa\npasm\nhandbrakes\nphanor\nstudiousness\nclunkiness\nsabangan\nmoonpie\nkyota\nmahvash\nmudding\nrequalify\nabdelwahid\nthamesport\nislero\nrachleff\nvancouverites\nwouldham\ntudose\nzikim\nwoolcombe\nbaracks\nmockbusters\nmaddicks\nneuropsychologia\nsweathogs\nberaud\nongoings\nportnall\nnazarbaev\ntegenkamp\nstepsiblings\nfrankovich\npowervm\njuergensen\nbayal\nzahidi\nhashmarks\nhalleux\nottoway\nwhitear\nalasay\npargetter\nbrittish\nberlijn\ngrotesqueness\ndawidowski\ntheyskens\nbanyana\nlowles\nreconfirms\njaelen\ndotsero\nvoegeli\nenewsletter\ndiglis\nmedog\ncouraud\nyulayev\ninsightfully\nhcap\nfremaux\njoyfulness\nmoudarres\ndopers\nmandour\naesc\norlowsky\norganix\nngabo\nereaders\nimpalpable\nbarszcz\nbuckmore\nottney\nvecdi\nepla\nfobbs\nbedsores\nhedgehunter\nlongfields\namtrust\nglogowski\nmedicalert\ncinacalcet\ncolorfast\nentrepreneurism\njuppiter\nfigler\nakharas\ncycloramas\nschi\nheagy\nsoitec\ndoctore\nortez\nkisseloff\nbolivars\nsheeni\nbarawe\nfortunetelling\nnorona\namarillos\nsalamo\nahmadiya\nbrowers\nmonneret\nzhaoguo\nodfjell\nnotorius\nhanowski\njuman\nsambhogakaya\nquickbird\nfederick\nklockner\nstojakovic\nciegas\nworlingham\nrasgas\nchadors\noverpopulate\nkonbu\ntrbn\ncalibrators\nzhirov\nhawelka\nrobiul\nmcmissile\nblackston\nlunnon\nboxhill\nlakhwinder\nweltschmerz\nfasani\naltounian\ncladded\ncontainerboard\nsinfully\npavlides\njayatilleka\nquinson\nhighweight\nchacachacare\nryol\nfumagillin\nveligers\njabron\nbenzes\nmothersill\nhawkesley\ndaolin\nverkündigung\nletterbook\nbakopoulos\ncrdb\naliano\nbystrom\nzesto\ncosla\noutworking\nkleeck\nbirkland\nyeay\ncatterline\ngaeng\ndainties\npacnet\nanwan\nbijlert\nfidm\nmaisano\nlachasse\ndowntimes\nlapolice\ncemm\nmgus\nunware\nrecontacted\neidinger\nflounced\naiyegbeni\nbeav\nculioli\njttf\nnorita\njermey\nlistecki\nrodio\ndabrowa\nwhup\negnor\nkerwood\nhanzala\nmontserratians\nrebase\ndumberer\nanalysys\nnonmelanoma\nsumka\ngumbi\nbaugniet\nrespites\nguruswamy\nrosettenville\ndrumthwacket\nquantin\ntrendiness\ntarheels\npuygrenier\nredbelt\nworstall\nbellei\nsuzzanne\nkleek\nedmee\nguidice\nsedillo\nolso\nmultimission\nmclaverty\nkarelina\ncablecast\ncefepime\ntrevigiana\nsathit\nleavittsburg\nvansummeren\nknockbreda\nkalfus\nimpudently\nyengeni\nsmartdraw\nfriehling\npointwork\ngatward\nlilyturf\nnogovitsyn\nbobbe\nefthymios\nhughesy\nphilps\njoola\nkrave\ngeisberg\nhylonomus\ncenterman\ntaloned\nmalow\ndibber\nboanas\nkipco\neisenhuth\nchewits\nmaxia\nsoberania\nlistlessly\nnyswaner\nsamoei\ndjangirov\nbandwagons\nthumbprints\nwnyn\nsarunas\nmyoglobinuria\ntangibility\nhaski\nartfulness\nlagosta\nlingdale\nheugten\nfaustman\nrebelliously\nstacksteads\ntowhid\nlamanda\ntoper\nfayiz\nstellman\nvamoose\njewcy\nmurcheh\nressi\nshunqing\nciolli\njavarris\nrmj\nwises\nburgerville\nhabig\nkannywood\npawb\narchibold\nlaciner\nsuneson\nprevnar\nborrachos\ndinnigan\nhitchock\nkuhnle\nagusto\ndailytelegraph\nstango\nsilvestrin\ntorregiani\nproh\nhinestroza\ntoted\ncholewa\nlackman\nautists\nfaerch\ncrimsons\nbadme\nfiancees\nglorying\nshimmying\ndudfield\ncheverton\nlikin\nniloufar\ncharamba\nmussarat\npowerplays\ntexbook\nmcgain\nabdessemed\nwaldhorn\nkrolicki\ndetter\nleurquin\nmihas\npixetell\nkonw\nshamik\nnduom\nredelivered\nkosmix\napproch\nkeyrings\nkorade\njaquette\ngerischer\nkcrs\nshanwick\nroomers\nvolutpat\nbiong\nbavidge\nbirkman\nkestin\nchelsy\nllanddaniel\nduckers\npingleton\nstofer\nmaricich\nbreard\ngetco\nsabey\ncarpinello\nsargentini\ngeovani\npiershill\nbeixin\nsanmar\nlutzka\nbalestrieri\nheptyl\ndangermond\nbrilliantine\nvotebank\nncms\nmikov\npostfeminist\ncbda\nstylizing\nandisheh\njoella\nlumpini\nzaccardelli\ntelmatosaurus\nbocephus\nmarrian\npierceton\nakintunde\ncutline\natheeb\nagranov\ndicksons\nmiad\nsheered\nkorogocho\nsynthons\nbazid\nrossion\nfloodable\nprawiro\nmoretown\ngamesman\nropewalks\nmulheim\ndesveaux\nguardianships\ncampañas\njaschke\nmcclave\naphl\nprivatair\nagripino\naractingi\nswinehart\nputnum\nmischler\nbecci\nefps\nfoxhunters\npaddleboards\nptbt\nhongzhong\nritty\ntozzo\nporsch\nwrenshall\nminneiska\nmontbeliard\ncodell\nemgrand\nvissarionovich\npuhinui\nahlbom\naning\nhyt\nkriger\ndemagnetized\nselukwe\nmolestie\nhadcock\nselenoproteins\nboyatt\ntheepan\npissant\npaynton\nfirrhill\nsmuckers\ncwj\nlagrein\npitkanen\nbukuru\nwtu\nvelislav\ndrearily\npluri\ndrollet\ntepidly\ngaventa\nevhen\nbendu\napostolides\ntebbitt\nrasanayagam\nseagreen\nmidelfort\nfoulard\nsoakai\nsantore\nplassnik\nprotech\nkozhara\nalchemie\ncliftons\nsusd\nhellacious\ntelescreens\nnugen\nzingerman\nbarck\ntuckingmill\nechavarri\nseanna\nkoyamaibole\ncommuntiy\nhornton\nbardawil\ncheleken\nlocorotondo\nlapread\nlaria\nblunderbusses\ngovernate\narhp\nevennett\nthrashings\ncaldercruix\nquirimbas\npokeman\nviven\novejas\nlenwade\nlasource\npastiched\nlarmina\ncatamite\nfaryd\nramadorai\nbohnet\nfinex\nquéré\nsemini\noldish\ntejana\ncontente\nchualar\nbunac\nrathwell\nvivisectionists\nmerhav\nmeeder\ndvorska\nmuntazir\nrotgut\ngaunless\nyuguda\nteneycke\nriffworks\nhemmerle\nspankin\nnorweigian\neragny\nmyrmekiaphila\nchaing\nlungarno\nscootin\ntaegutec\ndamningly\nproductivism\nmajoros\ntarbock\nphedre\nprooves\nfrind\nidioteque\nslipware\neisgruber\nziffren\nmelittin\nundular\nghozlan\ncarjackers\nmccalliog\nultralow\nplatitudinous\nconcessioner\npornotube\nkiltartan\nnomee\noutreau\nbarston\ncilice\nqaiyum\nborroni\npmis\nsaboor\ndowley\nrsps\ncorbus\novercomplicating\nfiell\naralsk\ntravelzoo\ncdph\nrecomposing\ntrogontherii\nchandak\nsupercasino\nbedtimes\ncusip\nwdb\ncwmgors\ntohn\ncheeseborough\ntrapido\nsperandio\nbatka\nxinjing\nstatpro\nperfumo\nnewswomen\naltinum\nselvie\nbenchemsi\npevely\ntaxability\nbaghe\nbofinger\nkuchinsky\nfanboyish\nkaly\ndowle\nbirrer\nctcl\ngirasoles\npuked\nbeitler\ntrimspa\nzeltingen\nailish\nnespolo\ncccd\ngefter\nquitclaimed\nparcelforce\nnbv\nbyaruhanga\ndeutrom\nmussy\nzeidel\ninclinded\ncachette\nreegan\nhamzat\npsti\nallegis\ndevauden\npacsun\nqiwei\nakerlund\nelderspeak\nxcellence\nsheelah\nunbolted\nnesto\ngeeking\npopover\ndickes\nresectable\niaam\ncantabrigian\nkilley\nveteri\nxanterra\nsamouni\nesgr\nuncrc\nawac\nvsw\ncredability\nrapetti\nkhudobin\nbeenham\nmehus\nanusorn\nhyler\nstanney\nschubin\nresnikoff\ngibo\nncidq\nfoxtons\ntippingpoint\nbauger\nbeltzville\nrinkside\nmutuma\npedalos\ndisconnectedness\ncsikos\nlozeau\noberhaus\naronne\nungroomed\ndilruwan\nlamboley\ncalleary\nulil\nmakhluf\nflashiest\nsanha\nlukoff\nunionport\nbusuioc\noddments\nmihigo\ngahl\nterumo\ntussocky\npeskanov\nperozzi\nstignani\ngeor\nserwotka\nkleinzahler\ntaneycomo\nmoonacre\nhomelink\nlydman\ndumbshow\nabdulwahid\nelmy\nshvat\naffectionally\nfungai\namande\nsesno\nglenallan\nkcls\ncolacello\nreformulates\ncastlebridge\nrumbly\nhollywell\ncraigston\nleeum\nsadis\ngillerman\nsnuggling\nseelaar\ngrealy\nmcwhertor\nuanu\nsakiya\ncomacina\nmarggraff\nunstinted\nhirz\noverutilization\nreyle\narvans\nbridi\noppressiveness\nnerney\ningroia\nschroff\nlaferriere\npalladini\nhallers\nsunkissed\nhousecoat\nfürmann\nmortify\npowr\nmsowoya\ndorry\nmorupule\nimperiling\nbyol\ndaquan\nkoranyi\ndisingenuity\nzeitlinger\npandemrix\njasmila\nexculpated\ntardiff\nfritwell\nstewpot\nkellwood\ntolsta\nforthampton\nhongliang\nghadiri\nhansdotter\nrobotized\npatua\nwuhayshi\nbact\nrassas\nunivercity\nremicade\ntwinnie\nnaringin\npietramala\nladakhis\nelsehwere\nfdrs\nkekexili\nislamicized\nroxwell\nhanhardt\nchihab\npasteurize\nunring\nchirla\nwardana\npendet\ncoldhurst\nbogarín\narbete\nmuhandis\nchiren\nsplendore\nnaanee\nspivet\npeynaud\nrosland\ncmus\npdge\nfeniscowles\nmargairaz\nllangrannog\nchafets\nthato\ncapitated\noudegracht\ntreston\ncardsharp\noostendorp\ncallejeros\njesselyn\nkaradjordjevic\nthania\nhundered\ncelar\nharishchandrachi\namisfield\nparticlarly\nanouncement\npurtzer\nbmxers\nnepc\njuanjuan\npolishuk\nwinborn\nchewang\ngatfield\neasd\niglhrc\nabdulqadir\njarrid\nzovi\ncodecision\nsouffles\ndesko\nagrs\nhanretty\nglenisla\nactally\naveris\nbackcasting\nsenner\nsnoras\nfulminated\ntator\nyosser\nbulevardul\nstriezelmarkt\nhyperinflationary\nkilleshin\nwithouth\nsuperbrand\nloosley\nmarwani\nlebenzon\nchewier\ncarharrack\nshowin\npetries\ntueart\nedrf\npgnig\nsuñol\nthroup\ntibbott\nismailova\npicnik\nsundel\nklonopin\njorde\nfanene\nlloy\nmerco\nannuloplasty\nmontly\nvlo\nmilcom\nengross\njbe\nshouldst\nsatiating\nputley\nfreebasing\ntjostolv\nwijffels\ncinemanow\nfrancescana\ngiacobbi\nwommack\nnayda\npriuses\nstarkest\nringuette\nthurcaston\nprecog\nmynatt\neurlings\nyunding\nalishba\nglascote\nginen\nkurras\ncomplected\nipda\ntransmanche\nrealage\njeebus\nbortolotto\npurcey\nsalvant\nhuberto\nkaidanov\nmahboubeh\nlaugavegur\nshuanghuan\nelkview\nyesil\nflightstats\nobadeyi\nnasonov\nfelicio\nzafarul\noversubscription\nkupl\nqadhafi\nrollston\nsarley\nbedsides\nspeyrer\nsongtrack\nklete\nsteepens\nbaddy\nsportfive\nlachrymal\nnilles\nkirklands\ncecom\nmmcc\nstanyon\nresuscitators\npicturetel\nohlinger\neurocities\nmlab\nrizwana\nwolferen\nleiferkus\ndewlaps\ndefrantz\ntileworks\nsarriegi\nevjen\nschaars\nhadida\nchillan\nrexam\nstooke\ntongsun\ndolcefino\ngewertz\noneweb\ngollon\nmacray\nthalamotomy\nschremp\nsceptred\nbeechdale\nmilele\nschulke\nxiaohua\nsuperiour\nmenoufia\nximeno\nskyjacking\ntranquillisers\nbefuddling\nrunion\nmignons\nmanacapuru\ncompucom\nhensick\nshoutouts\ngoerne\nghettoised\nveanne\nsuperduper\ntrystin\nziama\nmonetise\nflimsier\ningestible\nsheley\nlept\nistre\nwmmj\neuthanise\nsolangi\nmeasurment\nmerriest\nleonatus\nmoscicki\nstogdon\ntyring\nicewater\nnalani\ncostamagna\ntruveo\nsifry\ntrindad\nvider\nreflectography\ngenuflecting\npainewebber\novercollection\nkovalik\nvontae\ncenit\nthreespine\ngruenewald\nfilmo\nvocalink\nunshorn\nkinichi\nkhuram\nwoma\nguixé\ncranhill\nmisguidance\nvibing\nmuzzin\nchli\nmáncora\nfanfarlo\nborderer\ndeficiences\ngruesomeness\ndiscusing\nblabs\nmardani\ntreverton\ncharabancs\nhubka\nkinderland\nshackford\nhazin\nsterilant\ndirec\nhmic\nbenfit\nyagman\ncrystalized\ndehning\ntahoes\nforceable\nanyango\nolom\nnontaxable\nnonperforming\nprinzi\nvenenatis\nkameya\nherterich\nhrafnsson\nroehler\nloverdos\nmcerlean\noverule\nscriptless\nmoochers\nsmartish\npoag\npoerio\nmicrotomography\npufang\nfranses\ncrochets\nlashbrook\nsamkange\ngrober\njamphel\nvlcek\nshionogi\nrotschild\nvapourised\nsnoad\nmoskin\nswartley\npanagi\npourville\nmadini\nstayt\ntrucchio\nthuds\nmagticom\nshmueli\ndesperatly\nlipinsky\nsfda\nzahri\ncraigentinny\nclavulanate\nreby\ntweeden\nvexillifer\nnayeli\nvictimizers\nhydrocephaly\nmargeson\nburgeson\ncannom\nduvvuri\nlimning\nwardsboro\nomeir\nzawacki\naxelrad\nminadeo\nsakhra\nghalia\ncinches\nalabel\ndoodads\nwadded\nmetacarta\ncommiserates\nrodber\npasquesi\ncaryophyllus\ninzko\nsatherley\ninalienably\njoose\ntraduced\nmannelly\npillitteri\nstetsons\nburlando\nevdokimova\ngallante\nthorstensson\nunathletic\nidodi\nunitl\nvriesendorp\noneview\nsoapsuds\nwhittamore\nrewbell\nzambonis\nkadidal\nmisspending\nsubarus\nbasari\npiperazines\nmutula\ndaneshjoo\nbridcut\nminoso\nfairoaks\nmalakh\nchages\nkornblith\nviciedo\ndulski\ntestamony\ngumshoes\nsubstorms\nglower\nstachowicz\nurechean\nsamoëns\npruneda\npolz\nmoxidectin\ngendelman\nsummerhouses\nndem\ninduta\nthethi\nkiyoshiro\nsimeons\nmieville\nguevera\npembrokes\nearthward\npublised\npossibe\noutraised\nchocs\nmultifold\nviscri\npickier\nmauresque\nlaparoscope\nmizuuchi\ndragset\nhcrs\nmidflight\nfedwire\nnovelis\nfortville\nmeruelo\nmomar\nlibuse\nbrodt\nhoecker\nlasersoft\nairbrushes\nprasong\ndemarked\nhomeshare\nineffably\nkickout\nfvn\nmasoods\ngiggity\nimmunotherapeutic\nliakhovich\ndespins\nmatthiola\ngestations\ngamov\nagarwalla\npangsa\ncapio\ntricolores\nxrr\nrainsberger\nmajal\nbegell\ngeerdes\ntreena\nemmes\nwesray\nkhazim\nliothyronine\neffecient\ncoliseums\nandrejevs\nbiocare\nwartman\nundock\ncheesewright\nvollebæk\nwitthaya\narafiles\nsekel\nsubmachineguns\nsynnex\nzollar\nstonewood\nfrittering\nregbo\nsymbio\nfenestrations\nhusbanding\nnewmoon\nzyg\nvillabate\nortel\nhakainde\nwoylie\nfeltonville\nasbs\nbaobao\nweighings\nhatemongering\nrieseberg\ntuiaki\npachl\npinafores\ngyd\nbelaieff\ngrilse\nhirigoyen\ncanings\nhuntsworth\nzourabichvili\nboppish\nbuhund\ncohabitant\nssmc\nfelicisimo\nbartis\nhistoricos\nshardlake\nardhendu\nstadhampton\nturps\nwitasick\ntransfat\nshecter\nharindra\nnominaton\nmapit\nnvra\nmatavesi\nbrookstreet\nexcersize\nwynnton\ntalkov\nharmond\nwelshness\namigorena\nxaba\naraca\ntartufo\nchidchob\nhedl\ncressi\ncasd\npyridostigmine\nrelining\ntacul\nvialone\ntechart\nthaleia\nkhudair\ndomanick\nkieselstein\nwaniek\nelbaneh\nknijff\ndivani\nluofu\nshatha\nrahill\nauxillary\naight\nwainy\nghio\necta\nsensitising\nunscrambling\nvaccinator\npoletto\nconstantinidis\ntilshead\nanniversay\nxactly\nmadeirans\nhaunters\nweihenmayer\nlewenza\nbinham\nandringa\neschool\namtc\nschmetterer\nkaltenberg\nvylegzhanin\nguardedly\ndeprogram\ninlcuded\neurocentral\nmacmerry\npopejoy\nensing\nlaurentic\nsandretto\ngarthmyl\njamileh\nzemmouri\nforsgate\ntetracaine\nmadadian\nanyar\ngraeter\nkracher\neconomos\ndemodulators\njcom\nglunz\nyawed\nglusman\ntolchinsky\npogan\nmitofsky\nmetacrawler\novernighters\nsunquest\ngroms\nmukhamedov\nhmshost\noneword\nwintonotitan\nsuperorganisms\ngunatilleke\njetfighter\nnewmachar\nittiam\nglof\nshrillness\nsolero\nfonyo\nphrathat\nsuperclubs\nkirzhach\nmonesi\nswach\nscolopendrium\nverisk\nbachelart\nküntzel\nunrebutted\nmaralal\ncoffeeberry\neurophile\nrazzamatazz\nyeyo\nlemoncello\ndrumquin\nthreatning\ntegtmeier\nchukchis\nwongsuwan\ndeludes\njolee\nequibase\nchandila\ntransportational\nunhittable\ntoadying\nshukran\nspilhaus\ngianopulos\nsveningsson\noppertunity\nwawan\nteruhisa\nrahv\navellini\nhameldon\nfiresheep\njidosha\nwordnik\ntriggermen\ncampanaro\nrumbelows\nglaud\nnabatiyeh\nbeashel\nveiko\nminesite\nnower\ntriarc\nemosi\nwildi\npandrol\nsrmg\nfluoridate\nparmet\nkrongard\nocts\ndalesandro\nallocca\nsehir\nmillier\ncofadeh\nhadjadj\nblumine\nhaeussler\nllanelian\nmanross\ntagaz\nsujeewa\nmouglalis\nbonu\nbocog\nmanojlovic\nlarossa\nkoffiefontein\nvisitorship\ncaringbridge\nmilimeter\ngsee\nfukumi\nligron\ntsip\nrustles\nmediavest\nfdis\nzamansky\nljubicic\npurkersdorf\ncroe\nvarsseveld\nfintray\nguita\nillington\nguntheri\nmidon\napocalyptical\nvitalized\ngrabois\nloopholed\nfulgenzio\npadro\nyangchun\ngolwg\ndayong\ngiannattasio\nsuzyn\nkhalaji\nibata\nkhwazakhela\nstroth\nhuidong\nllanrhidian\nslioch\nkronick\nrumblers\ndasain\nbodett\npivarnick\nshirleys\nalbou\nkerrygold\nzeshan\nabsinthes\ncpms\nscelzo\netzler\nrompaey\nforgemasters\nxiaolei\npeerenboom\nlounged\nlliswerry\nghayur\nmigel\nunbought\nmecia\nheggy\nfehlbaum\nbirthparents\nvideocast\nmacanthony\nfamadihana\nkraeutler\nshiregreen\nraeva\nxudong\nblanefield\nelderhostel\nhurowitz\nfalash\nnavidi\nwithour\nsamro\nwiebo\nelspa\ngoodweather\nsportcaster\nstilian\npdab\nfiszer\nmoushmi\nmcsporran\ncompulsary\nrineke\nfawnskin\nkrue\nindispensably\ncaveated\ngrrrrr\nfoodchain\nphilamlife\nnelc\nbjordal\namusa\nmilio\nbredel\nlebeouf\nlittelfuse\nkildea\nsulforaphane\ndialler\ntheilen\npiroxicam\nflordia\ntenaris\nvagator\nrussak\nhaplessly\neckerle\npoplock\ndecendents\nlassman\nhinkelien\njarba\nfatsis\nhomm\ntakotna\nbaldonado\naglietti\nguelman\nerhebung\nglapion\nsizzled\ntomaszewska\nredmill\nmcflurry\noffish\nsukadana\nbenish\nmultiannual\nmaesglas\nmaidique\nreagin\ngillioz\nsentaku\nguadalest\ndemings\nlangill\nvedera\ncenseo\nendplay\nfristoe\nmixel\nuncontainable\nwieght\nmakas\ndelicates\nnavision\ncioccolato\nrasslin\nwisdens\nbershawn\nmalvertising\nchunyan\nwyberton\nmatchstalk\nwinklevosses\ngaylard\nregiste\nmarnhull\njiansheng\ntokes\nreigon\nrevalidate\nofatumumab\natypicals\nbewilders\nsleestaks\nredware\noxenhorn\ndelphia\njanessa\nsharbi\narabisation\nrodnyansky\nweleda\nspinny\ntidrow\nmikeal\ndipg\ndurose\ndetents\nbeitel\ncemig\nschoolbag\nweybosset\nxboxes\nlmno\ngoldklang\niseran\nfizdale\nmcfie\nneurohormones\nkitchings\ninfinitus\nseabank\nwarchest\nchumming\nfredell\nneonatologists\narbelaez\ndragonera\nyankowskas\nprevaricating\nprosecuter\ncasten\nmansuriya\ntreneman\nsynta\nsecluding\nlockhead\nkassirer\nbeltzner\neuropeaid\nbonisteel\nvillopoto\nlietzau\ntighty\npandoro\nparahydrogen\nelefsis\nchedzoy\ncortas\nflimwell\nartuso\nkallin\nkritzman\ntrifan\nglisters\ngrais\nungraceful\nfoscote\nperonal\nsciullo\nmusos\ntroyen\ninoguchi\nsegalla\ndudarova\nirell\nesspecially\nfforestfach\nscheringa\nricciardone\nemmens\nbelykh\nmuturi\nahmadullah\ncaldmore\npurdin\nkubotan\nteletubby\nricharz\nbelous\nmultifunctionality\nluben\nqueniborough\ndeclarers\nonces\njayda\ntenncare\nwillcom\nlaboon\nvanishings\nplse\npurees\nsuccessfactors\npalenzuela\nharía\ndakins\npayoneer\nrtpi\nhyannisport\ncrosscurrent\npricy\noveracted\nclais\ncontextualises\nproliteracy\narterburn\nhongu\nmuckelroy\nextenze\nneten\nwinnacunnet\nschweihs\nodalys\nlourey\nfakeness\nlipszyc\ntredworth\nterui\nborcherding\nguiders\nfloozies\nmarvelling\ndravot\nloughcrew\ninititally\nkomomo\nherzigova\nwyvis\nammendments\ncymdeithasol\nwebfetti\ngrillini\nstroem\nbedner\noten\nmattityahu\ncaladesi\nlynard\ntauer\nburradon\nnullahs\nschoeler\nmyspacetv\nannahar\nbanyans\nphakic\nwenfei\ncheroot\nbreazeal\nsonke\ntiep\nliwonde\ndiomansy\nschumi\ntonni\nrephotography\npitchside\nrushanara\naimster\ncalabrians\nrecordation\naustrialian\nachike\nplasmoid\nswanigan\nhushpuppies\nlicencees\nchromos\nkiszely\nhackescher\nburhakaba\nbarung\nrekowski\nstreched\ntandas\noniani\nintermittant\nyawned\nsoojin\nthurtle\njalapao\nmorgano\ndadds\nverolme\ngrigonis\npilhofer\nchics\nodeen\nnrpa\nalvilde\neverblue\nphear\nmanjinder\nlortz\npromiseland\ninformatin\nsmashmouth\nzouhair\ntrusthouse\nprizefighters\nburhou\nropotamo\ngeggan\nthileepan\nrudoy\nbehre\ngomphothere\nattik\nkimpel\nsorzano\nrudby\nrasburicase\nwdk\nadhami\nbloomy\ncresci\ndairymaid\npollies\nrescuecom\nsalke\nlilliana\nratemaking\nabigael\nplenaries\naxley\npeelings\nbakry\nworkshy\norkopoulos\nmulwray\nconsultee\nbarreno\nfehmida\ngwtw\nwickaninnish\ncchit\nneighborworks\nymba\nofftake\ngottfred\nfenthion\nsymyx\nwrigglers\nvulcanised\ndistillerie\noneconnect\nszydlowski\ntransmorphers\npeshwar\nnargess\nfreeflight\nbernardez\nmbos\nnativities\nbowett\nreback\nsemprun\nmezhgan\npotholed\nconna\nmuzzammil\noberly\nfange\nfixmystreet\nfraternising\npetibon\nectd\ntwistable\nbagatela\nfindlen\ngalerne\ntatties\nhemingses\nandolsek\njija\naberpergwm\ninrix\ndicosmo\ndenunzio\nfishtoft\nhaux\nfiving\nnijrab\ncatinca\nyajun\nglycemia\ntelnack\ncasani\ntureens\nbrixmis\ntakwa\nlegalzoom\nuhrmann\nhershenson\nplaydough\nfuneralcare\nsutardja\ncbeyond\nbugala\njersy\nkaniz\nfilched\nscampers\nkontron\ncondorelli\npalsas\nchugger\nbrezinski\ninstrumentalized\nsothoron\nwarks\nisraely\nwassila\nbejesus\ntranio\nacurately\nliathach\nmutarelli\nserreze\npepole\nkonyves\ncenterbrook\nexcon\neigenmode\nmimouni\npayes\ngoshka\nragtop\namsc\nshonga\nshapewear\nvullo\ncheret\nfitou\nsunjay\nbarrit\nteoma\ncarmontelle\nctol\ndzhabrail\nrapleaf\ngusterson\ntpcs\nguerzoni\nbigombe\nllobera\nhordichuk\nherbsaint\noxyfuel\nkishmaria\nspotfire\nrangle\nbaltimorean\nituran\nculinaria\njatuporn\nisofix\nchelf\nchre\ndehousse\ngymkhanas\nscaparrotti\nboaster\nnafai\nstimming\nkobborg\ndanieal\nhuggles\nchangthang\nashtons\neufemiano\nlandsdown\nomnisexual\nzess\npercipient\npaulita\neversheds\ncurado\nachnashellach\nbbbs\nlrmc\nreynoldston\nwoessner\ndivaldo\nweibrecht\nnetplay\nmacbird\naglieri\noutq\nmountnessing\ntores\nlamur\nzaborski\nlopinto\ngibor\nwiney\nsafarik\nproceratosaurus\nberetania\nagasi\nrestefond\ntiedt\nparten\npainfull\nmicrocurrent\nbulgin\nringgo\nakard\nbillik\ncomponentone\nladens\npotashcorp\nridgeon\ngrijpstra\narculli\nwislawa\ncardsharps\nfrataxin\ndateland\nyerushalaim\nbargepole\nlicuria\nkinyara\nkureshi\nballog\nseemd\nwalster\ngalamaz\nprerow\nacebes\nchemring\nstepover\nbienkowski\nasness\nkrsko\nfastercures\nswingometer\nyangchen\nsefl\ngötschl\nservanda\nbolarinwa\nkomfort\nmofcom\nwära\nmarette\nsportsnite\nmonitory\nmaunders\ndaroff\nchmagh\njlens\noverstocking\nketteringham\nfgx\nchartchai\nhimadri\nshimmies\nniga\naccoona\npetion\nclariden\nshuqing\ngardenesque\nbiegenwald\nmfdc\nmeunière\nsvankmajer\npfab\nhookway\nvictimise\nbusken\nrosebushes\nrimed\nspecfp\ncastrations\nultimatly\nvatskalis\nyids\nradcom\nsanela\nfeazel\neducable\nambiences\nsativex\nanone\nfavrile\nneurobiologists\ntrigged\ngilliss\nbasista\nlaipply\nshockhound\nseasalter\nhafidh\ndecipherer\nkayson\nebere\nmomon\nciccarone\nneau\nincestual\nseewagen\nparling\nmerka\ndanario\ndendur\nfantan\nintifadas\nyeat\nfensom\ncohabitate\nmillenarians\nethinic\nzabumba\nocassion\nthistlegorm\nservicenation\nxoops\nwsbc\ngriessel\nelfish\nzakrajsek\niftikar\nacclimatizing\ninsertable\ncholesteric\ninfantilization\ngetu\ncinemex\nheeks\nhjelmeset\nsolinsky\nspraker\ntortes\nmelini\ncimc\nsvetlova\ndervaes\nmagmic\nkielholz\nnty\nparagominas\nwillye\nriverhills\nsantillo\ngrudzien\nautoliv\njinko\nloudhailer\ntaxotere\npeopl\nhakani\nglobins\nathenee\narchvillain\nbryncethin\ndjenne\nbredenbury\nhärö\nmanderino\ngetzville\nelsenburg\nperadventure\nkrei\nswiat\nsenafe\ncarolling\ndeephaven\ntvcs\ndanescourt\nnellemann\nturtledoves\nljubisa\ntraffick\nrutto\nroels\nozal\ngrippy\nexagerating\npompoms\nither\nkuiken\ncompanionway\nneziri\npodlesh\nvidanov\nenodoc\nhausas\nlenadoon\ntwirly\nspymonkey\ntalbiya\nhayan\npellett\norab\nlaramee\nlumieres\ngautrey\nbalky\nbuckbee\nxijing\nhaleema\nsinnet\ncostliness\nméret\nfinfer\neuromanx\nsandersons\nwcwj\npougatch\nbaocheng\nsacredly\ncaixin\nbeeker\nbirthrights\nsystemised\naucklander\nwamakko\npatronizingly\nincans\naibek\nporon\nkasirye\nstrewth\nliberalizations\ngroundouts\njaquith\nlandmannalaugar\ntaburiente\nbulwinkle\nvdh\nkirp\njavitz\nouertani\nbalach\nmetabolix\nsexily\nrefiguring\ngestair\naasld\ncosl\nantiporda\nmolmenti\nlacunas\ngwenyth\npopulistic\nbopet\ntulketh\nwildbirds\nmirrorshades\nkutahya\nsonnenuhr\nbensin\nsecuritize\nzawe\ndmvs\ndovell\nbradpole\ncloggers\nacropole\ntetsworth\nkyaing\nsidero\ndungavel\nmaltsters\nclariion\ndapagliflozin\nyesipova\naronda\ntroedsson\nkwambai\nfrob\nchevannes\ntrusteeships\ndesalvio\nrezwan\njokesters\nhisself\ngaymard\nriecken\nargott\nsuperfreak\nvidcast\nruetten\nhairbrushes\naparent\njunchang\ntyhe\nalshaya\nalsf\npargat\nparimarjan\nsjf\nmolendinar\npundole\nchatauqua\nscolt\nmobisode\nsiobhain\nrestuarant\nmalallah\nunreadably\ncoinstar\ngstp\nfrischer\nazumazeki\nexaltations\nfranjieh\nmeskerem\nloizeaux\nverdecchia\ncottas\nputhan\nchannings\nchukwumerije\ncolthorpe\nunintelligibly\nsherazi\nteamsheet\nagoro\nshamini\neshaya\nelementis\nalmaco\nfascinator\nmaximón\nsaganaki\ntrostle\nfibulas\nababio\nsifs\nposssible\nniedner\nfogbound\nsettergren\ntrollers\nhereu\nroggensack\njhonas\nleiths\nmaurkice\nsubotic\namhp\nboachie\nmlungisi\nbewitches\ndigitals\nenzyte\nundersupplied\nmirones\narabatzis\nfeudals\nhatef\nwitkoff\nsamogon\nshipmanagement\nderbyhaven\npigeonnier\nrockschool\nmared\ngeeked\nnanosolar\nharlyn\nbedlow\nkadum\nfiltrona\nscrase\nplanells\nsimplyhealth\nnymeyer\nlisey\nboycotters\nljubic\ndefang\niqm\ndunbavin\ncenterbridge\nplumpjack\nirrepressibly\nhanah\noveroptimistic\nchepchumba\ncossall\nculotta\nabiamiri\nmeys\npsalidas\nwaterstreet\nleivers\nheadcounts\npummelo\nknobe\nstathatos\nchaddi\nfricassee\nchfc\ninveighing\nruyle\nshopkeeping\nahould\nboumerdes\nbolsterstone\ndouz\nrautureau\nranelate\narpaia\nenvenomed\ncarth\nmajorite\nlousiana\ndiptyque\nodighizuwa\nshenoi\nkozmus\npison\nunspecialised\nbaosheng\nolesia\nintrusively\nnekipelov\nkamy\ncicchitto\nmslt\npaux\nfroncysyllte\nklensch\nnylen\ncryovac\narzano\nfaycal\nesperanca\nrakishev\nheshima\naschan\nmcglothin\npcoip\nziqiang\nkhemiri\nminidress\nrajastan\ngumercindo\nefrati\nnorcot\nsupercapitalism\nzestful\nshoob\nnonstate\nsmolar\nlesters\nfeiring\nmilds\neyrum\nunteachable\nmineworker\nexhilaratingly\nmyelomeningocele\nsitbon\npowellii\nindofood\nalego\nzahia\ngueiros\nhaula\nsipper\ncabeus\nbuschel\nmonomaniac\nkeed\nfreiras\nnaek\nhurney\nenglande\ndramat\nmorawska\nlanzaro\nsudova\nmeshumar\nfundies\nozunu\npermut\nhapshash\nonasis\nbetão\nxiaopei\nloval\nminghua\nbohorquez\nsidener\nhemminger\nfsid\nmaggoty\ndiveroli\ncolourant\ngemmy\ncaulcrick\ntosatti\nhydrocharis\nzhenxing\nnyctv\nkoup\nsiimes\nfojut\nfigdor\nchuckers\ngayeton\nliniang\nstadelman\nyathreb\nfingland\nhavenco\nflirtatiousness\nburkhoff\nbaksheesh\ntorley\ndevelopping\nyassini\nbogusevic\nalabau\naquada\nfeatherwork\narbours\ngilfeather\nlervik\nrowhedge\nsuperport\ndeluges\nmassone\nmawsynram\nacasa\nscarsbrook\nmoulinsart\nscif\nkircus\ntyreek\ninania\ngulino\nvalueclick\nunifiers\neichholtz\nkobar\ncoquetdale\nimprovidently\njakez\npradas\nearwicker\nqargha\nmentel\nfionia\nhawesville\nponchaud\nesolar\nstalmine\nborghezio\nmoistness\nedutrust\nothewise\nbamler\nranque\nforray\nkilchoan\nmoim\ndcct\nadulteries\nsosua\nauditee\nacsb\ncahirciveen\nstrickly\nlochtefeld\ndescisions\norsborn\nquerencia\nuyesugi\najumogobia\nappelhans\nbarrena\ndapibus\nqingchuan\nchelston\ndeputizes\ndilwar\nsquealed\nmillionnaire\nshearston\nremedia\namrutlal\nnerad\nrossburg\nprettied\ndreifus\nsportiness\nsavella\nkiiza\nkashkin\ncroxon\nunflyable\nandother\nendodontists\nkaten\nthsoe\nballygally\nhutman\ncalafeteanu\nhypermasculinity\nbalhaf\nwoodnesborough\ngecf\nbuchser\nopeiu\nnorene\ngavina\nvergie\nabsurda\nthickett\nsignalers\npassop\nsaeson\nverticle\nmohajirs\nstamsnijder\ntheplanet\nheatherdown\nlataillade\nbillys\ntoxigenic\nurbal\nadmonitory\nkhbs\nreuveni\nkarpan\nkukes\nnyana\nzarlink\npongsaklek\nkriesel\nenjoyability\nfaskally\ncodefendant\ndeclaims\njoyrider\ncantinero\ntrevarno\nwantz\njahagirdar\nnoorullah\nsokolovic\ngouze\nviemeister\noelschig\nbienwald\nbiotime\nilgenfritz\nrasoulof\nmiddlehurst\nmastrogiacomo\nreburials\nclickjacking\nmiskimmin\ngruman\nleyh\nbradsher\ngabetti\nroenbergensis\nautoworker\ncandrea\nbdds\nhemorrhaged\nwesterkirk\nbinman\nnimbler\naffalterbach\nfarmdale\nconvinient\nforfend\npraxedis\ntantaros\nicaac\nboleyns\nkashou\nlinchpins\nbernabo\njelled\nhardeners\nfrends\nmgx\nzian\nbrevetoxin\nopression\naaus\napkws\nmellersh\nbartlam\nroundshot\ntopis\ngerindra\ncheseborough\nboerman\ntransamerican\nschiferli\nchdi\nengorge\nsmashbox\nprinsendam\nknobhead\ngreu\npreselect\nsaharkhiz\nneily\njuett\nreconnaisance\ndobrzanski\nkoofi\nortmeyer\nimprisonable\nipplepen\ngentilozzi\ngetaneh\npyleva\nzengo\nmemsahib\nyappy\nserouj\nfamilys\ngusteau\npredicating\nkralovec\nmicrovolt\ntransmontanus\nstrief\nserrie\nanoosh\nyahiye\nninas\nperurail\nquickenborne\nsaaeed\nmoneysavingexpert\ncherenchikov\nmccollin\nheca\nmasahashi\nbookstalls\nsimbol\nannouncment\njnpt\nkikay\nshely\nwinai\nguayabera\nskirling\nbouzereau\ngroupwide\nmezes\nchuprov\nmasculinizing\nradwanska\nwinegrape\nhuget\ntanhouse\nakenhead\nstalkerish\nbeguiles\nwallsten\npiratpartiet\nguiberson\ncolak\npacuare\nwahhabists\nkortz\nmeisch\ncurlicues\nplunking\nhardihood\njamus\nmadell\ntullian\nmitered\nneverfail\njerseyman\npianigiani\ntitv\naspirins\nepke\nsolomonov\ntravner\ngermino\nmountainbiking\nlortkipanidze\ndisciplinarians\nschnoebelen\nmetastasizes\ntroake\nmccatty\nshellacking\nhaasis\nrodrique\ndorette\narmazones\nkymberly\nftsa\nquinacrine\nkarsen\nthurairajah\nmumpower\nrukman\nchurcham\nunachieved\npaunovic\nbookspan\nendurable\npastilla\nponcirus\nmadeleines\ntroest\nchannellock\ncratfield\nboadi\nsashka\ngehri\nseefeldt\nbackstops\nscheinin\nproperness\ncicek\nblumenherst\nsinuously\nrockpools\nkorto\nrevercomb\nhammoudi\ncontorts\neisenhofer\nblacksea\npanbanisha\nskemp\nretests\nkaramehmet\nempathically\ndiegnan\nramnarain\nmochovce\nankergren\nstrole\nfaceman\nbográn\ncomman\nradhames\nrosmarino\nspiropoulos\nunfortuately\noovoo\nlubetzky\nchukarin\nalpenglow\nspycher\nsycomore\npetrosky\ntelemax\ntardily\njeret\nveenhuizen\ncraniums\ntwatt\ntowable\nbeardsworth\nwduq\nunobservant\nkonarka\npicograms\nsnuppy\nmangelsdorf\nkushnick\nträumerei\ndamery\ntravelgate\nndpb\nxinran\nrehabs\nlittlefeather\nidentifing\nsickie\nnkh\nzillig\nartayev\ntelecomunicazioni\npichugin\nzollman\nkabalagala\nboscarino\nkecia\nezrahi\nrodgerson\nnotebaert\npibil\nhupe\nbushton\ntidiest\nccxr\nfarshchian\nrosemberg\nresales\nderrogatory\nbuntu\nhimslef\nmedialink\nvitaya\nkimchee\nnazarena\nsalewicz\ntwell\nklouman\nbedian\nsequens\nblaszczak\nearbud\nupadhya\nspringiness\nkaffeeklatsch\nglcc\nlangum\nnumberplates\nherzstein\ncatalunyan\nmandjou\nzuul\nlongborough\nshieldaig\necfs\nkrisch\nsirba\ngww\nmikela\ncasteen\ninambari\nzuberbühler\nvatukoula\nhizmetleri\ndiffcult\narnove\nmejid\nsahimi\npetroleums\nimpregnability\nllwybr\nmagorian\nfaps\ntervela\ntayr\nacccount\ninchture\ndextera\ndmpk\ndelmi\nqinglian\nxtended\nstiemsma\ncrispbread\nmcrobie\nrobesonia\nirniq\nunclarified\nopportunely\nkewstoke\ndamadola\ndanamon\ntwango\nsonderborg\nstargroves\nopenskies\ndrewsen\nirwa\ngielinor\nluridly\nkaatz\ntweetup\ndentler\nginsparg\ncraan\npartain\neyob\nrodzinski\nstairmaster\nliberzon\nluigs\njirau\npatientslikeme\nwindcatcher\nspringall\nzoumana\ngledson\nrellenos\nstarstrukk\nreichler\nrundles\npasando\nzmh\nlaoreet\nmaydown\nruffley\nousama\nergocalciferol\ndellapergola\nrabczewska\nlacob\niguacu\ninfed\nebri\nheijde\naggs\nchunming\nlandspeeder\nchurchfields\nhacksaws\nvermiculture\nashdale\ncarwile\nspecked\nsevergnini\nlongsuffering\nliyong\ncheesiest\nkennacraig\nlindzey\ndussen\nkadak\ndeslatte\nearthday\nguden\nsniffen\ngroskreutz\narkoma\nbeaute\nacorda\ntscherrig\nhawkyard\nmungle\nsynthroid\ncounterpuncher\ndanenberg\nmolseed\nwixams\nchugged\nbylaugh\nkroschel\ncorones\ntallack\nsehbai\nfiaz\ncaolan\nhaywain\nsuances\ndsts\nharuf\nvalmik\nmoumou\nconcerened\ntahani\nstuffers\nundervotes\nvertrue\nmogielnicki\ncompellent\nexercize\nswitek\ngegenschein\nnonpareils\nbishil\nungifted\nlaines\ncybercafé\ncroplife\noverproduce\nrwdsu\nonli\nfolzenlogen\nunbeatens\ngeebee\nsherford\nguanglie\njacobello\nminsterworth\nballis\nlennick\nbunfight\nkilimnik\nbristows\nzahradil\nlibertinage\npaleoclimatologist\nhumberhead\nosayomi\ntenents\nattara\nsabiu\nstrattera\ndohnal\nnetworx\nmitsuwa\nmayis\nmuffley\naduc\nsmeekens\nkupczyk\nqingwei\nhoshiyar\nfinancieros\nbhanji\ncsrd\nedvige\nuncracked\ncamboriu\nsadberge\nsportshift\nshopzilla\nreinsured\nersol\nfictionwise\nschweder\ndecomp\ndinwiddy\ntankleff\nnscd\nrestage\njordao\nhelgemo\npoyiadjis\nreinjected\nfixedly\nhalloweens\ncyi\nfernau\nmunters\nfcms\ncruso\nenterotoxigenic\nthurlestone\ndysgu\nbrako\nesfa\ndeiniolen\ncunis\ngoodfellows\nreginal\nwijesinha\nmesmerise\ninterconnectors\ngenthod\nkuishinbo\ndauster\nviropharma\nqingping\nhatzistergos\nbertho\ndemulder\nampd\nmikulic\noduwole\nakete\nkollars\nstortoni\nrakotomalala\nbetadine\nhings\nconsolmagno\ncheen\narefe\nrthe\nmaryi\nmilkers\nfringers\ntuitama\nsupervolcanoes\nverzbicas\nexequatur\njampol\nsupertitles\nmashego\ngriswell\nnecton\nhemispherx\nsedgehill\nsotik\ngermicide\nfamilicides\nzamick\ntahona\nbriden\nhocknull\nbyat\naldercar\npitilessly\njaneczek\nnaqoyqatsi\nmchaffie\nlubavitchers\nsystematise\nmikheev\nsamkon\ngasters\nfleeson\nexcoriate\nlienard\nsupersonically\nmoeritherium\ncoorey\nheggem\nforestalls\nfoat\nwhartons\npffft\nshevell\nlievesley\nmomber\ndaejon\ndouthitt\nstaccatos\nactimel\nmaddelena\nbreighton\nunpalatability\nhighend\ncorma\ncynergy\njagtar\ndemeester\neuroparc\nfastcat\nmaryn\nbancolombia\nbtcv\ndispicable\nabdah\neskisehirspor\nvesuvian\nschalit\npidd\nsunnyview\nwwrd\ngharibi\nuggh\nmiquon\nroddon\nrasid\nfendel\nnewzbin\nmostest\ngolen\naglietta\ndamsons\nbyetta\nypfb\nsabermetricians\nsterger\nrofé\npaxinos\ntrevillion\nheidbrink\nkokslien\ngranacci\nvacchiano\nteber\nfenstermaker\nyanchi\nteleco\nnonbiological\nmaise\nghiraldini\nkimizuka\nmormans\ncioccio\nachache\ncordi\nbernett\ndromm\njunmai\nheycock\nwouldbe\ndreaminess\natherectomy\nconexion\nmazzocco\nscharfstein\nyolles\nlubero\nshagwell\nscooting\nenguri\nsvejk\nmardom\ntheyer\ndemitasse\nrommen\ngumatj\nwooroonooran\nlimelife\naprille\nparcher\nhopey\nsetka\nsavouries\nsulca\nlangold\nlurgashall\nthrailkill\nzhenlin\nallou\nmamirauá\ncontinuances\nrayovac\nkomisarz\nhamoked\nbullhorns\nmaidenberg\nfawer\nduffaut\nlobolo\nmaktum\nkirm\ndallat\nfloortime\nstuckmann\nsoliai\nphuntsho\nsiemers\npanaceas\nkeyholder\nselvedge\nizumida\nmukadam\nbailu\negusquiza\nastorg\nmattru\nassasinations\ndimitrenko\nthomasnet\ntoopi\nbocchini\nkosier\nstellone\ndaurio\nspeedbump\nseggerman\nfourtou\nwineberg\npostema\nphysiologies\nsoetens\nsolectron\npobst\nproliferations\nresler\nbabassu\nmilbrath\nsockington\nsordidness\nbeaford\ntumminia\nekkehart\nsantofimio\nsucedió\nbirchley\nbenichou\nwouold\nrepligen\nepyon\nolshan\npanitumumab\ncarsick\nweggen\nbouake\nmernagh\nlingbi\nreprobates\ncoughenour\nbeethovenian\naberafan\nuralde\nmohmmad\nconill\ncomcel\nsnooped\nresourcefully\nimberhorne\nhypobaric\ncriminogenic\ncouchettes\ndribblers\nblesma\njumalon\nlorek\ngiau\nnewstar\nfacci\nyuguang\ncobler\ntrullie\nzarnesti\nshvedov\nhainey\nhatsuko\nreceipient\neggishorn\ncardiometabolic\npolsinelli\npoxon\nwpfw\nbenac\nchudy\nbyelaw\naristodemo\ncaenby\nvaugrenard\nchoppily\nalcotts\nismaiel\nvelardi\nagyapong\noverwatering\npanjwai\nscreenless\nthinkbox\ncolubrine\nanglophobe\nestatic\nmedai\nabdelali\ngainsaying\nlengshuijiang\nunshelled\neurid\nwalmarts\ngreenbook\nyujun\nmarakesh\nsportsound\ndeminers\njenice\ncopart\npoults\nagj\npiggybacks\nphotoplus\nsollis\ncosper\nooyala\nmangement\nqfn\nsuddath\nichilov\ncorlough\nsilverfox\ncargolifter\ncullingham\nmatier\ngaughran\nflumist\nuntaught\nprocuratorial\nshortstown\nshichimi\npragnell\nariels\nstreamium\nturquet\nshowerhead\ntastykake\nmohadi\ndarcos\ntopfield\nfebrurary\ngebara\nbelgraders\nlazari\nloai\nluckin\nkhakwani\nzhouli\ngentzler\ncafemom\nsequestrate\nkaspa\ncoiffures\nhawaiiana\nwixted\nboulcott\ndemartin\nreenergized\npixelsense\ndiminutions\nsugdens\ndenars\nsuccesstech\nferoni\nattwooll\nappose\nlobotomize\nadjured\nromanticising\npyrroloquinoline\npancaro\nnaamani\nrannazzisi\npierrelatte\ntreponemal\nturbonick\npaluku\nmatrícula\nyakoub\nllanelwedd\nspatters\ndofe\nprograme\nduenyas\nasmin\nvagnini\nfishtailing\nrodius\nkoroneia\nfasuba\nkefi\nnietzche\nindrio\ncaddock\nhealthmap\nsellen\nschwartzmann\ngunster\nshukria\ndukoff\nschickendantz\nnonn\noldness\nbreakings\nywha\ncraning\nxiaosheng\ncedu\nbway\nzackie\ngainsco\nsigmundsson\nramezan\nkleeblatt\nwartsila\nplanetologist\ntollgates\nsuccussion\njaffarabad\nstankey\nsobig\nhessin\ndanot\nhumidifying\nmyojo\nmetsing\nhenken\nahuva\njoanny\nfahrer\nanthropomorphisms\nunsafely\ncrewcut\nkrz\nfadavi\nlongball\nmokwa\ndensley\nmmac\nbfrs\nweans\nzatopek\npalacky\nctsi\ngrozier\narrey\nturbodiesels\nfrisé\nquitline\ndirectlink\nusich\ngitti\nhyn\ntananta\ncluver\nspaccarotella\nrumel\npignone\nphilson\nsomboon\ncarecen\nbasiji\nchristkindlmarkt\nhematologists\nredetermination\nbudzinski\npreens\nfalen\nenvio\ntexican\nproductized\nmgaloblishvili\nshashamane\ninmotion\ndawodu\nmarzuq\nmythomania\nwalgren\nbairo\nbeyound\nbikindi\nzardad\nlawall\nleweb\naniya\nmsil\nkynance\nsaalim\nzacek\nsutters\nmilpark\natlanticism\nwarnie\namericanness\njinlin\nwoudstra\nudawatte\ncloughton\ndanbert\ndabeli\ntrich\nhandholding\naorn\ndoeringer\noxholm\naffliation\nbarrineau\nsuncom\ncottoned\nhomecrest\nprettejohn\nlindenberger\nstudham\ndragonaires\ngibara\nlehrke\neljero\nthunks\nproporz\nepcm\nbrunonia\nearworms\nviewmaster\natcher\nwhant\nulery\ntofurky\nzentsov\nwisewood\nwahabbi\nrumbustious\nparkdean\nizembek\nesdale\ncarrafa\nblaxill\nfrieslandcampina\nbrawer\nyuks\nqarnain\nlivedrive\nsiplin\nqtopia\nphilex\nrehabilitators\nfomboni\nsearingly\nbloxworth\nbeofre\navailablity\ngeneina\nazacitidine\nosinde\ncontinious\nmahata\nsungnyemun\ndamirchi\nvcj\nbasils\nalaron\nnaccho\nboozed\ncarigali\nredresses\nyelda\ncarithers\nmenuez\nowlets\nvolynets\nhoveringham\ngierhart\ncastleview\nfauchald\nboldwood\npantic\nnewsie\naeroscout\nlamama\nfirewalling\nupcs\npischetsrieder\nwarmings\nbelstaff\nfamiles\nenvia\ndeliverymen\ntheth\npastie\nhataway\ncemevi\nmesco\nobtrusively\nembryologists\nbeverlywood\nvolochkova\ncanipe\ntalkingpointsmemo\nsejour\ndjoudi\nbohio\npetteril\nmaqsoud\nheinisch\nlols\nkutralam\nmicó\ncanlis\nhibner\nmisinformative\ntestees\nreisberg\nchapur\nkrendl\nbosche\nterritorian\nbienenfeld\ncraigroyston\ncagnotto\ninstitutionalising\nroguery\nthumma\nkneedler\nkinvig\ndurants\nlohrke\nmotsi\ncoater\nromanby\nritcheson\nthevenin\ndensuke\nkampani\nanhydrobiosis\namnesiacs\ntadek\naljs\nmessanger\nggbs\noverabundant\nbuckels\nkerlon\ncontesters\ntrilli\npostition\ndamazin\nnisenson\nmuscatelli\ntolfree\nbrosio\ncircs\nesmod\nrhinog\nvétheuil\nmedinas\nsubotzky\nmicrogrants\nalacer\nmittelman\niise\nbelongers\nbadmouths\nprax\nketam\nmechanise\nkahney\nrheumatologists\nhellebores\nfullfil\nplasticisers\nvalorized\nswifton\nbeinhart\ndenève\nesab\nmulticasts\nlindborg\ndermontti\npabla\nstaropramen\nlouwerse\nmaloofs\nvosshall\ncisternino\nmarkopoulou\nfreegans\nfervency\ntambal\narenson\nramkalawan\nmasalai\nwoodstone\nmahroug\nladyboys\nbgz\nchalkstone\nshowhouse\noffcial\nculleton\nbrutinel\nzendai\nbernando\nmakubuya\nbeledweyn\ncpci\nfantastik\ngeothermic\njazzers\nchabris\nrozes\nkastari\nhwu\ntereska\njerseyan\ngansta\nweigold\nfizzes\nmursley\ndescants\nsourasky\nhazer\nwanandi\nweissert\nbridgespan\nfourniret\nsedately\nkapowsin\nkarabel\nktvn\nsucuk\naljofree\npolyarchy\nequivocations\nwitchita\nprozone\nwelday\ncastucci\ncristalino\nteabaggers\noverweighting\ndesignedly\nanmd\neffortlessness\nkrawcheck\nshutts\nniyamgiri\ngohouri\nsuchiate\nallihies\nmanikfan\ntazo\nyousri\ntanovic\nvoinjama\nkelvinbridge\nwheeltappers\nauthorless\ntwineham\nreseating\nuploadable\nladya\nfraynd\ndardar\ndoozie\nambuhl\nchapulines\nmeryon\ngartshore\nwoodacre\nalkermes\ncoberley\nsoaper\nhadhari\nkwiat\npollastri\nwindler\ncristen\nfarecards\noverlearning\nvillacis\nberlau\nyardwork\nhadja\nbonarelli\nderenzo\nturnes\nmiyakonojo\nvoltec\nitogi\nramdass\nlorello\nwalsoken\nigcp\nspringmann\nodalisques\nnénette\navigail\npoliticaly\nkazam\nlinothorax\nreblochon\nfilberts\nlongformacus\ncamaya\nmakriev\nbogomila\ngeolocator\ndysmotility\nheane\nkanhar\ntalkbacks\nsabeen\nchievres\nlillico\nwedgbury\nlevete\nzalkind\ncutely\nshufflin\nhabayit\ncarfilzomib\ncastelaz\nhoovervilles\nnurc\nmorange\nffo\ndhurandhar\nstahle\nchacin\ndlco\nkarabits\nquaff\nindemnifying\neggli\ntautness\nakhromeyev\nsuhre\ngrainthorpe\nnarec\nsculled\npaperworks\nsonofabitch\nnonhazardous\nwanstall\nnabilone\nbeastiality\necoop\noutie\nmedicalisation\natchity\nmacroregions\nlangeloth\ngordien\nsugeng\noschmann\nbiener\ndarrian\ndaswani\nshanaka\nibanga\nictqatar\nockerby\nsuckler\ntroublous\nchumbe\nyakum\ngunks\nstrouss\nfloricultural\ngalloon\npresentence\nearline\nparamjeet\npontygwaith\ntiphaine\nturkevich\nehrenkrantz\nwhiteknighttwo\nsianis\nploner\nwomanizers\ndratsang\neejit\nlastings\nrothbaum\nchikin\ndelva\ntormarton\nradicalise\nmagnetocaloric\nboockvar\nrehr\ncolliton\nheimans\nackergill\nkhadijeh\nmakgatho\nmanicurists\nbavaud\nsolage\nsaleiro\nteetotallers\nleisurewear\nkindertransports\nautocare\nkilonewtons\nservies\nobradovich\nmsst\nfunkiness\ncapsulitis\nnoisiness\npeaceforce\nnickolls\nsagrantino\narcalis\nbudhia\napartness\nshortcutting\nmcaleenan\nnovotna\ncinephilia\nemta\nmontanas\nenkhbat\nuspi\nmahnken\nsiry\nsunrice\nlza\ntwinset\nsarara\nroseannadanna\nrussianness\nmakudi\nbatpod\nnakazaki\nmaguiresbridge\nkariobangi\napportions\nhansheng\nviko\nlsrs\ncritisim\nnewboy\nbethard\nadgp\nbruley\noffchurch\naverna\ncanisbay\nchipolopolo\nbatsmanship\nguajillo\ntraka\njohson\nthougt\nkidani\nviscaya\nshiveluch\nelving\nbegums\nngobe\ncorjova\nbamian\nsoderblom\ncfcc\nperilipin\nimplanon\ngybe\nentrapments\nwtwp\nvassilenko\ncommmon\nfoggers\nbromery\nfourt\nmethylxanthine\npukach\nchyulu\nkrisher\nblueshirt\nthicklip\nphreakers\nclappy\nhalkias\ndecrow\nfrolik\nsudderth\nfiki\nklotho\nilaoa\nliebesverbot\nvinegary\nforcemeat\ngibel\ndeodorizing\noregan\npanousis\nrcuk\nmarsannay\nunconscionably\ndodji\nprono\nfulde\nefsi\nnesci\nnbcs\nzakri\njonckheer\neyser\nsubtextual\nyuyi\nhcfa\nglossip\nfavazza\nuniphase\ntweeking\nabsar\nkoblin\noutgun\nyeasty\nzantac\nshafiee\nmattiace\nrinpoches\nospel\nhandcycling\nanoc\ngevel\nbouron\nlebling\nuplander\njamora\nlinpus\nineed\nesquith\nvlccs\npunaro\nshopowner\nshaktar\nbereavements\nosala\ndrwal\ndissimulate\nstreitfeld\ndalbandin\nsydes\nmischaracterise\nsteingarten\npercussively\nsogecable\nstonethwaite\ndabbah\nconformality\nbeaufret\nweedpatch\nhonko\ncowlin\nviatrovych\nlocationfree\nmunnell\nrasit\nwielaert\nsumptuousness\nlubarsky\nvalenki\nsauntering\npeebleshire\nbrancowitz\npanthongtae\naccha\ndodgeballs\nbrendanawicz\nlinebrink\nlispro\nenkianthus\ndespoiler\ntorrellas\ngericault\nchifa\nswannery\ndemonstratable\nbummers\ndipshit\nmusicans\narmyan\npatetico\nshapir\neyecatcher\nfixturing\nshorris\ninfostrada\nkuenne\nblurp\ncagerz\nchewers\nkhowst\nsetaimata\nannihilations\nmedeco\nbenslimane\nhanslip\ntincup\ncardiography\nbassinets\npopsters\naardonyx\nmanocha\nchoosed\nsirett\ngoncz\nmimicing\nachte\nleaze\nmonstruous\nentitites\nriksgränsen\ngcig\nwhoopsie\ndunauskas\nadnam\nrunned\nmovi\nclearers\nfluffers\nbeignet\ndelago\nwhippin\nguiterrez\nmaskrey\ndrummy\nwadan\npersonalis\nzicherman\njerious\nfalkoff\ngangling\nmayford\ntakle\nstamile\nprepcom\nconceptualists\ncrumey\ntriptan\ndester\nkhator\ndrudges\ngrowly\nsomeon\nhathern\nrusutsu\ncybrids\neneloop\ndiprete\nmeasurables\nabishek\nunchaperoned\nhudhayfa\ncvii\nmayahuel\ngebremeskel\nhuissen\nbezbarua\nadarand\nvitters\nbarczewski\ndershwitz\nzilly\ngoldworm\nwooters\nriemerschmid\ncfmp\nsogetsu\nplainpalais\nturchyn\nmacnichol\nszarzewski\ncaresource\nradina\ncreran\nreevaluates\npiekarska\nwaafs\nndms\ntyszka\nsenel\nscrine\nvivens\nsahadi\njahanshahi\nkonovalovas\ngenetech\ninsideview\nsquinted\nhilsenrath\nerisman\nkwarteng\npodvig\nsheinkin\nwagnerism\nhipmunk\nkonstan\ncenveo\npottsboro\nauxis\npatricola\nzinczenko\nbibulous\nplonker\nlowitja\nbacp\nrampi\ntrusteer\nacceleware\npillowed\nwilbekin\nbouasone\nabseiled\nbarelwi\nsnitched\nkortajarena\nwrecsam\npersepctive\naikwood\nzeidenberg\npalenquero\nsopes\nhomegroup\nmicrochipping\nfrogspawn\nuncapping\nhalabjaee\npittington\nnymann\nsociodemographic\nwagtmans\nboster\nllangammarch\nbrfss\ntakudzwa\ncelltech\ntrueness\nmccallen\ninphonic\nlancman\nskovorodino\nmohnton\nseland\nmillenial\nkrigsman\nbertholle\naffini\nmetapneumovirus\noldway\nmushkin\nbuonanno\ndebiting\nspankings\nlidong\npatzke\naosis\ndisoproxil\nllanfaethlu\nemagin\nlcvs\nellaby\nstockholding\nwssa\nsentinal\nfornaro\nhartleys\nhuetamo\nnamasia\nhodsock\nangiograms\nelashi\nbilaterals\nponde\nmingardo\nhudman\nrusiya\nnagrin\nneytiri\ncurvo\nmeite\njuridico\nwindansea\ncountercharge\ndrisht\ngutwein\ncswa\ndegibri\nshortchanging\nrubios\ngwaa\naccountabilities\npechalat\nborysiewicz\ncostumier\nkyel\nhafiza\nlinbeck\ntacle\nakinwolere\nklaiman\ndhanji\nhuppertz\nsnookers\ntiano\nsheinfeld\npapalote\npoliklinik\npassingham\npantsuits\nricanek\nlisant\nwhitecraigs\nrehydrating\nccfls\nlaundrymen\nseriocomic\nröschmann\nevolutional\nbritany\nblankenbuehler\nsterotypes\nmultipolarity\ncarryin\nbootjack\nloucas\nplebian\nranulfo\ntaekwang\ntrobaugh\ndetoxing\nconvinved\nessary\njannero\nlpcs\nsplosh\nkaprosuchus\nzeda\nwavel\nanadin\nrydingsvard\nmultibank\nporthoustock\nmainella\neisenbarth\nruscote\nbaltasound\nnyumbani\nrwn\nitzkowitz\ntsybin\nschopman\nalfsen\nkallir\ncario\ncompetative\nunapparent\ntyngsboro\nagustien\naastra\nmingxia\nparsky\napatzingan\nwinnicki\nvallhund\ndimondale\nbogason\nlimbrick\n\nleibfried\nermakova\nmerafhe\nqubad\ncalfed\nsuanne\nxup\nlayabouts\nabaetetuba\nchayne\nbalkenhol\nalfy\nfinanceasia\nbamy\neviscerates\naylor\nlecharles\nflyspeck\ngadflies\nvapers\nextemporized\nliedel\ncrabill\ntwelvemonth\njual\nkaloko\ntillyer\nkeyla\nbabula\nmumbadevi\nduynhoven\nguanting\nreinholdt\npommie\nlalvani\npriestner\ngelila\nlansdorp\nimmage\npisarski\nzmax\ncampuswide\ndaane\ngersch\nscissortail\nsilar\nschtroumpf\nintertrigo\ndunnavant\nsemiologist\nhideousness\naromashodu\nthunderheads\nnaffah\nammash\nstaehli\nsigall\nmicroangiopathic\nhanmi\npenlington\nchnc\nkitware\ngalderma\npericolo\nvoskoboinikov\nalohanet\nmolycorp\ngovernable\ndostie\njyvaskyla\nepiphanny\nkalista\nshowiness\npilotfish\ncontol\nrefrozen\nalthogh\npassan\nfuntion\nslumgullion\nfacioscapulohumeral\nmuscletech\nevrensel\ngargunnock\nereira\nlivolsi\npopbitch\nbettyhill\nshriveling\nsunweb\ninterpretion\nmomondo\niusacell\nkeyur\nruya\nstagy\nkuntal\nminature\nachtenberg\nalkhateeb\neetpu\nkazemzadeh\nkellington\nvulgarized\nwindels\npaypoint\nwhick\nacito\nereng\nbirdlip\nmurietta\nnareit\nyri\nuzoh\nfrenken\ncuratolo\ntrippet\nauret\njackiey\ngugin\nutstarcom\ndohle\nbertille\nascots\nzaenal\ngestingthorpe\nwheedling\nbrutalizes\nkentisbury\nnvax\ninterpark\nefestivals\nsaiidi\nmodibbo\nwinterval\nmencke\ndocuseries\nbritts\nfeering\ntauntaun\nsermonette\ntobiasz\nmabelvale\ncongeals\ncessations\nstetzer\nezzeddine\nlearmond\nfrontierswoman\ngodam\nskaer\nuneo\nargumental\naquiring\nexhaustingly\ngilleland\ndistorter\ndongtai\nplastico\nstrathfoyle\npriorswood\nzanier\nbqi\ngreilinger\ngargani\ncredentialled\nzengel\nmcelheny\ncaerwyn\nambilight\nyargelis\ngroeslon\nsheera\nidylwood\nelevenses\nspringen\nkayembe\nmixenden\nramazotti\nluetz\ndreman\nviter\nmikitenko\nmaznov\nmushed\npillers\nmikhailichenko\napprising\nedreams\nyehu\nsoyoung\nburco\nbeechmount\nworkd\ngeekdom\nanglophobic\nsegye\nsxu\nrizwaan\ninstils\nkaguta\nremovalist\nshaindlin\nclubmoor\ncricqueville\nabolqasem\nseizinger\ntgen\nhopelab\ntigue\nkotlowitz\naxyridis\nwhitcup\nshvo\nageno\nmeliha\nneuzil\nfirewalk\nilusion\nkalsang\nonep\ncppib\ngolabek\nschidlof\nclephan\nirradiator\nburack\ninfonavit\ncraiglang\nbaykeeper\nwmn\njsmith\ndhahr\nmoven\npoliciais\nhuacana\nmanocchio\nzazzo\nwaytha\ngarridos\ntransregional\neurodisney\nwrinklies\nverkhovensky\nevjue\nmandujano\ntomatos\nclarach\nmahamid\nbaylee\nbackrow\nhardinges\nansingh\ndinam\nseminyak\nfracked\nwellinger\nboskoff\nyaima\ngangwal\nspecualtion\nfieldin\nscreenvision\nyezza\nksara\nphoniness\nmcdreamy\nbutiaba\nterrapass\nprovinciality\nwuxga\nirrigations\nmontori\nadfd\ngoogins\ntiberghien\nsrulik\ncerasoli\noffbase\ntinajero\nthomet\nthesauruses\nruccolo\nrohleder\nbextra\nredlinger\nparsad\njoone\ncoffa\nwallasea\njirgas\nwildfowling\ngryshchenko\nmountainair\nlaxon\ngozal\nmorch\narcapita\nwahlund\nhamdania\nvblock\ncantebury\nlhatse\nqudratullah\nschmith\nieepa\nloquats\nenodeb\nwiffleball\ngritzner\nkuelap\nkingsmore\nmccomiskey\nthirdparty\nrigorousness\ntongyu\nswanilda\nfurn\nprouts\nbuitenweg\nlatis\nkmtc\nhkiff\nnowakowska\nzagelbaum\nhelpin\nkaratas\nblechacz\ncroley\namerian\nschuett\nunsuspicious\nrailfanning\ngreczyn\nripperger\nbotros\ndolberg\nquamrul\nplaylot\nmeglena\nmuradova\ngoepfert\nbintou\ninterruptive\nmdec\nbencosme\noverstressing\nryane\nsering\nbrimham\nulch\nrehomed\nnpfit\nelterman\ndanus\nmahealani\nschrey\nroflcon\nuncharismatic\ncachaito\ntsho\ncoverable\nhaafhd\nnanomanufacturing\nclinico\nchisaki\nsashed\ngatorland\nwlodek\ntusko\ngurka\nmanolada\nshynola\nseimon\nantillon\ncliquishness\nasalouyeh\nwhhr\nwristed\nstolley\nhuselius\ncellulaire\ncreaked\nwitchdoctors\nschollaert\ngemino\ndayen\nkosoff\ndisfarmer\nbruhaha\ncoloccia\npawprints\nminiaturism\nsalvy\nkarnani\ntabnak\nuntameable\njerron\nkagle\nsodiq\nxbd\nlachica\nclearheaded\nmorjim\nflyaround\ndownpipe\ngocong\nimpulsora\ndurmus\npoterba\nartemev\nkhomri\nprocessionals\nusds\nrefi\npaylin\ndanesfield\ndreariness\nwanky\nstromback\nsaylorville\nsilkman\nhuchon\nnetnod\nunmistakeably\nmuhlbach\nfdw\nnatko\nreplanned\ncestero\nwagonloads\nrejuvenator\nuniversites\nwriteoff\ndeiter\nabdusalam\naperçus\nphana\nfeltri\ngotterdammerung\nformalises\nmarroni\ncabourn\ngoreski\nmunkacsi\ntraco\nsovietskaya\nkusy\nkadem\ndaturas\nprathiba\njarema\nlutzes\nvartkes\nequitability\naircard\ntrica\nkurnit\nrydin\nerkmen\nzakani\nheikel\nincretin\nstreetballers\ngroundstroke\ngoudé\nchhang\npinpricks\nabbis\nprestigeous\ngamalinda\naneela\nfescues\nnpls\nsandsend\nsomun\ncarytown\nkostos\ninsititute\nnysba\nfreegate\ntwiddled\npoltoranin\nlazimpat\nunsmoked\nsotigui\ngraafland\ntruffe\nnightscape\nlignocaine\nveloci\nglaviano\njeffster\ntownrow\ncodenomicon\nacceleron\ncommerically\ndamluji\nblueliners\nkenchington\nswordmaking\nyanke\nsteinbuch\nnuernberger\nschudel\ngrubbers\nmetaltech\nblogsphere\ntroshin\noldsters\ngvozden\npostwoman\ngamy\ndogtags\nringger\npulloxhill\nyabuta\njeramie\nvicp\nprommegger\nafpak\nsemeniuk\nslaski\nstardoll\nlaunchpads\npaulerspury\ngugliemo\ncornichon\nseedpod\nalhadeff\ndavidovitch\ntrimtabs\nnbty\nnsgc\nchatellerault\nconfect\nwesberry\nstauner\ndunnan\nsolmonese\ncnsi\nhousebroken\nmatherne\nhamstringing\nbadakhshi\nschoeffel\nghemawat\nsavner\nrokni\nsisneros\nschirm\nfrogbit\nhoerl\nhindery\nmonitise\ncartograms\nssese\nwwwt\ndelgaudio\nbeby\npuos\ndimatteo\ncookstove\nroslyakova\ncomparethemarket\nvmw\npassot\npyone\nmoehler\nbeleif\ncasty\nmansky\nvolleyballs\ntataki\ninfinera\nmiens\nsiliquastrum\nscacco\nwifo\nbrunacci\ndjevdet\ntoeaina\nkuitca\njemmett\nslovis\nesfs\ntahereh\npatijn\ndennerby\nhittle\nshamwari\nelbæk\nkilgraston\ntewis\nchroman\nefan\nfornatale\nolugbenga\nvlahakis\nfugere\ntunafish\nemerse\nferlazzo\nprilepin\nfeickert\nendcap\nmustafic\npuertasaurus\ntowelling\nbensalah\nkominas\nrhewl\nmonticchio\nhler\nhirshey\nmavrommatis\nbowzer\nioannidou\nlaganas\npreggers\nglengall\nossis\ningore\ngkc\nbettter\ngoligoski\ndaon\nmvy\nhgte\nlenze\nstrathdearn\ncopperware\nboogeymen\nhermoza\nsportline\nmotshekga\nunwedded\nbayboro\nneurohormone\njessenia\naxona\ndeim\nkhalfoun\nblisses\ntaoudenni\nsnacktime\nwayyyy\nhlcs\ncharne\nboliver\noriali\nzolezzi\ncerdeira\nmartez\nrosens\nnavarri\nbranzburg\nbentyne\nrheal\nodibo\npreeta\nreinsalu\nmishit\narkless\nkarisoke\nhapponen\napio\ngulledge\nwebsit\njadcherla\nbareiss\nreimagination\ndobek\nwolmi\nextranets\npathumthani\nmeeuw\nkensinger\nmaysie\nbaumgardt\ndockable\nsovo\njunlong\nbankrupcy\nwhli\nmontrell\nrecontextualizing\nleasingham\nburqini\npeddar\nhaversacks\ncmag\nrgbl\nablates\nkalinsky\nlewey\nseyama\ngamrekeli\nputback\ncormican\nportavogie\nbitange\nlecun\ngica\nscaffolder\ntrusties\nyusop\nkahtani\nburtle\nwienerberger\navenidas\ndrolly\nautomobilia\nnatelashvili\nrainbolt\ncashiering\nmsti\nsinkford\nsherene\nnethercote\ndeocampo\nunderrate\nreticulin\nperfluorooctane\ninters\nreeducated\ninforum\nbazoum\nholick\nwildor\nfcis\ndistell\neuphamism\ndjg\nsammies\nartzt\nairstreams\nfinckh\niakobashvili\nkrakoff\nsior\nknockaround\nstortini\nunknotted\nraskamboni\npobl\nselis\nlashmar\npunnoose\ncasia\ngetsy\nbreitweiser\natfs\noverrate\nwhitefeather\nbgea\nalsobrook\nwernich\nsafilo\nhaarder\nbewilderingly\nworldfish\nsportmax\nkalsoom\nglasow\nigbc\ngrandiloquence\nmarineo\nsigmond\nshiara\nwerthner\nchse\nprovigil\nfreepers\npositiveness\nmishears\nantor\njammet\nwanze\nshershah\nbucholtz\naloise\nyoudale\nkosmatka\narleth\nfinchem\nshackelton\ndoddy\nfusker\nbiagioli\noleochemicals\nzakzaky\nlistenin\ndulon\nbushie\nvictimizer\nfirgrove\nrakovic\npreclusive\nmacgarry\nniap\nkrieble\nspeilberg\nwingle\nbeatable\nparthenis\nkitbag\nfennemore\nsweetarts\ncraftworkers\nnessan\niooss\nworsdale\nhyponatraemia\ntalog\njjw\npenelas\ndartnall\nsomashekhar\nisrg\ncarrivick\ngorle\nrudich\nwheadon\nseilala\nszapocznikow\nhodeidah\nharicots\nbuhlmann\nfaccenda\nretin\nmerkens\nkabore\nbiley\nmalburg\nrestavek\nshaine\naneuploidies\ngrafe\nselfors\nbeirendonck\nsvedin\nnanjie\ncalim\ndeshea\nplastiscines\nkamiar\ndivsion\nlovingood\nbessacarr\nelci\nmabira\nbosic\ndistractibility\nsiron\nwolpo\ngirozentrale\napeared\nhefetz\nhomeliness\nashoura\nbenotto\nfrankum\ndurnbaugh\nkivisto\njaeck\ndicapo\ntorrefaction\ntowednack\nuncelebrated\nkitting\nmbaya\nmatous\nliefers\nmanglona\nwolitzer\nhandwringing\nesmie\naudretsch\nmboweni\ndvids\nkrisztian\nyousof\ncebrian\npalazzos\nmahendradatta\nmaietta\npolzer\nporokara\nplaywrite\novermeyer\nbumptious\ngrendell\nscatted\nkobaladze\nsinotruk\nzafran\nbedimo\nuncontroverted\nscripters\ntelepathe\ndefeasance\nfrauenliebe\nneeve\nmineau\njalloud\nzettabytes\ncrappiest\nkelts\nholieway\npiddly\nzubeyr\nconditionalities\ntashigang\nrennies\ncoffering\narmaly\ncutdowns\nkuncewicz\nchisenhall\nadwell\ntownsends\nhengameh\nschlotzsky\nmikele\ncaravanners\nbiocatalysts\nstonkus\nrodemeyer\noxygenators\nmutar\nzuman\nradam\nbluh\nvladamir\nmickolio\nhairbreadth\nrepurposes\ntronti\nmcilquham\nwouln\nkyno\naraldite\nmerini\nfréthun\nsivia\ndupraz\nloeff\nrickrolled\nflinchum\nlarkmead\nyarmulkes\nnatagora\nccia\nsamimi\nsunsphere\nsabili\nheartworms\njaisal\nspedan\nanticuchos\nuliano\nalesund\nfullfilling\npreferance\nmedvedenko\ntethong\nsmala\nguiseppe\nblacklaw\nkounis\nbonchester\ndodaro\nthunderpants\nrettie\npaychex\ngyde\nheims\nbarcel\ncapula\nsarsens\nsistach\nblaes\nsouch\nsimplicities\nsnowsport\nbabik\npulmo\nhandoyo\nordona\nflexplay\nmaiza\ndworaczyk\ngiannetta\nchorine\nriquewihr\ncartegena\nshurberg\ngurvinder\nluson\nangangueo\ndesulphurisation\nsavoys\nyealm\ntoposcope\nbaerii\ntsontakis\nmirthless\ndognapping\nproflowers\nvoake\nargentineans\nmondovino\nsensationalise\ngirerd\nopenspace\ngassi\noaked\nlandgate\namerichem\nawing\nmiyar\ncantábrica\ntinari\nchapri\ntejendra\ncalorically\ngrundler\nrenschler\nsaryusz\nbaalsrud\nmasury\ncollateralization\nrolovich\nzurick\nsulistyo\nneurotoxicology\nvitabile\nbracka\nlanghurst\nrockism\njianqi\nellick\nomnigraffle\nduttons\nrogha\nwhixall\nzipadelli\nfromager\ntechsoup\nmobitv\necia\nmcweeney\nviirs\nwinnifrith\nrefocussed\nbarie\nrosebrook\nhopfully\nmcadd\ntrowers\nsaitek\nbayla\nsimranjit\nfabretti\nleedes\nscombroid\nhenrike\naios\nsotu\nfemail\ncalty\nlogista\nmystifications\nbraingate\nchopiniana\nrittikrai\nmazziotti\npardeeville\nrowdiest\nlarvicide\nrjf\nfauque\nbuchter\nfalliero\namiando\naixam\nheadgirl\npaulhus\nwhdi\nmaxo\nlakeisha\npanopoulos\nmcnease\nsubtheme\nvlps\nmoysey\nsrixon\nintractible\ntoughly\nlobstermen\nabromaitis\ngavrilovic\nconsern\nsightsee\nanderszewski\nkinaesthetic\nsuperduperman\noperatorship\nsandhogs\navout\nrapscallion\nfluffery\nbrachylophus\nleithead\nmakrokosmos\nismi\nocme\nintifadah\nkazipur\njinka\naverys\nsurpise\nhurch\nvandenhurk\nbottino\npunningly\nctvrtlik\nlecat\nidania\nvisting\nzenz\nsounio\nvolunteermatch\ngessow\npolone\nkuptana\nberberyan\ntavuk\npappajohn\nroughhousing\namateurishness\nbottlenecking\nwedlake\ntrusthorpe\ncossery\nbusinessfirst\ndalibard\ndictums\nortas\nsinder\nvigourous\nstainbrook\nproustian\nnorz\nfieldhead\nllyswen\nlithman\nunionise\nyisa\nhippeau\noutstayed\nsboe\nbembry\nhayneedle\nbeorma\nhoeilaart\nyandall\nsibby\norrible\nmetrosexuality\nlightrail\nmasspike\nlovsin\nareen\nménages\nabari\nyarkin\nhenline\nscraggy\nvittoz\nmehdizadeh\nhakurozan\ntibbitts\nrovan\ndrewitt\nposterchild\nkalup\nhewar\neichelberg\nhenoko\nsaillard\npetroecuador\neyfs\ntorchlit\nasustek\nhundon\npeetie\nscrunch\npaliwoda\nqatanani\nzewdie\naxeheads\nchupp\nahistoric\ncamcopter\ntorina\nrmcs\ndissappointed\njaid\nsportshall\ngenty\nchristianophobia\ntommys\nkrolow\nspurwink\nginbot\nothella\nladislaw\nsyntext\nplaisant\nmoodiesburn\nsongza\nwohlsen\ncremeans\nmeasurments\nlenghts\nsatyric\nwaterwall\nranahan\ncarrog\nmohla\nceterum\nfiraz\nsolney\noxybutynin\ncederstrom\nmojada\nkrawitz\ndavitaia\navantime\nmakayla\nchaffing\nidiq\nchuansha\noccaisional\naslanbek\nduragesic\nboundstone\nfaeldon\nthejournal\nshrapnell\nbaulking\nmeths\ntrewhitt\nkidded\nfaulcon\nflighting\nbrosky\nkornukov\nmcsweegan\nprulifloxacin\nwollmann\nimplantations\ndurbars\ntoughens\nsimioni\npbbs\ncasebere\nshuyi\nfraises\npariyar\nagüeros\nlimer\nfreise\nfibbing\njianchao\ngroshek\nshenai\nfalkous\nzavada\ncantuária\npotkin\nsuperpressure\ncausewayend\ntriulzi\ncentrestage\nsixthly\nklamm\ncambor\nkamuran\nmarmur\nsennott\ngiovanetti\nupbraid\ncharater\nspeciously\nbamogo\nrugbywa\ndartboards\nteared\naliaune\ndejian\netemenanki\ncastels\nibrahimovic\nmetsys\nconsenso\nniedzwiecki\ncarslaw\nkloden\nstogie\nhalloysite\nchunchu\ncarret\nremgro\nbonerama\nrayward\npatnick\nvlingo\nlamex\nwirtshaus\nfaggioni\nmatthaus\nglozel\nmishandle\nuntampered\nkifer\neboo\ntathagat\nbeyti\nvinorelbine\njarvey\nmontol\nlapatinib\nchashama\nkitterman\nwaigo\nyvie\nrestaraunt\nhoogendoorn\ngarvock\nsesana\ncandicacy\ntarloff\nbackcloth\nhardus\neynulla\nsupercuts\ngymnema\nconditio\nfaeroese\nchandlee\naapd\ndsip\nskrzypek\nhalloweekends\nrbocs\nformost\nzeliha\nbloops\nshemekia\nmannina\nphotofit\nspraycan\nmesothelin\nintubate\ndownies\nschiesser\nschramsberg\nbrazenness\njakl\nhochheimer\nlibiamo\nevenhandedly\nvolano\nburesh\nunintegrated\npedrique\ncleco\nsilvermist\nurbahn\nceruleus\nopk\nczeski\noverbloated\nmatharu\ngalewood\nbilley\nphotoessay\nhoróscopos\nmedarex\nlundekvam\ngauweiler\nibritumomab\ndjohan\nhilliards\nsoffe\nboogy\nplasmasphere\npeñaherrera\nyameogo\npolypill\nvetsch\nzigmond\nhoskinson\njohnjay\nmathletics\nfater\napodyterium\nganina\nglifberg\nderomedi\ngesticulation\nsoudas\nsophi\nnjeim\nmukhran\ntrinka\ncopaxone\ncorroon\ngressingham\njungbluth\nchoua\ncurdles\nchinachem\nliniger\npilliod\nsanstead\nfraim\nfreder\nrosamilia\npoohed\nvasilj\nsalc\neqm\nkurzem\nnosbaum\nfireline\nfecklessness\noskam\njianshui\nneighourhood\nbabip\nguillotining\nnorikazu\nunseparated\nunificationist\nkriegsmann\ngoity\nollestad\nebj\nghonda\nbuscon\nforewoman\ncinetic\nseig\nwelegedara\nhuntleigh\nleaseplan\ntopup\nigrp\ngoeglein\nssgs\nstonum\nhbeag\nzeru\nsophisms\nnyambura\nalexon\noffe\nbayji\nbrabantia\ndyball\nherendeen\nhazrati\nfreeskiers\nkrysztof\nlongri\nyochelson\nmicalizzi\nwellmark\nmontcuq\nashafa\nungerman\nlimewashed\naquarians\nsapiyev\nwindbreakers\nnonoo\nquirkiest\nnonya\nacurio\nbroomielaw\nliebfraumilch\nlipservice\namaria\nadverted\nbeauchene\nrepresentive\nnannying\ndomata\nhaijun\ncockcrow\npalfreeman\nmpingo\nhilderbran\nwenallt\nodontologist\nmechri\ntheuriau\nledwaba\nsplotched\nsemanas\nabulhassan\nholben\nbreagh\nblancaneaux\neerola\nunsubsidised\nhardpressed\nglasenberg\ngaidhlig\ndownhome\npetrusich\ncusine\nmariastella\nnedrow\nunobtanium\nhadeel\ntolfrey\naiskew\neffen\nshern\nmulticulturalists\nprotosevich\nuntransmitted\nkeyingham\nwoodgrange\njanisse\nchakarov\nmasharawi\nvestibulitis\nmagnotti\nbarrettes\nhowlands\nfumigating\ngeggie\nkrivets\nkensing\npurls\nmisztal\ndrawstrings\nlegalisms\nlaught\nbiobanking\nchiamparino\nakhatova\nussuriysky\nmusis\nfrenchness\npumpsie\nchironex\ninprivate\ntelemar\nmultistemmed\nschabel\ntressider\npeenemunde\nstriegel\nwörld\nkashikojima\nseec\nmttf\nfortina\ncitroëns\ncashwell\ntymoschuk\ninfering\nfauquet\ngarrec\nprabhudas\nperosn\nharell\nschliessler\nhappé\nspiralfrog\nconfiture\ndeeka\nprettified\nbcec\nprotas\nbjornsen\ncocodrilo\naeromax\ncrookedest\ngphc\nbolmer\nbemo\nbeacher\nhnlc\nkulsoom\njabouri\nsitake\nmuneera\nflimflam\nstessel\njalalaqsi\nnoncontact\nhomebirth\nguigal\ndabblers\nalterative\nrohrlich\nreassuming\nglusker\nrowohl\ntongariki\nbenezra\neisma\nclaure\ngoudas\nhaoyuan\ntomcikova\npornchai\nfedoroff\nblythedale\npapania\ntruluck\nwerbner\njalmar\nfrancheska\nyuppy\ntetrahydrogestrinone\nstassinopoulos\nnerdiness\nnasheet\niedd\nsubdudes\niesha\njayawardane\nfuneraria\nschwaber\nundeceived\nbackgrounding\npagotto\nafbf\novergrowing\njatras\nclaycomo\nvsoe\nzagorin\nzind\nramrakha\nginwala\nunaccountability\nthge\nfrieman\nmckays\nhymotion\nanthera\ncomputershare\nrinzin\ndubsky\nmullady\nrevivification\nifap\nrayad\ndesking\ntetanurans\natcm\ndexatrim\njamster\nreifler\namfilohije\ntelsey\ndonckers\nujian\nzecha\nhadidi\ndefinied\nfereidoun\nhardianto\ncitris\nsoverign\nballyholme\nrashwan\ncadaques\nkorstin\npetrole\nvivified\naddyston\nrecontact\nrewalk\nlamman\nwunsche\nuntenanted\nknafo\npinballs\nrownd\ndisapora\nefimkin\necotax\npassbooks\nhyderbad\nphotofinder\nhoffler\nboxwoods\nmemogate\ngoese\nraimy\nmadderty\nzakuski\nfailte\nbernotas\nilesanmi\nstenseth\ndehuff\nkollerstrom\nadalius\npawed\nharardhere\nfortea\ncarlsmith\nshaf\nlybbert\nirritably\nregularising\npoissant\nnelahozeves\nmcarthurglen\nfawned\npashman\nautum\nengelsman\nbollocking\ncarvacrol\nlonmay\ndescrimination\nmuthalik\nleyman\nkoreen\nmwy\nundulant\nhimalyan\nfareshare\nwellfare\nnecropsies\nturtletaub\nrebibbia\nseptics\nirmen\nphysiognomies\nsaihi\nbacchanals\nnaher\npearlstine\nmelck\nbustline\ndhore\nkobak\ncpfl\nssao\nguven\ntourbook\nkaletsky\npathologizing\nbacardí\nhewas\nneurocrine\nmarcotti\nuprate\ntingelhoff\nbruv\nkufour\njinzhu\nstreiter\nbsafe\nmeonstoke\neannes\nferis\nchatterly\npickert\npolyphonia\namantle\nsecuritizing\ncostermongers\nkaree\ngortney\nfairlead\nturag\nlakwena\nfriddi\njohnshaven\nbucksburn\nmicklewright\ntallula\nhalef\nathman\nfinshed\nmalaythong\nhachijojima\nafricam\nsudbrooke\nugone\ndets\nbcam\nwritedowns\noelrich\ntailgaters\nmilquet\nustda\nstagnetti\nstreetscaping\nshalrie\nleao\nshoate\npinetops\nholmfield\nnonallergic\noctavien\nhowff\nqalqiliya\nkukula\nviadeo\nferreria\nsafeness\nmroué\nfrattare\nmanyfold\nhaggans\nabiy\nredlines\nprimative\ngrapecity\nprobowl\ncorynne\nshuaibu\nsuley\ngraffam\ncounterfoil\nmaliphant\nperfumado\nshabina\nprivia\ngraindorge\nhemdan\nlenotre\ncrunchier\nmidground\nstirman\ngemfibrozil\ndeconcentration\nwieckowski\ninabata\ndolwyn\nsimbarashe\nfreemarket\npessin\ndeitrich\nbbmf\nneophytou\nailun\nmicklewhite\nanonymizers\ncoutaz\nshushtari\ntamarisks\nlaboeuf\nzocco\npuppie\nrestudied\nushida\nbyberg\nnavjivan\nngic\nsplatting\nyingde\nbirchen\nfootboards\nluppino\nnmsa\nloaghtan\npagpag\nregionality\nrabadan\ntwiddles\nbonnan\nbluemel\nwended\nhoneycut\nalbouy\npanosian\nazdak\nhtike\nmalelane\nmaracá\nowles\ncapurso\nsigd\npejo\nzinacantán\npetkus\nthokozani\nmude\ntredennick\ndevidas\nnfcs\nshroeder\ncombinable\nscrabo\nhultzen\nmakarkin\nloking\nyasso\ngurski\njianwu\nmotricity\ngoyet\nhipgrave\neguiguren\ngunbalanya\ngotoassist\ndulnain\ngeddit\nenlightenments\nmicrocircuit\nzebraman\nvareille\ndemjanov\nspeechmaking\nschwinge\npontnewynydd\ngreenbird\ndistronic\nkilgetty\nmoinard\nbatbold\ntrollip\naerosolization\ntaimina\nnaclerio\nhuffines\nabua\nlaohu\nloafs\nnffc\nkoju\nperrenial\ntimesdaily\ngreybeard\nlagisquet\namorously\nsuely\nwssrc\nlowrys\nholehouse\nbehal\nnazarbayeva\nmunyemana\nfullscale\ntullin\nshigihara\nlinnen\nmiskinyar\naccordi\nearthshattering\npasaban\nblameworthiness\njuzhong\ntendancies\ndeptula\nhapuku\nseone\nhaybridge\nlostine\nmesssage\ncybersyn\nceferin\nrubasingham\ndioscoro\nhuiping\nsarafin\nsocrata\nqaraqosh\nstemp\nallsport\nminkley\nmateys\nocwen\ncpcc\nslyfield\nperdriel\nnarrowminded\nnaimat\nmwenga\nwonford\ndichlorvos\nnationlink\ningrates\nosterholm\nhokes\nrudzinski\ndumez\nbinatone\nbumetanide\nkartchner\nmenegazzo\nschmancy\npueblan\npinioned\nmartydom\nseiyaku\ndehumanised\nliljeroth\nblackspots\ndanbolt\nsynergic\ncellaring\nlochard\nbutkiewicz\nnikolin\ndemocrates\nhamidu\nachacachi\ncostock\nsqueeky\ndaxon\nperugian\nchrysostomou\ncollete\nirinel\ntuttosport\nkriukov\nblackey\nsorger\nnakam\nendonasal\ndacc\nislamofascist\nluchar\nscheeren\nmovial\nbloodmoney\nkisimul\nmarianske\npsychometricians\naeropro\njoltin\npavkovic\nippi\nchammas\nprovied\nared\nhansville\nmartinovic\nbandeh\nburys\nety\nnorthcom\ncarbin\nsporkin\nbelkheir\nbandow\noffor\nmigita\ntwankey\nmistiming\nbrantano\nhrcp\nprugo\nsott\nmascons\ntarchi\nfreezeout\nnxs\nartomatic\nunweave\nstraggle\nhersov\nbiofiltration\njiangyong\nvandeven\ngreiling\nprakoso\nonkelinx\nstoreowners\nkuzui\nmsra\nfurama\nmetastasised\nkarenna\nriederalp\nrubbishy\nsuperhorse\nbarclayhedge\nsazegara\nmariscos\ncaspians\npettistree\nsolofa\nwcva\nintergovernmentalism\ncongue\ncariforum\ngradwohl\naffuso\nparliamo\ndosas\nmazzolino\ncanarypox\nafterlight\nschickhardt\nthouvenin\njocund\ndawayne\nfagella\nthallon\nmodha\nmillegan\nsidhom\npicciano\nkhalwat\nmcallan\nwebtrends\nmergel\nchazin\nhoenikker\nindeterminately\nabuot\nxylorimba\nmajadele\nclenches\nburket\ndambe\nzybina\nvivika\nbusara\noberbeck\npolands\npereirinha\nhennard\nmarangi\ndegraffenreid\nkaco\nbaaf\npalmeraie\nsprowl\nrasheen\nestebanez\ngoatskins\nspudded\nchasnoff\npertusi\nraphanel\nlarenas\nslamon\napplys\nmonovision\nmaitake\nreasonble\nweleetka\nreappointing\novernighting\ntirolean\nscri\nbrendler\nwozencroft\navishay\nontime\nhoeger\nhesla\nqnap\ncoccinia\ndashanzi\ntechnophile\napplix\nxebra\nsupermodifieds\nparesi\ncheddars\nhorrillo\nthroughtout\ncrackly\ndevetzi\nvladimira\nmagy\nfrippery\nkasanga\nmingfu\nstraphangers\ntwinlab\nunappetising\nhenleys\nuntidiness\nslivenko\nmedaris\nlibbrecht\ngramer\ntornielli\nrmhc\neuronest\nreaganite\notaola\nleofoo\nstijnen\ndieties\nwalek\nsukka\naldara\numpierre\nhirshfeld\nfursdon\nmelanosome\nrelatability\nchalkland\nresupplies\nmediocrities\nfreiss\nrosenblith\nmackmin\naugustinussen\nmelendrez\nlahmacun\npershin\nbecuz\npaje\nfourposter\nelleman\nhauxton\nriddlesden\nhuaining\nfrueh\nredoubles\npettifor\nneighbourliness\nmehmen\nhorologists\ngelid\ninstrinsic\nfundos\nzingler\ncarnalea\nheedlessness\nraghuvansh\npotec\nreductionists\nderecognition\npleurocoelus\nveronis\nfarnaz\nblinkin\ngalettes\nplaks\nfetishizing\nglazzard\nbrockless\ncriticims\nplurilateral\ncwdm\nbassekou\ncyveillance\nconneh\nstrumpf\npotholing\nspoletta\nzlob\nuniverstiy\nspmet\nmillings\nvilborg\njorrie\npiffero\nambulate\ndjerf\nfanzhi\nsadosky\nlargos\npinetti\npresler\nfourplex\ndebido\naquaduct\nmakasi\nbachtel\nismp\njinfu\nslavenka\nquiana\nremonstrates\nbacchan\nfilippone\nqijin\ndefinity\nlipford\nmoais\nwattleton\npatocka\nessentialists\ngosk\ndepressurize\nevenor\ncandyce\nnecrophorum\njagm\ntewell\nspreen\nxintang\ndroped\ntrocken\ncharmane\nyums\ncarrizales\nbussone\nkawishiwi\nsenewiratne\ntongliang\ninflations\nnanoball\nlagueux\ndegout\nscrewfix\nlogierait\nallvoices\nawadagin\nnewtonhill\nkesuma\nfoge\noptimer\nendovenous\npirozzi\noooops\nbikehut\ntyman\ngaleas\neconlog\noutturn\ntycroes\npreconference\nmafiosos\nsalimah\nlearjets\nfelsinger\njieddo\nwmet\nzavis\nexperion\nthornburn\ngwpf\nchorused\nchorusing\nsleezy\nabelii\nmanrara\nwesly\npeddocks\nkaitz\nencounted\nmital\natban\nmalsam\nblar\nneglia\nbloaty\ncontentedness\nellzey\npreem\nyansong\npesan\ncanstar\nprances\nfedcup\nscarcroft\naltamount\nhawkhill\nrefurnishing\nmainstone\nflossenburg\nkindergartner\nsevastapol\nhydrogenics\nnajwan\nhayen\nsatran\nnixzmary\nsosenko\njnto\nutsubo\nreargued\nboncath\nastrotech\nkhalik\nprestone\nratilal\nbussiere\nratley\nyuhui\npletsch\nprevarications\nkuriyan\ncrupi\ntranfer\nabsolue\nbloe\nshinning\nsteadyshot\npolution\nactblue\npolytrauma\nviewforth\nsovietologist\nbrossart\naislin\ntechnoserve\nsokwanele\nattrice\nbuwono\nannice\nkatterbach\nbushwalker\nrumblefish\ntransporation\nzüricher\npanitchpakdi\nkaklamanis\npanariti\nhicklenton\ndelegator\npauza\nhuelsman\nseigal\nmargolick\nseguing\nmicrofibre\nbemporad\nmullaittivu\ninterbody\npugach\nrostowski\nhyperresponsiveness\ngreffe\nphillipian\nhoopy\ndjuma\nbosnjak\nchurchly\nkillpack\nrichardo\nsoundpost\nsailosi\nfceda\nadderson\nfacetiousness\nxirrus\nmoontide\nmixologists\nsvanen\nisamar\nstms\ngladue\nmyphone\nbackwardly\nzimet\nnahh\nsurpassingly\ngratch\nbatori\nbeixing\nkhalilur\ndubl\ntsypin\ntrgovac\npraesepe\noverindulgent\nlllp\nsciarrone\nstemberg\nsilicification\nantione\nwestburg\nlabourism\nmegavision\nlexecon\nbilgili\nflylo\nsieberg\nkhrystyna\nhissa\nrepublicain\ncwmni\nsmily\nkayar\nmacoumba\ncupuaçu\ngethers\nphillipos\nofficiousness\neuroland\ngaila\nhabitate\ngenoux\nspreadtrum\nhmri\ndausgaard\nmidco\nfurmansky\ntefal\noutshoot\nheico\ngothicism\nbraghetto\nkendray\npelizzoli\nalridge\ntatarella\notherworldliness\nsharsheret\nocrf\ntamburro\nmatrixed\nporthmeor\ndigeplayer\nouzinkie\ncarmignac\nhawrami\nmattotti\npfiester\nhadelich\naguillon\ntrinamul\nehsanul\nlespwa\nenshrinees\nvatikiotis\nladuca\nssempa\nbathetic\npullups\nusfp\nechus\nsodergren\ncowdroy\nhugue\nohhhhh\nglucocerebrosidase\nscardina\nseignorage\ntillerson\ndesensitised\ngholamali\nfromagerie\ntregantle\npapillifera\nleppan\nidzi\nsmushed\nvanauken\nkhairullah\nvalassis\nkaruturi\nramsin\nantiphonally\nhandpick\nmbola\nncsf\nlenins\nmouthless\nbestman\nsakowski\nteahan\nprobalby\nstrazza\nslaska\nerenberg\nzuley\nzongyang\nsapeur\npeppertree\nautologic\ncaesaraugusta\nbelliraj\nunanimis\nbadale\nhaythe\nportinho\nsibbit\nremède\nnightime\nmontasser\nrussotto\ntastiness\ngeneralisimo\nkromhout\nkochno\nargleton\nnepalganj\nlograsso\nlipowski\nhochiminh\npoliciy\nbishay\nsquaddie\njuth\nglowstick\ngoldenhersh\ncarouse\ntooty\nnunchakus\njavaheri\ncaixia\ncoupal\npeshkin\nbiotropica\nlifelessness\ntrye\nhrmm\ndilday\nglenoaks\nntas\nshoping\nmuivah\nopenedge\nbourjois\nkornreich\nslades\nsurcharging\nfoxey\ndiogenis\nberkshares\nnisnevich\nsebutinde\npendletons\nnosheen\nespey\nhasanat\npadoan\nmeiju\nslotover\nbruggemann\naljira\ncloepfil\ncohenour\ngxx\nrhag\nbojeador\nsalberg\npfge\nstratigos\nguiso\nbandaids\nlambdarail\nnidcd\nouriel\naffie\nindestructable\nstabilitrak\naspray\nsillero\nvenneri\nrussiaville\ndalaigh\nhardell\nmicroemulsion\nregg\nentemena\nidahoans\nupdm\nicnirp\npatheon\nmazzon\nshiao\nstathern\ntalev\nbogollagama\npluzhnikov\nmotoyuki\nafesd\nwanfeng\npickiness\nstiel\npurell\nsopko\nsimplexity\nwaterbeds\ndamane\nflautas\npoge\nshokri\nsuperfluously\nlusheng\neliah\nsanguillen\nbookless\ncynamon\nforewent\normondroyd\nfarset\nretie\nlasalvia\neghbali\nestis\nrobogames\nfrodebu\neuboeans\njdsu\nplateros\nskymall\nsummerworks\nunenlightening\ngenshaft\nprecor\ndimeco\nmelcer\npaelinck\nplasmoids\nsimonovic\nneske\nlumefantrine\nslavick\nlazarof\nmouloungui\nholloware\nconax\nskitube\ntalamini\nhavelet\nherion\nbrammertz\nmccgwire\nperegoy\npalpitating\ngatherin\nbeardie\nmurdin\nsofthearted\nlesnik\nhartkopf\nlutai\nipaa\nchurchouse\nimprecatory\nkilclooney\nnardozzi\ntimney\nviewability\nissed\nserby\njiasheng\nfriarage\nactuals\nsubcompacts\nfliegauf\nsproughton\nhillas\njocularity\ndilantin\nwhelmed\nseedbeds\nhoua\nheimowitz\ncasteless\nhaluptzok\nnietzel\ndeeann\nnotarised\nproform\ncregneash\ntrimarchi\nfrodon\nlauría\ndushu\njastram\nmattone\nburts\nhaselsteiner\neglash\ncollyns\nbodyshop\nwakelam\nhuggable\ntacular\nhlegu\nwehre\nburutu\nmarketsite\nrihab\nunutterably\ndarfurian\nweghe\nbacong\nkearon\nyamabori\naais\njamile\nsarajevan\nhuria\ntruphone\nfurbank\nprecooled\nwestenthaler\noeav\nfoleyet\nstrassfeld\nnonverbally\ncreagan\nmeterological\ndrugmaker\nmhks\nwincenc\nmarmorated\nkukura\ndarnit\ncarapintadas\nhevey\nnaidus\nhanzal\nmongeham\nloesing\ngmcr\ngeneres\nespinet\ncucciniello\ndispersible\nalthoug\nwinola\ndraghici\ninjust\ncamoflauge\nchichilnisky\nsyy\nalvarezsauridae\nwicha\nkuksiks\nccai\nrishad\nflailed\neaglewood\nreagen\nfivesome\ntoolstation\nhosman\nbalentine\nmisron\nwestworth\nadiala\ndruglord\ninterferring\nstepfamilies\nsimhat\nbuglass\nzuza\nsmatterings\nvarifocal\nurself\nforcedly\ncimla\nyulgok\nwnsl\nkhalfani\nacevo\nhumpday\nhundreth\nhavar\nsatisify\nkhilnani\nkitutu\nendeth\nserifovic\ngrinshill\nsmgf\nkekova\nghouri\ncontinuers\ndieumerci\nmasaai\nsredoje\ntuell\npuuhonua\njarena\nlifeskills\nbellydancer\nplch\nfenwicks\nabderahmane\nburstwick\nbandpage\ndstt\noutisde\nhaora\nunascertained\ndumyat\ntongli\nschmuhl\nroake\nlisbellaw\nbraintrust\nkooba\ngramley\nwebaward\npoin\nkissick\ntactility\nliveops\njerkily\nmbalula\ndrefach\nfreindly\njulz\ninsaat\nakilah\nfurrowing\noversupplied\nlixx\nllangunnor\nabfa\nmeserole\nsilm\nlawhon\nshoven\ncarrai\nhoyes\ntroob\nzafirovski\nmigdale\ntourmates\nedmisten\ntastefulness\nmongin\nallergists\ndoomsayers\nactivitiy\ngraboff\nangon\nlovemark\nshengzhou\nagins\nhefferman\nwhammo\nasenapine\nmoppett\nzenghelis\nkiwan\ntikes\nyalincak\nvinyards\nfantasised\nbozburun\nepel\nnegawatts\npisaro\nplygain\nmacaninch\nplener\nrespectfulness\nbaguer\nreletively\nakhtuba\nshilbottle\nhouria\namika\nwelzer\nsarukhan\nsakovich\nmythologize\nmixamo\ntoolbank\nprecycling\ntedtalk\noscawana\nlittermate\nrhamnous\npegasys\nllangadwaladr\nedwen\naudiance\nicgc\nrutha\nspadeadam\ndspic\ngansky\ngrode\nalverez\nsuhrawadi\ndhall\nkulwinder\nmiscalculating\nbuggering\nduluoz\nsiasconset\nberdzenishvili\nchandrajit\njackbooted\nbleistein\nmarshalswick\nkimmerghame\nnewsnow\nbahdanovich\nwierenga\nbouhired\nrestylane\nshanzu\nbarfing\nnascence\nfedmart\nchalcot\nkoltz\nuofm\nqio\nkooshian\nundergirds\nslipup\nfreeganism\njhsv\nasba\nrhyan\nfreerun\nmonory\nkadijk\niplex\nmadeiras\nciralsky\nworricker\nchesterbrook\nbadii\nkeudell\ngebregziabher\ndiscoloring\nglögg\nunzips\nlecq\ndovydas\nmartyring\nmantric\ncocreator\nitsuo\nmanhunting\nnasdijj\nhelzberg\npriamo\naxway\nondiviela\nbankings\ncoolican\nstoneyburn\nstureplan\nparinya\nroofscape\nkols\npusic\nsportiest\nadalian\nkatrantzou\nfiechter\nbouphavanh\nkaptchuk\nwerksman\nhandpump\nlabowitz\nkoizora\nnumbskulls\ndukei\ndmpa\npparg\nmuhabura\nscatalogical\ngalschiot\nhavern\njorvan\nmeinir\nvaccino\ncifras\nditib\nscorebook\nhakkar\npinocchios\nnyjo\nfraternise\njawaher\noutteridge\npttep\nmadshus\nacquisti\nparexel\nadhamiya\nikela\nstilly\nweisses\ntelekomunikasi\nquahogs\ntoymaster\ncondliffe\nmorguard\nthrombectomy\nvallies\nseptime\nreicht\nazenberg\nbraless\nsanah\ncherikoff\ncouth\neffectuating\nakkas\nsturmgeist\nsupped\nimich\nmutsamudu\nhelibras\nproperous\ngrebeshkov\nwenying\nassous\nbhujel\nvivisections\nhormuud\nsaksit\nschryer\nphiladephia\njarell\nblandest\ndumfrieshire\nbriois\nkadry\ndepowering\nchoel\nlisztian\nopekta\ndeltha\naboim\nmelodiously\nmesac\njanurary\nputzi\nmancs\nconsequenses\nfernbrook\nivankovic\nconcered\ngasmask\nispad\nphay\nfirecrests\nruckdeschel\nlebeuf\npylant\nlanks\nkaroliina\nhunnan\nreventon\nsparkwell\nluvox\ngravlax\nmarlbrook\nbalius\nfluxys\nshakim\ntootill\npetraia\npushbike\nbowsers\ncaulton\nmurtazin\nbispectral\nmeccas\ncaptian\nprimatological\ncherubism\nperrodin\nknackers\nbudongo\ncosbys\nmagnificient\ndropside\nfaci\nstaffieri\nurbanik\ntiuna\ndimhrs\ngreenlief\nmereu\nizaac\nriggans\nhyperphosphatemia\nkeshwar\nopers\nbronchodilation\nsonkin\nkindergartener\nongkili\ngeiman\nbancor\nchamoux\nwoould\nadefovir\ngrubacic\ntiffini\nmaussa\nloofah\nmareel\nkalkot\naayo\npruvost\nboroditsky\nphobe\nmicroflex\ntruckie\nyeshorim\ncollman\njuiciness\nvéry\njianing\nbasterra\nwhatstandwell\nabady\ninsitution\nstrubby\nspirig\nfriedenstag\nturkomen\nrundek\nmtiliga\njunhong\nikpeng\ndefrank\njojoy\nfederalizing\nbalsamico\nskovhus\nmayview\norru\nmaaren\nlukac\ntarconi\nminnewanka\nflavourless\nniederstetten\nforshew\njahed\nlelisa\naeco\nalberghini\nconstuction\nsursis\nremaing\nrausa\ngeophone\nfussen\npestilences\nallante\nprocoptodon\nkamathipura\nunoosa\nexpeditioners\npickoffs\ntrock\ndatatreasury\nairshaft\ncajastur\nappollonio\nhauert\njayakrishna\npedraz\nmonsur\neayrs\njaik\nweinandy\nultracool\nantosca\narismendy\nbaudach\ncosmeston\nfacists\npompes\nkazenergy\ntraiteur\nzuying\nfouzi\ncaramelize\nmcenhill\nimbibes\naviacsa\nsmei\njillie\nhopla\ntrevelin\nekranoplans\ncndh\nhaoming\nmcgranahan\nmantashe\nmortera\ncravins\nconniver\nperrish\nprotaras\nwaylaying\ntils\nmonath\nbolduan\nunfun\nteepell\naranas\npracticably\ndolciani\nporosities\nhayde\nnationalisations\nclickstream\nmurjani\nisthe\nwaterings\nfavs\nsouthernhay\nwaberthwaite\nirupa\npapillaris\njantel\nmanyang\nnanetta\nrobinzine\njetstreams\nstompy\nallergenicity\npreparators\nfiallos\nsherdil\nmagsamen\nkrafsur\nmvsu\nkohna\nftts\nechandia\nkiester\nbonauto\nsafty\neverlastingly\ngraminearum\norango\nectc\nkallaugher\nnonfederal\nmokko\ntsolekile\namirav\nhandmark\nlongicollum\nostberg\ngrgurich\ndivilly\nacrefair\nconaco\npinguino\ngroundbreakers\nklemke\nboîtes\nfinacial\nbillers\nzubulake\nmentals\nmakhmur\nstovepipes\ndudly\ncuisiniers\ngurling\nsoze\nakinwale\nponifasio\ncortonwood\nbodycount\nzidar\naburizal\nbiolcati\nmesirow\nsparton\nmomcilo\ncareworn\nshayes\nbreceda\ntranchell\ncleated\ndiluvian\ncluzel\ndiblasi\nkringles\nskyterra\nroecker\ngessoed\ncianchette\nbethann\nlhen\nlynchian\nwowowow\nptwc\nrumpy\nteratogenesis\nslipcovers\nalion\npsychedelically\nrafei\nfitchie\ntawakal\nlauries\nfernado\ntorsella\nmitsuyuki\nmues\nfoldi\ndaufresne\nakuntsu\nchitterne\nkeyhan\nhobnails\ncamio\nkambas\nhyperarousal\nkooga\nmujahed\ngolikova\nstatendam\nsafecracking\ngaffoor\nvideocamera\nbattleplan\ndychtwald\nchenai\nteaware\ntuitupou\nexsistence\ncrispier\nproelio\nfloured\ntringale\ntensleep\nnorberta\ndeferiprone\nwhakatu\nantegrade\nravva\nleinberger\ndimwits\ningestions\npellizzer\nsixthsense\npijon\ngtbank\nscriptlogic\nheartlessly\nchesnoff\nhafif\nandrostadienone\ndechu\njolies\ncrisa\nmanjarrez\nporyes\nshortley\nbrochette\nwansley\nmamund\nfilaggrin\nsufentanil\nzniber\nswallownest\ndardari\nweeble\ncinf\nobsequiousness\npaulison\nkamvar\ncharentais\nperriam\ncrackberry\naobadai\nravishingly\nbottorff\nfargher\nannointed\njiangqiao\nsticca\nupadhye\nrebind\nesgair\nlakpa\nvarities\nprocrastinates\nrepulsiveness\nhadadi\nmoinian\nnipsa\nfiacconi\njafco\nchukwurah\ncasher\nasiapacific\nteleworkers\nstraeuli\nacoss\nasik\nbreitz\nnikiski\nbekken\nansolabehere\nwyboston\nfoux\ndarwinians\nhajir\nplateauing\nuetz\nshimit\nwuori\ngatecrashes\ncotrubas\nleenders\ntopquest\nnacif\ndyma\nurgelles\nfacciolo\naspey\nfadila\nbavis\nanadigics\nshingirai\nkomis\nsbpd\nburbanks\nlaynce\npilocytic\nkilbrannan\ngotabaya\njaywalk\nhouchins\nforr\ndoernbecher\nviewpark\nantwaun\nuprating\nunexcited\nfabisch\ngartman\nunexcused\nzackheim\ngcsf\nlyubomirsky\napcoa\nkidstuff\nargaty\npineiro\ndarkley\ntrocaire\nchaussettes\nskinfold\nyenni\nraffaelle\nretreaded\noursel\nhoverlloyd\nelasticized\nlabban\ngogarth\nsuvir\nunimark\nbrunkert\nndayizeye\ngnpc\ndegs\ncorrectives\nporrini\ndafs\nkloiber\ngruma\nhosannas\nreinstituting\nbatterjee\nmaindee\nnavajivan\nedac\nkoningshoeven\nmichos\ntrenchcoats\nshartava\ndidace\nloreli\nmethysticum\nrishard\nflambe\nxichun\nbokaer\ndundar\nmcconnaughey\nvranich\nkhodaidad\nmultistoried\nbarrique\nderrik\nconnives\ndepuration\nmininum\nmasutani\npalamidi\nstutzmann\nxhosas\nwhoot\nneedlecraft\nthunderously\nactiq\nflagellating\ncherilus\ntenderized\nsockwell\nmaëlle\nfleckney\nimprecation\nohmori\ndirir\narzate\nsiberica\ncovill\nberiosova\nblackback\nonic\nmarylouise\nimplimentation\ngerbeau\nzijiang\nsalvationis\nbellm\nteabagger\nvujanovic\nachivement\njitish\nheising\nsomerley\nbetu\nblokey\nfanyi\nkampamba\ntremelo\njart\nkozicki\npalmaz\ngoniurosaurus\nfarbrace\ncyrenaics\nstartsev\nextraditable\ncammish\nanchimaa\nphanfare\nduntulm\nlilliesleaf\nmudman\nwonderettes\nshrilly\npanksepp\naromatised\nmargaretting\ndoly\njly\ntayyaba\nqualitas\ndeclutter\ncatchin\ngorazde\neurochambres\namburn\nwema\nsherrybaby\nmashamaite\ntapasi\nstolzer\nmepkin\nschaeff\nwarfront\nmyachi\nschook\npsaps\nsuffo\ntrism\nhumlebaek\ndepersonalisation\nmirabaud\nzakour\nnoreiga\nmdts\nembery\nrossotti\nnelstrop\npityana\nbookstaver\nkaiun\nmicroencapsulated\nsuavely\nyixiang\ndariga\nabdrazakov\nstavis\nhunnisett\narrogating\nskyspace\nulreich\nbarella\npoopers\ndebases\nbokashi\nunosat\nschepp\nfosseway\nexcesive\ngormally\nprevacid\ntubney\nnickelsville\nlemv\npunchout\nrodek\nsarabandes\nruegen\njamaine\ngrush\ngellard\nhandpieces\nbacciocchi\nabakar\nthalman\nmelchisedek\ncuyabeno\nkahut\nbielan\nleggate\npakage\nhasama\nwestmarland\nhodgeson\nnantymoel\nkeara\nranjbaran\nsierrita\narridy\nmortaring\ngpj\nmikhelson\nbdav\ndimethicone\nzumbado\nagrement\norlandos\nroughen\ntianwang\nbloms\nasmd\nnumerologists\nmuffuletta\ngasanov\nnicarico\nzoubir\nteléfonos\ntugenensis\nbyrdie\naercap\ntucc\npseudorabies\ngarrowby\nwaubay\ncrapola\npotthoff\nejn\nbiochips\nlivneh\nkoches\nlequan\nyarwell\ntechnologia\nkalvenes\njekel\niassogna\nhandberg\ndaunton\nkinepolis\nlightwaves\nkatseli\ndescibes\nsavre\nukse\ndemotivate\nempathising\nmuwafaq\njianlong\nlarssons\nspearville\nmicrobuses\nrefreshers\nchinnarat\ntsas\nweibring\ngoodfella\ntatums\nwhorehouses\npercassi\nkaltenegger\nebley\ndelineators\nladyboy\nselloana\ncronie\ntartine\nshehadi\nhutshing\nfasbender\nentertaiment\nquenelles\nnetherbury\ngulfview\nparedon\nhexogen\natitude\nnusakambangan\nlappen\nzadick\nallerleirauh\ngrowingly\nfogbank\nobizzi\npumpkinseeds\nvmeste\nromenesko\nshelor\ntopazes\naddicott\ndespommier\ncheesemonger\nshairon\nsimers\nbenyus\nkibitz\nartouz\nzardana\nmacgown\namilly\nalthin\nnonnenmacher\neligble\nbolillo\nstufflebeem\nbazant\ngazpromneft\nschwag\nreexaminations\nhoornik\ncarignane\nprocrastinators\nsvitzer\nquisthoudt\ntessel\ncompleters\nrefuelers\nfronk\nhunia\ntechamerica\ngeci\nbairu\njaeson\nhenbest\nkeauna\nlbos\nelisheba\nseligsohn\ntetulia\nballacraine\nniersbach\nvillez\nhobnobbing\nuncommissioned\nmoscovite\nfourpenny\nfaram\nebonie\nbanyala\nmetrologic\noverproof\nmugniyah\ntruenorth\nhavret\nhaydens\neppleton\nnsct\ninners\npittu\nembitter\nburlakov\nnemore\nimmensly\nelfrieda\nqaidat\noutworked\nectaco\ndemory\ntolchester\nboxiong\ngukasyan\ndansette\nbrogdale\nprision\nacress\neimskip\nechourouk\nkeasling\ntamkeen\npancur\nsaltmarket\npenywaun\nwoodstove\npreu\ninanely\nleibham\nfarhod\nsaio\nwaterscapes\ndewitte\nbanderilla\nbazardo\ncangemi\nagagu\nindridason\nsidka\nkasit\nmegatonne\nyegge\njeopardises\nsweney\nataollah\ncommscope\nphuti\nkeler\nmillionare\nyudell\ncountryfolk\ngabfest\nstriesow\npaac\nincentivising\nhumidors\nanci\nbriege\nsoliris\nleakycon\nkloberg\nbelco\nunalike\nyachin\nprisbrey\ntanevski\nodina\ncorreal\nxanders\niavaroni\nsudachi\ndeskovic\ngeech\nmassoudi\nseekingalpha\nthanis\nskelhorn\novm\nzinédine\nbuehner\npontrhydyfen\natenza\nblowhards\nhardass\ncymuned\nsebel\noutragous\nbmce\nziliani\npatamona\ntestors\nlawdit\naboulafia\navanessian\nsaidou\nwasmosy\nchemtura\nmärkl\nvillafane\nvesselina\ntwentyfold\nstereovision\nslivered\nwknd\niwamasa\nbartulis\njaidon\nkoliada\nsupersite\nmcstravick\neiser\nflibbertigibbet\nmccluney\nmovsar\ncobertura\npowernet\nchambishi\nsugito\nkralick\nbrunkhorst\nsabki\nmccullah\ntiriac\npiiroinen\ntlali\nbarusso\nfricks\nsensative\nrochman\nforestiers\nthefind\nyijie\nnonprofessionals\nsplotchy\nextenuation\ndinoire\ntweeked\nparishoners\nadongo\nlymphangioleiomyomatosis\nbenaim\nberardenga\nquadrillions\nmildner\nbrodax\nnkechi\ntamoil\nsabik\narnzen\nschlundt\nmondshein\npandoran\nwisser\nmarlane\nsamardo\nleyhill\ntreki\nmerilee\nablations\nfeinmann\nmalleswari\nbanasik\nkiplingi\nenagh\nhmyoi\nshakeups\nmollino\nknickknacks\nballinacurra\nkapsalis\nfarringford\nintrahealth\nrewatched\nsunridge\nboelter\nlitif\nzertuche\ntamegroute\nskinnerian\nxtl\ncablecards\nmerchtem\nmakel\nixis\nlaguage\nnasril\nnotimex\nheathside\nlivered\nflintoft\nimmunotoxins\nfensterstock\nrajabali\nomolo\npurnululu\nhestitate\npbts\ntuskeegee\nefsm\nparching\nsucré\nkosminen\nmmxi\nsulat\ndisinterestedly\nborberg\ndesribed\ndharun\nreveler\nlapasset\nstowupland\nlitwak\nbarovier\nnamb\ncoproductions\nbarratts\nmcgraths\ncerfontaine\nsalisu\nworrincy\nhousey\ndenike\npouncy\ntheaterworks\nochils\nercol\nlaseter\npaydays\nbruzzo\nfcvs\nsodersten\nadzic\nshahtoosh\nforsbacka\nbrakke\nformulators\npoulard\nmohtadi\ntahaan\nmarculescu\nzahava\nfirecontrol\nplatespin\nruhakana\nrennix\ngloveboxes\nschleifstein\nworkstream\nwildearth\nyosuf\ngolway\nhighcross\ngunes\nfiredog\nmydomain\nstreckfus\nunhappiest\nmoyock\nshof\nmistreatments\nzanoli\ntransfuse\ntraumatising\nnooke\nwarborough\ntrigeneration\nblangkon\ntapner\nyazpik\ndehap\nweyhill\nparkay\nconferenced\nlunchables\nacharacle\npropriano\nviridor\nndung\nmappus\nofficals\nsgts\ncorsino\nencrust\nallanbank\nunderselling\naccme\nkataib\ngocar\nmaniatv\ngenchi\nduboff\nsculpturally\nkaamulan\nstubbly\nkikuya\nhometeam\nwhovian\nchelmarsh\nbiopolis\ntheatregoer\nveiws\nceis\nbernstine\nalejos\nzipolite\nsopogy\npenyffordd\nmovoto\nmiquita\ndogsledding\ncranioplasty\nleppla\ndfps\npmps\ndanshui\nchiman\nbenschoten\ndecarbonisation\nbuzios\naestheticized\nvaluably\nmammographic\nibmec\ngarimella\nwinterization\nverfiy\nmlec\nphobics\nsheevaplug\nbenkahla\nligitimate\nenj\nkrokidas\nsosnik\nshantelle\nonal\norganochlorines\nmonicas\nlukaya\nlawer\nneosporin\nunstopped\nleatherbury\nagues\nkittenish\naldicarb\nkrippendorf\nbillcliff\naritzia\naltounyan\nphiladanco\nrombi\ngeomicrobiology\nirrationalist\nmartsvaladze\njittering\ngazeley\nzaretski\nvisceglia\nlouette\nrambaut\njesuses\nbidmc\nstudentsfirst\nmaladapted\nblindspots\ncartee\nbookpeople\nblud\nperske\nprahalis\ntrinite\nsmartzone\nguanciale\ndorch\nsilah\npegol\nkettled\nbhuri\nfearnhead\njabril\nlybeck\ncoeck\ntopalli\nesdc\nfilofax\narmoires\nvoxbone\ntaky\nproshkin\nroneeka\nthca\nscariolo\ndolceacqua\nomnipod\nsmarden\nbelsat\npxd\nnaalehu\nbradsell\ndbts\nonesies\nmacleane\nkikkan\nnotarize\nmidthun\ndecommitted\nprosed\ngaucín\nsqaure\nseasides\nspeal\nbanif\nmutasim\nrodeway\noeri\ncopstick\noldcorn\nunderlayment\nmannkind\nhallsands\nrickatson\nmagpas\nlanceros\nkvitova\nunfashionably\nblasband\nluisotti\nfavo\nchurchillian\ngasoil\nmatuska\nkarademir\nfettering\nlifetouch\nnadil\nlupset\ntienanmen\nlesiak\nhighjack\nphotodamage\nkingsfold\nmorizono\ngigaton\nkiver\nporteñas\nsynfuels\nwojtkowiak\nemersonian\numhs\nwating\nintveld\narthuis\npoutre\nfulking\nopenfilm\nbagrodia\nakhundzadeh\nnonbusiness\nsharvit\nextr\nbuehrer\nblueburger\nfokou\ncompex\ntaer\nsupercute\nparonnaud\nboiman\ndiperna\nkirop\nsalfi\nphelp\nanaylsis\navanade\nfluorodeoxyglucose\nvasovasostomy\nworp\nmicromagic\npardonable\nforebearance\nlabossiere\nesle\ndrq\nquinziato\ngoedde\nmaurices\nsanliurfa\nkoozie\nchebotareva\nkunimatsu\nakobian\nbicarb\ntobchi\nsunnies\ndetroits\nvolx\nunhealthiness\ntohatchi\ntumelty\nhidipo\nyotel\nflowlines\nbethards\nfingerman\nmollick\ncomplaing\nwookies\nmadieu\nlothing\nprometa\nbacklin\nkoomson\npiddletrenthide\nnacco\nrevoltella\nbalze\nchiquet\nnndc\naejmc\nducarme\nuncastrated\nchokey\nintacct\nsorpasso\nbluegold\ndisposability\npolignano\nkiplimo\nwhittard\ntrailor\nglanrhyd\nmatel\ntakeing\naugmenter\nsalaris\nmeloan\ntotaljobs\npaulmier\nshutup\nmizon\nkorde\ncolbath\nmcgirk\nyrm\nfilmgoing\nverklempt\nlovrek\noccaisions\ninul\ngopaul\nbutzner\npruritis\nandrades\nhealdtown\npadavan\njinfang\nbramly\nquestro\nmontecinos\nriia\nepea\nmulyadi\nisackson\nintegrationists\nnisource\nneobaroque\ndemora\ncrosstour\nfocac\nsaukville\nkostyra\nbuce\ncrickmore\ngliocladium\nkogas\nsurreally\nmediasmart\nlassale\nblueprinted\nbrannick\nglycaemic\nhusbanded\ndonnafugata\ntilleke\nrealated\nbrookers\nmolaa\nazcarate\nstaake\nschmaler\ntreprostinil\nlittlejohns\nmuntazar\nlincy\ncumo\ndogus\novertaxing\ncarisa\nandreassi\npipemaker\nbaoquan\npermissibly\nhaniska\nanindilyakwa\nliangliang\nhaverland\nvillehuchet\nomidvar\nthusitha\nralepelle\nhavlik\neyerman\npushka\naresco\nedisons\ncinecitta\ngraddy\njiyul\nasrael\nsexpo\npsychoanalyzed\ntelecommuters\nflookburgh\nentitiled\ngoulson\nkrod\nmccahery\noutmaneuvers\nsadovy\ndeschacht\nwienzeile\nappetising\nbaritonal\nchangeing\npondberry\nseawings\nalarid\nnakarawa\ndicyclopentadiene\ncege\nnbrf\nwdn\nsugammadex\nundt\nattendings\nlouthan\ndzf\nfuifui\ntelepods\nmultimillionaires\nkirkharle\nyirmiyahu\nkircubbin\nglagow\ngospic\nneic\nmoonmen\nperenyi\ncerre\nmaanshan\nledvina\nmotamedian\nrosenfarb\nbritting\nstukel\nefss\ndesirably\npricer\ngugerty\nchumbawumba\neiberg\nrontgen\npalipehutu\nhelyer\nvitiates\noxclose\njiangong\ndewael\ntalvitie\ntheret\njalawla\nerquy\npeaslake\nconzen\nghizzoni\nrumasa\nbadanov\nmacnulty\nmetson\nschleip\ndeworm\navelon\nticketnetwork\nlurlene\nkhukuri\nfjällbacka\nbearder\njozefina\nconcerta\nrehling\nbalou\nwmik\nsofiko\nwrage\neehv\nbarioni\nultrices\ncondimentum\ncaymus\njenkinsville\nmurmuri\npirouetting\ndestitutes\ngalitz\nnagita\nlapthorn\ndecilitre\ntransferjet\ndixmoor\nfireflys\nblumel\nendorsment\nlangbar\nfactive\ngustus\nimmoderately\ngovernability\nshomari\ntejocote\nclockhouse\nburlaka\npersing\ndmitriyeva\nosteoprotegerin\ndiyab\npodgy\ndemokratia\nterrycloth\ntoonattik\nbrennon\nkamela\nhameiri\nforristal\nfecs\nrockmount\nwardah\nquaysides\nmcaneny\ndurney\nalkalaj\npirouet\nglengary\nlovcen\ndevenick\nhaylee\ngarrotted\npromenaders\nochberg\nsenoko\ndiakhate\nrostki\nmyrle\nslansky\nmhlw\njiga\nclov\nseesaws\nnunnallee\nwidenius\npyrek\nweaponizing\nfrancop\nedag\nsadoway\nstavrakakis\ncorncockle\ndiscimus\ncesarian\ndawney\njudiciousness\nroumieh\nwellpark\nguojun\nhodsdon\nrosende\nnbaf\nlabid\nspectactular\nantezana\nhaddrill\nvelculescu\nheadships\nmainsails\npattered\nalred\nculata\nforebearer\nchalie\nsheader\nseeburger\noubiña\ndsmb\nheubeck\ndevilment\nkatayev\ncamín\ngorson\npredesigned\nfeep\nsaucisson\nbertocci\nbrevetti\nindustryweek\nlumer\nloiterers\nloadouts\nnstemi\nleaphart\ntitillated\naghazadeh\ntigipko\ncorrugating\nsaldate\nactivase\ngailhaguet\nqimin\nkhaskheli\nenevoldson\nmonay\nbordner\nrohlin\ninterxion\nmunusamy\nmenchel\nwelters\nbrooders\nkuong\nlouvish\ntocca\nsucsy\nhumouring\ncodorníu\ndarwesh\nhenneberry\nhankamer\naltinger\nkrisp\nwpas\npurevdorj\nzinifex\nherbed\nitac\nvergassola\nwellsian\ncpsi\nmlyn\nacustico\nmultitasked\nlman\nundeviating\nsuzukis\nloizidou\nbrachfeld\nthornier\nmarassa\njamiri\nanesthetizing\ndhaher\nklag\nronke\nconsentual\npodd\numme\nproppants\npeasholm\ningenuousness\nupda\nhopyard\nscrod\nsundome\nkoblenzer\nmorroco\namlaw\nbishopswood\ntollbar\nprobab\nrownhams\npemco\nnembhard\nrocchio\ngrapeseed\nkreger\npuistola\nsulej\nnonfeasance\ncertificants\nlessy\nfareboxes\npslra\ngruv\nkirbyjon\nliukkonen\nhandzlik\nopri\nuhy\nbehaviourists\ntropically\nwalvius\nbyrdak\nteardo\nbalblair\ndedaye\nayuntamientos\ngraterol\noverstaffing\nbekkevold\nkentisbeare\nkloesel\nlégumes\nbenedettini\ngianini\nutilizations\ngalka\ntrawlermen\nsubex\nklauer\nndjili\nelectees\ntillen\nkysor\nporcell\nyoshiji\nardec\nservisair\nrapiscan\ngeronimus\nmillimetric\nfaliraki\nherreras\nmanawanui\nmarmonte\nleggs\nbalsara\ncapercaillies\nminari\noutdrew\nstockingford\njalaludin\nterrmel\nhanieh\nbipap\ndrance\ndiedrick\ntelesford\nsvich\nsilverados\nbabyfather\nnashvillians\nhousebreaker\ncholita\ntowelettes\ngurrelieder\ncrookall\nupends\ndestines\npictbridge\nappealling\nsensée\nronney\nbentoel\nlaviano\nllanfrothen\napixaban\nvlodrop\nsneakier\nlahair\nbrushfires\nunreadiness\ngoserelin\nsleepily\nmaldistribution\ndixies\nbirkmann\nheidmann\nhirohata\nfundie\nreitmans\nqsp\ncampbellsburg\nbridgmohan\nlööw\ncambois\nakkuyu\ngarce\nphotobox\nplacidity\nclendenen\nriggall\nmarathoning\nboduan\nrosenstrauch\ninterplays\nwyedean\nnjpac\nunqualifiedly\nmayger\nstalberg\nmazsalaca\nmutianyu\nabsher\nanouschka\ntowerstream\nbachani\nchepchirchir\ntouhig\nkhoro\niwcs\nblairhall\nqizhong\nzurbaran\ncoremedia\nopposit\naispuro\nutla\nwaxer\ncotney\nhalldin\nsetai\nportese\nconcealers\nrequalified\nreshard\nzondeki\nlavendon\nmicroids\narchwilio\nceftobiprole\nkrawetz\nbonacina\nterrye\nqaed\nquivar\nidiosyncracy\nincrementalist\nweldele\nrascism\nchenchen\ncivillian\nwasowski\nalchymist\nghyslain\nplawgo\nebberston\ncrailo\nwhitminster\ncallcredit\ndebonis\npetroceltic\nsterotype\nericksson\nhillsdown\nsameshima\nvicken\nskivvy\nskyscraping\ndeiced\nbiomedica\ntaktser\nmignoni\nrepresentitive\ngordinier\ncervantez\nkellogs\nperhap\nsofort\ndeferasirox\nblubs\nshrewbury\nafricanisation\nvaders\nmukluk\nbiebel\ndrobot\nschotter\nbubblehead\nzhongmin\nacidulous\nkathyrn\ninterbike\ntigist\nruddles\nruszkowski\nbrewfest\nrehaul\nstarchefs\nwertheimers\ntowbar\nambach\nicti\nroberston\nbarhoum\nmerzouga\nconergy\nmcalarney\nbellens\nsibbi\nmegalitre\narmentano\ntakingitglobal\nvagliano\nnewswriter\nciechanow\ndreyers\nmarcheline\npownce\nspicker\nkhitab\nrischer\nportin\necoh\nvacherin\nfrucor\ncarbomb\nhanout\nvernand\ncherukuri\nwvsg\nschroyer\nsynder\nacronymed\nschlieker\nhartinah\nmancot\nblackflies\nslimfast\nbrillembourg\nblokhuijsen\nfellate\nnulliparous\nchakari\nlitwa\nberinsfield\nnicor\ngreened\ndecribes\nnaameh\nconfessionally\nknoweldge\nhoeg\nexeat\nshivpur\nsafieh\nwalkern\nravellette\nozio\nhickeys\ndelapre\ndöhle\ndaimary\npaymentech\npengelley\nmankani\nlittleness\nmontegrappa\njabalya\nlandlubber\nkeyse\napprpriate\nltee\ntravelwatch\nccpr\ncrockfords\nmaradonna\nferraiolo\nsegerberg\ntaluto\ntemporizing\nnewtok\nhately\npicrite\nyandiyev\nzonules\npawelski\nmicrofractures\nlilygreen\nbacktalk\nstanleytown\nlippiatt\nroentgenology\nquacked\npoplavskaya\nsqueeks\nnctj\nlarminie\ngaviña\nfulanis\nphonebox\nhanoverton\ndenominal\nsautman\nvirtuousness\nayouch\nsybaritic\ngiltner\nfelciano\nmoodily\njagpal\ngodsall\nhijja\nmicropro\nvillency\nmoceri\nspakovsky\nheselden\njolivette\nunsalable\ngreavsie\nrelton\nacbd\nhindquarter\nwishnow\nuceta\nharmolodics\nancilliary\nmejicanos\nglencolmcille\nsoleau\nfarmanara\nlifevantage\ngarnishments\nzhenli\nwincy\nrepossesses\nchikhaoui\ngoolam\nbhum\nmassachusettes\nmayetta\ntyondai\neliasen\ntransfusing\nfrale\nmullavey\nyehude\ngauzes\nsteads\nshaddick\ncalçots\nlabourite\nshameen\nalexus\nspoetzl\nkearneys\nrefenes\nsafinia\npaugh\nresumés\nirradiates\nmostafaei\ngurkan\npenksa\npacquette\ncôt\nstaybridge\nsamsova\ninnolux\nprecontract\nhendrikje\nwuhou\ndiniyar\nsomethig\nbarqi\ncleansweep\nghazwan\nsalafranca\nrecolonizing\nvbp\nfissuring\necsl\npoortman\nfranak\nlisitsyn\nviktualienmarkt\nglasspar\nadagios\npolimeni\nchibas\nglassbox\ngural\nfzc\nopencable\nlabno\nliene\nfactionalised\nuchiyamada\ntoasties\nroodee\nwifa\ncornucopias\nbpce\njevington\ninterlandi\nbalasooriya\ntwizzles\nkleffner\nfreefalling\npupilage\nwhirlybird\nunderdone\nmorcos\nyanover\nsuckale\nsyaiful\ngeroge\nmoabi\nwestamerica\ndorsainvil\ntessaro\nkedington\nnightmarishly\naleipata\ndevisingh\nblakroc\npienkowski\nsandinismo\nlawee\ngalliher\nurton\namorales\nbioindicators\nsuvo\ngovenor\nbullah\nnhsca\nganbaatar\ncanvasbacks\nsatelites\nusry\nsemet\nportune\ncranksets\nmunchetty\nulley\ntauntingly\nnardello\nperiquita\ninfanticides\njinyan\ngratifyingly\nassitant\nlatchman\ncervezas\nngel\nhirokami\nholsboer\ndissappeared\neastsiders\ncommenee\nstupples\nmenhennitt\ncablesystem\ndellenbaugh\ntigta\ncunarder\nannibynnol\nralley\nwayt\nmasterfoods\ngareeb\nshaden\nsepura\nwheals\nminnikhanov\nsrebro\nplastinated\ncopilots\nloulie\ntopgolf\nokg\npentala\nbrugnoli\nkaesviharn\nmalthusians\nhackbart\nteehee\nvolkers\nyawuru\npliner\nunseeing\ncollyers\ninfuser\nsmartgate\nrectums\nanbinder\ncharendoff\ninexistence\nlyndel\nimron\nlorely\nviscarra\ntofas\nalabaman\nfladgate\nherberman\ndupouy\nvolubility\neuropabio\nrosabeth\nhectolitre\ntruckline\nralink\nreelecting\nmetzstein\ncheryle\napneas\naltom\nnonscience\nsaroornagar\nzabell\nderra\ninexplicit\nkoltur\nwhimsey\nphysalia\nmotived\nbuoyantly\nrewatching\ngayakwad\nloquacity\nreindorp\nyaps\nbryukhanov\nbidborough\nbilat\nsailani\nstohler\nnorair\ncaminati\nbourdy\ntwaalfhoven\npainton\nfaughnan\nadriene\nhanmore\nweitzner\nmisun\nmozza\nebuya\nrychter\nviticole\neloran\nfrontyard\nlandmarking\ngerhardstein\nhemingways\nvideoegg\nshaoul\ncelozzi\nderwish\nmacrame\napfc\nmillvina\nmurueta\nashkabad\noverdrafting\nprodger\npogles\ngrammatikos\nbelgacem\npalenques\njoseff\nzhongqing\nuytdehaage\nrevisioned\nnailah\nmombacho\nbasell\nelote\ntaposh\nmussett\nluonnotar\nbackslapping\nmincom\naguak\nkurashvili\nlazzeroni\nxenserver\naï\naloia\nprejudgement\nweirdnesses\nnaturalizes\nkomalah\nwamsutter\nggd\nakima\nelayna\nscls\nkortum\ndiscbox\nmultiroom\nzhilei\neilian\ngrofman\nhatstand\nrebaudioside\nnymans\necoflex\nnecesssary\nbicky\nbelhelvie\nglobescan\nwhitcliffe\nlegwear\nsteung\nbiventricular\nriquna\nvidnovic\nmarinker\ngastronomes\nwoolwell\nmulipola\nrensin\nchrichton\nmartonyi\nairgas\nhuffs\npointscoring\nughh\nflays\nshinkle\nnumara\nmarchman\nprickling\nbeiruti\ndalcross\nmdrc\nwieldy\nhoungbo\nkasle\narmaggedon\nfeminize\nplanit\nheamoor\nandrogenetic\ncontenta\npiazze\njocose\nspezi\nrsgs\nmading\nacidulated\npulmonologists\ndeparle\nksenya\nminimun\nkoepnick\nslaa\nlifestock\nespadrille\nsonim\nsquaretrade\nfarmery\nmarecic\nbeqaj\npflieger\nwestleton\nradhia\ntianyang\nrejectionists\nwheating\nthje\ncarlstedt\ntinwell\ncontigency\ntheraputic\nandirons\nhormats\nodms\nexpro\ndudeism\nsacranie\nsatpayev\nseroma\ndrewniak\nshnayerson\nradmilovic\nmaccullagh\nballanger\navli\nmallek\nfilizzola\nloftsson\nshumake\nkontis\ndutheil\ngussman\nsunee\nmooren\nqueenland\ncubria\ngalgaduud\nrothfield\nbahrom\naricept\nsouthwoods\nguca\njosic\nbagir\nlagosians\noberriet\nlydstep\nscirica\nspeedlight\nfvd\nsouthstar\nhavaianas\nkayce\npochalla\nshchuchye\nbrooklynn\nwerning\nbeamsley\nfroriep\ngergo\nstockel\nsantoyo\nesack\nbollwerk\nparlapiano\ntssm\nfluview\nonechanbara\ncaciotta\nchineses\nazarakhsh\naggers\ncharmat\npielou\nacpm\nbosen\nhapilon\nmedevacs\nyedidya\npiffaro\ninmet\nnonintrusive\nfeldblum\nvittachi\nbosta\nagok\nlighthorne\nhalfar\nsquirearchy\nbinde\nbedevere\nnanotechnologist\ntafua\nmtcs\nhulhudhoo\ntrivelli\ndamanaki\namortisation\nulrichsberg\nbenstock\nforearmed\nisab\ncraptastic\nvilató\ncready\nqgc\nkasatochi\nlegalises\nveysonnaz\nfion\npuffa\nlakesides\nhyperpyrexia\ntehching\nlonsdorf\nkeggy\nbakkali\ntentativeness\nfattouh\nhirschorn\njaliya\nascherman\nfitiuta\nkonyshev\nborring\nskurla\nacquity\nwindygates\nundersoil\nbureaucratisation\nhymy\nwainio\nemelda\nbillinghay\nprotocell\ngottesdiener\ncottered\nlakhbir\naltick\nkawananakoa\nthalken\nlionised\nresurection\npartitionist\nworldpanel\nrasied\nirglova\nuntanned\nvarriale\nmicrocastle\nwillden\nmahamed\npoolhouse\ngurtler\nworthenbury\nhyperflexion\ndéclassé\nnekesa\nsarcococca\narcandor\nbinette\nmonarchos\nathrawon\nfalcones\nrosellen\ndjiboutians\nitag\nparuk\nweavering\navendano\nbridezillas\nzirinsky\naxeon\ntunin\nnoushin\ndunion\njialu\nkessai\nmilwall\nlethaia\nlegnini\nvipa\nwestmost\nmirabela\nsubmenus\ndogwalker\nindietracks\ndebruce\ntesev\nretrenchments\ngallocatechin\ncymoedd\nkooyman\ncommagere\nmcnamaras\nbaayork\nmarquitos\nhoppenstedt\npolysexual\nhorsely\namerigroup\nnewbyth\nuría\nnomadically\nchirruping\nlague\nblaenrhondda\nandromaca\nbernos\ncelling\nnatters\nkoening\ndjama\noostvaardersplassen\nmythmakers\ncentredness\nschlep\nshitless\nhipswell\nderai\nmisplay\nlechea\nhewings\nlypsinka\ncarene\nncrb\ncleanable\nmyeni\nhauner\npinkowski\ndorene\nsorour\nllazar\nilsey\nmichalczyk\nterzic\nquinceañeras\nhollowood\nfozzard\ngrasonville\nfeofanova\ngrinnin\nbasex\nstermer\nspacefarers\nadmet\nintoxicates\nlightboxes\nruing\nsabreen\ncursiter\nribis\nsochua\nsmokler\nnyid\nplanipes\nalizad\ndopod\ndesley\nsubari\nminashvili\nnoseley\npolissia\ntigerlogic\nbradstone\nstikes\ngörlach\nindivisibly\nmallnitz\nswizzels\ntaishet\nglendy\nspainish\nhellicar\nreesei\ngaudily\nstrategery\ngursel\ntaxations\nbonica\nmasarik\nkoogle\nchinotto\nsoumyadeep\nflipchart\nlambrinidis\nmondialogo\nfreshens\npriorslee\ndanyl\navantis\ncnq\noutswingers\nflavanol\nzhongsheng\nlelkes\nhowies\nteanaway\nvinales\ndannis\nkcdc\ncervelo\ntallygenicom\nshamlian\nknickknack\nchesterland\namrich\nleidholdt\nsevastianov\nshipowning\nrepellency\nnavarsete\nkentia\nbdcs\nsurfable\ngougne\nraucci\nbracketology\nsangars\nwonderhowto\nbuchert\ntosatto\nblaris\nnextag\nprilukov\nbeansprouts\nsupersweet\nlobjanidze\nalgiz\ngnip\nimpetuousness\nsheffi\nkankowski\nedeline\nstoups\nmonumentalism\nfredericksen\nnaturalizations\nmosts\nesom\nwoodloch\natlanticare\nnasima\ngutwillig\nbuglife\nbuckteeth\nreadmittance\nrubh\nbarmoor\nbalmacara\nbrignall\nlangelaan\npasstime\nunremunerative\ncrego\neastcoast\nmcneel\ngrovelands\npushiness\npatharghata\nemmas\ntylar\nsavol\nreassesses\ncheshires\naccoustic\novercorrected\nstoglin\nfreshbrook\niwth\nolgun\nmediaguardian\nrelaford\navjet\nkwando\nkhamal\nohryzko\naquagenic\nfrowny\npeura\noverexposing\npaybacks\ntrés\nrsos\nnecar\nocassional\nsadato\nhaematocrit\ntrelech\nbarbetta\nlaitner\nadvancetrac\npopley\nterpsichorean\naapp\ntenspeed\nhedh\ndraggy\naberman\nkhawani\nepok\nelese\ntachyarrhythmias\nogbogu\nchoucair\ndegioia\nmedelci\nkullas\ngaspara\nautoload\nwinnin\nreadmitting\nflotman\nmaridjan\ncheongshim\ngreenzo\njozini\nkefraya\npaciente\nomrah\nwashaway\ntolchin\nrosehip\nwiatt\nmicrostamping\nshyest\njahmal\ngieseke\npiccolini\ncheneys\nbiddings\npossokhov\nvovak\nguintoli\nddca\nnimda\nfontina\nspottswoode\nartington\nhymned\nhuyghue\nmaame\ndobrzynski\nkonis\ndraggers\nacelino\nsietsema\nchalki\nppds\nabdelfatah\nxinguang\ndevide\nderio\nbindis\nbrassie\nmispriced\nprinn\nvitetta\nmckeand\nshowier\nblakelaw\ncillo\nchoudhrie\nsinja\ndrinkell\nsazka\ngrowe\ncobolli\nkhosrowshahi\nunretiring\nbakhos\nramales\nkeslar\ndormobile\nsauerbreij\nstreusel\nwideranging\nfugett\nschmiedl\nfenney\ngigamedia\njamont\nbroadspeed\nerson\nfygi\nneurosensory\nwlga\nfuci\nwheb\nsundowning\ntolzmann\nbouveresse\nmarisat\nrossow\ncracchiolo\ncobles\novergeneralize\nlubricious\nstravinski\ncetnar\nbersagliere\nlabeef\nbrottman\nshinyaku\nbenthamite\nmugridge\nhurk\ncodere\nunpruned\nwineapple\nlaccd\ntpbs\nauthentications\nseagen\ncontantly\nsmule\nmaeklong\neqp\nzomet\nmhango\nchippies\nseratonin\nkrawchuk\nchellsie\ngonig\neett\nregester\nwesta\nwtgb\npansing\nabiocor\nleeched\nmanikpuri\nignatio\nderenoncourt\nmeranda\ninstitutos\nelavil\njuventute\nabdulhak\nirwyn\nhempstone\nunfried\ntarfa\nblowjobs\neyeborg\nscharr\norked\naldert\ntthat\ngudino\ndinp\nhelmkamp\nususal\npoelten\ndormansland\nnicolaidis\niosh\naimin\nbrowett\ndeskey\nparadjanov\nkilbarry\nmckissic\nfodge\nshalita\nconcealments\nzandbergen\nbreakstone\npakzad\nusfj\nstralen\nlamattina\nundergirding\nkwiecinski\nclobberin\nmahiedine\nhartop\nchareh\npriyadarshi\nrarig\nthaworn\ncarskadon\nsarms\nkorbich\njinsong\nparametres\nportoroz\nwatumull\ncaroselli\nvcit\nmaniar\ncoagulans\ncamilia\nvancouverite\nwinterswyk\nproview\ngadarene\neuroparties\nsichrovsky\nlongsands\nuaisele\nmpack\nknockwurst\nkipen\nbulimics\nslobbery\ntruvelo\nstephanou\nthembinkosi\nkalanick\nlineartronic\nhebog\ninstinctually\ncluing\ndouggie\nmortada\namberton\nkivilo\nbctd\ncamoufleurs\numkhanyakude\nsovereignist\narbaeen\nkumartuli\ntheodores\nnegotiability\nlindwer\npinck\nganaxolone\noakapple\napprehensively\ndamier\njohna\nizetta\nwowsers\njearld\nkingsand\njazmyn\nlongyuan\nguangrong\ncerpa\ndhalwala\ndebis\nhibernal\ndepressors\neastfields\nsprink\nheeter\nlbws\nhamami\nmandjeck\ngimmelwald\ntellef\nliechty\nspiffing\nhnidy\nhemodynamically\npratichetti\nmuchiri\npleitez\npoat\nneurasthenic\nantwine\ncprw\neiroa\nodontochelys\ndjakpa\nmosad\ndugs\nfarinholt\nsilich\npreacherman\nkaylene\nznaider\ncasnewydd\njazil\nsonova\nliebehenschel\nlightpath\ntumulak\nharidi\nsarajevans\nchistian\nsidy\nmcmickle\npisarek\nbiomatrica\nkreplach\nyannaras\nsantalab\nbangemann\nspilborghs\nzarela\nnuuausala\nkianoush\nmurambi\nverbenas\nsofttop\nfachie\nscudders\nesders\nkoplow\nneiers\nkuniya\nhennegan\nmarshgate\nhealeys\npanick\ncroquembouche\nunstintingly\nquantifiably\nkesslers\npancrate\ntrestrail\nkreinberg\nowein\nwainscoted\nbalitsch\npfpa\ndelord\nforground\nlowliness\nglenmora\ndjedje\nembroided\nsambol\niraqia\nhcsc\nwindish\ntpas\nquisp\nsarvari\nspritual\nbackdown\nelectrosensory\nmultitenant\nkawecki\nmikala\nthermomix\nzornow\nwilbourn\nrabideau\ntobas\nunclad\nginkgoes\nhineman\nsayanogorsk\ndeyda\ntourment\namke\ntornberg\ngarrowhill\nfabulousness\nnazma\ntaouil\nfoodbanks\nsagel\ncommity\nthinus\nboisi\npodoconiosis\nkijak\nbauge\ncoopering\nbevanites\nmalibus\npromotoras\nrivr\nfinnimore\nekmark\nnissay\nebbesmeyer\nunlamented\nboutté\namakudari\ninniskillin\nhojeij\nnxdn\nscoldings\nwebcaster\nnaptime\nakahi\nyerzhan\nbuttenwieser\nkinkos\nunconsumed\nguaracara\nsulcer\nafflatus\nbaman\nrupai\nhsmp\nemilsson\nettl\ncandleholders\ngarana\ngregucci\nabff\nparetti\naccurist\nmantained\nosondu\nkbro\nwurly\nfluster\nsonyericsson\nlartin\nmultiorgan\nunhas\nstakeford\nbloviate\nsacyr\nagrilife\nquizz\nassunto\nasbill\namellus\npostsurgical\nmalcorra\ncrullers\npullup\nlongueur\nanticolonialist\nczyzewski\nkachan\nshulz\nmusze\nwatchout\nsabuk\nincoherency\narlingham\nmalarek\nfogger\nappexchange\nbagnato\nkitwana\nmaxiumum\njasvinder\ncydonie\nsrizbi\nrown\njixun\nkools\niniative\nrisper\nsandlings\nsafronova\navitan\njtekt\nkotula\ndanzantes\ntouchpaper\ntarlok\nshipard\neuryn\nrozana\nclofibrate\nlovern\nzonked\nshiells\npostnasal\nynysawdre\ndunira\nkenion\ndobratz\ndesignworks\nmolestations\ngaleya\nuncontradicted\nsalterhebble\nverjus\nrotork\npapau\nshamhat\nboqueria\ntheere\nreformable\nrealer\ncrematories\nberberi\ngurmani\ntibaijuka\nfleischhacker\nukfi\nadreview\nkianga\niwanowski\npaumgarten\nwisut\nfullpower\nebang\nutsi\nbarbulescu\nbedin\ncivillians\nmelucci\nicher\nabitibibowater\nhabituating\ndasheen\nhefting\nalperson\nbdmv\nshamala\nmedievals\nscilingo\nkalanisi\ndejia\nbacharuddin\ndongdong\nmasticator\nruymbeke\nansumane\nupgraders\nmultitasker\nsqua\natsunori\nmandri\necard\nbiedenbach\ndolphus\nmuhoozi\ndeedy\nhospedales\ntishina\nphonophobia\nattitute\nsinsemilla\neulogist\nrabani\namritpal\ndubyna\npavri\nquresh\nportering\nmansory\nshinique\ngettting\ntocker\nngun\nwhitegates\nsanitarian\nserralta\nspanfeller\ntosic\nfarsad\nillegalized\nstudesville\nsecurement\ngamaleya\napmg\ntommyknocker\nskirl\nglimepiride\nziadeh\njakubec\nbenizri\nunmourned\nsignifcantly\nyemens\nprewritten\npleasanter\nabdille\ncaptious\nkocic\nfenceline\nafricat\nakinsanya\ntrecia\neqv\npomfrey\nfeedbag\nshuttlers\nsellas\nkoundara\nmagassa\nsaleswomen\nstopps\nsalesclerk\nkhatar\ndyilo\nogou\nscums\nchecky\nhasip\nmnay\nlivinallongo\nunderminded\nhibbitts\nneuticles\njinchi\nwoodmans\nficelle\nkovatchev\nguilmette\ndeatherage\nhanjra\nmusliu\nditherington\nallanbrook\ncristini\ncapsaicinoids\nleubsdorf\nkreher\nprocessability\nstirland\nnemeroff\njoyon\npeagler\nroswall\nsurtain\nmonsivais\nhosek\nnichopoulos\nborjesson\ntrenter\nmurfitt\nllanwrda\nffrdc\nbassolino\nhcpt\nmukwege\nsommerer\nbahmanyar\ndimora\ngertten\nsupera\nszmyd\nhuncote\nskevington\nssga\nbannout\nmichikaze\nmaddala\nenvalira\ndakowicz\nwuthnow\nalaton\npixo\nsavik\nbannang\nhermé\nmediterrean\ncloyingly\nspinnerbait\nabanazar\nstanda\nfelley\nrecces\nlinteau\npragasam\nkindlon\nrenowed\nternium\nspos\nfaegre\nqabatiya\nanmm\nlgtb\nrectennas\nbarcarolles\norrey\npohan\ncyburbia\nphallocentric\nzeppole\nmishloach\nmeadowhead\nriomaggiore\nrolinek\nkoina\nzemp\ncalibos\nswished\nmahameed\ngirlington\narraign\nmonsignors\ncamerman\ndehnamaki\nellickson\nabdelhaleem\nherritarrok\nnarisawa\nsubpoenaing\nmarincola\nleukemans\nwatcombe\nwitters\npoipu\ncloudbreak\ndaggy\nmzamane\npowerchip\ncontainable\npanjsher\neiopa\nclubface\ncannelle\nkurbatova\nresegregation\nhillberry\nfedders\ntenberken\nadamsdale\nnonmusical\ningenius\ndissostichus\nmyyearbook\nsfsg\nmofeed\ntuninter\nanynana\nbockting\nleswalt\nbhagmati\ntokarz\njalozai\nreinitiate\ntassle\ncottonelle\ncadjehoun\ncarprofen\nnnrs\npakulak\nmalinverni\ncalise\narraying\nsungold\nffred\nrecalibrating\nsanussi\ngreebe\nbeiji\nsecretaire\nguleghina\ngiannoli\npennycress\nmongillo\nhellishly\ngrydeland\ncoorstek\nsmudgy\nguilts\ncrase\nkhurrum\ncomastri\nkleinke\nbeautifull\nboguslavsky\nliberis\niwsc\nchainey\nlivoti\nmollifying\ngleadall\nvishny\ncoberly\npostconviction\ngabis\nscool\nprotocal\nzimov\narcone\nmutie\nwillekens\nclarium\nkagay\ncarlise\nyawp\nbiowatch\nlupfer\ncovario\nguiterman\nbikhchandani\nhalleran\njasperse\nmabula\npolumbus\nbulukumba\ndiglipur\nzanzibaris\ngarona\nshrives\nkinstler\nprogamme\ntennesse\ncitrusy\ntorrenueva\njanneh\nmycocepurus\ncompostion\nzaimi\nbroida\ncelsia\nundertrained\nullyot\ncineas\npincushions\nmouterde\nmoisseev\nnaseum\nestover\neluana\nsumisip\nmoonbounce\nscrimmaging\nfartlek\nthieve\ngambro\nhayshed\nimpresiones\nsimearth\nmatela\nputrefied\ndemarkation\ndanchin\nkronospan\nbullbars\nkenith\nwarstone\nbunkbed\nprimoz\nincredibad\nhaustein\nroastery\nweilin\ndinamika\nruoppolo\ndoddie\ngoodlooking\nstylers\nprepossessing\nprechtel\nqpe\nseracini\nkohal\npfsweb\nbundaleer\nrockenbach\nnonsocial\nbarnuevo\ninterwove\ncarbapenemases\nozgan\ngamila\noppossed\ngopin\nvinnell\ncrawick\nharilela\nverdick\ngetjar\nbutzel\nrjk\ndemurely\nbeloussov\nhousebuilders\nsniffling\nnawijn\nsharoni\nfremstad\nrightholder\ncheko\nhbas\ndxy\nconvienent\nshaimiev\ncannonsville\nbittermann\noutrushed\ngaraud\nmariwan\nnèg\nsadwrn\npahlsson\npingus\nhockerton\npockett\nconvertors\nchiappone\nfarceur\nrivada\nchoucroute\naggravations\nafricentric\nareopolis\nfaddy\nsweco\nglamorously\nclermiston\ntransvestitism\nalualu\nskyrace\nnovavax\nmoganshan\nshreiner\nrtss\nzoric\nwotsits\nliuyuan\nmilek\npantalaimon\nborsboom\nsatterly\ntocris\nemvco\npalevol\nkeigan\ngerbasi\nwoukd\ndrumoak\nkustow\nfajon\nlongtin\nfazoli\ngohpur\ninlaw\nscherber\nvelib\nslumbered\nbrennans\nmokin\nbéar\nkirschke\nfractionator\nignizio\ntantillo\ngalgadud\nmalcuit\natomizes\nopensparc\nmississipi\ntrawlerman\ndashe\nbingguo\nfitze\ndascombe\neumm\nvisnovsky\nvvmf\ndissappear\nblaxhall\nmolchanova\ngilberdyke\ndonewald\nrefinances\nmaléter\nkimunya\ndelboy\nscraton\nhandoko\nwilsonart\nearlie\napelt\ndinaw\nbarbagelata\nmafuta\nmundford\nunwisdom\nscuffs\nchrismas\naltmore\ngoulbourne\nduaner\ntechnophiles\ntobiasen\nkuye\ngielow\nmidani\nojinnaka\novernutrition\nnaruc\ninstigations\njalopies\nshestakova\ncoreth\njulphar\nklatten\nsuperpole\nescura\nratanak\nfoodists\ngravetye\nbidadi\nhafley\npinchao\nwiedeking\neaseus\nmeraux\nhousewifery\nnuseirat\nfatwallet\ngableman\ndehui\nsafdari\nmckines\nsweatband\nvenetz\nlowcock\naruz\nukfc\nsendlein\ndementri\nrifs\nflocken\ngreatland\nschelomo\nreres\npikelet\nmarketisation\nclouted\nakivi\ntenovus\nsaayman\neiter\nsimner\nsoraida\nmoniza\ngenuses\nzootechnical\nopinionator\nkolsch\nohlde\npatchworks\nemun\nbrendt\nkoobface\nsaera\nunlevel\nluhukay\npointillistic\nlearnvest\nvalke\nswarnamali\nballinspittle\nimpeaches\nshilly\nnsso\ncapuzzi\nhettinga\nakerboom\ngoslee\nbakesale\npolakoff\ngrumpily\nfalloch\nradas\nrentas\nshourd\nsmackers\nscharlau\nboxier\nryongchon\nlittlefair\nchichon\nmoohan\nmatsunichi\neifler\nspiegelau\nprestonburg\nmaleka\nkuol\ntillack\nfalfield\nsolfest\ntsujihara\ntacchino\ngerallt\nvillier\nsurata\nbeurden\ncadmore\nphotoshops\nbastardos\nkizzie\ncreels\nvref\nactigraphy\nigps\nsnaer\nbroyer\nshafak\nhoefflin\ntided\nnicus\nnanko\nlufti\nmillisievert\nmanick\nfunestus\npompadours\nsreenevasan\nkipunji\ntoupees\nmelland\nagnico\nfusce\ngassville\nantici\nmaclochlainn\nreverby\nequitas\ncanani\nurmo\nshticks\nmarcey\nproxicom\ndogaru\nscowls\nlitl\nxinkai\ndamagingly\nuntracked\npolovets\nsurgan\ngeissmann\nnourredine\ngraybeal\nbawdrip\nemotively\ndaughdrill\nultrarunner\ncefixime\npruis\nsteinways\naashe\nyungchen\nglanzer\noppenheimerfunds\nhamade\nminyon\nheparins\ncaqh\nbodenhausen\nvarnavas\nrobyne\nunparallelled\nlickley\ndoneness\npiraya\nbodeguita\nroscigno\nbridgefoot\npodgor\ntetz\nsomjit\noverachieved\nhokkanen\npamphilon\nlighweight\ntorsello\nadvertainment\namhi\nfrean\ncondescends\nthalasso\nghinwa\nhigazy\nroves\nmackeown\nninnies\ndusinberre\napgc\nmapoe\nhoekman\ncpsia\nvipp\nchuño\nunexercised\npolisar\nmaponyane\nclattered\nrofes\nkanning\njetsetter\npatsaouras\nlochrie\nmugrabi\nboyt\nbowdlerizing\nbackstabbed\ndrimmer\nbakhash\ngeralds\neastwoods\nshakesperean\nbetokens\nsambazon\nhocked\nrestitute\neshet\nseftel\nroben\nwhinfell\nsensys\nstaudigl\nwonggoun\nchattered\nwising\nllandel\nshacked\noxiclean\nfaldingworth\nbigan\nriesgraf\nhigl\nnontransparent\ncompetiton\nyuyun\nxilisoft\npinkos\npayre\nburkeman\nscbt\ngeofencing\nfrustation\nsherree\ncelebrative\njaslovské\nmillecam\nseadream\nchoos\nalkerton\nsnarly\nmanzanitas\nsitzman\nseekins\nsimhon\nsodertalje\nboeker\niiroc\nteie\nconvio\nsampley\nshageluk\nphotomicrography\npriviledged\nantonelle\nrheinallt\nmcuh\nmahound\nvalacyclovir\nednos\ntopex\nllansannan\nheterosexually\npalant\npenjaringan\nbiothrax\nbacklick\ndothill\nconstructability\nzorab\ncarpatair\neary\npattonsburg\nregulski\nbevanite\nnaiades\nstrumpshaw\nwilberger\nkoele\ninheritence\ncablenet\nrosegg\nbarzak\nfarhatullah\nedgeplay\nziplining\naldsworth\nlifepoint\ncamal\nintractably\nryz\nbranders\nction\npattis\nmuchnic\narrrgh\nseroxat\nkolinski\nilecs\nadmist\ninactives\nharperpress\nbhittani\ndeveci\ntamps\nzimmitti\nfanleaf\nudfs\nburnton\nparticpating\nchargeability\nunterkircher\nunbelted\nadvantest\nnyfd\ndilbagh\npriscu\nexecpt\ntohme\nnsri\ndaignault\nkacyvenski\ndegredation\nsucessor\nevgenios\nmkoyan\ndelozier\nyein\npuddefoot\nbritcom\nacousticians\niqor\ntrachelospermum\nupdateable\ncircumlocutory\nwayah\nburell\npachysandra\nmles\nmaime\nmenie\ncrestron\nunworkably\nsondek\njustifed\nmisshapes\nbajevic\ndaydreamed\ncritchell\nventor\npsychopharmacologist\nabde\nwesterdam\npeggi\nbesties\ntorteval\nstultified\npomelos\ntiedeman\ndurnell\naissami\nferryport\nreenergize\nconfected\nharlestone\nroncagliolo\nstridulate\nindiepix\nlianying\nwescom\ncebull\nsensitizes\nnewble\nmukhsin\nshogren\nhagatna\nratnoff\nteyona\ncovanta\nstanish\ntexterity\nthermador\nlaicism\nvendeuse\nhandlowy\nstephney\ndiybio\nempingham\ndisolve\nsoutane\ngorard\nbezzaz\nguerdon\nhebrard\nnightclothes\naasra\nunpf\nbalalaikas\nalperon\ntatling\nreopenings\nalienage\nküblis\nhagwons\nnodia\nraffling\nmarcaccio\ndenude\nwenneker\nbermanbraun\nstandefer\ncisen\ncacak\nemmanuella\nlazarre\ndisalvatore\norpe\nwigford\ngiampa\nmoldea\nluitel\nmuziektheater\nmenchie\naggarwala\nbimco\nlanesfield\nrayful\ntrancelike\ncoaters\nalhomayed\nvantages\nogunkoya\ncampylobacteriosis\nzgonina\ndisbursal\njannot\nredwell\nnonmaterial\nquane\ncontraints\nxenith\njassen\nveroni\nbrulte\ncleopatras\noverhasty\nxendesktop\nsaney\nsitkoff\nsunderman\nreoperation\nmorgenroth\npearlfishers\nflagmen\ntaeke\nbiagianti\nalchemyapi\nsciubba\nteesville\ndomoney\nwithywood\nnuriyev\nchonda\nmanarola\ncovansys\nsbsi\nbaccalaureates\nwenjian\nindem\nsarracino\ncohabits\nhastle\nmiza\nhubig\nbeyler\nkermer\nmorawiecki\ntenneson\npowerboost\nlabrea\npatima\nyouming\naarebrot\nrimvydas\ndxn\nfuds\nshrimsley\nsciutto\nbarbut\npalexpo\npooni\ncypriaca\npolgreen\ndienten\nbutterfields\nmaymon\nakumal\nevertson\nuncheckable\nmingazov\nmatrook\nwallisdown\nthynnus\nunwonted\nmuqbil\nmeguiar\ndibono\npistolera\ntonnies\ncarolle\npowerhaul\ntirador\nmcmillans\npulzetti\nechochrome\nraïssa\nlekic\nflubbing\npresold\ndenominating\npasquerilla\nlagmay\nquickoffice\nempathetically\ngayner\nzoueva\nkreil\ngereb\nbassmasters\nredcurrants\nmyguide\nmonzel\nsamardzic\namiry\nsequenzas\naccelerative\nhonohan\nhermer\nbaver\nnooij\ndijkstal\nthornewill\ncontrats\nkaune\nvatted\nsherriffs\nmicrobrews\nabrazos\nkriemler\nhartmarx\ninfelicity\njanjic\nmechelle\nsege\nkwizera\naldermoor\ncrystalised\ndatagrid\nmacadamias\nnugee\ngourin\nbeauracracy\nghany\nkajeet\nburles\nhadramout\nkineto\nseguchi\nbauw\nhalbritter\nmisher\npraderas\narmhole\nnyagan\nlessel\nsnorky\ntrivialises\njabur\nekren\nnbfa\nshocknek\nkristyan\namapari\n₂\nbroadwindsor\nhiraman\nperano\nuapb\nswooshing\nucac\nfriskney\nhemmerman\nsanping\npasttime\ngallbladders\navionica\nemannuel\nauchincruive\nspotz\nkunken\nramita\nirrelevances\noutstate\nsidat\nmicroturbines\nkidspace\nzinnie\nulumi\npastırma\nkilcooley\nnarah\nhelzer\nportioning\nbimpson\nfenners\npeaston\ntrotty\nrezac\nwisconsinites\nssos\ndohyo\nblighters\nranchettes\ntwangs\ndushane\ngrandiosely\nsteamfitters\neiras\nslovin\npollicino\nneurotypicals\nabdikadir\nmalseed\nvogelman\ntheroy\ntranquiliser\nmbango\nshrewsberry\nbensedrine\nnxec\njonquiere\nwedeman\nnoncombustible\ngoalwards\nbarzegar\nmoriston\nardossi\nkauders\nbenabib\nhypercholesterolaemia\nannuit\nmaginley\nkanoute\nbedells\nfideo\nrexroad\nyamaichi\nreimposition\nbrynwood\neppi\ntravi\nruegsegger\nfreeny\nlemmouchia\nenfolds\nhuntresses\nphilanthropical\nareds\nwyevale\nuibhist\nfacebooks\nslithy\nbraymer\ninverie\nultrawide\nvlautin\npitau\nsmutniak\nmagwood\ndaleen\nostracizes\npinchin\nschumanns\nkloser\nhotsauce\ntushan\nunauthoritative\nmullivaikkal\nbryld\nizad\nselphy\nkornat\ncrousillat\nheadquaters\ndolgen\nroscuro\nrabois\nswietokrzyskie\nmurewa\nfouth\nmorfessis\neilen\nmollway\nengla\neuthanizes\ncritize\nivalu\nlokuarachchi\nwestco\nafterplay\nreadably\nlocati\npoienari\nleonnig\nzezima\nsoundwalk\njmcc\ntillim\nmoblog\nbloodthirstiness\nguama\nchels\nliplock\nbaaj\nkylesku\nriann\nhohlwein\nonelink\nbaalak\nquaggas\ntallan\nminouche\nfrecker\nchemaxon\npennybags\nbrekk\nhebl\nagbogbloshie\nflagstick\nsamakuva\nbrundige\nneuralgic\nqdi\nnewswipe\nbeaucarne\nchenxi\ntimeworn\naozou\nlampost\nvigneri\nhoggle\nyardumian\nmapson\nmerisant\nokagawa\nciroma\nhouge\ngubay\npynzenyk\ntacolneston\nkhumjung\nppip\nedry\nmackiev\nclai\nkadek\njarraud\ngtis\nwindmilling\ncarringtons\nquivered\nfruitadens\npuzder\nambitiousness\nhakstol\nwaino\ncucchiara\nrgbe\nshuvalova\ndwygyfylchi\ncoverted\ncastrucci\nreadier\nfrz\npupkin\nnowthen\nunderequipped\nambuklao\nbrumbelow\nbiotoxins\nwolrd\nfabada\nreeep\ngraunke\niog\ndismasting\nuseem\nlanahan\nbinya\nnayiri\nnawiliwili\neyeballed\noutperformance\ntomcod\nnickelsburg\nmuriuki\ndurman\ndannat\ntavernari\nwhitewalls\nsaaid\nzeen\nwarshow\nkoolaid\nmpgs\nwinced\nhounddog\nchilga\neckenrode\nlensless\nhypermobile\ndecentered\nmcgauley\nmiyanda\nmomment\nshafilea\nranabir\nfathoming\nlaret\nshedrack\nrefinish\nmaruge\nzeoli\nferrugia\ndioni\nhettrick\nrozek\ncomito\npleasured\ncipulis\nbiogasoline\ngnep\nballoo\nreplastered\nzolty\nzilinskas\nhavelsan\neconetic\nrcdm\nfattoush\nforegoes\ngojet\ngreetwell\ntedenby\nnewshound\ngrabel\natterberry\naleg\namitay\nsqueezable\nfinty\ncelebres\nturbervill\ncaydee\nshillue\ntxema\nfairmead\nfiscals\nvaziani\nsintim\ntrandahl\nkhanani\nfuggles\nimmunoregulatory\nstupefy\nhookline\ncsaa\ncisowski\nambrosial\nmiljus\ncyclamens\ngetback\nvancil\nwithypool\nncsd\nshoulberg\nghesquiere\ntabarez\nschoppert\nfallons\nmaarof\nenw\ncbrc\ncertainteed\npeniket\ninceptions\nchowdown\nlegette\npebley\nushs\nastarloza\nburrata\njazairi\ndcra\npatisia\ndaryoush\ncouderay\ncaiafa\nchortling\nenernoc\nbecaues\ndesalinating\nshandley\ntonkovich\noxygenating\ncaminata\npolner\nnakouzi\ngearheads\nnogar\nlistkiewicz\npoteen\ngangar\nhoptman\noverprocessed\ndubernard\nremilitarisation\nspianato\nbathpool\nbagrami\ndecitabine\nutzschneider\nintego\nhoupt\nnordholm\namorin\nshujaa\nsadka\nquestionings\nmuranen\nnekaris\ndatelines\nsholar\nuitikon\nsrch\ninexactitude\ncothay\nhengeveld\ngriso\nonesearch\ndefrock\nfoister\neruptum\nvincebus\nblubbering\nreasor\nstihler\nmulticare\nzazueta\nsupranationalism\nmapendo\ntreizième\npaxi\nteletalk\ntoing\nscabbed\ndesia\ndonnachie\nshulock\nbegor\nchappies\nyogpeeth\nbackpedaled\njermon\nflinger\nkicanas\ntrigano\ncottrer\nautomats\nutec\nkonduz\njauntily\nlukesh\ndruskin\nkhwani\ninfosport\ndederang\ndobrynska\naccusingly\nnapolioni\ntunison\nicimod\njops\nfireguard\nopenadr\ngeocenter\nbassingham\nmcclatchie\nwolkowitz\ntuffah\nezawa\nalnylam\nshanghvi\njayla\nzarar\nadtran\nwhap\nabysova\ngormans\nhealthplan\nfaronics\ntaback\ncatcall\nhcis\njeeze\norchises\nosserman\noptimiser\nmluleki\npiggledy\nkayrouz\nholographically\nbetw\nzwirn\ncazaban\nboerwinkle\ninglethorpe\ndoomsayer\nponden\nmicrowatts\nassimilator\ncarnochan\ndfsp\ntanarus\nsakazakii\nbenisch\nkowtowed\nbardelys\nborispol\nhaeusler\namericanconnection\ndansky\nimaginer\nenflamed\ncollery\nravikanth\nimmy\ncancian\ntrulove\nchoper\nleegin\ntickencote\ncylinda\nhoevelaken\nionizers\nintratumoral\npambo\nszekesfehervar\nkezerashvili\nquyet\nmontenvers\nhelfert\nsantaniello\nciil\nvdma\ntransubstantiated\nsdlt\nbaseer\nthweatt\nletherman\njillings\nhighridge\ndornin\nbronglais\nporthgain\nhaveman\nomegle\nlulucf\nkazakhtelecom\nellingworth\nkaseman\nairão\nstandstills\nnkomati\ncowpokes\ngoldnadel\naclj\nvaccinators\nricharda\ndroperidol\nfastnesses\neulogise\ncrosbys\nshuzhen\nmarran\nshoeboxes\ncasden\nparadizo\ngenuis\nyaalon\naddresss\nvrinat\nklyuev\ncaberta\njasmines\ntakanaka\ngeocell\nthandiwe\nprammer\nswordbearer\nenthusiatic\ngoldheart\ngyude\nglobespan\naaditya\ngdgt\nhoeness\nprilocaine\npalihapitiya\nperisic\nanchorfree\nnorned\nwronging\nwammies\nsublingually\ndobbies\nriener\npongs\nbezunesh\nzya\nrosemeadow\ndsny\ndidden\nlevell\nquann\nschaedler\nanwa\nsangomas\nmzymta\nundreamt\nbabikian\ndiggles\nhowroyd\nduckworths\nguoyuan\ngentzel\nhlavac\nashlyns\nmonarc\nafront\ndruick\nplantiff\nselldorf\nsnickersville\narisman\nfalacy\nrefind\nkokol\nshopsin\nunticked\nhuska\nuyeno\npettys\nleebron\ncommet\nubogu\nshadowcrew\nunseeable\nfppc\npetfinder\npizzitola\ngueffroy\nguangjin\nwanly\ntyrannically\nelectrospun\nomachi\ncvent\ngandhians\ndorice\ntitbit\nzwolinski\nfreedivers\npaltel\ngasior\nbiederitz\nmeuris\nnongame\nquemener\nfeldmeijer\nbargal\nimagenet\npomerado\npenaranda\ngumus\noverindulging\nobdii\ninterinstitutional\nausnet\nmxenge\noutgrossed\nmissileers\ntarvit\nkoopmann\ncizeta\nflahooley\nsalutatory\noverstimulated\nmeeru\ntricyclics\njamarr\nbocardo\ndonini\nllanwonno\nhogganfield\nbutylparaben\nspoofer\nchandrayan\nmoutain\nwakuda\nserried\ncallused\nelies\nverryth\nzackenberg\nhelth\ncolemore\npowerschool\nbidonvilles\npucillo\ndelimkhanov\ndummying\njakon\niress\nsergerie\ncalica\negeon\nhatefully\nwisbar\nvyatchanin\ntecsar\nsoffa\npeakirk\nlisen\nbrudos\ntiedown\nrohen\nqarnns\nbootprint\nracf\nlimc\nlissett\ndanly\nschwemmer\nscwr\nhardoon\nbrunnermeier\nshanika\nbrooked\nginnis\nnonpartisanship\njournee\nschafft\nkveta\ncommisioner\nniueans\nkandeel\ngoovaerts\nbandhs\ncarinci\nneimark\nbooysens\nsadou\ntallington\nagraz\nputterman\nhelliesen\nseliga\nzinkhan\ntrickiness\ncomunities\ndiddling\nhearin\nzurabov\ncaplets\nbonking\nteichner\nbeardshaw\nsoysambu\nbeting\nkeshawarz\nwhatevs\nergometers\nacmg\nsibilance\nlania\nactivant\neudald\npincock\nprobabtion\nleatherby\nfalavigna\npasseron\nvigliotti\nsherifa\nfahringer\nfezzik\npalguta\nencomia\nmodahl\nactl\nblazwick\nkoolman\nephemerals\ncanós\nguiliani\nkyemon\nslavonice\ncuautitlan\nquaffing\nkariana\nnorview\niomart\nincluso\ndmea\nsparaxis\nwheelabrator\nhorsewomen\nkozulin\nyuesheng\ndevean\nriar\neasi\nfoushee\nliexian\ntolkiens\ndisenrollment\nauwe\nthouvenel\ninvacare\njesta\npiotti\nmotoblur\nunitrin\nmwaruwari\ngambril\nscurrah\nshawbridge\nengelmayer\ntaniya\ntactlessly\nbeegees\nschuelke\npantechnicon\nhollaway\nmethody\nsaddi\nbollito\nkolmanskop\nkilm\nmounding\nmuthamma\nkirkbean\nmvelaphanda\nmatrikon\nmuteba\nmassachussets\nchacahua\nlabwani\nkofe\nfamiliarities\nmenzieshill\ntouaregs\nalertly\noverprotection\nbaltal\ngrandey\ndemonstated\nmefeedia\nenemys\ndebtmerica\nrenderos\ntiagabine\nonsi\nmocap\nmandelberg\nmaykel\nfootbrake\nhappends\nhawliau\nvenediktov\ndoumeira\nkhaddar\nroksanda\negeraat\nkazini\nappraoch\neymet\nkaindi\ndeghati\npassanger\nnoghaideli\nnojo\nuninspected\nmicrocredits\nbullshitters\nzoomtext\nextricates\nvalaika\nivernia\nsgdl\nreccommended\ndaeg\nkaluyituka\nkutik\nlinerboard\nkincorth\nslovenliness\ngraincorp\nprofiteroles\njaeggi\nsamenow\nkeffiyehs\nmannerly\nscsr\nsafaa\nfainthearted\nmockney\nschmautz\nwndc\nlandrin\nharshing\nasbjorn\nairstar\nmccalpin\npolyone\nvizhi\nhisakazu\ntwizzlers\nsupersensitivity\nbrahima\nwyburd\nabeta\nchoreopoem\nlinds\nkacst\ndenef\nactiverain\nbaluyevsky\ntarves\ncokers\nkoua\nbegbies\nhoodwinking\ntischman\nranelin\nmolnau\nustyurt\nsalmona\nmorstead\nmishon\nmoskovia\nrawsthorn\nfaceplant\ninnerchange\nludmer\nkassoma\ngeeker\nclamors\nfrithelstock\nsambuco\nthiruchelvam\nbixley\nsubmetering\nlatman\nrhawnhurst\njoei\nmangaliso\nshedlock\nfluharty\nnorwick\nayariga\nhijrat\nskyrise\nhowgego\nhoefkens\nlichtblick\nbisti\nunderclasses\nausterberry\npedestrianization\nthumpin\nguisasola\nfooding\nsancan\nliterariness\nwhelans\nkoonse\nstocktake\nbanafsheh\nporfiri\nrettberg\nestore\nworobec\ngovilon\nhaipe\nphrasebooks\nmaccuish\nitasha\nmazut\nvitiating\nsabertooths\nfurfuryl\nunhooking\ndimpling\nchanchez\nkangai\nlingani\neebee\nbrovtsev\ndesoer\npebbledash\nzocor\nghorban\npoisioning\ndafora\ncairon\nshpe\nsubleases\nartesanias\ndodes\nquantapoint\nmehmanparast\nmwcnts\nalledge\ncalvey\ntalb\nediton\ncvvt\ngateaux\nwaggett\nbabnik\ndruckmaschinen\nallido\nphok\ngjelsten\nlostwinds\nhodr\ngoumri\ndecisionmaker\naussiebum\ngigaset\niqe\nscholtens\nmaplecroft\nunderinsurance\njameelah\nlipoteichoic\nruhlin\nunilateralist\nlungomare\nredenominated\ntilmant\nsallon\ntelebrands\nsquelches\nlandaluze\nrecipie\nextremest\nhowardforums\nrudakova\nxiwang\nearthweek\nlgfl\nqmd\nkowalczuk\npadaca\nqueloz\njegou\nsharzer\npetricone\nnanh\nstudioso\narenda\nbrager\nvikingstad\nclario\nruenroeng\ngazzam\nwoodturner\ndloc\ncampolina\nfourballs\negaming\nparaguana\nmashers\nmigrantes\neverchanging\nmandabach\nhochhalter\npopmoney\nhongbing\nlaborda\nnouria\nbottigheimer\njenco\nningming\nhyperglycaemia\nkanjo\nguthlaxton\nhateship\nloveship\nmarjeh\nzentiva\naipla\nkucherena\nglammed\nintimidatingly\nbuzza\nkryczka\nlongabardi\nwellie\nfastskin\nkorns\ntiotropium\nizala\nschoonebeek\ncasadio\ntarictic\nmikalah\nhinnies\npiscatorial\nhitchmough\nnimetazepam\nsubsquently\nwickramaratne\nconnubial\nkukovich\ncannelloni\nreline\nbierwirth\nfrykberg\nmorcone\nebad\nguosheng\nlittondale\nprimario\nmasalas\ncaten\nilston\nandroulla\nkamecke\ngregorys\ntoudouze\ngfatm\nzagorsky\nmilca\nfqhc\natkisson\nsaidullah\nrajapakshe\nhusic\nulanhot\ncahall\nexasperatingly\ntupay\nmarkmann\nmeïté\npresumptuousness\nhökmark\nagressor\ntridel\nharvy\npottering\ngockley\nschoolday\njalasto\nsublets\neatinger\nrubinger\nvalmon\npantomiming\nsklaroff\nwhv\nbiotek\ncharlcombe\nmulege\ncairnwell\ngelete\nyedda\nstorrar\nnaion\nviglietti\nderocher\nsoat\ntintinnabulation\ngerstenberger\nflywire\nfleshier\nkasandra\nnealis\ntatis\nccop\nlippes\nschlesselman\ncurnock\nmettingham\nencroachers\nheckendorn\nencierro\nbanknock\nnyron\nfantz\nsideburn\ndufus\nfinnkino\ndelron\nimpoverishes\nbidded\nhypalon\nspeechlessness\nkukeyev\nmosiman\nlagoda\nlangas\nwicus\nhuilongguan\ncontradition\nlawdar\nmandhir\notemachi\ninclan\nbigamists\nkrupicka\nkneeshaw\nfreidheim\nmurderland\nsumatrae\nbrookhill\nowchar\nkeratoprosthesis\nyousufzai\nrearend\nfahid\nkuks\ncapts\nkhyam\ndermont\naddow\nluach\nballeza\nrouiba\nnotnowcato\nangueira\ncrissier\nleshy\ncaringal\nowly\nwheke\npancaked\ntosspot\nfyle\nzalon\nalamieyeseigha\nnoiselessly\ndurty\ngreenstar\nakitas\npubli\nterekeka\nllandegai\nhinderer\ngageby\nskena\ngelvin\nrichlin\narkefly\nveracode\nejupi\nwebmistress\nladybrook\ncognis\nsoberness\nmoyross\nmarguiles\nlineouts\nschifilliti\nmagnit\njongbloed\nzindzi\nrampf\nausenco\ngirifna\nsvoronos\ntoddling\nbenhard\nnoblett\nscotson\nchenoy\nelegba\naggreko\ncigarmaker\npirls\ntaiohae\nsonol\ndukane\nrospotrebnadzor\nshakirullah\nredbush\ninsertive\ninterwest\nsnowblower\nsakhnovski\nrajnesh\nzecevic\nstrommen\nescudier\ndejonghe\ndeadlifts\ndumbasses\nhexie\nvasella\nslma\nloscalzo\nnoorvik\nsardarov\nbeccaro\nfaustmann\nrecessing\nfilicia\nfloorshow\nvalloires\nmouraria\nxanthomas\nedfa\ndorame\nsamthar\nbaldie\nsandya\nlightsabre\nbaldes\npingers\nopodo\nobamaism\nglocker\nnjbiz\nhavlick\ngrushow\nlambridge\nbanishments\ntargus\nsocities\nanathematised\nquiwonkpa\nnovair\nfluendo\ncountersurveillance\nliqa\nomali\nwhizzers\nbermant\nmammaprint\ngannochy\nnewspaperwoman\neayre\nbakircioglu\noxonians\nezbet\nciticoline\ntrevyn\nlöwitsch\nwesterlands\nhumitas\nkwock\ndalein\nsedici\noceanco\nliuqiu\naigs\nexperiement\nfaruqui\nillian\npunggye\norganiztion\ncruller\ndalbavie\nakesson\ngoofiest\ncrowsley\nforugh\npointcast\nbraemore\nbzx\ngrotnes\nshadowmancer\nscearce\nspenkelink\ngracian\nterentiev\njanadhikar\naltnagelvin\nkannika\npatano\nwestat\ncallerton\nmouttet\nunversity\nleiker\ntogeather\nchathill\nvodaphone\nlenney\nharcombe\nuex\nlajolla\nrald\nswackhamer\nnarcisco\nmovano\nbalkanisation\nhudyma\nragle\nkaylynn\nlansanah\nferl\nderreck\nmboka\nlangarica\nbockelmann\ntomsic\nparklet\nloooooong\ndoesent\nmondory\nbaudour\ndamsgaard\nsedky\ntamashek\ntomass\nbuinaksk\nmambu\nbrusk\nearlysville\nkunimasa\namale\nfishings\nchampon\nwagonlit\nztohoven\nappall\nukranians\npreordering\nudston\nmeshram\nbaxby\nmedano\nhoteling\nudofia\ndedridge\ngolfland\nmelwani\ncostellos\neej\nlisanby\nhessam\nsiccardi\ntolcarne\nniebling\nsezibwa\ngarecht\nbrentry\nqabb\njarritos\naulani\nbcause\nchiedozie\nvolanges\ngdhi\nprepacked\nmingma\nstereoscopes\nbessis\nbassens\nlarese\nsaoul\nfacteurs\nhorseshoeing\nccow\navrami\ngrattacielo\nosaid\npropagandising\njodelet\nmcgaffigan\nbriseno\nkurutz\nsartore\nipsley\ngosens\nvegfest\nmundle\nnashawena\nvirot\ncrimper\nautonomi\nappreicate\noystein\nnaschmarkt\ngutsiest\nwhitehair\nbivy\nsvtc\nminijack\nsautin\nhardel\ncerrejonisuchus\ndörflein\nchaib\nsandlots\nschnetzler\needa\npokaski\nlitigates\nrowly\nbretter\nsérac\nguusje\nclumsiest\ndidelphys\nbaatz\nkatsavakis\njosetxo\nzickel\ndominquez\ngagliasso\nsarbu\nfeebler\ntcherezov\nmerched\norjuela\nmicek\necletic\ntomkiewicz\nvtn\nrajive\nuriminzokkiri\nscorpene\nbucktoothed\nbeachings\nmanhandles\nmicroamperes\nsimme\nromanello\nteeside\nbichons\nshigri\nroia\ncavagnaro\nnothronychus\nharraby\ntonala\nshedfield\nmatsumori\nwatersense\nsailability\necgd\nziganda\nfiscalini\narbeia\nlevermann\ntodorovich\nvelosa\nspedition\npestronk\neigel\notepka\naknoun\nsudokus\ncomunism\nbatbayar\nglovsky\nchalmé\nmamina\nclonan\nbrietbart\nundrawn\nliftport\nalbiceleste\nsecondi\nrottino\nrummo\nbraunlich\nppss\nvelti\nvandyk\nimpresive\ntimanfaya\nislamicisation\ndivisionists\ntierre\nandartes\nwronger\ntomasevicz\nspead\némigrée\nfloozie\ndunchideock\nguriel\nacga\ngulbin\nbaumberger\nromanick\nupconverter\nboubakeur\nromaguera\ngrowney\nsolariums\ntryng\nophiopogon\ntruffula\naviations\nshellers\nmcapi\nrusan\nsakabe\nsillito\nlombax\nsneeringly\nheathway\nzlotnik\nshacknai\nhappended\njaksic\nlomnicki\nmcnamer\nimitable\nwoodgett\nrestif\ntwinjets\nsrednyaya\natassut\nbeaurocratic\nravensbury\ncalander\nmayorkas\nglowpoint\nreanna\nceratinly\nkildrum\nhorbaczewski\ncontinetti\nbucketloads\nrokpa\nkosair\nbreathiness\nzenshin\npilsworth\nsamcor\nhuanghelou\nzhongping\nsajadi\ngraystones\nhumphryes\ntional\nhuxleys\nmccafe\nengeman\nmouhammad\nneogen\ngujerat\nbeatdowns\nnegasso\nnomuka\nnurnberger\nasimos\nproek\ngidada\nwalkoff\nkotil\nfichardt\namisi\ndiscala\npassagework\nunfaded\nstolfo\niside\nblethering\npolverini\nnohria\nselespeed\nwheego\njeansonne\nkatyushas\nsteinfels\ngaudini\nchildproof\nimmedately\ncaesarism\nmcfeeley\nemcor\nudj\nsotterranea\nkimbal\nvandas\nflatfooted\nearthrights\nfragale\nmaharanis\ngamcare\nrabar\nbagnulo\ngaulden\nbernies\nenvirolink\nlussick\nbrachen\nbaalke\nbrammeier\nmaandig\nlevings\nemerito\nplbs\nunintelligble\nemperical\nshortcode\nshirker\npanteleyev\nstricklen\nperretti\nlgvs\nheastie\nfrearson\ntumlin\npreoccupying\ndaradji\nmarbley\nwetzels\ntenderhearted\nkandamby\nmohacs\natpg\nmaezawa\ndemocratas\npamala\nstegmayer\nvarces\ncrucifies\nbaltonsborough\nholtam\ndictu\npurées\nunderkoffler\ngavaghan\nacerinox\nflipflop\njockel\nmethlick\nabogo\ndestory\nmoviestorm\naltug\ntrister\nelectromobility\nifrica\neyeroll\nplutoids\nbesets\nriversway\ngrunters\nclulow\ntelea\nkennards\nakiachak\nidasa\naliakbar\nrienk\nstatusnet\npeiwen\ndubler\nbodzin\nemch\nmosebar\nadvs\nwawr\nsuffuse\nuyeda\npunctal\nfoued\ncarribbean\nsewardstone\nhallym\ndarv\nglossies\nsantia\nfarmgirl\ngieschen\nmiscommunicated\nhosenball\nnorfed\ncrashworthy\ngeekiness\nmckintosh\nhamdoun\nabdoulkarim\nasnis\nmaghami\ncentreforum\nleffel\namalya\ngareb\nstashower\nlanh\ngloveman\nmontplaisir\nlawshe\nadach\nfatfat\nidoko\nflashiness\neeig\nyabulu\nmaraden\ntuerck\nrepresentaciones\nkiest\nestai\ncaulks\nnetmedia\nauldhouse\nfreking\nholbury\nunpretentiously\nedwardses\nmilliamp\ngrovewood\ncurmudgeons\nmediatory\namedi\nrodiles\nrantamaki\ncymunedol\nvelton\nliljenquist\nsanberg\nyesin\nmoussavou\nmammies\npenisula\nirruptions\nwhjy\ndraggable\ndincklage\nrovani\nmicroalbuminuria\nsquarks\npolyak\nkassulke\nkarro\ndyantyi\nstockselius\nmcgleenan\nefacec\ncaliri\ncroglin\ncaryopteris\nmultidenominational\nibwa\ndisgorges\npasuk\ndefjam\nvidoe\ngetinge\nparvenus\nprabin\nalamoudi\nwolfberg\nllangors\ncavorts\npolysomnographic\nyelizarov\npastukhov\ndemijohn\ntwiddly\nmartinussen\narctotis\nzaheen\nvistar\nnourisher\nmyfoxdfw\nreaderly\nbetdaq\nwithou\ncappers\ncarnett\nhemdani\ncastlefields\nalderholt\nkatragadda\nsanfords\nmidomi\nhandsomer\ntahlil\nincapacitant\nemfs\nvanderveen\nminnoch\naustinite\nlynelle\nwrang\nheckroth\naepi\nopenworld\nweast\nuppies\nhnefatafl\nmathee\notting\nhardwear\nstroopwafels\noxys\ncalvery\nmucoadhesive\nhigest\ngrammel\nncfa\nmanceau\nhpvs\nfalinge\ndiemecke\nemcf\nsassier\nsttr\nabase\nfangak\nguisard\nmiedl\njianchuan\npalatas\nscatsta\nbarrey\nsoua\nmeselech\nzigomanis\nconjur\nhaimanot\nacergy\nbenkiser\nshatra\npanjaitan\noutfox\nederer\nbanwen\nrazoo\nlianke\nmuzzio\nmillhiser\nboehlke\nberoni\ngospelfest\npopkiss\nkubinec\ninvestama\nnestings\npikkarainen\npylle\nitif\nmataskelekele\nmannava\nnjongonkulu\nbookeen\ngonsalo\nbutscher\nfrisks\nsugarcoating\ncartref\nomigod\nlabaton\nthirunelli\ntakeback\nmayell\ntharston\nauthoritie\nstehn\nzeydan\nberchelmann\ndugg\ntaitu\ngietz\nintertech\nscooted\nsufiah\nayahs\nschiaretti\ntenderloins\nniittymaki\neradicators\nlovgren\nphilleo\naotc\nciardha\naltaie\nmedwatch\nsunshield\nelberg\nkiilerich\nseahenge\ntheisman\nfosbrooke\npowerwave\ngergawi\ndefectively\njaveed\nedathy\nmatallana\nsuckage\nmontpeyroux\nwoolie\nqinan\nkarins\nselwa\nbaglin\nwhizzy\nhoningham\nadiyaman\nleemon\nkaisersaal\nmazuch\nrondavels\nfinical\nruaraidh\nshoestrings\noptimax\nswetz\ntynesiders\ncommutable\nlvarez\npactola\nteckla\nuaq\ngaravini\nbargan\narnson\npictometry\nbudnitz\ncought\nmophie\ngalgate\ndaifallah\nlifelink\nkivus\nmichalakis\ninaptly\nbeichman\nenergystar\nreprogenetics\nhardstanding\nkhloponin\nfloripa\nmountz\nnfwf\nlingford\nguanzhou\nmoonquake\ndigiorno\nbillingsly\npartisian\nslithery\nhemion\npiscicelli\ndistastefully\nrainouts\nguanlan\nferarri\nplymale\nfcip\ndundonian\nkildans\nipis\nequivocates\nultimata\ndiscloser\ntechron\ngaddo\nboosterish\nernesettle\nliqing\nunselfconsciously\ndiavolezza\nbeegle\nmerfeld\nfinancieele\nzoraya\nhealthnet\nwhaples\ncalando\nseigne\nyeremiah\nknishes\nmahjar\nanapolis\ntraipse\nhairnets\nreynisson\nsunraycer\ncoppolino\nsarpei\nwarcrimes\nlafonta\nijams\ncurtici\nhassans\nhial\nmccomber\nalagic\nlanrev\nheadwalls\nstitzer\nsopping\naiful\nkhalilah\nunmelted\nvatansever\nvogondy\npipewell\njinkee\ncarlyles\nabdhir\nshawbost\ngloucesters\ntowndrow\ngalumphing\nzwinky\nartacho\nicepack\nmaalaea\nfloriferous\nsawtoothed\nkargin\nmrčaru\nfollwoing\nundercounts\ncutillo\nbakhtyar\nindigos\naobo\ncarvelli\nsyamsuddin\njuppe\ncourrielche\nsuperpages\ndepravities\ndebrum\npartaker\nphilipino\nguarenteed\nferreres\nprebate\ndepressurised\nzhelyazkova\nthingvellir\ngossipgirl\nbouchy\nnohe\nwocn\nberlato\nperishability\nuhls\nsoftphones\nsanitizes\nkamine\nmesinai\nersberg\ndlink\nsirbu\nsolair\nluthman\nsleeter\nrefalo\nchemico\ncefneithin\nvalone\noyuela\nketz\nvunerable\ngueguen\nkstar\nfadhili\nconceiver\namezaga\nwellinghoff\nhdy\narsd\nwounder\naberteifi\nstantis\ngustines\nprotract\ndoglegs\ncevital\nstoltze\nmatee\nchunxiao\nsmouldered\npoppiest\npalsgraf\nscarantino\nzyskind\nnishu\ntightropes\naminatou\ndatai\naiptek\ndogster\ncatchpenny\nwithall\nfretta\nbrackstone\nschlosshotel\njackknifing\nramnani\nhassie\neaser\nhabhab\ntiffney\nkogelo\ncoartem\nswatton\nteilhardina\nwazzani\nbaupost\nsnowmageddon\nbareness\nroussey\nalekseyeva\nmandanipour\nleskovar\nbershka\ncyberball\nshcharansky\nkaczur\nkaddoumi\nhetreed\nfatton\nindefinably\nbutterfill\ncashcard\ncuvées\ndetemir\nanmar\njoulwan\ndrachenberg\nmackwell\ncollarbones\nderner\nbelier\nvallos\ntalil\nrequiescant\ncavet\nfwi\nmoubarak\nlewies\nknightian\nconsob\ngrazzi\najib\ndonemana\nfentimans\nconservativeness\nblands\nndoro\nfiladelfo\nmohammadu\nparodists\npurtle\nfootstar\nlahme\nfioricet\nminiaturizing\nbarraging\nnamanga\ndepoliticization\nstromile\nmoggerhanger\nnewedge\nglatthaar\nsaadon\nlavandeira\ndebars\nkelynack\nspinned\nvodden\nkivo\nluckovich\nextemporised\nblook\nballie\nyodlee\ntakhta\nisoh\npuijila\ncaraballeda\nnagakawa\nszema\nbeserk\nscianna\nwetzsteon\ndemersus\nbillfold\njonkoping\nconversationalists\nnoooooo\nkarron\nbisel\nmilinkevich\nyakobi\nakenfield\nmorisette\nyonghua\nkiamesha\nsightscreen\ntosher\nsteeplejacks\noveraged\nsacia\nhitar\nwaterhealth\ncousseran\nrundquist\npricesmart\npelekoudas\nkapitula\nchlorphenamine\nijj\nshamble\nberloni\nancop\nkarosta\nfessia\nmenchville\nstanchart\nroyte\nkhafaji\nhorsebox\ngontarczyk\nmarida\nkyllachy\nslainte\nicebar\nposher\nlennan\nfaduma\nderossi\nmultitool\nsitution\nparlimentary\ninsectile\ndeaniana\ngreated\nnvtc\nhollingwood\nscheidhauer\nmcavan\nqiuxia\nmogilyov\nenfamil\nibmp\nlioubov\nunemotionally\nnidever\nmugatu\nmizutori\nseaweb\nkorpikoski\nedgerson\novercompensated\nmurderabilia\nalemi\nrivieri\nurique\nyaas\nnmes\ntchoyi\nsmooshed\nsozen\nlenes\nmisshaped\ntherrell\nvelayutham\nindravadan\njacada\nmelodramatics\nadventurously\nzyad\nchookiat\nbarsch\nyashchenko\nuncombed\ndistrait\nehsas\npresswire\nmahlman\nstrangulating\narviragus\ngueret\nkildress\ndgcx\ntodini\nagazio\netiolated\npolytunnels\nscade\ntschiffely\nratlinghope\nkammal\nballingdon\nflaschen\nethington\nmullainathan\npletka\nluera\nsupergraphics\nflansburg\nekber\nnaeve\nherrara\nbtwn\nuglovka\ntifs\npostions\nwavebands\nsvahn\nludicrious\nkarmon\nrmps\ngrescoe\nvodenicharov\npalaeoclimate\ninstituition\nbeuc\nclabo\ntevanian\ncastlestone\nkibbles\nqfii\nnoisemaking\ndivey\nmetam\nslavering\nswaddle\nacpt\nsheesha\noseo\ncynlluniau\nrackable\npositioners\nbladimir\nkheradpir\namalraj\nlahde\ncpeo\nmackness\nuskmouth\nprimulas\nkwezi\nmyko\npanagiota\ncarancas\nlrx\nkatzenmoyer\nmarchioro\nleamore\nlitterer\nsiloed\ndistintos\ndirigiste\nhananel\ncellou\ngoldreyer\nmamunur\nvnb\nworldfocus\ntavalon\nyuanhua\ntasti\nyowling\nmarchiani\ntaked\nfischbein\njenky\npilseners\nvistor\nduntrune\nmcvities\nmoaveni\nwannop\nlundwall\nchks\nawaydays\ndebry\nblockiness\ntuthilltown\nfccp\ncoverity\nsegafredo\nkurtulus\npowerpoints\nvidharba\nacutiflora\ndewick\nsalinisation\ntqi\nsharifullah\nvieaux\nkupisz\ncodrea\nrigzin\nschelsky\neilein\npbra\npagents\nruesga\ntrifurcation\neyeson\naggar\ntolmach\npharmacopoeial\niaop\nwuermeling\ntangriev\nnonas\nkarrikins\ndisorientate\nrollcall\ndiversitas\nroveri\nshahnazi\nsagg\nuplinking\nyardarms\nweiqiang\nmideksa\nchryste\nglibness\nbrundibar\nspookiest\nmeeka\nahri\norphanhood\negland\ncommentariolus\nstirbois\nfarflung\nbooo\nwoolfalk\npacht\ntrezevant\nroslund\nchitika\nmosiello\nherbfarm\nwholefood\nmiring\nrosenker\npervenche\nhartless\nlofters\nbaidoo\nhypermotard\nadao\ndundlod\nkommineni\nchulos\nkreizman\nstaron\nleekes\nnesim\nnieporent\nmiasmatic\nbhuna\nfaughan\nreddock\negalite\ndeckhouses\npartouche\nilboudo\nbizare\nlemonick\ninpact\nappjet\nkilcornan\ncisler\naronsen\nrtts\nheinousness\ngigantically\ncorrectible\nrefection\nbaudisch\nschoolfield\narbritration\njfcom\nkehela\naptekar\nporan\ntrigwell\nstormhold\nspecifiy\nsecurus\nchengwei\nknowlden\ndeleware\nritzau\nmaimbung\nnaleo\ncuilo\ndunivan\nswissnex\nvarejo\nsharts\nseabound\nprinya\nbajoria\ninfousa\nnappers\nsakchai\nconvit\nkhristian\njanszen\nlascurain\nbuchtmann\ngafisa\nlionetti\nbrokk\nlambrini\nbranney\ntimbersports\nepernay\neffel\nsenik\nhundredweights\nabdolah\nrusciano\ndeodorizer\neisha\nunsellables\nblandit\nsperrgebiet\ncavana\nlugansky\nwesbite\nrebny\ncampsfield\nfrontzeck\ntlale\nbisquick\nkeig\nnatureworks\nmitica\nawdurdod\nelmen\nondieki\nvollbracht\nkhugayev\nrackhams\nmcbrides\nrathergate\nverdini\nbeuzelin\ndeye\nmikmaq\nrulan\nkahoe\nclis\npicsel\ninelegible\nuninflated\nzulfikhar\npicturephone\nbragar\nhespos\ndotheboys\nseither\nkromkamp\nwindpark\npanicos\nesfri\nbaronesse\nvulindlu\nrapproachment\nlalai\newwww\nhafizur\ntheif\nblaber\nizenberg\nhaggler\ngaynair\nhuaynaputina\nwyhe\nteranga\ngoreux\nnewcastlegateshead\nnewspeople\nuplifter\nftms\nsheyda\nhadcrut\nrentech\nyalof\nbrockel\nirenee\nspewak\newh\nadmiralspalast\nfeachem\nkhadak\npisaroni\nlascoux\nwesch\njlloyd\ntogather\nmakeweight\nyamon\nsaltiel\nedwaard\nmiamians\ncubistic\nwassouf\njube\nabouba\ncullingford\nlaunderettes\ntetty\nsadykhov\npanich\nineluctably\nendcaps\nsarantos\nbelhassen\nsmartpen\naeterno\nsabbatino\nfaraya\ndelise\nwsff\nnacel\ndubielewicz\nunisource\nspiric\nmakali\nechiverri\nliddiment\nflipse\nboyishly\naiss\niochroma\nhighwall\ntrequartista\nexplantation\nocma\nbarsness\nvaza\nyakhont\nelhage\nprosafe\nisold\nbelston\nmontclarion\nrencurel\nmurkoff\nkrensky\npouching\ntasing\nsydd\ngharani\nmergermarket\ndollface\nordan\ncrinkles\npacfic\nwallal\nguidera\ndepressurizing\ntittmann\navolio\nbiodegradeable\ncornock\ncbac\nkbx\nsalsinha\nchasman\nkhabab\ngarvis\nwelfarist\nolaparib\nwarnig\nnonaggressive\nkotelnik\nmarinaccio\ncommmittee\narriana\ngarlicky\npaumer\nbionaire\nkarash\nskimpier\nloynes\narlem\nweyns\nflatlining\nmujangi\nsadeddin\nmccright\ngulay\nglamazons\nwhetzel\nmaasim\nbacklift\nmiguélez\nserriffe\nkoulis\nblachère\nrawah\nplayle\nmcmoon\nimbrogno\ndotonbori\ndealmaking\ngocke\nrhosddu\nnassauer\npingeon\nlambswool\nforakis\nonsong\nsupperclub\nialdabaoth\nmelsonby\neuzebiusz\nlusterware\nsassenach\nhdrs\ndeployability\nremkes\nkaviani\nfennecs\nvitner\ncramers\nkonzen\nhadsell\nzuheir\njesner\nmotherese\nsteingart\ngenocidaires\nquartettsatz\nunknowability\ntuccillo\nblankfield\nbedington\nadlène\nguskin\nstandalones\nghawi\ndistrictwide\nmicrolite\nbastrykin\nmaidis\naher\ninnumerate\ngwyddelwern\nfortunetellers\nbiologies\ntriaging\nslackens\nwoodsmoke\nassyrtiko\nprme\nkhalf\nbrianstorm\noncourse\nmoravcik\ncenterback\nfarokhmanesh\nmijatovic\nsiwy\ncommiserations\nglammy\nmayorships\narrache\nmuxlow\nhusinec\npetrohué\nskippable\ncoasteering\nonesimo\nchorizos\nremanso\nfulgent\ninnerspring\nkour\nmisseriya\naiswarya\nbasrans\nlovegren\naparthied\nheadcheese\ncodeveloped\nnathani\nicrossing\nconvienient\nroussef\npalmor\nbarria\nbjartur\nviñals\ncastonzo\ngodana\nameric\npurveys\ngumbos\nministery\nstreambase\niscd\nwalkies\nberkovitz\nmooli\nagatston\nmalooly\nfauvergue\ncybertech\nunencapsulated\nwackier\nmukaddam\nwolayta\ntsheri\nvandekeybus\nchinchen\ndossia\ndongria\nwelted\ninventec\nwalin\nindiv\nfrizza\ndinicola\nlolesi\nlindeborg\nmuntok\nkielsen\nburwitz\nvisger\nchilthorne\nqmm\nmohammadzadeh\nmolumphy\nesgob\nquet\nseper\nsdrt\ntraynham\nbromantic\ntiarra\nbilham\niliza\nlevain\nkuronen\nmonoprint\naperitifs\nmasillae\nschw\nfombonne\nffor\neduviges\naflibercept\nsequinned\nfrostee\nbactrim\nlifeco\nawarness\ntetzchner\naminur\nflra\nrtps\nadle\nmajore\nsjambok\njpay\nkerson\nenteritidis\nalbourne\nauermann\nonibury\nbeechdean\nfcmb\nmasriadi\nkringe\nolinka\ncarolann\ndarvishi\nitadori\ndismays\ntimmel\ndarabos\nstukes\nkotera\nleckonby\nlashinda\ntonking\ninternap\nmaxeke\nicbt\nsensodyne\nbookmooch\nvaterstetten\ndejardin\njacksdale\nlenscrafters\nlilibet\nsagansky\nappliqued\nhallford\ncatalonians\nfoldaway\nxenical\npitcavage\ninversnaid\npezo\nsurpisingly\nraty\ntravelators\nnovocain\nkyosai\nophiucus\narnezeder\nkment\ncaribia\nbugrov\nfishkind\ntinoisamoa\nmarpi\nkollsnes\nmoldavite\nsalzhauer\nlinkline\ninboden\nmautby\nkopelow\nifereimi\nsimmerling\nkhano\nmanats\nbradken\nrecommitment\ndiaria\ntranscriptionists\ntajbeg\nsuntower\ntyrannized\nsamothraki\nbranam\nsoakers\ncomestible\nbelizaire\nschrimm\nsequella\nfaingaa\ndiverters\ncrackhouse\nvascar\ngoghs\ntonchi\nsuperdeluxe\nblondchen\nnichirei\njamell\nfessing\nhasnawi\nsalvaggio\nproctoring\nbiddlesden\nsteidel\nsnowscape\nvillingili\nfisnik\nxtract\nkröpelin\nhillings\nuniversit\ncyclen\nenergywatch\nscorekeepers\nlarison\nhautacam\nbiocode\ndorneywood\nimpressionistically\nmelgoza\nwhittome\ntottered\narambol\nmatchwinner\naccet\nabinader\ntesman\nattainability\ntonkotsu\nmumblings\nkoulamallah\ncrocosaurus\neffeminately\nremics\nbouly\nmitsuyasu\ndunnocks\nbochi\nmcsmith\nwildings\nheymer\nbrackin\nisaaa\npennan\nintuits\nshangjin\nsleazoid\ncounterterror\nsklerov\nsolnik\nmuccio\nticer\nsmoothers\nvibrac\nbarbolini\nbizrate\nunblended\ndervin\nptsi\nmacnamee\ninclud\npatissia\narnish\nnorweigan\nelloughton\nvitalijus\ngumpertz\nasayama\nerres\nplatell\nwesport\nopenleaks\nselimov\nducatis\nbelesis\nkinuthia\ngaleote\nneeka\nkennametal\ninfida\nkickingstallionsims\npalmiero\nvagabonding\nlemere\ntynged\ntrelawnyd\nravizza\nmassification\ntiesi\ngaurd\njeffrie\nhastoe\nloë\nanchia\nkourlas\ncutuli\nunzueta\nckin\noverbudget\npaglesham\ncampsa\ncakey\nwherley\nvallina\nairporter\nivereigh\nbokun\nkagasoff\nhpq\nlpns\nnationa\ndingemans\ndepletions\nprearrangement\neissele\nstrasshof\nlerose\nremaps\nbeautyrest\ncorado\nsadykov\nchaowarat\nlokuge\nopelt\nerkes\ntagesanzeiger\nmonley\nzietlow\nkeynan\ncockshut\nalcosense\nhoovering\nberagh\ngarotte\nnaivity\nliveplanet\nhalfaker\ncalytrix\nsymondson\nmanagerless\ndelegitimized\nmuqtedar\npropafenone\nadeyanju\nmascalzone\nhercher\nadulated\nrebhan\nbreslaw\nspooge\ncentina\naffadavit\nraasch\ncerrie\nfulliautomatix\nheitzmann\ndistractedly\nkirroughtree\nseasonals\ndopest\nhollibaugh\nlarque\nthiranagama\npanaroma\nplantcutter\nmonnig\ngilady\nabsoultely\nblaggers\nlogsch\ntassimo\nborchester\nfingerpaint\naripeka\nsaani\nnupa\nfarrag\nwebmonkey\nwaidelich\nbluefly\ndiscoverd\nsonmez\nupsizing\nmuuse\ncummertrees\nkhalef\njomhuri\nlaique\nciputat\npunctiliously\njimale\nmansoa\nmanswers\ntafero\nlunchbreak\nbuluk\nrenvyle\nfessel\naponavicius\ncherrybomb\nascheim\nstepheson\nbonite\nrodenbeck\nchivian\nsinco\naziziyah\nquaids\nreiterman\nvlastnik\ndigitalize\nshowoffs\nstickpin\nquere\nyamamotoyama\nlepley\nunderskirts\nwinterwood\nclunbury\nlucman\nrassau\nbrandmark\nculmstock\ngidron\ngeotags\nmarungu\nfanaro\nyoukhanna\nmasik\nbtps\nhaydel\nkirrane\naccordin\nremonstration\nkabuye\nnavfor\nkallick\narnous\nprytania\nhuahua\ntapster\nwja\nsabeans\ncoporate\nbadingham\nstaenberg\ntath\ndefectiveness\nfencehouses\nletup\nbresaola\nsolyom\nalprostadil\nsusanthika\npfotenhauer\nlebenthal\nsummerscale\ncatfield\nhuiyan\nmalingerers\nborderlescott\nfilos\nnevelsk\ngarvanza\nsaanei\nlippin\ncear\nkliniken\ngoggled\ncapering\ncompugen\nkarush\nnghymru\ngfsi\nyursky\nwhitewashers\nzogaj\ncorndon\nsilipigni\ntenderizing\nborderlining\npescocostanzo\ncobá\nlapatin\nbenedik\nxensource\nhijms\ngralton\nbütikofer\negleton\nsejer\nattebery\nmeadway\nsarantakos\ninterbirth\nleifert\nhomeside\naundre\nnaiz\nyanyong\nwaldenstrom\nvlieg\ntantalized\nujlaki\nbakiev\npiovano\nlolis\nwaterbender\ndangly\ndeconfliction\ntushy\ningelsby\ngoojje\nchulkov\nfoully\npassementerie\nunspool\nitihad\nsuadi\nwielinga\nobuchowski\nhqa\nbrassneck\nsaddlebrook\nsamiria\nhoschton\ncarcroft\nhamams\nleisle\ntiririca\nbenetta\nnovey\njunliang\nbahrudin\nsweetspot\npossble\nmavisbank\npriapic\ngoberman\nmayberg\nsiegfrieds\nrollerblader\nhodgett\nmaryetta\nfarar\ndimissed\nllanegryn\ngerani\ndavidovic\nsuperceding\nhollywoods\nokement\nrechristen\nkronman\nvarteg\ncarciofi\nfontenette\nstemler\nmathathi\nterrestial\nbeanworld\nencombe\nsambals\nwescam\njenae\nthurgoland\ntermism\nharlemites\nkicky\nlinkner\nghostbusting\nwiniata\ncoldsmith\nbokeria\nstais\nupdraught\nkhisa\nplanetspace\ntagine\nteria\nwoodpulp\nghaziuddin\ncaep\nsterilgarda\npofalla\nshmulevich\nrheumatological\nfemap\nfinagling\nspanjers\nbaharistan\ngoore\ncarlae\nkadenbach\nôl\nregim\nbrandenburgs\nvisionquest\npressingly\ncomitting\nsomebodys\nhemani\nastrue\ncialente\niguassu\nbogalay\nslonaker\nshachaf\nnigrini\ncroûte\nrindy\nagiorgitiko\nsalue\npiercingly\nfoxtails\ntilers\nexatly\naafjes\npenningtons\necotech\npullouts\nprophete\ngatornationals\nmolody\ntambussi\nsheriffhales\nstintino\naberarth\nvinnedge\nballylumford\nanology\nacham\nmilici\nblitzers\nkartagener\nlcpd\nreachlocal\nviburnums\nlimerock\nmyfyr\ngxi\ntrussing\nprepara\nacomplish\nfilegate\nwischer\natripla\nciccotti\nworldlink\ntielke\n\nmiscomprehension\nqtes\nresidenza\nchokepoint\nabatacept\nrofo\nirbesartan\nsupersizers\npitchout\nbanovic\ntemarii\ncheste\nmbulaeni\nbrossel\ncilostazol\nrackety\nevangelizers\nonvif\nfuzzballs\nmafokate\nlirey\nhamc\ncrct\nswor\ntranquillizer\nvidiya\nrazlan\ntribalist\ntransrectal\nrudrani\nclearedge\ntmobile\nroindefo\nbenchimol\nharatine\npegylation\ntravelsmart\nsostis\nstartech\ncordsen\nwillingess\nrosenbush\nlinquist\nseculin\netravirine\nrevera\nneoedge\ngeltzer\nnewfoundlands\ncouriered\nnewyddion\nbiuku\ninegalitarian\nphilomont\nvegter\nstunnel\notah\neinstruction\nlipscani\nerry\njaquiss\nbahaji\noybek\nmurtabak\nbeaucastel\nbesluit\nbiospecimen\nfrico\nkashiwada\ngeorgiann\npdos\ncoronate\nlaxart\nbeloveds\npreng\njosphat\neuroparty\nomurbek\nhabeus\nbuddwing\nvistors\nsimilipal\ngiourkas\nbacchetti\nlarrikins\nbelluck\nkowalke\nborgesian\nsysoyev\ntulipan\nozcar\nustian\ntrindle\nfrazzle\ntravilah\nsufaat\ntameleo\nnyseg\nsterr\nbrimpsfield\nmooradian\npurke\nsuraphol\ntradin\nbrookmont\nchurchwomen\npilau\naktc\nneag\nriseth\ngaurantee\nhillbillys\ndeforesting\ngryon\nlutens\nlauerman\nconfabulated\ngreates\nsealskins\nveckatimest\ndemobilising\nyopu\nbarrelhead\nknocke\nrussianoff\nprebon\npigskins\ngelwix\nbunglers\nzongheng\ngellers\nharakah\nzamar\nblintzes\nrajeeb\nsuleymaniye\nwhitmey\nnikulina\nsaharia\nhomefinder\ndanqing\ndigimation\noilskins\nvongole\npuggles\nnarayen\nchapas\nudre\nhandiest\nbyambasuren\nhanefesh\nvsetin\nscaasi\nsherbon\nhelfgot\nkushina\nreiersen\nnonperformance\nkamanzi\nteavana\nbackbend\ninhibitive\nmineralizing\nmerkaba\nwfpa\noutcalt\npiromya\nhuffed\nbjörnberg\nkfl\nintralocus\nmarksberry\nbatliner\nbrostek\ntoolbelt\ndeliquescence\npbsi\nlaiv\nmorens\nonken\nneutralinos\ndandara\npermissively\nboeckling\njigang\nfloridly\nmobey\nmiresmaeili\nguarnizo\ndaraina\nanythin\nduyne\npicamoles\nichan\ntahirkheli\nsoundtracking\nmycar\ngamefowl\ncosmetician\ndorchen\nexcluders\nsieden\naaaargh\ntavoris\ncartright\ntsepo\ndunta\nappreciators\nargentinan\nscpr\ncontinuos\nkanew\nsooooooo\ncsrp\ngnango\nskalli\nsaleisha\nmethoxycinnamate\ndesogestrel\nmisan\nofz\nscapinello\nconnectomics\nbokko\npopfly\nlillywhites\nabbaspour\nrainwear\nsensata\nernies\nibnr\ndatafolha\ncainscross\ntelent\ndinstein\nwors\nchyzhov\nerrie\nvaunt\nikililou\nmaiello\ngipn\nkerschbaum\nbuskas\ncrostata\nstorton\npromiment\nkropa\nmatheney\ntakuzo\nbeind\ncentralians\nozolinsh\nrawreth\nseehttp\nweissenburger\nbalajti\nguosen\nsarafanov\ndickler\ntjian\nvideoconferences\nsartz\nreevaluations\nphilagrius\nstatewatch\nbadki\nmilrinone\npenilee\nindicitive\nrepostings\ntotonno\nregnante\nzigs\nnuch\nyodelers\nredtape\nqueerest\nmindbogglingly\nlurpak\nmultidimensionality\ncalstar\nkagermann\ndcac\ndimassa\nfomepizole\nmonkou\nfacciola\ncroupiers\nreato\nlaurelvale\nbizunesh\npiigs\npramlintide\nslagged\nlyddington\nhanfling\nmuziic\nreycraft\nmaintance\nintelcenter\ntorrisholme\neslington\nsmeraldi\nexpeditors\nshirli\ngroshev\navoiders\nprgf\ntrefry\npegna\npassholders\nyamli\nortmeier\nhassouni\njaylon\nupshifts\ngrandtully\nbicalho\nlarami\ncordemans\nbiescas\nlitoranea\nxoco\nrry\nsliter\ncassens\nsulkowski\nkivanc\nllanasa\nskarv\nspartathlon\nchyba\nleftow\ncromolyn\nnshr\ndetloff\nfollieri\ntechology\nklatzkin\nsuuronen\nthavasa\nbrontothere\nbakhyt\nhassmann\nstodmarsh\nnirajan\nkliegel\nmontbrial\nlafeyette\nzador\nraiano\nreddens\nleagal\nelcott\nebbrell\nwaties\ngrellier\nragip\nzumbrun\ncallahans\ngasbuddy\nrosalino\nfungiform\nforgetten\naleve\ngrimi\nelvitegravir\ndaimi\nelektrownia\ncensis\nidenties\ntransitionally\ngypped\nwowio\nserialising\nzofran\nurbon\ncakir\nsambrano\nkilladeas\ndruz\ngorniak\nlinhardt\ndisparately\nestha\nunitaries\nsubo\nkipple\nalterraun\nannington\nbilga\npintas\nbamfurlong\nmestrallet\nhalons\nfleurant\nllannon\ndastgheib\narrasmith\nsynchronica\nbluths\nbabafemi\nacccording\ntransfiguring\nnahra\ncillit\ndunnikier\nfamil\nbirkhall\nssrb\nbelchalwell\nfictionalises\nbernieres\naerobiology\nknotz\ndarwinopterus\nvaccarino\nashg\nboxboard\nyermakova\npazopanib\nunthoughtful\ntitlists\nmeatiest\njaywalkers\nthromb\ndomenichelli\nbenamor\nbeamen\ncamalote\nkapstone\nschilly\nkidogo\nmccolo\nugborough\nfebian\naxc\nthinkorswim\npipedown\nseamill\nactionability\nunsheathe\nseredin\nmutallab\nbonnieux\nohmann\nbaikonour\norentreich\nrazaksat\nredenbach\ntayor\nblueridge\nriffled\nkamhawi\nwisconsinite\nkilleeshil\nschonberger\nschlup\nprueher\nchartridge\nskybet\nwour\nansd\npennwell\ndynaformer\nlevitte\nristau\nbeakes\naltonaga\nfidessa\nukravto\nsunside\npickart\ngelsthorpe\nbigfoots\nwelting\nkmetz\nazkargorta\nskhirat\nconvice\nnixle\nmsde\nhessayon\nreavley\nredhall\njaqui\nhayyat\nyuhe\nezee\nrigley\nhamadoun\nnonpermanent\nanaysis\ngrax\ndespondence\nzattere\nshirayama\nbonvilston\nsolemnise\nickle\nljova\nunplanted\ncaterhams\nsertic\nconceeded\ntraductor\ntreffert\nmozena\npoortgebouw\nnguon\nsmolenyak\nababu\nbrittanie\nmuenke\ndjambala\nremans\nhayzlett\nneveah\nmazière\norals\nlautenbacher\nkirkos\nglendurgan\nvaers\npocketknives\ncuccaro\ntalsarnau\nbrummet\nagonises\ncategoric\npescatarian\njianrong\nwhti\npapillote\npsaier\npendoylan\nakerlind\nredinger\nshovelful\ndissapointment\ncpss\nlepowsky\nboothwyn\nbidez\nbessent\njaquez\nlhvs\nhanting\ncertitudes\ncbis\nboxful\nlevai\ntarty\nredda\nlindenlaub\nahwar\nnashar\nmargining\nmtis\nstotland\nsanglap\nbrightview\ndiepsloot\nbeirich\nunwatered\nnostoi\njerramiah\nmindscapes\ntallinder\nfredersen\nambry\npapuna\nglimmerings\nbitee\nbioresorbable\njockstraps\nwonderdog\nspritely\nnalaka\nlapad\norent\noverstrained\nflagbearers\nnacdl\nkeiwan\nbrookfields\nstandardises\ncomany\nmiesian\nhendelman\ncashley\nkingstowne\nthaek\nwisty\ncoquis\ntunesmiths\nwirjawan\nkulyash\ncorcrain\nawaleh\nisaq\nbenfro\nboser\nnasopharyngitis\nfallaize\nrecepients\noutdistance\newenki\nmarescaux\nleatherslade\ngeisen\nmackel\nclingerman\nnongbri\ndormandy\namayo\ngriebe\nvinayagamoorthy\nzahorsky\ngrapelli\nsereysothea\njamesons\nstorgaard\nsqf\nvonteego\nreistad\nhuti\ntroublingly\ncoday\nphcn\naxj\nglenuig\nbugling\nfootstepsinthesand\nchiesi\nlaminator\npiracha\nsovani\npreconstruction\nconfimed\nponomareva\nkoppinen\ncyberstalker\nrefulgent\nsoberon\nintuiting\nrepetitiously\nkatulis\nzubaydi\ngronstal\nniangara\n\namcon\nsimod\nkwhs\noutmanoeuvring\nerosi\ncharacterological\nkhoram\nillinoisans\nplasmati\nhezbullah\ngoldberry\nvogiatzis\nfakhreddin\nminguzzi\nroquel\nmontagano\nlaast\npelman\ncompeer\nmenchell\nelectrophysiologists\ntamdan\ngutow\ngipslis\ndusseau\nheighted\nzalaquett\nrollmops\nvalat\nringstone\nasure\nhowfield\nassayers\nknothe\nllx\nradvanovsky\nuncork\nripol\nnannyism\nilkla\nroustam\npreachiness\ndebator\ncnit\ndelegitimise\ncopec\ngelée\ntechnium\nhardgate\nelgindy\nmeriño\nschertzer\npolimeks\nogola\nsimiliarities\ntourigny\nkolender\ninterparty\nneiger\nmoehringer\nuppy\nflameproof\nkekule\nelbphilharmonie\nwappapello\nenteroscopy\nerofeyev\nheijmans\ntxm\nbehooved\nubiles\nhillaryland\nwandera\nkhoshnevis\nduvendeck\nscroggy\nnuzman\ndiamand\nkaing\nsaliently\nharome\nrethoric\nextradites\nambreen\ngaint\ngurda\nkillerspin\nforedoomed\ncaffiene\napenheul\ngiscours\npasilla\nmannamead\ntagruato\nnettesheim\nbroemel\nmultitier\ngusky\ncallbox\npionirska\nsemones\nbrozak\nllangwyllog\nternus\nearnhart\ncoldren\nsowder\nmidscale\nboslough\nstagni\nkonheim\nndeye\nkultar\nlynsted\ntittles\nmoistens\nhermesh\npikiran\npulkingham\ndisorganize\nduska\nvvips\nanticolonialism\ngoldbaum\ngrabsch\njavis\nkleptocratic\ncaline\nundiscerning\nfourhorn\ngarah\noik\nhabenular\nordman\nmcfd\nsprucefield\nverbless\nfidanzati\nedmodo\nesophagectomy\nforeperson\nginsters\nyousufi\nwaghela\nlubman\nmaslan\nfasulo\naquafresh\nbiotechnics\nbryning\nkamkwamba\nunsilent\nsidorsky\nconnarty\ngrial\nsaben\nhuval\ntoori\nmanop\npontardulais\nmatsen\nrawlence\nrewires\nfelicidades\nchromes\nblackfriar\ndoubleshot\nshippable\ngreenshoe\nethnocentricity\nmurstein\namharas\nzucchinis\nballadeering\nmyburg\nbrinscall\npopaditch\nembitterment\njkk\nmetelkova\nwhinnying\nicfc\nynysangharad\neurohypo\nsalmasi\nbisbe\nwijkman\nwhippersnappers\nchemoembolization\ndahling\niraqs\npriede\ndhanuk\nsqualling\nbarandiaran\nahmm\nglacés\nbabiker\nchallanges\nmicrofibers\nballweg\nfritzner\nrentiesville\nluterbach\nbibeault\nguilted\ntognum\ntanyon\nhangartner\njahrhunderthalle\nllanerchaeron\nmaggiano\nbenhaddou\nschilbe\nadvair\ngetahun\nbehsud\necolabels\neasybus\nhaaz\nmuellers\nstorgata\nbulik\nnimma\nmillies\ndisrobes\npinhorn\nassylum\newingcole\ntelsa\ndemirjian\nglyco\nkornblut\nlanzone\nflatliner\nheede\nskean\nkuza\nnunoo\nidcs\nhaitch\nvoletta\nbellydancing\nabdalqadir\nlongueurs\ntharin\nenviromission\nsalsi\nopsoclonus\ncheapie\nprecisly\nkgal\nsiladitya\ngarachico\nbartho\nbasd\ncorhampton\ntranching\ngulped\ngleeks\ncandiates\nenocean\nenford\nanomolies\nradvision\npolay\nbrenntag\nfoinaven\ngrinches\nstylecaster\nallostatic\nzadrozny\ngoeres\nzumoff\npresutti\nwuas\ndorsen\nparvanova\ngohel\nnofit\nhealthcorps\nresponsbility\nestwick\nzafrullah\ncutchin\nmatjiesfontein\nraffaelo\nmontalbo\npowermeter\nigrt\nbeckerle\npretorian\nitemization\ncapriole\nemetophobia\npcns\netappe\nwaldmire\nnlmk\ntranquilly\ncumbayá\nfistral\ndatatel\ngelastic\nflipflops\nsafing\nbusks\nchidanand\nhydrocolloid\nschmeltz\nkirigami\nkhangiran\nraccuglia\nzalmona\nberrouet\nhaysman\nsubcabinet\nerspamer\ngoneva\nbumpurs\ninflators\nbuduburam\nkjlh\npocd\nuninventive\ndekar\nitraxx\nfecan\nestefano\ndahabi\nilois\ndepravation\naishwariya\nboxmasters\nsutiyoso\nkowall\nntri\nbacow\nacdelco\nvolpicelli\ndousland\ngroundsheet\nelectrocoagulation\nyoeli\nsongful\nganther\nnozette\nbodgers\ntolstoyans\nmchaney\nnout\nrikhye\ngibilisco\nnaujocks\njulavits\nroskilly\nllandrinio\npiccari\nfaehrmann\nambles\nbreus\noughts\nzolmitriptan\nsongok\nwindborne\npennycross\nazpiazu\nheritor\ndidulica\nrasmussens\nnuttier\nnursemaids\nmindbending\nwcrx\nreorients\ncomradery\nphsa\nhintermann\nfolksbiene\nsuncatcher\nmaiffret\ntoymaking\ndominionists\nnamoc\nantinational\nchatigny\niafis\nmushatt\nwasay\nmerediz\nmicrobudget\nmethandienone\nmarkmonitor\nbiris\nunsterile\nhotheadedness\nteegan\nhassanin\nnced\nharab\nchroust\nclanrye\nsantouri\nawfull\nroofies\nsebti\nchumpol\ndoffed\nwett\nhidta\nbellmead\nmattil\nruly\nubit\nfensome\nkazkommertsbank\nbelohlavek\nshikano\npmpa\ntaxidermic\ndavening\nrotger\nhedblom\nburakoff\nquizlet\nteemore\ndigitek\ngrowdon\nvaquitas\nlulus\ntianji\nconsitutional\ndauti\nportably\nlaspina\nfreudenheim\npmsi\namron\nentringer\nhureh\nroscam\nqlr\nzephania\nseatwave\ncedarbaum\nbarnyards\ncolliseum\nbrucke\ndacic\nmcmunn\nampi\npatyk\nacfm\nexper\nguanghe\npsychographics\nrudha\nzanny\nantechambers\nimoke\nkaytee\nchristanval\noccar\nhickenbottom\nwhimsicality\nbezzola\njalbani\nhamriya\nsorona\nsuperdad\nlinsell\ngonks\npoignance\njikei\niavarone\nnorthernhay\nbukvich\ntideford\nchipmaker\nhobbycraft\nguitierrez\nsaborna\npregones\nawea\nnourbakhsh\nlowedges\nheteros\nbonked\ngruder\nmerrf\ngraymail\nantiqued\nfrigon\nnorsar\ngitonga\nchernetsov\nwimar\nscaraffia\nsponged\nillgner\nbeeped\nassinine\nsnowkiting\npredoiu\nunremarkably\narmintie\nvalbusa\nbonow\nspherion\nmoshers\nasuni\ntidcombe\nkonnect\nopenmarket\nkamiura\ntritan\nweaseled\nseaco\naeolos\nmvne\nbinghams\njaider\ntomosynthesis\nbibit\nenec\npremixes\ngaraway\nimperitive\nunreformable\ndinlle\nrathie\nabsconders\ndeverdics\nfalder\nchinary\nschaeder\nmaslach\nbezbaruah\nsaddlebacks\npapalexis\netinger\ntygiel\nscoffield\ntescos\nliepajas\nvibskov\ndogsleds\npingwu\ncapasa\nkawasoe\nmaystadt\nteamworks\nvsq\nhypercolor\nhandpumps\nferrieri\ntepfer\nadultfriendfinder\natá\nvinegared\nstonings\nmcinturff\nbachenheimer\nmyohyang\nicings\ncallouses\nzonin\nsedimentologists\npalls\nfolow\nunderreport\nfuqi\nrudry\nnordbanken\naistrup\nlauras\nmaggiotto\nnemser\nunhusked\nshannahan\nalliancebernstein\nmakishi\nbozhko\nceely\nhajim\nwasy\nwherstead\nmalmen\necojet\neachus\nrebooked\nlazareanu\nfourme\nllansannor\ncuates\nmaleczech\ndonggu\nlemine\nziniu\npithier\ntransacts\nderse\ntsis\nhenyard\ncentralises\nmcbarron\nnexxus\nvinney\nprosthodontist\ncalcpa\nsilvero\nintellicorp\nsandfort\nburzichelli\nkangemi\nthoenes\nmeraviglia\nintertank\nshortcakes\nacasiete\nbertling\nintermap\narabised\nareh\ncreamsicle\nhydroptère\npeevey\nbazedoxifene\noverreactive\nthandiswa\ntiggs\npadanaram\nscibelli\nkoralm\nwazee\nhomegirl\nvisalli\nbackrub\ncornellier\ntirri\nbraquet\nsublicensing\ninvermoriston\nroomet\nulnes\nkrižnar\nlunchrooms\naomame\nsollitt\ntriangulates\ncubbyhole\nloomstate\nedgcote\nwoodlief\nduram\nmarcopoulos\ndruggies\njongo\nlegals\nseera\nvantagescore\nmixings\ntemata\nsteinemann\npapay\ndeanza\ngoofier\nbodinnick\nmicroplane\naiguo\nmavity\nfreewave\nonexone\nshellfishing\nalaea\nvakalis\ntontoh\nremodeler\nwaynick\ncelebreality\nthiostrepton\nbaghead\nawcc\nispu\nseditionist\nbklyn\nsanjida\nmeddygol\noversteering\neuas\njhar\nunderstructure\nmatero\nmahlerian\ncnnc\nentombs\ndodgen\nbradda\npanafieu\ngermanakos\nbassinger\nmetabank\nmatsigenka\nchansi\nhsci\ngnutti\nlorenda\nshoddiness\ngrievers\npapenfuse\nduboeuf\ngrooth\nfukoku\nettelaat\ntudful\nentune\nivors\nholroyde\nferrucio\ncastorina\nyamu\ncountermove\nrosza\nmayrhauser\nblanchimont\nroussimoff\nbelben\nspareness\ncharone\nhsba\npimecrolimus\nsawade\nsportime\nperverseness\ntopamax\nprofitting\naitches\nroveto\nschmader\nreidenbach\ndaises\nunpadded\nmatsanga\nhassink\nnatarov\nedss\ncheeburger\ntrossi\ngoerens\nezulwini\niloca\ndehaas\nengraft\nbratten\nbashira\nnazy\nmillirem\nbergbahnen\nfisd\nnadey\njausiers\nsmooching\nauteurist\nrokke\nomozusi\nsilan\nlockboxes\nschlichtmann\nhkse\nperbix\nagion\nrobh\nrandolfo\nbahiya\ndmj\nkleeneze\nskiller\niodice\ncarreto\nguiderius\ntoomy\nhekuran\ndiwanji\ngemerden\nreenan\nvibrometer\nbrigader\nnacods\noverenthusiasm\npegmatitic\nlotusphere\nmonikered\ndoubek\nspagetti\nschalm\nquirine\ndeseine\nbroughtons\nalgers\nbotcher\ndimeji\neldrup\nmidriffs\nloddo\nmaclarens\nskirlaw\nindivual\nvaughns\nohlmann\nladkin\nmarxhausen\npizzelle\nranchette\nnazemi\nswepstone\namost\nsmyer\njadidah\ngarfit\nmutsaers\niwps\nwertmuller\nshurov\nmmas\nabilityone\nweathergirl\ndebin\nsafair\nwesterhope\nkhareh\naeromobile\ncaccavale\nprimakoff\ndaniller\nintentionaly\npowdr\nasberg\nsealord\nzootfly\nhornett\ngelong\nnatrajan\nduquemin\nciroc\ntischenko\nmedievally\nseaberg\nmoralise\ndolmas\nepiphanie\nkrislov\nrocsi\nsupernodes\nranella\nlodgements\nmcgettrick\nedmondthorpe\npaute\nfearsomely\njoggle\nsuffusing\ncéad\npritchards\nkaleidescope\nlumpiness\nvorsprung\npalmitas\ndisrepect\ngeorgoudas\ncrawshawbooth\nsyncsort\ngeminids\nroncalio\ntalluri\nuou\nslavutich\nvanniyars\nholmans\nkhorsand\neverybodies\nmykey\noptionsxpress\npoltair\ntiecon\ncwynar\nluminously\ndécolleté\nkeepon\ngahs\nfornasier\nkuzmanovic\nllanaelhaearn\noxhill\nbarhoumi\neffiency\nmarisabel\nkozlow\nletdowns\nmilloud\nhabiger\nminerbi\nhuarango\nofoto\ncompan\nmixable\nbonyads\nreprehensibly\nrehim\noozy\nunitymedia\nmorsell\ndjembes\nisokoski\nsoile\nsawina\ncimade\nneverov\njeran\ndobias\nfandy\npseudofolliculitis\nalosi\ndaudy\nadjustor\nhmq\nmounia\nconducing\nsantostefano\nschollin\ncanak\nmainka\nritha\noberhuber\ntontines\nstenoses\ngleacher\nintergroups\nschelske\nradoi\ntwenge\nwambold\nhirami\nrubianes\nwilstead\nzaiyu\nhatchetman\npelotonia\nsotiros\neverthorpe\nroosenbrand\ngeden\nmcclover\nbuddist\nbehnen\nowolabi\ncrichtons\ntempelsman\nseamore\ndeherrera\ngarnick\nmeida\nmicroserfs\ncontarino\nmarinatto\nsanajeh\nkibayashi\nclewell\nlonsbrough\nzatz\ntelephonics\nwehde\ncappuccinos\nmandato\nrosenschein\nhtz\nduvergel\nfistric\ntarmacadam\nndq\nkawaja\nrcis\nbindeshwar\nidenticals\nhesledon\npresleys\nhibah\npotterhanworth\nwehrenberg\nbullgill\nchrispin\nblanketly\noarnet\nhummell\njmac\nbalhousie\ncarhenge\ncoghlin\ngjonbalaj\ncoky\nfarve\nzippos\nschatt\ntriteness\nhammerle\nunploughed\nlobelias\nauthonomy\nhuping\nsimitian\nbradhurst\nteiwes\ngunesekera\nalerces\nszamotulski\nmarantha\npiadina\ntonique\ninciter\nmurmuration\napem\negbuna\nfrassoni\nstonebow\ncetv\nmalry\nlevys\nmotio\ncoalfish\nbelaynesh\nlewycka\nlekstrom\nloiacono\ncleviprex\nkobil\nduplitzer\nboyuan\ncybermedia\nhandgrenade\nchephren\nhumbrol\nexternalisation\ntanavoli\nmoisturiser\nelekta\nnorteno\nsuhrstedt\ncazalot\ncogliati\nordesky\nintercasino\nreequip\nqalandia\nzabihullah\ncopasetic\nipred\nmastuj\nokosun\neductaion\nsanchaung\nifly\nlimeys\nmasticate\ntstt\ngdas\nchenilles\nwilmet\nderange\nchalit\nmonninger\neaman\nmalil\nbrentina\nremortgage\nalhough\ntocolytic\nyewande\nmiraa\nbinged\nkhayami\namtech\nantalis\nlavenders\nzweli\nselker\ntbrb\nplops\nwheller\nannalyn\ndesarrollos\nfréquelin\nreinvestigating\nbedwetters\nderivatively\nmestdagh\neversleigh\npetcoke\nprinicpal\nponsor\nokulitch\nnimwegen\nconmy\nyaish\nsaleroom\nindoctrinators\nantihelium\ntouchiness\nbenatti\nnelsonian\ndisapplication\nmagret\nconcerend\njihadia\ntaspo\nedmundbyers\nmingliang\nburkin\nbankend\nbishri\ntemplarios\nbleah\neptifibatide\nwatercrafts\nmelanio\nkriton\nderogating\notellini\nkimchaek\nsubfertility\nsitutations\narbes\nkonzerthausorchester\nolusoji\nfittv\nvcenter\npaywave\nmruczkowski\naselefech\nslowmotion\nkimzey\nprolexic\nunimpeachably\nmasondo\nciaramella\nhexcel\nkouwe\nnanina\nshresth\ncalfire\nkashina\nosyp\nzhanshu\nreinfect\nbbeb\ngramling\nsupercycle\ndandala\nkushlick\nupadhaya\nmassoglia\njosephoartigasia\njermal\ngalorath\njingxin\nboobytraps\nshameka\nnisreen\nvertic\nvados\narfken\nhuchthausen\nludie\nekathimerini\nsliproads\nformicola\nguangzhong\nkhrzhanovsky\nwiedrich\napsos\nsosthene\nwoodsetts\nrajasingham\nbareham\nozian\nlaserman\nvivifying\ndelker\nmedrad\nproselytised\ntimewasters\nwhieldon\nmovahedi\nforechecking\nfrangelico\nclubcall\nbenemann\nbanderilleros\nmuscial\nulph\nsoilless\nciot\naertex\npuliti\nyelo\ndengate\nenglishby\ndessicated\ngubu\nchaturon\ndaep\nsaparmyrat\namflora\ndăianu\nhaleva\nraillery\nmaibach\nkayleen\nriscal\nkerfuffles\nlithang\nwolkstein\ncairness\ncsfi\nqiuping\npeschl\nrayport\nterranea\nunembarrassed\nmurshida\nnessuna\npanchev\nyorman\naspiotis\ncondeming\ngabbitas\nsharga\njudin\nustekinumab\nhtar\nncbe\nchepkwony\namscot\neilam\nslashings\nbellandi\nsurdna\nbrisard\noestradiol\nkalyoncu\ncoyner\ndevilled\nintevac\nclesio\nmahtook\nhumer\nukyp\npitarch\nmartensson\nzoomlion\nhousers\namonst\nshamsullah\ndejon\nnikpai\nmerchanting\nscarsella\npersonna\nkhandahar\nswibel\ngosat\nhounsdown\nsommerset\nstymieing\nshennawi\nregreted\nhayon\nmoldering\neicc\nmchales\nagrichemicals\nphotosensitizing\nxianrong\namortised\nrupf\nwesti\nspinneys\nmojacar\npatronisingly\nbakio\noppposed\npatchan\ntunisa\npayman\ngyawali\nyext\nmitutoyo\ncomplaisance\nblagger\npraj\ncreige\nheartstring\nprazdroj\ncrozes\nplenteous\nexpatiate\nmitchler\nrauda\npowerlite\nasurion\ndelaminated\nfocuse\nmuscats\niolana\ntalaban\nthirsted\nkemple\nbsic\ninishturk\nbierd\ngradebook\nmonetised\nfrenchies\narenstein\nmachiques\nvardas\natteberry\narmento\nwalkathons\nstromgren\nunprogressive\nfishlike\nbottlenosed\npongpat\nsarar\nuniversty\nleukaemias\ndollase\ndervock\nporkies\ncrociani\nnurith\nalthof\nplighted\nraunchiest\nshimange\nkrepinevich\nguyette\ntsumori\nwosa\njelko\ntecnicas\nslaczka\ntussing\nnaïvete\nwhiling\ntranzcoastal\nstringless\nmegg\ndzg\nsmex\nbuguma\nmckimmie\nmurerwa\norated\nsabban\nsieze\nsubconcious\nmemorialises\nbolshy\nanimalis\ndomene\nprofessionalising\nasfb\nwelzenbach\nshwan\ndjerejian\nkillyman\ncurtailments\ngreeleyville\nbazzana\nbenderloch\nlislea\ndimitriy\nofferd\nvarki\nsandherr\nbalasingam\nnetters\nunbiasedly\nmckinleys\nlohier\nsadrists\nfrazen\niotv\nessr\njof\namens\nspasticus\ntorbor\ndismountable\nthorniest\nhwn\nelmosnino\njonabell\neclips\nrousson\nmcquire\naffrica\nratuvou\nmoorgreen\nroniel\npaque\nalaves\nlawtell\nsparsest\nserracchiani\nanichini\ncountercharged\nprooijen\nayverdi\nporwal\ngaisman\nmadcow\nbmed\nugley\nlifi\nchearavanont\nonmobile\nbechtol\ncolb\nrobichon\nsubcommitee\ngolfed\nroominess\nhydrochlorofluorocarbons\nsilkscreening\nsiedow\nsonders\nagrifoods\ncebada\nzygos\nnadene\npoplack\nunscratched\nclusterf\nnadj\nrobow\nbrunos\nhygeine\ngenou\nnamesti\naerosteon\nracanelli\naxigen\nkokk\nmuglad\nsciandri\nsinbo\nvilaya\npiquante\nschuchard\nportraitures\ndivvying\ndrillstring\nhysell\nrazai\nfinamore\nbolatti\nyaqi\nsimpatia\npastéis\nharbeck\nhelfet\nlernt\nalkham\nwaddoups\ntimberwood\noviir\ncomiston\nviennoiserie\ncromack\nsuperfetch\nbrodjonegoro\nfothen\ninsatiability\ntomcar\nsenitt\ncybersquatter\nmcbrine\nfesman\nsandbakken\nlivewires\nranolazine\nmargon\nbikeability\nstevenses\nnpqh\ntrebon\ngroclin\noceane\notnes\ngeaux\nboparan\ntabraham\nderryveagh\nskrovan\nlundwood\nlifchitz\nvhtr\nharpie\nsproxil\nmeleshko\ntoshin\nchartz\nndmp\ninconvertible\niressa\naulestia\nvalte\ncogmed\nabdopus\nocxo\nmonz\npicnicked\npessa\ninju\noakerson\nqorbani\ncavelossim\nunbutton\ndeede\nportknockie\ninviter\nleape\nsportsview\nreservedly\nmonshipour\ndeplaning\nzaidel\nsgca\nsavir\nbluedog\ncochinita\nbourgmestre\nnabp\nsolucient\nrahabi\njumpstarting\nprofiterole\ngayby\nskysong\nayron\nhainley\nscotten\ngubanova\nbaysinger\npfirter\nglyne\nmonzer\npeterstone\nbitd\nkramish\nhiggitt\nimplicity\nrosselle\nbowlsby\nadhanom\nreauthorizes\nmiskell\nsovietisation\nebanos\nkerkhoff\ndenuo\nedwardsport\nchowkidar\nbiha\nfirers\nendline\nseatguru\npenon\nprevously\nkokilaben\ndiquigiovanni\nbioprocesses\nvaki\nnukaga\ndamhead\ndeliziosa\nsilbergeld\nkavaler\nassociado\nbieldside\nsaslow\npeoplexpress\nruijten\nowlett\ndualit\nkayli\nrosliston\nbeguildy\nstarborn\nabdulayev\ncalvine\nlangsdale\nvicriviroc\nthoelke\necojustice\njafaar\nohrnberger\ncarmeuse\netteh\ngesticulate\nhellabrunn\nlöwensohn\nweiqiao\nmrozowski\nhnr\nfrontrunning\nbyv\nnumonyx\njazdy\ngmcc\nwiegele\nmycogen\ninexpensiveness\nostional\nabdulhamit\ndorfer\ndecolonised\nbartolone\ncamre\nbalgreen\ncrapware\nrhoton\ngreensfelder\nvulgarisation\nfordow\ndemarse\njerviswood\nsamaila\nraey\nbohmte\nlemish\nmarginean\narabize\nlathkill\nsubtherapeutic\nsubstitue\ncristovao\ncranshaws\nunputdownable\nilyena\ndewynters\nhandango\ndoxil\nweaks\nabbotskerswell\nscreenprinted\nnumu\nreengineer\nmilleville\nesrock\nekuban\nusura\nnurudeen\nstipp\naanenson\nmediwake\npostrevolutionary\nvemos\npernier\nxijun\nmircera\naanensen\nfrediani\naparthotel\nbrunhart\noestrogens\nfadillah\nantibe\nskelleftea\nlopini\njeralyn\nkhazakstan\nduplain\ncontestability\nsemde\nsarfate\nkrasnovsky\nenage\nhanggai\nwafic\nsodahead\nhiguain\noverages\nmallarme\ntoroitich\ndiwrnod\ngrassing\nbowd\nrebuilders\njipijapa\nsummerscape\nmarchell\nchangming\nadvocat\nmangeot\nmaddan\nfangping\nbackwords\nsurono\nmieka\nschelbert\nmordiford\nhubers\ntosetti\nincalculably\nchangzheng\nprobings\nicemaker\nserialisations\nodim\nhuascaran\nnrta\nchukkas\npiepenburg\nbodybag\nadfl\nsportsweek\npallières\nkursumlija\narmorgroup\ndestabilises\nbetutu\nwestren\ntransmeridian\ntresch\ntlaltecuhtli\nshaley\neang\ngrubbed\nballetmet\ndawani\ngroenveld\nhaitises\nmêlées\nabdulatif\nghleann\nunmuzzled\nmanganaro\nyemma\ncorbey\nnybc\nchangyou\nresilin\nwatchfire\npuddleduck\nnafeesa\nlagae\nhanstveit\nprw\nprotetta\nbeigh\nlerebours\nskyvision\ndousset\nplique\npresagis\nkorki\ntaquitos\npaysinger\narrick\nredknee\nprotracting\ncclrc\nthroug\nmintues\nichord\nmonksfield\nmoredock\nbezu\nmardo\ntavin\nfrolunda\nindictee\nnomal\nnehls\nsporks\ndockmaster\nhurungwe\nflashplayer\nalanda\njenab\nneura\nbpca\njisu\naquaphor\nomalos\nnorthstowe\nheadleys\nhalftimes\ndebattista\nairclic\nailea\ntiridate\nbuzdar\ngogorza\nschuth\nchelopech\ngruffness\neubam\nmzima\nqori\nvaporises\narkaitz\ncommerz\nmauric\nkitwara\nwolfbane\ndeaves\nfotch\nmurgitroyd\nkranitz\nbergalis\nleguizamon\nsomova\nawatef\nkegger\ncharde\nbiorefineries\nsliwinska\nschachte\nschmidbauer\ngareev\naptina\nvanny\nshuff\nmashore\nwarings\njaffree\nbougrine\ncharnov\nlifebuoys\nkolzak\nmitrovich\namarr\nplantsmen\naquadrome\nealam\nmedtrade\nokmok\neqi\nmuslem\nnakamitsu\nhaibel\ndawon\npatriarchical\nbeetling\nsedler\nfalanghina\noverindulge\naqal\nodee\ngracemount\nalperstein\nzerline\nsubby\nfordlandia\nbobbling\npalygorskite\nxiuli\ntuppenny\ncacciato\nligoniel\nnagorski\nradicati\ngollnick\nsedm\nnarcoterrorism\nicenorum\nlumosity\nmichaella\nhayllar\nsurrexit\nbaumjohann\ndtcp\ngoodguys\nrackauckas\nlistmania\npetpet\nizy\nwardeh\nllanteg\nmolby\nkorobka\nhaifaa\nmewling\nweiyi\nlubanski\nvampish\nscrunching\nstenstadvold\nitemizes\nrosebrough\npassout\nzizkov\ntanchon\nisman\nthongsuk\nimpenetrably\nhandsprings\nchowders\ncavu\nchmsl\npartitionmagic\nbeychevelle\ndrachkovitch\nmobilerobots\nrassouli\nairgroup\nferley\naphibarnrat\nbeamz\nciullo\nepitafios\ntamwe\nmarauded\nastringents\ntveiten\nsafelight\ngermay\ncombustions\nretrovirology\npazdan\nccbi\nbrous\ncambusdoon\nmersky\ndaofu\nretronyms\nforthlin\npowertools\ntanaz\nnonja\nyeasted\ndematerialised\njosephsohn\ndrumlean\nskacel\nhanaa\nsupersmart\naccumsan\nsagall\npenniston\nyaqshid\nmonsterpiece\nshimu\npixantrone\nolsat\ndiscriminately\nmcmeniman\nkosmin\ntolkienesque\nharbourne\nmonitronics\nyastrzhembsky\nghanan\npatronelli\nnwaubani\neuripidean\ndalke\nlauzerte\nsoderick\nxingwana\nkeahole\nmbandjock\ncluess\npyatov\nbabeland\nfaidley\ntamsir\nperwer\nbazillions\npuggioni\nstaycation\ntotalai\nleopoldino\npiedro\nturbulently\nglacéau\nmatsukevitch\nhydrofracking\nnttc\nvigreux\nherol\ndeltec\ntorregrossa\nsynott\nfalker\nhillwalker\nrotnei\nfonteneau\nacfe\nvanhoozer\nndjeng\nmoussey\ntheman\nkhurma\nquickr\nmuthama\njotspot\nbadescu\ngegax\npotterrow\nprepak\nmontlucon\npalely\ndesirae\ninternacionale\ngartloch\nmaikano\naltwegg\neyries\ncajuste\nprobelms\nsullum\nschlack\nmarsalek\nallbee\nbourgs\ncurtness\nkitzsteinhorn\nchipchura\nnewthorpe\nturnspit\ncomediennes\nzugibe\nforber\nboneva\nnuttiest\nsnowless\nciaron\nllanbister\ninteret\nedmison\nlagazuoi\nnekhoroshev\njochelson\nkettuvallam\nlacapra\nsensio\nmacroberts\nmicrodeletions\nfarmiloe\nranaghan\nshaida\nlynham\nniemoller\nbodipo\nhappart\nniyombare\ninkom\nmuscala\nblaencwm\ngebremedhin\noikocredit\nsochacki\nnnedv\nmakuti\nsubsiduary\ntwon\novv\nextolls\naserca\ncroaky\ncapestang\nhuaping\nnemtsova\nbettag\nlirung\nsporrans\nxiaojuan\nbayble\ntherre\ntaurid\nkrogers\nlemonades\nhighlighed\nnegitive\npatriarchies\nspanbroek\nkuular\nanthropomorphically\nhuske\nthemseleves\nbarloon\nluchow\nkuhaulua\nkrauchanka\ncastellations\nthingamabob\ndozsa\nbebidas\nposeyville\njolicloud\nriisager\nirccs\ndoorframes\nshaza\nsitarski\nwaldhauser\nmynx\nlohberg\ncameraphones\nmiscalculates\nsamoilovs\nthoug\nmaiquetia\nflylady\nbijarani\nshuaa\nmediasentry\nbumptop\nfanciable\ngaillet\nrueger\nlhalu\nrichmonders\ntronchetti\nkasarda\nhubzone\nwaterproofs\nsarrell\naveyard\ngarley\nchumki\nzhaohua\ndunckley\nsharkfin\nallseas\nshohan\nsiglin\nintensivists\nhealthfulness\ncollecter\nbingeing\nspringbett\nhizbut\naboe\nstroik\npipien\ntheophanis\nbocar\nmaulawi\nalecks\nmckerr\ndiabetologist\ngarduno\nsmain\ncampath\nseaboards\nindridi\nclogger\nkajko\nentu\necch\nclunkily\nbrinded\nplacerat\ndudus\narglwydd\ncamenker\nnoteholders\nbagneres\nonpoint\nmasurca\njangbu\nvroon\nresevoir\ncongest\nsigurdardottir\nmosebach\nvysehrad\ndhra\nbioequivalent\ninstutions\nmalago\nlaudably\ngratifies\npoligon\npimentón\nneiss\nsagebiel\niferouane\nfraziers\nbashings\ndrumahoe\nnaed\nchasis\nplaneload\nbolch\nwyka\ncrystalize\nwinni\nkimmerly\nconspiratorially\nwegrzyn\nfastcraft\nhumourists\ntiankai\npyan\nsapaugh\nbongiorni\nyokocho\ngyrotonic\nwolpin\npanderer\nbosket\nmahyco\nortigia\nsquaretail\ncomunicacao\nmetrozoo\nporthpean\nwenlan\nhalpine\nmoreman\nnonobvious\nsanea\nbamboozling\ngardley\nappp\ndefenitely\nombudswoman\nmahachi\nremeliik\nnapha\nbeanomax\nbedwetter\npolumbo\nvaugh\nmonsterous\ntrojka\ndeodorized\ninterational\nxiva\nweitman\nfewa\nbayakoa\nhilber\nislamicist\nlodden\nbzo\nkamilah\nbuonafede\nmicrochipped\npedicone\nmekurya\nintelisys\nsuperko\nbaijal\njagot\ndataspace\nsscp\nfopr\nsinhalas\nparisotto\nvxs\nwollam\ntoumazou\nruhal\ngorgia\ntunander\nfahel\niqt\nlunke\nmandley\npinteresque\ngloomiest\npimms\ndogtag\nonlys\nalsheikh\nsecuritised\npelttari\npeformed\nscibilia\nzamagni\nalcindoro\ncolleages\npriestnall\nspinnerei\nassns\nhalic\nkuvin\nbrankin\nsrgjan\nbenia\nrembrandtplein\nmontanes\nweath\nbrisconnections\nrhit\nauthenticators\nkgomotso\nbenaouda\nhastilow\nhousetops\ncabernets\nstrappy\nhamouri\nthiazolidinedione\nbdmlr\ncaravaning\nsalumeria\nsenol\ngrabauskas\ndowtin\napffel\nqadbak\nbandipore\nclarcor\nhealthequity\nscarfed\nnwlb\njongjohor\nmaximun\npxs\ngambols\ndykeenies\nsobah\nkatchit\nwreathes\nlarchet\nozkok\nboryokudan\nabssi\nellacott\nschoenwetter\nmaresa\nsummitting\norowan\npsychologizing\nunkindest\natondo\nsifrit\nelnaugh\nantoura\nparro\nunmeritorious\nbimatoprost\nbektas\ninconclusiveness\nmorlich\nquinstreet\nequatoguineans\nglencrutchery\nffelp\npullig\nkhaemba\nbacongo\ngyenes\nladymead\npantomimed\ngrittiest\nhamrlik\nniace\nfarmelo\nmokhzani\ntnrc\nrals\njafarian\nrockwalk\nchatila\neembc\ncastparts\nmaque\nhutley\ndeutschendorf\nrafeh\nmacuspana\nswelim\nkhazi\ntitanics\naminov\ndevestating\nkrueck\nglocer\nsewta\ndonadi\nbloodbaths\nmarlowes\ntransdniestria\nsigfredo\nproskurin\ntechshop\nhotpots\nhymens\nstanderwick\nindentifying\nkrams\npearc\nmandideep\nlilybank\ntrackpads\nleanest\npamm\nideations\nverigy\nswarmers\nfedrigo\npones\nnarino\nmutty\nguéant\nresouces\ntincu\nelectrolyser\nsmeraldo\nwalport\npanalpina\ndivens\ngroenink\nlavalife\nvaroni\npegman\nteverson\nnenadovic\naltoum\nbaaad\nkyam\nsponsorless\nimmunogens\nteriberka\nmassmart\ntourbillons\nlibreta\nmtso\nisraa\nzegart\ncapsulated\naxarquia\nkqet\naranzubia\nbelayed\ncicconi\nsedulous\nhomogenise\nshorteners\nlogcap\nagritech\ncoleite\ndeqin\nyoungling\napale\nmautam\nmdbs\ntunworth\nmollenkamp\nortak\nadvents\nunderseas\nseedco\nheimburger\njitesh\nllanfaelog\nardena\ncotecna\nsanctities\nlangeled\nbradworthy\npcec\nsenseo\ndislodgement\nthiérrée\ndinnis\npedwar\nferrandiz\noxhorn\nracki\nteleprompt\nvaroom\nalluringly\napdm\nghaemi\nincomer\npincha\nbonacini\norphanos\nconnesson\nkatrice\nmorukov\nplacemat\nthayers\nsindell\nfarver\nnumrich\nccan\nzhengdong\nripani\npredominently\nsemaj\nchaaban\ntitizian\nwegeneri\nvavala\nbeidleman\nfriddle\nkolumba\nexeptions\nzazz\nwhatshisname\npeerbhoy\namerisourcebergen\ncrosscuts\nbrightpoint\nbensted\ninstitutionalizes\nhoumas\ndisregulation\nmuwaqqar\naidem\nbockstein\nperler\nngmn\ngamco\ncounterprogram\nopaquely\nvideoscan\nmufamadi\ntrabaja\npopelka\nrignot\nwiercioch\nmobaeck\nvasini\nmandab\nasbeck\nbasye\nklompen\ntorsions\ntransnationality\npenalver\ngoosegrass\ngalais\nstrinati\nalmarai\nrajoli\nzherebtsov\ntzetnik\nkucharek\nmishandles\nhatlestad\nnderitu\ncivl\njarmin\nmpiranya\nsticklepath\nenaam\nstilkey\ntahe\nharrowingly\ndfdr\nfertilizations\nurosa\ndelocalisation\nlounes\nfarl\nsteamie\njangled\neastbay\nkumawat\nhypergravity\nwyocena\nmarcarelli\nhyperx\nobstructionists\ntorjesen\ndrosten\nalixpartners\nloughmacrory\ninserters\nnailbiting\nassura\ndavendra\ntavleen\nbeddings\nnosiviwe\nnonimmigrants\nkhorfakkan\nuncharitably\nkauzlarich\nsurfeited\nforeing\ntobiko\nbarfuss\nkenagy\nkhezri\nhardnosed\npiseev\ncontrollably\ndetoxifies\naldhouse\ncooden\negoic\nsafarova\nabeliophyllum\nbeckettian\nturen\nmulthaup\nscaleup\njamarat\nbejamin\nholmbush\ncultlike\nrobertsfield\nnaftalis\nillimitable\nribeirao\ndecongestion\nstonebriar\namareleja\npebblebrook\njanks\njuvenille\nhidell\nmarshmellow\naltovise\nlangleys\nprateep\noaaa\nculpas\ngadio\nrestructing\nmcsd\nsicha\nhutagalung\neadon\nterner\nwahala\nsarahs\ndepilation\nbeatify\nfirswood\nroseacre\ndhital\nyambuku\narugment\nmiere\nbioindustry\nversas\nperranwell\npolukhin\ndenouncer\nmonoethanolamine\neventoff\nripest\nhenretta\nhaythem\nlucentis\ncurtsinger\nyasseen\nnutall\nbenyoucef\npetrey\ntwynholm\ndustmen\nnetscout\nbrüderle\nstrassel\nlaviera\npashos\ndocstar\ndorer\nvampyroteuthis\nnonemergency\nnepstad\nnaysaying\neletion\naberchirder\nvodopyanov\nprasquier\nsmmoa\nfeehery\nustari\nembittering\nreattribution\noserian\ndorronsoro\ntiblisi\njupi\niziane\nluckly\nruralism\nchatom\nfuisse\naleki\nbecha\nvalaichchenai\ndegreee\nklapheck\nsoheila\naldurazyme\nbancaja\nvoumard\nkhalde\ndefier\ndaintiness\nataga\nlambhill\nfutzing\nballcarrier\nkanah\nitsik\nyunchang\nnantmor\ninconvience\nsaeqeh\nbuks\nstakman\nkhromov\ncliffton\ndilemna\nmcgennis\npartsearch\ninhospitality\narss\notcqb\nallianoi\nduffuor\nraincheck\nhorlogère\nhassebrook\npiella\nhnt\naccessdata\nkeatons\nzoback\nmccastle\nsuker\nbinoo\nfirova\nmicrobiologically\nvillified\nunhinging\nndure\nmastrick\nrymarev\nnonscientists\nfayose\nagerpres\noldfashioned\nredpolls\nanncol\ntorgan\nodebolt\nauy\ntitians\nboxworth\nguzara\nimpotently\npeaceman\noutflux\ndisquietude\nsolans\nlvef\noxc\npieraccini\nrituxan\nsekikawa\ntaketsuru\nkibris\npretaped\nbrutalising\npitchforth\npostcomm\nflyfishers\nwolfsdorf\nmilners\nummmmm\nchembe\nbestfriends\nlazes\nlapiz\nhermila\nshopworkers\nbodai\nunitholders\nkairouz\nyoof\nmitzvahed\nbathija\nevarn\nnahdha\nmurar\npoptastic\ncrossable\nmicrotca\nenarson\nelectical\nbarsosio\ndikla\neisenhowers\nmaniace\nstellick\nsmadi\nganol\nweigman\nyanowsky\nnerica\nlujambio\nscoffers\nresile\nashlock\ncrunchiness\npicocuries\nonemi\ngoatley\ntransillumination\nkalie\ncouln\nhurcombe\ncreaminess\nroeding\nsosp\nopalka\ncordiant\noltman\ndisbarments\nmackmyra\nzuera\ngmarket\nlonglining\nteshkeel\nwtmd\nbrakefield\nsinkan\nnausée\nhaberturk\ndcci\nkleinmond\norionids\ncytec\naracoma\ngorkys\nwhorley\nfengying\nwetherhead\nboulmerka\nkawangware\ngrocholewski\nboilover\nvasteras\nlardi\nvantas\nmontclaire\nbarjo\namum\ncnockaert\nbearcroft\nyesim\nberlow\ntrinitarios\nssids\ncaponata\nvcts\nlebergott\nbvoc\nresturants\natomiser\nyellowness\nbellhaven\ntofane\nlydic\nbiostatistical\nmintoo\nfreifeld\nwhataboutism\ntatbir\ncasue\npishchalnikova\nsynex\nfaramarzi\nemporiki\nentropa\nextoll\ndownderry\nomnibox\nlowenberg\noppenheimers\namdr\ncsim\nsarnesfield\nlovecats\ncelier\ncapless\ndukic\nbrylin\nstudsvik\nmaceio\nmlynek\njaising\nsilvaire\ndravite\nhuelle\nphoumsavanh\nmugerwa\nwharfside\ntessina\ndaybed\nlarchfield\nolweus\ndrahm\npristinely\nverrecchia\nkirr\nxtronic\nbbet\ndillow\nboulami\nsissie\nautodoc\nkhushhal\nhelfant\navranas\nmelmore\nmonello\nstepkids\nliau\nangelich\nparadisical\navaza\nlafley\nsalzburgerland\ncontactin\nreinstalls\nallrounders\nschulson\nhealthplus\ncnbb\nnoerdlinger\nkayyem\ngorneault\ntrophys\nkarosas\nmangyongbong\nshaoshi\nmicrel\nziplock\nvanclief\ndaftest\nnowzad\nkamaruzaman\nwolfing\ngraubard\nwebcameron\nsupervalue\nhakia\nplosser\nhavertys\nreclassifies\nzhevago\nportavadie\ngetson\nnikishin\nteshekpuk\nkubar\nlividity\ngarrington\ncortexes\ngoldstraw\ncebreiro\nminic\nmettey\njerheme\nsulser\nrichtel\nsteliana\npeeke\ninsanities\npetin\nnishigaki\nlorenson\nkopps\nyawk\nepithemiou\nhazelgrove\nfontas\npiland\ntiggelen\nstaion\nkurkela\ncarinish\ncopperhouse\nmetropolitian\nbiddies\nreteaming\nraiya\nsurgutneftegaz\nsemitrailers\nbetulin\noutclasses\nsuntanned\ngarbles\nlarae\nsaumitra\ncartographically\nninfield\nyipsi\ntrenkler\nglühwein\nesparanza\nkitcheners\nswines\nuncast\ngangbangers\nsehba\nberrini\nshauny\nprevas\nzanmi\ntomorow\nburdeau\nsubtenant\nbibbins\npinchukartcentre\npolegato\nsarmat\nscura\njanoris\ndashiel\nzoltek\nzancudo\ncchd\ncherrix\nornellaia\nsacheen\nbirak\nreenlisting\nbenecol\nburped\ntahnoon\nwmba\nvirgets\nuppo\nshafiqullah\nvelasio\naidone\ndiekema\npurposly\nshahbuddin\ncopelands\nchomette\nstorchak\nmeckstroth\ngouras\nxiguang\ntrefin\nzaromskis\nleestown\ngzm\nhasfield\nsedulously\nudw\ncarnehan\ndeepesh\nmatusalem\nlomban\nlatterman\ndoutre\nyeoward\nlamamra\nrurua\nmangosteens\ninimitably\nkahiye\ndworak\nkinchin\ncrra\nreorganises\ngamesalad\nfangxiao\nveikoso\ncoggles\nbenest\nphisher\nganrif\ndworski\nleigham\npseudocyesis\nhuzaifa\nmozhdah\nngetich\nmarumsco\nplumerville\ntributs\nsilkier\ncharaf\nbelarussians\nguilbeault\npooing\ndealed\narchness\nacknowleding\nmihadjuks\nlescroart\nsarobi\nnafzger\npywell\nhalfhill\nwinsberg\ncalculatedly\nguwa\ncaveny\nfarofa\nturquoises\nshambrook\nnyamko\nlandaus\narrestingly\nansong\nkoeller\ncîroc\ngookins\nmaytas\nhealthworks\nmyozyme\nshuggy\nacrassicauda\nvichai\ntahiraj\nkeymaster\nbritishisms\nslowcoach\nbadesha\nlionizing\nhasanul\nresidentes\nvorstenbosch\nhourican\nfausa\ngallivanting\nnavah\nshamateurism\nowre\ndysmenorrhoea\ndoucoure\npontneddfechan\npaganica\nfuterra\nhijgenaar\nbicking\nskatetown\ndeplane\nheirachy\ndrehle\nmarcillat\nickleford\nmonologuist\nharootunian\nzyrtec\nmujawar\nmayala\nacroyoga\nglyncoch\nsinervo\ncremieux\nkhog\nwatersplash\nfressange\ncumbus\nnurturers\nklapow\noutthink\ncnpa\nanite\nyoui\nkedrick\ninserra\ngonggrijp\nscarpitti\nauthories\nchengwatana\nharesfield\nbargin\nleive\nantiseizure\nfederoff\nfeejee\nfalon\npawprint\njianhe\nlozupone\nreceipe\nmunsen\nusership\noenning\neshelby\nwsta\ngaziano\nyingdong\nmajzoub\nmilbourn\nshanor\nrewild\nkwanda\napolinário\neicken\nosmun\nkfn\nramiele\nhoogstraat\nfamilycare\nkiranjit\ncoould\nmudassir\nconcepció\nbimmer\ngrender\nmysupermarket\ncoffeecup\npelluhue\nmindrum\nvesali\nvictore\nmeshkat\nwoyda\nadorer\nauricchio\nformigal\nheulwen\nmouha\nunretire\nimposingly\nelectricities\nforewarns\nhaylofts\nressa\nunprecendented\nbolham\ncrimebuster\nzakone\nprita\nbridgemary\nliad\nkutin\nbehlen\nentreprenuer\nvarkonyi\nkailee\nhusch\nstotsky\nwannous\nautomative\ngarc\nsaubert\norbotech\nmorewedge\nmageau\ncomunicacion\nuzbekneftegaz\nthirunavukarasu\nalmondbank\ncrucifiction\nmiruts\nliah\nderisi\nrozynek\nnucs\ndespicably\nhandcycles\noltrogge\nmonashees\ndefendents\nkinchloe\nwishna\ndanovitch\nchestnutt\noxborrow\nbiomechanically\ncamioneta\nmutri\nsekeramayi\navanir\ncleveden\nnewburger\nturbolinux\nkerpan\nkaelber\nmentawi\nhampikian\nloengard\nmoisturize\naapis\ngetafreelancer\npracticising\nteven\ncoldra\nalanah\nepns\nbrowman\ngocompare\nofficemate\nmarvy\ntoptan\nsubnitens\nbierbichler\nmuffie\npoilus\ngrotbags\npendergraph\norbec\nmasinter\nalmaric\ncraws\nnizeyimana\nnonsmoker\nmihailescu\ninnotek\nkimaya\nbritweek\nwpsi\nzahradnik\nflapdoodle\nmicrobrewing\nculton\ntatter\nrepresenative\ndnsc\ncharlatanry\nblechner\npowertune\nusupashvili\nalmast\nsusiya\nklenke\nhandspan\nwiveton\nnanoshells\nprawna\nalgenol\nkundnani\nramatuelle\nesterhuyse\npracticle\nlovability\nblackwash\nsanex\nkenti\nboulmetis\ndunguib\nbunagana\nchimen\novono\ncamposano\ngouvia\nargies\nrollersports\njosian\nvigneaux\nkhamid\ncommiserated\nizta\nqeep\nbreaststroker\nnightowls\nakerley\ncoarsened\njianxiong\nmochlos\nbodow\nkirtman\nwilmorton\nhurdia\nhuijin\nkaldas\njohara\nglascow\nullenhall\nshulkin\nboufford\nblackadders\ntatsuzo\ngianopoulos\nraghip\nnylo\nthoughs\nfootstools\nyoana\nsgarlato\niswahyudi\nsaltchuk\nkronthaler\nklicka\ngeorgos\nchurrascaria\ncoldman\nchokepoints\nalmendarez\nbartumeu\nkwikset\nlumus\nsundyne\npresho\nhousecleaner\nunsleeping\nruminated\ngergorin\nnumeros\neurekahedge\nyames\ncablefax\ntrovata\nslattern\ncorbishley\ntimebound\nredwan\namirkhanov\neurocypria\ncataluna\nmcrs\nadvantech\nequalises\nfurnituremakers\nfairminded\nfloormats\nfoud\nwintzer\nxinbo\nclangs\nirlene\nijl\nautoeroticism\ngissel\nencourge\ndisappointedly\nseyfollah\nterrapower\nnimham\nevany\ngriffelkin\nmdjt\nadventuredome\nintralase\ncrigglestone\nchristobal\nsearose\ncriffel\npassetto\njurelang\noliu\nkranton\nkukje\nmohsan\njuhayman\nombale\nhvacr\npides\ndeodars\ncattie\nyouboty\ncosset\nfeleppa\ndirden\nreinclusion\nbeghtol\nbodorová\nempathised\nforbearers\nbergamin\nbreckfield\nbalough\nlyris\nlanghorst\nbesas\nmoctesuma\nunpracticed\nfiroozeh\nschauss\nslighlty\nheartrate\nyishui\nfeec\nforepeak\nacademyhealth\nnuli\nhawnby\nwaterous\nterisa\nvmworld\nbellybuttons\nidolators\ngebreselassie\nkidskin\nmicrosieverts\npostlewait\nwintley\ncommvault\nexocets\nisquare\nkeesling\nrosenmann\nzabihollah\ngalves\nmfsa\nheumarkt\nlereah\nfackrell\nzuby\nlouet\ndjorgovski\njiafu\nngay\ndadang\naltenkirch\nfroglike\nsinisi\nhotrods\nserrette\njdams\ndevolites\nschtonk\nhdj\npocketable\nluzuko\nfiyaz\ngouldner\nfritto\nmcbey\nrhodie\ntawila\nserag\notcqx\nsunnymead\nsavasta\nreceiverships\nzhakypov\nsavala\npingyi\nbhoopalam\nshurlock\nsementa\nmalapa\nmurungaru\nfulcrums\nfunner\nkollath\nreground\nbarrado\nmittman\ndangelo\nzimonjic\nduquenne\nkobashigawa\nseegrist\nbetschart\nconvulses\nmobos\nrons\npscc\ngattii\ngwaenysgor\npahimi\nchiclets\nfornicator\nkasanoff\nbrainbow\nmortarboards\njimaní\nkowitz\nloomia\nschawinski\ntagro\nkhaung\nchiropractics\nkadakia\nextreemly\nmaxxam\nesporta\nbendamustine\nballinagh\nstourpaine\nguirand\nevenstad\njollimore\ncompounders\nimla\ncormie\nbeney\nsticken\nkenwith\nmedeia\ntaraqi\nflogs\npreregistration\ntonet\nhardwiring\nboskovski\nsophistries\nsitanggang\nchangge\ntabita\nhankies\ngolflink\ncelian\nselsky\nfetida\nmilanes\nmollon\nabraxane\ndingwell\noptex\nupperlands\nzoglin\nherard\ncrudités\nghiglione\ngardea\nhautlieu\nfsos\npersic\npinholster\nnittve\nkinclaven\nfraudulant\nfatalists\nirisl\nwidespead\nroetzel\nodney\noverindulged\ncaldeiro\ncherrapunjee\nnevruz\ncinéastes\ncohered\nacott\nfiolek\nmatchfixing\nwartel\nmisanthropes\nallinger\nretrovirals\nswordstick\ntiribocchi\ngenske\nvorobyovy\nrecoups\nledy\nforecheck\nrylee\ncornucopian\nsorenstam\nscrofulous\nkanik\nlandels\ncrossplay\nlifeng\noafs\ncaraibes\nsmartceo\nyaojie\nnrepp\nbondan\ncaffarella\niifl\nindecencies\naloisiuskolleg\nmegal\nfantasise\nprocedes\nebates\nsimopoulos\ndoonie\nsugarsync\npileups\nitula\nexhortative\npicassa\nzhongwang\nlopezes\nnashwa\naquilano\ncasegoods\nshuanghua\nkatel\nrenggli\npourmohammadi\nreos\nmkhwanazi\nsabloff\nrohrs\narchetti\nreconsolidated\ncoveri\nnikkie\nfumiyuki\nmalaitan\noveremphasise\ntorwards\nfrothed\nenergyplus\nhabitrail\naboucherouane\nkubic\nbpss\nstagestruck\nriskmetrics\nweightiness\ndoppies\ncremes\ncalumnious\nbermans\novrebo\ncadge\nterzopoulos\nsinkor\nboomy\nkeilar\npylas\nnobes\nkaplon\nkweder\nwinbond\nhousner\ndaynard\ntesfa\nstuczynski\nwhaa\ngeorghiou\ncocaleros\nenterpriseone\nkiewiet\ndecaë\ninterupted\nshohet\nmckellan\nkassy\nzhirong\nserhant\nshenmu\nsharahili\nhandgrips\nplayready\npattiz\ndionisotti\nencorage\njingsong\nautostream\nfrypan\nbiib\nstagework\nmaday\nsumari\nsondag\nherbertus\nquadrivalent\nlamneck\nkambale\nfelicitously\nicpr\nboski\nbeyoglu\nburhenn\ngeorgeann\njianhai\ngudgin\natron\nverstandig\nphans\ndeliz\ncarmenère\nosguthorpe\nosleidys\ncorrolary\nhuaqiang\nrfic\nwatler\nclingstone\nbippy\nafgoye\nsakaria\nzimeray\nthiagarajah\nsummiteers\nchallock\nwauquiez\nferronickel\nchemonics\nprofoundness\nauby\nylläs\ncigarillo\nettlin\noilsand\nfractioning\nfalklanders\npronouced\nmahjoob\ncummerbunds\nfreescha\nabendrot\nbovbjerg\nngoudjo\nisue\ncospedal\ngoslett\nbistany\nuze\nurizar\nrowta\nlewandoski\nbrownface\nbobzien\nsportsdome\npenford\nunsurvivable\nsorak\nreimaging\nkayt\nwolfan\nvandehei\ntyren\ncuoghi\ndevonwood\nljungquist\nnypirg\nnikzad\ndarshak\nhillenmeyer\nnamvar\nhaltiwanger\nhuiguang\ncavnar\ndecapitalised\nkaysha\nstawinoga\nminski\nchepkurui\ngassiev\nresna\navax\nboundlessly\nhaqi\ndauletabad\nzinno\nbiorefining\nlaibson\njibed\nladettes\nappf\nteraflop\nbabyfirsttv\nlocalness\nfqhcs\nmontae\ndecamillis\nwespac\nslaley\npositve\ndhobley\nshaplen\nzusman\noluyemi\nmodd\nbunkmate\nmengert\nindivudual\njhooti\ndenationalisation\nunmilled\nnovec\nsizhi\nliliesleaf\ndiplopedia\nmerti\nsandouville\nsarava\nhazelwell\npetreus\nartwatch\napemen\nkhagush\nrascoff\nmfrs\nshopkick\nknowin\ngarofolo\niourieva\nexagerate\nsukhera\ncurrrently\njacie\ndawalibi\ntolstrup\nclendenning\nindanan\nkocherlakota\nunza\nsymud\npleaders\nnasatir\nukunda\nrichemond\nsarajuddin\nmanku\niztuzu\nventouse\nlubatti\nphobjikha\nvitran\ncatalist\nbargainers\ncloudsat\nacaz\nprokesch\nmaruha\nimfa\ndovan\nappnexus\ntechnolgy\ntraduce\njousted\njammies\ndanilchenko\nphlip\nkokavil\nrewriteable\nmultireligious\npepperstein\ncrosshatched\nilenia\nmahboba\nkawaoka\nenigmo\nurbanizations\nexagerrated\ncyh\nxiaokang\njirous\nrescissions\nteshigawara\nqaseem\nbooksmith\nsivarasa\nshearwood\nrrac\nmalula\ntemping\nschumacker\nklarner\ngenevans\nillman\ndouzable\nbwy\nvidiians\nushiba\nhowdahs\nlasercomb\nultrahd\nevrything\ndescas\nbonacic\npalonosetron\nwalus\nseafo\nchristingle\nfesmire\nshurdington\namankwaah\ncrowlin\nwilfulness\nndpvf\nvuorensola\ncyhoeddus\nflourless\nmaglakelidze\nlhanbryde\nhamwee\noffspin\ncucolo\nbogh\ntumultuously\nsmus\nversifying\nhorsewhipped\nhayvenhurst\nacoba\nesag\ngimm\nnortek\nrandgold\nmitsuji\nhisey\nparsonnet\niranamadu\ndespute\nremoulded\nflozell\nflashmobs\nblandishment\njikany\nanoymous\nalogbo\ncolouristic\npomeranc\nogborne\nsqrl\nhansabank\nbaghar\ndoolally\nsoceity\napgs\ngombiner\nervins\npenaluna\ngellatley\nbondam\naerolift\ncritisizing\naffinion\nsifteo\nskorecki\namaryl\nmonoplace\nmiskelly\ncarello\njerbourg\navize\nregifting\nbosphorous\nkristalina\nstreather\nallonne\nlouisana\nschivardi\nfundora\npallmeyer\nteacakes\ncarbaugh\ndemitrius\nsbca\ndaintily\nkajitani\ngolubovic\nostojic\nwojta\nteasels\ndenette\ncusato\nboudjellal\ngolpayegani\nbourtzi\nqqm\nkhazna\nwangyee\ncasales\ndorkbot\nnaureen\nstefanoni\nlarsh\npasteuria\nhighhanded\nbarnouin\nrennicke\nlakdawala\nchakir\nwrotes\ndelloreen\nhorningsham\nschwartzbach\nyablokov\nwikileak\nmeyiwa\nnapm\ndrapkin\ncalderglen\ndebruyne\njaelani\ndehnart\ntransgas\nowomoyela\ncacio\nosirix\nbarakaat\ndevyatovskiy\nladdies\nenitre\ncitymeals\ngorllewin\nminable\nwoerkom\nhaubner\nwict\nwhata\nhugeness\nsommerfest\nmappleton\nyeargan\nhotzvim\naproval\nplooij\nwingsuits\ncalotypes\nvegetate\nwesterngeco\nfimat\ncchf\nwoodburning\nmckechin\ncptc\ndefanged\nschroeders\nceutí\nmudzuri\nothersiders\nrheinhardt\nbesham\ntealeaf\nsanjel\nprosthetists\nsnla\nhanadarko\nlawsky\nyouthfully\ndjelimady\ngwawr\ngreenyards\nplanai\nsembra\nsyson\nfoleys\nmarkerless\nmavraides\ncamisoles\nconniption\noverbearingly\nnoblella\nkumgangsan\nicera\nnovazzano\nhembrey\nnussenzweig\nasnodkar\nederney\nbiospecimens\nacclimatising\nuzowuru\npropbably\naccroding\nmiodio\ncherin\ndeutschman\nanvisa\ngroogrux\nneilsons\nswanny\ngunky\nsplashback\nsnainton\njonhson\noralia\nbeshimov\nmanyatta\nmoneybox\nkaytor\ncaprylic\nsqueezers\nwarzecha\ngadgettrak\nlipping\nlicosa\nmosavi\nkinnerley\ntenido\nwesminster\naugmentin\ntalad\nvoevodin\nreschedules\ncyranos\nrifapentine\nknolle\nnossaman\nhealtheast\nmandrem\nsébire\nduprau\ndemocratics\nmobiltel\ncrusties\ndevachan\nhallums\nunmedicated\nparticipacion\nchromatographs\nitqs\npusillanimity\nmagera\nrowdon\nhedebrant\ncawse\ngaberdine\nunshakably\nhuttoft\nboissiere\nparygin\nantispasmodics\nkusaba\nkurzon\nchokai\nalarp\nechenoz\npamidronate\nkatlyn\nfashionability\npolarisations\nudeen\npomerode\ncounterproposals\nspeigner\ndecorously\ndearen\nejegayehu\nparazit\nniccals\naddreses\nmcewans\nchiamano\nlatell\ntiihonen\ntrents\nqisheng\nyerlan\ntgfs\nkhaleed\narmajaro\npushpamala\njiexiu\nbalzekas\nbacterias\nsokalsky\noesa\nmegahed\nskimpole\nsaeeduzzaman\nhvae\noutsourcers\necity\ndiarrhetic\nshenkin\nurgencies\npatullo\nedisonlearning\nabusharif\nbehaviourial\nvillarin\natefeh\nrebalances\nscharmann\nadament\nglentress\nsomerstown\ncongesting\nskeem\nflorange\ncaymanians\nlipinska\nsirva\ntrackman\nlevitre\noceanics\nrucinski\nicontact\nmiland\nfecking\nfelcher\ndhahrani\ngoogel\nyasuf\nbexton\nstaffordshires\nbarkay\nzierke\nhnwis\nboardley\nheliovolt\nsejjil\ncompliants\nturbett\npiontkowski\nhavarti\nhuaren\nmitchelle\nlecg\nfathauer\nianucci\nreisha\ntalf\nwhiled\nhaltli\njuliusson\nmuszynski\nonterio\nlopinavir\nserebryany\ncomfortless\nvikor\nmatrimandir\ngobal\ndilatoriness\noctandre\nbrentley\nhisanobu\nnyongo\ntokbox\nbucan\nzotos\nmurl\nngiam\nheege\ndtech\naccuray\ncitybound\ndvoracek\nsevenload\nmenaged\ncosies\ncarbonetti\nrolet\ncandes\nresponsibilites\ntundo\nthingummy\nmicromanager\nlavernia\nguilfoile\ncaliope\nsnowploughs\ncojuangcos\ndurstine\nskimped\norgandy\nzatonskih\nharsley\ntalibi\nnavinchandra\ntakfiris\nmayolo\nwhear\nhoffmeier\ncavorted\njailbreakers\ndaisie\nkolay\nfolberg\nballyronan\nrelitigating\nglyptodont\natempts\nritsch\naratu\nbrodman\nbirute\nroongta\nmendana\nlaggies\nporreca\npacientes\ncamis\nbinkerd\nlitzman\nkccl\nmilane\nnutman\ngaudiness\nassult\ncarnavon\nkyloe\ncastejon\nsred\nshimmel\noverleaf\nscotches\nirdning\nmaltipoo\njudaize\nunintrusive\ntuw\nbjörgólfsson\nfarrin\ntrosglwyddo\npolymetal\nusse\nbreakfront\nelsfield\ntaith\nsuperphone\nbattison\nskiwear\nmoureau\nasbu\nfarzam\nfrecon\nvulgarization\nclodhopper\nbengalooru\nsucculence\nitmi\ngoodkin\nbonam\nbhatty\nhelsen\nhaykel\nsepanlou\nhaled\nshvelidze\ndonfried\nantihyperglycemic\ngricel\npolisseni\nsponger\nrackrent\ncoopting\naridaia\nkgale\nsideload\nseurasaari\nchamping\ngiardinelli\ngresko\nruths\nmercruiser\nburrier\nvres\nthorstvedt\nexcercised\nincude\nproductivist\nlauvergeon\nembarek\njeleva\nacarajé\ndemocrate\nrutrum\nmonocropping\nngungu\ntaryam\ninnkeeping\nkhurmatu\nshacking\ndewchurch\nbugginess\nvanhorn\nseineldín\ntransfigures\nsgia\nbyoc\nbarion\nplaneloads\nruginodis\nfollwed\nchechnyan\nquadrado\naftr\ncraigiebank\ndeloge\ncsuci\ninvoved\naafs\nheline\ncebp\nfstc\nerlinder\nkwatra\nbojko\nwitloof\nballabgarh\ngbomo\npersuit\nlapresse\nkohane\nligocka\nwavecom\neggbuckland\nredworth\nyorkgate\nindego\nrumma\ncamela\ngradante\nsosnick\nthreapleton\nepogen\nnebbeling\nnedderman\nbackwoodsmen\ndecaprio\nlulgjuraj\nmcglasson\nbartke\ntagamet\nhummm\nklinton\npurikura\nselectorate\nrepped\nayesh\nberisford\ndiemut\nmerguez\nphaeno\norathai\ncyalume\npapadimoulis\ndurwin\nsegro\ntetonia\nchinless\nnanz\nvahn\ntipsarevic\nweafer\nwuold\nburagohain\nalgos\nosterreich\nhullet\ngemdale\nbibishkov\nsanker\nbobl\nindentions\nsmietanka\ngelernt\nbakon\nlenity\nnuggety\nbachiana\nestra\nlifebook\npeloza\nprotogalaxies\nojama\nshiza\ntadross\npackinghouses\nmakalambay\nkhorgos\nomenn\nwhie\notse\napprentis\ncogitate\nkingsview\nduscussion\nshatri\nkorzec\nardlui\npacquola\nabdiaziz\ndjiby\nsesnon\nhairies\ngorgeousness\ndybzinski\ngthe\naboslutely\nquicc\ngaymers\ngrübel\npomper\nbackfields\neulogises\nohmygod\ncolourspace\nalviri\nwoodbines\npaycheque\nrefigured\nmoravek\nsunalliance\nasbm\noenophile\nscci\nosetia\nkastenbaum\nnonmotorized\nnubanusit\nmeteorologic\njankowiak\ntradedoubler\nredcats\nseminarists\ngrouts\npoliticker\nschaps\nneedin\ngheel\ngrubstake\napathetically\nquadrillionth\ntrendlines\nsinophile\ninterspliced\noganes\nstraitjacketed\njehanzeb\npresvis\nbeuttler\nsamwer\nhornback\nmydlarz\npeddicord\nsphe\nabelman\npaetzold\nvertin\nmellier\nleafletting\nbehbehani\npqi\nlegales\nharmesh\nendives\ncompunetix\nupdata\nkenro\ncuet\nmacalintal\nloganton\nssta\nobermeier\ncopito\nbodymap\nvoluptuously\ntubul\nkraner\ntoncontin\nchernyakova\nivis\nblotner\nbiklen\nbeniquez\nhallmates\nbwlchgwyn\ntoranagallu\nouthit\ngulfi\nibstone\nbillia\nalternext\nsecuritizers\npetrophysical\nschipperke\nkomnas\nhawleyville\nkamerhe\nmagnetix\nreflation\ngaldieri\npedreña\ntolkienian\nrosmino\nmiton\nbissoe\nsupertribe\nhegerl\nsaltimbocca\nvictimising\njavin\ncocklebiddy\naquascaping\nwithburga\nannoucements\noysterman\nbothwick\npersdotter\nsalgaonkar\nhellewell\noverborne\nbevmo\ndervite\ntcgs\nkhapra\nevenley\napgi\ndignityusa\nposessions\ngearstick\nbageant\nhelmbrecht\nneworleans\nkerscher\ncowdy\nbreakfasters\npreventatives\nlemsip\ndoory\nlurgy\ntanous\ndowanhill\nwitnesham\nzchn\nfoqa\nfirstlight\nsparacino\ntaiex\nxalisco\nthorneywood\ncreely\nmirfak\ncontumely\nprewriting\nalginates\ntippens\nmererid\ndudarev\nbryansford\nameco\nworsall\nwardon\nweilding\nleighty\narreguin\nmrfs\npooed\ntrichter\nperola\nxiaotang\ncppi\nheinzerling\nubiquitousness\nsssh\nmaaleh\nkoutouvides\nbudcat\nfeltheimer\nburmarsh\ncaryle\narchetypically\nsyncopating\nbirchover\nedery\ndunsky\nbonvillain\nscrutinies\ngevor\nantiheroine\nlochrin\nashkirk\nprieure\naerotel\nsuperserious\nwildcatting\ncanape\nmanouver\nmakiri\nmcgregors\nagom\nzhenqi\nazir\nhitzler\ngooner\nmcaleney\ndynaudio\nbanglatown\nnodelman\nmeatpacker\npossiblility\ntrevalga\norullian\nritek\nbfsi\nkovacik\nexclaimer\nselbe\nreseat\novertax\nenthral\nshielings\ncameraperson\nbeurskens\ngolaszewski\nyelvertoft\nsatyriasis\nlangefeld\nsayoud\nhinden\nexsisted\ncasanegra\njiren\narzhang\nmicromoles\naarabi\nzalina\nhaughwout\nbarnow\ngraskop\naperio\ndemuynck\ntraceca\ndemicco\nraptorex\ntechops\nheebner\nliberda\npicocell\nunpoetic\nepidurals\nhybird\nvogelenzang\nleevan\neutawville\ninsureandgo\nevci\ncrackheads\ndyron\nzenjiro\nmozal\ngodd\nnellum\nlivingstones\nperranuthnoe\nsheltam\nreshoudi\nvigdor\nwhrs\nnehmad\nsuberbiola\nfarimagsgade\npietrus\nbaronscourt\nchunxiu\nzuttah\nfenside\nfengzhi\nmaysam\nincredibility\niipf\nstroyan\nyeg\ncifas\ncambogia\naltynbek\nlacelle\nsmallie\nbatiashvili\nphree\nomniride\nkhansaa\nshrock\nlawick\nhilan\nanticapitalism\nsugarcanes\nunextraordinary\ndadfar\nbryostatin\nbhavik\npatail\nmolski\npsychoneuroendocrinology\nsquidgy\nrubalcava\nrcop\nhonga\nwhens\nsoohoo\nabatemarco\nsreekala\nsasd\ngoriest\ndiscretional\nfillibuster\nmarlesford\ncawing\ncharlady\ndrosos\nsakip\npreest\naquantive\ncoronil\nmobilkom\nshomrat\nfasthosts\nmccavity\naktaion\nnuvvuagittuq\nelsewere\nwrenchingly\npihlak\nbopped\noloo\nagns\napoteket\nstoystown\nsuperlicence\nsarac\nwattages\niniguez\ntheofilou\ndonceles\nfishfinder\ndidmarton\nnoogies\nsongul\nchophel\nnemicolopterus\nzhuoxiang\nimmunise\ndrabber\nlossio\nthiriez\npyramiding\ntreadwear\nversifiers\nhaghia\nnorster\npurrfect\nbazaari\ngainsaid\npacketeer\nenticknap\nhallelujahs\nambassdor\nhovertravel\nabsoutely\nthreemilestone\ntawengwa\nprovice\ngottheimer\nerlana\naeries\nweiers\nmadior\nnoppadol\nprso\ncorvalan\notsubo\noutfight\nferen\nlaurini\nscarlata\ngaloot\nrepulsively\nterreblanche\nirinn\nhumpherys\nglymour\nmultidiscipline\ntigapuluh\nsumino\ngeohaghon\nchunmei\nachenbaum\ncirovski\nplaw\nvillagereach\nmylin\ngratingly\nneumanns\nexablate\ntuerk\nbaghran\nwetterberg\nkatsucon\naccoridng\ncamapign\ndatacentres\nkinkala\nphilsophy\nfydler\nbureaucratized\nantifraud\nvilanculos\nsimap\nstockholmers\ngraffigna\ndrinnen\ncrashlanding\nprovi\ncommodify\nguihard\nskibiski\npsychobiologist\nalteryx\ntsouli\ndouchez\ngaruccio\nkokish\nouallam\nmilchberg\naproved\ntranquila\ndesloratadine\ncoreper\ndisfluencies\nrogich\nscabbing\nfrizzelle\nscarefest\ntruchard\nynystawe\nbandings\nthelton\nganderton\nexcentricus\ncarnwadric\ntigrana\ntundi\nreller\ndeveson\nricards\nwxtr\njungstedt\ndunkelberger\nfotyga\ncherifi\naserinsky\nkirlin\nweimaraners\nnadiri\nvioly\ncatteries\nspalled\nsellathurai\nspiciest\naboobaker\nintertrade\ntedstone\nmonowi\nfrazin\nsejil\nglucometer\nremuda\nlillien\nqadaffi\nfilenko\nrecist\ncopycatting\nbendet\nbeldangi\nmossbourne\npjanoo\ndiapensia\nkisaran\nrudik\nfionnphort\nwenna\nwouls\nkenyi\norenthal\nlaguito\nregualr\ngranjas\ngaios\nomada\nsunmonu\npebbledashed\nbesenval\nicub\ntoldeo\nbodinieri\nlinnan\nmarusan\naprill\nshanle\nsinka\ntenous\npiracies\nblackard\nkocurek\ngeling\ncanpotex\negotists\nravishment\nsquirter\nmessano\ngerhardie\nhudek\nhrudayalaya\nbksh\nmonogamously\nshikin\nnazirpur\nquadrathlon\nairconditioners\nchuggers\nrizatriptan\nboymelgreen\nladele\nmountian\nstratix\nhomeworkers\nophone\nspringview\nunrelaible\nhendeles\nshabiba\ninsatiably\nkrcmar\nlended\nverrue\nhazinski\ngastrell\nmoqueca\nhasbrook\nsinisgalli\ngouil\nsexpert\nhostellerie\nkunowski\nflattener\nproclaimer\nvonna\ndiversões\nwichai\nstilll\nscharfman\nasanti\nunstirred\npsycological\ngushchina\ncrappiness\nweila\npingeot\ndenevan\nfalaya\njudies\nswoopy\nhullermann\nbelski\nbeefsteaks\nbantham\ngrynbaum\ngger\nfaasen\ndinking\npresgrave\nwoywitka\ndemurral\ntelekenex\nbeechhurst\nsgam\nkreuder\nfronzoni\nberkan\nlinkexchange\nsurgey\nyeslam\nhelliker\nlutts\ninventiv\nzuckoff\nnujood\nswigging\nzmajevac\nrejectionism\npushovers\nbrandz\nmbhazima\nhirotoki\nfeenan\nhadithi\nbeleifs\nregardt\nluraschi\nsilnov\nkhanu\nfahal\nkeis\ndemoro\nhausch\nnemukhin\nershadi\nloiters\ntramain\ninfosecurity\nalligned\nvalinskas\ntubay\nstinziano\nzega\nfranchisers\ngarbagemen\nchikashi\nremedium\nsalems\niristel\nfagiolini\ncountersuits\nperng\nmullighan\ninteroperates\nunserviced\nxohm\ntiantan\nmargis\nbrochin\nphakdi\nstensby\npossingham\ngdrive\nshierson\neboue\nbamji\nracialised\ncolourways\nmecchi\nkarnig\nrampino\nyosfiah\nrepresentativity\nbankrolls\nchildwickbury\nujwal\nrescuees\nanaglyphs\ndelvon\nblechschmidt\ninterupt\nreglazed\nornateness\nascentis\nwesbanco\ndisillusions\nraitala\nwatsham\nlewith\ngreep\nbandoleer\nvidarsson\nforswore\nmariajo\nsaké\ndgat\nbootlaces\nenthusiam\nbirchin\nslawek\nroshambo\ncordain\nmiscued\ndiazoxide\nwride\njudgmentally\nhargeysa\ndjindjic\njuszczyk\nguja\njinglian\nbudoff\nescalope\ngpnmb\nvalesca\nkoldyke\nkassal\nmotorscooter\nthesame\ntrainwrecks\nbikies\nabich\nliveperson\nplaydates\nsleepwell\nmarjina\nnatascia\nponderosae\ntueller\ncandil\nshabati\nbukky\nmurco\nnimbyism\ncritizing\nzalai\nxharra\nmashenka\nmitek\nbouzoukia\nsimplifly\nginzler\nfirouzabadi\nalfacar\noarlock\nroadwater\nsnss\nwaluyo\ndeerlick\nflushable\ndolloff\npersol\npenstemons\nlidlington\nsweary\nschickner\ndumezweni\nnewstands\ngalatoire\nunderfinanced\npistou\ncornrow\nlynell\nrenewableuk\naeronwy\nrumrunners\ntirpak\nfuturebuilders\ntorrenting\nkyvig\nbanson\nspoonley\nohkubo\ndwina\ndiegos\ngrost\ncabestan\ngarger\nkarmically\nokayo\nsoundbytes\nsakine\nkatesgrove\nicea\nkcha\nundiscovery\nparanormalists\nnetd\ngitenstein\nelgol\ndadkhah\nchecchinato\ndealogic\nmaryburgh\nrudland\ntencer\nzeckhauser\nsevruga\nhonoraries\nmadawi\ntihnk\nnaqoura\nfingle\nhervas\ngisi\nmarozzi\nbuldak\neuthanization\nkhidasheli\nvenomously\ndustproof\ntroha\napstein\nmargreet\npushpins\nraudnitz\nclachaig\nkyaik\nmedcup\nmedupi\nbtselem\npolitcs\nnaret\nvoitenko\npervitin\ngodello\nsekope\nmspp\nsemiliterate\nporchia\nthunderclaps\nreeducate\nbepler\nadrover\nchickenshed\nyachay\nkinyongia\nroseburn\ndykers\ndaneil\ncassandras\nmilota\nzulily\nkepplinger\nlemmerz\nirec\nasre\nbissoon\nequusearch\nkriston\nprevous\nvoos\naberdein\nborozan\nwiscombe\nyonadam\nnimat\nakhtiar\nnons\nstudywiz\nrideability\nfter\ngorslas\nclottes\nklepsch\njansens\npritts\nkannat\ncollosal\nmalde\ninhabitated\nloghman\nchalbi\nautobytel\nprogramer\ncreutzmann\nenaliarctos\nquinny\nplasticky\ndooo\nhawr\nhalvah\nminces\nffrancon\nepeius\nhiccough\nashkenazis\npostbag\ntaule\ninsoll\nsupl\nkiljan\npopetown\nskellow\necoa\nupholster\nnapalmed\nbankton\ndiosmin\nantopolski\nyarom\nmcos\nstulen\noverproducing\nmirlande\nthrogh\nfloralies\nboilersuits\nborré\ngevisser\nkahwaji\nbalduzzi\nmilevsky\nopiners\nchirrup\nolick\nintensivist\nshakepeare\nroueche\nscrimping\nfebuxostat\nwijesekera\ndrinkstone\nmavrou\nchibli\ndufallo\nspso\ndelbello\naimen\ncerino\ngolfclub\nelate\nkiwibox\nmilkin\nguban\nngagi\nmtfs\nhumbolt\nalivio\nraudenbush\nhoush\ntickertape\nwinyates\nsabbar\nleardi\nbyrant\ntrackstar\nadbullah\nfanore\nroopnarine\nbadders\ncarere\nblanchon\nyech\ngarech\nselkoe\nkurvers\nyazigi\nkuzak\npossilbe\nuncontactable\nmarylene\nmtls\nsquidgygate\nscowler\nwheelarches\nchaharshanbe\nschlump\nkilnsea\nleatherbound\nediets\nrediess\nliagre\nangelil\ndisenchant\nasisi\nanginal\nabsconder\nhwg\nhootenannies\npungently\njeremic\nwashpot\nzortman\nseddiqi\nindiglo\npalmerola\nstalemating\nshurtz\nreburying\nprayad\northopedists\nhardstandings\nphotocall\ncortesia\njanovic\ngonjasufi\nmoredon\nwhetehr\nfinavera\nkaresh\nsiaca\nmassman\nperiera\nkyuma\ndelepine\nguiver\nmadhavikutty\nendograft\nvasarhelyi\ncloudbursts\nrevaluing\nwihongi\nsandies\nmadut\nndahimana\nypb\nmikelis\npuppeted\nkentrell\njoueuse\nqunli\nujima\ngurtner\nrodic\nbraig\ncushnahan\ngjelten\nresidental\narthroscope\ncuckson\ncrfb\nbalila\naudix\ntehachapis\nircx\nviragh\nkatiuska\nhamzawy\nrakiya\nconaton\nbaydemir\ncamauro\nterreiros\ncarepa\ngurmail\nfixs\niaci\nsquirreled\nxochi\nemptiest\nsnorkellers\nmatildae\ndryships\ninhope\nparales\ndimitrakopoulos\nweinhaus\nbarkindo\njappy\nmarzaroli\ntrichloroethene\nkillerbee\nworlaby\nchilaquiles\nlimeuil\nsandrak\nbamp\npangkalpinang\npignatiello\ncanovanas\nogbeche\nnekrassov\nberresse\nterrón\ntriantafilou\npuska\nbluechip\nyaung\nlyddon\nschissel\nkpaka\nojea\nattuning\nboshoku\ncatharpin\nmortgagors\nchikilicuatre\nbedposts\nurbanising\nmonchi\nmosys\nhamsi\nguillebaud\ndpni\naminatta\ntribalistic\nbiogeosciences\nsastrugi\nracecraft\narriviste\nboemo\niguatu\nbonafides\nschev\ncardiogram\nthromboses\nharlo\nconfrère\nscorey\nfestuccia\ntorcetrapib\npendrey\nnlpc\nniec\nsharkman\nbluebay\ncpag\nbresh\nbanciao\nnachtrieb\ndumbs\ninstallaware\ngawcott\nbroadline\naltace\nloxapine\nsokoloski\ngreenburger\nrampell\nhmma\nauradou\nfinansbank\nschwenn\nmulleavy\nentitiy\ngoonhavern\ngwava\nalatar\nvvel\ncgx\nvinovo\ntanor\nmuamar\nticketholder\nmislan\nproble\ndemelo\nnounou\nbhumjaithai\nconisbee\nmackenson\nradrizzani\ntouby\nbradnock\naminzadeh\ngrassa\nbctia\nchavkin\nadultress\nzettabyte\nendace\nkreager\npoliticizes\ngenmar\nshockable\nlukeville\nsoccor\npesaresi\nbrightener\nbriger\nandouillette\nalstonefield\nhyperekplexia\nvidoes\nputze\ntopcor\nceatec\nbaldfaced\nxianglin\njoette\nwalead\nbikepath\nreithofer\ncasaca\nilco\nmonotheisms\nlalish\nhydropolis\nfilmart\nknupfer\nunpayable\nlabon\ndistils\nkurochkina\nsucharita\nglamorised\nustia\nschuenemann\nboissonnault\nwikitude\nencinar\nmfuwe\ncruzcampo\nunresponded\ngokcen\ndelouche\nrashon\njiarui\nklaric\ncompeau\nneelys\nbodymoor\nvirologic\nrollens\nieke\nripperton\nampro\nrezk\nboughanmi\ngolvin\npolarises\nacharn\ntarceva\ndlife\nbelluso\nfreddies\ntownsel\ngastronomia\ndyde\nsuchon\ndependancy\nmalintent\nyarg\nflimsily\ndhanteras\nattourney\nnovitzky\nsomatropin\ngribbles\ncutbirth\nhusseiniya\nmeanin\niphoneography\nsukkiri\nbotty\nunicar\nlavandera\nchinstraps\nvadheim\nkatoucha\ntuwaitha\ncutié\nsteadings\nruven\ndeysi\nmillmount\nkabai\npayard\nbluebloods\nguisachan\nrouba\nkishigawa\nplumbridge\nsquance\nvivalda\nanhar\nbusnes\nlevdansky\npontarelli\nrfqs\nrahmeh\nmuqattam\nwindchimes\njoyousness\nburnim\nwaterreus\nsprick\nworthville\ndulmatin\namorn\ncremades\nurbik\ngartenberg\ncyberbullied\ncowarne\nhertzel\njontz\ncamd\nspeechify\ncolonialisation\nchwarae\nchesimard\nmiltner\nsuraci\nandreevo\nabacos\nimmunisations\nmlynar\ncardamoms\nfunkeys\nshibulal\nbucketfuls\nfusillo\nkubatana\ncandomble\ngenclerbirligi\nshoplifts\ncarmenere\nyurkov\nunassailably\nhanatziv\nsakhan\nactimize\nsnarkily\nrafii\nbilsen\ntechniquest\nklaven\ncorble\nhejma\ncraigforth\nglynns\ndidja\nchargin\nskateland\nelegants\neliyah\npicarelli\ngeuk\nnlng\nsartorialist\nburrett\njanamukti\ndelucca\nisaakyan\ndesig\nsmikle\nwherryi\nteewinot\nboggiano\nmerebashvili\nbaccari\niwanyk\nantibullying\npuentevella\ntelikom\nmawkishness\ncandystripes\nslippages\nbiosense\ncrazytown\nbcfd\nfahimeh\nbeetv\nsmmt\nprestat\ndonerail\nunderproduced\npohanka\nsawhorses\nlanzano\nroner\nmatonga\nstraigh\ngrodstein\nlahkar\npiluso\ntauwhare\ntweenie\ntownfield\nglenshane\njobster\ngazenko\nkuraray\ngooda\npetrichor\nteithio\nbomarito\nhivert\ndochfour\nkopicki\niezzi\nbinoria\nexteme\nscillies\nakrour\navivah\nuncaptioned\nmurgo\nrefueller\ntranport\nneustrashimy\nhasenfratz\ndejeuner\neniko\nlimitlessness\nhardlines\ndensey\ntapies\nhamngatan\nmcowan\naramingo\ncreekbed\npetronet\nstournaras\nshihong\njumaane\nfalasi\ndashawn\nartemisinins\nleipsig\ngranadas\nmuder\nmanhattanization\ndeuser\noutrebounded\ndownrated\nclaverham\ntincey\nputtanesca\nbezhuashvili\nnaughtier\nsdmc\nedfors\ngrona\nanimatedly\nsterl\nreleasers\nadalsteinsson\ngoves\nhazboun\ngovin\nrestiveness\nglobalhue\nvlack\nceraso\ncurser\nvadon\nextrodinary\nisce\ntuteur\nghwb\nhillpark\nkocieniewski\npsychoanalyzing\nidealise\ngrasmick\ncavalia\n,why\ncrossmembers\nhutching\nkodes\nsindou\nbardala\njurika\nfrene\npylypenko\nsabaugh\nuscap\nsalarzai\nessentialized\nderessa\nbucio\nkhimar\nvalerenga\npriveleged\nwfsc\nkominski\npantyffynnon\nislamics\nlemonaid\nheteroplasmy\noresko\ncommoditised\nnijpels\nshhhhh\ngourville\nfasol\nmogden\nculpably\nazango\nblackmill\nloquet\nliasons\nbelieveable\nkappl\nplagiaristic\nharadasun\nmoretaine\ndoger\ngonis\nnomvete\npassionata\nflaks\noverpumping\nsnedker\nsolovtsov\nmacefield\nhimanta\nmadhes\nhradek\npatronisation\nstrenghten\nluebbert\nrightwingers\nbonesmen\nmatovina\nfiliu\nbachhaus\nbarasso\nvanavara\nwhiteson\nriiiiight\nbordyuzha\nnikolaides\nmocka\nspookiness\nvideojournalist\nlarosière\ncreditsafe\nfetishisation\nbumf\ndropoffs\nhamour\nconlig\nwilhelminian\ntitla\noleocanthal\nrelling\nundergound\necomagination\nsieradzki\ndebellis\npliancy\nmatschiner\nmanilva\ncarradice\njianlin\nneoris\nrollyson\nkotooshu\nprospere\nlavier\nrakhmon\nskarnes\nwestcotes\nmayali\npurecircle\nchakales\narand\nliac\ncottageville\neastender\nwogau\nderawan\nmisogynous\nmowaffak\nbedjaoui\npokerbot\nsquaresville\nkwawu\nblogsites\nmustaffa\nsprizzo\ntesha\nretarted\ntsala\nemploye\ntouchlines\ncomplacence\nstelmaszek\narnisdale\nackford\nrcpe\nnaimatullah\ntannochside\nspareribs\ntransfix\nhunta\nprosperously\ntaraborelli\ngavronsky\nshirehall\ntahawwur\nndungu\ncopti\nristuccia\nmediacity\nnicolien\nbiox\nanaesthesiologists\nmakimono\nhuben\nstrul\nzaccarelli\nteibel\nfilipek\nunfavoured\nanatolevich\nvitezslav\npoketo\naurumque\nnucifora\nsponser\ndonckerwolke\nsalf\nneuroradiologist\nwerners\npohakuloa\nlazich\nkwangmyongsong\namimoto\nshld\nfthiotida\nerlick\npleasurably\nopco\nnyuon\nlllt\npakey\nncdex\ngediman\nmoonscapes\ntransylmania\nveluppillai\njhangir\nasirt\ncaip\nvaccae\ncoracora\nbraslau\nrendani\nnanospheres\nferver\ncaplen\ngeodon\nmetalloinvest\nbragas\nlamby\ntesso\nzeroville\nsitv\nvantis\nhopefull\npolewards\ncirumstances\nbeyrer\nwermers\nexpresed\ncharel\nfabros\nswagel\nbichot\nedventure\nbenice\nlagha\njemil\nvitullo\nteessiders\nbruery\nglaramara\nassani\nsepco\nbronzetti\ninverne\nkalubowila\ncotrell\nidirect\neipr\nweighlifting\nsiezed\nrynard\nafosr\nzhifeng\neliyahoo\nbeyah\nimperiousness\nsunlounger\nzeleza\ncephalalgia\nsaxbee\nloughlan\nadditude\nnirex\nzuckschwerdt\nsititi\ntegretol\nvagit\ngentrifiers\nriverfronts\nwaldouck\nmandaville\nmakashov\nhairiest\nnontropical\ndornie\nbeltinge\nnory\nlakovic\nnunchuks\ndemonlover\nvaharai\nabbatoirs\nshimmered\nmicrogen\nbroux\nissuable\ngecamines\nmcgalliard\nsaih\nclicksoftware\nwildner\ntarsa\nannalie\nsplaine\neibhlin\nfanchini\nkousser\nyingchun\nevri\nbuttin\npepitas\niawn\nbernell\ncheapoair\nmanir\najiboye\njemile\nthamilselvan\nsadovnikov\nriddrie\nmaybaum\npfh\nvecernje\ndipuo\ndesease\nvandetanib\nconnollys\nvoras\nlinchmere\nrhosnesni\nknowning\ngaslit\nspallone\ndysentry\ndorenko\nartsiom\nlindloff\nkafia\nyueda\nsealock\nsaikawa\nbywords\nuncurable\nicast\nzengerle\nkaemmerer\nfusillades\nmissingham\nhereditarians\nagudio\nmcdow\ncauseless\njoklik\naprutino\ncesars\nsekeris\nadmitt\nspowart\nwinterrowd\nkemy\nbyamugisha\nsensat\ncarthorse\nqaza\nhuzarski\nballyskeagh\nblockout\nbarage\nassads\nceic\nkroenig\nconsert\novergarment\nshagpile\nbremont\nelvanfoot\nkorinna\ncilfrew\nbloche\ndevelopements\nnormile\nhighballs\nviavoice\nhennessee\nagualusa\nblinis\nbartlit\nxynthia\nborsetshire\ngiquel\nmoradian\ngeohot\ncardiocrinum\nslesarenko\nromelia\nyoncheva\nvinexpo\ncorretto\nzugara\nadjuvanted\nloiko\nsaglik\nbadriya\nrodberg\nrenditioned\nlorio\nbeermat\nretracement\ncenzontles\ncrookhill\nfloca\nklimpl\nseikh\nlaxly\nparron\nmenoken\ngotu\nfarmfoods\nfamili\nbdbd\nbirminghams\nwestlakes\ntortolero\nhmps\nwaymond\njavdekar\ngoldup\nkashflow\ninciters\nlogicomix\ngymreig\ncogill\nabbos\nbarbaran\nintemperately\nsedale\nduwe\nholodecks\nstockage\nsocias\nairball\nmetatheatrical\nravaya\npreventitive\nbaxtergate\ntoltz\nagway\nchaple\npekkanen\ngoaa\neverlyn\nshifeng\nprojectable\nmailpiece\naggrandise\nmarkum\nstevies\nmadelein\nlipstein\nconwood\ncoifs\neckrich\nfilous\ntabra\nwoodlyn\novervaluing\nfutalognkosaurus\ncosmopolites\nserratelli\nmikhalkova\nsytle\ndorricott\nnummerdor\nprawit\nbullrushes\ndevonians\nlaperrine\nlopiano\nbrussell\ndarui\nreachmd\npottker\nxiaofang\nchawke\nfoody\naidala\nsteall\nhaithem\nimmunizes\njinking\natacks\nfanan\nmadders\nbenchellali\nokiro\nhabr\nkoznick\nreligios\nyawney\ndillonvale\ngawkers\nobviouly\nngedup\npyromaniacs\nedou\nclipson\nhearby\ncomenzar\ntelpuk\ndualstar\nlandra\nmichelene\njoaillerie\nexpiries\nmoisturising\nkinter\nlauders\nshouls\nlankster\nmclamb\nscws\nvoeller\nvapidity\nrozabal\nzhiyang\nacbj\nzinkin\nvereecke\negeli\nunderhood\ncontrarianism\nradicality\nanwick\nwaverunner\ncataphora\nkennelled\nsufferage\ntonsuring\npipas\nmanouevre\nyingpan\nkipkelion\nbiodegrades\nnitrofuran\npasona\ncacheris\nlokahi\ntehani\ncrinis\nmemorialisation\nbiatch\nwallbangers\ndasey\nguanta\nidom\nclyth\nfluffiness\nshoebuy\ntrusnik\nserraglio\nmorecombe\nservicemagic\nspirk\nmetaldehyde\ncomco\noscs\nnarrenturm\nsamasource\nhypotheca\nhorwell\nshuqin\nvikan\nwomanliness\nclubbable\nsatified\nsipps\nlipotes\ndancedancerevolution\nrehabiliation\narcep\nrondavel\nvellanti\nsemisubmersible\neskendereya\npashler\nsweetenham\nheulog\ncrociera\nantonetta\nrety\ncutkosky\njaoude\nintegrys\nwinemiller\nsingace\npastre\ndyrdahl\ndrinkall\nelefantasia\nmaddalone\nastree\nkilchurn\ndomzale\nheshka\nwalkerburn\ndabek\nconeflowers\narbez\nsarach\nanxiolysis\nkoornhof\ndilhan\npranknet\ndupnik\nlevocetirizine\nquippy\ngallach\nnovosad\nfatau\nthme\nhilander\nyasim\nsnowland\nreatards\nbaghdasarian\ncruiseship\nbedward\ndenverites\ndeadnettle\nhuisen\nsestinas\ncervenak\nxiyun\ndisneyfication\nleles\nrichwine\nwinghead\nfedexia\nsecane\ndecsion\ncheonggye\ngrisogono\nberdovsky\nböögg\nmislabled\nsemc\naabid\nseselj\nifwla\nchouet\nmölzer\nbarltrop\nkolath\nshettles\nrivetti\ntvac\nnonethless\nlobaton\npaytv\nextentions\ninvididual\npiecha\nkuzumaki\ndrivelines\nbiddinger\ncrestor\nwakamaru\nasphyxiates\notryad\ngililland\njiantang\nitune\nmaalox\nrollnick\nbarquet\nneointimal\npermanet\nparisella\nretrenching\nkuniholm\ninurned\nmorbier\nkokosalaki\nfoing\nloepp\nshanaze\nchastang\nsutent\ndotmobi\nakaev\npinehills\ndownview\nsuccesor\nsiebers\nfordo\nfakahatchee\nmankovitz\nhorava\ncontrovesy\ncaamano\nqaba\nmakovetsky\ngrieger\nhellwege\ngelfman\nfederalised\naffenpinscher\npullitzer\nchapoutier\nshemshak\nnefariously\nshetlander\nvergnaud\nxross\nmouland\nnavdanya\nbufori\namdh\nflimsiness\nriyashi\nrobinzon\ncasualisation\nyamahas\ntruckdriver\nizt\ndiffernece\nbohrod\nsesiwn\nschmeer\ntomasek\nguseynov\nfirstlook\npolytunnel\nminhua\nbleiker\nyulianna\nxtracycle\njimsonweed\nmuttar\ncannister\ntommey\nalitha\nsiviter\ngristede\nverdel\ntaughmonagh\nsalwan\ntayal\nbeutiful\nzixiang\nshowpeople\nahmadiyah\naganst\ncederbaum\ndatejust\nnewspring\nwreyford\nhypogene\npwns\nzenkoji\ndjukic\nsupergran\njaniga\nharangozó\njserra\nshifflett\nborans\npoliticas\nalreday\nknehans\ntjv\nnoroviruses\nanythng\nmanhattanite\ndehumidify\naguamiel\ncutud\nahaa\nunforthcoming\nbilliot\nephemerally\npoochera\nhiseq\nbladesystem\nhaowei\nlucques\npartlett\nshinder\nfiegel\nmiskeen\nallenport\nrahav\nnacotchtank\npoising\nwhessoe\nhansenet\nbrackenhurst\nsamething\nwavemaster\nashgar\ntaloga\nbusienei\nsucessfull\npantsdown\nmendia\nwarefare\nchinotimba\nhimali\nrhidian\nbargainer\npoze\nevolt\nretrogrades\nenteromorpha\nswitzers\ncaldero\nvradenburg\nmaatta\nbrempt\ncodlin\nkelci\npechonkina\nneurosonic\ndreckman\npulteneytown\ntidjane\nwinforms\nralbovsky\ninverewe\nbayaa\nregisterred\nbebes\noaps\nmocumentary\nmutel\nbasware\nmcslarrow\nanyhing\nmontavon\npenwarden\nmarival\ndigiboxes\nkhanjani\nnerac\ntelsim\nbenimon\nfadilah\ncourances\nmongella\nyiqun\nzezza\nwazza\nelburton\nlimehurst\nsonglian\nmahmudiya\nspaceworks\nfitzer\nscce\nadlink\ncullings\nstormiest\nlifeteam\nloulis\nimplictly\nprosection\nlittig\nchronotherapy\nphytonutrients\nrazzoli\nchilevision\nikigai\nbernhards\ncnpp\nmbita\nblacklow\njoley\nkaradzhov\nbohanan\nprescote\nmassas\nlogons\nitapecuru\ncorsages\nkabbaj\nwimblington\njeanene\ngatoroid\ngrzegorzewski\nfabc\nstoneybridge\nweepie\nmutlak\nkapangan\nrefilwe\nfroing\nchibuzor\nbaycare\nksor\nvoxofon\ndoenitz\nmediagenic\ntoryglen\nrabiaa\nsibelian\nplanethood\nkraayeveld\ndatasite\nmeistrell\nmoonbats\nfloreen\nawah\nharkrider\nprogrames\nwoodfarm\nmalakhit\nkrinn\nettedgui\nendorsees\nfxfowle\ninteva\nentertaintment\ncsrees\natiyyah\nirida\ncoze\nleshnoff\nckers\nkairelis\nwurtland\nruegamer\nhirschson\ntheirselves\nogbuehi\nsighter\ndetatched\npioneerof\ninterspinous\nlowlying\nunderseat\nbasaev\nhealthday\ndeliberateness\nsynacor\nmycerinus\nuzochukwu\nmaarsbergen\nwellons\npenot\nfsai\ngavottes\ndaguang\nkookiness\nvilk\npissaro\nmervi\noldak\nfiryal\nmsmb\nheartiness\nbrakey\nwinney\ntyri\nhallal\nstrombergs\npavs\nspvs\ndumex\nmetais\nschweid\nsignorello\ndianabol\nastv\nsystematising\nmaharajgunj\nwoodwell\ncapodice\nrustbucket\ncrosslands\npierceville\nscotomas\nmariahilfer\nolusanya\noyneg\nfantayzee\nwarungs\nsrdc\nparkhall\nsuccah\ntreese\nkiambaa\nvagar\ncartney\npntr\nchuanzhi\nkaramanos\nkosk\ndoubleness\nconverteam\nbisri\nvulnerably\ntalone\nadazi\noveract\nnassef\nbayev\ntimochenko\ncurlier\ntetrapak\njarolim\nlienemann\nhasenauer\ninelegance\nculturemap\nhunching\nfreeper\nlinvoy\noverdog\ninterraction\nbiale\ninsyte\ngarlanding\nseductresses\nculbin\nsxrd\nbedpans\nbefoe\nafns\nhatered\npetaquilla\nunremunerated\nditan\nkhubani\ncallian\nvoelpel\ndolnick\naverkamp\nwarrantee\ngliebe\narmholes\nsylven\ntreyford\nreeders\nnudiflorum\nreddall\naddf\nvedior\ncollora\ndarva\ngoepel\ndumbfounding\nsummatory\nigaya\nxizhen\nhatcliffe\negtc\npliskova\nduing\ncorff\nsubsegments\ntyms\nmotomi\namaturo\nkroszner\nsaadane\nxiuhua\nrousted\nhaini\ndemoralizes\ncedras\nsouthhampton\ncambers\nkurvin\ntamuna\nungrafted\nspotlit\ncripplingly\nfeyerick\nmusicdna\ngrovesnor\nrobaire\njouvenal\nhfw\nhargadon\nhuels\negadi\nhemodiafiltration\nbrason\numiastowski\nbettola\nsensecam\nzmijewski\ngaraging\ncybersource\ninterloping\nkomlo\nmacerating\ndemonstates\nberjon\nstarblanket\nwapshot\nabsentmindedness\ngruet\nhonorariums\nukla\ntocai\nquetzales\nshapton\ndeisel\nnanyn\npamer\nnosanchuk\nbinki\nkilgoris\nhaieff\ntrilantic\ncraycroft\nstrobeck\nmanag\nunappealable\nvilage\nachiltibuie\nbrysac\nrassan\nstitchery\nsovsport\nwfdb\nbastad\nbascules\nbarbeito\nnorrkoping\nzakher\nsomerfeld\nloyaltyone\ncruzin\npancreases\nzocdoc\nelterwater\nhardees\nallone\nrealscreen\ncorkerhill\npleitgen\nlibonati\ndoescher\nhcng\ngissara\nabeykoon\nkhonji\nwideouts\nlittleneck\nfluky\nmakori\nmscc\nfomer\ncrusat\naristedes\nvwap\ncitl\nahps\nsudarto\ndefeatists\nbermoy\nblissfields\nszucs\nsquillo\nrinderknecht\nribowsky\nllareggub\njenkens\nnevan\njesselli\nxishun\nouput\nworktop\nlaskas\nunscientifically\niacovou\nmauerpark\nnarked\njayner\ncosimi\nmoap\neuropejski\nbiopreservation\njunkfood\nnonrecognition\npietruszka\ndeglobalization\nlclaa\ntsukerman\nhatswell\nstephanotis\nleptokurtic\nebertfest\nkigozi\ndeahl\nrisedale\nbukharans\nforhan\nrufolo\nmanageably\nabseils\nhalilovic\nshagger\nsportech\ncongruently\nilliberalism\nhansgrohe\nmyren\nwestreich\nbookrunner\nkicinski\nhadschi\nseebacher\nemersion\nsaylors\nabayas\nwhimps\ndayani\nsladden\nhodierne\nmuron\nlutvann\nkhazir\nnyamuragira\nxianju\nflowe\nfalsies\nardkeen\nkoeltl\narnesto\ncompetitivity\nvook\nxisto\nvirtuozzo\nartisphere\nexceptionalist\nrmeish\ncolodny\nbaldknobbers\naspinal\nantireflective\nmorozs\nleibrecht\nlawzi\nsidestick\npretendo\nmarze\nkantaoui\nhellshire\nwonkish\ncastleblaney\nacupoints\nllanfrechfa\ntegler\nacams\nhushes\npdna\ntuttino\nmacchiarola\nslenderest\nncls\nfischmarkt\nlovably\ndejene\njelger\nsantibanez\nuriona\ncreaturely\nicomp\nunexpended\nbindschadler\nelavon\narberth\ncuiaba\npanglossian\nstoffregen\npikermi\nepron\nnirah\npopline\nkriikku\nhaean\nlacosse\njines\nssdd\ncopaken\nziane\nunzoned\ndatadirect\nalenius\nsolskjaer\ndecissions\nsevelamer\nbelsonic\nposchmann\nchoas\novertired\nbazira\nlhtec\nofccp\nstickk\nbossou\nrahlves\ncegelec\nwanqing\nmuslima\nlmrda\nspeakerboxxx\nabdulelah\nsaslaw\nindusty\ndifalco\ncarrigg\nkoronka\nccrn\ngerboise\nkhudhair\northotist\nterwindt\nrenardo\nsubtil\nriversimple\ndigex\nalwall\nauthorites\nsmartcode\nskeeball\nkuijt\nmegrim\nhamiton\nchatrapathi\nrepellants\neimbcke\njablow\nhookstown\nkahoot\nahlemann\nclarian\nackrill\ntheocrats\nnaidenov\ntruvy\nbeseiged\nmtec\ngroupo\nsabeeh\nkravice\nvalleymount\nwustlich\ndermoscopy\nsubleasing\nhirni\ncangialosi\nmutators\nkasteli\ncontactpoint\nfearnall\nshalvoy\ndarline\ndciaa\nelectroretinography\nprobem\nbindeez\nverkamp\nnorvir\nfiberesima\nkimeu\nschulweis\ntokoyo\nkatija\nnehmat\ndavaar\nironshore\nbisoli\ncbcc\nmohieldin\nsalinomycin\nintihar\nmarynell\nfeichtner\nbaseley\namirshahi\nkabarebe\niteself\ndrosselmeier\nngvs\ndogzilla\nreckard\ncharlston\nchimpy\nmarksburg\nmelisse\nbogdanich\nwebbwood\nrzn\ncornworthy\nwitanhurst\nfuseproject\nimmunotherapeutics\ndahlander\ndeps\nhamblet\nbratza\nscaffoldings\nkleisath\nreath\nhoverport\nsantaris\nporteno\nsunsweet\nphysios\nmevel\nsths\nharmy\njarnigan\ngillespies\nalteplase\nhillion\ndrewer\nnonusers\nlabberton\nrimoin\ntoepel\ncompartmentalise\nvigário\nchifflet\ncorreze\naltuzarra\nbeloglazov\nvescio\npailler\nrealtively\nvividus\nabdolhamid\naacd\npolycap\nhuibert\ngnassingbe\ncompulsiveness\nplaceless\nunadvisable\nbayliner\nmarife\ndatamart\nlackritz\noversell\nflavoursome\nunpeopled\ngualdoni\ngilliand\nmedinger\nmaketa\njaszi\nminibonds\ntapfuma\nblarcom\nworgu\nbriliant\nsangtuda\nabusadora\nkimberle\nkirna\nericcson\nithaa\nstaceyann\nbohley\nnony\nfettig\nsporen\nrfog\ndirectline\nbloodgate\nrubeck\nsaiid\ndownslide\njianxi\nlhj\njoachin\ndruin\nsuntrap\neligo\ncowhig\nxhp\nderow\nmusclemen\ntakling\nultrium\nllanmartin\ncusma\nzumodrive\nsoaries\nniebler\ngeddings\ndjamil\nbraford\nshalonda\nprecrash\ngoup\npcff\nboubous\noverpoweringly\nlanzillo\ndelasin\nlonquich\nshavitz\nswanke\nstoffers\ntrifold\nmatusiewicz\nreetika\nbaberton\nhimmelsbach\nichimoku\nnwpa\nkahmunrah\nunpopped\namio\nbostjan\nshezan\nglutting\njuanas\ndagres\nrifamycins\nstumpo\nreanalyze\ninvoltini\nlemarquis\nobenchain\nedrisi\noneupmanship\njeake\ndowes\nkelyn\nmaretta\nmiddelgrunden\nardia\nplushy\nrightfulness\nmanuokafoa\nultram\nsipacapa\npropylaia\nrubico\nwetbacks\nsinnreich\nforlenza\nhibdon\ncybersquatters\ncandelight\noversleep\nsuchak\nbgas\ncontesse\nprotrayed\npetryshyn\nquotron\nkhemraj\nsumir\nsubliterate\nouaddou\nnwfz\nislate\nlahno\nzinfandels\ntidc\ncieply\nmehrin\nvacquier\nlaingen\nnonperson\npesamino\nrenze\nnanomech\noveraggressive\ndeutschkreutz\nlempertz\ndiscardable\nwalsleben\nnoncriminal\nraleb\nhoppes\nmocek\nwhiteouts\nancrod\nbronington\nnasic\nardaloedd\nsumatrans\nfightmaster\nescalettes\nbialecki\npteropod\ngulson\nblackwatertown\nwithcott\nmariqueen\nwalaa\nonychectomy\nmahayogi\nknockback\nnisbah\naberglasney\ndistruction\nlymelife\nauldgirth\nmischievious\ncloserie\nmedra\ndiovan\nsncr\nmilitarizing\ntattum\ncyren\nfulled\nleatherjackets\nlatiker\ngreenthumb\nburleith\nsandeel\nmuhith\npunctuational\nminerality\nwesbury\nsalavati\nnautile\nweingrad\ncreekmoor\nruds\ncorniglia\ndrooled\ncoovert\nremeron\noverwelming\ninshas\nbiodigester\ndeseve\nkrogsgaard\ntalba\ndormady\nriduculous\nchichvarkin\npfieffer\ngwrtheyrn\netran\nbgmea\nfouchier\nhmongs\nblaszczyk\nmysky\nmastercards\naillagon\nlatinojustice\nmtap\nmandelker\npagola\ngurjeet\n,just\nenvira\noxpens\nmaconachie\ngkss\nnankivil\ngaraufis\nkatsuhide\nbrymore\nfingest\nalwiya\nmorfill\nlibertyland\nromines\nsquaddies\ngillete\ninsalaco\nrusanova\npccp\nsdku\nbyplay\nconscientous\netxerat\nsuellentrop\nmaisemore\nmayskoye\nwellsite\nwiracocha\ngarmoyle\nkifaya\nhyleas\ntawanna\nnamiq\nboultwood\nrandalf\nspeich\nmaigari\njyll\necharte\nscharffenberger\nmicrovessels\ncraigshill\nculik\nieroklis\nplene\nquokkas\nstamshaw\nfedotowsky\nsarande\nnevils\nvneshtorgbank\naltentreptow\nbloodiness\ncinram\nfullana\naysan\nlipless\nezralow\njamkaran\nchatzimarkakis\ndurabolin\nfarrellys\nglyburide\nnordenson\nxhij\nelist\nmarudai\nkanjobal\nulatowski\ntamarinds\ninveigle\nancroft\nprovde\naaaai\ndatson\nrossmiller\nseitaro\nzenzi\nthundery\nwtlr\nbeadboard\npinnochio\nsandbot\nkettl\nbeeney\nhankered\nsanil\nperfectmatch\nchrystine\nintertechnology\ncrucell\noritavancin\nzombiefied\nmakarezos\nhamudi\ngiroir\nsussing\nsullenger\ndinops\nlepere\nrestoin\njinxes\npremera\nsabrah\nsnookums\nelitebook\nduverne\npfrda\nvoluptuary\nsciolino\nsolarreserve\nrubinfien\nvistage\ndepaoli\nmanios\nsesat\nchirs\nfridovich\nbarhom\nbluehouse\nhighter\nfarbod\ncorkman\ntorquey\nuntraveled\nqrh\nblist\nmedos\nmobsby\nmuscularly\nprorsum\nhalischuk\nsunsmart\ndaffin\nbronzer\nwhizzkid\nyowl\nsvitak\nauslin\ntrallwn\nwolviston\npuempel\nprosumers\nbazian\nappeldoorn\naanp\nabdisamad\nmapetla\namosite\nhalawani\nyoulus\nbabinsky\nudderbelly\nmichellie\nisatabu\ngpsd\nramim\nhaliru\nsumara\nmultiagency\nkavos\naneez\ntallini\ngarimpeiros\nteeton\nitss\nalcantar\nregazzi\ngooses\nshucked\nsnippiness\nstufflebeam\nowaissa\nrockerz\nzalingei\nadjudging\nkluth\nconkle\nchlöe\npolsloe\nsloma\ngilberg\ndangermen\nbloomwood\nengwall\ngondorff\nperplexingly\nflabbergasting\nindeedy\nwebloyalty\nlumbres\nfaqiri\ntibolone\nautodata\nwpni\ninflamitory\nsheridon\nkidspeace\nklean\nurica\nmuelheim\nboaty\nsandidge\ngelnovatch\nhaggie\nfinnaly\nnorinchukin\nbiogeographer\nbranekov\nfiquet\naddictinggames\nelwen\nbarzanis\nmashamba\nhatemonger\negroups\ndesruisseaux\nlapinsky\nbeeber\nrivarossi\ntiddlers\nmcquary\nshareece\nnonparty\ncrogen\nmotiwala\ngbmc\ntomonoura\nzywiec\ntarali\ncitigate\nmourtada\nvannatta\ndaies\nnnpa\necast\nlavicka\nheadquartering\nrisibly\ndioko\nroscas\ndisaboom\ncuilapa\nwincheap\nmullappally\noritz\nnonnuclear\nbizer\nkubbel\npierot\nriderwood\ntutut\ncrowdflower\njabran\nmackreth\nataya\ngiftcard\npodcasted\ninate\nxpressbet\nkruzan\nlunine\nbegats\npolezhayev\nkaloi\nfeminicide\npeckish\ndurbeyfield\nportballintrae\ndrunkeness\ncribyn\nrulemakings\npsacharopoulos\nkraynak\neukanuba\nidaville\nboomeranging\nspml\nchainstore\nkecks\nbonforte\ntaisir\nwiteck\nterdell\namericains\ncorriher\nunvented\ntwilit\npostludes\nwaylays\nbootless\nukes\nanbaric\nrenson\ntambang\nflanary\nhairlines\neagach\nuniteds\nhinote\nrearguing\npiela\nsubutex\npavlak\nkawaminami\nkamakazi\nvisner\nhamblyn\nwychbold\nsuccede\nbottura\nrwjf\nmengtao\nbirleanu\naneisha\nzhenming\nciliberto\nxxxl\nbeljan\nsmarted\nskiier\nsurpress\nwubs\ngijsels\nobfuscators\nhandzus\npedu\nmethacholine\nconked\nrilya\nisometrics\nfillbach\nranstorp\nmoazami\njusu\nfulminates\nfiedorowicz\ngulbadin\nvinader\ngatens\nrenewability\nveddah\nplumpness\nhoeben\nmarchois\nbirnberg\nrubinius\ndistempered\nnijgadh\ngrainier\nbecos\nburnhill\nuzbin\nswro\nrasooli\nstraniere\nballygomartin\nibtissam\ncetaphil\nrouner\nnicia\nseiders\nmessman\ndanilson\nbuddism\nmelamede\nfeasey\nmozar\ndlouhy\nherberth\nkristovskis\nlaau\nwittler\nconspirative\nmumbaikar\ncybertrust\nimmunex\nwakakirin\nsafranek\nbriault\nfnar\nfiremint\naerobars\nmbonyumutwa\nlleucu\nsidewiki\nahmadiyyas\nklinenberg\ntarnovsky\nnatashia\nstainborough\ntanoesoedibjo\noliveti\nmusakhan\nfellgate\npreoperatively\nneckles\ntalbut\ncoedcae\nsunnyhill\ntswalu\nhepburns\nnoden\nbehzti\nogwang\nwedig\nasinelli\nnahoum\nspreadex\nkinnane\nbogliacino\nsullens\nliinamaa\nohmer\nlindseth\nbielskis\nliebefeld\nlowhorn\nrogard\nhearthside\nmickos\nergas\nraedecker\nbrownshirt\nariva\nfilching\nmorgaro\ncangiano\nsjhs\nzhengwei\npieh\npotesta\nshamsolvaezin\ntakemasa\nvivary\nbistahieversor\nrathsack\ndachigam\nundetectably\npisanelli\nkileen\nmamary\nreallocations\nkiszelly\npostconflict\nlathering\norionid\ngarbanzos\ndelvine\nfedbid\nguberti\nkrankie\nmoorend\namerindo\nbagsby\nkahlke\nmulvagh\nbarklem\nvannet\nsarber\nyande\npharms\ncensurable\nzuccoli\nwadajir\npgis\nweho\nstamatakis\nroyles\nremebered\nbrainport\nbachvarov\nspaceloft\nhuffstodt\nhidehiro\nsutherby\nnctb\nnondurable\nsramek\nmasticating\nlagno\nmxolisi\nnedal\nnonracial\nmarinière\nhansards\nnadolig\nwistron\npulverising\nguidewires\npumpjacks\nhumanisation\nroeske\nbatona\nballingham\ngolnar\nepoll\ncappadona\nboomsday\nhabsburgian\njumbotrons\nmeritech\nindoctrinates\nbayrock\ntilin\nhichborn\ndemarlo\npowindah\nverloop\nabreojos\nsozio\nliechtensteiners\nguyhirn\nridiger\nklimm\ndonto\nbabikov\ningoma\npxg\ntedburn\nwinckless\ndemuren\nnarriman\nrehome\nkilladelphia\nbouderbala\ncrumbley\nedidi\nyoussuf\nkofax\nhopu\nvillwock\nreguzzoni\ngalimzyanov\ncordano\nchongos\nvandome\ncentofanti\nzwelinzima\ncamley\npopmaster\nedgeworthia\ncouncellor\ntuwaijri\nfallbacks\npitso\nnababkin\nburno\nzeine\njumaili\nplotlands\nwhiffed\njabbi\nnumico\nvasilyuk\nseisint\npatternmaking\npharmed\nthigs\nbrindel\nlagger\nmashru\nsonjica\nneossat\nonaiyekan\ningenuously\nrectortown\npianc\ntangun\ndonnenfeld\nwanatah\noldakowski\nlawver\nsoftex\nculpitt\nwiginton\nromanticisation\nbaltique\ntenderers\nshoreacres\nhypophosphite\nisavia\nmarumoto\nhołowczyc\nchritian\nscotchmen\nofiesh\nmundaca\nlorincz\ntryless\nroundnose\ndiscusss\nreradiate\nsakhai\niveri\nfagina\nmalasia\npetkovsek\nstreamflows\nzvue\nbortel\nfliter\nrahmawati\nthür\nlisses\nellegood\nboaler\nscuppering\nminotaurus\nmuralie\ntryson\nquartino\nrockhound\nbjorg\nkladis\nsmartwood\npirooz\nringera\nfoveran\nritchi\ndumbly\nprarie\ndonw\ncolisee\ncsae\nflextreme\nharshberger\nscialabba\nziedan\nhinstock\nhochfelder\nneaten\noludamola\ntruculence\nmarkon\ngrandcentral\ngolinkoff\npasal\ndandyish\natamanenko\naspiazu\nrondini\namericold\nparalympicsgb\nbanktrack\nfarj\nfalorni\nstrasbourgeois\nlecointre\nbusha\nluddenden\nfluckiger\ntilc\npompousness\nhofesh\nisacc\nmoorlough\nrearers\nlajuan\nyusko\nstupenda\ndegreasers\nstonebrae\nquitoni\nllinois\nustads\nriiiight\nunderpressure\nconqu\nbrunjes\nsolidness\nroundarch\nalvediston\ncachaca\nmowachaht\nminchenden\nconpiracy\ngladiolas\ndevillier\nmethomyl\nkudukhov\nisango\nkatritzky\nuznadze\nsayyah\nbingol\ncubatabaco\nphasellus\nwhle\noeh\narnebeck\nabsurb\nadailton\nxolani\ndivergencies\nrüstü\nbunir\nhalafihi\nsallyport\nriveras\nfingerpicked\ncashill\ndendraster\npeolpe\ndetica\nyares\nsupi\ntibotec\npeptidomimetic\ntrenant\npiotroski\nsalterforth\nbusradio\nshimshi\nafflelou\nsmeathers\ncoeliacs\nbajin\ncreosoted\nsingpost\nmunai\nsneakerhead\npentacostal\nmultitronic\nshandel\nriflemaker\nshekleton\ndedomenico\nsensage\nsediqi\ndeadlifting\nrunkeeper\nhamda\nenervation\nwestlane\nweightiest\nunseals\nmatarrese\nfieldfares\nblls\nlindeth\nnunam\nmihaileanu\ndecathlons\nokines\nartlessness\ngeiers\nmakeable\njurisic\nlegwarmers\nrecutting\ndynex\nanraat\nhyperthymestic\nvitit\ncurlicue\nyéle\nrafayel\nmmsp\ntarrab\ntorrecampo\nmaylor\naccessnow\nqirim\nkansal\nrecommenders\nkimkins\nbyzantinus\nbanabans\nvoskuhl\nsilvernail\nwoolfall\nijmeer\nauble\nferociousness\nruvell\ninseperable\nbernsteins\nhennessys\nhutchisson\nmyspacers\nalthorne\nbullar\nsahagian\nfabrick\nbaybrook\nfredenberg\nhaeberli\nreppetto\nlatchem\nyakhchal\nindependen\ndecho\nmishelle\nhellscape\ncummulative\nmoneytree\nsutterfield\nfreerider\nelonu\npitonyak\nshayana\nopower\nsamdhong\nmindlink\nfortismere\npalaeoanthropologist\ncallero\nlewdly\ninjudiciously\nbednets\ncrackup\nrapenburg\nexfoliates\nsupportiveness\nbluepearl\nzhenkang\nschureman\nmclovin\nrefreezes\nunmetabolized\nblancaflor\nresendez\neery\nmontanino\nkhoc\nlimbered\ntanser\nparadores\nningrum\nkammann\naugustow\nencap\nschimdt\ncloudscapes\nbrioux\nmovsas\nfengate\nahto\nappleyards\namatriciana\nquarrata\nbabajian\nfinnane\nskirboll\nnewstand\nbumpersticker\ncowhides\ntimakova\nkapachinskaya\nbolongo\nilshat\nmcglinchy\nkachur\nbergfeld\nnibc\ntuluksak\nhanchard\ntompkin\nproffesor\npeacenik\ncracktown\npanthal\nxiaoji\nbeguilingly\nqosmio\nverastegui\nprodea\nkaragoz\nbiohybrid\nmushikiwabo\nraydah\ndubut\ngodell\nchidyausiku\nsindicatum\nflakiness\ncardetti\nangbwa\ncederqvist\nhedgecoe\nguck\nshahids\nsouthtownstar\ntostevin\nscence\nviars\ncroslin\nbewerley\nbesseler\nplastow\nfrolicked\ncyberbullies\nqigang\nfortna\nbeligerant\ndesn\ngurwara\ndescoings\ncattiness\nmiddlehaven\nwarshauer\nswinish\npaasch\nbradach\nghorayeb\nbrookyln\nvarshalomidze\npidgeons\nunweaned\nnetham\nlevemir\nresubmissions\nfrns\ncrathern\nbajak\neisenson\nmaskill\ndjup\naudia\nvicos\npitcaithly\ncdls\ngermy\ntostes\ndandora\nbaussan\nahrons\neswt\nkailani\ndivnich\nattilla\nzenprise\nheibel\nrudding\nubel\nboshears\namorella\nusuals\nmontra\nislamaphobic\ncpts\nbrnc\nmalbun\nsdti\nhangdog\nchamon\nunirule\nswarzak\nspasming\nlazarro\nlesaka\ngulja\nmainstreamers\nroneo\nbanel\npolyphenyl\nshopkeep\nterritorialisation\nacerbity\ndulloo\nmullner\nanterooms\nkajara\njaylene\npyaw\nlowitt\nkelbie\nsloate\ngriffths\nuocava\nbhfuil\naslund\nnaughtin\nerbistock\nnantyffyllon\nmouzannar\ntapiche\nbrynsiencyn\noverdress\nntdtv\nebbutt\nedelkoort\njingying\nimat\npozar\nsheetfed\npimperne\nnikoi\njousset\ncosponsoring\nshirwan\nchoric\nheininger\naboushi\nhilfiker\ngladhand\nlorigo\nwestmoore\nstichbury\nkneepads\nmeanspirited\nfessed\nbaere\npastizzi\nrowghani\nkrikalyov\nakapana\nhyperintensities\nswingline\njusino\nyazmin\nngige\nnordmanns\nguillaumes\nredridge\ndhuhulow\nsmirked\nfreetail\nevenflo\nlugwardine\nsplitt\nronreaco\nbahiri\nintracoronary\nmichihisa\ndrinnon\njoud\nbils\nwinair\nseeboard\nselliers\nkiyemba\nsuitner\ndelys\nsepracor\nrestuccia\ncorlis\nurmeneta\nchisipite\nsamoon\nsopheap\nmerszei\nbrommer\ngritters\nshereef\nbelcaro\nbrostoff\nnogliki\ngestring\nhohenfeld\ndigiovanna\nboscaglia\nsammich\nbeshenivsky\nrinto\nshalamcheh\nchampman\ncalcipotriol\ngarze\nlattari\nwanlop\nbiobricks\nkarell\nkiteboard\nlaudati\ncarbones\nvizor\nbrawns\ndisequilibria\nassalamu\nchurchhill\nrafshoon\ncircello\ndohmann\nfrutarom\nresubmits\ntotsky\nenninful\nlosinj\ndistructive\nrosbank\nfaher\ndonica\npereverzeva\ncyflymder\nswansey\nmahiki\nbacterially\nfredj\nanduril\nkokocinski\nsabrage\nmanicotti\nembezzlers\nmassingill\nbourgeault\nplagerized\nhumba\ndevourers\nsubtlely\ngunbattles\nglamourpuss\nmottel\nsicelo\nkipahulu\nrowatt\nueps\nmeckseper\nbubblicious\nunbuttoning\nkhaplang\nfinchum\nadknowledge\nturnoffs\nairdam\ninvenergy\nmeydenbauer\nsaglam\nincriminatory\nhedderson\nsambódromo\nacredited\nvondeling\njiangang\npizzala\nelmaz\nyelding\njanic\nfancypants\nfacilites\ngangel\nblaichman\nwolder\nbutturini\nstalinesque\nfeener\nparvaiz\nyordy\npiening\nchenge\ngormezano\nabsolutions\nelegaic\nprehypertension\nginno\nburgdorff\nitest\nwillemien\ngipi\nsoutherham\ntatopani\nnawc\nrunflat\naubain\nimcomplete\nufip\naaoifi\ngbadebo\njindi\nwearability\nmicroamps\nsimunic\nvscs\nnebulization\nblyk\noscypek\nespitia\nquickcheck\nvanacker\ndeß\nhatemongers\nbucheli\nperniciously\nrosow\naraskog\nlegislatives\nmearth\nbarnacre\nunsegregated\nmambetov\npoblanos\ndweeby\ngason\ndadwal\nhexapus\nschüle\npickus\nkenjo\nplax\nmarineau\nthrumming\nmalual\nclotheslining\nvideoing\nbailers\nbankok\ndemilitarise\ngoodo\nthrums\npicioane\nnovated\nbronder\nhelcom\nchampurrado\ninfinate\ncelebrator\nnadhmi\nollies\nsylvest\nfingerpainting\ndaid\nchebii\nllenarme\nkirpans\nbubnik\nsonka\nugulava\npennyhill\nchavot\nsheekey\nundismayed\npaktribune\ndepoliticize\nrecountings\nesrin\nngoepe\nnyboe\nfinisar\nmohammd\nscamman\nfirsters\nguellec\nnahwa\npryors\ntadre\nsluss\nonuci\nadamy\nferbrache\nsieci\nlyophilization\ndentdale\nstratacom\nmisali\nkarwi\nparticulaly\nbuytaert\noneroa\nzizmor\nsadig\nmohammadullah\nalldritt\ndentsville\nspittlebugs\nmedcap\nwovens\ngoaless\ncamana\npathologize\nchodounsky\nspreaded\nfoodstore\nfairbairns\ncropton\nlorent\nintellectualize\nformstone\nagustinillo\nmonkwood\nhaif\nresynchronize\nchubachi\ntennman\nmuilenburg\ncaradonna\nsinex\ningrowing\nmtss\ndisembowelling\nmahnut\npitofsky\ncoopervision\ncappato\nromaro\nkenco\nelmesthorpe\nsignle\ngoldenport\nhallyburton\nfrmo\njariban\nhrycaniuk\nunintimidated\nplebiscitary\ndraughtswoman\ngruszynski\nadega\nnaths\nkleb\nenersis\nbaradaran\nfrontlinesms\ngiddeon\ndewstow\nattalah\nschachtner\nwhitleigh\nsubconciously\ncatsa\nsullies\nlamassoure\nearliness\npreemie\ntourismo\nrevital\nzemiri\nbemko\ningves\nfelicities\nsawzall\nsnediker\ncumbes\nkrainer\nkarlic\nstopzilla\nfayston\ndawod\ngunashli\nheizmann\nbrooksley\nagropur\nromancers\nforterre\nwejryd\nshihe\nirrisponsible\ntootsies\nllundain\nomniflight\nthorvaldsson\nexemplarily\nyounkin\noubrerie\ndemtschenko\nmattieu\nsroda\ngutkowski\nbenville\ndobberstein\nsixmilecross\nuncongested\naveton\nansfelden\ncoloe\nscratte\nabdulraheem\nbancard\nhästens\nvannessa\nluggala\ndethrones\nhillgarth\ncamolese\nsinak\nculos\nsupremos\nhennops\nqingzhu\nlonglasting\nhakims\nstrobed\nccpm\namericare\niconnect\nxta\nbarayagwiza\nsuminia\nwinces\ngjedrem\nbacksplash\nvandura\nmstr\naquebogue\npaciocco\ntreliske\nbiogeochemist\ntearaways\nplastiki\ngroovier\npetfood\ningrida\ngenially\nkaydee\nkaeshi\npocketless\nimpetuses\nkhachapuri\neminating\nbudner\nteplitzky\nhkmg\nvivaz\nschieler\nbirnau\nslavinsky\napiay\nrouged\nherlander\noldani\ngilster\ncremators\nvagary\nldeo\nblindsides\nfisita\nnanpean\nmulvaneys\ntimeconsuming\nprognathous\nclarificatory\northorexia\nspacehopper\nbartoshevich\nmsph\ntongson\ncodetel\nzahreddine\npanthenol\nsandvine\ngazumping\nmilhollin\nboding\nmseleku\npotpourris\nbomana\nbeligerent\nilove\nshakan\nweddingwire\ngianduja\nmweene\nvancouvers\nlandican\ntsokos\nrorting\nlevance\nlameiro\ngracemont\nchaske\nmanservants\nharlotry\nwhities\nseche\nusgif\ncommodifying\nupsell\nnmsp\npsaras\ndonolo\nmascalls\npresbury\nweisbecker\nmiltie\ngenencor\nnrlb\nplme\nmattimore\ndahou\nimodium\nzerai\nlongjohns\ncroeso\nsolat\nunleased\nwaelsch\nxavière\nsackful\nosinga\ndeepdyve\nlevkovich\nilligal\nsinotrans\nportnoff\nkurundkar\nluesther\neardisland\nshpa\nbrioches\nslimmers\nwallahs\nthrasivoulos\nshivambu\ncaparulo\nharop\nlampu\nveals\nonepass\nschiesel\nintraregional\ncbrl\nglenravel\noffshored\nlorus\nsautoir\nshereshevsky\nmandache\nstafon\nbillout\noapi\narpey\ndraganic\nradox\nshabecoff\nempanel\nllanbeblig\nscqf\ndumiso\nbuzztime\nmichalos\nludmil\nnregs\nhoons\ndabbert\npossition\npreoccupies\nromneya\nlidget\ntheweek\nanchorless\nsubsistance\nborroka\nthomasz\nskycap\npeschier\nsagittarian\nwelat\nsaqafi\nremigino\njibarito\nslothfulness\nmyopically\ngosi\npushbacks\ncarpati\namach\nrocori\nlosantos\naquadome\nricciardelli\nmiddelhoff\ngilnahirk\nneckless\nmorem\nchiplin\nfuhs\nwinka\ninsalata\nschlub\nkhalvashi\nmaterpiscis\nbukoshi\nvallese\ncetc\nmicroserver\ncharismatically\nreish\nporthminster\nvirshilas\ncinematique\npfandbriefe\njingbo\nnishimatsu\nmiasmic\ncallands\nscandi\nkorrodi\nasnd\ncavalaris\nbeechams\noctapharma\nsahlan\ndoripenem\nprtm\nsphygmomanometers\nempact\npickwickian\nvhcc\nosee\nsirtris\ngoldsmithery\ningloriously\ncuase\nkobernus\ntelepod\njailings\nfloridiana\ngradeschool\nsharot\nschmitzer\ndismantlers\nspauldings\nmultisensor\njobanputra\nbenumbed\nbusquin\nshamban\nmaqu\npreceived\nhennum\nseeqpod\nthegrio\nusdla\nabary\nwallersteiner\ngaynet\nglaskin\nlaleston\nsalomoni\ncrispiness\nestablsihed\nwojtala\ningeo\nissur\nadenoidal\nhret\ndarjina\nkhramov\nadelfa\ntrewen\nmanzor\ninzer\nhemosiderosis\nsegneri\naccredo\npetronzio\nnooney\ndivex\nignor\nghaidan\nagrella\nflaxington\nsepte\nclaxon\nleszczynska\ngaudoin\nappeciate\ndaftness\nsampsons\nmontenegran\nunpassed\ndazer\nkookai\nnabiullina\nunlevered\nwopmay\nleadin\nforgent\nschlicker\nflatty\nramsis\navdeeva\ndoornkop\ntopknots\nfinancialnews\nboily\ndennise\nlelay\ntsbs\nshysters\nkargel\ntrenc\nherschman\nfiorilli\ndantrell\nrennaisance\ncarcraft\nhunkering\nhofferth\ncornas\nsocialises\nogaryovo\nignatas\nscoopers\ngahler\nostholt\nsolitair\nmasorin\npayi\ncubbison\npercovich\nmanibusan\nalvardo\nnarcoanalysis\ntheoden\nedicule\nbataller\ndiehm\ndaikundi\nzaluski\nnewsrx\nmonbazillac\nvriens\npabulum\nloftily\nreligiousity\nshenson\nsaylan\neffortel\ncibulkova\ngoldmans\nsitups\noverpack\ncpma\npervs\nscarse\nvinashin\npeformance\nmeichtry\nexoduses\npmbus\nlevandoski\ndarnah\nodigo\nacsu\nftk\nzuur\ngawel\neleve\nwvwv\nwolanski\nrereleasing\nbioscientist\nparenzee\nvscp\nbuildin\ndepositaries\nragot\ncreedmore\ncarrville\nperasso\nspillar\nbokum\nmarje\nwhatham\nautotote\ndevitalized\ntemesgen\nbagnal\notcs\nsurovell\nsheepcote\ntoxt\ntriaud\nzaborsky\ncafarelli\ncherkas\ncoretti\nazertyuiop\nghundi\ncahyo\nbristed\nkrevey\ntwitchers\ncannulas\npaiano\ncampanale\nholdingham\nauteurism\nbussman\nvanquis\nsaremi\nhammergren\nrobock\novercompensates\nleidecker\nruault\nramezanzadeh\nholleyman\nexoticized\nuduaghan\nspagnuola\nlomelin\ntrebicka\ndoffs\nlinkman\nmereghetti\nmyofibroblastic\nantcliffe\nshimbo\nnouzaret\nwildridge\nmaket\npeterchurch\nbazzel\nsunai\naaae\nspotlessly\nkayali\nkamphausen\ninexpressibly\ntalkleft\naeroman\nyoungstorget\nchomolungma\nclevlen\nscien\nbouchikhi\nsiracusano\nsdtc\ntrunzo\nbanoffee\nclaimaints\nanela\nunwaged\nconscienceless\nmevlut\ndatcher\nsatoris\nahmedou\nbakhodir\nteashops\nklausmann\nbosky\nbeachgoing\nmotahhar\nmefin\nutton\nbrami\nsiknis\nandreesen\nnonexperts\neshbaugh\ngamlen\ndordain\ncorazzo\narthog\nlaboso\nturgidity\nfamista\nsadara\nmisdiagnose\nattck\nhansack\nnisenholtz\nmccaine\nwarlikowski\nwingas\npetajoules\nrachou\nfieldings\nudwin\nfailer\nabuk\ninms\ntshewang\nkhazaee\nsipgate\ndrnovsek\nxuenong\nseamlessness\nchurgin\nczene\nreitzes\ndehiwela\ntoget\noldchurch\nmellits\ncromitie\ntakanezawa\necotours\ndelawareans\nfierros\neshre\nstruckmann\nunburdening\noptenet\npetards\ntalaton\ncorthals\nmckerron\nzaccai\nsukardi\nfanlike\nanowara\ndemeksa\nveeteren\nanable\nshotmaker\npolyvinylchloride\nsharrif\njacquemain\ndunbia\nrockish\nweinbrecht\nglamorizes\nnajmah\nmendheim\nrianto\npcit\nmesarites\nkealing\nreapproved\nprokovsky\nutterby\nfrustratedly\nibcp\nwillowwood\nairbursts\nmekia\ntarkov\npruszkow\nnurdles\nmanipulatively\niwuji\nweeford\nesio\nfalik\nhojjatieh\nnaulty\ngreenlining\noctoshape\nskenazy\nwilcott\ntrewithen\nroccat\nsabate\nlukusa\nsuperclasico\nintitiated\nirham\npreson\ngpha\nschnoodle\ntanon\nmassequality\nenergises\nfeinglass\nbrickbat\nvandaele\nnyamwasa\nfxc\nbrezoianu\nluffman\nchernyshenko\nlpgs\nkumakawa\nduferco\nbontempo\nteresopolis\nblancco\ndogherty\nimprtant\nmajia\narmella\naarnink\ninterpet\nmultipronged\nmaich\npsyching\nmecl\nsyder\nbassirou\nhydrotreater\nconlogue\nfouettés\nupsize\ngreenquist\niloperidone\ngigajoule\nghezal\nquevega\nstudioeis\nswopped\nallaben\nraimes\nxcite\ntaruta\nvacs\nhayemaker\nmastec\npurred\nkhademi\ncoppley\nsheroo\nmakridis\nrationalises\nliveauctioneers\nlicadho\nbatterman\nwarburgs\nadrenomedullin\ninflunce\nsteenie\nutterer\nharperentertainment\nishmail\nlayalina\nhorpestad\nemda\nperisho\nbalcazar\nmcmeen\ndaubs\nreconverting\nincluing\nnieboer\nkalaloch\nmarvella\nshugars\nminamitama\nftvs\nkoduri\nwagaman\nmarmari\nhealty\nfilmgoer\nmirdamadi\nchemel\npoststructural\nbankability\nsuparat\nreclusorio\nmerdare\nyasamin\nhaist\nlarasati\nxtuple\nmethylnaltrexone\nshengtai\ngimferrer\nvallverdu\nsevket\nomos\ntalkbackthames\nkheifets\npetruska\nmundon\nfitgerald\nboed\nastall\nptss\nchanneladvisor\ndistate\nmirtchev\nnoseless\nrumiana\nenglin\nwexton\nhuaxin\njehn\ncampaining\ndaddys\nyeman\nbodycote\nbluefins\nrisbridger\npublicy\npottie\nnby\nwenbing\nskorka\nskyer\npeacefull\nzellmer\nbartonellosis\ndesjoyeaux\nhuneck\necoterrorism\nladenis\njanuray\necclesiasticall\nbhagwanpura\ngvir\ncomacho\nlarsons\nlaparra\npixelvision\nprosise\nfengling\nkreteks\nuncorrupt\ncenteniers\nwamuran\nacciavatti\ndunlins\nsunderlin\nclearys\nstannah\nsmeller\nvdap\notty\nkirumba\nbabrow\nswedan\nnaymick\ncargin\nstencel\nbeliev\nbeltless\ndacunha\nhaematococcus\nnamsa\nscheimann\nfskn\nairmall\nnannetti\nzhongneng\nopnet\ngorske\nkuonen\ndenderah\nsportwear\nnopat\nhenningsson\nproprietory\nshieldhill\nsinorice\nspideroak\ncollemaggio\nharrodian\nterrazo\nfayres\negoistical\nfugee\nbirnkrant\nbioabsorbable\nbeetem\nnyantakyi\nprecip\ndisuade\npopwatch\nsoundbar\nbarbano\ntesak\nbearpit\nfakeh\nizzies\nlcdp\ndouzinas\nsouthrop\nberdie\nmeikles\nsenkowski\nosaf\nmelony\npgpf\nzauderer\ntumeric\nstissing\nappendectomies\nsevcec\nfrémeaux\nsahim\nashtree\nguyonnet\ncannibalising\ntrewyn\nzinzi\naudiotaped\njarjura\nairong\nfleetlands\noutof\nfircrest\nvelud\napsaa\nhackey\ngangbanging\ndivisons\neasl\ninsipidity\naboutthe\necollege\ngamekeeping\ndernegi\nkarimojong\nsubtley\nanritsu\nyanky\nraghavachari\ncongradulations\npiatrushenka\nhommell\nshiqiang\nrhosllanerchrugog\nbredekamp\nnitrofurans\nnutball\nneuroblastomas\norcel\nsemiprivate\nnumerix\nmychel\ndonyale\naddenbrookes\nmascarello\nnonconfrontational\nyevhenia\nschottlaender\nsolimar\nfairtest\ntailby\nkhandkar\nedmondstown\nchassin\naquaintance\nvaledictions\nchambe\nlifelessly\ntravelcenters\nhiddlestone\nmacosquin\nsueppel\ncalabuig\nkallasvuo\nwaggish\nkiling\nlubes\njufer\nvmy\ntbtf\nwhoopass\nnomophobia\nkopko\npampelonne\nstanistreet\nreicin\namerijet\npredeployment\nshadduck\nlegedu\navocent\nkonowalchuk\nrefuser\ncorrrect\nnjoki\nedrm\nmordashov\nshockheaded\njingmin\nmedwed\nscheld\nabdoulie\nbrahmsian\ntcpalm\nsemos\nriformista\nrepuation\nebisawa\ntingsek\nanois\nrisedronate\nqaiwain\nsaaed\nreselect\nbistec\nventisquero\nmarabe\nsmartpak\nmossor\nsomewere\nskupien\ndebbye\nklencke\ntengzhong\nhumanlight\ndumo\ngramacho\nnordon\nblys\ngillogly\nsophies\nscrimp\nroghun\ndonchery\ndyskinetic\nimmunostimulant\nmacrs\nledare\nmapel\ntusiad\njouanno\nsmashie\nlonghauser\nresurfacers\npanopto\nflambée\ntheam\nalide\nctfs\ncisero\nlandazuri\nmsce\nschilens\nfornasetti\nsilhouetting\nweyne\ncadahia\nsinse\ncaffari\njerg\nmutely\ndubrovina\nschlom\nlafrieda\njaghatu\ncedc\ncorvatsch\nstarsuckers\nskuce\noverbalancing\nhelados\nreadsoft\ngundotra\nmisfold\nholloran\nprotsyuk\nfoxxhole\nmontagnon\nsytems\nfbcm\nhobnobs\nfuneka\norginated\ndrobner\nletowski\nmanhatta\nrashod\nbouillons\nshamseddin\nvalises\nguilio\nviar\ntrussoni\nroszko\nwosniak\nregathering\nharsono\nmetlox\nnaqelevuki\ndistortive\nmujawamariya\nminnaloushe\ngrevin\nlofstrom\ngosbee\nconvertable\nmitbestimmung\nkinoulton\nwintrich\nguylian\npitanguy\nthrondsen\ngurewitsch\nbakia\ncedre\nfilmless\ncrenca\nbaning\nvadasz\nmagnex\nsandroni\ntrundles\nakanda\nrestrictionist\nhurtmore\nfanbois\nscvo\nmusleh\nmoqaddam\nusenov\nderacinated\nroee\nniflore\nuexkull\npulzer\nmesnel\nyesui\nsentis\njaidi\npoeticism\nbabah\nstodel\ncsii\nkazandjian\nberties\nunblushing\ntadian\nertha\nsunner\nbaskins\ntaghipour\nthrillington\nsokolove\nossó\nomdahl\nkornblatt\nmenegatti\nbeggared\ntraicho\nmessan\npayzone\nhashwani\nfrenaye\nlamber\nundebatable\npuigvert\nteamgeist\nclangor\nshrider\nnomatter\nscansafe\nreapplies\nrecurvus\nwestrop\nbettley\nconsta\niraqui\nbioresearch\nkillias\nairstation\nhuamei\nmezzos\nhollingdean\nthesps\nlovelikefire\ngilbody\neskra\nppif\nmctaggert\neichmanns\nrookard\nplakias\ndartmeet\nfranzblau\nolafsdottir\nethelwold\npoleska\nsmigiel\nmalles\nkalff\nmasimba\nlinnington\nsovietologists\ndufka\nparrottsville\ndrinnan\ndibis\namaraweera\ntimonov\ncrumby\nphrc\nclueing\ndekabank\nanchorsholme\nbonce\nshannah\nquetteville\nshfl\nboyl\nmsut\nmakoti\nkolasin\nknuckler\nsusanka\nhorita\nmikulich\nmckerrell\nfjf\nglanaethwy\ncrumbing\nexfo\nunveilings\nescarole\nnading\nrosanvallon\ntenability\nthoise\nahmedzai\ntowerhill\nukcg\npaquirri\naquaplane\nthellier\npeiro\nchapnick\nradojevic\ngrausman\nzapresic\nheifner\njaymar\nalibaruho\nfirelines\nhangama\naamva\nchoom\nllanllechid\nmuezzins\ncellcept\nscientological\nvishaal\nthourgh\nsiradze\nsaguy\ngarryduff\nmaamobi\nanrs\ngomperts\ndiversifications\nignobel\ncertej\ngassim\ntourgasm\nlumileds\nshaib\nfragrantissima\nbldgs\nstrambi\nmyrtaj\nlichtung\nardekani\nkilberg\nerbsen\nprobat\nreplan\nskapinker\ncameraless\nsoname\ndreze\nadcb\nccei\naeroports\ncovingham\nminimoto\ngrutza\ncunza\nregassa\nmerletti\nutrilla\nnorwitz\ndamed\nbloodfest\nworsteds\nwoznicki\nferstl\nxceedium\nkreuzpaintner\nlogorama\nquizno\nmisregulation\nfacon\nxiaohai\ntitterrell\npuling\nosinachi\nhotting\nserapes\naranesp\nnovera\neikelboom\ndignatories\niccho\nkievman\nwalkey\nexcessed\nthikse\ntrefeurig\nryvita\nfauchier\ndiscolors\nmorero\nwithins\ngaumer\nomlet\nirrationalities\ncairnbulg\nshawali\nkassahun\npatsies\noncale\nfavolosa\nomgs\npataria\nwaterpod\nsnowblowers\nobdurately\nhaimson\nfallowell\nskorts\nundisplayed\nslogs\ngoatherds\nreboletti\neodromaeus\nilikai\nnoncritical\nbearfield\nebonized\nrizq\nswingbridge\ncastelgandolfo\npoolville\nbhuttos\nbouchart\npercutaneously\ngoedecke\noreskovic\npalecek\narkes\nmítica\naccute\nyeandle\nvirrankoski\nluvvie\nskolimowska\nhootin\nlibowitz\nbulbed\navocadoes\nneukoelln\nmastroberardino\nbahaullah\npríncep\nassocies\ncompetetion\nbertagnolli\ngalchen\ngallix\nhaberstroh\nacupcc\nninkovic\naldersbrook\nuricase\nskort\noleochemical\ntradeline\ncontergan\nmogavero\nmrbi\nphysiatry\nlagreca\nkelz\nantiballistic\nleapfrogs\nurquart\nshahpour\nhuetter\neqivalent\nseike\nlerwill\nsantoriello\njelavic\nrogun\nbedevils\nwastrels\nfigaredo\nfalled\nclickatell\naïnouz\npourandarjani\nsensics\nfrankle\nrillettes\nehlinger\ntelemedical\ncaterpiller\npleached\nmokrzycki\nporod\nholczer\nvomitting\nelmqvist\nfilus\narthrex\nstemberger\nbellar\nsheikdoms\nholsbeeck\nmagnusdottir\nwaymarkers\nunamed\ndukach\nkilford\nhoffarber\nencashment\ncarlick\nrascom\nnaftna\ndunningham\ncalvina\nfarba\npellestrina\nphilosophise\nelenydd\ngoettler\nfiskardo\nmrmc\nzhaoxu\nkattar\nsandelson\nstreetfront\notzi\nstonewaller\nclarida\nuntidily\npuskepalis\nassement\nsuhrid\nlanphear\nlovelessness\npoular\ndubon\ncarnavas\nsharani\nmaccumhaill\ndsci\ntimidness\nmmrv\nmasbia\nmikeno\nyaxcopoil\nmicrotargeting\npithawala\nzappin\nslurpees\nvichea\nrhencullen\nsalutory\ncareerlink\nsandrino\nintermeshed\nrozanna\nzatko\nsabow\nyussof\npetoro\nburkleo\nmidanbury\nbeijinger\nlifestreaming\ndaytrips\nimmutably\nsarfu\nraffell\nrubish\nnambia\nsexualize\nkavinoky\npredecesor\nagrichemical\nholtan\nschanzkowska\ngexa\nwillings\nrehabilitator\nluyn\nstranges\nwedberg\nkohnke\nvilchis\ntowelette\npostcrypt\nsirenomelia\nusitc\nragheed\nazzura\nkuntzman\nebener\nmalreward\nheloc\nforefingers\nmarchesani\nomung\nleprevost\nsplenetic\nlaschet\nhurted\nxuejuan\ntwere\nfleegle\nlloegr\namedisys\nenard\nhavenhurst\ncrittercam\nacibadem\nsiegels\nspreckley\nmateriaux\nskiena\nljubojevic\nprijono\ninbursa\nfilianoti\nadhiambo\ndailycandy\ncanonmills\nsetten\noberhelman\nnakameguro\nrunacres\nbluebottles\nwithens\nconfucious\ngeoeconomics\nghadiya\nkanguru\nsubdivisons\nedcarlos\nporscha\ninterpipe\narumi\ncbhd\nsanio\nhealthplex\nmoisturized\nszybalski\ncounteractive\ntedda\nprepatory\naropa\nthinnings\ngeorgeanne\nilimaussaq\nplexifilm\neventuates\nfinetune\nostrofsky\ngeocultural\ngambatese\niuta\ncornton\ngaraged\nhallae\nwhoopin\nresistent\nbrookyn\nshtein\nbolventor\nrotel\nunscarred\nchappers\nmorganstown\nmachaidze\nwellswood\npipper\nolesz\nmesg\nafifa\noudkerk\nclowned\nnaturalpoint\nmonets\nbielinski\nyatco\nsympathic\neshraghi\nsuanzes\nmelverley\npaxford\nthuet\nchrissochoidis\nulcerous\ntheriogenology\nestenoz\nojomo\nhaddox\nkirmond\nwinkers\ngibus\ndammika\nrowardennan\nquicksearch\nyolink\nsimey\ncacerolazos\namerex\nswimm\nlingustics\noddcast\ndelucci\ntherap\nkidero\nihnatko\nxtraordinary\ngtps\nsmooha\ncaddigan\nmonastry\nextraodinary\nyiru\nmonkston\nchakas\nbebchuk\ngraversen\nazoy\nbutcombe\nhammarstedt\nindepedence\nrebora\nclairborne\nedst\nshopowners\nsirmans\nlungarotti\nstategy\nsuts\ngirlfiend\nspuistraat\nsferrazza\nnavarrette\nsamarco\najang\niafeta\nakli\nyiannos\nreviles\nvenkateswar\nmezzetti\npelagics\nsumler\nvermicular\nakridge\nsyphoning\ndwoskin\nsparklingly\nzyban\nganush\ntbaytel\nsiniya\nkoomey\nbouzy\nshakertown\ntelavancin\nspatt\nstancliff\nmisperceive\ntiquan\nshalaev\nhamlins\nsoccerex\npalagyi\ntution\nqibliya\nuvarova\npabor\nshuttin\nlidoine\nskillsoft\nshamiana\nfalletto\ncomfortingly\netek\ntreseburg\nhypoglycaemic\nrumpke\ncinghiale\nclovenfords\npostmortems\nnkoulou\nkouznetsov\ngilltown\nnonfamous\npetitgout\nalpheratz\nhossfeld\nawasom\nfinanceworks\ndinniman\nbetsan\nembutidos\nbolesworth\nyoumzain\nadade\nbhojwani\nweizsacker\nchirilagua\nnutro\nprotectants\nmepivacaine\nbrickie\ninderfurth\nminimalize\nkingsisle\nsitrick\nmassaya\nnaughts\npurbrick\ntoyosi\ngruentzig\nmoussem\nworral\nbefuddles\npolicital\nshadmi\nbraystones\nmojopac\nstrycek\nperseverence\nreynholm\nbruited\nbattue\ncioppa\nblts\nbacame\nsolopower\nschierholz\nnagusa\ncherkin\nkummant\nbackboned\ndediu\npinatas\nturkoglu\nundriven\nwipeouts\nhuperzine\nprocyclical\ntwinity\nmandiant\nswingeing\nmotecuhzoma\ngoldwind\nclamours\ndvortsevoy\nbootsnall\nbaleni\nunregarded\ndanleigh\nseinn\nbstc\nsocgen\nmoudjahid\nimportune\nyassmin\nnakhuda\ntheyll\nrecommitting\npatrinos\njosl\npolyface\nlionshare\nsenderoff\ntradebook\nhoogewerf\nabdifatah\nrimers\nfarnoosh\nmembreño\nsgreccia\nsabrine\nmoynes\nriverscape\nbacteraemia\ndarrill\naskmoses\njoels\nsprinklings\nruisi\nmarongiu\ngoldenwest\nsiela\nantiliberal\nicic\ndangor\nbritoil\nosiraq\ncenterra\ngirbaud\nstarkers\ndeadwyler\npleva\nampal\nmontauriol\naigrain\npromover\nartour\nraunchiness\npectic\ngrotesqueries\nveletta\nmussallem\npersily\nbrowbeats\nquinceanera\nrefighting\nhosel\nhollomon\nrezart\nbongoville\ntaeb\netien\nfolson\ntirley\nguangfa\nislamaphobia\ncodpieces\nsfms\necbs\nkulevi\nherepath\nperambulations\nbagless\nhavanas\nvoronet\nbostian\nwoodle\nirelan\ncarmellini\ncowels\nlitokwa\ntelesp\nunderstandingly\ndreibelbis\ncayuco\ndigitalizing\nsamanda\ndunky\nchanuka\ngishwati\nschmincke\nezekwesili\namegy\nflirtomatic\nramkissoon\nrerate\nrosseler\noutdraw\nungeared\nfastech\ncerezyme\nnoreena\nparanagua\nnormansell\ngozney\ndohms\ncacophonic\nstroka\nskeldergate\nkethledge\noverclass\ndownlow\nuglydolls\nbilkey\ncurteys\nmanolopoulos\nulanoff\nmeetic\ntimble\ntakover\nkolobkov\nlaarman\ngindorf\npizzicati\nlabadee\nmattiello\neshetu\nrosinei\nfroelicher\nribband\nvellupillai\nradkey\nloffler\njiayou\ndonose\npackable\napplauses\npapadopol\ndullards\nnaafa\nshanghart\nhashers\nmarybank\ntronick\nfudgy\nambudkar\nuphaus\nsteussie\nstockily\ntsalikidis\nphosa\nfuschillo\nncomputing\ncalfornia\nramotswa\nburud\npremiair\nretroelements\ngrebbestad\nalouds\nvishnyakova\nhighflier\nhurlin\nbaynards\nundistracted\nphanindra\nconfigurators\nweaner\ntiejun\nvalarezo\nsnorkeler\nlungile\nmedulloblastomas\npiteously\nslightness\nteepe\npoliak\nabdiasis\nstemilt\nfunderburg\nraisinville\nbidri\nramsammy\nelemer\ncleaton\nshowiest\nsluttish\nbdps\nenck\nolad\nmicrodisplays\ntelvent\nparings\npinkelman\njelmoli\npopinski\nstericycle\napaporis\nntale\nbartine\nlabourious\nnamdrol\ncatrambone\nquantam\npoggenpohl\nmingfeng\ncrinkling\naabs\nwildcru\niskenderian\nmccurrie\ntotonicapan\nrendine\nroomates\nmarjani\npunko\nkonbit\nsivb\nfriedhoff\nunpropitious\ncliam\nmagull\nsallinger\nmykhaylychenko\nadisorn\nponiard\nkargus\nangelyn\nsonsoles\nwgcs\nsinlge\ncochleae\ndiefendorf\nchairpeople\nlonner\nsomak\nrudys\naving\nfiis\nrattletrap\nsansibar\nosgathorpe\nunoffending\nthaksinomics\ninsurability\nmisnad\nodilio\npoptop\nhfma\nkonuralp\nabromowitz\ngattas\nmustacchio\ncabelas\ntrotte\nbuckheit\nzuwarah\nlutman\nrailbird\nwashdown\ncasarotto\nmyps\nfcit\nkinesiologist\ndepersonalize\ngressly\nspeaches\nfloorplates\nsating\ntalwrn\nnutbag\nrecapitalised\nnietszche\nmakhneu\ntelevion\nlepisto\nsenes\ncamhs\njaho\ntoothman\ncafard\nnetzeitung\numpg\ndepayin\nadamsen\nxiaojian\nsringar\ncryonicists\nzraly\nhirshorn\nrecapitalise\nsmis\ninternalising\nkalocsai\nfidgets\nbestplaces\nisolus\npaglieri\nbasith\nschlepping\nmarnò\nrescap\nvitria\nsporer\nntakirutimana\ncarrozzieri\nemiratisation\nsieminski\nagonise\nneyroud\nnaposki\nenplaned\nlumiracoxib\nsiekierski\nansarul\nchinny\nshiniest\ndiraige\nddlj\nmernit\nyearlykos\nkimhae\nsentayehu\ntŵr\nwattegama\nunderpricing\ntaggar\nsnabba\nregorafenib\nhoogh\nsamll\ntarullo\nguisset\npolverino\nbookstart\npressplay\nunpingco\ncetraro\nteenyboppers\ndeppa\nsundelius\ntubulars\nethie\nlycees\nfridkin\nzavon\nmildewed\nnuriel\nvilje\nbenissa\nseydler\nevillene\ntheocrat\nspitted\ntianli\ndefanging\ngoeken\nguidara\npetroplus\nzackery\nbombastically\ndaurade\nbalford\ncorruptibility\ncrispen\nlemanski\nunhedged\npeniakoff\nmahmoudiya\nhuuuuge\nmorozzo\nkleman\nbogash\nemmers\nproliferators\npaleaaesina\ntovish\nzelikman\nlasered\nmallach\npatission\nidolisation\nvosough\nbiancocelesti\nstefanek\nquatford\njohndroe\npulsated\ncrosschecked\ndalewood\ntuila\nnayel\npalaj\nkaumatua\nnincompoops\ndennisons\nsehdev\nfraijanes\nscalf\nrazeh\nheusdens\npollenca\nstrategising\nchaundy\nintensly\ntalayan\nhaggles\ngianello\njuerg\nevanoff\nbeardwood\nnovolipetsk\nhaplocheirus\nshatat\nqoran\ndulcibella\njaycox\nsakiewicz\nnaharnet\ngutte\nreagor\nperimenopausal\nootani\neyup\nroslynn\nskrenes\ngilbarco\ntopolansky\nwyddfa\ndirtbikes\nmanceaux\nforeshadowings\nfoists\nrongsheng\ndhlamini\nsatco\nalpuri\nsommerin\nhaaks\nzurabishvili\nkabobs\nshatzer\npramono\nplitmann\nephgrave\nmaqal\niksv\nsuprises\npiezoelectrics\nkoite\nwdh\npraver\nodroid\nscrunchies\nbiocentre\nurbany\niwatch\nparrock\nbosherston\nnaturellement\nnigon\nlurve\njissah\neffenberger\ntourgis\nvenkatachari\nfessio\ngoemaere\nchuffing\nseditions\ngleadell\nchocking\nseved\nmorosov\negelhoff\ncryos\nbhaichung\nhaatrecht\ngasparovic\nintranasally\nmelianthus\nbancorpsouth\nahikam\ngdss\ndelavallade\nsanburn\nmckeeva\nedlp\nphilosophizes\nriverboarding\nkulma\nmeningococcemia\nharlap\nladylove\noeyen\nbeguiristain\nspeckhard\nlillyhall\nregenmorter\nmummenschanz\nofficiously\nshovkovskiy\nargles\ngorbachov\nyakking\neulau\nzaab\nzithromax\ngleadow\nrefusers\naldarondo\ndinnick\nhevingham\nimpressario\ncaucas\nyitzak\ntomizo\nripasso\nahhs\nbellinge\nclnk\netecsa\nturmes\nmowhoush\nhickner\nstonborough\ninveigles\nfaurie\nchaplinesque\nvallvé\nmynediad\ngerou\nbroders\njerren\nflaccidity\nbrieant\nalaha\nerlebniswelt\nltac\ntheslof\nfelzenberg\nzimmerling\npomazan\nlillycrop\nbhui\nmascari\nalltop\nlry\npsrc\noronoque\ncltc\nhenvey\norientates\ncleale\ntrendies\nrabadi\nsalangi\nscrunity\napptio\nhoundwood\nbutenis\nbierer\nreliford\nzezelj\ndejongh\nnechirvan\nfbar\nsacremento\nnadolski\nmapusua\ncraford\ngremm\ndebarquement\nnpis\njalet\nfernihough\nbrutalisation\neshe\ncannoned\nravelle\nsovereignly\nclambakes\nbeliving\nrobotised\naguirresarobe\ndoohickeys\nkampelman\nmarcario\nvivendo\nbarshefsky\ngradualness\nkhorakiwala\nkorytko\nsqueegees\nyidong\nbellochio\nsarad\nguardbridge\ntillekeratne\nchanmugam\nbackpedalling\nooky\nsystmone\nfonctionnaire\nfdci\nlongham\ntsds\nsulphites\nyould\nabercanaid\nmicrovillus\npiskarev\narrogation\nfiatal\ntrogdon\ngestodene\ndeadest\ngallmann\ncharlette\ngorau\ncrov\nhansenne\nnonreflective\nezzouar\nledner\nacrophobic\nadefemi\nhothersall\ndatabased\nmvela\nmodrica\nbedsore\nscheibitz\ndegi\nagathonisi\nbougher\nreadback\nhealthcentral\nhscc\nbutros\nvosovic\nrheolaeth\nzappy\nbingde\nwakeboarders\nabigale\ndevondale\nnitol\nsaccomanno\nmanguera\ntemptingly\nchippokes\ntrackday\ngaofeng\nhapworth\nstewardships\nbrussee\ngwbush\nurusemal\napalled\nholober\nkwasny\ndiamonique\nnizuc\nwellworths\nslaatto\ncibm\ndhers\nsaudati\nbohnsack\ntchato\nsalahudin\nnaharro\ngjxdm\nleakproof\nbrushlands\nalfrey\nbjorling\nseube\nnarraboth\nguised\npacketcable\nhogsqueal\nbracigliano\necopower\ncashcall\neverus\nmummifying\nvillet\nseckman\naccom\ntraductores\nbankinter\namjed\nchemchina\nstetko\nmeridius\nrecapitulations\nrabeca\nresponsiblities\nvendio\nbastardize\natiqur\npersonology\nsketchier\nshutler\noblon\nkaempf\npimpled\ncafm\nkampmeier\nchoosier\nantipasti\nideaworks\nkidult\ngadair\ngahrton\nyurinsky\nomido\nfielkow\nwillersey\nalmarcha\nluksa\ncheba\nukshin\nzeltzer\nbaratunde\nturbigo\nserabi\nendemically\nermir\nstpaul\nesigodini\nschletter\nhaishu\ncissel\nstalzer\noenologists\nparanoiacs\nloflin\nranjitsinh\nbekman\npper\nthirlaway\nrusada\nlathered\nliljeberg\nhazak\nrayhon\nredacts\ndeyaar\nmceleny\nmiskovic\nunrecognizably\nwennersten\nheying\nharverson\nisum\nencasements\nocen\nstorywise\npeili\nyijiang\nnahcolite\nvertegaal\ntavaris\nmeditatively\nseptoplasty\ndeolis\nsosh\nmooragh\nrockbottom\nneurostimulator\ncheroots\nmontanana\nfoodshed\nhirers\nunax\nwimpenny\nbouchers\npersective\nmorjane\nverg\nruettgers\ntrainbearer\npharmacogenomic\nmarull\nchanock\ncholish\nunderhandedness\ntharcisse\nmacozoma\ntenaciousness\nstatkevich\nmarnich\nguildmates\nescude\nbugandan\nsaffronart\nwatchbands\ncereblon\ntokon\nbitondo\nzarghoon\ncfed\nloutrophoros\ndesensitizes\ntauch\nlungelo\njednak\nguiseppi\nwhitebirk\nevaristti\nconfino\nconstition\ngrbic\nkesch\nventilations\npehe\nmtvn\nmtpd\nlibsker\nsufganiyot\npressphoto\noverexpose\nlizhong\nrohrig\nroseires\nmoneysense\nathfest\nunbendable\npenrhyncoch\ndisconfirming\nvdacs\noccaision\ngalila\nmurviel\nyussif\nstateparks\nslawinski\nlasante\ngyrates\narmstong\nservie\ncharvil\nlutron\nmejorar\nestlink\nmarinopoulos\nekwok\nlonay\nizmailovsky\nladhar\njonjon\ncbsp\nkayumba\nmacintyres\nnoyze\nperfector\npromontorio\njoyalukkas\ntreilles\nfossel\nhiguey\npartizansk\nsternthal\nadegboye\ntroeger\nniniane\nbengen\nzacho\nsandbelt\ncarltons\nmegadose\nlisnek\nsurrell\nchurov\nsherida\naustrialia\ndatavision\nbendinat\nixys\ndamndest\ntilberis\nlynna\npalel\nchineseness\nfhlmc\nbooktime\ntalt\nmagpantay\nlifewater\ntiuxetan\nambiguousness\ntomeka\ndarkes\nzidlicky\nqouted\noccassionaly\ngigmasters\nrontzki\nshemona\ndisbarring\njelenic\nkloet\ngianadda\ngorteen\ntranum\nmatinale\necobici\nszish\nkeflex\ndistrest\nfrassanito\nrafaqat\nmturk\nalliegro\nelyzabeth\nlamisil\ntesoriere\ncaraveo\ndisconsolately\nrawashdeh\nmefou\nrslc\ninnerscope\nlipocalins\nsidner\nmoneytalk\nfundamentaly\nhongkai\neicker\nkesterton\nmotionx\ncommunicability\ncameraderie\ndornes\ngearwheels\nefficientdynamics\nrightholders\ngelatins\ntreborth\nrafeal\ndhca\ntampakan\nkhallad\ngronholm\nerte\nwordley\npefect\nraechel\nicae\ndivalproex\npredigested\ngalgorm\ncauquil\nschrek\nphangan\nsolidarite\ndyagilev\nrolison\ntarnya\nzesa\nrolapp\nneyens\nstaylittle\ntsirekidze\nmvas\nplayfire\ngcca\nwhenver\ntillya\nthirkettle\nundemocratically\nzakria\neuroprop\nloreno\nvelthuizen\neigerwand\nlinenhall\nspectris\nistalif\ncpex\nhonaunau\ngarofano\nduggin\nvaamonde\nprople\nbelcon\ndumbos\nwichner\nthielicke\nwestquay\nsprackland\nrelection\nstinted\ngremikha\nannitsford\nvitually\nzige\ndambrauskas\nfosbr\ngáll\nhebeler\ngsps\nexxxotica\ndiamandouros\nmazzitelli\ncomverge\nanguishes\nbulstake\npcra\nigiugig\nzhengcai\nwincent\ndéfago\nmichniewicz\nclearspring\nfantasmagoria\nalegent\ngrynsztejn\nlici\nspörl\nfromms\ncourreges\ncrimeware\nrefired\nsahnoune\nlixiong\nromarin\nwhatev\nraghead\nfoodsafety\nboudha\nuludag\npoofter\nhalpen\npanss\nhorsefair\nwaterden\naskary\nexumas\nelectic\ngoetschel\nvectron\nbabycakes\nhanoon\ndraftswoman\natcitty\nsubrogated\nlivor\nloope\nhohnen\nnurhasyim\nathr\nbastakiya\nfunster\nretiled\ngissin\nmarrazzo\noyebola\nffms\nedendork\ndoretti\neckstut\nbonchev\nsuncal\nexpensify\nsandbæk\nwavi\nawps\nvolkswagon\nmeutia\nsunarto\nlorance\nfelner\nfanara\nwhatsmore\nbosendorfer\namamou\noutwin\nalexandrea\nfinlays\nhoneymooner\npezzuto\nweyhrauch\ngenetree\nundurraga\ninterpenetrated\ndusing\ndragin\nemblazons\nmultiwall\nsegin\nhighpointing\ndsns\nglowinski\njetskis\nmudbug\ntatitlek\nshengxian\nsurrealistically\nallnight\nsunisa\ndemaurice\npiscean\ngonged\nwillse\neichhof\ntricresyl\nchapan\nvaidere\ndudash\naijala\nspazzy\nkaurareg\nnchez\nratlike\nikegwuonu\ndeigning\nstsi\nkunzman\nladygo\nchulk\nstickhandling\nloterij\ndhanjal\nmutative\nhostings\nfeeblest\nbarechested\nbetted\nwarmaking\ngretch\noffspinner\ncogenerated\nbuckwell\ngirafe\nprimping\ntaxers\nexplosivo\ntecnimont\nrightsizing\nkolinisau\nvinicombe\nlevengood\ncresselly\nvoil\nkazmierski\nsousan\ncerrigydrudion\nkheera\nskyping\ntamizdat\ninvestement\nphosphogypsum\nhayleys\nblommestein\nrvia\nbarika\naltimo\nkestle\nvishnoi\neisenhuettenstadt\nshandan\nbaiquan\nmanerba\noberndorfer\nfruchterman\nmethuselahs\nweifenbach\nkeynell\nolukotun\naitofele\ncorenet\nblabbed\ngardenstown\nhosptial\nbercken\nreponsibility\nsukanaivalu\nponn\nneaton\ngladisch\ngizab\nvieites\npuco\naperitivo\nncja\ndandois\ntrenchmouth\nimmobilises\ntresnjak\napacs\npedialyte\nalvárez\ncoha\nmaynez\nnassirian\nkarchin\nkhomeni\ncressage\nmicrodosing\nbrenkus\ntutorvista\nyummies\nshimali\nbreakoff\nbroadcroft\nboomtime\nbabler\nmariott\nmastromarino\nkorting\nrustamova\ngarbee\nroumelia\necodefense\nnastygram\nlocklizard\njuwi\nnuxoll\nbottarga\namouri\nashoor\nrestent\nsbihi\nsuperficialities\nrichins\nkenk\nvivaciousness\nspringlike\njinelle\nhoppings\ntesofensine\nniemira\nphaseal\nnozzolio\nozel\nturaka\nhalkida\nluib\nxiangfen\nfragger\nbigheaded\nmccaysville\ncarnt\nbrashier\ncronenwett\ncerminara\nlenker\nbrizlee\nbronxdale\nredhanded\nsynaesthete\nklarik\noseland\nnamiquipa\nsunnucks\nbodum\nlovings\nmeagerly\nwaughs\nfrijters\nsagittis\nhoehner\nshutterbugs\ntariko\ndoltish\nsarnecki\nselmeston\nfraire\nselectives\ngarot\nprosecuters\nnuhw\nribeau\nseminude\ntavant\ndaic\ntommasso\ngraesser\npelzman\nfallowed\nkovi\nbaldrey\nbrunshaw\nputschist\nudic\nunilluminating\nivington\nhhsc\neasterbunny\ngentin\nmilenkovic\npicadors\nlinganzi\nbennecke\nognyan\nkomac\nexergames\noleana\nnicsa\nlltc\ncholakov\npriyani\nsierraville\ndemaçi\nvuclip\ngrubinger\nfazakas\nabstral\nnumatic\nradogno\nkeshavjee\napitherapy\nkiejkuty\nsandwood\nbrukman\ncrovitz\nmemjet\nrucinsky\nworawi\nmythologiques\ndoozies\nsubstorm\nbosomworth\neurolink\nbodged\nkjt\npashka\nhajeri\noveremphasised\nhamalainen\nvershire\ntennesee\nsunspider\nknuckleballers\nwarnken\nfallibroome\navoncroft\nplughole\nliesman\nminiaci\nquantative\nkazimira\nhipkin\nskulks\nswainby\nubid\nhowlite\nrangali\nydi\nshoudt\nalmadraba\njungersen\ncacciotti\nhurtault\nbriccetti\nslighest\nbrepoels\nmoniem\nkrovatin\nuwink\nleylandii\nrouissi\narguido\nnahn\ndarnedest\nkulibayev\nhejin\npfeffinger\nmahonen\nabina\nwillumstad\nasiah\ncoersion\namedei\nsandos\nsirnaomics\nashqelon\npitocin\ngemfire\nluscomb\nmeralgia\navicenne\nsalvers\nenergiser\nharrist\niatan\nzaika\nstrini\nsalterbeck\nbusin\nmitul\nsuperbank\naddictively\nanimoto\ncobwebbed\ndishcloth\nhizzoner\nfreepost\nlionbridge\nmayner\nenergetica\nshutoffs\nsuparak\ncamford\npicciolo\nspaf\nnotetaker\nfontanez\nmedvei\nedhouse\nshriprakash\nblenkarn\nanavilhanas\nbackhandedly\nleezer\nbouzou\nwarholian\ncosalt\nmcmurrough\ncordasco\nguapore\nparentally\ndevenny\npimpbot\nboscono\nwimbourne\nqualitest\niglfa\nprompters\nzarefsky\nswagged\nhedquist\nojwang\nacceledent\nlcdc\ncarbost\ntyurina\nplutonomy\nnypc\ndaffern\nelementally\neeps\nsmichet\naslant\njgk\nkicukiro\nreprieving\ndovehouse\nseljalandsfoss\nparadiski\nbarysheva\nunworried\nindustrywide\ntactfulness\nwishfully\nnajer\nsouthwaite\nglistened\nostracodes\nprocare\nbatallions\nbikker\ncavis\nmoeliker\nscudding\nocotber\nhudbay\nafreeca\nvesilind\ndryfe\nnorsa\nnobuki\nberewick\nsevelen\ntellado\ngabardi\nvasilija\nboureima\nmclemoresville\ncrounse\nklane\nkaluka\nbatayneh\npicogram\nsylfaen\nhopple\ndemostration\nulusaba\nzelve\ntwills\nnemenhah\nhockering\ngaiennie\ncimpor\njurki\nolallo\nseinajoki\nmalingerer\ntechnophobes\neyepet\nbroadsoft\nnursel\nleogane\nbookfest\nwicklewood\nazalina\njostles\nclasby\nartiga\ngavle\nbackgrounders\nbenhassi\nmakunga\nmouratoglou\nbernand\nchiodini\nsybert\ndevanna\nvassie\nklarsfelds\nbullionvault\narcticnet\nminguez\ncausewayhead\nrongione\nhelmetless\ndeathmaster\nclawfoot\nglosserman\ngriazev\nmeuller\nbandawe\naldape\nshiferaw\nbubaque\nissueing\nbattallion\nmanalastas\npericlymenum\npentabde\njurney\ngladdens\nbatsh\ntolstoys\ndeflective\npaoua\ncheysson\nallusiveness\narodys\nskylift\ncommsec\ninocent\nchubbies\nnarrowings\ntchotchke\ndjuan\nhayali\nkreissl\nwimco\nlamdin\nhackings\nappdata\nbeelzebufo\nvanderfield\nmaxygen\naffectingly\nceliacs\ngorier\npraiano\nnuvelo\ngudiel\nkrivsky\nmiok\ndreese\nmanevich\nberesfords\nhalycon\nsaudan\noutercourse\nungraspable\ninquistion\nvolonte\nlangbeinite\nfadinard\nsiegried\nsimonenko\nriingo\nbarreleye\nzalba\nwqvga\ncalworks\ncattedown\nlny\nnccd\nzartog\nciea\neyewriter\nkardas\nrequip\nzhifei\nwhiteners\nnicklasson\nexpediencies\ngrimaced\nlyton\nparadisal\ngianfilippo\ntahli\nglenforest\njodat\nchiampou\nstrops\naccce\nbeysen\nreadjusts\nnonpromotional\nincumbant\ndesano\nsemiretired\nhameeda\nripely\nslaughtermen\nzolberg\ndeviceanywhere\nosud\nmoiben\noutmanoeuvered\nkapin\nbracko\ngladchuk\nmvcs\nbesmehn\nwnbd\nvalleta\ncallled\nterrines\ntalledega\nverbalizations\nraizal\nfruen\njalc\ncossy\nhegseth\nschubertian\narooj\nunnaturalness\nwsvga\noverduin\nsuherman\nkleppinger\nfrancioni\nemilo\ndedas\ncodependence\nballone\nhathout\ntimoni\ndegaetano\nmethanex\nstillwaters\njesuitical\nspoty\nmuhamalai\nhauwa\nmarcianos\nrusselsheim\nsheinin\nkatriona\nruggedised\npotrayed\naspiro\nromesco\nissaq\nfatina\nletisha\nremobilized\ncasalinuovo\nschagrin\ndoden\ntowergate\nitalophile\nfettah\nmakhtal\ndalser\nlumbu\ntransposases\nkopuz\nisaacks\nwildmoor\nsatelitte\nkamerion\nhaisley\nmexicanaclick\nluvera\nlaramore\ndihk\nmerix\nkhos\nhalfpipes\nunderlyings\nbrassell\nweybread\nshiranthi\nmabro\nravensden\nmiltons\nnafcu\nvalleywood\ncyfnod\npsomas\nspacelift\nsaleman\nnowaczyk\nlhundrup\ntyibilika\nabrouq\nspeerstra\nnonaction\nsnowbombing\nkalimah\nichim\nunderperforms\nchocolaty\nbragger\ndzingai\ngencor\npuccioni\npasschier\nrivele\ntepozteco\npellman\nmasurier\nrised\ntrileptal\nwinkowski\ninarritu\nchettleburgh\nsithe\ngloba\ntopno\nronaghan\nkinfolks\nkarawala\nchutzpa\nbalaya\ndelayer\npompons\nloewinsohn\njarzyna\nmisplays\ninvoluted\nmogulus\naviakor\ncepol\ndreamit\nlacewell\nchartplotter\ncupful\ncetain\ntematico\nbastareaud\nbeaucamps\nchildnet\neckblad\ncopythorne\ncatunda\nmachtinger\nsnapfinger\narinaga\nzennie\ngirotto\nmisspeaking\nfriedbert\nbizx\ntreganna\nhopkinstown\nconcededly\nrachlevsky\nmilehouse\nkibblesworth\npsychodramas\ntoooo\nmacklovitch\nmescher\ntoerag\nlarky\nkenaan\ndesignjet\namsus\nbakerman\ngilbeau\nuhart\nrosskeen\nsomporn\naltyre\nbiank\nbrierton\ndummied\nsmooze\nnemelka\nstantons\nszafranski\nlvcva\nsetya\nbatle\nspetember\nskytone\ntention\nchengs\nrededicating\nlendoiro\njgg\ntrefler\nloise\ndetag\nrevells\naerobridge\ndoodoo\nknuppe\nundriveable\ndictat\nbreining\ncomeing\nrecevied\nkepesh\nyonts\nleadburn\nwouild\nmetabolises\ndoelger\nstraeon\nthinkings\nnsda\nrelegious\nwavier\neaseful\ncmedia\nchicagos\ndaise\nmartori\ncoubert\nyfantis\narchaos\nincitec\nlimato\ngulftainer\npoulenard\npresant\nexomoon\ngajurel\narrellano\nunbuildable\nconnecter\nbesteman\nsiributwong\nclunis\nlabb\nghostman\namodiaquine\nicebridge\nsuperally\nunregister\nmaharidge\nkorski\nsobieraj\nrebsamen\njuliao\ntemor\nrompers\nwaldfogel\nhighchair\nospc\nomehia\nashleys\npiontkovsky\napocolypse\nseawaters\nultraconservatives\nbyworth\noffredo\nnyffeler\ndernie\nvivox\nsetaro\nhosston\nmalinslee\ntamgaly\ngrooviest\ncadolle\nabci\ngainsville\nwearies\ntillary\nbewer\nnwas\nfelske\nbatiquitos\nangban\ncompeletely\noglivy\negitim\ntwinship\nwesthoven\ncarway\npittie\nbrookmill\nnowrouz\nsekiyu\ncarasa\nremondi\nskillington\nmolat\nracher\nkunii\nmapleville\ngiess\nnauer\ninnellan\nosfi\nskoric\ndasrath\nbarzinji\nleixoes\ngynae\nstineman\nthoght\nangélina\nallegience\nglenkens\nwahome\nlealan\nloremo\ndemissie\nafficionado\nbrocal\nofferors\nmansouriah\ncetinkaya\nroskams\noperationalisation\njoichi\nworldclass\nbeckville\nkriangkrai\nidarubicin\nscuderie\nfospero\nghazel\npenetratingly\nrejoneador\nmuzza\ncenacolo\nswack\nthinkfun\nmidgeley\najws\nanninos\ndelish\nvishy\nmisusers\nsrecko\nschearer\npenuelas\nveva\nrolldown\nrabbiting\ngibler\notone\nboekhoorn\ndeloused\nghazvini\nfalkenborg\nezatullah\njoads\nmicroseries\nscenarists\nugeux\nwaxcaps\nvexatiously\ncampanologist\ncorneliani\nscvngr\nakeredolu\npozega\nameijeiras\nsenzeni\npapenfus\nneedhams\nkarimuddin\naudard\nmorrab\ninsi\nsundus\nnanosys\ndiscouragements\nfistfuls\nprevaricate\nanufriev\nsinglehood\ndorito\namburgey\nskyjump\nsterlacci\nperformics\nciochon\nshinwoo\nschleimer\nboths\nendu\ncyclicality\nuceny\nclinks\nquadraplegic\nsplendorous\nledray\nfrenos\ntagaeri\nobern\ncroi\nspcb\nfoodist\nstojic\noutspend\nminzoni\njuliets\ncdpc\ntomashova\ndiaristic\nipledge\nunmarriageable\ncrisford\nlifecell\nvenexiana\nreans\ndurdu\nrynkiewicz\nfjerstad\nkempshall\ncanana\nbatuman\nkareema\nstickgold\nsaime\nvirtusa\nstreetfighting\nagentry\ninexactly\ncces\nroust\nbonachela\nsecateurs\npezeshkian\nnobble\nsenggigi\nsinet\nalimentaria\nslouches\neuphemized\nvalez\nrbbb\nmugabo\nshipholding\nabouet\njpma\nheker\nlapidation\nflugge\nglaverbel\nunrevealing\nfiddlewood\nsteingrimur\nbierset\npithoprakta\nepalahame\naddding\ntuveson\ndunhams\nswix\ndhirajlal\nminex\nmaritha\nexperi\nrevee\nequasion\naliber\ngatesgarth\ncosmesis\npercale\nurribarri\nmenches\npartic\nvayama\nulvskog\nhcan\nroselend\nkorset\nbenbridge\nudderly\nired\nsundiver\nglyntaff\nmosleh\nbriancon\ncapetonians\ncentile\nguily\ngrisetti\nstrafes\naccountings\ndecompressions\ndumphy\ntianjian\nalingar\nstriffler\ndberr\nbishopgate\nnordsjaelland\nkiumars\nsleepyheads\nphonecard\nskimpily\ngelateria\nehg\nminshan\ncwele\nguebert\nsgic\ndeigo\nuclu\neures\nlevchuk\ntoyako\nhummelo\njaspersoft\nquereshi\nrohim\ncrocheters\nblackburns\nexcremental\nuaar\neaaf\nfrankies\nolfactometer\nkassow\nsankurathri\nwilka\nunblurred\nnumpty\nglobalgrind\nmechaly\nblasini\ntceq\nmacnissi\nkawabuchi\npdcp\nnarrowneck\naeropress\ntimau\nveling\nzury\nstyger\nlobintsev\nsureyya\nmmrca\nmichaeline\nclaira\ngenack\nsetiferum\npriyanto\nsangeen\nbrouder\nnewgale\nunordinary\nyamar\nlattea\nmuvee\nhussies\ndemarais\ncushe\nsupermajors\nmanditory\nquanity\nkhrushcheva\nhouat\ndanise\nclermond\nskidbrooke\nnosratollah\nvaley\nmacdougald\ntazmanian\nincept\namreen\nwhobob\nivrs\ngrabenstein\nfibroplasia\nbulled\nadamsberg\nlievsay\nnewsite\nbeedis\ncsip\nbicing\ndriscol\nkinbuck\ndeceivingly\nanoai\nbicos\nsoory\nparesthetica\nxme\nholsworth\nrattanarithikul\nbagdhad\nencapsulations\ntabbi\nkreo\njambiya\nflexipop\nwhatpants\nsandlers\nwatman\nuverse\ndithyrambe\nloughguile\nalewa\nshfaram\ndfac\nsanges\nhasaka\nsteenstra\nsaladworks\naerotaxi\nartio\npauntley\nfonera\ndenhart\nkleynhans\nkenexa\nmendelssohnian\noueslati\npunishingly\nphlx\ntaurel\nbelenkaya\nphenomina\nvalorizing\nmuvi\nxtar\nshalam\nmarascia\ndaleys\nsplendiferous\nsidestroke\npownalborough\nclywd\nsarries\ndywer\nroistering\nkhalikov\nnogood\nbearlike\nfrogurt\nconverstation\ngardarsson\nlessels\nwhg\nhermoine\nciarelli\nordnances\nstockwork\nstiffle\nwooder\nfastforward\ncharreadas\ndpps\nbythe\nunworkability\nsynygy\nlutnick\ngelotophobia\nwrightspeed\nollmann\ngrungier\napiana\nthumbwheel\nreaganism\ntangina\ncarcas\nselfhelp\nbeltransgaz\ntengizchevroil\nmaniq\nmangunkusumo\nintervet\nvasselon\npelligrini\nhunderds\nhasbi\nlunghua\nrussello\nhywind\ncipto\nkrupnick\neastment\nmoviehouse\njałowiecki\ndezmon\nneatherd\nfaarax\nresettles\nwispelwey\nguantanomo\nrodwin\nsdam\npemble\nkarto\nleiermann\nslec\nitched\nbhattacharji\npasqualin\nharrahs\nrejectors\nmurrihy\nmcmurran\nvervets\ngerloff\naraouane\nwillardson\nconsu\nningsih\nsmatresk\nngamotu\nfroidevaux\nmuiravonside\nthakoon\nrumormongering\nmalletier\nundiscounted\npingdom\nisnardi\nparaxylene\nsolises\nunconferences\nronc\nyishay\ntorrian\nmortage\nkainth\nfortgang\nsiewers\narpoador\npalisson\nwilsonianism\narthroscopically\nsornette\ncanicross\nmotoshima\npuerility\ntextspeak\nkheili\nkumarasiri\nvahland\nrandomise\nmargharita\nmendiluce\nlozo\nknohl\ntheranostics\nbeshears\negelston\nsmoochie\nvinelandii\nménerbes\nferronetti\namenabar\nbioswale\nweissler\nboecker\nnovermber\nbnabs\nrazzaque\ncastlefin\nyound\nubani\nnythe\nmesonychoteuthis\nmaroga\nthiosulfinates\netsuro\ncounci\nbrikho\ngarciamendez\nbillmeier\njelincic\nweemer\nboceprevir\nkoelner\nosley\nsaldivia\nosenton\nailis\npurches\npresorted\nforlini\nhapped\ngallagheri\nstulz\nfattier\nyefremova\ninovia\nbenha\ndunderheads\nslusho\ntanerau\nmakelele\nkhomenei\nlingamfelter\nwirework\nmcleland\nchalupas\nsavins\nriccarda\nandrysiak\nsmartarse\nswiftnet\njensenius\nthierrée\nghassem\nmercatali\nphlebotomists\nnightscapes\nlsvd\nlinsday\ncurtsied\nzollars\ngalliers\nleonardis\nragpickers\nchivery\nquisisana\ncsob\nohab\nswrda\nconisborough\nloadshedding\nrattee\nwdig\nsitompul\nkrinsk\nfrancaix\nromstad\ncatholism\nrestrainer\nkajillion\npnhp\njiacheng\nlexmond\nunguessable\nstrpce\nsmotrich\nadraskan\nmultivendor\ngarlinghouse\ngreengross\nduesenbergs\nrowehl\nnimsoft\npornanong\nparajuli\neskow\nbookshare\nitapagipe\ntreacheries\nbelhocine\naseptically\ninoffensively\nwippert\nnetview\nvscc\nannoymous\ntuffrey\nfelan\nnirwan\nmissippi\nfougeres\nbubblewrap\nimitatively\nkrystine\nvipps\nasphalts\nbraynon\ncopithorne\nfearin\nsogaard\nnotecard\nsauntered\nbergeres\nluxuriousness\npickpocketed\nbustiers\nrakitskiy\nebrill\nunairworthy\nauroch\nsinkinson\nzaiger\nmojados\ngimlets\nmaurita\nelsag\nenmeshment\nhomogenously\nunharnessed\nharpertown\ntrakys\ndiraimondo\njepkorir\nfelicito\nsherril\nburguiere\nkdom\nconstituion\nbatphone\nzwillinger\nstreetsblog\nbroccolini\nndlea\npicures\nnetminders\nnmtv\npamplet\nrankest\nblackerby\nparamotoring\nbendersky\nlashwood\nwickerham\nspeiss\nschoepges\nemdeon\namau\nsafiullah\nboundlessness\nzürs\npriefer\ninterbranch\nbicuspids\nfeigley\nhennicot\nhealtheon\nakorn\nogunnaike\nplacanica\norexis\ndownlights\ndiffuseness\nbossaball\noverbilled\ngryta\nelborough\nlemvo\ntsedaka\nasbat\nebookstore\ngreaseproof\njtdm\nderrynoose\nshyra\nplotnik\ncannato\ncichan\ndebarring\ngangchuan\nkvor\nchiyome\nswaray\nrpra\nkreisman\ntamboli\ncalfo\nkarstadtquelle\nskvortsova\nlizeroux\nmilhau\nbingtuan\nspti\nreadathon\nupsurges\njdz\nvormann\nkankas\ntaishun\nmofos\nclowance\nquizzically\nofficejet\nyatama\ngarold\noogieloves\ninseminations\nidmc\nbordens\nlikhtarovich\nzabinski\nizal\npogorelich\nsovietsky\noosh\nkopetz\normsgill\ncontintental\nopy\nphse\nmaletta\nmeirav\nntibantunganya\nmanolos\nbushee\nunfi\nrossides\nmoag\nshurfine\nbellison\nextraconstitutional\ncarrys\nbejel\nleaseable\nvisitengland\nbraillard\nhearten\nclodhoppers\nnyotaimori\nbejun\nzeltsman\nharoutunian\nbeardon\nlonni\nglomming\ngowler\nhaghighatjoo\nunreligious\nleweck\nsaines\nlaplanders\nsweetgreen\nsteepletop\ntilelli\nlencioni\nchereau\nivanyi\nmetrosexuals\nconnswater\nwhatsisname\nwinkenwerder\nselna\npitner\ndivisively\nmatutes\ndikun\ncollossal\ntrounces\ndevillez\ncoedydd\nsenseable\napplehead\nstoneriver\nbeuret\nmontalvan\nprelaw\nlices\nkiwanda\nfickes\nnaputi\nmaif\ncaddonfoot\nmostoles\nhuesmann\nprammanasudh\nhakuo\npomades\nkarmarama\nampac\nbellisle\navidia\nticktock\nbecauase\nheldenbergh\nvereadores\nculvahouse\ngalashki\nmiddleberg\nacquafredda\nmendibil\nfalus\nguenthner\nstarone\nolner\ntecc\ngavlak\nbridas\niteris\nbarzi\nlandlubbers\nsunnegga\nmclear\ncartrefi\nsivaraja\nbowhunters\nhongtong\nglamorising\nbreteau\nfaygate\nunreimbursed\njiggins\nleucosis\nperrotti\ninfovision\nbowis\ndeanell\ndisenchanting\nkostich\nkaibil\nskinniest\niaec\nlaserium\nvalukas\nprotien\nextremeties\nbreandan\ntvrs\nrozhetskin\nflybar\nchaoin\nnonuse\nsmichov\npeljesac\ngarvins\nfahmideh\nprincesshay\nbraider\nfesitval\nascerbic\nkruschke\npercodan\nhaerter\nbutkovich\nrhapsodizing\nicfj\nmiraikan\nfeltes\ntohyama\ntatweer\nbobsleigher\nmidcareer\ncircunstances\nmaynot\nokky\nkranish\nshufelt\ncrisscrosses\ngillislee\nskelwith\nflowy\nwairimu\nraspier\ngofish\nhuasheng\nmuscardini\nswinerton\nnekritz\nocariz\nmuhlstein\nnofsinger\nkhaaliq\nlidbury\nbellyaching\ncordts\ndevron\nkazulin\nunfathomed\nmagdeline\nsohaila\ncoppitt\nfahrenhype\ngbgb\nbangkokians\nwilches\nskripka\nbenter\nbureiko\npikin\nexplica\nguglani\nfimalac\ncastrozza\noculd\nbirlings\nrecept\nmagliaro\nanbd\nprous\nagroscope\nseromba\ndamante\nritan\npinderfields\nunresisting\nbeyondtrust\ngavalda\nburkha\nnjoy\nmicrogrammes\nairplus\nbillinger\ntalabi\nberechurch\nswamphens\nhatwell\nchondroitinase\nresposible\nschweizerhof\nbialis\nbrainier\nmaressa\ncandymaker\ntrêpa\nbaloji\nvibroseis\ntrajes\njurietti\nangioplasties\ndehumanise\ncubacel\nditommaso\nibers\nshosanna\ninverkeilor\ndischargeable\nbcfa\nkipkosgei\ngaviglio\ntarpenning\nuncrossable\nbambery\nteah\ndornfeld\nweusi\nbernacki\nwyffels\nbludau\nroboworld\nhotplates\nemilienne\nbloodedly\ntewelde\ncorle\npagitt\nkirsanow\njamillah\nfransi\nmadr\npatrixbourne\nmultiemployer\nguerci\nmicozzi\nkolomoisky\ncanoodling\norrstown\njasani\npogosian\nbety\ncieri\nboozers\ntallackson\nmagicked\nbazzell\nsadgrove\nchrisitan\nskydrol\nthickthorn\nnhema\nperh\ncokal\nsharnee\nkatsande\najinca\npotinière\ndihydroergotamine\nbaconnaise\nllwynhendy\ndoublechecking\ncaree\ninrena\ncancell\nboshu\nkushel\nadali\nratanpuri\nbackwood\nbourgass\nglandford\nworldteach\ngeminoid\nrukmana\nsabritas\ntheives\nandizhan\ncampailla\nkhalilou\npolce\nluchese\nduljaj\nikal\nkratschmer\nyoculan\ndictats\nnooy\nlerouge\nfraddon\nbarmitzvah\ncorruptors\npashby\nducket\nlashgar\naleppan\nhanovers\nznbc\nhooh\nexculpating\nsteadward\nskout\nmondex\ncpnt\nbouncier\ncabp\nineich\nbargirls\nuncooperativeness\noppositon\ncocksworth\nywcas\nhongling\nchesterfields\nhadee\ntalauega\nsennels\nmccoshen\nwildlifedirect\nquaked\nsnowslide\ngreaux\niula\nniceic\nabdow\nortegas\ndrouant\nccpoa\nwittstein\njetskiing\nvoteing\ncurrenlty\nbroide\nheteroduplexes\nsandefer\nfenglei\ndomainers\nthoumire\nelss\nvoluntourism\ntefera\nunamusing\ndibai\nrioult\nkainerugaba\nscroggin\nsuppurating\nmoraima\nmifamurtide\naldaniti\nwtrg\ntresize\nguilvinec\nmarvine\nigeneration\nlangsamer\nredecorations\nopinionating\nxiumei\nburgar\nkazatchkine\nhartas\ndropsondes\nliquidnet\nmckensie\nvivette\nsuplement\ncanavosio\nsamcef\ngeriatricians\nromneycare\njonesing\npheaa\ndesvenlafaxine\nmaraahel\nportmans\nkaikkonen\ndevico\nmavromatis\nposesses\nmurdochs\nsloans\nserenbe\nstolidly\ndorato\nmicromania\nprosinski\nsharify\nmcjames\niacovelli\ncosteira\nmuasher\ngervay\nisenhower\nfieldman\nflorencecourt\nsmithgroup\nforness\nkorkidas\nlövin\nhaddacks\ntaei\nrummana\ngwanzura\ntannert\njamille\ngobb\nabbu\nbenbaun\nnvic\ntcan\nkarlgaard\nalexine\nfakudze\nhipbone\nppca\nbeaurocrats\nrestovich\nmanorexia\nshunner\nulacia\nyatch\ntorwoodlee\nseminomas\nsamso\nspunkmeyer\nenersys\nwhiplike\nresponsed\nkador\nmanuwai\nsajeeb\nayiiia\nstichter\nquintiq\nmastrov\nbrûlées\nlabron\nkiljunen\ndegarelix\noaker\njancevski\nrykestrasse\nwelioya\nvigiani\nhafsat\nrecongnized\ngrosnez\nposhest\nsoldz\npulrose\nsteamfitter\nassuncao\nterpin\nnorgard\nyanhuang\nupgradeability\ngreenhaus\nmultipack\ntowerbrook\nerdimi\nseemore\nstatehouses\nfreefest\nmuscato\nolrig\nwesternbank\ndivello\nquartermile\nkosner\nthodey\nintegras\nspaciously\nmaubisse\ndashingly\ngristedes\nzaretskys\ndenihan\nconquerers\naltaira\nsporns\netsuo\ndaydreamers\nscheidemantel\nabeed\nbldp\ncreutzfeld\naodhan\nmalchi\nyolky\necornell\namser\ntrabolgan\nguggenmos\nhaerizadeh\njugraj\nxiaoshu\nmcseveney\nhawksford\npunkier\ndiclemente\npassangers\nadaptogen\nshorabak\nanderby\nstrugglin\nteradici\nkazakstan\nguoxiang\nzvara\nmacloughlin\ntsaf\ndespotisms\nbucketing\nbattleaxes\nnerveless\nvalfierno\njianbo\ndeciples\nhighpointers\nindjai\nsalvadorians\ngarmirian\njaaa\ncannibalizes\nchaffed\nromauld\nboutonniere\ntrwy\nashwagandha\nfosis\nzvai\nuygurs\nmetec\ndillenberger\nchizhova\ngostomski\nswiftboat\nchout\ngenyen\npaidos\npolybona\nahmedullah\nlakafia\nforsen\namerithrax\nzandio\neramet\ngutschow\ndefinitiveness\nborque\nkokee\ntankink\nfreedive\nqarar\nsnurfer\nneutralizers\nmazorra\nnaglaa\nmolissa\nadek\ngyari\njolomo\nrindi\nskivington\nenomaly\nabdinasir\nwiva\nrylko\nfaliva\nsupprised\nredant\nrifugi\nplopper\nstoicescu\nsportsticker\nlatice\nvitens\nrowhani\ndorce\nsubfusc\nnaviance\nsinkfield\ntolerence\nabbeywood\nmaslon\ntelam\ngnudi\nsharie\nmapd\nmadere\ncarpetbagging\nkinesiologists\nyaritza\nfavourability\njundee\nabbeygate\ngongga\nbarofsky\nhakakian\ntaroni\nfufilling\nritholtz\ncwyfan\natonalism\nmexicos\npromenaded\nllanfwrog\nluthardt\nthéret\nnixonian\npaninis\ncalld\nphocine\nsynchronicities\nraciness\ndigiulio\nmallaya\nlizin\njamika\nduschenes\nnepstar\ninfantilized\nwitricity\ntaquito\nzhenxiang\nmakete\ntsuyako\nglofs\nettien\nlowside\nneskowin\ngutterball\nafectados\nmultiparameter\nrambly\nmarcetta\nfrissons\nprugova\nnutriset\ntreier\ntwitterati\namtek\nflorigen\ntoastie\nkasmiri\nhesch\nfatass\ndhiya\nruedy\nulgen\nsubero\nhasun\nfoecke\nsaumlaki\nuncarbonated\nswaffield\nsimhan\nnewsum\nschellhardt\nuncap\nwajeha\nmarchioni\ncrigger\nhoofprint\nmosside\nbernandino\nkesici\nresculpted\nchattergoon\nunsheathing\ngalachipa\ntengco\npsco\ngroer\nvillify\nfoxell\nxiaoke\ntinkly\nlwala\nkontogiannis\nhoddeson\ndevonshires\nnonsupport\nviceconte\nugoh\nscaqmd\nzwiener\nhahahahahaha\ndelger\nkanaykin\nrenqing\ncomponentized\nganswindt\ngurpal\nkurer\nsinosteel\nsavta\nlezli\nbanderillero\ntilki\nmadelena\neaws\nburiak\nkeehn\ngogulski\ntormenters\nittoop\ndevro\nnortheasterners\nthrobbed\nnysp\nyakoob\nuindy\ngantman\nlatorraca\nlefkos\ndamapong\nshyann\nrepsect\nbarmal\nsusko\npublishe\nlebedko\nguettler\ncloughmills\nspringbourne\nimpellizzeri\nvaldovinos\nzda\nwassman\nelesewhere\nbrunching\nibutilide\nbenuzzi\nsenoussi\nnaproxcinod\nbreakfasting\ntstr\nkillone\nsoneira\nmelandra\nnieberg\nromashina\nbastel\npálmadóttir\ndédée\ncaveri\nmastiha\nviolaters\nsteinhilber\nfownhope\nfona\ncrowleys\ngurbuz\ngougers\ndrewnowski\npennfuture\nmedc\nboniver\nivanans\ncrampsey\npontrhydfendigaid\nsemerari\nprofounder\nmandelblit\naustrie\nheadlocks\nkalustyan\ncelebair\nkondewa\nnadezhdin\nfeminazis\nshabbiness\ndreamier\nmejdi\nhefcw\ngocek\npatrina\nmesalands\nworkweeks\ncheaptickets\neikenberg\ngilton\nmaveron\nresourses\nfesperman\ngladdening\nyevtushenkov\nrochdi\nhorndog\nsamau\nunderpromoted\nrotini\nmeadowfield\nnokdim\nchgs\nlüttge\ntheend\nfahdawi\nwwpc\nunstylish\nboneau\nrpet\nvollman\nudda\nbureaucratised\npytor\nkobna\nterrority\nwoooo\naddeo\nakitsugu\nkuchwara\ndepartmentalized\nsynageva\nestess\nbalstrode\nharmann\ngraftech\nafdm\nreordained\ndwomoh\nwgii\nwadiya\naldisert\naringo\nbodyboards\nmatsinhe\npnrc\nchestfield\nhypersexualized\npuckers\nlodgenet\nhittleman\nbogied\nhottes\nrishell\nambela\nprejudges\ntangguh\npailleron\nmtbc\nbeltra\nmothecombe\nfilshie\nspeleonectes\nhansruedi\nazour\nfoursquares\ncocamidopropyl\nautomobilist\nadolat\ngiefer\nflager\nexactingly\nrabach\nbeautyman\nwyngarden\nvesuviana\ndirenzo\ncharmayne\npeepo\nhartside\ntipling\nhornblum\nbrooklynese\npfertzel\nplatings\nbrunini\nbograd\ntanlaw\ncerm\nanandapuram\ngrechaninov\nkavadis\ndeadbolts\nhunnington\nstripteases\niwh\nladbrook\nrevenuers\ntaklimakan\nvolozh\nyumari\nshaygan\ndosidicus\nhermansyah\namuay\nschneir\ncaphosol\nkamembe\nbenyettou\nlauga\nbelnavis\nmekdad\nvictoza\nairgid\nbaronness\nyonghui\nlionheads\ndenigratory\nprognosticate\nmagira\nbeome\nmcluskey\nindignance\nsdax\ndemineralised\nboricha\ndeteriorations\nlittlemoss\nhematopathology\nworktable\ngraveness\nchiminea\n¸\nsivarajah\nmovietown\ncoggio\nbannier\ndatascope\ntrabectedin\nhumoud\nbrodian\njuling\nprevelent\nraynell\nweijia\nvolgas\nballabon\nbabara\ntolterodine\nveeco\nsaizen\nnafez\navantel\nkfcs\nslotter\nterrian\nmuxidi\ntrunfio\ndigitial\nneupogen\nikats\nmoutaouakil\nlandmesser\ngirishk\nunsubscribing\nmotulsky\nrakova\nhemingson\nsouthbay\nsteindler\nshesol\nheminger\nindividial\nbendelack\nrefolded\nshakirova\ndenegrate\ngladen\nburntollet\nsymeonidis\nvaxgen\nvannan\nkiyora\nkolambugan\ntipples\nneverson\nherbstritt\njdd\ncommuity\nyance\ndragooned\noiliness\nexorcize\nalbertos\nletseng\nspamer\negesborg\ntemptresses\ngreenprint\nwoolmore\ncaffine\nrunnning\nroncone\nmarysa\nenman\nmedex\ntbhq\ngeeben\ngenotropin\nabdelaal\nillustrational\npolcari\ntigs\ntimbrook\ncambanis\nrachev\nrightous\nlivadas\nwhytes\nlifechangers\nkover\ntirumalasetti\ntrabocchi\nderunta\nastex\nhackshaw\nsettis\nidds\nrepudiations\ndurrand\nniqabs\nseguela\nfaciliate\nantwuan\nsudatel\noutblaze\nshmarya\nmechele\nsonador\nnontransferable\nfavaretto\naihrc\nbowlful\ntaslimi\ngeotrax\nagencys\nbobrun\nverasun\nacquited\ntakazawa\nsunand\nsitoli\nperiwigs\ngourdie\ntholan\nundoctored\nphilarmonia\nsailes\npersonalises\nlawdragon\nkovalski\nqjm\nptac\nalsalam\nkorobochka\nceaucescu\nlionizes\nramunas\ncombinado\njaker\ngurel\nbaikalfinansgrup\naçai\nsalutatorians\nblouson\npendon\nscallywags\nwearn\nphilistin\nmcneff\nsheaff\nmavrud\nvernick\nowczarek\nsafah\nyanovich\nobiageli\nbisignani\nnerdly\nunfasten\nsemiretirement\nddysgl\nlambrinos\nlisowice\nschwegel\ntarmiyah\nhambrough\npepcid\nsonotone\nnehemias\nebuyer\nvincci\nchupin\nnoctilux\nxiaoqiao\nyevsyukov\nkood\nfrostie\nheires\ncounterphobic\ncastlecaulfield\ntemped\nsadlo\nwestergard\nmiodownik\nassh\nanigo\nmisdialed\nsharenow\nbrenninkmeyer\norsak\nangop\ndisinvite\nxeta\nmalmierca\nferkauf\nchukiat\nonochie\nseaglider\nmatambre\nranjeni\nfilreis\ncoquese\nwsts\nwirelesshd\npivotally\noptx\nwoome\nsassard\nwoolfs\nprotectees\ntransue\ncronic\npsephologists\najuwa\nminca\ngontineac\nuncross\npcam\niggle\nsedgeberrow\ntrovax\nmalmanche\nconod\nbecel\nsteingard\nfreedlander\negomaniacs\nwhino\nmeinrath\nrisling\nfasso\nweedn\nrizieq\nrugrat\ndescalzo\npelindo\nkindra\nsourcemedia\nmetamucil\nwaemu\nwtri\ntoolchains\nphotocure\nclothbound\ntatma\nsolimoes\ngargled\nburnmoor\npinnix\ndisagreeably\nhamayun\nokst\nwestrock\naddd\nsomavia\nidentiy\ncirrate\nrhonheimer\nobfuscator\nmelantha\nboorstein\nimmunet\nathanasopoulos\nsouchet\ninplace\nshumin\ngraspable\nmanikchand\ncheywa\nuncertificated\nfeike\nwebawards\nstrautmanis\nistcs\npelite\nshripal\nfibrate\ncounterpointing\nbitani\ndarbys\nmarlpit\nivuti\ntrabajan\nsednaya\nmandelkern\nmartletwy\nramaroson\nclie\nultang\ntaffa\nchigago\ndancigers\nconstitutionals\nytterhorn\nplacation\nunprecedent\ncoketown\nbavani\nunshaved\njackline\nnver\npudenda\nbengalese\ngutseriev\ngarrion\nspaven\ntoplou\narbain\nloest\nbaulin\nlionnel\nunitil\nbriginshaw\nkhilani\nkifner\nmonestary\nloaeza\ncolleage\nmwafulirwa\ncapitala\notherwordly\nmahdaoui\ndaugman\nsysomos\nbeirn\nontrac\ncornum\ntribeswoman\nrightism\nvillagarcia\nbansei\naned\nashaolu\nfaloria\nmisspeak\nchikhani\ntauter\nbarefooting\nsinduhije\ndeddie\ncourageousness\nyordenis\nshenise\nbranin\narcigay\nchervin\ndesvarieux\nfviii\nthreequel\ndouglin\nbarbanell\nmarshchapel\npratkanis\ngoldschlager\nsoffia\nshabbes\nyossel\nberkowski\nhorizontale\nloked\nretreive\nkalukundi\ndonvan\nsalave\nsheriffe\ninamine\nwojas\nsubparagraphs\ndardick\nbreadmakers\njobbery\nanouck\ncharlety\nnewp\ntawazun\nishwor\nhrmph\nkirchgasser\nneureiter\naprovecho\nsular\npalimbang\nsavnik\nlepape\nedcor\nsandella\nquicklaunch\nmelodiousness\ntuitele\ngurnon\nfleetwide\nvirtualy\ncnso\npolysyllables\njohnners\ncivilities\ndhafir\ngelula\nprtg\nimasuen\nimmergluck\njasvir\nhighlevel\nsulukule\nbotcherby\nmarzoli\ndesertified\nsteets\nactuant\nautie\nboisfeuillet\nputrefy\nfleri\nfellings\nhofler\nmamounia\nshribman\nfastline\npudeur\nazizollah\ntallamy\nwireweed\nolika\nsovrano\ntightwire\nsquali\nfarecompare\nfathalla\nmonoi\npastika\njuddery\nkrylon\nmelanne\nbaggier\nsirul\nvartabedian\ngontarek\nbrancy\npoehlman\nketia\nweisblum\ncadger\nparping\nwusses\ncolliver\ncandoco\npraedium\nstobswell\nreadopt\nadoringly\nstaniszewska\nliyel\neidu\nbouroullec\ngotsch\nkaroun\noppurtunities\nbittenbender\npadf\nkienitz\njoyandet\nsirieix\ncoue\nprerace\ninsightfulness\ngammans\nmatchy\nevertonians\ndoggo\nshafiqur\nantipiracy\nthorstenson\nsiadatan\ngradiska\nwinsham\nhamori\ncwmdonkin\nheinla\nswitalski\nccmt\ntsoukas\ncorkish\nkilshaw\nforthriver\nnightwatchmen\nmalnik\neditorializes\nwhittakers\nlizewski\nvollis\ngohari\nskarzynski\nbruell\nstarsia\nteeder\nmorlands\nnorthsiders\nwhettam\nkihuen\ngarlaschelli\nrenel\nknoops\nsulfosuccinate\nrowleys\nstefane\neyong\ngauselmann\ndemsky\nbonaroo\nluzius\nseanor\nlinendoll\nfatica\ntitters\narrythmias\ndinedor\ncarapelli\ntinterow\nevem\nortolans\nparhamovich\nsilfra\nuxue\nkringelbach\njuandre\npanamerica\ndoblo\ntemperence\nyawovi\ndamjanov\npenkov\ntibballs\nyukes\nicall\nhosseinian\nymddiriedolaeth\nintercall\ndancap\nhafnarfjördur\ntinaco\naraqi\ngreengrocery\nbelam\nscarpi\nthierer\nlactations\nspocks\namarena\nponemon\nknchr\nspecced\ndensify\nknipl\nteuge\nbejam\njospeh\nwasem\nbodyguarding\ngamelink\nbocm\nkyaukkyi\nxueying\ngwatney\nkaopectate\ndaymer\nredscout\ndenbow\nroadsport\nlitlle\nmacallen\nseremet\nschruers\nabriachan\nmayanmar\nimfc\nmozartean\nlunkhead\nthreadsnake\nbrobst\nmahaday\nmasunda\nkerbed\ntranel\ngooper\nhaxhia\nepower\natchinson\ndiscusssions\ngarsson\nysbryd\nmuscian\ntraitorously\ntechmeme\nsoutherness\ngaches\ntraditonal\ncerisier\nrenza\nsavories\nrhaglenni\njefrey\nkarraker\nchiuariu\nbeaulah\nndiritu\ngeocoins\nbroerse\nfiebiger\nkarlijn\ncyndie\nnapps\nsuperbitch\nberès\nhunke\nslesin\nsiree\nscard\nwohlleben\nbinladen\nvandinho\nnset\ntoberoff\nkappelman\niwv\nnewforge\npiekos\nbeaudreau\nmolitika\nguaranitica\ncomputerise\nrehfeld\nhungriest\nkamathi\nstrathconon\nkerryson\nlevota\ncusses\ntrius\nlokke\nsmallbusiness\ntresaith\nhadramut\nkosasih\nreinbach\ngurantee\nkittelmann\nholbox\nshafiqul\ngermier\npennys\nsitomer\nguarch\nmcgregory\ncompetance\nguennol\ndislikable\nndamase\nyoville\netms\nnevine\ngaev\nladdered\nsacrileges\ngaehtgens\nwaybuloo\nfilete\nharrhy\ngeneraux\nhydrex\ntingya\nfootlockers\nresseguie\ntruell\njoksimovic\ninded\nraveché\nsidya\ngónzalez\ndoury\nbeavering\nbishopmill\nhardwearing\nmainelli\nmurigande\nhearld\nexcercising\nmichigans\nbatailley\nhaveland\npowerleague\nvillonodular\nhissyfit\ndiera\nleucopenia\nnuclearization\nimprisioned\nthorgeirsson\npifan\nmalulani\nprivleges\nflyering\nnicox\nabrahim\nunplowed\nkazenga\nmvezo\nturbins\nalpr\nlaroe\ntechnostructure\nraptakis\nsphenodontian\napsf\nmonêtier\npiferrer\nxenotropic\nsantanam\nkeertana\nroenneberg\nbumpiness\nfuraha\nciana\nschuringa\nelectrofunk\nshiroo\nunchastened\npiolin\navailabilty\nsmithsons\nvetters\nqiangba\nnetwars\nmultilaterals\nhiasl\nkamaljeet\nhankie\nthawat\nmercel\nmogull\nfishville\ndumitras\nbosacki\nedgiest\ncgtp\nfayfi\nwitchalls\njalazone\nproenglish\nsplain\nfraternite\nevangelique\nostrower\nfaeth\nauchenblae\ncrapanzano\nkuijper\nehui\nvanasek\nairbuses\nbelcrest\nkunonga\nchizen\nbentilee\nfellowman\nbutterstone\nganko\nbernadeau\nkochis\nmagimix\nglums\nmirial\nbonjela\nernsberger\naustalian\nbentzion\npurho\nokeafor\npaller\njgd\nkcpl\neroticised\ntitchner\nensorcelled\nremortgaging\nglossiness\ndopplr\nidloes\ncuillins\nziada\nclubfeet\ndeeda\nweanlings\nbirdsnest\nbeamy\nedmo\nglovework\nkwanzas\nketelsen\ncnto\nallagoa\ngrimbert\nbeckson\nhladik\ntycoch\nrobl\nprotaganist\ncraphonso\nratafia\nrezun\nroboform\nbezy\nforcast\nitacaré\nkeirstead\nkonieczna\nteeterboard\ntartly\ntelenews\nbandou\nekgs\ndrummuir\nteaboy\nflexnet\njingfang\npoaches\nmedpage\nmisdiagnosing\ncalazans\nreckford\nmoayeri\nnocher\nstaement\nfederkeil\ndowdney\nxinfa\nstokdyk\nazizian\ntransandino\nserajuddin\nsuperbeing\nshumilina\ndowncity\nlatourelle\ndiggerland\ndeluz\nfoinse\ncarenet\nreiland\nxiaokai\nyunosuke\nassegaf\nrenationalised\npeschici\nlobsterman\nmagoro\ntrusonic\nmenglian\ndécoupage\nalbertonykus\nrovito\nshushing\ntokia\nmarite\ncelerra\nkwamena\nunicharm\ntracheas\ncabbot\nreitzle\nandrucha\nsalbris\nkheny\natomenergoprom\nshantallow\nunsucessfully\nblondies\npeiyan\nyukihisa\ntranspartisan\nbleachfield\nballcock\npiaskowski\nshahsavari\nchatikavanij\nmankoo\ntansor\nwintrust\nponciau\nslipups\nhammershoi\nmbachu\nmorosely\nnikethamide\nreengaging\ndodick\naaranyak\nfrousos\ntheine\nflowsheet\nfflint\nadnewyddu\ngurniak\nsaxaphone\nsleaziest\nsibc\nappler\nintroducting\ngravestock\nemadeddin\nsqueakers\nricqlès\npomas\nneida\ngooseflesh\nynysforgan\nunconflicted\ngenebanks\nhapus\ncoatimundi\ndolbeer\nmodelmaker\nvalrhona\nreapportioning\nfisseha\nssfa\ncoogi\nghiaciuc\nqatalyst\ntripps\nkaminetsky\ndmitrich\nfurzey\nnebit\nlndd\ncomped\nsomarriba\npumariega\nnoncash\nnonexecutive\nlonstein\nfroes\nthinko\ngurwitz\nponderable\nwhitling\nhinphey\njoselit\ntomana\nperfuming\nramsley\ndetailers\ntutan\nhairdryers\ntatil\nmuenter\nstofile\nrocques\naccurso\nevitable\nmidfirst\nokal\nilheus\nsculfor\nsomashekar\nwambugu\nbluecher\nlyter\nwoippy\nablin\nkubasik\nskinnygirl\neweida\nshekinna\ngluek\njunkmail\nubelaker\nlezana\ngobshite\nnibert\ndavonte\nssan\ntwinges\ncorrodi\nkrapyak\nhomesdale\nshirtfront\ncesarsky\nappreciator\npitmon\nhanggi\nlilliam\nlaffita\nkruglyakov\nbillionare\nsiroki\nbarbourula\nsoftwear\nxylar\njuldeh\nretailmenot\nthamkrabok\nmayola\nmesquites\nhabre\nwhitewave\npergau\ncraigielea\nyuegang\nshalee\nfinco\nbroumand\nreichek\nlaeng\nzinicola\nlevadas\nlutrus\nthighbones\nsmallridge\nspinnrade\nvinocur\nactovegin\nyagil\nbiocity\ndetermin\nadbc\nutrinski\naclt\npresten\nlumin\npiomelli\ndabbawalas\ntamecka\npitavastatin\ndevinsky\nvinehall\nmeeusen\ndyllon\nditmore\nroadtrain\namparan\nsearc\ncatheterizations\ntangutica\nsomanahalli\nchaston\nhaubert\nprobosces\nschonborn\nbaysse\nribiero\nescuinapa\nbiostar\nblobfish\ndhcc\njapanes\nmapondera\npassionato\nzeidabadi\ndometic\ncharlsie\narnulfista\nunmannered\ndongguang\ndignifies\nsassiness\npunnet\nlasmo\nearaches\ncouttie\nrailcare\npauriol\nprognosticating\nsivola\nsnamprogetti\ntianqing\niadarola\nbiocryst\nobiefule\nyankeeography\npanitan\nakeim\nmedivation\ngreatpoint\ndesselle\nyarwun\nlertchai\nconvalesces\ntimebeing\nbashert\nveronicastrum\nnonrecurring\nfyke\ngosder\nmoallim\nafemo\nullinish\nwaszczykowski\nhuateng\nnickleback\nswissmedic\ncadx\nleopolds\ndispensa\ntivnan\ngoldplated\nbjelajac\nperelson\nrefudiate\ndouge\nmikic\ninquisitively\nmolotlegi\ninebriating\ngetreidegasse\nmalarone\nnpbp\nwoodenly\nmertaranta\npatrão\nabdominus\nblazevic\nbiaudet\nkoring\noriza\npincham\ngowlings\nrockbeare\npretlow\nflavanoids\nbarefeet\nunderdress\nllanwddyn\nsyntroleum\nroett\ngiveing\nbianucci\ngompo\nimerslund\ndiyers\nnongenetic\nnetstumbler\nsiloh\nnewsperson\nraymarine\nbipper\ntvnewser\nstubbles\nbrodner\ntraxon\nteerlink\nlaticrete\nhuckfield\nshezad\nreisenauer\nskycraper\nkisselev\nqcda\naharonoth\nstamenkovic\nguilian\nhwat\ngroenwald\ntwiname\nparkvale\ncohill\nbournewood\nfitchner\nheavenwards\nsuhada\njeilan\nmukasei\nleviston\ngroetzinger\noleuropein\nnutsedge\neloqua\nnutraloaf\ntumim\nbrucheville\ncoffel\nmocana\nsynfuel\nreimprisoned\nvyners\ntraducing\nperpignani\nhayekian\nomland\nwhackjob\nhabur\ntoloui\npriceminister\nizzeldin\nsakaba\nriteway\nshabah\nstereotaxis\nemhs\nshilstone\nwijesiri\nmaitham\nircica\ngymnopedie\nvacantly\naiport\ndainese\nhrj\ndallies\ndigitaltrends\ngrandparenting\nwellwisher\nnavegacion\nkucharz\nvyron\nhappythankyoumoreplease\npanio\nfreebooting\nwheedled\nmyrobella\nrecitalists\nnusc\nshivinder\nmolestors\naddiopizzo\nbaringer\nfarboud\nronseal\nborbor\ndotori\nrevivify\ndoualiya\nhieftje\nkitzbuehel\nmassaman\nfegs\npsychs\nrandjelovic\nsodann\nsemenchuk\nshandre\nromeny\nkneecapped\nstepanovic\ncrisson\nhemiunu\nridland\nmacquin\nmousseaux\nmemorizers\neilerts\nhmss\nsplurging\nrojanasunand\nburea\nhalvey\nsideswipes\namddiffyn\nxocolatl\nredpine\nvalj\nsuchit\nkrobot\nunderley\nstrieff\nlewitinn\nlampion\nafotec\nmuhabbet\nguatamala\nswingy\nallenheads\nteamquest\neliaquim\nbubbleman\ndelhaye\nsnipper\nbouaouzan\ntourister\nedls\nsedwill\nhoura\ncanco\ndrylaw\ndargel\nkancharla\ndeguire\nshcherbakova\ngattinoni\nntaba\nlerab\ngehlhausen\ncritising\npoggiolini\nseitel\nbicyclers\nmahdawi\nbaughen\nkosovans\nkremlinology\nmakeen\ngepirone\nxiangfeng\nspermatogonial\npulecio\ntesfamariam\nmilimeters\npossessively\nelectrochem\nunlatch\nlangrell\nupperby\nreidenberg\nwcoh\nslabbed\nvever\npryd\nclezio\nlaviada\nklizan\ntinklers\nwarsofsky\nplann\nmacua\nguidos\npractially\nhesitatingly\nondó\nsavjani\npomata\nwhitelake\norbik\nfridrikas\nscrobbling\nasshats\nabyssmal\nkasikorn\npliev\nnkong\nmilteer\nlongbotham\nisnilon\nwhets\nkushman\ngcmmf\nfolkier\nnyiregyhaza\nseynaeve\nkarvy\nconclusiveness\napplier\nfiretide\nhenegan\nsummarisers\nlpai\nagwai\ntarpishchev\nrocketships\nnationaly\nfirouzi\ncrosets\naquaplaned\npaisnel\ngurgler\npawelek\nusoyan\nsequitors\nadou\ntravelpost\nmounty\nshiman\nkosters\npralatrexate\nwohlin\nbonadies\nzoratto\nparness\nskalak\nhojjatoleslam\nespanoles\noutjumped\nmajoras\nsnickered\nknipton\nashikawa\nkuys\nchalfonts\nsafrin\nbiesecker\ngilbraltar\nborgerson\nbilion\nbusetto\nffsg\nbreakerz\nteekanne\nambled\nophelie\nniyam\nrhios\ngaerfyrddin\nedmead\ncitropsis\nklores\nsupplementals\nafrikaaner\nmukomuko\nsmeary\ntriwest\ndobriskey\ncaddish\nmahsouli\npassalong\ntheone\ninveneo\nululations\nmarhefka\nkoyie\ngaaa\nvelvick\nchurchil\ndonze\njiranek\npapermate\nzaftig\nmargara\noneplace\nperfectness\nmitvol\nvican\nhorgos\ncrestar\ncopays\nkierland\nrevlimid\npaivi\nschneeweiss\nduncon\ndeinterlace\nboniello\nwildworks\nlymn\njandola\nmadrileños\nchancelor\nbushwalk\nhunstville\nsynek\nwaygal\ngekoski\ncaite\nrackliffe\nsnfs\nukesa\ncvijanovich\nkrivoshapka\nhallerman\ntechtronic\nmorkov\nuscybercom\nshirkers\nsubtantial\nsuyono\nsliproad\nyatimov\nzhongfu\njamling\nsaltado\nstandups\nkardamyli\ncassiers\nbiospectrum\nanguishing\nmawejje\nhprp\npenjor\nsmink\niskin\nlucich\nairier\nzelston\nmultistars\nsharawi\ncognifit\ncyrine\nfranulovic\ngeltner\nviny\npenston\ncrausby\nkhrustaleva\nzambeef\nnehrbass\ngierach\nichita\nosael\nsimonoff\ncolonoscope\nkolmakov\nfihlani\nkeekle\nfuturelab\nconneticut\nkrivokrasov\nnzabonimana\nconfrence\neischeid\nduvenage\nbayeaux\natieh\nbrockert\nelwi\ncirse\njicin\nutay\nabecasis\nverbalise\nbernholz\neflow\nolvey\nkamyab\nfarella\nurmc\nstonerside\nmoxxi\nbirdcalls\nradmanovic\ncuttery\npashin\nseaburg\nliquidy\nlahariya\nhsyk\nnuerburgring\nzajecar\nducksworth\nnobelists\npudukkotai\nxanthelasma\ntransexuals\nwithybush\ntwiddler\neyeshades\nnastasha\nperinger\nciav\nchananya\nnsima\ninternatinal\nclomid\nbeňová\nderichebourg\nfreear\nunembedded\noatland\nlassandro\nandrau\nhared\ncorrieri\ngawdat\nunilife\nkirchman\npfpc\nstoye\nzhiguang\nshieldfield\nfallshaw\nparod\nnaehring\npanners\nindemand\nhinduist\nmelligan\nbayik\nharpviken\nmuzsikás\nreconceptualize\nchefetz\ntimbertop\ndalmations\nantiquiet\nsinkerballer\nseeligson\nhabimana\ntrende\ntalktime\nminicars\nportacabin\noppresive\nlogophile\nghorab\nsinnington\nmilmoe\nvigee\nhagert\nsalos\nqingsong\nberkat\nstaythorpe\nbiggovernment\njinjin\nangen\ngravitional\nroadwatch\nseifi\nnewswise\noaty\nkuupik\nthrog\nriani\nsterotypical\nprandini\nlayerings\ndishdasha\nmolpus\nhasselknippe\nkowarsky\nhreidarsson\ncabraal\ncrumbed\nkarasin\nglenwright\naghan\nfarq\ncollegenet\nnigari\nderrill\npotheads\npalastine\nskelemani\nelsea\nnapw\njustic\nrenante\nsrinivasarao\noutbreed\naibileen\nhuttenlocker\nfarjo\nobtener\ngithu\nasecna\nnrsf\nbiopure\ndissolver\nfetishizes\nineducable\nquaidabad\nkakalios\nlebua\nattackes\nhirshon\ntechspace\nkompak\ncolmado\nfanatism\naberrational\nhessi\nletner\nnavigenics\nsuphachai\nkokoo\npasborg\ncolapse\nmayby\nbobwhites\nkasib\nbrandstaetter\nmarichu\npaycut\nautomobilwerke\nbertinetti\nrabbat\nremaning\ndncl\ndeplaned\nmojib\nteasmade\nlayette\npilleth\naccbank\nnejib\ntatlitug\nblann\nstaiti\nscanpix\nencumbers\nscarberry\nshmuck\njugendorchester\nbellwin\nnonagenarians\nautoscope\nrobinov\nweiqing\nisramco\nlujic\nlapot\nryynänen\nmateelong\nkneecapping\nicaronycteris\nmerald\nolian\nsuperpoke\nbabbled\ndaciana\nwindowsmedia\nradioclit\ncohering\nctca\navoider\nreiza\ngilsey\nmalène\nfinanco\nbonisseur\nflawlessness\ngalippo\naisenberg\nulcerating\nmitrou\nfurstenburg\npleuropulmonary\ndisappearence\nlefta\nglenton\nabric\nreponded\nchoppier\nstanks\ntryba\nluboslav\nfurjan\ngladston\nintersexuals\nmeagaidh\nsoyak\nyosses\nworring\ndund\nchamchamal\nwwy\nintissar\nmarkuszewski\nderegulates\naboi\ncairndow\nzygmantovich\ndeninger\nglasgay\nfraiman\nmetroaccess\nthorek\npropagandas\nautisticus\nfowden\nkamisugi\nsafak\nanouther\nxiaofen\npharmacogenetic\ndrene\nshooshtari\narvella\nclydeport\nzacatecan\nwoodstream\nfreshkills\nhabacuc\nreadiest\nhawbaker\nclingfilm\ndeglazing\nsheikhi\ntussie\nprocomm\ndelloye\nmuzdalifah\nutilimaster\nknar\nspcas\ntiffanys\nfleetwith\nkulvinskas\ntcdc\nfogell\npeyi\nostergren\nvenville\nsubspecialist\ntangjiashan\ntejani\nkomphela\nkleier\nmultishot\nkrasikov\njokhadze\napria\npalamar\njafza\nsoer\noverstimulating\nbigbox\ngindy\ngalaviz\njeba\nisvaran\nkrames\ngrayswood\nblumau\nballycraigy\ntessimond\naramón\ntogneri\npottow\nstonecarver\nembezzlements\ncossie\nmissan\nnells\nfirehoses\nenagaged\nhabichuelas\nandrill\ngrimmette\nmotevalli\nhocktide\ncoulters\nreanalyzing\nzoledronic\norha\nassts\nsoothingly\ntarting\noddbaby\ndavonn\nmarchex\nsidak\nabbasali\nantumbra\npozzoni\nourselfs\nlamura\nredknap\nbemahague\nmeyran\nbrichto\natradius\ncandids\ngreear\nganjgal\njenan\nbillado\nboondoggles\ncotchett\nlohuizen\ntarentino\narabov\nshkelzen\nrotflmao\nfrustated\npendrill\nscicolone\nresizer\ncheeseball\nalanssi\ngiula\nimmunohematology\nkahlah\nadvamed\nphalane\ncenterplate\nbroght\nengenderhealth\nalsingace\nhannema\ntabrizian\npiselli\nojuland\nglucophage\nshrimpy\nbioceramic\nkaven\nbluwiki\nmarkstrom\nurlicht\nlavera\nhyatte\nmizhar\nschmoes\ndilemas\ntantalo\nsmalto\nredkey\ngacc\nclouden\nphuea\nschimmerling\nsolotaroff\namsheet\ndigal\nsecdev\nallioui\nwgw\nwörndle\nvargus\ngragson\nreflooded\nwebman\npogossian\napeal\nclassey\nbackrower\nmaszczyk\nuigurs\nbutkovitz\nperscription\nbicyclettes\nrwas\nhenebry\nreticker\nfaulkenberry\ncristabel\nkoering\nbutterbeer\nshigekawa\nparrita\nhistopathologist\nscholarliness\nstarcevic\negans\nteotitlan\noutrated\nhuruma\ngumwood\nrysanov\ncnaf\nhenock\ncihaner\nfedee\ncrosschecking\ndeathline\nlehtomäki\ngershengorn\nsketchiness\ncaseinate\nnmx\ncougartown\nwildern\nfermain\nzuercher\nmapaction\nlaoting\ndybul\nfxdd\ndamiri\nsummersonic\nbruemmer\nstringz\nwalian\nventless\nlurita\nsubmarket\nsamax\npermasteelisa\nhatchards\ncontemplatively\nbilliam\njurovich\ncirle\ntymms\nacerbically\npfbc\ncaidos\numpp\ncammel\npushtuns\nsaurat\npadlocking\ntrewoon\nfreewire\nagilo\npolacek\nmurate\nmassilon\nwellchild\nsulili\nlegitamacy\nbackface\nahner\nmandaza\nzoledronate\ntzen\ndrably\nsbrt\noberstolz\ndijken\nharrasser\nprincipalist\nfoodservices\nacourt\ncahnge\nmostazafan\ndreidels\nblacknest\nmdca\nbesanko\nsigsgaard\nchudzik\necumenicism\nrechargers\ndgamer\nspaccanapoli\napols\nghoba\nbucintoro\ndrozdoff\nliterato\neliots\ntabbaneh\nservanthood\nimmovably\nelsewehere\nquinley\ngentek\nlifenet\nkoterba\nmagnitka\ngillers\nherdy\nkostelic\nterremark\nshanai\ngetler\nreorchestration\nneurosearch\nwolferman\ntunneller\nreeh\nungratefully\nsanwi\nfootwells\njuicycampus\ncoultre\nmuranaga\ncaminer\ngroark\nsibor\nnicaro\nbaertschi\npicaso\ndunwood\nramtane\nstepakoff\nyno\nrobertsport\nserfin\nsavageness\nshowkat\njeggings\nmalalignment\ncervelat\nnadell\nuppance\nmulanovich\nbucatini\nlintula\nsterz\nrixson\ncapitalsource\ngooya\nbendectin\nsudbin\npostberg\nsnower\ntuiles\nybh\nlitchard\nbirchill\ncnmv\noffir\napoint\namelda\nsxephil\npluckers\nelijio\nnewtownbreda\nsmartened\nabic\ncsfp\ngaytri\nshammo\nmillichip\nghazaliya\nsakey\nhectored\ncarbonex\nwesb\nkiton\nmacala\nphahurat\nphaophanit\nmasshealth\nfetishised\nhexvix\nbyki\natns\ndibert\nleewood\nmcauliff\nbacara\nbpsd\nmshtml\nspasiuk\nmothershed\noverarched\nbowdry\nglorya\nhokule\nmosebacke\nrepping\nfozia\ndipsomaniac\nlindenstrasse\nthuo\nrizek\nnisanov\nisayevich\ncollinses\nconformities\nguidroz\npageonce\ncroud\nschellas\ntattles\njrj\nspringle\njeglic\nwhirlies\npulcrano\nmvuemba\ninteligentes\nfargana\nstosic\nhandballers\nwasha\nquaile\ninfogroup\ncalka\nbenone\nsorbetto\nseckerson\nmonogramme\nspoletini\nethereally\nltbi\ndigestives\nmoygashel\ntreos\nningling\nsheeter\nabdulmohsen\njirjis\ncitizenm\ncockburns\ntalanoa\nbakowski\nwastebaskets\nsigard\nrenkl\nringgits\nintinction\nsuraiyya\nlinstrum\nmcphun\nvoorheis\nardah\nhafnarfjordur\nbulos\nschwanke\npaparella\nmeskhetians\nblythman\njoshipura\neuram\nsitaula\nwatchguard\nattemping\noreka\nbugarach\nguirong\ndoaa\nantiblack\nacbar\nmeraas\nleisk\nbiggam\ntissainayagam\neurig\nhenes\nstrobo\ngloopy\nbicek\nlandsliding\nacision\nfedoke\nfootjoy\ndisupted\naberdonians\nmetarie\nmalangatana\nkurlantzick\nivax\nsamphel\nkilbroney\nstah\njiankun\nreoli\nspeechley\nlasdon\nveerasingham\nfranczyk\nsulick\nkvitashvili\ndreu\nvhg\naccomplis\natwick\nmultireedist\nedmont\ndebak\nelevage\ndinyar\nvarik\nrecenter\nseewalchen\nquiggins\npicketer\npesalai\npicaboo\nbryd\ncraftswomen\nanessa\nchappelow\nsquirreling\nwahlin\nloiterer\netnyre\nminewolf\nguiza\nsciencefiction\nketbi\nskybar\nsestier\nwelldone\nogtt\nmordantly\nwordsworthian\netex\nmdaa\ngsci\ndagai\nampaw\nnonparticipants\nrennaissance\necosecurities\nscoots\njasny\nruzyne\ndalgetty\ncharacias\ncabraser\ncoelen\nembroils\nspeece\nhumanin\npolcy\nclairee\nsughrue\ndefterios\nphising\nghodse\ncherkesov\nchikankari\ncocuzza\nrepudiatory\nyeras\ncherryl\nrici\nimmortalising\nobesogenic\nladia\ndolgachev\nbienenstock\ncornrich\njekka\nmassivegood\nstreetstyle\nlipkis\nthwacking\nkebby\nseibald\nshsp\nthimbletack\ncimatron\ncashers\niimsam\noverprotectiveness\ndobrow\ncompper\ndcca\ndreibholz\ndepere\nbleepy\nlassithi\nkraisak\npetromin\nmazzagatti\nsember\noversimplistic\nvirture\nzarir\ncommandite\nmitsushige\nforgoten\ntihana\nniada\nferentinos\nloadholt\netoricoxib\nsholle\nlaciga\nostovar\nblask\nstachyose\nrumbaut\nritsuo\ncreger\ntregor\nzaidon\naragoncillo\nscammy\neuromediterranean\nshionoya\ncontillo\nsugahara\nhelsingor\nwoldemichael\nhattons\ntroutwine\nrevia\nkafle\nbrouilly\nfohn\njevtic\ncrav\nspungin\nperotta\nsundresses\ngellin\ndondurma\nfelippa\nclinchers\nfluidigm\nmadrs\nshuanggang\nfarnood\nkippenberg\nnifaz\nbaglow\nabdullin\nsperoff\njournyx\nservini\nlegside\ngenan\nbresgen\ngershoff\ncorncrakes\nslavemasters\nkittell\nbersey\nameriks\nsamour\nukerc\nmullinax\nyazdanian\nschoerner\nsolot\ncranachan\nwayuunaiki\nkunzmann\nmelvich\ngrubel\nmaximilianstrasse\narrise\naubins\nbondgate\nwaseley\ntarby\nbriquets\nkulovits\nepad\nkielar\nsolitariness\nbedke\nkrevsun\nhirschowitz\nillner\nopher\nfroths\ncanfin\nhilford\noffcut\ntolia\ngokool\nenthrallment\nbuhriz\nwoehr\nnmhc\nplatzl\nmujahideens\naroop\nrettaliata\nphcc\npolititians\ngrajewski\ncodders\ngharraf\nfirescope\ncasarella\nmasnata\njeevanjee\nsearingtown\nchiacchia\nwabho\nskateboarded\nwhitens\nlakeforest\ninclduing\nzipora\nqiyao\nremortgaged\nweisswurst\nmaingot\naquazone\npranced\nanmin\nkingibe\nrmic\ntealight\noustide\ncoggon\nhazardously\nprockter\nhnwi\nvesteys\ntaxin\ncridersville\nmajungatholus\ndooky\nelectros\ncjones\ncenedella\ncapellio\nmchoul\nrugare\nlowermybills\ncraigcrook\npowernext\nfrancsico\nstonely\nshantham\nvanderwall\nabsolem\ntemming\ncubbin\nvendeen\nvitousek\npirih\nwangai\nalanne\nchengping\narithmatic\ncenic\nnitb\nfructis\nmeyerrose\npagentry\nrosmira\nbejing\nmenuires\nthomma\nechem\nspitbank\nsesheshet\ndamascenes\nsternheimer\nandjelkovic\nshieber\nkorniloff\nrasped\nsattelberger\nsandtrap\nidataplex\nwideroe\nbridgforth\nossete\nportloe\nbwrdd\nwhelping\nrussan\ntascon\nmioduszewski\nsmulian\nbumann\ntormenter\nchwe\nmidle\ngelperin\nbordage\nneurohormonal\nfieldrunners\nphenergan\nusgi\nterrantez\nmallegni\npourable\nhutabarat\novercook\npecheur\ndecidió\nhoft\ndoomers\nushikubo\ndestigmatize\ncikobia\nlianhai\nexternalised\npojama\ncacee\nwimmers\nhrft\ntyber\ncontructed\nryndam\nconocophilips\njarmers\nconcon\nbaich\nhjordis\nethopia\ndelphiniums\nlsuhsc\nbromm\ntalpatti\ntysabri\nschreider\nbigchampagne\nrecomends\nregilded\narcusa\nallusively\nsainey\nweetos\nemken\nsoghomonian\ncartelli\nexmple\nlogline\nunderarmed\nwurmfeld\nbelgrader\ndawut\ndadnapped\nmenschel\nscrummager\nrubigen\ngeremie\nbonks\npantukan\nskalnik\nsprinkel\nsimington\nhalyk\ncolwin\nkunzi\nsalomi\nlyalin\nschlutt\nfolkenflik\nmunts\ntwilighters\ncraptacular\nassasins\nmajnoun\nbrear\nhumpage\ngangmaster\nfidayeen\nantiglobalization\npecor\ndelligatti\nswingball\nclocaenog\nramatam\nsporthotel\nargn\ntouchmark\nkinglsey\nyewdall\nfunpark\npointin\nmunkhbat\ntianamen\nkaysi\npinkstone\ndashtop\nshoulds\nflello\nulleren\nkanipe\niqlim\nasnc\nbrikowski\nbutani\nbeilock\nrennselaer\noenological\nfauxhawk\nparrotting\naltinkum\nrechnitzer\nblahs\nlukach\nmanvinder\nantoher\ncapricolum\nxianglu\notila\nspinghar\ngyppo\napdal\ntother\nherzel\nfapl\nbochniarz\ngudjohnsen\ntranmer\nkeyholders\nbacarri\nathenes\njinked\ntherafter\nmanglik\npeader\nperons\necobot\npasinato\nedenred\nschodorf\nsiestas\nartfl\nvanfleet\nwiseass\nweckherlin\npeaceplayers\nmedborgarplatsen\nkosovich\nsonenshine\nliftings\nwirginia\ngruzen\nuniban\nwordworth\nzomegnan\ninflationist\ntyna\npolycephaly\nhejlik\njeeja\ngrindingly\ngerecke\ncollectionneuse\nshiya\nmortarman\nmatzka\nmekongo\nnjawé\nmcns\npillarbox\nnumbnuts\nushe\nfarnhill\noveracts\nzerihun\nreichwald\nkorins\narchimboldi\nkalamian\nyoky\ncolberts\npanayides\nhildalgo\nisof\nroundtown\ndebiopharm\nrabins\nkaskenmoor\npatapons\ncoretech\nfretton\neidarous\nbirchenall\nknoechlein\nuninterest\neusr\ndaliburgh\nbarberries\nokole\nsneum\notremba\nbalmuth\nbeleaguer\nmontbleu\nhomebrews\nslavena\nwbes\nunsettlingly\nbohinc\ncircomedia\nsageman\nautoglym\nkappy\nfreegold\njarstein\nfreindship\njolette\nheldrich\nzaffarano\namadito\nweibe\ngü\ndjezzy\nburlesquing\nsauds\nprevis\nkamolvisit\nhovington\naahperd\nhurewitz\ntravelle\nlifebridge\nfobt\nsarahyba\nxiangzhong\nuyanga\nbarnevik\nshadiness\nwhitmeyer\nflyposting\nkoram\nstickels\ndehumidifying\nchurchkey\nwhitwam\nkevrekidis\nsmokiness\npleace\ntacketts\nfirstservice\ntevaga\nfailsafes\npokrovskiy\nsmaby\ncrowdspring\npoofters\nplonka\nvawdrey\nsupposely\ntotalfinaelf\nsomato\nbobotie\nschnurbein\ndickoh\nsummeren\nshuffield\nsayee\nbarud\nwelchez\nwriteback\nnikozi\nmeheux\nlabcoat\ngulei\nheards\nballhandling\nbrisiel\ngybing\ndhifallah\naltenwerder\nfazeelat\ncapotorto\nyouthworks\nrelkin\nbalthrop\nrachwalski\ntiani\ndaner\nsenmaya\ngimv\nresiting\ndjana\nlieden\nstabiles\nwoge\nmatland\nmukoko\nflewis\nwestgroup\nandronikou\nhousesitting\nmuzzamil\nsabhan\ndannemiller\nancientness\nsuora\nsalvagni\nhydrocephalic\nspitballing\ndecribing\nimbe\nunhealthiest\nmuppalla\ncolyandro\njoshing\nheadstamps\nstrook\nlavenda\ndimofte\nlucken\npollner\noldstone\ncmvm\noniel\nkirkmuirhill\ncutkelvin\npmetb\ngoleen\noutlaid\nostagar\nbilbow\nmaclennans\njinjie\ninsistences\ncolaton\nmingjie\nfatuously\nenodis\ncutifani\ntampep\nsansing\nmuneta\nrudovsky\nyanique\npicenze\ndemya\nteragram\ncandidancy\nizale\ntroise\nbelhouchet\nformella\nmulligatawny\nkyzylkum\nhalac\ncerbalus\npogoing\nautohaus\nemblematically\njaneczko\nnewsgator\nsousanis\nfollas\ntoefield\nwaldrist\nstancombe\nchiffchaffs\nstrb\nfügen\nllado\nandalasia\nnienhaus\ndayquil\ncpal\npensylvania\nbalasz\ngardent\nglobaltel\nweisenthal\nkipkemoi\nbaconator\nramuan\nquestin\nsuao\nwhorish\nhoneybaked\nwikinson\ncardhu\nsiochana\ntsuyuki\nsamulski\nzegar\ntroncones\ncafero\nyalie\nbetak\ngastos\njiulin\nkiosko\ncoltec\ndomit\nghadban\nbowdlerisation\nmakutu\nreintensify\npatronises\ntavullia\nashame\ntresselt\neammon\nunfortuntely\nspinoso\nhorniness\ntricep\nleinenweber\nlaible\nkoepplin\nhypersensitivities\nzhengyue\nrinet\npicochip\nrelights\nkulchak\nnaubinway\nneedletrades\ndisca\nzakiyyah\njumby\nammends\nbartholomaus\npelambres\nqirbi\nbaymont\nentonox\ndrinmore\nmasterbatches\nlissome\nstemlike\ncosner\nsuhua\ntopcoats\neclypse\nmoppets\nlivian\nknackwurst\noosterhouse\nhmip\nffel\nbalestrino\nvistan\nteragrams\ndefrasne\novariectomized\nimmergut\nlongwe\nbreastfeeds\namsted\nfromson\nsuperdawg\nskapp\nbrochettes\nchantix\nlogicacmg\ncalamos\ngrassfire\ndecathalon\nchivalrously\naquitted\nbenzdorp\ncognard\nydy\nbattleboro\ngroundlessness\nhanane\nsavasana\ncanteloup\ndalworthington\nlubet\nmiharja\noffler\ncotif\ndilullo\nbookatz\nkournas\nkolirin\nbanita\nfrancess\ntarani\nblaxell\njanger\ncloudsplitter\npaintworks\nmesocorticolimbic\nkopta\nhonghua\nnhengu\ndansko\nsotelco\nturnill\nhundres\nmiloševic\nmestawet\nlekka\nwaisea\nexasperatedly\ngollaher\ndircm\niqua\neschenfelder\nanythig\nhouwen\nmanslaughters\ndormy\ngolbin\nromeva\nsotolongo\nfalstone\ngethyn\nzarabozo\nvorsah\nunrigged\novadya\neurofighters\nkhames\nstonebrook\nbuzenberg\nlogrippo\nfiom\nkuchner\nduerer\ncelotto\nscorebox\nbaggin\ncarlstroem\ncotrone\nramchander\nhezuo\nmatloob\nsware\nmusicianly\ngranjeno\nmuneeza\nbroadridge\nbodycheck\ntaunter\nscrammed\nponterwyd\nwickramasuriya\nsomnambulistic\nivanic\nhorkos\nsatia\nreplacment\ncleversafe\ncybersoft\nipri\nepischura\nserrand\nkeppra\noutbraked\nemblemhealth\nsoundworld\ncarcary\nfardh\nryelands\nnadcp\ngroused\nbelieveing\nwindland\nbecamse\nbusansky\nmeredov\nbarbancourt\nnevsehir\ndustjackets\nbennette\nbourgeoning\nlitchborough\nclonroche\nmeneguzzo\ndannheisser\noutr\nbancvue\nvertiginously\nneotame\nhussing\nquiltmakers\nndayisenga\npennyburn\njablokov\nwabano\nmantuano\nbarbarically\nianello\nrondin\nchanukkah\npinborg\nwyludda\nhimler\nbuoying\nsundeen\nbranom\nkapalka\nschupbach\nchebli\nmishear\nbichlbaum\nroadpeace\nsheinwald\nsaidiya\nkesseler\nmarybelle\naccessibilty\npillowy\nfuturefarmers\nbernadin\nbelleroche\nconverj\njanakiram\npalinka\nchinasa\nedelheit\ncenterbeam\nrickabaugh\naziga\nmarouelli\npariya\nwildhack\ntazhin\nxiangming\nkenehan\nextrordinary\nappleseeds\nbaglihar\ncrampy\nbrazaitis\nbamff\nbeurle\npropagandise\nidenity\ncaulley\nvodicka\nlionelli\nfreakouts\npidgeley\nheimuli\nsorouh\ncurrensee\ncapolupo\neveriss\nclickability\nsombogaart\nstilleto\nmcgranaghan\nmoneeb\njerbi\nfuturex\nperversities\nkenville\nfirts\nunkissed\ngfms\nforelocks\nctdi\nfrunza\nsachtjen\nwaringin\nsuperveloce\nbuglisi\ngilian\ncasetti\nsuporting\nhuettner\nshiptonthorpe\nshatby\nmckessie\nipdr\nwrongheadedness\nyemo\nbensing\nbodypump\ngoaler\nsgas\nantiauthoritarian\nmegilot\nprotegees\nsaich\ntiemeyer\nquagliata\nmrowiec\ncruikshanks\nforegin\nscheinmann\nleisureplex\nstegomyia\nsandata\nfezzani\nopalinska\nrynell\nblackband\nsihine\nmcilorum\nolushola\nnotrees\neiddo\nhelvetas\nmerdle\nmalvinder\nshagang\nmarianito\nllanelltyd\nstamou\nosnabrock\nceruzzi\nbaciro\nddinas\nzerok\npenaloza\ntuirc\nlipumba\nfigher\npenalites\nsteamrollered\nandijanis\nschnelldorfer\ncastara\nchirikure\nbackscratcher\nbrintons\ngurp\ncommonside\nrhoad\npolkey\nnoall\nwitin\ncatoe\ntoay\nballetomane\nmicromet\njamaldin\nhythiam\npetterssen\nkotsiopoulos\nhydroplaned\nearmuff\nnoncompete\nskimlinks\noveremotional\nsourani\nestevao\ntexim\noctabde\nouani\nchudnofsky\nkamiti\nmedela\nrahmans\nyahama\nisaya\ngoethel\nwebwatch\nresurges\ntausif\nnexplanon\ncdrh\ntrecco\nwdav\nfoxbar\ntudweiliog\nwayport\nthoden\ndocumenters\nballee\ngurnell\nunskippable\nsomberness\naleppian\nehrsam\npacleansweep\nfesikov\nespiner\nbarranger\nvoorwerp\nbidwells\nzester\nstigsson\nkinlow\nsungwoo\nbonsey\nsenfronia\nreunifications\nliswood\nbelorusian\nlukovic\nsimorangkir\neways\ndcsa\nsupernap\nsevice\ngrybauskaite\nezenia\nlidel\nkastigar\nmarmotte\ncolberti\nwesthusing\nnoseclip\nfrimmel\nossobuco\nbonders\nwildblood\nharithiya\nviners\nfarouki\njoshie\nrixin\nsupermom\nsurgutneftegas\njhuapl\nconstructeur\npocahantas\ngoate\nhites\nbartestree\nvermonster\nficheras\nslowish\ncudas\natlasphere\naggressed\njeanswear\ncarafes\nadventurists\nortenzio\nnides\ndisabatino\nbarcola\njobcentres\nsoapmaker\nglattfelder\nmaaruf\ntitulos\nevelia\nkomlosy\ntundavala\nfrederking\nblether\nintravesical\npasik\nenlli\nbealle\nuscm\nmanamela\nhayslett\nbarbourne\nserenella\nunkles\nmckool\nnumberland\nattcks\ngoldstrom\namerigas\nporgras\nprimp\noncofertility\ncavas\ngoolrick\npiedfort\nriklin\nlevanzo\nbouchnak\nopton\nmoneyman\nriboulet\nosmans\njebson\nferwig\ndonatucci\nbennun\nlabrooy\nkozeny\nelisapie\nfriere\nekejiuba\nedmondsley\ntayon\nbirnstingl\nallocco\nlalaland\ncorrupters\nipsco\nnarcoa\ncentrella\nwohn\ngrippingly\nrashbaum\nipti\njawboning\nhicktown\nturangzai\nplecnik\ncampanulatus\niovino\ncrueller\nsichi\nmikalauskas\njobholders\naidesep\nhydrocracker\nbliny\nstarmine\necolife\njarina\nnicodemos\nberlinia\nshadowserver\nmethenolone\nregene\ndeterent\ncambrex\ntrengwainton\nmescalito\ncertolizumab\nleisinger\ncommes\nhuajian\nmuazu\ninadvisability\ningored\npontificated\ncontigent\nokuribito\ntaglianetti\nsinemet\ndevard\ninfared\nmavroleon\noleos\nmorvillo\nbovanenkovo\nnamey\nwhon\nmargoyles\nsaladrigas\nzizzle\nrepecharge\nrespectible\nyemenidjian\nrny\nvystar\npandolph\nnonpregnant\npattama\nfurillen\nnegari\ncontemporize\nhabineza\nganzorig\nbillerey\nschörling\nellams\nbutik\nquaility\nshamefaced\nvytex\nmondli\nzhaparov\nbehove\npowersharing\nvigeant\ngrancabrio\nlonseny\nadroll\npeschek\nskorupa\nscalisi\nrevivers\nemenalo\neisermann\nbroderson\nmusketier\ncmca\ncondesending\nbullmann\narmelin\nbingzhang\naskett\nprogovernment\nengleby\nosculation\nbablitch\ncasuality\ncorera\nwaddleton\nzubrow\ncuoio\nsaccos\nhorseplayer\npostrock\ngmoser\ntipline\nmillionnaires\nindrawan\nshengjun\nmobocracy\nqinjian\ntdindustries\naberation\nhaughian\notim\nkingsdon\nndiema\nvaldebebas\nvdara\nvisualon\nshaid\njumunjin\nheverly\nortloff\nzubaz\nurkullu\nyankeeland\nbakey\niaastd\njamake\nevoh\nmidsections\njhonattan\nraffield\ninfinately\nglidescope\nespaliers\ncrossharbor\ndumbells\ngoldstrike\niddle\nveddhas\nbrazer\nsullington\nsanth\ncolonography\numdloti\nphyscial\nprovidência\nchokshi\nmadeo\nginster\nfreysinger\nfedaia\nstudentuniverse\nguohong\nsadigursky\nopland\nbioinitiative\nehtisham\nmisapprehended\ncorsas\nkoplan\nkayihura\nsuhay\naleris\nbullwhips\nrockwaller\nueapme\nunitards\nmillenniumit\ncrunchgear\nhormazd\nbiomarine\nbiscan\nvoalte\nmeikleour\nkingsmills\nbarafundle\ncarestream\nnbpts\nfriarton\npalsied\nyechiam\nguastaferro\nquerulousness\npayline\naraghi\nbellwethers\nfulchino\ndelasau\nsehring\ngianaris\nabsssi\nuntypically\nbisek\nbreedt\npittin\numda\nashurov\nzeldes\nbennick\npenasquitos\npanjiayuan\nafirma\ntshibanda\nsuperfreakonomics\nmombourquette\ntossups\nconversly\npechorsk\ngluco\noutbacks\nnorstrom\nwitheld\nshawnie\nbodyslammed\nchunjiang\nluftman\nheadcases\nriverhouse\ntslp\nnorlington\nfuggi\nwashday\nwegs\nsuvereto\nfrempong\ndollys\nbumperstickers\nnutritionism\nforouhi\nferwerda\nbliar\nbrownlees\ninvega\nwisgerhof\niacobelli\ncheesey\nllegan\nshoket\nesnard\nplunks\nkiarie\nsechan\nbeltian\ncosseted\nasyla\nashimolowo\ndecoratum\nmarsdens\neldean\nweldin\ndabate\npagb\nferziger\nrotundity\nfrancl\ntucan\nmayreau\nfontdevila\nhollendorfer\nuckington\nmofilm\npsme\nmulee\nballyholland\nautotrack\nexquisiteness\nreprecussions\nvringo\ntrabulsi\ndipsticks\nvisitdenmark\noreshkin\nwhiffing\nnostalgics\nflugsicherung\nnondisruptive\nunsayable\nursprache\nrihn\nkonlive\nmocad\nkupchan\nrosgosstrakh\njuron\nstraussians\npleite\nreax\nphilidelphia\nsurui\nwhaanga\nmoumneh\ndieted\nklouda\nchinner\nparaglide\nproshare\nbistricer\nautier\nnumbe\nvojnovic\navecia\nsteelhammer\nwonderlust\ndisproportions\nnagg\nswifties\nhissom\nvoulgarakis\nskronk\ndspca\nkulala\nszájer\ngolts\njailtime\nflukey\ngreige\ntomalley\nmultisystemic\nciterne\nfarez\nfishof\njeziorski\narcelay\ncreekstone\nrightscale\nhaixing\npetrack\naustralasians\nkatiya\nlimitlessly\naposhian\nrezulin\nquixotically\ntolmoff\nwapixana\nsypolt\nregionalize\npitchbook\nswitkes\nfolderol\ntravisano\nthaty\npruzan\nnoritsu\nmyrichuncle\nkheraj\ntanyong\ngerwel\nvilledrouin\nkaminskas\njustins\nhammarstrom\nkountouris\nhansman\ndinsha\nwadesmill\nrealisms\nmoessner\ngladestry\nmyair\nwijdan\ntweb\ndeat\nkaroutchi\nkindlier\ngrotech\njabon\nperrig\nunoaked\ndroutsas\ngreensomes\nlaubman\nshuteye\ndisport\ndelerme\njacomet\nmountbattens\nareli\nprisum\nquadruply\nleapfish\npasserbys\nkoranda\nimplemention\nnanuq\nboulters\ngohara\nchatzi\nmecar\ncmsb\nmladjan\nsannae\nsquillion\nstreitberger\nimobile\nkersels\ngulliford\ncapellaro\nobession\nconygar\nschummer\nomotayo\ncaseware\nrecognisance\ndarwars\nbelluomini\ncarstanjen\nsfiha\nleyrer\nnoaimi\nshiffer\nmihelich\ngaddie\nacritas\npeones\nhattenberger\nliepins\nparaic\nbuyat\nliederman\niskysoft\ncourteousness\nyontz\nbotnar\nkulur\nstrok\nanapu\nbanksters\nalynda\npostively\nmonsall\nswires\nlahem\ninnata\nlambroza\ncaraccio\nstadiem\ngavanon\ncundo\nbeukelaer\njarosik\nlumena\nassumtion\ncanditates\nronza\ntrakatellis\noutdrawing\ntranferred\nporzecanski\nmandaloniz\nlarzelere\nlucknam\nloudcloud\nvaltrex\nashif\nanasi\nhupy\nunhappier\nmeasor\nshouldve\nequiptment\nlomia\nharbus\nexigences\nproshansky\nmyhome\nbakhat\nmurrayshall\nlibidos\njennine\nlammin\nschnellhardt\nlapcat\nsheku\nprehuman\nhandsaws\nserotta\nzetti\nzaitoun\nblvds\ntardec\nhkja\nbathersby\nshimell\nbugat\nsenderens\nliechtensteinische\ntrengrove\nbatawa\nvarois\nfourquet\noverpressured\nfairooz\nwildberry\nzhevnov\nmitsy\nwefi\nselham\ncaliskan\nporbeagles\nyabuli\nstjernstedt\nsandylands\nrainshower\nduesing\noksala\nveingrad\nmuhairi\nlamonaca\nlakbima\nshenon\nmiaskowski\nbullshitter\nbaqouba\ngestrinone\ntelaprevir\njohnpaul\niffs\nteenscreen\nmaponya\nened\ninveigh\nlicuanan\nmohommed\npassell\ndoitt\nalliums\nunattractively\nquixall\nquaero\nunrideable\nzava\nyoulia\ntsylinskaya\nabsorbine\narminas\nfarkhar\nhebditch\nkrissie\ntayvallich\ndisinterments\nchildern\nlabovitz\nbizilj\nkrokodiloes\ndumpsites\nminnion\nsazzad\nstadtlander\ndrawsheet\nkranjcar\nikle\ngayetty\nhenshaws\ntimelier\nhumaya\nmehyar\nhdq\nagnolotti\nmustafah\nennahar\netbe\nnouzha\ndefoor\nsbec\ngrossetête\nswimmable\nbarshay\nvalasek\nbrestyan\nmaterialscience\nreprocesses\ngoldway\nallenspach\nabhin\nslutz\nholmeses\nfattan\nbanman\nalonissos\nlundt\nhouver\naversano\ncyffredin\nhameline\nkibali\ntradus\nwilenchik\nmoragas\nruperts\npaliotta\nweaire\naviate\nbahima\nrespirology\nbakheet\ngiegold\narrondisement\nproteon\njardini\nbiedron\ncvsa\nguyader\nelectroencephalograms\nretools\nprounced\nalmacen\navoide\nryefield\nabdulmajeed\ndissinger\nsaptakoshi\nwimbledons\nfurure\nbellchambers\nfirestones\natwar\nderagon\nrsdl\nspottily\nsurmont\nelkstone\npoldrack\njeanney\npgmp\ntowerjazz\nscamorza\nexluding\nonanuga\nschennikov\nselectboard\nsibisi\ncamapaign\nhargray\ndatapipe\ntapson\njadan\nalsac\nnolley\nbarnev\nkefallonia\nflegal\nakoo\naeras\ndilello\ntravolution\nhoutzager\ngamemakers\nmastek\nresponsabilities\namarra\nwibert\nsarafpour\ndassies\nnightwatching\npowerlet\nqualifer\nzakeri\nphumelela\nwysteria\nmastel\nonx\nmolfese\nresits\nyessenia\nrussoniello\nhypocrit\nkarunarathne\nstraughter\nkvitebjørn\ncrickmay\nhierholzer\nslobodyanik\naduaka\nkirpatrick\nstatesperson\nreinfused\nadnec\nbournmoor\nmillam\nseafolly\ntahai\npavilionis\nincludeing\nlittledown\nshrawardine\nhusseins\nmedicates\nsundowe\ncorgnati\nkolega\nhartge\nsymphonically\nfragranced\nfinkin\nppz\nrechanneling\nsuctioned\nendoscopies\nforkin\nwinstrol\nnantgaredig\npljeskavica\nkarmani\nprebensen\nocéans\nredeploys\nmslo\nstackley\nangelena\ncollicott\nutapao\nserghini\nzooman\nmousedeer\nfrankenfield\nrijos\nniess\nsameday\nchomba\ndecongesting\nmouncey\nboisterousness\nfibreoptic\nfleishmann\npewsham\ninconsiderately\nnarfe\nbabybel\nthorpedo\npushelberg\nacquistion\ngombeen\nerasmas\ntexturized\ndislikeable\nsnoozy\ncapitalinos\ntokenistic\nchaza\nyeltsina\nzhongjun\ntonights\nimaginengine\ngehrt\nsugarbeets\naisher\nortube\nproddings\nprosthodontists\nunclip\nnyetimber\nstila\nsuccoured\ndonovin\nkajganich\nablum\necotality\nreenforce\ngauthreaux\nclanky\nocrs\npmda\nmoonshots\nkeheliya\ngoony\nrealacci\nthrenodies\nrubingh\nhallak\ncheatem\ndismayingly\npoospatuck\nlorgat\nboback\ndiscrimation\nosito\nunthinkably\nheptaminol\ntrouvadore\ngalv\nbrangman\nmazhari\nbenicar\nalderminster\nstokeleigh\nsnoots\nforgeard\nphinda\nechaveste\nmisalign\nrovinescu\nparkies\nabloom\nguiffre\nnding\nzunaira\ntugman\njaven\npodro\nubaydi\nuncrossing\nleaverland\nzonis\nkernals\nmolner\ninverrary\nbagis\nsladky\nicemakers\ndesignworksusa\nrozy\nvideanu\nwormgoor\nflavorsome\nrittelmeyer\nlahj\nkerger\nbussereau\ncupla\nbalornock\nustadz\nmakoua\ncroskey\ncareyes\nyusgiantoro\nharwoods\nalcober\nhighboy\natrap\nomed\nmiasmas\nfrayssinet\nchowhan\nctrp\nlustfulness\ndoghouses\nbottleworks\ntahmid\nronika\nbridgehaugh\ndecommissions\nupline\ndodt\nhighbrows\ngamesbeat\nkingsleigh\njessyca\nembargos\nmegayacht\nlodeweges\nspendthrifts\nhoggin\ncremisan\nsebby\ncclp\nslabaugh\nhernial\nsturzaker\nhengli\ndikul\ntrakr\nalnoor\nfluoridating\nporis\nhopsital\nlailey\nelsburg\nbedran\npenenberg\nhiersche\nbetoken\nbromances\nuhai\nmegion\nkalimullah\nfaulker\nbreay\npajovic\nspokeperson\nnasraoui\nvinella\narku\nmonderman\ndeloreans\nfose\nkimjongilia\nbestit\ntianyun\nbrusseau\nkonopnicki\nuninfested\nlumeta\nchinajoy\nbouff\npolistena\nwatban\nrenewably\nnackington\nlymbery\ncurbstones\nsantonastaso\ncareworker\nexpurgate\ntimolin\nrungwecebus\ngeivett\nvitec\nrhena\ndushyantha\nweatherbeaten\ncheatley\nswatters\nsayadi\nxiwanzi\nagreeement\nluckert\nsueda\nrandoph\nreafforestation\nciuffo\nbalkanize\nyanliang\ntregynon\nhaggards\nswendsen\ncannard\nrfsl\nlalezar\nailie\nchishimba\nlefer\npround\nnetumbo\nstavish\nbordeanu\nadjoua\nbenzar\nruilova\nbatraz\nitsmf\ntombstoning\nhashimiya\ncardiogenesis\nladybarn\nwikisposure\nthreadworm\nalertbox\nmulsum\nbasik\nlopeti\nbrewitt\nscavos\nwclc\ncistaro\ntlusty\nwormelow\nnipsco\nardency\nelloway\npalolem\nlynetta\noddantonio\ntacettin\nschwope\nrothchilds\ntrelise\nslamfest\ncoxley\njlk\npletch\nasymetric\nrosscup\nulger\nifez\nkibati\npoleto\nochsendorf\nneophilia\nkleps\nbosbach\nmorganroth\npinacothèque\ngobelet\nilane\nzhicai\nserson\ngaveled\nrauball\npainfulness\naffogato\nsturmius\nkanisha\nhaily\ndecasia\nspoilsports\nyoyes\ngorodyansky\nbydureon\nmenear\ngadberry\nschoenefeld\nsaffman\ntawfeeq\nsouayah\nspdrs\nzaiying\nkaanan\nshoegazers\nincois\nmarocchino\nfishersgate\npapparazzi\nunprosecuted\nfaltskog\nkasradze\npanderers\ntirtschke\nhumpton\ndonehoo\ninspiringly\naleek\ncelko\nreveillon\nswigs\nwaspy\nhoarsely\nunstoppably\nkrubally\nmuscati\nsalau\nguéridon\nfroghopper\nlagostina\nangkhana\ntwombley\nwristcutters\nrahulan\nduggans\nsudeikin\nmccrohan\nexfiltrating\nreikan\nrhai\nberrymans\nredsky\nstupors\nsiedo\nllafranc\nshirong\nluleh\nplacelessness\nblusens\nsmallmead\ngreencap\ncencic\nmcfarling\nsaviles\nsoutif\nshawcroft\nzanne\norlie\nroadcraft\nunderstatements\nanybodys\nbelchamber\nbessmertnova\njinxy\ningrediants\nsehs\nbowesfield\nflox\nlendor\ngrsm\nmazzalai\ngeleta\nmuddiness\nsorren\nepitomy\nzeits\nbuendorf\nthatcherites\npantsil\njeuland\nréne\naugert\npeguy\npitwall\ngarikoitz\nartba\ngreenip\nfoodstalls\nmultibrand\nneighs\nocchiogrosso\nbroton\nfleetness\ncollarette\nalmiron\nmakke\nfouratt\nbarnish\nyorked\namgad\ndimetapp\nmisdescribed\nbardoxolone\nflightseeing\nwenzlaff\nmedpedia\ntaigman\naatf\nfontus\ncyberespionage\nraguin\ntsra\nweebles\ntianwan\nuprs\nyamile\ngrammatiko\nbluehippo\nuhlin\ngladish\ngoettl\ncankaya\ntoughing\nlueth\npiccata\nbazlul\nbasuo\nzebov\ncompliantly\nsonji\nversari\nkangerdlugssuaq\ndigeo\nrooky\nprescoed\ngraciani\nsimps\nnakalipithecus\nrehl\nacef\npressboard\ntómasdóttir\nmeph\nsplichal\nhartstown\nbommentre\nbuddying\nbrickarms\ncarful\ncannibis\nzquez\ndmips\npinkness\nswartruggens\nrubberduck\naccentuations\nabstr\njachowski\novercomplex\nantianxiety\ncasued\nameneh\npaulhac\nunremorseful\ndaphnes\npatriarshy\nbruyckere\nmoolen\nchhachhi\ndanylyshyn\nkwariani\ntruenergy\nbrutalise\nutiger\nbroadmayne\nkhabarova\nwylen\nwapusk\nkazachenko\nweizhong\nivorra\ninsolvable\nbakx\nuncredentialed\nelhorga\ncomprends\nsaloom\nmputa\nthreader\ninteview\nvaudin\nphotomaton\nsquirms\ntrucs\nsitorus\neywa\nclassifed\ntjandra\nmintek\ngilbank\nrealtytrac\nandalio\nhennock\ngustafer\ndziekanski\nhoofing\nakok\nolswang\nsloanes\nbarzey\npazhwak\nekirch\nchristys\nprivelage\naissaoui\npeaktime\nyorkstone\nkillygordon\ncheeriness\npymt\ncorduroys\nbushies\nablaza\nroellig\nrinnai\nzechman\njacci\nisermann\nkernoviae\ngatsalov\npelletreau\nmanificat\njuwono\nileia\nbaraniuk\ntrebert\nbehavioralist\nminihane\namfa\noncomed\nspectralism\nmossessian\nnarry\nadiva\nunconvention\nuniversityof\nbodrato\nvaeth\nczapiewski\nrapamune\nyochi\nmcgimsey\nmindnumbing\ntartlets\nrissoles\njamous\ncortiñas\nkanapathipillai\ndavidenko\nfrakking\nlibrio\npegden\ninvolement\nschriefer\ncompotes\nfarani\nlavielle\nfujitec\nvandeurzen\nchiselers\ncibs\nfleischaker\nchervenak\ncigital\npiligian\nnpss\nwreckages\nseratelli\ncariverona\ngutin\nboatsmen\nthoraya\nstandeven\nshellacked\nsubbasement\nwasthe\nmercilessness\nasfari\nborzou\nintergen\nnonqualified\nfoleo\nschulting\nmoslim\nluxxe\nopenxml\nwindblast\ntransmucosal\nantiroll\nandertons\nnamdeb\nschaumburger\nrahid\neebc\nmurrietta\nhecher\nmodie\nchlorinate\nhaiping\nembargoing\nsmrc\nshcool\nantur\nformulaically\nprogamming\nkhurais\nqizs\nburpees\nchanglin\npibs\nsibaja\nmetolachlor\nseifried\nmalaz\nnacubo\nvatea\ntourson\neltron\nstyloctenium\ninceman\nactividentity\nhuybers\ndeian\nnilsmark\narputham\ncantrelle\nlhundup\nrevpar\napwu\nllike\nelsenhans\nsciencelogic\nphotocards\nvecoli\nutterings\ngreenbox\nunshipped\nkrugier\nerbyn\nberrelleza\nbirrane\nantojitos\njiuxian\nacupoint\nhedwiges\nbryco\nlavande\napeloig\nkaburu\nriverkeepers\ntadin\ninvitingly\nkayahara\nnerlich\nunmodernised\nayandeh\ngrappo\nwithi\npopli\ndemey\ntaciturnity\nhahadasha\ngolfnow\nsophal\ngiangrande\nresusci\noloyede\ndimin\nknowedge\nloua\nwhooley\nideastorm\nembroilment\ngittelsohn\ntickin\nunicyclists\nrithmetic\ninfragravity\nwiners\ncoolheaded\nzeglinski\ncaubet\narjaan\nherbenick\nquinette\nfotanian\nlooxcie\nkostenki\nhungerburg\nlatrina\nbeieve\nshirred\nprakit\ntedwomen\nloule\ntscm\njazzmin\npreborn\nshengda\ndepersonalised\nentriken\ncyrela\nhoodrat\noxitec\nsomsavat\nlengsavad\nrathnayaka\nbrailer\ndonchin\nrayshawn\nseada\nschweizerhaus\nnikkole\ninfy\nstirringly\nwriggins\naschiana\nhelenians\nvilcanota\nkoshansky\nsingulair\ngâr\nsleazeball\nrustenberg\njott\nmascitti\ndaleh\nverwaayen\npaerdegat\nnestboxes\nrobello\ntetelbaum\ndefensenews\nzangari\nunnoticeably\nenaged\ncraftier\nchadashim\nquarterhorse\nfredlund\nhawkei\nnosiness\nsdiri\nbissap\ndongkuk\nfaradje\nmuriano\nshigemura\nfigliuzzi\ncomprehendible\njulfar\nfluked\nunquenched\nthaibev\nfeldmans\ndarington\ngenton\nwoodings\nmargolles\nbeardall\nbhanbhagta\nkraichnan\nmankini\nicluding\nxjt\nnuancing\nsartorially\nchomped\nbeltrones\njaggedness\nmusettes\nkontorovich\ninaguration\nqama\nkiobel\ncreampuff\nkinsel\nhornbaker\nknetter\nsasselov\nheavner\nbosto\npenybryn\nrichen\ntirawi\npuncog\ndiems\nfamara\narrabbiata\nboudella\narulkumaran\npietrafesa\nolmetti\nmudimu\ntradeweb\ndcvax\nmbenza\nrussomano\nkopitz\ndappling\nfootsore\novertaker\nproberly\nridker\nneeps\ntrendspotting\nsegement\nappauling\nsarposa\ndittohead\ndiondre\nvicepresidents\nsophisticatedly\nwashpo\nwintersports\nsaddo\nremata\nreconveyance\nkhuzam\npesanggrahan\niotc\nthalians\nsarwono\norther\nburocratic\nmakdissi\nmusoni\nlavand\nlusikisiki\ncarpentered\ndiyali\nncae\nruslans\npompus\nartsdepot\nuninstructed\nknief\nopko\ntillbrook\ngroundrules\nmerseytram\nconcertgoer\ncchp\nmclurg\nahmadreza\nguanyu\nmelim\ncasiokids\nsamsoe\npagesjaunes\nziprealty\nwinspeare\nbentancourt\nissiar\nbhps\ngaith\nproveable\nfrithville\nxora\nnilbog\nkackert\naccoyer\nvitalise\nfathomable\ncrasnick\nhartnagel\nnonmalignant\nconern\ntoyer\nlakely\nisbourne\nunroped\nguerette\ngorshenin\nstarosel\nwalkscore\naptv\nsigni\nnetratings\nboschker\nmindmapper\nsumate\nbazaaris\nkitner\nbioflavonoids\nmainy\njolibois\ntheuer\ndiscourteously\nanonymisation\nsweetnorthernsaint\npurloin\nfornaio\nnumex\ndefibrotide\narchibeque\nkryptiq\nduccini\ntestee\nkhella\nciclovía\nhestrin\ngearey\nbeitenu\njaysuma\nmograbi\nallaster\ncyberterrorists\nquickcam\nstakeouts\nimmortally\nmicroplace\ndhekiajuli\ngrinnan\ntatenen\nhomestanding\nikmal\nrapidograph\nicbf\nkovanda\npanobinostat\nsandestin\nfrancescani\nmailo\nlathams\nfaillace\nnamikoshi\nmailout\nnybot\nstoyka\noberhammer\nkhandal\nhernreich\ntobashi\nlinheraptor\nautoinducers\nredzikowo\nrumberg\nnariratana\nkiechel\neirlys\nsulked\nsubeditors\nmedlyn\nitzstein\nurostomy\ntrhough\nabovenet\neverloop\nprocul\nbulin\ndefrauds\ndrifty\nnonaffiliated\nvlerken\ninstanbul\nreconceiving\nalsomitra\nwingerden\nwooers\nsocca\nmulticomputer\ngawade\nsystech\nlttle\nghormach\ngoldfrank\nkalaj\ntradtional\nnischan\ngostev\nprageeth\nluhyas\naloko\nroyksopp\nhargin\nwynard\nnirj\npollena\ngermanwatch\nboringness\ncountersigning\nwthout\ncannibalisation\ndegenhart\nchromeos\ncathance\nbasterd\nshiceka\namsterdammers\nheadbone\nbomai\ntafforeau\nsherrerd\nputrov\nrupiya\nfriborg\nzipless\nliquica\ncarollers\nsmin\nunderdressed\npocognoli\nsocpa\nvedan\ntortoiseshells\nradhamma\nsmyllie\ntravestied\nwaithman\nanot\nantinarcotics\nomrf\nhunstad\ncanesta\negotistically\nmutineering\nxinhu\nsiamas\ncomingled\npatricelli\namock\ntarabella\nagbaria\nblogfather\ncollateralised\nsucide\ndragnets\nboodai\nafib\nnycholat\nshafta\ntouton\ncounterpoised\nbillauer\ndeluging\nadonde\nabuya\nhamos\ngolimumab\nomble\ngiessel\ndentressangle\nfehb\nrivisondoli\nwolosky\ncatastrophists\nregietheater\nhempson\naniversary\nwotus\nlilliefors\ntegas\nmelinka\nairworld\nkhaji\njibreen\nappup\nvongo\nhonicknowle\nguidall\nafribank\nhaycroft\ndistends\ndvn\nmandikian\nnnabuife\nhighflyers\nlorgues\nlazreg\nciftci\naltech\nyarima\nordzhonikidzevskaya\nleocata\nbrachiosaur\nmenary\nzoloth\nshopaholics\nhpnotiq\ntworkowski\nhysteroscopic\nnprs\nqqqq\npantless\nomax\npirlouit\nstonard\nnungwi\npopovsky\nmonical\nvigay\npresevo\nintrepids\nsufiya\ntimebombs\nukrainain\nelgarian\ndonguy\nwhooped\nseabight\ndecisons\nshupp\nchristenbury\nparguera\nklerman\nshortman\ntauruses\nsayala\nmillimoles\neikerenkoetter\nffiv\npenchée\nputrescent\nfassitt\ntobasco\nchunka\ncouncelor\nmabele\nagnoletto\nsunshiny\nfantauzzo\nbabying\ntiffiny\ntwiglets\npevey\necogra\nspringcm\nbizonal\ntalibani\norcopampa\nmadewell\nteten\nmarghescu\ntriner\nusairways\nhemmerdinger\nhandbagged\nvalenstein\nlagerquist\ngyotoku\nmesinger\nnoshaq\ndifferenct\ncolcci\ntreviglas\nloaners\nshaimaa\nleonardslee\nglobonews\nnonmonetary\nsprightliness\nsupportes\ntiziani\nkimberli\nwalski\nuriri\nkuehler\nmamsurov\nbelway\nwagnerites\nroycrofters\ntakeouts\nglugs\nagood\nhomogenising\ngotsis\nprofundities\nhnba\nhemlington\nmilgard\nparthasarthy\nherwitz\nsiksik\nferrino\nnecrophiliacs\nsubsquent\nundiplomatically\nfahimi\nbavents\nzvarych\nolfat\nibisevic\narnvid\nllangunllo\npushpanathan\nblagged\nknpc\nteollisuuden\nmykhailychenko\nadnoddau\nplumps\nfirebreathing\nkazn\nklocwork\nbighearted\ntaxmen\nalili\ndhuluiya\ngreybeards\nrubgy\njedforest\narbennig\ntraiter\nbardsea\nalloro\npergam\ncoet\nyumilka\nhomasote\nmagcloud\ncryptographical\nourt\nkorhola\nfvrl\nmaccubbin\nstockbury\ngoelitz\nsedas\nhumidify\nsimultanously\nbethleham\ndurda\nwaitlists\npricefalls\nhrmc\nsheafs\nliterarian\ngroznyy\nolling\npfeffernüsse\nflightwise\nbastardizing\ntinei\nmachelle\nweissglass\nrabalais\nbanyai\nlizst\ncentralistic\nfilterless\ntrerulefoot\nmatraman\ncopolyester\nkolluri\nsefs\nsvatos\nbabchenko\ndarrol\nberlinde\ntongrentang\nbehanding\ngisozi\nautomatist\nbrovina\nlifland\nsplainin\nmindshift\nadaleen\npeguera\nuneg\nproteccion\nzhaoyu\nsuperfighter\nbernuth\ngiornalistica\nmentell\ngerstman\nnger\nprospectivity\nruhullah\nsmarm\ntelekomunikacja\ndangcil\nlacoochee\nkiloliters\nfiresafe\nepipens\npolydoros\nnestinari\nngaba\noutplays\npolihale\nllwynywermod\nsextortion\nstradsett\nhankton\ncrockart\ntollroad\nswannack\nsangermano\nzavecz\nfranciscos\ncheuvront\ndrobnick\noveisi\nkingsknowe\nrasate\nakbars\nseronera\nmaldoom\nchickening\nthymes\nboerger\nlouizos\nkibitzing\nmarlbank\npfingst\ntarikat\nmapco\nxiaonei\nemix\nlibaux\nundersampled\ntiandong\nkendrapada\nekpemupolo\nposessed\nsoroptimists\neots\ngoldfever\nuffindell\nfawbert\namparai\nlicensers\nstatemen\nnorbourg\ndhahir\njeannetta\nhuissiers\ndaragahi\npontifications\nwsbr\nmournings\nmisselling\nfittleton\nrockits\ncampcraft\nbronne\nacox\nopeta\nkivuvu\nndfa\njegal\nintergrity\nchittock\nshukron\nperpetuators\nwoulnd\nvalthaty\nmuncipal\nyarnfield\nhumburg\nlatifiya\nevryone\ntrcs\nunderpay\npiccarreta\nliptapanlop\neleo\nxiangjiaba\nkanyabayonga\ncrozatier\ndorli\nmulvee\nplayacting\nfootgear\ndutto\nnorak\nzimmy\nionawr\nalphasat\nfranker\npesaturo\nbinegar\ndetchant\nkscc\nbluenext\ncacchione\noatridge\nhameedullah\nmixbook\npoltics\njolynn\ndonehey\nkolbeck\nmullineaux\nhumprey\ncoskata\naleikum\nkovilakom\ndufournet\nmaccone\nflexibilty\nfulminations\nincyte\nannmaria\nyeide\ntanbridge\nwilkowski\ndekdebrun\nmafco\nkurowska\nungrudgingly\ndiogelu\nxeko\nhooty\nrattiner\nhegghammer\nwattad\ntavernelle\nsurrouding\ndiscouragingly\ntdra\nbonorino\nmontos\nsatarov\nunenforceability\nkgra\nbasijis\ndailys\ntriveniganj\nauguin\ngowthorpe\ninterregnums\nresonse\nstierle\nblustered\nsuspicionless\nsarnowski\nrecoverpoint\nleghold\nfukumori\nhapened\nsomafm\ngerr\njohnetta\ntechnomic\nparachuter\nserdamba\nentj\ntroughing\niccvam\nmuscare\nsulamita\nbaksaas\nbertan\noverachievement\nschomaker\ncosigning\ndissemblers\nvilely\nvacuousness\nsevent\nraddle\nlundbom\nogoun\nschonthal\ndegasser\nuniverisity\ntaqueti\ncorvalis\npaquis\nyonko\nsungen\nfiligrees\napicultural\nseamheads\ndecareau\nrohlfing\nspeedferries\nkkim\nnesheiwat\ngironcoli\nredlener\nayral\nbabouche\norogbemi\npenchants\nclerisy\nchristodora\ngrotti\nmikucki\ntsujino\nchancres\nobscurant\npseudowire\ndreves\neditrix\nauchencairn\nstadd\nklegg\nkathwari\nazzoni\nbeguinages\nsteo\nredbuds\ncontentpolis\ndayday\nbrachioplasty\nsalties\nlepelletier\nhrysopiyi\ncaav\ndocketing\nmedvedtseva\nochandiano\namorette\nscarweather\namercan\nmashar\ndowles\nhoesel\ndesbrosses\nstravinskian\nahmadian\nsimkus\nemployement\nhubrich\nkujat\npenyberth\nplanetsolar\nizatullah\nvinasat\ngranneman\nchortles\nyusufov\nchaisteil\nbancfirst\njoybubbles\ncerebrus\neqypt\nhapsford\njaekelopterus\npieronek\nwitih\nreiach\nbrylawski\nmruk\ntreichl\ngouves\nloseling\ntilbian\nhealthwise\nchillier\ndullish\nleawere\nbenefical\nculpan\nkabulov\nladanian\ncorien\nhrebenciuc\nwavegen\ndelaurentiis\nboardmembers\nsukeena\nbijur\nstupefyingly\ngermaphobe\nsurliness\nelber\nsauven\ndaylit\nsamawi\nellestad\nnatour\nbezafibrate\nzylo\necvam\nsychnant\nhospitalities\nmortland\nkalea\nzync\nkristia\nmussed\nchassaing\nlitein\nconstitional\nnoltland\nrefurbisher\nvirostko\ntroublespots\ncrispins\ngenitally\ngrisolia\npodrug\ndeadlocking\nalies\nmahmod\npuliyankulam\nkayed\nfluvastatin\ndesided\ntehrangeles\nmacanga\ndesmoteplase\nkorsunsky\nbargeman\nuncontrived\nfeistier\ndelucas\ntrevan\nmorgeson\nsamoas\nasociated\ncocottes\nrambøll\nniedzviecki\nduplicitously\nellershaw\ntenatively\nisssues\nannoushka\nbigotries\ntptb\ngropman\nmorinas\ntrlica\nabayev\nnovellis\npacewildenstein\ntrisynaptic\nmuhidin\ncavatelli\ngasline\ndrawcards\nperjures\nchristensson\nhoughteling\ninternatonal\ngruffud\ningraining\nbasiron\nabacuses\nwippler\nkingshouse\nsarmas\nmuellner\ntefe\ndhirgham\ntercek\ncfdb\nenmasse\ngreilsammer\nmedsafe\nglem\nwellby\ndisaggregating\nargentieri\ndogons\nsiefker\ncolombos\ncounterlife\nnyph\nempl\nblaabjerg\ngalgiani\nreacquires\naizumi\ndueber\ngoldenblatt\ndisheartens\nexhalted\nsnowplowing\nvmpc\nquiraing\nborowik\nswots\nhefez\nmaeba\nmicrostar\nkikunae\ntunefully\nikrima\nletizi\npsdf\nluzmila\nborght\nspangly\naelodau\nsorcher\nkotowska\npfuj\nbroquet\ndondrup\nkindess\npendery\nroadhead\nodorama\ncaribean\nvincento\nmgic\ninlanders\nfived\ntandle\nwasendorf\ndoubletwist\nbalmullo\ntewfiq\nsolidays\nikhana\nawfi\nlimbourne\nwishlists\nfilloy\nsekulovich\nyucks\nchyandour\ndangerman\nbottlecaps\nzhiyue\nganiel\njeunehomme\nconsquences\nsquashy\nscivee\nshayon\nmealer\npulic\nbrégier\nfatteh\nportugalia\nleatha\nlinskey\ntenuousness\nkaushansky\nreget\nnatil\nkamarulzaman\nnavizon\ngrammercy\nmidrise\nflamekeeper\nshontelligence\nschopfer\nimposer\nismatullah\nonychonycteris\ngrabovoy\nferness\naujali\npakol\noverdiagnosed\nnüvi\nyach\nkulicke\njbjs\nexpectorants\nliase\ngametech\npuffett\ncanonise\ncerrudo\ntrpčeski\nhinners\nphibro\nmahoganies\nhapppened\ntemelin\nspookier\nhatpins\nmotorbiking\njedrzejewski\nubaidi\ngrats\namsha\nanbin\nvenglos\nchunhong\nlucketti\nbernardette\nkibebe\ndbis\ntulaganov\nmelanzane\nbleg\nwenxia\nkunskapsskolan\narmspan\nferrassie\nesupport\ndrabo\nrefold\nbrusher\nsleb\nmakarewicz\ntimra\ntopstitching\nbertschin\nllafur\nmatory\njiegu\nkodmani\ndutney\ntrueblue\nmusambasi\nfsbs\nsandalo\nbatterymate\nmyclimate\nnompumelelo\nbihn\nzooppa\nsleazier\nmilburne\nsantoli\nvulcanologists\nekaette\nmondrians\nfruitflies\ncusanelli\nbonanos\nfahle\ninternationalising\npenttinen\nserrill\nnarky\npluvius\noveson\ndaelemans\nstertz\nmentary\njakovljevic\nhesperonychus\ngeoreactor\nglivec\ncharonda\nzentmyer\nshopsy\nkiteboarders\narnelle\ncadougan\nkanjana\nchibbaro\ncarcavallo\nstoeckley\nbouska\nhosszu\nraouraoua\nnucleaire\nessawi\nspinesi\nlongsdon\nguildhe\nhainje\nbuiling\ndemeritt\nsompop\ngemeos\ngirds\nmollee\nheskel\nnonserious\noutboxing\nabhorent\ndentally\nallwin\nmizin\ncanynge\necec\nschulters\nlulic\nccpi\ncbtf\nlouisianians\nayarza\nakouala\ndevecser\nuntangles\nbidonville\ndoofy\nheronsgate\nakhigbe\nnaste\nivesiana\ninteractif\ncarritt\nhighstar\ngmdc\naquaduck\nmarlaine\nriogrande\navalonbay\nbaldheaded\nfloodlighted\nwiederin\nsajawal\ndistateful\nvespera\nmgive\ngroneman\nintels\nbritanni\nbrambridge\ncheerier\nolivotto\nnazarenas\nonvia\nharofeh\naramide\nsmashup\nvogli\nchopsocky\nastudio\nwnan\nmylonitic\nsenstive\nkitwood\nyannic\nbirkhimer\ndecending\ngreschner\nmassier\nmedicalize\nibfan\nbowlingual\nrailtracks\nchandrasekara\nheyzer\nccctb\nabcam\nannularity\nremin\ndispair\nlardelli\norombi\nscog\nbovim\nbrimo\ntakoe\nspermiogenesis\niuh\njugos\npeijs\nkiradech\nstroupe\nveic\nacupunture\nphomolong\ngalovic\nspectrolab\nunspooled\ntobosa\njollett\nnakagaki\nescrowed\nmanscaping\nsamanvaya\nducheny\nsharmon\nhucksterism\nbacas\nlenzini\nshemwell\nceisler\nphillion\npondlife\nnorthcross\nxiangying\nmohoni\nlaque\npiquets\ngpro\nfederacije\nsplurged\ngiavazzi\nhighhandedness\nlillas\npscu\nstaubitz\nbursik\npeevishly\natest\nbursk\nvulgarians\nconnétables\nvirutally\nchastan\nslouchy\nstrokemaker\ncartesio\nsheutiapik\ndormie\nsghir\nblelloch\nblacktopped\ndisneyfied\nfareri\ncuspid\nltfc\npaischer\nweplay\nundrilled\nsonenshein\nbjayou\ncphi\nwilnelia\napostola\ncertisign\npreprimary\npepall\nblacksands\nredferns\nomir\nenciu\nwaern\nalwasy\nborouge\ntanjim\nscantegrity\ngtsi\ncarrez\nbigum\nbetwee\nsadoon\ncsango\nshanny\ncoursesmart\nteera\ndhargye\npunchcard\nbabygirl\nunmodern\nshelties\nsteltzer\nelsaffar\nistedgade\nloosies\nippl\nehic\nlozito\norchila\nvouk\nconspriacy\nfrigia\nmalinvestment\ntiffee\nkarawan\nwiznitzer\ncarcassés\niaus\ngingerman\nzurb\nguyland\nbrastemp\naccomplishements\nmasterbatch\nsilverknowes\nblogotheque\nbarlanark\nstampless\nvehle\nkkgn\nringhals\nsempe\nmasura\nnvz\nbaverman\nchareau\nchurchstanton\npopma\nivari\noverides\nalexsander\nemployess\ngrindhouses\npescetarians\npassata\nauque\nogundele\nicop\nviccellio\nbransgrove\nmaarohanye\npcip\nkrenar\nhoriyoshi\nluxoft\nuninvested\nsuccesive\ngreenoak\nsermonising\ntwitterers\nbakwin\nsoneva\ntonaghmore\nwesternise\ngoosed\ndelectorskaya\ntysa\ncartus\ntoppel\nmnari\nnorlund\nboutiette\nsimental\ncommitees\nsalaciousness\nglassverket\nhostias\ngietzen\ndemidec\nrelitigation\nschizopolis\nvalemont\nroaccutane\nunquotable\ncamau\nbernazzani\ngavard\nfischell\nmozhan\nlibrilla\nbattaile\nchhouk\ndooce\nmultiton\nconsipiracy\nexpertos\nalvart\ncayzac\narkleston\nmaumbury\nhikal\ndisjointedly\ndeutag\npyschology\nbobois\nbetweem\nmoruzzi\nfettling\nschoolhill\namatore\ntriallists\nmogor\npedestrianise\napkarian\nperriers\njaniya\nsawadi\narbourthorne\nwayyyyy\ndelwa\norginizations\ncounterspace\natyushov\nreacquainting\nweegie\nperay\nmahmad\nbavouzet\niret\naverment\nbaako\ntagliafico\nprequalify\nlandaker\nmunayyer\nvratil\nlompo\npenfriend\ngauntt\nbratter\ntunewiki\niochdar\nlluberes\nmenerba\nsomervale\nflatteries\nafren\nplie\nloac\nunallotment\nstrite\nresortquest\nwoodmark\nbreare\nhopfinger\nvercorin\ncalzati\nbluemountain\nnukri\nintoxicatingly\nquanitra\nsnts\ntallie\nstricklands\nsisia\nmabunda\nselectone\nhanscombe\nnurestan\ndaulby\ntangel\nmultisided\ndilaram\nuninhibitedly\nmigala\npattni\npaleochora\nsarvestani\nhasak\njelacic\nfacekoo\nipsl\nkwis\nolabode\ntson\nbasw\nrepor\nperrodo\nberec\nmacronix\nciller\ndrudgereport\nevault\nnakagin\nmpdu\ntamasi\nsangwa\nmallarangeng\nphoberomys\nbfsu\nzatar\ncamello\nconstanly\nbirring\nfijilive\nhilborne\npeplums\nmouline\nvazon\nozaeta\nunseats\nljr\nharss\namifampridine\njanofsky\ncervasio\nayagi\nwwdp\nfeldmeier\ncrabber\nsuperantigens\nlepor\nhobsbawn\nbannwart\nbraithwell\nthomley\nbuethe\nnuvasive\nremmeber\nhallgarth\nalcombe\ngordmans\ncalpains\nyanoff\nclape\nstoltmann\nmanadel\nlenzner\nbrugnetti\nkuzubov\nlhps\nplebanski\nrizkallah\nchimutengwende\npotawatomie\nsuperfetation\ngonzalvez\nseverances\nslurpy\nschnaars\nnettlesome\npunchbag\nmutiu\nuhaul\ncassimere\norecchiette\noutsprint\ntranslogic\ndepinho\nrazzing\npakha\nsummerlands\neurodocsis\nsupercentres\nwreathe\nohtahara\ndaylength\nproglide\nseabord\nblutch\ncobbetts\nyazici\nscharioth\nwatg\nanastatic\nhagwood\nlasarow\nlavastorm\nmcquitty\nkueng\ngunay\nfuturis\nrpix\nhomoeopath\nsampathkumar\nmalloth\nsanquin\nyawner\nklaehn\nanrig\nalent\nturnbaugh\nburey\ngreencoat\nhererra\nsloppiest\ngarriot\nsigon\nsalehuddin\nvalmond\ndiezani\nmarionberry\nganczarski\npedophiliac\nkechichian\nwoodsburgh\nkeeve\nidutywa\nboardex\nsupremist\nnonparticipation\nklingner\noutstay\npervaz\nkocca\ndeihl\nplexes\nlingoes\nproenca\nhoskings\noutsourcer\nvalkova\nskeezy\npanzner\nbansley\ntelehandlers\nnreca\nvoluteer\ntabloidism\nsecong\nnewpoint\nstreethouse\nbaudrecourt\nenglebrecht\ndelamarche\nvanaskie\npayd\njimly\ncellartracker\nallmighty\nladdonia\nweills\nentrie\nkasrah\ncreciendo\nigwg\nchopticon\nwhipton\nperkiness\nharrells\nairsick\nfibbed\nevertonian\nmitfords\nscoiety\nrehnberg\nnaaah\nlitoff\narraycomm\nstacelita\ndaysi\npopinjays\nclaypotts\nslatternly\nsalcer\nbachian\nsnivel\npapuc\nuncollectible\nvederson\njaphy\ncoronograph\nkamlabai\nkgotla\nbeanstalks\ndyszel\ncutups\ntechpresident\nsukkari\nmorlok\ncompetions\nalmudevar\nrajkovic\nspinmaster\nbarware\ndeleasa\netnia\nraeff\nhoopster\njarko\nsipsmith\ndivesture\niyp\noverengineered\npollesch\nsimulsat\noctuplet\ndecend\nhujum\nperat\ngridrepublic\nfineries\nriziki\ntollerance\nvolitile\nmerchantile\nlongney\ncontois\nbismullah\nswiki\nniba\nfingringhoe\nospraie\nallieu\nsacdalan\nsandstedt\njabots\nunpreventable\nnavaro\npogemiller\nitpc\nvideocracy\nkinnucan\nmonning\nfukushiro\nfawal\ndicked\ndobies\nquintiliani\nsartoria\nperfoming\nviotia\nfreerolls\nslaff\nrigotto\nsysview\ntagliamonte\nkhoudary\nyouseff\nmyfoxla\ncommmercial\nvigenin\naizenman\nkotalik\nhomeloans\nniggemann\npramit\nsomary\ndrycleaning\nkazuharu\nomaid\nchaunce\nthekkekara\nemailers\nbudges\nashgabad\nelastane\nreorganizational\npilypaitis\nhalbertal\nhfba\njepleting\npetrokazakhstan\nmonewden\ncvetan\nellerston\nstanya\nfetion\nmeidrim\ntowthorpe\nunbossed\nshaariibuu\ncoxhealth\nblackhouses\nsatcon\nlordshill\nfendant\nwheelarch\ndaneshill\nincompetance\nhonigstein\nhandtools\nfridson\nelectonic\nhiter\nmatrosskaya\nstrohecker\nwealthtv\nwaiganjo\nlombino\nulfers\nmacae\nducketts\nherawati\nangstadt\nreconception\nahlone\ndecieving\nfillippo\nfascinatin\nlantra\nreceipted\ndearnaley\nweiguang\nmytikas\nmyerberg\nsamei\nhoneychile\njubbly\nkaroki\nparween\nindocyanine\nbrakewoman\nonbase\nspectralink\ndiscombobulating\nbibf\nridling\nbosniaherzegovina\npolitial\nschue\nmultiven\nlyrids\nfehbp\nkitzen\nwaterlog\npanouse\nibsc\ngrimason\nboyat\ngiantism\nploteus\nrelized\nrecenty\ncalem\nthougth\ndeselecting\nblacklistings\nzochonis\nsaidie\nhafter\ngrucza\ntailormade\nsaminejad\nwilb\noverell\npazmino\ntelegeography\nchampetre\nhightly\ntargedau\nbuid\nintrado\nweblo\ncybercafés\nfishtails\ntadmarton\nmodrak\ncentera\nrowberrow\nriverford\npenetralia\nmayanga\nobergurgl\nchbeeb\nsniders\ngoklany\nsoldera\nbazbaz\nschnarch\nsharrocks\nslotradio\nnavini\ngyrocam\n\nbrainbench\nchalvedon\npicknicking\nlepsis\nkostrikin\naudf\nvoluspa\nprivilige\ntazim\nhamermesh\nschuening\nlyapin\nsukkoth\nkinetin\nsaisi\nthornloe\nfereti\ncinching\ncityspace\nlasak\nsamji\nbrutoco\nshaquanda\nbingjun\nengmann\nakhond\nescalades\natlal\nmightiness\nwieben\nharik\ninterscan\nsalerosa\nsaloli\nknipling\nruemmler\nvillen\nverrry\ngreenlick\ninkless\nmicrocsp\naltata\nfisman\nsudik\ncripa\nelkwood\nsuping\nflitcraft\nseasonably\nneumunster\nrevich\nndeti\nmarleys\ndancelike\nqushan\nslimmon\nfosamax\ndumatrait\nmurliganj\njunling\nhaynos\nfishfinders\noveranxious\nfuschia\nbecom\nmykita\nsongsak\ndissapoint\nefficently\nfarrage\nadsb\nschilf\ntagme\ncuttone\neconomix\nbrocke\nnetani\ngreenlights\nswordfights\nmeão\nkambaksh\nplantlike\ndustbuster\nrebtel\nsipress\nfrowd\nshawwa\nbronzefield\nsirree\nthottathil\nzaras\nkrafts\npulverise\nazuz\nzirkelbach\nmultidetector\nansv\nedemocracy\nparadichlorobenzene\nlowenhaupt\nmhenni\nstretchiness\nbansky\nfrostick\nghaffarian\nhodginsii\nkhamtay\ntieback\ngabaix\nneonode\nmiloscia\nkeera\neneough\ntsem\nasiacell\nkurtzberg\nquicklink\nbowermaster\nyuai\ndworzak\nsuperintends\nkolade\nsubcribers\nsummersgill\nseantrel\ndarfurians\nupskilling\ntagalong\nbobbyjo\nkatam\nguellal\nhulatt\nmarketingprofs\nzuberbuhler\ncagaptay\ngorsey\ntouboul\ndetron\ncotey\nvictimizations\nkatial\ncrossfires\ncraycraft\nestaleiros\nsultriness\nghioane\nnethers\ngonzalves\npaduch\nrumaihi\nkitcat\npolictical\nchimenea\ndazing\nbrandcenter\namgs\nrubeis\ninswingers\nhovhannisian\nkalko\nmacguff\nhobbyhorses\nsongstresses\nholzberger\npivoda\nvisability\neqb\nsonystyle\neilt\nfolkson\n\nbarabasi\ndebauching\nflowerings\nhillgate\nrussneft\npedair\npetchkoom\nhiros\ngraikos\nwhooshes\nstockline\nguennoun\nhourse\ngartenfeld\ndevaul\ndonaldsons\naktis\nsuperfluities\nnymphomaniacs\ndistachyon\ngoofily\nsreb\nmedenica\nharakas\nbackett\nrapelye\nchegini\nnlga\nparamos\nsgorio\nwitoelar\ngasfields\nskillicorn\nhabiby\ncotchford\nvprotect\nbilleaud\noverclaimed\nemberlin\nbefouled\nsherifi\ntinhorn\nsheptock\nnasn\nmoenchengladbach\npamfilova\nuniflex\nnormando\npanteli\nanile\nderanging\nngwesaung\nhilliest\nknobbe\naskenazi\nbreann\nfunemployed\nsupriyanto\nivannikov\nasefi\nzvulun\ndeadens\nepcos\nthiefs\nhastalis\ntelephonists\nduhau\nneuborne\nslowes\nsowles\nmalaney\nmedlam\nlimotive\ndurup\nupshire\nluxuriating\nmanganyi\nbhunu\nbedeviling\nshibanova\ncalvins\neuri\ndalemain\ninexactness\ncondole\nnagreen\nzalpa\ninspectional\nexoo\nciacco\nplackemeier\nbarti\ngordel\nrepotting\nusherettes\nalefacept\nflaska\nmzonke\nacccess\nsaicm\nunironic\nsugarmann\nnicassio\nelvian\nchaowai\nbaudy\nmanorcunningham\nokram\nscrunchie\nllion\nbraband\nblakewater\npontiki\ngrunin\ncasalese\nsexualizes\nmccollom\npolderman\nheyler\nsireau\nroualeyn\nbakyt\nagita\nunipersonal\ncirulli\nroshia\nlosana\nunprescribed\ncsotonyi\nriberio\nshooutout\njalli\nfcsc\nacjs\nsilek\ndemanders\nperfusionists\nbocadillos\nconfrere\nmarzolf\njurrasic\nzhengqing\nhabitues\nasge\neasliy\nmoneychanger\nlearnedness\nbirdingasia\ncoucil\npauze\nschoeneman\nwatring\nbedspring\nferdl\ncapocollo\nsnwa\natrophying\nwafik\nbroadbased\nbensadoun\nzolghadr\nmedge\nhennagan\nguiter\nkennya\nleetle\nstereographer\nrusas\npossesed\nmamitu\nruihua\nmassenhoven\nsubtenants\ndisseminations\nranbeer\nslightingly\nrsvps\nuninterruptable\nschartner\nmcgilvrey\nluteinising\ndrowsing\nwhiteaway\nwysing\negana\nddit\nholohoax\nsrygley\nkikaya\nneece\nrakefet\nmorhaim\nalbitreccia\nsabtu\nskillett\ncasselle\nilligitimate\nempathises\nddrc\nkaranfil\nkeyra\ncrothall\nxingquan\nsacla\ncolting\nstaunching\nduponts\nduea\njinadu\nscer\nbrezna\ncyntaf\nseshasayee\nlampel\ngjeldnes\nryakhovsky\nbauli\nhallissey\nstepanski\nabgal\n‟\nderrykeighan\nchapek\nsecularising\nsyllabub\nsirait\npouladi\nsuwarno\nhaidle\nsubijana\nflannelette\ndonativum\nlieck\nlenell\njnet\nalyoshin\nyoungor\ntrustingly\nfumusa\nharllee\neacute\ngutschmidt\nschenz\nstokeinteignhead\nbhebhe\nmolitoris\nsoopers\nkhatak\nwannen\ndespoilers\nbedsprings\nmalboro\nsoelistyo\nremling\ninebriety\nvotomatic\nreturneth\nmyxer\nbrehat\nskinsuits\nnimocks\nraymor\noeics\nnncc\nmacayeal\nnjha\npathlow\nsmilar\npajeros\nvaldimir\ndmjm\npoularity\nstrenously\nokula\nkifayatullah\nkeyesport\ncampagin\nneiko\nfishwrap\ndisabilites\nsidneys\nryehill\ndeerhorn\nkalidou\nuncoerced\ncoarsen\nghasiram\nschlief\nchaplins\ntechster\nmangoma\nbolivares\nroshanak\nlobbyism\ntranscrypt\nmessian\ngitsham\ngtac\nmianheng\ncimolino\nblasphemously\nmagnificant\nfreeley\nhagues\nfogt\nweierman\nhigney\npayslip\ncoddles\nfocalin\nmangalitsa\ndixielanders\nindelicacy\nsidiqi\nwitheford\nsyphillis\nnetnames\nmorethan\ncolemanballs\nxianmin\nminqin\ndekuyper\nomra\nfroylan\ngocco\nshemali\nshillelaghs\ngrayness\nynghylch\ncompletition\ncalladine\nredlaw\nsmolla\nrottler\ntochman\nsidikhin\nmptf\njurlique\nattiyeh\njarinje\ncelebreties\nbrokenburr\ncalyptogena\nrenslow\nkolini\nnimax\nsirias\nkyaiklat\nhomebodies\nbhubneshwar\ndrawled\nshipit\nmesiti\nimmigrationworks\nsplittist\nlideres\nnusoor\noncor\nsamarjit\npireos\ntromode\nmesaverde\nthemistokles\nsassiest\nliebergot\nartlessly\nkelvinhall\nayaa\noverharvested\nbolde\nsubsegment\npauffley\nkaruba\nbohling\ntamarside\nlondolozi\ngoosefish\nocfa\nnonsexist\nasigra\nmagnoni\nburkas\npakeerah\niepe\nmollycoddling\njerzyk\nrepublicanos\nsunamerica\nbestride\nmovielink\ninvernesshire\nesys\nometto\ncampeao\nleathwood\ncpmiec\ndiale\naqn\nslogger\nmellins\nschloendorff\ndembner\ndcjcc\nprimex\nhenkels\nanitkabir\nnewlandsfield\nshiaism\nsnots\nmichéal\nhuther\ncrimdon\ncomert\nheske\npublinx\ninveighs\nmoreinfo\nrosindale\nfleamarket\nbhac\nkliem\nplonking\nmyller\nvucinic\nnaddi\ncaptura\nkniffin\nlatavious\ncommentariat\nritti\nmarayati\nunreel\nslutcracker\nnangahar\nenervate\nhobnobbed\nimpishly\ntrampy\nbickhart\ngossipers\nbowbridge\nfarcial\nbraue\nxixianykus\nincy\ngrygiel\nslaughterman\ntaliesen\nanteon\nzelon\nsalym\nhensch\ntimetree\nlaforme\ncolisseum\ncardiotocography\njaquess\nwiand\nsedran\nfocalpoint\nanowar\ntoptable\nkalaweit\noeic\ndenstad\nmaduaka\nmeken\ndrajat\nhumbrecht\nseditionists\nniederman\nkakabadse\njasiewicz\nmgat\nkhristina\nclerically\nzangmo\nnonsubscribers\njainist\nturchak\nbreezeblocks\ngiustozzi\nscaremongers\nspringform\nblamer\nshwemawdaw\nalkalay\nrepricing\nkraine\ncahps\nwanatka\nboncore\ncarlops\nchapelotte\nlaveist\ngroshong\nmandean\ngrish\ngrabsky\nslinkard\nitemise\ntither\ntelenovella\nszara\nheinously\nmylonakis\nkinshasha\nhownam\nchungs\nstreetfighters\nexhorbitant\nvalkovich\npuremovement\ngénocidaires\nlotsof\nfreckleface\nschuver\njwf\ntoplitz\naccupuncture\nbaranka\ntechzone\njolbert\nnaraine\nsvedka\ncinderblocks\nunsal\nmasciola\nhedgcock\nmanlangit\nherczog\ntamanaha\numpcs\nshushes\nwvas\nqanooni\npronuclei\nrheeder\nwanding\nyukitoshi\nharquail\noshea\ndauth\nlockview\nioulis\ntasimelteon\neffies\naustalia\ngreenmeadow\nhallewell\nabshero\ninklusive\ntakeshis\ninswing\nmweelrea\nlooki\nreprioritization\nantwun\ninsulinotropic\npropsals\nschive\nmipro\ncobner\nsayra\nlanntair\nzayets\nczarnikow\nbanotti\nmajete\nboeving\nmoubamba\nseemant\ncaranta\ninbreed\nharmut\nmaue\nvandervalk\nabiomed\nryori\nhollinsclough\nbenzecry\nunusally\nweaponary\nshirtmakers\ndantesque\nbvute\nwüsthof\niattc\nkruijff\nmiilion\nunconsenting\nperenially\nhainline\ncounterpunching\ngötgatan\nbonnart\nprinzing\nfineprint\nflipkey\nallegros\nclearvision\nklehm\nlorkowski\nraymour\nfujisue\nkasriel\nsagit\nbaladin\ngsce\nschoenhut\nmisgovernance\nvobora\nsydneysider\ngardaworld\nthreatend\nradiat\nesolutions\nbloviation\nwiliness\nneumar\nenzon\nyongliang\nudzungwensis\ntetangco\nteutuls\nfilmaka\nclimactically\nshoosh\nweinschenk\noddbins\nmultidisciplinarity\nyouthnet\nnagamitsu\ngetliffe\nconfederado\nhiddush\nsurpreme\npiaba\nsemiformal\nalkatraz\nchorltonville\nyuqiang\nniesr\nrpts\nkcals\ndognappers\npluggedin\ngiarraputo\nsenofsky\nunmanipulated\ndimario\nitalianisation\nmingarelli\nhafed\npatrizzi\nscheraga\npky\nmozzies\nmoynagh\nstohrer\nshireland\nmultiplus\nnadac\nyukons\nvakhtin\nappaled\nglj\nlamingtons\nperfomer\nnaeemia\nzhakiyanov\nsorriest\nsashays\nlamposts\ndelfos\nfirewatcher\nosmanovic\nlangdorf\nassit\norlob\nsatrec\namsellem\ncarupano\nteresas\npinco\nmietzner\ndefintiely\ntaweelah\nmoustiques\nalbain\nraxibacumab\nwinstorm\nenkoping\nbromfenac\ndemailly\ntudoresque\nretamales\nqiuhong\nperfom\namicone\nresika\nkloza\nfieldglass\ntrafficing\nviolinistic\nayro\nkumbaro\ndirtgirlworld\npilarcik\ndrumpark\ncicheng\nshefferman\ntrockadero\nqingchun\nviperin\nblankmeyer\nretore\nballantines\nmatamoe\ntzd\ntranghese\nwoodrough\nseatmate\nshoemark\npracticies\nburgt\nunmistaken\nsakow\nelew\nderw\nokike\nbankcorp\nsaltshaker\ncrespina\nsnunit\ntowning\ncitysocialising\ncivan\nitälä\nsomeo\nramaala\nwrdf\ntyrnauer\ntruthfull\nkularb\npaulistanos\nrastagar\nbrattish\npitango\nespaliered\nscorcese\nwhitmill\nfehrer\nbegert\ngalens\napprehensible\nyarelis\nkadyrbek\nticketline\nzagg\nciwf\nsandbo\neverhardt\nchewiness\nahdr\nmmod\nvmotion\nfmris\nbrawlin\nbosire\nsheeny\nkrivokapic\nnajiba\nportlemouth\nbewigged\nparsai\nradient\nkampgrounds\nmaoyuan\ncherita\nstrathkinness\nfervant\nkrusell\nlamorde\nwattville\nopenhydro\njesli\navesco\nfleetingness\nsitex\nweightlessly\nisandra\nemiro\nmenb\nitati\nkiplings\natpa\nlimbos\nklonoff\nknowlingly\nlefsetz\nholzel\nsabahuddin\nduckhorn\nkipsiele\nwarbots\ngyalo\nwifehood\nmiscalculate\nattensity\nchpa\nkraeger\nabsl\nloganberries\ntrillionths\nladygrove\nbishow\nbarcha\njourne\nschlondorff\nprotalix\ncapb\nomiyale\ngoeman\niréne\npenholder\nquintiliano\nmujati\nbogarin\nveldmeijer\nkavenna\nshaoxiong\nmispoke\ntougias\nfetz\nestephe\npontio\ncinelatino\nhradcany\nintactivists\nshortchange\nocdetf\ncortella\njethani\ntegic\nmagheramason\ndakkak\nburring\nmeziani\ntwer\nupseting\nsonicare\napme\ndaugher\nallbeit\nintrepidus\navonex\nkaraoglan\nbintree\ncaroused\nsadkhan\nabdelbasset\nkaliese\nnapili\nscherza\nmythologising\nguneratne\norringer\npiecyk\naondoakaa\nvbieds\nglannau\nstormready\ncirg\nkapilow\ngaffikin\nblutt\nryabkov\nbabock\nrreef\nhandwrite\nprizefights\nreyad\nmediteranean\nbuntain\nspicey\nnextfest\nmcduffy\npakhalina\nspiridakis\nnyanchoka\nbrandesburton\noverlit\nechemandu\nclarizio\nbmds\ngardels\njihah\ncarmens\ngerneral\nanyidoho\nortica\nnevling\nclaybrooks\nlazslo\nbassanini\nuproars\noutstays\ncoiffured\nbibring\nuncalculated\norentlicher\nnevetheless\nessawy\nbasaran\nclinard\njcj\nketen\nconstantí\ndisposers\noptomec\nineichen\nhouseago\nkrzanich\nboasters\nexpensiveness\nfreshour\nramstead\nrailfuture\nrondy\nreteam\nkrolle\ntdca\nsekita\nmitiku\nferiani\ntradional\nnidhal\nmetaplace\nschallenberger\nitalias\ninnocense\ngreengauge\nbedbound\nponied\nimpark\npanted\nentrate\nbunkmates\ndunlewey\npoac\nlucasz\nzhur\nganef\nunstocked\nhueter\nfranchuk\nlorinczi\nsizomu\nfriedelind\nbouchons\nlmax\nbountifully\nsentencer\nsichenzia\nelction\nsimcyp\ntaismary\ncardfile\nunadvised\nhritik\nexpertice\nsclar\nnsofwa\ndiamondstone\njouyet\nbaudrand\ntkacik\nwingels\nethnik\nfulfords\nstongest\nxtension\nheavrin\nchitron\nintoxica\nashizawa\ngoldbart\nbenalcazar\nysursa\ncwla\npyatachok\nnnemkadi\nkoshiishi\nahanotu\nbulyanhulu\nbonacich\ngheni\ntwlight\npalavicini\nnekvasil\nlannen\nmagson\ntubelike\npropert\nheliocentrics\nrusol\ntraing\nbocker\naviall\nslathering\nauthers\naryanised\nimlil\nshamefull\nhanash\nwyser\npagunsan\nclubmates\nrinta\nchieftan\nbuscaino\nrccl\ngranatt\nallweddol\nmcgillivary\nsamirah\nnonindustrial\ndharmana\npadgitt\nperoid\nquandts\ngallarda\nfamm\nisnaji\ncedano\nposman\ngoetzinger\nproe\nradhakant\nlazydays\nyark\nriversource\nwatarrka\nskils\nruehle\nkunzer\nmurchinson\nwrigleys\nhotwater\nchiwenga\nsosen\nsupersizing\ndermalogica\nuncrackable\nstakhanovites\nprickers\ntaniwal\ngiveway\nbombassei\nmaloin\ngreenfaulds\nuncustomary\nsteinhorn\nclouting\npniewska\nsoftspoken\nsashayed\nramms\nwuan\npagella\nfscp\norigene\nvaisanen\narthro\nguzm\ncbcf\nblahous\nweyco\nvasogenic\ncobank\nhkmex\nhaimov\nmght\nbridgeclimb\nflexsys\npoussins\ncardinology\njnci\ndedem\nkohavi\ndoostang\nscribendi\nsillen\ngraziana\nfischhoff\nitalk\npple\nwyndford\nmonetate\nkanakuk\nkursinski\nhutchby\nbootstrappers\nnonpracticing\necamsule\nrajay\npeltason\nchej\nfieler\nicrier\nbosmajian\nfootsbarn\nglassful\notpp\nrtfo\nmatyszak\nchatani\ntrivets\nflashfloods\njiale\nyingxia\nrhosgadfan\nbirbalsingh\nsuky\nuntalkative\ncatcalling\nruncom\nchopine\nramdeen\nkabanicha\nbeastliness\nzarubezhneft\npayslips\ncerdin\nreinbolt\nrebating\nproductivities\ndaunay\nspeedskaters\nllwyngwril\nchinaco\ntatzmannsdorf\nchipmakers\nblockson\nincoherences\nsathyaprakash\nicosium\nbiocatalytic\nbisogni\nozlem\nsharnford\nnekrotzar\nwallbuilders\nclab\ncannito\nmontagner\nareligious\ncscb\nstickiest\nazmatullah\nmumbler\nbilro\ndoddery\nklepfer\nintrafamily\neltrombopag\nleissner\nzhengfei\nmuratovic\nmedika\namatitlan\nniederhuber\nroulades\nshifren\nbraeckel\nblowsy\nucedo\ngagas\ngreivances\ngalynker\nhardup\npaiewonsky\nstewartfield\nworls\nfufills\nwomma\nmonacolin\nlaios\nenawene\ndhahi\nayovi\ndrugmakers\nsherwyn\ndanii\nastrochemist\naznavorian\nsuruj\nmougenot\ncagatay\nhdhp\nnzimbi\ntufin\naristocratically\nsaddique\nplommer\nmorgenavisen\nfetherstone\npenstone\nperlson\nreaggravated\nsessegnon\nherkel\ntarcy\nchelsio\ncalie\nwarninglid\nmeidi\nclri\nplaycalling\nheitzler\nrevak\ntsimane\nquestnet\nmynor\nimagineable\nvoggenhuber\ngreenfuel\ncwellyn\nobaida\npartical\ntulipae\ncanjuers\npoolstock\nbackfoot\ngriffth\nlepselter\nsherbets\nherdan\nlifecasters\nuzzo\nclaycourt\npropganda\nvalerien\nrealmuto\ndossen\nbiomechanic\nvrijenhoek\nfurnisher\nkuleto\ndiservice\nfcbga\ncavolo\nholnest\nvcast\nryuteki\nsolden\nslossberg\ntansky\nlisset\nhomelite\nsterndrive\nshieff\nconjunctural\nhammerstrom\nnafaa\nduena\nsimanimals\nstalevo\nasmatullah\nglandyfi\nimperturbability\nclra\ntsundue\ngollen\nladieswear\nsuntans\naurang\ndiamoutene\nclayish\nbuoi\nabercastle\nhandwash\nhottovy\npiermaria\nwoodhatch\nmastromonaco\nflexjet\njftc\ndebbage\nblythin\noneapp\nfiaa\ngrishenko\ngaffed\nbudwood\nfreehouse\njubril\ncestari\nwarbly\ncraftsmanlike\nkoniaris\nogutu\nmullivaikal\nturneth\nsobp\njinmao\ncortaro\nfictionalising\nseddons\nhilmo\nmeronek\ntriperoxide\nooooooh\nweidensaul\ncwdc\nsmoothening\npalefaces\nchelgate\nconstructedness\nsolwhit\nmagomaev\ntruchot\nramsy\nmoider\nokaying\nuncataloged\nwegge\nwitchel\nforschungsgruppe\ninfratest\nlumio\nladypool\npersina\narader\nzilic\nsnappily\npikers\nnetlogic\nkimilsungia\ninflamation\nptech\ntrunki\nvbrick\nbradville\nlolled\nbangaldesh\nnyheim\npolycationic\nstannett\nbellyful\nshinguards\nbakana\ntrupe\nuneatable\nkneads\naustens\npgrp\nleberkäse\nblatanly\nittel\nenpro\nredeclared\nyolette\nzakira\ntricoire\nhandschu\nldraw\nreinemund\nmcguffins\nmiklaszewski\neffaces\nixabepilone\nsahafi\naraghchi\nfatmire\ntrawscoed\ncatastrophies\ncohr\ntahuamanu\nbottlerock\ngottemoeller\ngreedier\nanahuacalli\nbleazard\ncoonelly\nkombewa\nfloridita\nnutted\nsteinseifer\nstenzler\nbyrddau\nforestethics\nzirkin\ngeliang\ngromia\nlolito\ncorelis\nseamons\ncefntilla\nmhec\nklatsky\nmamillius\nunclogging\ninauri\nukaj\nbruyette\nhechtman\njolinda\nguch\nmunlochy\npresort\nwlgc\nmarleix\nbatmanghelidj\nsturley\nlinbo\nslushies\nakerfeldt\nsioeng\nreligeous\nshuras\nideaology\nmeistrich\nbullwood\nritche\nshadingfield\ntutv\npartipants\nmegatrain\nmaiziere\ndunakin\nthemelis\nkrassimira\nstojadinovic\nmerland\norating\nsaintsations\nschräder\nletard\nmikayelyan\nheatherden\ntoomebridge\nmarcoola\nunparented\nseabolt\nwittert\ndufourg\nnyiso\npasedena\ngoldfire\nunsustained\nnechi\ncheifetz\nfinighan\ndsrl\npenneteau\nkiedrowski\nsajnani\nnobia\nblakean\ngrandvalira\ntetherless\nbreana\nvivisect\nbengay\nrheumatologic\nglitzenstein\nozkaya\npanang\nnewsmaking\nracioppi\nboffey\nrécemment\npyant\nwalvin\nkhamisa\nhoopman\npoveromo\nmonagle\ntilter\nexcitment\nexlusive\nmassouda\nnsms\nvanderhye\nfaguibine\ncayford\narmyworms\nbridling\nopportunites\npolyjet\nmariuz\npigasse\nschonwald\nkuettner\nacuson\nmpinganjira\nmadilyn\nvereador\naftersales\nfacilty\ngathy\nsnowier\nreallocates\ncockwell\njadriya\nbackplates\nchunlin\nairhorn\nrelativities\njewlery\ndelettrez\nlubiprostone\ncheresh\ncoronari\nbushill\npollick\ntopmouth\nbertodano\nsteinlauf\npildes\nevensky\nwaeli\ngalw\nleighanne\nkvaal\nboender\nunlaced\nhemostats\nlabatts\narrogates\nnguan\nmusah\nnozadze\nfacilely\nduologues\nhexworthy\nprakarn\nkinlochard\nairer\nwildean\nmakkawi\nhortas\ngoicolea\ngamemill\ncorumba\narné\ntouzaint\nsimilac\nsickled\nfattens\nhmmph\ndevani\nstabbins\narrata\nhowlingly\ntenían\ngelbman\ntaelor\nbainwol\nhambden\nshubhada\nmaniquis\ncoutadeur\nkavon\ndiamox\neyms\nmaxalt\ntitchy\npepelyaev\nfrancel\nluckwell\nintelect\nassurity\ntransnationals\nshorelands\nvolleyers\nwhitgiftians\npolioviruses\nmatambanadzo\nstorozynski\nsuitsupply\ntailandia\nraskind\neboda\ngleasons\nbydlak\nsabali\nunassumingly\npostnuclear\npublicaly\noilpatch\nbekay\nelusively\nforrister\ndopage\nhahahah\ntoppert\nsomila\nnarbeth\nsavuth\nnosecap\ndoonhamers\nsupové\naydogan\npumwani\nbental\nneckbrace\nacoustiblok\nekmani\nventastega\nprovett\nkilchrenan\nhigiro\nkippford\nmckilligan\nmastis\nbonnano\nwestaby\ndovgan\nlilliman\npucky\nquelccaya\nmilltimber\nmiswrote\nsulitzer\nseikel\ndenneboom\nermelino\nclarabridge\nclacket\nporterage\ntekebayev\njotischky\nforswearing\nzhaorong\nwaithera\noverbought\nmcgarrybowen\nfilyaw\nkamrava\nkorabelnikov\nfluty\nspottings\nteleni\nchelius\ndoumbe\nrenjen\nbeyton\nsquiggling\nkhouzam\nbarnavi\nbrighteye\nrinos\ngecov\ncutecircuit\nlukacevic\nferez\nmegayachts\nverardo\noverpaint\nngaujah\nkolek\nrollups\ndafur\nnetdragon\nmarari\ncochleas\nlantiq\nasipp\nvotewatch\ncapathia\nweedkillers\nreductil\nzyflo\nseting\nefunds\nwhuppin\naurothiomalate\ncscp\ncacb\nkadhafi\nsurving\npockmark\nestulin\nchadderdon\ncalavo\ndetikcom\nbitancurt\nbrejc\nsherak\nsmolyansky\nbegovich\nkatembo\nlahsa\nguynn\naplogize\neribulin\nparkstead\nmarget\nburkee\nreckermann\nsalvias\nwinnows\nmasche\nkherington\nbeated\noverachieve\nleimberg\nveltrop\ncroudace\nlisov\nmbpd\nswidnica\nglossily\ncassivi\ncentrefolds\nsalge\nniklaaskerk\nreadapt\navega\nchowdury\nmossawa\nkrakhmalnikova\nnetwitness\nmorgantini\nmbbl\nibasis\ndensen\ntiml\nuhlir\ndohop\nquiddities\nismm\nbrugs\novergraze\npiffling\nwendrich\nteenaids\nlucases\nmumbrella\ntribalists\ntaab\ncuvaison\narmegeddon\nthemos\nunderrating\ncozido\nsicknote\nemzar\nsukin\necofuel\nshanbag\nlorenze\nrotovator\nymddygiad\ndolara\npochoda\nsemisweet\nshlaudeman\nzamost\nesref\nbreadbaskets\nkankariya\nmcguires\ndockham\nsoltesz\nhaqqanis\nstapelton\nwodge\nmasterspy\nschriro\nghobash\nkelut\ncefp\nchiyangwa\ncrookedly\nmakhalina\nminikin\nstinting\nmescalin\ngroaners\ngyorffy\ncakmak\nlevave\nhatzadik\npahk\napéritifs\nrathbones\nrecharacterized\nokonogi\nzhuravli\nmoeti\nbasbaum\nmarchet\nkeiles\nobscurer\ngestamp\naslamazyan\nleverets\nnassery\nocurring\nrapke\nyoua\ntotalview\nmaouche\nessner\nparticularize\nshufti\nthighmaster\ncocilovo\nsawasdee\ncomapred\nbibey\ntoywatch\nhoggish\ndussuyer\nmurell\neurodam\nhumetewa\nowlshead\nfeleke\naltarum\nkocoras\nvoluptua\nbasico\nneala\nfomula\nkanamit\nmisfortunates\naprl\nasieh\neleborate\nguisti\neterovic\nlouloudis\nmanwani\ngenomewide\ncrazyness\nrationalities\nintracorp\nintergration\ntillander\nsquirty\nspendable\nkollie\nmetalogix\nprommer\nkromberg\nfussible\npresuppositionalism\nbotflies\nambivalences\ncattoni\nboutaud\ntrymedia\ndslextreme\ndisjunctures\ngossart\ntapha\nbriskness\nzoelen\nflightiness\nclussexx\ngrinchy\npapadopolous\ncaixanova\nsimonaire\nendicia\nlockerley\nasila\nmarasa\nstess\nedox\nhailiang\nintercell\ncuifen\njanuvia\nvny\nmarquiss\nmajdic\ndumal\nenmei\nmultipacks\njazic\nrajdev\nfestivalgoers\nbowdenii\npumpy\nrizman\nkommunisticheskaya\nstevioside\nbridgitte\nonionskin\nkinnego\ndereje\nwoofy\nmorue\nmorila\nebace\njanise\nclarisonic\ncharap\nkuresoi\nkaltman\nsezone\nbeaconside\ngiulietto\npanjagutta\ncangandala\nistream\nrustamiyah\nkrimm\ntravelscope\nparthenolide\ntalkswitch\nsheshunoff\ngred\nsondi\ninglee\nmaawg\nwitrh\naffreux\nforrey\ncolono\ntennesseean\nsimels\nkugelman\nselsun\nlunchmeat\nkirkholt\nwahhaj\njaffari\nmohm\ntorcher\nboschen\nbalikun\nwkmk\nhrqol\ngrangent\ntrumpery\nunfollow\ngwastraff\nchamath\npinp\ntimmendequas\nschee\nustin\nlesportsac\nweyermann\nsixpoint\nenshrouding\nfreesheets\nbelapan\npepped\nellex\nreagrupament\nzakharia\nstiffing\namnis\nsymbicort\nguerinot\nrobreno\nswiecicki\nschaftenaar\neathorne\nafobaka\nmaxium\noghene\nkhashan\nmochas\nbiotherapy\ncitydance\npalaeoclimatic\ngarousi\nkusnyer\nklafter\ndelectably\nacitivity\nconcretize\noedd\ndrumaville\nhowatson\ncaiguna\nocie\noryem\nosodo\nippudo\nwygod\nsosnovski\nsockman\nenayatullah\ntsoumeleka\nvanderbei\njestina\nbruenn\nduderino\nzitoun\ndemocratizes\nskyscapes\nupcycle\nduhn\nbhawna\nooohhh\ntsavorite\nunace\nmanaseer\nscarely\nsharabati\nfeddis\nvvx\noperatics\neldrige\nrepetitiousness\npellicani\ndalenberg\ntabbaa\nzubok\nsblc\nbeatragus\nbernia\nfhcs\ntrebic\nudelnaya\nezor\nblubbery\ninjuriously\ncarrows\noverengineering\nruig\nabidemi\nmooo\njiandong\nweyeneth\nindiscriminating\npenanti\ndemotivation\nareawide\nbodjona\nabdoulay\nchapelfield\nnorbord\npearlberg\nunfcc\nrynku\ncurcillo\ndarbonne\nlesc\nedsp\ndjemil\naboutboul\nmaizy\nswiftbroadband\ncandelon\ndanderhall\necomotive\ncanahuati\nmafara\ngyromancer\nhestor\nasgaroladi\npaatsch\nsmrz\narangement\nyuken\nniomi\ncodepink\nverrastro\nzenani\npetriv\ngrawl\nmalayasia\ncollusions\nhayasaki\nintrastat\ntoyobo\nupaid\nupolo\nköbel\nvosawai\njabuticaba\nmyanmarese\nissers\nhaydenfilms\neverymen\nlitterbugs\nkidshealth\nwindowbox\nskistar\nkhalda\nvectras\njorges\nsuzhousaurus\nfelci\noverexert\nslobbish\ndiringer\nharpic\nepassports\nvocino\nragbir\nmarkino\npoilâne\nscalora\nmenseguez\nmellard\nenunciator\nkutigi\nintransigently\nlvcc\nazzaoui\nnsid\nrotw\nbarocas\nkrevel\nmaelog\nregadenoson\nsuperlambanana\nhrcc\ndsgi\nnplex\nachub\ndeskilling\nchestertons\nmangalaza\nchesko\nbirotte\nschlenkerla\nawbridge\ntelengana\nlumberjacking\ntransshipments\npowermat\ngateford\nsouha\nfrehner\nblochairn\nalouf\nnyka\nweintal\nschlossberger\npresious\nquantex\nsvyaz\nquicklogic\nbeddawi\ngabow\nnedrailways\ngoverners\nsaajid\ndisincentivize\nnayed\nvié\nstielow\njeffe\nseifollah\naldenderfer\noponents\nmilitans\nbenylin\ncrepeau\nmenilmontant\nfeasability\nsuperlambananas\narbc\nenxco\nseccession\nwaqaseduadua\ngyllenhall\noutraced\nkyri\nsestanovich\nsazhin\nunderworked\nramdat\nparous\nakapusi\nbeardies\nflipbooks\ngraby\nheckenberger\nimigrants\npavord\nbouyant\nmully\nlifescan\nmanguso\noakside\nterrorization\nkarliner\nariaudo\npalestinan\nunintellectual\nflagellifera\nscolnick\npurloining\ncelebracadabra\nvergnano\nclougherty\ncondomine\nwholewheat\nthundersprint\nparsvnath\njonigkeit\nbatabano\ndickersin\nicabad\nhyrbyair\nshawnae\nbrunnock\nbachmayer\nmonart\nkalkhoff\nesfi\npersonalties\nexhortatory\nmetatarsalgia\nturkmeni\nzilberberg\nshopmobility\nthorong\nrebiana\nmicrocultures\nsieni\nvoast\nqahtaniya\nsloviter\ngromoll\nonich\nkerpoof\nkamens\negunkaria\nxtep\nmeschke\nherritage\nabbasiya\nuncollectable\nunnnecessary\ndecine\nbaramia\nmarkwest\nbioventures\ndigitaleurope\nptwa\nitat\nchromadex\narbas\nneofascists\ntidningarnas\nbetonsports\nmmae\nmayewski\npinelake\nyonty\njanelas\nghabra\nfreakiest\nmaculan\nleurbost\ndenik\nkawuki\ndisctrict\ncorticeira\nfancily\ncransberg\nnyakairima\nvyatkin\ncorpoelec\ngradowski\ngoursaud\nsokolovskiy\nkervella\nrietdijk\nbeukering\nweith\nschuchat\nprincipalists\nposessing\nogis\nsorc\narnup\nvagli\nchardara\ncaliforina\nstephansen\nqassemi\nsmolko\nmoleleki\nwallone\nannike\nrothlisberger\nesensten\nmdex\ngaretto\npicholine\nplentitude\nmcbains\nmondesire\nknuckleballs\nclammer\nsarikaya\ntransitionary\ncheyanne\nmeiping\ntodung\nmygrid\nmaseda\nnontherapeutic\ncanup\nunflaggingly\nyvenson\nzophres\nentech\nobidos\nespalin\ncosiness\nsegell\nprezza\ndarges\nphoners\nsolenni\nshaoqiang\nmonther\nblimunda\nhosani\nvomitoxin\nlifesized\negms\nchucklehead\nneonopolis\ntonsillectomies\ndepositers\npublicat\nmqg\nmilevski\ntravoltas\nharbarth\nciganer\nchsra\nvallado\nblyskawica\nshaanan\nblankie\nnasirli\nbellara\nclawlike\nmadheshis\nmozier\nscandle\npinderhughes\nubcv\nvisotzky\neichstaedt\nnewaz\ngorczynski\nartna\nfaggen\nmiklasz\nfiercly\nmusotto\nantiangiogenesis\nravishes\nnakal\njeronimos\neunavfor\nkildan\nchildproofing\njarnell\ntimberly\nsehd\nescholar\nmarchants\ncompustat\nkalooki\nfrechter\nmirlitons\nterasem\npoltermann\ndistractible\nelowitz\nwhichford\nkonoike\ngraib\nmodry\ngatsas\ninfuriation\nrazaaq\nistrabadi\nkoschnick\nishare\nsouldier\nvrts\ncheno\nolw\nbrivati\nhajjis\nmatsouka\nchunqing\nynysmaerdy\nrsoi\nbions\naltfest\ntibisay\nbinaghi\nwaehler\niitt\nkassabova\noutreached\nbindaree\npatner\naverbach\ninteractivecorp\nballar\nmohtaram\ncybercriminal\nmasive\nwillbe\nortakoy\nhellsgate\nhandcarved\nusrbc\nbranstool\ntalentmanager\ndefibrillate\niigep\nphotopass\nrealtionship\nerso\nyueling\nnyssen\nbabon\noosterdam\navisar\nbrightleaf\nsupossed\nwickus\ntornqvist\nsnowboots\nrambourg\nskyburst\nspamount\noshiomogho\nabdesalam\nnyff\nkaramira\njoypads\nmangola\ntissi\nbratschi\ncolcoa\nhovensa\nfarjestad\nhatoon\ntomishige\nmcdermont\nhayball\nhillmans\nscemama\nshoubra\nschnebly\ntyf\nciriani\neuroskeptic\nkrajcir\nradanova\navermedia\nhygrade\npeakes\nphoun\ngilleran\ntheit\ndellamonica\nbootcheck\ntassled\niftas\ngrishchenko\nhaeuser\nminfile\ncavna\nrouseff\nsbme\ncanbyi\nrieveschl\nteargassed\ngeyen\nyaguas\npsilakis\ntransmogrify\nspoonfeed\ncollegefest\nprepubescents\nalteri\nqxm\nxuetong\nmouyokolo\nwalmarting\nzionazi\nkerschen\nemobile\nsciammarella\npredetermining\nriorden\nmedflies\nbasdevant\nraee\nerksine\nvanessi\nsalwak\nolivadotti\naortas\ngardenburger\nzhiliu\nrisanamento\nmasterpeace\ngenetti\npostholder\nlisanelly\nvundu\nramekin\nhinzpeter\nkonigssee\nkrzyzewskiville\nlyv\nsegert\nwebworks\nhalikarnas\npavones\nxianling\nexpoland\nemmigration\nstriaght\nmobuto\nfrease\nsunhat\ndabinett\ncaviare\nuare\nfrancombe\nworrywart\nbcuz\nmilwyn\ngennum\nlebrons\nlabbey\ntchuto\nodamtten\ndinging\ncannaday\nmongelli\nlejuene\nhansma\nepecially\nguinnesses\narberg\nskydived\nspruiell\ncorval\nbernadet\nonebeacon\nhightened\nmusicophilia\nheikkila\nsurfwatch\ngoubuli\nmallarach\nacroos\nchopines\nmohawked\ntigerish\nequifinality\nthenew\ntostao\neyebolt\nvignaux\nturkovich\ndolefully\ncricks\nspaceage\nviewty\npersbureau\nproration\ntranquillizers\nunhelmeted\nfriesan\nreggiolo\ncumaru\ncentralwings\nmerok\nkiddoo\nbellieni\nvewy\nzakum\npeladeau\nmochipet\nlayt\nglobalscape\ncyromazine\nstoate\ngastronomist\nsotc\nfofonov\nlosina\nohrp\nchirayath\ntaxidermia\nderbenev\nneuroreport\nhologic\nmohebbian\neilber\nwanyonyi\nrasai\nkuramata\nkeers\nrosenhead\nsvoray\ndemetro\nprachatai\nasdi\nvilvorde\ndigifest\nflabbiness\nbronchospasms\ntradingscreen\ncarleto\nsilvinho\nmesages\nlofquist\nralsky\ndictatorially\npepsiamericas\nhörst\nzabir\nyealands\nmolleindustria\nactorly\ngisiger\nspallino\nguarnaccia\njetsuite\nspatchcock\nundoubled\nmayesbrook\njellyman\nsibos\nnnanna\ntagliariol\nkreiling\ndohertys\nradiotelescopes\ndenninghoff\nokurut\nthirstier\nlytel\nvillata\nmosalla\njelassi\nnovagold\ngarns\nfayrouz\nhangai\ncheckposts\nolshey\nmchappy\naberley\nhuzi\naahh\ndeminer\nmuzarabani\naujeszky\nbrassage\nlangoustine\nbooe\nlimnitis\nvignetta\nnoggins\nkalins\nquoter\ngutherie\ntemperamentals\nobarzanek\nkupferschmidt\nnewdick\ndudzik\ndalmer\ncibelli\nprivatklinik\nuplyme\nmacgillis\nsliminess\nlasermotive\npelites\njoergen\nopions\naesculap\ndupey\nintelligensia\nnonfamily\neuphorically\nvingtenier\ncoasties\nimrf\nmichigami\nepicures\nvinen\nportscatho\nmurrary\nmallavi\nadag\nenmesh\nakinaga\ncybersitter\namicability\nesseen\nkiyosi\nposistion\nrjdj\nwoofs\nligashesky\nsuperscape\nstiches\nopsource\ntiea\nkraul\nbuffleheads\nsumardi\nveleko\nhessie\ngreenbridge\nburayev\nbubenik\nhywell\nfaxfleet\nmyyahoo\nbellbottoms\nmariacarla\nrepublicrats\ndemilitarizing\nskycrapers\nkutol\nphiloktetes\nhfss\nwulkan\nalcat\ngönner\nfatwah\nmessagers\nmonstrousness\nmondot\npentex\nmegaresort\nbioactives\ngeotargeting\norkis\ndogfooding\nemblazoning\nsaqar\nchinext\nossur\nphotosynthesising\nbonomy\nalogliptin\njasperson\nscrotums\nonefs\ngrimsson\nlorah\npethica\ntoregas\nshitake\noakport\nlongtailed\nreitinger\ndentista\nknockwood\nfreefalls\ndorpalen\nhidef\npased\ngolesworthy\nhokazono\nlymphoplasmacytic\nstanilas\noverwrap\nholers\naccustoming\nbarghouthi\nuntarred\nepolicy\nfintage\npanforte\necards\ndraytons\nelfrink\nrevivial\nshaheeds\nlinday\nsubstracting\nmohebi\ncrljenak\nlindani\nhappinesses\ngioda\npithiness\nsciaf\nnesbett\nwilkommen\ntakeyh\nbizo\nsniggers\nstrba\ncollegeweeklive\nsbss\nmatsugen\nliekly\nzings\nvölsungs\ncaroma\nreimaginings\nruskies\ncaliguri\nformable\nwestwoods\nswathing\nvnpt\nazaouagh\npacheh\nenviably\npegloticase\nluaus\nstutman\nkertz\nkoplewicz\nhurre\nsimilien\ntdic\nniniek\nanghenion\nseaquake\njetlite\nreddicliffe\nlilliane\npranali\nattact\nsonggang\ntremosa\nfaming\ntsopei\nenunciations\ndmfc\nnuong\nalexsandra\nvaval\nshoehorns\nlapenta\ngreycat\natml\nluai\nautomotrice\naurness\nlucilo\nhny\nquestrom\nherringthorpe\nalowing\ninterferance\nbayhill\nsoputan\nmahroof\nlaakmann\nhoylman\ngainsharing\nqcn\noverdrawing\nconspicuum\nbeiersdorfer\ngeminid\nfarahanipour\nmcshan\nastone\ndemonhead\nwiggington\nyudashkin\nmooallem\nsevelle\ndavisco\ncivb\nbellanaboy\nidress\ngrard\nwaelkens\nargonautika\nbozano\nposion\nceber\nshirecliffe\ngiliomee\nguillette\ntrimpley\nlepera\ncuratives\nrichmonds\nepilim\nsuffredin\nvalantin\nneykov\nflaster\npanamas\nlacefield\nrapidnet\njokonya\novergenerous\narteba\nflotus\nmicroentrepreneurs\nvolinsky\ngraviora\nbehney\nsciora\npilarski\nreodica\nstiltz\nexpections\nhobbyism\narrearages\nalmin\nkiniry\nchockstone\ntabarie\nmourenx\ncherating\nqueyranne\nbutterflied\nlemasson\nwarmblooded\nwhouley\nholson\nskurnik\nbertazzoni\ngolovinsky\nknipschildt\nmoonwalkers\njeraldine\nfireballer\nslym\nherstik\ntechiques\ndeutche\ngoggling\nwangled\nlazienki\nyellowhammers\nswoony\nmatalib\ninterrante\nworht\nunenthused\naragao\nsomnambulists\naffably\nselectwoman\npastilong\nbutterflyer\nthickheaded\nabouo\nsicky\ngedmin\nrhomobile\nguardium\noedenberg\niyman\ngemmayzeh\ngomang\nzerona\nwilkus\nplayden\nlatchingdon\nmacur\nxansa\nmogahed\nintralipid\npolticians\nmushak\ntweedside\nyendry\nrainshowers\nolatz\nbillal\neshaghian\nhalzle\nkaltschmitt\ngastronomically\nclunks\naltermodern\nvenetis\neyasu\nsmögen\nracepoints\nbuyable\nschincariol\nbirne\nmoorlach\nponosov\nwaddah\nhansjoerg\ncrescendoes\npicarones\nextorsion\nlewander\nufov\ngovindaraju\nbanjoko\nkarageorghis\nsuboptimally\nfractionalized\nboudoirs\nneverdie\nmichah\nbodeli\noverscaled\ndudette\nbirac\npribilofs\nzivin\nmerkato\npasnap\nuklanski\nengesser\nbrittenham\nsnazzier\nsakkie\njulissi\nkathalijne\nebrc\ntearjerking\nsurfaid\nhattenstone\neuroweek\nzilmer\nseniora\nlutzke\nmerrey\nentilted\nbelfor\ngraphomania\ncolacino\nvavae\nmouallem\npapf\nalvarezsaurs\nsannakji\ncaguan\nfidow\nzoheir\ntobalaba\nnins\nfelinski\nkaliati\ngallivant\nfouty\ninbicon\nkenmuir\nahlerich\njaschan\nlayde\nmasmoudi\ncluttons\nphiliosophy\nheher\nalbarino\nzimmermans\nliasson\nfreeside\nboxton\nglennerster\nyeaw\npartent\npodany\nkirkstile\ndoruma\ncombustive\noverpopulating\nfrair\ncloners\nlandamerica\nfacep\nsmts\nyakobashvili\nmigrane\nhemat\nupdo\nwessberg\nabbeymead\ngrohol\noverett\nquantites\navac\nfileds\ncodeplay\ncommercialises\nshaxson\nellegirl\nunconducive\ndescry\ncacuaco\ngirion\nbootlicker\nnaturalisations\ntelegdy\ngianvito\nlambraki\nelee\ngaddaffi\nkalinigrad\nalapont\nphoneix\nmultijurisdictional\ngabali\nnynke\nprorate\nwinichakul\ncadeo\nansuman\nvereecken\nharport\nfaynan\nrashidat\nwost\nmatali\nyonke\nshinh\nshufflebottom\nmathmatics\narchconservative\nexpecto\nmassau\nkuparinen\nugaas\npanchai\nkendyl\ncronian\ncritised\npostini\ngräßle\nglavany\nimportan\nrpsgb\nwellbores\nsavoyarde\nmatulef\ninterwrite\naltmejd\nbanez\nruched\nbacklines\njesrani\nflixter\nzawr\nredos\nrovia\nveeraphol\noorja\nbankowsky\nnmol\nkeali\narmario\ngoaltore\ncartvale\nmajestical\nkarkhanis\nyanney\nkemess\nkheirkhah\nunicorporated\nmagel\nnemtin\nkatsnelson\nromansky\npadwal\nbreidis\nhusnain\ninfantilizing\nchihuri\nburkini\nznaur\npasanda\nweonards\ntrappatoni\nbannos\nexpostulated\nfizzed\nchirisa\nlavasseur\nrapsons\nmevhibe\ndrezen\nkinatay\nmuxima\ndaugther\ndimbola\nsappiness\nactvities\nalook\nhenegouwen\nentabeni\nmineralize\nfetv\nkuczek\nseemlessly\nhoosegow\niberworld\nwenra\nmudrak\ndubiecki\nsignators\nlitsky\nkormendi\nlewter\nbrecciation\nbackslang\nsmartfusion\ntrelleck\nequinet\ndelamore\nshrivelling\nabousfian\nnimke\nultralase\nhamdaniya\nsadusky\nredc\nmikhel\nbayovar\niosafe\ncmai\nparlano\norexigenic\nmandokhel\nacidifies\nskelbo\nmuira\novenstone\ngigauri\nstarkell\nstiflingly\nleadship\nakalitus\njaynee\nmillichamp\ndermarr\ntingyi\nsunrocket\nsoulbook\npegfilgrastim\ndezarn\noculists\njerseylicious\nincarnadine\noverfeed\nstepen\ngirlishness\ndevolo\nfortuity\ncarem\nstrefford\nlabier\nnhem\nkennebeck\nozil\nfindler\nretroplex\ndisinfects\nmilbauer\nyones\nauew\ngranbassi\nsensored\naghaly\nmoukarzel\nmikhalchuk\netelecare\nalands\nfaultily\nveline\nrlsb\noked\nmunyeshyaka\naccoutable\nsharim\nwanabee\ncréditos\nimitrex\nmethodologic\nhgcapital\ninconsequence\nfloriston\nudovicic\nnolta\nelghanian\nmarlana\npescante\ninterims\nelfan\nmoneyglass\nbashkirova\ndevern\ngerova\ngerassimos\nsinglemindedness\nkovio\nhochhauser\npalestinien\ndelijani\nnordictrack\nleasers\nsanitisation\nansca\nwinnabow\ndjinguereber\nhaselrieder\nmcquater\nplankensteiner\nrezso\nflisk\nclampdowns\nunbuckling\nkochnev\nuzomah\nmanweb\nshinbashira\njunwei\nfinallists\nrubenhold\nibmers\nnatually\nfredom\nmatui\ncadidate\ncouer\ncanchola\narbedion\nellsberry\nusariem\nopenvpx\nmbiyozo\nhervs\nbiyela\nsunstruck\nchowen\narieff\nanobile\nhappenes\nmigrainous\nstremmel\nessakane\nzymogenetics\nobsequiously\ngruberman\nlutfy\ncabb\nfloaties\nemex\nfuleihan\nquaida\njawann\nsuurhusen\nsauvaire\nfinnin\nrondelli\nunlived\nmabbs\noffic\nsmelliest\ndrawls\nkhwaza\nthuwal\nmujaji\nautocenter\nsocoby\nmirassou\nkingswells\nbiocom\ndustiness\nunrestrictive\nebif\ncontractural\nnaghavi\ndragoo\npastiching\nevolv\nlascahobas\ndeplace\nbarkos\npsda\nsteuernagel\nnotic\ndevincentis\nzweifach\nfaunia\nammaccapane\nmigingo\nangelilli\nunevolved\nkolahoi\nwapenaar\nroctober\nrpii\nmilkwall\nmigaloo\ntecau\ngimigliano\nromanee\nmecox\nbulhan\nbvudzijena\nleibo\nchainani\ntowerblock\ngaddhafi\nplumbly\nroumen\ndoohickey\ncurtsies\nkeithen\nthurmon\nkatehis\napparenty\ncaerfai\nplayón\njeremain\ndeactivations\neriez\nsingita\njennah\nsideswiping\nnonhierarchical\nisdale\nlautsi\nparanjoy\nschmieding\nclienteles\ntarzans\npillowtex\nmoleketi\nsgfc\nwsib\ncamex\nvalextra\ntubia\netkes\nnardos\nharrabin\ndebski\nhiccuped\ngurhan\nsirivannavari\nkohrt\nhyalite\narlenis\npermier\nkorotyshkin\nairpass\nsharebuilder\nmpio\nissacson\nzaldarriaga\nlatek\nteching\nhanman\ngiannola\ntachia\najeel\nfeltenstein\ntyisha\nsigalert\nrescoring\nkilolitres\nahady\nchanae\nmoseman\nlargley\nwioth\navorio\nreichensteiner\neffectivley\ngilbern\nglycopyrrolate\nherbas\ndelicateness\nmuessig\ncloseouts\nfamour\nmahers\nschauerte\nupim\nfcib\ntroweled\nfantawild\nrandox\nmicroprocessing\ntusar\nboorn\njordanie\nrafalowski\nfulstow\nsubsidaries\nmistubishi\nodikadze\ntélécoms\npaduano\ngechter\npagrotsky\nwimbleball\nlipizzaners\nflouncing\ntumbril\ninterferer\nwoodfire\nmurarka\nfraisthorpe\nfinatawa\nermei\nspeliopoulos\ntourre\nmccheese\nlistrac\ncushwa\nwonderwoman\napet\nangiogenin\ndeafeningly\noluwatosin\nshioiri\nhavergate\nfahleson\ngoreme\nexperteer\nbarmbrack\ntaillamp\nshefield\npapooses\njauslin\nbutterflying\nfibi\nmoonblood\ndesferrioxamine\nimanyara\nbradlow\nbelyayevka\nrybakou\nabdenour\nalmalik\nvoicemale\nesenboga\nsadequee\ndorpel\nrenyel\nunground\ncheeto\nbeargrease\nhayeses\nmobileiron\nnymphets\ncafi\ndrizzles\nefds\nspigner\nhoohah\nhitchcox\nschwaninger\ntyrella\nwilsdon\nshanghaiist\nfaschingsschwank\nmiskiewicz\nregimenting\nmellone\nyeilded\ncelmer\nsherdley\ngyno\nvisitar\ncivista\nescobares\nzuiderdam\nnylan\nthroughways\nccer\nmuhammara\nazlynn\nzakki\nnoteably\ndiplay\nfktu\nlydall\nfranzino\nnsaba\nvegetating\npanell\nindohyus\ncaslin\nwestbay\ngilissen\nbloorview\nköstinger\nponytailed\nfaize\nmikeska\ncarrowreagh\nhegemonism\ntullycarnet\ngamecrush\nbelmontes\nflorigene\namaretti\nclearplay\nkelbaugh\ntelehandler\nkamuda\ncapcity\nemnity\nirradiators\nrwindi\nmagentis\nkolosoy\nmichalson\nmilchem\ntimesaving\nmisnumbered\nralia\nglommed\nlovergirl\nternovskiy\ntoughed\nprocesser\nswima\ncutbill\nsunwin\nayham\nrocquier\nteperman\nnotcutts\nmicrorobots\nmoiseyeva\nblejer\ngafney\nkerkwijk\nmmhi\npellom\nschnitzels\nanthropomorphising\nshehada\nextravagent\nfazzari\ntibc\nburbles\nululating\ndichato\nmatchwood\nblecksmith\njudgmentalism\npredessor\ngarodnick\ntriptorelin\nredrafts\nwehrs\ninsubstantiality\ngalacticos\npeugot\nharbash\nmgib\nlastres\nbensayah\nkloppenberg\nmandaree\nshread\nahdab\ncein\nordoña\nguangpu\nhallgate\ngerein\nsimester\npeevy\nharteros\nambanis\npsykter\nrejab\nzuhairi\nmorkos\nsuccot\nharoldson\nchaddy\nprath\npeevishness\nbodnick\nwpte\nabston\nsilveiras\ndysport\ndammage\nsmolansky\nidentita\nmizengo\nfieldsend\nlaneast\ndirgham\nlaruffa\nshchennikov\nhuafeng\nmosaicos\ndesalted\ntokyoites\nkesal\nnesv\nnavnit\nkumbala\nhatcheck\ncorporatists\nimanaka\nsnouffer\nreev\nexpereinced\nvoecks\ncrissie\ndispraise\nmedpac\nhussainy\nrumenov\ntaffetas\nentrepreneurships\nbanio\nloveseat\nhawthornthwaite\ncaragabal\ntopdog\nmisezhnikov\nglaceau\nefficency\ncammo\ninstution\npfanzelter\nbeeche\nsaltend\nmenkerios\nbiondolillo\nlamost\nsoleas\nkeydata\nmakwan\nalladale\nrebarbative\nblobbies\nkoutstaal\nbicommunal\ndowjones\nnonideological\nchartwells\ncoinfected\nugma\nthecb\nshlemon\nsusurrus\npresurgical\njaelyn\ncorruptness\nshowplaces\nbamberski\nfalks\nunendingly\nkopke\nemanuella\nblackwaterfoot\nrhes\nsandrow\napocalyptically\nfcfe\nfripperies\nclimatechange\nnajdat\nmasic\nrafeek\ncaseley\nmumbaikars\nsportcoats\nrürup\nbursten\nsnowpocalypse\ntofurkey\nsuramericana\nfraternised\nunderinflated\ndcal\nmeiller\nbarchiesi\nflöge\nennenga\nshortterm\ncoughtrie\nvergin\nskride\nbonby\ntajzadeh\npalino\nisilda\npatriach\npetrobangla\nenergix\nsauipe\nrealdvd\nmedit\nmistruth\niprint\neink\nlavonda\nwurtman\nspfw\njnpr\nsuiciders\nillela\nunopenable\nhirut\nprayitno\ngoldleaf\nnduja\nunfiled\nfaudra\nshantell\nbluitgen\nunadoptable\nbaytsp\ntelemanagement\nhussainiya\nderuiter\nforlan\nrebok\nmatchmake\nhorsepowers\nadapidae\nmaccurtain\nleish\ndaille\nkasinga\ngronwall\nforeseable\nwaingroves\nboyishness\nhiglett\nconcertedly\namerli\nlisogor\nevencio\nberdouni\npaggett\neffors\nweathertight\nfawda\ncorporeally\nmccambley\nxiongfeng\npanks\nmcadie\nsrinigar\npacojet\ncabdrivers\nnbac\nwrighster\nasterley\nfeatherlight\npossesions\nsigale\naccessorizing\nthrombolytics\ndistrubution\nwhinger\npubwatch\nmyplace\nkyzer\naganga\nsedney\netchart\nzigiranyirazo\napcom\naracelis\nblanchards\nookie\nmediocracy\nfigh\nmarkai\ncotehill\nhycor\nbollworms\nratanga\nadriean\nrizan\nslominski\nneurolaw\ngialanella\nduloch\nsoliani\nuniversalised\nfurless\norizon\neuthanased\ncitera\njeffer\njellyby\nnamy\nmassoumeh\ngladed\npalivizumab\ncuram\nrathana\nglezerman\navorn\nfeeb\ncarmelle\ngianaclis\npreplanning\nofex\nciolino\nmeisenbach\nhyakuri\npeisey\nbrrrr\npharmacyclics\nctts\nwomanity\nincapital\nfilthier\nbalivanich\nrbai\nshulian\nchorines\njilali\nnyhuis\njcvd\narcenas\ncampaniles\nheriditary\ndnx\ntacu\ncatnap\nleagas\nstrengh\nliquored\naccame\nadeena\ninvidiously\ncasscells\nhirsche\nliubinskas\nhamastan\npricewatch\nsaccharose\nmindark\nulh\nbrandbergen\nprovincien\nwarheit\ntewary\njugendamt\nkwashi\nimmiseration\nmassachusets\nwgnr\nmoszynski\nkarake\nbogorad\nmarcelis\nmwiraria\nnocino\nhammersteins\neljay\navowals\ndeclaw\ninterpretively\nguangqian\nomischl\nclassiness\nrollino\nhasnaa\nunperformable\nntic\ndrycleaner\nvarvitsiotis\nlitzinger\njirtle\nexcellance\nshavendra\nnipponkoa\nstradbrook\nskretny\nprii\nlassaad\nsanjaagiin\nhorsting\nbouan\nsquamiferum\ntouchette\nwaterlogic\nwynds\nspaceline\noutpoll\nsoliel\ntarisa\navoidably\naignasse\nmpxpress\nnontheatrical\nclutterers\ngribetz\noofnik\nbensko\ntastelessly\nloamanu\npostracial\nrugmark\nepicness\nstriemer\nspeading\nlepu\ndriftnets\nashikari\nlanerie\nhealthone\nworthman\nfresian\ndamart\nlivieres\nmcluskie\ntelvision\nbaghdadis\nunclogged\nleftfielder\nspindoctor\nsercial\nkitahata\nclusterin\negozy\nbruco\nroubik\naveeno\nmaouloud\ndistrubuted\nsonglin\nqayoumi\nasilisaurus\nestefani\ndowdie\nnatonski\nasanda\ngoset\nrosapepe\nfructify\nlambertye\nloiero\nbratwursts\nelyea\ngargash\nbenotman\nalecs\nnightshirts\ngreenend\nosanloo\njentzen\ncortazzo\njaxer\ncolur\nbimkom\nmangalica\nxiluodu\nmahesa\ncraignaught\nstoneburg\nboulat\nsatsumas\nsliimy\nsnowflex\nsentimentalizing\nlapwood\nclanks\nlifang\nmectizan\nfromanger\npradelle\ndouch\nkazmunaigas\ncylc\nmdsp\nalternitive\nflamineo\nlilico\nayalas\nmighall\ngafas\ndorfeuille\nkwegyir\nsamodurov\nspayd\nclott\nprocup\nstrumwasser\nbrotzu\nrabbatts\noutsoles\ncubavision\nkaminkow\nchengyun\nserfas\nclopyralid\nendsor\nmboungou\nirrawady\npurposelessness\nculty\ndomainer\neligibilty\nkronemer\nbannis\nmerifield\ndahyun\nfruttuoso\nsaudabayev\ngalev\nnbta\nefrag\ndemary\ngerston\nsavviness\npatson\nlockhard\nbaugo\nsuperciliousness\nturrall\njohe\nacsis\nlaurs\narive\nhaselbech\nhoarafushi\nbset\nmillichap\necause\nbrentor\nchauffer\ndevoteam\nplasmarl\nkkoh\nhitco\nmaybachs\nmagagula\nregenstrief\nestimo\nbrtish\nkorphe\nsumidouro\nofari\nguarinos\njonetta\ndorka\nbreadmaker\nmekonen\nensenat\nczwartek\neyeware\nkraditor\ntaktshang\nkanaskie\nsimena\nagedashi\nlamendola\nmcaughtrie\nweaselling\ncinamon\ndeitche\nweterings\nllanerfyl\nstanfields\npriyadi\nmotomasa\nwitmore\nzarrouk\nscantier\nkenkel\nshiona\nwilmart\nleachable\ntruffled\nversionone\nnewaj\nfeigenholtz\ninteroil\nunaustralian\nstralman\nwaagaard\notera\nntma\ntitilation\nbenahavis\nbouhali\ntexi\nsarsekbayev\nsukhpal\nmarksaeng\ngargles\nnygh\nowlish\nconnot\ntaghazout\ntinks\nonanistic\npunchcards\npenhallurick\ndrummey\nundercapitalised\nblogtalk\ncollegesports\ngimlette\nvisk\npannur\nradlauer\nsauciness\nantonellis\nmicoach\nnondisabled\nhusavik\nfaichney\nseliverstov\nblanchardville\nchochiyev\nthumbsticks\nclemenstone\nturnabouts\ndibartolo\nembrya\npurposedly\nsteenland\nhogsthorpe\ncidac\nglamorise\nsccf\nemasculates\nrudolphs\nkalatozishvili\nlfas\nmaimings\nquetzaltepec\nrelize\nhuarui\nlatson\nvanzi\npipelay\nwrxs\norad\nsyneron\nreweighting\nraramuri\ndgen\ntrustedid\nmwapachu\nmullowney\nhius\nmilions\nundisrupted\nshirlow\nwaili\nbrondanw\nenrapturing\nmobeen\ningy\nintracompany\njolliest\nalgalita\niannarelli\nrecultivation\nmusnicki\nclickstar\niakovou\nkiehne\npalazhchenko\ngorik\nalvester\nladdha\nidealises\nchinesepod\nhelibase\nnicolazzi\nrecapitalizing\ntenaska\ndiscussio\nachievability\ninfratel\ncaiden\nronalda\nkatuwal\ncetrorelix\nabosch\nmedrich\nrothienorman\ngoyas\nsanquan\nbatmans\nyuanta\nreweaving\ndegrand\npirogi\nstreetcred\nallcomers\ndistracters\ntuiga\ncentilitre\ncytopenias\npalestinain\nsrebenica\neulogists\narcserve\nungentle\nchicom\nperrons\ntravelbag\nfreespirit\nsuler\nnigrelli\ndunderhead\nhorribleness\nballyhegan\nshomon\nboxbe\nminiati\ninoubli\nteeba\nmayhugh\nbemuses\ndolez\nbuffelsfontein\nevista\nappropriator\nsturgell\nponv\nreraise\nlanglee\nfremon\nmyface\njavea\nconvulsively\nexelixis\nfickenscher\nfangzhuo\nschastlivy\nmaunalua\nenform\nvinohradech\nhjerpe\ncauson\ntortelloni\nneurofocus\nmultimineral\nisakowitz\nculatello\nbesito\npoping\nhemakuta\ndapdune\nsleddogs\nhawkswood\ndendê\nfreada\nfrailer\nbrydekirk\nearsplitting\nelectrocardiology\ntucknott\noracene\nkucharik\nbodeker\nsazon\nembarkations\ngandanga\nbedcovers\nnect\nfrankens\ndaggash\nglezen\njananayagam\ntaichiro\npiñeres\nmimobot\npeterside\nongame\nworgret\nsedoc\nindefensibly\nchemosensitivity\neghan\nzlotnikov\nshopstyle\nbiocchi\nsinoti\ncuttance\nculverhay\nthngs\nbargery\ngiggler\nlonesomeness\nstuver\nwestlea\ncharacid\nctam\nsukow\nalemseged\npacketvideo\nskiiers\nstrope\necohomes\nskoula\ndissoluteness\nshtarkman\nschain\nrmds\nserramenti\nocaranza\nmarginalises\nfalconhead\nbackload\nbennite\nmalefane\nadino\nschoolbags\nfairham\ntizz\ntatelman\nrussh\nabdale\nkrima\nchhatisgarh\nmiragliotta\nenterra\ndabancheng\nkotsopoulos\nschumachers\nfatmata\nmednet\ncadivi\nfairthorne\nnympheas\nmariatorget\nhollydale\nshivek\nscarrott\npostimpressionist\ngutteral\nmaber\nrostenberg\ndawran\ncumbersomely\neilde\nxjl\nszala\nkoshary\ncanstruction\nfaling\nseachd\nloppington\njheryl\nmaliau\nyenque\nroae\nravani\nbittker\nsurpising\ninartful\nmachipisa\njenike\nipls\nleic\nsielaff\ninstrumentalisation\nkaec\nmakeyevka\nqeb\nikanos\nqingyu\nautovaz\njacksie\ndissembler\npickerings\ncherishable\nellalan\nduplat\nmauricia\ndruggable\nmurshad\nhuffner\nchristeson\nlaughead\nrecommerce\negendorf\ncarancho\nimaginitive\nwendice\naaaasf\nblitzkreig\nriposted\nsafie\noccular\npeelable\nperold\nmonforton\njeannemarie\ndeping\nwalentas\nhypervigilant\nadularia\nidou\nstrachen\nmislow\nmmlp\nsamuelle\nledburn\napolgize\nskatelab\nkarfiol\nbalqees\nsfha\nokaukuejo\nspecchia\ndulcote\nlevaquin\npallino\nwayleave\nmacmorrow\nfeakins\nlefkaritis\nʼs\nassasinate\nnaughtily\njebidiah\nnightlong\nszczechura\nsolterra\nasagai\nkrege\nseond\nekahau\nmargalis\nesquenazi\nstarquakes\nperence\npressé\nsearchinger\ndesal\nbetra\nnejdi\nhonickman\nsorrowed\nsotp\nhomocide\ncitycard\npriveledge\nvelayudam\neicta\ncauseing\nhammo\nkrever\nheroles\nthint\nartier\nsamantar\npacot\nclotfelter\ngadirov\nhasset\nlottridge\ncoombeshead\nmiszewski\nfrnt\nepisiotomies\nconcilliatory\nsitanshu\nkrasney\nnexavar\nsafework\ndjurkovic\nhypponen\nboyling\npembertons\npersisent\nappletini\nremonde\npalaikastro\nredgraves\nqabalan\nyenesei\nfulfiling\nolsons\nrackenford\nautomaking\ncolsons\nshander\nnairas\nriechers\nirongray\nplateful\nstumpel\nezratty\ngilets\nkelloholm\nfishler\nfedderly\ngeurtsen\nownself\npaillou\nlenser\nwojdan\nthroughline\nsnowshow\nshackman\nmarwad\nzarchy\nresiliant\nwestermarkt\ncoachable\nlanlard\nkenspeckle\nfistic\nbrasilinvest\ncaptivatingly\nmanicuring\naszure\nspoonfeeding\nspoofy\nremotec\nvainest\nplaints\nhousebuster\nnilli\ncellufun\nanandita\nmanjgaladze\ngudelsky\nkhaledi\nosmometer\nxtabay\ntorcross\nclasslessness\nrbds\ncerrillo\ncaisley\nlashkars\nfonkoze\nmartillac\nkenechi\nglenrosa\naliph\nilangakoon\ncalfrac\ntalling\nslemrod\npeterstow\nshnur\nmidweeks\naidablu\nbenneweis\nrumsfield\nmuqim\nairikkala\nrittenband\nannouce\nbiorepositories\nrogat\nzetts\nfeixiong\nhuskier\nsoleman\nsmedegaard\nmcphilbin\nnvca\ndustcart\nladyzhensky\nkismaayo\nphilinte\nmoelleux\nsynbio\neldrin\nvargen\nschoenbach\nrawai\nbesian\naleviate\nblueray\nyaoting\nconcelebrate\nadebibe\nmortein\nmagliocchetti\nnrse\nhartmayer\ngarani\nlupberger\nfrownland\nsolandt\naftc\nkwol\nirag\neaiser\ntakatof\nunpc\nlivigni\noberender\nsuperweeks\ngwanas\ngilver\nkelya\ngentilhommes\nbuttonless\nballetomanes\nclanged\nglunt\nguérot\neurocrats\nrowantree\nsadaam\ncharltons\nfalahi\nkavsadze\nvereniki\nsnobbishly\nperpetuals\nsunquist\nholzhauer\nmurdie\nsaggaf\nafft\nreimported\ngensym\ndastageer\noffshot\nkrumnow\nkensie\nearbox\nhowen\namnuay\nkirland\nwardropper\nbrelis\ngolnaz\ngousse\nclammed\nscotairways\nnelyubin\nbifold\nhraun\ngwennan\nsvenssons\nflwyddyn\nhollyscoop\ntuesley\naljibe\nmumpsvax\nscambio\nflomax\napoligise\nmuharib\ncerminaro\nzomboid\nmainlining\nincurious\nsakeasi\nvaul\nmonnezza\nsanguino\nsureau\noverrating\nmisclassifying\ncatchgate\nsheinkman\ndemurring\ncapricia\nallieviate\ncreppy\nparanormally\nnadjim\namtrack\npredicta\newanrigg\nhuegel\nricupero\naimc\nkasotc\ngarthamlock\nmacknik\ndalayman\nrefurnish\nintegralis\nlawrentian\nakilisi\nmediene\naldates\nnadex\nthyroids\nhopeline\nkorian\nsurveilance\nmobilisa\nassetz\njuulia\npalatschinken\nslaiman\ntrasolini\ncumner\nunderachieve\nponsero\njoergensen\ndovel\ncompaines\nrimadyl\ntazeen\ncelebrini\nputins\ntocom\nalomst\ntalh\nesele\nworldwise\nwaterwatch\nswiftcover\nreasonover\nkradjian\noveriding\nweinhofer\nvingoe\nndep\nfleurival\nloudeye\nfreshminds\nlaoula\ncornicing\ntajikstan\nnadoolman\nicesi\nsieck\nfootware\nniebur\nmastergate\nqurani\ntamanisau\nreults\ntabaski\ngorane\ndawick\nmulsim\nglenmavis\npotables\nrittmeyer\ngranzow\nfaer\nfengchun\nmaingate\npentregarth\ncorraled\nsilverhawk\nscotlandwell\ncoem\nleineweber\nsterio\nabydosaurus\nmengcheng\ncalculatingly\ngoodramgate\nlatté\nsanctimoniously\nalyssia\nalbertbridge\naperion\naknin\nmaestripieri\ncatholocism\nonlies\nsaltiest\nmathwin\nbladelike\ntailford\ncononish\narond\nevotec\nmilitamen\nsmollan\nzemlin\npossuelo\nprigs\nunderexpose\nmchedlishvili\nlaparoendoscopic\ngettable\nseides\nimediate\nbahmanzadeh\nhanwang\nsamadpour\nboyn\ngruop\nxinmi\nventilla\nkriewaldt\npraus\ncritcal\nencash\ndiferences\nojiambo\nahmadshah\namendt\nfreakier\ntuder\ndcpd\nwasifi\ncrowbarred\nrubbly\ncpnb\nhoheisel\ngatluak\noutdrive\ncrashingly\nfullani\nbournigal\nfiesty\nundocks\nkazmin\nechakhch\nwhiterocks\nsparkpeople\nsubmeter\nwondrich\nworapoj\nsmashburger\ntipsheet\nbusinesseurope\nunderstan\nandew\nruusunen\ngunnoe\npallasades\ncressex\nzuckerbrot\nnovolog\nattornies\nstemnet\ncomani\npeaple\nstakeholding\nmannahatta\nstuhler\ndzidziguri\nethnie\nelyaniv\nliberadzki\nramalama\npreddie\ncassley\ncarmenita\njitterbugging\nhrinak\ndataline\nspecialisterne\nhogel\nshaquelle\nmunah\nrubisch\ndraghounds\nbenazepril\nvaleting\nqbic\nplymtree\ntidespring\nspco\nunstarted\nsharghi\nfeedstuff\nhegglin\nshashiashvili\naiqing\ngurrutxaga\nlangerman\nchappill\nahlenius\nlauney\ngrunke\nmahahual\ntassles\nbaqee\nhashmark\nkofte\nmufson\njerraud\nmaithan\nperking\nverkuyl\ndepressurizes\ntweezing\nmedjuck\nbrantes\nuncorking\ndemoura\ngnvqs\njinshanling\nbushs\nartfire\nunnuanced\nbowlplex\ngappah\nfroghoppers\nassuit\npaciullo\nlutzner\nfacg\njungled\nsuperflash\nthiksey\nkeiding\nlkq\ndjhone\npavletic\nmmbbl\ncoric\nmobilians\natie\nkalyakin\nvigilent\nyanelis\nrahua\nchevvy\nstengle\nintini\njoseloff\nfreydank\nmuncrief\ncarkhuff\nsenomyx\ngsfl\ngrussendorf\niwebema\ncairos\nrouyanian\noutragious\nurdahl\nnegociant\nmoobs\nchenot\nindh\nserwan\nwaines\nsuers\nyouxian\ncullip\nsmokejumping\nkrustyland\nlixion\ndisconcerts\nkolodkin\nzouaghi\npicerno\nciboodle\nchinnaiyan\nifight\ndimentia\ncoronagraphic\nrednock\ngeerhart\nfafen\nteruyo\nmascarades\nmoonpies\nwriggly\nmallomars\nexceptionnels\nnudger\nufap\nmumper\ncyberweapon\nnumbersixvalverde\nvtti\naydilek\nsunsat\nlieberose\numos\ncancelable\neackles\nurvilleana\nrompler\ngrouplets\nbbowt\ntestily\nsouhami\nclifft\ningestive\nruthanna\namimo\ncognovision\nstonegarden\nfarouche\nkhadevis\npajcin\ndanionella\ntomory\nkibitzers\nprettying\nselover\nbookishness\nleroys\nschwizer\nhomeister\nhatzigeorgiou\nnorenberg\nnatexis\nunimplementable\ntashkin\nreceving\nsewering\ndinatale\ngrowhow\nangkorean\navonbourne\njanaury\ncalsonic\nmerkies\njernhusen\noutlandishness\nmoeletsi\ncubillan\ngrenot\nchoreographically\ndusc\ntaquerias\nheadstands\nhabab\ncalang\nmorphey\niyayi\noktob\nmallaiah\nezzeldin\nxincai\ngesticulates\nhantuchova\npresidentin\npokéwalker\nzovirax\ntrividic\nfogeys\nmahoningtown\npentathol\nlandhuis\nkahramanmaras\nelyce\ndisporting\ngraverobbing\njadelle\nbalthaser\ngobbledigook\nyaichiro\nimprezas\nwhackos\nnosepiece\nunspeaking\nblardone\nvixia\nnazal\nisnr\nhayesfield\nreferals\ntamaiti\nhansley\npeske\neuthenasia\nneoware\nslivovica\nmahelona\nsleekness\nshawfair\nstanekzai\nkurkowski\nramrodded\nnowgam\ndrijver\ncheesier\nberba\npotempa\navero\nromoff\nprotess\napog\nmaruto\nsawaf\ntjia\novernighted\nnovakova\nduboin\nyld\njacquline\nzerkin\ndiciccio\nfuelcell\nsorena\nverao\nawwwww\ncoomarasamy\ngreenhalghs\nhumalog\nfanson\nbobyshev\nsemipublic\ntorygraph\nbaedekers\ngready\nchatlogs\nrosenwasser\nshamansky\nyiampoy\nnarochnitskaya\nexternalising\nzongliang\ndobrzyn\nsurewest\nstupified\nblackawton\nhiban\nspse\nmameshiba\nschklair\nmaviglio\nprofitmaking\ndatone\nkarpati\necobuild\nmcsp\ntrigem\nneison\noutswing\nklutts\nbrasel\naxcelis\ncoptics\nsuperenalotto\npaleoecologist\ninvestible\nterzigno\nzogaib\nfaife\nyarhouse\ndarmin\nthabethe\nnewfangle\nhannspree\nmagliari\nwendorff\nnesdb\nningqiang\nswanning\ngirassol\npettes\ngreasiness\nncpd\nalbet\nnikolos\nskopintsev\nnohilly\nonovo\nodundo\nfleshiness\ndaccache\nceglar\narghandiwal\nmarende\nhoomes\nmccuddy\nbranley\ndarawshe\ninnsuites\nkronzer\nunsuprisingly\nbvca\ntournaire\nmerzi\nwhannel\nlongar\nsponsers\nsensationalizes\nolenick\nsymptons\nsufen\narabshahi\nirga\nleiserowitz\nmeltingly\nknörr\ngriffard\ndisinvest\nwakiihuri\ndomincan\ndewsnap\nwileys\ngosselins\nnetsu\nperserve\nnuture\nremould\nrbct\nlaforte\ntscc\nthinkstation\nbrichambaut\nsasch\nberkun\ndivos\ntsikurishvili\nqihua\nsallenave\npendergraft\nsupernotes\nsliderocket\nsayyeda\nbdelloids\nheaden\nreijtenbagh\natflir\nsteinacker\npted\nguozhu\nenterprisedb\nbersoff\ncrampin\nvareniki\ndaril\nnexentastor\nrokit\nmontbrook\nvilardo\nbursík\ntazers\nnorries\nplested\nseeram\nsterilisations\nhallem\nlormel\nmewett\nbaqui\ncandlelighters\ngymslips\npahlka\nlukomir\nponnary\nincipiently\ngeohazard\nquestionning\nlimpness\nketteridge\nsurfcontrol\nmateriels\njáróka\npaisleys\nrehypothecation\nelladas\nschwendeman\nmilksop\nyobbos\nbaynote\nrosehips\nmagrez\nkalaris\nsquiffy\nnassari\ntouitou\ngvaramia\nnefedova\nmontañes\nbowdens\nraihani\njonzi\nheagerty\nteneyck\nsocheata\njiraporn\nkipipiri\nlonestars\namcas\ncpdc\nheitzeg\nfmes\nshivalingappa\ngaurds\npollis\njenrry\ncassileth\ngiorgianni\nreconstitutions\nroshek\nfrogh\nunchivalrous\nkinninmont\nmnemba\nincu\nconsentino\nmikari\ntazed\ncsincsak\ndemutualise\nsledders\nearthshare\nscruposa\nvisia\nconsious\nheightism\noriau\nmakol\nbingum\ntailgater\nprofitted\nporage\nvmtv\ncagy\ngeekiest\nsrichand\ngudim\njohnnes\nspeci\ncheeps\nhauri\nsheirer\ndisapprovals\nmolyviatis\nmbrg\ninvs\nhyperreactivity\nnieuwstadt\nguidepoint\ninmarko\nfoeme\nbodymedia\nyaneth\nsmaland\ncommingles\ncepii\nroskamp\nmischak\nsekouba\ncravinho\nahpa\nkuschynski\nhatchwell\nmorilov\naromatherapist\nelanco\nmedupe\ndestocked\nprofero\nhodzic\nxiuping\nmtagwa\nigmar\nbalčytis\nkeme\nuahs\nchassay\njacolby\nefectivo\nwhinning\npaderina\nsyabas\nunpresentable\ntromethamine\njuls\nwasikowski\nballig\naudronius\nsophmoric\nimpermissable\nposterboy\nzavalloni\nparicalcitol\nmcgills\nherlambang\nmtpc\nlcvg\nubermensch\ndalakliev\nglobalwarming\ntjh\nohtsubo\nmccolls\nfrikkin\nnewsarchive\nzaiem\nsurour\npostwatch\nklecka\ndragages\nwingding\nobituarists\nyaneza\nhipocrisy\nsonglike\nacridone\ncslf\ntitnore\neyf\nmushangwe\nminidisk\ncioppino\nshelda\ngwaed\npapur\nguileful\nmelitone\nwicaksono\nfiana\nwizman\ncomfortability\nmizens\nexpences\nnfps\nteodori\njursa\noverexerting\nperambulators\nchedgrave\nkyani\ntoris\nmyfoxdc\nbinski\nkohona\nslapdown\ndramaturges\nwilliamette\nalakozai\ndutifulness\nbnim\ngradated\nmalera\neyermann\nneidermayer\nzaoua\npyschological\nmeditec\ntaiseer\nbougainvilleas\nkibungan\nanglomania\nmcalley\ndawasa\nwindale\nanhri\ncyndia\naltekruse\nlocums\ncapmark\nwhittal\nzillionth\nessek\nbaynunah\nseguiriya\nkvisle\nsomemore\njtwros\nwhitemark\ncedomir\ncortizone\ngalou\nlétard\nxenoposeidon\nbiomacromolecules\nmcgauchie\nmijail\nverbalizes\nhizbi\nsmyre\ntuanpai\nkuruvchi\nouzts\nmawsley\nyuanxi\nungheresi\nkapasi\naesha\ngaroowe\nundisplaced\nharrover\nbagmen\nbezemer\npageboys\nsaout\ncampkin\nreglan\nscalin\nkount\nrohais\nosklen\nkarus\nbesty\nsummiteer\nramoneda\nlassin\nskybitz\nbelim\nlairy\nidith\nmaisanta\nerikas\npfeiler\nsuperpass\nquirpon\nleutenbach\nwormery\neousa\ntorchin\nmillito\nbootees\nvectibix\nruchel\ndednam\nfelcsuti\nbelkhir\nwhispernet\nwallenbergs\npipefitting\nanaika\nanchalee\nnanoworld\ncbra\nanybots\nbaiardi\nkoutoukas\nunaccommodating\nzdanowicz\nbexington\nmanouevres\nimbiber\nqaddura\nlituma\nmascarenas\nbeneski\ntablespoonful\ndigestifs\nbiryanis\nrubbishes\nnonenglish\nrolax\nstaggies\nkenwin\nwychall\nmelanosporum\nmondoloni\nannihilationist\nrecoupling\nbirchleaf\nsnarked\nqaemshahr\ntoledos\ntacori\ninterestin\nlaquita\nromanoffs\nsabillasville\nterroist\nnuvu\nfrischling\nhushion\nmanciet\nmagnetise\nszabelski\nrexnord\nsidorkin\ndayem\nsadvakasov\nbyggmark\nspiropulu\netexilate\nsmokable\ngearshifts\nlesiewicz\nderx\npreregistered\nhinatsu\ncrostini\navello\nhearthrob\nsarukhán\nakhmedova\ntrigell\nboquerones\nkassman\nwilderstein\naabout\nunicel\nschupak\nlegitmacy\nbreasclete\nuncompassionate\nmacecraft\npanasas\nlonyae\nojek\nturbiville\nlocksbrook\nardfield\nxeloda\ndunganstown\nalinean\nbaggages\nlurling\nhardins\nbringham\nslackman\nflackett\npentalina\nheytvelt\naoukaz\nlockner\npacificare\ngoldgeier\nlightshows\ndebrided\ngonadotrophins\nplagnol\nburdale\nverryn\nwholley\nschelvis\nmanagerially\ninterestes\nmazoka\ntschütscher\nquantros\nsirm\nrarin\ninviolably\nsiejewicz\nemerman\nransdorf\nheartwrenching\noplc\neihi\nallbut\ntratos\ndykhovichny\ndroukdel\nchapatti\nadirondak\nsarcophaguses\nguarascio\nswoozie\nextensis\nschweickhardt\npamulang\nmerigo\njfsc\nscratby\naraud\nlabaf\nmitulescu\nsevereal\nhoguet\ngrandpop\nworldperks\nborsheim\nbiocraft\nsekurus\nherodotean\nmewshaw\nbrovold\ncannt\nworawut\nnomadix\nramonov\ndazé\npeascod\nkopczak\nlorra\ncockily\nsanitec\nmedl\noceanliner\ncrtical\norsedd\nbaqr\nghoramara\nsuheila\nbaldeon\ntrioval\nturnstyles\nadwar\ninterglobal\nkerekou\nimpared\ntilery\ngracetown\nimerman\ncombover\ntuffnell\npacala\nbolay\nsmutny\ngenson\nnonparticipating\npasman\nnyamwisi\nhectically\nversys\nbridgemont\nhuckstep\npesc\ncosmeceutical\nfirminger\ngoodboy\nconviently\njujie\nlortab\ndjelkhir\nkobeh\nmelipeuco\nkearnan\ndegrange\nunsensational\ngarantee\nlibéreau\nmiscounts\nbassis\nbaochang\nmaydays\nelfont\ngreediest\nsalesrooms\nsovreign\nemminent\nachaval\nsnits\ninsequent\nkineticism\njashanmal\nbemedalled\nlalt\nlibassi\nseapine\nachieveable\ndooey\nintiative\npedelty\nelkinson\nseluk\nhty\nafalava\nbrulée\nnarcotraffickers\nstyrenics\nswadi\nsetliff\nmyelomas\naffo\nmaliwan\nlariam\noutdueling\nmozaffarian\nisscr\nzuddas\nfraioli\nsoppressata\nmosimane\nunattainability\npatalano\nhackleback\nrogness\nmitler\nbickersdyke\ntrencrom\nzilka\nelectio\nnodine\nkapò\nuspga\nhaot\ncurrock\nkowey\nmarinaded\ngurri\ngurpegi\nebita\nschoner\ntelum\njamame\nhedric\nzhongxue\ndeice\napwr\nzetterstrom\nereli\nnlea\ntrendrr\nariaal\ndaem\npleinmont\noveranalyzing\nungass\nfatherlessness\ncarender\nsimphony\nbogacheva\ntaimani\nnaksa\nenrapture\ngurvansaikhan\nheroique\nhaplin\ninama\ncorvaja\ngrillers\nteixera\nbetzold\nsuperfruits\nmastrud\nshihhi\nthanya\npellum\nnivose\nrebidding\npinvidic\nzichal\novermatch\ngeiges\ngrassia\nusajobs\nbarmulloch\nlevisohn\ngrovels\nshadiest\nehuman\ngagon\njowder\nbelani\ntilmouth\nfärberböck\nbukantz\nnowzari\nryers\nrutterschmidt\nanacap\nmicrocitrus\nshentel\ncutup\nrakhman\nriyadus\nhickin\navolar\nupasani\namyloids\nibéria\nmpsv\ntlab\nsahwat\neriswell\nszele\nrisottos\namzallag\nbombadier\nvelicka\nknouse\nindepence\naudobon\ncassidys\nmeletis\nbiostatisticians\nsledder\nstranglings\nmaggay\ninternationl\nccgi\nhryvnya\nzipit\ncmfs\ninterwork\nfelberbaum\ndenak\nschonbrunn\nexorcized\nazafrán\ntabd\nnunno\nkallström\ncranmere\nhirliman\nhellekant\ndsam\npilsners\ncocaethylene\njhoulys\nalipov\nwestmann\nberce\nsickener\nakme\ntxtmob\nkhatija\nghastliness\ncalyxes\nvehrencamp\ninidividual\nuzumba\nzarnowitz\ntarfusser\nneafc\ninovative\nmarshevet\nnebuliser\nsrrt\ntirofijo\nghalam\nroovers\nfactae\nllovet\nvasiljevs\njiefu\nmemecan\nlandahl\nriyom\nparellel\nunscanned\nmolodkin\nseesawing\nnavaros\nigougo\nkonenkamp\npopeo\nrepilado\nclerkland\nbioquarter\nlongmenshan\nhessenford\ntrematopid\nmuqdadiya\nnonin\nberryfields\nkranju\nmethamidophos\nmistatement\nglobix\niaccoca\nsepeda\nkaiserschmarrn\nopeb\nrhyddfrydol\nbhangal\nazmir\nscailex\ndelawar\nbwakira\ncaseosa\nglied\nmadisha\nnaftel\npublicty\npodolny\ntwelth\ndishu\nprofessio\nhanao\nhemstreet\nhinant\nsupermans\nnimotuzumab\ngenewatch\nfooey\nsantus\nonthank\nmauby\npankratova\nfractiousness\nmaisi\ngalbo\neint\nkrinos\nmigranyan\nlukats\nraeth\niscan\nmiscik\nlatinobarómetro\nmagnell\ncollectiveness\naccountably\nstetz\neuropolis\ndowloaded\nbotmaster\ngobsmacking\nunbackable\nfyw\nmandery\nincandela\nthribb\ntooway\ngorfodol\ntillsley\nbouzaid\nblatancy\nahcc\npactiv\nsectretary\nunderoos\nberst\nfrovatriptan\nkilolo\nbifengxia\nfeuchtwang\norlovac\ndhoki\nhelda\ninspirers\narvinmeritor\nmelahat\nlifeguarded\ngustinetti\nduac\nsuebwonglee\ntransistion\nkitas\nifund\nmarkettools\nfussier\nplaypens\nstreches\nmartinsa\nhastier\nsimulataneously\nshaif\nlewand\ndenationalised\ntokujin\nhumaidan\nballuch\nfhit\nhensey\ngallardos\ngenarlow\naranoff\nretentiveness\nzeyad\nlautem\nyantorno\nyvonna\ncaracortado\naresty\nslinkys\nporthemmet\ngurkas\nhampreston\ntrannie\nmudbath\nsoasta\nbécasse\ncloseby\nsubpanel\nkissogram\nathwal\nghising\nptsc\nmeshoe\nbentprop\nvafi\nlimned\nrostang\najuste\nambulanceman\ndthe\nvinnik\nculicivora\nindigineous\nsenges\nkhazai\nschomerus\nfastsigns\nvygon\nibase\nmaddiston\nlockerroom\ngorick\nsagalevich\nbedeck\ndualeh\nbisheshwar\nhersha\nkaboose\nmoghimi\ningosstrakh\nyannan\nmudawi\nrheinecker\nexpelliarmus\nupturns\nmedee\nkaaot\nvidala\nscheppach\natheel\nsubindex\ndishearteningly\nfaild\narabba\nmedicalizing\njahmar\ncussins\ngisagara\nnahabedian\narnhart\ndragonwagon\nnonexempt\nellcock\nsuedes\nunreturnable\npeckhams\nbigmachines\nkatwal\ndrewiske\ntuxes\nmophead\nnoncircular\nsathyavagiswaran\nsilab\nmoldonado\ntidemark\noverinvestment\ntakhe\nsfwmd\nsconser\nhandballed\ngoldex\nfossetts\ncollop\nmagomedali\nkupers\naccurev\ngabbing\nlozes\nebookers\ntetiaroa\nswanagan\njevremovic\ncbssportsline\nslingplayer\ncsmfo\nifsl\nwikholm\nglicksman\nbrynu\nalnajjar\nsukru\ngeorgetta\ntheodoratus\nlocharbriggs\nvandebosch\nnullis\ndinitia\nenvironmen\ntuckered\nanthros\nsaleve\nunmasculine\nsavvier\nvendex\nfundamentalistic\nawino\nzeibert\nnorgen\nmomanyi\ndiiorio\nsubira\notherway\nultrasone\ncyanurate\nroedy\nconisholme\nviget\ninvitiation\nmatalqa\nshoeprints\nbmgf\nfirelighters\ngurria\nflewellen\ntuaminoheptane\nsukkwan\ncompeteing\nforbore\nrweyemamu\nsangean\nalieva\nrentfrow\neurocents\nkohtz\ncolaneri\nnaggingly\nkintnersville\njanulis\nchuluun\njampacked\nroadm\nantitussives\nrespironics\nsuzaan\nyoennis\nbdkj\ntheirself\nabdikarim\narow\nelmworth\noteh\nsaniora\npronexus\nmangou\nmephistophelean\nyingxiu\nophthalmics\nrohrssen\nvitrol\nwittenbrink\nkalisa\npfiffner\nmenteshashvili\nnwoye\nnbgi\nscottevest\nyoutrack\nholdway\ntrotanoy\nplenel\nmeatout\nwhichis\nbctgm\nthirdhand\nheii\nheltzel\nmwewa\npelmet\nlangsdorffi\nelleke\noberacker\njamychal\ntwitterer\nnances\nimpella\ncummersdale\nmitting\nexcrutiating\ngrieser\ncailleteau\nweinfurter\ndemaundray\ngorsel\neocarcharia\naddlethorpe\nwajma\nkhamanei\nstrengthener\nresprayed\nmingwei\ntopcroft\nmedvedow\nulsi\nunignorable\nzevran\nbrandejs\nabernyte\nkules\ntavui\ndaulerio\nintellichoice\nklaers\nmokarrameh\nccsso\ndoneger\nirabagon\nmidsentence\nunnat\ninstiture\nbradburys\nsoapmakers\ngenart\nmoaa\npreds\nnormalises\nshvidler\njackon\nunsubtly\nrayburns\nelectrocomponents\npravada\ndoreena\ncackled\nburban\nunigol\niriekpen\nstrummy\nwaaah\nhanzaki\norobator\nkouider\nzelika\nshallah\nacores\nphoner\nchronophage\ndeadpanned\nroybridge\nkeidar\nhousekeep\nunsocialized\nibbett\nschinkels\nbaraawe\nrasiej\nkasoulides\nholweg\nmacchietti\njiverly\nresoundly\neacho\nmnuchin\nlogocentric\naviculturists\nlineberry\ntafreshi\nsmartpass\nkwalik\nchattiness\nrobbrecht\nwitheringly\ndiscomfit\nnistelrooij\nmcaveety\nflymo\nsnappiness\ndeklerk\nbreguets\ngronant\nkabalan\nitsunori\nreligeon\nshuldiner\navenatti\nravenscraft\ninstranet\npunking\nsettlin\nmodiba\npastides\nthimbleful\neurocommerce\noverfamiliar\nqliktech\nbeleskey\nikee\nanacor\nsnacky\nsymo\niparadigms\nsoerensen\nerrin\nsharow\nlouisianna\nedmier\nyday\njosanne\nruching\nsinokrot\nferer\nhurezeanu\nockwell\noathall\nbiegun\nintermodality\njorisch\ndebasements\nmcarthy\nmardirossian\nzubayda\nbearup\npizzaiolo\nkarpaz\nghigliotti\nnayna\numbehr\nmegeve\nelwesii\naclidinium\nonhand\nklie\nsoltanieh\neular\nspehar\nnowcom\nitilleq\nsteidtmann\nbarnets\nmedvec\nmirium\ndulio\nposturings\nbiodetection\nthinkg\nfeamster\ndourly\nroubideaux\nreinjuring\ndeathbeds\nhoute\nspeechs\ndreadfulness\nmoskitia\nflytech\nfioritura\nsoniat\ndeifies\ndigsby\nmultichambered\naubourg\nprimaried\nkawneer\nseamicro\nquadrantids\ndresse\njonnes\ncrosscourt\nprescher\nperondi\nsweig\npalkot\nunwrought\nbagleys\nrecanalization\ncestria\naluizio\nsnakefish\ncampier\ncaptials\nmckubre\ntalanted\ndayyer\nunironically\nmanipulativeness\nmoorclose\nwozza\npopguns\neveryscape\nwinings\ntreacey\nzhihe\ndoninger\nbluerun\npivnice\ngeraldina\nclcv\ncrdc\nonforce\nfriendfield\nbnic\ngushingly\nfiances\nneovius\ntsakhia\nnurhayati\ntrehane\nlachelier\nmayersohn\nschafernaker\ncuillier\ngrenham\nholmrook\nfrancaviglia\nlashawna\naccessorised\nrollick\nyunsheng\ncorkboard\nsymlin\nbaylay\nangelfood\norgalime\nflorbetaben\nchiem\nvroomen\ngoip\nmudbank\nmojahed\nbolnick\nrieken\nfatique\ntéléthon\necodiesel\nkentz\nblairdardie\nwonderkids\nowlsmoor\ncerebrally\nqualles\nvalimo\nbradaigh\nlavander\nsextupled\nsenseney\nadpc\ngreencroft\nlauched\nhopke\nmmfm\nbattlewagons\nyedoma\nturridu\nsigneul\nessaydi\ncarumba\nmanook\nrafeedie\nberendo\nepay\nkhomein\noverbidding\ndodn\nsouheil\nreplenishable\njové\nsandelford\ngavil\nmponeng\nzieve\nupconverting\nformanchuk\ncomissioner\nlesiba\ncaturday\nshefrin\nanggraeni\nellenore\nalshamsi\nspiderlike\nfaliure\ndarlo\nbruar\nbaoliang\nstiteler\nkhushali\nfranscell\noinks\nestupinan\nspigit\nkleptocrats\npivni\nunbeautiful\ndetamble\ndemonstation\nsuie\nsapina\nhorowicz\nhardmen\nsicap\ntubbing\nwigtoft\nneuringer\nsplurges\nstreetscene\ntantular\nbetokened\necris\nqianfeng\npolicylink\nvyarawalla\nentrecanales\nmarefat\nwerfenweng\nmacconnachie\nokonjima\ndefusion\nmayben\ntittsworth\nsmalldon\nwolfard\nfuturistically\nvolkens\nfolb\npillco\ncollecters\ndebon\nrubbin\nkulakhmetov\nmonachino\ncachuela\nbutterkist\nrolta\nmisheck\nunattentive\nquilombolas\nderia\ntattooer\nchaurasiya\ntrescot\nhelderman\nuncoils\nmurriel\nundaria\ngogear\nartsbeat\npelisson\nfinagled\nschiffahrts\nstarchitect\nhisser\nlousaka\nbaianas\nnerdier\nhollitt\ndorokhin\nsensable\nxenios\nhoushold\nschuttler\n\ntaxloss\ndemijohns\nkanasaki\nmarakele\ngostowski\nheadhunt\nbazzy\ndestabilizer\nkirwans\nscoyoc\nbrunnstrom\nslipman\nmoritorium\nsiddikur\nundeployed\nchallandes\nbernardins\nstepladders\nladling\nmetavante\nportmouth\nguertler\nnlets\nokuwaki\nkavoussi\nkonterra\nkeynsian\netblast\nceoe\nlapiro\nwhooo\nrosema\nerikssons\nducille\ncynuit\nsaretsky\nprogen\nparaguyan\nmotele\ndevaluate\npoutala\nwinterkill\nfastpath\ndanesmoor\nbucklesham\nlectorum\ndygalo\nmeatheads\ncratchits\nkanel\npitham\nciurciu\nmultrees\nmanufactuers\nbonesman\nutsc\ngreengages\nzumwinkel\nbayhealth\nguilelessness\nniiler\nlimpidity\nrenfred\nevpl\nfreee\nopmd\njawole\ncapoccia\nenvirofit\nanitelea\nnelan\nnimbu\ntahhan\nreconcilliation\nbitkom\nzepagain\ncaronna\nelecticity\nthummer\nrefire\ncanestrari\nwahiya\ngrumeti\ncheckdown\npacan\nbaltray\nathttp\ndotasia\ntencha\nsadangi\nsnptc\nforsbrand\nfungurume\ncorbusian\nfehrnstrom\nsavuti\ndryish\nlurn\nhuvafen\nstrangulations\nuchea\nseagirt\nartrock\nnawbo\npantywaist\nwerx\nvandamm\ncajani\nskana\nmugaritz\ndiabetologia\naïoli\nluebbe\ncretz\nlaroquette\nheireann\nlumpp\nkanit\nlamel\nbrenco\nocteon\npharmasset\nsuwalski\nuniversitie\nwhoes\ngeorgoulis\nschurrer\nzafferano\neejits\nshaggier\nejiogu\nthandeka\nfuzhan\nminohd\nddtc\necip\nherteman\naravit\npragna\ngonsalvez\nhathwar\nizat\nwhimpered\nwiggans\nhotblooded\nsapard\ndesposits\nrepik\nkrengel\ncedillos\ntarries\nvetinary\ncampness\nschackenborg\nlowed\nmocny\nwhalid\nfeeks\ngrabarkewitz\nparavati\nbortin\nmidgame\npatroled\npeloux\ngwerth\nunmailed\nkunick\nregualar\nheatherside\nnyeu\ncapaldo\nreichenthal\nowie\nbrassière\npearline\ncrotts\npatongo\nbloodsoaked\nwalkon\npirarucu\nstiil\ngastner\nsajil\nchangepoint\ngrindea\ntaimuraz\nsupersensitive\npierantoni\nambulating\ncheselbourne\nalqaeda\nlehmans\nurbn\nconniptions\npeninah\nwaltzers\nquangel\nqualifing\nscrimmaged\nhorray\nwashko\npalazzoli\nrosemarket\nmantillas\nbambis\nhandywork\nosterweis\nkorchemny\nbateyes\nvivari\nshamama\nmaelstroms\nguiora\nshokouhi\nnerma\nreinfected\njapenese\npraefcke\nskrewed\nsurpressed\nubiparipovic\nmanifa\ngruppetto\nsantaros\neffler\nsolemnising\nbodycare\nlilianna\ndominiczak\nduked\nwaddled\ninconsideration\nruksana\nfillibustering\ncarlsons\ngappers\ncodys\nxinpeng\nvandevorst\ncumper\nspytty\nkurella\namoni\nbrochier\nborzi\ncevs\nrobino\nenviorment\ngrillework\npeacable\nnypro\ndulic\nroxford\njunren\nongeri\npeccadillos\nsupping\nguffawing\ntoretti\nblackfordby\ndurana\ndiscusting\nmiserablism\nrapaciousness\nseabrooks\nchevez\nträsch\nlobberts\npaternalistically\nsobero\naboutreika\ngrcic\nabertoir\neradicable\nantismoking\ncoyles\nstormclouds\ninsolia\narctics\nmittee\nmephitic\nbeginnning\ncairenes\nrivasi\nkingsly\nstenvall\nmezzolombardo\nabdelbaki\nanglophilic\nmetaj\neletropaulo\nahney\narticulacy\nkeyboardless\nsemionova\nrequetes\nmedx\ndeskin\ninjuncted\noohh\nloadsamoney\nfscc\nnakanda\noteley\nkiddos\ntriebe\nsidetrip\nfacilties\nruitenbeek\ntugba\nrocketbelt\nlebovits\nbackscratching\ntelephonically\nbeinne\nduckler\nvivona\nsubtyped\nunpenalized\nberlais\naeis\nfloridean\nsoukupova\nafce\nvideocameras\nsavviest\nspeedbumps\nbalmossie\nergots\nadpt\nmeineck\npossibley\nmangor\nniederbrock\nkoelliker\nshanth\nmunyurangabo\ndreamforce\nbacus\nditchwater\nhaussmannian\nsessilee\ndetrol\nmatanovic\ntricor\nbenattar\nintertanko\ndoai\nswaner\ndeoderant\nkulzer\ndupay\ndanthine\nalderwoods\nxethanol\nrewrapped\nhyperview\neffed\nisoardi\nalderston\nundersubscribed\ngodsmark\ndeconstructor\ndcgi\nintermittence\nnaprosyn\nbuckfire\nprempro\nslummy\nbackstrokers\nfornas\nputtable\ninconsequentiality\nwimpassing\nplacedo\nbahnam\ngovernmentwide\nsenao\nunfertile\ndrinkability\nelberse\nnebti\nnemyria\nmoooi\nvorce\nkrisworld\nmaugh\nhuanglongbing\nesskay\nintino\nfumus\nkrimpets\nscolese\noweis\nghappar\ncaffell\ncianflone\nmamajek\nmoneymen\nfenk\nzabari\ntwosomes\nphwa\nrapradar\nbaulas\nbungard\nstonger\nactifed\ngetresponse\nahrari\nbaubigny\nallowability\ncosponsorship\ntunneys\ninhumanities\neyjafjallajokull\nszeleczky\ncableone\ndttc\ntrwam\nnutra\ngotton\nbeaworthy\nmoddershall\narrangments\nmaurito\nfollitropin\nenquest\nadow\nongenaet\nmacpro\ndoorstops\nhydroprocessing\nnassawango\nschroepfer\nfazilet\nscalex\nperóns\nmandelas\nstoever\nwhinnies\ngranizo\nhardick\nroehrs\ncorvington\nsharsha\nskywatching\npensthorpe\nbloodmobile\nstereography\nloftleidir\nmegalopolises\nmokedi\nramonas\nndira\nhorter\nconstitutent\nfunguses\nmerlots\nswanswell\nagtmael\nprunings\nsabarsky\ncynicus\nintials\nblums\nstrothotte\nwashingtondc\nnataline\narchimage\ngobero\nguadalix\neupd\nvpotus\nnnoli\nkoryolink\njicky\nbohora\nbeckfield\npelkie\npotasnik\nwigmakers\nfrankman\nvardiman\nflameouts\nriedlinger\nolaleye\npylade\nhoweth\nmodiselle\nsiglio\njasmyne\naviacon\npaglicci\ntechnomedia\nshorstein\npiggin\ngetfriday\nabkar\ninshes\ndonihue\nopurum\neglitis\nconrades\npanuke\ncedulas\nmileusnic\nntpp\ndvornikov\nreclaimable\nefner\nmarfrig\neverdene\nhyperextending\nkotzur\nkreth\ntamberino\nzetumer\nafls\nmeydani\nsharmans\nhainsey\ngompel\nellay\nblsa\nhaskanita\nnadcap\ngreyest\nbrackpool\ncowlam\nunconciousness\ntraxis\nfibrocytes\nflamboyancy\necoterrorist\nhejab\novernment\nzerya\nviolaris\ndisgreements\nstaler\nperniciousness\nbudennovsk\nsupremecy\nerway\nhighstein\nnoblis\nfohrer\nlaurissa\nlapka\nbatofar\nmeler\nwaverman\nsözer\neclipsys\nluvaglio\nppci\nfeliksovich\ntecu\nlondell\npintxos\nntdc\ncenturians\nqueered\ncourtlands\nticia\namnewyork\nlabandeira\namsprop\nmchp\nwalhi\ntanzie\ngoloco\nbelviso\nsachinidis\nsankaralingam\nmicrodiscectomy\nsoyfoods\ncogitations\nsullenness\nwycech\nwuer\namburg\nadaobi\nmoskovski\nprobers\nconcia\nmakhmour\nferrús\ndndo\nunfished\nspherix\nkasukuwere\nyonnel\nhornbeak\ngombosi\nmulid\nabsolwent\nsavoi\nunutilised\nlegendry\navodart\nderwenlas\ntsahi\njurow\ndoriel\nlandrys\nfutaleufu\ngaleo\nbuitendijk\nguardiani\ntajuan\nsabemos\nkaidanow\nromark\nnsimba\nadegbile\nsakkas\nassarat\nhaverson\ncovisint\nmarrington\nduntley\ntarence\nsidles\nrelativley\ngauldie\nmastrogiovanni\npeopleperhour\npramaggiore\nmarkwalder\nintc\nbauditz\nmussing\nforminte\ntatnam\nkaemmer\nbioport\nheric\ngitahi\nfootbo\nexuse\nkomodos\ntaromenane\nsukiyabashi\nkovarsky\nguebre\nbestrode\npnwer\nbuschman\ntarracino\nuniveral\npreheats\ntortu\nwilloughton\nniezgoda\nedinbugh\nsinglemindedly\nsaynow\nchitou\nhydrofluorocarbon\nbrightkite\nunbooked\ndhuru\njamilia\nharwit\noubaali\nadwait\ntietmeyer\nonsides\nelmasri\ncrispest\nmingyong\nborkovec\nefata\nmelhorn\npubcos\nchristmasy\nfritta\nattou\nschaldemose\nmindjolt\ncompetitve\nipulasi\nkapana\nttha\nlongserving\nkopinski\nbakhta\nswepco\ncytopenia\nraquil\ndilscoop\nprecollege\nberuit\nmicrocircuitry\noverspends\ngeenen\nliverpools\narghistan\nseifzadeh\nkhpal\nmufeed\ncontentfilm\nhimeself\nsékouba\njayasooriya\nramkishan\nmasaliyev\nsantanas\nultrazoom\nbassiouny\nmotherlover\npaglialunga\nxarelto\nnougats\naconex\npirez\nreconnections\nwildbore\nmowle\nwainiha\nalati\nfsap\npophams\npovaliy\ngrapnels\npopovers\ntimone\nkokinis\nraggedly\nalphatech\nstrowbridge\nphuck\nmackems\nschmear\ngoaling\niannello\nbinliner\necsp\nchipotles\nhucklesby\nhatefulness\ncynde\ndobberpuhl\nwppa\naltenbrak\nacclimates\nhebronites\ncatster\ntanard\nsoluable\nelsholz\nshirvell\ntorraca\ndailymed\nbenriach\nsasajima\nphilinda\nneinas\nvinitaly\nninoska\nbetgenius\nshoppable\nhaveron\ntertz\nsquan\nnadezhdy\npotowmack\nespressos\nantiunion\nmultiplo\ncraftiest\naigfp\nstavinoha\nsuhur\nmoebus\nzawadski\nbilello\nuhhhhh\nlednock\nfezza\nundigestible\nragozina\nparamhamsa\nkolditz\nstoneyard\nweglarz\nmegaten\nwananchi\nvnas\nchritianity\nprolongued\ngaiger\nmontias\nriads\nturbinton\nbaseera\nunderexplored\nkoloskov\nsquba\nrosendaal\nponces\nrajnoch\nsalyards\natsg\nsheepdrove\npokemones\nabedian\nglionna\npeckers\namirahmadi\nminnigaff\nisdr\nlizzio\ngtel\naglitter\nblemishing\ntankful\nkalapara\nmaydwell\nshodeinde\ninteriano\nsobon\neffervesce\ndrumkeen\ngildred\npustay\nbournside\nbenedi\nseaclose\ncbol\nkulchitsky\ndabah\nilora\nlemole\nchhewang\nfullalove\nmansyur\nexg\naderans\noutwear\nmckennis\ncounterinsurgent\nlasering\nspackling\nbangbang\nzly\nsenetor\nninebanks\nderico\nkausfiles\nstacchini\nsirimanne\nralpha\nntsebeza\nshastar\nobermayr\nhamptworth\nmbagala\ndrunkorexia\ncardoons\ncereso\nfallwell\nastrovan\ntabaro\nspringland\nhiim\nasantes\nvorsteher\nweate\ngaranca\ntalboy\nclemes\nmotech\nbiodel\namankwa\nweijing\nstrenous\nunspectacularly\ncdphe\njudical\nchauhdry\nstorminess\ndalhuisen\nboardsource\nactonel\ncappellacci\nkeryx\ndirectcompute\nbuturo\nsibleyras\ntqd\nscin\nfbml\nmycock\ninsdorf\nsuner\ncrewneck\nbraggarts\nkivuitu\nmonsma\nfellous\nsdunek\naffliations\nucpa\nwashcloths\nyowls\nmalei\nwizzart\nmillponds\nsahebjam\nvillepique\npigheadedness\ngannushkina\ncherner\nsetterstrom\ntegegne\nwhem\nconcertmistress\nvanderhook\ntsehaye\nextrahop\nrecladding\nhealh\nsciens\nbramhill\njhagra\nacass\nchervochkin\nponying\nphished\nsecularise\ncolateral\nagasint\nfibresand\nbellatti\nroopam\nsnowmobiler\neikos\nlaynie\ndonowitz\nyafeng\nlaugardalsvollur\nthinsulate\ntdz\nherme\nbuddys\ndobens\nhepatologist\nwagha\nstoessinger\nspaulings\nafsaruddin\ngoodwills\nhomicidally\nresuce\ntechtown\ncarring\npracharaj\nduddleswell\ndeheng\nfantle\npoczobut\ngammarth\ndiligences\nsuprime\njokhio\nmojitos\nportaledge\nmoyesii\nhaansoft\nboeglin\nsphr\ndenburg\nfolfox\neroticizes\nendris\nsudjatmiko\nalayo\nkenmoor\nbioserve\nmlinaric\ndustier\ninnocous\ncarizza\nwindbags\nhipocrites\nafetr\niridotomy\ntinetti\ncihlar\nhairbrained\nxertigny\nchengcheng\nhamingson\nmarsano\nclamed\nghei\ncozze\nreecie\nrushern\ndaynile\ntetelestai\ncydf\nolaroz\nnhongo\nstacom\nsublicenses\ntakkula\nhardyment\nfreekicks\nhannahstown\nzaluar\numdf\nnmfa\nbaracouda\nawdah\njpra\ncravenly\nbiocatalyst\npevenage\ngeving\nflaxseeds\nmacaronis\nuralsib\ntearne\nvoilá\nprivacies\nbizy\nnosegays\npensioning\nanaesthetise\nsowings\nseinfelds\nsantuccione\ndeveloment\nlarché\nsucden\ngumbiner\nguccis\nalpf\nmaruziva\neccu\nteklanika\nsklarew\nborlotti\nmoneypak\nbedfordia\nblipping\nlaup\ntineid\ninadvisedly\nhuriwa\naustriamicrosystems\nculv\nstormier\nstankevicius\nlashun\nfacius\ntrumpian\naelvoet\nparcour\ngrowning\nsospan\nbrandeau\nprobablility\nscrote\nclonie\ndufry\nhargroves\ndepietro\nfloodline\nhudspith\nfiveforthree\ndistincly\ncalster\nhakimzadeh\nhygge\nlowara\nramde\nmeckling\ndarlyne\nhubayshi\ntaikonaut\nmusicial\namericanise\ndionisios\ncoovadia\ntyjuan\nsaintia\ntolerantly\ngeekier\nheisterberg\ndoomsters\nsnowmachine\ncityengine\ninlcudes\ntravagli\nlawbooks\nbisazza\ntalamona\nsilverlode\nptns\nzerain\nthicot\ndeyong\ncosmeceuticals\noutpitched\nsupersleuth\nabli\nblancarte\nlandrovers\nbleichroeder\nttwo\nnoninstitutional\ndecencies\nhodas\nsaxenian\ndicovered\nkoetsu\nirresistibility\nsoyabeans\nskytypers\ngasaway\nnunziato\ndornenburg\ncopenhageners\npetrojarl\ntril\nstrawflower\ndoodlers\nbarthmaier\nbuatois\nripply\nmcglowan\nbombsites\nhimm\nravinet\nnalbone\nitoo\nglassybaby\npainterliness\nbalbardie\nbaddoo\nrptp\nsmarthinking\ngalisteu\nslix\nwristlets\nbiodegrading\nrazeq\naquinos\nmiskella\nfrankenfood\notelixizumab\nhomebirths\nchabbert\ndovima\ncristan\nassociaton\npirinski\nbarondes\nmuxworthy\ndenos\nwholesomely\npimentos\nduporte\nworthingtons\nfarrokhroo\nprickliness\nbienemann\nverzaubert\ncolaninno\nhautzinger\nsanish\nnikons\nsuslick\nnexaweb\nunderrepresent\nzetar\nsedivy\nzeyada\nmoshayedi\nskvorecky\ndragoneer\nrepole\nabortively\npraire\nacras\ntheose\nrickerson\nteplica\nthte\nsaulters\nhabeb\nperkiomenville\naxinn\nfreska\nchaptalisation\ndamola\njohansens\nklunchun\nelectromedical\ntumwa\nnuble\nbudenberg\nfilmyard\nbeeland\nyouthwork\nbilwi\nlemondrop\nhiccupping\npipersville\ndenzo\nrohnke\nnecessay\nthroneroom\ngrullon\nawardwinning\nafrobarometer\nmundinger\nafni\nvoteless\nqpos\ngoldsteins\napplebees\nscarselli\nneedier\ncortefiel\nnashan\noffguard\nleppink\nbrynford\naawc\nshoulderless\njeanloz\naltnet\ndaiquiris\nmanufactuer\ngillepsie\nvowlan\nandanson\ncdai\nlimitative\nhervik\nflewin\nunfroze\nlibeco\nkompan\nmarsali\npillcam\nassumpcao\nqddr\nformfactor\nstecf\nchampoluc\nrheo\ncsrwire\npawloski\nmobiletv\nacthar\nvaji\nboltby\nsorries\nstonking\nleasco\nomnipoint\nbasepath\nmultisource\npriyadarshana\ncrimbo\ngenderism\nstroytransgaz\nwhistlefield\nrashedi\narcadey\nstrategise\ngulfcoast\nneptec\napplebome\nmerkinch\nbreachers\nbloused\nchiliboy\ninluded\nhelta\nduvie\nsgms\norganistion\nplowes\ntoyomichi\nremeasurement\njarrai\nderrico\nschmill\nlipes\nmudhafar\ngruaud\nmirembe\nmarhsall\ntrakh\ngruntled\nscaynes\nwistman\nmavrakis\nsmolyaninov\npaczki\npianalto\norumiyeh\nosoria\npersley\nguttenburg\nxhb\nfellates\nleacon\nprelanding\nmunizaga\ninfringments\npopka\nmastracci\nbendiga\nsneakiest\ncurzan\nekstein\ntimidria\nunho\nenvironement\npreposterousness\nmergler\nsonghurst\nhonsik\nspaliviero\nturky\nkalimantanensis\ntheorys\nkanekoa\nriihimäen\nfuddled\nmedialand\nabdolali\nbabayi\npijanowski\nfiorile\nakab\nfindin\nimpanel\niaato\nmessineo\nazizulhasni\nsleepier\nissakov\nannonced\nmitina\nbiometrically\ngranich\nkalpoes\nprestigiously\nweatherproofed\nnargile\ncathall\ntamilselvan\ndebitel\nlamestream\nbracketbusters\nslieau\nsoltner\nalmany\nconventioneer\neluay\naicraft\nsecen\nbudelmann\naudioguide\ncahyono\nmulal\nperfomances\nunlubricated\nrabbity\ncuting\ngodzik\nindustriebank\ncasciani\ncytyc\nrechargeit\nfengqi\nmemberclicks\nloropetalum\nneulasta\nbananna\nqueffelec\nhardiyanti\nrostek\nvéfour\ntourk\ntopicals\npeacocked\nlovaza\nvlna\ngyori\nquatar\nbostonnow\nnavile\nedal\ncrog\ncriticsed\nmangoush\nmahmoodi\nsjoholm\ngarness\nentsminger\ncomé\nwellwishers\ngichon\nbagnat\nmeetmoi\ngatsometer\nsetmariam\nmitat\nvangundy\nstalbaum\nunterach\nbegona\nattmore\ndrakopoulos\ndecomissioned\nbuoncristiani\nhellsberg\nsawday\nolumuyiwa\ntruckies\nnomics\nloook\nslesnick\ngabyshev\nbackhander\najillo\njendeki\nkawhmu\nidealogue\natalon\nnbsk\nedrik\nluksch\njohnnycakes\nqpx\npredetermines\npcad\ndussap\nkaimar\ntrishaws\nbiofields\nedinaldo\nsingstore\nmichelins\nokulov\nshirrel\nsspca\nglucocorticosteroids\nscourer\nmadhuku\nparticipacoes\ndegrippo\ntholet\ndaspu\nnaoms\nslonina\nmoroko\nmanusky\nkoeneke\npelegrino\ngranquist\njabbie\nempreendimentos\nclimatique\nwisterias\nrburgring\nparkwest\nmantriji\nwinyard\nnichicon\nthinspiration\ngastao\npushpinder\notherworldy\nthinkvantage\nvaccarello\nbraidhurst\nautlan\ncarterfone\nindama\nnwlc\nclfr\nearnout\nhenigan\nescarra\ndysphasia\ndebica\nkenfield\ninfirmières\naleasha\nelysha\nbutterballs\nbrynien\ncampanilla\nstudly\niraw\nfeleknas\nbarehand\ncommitteee\nrumailah\nchillas\nlaffoy\nangelakis\nhepatitus\ntofutti\nschnittker\nkehrmann\nscentsy\nchockfull\ntoulou\nbordat\nsinornithomimus\nlezlee\npadmawati\npotray\ncarrison\ncoarc\ncytel\npyret\nholographs\nseussian\nrozsival\nuncomplainingly\npontycymmer\nopsm\nmeringer\ncroiter\nhelioseismic\nnatelson\nmokuena\ngophone\ndissappears\ndelce\nascetically\ncadan\nhospitalising\nkingseed\nreceipients\nstockbreeders\nreleaded\nporfolio\nblaiming\nroskovec\nsquired\ntopilejo\nladened\nshopland\nserran\nflightpaths\nsuryana\nindisciplined\nliliam\ndarroze\ncarrefours\ngrobel\ngilthead\nheles\nshikov\nholywells\nbourrées\nvlassakis\nhenehan\ncraigowl\ndownsizes\nparralel\nbusinesslink\nmaxick\nguarida\nsucumbios\ngremmo\ninfopro\nbreakingviews\ndsas\nkayum\nfernàndez\neqr\ndifonzo\nwisehart\nwmmm\nyauatcha\nopciones\ngoutal\nmccourts\nglowers\nehnert\nturkheimer\ndennee\nsizakele\nhitsman\ntuggy\nagley\nrelevations\nmformation\ncuneyt\ncelluci\ncarloss\njorvorskie\nqueasiness\ndubout\ngazzano\npostrace\nfirstplus\ngagnidze\nuspaskich\nglencove\nipoker\nloppers\nrackleff\nakkus\nbomke\npodimore\nfantus\nballyhackamore\ntreyarnon\nyardages\nnexcen\nacdf\ndedworth\nbelguim\ngosek\nchanon\nfitba\ncoastally\nraiter\nhaasara\nyohane\nbadreya\ndoxsey\njomhouri\nbirdcall\njephte\npretzelmaker\ndizikes\nhorwitch\nchappatte\nbumaye\njellylike\nbadui\nottobar\nincontrol\nkipngetich\nburte\nshinkay\ntaxer\ngrandads\ndraisey\ncytori\nkupono\nplash\nbarberas\nsouless\nsideco\npogam\nyaleglobal\ntarrion\nserialists\nditzen\nmevs\nchamberfest\nfilipovich\nqinghu\npapun\nmusicalized\njjimjilbang\nholocost\nhighflying\ngattari\nemptily\nsnode\njenefer\nhaasteren\nmichitoshi\nchongwe\nabakarov\natbc\ntrubowitz\nbelchatow\nguobin\nlosyukov\nrequoted\nskrebowski\nchandiramani\nfenroy\nmerise\nkörbes\nsuperbot\npankshin\numarji\npetrivna\nmpinga\nmatress\nchiuri\npearlson\nluchko\nchumbucket\nshneerson\nendako\nmagennises\npliage\npalframan\ndouda\nuscp\nphillipino\nthameur\nsnobbism\nchebotayev\nsoftnet\nhessert\nwallbrook\nshafritz\nunfortuante\nfipr\nwitcham\nlabrot\nhagmaier\nomahans\nawlaqi\nutest\npescod\nkinger\nmarinacci\nadulteresses\nmilitarise\nbitez\nbumpier\nsymbologist\nundiscouraged\nseenan\nleyvas\nchenevière\nstressy\nportantino\npogoed\nzayi\ncomebacker\nignagni\ndelusionary\nkazakhstanis\nwasbrough\nfeaunati\nhundleton\nfiondella\nkossler\nvitreomacular\nadrasan\nstatisics\nhalfshaft\nkurani\ncobu\nbartend\nprinicipal\ndishcloths\nvogster\ndoublehanded\nbarkwill\nheapy\nlumpish\nmefa\nwge\nbunbeg\nhutomo\nqdg\nconservatee\npodleski\ncorallium\nleopardskin\ngeed\nicariin\nzequeira\nasshiddiqie\ncelynnog\ninnovest\ncornley\ncanix\nmasunga\nobnova\nscarifier\nnikahang\nrainproof\npernil\ncozily\nqadirpur\nmicrodistilleries\nmuntasser\nbukstein\nrotblit\nuehiro\nbancgroup\nvodoo\nzhihu\nrapelay\nashmanskas\nmarcassin\nhunzeker\nimiglucerase\nsupernerd\nnuzzles\nbatjargal\ndoldrum\nnjea\nfriedrichstadtpalast\nforestside\nworell\nbolie\neasyway\nhotbutton\nnetdoctor\nbeschizza\nzonally\nvsia\nmetroeconomica\nmegalomanic\nsheperdson\ncastleside\noverexpanded\ncurvin\nbecas\nascuaga\ndelegitimisation\nbecomingly\nverndale\nbaheng\ntrenesha\nsmilovic\nvylka\nbottner\nngalande\nmilanetto\nbundlers\nkhalezin\nbfms\ngoluboff\ndenaturalisation\nzargham\ntravaris\nfigel\nnilaveli\ndangos\nfussel\nsupriatna\neurovegas\nnationbuilding\nlahiani\nasmerom\nshuaiba\ngagarinskaya\nsmethers\nvachhani\nbanguela\ngalron\ngreenaction\nkharrar\nwitchunt\nseariver\nelectively\nsupermaket\nlepeltier\nquesnelle\nwudil\npucon\nagualeguas\nwigix\nheathcliffe\nanafon\nabinanti\nkotsolis\nsoweth\njcra\ngresswell\ncapey\nslingin\nferromolybdenum\nbrewski\naccomarca\neacts\nwestfarms\nmarver\nsaltmine\ntreixadura\ngrumblers\nhamfatter\nbattels\ndubuclet\ncheeseboro\nsnored\ngrafft\nraibert\nmurcar\nhossenfelder\nmadumarov\nploof\nrecyclates\ndevolutionary\nxiaoshuang\nunappeasable\nvisionworks\nschweiss\nmédiatique\nhyperstar\nmulvehill\nnoruz\nazun\nspeechtek\ntedlar\ncomice\nsiminoff\ndashboarding\nsuperkings\nlated\nfuturebrand\nthornwillow\nliebesleid\njosen\nfazes\nmecaplast\npokalchuk\nsabrett\nkuvan\nmakvan\nchinawhite\nunremitted\nrygb\nliveblog\nsinkerball\nkormendy\npérol\nhayre\nburncoose\nrfis\nmfcc\nwarstler\ncackler\ncoopmans\nsuperrich\nattarian\nbookstand\npaffenroth\nshelp\ndettloff\npolehinke\nsaimira\nidva\nkibir\necbt\nkelbessa\nhumanistically\noverbuild\nredeposit\nshanara\nswaleside\ngwac\nkozmino\nprangs\nbouthaina\nostini\npalazzuolo\nkahlow\njadel\nboreks\ncarpetright\nramshackled\nyourslef\nmangieri\nyanli\ncobranded\nchupina\nfrontrow\nguiltiest\nvellis\ndecabromodiphenyl\nrhapsodized\nfiberoptics\ndeductable\ncoreconnect\nbalvinder\ngembo\ngullestrup\ntiegel\ncotorro\ngwartney\nrayyis\nbosserman\npaluska\nmachista\nishmel\nremunerating\npceu\nrivenburg\nplaysuit\nperrucci\ngraffy\nmomlogic\nstyriarte\nviviene\nkumbayah\nmamaa\nlicencee\npioro\ninnosight\nmorcambe\ngarishness\npriciples\nrdoba\nperaliya\nhisi\ncheeping\nshaojing\nfadika\ntrhat\nboute\nquietens\nogalo\nbaraldini\nqueensridge\nlybian\nblustein\nltachs\nmystick\ntegnestue\njetamerica\nfukubukuro\nshanoff\nskiworld\nwomenomics\nblinatumomab\nsvartedal\nrovera\nverace\noetz\nhouseparents\nartunduaga\nunderpayments\nsooam\ngrasberger\ntwitterature\nngare\npaperport\nibes\nairfinance\naufiero\nkillalea\nkadence\nkwaa\ninkersall\nnorthface\nsliger\nfontainebleu\njanmohamed\naskariya\ntekeste\nmcleer\ndonabe\nseiser\nstalter\nacutest\nnarmin\nappelate\nproselytizes\nbphc\ntechtonic\nmoledina\ngilula\nberinsky\noutgunning\ncnci\netilefrine\ngalvagni\nconservatorships\nunderrecognized\nmazatl\ntorturously\ncarosella\niabg\nkarunatilaka\ntosson\nifin\njazar\nmigalski\nlumm\ndotta\nbitterer\nremarketed\ngxc\nzijderveld\ndelonghi\nlavita\ncodiscoverer\nincomm\nmendaciously\nimmad\ngalban\ntayeng\nmiquelle\nwoodfold\nelbot\nsiscoe\nfluoroscopes\nvobs\nreseachers\nsouthest\npescovitz\nreesman\nolympico\nweaking\nfultons\nshopfloor\nblackavar\nfitterer\nblackmores\nernakulum\nnaawp\nkalca\ndobner\nbearsuit\nrspp\nanapol\naboudou\naughenbaugh\nfurmaniak\ncanellis\npiggish\nrettew\npalheiro\nphenominal\nurrugne\nbahoz\nbrijnath\nricket\njarosch\nfrazzles\nmauffray\nsospeter\nmortty\nzingarese\nphilou\nlumanu\nsoooooooo\ndaboub\nkelepi\nunjam\nmackenzy\nweely\nguadagnoli\ndolcezza\nleidolf\ngmps\nvitual\ntrafficks\nmilimetres\npodladtchikov\nabhey\ngrappas\ngourna\nwanma\nwoulld\npoct\nsaidman\nyouare\nkazahstan\nscrewcap\ncliton\nwhrrl\nproplem\nbuenaflor\nguilet\nsakuji\nimass\nprouve\nthonglor\nnubby\nthyroxin\ninterpeted\nterrenas\nkremikovtzi\nraelyn\nhaircutter\nsmokery\nrecharacterize\nschipani\ncainero\nmitsis\nimpertinently\natributes\ndebkafile\nplasticised\nxpresso\nscillian\nifsec\nionx\nhpsa\neruygur\nhacdc\nnondiabetic\nterabeam\nfacchino\nimpassivity\nporeba\nignorami\nprimly\nkkbc\nniswanger\nunreconcilable\nresposibility\njacarepagua\nmoonpig\nmediacenter\npaulovich\nmolsons\nwaterskier\nliuhua\nimperilling\nhollyhill\nlevolor\nhaswa\ncosigner\nblackcraig\nizere\nklebanoff\nsmetters\nsabihuddin\ndraad\nnetronome\nsusah\ndominoe\nrauhouse\nmartitegi\ndeetjen\nadhamy\nmcelheran\ncounterstrikes\nauxvasse\nelimane\nboukpeti\ngogrid\ndeftera\neoir\ncordycepin\nabdullahu\ngasify\nhighberger\nelmos\ncattrell\nfatuity\naesp\nwaterzooi\nidentica\nasylmuratova\ndanjean\nlunardini\ngiardinello\nsararogha\nmhpa\ndepastino\nrubiana\ntanesha\nwvi\nmeissnitzer\nmerieux\nneather\nlubricin\nhohlt\npomanders\nxiulan\nnaragon\ngrizzles\nworick\nlighthizer\nkaraokes\nmuminovic\nmollik\nvillavaso\nkuzuhara\ninertly\npicciola\nusabc\nnaqash\nrenzaho\nbogler\ncareone\nyannos\nshinskie\nfrankurt\ngodsons\nespel\npanigrahy\nchooch\npixmania\nkludze\ndeltana\nrepublicrat\ngregoraci\nboundaryless\nbanic\nstockinged\ndataviz\nyershon\nlungen\nosmanoglu\nafricanness\nunprofitably\npeewees\ncharasse\nilyaas\njoggler\nlinaclotide\nshemari\nadaptiv\nmerrylees\nmulhauser\nhankers\ngoorin\nvoudouri\nchaiten\nsizzurp\nbreeziness\nbrightsolid\nmudhaffar\ncamoapa\nkve\namstrup\nyurendell\nelitest\nchronoswiss\nshools\nprettifying\ndolares\nmcghan\ndabblings\ngrittily\nrajith\nstonex\nsweetface\nlefthanders\ntoddrick\nstableyard\nclaxons\nmonacos\nwalmex\niceton\nranneberger\nlabovitch\ntryg\nnalecz\ninterindustry\nmarlabs\ncafetière\nislita\naniboom\npanzanella\nyusak\nthankyouverymuch\nfusspot\nswyddogol\nsurvelliance\nbancwest\nnoly\nvaratharaja\nnestell\nymck\nchristodoulides\nfastovsky\nmclaughin\népater\ndunkerly\ngplus\ndoumbouya\nslideluck\nchermak\nsheeesh\npaulites\nmariluz\ncecep\nchainsawing\ngrumbly\ngoozex\ncounteraccusations\ngelbaum\nreakes\ncorkscrewing\nvranjes\nunvanquishable\nschoolteaching\ngooners\ntamberi\nbudby\ndearn\nbissan\nstrenth\npreformance\npucino\nunrecommended\nlipnic\nproofreads\nwailoo\ntowneplace\ngoalbound\nbuld\ndicon\nferenci\nballoonsat\nshirkat\nkurlan\ndebrock\ncussin\npfiefer\nqoryoley\nmachinegunner\npaganistic\ncrimelord\nhorejs\ntakva\nkirschling\ndedapper\nderadicalization\nzacar\nfooball\ncabnet\nfishiness\nkoepfer\nnuclears\nbendolph\nsusanu\nmeitzler\nvenezuala\njerini\ntajammal\nprzysiezny\nshrim\nchompers\naciphex\nprotonix\nzapak\njasmer\ntowl\nclarkey\nbellera\ncontinential\namidror\nskylighted\nhoschek\nnettled\nfalshaw\nhainford\nbamroli\nganss\njitian\nclearcube\nsureesh\nalelo\nubale\nketchups\npattypan\nabudi\ndargham\nexacty\nliposculpture\nrevotes\noutway\nunbuilding\ngyala\ntvss\noceanariums\nnarquin\nmildenberg\nrescanning\nsighters\nanodina\nfunfest\nvergilio\nmaypray\naurakzai\nstillit\nsubhiksha\npharmatech\ntartes\nallatt\ngottschling\nzerohedge\nboriana\ngodesses\nspunt\nengrosses\nredelivery\nrushmer\nalhaurin\ncompartamos\nvalmor\nchikwava\nsooting\nfiggures\natiur\njairazbhoy\nmustiness\ndorint\nsheerest\ngardenstone\nshapeliness\nspronk\nkhandoshkin\nshadoxhurst\nkroening\neventally\nstaticky\nbissey\nvukelic\nbathiudeen\nutilicorp\nuncomely\nmandefro\njewbelation\nyoue\nosteoinductive\nknuckling\nwestthorpe\nvaugn\ncullivan\ndgas\ndartanion\nlongly\ncizelj\ntrimont\ngujurat\nspinsterish\ninexhaustibly\napplicances\nmazombwe\ndiputed\npélisson\ntripath\nhagglunds\nfiorinal\nrhuallt\nquida\nacenta\nbegood\nhursday\nschaitberger\nbroadwoodwidger\nwakanoho\navature\ninestroza\nvestberg\nloasby\ndibbuk\ncfmc\nadlair\nfinching\nudovich\ndispensible\nreplastering\nabjures\nrooftopcomedy\nnetwon\nvandenbosch\nparlevliet\ndecrem\nhkac\nmeide\ngigatonne\ndescibing\nminuteclinic\nshankhill\nounsdale\nnocc\nglowy\nakasheh\ngeekish\nkanetsu\nfaryadi\ningratiatingly\nmadhat\nmalpede\ncrashgate\npostill\ndespensa\nrailworkers\nkibisi\nshinwaris\nnorvasc\nvitriolically\ntavakolian\nseeable\ndesignbox\nbouazzi\nwilcrest\nfaggins\nhorvilleur\npfitsch\nlightposts\nsvaty\nbresky\nlibiran\npaquerette\nopertunity\ncavilling\nhilwa\ngendex\nblankfeld\ndexcom\nsutjipto\nrothholz\nsaligman\nmclatchie\nasfur\ncruch\ncodirectors\nzipursky\nbabeau\ngjetja\nrooijakkers\nkidsworld\nstenske\ndesrve\ninmar\nsmartnet\nhatherill\ntrinlay\nreeti\nahronovitch\nzakumi\nmasochistically\nchopteeth\nescajeda\nflexcube\nmontevina\nvarod\nbaccar\nmanale\ndininny\nyeilding\nsaiburi\nreak\nnoortman\ninnergex\nefraimsson\nkaime\nclaren\nshalesmoor\ndapd\nseasonale\nfertik\nmondesi\nkayford\nlingvall\nhomever\nvolcy\nsirenuse\nunheeding\nnrai\nsuperfresh\nartccs\nzhenghu\nloomans\nantimacassar\nsteinwand\ntrepidations\navascent\ngasten\nossetra\nngwynfa\ndrived\nbloomsberry\nnaiop\nfridriksson\nabuelaish\nalabamans\nvahafolau\nzicatela\nheremans\ncasteja\nactioncoach\nscoobie\npaultre\nmotoshige\nlunday\nservive\nlandeg\nvalvematic\ndartez\nthemseves\nfruman\nkittenz\naircastle\ndiacap\nprpl\nhusock\nmondegar\nisavuconazole\npawlcyn\njonason\ndehumidified\nbleiman\noddpost\ntzus\nnecesitas\nritterbusch\npluot\nkickboard\nzhitkeyev\nilakaka\nbrufau\ncitifinancial\nsoundlessly\ntcis\nrinnes\nitalease\nustoa\ncoatroom\nviracor\ncalvelli\nmystifyingly\nconstableville\nroskos\nastuti\ninnholders\npolares\njichi\nmamadouba\nasessment\nsabatoge\noverthought\nealry\namrika\ndashevsky\nbreden\naverdieck\ntabarzadi\nnvocc\nsamlor\ntamarasheni\nremmen\ntartiflette\nsolzen\ncabindans\nworktops\nnawani\nmcelvoy\nforcers\nwipperman\nsnuffle\nsexperts\nsuboh\ncantonian\nhandpicks\nhawlati\njankowitz\nfurloughing\ntaffi\nvalee\nhamdallaye\nasselta\nshimakawa\nbcls\ndesignlab\nsheepwalk\nbrickies\nimapct\nlunchpail\nsandmeier\nglaschu\nbiodynamically\nkahunas\nfutureworks\nhavranek\ncrottin\ndragani\nsubdelegation\nrabbu\naccordian\njuares\nfebrary\nrowold\nrevisitations\nlatag\ncarefirst\ndesertlike\nmortaud\nbuttonholed\nwallchart\nyummie\noilier\nzenei\nieoh\nkulman\nmatterface\natavisms\nporeotix\nschwarzenbergplatz\njohncox\nmitisek\nklempert\napprehensiveness\nwashability\nserverbeach\nmartavious\nspatharis\nemere\npoggetto\ngreenkeeping\nthiermann\nmukhlas\naifms\nkastanienallee\nminimart\nbronxworks\nlongcore\nconcommitant\nphobes\nzyvox\nlüke\neddying\nstockouts\nrohozinski\nmeiendorf\nvinaigrettes\nlindbloom\njgto\namimon\nblockparty\nbrucknerian\nfssa\nprystowsky\nmcenany\nbashfully\nmischeif\nbloes\ntrydydd\nziegenbein\nzauberman\nkovalsky\nlifegem\ngibellini\nunguentarium\nkassinger\namerco\nwojak\nhomegrid\nbirkel\nroodenburg\ntechnogenesis\nnonpaying\nritger\nvalancia\ncaunce\nschapen\nmereside\nborbely\nbogucka\ntrounstine\ntransfixes\nlvfs\ngrossbart\nterlato\ncordelle\nconvera\nwyg\nanticrime\nhwadae\nzyla\nomesh\nsehrish\npaktya\njohnthan\nnondomestic\nnesrine\nbadeel\nswda\nmoeakiola\nhiott\nkraam\nokudzeto\noyon\ncamner\neyeghe\naaaaah\nnateglinide\nbreastcancer\nverheyde\ntsiamis\nmobilereference\ntwittered\ncpfa\njeremiads\nkrainin\ninrushing\nsolargen\nredeveloper\nberylson\natunrase\nmettraux\nesbenshade\nparfaitement\nnajarra\ntvel\nampelmann\nquestioningly\nparlamentu\nmayahi\nslaght\nphidippides\njohnmccain\ngrandnieces\ntenerelli\nkoumans\nmenyn\ndilorom\ndurrel\nchear\nzuckerburg\ndietlin\nkloefkorn\nstreckfuss\nimigration\ndelance\nbryjak\nopirus\ntrusina\nravallion\nhafize\ntogheter\nkholoud\ndenery\nbreadknife\ncabl\nceejay\nkorad\nverhees\nbubye\niratxe\nlafranchise\nsychdyn\ntelapak\nchoza\nombaka\nretik\nmargram\neutef\npangandaman\nsynagro\ndaiga\nmufj\niifm\nunderhills\nassabah\nwashingtonienne\nlvads\ntuhabonye\ntraduttore\nradeke\nbhuddist\ncorbas\nuplinger\nnaczi\nworriers\nsmooches\nquintuples\ncloudforest\nbrenzel\nliroff\nbackburning\ncsibi\nsiluva\nshotspotter\nogalde\nkershope\nlusciously\nchicola\nnpfs\ncandotti\nsabyinyo\nwackjob\nteachstreet\nnanosphere\ncoaltion\ntrester\nseesawed\nedfund\nabbamonte\nroseobacter\nnanofibrils\nperkowitz\ncsapo\nelizabetes\nboilard\nsinol\nammendale\nmcanthony\nmismanage\npendeli\ngroggily\nugel\neiast\nfcrn\nsiggers\nsusia\nrougue\nozpetek\ncaiso\nsummerstrand\npakradounian\njetlagged\nburenstam\nsillakh\nsadeeq\nbarbian\nunblinkingly\nsouplantation\nyanovski\ndichloroethylene\ncaerffili\nsteinger\neale\nmicrosensors\ntoolin\nkudela\nmanugistics\nbohnert\nsavoriness\ncetta\ntuzee\nkallstrom\niqoqi\nmaistry\ndeitchler\nrugelach\nbahcesehir\natpdea\nstockcross\nausma\nabrades\nmesserly\ningushetian\ntianyong\ndurá\nunwrinkled\nlatti\nkinani\nhollahan\njembere\nunpressed\ncousar\nredchurch\nprepack\ncollators\nwaterthorpe\nalhazmi\nforswears\nnavellier\nbergrin\nabercombie\njje\nplaylisting\nvaisman\nhildren\nesmc\noutselves\nrounsaville\nfawdry\nlanoo\nkasyanova\nhymon\nmingy\nuruba\ninsinkerator\ntamagni\nborroughs\ngonorth\nsiwarak\ngillier\nthemslves\nleondis\naccountholders\nzyed\nfehsenfeld\npaltenghi\namms\nmaggiemoo\nbrasswork\nsasic\ninteg\nstornaway\nbeautifies\nbloqueo\nmcwhirters\nfransiskus\nbrucennial\npustilnik\nreckart\ngribenes\nbonitasoft\nbluewaters\ntrygstad\npiegza\ncestaro\npreciosity\nforinger\nbarfed\nalbes\nkurczewski\nbartnick\nthreadworms\nshoebills\nichsan\nsearchme\nbowane\najristan\nguardpost\nfelisberta\nceviches\nzhongjie\nnoodler\nypma\nsterndrives\npolyglandular\nyof\nryanne\nwittingham\nmeliden\nunspools\nmoniotte\ndomers\nslobodchikoff\ngenao\ndedryck\nsirot\ntokarska\nxiaorong\ndharmu\narfordir\nunclench\naspillaga\nbogles\nespelage\nbletchingly\nsteampunks\nassest\ninnie\nhyatts\ngeekspeak\nmulish\nurbanise\nnwagbuo\nmgallery\nkukava\nimbroglios\nnmls\ncastaignede\nringhofer\nmoncivaiz\nchiffonade\npaskoff\nowhali\nanatosuchus\nchimichangas\nferrández\nsereys\nsimontacchi\njasinowski\ntelevizor\nsearchmonkey\nguidobaldi\nnewera\nscolnik\nkogalniceanu\nplent\nlitovitz\nbroadtail\nyiqian\nfomunyoh\ntricoteuse\nsquidge\nbijagós\nemip\nsoother\nwacaday\nkipas\nfolllowing\npensant\nnafie\nsemifinished\nlasciviously\npalringo\npreloran\npyranopterin\nespineli\npomery\nshelanski\nrgensen\nhingson\nberzina\njamilya\nhardcash\ngurspan\nrury\ntoromocho\ndider\nraymone\nmuindi\naryzta\ncrnas\nalexiy\nmarzell\nmaloch\ntommasina\nnirat\nfaiola\nxtradb\ncentruy\npooches\ncongresscritters\nxiaoya\nsnowscapes\nfinvoy\nfaoa\nseyhun\nkiesle\nencourged\nwots\nbégaudeau\nwestquarter\nomnistar\nhazime\nliwski\n\ntearless\nhounam\ngrarup\nneimann\npontina\nunhurriedly\ncritisised\nhireable\nkhoder\npushtoon\nkarnats\nmeyinsse\nchicharito\nleegomery\nvincis\nintermediating\njungbauer\njinjun\nalgarin\ntestiness\nmoqadem\ndogcatchers\nsmoger\nterribleness\nphileleftheros\nshufen\nscheirlinckx\nnoski\nshoretel\nspinvox\nmoreleigh\njanyk\nxhelo\ndrivecam\navacado\nlandver\nricharson\nthesmar\nroslind\nmcgrogan\nestara\nwanogho\nfilek\nrealties\nandariese\nfullhurst\nsenyszyn\niaem\nkorsa\npoppyfields\nmukuni\nweymarn\nharbinder\npelak\ngardey\nsynj\ncancelations\niordanova\ndenegrating\nhoplessly\nhdsa\nincompatability\nvsos\nkatami\nashal\nviolacein\njaxtr\nfalkengren\npiacenti\ndenitto\nhodara\nsyha\nbeagling\nkeilty\nexxonmobile\ngrasscroft\nantholis\nchanteuses\nprosy\nembas\nsadistica\nwhitledge\ntrépardoux\nsuperagent\nequitana\nwoertz\nkovanen\njosko\nkunisch\nsidling\nsuldan\ncwlp\ndoliner\ndrevno\nvreed\ncageless\nparamax\nerekson\ntimboroa\nbroompark\ndekoda\neclinical\nfrought\nmisappropriates\npihlaja\nkotecki\nneocolonialist\nloconte\nmarineros\nbartfield\nvillaneuva\ndecieved\nrevas\norstad\nblutarsky\nredolence\nexultations\nbillary\ncousen\npurewave\ncwrdd\nfiering\ngridwise\nsturner\nvietcombank\nnajih\nnewlink\nkracow\ncontintent\nsemkiw\ncvijanovic\nxtca\nbrahimaj\nalvheim\nironists\nnonkosher\ndreda\nehrenkranz\nrevealled\nmillenials\ngimzewski\nmankinds\natheistical\nnetcentric\npixelate\ntermers\ngonave\nsnowie\nbrooklyner\njettas\nsiemionow\ncaravansaries\nbraynard\nbintu\npudlowski\ndisagreeableness\nintrona\nfauskanger\nasheesh\nwaffler\ncellura\npoltiical\nkawalek\nalawiya\nbushelman\nquantez\nshikwati\ncitzens\nskouries\npadeswood\nkibbitz\npecorella\nbeijie\nhearting\ndohner\nreho\nninots\nauthentec\ndemorrio\nmuscly\nzerbin\nsabria\ngastroplasty\nunitedairlines\nhafit\nstarbar\nbonnema\nsaleban\nburgaud\ntechtonics\nbillong\ntelereal\nprivette\ndiflucan\nserviceperson\norganogram\nvytorin\nhealthfully\nboycot\nroadtrips\nahya\nrosenannon\ntrinklein\ncismesia\nmoscows\nkomsan\nlenarčič\nsevinc\nrepetoire\nbioidenticals\nzinch\nlemisch\nlieving\ninnoculation\ngolli\nponsanooth\nwoodpiles\nmorningness\nbatat\nfayman\ndicocco\nincurrence\nwolflike\ncrez\nexpec\nkatseanes\nimpracticalities\nminnig\nmalandain\nsmartsource\nshrubsall\nmontsame\npaulas\nballcap\nswerdloff\npoldek\nlokichoggio\ncdcc\nsuroosh\ntaxact\ndifferentness\nrecind\nvergette\nceladrin\noverregulation\nconvencion\nzenkov\npalmiter\nchênevert\ncaffet\nmapless\ncpff\nvellenga\nmalmer\nearflap\ngarufi\ntarrifs\nsafway\nbachoco\nvkl\nmingott\ncherisse\npatteri\nkrooks\nwalgate\nkaralus\ncrispies\nfrieston\nergh\nmountpottinger\nbachuil\ngenderen\nobbligatos\nmunatones\nwenchong\nheilborn\nmultitaskers\nunhão\nnahavandian\nprefeasibility\nshmendrik\nrottet\necds\npastrick\npoulner\ntumbly\nmorrical\nhanlen\nvogelhuber\ncunth\nkamras\ntraumerei\ncontined\nsammour\nevridge\nguglielmina\nclafoutis\nrozza\nmarshua\nbitonti\npolaszek\nmalinski\ndeltacom\nlaffineur\nglantraeth\nudrs\nputrescence\nvitolio\ndiang\ncpcu\ninbounding\nspraypainting\nfordhall\ntitzer\npenmanshiel\ncedarlane\nstaig\nhodosh\nduchoň\nknockeen\nshindigs\namericast\ncasiple\nhechizado\nsteinel\nculbokie\neconomised\nfaridoon\ncharsada\nacquaintences\noverextends\nedgbarrow\ncavicchi\ntraude\ninterfamilial\nfamilyfun\nmargoth\ncontrolls\ngreyrock\nmelaugh\npalagio\npnut\ncaucusgoers\nbiema\nfrankendael\ndelbeke\nyobes\ngiannakou\nmidlarsky\nmulkerrins\nbdms\nkliot\nquinapril\nschlepp\ndigitalcameras\nstudiedly\ngibbin\nkierstead\nunoffended\nopressed\nmapumental\ncabilly\nhonten\nladyga\nsteenburgh\ntoothsome\neaks\nreadspeaker\ngiantkiller\nbiomechanist\nbernsee\nwebforms\ntarnasky\nthrowleigh\nimmunodiagnostic\npaskett\nrsus\npanayotopoulos\nukec\nghilarducci\nmanylion\njandel\nlenowitz\nbroxden\nspendin\nhtib\nstragglethorpe\ndemobilizations\nkirikkale\nresiliently\nseaon\nelterngeld\nnozizwe\nejeta\ndiscontinuations\nmistras\nloopiness\nrabett\nmhoire\nwhittacker\nmierzwa\ncarabante\nhaughney\nmislabels\nbecaome\nserodiscordant\nalexan\nguac\ncotmanhay\naabey\ndanishes\nrtaa\njapanimation\nkissable\ndayni\nplottings\nbraestrup\nbackhanders\nskovde\narbitraging\ntiebacks\njenessa\nultraorthodox\napplecart\ntokitaizan\nbarreales\npoujadism\nmojdeh\nhafidz\nmakova\nkatsikogiannis\nblankstein\novermedication\nthirer\nmunene\nxsight\nprexisting\ntalkman\nharbuzi\nsililo\nkhudadat\npodkański\nrcpo\npetcock\ndisinvested\nvinken\nboppy\nveillard\nbrussells\nboobytrapped\ncoevals\nspiegelworld\nundercapitalization\nkropper\nmakover\ndornhorst\nexprience\ngiedd\nchaswe\ntgfbi\npannone\ndropp\npehn\nsekayu\nashr\ntooooo\nliepman\njemiah\nschlaeger\nreroofing\njosico\nnansel\nbreakthough\nmiotto\nwhittick\nmlbpaa\nhypocrits\nbausor\ndenay\nflatmo\nguerron\nwanyoike\nshoretz\nmbodji\ncummard\njossinet\nphemister\nmabrouka\nfagre\nleanora\nwagnall\ncontemptable\nleting\ncarajas\nbarichello\ndunguaire\nprofectus\nbrettkelly\nnicb\nphotgrapher\nmemeti\nibarguen\nvolumn\nhappned\npresss\ncomentators\nemass\ntwitting\nrobischon\ntraxo\nsorrier\nopenzone\ntypcial\npirret\njerred\nmattani\njunying\nemployeed\nportayed\nskoura\ngasless\nwyplosz\nmoulard\nohny\nshelterless\nabullah\njiazheng\nhapuna\njiashi\npelzig\nnetminding\nsucced\nremberg\nkarkus\nkunk\ngalanteries\nassigment\nmmgg\ndechrau\ninnospec\ntoplists\nvulkano\nedmans\ndhurgham\nmiddagh\nzcam\nholstege\ntaxprof\noldhams\nniblets\nmtus\nleeane\ngops\nsteamrolls\nmagliarditi\nwaggin\nardinger\nklesse\nscandalise\nnutmegged\nhohlmeier\nlongways\nhaqlaniyah\npbjs\nmatchwinning\nmeqdad\nassaraf\nmatices\nzeppilli\ngeertrui\nhouseowners\nonyszko\nsubfertile\ncrispers\ngaryn\nuniversalise\nfalacious\nmangera\nprodhan\nwatchfully\ncarronshore\nhoughall\ncanadell\ntabbush\nmelch\nmbom\nnceh\nglucocerebroside\npetrisor\nbunich\ncravioto\nkydes\nmicrodisplay\nrosendorf\nfastt\nbonnee\nmicrofilter\nwaiola\nmirrorstone\nxolair\nsunbathed\nedemas\nprivatly\nllinos\nrashee\nrogak\nschaffter\ncommuncation\nmuffit\nbaylham\naimhigher\nnoseguard\ncadex\nkagisho\nwimon\nstouten\nmasrour\nortuondo\nbrynberian\ntamburrini\nwollerton\ndevah\ncastrejon\nserbedzija\nsimonich\nspross\nsequans\nkleanthous\nclickstart\npeiyi\nechavez\nnetone\nexpendability\ndobbing\nwryness\nflexitarian\nhinduness\nadbowl\nkaradsheh\njustfy\ncharoensri\nwootan\nsuperpredators\nmallouk\nklintworth\nallerca\nreappointments\nsackers\nobsurd\npertierra\nbarbadoro\nevangelically\nwoznuk\ncarleon\ncsgn\nwolens\nnativeness\nkneisky\nhewko\nkaigler\nihsanullah\ntalismen\neliahou\nbhuto\nsaltin\nitwas\nnevels\nunyieldingly\nradiolina\nnoveau\ncasalena\nzheijiang\nfollifoot\nmorfogen\nwaudo\nflajole\ndavidovna\nglng\nlearmount\numbilo\nohmigod\nwagenmakers\ncentrais\ncanggu\namcat\nfeilchenfeldt\nsoccerbot\nspreadeagled\nweq\ndazedly\nshalleck\nfuno\nbarkely\nhumoresken\nstiggy\nvelencoso\nbubbas\nstoled\nghairat\nhetrosexual\nghag\nwatanbe\ncoproducing\nmoxam\ndspp\nfeeva\nklibanoff\neurodac\nbernardito\nnewtowncunningham\nfloppers\nrozzell\neveillard\nrassak\nsayyari\nrobomodo\nbulatovic\ngemal\nmuskal\ntulkens\neurofound\ngreenwise\nrajaraam\ninspirationally\ntelefonos\nartemesinin\nteenies\nbalades\nledc\namraams\nanways\nharmoniemesse\nkynt\nzielbauer\nburooj\nadumbrate\npickaninnies\nciaravino\ncandlin\nhurrahs\nwescoat\nkailuan\nactuallity\npollaidh\nnastas\nmolemo\nkultgen\ncommunicasia\ntriscombe\nbapineuzumab\ncastington\nweiting\nxpedition\nsiwak\nbrisland\nhärtl\ncottco\nzhongde\nmassiter\nimouraren\nsomach\npitkerro\nmhanna\nscalpings\nghos\nwhislt\nnayeri\nsmip\nguessoum\nwingfields\ncrotchless\nhoerbiger\ngeranios\nleashing\nnellies\nelastoplast\nsurburban\nsecurian\nschake\nfanfold\nbelarius\nskulked\nseediness\nfordu\nshimahara\nexecutional\ntemporäre\nfunemployment\nshamsan\ncruthird\nbehravesh\ncoldiron\nbringsjord\nbakira\nlierman\ntemelko\noutdraws\nseage\nsainclair\nolimpiyskyi\nnosb\nnneji\nsomersall\nvuchic\nlogisitics\njiyane\nsiutsou\njackolski\ncastenada\nchuqui\nibok\nzigic\nagreat\nsymtoms\ntorton\nobssessed\nleafleted\nexstream\nsynesios\nkybosh\nfadeyechev\nracinos\nsmirky\ntsxv\nfatmah\nmissett\nmiglin\nremailing\nbusanga\ngronauer\nantifreezes\nroughhead\ndeafens\nscrimped\nfemininely\nhamlili\nmesterolone\nwestenburg\nmedikidz\nsosban\nnonblack\nthrales\nferay\nhoggers\nqurabi\nsenut\nwestsiders\nkulczewski\ngallaxhar\nparaphenalia\nconfutatis\noput\nribay\nkanders\nfurgeson\nlprs\nfunsch\nanatomized\ncorscadden\nredrick\namfpa\ndisler\nschute\nlalmati\nditherer\nlinlathen\nganeshguri\nrunzler\nremineralize\nfulmination\ncorseting\nfials\nfeess\nbedframe\nfascitelli\ngerolymatos\nkhaliqyar\nlansberry\nmeleka\nfrischenschlager\npivitol\ndamanik\nintubating\nychwanegol\nacgt\ngildehaus\nzakharkin\nramlet\nlafree\nverrrry\nelsheikh\ntuttiett\nmashai\nmonetizes\nwdcc\nblazej\nconversive\nprounounced\nmazare\nkaitlynn\nimmmediately\nkamenar\nlaninamivir\npouffe\ndiscolours\nheckaman\ndacias\ngenevive\ntemaki\nsafarini\nhighlining\nschattner\ntawdriness\nmprf\ndaslu\ncanniffe\neufloria\nfrattarelli\nzeresh\nmbada\nlondongrad\nfrises\ntomorrownow\npaulée\necotaxes\nningrat\npachar\nbjornar\nhoarwithy\newni\nbelabouring\nmessitte\nvidattaltivu\nairmiles\njumeira\ncnha\nhomeservices\nlapostolle\nctid\ngrazax\nmarando\nfrontward\ndecharms\niasia\njitteriness\ndodgier\nupswings\nmarinelife\nblachley\nbarcud\nsafeways\nkavanjin\ndonetta\nsatlof\nskipinnish\nbagur\ncambronero\nswaggered\nrotarix\ncheckbooks\nelcombe\nagbank\naegisth\nfiftyone\nchapayevsk\nsobro\nunkoku\nfairwell\nsteinhafel\nheherson\ngluckian\njenzabar\nonetaste\nsportmanship\nlebsack\nriippa\nacomplia\nnewtech\nlandesbanken\ndtto\nrazored\ndegenstein\nandoga\nspikier\nagism\nthtat\nverminators\nmcaleavy\ntemblors\nthaddis\nmaplesden\nmckneely\ncolichman\nsemiotically\nmotavalli\nmillerson\njeeb\nbadenhausen\nweldments\nsisli\nforaying\nszumilas\nprimelocation\npolikoff\nfreezed\nmusem\ngaruba\nvillacarrillo\nbenlysta\nsiopa\nineffectuality\naltink\nddoe\ndeclassifies\nmcclaskey\nnpsf\ncompartmentation\nleared\nsantiz\ndragulescu\nillusional\nmidmonth\nhoodwinks\nbofo\nogolla\nahler\nuncouthness\nzynth\nsuccessories\nlimet\nwhibbs\nfochler\naahoa\nconatser\ntibbermore\nkirketerp\nmaydon\nsaintil\naggreed\nuitsig\ndrumless\nartisteer\nschwencke\ndungee\nnonliterary\natalissa\nswingmen\neithan\njinjer\ncrowston\nprofessonal\nostalgia\nzalul\nodein\ncallconnect\nokland\nkufor\nsolidthinking\nharira\nalliegence\ndejana\nffhs\nkurlbaum\nevildoing\nsupersuit\nhortle\nreflowable\nheenes\ncostumery\nuclear\nlitttle\nspirtual\nsherraden\niakf\nargentian\nshontz\npipperidge\ndispenza\ndramedies\nillegalize\ndamüls\nboroujerd\nokerson\nvinyan\nslah\ndiffence\ntoddles\nperschke\nlangdown\nconsumerists\nllanganates\njealott\noverpacked\nmohagher\nameerjan\nfurushima\nhooplah\nlittlehey\nbenalmadena\nsyrias\nstoneyholme\nalmor\npursuasive\nmoddelmog\nmissel\niwuh\nbatjer\nksbc\ntraffiking\ngorily\nbhoyrul\nomneon\nmamoud\njogmec\nautocentre\nunbreakables\nintermune\nsittercity\nprosectors\nmorganelli\nrapex\nclevin\nitamae\npleguezuelos\ncomposters\nbathie\npolydoras\nmukhim\nselenochlamys\nqcdoc\nrejer\nbahave\narlinghaus\nstylelist\nmulticurrency\nmatuszczak\nmuttathupadathu\nkirschvink\nbogeying\nbaudier\ntwistgrip\njanicek\nchubbier\nsamareh\nbirklands\naricom\ndistastefulness\narnwine\nfawcetts\ntoorale\nmantese\ndannehy\nmomentus\nkelsy\ndisaccord\nhoerth\nnogaideli\nballardian\nlifepak\nlochbihler\nagwunobi\nsuretype\npily\neyrow\nshibil\nqflix\nangelically\novercount\nlasowski\nwember\naunti\nflamming\nproductization\nsantiagos\nhealthywage\npartow\nmistating\ncurliness\nliangs\nsecac\nfundholding\ntribrid\nadeoti\ncandance\nchavula\ninotera\nsciencedebate\nlillehaug\nkirbo\nlobmeyr\nstebic\nweyts\nhajdib\nabdellilah\nroedad\nitabo\nnafri\nshpigel\npresedential\npublitalia\nkeylee\nlisamarie\nclich\ndisgyblion\ngandolph\nperlmuter\ndelterme\nhelgelien\ncolontonio\nhakura\nsugishima\nhovid\npaddypower\nlbbw\nwildcare\nmokane\nbelloto\nsinatro\ndevalera\npostitive\nreethi\nirobe\nchapins\ncornale\nimrpoved\nlincvolt\narived\nidexx\nmatouk\npapile\nventose\nbaldragon\npribanic\nabayi\neverydayness\npiperlime\nomama\nbornaz\nfootcare\npritty\nmangatepopo\nforne\ndeltav\ndzp\ncastlemead\nmicrometastases\nguipure\nsantamans\nnoncommital\ntrisler\nschutzman\nreneker\nklender\nanung\ncorniest\nneidig\nunchronicled\nvandell\nnashel\nwälde\nfehrbelliner\nskmc\nudink\nsensibaugh\norfani\npoblah\nnikishov\nrookmangud\nmandina\nhectarage\nkhatem\nuncashed\ncaricola\napearance\nooohh\ntimbercorp\ndupattas\nhöweler\nhuntsmans\ngandur\npollutive\nathra\nlovedean\nhankiss\nhandleless\nzimmet\ngropers\nmegafan\nstirewalt\nportguese\ndepolo\nkatsusuke\nbarrella\nsarafem\nmultiaxis\ncloudlike\nmudblood\nabramenko\njohnell\nbatchley\nbosiljka\nleesong\nhydroxycitric\nlipsman\nhypnogogic\nswishy\nhostessing\nangland\narenysaurus\nschnoz\nstrads\ntuaca\nbriberies\nsuperchef\nmillgram\nhallucinogenics\ndooner\npromisses\nbosquez\nerbelding\nantiterror\ndhonau\ncupholder\nxiotech\nraynella\nhoffi\nlambke\nwatchability\nmoark\npettite\nwhitrick\nsmartbooks\nmichter\nmartner\nbromhall\ndeputises\nromiplostim\nbalsawood\nnonlawyer\ngardenview\nscpp\ncunit\nbébéar\nrunnalls\ncmim\njundal\nfrega\nnuvigil\npanichgul\nthermoses\neuroset\nsanzone\ncraftsy\nthurbert\nspinco\nantiheroic\nbocarsly\nfestooning\nghazy\nreprehensibility\ndunadry\nconell\nopina\nbagudu\nlowermoor\nlopsidedly\nchynlluniau\npontfadog\nsidaoui\nsinnadurai\nadspace\ntargetfollow\nzooborns\niskandarov\nchiso\nmeary\nstylista\neameses\napointed\nhaindl\ngorrill\nmulrow\ncroonquist\nexagerations\nlokshina\nproce\npinior\nbisgrove\nwersch\nautralian\nwalaker\nwimples\nsapirstein\njital\nroydell\nezchip\npranee\nsmilowitz\nscorseses\nstroeve\nsumaq\nextraordinarly\nyuanlu\nnonoperating\nsugardvd\nkaufmanns\nhenmore\nbootprints\nparlux\nwheedles\nsocarrat\nmoughton\nbasketful\ngahcho\nkorena\nabusos\ndomicelj\nschanzenbach\ncaveda\nsuhrab\nhelmstetter\nfatalistically\nvivara\nsozzled\nbubriski\nhuette\nedscha\npowersliding\npitchcroft\nhaffadh\ntimbol\naidsvax\nmonifa\nvesturport\ngravenstijn\nchiggy\ndineequity\nlyubasha\nclipa\nwebair\nwennerholm\nahcccs\nviniar\nguillebon\ncorniness\nzakay\ncitified\npopule\nshushed\npatientpak\nmarranca\nrolexes\nlaplata\nwasteground\nnadca\nlabau\nbosnic\nrabern\nvoltaren\nmaroteaux\ncofman\nhefted\nchalencon\ndisentangles\nuncessary\nninepins\nsalissou\nshuklaphanta\narmishaw\nsohus\nricardas\nchalcraft\nminnesotacare\ngoave\nnetqin\nlowit\nbritglyph\ndidymosphenia\ngcwr\nabellan\ndismemberments\nholtec\nstobe\nfenics\npapayiannis\nundertreated\nbeknazarov\nkoloma\naccordent\nwochner\nsonatel\npodhorzer\namerkhanov\nmetho\nkroller\nuhomoibhi\ncvne\noblas\nrumaker\ncessy\nnuissance\nnotc\nassida\nschwacke\neasterside\nebly\ntapfs\nevfs\npostmatch\ntheraflu\nkeirnan\ndynadot\ncaglioti\nbeuselinck\nhoggy\nsouthgobi\nelecton\noneunited\nkochersperger\nlacoursiere\norganis\nbeinish\nkhushnood\nbrenkert\nbohlke\nsikura\nmoessinger\nlabrit\ncrabbit\nlvpei\nupperman\nchinsky\ncarying\nmetry\nroyana\nlovley\nizulu\nscsep\nmanboobs\nstomached\nincresed\nlamphier\nnget\nlaidi\nstaled\nmisdealings\ncoporation\ndisavowals\nwpz\nquadrilha\nmeints\nliptrott\nprodanovic\npurblind\nhausfeld\ntuggar\ntwitted\nrgen\ncecillon\nvorobiov\npinots\nhomburgs\ncritise\nguajana\nuncurl\nprets\nmedisin\nmoviebeam\nyackel\ngiuffra\nputrescible\nbawds\noccurr\nkimerling\nergonomists\njhai\nfgtb\nkoprivec\nsukaina\nrubberstamped\nuduak\nclankers\nweatherbird\nandreeff\ntrabado\ntianshuo\ncraigholme\nnicas\nlinkov\nmontecore\ndunsire\ndeconsolidation\ntjeldbergodden\nplsg\nmclenaghan\npennyweights\ncorsell\nmoronically\nhaulout\nfemp\nstaduim\npalanquero\ninfernally\nkuhlen\nbrasside\nsusar\nanfam\nunhooks\nbofferding\nalege\nemerainville\netbf\nplasco\nauchnagatt\nelizbeth\nherit\npucllana\ntelexes\nsiriusly\nwoldegiorgis\nshpakov\nmorso\nrakotoarisoa\nzbarsky\ngomolka\ncubanacan\nwodarg\ncontrollership\npasola\nyonekawa\nfebraury\namzn\npmds\nsaviana\naltemio\nschandorff\ningeominas\nverter\nmarawah\nkipevu\nlargoward\nnadirs\nosisoft\nfluzone\nkutluca\nbrammah\nholdzkom\nunawarded\nhmmmmmmm\nhatzigakis\ncoatney\npulsenet\nbowbelle\ntransdigm\ntinging\nchickera\ninvesture\ncascini\nlozere\nmcgurran\nwiegandt\nwartelle\ncontributorily\ngoldhap\ntroplong\nvaios\nshingly\ndenoix\ngarafalo\nasru\nluxuriate\ncyrulnik\nboconcept\nwizet\ngelee\nclimatecare\nsyclo\nstayne\nmaliyah\nindicatively\nweiberg\nmanoucher\nexcisable\nammori\ndagze\nruthsburg\nsernageomin\nkaixuan\ntwthill\njashon\nwhadda\ncookshop\nndegwa\nbzhania\ncartiere\nlrads\neffeciency\nportney\nfranzone\nchegworth\nschit\nconvatec\npcln\ntsgs\nyaowapa\nwellink\nundecideds\ndichiara\nsamiu\nthauer\nxeroxing\nrotbart\ncloghogue\ndioum\nsomewhow\nbourlanges\nlevrat\npoundsgate\nstudentification\nscholanda\nhiccoughs\nstives\nezcorp\nterrone\nofrece\nlukewarmly\nmelcon\nryeish\nheartstopping\nmedvedkin\nwardani\nmorphis\nnorwine\nfissiparous\nglipizide\nwaveshape\ndolcis\nhausam\nknyveton\ndelaronde\nbiomasses\nreinspection\nshelthorpe\ntular\nohmar\nzaliukas\nsharnol\ngorbatenko\ncastagnède\nphins\ngandah\ndahei\nnonexplosive\nceradyne\nennon\narrendondo\nmackilligin\ncaptively\nsamnick\njagodowski\nballinluig\nbioproduction\nzelek\nvillaldama\ncompaired\nvamizi\nlocklair\nnovacor\nbeable\nflexitime\nclimbdown\nondeo\nkrup\nunflappability\ngeotech\nprokurorov\nanassa\nsubagent\nmultibillionaire\nbogend\nunobtrusiveness\nmonopolises\nnfrc\nmeterologist\ntearstained\nrosstat\nmesiano\npolkomtel\npurry\ngoulou\nscorcard\nirshaad\ncogifer\nbeatboxes\ngeoeconomic\nespandi\nfsrc\nllanellen\ndemane\neureko\nperigueux\nlownie\npharmasat\nsosnovskiy\nrootmetrics\nbarlev\nitit\ngottsegen\nshimrit\ndhanushka\neurovans\nclaborn\ndishonours\nwoddis\namouee\nbozzuto\nkoreng\nsmfg\nyipping\nschlaug\nroher\nmuilleoir\nbannering\nschaler\ndauterman\ntrewby\nincretins\nsugru\nwerleman\ninvincibly\nzachares\nnalukataq\ncommisions\ncarrard\nshoutfest\ndreazen\nszymkowicz\nskaret\nigoeti\nracegoer\nkabasha\nmccadam\namundi\ncoalson\nrossan\nshearsby\nvociferousness\nunfelt\ncianfrani\nniave\nharpooners\nkohll\nalkqn\ngicanda\nittersum\nblaemire\ngunhus\nmisuser\nokadas\nraharjo\nkrener\nbemusedly\nmejide\nbatkin\narlinda\nballat\nasselah\nkeyfob\nrafaat\ntobolski\npaletas\nkorissia\ndereon\nweaponise\nabbruscato\nhilburg\nollar\ntoloache\ndulkys\ncaragol\ntrafficmaster\nionises\ndivorcées\nbonusgate\ngrafer\nfairyhill\nlaracy\njiggered\npaellas\npallinghurst\noutcall\nshedloads\nsuppy\njalolov\npromethei\nreinflate\nalperovitch\nzutt\nyourselfs\nwfda\nhinam\nthingamajigs\nlecjaks\nstroehmann\ndirectconnect\nhettleman\narmann\nnkusi\nvolubly\napalara\ntwanky\nwintuk\ncorjuem\nbokke\nchiselhurst\npatdown\nlannoye\nrizzetta\nproshares\nkaymaz\naidh\njackstraws\ntomou\npakem\nkamula\nnewcott\nnhmf\nformworks\nmonyane\nistanbulites\nluxuriates\nperneczky\nrhapsodizes\npoetsch\njosifovski\ntipuric\nbraywick\nsominex\npennymac\nbusari\nminatory\ncloudworks\naloul\nnouredine\nnarcissistically\nmesmerizingly\nsportvision\nnmvtis\nsuvaddhana\nhifn\nparboil\nmworia\nspertzel\ndesset\nscaringe\nyounoodle\nwardhigley\ngaranzia\nmulchandani\nlizet\nshaftel\nrunelvys\njelveh\nfinansinspektionen\ncompetiting\noakleys\nspitale\njawarharlal\naarik\ntabat\nbosomy\nkazuyasu\nnassiriya\nchilcoat\nbidil\ntininho\nfirstsource\ndefering\nbreglia\nsaimo\njaumont\nostracisation\nbramscher\nhyperwords\nbidis\ngerisch\nhaskoning\nimplenia\nashkenas\nvoiceage\ntranser\nninties\nbukharians\nmisstepped\nwalizada\nelvert\ntranscervical\ncheny\nsoffritto\nphandu\nmiltiary\nnagyvary\nkaywa\nizea\nlinkscanner\nsorrillo\ncrailar\npayden\nstephanson\nmcjobs\nuekawa\nbrayboy\njeacock\nnoppharat\ndaglas\nzevalin\njazmines\nbusho\ndewaldt\ntegwen\nbischitz\ndonike\ndeshka\nflashily\ncavaney\ntannura\npaperno\nnevas\ncablecam\nspeakerbox\nloncaric\nthonged\nablity\nzimansky\nruchazie\nnestegg\njuakali\nmihaka\npluming\nmckinnely\ndetemine\numida\nkerasia\nbreidbart\npaskaljevic\nmerrows\nchameides\nmigente\nkhoshchehreh\nguojie\nclubmakers\nmalhuret\nwebid\nroadsinger\nharnar\ncrisping\nrogaciano\ndowload\nkanok\ngrca\nphonelines\nmcilwee\nmotasim\ndickoff\nsiutation\ndildarian\nomeje\nkakarala\nboci\nbeacopp\nihss\nnattily\nsedmak\nshanay\nobservantly\nawvee\nkanavas\nezen\nrecalculates\ndegloving\nmarusa\njewna\nshyrone\nbiffer\ninfobel\nairness\nnewlook\nsadaoui\nvickilyn\nhydis\nvaratharajah\nhimbo\nfeshie\nquetame\nunmannerly\nowiti\nwazhma\nbittles\nochola\nradmacher\ndehoff\nloglogic\ngravelles\nincudes\nyodeled\noffill\nqalibaf\niaslc\nincased\nrowenta\nmoosman\nhermitic\nactivevideo\ngiveback\nchissick\nakea\ndezheng\nhupfield\ntoccare\nradiosurgical\nsuperglued\nviriginia\nshiftiness\npulqueria\nmomah\nvillainously\npiscatella\nkeker\nanacomp\nespenshade\npuckhandling\nkomag\nmaoying\nultraliberal\njavanfekr\nundergear\nundervote\nstonhard\npupus\ncasgliad\nluperon\nchalkface\ntseckares\nrivenbark\nchaobai\nniklander\nbedane\nhenthorne\nprotzman\nutsw\npisor\nsnooth\nqios\nginman\ntirelessness\ntwse\nnnoc\ndisinterring\nchidlom\nhongxiu\nuntrainable\nzvecan\nwibbling\nphonautogram\nwhingers\nnerdiest\ndemirtas\nsollano\nairhart\nwaterpartners\nplavnica\nfillinger\nsimilaun\nanqiu\navalons\narchera\nnavetas\nsusham\ntwiddy\nalizada\nteviothead\nlanehead\nsteris\nmekelburg\ntgscom\npapacostas\ndaith\nshimmied\nbitgravity\nsurftech\nkostunica\ndraut\nscandel\njudicialwatch\nmongbwalu\nriffy\nbalakhnichev\nrossiiskaya\nreaganites\ngavello\ngopie\ndiscolouring\nhoelter\nbackslid\nshaylor\nfatimi\nmeltz\nhumanscale\nshisler\nmiklowitz\nringos\nwaitressed\nrecinded\npensiveness\ntojam\nsenagalese\nradziszewska\narelis\nforcasting\nkosslick\nrisikko\nnxy\nmdba\nfindout\ncuisiner\neuropia\namsf\nmirrow\nfinancialisation\nagho\nadtr\nfidelco\nlawhorne\ntraicion\nabramashvili\nbosti\nrockstroh\nchalal\nodsts\nneuroprosthetic\nbrcm\nhartunian\ncattet\nburnhead\nrachford\nhondura\nmuchlinski\ncustodies\nwristy\nvancouverism\nequest\nauchengeich\ncocain\nbuether\nabeche\nskeate\nmulva\nwhitecrook\nortutay\nfilsham\nismaeil\ncemagref\nnedge\nshatzkin\ntrabing\ndiedrichs\nzadravec\ndeggans\nbtwc\nmzinga\nkastens\nstormonth\nstenches\npedastal\nkrabbenhoft\namortise\nchipolata\nyashim\nhisse\nmaduekwe\nbramanti\nsarangan\nschable\nshreves\nbukowskis\nfascade\ntheirry\ncidem\ntorek\nchiangs\npritzkers\nstrokosch\nrewane\nspeechmaker\nbrunis\nhavce\nkinkan\nalléno\noxyglobin\ndevonald\ncrippens\nfalsetti\nvalyermo\ntuani\natvm\nsteamrollering\ndabbl\nkondek\nriddalls\nbarnholt\nbenty\nmeyrav\nsorice\nkasala\nkamiz\nverbij\nkndl\nartiness\nsarur\nfrissora\nmahawil\ndoeke\nweisshaar\nchurlishness\nredpill\nethnicly\nbamali\nmouldable\nprintex\nsupershuttle\nguehrer\nnanofibres\ncrowly\npanjsheri\nsheepridge\nnapali\nshenfeld\ncharruas\nwyllin\nbciu\njustham\nunsaveable\nhuntsberry\nesnc\nnightlinger\nhanessian\ntreworgy\ncontiue\nshoff\nbaroquely\nstonewashed\nwanas\nkatine\ntwot\nafrezza\nfoundem\npushta\nkenedi\nmoulinot\nstewy\ngekkos\ninterplanted\nliviero\nwitsell\ngregorka\nwatai\ngchat\nscalvini\ncompletionists\nbaztab\ntichaona\nkahol\nbeilke\ndommer\nkfy\ncharata\nmakim\nkreole\nrockler\ndisapointment\nravensbrueck\nelniski\nvalena\ntimesselect\nbutties\ngratta\nmathiang\nlehmon\nsarhat\ndunka\nmimodrame\ngodwulf\nboutcher\ngultekin\neurocare\nhangnails\nsuperbreak\nwindspire\ndamber\nraffone\nvideomakers\ndesirables\ntriacetone\nyakes\njonae\nhingerty\npaerson\nfolksiness\nzukang\npubco\nnbcf\nbeiges\ntyrannize\nstrathewen\nmarper\nharjono\nskywatchers\nhartsuch\nsfogliatelle\nmouneer\nbedder\ntowsey\nkostry\nindect\ndancemakers\nyoulgrave\nsnowbarger\nmesilate\npinau\nedwords\nrizkalla\nmaintian\nbrainwork\nmistatements\nfamiy\ntehranian\nturjeman\ndeyanira\nhustead\nkoltes\nbaillères\nairfast\nfirstperson\ngategroup\nbtop\nkingara\ncobrador\nlespinas\ncomputrace\nzierold\nsephaka\nschmutter\nbuckroyd\nkhazakhstan\nsayari\ntollfree\nmilibands\ncyberdissidents\ntrockener\nconnaghan\nclimed\nalisande\npressurises\npltp\nmartime\ngustaff\nchouly\nminnet\ndaitch\nilness\nbaifu\nangelsey\npredraft\nnicklen\nunabombers\nfilostrat\nalhabib\ntrashier\nsawblades\nocansey\nprindiville\nsitski\nbenot\nbierenbaum\nbunkbeds\nhenington\nghosties\njunilistan\nsoliver\npwllglas\nkovals\ntsemberis\nmeidinger\nguloien\ngrixby\nplaypump\nkatelan\nkalentzi\nhosseiniyeh\ncrediblity\nméheut\nprevx\ngoyens\nsomohano\nletzig\nyurchenco\nsroc\nopensaf\nbandleading\nkeppens\neleuthero\namerispec\nwinance\nhelwing\njavani\nmonologists\nharouni\nresplendence\nimdr\ngurkhali\ngecina\nmucheke\nmaheras\nbaqaa\nleaners\nheadful\nhawijah\nwhovians\nzione\njardiniere\nrobon\nrotateq\nkamistan\nsadeghiyeh\nwebisodic\nmernier\ncoronell\nbrunache\nstrutts\nsumanthiran\nolevsky\njazouli\nnubuck\naflutter\nvelmanette\nloukoumi\nheddlu\nmuthui\nbundeskartellamt\nkatzoff\nnikkita\nmueser\neeks\noystrick\nmonotonal\nvishing\nsedyaningsih\novercorrect\nfrite\nruddier\neisenhour\ndocumentarists\nmollycoddled\nsoussa\nchabanenko\ntjaart\ndisquietingly\ncominsky\nbirdfeeder\npostimperial\nklusman\nhanikami\nhostless\npécoul\nemulsifies\nyuwei\ngarrahan\nschiliro\npetersohn\nwitth\nhoussami\ngreifeld\nbernita\nchimonanthus\nminshaw\nbillye\nfreezable\nwhitus\nsimer\ndisjointedness\ncourey\npunal\nwatahomigie\namenoff\ntooted\nlowfat\ncaadp\nupdrs\nbearnaise\nelal\ndpas\ndesme\nsadoughi\ntanowitz\nhuitt\ngrillner\ncarcelles\njekshenkulov\naxela\nbaskent\nthonis\niseh\nhoardes\nserebriakov\nganier\nhempsall\nkaheawa\nklandorf\nnegociants\ncuresearch\naaoms\nthicks\ntroytown\narimidex\npeare\nmagaoay\nspece\nkampela\ncerling\nsalchows\nabcarian\nterjesen\ndeadpans\nmiraculousness\npaamco\ndiscursions\nduruer\nschuemann\nberocca\nmawete\nbanro\nknoxes\nteligent\npoonsawat\ncoombefield\namobee\nritchiei\ndistrubing\ndeservingly\nstength\nsupected\naggrevated\ndullsville\nchonqing\nexecustay\nsonnex\ncorcione\ngazillionaire\nvisioneer\nmichelangelos\nsuperstretch\nluchina\navici\nnighbert\nscurge\nchimalhuacan\nasalache\nlimeback\nactwu\nkiffer\nchieftans\ncopius\nsecretay\nlearnedly\nservicepeople\npresentiments\ntheallet\nlechers\naveratec\nfiresale\ngcib\nbaghaei\ndildoes\nborgonuovo\nslobbers\nwmzq\nntshangase\nwelshwoman\nlaverents\nrepriced\ngeminder\nchayevsky\nbugli\nlinzie\nburc\ntrigilio\nkrisjanis\nglanrafon\ncctb\nhardhearted\ndaood\nbeczala\nambastha\nslimly\nwaltho\nkwakwa\nrematerialized\nnjitap\nangellotti\nkhanzir\nmilliron\nlinsdell\nbenbihy\ngrimsditch\nkarweta\nshreffler\nbarberosgerby\nphio\ndefrosts\nsatsukawa\ndresslar\nsekela\nbmts\nergonomist\nklepac\nsyyn\njuryless\nnoluthando\nabdolfattah\nievoli\navamar\namiad\nkoplovitz\nvollweiler\nsilverfleet\nhadjigeorgiou\nbuffalonians\nrobedaux\njermareo\npretreating\nkobylanski\npeptoids\nfalkand\nhtere\nspbu\ndiyana\nmarvez\nmayiga\ntsatsouline\nplacatory\nraghda\nskystream\nmashkel\nnadelson\nhaushalter\ncapiche\nmilkey\nnrtc\nbargewell\npizzoccheri\nebike\nlongframlington\nperfumier\nrojales\npoliticshome\nsnsc\nbirdfeeders\nfutacs\ngwenan\nalllowed\nnematzadeh\nhalsdon\nsitagu\ndurrants\nadkerson\nmarsot\nhyperdrama\nngaus\npozder\nrefridgerator\nzarlenga\nhakhnazaryan\nexecutability\nhaibao\norituco\nsnowmasters\navelox\ndumenco\nnonpigmented\nbrégançon\nmvule\ndawnnews\nnitzanei\nlightsome\nfanselow\ncaseback\nbenca\nfhlb\nkhaiwani\noraha\nlaiyan\ngenous\nknec\nazzaman\nnewbuilds\npcpt\npoobahs\nfeatherbeds\nbulldozes\necoterra\nstruckman\nwitlin\nvitacress\ngharu\nkatula\nservidio\nerdkamp\nveitia\njakrapob\naccrediation\nretendered\nyouna\ngenae\nunvested\nrongwu\nnicita\npolitcally\nmcaboy\namergen\nbanrock\nscmm\nhamard\ncoltish\nhusselman\nalworths\nnativi\ngartocharn\nspetman\nprudie\nappdynamics\nquestair\nllwyngwern\nnevisport\nlionore\nmazelike\nswoger\nreclusively\ngrotell\nsaverton\ncondop\nclevert\nkolokoltsev\ndirecter\ntrayner\nfaqiryar\ntakaratomy\nmajimbo\nteensiest\nprty\ntacopina\ngristly\nconcequences\nifton\nlaraza\nstenico\ncansecwest\ntilp\nyeghern\ntehilim\nkomarkova\ntaec\nivoclar\ncermis\nsuperinsulated\nmigoo\nmillivres\nburrer\nspievey\ntelasco\ncheiri\nsewai\nunworldliness\noutgain\nseyfullah\nribagorda\nvivadent\nparps\nvalainis\ncommunties\necybermission\nhypermiling\nrovnag\nimmobilisers\nsefydlu\nlatrepirdine\nluveniyali\nsnakebark\ncurioser\nskittery\ndementedly\nbukiet\nvanisha\nredistributors\nbalsys\nwineke\npsychemedics\ndrabu\ngheal\nzamen\nculican\nsannier\npurja\nmaniruzzaman\nhomechoice\ndrepanid\naguis\nenyce\noversexualized\nkuboya\nthaller\npetosa\nmultispace\nmalandros\nbobolinks\ncircumcisers\nmortine\nzangabad\nsubserviant\ndevolutionist\nswalmius\natryn\npyrotechnically\nkirghizstan\ngracelessly\nmespil\nvignaroli\nmutualisation\nharguindeguy\nsexcapades\naaid\nvirut\nshaodong\nluukko\nroduit\nkojola\noultram\nfaulknerian\ntyser\nfinlinson\napostasies\nindividualizes\nsmotrycz\novnand\nkaprawi\nkunavore\nnakayamai\nnadaf\ndjurgarden\nmaleli\nunty\nrelucant\nthomert\nfairisle\nshirttail\ncwtch\nanykind\nhickham\nsympton\naanerud\nundisruptive\nautomony\njokerized\nethnikis\nforbort\nmandarake\nintrepidly\ntesoriero\nnewsmarket\njorris\ndriton\ncamaradas\namstrong\ndisaproval\notion\nreoffended\nlavanderia\nkreibich\nsmithfields\nfountainview\nfootaction\nprotuding\nmelal\npulverizes\nlevram\nmollifies\ncompeling\nunbelieveably\nclariss\nwangers\nwalkowicz\ndixe\ncholitas\npsncr\nepit\nclearstone\npolacheck\nroils\npalal\ndaytrana\nlazerow\nforpadydeplasterer\ncareline\ndaybeds\nsomebeachsomewhere\ndingers\ncruiseliner\nosuagwu\nhojilla\ncausations\nwallbanks\nnilaja\nrossinian\naccesibility\nrepotted\nparfumeur\nzhovtis\nmuldersdrift\nstoved\ncianciolo\nkirnon\nhelinet\nsarova\nchiroto\noleaga\njouir\npulgram\nmelott\nkabayeva\nrunnign\nabasse\ndipnote\nlunacies\nschnuck\nrullman\nshimmen\nreasses\ngrimier\nnovatec\nsasses\nairflite\nnatano\ngmmb\ncharrisse\ndelanoe\nsprogs\napplaude\nmcclarin\ntealeaves\npiry\nrorrison\npressenda\nvoje\ncentry\nmccurrach\njarrel\npennybridge\nostroy\nvardzelashvili\nsnda\nreeboks\nspinningdale\nsoilders\nvivify\nmainsforth\npolypills\nachieveing\nfenjves\nloback\ndyrs\njunc\nangrick\nmurabito\nguarguaglini\nibovespa\nkrepps\ngolomt\necuadoreans\nshatwell\nmogil\nyueting\nfamilylife\nseparovic\npolymorphously\nseliana\nwarier\nbrethern\nrnla\natheltic\njohathan\nservicos\npanelized\nmarentino\nvorobei\nmakhosetive\ndhandwar\nweihl\nmocospace\nceragon\nbeared\nheadcovers\nsjaelland\nkirkush\narticulateness\nrelles\nlitomerice\ngazidis\npantsless\neritherium\naerus\nbirinyi\nthefunded\ncyberpatriot\nwebste\nregling\nruthledge\nbunnytown\nlowfields\nchapagain\ntaichman\nzucchino\ndanuel\nlibral\nlazek\nmidprice\nmblox\nderraugh\nhehman\nhambastegi\ntangelos\nhernadi\nentrepeneurs\nboontje\ntouria\ntelenova\nsnoeren\nsukhee\nguttered\nikitelli\nwellywood\nuninstalls\nnonconstitutional\nriesbeck\noversizing\nwithut\nsilbertanne\nhaimoff\nphonagnosia\nwedad\nknovel\nperserverance\nafflecks\nbackelin\ngolau\nlorit\nindependentes\nshankaranarayanan\nlezaun\nhymietown\nimpove\nmedicinema\npartrick\nselesnick\nkachroo\nmmrf\nnovacare\nunfarmed\nbabida\ncryor\nbakshish\nbetzler\nmimun\nblatchly\nbangrak\nbessies\ntransdniester\nmandsager\nswsi\nrancilio\necons\nenviga\nsireen\nʼa\nwluml\nbouzigues\nlohuis\nrenationalise\nhameli\nsäumel\nmiluska\nwarkentien\nbethelehem\nmekhennet\nbrooksie\nstrepsils\nshibas\ntalibanisation\nensconce\nhioe\noverturf\nerikka\nbalendra\nmiscione\nunipublic\norangethorpe\nchemoradiation\ninkombank\nmudpie\ninterfereing\nstaretz\nmolberg\nmarif\nchebundo\nzekic\nbejeezus\nlipkins\npiggybank\nemmanuello\nsambili\npetojo\nspeea\nbarzón\ndeleg\ngillinson\ncardiocirculatory\nwoelfl\ncagen\nemoted\ncloude\nriccetto\nstancioiu\ngaspoz\naledort\ncavadas\njhawk\nelevenfold\nguetersloh\nshirwa\ngerardin\nsukau\ndrumadoon\nelkhonon\nlotuslive\nlopsidedness\nconstrutora\nkirkfieldbank\njonik\nroters\nscorchio\nsteigerwalt\nidelphonse\njersusalem\nmusikvergnuegen\nleichty\nenginyers\nqadissiya\ndreariest\nklamt\nmusion\nyacaman\nnovellos\nfighing\nmigraineurs\ncavernoma\nlinel\nkatsas\nhamiltonsbawn\nallpoint\nmartignette\nshaquana\nundiversified\nresmed\nshirring\nhesses\noffwell\ncrimint\nsportello\nintergrate\npoisonously\ncabretta\nsaidia\nkhh\ncollatoral\ndesultorily\nbenecken\nwanvig\nntera\nfreakiness\nnexuses\nunformulated\nhmtd\nkorakas\ncokehead\nincorrigibles\nplutocracies\nzhenrong\nbussanich\ncolombiere\nschnook\ncyrpus\nheddiw\nussia\nkanouni\nskelta\nadolescences\nsprayings\nreferans\nrassman\nwiebel\nbradnum\nchortled\nleccisi\nrephotographing\nvukmirovic\nfearmonger\nsearchengineland\nmulvenon\nlodovic\nspeedone\nrhosesmor\nleighann\nprough\ndelouise\nnacom\nferanec\nplepler\ngamier\ngreyshkul\ncalpol\nabhazia\necodynamics\nprovenly\nsafod\nmolea\nbeverland\nkulveer\nlukefahr\nunshowy\nkahlefeldt\nndoka\nnesper\nleichtman\ncohns\nlupito\nsnowling\nsuhartono\nsmicer\ncasualwear\ntinderboxes\nhakawati\nrustiness\nexploris\nakkawi\nwitb\nolefsky\nroadcast\ndenoke\nsarops\nghaddafi\noverprivileged\ntrachtman\nzhongren\naiusa\ntranscendently\nfarassino\nepoxied\nncfm\nodigram\nblytheswood\nelswehere\nclendinen\nknujon\nbeistline\nkimemia\nbouclé\nhistiophryne\njuvéderm\nmetabo\npalapas\ntillydrone\nglaxowellcome\navdic\nwillbros\nfabulism\nkwiecien\naquatically\nhtsql\nnighmare\nprotopappas\neveryblock\ntafazzul\nguogang\npelf\nsecci\naquaventure\nrcma\nasadata\nophelias\ntodra\nsiteminder\nafjp\nkalyx\nisenstadt\ncostil\nrart\nbonert\nuniformisation\nzanclea\nmujava\nchalat\npuzzo\nmeniscectomy\nkongsgaard\nbelshe\nwitteles\nwonderbox\nhegele\nbessard\ncoûteaux\nwujiu\nresponsibity\nmanirumva\nbuliding\nthordur\nhansler\ndarkhovin\nwoggles\nnabokovian\ncapinordic\ninformat\nbcrf\nralling\nffis\nshedders\noldhamstocks\nheadscarfs\ninconsequentially\nwernet\nabex\nbarsocchini\nnaqba\nmicroemboli\nbedie\ntelemental\nrutner\nmaryanna\napetite\nzamal\nshadchan\nsenegalais\naaiu\npanasuk\nodhav\nfionda\nkauch\nsponsler\nnesbitts\nanaky\nmalleny\ngannex\nnysschen\ncitycentre\nlambells\natwoods\nzourab\nsleekest\nvrbo\nworthit\nsandate\nnonrenewal\nhurriyah\nhertle\nsheenan\nupturning\nciljan\nelps\nunpoliced\nfabulis\nhuwaida\ndruglords\nslyngstad\nmittelos\napim\nsulaymon\nattunity\nepocrates\nnueske\nfissette\nipoa\nfrence\nmetrostage\nlimm\nvizinczey\nweisert\nmiriana\nochr\nsalcito\nmacconnel\nrbnz\nnozoe\nbandimere\nmusawah\nlikierman\nbellanaleck\nrecharacterization\nmomsrising\nbudreau\nhelness\ntideless\nschuffenhauer\nblumgart\nvinchuca\nzuli\ntopstar\nfillin\nlungful\nsoulet\nsandycroft\nwtge\npseuds\nhospicecare\ncharneski\niming\nsharlie\nknewstubb\ndebting\ndaliberti\nsurfboarding\nhimpler\nflateyri\njajoo\ndrevitch\ncarbro\ngreendown\nloneragan\nphedi\nbruhier\nnoninteractive\ncalao\nsoaped\nocularist\njetbook\nnercwys\nkaniewski\naftertreatment\nvieled\nhulkster\ncapcities\naudur\nreferrred\nmotorheads\nmovlud\nbambuck\nleastways\nkaczala\nosotimehin\narpeggiating\nkronplatz\nnajid\ncorbat\nbruntland\nimafidon\nhicheur\nintercivic\nbpxa\nmoini\nhorrach\ncnst\nyoakley\ncandella\nrandock\negprs\ntenbroek\nimperas\ndetoxes\ncloar\nsegolene\ngowkthrapple\nshotmaking\nperambulating\ndeliquency\nnicolussi\nharebells\nconsultas\nhoutan\nunharmonious\nanatomised\ndemattei\nwanblee\nfilimone\nbrzyski\ninforcement\nnhang\nullger\ntinside\nacknoledge\njuvederm\nsefydliadau\nmalmoe\npalmucci\nmaune\nkeyrouz\nceccacci\naggreement\nyankilevsky\ntruva\nmasawi\nlaughers\nlumene\ndrainers\ndannenbaum\nbbpa\nbiotechs\nbouys\noqab\nbacanovic\nswiger\ngoldkorn\nprocyk\nheisted\nfingerworks\nstenfors\nxuesong\ncvas\nthrowdowns\nviengsay\nmegadroughts\ninsituform\nsluppick\nbiache\npapaj\nclaudon\ngonwa\nfamelab\nclites\nubhi\nnhma\nazelle\nbodysurfer\nmuxia\nbaisch\nrospars\nicandy\nhublin\nzhongjin\ncarletta\nmelican\nfigleaves\nmgpi\nrasheda\nstepmum\naxam\nbutchard\ndaudel\ntchama\nusuing\nopenmindedness\nrodah\nharbon\ncorsello\nkerschner\ncantoro\npattisons\nsureka\nremoulding\nomlts\njamine\nboubyan\nblogrolls\ndeelites\nvijaypat\nhackitt\nmorogiello\nfelliniesque\ndfis\nhartdegen\nmoïsi\nbadula\ncommisars\ncosmit\nbicetre\nbjorgen\ndarkmarket\nnadimi\nmaingain\nseldane\nlodell\nkwalia\nbensaid\npuigdollers\nstraatman\ngobbets\nsseldorf\nlinyin\nkazek\nfsps\nparwin\nhasanain\nbehura\naffort\nreleaf\ngoldensohn\nwygal\nimprogo\ntredre\nkovykta\nhakapik\ntownsen\nmckimm\ncitco\nsvrs\nsugarplums\nunglamourous\nbenfotiamine\nshawqat\nendarkenment\nbarroway\ndenae\npodkoren\nadsafe\neòrpa\ntweakings\nraciest\nolaberria\nshamefulness\nmaeue\ndoletskaya\narifur\ncarbeth\nvergallo\nmakombe\ncoleburn\nreithian\nmbdc\ngauzès\nkorsrud\nbrévent\nsangki\nanaren\nbeansprout\nastani\npenparc\nmuhaisen\ncarpooled\ntickly\neressos\nkostelecky\nkinderhilfe\nsoumana\njüngst\ntroys\nporpose\ninviduals\njouvencel\nshibly\nffls\nfrieson\nmedtral\nswanier\nbiscaia\npairoj\ngriesbeck\nlubed\nchandratillake\nchrysographes\ntrashiest\nlaviola\nendodontist\ngraysmark\npolystylistic\ndisipline\nglute\nmcdougalls\ndovidio\nalcorcon\nspitty\ntimemachine\ncoquettishly\neasypaisa\nadjame\nparette\nrevoltingly\nabdusalomov\nsabathé\nwellburn\nunfulfillment\nunshackle\nmosstodloch\nleonera\nquesenberry\nsightscreens\nshamefacedly\nkantra\nrengen\ndareen\nmockel\ndaffer\nbaseliners\nxata\npopps\nexcessivly\nautarchic\nsouthernlinc\nreaddressing\ncannolis\nvitolins\ntthey\nneuroinflammatory\nbackburn\nvucevic\nunstarred\nhelpmeet\nmwakasungula\nlarcs\nhuxtables\ncounry\nlievremont\nswicegood\ntimurziev\nleeker\ngeurin\ninnerwear\nramsier\ntoylike\ncarholme\ndostoevski\nhanefeld\ngwec\nmcalexander\nbunscoill\npeebler\nlunasa\npeatfield\nrorys\nhankerchief\nmakfax\nzalika\nankaragucu\nmarkdowns\nooida\ntoughies\ntorbothie\nburmania\ncommandingly\nharyasz\nsiepr\nalini\ngunchester\ndodginess\nphrazes\nmagomedsalam\nbrisenia\nterroris\nklibanov\nexcells\nedradour\nfdep\ncholesterols\nqardash\nclwr\nprocessess\nhedgefund\nderogates\nnrfc\nshehong\nzania\nbertolaso\nglanaman\nnavle\nkilmahog\nfourviere\ndesses\nzpmc\natrianfar\nhogeg\nboethin\nunwillingess\nroadloans\nsmolkin\njellis\neusec\nmutakabbir\nglindon\ncaerwedros\nkignoumbi\nrecalcitrants\nrhubodach\nbitrix\ncontrave\nhoshinoya\nseawatch\nkusuhara\nalgenon\ncorobo\nwhitebird\nwhiteys\nobair\nskelp\nmonumentensis\nlovvorn\nposiva\nrafd\nklingstubbins\nultimatley\ncontagiously\nminstry\nknibbe\ndevido\nnatika\nthangarajah\nbrackenfield\njingly\nduncarron\nfaley\nloeffelholz\nerha\ngilbeys\ntenden\ncauterisation\nkurkul\nkathee\ngutberlet\ntsurikov\ncaduet\nigancio\ndeicer\ngraybeard\nassocs\neglu\ndifficut\nelectrosmog\nardana\nzeltiq\nmelquan\npentecostalists\nawkwardnesses\nconflct\nbraillenote\nsevki\nslippered\nloadspace\narmagan\nautosuggest\npotstickers\nparasaran\nbryanna\nvolkhard\nloipa\nshabbier\nbrinkmeyer\njccf\npregancy\nsurgury\nkayelekera\nhapis\nheavan\noref\nmouin\neaccess\ndalreoch\ndibattista\nkuhfahl\nclifts\ndamario\nscrounges\ncéladon\nwallstrip\nutsire\nlashanda\nrlam\nbakone\ngoldstock\nhumayoun\nrasuli\nbirkrigg\nfagerstrom\nlscc\nofterschwang\nbeardsall\nantivaccine\nellidge\nbodis\nmcilduff\npercudani\niread\nspraggs\nsuporters\nzeyuan\nmccullochs\nannelis\nalgranti\nallissa\nmahaiwe\nbidtopia\nsermeq\ntransmodern\nbootman\nboloney\npeisner\nslackjawed\ngummelt\ninfastructure\nmordechay\npicca\nstuggle\npowerbuoy\nberings\nkryptops\nkildren\nwittelsheim\ndoughtery\nfessy\negeler\nieua\ncheerlead\ngambolling\npopieluszko\ndrudging\nrickham\nsaikua\nyellowhorse\ngoddamit\nbacote\nmobel\ndaviz\nsahayata\novervalues\ngreenworks\nborke\nnnlc\nnotarianni\nkrejcik\nchinitas\ndesem\nnorat\nvilne\nrudko\nhousehould\nspeciousness\ndawdled\ncondemed\ndogfighters\ncomplicatedly\nbelaboured\nmjunction\nxsel\ndowski\nportabello\naberdale\ndikler\ncnpv\nbavuma\nkussman\nflysheet\nsettop\nrewardingly\nbackcombing\npromotores\ninovation\nnoront\npelmets\ncrossly\noutqualifying\nwolan\nzensational\ndebilities\nfictionalise\nmondrago\nfreeriders\noptiks\nhortscience\nmikhalevich\nedirol\ndivyang\nrepentence\ntimane\nongarato\nhomu\nrapprochements\ndishrag\nshadsworth\npulce\nforign\ncuric\ngjepc\ntruanting\nmckinlaigh\nwaldseemuller\ndesmangles\nfenz\nsurescripts\ngerbarg\nunfulfillable\nbroadbents\nwhitchester\nkaskelot\nkukly\nmacbookpro\ntrendex\nkilmadock\ncranapple\nwebtrust\nmugavin\nblazesports\nwatzman\nbeermats\nupsweep\nthinifers\ncalenick\nmollinsburn\ngreninger\nnaqibullah\nbafi\ntheatregoing\ntedlow\ncaitie\npejovic\ntdis\njaniot\nindemnifies\nexfoliant\nshalash\nlethargically\nlobke\npetillon\nclientless\nvolcic\nxizhong\ncalister\nnmas\ntarlau\nschwegman\nmergis\nwittbrodt\ngreenloaning\nkingarvie\nappealled\nminallah\npatarini\nboretto\nthorkelsson\ncordara\nechorouk\nshutan\nfuqiang\nleukocidin\nbookstock\nkingsberry\nasnes\ndurovic\nimplementability\nminexpo\nleukaemic\nsdst\nsalaciously\nmayfa\nanacetrapib\nschooltime\ntrands\nmuzicant\nspents\njonzon\ntusing\nnellysford\nivys\nsupremists\ncrbt\ncovec\nphilanderers\nophthamologist\nvelveteria\nmarleigh\nplesser\nwymott\nvalueact\nmacgillvray\nazerbaidjan\nclassiebawn\nscavino\nunrepressed\nprevedouros\nebidding\nlooj\ngaganjeet\nsoundmen\nmetallidurans\nyielders\nsherelle\nsloggi\nsagong\njoydens\npoliza\ntranstec\nlumison\nmonadhliath\nthierman\nballagas\nokonak\nunderemphasized\nsprayable\nmasduki\nverenium\ndecarnin\nqingtongxia\nshopbop\nkalogiannis\nwoodfree\nwull\ninson\nmosae\nyeppers\nmrgfus\ncarphedon\nneithardt\nsewin\nevca\nazafady\nstringbag\nloske\nsucursal\nagrast\ncphd\nepscor\ntasmagambetov\nchrysalises\nhanad\nkurtaj\nnobbled\ngraët\nscarfing\neighton\ntreavor\nunhygenic\nibol\nparrotta\ndimebon\nhindolveston\nboubekeur\ncetos\nmotaleb\ntitherington\npual\ndeviney\nunglue\nbalanoff\nusnavy\nbikavac\nltro\ngombocz\nboniva\npolene\nprender\nwaterdale\nphsc\nhelmsburg\ntiyapairat\ncadhay\nieca\nstratyner\nsicinski\nsukhinder\nponcino\ngoeppingen\nncrg\nseriouly\nlscd\nbarbaso\naacca\nmohammadou\nnovari\naccoutred\nsatsias\nextraordinariness\nwelnetham\nsoudant\ntanberg\ntemistocles\npelczarski\nbuyt\npiplica\ndnpa\nzgh\nnayfeld\ntoumai\nmarylynne\nkleeberger\nscioneaux\nkendric\nbackbite\njeanbart\nmanzanos\ntinch\ndisbenefits\nguesstimated\nreguard\nruloff\nchint\nesveld\nsandercoe\nhissen\nkmmg\ncsmu\nglulisine\narsher\nkhakrez\nsuffrajets\nbuzzwire\nlxk\nillegibly\nwitthauer\nhhla\nmenú\nstecco\ndanli\ngymslip\noverstocks\nreemphasizing\nputsborough\njamarca\nclynder\nadolesc\nbohac\narpeggi\nxinggang\nareopagitou\nhidajat\nenaje\niostar\nmooncraft\nwhiteknight\npaulinia\nautostar\ncoburns\nannicelli\nyhoo\nseediest\nkozinsky\ndebtx\nlifeimi\nrepressiveness\nmupariwa\nusie\nclitsome\nuaeu\ninitital\ndonnellon\nrawstron\nnsrp\nbuale\nkohsar\nhulugalle\nvelaux\nbolberry\nneubig\nsodomising\npolomka\ncommuncations\nmultipane\nkatrena\npinkstinks\natousa\nbizony\nberlinerblau\ncagdas\nabduljabbar\nrockiness\nkuligowski\nbancyfelin\nfaragallah\nwalper\ndesplanques\nvialet\nmutahar\nhusari\nghazvin\nbreezers\nlineth\nchantell\nhatefull\nborick\nbeeford\ntiffins\nmylswamy\ninterracially\nallegaert\ntadelakt\nethoxylate\nkeggers\nlunyov\nstrazzullo\nakapo\nmcintrye\nmalaak\ncolombet\njeha\nbrendell\ncazo\nbunkroom\nauville\nhsaio\nkasowitz\njribi\nbabydolls\nchatrath\ncilluffo\nstartingly\nbuhrmann\nbryanboy\nsheppy\nglase\nderryck\nrecenly\ngamesmen\ndumpleton\nnostalgist\ntollerate\npostin\narender\nfateev\nintersts\nsilhan\ndeferr\nodato\ndiscordantly\nrowanfield\nlogoed\nfaggy\nserzone\namurao\nlavone\ngathuessi\nkibwana\nstroz\nnufarm\nmijke\nfeagans\ndellaqua\nranner\nconcientious\nnealson\ngenesia\nsiewierski\nkurobuta\nrosyln\nkabwa\nnatinal\naubeck\nmuntarbhorn\nextensification\nlxp\ntarrif\nnemawashi\ntilberg\npyos\nazcom\npageflakes\nroderique\nacrylamides\ntopland\nulimately\nksha\nrangon\nalessandrin\nlaeticia\ninvestools\nkusnitz\nregifted\ndebaty\nmorphix\nwilenius\nservranckx\ncondotel\ncroquetas\nmorbello\nhoier\ngaoith\nlabvantage\ngrindings\ntadena\nfaisaliah\nshantee\nsiderurgica\ncinématheque\nesbi\nglasto\ngadonneix\nuninsurance\nselic\nrowlatts\npietta\nflummox\npersey\nhorkan\ngleit\nbellyflop\nflexibile\nmokoro\npoped\npaceline\nackowledged\nidefense\nkedric\nstraplines\nwatercar\nstuggling\nharrowell\ncubera\nforequarter\nacuma\nthongdee\nuraemic\nonwubiko\nlyuda\ndistrigas\nmehat\nbhere\nnothaft\nagreeability\nnondependent\ndellasega\njiwen\naltira\ncraniomaxillofacial\norianne\nhammans\neggcup\ncollateralize\nshakealert\nkarters\nguesstimating\nnewirth\nteaspoonfuls\nlifka\nauwal\nminilabs\ntatoos\nquatchi\nrobeez\norhttp\nnantie\nsmithwicks\nseinfield\ndoornbusch\nnochimson\nbuzard\nbolthole\npolydextrose\numprum\ngsic\nkamn\nlikably\njegathesan\nenourage\npieth\npongsu\ntiggywinkles\nlowenbrau\nanticlimactically\nrivinius\nsexaholic\nfraccari\nandrad\nrilwanu\ngrowthworks\nconnette\nhopeland\nhillarycare\nbaigrie\nshurrab\nlaspada\ngretsky\nmcguffy\nbuffler\ndrunkest\nitablet\ngoig\nsteamiest\nforterra\npilic\ntranquilisers\ngoerge\nalshehri\nfountainbleau\nenterline\nequiduct\nsuppposed\ntotzke\ndimbo\nrizgar\nstimpmeter\nalzbeta\ndelzell\ntrodding\nsaraghina\nmulrain\npromissed\naripuana\ntankus\nbairsto\nclabecq\nreproachable\nsecuritisations\nmoshkovich\nlemosho\nstamatia\njayyousi\nmitsumoto\nincludng\nseedat\nrelgions\nurucum\ngabell\nprozanski\nmontellier\nyesica\npalmeirim\ncobent\nsheneman\nnpts\nanerobic\nplokhov\nrejiggered\ngladiatorum\nbulaki\nripson\ngnma\nqibs\nthesp\nungphakorn\nenimont\nprerecording\nprogramers\nsmokeable\nsamadashvili\nmartinat\nmacerator\nbhukya\nworklight\nvertrek\ncarahsoft\nopcab\nargant\nrendl\nzipperstein\ntariffed\nsherle\ncpmf\npoad\njokery\ntrahtman\nsavarino\nreciept\ntendal\nkeilitz\npolten\nvounder\nscibona\nberibboned\ngyürk\nfonatur\nakuno\nloughview\ngiunchigliani\nobvi\ngladkiy\nmeirs\nbeijingers\nmancillas\nelfering\naktionsgruppe\nmillworks\noxycyte\nspragens\nafica\njoxel\ndopirak\nfuerstman\nflowbee\nbuildling\narulanantham\npropitiously\nnatee\nschaerr\ntaxine\nafida\nlionshead\nabdulhussain\ngoodguide\nfactless\nxingtong\naysar\nchalor\njalfrezi\nfurreal\nskimps\nonyia\nminnix\ncatam\nrubboard\ncompèred\njaniero\ncarufel\nmitraclip\nkardamili\nhypolito\nsavander\nronnen\nfeagan\nmalcolms\nreflexologist\nbirthmother\nbielicki\nemkay\nlovetta\ntheplatform\ndimetra\nmunninghoff\nwynston\nsmartdrive\ncordevalle\ngrayhawk\nnationalises\nsongping\nbryncrug\nritze\nroht\npsem\nmarbridge\npejak\nfjellner\nbernerd\naltogther\nwihda\nchipeur\ncliman\nglicksberg\npfspz\nsopheak\nnextpoint\nhomoet\nviklicky\ndegise\nuchishiba\ndemarches\ncritien\noddson\nknifeman\nhazam\nlifechurch\ncrons\nadvaiya\nperroncel\nnordeide\nsirignano\ngabari\nlupski\nrilonacept\nkasuba\nfrednet\ncabbed\nspaccia\ngoosse\nsklamberg\nstelt\nobousy\ngregorich\nmousketeer\npirtea\nwhalebones\nhahnium\npoidatz\nmushonga\nmacguineas\nwijenayake\nconradian\nundraped\ntashichho\nthosands\nmunthir\nmandlikova\ngollogly\njaiyen\nnexxt\nhoshiyama\ntelecommuncations\nplapinger\nchrisopher\nwebchats\namanor\ndeflators\nmbrace\nspringsoft\npaladina\nuzcategui\nuyttebroeck\naleshia\nsaffarzadeh\negelstaff\nodiousness\nenertech\ngarbhan\nmisdial\nscarceness\ndemetric\nfromages\nyerman\naltshul\nsurveillances\ndashty\nstunder\nprytherch\nhanqin\ncowlicks\njackton\nubergizmo\nnarte\ndelny\nhospitalizes\nstrews\njessberger\nvulindlela\ntresper\nmegahy\nabsard\ncynghorau\nesrey\ntumori\nfurballs\nlowys\nkadkhoda\nheddell\nlesil\nmaulings\nsimpley\nghota\ncormet\nwhitehat\nmammalodon\nunbuckle\nmanfo\natteveld\nchaimbeul\nkaulkin\nbellkor\ndegaris\nendoscopist\npolaner\ncaulkin\nillycaffè\nhyken\nhiesinger\ntalarion\nkasbahs\nzunga\nxtremedata\nhonomichl\nkandarpa\nyixi\ncarlby\ndefusal\ncolish\nlagdo\nkhazaal\nhoverman\nresnicks\nenrgy\nbezabih\nmaryka\ndonana\njumpstarts\nsoloski\nbowhouse\nwozniewski\ndemonises\npanthar\nunbend\nmacuxi\nunbolt\ndorpen\nabutu\nswith\ndochow\nconata\ncipf\nbeclomethasone\ncyberlab\nnoncognitive\nczop\nhavil\nfagged\nkurtag\nhillhall\ncrofelemer\ngretkowska\nweatherize\ngnps\nabili\nbluelithium\nschoepp\nmuumuus\nthickish\nmillerston\nlalov\nbuhrman\ntotted\nlenain\nknickman\nobah\nshgc\nchlordecone\nyjb\nwuyep\ncircularise\nsaquib\ncourjault\nidga\ntuscano\nstropped\nyadavaran\nbousted\nmevacor\nwarmenhoven\necumenicalism\nlelaina\nkrysko\ncaesareans\ndehorned\ngenuflects\ntrilene\nlufrano\nalaneme\npicafort\npopout\nsquealers\nrepressurize\ncoakes\nunderware\npeenemuende\nduplicitious\nbalasubramanium\npuiforcat\nstreetdancing\nnanoflares\nmoaners\neyeteeth\nherdwicks\npetrolleri\npassaged\nbeznosiuk\npretorious\nnpap\nwrighty\noforka\ncurrnet\nassistan\nantiaging\nbhol\nfrancileudo\nlandstar\nlimns\nfoulkrod\nnerines\nnowosadzki\nendocyte\nretchin\njideonwo\ndetangling\nsuface\nwinfields\nsirotta\nsacrafice\nghilad\nrawkins\npanoptica\nkryzan\nreminiscient\nsapar\nguoman\nmubenga\ndought\ngeneses\nchocat\ntutukaka\najok\npuplic\npatasse\nmahanay\ngunshow\nnanobio\nkruszyniany\ndegiovanni\nkambarata\nunforgivingly\ndockter\nanothr\nvannuchi\nhudsonalpha\nperovich\nmaziotis\njammyland\ntreeman\ncrosshands\nsadick\nduckhouse\nsarvey\nbarod\nkeelhauled\nrght\nluhrman\nterroism\nfedflix\nvenoy\nkonstatin\nshapings\nshengyang\ncoastie\ngruca\ncuete\nleacach\nhardfacing\nrashim\nplaypumps\nkymry\nodendahl\nmethold\nrelending\nmariet\nchruszcz\ndalmation\nmosakowski\ngaitley\nlisnarick\nyanci\nbiniak\ncarrim\nsaurez\nsoffin\ndfferent\nbackhill\nphilagrafika\ntching\nherreria\nlibdeh\nwaxholm\ntrieschmann\nisoglossa\ncambar\nhirschkop\nfreaney\npostindependence\nfontis\nhoundshill\nvastest\nrimondi\npandeya\nedinbane\ntayab\ndesigntech\nwassit\nlianwei\nvaccarelli\nsalpetriere\nsógor\nwomanless\nsibl\ncadec\nhuwwara\nnergard\nuchibori\nwondershare\nhistroical\nstubner\ninfluents\nzoepf\nweathercasters\nlazarovici\nlewannick\nbringewood\nsamtech\npicaridin\nwiancko\nslitty\nsalivates\nbombmaking\nkundor\nenemo\nnaptip\ndewer\nshukoor\nmahachai\nsivananthan\ntimblin\nsvanoe\nicelandics\ngoerlich\nxianliang\norganovo\nfreshbooks\nsynterra\nhaminu\ncerza\nmovellan\ncoolfin\nbakhmina\naxten\ngynaecomastia\ncoastkeeper\nfederalistic\nyordany\nquana\naristede\nurfer\nwenzek\nbibipur\nalerion\ndevloping\nkusile\nliasion\nksiazek\nmurwanashyaka\nkolaj\ncotarelo\nbreznican\nlamoureaux\ncemt\nunbudgeted\nguangyao\ncostumiers\nratzmann\naufdenblatten\nclannishness\nrustock\nkenidjack\nsiegsdorf\ntiredly\ndepersonalizing\nrosendin\nakdag\nripi\npourers\npoliticked\ntitantic\noponent\nmokoka\nsambucetti\ngudaibiya\nroenne\nprelimary\nsabaj\njampolis\nzootv\ntaramosalata\nolopade\ncontorni\nbodgit\nstarite\nyateman\newallet\necography\nkiriakidis\nsuraqah\norangs\nunificationism\ngiridharadas\naulc\ndominatrixes\ngesturetek\ncalçotada\norajel\nzwiefka\nmansewood\nklingholz\nstjames\noutgrossing\nthackwray\ndevassa\ndogpiled\nswormstedt\noppostition\ndespites\nbounciness\nstandfest\nsoupcon\nirisys\nglicker\nbourdoncle\nvoegtlin\ncyclamates\nkopane\ngobbet\nrudimentarily\nfornicated\netidronate\nouvi\ncandidiate\npowerlist\nwoessmann\ndamnatus\ntozzoli\nosetra\ndespagne\nguirguis\nwisecracker\nmorineau\nkloska\nclsoe\nwaverers\nmaroda\nluhuo\npassailaigue\nrunty\nmagnabosco\nschoenecker\nrelativisation\ncraneway\ncandleholder\nsrokowski\nfumigations\nmingji\ntellkamp\nterrawatt\nrashanda\nmedicins\nhugya\nblindley\ndimson\nunderexploited\nlumberjills\nlachter\nborui\neastcastle\nmarichka\njakartans\njuurlink\nkaffi\nfirelink\nkrolak\nconfield\naggrandised\nwebload\nsilverdust\nbroxted\ngamman\nmomument\nezatollah\nknur\nbacille\nkinkiness\nthunell\ntheya\nsametto\nrecission\nkiselo\nmfou\nispan\neigensinn\ngoeff\nmakarapa\neasyhotel\ngovenrment\nmancation\niacoboni\nmackeral\nlcpc\ncafetiere\nhershon\npostgrads\nshurpayev\ntamro\naugar\neickhout\noctobers\ndemke\nappreared\nvinalon\nfizan\nminicamps\ndurfy\nreboarding\njalandar\ncomepletely\nbilderbergers\nrequalifying\ndicipline\nchickenhawks\nmoamoa\ntiné\nministerships\nanniverary\nprompan\nshaub\nmyllyrinne\nspeis\nhankus\ndührkop\nkhataba\nportnaguran\ntibula\nzenter\ntriscuit\nedetate\nwhoonga\nnajmeh\ncratic\nexcerise\ntepel\nadvo\nzummar\njollier\ngenlyte\nsmyrnium\njuola\ngetlein\nhuziak\nmanjaca\ninsidiousness\nrasilla\ndelpuech\nburuca\nghx\nhelyg\npajatén\nbroomloan\nprovoste\nfuntwo\njubilance\ngirmay\ndergachev\nmagomadova\nquieroz\nnalbandov\njaakonsaari\nlegspinner\nkapito\nmathais\nhitn\nkabinga\nghneim\nrembold\nindianopolis\nevli\nwirat\nyoostar\nbaghmati\ntibber\nnateq\nchoupana\nhlth\nbaiqi\nunbuttons\npennett\nnumhauser\nszczerbowski\nunequals\neplp\ntilborgh\ndairese\nsharkbait\nuejf\nbraises\nmcclennen\nnhin\nquitlines\nmoroseness\nresat\ntasawar\nnkan\ncozzie\nassymetric\ndaphnee\nnatb\nclopay\nkopatz\nsacramoni\ntmrw\nsupremicist\ngwir\ngrigoli\nwinterflood\nleblancs\npetrano\ntky\nstrimmer\nchingoka\nschumpert\nimmaculee\nrouwenhorst\nabdolvahed\nmbilu\ndecadently\nessola\nmoisturisers\nedham\nieep\nramsus\nlushest\nwtwt\nspongecake\nouevre\natually\naftel\nstofan\nbullheadedness\nhanescu\npediatrix\nfootling\nmohmad\nkyree\nalpharadin\nbellco\nwiratchant\nsouman\nballyoran\nhimelblau\nsaravanapavan\npasticceria\nceku\nemmigrated\nnonrepresentative\nwigging\npedde\nlakic\nmanoussi\nnonlegal\nmifeprex\nceratizit\nschupf\nmzalendo\nrunyonesque\nkamiko\nneurochem\neduventures\nlimewoods\nsuspcious\nfeelingly\nhoogesteijn\nweliveriya\ngssc\nfitfinder\nkamagra\nfloxx\nprapas\nphotinos\nlebioda\nkonicek\nsentimentalised\nlinpac\npoyon\nlingholm\nroadford\nronbo\nsouaid\nularu\nminkova\nchrx\nozzies\noverexaggerated\nabitova\nswimmy\nsamanez\nszaky\nkoche\nrewarmed\nlebherz\nkwatsi\nceling\nolofi\nroullier\nsilverspoon\njawzjan\nstangs\nvanpooling\nensuites\nbulletproofing\nsomodevilla\nbirmanie\nplaynormous\nmysimon\nflinstones\nrahimpour\nunparalled\nelkinton\nomda\namscreen\ndisect\ninaugurals\njelmini\nwoodstoves\nlakhdaria\nruimin\nwriteoffs\nsnoozefest\nbowcock\ngoodhall\nreinholz\nyatauro\nazk\ninterhome\nbulbrook\ncarnkie\nundercoating\nbinmen\nreidt\nkanshin\nlauture\nchondral\nskelhorne\nupgradability\ncarten\nribadier\ncosigners\npapell\nbreadwinning\ndisharmonies\npepu\ncaucasion\nwishfull\nbemand\nbaige\nolivennes\nelaha\nharalambous\nahmard\nsugababe\nhmap\nfataki\nembarrasingly\nboesche\narleo\nchentouf\nhertforshire\novertrained\nmiell\nghayasuddin\nfreshpair\npenel\ncarmichaelii\nbexon\nhomerless\nwilhelmson\naturu\nindianan\noldemiro\nvenkatasubramanian\neorpa\nstaerk\nhealthconnect\nekanga\nvoisinage\nboquhan\nlossed\nvoyeuristically\nsolena\ntodday\nsobis\nmundulea\nlanoka\nkidsfest\nabdusakur\nskolrood\nbalnagask\nnovicki\ninnovis\nfaifili\nstranton\njoselin\nprocyclic\npersuation\npécrot\ntonik\nchemaly\nzachanassian\nferrexpo\nbaiyangdian\nstoelting\nteletrac\ninalterable\nvisionland\noverholtzer\ncostanoa\nwhitie\nkaffiyeh\ntoqueville\nscarlite\nturst\ndarvocet\nmewbourne\ncristalli\nrangzieb\nchampionsip\nfarmaceutici\npetrolheads\nzhikharev\nteisuke\nmosaid\nschlow\nhyperlens\nockelford\nberkett\nksentini\ncheeseheads\npennslyvania\ngouts\nkonecki\nashray\nreengagement\nmogadon\nprepaying\nneugeboren\nstefhon\nhengjiang\nshireman\ncrusing\ndcrp\nscheuneman\nhussani\nguyomard\nspellbrook\nabdabs\nshiftas\ndawra\ninfracore\nherenstraat\nkosmidis\nsubandi\nmarmie\ngalloy\namosu\nhyomandibula\nstretz\nparkfields\nipath\nhufbauer\nmitia\nhufanga\nebonised\nlankey\ncerrejon\nmainar\ngudmundsdottir\nvinyes\nnewhailes\nycg\nouvriere\ntracheotomies\naltarock\ndeboning\nwisenheimer\nhighish\nbedlinen\ncpea\nespnhd\ncaslen\nhussmann\nuaua\nbovensiepen\ntunit\ntroncale\nhornqvist\nlbbc\niberiabank\nmkhondo\ntuttis\npiggybac\ntwtc\nkungayeva\npapastavrou\niddy\npprs\nspurk\nplaceman\npostglobal\nlards\nmicroalloyed\nfattiness\nberlinecke\nklair\nkness\ncoge\ndehm\nmichole\nxingjiang\nchikari\nnoncardiac\nmutiah\ntressie\nsprengers\nalgore\nomidi\nchongyuan\nforeigns\nbroatch\nskibine\nyeonan\negality\nrocken\nthorbeck\nrathfelder\ngeneralisable\nhemingby\nflouncy\nmenichetti\navonwick\nrellys\nbotherer\nraggedness\nrootsier\npaktiya\nmorua\nlessini\nportégé\nzelenitsky\nmutally\nroudnitska\nstoldt\nyodle\nnitec\nkinsmon\nobliqueness\ncutrera\nheuermann\nzhone\nacapela\nalbade\nnicvax\nindacaterol\nciarrapico\nzamrak\nkacar\nvomitous\nundcp\nfrappes\nastho\ncmmc\nspeedballs\nsafelink\nmashatile\nlondiani\nskivvies\nlateralised\nnonreturnable\ngeekchicdaily\npouted\nscruffier\npredix\nraniya\nkyba\ndeterrant\naminopyralid\nlember\nelica\nsurefooted\npreened\naddling\nanchorwomen\ntomabechi\ningibjorg\nbirthler\nnanosensor\nconnnection\nmattas\novereater\nfalstaffian\nbidh\nbaralla\ntheola\nbusaba\nmocktails\ntiptoed\nintereted\nobamae\nishiaku\nnhms\nwringers\ncodispoti\nmiyase\nrybarczyk\ncspn\nmillons\nrompres\nhogh\nzottola\nchanceless\nfurudate\nlarot\nbarzansky\nsaiccor\nyardenit\nshareowner\nstratoni\nbaodong\nrubicund\nexpectorate\nzebro\nnelva\naproximadamente\nsantizo\ngarbarski\ndogileva\ninstituion\nkousseri\nschwark\nwincc\nduaa\ndunie\nmegaresorts\nlejune\nkildow\ndittoheads\nccsi\nandelson\nwerer\nhaynesfield\nngawun\ntrupanion\nthemelves\ndinkas\ntendenza\nperpetuator\nadetomiwa\ndvur\nnucleators\ntownwide\nilpa\npathwork\nmesclun\niuda\nwinstel\nsèze\nqingyao\nkilliow\nstepgrandmother\ndozzi\nhillebrecht\nraducioiu\nzegerman\nindividualise\nrusbridge\narpel\nyovia\ngbci\nhazmieh\nmiaoke\nmarander\nintitiative\nninfo\ndordick\nillegimate\ntabarre\ngsol\nreddox\nprovate\nvornamen\nbattara\nnareau\nprestowitz\nhyma\nwouldent\ndiminshed\nsajko\nmudders\nmolosh\nteola\nekulona\ngavea\nkotlarsky\ncrdt\ncamaleon\nencarnacao\nbonizzi\nshunhe\nlthe\nameringen\nhatayama\nsmithberg\ncomapnies\nhedgeable\narngask\nogoegbunam\nabsorbingly\nneuhouser\nfadipe\nramekins\namnd\nbroadweave\ndeodorize\nabduallah\nkotchian\nsaucing\nestheticians\ncanaletes\narcelik\nfourtrack\nchiefswood\ndigimax\nspeckhardt\nemmissions\nivanovi\nmahammed\nmasiel\nquilombola\nsolvej\ncornor\nmonarrez\nmaccy\ncnev\nmisallocated\nministerially\nexpectorated\nutegate\ndaqduq\ntragardh\nlipoplasty\nrezaie\npassionel\nmissable\nindulkar\nrecoated\nelkhounds\nminaldi\nbansen\naldemar\nkettaneh\nberkshare\naccumlated\nbakchich\nvcxo\nhurwit\ngillilan\nblondness\nstavenger\nlopucki\neastonville\ncabrach\ninnisbrook\nzotinca\ndariani\nfrenzie\naquamarines\nisqed\nhamshere\nsteelite\nnorthsix\ncymunedau\nleeville\nluftschifftechnik\nduckface\ndalsace\nbalogna\ndeglet\nzazzi\ndroi\nmouswald\nmirazon\namjid\nundercooking\nerulin\nnagarro\nabuza\nassisant\nswirral\ncorioni\nyonica\nmesopredators\nreifman\nmachart\niciness\nwelshofer\nexceedances\nsolkin\nforeswore\ncannellini\nsadjadpour\nburyatsky\ngarnetts\nhaidy\nsongsmiths\nschwartzes\nsauerberg\ncorrimony\nroszkowska\nsleazebag\nvarathan\npricol\njady\nqlipso\njadwa\nmeharg\nemeny\nfailiure\ndiddles\nmassood\noutproduced\nqueslett\nlipperhey\nngonyama\nmonix\nkrepela\nciparick\nidolatory\nruwaili\ncolladay\nlpsa\nblathers\nnewsweeks\njwn\nhorsetrading\nesrailian\nretrospectivity\nsmalti\nkhetaguri\ntweenage\nbmad\nfaligot\npeynier\nrosemore\nshaloub\nweltzin\ntriparty\ntuyn\nmoamen\ndigiwalker\nduathlons\nspoonbread\ndeustche\nheedful\ngenium\nsportsters\narrika\nnaulleau\nspeakerphones\nbulcha\ntavarres\nserbis\nsorest\nunassimilable\ndicatorship\nboluk\nrabonza\nplihal\nopperating\namericorp\naguera\nkavosh\ndobley\nsampas\nsemifreddo\nlantagne\nbardacke\nhaloacetic\nsolicitously\ntigiev\nakikiki\nantinazi\naptilo\ncallay\nteppco\nrecip\nmgscomm\nmahb\nwpuld\neletronuclear\nideh\nbeaterator\ncawp\nmccoskrie\ndarrtown\nsubmarining\nmaquire\nsagus\ncubbedge\nlefterov\nsaverino\nambered\nfesq\nstomas\nmavinkurve\ncorperate\nalgeta\nzyskowski\nentec\nrolheiser\nioma\nyuksekova\nostel\nmaghrawi\nfreeconomy\ngoam\nleferink\nbodensteiner\nvardanega\nshabbaz\nelams\nrichies\nrefoua\ncrosslegged\nlojka\nlatara\nportaloo\nbaycol\nnonliterate\ngenbutsu\nserbinis\nabbing\nironkids\nechavarren\nshihuangdi\nxuemin\nhsic\nhooghsaet\nfonner\ngreystar\ncoutiño\ncourset\ndollarized\nmcgonnell\nborodavkin\ntanzaniteone\nlundeby\nromashkova\ncaballa\nmalkki\nstrawberryfrog\nqueada\nmclonergan\nholmbridge\nsliceable\nlahyani\ncauna\nprostatectomies\narrivistes\njapannext\nfossilise\ncaroming\nloxam\nheimbuch\nunoticed\nfeinblatt\nsasiprapha\noverinterpretation\nsupercroc\nellstrom\nweaksauce\npriviliged\nausberry\ndanahy\nviswas\nmojiva\noverdesigned\nkapya\ndosova\nidri\nyongda\ncrunchpad\nclimens\nblalack\ntabbat\naquathon\nnfwi\ngreenshoots\nmurdishaw\novereducated\nlaylin\nwindeler\ndragonlike\nsbtb\ndecomissioning\npreelection\nnbis\nbullmastiffs\nkahikina\nmarkstone\nmuxlim\nchijoff\ntoweling\ncabazitaxel\ntabare\ngascard\nnondairy\ntzipori\nmotorbiker\nletkemann\nintellectualised\ngorings\nbronzers\nklipspruit\nrecogntion\nhurdlow\nartfest\ncnnturk\nkufel\npyrg\nsenbahar\nxkss\nhuskiness\neurosurveillance\nbaoshun\npernicka\nceola\nemack\nbadria\nchlamydiosis\njctc\nholeta\nfareena\ncalev\nsupplementaries\nkitsmarishvili\nkronholm\ndumbya\nhalkerston\nstanforth\nstonesby\nstanziale\nketterson\ntejocotes\nmujahadin\ndaintry\ncarbonator\nsportservice\nhardluck\namlf\namericanlife\ndaredevilry\nimerovigli\nfieldview\nshatskiy\nroflumilast\nwalum\naldermans\nsalvayre\nsicklen\nddss\noverinterpreted\ngiantkilling\ntomaševski\nmeterology\naderman\nalsmost\nbartone\nmugyenyi\nondres\nwisenbaker\nindebtness\nsaichon\ncivilan\nsukamto\nhypercompetitive\nsuiciding\nuncomfortableness\nsouvenier\ngribbell\nzhenglong\nmagaha\nchernoy\nbahamondes\nplook\nmetalith\nabfab\nradjou\nalianca\nhalshaw\nkingweston\npipelayers\nevanses\nstremlau\nnorthminster\naundrae\nchiantis\nintellectualizing\nsusdorf\nusonians\nmarabese\ndehen\nkliesch\nrehrl\npogmoor\ndoniyorov\nwisan\nberinstein\nstorum\nnewsfutures\nsegee\ncyrila\nphel\nallowd\nsutha\nimagemakers\namorphously\nmendelevich\nrüter\nhighsides\nclandestini\nmidmer\nfixham\nextradicted\ncoccoon\nreassort\nhalamandaris\npersonol\nchenalho\nrehding\ninminban\nchastize\nprofepa\nstaska\nlevitts\nandrosia\ngolbeck\nstracchino\nmitk\nattemtps\ndendias\nsupercolonies\nquaffed\noshitani\ndergarabedian\nsceme\napimondia\nvehvilainen\nsibani\nghaida\nmohmmed\nshopgirls\nhkdl\nrecyclate\nnonattainment\ncfy\nkaztransoil\ndeliu\ncaucaus\nkuchling\nivuna\ntrafficlink\ndicale\npaerl\nihad\nnuradin\nanash\noffspeed\nvizzard\ndunnion\nraimer\nurbandaddy\nmonieux\njumpdrive\nforein\nstoitchkov\nunderdiagnosis\nturangalila\njangmadang\nunderspending\nhejji\nkaouk\nfocued\nbrakebill\nroise\nappennine\nsonntagsblick\ngielan\nbardale\nbatarseh\ncleardebt\navemar\nescolastico\nchatline\nziketan\nukrspetsexport\nentomo\ntarne\narova\nhylenski\nogadeni\nlaghmani\nzurmat\nsurcease\nnetherfields\nkuzniar\nponderousness\nwearsiders\ntribromoanisole\ndaunis\nwinklepickers\nguanjun\nsurip\nizecson\nwnion\nsahebi\ndefaqto\nayouba\nhotted\ntsdb\nsunwear\nstraphanger\nwittur\nabandoner\ncomeup\nroithová\nropke\npranjic\nmanouvering\njavins\nanchen\ndonavin\nsahner\nwasso\nhlavka\nnichter\ndillendorf\naudri\ndelicta\npizzaz\ndisinvestments\ngilmours\nrollonfriday\nliberalisations\nvuthy\nccjs\nnewhills\nsanguineous\nsitutaion\nrodecker\nmufa\nkillham\nconsumeristic\nidigbe\nflustering\nbloggery\nbaarsma\nsalvadoreans\ncamillagate\narrogants\nsummervale\nuggen\nmagradze\nfannies\nchangizi\nshamni\nmigrators\nshaquan\nziprin\nsuppler\nexpells\ndoise\nspringham\nrcpch\nkhalib\nbabynames\nmeasey\nbuddists\nzaluska\ndelaema\nglander\ngachassin\ncorss\nnewhill\nsubstanceless\nmiquale\nnooz\nmcra\nprehab\nzucula\njahmaal\nfedspeak\ndotcoms\nwhoopers\npermenter\ngerten\nradhouane\nostracising\nlemrick\naliquo\nnardicio\nwanlaweyn\nthomana\nfangupo\nbilerico\npantev\nganllwyd\nrainswept\nmisunderestimate\nliuda\nomniport\nspewer\nrandich\ndoranne\ntensest\nnonvocal\nhooshmand\nyesteday\ngovernent\nxingguang\nwaifish\nlazarowich\njatto\nmanorhaven\nkgotso\nseley\nquckly\nnatex\nsmerling\nisins\nbarfrestone\nseperatists\nkarban\ncrockard\nintersolar\nhaselour\nbatoned\neinfall\ncentilitres\ndayshift\nallegedy\nethelston\nklinz\nkashmoula\ndellorto\nmulticarrier\nsophiline\nbacklighted\ndakarai\nairp\nnepco\nshinbones\nbrodys\nsoigné\nschvaneveldt\natlapulco\nnwigwe\nmckaigue\ncalascione\nemuzed\nmirii\nfleetest\nnasbo\naskjeeves\nfacilitations\nexultantly\nasustada\nijza\nkulski\nvaniqa\nbridgestones\nostentatiousness\nefie\nbryncir\nnickodemus\nslcg\nganjoo\napoligized\nunburdens\nsamkos\nupcourt\nblindart\nbagcho\nlacount\njetsetters\nenfora\ndispise\nkungyangon\nallice\nthibadeau\noutdrawn\ncwmp\nunliberated\nsameur\novenware\nprescreened\nfountainebleau\ncauleen\nankershoffen\nmeadowlake\nleftrightleftrightleft\ngeosequestration\nbatirov\npruce\npocks\nkotkai\nbraaid\nnaesp\ntuctuc\nperiapt\natvi\njokiness\nbizley\ngdba\ncandylion\ncrawforth\nchallice\nkeion\nmedpro\nhidenao\ngesté\ncoextinctions\ncentrefield\nchanrai\nlamal\nneuroendocrinologist\ngeartronic\nmañuel\nstramongate\nkratsa\nbonio\nalphabus\nziesel\nefalizumab\nslipstreams\nmckalip\nmcilhinney\nrecertifying\npoell\npetrolhead\nnurre\nruthian\nthuddingly\nzippi\nkenetic\nsyafi\nfreasier\njimani\ndemichele\nradosta\neurispes\nprivvy\nspacesaver\nbadisco\nhandwerg\ntrostel\ncoldingley\njumhuriya\nkitzman\nashbritt\nfreezeproof\noverscale\nhopless\nnfrn\nhesl\nmuxton\ncarred\nrockwells\nmeglen\nsabbaghian\nhometree\nscrewcaps\nchrisafis\nokolloh\nhording\nvranac\nprivledge\npollmächer\ntchouk\nrufinamide\ntasy\nkornstein\njotwani\nrunnig\nkrahnen\ngardephe\nnewn\ntrocchio\njxb\ngaladi\nhorrow\nmarcelus\nbialys\nlacena\njerret\nlifetree\ndragland\nseabase\ncallifer\nfairtlough\npoulation\nhumberts\nwilcove\nyouwang\nwesonga\ntuscani\nruqaya\nsubheadline\nazrack\npcis\nhazier\nlabná\nstriplight\nblindest\nmanduka\ndonyelle\ntogu\ncapitalone\nkormas\ncypriniform\nkarrubi\ninsys\nbobsleighing\ngrönfeldt\nsconset\nzubrowka\ncatchline\nfetai\nganeden\ndpmd\nminiweb\ndirico\nushcc\nanually\nblio\nbradsby\nchoicer\noberzan\neliash\nscuplture\nelrio\ndehydrators\nrengstorff\npottelsberghe\nnukui\njabrill\nmdladlana\ndoofs\nrockowitz\nhajela\ncellarman\nswydd\npapou\njawdropping\ngrunwell\nlaundrettes\nfrowick\nlachappelle\nkilburg\nmorger\ntreseder\nelsynge\nvitsoe\nbackorder\nchallaborough\nschmelling\nirva\nnenashev\ntharps\nstokker\ncritten\nfogginess\nbfas\npetrogal\ninstructionally\nufdg\nkukki\nescénica\noutstretch\ndesipio\njakeli\nshevach\nsaltzburg\nsentimentalities\npacome\nteetotaling\ntangoing\nrmif\nblairites\nschottlander\npezzaiuoli\ntadaka\ndebaucherous\nexperiece\nohsaka\nhopechest\nwncg\nkenshu\nsunspel\nmerisotis\namerge\nmandlenkosi\nvlahovic\nmulad\nelysabeth\naverkiyev\nlongboarders\nleukine\nmuenchner\nmeytal\nmyman\nthumpy\nnagatsuma\naverge\nwoodruffs\nabuu\ngeotechnique\nmacabe\nwaterpipes\norgainzation\nkolchinsky\nfayers\nmcclenachan\ngrevett\nshebar\naftre\nwehrey\nmcinness\nglenglassaugh\ntreuting\nnazila\nmizzle\nmilashina\nvedrine\nwussies\nexoticness\ngoldstaub\nguvava\nleibson\ngoodsync\ngeorgallides\nremitter\nstottie\nsnakebit\ncalagione\naviapartner\ndatek\nzelenetz\ngovekar\ntillikum\npsephological\nneigborhood\nwrapp\nchaddick\nodoptu\nmorenike\ndabbashi\nbazlur\nsightspeed\naffectless\nchountis\ntenthani\nwashkewicz\nmillbeck\nsiusi\nfaberman\nbradac\nmaturen\nfalcarinol\nsyncardia\nbozdag\npisanio\nguardianfilms\neurimene\nhemmis\nkarsai\nchelbi\nmartellini\nkabwela\nrrvs\ncamblin\nsemirural\nproxauf\nmicrosft\nbmra\ngiersz\ntimelag\nyildiray\nnebbett\nrivercentre\nhusick\nrhoca\naahpm\njakhrani\nmettee\ndisatisfied\nrecoupable\nphipa\nparkerization\nmatthe\nbolognaise\ndisagress\nnameth\nsnogs\nhadarim\nspeeddate\nunemphatic\njilli\ncemita\ntarke\nmuqimyar\nexterran\nhomeserve\nmenking\njiangbo\ntorpig\ncaloiaro\nderegistering\nnyombi\ngangplanks\nhanasaka\nlepad\nskypeout\nstielicke\nukriane\nplextronics\nameriserv\nhonkytonks\nkhazova\nmukhu\nthalesraytheonsystems\nciminera\nopinoin\npropulsively\nmynytho\nrecarte\ngeee\nbarnado\nroofthooft\nkwana\ntichfield\nwaaaaaaaaay\ncharterholder\nsargentiana\ninquorate\nteerathep\nkoshwal\ntreesnake\nknuckledusters\nconcom\nsnyderwine\namusedly\nangolagate\nkarpowitz\nrrip\nmonkeyed\nogunsola\naurangzaib\nmeikhtila\nprevoius\nreemphasizes\ndbmotion\ntarmacs\nsurtaxes\nincongrous\nraij\nearhole\nremorselessness\nchintheche\nlacatus\nhilgenbrinck\nbiodiesels\nriechmann\nhboi\nfottrell\nsmatter\nhronek\nutherverse\nbusurungi\neforms\nprolem\ngruer\nketumile\nkennford\nrepressively\nyasith\nremaind\nadelene\ndarchau\nplastinina\nkergan\nfrba\nvucci\njbala\nrezonings\nagadem\nresiliance\nnonpolluting\ninvestrust\nyouwriteon\nactable\nequens\nsunspace\nvathana\nglentrool\nsigm\nedhar\njobsites\nschlefer\nforysth\nkassianos\ncommericals\nwdfc\nsacrifical\nwichelstowe\nsullivantii\nveev\niboxx\nadministaff\nallighan\nvacilando\nwaldin\nsaipov\ncarbonfibre\nbabied\neliasoph\npollycarpus\nhoneybuns\ndastmalchi\nmarrige\nphocuswright\npintal\nprivatebank\njeremys\nprusty\nmedsker\njoncour\nzaniest\nabdulahat\nnebbishy\nsical\nserevent\nbuckwald\nermanii\nviamedia\nmakeshifts\nzangar\ntredup\nsombrely\nmouen\ndjau\noperagoers\nsuperwomen\ntiramisù\nperniciaro\nmerrel\nwheddon\narousability\nshibatani\nmaierato\nkashlinsky\nshotting\ncirrincione\nmoorsley\nloughhead\nsatmars\nansoft\ngermophobic\ngrandpuits\nhyperinflations\npandorama\nhemscott\ndonorship\nlitinsky\nqualifiy\nsarking\nsdac\nbronchoscopic\nbreuss\nramindra\nbookstaber\nnotarantonio\ncrady\ngazillionth\nbittorrents\nwimping\nneagles\npampeago\nsuffocatingly\nvaloria\njonquils\nuwem\nhastreiter\nmarinone\ndaleside\npauc\nyanira\ncetrulo\nschonbrun\nbenmussa\nrestios\nclinginess\nogoo\nkirktoun\nlustrons\nashvale\narluck\nyding\nlaserlike\nalliteratively\noberlies\nnauffts\nforgivness\nhenredon\ngutsier\nrushfield\nabuzayd\ntacomas\ndokhan\nbosideng\nmahoud\ntayburn\nglassgold\nweckerman\nbandic\nultralingua\nsmarthome\ncmro\nvenerdi\nbactroban\ntoddies\nabrt\nduststorms\ncentaline\nrivetingly\ntwanged\nwracks\nelektrarne\nmidnatsol\nphotospread\nnaeema\nboorishly\nincontact\nscheinert\npatchier\nfeatherlike\najlouny\nstrengthing\naltens\nlifedrive\nneatened\napachecon\nglenochil\nisbc\nmudsnail\nbistecca\nschulp\nshaat\nnobilmente\njunming\nchicoma\nhsmai\nhamieh\nosteologist\nparrini\namerenue\nlokken\ngroch\ndiby\nmechtronix\nrooflights\nolaszliszka\nmicromedex\nlablanc\nbroersen\nossendrijver\ntrosten\ngaviscon\nfishbowls\npreza\nbeltany\ninitialing\nrecrafted\nherawi\ngarlik\nshueyville\nhadian\necouen\nfilise\nmoellers\nyemenese\nhokuyo\noverexerted\nbadee\naseltine\nderbyn\nbadertscher\nplageman\nunderdose\nminnieville\nquiterio\npenality\ndanahar\nblommer\nfdml\nribbins\nairscarf\nmasterbeat\nochang\nphotogs\naddisu\ngoodbaby\nsefularo\nnopalitos\nspadoro\ncleggs\nharryville\nmarlias\nunposed\ncuranipe\ndulken\nutiashvili\nniederaussem\nunpurchased\nrichardon\nkamlish\nenzler\nastroturfers\nmaaden\nboraie\nmilthorpe\ncotteswold\ndimango\nhaefele\nuchitelle\noutstandings\ncandleriggs\nshoweast\ncakewalks\nsipunculids\nzinkan\npeponi\ngruffer\nwillhoite\nverimatrix\nbootstrapper\nbossanyi\ncyno\nvorapaxar\nhydrochlorofluorocarbon\nbhulaiya\ntouchier\noohhh\nspiccia\nbulyga\nrattoides\nrockiest\nyunsong\nwebsurfing\nregelous\nfresquez\nrumfitt\nplaynetwork\ncouncilwomen\ncelestially\nyoduk\nliqour\nfenves\nedifecs\ngrandmom\nngumbi\nphenomonen\nnsti\nfunambol\nunderdocumented\nmeshuga\nhapppy\nirise\ncucuzza\ngoicochea\ngalatasary\nrangwala\nploegh\ncornetta\nschleiff\ntoulemonde\nlozowy\nsubsonics\ntenorist\nteletrax\nthathe\nvaraut\nmuwrp\ntabtab\nkomatsuna\nmcgrigors\nfratti\nrivarly\nwomba\nwimper\nmortifies\ndesseigne\naffliate\nbrandons\nwaterwells\nrescources\nladram\nbellando\nwintonensis\naoci\nperol\nmlib\njebran\nnyangoma\nnpta\nvillifying\nbemf\nbaumgold\nteliris\nglendermott\nmadalin\ndigestions\nberlex\nprodisc\naarto\nstellent\nbedrolls\nyepsen\nseckin\nclampers\neljahmi\nastapovo\npetalotis\ncashmeres\nlocy\nfujirebio\nidid\njanuarys\nreinvade\nampatuans\nlazevski\nfcfc\nbuildability\nmicroneedles\nflatworld\nguangxin\nsirisak\ndjemma\nkaide\nopportunies\ncontemporized\nneuger\nsakhizada\nsnuggs\nmmna\nbordell\ntravelsafe\nconsomme\ndelizie\ndimeola\nnarcoleptics\ntalkes\nkoniambo\nkondanani\nninestiles\nveihmeyer\nbaisalov\nhomeliest\nguyford\nwijsenbeek\nimplys\nmcgookin\npassholder\ngodlingston\nkuwadzana\nqinghou\nmarketgait\ncapol\ndigennaro\nhomotaurine\nunequivically\ninterbanca\nsquawky\nlaywers\ngoana\ncodjo\nukrtransnafta\nsutureless\nhalaweh\ndelgadina\nsharpstein\nleily\nmirenda\nmorkunas\nclintonesque\nairpatrol\ncullagh\nbromsberrow\ndegressive\ncompells\nwindcheater\npsephurus\noutdates\nrussomanno\nglaciei\ncartoneros\neileanchelys\ncorvel\nkickapps\njuvenility\nqoba\nlysek\nthiab\nknudstorp\nxinliang\narmorsource\nquikpak\nlooniness\nwonderbread\nmezain\ntaskent\nserostim\nportugeuse\nctwg\nsweileh\nshorebreak\nhystericalady\nsitesearch\nasaps\nnonpotable\nwagnerians\nmalpani\nambulancia\nfischietti\nfutron\nviticella\npasierb\nunbelivable\ncrima\npoppens\nstonewater\nvanecko\ndohany\nkoelbel\nsentman\nzirp\nlyovochkin\nerento\njastrzembski\niguaran\nfishtailed\ntilstock\nquartarone\nbogdanchikov\ntehseen\ntashia\nrolark\nrhestr\nnewcastles\nhallab\nsleaziness\nunbreachable\nhktb\npulickel\nirrs\ndifi\ncushnan\nconcetration\nhärstedt\nlibbers\nboseley\nindusties\nthorps\nembratur\nmegapiranha\nmodernises\nvagabov\nkolarska\nbrightonian\nfreedomcar\ngilvarry\nwillowford\nheiken\nyalcinkaya\nrpro\nlorriane\nintergovermental\nmonteblanco\npiringer\nprimally\ndunard\noglaigh\ncoutnry\nsymmetricom\nfoodhall\nunderwhelm\nnerb\nshahreen\nriesberg\nsibbing\nnemertes\nsensorless\nsocializers\nworldwinner\npalwasha\navancer\ntambra\nkortekaas\nbravelle\ndeschenaux\nhalozyme\nmudslingers\nlameda\ndamaskos\nmerrist\nnicoderm\nchowing\nmceachen\nharisa\nflammini\nwahdan\nbartron\nzims\nflowerless\nriexinger\nmajaw\nincovenience\ntabermann\ntamperproof\nreallity\nwideorbit\nlarkrise\ndosky\ndipirro\nndfs\ntaliglucerase\ncramster\ngonnerman\ngajbhiye\nemling\nshayma\ntlsa\neliota\nsrisamutnak\ngulworthy\nkislov\nslackistan\nkirkendoll\nriesenbeck\ndisengenous\nreolysin\ncomcare\nmagied\nkräutler\nurbansim\nhtey\ntadjedin\npakastani\ncaucusus\nbigoli\njurcina\nfrontload\nyeywa\ncharpoy\ncaramelizing\nlanguorously\ngreycon\ntrialpay\nwalth\npenchard\nziegelman\nsnauwaert\ndonnici\nmisclassifications\nprakesh\nsorian\nnetfires\nverbillo\nashun\nmifumi\nklaris\ndarfour\nsubcommanders\nkintra\ndileepan\nhafith\npaulusma\nzerp\nauditees\nreniec\nmonjayaki\nstrubegger\nhybášková\ntaquería\nxpertdoc\ncigdem\nchildrenshospital\nbegrudges\nincreasinly\nblats\ncrissman\nleered\nsuwal\neschewal\nplantswoman\nhanses\ncasmoussa\netant\nrodrigez\najarian\nsharhabeel\ndilettantish\nminnijean\nsaydiya\ntartlet\nvideocasts\ndeaville\nyielder\nsikich\nrauseo\npompy\nwasent\nraage\ndesignware\nbrka\nsantisuk\nefforst\nmisbahul\nklauk\nstreetwars\nhenkelmann\ndrusillas\npaillettes\ngenise\nunconsoled\nkaraganov\npcmm\nactionaids\nstubberfield\nmaekyung\nmadelynne\nborsani\nhsinchun\nfilmforum\nbestas\ntreelines\nstaunched\nkräusel\nogac\ndenplan\nchlorophyl\ngueorguieva\nhamoodi\nputis\nmalphrus\nciska\nabusada\nbackseats\ntembu\nmysogyny\nzeitchik\nwanden\nlibson\nloebel\ntsaritsino\nburmis\ndisapearing\nunrenovated\ngkj\nhaizao\ntombolas\ncapdevilla\nsoeul\ntissanayagam\nnucatola\nhuanqiu\nlarget\njctd\nconferee\nnondrinkers\nlyndy\nlivek\nwindburn\nasell\nlstr\nsunders\ndukem\nuneducable\nstanwich\nrahmaan\nforkas\nchilliness\nkapetanakis\nunisom\nkambriel\nnawagai\nrissient\nbesly\njeangerard\nnarmeen\nwhatling\ncitzenship\ntoase\nglenaan\nnotetakers\nbagnone\nbibba\ncnnradio\ncorrugate\ndawaa\nmonopolism\nhyfforddiant\nhookwood\nsnowboarded\nredmonk\nhenricksons\nelmu\ndenodo\nghafir\ndeeson\npitkeathley\nnairns\nrenuzit\nhealthpoint\nyangguang\npayack\nzabola\njapex\nampad\ngiammanco\nbestriding\nteag\ncelerier\ndisatisfaction\nloquaciousness\nrothnie\ndegrey\nvisualsonics\nhorwits\ngubden\nhallandsås\nunpriced\nseventhly\nsolaraid\ngrosskreutz\nbelcombe\ngarica\nfollick\nsekoff\nalkylates\ntariceanu\nwilkis\ncianchetti\nsurachet\nbeachell\nniedzielan\neyesocket\nbuzzeo\nsaubade\nbuffaz\nzojirushi\nmcfe\nazmak\ncoricancha\nhateg\ngraffitists\ndismantler\nuncorks\nchavance\nsilim\nschabort\nlegent\nshafiqa\ngtaiv\nrtpj\nklaidman\nprinty\nfeles\ngovernements\ntranseuropean\nzippity\nsamanna\ntagliero\nnachalat\nrockresorts\ngetups\nglucocorticosteroid\ncordoza\nanyadike\narbess\ngoffney\nliquefier\nbedruthan\njanesky\nstetser\nreschio\ngongwer\ncracolici\ncharbroiled\nduskey\nassailable\nadamatzky\nabasov\nfaceboook\ngudel\nrosefish\nvizza\ncafolla\npacificist\nazuaje\nsagent\nlimtiaco\ndirectnic\ncasteneda\ncitifield\nelektrobit\noverinflating\nwheedon\npellicori\npreauthorized\nchilmanov\nbrexton\ngooglemail\nkennestone\ncolaw\npoussepin\ngreendog\nsnowdonian\naagl\nchakai\ncortec\naurandt\nmegaplexes\ncdars\nbigshots\nballybeen\njoojoo\nmgps\nsuppling\nsizewise\nentekhab\nholestone\nsantagati\nsaifun\npeavoy\nkilicdaroglu\nteleplus\nnonevent\ncoinings\ndismisal\nndambuki\nbioparco\nvists\nfortunatly\nstibb\nsarahpac\nartt\npaxo\nintx\nwhyley\nhavazelet\nbazargani\nharmeyer\nazedo\ntomuraushi\necologo\nkagans\nsayable\ntsiklitiria\nboness\nnavtraffic\ndevelpment\ntenderise\nwwhi\npoulis\nkurdsat\nschuele\nmarcinowski\nmacdissi\nrankling\npaulite\npostolos\ntappah\ngousis\nwhizzinator\nutma\ntoftwood\nfratboy\nbarbarianism\nyoutubing\ntemporize\nmarylynn\ncompeition\nastrakan\nhrbaty\noldag\narchetypally\nmumbere\ndayon\nvancheri\nyazji\nkarinna\nipredator\nmenstruates\nmoneduloides\nwolson\nencouragment\nlincoff\njasman\nveitel\nmosharekat\nbuyuksehir\nbulkiest\nderosario\nohsumi\ngrapey\njuxtapositioning\nalsoswa\ncanabis\nmpoc\nunconstricted\nbirah\nfolmsbee\nwharfdale\ntaghreed\noumma\nanthonia\nyoink\nexpediters\ncheenath\nluminas\nwanrooy\nonglet\ncityboy\npoppadom\nmusuems\ntaurand\nstadco\nsmartops\nissenberg\nnitties\ncatalhoyuk\nbushiness\nburok\nbranzino\nrazorwire\nccusa\nlfepa\nstenild\nidilbi\nballymany\nsherell\nhaffield\nbapela\noutterside\nqindao\nmistargeted\nolders\ngrugan\nbazetta\nnanx\nchangwu\npatsalides\nteisha\nzimmerly\nchambost\nscowled\nmasalskis\njaggernauth\nenfeeble\ntousle\nallsteel\ndratel\nerkesso\nguliani\nqingguo\nabdirisak\nletko\ngonorrheal\nkaenel\ncilybebyll\npredeliction\nrissani\nsimolke\nbeglov\nbarrise\nvannina\ngloatingly\npathography\nmontrey\nlickhill\nbedzin\nmaniadakis\nneufeldt\nsapozhnikova\nogorodnik\neducationdynamics\njobmatch\ncbiz\ntapella\nspellmans\nbluntest\nexternalizes\ncavewomen\nbucalemu\nffrom\nstaibano\nmelaye\nenchautegui\nfuger\nheinricher\norgnization\nkrumper\nreyka\nmillpool\ncolanders\nsollazzo\nacoustiguide\nvogele\nmaconomy\nlubke\nfottorino\ncostopoulos\nmangoni\njasmon\nshabayeva\nyagihara\nncoil\nconsultees\njanácek\nsuceeds\nnorowzian\nstrongish\nbergene\nbojs\nlikkle\nhovater\nmccrosson\nvisionart\nmavric\ndigu\nceemea\ntredoux\nchavvy\nturchinov\nnudler\nindarjit\nbackstrokes\ndigitalise\nzhixue\nmcvarish\nagaoglu\neffectivness\nargutifolius\nauchtertyre\nschwartzwald\nuberuaga\nlongdistance\nmalitia\ntheatergoing\nmcgoran\ndepledge\nnetmotion\ngirlishly\ncountermands\nbumblers\nstroble\nglai\nbetor\ngalthie\nlammons\nukibc\nedmistone\nsportcoat\nnfda\nclatters\nioflupane\nripolin\ngellir\nadacel\nkabluey\nkadogo\nnanomagnets\nasadero\nprochoice\nshafar\nfolksay\nlefkovitz\nfirstgiving\nmoonlike\nnovorossiisk\ntradelect\nunflashy\ndecopac\nwyngate\navrio\nequistar\nuhmmm\nlatexes\niftc\nboente\nbbut\ncanellos\nsaccomano\nrisio\nabondoned\nplayfoot\ngudenrath\ntreesa\nalpenhorn\nminehart\nkibbutznik\nvillaluna\ncarbonating\nafghanistans\ntsipi\ngainous\nscutiny\nembued\npinkhassov\nnestbox\nchwilog\nkiteboarder\ngroppe\nstuporous\nfarmstay\nnorrises\nsuroso\nrapprochment\nsulikowski\nnikolich\nlousiville\nsafleoedd\nburullus\nlassco\nbaniyaghoob\nhafemeister\nkrutz\nhallhuber\ncoloradobiz\nkutyin\nclinicals\njiggering\naffiars\nmehsuds\nghlaschu\nseidle\nreidford\nagajan\nfoderaro\nvixs\nlontchi\nwizer\nlouiselle\nnibsc\nsaggitarius\nrenetta\nbrittenum\nsudapet\nuchel\nseverfield\numaro\nrolfson\nhungerhill\ndunda\nappolloni\nstripclub\nmcgilloway\nchakothi\nwessi\nnyange\nsmpc\njereissati\nengagment\nzokkomon\nmusbach\nituango\nfrienship\nsporogenes\nvinar\nlyas\ndolmabahce\nvatapá\nmciff\ncashable\npancaking\ngarndolbenmaen\ncopers\nhockwell\nsawab\nmadoffs\nphotosharing\njarislowsky\nshaheeda\nprivatizes\nmaylander\nyanquis\ncarouser\ngrundys\nweatherized\ncultybraggan\nannemarieke\ntaele\ncbai\nrecievers\ndecis\nsoleirolii\nhitson\nitida\nfortyish\nescano\npilegaard\nfakkah\nzayouna\nmachetanz\npolsham\nwaggled\nfruska\ndtvs\nmroueh\nbostanci\nsupermaxi\nheadcollar\nugobe\nnyccah\nkondengui\natrivo\neliezrie\nbillboarding\nalousi\nsablich\nfites\nshanae\nblose\nnilgun\nperthcelyn\nnonscientist\nunwaxed\nlemerand\nbodhráns\njamalapuram\nlusti\nhoareau\nchubukov\nwatandar\nberdymukhamedov\nmariale\nimpromtu\nsagittarians\nroanhead\nbirkins\nbuchthal\ntottendale\nmawlamyinegyun\nkaauwai\nfresnedo\nomnova\nlazzaris\nlickona\ncragged\nmagentas\ntsatsa\ncheesemongers\nmediasphere\nblackglama\ncroser\npolcheewin\ncharnwit\nnahc\nbackaches\nmurisi\nlobosco\nlauretti\nobering\nbrezovan\nshanthakumaran\ncybermentors\ngrashow\nactualising\nborkan\nthakuria\ncavuoto\ntoisa\ncpmp\nmeddwl\npadlo\nrighthander\nwinklevi\ndemandtec\nrecalculations\nmermoud\ndimare\nacknowleging\nindonesias\nliveblogging\nheadier\nghuneim\nstrey\nmélida\nnegociations\naphrodisiacal\nschloegl\nnitasha\nweaponisation\nwielechowski\nknekt\nmarican\nfabish\nfierberg\nshaunie\nlycatel\nruohola\nopcc\nbeefburger\nrunningwolf\ngoatish\ncommonfund\nknews\ndistinta\ntauxe\nmasseroni\ncatizone\npronating\nmilarsky\naggrevating\ncaglayan\ntroman\ninseminator\nchands\nlonliness\ngladd\nlyudmilla\ndidima\nsgis\nkaynan\nprasco\nplessinger\ncamilio\npancks\nfarinetti\nkalingrad\ngumbasia\nsalahadin\ncannavan\nvisant\nuniters\nbodysurfers\nbarolos\nphurbu\nbrightscope\nspekman\nfasihi\n​​\nnatsvlishvili\nshahrad\ngujiao\ngafi\nwellbank\ntuchin\neliette\nscadden\nsmiliar\nleavesley\nimperi\nzambarloukos\nguzzone\npetcharat\nghorak\nrhoten\nmaradei\ndiger\ngsms\npetrzalka\nneronha\nbibbe\nkfaed\ncahi\noberli\nroseworth\nrubaish\nostergard\ndefenestrate\ndestructuring\nbefouling\nsurles\nbertelsman\ndelissio\nkassoum\ngekkeikan\nwodehousian\ntorick\nkorzenik\npodila\ngospelaires\nspurtle\ntvoi\nvenemous\nkabukuru\nsunergy\ngreenergy\nkrechetnikov\nkillke\ndigiallonardo\nsandaig\nmccleese\ninfrastucture\nmuscadines\nlaloosh\npucciariello\natandwa\nchugay\nstcherbina\nafba\nehteshami\nshikapwasha\ncdnetworks\navanafil\npriniciples\nozdil\nprequalifying\nahlman\ndunsdale\nnovich\nmarkhams\nvultaggio\nassymetrical\ndinosphere\nadvancedmc\nsafana\nvigilence\npeachment\nkienbaum\ndachan\nkimme\nlittlehale\nardith\nmeditteranean\nwnav\nscotoni\nupcrc\nservic\ngannons\ndaurov\nholymen\ntakamanda\ngramatan\nwewer\nconscionable\ncarlye\nwysopal\nsharkawi\nbumbled\ntopfen\nwalpert\nhirshson\nessmann\nfffm\norocobre\nwatersound\nembitters\nmilanowski\nandaloro\ntotaliser\nbistline\nsexercise\nmhaiskar\ngigamon\nwanjek\nantiestablishment\nbmss\naitha\nbobsleighs\naghanistan\nicims\nsonejee\nkahmann\nzáborská\nrisal\nyounousmi\ngajdosova\nmedja\necononomic\nneurointerventional\nrfmos\nforsaw\nnewhope\nkpene\npoust\nfreesias\nbaldermann\npoleaxed\nplayita\nnecesidades\nhestitation\nmlac\nfacchina\npurpuse\nhadly\ntheatergoer\ngonzalito\nalraedy\nllai\nhiropon\nfullenwider\nnorthstone\nvalleycrest\nallena\nksis\ndunain\nrumangabo\nbeleiving\nsweeped\npiniero\nnassan\ndebbies\nboecher\nhefling\nethereality\nwarnemunde\nhearson\nproctologists\ndwifungsi\nilaskivi\ngalioto\nshlep\ncapesius\namidol\nthommes\nadministation\nincentivisation\nnadich\ngodager\nthemselvs\nlazowska\nleevees\noodaaq\nattenion\nbankster\nvatubua\nstuggart\nyelped\noverslade\nsweded\nwhitworths\nidama\nwgbr\nbalistic\nstrenuousness\nsluiceways\ncarbonneutral\nmidsong\noboeist\ncarlan\nhautelook\nmundaneness\nlabalaba\nrainforested\nstoeffler\nportelet\nbenfluorex\ngristwood\nuserplane\nsonderlager\nsomeof\nritish\njny\nticas\nqingmei\nmerridy\ndulcificum\nnationstar\ndelhiites\nallmon\nwpeo\nspriegel\nvălean\nbrejcha\nzhuangwei\nlineswoman\nepoisses\nlocatell\ndoubledecker\nrevaccination\nterez\nteamcenter\ncrisises\nocegueda\nclems\ncausewayside\nnarrowmindedness\nsesama\nreddihough\nbalkinization\nchiad\nquinzani\nothaim\nldls\nwonderings\nlaraba\nlegro\nacomplishment\ncasana\ngabling\npoliticalization\norrisdale\ngnpoc\nnostalgists\nverkhovtsov\nhorseplayers\naltynai\nthoughtfull\nfobert\nfoodshare\nslugfests\nnonattendance\nvictoryland\ncytoxan\nteramachi\ncalculous\npicciotti\nlepofsky\nsistersong\nukpabio\npizango\netextbooks\nborinsky\nruffel\nstauts\nsxswi\nellise\nrefranchising\nsalamabad\nsomper\necovative\ngemany\nstabilizations\nhummler\ncommercialbank\nflacons\nazeffoun\ncharrin\nyots\nbroadreach\nraheleh\ngoldmacher\nkelash\nkiswa\ndumaux\niglinskiy\nearier\ntemedt\ndisbenefit\nunrelievedly\nvittozzi\nbuitelaar\nmieles\noatibix\nbalsera\nglamourised\ncandymakers\ncollicutt\narellanos\nlifemark\ntagines\nlaizer\nnabakov\ndrakensburg\nvindi\nbucheri\nsightedly\nharmeling\nhernadez\nmalthace\nhulcup\npanfilova\nlilikoi\ntradmark\ntufaro\nundertreatment\nbackcombed\nhomepride\nstraiges\nwegh\nfleuranges\ngcmhp\nfaulkners\nparavan\nlaughingstocks\nlentol\nzerofootprint\nfanm\nffyrdd\nslagheap\nmarinez\nthougts\nblogospheres\nsusceptable\nporny\ncaried\nsitelines\nrosg\nparwaz\nmerilees\ncereproc\nresx\nemergin\ncmls\nannerson\ndickons\ncogitating\nhoneymead\ntarnovski\ncoquillard\nsolrun\nharfenist\nkondas\nnutrioso\nconnaitre\nbinalshibh\nfreetel\ndebarati\nclocky\ntomsett\nfidis\nintruiging\nfabiszewski\ntinnies\ncondoles\nintelligable\necologics\nmaddo\nmowie\nfautino\ncrystalens\ntrigonatus\nlinctus\nsonys\nnathenson\nrelock\nlossada\nwerema\nvandellos\nkovick\npropanolol\nwimpish\njtx\nfunuke\ntelltales\nsiefkes\nmahlum\ntemplehall\ngnaoui\npalmatier\nsisolak\ntranspetrol\nsurján\narolia\nboyloaf\nsplaining\nwernli\njudaisation\nsoukhovetski\nkruer\nmarquetta\nghtunes\nquizzle\nclippie\ngrandclaude\nchryssie\neknaligoda\ntaie\nprotoge\ngullets\ncavusoglu\nbirdbaths\nesaa\nholoband\nvoluntears\nperb\nshofars\ngraymark\nradicalizes\nwestaff\nedemariam\nmetastasise\nschweddy\nsyphoned\nwaleses\nworkstreams\niceburg\nluzolo\ntugra\nlangenhoe\nrediculousness\nlepping\nswishes\nportakabins\nquaters\nxyratex\nsjoblom\npekahou\ngreengairs\nmasaro\ntitanothere\nsarji\ncwlf\nbritneys\nnangrahar\nvarshons\nlottes\ncitadines\nfayek\nartefill\nnakhabino\nreprocessors\ntolitoli\namechi\nballero\nrohanna\ngalactically\npremising\ndataplan\nrecardo\ncontast\nfoodmaker\nswannee\noganyan\nstoneyhurst\nbathstore\nsirichai\nstalisfield\nindelicately\navrdc\nnonscheduled\naitan\ngureshidze\nvillagrasa\nshebdon\nskosh\nwalliscote\nanegasaki\nbrookenby\nfisherpeople\nwaivable\ngnjidic\nbookhammer\nsbgi\nbudney\nbunnin\nconculsion\npolek\nbuyvip\ngabbling\nyounggu\ntransfomers\nizraeli\nmilleniums\nbarayeva\nzhura\nmedishare\nrighs\nramsburg\nczechvar\ndecof\npropell\nsayliyah\nkitja\nlabinger\nplotty\noncophage\nbrockbridge\naound\nplessi\nrisg\nmcpadden\ntanswell\nlandaburu\npureeing\nfarries\ngamersfirst\nrouiller\nadhab\nbargylus\nsimplegeo\ncachers\nsalikhin\nunlatching\nabdelgadir\nlfrs\nmaqaleh\nkhosti\nforgia\nkocik\nshequida\ndebone\nmusicnet\nshilleto\nlonoff\nrutf\njornaleros\ndinnes\nmichelozzi\nwiedmaier\niresearch\ninternati\nrouček\nzalkin\nkilpatric\nrrez\nabdollahzadeh\npickhardt\nbiondich\nchipstone\nobare\nsembiosys\nkarabey\nmuaskar\nbutlering\nredistributionist\ncfib\nbodging\ngaudiest\nmotivala\ntaposiris\nkearin\noverwatered\nvarischetti\ntenners\nturkina\ntrebay\nphema\ncockier\nscrappily\ntywain\nzougam\npantuso\nurbanely\ntamarra\nmynachdy\njayz\ncupet\nblurbing\nvivaldian\ntochterman\nmikonos\nlaganosuchus\nmaidenhill\nadultcon\nkiwayu\nbryshon\nsmlc\nmitsuka\nphlo\noverfinch\nmainero\nmitteleuropean\ndiprivan\nolevia\ndimmel\npappardelle\nmalignantly\nlangour\nhäusling\nbabec\ndogwatch\nguguletu\nezrati\nzafiropoulos\npociask\nkacyiru\niacovone\nroudier\nbolívars\nsvnt\nkoziej\nsyntocinon\nhillblazers\nrvus\ngreenwhich\nwallagrass\npereria\nchnages\nuparmored\nfadai\npriggishness\npovoledo\nphindile\nsaimone\ndemattia\notelli\nfaddists\nzackham\nthurnherr\nmackynzie\nthieblot\nwhiffling\nachugar\nbraincells\ntombliboos\nstoccareddo\nozin\nsandcat\nbulkan\ngogarburn\namphistium\nkarinen\nbailgate\nunwearable\nyaodu\nhenhouses\ngualeguaychu\nkotlarz\nzohor\nrogol\nwahington\nelectrochromics\nordower\ndebossed\nbutrym\nkandhas\nxmega\nnickeled\nabouit\ndevloped\nwristing\ngambala\npropostion\ncyberchondria\nmethedrine\nserviettes\nsumarsono\nniquitin\ncanalys\ntransfats\ngokita\nhollopeter\nbelevedere\nshulte\nfreeload\ndanetre\ndevaan\ndotloop\ndevrouax\njuchau\nstautberg\nvirusbarrier\nstreleski\nniketown\nheadlice\nmazuronis\nconceeding\nmanshiet\nhonoury\nduvanov\nalirezaei\nunderrotated\nclassens\ngilbraith\nskypower\ntransplantology\nramniklal\nherrarte\nsfgh\nqoq\nreadyreturn\npeetu\nilts\ndebix\nbyrraju\nanonymizes\nnewstrack\nperifosine\nalomari\npopster\ntriumfalnaya\nstratt\nbespattered\ntelogis\nbranshaw\nschwamm\nbabaker\nquadrantid\nlaksman\noystercard\nstrompolos\nbroadsiding\npharmacutical\nmitsutomo\nviliv\nflautre\nkeenes\nmuttemwar\nslingback\npriorty\nayva\ndengir\nrepected\nmonitering\njees\nschory\nyehezkeli\ntomilson\nfathur\ntatooed\nbenquerenca\nocrelizumab\ntryscorers\ninterbolsa\nklnlf\novercommitted\nholidaybreak\nhomegirls\nsheskey\nflunkeys\nbjerk\noverpraised\nmuhren\ncobholm\ndimitrakis\nmachinimas\nkoshalek\nshujaaz\nannamay\nbirdguides\npgad\nshenghuo\nmegatooth\nesepcially\nstanion\nsheehans\nvpak\nmakaio\njelbert\nconvertirse\nplamegate\nuncuffed\ntowfiq\nalagem\nretrigger\ndrivesavers\navrett\nmatche\nmctamney\napmi\nphilosopical\ndongier\nsangfroid\ntremolando\nsecb\niafp\noliveiro\ngandelsman\nuxorious\nillinformed\nlaeven\nfarmacy\nunreeling\nfranprix\nwestine\nstreymur\nducate\nstephfon\novca\nsoshnick\nzuqar\nzormat\ncordylines\nmultipin\nafcis\ncousseau\nhittable\nitemising\nkaparo\nparsol\nrohrback\nvaldebenito\nozem\nmexicanalink\nqmy\nbiscombe\nhualan\nthunking\nbankserv\nparticularised\nrohrman\npellettieri\nfamvir\npilferer\nimpastoed\ndrycleaners\nweblike\nkhachidze\nbereano\nnarcotizing\ncordelli\ncounterprotest\ncooil\nkarzi\nkhazaei\nsedums\nstasse\ncafergot\nlinagliptin\nluethi\ngodbee\nchercover\nshaweesh\nschimpff\nehrenheim\nplecas\nloidl\ninconspicuousness\nsiquiera\nsunopta\ncozying\ngilleo\ntokers\nboutwood\nbagci\nunhitch\nwihin\nsinanian\nvegesna\nlezmi\nwinegarner\namgylcheddol\ninterlex\nqarmat\nyetagun\ninfoterra\ndeothang\npartenon\nspywareblaster\nhilia\nhelferty\npinesdale\necogen\ndiffidently\nloehnis\nphonecaption\npdns\naetr\nradwaniyah\ngarpozis\nschnipper\nchloraseptic\nballypatrick\nrabbae\nrenationalising\ndepreciations\nresuscitations\nmaels\nguissou\nholmgaard\ngaffield\ndeerstalking\ndiffenbaugh\nthatiana\nspett\ntabajdi\ncupcakery\nrysavy\nintubations\nhomegroups\ntrepel\nmariacka\nswarmcast\nunsuprising\nsplatstick\nchaudri\ngbks\nsathnam\nbuyology\ngodsake\npopovec\nkocol\nusulutan\ndemarquette\ndifferece\nassayag\ncochairman\nkidiaba\nkaixian\nsweatbands\npemuteran\ninve\nuniformally\nbalathal\npiccante\nempassioned\nzehme\nwerhane\nduckstein\nbestayev\nhematol\nfrascella\nrothfeld\nzlotnick\ntanyang\nchernovsky\nrestring\nmarnhac\nkoeck\nnubani\nattacted\nbavette\nelran\nungovernability\nquorate\nsoulflayer\nlapook\nmeteab\ngorvy\nsuperfit\nfirstpage\nbcwipe\nmashouf\nermann\nkozmann\nsynergized\nannyas\nbeachfronts\nsaibao\nduchampian\nclincal\ngruters\ncliett\nupswell\nwenhold\nlangenegger\ngarano\nhollebon\ncapmed\nsfantu\nxeroxes\neuphemised\nrestuarants\noysho\ncostena\ngreenwire\nlawrimore\nreponding\noceanium\nfidaxomicin\nwahidullah\njillions\nylon\nbuyside\nballyreagh\nwatercross\nnexity\nginnelly\ndorwart\nbraefoot\nbourgnon\nbrookhollow\nplevin\ncamtek\nfilardo\ndumpings\nkotchneva\nabhirup\nkiesl\nneatnik\nalmihdhar\npotocari\nborlange\ncanogar\nnwj\ncavic\nexpectency\nfander\nmedieaval\nzipcars\nieta\nwooddell\nrensink\nrottenest\nelinkine\ningelsson\nlouai\nhaggi\nveyrac\ntarryl\nmisrouted\nglenney\nrapers\nwritetothem\nyazilim\nbruuns\nchsw\npojaman\narrambide\ntripplett\nferetti\nreinsure\nrepoted\nbernadetta\njdem\nmediterranian\ngyem\nshouild\ncavileer\ntonsorial\npoitical\ngrantleigh\ncapehorn\nmoeckel\naldrige\ngidu\nmainassara\nburguieres\nbettadapura\nnosediving\nnonsecure\nfacebooking\ngarsztka\nspeacial\nsnacker\naluu\nnextlabs\nsmilebox\nnasiha\nfuturechurch\nhorberg\nmenapace\nrabaska\ntuleh\nbvii\ncouloumbis\neleider\ndemchak\nlabourhome\nredcom\nnhsmail\ngremolata\nfinf\nnaame\nkhnata\nsidoides\nbeetaloo\ndeking\ndenouncers\nodabashian\ndaughterly\neperjesi\ntittering\nexpectance\nfurberg\nstanleybet\nnardil\nknakal\ngadafi\nstrenghts\napolosi\nimprecisions\ncounterprotesters\nepaf\nklimaforum\nloathesome\nblackcircles\noverruff\nrouton\nketek\nmehtas\naremissoft\nbelievs\njohanesburg\nartemije\nbradco\npodber\nsoulliere\nmangiaracina\nlipshultz\ndgif\nweirong\nkazeminy\nlmra\nleecia\nadsm\nduhhh\nwyebridge\ntobyn\ndrilldown\nipekci\naparatus\nlandier\ntrefechan\nwestraadt\nmicrotrend\nscarpaci\nprisioners\ntosoh\ninwhich\ndetorie\nbakman\ngirardo\nhosang\nbertonneau\nmoue\nminxy\nnutrional\nunsightliness\nhereditaries\ntransactors\nhousebreak\ncabaluna\nmultination\nroadbuilders\nwyett\ntamte\nmecher\nmassounde\nmegaliner\nmzikayise\nkurcz\nmarraccini\ninexhaustable\ntashnick\nmccrackens\npackway\nsemisoft\nleishan\ncyclacel\nmetsi\nlissen\ndpfs\ntitillates\nwynen\nlovan\nsekonaia\nzedginidze\nbedmates\npaymah\ncrystallizations\ntinkertoys\nlorings\nvolc\nnattans\ntmti\ndénériaz\npuurunen\ncattoi\nreinspected\noldmill\ndorofeev\nbestattung\nwhetsel\nnoubissie\nuors\nmendiratta\nqriocity\nmarsack\ninjuried\nyanza\nverino\nbascara\nchampers\ngridlike\nbrossette\nrebeiro\nlampam\ngrandparental\nacgh\ntietong\nelmlea\ntirin\nodowd\njeapordy\nhansjorg\nswandel\nbakhurst\npenedes\nhamoui\nndanusa\ncorporan\nglenearn\nsteinkuehler\nzainaba\nkaveny\nermmm\nzeidner\npiccaninnies\nelsenheimer\nvillaroger\ngiudia\nstramash\nhaakanson\ngnarliest\nseminoff\nediacarans\nzuchowski\naijun\naosda\nencumbent\nelektromotive\nslackly\njbar\ngurule\nduzan\nsmoulders\nsiddy\nnanogenerators\ndokku\nquicklook\necwr\npreusser\nbunchrew\npyestock\nashtari\nkurton\nbeninois\nrazaullah\nskulason\nbalchunis\nsiyathemba\novershirt\niguaçú\ntigereye\ntelhami\nnacey\nastroglide\ngillbanks\nstrudels\ngorinsky\ncodatronca\nwelikanda\ngharawi\nsensationalists\nwarnaweera\nexercized\nzareth\nsedillot\nbalindlela\nnosimo\nunadmitted\nsaõ\nntawukuriryayo\nikeguchi\nlurma\nmicrobrewers\noverwheming\nkolokotroni\nlubrani\nshenyu\ndockrat\nonepulse\ngemzar\naddaction\nvelders\nsvengalis\nmachluf\nivell\nbirdine\nfixie\nkapuya\nfoxytunes\nuberalles\nblamelessness\nvoyatzis\ncalguns\nkapnick\nmohlala\nhftp\ndistribucion\nzubo\nheinrick\nghawas\nchampps\nrespectfull\nsunniness\nvilday\nwelterwight\nszen\novec\nefects\nprosecutive\nintervac\nocobamba\npspgo\ncareforce\nshopwatch\nposho\nbinationalism\nrubinald\nmcgree\nkhiyami\nseckinger\nborgna\nteitlebaum\ninumerable\nconfrimed\nprattens\nbamboozles\ncustomises\nfleeters\nporfido\nstoody\naahhh\nsorsby\nnsmb\nprepme\ngaravand\nislamofascists\nunhackable\nriboli\nhiggo\nkoelbl\nursala\nkilamanjaro\npischinger\nhamalian\njipson\nervell\nbracale\ncravendale\nlubel\nwojtal\ncpls\nhaydan\ncamerawomen\nkokoris\naudies\nburkhas\ngounden\nnshamihigo\ncemf\ndaaras\nforoyaa\nrockson\nrepat\nrhiwderin\nkenscoff\nhoselton\nflewitt\noacs\nsherrys\neyestorm\nslavisa\nlangbord\nswitt\nmeseberg\nmadnesses\nkelami\nphiroz\nzitty\npslc\nurbancic\nbelabors\ninfuence\nbullsharks\nnephropathic\nperdikis\nrougle\nsrur\nchowrasia\nistanbullu\ndustball\nroofspace\nquets\nstaginess\ncellu\nchadirji\nmadover\nbolderson\nvicitms\nfiltrated\nmeiff\nkielt\njarich\nsmallprint\ncoehlo\nwomenkind\ngeovax\nwolking\nmarketscope\npestival\nchuet\navouris\nsahiron\nstanifer\nmalovic\nsymeou\nenraptures\ninstructively\ntrillionaires\nphandroid\ngovindini\nmusademba\nnerco\ncurc\ntunesia\nmanlier\ndomeier\nchupeta\ngurewich\ntutka\npwnd\ntohidi\nsneesby\nbusmann\ncimavax\nprobook\ncreperie\ntallish\nstrecher\nwinalot\nbialobrzeski\nguaderrama\nstamell\ngretar\nsadeer\nsalmide\nchelvan\novab\ninkd\ntcell\nschnozz\nmanful\nmiddlemoor\narmures\npaiche\nedemar\nsajmiste\nyuanjie\nuninsightful\ninaudibility\nstreetwork\nmcclintick\ndmhc\ncowpat\npaterakis\nzeku\nhelane\nmoestafa\nillogicalities\nlispy\nandreen\nbucala\nswraj\nzaidman\nvilimoni\nscandalmonger\nrolihlahla\nsidetur\nguamuchil\nelectons\nnavaras\nkauahikaua\nmmis\nnhmc\nminitruck\nnageeb\nwyma\nyuansheng\nincompentent\nyakasai\nbommarito\nshaoxuan\nbeliver\nseedbanks\npijbes\nsmartbike\ntoecap\nbeliefe\nhalfwits\nqutaiba\ncaffein\ncollecion\nultrabithorax\nantzas\ndizdarevic\nlapdancer\ncahnged\nattaya\ntuhakaraina\nlailatul\ntlil\nhaverstick\npoggione\nardous\ndemined\nhunko\nsesamoids\nwinzeler\ngensch\nantlike\nclickwheel\nhanify\npatricidal\nelkon\nexpostulate\nbespeaking\naspesi\ncharcutier\nnemesysco\njudicia\nibbc\niobridge\nokunade\ndogfaces\nfrostily\ndusik\nmaggotts\nminutaglio\nwestsider\nacccept\negate\nnazenin\nwitlings\nbancells\nmclehose\nmexoryl\nmegret\ngerstenzang\numshini\nextemporizing\njamanak\nerfle\nsukova\nmesmerises\nballaquayle\ndiabetologists\nchisanga\ndetoxed\nressner\nastmh\npugnaciously\nmercinary\nanbyon\ncompromisers\npowerreviews\nlapandry\ngŵr\ntoderasc\nmochamad\nforgetable\nrazeen\nidiotbox\nmehmedovic\nscotching\ndwimoh\ntirtoff\ncasaburi\nhorselike\ntherasense\ndesexualized\ncmpp\nqiam\nallof\nrendich\nmondeos\ngornstein\ntemporall\nkneeldown\npristavkin\nboireau\nnamotu\nvanceinfo\ncompactrio\nkerrea\notherton\njawing\nhuebener\nfleetbroadband\nsheriffhall\nlhoknga\nmyska\nbillionnaire\niscol\nblinkevičiūtė\nseattleite\nsenpaku\nvasts\ngearlever\nxrep\nhashbrowns\ntightfisted\neroticize\ncounterpunches\nimmie\nboshers\ndenuclearisation\ndojack\nrubberstamping\njimmyjane\nbunging\nfrez\nmoseleys\nachamore\ntirozzi\ngeltsdale\nverkooijen\nsakhile\noberton\nmussburger\nssma\neeob\noverwatching\nmargenthaler\nraffie\nbastarde\ndrukier\nparadera\nsenal\naltagamma\nrosile\ncareeer\nonmedia\npeszka\nfullbridge\nauctiva\nromeike\ncyberethics\nadcps\nmasoff\nahumado\nardeonaig\nstevenote\nsetlock\npicklock\nseatons\nabsamat\nspiri\nlandco\nvocho\nurre\naircaft\nunlaid\nbrekkie\nnonappearance\ngoldenballs\nbendler\nmeddon\nfalor\nbaitings\natttempt\nshushufindi\nmilbrodt\nfitb\nsthat\nsquaremouth\ndutia\npeacedrums\nkrohmer\notri\npresh\noveranalyze\nzaldana\nmudwort\nphotofinish\nliquorish\nrimjingang\nngruki\nuncremated\nduschl\nalalam\ndahlie\nwhimp\nmelentyev\nshakour\nidentfied\nkabbara\neldene\nrecuperada\ntremelimumab\nunwarrantably\nkajran\nguatemaltecos\nkreitzburg\nasplundh\nchaats\nmanour\nxusheng\nsefik\nadiba\nlightpost\nunifirst\nagriturismo\nshedder\nwpnsa\ncalamandrana\nemmaneul\nexploitability\npubmatic\nswaco\ntankel\nnumerex\nguidiville\naccually\nhajiri\nellite\nbuchko\naldai\nproteolix\nlistserves\nspecktor\nturiscai\nbadei\nbelfonte\ndiabulimia\nmemfis\nmccuin\nsidhoum\nclotheshorse\nnadasi\ndannemark\npikser\nsheqi\ngrare\ncavinder\ndemocratice\ndretzin\nkeumgang\nsilha\nbalkind\ncompleteley\ntsukigawa\ngrynspan\nziaullah\nlazurus\npaksitan\ntsuris\ndongbang\nrashesh\nensnarled\ngawp\npittelkow\ngiffels\navondo\nvanderwagen\nkefah\njannarone\narraigning\nonebox\nerfani\nchepkok\ndetriot\nkangeroo\ncoppess\nmetalers\nschwermer\ntavasoli\ntwito\nlidor\nimplats\nkhalig\nantoniotti\nabrass\nreleif\npadoh\nswedishness\nlanaway\nreliv\nunsavvy\nstraighterline\nunbuttered\nwillborn\nsoapdom\nebot\naome\nsalloukh\nwescorp\nvalencias\nkhetagurovo\nsbab\nathanassiou\nalexovich\ndolinger\nyerkebulan\ngracelands\nauvert\nafrikaaners\nbjoerling\nbirthland\nwfmi\nschepel\nseideman\nbrutta\ngezellig\nzajic\nfreidgeimas\nedelca\nwolfhill\nedenside\nmutawakil\nenergycap\ndryburn\nmialo\naizue\nskavysh\ndiliegro\nfanfou\nslome\nnarins\ncirv\nkeluak\ncrossick\ngrapeville\npensieroso\nbeiqi\nnurhadi\nsezmi\nmaillots\npssr\nberiault\natambaev\njusman\nencrusts\nshantal\nsamadani\ngraceway\naahp\ncitypoint\ngastronaut\nleathered\nhappies\naggeler\nquimeras\ncadue\npanaf\nadml\nbuzova\neletronic\nanseo\ndiscriminatorily\nwhilden\nprelapsarian\nduxelles\njmpr\ndegracia\nchiberta\ntadai\nfrontperson\nguoco\nsojern\nnematullah\nsigtarp\nlanguard\nwetherly\nteesmouth\nruthrieston\nmedassets\necvet\nofferring\nmedicalised\nimtoo\nwallcharts\ngihad\nserendipities\nrumin\nhmmmmmmmm\ntopcashback\nnumerologically\nambegaokar\nwongpuapan\nvantassel\nauchwitz\ninkley\ntaurons\ncepollina\nfledgewing\nrozenblit\nfacevsion\ntereshkina\npurssell\nchatterboxes\npizzotti\nmahgreb\ningoldisthorpe\nvpas\nmandarinate\nslts\nfgic\nbambinos\nefamol\ninternext\nannenbergs\nchandley\npodles\nyarnbombing\njoyti\nquadband\nwgts\nbypasser\ngillinov\nherheim\nsociably\nplebians\navtoframos\nstueber\ndepetro\nkhadam\nnorthwesterners\nnaturiol\npdcf\nbattilocchio\nbasavich\ndurach\ntomasita\nschaede\ncorodemus\nchockablock\nfoyleside\nlibber\nfixins\ntraumatise\ncynnal\nleffman\nnortheasterner\ndomolailai\nfravel\naccaoui\neverygirl\nmuddiest\ntieger\nelsbernd\netrade\ncracklins\nbackcheck\njerika\nmozie\nschuiling\nspagat\nsolarize\nrailbelt\nplantadit\ndobui\nhaugo\nparouse\nlabreche\ntandoors\noestergaard\nkentrail\nlifecar\nflaggs\noldner\nhacan\nrestoril\nmillmore\ndeltalina\npaladar\nthorhallsson\noutpowered\nkaramojo\nmazroui\ndurabrand\nlavely\ncentropa\nfarahar\ntrendmicro\nexplusion\nelectrovaya\nzegveld\ngoebels\nbebside\nagilely\nesapi\nalikbek\nmeghalayan\nadventitiously\nregardin\nkhial\nsenelec\nhhonors\nboninite\ngoddijn\nbaikeinuku\nbananaz\ncouvering\nuuac\nrynhold\nushguli\nrestaffed\nenthrals\nunderscan\nbartner\ngattlin\nokum\nakator\nsagario\nmangetout\nchabalier\ndematerialising\nmatchima\nspencelayh\nbehme\noravetz\ntankering\ntrunkless\nsentimentalize\nunsated\ncurnyn\nslabby\naltegrity\ntavinor\nmacquisten\nowoo\nshiat\nshapovalova\nentrechats\nepigenomes\ninculded\nmoellering\noversharing\nnanotechnologists\nclukey\ngrumbar\npötsch\nclarvoe\ninquries\ngeldermans\nplaun\nschulzke\nmeadfoot\nshaller\nsuperpark\nsinovac\nsyamsuardi\ndizayee\npenknives\nunreflecting\nmoelyci\nmasaiti\nbluestonehenge\npolensek\nbutterhead\npolledo\nburbine\ndentention\nbanahene\nsteare\nkahre\nrammo\nprody\nyllescas\nchames\ndeskilled\nnikitta\nmonetising\ncardas\nharborfront\nexhilarate\nphilogelos\ntorchi\ncontran\nshreiber\ngoualougo\nstudyblue\nschwammberger\nicestone\nhagworthingham\nuhrlau\nlacore\nnirere\nsozar\ngoginan\nsabouni\nsaunooke\ndassarma\nvhda\novertaxation\nnammco\nabloh\nvaxjo\nmovetis\nthinkway\nlousiest\npeita\nullett\nsaydnaya\nadline\nanthoula\nschlindwein\nhoopsters\nlandcruisers\nrezeigat\nfabrizzi\ncovestor\nfarabaugh\nvaluating\nflipflopping\njilma\nmarconiphone\nidenitified\nlldcs\ntimesys\nequipt\nadlene\nniemela\npuddester\nshynaliyev\nfatayer\nshamaqdari\nhudok\ngaymon\nidenty\nziems\nshiprepairers\nmanawi\nbajil\nsprd\ncentronia\nriddall\ngazarov\nmademoiselles\narrse\nazrouël\neymer\nsvox\nsampallo\nelectrifications\nlapore\nhorseboxes\nhypres\nsinskey\nrushka\noverinterpreting\nargenbright\ntoothmarks\nwrinn\nbutkov\nnovodevichye\nbastardising\nboubakar\ntowncar\navicena\njaekle\nguggul\nefinancialcareers\noakmead\nnopr\nguaino\ngunnarsdottir\nfreakery\nvinoly\nsuppli\naulsebrook\nboştinaru\novernighter\ndensborn\nobamba\nnexpress\nwhoremonger\nporas\nkisik\ngillotts\ninterdealer\nthordal\ningenix\npopkins\ntulaichean\nprobl\nrnln\nsunbaked\nmuhlestein\norexigen\npsychopharmacologic\nkulstad\nchanice\ntorgovnick\nborishade\nrahmanov\nfirsties\npenate\nheygood\nifart\nacred\nmassgeneral\ncastlecourt\nmailrooms\ncacg\ngrosfeld\npcpcc\nerulemaking\nbarrica\nbrancalion\ntakaso\nroadblocked\nmohommad\ncornichons\nalphameric\normando\nmirasierra\nwhiskys\ntranchemontagne\ntongayi\nroebke\nerevia\nancester\nyanlin\ngorefield\nklaß\nhardfought\nreenergizing\nbotequim\nseiphemo\nyayha\nostm\ntiguas\nhausding\nduoyuan\nundertows\nbasima\ndambazau\ndisqualifiers\nnambarrie\nsacramentalism\ncalifonia\nimagenation\nbutterell\nanastazja\npressreleases\nverheggen\nshaap\nbronowicki\nuwezu\nmortor\nwinthers\nwyvil\nseyferts\ncellmark\nbigaud\nethar\nsoveriegn\ntulipani\ndealtime\nghulab\nappartments\ncalphalon\nidiotarod\nonepoll\nfeinsod\nsenterfitt\nsurrick\ncaloia\nappelby\nstudabaker\nbusinesse\ninsteps\nsalsalate\ngrandholm\ntokofsky\npidgley\nsaiger\nhandmer\ndimishing\ncapoco\nbackhanding\nmyrone\nbluemke\nadminsitration\nseverence\nyorky\nglobrix\nbuhara\nhynor\ncadging\nalicudi\nquezadas\ncylchgrawn\neurospeak\nfromlowitz\njaked\ntattled\nrestaino\nspringstein\njapery\ntregolls\nflakier\nsuniva\narabias\nclumpiness\ninsectosaurus\nceasars\nvankor\ngiltburg\nschillerstrom\ncajolery\npoliglumex\nrosettastone\ncarmondean\nnisene\nheineke\nmeidel\nviolance\ndabbahu\nbroadnet\nzbot\nscherrenburg\nocsw\nwyszomirski\nenshroud\ncontura\ndced\nkanninen\nadezai\nohca\nwantanabe\nmultiwave\nsequa\ntupak\nkratovac\nmonai\nfinetto\nakusekijima\nesms\nmaritally\nglascoe\ngecad\nfrankcomb\nperuvemba\nzitka\npakul\ngaragistes\npacka\nncdt\nyevgenii\nsquarcini\ndissaving\npellens\nbudgeteer\nunpressured\ntranscendant\nworkovers\nyanggang\nkitaka\ntermist\ngreeve\nneuroarm\nguangwei\ngeerling\nfolklike\nroebroeks\nncbm\nhoshiko\njigmi\nkeyun\nhellholes\naquent\nfayot\nmonagh\nswellhead\noutkicked\nfootbath\nroenigk\nlaschenova\nfeklistov\nrockabillies\nlichtveld\nrenovacion\nluques\nhaematomas\nhohlbaum\nqdii\nrespec\nvenlaw\nbrandable\nsnowmachines\nsuperking\nbikeathon\nsucrerie\nsukhirin\nknrm\npolulation\nnwcu\ntakeway\nterlet\nucunf\nkameisha\nbolillos\navghi\nccrif\nunderweighting\ndaraghmeh\nsulek\nchoosiness\nlemunyon\nloadman\nnonpunitive\nmisplaying\narulanandam\nfiestaware\nzvents\nschwam\nmothetjoa\ntcherassi\nedesa\nrotina\ndijla\nhalilbegovich\nsansern\nyotsukura\nkowk\nfuelers\nkblb\nhormozi\nsohonet\nluchey\nsteinwedel\nagcenter\nqualifiying\nremaine\ncput\nqustions\nstoudermire\nsharpy\nmillirems\netumba\nreverand\nkauranen\nhermene\nbaneham\nparentline\nibrisagic\nwallboards\ndarfuris\nzedar\nfanaika\nlobotomist\nbijal\nallshouse\naccommodator\nremoteview\nkalymon\nsummercase\nwachenfeld\nvozdovac\nsmashingly\nhaigwood\ncomponentsource\npizzahut\ncalamander\nsummerlong\nspartoo\ngrazin\nprevaricator\niannicelli\ncurrenttv\nfordhook\nbbdc\nhaterade\nareng\nkrainy\nfreedome\nbranchini\nmillsport\nribery\nadug\nlathallan\nexigente\nlasok\nlorinc\ndcpi\nalbader\nhochedlinger\ncrowdpleaser\ndobrik\nstudwork\nsashaying\naopo\nproglio\nbeizer\nspirax\nprochnik\nruvin\nguernesiais\nkwing\npbsg\narkstorm\nilano\nvawts\nviasystems\nacknowleges\nshoeprint\nunflustered\nwilkman\nsealyhams\nlinjun\nneigbours\nupex\nseeclickfix\nspelga\neskelsen\nintenational\ndabaghi\nwisecracked\nselnes\nforbiddingly\ntomihisa\nbliadhna\ntasovac\ntoeloop\nbodhar\nsartison\nmcmoore\nhideway\nstattersfield\nreibman\nstogel\nconinue\nesdm\nmicrobusinesses\nfacination\nforesworn\naftertastes\nhackenburg\nfinnbar\nthorsgaard\nprgs\nskalicky\nmipomersen\nrvot\ngloblex\nbeatboxed\ndignes\nanouma\nstealthwatch\nzimasco\nlinnel\nhyperaggressive\nmagnana\ncivette\ntyrees\nfernandis\npmvs\negly\nmarkelle\nmounis\njonnier\nwidescreens\npycnogenol\ndazl\noldways\naccustoms\nwoozley\nstreetworks\nglencaple\nschweichler\nurtiaga\ngoldsim\nundersupply\nlipidology\nuitslag\nhauter\nsapientnitro\nilinois\nhoxsie\nchadband\nbeninoise\ngoudswaard\ntamweel\nzabin\noluwaseyi\nflexbook\nvelafrons\nsauvion\nbeynat\ngmai\nmutilators\nunifab\ndokht\nmehrjoui\nsurestart\ntendinopathies\nbtween\ndivinia\ndouna\nfaintheart\ninsor\ngarringer\nnolvadex\nnumberof\natacand\nlavarra\nkribs\ntchotchkes\nicepacks\nalagno\nlateraling\nhaydnesque\nmonobrow\nmaizar\nlightish\nkillifer\nghoula\njarus\nuppercrust\nbereave\nmusikapong\nrichardsen\nchicest\nnuwer\nmacilwaine\nchards\nunderthrown\nconzo\nzephyros\ntecce\nistrobanka\ndirienzo\ncitrines\nshipster\nuntrendy\nsynopsize\nieah\nlissom\nchozick\naksentije\nliveatc\ndigenova\ncapoor\nrafli\nstalinistic\nunvalued\nclinked\nsarghoda\nmebroot\ncyllid\nesfehan\nzavoral\nmartt\nmirrorbit\nkufeld\nheixiazi\nshnider\nmullahy\ndorko\nmaresco\npentremawr\nskodas\ncifg\ncampsmount\nflavorite\nhosl\nbanwart\nrailteam\nbernadito\ntomashoff\njanjigian\nrmst\nyagawa\ntalae\nwhitopia\nfeau\ncristofanilli\nimangali\ncolmers\nkilbert\ngackenbach\ncrassest\nhighnesse\nzhizhou\nretrolental\nogbulafor\nteether\nhairo\nbarket\ndecrescendos\nadorama\nreinagel\nngog\nperkey\nmicrocenter\ngossom\nouttara\necholot\nmasalin\ndownshifted\nrosenheck\nsuperjets\nnemwang\napparenlty\nelisetta\nlindners\nhafsah\nextenstion\ntulipmania\nahart\ncembalest\nayatolla\nirangate\ngubaz\nzavagno\nfnpt\nsnowies\nschwaig\ntzarev\ncroel\npalumbi\naelon\nkajen\ntwitterfeed\nfekter\npriorites\nzekeria\nvarmland\nfiercewireless\nbubkes\nkamonyi\nexpiating\npeekyou\nannapurnas\nishitani\neilenfeldt\nahmadinajad\npelargonic\nchawalit\nzarich\nrestavec\nlnat\ndoretha\nlinyekula\nncafp\nterie\nnexbtl\nbancsystem\ncybs\nrangeworthy\nelopements\nronks\nslurps\nchesnel\ndevraient\nrepigmentation\ntorgovnik\nkarnit\nshekelle\nolice\nrepresenation\nzhonggui\nestablishmentarian\nconquerable\nbujnoch\npsittacosaurs\nhiccupped\nvanderboegh\nwalleen\nsurtitle\nsippers\nrechichi\nrevalation\nmakuei\nfumba\ncammaert\nlampariello\nthredup\nbathon\nmakhenkesi\nbavituximab\nplusio\ncarandente\nbenfatto\ntrimega\nnextradiotv\nclev\ngilboy\npannabecker\nvahia\ncoquillette\nbratic\nbrudney\ncaliforniavolunteers\npaatero\nstultify\ntoqué\nkavaf\nileret\nincierto\nkansho\nmicromanages\nostle\nxunlight\nbardella\nisleib\nomarius\nskrastins\nmokwena\nakhgar\nrutino\nfulmore\ndocusoaps\nnonathletic\nlippis\ndipex\nkarayilan\nsacheri\nafesip\nbuonaiuto\ntaavo\nhesseldahl\nluisel\nputsches\npaven\nhartvigsen\nbezark\ninnundated\nindianans\nsivanesathurai\nchelopechene\ncyberweapons\nbouard\ndaury\ncolitti\nnonplayer\ndailynk\nneurotech\ncapels\nrespray\nlichtenwalner\ninvestorrelations\nvirtuallogix\njakucho\nschmölzer\ndlcc\nkleberson\nrotech\nsceti\nobtrude\nroombas\nkholwadia\ncicale\nplayng\nyucumo\ningleson\nosenovo\nhumanitaria\ntrilion\nkashper\nkuthe\nanzhen\nalvine\nunangst\naethlon\nclintonite\njazziest\nprogams\njianjiang\njunkshop\nnungaray\nyabunaka\noskoui\ngolubchikova\nmagomedtagirov\ndangana\nelshaug\nncredible\ntisin\nouisa\nlisenby\nnuraini\nruecker\nebie\nonodi\nsomnambulant\nfallu\nunjustness\ndvorovenko\nconsalvos\ntlapehuala\nhootan\nlisis\nrelinquishments\ntenaculum\npalek\nfricassée\ncomilang\nreconceive\nfirtree\nrosegarten\ncollucci\nejehei\nunblindfolded\nprestedge\nvoordewind\nviewerships\nurli\nmalgieri\nababeel\nyumyum\nyandicoogina\nshouwang\natabani\nimbeds\ntottingham\nwidsets\nvasterling\nbleckman\ntomljanovic\nkhameini\ninnumberable\nargouges\nrhissa\nxlx\nvexingly\ndervi\ngullino\nmalasian\ncuemaster\nboxter\nwherewithall\nalekno\nschorle\nnigori\nvangard\nunloosed\nlekach\nharnek\nprepurchase\nsacn\nendter\nsielicki\namanjena\npelevine\nmicrocalcifications\namatrudo\nexoneree\nsmwf\nabereiddy\nbettane\nwebbies\nriverbay\notpc\npalmenberg\nmuolo\nackert\nrefinetti\nbusness\ncouncelling\nnasiganiyavi\nschlubs\nmorizo\nbeikou\nshokhin\ndysfunctionally\ndeerstalkers\nfrankensteinian\nbencivengo\ndusks\nanghie\nbisoi\nringlike\nengressia\nfrontloaded\nprattles\ndimwittedness\nonetel\nratnesh\nflanz\naggrieve\ndelcour\nnadali\nmoosajee\nszrom\nprogessive\nminczuk\nreinject\nindevus\nantiriot\nbortolini\nmicroculture\noufit\nkevern\ncrosswater\nmetrinko\nwlaschin\nwijeyadasa\nantiphishing\nroséan\nkorge\nsiguiriya\najwright\nyellan\nshellsuit\nlazie\nacdi\nspadeful\naustrlia\nskived\nreqall\ngirnius\nswigged\nvivra\nalaitz\nduhnke\nguaging\nricelands\nzerrouk\nsurk\nhailpern\nbatsuits\nbroughs\ndissemblance\npapaye\nrespo\nsetoff\nrepubs\nfudger\nwhiteways\ntotobiegosode\nkingwill\ndawki\nsintonia\nlejnieks\nmeei\nstrenk\nfreddiemac\ncounteroffers\nhemwall\npyenson\naaqil\nmcclear\nhunstable\nreasonalbe\njeromey\nacofp\ngabarone\nhyperventilated\nautoexpo\npeppersmith\nbibbings\njudiths\nhumorlessness\nraydel\nnonsuicidal\nchamie\njaggedly\nbettaney\nfortissimos\nlutker\nnavic\nlasis\nriddile\nbirzer\nshayeb\nshanzhen\nauctionbytes\nmedvecky\nahic\nhärter\nsuperheats\ntaneal\nhouseowner\nsolino\ndeynes\nwensveen\ncmtl\nblackhearted\ntransoft\nscarify\nsauerwald\nmoreish\nantigenics\nfanciness\npisaturo\nnigussie\noppourtunity\nbochinche\nflunkey\nraynesway\nreinikka\nsaadeddin\nmvumi\nfeltmate\nkleptomaniacal\nactivequote\ndifferenet\nregasified\nbestrides\nbasardah\nfulsomely\ncoiley\nonama\nmeganne\nsentras\nbatom\nchilcombe\ndisembedded\ntrunkful\nplaycast\ntanaiste\ninspiriting\nblairon\nsneeky\nhuangci\nknaupp\nwiid\neuroclassic\nreimport\nghullam\nfinl\nnared\nsocìetas\ngachoka\njackmanii\ntrikilis\ndeclarable\nrismondo\nbasbous\ngodé\nfatahian\noilrig\nsdcp\nfurriners\ndowntowners\npompholyx\nfarbio\nvangala\nsuhardjono\ninmage\nnordson\njianfang\nsqo\ndcmd\nscossa\ncozar\nsuctions\npigmenting\nmonstering\nmeglioranza\nsweatheart\ninkpots\nbenthien\noneof\nvernola\nalore\nberettas\nhogwart\nffcb\nsalac\ncayler\ndoonies\nautopacific\nwiniecki\nnondefense\nsouthstreet\nallscott\ndemichiel\noverprice\nscorchingly\nberru\nwajsman\nsebonack\ngivebacks\ntrickily\nmullaithivu\nkotting\nperambulate\ncavelike\nnicandra\ncraigour\nguenot\nformidible\npowerlabs\nperkinses\naabpara\napls\nobayomi\njolita\nshkurtaj\nvanezis\nfssd\nkapetanovic\nscheft\nalyanak\nretrains\nleskin\nposole\npalansky\npuigpunyent\ndevay\nbieniawski\nnutmegging\nalbertrani\nhadjuk\ngoryunova\njumbish\nkhales\njegher\nfemara\ndefensman\navapro\nsavient\nspoofery\ndharoor\nwolensky\nuogb\nyhency\ncuminestown\ntresspass\npensilva\nchikwelu\npercuil\nsafyan\noverinvolved\nreactivations\nwoodfalls\nvarelas\nmuhtathir\nzabrocki\npentons\nligocki\njensvold\nvaneman\nmullee\ntolemaida\nmacrosty\nunrushed\nchariman\njauntiness\nnnpt\nnogy\nhorible\nasssociation\nkawale\nbollore\nprommers\nredounds\nfreeborough\nmaglis\nmensun\nstratstone\nvikuiti\nneurocare\nbustami\nmarenzi\ndalmane\nillmitz\nfrancises\nundershorts\nangwenyi\namadine\ndabic\nveddy\nhctz\ngruessner\nbifeng\nrecaap\ncoccaro\ndiminuitive\nbachur\nenpa\ncharleses\ncyberpatrol\njannuzzo\nwalkstation\nroppo\npolysorbates\nsianel\nchallengingly\niverse\nkarkkainen\ngreyman\nposedel\nbarfe\nneilia\njinal\naftermarkets\nhajda\nturismos\ncitkowitz\nzoidis\ntrotskys\nungraciously\nmidlist\nmagneux\nkamysz\nsinosure\nwihtin\nlesha\nrepellers\nhybritech\nfulston\nblitzkriegs\nbalconette\nfitchet\noutscores\nvirtis\ndipenta\ninteq\nrigl\noftsed\nneindorf\nhindcasting\ncorehead\ndorleac\nvukic\nkolola\npfis\nshovelin\nbaszak\nmelanee\nenglemon\nwbmd\nbowlhead\ninsm\nnolhga\nkeynoting\nlaussucq\nunstowed\nmedbøe\nbelayet\nlishen\ncheesegrater\nsubcription\ntracewell\ntambunting\nadolygu\nharshav\nfaruqee\nallahdadi\nsissified\ntheera\ntmsuk\nshugak\nomniscan\nchevettes\npikestaff\nphambili\nrhinefield\nkuchenbecker\nlukonin\nlifestreams\npolarn\nebps\nboxercise\ntexmex\nblondish\nshishkhanov\nwallender\nderrion\nrusticate\nnsia\nolfson\nsameem\nwhizzkids\nmazria\nbestir\nlakner\nkenshoo\nmeghir\nkrauthamer\nreannounced\nblousy\nzirok\nphotiadis\nsmia\nreinvasion\nsimbex\nmusicforthemorningafter\ngothberg\nfeasters\ngenteelly\nflometrics\nknickerbox\nfeuillatte\ntulchan\nkitcho\nquietman\ncopans\nwythnos\nbravinlee\nsanctimoniousness\nmathez\nszyf\nmamuyac\nnonroutine\nkamikatsu\ningla\nmizani\ngpic\npuedan\ngoadsby\nistore\nkrowne\nnorthacre\ndeganya\nkayishema\nspielers\ninternalises\nmuhummad\nanthuriums\nbankengruppe\niped\nheiligman\nportuguse\ncgdc\nrosseel\ninfantilize\nschenone\nelbagir\nbeccan\nyasouj\nhudgell\ncallfire\ndeerbolt\nleiblum\nineloquent\ncussedness\nhaslauer\ngotic\nmolindone\nthecall\nfeminising\nakipress\nesquerre\nsuriyasai\nagnitio\nneelofar\nmadrilena\nfolkloristas\novereats\nwhizzgo\nmajit\nlogomark\ndunlopillo\nmcleodusa\nschear\nnonvegetarian\nturkstat\nriversley\nrenoirs\ngadekar\ntigertext\nristelhueber\nhubbie\nshaprio\nhoffinger\nsylvor\nliepzig\nersek\ncomorans\nsouryal\npolitition\nzere\nbrideau\ncelizic\npmti\nglasfryn\nweinhart\nthumbplay\nduperreault\nmadencilik\nplestis\nshaiq\nshabih\nsoapies\nantimodern\nredevelops\nnegligees\ntorarica\ntagai\nbrothman\nnaggy\nigcs\nhahahahahahaha\nrohtenburg\nelectorial\nlongheld\nrooseveltian\noverbook\nthermie\nabfs\ndothard\nsintu\nechouafni\nnoshki\nmediacurves\nboosbeck\nreadytalk\nlarrowe\nwanabe\nappetisers\nbaumgaertner\ndeminish\nxiuyu\nleftfoot\nnakhle\ncloudveil\nkohail\nstrentz\ngermophobe\ntrotignon\nledgemont\nnaoma\nghosheh\nheadcover\ndrotar\npreatoni\ndynamicops\nchantra\nelchlepp\nfanteni\nususual\ncelebutantes\nalguire\nleivinha\nhaemost\nsagerman\nshraeger\ncolliford\ngardler\nbrutishly\nsendlerowa\nautoban\ncherkos\nsmull\ngolfballs\nwafertech\nmucinex\nreitell\nlogicvision\nzatloukal\ngleichmann\nlazca\nartigarvan\nwooky\nidyllically\nodabash\nbbumba\nrosheuvel\nefit\nelridge\ntaea\ndriade\nmcanelly\nhitsp\ngelowicz\ntalebzadeh\nsteppingley\nkontakthof\nvrae\nsatava\nvinacomin\nshpiel\ntechsnabexport\nkaptel\nhyperinflated\nduessel\nvileda\ndombrovski\nmasscap\nmoonfest\nkheirandish\nsupersecret\npyszczynski\nfenkel\nproposterous\neastell\nkoufman\nzierath\nnabintu\nchorcha\nløkkegaard\ncusos\nheicklen\nvlock\nartwerk\nschwabian\nthanachart\nwbenc\ndepinto\nusaec\nfenproporex\nmirakle\nkozodoy\narreaga\nkforce\nwepler\ntsalikov\nlapica\npezula\ntavaria\njoudi\ncorviglia\nsoftgels\nkapahulu\nkarber\ncretons\nscrapbookers\ntwinbill\nmansouria\ncommunit\nciza\nkriwet\nbdhf\nbreakish\npacesetting\nweisbuch\nshogan\nvivisecting\nschramma\nhampnett\ngameboys\nmcarabia\nlairige\nstrokeplayer\nsvase\nbibelots\nrivermark\nbelltel\napplera\nacomplished\nnjvc\nsepticaemic\nfelini\nrmhcsc\nrawood\nexubera\ncurnutte\nvechicle\nbajorek\nvashishth\nvolumetrics\ngongsheng\ngrumpiest\nramlow\nperatallada\neservglobal\ntroublé\ngetequal\nunartistic\nlytal\nstrausses\npedri\nmachiavellians\npazel\nridded\necotect\ncorveloni\nminimorum\nshamburger\nchambar\nperusals\nlandgasthof\ntrammeled\nplexxikon\nderinda\ndalpiaz\ntoffy\nzafon\namirkhanova\nmarzola\nmutayri\ndtas\nrepot\nbassuk\nvillification\nsuncare\nkhomeinist\nluari\nlollygagging\nfireplug\nhalaco\ntriessl\ntendonectomy\nfalby\ngallahue\nehnac\ninflationism\ndardentor\nflueckiger\nlickteig\nfettled\ndomenika\nbarme\nmummiform\nivideosongs\nunopinionated\nrubeor\nintransparent\nvercoutre\nprofiteered\ndomenichini\nmonolines\ngentled\noverapplied\nhejna\nmaching\nballbearings\nfiners\nskycourts\nmillitants\nschlüer\nhomecomers\ngrommek\ngamt\nfranzo\nkeiaho\nrugasira\ndrčar\ncasulaties\ngroov\nnoonmark\nelderflowers\ngrauls\nbattlemind\necosmart\nsoftkinetic\nbesmirches\nclubgoer\nprediabetic\nibbeson\ndepouilly\nmultileaf\nnebulousness\ngorenflo\ncorace\nmahanna\nramadin\ncarbinoxamine\npacquaio\ngobbetti\nfoldover\nfurbearer\nwindsocks\ndabab\nwollack\nhladowski\nopinium\nspykee\nconvivia\ndevauchelle\nsuul\ndallasites\nshareprice\nkulinski\ninglemire\ndinola\nkevane\nschwehm\nmagovern\ncontritely\nyouthlink\nbegginning\nbatterberry\nbiodigesters\ngedow\nkocken\nafonwen\nluging\nneratinib\nkenzler\nborqs\ncorrance\nqureia\nmalcontented\nrilin\npcati\nprizant\ntanora\nanshakov\nhied\nbridgefield\nlocavores\nfalafels\nmarkhouse\nuspt\nsombreness\nbabycare\nwitthoft\njijun\ncheesing\ntherkildsen\nhaegele\nmadaleine\ntastemaking\nabvs\nawooga\ngemlike\ntahdig\neffertz\nautralia\ncalingasan\nkechik\nfriggen\nnetshitenzhe\nkoskas\ntaleon\nsarnez\ngunewardena\npolymethylene\nmicrobloggers\nhostelworld\nlyublinsky\nkwanchai\nmythbusting\naskale\nutsteinen\nbarimo\ntaybarns\nfibia\ndissatisfactory\ncomvita\nconsisently\nmccarra\ndrugan\nkabbal\nirurita\njanangelo\nsoliant\ncopemish\ngrimmelmann\nunhinges\ndubitsky\nchangey\nifanc\nsectra\nromich\nbronis\nmazard\ntognozzi\nfriebert\nafue\ndescripton\ndovale\nzuska\nyasman\nkollins\njalai\nafirm\nballyowen\npochat\nturrent\nkomisaruk\nhomoeopaths\nadvertisng\neconsult\nshelver\nfrix\nfuyao\nelectioneer\nonmessage\nminibond\nchumminess\nendulge\npastings\nbeanywhere\nwinpenny\nkohrman\nweals\nfunkified\nchampine\nhugheses\numguza\ngähwiler\nviskase\ngeospatially\nmatw\ntestiment\njzj\nborther\nxaiver\nsplutters\ngicheru\ndiddams\nlincolnian\ngtec\ncwik\nlexcycle\nvljs\nwriteing\nwheatleigh\nstyrenic\ntabiou\nshough\nerhlich\nskimboarders\namantaka\ncalestous\npesacov\nstolzius\nmedallia\ndepletable\nresing\nbiomatrix\nconstantiner\nunexpressive\ntrilliant\nsooreh\nxybernaut\nkulakowski\nchunlei\nboffeli\ntravelodges\nscootie\nhubberts\nspongelike\nmirandized\nmirandize\nalss\nbulabula\nmojaddidi\nchoudhri\nbeva\ntardies\nfuoss\nbarankitse\nshidlovsky\nshanavia\npitie\nestrellatv\nborukhova\nthoroughman\ngywnn\ngoerges\ngoldenrain\nwtfc\nuspca\nfirb\nplunky\nchanc\nemptyhanded\nmumc\nnwanze\ncgnpc\nsteckman\nopthalmology\nboomeritis\nminova\nhumdingers\nsketchiest\nstoutz\nbrodcast\ncaschetta\nsarnies\nengrossingly\nanounce\nconcience\nnoud\ncompx\ndebottlenecking\nkanharith\nfynd\nlansbergen\nucberkeley\nwhiteburn\nsikkens\nalphacat\nswitchovers\ndbfx\nlocricchio\nchind\nvanned\ncruddace\nsupporte\nlilos\nnickiesha\nrosebys\nwellawatta\ncryobank\nhomosexualist\nredbaiting\nkimberlina\npiks\nabrham\nstoskopf\nzenergy\ntransparência\nayelen\naboudi\nlejuez\nstieren\ndunnhumby\nfarmout\nngobese\npoltician\nhoedemaker\nhousehusbands\nacik\nkarakus\ndillenbeck\nmowlavi\nbelluci\nuchc\ndoup\nsubzwari\navailablility\nborchelt\nvisudyne\nwimbeldon\nidlet\nmacchione\nbeytenu\nqaddumi\ntajique\nfromong\ngonul\njihaad\neucerin\nelkarra\nsolarcentury\nortegon\narciniaga\nalavian\nheadd\nfireworx\nmenedez\nsoftlines\nseroka\noncampus\nchmela\ndeserie\nrybeck\ncolee\nsupramax\nbaleira\nrvers\nchitr\ndziuk\nmolaschi\nmalinoski\ndellape\nbarcalounger\nnarcisstic\ngodah\nmiltant\nnaudero\nmarily\nkostow\nolmer\neqecat\nfoulstone\nkasav\nadahi\nathat\ncutchins\nrigaudo\nnancledra\nchtr\nnosherwan\nfaaf\ntjmaxx\nunbuild\nboffetta\ncoopi\nseipei\ntchuruk\nngosi\nhistrionically\nlalomanu\nkachinsky\nfalconnet\nquiter\nmcalear\npinked\nmalalane\nnavajoa\nvakapuna\nraduka\npeacher\nhoneybears\nsudby\ndefalcations\ngabot\ngiessibl\nsolovic\nebulliently\ngarone\nyanshen\nrerras\njunny\norigamist\nkopenawa\ngunel\ntariffing\nhumongously\nowassa\nsqueo\nforyd\nnorthwester\nforgetfully\nlochardil\nmacromutation\nunsanitized\nthanklessly\nsadec\ndelamontagne\nfremer\nreadle\nfruz\ngoventure\nhollandois\nbulgurlu\ntseycum\nusdx\ndescenza\nncpdp\ndogheads\nfrenze\nbetaseron\npreferreds\nlating\nrevoy\nnusoj\ndispirit\nstaadt\npalled\nebitdar\nmilyo\ntappolet\naurn\nwellesian\nquaintest\nsherbedgia\nbehin\ncitroens\nmarcellos\ndonio\naxiotron\nzinkann\npeform\nxojet\nwoodpigeons\nratnesar\nkarisa\noutspark\ncynhyrchu\njurados\ntuputupu\nrepasts\nyoucan\nsyman\nshimel\namtote\nzerola\ndevoré\nsukhois\ncbocs\nrúm\ngeona\nsrlc\nnuvox\nplumchoice\ngingersnap\nsheinkopf\nlellenberg\nseic\nprexige\nmachinelike\nfighers\nlastinger\nplethodontids\ncaotang\ntarpin\nabreham\nhipposonic\nmagezi\nmoghaddas\nmishak\nretkofsky\nbogacz\nholtzbergs\ngomart\nkurniasari\nwalnes\nizdihar\nenpei\ngrappolo\nevanzz\nieak\npicosulfate\nfredenburg\nwelfling\nblinkoff\nreserveamerica\nhempsell\ncigler\ncvik\nkoelling\nonica\ntranquilised\nsinnette\nearthecho\nmerran\nnuumbembe\njuking\ncadus\nrushcroft\nneftegaz\nhazmiyeh\nschmadtke\nrefought\nmossmorran\nsinovel\ncalcifies\nmalary\ngracko\ntrangressions\nwickenberg\nkazinform\nakoi\nkcic\nkeerthisena\nsandboard\nmachalek\ngaleai\nspherification\nclixtr\nmascini\ncervalis\ntracesecurity\nrevoz\nexcoriations\ndistroyed\namrican\narduaine\netxeberri\nhasagawa\nshagrir\ntannaghmore\nredways\nanihilation\nolow\ncachay\ndomnitz\nmemorystick\nlabyrinthian\nenegy\nsaadiya\nbinyuan\nblazic\nriofrio\ngokay\ntachyarrhythmia\nborgny\nsedmoi\nshokusan\ngriebenow\nejk\nsunsetting\nroofbox\ntomt\ngqt\nincisoscutum\ngotkin\nottica\nstovroff\nkavenagh\nzaimoglu\nlewisdale\ngrimshaws\nimazapyr\nknautz\nyonemori\nanthropomorphizes\ncongel\ncysylltiad\nsierwald\njunade\ntrupish\nisrl\nmadivaru\nmewl\npretti\nelomar\ncourters\nbenningfield\nborsodchem\nbuvette\nkalist\nintrieri\npropser\nvellay\nrashkin\nmilliohms\ntechnopromexport\nwuffli\ntamug\nkerevan\nperscribed\nsquirrelled\nkernerman\ndarbee\nfowzie\nteeshirts\nsamels\nsoothers\nweiger\nplasari\ngallick\ndiginity\ngirhotra\nkomansky\ninveigling\nbelorus\nsoild\nmorandy\ncaralyn\nstoley\nhawlata\nrazeghi\ndidik\nalasin\nsuaveness\nmaxner\njoset\nedenwood\nadminstered\nabstractors\ninvincea\npennyless\nanado\njurgielewicz\naona\nmagnun\nlunchers\nkemkers\nlatifiyah\nhupond\ndominiak\nsambhi\nshepik\nkhupe\nspykers\ncliffhanging\nabsently\neabis\nmidpines\ndomori\npeterbrough\nrevivifying\nmickum\nnonstatutory\ninvisage\nmelaney\nhutcher\npaliivets\nnetshops\nquigo\nturny\nlamak\nghurkha\nearcup\nskab\nspringfree\nmpumi\nsigurdardóttir\ndepoliticizing\nmuayyed\nrepeate\nleising\nnarzan\nsuseno\nburkland\nelkady\nscratchpads\ntoeholds\ntimecard\nabhra\nnerakhoon\nlowbrows\noponyo\nghayas\ncomandantes\ngmwda\ngobie\nesayas\nlerners\nzuora\nklasko\nkokayi\ncinematch\nexobiologists\nluobei\nbouga\ndelusionally\ncvrs\nchetra\nbiaxin\nreichgott\nrequiris\nwoolite\ncounrty\nnamad\nameril\ntembagapura\nmockey\nsnickometer\nfabby\neirich\nadineta\nbogogno\nkubrik\ninterract\nqchat\neliaz\nnonaccredited\nbrudevold\nglenbranter\nlaglio\nwnats\nhavebeen\natayi\nlowgar\npolyradiculoneuropathy\nklutzes\nhearsts\nmoshen\nxiaoquan\nsebastain\nrousingly\ndiscoved\njouzel\ntradepoint\nalahuhta\nbirtwisle\ngirotra\nquestionned\nvasby\nhazina\nborgerding\npcga\nseakeeper\nwangjialing\nborntrager\nmagdaline\nrenumerated\nstubbled\nharneys\natfa\nmayakovskiy\nrumery\njacunski\nreibstein\nintraracial\npeolple\nnicaraguense\nresponsibily\nprasidh\nklaber\nblairo\nferrufino\ngallactica\ntareck\nsahaku\narciuli\nuplc\nnofas\nyurov\nblancher\nworklessness\nmilhaupt\ntanayev\nmoudgil\nlangert\nvinopolis\nnejma\ndnsr\nxega\nwilgoren\namplex\nlevkoff\nmittelplate\ngissa\nmattusch\nakeroyd\ncataclysmically\nignjatovic\nsiegelbaum\nmyspacer\ndefog\nbowab\naseza\nrousell\nafrician\nberaza\ncretul\nstabaek\nschéhérazade\nheatproof\nciatti\nraushenbush\nscathed\nquadrozzi\nconvis\nlabeur\nprecentage\nromers\ninsightec\nmplms\nbrasilero\nchanton\nsheffy\naliwa\ndhamala\nacelera\nthilmany\ntainsh\njędrzejewska\nnovitas\nqdd\nexpatriots\nbridalwear\nniquero\ncernota\nnovoseven\nnormobaric\nguoming\nsumroo\nsorah\npoed\ncthomas\nkibumba\nglendronach\nnucletron\nsneiderman\nkrusoe\nzephirin\ndmard\ndeskbound\ntradgedy\nbistis\nchaweng\nnthi\npathologized\ncumulous\nyoumail\nbrazzale\nkarunaratna\nadmc\nnordvig\ngcobani\nproselytisers\ntukuafu\ntritely\nlooing\nloyle\nkillinochchi\nsmbl\nproceeed\nnijmeijer\nsupplementally\naudika\ntodger\nspiridellis\ntenderfeet\nclynelish\nmemorizable\nsmattered\ndycks\nlitterers\nwirelesses\nkeville\nnasibov\ncasalotti\nzuccato\nlosec\nxalatan\nmessiri\nrafterman\nupskill\nshousheng\nierardi\nepicerie\nwaliur\ntrouillebert\naquaterra\namstelhof\nreplants\nkazlas\ndaedone\nvandebroek\nguiltier\nmoninder\nmarfork\ninterlog\ngumblar\ntalafar\nrattin\nboooo\npeopole\nhausknecht\nskarpnord\nthembisa\nchmc\nsmru\ninutiles\npongpaiboon\ntarrell\nddisgyblion\nvetterling\nnizich\nrltv\nhinni\nretirment\nlonglists\nmetronorth\nsellgren\nposterboard\ncrescencia\ncloues\ncleberg\nclaimd\nsalomao\ncasterline\nayrshires\nstuffin\nrepossessor\ndabelko\novercalled\nshinnojo\nmocktail\nnoerdin\nchekroun\ncolusso\nfurtiveness\noberwetter\nmafura\nenterpise\ndrymonakos\nzacharda\nkrasdale\njusy\nsluder\nbekke\nfareeha\ncasesa\nassisters\nsulkily\narchirodon\nborho\nreputationdefender\nnickelberry\nrejigger\nrhinocerous\nwbmc\npossis\nschipol\nlindelani\nhatzes\nfhlbank\nmrtyu\nliversage\ncaucchioli\nintellecutal\nkirtzman\nrdq\nffostrasol\nnownownow\nélitist\nvianet\nshakiba\nrecirc\nkhakoo\nemergance\nturndorf\nebco\nmalane\njarg\nbackpedals\ncrommie\nsaamiya\ntreehuggers\nbuffetts\njoele\ndezman\nosaghae\nkilburns\ncemea\nkalogera\nmadaen\nukad\nhenaac\nstraggles\nkattegatt\ncirtek\nmayhle\nefvs\nmogran\nnewschools\nbierling\ngarwe\nminezaki\npogram\ntegge\nioco\nsearchingly\nwasiqi\nsalky\nkalloo\nfollowes\njiskairumoko\nfejerman\nkrajisnik\ndupler\nranchita\nftrans\nmccarthyites\nambulancemen\nbalmorals\npotatoland\njowharah\ntransportion\nbelkas\ncaly\ncolbon\njimador\nnidorf\njreissati\nunplucked\nsaarwellingen\ntomcito\nsunich\nwhackjobs\nadesa\nschops\ndalesio\npiyasvasti\ntsetan\ndemarius\nchimdi\nkronkite\nfinz\nmusahars\nballycolman\nwybie\nstottlemyer\namranand\npargman\nanandalingam\ndanhi\nshabwani\nlinkon\ndepthless\ncamers\npuigdevall\nxacti\nrelandscaped\ngerel\npotpie\nracsa\nolotu\ncepia\nplumrose\nridgelys\ndriveaway\nmozafar\nvidlak\nmuhire\nofficiators\nwuerker\nzunshine\nunderlinings\nblisset\nguoying\npalmerino\npepparkakor\nmachipongo\ngashaw\nphilpots\nsenkel\nstandale\nferensway\nzebrano\ncolemen\nceysson\novulations\nsatterberg\neoms\nkavee\nflightview\nlebanons\nsolemly\nfangman\ncffi\nfinel\nmaundering\nnegotations\ngieringer\nfarouqi\nchallahs\njuliénas\nlehotsky\nrubinowitz\nstridex\ntwitterrific\nvenezualan\nedingburgh\ntallyrand\neconoboxes\ntaglieri\nsheiman\nmualem\ngpy\npommies\ncredulousness\ninvaluably\ntoukie\narrowpoint\nlmrabet\nervasti\narsu\ncitröen\nliszewski\nosunsanmi\novercommitment\nschisgall\ngiraldez\nshamalan\naquisitions\ntrigen\nvanrooyen\npowerfulness\nwilbrod\nlimbourgs\nsavou\nmasakela\nkalvitis\nmacwillie\njennerjahn\nterrico\ndemolli\nciic\ncatrinas\nalamolhoda\nyalies\nminibook\nshibis\nwintemute\nhaijiao\npadillo\njuenger\nsciquest\nmccarton\nwierzel\nhazelhead\nggtase\nbrefs\nbendich\nhamane\nsheratons\nkulat\nchisi\nplakun\ncdus\nrafea\nliguasan\nmcdermotts\nuncontrolable\nhulkenberg\nlegistlation\nshilou\nsalkantay\ndarkazanli\nunmistakenly\nfatuzzo\ntouristed\nsnorters\nbraison\nandolina\nnoven\nhegsted\nsurgicare\ncadapan\ngriddled\nbousada\nplcm\nusibc\nebersman\nmoontoast\nturballe\nhmoud\nsamares\nblusters\nshubaki\ncowells\nrumberger\nderestricted\nannouned\ngermaphobic\ntaggett\nbotticellis\nmargol\nmammuth\nerace\nnalge\nfiremark\nrefugios\nmontepeque\ngratuitousness\nwetters\ngreendykes\nsharil\nkinno\nsitrin\nfneish\nmabhena\nbadush\nrendleman\ncadged\noveur\nkroizer\ntrelliswork\nchagares\nratners\nhekhsher\nunloyal\naspec\nmorledge\nkamuntu\nbiesty\nhabibiya\nrobeks\nrefinancings\npaprikas\ngrrm\ntoking\ncomputerising\ncorodeanu\nmilkings\nkildary\nskyfari\nkidswear\nbushcricket\nbollixed\nwallcovering\nschoenwald\nffgs\nanalytix\nbirkenstocks\nsefland\narvanitakis\nbernholtz\njoltid\ndellia\nelcoteq\ndecends\nbuckmire\nteamwear\nbelby\nestablisment\nsufyian\ninnoculate\nameriques\nprocuraduria\nyunhui\ninnotech\nruwaida\nliolios\ncheapy\nrutowicz\nvigoa\ninstituions\namouage\naniasi\ncasias\nlennmarker\nyonto\nsauteing\nrouas\nspruiking\ninmobiliario\nforiegners\nextemporize\ncancercare\nwinehouses\nnumberi\ndiddums\nnonggang\nbucklo\nbrauncewell\nundercounter\nglybera\nzharmakhan\nsibeam\nstoutmire\nosscube\nunreplicated\nmohau\ndisneyesque\ncroute\ninescapability\ngroebli\ndollarisation\nappendino\nbernabè\ncmit\ntiii\ndipity\ncontabilidad\nhumanick\nburtone\nsheddon\nverex\nkalishnikov\nclimbie\ntexturizing\nsikku\nrecurrance\ngringas\nhipnotic\nexquisit\ngogava\nqeybdid\nhashash\ndefillippo\nmusonye\ndarunta\nrahmouni\ngeocacher\nautocentres\nchangingworlds\nembelished\nleitenberg\nureb\nnealey\nbaringdorf\nnotarizing\nklimts\ndementium\nkafoteka\nbarnshaw\noutracing\njounalist\nquerce\nwebsafe\njoek\nanzueto\nparayil\nbicyle\ndijlah\nrwdi\nglaspy\ndonnent\nmonigan\nnaadac\nolitsky\nguirane\nhumungously\nwallstrom\nlegistation\nexcuted\nslotboom\ngavilon\nwnes\nmpowerment\nplastech\nklawonn\nsticka\naspentech\nbronques\nballestra\ngrevill\nplatoni\nasss\nkhuria\nskittishness\njonagold\nintolerence\nciarrocca\nsmoaks\nversacold\nyalennis\nlinza\nzabini\nrubiner\nzhixiong\nestablishement\ndeliv\nlavatorial\nlayva\nepiscopals\ntagong\npoluleuligaga\nhabid\ngarwick\ntshepang\nsuperhead\nspainhour\nurband\nneutraceutical\nrhynchophis\naspirus\nkoelmel\nimmunising\ndistaghil\nblacklaws\ntingled\ndrwy\nmevlud\nshelledy\nbreteche\nzhenmin\nmooches\nlieberfarb\ndamione\nspirko\nimbrasas\nlazaroo\nsugery\nchusak\ncanvin\nbashmilah\ncabieses\nopande\nspaisman\ncoulsfield\ngoidel\nsmooter\nburklo\nvirgoe\nlylle\nikhwanweb\nredprairie\nkhodos\nunrequitedly\ndestoyed\ninterheart\nmingqing\nschoolcenter\ntherminol\nhightails\nsokhom\nzhanar\ncoshed\nlimbering\noxytricha\napwg\nluochuan\ndefenselessness\nzeyda\nbootiful\nbevine\nenviornmental\nfreecom\npitoitua\noberwaltersdorf\nwraggs\narnautovic\nsilima\nsicked\ntapili\nyamgnane\nasymetrical\nsewnarine\nfassihi\nmilberger\nhaeggman\nderesse\nfreidoune\npelosse\nchirashi\nvenissieux\nadamji\nvogueing\ncomor\nconchord\nkobau\ngaravan\ngiannandrea\npanameno\nbamdad\ndufresnoy\ntymor\nelbmarsch\nxchanger\nhelgren\ntrabbi\nkhaymah\nlevya\nmoujik\nbenezette\nakhlaghi\npietropoli\ninfosurv\nmulticourse\nsmartwool\nthrowout\naltimas\ndrefelin\nhxm\ndelvings\nforia\nbarcaple\ndebtholders\nboissonneault\npetalas\nsorial\nmunyakazi\nnationalmannschaft\ncambadélis\ncircustances\nescrows\nkeilen\nkissenger\nlifecam\nsourer\nchumleigh\nmeciar\nbigeyes\nchailleach\nmoralised\nhymenoplasty\nagbami\nkeyaron\nzeebroek\nglimmung\nbombardini\nimune\nbalinda\nffriddoedd\nharach\ndjinnit\netirc\nsparts\nforbeswoman\nneumos\ntrickler\nborwankar\nbraais\nkerper\npeforming\ninoculates\nfleita\naguet\nabdelfettah\nkhajavi\ncommy\nfioriti\nloller\nfirebag\nrtog\nchakaipa\nlouks\nshabout\ntekori\ncoway\nursuleasa\nloudham\nfoulards\nvonderhaar\ngoetzl\nxcalibre\nblochs\nhabetz\ntawafuq\nhoujian\njerie\nhypergrowth\nheathcare\nwishnie\nnextar\nstst\nmitalipov\ncamdeborde\nwellenberg\nowyang\nllrw\nschouppe\notkritie\ncounterproductively\ntruley\nsaridakis\naubey\nlillestrom\ntrevers\nstepgrandfather\nvirnetx\nailman\nglobeleq\nibercaja\nreveiew\nsalmansohn\nstrm\nasaib\nyilishen\neliadis\njaquemet\nliscano\nyehiya\ncsbr\ningelise\nslochd\nbrandix\northofix\njunhe\nmariaca\nsaood\nyoffee\ndubens\nmynbayev\nsprawler\nrestabilize\nsfcs\nkhadraoui\nadrenalized\nsimbolon\nbescos\nkinghan\nyachtbau\ntsigdinos\ntrokel\nskywalking\nyefang\ncorbato\nmidelton\ndatoo\nkillt\ndajabon\ntcdl\nmasisak\nkiminori\nraleighs\nminging\ntimesand\nklebitz\nnauls\ncoffland\nfielea\ncurtley\nmailshot\nmacroy\nrebloom\ndeicers\nsundkvist\nbuyanov\nhatboxes\nbrookses\nunhealthier\nrathina\nluparello\nmoraal\noukaimeden\njodean\ndevyne\ncallgirls\npapple\nzeromax\nkurzyna\nscratchiness\nflinter\nsebc\ntouble\ncrockpot\nfoofy\nkusasi\nazilah\nslicking\nbntm\ndinerral\nzimmerle\nsardonicism\nhippiedom\nwtkn\nveddahs\nrefashions\nwiebenga\nstavins\ndarmer\nnbsc\nfreefloat\ndufficy\nsubashini\nsysytem\nkilquhanity\nrepresentivity\nsimman\npositons\npowerblock\noutstaying\nroeckel\nflowmaster\neckholm\ngyorgi\nhasibuan\nnaipospos\nbrittlestars\nayeli\nhuijia\ntransapical\nhousam\ndettra\npekins\ngwrs\nmalayev\nbocevski\nllorenti\nswingby\npaddywagon\nwavery\nmaniraptors\njanaagraha\nkrueckeberg\nschmall\nnizri\nmartore\nepbs\nburred\nlevulan\nbiumo\nwindbaggery\ntharrington\ncarped\nlomonte\nwizzit\nmcgory\nkasserman\ndenouements\nrcrc\nsertig\npercoco\nsaidur\ntrendspotter\nlupetey\npriniciple\nnuthetal\ngasprom\nunphased\nunhorse\nkarwoski\nberinge\narboc\nbonterra\nintiatives\nstiglic\nmokambo\nskilcraft\nmonsees\nmateljan\nlevantado\nsplodges\nartsquest\naforge\nmwaniki\nhuachen\nperogies\nvideoscape\naveion\nbreezier\nserykh\ncongree\nbandwaggon\nzerefos\ncanonisations\ntumminello\nyurkiw\nweisburg\nsamoens\nlichy\nabukhater\nsioban\ncruciverbalist\nextortionately\ncrocked\ndubik\nbaaaack\ncharnvit\negerson\npredevelopment\npotbury\naccg\ntrainride\nclottemans\nwizbit\ntrundell\nlifevest\nlammermuirs\nghosthorse\nhassas\nbasateen\ngavora\nhasman\nshaoyong\nsumaidaie\nocloo\nopressive\nskjodt\nwaltic\nnuvi\nconeway\nanthropomorphise\nhungai\nperfformiad\nmavimbela\nvetrazzo\naustralopith\ntamuly\nluchezar\nadivce\nadvanstar\napruzzese\nbowoto\nleyzaola\nireporter\neriam\nvonkleist\ntaveau\nhouselife\nintradivisional\ndefensives\nmonacolins\nkeevill\nsomfy\nraychel\nnonprime\nmaestracci\npgds\nyergeau\nblueeyes\nkostelac\ntouradji\navtec\ncarvello\nturnd\nbumpf\nseref\nmultihued\npulmoddai\nadriyatik\nneutze\ntrinajstic\nareheart\ntdbfg\nscdi\nmonocrop\ngemütlich\nrummaneh\npeterken\ndozoretz\nultralounge\nschuhbeck\nsharyati\nagunnaryd\nlaquinimod\nricigliano\nbewilderwood\ntripitikas\nhuthi\nchakushin\nkremlinologist\nunclamped\nglaciergate\ntechine\ndardennes\nzahwa\nsapergia\nmusngi\ncanoscan\ntizza\njived\nsztykiel\npishchalnikov\nsalathe\nscholnick\nmicrocurie\ncopaque\nmobilizers\nsafecrackers\nmenotropins\nziso\nnaimah\nzenithoptimedia\nintelliquest\nsobelman\nadjunctively\ncloudwatch\nguocun\nbourzac\nclearsky\nthornsbury\ncibils\npanaderia\nchangewater\nmyocardin\ndenene\nmadziva\nmezuzahs\nabdalati\ninexorability\nbioculture\neitf\nvitaioli\nvickroy\nnonstrategic\nmorotopithecus\nturken\nrawland\nyibi\ndmitro\nsigurjonsson\nmerighetti\nghaire\nbouteloup\nbeugre\napsco\noddsmaker\nrustically\nsunnyville\nhpmc\nlinkery\npottermania\nturiano\ncimpl\nnetsanet\nmuirtown\nffpe\nunderbidding\nempirix\nfarrey\nlahoris\nkubr\nletford\ngarapa\ncapful\npurevia\nclavichordist\ndiamanté\nmcmannis\noldstead\njahon\nscunny\nschnegg\nddess\ngransport\nirace\nardiansyah\ncosn\nbellweather\nbinit\nimjingak\nsinabang\ngrandkid\ncasasnovas\nmediahub\nredjeb\nrfec\nzacharakis\nraufi\nfutureless\ntomatoe\nphillabaum\nfoodwatch\natomises\nmerlau\npassacantando\nmyxo\nwinzenried\nayung\nmaryjo\nmumtalakat\nneigbouring\npanden\nshaminder\nchoragus\nkatesbridge\nsockless\nsebasco\nwoehrle\nvernae\nfamly\ndllr\nchemnutra\nmalangré\nhajis\njarwan\nmanorohanta\nplaschkes\nfelidia\nconfidents\nwhitnell\nrepremanded\nhahnfeldt\npfannberger\nxianting\nohiri\nnegotation\nsuleimaniya\ncashcade\nschofer\nfeczko\nshriller\nroulez\nnearn\ncarnaroli\ngronberg\nkeilholtz\nwattenbarger\nprometea\nvasileff\nvideoplay\napodeictic\ndispossesses\ndetered\nblogpulse\nlambreaux\nkinsell\nforthside\nsicortex\ncartoonlike\nnontenured\nkyama\nmancrunch\nmavrinac\nunsaddling\nzloch\nbendien\ndustups\nibrayev\nmamool\nturcas\nrecentered\nidoit\npopularisers\nshopsavvy\ngawked\nfritzel\nbiblarz\nthowing\nsharrers\nmailander\ncerron\ntolins\nmarqueis\nrejiggering\ndevestation\ngramanet\npowerfuel\nexpe\npalliatives\nbroussards\niditarods\nulsh\ncurrid\nhérita\nverduno\nnatsal\neroticization\neasynews\nziha\nwiranti\nscoopful\nniyitegeka\nhallowich\nrehospitalization\ndevincenzi\nnadarasa\nsabriya\ndesexed\npollman\nsavaya\nnsofor\neisin\nubisort\nbelatacept\nshabbiest\nchangsan\naxid\nnorber\ndelievered\ngebregeorgis\ntukssport\nkhedafi\ngoldbar\noutift\nkoloroutis\nsheikholeslami\nborkenhagen\nsavlon\nwidemarsh\nsiarad\nhdaci\nzushan\nporshe\ngiddon\ncarrizalillo\nintersec\nkervorkian\nferk\neasybib\nshamsia\nbarrasa\nmhuintir\nfilopoulos\nswordmaker\nureilites\nprofesionals\nleigha\naustraila\neldrick\neggbeaters\ndickers\nkotzian\nhighquality\nemary\ncalderaro\nadako\nmowaffaq\ndevhub\nfarthermost\nlostness\nabdulmalek\nbarrista\ngiboney\nsoftbook\ncolesbourne\npaln\nwaginger\nnydegger\nxpressions\npalancas\nfobb\nschoolmarmish\nshowcourts\ngreenbergian\namericinn\nromick\nsemagacestat\nlloydspharmacy\nespinos\nrecommits\nacquaah\nbrouwersgracht\nrcnc\nyanobe\njossen\nkoenigssee\nonts\nsawadee\nchelminski\nclubpenguin\nundented\ngarske\nzacaria\nuchizono\nsniggered\nhouseal\nkhanjari\nkozerski\nkalfan\nkagaba\nantimalaria\ntroutner\nraspin\nneuropsych\nklowden\nlongpen\nbermillo\nhatzius\ncalvay\nepigonion\ndissatisfy\nicue\nwillebois\ncollydean\nhungerburgbahn\nrisoe\ntornoe\ntanae\nromanengo\nbroekema\ndenbrock\ntxeroki\nlanum\nimmitation\ninevitabilities\nwahaca\nntag\nexida\nreuinted\nfrbny\nlopresto\nplatteau\nbythell\nasociation\nneighouring\nlapera\nnaumannite\nkollmer\ncompetitiors\ntaghrid\nsummerell\nincrementalists\neerier\nkolpack\nsheinton\nbaytree\nmarivent\nphonthong\nsuglia\narchaelogist\nhoeksma\novercut\nstrelsin\nsilverley\nlichtenheld\ndarelle\naskernish\nsegatti\nelandsrand\npriszm\nluciene\ninfragistics\nchorost\nmindflex\nnwando\nrydze\ntunebite\npaediatrica\ncubasch\nwakfs\nhasselback\nakill\natmopshere\nreborning\nsophoclis\nplassman\ngalliagh\nsloshy\njaibi\nostick\nplasticy\nschmiedt\nsanca\nreportin\nwolfango\nethirveerasingam\ngrami\neconic\nmsif\nmckerracher\nsemapimod\nklocko\ncarabineri\nincandescently\nmlilo\nevilest\nnorcros\nbitsakis\nthrottleman\nanthonisen\nmccoskey\ntriblocal\nnotal\nbesilate\nradya\ndescarte\nbionovo\nfirelighter\nclopping\nplacemen\nruchat\ndemisse\nparatek\nmenzes\nehring\nschmahmann\nchermont\nmerholz\nstockpot\naronhalt\nvalentenko\ntulles\npidsea\nthsn\nmujirushi\nsereboff\nthurairaja\nsealcoat\nelosegui\ncrohns\npecentage\nwhirrs\nxitao\ngraniero\ngawlo\nearthwards\nsimerini\northosilicic\njustanswer\nriggwelter\naorund\nbrouhahas\ndebrosse\nandruss\ncityroom\nakba\ntoneelhuis\nthanee\nkambwili\nselesky\ndropback\njesenik\ncurnutt\nherdswoman\nannisul\nenoughness\nocchiuto\nbiblioburro\nparodically\nkresch\nchallege\ncowtan\ncannarozzi\nyucho\ntelnic\nsholley\nlouisian\nrieuse\ntradmed\ntrevemper\nbedritsky\nqualies\nholtzinger\nairworks\nhernquist\nbahts\npappelallee\nerehwon\nsimspon\njetboil\ntabouleh\nxhelili\njocic\npursifull\nschuchman\nvilstrup\namross\ndecifer\nmarkwardt\nabazov\nphaswana\ndifferin\nvitrano\ntoprol\nbartoshuk\nguardistallo\nadulating\nseptermber\nbezirgan\nanisi\ndaulaire\nshowhome\ntafani\ngunked\nptec\nboobed\nkaroke\nantitrafficking\nkyriakopoulos\nstulman\nnickl\nnightastic\nguibovich\nbelskus\nkamynin\nkalisha\ncenturys\nbillionares\nfroebelian\ncardie\nutilisima\njackovich\nbarnosky\nweister\nbertinetto\nfinaldi\ncrowlas\nmingxuan\nclearable\nflarer\nsandpapered\nlobbygate\naqqaluk\nwayfield\nwaterflood\nbaseej\ngovloop\nchurlishly\nvismitananda\nlodenius\nracerback\nfoddering\nmangara\nbarotraumas\nlevale\ndeferrable\nblackfriday\nandex\nsanminiatelli\nnarjes\nhaggiag\nlvns\nacorp\nmuyi\ngcash\noverdecorated\nhibhib\nsponseller\nstedim\nsuncadia\nrebanding\nusbf\ngospelly\nuncurled\nzommer\npedicurist\nluem\nfalahat\nnanasi\nprevitera\nglobaldata\nwintertons\nbevon\ndailylit\nchunlong\nhomevestors\nspinkai\naddustour\nparrog\nsimulus\npetrosaurus\npopke\nsmri\npassback\ncommunitie\nwassana\nsabans\noligoastrocytoma\ntalbieh\ngoucha\ntchatchouang\nndebeles\nmellace\nmagnetti\npenalities\njaroslow\nfelpausch\ncdrp\nmechira\nlagrassa\nmillesime\nntegrity\nvistec\nramjeet\nmukhisa\nandraz\nickburgh\nliebovitz\ndotts\noffroading\nsnowcover\nsupertex\ntuev\nimposimato\neutsler\ncibernet\nkazman\nlaermer\nfordcombe\ndtna\nscheuering\nsenghennydd\nwassom\nmelness\nmantarraya\ncravotta\nkattie\ncroddy\ngirlier\nalphine\nstacul\nbrangaene\ndramady\nguinazu\ndrugscope\njasiel\nwindrem\ndlugach\nsevenzo\nrisius\nsezno\nnyingi\nreflexologists\ndorritt\nmejstrik\nunblind\ndevaluating\nsummerfare\nfetoscopic\ndancemaker\nhosseinieh\nfrightener\nshenita\nshakr\nadalaide\nbianchino\nspiritclips\nnqetho\nctmm\nkickstands\nstodge\nbrosens\naseff\nmetroplitan\nzues\npeagram\nwigdortz\nyoshitani\nsupergrasses\nvalstar\nxiuyun\npannekoeken\npresentationally\nkullgren\nsooud\npupillages\nhumenik\nautoeurope\nlindia\nhomebuy\nteladoc\nlatzer\nfraxa\nkamimoto\nkortnie\nkamarck\nearthsat\nklaeden\nharethi\nkogito\noschner\nstaney\nspeciosissimus\ngubmint\nfireable\nworrad\njackknifes\npouchy\nanticlimatic\ngyrobike\nchelada\nhaeji\ntjibbe\nrolontz\nburnishes\nbosdet\nwinspit\ntarciso\nsliderule\nstultification\napplebys\ncutietta\nlusciousness\nmariton\nlupson\nautom\nrenegotiates\nprbi\npauperisation\ncopperwork\nstateliest\nrestringing\njurdan\nliechenstein\nengineeering\npndd\ncotonniers\nqateh\nkagwene\nsayansk\nscrumping\nbehuria\nhvps\nmizera\nescobars\nchewey\nkempski\nceraweek\narres\nserritella\ntinkerings\ncerrejonensis\ncken\nwyeths\nprupas\nhoosh\ntroadec\ntuntable\ntrendalyzer\nruebel\nmaykin\nduckfield\nkopczynski\nblus\netymotic\nwahner\nteamaker\nsyafii\nbiohackers\nprauss\nbernoskie\namwal\nandary\noclaro\nslurped\nlanzman\nmusicane\nawrt\nestrangements\nsetterholm\nchurchbury\ntendrich\neruditely\ntelk\ncrovie\ncominetti\nswaybar\narogant\nzeelander\nlamana\nbuthaina\nwosb\nkermabon\nsieracki\ndeward\ngphone\nmozafari\ngjorgje\nedig\ndeviceatlas\nunibomber\nwolsley\nkasulke\nrobotlike\ngalab\ntazu\nhawfinches\ngaouette\ndierberg\nsoneda\ndranginis\nsharisse\nsufferd\ngoldentree\nmepolizumab\nfradette\nshirtsleeve\ndisillusionments\nddaear\nfatemah\nsunlamps\npeachcare\nslickwater\nvlcd\nspluttered\nviacord\nberryden\npetionville\nextintion\ngulladuff\npheap\nalvery\nleonidis\naquacity\nflacking\ngrinspun\nilkham\nwenglish\nrabena\nridao\nyordas\nbuonaguro\nlevans\ncevipof\naosc\nmannello\nduzce\nnantcol\nzdunek\nhelico\nslong\nentrup\nwhiteleas\nstringari\nindeck\njochumsen\nwellstream\nmataponi\nniere\nmwesigye\nmeroi\ntremblois\nkaleme\nbiadillah\nrapporteurship\nmosala\ncuiv\nposegate\nchhon\nvalentins\nopthalmologist\npanmunjon\nmaryl\nmulticonfessional\ndomjan\nharlans\nmarzluff\npointsplus\npeoplefinders\nsimulative\nentraction\nzguladze\nfuyushiba\ngerami\nenergías\npernin\nfortfield\nnurowski\nlwara\nravishankara\napologias\nalberstadt\ntravesser\nyennifer\nuccio\ncannongate\nfitisemanu\nbraveboy\ndisastor\npregant\ninfusers\npantymwyn\nvaco\nhighleigh\nilcho\nrochez\nvapourising\nunderplanted\ncenckiewicz\nshadjareh\nplasticene\nramierez\ndrca\nbline\ndoje\nsymingtons\nhermelink\npedatzur\nlionette\npiacentile\nsandpapery\ncubbyholes\nknofel\nkalogeras\nfeiger\norganistation\ntweetups\novalles\nchainwide\nshoptaw\nlifesouth\nvandrevala\nbarbis\nsatyadeo\nbroccolino\nrabdhure\nkeylime\npohjamo\nsadean\njoudeh\nunpartnered\nfurfari\nsuperstocks\nbargu\naquilar\ncolbin\nprenups\nbuddakan\nhelsington\npowerpacks\nlipworth\nmoutsopoulos\nintereste\ntavelli\ngibgot\nvegitation\nallaho\nmashatu\nbuulo\ndynacast\nolagbaju\nschwerk\nsolventless\nwigga\nhaymans\narthenia\nbanze\ndeddeh\ngarzarelli\nspahic\nelwak\nbjog\nwideboy\nnaderites\nesraa\ngreenmead\ndeflations\ndurado\nmicrocentro\nyawei\ncordani\nnormacot\natishoo\noportunities\nfleischacker\ntrahn\namireh\nfalceto\ntonnere\nrensberger\naerni\ntimberhill\nnooooooo\nnisgs\nmunarriz\nconsignors\nadcommunal\nshulong\nmaralee\nincessent\nraddad\nshabes\nelfred\nrushgrove\naherf\nmamduh\nhizzy\nlucratively\nayalde\ndeskphone\nprogammes\nquailed\nrhymesmith\nmilitzok\nskytown\nrattanakiri\nsudoko\nsupportsoft\nkwasniewska\ntungesvik\ntotengco\nairmedia\ncatholes\nhirshler\nbaldeosingh\ndegrease\nhittisau\nelshof\ndanovich\nnarenda\nshaali\ntrepidatious\noverperformed\njamiyah\nklun\naxtmann\nvariani\nrephotographs\nlongroyd\noutdistances\nnowais\nmunud\nmoggies\nshirtwaists\nthuronyi\nfarw\notls\njorgan\nfallafel\ngoatherder\nkosterhavet\nnarvaiz\nkacelnik\nxinzhu\nexonerees\nkhurts\nollusion\nyaffle\nlilts\nzabaglione\ndenmans\nkumkapi\nkemfert\nmoonshadows\nalessando\naudouy\nkorsts\ndioli\ngendell\nslebs\nphotiades\nbramman\ngueliz\nslideout\nbiosaline\nsegements\nxenicibis\nrolos\nnithish\npolybag\nskittling\nwicoff\nbamieh\nsightholders\nhauntworld\ndirouilles\npratices\nyakker\nabcn\neurocrat\nsayedan\nfulmor\nmahmidzada\nluja\neritoran\nunderperformer\ntremopoulos\nweddy\ndjibrine\nsuttirat\nmidwifes\nmedplus\nnoncapital\ncheesbrough\nmadhoun\njuknevičienė\nrhinodoras\nkazai\nrespresent\nlanxon\nmesomorph\nshahpoor\ngasgoigne\nfalkands\nchemtob\nalestra\nregualtions\nkurshid\nmacropetala\nsensoy\nfirends\npantalons\nstuddal\nczisch\nsalinated\nbalefully\nintrax\nyarrowford\nemruz\nmonochromatically\npeschanski\noudah\ntrencherman\nhafedh\nkoitz\nchitiyo\nespley\nscambos\nahmady\nmorells\ntalek\nfqr\nmelome\nsoonercare\nogechi\narrowbear\nmpes\nfukahori\nlasondra\nhethersgill\nsantions\ncucs\nflatteners\nfuturesonic\nlehnertz\nngoun\nmatovic\nmckelheer\nciprianis\neough\nwiita\nkuncel\nduddleston\nprudenti\nzenti\nmaarty\nrttemberg\nfirinne\ndrummerless\nadelir\ndizzily\nbullishness\nsiadar\npauperized\nconwoman\nnblsc\ndragonwave\ncinespia\nvathia\noutmuscled\nbailenson\nnyongesa\nburfi\nsoremekun\npccy\nhedonics\nconnahs\nbalyoz\ndrhp\nharsimran\ncaihong\nmuyin\nlagrue\npfaeffikon\ndisinformative\ntwitterings\nmaalla\namerock\ntassagh\njilo\nbabybjörn\ndefencelessness\nmorrisonn\ncevik\nchukri\napegga\nhottoni\nwreathing\ntokuichi\nsanahuja\natepa\nhardbitten\npluspetrol\nsesji\nvirginmega\nazadiya\niluvien\nhazzazi\nmirina\nmerico\nsteneck\nenomatic\nthixendale\ndominca\nuntransparent\nbehroz\nbigbelly\nhyacynth\nniqash\nshapelessness\ntarascio\ndarias\nreturfed\ninsurmountably\ncappellazzo\nfirther\nimaginis\nchampigneulle\nsignvideo\nmijac\nmalkenhorst\nsherjan\npgcil\nconforama\nbavarois\nyonghao\nzubeda\nglenturret\nhadjicostis\nburnikell\naltaqi\nhaigs\naanma\nbollwage\npongthep\nkitsyn\nnonpersons\nkeigher\nbarbaris\nmashtal\nchabraja\niwanski\nknockings\nmjallby\naldeanos\nqpod\nlangrock\nschweinshaxe\nraschhofer\ncocoavia\ngershowitz\ncheapies\ntargeter\nboppre\ngrmn\nbollix\nobopay\nlendell\nzeppenfeld\nreproachfully\ndoffcocker\ndoubleline\nkellestine\nonevu\nvirginiamycin\nppuc\ndormeuil\nciggy\nguilliani\nvanderbeken\nusaction\nssed\ntrimeris\nhitachino\njbwere\nnvta\nsentencings\ncorbieres\nherkert\nwhatshername\ngawanas\nsatyana\nbovt\ngozman\nincontinently\nkikambala\npravachol\npayloader\nreamin\ntuddy\ncothron\nsengamalam\nfahrman\napidra\ndolatabadi\nsagapolutele\nwaldholtz\nbobińska\nkapron\nchepkemei\nariail\ngoodone\nprelec\nvivix\nhabersetzer\nraptly\nenvenomings\nlizarbe\nvalhallians\nyogurtland\nrudahl\nneigbourhood\nraner\nstablize\nnacds\nijegun\nmindorashvili\nbarigye\najdarevic\nbeddar\nbiotch\nsubeliani\nyablon\ncolostomies\nshpigelman\nembarressment\njoisey\nsmaghi\ntrilokpuri\ngaroña\nwynick\nheadshaking\ntaherkhani\nasiantaeth\ncbtl\nfuzhen\nubci\nomache\ndenkaosan\ntarmiya\nfilewich\nbaiman\nkonczal\nukaid\nbeesands\ndidas\nstarcite\nmolyvos\nvandierendonck\nsalpigidis\njackmans\ndawdles\nscentific\ntemozón\nfrother\nlipc\nmilvina\nlonedell\njaffin\nufot\nhissong\nkleinfeltersville\nronneberg\nnaseef\njanian\npharmaca\nkooza\nclozaril\nwojdakowski\nkorostelev\ngastrique\nkondratowicz\nnulogy\ntkos\nhajizada\nkwaje\nnoncoercive\nnimalan\nkruglik\ngenerose\ntelbivudine\npenello\nunliterary\nlaurson\nunrelaxed\nsagrillo\nbiswamohan\ngrachvogel\nmyard\ncirculans\nkhusbu\nvoguish\nbeiliu\nlubinda\niget\ninartfully\najuba\nrancy\ndistributers\nrauchway\nunsually\nwizardy\nworkingwomen\nchmelar\nganakas\nunappeased\nlital\nsalera\noutdating\nserice\nnanogel\nebrahimian\ncombita\njermin\naudioboo\nslettedahl\nwories\nsreemathy\nimbongi\naslyum\nbartended\njermel\nmussbach\nminibridge\nmicromanagers\ncozaar\nshemmari\nbreznitz\nreiteralm\nharsant\ngafor\ndewji\nchilin\nndesandjo\nbugliari\nlabovich\npeepshows\napotheek\ntalibe\npettinella\nhufner\ngylve\nresponible\nhipoteca\nunefon\nmicrobusiness\ncapato\nmdri\nhaike\nudmf\nnovabay\nammonds\nprovokers\nsubmeters\ngutgsell\nrewrapping\ngagara\nliegey\nbouillard\napey\nmeasley\nkotrikadze\nzeidi\nokarma\nflaxby\nbrochstein\nricciardella\nazares\ntrongs\nscrooges\nhorsy\nwallenhorst\ntinyes\nadorableness\nlucimara\ndracup\nczapnik\npropulse\nsillick\nvanmatre\nschaumber\nkylies\ncompartmentalizes\norgansation\nemeklilik\ntamadon\nlackies\ntidi\nscrunchy\nbartran\nchargrilled\ngouray\nmesschaert\nmassud\nboteco\ngriethuysen\nsauvignons\nbicurious\nchuanhui\nivds\nguernesiaise\nbaldanza\ncrvo\nollivanders\ntalisma\nwalstrom\nmitham\nbourbonette\nbeideman\nbcpd\npowerphase\nwinnning\nnewstrom\nterible\ndurgham\nlobala\nkhromova\nrubane\nveliyev\njehmu\nsasparilla\nshorja\nchimbalanga\ndigitises\ndhahiri\nweixiong\nozmo\nstegbauer\nonsm\nrecyclings\nkilpatricks\naccuvote\nsmartbox\nakhunzada\ncabalettas\nbelobaba\nvyborny\nbourhane\negros\nwoodforth\nthambwe\nnazeeh\ndaimlerbenz\nowlia\nbushweller\nnoritsugu\nbaicker\normers\nraggie\nlynly\nagamez\nlostant\ncogsville\nwellpet\nzawislan\nwqma\nchandleresque\nunderspend\nkuhnhenn\nkubbeh\nmorghan\nsicced\nqubaysi\nrevies\ncabei\nperorations\nrealtions\nmcclorey\ngribiche\naerophobia\nshahwan\nwebanywhere\nhablen\npluthero\nshindaiwa\nteleopti\ndohoney\nbochert\ncuddlesome\nallera\nbendross\ndongpeng\nkilninver\naestheticizing\nebondo\nexpropriates\ndsit\ngeisst\nsinot\nadelaine\npwyllgor\nambulette\nnadama\nstanfordville\ndipersia\nfagerlind\njazera\nhoneyborne\nthermoteknix\nkeatsian\nqinming\ngandus\numtri\nshahrudi\ntwitterverse\ninclue\nmilbanks\nfillman\nzisblatt\nkajwang\nweikle\nwolfgarten\nredearth\nteseq\nfernworthy\nlefor\nakse\ncurlicued\nwetsus\nguidette\nbakhchanyan\nbosci\nblobbing\nplanès\nsaghiri\nlaugable\nchesshire\npravex\ninternationa\nbeckam\nwordily\nnorthhampton\nmiljo\nbornhak\ngravley\nageron\nkoperski\nblamelessly\nlinkia\ntalaris\nmcintoshi\nbawdier\ntheise\nfdas\nsistare\nnaghmi\nipico\nnovostey\ntwighlight\nlackadaisically\nhirbet\nlewiner\nbedwar\nquestionaires\napeing\nfielakepa\ntheyear\nakuei\ncolonoscopic\nkinam\nxmpie\nkisby\nfamlies\nlukianov\nbrandent\nalimta\nkutlay\nsirchia\neitm\npatrico\nhasbun\nbarbequed\ncrackback\nzalesne\nurbanizaciones\nsuceptible\ndunaire\nfailled\nsathre\ntoniu\noverwhemingly\nmountainscape\ninsensitivities\nalimentum\nbackbends\nclearpoint\ntepperberg\nbarkoff\nsultanzoy\nklerks\nrudesheim\ncabassol\nifire\ngreenfleet\nbobulova\nticketweb\ncoinless\ndelwart\ntppf\nintital\nbosland\nroehrkasse\nlashof\nafterwork\nmiroff\ndrevon\nacatzingo\nkillavullan\ncovich\nunaffectedly\nbaechtold\nmaelle\nunsingable\nmadland\npetrominerales\nyonus\nbarafu\naudiocast\ngenteq\nsiglar\nfarham\nslapsticky\nkaletra\neurusd\nlueshing\nandrogel\nchanggyeong\ndairygold\neuraque\navtur\nmemela\nfischerandom\nkinkier\nseitaad\nkopecki\nourisman\ncrescendoing\nahmadenijad\nkerbstone\ngayego\nalpharma\nrptn\nradosevic\ndownshifters\ntrevo\nnaghshineh\nemmad\nbluekai\nprestipino\npetreikis\nkoswara\nasloan\nballakermeen\nshaqa\nkamagata\nstudioworks\nkolish\ndisparagements\nyatkin\nesbjorn\nsocitm\nkomid\nabsord\nmovewithus\ndecadron\naygul\npiraha\ninorganically\nkinmond\nspaffords\nasmbs\nromcoms\nnshmba\nakpd\nfurmanski\nkhardo\npanard\nguideone\ncourroye\nlahouti\nlaborte\nhibba\nmdms\nstevil\nhardegree\nmamaliga\ngenexpert\ncifta\nzonderkidz\npopularism\ntalusan\npaycuts\nfeanny\nsorini\neeeh\nliccy\ngopers\nbeertje\nohios\nreshipped\nbudathoki\nmtvnhd\nhonghai\nkoifman\ncarruba\ntrilevel\ncautionable\nhonglin\nunnos\nallerslev\nwickramanayake\nelementeo\nsymplicity\nrtuk\nhornedjitef\ngeolo\nvccp\nreismann\nframlington\nvpsos\npygar\nratably\nhealthly\npaasschen\nboxclever\npeschka\nawpr\nmisadministration\nserenic\npeaceniks\nhomina\nholiman\nxingxiang\nentrepreneurially\nterpeluk\ncolbertaldo\nconsumated\nsimponi\nletroy\nbioc\nshbak\nmedlink\nsitefinder\nsordelet\nyaling\nkapandriti\ncustomizegoogle\nsennitt\nrotavator\ngilburne\nesmeray\ngingersnaps\ntealights\noptomistic\nroelfs\nlovieanne\nbuzzmetrics\nmembathisi\nmonasch\ntrifield\nkraisintu\nmusni\nfacelessness\nselenological\nkispert\ncohabitated\niekeliene\nvirigin\nmulemo\nsakvarelidze\nriscassi\nsmallhold\ndamians\ncityfile\ntextainer\nsisic\nsowood\nspriggins\nabstentia\nhaberg\nlving\nbanatwala\nnishma\nplatkin\nshnewer\nshyanne\nzubaid\nbocsa\nmemeory\ndeitra\nundebated\nsharkawy\njolkowski\nneuvax\ncasnocha\nnansan\npansion\nvacumn\nconfoundingly\npalnackie\ndraemel\nmochammad\nalgosaibi\nunbreached\nmacbeths\nameican\ndevaud\ngerlan\nbluephoenix\nsuliaman\nmahfoudh\nmedr\nbateses\nkorinek\nvolberg\ncusak\nhrouda\njaziya\nthuring\nnagayuki\nxiuqi\nknockemstiff\ngivner\nludivina\nmidpack\ntaramasalata\ngerrad\nslashfood\ndoxsee\nnechak\nogonis\ncriticizm\nmogilino\nmyvu\ndecarbonising\neact\nuatp\nalacchi\ndropshots\nprometic\nevain\ndoubledown\nbannermans\nglamourising\nanecdotage\ndemolishers\ncompari\ncoloseum\nkhadambi\nflâneurs\nkuppinger\nglitzier\nheartware\nnexbus\nabashilov\ncamowen\nlsgt\nimarex\nsisavangvong\ngimpl\nmcconell\nthuras\nreefat\nmevushal\ncacaphony\nsedore\namantino\ntillysburn\ntouchtunes\nsemsey\ncdoc\nthieren\nofice\nobaigbena\nmantraps\nmashishing\nparvizi\nkuittinen\nhayli\nradition\nhaouari\nchrisi\ngiek\ntestolini\nnonmetropolitan\ngluteals\nkronur\nxinda\nkpatcha\nlrri\nklepach\nlinkbee\nglante\nfrengo\nkarubi\nmyobloc\notuam\nredivide\nciggie\nconsituency\nmoukheiber\nresculpting\nqcue\ndestoryed\nblinged\nsailab\nbouhnik\napiarists\nwoild\nholimont\nalameh\nveremko\nfiwi\nsmci\nunscary\notabenga\ncaleca\nfrerking\npjhq\nkiww\nbobsguide\ntvnotas\nlietch\ngvidas\nmeteorically\nmatynia\nwiessmann\nmanferdini\njamundi\ngeolearning\nhicox\ndobrish\noppresion\nlanice\ndacquoise\ntrosclair\nbrocher\nwindpipes\nfulbrights\nguayaberas\ngepetrol\nvarnagy\nkashief\nkosanov\nraineach\nsandhoke\nungloved\nwhippe\nsquarest\nraspiness\nrevelaed\nokech\nzhongzhuang\nsomaia\ndudettes\nandrikienė\nnatik\nramanarayanan\ndarnowski\ngentlemanliness\nadhp\nkhaidarov\ncerebal\nplutzik\nsantigie\nfallibilities\nschaeuble\nclariano\ndodgily\nfunkwerk\nashiestiel\ngoosestepping\nneugent\norinoquia\nmauel\nstructual\nritazza\nameo\nsolamere\nkopište\nlizzimore\nbasjoo\ndmva\nlynnea\nlukoshkov\nwaterwings\nsynchronoss\nitgi\nralfini\npretsch\nautho\nnourian\nsageview\nhassanien\nhedigan\ntavai\ncorsendonk\nsiriwan\njaybo\nverdiem\nfuljenz\nyarmash\nsweetners\nmolak\nmarcalo\nispor\nglru\nencams\nyouba\nhelyn\nwateridge\ndragonballs\nmuschett\naccountholder\nclemencies\ngemmologist\nfrometa\neuropejskiego\nporstmouth\nwaterspace\nconsumate\nfossilizing\nmussawi\nharkess\nuncanniness\ngyory\nzwahlen\naffilliate\ndramesi\nhiroichi\nfrogmarched\npresentability\nwhatsits\nnefe\nsweere\nszavay\ncarvo\nwanjin\nsangota\nrinconcito\ngobey\nfomentation\nfondues\nbeore\ntrakin\nawwwwww\nliftboat\nbenata\nmagande\npignoli\nwormuth\nmbow\ndyrell\nmariches\nkasdin\ntolou\nparliamenary\necoterrorists\nbrushoff\nwuethrich\ndundreggan\nbuckberg\nbhagidari\nmcgarrigles\nyamith\ntarged\nundersteering\nimmortalists\nswecker\nastropolitics\nnewfel\nlaudonio\nlowcost\nleadframe\nglorney\nhestitant\neythorsdottir\njianling\nmhrp\nkeag\nguerze\ncidery\ncococay\nmccolly\nmellau\nwolchok\nlbsf\nbeefburgers\ndenaturalize\nnumayri\nssrt\ntwork\nbergstroem\nportalatin\nbrunellos\noddsac\ntraffiq\nsatisfication\nclawbacks\npublc\nnarcotized\nhartpence\ncolombopage\ndetoriating\nbrassiness\nstivanello\ncandelora\nfidanque\nmatricidal\nextened\nliveras\nibrohim\neinfochips\nmisetic\nteamworking\ninzlicht\nrockwater\nhoedowns\nkoteswar\ntackier\nsolih\nfleapit\napetit\nhandleys\npebsham\ntrustafarian\nqumranet\nnardy\nlunak\nimpenitence\nnonhlanhla\nmolgaard\noversample\nboguslawa\nnxstage\nroadsweeper\ntherapod\ninderal\ndewitts\ntutition\ndapuzzo\nlawleys\nombré\nmerzouki\ntootling\ndemostenes\nlotoro\nrepya\nsusica\nanbarasan\nflightlink\ntroj\nwinghouse\nraiments\nbtmu\nmienis\njasira\nflagyl\ntittilating\nmizban\nrodean\nmingli\nludila\nsosostris\nbrachiosaurs\ngrammaticas\nhartshay\nmasiko\nliptsin\namanzi\nflyouts\nalmery\nvatnajokull\nedilov\nmaliah\nolubayo\nbookin\nsolaces\nceiops\nshiach\nbudzik\nrageful\nsuppon\ntrichologist\ncourtoom\nantisatellite\nmenetrier\nopne\npludermacher\nplettner\nhennegau\nablynx\nconsquence\nwoollven\nnyalenda\netvs\nhasnan\nedwan\nmmce\nvoxant\nduhul\nrisgaard\nickiness\nheoa\nnesirky\nalishayev\nreoccurrences\nroisín\nduckinfield\njeanswest\narbd\nharpoonist\nalcos\nchabangu\nkomolafe\ndadush\nbearwalker\nqisda\ntabakovic\nboumeester\nghettoise\nquavas\naattou\nsobowale\nmetallireducens\njhoni\npreemies\nvagni\npcsu\nzigged\nviklang\nsterzel\nfinucan\nswithins\nshaull\nkuglin\nscylletium\ngokova\nplevan\nruun\nprefight\nfairmindedness\nmaplecrest\nblairtummock\nchatbi\nbuyline\nbwea\ncatrall\nprojectwise\neurophiles\nzeale\nbritanick\ncanniness\nguenet\nzachari\nadsu\ntargetpoint\nhuveaux\nbreazell\nbalaran\nlupardo\nmisconfigurations\nurstadt\neurofor\nsupersaver\nnedl\npantastico\nkurgapkina\nbaaack\nnudell\naudioid\nsypien\namrdec\nudey\nmcgruther\npâquis\nibmt\nmisbehavers\ncallix\nhomogeny\ntoystore\nfutala\nkrissa\nchansamone\nberrydale\nparachuters\nmogilner\netess\nmalfeasant\nlemore\nngconde\nhouseing\ndeewa\nbrewhouses\nvatandoust\nsemore\nblackheaded\npersent\nkamenetzky\ngroovaloos\nnorphel\nmisaligning\nlandefeld\nmcnaughten\nshaibal\nrevenews\nroundham\ncssn\nbelf\ncocoas\nnovogrod\nbuywithme\nnutrigenetic\nkakavas\nfountainwell\ndafarch\naristarkhov\nklvana\nmohassess\nfambrini\nketterman\njuqua\npetrohawk\ntommasinianus\nisraelies\nnishar\nedelstenne\nbreathalysed\nbuzea\nosteoconductive\nwiill\nstts\nunwieldily\nnnimmo\nattorneygeneral\nfalcos\nnextview\nattackable\nanoth\nslutkin\ndeodorizers\nracisim\nsarracini\nnhsbt\nemoney\nsagaria\nganmukhuri\npetrotech\ndelury\nharbormasters\nskierka\nzipingpu\nanfrel\ngroundsheets\nzeum\nahbabi\nhobeau\nluvsandorj\nıs\ndinman\nscurrilously\nainamoi\nlorho\nofflee\nfitur\nparashumti\ndeathers\nunshrouded\ninvigoratingly\neirwyn\nmagisterially\nnoninstitutionalized\ndisend\nspanishness\njianqun\nsholam\nbonrepaux\nballyearl\nloopallu\nchengli\nyssouf\ncallaways\nairnow\nfrien\nelseware\nkhairudin\nblowzy\nnemos\nhimebaugh\npulks\nsoberer\npopula\nbrawndo\nbrigandry\ndicciani\ndourness\nkusstatscher\nhefton\nvisitng\nshalgam\nphotographable\nafpd\noutguess\nunpretentiousness\nseglins\nuncompetitiveness\nmethilhill\nfridd\nselvaraju\nvauzelle\nbigton\nhaemoglobinuria\nunrunnable\nluhmuhlen\nmayagna\ncastresana\nreenforced\njalepeno\nanaesthetising\nbotner\nwallbirds\ndruzin\npumpgirl\nawilco\noverlimit\nossum\njetfuel\nwinegarden\nmetreleptin\nahmen\nsententiousness\nbacm\ndruckerman\nlanitis\nmaryscot\npohler\ngrantshouse\nsmogs\nbluu\nexitoso\nsexsomnia\nfleysher\nrightfielder\nsalving\ndietsmann\nsweidawi\nhornists\nqarantina\nglenkirk\nmotionbuilder\ngreensome\nibet\nqafco\ndragonas\nraviolo\nortigue\ntuilière\ncicconetti\nhisted\nmanuever\nintimas\nshcherbachenko\npericard\nthundercrack\npyscho\ncammisa\nharres\nposhness\nyakini\nclergerie\nmashele\nfimognari\npurnells\nmexicles\nschatzel\nbeschen\nalloudi\nleocorno\nmckhann\nobsene\nklosinski\ntoder\ngroundwell\nmerryday\nvistakon\nfreakfest\nentrace\nradiotherapists\ntroubleshoots\nsetember\ntransdniestrian\nquizes\nshostack\nexarchia\nmwamwaya\nrehanging\nnauiyu\ncarlesii\nshehrbano\nflatshare\nhonts\nhaick\noaug\nkyliekonnect\ncliental\ncrenson\nmarere\nmusila\nmeii\nstarbrook\nchoroidopathy\nazadpur\nfarras\nwinmark\nfrailness\nanonymiser\nrestauranteurs\nguardans\nsuporn\nderden\nlemler\npeycheva\ntobji\nmawlynnong\nkipng\nsedlock\ngesticulated\nghariban\ndaloz\nunsureness\npadhraic\nradivojevic\nmaegle\nposcente\nunmooring\ncadelo\nbourjade\nbreitburn\nrebaine\nassps\nmistal\naggress\necowaste\nmogt\nargaric\nsueng\nusselman\nreprogramed\ngrassless\nvahrenholt\ngoeldner\nmcraith\njcpenny\nhalamka\nsisif\nmaeena\nvoepel\nbitatawa\nmuchembled\nsizeism\nmause\nvannatter\nboutboul\nfench\nelecricity\ntsumami\nswaibu\nnigmatulin\nascentium\nseelische\nurwand\nschütrumpf\ngulhati\nhacktone\nsummerleaze\ntreuille\nkappelhoff\namlani\nlinegar\nthenia\neuropeanize\nfunning\nmuia\nbilalian\nneomagic\npygott\nmoudeina\nmartson\ngarrards\nmtds\ngautum\nbergendorff\nplgf\nhords\nbasilevsky\nnorkom\ncryne\nelgas\nelektrim\nbiscuity\nyef\ntuwairqi\nbaktash\nschweer\necobee\nkraftfoods\nunembalmed\nautopark\nnatcen\nblogworld\nrappahanock\ntuxedoed\nimproverished\nsittar\nnorthala\nenlighting\nalexzander\naneurisms\nnuriya\nexplicatory\ntoai\nmaasailand\nbonter\nhornecker\nleonne\nramzee\nzalaznick\ngrasscourt\nmoviles\nstudentessa\nalixandra\nmicardis\nfontelles\ncacophany\nrephase\ngurmu\nballgowns\nbabbitts\nbadoit\nsouan\nfrigatti\nmisunderestimated\nsrsl\ntital\nweitemeyer\nbrembeck\npedery\nmcnearney\ncomoglio\nminkwon\njaseem\ntastykakes\nakca\nakissi\nheussner\nhurtfully\nsourouzian\ntremulousness\nlilke\nesmart\npeiying\nappauled\nbittersweetness\nindicent\ncurosity\nschevchenko\nbivvy\nunikko\nhjw\nkokonas\nbrysam\nyounesi\nqualman\nsaulny\nledua\nnafjan\nfalsest\nyarning\ntedindia\nadamonis\nteitipac\ndublanica\nchipless\ngodri\nveronda\nthreeasfour\ncharlieticket\nzider\nqlikview\nsauquillo\ndarios\nskedee\nreyeses\nfléchard\npodsmead\nsoliday\nallden\nprowell\ncasketed\nhanhua\nsupon\nlagree\nmutalik\ncaymen\nslotmusic\nkerti\nskeevy\nhypercompetition\npakay\ndegraffenreidt\ndanishmand\nxwbs\nmatombo\nredmans\narshid\nnatirar\nanticounterfeiting\nsigurgeir\ngorbushka\ndebentureholders\nonton\nmacauslan\noosthuysen\nbabblings\ngrummitt\nrahmanipour\nfredia\nboudaries\naldabran\npressclub\nmargusity\nbadylak\nkaleidoscopically\nsafelayer\nstojnic\nwargotz\npatsis\ngrivich\nhaladas\narciaga\ndongdajie\nsutaria\nvouchercodes\nseddiq\njudelson\nmagnevist\neminate\nilincic\ntauiliili\nburght\nohland\nkasco\ntuzantla\ncentraliser\nbuyseasons\nmaratier\nyankiel\nyunwei\nconditons\ndruart\ncosandey\nwhirpool\nstadum\nmvume\njowly\ntsakalakis\nabingdoni\nconedison\napirat\nsearage\nstoesz\nscvs\noldbrook\nclublike\nsuradji\ncalasan\nrehang\nkotsos\nsccb\nguilliaume\nproo\nobstructiveness\nsarnie\nopportunties\nshepperdine\nmontervino\nachany\nacouple\nmiezis\nperlet\nrybinski\nsorl\nrayappu\nhaims\nangelson\nreadynas\npouvons\ncorkscrewed\nkawasakis\nnapatech\nreffert\nbloomgren\ntalaiasi\nconfederating\nandreau\nkabimba\nunpunctual\nguranteed\ntbilsi\ndratted\npizzicatos\npaustenbach\nkenol\naerium\nsofroniou\nmecury\nobstáculo\nmaritial\ntubigan\nburig\nmcrl\nbeatmaking\nweirdoes\nportch\nmaskery\ngoateed\npetroliam\nmemeorandum\ncontortionism\ngwylim\nanuvab\nmofidi\ngurmai\nnawr\npouquelaye\njeke\ntahboub\ngaragey\nnutsie\nswedelson\nsherringham\nmojgan\ngemaldegalerie\nrnld\nwambua\nsarnow\nláidir\nmenduh\nnewriver\ncreyts\nbabakir\nrunggye\nhateboer\ngénoise\nvinography\ngennette\nkorkishko\ncomninos\nbrkovic\nconstituional\nnassfeld\nsayeg\nsizeably\nsteepler\nevalu\nnonkululeko\nscaysbrook\nhumanware\ngerberas\nsonyma\numalat\nltci\nindieplex\nzanjero\nbertsche\nnikolskoe\ndurnian\nplisco\nridonkulous\nserbanescu\nhairapetian\nsixthman\nprooth\nkracik\nrenesys\nshortchanges\nhydrofracturing\nhumanties\nzhongyin\ntracky\nlyreco\npleadingly\nattoub\noverweigh\nblogads\npaedos\nbisaro\npanagaris\nthueringer\noostlander\nkhudari\ntradewell\nwinchman\nsupersexy\ntorfeh\nbachvarova\noxonica\nsanye\nguger\nbenedikz\neskinazi\nakeju\ncermelli\nphilogene\ndadier\nngarlejy\nleavengood\nstented\nstarflex\nanawratha\nfrailest\nshriti\nschmancer\nvarfolomeev\nclattenberg\npozzilli\ndohmh\nchampo\nbussers\nfilesharers\nthronton\nffffound\nwcpb\nshabang\npolick\nrepackagings\nshewmaker\noverburdens\nfascinators\ngeordieland\nassaluyeh\nwanawake\novali\nboepd\neagleeye\narmsden\nsipila\nhipsterism\ndenktash\nnakra\nbriede\nvantone\nempanelling\nsugv\nubiquitious\nhyperworks\nvneshekonombank\nwelltec\npatreus\necoc\nbordt\nzantaz\nsodra\nelebash\nmuslimat\nmanbeck\nfurtney\nwoodshedding\npoliticing\nhippen\nditsi\nolbas\ndisgruntle\ngeoana\nsuozzo\ndroppeth\njalaladdin\nkongresshaus\nskraastad\nfrancophonic\negenhofer\nkathreya\nnimco\nloyality\npreseasons\nrublyovka\nkalapathar\nstrege\ncdnx\nlymari\nenfeebling\ntendil\ngrinny\nhtpp\nshisheng\nkierantimberlake\nacors\nrfmo\noverfills\npickren\nsenkakus\npratury\nzepnick\nprivitization\nhasanali\nnutbar\njagdeesh\nbewitchingly\ngreenbuild\necumen\ncaracollo\nrevatio\netchebest\nhairclip\nkoedinger\ndaffron\ncardean\ntoolo\nwagonr\ndelaitre\ntomsoni\nphotodisc\ncharcol\ntrumpauer\naurigo\nculduthel\ntosheva\nhijran\nlcbp\nmarajuana\nsolmar\ncervelas\nhavng\ntransmogrifying\nrescreened\ncracke\nbatdyyev\nmilivojevic\nasteroseismic\npalstaves\ngabla\ngeorgeous\nweegh\nmadzongwe\npeleton\nundramatically\ncisri\nalbinder\nfoamex\novercounted\nhoik\namgott\ndrozda\nhawra\nrochinha\nfarat\nautolyzed\navmt\nbochao\nzibiah\ntrevelgue\ncherdchai\ncmht\nhartlage\nroren\nnottm\nnightbook\nweekending\nmenlow\nreuseable\nmatere\nsaaka\ncastlebeck\ncanonicals\nrejecter\nbernoff\njibilian\ndincer\nmetametrics\nmadano\nlamerat\nhanusz\ndiserens\nleovy\nblatcher\nreborns\nalafco\naproned\npeffley\ngardberg\nectel\necomony\nhyperstudio\nkaptsova\naboody\nivara\njpmorganchase\nnearman\nturkalo\nlochmoor\nsmartstax\ndegreen\nbelaiz\nmolinaroli\nluedke\ndurborow\nragingly\nusglc\nkrkonose\ntiszalök\nhuynen\nauchenkilns\ncafecito\ntwistings\nparsenn\nirreproachably\nsensoji\ngooood\nannwyl\nthouht\nislamise\nmieras\nkanaley\nwogaman\nclingers\nhousedress\nminuk\nkiefner\ncivitarese\njumbolair\nayerra\nultraframe\ningenierie\nmailpieces\ncommmentary\nheftily\njermiah\noudated\nsolomont\nconspirata\ncringingly\ngastright\nmcconigley\nreheats\nvitton\nplewka\nhrapmann\nvicunas\naftermatch\nthadd\npento\norgainization\ngraessle\ngearchanges\ntabibian\nparvulescu\ngoolen\nwhinges\nerindi\nirias\nsolesbury\nfanniemae\nlolls\ncoustenis\nwaverer\nrumgay\nspringmeyer\nitelligence\ncaoyuan\nscotting\nstrandlof\nhongni\narbittier\ngroundsmanship\nmuchoki\nhellooo\nchuckchi\nshatilla\nshampooed\nmakharinsky\nmesters\nmcgavigan\ndudenhoeffer\npersecutionis\ndowncourt\nmaletic\nmanouri\ntramal\nkierre\nkontz\nriddock\numare\nfoodstamps\nopenstage\nstopgaps\nzenab\nundescribable\nbariza\npridefully\ncalaboz\nlopina\nbrooksher\nbenarroch\neirwen\ndemornay\npapaer\nbrosolat\nesafety\nseamlessweb\nsupsect\nmogge\npilferers\nazulgrana\nghoulishly\nllins\nwettermark\ncoccaglio\narclid\nkinglass\ndjellabas\nreincarcerated\ngunthardt\norpaz\nirreconciliable\nchhatradhar\ndouvall\nmandazi\nmannakee\nborkowsky\ncountr\nembalms\nlabout\nwasicky\nnonvolcanic\nriffling\nsabeeka\nsaddamist\nantistate\nchermoula\nskrzypiec\nlucard\ncittareale\nvaniel\neffuses\nbarcamps\nbaaji\nkolecki\nbongha\nbajramovic\nncdp\noenoke\nmimedx\niahv\nrunscoring\nstammerers\nbruene\nhoovered\ndanuza\noiks\nmuhame\njusr\nrigidified\nschinwald\ntogai\nalmorexant\nsathyamurthi\ncdebaca\nelfert\nliabilites\ndenoyer\nultralong\nklutziness\nexergen\nmcclenney\nchaussade\ncullina\nzibakalam\nziegfield\npeop\nschirmeister\nkenneled\nbodycon\nmoneydie\nensconcing\nbujie\ndragga\nflašíková\nsynbiotics\nbolno\ndnot\nphien\nglamourise\nrelected\nmiankova\npotot\nwintermantel\nsuhn\nsartiano\nmosle\nsquawker\nglencadam\ngummett\nwwmd\njeyapaul\nggers\nantimacassars\nradov\ntransaven\nsentor\nsyangboche\nmultidisc\ngreycoat\nheinert\nchairty\nmaalem\nkctr\nbergstad\nalbicelestes\nrasjid\ncherrey\nmemorialcare\nsmoothstone\nhenpecking\nkerkhofs\ngagloev\nbarview\ntutoyer\npistacchio\nanthoussa\nratney\nromanucci\nmotorboy\ncubavera\nionatron\nequired\nflagellator\nwintersweet\ncaringly\nrauden\nnvoad\nfarmstand\nasiimwe\nganlea\nroodman\nwaingankar\nstrangulate\nrolito\nlitowitz\nkozachik\nknowledgably\nkopites\nfreudiger\nrechannel\nagresso\ncarterets\ntuksal\ncortivo\ndecato\njirgl\nfooler\nmadigans\nexpd\nflanaghan\ngloms\nhoutryve\narduousness\nsaintlike\npodeschi\ntransperency\nplyed\nresponisble\ngravedigging\nknafeh\ndiagramed\nsquawked\nendebted\nvondrak\nmalenkikh\niwinski\nofmdfm\nneesom\ngreenorder\nkurchaloi\nbahmanpour\nghantoot\nsaklad\nefran\ndepoliticise\nhalbfinger\nsaravanakumar\nmengiste\nferf\ncabined\nfontham\nwingstreet\nbachmans\ncology\ntheoneste\nbareroot\nnetgain\nknake\nmgtf\nventurewire\ntightlipped\njujuan\nvertigineux\nseguranca\nguendelsberger\nweisgarber\nrefrescos\nhatalsky\nclerge\ntapsfield\nunseriousness\nentinostat\npayperpost\njasdev\nhabshan\nspritzing\njaslo\nshmelka\nabduweli\nnonpartisans\ngaznavi\ngroundfire\nsnowfest\ntarpinian\npoddala\nkeppa\nsynagis\ngassani\njumaily\nbarcap\ncimarusti\nmegahn\ncoulsden\nlpld\nnhli\nsourcer\ncirkovic\nbuffardi\ncseu\nasgeirsson\ntweeker\nmacguffie\ndominent\nlevasa\nmaloni\nneeves\njehuu\npollitz\nfullfillment\nholmstead\njotters\nheretically\ncountermoves\nmiday\nswarowski\nheadblade\nriemschneider\nhorizion\nfueltank\ndymovsky\ngrundon\nlatsko\narlaten\nsesquipedalianism\nrubefacient\nassac\nnacianceno\nsandyhills\nunrecycled\nenginuity\nbaudains\nschouwenaar\nbisno\nsearchwiki\ndruidsynge\ndontarrious\nmoukarbel\nsmocked\namawi\nemptywheel\npradaxa\nabutaleb\nlicensures\nintacs\nunaccaptable\nsandrakasi\nvillongco\nhalek\nunshockable\nprotsch\necig\nmendick\nworringly\ncampsen\njaunarena\nukaegbu\nsmetacek\npreprandial\nkamoshita\npoujadist\nresistere\nwestfeld\nxigaze\nprakken\nvictrex\ngeater\ntorgay\nsmarminess\nstuckler\noldster\ngerhartz\nszulik\nbenseddik\nsedarat\nplasticware\nbrainpan\neader\nplodders\ncalisi\nphuensum\nkikinzoku\nwangling\narsey\ngibala\ntantalizes\ndoroshow\nbellmen\nmerrylee\neelmaa\nrivère\nklitschkos\nunemployement\noinofyta\nuncategorisable\ngernat\ntsagaropoulou\nowever\nmeritz\nansawdd\neodt\ncommu\nmabala\nquintasket\nsimansky\ndavilmar\nvipre\nutahans\nmutsekwa\ngiazzon\nveveo\nsyatem\nomnikrom\nvidino\ndeyanat\nwatte\nsubpeona\nkupelian\nmazier\nlefar\nsegelström\ngulmira\nsaddlebow\nmontcoal\nlythcott\ncelox\nkranidiotis\nviscogliosi\ncarianne\ndudack\nbilchik\ncastlelike\ncapretta\nliipfert\nreliablesource\nchidzonga\nnilico\nchikurubi\nguiller\ndrillholes\nyadegari\nadvantange\ntencate\ncommi\nkobliner\ntsvangarai\nhalkia\nbondaruk\noutride\nbachmanns\nruangkit\nhydromel\nronksley\nlaronidase\naurlandsfjord\nkutten\nhamadah\ndoghmush\njinwei\ntarangul\nmoinina\nfibrillating\nolsher\nguatanamo\nhamiltion\ndulberger\ndonly\nabreva\nnanceen\ncomatosed\nreacquaints\nmagnequench\nostaz\ncahr\ngarlon\nknoa\nfuturesource\npennrose\nalvac\ntinkebell\nglandwr\nrecuperations\nzhuravel\njingtao\nstiking\nfrancios\nbiotherapeutic\nshov\nrodericks\ngyopo\nvergiat\nincredimail\nlodeve\nahhhhhh\ncombivir\nalouni\nziercke\nsunlamp\nhywyn\nnuval\nfutatsuki\nkneiss\nmadwed\nmccoin\nmarkshausen\nlinhope\nnorit\nunnegotiable\nbiodomes\neredvi\nlagudi\npricha\ndusia\nhtng\ntalkasia\ndnaprint\nqrx\nsickout\nfasciani\npalcic\nsagardia\nanzack\npraisner\nunprecidented\ngenine\nsuppresion\nrehage\nbennitt\ngugelot\nblaim\nallida\nrephasing\nalertme\nstasinowsky\nwidey\neastcroft\ntragara\nidose\nagilysys\nwassuk\nmitsubushi\nkingy\nunrefreshing\nunabating\ndanien\nmcdonaugh\nguttierrez\nluijten\nkadom\nwilderotter\nriuven\nwidspread\nnienow\ndagvadorj\nkipman\ngrzesiek\nnoxiously\nclimateworks\ncabourne\nedemir\nelkoff\njohm\naquivaldo\nsabaah\nmmcf\npavluk\nairwair\npristiq\ntraoui\niolotan\nnorhayati\nautodialers\nhendersin\nhyperbolizing\nsubstain\nnureki\nirineos\nniknam\nmenthols\nhakonarson\nsuffredini\nleyrit\ncardoz\nilchman\npratali\npaczuski\nunbury\ncrookhall\nnaturalizer\nbardhaj\nlahim\npleiotrophin\ncopehill\nvengoechea\nnasief\nbotanika\ncalvan\nredstones\ntapitsfly\nmaners\ningoldby\nviolanti\ndeveopment\ndegregory\nchabba\narthrotec\nimmoderation\nlousing\nlaggers\nxiaosu\ncèpes\ntranchina\nmohidin\nbedmate\ncontucci\nknestout\nreboosts\ncotgrove\nsaruni\noaxen\ntofranil\nabike\ndetian\nkurkin\ncutesiness\naddage\nmmbbls\nantisocially\ncyclamic\ntayde\nshahda\ndaniszewski\nrobberson\nopondo\nfougerite\nmannos\ngigapix\nmiczek\nfurhman\nswia\npébereau\nwarco\nsnoasis\nstamatiou\ntropicalism\nreprivatized\nhiemenz\nversaci\neastone\ntombliboo\nminskip\nrevoting\nnicr\nfishier\nscoveston\nfallenius\nkhammas\nstorimans\ngriffithsin\noutcoached\ncassellis\ngarorim\nweightroom\nngowi\nsfogliatella\nmartinko\ntinharé\nlandesgericht\nlivepc\ntynell\nsledded\nopporunity\nabboccato\nblackminster\nfumbler\nramsha\nhankla\nevaulated\napplabs\nblasing\njadco\nquadruplicate\noffi\nperelshteyn\nannuled\ntrupo\nchistmas\nautismspeaks\nsantiphap\nlekon\ngeosentric\nsandioriva\nrothbort\nconney\nmoralistically\nhipe\npoulterers\ncapacchione\ndunlevie\nprevaricates\nronez\niboa\nhaemonetics\nprodesse\nchockful\nfremeaux\nipbx\nnajai\nleggier\nteason\nnarubin\ntricoteuses\nklyve\nkohinur\nrushaway\ntalbiyah\nhuffnagle\naites\nhaidl\ngaouaoui\ngaoxing\nmcclenathan\ncrapulous\nraczko\ndersa\nrasist\nimbeni\nswithenbank\ncollingwoods\nbrynle\ngalactico\ncelum\nzvaigzde\ndownsizings\njubur\nserphin\nseree\nacqualina\nidylic\nbsja\nkisielius\nrasenberger\naliecer\nperinpanayagam\ntestar\ndagsa\nzamarra\nrambunctiously\nsarlis\ndurepos\nchatwell\naadnevik\nsurfwise\nreorientate\nriisgaard\nsaperia\nlentran\nbittlestone\ntwrs\npentabromodiphenyl\nfmsa\nunbureaucratic\npaisarn\nlochren\nneltner\ndogtime\nkashmirs\nlorelie\nanemona\nthunborg\nhulya\nsayidat\nredevelopers\nanasthetic\ncumins\nafrh\nsupersensory\nregicidal\nspls\nnebbishes\nmodero\npekli\nkinevane\nezat\nindebting\nicenhower\neprize\npedestrianising\nrosiland\nbronicki\nterroists\nbresland\nclangorous\nknicknamed\nhaoge\natrocites\nekazhevo\nsatinsky\ncontemptibly\nadversly\nkleinubing\npeerindex\ninvideo\ndestigmatizing\ngrenny\ngeele\ntommye\nchekwa\ncomtempt\nnyhart\nqureishi\nsimchon\ntunjang\ndeked\nkenndal\nonglyza\nchunyuan\nciccolo\nwirayuda\nbarnados\nmotorports\ntsuyuzaki\ninfs\nmaiolini\nintskirveli\ncorkers\naloudat\nchiazi\nunagreed\nawir\nboughten\ngemkow\nmerentes\npolitco\nberlioux\nmacaree\nconfernce\nyiasoumis\nlhamon\npersiankiwi\nmetlin\nturbinado\njancic\nhetze\nshunichiro\nstonghold\ngavaldon\nlorrenzo\nelier\nwikispeed\nhimilayan\ndiddled\nsoakaway\nsinofert\nhorsewhipping\nmursyid\neggrolls\nruskins\nbargehouse\nzweigle\njarecke\nshikanda\nhostpur\nstodginess\ngenuair\nmikoliunas\ncwcc\nkrinitz\njando\nsarasponda\nvivadixiesubmarinetransmissionplot\nmobileone\nmathiason\nnewbo\nyette\nnaema\nspanis\nneddylation\nsufferes\ngoatlike\nhelimed\njameah\ncathell\nmcparlin\nmountaire\nhoggets\ndestoop\nmediterranen\nshinrock\nzarattini\nsilió\ncornutt\nwoodbird\nhochar\nconvieniently\nundercharged\nhymnlike\nbhonsala\neidur\nmontanera\nmuellbauer\nwrongfooted\nlehmiller\nmediabank\ncollahuasi\nschiera\ndémarches\nshetrit\nmewhinney\nfeminis\nkuljic\ntimeslips\nketino\nrohaya\nparachini\nblakebrough\nfaler\nliteralized\nbkmu\nscruse\nastelit\nrepositionable\ngolba\nqassas\nnammi\nsupranationality\nneuroaid\nspeliotis\nbraconi\npimex\nrestak\nkoscik\nmaternities\nrelearns\nbenit\ngootee\ndrozak\nrockpoint\nmadit\nlachhiman\njontel\npymont\nneedlessness\nlupoi\ncorrosively\nhoell\nmihelic\namercia\noybike\ntnsalp\nabandoment\nnaple\nmcneilage\nringfenced\nlindenauer\neconimic\ncerovic\nmuehlen\ntrubus\nthurday\nrefermented\nrymans\nsandhaven\nbelkovich\nberserking\nsadiyah\nstanmeyer\ncacy\nsnocountry\nbesylate\nschizophrenically\npennsylvannia\ntraiman\nzatuliveter\ncaputy\nmpam\nabramsohn\nprma\nameritox\nnonk\npyromaniacal\nvideogamers\nwenaweser\ntigerwoods\nmasklike\nmougey\nspindling\nisais\nosode\nbleik\ntotia\nobstinancy\nepromos\nlatibeaudiere\nknaul\nsamsungs\nwesternising\ntagetik\nwomacks\nmadkins\nbonarrigo\nzhaxi\neconomicas\nevetually\nmaiali\nchrysanthe\nimberger\nfrischkorn\nvieshow\nsrdp\negde\nzemek\nmarianjoy\nstrakhanovich\nsexualising\ncashline\nmarafi\nbeader\ndismang\nsibum\nmalagasies\nstuffbak\nsenturk\ninglesbatch\ninterlocken\nluoland\nmatarasso\nincanting\nsleeveface\nballinalacken\nzhurov\nkangtai\nboaretto\nfriendo\nkeroche\nuvse\nsporepedia\nmccelland\nmulsims\nketrin\nsudayrah\nreapproach\nbondian\nthermopower\ndowsed\nboagiu\nabawi\nburdzhanadze\nmarcinkowska\nmyoe\ndigitalising\npatk\nmohaqeq\nobang\nyetiv\ntytle\nbritel\nyrbs\nunwarrantedly\ndmepos\ncostan\npenegoes\nrelocking\nsmartnav\nalaixys\njeremiahs\nbactec\nnoncontributory\nwinterize\niabf\nyuthasak\nscartel\nengelson\nwaybright\nhaladjian\naduro\nbellantoni\ncisos\nsahf\nvodone\nislamicists\nhomeform\nfragueiro\nmoben\nminap\ntupolevs\nsubardjo\ntongkor\ngibben\nhandcream\npinakin\nhurlston\nclaf\nmichelito\nsimrit\ntreays\ngewgaws\ngiavanni\nbridlepath\npioquinto\ndesgagne\nrhogam\npowerseraya\npaydown\nmesadieu\nmawrey\nchedia\nmspot\nballasteros\neluxury\nparacetemol\nunintimidating\nwhored\nlazerine\noenophiles\nnarcoterrorist\nfwab\nlawlessly\nlizan\ngasmasks\nperetu\nfridjonsson\ncriminalists\nlenelle\noutgo\nndga\nrojewski\ncurleys\nsevey\nmorolica\nmollan\nnugo\ncolazzo\nmatinenga\nlolapps\nspeeks\njamecia\nilanaaq\nshemayah\nunol\nminisub\npuchala\nsarkozi\nsainato\ncahoy\ngbar\nglendoe\ncheaping\ntelasi\nhastingwood\nnuctech\npalella\nvermot\nwxco\nbladderwrack\nzetters\nvibey\nsuaad\nhwkn\nibahri\nwhitetailed\njeles\noffsetters\npolyfilla\ntruffer\ntechflash\nmultiproxy\npreviewers\nkindrochit\nairpot\nminikus\nbalante\nfizzer\nhandprinted\ndemulling\nrudrakumaran\nbugaled\nhlep\nuncomprehendingly\nactogenix\nogbuke\nunipac\nnghien\nneckpiece\nbhojak\nshoptalk\nsermonized\nagaing\ngiacomotto\nstasenko\ntousa\nrevitalises\nvitalising\ndattakhel\nevaw\ngetsemaní\nnbcam\nloami\npalestian\ncvis\ntextphone\nfakhrizadeh\ntobman\npasovic\ngintung\ntrevell\ncartoonishness\nvicitims\nkarrington\nemnid\nessoin\nelectrathon\nfolasade\nabdulhussein\nbamsey\ndhurki\nbaasch\naggravatingly\nvijit\nmambe\nqabel\nluwei\nfriguia\nbaguilat\nknittle\nunfastening\nlettinga\ncaiani\nsydrome\ntxtr\nwasila\nfundraisings\ncannava\ningc\nsterilants\nstovetops\nlinty\nhryvnas\nmstf\nbelmain\nunwaivering\ncarpers\npetlyuk\nbroschart\ncencus\nslatten\nneuroeconomist\nbibical\nmuganga\ncatfighting\nsuttar\nafpp\nsuperheroism\nmarketriders\nnimalka\nuncosted\nharazim\nabdulbaset\nmilllion\nciosek\nslipperier\ngrocock\nantimilitary\nrobocalling\npopaj\ndoners\nlynnae\nsenaida\nbinjie\nfayson\nkidstart\nmashangva\nsuleimania\ncalter\ncuragen\nbreakbone\nyulex\ngiammattei\nximin\nunsticking\npullia\noverspilling\npetties\nhaziz\nichinokawa\narsad\nkuchins\ntamarkan\njilleanne\nunpursued\nreinvestments\nchivvy\npiersons\nparowski\noffsites\ngutteres\ngoewey\nstren\ntoolbag\ncomputerlinks\nspierdijk\nczwg\npaitoon\npowmill\nfarasi\ngoeltz\nalexsei\nppera\nkaczmarska\nstadtmueller\nshanking\nrobbeson\necherer\npova\nlightweighting\ncookalong\nstepehen\nrepsonsible\nmegabank\nstrim\nshivender\nbeatifying\nsauciest\nszigetvar\npixs\nrolodexes\npillin\ngothenberg\nsketchwriter\nbarnabei\nflossed\nclitoraid\nzulum\ndramaturgically\nbendicks\nscios\nceredase\nstreppel\ndessange\nwasna\nljdam\nilsac\nyermolai\nsuppositious\npurakayastha\nrollig\naloui\nwuite\nhibell\ncytotec\nuyilankulam\npeacoat\npremeasured\nlocair\nlemuroid\neastlack\npetrizzo\nmalachowsky\nturiansky\nvolskaya\ntribeswomen\nratuva\nchildie\nhooning\nmurban\nmowforth\nyatedo\nsandblaster\nintergovernment\njajab\nmarilson\ndaffiness\nclapgate\ngairns\nshampine\nschneyer\nbeehner\nretimed\nstalement\nunprecedently\ncicerones\niclaprim\nislamicised\nsejdic\nchakladar\nprecancer\nsovietize\nxiaoqiu\ncountersue\nchitown\ninfonetics\nlipsticked\nkissels\nzubari\ntreaders\nwollensky\ncenterparcs\nlamazou\nnodjoumi\nupconvert\nmagelli\ntakaesu\ndalessio\ncontrafund\nfnih\nschwapp\nzesiger\nkathay\nbewhiskered\nfukomoto\nsalmesbury\nzimbawean\nfiscalia\nerbst\nklyuka\ncandying\nkleisterlee\ngiorgobiani\nojougboh\nschriock\nsusic\nmezedes\nthrobbin\npresicce\nredlawsk\nlijian\npetroperu\nbecchia\nkettlebrook\nmahamood\nyanuar\nstateswomen\nnonelectric\nandrettis\nthanamalwila\ncastleland\nwebphone\njungly\nostros\npandorans\nvaubel\ndodel\nunspayed\nkalafut\nherdson\nymarfer\nchigirinsky\nmyfyrwyr\nbifeprunox\njackhammering\nseasearch\nkiyah\nblackpole\nhendarman\nmascho\nmassaguet\ndevasish\natssa\nmoralisation\ncvti\nganguzza\nsnorkeller\ncppp\ngootman\nqingtai\nmuzhda\nalambo\nloeillot\nsuvanjieff\ncalicchio\nbrnjak\npunjana\ndukascopy\nsparcely\nmylifebits\nismel\nshompole\ncalbeck\nwulfeck\nrebif\nbodytech\nairson\nmaltiness\ntreaster\nboudrow\ndraguhn\nflanken\npukhova\nnpoiu\ninfobright\nassination\nnorani\nmeijaard\naronen\ndrattsev\naccton\nstarchase\nmewed\nmeowed\nsuleymanoglu\nfanlo\nshammies\nbannar\nnyehaus\nsebaoun\nladenson\ndepoliticisation\nlladro\nchurchard\nmontelago\nwaalkens\npremeditating\ntreadell\ncaubul\nnightstands\nsverrisdottir\nrosofsky\nkruimel\narvedlund\nmerkes\ndrasdo\nkhinchagishvili\nintersector\nattram\nmagnanini\ngleckman\ndiame\ndsnet\nsebasti\nmedpoint\nhalterneck\nmccaffree\nunformat\nvatanka\ngallazzi\npspca\nbaghtu\nblagging\nkentallen\ngiannasi\ncontolled\nroaringly\nswitcharoo\nhorami\nequivlent\nxtremes\nnceo\nportentious\nconron\ncircannual\nsarkany\ndatascape\nsubtance\nburky\narikian\nperuke\nfndd\nselimaj\nclearancejobs\nsurveils\ndryfhout\nglbc\nmainous\npdex\nsoelden\nchigas\nblotchiness\nsbordone\nnagpaul\nmalinowsky\nkravi\nwejustgotback\nmisclassify\ncontrino\nbletchly\nrieves\nesmir\ngyns\naiting\nmassouma\nsharhan\nportentousness\nkauvar\ndemier\nmilblog\noluwakemi\nthreesixty\nleshawn\nirtiza\nowenses\nblahniks\npremcor\nrheumatol\nidealogues\nnorback\ngellein\nstromstad\ngelitin\nfeeing\nciborowski\nrimage\nbjerkan\ntenterhook\nvanalkemade\nanticorrosion\nspillius\njavaux\nhoarde\ndatacards\nhezam\npapangelopoulos\ndefinably\nbienaime\nshrewder\nsmallhouse\njellen\nstaybrite\ncontal\nmkhuseli\nryzuk\ndinertown\nchrisochoidis\ngiltwood\nbuyukada\nreinsel\nadebari\nmoudry\nprocomp\nsweatiest\nyeatts\nerradicate\ntweetmyjobs\norlewicz\ntrashiness\nnesterovic\nsagolla\nchilliest\nmatevz\nfaoin\nsheephouse\ngrundhofer\nfocuss\neconobox\nalner\nsalba\ntalascend\nclatto\nsupurb\nvotin\nyosypenko\nenmeshing\nequivocator\nsimcere\nblasim\noobr\nalante\nconsumtion\nblackarmor\npanyard\nkomie\nelebert\ndigitalbridge\nambrozy\nsnowmaiden\nparamés\ndoncasters\nhurman\ncuttingly\nuaefa\nhardhitting\njeselsohn\nalbiglutide\ndalka\ndragutinovic\nballhawk\nshikaki\nspectactors\nasokoro\nswansfield\ninfosoft\nconfits\nlacedarius\nfraunfelder\ngastinger\nfreyman\nosterhoudt\nadampan\ninterflex\npaytons\nlekuton\ndanhostel\nhabitue\nvatalanib\nbernadac\nmarchfield\nduso\nvondrell\ntooni\nsamander\nscario\nlangoustines\nprologic\nforechecker\ngembe\nbettyann\nworkboots\ngissen\nhysta\nmidlantic\nzoombak\ncirali\nwashbag\nhararians\nmowaa\nzalan\nrambunctiousness\nadrean\ninteraxon\nsampil\nshtayyeh\nyesanguan\nleftwinger\nroettig\nchascona\nwurg\nkouddous\nokenyodo\ndolfor\nsmorodov\nshahedul\nsmta\npolyglutamate\nharjap\nwatcharapol\navantgard\ncrankier\nlungstrum\nexhi\nmohrhoff\nloogies\ntyrrany\nubig\nmushiness\nodmhsas\nkasunic\nkickable\nkochneva\nschwadel\npflaumer\ntriplicates\nrajalaxmi\nbeardley\nwilverley\nsexualise\ngullibly\nbaryalai\nrefusnik\nflowerlike\njetsetting\nsacsayhuaman\nbaruffaldi\nturbanned\nunstapled\nateke\nbuyelwa\nogushi\ngooshays\ntwitvid\npaible\nstrating\ninternetworldstats\nrestacked\nkovalov\nnetburn\nfiledby\nbarbic\nfruitwood\ntasho\npadiri\nquartaroli\nstarent\nmufleh\nconferance\ngorat\neidelson\nrubendall\nfavory\nfrauding\nsuhad\nsconyers\ncornips\nveugelers\ngottsch\noosterveer\npulsepoint\ncierge\nsitcen\ndenigris\nguestimates\nmontrouis\njhpiego\nrecriminatory\nperónist\nmwanda\npogie\nboudjenane\ntutima\nbavelier\ncoolabi\ngorol\ntuilevuka\ngartung\nahole\nmuehle\nnahass\ntcheky\ntdwi\nserviceberries\nvivaciously\nfriedly\nnovazyme\nheimbeck\nbeachgoer\nhoferlin\nhopscotching\nossoff\njunquillal\nigiv\nforsythias\ntiney\nbrindas\nintentness\nodep\nhabuba\nhoweve\nbandstocks\nmarcee\nritzler\nreyum\nsalkini\nzafonte\nhenseler\nvemic\nmawene\nrahlir\njacquiline\napture\nunig\nmailat\nlaffel\ndemilitarising\nappositely\nlitchis\ndipert\nzeinat\nspevak\nsayef\ndenesha\nshorefields\nlapdancing\nvieria\ngumshield\ngisby\nbaerbel\ndustings\ncommonweath\nahilan\nkamitatu\nguotuan\nebrie\ncallipygian\ncoama\nzorbeez\nespeciallly\ndisclosable\nruttgers\nfuligni\nbeldini\nkutbi\nnatche\nbohaty\ndrappier\nmyvatn\ntoureg\nantiskid\nwierman\narrearage\nsükrü\ndellasala\nwhinged\nblocklike\nparrothead\nelminate\nqamzi\nfarls\ntaneisha\nyoussra\nkleynkunst\naltr\ngewgaw\nwiesman\nmexus\nholmlea\ntyrany\nkuentzel\nwaledac\ncgnu\nsuperlotto\ncomtan\nnocker\nvygaudas\nbobat\ntightknit\nsajjid\ncoai\nfilerman\nkambakht\nnanoradio\ncybersafety\ntripati\ngoehner\nrainawari\nogelsby\npaksane\nrabeder\nshigetaro\nheffers\ncoregulation\nkausea\ntillabery\nchongshi\nsuchada\nsitzes\npartan\nbreba\nmanzke\ntagliolini\nrevalidating\nsneakerheads\npęk\njcbs\nstudenski\ndisotell\nwhitendale\nbaraniak\nlinsker\nsunbrella\nehouzou\njedburghs\nignus\noutpassed\npuertorriquena\nmoxom\ndaneshouse\ngehlert\ntzaban\nkhatua\nloughrin\nwornat\npunkass\nveghte\nharcar\nspoofable\nbambarger\nispca\nbarnatan\ngaravoglia\nplauge\naccusatorial\ncontinung\nhasmat\nnyangweso\nwideford\nhairclips\ncastrodad\nfeminise\nnaftogas\nrelis\nnieburg\nkurina\nlthough\nesthesioneuroblastoma\nnakli\nanzalduas\nmajhu\nvasty\nchristoforous\nmorganthau\nallaga\neggless\nrecks\nparayno\npetropars\nrosneftegaz\nipys\nelins\nbulletholes\ncontines\noutstreched\ncapouya\nopthalmic\nkojedal\ninvervar\nrozlyn\nzibibbo\npanhypopituitarism\nathermal\nraskoff\nshaea\nrosendall\nthorbjarnarson\nkenyas\nghri\nthaiss\nmcgoey\ngechem\nofficemates\ngrodsky\npittsbugh\nvitorio\nmanahattan\nmoisturise\nniglio\nnyaumbe\nprialt\ntardun\nexfoliants\nhelveta\nsickipedia\ndelawari\nrelet\nawny\ntrcc\nuntravelled\nprotaganists\nengergy\nhealthyliving\ngaesong\njaskiewicz\nwsna\nrafanan\npolitians\ntapui\nbloodies\nsolidcore\nsabbatucci\nyezerskiy\nritzes\narunma\ntulsiani\nsriharan\nloyan\nbraining\nbagdasaryan\nconsignia\npecr\nuviedo\nsilentnight\ntevot\niecex\ninstedd\nmelvinia\nevanne\nkinesthetically\nhyperseal\nunreeled\nkeckly\nstolidity\nbrackmills\njaacks\nindespensable\nlissavetzky\nzillia\noverinflate\nterorism\nantonione\ncerimon\nministeries\ndefinites\nsobue\nbleated\nmyrthen\naddidas\nmaftoul\njenay\nhriz\nharperson\ndrubbings\nzangrilli\nestling\nkongwe\nbacskai\nragheads\nsilcon\ncompanied\ntesema\nnapwa\narcinazzo\nlagmore\nhydromassage\njaljalat\nphax\nbetokening\nuncurling\nscheyder\nlarrigan\nzetlan\nkully\nurumiyeh\nnovitec\nturnround\nminorty\nadew\ngradillas\nunseemliness\nletterier\nwcccd\noout\ntolee\nprediliction\ntsapelas\nkrent\ncarwashes\ntsatsi\ncondidate\nropper\ngutíerrez\nonyejekwe\nryness\ngróbarczyk\nklappa\nwvvi\nbedoin\nbiodyl\nodec\nhria\npozdniakova\ndeclasse\nsalmones\nbornhorst\nrealtech\noakfields\npellant\nelleithee\ndubrowski\nfashing\nbraney\nwiewel\nthorleifsson\ngazgireyeva\nizlar\nasig\ndearmon\nstatland\nbarjon\nmanouvre\nfezs\nopenhpi\nclergyperson\ncentrastate\nbodegon\nscrabbled\naldeasa\nrathgama\neiag\nreopro\neconomywatch\nwesterterp\nbluenog\nfbj\nbacaro\nacucar\nunavailingly\nrethorst\ncuvs\nghiglieri\npaleopathologists\nwhitmanesque\nserlet\nactioners\nrepeatedy\nrogacki\ngrimp\nsedrakyan\ncasodex\nheucherella\nkilfinan\nunanalysed\ndicroce\nsnowworld\nmykonian\nahlaam\nkosseff\noeur\njinguo\nsurpressing\ngoetterdaemmerung\nheithold\nmasnaa\niarfhlaith\ncolifata\nmulrine\ndaewoos\nmemoli\nhomex\ntidland\nbarncroft\nveljkovic\ntraikov\njolena\nwisneski\nabraxxas\nhaymet\noverscheduling\nzemedkun\nvictum\ngosin\nexcruciation\nplumpers\ntalve\nbliese\nmielgo\nneurobics\ngewen\nmangul\natempted\nbanchetti\nmckellips\narwady\nscarbro\nsocialmedia\nceannaichean\nhoens\ntpss\nfrankovic\nkotyli\nhyperventilates\ncribbar\nschmoozer\nlezark\njibouri\ngaddopur\nmahalak\nnetworkable\nscalelike\nreivews\ndyfs\narrotino\npontignano\ndebategraph\nsawc\naryani\ndazzo\nbullyboy\nmasstige\nadbe\nlustau\nfarruco\nexpatiated\nserivce\ngemperle\nfreakishness\nalathara\ntintero\nunrealisable\nminidresses\nxtrax\numhoefer\nporfolios\ninveroran\nbrisbee\nkiplingesque\nchorman\nstationwagon\nroastee\naqueel\nservicepersons\ndevoloped\nmicrus\nkazmierz\nshoulod\nsoulen\nmandolines\nmaharashtran\njobsohio\ntshing\ntehm\neldery\navnon\nlrdc\ndesmar\nrostekhnologii\nzitacuaro\nelshani\nedap\ngymastics\nmergent\nwishtan\nmereway\nparkgoers\nimars\nnasulgc\njcaa\narito\nouatah\nwaterbugs\nwhif\nmatrex\nspiffs\nemployeers\norotund\nimpossibe\nfelitti\nnehad\nfosil\ncigui\nhammerskin\ngaxa\nszczech\ndaleep\naloulou\nprzygodzki\noyw\namatista\ntrubion\npostmarket\ncallisthenics\njielian\niovs\nbecora\nkhaindrava\nwoodier\nblindsnakes\nbpex\nsmokier\nkarahalios\nvitone\nglaberson\naaadt\nnonvoters\nanwarullah\nclarky\nolimpa\ncherkizovsky\nvillainized\nlancot\nbarzanti\nerrachidi\nmcgetrick\ndundonians\nhamengku\nyolton\nswartzbaugh\nlevav\nbouzar\nyousefzadeh\nbystolic\ncardizem\nproventia\nyerawada\ndieteman\nfructifying\nsamarasan\ntriump\nphotoreal\nkmpg\nslatyford\nsupershort\ndjbouti\nsaproxylic\nschertler\ncusí\nzadorozhniuk\nequitorial\nfarelly\nbritiain\nʼal\noceanlinx\nsayerlack\nhespanha\njolil\nhoyoux\nmassaud\nlomitapide\nshucker\nginis\ncitifx\nloshak\nrosekind\nstba\nmanakhah\nsvest\nvaret\nbademosi\nkorst\nsamreen\ncobbolds\nfonarow\nneighours\nkinnesswood\nplanemos\nmemora\nschoppa\nslatalla\ngiragosian\nrncc\nfleshpots\nisocs\nurtis\nmoevao\nonlocation\nappathurai\nyiin\ntechconnect\nretailiation\nmilitzer\nschlabowske\nzargani\nbattenbergs\nbirdell\nspazzing\nmizejewski\nbillerud\nklitzman\nprograf\nangloamerican\nashamedly\nwishner\nwirec\nrustagi\nbiliousness\ntopbas\ndabchicks\ntorrentes\npostmus\nroundscale\ninfinitis\nhornabrook\nbadden\nskochinsky\nplainclothed\nexoticisms\nsaralyn\ntervalon\nsecurecode\nconfiguresoft\nmultidrive\nrightsflow\nfzco\nmulraine\nmundipharma\nnonvascular\nnägel\nbreathily\nbodywear\nappals\nbowlder\ncityteam\nglistering\nfereira\nmartons\ninrap\nardelan\npentech\nnoaman\nilight\nbaggish\nangoitia\nwmrs\nmohammeds\ncovia\ncelzijus\nwiltel\nstepleton\nhermidas\nkipkalya\nbarlyn\nzambas\nraeesi\ncitzen\niturgaiz\nprotesteth\npcdc\noverinvolvement\npelekas\nlagnieu\ndaisher\nfunjet\npiasentin\nboulangeries\npodoski\nbobbys\nultradns\nimpracticably\ncathdral\nkedrosky\nrasheem\nperunicic\nprostition\nerka\nbryarly\npricelock\nornais\nmoneycorp\nameristeel\nhughen\nxianfu\nmngadi\nventavis\nfewsmith\nfrash\nrothweiler\ngabart\nfalkenrath\npakpour\nsietes\nouches\nyoutek\njonases\nbugaighis\nprimet\nhoegel\ndybkjær\ndodkin\nconstruciton\nkhouloud\ncrezdon\nmindblowingly\nhadnott\nglangwili\npaddlings\nhajicek\nwidepread\nrulespace\nectodysplasin\ndandeker\nzeitgeisty\nperries\nshpetim\namadiyah\nkeydrick\nwehlener\nshowjumpers\nspron\ndanehy\nterveen\nchodak\ncompart\nvirually\nsemons\nmariell\nmethaemoglobinaemia\nbohrs\noleszek\ntriathalon\nprodution\ndalecki\ntoeman\ngentiva\ntruedelta\nupheavel\nriekstins\nbouramdane\nyelpy\ncanaloplasty\nkodjovi\nzirakashvili\nlocurto\njarandilla\nrafis\ncachagua\ndahlstrand\nbitterballen\nmcclellen\nwassersug\noverindulges\nstentiford\nclema\nsautéeing\nunklesbay\ntregroes\ndalguise\naspatore\nschlippenbachii\nimmobilizers\nraktim\nshamaa\nchanterlands\nsaddos\nresharper\nboytoy\nforesty\njelous\ndiesotto\nmulticandidate\nkokaral\ntaregna\nmenachim\nfayeds\nventur\ngintare\nlgiu\nmicroproducts\ntafesse\nnoonen\ndelinski\nnoebels\njaunted\nnaheem\nselcan\nindiewood\nwashir\nnurallah\nmanufacturered\ncanonising\nsolpadeine\ngrazhdankin\nscences\nzanevsky\nsnoozes\nmarrietta\nhamrol\nkantes\npreperations\ninfotag\nbrittini\nsecurency\nfuul\naeberhard\nasteriod\nroomstore\nlukeba\nrusssian\neloul\npfaffmann\nrislund\nbronces\nsploshing\nswep\nralton\nligth\nempoverished\nboomburbs\nnonsequential\nlhcf\npolania\nmokango\nthanawala\nsinopharm\nkryshtanovskaya\nagrobiological\nfirmwide\nwillikers\nkerbow\njarris\nargumosa\nattampt\nhochbrueckner\neversons\nephelia\nsecurtiy\nnetlearning\nfreons\nasantha\nfifith\npiwna\npigneto\nsiddiqah\ncarifest\nhealthcheck\nintoduced\nbeshbarmak\nzauba\nretiling\ndeganit\nstevely\nfruitlets\nenvolve\nlinardos\nrepell\nhosteling\ninsouciantly\nonliest\nnoiron\ncardes\ngreaseless\ndoleuze\nmadewa\nbutuo\nsiral\nciragan\nslurried\nairbeds\ntabloidization\nstewmaker\nobamanomics\nvansteenkiste\ngoates\nmeieran\nzoodsma\nnasrawi\nrecue\nmesches\noevp\nminerally\nvenkys\nkahlid\nkondaurova\nordianry\nlikeably\naquarids\nhorelick\nhypermarché\nsingularitarians\ngoken\ntarasiuk\nfluffles\njakusz\nsmetanka\ntirus\nminmin\nsterzenbach\ntournment\nameron\ninsititutions\notex\nbohland\npetrolifera\nmalecha\nsalerooms\ntressed\nlifewave\ntelenorba\ndeceuninck\nshujie\ncrêperie\nchadiza\nsexbots\njeanice\nukieri\nrobatayaki\nmalenotti\ninititiative\nolch\ndanilkin\nlabrandon\nprosten\nartparis\nmohne\ncochetti\nendp\ndasient\nsarinana\nschamel\ninvestindustrial\nwwxx\njawanza\nsuperpowerful\ntargacept\nalbanes\ntoffa\nfeinsilber\nafbs\nscaffolders\nshidane\nkrejca\nutimately\nbookbags\nfremlins\napsell\nfirestarters\nhealthsource\nrachofsky\nbistrots\ntzavela\nbellavitano\nagreeed\nfreemantlemedia\nmaxcom\neloui\nhotelicopter\nallenna\nmcgowne\ntorgard\noulo\nblaven\nzebley\ntenerian\njacquelynn\ntheinternational\nmarzella\nvandenberge\nleuckert\nstraigt\ncetron\ncyfartha\nkolba\nkoukidis\nluftig\nprestart\ndorkiness\nmohabir\nnauss\nsanitoa\nkhameni\neurosceptical\natwah\nadung\nyubaraj\nzierdt\nhereos\nslaoui\nbiowaste\nsandimmune\nspeto\ntomase\ncollander\nssing\nbuttenweiser\nhorsholm\nwinkling\nconsommés\nquestex\nwatchstander\nlehmen\ntversity\nsigaty\nkarydis\nheikin\nhumorus\nkourula\narechabala\ndevines\ndehghanpisheh\nbratich\ndinowitz\nkarason\nqadari\njacobsons\ntonjes\nfilipiová\nrestitching\nshoddier\nscrivani\ntibbatts\ndefoliates\nunholstered\nunmoveable\nhulthén\nmoaser\ngyrowheel\ninflationists\nneckpieces\ndeleston\ngegenheimer\nerrrrr\namunategui\nkingon\nnaciria\nmaehr\npulchritudinous\nwaitperson\nappelberg\namselem\nswanlike\nspicerhaart\nfindaproperty\nbartop\neucharides\nbigongiari\npomaded\ngieg\ncleanshaven\nhalotherapy\norlane\nanakara\nkhanafeyeva\nthitinan\nkarugarama\nhakimian\nvnus\nretched\nmaysfield\ncvos\nbearce\nbioinnovation\njailen\nbookeeping\ncooed\naddante\nmafileo\nmiscegenated\nfervidly\ncaroon\nsamouk\npatelco\nbenzodiazapines\nhaufiku\ntandikat\nfeldzer\nquing\nmuheisen\nkiwaukee\ntilem\nsarejevo\nschieb\ncastledown\nstuttery\nineris\nannozero\nraidrs\nphaal\ngarzones\nyongling\nfoldershare\nanzelmo\nhandicam\nzimnoch\natabayev\naicr\ncovention\njingled\npetroliana\nhotornot\njerritt\nseprafilm\nsymbiocity\nganchi\nbrynaman\nnazami\nkavkasia\noceanhouse\nlegutiano\nbestowers\nreturing\ncantick\nmulticharacter\nsycolin\naphorists\nfritti\njacomelli\ntoradol\nmujihadeen\nonzo\nturitzin\ndenica\nwhaaaa\nnarusinsight\nchampley\nshangjie\nignobly\noveremphasising\nnessers\nlatsky\nacknowlegement\nsbics\nctitf\nctsp\ncebalo\nluxim\nmival\nflotations\nfobney\njohnsondiversey\nbrokovich\niguapop\nzelevansky\nboccio\nknightwood\nbrahams\nrogerses\nglassel\nhartzer\nmurado\nhuneycutt\nkersen\nhimelf\ntargeters\npecong\ntunwell\nprinster\nconstiuency\ntrihatmodjo\ncollás\nvaidi\ngrimbsy\nnasbla\nnakhumicha\nvöslauer\nolaide\ngwella\nliqiang\nfilmically\nvalhi\nmorters\norlovic\nhilgeman\ngrizzling\nipev\nfarnen\ngovernmnet\nfeyling\nhutsby\nviennetta\nzippori\nhushkits\nbiofit\nrusen\nndonye\ncallwave\nmonomaniacs\nbusineses\nandreano\ndegorski\nwhipsawed\nmhaith\nproyecciones\nuppish\nviread\ndefnitely\nintracacies\nknable\nmohin\nforseeably\neunick\npresdient\nhintlian\nmtgo\nenergysmart\nkalemeh\ntalez\nrazanamahasoa\ndeceipt\nvainstein\nwatercube\nbastes\nnecesitan\nbreysse\npowerscreen\nbashari\nkrysti\nncore\nnalbach\nnathe\norencia\ndivorty\nfussbudget\nanusat\ngroenig\nphard\nteachman\nchandrasekera\nchamni\nmoquet\nunaroused\naspics\nhalaas\nrrsat\nerlys\ncollacott\nysbryda\ndowncounty\nishitsuka\nvelling\nrital\nhagers\ndjabal\nhometrack\nmionet\ndymanic\njandreau\nattiq\njannetti\noibda\nmaesbrook\nguilbe\nsudack\nkhaidarkan\noverhelming\naquatec\narenivar\nlicuados\nladsous\nxiuqin\nboomgaarden\narchuletta\nschooltube\nadforton\nmashek\nbisimwa\nelshadai\nsldf\nmahimahi\nopenscape\nmcewens\npouplin\nzingman\ndhody\nseaney\ntantalise\ngirao\nnachito\nafairs\nnawlins\nvasilyan\nmecki\ngwledig\nmarzuk\nmazetier\nvisitied\ninoccuous\ntechau\nsonelgaz\nbrothy\nporkie\nhuiyong\npannack\nbahng\nfelow\ninfighter\nfussily\nilao\nthecompany\ncommitteed\nsakkinen\ntroester\nunprecented\nmetalcrafters\nrollaway\nsaddams\nraqia\nezpass\npriestgate\nyenakiyevo\nghostbar\nbellord\nbagginess\ndalonte\nbirckhead\ngridwork\nafrims\nglasfiber\ngelcaps\nlashonda\nlmxb\ngloder\nkorenmarkt\nturbocharge\nsoyle\nsctf\npixlr\nserfass\nwesterhaus\nwatb\nclarencefield\nscudded\ncircadence\nlowgate\nferlaino\nvaananen\nmoneghan\nclayhidon\neeze\nvisen\nviciano\nkiyofumi\nfrazell\nfrontgate\ncherigat\nkabban\ncrcd\nsecuresphere\nmedicore\nschulle\nkrabak\nbustleholme\nbebawi\ndelonas\nstreetwalking\npresedo\nmissier\ndupper\nfwix\nxosha\nzestril\nconside\nlmhr\nspottier\nspeyers\nlgpa\nmoview\nloped\nclassfied\ncomplicities\nfrosses\nscheungraber\nchieppa\nthouands\nsanitiser\nstaggerford\njurrell\nsilking\ngalewski\ncarveries\nnaffa\nfarnquist\nfrisoli\nperdut\nstovitz\ndrowsily\nabstinance\nkorpal\ndisclike\nbetterinvesting\npredicitions\nyoslan\nquotability\nphotoswitches\ngrpr\nmoviedom\nmaullin\nmaddness\nstalwartly\nstarchitects\nchhom\nfinkley\ndistroy\nchainstores\ncwmnïau\nzuckers\nkudrycka\ndesterrados\nshanmugaraja\nlakesha\nshaone\nshermanesque\neuille\nmedhekar\nkowsari\nahhhs\nminsiter\nlrpl\nromanens\nwhelbourne\nchelbat\npenelec\nkalsoum\nbasille\namgylchedd\ncollectivise\ntohave\nboycs\nhustinx\ncolagreco\nhrsd\nviñedo\nbluelinx\npalmason\nseillier\nkaernten\npredictify\nshirzai\nshahrestani\nbialowitz\nmsrps\nkickabouts\nlattman\nloosener\njabugo\nizosimov\nobonyo\nbaumfree\npriot\nresearc\nsalubi\nraviglione\ndarlingtons\nsoukar\ntripit\ndonorgate\nghindin\nwifeless\nvppa\ndestinationcrm\nmiwg\nwhalemeat\nunchoreographed\nauriculas\ncanapes\nbardavid\ngimick\ninsipidly\napercus\nnumeiri\nradioheads\nchamberlins\npeoople\nfosberry\neaglerider\nirhabi\nchepkurgor\nsqueegeed\nsoulez\nreacquaintance\nvolkl\nconery\nciaglia\nlaedc\npietravallo\nspokesdog\nsteelbox\ndevashard\nesteruelas\nanderon\nmediabrands\narbey\nolmützer\nkreab\nagfc\ngarimpo\nmcnenney\nsabatine\nrabiyah\nkumzar\nmusicid\npilgims\nhomefree\ngepford\nlimeside\nlousiness\nprisi\nadenekan\nlazrak\nguclu\nwolever\nrestages\nbleum\nvirtualise\ntscp\nmaternite\nkuken\nsnackwell\nderrys\naippa\nstibal\novercommit\natircm\nzolfa\nmurderousness\npeckolt\nxdms\nmidgate\ngorza\nruentex\nsarachek\nrotondella\nratkovich\nrudoren\nshoucair\nshmyrev\nhachamovitch\nswoopers\nremodulin\nzahorian\nmathmatically\npisarro\nperfervid\ncofinanced\naccutronics\nzomig\ninferrence\nverboven\ncornishness\ncavendar\ntoddled\ngothically\nmanuli\nwopping\nsnarkier\nhmmn\nmilenov\nbrightseat\nbardoni\nauldyn\ntinkerbells\nmccarricks\npreplan\nmatijasevic\ndeschaux\nsedonia\ndusic\nsusceptability\nyokoshi\ntifatul\nbatties\ndellwyn\nfritada\nakond\ndefuelling\ndurica\nvarbusiness\nloilo\nstriplights\npapastamkos\nbudiriro\nmeeeting\nmackert\nunrestrainedly\nfranic\nstampolidis\nmetascores\nadivar\nimmensities\nsanostee\nchedraoui\nlatricia\nsilaigwana\njoselio\nvillemont\ncallans\nfridjof\nactivinspire\nreimplanted\nmidstocket\nfitties\nvayl\nkhanbhai\nsensualists\nomelyanchuk\nradiesse\ndevriendt\nfilp\nkennedies\ngabitril\noverhype\nmaquin\ncofinancing\nshemara\nkidzapalooza\nbelpietro\ncommmitted\nkwamain\nefyrnwy\ngressman\nbrodigan\nsemtek\nadirs\nqoh\noetiker\nstellaservice\ncpix\nfreightcar\nmatria\nnasier\narboriculturist\nbozize\ntranshipments\ntanrikulu\ncrossruff\nsuperstuds\nrquez\nkalkut\nragoonath\nantipolitical\nfortunado\nargiano\npetrobas\nrumsas\nelisco\ncadora\njelleff\nncal\nsbsa\ntravelalls\nfuruvik\nmagalski\nratcatchers\nmennello\ndoorbuster\nyunchuan\nzellij\ncoldsore\nintroducted\nphytopharm\nbuerki\ntwizzler\nscharfen\nlellouch\narahuay\nparvizian\nmcconochie\nlangstein\ntradegy\npinkwart\nhirschkorn\nfoulers\namazongate\nschüll\ncensoriousness\nrebook\neruvim\nzenkel\nwhiffy\ntsarukaeva\nkirabo\ntrevard\nbiogenetics\npingeton\nenfeldt\ncallowness\nzarein\nmiseducated\nahmedzay\nesfandyar\nboxloads\nklipa\ncaovilla\nheimeroth\nenraku\ndearland\nsoundin\ntrunnell\nzuoming\nilyukhina\ngrupetto\nmidwifing\nanerican\nnondeductible\nespd\nfuscia\ntrubek\nmenashri\nprigozhin\ngoodmail\nlelands\noilexco\nwotter\nundedicated\nkadannappally\npillaiyan\nvilaro\nuralchem\nmemish\ncashedge\nkhorramshahi\nmailouts\nmiccolis\namendoeira\nlavarreda\nsardjito\nawarders\ntrenchancy\nddal\nwfes\ncurruption\ndhoble\nquattrochi\nnosebag\ninterestng\ntamasy\nendesha\nhanawon\ndatillo\narktikum\nlavapies\nprotrayal\nprotectins\ntelemonitoring\ntackiest\nstrokemakers\ngoldinger\nunexcusable\nintelliquote\nsunnyhillboy\nhyperic\nyuanshao\nturnpoint\ngsoh\nspinelessness\ncorralejas\nmildrid\nclomping\npaisal\nricked\nputbacks\nsbrana\nchoiceodds\ncnaan\nmonics\nraihana\nsebnem\nenobled\nstudentcam\nschuppert\nendboards\nkafashian\nsujak\ninfinia\nduplessie\naltinay\nstarchitecture\nkirubakaran\ndentzer\ncalvanico\ngalamsey\nmukisa\nduoji\npayrise\nbatoning\nustelecom\naplace\nchloraprep\nyachters\nsekulich\nsuddeth\nfengwei\nbicentini\nkmarts\nkliuyev\nnickolds\nmedbery\ncoronaries\nbielas\ngaull\ngonxha\nmindray\nsibio\nsohil\nfradgley\nfastlicht\nsperrazza\nberez\neprivacy\ngingeras\nnoticer\nanabtawi\nnaiem\ntalyor\nvolksrant\nmussolino\nabeloff\nrubare\nthirtymile\npulchritudo\nnahiyan\nmchutchison\nkhovanschina\nuxbal\ndakwar\nkookaï\nlovenheim\ngudavadze\nmidence\neatr\nvanderzee\nsickliness\nnixonland\nresiled\nschwanewilms\nossatron\nsmidgin\ndoodlings\nindebt\nsaugy\nrasula\nzachy\nuniongyrchol\norcoyen\nkingdee\nkomarom\nlecarre\nviennas\nlemlem\nmirixa\nlovallo\nisaby\nhightlight\nfcram\nhoariest\nidell\nscreeming\nthenca\napptec\ndorsher\nzebtab\nzesn\nexergame\nnetquote\nfistfighting\nbesir\napakan\nnikolajeva\nviatronix\nleftwingers\nnumnah\najdar\naborning\ngualano\ncharbit\nkleptomaniacs\nhoussels\nmishara\nabseilers\nhradilek\ngrayken\npesko\nwheek\nmetalor\ndasko\nbazaarvoice\ndownsizer\nkechele\nbatcho\nnonmineral\nmanteros\ngatier\nmillésimes\nokasan\nmulben\nwaple\nbrasiliera\nshmuger\nprofetica\ntheravance\nbialo\nhonsel\nfcso\ntssc\nlouvrier\nschleisner\nhujjaj\nmaschek\ngardebring\nbrashest\nautopart\nbcbgmaxazriagroup\nfrabjous\npeacor\nnorthton\nshoulderpads\ncoporations\nhspice\nveriface\ndryvax\njulys\nmultilateralist\ngoosing\nusce\ngoudiaby\ngraffagnino\nmuhedin\nmanata\nsweetmaker\nchannelweb\ngrauso\ntremin\nwmk\ncenes\nkimbrose\npozze\ndybbuks\nnozkowski\nmarraud\nmedjools\ndemosphere\nsunvisors\nchotaro\ncondemming\nzouerat\nsorton\nzetterman\nxinyong\nfulgosi\nklenz\nteeranun\nwisened\nchmielinski\nnorek\nswooshes\nalsaud\nrossiskaya\nplues\naahad\nbeecken\nhollihan\nkoezuka\nrzo\nhagiu\nwoofing\nlacourière\nyottabytes\nzerog\ndamuth\npopick\ncruely\npayge\nloopier\nlaxmanananda\nciwidey\nkebler\naneiros\nsherrow\nunterkofler\njameos\nakritidis\ngonone\nshanetta\nlacinato\nbreema\nfarechase\nidtgv\nbiospherics\nmirise\nnoboby\nmecaniques\nhillendale\nmishits\nlotharios\nsanglah\npsbr\nsleepwalked\nsiblinghood\nbarioz\nvasakronan\nbordwin\nscoleri\nfailling\ninevitible\njabaal\nnuzzled\ncrouzat\njavedanfar\noverspills\nkathon\npengxi\nrimowa\nbeauge\nparaquad\nteyssen\nmesis\nchlg\nasheninka\nsarcona\ncegh\nmartineck\nwilens\npomponi\npitcaple\nhaphazardness\nmallay\nbodywash\nzoonation\nfolketrygdfondet\npinheaded\ntierny\nmilien\nponderland\npongracz\ncontroladora\nmusberger\nsterksel\ncocis\nnarcotraffic\ntaegan\npollanen\ntechnolgies\nganek\nchechan\nogguere\nozumo\nprisonners\njapansese\nglinert\nconditon\nstayaway\nblackann\nonesphore\njiaomei\nagboluaje\ncratos\nhirko\nhaukohl\nmonotherapies\nbergian\nbucardo\ntrowelled\nviermetz\njuniti\npossoni\nkasselman\nbonnani\naumento\ndémodé\ntatoyan\nkasonga\njesmer\ncavness\naffymax\nliddells\nyurgens\ncahlin\nbackbiters\nmcdouble\nsaulino\nbertagnoli\nzwakman\nziccardi\nsupremecist\nheeeeere\nnighties\nkulbicki\nmasline\nlalena\naffonço\nyumasheva\necuries\nyhis\nmaici\nbernasko\nsahidulla\nklyberg\nnannys\nheilveil\nrosengaard\nuprighting\nreindustrialisation\nnordaas\nraisinets\ntweneboa\nmitrichev\ngawks\nqasmani\nbikesafe\nunspooling\nkhukhashvili\nmélisse\nexective\nstvp\nsecurer\nduerrenmatt\nfitel\ntifanny\nnoray\nafeared\nnyanzale\nspecia\nshalson\nbhattacharjea\nmyisha\nfranzoia\nsuberb\ncatheline\nkrahom\nseditionaries\nstockmarkets\njokanovic\nltcfp\norygen\nbirkmeyer\njnbridge\nunfavorability\nsumbandilasat\nneelmani\ntechnocentre\nbeaunier\nbossler\nbrownism\nharbourage\nmhhe\ntbps\nstaner\ndepersonalise\nshopfitter\nalbless\nbirchers\nsheeplike\nmitsuma\nramsaur\nkostyrko\ntailgated\nbreadlines\ncepko\njivanjee\nnenuphar\nuramin\nrevolutionises\nkwikchex\nmultihomer\nsiheyuans\nwardieburn\nbarakula\nparrasch\nkanpachi\ncrewcuts\nsebio\nnikpay\npappers\nlengkeek\nsalares\njaffry\nthorseth\nsaltness\nlotery\nbeversdorf\ntulashboy\nshiavo\ncaravaggios\nelmay\namenah\nlicy\nvovan\ngleysteen\nosbrink\nekay\nsimmond\nbukata\nborovay\nmadisyn\naulisio\nthinset\ndakim\nrealtive\naltantic\nincommunicative\nwadsted\nnimbys\nwerbel\nscarpato\nkarith\nkohlenberger\nantipollution\nsengstaken\nmunqeth\nogut\nbabbio\nvunga\nzakiur\niciest\nmussoni\nbefriender\nefdi\nptpi\npelma\ntransitting\ntarino\ntindy\nsculptra\ndalcin\ndugel\nsanzari\nlumleys\npolkinhorn\ncampaiging\nbaroncini\npellerito\nkaloyeros\nskycaps\nkozari\nwatchcon\nnukular\ncunat\nreligously\nnhsa\nnomineee\nbibic\nalpuche\nritualize\nriotto\nstauncher\nboulodrome\nhairst\ncaiz\ntrattorias\nsilvercorp\ntidmington\nwigren\netcheberry\ntennessees\nvaghar\ngaoke\nbatemen\ncharlow\nbellevarde\nbobbly\ngoodear\nyazdovsky\neeurope\ncryopreserve\nbroubster\njaynarayan\nfordian\nguilelessly\nindarra\nquestcor\ncannabalism\nlehmacher\nencashed\nhiggerson\nmarketside\nactiontec\nwraxhall\nsycophantically\nrenasant\nskyforest\ntoothlessness\ntoshishige\ncooneys\nlovborg\nwaggles\nmetelsky\ninayet\nnetessine\npooks\nthassa\ncummis\nlookbooks\ntpsac\ncollpase\nmonicans\nmikul\ntscl\nausn\ndext\nsweatsuits\ndaklak\npeerlessly\nplaytimes\nwestfaelische\nmaneouvres\nwoobie\niezzo\npethau\nmetascientific\nunsentimentally\ncruxton\nnigera\nsimplyhired\nafican\nsemneby\nbackorders\nmalikyar\nswansborough\nozd\nblackweir\nbecareful\nvarlan\nbalbeggie\ncfats\nsimpe\nbouilhou\nalwaki\nkalentieva\ndardai\nelectrorock\nshipwide\ncosily\nwithn\nretino\ngridlocking\niggs\ndenuclearize\ngubbeen\nmauerfall\nwindchills\nugss\ntananbaum\nluskentyre\nmhangura\nvenkataram\nwondeful\nkalite\nkingmaking\nstrozzapreti\nhappeneing\nantipaxos\nfransje\nchargoggagoggmanchauggagoggchaubunagungamaugg\nedies\ngiricek\nfinnebrogue\nhomeboykris\ntristone\nmobus\nkriese\nschrauben\nadilgerei\nshmira\ninarticulately\nthumbdrives\ncegedim\nphilomen\nlawnbott\nfornalutx\nfilev\npoulters\npoofed\nwardwick\nepistolatory\nzanoun\nabingworth\nuntaru\nhilldrop\naltamarea\nmegabreccia\ngoogletalk\nmotionflow\nthuyen\nfugitt\nobray\nferrai\ncricieth\nimangi\ntranferring\nwillyum\npufferbellies\nolajos\nisaichev\nintellectualise\nfinanciarul\nmeejin\ncorty\nrickeys\nczarism\nsrifa\nlookng\nkopera\ngeoengineer\ntwistier\nhundreths\nixer\nfrightning\npollins\ncssv\nrenfors\noiba\nloglines\nguoqi\nfranceschelli\neconmic\nprodemocracy\npacl\nleeuwenburgh\nhawketts\nnonsensicality\ntylerton\nelshinta\ncheathem\ngedco\nspilak\nklisch\npurchasepro\ngonvick\nmoussaid\ngogii\nkurgo\nnovostei\nmoletta\ndargate\nfabianski\npyatykh\nhibees\nniccolls\nvetrone\nzottoli\ninglenooks\neicu\nrexrode\nwarrenty\nrikardo\nunerotic\nimmersiveness\nlichstein\npunnery\nstigger\nanonymise\nwanyeki\nmetahaven\nstetch\nholosko\nmantau\ntironi\nbechtolf\njoedy\nruedrich\nettner\nmonolaurate\nhearron\nmutemwa\nnetex\nmanzitti\ngalwan\npitchwoman\nhotelera\nseereal\ngwq\ntankan\nrakus\nakallo\ndyatchin\ncoupledom\nsubtrochanteric\nsumayya\nnightjack\nryzik\nhardmeyer\njacketless\ndsda\nmoble\nleruo\niraninan\nresorces\nfilaq\ntsfs\nvirilion\nflamboyán\nwrotto\nxiaflex\nbettio\nholzheimer\nunhired\nzahera\nidealogies\npredo\nquixotes\nchirst\nlaurionite\nbuisnesses\nhirshenson\nlescault\nadventuristic\nwraithlike\ndossing\ntabouli\nluege\niwave\nrivellini\nsemicon\ncoshes\nviehbacher\nvolpp\nnonfuel\nlatunde\ngamalath\nmouquin\ncappex\npingsha\nreupholstering\ngulty\nmsosa\nndarc\nolazo\nspeedpark\nflovent\nquintupling\nrybovich\npyttel\nbankwatch\ntruimph\nhelocs\nsemalaysia\nclerkly\nropelike\nbradke\nnrda\nelagolix\nmonokroussos\npresentaciones\nlepauw\ncowpats\ndorb\ntrenfield\nmatiullah\nmpal\nbrcd\nhimax\nnosamo\nexcrutiatingly\nmicrofun\nnoymer\nblossomy\nqayara\nbjorge\nnighthawking\ntamisha\nkalenjins\ngarosci\nhelvacioglu\nbiosurgical\nunemployability\nmcclaran\nfloxin\nhyaric\nchiezo\ncolascione\nmassucci\nautonational\nbaseco\ngodat\njuchitan\nbertamini\nmainds\nkolontár\nnoncommittally\naptr\ncottrol\nvideophiles\nirifune\nbarakani\nmegadrought\nbolotbek\nbiolley\nmosbah\nijewere\nnaqura\nbunless\npreinstall\ndunnichay\nfortresslike\nlovenkrands\nrivello\nunrisked\nstoemp\nccould\nheptullah\nsiggs\nykhc\npuhalo\nmirapex\ntomnahurich\nmcstays\nferarro\ntrouserless\nleanse\ngoerg\njacquards\nhromadka\nnoncumulative\nunroaded\nimpieties\nmikkelsons\nsweenie\nblumengarten\nsaleemul\npercentagewise\njizzini\nfrantoio\nzinetti\nlunera\nkalaye\ntumbrels\nkenber\nstruhl\nbaheer\nlangenhahn\nhairsbreadth\ndecarbonizing\nmundanities\nredepositing\nvideosphere\nmuqdad\ndonmoyer\nchaleff\nyhf\nosteosarcomas\nnasuni\ngettng\ncrescenza\nslingboxes\ncollardi\nsaeedullah\nmisremembers\ngrandmougin\nbgrc\nmodfather\navenbury\nskanled\nsurvitec\ngrossmarkthalle\nvideoboards\nyondelis\nfugnido\nsarewitz\nbonxie\nknobble\nnerdfest\nstribley\ntarekegn\nhoogie\nrhenaniae\nlentine\nepir\nlautin\nrekso\nleaue\njasikevicius\njobsworths\nhsss\nkachikwu\nvanwalleghem\ndontell\ndizzied\nrothgery\nfoxiness\nchalamish\nlittwin\nwhpc\ncaniato\ncoolish\nincuriosity\nproassurance\nkcfa\nvandergaw\nandarabi\ngovermnent\nmanhoods\ndiscofied\ntumbaco\nvyomesh\ntoweringly\ntestagrossa\ndolmades\nyehle\ndussindale\nthwacked\ndelerious\ninscrutably\ntapuaenuku\nelegent\nterravista\nadventuresses\nnorthumberlandia\nsaiedi\nbramleys\nosmanova\nhyperactivated\ncultiver\nmuhle\nechoworx\nmurderes\nbordean\nboodman\negnazia\npaffendorf\nzakhu\nschels\npitaro\nbritspeak\nskinflats\nphonautograms\npenine\nstorediq\ntionna\nyaourt\neurosur\nabasing\ngerrymanders\nsidled\nkoblas\nnexterra\neefting\nsdsers\nmoonfleece\npuela\nhedgelaying\nwoja\nmatieu\nnlcr\nviebrock\ndoumgor\naeronatics\nkochon\nsiriwat\ngloppy\nperceptics\ndunseath\nveeps\nmxim\ncapeless\ncostcutters\npurhonen\nharmanis\ndorheim\nbollene\nlidove\nsmartconnect\nkeyna\ncrunchtime\npoeteray\npalmprint\nhuante\nunconventionals\nteresitas\ncascal\nlavee\nincompetencies\ncampign\nantitheft\nsulub\ndiscombobulate\nlarkcom\nrotable\nshurah\ntaneshia\nsabraw\nbrunstock\npourdastan\ntimetrial\ngewanter\nfidc\nunionbancal\nintenders\nunpeaceful\nkoelzer\nmaizes\ncorfman\ntunza\nngirabatware\npostler\nsudr\nrozak\naple\nfecally\noncoplastic\nkaching\nmacrogenics\nmakhar\nsyscon\nsexify\nnagdi\nkubesh\nfruad\nmandoo\nunticketed\nrimmell\nburatha\nsleeting\nsmolianoff\ndionyssos\nambrook\nmacaleese\nknells\ndimiss\nmedigene\nbddk\nvigilanteism\ncioloş\niofina\nkirkorian\nmegaliters\nsaladdin\nreappoints\nmanich\nvarsos\nsugarcoats\npremeir\ngudiberg\nmatzger\nmeville\ncollateralizing\nnijah\nbeirutis\nfishbowldc\npreparty\ncardlock\nsnorer\nthirtysomethings\nlochview\nkát\nbhumipol\naquatech\ninarticulacy\nraffele\nbedzyk\nsporormiella\ndefoggers\nweithaas\npromphan\nhantho\nunconcernedly\nseattlites\nnasari\nelating\nscreecher\ncharpai\neceiza\nbeinert\nschapp\nnelco\neasyoffice\ndownblending\ndabaga\nsuthin\nwergs\nvouchercloud\npersichini\nfertonani\nrunflats\ntagney\npenpole\ntarratt\nharmening\nballetboyz\nqeis\nonziema\nhouseproud\nrespess\ncasglu\nspidcom\nbrtain\nakinyi\nkatwala\nwooziness\ndynapac\nnccg\nlinescore\nchrysanthis\nladau\nstuller\nnowhatta\nokosuns\ntroesch\nportgual\nbluelines\nschnare\nkaneoka\nyarkas\ncescau\naskarieh\notjivero\nkalashov\nramoin\ncreeth\nafterparties\nexcelstor\nthecentre\nkhulani\nrabjohn\nbiovex\nzvinavashe\nrosty\ncomposedly\nadulate\nweydert\nsqueri\nscorable\nrafin\natbs\nnuedexta\nkegels\nsawrymowicz\nflabbergast\nmistrals\nmolsoncoors\nsinglehop\nblooped\ngrambau\ncortèges\nohley\ndangerious\ncitma\nprocaccianti\nrelevation\npubpat\nrazzed\noveneke\nosvs\npancanadian\nlenane\ngiulianotti\noutsiderness\nteaster\nnegedu\ncosterton\npropagandised\nkanektok\nunimpressively\nsordidly\nelslander\narall\nritvars\nsaudino\ntupra\npoliteo\npremeditate\nbasbas\nkonzelmann\nyakcop\naurus\nfunsters\nhezlett\nrakishly\neviscerations\nmeritcare\nunderutilisation\nchristofis\nfiltronic\nhuiskes\npashmul\nromeyer\ncoywolves\ndezell\nrigerous\nhousesit\npanjiva\nboatful\nmetamorphosised\nhulky\nroures\nbilingües\nsnerdly\ntrusso\ndraig\nheadquarted\nviafara\naltringham\noppotunity\nlaudicina\nberisa\nunderrepresents\nbroadbrush\nconningsby\nemfesz\novercash\noutproduce\nolke\ngudni\nakbulatov\nfruitseller\nashecliffe\namortising\npolicitian\nmommer\nsnis\nshoeboxed\ninteko\nunconstitional\ntsoukalis\nmezzaroma\nloveshy\necycling\ndjibrilla\nexperinece\nkeithly\nirenas\nsaghal\nrcni\ngubner\nbangaldeshi\ngrandt\nescentuals\ndorros\nsayafi\ngalavotti\nactec\nglinted\nseminfinals\ndendoncker\ncryle\nguachos\ncicalese\nreeg\nilliniois\nthemal\nisenstein\nzarkasih\nmatternes\ninzerillos\nfetishising\nmowj\ndasmunshi\nstallergenes\nsevugan\ndolsi\nnawaja\nnaieve\ndabovich\nstephannie\npolydrug\nunieuro\nbashiqa\nbajur\ncscmp\nhypomineralization\nnspca\ncockscombs\nreeher\nvigliatore\nbvrla\nschlomach\nmathrani\njurisidiction\njaralla\nstojiljkovic\nhaskells\ninflammed\ngauranteed\nseebs\nsideroom\nkiddingly\nkanneh\nsquirrell\nshekhovtseva\ncrooking\nwfic\neconergy\nchemoperfusion\nsoullessness\ncavolina\nsophistications\nguaceto\nzotova\nreimchen\nyoungins\nabaas\nexcitebots\nshadeless\nshwam\narmload\ncbry\nbjerga\ntrizzino\nsawula\nbelluscio\nwithrington\nscrawlings\ntranquillizing\ndisinviting\ntechonology\nzhengang\ncountertrend\nloooove\nlummy\nwintonbury\nrevention\ndzongu\nmohtashim\ntorax\nlycourgos\nsnakeshead\nhandpresso\naureos\nstecoah\nzubiria\nshanas\ndjalo\nbpoc\ndipippo\nsanthera\nbuckweed\nyoutie\nheathhall\nofcc\ninterservices\ncheongsams\nrodenberry\ncommodifies\narchaelogists\nrepesa\noverfunded\nmamalahoa\nghayyur\nnaaco\njammaz\naverbeck\nscanbuy\ntahita\nsithanen\nbilaal\nradkin\ncitrano\nnonessentials\nkulasekera\nhsse\ngrumann\nfishermead\nploddingly\nschratter\ncrimpers\nphillp\ndanat\ntrendspotters\njoman\nhawklike\nneighed\nguadamuz\nstoeckle\ntianyin\nmaccray\nvincor\nweldmesh\nregieme\nsteriliser\nsrdf\nrobindale\nselfserving\ngeneive\npukey\nknuffke\ntravelwise\ndalmain\ntradionally\nbemuse\ncharlack\nsherika\nambasssador\nddct\npokiest\npeulla\nsolowij\nsheenah\nchakrabortty\nwaheedullah\nbaaria\npostconsumer\nhiiran\nstolly\nauberg\nsubprimes\nvanhoenacker\nsnowshed\nlovasi\nherrling\nbarbicide\nquemere\nxuren\nsatays\nabdulrazzaq\nnarusova\nwarinner\nmakombo\ndharkenley\nrestrictionists\ninvestees\nforyourart\nsecla\nraizals\nkaijima\nbegrimed\nejide\njumpiness\nkirtankhola\nfifteenfold\nkleercut\nabosolutely\nchintzes\nklyph\nbransome\ngroundforce\nandriod\nirandokht\nobono\nbiegert\ncenterist\nerikkson\nmmia\nvesicarius\nflonase\nlotrel\narmfuls\nhosiden\ntavernise\nwoodenness\nwsbtv\njabrin\ndualview\nelectile\nroyak\neurand\npafilis\njoffrion\nhoneyben\nursitti\nsteepling\nseimens\nreservationists\nhankham\nmornati\ntoped\nhoeffner\nbeaconfield\nrepresentan\neadington\nnevres\ncerattepe\nvanderhoek\nindefatigability\nfenwal\nsurbaugh\nwiedmeier\nperserved\nbirgun\nplastinate\nfuts\nereading\nthillaiyampalam\nognyanova\nunliveable\nzlaten\nkumutha\nrinzen\nscharbach\nundershoots\ngoldensource\nmusslewhite\nquaden\nnastaran\nosmak\nfartusi\nsaponaro\nchernofsky\ndisinform\ndilsaver\nipelegeng\nselosse\ntrifactor\nbivings\nareshian\ngyurcsany\nnethken\nconfernece\ntuerlinckx\nhammaren\ntrofile\nsteinbergs\nwobblers\nreesha\noutgross\nnorthgatearinso\ngangjee\nmukantabana\nmeshkati\nmessianically\nkejuan\ncelebriducks\nweteringschans\nkuzo\nmisdating\nshlapak\nselecky\nrodocker\nchebaa\nabednico\nscalamandre\ncompañeras\nbuttsbury\nfriedensen\nmontepellier\nmadle\nutek\nfacette\npyrolyzing\nfieuzal\nlaproscopic\nyashkin\njermyns\nhelmsleys\nnozipho\nmcpate\nbobó\nalexes\njacobelli\nmainiero\ntsumani\nneognathous\nmcommerce\nwirahadi\nboersch\nbanser\nmudpools\nplungington\nagroterrorism\nburnison\nbeachem\nfmcgs\nmankiev\nbuchris\nihrsa\nlahza\nduchak\nchartkoff\nmmvii\narrrrrr\nforlornness\ngonchor\napotheosized\nmagnoliana\njaurez\ncyrenians\nchiofalo\nabfd\ngasòliba\nnonlife\neobs\nabdolsamad\nforbeslife\nchouchan\nsidefooted\nbeautful\nrunstrom\nuncynical\ngymuned\nthinkequity\nschendler\npalchak\nwiltons\npitters\nhaigood\ntrisenox\ncolbost\njneid\ngiacaman\nkellari\nrebanded\nisinbaeva\nsimpcw\nunrecognisably\nwabbes\nalamzeb\noverhit\nsinese\nmuner\nrepacholi\nwodicka\ndemou\nbarrickman\nperving\nauburns\neducat\ncontageous\nwhithead\nmortages\nunderstatedly\nsubmarkets\nkleibacker\nselction\ndonnish\nschaltenbrand\nheartscore\nshamless\nrowdily\npaleja\ngenarro\noarswomen\nfreekin\ncdhc\ngoldstuck\nopeing\ntuxedoes\nraucousness\ntechnocracies\nsuppported\ntyranical\npannacotta\nperimiter\nshrivers\nkhinsagov\nschiavocampo\nmeadham\nnoshing\nextramadura\nminesh\ngranades\nforhead\ngimmies\nsheib\nkhury\ngøtzsche\nceril\natebits\nmidfields\nteethers\nclosey\nzammo\nmedcath\nslosser\nladoucette\nafricanamerican\ncarryforward\nhanenburg\ntradi\nellenese\ncrammers\ncjackson\nfinanciación\nwildmill\nsidekan\nsurrend\natomising\nfabrazyme\nfrisée\nboullard\nrhythmless\nphcg\ntvert\npropanganda\nsmigelski\nscrappiness\nghadaffi\nweihuang\ngreenwold\nboebinger\nkwsc\nereck\nnabali\nglenzer\ndrumshoreland\nzubairu\nleichtfried\nservce\ntransation\ncomittment\nchoksey\nampinga\nmotorcyle\nquadbike\nstarbrite\nbcame\ngoksel\ngermophobia\ndaddow\ndepresion\nmaleza\nprogenics\ngoosestep\ndonghui\ndaddona\nweilheimer\nnorred\npitcairngreen\nantcliff\nbollihope\nvocabularly\nsachsgate\nkortlander\npremediated\nagoraphobe\ngorovikov\npierogis\ndobrinski\nsinglehander\nkamais\nludzidzini\nrapkay\nbalero\neviter\nscarletts\nisria\nnapierala\nnewwest\npericak\naharonovitch\nsnugs\nercument\nschoewe\nalykhan\ninterntional\nsgitheanach\nlambrew\nboneheadedness\nrhadigan\npwer\nbianna\ntaranabant\nluluwa\nfosmire\nagahozo\nwickliff\nairrion\nlianfang\npangalangan\nsukhodolsky\ndocilely\nclearence\nsendell\nminnite\npannek\nblubster\ntowables\nferraiuolo\nbossone\nbarosso\nwinterizing\nslrk\nkarcic\nheára\nproducted\nredounded\nshovelfuls\nfullabrook\nhairbands\nnonstock\ndadswells\nalemparte\nozor\nbangkwang\nschlössl\nslaloming\nniswander\nguzovsky\nmillennarian\nratsat\nforeighn\ninarticulateness\naudenaert\nheuze\nlambikiza\nkoechler\nbespangled\ngtce\nvaction\ncontrairement\nrajapaksas\npieczynski\nunforseeable\nrumbatap\nmediterannean\naviran\ndocksides\ncatavento\nmodert\nitalpetroli\ndemari\nodioworks\nmcgougan\nschroedel\nabery\nmontemerlo\nspasic\neglis\ndunkenhalgh\nislanova\npixelmags\nrhoys\nkempinska\ncleansheets\nfregola\ncortman\ngerbrandt\nviacell\ngranulates\ncobeaga\nsunbathes\nsnorty\ngreywalls\nsbfa\nvakas\ncilurzo\ndifx\nassiduousness\nwbztv\nmcln\navona\nstandhill\naisenbergs\nmorroccan\ncolllege\nunderspent\nvisitphilly\nkyndall\nuntinted\nmmscmd\nunremedied\nbuddied\nporetz\npogoplug\nphangnga\ngudhe\ngroundedness\ngibrilla\nvirkus\neducationguardian\nchemu\ngarofalini\nsahd\nextortive\nsotherby\nwienermobiles\nhotdoggers\numiker\npallozas\ndueles\nterprom\nkaboni\nplotts\nvukelich\nkazakhastan\nsideeffects\nvanska\naccolate\nheesom\nadgenda\nhardbat\nskeens\nsilgan\nmelbreak\nflouquet\nzokwana\nnarika\nrhoi\nebridge\nyge\nmushailov\nmiiro\nimbibers\nconjurings\nsengalese\nbombardiere\nandrin\nselamet\ngriesmer\nbattino\nzumas\nmananged\nsevices\ningrouille\nkarnell\nfotp\nrzss\nfakhriya\nwyndgate\nzaradic\nphotoquai\nbanej\nitrust\ngardenwalk\nadaniya\nethiopa\nmicrotrends\ndincin\niftars\nphrasavath\nlabl\nlaicite\nasarch\nbeanos\nexended\ntombini\nileka\nxuehui\nsoury\ndifferntly\nlystedt\nshocktoberfest\nthinky\nasiantaethau\nreinaga\nwillenburg\nessan\ngeuder\njianyin\ntruckles\nnayereh\nalmezaan\nmeadowes\nisupport\nsengis\ndziak\ndemerjian\ndvbt\nsajat\nunderripe\nhigazi\nrhostryfan\nmoconews\ncaler\nbeneift\nklaussen\nemvall\nsomjai\nopatowek\nsealeyi\nhfmweek\nglangrwyney\nvilhelmson\naccumen\nnbjc\nladgate\nkarlekar\nbatschelet\ndelegitimizes\nwumart\nlawlis\nqanta\nboredomresearch\nhendawi\ninoffensiveness\nlonley\nlipoff\nrevaluate\ndrillhole\nearlyish\necssr\nthurlo\nexurbia\nmatrimonially\nmobbers\nberlingieri\nforeshorten\njourneywoman\nwintles\nqtrs\ngirlfirend\nhipping\nfrontloading\nallisa\nrebeaud\ncézannes\npdii\npizjuan\nzellnik\nunamendable\ntemporising\njafry\noveranalysis\nzaiwalla\ntopsiders\nheadshake\npetroci\nnodl\nburnstone\naltmans\nmanuevers\npemberly\nalliedbarton\nbrucey\nsecetary\nfullcircle\nkhomeinism\nsashenka\nandalgalornis\nwonkiness\nbeckitt\nborthwicks\nbanrural\nunadvisedly\nmlynarczyk\njouault\nmoromizato\numming\nbotvin\nschultis\nthase\nmuffing\nsliska\nmamere\nechikson\ncodero\nbrutishness\nsoilihi\nkitterick\nbloodlettings\nrhinestoned\nicej\nniesiolowski\nanandappa\nrelativise\nblockfront\nfrova\ncamozzato\nphyiscal\nskillfull\nmickleson\nqahwash\ndraaisma\npiconewtons\nchernovetskiy\nbashier\nsuperliminal\ngenthner\nconsititution\ncolumist\nthurin\nbuzaigh\npnvs\nguestworker\nlvam\ndevocht\ndemokrazia\ntechteam\nwildnerness\ngonch\nneoclassicals\nreductivism\ncrasnianski\npanwa\nsnjezana\nvasterbotten\nmedding\nlerille\namezkua\nplayacar\nferrells\nucsmp\nnavisworks\nkesselheim\nbetsworth\nsnowsheds\ntolossa\nreebie\nrajaan\nbuddi\nmurkily\ncoronia\nseiniger\njanal\nprebeg\ndefecit\nwhitetips\nagassa\nescuder\ntorlot\nomaya\nmolehunt\nconcretisation\nmarchinko\nballotine\nreailty\ntudworth\ndelahoyde\ncornioley\nroenning\nsarcinelli\ncelmo\nmulbah\nspritzers\nzees\ngoffer\njerilynn\ntenedor\nrebic\ngershenz\nrevanchists\npakista\nmemy\nrojansky\nsoulcraft\nlamno\nmottern\ndantherm\nrestis\nembonpoint\ndeblase\nelesin\nolic\ncpdo\nkavran\nmimnagh\ngibaut\nhudlow\nguestlogix\ndecolletage\nallrich\ncloffocks\nlashell\npietrucha\nfreerunners\nechocardiograph\nvargos\nvoloder\ndeche\ncompartmentalising\nkupferschmid\ninveralmond\nerdy\nsoovin\nlaskhar\nbiocompatibles\nwinbourne\nparticpates\nhandsom\ncertegy\ntambasco\nixg\nbuffenbarger\ncindee\nsspo\nbreathalysers\nforys\nodweyne\nrabuck\nsaeka\nerturk\nbocot\ntssam\ncallori\nelnar\ndéshabillé\namericanhumane\ncoralled\njournies\nspado\nlafranchi\nswapp\ncardelle\ndeschamp\nkynikos\nparkgrove\nnrtee\nldtx\ndanstrup\nfleurent\ndeazley\ngriffall\nfoerst\nøn\nvelleron\nmensik\nnesmachniy\ndannette\nconcepcíon\ncounterinsurgents\ntambyah\neurodata\ndrumwright\npineoblastoma\nghanouj\nmayernik\nmorentin\nkanaha\ngesac\nmalagon\nwesterhever\ndoorknocker\npeacful\nthirdforce\nshoeshiners\nquesney\ndogwhistle\nalape\nhoornweg\nconsolidants\nwilliamsen\nftaap\nlippiett\npolitkovsky\nmoqattam\njewsons\ndechra\ntranform\ntregele\neducap\nwearyingly\nflorally\nmugunga\nlafauci\nbrundell\nbackloaded\nwittenauer\nbydesign\nscreamy\nthealby\ncauthron\nleavenheath\nswartzberg\nsdrm\nomigosh\nnomnation\nkødbyen\nbaytril\nsassing\nsidko\nswopping\ntraschel\nrhsc\ngailen\nmcgrand\nitins\nvinuta\npatuano\nprezant\npuhn\nmygig\nseenigama\nchasteness\nshamila\nmcclammy\nlekae\nextenuated\ntartines\nlingeringly\nincomings\nshwarma\nkianoosh\nbelken\nnagatacho\naltcourse\nbundley\nheisting\ngionatha\nellaktor\nbasrawi\nzainey\nunapplied\nilean\nzaineb\nsimclar\nitalain\ndevonside\nmuehlhausen\nrochkind\nguarnera\nquanis\nrenewdata\nhorakova\nzoocheck\ndisentitlement\nsooki\nboichuk\nguantanmo\nbwambale\nzirh\ncomroe\namersterdam\nkurty\nfcpo\nlawil\nganca\nprodcution\nmihn\nwagster\ngalperina\nlandfast\nabdelmunim\nbpop\nstreetballer\nforeswear\nausthink\nincriminations\nsuperchip\nhulaween\nradiotime\nfeay\nairmagnet\nsecurum\novertons\nbrancott\nbourk\nburdeyna\nunfug\nduchez\nsmooshing\nchaiwan\npennbury\nnpwt\ntmep\noutred\nmonfredo\nndirangu\nwzaa\ncryptarithm\nreimportation\nfasanenstrasse\norback\nbholua\npromens\nenglon\norganizatin\nunentertaining\nseatmates\nhcahps\nigaly\nquadricep\ndrawbaugh\nmispricings\nteyba\nsafeware\naltanta\nweedons\nquaterly\nredecard\nreconviction\nthomasis\nsedq\ntuomanen\nwoojin\ncopule\ndrival\nwegher\nbionova\nenormousness\nwillemsorde\ntabesh\ndreeben\nmisdescriptions\ntrubeck\nrosenwaks\nbalmacaan\nremeha\nbrodyaga\npeppier\nskapa\ncongueros\nbayanat\nkaiserstrasse\ngatorback\nbeouf\ncrablike\ndisx\ntapizar\ncorcept\ndivay\nlabégorce\nairmailed\ndegression\nrickrack\nummmmmm\nqvarnstrom\ntempland\nvegatable\ndimitrouleas\nalteris\ngitlen\nshopfitters\nbipro\nabufarha\nsmari\nkenyetta\npolygraphers\ndenuclearized\nballengee\nsteadicams\ngiuggioli\nbleiweiss\niluh\nkimat\nschockenhoff\nrotenstreich\nvisarts\nunstrap\nteenangels\nkowalenko\nrafacz\nnatual\nsubfractions\nunblushingly\ncobern\nvittrup\nmargreiter\nsannwald\nnetprice\nsurfy\nqkl\ningénues\ngiammario\nakekee\nmujadidi\narounf\nzfps\nmisfiling\nmicronetics\nnazak\nozsoy\nwattlebridge\nharalambidis\njuritz\nethienne\nreynecke\ncléac\naltafaj\nsplinternet\nizunaso\nflitters\npolticial\nserapong\ncawt\nsnaefellsnes\nsupermileage\nwaltemeyer\ntrammo\nvanersborg\nvadino\nnorrback\nsjowall\nmomemtum\nfadzai\nmiltants\nshunra\nfnirs\nmezie\ntrüpel\nsnackers\nsunchoke\nrecrafting\npeijun\naccomadate\ncusden\nmullhouse\nxuming\nbaruffi\nniggled\ncentrafrican\nmoufarrige\nqunar\nblavod\nabudwaq\nbodysnatching\nkneafsey\npbfa\narthrogram\ngubareva\nibarlucea\nswiming\nunfortuanately\nunintelligence\nisrat\nmarenariello\nalltami\namshold\nselectorial\ncerus\nsoligas\nworsely\nuwimana\nbiopat\nbuddenhagen\npietrowski\ncarmex\nwhizzbang\ndinc\nfoxbury\nmelanine\nterlo\nsupedi\npanjshiri\nbroadcastable\nagensys\nziobrowski\nshoddiest\nsemass\neducaiton\nnataphon\nsplitboard\nparring\nungracefully\nrubacuori\nnuclearelectrica\nstigmatises\nunderutilizing\nblacksite\nmeaby\nhazelbank\nacidini\nhachamah\nfraxel\nkingsmarkham\nsunporch\narsalai\ndiaster\nmurazzi\nbedazzling\ndiming\ngonchen\nkoroshetz\nloirston\nkongkiat\ndiscplinary\nmuiruri\nwatergun\nbriber\nmotar\npatick\ntufino\nbeejays\ndecampment\nfroufrou\nnasho\nsiasa\nwyddgrug\nsflg\nhagoel\nhuuhtanen\npettier\ngamawan\nrecordbreaking\nwotcha\nminiskirted\nfidgeted\nzemaj\nfileman\nwdowiak\ncounterplots\nhoick\nreaganesque\ngaggles\ncelestae\nmultitiered\nlaurans\nlumenis\naqmi\nnvla\nprosystem\nwelcare\ndibiasio\ndesalinisation\npacificor\nirens\nhwyr\nbanjarsari\nworksurface\neltek\ngfcis\nrhinoconjunctivitis\nsaracoglu\nphoung\nspraggins\nmcmonigle\nhewad\nhongfang\nholdbacks\nbriefel\nrichette\nkalchbrenner\npolygraphed\nsauteeing\nkaltschmidt\nafelee\nalfaraj\nsreet\nessenburg\neggland\ncoolhunter\nfetchet\nsupersites\nrindner\ngouro\ngruntal\nhaueter\nmeyrelles\nkaplicky\ndizzywood\nvivaki\nwipfli\nmonemvassia\nalsumaria\nkhamisiyah\njiuhe\nschlemko\nloewith\nrttnews\ngrassian\nfileshare\ncabangbang\nglink\nseafowl\nhydrospheric\nbougatsos\nincentivises\nhandwrote\nmuhammaed\nrecraft\npibworth\numarin\nwaladi\nschene\nrabhan\nidtvs\nlauckner\nselbin\ntasktop\nsteuerle\notellos\nsebire\npoeni\nraisor\nreenrolled\nfeart\nfinkenbinder\njennerex\nbergtheil\nwarblings\nwalcoff\nchungtak\nwftf\ndegremont\ncaeathro\ngalphay\nzhila\npratter\nmorocca\ntieas\ndinker\npenzeys\nviastore\nicontrol\nfashingbauer\nberelowitz\nspiceball\nsinglar\nsilkbank\niacopelli\nretrovir\nswfr\nmashiter\ntratman\nmanandafy\nlaylaz\nlidoderm\nnoorin\noback\nnonstriking\nanesthetised\nholstering\nsupergrade\neafrd\nindigenious\ncardmember\npedre\npugel\nandrogynously\nfootbaths\nmiara\njhamar\niftf\nhornall\nindpendence\nexantus\ngainsays\nunsaddled\nfarmhill\nchalendar\nukrainskaya\nmatonoha\nbagsy\nyalikavak\ntamnamore\nkeyssar\nzvegintzov\nsmilovitz\ncloddish\nnonnatives\npötschke\nraineth\nambisome\ncatheryn\nhelsper\nstambecco\npupster\nzinkia\npowersite\napaolaza\nconstructionskills\nmazurian\ngrasz\navitat\nannell\nsimpletuition\nronaldos\ncounterpulsation\ngoldhay\nbusinesss\ndubson\nomnovia\namazonencore\nthermofisher\nwonked\navaility\ncravers\nporeless\ngsam\nmactec\nquedamos\nstressfulness\nantonovs\ncsbg\nmolsa\npvos\nnyag\nlakhvir\nrustlings\nunbloodied\nfootbed\noudéa\naliadiere\nsodefor\nnextgov\nspohrs\ndisgrees\ntortise\nderacination\nfoelsch\naircruise\ncontempories\ntankie\nadapid\nfahimuddin\nunidentifed\nyobbish\nbakhtyari\nexpresstoll\nebanking\ntcab\nghettoising\nenshrouds\nyavala\nmissakian\nsnellin\nikenson\ntrifectas\nkhinshtein\ntyen\ncomentator\nflyballs\nintergrating\nsatorius\ntaxand\nmetcalfs\nslimiest\nvolodko\nwashash\nanayat\ncorenblith\ngrobelaar\ngrandtop\ntraavik\nhobijn\nyuanwei\nkobren\nmushore\ncourtman\noverprescribing\ndhiyaa\nmainey\ntercica\noosterlinck\nmaplets\npenans\nculatra\nmotloung\nmudiwa\nnauseate\nlohans\npiatco\nmacheras\nhorsefall\nesmerling\nchuntering\nhapppen\nbaswell\nleighnor\nkalkidan\ngulestan\npaleokastritsa\nvegliante\nmarrage\ndejour\nyionoulis\ngroveman\nmasachapa\ncvcp\nfarking\nundershooting\nbockheim\nrajasingam\nsnidey\nkorkman\nbackslides\ngoldsithney\ndemureness\nfrozan\nlobotomise\nschawbel\nescapable\ntetramisole\nprebiotin\nsimandle\npolizzotto\ndqg\nfoamers\nprevidi\nbeltranena\nordnungspolitik\nrooked\ndigicable\ngreensun\nmakhauri\nebex\namorousness\naldermore\ncrasser\nswarner\ndillihay\nitalee\ndatca\nkirkwoods\ndubendorf\nnorampac\ncopeley\ncaviars\nrafte\nthykier\nnormanbrook\ncubicin\navoding\nkharabadze\nberjaoui\ncerist\ncorsock\ntoussas\nmisleader\njanuaries\npiquantly\nwasman\novidsp\nkilali\ndemaster\nholtgrave\neremian\nfrappuccinos\ncharcoaled\ncollywobbles\ndorvil\nstuthman\nwingos\nnafd\nfrysinger\nconsternated\npropertyshark\njindals\nsirulnick\nasimow\nkhicks\ngeomicrobiologist\nschuknecht\ntrøim\nreyaz\nrosiest\nolivan\ngolarz\nvivane\nstachelberg\nmetatools\naeromech\nretek\nremirez\nbarbastelles\nserhal\nobamania\nsprucewood\nblfs\ngraythorp\nuniversitites\nunivited\nkyrgystan\nbattlenet\nhorizontales\nrumaitha\nmaiers\nrexite\ngreenbergs\nltach\nlnas\nrittenberry\nrhji\nsulafa\nenox\nimmenent\nweyerbacher\ndebilitates\ntrimel\nsnuffling\nrocap\nbackscratchers\nleinsterman\npricewaterhousecooper\nmisogynism\nwisconson\ndedmond\nsuyitno\ndadge\nmthalane\npinenuts\nkuritzky\nshanawaz\njaylyn\nconfe\nrottenstone\nstollar\ndebtline\ngritta\npambianchi\ngroocock\nbredberg\neuropeanising\nllandyssul\ncapitamalls\najack\nhabbah\nasdale\npresbyopic\nswyddogion\nallinges\nkyleigh\ncarcache\nstracher\npinoncelli\ntheleme\nngeno\nmessagero\nmoodey\ncefin\ngaddafis\nsalaway\ngershfield\nvical\ndoah\nlionstone\nalverado\nnichollsia\nsatsair\nhannstar\nunconvential\nsarchal\nsommersdorf\ndisallowable\nmilovanovich\ncoloproctology\nkumang\nundercofler\nisraelsen\nclubcards\nsoveriegnty\ncatain\nprescotts\nsitilides\nmisrepresenation\nsnuggies\ntitouan\nirrevelent\nbrattiness\nvickory\neuroccp\nbedf\ncoloradoans\nluqu\npgcb\nkarimou\nanipals\nasats\ngiyas\nladdingford\npaleros\nrukhi\nsmidgens\nmoncoutie\nschnarf\napropros\nleahurst\nchindo\nnobilia\nredetermine\nscanted\nwestry\ndefourny\njerrid\nflatulate\nfeifdom\neightsome\nalipac\nbuckly\nkazani\ndirkzwager\ncummine\nsexlessness\nrutfs\nvould\nbifma\nnazarali\npesis\nhoock\npaperstone\npresilla\ngommes\nornek\nkroners\nbredeson\nlahya\nsponginess\nrahwa\naynte\nkotsovolos\ngamex\nrietjens\nmummifies\nreoffering\nincresing\nsemiautomatics\nfiscuteanu\nabrecht\nruffler\nraiff\novercounts\nluvvies\neeriest\nmingxiang\npichaya\nsmushing\nozonoff\npelisek\ncessar\nvadakan\ndinked\npoggia\nroros\nbusienss\ngrouchiness\nsansabelt\nupstretched\nskvarla\nwerrick\ndispiritingly\npetrikin\nsuebu\nkoruk\njekabs\npensham\ngarritt\nnoncore\nwoloshin\ncreditcard\nsupertax\nporousness\nadipec\npstd\nazotam\ntiltons\nghadhban\nfrechon\nhandys\nandriole\nkocab\nbrainshark\nkraushofer\nunlimted\nintouniversity\ndriulis\nfoucrault\njunnier\nweatherworn\nringfence\ngribkowsky\ntumer\ndecorex\nunderpant\naitzol\nmobmov\nlochburn\npreferisco\ntravisty\nmathad\ndowm\nprivets\nwoolhandler\nnooky\nkuklina\nboutall\nkhamkhoyev\nbridgeable\nbuatta\nsecondees\ntawafiq\nsaucily\ndebbane\nbubbies\nrapska\ngemmells\ntablespoonfuls\nmasbah\nraddled\nconvinction\nadelos\npalenik\nkesington\nmuniruzzaman\nunibanka\nsunchips\njasmijn\nunmanageability\nmlda\nspivvy\nsurpless\nequably\nsynaesthetes\nthirtyish\ndiscombobulation\nnabq\npetchabun\nduthy\nchlorox\nleslea\npolute\nmetaldyne\neerp\ndailiness\ndioskuria\nabiam\naretakis\nvreth\nwhae\nguant\nliasing\nmaydew\ncampatelli\nqorey\nnuview\ncarmindy\ndristan\ncitrone\npetersilia\nmeagerness\noilworkers\nhemgesberg\nbllack\npodgers\namplats\ningly\nfeihe\nwiemar\ngoolkasian\nopfermann\nbunja\nautorisée\nflatrate\nkaddafi\nampitheatre\ntranformed\nkudrina\nshukat\nochakovo\nmarcouch\nchathrand\nstosny\nsekoba\ninbs\nrudgate\nsnobbiest\ndemoff\nsupermouse\ntunnocks\njamuana\npreannounced\nhossien\nstodir\nguegan\nowsinski\nvalicenti\naquaintances\nmuhanad\nlazovic\nmujaheed\nanano\nkosachev\nwanny\nkavcic\nmoede\nbreska\nettelbrick\ncoulds\npozon\nszkotak\nwalderstown\ndelafon\nnickolson\noepa\nkalinic\nflabbier\nkhoshjamal\nsherland\nlavance\nfillup\ndunkelberg\nrokafella\npbpc\nccni\ngabig\nsuwayrah\nshmatikov\nruffly\ntestwork\nvalesco\nsunblocks\ndouyon\ngrowbag\ngorkss\npiredda\nhooches\nrycart\nbathursts\nseiffer\nsharlip\ngermicides\nchhiring\nkurskis\ncapka\npenyak\nzdravkova\npoitevent\nmounga\nleposavic\nnachreiner\nsumiden\nmondre\nunderfund\nhaoyu\namerichip\nbucketed\nalimera\nlitigous\nenamour\ntawian\nindictor\nheartstart\nonguard\nstetched\nkennette\nunhulled\nvaccinates\ndisobliging\ndostoevskian\nbardrick\nshihadeh\nmagodonga\nreplated\nuntersteiner\nglasslike\nbangadi\nillinios\nparging\ndundale\nlibbertz\namsafe\nmeudwy\ndelyagin\neasom\naboubaker\nnasonex\nhashani\nradder\npomposities\nrodny\nkirchnerismo\nbarncastle\nsemdinli\nhezza\ninablility\nsolitoki\npowere\nnarcotrafficker\npitboss\nlazerson\nmuit\ngottula\noinker\nbestball\nguzzles\nteravision\nrushdan\nrubenfire\ninfrasource\nklimkiewicz\ncnmg\nsurprizingly\nchinodya\nremilitarizing\nxiangang\nreconsult\ndorback\ndoft\nsoutherlies\nleinhart\nbernardinis\nobselidia\ntchepalova\nmashery\noverseeded\nlabroue\nripsnorter\nremcom\ngunningham\ncrarae\nfilskov\nbierschenk\nusnik\npiereson\njeetay\ncarryback\nfinneyi\nhottopics\ncyberhomes\nsmithline\nfylingthorpe\nrykner\nterziu\nargandab\nyenikapi\nnomaguchi\nderecognise\nborain\nocober\nsparaco\ntrifari\nbaranwal\nnocas\nslobby\nhommels\nsliddery\njouwe\nwenkui\nchiennes\nyachties\ningnorant\nabermorddu\nlekander\nnourizad\ndiffe\nbekamenga\nrocabado\noguike\nmartrydom\nsapic\naffilate\nmitayev\nmeindertsma\nwarduni\nbiotex\nkritzler\nvanairsdale\ntegtmeyer\ndevasted\nchopinesque\nmoistest\nulve\nkouvelas\nstringfellows\nthreets\nzombifies\njeht\npaslode\nalasdhair\nstandardaero\nempatic\nmaryss\nthreedimensional\nplaysuits\nsiguier\nbabbitty\ngodsakes\nkayange\npandermalis\nstrautins\nsaharawis\ntechnopak\nconleys\noroweat\nleggie\nsesti\ndefago\nhasids\ngazzale\nherculex\ntranscipt\nractliffe\nmillefeuille\nhabberjam\ngantvoort\nbilary\nlankier\ncentech\nzaynar\nsanluis\niongh\ntzofit\nmussab\namreit\nweslye\nodelin\nminstral\nslopeside\nsoundcast\nsorowitsch\nparkmobile\nginormica\nsnortland\nmeritocracies\nindecorously\nretsinas\nnotchy\ncharmz\nsouix\nhyberbole\ndrowing\nreprice\ndecitions\nsaltie\nxerion\nmokara\nwhitecollar\nchrysostomides\ncorcodilos\nhaniyah\nsubindexes\nezeagwula\nzaneis\nerbey\nbullmer\nbeckets\nbradely\npfcu\nbroodingly\nskurygin\nhourmadji\ngiade\nyerkin\nmaddahi\nkukmin\nscbu\nzorigt\nniftiest\nmaeyens\nleratong\nwettstone\nmeanwhiles\nhickses\nefast\nmanservisi\nmithoefer\nrobinowitz\nczechoslavakia\nstyczynski\nperreten\nmilltir\nturbomentor\nemrg\nsharashidze\nverrucas\nstraatjes\ncaladiums\nidleaire\narboform\nmpombo\ngougère\ntuszynski\nsunaoka\nswiryn\nnonresponders\npoisonville\nbaloga\nbiovest\nreidl\njamalca\ngameela\nngarua\nhrpc\nfsbi\nchignons\nvarnadore\ndashikis\ncirstea\nstrobbe\nreconcilation\nfiresuit\nprofert\nlandsite\nstymying\nclubbish\ncaceras\nrecision\nacbp\ndarif\ngrooviness\nbiscotto\nlkcm\nsecuritise\nskyped\navivit\npotholders\nntaf\nmesmerisingly\nfime\nsensibleness\nmwaka\nbasata\nleabrooks\ninquira\nmingkang\neskdaleside\nklovstad\nquanjie\nobessed\nraschker\npluckily\nheadwrap\nchoruss\nhiratzka\nviruslike\neathyn\ngesm\nnettlesworth\noestrogenic\nhonaman\nshister\nmultinationally\naykley\nirep\nroasty\nsupplimental\nmartocci\nhomelier\nchipkin\nmiamian\nparenty\nshipworkers\ndisinformed\nferosh\ncurtseying\ncompactdaq\nuncertainity\nvsla\nnapbc\nsangqu\nnorks\nlubenow\nniesha\ndeterrance\nsuhar\ntebbits\nproell\nmauras\nsucceeed\nkuando\nonky\nbrandade\noverfarming\nlaurys\nipga\nmcld\nchernack\ndroser\nnaftiran\nqaedat\nscaap\ncankles\nnetidentity\nblcs\nueberlingen\nsparkleberry\nafforable\ncommunitywide\nendosurgery\nschloter\njhung\nrecognisers\nsunswept\npheby\namael\nnutrimetics\npurvin\nshivaun\ncanetto\ncylone\nultrapar\nprotoype\njanerio\ndragus\nthreadcount\ngiday\nhammerly\nlesleigh\nzhikai\napalisok\nforrell\nhowdens\nvardag\npropmaster\nbikestation\noptoma\njinsoo\ngarstein\ndamaseb\nballboys\nahanger\nhelstein\ncaliforna\nsowash\ndbouk\nmagati\nanastassopoulos\nsuranyi\nyoungistan\nstateliners\nsonitus\nthebault\nsimeonidis\nmaryton\nbirkhahn\nprudentially\nthigpenn\nmatkovsky\ninflammables\ndugatkin\ntreneer\ntigresse\nbaluja\ndatatech\naubrayo\npuréeing\ncatrachos\nvallillo\nverplancke\nmoreys\nbrickcon\nunimagineable\nhindell\nfundementalist\nincisionless\njichun\nvanthan\ngagfah\nlindele\nlazards\nlashway\nstepback\nsaintpaul\naleklett\ndollet\nacronymous\ndeschaine\ncwmgwili\nchapero\nhazimeh\nbelachew\nclearflow\narmorlite\nrakipi\nfiduccia\npicpa\nvideosurf\ncyberteam\nhemicycles\nurspelerpes\nshoulld\nsukhotsky\necet\nmularski\nnewjack\nsolimene\nabowitz\nmowmacre\nbarnich\ngenego\ngursewak\nsafieddine\nprobono\ndadms\nallested\nmarolla\nperspired\nmiddlesworth\nouderkirk\nmohrbacher\nsanidas\noseira\nkinkeade\natgofion\nsayette\nsuller\nribollita\nmindiashvili\ngratins\ngdula\nkarolczak\nczinger\nproselytiser\nnonconvertible\nproganda\nhospi\nvishnevski\nsluggerrr\nkumiki\nblendstock\nbusaidy\nduchaufour\nrebonding\nmcgeehin\nrenkel\nbodysurf\nhalfvarson\ntesarz\nmulkearns\nalyami\nziswiler\nlumpar\noverweights\nyewlands\ndeconcentrated\nstichelton\nboggins\nwarser\ndiffenderffer\nstonelike\npillayan\nvetheuil\ndealbase\nyinxing\nmilitarising\nzagaris\nprofligately\njgbs\nruales\ntimofejevas\nmiscoding\ndrinkaware\nmambili\nwmam\nrowlestone\nllez\nhalona\nnoninterventionist\nlevalbuterol\nkonashenkov\nspikol\nradionet\nmaneouvre\ndataupia\nniakhar\nanarchically\ndileu\nnvds\nadzick\nmasculinised\ncrankiest\nrxamerica\nknafel\nrenfew\ntestifiers\nachacollo\namorphousness\nnatsheh\nboscamp\ndepature\ncrookfur\ndiyer\nmilway\naichr\nristra\nbackchecking\nptown\nprolith\nhollywod\ngamercize\nbalce\ncarnesky\ngaffel\ncostell\nsuryakusuma\ngeeez\nnielssen\nmyocet\nbeauvier\nemtech\ndiffculty\ndubbelman\nshifflet\nsafestore\nyapacani\nmarlenka\nvectura\ngalymzhan\njupiterresearch\nqinnan\ntiggywinkle\nregreen\ngadish\nsaffan\nooip\nmockable\nwhitehand\ncesareans\ngreisman\nglushak\ndjeljosevic\nclassie\nnumbly\ndazheng\ncoughtrey\nsurmelis\nbelgiums\nyway\ncaramelizes\nlombarte\nbundtzen\nobongo\nformigenes\nnavias\nbaaaaack\nharrumphing\niovate\nrevazi\nsangdrol\nnonmusicians\nhausers\ndaccord\nmagarotto\ngoeteborg\npassported\nunimmunized\nwiedemer\nsmitka\nderrières\ngfoeller\nexhibtion\nsquanderer\npongsapat\nchemmedchem\ncheekiest\ndjoghlaf\nsmalll\nanthropedia\naffronti\ncepgl\npolicyarchive\ngottliebsen\nrapnik\ntrasti\nbastiman\nkrautchan\nketts\ncontinuted\nklironomos\ngolembeski\nkriesch\nbrogliatti\npepperoncini\nlivlihood\nzinkevich\nfrohlick\nperkus\nheayweight\nshanghua\nculmone\natakol\nmugals\nstablized\ncultureless\nmazzari\nhonni\nporrit\nsawaneh\nmatchball\nexternalise\ncerruto\nlachelle\ncomsys\njalander\nkolanda\nwhitehaugh\nmomentously\nmismeasurement\nrosilyn\nmcbarnette\ngreasier\nmadziar\nmurugiah\ngossy\nmedchi\nsystemising\nkavira\ntravelcare\nmontine\nhezbolla\nrachline\nkarkos\ngolodryga\nmanzeck\nnicar\nqidi\ncurtsying\nmcdougals\ntoniolatti\nstraght\nleplae\nshabbazz\noraquick\nbaghurst\nnulo\ngoldbelt\nfigawi\nshrauger\nselmayr\nwierdos\ndiagraming\nyesilyurt\nlexiscan\nleglise\nyoos\nswindal\nagnellis\nvitabiotics\nbabygro\niteere\nwecu\ninfolinks\noptos\nenflaming\nadapation\nmadrilenians\neigenharp\nrejuvenations\nzeglin\nwackily\ngaggioli\ncorkage\nobih\nhillaker\nvallerie\ntouchstar\nifty\nreyhani\nruhulla\ncounterreaction\ndmfcc\ndussey\njeds\ninvestco\ntanginess\njannuzi\nlehnhardt\nrichochet\nentiled\nooking\ndrewal\nfeelgoods\nspirts\ndebriefers\nsinecatechins\nunitisation\nmeesh\ntoireasa\nprimped\nguilting\nephremidis\npooneh\nbackcourts\nwhiplashes\ngulfsands\ncutaş\njurcic\nnaiditch\nkablooey\ndaftly\nwahabist\nhumouredly\nzhilian\nkassaye\nchapattis\nmakumbe\namené\ntlhagale\nchaparhar\nsnootiness\nuncorseted\nunwontedly\nfewcott\nefore\nproscout\nwsox\nmacherio\nbozilovic\nxtet\ninterogation\nebwy\ntoecaps\nsiobahn\ndatamentors\nwhippey\nimmitate\nherszenhorn\nuruzgani\nlollypops\norumieh\nmethodone\nbelyaninov\nbareth\nhighman\ncinepop\ntossiat\nberghold\nbmvss\ntopchi\npentrechwyth\nbricky\ntrilogue\nuntenably\nadwent\nmofi\nelixhauser\npaitson\nroués\nlephone\nvindec\nabubakir\namidoamine\nvantrease\nmariellen\nhyperviolent\npernando\ngasifying\nmotney\ntipperty\nmaxlife\ndhillion\ntursunbai\nameel\noverreliant\nsouters\nunsterilised\nsarjono\nsharrad\nmckemy\nripkin\noaters\npostconcussion\npynoos\ncasagranda\nproclo\nmitrice\nibeam\ndaila\naukett\ncrappers\nhechts\nblings\nblashaw\nkragnes\ngarrulousness\npolderbaan\npollyannaish\nancestoral\nablondi\nnkonyeni\nmjos\nhereabout\nhulver\nwainger\nscaremonger\nbirdieing\nidearc\nchirkunov\nbeninson\nddce\nrestituyo\nstambolic\nboomlet\nmigranes\ndepartees\nipanemas\nsortun\nbaskas\nstanching\nstacho\nloestrin\ndonnica\ndelevoye\nsharespost\nglaiel\nharabin\nchezelles\neurfyl\nklitchko\nnerin\nogorek\noffman\ndorber\ndownlisting\ninstictively\njacamo\nhaematologists\nnmsdc\ndeanthony\ndeadheaded\nzagala\nfoncia\nhachikian\ngreenfort\nmusicmaking\nspiessens\ndesertxpress\nbilharzias\nhomoerotica\nsunsail\nkhushtov\nnazyr\npandorapedia\ndistinquished\nfnbo\nblaszko\ndunnellen\nkaissi\nffilm\nurfan\nrosbrook\ncaughman\ncourneya\nfranquelis\nrightwinger\noverseeding\nendomorph\nsgurrenergy\nabudullah\nhowsare\nonsat\nconstructech\nnonsinging\nalupo\nlavachet\ndelcher\nhauslaib\nincovenient\nsuniya\nsavey\napesteguia\nshiree\nyoandris\nhelex\nsoliloquizing\npolyphenolics\nlevoxyl\npresstek\nnarcotraficantes\nbestinvest\nbolinhos\nzoomsystems\ncomvest\nserte\nlistel\nkelaidis\nkenderick\nsalel\ncarnebone\nalchoholic\nbelwind\nseverenergia\nglimse\nheilicher\ngartzen\nbrookover\ntuvey\nuninvented\npittam\niurato\nunderwritings\nendometrin\ntrezona\nwoys\ncasassa\nmindlab\nannotative\nlammel\nkrawczynski\nsubmersing\nthemn\nvalcent\nthinkuknow\nsingledom\ntaharka\npossati\nredplum\npenayo\nsardelis\nfarecast\nsubramaniyan\ncontinuingly\ntuilleadh\nelhami\nthroughfare\nnozhkin\noversaturating\nsmoriginas\npredicatable\nchooky\nwynott\nswiftboated\nameerul\nvalades\npouters\ntujague\ngfirst\nvisionless\ncamileon\nrancatore\nkaeson\ngryaznoi\nmoelgg\navaition\nstiwt\nferumoxytol\ncudgelled\njeggle\ncebt\ncargurus\nklausenpass\nearings\nkittredges\nprorates\ncaonima\nmacvean\novercritical\npatzold\nmareya\nfromageries\nbattut\ndreena\nsomet\nyisrayl\nhamouly\nchiego\ncraiginches\npléthore\nshojin\nbhith\nwaitering\nsufferred\netouffee\nfelgtb\ndrra\nditalini\nweaber\nupwash\nboutih\njanikhel\nschwieterman\ntryanny\nheeschen\nkleivan\nluxuriated\nkatzev\nmauad\npodding\njumeriah\nrosetto\nenergey\namortizes\nprimobolan\nishiya\nmcilhone\npenalisation\ngotopless\njounalists\nkalmijn\nazocar\nkebbeh\nartown\nyabuno\nhydrolized\nmisimpressions\ncatrow\nroseisle\nloarie\nguzzled\nchheang\nspped\ngecc\napah\nphotoscape\nselody\nmachisu\nwolesley\npeblig\nlssi\nkagona\nmarziah\nlajko\nastonishments\ncounterdemonstrations\nquillagua\nwangmene\nhakiwai\nnipro\ngardenweb\nartmosphere\nashoori\nmakhubela\nzhushu\nislambad\nsubfunds\nszonda\nmaňka\ndesolately\nlatests\nlitwinowicz\nshamol\nodiously\nbufwack\ntrustworth\nafmxa\nwhiddett\nwoomer\nsauvey\nlewitsky\nficara\nperridge\npassalaqua\nsayedi\nnyeshia\noverreporting\nstablised\nweathernews\nstabex\nobamamania\nkaeda\ngynormous\njaemin\naxcient\npmle\nmammotome\nqeada\npisf\naltamed\nstepovers\nuscinski\ndefronzo\nkéchichian\nskeoge\nacccused\nwindiness\ngouldings\nfavourities\nzhichun\nguysville\nwoolfitt\nrosani\noludeniz\nfedscoop\nradfar\nhemrick\nhooksiel\ncrissakes\nguagliardo\nallodin\nleningrado\ncentamin\ndhaifallah\npaspa\nabdulamir\npatulea\nmicroprudential\nfredrix\ncolimon\nelaws\navihai\nkrents\nglassless\nabsolutisms\nsharespace\ngweld\nbuttriss\nsummmit\ntheramin\ncouldve\ninsideline\nuygar\nintriging\nbahkshi\ncldf\ncbhf\ncalgreen\nunrenewed\nnonadjustable\nmicrsoft\nfenstersheib\nletna\nmiddlecroft\nschauland\nbegic\nduralex\ngenara\nlardaro\nnanoengineered\necchr\nnonbiodegradable\nbressman\nkorupensis\nholmeside\ndizziest\nfilppu\ngaglio\ngnashed\nrudkovsky\nauthorizers\nvirgnia\nlochfield\ncemusa\nbeclouded\npotshow\nazizova\ninsectariums\nunshadowed\npanasci\nlesane\nmicosoft\npermeti\npiecrust\nblunderers\ninterestedly\nstigmatism\nrustier\nmahida\ndeyon\njonikal\nskovbo\nbeevis\nryotei\nsmolke\nmalast\nikechuku\ndecertifying\ncyberharassment\nsidefoot\nsipio\nunderthrew\nkemery\nbigmore\nyubamrung\npedwell\nknoxy\nspeechmakers\nmtop\nticketable\noldoinyo\nchemjor\nwamiq\nschalfkogel\nlongenbaugh\nrasmala\nperkstreet\nexosolar\nsainbury\nguradian\nbaladiat\nkacou\nnorson\npaaswell\nannoucing\nunguaranteed\nmellqvist\nlexon\nfilesoup\ndarnovsky\neuronaval\nquadbikes\nrawod\nfraenzi\nmahaman\nnwoga\nassoication\ncitimortgage\nmusir\npostelection\nbloomie\nezzedin\nmielles\noscarcast\ndonʼt\nkickboards\nnchabeleng\npanickers\nkenoy\nchascomus\ncricitism\nipodtouch\nlifechat\nriskily\ndopy\ntophoven\nléonid\nlebi\nmtan\nblcu\nortique\nmaxlinear\nbamut\nanlaysis\nnewva\nbijagos\ncholnoky\nkoksal\nbircken\nusbourne\nbarnfields\nddmi\nambergis\nbittova\niberri\nambaye\nretendering\nschoenholtz\nrawboned\nburberrys\nghayoor\ndongshen\ndansaert\nakhondzadeh\nshuttlebus\npavlopetri\nmentouri\nbjornsdottir\ncloistering\neykel\neemea\nmolnia\ncinnamond\ncesan\negington\nfirtina\ndatakhel\ncorgiville\nyeatsian\nkelberman\nabdellahi\noverbaked\neljvir\nmelonas\nxpak\nnaqu\nhoffbrand\nymwneud\nvolberding\ndebronkart\nfondazioni\nintollerant\nzareer\nrangier\nmelgren\nrecalibrates\ndimunitive\nkerimli\nxhale\nmhora\npotbellies\nseminerio\nreclaimation\nshyy\nrerating\nmortillaro\nchenghu\nparisel\nmccreet\nmetje\nchulov\nbssf\ncantrel\nboehle\ngrogginess\needs\natherothrombosis\nguediawaye\ndepositos\nkrawchenko\ntruckfest\nmeringolo\ndcsp\npanarina\njsea\npijls\nbocom\ngrwp\nhavillands\nunleveraged\nbarakett\nowener\nsmagula\ntsod\nphraselator\ntrevisanato\narkeia\ngarbino\nbalilty\nbacalzo\ndegooyer\nhosteria\nsensitivies\ngateses\npcar\nhdma\nzonegran\nlaabidi\ntomisue\nphemt\npanhandled\ngroharing\nklimentova\ndegauque\ndafri\nkazachkov\nrushie\ntargetman\nmiljevic\ngenerra\nupscales\nafix\nwetli\ndomanic\nmclinn\nsherita\nexhanges\ngronvold\nlamouche\nanahtar\nreinfecting\npaedophiliac\nbijoor\nneivua\nrubensteins\nbudgeter\nwineglasses\nmoszczynski\nadmp\ndjavad\nvisitorial\nhemingwayesque\nnanosilver\noncothyreon\npresumptiveness\ncarpeneto\neverflex\nblogospheric\ndhyanapeetam\neschatologists\nolivant\nweatherunderground\nciggies\nnandigna\nborrman\nlowriding\ndostam\nbaltiska\nlisped\nhasira\ntweeness\nghoulishness\nimmigrationist\nbrigyn\npolicians\nframwork\nbogdal\nhammeri\nscreechers\ndrizin\nshakeys\ninabilty\nkirksanton\ngoodfood\npersbo\nwalletpop\nsalaberria\nvermet\ncadidates\nhyperv\ntarassenko\nhawalas\ndrnc\nvistes\ntangly\nwordscraper\nabderrahime\nsunlin\ntheodosopoulos\nunsigning\nschwartzenegger\nsteepbank\nsiebenaler\nidjits\nhandwritting\nprovactive\nunderexposing\nacabq\nprewash\nrimsza\ngoning\ngoldford\nlomondside\ndavender\nrijkman\njobseeking\nzulfu\nbeerntsen\ntsnas\nunderfire\nsorgdrager\nracecards\nnattiv\nsadriya\nacquital\npromperu\nfreeradical\nguldimann\ncogentrix\ntzekos\nbabalawos\nnosedives\ntendar\nmubaraks\nhahahahahahahaha\nspickernell\nipcom\nergneti\nechan\ndaivd\ncamarota\ncasciato\nqare\nguentner\nfoxhollow\noswold\nguodu\ncmlp\nbuttree\ntimipre\nresurgance\nmulyasari\npiehole\nzinchuk\nknobkerries\nbupkes\nassemi\ndibens\ndryman\nchimneyed\nmasterlink\nduddies\nbusemeyer\nstaceys\nenouraged\nagaporomorphus\ndettre\ntoysrus\nmulticulturally\nislamising\ngoldvarg\nmontalbini\nelhaj\nkyalo\nkamlani\nhysear\nlietenant\ngalanz\ntroublespot\nbardakjian\nditziness\nverticalresponse\ndrakakis\nsarcs\nhirakubo\nshainova\nvizner\nscintillo\nstockebrand\nmochomo\nmanclark\naborad\nmassolo\nstimuvax\nnirkh\ncated\nksnd\nslowpokes\naksh\nplastinates\nintothe\nnecarne\nantiapartheid\nmoruti\nuglified\npodimata\nmpshe\ncfso\nfelisbret\nmountainscapes\nhadhramout\nshtik\nseedcorn\narizonia\nwinklers\nsomocurcio\nlichtensteins\ninsuremytrip\ndutchified\nropeik\nbothfeld\ntchp\nteamsheets\nnationalbanken\nbrussles\nlaugar\ntechnologized\nsupperstone\ncorridore\nstomaching\nwuxiu\nstidolph\ncomodity\nflyspecking\nhostmark\nallensmore\nrolfsrud\nmaridadi\nvlsci\ncarrem\nkoumiss\nresponsbile\nbuctzotz\nprasow\nlegimately\nhersen\nwibbles\npodlike\npigeonroost\nsiwik\namericone\ndirndls\nlameduck\nqadoura\nreoperations\ncaldrons\ntolek\nhubertz\npoppadoms\ntushishvili\nwalkaways\ntodoli\npracti\nprewashed\ntimebank\nbrainchildren\nmauia\nschankweiler\ntunelessly\ngavio\nwinkett\nvilmorinii\ndecarbonise\npepperball\ndrywalls\naduku\ncareerwise\ngeldmacher\nbollant\nlirhus\nwmpt\nsattui\nspeechome\nsunkuli\nmajestyes\npopogrebsky\ndworken\nhafida\nskrivanek\nmicrodrones\nbaltrunas\nkhaldan\naudrin\nvanselow\ndeneroff\ndadak\nkascak\nafbi\nmontecastillo\nharaf\nqtip\nliveuniverse\nisnora\nmaulidi\nstraightjackets\nradilla\nmicropilot\nremache\nyongyue\nmazeikiu\nannouces\nstrangfeld\nstrategizes\nvicorp\nunrationed\nmassaglia\ntbvi\nceramtec\neiris\npotlines\nlaskoski\nlongwei\nperdent\npiriapolis\nstudenka\nprolink\nsovereignists\nganglands\nrampike\nesperion\nintitially\nschmuckler\nbeanfeast\nschrenzel\nnonelectronic\nbamarni\nconversative\npeterkins\nrejecters\nstaropromyslovsky\nheadiest\naquirre\nlipodissolve\nzahim\nsteelfab\ngotze\nlhotellerie\nflns\nstaffrooms\nfaughey\noutrush\npurewire\ngeorgaris\nemoi\nenchancing\nartiss\nchiaiano\nbeeden\nreseacher\nmceniry\nspongey\ncannoning\nirrate\nlickspittle\nskory\ninnexus\nlintuan\nlazeric\ncapaign\nbraxis\nquiffs\nsarapiqui\nbrüner\npooterish\nklimley\nexcelerate\nmycolors\naddex\nrhydycar\nstaale\nsaklikent\nposma\ntrupia\nairington\nidong\npeerwani\nveryfine\npolyhydroxy\nbvps\nmosleys\ntereu\nfamliy\ntechnlogy\nhomestall\ntzafrir\nkosayodhin\nbarhopping\nbyoe\ntimberon\nwinzar\nmelazzi\nflogos\nplainclothesmen\nshpl\nlitty\nmontalte\npenningroth\ncarnivorism\nbarkeeps\nthemost\nrefroze\nultraclean\nzwcad\nbingai\ndaokui\ncapricans\ntanjin\nkharbash\ngriffee\nbotherers\nphort\nloscher\nbrombergs\nschoolbuses\nattacts\nmooched\nhillheads\nmonnat\nkuźmiuk\nanted\navish\nvsevelod\ntopdown\npashminas\nmidichlorians\nélitism\nsonnega\nbahukutumbi\nassosa\ngichuki\nmaleyev\nyoco\npetronis\nsplinterheads\nsacristía\nszejna\ngrolar\nfladung\nersland\ncksw\nfellus\nceasfire\nweened\nsatid\ntanpinar\ncianna\ndebaggio\npotholder\nmartissant\ncheesmond\ncyberonics\nfloyde\npedophelia\nleanord\nnaiive\nstanched\nardura\nminuites\ndatatec\nschenkelberg\nlibberton\nsidm\nalspac\nuncoachable\nloiron\noverprescription\nskinput\ncuzick\nmangolte\njehani\nmananger\nmenrad\npettiti\nkharzeev\ndealflow\nratagan\ngambera\nyellowbook\nurofollitropin\npreowned\nearthend\nsaccacio\ntrentmann\ncomish\nzhenglan\nshtreimels\nbaylake\nnyfix\nsutherst\nolima\nvaquillas\nciorciari\nboulygina\nspallen\nrhubarbs\nforbearances\nvaujour\ndescottes\ndaintith\ncapelet\nmaesydre\nmackenzi\nkabelis\nhomier\nregaldo\nplanadas\nprimecare\ncananda\ntreaments\nbrunnera\nfehlhaber\ndefillo\nkhabaronline\naramnau\nnnec\ncarrousels\nfrancisley\nsteriods\npevehouse\ndairakudakan\nrafiullah\nchrust\nncsli\nmicrobanking\ndocumentor\nindabas\ncarsickness\nbahran\nweilded\nschaben\njonney\ndmes\nsunnywood\nsuperboat\nrotisseries\nstancl\nmalvey\nmpay\nnorbolethone\ndaszak\nkazillion\nsemitool\nthehotel\nprimadonnas\nphotochromatic\nvarland\nwulfman\npenacilin\nsharek\nreligionism\ndonielle\ncingulated\ndezzi\nmillmead\ntorbati\npeiry\nspaly\nlyngdorf\nresoled\ntlbb\nmitzner\ninterveiw\nbraodcast\nfairpensions\npavees\nkaschalk\nboarland\nsanfield\nslendertone\nkhev\nusablenet\noccurrs\nschlagenhauf\ngovernemt\npeepolykus\nelctions\nicesheets\nhaltime\ntagliata\ncarrillos\nishkhans\ndibadj\nstiti\nstyres\nmcelmurray\nhacu\npéchenard\nlibtard\nrhydlewis\nlosinski\nsudekum\ncheatom\nspinnato\nphaup\nmanoguayabo\nshaum\nafssaps\nbossaso\nparziale\nlobon\ngirolle\nreau\ndestounis\nsapristi\npropagada\nchodan\nrashakai\ncurty\ntangoed\nbatzer\ncalvanese\neremic\nexiler\ngriel\ncloseminded\nwikicrimes\nagoraphobics\nriddiough\nsandalow\nlobenstine\nfactorys\nkrutonog\nkhodari\nwemi\ntamelen\nvanrell\nkohestani\nroddrick\nallanna\nthenm\nshewsbury\ndeutschemark\nnisenthal\nbushara\nkamchybek\nermira\nprimondo\nlayaways\nmccally\ndihle\nopiod\nbiosurgery\nfratzke\nschnetter\ngerrell\ntelphone\nfractionators\nnighclub\nvardes\nnirja\ngrebby\nilyse\nhouselights\nmaninger\nrémoulade\nwsii\npetruzziello\nchicherit\ncsssi\nfabrico\ngreenpalm\ngransmoor\ntronconi\nbouland\nbooys\nsceince\ncmtx\nschellens\nraydiance\nsarpourenx\nmavni\nwilfie\ndefered\nmonosol\nmyogen\nschaghen\nobsenity\nchakiwara\nwhirs\ndlan\nskyscout\nonone\nmisrouting\npyworthy\nrhônes\nnortheasters\nprevatt\nfrigstad\nseetoh\ninwest\nexces\ndeju\nfuthey\nnimol\neliraz\nderossett\nwoolsery\nlacusovagus\nmanless\nlbpd\nemrouz\nsherez\nroadwarrior\nexagerrating\nbioflex\nlaysha\ngodzillas\nwaivered\nmassler\njoseva\noverpays\nnitsana\npurdis\nocklynge\noveruled\nfourpiece\nprofeet\nfederbush\nheadscratching\nembarrassedly\nprotiviti\nzhesi\nuniko\nfagiuoli\nviccei\nzagha\nhelpdesks\nthreadhead\nflexibilization\nfassold\nlooniest\ninvigilation\nvaciago\nkilkeary\nchannick\nismayl\ngeofence\nstockhill\noakmark\nosaze\npakkoku\nestacao\nwhodat\nkizilay\nbarovsky\nigss\nnetcu\ndiptheria\nmareer\nharingay\nfirmount\nrenationalize\novereaten\nanatomize\nfroglife\ntraipsed\ninterpretors\nkhetrapal\ncanl\npikelets\nresplendently\ngilyeat\nzabib\nendocrinal\nobenhaus\nbioprosthesis\ngufeng\nlowlier\ndeisher\nmultisector\nbaddock\nynni\nblaenannerch\nhuitian\nrocquaine\nbrodeck\njayanarayan\nbrigend\njilal\nbreakables\nshkedy\nphwoar\nkravat\nproverbio\njelleyman\nchelokee\ntojirakarn\ngoudin\nqomo\nearthcraft\nagrc\nshantsev\nmangasaryan\nsuncream\nfaceful\nbesharov\nchailert\namibitious\ndesensitise\noulmers\nfrothiness\nlesers\npissoirs\ntrickel\nservicement\nswampier\ndrotske\nvivon\nseocnd\nidjmg\naftewards\ndravucz\ncaixacorp\nspirtos\nelizabete\npissaladière\ndrivelling\ngharab\ncounterinsurgencies\nnecci\nsiefer\nthahabi\nqinsheng\nhaidan\nkasteler\npolititian\ntipsword\npenodol\nchoike\nancesters\nkyabakura\nwrep\nhtcia\nlydmar\ndykhoff\nkabiller\nicban\nnavjit\nbeccause\nnixzaliz\nunprosecutable\nceter\nsipkins\nnankabirwa\nomfug\namoro\ntherapods\nphildelphia\nmeltons\nsystec\ntelexed\nunchlorinated\nraquenel\nwilderhill\ncarbombs\nwieseman\nsedgh\nrecolonising\nolsiewski\nbraje\nsollo\nkiefaber\nnerudova\nsimonsohn\npowl\nchipolatas\ndorith\nsleaves\nplatystele\ntwinsets\nspaeder\nmattle\nraechelle\nsemsar\npaciolan\nmosqueta\ndieperink\njingoists\nantiglare\nissifou\nvastic\nsweltered\nshrills\ncolsey\nhestness\ndoerries\nmotorino\niwbs\nbeyersdorf\nralliers\ninweh\nagendia\nhabinek\nreinstitutes\ndunmire\noverstimulate\nambue\nkhannas\ntultitlan\nnightguard\ncsag\nsleazeballs\ngmtc\ngansert\nburakowski\nbeersbridge\nmaribou\nvalras\ndamscus\naairpass\nbigtent\nmunichs\nahdyar\nknudsens\nannointing\nhogeveen\nbekbosunov\nlaquinta\namurdag\nbannaby\nabramorama\nevanka\ndemeanours\nscoraig\nshway\ntilkin\narmagost\nsubconcussive\nbreakfield\ngiattino\nsodhani\nmaccaull\nnauseates\naspirationally\nniloo\nsilverrock\nmuyambo\nzoladex\nderelictions\nithought\nvfinance\nuranishi\npinneys\nzumobi\nporogen\nvincz\naaprp\nplaywin\nfermon\nrocketi\nslotte\nportovaya\ngwraig\nwuzheng\ngrŵp\nquntar\ncowcliffe\nchldren\ndisabuses\ndownington\nangmo\ngovx\nreconvicted\nbuddeke\ngyalzen\nhicker\ncovenas\naromasin\nshebab\nstonewash\npersonratings\nmarcuson\nbayti\nmiscalibrated\nbusinessowners\nleeburn\nbyalalu\nmarberger\nwiergate\nmicrobrewed\nkrystofer\nquada\najorlou\nversveld\nsupatra\nrachinel\ngenepax\ncahillane\nakinbola\ncuende\ntribewanted\nboughn\nmaslovskiy\nrenshon\nmaidencombe\njoguet\nsimoco\nplyer\ncomisiynydd\ndrumrolls\nagland\nyermoshina\nhypervirulent\nwillenken\nkuryla\nkellyton\nshovkovsky\nvenzuela\nbassell\nxaltepec\nunpegged\nmarzok\nchutian\nconservera\nmichaelwood\nseeminly\nwarthan\nprotzmann\nazrouel\ntrombitas\ndremiel\nintercomm\nbeedenbender\ntoeless\nelishia\nshipside\nheleta\nwoollier\ngotschall\ntaliafero\nsannitz\nglucometers\ncountryish\nasimco\nswinbank\nbentovim\nredward\nbarall\nnazereth\nlitner\nagrobacteria\nwhitcome\nvircom\nupender\ngolsan\ncorespondent\nbenguerra\ngeoge\nforver\nhounsome\nworldfirst\nbotanicas\nrmts\ntyrannised\nsunned\npelavin\nniembro\nlehmkuhle\nmultifetal\ngranddads\npersonalizable\ncarbonare\nganoush\njurijus\nstultifyingly\ncrabcakes\nbulok\nacierno\npapahanaumokuakea\nbollingbrook\ncortesio\nkatoey\nmukhrovani\nyanyun\nbisol\nlanxiang\nlosty\nyeohlee\ncardiotocograph\nharnet\nheryawan\nrashidan\ndazzlement\nmonkmoor\ndyspraxic\ngatkuoth\ncentage\nfeminem\nsawl\nilyushins\nskidoos\npizzetta\ndostoyevskian\njamye\nmuehlbauer\nlmvh\ninergize\nhmbana\nboeh\nvirigina\ngoochie\ntepetlán\nwlcsp\nncsp\nbodytalk\nkretzulesco\nkhristine\nelectrity\ngremillet\ntrefz\nkatasila\ngulets\nrohra\ngrigorjev\nbedrails\nkaliss\ntransf\namoralists\ncannier\nsigheh\nlled\nfujeirah\ndevestated\nrepublician\ncoreg\nsamachablo\nanyansi\nbiunno\nbriefers\nsukkahs\naccessportal\nwhitestrips\nangawi\nruscetti\nsheillah\nbauerly\nbunkrooms\nawri\nmocassins\nlassegue\nvolpenhein\nmungin\nreitemeier\nchukudu\nkalcheim\ngreenestreet\niosono\nelfstrom\nbackhauling\nxpec\ncapial\nbataoil\nlebeauf\nfrothers\nmelloan\npanzhinskiy\nturtelboom\nplayfull\nslathers\ncugnon\nexectutive\ncctm\ncharasmatic\nhajizade\ntairia\nlabourist\nizzatullah\nsuccessul\noverleveraged\noverley\nphiltjens\ncerfontyne\novx\nmutahida\nyounghee\nhoil\nthreatre\njizzax\nprevalant\nstauble\npilled\ntpye\nververs\ngerler\nkjustendil\nchavancy\nantigravitational\nzakinthos\nremeasure\nhaugabook\nyaftali\ninfotrieve\nladish\nmediamark\njalynn\nprutsman\nwalzes\nvalabik\npmfm\ndawidiuk\nvnaa\nheadguard\ntennies\nbiart\nfacebooker\nsanm\nwackermann\nshmotkin\nbtter\ntweini\njersualem\npartl\ndressier\nalkhalifa\nrendevous\nsophisicated\nlorenzos\nbriskets\nspinasse\nfondants\ntamarah\ndarknesse\nwhoppingly\nbrushings\ncaviling\njinxin\nmcilmoyle\njadaa\nstickings\niftaar\nlayas\npechacek\nvoropayev\npirb\nzaio\njiangying\nhakurk\noxtails\nmournfulness\npreferido\nbicsi\nstverak\nconiker\nmakawa\nrodopoli\nnafpliotou\nablauf\nzenima\ndishrags\nkriechbaum\nchipiro\nascarrunz\nlmar\nliquin\narapoglou\nsubsonically\nhairshirts\nholevas\nwiyono\nscamsters\nblackmount\nampim\nspiffed\nmunaim\ncyrte\norgell\nmonastary\nzimbabawe\nagressors\ngpif\nscuttler\ntrucost\narnestad\nmosehle\nsavci\npricings\ndivoll\nadigwe\ndrumintee\nexercizing\nvillainize\nkhadidja\ndulaymi\nkonsam\nharperstudio\nmaccario\norlowska\nlinkout\nvisalam\nkawaler\negstad\ncagnon\nkalameh\nmouselike\nglaab\nmargett\npancini\nminins\ncommisioners\nrussoti\nchyzh\nkalvis\nngcc\nthaut\nlayettes\nmuynak\nshoulderblades\ntaribavirin\nimidiwan\nmedvin\nsunshields\nwouuld\nproductize\ncepl\nzanupf\nbroked\nzarghona\ntoptier\njeffrion\ndenckla\nquandrangle\nchiropracter\nplaisier\nbazzetta\nnxtcomm\nstudenty\ntalkboard\nquaak\nmartavius\npetrosun\nvolpara\nthavisouk\ncorogeanu\nlighthart\neuropeanness\nschenkar\ncollerton\nmonteiths\nheppleston\nelsina\npinstripers\nborocz\nlafetra\nmarmoreal\nhotpsur\nsarsembayev\nerraji\nneidstein\nmahne\nsendar\npowfoot\nchanthaly\nshiyah\nrayssac\nhyperendemic\ntschuggen\nrecapitalising\nrendeiro\ncimzia\ndrinsey\ntavoletta\nilhota\ndubchak\nmcelman\nhessy\novais\nabujihaad\ndeglaze\nanghelache\nsgobba\nnonstory\nintellectualising\noduoza\nkissables\nhensal\nrewarm\nwoofter\nlumpier\ncheesily\npacier\ndeall\nneuralstem\nhaplessness\ncaifornia\nghatan\nzitan\napalachi\nshustek\nofran\nbalakhani\nadoyo\nfallico\nnanocenter\nfebruay\nyoicks\nrosbifs\npcapa\nrafaelov\nbulgargaz\nstollard\ndiebert\nspeedwalk\nfawningly\njokily\ntygard\nvallings\nlibrandi\nkurrum\nmotorbikers\nharded\nhalfsies\nburgler\nkovalyk\nthombs\nworktables\nboogiemen\ntaikonauts\nqalah\nweseman\nsadibou\nunfriending\npennyslvania\nfwice\nfryett\ngiampilieri\nbehnsen\nhousely\njiazhi\nebok\njanean\ndisabusing\nmehdar\nfaultfinding\nbirdstrikes\nsoonr\ntobola\nlangenhan\nsumirago\nnahirny\nshorina\ndobransky\ntcdt\nmcalpines\nnoninterest\nleonardelli\nmacugen\nludwina\nparveena\nmarberg\nbiofuelwatch\ndombek\ndustpans\naccupressure\ncasee\nnyia\nhajem\npief\nchubbiness\nvífill\nmoistureloc\nmaroone\nchiringuito\njetico\nflippage\ngalmo\nschuyff\nreinstatment\nenvivio\ndemythologized\nkadaria\nonecat\nhowt\nsensipar\nsherpalo\nmeadowgate\nnurdi\ncasac\nrétromobile\nmistiness\ngreensource\npirko\nlaforgia\nhottle\ncrapy\ntrenchi\netxaburu\nnagydij\ntoffel\nwoundedness\nbdcp\nbuccini\noutraising\nbruesewitz\nscacchorum\nlovatelli\narteriotomy\ntregonetha\nstuewe\nonother\ndepoliticised\nsenties\nnatthawut\nkenndy\nroozrokh\nbodyattack\nleeholme\nwashton\nwedginald\nmohebian\nsyden\nehmcke\nprotocluster\nexplorelearning\nghashghavi\ncarida\nmatussek\nhomeira\nalsabah\nmazyck\ncomissao\nbassoff\nhubdub\nmalamed\nneophobic\nzarnke\nnapack\nswogger\nmukonoweshuro\nmaxit\nerrosion\nraveloson\nlahaleeb\njericevich\ngadgeteers\numraniye\nrusenko\nsotware\nchevillot\nkettels\nsantarchy\nchitting\nbindschedler\nkamajor\nshmooze\napparu\nreissa\nmureithi\nravas\nsharaud\nfajinmi\nrsantiago\nassload\nsalicyclic\nkuchinoerabujima\nkontraband\nwysocky\ngoodmark\nlawncare\nkryvobok\nleibig\nguindulungan\nexpostulating\nvahradian\nunexhibited\nholbeins\nschoech\ndetectible\npamelor\nfireams\nzation\nstitchings\nspoilery\nelegist\nmohamadi\nsacktor\nboomslangs\nzemlianichenko\npremesis\nthita\nvinopal\nlascola\nprotrader\nlegrice\ndemauro\nwazzu\nriebel\nrottweiller\nsauts\nsabeg\nservigistics\ntomlinsons\nkalkwerk\nhininger\npluth\nlongdowns\nsubtone\nangiodynamics\ncatastophic\nfavorables\ncaravanner\nanip\nwyshak\nsulmasy\nokuyan\nkekhvi\ncarpiagne\nlutropin\nchruscinski\ngarell\nthingummyjig\nsereena\nhammerings\ntielve\namangalla\nnellans\nungainliness\nbetro\nconsience\nstrenghtened\nwernig\nehrlichs\nphysicial\nxueyong\nundecisive\nbritrail\nkramim\nchevely\nsulistyowati\nbibp\nflurried\nprapawadee\njaroch\nkittaka\nkamienski\npamon\nbakol\nmailstream\nbrif\nbattams\nmuttonchop\ncampingaz\nsylwadau\nkyndiah\nrollenhagen\npsaros\nigglepiggle\ncritcized\nmemolli\nflumotion\nshoar\nsousi\nhoungans\nfpns\nreplaster\nsoulstress\nosmek\nantipodium\nvalidis\nlakhvinder\nautospy\nyogabugs\ncomplainin\nyanito\nviperous\nrecapturetheglory\nmuney\nmdtv\nkemach\nmismarked\nhumourlessness\npitigal\nhaileyesus\nbolchover\nturturice\ndemonination\nalbondigas\nzinged\nhonn\nghuzlan\nidentigen\nbrdl\ngatza\nsipla\nfinacially\ncoudreaut\nkubango\nsmartshops\ntyrannobdella\nkhaiber\nnondas\nshasun\nsankary\nkipre\nemamul\nfirststep\nimaki\nringles\npaukner\nsorosky\nseparado\nstatelier\nspunkiness\nbiocontrols\nstrahs\nscintillometer\ntches\ncacchioli\nmahnic\nhytest\nextradiction\nnimoo\nbenfits\nlelouche\nmasgouf\ngangsterish\nboltholes\nmusana\nsemaw\nddaw\nfloraholland\nexemptive\nprimous\npaiboon\nreconrobotics\natttacks\nfrothier\npwajok\nplusha\nposties\npenwithick\nhochheiser\nyongyoot\ncuigezhuang\nchisnell\nportego\nfretfulness\namvrakikos\nintegrilin\ncarrotmob\nmutashar\nburgat\nlelieveld\nemmorey\npseudophedrine\nabdual\noilrigs\nwzab\nuptightness\naaaaaaaa\ngorbey\nbootlicking\nspykes\nasacol\nmonitorship\nbionumbers\ndistaster\nwesselius\nargippo\nbrunners\nmahdieh\nsceptism\ngrotke\nhostaged\nthebritish\nwieandt\nperille\nsnowshoer\ndedvukaj\nreemploy\nbribers\norfinger\norine\nlatchin\nthisclose\nbunf\narmloads\nendundo\nffransis\nfettucine\nbackstabs\nrasunda\nrgus\nlorillards\nunmetabolised\nyolanta\nsaffra\nbartabas\nhommos\nsabresonic\nqiyada\nworldblu\nharehill\nprating\nemployeer\npunditocracy\nburqua\ndahlvig\nsmeco\ntikos\nzippier\nkeroack\nbonset\nbenmehidi\nteeshirt\nmadunina\ndsec\nkazamias\nmidlem\ncloseting\nqudra\nlanzillotti\northoaccel\nlidvall\nehrsson\nstarkjohann\nvezie\ncapoulas\ningolfur\nseave\nsifc\nkarenzi\nebison\ncallpod\nkardava\npettily\nperiph\npowerpac\nthummalapally\nhcry\nweissglas\nmeridio\nbowlines\njaksto\nfulcra\nribnovo\ntopoff\nmachipanda\nflytes\nthirstiest\nnatchiappan\nreffet\nzarren\ngwell\nmozarella\nnaatha\ntowfighi\nsofabed\ntsuneoka\nannelisa\nsupman\nathers\nglobalizers\npirandellian\nirené\nprizen\nzaafaraniyah\nmulticity\ncornhusk\nharestone\nlhbs\nverbraak\ntrialogues\nborrus\nringford\ndavd\nrawbers\nczin\npilkadaris\nsoakings\ntanzy\nhilarous\nmchann\nhengda\negglike\nmultipiece\ngroinal\nfragrantly\nhomeworking\nkomanoff\nbuzzmachine\ngraubuenden\nnilab\ngutermann\nruaro\nmetrick\nyotei\npaparazzis\nbogusness\nmonba\nehenside\nalgebris\nultz\nthumpingly\nbudhwa\ngluteoplasty\nalwara\nfiscardo\nstukalova\nparizek\nvalenica\nshoraka\nbeggaring\nrattue\nformidability\ngergaji\nwazungu\nleostream\ntrasks\njunious\ncyberwarriors\nirritatedly\ndelduca\ndegutis\nshuftan\nmegaliter\nhamiel\nsshe\navalere\nmcguinnes\nlaropiprant\nwonderlich\nberkowicz\nslotsplads\ninnovene\nnstein\nmaurauding\nmalesko\nfundrasing\nwardrobing\nbuzás\nlobis\nbarerra\nunderprice\nsuson\nmausner\nsabaudin\ncramant\ngracer\nchechyna\nzarkin\njanati\nunmagical\ncontactus\nkabari\nchykie\nwimped\ninghilleri\nquibdo\nteachin\nequiment\nschmatta\nfaceguard\nceyla\nrsrm\nkhojir\ntrendily\ncambridgshire\nsciencefest\nbimont\nsteptoes\nchipidea\nfogy\nnaïr\nholidaymaking\ndiess\ntrancendence\nsafeminds\nsandaled\nroarie\nfirstbrook\nsunshiney\nqorwk\nniederungen\njuanqinzhai\nzentek\njamac\nbjalcf\nquenin\nrunouts\nfetishise\nmolikpaq\nkryolan\ngarrana\nbutterbaugh\njahic\nmuhajer\ntalibs\nsexagenarians\ndisbelievingly\neminonu\ndicator\nkunzelman\nmerriness\nabidingly\ndottino\ncantellano\ncommoditisation\ncoconspirator\njerath\nhippyish\nprinceridge\nimperturbably\ndoermann\nploghaus\nbuget\nazher\nníos\narboriculturalist\nhpti\nunderweighted\nwesabe\nzakanitch\npackrats\nverismic\nsubscore\nrehersal\nodze\nmckaughan\npaprec\ncristals\nbrownsell\nsurepayroll\nitip\narym\ndeinstitutionalized\nbutah\nuttr\npleged\nstudiosystems\nsakanaka\ndantewara\nciliv\ngeosteering\nsurapol\namnio\ntriyaningsih\njonal\nzweben\nyahuza\nspigarelli\nesophagi\napercu\ngrenson\ndepraving\nnonsmall\nmakistos\nsuspision\nbreconridge\nbookswim\nabbassian\nmerigot\nreconditions\ntowcar\ngudal\ncondemend\nkitabata\nyoss\narmao\nmirrorlike\nbirthparent\nlifestage\nvlos\nmerriot\nczepiel\nmuskin\nboatpeople\nnonchemical\nvelocci\ncyfan\ndranko\ntwitterfall\nmckersie\ndodders\nliverail\nsztorc\nmisruled\nheidgen\nyouseph\nslivovice\nkosmopoulos\nendorectal\nhumevale\nribbonlike\nredtag\ngrefenstette\ntrucktown\ncolsten\nsnapnames\nnochten\nhaithman\ngaleao\nparites\nsellotaped\nrozett\ncgaq\nmakaridze\nomnipeace\nrealdolls\nwillinge\ncrewelwork\nzitserman\nstreambox\nsilmi\nzamzow\nsilguy\nculbertsons\nministate\nnutritiously\njakobshorn\ngovernmant\napdp\ncotis\nthjat\ndigusting\nivascu\neperon\nbatalona\nbarishnikov\nofpra\ninterupts\nrochlen\nreissman\nmazuryk\nschnupp\ncertifica\nmenacker\nrohdes\nmanifester\nnoncompeting\neujust\ntranformation\nschuchter\neconomistes\naripov\ncathlyn\nmusaab\nmegary\njankovskis\nhusbandless\nsppi\nthulagi\nescalopes\nrooda\ninadvisably\ntabizel\nunacclaimed\nbaathism\neclerx\npurpled\nimprovemnt\nnontariff\nconsciouness\nmoec\naboutorab\nabousamra\nliazid\nrutlege\nbaupin\nsarotte\nteleverde\nsouthernness\nepidem\nkuwai\nchharia\nrichieri\nteeuwissen\nbepotastine\nbouhail\ngeldolf\nwolynes\ndiadkova\npelerins\nceud\ntcpi\njence\ngonazalez\nkeanae\nqualfied\npirouetted\nmaunderings\nkhachatourian\nvaccariello\nupscaler\nvasilevskis\ncrasset\ntoensmeier\nsnugli\nollson\nhpra\nscld\nicescape\nmyfi\ncubicularis\nbiop\nmdtf\nbasargin\nbatelle\nbrewley\nnidri\nviracept\nsexbot\ndevictor\nsirard\namenagement\ndyches\nkepak\ndiddymen\nrabidity\nmancin\ntonnato\npressgrove\ngalliver\nvixxen\nvaalco\nfumblings\nkhvichava\ndjodjo\nadorability\nmabrook\ncorebrand\nipel\nkaelke\nfuerzabruta\nschoolwear\ndiscernably\ndeutschemarks\nboyarchuk\nrecellular\njubak\nsogebank\nmicoperi\nizbasa\nhaetzni\nsahri\nmunkenbeck\nmegaband\nkoukkula\njonthan\nbigpark\njagusch\noverconcentration\npetrecca\nstrempler\nteint\nkabando\nconfidance\nradzicki\nthicko\nbackflipped\ntriantaphyllides\natgwu\nnonslip\nismailzai\nhoves\nresilence\nnasto\ninsideradvantage\ncanori\ndevit\ngearknob\nirrefragable\nulpd\nrigths\nexercisetv\ntrackies\nunassailed\nstrenghtening\nshakhshir\nvisionmaster\npadera\nmeachin\nsivuqaq\nmerdian\nhellevang\nkoetting\nmicrolender\npinters\nacquiror\nmanizha\nnaftidrofuryl\nledia\ncreakiness\nfastco\nzuckermans\nmogmog\neconohomes\ndulvy\nzaryn\nlishchynska\npaperwhites\nsocialthing\nescorza\nkhamene\ncajanek\ngeiszler\ntonked\nespring\nleggitt\nliaigre\nkasetsiri\nbragagnolo\ncrowdstar\nsamknows\ntreillage\nemulative\nehlmann\nsteptext\nbesifloxacin\nllinares\npafumi\nkofmehl\nmacaulayite\ndisarmanent\nsdvosb\nsporanox\nphilanthrophy\nmultistrand\ninsourced\ntropiquaria\nadnexus\nbobroff\nbubl\nnehst\npassats\nphuthuma\nemmisions\nexpungements\nrootball\nchatlines\ngcy\nguariniello\nhelljesen\nboultinghouse\nremarket\npcrd\nimpaneling\njocelynn\nbougourd\nbrownite\nbacigal\ngermanier\nindys\nbuveuse\nuncoached\ndhliwayo\nundersexed\npautsch\ndaniluk\nafricanised\nsindani\nincreas\nbuybuy\nrikhvanova\nchristains\ntamfourhill\nwaakye\nhomoine\naberhonddu\nhurón\nitʼs\nbergsund\nhallac\nbetj\niranophobia\nyayale\nsmellier\ngancheng\nharbuck\ngochar\nfloorpans\nbeverely\nbamfo\nopenvibe\ncarnglas\npresciption\nprogesterones\nfasahat\ntresspassing\nsulkiness\ngroft\nmuuto\nprescheduled\ngigle\nbeatrisa\nnanopatterning\nnaeng\njailable\neavey\nenchaine\nbeyou\nrawitz\ndisected\nsennowe\ndiogelwch\nterino\nosna\nnetflights\ncaherlistrane\nedpr\nmodric\nfairfull\nunroadworthy\nkariamu\nvermuelen\nagcom\nballestros\nesot\nspiriva\nwirtschafter\ncrisislink\nbreakins\nzermeno\neurolat\nmerjos\ncringey\nmachiavellis\nobesogen\nngpl\nscrra\nweikang\ntarriffs\nsoundoff\ntavalaro\nbedevilling\nfragomen\nlahudood\ncipralex\nsepich\nkoleston\nolukemi\novercorrecting\nslopey\nbbam\nstemagen\nchiadzwa\nmarketised\npongsudhirak\nunarranged\ncintec\nmorgansen\nruez\neurobancshares\nsaadun\nserviceably\nfeting\nkellagher\novermighty\nabergils\nhaemorrhaged\nbalzacian\nhosue\nglasslab\ncrespadoro\ninterivew\ncorsentino\nmowasalat\nsourest\noktapodi\ngalvanises\ncoaliton\ncompsych\nuncollared\nboutrous\nasppa\nhureira\nbanpro\nkarabus\ncambr\ngulvin\nallrighty\ncoffeeheaven\nsarene\ninhi\nlovenox\nweigner\nspethmann\npvsa\nteraelectronvolt\nmcop\nchinse\neggstravaganza\nwarmish\nredrobe\ncelebrites\ndustmann\nroeselii\nmarraffino\narking\nheithaus\nrosarie\nautex\nharvati\nrecessionista\nbrodnitz\ndatablog\nrebooking\ntakeh\nkrema\ngalatica\nripston\ncooklin\nwoollands\nmultistrategy\nzubiate\nconcertacion\nambroeus\nmarsey\nhireright\nurazov\ntackies\ndazhalan\ntweaky\nmuscley\ncameronism\nantiwhite\npostform\nmokgadi\nschwarzenneger\ncroff\nclosests\nmisiaszek\nthaumastos\nwheeeeee\ntanic\nsempell\nbutkevicius\nmanisco\nembyros\nmellifluously\nchiroubles\nmoerk\nramraja\nsercel\nmalesan\ncaleton\nhrer\narnaoutakis\ncappucinos\ncounterpanes\nsisha\nevenett\nsurkhakhi\ncermony\nkonskaya\nhoitink\nrecarpeted\nlmod\nvarqa\ngoleizovsky\ndcsnet\ncarsberg\nknackering\nmohhamed\nagitatedly\nraybin\nbupkus\nocensa\nabyd\ncpcn\navalanched\nsaril\nkejun\nhief\nravjaa\nbalkholme\nadamiya\nfreedomnomics\nshaunte\nholiff\nartworker\ndunetz\nnaspe\nunscrutinized\noleandra\nunbuckles\ntakete\ndarlys\nuniqlock\notbs\ngalluci\nunsheath\nschellhammer\nfreson\ncrossparty\nmineseeker\nunmolded\nkadhimiyah\nhåkensmoen\npettiest\nkiogora\nswaid\narhuaca\ntobold\nlanxade\nghirga\nrissole\nxifaxan\nsilkiness\nthermoplasty\ngetwellnetwork\nchuanfu\nvicoprofen\nartworkers\nsahlstrom\ncorrag\ngengsheng\ncrocketford\nalikozai\nwoooooo\ntabankin\ndipalermo\ngildenhorn\nconcilliation\ntroughed\nyolaine\nemmins\nexchequers\npapalo\nsynplicity\ntimpsons\nbusniess\nrandlay\nremitters\nvivenne\nknifings\nhaileselassie\nvechile\nibrado\nfinanical\nrigside\nelasticised\nbrightwells\nendrik\nbuttry\neyptian\nfoodmaxx\nsopped\nconsumerland\nnyweide\nvlahides\nteuben\nmellizos\nretrofest\namunition\nfarreaching\ntorbit\nmellila\nkuvaas\nkrabacher\neloxatin\nhekkema\nthanenthiran\nmobiler\nkymani\nakqi\nxingdou\nillamasqua\nnewgard\nzonolite\nfendry\npakhomenko\ntenschert\nbrusiloff\nbombiviridis\ntrubey\nmuralee\nalmondine\nvovkovinskiy\nwauchob\nmaccalla\ndessources\ndinte\nvicous\nmarole\nlifesource\nzhangazha\ncriticalblue\nfatboys\njwad\ndingzhi\nbakoyanni\nmarjam\nheimoff\nroeca\nkucharska\nsingsongy\nhospitalise\nkuersteiner\ntestrake\nveyance\nunkeepable\nabsoluetly\nmunire\ndraznin\namercians\nbodypainted\npittilo\naiada\ndoskocil\nfiftyish\nfujayrah\ninists\nsweetcakes\nadacher\ncasva\nhexaflouride\nthissara\ntortorice\netchebarne\nrichels\nhser\nwolwedans\nception\nseniat\nmenahi\njoschi\nbousses\nkarolides\npaazab\ngantly\nregusci\nmazarron\nstainthorp\naguirres\nsafle\nmondro\nimpossibilty\npredecesors\ntrotts\npakpahan\nsanaria\nturmi\nneibert\nbillpay\nrittie\nachabeti\nieci\nbreeziest\ndoueh\nhypoxico\npzena\ncannada\ndcypher\njerrianne\nbarrenwort\npargneaux\nlinselles\nmoamer\ngrockit\nweejuns\ndudenhöffer\nghafor\ncrapness\nrechecks\nscruffily\nkrmc\ndiino\nduravit\nairbed\nbidzos\ntimebends\nbiesk\nflaux\nskirsgill\nandollo\nmiscikowski\narnika\nmiteb\nalkmonton\nmuseles\nwidevine\nsoetrisno\npittsinger\nermonela\nmediasurface\nyanguan\nkotlyakov\ntranscantábrico\nrespon\ndisincentivise\nyulis\nduroville\nhtfc\ndusika\nifec\nlarence\nazimbek\ndathorne\nnbra\nreedijk\npolygot\nwheelnut\nboedihardjo\nthiazolides\nrideon\nquatela\nnondrug\nsoltow\nsergeac\ngropings\nchippac\nwillockx\ncagily\ndownswings\nruias\nticor\nburnsong\nalexin\ncannucciari\ngodsday\nwasch\nokeowo\nindistiguishable\nbeyonc\noffiical\nphadia\npisasale\nhealthspace\nbachatas\ntaupes\nembler\nridiculas\ngaszynski\nstrbske\nsiyoung\nodwaga\nchimore\nsedapal\nbetrixaban\nkimenyi\nsemuels\ngelbmann\nsarsam\nchigishev\njursidictions\nsavain\nzylinski\nsuddely\ndiefenderfer\nthecurrent\ndejevsky\ndlesk\nshinnecocks\ntoowomba\ntortiously\ngriffeath\nscruffiness\nmorizio\ninestimably\nnellas\njittered\nbarkema\njetzer\nkushler\naecio\npicowatt\nreevoo\nslsp\nnobbling\nnciia\nflosses\nvascellaro\npublis\nsalteñas\noccupanther\nvallandry\nactigraph\nfattiest\nvasper\ntoorock\nsbaraglini\nsureno\nzennstrom\npumalin\nbrumit\naschieri\ngrampp\nlaclair\nnoncarbonated\nadminster\nsarwal\nnoof\ndalfen\nagem\nhogberg\ncaffera\nhassanain\nkubrickian\nrevolutionarily\nstotlar\ndqed\nehrenhauser\norderd\nalonside\ncoproxamol\nbrotherish\neletronics\nandalsnes\ntrsl\nbjorgvin\nfrostier\nbedazzle\ncancelmi\nsimendinger\njamesetta\nlewtas\nsquishier\nladny\npinenut\nricot\ndelectables\nvasconez\ntumultous\nswitaj\nmaltam\nwisewindow\nhabip\naliviane\ncrimereports\npromac\nusarec\nslinga\nbernfield\nweaponising\nltat\ndriveling\nmeienhofer\nmakus\nconstrasting\nloprinzi\nbarneses\nrefigure\nipzs\nquibblers\nblauwet\nnarrowish\nsooliman\nfasps\nragoon\nmbuzi\nchappelear\nswantee\nworkwithinwork\ntalkfest\ntecco\nlakiesha\nbdrs\nakoh\nashja\nofhis\nvisocchi\nspudding\nyousouf\nmahvelous\njanece\nhewgley\npengana\ncablemas\nmomentousness\ntheatricalized\njagh\ndisaffecting\naccustions\ntoama\nmothershead\nabck\ndecarbo\nclovenstone\nclms\nmutualised\nagnelet\nrattal\namoes\ngovernates\ncamonetti\ngargurevich\ndimento\nheith\nkarolewski\ntfank\npenpoll\nfranklen\nenergos\ninderstand\nscintillated\ntarpy\ntsujiura\nrailrider\nsucccess\ngreasestock\ntakanishi\nshoemate\ndelemere\ntavendale\nimplacability\nhallbauer\nlfec\ncarryon\ndimished\nriniker\ngilbar\nkrikler\ncinton\nmurkiest\nkrstajic\ngerodimos\neastler\ndubais\nforlines\nygal\nenfys\nanagh\ncresitello\nhoneydukes\nlinkohr\noliker\nkoczi\ndreich\nperik\nfolfiri\npeaceloving\ndellibovi\nturbow\nmabanga\ngoozner\nmeinshausen\nagoumi\nairwick\nasssessment\nmaldanado\npachtman\nbroan\nkmsa\nbasirat\ndwtc\ntaioseach\nfilc\nforeca\nwahh\nglencanisp\npepke\nsurenos\nbizuneh\nhaemorrhoid\nlinkenholt\nzongjin\nlatecoere\ndebenedetto\nkatzburg\nventilates\nirgcn\navmed\nshakshuka\nweindruch\ncappas\nprescibed\noveranalyzed\nceglarek\nquian\napeh\nhebblewhite\nploughmans\nhammudi\npetesch\nkissas\nyageo\nalbiev\nsimec\nchiranuch\nspielbergian\ncappio\nonismor\ncondolances\nwebdale\nkucik\ndangamvura\nakec\nsomini\nkallweit\nfractionals\naminda\nsamoura\nodce\nshenergy\ndatanálisis\nhagelauer\ngoldbergian\nmurdani\nloewinger\nriddens\nconnetion\nyanadi\nscrumworks\ntruckdrivers\nbedazzlement\ntrilaterally\nomwami\nergezen\nloonier\ngoedhuis\nyokokume\ncinciripini\nzacinto\nbergeman\ntribalization\nbreeanna\nqizheng\nclydes\noussekine\ndelrae\nprotectio\nisayas\nlimbones\nmightly\nmetastorm\nobano\nvanjoki\nmassify\nfréchon\noutling\nshakili\nwojtak\nkeffiyah\nsunesis\nupcounty\nhugoson\nloutzenhiser\nmallins\ndncc\noriard\nchurchley\ntarganta\npement\nmicrometastatic\numholtz\ndekom\nidalgo\ngeogia\nfloderus\naredia\ntaona\ncicpc\nrealclearmarkets\nsloaney\nhulnick\ngenr\nmshini\nenergyguide\nblueant\npoulicek\nspokesong\nkransco\ngatefolds\nunphotogenic\ncockfighters\nanastagi\nstowells\nthomashow\nhartcourt\nkazza\nkebony\nbogliolo\npornified\negpyt\nsamarskoye\nlagani\nkhosrokhavar\ninextricabilis\nfoliofn\nmerksamer\nploddy\npimkina\ncoutlangus\nminikit\nxopenex\nherdes\numarzai\nyetty\nthreathen\nallaw\nspyrus\nhvlp\nnvcjd\nstigson\ncovault\nlarizadeh\nnessinger\nkiejman\nmorparia\nbaquer\ntimbit\nrenaissancere\nrhaid\ndivorcés\nnakarin\nnonrealistic\nwepco\nhumilation\ntokoza\nhottrix\npantsman\nvithy\nthobes\nmisdials\nhualon\nabdiwahid\nxinpei\nselari\nsolastalgia\nrodeohouston\nnonken\nkudrik\nelectronuclear\ncomag\ncourtesty\npetruschke\nmethar\nkudina\nunshackling\nshriven\nmahaley\nsoggier\npaikan\nleakiest\nachrafiyeh\nfurtherwick\nrishe\ncaringo\nsoftcat\nbourride\nkhudzhand\npurità\nperod\nkurnev\nnewsbites\nduraflame\nreupholster\nkeeril\nmintzlaff\nisotoner\nstaycations\nnekunam\ntomazin\nbanaa\nbradmanesque\ndaunts\nvogelheim\nbeetge\ngladedale\nexceptionals\nssst\nbermudans\nkeyhani\nzwillenberg\nflashguns\nnbbs\npaintballers\nemmalee\naftershaves\ngwybod\nmicta\nkrejcir\nduhy\npuscau\ncarcelle\ncanegrowers\ntechserve\ntfets\nsnitty\ndeliberators\nfatuousness\nkumuka\nskovsgaard\nmontgri\nfishmore\nicli\nansol\nlendable\ninvesment\nsuperproducer\nsaleslogix\nwhipkey\nbarraques\nsprigged\nsantine\noedi\ndesensitising\nantiviolence\narséne\ndrakeley\npugnaciousness\ngeleijnse\nijcic\narhaus\nmwampembwa\nbasketfuls\nfingerwork\nadultry\npekgul\nbukaty\ndelus\ngpaa\ntpct\npapastavros\nmcclam\nunderpriviledged\neffient\nlevenston\nltpa\nihda\ndaypack\norbusneich\nantidiabetics\nmalysz\ngrelier\nshumer\nnonsports\nprofligates\nbaibakova\nbandelli\nlimberis\ncorelation\npantperthog\nhockenos\njorrick\nshieks\nghahraman\ncrashy\nmeckfessel\naweau\neservice\njomba\ncurvacious\novercapitalized\ncondroyer\nmshda\nnovespace\nushar\nubcp\nfampridine\nyogli\nteeples\ncrystina\nkgoroge\nfassbind\nbrookly\nerteszek\nechterhoff\nbonefishing\nsturmia\nfenay\npanathenian\nmomjian\nmustansiriyah\nantonick\naaslaug\nbfei\nelsbury\nbatheja\nduperval\nrozner\nphotgraphed\nscharman\nearnse\nvosgerau\nseventythree\nnotini\nwedgy\nrackswitch\nballyarnett\nganthier\ncognetas\nzacny\nreciva\nsuljic\namfibus\norwoll\nsiggil\nlareo\nutahamerican\nwardheer\nyukky\ncruiselines\ntreelined\nhbsc\nchitau\nleitmotivs\ndefexpo\nshoupe\nzhengs\nfotoweek\nfastpoint\nyingpu\ndqd\ndouzeniers\namcol\ngervich\nedrych\npreisendorfer\nsharpcast\nunopen\ncapparell\nfunduk\njobserf\nvilebrequin\nwvcm\ncyberagent\npolicer\nforoohar\nmulticountry\ncheeking\nsimens\njaunary\nsarbox\nhcam\nfnsea\nmunwha\nkarokhel\npunchiness\nderschau\nshilda\npeverly\nvelaglucerase\ncallau\nkhadhar\nswyddfa\nsquirrelpox\nhorsepool\nchesting\nsupramaniam\nmotshabi\nsfta\ndumbiedykes\npeskowitz\nbathmat\nballywillan\nyugraneft\nrattly\nwinterscheid\nbybox\ncraftmatic\ngambriel\nalauya\nvizplex\ndekosky\nsupportors\nkleptocrat\neffusing\ndenamrk\nrestino\nsmajlovic\ncolarado\nmilligauss\nreithmayer\ntheyt\nspraints\nportugual\nxsite\nguarentees\ndisario\nlavrakas\npedrone\nshiane\nmontazah\nmalakpour\nbounciest\nkaeslin\ndenudes\nkilnacrott\npfaender\nsuraev\nlevigne\nflaen\ngodfinger\nchérèque\nperfomers\nubbeston\nmarchinhas\nkilinochi\nchatanooga\ngenebach\ncambricum\novodda\ndisapperance\nvervoordt\nwoodycrest\nparsh\nmontefiores\nbadza\nlagattuta\nsuperthin\namaizing\ndeperately\nvspc\nroadtest\nvtss\nvieregge\ndodard\ndudzinski\nvastani\ncastellito\ntullygally\nkinoki\nkarokhail\npalsey\nmilipol\nleafier\nbenbassat\ndemil\ntolpeko\ngilleard\nmcosker\ndigitalchalk\nrazzetti\nhyzaar\nalgesiras\nvalueoptions\ntdameritrade\npartenariats\nnadhem\ndeitchman\nmerlone\nellaone\nsottish\nwehrwein\niflo\npuritanically\nseduccion\nweepiness\npolycotton\nosuri\nwithdrawls\ndreki\npenasquito\nlaraby\nbimetallics\nwindmueller\nstilettoes\nahmud\nkovachik\njoepa\nwinkled\ntoursim\nparessant\nkozima\nstupple\nnestwatch\nchirtoaca\nmadridistas\ndlodlo\nrachmanism\nachos\nflunkie\nalmaleki\njiau\nadlakha\nmoderow\ncosit\nmattscherodt\nlafraniere\nazadian\ngmpta\nlemenager\nresecure\ntucsonans\nschouwenberg\njabareen\naltoumaimi\nparraguirre\ncitysafe\nrikleen\ntersigni\nlcdx\nsomeonelse\nimmunotec\nhueppe\nkittila\nskorykh\nrealini\nprolia\nperer\nehambe\nhearbeat\nyijinjing\nzeitgeists\nlibow\nyanny\nstylewatch\ncorter\nisaps\nukio\nhumayra\nbosselman\ngroveled\nspeid\nmonjeza\ndireness\ntuleev\naurp\ncartizze\nmalayappan\nblockheaded\nmultisymptom\nhfas\ncloque\nsanderijn\ngkpi\nkaskeala\nichill\nkysar\nshrinivasan\nbingemann\nmartevious\nbivb\nrepresentitives\ntajarin\ndoind\nshuyong\nbialkowski\nkuryanov\nportec\nteitzel\ntefap\nmikaya\nrabineau\nlaphil\njunsai\neverbridge\nvapidly\nhuldahl\nhinkles\nreichbach\nclattery\namadie\nkorisha\npenkivel\nneuhart\nsubseqent\narabise\ntheary\nshamsudheen\nmenomune\naomar\nnnal\nramassage\nfootytube\ngayboy\nlopex\nmercadito\nboostrom\nfunnymen\nmcmurdie\nglozier\nwaney\nraiber\nshambled\naskk\nopaques\nlocalnet\nthisthatandtother\nmmpl\nasnawi\nribeirinhos\nheredad\njanies\nnuvuk\ncheishvili\nlohafex\nlokeris\nhanooti\nnhanh\nunitholder\nrotozaza\ntratner\nwebtech\nwpy\namerifit\noptitex\nröcken\nojani\ngallantree\nmcbl\nabdukadir\nencouter\nrathakrishnan\nwangoi\npubilc\ncrafer\ngylfe\ntebutt\nstrugling\ndelrish\ncolomendy\nnegusie\nydri\nhryvna\nbeautysleep\nnxea\nenfeebles\nogboru\nrukwanzi\nghostworld\npoernomo\ncooncil\ndramis\novulates\nvermund\nkuyl\nmatemwe\nnongreen\nportovesme\npsycopathic\nvolkwein\napsītis\nonhollywood\nwarrantees\ninjunct\nenrd\ntokyoite\namerus\nsieć\nphotoswitch\nshdema\ncoalco\nkroloff\ncrustastun\ncullan\nctna\njustifing\nsatawu\noleoylethanolamide\ndecriminalises\nmehne\nbreat\nborysik\nlabarda\ncolboc\nllwyr\nhillsmere\nmilbook\ncvision\nkdhe\nbleeper\nmedmerry\ngenasense\noblimersen\nhomedics\nbaoanan\nmillest\nunsoiled\nnyias\nrukshana\ncoastbound\ntwcn\nprocrit\ncrymble\ncantankerousness\nmattich\nadiana\ndiloreto\nfritolay\nsnifters\nyately\nwhupped\nridouane\njoycie\nfrontlist\ntimberwest\nunstimulating\nbraslow\nrecalibrations\nvitrola\nthiopurines\nunderyling\ndilnawaz\nneurocase\nsignability\nlighteners\nschreiberg\ntlachinollan\nmacin\nphsyical\nritualizing\ndrawdy\nnewschaffer\nopalach\nsweady\nnwec\nstruldbrugs\npurply\nyatooma\nsexted\nffirth\nkristic\nutoyo\njohnstonebridge\nmaringo\ncharnota\nrepublicon\nanesthetizes\npirla\nchieming\ncompretta\nwesterback\nflatish\ncoolmax\nshamarr\nbosfor\nzwingle\njudisch\nihamuotila\nsaitas\nwerren\nnkadimeng\nxinqiang\nnativistic\nringbinder\nidesign\nraliegh\nbosanko\nhillan\nmisell\ncrepps\naracinovo\ndankerode\nhelcio\nguanliang\nrockspring\nfelito\ntheede\nwelching\njulong\nncin\nmaasbommel\nnontextual\nwharrie\ngwyrdd\ntrumark\nwindups\nkizs\nsnapvine\nsillanpaa\ndefensics\nstriplings\nselka\nanastasijevic\nredtops\nimamovic\ntrendwatching\novertasked\ngaraicoa\nedles\nmycoupons\nservicechannel\narbaiza\ndishonesties\ngeolocators\nurbanbaby\npodowski\nprausnitzii\ncosying\napuc\ntriboelectrification\nshinewater\nelectroshocks\nhappell\nbrittannia\nicenogle\nrsquo\nmachery\neicholz\nchiropracty\ntavey\nstumblebum\nmydicar\nnonsecular\nacuerdate\nundule\nmuthumudalige\nstinkiest\nbreglio\nsebarenzi\nworonzoff\nscheana\nsupercomm\nmexecutioner\njpatrickbedell\nhealthplans\ntarica\nlarowe\ncomsumption\nultraportables\nhijji\nyrg\ncomodi\ngmita\ndespouy\nomot\nntsiki\nsmarmily\nbolinaga\nenvirocab\nllun\nbrimmeier\nedctp\npoufs\nklackenberg\nfdaaa\nmickah\nbryder\ntruckling\nvatagin\nalpargata\nefficiacy\nanstine\npalistine\ntaxcut\nlagrell\nmasaood\nleevers\nospca\ntruvo\nmartinrea\nehhhh\nshuval\nclassily\ndishevelment\neffeciently\nfastbreaks\npairts\nbleick\nlokker\nalkhatib\npulwarty\nfedral\nnyuki\nunpredicatable\nmanguzi\nthorsteinsdottir\npiretti\nbloodymindedness\nhorizontina\nstucked\ndemandingly\ndoback\ngimnastic\nyuce\noverfish\nulyatt\nwowzers\ncampest\npetray\nprouser\nupback\nepratuzumab\ngreeners\nfrancelino\nwbrt\nscrewiness\ngismervik\neconomizes\njaheel\nalkhair\nbcny\neuthanising\nedgwick\nsciton\nstuttle\nveggetti\nlaydee\nvrus\ntoters\nxiexia\nlauden\ncunagin\nportentously\ngalantuomini\nvaclavik\nnortenos\ncueller\naramini\ntrown\npolytec\ngyar\nlacoe\npuffier\nisafjordur\nmoistly\naberhafesp\nchavarro\ntechnolog\nhurlow\noshins\nnasbe\nnoncommunicative\napnoeic\ntemares\ncoachroof\nguayakí\ncaramazza\nalderden\nchaderchi\nallstetter\ncosseting\nmersley\nbalfanz\nmuskateers\nfolchi\nracivir\nhargon\nfruiter\ntopseos\ndeantonio\nshuzhong\nbuzzarté\nlhergy\nzinwa\nmiscalls\nbagmet\nsalaams\nsaveock\nlocksets\nrochom\npreszler\ntpus\nschwankert\nbvmw\nteamlease\nthosand\nbekkay\nkadakin\ncrushproof\nrestorick\nucap\nkitabat\nfohe\nbriso\ntemporise\ndipuccio\nsouare\nalhuda\ntranched\nameris\njolean\nchengue\nkipred\ndahdal\nwmz\ncohera\ngrotenhuis\nimpishness\nherbalgram\nvlassi\ndubiotech\ndayboat\nmidcycle\nadmax\nuncarpeted\nmobiclip\nflushers\nohhhhhh\nmascaraque\nfargione\nzongfu\nsovreignty\ncalavia\nmertinak\nhuzzahs\nhardscapes\nvohr\nwonil\nzestfully\nsemisubmersibles\nplanktos\nbramhaputra\nodimba\nskulkers\ngonxhe\nelegua\nblamers\nyaer\nstrangerer\nharasym\nmorcenx\neyg\nillegitamate\nexploitatively\ntschannen\nstuffier\npoliwood\naffifi\nparlyament\nhuttary\nsimlab\nraaum\nibandronate\nnanofilm\nnioxin\nonegeology\nnonsupervisory\negemonye\nspotkick\nwilmhurst\nleslyn\nlylia\nbonier\nholzhammer\nedsinger\ncowberries\nciolos\nnaysay\nhuijser\nguleff\nkillenard\nreplating\ncosabella\ndistorters\nsimerly\ninteriew\nshoreh\nbarankin\nmarben\nbartech\nsunaryo\nhaygate\nbluehybrid\nsaccoccia\nbunnyland\ndezcallar\nplattin\nsidewind\nmogaka\nmultipanel\ncushenberry\nsuperbrat\nsorillo\nheckbert\nyanying\naudaciousness\ntarabarov\nrumormongers\ncnbv\nhotelplanner\nnallamothu\njarillo\nyuewei\nearcups\nraynaldo\noktoberfests\npentrebychan\ntimecards\nsturminger\nlochsie\nstojanowski\nnsmt\ngastar\njovetic\nbajilan\nloafed\nlaffa\nnamrood\nszpiner\nmarvet\nseebrig\nglenborrodale\nbonvissuto\nsouchard\ncompanionably\nuryadova\ninsiderpages\napplica\nhoraire\nbastuerk\nfeoh\nsellek\nadakhan\ndecarr\nkapatos\nketra\ndatasphere\nabbasgholizadeh\npoppema\nbehalves\nsalahat\nfalteisek\nsaksin\nverticalnews\nshockumentaries\npeskoff\njabbarin\nsemmelhack\nchewables\nwearden\nzimmerstrasse\nmazdzer\nschneiderhahn\npearlmutter\nsecuritymetrics\nfuniture\nngodup\nsurabi\ndangour\ntheier\nnatcho\npethokoukis\napichart\nkuljanin\nbendas\npoltically\nhardricourt\namanresorts\ntabery\ndynamex\npiconewton\ntrogolo\ndalsass\nbilsborough\nflantz\nsotheara\nkolaches\nmorgunbladid\ndecodeme\nwiseacres\nouranoupolis\nandrosova\ncovali\ncèpe\nflightsuit\nedmondus\nsonosite\ngawping\ncaergeiliog\nkandlbauer\npettem\ntorotrak\nmyregistry\npotholers\nsnoozed\nfawza\nsuperlawyer\nharoutounian\nneedly\nclavenna\nbenesova\nchristodolou\nsweder\ncancerbackup\ntorita\ngazundering\nchilren\nbachchans\ngiddiest\nembrassing\nfunez\nspaunton\nalbenda\nplatais\nicier\nmaamari\nntes\ngillane\nhelitech\nsolerno\nddwy\npostpubescent\nversweyveld\neffusiveness\nraikabula\nbigged\nhaddah\nishkanian\nmonicagate\nweatherstrip\nsicr\nteleatlas\nigrc\nmisspeaks\nhartbreak\ncrownvetch\nlogroll\nplauged\nmashakada\nfirex\nturneresque\nrapaciously\nyearout\nforewarnings\nkuratani\nvraalsen\nduartes\nshebly\nﬁ\necolodges\nbarthau\ntouze\ncrotonville\ntitstorm\nmainshill\nmatrafi\ndevanei\nhannick\nreconciliate\ngeotec\nnorgle\nsutel\nsevene\ndireko\nsamalut\nacclarent\ncohabitees\nnahawa\nmugira\nfingerhuth\nbeleiver\nstaffies\nghufron\ninfometrics\nhonerable\ninnovasjon\nvaporetti\nherdlicka\nuncertaintly\nathlinks\npiccino\nmarbourg\nabdramane\nyitta\njourneyers\neasycar\nclairoix\nbaldisserri\nemblaze\nsuksan\nsculleries\nmoayedi\nthriftily\nmataban\nkulju\nmorlin\ndunavan\nseani\nkuratomi\ncanniest\nwrithings\ngrandbaby\ncarthorses\nijet\ntadier\nrehabcare\nczamanske\nstraussy\nstretchmarks\nunderwhelms\netkins\nchinitz\nnvrs\nshvitz\ndougill\nimmunoadhesins\njanneys\nvideocam\nschoepke\ntriptik\nmezuza\nutvi\nsmartening\nheuwer\nspellbind\nhafstrom\nwislocki\nneej\nyangaroo\njessies\nzigging\nmhashu\nfarstone\nkishkovsky\nszorenyi\nmenochet\nzuitube\nkirkhams\ngarganelli\nbeztu\nmcelholm\nmmviii\ninnoculated\ncomissiong\nfidelite\nspherics\nwellbeck\ncoremetrics\nzuanic\ngluttonously\nwascals\nbooooo\ncaucases\npéchiney\njadedness\nmalawai\nmagnetites\nmatchet\nfogelsonger\nregionalise\nschave\ntmst\ncoffen\nhamisu\nneilyoungi\ntexai\npixeled\npetatlan\nghulan\nproir\ntarantinos\naxsys\nwhomes\nmwando\npianissimos\nexplotar\nmembrez\nshiffler\njeruselem\nopthamologist\njuvic\nquinnett\nlowyck\nbydd\nsundareshwarar\ncablinasian\nprescripted\natttack\nalsation\nfallibly\ncacutt\noverprotect\ndishabille\nheavican\nraiken\njeroboams\npiccillo\nhollowly\nshiryaeva\nuchannel\nraile\ntackeray\ncyberthreat\nbookrunners\ndemeurent\noutfielding\nmatuzalem\nforestweb\ntolerx\nmigaud\npharmanex\npolykoff\ndemoc\ncalcuations\nmicrocantilever\nachfary\nmidwifed\ndrumbrae\nwwxt\natsutoshi\narrugadas\nbutteries\napoliona\nchivan\ndogwalk\nmarivi\ncrescimanno\nblacklegged\ngermanika\nslingbacks\nmukit\nforcasts\nnexar\nankunda\ngeigers\nbelohlávek\narkland\npedott\ntopcu\ncomercials\nnelton\nbarrachnie\nifpte\nculinarian\nnundroo\nbrandied\ntrasviña\ncuases\ndpic\npsychologising\ntopolanek\nfutureit\ntrokavec\nurogynecologic\nphotoshoppers\nshiia\nperparim\ncyclamineus\nyadvinder\nreawoken\npropps\ngottdiener\nbexxar\nweliweriya\ndepartee\nfldr\ndelubac\nmewies\nlisotta\nfurlined\ndrumskin\nmuntinglupa\negre\nsurive\ncelltrion\ncalato\nhouttuin\nhowdle\nwitkos\nwjzw\neuromax\ntanezumab\ndamco\ngiornetti\nbiolife\nmatisses\nfrieds\nstormily\nmaribavir\nheptathlons\nbluring\ncharitible\nswallowable\nbengdara\nmsdc\ncenturo\nadetula\ntolvaddon\nakaretler\ncakar\nkhiel\nryonbong\nmcdarby\nletiecq\nhonigsberg\npuddifoot\nafriq\nzeif\nballotting\nbaldaro\neducaton\nsperian\ntundergarth\nresponsibile\ncalpirg\nsouping\nnutbags\nchressanthis\nsoparrkar\npenymynydd\nquietening\nbloviators\ntarre\nwardhouse\ndisunite\nglobalmedia\ndenerley\npomerai\necrg\nairaudo\nnobriga\ndispaly\nhcrc\ngradd\nndongou\nadblocking\nreaffirmations\ncontentnext\nvainuku\nstranocum\nproducton\nblugirl\ndarbelnet\nheidelbaugh\nmilborrow\nfarabow\nyaros\npapaflessia\nvietnames\npownell\ngijima\nalcwyn\nozat\nkuoy\nbaltschug\ntegryn\nmcneela\ndarpakhel\nplanalytics\njospe\nwallmart\nofman\ncompletism\nbreul\nwitlessly\ngoodmon\nfasinating\ncollidge\ngrissini\nbooksmart\nzikria\norjiakor\naccuvant\nhofshi\nczlowiek\nunroch\nmasculinisation\nsunber\nkimbisa\nwaggishly\nadfer\nsouthernism\ndilallo\nmegatrade\ncadas\npriestlands\nrosciano\nhonein\nchangelessness\narbora\nramattan\nmasseron\nhironao\nnø\nccbn\nausmin\natfer\nkerven\nbrannoch\nsnogged\ndexheimer\ncalamitously\nbrasell\ncnla\nunlet\ncayennes\nfdls\nezj\niniciative\nbolnore\ninsurrecta\nadickman\nboultings\nperipherique\nmarcavage\nndvf\nfalnama\nmanettino\nopinionators\nthinnish\nmitrova\nmilarch\noirish\nmoraski\njrti\nblaenafon\nsnatchings\nsouffrant\nemmission\nterenteva\nbalouchi\ngentzkow\npamams\nopprobium\nrtdd\ndwon\nsweidan\nmazzaschi\ncrowstepped\nscramp\nscroogenomics\nnatonal\nwoodlarks\njaabari\nbetacarotene\nchepchugov\nteborg\ndehumanises\nborgstedt\nxianghong\nortrie\nheatlh\nbankcards\ntynesha\nbablock\nandwele\nstelara\nunattainably\nvansville\nkloes\nfurlotti\ntromps\ncheslock\nminxes\nflyhalves\nharnal\nlasjan\npreh\nvalizadeh\nallars\nboogyman\ngwernymynydd\ngimara\nmicrofluidizer\nhardily\nsaydia\nbakkavor\nventenac\nluncheonettes\ncoshquin\nrecoat\nbroadworks\nvalteri\nsatpol\nsparest\nreynié\nmeeja\ndisovered\nbodgan\nconstitutents\nopiods\nrtpark\ninchmarlo\nopdebeeck\nnonvenereal\norgiva\ngunshops\nrevani\noracy\nwhoof\noverdramatize\ngenwal\nsasbout\ngilberthorpe\nweissenkirchen\nshirtdress\nmonterio\nllanfilo\ntabreed\nbaczkiewicz\nhelmly\nyushau\ndolmio\nkonkatsu\nschtum\nfreckly\nkubzansky\ninverarnan\nslomer\nvolkening\nthabani\nitsma\nmignardi\nsentate\nhansala\nfrenki\nblumenthals\nxevo\ntasat\npawky\npgmol\nkieny\nfelcman\nvisualforce\nallgaeu\nvirgis\ndrasek\ndampha\nfairoak\nysios\nmistele\nforsight\nbalcons\nroitstein\ninuendos\nnicoe\ngodineaux\nbonnerichthys\nbezner\ngaluba\nworldfamous\nemiew\ndecarbonised\ncleron\nwhuh\nladman\nmroue\ntsirbas\nsustiva\nclarinex\nattemsi\nmarinhos\ncarrageen\nmachulis\nmeea\nmulongoti\ngiradi\nzayim\nwhirred\nkibbutzes\naquascapes\nayap\nchapparal\nschnuelle\nbananana\npuckrin\nalfridi\nbysouth\nwaterwell\nlushes\noverinvested\nkowtows\ndetangle\nusbg\ncongolose\nrisbury\npdpt\nffom\npehub\nsmuck\nnfsp\nsimring\nkeyamo\nomoyele\ndabinderjit\nhamparian\npixillated\nblubbing\nspaetzle\nsönksen\nwilcomes\ndisabilty\npamart\ncorrex\nglamorisation\nzurnal\nncaf\nrauterkus\nprognosed\nlenett\nlenkei\nkarrasch\nverfuerth\ndelcath\nrehad\nsegerstrale\ndistrubed\npuamau\nwebslices\nboedker\nahogada\nmershin\nelmaghraby\ncopling\npaneque\nwowtv\npurnhagen\ncoplink\ncagin\nhellotxt\nmalaren\ndobrovic\nglasgows\nduann\nnylag\nphilhower\ncatcalled\npinapple\nsurburb\nkopetsky\nhypos\nteenhood\nbackordered\nkokoi\niapv\nkhastoo\nkvinta\nstageworthy\narmanis\nkurbos\nbenchlike\ntyhypko\nxiaochao\nfilches\nngakoue\nbanyjima\nlinkups\ngliadel\nealim\ntschorn\nmudzingwa\namberry\nrissing\njozic\ngemmayze\nverenda\npolitcians\nenviromentalist\nmeleady\nhabarugira\ndesfosses\nbiosource\nsrob\nslanket\ngarazh\ngalafassi\nnollette\ndasanayake\ndeitzler\nyewen\nnster\nalcr\nlaluz\ntalgarreg\nrestfulness\nnmhh\nimmunodiagnostics\nbefogged\nsamray\nzanchini\ncharniele\ndbes\ntaimour\nllok\nomgeo\nsalava\nkuriansky\noldington\nledue\nflunisolide\nmsim\nzarudneva\nbalir\naaric\nvarvasaina\nuncapturable\ndakake\nbeatlemaniac\ndhusa\nstingingly\nblankies\nkhaddafi\nhadman\nvendormate\nelasmar\nsgpt\nbyamba\nluvians\nleejohn\nmarcoci\nplayfighting\nkabballah\ncarmaking\nmcnorton\nkhankhel\ncollateralisation\nfrenck\nwallbox\nglimmered\npuligal\nmersiades\nbillionairess\nfriskier\nbranstrom\nhorndogs\npaulann\nnunnelly\nkatsenelson\nkulchy\nabdirisaq\ngrislier\nmylie\nhodac\nbuschow\nschnozzle\ngoldia\ntolchuck\ndiease\nzometa\nechosign\ntindyebwa\nlahovnik\nchamil\ncefx\nvahidnia\ndlask\nbandarin\nrhani\nmeningosepticum\nheilberg\necclectic\npresidnet\nscalici\ndentistas\npouffy\nlavinge\nstaysafe\nkossangue\ngorblimey\nfurriness\ngourgeon\nweeren\ndiscimination\nlabordi\nexecrably\nogunjobi\ndisater\nsceney\npichan\nsuperviser\nvekic\nfraility\nreslizumab\njernvall\nhillo\npluckiest\nireports\nrowinsky\nmkvi\nmeglomaniac\ngiradeau\nlaicize\nwatina\ncohibas\niecee\npulmonx\nfintor\nechazu\njugulars\nlaygo\neght\nbioservices\ncornisha\nzondas\ngreycrook\nfranzetta\nsyndric\nespeed\nunclassy\nchadsmoor\nsabril\nraffinee\nimpingements\nsmartsearch\nshipmans\nmbithi\nchunquan\nmililtary\ntoremar\nobamacons\nfuranones\nvayrynen\nrizaj\nliquescent\nexpessed\ntotvs\nhellmans\nsaifulislam\nsrbska\nedfs\nattcked\ncarbfix\nmigereko\nhillstead\nwyhs\nspoofers\nhosepipes\nikhwanis\npolicical\nyongon\nstoneyhill\ncharwood\ngittus\nvalvulopathy\ndedring\ndustiest\njankovec\ncolcrys\nunsoaked\nlongyis\nmikhalkin\nmndp\ndigney\nsplashpower\nhomeaid\nsantano\nberragan\nagnieska\nseanez\nazcueta\nwatanagase\nplanemakers\nfrontwomen\nxingwei\nfreebrough\nswieringa\nlonning\nrhinogs\nmichni\ngillece\ntoppenberg\nchainsmoking\nchattily\ncurrenciesdirect\nseasonless\nelwynn\nsinglemost\nbarazza\nwakodo\ndemaci\nkeyani\nmuntadar\nplimsoles\nhembrook\ngruenenfelder\nninewah\nportege\ninteropnet\nbinikos\nglobic\npapco\nwernyol\nclydesider\nklish\nironpants\nultradense\nwanded\nikena\nkéréon\npesznecker\nwaycott\ncomras\nseveali\njerrald\nmcconneloug\nnopporn\nspeding\narduthie\nadirus\nkopping\nmartiza\noncy\nrowlson\nhamhanded\nlenchwick\nngsa\npanoff\nattebury\ncapozziello\nsarvananthan\nmalkovitch\ndalís\nmichaeljohn\nimrg\nskrtl\nwurmbii\nziaulhaq\nsaposnik\nsilecchia\nsoftic\nshopworker\ngengnian\nskycouch\nbovett\ntreasurable\nindustrializes\nchicozapote\nbromidic\nnervewracking\nrococco\ntasir\nadjagas\ngasparoni\nhomewear\nbeanballs\nchiense\ninkaterra\nveganic\nkuppermann\nsuspensefully\nbadrah\ntrémolat\ndecling\ndoomster\nhursts\ncigarrettes\naltitudinous\ninconvienient\nlevra\nspokeman\nyushenko\naquapalooza\nunluckly\nrepsonsibility\nmagrittes\nedtp\nneckbands\nwaitng\ngrimmel\nkapris\ntriscuits\nanaemically\nfeinsmith\nmontuschi\nmolndal\nanastassiades\nworing\nanabl\nrebublican\nehrenthal\nlikel\ngrunenthal\ncouthy\ntnav\nontarget\nupperline\nshucksmith\nbouverot\narfeuille\nstefanello\nffdm\nminimills\ncnsv\nhellowell\nhvms\nnaug\nmapjack\nmunizi\nagenices\nnonnarrative\nsaviotti\naleyna\nletterbreen\nswiftie\ntranscosmos\nmcaleavey\nmamand\nbaetge\ncarolos\ndownhiller\nstreetlinks\ncediranib\nsooka\noatt\nkartoum\nidacorp\nmacchiatos\nxonacatlan\nboylepoker\nlavorante\ncoutino\nbahrampour\nwearings\ngnomen\nchironis\nrauzi\nostomates\ncieneguillas\nmerchdirect\ndiguido\nyuzheng\nbullocking\nginevan\npostprimary\nsoyun\njunsho\nmisconnected\nanleu\nthundr\ntortilleria\nbeache\nsunhats\nhardouvelis\n\nsouz\navtr\nexcutive\nstenquist\nruchbah\nschlip\nsviblova\nqureish\nhelil\nosorto\nyesodey\ngalvins\nhypercore\ncraffonara\nkarbouli\nsebrina\nlolololol\ndonnachadh\nromatet\nmdvip\nmoderateness\nedidin\nfacebookers\nthinprep\nroyere\nkoping\ndiasaster\nberdymukhammedov\nnicoson\nhellfires\nessaioi\nunctuously\nmeditor\nsiamwalla\npourous\nsordillo\ngcla\naerogarden\nsexts\nregrind\ngidel\nknolton\nketchmark\nkucova\nredcarpet\nmarwoto\nzahary\nsiroty\npwsa\nzwally\nozden\nkjellander\nelctricity\ndunifer\nrheos\nkowaljow\npolicitians\nttsi\nyolie\ncusanero\nstanislavskian\ngurthrö\nprovent\nbrownsberg\ndehavenon\nstephanopolous\nbrookgate\nbaribault\nhecklinski\nqaneh\nminivehicle\nspiegelstraat\nlenette\nzaninovich\nbacharan\nexpan\nbodybags\ngarrida\ndivebombing\ntrichelle\njamshad\njankins\nsmailovic\nclevage\nwvcs\nvisisted\nbecomin\ndemine\npentrehafod\nrahimian\nnassri\ntoosh\nkichler\nbratke\nsmythers\nmasoomi\npoborsky\nsítv\nhonny\npottered\nishac\ncloggy\nuchytil\numitaka\nkyohwaso\nmmboe\nkimjang\nquianna\ntufft\nkeenyn\nfederalise\ntouchtronic\nprouzel\nterorists\nstoelwinder\nmagdziarz\nproliance\ndeloittes\ntecchannel\npoddars\nsocialst\nmelafind\nwathne\nmkiva\nyoggie\nkillstreaks\nwhipsaws\npuoy\neopa\ncloudcomputing\nfaultlessness\nmanagerialist\nnuvia\nmagerman\neivers\noverpromotion\nyanbaev\ncupas\nbuczak\ndatafinity\nsebban\nmeretsky\nmutawassit\ncentertel\ntekah\nvoase\nlafico\ngourgel\npashton\nxinhau\nschaepe\njaquan\nupdegrave\nparttime\nmpeta\nattactive\ninconcert\nabgr\npomahac\nsedenquist\nhypocrasy\nbaniata\ntrussle\naxiant\ncatchily\nbrovik\nphama\npleet\nsiphokazi\nmughniya\nmesz\nctsb\ndisloyally\nzajaczkowski\nenvironemental\ntraeg\npinnies\nverc\nkryostega\nmjsbigblog\nitinerans\njaggs\nabelino\nicpw\nsurfas\nmccullar\nhyperdynamics\nbobier\ndialoguer\ndodgiest\nchritians\nbgca\niavoloha\nlangleywood\nkrtk\ntyrka\nmccanne\nmammels\nsaccharolyticus\nkameen\nchisley\nawyr\nversfelt\nnyone\nsapropterin\npureplay\nmosaka\ndivertingly\nbasejumping\nkhilanmarg\ndankgesang\nqalyan\nscheucher\ndelahooke\nanmyeon\nlyutenitsa\nmadyun\nhrubes\nparsely\nmeimou\nkozhemyaka\nshuchat\nindustrail\noxfeld\nncbw\ngulalai\nschierhuber\nwriglesworth\nbregazzi\nwisnefski\nzieser\nynon\nsexpots\nhyperdunk\nmanbert\nchitengo\nptcb\nahlering\nzagaja\nserotsky\ntanatside\nabassan\ntulshiram\nbanchetta\nstuelpnagel\ninurnments\nenvestnet\nysern\nselloffs\nfretlight\nroitz\nstelma\nxinmiao\nnoteholder\njacomini\ndimicco\nrecidivistic\nmontengro\nsakihito\nspaceview\nalmasmari\nmavy\npurnick\nkakule\nshaffrey\nparches\nsuperstein\nhelsinn\nlovobalavu\nlotensin\nnchv\njailyard\noffsping\nsarler\ncauseyside\nfazullah\nskylounge\ncoryells\ndemoralises\nshubart\nraisiny\nkutelia\nfeedmill\ntromping\npressue\nalecha\ncakelove\npancrelipase\npirozzolo\nehrnfelt\nrunningen\nfalteringly\nazoz\nmetselaar\nhollee\nactivits\nzannie\nboltman\naryashahr\nproventil\nlonsway\nrvrc\nzhongce\npreauthorization\nboikarabelo\nmarazul\nkhalisadar\nschlievert\nleasebacks\nxueping\nmchendry\nbutit\ncanfa\noplinger\nnevrkla\nhamdaniyah\nseghatchian\njermantown\nmenick\nmaštálka\nbuydown\nanalysists\noxbo\nszabos\nkerusch\ncolleluori\nraincity\nkonchalski\nslobodzianek\nimagesat\ncrossbanding\nivania\nyaghoob\nmeain\nselecao\ntoked\nkirstens\nsavvidi\nkaywin\nodidi\nremanufacturer\nibhs\ngaracad\nvaijanti\nnellor\ndyax\nnzsx\nsouflias\nfishworks\nponten\nemcore\nluckraft\nbraidholm\nlipsen\nbartecko\nbatterings\nmohallim\namoke\nweyel\neees\nxomba\nerhman\nmalenky\nlavieri\nbizarrerie\nlancz\npartnerless\nances\nneedlesticks\nansfield\nreciepts\nemeter\neviler\nbusalacchi\njeukendrup\nalcarez\nerdbrink\nvegeterian\nmoamar\ngptw\nmauvernay\nslipchuk\nkaival\nprayerfulness\ncarlozzi\nbintel\nkukic\nhasecic\nhindujas\nkleptocracies\nuptrends\nseaorbiter\ndettling\nbrigalia\nrepella\nkipros\ntalmidge\ncluness\nbillʼs\nhaddows\nmusicpass\nsoetikno\nkornitzer\npittburgh\nimpactive\nmironyuk\ngoteberg\nattenboroughs\npescaíto\naholes\nbremzen\ntpma\nscordo\nhepatologists\neconned\nefejuku\nstructurer\nsupporta\nurssaf\nddyfi\ngrudi\nonlin\nshoplocal\nosvath\nmaheson\nvermeesch\nswotting\nlastrella\nmuwonge\ncouriard\niproduct\nstraddler\nneuromedicine\nbezard\nmarchiony\nmiceage\ndosmukhamedov\ntransmyocardial\nbarbequing\nbiemer\nmarkbygden\ninstitutionalises\navmf\nfizziness\nuntether\nhotpicks\nbaldiris\ncorperations\ncarollton\npasteurising\ntsking\ncivial\nringstrom\norangette\ngallicas\ncocucci\nultraswim\njvania\nmustert\nsixapart\nvolkwagen\nviravan\nmoiree\namtower\ndpuc\nschwirtz\nranderee\nlacole\niwish\npellacani\noddicombe\nmerryweathers\ncighid\noverdrinking\ncdha\nserrailler\nabdelbasit\nferrah\ndevonna\nwayto\nbejach\nheinken\nbmxs\nmiligrams\nvivagel\nmson\ncynnwys\npoltroons\nfarmaceutica\ninimicable\nmuawia\nbristleworms\ntatsoi\ngegenschatz\nquipster\npnemonia\npettry\npareos\nphilipshill\ntruculently\nxius\nbloomsberries\nhandbasin\nkeehner\ngammick\npoutch\nhourig\ncheptai\narmelie\nrogering\ncredem\nisailovic\nlealamanua\nquamut\nprzedmiescie\ncarmoisine\nharriger\nmediametrie\nkastrinos\naulbach\nzumra\ncriscio\noversweet\nmcnenny\nsheshinski\nshasteen\nrbts\ncollegeview\ncarveout\ntowerblocks\ngenuflected\ngodfilms\nrobsons\ntoolmarks\ntomlan\nschieman\nglencarron\nreagonomics\ncollagists\nkennes\nuncomplicatedly\nthongkongtoon\nsalesclerks\ndecimos\nmircosoft\novertrading\nsabaawi\nsyrahs\nconcupiscent\nadede\nacidless\nrobet\nparquette\nseptmeber\nakrum\ndatebooks\nparadors\narauquita\nbucossi\ncomdemned\nephemeres\ndachelet\nqueiró\ntelcagepant\ngoddaughters\nrodgin\nmentaly\nstepneys\nlaczko\ncommoditizing\noversaturate\nchoriogonadotropin\nshtum\ntaubate\nnonregulated\nkassid\nbekking\nknutti\npickable\nvaubaillon\nkardasz\nredbulls\ntabacón\nleathering\nironfire\nwhipsawing\ninmans\ntweezerman\nbcbgmaxazria\nfishapod\nbruha\nojdanic\nurell\nchibaya\nmursaleen\ntumpach\nopunohu\nprizel\nscrace\npoliform\nduncow\nretrogenes\ngiovannucci\ncatnaps\nschratz\nendplayed\ntaigs\niniparib\ngereida\necoupled\ngiantkillers\nesmon\nharway\ncharicature\nkovell\nfidaa\nroadworkers\naaphp\nwhitelees\ncollapso\nyirrell\nllywd\nmanging\nreminescent\nqatami\nbussinger\ntomskneft\nundercharging\nincumbants\ncomanies\netyen\nhuayong\nknobkerrie\npuedpong\ncanalettos\nshiastan\nlevittowners\npories\ndechane\ntoffeemen\npoeu\nbaleegh\ntraficking\nmmwave\nbuerck\nmuré\nnonpoisonous\nmeddyg\npalisi\nsinuplasty\nbluffness\nboumelha\nvirent\nhatefest\nkavishe\nmostaghim\nmarkiet\ncarelink\nmachie\noverbright\njohnann\ncoralling\noctabromodiphenyl\nuntaxable\ngyegu\nresponcible\nherita\nexhibiton\nramchurn\nmazyek\nbeuracracy\nhumanises\ngeither\nhyfryd\nmeatiness\nvetrovec\ngruppioni\nedof\nuloth\nrampersaud\nknockbacks\novermastered\npearton\nboysenberries\nresole\nadamkhel\nofficiale\nmorestead\npishtacos\noverproduces\nsacrilegiously\nreciprical\nlogorrheic\ncohabitations\niannacone\nppan\nnutrisse\nwaicu\ncurfewed\nonanist\naboudihaj\nscratchmann\nglubok\nmossbay\nrialtas\ngaribotto\ndalcetrapib\nzwak\ntaliking\ncantens\nzizola\nmntx\nmillimole\nowamagbe\ntrentside\njackalberry\ntacambaro\nprudishly\nmuraca\nbilefsky\nconsulations\nxway\nphilanthrocapitalism\nneverlost\ngellideg\nbaverez\ndebens\nlifto\nsiddharam\nnasariyah\nmaasz\nnezamuddin\napotheoses\nmwangura\npapura\nsaymeh\nkufrah\nlodrick\nitronix\nmoldowan\nlefotu\nmeglomaniacal\nmohtaj\nptcda\naghassi\nvanderbuilt\ndagunduro\nstason\npacn\nmisconnections\ntydtwd\nturnto\ngeob\nreassortant\nryhurst\nnelso\namortizations\nnktr\naousc\nrizai\nfreris\nprochymal\noutsted\ndrelich\nturocy\nvourloumis\nfarchnad\ngubernick\nchaping\nsadker\npenfed\ncollegeamerica\nmacdorman\nsimsch\nstarbucking\nmaryfran\nmohanram\ntooher\ncotliar\nmccrann\nlvhs\ntriaminic\nvbvoice\ncopehagen\nmidgrade\nshawni\nbirkelbach\ngeoapi\ncoifed\ndisapearance\npressie\nhulhumale\nunsown\nbilks\npowerlessly\nacuras\nhamchetou\nbudburst\nukic\nwetherhold\nkeinon\nminues\ncongressdaily\ngoolding\nbosasso\nlafeuille\npropertyfinder\nncib\npelem\ntestimoney\nnaswa\nmicrosavings\nbondioli\nnephros\nanbaris\necycle\ntelocation\nlawniczak\nfanball\nummersen\nwarmists\nfaveri\nirrestible\npopkey\nsolodyn\nentrail\nstagebound\nstylefeeder\nplacke\nrozerem\ndoomsaying\nkwangba\nbpom\nhighchairs\ncorteges\npunnishment\nreihill\npedofile\nsteeber\nnecastro\nbritnee\nmuzhakhoyeva\nwinkey\nbabyboomers\nproscia\nbibw\nderrieres\nzandanshatar\nboecke\nstybel\nwintek\nwebsurfers\ninury\nschornagel\nhiree\nthwap\nunpunishable\nislamicism\nsniffly\nbontan\ntryscoring\npenisa\nekaitz\nstrihavka\ndetatchment\nfebreeze\ngonnot\ncchq\nwilcha\nparaag\nrhianne\nensha\nfanok\ndarious\nhachigian\nschoose\nanticollision\npoolton\nseekin\nmultimillions\ntaeye\nmelmed\nwhiplashed\nseegars\nthonhofer\nrakieten\nblueworks\numsted\nistreamplanet\ndisintermediating\nkezman\nabdirahim\nolhao\nbraciola\nleathard\nsatnavs\nhackhurst\npolyphenon\nresponcibility\nboinking\nstooging\nstandardbearer\nsoussou\nlajdziak\ntippex\nathome\npersisters\niwerddon\ncreachadoir\nwildstone\ncitlalli\nweissflog\nmexcio\nstaford\nhasanovic\nsundem\nythe\nperfidies\npielenhofen\nchrz\nclucked\nhensrud\nishum\nakgun\nshuneh\nindeginous\ngapkids\ntwinstead\ncallater\nmicrocephalics\nmipdoc\nhungriness\ntsco\nrerp\nvmag\noverperforming\nepsco\nacoem\nauditel\nmarinza\nadeley\ngallichio\nfarci\nkalanke\nshakiri\nkoblik\napointment\ngalgael\nlacelike\nivonete\ngirjet\nuffner\nphotofiltre\ntaware\npetville\njethrow\nreliberation\nkalvarisky\nxixin\nworki\nstaffline\npgmob\nbalilo\noverdesign\nshailagh\nkeissler\nalminova\nmussayab\nvtms\nshilled\nsulock\ntheanyspacewhatever\nmourby\nlibé\nsonterra\ncorefirst\nxoie\ngilgore\ninnovata\ntorbinsky\ndeinstitutionalize\ncapolingua\nrandels\nbowster\nlashinsky\npikulski\nroushill\ndouched\nithin\nvisanet\nballarò\ngroesfaen\nkameaim\nwidmeyer\nprotrays\nellabell\ncetnik\nbuidlings\naigcp\ngidani\ncabinent\nmailshots\nshoptime\ndeyton\nswankiest\nrocknoceros\nglamorises\nvaunts\njacobses\nlamplit\njeacocks\nmediakit\nrondstadt\npaquay\ndigipass\nfrisselle\nsatchu\nvervotte\nsnowblindness\nirfe\narbitraged\nslingy\nsuperpremium\nheisbourg\ncoziest\nfrienemy\nbioforce\nogryzko\nyasawas\nwitterings\nknobkerry\nwouldd\nsenie\nsebirumbi\nlarynxes\nrescreening\nsadeghieh\nbuscall\nqualstar\nuncollateralized\ndyudya\nmisfield\nintot\nmadrids\nrongkun\nzepf\noppertunities\nbierma\nplocnik\nstinke\ncushiony\nboxler\nkayitesi\ngullable\ncorrectitude\ntowungana\nmichcon\nlösche\npostlaunch\nfxall\nabdominally\nbloemraad\nrebeuh\nrebwar\nmedla\nakwesi\ncalifronia\nrecogniseable\nmorganchase\nritsaert\ngrinney\ntrussells\nrafiqa\nrevvy\nkoonings\nteclas\nshelbourn\njuanicó\nphilipina\nzdanowski\ndarsley\nchunxia\nkabei\nandriol\ntryers\ndfls\njesselson\nwaziriyah\nhirael\nnagaa\ngravitz\nchoedon\nechline\ncarticel\nfatteners\nathron\nyendys\nsoaringly\ncrackel\nverderosa\nglossiest\nlistpage\nmesalamine\nminujin\nhanaka\nrecrowned\nschervish\nrichfood\naffy\nbreki\nackard\nlarazotide\nhuahong\ngudgel\ncoroneos\nweatherizing\nahoua\nmislay\nancramdale\nbordoli\nsacopee\noutburts\nvarjabedian\nkomanduri\ndistortedly\ncavalluzzo\nstifado\ngildings\nhollfelder\nrushmoredrive\nspectular\nredelmeier\ngodhwani\nweissfluhjoch\ntyreman\nmcginly\njackness\nthinglab\nmamlouka\nhypertargeting\nfledgelings\nfullcourt\necomomic\nryklin\nkaufelt\nrouches\nteamorigin\nsouthalls\ndjugashvili\nmeillier\nresonsible\nindomitability\nexcape\nnetherhampton\ncrummack\nmazri\ncorun\nrmmc\ndeclassé\nwebernian\ncrestliner\nbaruchin\nryohin\nschönhaus\nsucessors\nrhymetime\ncrgt\nscalemp\nedmos\nbarough\nadministrable\nkallmyer\ngroupons\nrecapitalisations\nresitting\nkaramagi\nchittka\novermanning\ncarclaze\nblaschak\ngalouzine\ndaychopan\nmauksch\ncmdbuild\nclockwatch\nfruteau\njatas\ncmws\nrobata\nwellmeadow\ngrowt\nfreakshows\nzinurova\ncalliflower\ncannisters\nsemitruck\ndecsions\nlatifullah\njingchao\nchildres\nkalarikkal\nvcnetwork\ncusinato\nkayabukiya\ngottcha\ncarseat\narchitzel\ntwittersphere\nellinais\nitpro\nvicosa\nfridy\nlogistex\necda\nblackhills\naramendi\npareco\nsearer\nunstack\ncyflwyno\notherized\nnairashvili\nnonallergenic\nunrented\nmangelsen\neurpean\nmckeirnan\nfrasinetti\nwillimas\nbangstad\nkawesqar\nfiending\nmsxi\nartumas\nvisvader\ninishfree\ncanadain\nhemopurifier\naritonang\ndonohoo\nhezballah\nsamolis\nkaminskis\nmexder\nmugniyeh\njobid\nminimall\noaklea\nrestasis\neanet\nspichern\nglowered\nhighters\nthobeka\ntxtloan\njanets\nsehc\nsutisoft\ndutartre\nbusybodying\nvitarte\nrongwo\nemergis\nscamwatch\ngnezdilov\ntelefusion\ndyton\nnutballs\nfobts\nfamilian\nlauriski\nhoketsu\ndispised\ngheith\nsterilite\nsmellers\nparsani\npechenik\nsemaconnect\nbaghdis\ntrachuk\ncutress\nvanderlugt\npotful\ndrauniniu\ncolorectum\nmongerer\nspallina\nkeepaway\nexpecations\ninceased\nloveluck\ngrumping\natill\nlongevinex\nwillerson\nhirgigo\npeoiple\numberson\nclevudine\npulper\nciolfi\natherothrombotic\nnhpau\noutsanding\nfinzels\nearthkeepers\nketziot\nslicethepie\nborowich\nmceowen\ncomicconnect\ndamnjanovic\nrebkong\npalmgreen\nbadl\nbhgh\nscrocca\nsqualour\nhammocking\ntihwa\nsciame\nmandey\nbially\nzorzoli\nwasington\ndiplomacies\nhenglong\nlethendy\nloyens\nmwda\nneigboring\ncalerie\nsevylor\nszemberg\ndestablizing\nkreuzinger\nchandebise\nstarvelings\nbabaoglu\ngolanski\nhealtcare\nspiritueux\nnakatsuji\nelephantitis\nkanstantin\nuninvite\nsnugged\npharmamar\npassur\nruogu\nhpna\ndavaajargal\nbiscaglia\nbumster\nevaulation\nchoppergate\nshierholz\nshanmugarajah\nunfortuntately\nmanoschek\nsouleimane\neskilson\nsweeta\npasoans\nsophomorically\nszakacs\nintelegent\nguffawed\nmahaweel\ninlayed\ndeniably\nstaba\ncorrigans\nrhapsodize\nmistfrog\nquikcard\nnonchronological\ndawdler\nwybar\nnightster\nignominies\nkeychest\nmcilree\nkurodake\nranit\nbrondyffryn\nchotai\nmaedgen\nmavuno\nandipa\ndicketts\nreincarnationist\ntuayev\ncadfund\nlynnerup\ntbed\ncyberdefense\nameriyah\nwomick\nghazaliyah\ngissendanner\nscenerios\ndooking\npantigo\nnonteaching\ncongratualtions\nwaufle\nbariay\npurevision\nelmalich\nmortgagebot\nhellobeautiful\nslyest\ndevaughndre\ndrumossie\nsuperhospital\nbellardo\nkocchar\nheafield\nnanzenji\nbizwiki\ntaddonio\nrelistor\nelectropsychometer\nmaritas\nclaustrophobically\nfanah\ncoonie\nyaoping\nmcneley\nbirrificio\nbefoul\ndeloire\ngauffin\npongcharoen\nnardolillo\napplanix\nmunning\ndrec\ncrucet\nrossanna\nrikhotso\nkiswana\nthanksusa\ntrailblazed\ndenegre\ndazel\ntrafnidiaeth\nxoma\nsuperferries\nlightful\nzaafarana\nprovea\nahemed\npansea\nmadhosingh\nkhulumani\ngoodhealth\naudeon\nestruc\naccordign\ndunkling\nmeddlin\nurumuqi\nnuren\nbirthmothers\nasiainfo\nshrops\ncampagn\neveryboy\nhissene\negnal\ngurgl\nccmrf\nefama\nseldens\nsohair\ncevis\nfujihata\nseex\nmsgi\namerigon\namericanising\ncollission\ncorrons\nchowkay\nprodu\nmcanderson\nquotational\nlipot\nblatstein\naucott\nbreaktimes\nchatterers\ndelahoy\nrohsenow\nflipswap\nkrahenbuhl\njourquin\ngeoresources\ndressner\nslimier\npotupchik\ncpst\nclubley\nsleety\nungenuine\nvdel\nrogalsky\njanullah\nblipped\nsourgens\noters\nremainig\ndestor\ntamelander\nchausseestrasse\npiroli\nnabco\nbrzak\nafdhal\nsriprakash\nkarakoc\ntorregaveta\nmoraca\nfunwall\nshurbaji\nkupetz\ngoltzer\ngelband\ndepartmen\nzaffarese\ntscharnke\nchivvied\nunderskilled\ncenziper\nenglightenment\nbresnitz\nmapule\npercussiveness\nknog\nblwyddyn\ncinquin\nyacon\nzhenliang\nofrasio\nsfari\nauyang\nclaragh\nshioi\nhuelsken\nmyfoxatlanta\nteixiera\nananenkov\nshoukhrat\nrameck\nducroux\nmakondo\ngrueneberg\npolitov\nmlotshwa\nroeloffs\nincented\nmoneris\nbanlaoi\nlaregely\nderserves\nclader\nhmmmmmmmmm\nunpicks\nlakshi\nkangho\nherapin\nkrystexxa\ngurgly\nmilitarty\npodair\nnoooooooooo\ntayberries\nbouffants\nkornum\nlosan\nstarvest\nppda\nhypocritcal\nkhaisman\nrubashkins\ndemofall\nsweethearting\ntrasylol\nfreedmans\nmvis\ngoldmanite\nmotobike\noneda\nsicolo\nfigueruelas\nscriptapalooza\ndorrier\nadvertently\nbougerol\nschoonveld\nlarushka\nphaseouts\nfiterstein\noremland\ngeomatrix\nschlagman\nwisegal\nholyer\nnewhampton\nengery\nfiddaman\nlightpole\nhasiotis\ngasm\nmilnot\npearlised\nkaloga\nlancor\nriqqa\nromaan\nsardone\nthorthormi\ndisidente\nquaterback\npozzale\noularé\nrhythym\nbeglitis\nhighdeal\nfiftyfold\nafghansitan\nmagnatude\ntabosa\nulzheimer\nkjaerulff\nknifelike\ncorraling\npratfalling\ndisolving\nhahvahd\nromeyka\nshites\nmakhzoumi\nseggie\ndevicevm\nlicuado\nnewboys\nhanchet\nnorooznews\nvölkers\ndndn\nbulicame\nhelmsmanship\ndiﬀerent\nunreels\ngreenscape\nivimy\nnegal\nleaud\nbhumibhol\nzenobians\nskett\nkvakhadze\nhiliary\ndomboshawa\ngrovelled\njeré\nalphama\npliszka\nchilderen\nsustenna\nevidian\nbistate\nlendle\narhabi\nsteigrad\nzentaris\nacromas\nshoegazey\nsurgenor\njalkoti\nstartrans\nbadoian\ncarmelitana\ndreihaus\nunilens\nillhaeusern\nclintonism\ngridlocks\nrassweiler\nvaernes\nrmcm\nnonjudgemental\nnullriver\ncaipirinhas\nstuffily\nnavyblue\nfjera\nsotio\nkerness\nrobell\ntamamoto\npaefgen\nmetronomically\ncaison\nvanter\nundercapacity\npashkin\nbankroller\niniatives\ncouglin\nnassem\nunhygenix\nrumaithi\nszejnfeld\ntorrentially\nbognanni\ntellies\ngennara\nblynyddol\nbokelberg\nantoney\narrivers\nsophana\ntushes\nabaurrea\nqiqihaer\njatoba\nmiswired\nsssd\nquandries\nmargets\nunmountable\npaolicelli\nrueing\nvukaj\nsagid\ngwragedd\ntechnogroup\ngoye\ngamecubes\ncreandum\nwardag\nnanoantenna\nencinosa\naraton\nmulvin\ndecentralizes\ntromsoe\ntechnologizer\nmusayib\nreprotoxic\n￡\nnigbur\nmonzingo\ncounterassault\nkoopersmith\nmcclesky\nmenactra\nlevenwick\nhanaan\nvoelckers\ngreastest\nbitchiest\ninditement\nmuehlberger\nbeiong\ntheatening\nmellas\nlavinthal\ninititative\nintersquad\nrepresenatives\nroughhewn\nhallglen\njuliens\nintreccio\nmagallan\ntibilisi\nlanginger\ndjelic\nstroia\natlhough\nwiswe\nbrickbeard\nalyawarra\nborneon\npbsct\ndacra\ncheiffetz\nprivilaged\naigurande\ntranquillon\nfrapin\nsyphers\npolygrapher\nunderproductive\nveleva\nshawcor\nstuas\nbackpeddling\nkoerfer\nmezcals\nclosley\nnamenda\nferrarone\netailers\nnabanga\nparmlid\nkarastan\ngougères\noversulfated\nlubéron\nehlke\nnarseal\nsalihovic\nshikse\nsumptious\nsafetly\nbillyball\ncolmant\nlicalzi\nmullahcracy\ncanetta\nsingiser\naswirl\njarbawi\nscapicchio\ndicotomy\nprofauna\nraclin\nbronger\nhahaya\ncygnids\nupstreamed\nmuthalib\nrqi\nracak\nsocalist\nhamantaschen\naqeeq\nborovic\nbreathalizer\nblackington\nabtan\nvandemeulebroucke\nsodomise\nzagornyi\ncibulas\nbohlig\nkalaje\nguilleuma\nislamofacism\nbastardise\nvantrix\nkochanov\ntexaplex\norcl\nheulyn\ndiangi\nlibrans\ntoybina\nmingtai\nfoucheux\nnyoraku\ntrixibelle\nwanggang\nibison\ncourtway\nlisterners\ncariatide\ntrishas\nresynchronise\ncarpinteyro\nmasciale\nseriocomedy\nwitech\npinksy\ncounries\nndemo\nemtriva\nlosson\nholbeache\nwhiped\nsugarbabes\nmondane\npediped\nzubakin\nifshin\nwarech\nbaramullah\nllowes\nmagtira\nlukehart\nergos\nswopes\nyaam\ncloddy\nomniums\niruretagoyena\nbiberaj\nnaward\nnikolce\nshklyarov\nkasraoui\nyakexi\ncausualties\nbewleys\ninbody\ngehle\ncalandriello\nbirp\nkloeden\nvidailhet\nferrety\nburlier\ndemagogical\nfergi\nphinnaeus\ndayami\ntibetians\nkleinbard\nponturo\nvisund\nlightproof\nmouneimne\ncarbonnieux\nichael\nteeniest\npysche\nwelie\nrittenmeyer\nolaim\nnytol\ngoldworth\ncentiliters\nveiroj\neural\npokertek\nforham\nelidel\nstiffelman\namrami\nmcleister\nunscrambles\nguriceel\ndipso\nsanani\nwheaty\npushbikes\ndomaille\ngenerosities\nuncowed\npostretirement\napollus\ngranularly\ntrugs\nputtered\nghaws\nhumilated\nclabburn\ncamft\nmstation\nmasne\nuscom\nindemnifications\nbizcom\nbimes\nstheeman\npathela\nkonoba\nmcitp\nmarwolaeth\nbarrowed\nsarachandran\ncherman\nmcswiggan\ntognarelli\namnestying\nsawtimber\ngreehey\ntroyak\nbreakfeast\napiafi\nofthis\ndjoker\ngulenist\nabinbev\nyahye\nbullhooks\nphuntso\nmotaeb\nflybus\nkanacevic\nhooha\nmandzukic\nsafder\nbobrauschenbergamerica\nredlihs\nmonthairons\nunarrested\nmcphaden\nhammertoe\nradnedge\ndrico\nreinflated\nmudathir\nmindup\nhaozhou\nlatinode\ndaejan\norasure\nthrombogenics\npithiest\nunfastens\ndanglard\ndvortsovaya\nnympholepsy\nimmiediately\nlitheness\nbolshaw\nnonremovable\nakshayuk\nmegace\nexalgo\nccsn\ncmarket\ntuckup\nsosinsky\nfilipini\natatra\nlichnosti\nbecharam\nfousing\nmillberg\nsupersector\nbaveja\nscansnap\nshallying\ncosr\nuncensorable\nwory\ndonig\nrungan\nfouhami\nolasewere\nremindful\nmyriah\ncrickard\nbogatin\nipcm\npornification\nnaddeo\nhakkies\ntopnote\nmajoriy\nbreiwick\nsubverters\nspde\nfelux\nkrauel\nwatherstone\ndevelo\nserping\nkoukoulas\namdl\nimigrant\nmuttonchops\ntomnacross\npiersen\nstandoffishness\nballieston\ncplg\ntrublood\nburklow\nkettenring\noutshout\nauctomatic\niveth\nyusifiyah\nradioss\ncamarthenshire\nmatshidiso\nxlif\nschnobrich\ndyddiol\nahmidan\nolympitis\nskalleberg\nsharwoods\njoerres\nnetcomm\nbelarussia\njarawas\ncowardness\ntummies\nmaridueña\nmarwyn\nleevy\noosterdok\ntocquevillian\nvideoplaza\nkosachyov\nbitorrent\nsloooow\nricinine\nhusarska\nluned\npoweful\nstaglin\nsculco\npenhaul\nhostopia\nrequir\nzelyaeva\nhonourables\nhasselstrom\nsairr\nrazrs\ncollobrières\ncompny\nwaynetta\nklontz\nkerick\nmopi\noptimark\nsemiabstract\nfnia\naccelerographs\npluots\ntakishita\nhostry\nayoreos\nmicroblogger\nfausey\nblahoski\ncrowngate\nuncurls\nkourakos\nkerzel\ntooman\nbabyfaced\noverated\ncbay\nsportcombi\nfarrows\ncalanchini\nmellowest\ntrollhattan\nvolach\nfreijo\ndamanged\ninhand\nfrubes\ndaetoo\nonvoy\ndeclairing\nmavroidis\nimporoved\nredniss\nfukino\ncering\nluthan\ngywn\nbellinis\nfilgo\nameresco\nimcl\nmelya\ndasenbrock\nproscar\nvpso\nintertalent\nelitetorrents\njallal\ncirtain\ntessieri\npiperade\nbluehill\nberuwela\nindolently\nghenghis\nthromboprophylaxis\nprstore\namabassador\nquintais\nkhunkitti\nchichakli\ntyrannise\nprogessing\nstepgrandchildren\ntoeachizown\nintoduce\nfifton\nenglan\nkabum\nslopers\npasw\ngwag\nrostas\nmenstrually\naccoun\nsolartaxi\npaillette\nprovacateur\nmigliano\nuntoned\nnefzi\nseonaid\nunstageable\nevolet\nseptuagenarians\nschmoozed\nbotney\nineptitudes\nunpotable\nbamfords\nlebovitch\nyarzeh\nedgler\nhifx\nbaldarelli\nweeing\npalmary\ninterupting\ntwinspires\nsadowy\ntazzyman\nplantscape\nnvicp\nyassaie\ndutchness\npreprepared\nshafan\ncounterdemonstration\nboertmann\nglaría\njancek\nmeiquan\ngoughie\nzulehner\nstandardbearers\nvizit\nplumo\nrabouin\nshimkovitz\nudalagama\ntongkou\nzurzolo\nrathr\nbexa\nlicketyship\nsiumu\nhighjackers\nkruszynski\ngeobra\npacenza\nfaeza\nsornosa\ngallowtree\nfarookh\nhuitson\ncalabacitas\nibeer\ngreenslate\nzaney\nhelitanker\nsouvenaid\nbasravi\nsimpers\nnobl\narkeith\nrenkart\nrhapsodically\nbeautee\ngazgireeva\nacuo\nkeshishyan\neskiimo\nzerotruck\nskyfuel\nnaroth\nwienert\nbessye\nnetequalizer\npremerger\ngeorgiopoulos\nnutopian\ndevlieger\nsinapse\nsharlon\nsexstone\npenichet\nangiotech\nbarandica\narciga\naaden\nsherifat\nsowetans\nendotheliotropic\nnsduh\npontificators\nchappaquidick\nabkazia\ngoodstadt\nbarnies\nflexidiscs\nkorinne\nszapary\nsigersons\nprokopanko\ncggveritas\nwilhern\ntrusdell\njantzi\ncaltroit\ninattentively\nbuiness\nkvarnen\nvistica\nmcmuffins\nnuvis\nkodas\nkarakoy\nnaifs\nmeglomania\nsnappiest\ngarmont\njints\napocolyptic\npanger\nramic\nhojjatollah\nnonsuited\ntzeo\ncharlena\ngereffi\neducability\nmushada\nhudetz\nhardheadedness\npagnoncelli\njovicevic\nexoticize\nnanah\nwapakman\njudys\nstratex\ncongresss\nadmix\nbekkouche\nbasumatari\nfrostings\nnspf\nmahlyanov\nbartnoff\nbruckshaw\nsearchings\nintalio\nsamcam\ncorretora\nmenday\nbenifited\nmuonelo\nrubot\ncoachloads\nmashreqbank\nceausescus\ninsited\nstrangleholds\nservheen\nheska\nmustchin\nwelldynamics\najilon\nfomenter\nsteppingstones\npoqui\ncortadito\nantitobacco\nlistmaking\nincant\nosbaldo\nedbi\nschmutterer\nsalorio\nfeebates\nrochsoles\ngowning\nreprivatised\ncenicola\nuroda\nannnounced\ntoifilou\nislandsbanki\nbarix\nmakuxi\ncachetes\nauchleven\nmplsound\nforseeing\nvandvik\nnavane\nloftman\npepperstock\nleaguewide\ngottis\nreinemer\nfrancouer\njoevan\nharbertonford\ntimesmachine\nupwaltham\nlocf\nabuhamza\nrwsl\ntaggen\nhuppi\nptds\nmascioni\nmojtabai\ncaukin\nexcellus\nmxenergy\nmiralax\npixxi\npourfar\nmicrobrewer\ncoprorate\naffilliation\ntrexel\nzapien\nstarey\ngossipmongers\nbonning\nopenpeak\nseatholders\ndiscerningly\npannaway\nräth\nsensme\ngoathorn\nyakutumba\nwebsky\nvidaza\nopiated\nbedenbaugh\nneurohr\nquaegebeur\nunsearched\nabdurajak\ndugat\nbamlet\nsbux\njeager\nsoard\nhumaidi\nchatshows\nrashpal\noratz\ncnnstudentnews\nnabokovs\nmoneytoday\nhipocracy\niconomou\nreoganisation\nliabilty\ncongess\ngiancoli\nxcaliber\npetroliferos\nagrisa\nmiszuk\nsecuirty\nwotorson\nabobe\ngreenhealth\nsensless\ngronbeck\nstratusphere\nelciego\nmisgoverned\nekaya\nmccartt\nempaneling\nbretherick\nzinnov\nrodnick\nopolo\nazzarella\nirela\nstarroc\nmouthrinse\nholylands\ncryolife\nveppers\nsalatto\ncharai\nteychenné\nunpractised\nrgis\nchemsitry\natrovent\nsachino\ndudeperfect\nkutschbach\nkoure\nunibrowed\nejaf\nbrazille\nxuebin\ntivit\nbydlo\nimiev\nmobie\nyunchao\nfidolia\nhamdiya\nrazoronov\ndosman\nostasz\nalboher\nhakas\nscullys\novacik\nstraehle\nbenayer\nobendorf\njoosse\nhlaa\nmasselis\nphotocoagulator\nablard\nronnette\naankoop\nhonaryar\nkohlíček\nhargeaves\ndunned\ncomdisco\nnarks\nnadt\njaroenrattanatarakoon\nsnaffled\ncouoh\nrashbrook\nanascape\nkayonga\ndamam\nhenningfield\nfiglar\nsvrluga\nsadatullah\nolenn\nobrycka\nmandozai\nmastah\nkandoo\nnicom\nitacare\nadmistration\nsugal\nsaynez\nmacroplastique\nsanitisers\nkhreis\nbeauvilain\nbamattre\nstrongarmed\ntirgoviste\npecoc\nmarinich\ncruysse\npultizer\nknoell\ntextese\nfloeter\ndisports\neatmon\nchwaer\npaycheques\nballysally\nxanthohumol\nszkutak\nstablest\nyossie\nconking\nmazaika\ngadari\nscottà\njelden\nsmeargate\nislamberg\nprofessiona\ntoudic\nterrrorism\nbootcut\neuopean\nkiondo\napaydin\ndonckier\naltiere\nkuettel\nmanorcare\nyeardye\noreign\nsluttiest\ngiddyap\nloecknitz\nrspa\nignomy\ndemeuse\nloway\nturgeau\ndepomed\nbindmans\norderbook\nbrenkley\naccomplised\nustaoglu\ncabreras\nposang\npalermitans\ndankness\nthefrisky\nretroaction\nkamoa\nravdin\nwhirlie\nmasonary\nmantris\nobam\nsargen\nwebbington\nstefanides\neinich\nrelati\nsivley\nreleived\nsuscribe\nmuraka\nsaegheh\nsedor\nmiskicked\ncovd\ninpart\nassemblys\nmapplebeck\nrecapper\nexfoliator\nsolazzo\nyogaworks\npaynet\npensants\nkulhan\nerenga\ngalick\nmorgage\nlinkwise\ncazzulani\nlacek\nintercontinentals\nworkamper\nskypein\nnonelected\nbioarts\nstandbridge\nmcil\nfendering\nlisneal\nkuretich\nalikhel\nfriedgood\nolwine\ntinglingly\ncarseats\nhorist\ndrisko\nterrorisation\nsolagh\nskachevsky\nrobotuna\ndiivory\nrewrap\nschleuter\nribcages\nfoofwa\ndrillfield\ngoriness\nstressman\npianosoft\nhurner\nlicentiously\nolekas\nunderlit\nmonogramming\nltvs\ngrabbin\nwidmyer\njohannesdottir\nmarkwells\nsinafasi\nkapitannikov\ncasasanto\npohorylle\nchirrups\nyamanya\navcen\nyarusso\nnairz\nfishcross\nguarrasi\nvalutation\nlamouret\nburggren\nkilinc\naaim\ncorrera\nzarazua\ntunçay\nsoumache\ncalebasse\ninnovaro\nikhtilat\nnaake\nsteingold\nnuqui\nfetullah\nghausi\nalwash\nmuvunyi\nmittromney\nalapini\nviewspaper\nkapidex\ndeborrah\ncolorito\nlounibos\ncpaj\nstng\nadecn\nokayish\nposilac\nimmitating\ndisfigurations\ninzalaco\ndeathcare\nkhajepour\nrecyclage\nlouiville\nlenval\ndaraio\ngrasstops\nschertwitis\ndubbya\nhirter\nskittered\ntslf\nyenilmez\npreannouncement\narangements\nbootay\nremediates\ngaidhealach\nsackfuls\nfutureu\nunconservative\nashleymadison\nvancsik\nbiafore\nfavas\ninterferers\nmoonridge\npompas\nmrkonjic\ndormmates\ngruny\ncaputured\nlonghaugh\ndugue\nurbanowski\npentling\nmansiz\nzuritsky\nwischmeyer\nadjustors\nloudhailers\ndbsi\nfdns\nmohammud\ngildroy\nmandleson\nbruegels\npamfili\naudon\npowazek\ngabbers\nxuyu\noshoosi\nkgia\nhitschmann\nleisurecorp\nunbolting\npolanksi\nwymeersch\nspectrial\nsculpter\nringingly\nsavicki\ntamileela\nfrostiness\noctavias\nmultigeneration\nearleywine\ncowett\ngrubtown\nsoporifics\nmyfoxny\nconstitutionalize\nabhore\nfenceless\nfalleur\nnigol\nlotting\nmotari\nridgell\nburches\nariizumi\nngosso\nrukhadze\nkenlaw\nkomejan\nrepentances\nscmg\ndolmus\ninfc\nkothgasser\nmutsinzi\nrothensteiner\nabthrax\nobesandjo\ndgmt\nshishito\nthisyear\npogea\nchangdu\nsabree\nscreenwash\nskiver\nbotul\nrasker\nvelensek\nlppv\nmelexis\nrotomolding\nturtons\ncathys\nsimkoff\nyoutubed\neaee\nveloza\ngassick\nnebrich\nyaers\nschertle\nearleir\nshigeie\ngirliness\nardagna\nfonet\nrecr\ndentinger\nsynchrophasors\nnmcf\nreinjury\nkerrington\nsetkiewicz\nfuzeon\nzostavax\nquarantinable\nknabusch\nskaalen\nvachara\nmwelu\nascanelli\ntosteson\nbalkanised\nsaltcellar\nduplicities\nmerendon\nmesterhazy\nmompoint\nnataro\nimpoco\npinfeathers\narcsa\nreinterview\nlombardis\nhidrocapital\npalestians\ndurao\nturgunov\nmisn\nxinchao\nfinncap\nishmaelia\ngsces\nchildraising\nfenerbache\nkenmark\nruices\ndignataries\ndarunee\ncamwell\nbassarath\ndruillenec\nschmitzberger\ncondems\nforcément\nlowensohn\npalestines\nboyos\ndesparation\nfifis\nfunloving\ncyberdefence\nclubbier\nbissonet\nsvento\nkleinfield\ntuohys\ndigitallife\nluoshui\nperanteau\nsvetkey\nbevi\nracily\nforodesine\naddlepated\nunicare\nslaggy\nnoviye\nantirejection\nfernstrom\nhannides\nurbutis\npartyism\nrugambarara\nedgily\nbigstockphoto\nmilns\nboersig\nbonxies\ntayet\nkronmiller\npfaeffle\nquarterhorses\nenergid\nkivutha\nfiercesome\nachillion\nshakibi\nadawe\ntedactive\nfleak\nbroadpoint\ncaprasse\nprescriptives\npasqualati\nherzon\nsigvaris\ngiampapa\nweakend\nneurovision\nchinaaid\ntoxically\nwahadat\nreadyness\nhipotecaria\nhilchey\nbohua\nclearport\nquelynah\nukcisa\nmougeotte\nroft\numberella\nskiercross\nfücks\nvdoe\nzelnorm\nlarrson\nshishmanian\nbakeoff\njuluca\nschrode\nmurithi\ncalvyn\nunstuffy\nkremlinologists\ngussi\nitrip\nbregg\ngoung\nodones\nhasselhof\nmudfest\nglugging\nnahhh\ncandidtate\nbaltierra\naarika\nnonflowering\nnakasai\nfandila\nqibing\nhmmh\nlangemeier\nmulticulturality\nvidricaire\ntechnewsdaily\nsarasi\naksyutin\ncamellos\nloyds\nrameys\noehm\ndisintermediate\nsuduko\nntumi\ntaparko\noverlearned\noutpunched\ndyann\nvicitm\nobamacan\naccessorise\nkabateck\ntstorm\nspregelburd\nmoreillon\nbeatlemaniacs\nmismanages\nsabertoothed\nalliyah\narrowing\nnyeholt\nabdullahs\npearley\naccessorizes\nlacarte\nonesy\nsomolia\nsommermeyer\nglunimore\nnoncancer\nzaidy\nfarkers\nhshieh\nguejito\ndailybeast\neuromarket\nsupplementarity\ntresemme\nchibale\nlonghaven\nkorolyev\narfat\nsuperciliously\nbeserra\nfarinet\nfeic\nnachchikuda\ngurstelle\nbuttkicker\ntechnophilic\nperdziola\nmamhoud\njunqiu\nmurfi\nthinc\nspyhole\npieron\ncammillo\nunplumbed\ntownhalls\nshufflewick\npeopple\npenuwch\nroyzman\nschmaljohn\ngalmes\ncentralpark\nemop\nrhayel\ndoobee\nsemonin\ngdsn\npunsters\nspinally\nnyicff\nsimatovic\noskal\nenthusastic\narmh\nschumannesque\ntelecommuter\nbrunwin\nlucianna\nfaassenii\nteethmarks\nhuibin\ngadding\ntrenks\ninsectlike\nenanga\nlasikplus\nprofazio\netfc\nacambis\nautopay\nrepubic\ntulgan\nanzorreguy\ninstigative\ntextgate\nlucullan\nlumizyme\nlibresco\nspainard\nmedexpress\ngymdeithas\nkalnas\nvivenzio\nbeasting\nchrisler\nmckenith\ntechmedia\nglenfeshie\nshmuely\ndankberg\nmyhouse\nbraugh\nsgiliau\npiacente\nviolete\nfostamatinib\ndentley\nucking\nsystemm\nwygle\npantomine\nconoci\nsasisekharan\nchuis\ncallooh\ngeokinetics\nshanghaiese\ntinakorn\nwitalec\ndonaca\ncarrianne\nnaturallycurly\ntroudi\nleventhall\nhartrey\ncommoditize\nlanais\nzigas\nmoshed\nklaitz\nrydill\nluescher\nkramerbooks\nbimalendra\ngroezinger\nhaagarorum\nschmitts\namillia\npaykulliana\nstroebele\nanatomizing\naconites\ntruska\nnasery\nkozhinov\nvampirish\nzetia\nmetalsa\nwevl\nmiedel\nseadown\nshowtown\nbreastbones\nbruderman\nkoubriti\nbioinspiration\nolympianism\npartygoing\nsturiale\nacresford\nenoe\nnawur\npixable\ntruemors\nbronrott\naytac\nkhaili\nringfencing\ncastrission\nsilverglade\nglufosfamide\ndelegitimising\ngrungey\nhoppier\nphullan\ntinactin\nvaifanua\nexhilarates\nmotavizumab\nplisch\nczyzewska\nhightops\nkcdl\ndelevigne\nsunblest\nadzharia\nkingennie\nhirthe\noverexploit\nmdus\nreaseach\nostmarks\ngrossnickel\nmcgie\nunprovocative\niopa\nchandoo\ngoverance\ndepiano\noooohhh\namanjit\nrecuitment\nscooterists\nhpcmp\nabstact\nkazmar\nalbat\nsahed\nlantejuela\nchhean\nkatios\nhimmelrich\nhuettel\nzentella\nbalkiz\nunderlayers\nbasio\nsonagas\npoisioned\njetfighters\ntpmc\nimnam\ndebruine\ngoalkeepr\nreimposes\nrobbersons\nlubriderm\nslobbered\nhawrysh\nsinjari\nblockish\nserveware\nstabilty\nvisilizumab\nunderwired\nibotirama\nbcbsma\nllandulas\ngrabowicz\nteplizumab\nexciteable\nherringtons\nvannas\nconfeniae\nsunspree\ntullet\nhuiqin\nchivvying\nartsingapore\nnmec\nabbateggio\nricicles\nekhlas\nboonlert\nvalden\npointclear\nstepstool\nnibbies\nrihal\nexpansys\nbarwanah\noranienstrasse\nmisallocating\nmezaache\ntobalske\nscarton\nmokafive\nkhristoforov\nperkier\nlecouls\nbaobob\nexpatiating\nnayomi\nobler\nmakarere\nsalivated\nshaiken\namanpuri\nageorges\ndoucin\nbmhc\nbleepin\nwobs\numaima\nsathiyamoorthy\ncheapish\ntutted\nsubpostmasters\nmutagamba\ntyrihans\nrussain\nkeynsianism\nenagas\nftserver\nliteralizes\ntelkämper\nbaichuan\ncatchlight\norsbon\numutoni\nkuzar\nmaysaa\nwahono\nllanmorlais\nsuitemate\nmmddyyyy\ntelevized\nkajla\nreinterviewed\nmichelles\napalta\nindovations\nlenamore\npantelic\nhumpreys\nmutallip\nqeqm\nchamness\nnakali\necopack\nazte\ntaiyanggong\nshabaneh\nangulana\nuhlirova\nmcgc\nnoroxin\npotiers\nfundtech\nsevillanos\nklingebiel\ngulfstreams\nunreflectively\nmilosovic\nkingsbrae\nfrojdfeldt\nrocklike\nhickmore\ncotrimoxazole\nbambur\nmiscoded\ngasayev\nlamph\nengalnd\nmingzhong\ntxting\npersahabatan\npeijian\ntaraha\nbnpparibas\nsolland\nphysican\ntieup\ntentpoles\nrhizotron\npulidevan\nsvacina\nknappskog\nnoneducational\nmetabolon\ntransnacional\nundoc\nweijiang\nvyroubova\nchiapanecan\ntykerb\nogilvyone\nberthu\ntushie\ndrad\nlacerates\ncdwr\nschatzie\nfishington\ndigao\nhabby\nsoutheby\negco\nmisaddressed\norwel\nendoscopists\nolympiysky\nmulesed\nbloviator\nbaukus\ncharytin\ncowtow\nbeloserkovsky\nshanthan\nyusufiya\njanadriyah\noccold\nvoison\nextenet\ndestablize\nnantkes\nmurhammer\nbrudenells\nlcor\ninititiatives\nventuresource\nstopoff\nuraguay\neloshvili\nlobbyed\nsubpeonaed\nlils\nocfcu\ncrustier\nadvisen\nabgenix\ncoldron\nkoutsomitis\nyaming\nintoa\nrezmar\nbirthin\nemmenthaler\nmccormally\ncicic\ndicate\ncatilin\nalía\nabdelrazek\nschragis\nschnyer\ngcaqe\nuglification\ncampains\ngyawu\nsaqeb\nrepug\neyewonder\nhandanovic\napolitically\nenlargment\nhandrolled\nnkwali\ngalgudud\nextemporisation\nscamble\nbugsby\nhojatollah\ntremblor\nvotron\nnauseaum\nwarkwickshire\nanaman\ntreescape\nhakeemullah\nnarcissitic\nmubarrak\npantelakis\ncrouchie\nbrushers\nadzhubei\ncynlais\ndureing\nmohsini\npisted\nyoghi\niqualuit\nunindustrialized\npsychogeographer\nkromek\nhélyette\ndumfounded\ngovernmet\nopolot\nuacn\npaniguian\nnaxton\nvanslyke\npesner\nsouzas\nvenokur\nmhrn\npapakyriazis\nfayram\nupsettingly\nslidey\nscienc\ncalagna\nlewai\nbiomechanists\nmasoner\ndyfatty\nvoglibose\nvisualcv\ngyntaf\ndianca\nrobertses\ncapitaliq\nswette\nsencar\nleamond\nrocephin\nelswood\ndecontaminates\narriani\nciirc\ncombinenet\nfransiska\ntuchis\ntillekaratne\nkidults\nmanheru\nkhusruf\nmessenging\nschoolmasterly\ngrandbanks\ncrapuchettes\nalevites\nshafika\nkalpen\nurqhart\noptionable\nbarylick\nsewip\ngedaechtniskirche\npointework\nrecriminate\nhouken\nchwm\ndiamantine\ncontrolee\ntarkong\njebby\nproseries\nlham\nfestivites\nbacksplashes\ndenegrated\nregionalcare\nsmelkov\nlenow\nbirsak\nclinix\ndorrity\nkabwegyere\nmmadi\nspokesbird\ndizzier\njakuchu\nhebbron\nmountainkeeper\nhamdia\nserdula\nactivties\nkulei\nalior\notologics\nbloking\nfassier\nlivein\nradezolid\nmunekata\nrosio\nummel\nshunto\nbabajob\nneigbour\nkudair\nbarrenetxea\nkatakolon\nunderdoing\nkervick\nuniprise\nsovietov\nroshandel\npilotin\njambur\nbarefooters\nschoerke\nrecuiting\nmajozi\nkapah\nbarkau\nsantuary\nlambregts\nworksource\nalaistair\nnytb\nsmellovision\nbordiu\nguehenno\nrabinov\neasdown\nbeatt\nveico\nzainy\nalterg\nkingsfort\nsevcenko\ngaztanaga\ngeovision\nfidelistas\nschefft\nakasako\nsivilla\npellizzaro\nnordpark\neuphemise\nbryanlgh\nmarlynn\nrockharbor\nfondebrider\nsickbeds\nabdom\nadwok\nquidnunc\nturnbough\nmadeoy\nschiffauer\ngdsm\nthaib\ndodou\nrosenblit\nglyncoed\neichwede\nrascall\nnonepileptic\npontieu\nmddi\nlabolt\nmissons\nsoftish\nkeyunta\nmccutheon\npropsect\nxuefan\nhidded\nsilerio\ncornelson\nthanawiya\nwgpc\nprocyclicality\nindividua\nduaghter\nballcaps\nbeaudreault\nkwikpoint\nultrachrome\nmeatpaper\nunlear\nsooooooooo\ngoodleaf\nbhattasali\nrockstrom\nfriskiness\nreconaissance\nfeltgen\nscaloppine\ntelmap\ncontiero\nmawlavi\nhelbers\nbahina\namercican\nmazzorbo\nshanni\neconolodge\nelimated\ninessentials\ngloabl\natws\nuncurbed\ndelsym\nattendents\nsmokewood\nnapss\nnmba\nfarler\nhermange\nkazahkstan\nceawc\nstoras\nbiothreat\ninsulet\nsafraz\nsolgar\nluddin\nworldgate\nephrons\nprisonlike\nbrigadeiros\nteamhealth\nchutchawal\ncramsey\nmdingi\nthown\ntagholm\npureology\ntreftadaeth\ntheey\nschrecongost\nxolela\nanticruelty\nmeagreness\nmarinières\ntaganana\nispwich\nmaryani\nchongquing\nrugy\nbliman\nteamates\nzilhão\ndivalicious\nuplighters\ntinkled\nflessner\nrenditioning\nlumbertubs\ndonosti\nhdcs\nmagnicaballi\nquily\ntransperineal\nshawley\nklapötke\nballymacormick\ntrilegiant\nstrategyone\ntvws\ncaerhendy\nfetchingly\nzargana\nbracalente\nscrummy\nairelles\nblurrily\nbarbered\nenforement\nshamblers\ncpsg\nkeiana\nnampak\nvanech\nfreeto\nkneuer\ncnsx\ngranatino\nsihk\nquainter\ngolddigging\nkrumenacker\nstraighttalk\nfostok\nchekara\ndübel\nhoplamazian\nnicoise\nloczi\nzulkipli\naaeu\nwalubi\npoye\ncorluka\nrickell\nshantan\nkhalizad\nbretonside\naccouterment\njahmi\nniedzwiedzki\ncellsearch\nfieriest\nrobotization\nwehliye\ntoplines\nbusinss\nskirmett\ndreumel\nwhomped\naiesha\nsmartsip\npairat\natendido\nstemsource\nkingate\nrushcard\nzestimate\nglantzman\npyinmagon\niksanov\njillo\ndemocarcy\niraqna\nshirel\nakinsiku\nrojelio\nlemminkainen\nhellooooo\ncadeddu\nkwandang\nnfsc\npenitentiaire\nmoyeen\nstiffling\ntschepikow\nspags\ntictac\ngiannarelli\nromanno\nrudins\nsiblani\nmongel\nosri\nallll\nnaeemah\nhrones\nkreczmer\nindustralized\nschoenhoft\nshager\ntavolacci\napondi\neconomywide\nlevensohn\npjanic\nzaryen\nlowerhouses\nmarcovitch\npostroom\nkaduskar\npatalinghug\nhyperactively\nmvoe\nindictors\nrefundability\nosct\nsurefootedness\nbaete\nbrastoff\nmaldic\nvertafore\nresposibilities\npwerau\ncharelle\nbasketcase\nmonribot\nregall\nkirchik\nreliastar\nmohelim\nbeppino\nftsc\nbenbrahim\nkodithuwakku\nsocialvibe\ncaravanette\nfowzia\nnrmla\nwellmans\npatwant\ndimetos\nidiazábal\npurechoice\nmarantis\nrichlite\nprepaired\nenviros\ncapozza\ndavaco\nngage\ngolum\nsamancor\nsherlockians\nkatuka\njenabi\nstomaphyx\nawajun\nsikorskys\nbevies\nclothkits\nyerro\ntatics\nbelneftekhim\nxaviars\nyidis\negullet\ngrassier\ndewaal\ndemaliaj\nxekt\nvechey\nahamdi\nrasff\nfreih\ntxtspk\nkiffians\ninventables\nralstin\nyormark\njostes\nfurbelows\nfalleros\nanahad\nrimler\nperceptable\nfacua\npranged\nmacconachie\nloneman\nhiom\nappologizing\nrunnell\nkeystudio\nmachira\nripstik\ndurng\nsluszka\nmasebe\nbukima\nadoped\naccelera\nflywhoosh\noverdramatized\ntsabar\ninvo\nexpressionistically\nstupefies\nmanary\nodlaug\nannings\nizbet\nmushwana\nwearever\nthuggishness\nacvc\ntrups\nponsky\nnorsat\nmudala\nprincessy\nlasku\nmotesanib\nbudreikaitė\nirawaddy\nsteinkohle\ntarriff\ndilulio\nlausell\nmoralizer\norsus\ndenino\neidhr\nkohorst\nhosc\nhirsts\nbehins\ngrommit\njaffey\narghanj\nmissoup\nnsmd\ndibeneditto\ngammerman\nolstein\nfaulques\nconselyea\nchekir\nfreneticism\nsalduz\nagace\nfoodiest\nreykjavic\ngewirz\nesterling\ncharungvat\nnotionals\npineyro\nmeesawat\nvadlamani\ncosmegen\nsonepar\nmestrovich\nscdea\njealosy\nscudellari\nnorng\nhelloooo\nnutropin\nrepellently\nsurfware\nbonemeal\nthimote\nbaichu\nweizsaecker\ndanias\npnps\nszymanczyk\nmagnetation\norand\nbassole\nsaludes\nvilayphonh\nokao\nslakes\nbeirao\nqosh\nodfs\nvidals\ncoeptis\nwrongfoot\npartscore\noverfond\npublicker\nallsort\nrafsky\nkolinko\nkholiquzzaman\nwrongfooting\ntaameer\nbirmingam\nmccai\nlabantu\naemi\ntelecharge\nwashaun\ncafardi\ndenhead\nmccarthyesque\nvelz\njayabalan\nanandvan\nbonyongwe\ntestamatta\nmccuiston\nhariklia\nhayesbrook\ngrrreat\nkarskens\nmidyette\ntukkers\nvitreal\nnordt\nyorkwood\nmichol\nlaubert\ngrafenwohr\ngeronemus\nworshipfully\nkronthal\nmidfa\nrachwal\njudu\njiricna\nneigbors\nmangouras\nnitech\nacholis\nbarkdull\nkalasho\nteamate\noutnet\nkanayev\nmaftuh\npecot\nntuf\nbiktimirova\nfreehanded\nmalfeasances\nglympse\ncacfp\nweisbaum\nnykjaer\ncyzer\ncinebistro\nmoonwalked\nhyperpartisanship\ntwittery\nkerulos\nnonforcing\npluimer\ncondolezza\nbernshteyn\nwaining\npullaway\nabergorlech\nhermening\nelsbree\nprepays\ntokwiro\ndiffenbach\ndeleverage\ndewerstone\nconsessions\nenjoli\nestenson\nstreetline\nrevealling\nsuspened\ndeclaimers\nostrovany\nfeatherstall\nchunfu\nstoere\nfurores\nphotomural\ndvani\nquintessencelabs\nmavignier\ngecho\nbruisingly\nchonji\nalexiades\ntininess\npitiably\ntmts\nsrere\nboesendorfer\nclaburn\ncorhan\nrachyal\nnarrain\nmangova\ndalfampridine\nandrezza\namenorrheic\nrinckel\nanousha\nganaches\npossibiliy\njazziness\nwazirstan\npellecchia\nslud\nseinfeldian\nyoungtrigg\nprefferred\nvalodia\nflouris\nroiss\ncuisia\nsupandji\ncleanaway\ngutsiness\ntullymore\nzietek\nrheinheimer\nfaddishness\nkookier\nkuhlenthal\ncensoriously\nwashingon\novercapitalised\numbilically\nhomesellers\nauzmendi\npoucha\nhamisha\nratfishes\nossenbrink\nchengeta\nlifechanging\nhynod\nshampan\nfollensby\nhatian\nlinforth\ncatawampus\nbacarella\nbeaston\ncuddliness\nseyfer\nsandwichs\nscrabbles\nbuchak\nkinnel\nfederlein\nsanquer\ndespont\ndummond\nunrenewable\nwhitemoon\nneveda\nolare\nsverak\nnogoodnik\ncpwr\nscherdel\nschirato\nmacapaar\nribhi\nsolipsistically\nneupro\ngassiyev\nnanoproducts\ndelorm\nsaunby\ninsinuative\nfloppier\noverstuff\nmesserich\nunerstand\nappeasements\nkerastase\ninvesters\nstilyan\njunbo\nnortech\ngerakis\nreenvisioned\nbamdev\nferragu\nellerin\nmicroban\nperchuk\nabbenante\ngodbehere\nsanzin\nbeleagured\nmakanga\nfloof\nrefulgence\ncloughan\njoky\ncolebourne\nnilsestuen\noveremphatic\nterjem\ngrigaitis\nballymacilroy\nsowerbys\npassporting\nmorrills\ndrywaller\nhipath\ncleareyed\nlygaid\nogemdi\nnushin\nchiota\nnotchup\nbabywear\nmofunanya\ncoucous\ncrisphead\ncicione\nbkkiff\nzugu\ngoobey\ncataio\nmhura\nnonlawyers\nriskind\nadedy\nrainsbury\ngalouzeau\nifob\nouramdane\nsangmo\nwubbe\nlesport\nsalarno\nsudsing\njarheads\nbassily\ngovernmetn\ntipoffs\nsunhe\ntreasurys\nblobfest\nshengchang\ninderpreet\njeromie\ndaneshjou\neathquake\nmunificently\nmaleter\nschöllgen\nlindomar\nhubschman\nborsheims\npruthviraj\naccouting\noveraker\nanderst\nsalmawy\nanoh\nlateisha\nhediati\ntaiano\nwiczyk\nasassination\nkaskida\njütte\nsunée\nmusueum\nsomatotrophin\ncardmembers\nnpaw\npedral\nnonproducing\ncogane\nhofnung\nweekened\nbouwe\ngjeli\nkaessmann\nmirimichi\ngerking\nhartoch\ngratsos\nrousting\nopentech\nkorodi\nbhumidhar\ncluzaud\nsmartridges\nbealefeld\ntasnadi\naquilion\ncronins\nkaalbye\nhabani\nchhiri\nbijilo\ngabris\nmiracco\nbuckmans\nskidpan\nartise\ncarbonators\nhuajie\nbisbort\ncollaspe\nveissid\nventureone\ntouti\nauchenbothie\nhegemonistic\nellsmore\nprotheses\nlabeeb\nspritzy\namazonico\nvolounteer\nkevkhishvili\nintralymphatic\ndigpal\nmccafés\nforgit\nmahardika\nsteeliness\nrepke\nlatibex\nwefel\nquailing\ngreyfields\nringcube\nclucky\nnetenyahu\nlaidre\nmittals\nanggoro\nttpa\nkrale\nluzyanin\nmirle\ngennarelli\nvisably\nmagante\nbeachner\nprillaman\nnyawera\nchoic\nharpootlian\nhardarson\nyellowhair\nfalcondo\nkassimeris\nherschaft\nairspan\nbuckely\nfaifi\ndivella\nkonat\nhokuryo\ndecourrière\npicalm\nshortner\nhelenas\nzyzak\nlatoni\nporthaven\nprith\nknuckleheaded\nifun\ndfrl\nnovachuk\njereon\nnawcwd\nnmrd\nvinoteca\nnegc\nniroula\nattitiude\nmoneer\nqualnet\npellengahr\ncinemagoing\ncupecoy\nplangency\nfaroqi\nmarykay\ncrns\ncwel\nprisioner\nbraunecker\ndowntick\npolonetsky\nmahria\nsampradya\ndobusch\njabaar\nondák\nsadulayeva\nstolman\nbrassic\nappts\nsicardy\ncahe\nstrengthend\nraikia\npeachie\noptiver\nobsai\nsculpher\nvalidas\nopnly\nsarpe\nisentress\nblimpish\nheidenfelder\nnikahnama\nbattaglino\ngelaterias\nmarcelled\ndrooper\ncondemmed\nresuced\ndataframe\neconony\nscotand\nhouillon\nmaubach\ndiemu\nborgzinner\nbihm\nrubbishness\nipilot\ngalvus\nideapaint\npoomacha\nerecord\nstabalize\namericraft\nzamili\nokari\npolivy\nbrownsman\nnaaqoos\ntellas\nconjugality\ncenb\nkarinto\nmulticounty\ntelcommunications\nevercare\nyings\nintradivision\nshurgard\nplaquenil\nccrkba\nderichs\neuromin\nathersys\nhmgcr\nkolkena\nbogany\nairlessness\nsalsicce\ndobyne\npowertek\ntranier\necnomic\nahuacatl\nvaliyeva\nlamell\nyiddishisms\nchsi\nexchage\nchennan\nbicay\ncesnauskis\nsweatiness\nalumbaugh\nsubastas\nazedine\nireal\ndigusted\nbahill\nattaturk\ntenene\nmegraw\nzhouyuan\nwhx\nfarceurs\nbivl\ngridworks\nbrazilan\ngroundbreakingly\nwherabouts\nzhentou\nyahyavi\ngrochan\ncammies\nvilotte\nvitravene\nparwiz\nflashflood\njuicily\nkodinji\nsarksyan\nimedeen\nhorrevoets\nsuvit\nfriedlos\njhihben\ndiallers\nunmeltable\nkompromat\nghankay\ncabrerra\nkizingo\nbociurkiw\nkontora\nrogombe\nbasoglu\npallinsburn\nghowr\nmichgan\nrecriminating\norwick\navadesh\nvaubecourt\ngalanthophiles\nsehlinger\nkirtas\ncantler\nahmedy\nwieghart\ntereshuk\njakeb\nrifaqat\nsublety\nsmialowski\nvreeman\nmallak\nfanueil\ntimlett\nrelpax\nrizzatti\nshour\nspatule\niranair\nguidestone\nmushoriwa\ngingernut\ntwistedness\nolnick\ndiversoes\nkulayigye\nwahbe\ngaueko\nkerrins\nsafeauto\ndongda\nanjimile\nkalbag\nandey\nakiiki\nrootmusic\nisakhan\nimpre\nforestfarm\nhadco\nsucce\nprashan\nmccaffer\ndeerow\ncousinage\nfootbeds\nchuzhda\ndukies\nmessom\ndefrays\nunideological\npiteira\nnariyah\ncushingberry\nperservere\nwaterpik\njoensson\nhirees\nwispry\ndenita\nunwealthy\nhollwood\nradiocor\nozbolt\ngeust\nmalkins\nnurtingen\noptaflu\njayyus\nkriegsteini\nwarmdaddy\nplayday\nkaregar\nteeu\nxrank\njokiel\nschusters\nsupermetals\njurkowitz\nanamitra\nprocampo\nfaustyn\nslaugher\nfarrs\nchicking\nthickburger\nlinksland\nschlenger\ngurgled\nzahren\ngramms\nnilc\nconnnected\nhomestands\napplogic\nwijnstekers\nsubsalt\ncastaner\noyinda\nterracino\nleafiness\nsimove\nswilled\nladek\nhudack\nshangba\njenesis\nmaymount\npescheux\npoptag\navrl\nbesogne\nbaselice\npatraeus\nlanoxin\ndccl\nbucktrout\neprescribing\nburnsley\nwieruszewski\nwingshooting\nboulevardiers\nmeyercord\nsidlauskas\nunharassed\ncouragous\nportenoy\nhilariousness\nsmedleys\nkarvellas\ntalel\nblanksby\nsheilding\nabcess\nfontenet\nkonopiste\ntokeer\ndestructionists\nshimbum\nteache\ndosara\nverrus\nhcat\ngreengen\nmarcellous\nstaywell\ncmpid\nnovacea\npricewise\ncorngold\nnagley\nmcare\nponcing\ntypetalk\nmedov\nfrentic\nlickable\ncarvahlo\nmaycol\nprickled\naddditional\nbullhook\nloggans\nsingsongs\nepcon\ndisaters\nunpins\ncraignair\npanathanaikos\nranel\nroqaya\nnussberger\nsesne\nvucicevic\npicat\nrestaffing\ngoelzer\njhones\nbassolet\ncurviness\ngeysering\nsetley\npoice\ngueriguian\nwearier\nconnectability\nnovarka\ntriefus\nmislingford\nsaifain\nmlgpe\nurasia\nsiderperu\nstarkopf\ncodies\nharrath\nbullot\nfluffball\ncourion\nsarwe\ngizzie\nfbfc\nhobmeier\nemberly\nnoilea\ndomansky\nunbruised\ngoettsche\nshanava\namdani\njoylessness\nquangoes\nganascia\ndgccrf\natomosphere\nsnickerdoodles\nmehrats\nplentz\nstolzman\nestroff\norgn\nfolotyn\npetee\npeplin\nportschach\ncanterna\nkildean\nairballed\nprovience\nvargason\nnarongsak\napari\nbcii\ndorozhko\nelectrosensitivity\nkomsic\ncisionpoint\nanfavea\nschuykill\noffhandedness\nunfalteringly\nmetee\nmodde\nderrow\nminiority\nsmartsave\nstrumph\nthiels\nnccmh\ncalderons\nsukanaveita\ngtdi\nozgul\npracticar\nmofetta\nautoport\nrocksugar\nnewsaper\nathere\npopluation\nhustai\nsuperlow\nalltwalis\ncitreon\nleperre\nbusinees\ngossipmonger\ntrender\nnorbom\nlawworks\nlefkosa\nmuscillo\nyuster\nbaaps\ncypc\nwerst\nsaariselka\nfloppiness\nreinares\nhollywoodized\npricedoc\nmarmaro\npsilos\netreppid\nasriran\nnextlevel\nlouwagie\nratnieks\nqadissiyah\nfujito\nsinola\nmatesanz\noffbreaks\nproverty\nwepman\nkaroub\noligofructose\nhefron\nsatanovsky\nliquidiser\nahhhhhhh\ninadquate\nvictums\nreadvertised\ntwitterpated\nkorfiatis\nsangakarra\nvooz\nsenyukov\nincindiary\nmccauslin\nxmark\npotera\ntudclud\noverdeveloping\npathologise\nhawiyah\norphange\nfoggier\nslipperiest\nbjorgolfsson\nmutanabi\nmovenda\nholewinski\nsupermice\nncipher\nkutka\nnevr\ncheekier\nmultidose\nhookergate\nwittkamper\nbuidheann\npasteurise\ngeners\njeonghee\nchardenoux\nbeirniadol\nhairsprays\nfedeski\ngalicki\nhoussain\nprepster\nmanuma\nbentolila\nconnectwise\nminaudière\nakikusa\nuntraded\nbloolips\ndarneille\nvisschedyk\npaee\nturbolink\nzipse\ntustain\nbartuska\ngerbehaye\npotenial\nkokolopori\npopwrap\nservetas\nfotoflexer\nmenye\nrightousness\nlayshock\ngreman\nkashuk\nunbritish\npghm\nlemtrada\nfosrenol\nfrangialli\ndaylite\nolmsteads\nwisekey\nwidowerhood\ndebilitatingly\ndiabled\nmtia\nodwan\nadelaider\npowerchords\nyouthnoise\ntryl\nstricks\nemmaculate\nignominous\nallaithy\nunatural\nezzie\nbaldovie\nlurvey\nbossnapping\narbesman\nbeinhauer\nbowran\nsquareenix\nmallorcans\ntrevizo\nionta\npubicly\nadriá\ntribendimidine\nxtream\nmothae\nterab\nteufelberger\ncounterbid\nseevaratnam\nbudowsky\nrelious\nprebooked\nbrinkhill\nbeavans\nthandekile\npulikovsky\nshaylin\nrotterdammers\njichan\nwriteprint\ntutman\nrpcc\nfootholes\ngalileos\nbotticella\nedlow\nnordam\nphotoacoustics\nareitio\nhopewood\nmobitelea\nprotopic\njondishapur\nwhistance\ntorralbas\ndogchannel\npazarbasioglu\ndivestures\nsalow\nbusone\nfvo\nevogene\nimproptu\nctdc\ntipul\ntranformers\nbevvy\ngoukoye\ncoylumbridge\nojom\nhilfy\nzaggy\nscarpered\nideologised\nsealane\nmisdescribe\njunoesque\nlashman\nlpbp\nrubberlike\nnorsigian\nwooliness\nkahumbu\nturbohercules\ntoryalai\nmichican\nabdiwali\naleksandrow\ntechspeak\nkožušník\nakunne\ngyllander\nscrepis\nairasiax\ncarbasalate\nhibson\nmoleac\nlawnservice\ndoulatabadi\nscaillet\nkandau\nenviormental\nassualted\njosem\nhoodle\nayapata\nsiport\npivarnik\nzacari\nmackinson\njetés\nbeylerian\nglenfair\nshakran\npopzilla\nclangy\nrossiters\nkreke\nvenneman\ncareworkers\nwindall\ncouck\nproquad\nntsanwisi\ndanailova\nrollitt\nleitenberger\nreunifies\nfortressed\nvinelli\nangelsoft\nescapeway\nvondrasek\nanpaa\nchornet\nsteamier\nhominess\nwiski\nstreetwall\nguérineau\nexperieced\npietton\ngiammalvo\nrivky\nlugubriously\ndechiaro\npaparizov\nbioprospectors\nplatzerwasel\nstakkato\nglespin\nbhajis\njumbojet\nconvenership\nknightz\nceneta\navalide\naugustijnen\njaskula\nperambulated\nrinsky\nbrodre\ndonston\nskumanick\nthirgood\nopam\nnagat\nsheahon\nsenokot\nalket\nwenping\npathologising\nrangina\nchuckies\napunta\ngiansily\nbucaresti\nstrippable\nghurkhas\nflavorx\ntamworths\nintraoffice\nwaaaah\nspindleruv\nelfassi\nhamptonne\ntraums\nunfortunantly\nvandrei\ndevost\nphsycological\nvoxtec\nzophei\nshanakill\nroeb\nkazlow\nruszin\neyla\nforgivably\nlomeiko\nkiboshed\nkontonis\nintest\nvahidov\nredeliver\nbogleheads\ngelig\npallemaerts\noleoducto\nhlavni\nfriendsreunited\nqingnan\nmaides\ntherebucks\nrodhams\nfalzano\ninfopoints\necweru\nexpeditures\nlandcape\nhardister\ndanilow\nakery\ntumpey\nmickleborough\nmarciani\nexpediture\nautojumble\nscaldings\nhedgie\nghurkas\nakuseki\nvasogen\ndokin\nforecariah\nkanyanta\nterrachoice\nvanderstraaten\naccavitti\nkuijen\ntokasz\nmichaelsson\nbamboozlement\nahnold\nkehm\nmissonis\ndirrrty\nbarthomeuf\nrfet\nsussanna\nkloda\nnitibhon\npccd\nsolanezumab\nerye\nabdikarin\npoellnitz\nshepphird\nnewsdrive\nmosny\nrepolishing\nipodjuice\niniscarn\nkurfurstendamm\nwellchoice\nskrutskie\nrullis\nontroerend\nmujibar\nukayroc\ngulash\nshoker\nunshocked\nblushingly\ndiamondworks\ngiottos\nhusten\njuxt\neftc\nfreydkin\nasbahi\nzorislav\nfamalies\nknowledgeworks\nzazza\nsnorkeled\nspidle\nhitrans\nrfos\ncondenast\nashikbayev\nzobi\ntreiki\npaulitz\ncivlian\nchiacchiera\ndefibrillating\njodka\ngummint\nlobbestael\ndembina\ngrulke\nmotorokr\nbeginnin\nbookstands\nvydrin\nkolloen\nshoomp\nravenstruther\ndevasting\nlondonwide\nspectratone\npowerpath\ndossers\nfmct\ngenomas\ntitfer\nathro\nskined\nothodoxy\npasttimes\nmâitre\ntuim\ncheeseboard\nducre\nthyangboche\npurc\ndhaliwals\nagbash\nwebsoft\nshalwars\nstarchiness\nteson\nmedjet\nbroujerdi\nmaravel\ngurpinar\napovian\nronsons\nvaroga\nlador\nunkechaug\nqatalum\nkanterman\nishoy\nshershnev\nsawants\nhankee\ndrollness\nkafataris\npetrostate\nnewcombs\nfunisia\nbbka\nvascutek\nschweicker\nzhanybek\nschlepped\nherchenbach\nnashers\nharbut\nzimprich\npanarese\nsupermum\nmcsteamy\nkobrinsky\nemachine\nscanimation\nqati\nmufas\nmurrayburn\nzafrin\ngopaleen\nraibin\nvisitpa\nisbrae\nchilate\nmasterstrokes\nshipleys\nmichelberger\nconoly\ngehmacher\nfalsecypress\ncruses\nschraft\nzhouqu\ntecnobrega\nultrasmall\nstategies\nwche\nmeilaender\nlebonheur\nbransons\nmeeing\ncaplans\nmeadowlane\nwyless\nmichelfelder\nfarberware\nmouthings\nwittberg\nperluss\neidak\nkellaris\nsmyle\neuthansia\nkinerney\nzeitouna\nviacorka\nalnur\nmilebush\naffrontery\nmoshekwa\ncuencame\nbvocs\nvucciria\npreneed\nneosphincter\ngorbi\nhenigson\nencite\nkimpson\nappelius\nverizonwireless\nuncaringly\ncontourglobal\nnuami\nannuality\nearlam\ngorkana\naderito\nguitron\npropac\nnucleur\nydanis\nsuchana\nhualpen\namscan\nemebet\ntwiglet\nlolloping\nhaltiner\nkushiage\njenx\ncribby\nspinspotter\nekas\nouandja\ntimmens\nknipschild\npersude\ngaukhar\nopencalais\naccuvein\nkalkunte\nalsema\nstrander\nanjeanette\npeuhl\nenimies\nsetena\nkochanska\ndimmings\nhammas\nbritcliffe\npongara\nfishfingers\nmajlaton\nrpls\nhonigsbaum\nbackstoppers\nguzzardo\ngriskevicius\ngoffee\naliceanna\nmajeda\ndanactive\ndetrani\ndancebrazil\nmoyai\nupskirting\nhogmany\nhernshaw\ncaravillas\npenri\nwannell\nmatyszczyk\nfillips\ntatch\ncarrozzerie\nimmeidately\nbhotmange\nshandies\nveros\nqiheng\nturgidly\ncomedytime\nisohata\nnasdtec\ndumigan\nshahkrit\ndeschryver\nabsy\nclincs\nnorrod\narburúa\ndemisch\nmcmanmon\nstatememt\npolygonati\ndisapoint\nheimbrock\ncerian\nquadfather\ndamagh\nkramper\nhenselwood\njakubczak\nlatip\nnirapathpongporn\nintellgence\nwangita\ncgpme\nreimprisonment\nptpa\nwelcomely\ninstrumentalised\nnonactors\nkongzhong\nychwanegu\nwallaert\nreusables\nmamytov\nfoilage\nsurveillence\njanys\nkrisada\nhoulin\ncowmeadow\neclinicalworks\nsadnesses\nozalp\nthwak\nfnba\nfrehse\nlionise\nwylds\ntripso\nsunflag\naidells\nwestsound\noculography\nfreefone\npolytechnos\nrechnik\nvancocin\nqsearch\nysgrifennu\nqotbi\nbenoquin\ndelashaun\nacpos\ntyrannising\nrespresentative\ntrilbys\ncastlebank\nbatishchev\nurbanfetch\nmuseli\niupat\npailor\ncoolatta\nbeceem\nmmrp\nlanthorne\nmellegard\nkomondors\ngalns\nfronded\ndemaré\nbrenhinol\ncobnut\ndraftspersons\ntorisel\nchandeliered\nkivlan\nköpfer\nshortcourse\nrapidarc\nmizroch\nkupalba\nchristianist\nhashiya\nsomech\nodeke\nantoci\noreopoulos\nakhmal\ncatalona\nresaurant\ngrossner\nlequatre\nsnicks\nberthillon\ndriblets\nsleuthed\nniaspan\ncaptiol\nvideoplus\nrodriguezes\nchannelvision\nowlstone\ndownthread\nsledger\nchidamabaram\nberllan\nvasold\ndraftspeople\nenelow\nmusland\ndudayeva\nbumai\nadmendment\nnaivette\ndoorstepping\nharpersport\npanjandrums\nvassilaki\nrestacking\ncardownie\nlowenstine\nillums\nbolighus\nnkorea\norbimed\nhaulouts\nchincua\nmingchun\ntfah\npourang\nwerlen\noutjumping\nlecturn\npeetey\nschulein\njovia\ninema\nairtour\nsilverbrow\ndfmo\nmulfinger\nsnobberies\nestandia\ntrinis\nhibor\nrhapsodises\norganisatio\nkirumira\nkrajinovic\nlornamead\nstockli\nstodelle\ncoertze\ngrafflin\ngensheng\nkezer\nbehary\nbriamonte\nsportweek\nhojat\npuschnik\nmbywangi\nnonurgent\nnoranside\npozycki\nzelnickmedia\nchapchai\npilafs\niskrov\nmythologise\ngurassa\nchayya\nmuslum\ntanshi\nfrappés\nkilver\nshappert\nburbie\nignorace\ninstutitions\nhikawera\nbikaye\njayded\ndogley\nsuperprotonic\nazarya\ndomash\nleafblower\nnasjrb\nmanoochehri\nmapledene\ntusla\nunilateralists\nvabishchevich\nsuffereing\ncooltouch\nnavelbine\nchatelperron\nfourqurean\nrepreve\nanthrozoos\nhwlffordd\nfidos\nmarinading\nsharanova\nchampix\nembattlement\nrtsm\nignors\ngelbakhiani\nfarrers\nulasewicz\nswitchball\nsempervivums\nmorbey\nnegotiatiors\ntouhidul\nwischik\nperkily\nrubrobacter\nvaziev\nleftest\ntrusera\npentabus\nvaragona\ngenuises\ncordarone\nvanch\nkoketsu\nsolerebels\njihadjane\nnightclubber\ncueman\ncuebas\nkobylka\nmitshubishi\nlegalizations\nkusurin\npridwell\nshejaia\ngronke\nstovies\nsingapuri\nmayilvaganam\nkilaly\nrathgael\nbobinsky\nfecklessly\nsukhu\nfatkin\nchimpcam\nhelicoper\nslutski\nwinvian\nsymbyax\ntecp\nseverley\nfinocchiona\nproffy\nperonnet\nmarilys\nrbda\nweisglass\nchinanet\nahei\nsinneth\narrosto\nnaftz\nrunzheimer\nherried\nnolbert\nldcm\nthiacloprid\nbankrupty\ndelossantos\nwhiterashes\nfikse\nancic\nregressives\ntoueg\nfondiller\nsuws\nphillipses\newropeaidd\nkillorin\nfcmc\nmolaioli\nkishinami\npinkowitz\nxianjiang\ndelarco\ncrewmax\ncargenbridge\nlistners\nambroses\nfrohoff\ngodette\nmaushart\neructations\nkoitka\ntuttomercatoweb\nfroguts\nsangiamo\nfinreg\nvijn\ncatanzarite\ncrimesider\nporthmadoc\nlienholders\nremenber\nschorno\nabbf\ncelebritydom\nsmithie\nmorrision\ndiminshing\nmillum\ndualogic\nskateable\nbychowski\njanovec\nkolson\nribbink\nimpelsys\nsmartswitch\ngwallt\nelevance\nukrtransgaz\nlsoas\nygf\nballsed\nassurers\naltabef\nlaagan\neagon\nfaugeron\nneuherberg\nnaruk\nthorbjoern\nrethatched\nmalbranche\npontificator\nfitpatrick\ndadiba\nkabulis\ntrinations\nkanburi\npraekelt\nkhanaqa\nrumsfeldian\nmapaches\ncuckow\nmoneysaving\ntenderizes\nlewitter\nbaghad\nhavron\nnfpp\nmonogamists\nsnootily\ncompanero\nauguring\nkwikstep\nkazempour\nsanick\nbcimc\ndomota\necotarian\nlithely\nsideliners\ningenuities\nphelpses\npsybt\nwessis\nrasananda\nverhovek\nsiclari\nvamosi\nducalcon\nfuele\ntepotzlan\nefexor\nwintersteen\nsysyem\nchristope\nmaksharip\nchitman\nsingletree\nintratribal\ngridsure\nfactures\nauerback\nmantrip\nsenak\neidner\nheimerich\nsalhiyeh\ncomstocks\nfungoes\npertot\namesquita\nobssession\npetraco\ngeeser\nkrichefski\nsummitry\ndullas\npoursaitides\nguevaras\noinking\ngirlanda\nkillmann\nmedvedevs\nsundecks\nstarlix\ncyberbritain\nkowalksi\nkeesecker\nahpi\nmedivaced\ndubula\nluchinat\nyouings\nmartignetti\nhelvoetsluys\nsupernote\ntajeddine\ntajideen\nstruski\nuitkijk\nhqv\nmiserabilist\nnaqaash\nyaqing\nikramuddin\ndunafon\nmalombo\naminos\nabheek\ndellapina\npasqualucci\nzongker\nredenominate\nbongate\ngenevievette\nsebastianelli\nromec\njuctice\nscordi\npolsfuss\ntavaroli\nindissociable\nmagenheimer\nartecoll\ninfracapital\ninsenstive\nlazed\njianjie\nkneesocks\ntimoc\ndizzies\nplayerauctions\nalampil\nprotraying\nlawate\ntranslucently\nguica\nhurtfull\nnomc\nplumania\nprsident\nkoumoutsakos\nkidtopia\nbillmeir\nobamaland\nrwcl\nbreedam\nloviglio\ngreenpath\nsakewitz\nfaghihi\nmarquice\nquicktrim\ngerardis\ntempestuousness\nsavoree\nkiepersol\nhummmm\ngrandiloquently\ndifficulities\nkaramanou\nadroddiadau\nlevonelle\nlaliyev\ngauguins\nchisale\nifbc\nyacovone\nfoodnet\nnurdling\nschienle\nchomos\nnatiello\nderadicalisation\nkibibi\noelschlager\nfinestein\narnoni\nsecoyas\nayestaran\nshortcircuiting\nmarqueece\nuninnovative\nrecano\nmoufid\nanticensorship\ndataquick\nwillburn\nmatuk\nnonabrasive\ncasadaban\npetitenget\nmenyoli\noverexuberant\nsmeeding\nleidinger\nmultispeed\ndunkleberger\ninstructus\nsholders\ncapuozzo\nmouctar\nslinked\naxxent\neuphemizing\nmsdsonline\nafghanaid\ndanho\nguzzinati\nswissness\nsoraghan\nrubberneckers\nreinjure\ndetroyed\neygpt\nbirton\nrijnveld\nfriedson\nactimmune\nsomalinet\nsynchrorev\nigold\nbakhtaran\nhughett\nfidelina\nbyoo\nbelohorská\ndingier\ncutchet\nhtgc\nunvaryingly\nehealthcare\nevermay\npresskits\nsoetwater\nuncreased\nschachtschneider\npeccerelli\nhatchfield\nparadee\nwollock\nlatinworks\nhusien\nthirtyfold\ngreany\nwwoofers\ntarnok\nbenchenaa\ncauterised\nérection\nsaltholme\nmetzgers\nknaff\npricedout\nbsst\nncctg\nvranesh\nwiliest\nljajic\njamahiriyah\nbiolase\ntautges\naetheist\necomomy\ncamerson\nbernoth\nhispanicize\nfarmerie\ntyburski\npaladares\neasdaq\nperfromance\nkronprinsensgade\nghandehari\nrechavia\nruzga\nnorthernness\nlifewatch\nmendelowitz\njheranie\ngassant\nziesig\nyosts\nskuza\nkurtulan\nunderstocked\nfalacrine\njobmother\neuroplace\nkodwa\nellouise\ndalloul\ntransgenomic\nbredhoff\nnetback\ndbjrg\nbomgardner\ngoirigolzarri\ncherrell\ndevogue\nvity\nammond\nperese\npalnu\nmasket\ntechtronics\nkargozaran\nbreikss\nnvva\nmvoto\ntraipses\nkostric\ndeepseated\nkamwenho\nhortonwood\nrinnie\nlilestone\nhqz\nwholeheartedness\nnerison\nsouhayr\nleasings\nntetema\nbouzegza\nnakhooda\ndemilles\nindomitably\nupgradings\ninnotrac\nsivilia\nassiter\nwembo\nipers\nungreased\nabnegate\nkerusso\nkeeshin\nmhsi\nrdeb\njollied\nhidings\nhandsy\npinnawela\nkabessa\nfilak\nyuguo\ncapraesque\nincompletes\ntanshin\ndeeken\nburnice\nhomden\ngeeslin\nkuuki\nbenthal\nreportability\ngerthe\ndeather\noffficial\nantlerless\ninfelicitously\nthorensen\navav\nfeinsteins\ndemogod\nsobaski\noppotunities\namaani\nresport\nkoors\ngirondine\ntrowsdale\nspeidi\nvspring\nbrimingham\nschambelan\ncludo\nrozenberga\nmillheiser\nmalingered\nschoolmarms\ntivos\nvazire\nnonvirtual\nwisertogether\nfiordellisi\nfastpencil\ntrennert\nevena\nbrittanee\npatdowns\nflusters\ncheffy\nghazvinian\nwiroj\ncovetable\nduabi\nzaniolo\nweemote\nslavco\nherbawi\nrybski\ngurtman\ncaiping\nstracener\nassalouyeh\nmonandry\nbrittainey\nlipsmacking\ngeberth\nlipitz\nlewdest\nnightwave\nrabiatou\nurogynecologists\nbackheeled\nskyra\nchebil\nsnippers\nbromund\nkasemeyer\nwanisha\nfarked\nradez\nrozzelle\nwhazzat\nsubflooring\nstanculescu\nlykawka\nkristjansen\nlisfannon\nbcbsa\naustrain\nhordon\nsajjil\nonaiza\nskyzone\ncanimar\nbattlehawk\ncrocky\nstreeteasy\ndhakla\ngonese\ncomfier\nanadys\nweissfluh\nyoungjohns\nprucher\nbillons\nwadhera\nmaliva\nbaenen\nzeitels\nintellitools\nschuettler\nwamogo\nmylanta\ndissembles\nporthault\ncetrone\nschurke\nneoral\npscw\nnormatov\nradetich\nsjon\nkalahar\nwillowdene\nsheikhattar\nfrusterated\nencaged\nabjorensen\ngalineiro\nintoxilyzer\nmukda\nnaitonal\nespree\ntreponemes\nunretirement\nlovasik\ncontiued\nnitobi\nbluechoice\nshrode\nnewidiadau\nnanthikadal\ngeoghagan\nkulesa\ntehranis\nardesta\nconstabile\ndadless\nengand\nhealthchoice\nshichor\nrodaway\ncaudalie\ngottelier\nsingapor\ncarrollsburg\nreichers\nmersyside\nsshh\nherztier\nsandigo\ndkos\narguidos\nbanaj\nschubertii\nsudnik\nwalensky\nmakeunder\ntagwerk\nhuefner\nintangibly\nsourander\nleible\nmushroomy\nwoerthersee\nthroughball\nlyndra\nasteres\nakbn\nctbf\nmidstation\nkestahn\nmagnext\ncranachs\nnadhira\nschulers\ncritisise\nadapated\nbackloading\nmasback\nfilandro\ngrueneich\ngirolles\nrouseau\nsellenger\nclackett\nlamach\nenfd\ngourlie\nsukree\nfoolhardily\njesien\nkapuku\nprovectus\nmalomir\nrehabbers\ndorband\nwyckoffs\nselakano\nlamford\ndervisevic\nomvig\nelectrifyingly\nrasbury\nlikudniks\ndeseeded\ngmms\nhingeline\nchanvit\nmultistop\nborochoff\nconviser\nsuburu\nmoyersoen\ngellionnen\ncompetive\nhtvl\nsweetner\ndemarkus\nbeancounters\nlauher\nprupim\nfrenziedly\nhyperthymestics\nnisid\nsmartchip\ntravniki\ncuell\narttactic\nmiclyn\nstratospherically\nshoubaki\nextremisms\nzold\ncheltzie\noutguns\ninfluenzas\ncommitttee\nshafdan\nsafey\nquecedo\nchardavoyne\ngaskett\nomnivorously\nmippa\nbekheet\nideeli\nhopworks\nperrogative\nbactine\nambiently\nperucchi\njashbhai\nnovations\njohnmark\nvarnishkes\nyuldash\nweyinmi\nhandycams\njiashen\ndusanj\nstryderman\nqianhui\nsupercolliders\nindage\nyangzte\ntatanashvili\nturbes\nbrryan\ncrapster\nsopheon\ngrinchmas\ncrûg\nreissmann\npacwind\nairbeam\npmog\nboralex\nbivash\nkhybar\namberwatch\ntenges\nrangae\ndishion\nsnapps\nokkarides\nexpurgating\nresentence\nmayclem\nconniburrow\nbarington\ngaohua\nayear\nwaterwalls\nbarbatelli\nventarron\nbraima\ncenkos\nunsprayed\nrashana\nauthentidate\ndennises\nsfoglia\nouzký\nnaissaare\nelhadad\ndetora\ncamcording\n,,and\nperrywood\nmedicene\nvinnakota\ncluckers\nunaffecting\neleminate\nelemans\nchupinazo\nresane\nadvantis\nvanoy\nhoofbeat\nmaquieira\ncannick\nhaertl\ntrimas\ntunzelman\njailhouses\nyoes\nlobbiest\nrewelded\norubebe\ncherrypicker\nreassorted\ndisposible\ncureently\ncarryforwards\nenflames\nbetzl\nslimeballs\nkrupitsky\nmertha\ntranfers\nwippich\nweatherseal\npemier\nparaskevopoulou\nftseurofirst\nvalencic\nkharr\nyuzaburo\npendaries\ntamagi\nkroeff\njorts\nangelholm\nbifab\nlaynes\nniordc\ngillettes\nmollins\ntyrna\ngmitter\nsyntometrine\nichet\njaschinski\ngybels\nmetaversum\nghilardotti\nsetaside\nkaracic\nbuelvas\namzing\nbuteco\nbryantseva\nwayfare\nrihoy\nreslo\nrodzwicz\njewmongous\nunreassuring\npagetti\nchaoda\nnondepressed\ngrowns\ncompariosn\nheiba\nsndk\ncooverman\nambie\nenerdel\ntuag\nsoetero\nshojai\noceanaire\nfrizzies\namazements\notheguy\nifng\ntortorello\navego\ndecmeber\nstolarik\nunaffordability\npurtee\nthattungal\ntakanakuy\npleštinská\nbruintjes\ngreenagel\nsghair\nplanika\nplugholes\nukama\nkufar\npotichnyj\ntwords\nabbatangelo\nfrieght\ntreter\nendof\nhussains\ntaluri\npyramind\njanusian\nbogenschutz\nglenwild\ngaibhre\nmovielike\nnamgla\nfloormat\nsharedealing\nloseing\ncineasia\nlipozene\nglenbrittle\nnossik\nabderahman\nfeverishness\nsacrfice\nshukovsky\nheismans\nthillet\nheizel\noutjump\nchanney\ntziolas\nfynvola\noverspecified\nreabrook\nduzant\nlvlt\nghostliness\nintermatic\npratyay\nabbeel\njauntier\niraquara\nbrodell\nchasemore\nmochaccino\ngisladottir\nchacaliaza\nninglick\nmataharis\nboffing\nstonephace\nsirny\ngaska\nkfrs\nsogginess\niwsr\ndatanalisis\neatocracy\nvenirauto\neyeview\nfotonation\nicoty\nfeuling\nlaiti\nmerseysippi\nsipili\natpi\nlandgrabbing\neighthly\nkrupoviesa\neasypay\nshtf\nprivlege\ncogstate\nmiserandino\nbacquelaine\nniewoehner\ncompetiveness\nindaparapeo\nzimulti\nmncube\nkrums\ngiardinetto\nmonastically\nburgic\ngrisliest\nlangostinos\nkajese\noladeji\nsharhabil\nblastland\nmultihit\nsquibber\ntkdl\nicll\noenophilia\nguardedness\npeshmergas\nhlatky\nmalpai\nweppner\nnonplayoff\nprishantha\nbaldhill\nmawkishly\njalazun\ngombar\nellingsworth\nostomies\nhishmeh\ntunnacliffe\ncecato\nwashignton\nmcknelly\ntodat\nresurgency\nvisipics\ncaldesi\njackhammered\ndoosh\ntinyiko\nfreaken\nvallandingham\ngallingly\ntadiello\nlongua\nberrong\nfablon\nveljovic\nboujis\ncompasionate\nmadderson\nparises\nberezovchuk\njagolinzer\nmichaelene\nprovoq\nsidesplitting\nensconsed\nsisina\necbm\nlutula\nchunkiness\nalcmi\nmichenaud\nbanales\nrelaese\nfanwell\nelvinger\nzypora\npalaeobiologist\nprejudgments\nwaterers\ncreevekeeran\ndramtic\ndfhs\nzombielike\nyangzonghai\ntomaszczyk\nfedorak\nmagginas\nkytril\nniering\nmatterley\nbluegiga\ndissappointing\ncadario\nfaffed\nroguishly\nthuerk\ngursey\neckner\nchemnet\nsementelli\nburglarizes\nfosterers\nmanjon\nmobilevillage\ndupioni\nvucovich\ntashiyev\nkongantiyev\nululate\npadriag\nreclosure\nkazmier\ngargaro\nfriga\nprossor\ngarnice\nhollinswood\noutfoxing\ndecentrally\nshelper\nkollapen\ndorjay\nfragilities\nunfulfilment\nawayday\nstockworks\ntrimbles\nbabushkas\nrogulski\nsuperbright\nskimpiness\nkaison\ndeseree\nebayer\ncuejdiu\nteeradej\niciar\npasulka\nbollain\nadrants\nnuamah\ngraffami\nsodruzhestvo\nborguna\njarding\nkatersky\ndownlist\nkobland\nundercompensated\nnakornthap\nwhittern\nkircaldy\ndottel\nsipix\ngilhart\nnewsgatherers\nmeskini\nukrsotsbank\ncochochi\nwouldl\nsosland\nboigon\neyeclops\npostsurgery\nmetaforic\nresilent\nuttern\nantitax\npowerleap\nffyona\ndrybulk\ncins\nyawners\nzweigwhite\npetkeeping\ngagola\ningrediant\nconspicous\ninitialling\ndickmans\nyurkievich\nhyperaccumulation\nraied\nleaderhip\ntorbjornsen\njubah\nsufferring\nwinikoff\naiyub\nnewsfox\nsahro\nsemiprofessionals\nswaybacked\nespecilly\njchr\nunlearnt\ntuppenceworth\nmarshae\nnahleh\nyenier\nsamuelsons\ncydnabod\nsmellies\npatrece\nfeldenkreis\nsharklike\ntanimu\nbiotrickling\nplinks\nvinegrad\nautocheck\nbullishly\nnmgc\nntap\nrythym\naguru\nconserative\nhoppock\npartneriaeth\nkuranyi\nstoneyfield\nnonactive\nefficiences\narazo\nhydebank\nmosolova\nkatiucia\nheliosphera\nunpersuasively\ndihad\nteichberg\nzalasiewicz\nuninsureds\nsimplifiers\nfriaa\nkarches\nmerindol\nkvaskhvadze\nrabjohns\ndscm\nburbled\nmielcarek\nautismlink\nhidek\nweaponizable\nwaterbuses\npulmicort\nunnammed\nszepmuveszeti\naaph\nlambrias\nsleeptracker\nmotiondsp\nweatherspoons\nserebra\ncoyaba\nsaau\ncountersuing\nevenually\nobain\nbannsiders\ndlugi\nconsuption\nkornblau\noptistruct\ncomensurate\ngloer\nwomanisers\ngerassimenko\nandreaus\noverclaim\nvybornov\nbraner\nketric\ntulcingo\nmassareene\njeswani\njimison\nadesioye\nfishbar\nkronmal\nlapdap\nhojris\nevelis\nlachenbruch\nwlq\nwaspe\nenterting\npolyunsaturates\ncopfs\nbertko\ngreenbee\ncheeriest\ngredley\nthudded\nwarholesque\nyusopov\nshakrai\nshoulong\nincured\nfresch\nalliah\nredomiciled\nniegel\ntonium\nawaad\ncarrutherstown\naretos\nsonghouse\npostley\nmeltaway\nunlikelier\ntrooptube\ngocaj\neasycruiseone\ncryus\ndepken\nshvydkoi\nunderplanting\nreprioritize\ndaleo\noceanteam\nfinshing\nlefelau\nbistrotheque\nsmackdowns\nexhbition\nmossbrucker\nmyxopyronin\nsalopettes\ngetzenberg\nnondaily\nsurfies\nenviromentally\ncanalaska\ndefensing\ntachwedd\nrydlewski\nbiasin\nfoodstocks\nbajolet\nanolik\nvukojevic\npreuitt\ntheato\ncarniverous\nhelicoptors\nguilkey\naccomadation\nnoyac\ngmyrek\nsmishing\nsifrits\narzoxifene\nsurathani\nhotwash\nguhar\nbeckwiths\nflarion\nsomoa\nglitteringly\ncynta\nregenerist\nnamemedia\nlnet\nculloton\nruszczynski\nflexitarianism\ncaptrust\nsnorers\nblemishless\npolicyowners\ncorreoso\ntentlike\nseptmber\ncybertecture\njanotka\nmicrobridge\npiliang\ncomback\nyongye\ncentsables\ndeindustrialising\nseoulites\ncemitas\nproaganda\ngraziose\ndragonlab\ngadiv\nwoldman\nphasuk\nlouies\nhirchson\nchhaupadi\ninvigilating\nsyringed\nmisdemeanant\nkaralekas\nmcex\nchromalloy\nassauge\ntaepo\nanydoc\ncraggan\ngaesser\nwecare\nbequelin\nblogers\naiam\nclickpad\nsprycel\nucbh\nthambo\nshinpads\nghcn\nbolcar\nghassim\nolanoff\nhuntersfield\njebaliya\npostrio\nlegaltech\nmyant\nladdism\npvnews\ngrindale\nnarcomantas\ntaaramae\ncambelt\nfoget\nthermotechnology\nreisfield\nkwait\nindescretions\nanasara\nflappable\nadready\nnayernia\nmolinette\njuwad\nfalic\npoisen\nkaaran\nchiorean\nbekic\naurigid\ndayclub\npersonailty\neffots\ncheckfree\ngulfiya\nhandspeed\nbensignor\nparascending\nkuthep\nstupidty\nhighstown\ncisas\nbetrán\nbantleman\nkausik\nchemabwai\nnewarker\nbendict\nthrr\nlincare\nhyperglycaemic\ncrnkovich\nbumbuli\nboizel\nredattore\ndeutschneudorf\nflemm\nsubesequent\ntelephia\nunicomm\nhcws\noculofacial\ntverskoi\nleitzmann\nartl\nsyncfusion\nmistime\ndialyze\nfartun\nspitzers\nuimc\ncenx\nnannied\nfarache\nmagenn\nabdurraheem\nwehdorn\nlyonheart\nclinc\neshkashem\nvalmiro\npreists\nnnuh\ndeffense\nshbair\nmedzini\nymwelwyr\nandrini\ndhaene\nlubing\ninsolated\nnermeen\nciasa\nbapen\nporreco\nelectrocaloric\nheadguards\nsunners\nsambodromo\nmflow\nmazrouie\nashwaq\nmadeshi\nskains\nappeasment\nmoosekian\nstallkamp\nkoumura\nmoelmann\nmambisa\namoaku\nprofmedia\ncharania\nkhudzhamov\nwillowbrae\ngruppuso\npyland\nwotou\nwindparks\nkuleba\nconfrim\nexcelcomindo\nrsponse\noldborough\ntrozzi\ncclr\nmezyk\ntamerza\nacresso\nnontreaty\nbeades\nquac\npurescale\nequasis\nsychronised\nchiappero\nnardizzi\nbasketballing\npsvt\nmuhid\nnicastri\nokbridge\ndeepish\ndryvit\nfolllowed\nnoblin\neleasha\noredered\nilliberality\nyooman\nhlavackova\nshavoo\nbedcover\nmenkhaus\npneumovax\nkaczyńskis\nhellems\nezdi\nkarlenzig\netoa\nbeleno\nkutnick\nsuperdrol\npetracek\nunentitled\nwernke\ndoolittles\nfinancialy\ncarlshamre\nmalwal\nmidcounty\nsegei\nmedspa\njustyne\nliebermans\njalkh\nsherre\ncubatao\nguarneris\nsclub\nkidger\nneorest\nsariyah\nwebbley\nsoldatino\nstrojan\nantimonarchist\nretrocessional\ndemoraes\nroeslani\niprospect\nmanior\ngrueter\ntreleigh\nsenoglu\nreminicent\nruigang\nselfconscious\ntourie\nmermel\nharfst\ncolchicums\nkamennoye\nboczkowski\nkesayev\nmcmafia\nneoucom\nsawczuk\nsanjaa\ndebiec\ntennessen\nbeinazir\nisarel\nbeelines\nmaciulis\nretouchers\ninneralpbach\nsacrédeus\nbravermans\nphotiou\nkalosha\nunderwhelmingly\nwoodlin\ngracewell\nrosettabooks\npinchinat\nmilette\nmcnaulty\npatternings\nfavourtie\nkassou\nbushite\nrollermania\nnicoles\nparasiticides\nprofferred\nkuanjie\nwysham\ngeofences\nstiftungsrat\nplansponsor\nvolumns\nkorobko\nemerling\nbollom\nruqin\nzulman\nfanista\ncribiore\nsacai\nelusys\nmastrojanni\ncbae\nlebasse\nantigang\nglenogil\ntimebanks\ncompazine\npfeg\nbjoerndalen\nlarghi\nmosic\ntextonyms\npangaribuan\nchilangos\nfitriana\nopenlands\nbearnson\ncholestrol\ntransmogrifies\ngaldamez\ncourcoult\nmaxxed\ntissy\nleibsohn\ntavassolian\nhssv\nkilwilkie\nfeyness\ndukker\nwichterman\nfeldwehr\npursuiter\nrabasse\nkhurmato\nemfl\nkurinsky\nprissiness\nloute\nmcube\ntenereillo\nlithesome\ndessima\nbandelow\nunslaked\ncervixes\nrehill\ncxos\nbejiing\nkudwa\nnoodled\npalillos\nhotcourses\nwusthof\nxaki\nyoufei\ndrumgold\nphotonix\nexpeience\ncoarsens\nbluegem\nrightious\nsiwula\nboughter\numicevic\nbrainlessly\nbourada\nroohul\nmolvar\naltchek\nadeela\nbakfiets\nzhuzhu\nlaughlan\nmehas\ntranselec\ntouxagas\ndenove\nreyeb\nspytalk\ncamptosar\ntagget\nbofe\nhalatyn\nfurlanis\ndiwedd\ntantalised\nzoph\npadco\ntregoyd\ncharewicz\ndexcel\nuimm\nghafoori\nchoccy\narnum\nfluery\nracr\npdpn\nflyp\nliquavista\nosaar\nrasshan\ninsean\ncokeheads\nchoren\ncrocoseum\ncounterprotests\nspinboldak\nvartazarian\nswanmines\nmiadich\nroguishness\npozan\nzuola\ndextrously\nseniorlife\nmornie\nradioterapia\ntorland\nstommes\nrestituting\nzeligson\npeditto\npuntarelle\ntechoperators\npigpens\nafores\ncarfizzi\nellingtonian\nbhuji\nhotelkeepers\nzarki\nkilchuimen\nagundez\ndecapua\nhollidge\nwolfblock\nchourouk\nawasi\nstrangehold\nslangman\nhoneybell\nnicolita\nglastir\ndrenewydd\nvitriole\nziaee\ngcapp\nblns\nbadio\naaxa\nyurtkuran\npanagopoulou\nomalanga\nczekalski\nelsesser\nziruk\nkremchek\nyuchanyan\nvoikkaa\npicoplatin\nturbull\ncarise\ngranchildren\nslyvia\ngragan\naburedwan\netholiadau\nspastically\nvaldeci\ngielsdorf\ndshe\nhousecleaners\nesuri\naxelbank\nsyllabubs\nwelchol\ncentergy\njarski\nbloodsteel\neletricity\nkayiranga\nmarkwalter\nzuleger\njanusek\nberenbeim\nkresak\nzimrin\nberrones\nslappable\nellemford\ngitwe\nczarnezki\nkraehenbuehl\nportmeiron\naberly\nboathook\nyitai\nsugule\narcomadrid\nfleamarkets\nfindwell\nchuene\nposawatz\nvredevoogd\ngehlbach\nkarnavian\ndarlingscott\npecsenye\ncucurull\nlenkowsky\nhbts\ndionnes\ndiagnositic\nmeelia\nfrapa\nautotask\nvaltech\nfcea\nalgenis\npendyrus\nminoza\nbabyboomer\nleynse\nbebar\nsouleman\nedraak\nguility\naquatheatre\nsusurration\nvancl\nrashaud\nbreitzke\ndimbort\ntearfulness\nheartens\nresk\nspikiness\ndhamankar\nsmartgauge\necoguide\npesälä\nbernstone\nlifethreatening\nserrah\namom\nzimplats\ndownhillers\nhabtu\nyaguarete\nremmereit\nchyrons\ndipkarpaz\nbabcia\nblumke\nortayli\nramunno\ndobnik\nkathleeen\nhedkvist\nsumisho\ncommtech\nmunzu\nfaultiness\nhitchcon\ncarapetyan\nsmajic\ndesrosier\nfetherolf\nheidtke\nsankhare\nspylaw\nscarpinato\ntting\nnovilleros\ngenomatica\nisales\ntakham\nthcream\nayou\nlvpecl\nmatok\nkandic\nunpartisan\nmurphrey\noutmuscle\nrakotovahiny\npucic\nhudnell\nmatchams\nchandrapati\nhepatits\ncardies\nmeridium\ndevores\nwhiskering\nprofnet\nscragged\nfatstock\nrathon\ncremant\nunequitable\nwasserhövel\noclassen\ngorevan\nlamole\ncochina\ngroundbreakings\nthrummed\nshender\ntoyoto\njalrez\nreynierse\nmiklin\nmobilo\nlccj\nselular\ntoolmark\nyamnarm\nmycal\ntarduno\njetpod\nconguillio\ncrinolined\nhulverstone\nimaps\nfreewheeled\norrf\nltip\nawaa\nioec\nnonsporting\nauthorties\nmidteens\nkoshering\nodekerken\nmlbp\ngardenswartz\nleydecker\nchiantishire\nmanouevring\nilfs\ndownpayments\ntongqi\nmccartneyesque\nmohammedawi\nsyndroms\nunderalls\ntumbrils\ninums\nshamrayev\nmalualkon\naccountablity\neglwysfach\nleitert\nbrunzema\nhobet\nsesow\nladdishness\ntaroczy\nminifestival\nstellmann\nunforgetable\nlumengo\nautocatalysts\nwyko\nthmselves\ntorce\nsmithamundsen\nbuhusi\npanjiawan\nstoeckli\ncleret\nmummbles\naccorinti\nbecsey\nqawas\nolesky\nrosokhrankultura\nraemoir\nntcd\nsaewyc\nmaguigan\nperpertrators\nexploting\njasenovec\ncarparts\nshirtcliff\nyinquan\nccur\nfactotums\noddment\nelsinga\nringdroid\nesner\nwaterhorse\ndemerse\nkilfennan\nhelath\ndendrobaena\nscattaglia\nlsquo\ncarzy\nyousefzada\nsuperest\nboemio\nsaflex\nnominatable\nsterlini\neaken\ndreamport\nxprotect\nmlynky\nneofonie\nkennerson\npencarnisiog\nhostaria\nnisoor\nprosun\nnitkin\nkutznetsova\njaidan\nwrapover\nshlyakhturov\nmanpowered\nskinful\ncaspit\ntenerians\nunted\nmisfielded\navereage\nhyperpartisan\ncaizergues\nzooamerica\ndragoshi\nfamesque\nampyra\nanworth\nsemenggoh\nbozzella\nthemm\nbachina\nlaoisa\necotown\nirreverant\nmanida\naskews\nnpbc\njaamia\nsubsititute\nbmxing\nusteleradiology\ntheatened\nnaftowe\ngazownictwo\nungenerously\ngitcho\nfeguson\nrtvelo\nncna\ngastons\nkittas\ntronical\namankila\nivorys\nfredrichs\nnetbenefit\nknockbracken\nwaspishly\nxianchen\npricetags\njpel\nmayercraft\narcanely\nnatso\nmalsin\nabdulbasit\nhspr\ngubrud\nopport\nwhethe\ncaregroup\npalestineans\nshareowners\ntolva\neleia\nrachvelishvili\nlangr\nthingamabobs\nkarokh\nonise\nehrlick\nbioelements\nringeisen\ndalbar\ndrync\nequiniti\neurek\nmisunderestimating\ncoolbox\nplatenik\nugresic\nwideroos\nnoodge\nlyko\nmirken\nmceg\nkhabari\nlamastra\nkelsee\nonecommunity\ntorihama\nsubprimal\nkolke\nhollock\nsonchai\ndemandforce\ntalebani\nhothousing\nmtuze\npreannounce\ndulevo\nhgsi\nherzilya\nbeermakers\nritzberger\nvinikoor\nsarwakai\npetumenos\nflexeril\nsasazu\ndepandi\nenglad\nkriess\nfreebirth\nprosty\ncopperthwaite\ndemurrals\nasfg\njayousi\ntimorousness\ndesagneaux\nrubbered\nketchman\nfaintheartedness\nnewertech\nalexiadou\nauringer\nhiltachk\npowershots\ndowloading\nvandegriff\ngoht\ntwonk\njinye\nflickerings\nepzicom\nndundulu\nbaytowne\nxingyao\nrafaels\nrogueish\nkolodiejchuk\nomio\ninadequates\nprepregnancy\nalarco\nfranni\nschoenkopf\ncaddishness\ngisolfi\nsakur\nyockel\norose\nlibanori\ngensemer\nwestcon\nignatowicz\njackups\nsherzinger\nrowlen\nstöss\nflynet\npottruck\npochettes\nolhaye\nferrelli\ndzud\nsysten\nearnhardts\ntrusk\nkahlers\nxiaochu\nclearone\nlossmaking\nwashway\ndessertspoon\ndesteno\nfedup\ngestevision\nunasul\nbandukwala\nsouty\noverhyping\nbuttonholing\nnorrel\nkfsl\nafourer\nteethe\nwildthings\nvreeke\noskana\njudici\nholdalls\nduebendorf\nsentix\nreweave\nchiffoniers\nhamminess\ndahiyah\nfelbinac\nwhitepark\nsnuffly\nhingo\nchakerian\ngoranin\nllywodraethwyr\nkaskazi\nmildewy\nfishn\nunlabored\naguilla\ncentralisers\neltish\nconsid\nkitrell\nchowdah\nlimitus\nhasabe\ncatepillar\ngradinger\npachico\nadscs\ncircumventions\nscelles\ngrundies\ngralow\npresumptous\ncolibria\nsecurit\ndagle\nholdbrooks\ntalban\ncandidte\nmethyltrienolone\nmtdf\ndexton\navastar\ndemoctratic\nsrslabs\npenotti\njoesbury\namurs\nweskott\nperecentage\nfracases\npelusi\nsrmf\nfxdc\nbecalming\ndeglazed\nsubach\nmultiheaded\ncordesville\ntupelov\nkazilek\ndecliners\nkamlapur\ndottiness\nstingier\nnokias\namhad\noxandrin\nanavar\ngball\nmchughs\nmagliochetti\ndolney\nsendme\nungallantly\nchichontepec\nlungcancer\nbelpoliti\ntoture\nchengwen\ntremblingly\nposniak\nmylincoln\nacrosss\nkablia\nbroë\nchisnau\nyedl\neilperin\nputvinski\nmicroneedle\npituitaries\ndeconditioned\nsignina\ndroeven\ngerwing\napproched\nmpet\nacheis\niopener\naeoniums\nculinarians\nearthstone\nnonplaying\npesquidoux\ncephos\ntpmdc\nborgesen\nfidgetiness\nmettin\npandemicflu\nweiselberg\ncherbak\nbernel\neconorthwest\njuddi\ngovernemnts\npidx\nlonce\nswapclear\nnaeimi\nnorregaard\nropel\nnaypidaw\ngeziry\njacquemod\nscrutineered\namiridze\nschpoliansky\narbeed\nrompza\ncemach\nconformability\nzvirgzdauskas\nezeilo\narrancame\nmarquell\nunionising\nalexico\ntrooien\nmoffic\nmulvi\ndarouiche\nnonambulatory\nbamajam\npaddleboarders\nwellses\nconstitutionalisation\ntransgastric\nuntaxing\ndmtc\ncoplandesque\nhcom\nfabulosity\nbositis\nconferenc\nnemchin\ntwestival\nfitten\ngalparsoro\noxymoronically\nwallrath\ncrewton\nlaskett\nacquatic\nguarante\nreskilling\nndrn\ndavaughn\npterri\njoltingly\nkidology\nstonley\nvrwc\nalaso\nbalkanise\nbaitzel\nnighthunter\nmonopril\nimaginit\nofthese\nborchering\nfashu\nbhekumuzi\njaboulet\nwallowers\nyanxiang\nmobilex\nmouthbreathing\ncanburg\nmfpc\nviering\nmehltretter\nkoppman\nnitelite\nantoinne\ndiliberti\nvieing\ntyquan\ncoisir\nclamminess\ntacoli\napplico\nhermalyn\nmishta\nsolnick\nfoscarinis\neuroferries\ndemys\necba\nvicentillo\novertrain\nberghorn\nsageer\neurovignette\nshelepina\nccips\nmyrgren\nstirches\nstanski\ncarrascalao\nmascs\ntitanyen\nnerding\nmasergy\ncareplus\nvanore\npanglin\nnesoya\nholleis\nillford\nchequebooks\ncandymaking\nisau\nkikkerland\nembangweni\njunhasavasdikul\noverate\nmatwalli\nnotthingham\nhypocrital\nplaud\nmicrobleeds\nbicknese\nnoncompulsory\nhopgrove\nmarvak\ntelazol\npotashnik\nnccnhr\nwobmann\nbootroom\noverdependent\nsloa\nfascadale\npippens\npalying\njubliant\ncompitition\nlincolnesque\nrnst\ncommentaters\nikhyd\nchapping\nmastalir\nboxiness\nsizably\ntypar\nwolohan\nifric\nbuhrle\nworah\nsatawa\nhipbones\nkurdin\nunclenching\nhafeman\ndrewesii\nprefinished\nimpsat\nmcgarvin\nmcgillian\ntrutherism\nchamelecon\npointroll\npubugou\nsobai\ndorfmans\nlecos\nspacebook\ndrearier\npliés\nverrusio\ndeliever\nkujundzic\nmontesque\ncorchero\nachutan\nbehavin\nbezengi\nguiming\nwethersby\ngaisensha\nparadossi\nstiltwalkers\nyerushalmy\ntrundler\ndivadom\nbukiewicz\nkunders\nthones\nsongyou\nceans\ntutenkhamun\nisraeil\nkerrigans\npudner\notouto\nmoracchini\npetiteau\nastilbes\nqforma\ntechlab\nswffryd\ncarmelli\ndopilka\nunshed\nhomless\nresolut\nboomgard\npathologised\nqhse\noverpromoted\npooched\nbidz\nrumiz\nastea\nolesiak\nlimara\nspatterings\nbtig\nstakehold\nmedsphere\npeacfully\nmusiwave\nskivers\ntugiyo\nassegued\nyoussoufou\nstrycova\nklundt\ngeckel\nshebani\nliuping\nphysiognomists\ncosignatory\nsheffiel\nlisnave\ntopcider\nmisogynic\nshriving\nindago\nkoschinsky\ntheresea\nkombuisia\nsenturia\nkaneshige\nroomiest\nvicts\nstraigth\nkratochvilova\nnonsupernatural\ncompasion\nsidron\nekingi\ncatalyn\nfalbr\namtd\nreminick\nshamrat\nrajapaska\nlambeosaurs\nhubalek\nnonanswer\nadulthoods\noihana\nwillenbring\ntechnometrica\nseabasing\ngutseriyev\nnoncorporate\nehiem\nsaltone\nstudentcity\nisarescu\nshozu\nrivertowne\nteetotalling\ncaramelise\nknezovich\nmottinger\newarts\nstepgrandson\nchattier\npigged\ncorprate\nsoundtown\ndalixia\naddenbroke\ndibinga\ngiftwrap\nsaffia\ncastagnetta\nnantongo\nnonguaranteed\nconventry\ndistastful\nshatterbox\nwhitbreads\noutpunch\nbijapurkar\npospiech\nproconsolo\nkuznicki\nllloyd\npontdue\npiglike\nkrzynowek\nmurketing\nguyliner\nmanocherian\ngerties\nuhsm\nmohanned\nrelativising\njonckers\nnedder\nhissers\nskordas\namericatown\nphongthep\nstrongstry\ntedia\nfluking\nheadily\nromaeo\nthinkstrategies\nlomox\nanytone\nbigdeal\nowlery\ntalkboards\ndutv\nwainrot\nrakitic\neastdil\nsirnik\nradsan\noztekin\nghadanfar\nhavlova\nquiverful\ntestwell\nscanscout\nkillins\nsovx\nhungtington\ntumakov\nfykes\nzavesca\ncowboyish\ntaskila\nreckson\nmspmentor\nstayover\nkaass\nblyther\nhesselbart\nwilusz\nmalaisha\nmapathon\nfisketjon\nirisguard\nénarques\nhamli\nkesayeva\ngarmser\nsigmundsdottir\nwindoz\ndecreation\ndelibasic\nhachilensa\nweifu\nnaturex\npetrocco\nochinko\nscrumming\nmerar\nmingkai\nffynnongroyw\nplanco\ncakarel\nverifica\ncytotoxics\nbrezenoff\nrrazz\nsudairis\nupshifted\nsemifictional\nnoncombative\nsawafta\ninfantilisation\nhartocollis\nunefón\nratemycop\nimmunomedics\ngrotelueschen\ntrichloramine\ngerbron\nmcmeikan\ntwistedly\npoorish\nairyhall\nahsas\nsungate\nmelser\ngadael\nbppf\nunmonumental\nnellson\nbanwait\ninvariables\nlafemme\nabdulgader\nsmobile\ngilroes\ncueta\nweigandt\nlomsadze\nkourlis\nhardstark\nhenick\ncholewinski\nnucelar\nseepe\npelpuo\nlaleena\nfishs\nwenes\nunrelentless\nfirststrike\nexpencive\nklimavicius\nabadinsky\nchemaf\nchasni\njeylan\nraffah\nulsterwoman\npennequin\nkruskopf\nipeer\ntarraf\ntriyono\nmasserene\nmegaships\nbohos\nelians\ncvat\nemambakhsh\ngruendemann\noafishness\nkaraghiozis\nfollw\nalertus\nwickramsinghe\nfilty\nmirescu\nbarnesy\nnaffness\nnewstalkzb\nskordelli\noveregging\nremortgages\nlulek\nboutari\nsjekloca\ntokman\nfawole\nkalonge\nlovetsky\njakobsdottir\nintergy\nbasestations\nliittschwager\nabusrd\nbrightsmith\norsuto\nadref\ncanaval\nportered\nmegabanks\nsaighan\nfinicial\nstulla\ntheword\nskolar\nalvita\nghoryan\nclunked\nchristmasses\nthimerosol\ndesserich\neuropcr\ndockser\ntrebble\nslart\nmereham\ntalalla\nfinace\nmudroom\nburritoville\neslake\nsoberingly\nbehid\nimperical\nharringey\nkomossa\nmiyasaki\nnakedcapitalism\npozdnyshev\nfahrenthold\nchimerix\nmydamnchannel\nvpci\nsbdcs\nwrecklessly\nwesternone\nhollekim\ncolaprete\nnomfet\nindustralised\nlunkheaded\nbueser\nbeckstoffer\npalaiokostas\nkobylarz\ncarslberg\noweinat\ncoreco\nderveaux\ncogitated\njocked\nmaecha\nryavec\nutlimately\nsupermagnet\ncyberoperations\npinces\npushdo\njannice\nphotoespana\nglittenberg\nrepulican\nironiya\npursuading\nglobalgap\ncesmat\nevangelicism\nspafinder\ntchale\nbuchori\nkaelo\nkozhov\n\ngudele\ngladiatorus\nsanez\nflipshare\ncivilianisation\nfirek\negoscue\nphaneesh\nbrayn\ngocycle\nspoonists\ngulsun\naestheticize\nsermonise\nkondic\nekasala\nhelitankers\nnewlins\nbudgetting\nyounsi\nfunktional\njaffre\nutccr\ngyrion\nerold\ngandas\ndisinflationary\ntenjune\nxpressprint\nlafig\nclerkish\ngargonza\namidan\nkolevar\neatings\nnextage\nhelthwyzer\nniehuus\nrashun\nattorny\nmothing\nlivant\nhocha\ncalafati\nbouillabaise\nergil\nmotorguide\ndeathbowl\nkopeloff\ndownclimb\nbookcrossers\nantonias\nkartin\nconsumerfed\npourquery\ntzds\natlanticists\nasmedia\nbesmillah\nmakansutra\nmikulecky\ncommercia\nangiddy\nhirable\nnantana\nsuzella\nmapeley\nsixti\ntankosic\nlscp\nkirsta\nnonconsumption\nnuturing\nfujimorism\nsecuritising\nflublok\nzizhuyuan\ncomunications\nhollinquest\ncostermans\nprintables\nawail\nwheezed\npyonyang\nmiksche\nmileaf\nbuppies\neurotelecom\ngovnerment\ncurnick\nsarbayev\nrightfield\nlumie\nteatimes\nnplate\nsafco\nwobbliness\nszul\nkeysaney\npiratbyran\ncumbiamba\nruppy\nricucci\ncottoy\ndarce\nczitrom\nbutterwegge\nmohssin\ngloser\njazzes\ncuratorially\nthankgiving\nbrazlian\nliant\nanaide\nwonderfall\nsevc\nbadnaban\ntristans\nsulskis\nseliverstova\nflokati\nbpgc\nsorabi\nasaish\nwokorach\nsocialable\npartnernetwork\nglunk\nnmrx\nmanazel\nennico\nkinro\nnattavut\ndietziker\nnoodlings\ndadoun\npettazzi\nunspecificed\ntxtng\nalburn\nnormany\nextenstive\nmaarib\nupgrowth\nniklison\nkickings\ngraffius\nimpressionability\npoitevien\nrittgers\nbawb\nedrms\nbaltanas\ntairu\ncygielman\naution\nterissa\ntreehole\nincredulousness\nfueding\nrussound\nkabemba\napharwat\njodphur\nrsbp\nmartabe\ngoalward\nqeemat\nwutke\nsateh\nhealther\ntremolandi\nzhuping\nmakone\nntdoy\npiranesian\nimomali\nfachette\npammie\nvoytenko\ndsgv\nhaferman\nmollart\ngiorgy\nnajja\nasheley\nvalups\nroseen\nwhitticase\nalesco\nacorss\nfawziya\nhatefilled\nfonté\neastbanc\narenot\nsaffre\narthroscopies\nlapucci\ndetoc\ntapjoy\ngastronomists\n\nabent\nsmolinsky\nspisto\nvisitflorida\ntiaoyutai\ngweithwyr\nshabdarbayev\nkuechle\nindata\nbehesti\nwoozily\nsumray\nmmbo\nsokudo\nkachine\nhupert\ntalecris\nnamerow\nthistly\npeyankov\npondel\nhomefires\nkermits\nrssd\ndursts\nscrappier\ngnaas\nzewdu\nstubholt\nmigl\nlaunchbox\nbuchaille\njerera\ncteep\ngillygate\ncosmetiques\nmullahey\ntuffers\nslepnev\nepiderm\nsolantic\ntorjanac\nvslas\npelletised\nporini\ncomraderie\ncaramelises\nduans\nroguy\nlehwald\nfehilly\nzimonline\natiqi\nhuorn\nkomesaroff\ntaxs\ncarfare\nmittleider\nsummun\nconfiteria\nlosties\nkanunnik\ncyxymu\nshular\ndriis\nmccrickard\npluff\nvolpes\npittances\nkatsarou\ndelgardo\nschoenebaum\nshantia\nwinnner\nasadoorian\novercaffeinated\nhappyton\nkadetsky\nmegasites\ngondas\njirong\nticketnews\ncockshaw\ncubias\nbannissement\nkuduru\ntalkathon\nyrcw\nlecarpentier\nmunsef\nposiblity\nwinpro\njibriel\nxianchun\nmcblain\nclendaniel\nchessum\nangelisa\ncastlebrae\nnassiriyah\ndrillable\ningérence\nmadakhel\ngenkin\ndevilishness\nvaricoceles\nprotoconsciousness\nevoras\nsemenkovich\nprincella\nlisterial\nayurvastra\nhorua\nrumbak\nzonko\nbradshawgate\nenternships\nchestertonian\nnervier\nthanapol\nzachelmie\nbarasia\ncheonsu\nlocasio\nencoring\nsplotching\nwintergirls\ndelbianco\nkamaron\ngfig\nstci\nbellalago\nbeushausen\nbirdbrains\nfoodtech\nsuleimaniyah\ngoshdarnit\ncimab\njamere\nlalchandani\nolaciregui\nalmspeople\nyuwali\nemelye\nmncn\nabbyasov\ntrinton\nnainima\nbestthinking\npolicinski\nstovepiped\nallick\nelagina\nwiedl\nkonyn\nmachanga\nsolartech\ndescouensi\ncoffeepots\nroomfuls\nmayhane\nmachikhel\nmyfaves\nmasoum\npaniza\nbentleyforbes\nelemendorf\nmithering\nnogee\nhillfarm\ngordyn\nokeelanta\nhaslehurst\ntheemithi\ndzierzanowski\nniveen\nbedoni\nknottier\nchanterie\nminiucchi\nbuscato\nsungwon\ngoolag\nlmag\nunsnarl\nspectable\nkichel\npotholm\ndisraelian\npalacerigg\ncoronor\ngloton\nnewenergy\nmeggyesi\nmoniquet\ndatagate\ndrapey\narterra\ntandrup\nsummersell\ncieba\npospech\nblotz\nbackcare\nantalina\nruchei\nedgedale\nmaliti\nweaked\ndenrées\narshadi\nniehus\njulfest\npassionflowers\ntillema\nromenskaya\nmuntazer\ntakashio\nteamsite\nnihiwatu\nlesak\nyubraj\nveltliners\nveneficum\nrusalca\nanthrasimias\nforticom\njheng\nvalencio\nrouhalamini\nminwax\ndeunta\neiderdowns\ngoedjen\njumptap\nbadric\ninterceed\nlinburg\nresue\nabdalhadi\ncanden\nhurder\nmicrobia\niljans\nkaleja\ngovernmen\nlibanan\nubran\ndowgielewicz\nfuzziest\nvarlotta\ncarepages\nchuckin\nputhukudiyiruppu\nkourtni\nhovatter\ndetyen\nlanguir\nadviceuk\nplca\nbaczkowski\ndrcm\ntotico\ndwis\ncompnies\nzelenkova\ntohani\ngphin\ndubovi\nusbsf\ntribalised\nmaqtari\nghadafi\nhhrs\nintrinsa\neberli\nostenberg\nsoaraway\nsancrosanct\ntexass\nbrownen\npremeditatedly\nsaucepot\ncityvista\natifa\nwilfing\nwastefull\nduchowny\nyelpers\nsestaret\nkruszewska\nfloridans\nrogeriee\nphotofits\nantza\nmoyad\nsimitra\nrationalizer\nsliwka\nzianet\nservents\ndollor\nagritalk\nrechler\nlhag\nedesia\neyadou\ncherylyn\nﬁrms\nparasiliti\nlaxest\nsuperweeds\nzacatecanos\nfrontpoint\npedlosky\nsplittism\nderoch\nwaiflike\nextemporise\nstridulations\nkadaré\nterrorisim\neichele\nphoslo\njihangir\nzestier\nmundorff\nstrugging\nmynmar\nsedotti\ndesertscape\ncheongdamdong\nchernof\nslyusarev\nnulf\nlauryssens\nkrizmanich\ndeclinism\nmachreth\nsolomeo\nwordle\nvindico\nslumdogs\nhumilate\nhydrapak\ngrandclément\nparapropalaehoplophorus\npasssed\ntargui\nalsanea\npacquement\nshaed\ncaisi\nelemis\nmonteleon\ncreepfest\nxoft\nupshots\ngenery\nkhadivi\nliando\nvidana\nrukmangad\nblaymire\nmathiew\ntrzeszkowski\nschimelpfenig\nbrechet\nzhoucheng\nbaugham\nmelera\nverschure\nprority\njarping\nkontarinis\ngypsyism\nsoodeen\ntransracially\nstuker\ncatwalking\nmaislin\nwsri\njreri\ndornbracht\nbetani\nmocal\nrepercusions\nglyver\nyahyo\nbnrt\nchysler\nrajprasong\nmompreneur\npennypot\nbatakovic\nchouquette\nduesler\nmangelwurzels\npoethlyn\ndouple\nbezzubenkov\nmimmick\nnewsosaur\nfamis\najorlu\nupdraughts\nvanica\nnvó\ncornettos\neonnagata\nshaboo\nbouchar\nhaimowitz\ncasanas\npadmanathan\nhironoshin\nmushey\ninvestgation\npunchless\nibaceta\ncatney\nsheehys\nloglisci\nunerasable\ngehua\nbiznet\nmodirzadeh\nleuh\nazade\nsplattery\naastrom\nsparlin\nhumad\nacrophobe\nsallard\nkastanek\nmailbu\npazornik\nonieva\nslingjaw\nfranticness\ngureckis\ncopolla\nokropiridze\nritziest\ngiulani\nbikinians\nbolofo\nreprioritizing\nstorewide\ntorreal\ncraigielaw\ncrouin\nhuppuch\ndaranee\nmokhiber\nrsvped\ncabrnoch\nqutenza\nkhvalynskoye\nquivery\nshigal\nclintec\nabeysundara\npffs\nlizak\nlaminusa\njcct\nlvas\nmedialunas\narnots\nbufali\nhadzidakis\nquanxin\nmaeslant\nhealthteacher\nvrijdagmarkt\nkotwani\nsvns\nwoldt\nlyron\ntakirambudde\ntransitcenter\nratemds\nberekely\nthirstily\nhalusa\ncraythorne\npaupiette\nkannam\nobetta\nclct\ngirifushi\nshiratsuka\ngoldshield\nikejiani\nmonkerhostin\nwaldmeir\ndonatelle\nabrevaya\nblistery\nstockrooms\nooten\nkalkilya\ncaplis\npostrelease\northodoxly\nmdst\npitchforked\nsokolin\nnosazi\ncolat\njeapardy\nmitschele\noptium\nshalaan\nshehanshah\nellenbecker\nvarshavyanka\nschulzes\nradco\nizaga\nmerscom\nfeistiest\nkrindjabo\nreturnables\nhospenthal\nmanyani\npowerbases\nheadiness\nasankya\nfarakhan\neldonian\ninnovent\nelectromotors\nlongeurs\nnuhanovic\nintersouth\ngeorgens\nkissman\nlarfaoui\nharges\npvcu\ncheatam\nrehabiltation\nufland\ngyeonghui\nqutbal\nmeleca\nbeibars\nchermen\nabdukhadir\niois\nimmodium\npoeticizing\nhyvonen\ndecosse\nherriges\nmallandaine\ncosmotv\ntrajal\nlankapuvath\noveroptimism\nchrysohoidis\nvujadinovic\nbiomerieux\nbeleivers\nwimhurst\nbleecher\nsubserviently\nnonart\nfootlik\nalmds\nsevastopulo\nginandes\nhumanbeing\nnonowners\nchinses\naarone\nhulser\nweatherwoman\nfatas\ndamnosa\nwheatgerm\nmegaraid\ngalatolo\nssoe\nhubbies\nmccrimmons\nfinancialservices\nsausan\nmilwiki\ngamboling\natender\nouthalf\namaroq\nexillon\nrafert\ntactis\npriobskoye\nllywydd\nstrozyk\nmormoris\nspiegelfeld\ngenerousity\nmybo\ndogeared\nlucato\nmontamat\nabdilahi\nfvpf\ndicharry\nharridans\nxbot\narbinet\nnooristan\nwantchekon\nshaleum\nchocoholics\nhankerin\nshigakogen\nmoscaritolo\nschweiterman\nidenix\nspeac\npazdur\nmouthrinses\nexpatriot\nscrunches\nosgoi\ndemiglace\nfirstflight\nceiron\ngxowa\nyolan\nkayrol\nlbdr\nbeluah\nshaggily\nfurlatt\ntrochez\ndevakumaran\nsuperglass\ngerstenblith\nsqw\nsadafi\nkeelen\nsicherer\nexpostulates\nesfandyari\nnombo\nbozhong\ncoloroso\nwomblike\nnonjury\nkaradjian\ngloriousness\nclasketgate\ndurgahee\nfrontbenches\nthalesnano\ngogola\nsnildal\nkarradah\nmachnig\nrabeling\nrevus\nbairey\nvizjak\npalistinian\npredicter\ncollegeconfidential\ngrassmayr\nlosingest\nmereenie\npiong\nsystinet\ndenit\nsmartcar\nbatzeli\ntiumen\nphfa\ngormez\nhursting\nkither\ninsmed\nhassina\nsterilises\nkarukinka\ngavelled\nstonner\nmobayi\npspn\ntoyloy\ntransatel\nbayoil\nregenia\ntangerini\nbravata\nzwirko\nnatrecor\ncaravantes\nmarset\nzubasnabar\nainain\nwizansky\nhulihan\nleendertz\nabeckaser\nmassarray\ntrollops\nsoooooooooooo\nshoushtari\nputaway\nseeen\nedul\nlovain\ndopier\ndistration\nrosenholtz\nloreille\ntechlink\nsasmaz\nloebinger\nreiging\nbonagrass\niupa\nsmykal\netoken\nrailbirds\nconcusion\nwawruch\ndigene\nforsees\netinde\nmarakon\ngastrotomy\nkielich\nyuewen\nroffler\nfollath\njerriais\nschizophragma\ndasic\nbuppie\nsacharov\npeochar\nhertzke\nbargaal\nmarguarite\nduboef\nconstantis\ncompeat\njadaf\nmgas\nloumia\nhiridjee\nwillowcroft\nrailpass\nunwigged\ndowngradings\nottenhoff\nstrandheim\nhadjimichael\nozgo\nkomkommer\nhoneybadger\nrephased\nrespun\nhicl\nnereyda\nveltre\ntreebhoowoon\nduirng\nsharethrough\ndelre\nballough\nreroof\nperevi\nlimra\nmaestrecampo\neorpach\nthearon\nforestries\nzarandeado\ngrimana\npixol\nfindomestic\nnaryanan\nneimo\npreseren\nviolari\npolymedica\ncmops\nmascola\nkaltbaum\nflatscreens\ninkdata\nfateyev\nashker\nsisso\nfurda\nnorlandia\ncprg\ncontnue\nsaroeun\ncrystaltalk\nbudaors\nyessayan\nmagwire\npromming\nmobissimo\nnwamitwa\nplanemaker\ndinampo\nmojie\nredpants\numberta\nenchainé\nmealbox\nparliamentarisation\nsummerley\ndamiran\nmachanic\nkalivas\nrengin\nganswein\ngjana\ncredu\nisturiz\nunreachably\nmcpaper\nsherkan\nalure\nemrose\nbioterrorists\ndebussian\neurorealist\nsecretarygeneral\nincease\nirrelevence\nstemis\nwinsky\nretailored\ngbod\nairdefense\nlulejian\nrevarnished\ndreweatts\nkrajnak\nscoffingly\nzengana\ncardpoint\nboogieing\ndebrecin\nvorley\ndohse\nkhidi\nstaidness\ncoicou\ngarabitos\nwadir\nhadibo\nrutskoi\nmaureau\nmorkūnaitė\nhoofy\nalkifah\nchioce\nstaloff\nfireraising\nultrarich\nundercook\niseminger\ncommentateurs\nbeanpoles\njamaatul\njuleanna\nmcwalters\nvenofer\ntwinsuk\nlarium\nhilots\nrakotoarinivo\nnishta\nburtonshaw\noperationa\nnozad\nmaciborski\ndankest\nungallant\nrippleffect\nbonstein\nideaconnection\nasrgs\nwilpower\nmakahs\naxcan\nfredik\nharedit\nsoltanifar\nunpragmatic\nbasestation\nbattaglio\nauffhammer\nminendra\nscoreable\ntickover\npgba\nqualita\npetraske\nbalsille\ngarrera\nsneineh\nmeleta\ntalampanel\nwoolliness\nrfma\nbkis\nrummyroyal\nblayn\nreceeding\nmisalliances\nscruntiny\nqole\nsocheat\nhfdc\noffp\nchemobrain\nboosaaso\naebleskiver\nkitschiness\nslatterns\nafriat\nbulletproofed\nfinaid\nkundai\nquinceaneras\nsammys\nkabati\nkambeitz\nswingingest\nrisonare\nmarinates\nfinkelshteyn\ncieszynski\nxeomin\nkharadze\ndonathan\nblumka\nasiko\nhandblown\nexcellium\ncavco\nkabealo\nnfzs\nmacedone\nstreetcorners\nbabiera\nkrepon\nramages\npulles\nblanas\nturkomens\nprovopoulos\nsuperhigh\nudenze\napplehans\ndeyun\nibbo\ncatalfamo\npocknell\nppgi\nrpcl\ngoodhind\nlynchpins\ngobbing\nmadail\ngueridon\nheped\nislamicise\noders\nkaraiskos\nsilkily\nbuesing\nkabakumba\nhsmr\nmuehling\nkollwitzplatz\njande\npocp\nhindsboro\nfclo\nmccarson\nbraininess\nastringently\nzwetsch\nquavery\nravussin\nfundementals\nangiomax\nqueasily\nojul\npatientline\ncrabbiness\nmartiz\nyturralde\njalang\nsolvesborg\nhortsman\nbalber\nshintos\npriorties\nmeteast\nleciester\npriklopil\nchaorach\nadmetech\nvictem\nbaciagalupo\njurade\nreinherz\ndraskovics\ntamau\nloungani\nsphp\nmetalline\nmartern\nfoodbuzz\njunketing\nchanomi\nmofya\nblogdom\ndongfan\naltintop\noctagenarian\nwatercoolers\narpil\nrahate\nncys\ninnefective\nbellying\ncochetel\noversexualization\nflittering\ndowndraughts\nabduhl\nkennnedy\nwatchetts\nrebundled\ndornic\nmilevskyi\nchengelis\nengrain\nloretha\naltiparmak\nbelto\ncuranderas\nsmwm\nliberalizes\nhellga\njassat\ndemarzo\nngxoli\ngerlis\nbutmalai\nexecunet\nwindsurfed\nexected\nmgma\neaggf\nsarway\nultrasonographer\nscoill\nontak\nlaalou\norginality\nsurfcomber\nsuchlicki\nsoulquarian\nseffrin\nearthport\nhouplain\nlazere\ncyruses\npasich\ntrimac\nkazanjy\nvernakalant\nropeless\nmuchdi\nhonberg\ncalongne\ntutterow\nenstar\nsheratt\nluftballoons\nshinbrot\nunnessarily\nmcclanan\nandwas\nhernried\nprophy\ncorporat\nschield\nunclenched\nmobots\nnusairi\nhollensteiner\nmeidt\nchimmaya\nhanunuo\ntamica\nhelicoptering\nsurpirse\nmyhal\nhimeslf\nstroumillo\nkannady\nllanederyn\nmenstral\nwoelfle\nperuto\narocatus\nannoor\nniemuth\ndecailly\nunjamming\nhogendoorn\ncarlike\nmuzdalifa\nvoigtsberger\nkuloglu\nhoors\nvelocix\ngoalside\nvasys\nfinchatton\ncoffaro\nzhuwawo\nandrelated\nmyatts\ntruscan\nlaulan\njariah\nrotflmfao\nnewsevents\nbillgates\ndranove\nrisibility\nqpsa\nqueenies\npauperizing\nchalleng\ncynnig\nchadborn\ndaelman\nkaplanoglu\nairwire\nsimpfendorfer\nfleischli\nmeuniere\nzeltia\nmardomsalari\nkamlari\nperipherality\nmalhorta\npierazzo\ncomitment\nsilajdzic\nsmolarski\nswithering\nradovcic\ntwmc\nrosselkhozbank\nprzybyla\ndatacash\nkabatu\nblynyddoedd\ngasifies\nexploracion\ndaliesque\nminocha\nhempels\npennenvironment\nfawcette\nguarinisuchus\nbingers\ncurrupt\nsasseen\nparapattan\nbanteaux\nmhip\ncinebarre\ndudina\nyugosphere\nstiltwalker\nchessen\ndubit\ndotmed\ndahaf\nhelrich\nwishfulness\niacopi\ngongmeng\npamperin\ntallymen\ntulawie\nreverance\nsunoo\nwakpa\nvollet\nekawat\neconomis\nblaid\npecnik\nbebear\nchimtal\ngerogia\nbickson\navroko\nbinkos\nstooksbury\nchinula\nmaxxaudio\ndcar\ndatscan\ngandrange\nwearingly\nnadesapillai\nhorribilus\nmagavern\nzatorre\nwheelgate\nviprinex\nstrohschein\ndeonta\nbarchana\nestepp\nrouman\nafssa\nmingenbach\nalmahata\ncampaignin\ndailin\nmisvalued\nsamanthas\nklitscher\nsirantha\ncouvillon\nvouvrays\nschlosstein\npalmour\ntromped\nnextone\nnazirhat\nbioform\npollers\ntwitscoop\nhrebejk\ngibbers\nwieske\nplanéte\ncoachin\nchàvez\nkenerly\nqedumim\nmerily\nmedfusion\nmirwald\ncytox\ndŷ\naizhixing\niapac\nofits\ntsagaev\ndeconcentrate\nskyseer\narway\ntezal\nesary\nrabenold\nmellamphy\ncamisón\nmattaliano\nshibor\nmarrapodi\ncachel\nrossminster\nmceneany\ncethromycin\nreconstructors\nairil\nhibe\nmoderaters\nschenter\nanfac\nvixenish\ncoolcullen\ntaunya\ndzhakishev\nolymic\nfalcondrone\ngulbai\nfondamente\nmazurak\nkhuloud\npbteen\ngrindler\nunwhipped\nkulcinski\nheterodontosaurs\nsiegessaeule\nprosinecki\ncharisa\nriness\nmanssor\ntcktcktck\nnandrive\ndularge\nmarazziti\ngethi\nmonocracy\nsnarfing\nwolkind\njaffeholden\nstrummers\ntotties\nbrulant\njmma\ndgacm\nfcag\nfastiggi\nmansionization\nhospiscare\nfloorcoverings\nrovit\ndurrence\nobamessiah\nsanregret\nunderpays\nhealthcard\nzhanjun\nrhones\nkoolhaus\nhatworld\nmubai\nocelote\nfimmvorduhals\nfelco\nffliw\nelectrosensitive\npriebatsch\nhightailed\nbirreria\nemster\ngussying\narjuns\nshirtdresses\nborthers\nabajian\nbadros\ninvirase\nwellnet\ndetzel\noxby\nlesnicki\nphotoshow\nappenines\nurunana\nbelyeu\nmenji\nhomeusers\nunfashionability\ninemi\nbadiani\nleakier\nylan\npennisula\nneumuenster\ngravatai\nkaukenas\nsynnwyr\ninsensative\nspravedlivost\nproximagen\nfisca\nlumeng\nwilhemsson\npolakovs\neglseder\noverbred\nsocialiser\nhadil\nfranticly\nbuffings\nbrashers\nyoungsun\nsadove\nsokou\nbokas\ncordozar\nnucks\nmadobi\nlupanare\nsajar\nintelence\nswarbrook\ngraessler\ndammerman\ntrony\nmunkley\ngadabouts\nviazzo\ncomputors\ncoughter\nraceco\nallopass\namplifon\ngladinet\nvolnard\navaliani\napmss\nmenvielle\nnaroua\nshumeet\nsainbayar\nhanakamp\nroseberg\nalsol\ntrugreen\nkuntoro\nnightclubbers\nlaox\nviitasalo\njanetzky\nsettting\nafdo\nbickelman\ndicier\nhandsewn\nhuipu\ndatafeed\nmaidlow\nseiman\nmarchioli\ndragomán\ndilyn\nspacie\nwigoda\nvictimises\nconsituents\niradi\nfelipo\ncuciurgan\nmemushi\nhfsa\nmunnoch\narachnophobes\ncurations\nmiskick\nmarokvia\nnzonzi\nshengen\nbluesign\nprogession\nmossbawn\nunshakeably\nwafering\nghdx\nminggao\npityingly\ncaranobe\nusband\nwafflers\nvuth\ndrumlike\ngorgadze\nhopefullly\nintegre\ncattlewomen\neborall\nmeritocrats\nrealny\nsnoxell\ntarenflurbil\nknuchel\nsheliah\nkedersha\ntearily\ntrhis\nsiliconware\nnonsampling\nangore\ngrundvig\nmonterubio\nukli\nburkwoodii\nyibao\nnamwamba\nmemolo\ndogwalkers\nextemporising\nwarpspeed\nphillipon\nsuperzooms\nrequete\nmalbecs\nhilterman\nrakatan\ncarnick\nbasuta\ntrbic\ngentic\nunblinkered\ngodean\ncundle\nguilliams\nmisruling\nnavindra\njohndrow\ntherapuetic\ndenuccio\nymlaen\nmothersole\nsaltry\npoppyscotland\nastoundsound\noutlaying\nthinkbroadband\netzkorn\ngulfood\ntwittelator\nsatchit\nnlihc\nnowling\narlingtonian\nsiezure\nmajonica\nondracek\ncenzic\npovall\nunsinkability\nsassenachs\nkoterec\nundercapitalisation\nkizilkaya\nchauzy\nkovalevska\nzzzzzzzzzzzz\nchinaamc\nlabioplasty\nyugonostalgia\nirregularites\ncoldstreamer\ntrymaine\ncupless\nolphen\nmabledon\nminicucci\nkriegstein\nsonorously\ntawnies\ntazered\niddles\ntechonline\ntrygvesta\nrandzio\naquavits\nkhinda\npemgroup\ngerontocratic\nrawanchaikul\ntactico\njermie\nconciliations\nbaimoensis\nbraques\nkepkay\noteha\nandalucians\ntrebing\nhedayatollah\nflimflammery\nsiennas\nburghes\nacaai\nkevelos\nfinnfellow\nperinatologist\nmuliana\nqifa\nwcdi\narapata\nrouder\nraulini\nmaleos\nprezista\ntrickski\ntarogi\nwaxworm\nivhs\npetzen\nguilermo\nniyaki\nrealtionships\nwolridge\nparliamant\nmoneyspinner\nsaddah\neliès\ngorund\ngoehler\nlambah\ndörentrup\ntwihard\ngearrannan\npereplotkins\ngravner\nlearningexpress\nhousewifely\nmaskiot\nakhileshwar\ngravagna\nfaiez\nueckert\ngillingstool\npecoul\nhirawi\nmulpuru\nbellozanne\nzanner\nkhomeinis\ncerphe\nurgen\ndinnae\nfunkyzeit\nnoldus\ntailândia\nmarlines\nazzarelli\njohnasson\nmarkridge\naezs\ncalderan\nefinancial\nunknowables\ntekturna\njookie\ndaintier\nintere\nloungy\nquoran\nmarshalik\nembroilled\njaoshvili\nscreenline\ntestim\neggcups\nmietto\nwaxter\nmonjaraz\ncoldhams\neurocracy\nlevian\ncalimocho\nhaband\ntwinem\ntalenfeld\nweisfeldt\njsda\ncastellinaldo\nundisc\nneurostar\nlagravere\ngodapitiya\nrajohnson\nlederhandler\ncamardo\nalberquerque\nmopup\nchiclero\nbreckconnect\nbalatony\nhommen\nmicrolenders\nacjw\nfaizaan\nilaris\nllanrhuddlad\nclinto\nnonmuslims\nmartier\nraptiva\nparrin\nkumbwada\nrhiann\nvitalo\ngrncarov\nchevanton\nrunarounds\nnobama\ngttc\neurostoxx\niorworth\nbackpedalled\nsinsuwong\nrothwax\ngiezen\ntwangiza\ndjenane\nshyte\nebct\nbrynllys\nicepower\ndaozheng\nunfond\ncristocea\nimado\ncarloz\nnicois\nnanomagnet\nlemacks\nkinkri\nfennebresque\ncertificados\nmanufaturer\nvoldermort\niopc\ncurseen\nhscrp\nreumayr\nstrimmers\nimponderabilia\ncapline\nbenecio\nwainting\ncibers\ndaschund\nunpardonably\ntrafeh\nemmitted\nenria\nsimcor\namarger\nnflx\nqualit\ntrevia\ncaplon\neteraz\nborosage\nunamimous\ncytokinetics\nhailers\nstryde\nvirtuosically\ndetheridge\ntarifs\nmadhwani\nbuddenberg\nmwangaguhunga\nstickless\nunconsulted\nivatury\nfontanillas\nmaradonas\nmagern\njumpshots\nbaskis\ncelldex\necovoyager\nsomaliweyn\nsabauddin\nconar\nesye\nakkaz\nshanghi\ntablescapes\nbetao\nslarke\nnonurban\nlelliot\njunquiera\nexecutiv\nooer\nstepha\nschoneborn\nmoanings\nyeezys\ncelgard\nwiedenhoeft\nnonsedating\nmunlo\nlerga\nossies\njalaledin\nknitzer\nbaudelairean\nsupremicists\nintelegence\ncoaltrans\nelahian\nsences\naarin\nzavadskaya\nzéribi\nhousholds\nbossasso\nachievo\nhervi\nmcelwrath\nromaniw\nfosback\nfirmdale\nmpdus\nboliva\nhammily\noetsch\nngci\ngrillings\nscamster\nachelpohl\nharkett\nmouthers\nverhoff\ncrystalox\ntarado\nredinbo\nminibars\nplunz\nkentwan\npanicing\nloserdom\nmuhamud\nsufferable\nvollebaek\nburtless\npowderject\ngeldorf\ncostive\nunrecyclable\nschnable\ngrogin\njobbies\nnewgent\nredialing\nxpac\nswica\nfroideur\nscambaiters\nseropyan\nlohachara\nhabibinia\nskyroom\nshiral\nathashri\nshukarno\nposners\ndakdouk\nviznar\noutswinging\ncarryalls\ntescopoly\nchristoffers\nmulyo\npgeo\nsnohetta\nruholamini\nuyttendaele\nfhlbs\nfreeflowing\nsuddeutsche\nspiffier\nmaldeikis\ncraigengillan\nhypercharged\nsnarry\ngriscti\nsudnick\nsoberest\nhlang\ncarradines\nmeixin\narriaran\nsecuity\ndourer\nmesenbourg\ntawke\nstoniest\ncordex\nseretide\ndepersia\ncmpmedica\ncmpb\nrillette\nattabi\nmagincalda\nharild\nmozypro\nformenting\nsyste\nairboarding\nhealthpro\nszebeni\nheartspring\nskiiing\ndumanski\njeneen\nnycity\nmahbob\nwomanize\nrubatos\ncorrpution\nkarbovanec\nomnisource\namcore\ntonery\nspaciness\nflummoxing\nknsy\nambulence\ncarmat\nchazi\ndelsman\nchallem\nachata\ntaukafa\naltzheimer\nbundred\ntruckmaker\nbraught\nkurzak\nwavebob\nsartwelle\nristroph\norelans\nsyatt\nsheddocksley\nunivesrity\ncpzs\ngravesides\npharmachem\nadeiladu\nhaberdasheries\naplf\ncouldwell\nfineran\nnonswimmers\notarola\nlegaue\nevacuators\nmpitsang\nfereos\nesnol\nmainegeneral\nlibala\nafghnistan\nqnexa\nbenoits\nseviour\nyenegoa\nfineable\nvermögensverwaltung\npsam\nfacpm\nphytonutrient\nnurik\ncrispbreads\nsemilong\nmeresamun\ntombol\nandews\nunfurrowed\nbjoergen\nmillionares\nanesta\ntornagrain\nstolzfus\nemphasys\nyelstin\ngoubaud\nanjale\nhawkishly\nminuting\nchifundo\nhullis\nlambeosaur\nwloszczowa\nsensemaya\nsandalled\niversens\nkleinhendler\nwestfalische\nbelabed\ntchico\npentref\nenmeshes\nwatanabes\nidustry\nreaggravating\npredeccesor\nmarenberg\nyahe\nweimeraner\npremajayantha\npennsyvlania\nlightningcast\ndickover\npaccheri\nkodzo\nbiggoted\nundersupported\nschulkin\nidentidy\nfoppishness\nscrummagers\nsiphan\ngovermment\neqii\nniknejad\ndonerson\neckaus\nduncrue\nshaghaghi\ndrcc\nlatisse\nwackadoo\nantisa\nfarcus\naerothermodynamic\naprilla\nvuic\nsubaiya\nmlam\nboinod\nlletty\nbostelman\nmelonguane\norbicule\nbeewolves\nreassortants\ntruanted\nhuler\norgias\ncazmo\nnyiregyhazi\nupfitter\nrepored\nyakumi\ndeptartment\nminidisks\ngreedheads\ndocumenary\ngunowners\nintitials\noverfamiliarity\npiquot\nreposa\nconsecuences\nhelathcare\ngunterman\nshimanaka\nsteaz\ncottoning\ngeraghtys\nikarian\nfentora\nkinstellar\nselzentry\nimposssible\nphormiums\ncharrieriana\nwhump\nboguchanskaya\nboonekamp\ncorecommerce\nzigun\nnaglazyme\ngalsulfase\nscardelletti\nsabatello\nphippses\nshadowood\nnonfactor\nfelmlee\nyanire\nporking\nmisallocations\nwiould\ndegroof\npcubed\nkisiis\nmucoadhesives\nablikim\nehsi\nbiriotti\nmisadvised\nkirkston\nasawin\nrxnorth\nlegeros\ngiganticism\njudaise\npresentencing\njuiceless\nclaireece\nhauda\nchelseas\nvolex\nloderick\nsarwary\nlitigiously\njacksplace\nmusati\nogunniyi\nwebathon\nperbacco\nunwinable\nschmechel\ndancejam\nwayser\nwirex\nyakhdan\nlonghini\nwafels\navinza\ncutwail\ngirat\nziraba\njopari\nthruppence\nssnc\ndroba\nngwira\nsakalis\nsukhova\ncssiw\nsokunthea\naptuit\nislamaj\nameriville\nlaiping\ndilxat\nholencik\neeewww\nmasese\nseafronts\nlongcourse\ncholerton\ntoblerones\nlabañino\ndushkina\nfazalur\nclaymates\nshambas\npelletiere\nlucoff\nchutkan\nmaximedia\nbierbower\nmajimboism\nredrag\nbithel\nhinterlaces\nconveyers\namons\nperplexedly\nrazziq\nyaggy\ntakabuti\nhealthscape\npopcast\ntukuls\njaiani\ntravella\nunseasonally\nloofahs\nskion\nyozzer\nduveens\ngomedia\nsspt\nmenks\ncritcally\nandriesen\nstretchier\nshawana\nommar\ndragseth\nkoziarski\nlasama\nimoogi\ntingvall\nklarissa\nvidir\ngtgp\ntuvshinbayar\nunneutered\npoisened\nhanesyddol\nmyride\nmerluzzi\nmydroilyn\npreseident\nprofounders\nalafoti\nstollsteiner\nalfies\nvendanta\nkillinochi\ninteruptions\npoortown\nmetiner\nbearishness\nsousvide\npsittacosaur\nretured\ncasarett\nlanita\naaahs\nkimberlain\niaro\nbischel\nshortcircuited\nforeced\nyelsky\nhematopoeitic\nmalliouhana\nzakouski\ncouthard\nmarjority\nmochileros\neavenson\npopcorns\narkans\npochepa\ninititive\ndugle\nsrla\nuniversecity\nkampang\ngloddfa\ntangee\ntanigue\nwreghitt\nboltonian\nbournmouth\ndrapper\ntaekjip\nhilgart\nfullham\nbeefcakes\novereaction\nstremel\nalariachi\ntarmachan\nhasab\nmonaveen\nyoungwoo\nkalimanzira\norneriness\nfusf\nmantuo\ndiavolina\nhebior\nrebaza\nchelson\npyrolitic\ncvvm\nrompel\nmagliozzis\nkossek\nscobleizer\ntyab\nmarghoob\nexent\ncavalho\nspontex\nitvplc\noutrushing\ncpfs\nmunajid\nbabymaking\nunderclothed\nfelloni\nmcmurrey\nkliemt\nmanghera\nwichy\nkinglas\naljabri\noutgang\nlegalisations\njennilyn\ntoquam\ncharmil\nmultihyphenate\nconnivers\nbrighthaupt\nminocin\nmonickers\nargentini\nhveragerdi\nbartmess\ngrivko\ncontradictoriness\ntacticts\nayouni\ncrystallex\nimpm\nnietsch\nshouk\nuninstructive\npdois\nhaynor\nkaradzhova\natci\nsamsami\nopuc\nkloske\ngillert\nneverthess\nstoiko\nechenard\nlagreat\ntonier\nbourtree\nswaggerers\ncloman\numgd\nhejl\nprewired\nhusbandly\neminant\ndeflationist\ngrytsenko\nrodearmel\nalbosinensis\nmiomi\nbellio\nevox\nncade\nliteralizing\nslanker\npteroptyx\nsanudi\ndrmo\nsulaikh\ncannistra\neqipment\nyvras\nfreh\ncateura\nweskeag\ntycko\nmagwilde\nraxit\nrenationalizing\nmonfried\nbolakoro\nwiederer\nblunderings\ngladwellian\nswedishamerican\ndarstein\ngeissbuhler\ngeronzi\nviviant\nfierek\nmelquisedet\nwestmuir\nchillaton\nbellshaw\nmicroprojects\nsylviana\nwaszak\nacqusition\nclothilda\nlabeij\nstansbery\noptumhealth\nkhumbanyiwa\nsentimentalizes\nworldheart\ntackman\nqteros\nsmartrend\nsoftsoap\npedrocco\nsherbin\nfestgoers\ncrimebeat\nshustak\nmaureece\neuropewide\nwdrs\ndysautonomic\nbhagh\nharpst\nretailleau\nkamide\nubinetics\ncuona\narchfoe\nyingxi\nmehlberg\nchafkin\npodrían\noduma\npieminister\nmacroalbuminuria\npinballing\npoleyeff\nangery\nrheolau\ndefendory\nplantsmanship\nrajdamnoen\nkemoko\nmzembi\ntedmund\nstansall\nmcfarley\nkäch\nmajolo\nwadongo\nmuhe\nwudnt\nfliegelman\nshedroff\noverkeen\nunloveable\nanticalins\nnegotiatior\nvellanoweth\nnoticin\nmanglings\ncoldstreamers\ncahalin\nstückl\ndunmores\nvanillylamide\nwansee\nsantona\njoshes\nhighl\ngrosveld\nmasieri\nlahud\npuglise\nkhatiya\nmcbrady\nausp\nquiffed\npandalabs\nwitlessness\nnonbanking\nrelion\nepaule\nstasya\nstoopball\nstronati\nensequence\nmaufacturer\nwinski\ngschlacht\nratoons\nracinger\nmateyka\nberezowitz\ntaulafo\ntalanx\nrudwan\nmyhres\nfrankfinn\noverdrafted\nwelsummer\ncollaboratorium\ntokusei\npohlschmidt\nkazakoff\ndalehouse\nplice\ngillim\nmavizen\nboritzer\naaaaaah\nmikolášik\ncendra\nwinnercomm\nappbrain\nghappour\nfloodplane\nstaffy\nallidina\nmirós\nmostaque\ntantus\nclienti\nrealstar\nincents\nsutiexpense\nziagen\nlerue\nlemnis\ndistractedness\nionix\nsadlon\npeformances\ncoppolla\ntaepung\nmaximiser\ndetkov\nmoiser\nsobaru\neckern\nhollywould\nscapagnini\nseymourpowell\nvapnyar\nskabba\nsporicidal\nmoota\neddye\nbearnageeha\nlevatich\nkigawa\ngrassfires\nwriststrong\nbriggeman\npdbs\nunfresh\nffch\nsnowcaps\nbialka\nstajner\nstaycationers\nmoualem\nwedick\ndisis\ngochfeld\nbhall\ndifinitely\narrar\npendy\ncontrac\nsmiljka\nprimarys\nbuckmiller\ngourneau\nanzorena\nmarcouiller\nscriptpro\nsecurest\nherbein\nbrideaux\nflotel\ndisplaymate\nbaabaa\npharmas\ntecaccess\ngricia\neconomc\nginobli\nlitepanels\nkingzett\natropo\nhammerling\novermanned\nnusseirat\nvanwart\nforaminotomy\narington\nwitzenburg\nmamond\nsinesis\nleszcz\ngwinett\nthomand\noakervee\npoundings\njieho\ngrippier\npushier\nmersman\nboogying\nkwadzo\nadrenalyn\neorann\nnasutra\nactigraphs\nvasotec\nprinivil\ndsdha\nalternadad\nkrissoff\nmeerwala\nkiffy\ncortera\nbirgfeld\nricasa\nharmohinder\ngadarif\npihlstrom\nchocka\ndotorg\nlumpectomies\nbanxico\nlivecity\nfinlo\nvenezeula\nfallbarrow\nfeatherbrained\nkeiy\nbergmanesque\nunflappably\ncornova\nagossi\ncampiest\ngolagha\nsunw\nnusta\nswedroe\nkagda\nkissine\nrealview\nlaudamiel\nmediumsized\ninternaitonal\nlongevialle\nleighninger\nharshed\nstokking\noceanico\nguilano\npoghisio\nchaisaeng\nestrace\njuthamas\nandrunache\nnrbs\ntwdc\nglitziest\nhukins\nastorina\nbeachbum\nmoushaumi\nlanglees\ngantumur\nchialvo\ntantalises\nsekaggya\ngruebel\nflowerbomb\nbaraclude\nolimpos\nmitoji\nineffecient\nretière\njeannene\ncowhey\ntrippled\nsciencexpress\ncubita\nvilliard\ngbms\nceidiog\nnativeenergy\ninopportunely\ndeante\nyanoviak\npretreat\nmosterd\nzeljka\nstrianese\nbusches\nzypad\nandrewandrew\nmamozai\nbergsrud\ncitytime\nshalwitz\narthouses\ncrotzer\nsubaccounts\nsunscape\njegley\ntracleer\nnyregion\nbrugnaut\nloanmod\nwoelper\ntransdniestr\nschoenbohm\nshimari\ntappable\ndwpi\nneomedia\ndelusioned\npruhealth\nraneen\nrelan\nsimplifydigital\nbrydes\nurgup\nnosologist\ngrocki\ndeeker\nkaldenbach\nfotowatio\ngeophysic\npromptu\nbridis\nbiosys\nlscb\nantiamericanism\nnonwork\nsynapt\nlengele\nundersung\nyawningly\nstennack\nmareli\nhanian\nviveurs\ngoldmeier\nabosede\ngloveless\ngrassfed\nlievense\ncomeek\namericredit\nminqiang\npdks\nschkolne\neuroyen\nathoi\nwillibrod\namnor\nsweer\nformfitting\ndiweddar\nnichanian\nmandere\nalgt\nboutris\niqms\natiz\npfgbest\nknuutila\nkairy\nestreller\nstepinski\nnetcord\nkandids\nsharieff\nkimmick\nbraml\ngarrotting\nnapeo\ncomisiwn\npollaro\nproietto\ndirexion\nwaghaz\ncitgroup\nsalhiya\nchristofore\nnonuniformed\nchipaumire\nrehabilitacion\nrovaris\nhanakee\npaternos\naeol\nlanau\nmuttawa\nyuchai\npreaubert\nwakeups\nfolletts\nhypocrates\nconvivially\nreado\nmagarshak\nalavarez\ndimbos\npeoplesupport\naltropane\nnetspend\nmingquan\nbluenoses\nnonhormonal\nprebil\ncomng\npepperjam\naltunian\ncristinas\noversells\nmbet\nqannik\ntundidor\nesipova\ncpdos\ncutsie\ntigerskin\nsonodynamic\nphantasmagorias\nsalord\nthars\nprowar\nalmli\nfasto\ndanielovitch\nseniorcare\njibao\nplocker\nbolgiano\nkampachi\nqudoos\ncyfle\nrecanvassing\nmarqueze\nderyan\nirlando\nbamcafé\nduii\nchasseuil\nrackow\nkurnosova\nfeugère\nmatoo\ndieugenio\nsholz\nrunbacks\npesapane\nbureaucratise\nariannol\nsebangau\nkubatko\ndispossesed\nbalaresque\nmeaklim\nkharji\nngure\npunctiliousness\nwisard\nhumanizer\njinpan\nglicks\ntruppi\nmagestic\nyosbany\njangl\nicjb\nmonastaries\nallweather\nrodalquilar\nsollman\npremeal\nptolomey\nkefta\nalsobrooks\nsandyknowes\ndupaul\ncluizel\ngiulesti\ncritcizing\nlelly\nrahala\nreiffenstuel\nrhywbeth\npenzes\ninsurrectionism\nitagui\nshaladi\nwelched\nvlahou\nreineccius\nwennemer\nkhorafi\nvolumizing\nfilmmagic\nlutex\nhalfax\nsakelarios\nlepse\nxigaoxue\nshaolian\nooof\nhebu\nbrennig\nvelsor\nfristrup\nhayz\nmuchauraya\ndessaline\nalaksa\nresponsibilties\npontoh\nnederkoorn\npomsox\ngrimiest\nsourceone\nventureworks\nsfsf\nweadon\nreavealed\ndenoument\nsanhuan\npeltor\nsentrysafe\nrydbergs\ngmed\nrightsignature\nrosnano\nbezielle\ntrilbies\nvisitpittsburgh\nnakanowatari\nunbusy\nfacilitie\ncsosa\ncutomers\nniepoort\nkazmunaigaz\nbifas\nbmsn\nspeakaboos\nsnicked\nyeowart\nmiuntes\nwolowicz\nshatswell\nsphb\nsanctimonius\npanfried\nmirnehad\nnonelderly\nlancearmstrong\nwiniarz\njournalis\nkohly\nkricker\ntirua\nmotznik\ndelusionist\nsnicko\ngluhwein\nliberalness\nchillaxin\nboyev\nridgback\nsizeist\nuniqema\ncisel\nleimsider\nenouth\nprobablys\ndisintermediated\npenkair\nbitchery\nacephalic\nrnsa\nintelisano\nettalhi\nsnowsuits\nquanities\nmicrofilters\ntheamerican\nmowings\ngallanagh\nzimiles\nfaugere\nbioenergia\nfstd\nsaechao\nabseiler\nrvucom\ndisappoinment\nhuiet\nmppi\ndemitrios\nzosyn\nokiharu\nintermet\ncoltellacci\nwohlschlegel\nicepicks\nkarambir\noktem\nsimrill\nlhabu\nalokozai\nkursman\nheshu\nduadji\npolkes\nfarfur\nproboszcz\nvricella\nbuerhle\nbahdon\nmultidecade\ncocksureness\nogap\nskiied\npraill\ndicecco\nsyaifudin\ndaunivucu\nrabuor\nhaartez\nunderdosing\naganda\nrowark\nfincad\nabisaab\nperibere\nllabres\nmurenzi\naltegris\nbuechley\nnimblest\nfetc\noestmann\nnoguiera\nsellz\ncontemptous\navtc\nkandao\nkomombo\nunappealingly\nkrenwinkle\nstranack\nwirya\nsimulcrypt\nleviten\nsebagh\nfaceing\ndejarnett\nknocknagoney\nkohyama\nmidsemester\ncamellones\nshawnda\niotova\nstapert\necotarium\nadconion\nfasanaro\ncatryn\nchakrabati\nappinions\nstuebe\nprecontractual\ntlaa\nasiate\nkosciuszki\nwhoda\ncatylist\ngptv\nrendlen\nsaslong\nadfusion\nmilstd\ndubrock\nbuzhala\nfortrex\ngazump\nchargeoffs\nrealtysouth\ndurned\ncroftlands\nkunsman\nshaktu\nkennedyesque\nlaudin\nwanzhi\ngallups\nsierpina\npaylocity\ngrandparenthood\nballyard\nschneberger\nsilvesters\ngarimpos\nabdrabou\ncupcakestop\nlemorin\nlaicité\nzekria\ncadetii\nmcintyres\nbubbliness\nkahakuloa\njavacool\nsayare\ntoutanji\nalimar\ndecaid\nbowins\nvashee\nzagre\nwebrangers\njiam\nautorite\nedci\ncybots\nhigginsen\ndistaval\ncarvajales\nhybridpower\nsukeyasu\nbrizzio\nbiodrama\nlabarbe\nyaaqob\nsummize\necoark\nsnpl\nkaziu\nmostof\ndissaprove\ngwertzman\nkustoff\neforts\nmonarcy\nmotorexpo\nkressen\nsumeth\nbrilinta\nfuggy\ncukurca\nhangabehi\nswipeit\ndirez\neckoh\nagostine\nembaressed\nmoschofilero\ncuggino\nlightener\ndeffered\ncyberpolice\nchantaco\negoavil\ndumstorf\nestabillo\nkrezel\ndustcarts\ngentianes\ncphpc\nrodenator\ncalifornina\npolitkovskaja\nveddw\nembarrasement\nascofare\nbeckmen\ndrthom\nlapada\nlittlies\nmultihulled\nbookbuilding\nmahvish\ngutturally\ngaunter\nbramow\nguzelimian\nleinin\numerzai\npoipoi\nislamshahr\nphuah\ndrawerful\nopportun\nwalkus\nlacors\nharringon\nplaney\nvengenance\nhofelich\ncvff\noffenheiser\nlowham\ngordonian\nblakovich\nkrenke\nafghanstan\ntakavesi\nfreewest\namercans\nsecamb\nplaygolf\ncptr\nzibel\nmergener\nchinches\nseuer\ncouba\nstarclass\nseghal\nlatosha\ninalterably\nsilagy\nconfortably\ntamboen\njmba\ncatrell\nnkuku\ndutybound\nmagagnini\nfelesky\njezibaba\ncramdown\nfolabi\ngvtc\ngordys\nrugero\nlunchmeats\nrexcorp\nleedle\nconflab\ncrankbait\ngerds\nklion\nhazut\nrepenteth\nkyriat\nwonderingly\ncdmrp\ntabbakh\nbarick\nhaisong\ndjemah\nescritt\ntikvat\nsmize\nhollowest\nsurdin\njacoub\nnoncomedogenic\nantaviliai\nomerovic\nbaissour\nkaboudvand\nklarich\ncolbertist\nbodenmiller\ngroeninx\ntekura\ngolotsutskov\nsollitto\nmercuro\ngbubemi\nalfoneh\nbogot\nenpocket\nsofugan\ntowergroup\nseviche\nhiruy\nannahof\nfratantoni\ngoiabeira\nbuzzd\nslaughterings\nglenzier\ninceasing\nbogenberger\ndalaro\ndisaffections\noutduel\nherminator\nmarlaud\nfitzerald\ntrichlorethylene\ncbkn\nspunbond\nreshetin\nbecquelin\nbiotherapies\ntilburn\nottomeyer\npunamiya\nbarakova\nskordis\nmarszal\nzélindor\nshiit\njeremis\nvumilia\ncamgian\nnealley\nportie\nserhani\nrickarby\nmarandola\nuauy\nthepkanchana\nrozzers\ndarkmans\nparacuelles\nreadys\ndemerges\ncampara\nbotchergate\nlevinsons\ngiubbilei\nedusoft\njacqua\ndualchas\nbodhnath\nbildman\nsitcommy\nankudinoff\nudovicki\nrearrests\nfundholders\nuncategorically\npontolillo\nnken\nkeyontyli\nreprogramme\ngittrich\nobamafest\nglitterbest\nzhadobin\ntoktumi\nplumerias\ncopney\nswilcan\ncorpselike\nschwenkler\nteulere\ngammex\npennese\nvaghari\nshakhova\ndeutschebog\npauperised\nbdrc\nhomecall\nwrily\ncytos\nbracek\nlindrup\nwinsomely\nbettocchi\nsibbach\nhozelock\ncastion\nwellsphere\nbusinesselite\ngabbatt\nseptwolves\nalailima\nrublein\nkuehnlein\nidox\nbsharpsonata\nsuperstruct\nshwani\ndebriefer\nwhyles\nretaped\ncrescendoed\nwedner\nhietikko\naaraji\nsidanko\nschenkein\nextrabudgetary\npingg\ncorticeiro\nunilin\ngoedbloed\nkoreanness\nshishas\nduffles\ndadhwal\nantenatally\njounalism\nhiec\ndecrepid\nvenardos\nepals\nrussoli\ncoldiretti\nlichened\nfdtc\nawdc\ncorrolla\nakoskin\nuafm\ntanys\nmfhs\navruga\ngroehler\nmatejovsky\njecca\neathen\nnwnw\nsheeeesh\nsaltshakers\nrepublicca\ngoettle\nmonfortino\nlerck\nfemgineers\nmicrogenres\ndrzal\natatcks\nfudds\nmagicard\nwyeside\nnotbe\nreincarceration\ntranslatlantic\nmenawi\nmutungo\ngaidica\ngotvoice\nfbma\nkishenganj\nbathaa\nbackpain\ngorlier\nsemakula\ngalumph\ncramdowns\nmodelworks\ncompugroup\ngroskinsky\nfoudland\nsdvizhkov\ntamaela\nshalvata\nglenmachan\nhowgills\nbernsdorff\nshernhall\nmerseysider\nnonreportable\nhalabjah\nfashionableness\nkillaloo\nsvia\nfloersch\ncountermajoritarian\nchengetai\nhemmingwell\nclocklike\ncornici\ngostic\nrealitywanted\npastic\nmintwood\naustock\nswindonians\nfadumo\noffseting\nmenrath\nsqueeking\nhypocaloric\nmeehans\nexculpates\nshowily\nloiederman\nsevmorneftegaz\nridong\nblgm\ndewormed\nsmashups\ndyomushkin\nkocijancic\nleacann\nthaobh\nfoofaraw\nsacrified\nlegbone\nshurqat\nqueuer\nsmartcool\nrychleski\nzrihen\nunselfconsciousness\nmuguti\nhydroenergy\nresponibility\nresel\ntetchiness\nsaffiotti\niobe\nyoussry\nshillibeers\noverindebted\nvillement\ngiorla\nmohammmad\ngadw\nfayant\ngovernmment\ngabeler\neisenbraun\npepic\ngorbold\nleisureville\ncollombat\nobsta\ndowdier\nkraskin\nsnowballers\nhysenaj\ndolfino\necott\nbornfree\nfergany\nleverence\nmavaddat\ngrewar\nresetters\nfuriouser\ncarhaulers\ndwurnik\nfetishises\npretape\nfrankest\njenkinses\nrhuhel\nbegner\nleefolt\njenev\nmayed\nbanlieus\ntodner\nfreakily\nmagunga\nfebruarys\npushiest\nneoclassics\nnoncommunication\nschui\nkuehnert\neindoven\nsquitiro\nkallakis\nsimantha\npratheepan\nhickorytech\nnacil\ncyberactivists\nhealthtrust\nlijit\nagiles\nhirsuteness\npowerbuoys\nsuperchic\njesdanun\ngulliani\nsluglike\ngortex\npentapeptides\npowerpad\nbuckleysandler\npurbecks\ncomforce\nmksm\nguadalupanos\nmerkavas\nkhizeh\nkharatian\nemoly\ngoffard\nbelinde\nvinals\nhoovler\nbakkevig\nsigwalt\najones\nhermangarde\nvillicana\nvietman\nhungy\nuhpa\nthostrup\nesotouric\naudibled\nvladymyr\nchevedden\ntimbercon\nunczur\ninamed\ndalati\nnonvitamin\nkaraganis\nspatisserie\nheinitzburg\nsparapani\nsentencers\nchvotkin\npordy\nnorpramin\nfirrantello\nswooningly\nmuseminali\noivind\noverprescribed\nmonoply\nliddard\ngoalkick\nbejzat\nschuey\nneohoodoo\nactivitists\nregionalising\ntienna\ncortelco\npevero\ncradlepoint\ncromme\npharyngolaryngeal\nholters\nalvarsson\nguidettes\njcrew\npixon\narshed\nouthitting\nicpas\nistabl\nswiterland\nkodindo\neilu\nadhyaksa\nshrooming\nstridence\nadcolor\nhantro\nmaurading\ngvep\npharmavite\nshuffett\nsomprasong\nmvaas\nprevor\nsharpatov\nahhing\nfourwinds\nspaghettini\nmeing\nirelend\ncciced\ngunyon\nbrincko\nmachingura\nunsubordinated\nbrilliantined\nbitsberger\neasytown\nmilenkovich\nfofanny\nagleam\ntataei\nhomoepathic\nvartys\ncspro\nbagl\ndowsey\nmegabudget\nmajescomastek\nhousemade\nabdhul\nchikez\nufizzi\navator\ngrawunder\nartrip\npaudorf\nknockbreck\naestheticising\nayanoglu\nirbd\ncantania\nsinawatra\ntaxachusetts\nmitsubish\necopy\nsabip\npaskiewicz\nunanswerably\narmagen\nlengfelder\nallerdene\npertoldi\ncosmopulos\ntaddio\nkatiforis\nlangdeau\nbabkas\nrationalizers\nmenance\nsouléymane\nhuslin\nmenveo\nfranceye\nreedlike\nmontmelo\nuscategui\nctagg\nagrofuels\npacknett\nobstreperousness\ncinemanx\nkocen\nrecchio\nwoebcken\nmenupages\nbedmaker\naffrunti\nnuegados\ndukette\ndarzins\nconstituants\nclickables\nlbjs\nfinuoli\nlovrien\nrhude\nschöppner\nviilo\nwanabees\nsurenas\nshalley\nauconie\nzhengzheng\nchalcroft\nvottero\nofflimits\nweedless\nsovie\nbetreiben\nmizuni\nmuyale\nagjobs\npanjwin\ncallvantage\nshadowboxes\nbuzhinsky\nconcernd\ntaffety\ntskj\nabbaya\nmypublisher\nkahumba\nsehee\nfolksier\nhaszeldine\nmutrif\nbarbach\nbedforshire\nroglic\nsubsidizer\nolarn\ndippings\nperell\nmawatheeq\nsunkids\neasdon\nhealthview\nefmr\nchunter\ncirendeu\nlovesounds\ngottliebova\nintelispend\nparnaz\nsenizergues\nlancovo\nnumide\nrewatchable\ngerdak\nfuradan\nolsenboye\ngeyskens\nibaviosa\ninstantaneousness\nbouchiha\nnleomf\npicolotti\nwaltonen\njesture\nchunping\nalevels\ncaluco\nswashed\ndefenestrating\nmakaela\nchitrabon\nshallman\nonhttp\notarian\nsangzao\npollyannas\nnegotions\ndialaflight\nporgo\nnorenzayan\nnebras\ngenderblind\ncraggier\ndumsday\ncroisieres\nopenhouse\nerrupt\nabeling\npieraerts\nkendar\nagovino\nyisan\nweatherbill\nlybba\nembyro\nvidosevic\nmadeover\npansiyon\nchipkar\ndelgates\nchalghoumi\nstapely\nbasqueness\nmurasawa\nremunerates\nhanick\ndocca\nndubi\nieni\nsuhy\ndundees\nhweg\nhealthfood\nfillpot\npresinal\noverinterpret\nisotis\nkimanthi\nsloter\ndandana\nsprinboks\njanuzzi\nstodgily\nsynflorix\nrabanel\nnamibrand\npremsingh\nembeda\nartventure\nviccaro\nmoalin\ngaisanov\ncaleen\nlinktone\npeterkiewicz\ntransperformance\nkudlacek\nmauritsson\ncarpluk\ngagor\nsiemieniec\nromatic\namazins\nlavasier\ncafedirect\nllif\nlocca\nsupertour\nbrooming\nbestsellerdom\nshaqil\nbraintech\nlightcycler\nsenesac\nexcercize\nsalleras\nsonthoff\nfingermarks\nschoendorfer\nundivulged\nschmerge\nalhajji\ngoryachko\nalcaino\nsermonize\nendorsable\negmi\nalldred\ncotweet\nshinguard\nlshc\nunderthings\nwedag\npreboarding\ncarhaul\nguindilla\nelitek\njardee\nbribesville\ncandlewax\nbetsafe\nalayan\ncrisfar\naquafarm\ndrayna\nfenyn\nhamahara\ntranquillising\nliquidised\nbollgard\nabhorant\nsegaram\nduwisib\nkondaurov\ninnerpreneurs\nboxroom\nviap\ndelegitimised\nstuerzinger\nshalina\nmicropulse\nwoodburners\nlegalizers\ndeulbari\npalies\nnoncurrent\ntimeliest\nesikia\nneflix\nddeddf\ngrässle\nyahnke\nwakings\ngudmunsson\nsciple\ndadur\nandrezj\nizundu\nmudawar\npreservación\ntashiev\nfabrizius\nmbcn\nsanjayas\ntriossi\nshengying\nvakacegu\nhalahuni\nkanamma\nfyfyrwyr\nnumaniya\npartnerworld\ngovdelivery\ndelapp\nyribe\nmuntaser\nbudgetarily\ngreenopia\ntricast\nngler\nsoprovich\nlivng\nwelcomingly\ngilliot\ncupchik\ndroptop\nhashimzada\nsarzanini\neroshevich\nmentgen\nfajarina\nwolozin\nsweetshops\nabeibara\nthaweesak\nhauptli\nzehfuss\nschmocker\ndrehers\njurisprudentially\nkantstrasse\nmexicola\nkabbia\noctf\naelod\nzaitschek\nholslag\nragil\nseamoor\nyusanto\nspryly\ndurabook\nyamileth\nmcdermed\nvardai\nauwarter\nmorphotek\nwimpier\nlamielle\ncimatrone\ncreditreform\nbaybio\nsquidlike\nwarmist\ndouetil\nempoyees\nincessently\nscppa\nnukhazhiyev\npockmarking\nrypkema\nbudweisers\nbesuited\nestrasorb\nmatlary\nsisvel\ngrandstander\nintermissionless\nenzos\nphotodna\nensus\npipart\ntacnav\ninvestorplace\nziegelmueller\ndiscomfits\nteeson\ndiscourtesies\nscardapane\nmmcfd\ndjidonou\nzdralek\nowiso\ncemetries\nsiwakoti\nroussouw\nlykendra\nhulahan\ntezampanel\ngalzerano\nbliemeister\ntarculovski\nstuppy\nbestirred\nclarklewis\ndiscrace\ndressiness\nzuffo\npeales\nchidlren\njfin\nucdmo\nmaoxian\nthatwas\nresperate\nrosprirodnadzor\nbelcom\nstoddards\nswistowicz\neatingwell\ngankhuyag\nwamai\nlativa\narsema\nlounguine\nhyundais\ntelik\nszello\nflakiest\nwidyalankara\nbaumohl\npriveledged\nkarzais\nmcgrathnicol\nzibari\nredpeg\nnonathletes\nskiby\ncroony\ngraduat\ngedarif\nimaginero\ntongswood\nmakowka\nngers\ndiemtigtal\nterer\ncusati\nforadil\ntyposquatters\nmunith\ngameplans\nfrowzy\ntriozzi\nschokko\nmoviehouses\nlaznicka\nmodishness\nenewsletters\nbitomsky\nsevercorr\nkardel\nmistimes\nperfectability\nmeaulte\npeformers\nromazzino\nevrony\nmonachs\njeitawi\nmercher\npolictics\nuncrustables\nivobank\nsmartcenter\nmizners\ninvestisseurs\nmendendez\nraisingkids\nanselmetti\nandroidguys\nkjaerholm\nrathode\nkvell\nmbrt\nsnippety\nflageollet\nduanna\nwaddya\nsekitani\nmbomio\nshipquay\nbhailal\ndeflon\nphlebologist\nehly\nsrebnick\nunstrapped\nprobléme\nlamperth\ndenty\nkoik\nbestofmedia\neconomicaly\nlouring\nchioco\ntapey\nmamouri\nmanslayers\njiabo\nfeckner\ncapla\nfarimex\nfanuzzi\nkeybanc\nbspm\njaromin\ngabrysia\ndyarbakir\nunderresourced\nkatorza\njalalludin\nlaperrouze\ntftd\ngeosa\nhadlaq\nfltc\ncaucasin\ncamaign\ncepelak\narboriculturists\nrymers\nsuccomb\nbrainlock\nfolksters\nbowmark\nsuperlarge\nunexcitable\nwerrin\ntrively\nnajlah\nvegging\novercalculated\nhonex\nnonroster\ncblc\nappworld\nlinseeds\ndandey\nmirarchi\ngivernment\nniangadou\ngeorgeou\nbrownites\nlandsource\namicas\namariyah\nensky\nfuturesex\ndandling\nromash\nschuerger\npeadophile\nsadeek\nhobrough\ndeveoped\nireo\nalwad\npervent\nstarpharma\npointiest\npeacemonger\nbedecking\ncherenfant\nblockdot\nturnas\nfrockcoat\nrameriz\neggos\njoester\ntakash\nnitrofen\nkirtonkhola\nbankrutpcy\nkinleigh\ngrunander\njinqiang\nrotundone\nalberes\nhoehler\nsczech\nfedroff\nfakka\nsaakashvilli\nshahidur\nariannu\ngeravand\nmillana\nbrainforest\ncanyou\nmurfee\ncasopitant\nvirganskaya\nmiscontrolled\nchalupsky\njagging\nmetsblog\npaparrazi\njaakke\nnickolenko\nkoumis\nfassotte\ncrocop\nshosteck\nnationalsecurity\nfarmstands\nsustainabilty\nkuwik\nshaabab\ntongyao\nrightsize\nwhaps\ncourics\ncrosstabs\nwindenergy\nmasonson\nreupped\nshengjie\nrayaam\ninnerlight\nusdjpy\nsteudle\ngiebelhausen\nakahane\nopont\nbrichet\nviptera\nwetly\nrefrom\nalmajiri\nnunemaker\nnitzkydorf\nharrenstien\naddm\ntianmenshan\nparrson\nwickelgren\nmiatas\nchipmaking\nnanomoles\nclampi\ndjabar\nbiddlecome\nbackcomb\nhawkamah\nlotempio\nsheese\ngoldbogen\ngoldhawks\nmyot\nconstution\ngrogans\nphonetapping\nimigran\nwimpiness\nconservers\nwesla\ngweinidog\ncolonics\nbritsoft\nweetjens\nclapometer\nlyssenko\nsuhandi\nconstar\ncotcher\nfullol\naltamuskin\nkovida\nsankov\nbohstedt\nnosir\npbxpress\nhladyr\nnimrawi\nfiroozi\nnivalin\namerichoice\nflurrying\nsimonia\nintracommunity\nunsatiable\nretailvision\nhomosexualists\nleisman\nviehland\nazda\nschwarznegger\naudacities\nioanes\nfiim\nswaba\ndipeso\nvaldan\nunthawed\nhouseroom\nscatback\ndamous\nanberber\nwoundingly\npaslow\nfinsley\nschnaidt\ntoelke\nstressfully\ndiamonbacks\nscrawnier\nandriamananoro\nsolamar\nbirillo\nfrape\neverdream\nblugerman\ncanolbarth\ndenbigshire\nmotiani\nlimons\nchhoekyapa\nldar\nballacloan\nhuco\nmiljen\nmuhmand\ngripers\nxiongbing\nrungi\nthuer\norsow\nkaewkamnerd\nstrizhev\nreestraat\nmawji\nbasei\nsadrau\nwemos\nfulbridge\ndillute\npreldzic\nguskiewicz\ntsuneyo\nintelligroup\nnadaam\negotastic\ntepidity\nlorencin\nequitisation\njitloff\nskimpiest\nlegetic\nsagovsky\nayoo\nbourdonnec\nbiodeisel\nsalesgirls\nlaviv\npuchase\nbrotherwood\ngaricano\nmedcare\nlourda\nvanderhill\nexomos\nvaniak\nnicma\nmyrddyn\nsuweidi\necstasea\nthamanya\nvergnoux\nsafenano\ntopups\nshaherose\noverridingly\nmtoko\ncommisssion\nkrkic\nbytemobile\nmicunovic\nzaffuto\npamodzi\nmonib\nmouphtaou\nexonerative\ngilgoff\nwormersley\nvorkapic\naloxi\natutxa\nvesselbo\ncauterise\nskomal\nmediaedge\nrefraim\nproscuitto\nzannel\ntorries\nharthiya\nserener\nbutzke\nwittkopp\nraea\njoyella\nhehea\nsneery\ntarlike\nfreilla\nkittlaus\nnallely\nuxua\nmercadeo\nohmed\nnesvold\nskiptracers\noligarchial\ngrmovsek\nlashay\nchavarin\nnafee\nnankani\nmailee\ndestigmatized\nkajer\nsteffanoni\nshofield\nsimonovis\ngraveses\nstoba\nnontribal\nbaleri\nlaysiepen\nsheepmeat\nplushly\ntechinsights\nmalariologists\nleeae\njinren\nlontscharitsch\nlaunsky\nestefans\nmilovidov\ntrpceski\ncucurto\nsheepshearing\nnovantrone\nelecnor\nthoumieux\nimrali\nisports\npeckerar\nxyzal\nmerler\nwiredsafety\nfatheree\nrepublicn\nccdo\nmauly\nldbrain\npascuale\ndecitre\nembezzelment\noverparenting\nresegregated\nsopexa\ntrayless\nkiyawa\nferraccioli\nmirthala\nfoodspotting\nfittall\nmanchay\nkanyen\ncdii\nrekulak\nroscar\nsellergren\nprivateness\nfujimoristas\nsulphonylureas\ncrosscage\nlowgrade\nsilversneakers\nyodler\npaulsin\ncndr\nmadagascans\nbramell\ncommittted\noncourt\nroesing\naddlington\ncflp\nmotodev\nskcin\nmulverhill\ncoddler\nmiaohe\nshortlanesend\nkanatzidis\ntianren\ndiulio\ncnoa\norrante\nmhia\ntonankai\ngelil\namanatullah\nspyk\nkafalas\nbearhop\ncaurier\nsorgato\nstroehlein\ntextplus\nchuvashov\nswaptree\nannc\nproblaby\nrobohm\ndesigncon\ndeviatovski\ndreisler\nbotach\ncalifornicate\nsushinskiy\nchocalate\ndejac\nhosptal\nvolosky\ncashay\nverazzano\nzimberoff\nmahmoodzada\nspasmed\nbelongia\ndcip\nsinkable\ncardtronics\nbiancoceleste\nppdi\nmalpasse\nbambling\nmytongate\nproxes\nredsell\nefforting\nphotocalls\nnonclassified\nairprox\ncentenery\nmanhali\nlasensky\nbilions\ntancer\nsosinski\nsinuousness\nbancruptcy\ncosmeticians\ncurbsides\nozguc\noverbill\nnoshehra\ndoht\nfasslabend\nrhoni\nukcmri\nvawd\nbjorkestra\namygdalas\ndillweed\nstreetbrand\nmobilebeat\naurilla\nfeuchtgebiete\nagdur\nprotetion\nunbylined\npetroline\nmonistere\ntramontozzi\nhyytia\nshortbreads\ngauntness\nvictom\ntzampazi\nunumb\nexecuitive\ncmev\nticketleap\nfulminator\nonseong\ninitatives\nazzoli\nkettelkamp\nrahmin\nclonking\nrubefacients\nllidi\ncubukcu\neliphante\nxhua\nprixes\nyuexing\nthirteenfold\ngiley\nworoud\ngeniesse\nseonag\ntschofen\nmelodramatist\ndruba\nrositano\nmagyarosi\ntubruq\ndeurwaarder\njanoyan\nalmight\npresgraves\ncawrse\nbethworks\ndruglink\ncornah\nzorome\nhonnington\ncuvees\nkeybridge\nskyports\nsegalini\nmausoom\nnekhaychik\nlisnagelvin\ndroopiness\nmpel\nkahayla\nreawoke\nsplashbacks\ncivilans\nfcpf\ndatolo\ncryans\ncatastrophize\nzeulner\nchvc\nschoolboyish\ncampanologists\naquasplash\nshopdropping\niwonder\nvideomaking\nantiseptically\narmenistis\nflexibilisation\nmichelletti\nklueger\ncrofthouse\nreservationist\nmicrofleece\nnsengimana\nemanuello\nmojadeddi\nwafty\nbreaktrough\nfragmin\necel\nhuvane\nhildburg\nskirr\narzerra\nzainabu\nmirosoft\nneelsville\nsaintula\nregistrational\ntoporoff\nkolymsky\ndiogel\nweidenhamer\ngellerman\ndelapoer\nsoedarmo\nbreea\nmohaned\noversalted\nhŷn\nsvonavec\nnigec\nweitzen\nsabhnani\nfluffiest\nkirmanto\nsheduled\namsec\nnooriala\ngyrru\nbernann\nvolac\nabdourahim\nraithatha\nheakin\nbabyliss\nrodabaugh\nlymphopoietin\ninautix\nsekur\nsopera\ntedxeast\nprakosa\nsafeen\nsovietskiy\norcun\nibfd\nvoong\ndarparu\nhosseiny\nwesthrin\nsurrended\nsandostatin\niritani\nflorinef\navgousti\nbidegorry\nshaanika\nflocoumafen\ntravelportland\nwolraich\nwerschkul\nfabtech\nbermanzohn\nbedstuy\nhammertoes\ncrabaugh\ncompasspoint\nburnoose\ntriccas\nafbnp\ntrillon\ntaiy\nillegall\nmodernica\nprefunded\nsubservicing\nsuperefficient\nnacr\nirascibly\norcy\necochic\nkittoch\nroona\nsuitemates\nsassily\nundersells\nliphardt\nmagrao\nbasyir\nladaris\ncockups\nnonblacks\ndulsori\nunknowably\nmangassarian\ngratl\nkhodori\nbestel\nomazic\nmorayniss\nexcessing\neniel\nbaimurat\nshortgate\nzeszutek\nverwilst\nballcarriers\nnasry\nkabaha\ndeminishes\nsafeseanet\nabdisalan\ncorevalve\nwynx\nrushaid\nbaofang\nbarnacled\nintercytex\ntjoka\nbratzel\ntouadi\nlawenda\nmtlqq\nwheeland\ncrystalising\nkösters\nvoped\ngotzis\ngulity\narbabzadeh\nkodner\nebank\nslickened\nibeo\nmandoon\nbbag\ncauterising\nalmeira\nvulovic\nriase\nuttecht\nreteach\nnoridian\nissott\ndentaquest\nproperity\niskoot\ndensign\nmckenzy\ntogarashi\nbcbst\ncaufman\nkuryama\nlogvinov\nadimora\nmidcalf\nsoooooooooo\nkozlowsky\nnalci\nspectracef\npeterses\noikomi\nmaksin\naliferis\nintersegment\nvetoryl\nmotterlini\nguadian\nenzhu\ntacoda\nsweazy\nmalbran\nyankess\naskana\nschweiter\nbarrowing\nrepoxygen\nhibey\nyogman\nvisioneering\nllds\ncheriegate\nrixie\nrastetter\nblameable\nberkenfield\nkanowitz\nyurk\ncocktales\nangelette\nsarikas\nyekaterinberg\nexaggerators\nmolczan\nsungar\nbelhoul\ndannals\nroundbush\nlarotonda\ncheskis\nhoiem\nmyrup\nvoraciousness\nsiblin\nmulleted\nretainability\naqazadeh\nslavishness\nbushco\ntasali\nozgen\nblabbers\nbamat\neshaunte\ncbrj\ncavates\nmegaclub\nflyblown\ngaulthier\nluzkhov\nlantheus\nnicholashayne\nsayedabad\nmethu\nfatula\nzaback\nknutmania\nwunderkinds\ndescriminated\ndapartment\nkardous\npratomo\ndujkovic\nchedham\nsleepiest\nwhitts\nroubatis\nathenaeums\nopportuntiy\ncbnc\nsavageries\nstarkevičiūtė\nwhitwham\nnessiteras\nfrontcourts\nganeshas\nskeered\nlebida\nuptegrove\nmilnesand\nkuwaits\nmecary\nalexza\ngrisliness\ncommodites\ndharmapalan\nfriestad\nraafa\nhaymills\noverwhlemed\npursestrings\nfanhood\nhalldora\nooil\nmungers\nvautrey\nchunxiang\nholtslander\nbonczek\nparkatmyhouse\nalderwick\nwimik\nhogervorst\nlucquin\nabdiqadir\nribstein\nyakutugol\nscovilles\nbaybasin\nactemra\nquellenhof\ndeeanna\ndanday\nasrv\nognianova\nvolpendesto\nvetcogray\nslipenchuk\nepiduo\nreputationally\nrorshach\nbeanee\ndingfu\nivvr\nopprtunity\nnazek\nenior\nrapdily\nrocketbook\nmeterologists\ntushies\nsautés\nwybodaeth\nyoussifiyah\ncavm\nmoeny\nzerina\nbosinger\nobag\nsfiso\naminosalicylates\ntravelscene\nrrev\nwondertime\nindusrty\nluotuoshan\nreplikins\nsceats\nkemigisa\nlingerfelt\npacheo\nringwalk\npostponment\ndecalred\nstyleless\ngieseker\ngimcracks\ntreatery\ncovar\npalmchip\ninovis\nienm\npeberdy\ncountesswells\npredjudices\nbirthweights\nqioptiq\nkafayat\nwawrzynski\ntertsakian\nteixeria\nbastianello\nbrauser\nexpotition\ntepotzlán\nboardies\npolically\nmeltoff\nsorur\nexplainin\nferrlecit\ndomboshava\ndataworks\nconrol\nehcr\njdcc\nstoleshnikov\nsongfests\npetursdottir\nhaakan\nflation\nhussell\nbetac\ngghc\ncovalt\ndoumato\nmgcs\nscanties\nsuyatno\nuninterupted\ngjenero\nségalot\nvitamind\nlibforall\nmegabuck\npropeack\nfuhrerbunker\narijon\nrondeli\noudea\nshivalika\nwrapit\nshufro\nsaparmurad\nfreshfarm\nbouskill\ngreyber\neurolist\nmobclix\nschappacher\nsovio\nbaghaturia\napplink\nshorters\ntweeze\ntrustnet\nnaidex\njanetos\nsheffielder\njcwi\ncoldbloodedly\nhoobrook\nnesnplus\nstiksel\njotr\ngventer\nkinnings\nteochow\nrightie\npancost\ndetillion\nbabyyeah\nbibbers\naneg\nnkombo\nsauvagere\nanham\nimplmented\njharkand\ntamishia\nadimab\nzlinux\njabob\nifcj\nritchhart\nmantsch\nheeeere\nyunessun\nnagareda\neleviate\nresponsble\nkiconco\nklebolds\ncravenness\nwitlox\nmoezi\nborghoff\nngaruiya\nsnackable\nrohmat\ndepatment\nphotoframe\nkubaisa\nmarktest\nmbeke\nraydi\nbiersdorfer\natyr\nretraceable\nalrajhi\nlybrel\nnfumu\ntemik\nmennill\nshareese\naisight\nasdso\nmcalorum\nsuperphones\nkfhp\nipsonar\nwohle\nheinbockel\ndinasaur\npuct\nfarhani\nnerazzuri\nloveing\nentrepeneurship\nsubowo\nenrst\ngardenless\nmagazzeni\nokusanya\nprimness\ndipersio\nmilenky\npreppiness\nkruess\nslfc\nschwingel\nrulemaker\noffthe\nrdcm\nidiz\nmyvouchercodes\nstaidly\nehealthinsurance\ncalambokidis\navandamet\nxlhealth\nhatherall\nokeyo\nstuiber\nhellmold\nhuseein\narmanino\nlegasse\nmersbergen\nveteto\ntegni\ngonek\noverdorf\ntauting\nslavelike\nmaguwu\nmishura\nboniwell\nkulibaev\nnightdresses\nhawkpoint\nchandrakasan\ntfwa\ngarshelis\nqadus\nhanci\ndecarbonized\nluckson\nfrownfelter\nweece\nnamina\ncunsolo\nharsheim\nspritzed\nemergengy\ncommunitization\nlinak\nnilpferd\nsibello\nguyaux\nchalkpit\ntastily\neconomi\nazilect\npatchworking\nammart\nshamshatoo\ngrugel\nsupermiddleweight\nrenane\nschurenberg\nmotcomb\nunitedlex\ndisported\nzarkovich\nbondareff\nacquiesence\noitavos\nbardino\nmoygannon\ndystopianism\nfadeaways\nchuvit\nlobukhin\nzargun\nscreenagers\nyrfa\nmetwest\ninaugeration\nudemezue\nkrajacic\nmurm\nrtea\nainardi\nbrainstems\nsmartplant\nresown\nnethawk\nillahun\npurivatra\npapalexopoulos\ndreamiest\nconomic\nrouyet\nxxviiith\nbirded\nligorano\nyinghong\ncogcc\nlaipson\nanpac\nlegarie\nsherbino\nsakakeeny\nrakosky\nyazel\nwagemakers\nerbol\nbueti\nparadysz\ncornw\nbimha\njungala\ndemartinis\nmaialino\nzovath\ncaplat\ncreditex\nturver\nsoays\nmannato\nfelhi\nslipskin\nemmans\nbabyfood\ngirthy\nmathendele\nballydonaghy\nschallhorn\nevote\nembrassed\nusprotect\nrudebusch\neveyrone\nkonyk\nammor\nmarcelles\nejii\nhelloooooo\nmcgivering\nalchoholism\nsonfield\nnomaan\nchenia\ntjarnqvist\ndanneker\nextavia\nwtert\ngibbscam\nhnilicka\nsarandrea\nmoletress\nmanjang\nbdti\natoki\nambitous\nmaldron\noktavec\nasmundsson\nbennathan\nbiojet\nkeefauver\nmouillefarine\nbackslap\nsixteenfold\nmarketaxess\nvarasano\npeoplecare\nsgarabhaigh\nmulticard\nsameere\nlaserline\ndelao\nofeibea\nyouselves\noneriot\ndreamtown\nioli\nliskey\nsatellier\nkelsoe\nweitekamp\nbutchness\nignoramous\ntumai\nmimb\nprepandemic\ncarmountside\nzanzinger\nsteelriver\nmoneyback\ndrabbest\nhadijat\nrelishable\nsiyavus\nchiavaroli\ncapelles\nfordrough\nvatuvoka\ncolllingwood\ngalichia\ndenegration\nsalkovskis\ndegideo\nlusuardi\nmyfortic\nkijivu\nwotring\nbarkor\nclolar\nmehp\nedgelab\nheavyhandedness\nforard\nkruzich\npizam\nmundow\nheshmatollah\nkingspark\nrequelme\ngillibrands\npregis\nyaverbaum\nrajivan\nriverso\nwifelets\nhandlova\nswiebodzin\nsheronick\nkultida\njumale\nfrothily\nwadnaha\nremail\nsuddock\nfelzer\nkaradogan\npapped\ngausi\nthainstone\nrolofylline\nwasbir\ndepoliticising\nrippons\nspiehs\ntambaro\nszczygiel\nguduric\nhradecka\npuckery\nsiliconsystems\nsnarkiest\nrcgm\njumhuriyah\nroddymoor\nsmuggs\nwhitmyre\ncontompasis\ntransis\ndrevets\nschnure\nwolflin\nkoyle\nmerkels\ntarnstrom\ncolclasure\nplasil\nantimiscegenation\nghalanai\ncatroga\nblansett\nultramarathoners\nbusefink\ntoumaz\nuthemann\nfloatables\nmiuro\nrethemeier\nvanquishers\nxcellerator\nyeffet\nwagtendonk\nlofalk\ndiminsh\njfit\nkueffer\nstroppiness\nimboela\nsatphone\ncvrg\nadmati\nloyko\nmoany\ninestrosa\nmileyworld\ndelapidated\npeerialism\ntirschwell\nsheppeck\ngenender\ntripartisan\nkorndorfer\nragatzu\nlemaine\ndongrong\nngarongo\nkayse\ncanally\nkipness\nbiofach\nbettinga\nolibeau\ntravelsphere\ngerasimowicz\nsirls\naleece\nschooyans\npuckishly\nfrederikson\ntelecommutes\nboozin\nscoblic\nreinarman\neulala\nkremo\nshetgaonkar\nsipkin\ncameronbridge\ndemutualisations\ngueros\nveyrons\nanticrisis\nravelian\nbuttheads\nspalga\ntiwonge\nnovaquest\ndecarl\ndinned\ngreeland\nquershi\nwazne\ncruisy\nitlay\naidcp\nincomptence\nrozmarin\ndinela\ntumorgrafts\nbiogeneric\nguiltlessly\nsentel\nbridgelux\nministy\nphotronics\nmyscene\nbaharvand\namankora\nnuron\ncuhna\njinzhao\nilleagal\namaney\ncharchian\npoupi\ncdxc\nkhoula\ncribsheet\njovtchev\ngelées\nwindstars\nbehrad\npollastro\ngaloots\nuberior\nmicras\nlibaud\ntrueful\ncharfauros\nmarketpulse\ndelorian\nvucitrn\nhoelzl\ncynnie\nshawler\nvefour\nmislaying\nchinaillon\ncanlyniadau\nthinkpiece\nabdulkhaleq\nmicrosofties\nbothof\ndiversifier\nozery\ngoodhartz\nricking\nlloydstsb\nmittys\npollyannish\nportmead\natayeva\ndeflationists\nfishbowlny\nopensea\nchybowski\ngoplo\nmozakka\nyarish\nmultitasks\nyaghoobian\nsperle\ncrossmans\neurong\nusiba\ntimorously\nnlst\ntawassoul\npongy\nmnlu\nprueitt\nwuerttemburg\ninvolontaire\nchuckleheads\nprendegast\nongoin\nnannis\naerias\ncivitello\nshpt\nunctuousness\nherpetophobia\nmaastrict\nipof\nmenerbes\nnintendos\nstarmaking\nsaifan\nphoniest\nlochrane\ncopenhangen\nwildbeast\nfunseekers\nunhealable\nmisconnection\nnumpties\nsostanza\nannapolitan\ngotova\ntenojoki\ntrueform\nsubstudy\nalioli\nddpa\nworlsey\nfrostfrench\ngeneralissima\nkopanya\nphoneys\nalteratives\nimpassiveness\nsvvs\nhandu\natemschaukel\npreachments\nfsrp\nmicato\npopolzai\nzahidov\nashooh\nbackdale\nfougera\nmonumentalizing\naranyosi\nbrentside\nafinitor\nhantzis\naugignac\nneujahrskonzert\nchistians\nupdos\nhargy\nmohammara\noveregged\nhazeldell\nagwara\ntipsily\nbeehaus\npresniakov\nbizanski\ngioffredi\nkisseberth\ncloakware\npsychopharmacologists\nmissery\ngêm\nitrk\nlittlestar\nsaydabad\nlaiskonis\nvonona\nbrimbles\nambiq\naltaroma\nprampero\ngelpe\narmellin\nmiskimmon\nlugacy\nmebaa\noetzi\nharns\ncentrebacks\ngayhart\npfoten\ndeleb\nbraehmer\ndesurvire\npetrák\ndujanah\nfouettes\ndomeyer\nbevelacqua\nokechi\ncinner\nghebremedhin\nhanoman\npulino\npuladi\nwaldera\nmexes\njunifer\nexpobank\nsherona\nfrappucino\nkerchers\nedwight\nngunyi\nnumerati\ngawrieh\nkeppert\nrecyling\nplumy\nmrct\njulfikar\nmazrooei\nsikelele\nunpeel\nccggy\nminiaturise\nschretter\nbrussow\nmuseumnacht\npencalenick\nduzgun\ntarsin\nbmwed\nvappie\nseventysomething\nmaturén\nwonthe\nghanea\neuropeanise\npowwowing\nsaod\nkrebes\nmbec\nlecourtier\nhisave\nniezgodski\nafghanization\nxtify\nmishori\nsimine\ncyberthief\nnardell\nsuitter\nebio\nultradome\nicengelo\nkierstede\nchayanda\nbeniston\nmagrakvelidze\npregnacy\nambercrombie\ntsep\nepus\nbertrands\npasick\nundermanning\nskellett\ninnoncent\nmanochehr\nthewhole\nnlsi\nmergasur\ntwea\nlasharie\nvinotherapy\nshriberg\nmisrecording\nnathee\nzelio\nustp\ncraigrownie\nokume\ncraphole\nbergthold\nsuntalk\nimportatn\ndamgard\npescoluse\ndonguan\noffroader\ndufan\nllanspyddid\nzivana\ntroyers\nswapceinski\ninhospitably\nqongqothwane\nbureacracies\nterriost\nkoshlyakov\nchintzware\nwackjobs\nvalensa\ngadlin\ntidemill\nmarketforce\ndjouadi\nbennites\nturbinator\nmillesima\nddisc\nsarkozys\ntbar\nsongkitti\nbottlenoses\nsickey\npribetich\ndelamo\nojon\nkruman\nyelovich\nrehomes\nsmartridge\nhendarso\nbulovic\nmogulof\ngfcm\npitville\nmontari\nkassinove\nkaddatz\nrammeloo\nzahhar\nbudino\ncorlato\nbellafiore\nagurs\nyongze\nunbrushed\ndoepfner\ncasteu\nbascher\ntecsis\nesterhammer\nopenhanded\nampareen\ngalleano\nlongers\ngorwitz\npenrhosgarnedd\nnejia\nmrkt\nbionj\nbanksys\nkucy\nhuenke\ncastignoli\ncallxpress\nlyophilizer\nsternii\njaaskelainen\nbordry\nmehrer\nmlyako\nmahami\ncharvez\ncaginess\npigmyweed\nnegbi\nscabrously\nmollura\nrhotert\ngindalbie\namerca\nluedicke\nhyggelig\nnonutility\ncorraleja\nspacehoppers\nhspda\ndelisio\njenden\nscane\ncampie\nrhor\nvoyennykh\nperezes\nprvni\nagbere\nehleringer\nroadley\nshabaa\nwapava\nditomaso\nlionising\nlafollete\ncallusing\nregurgitative\ncandacraig\nkasisopa\nifcn\noladokun\nabdulbari\ncrystalizing\nvardakas\nalvidrez\nswaggert\nperhach\nunberth\nplonks\nkinnair\nfloobs\nexpendible\ndjahid\ndeclassifed\ngirdner\nsaleapaga\nchelski\nlahouri\ncontitution\nexiqon\nqristyl\neccelstone\ngfes\ngonso\nsentimentalise\nengwirda\nciesco\njanury\ngriffinger\ntosas\ngrubesic\nparalyzingly\nheadscarved\nkonje\nambelopoulia\nbirbili\nzoerner\ndannels\nbleepers\nboessenkool\nswirlie\nhanono\njrct\ntoadied\nskoby\ndexis\ndebtload\ncarkeel\nlaubhan\nychydig\nmegastardom\ntrusch\ntovel\namericn\ntilhill\ncoachload\nhirabe\nbetchley\nterravina\niggles\ntradelines\npoyry\nefjohnson\nncayiyana\nmarmoleum\ncordings\nschorner\ncommmunist\nshelterer\nmehmedinovic\ncanuteson\ntalkiness\nboutoille\nbirbragher\nnonoffensive\nlefurgy\nqueenmaker\nflosser\ncadungog\nsummerdance\nbillips\ngambolled\ngindara\nwedgetails\nhusseiniyah\ngirfriend\neacom\nwitnessess\nborha\nrodhouse\ntrevarrian\ngaucherie\nbrinkmanns\nnjar\nfasttech\nlandikotal\ndavymarkham\nsaglie\narclin\nshowgoers\nsahaab\ngazumped\ntailies\necomonic\nthriftiest\noverprovision\natwinkle\nkifuji\nmethyr\ncashpoints\ngarganas\nwetang\njekabsone\npandaw\nwanye\nrozansky\ndorger\nleadbeatter\ndtvpal\nsusanyi\nkiesner\nbeituniya\nnoncaloric\nhampsters\nboqiang\nkinneson\nloetz\nshinawatras\nspringwoods\nswaddles\nhabbash\nhseni\nshafranik\nistodax\nbodhrans\nwyand\nbebenek\nnasami\nmokhtarian\nquattara\ncolllins\nojelade\nclearbridge\nmurugathasan\nflégère\nbalack\nalcoentre\nspigler\ndocklow\ntwanda\nprefeminist\nkdolsky\nprovolo\nnashawtuc\nfonduta\nshipperlee\nitexpo\nwrzesniewski\nkayci\ncacophonies\nrecultivate\nfriendlessness\naildenafil\ncompumentor\njeranimo\nadvar\nedcuation\nfriedes\nvriesman\nspycam\ntweeple\nowentown\nchebanenko\nmogin\npalipane\ncurtainless\nsunburning\noathes\nhoehnke\nspinmasters\nubse\nexhuberant\nkhvaja\ncyberthreats\ngyffredinol\nhohenschoenhausen\ncloudmont\ncampanulas\nlowballed\nzumbrunn\nlemaricus\nrahaim\nabdulhakeem\nniederhofer\nmanei\ncpii\nalaeldin\nboatmakers\nneurotronics\ncotgreave\nclivias\nkimmeter\nstenny\nslushing\nyxta\nadgey\nantidepression\npazur\nweariest\nonyeri\nbelghazi\nghalamnews\nrstandard\npilk\nboudjenah\ngeeya\nestopinan\nfarasat\nperdigao\nqliance\nnumed\ntenhaaf\nofour\nsadrs\nkhrista\ndelac\nscratchier\nusbank\ngrowmore\nvakidis\nkouremenos\nconvergex\neigerman\nsuperegos\nponderers\nsahidullah\nmajercik\nprogestagen\nwinkingly\npeaco\nvillatuerta\nadrees\nkurmanov\nrezae\ncagoules\nfemip\nmaurcio\nkhakasia\ngeomet\nsurlier\nscalcione\npemberthy\nanastazia\nyudkoff\nulset\nhalilhodzic\nrsst\nkameezes\nbaggan\nprofitstars\nbahuga\nkhusa\npegum\nskinkers\nlattig\napono\nwestendarp\nbeaties\nmakovich\nabdalaziz\natunyote\nbiocheck\nswarns\npusses\niphs\nectomorphs\npropects\nmomslikeme\nmifs\nuntethering\ndelhusa\naardt\ngoddell\ngrassett\nreoffends\npomarici\nolness\nrelaise\nlueking\nkairen\nsitlika\ntwantay\ncytopia\noponion\nallendes\ncyberslacking\ndoonhamer\nquallion\nstmts\nzelenovic\nchungchong\nehlenfeldt\ncrossflo\nsvemo\nkavlico\nwillebaldo\nspievak\nvisicu\nfeldmar\nhopscotched\nrecompetition\ndelcined\nhidayati\ntemcor\nprimewest\ngacesa\nlandmined\nmutinously\nmudsnails\nphizzy\nbalague\nlenova\nbankrollers\nadcirca\nprimesight\nstarcat\nsoubiane\ngatignon\nnoncovered\nspinnler\ngoacher\naccedo\nemiley\ntaumua\nbagherian\nfeezell\npekurny\nnontraded\nspudman\nreproted\nweidenmier\npostpartisan\ncousteaus\nabilitynet\nstagged\npikulthong\neckerberg\nusuf\nvalayanmadam\nlilamani\nshortsightedly\nthelast\nstinken\nskymarket\nfireballing\nscampish\noffier\nskillend\nsleazefest\noversimplication\nrifes\ngodsick\naluise\nranaivoniarivo\nblottnitz\nscrofani\nstiffie\nnewsmag\nanait\nciticard\nhabeebullah\nburningly\ntretherras\nperelstein\ntechnomarine\nsteinreich\nderms\nboccie\nraphäel\ndibler\ntempkin\nunhoped\nstricklan\nikamva\ngundegmaa\nsincavage\ncocklers\nesclapez\nflyde\npoliburo\nsectorally\narritola\nmooly\nbruchsaler\nnachawati\nabsolutepoker\ncohabitators\nkhimm\nwelmoed\ngoetschl\nkoubi\nsofronios\nhrabik\nreinsures\nchaffins\nplainspeak\nsynctv\nconsummations\nstavreva\nzavi\njaabar\nobjectvideo\nnurkin\nvandemoortele\ndukkah\nqaddoura\nwingcon\nselega\nbashinsky\nsaddiqi\ngenocided\nhomeys\norphanidea\nhimnself\nesmerian\nghambir\ndropcard\nmncwango\nabulhoul\ngorringes\ngfsr\nhubnik\nblubbers\nsecretrary\nkahau\nexperienc\nthxa\nnaupa\npreheim\nnykanen\ntussionex\nindepabis\nlabuanbajo\nfoodcalc\ndecherney\nvarnai\nintertain\nobiol\ntrialx\ncuttable\nstrongeagle\ntocto\nkorkoneas\nvicepresidential\nkampars\nmilkpep\nnasacort\nmiddled\nsojurn\nomenetto\nlaprevotte\nbimingham\ncrummier\ncourtiour\nrabinor\nchantecaille\nunderreact\nbarthmuss\npayplan\nmoyao\ntohamy\nwhipcrack\nnajmedin\nnotel\ngoshdarn\nshakhsiyah\nhasanein\nchanceller\nworldpac\noverconsume\nkobsak\nstoyer\nmckiver\nvondran\nkoernig\nwilben\nmusicskins\nmilliion\ngelbin\ndalha\nalcolac\niapso\nredxdefense\nmccaughley\nuntransferable\nadgie\ntreadwells\nameriya\ngopilal\nnsemi\nlandina\nbakhty\nsamecki\nmalomuzh\nzoneperfect\ncarrazana\nameircan\neventfulness\nabelcet\nexportability\nmatoian\nherrle\nmadatyan\ntrenn\naarron\ndlar\nweiqun\nlasecki\nshlemov\npaleokostas\nsledai\nraucus\nshammary\nmonstered\nweltons\ngalliwasps\nzapor\nzickefoose\nchanate\ndoani\nedeen\ngiftwrapped\nvebas\nunstitch\ncuddlier\ntzun\naytat\nkvasnicka\nlitef\ntemodar\nscreenburn\nsecurityholders\nbethenod\nneocate\nunbottled\ntobaccowala\nprynu\ngacula\namornwiwat\novercoated\nkrenning\nradeta\nbornt\nclaustrophobics\nwackadoodle\nkingshall\ntoube\nroadmate\nffilmiau\nwetbikes\nperfectville\ncabler\nbuffetted\nrajaprasong\nakbaraly\nhaythorn\nverhoeve\nhouriya\nrottinghaus\njaniece\ntayub\ngoware\ncoiffeurs\neuroskepticism\nsideswept\ndroeger\nbeigler\nfaatau\nzeppetella\ndeysbrook\npreinstalling\ncaua\nreappreciation\nchaeruddin\nmuzzaker\nbachgen\nfrugoli\ndeductability\nresynchronisation\nbrni\nunfamiliarly\ncreamiest\nginco\nissmp\naminiasi\npangeo\nkargha\nmvbase\nchristandl\nspotxchange\nromski\ndesperance\nuntch\nhilliary\nmompreneurs\njakabok\ngolser\nsitdowns\nsprado\nunshuffled\nbrti\nyardenna\ngoaltend\npannullo\nmckerchar\nkalkay\ncrosschecks\ndinb\nagrisure\nnosers\nlaceless\nhaihua\nschlecks\nfinacee\nbrazoban\nstryland\nordemann\nchartouni\nbollingbroke\nbonat\nreneuron\nuchucchacua\neschenbacher\nuntame\nschiffbauer\nogreish\nstubbylee\njosmer\nnonturbo\nmisfields\nselleca\ncronosoft\ntocheri\nyannakis\ncatrack\nrichlist\nkabkabiya\norbotix\nyocca\npunternet\ncamae\nariesen\ngisy\nnaftzger\nchezzi\nmrugala\nbasteiro\nehya\nuvat\nclassness\nsanitises\nmadginford\nwuryanto\nworthersee\ndimunation\nelsohly\nordovas\ngoodair\nseasteads\npomerols\nizhaki\ntailspins\ntowork\nunfancy\nabaete\nsuann\nʼi\ngeniom\ngologone\npreema\nporkulus\nouamba\ngontmakher\nanemically\ntrma\ncryptoportico\ncvilak\naglycons\ntazeem\ntruronian\narmaris\nnasdq\naguettant\nalderly\nkavulich\ncitrucel\ncerelink\nneuronetics\ngonter\ncolpy\nglah\nbrayed\nbocchia\njinhang\nzeedyk\nstainmaster\ndatafeeds\ngiammo\nvechicles\nmontobbio\nbluedorn\nmainardo\nconfiming\nhempey\npearlized\ndanielczyk\ncadwraeth\nheddatron\nsauat\nmauviel\ncatapault\nzaiqing\nintuniv\nschistomiasis\ntullydonnell\nvacine\nannihilatory\nreort\ndigeplayers\ncrawleyside\nauchmill\nlachar\ngenscape\nonges\nsomaxon\nzhilyaev\nshafkat\nquartely\nlenarcic\nakylbek\nkwangchul\nkramberg\nhighfliers\nrashevski\nmatsuhita\nmarmillion\nmaryin\ntavelman\ncervelats\nferronniere\ndahllof\nputzes\nsubasi\nschandelmeier\nastromaterials\nassulted\ntemblador\nwahishi\nbengtol\neyeris\ngrandmet\nteresea\nbannapot\ntyland\nshohin\npirilli\nenglobal\nfazalullah\nshirtlessness\nmisbahuddin\njumpsuited\ntribulete\ncillier\narcalyst\nsuperswarms\nshanshal\nhumantarian\nscalinatella\ngecker\nmcelheron\nalmirida\nwaxings\nvenezula\ncgibin\nmuffi\ncompetitior\nhandbuilding\nkopchinski\nazzaz\ndotcomedy\netol\nnewheights\ninterfear\nreflate\nioactive\nddydd\ntalevi\nessakow\ntetherow\nadnitt\ncloughie\nbirkenfield\ncolussy\nzoppe\ngoldkamp\ncollexis\ngamarekian\nestablishmentarians\nlanap\nreann\nzawislak\nsuperscooper\nwhitny\nslogar\ntesfamichael\nparentsconnect\nworobey\nbaijiao\nloathsomely\nknickmeyer\nspazzes\ntunstill\nnoncyclical\nbalgrayhill\nconfesercenti\nvitino\ndogumentary\njanusiak\nndjai\nbirgin\nbasyouni\nsolictor\nhpas\npixieish\nmuhyadin\ngoverenment\nbacterin\nhammouri\nsaftler\nhughstan\nmonetaire\nundelineated\ninterferred\nglobaltrans\nulitmately\nregathers\ntraumeel\nkreeps\nrailplus\nportaloos\nduilia\ngamidov\nclippered\nwiedinmyer\nquavis\nneswick\nchmm\njabalee\nzendani\nditsa\ntianliang\nauthentium\nsnackfoods\ngrauvogel\ngenilson\ntrustcenter\nhoerling\nsuperelite\nflittner\nbicyles\nterrritory\nlehecka\nniwar\npintxo\nkachka\npreopening\nisratine\ndrabelle\nbluestring\nheadlingley\nmuraguri\nsexcapade\nbrovetto\nsedovic\nmalteses\nexurbanites\ninnocuousness\neberwine\npressganged\nabadoned\nmcnesby\novermedicalization\nprkr\nhuthis\nblaringly\nreemphasise\nreyl\npickiest\nrespraying\nsepil\nllares\nshmura\nmorhouse\nthorplands\nakmad\nhougardy\ncottigny\ngulkis\nzorpette\nkaltoft\nkatokichi\nfatcats\ncornblatt\nunharried\nfrownies\neldridges\nfairfx\nlandgrab\ninstitutionality\nscheiwe\nardanaiseig\nprht\nzislis\nfertittas\nreivich\nkadii\nsquiring\nrodins\npramudwinai\nwhoopla\nblutrich\nsoftbrands\ninvigorator\ncrudité\ngwom\nziolo\nffrindiau\njenard\ntyruss\nprehearing\ndesquesnes\nantufiev\ncasber\navichal\nfulvolineata\nconando\nforzley\novershare\npulmonarias\nhammadou\nhanoians\nfootit\nsolaren\nchestang\nvyxsin\nfreiden\nlermon\nstablise\nmashadani\nrubacky\nfadillioglu\nmurmurous\ndannye\nmanorway\nlitvan\ngoatie\nrabits\ngramke\norlandella\nredenvelope\nviransehir\nimmersively\nboussin\nmalehorn\nsiochána\nfastac\nmicast\ndrymades\nklingvall\ngrillmaster\ntomasevski\nmizher\nprofusions\nnegrohead\ngucciardi\nsandbergs\nprevette\nmegaports\njianglong\nspaclub\nadelis\nprabaharan\npanagiotopoulou\nvermon\ntrmb\nlarvin\njennelle\njempson\nziouani\neppert\nrobotopia\ninseams\nneukum\nbaracky\neyelike\nshuddery\nsophoan\ngosinski\nrepondents\nrothlin\ndasy\nkonowaloff\nmonzur\nmikelle\nfineliving\nswide\ntanier\nguaranted\nalicart\nchantlike\ncolberto\ntecua\nalphamosaic\nkimondo\nzuca\niogear\nmexp\nstonkingly\ndelwit\nfrankeny\nshuweihat\nfortebio\naylene\ntachosil\ncolica\ncyborgian\nborgogno\nsunbleached\nmegafights\nrihannas\nversimilitude\nsuccesss\ntechnogeeks\naerocrine\nalphalab\nchinemelu\nkrason\nshrillest\narthrocare\nbreininger\ndamne\njavorn\nspinmeisters\nshaokun\nambay\ncostafilm\naqualab\nhandyperson\nhynick\npasskeys\nnhsta\nrobier\ncoxmoor\ntextme\nhupton\nteendom\nusbln\nagensky\nyonglan\nbaseships\nhaberkamp\ntrelinski\nfraggings\nneosoul\ncvmp\nsavouriness\nkeresey\njinc\nciecc\nburkus\nleisch\nquib\nmillert\nbarrelful\nmaclarty\nbamrah\npenbury\nlobbenberg\ntraumatises\nsquishiness\nbehenji\nnebulisers\nnoxema\nezon\nmoseying\ngoergens\nscudettos\nsamalout\nsagaciously\npizzella\nbeems\nborisch\nshiray\ndecaff\nwombell\nsnowfort\nschlafengehen\nenthrallingly\nsofronis\ngamelogic\ngiannobile\ncadwyn\nkennebrew\nicddr\nsnacked\nchonail\npomajambo\nelpistostegids\nmorael\npowerseller\nlatecoming\nschaloske\nwolfed\nthrupenny\ncleggy\nmutiga\nrachow\nchfn\ntegrity\nsomayli\npecheux\nazarmehr\ndayya\nbordine\nheeran\nevolene\ndhasmana\ngarikai\nateb\nworkrights\nrovian\ndeglamorized\ngeorgelle\noddes\nmutaa\nofffice\nhassaine\nlostroh\ntruffling\ndimunition\nepedemic\noutthought\nparenton\nubachs\nftfm\nbluejohn\nertong\njerkbait\nharmie\ntharan\narancam\ntertrais\nboltless\nkensy\ncapablities\ngondolo\nzoomy\nrenkes\ngleefulness\ngorens\nmadban\nvizzavi\nstonefrost\nthrifting\njazzmutant\nuithoven\nteraelectronvolts\nbter\nwarwak\nblear\nreprioritised\nhaziest\nriedo\nmitschek\nohja\ngribbins\nunevacuated\nnowacek\nwursts\nmelograno\nknapke\nhasanin\npunnets\nkamore\ngadgetwise\ncredenzas\nhoromones\nhaiders\nsensia\nunknowledgable\nmontioni\nenung\naplix\nemmar\nbluffy\nreindert\nmuyuni\nkilomters\nmagram\nyannas\nglucotrol\nmatasano\nvdrs\ngreitemeyer\nceleberity\nrummery\nbrudnick\ncomack\npopcake\nafghanastan\ncottagey\ntavlin\npafuramidine\nayyalusamy\nhenss\nshinwatra\nriccadonna\nlubit\nrunless\nrefettorio\nyuban\nhvhc\nruefulness\nwinterisation\nngemelis\nimberman\nneigborhoods\nnarel\nkisamba\nholidome\nrutab\nbiggots\nmelenie\nowour\ngerui\nvietnamwar\nkameroff\nmushier\nboobis\nallue\nklosk\nskooba\nyabroff\nfaigenbaum\nbridgeen\nsiomara\nradzius\nconvergency\nscandelous\nyourmoney\nleismer\nstrohal\neverpower\nexbs\nheddings\nmasoumian\ntoree\nvinelike\ndasaad\nmakaibari\ncrummiest\nmuzijevic\nsugich\netene\nbleatings\nshoetops\nhseq\nsemperian\nleporati\npafr\nsarrafi\ntweentribune\njirovec\nbonitatibus\nfossgate\nplantiffs\nlockfield\nachosion\nrighful\nbillfolds\ndimascio\ndrywallers\nbeligerence\nhincker\nsvetlik\nbaiyaa\nwauck\nacrodea\nlannett\ntaous\nknicknacks\njoshpe\nbluchers\nlawnmowing\nbromptons\nrebuying\nshaniece\nscappati\normesher\nascendia\nneglectfully\nparadyne\nmoussin\nahlfeldt\nautothrust\nalmatis\ngreendimes\ncychwynnol\ncprit\ncorcell\nfattahian\nbrownstoner\nkallinis\nministates\nprawdzik\nunsusceptible\ngueant\naminpour\nmuers\nakhilgov\nprmkt\ntrememdous\nhuffmon\nalbinger\nunenticing\nsahenk\nrivertime\naweidah\nbasagoitia\nissaias\nsusno\ncorwm\ncastillas\nbirchfields\nupswelling\nfinmin\nbloombergs\nvkernel\ngaohe\nreynar\nfiese\nkiltmakers\nmarleine\ncoppolas\nheech\nzilliontv\nhoussoy\nrobna\npoliakine\nilliteracies\nbouveries\nthadee\njilbabs\nandimuthu\nwadian\njulmiste\ndiffi\ntenreyro\njamarkus\nsterigenics\nrhonnda\nsnbc\nschlachet\nboukous\nbamboccioni\nkhoory\nrécitations\nrhiannan\nmastronarde\ngaveling\nistiqbal\nnacarat\ndumebi\nbierhanzl\naryasova\nnormil\nspalton\nbadiozamani\necomedia\ndevistated\natiha\nelektrobay\nelfine\nselfsufficiency\ngamemanship\nadvisery\ncorall\nsahmarani\nugnivenko\nstting\nmcroskey\nparzych\ntwitterview\nvickki\ncestyll\nmtrx\nunhealth\nsilverpop\ncollpased\nrehersals\nbuchanek\nsjfc\nrailomo\nbeligum\nsainvil\ngotaland\nbroyd\namlen\nklaussner\nphuyal\ncapered\nblacksell\nscrybe\nmonzavous\naxert\ncelandines\ndemonstraters\nkyphon\nreybier\njüst\ndidarul\nmontealtosuchus\nzardar\nmetelitsa\ndefensless\nsalloway\nkathat\ntrenell\ndricks\nkinseys\nwahedi\nwoodfuel\nguestrin\nsadikoglu\nwonduruba\nsuperinsulators\niscoe\nhulteen\nundercharge\noverstayer\nbeginin\nstrokeless\noniya\ncorepharma\nsavp\nhasabu\ncourchinoux\npollermann\nctei\nchlopak\nmuaid\nsymlabs\nscouarnec\nsyagen\noverexcitement\nmehrat\ncdhe\nforeyt\norganlike\nnberg\nretreaters\nmulyanto\nsolarmagic\nphobot\nsaintbridge\neversham\ntorezolid\nhuffily\nrdova\nviselman\nsorgeloos\nnuvio\nithat\nvenktesh\ngarafolo\ndawald\nknickel\norllewin\nepayables\nlafemina\nguinsburg\nworldtech\nvivitrol\nretureta\nconsective\nobaidah\nsambucas\noverstuffing\njatna\ntussel\noptison\ndaikondi\nmacovich\nbordelet\nsidanco\ncompenents\nmygas\nmaaouiya\npellegrine\nlokhi\nraniga\nbeddia\nnovogradac\nhastelow\nspeedware\npinzone\nloggos\nabakr\nmahgerefteh\nunderinvested\nrosendael\npicnicker\nstoitsov\nchicer\nstroch\nclenell\nhinderlider\nphilant\nnstep\nregieoper\nwharrels\ncelacade\nouidad\nstanos\nneutraceuticals\nrunnebaum\nmahlak\nskeikh\npaesa\nageis\njsfs\nsmokeries\ndarapladib\nkingliness\njukowski\ndonadello\ndemontez\nbattersbee\negyptain\nerox\ncostetti\nrushenden\ndeligates\ndaogah\npiemaker\nzueco\nstarriest\npercec\ngroenvold\nvibiano\nmurphi\nwiiitis\nzhejian\nmilktoast\nscoltock\nenergiekontor\nseapak\nsmartshop\nnarredu\nsceintists\nitij\nnfvcb\nixic\ncyclicals\npatentholder\nwithrawal\nwelber\nisln\nugartemendia\npontocho\nhadizatou\nmlim\nganyang\nyarovoi\nvegatables\neftychiou\nsmolanoff\ntresierra\ngwacs\ngscb\ntalinda\nliquider\nvestoids\njunping\njoylessly\njintropin\ngbgc\nelaborator\nbedair\nnonhospital\nmcnosc\nmexicanness\ndannijo\nmishavonna\nhcais\nappenine\ntrevine\nchadiha\nsuslik\ngalitzia\nmererani\ninnnings\ncoccidiostats\nyeskey\nwendesday\nreclast\nbarelegged\ntwigging\nprotrade\nconstitition\nmasterofthehorse\nkuritzkes\nwirelines\nkosawa\nmetalrax\nduopolistic\ncrowells\npenpower\nwqh\nfrizzed\nkeithan\nrabines\nsjostad\nsapphist\nscrabbly\ngandarilla\ninacol\nrosgill\ntaketv\ncatheterized\nzinging\ndahlitz\nnatshe\nadapatation\nyudelquis\ncutrara\nmulticulturism\nerqs\nlton\ninterbanking\nschimenti\ndhongchai\npongpaibul\nmarceles\nthyrogen\nramsower\nchiumento\nmotuara\ncyberstates\ncampsies\nkolch\nmesalles\ntenrehte\nweeky\nlythic\ndelannon\ntaojin\nspoe\ngastronuts\ngalbreth\ntacugama\ndetaines\nfifg\nguildtown\nclackety\ngainsboroughs\ntarrantino\ncarmelized\nallweiler\nengzell\nnoresco\nquois\ncubreacov\nthiruvenkadam\nkimla\niwanowicz\ntreaury\nploger\nweediest\nmangalik\nprefiled\nlowballing\ncorken\nconab\nattisso\nsurvivaball\nllsh\nadahdad\nancellotti\nihry\nlinuxlink\nbioforensics\nqashqavi\nspottiness\nlikudnik\nbohatyryova\nvillagey\nppmo\ndaskas\ncowcher\nmrag\nhorizen\nouldn\nschrider\nbiedrins\nmedicalise\noberschledorn\nflulike\ngwynfa\ndummermuth\nentertaing\nmaggioncalda\nvelathri\nosipoff\nfoqaha\nsouviron\nmicrocephalia\nnurdle\npurvanov\nestiatorio\nhongke\nronsky\nscaleability\nhypercapitalism\nicrime\nstarluck\nmohmoud\neverfi\nlescol\ngianino\nsarcmark\nprivilidge\nadeeba\nheadfake\nkandeepan\nthirugnanam\nmistier\ncatrini\nsteinmayer\notiso\npolpette\nunloseable\napoaequorin\nungless\ndubak\nunderattended\nfleetnet\nkelimbetov\nboomj\nplummetted\npolitiicans\nnghaerdydd\nseruga\ndworetzky\ncounterpiracy\nkhulud\ndahianna\neyjolfsson\nmckamie\nchakhkiev\npachecho\ntongon\ncelm\nrubinomics\nficando\nflaam\ncanare\naelric\nljubco\ncatapaulted\nlauterback\nrezgar\ntaksta\nexplorit\nlobying\nagenstvo\nszubski\nmaxximo\nlaquanda\nkazhakstan\nsleekit\nbyronesque\ntaikong\nsalmina\naafaq\nsafecall\namotosalen\nweedbusters\norgasmically\ndprc\nayira\ndiguglielmo\nsayekti\nwidmay\nsteelier\nscienceinsider\nureneck\nvisciously\nbarentsobserver\nbukasov\ndaabas\nibrain\ndirgelike\nsumnall\ncassiotou\ntoebben\nmileycyrus\nmitee\nforzoni\nglobby\nhawkishness\nrheumy\nhydref\nchrylser\nnakahiro\narbani\nyizhousaurus\nrosteck\nfewn\nenviromentalists\ndanchick\ncorix\npraticò\nturginovo\nmegawave\nkozena\nantitakeover\nhematide\ndrennec\nsybron\nrhombopteryx\nstaffie\nrejeski\nadavantage\nghoram\nbaage\nchinarat\nmilhorat\nmasisa\nwilpons\ndabryan\nleanback\nflycast\nsteenvoorden\nshalo\nalnc\nbarsade\nperimekar\ncalportland\njuqueri\naruond\nsanjae\npliss\nmoomilk\nmintec\njancovici\nbambale\nditkoff\nabdolahi\ncavadino\ndeweycheatumnhowe\nlambinon\nswineflu\nxtion\nnordfinanz\nwybot\nschoeder\nmayuran\nswagging\nsoglasiye\nlahmajun\nfordwat\ncupfuls\nnewbeauty\nfintur\nsotudeh\nretriggered\nbraindistrict\nmikelic\ngirsky\nyuppified\ntimberlakes\nkucharova\npertusio\nsalviatino\nburaida\njuette\nooomph\nfelicetta\nyumurtalik\nschefenacker\nrosiness\nharkatul\ntabakman\namelior\nmommywood\nkagamé\nmelaas\nimdc\ncizhong\nassomo\nlightfooted\nzwitserloot\ngroundhandling\nkoskimaki\nwavra\nwidness\nunguardedly\necuadorans\nghwell\naltmanesque\namerisur\nguvamombe\nlydianne\nreykavik\nflector\nyerks\ntrevessa\nselanikio\narrogent\nquarantillo\nabdelwahhab\ndemutualising\nmajestik\nshafayet\ntorkhum\npacewicz\nhagees\npoussier\nnorthwater\nfountained\natronic\nystafell\nchersich\nnmpp\nramono\nscupltures\nseascope\nepiskin\ngoussous\ndelphon\nchicness\nassetts\nsenseman\nnorrona\nhollmen\nmrina\ndeterminato\nlittmoden\nabashova\nknowetop\nshamdinan\nbaghdassarian\nhopscotches\ntungsha\njudders\niurgi\nlumberingly\nsinkevicius\nwilberding\nsprightlier\nkapani\nwoodrick\nbudulis\nmauriceo\nfaqirullah\nbalmier\nnakivale\ncockhorse\nclickagents\nllifogydd\nsimplemail\nunrefusable\nzoolights\nmobilitrix\npalastanga\nwindcheaters\ntutssel\nvoudrai\npoliticain\nresignees\nfirefront\ntovagliari\nmountleigh\narogance\npreregister\nsolarmission\ngreyser\nbronzeoak\nsalanti\nintrepidness\nbippu\nxenazine\ntibnah\ntrammelled\nchengbo\nvalcyte\ngooier\nmoerdiono\nsukarna\ntowes\nevolta\nvinovation\nadenotonsillectomy\nnurko\ncorproate\nmergerstat\nmatekitonga\nuncharming\nislamified\nsoliloquising\nboardwear\nsansalone\ngawkiness\npffffft\nvtuner\nwyithe\nburshteyn\nwaisanen\noverincarceration\nhopsack\nbrainquicken\nwajar\ncappellani\nsinahala\nsmackheads\npswr\nseleshi\nbedinghaus\naugill\nwaikupanaha\nwizemann\ngesel\nyorking\nbronwylfa\nantisocialism\nshiberghan\ninnotrust\nambrozaitis\ngrowbags\nbokks\nsalarkia\nmediamarkt\ninprisoned\nmohlenkamp\nsitecatalyst\nsundloff\nmarketsource\norvarsson\nvistory\nluxalpha\nkajouji\npirnhall\ncrustiness\npedaller\nallarton\nmisspend\nthaix\ngajon\nkapolczynski\nrobocroc\nabdygany\ninterestd\nmarkenson\nidividuals\nvanadia\nwirestone\nkampeter\nstiffies\nakonix\nzweletemba\nsandeela\nnendick\nmypunchbowl\ndemetrus\nstoilova\nansworth\ncityryde\ngrandner\nyanghui\nnascetti\numemployment\nkerkstra\nbaikalsk\nnextio\nskinceuticals\niband\nshellys\nkalinak\noktrends\nbecici\nmotoboy\niserman\nhafeth\nfruhis\ncongileo\nsqueamishly\nwindshift\nmaktoums\nqualye\nadultvest\nschnidejoch\nacounted\nchenevey\nheadage\nbeccalli\nkazutsugi\nqueiro\nbromirski\nfarraher\nnanobusiness\nzolqadr\nscarbrow\nhyunjin\nkengeter\nmutler\nliuetenant\nreynen\nlevistre\nsawalich\nidoya\navports\ntejdeep\nnewscale\nsutow\ndramexchange\ngejon\noverboiled\ntekelioglu\nkushners\njobatey\nschrapnel\nsasomsub\nyongyut\nkatalia\nsalstein\nqubeka\npairolero\nnedds\nradojicic\nmuzer\nicebag\nshakif\ngeogheghan\nshanafelt\nridgeworth\ncounternarrative\nechometrix\nalcoholically\ntribalisms\nbarganing\nintellipharmaceutics\nmoeb\nmerepark\nfurors\nsharrief\nbaskan\ngreehouse\nwesternzagros\nuplit\nramblingly\necononmy\nsourcemap\nsedensky\npsirri\nbackcast\nmcrel\ngraumlich\nkeneshbek\nmalezhik\noffshorable\ngolftyn\nnilled\ntranquilising\nmedstory\ncedare\nczjzek\nagys\nbilll\nisletmeleri\nhennaed\ninefficiences\nsilverwear\nfallica\nkasereka\nvemdalen\nehlis\nnemecz\njjohnson\nglitner\nnonadherent\nblokeish\nhoelzle\ngrunstra\npilgrimmage\nthermocool\nattaboys\nantiwhaling\nsirci\nprostitues\ntyrelle\nspanakos\nperusers\nvongs\nformbook\nyowled\nbasyan\ntinkerman\ncockiest\nseraphically\nemoze\nsupriyatna\nmolori\nstiffled\nbruijns\ngoodlyburn\nkulbel\nkovacova\nvangent\nhagenbuckle\ncasenergy\ndreno\niragi\npostapartheid\noceg\nbersell\nmafeje\ntatley\nsalesladies\nkurzarbeit\ndaibes\ndarea\nfasfous\nvangeline\ndebowski\nemplyees\nbronxites\nvolumizers\nmezzomo\nborgula\nbestcovery\njanvey\nsimpon\nhosbrook\ninsideflyer\ntradeables\nspokewoman\nmalungisa\nhotelclub\nnonpetroleum\nlacers\nreadius\nzirko\neslc\npowerlong\nbungeed\nstoptech\nfejtö\nmostart\ngilletts\nhudacký\ngerspach\nkhais\nfundementalists\nzahlmann\nsmedile\nweihrer\nnisly\nrisken\nctem\nyukna\nspotco\njoebert\nlunchable\nyinfeng\nzimrights\nsumco\nkallari\nvimac\nogide\nsabitsana\nneatt\nlacanche\nmittersteig\ntauxemont\nchurms\nneugut\nschiffren\nsymmetrel\nmanaeesh\nstratecast\nshamso\nisotec\nbechta\nmelquisedec\nfrieburg\nmontmayeur\nmoldomusa\nmarkeljevic\naccera\nnipbl\nquinlisk\nsakhakot\nunipaas\ngargula\nlouvard\nnonhomicide\nvolksparteien\nrizwanullah\nshoutalong\nzakhari\nyacqub\nfiscalis\nficos\ntemudjin\narbx\nshaimiyev\nlimbaughs\ndopps\nrenesola\nzsweet\ntransdnistria\nshowboated\nkandyland\ndesensitises\nijjas\ntriartisan\naliffi\ngawds\npinacci\nvasquezes\nsangprapai\ncarcich\nflytower\nnewcasters\nacolades\nrunnan\nmatathia\nmarsilli\ncontemporizing\nblowy\nbotiga\nvewwy\nfirstdirect\nfusidate\nshamsiddin\nnoboard\nmisoft\ninphi\nrightmedia\nspirita\nmayrhuber\nbeckylyn\nrelphorde\nclaimimg\ndragao\nsquanderers\ntourondel\nbittani\nnjambi\ngaranimals\nprojectory\nbamboozler\nsimé\ntrikini\nccvp\nfoodex\nperogordo\nchontos\nitwould\nsupprelin\nbctga\nqidfa\nhedenstrom\ntruffade\nfursty\nsnowblades\nshate\nahktar\ndisraught\ntarbett\nbuhary\nsubasri\nnonstudent\nwembridge\nkadamovas\nrapidata\nwibautstraat\nsteepish\ndarimont\nnonhealing\ntackily\nlaatz\nzaelke\nlokoff\ncorrpro\ngreenblade\narixtra\ngrundie\nvanouver\nbennigans\nexactpro\neckelt\npowercast\nvfend\ndases\nbarcaloungers\npozition\nkhahar\ndzhennet\nmahrus\nhernieder\nlanzkron\npenderels\npakhrin\ntiarza\nhardnut\nraghzai\nclanked\nquegan\nviarengo\ndimissing\nskinniness\ntiremaker\narcanto\nprajak\nanticar\numle\ndyssynergic\ntigergate\nexfoliators\nsefydliad\nantibribery\nkersteen\nbaghram\nophra\neuthanizations\nrequ\nwawas\nakule\ncarnhill\noesterman\ngenevers\ntusty\nvandrey\nanjun\ndorotan\nqalqilia\nfanguy\ndepilate\nkulacz\nzonebridge\nhatriot\ncorcella\ncountres\nghreadaidh\nworrywarts\neudemons\nmodishly\ntomblike\ngeille\nmaswik\nmuhmud\ninaccuarate\nmccrobie\nksander\nschlaeppi\nkustendorf\ngovens\nwendtland\nnelstrops\njolyn\nstancanelli\nvorovoro\nharbourers\nqualisystems\nheidgerd\nfezziwigs\nsukkariyeh\nbeccio\nmymedicalrecords\nosmaan\nlavanga\ngrilikhes\nlaborc\nhevenu\nrietmeijer\nmaneuverer\nlifevests\ngollies\nlandgrabs\nraddo\naccelerade\njablonowo\neddc\nngenera\nzellwegers\nphysicans\nshwak\nkundtz\ncoolblue\natyniad\nwilgenbusch\nteltsch\nrachmadi\ndanatus\nkósáné\nclomazone\nzolensky\nsecurites\nkleinfelter\naclara\nelosu\nsojewish\nmurefu\nbrasiliero\nmontegufoni\nnosiest\ntelecommuted\nargetine\nryvkin\nulitimately\nshirian\nfalujah\nkamami\nchazzie\nzuykov\nquaddafi\nlemeshow\nteixeiras\ndostis\nlonngren\nadct\nwhodathunkit\nbeachwatch\nsucceds\nilabor\nmaobama\nlenczner\nshemel\nhardern\nzegrean\ncaidan\nxterasys\nmodiface\njibbed\ncontinuin\ntaneka\nfuzesi\nimcompetence\nnintento\npharmatelevision\nwehran\nprecooking\nschwabish\nprovado\nniklason\ndiegue\npitale\nrockwellian\nmitigator\nmünchau\ngbis\nblazak\ngruach\nairknight\nthalomid\naugustawestland\nstonefaced\nvesterdorf\nratanakkiri\nlebedevs\njerseyeans\nstarbirth\nmillrock\nbragnalo\ntimonthy\nallvin\nneuromedical\ndelnaet\nfingerpoint\nmidset\nkhorsheed\nstalest\nmucousal\nmingkwan\nwastall\nsqueegeeing\nkazmunaygaz\nafeworki\ninexcuseable\nbyrs\nliverpoool\ncoslov\neasylife\nbirdbrained\narune\nrecarpeting\nadlea\nkrausman\njigdal\nratifiable\ncompozr\nsnowlines\nmelvillian\nengellau\nmtendeni\nmediatex\ntonsai\nchargeholder\nmelnikow\nsiniyah\nbattiness\nfeseha\nmoussed\nwithagen\nhtibs\npastelería\nshenneika\nbiobusiness\njohnannesburg\nbrattons\nnaween\nlindeperg\ntrunkload\nunprintably\nhyperrational\nbearnes\npiturca\nfujimorismo\nholografika\nbugati\nvelcroed\nmustier\nmajedie\nhygate\nsadikovic\nboltuch\npromarket\nsillice\npakstani\ntounges\nmikolashek\ngnaizda\nluttmer\nneurologics\nteeuwisse\nhookipa\nbirdflu\nhartenbaum\ndebrowski\ntrowelling\nmiclot\nsandick\ngillens\nkinksters\nberlamont\ntodwong\ncounceling\nceberus\nsecrtary\nattitutde\nautobox\nbcri\nwiganer\nbiothreats\nkhodi\nkinzett\njinkosolar\nwithdrawel\nbeotch\ngormlessness\nkirston\nseitou\ndemcratic\nmyser\nwomencount\nuncloseted\nlassaline\ncasseroled\npiggles\ntraiterous\nkomaci\nfindingdulcinea\ndelamor\npostbop\ndepersonalising\ncuihu\nbrisc\ncatastophe\naleksanian\nhypermilers\nsharati\nhambrey\nmcnairs\nbromenshenk\npoertschach\ncompudyne\nthickos\nmiljas\nzukoski\nsagalowsky\nbabouches\nczomba\ndmitrovic\nfriedmanite\nbioalliance\nbacarro\nkahiem\njurkovac\nnewsconference\nlarusdottir\ndiscorse\npavluchenko\nufberg\nhsip\narrabiata\nalhaq\nmammosite\ntelemetrics\npatricot\nscme\nhealthit\ninfotrends\nkimre\ninsurence\nkhalina\narcua\nmulner\nchassard\nbcbsm\ngurtovoy\nchinaedu\nhaentjes\nkirrage\nfesenjan\ntltc\ntilletts\nnetprospex\naquaclass\npaktiawal\noutterson\nnextenergy\nnawasi\nchassés\njadva\nvinzavod\nadolor\nopde\npartono\nhofinger\nregivaldo\nmanuitt\nkulasi\ntylette\nzuanazzi\nblousons\njsna\nhonered\npanzirer\nexorbitance\nnobeltec\nmisraje\nlewkowitz\nmnister\nrsvping\nyirgacheffe\nhuluplus\npeua\nrippee\njeyarajah\nlumsdens\nnonlocals\nrcht\nmommys\ncottees\nanabolics\ngourinchas\nchilewich\nturkeltaub\nrimensnyder\nsoljacic\nbulgrin\naranxta\nperlwitz\nactressy\nmidstage\ndissaray\nwinsomeness\nvoelkischer\nbeslow\nwodges\nkaganda\njoaqin\nadvancers\nrantao\nbapm\nwillliam\ncubiche\nkappatos\nlanong\nparaschuk\nprepetition\nkallop\nyolimar\npanosh\nkambakhsh\naquastar\nghalyoun\nreahard\ndeclinist\nspitals\nmarkinor\ntrembow\nlamebrain\nwestates\nphoan\nmultilateralisation\nheterodontosaur\ngrifts\nscheinbaum\nplayact\nacdl\nchanonat\nmangersta\ngwenigale\nhealm\nenvivo\nfrerot\njazzlike\nfoseco\nvacationeers\nsmartmeters\ndalbadin\nbednet\nxxib\ntranslumenal\nnoscar\ncopperwynd\nxaviere\namitiza\nshangbao\nkidderminister\nlobodzinski\nhydrator\nsnarlingly\nbenzaclin\nblaentillery\nkkottongnae\nmyawadi\ntricksiness\necoatm\ndisipate\ndynavax\npeisinoe\nwouhra\nimmemory\nkienan\nsoltwedel\nprolla\nauru\nhildebrandts\ngontebanye\nrentabiliweb\nhatcheted\nbiorubber\ngreden\nhuysse\nlustman\nmallicki\nkatewell\nenmired\nhummous\nvedemosti\nwaea\nfazlic\nhebranko\njinzhan\nmeleri\nmagacin\nseghesio\nranexa\nmartinette\nshlm\nsuperpops\npreordain\nschwelb\ntreestand\nmhrt\nrowzee\novertreated\npresdent\nshirell\nmediaplanet\ngairnshiel\nhyojung\nnonseminomas\nindignations\njjones\ntautest\ndanesford\nxiangbin\nslickster\nwinlatter\nannoint\nhousebuyers\nmatchar\nnagaski\ntraply\nkumming\nsoarian\nrenthop\nkriner\nexplorist\nmediasite\nsirloins\nhipsterdom\nquievrain\nlucinella\nhelpped\nchiddix\nunviolated\ndfcon\nmixim\ntourino\ngrubbiness\nbudeprion\nmooed\nroseblade\njaparov\nroundpoint\nmczs\nintarcia\nyishion\ndjingarey\neartag\nloropeni\nmamonekene\nalakrana\ncikeas\nhyperliterate\necocare\nbuttie\ngreenmarkets\nbarrathon\nrancourous\nschnoll\nawearness\nmikhalov\nornskoldsvik\nhohenhaus\nketover\nhallenborg\nmuhanned\nhabibian\ncuprinol\naxcell\ngeranger\nsirit\nhiroshimas\ncherrybank\nfoladi\nmilligrammes\nloveably\nboonyakiat\nmazeina\ntallahasee\nsatalia\nnpsas\nmarcuccio\nrehospitalized\nbetik\nlplayer\nzhenling\nnogga\nrmaileh\nzighy\nkahmed\nespouser\noildex\nfadam\nkrupskaia\nvisalberghi\noverallocation\nzyrianov\nadenin\neunie\nanastenarides\ncouget\ncompassionless\nbertzi\nmirsayafi\nofficescan\npasticheur\nkhatiashvili\nhandilift\ncamgymeriad\nmossvale\ntreskerby\nirudayaraj\npamboukis\nviselli\ndrinktec\nsajudis\nschlubby\nwahey\ncreditwatch\ntestoterone\nbudianto\nlandshare\narashima\nstairsteady\nirimpen\nzaafaraniya\nroyalities\nemerilware\ndistractable\ncannibalises\npanzella\nprotasiuk\nmeknassi\nlistining\ngrandeau\nfecitt\nschwendimann\ncarrasquero\ngaich\ndrexen\nmadelain\npaystubs\ntelegrah\nvitacco\nmanvar\ngorowitz\nconfuciusi\nhedonistically\nbrisbanites\nshortle\nkillertal\nwharington\nrashy\npalmeter\ntomatoey\nthinkspain\nucyclyd\nbuphenyl\nthroughts\nsupposdly\nhoureld\nliffen\nkarlah\nfuggin\npatchworked\nclimatization\nguochao\nshrimplike\nlawnwood\ngivony\naptivus\nlivinghomes\npitjantjatjarra\ncoalbeds\nwakaresaseya\nlaloi\nazuelo\npropinvest\nkrafcik\neligon\ndaehlie\nbluegene\nmachmouchi\ncokas\nchundering\nstengthen\nctcm\npsychopharmaceuticals\nhagase\nsuthanthirapuram\nflowriders\ncodels\nsipam\nqaraqe\naskalani\nzautcke\nauroi\nalbanna\nfibrex\nhadjidemetriou\nwhisteblower\nvisioncare\nbannat\nskielik\nsteriod\ngraybeards\nrefineria\nshoffman\npilaro\nbabadiya\nmangkusubroto\npressprich\ndrowart\nmaffulli\nmonticciolo\nstenvinkel\nlazzeretti\narizonians\ngreenforce\noptisolar\nbelajac\ngorree\ntseliso\nheeman\nleavisite\nalbattikhi\npulungan\ndonmez\nhaulfryn\ntwitchhiker\nlightfair\ngüllner\nobletz\natfaluna\nsambuaga\ncompeta\nheaddon\nthreatended\njabbouri\nkartashyan\nmunyard\nmuataz\nconsequnces\nthaniah\nintollerance\ndurands\ndekatherms\nzazzara\npalinistas\nschwanenberg\nindepende\nkuckelkorn\nmostoufi\nfinlandisation\nzelinger\nzareian\nbheil\nballyarnet\nfleschner\nmayweathers\nsebik\nsaberton\ntangoe\njezabeel\naproaches\nkarstetter\nathero\nsettelen\njaghato\nsentanta\nltcg\nlefist\ncomsumer\nabenaa\nsigurjon\nyellers\nhasaj\nqanan\ninister\nfortifiers\naskeraser\nmatxin\ntechnlogies\nbridgewaters\nsulphonylurea\nrokonuzzaman\ndellara\nmagheralave\nphly\nsackable\nunderexamined\namenties\nsilberztein\ndalog\noverscheduled\ndoofuses\nvinacapital\noverambition\nwonkier\navitzur\nparentdish\ndumighan\nanticpated\nyankus\ntingman\nbasine\ntyagachev\nfanrocket\ndarori\npavlata\nnumbnut\nleanachan\nfarance\nkasirer\netherealness\nyatabare\nharrott\nobstreperously\nlrmr\nbapras\ngilarski\nuzeta\nwillnot\njerichos\nzerihoun\nkhec\nbarbwires\nracheting\nlasseigne\nouwbsm\nsweetspire\nlyglenson\ninaccessability\nkranzbach\ncuissardes\nuncoolness\nindepented\ngnashes\nditlow\ninoma\nsummonded\nbbmg\nbohndorf\njuress\nnonveterans\nprodeco\ngradates\nfelgner\ntebidaba\nbourneside\nkozloduj\natock\nunbankable\nfedecamaras\ngrammatis\npaseornek\notuoma\ncarsebridge\nbigal\nmarxloh\nyaker\nfarruquito\nwelagedera\npitcox\nkatchkie\ncassama\nbrijit\nshmatte\npapakipos\nbfoe\nriddence\nbendickson\nloungey\ntitarchuk\nbault\nklasnic\nhomaira\nscagell\ncanadean\nmabina\nfarmelant\nmehmoud\nfeaga\ngvss\nmkts\nfmtvs\nfesik\nsendoffs\nbaybak\ntelops\ngeybel\nmoseys\ntoegther\nnourmand\nxibrom\nuncrashable\nimcas\nbuzziest\npartyer\npharmacuticals\nkushins\npakistanies\npirici\nclarisia\njomana\nfroeb\nferuzzi\ncamerapeople\nnossell\nnetblazer\nmilband\ndancetown\nfueld\nkodima\ncompensational\nerignac\njamiyyat\niiat\ngrouchily\npokerpro\nlangerhorst\nyazicioglu\nmccarragher\nspinmeister\nriazat\nmadaí\nmeadwell\ngridskipper\ncrunkleton\nstroot\naftere\ndamianova\nleingang\nchiroux\nmaqueen\ndharsi\nurbaneye\nfrelick\nhipperson\novercautiousness\nozaka\nmaradonian\nhatties\nscotgold\nsvedang\ncoresoft\ncloudspotter\nflummoxes\nroundbox\nrhyw\nbäte\nbilbro\nconstella\nstiffeniis\nalfasuds\nlipham\nlevx\nchorush\nresponsys\nunomedical\nwaterproofer\ndhevi\nfailor\nbhbc\nscientest\naqlaam\nlibanes\nvegtables\nschran\nhariyadi\ncfpo\nhopleys\nvaluefirst\nnonedible\nmadrilenos\negdf\nraisonée\npoujadists\nattitudinizing\nfarwolaeth\naficio\nlaturnus\nmoeai\nrecharacterizing\nprotohuman\nosbo\nbaymon\nchrobot\nschoeppner\nstrenghth\ndonnar\nwidish\nkitaeva\nrezistans\nlandskroner\nladleful\nrefarming\nvictems\ncorros\nrisktaking\nelmerton\nswiller\nhepped\nsunwave\npson\ndesoky\nsheffie\nblunderingly\npassikoff\nsuwyn\nkvetches\nvukusic\ninnovas\nplude\njudases\necocampus\nfwank\nqunnipiac\nreinspect\nfianza\nlosable\nunpredecented\nwondwossen\nmoadel\nknomo\nviec\nincongruousness\ntugrik\nultraperipheral\nversar\nderegulator\ntaifook\nencysive\nharazin\npauga\nkarpreilly\nvertuno\nnadem\nunderdue\ngalvestonians\nbejmuk\nblefari\ncurascript\ndemspey\nzakotnik\nhcvp\ndamges\nvirgance\ncentano\nsabretoothed\ncyoung\npfanstiehl\nsausser\nchippiness\njeang\ngalvanizer\nlarmenier\nmwyaf\nmurria\nkosutnjak\nfalesly\nkleintjes\ngullibles\nborrini\nrentar\nmultitrillion\ncoutee\nyachana\nporici\noverborrowing\ntornelli\nantinoro\nsafcol\nyurkanin\njelana\npennybaker\nmalinow\nmccarthyistic\nlazeez\nlaubsch\nsamudro\ncharties\njailal\nhaydom\nchkb\nrespall\nsklaver\nillnois\nfluno\nromdeng\nmarshay\nxterras\nwaltiea\ncummines\npubgoers\nmycka\npalous\nroukos\nguineabissau\neinreinhofer\nthehub\nkafwain\ndryweryn\ncofiwch\nhipermart\npurgar\nosteoperosis\nodyessy\nzepped\nhynny\njiqin\narbayeen\nlunne\nartily\nshcontemporary\nporcellato\nkremes\nzhivilo\nanderselite\nswinder\ninvigilated\nantinozzi\nspude\nkaiserball\nparaphenylenediamine\ncaseby\ndeductables\nconvnet\nperrou\ntechnolo\nesmailyn\nbakhshayesh\nvitalmiro\npotterat\nbinbags\nparlos\nvalires\ndebrazza\nnainakala\numbulharjo\nebidta\nrootphi\nuselton\npischa\ngact\ncannedy\nskandium\nbondzio\nlarrubia\nkitulagoda\nvozoff\ncatarivas\nchrisanne\nseignon\nsufering\nstoneview\nherwerth\nlobira\nfabulistic\npaaso\nalltell\nsansert\nplymbridge\nsyncora\nyoomi\nnagelmann\nperruccio\nsnugger\nschiralli\njamnicky\nleverndale\ncarmonas\nverhaaren\ncagefighter\nclarient\nnewsier\nloadsa\nzherka\ntrianing\nlysova\nkirollos\ncrumblier\noverpraise\ngongxin\nstiverson\nminutage\nappolicious\nfiostv\nkhateri\ndolorfino\noostdyk\ntaskstream\nsatbariya\ngoussev\nradwaniya\nrenegotiable\ntwihards\nmartinezes\nnaikuni\nmmfs\nsaxobank\nwuebbles\nkaidarashvili\ninedibles\nkanmi\nrakovich\nradlo\ntranquillised\nloudwell\nportmsouth\nhtsi\nslackerdom\nfallat\nperupetro\nkhayankhyarvaa\nsaltboxes\nringmistress\nsilagyi\ntexis\ncutera\nkeriako\ndrymala\ngenvec\ngnashers\nségo\nteacupful\njarlais\npotson\nmaroot\nakhlas\npolq\ncaucci\nfamillies\nhangtag\nshahruddin\nsemaphoring\nisebrook\nboroughwide\nsuwung\npelters\naxeda\nstealthiest\nhakamies\neuple\nstepic\ngoulborn\niishiba\nintefered\ntarenghi\nboozehound\ncapitolism\nmatzbacher\noverkalix\nboulesteix\nppifs\namphastar\ntalkmobile\nhaughtiest\nqvar\nsauturaga\ndemocracts\nbarsuglia\nkukuwa\nchuanlin\nmirander\nwolvie\ndukuzov\nstemwinder\ntauf\nsubmolecular\nhumilating\nbianchet\ntufegdzic\ncachaças\nusacheva\nahhed\naseefa\nrobbinschilds\ncagc\nshortcovers\nkabalu\nxiangchen\nzadworna\nsuseptible\ncoffeecake\nstratege\nsomeboby\ndarrie\ntummie\nregranex\nringscan\nzoëtry\nsubro\ncalpastatin\nkarlsten\nperfluoroelastomer\nclomp\nlobuje\nfuentez\ncareercast\ncustardy\nbreating\npremchaiporn\nglamming\nschapps\nsnugness\nhemcon\ndecarta\ngottino\nqaissi\nbamcinématek\nkiranchi\nuchena\ngeomôn\natmgurus\nnonstaining\nenterpreneurs\nominousness\njolliness\nyeeee\nquarterhouse\nzhuoma\ncoordinatior\nrilvan\nzinszer\nkanilai\npatrycia\nbratts\nzulifqar\nakilov\nfurmedge\nyohnka\nnetxtreme\nabota\nmarkunas\nbarhum\nnetmethods\nfosko\nbullfeathers\nheartstealer\nkaluma\nzurik\nidotic\nridgeling\nfaurlin\nbovenzi\nazziman\nbogomilsky\nthamesbank\nmormom\nextracare\nnetsationals\ncoreflood\nheppelmann\ngarbages\nunwrecked\nyoursel\nnahyans\nsivelov\nescmid\nkuyam\njerkier\ncartmail\ncueno\nboliviarian\ncliseam\nlemonette\nwincingly\nsecfinex\nlanzate\naxilrod\ndhiaa\nmuschaweck\nsoldinger\nmotivaction\njatania\ndfob\nhassabi\nratholes\nkashmula\nfenglian\nwanh\njermoluk\nsteckart\nunexpectly\nvanderhaar\npennsylania\nhuffling\naslc\nelmalan\nmackerron\ndelvac\numwelthilfe\nmeevee\nsycrest\narbaje\nblazered\ncelious\nkazimiyah\ngrapegrowers\nmauquoy\nchraplyvy\ndanuri\nsnaptrack\nbroes\njezic\nnonmuslim\nbahgdad\nsucharow\nbaddoch\nabenaqui\nwesterburger\ndeltapodus\nrubdowns\nripcords\nluemba\ntamco\ndogubayazit\nchangbao\nferoli\nieae\ngollums\nwinkelreid\ntepilo\nbrandless\nlincl\nredeterminations\nclearpad\ndhuluiyah\nzafaryab\nbarbotte\nhedgefunds\nbegov\nachmadinejad\nhorberry\nswashing\nembuggerance\nkwedit\nmandiyu\nhufft\nperishingly\nmushraff\nvavreck\nprelicensure\ncaprail\nscatigna\namhras\nnjbpu\nhaydos\ncontech\nceasers\nibeanu\ngressett\nbieze\ncapitilize\npassportmd\nrozencwajg\nlethola\nloefgren\nstastical\nwintrob\njardo\nspatting\ntinklings\nyuhnke\nduxfield\ndeluzio\nbrokerswood\nmarktplaats\nadgas\nmackes\nazamiyah\nnondisplaced\naugustavas\nghelardini\nmcgautha\nmukoni\nallouettes\nmarckwardt\noboma\nremobilize\nqayoum\nhoines\nhyperfocused\nwampach\ndasaro\nverbio\nzwikel\nshutterly\nwailani\nanzemet\nbradburd\nmarijuna\nenduing\nabduljabar\nfratboys\nhomaizi\nportmuck\nnusym\nmakda\nloutishness\nramminger\nkhowlan\nmalgir\nchangpin\nschwertz\nfreakanomics\nkiwanga\nkaracadag\nperent\nmillsteed\ntawteen\nwinkelberg\nnoorderhaven\nabedine\ncoccoluto\nuncustomarily\ncyfeirio\nnavr\nmudavanhu\nneedlman\nremobilizing\nalbaisa\nalexanco\naurangajeb\nmontsoleil\nnaffest\ncontracyclical\nlongstick\nhydrovac\nsubstitues\nmultiplanet\nhyprocrisy\nmoffle\nslowworms\nabsnet\nhairmax\nfavourties\nwebsdale\nretrenches\nbonusses\njockying\nnekzad\ndevington\nringbacks\nmastertones\ntelzrow\nriverrink\nmaalula\ncibani\nunluckier\ntomrrow\nfmaily\ntrigt\nrawanda\nceralyte\ncapezzali\nileene\nturkcan\nfcoj\nkearsage\nchronix\ndeadpanning\ncassman\nradiohole\nmcskillet\nintergovernmentally\nventrassist\nunconvertible\nmorlat\nfugel\ncges\nromanc\ngeckeler\nmkda\nmeyskens\nmauffrey\norgal\ndicers\nkjaersgaard\nhextalls\nrodgeriqus\npuffet\nbrainwashers\nsypris\nrahodeb\nwheeden\nplattel\nesmailin\nuibel\nrasfer\nyurisel\ndahok\nsablic\ntijanis\nlechlitner\ngarbuja\nquickeys\nsolarfun\ncongu\nkawash\nsherter\nbaseem\npuissantes\nschraufnagel\ndictorship\nmentallity\nredtv\nbustar\nbanally\ndanehurst\nfthomas\ndikky\nfairl\nmarscher\nwhyke\nhazeldon\nssdnow\nnovamont\naccountabilty\nacountable\nhuajin\nkhawad\nspinwam\nsmarttrade\nbiosite\nchiacgo\nmajoda\nhandbagging\ncalilfornia\ndelmendo\ncivitano\nmiyeegombo\nsaieg\nmussig\nflashpots\nchecksfield\nshenyin\nosanga\nwordell\nwrister\nretoric\ncelsis\nscogs\nrazumova\nxingchang\nmikova\ninquiringly\nsuperpotent\nstampar\nhealthroster\nfoodsafe\nfienstein\nbejjani\nthundershower\nkututwa\nkutano\nburgeons\noponnents\nspeedwork\nconita\npronovias\ncherrymount\nadventis\ndecoufle\nsemeta\nmonomaniacally\nabuelhawa\nalcotest\ntranscallosal\nolago\njimmied\nkijafa\nrenzer\nhoshor\ngulowsen\nzamari\nlindoff\nmoyard\ngodswmobile\njanusson\ntikkas\nhartnady\nschottenfeld\noutpolls\nscotiamocatta\njimela\nballyhooing\nruymen\npalestini\nenlander\nstempler\nbestpic\niguidensis\nereleases\nabbamondi\nsimonka\nstaehr\nadlen\nbaecke\ntrajedy\nrennó\nvolling\nsubletter\nguyt\nsamaraneftegaz\nmagora\nkhaleeq\nguantun\nlaskawy\nsukhjeet\npinsentry\ngadur\nlambdon\nelmores\nzaanin\nwemheuer\nvarsallone\nbikable\nunderutilize\nhyperaware\nsluggards\nsharyar\nkuchinski\nmossialos\nstainken\nloster\nsuddoth\nakkeron\nfranfurt\nspawling\nkemkes\ngirneys\nkrati\nshanwell\nzabradli\nnonleague\nshilsky\ncommunisim\nkuadey\nanticopyright\ncedain\nliposuctioned\nfotl\nmassacusetts\nkeedwell\neffeithiau\nkunselman\nlerrick\nbledaite\ncaners\nbenoin\nphiloxenia\nspiffiest\nrecomplete\nhollicombe\npadiet\nrahmaniyah\nzurbatiyah\nzimpapers\nweatherpeople\nmomentoes\nxience\nyokata\nlanear\nunsavoriness\nwynaendts\nfreedarko\npeshwari\nmcphilips\noggle\nswomley\nrowhome\nzamarai\naawas\nkizzi\nthriftier\nrecompete\nqunshan\nbelbacha\npebo\nkingold\nsanctioner\nvetsfirst\nohhhhhhh\nsarlot\nmiloh\nblackmont\nincrementality\nberdieyinne\nbolkenstein\ncamperships\ngamecorp\nromuzga\nspandikow\ntermansen\nprebirth\nrajt\npoisining\nsyscan\ndvbe\nmassawe\nvietnamisation\njnem\nbooksonboard\nsainjon\ngorff\nrefusniks\nmarrujo\nfeeman\nbahceli\nallaux\ndenounciation\nexceutive\nellinghorst\nhilfman\nlukken\ndimassimo\nmikeyy\nparver\nforclosure\nmajorgeneral\ndolydd\nxueju\nunchilled\nlepro\nspellbindingly\npalestinains\nvidusha\nverchot\nshaltz\nraleys\nklangsang\njinou\nthemeless\nerlenbusch\nfintecna\nestrostep\ngerszberg\nmainolfi\nimazon\nsirajudin\npetitte\njoyrich\nturaqistan\neecl\nroadmonkey\njigwan\nneyhart\nmoralioglu\nsuweon\npsychoanalysed\nstimilus\nmcqueeny\ngusov\nmcguffee\nwauwinet\ncgpl\ntribole\nputzing\nmachievelli\nrakeen\nraheim\nkasambala\nshelk\nbiocentury\nwillomitzer\nkalban\nvelissarides\nhakskeen\nsynerject\nthrivers\nzakarneh\ndughmush\nwashingotn\nfacebooked\ndeeres\ndiscreditably\ndecommitment\nseandel\nfrakkin\nschoolgirlish\nbiven\nprodigally\nwisked\nglurdjidze\nnaziq\npqri\nheadtorch\nwickramatunge\nniftier\nliverpol\niiwa\nhekmatullah\nnoncertified\nratanamorn\nhoppert\npalpitates\ntdtv\nbarbituates\npetitbois\nmatone\nclosantel\nnuvomedia\npoldma\naruh\nsheigra\ndiamondware\nhunosa\ndkrw\nbiffed\nvenk\nunromanticised\nslinkiest\nigive\napologizer\nmagrino\nonebiggame\npassafiume\nzivancevic\nreaccelerate\nzerista\nhaelterman\ncajou\ngovement\nnuclearisation\nmordenti\nroumel\ntheright\nuncollateralised\ncelg\ntutana\nshuala\ntakkt\nverhaagh\nworldlier\nsentrus\nalceus\nsemaya\ndogubeyazit\nraingear\nzugaza\nlazette\ntursunova\nglassings\nschaffart\nhritz\nefilecabinet\ngrischuna\nsalicath\nabderamane\nmombaur\ntoevs\nmercedeses\nprvt\nduffed\nsakong\njouney\nrydning\ncalingaert\ndsti\ndefife\nelectriccommander\nlasseur\nspamford\nsinglehurst\nadelsons\nurogynecologist\nkanikka\nfirchau\nscbm\nmidsixties\nairfreighted\ntenbrink\nlooooove\nooohs\nbaymiller\nkayakoy\ndovebid\nchatikavanich\nrepreive\nalavaro\nvujanic\ncharlottenstrasse\nbohuslan\nstephanowicz\nkerkyasharian\ndawsy\nalberson\nchindogu\nkohmann\nthruyou\nsobelle\nlasaro\nnondiscretionary\nspauling\ncalzon\nimmsi\nrazadyne\nunauditable\nexousia\nclaustrophobe\nmangiola\nawtan\nclafouti\nresomation\nsammadar\nrulemakers\nrussmann\nsignifi\nqtech\ntogoimi\nmicrophage\nliftboats\nkirchmaier\ncatienus\naugi\nzakout\nrdpr\ntenneh\nmpdi\nethirajan\nellacoya\ndroukdal\nmmscf\nnisantasi\nmotormouthed\nmehiläinen\ndockdogs\njoric\nantelava\nsusol\nesmin\nepuron\ncushier\nshaplin\nfudgey\ncoffinettes\nswertz\nreamined\ndumers\ninvironment\nparchet\ngorbechev\nracialize\nvladovic\nwillmake\nkronors\nworrisomely\nguarrantee\nolgay\ncinet\ntedsters\nbushit\nsoyjoy\nbicksler\naurs\nprovencare\nemapa\nkatzourakis\nkaumeyer\ncharmlessly\nreindicted\njerril\nmediterra\nshork\nbouncily\naccordioned\nangerame\nraddho\nwetroom\nhawlicek\natragene\ndevistation\nnorbank\nselfsufficient\ntatarowicz\nmurier\nurni\ndincel\nkommounistiko\ntoshirou\nsawasaki\nliepold\nbiamp\ngabridge\nmujuthaba\npennay\nkwapa\necrehous\ntrackie\ncarregosa\nsiripan\nbrickles\nmatthewses\nstelluto\ncastelbello\ncloudshield\noperat\ntresset\nalguera\nmeritocrat\nwahabbist\nbasturk\nkolpin\nllani\nunindicated\nhopleaf\nteenee\nkaziboni\nkuvshinova\ntritest\nmamarbachi\nxianguo\ndoodycalls\ncolonscopy\nretroscope\ncoverflow\nberkwitz\nministrokes\nfongwan\ngluckson\nsidakan\nepitiro\nkaloyanides\nsupri\ncoundley\nantireflux\nbercero\nswartzman\nbritians\ncaritiana\nuanble\ngeam\nvenjah\npreggo\nchammari\nunoprostone\nermatov\nnukala\ncronyist\nsiraz\nhospitol\noursleves\ntwitchiness\nhockeysticks\nconfabs\narestat\njalaeipour\nennstone\nchangewave\nmalebogo\npredigital\nauturo\nbritflick\nclake\nslenk\npureit\nvigent\narkal\napof\nbingsheng\nunifest\ndowntrends\napero\nkamaaina\ntogged\nnawabad\ncyberbox\nkommunalkredit\nfdac\nartemisias\nhandwarmers\nnewlaithes\ntighest\nadrenalised\nrambaldini\ncliatt\nlajcak\ntrizetto\nmegastomias\nprochazkova\nwurud\nretout\nipayment\nrogles\ncresyn\nahmadzada\ndeleveraged\ncarlinsky\ngiambelluca\ndisarmers\nkermanian\ncyncial\nmcaliley\narayama\npopultion\nqualeh\nzarbakht\nunderstimate\ngapay\nraceable\nkajuri\ncereplast\nexenberger\nwhimpy\nhaneline\ncrimminal\nwegryn\nklatches\ntravelsupermarket\nfrikken\nmetsavaht\nmerrywell\nyoobamrung\npalmitoleate\nquidel\nwestlawnext\nsucccessful\nunjaded\nvlyf\nkatzke\nastronergy\nmlawer\nchangson\nauwaerter\nocsober\ngrienke\nbaloloy\nredeemability\nspands\nnayaf\nchaffoteaux\ntechstreet\nalshabab\naestheticised\nshipitko\neifan\ngladhanding\nmannigan\ndecollete\nzuzin\nkottmyer\nnovogrudek\nconcertinaed\nunpresidential\nosteoplasty\nhelioslough\nmarquest\namardev\nmishitting\nburgans\nlagzdina\ndramtically\nbonnant\nfilebound\nquadramed\nfieldside\ninterfaceflor\noverspreading\nhulf\nickies\ngenaudio\nnonpaper\npursuiting\nashcrofts\nosmonaliyev\nberentson\nratshitanga\nrentel\nbetablocker\ngwefan\ndimmery\ncholestorol\nrhaglen\nsafflowers\nsallenger\nrushid\ngorgiashvili\nlumgair\nkaryagin\nshokubai\ndipal\ntolerancy\nakpele\nburliest\nskoblikov\ngoldins\nvincentric\nmaqar\npallancata\nmichalchyshyn\nmultaka\nmensil\nmadurell\ninshriach\ngargoyled\nnewshole\nbecasse\ngeegee\nshanman\ndjousse\ncarouses\nbridion\ngubenatorial\nyasmann\nmelodeo\ndualmode\nohlhaber\nhosselkus\nasliddin\nasteco\nglci\nwomanized\nngarambe\nsarigerme\nzests\nfrenna\nbotkier\natheron\nchernomorneftegaz\ndracaenas\npostwork\nhatchel\nshillaker\ncyberdisplay\nhernst\nsandelman\netnz\nlightborne\njarrom\njinduicheng\nanoraky\nmalealea\nimundo\nbowriders\nozhan\nneglegent\nkusmer\nskreba\nshajoy\nlewaravu\nunderprotected\nfyock\nbibliowicz\nstoppardian\nkazombiaze\nfortyfold\nballein\nmoundros\ntaoping\nrhenigidale\ncobbes\nbrezhnevian\npelaccio\neffortfully\nbergermeer\nnonjews\ncagnazzi\nwonkily\nsendups\nontheir\nmoelleken\nchiames\nsardeha\nzimbawe\nkocker\norbuch\ngoldengrove\nheijokyo\nwaterloos\nzaarir\norexo\nyermack\nrotert\nplayfootball\nwasseem\nfilsoufi\nbozard\nthreatexpert\ntheken\nrobinul\nmough\ncommittedly\nwhinnied\nmisna\nvigal\nboezio\nalprin\nmuniwireless\nhochgurgl\nkeppelhoff\ncogeval\ntigher\ndynamise\nqurbi\nstabalized\nkookiest\nticoll\nsoscia\nduekoue\nmedicide\nvaark\nwoodburner\nallybar\nlhakhangs\nantii\nreaccelerated\nkiriakakis\njaheem\nsaadnayel\nhydrodome\nnongamers\nkoverman\ntanoos\nchoongh\nboordy\npressive\nchawkay\nanderies\nprisna\nrones\ndesperaux\nvivaty\nmoneea\njoszt\nscrappiest\nbrocko\nderar\nunproud\nadesta\ntbsc\natomredmetzoloto\npolymedix\nmpuc\nlumera\nchemosurgery\nhandrolling\nnoncareer\numiyuki\ngurbantunggut\nelithis\nthandisizwe\noutpitching\nkikuyuland\nelinogrel\ngaribyan\ngatorfest\nthabault\nsanbona\nbracketologists\nhycrete\nsbrefa\nhersonski\nkrudy\nirlan\nagued\nbudihari\nkérastase\neverlyne\ntrippiness\ncrystalise\nbesseberg\narwain\nverbiscer\nskretteberg\ntaccetti\nirigonegaray\nhmag\nostir\nsheikhly\nmarzouka\npoortmans\nspeisman\nlegowo\nrogered\ndragooning\ngruenstein\nballahutchin\nbinangun\nfugazzeta\ncamahort\nhudziak\nconservitive\nundeland\nentura\nzeravica\nnetchoice\ntreesje\nkalief\nmackriell\ngaccio\ncpei\nserd\nleontaris\nmusinski\nmasaud\nsmileybooks\ndrollest\npatrak\ndalvinder\nmyfoxphoenix\nunderlap\nxcell\nnonwinners\ngpsi\nnatterings\nstorzer\nbernasek\ncorway\ningorant\ntreiser\nknbt\nhallandsåsen\nseramas\ndwikhondito\nswanned\nasifi\nwyndal\nbowcutt\ntinseth\nmidtower\nviggiu\npuhleeze\nracetrax\nmarota\nbelluardo\nmusiccares\npalanzo\nngdt\ncasorso\nyertysbayev\nhealthsmart\nlecka\nrasilez\nproesch\ndpri\nlicsw\nsourpusses\nsaghand\ncanoodle\nkeoghs\nmyrthil\nproreader\nmulhaupt\nsanukite\nbnsp\nworldbeaters\nachtner\nvopium\nblondeness\ngroople\ngroupabout\ncharcon\nleidenberger\ntheochari\nluckhaupt\nnefariousness\nuyeki\nipth\nmychoice\nsloganeer\ninnoculations\npunchestowns\ninsipidness\ncrasbo\nchoeden\nmyespn\nbalcas\nwaldmani\nscanlife\nspoonable\npichichero\ncompnaies\nivad\nkralyevich\nschwellnus\nsenjaray\ncitac\nceau\nmereilles\nforclosed\nglycomark\ndissatified\nprepsters\nspringier\niufms\nveeva\nsixmilewater\nenyele\nzebrugge\nmacksoud\nglitazones\nkontilai\nhevly\nshlein\nwiswedel\ndudl\nhodmezovasarhely\nclonings\nohva\naspidistras\nlemack\ncastiglionis\nperipatetically\najras\nstonecastle\ncaramanlis\ndelagates\nseeit\npirotti\nwajba\nbilllion\ncagelike\nhugins\nstarcaps\nboggier\ngrosholtz\nbakkom\nkortuem\nkvivik\nyhey\nnusing\ncanyonside\naliriza\ntalbooth\nlmga\naasmundstad\nutrechtsestraat\nviradouro\ninaccordance\nekeli\nleadres\nmutobo\nnetcents\nbuangan\nconvington\nenell\nroshydromet\nkozelka\nfoodvest\nsangduen\nskarssen\nprelapse\nparamiltary\nclimatesmart\nelaa\nziecker\npaksitani\njavers\nleshno\nbouncebackability\nhueh\nmiqdadiya\njakari\nchindex\ncatcote\ntangwanghe\nidrissu\ndurrua\nsuperlong\nhazuka\nwindyhall\nmerked\nrezconnect\naserf\nabdelhai\nmwelwa\nhertzmann\nivanplats\naneres\nhimbos\nvilaceca\ncelebrityhood\nroofbeam\ntypcially\nmabthera\nespad\nbajillions\nsgeis\nchdos\ndellaporta\nrippeth\ndoim\nleanore\neblex\nlaundromatinee\nbillotti\nraishbrook\nmengqian\nlandsnes\ncydcor\neslr\ndicastro\nsukhdave\nstavitskaya\nmemolink\nthomassey\nvisipaque\nbadurdeen\nyesawich\nurnov\nkreidel\nibrahimia\npostcrash\nmijoro\nmannuzza\nnoninflationary\nalizyme\ncuunjieng\nstupek\nhadlington\npolanksy\nmedstrat\nkrygyzstan\nsiyamak\ndestablise\nnoteless\ngiftable\nultreo\nbanmiller\nberingea\nterrito\nlamic\ngutlessness\nraychoudhuri\nseddi\nakaz\nhwnnw\nhbac\nrendells\nnantas\nhootnick\nkickouts\nreyda\npogrebniak\nmutumbo\nrhapsodise\ncuzon\nzabaleen\nberoiz\nrazorgator\niacolino\niotum\npaline\nstylemark\nqiujiang\nnonincumbent\nscholly\nhostelbookers\nfamulare\nannualisation\npitots\ncendon\ndownlighters\ngerrero\nfirly\ntahdia\nhascup\ndiettribe\nppdg\nnamuncura\ndecentralises\nartope\ncolagiovanni\nzaranek\nconstiuents\nprusci\nvanhool\ncirclelending\nsaidd\nawyren\njading\nmootral\nsmithbucklin\nickier\nbusinessworks\nshuana\nokema\nhoneman\nkaramouzis\nerdaoqiao\nchepulis\nlyndin\nlockness\ngreencycle\nidelogical\nfloweriness\njerime\ncardica\nsleepness\ngastelu\nzimondi\nsundher\nnicodeme\nfastballer\nchifunyise\nzubeyde\nlorich\nopensides\ngwsc\nkamrob\nskwerl\nstaco\nrpis\niljazi\nsandeels\nbrandelli\nforwar\ncompetefor\nacqusitions\nsulkovsky\nblueskies\njannett\nburutin\nukccis\npaekdu\nnealry\ngrandos\ndmec\nvilhjalmsson\nferaz\nyaldara\nchickcharnie\nchaupad\nbirkmire\nkagro\ncciee\nkurbegovic\ngeplak\nswiston\nportpin\nschewel\nnjpa\nigodigital\nfreidan\nhsopital\nprotho\nsteeliest\nkhawari\nakinaka\nchirring\ntimebanking\nmohamedain\nivotronic\nsegelstein\njunkier\nhomaid\nsorros\nsmsi\nmediaconnect\nstrmecki\nclownishly\npampe\nnutlike\nnonautistic\nostrolenk\nsoundabout\nbillingses\npyeritz\nnonforfeitable\nnerby\nnedanovski\nsandlofer\nouirgane\nmargoles\npubby\nchicanna\nharerimana\ngeosentinel\nelomire\nutilites\npodhajski\ncnpem\nspinale\nhuxlin\nfrable\nverbrugh\nmachil\npredecisional\ntibbert\nscarved\nnamugala\nfasick\nturaihi\ncounterstrategy\ncorupt\nchoubina\nkleinhaus\nautomart\ndanyong\nloundy\navanex\nfrnakly\nchler\nvooks\nronnee\nphurnacite\naccoya\nthuggishly\ndustlike\nbrynmally\npttow\nbossenger\nqaswarah\nwipsi\nenterprisewide\nszakaly\ndatalabs\nawwwwwwww\nnorthless\nperezcano\nhilam\nurds\nkrausner\nentralled\nslepicka\ngyurmey\nbuildouts\nunfrock\nmedjumbe\nericsdottir\nschme\nmooore\nlevetan\ntwoway\nmonitary\nhorillo\nuntheatrical\nhunshi\nunderstandin\nhillon\nmicturating\nparkus\npolydopamine\nscotchguard\ntchilaia\nlungundu\nvdec\namerah\nyorkshirewoman\neyeshadows\nhunkiest\npuppyish\nidcg\ntweenagers\nhusani\nfardosa\nscrogie\ntureli\nhomemanager\nundereye\nifthikar\nunsexiest\nvodpod\nbrutzman\npromacta\nevco\nniaki\nchyngton\ncavvy\nwraig\neuroconsult\njetbus\nhoumous\nstepter\ncbeex\nfaiure\nbenchings\ngundogan\nwesal\nherubel\nrainsoaked\nostensen\napplbaum\nbejaysus\ninterlinings\ntritch\nzemlyansky\ndeconto\njamiaa\ncosport\nlimco\ncredant\nyepiz\nharralson\npizzola\nsalwens\nsevenhills\npeasents\nsieno\nfmlc\nniekirk\nfatless\ntraditonally\nsentimentalising\ncurosurf\nnsrt\ncossetted\nhoneytraps\ntickett\npreformulation\nibone\nddod\nrellas\nluxlash\nmcaneney\nnosovice\nbedsharing\ngordji\nbodystep\nyogarasa\nplaninic\nbombmakers\nlastonia\nmicrodistilling\nknca\npohoryles\nimagenetix\nwootens\ndaxas\nronilson\nsirilal\nmuhlenkamp\nbefera\nveterens\nunderhit\nipct\notana\nsarjo\ncagier\nzekelman\nchernoi\nsuzettes\ntaride\nwindwood\nblingy\nkhalass\nmissimi\naffirmitive\nwolaner\nluduena\nnanodragster\nfeeneys\nedulink\ntownfoot\nlixun\nboomerangers\ndimieari\nbijeel\ngollywog\nschiestel\ntwickers\nsnivels\nvaltchev\nmuhlhauser\nmourayan\nmarakis\nmtsc\nmvrs\nstosberg\ndominka\nhoggie\nlatture\ndhanju\nresoling\nreinjecting\nmyungji\nelitech\nhoellwarth\nnanev\nunjudgmental\nbelth\nowczarski\nbnac\nchoedak\nvaidisova\ncontemptuousness\nbunol\nfinalcut\nzayuna\naocl\naccorn\ncollasped\ncostcutting\nviawest\nsciclone\ndourest\nchwaraeon\nhargey\nalchol\nxcz\ndemocratiques\nmilinovic\ndinsmores\ncondis\ndunclug\ndiscrimated\nsmeland\nmousad\nzlatanovic\ndjebbar\nparkervision\ndeftest\nsocialight\nmoldow\nhoopty\nwojtecki\npianin\nfizzier\nenvigorated\nhpsas\nmayelikohan\npordon\ntrochet\nfiberweb\ncomdexvirtual\nrepucci\nbenrock\nfusnesau\ndapidran\ngyorfi\nwilsterman\nwichcraft\nmickeyd\nyangpyong\nlaneve\nmcely\ncreemos\njdimytai\nqianfo\ntolerent\nsagey\nelfenworks\nfbars\nmarose\nnellcote\ndrudger\nairportal\nessangui\nteachfirst\noperatin\nlewinksy\nmosalikanti\nshije\ngoinggreen\nallegeldy\noverexpand\nchepkemboi\nnteziryayo\nuntradeable\nworldcard\ndaltonian\nnonsquamous\nmangiantini\nikebal\ncherquenco\ngrandluxe\narzou\nodlozil\ndecoff\nvampirical\nfoxily\nkpnc\nsoxman\nponne\nguinnea\nshowal\nuncommercialised\ntimberlawn\nfurtw\nsteiker\nbankman\nghazawi\nraisian\ndajohn\nbreedveld\ndeedie\narfaa\njockish\nultratravel\nhuldisch\nmacrolane\nsathiyan\nvancisin\nforkful\nbedevilment\nbankfirst\nrosebrock\nlefraks\ncwaf\nsaelee\nsilkiest\nainkawa\naixtron\nsuspose\ntrattorie\ngreengo\nontex\nuncensured\ndermatologically\nbohanna\nobenschain\nfunfit\nmedflash\nskippyjon\nprefete\niscn\nymgynghori\nscantest\nicmeler\nwangsness\nwextrust\nsuperquadra\neheart\nmapasua\nbesecker\npetercam\nmazraq\ninfocrossing\nburkheiser\nsvanes\nsladjan\nherszberg\ngiribone\nlizardy\nsquibbed\njaisham\nhapn\npigd\ngrishkoff\nbioenterprise\ngynnig\nalleron\nmckeeve\nbarnholtz\namvac\nhamsik\npatineur\nfinancal\npeignoirs\neisiau\nminzolini\ntelepiu\ncrosstech\ninferometer\nreminyl\njanielle\nnobbly\ndomainsbyproxy\noveramplified\npetroenergy\nsmfm\nstagging\neskovitz\nmisiura\nleafers\nshaqs\nbefouls\nshahabeddin\ncustomiser\noptimor\nchupka\ntannachy\npmscs\nscarjo\nmolica\nsehnert\nzzzs\nlevittowns\npanetteria\ndahalani\naspercreme\njidori\npreganant\nreblackpool\ngherzi\naujila\nyansha\ncryobanks\nroughneen\nassocations\nswifly\npobanz\nedicson\nazmal\nlykourgou\ndamchoe\nhmics\nwireforms\ngardesana\nhokiness\nbernardaud\nayudas\nmultaq\nnaraki\ngaudiello\nsocietythe\nshopsins\ngoodbrand\nliram\ntowheaded\nchantana\nthumpings\nmzili\nwozzy\nirizarri\ngeeing\nenotah\naappo\nthatje\ntuttman\ndarxia\nchowns\ncheckett\nirksomely\nvahtera\nbourzai\ndaycoval\nazazeel\nnannelli\npunchiest\ndruidale\nkimbriel\nforgaard\ndiscoverx\nmcellrath\nhyclak\ncentrico\nparagallo\ndaryatmo\nunscrutinised\nuapd\nharebreaks\nnatalo\nvanhooydonck\nchurnings\nlasorsa\nperacchia\nalbinski\ndianyuan\nrohd\ntraited\nmuhtaseb\ngromicko\nwriteroom\nspenhill\nsuppoters\nhornbarger\ngohir\nmorquecho\ndirshe\nturchet\nwicab\nmazamanian\nsupergraphic\ngraap\nliftline\nbouncebacks\nkowalcyk\nbaraou\nprisions\nmushraf\narthroscopes\nhyvarinen\nflammia\nnorklun\nkatiria\nteleconferenced\noverfloweth\npokrovnik\nyatskievych\nklaviter\njamjoom\npipsqueaks\nplechner\ntdvcodec\nillionois\nhubsi\nrushwaya\nmccraty\ncironi\nblta\ncomapanies\nzhancheng\noppponent\ndribbly\nhayab\nskedaddled\nrogowicz\nunstowing\nfrienemies\nfirstcity\nfaridany\naccure\ngammagard\nharussani\nyalen\nsalaheldin\narrm\nrosnani\nkacandes\nchuffs\nsohat\nwynfrey\nalcobas\nmiserabilism\ndogtopia\noogjes\nstoryvault\ndamange\nfrazeur\npitnick\nspigaroli\npedestrianize\nreleaved\neurotaxglass\nqaderzadeh\nunderoccupied\nmerkushev\nborval\ngoeas\nkrasn\nsmallball\nirizuki\ncoalmen\nbarladeanu\nspeet\ndaugter\nbertucco\nknowehead\ntiresomeness\nyelnikov\nsolian\ncoquillat\nyeondoo\nnamouh\nforida\nneopeltolide\nmatatizo\ndiory\nliugui\nunflushed\nsnowshoed\nhogli\nlandrell\nmwakasungura\nbatwomen\nantithrombotics\njessicas\nfrincke\nallback\nanecdotalist\nslobbing\nunholster\nchantrill\ntellam\nkanzus\nleichtenstein\nshowhorse\ncorgentum\nfsmt\nentrancingly\npolicharki\ndeparments\nnonunionized\nvanillas\nglir\nnsqip\nnohpat\niurc\nenovate\ncherveny\ntelergy\nnonphysicians\npamos\nwrange\newusi\nvobile\nsweatpant\nguolong\ndisb\nzajtman\nselectadisc\ngenocidaire\nhamidzada\naltamore\nchumra\nclacked\nsquassoni\nmcgriddle\nfaridullah\ndrobisz\nasets\naloun\ngertmenian\nreticient\nconstituences\nnqobizitha\nreciben\nhbtc\nbahrains\nshult\nsomadikarta\nfalewicz\npwnc\nbenightedness\npornthiva\ncamiro\nqiodravu\ncdhp\ndepegged\nhelldorfer\nracebrook\nchemchemal\nxeriscaped\nrushower\nimataca\noossanen\nıf\nmoenig\nlafeet\nhendzel\nbinui\nnaabzada\nspectical\nngere\nhospodarske\nhausenblas\naircap\ncomcam\ngombossy\nfronteer\nretrogene\nliggers\nfuschias\nbagfuls\nbauert\nnasirzadeh\nwiimotes\nlongmay\nwatermaster\ndunievitz\ntreanda\nincrimental\natoyebi\nrawitch\norock\nbendett\nblackberrying\ninvigilate\nhajdasz\norbaum\nusuary\netlin\nmichellod\nmidevil\nhowr\nphillipina\nsmooched\nandarko\neyesmart\ndywed\nizaurralde\nsukant\nprimeurs\ntixylix\nboardfest\ngoulios\nunstopable\nsedaqat\nmajome\nsoflens\nthuba\nryaguzov\nfuriousness\nguereda\npizzette\nmoronuki\nbudesliga\njpsk\nluminere\ndweud\ngovernership\ndoneghy\nbasyrov\nkristianto\ngwneud\nweathy\njppm\ngmtn\nprovamel\ninkubation\nkandaharis\ncanawati\noverstyled\nappsec\neathai\nelasha\npenggen\ngimer\nshoebomber\nelanbach\neyewatering\nnorthoff\ntysson\nbeseechingly\nglospace\nmegamovie\nsamure\nbomboniere\ndeshar\nmuhrcke\nfannying\nleveretts\nsabeckis\ncadna\nwilldorf\nposioned\nlattif\noutdoorswoman\nmisamore\natxalandabaso\nstjc\nneiderer\npiccaso\nunlacing\nsharick\nkhourshid\nsidearmed\nlifflander\nburrelli\nxiaoduan\nfrazzini\npresepi\nabdennadher\nhypermesh\nrenagel\nmaronian\nwavelight\nunderheated\nfstr\ntauntings\nadamovicz\nkatoro\ntheut\ntrumba\nglenhill\nunlaundered\ntogelius\ncellai\npontarsais\njuliber\nfanatec\nemotionalize\nfreewrite\nsnbts\nbarnao\nbestpractices\npreibus\npulbere\ntrayport\nccsbt\ncaccavella\ncalosha\nwirehouse\ngklavakis\ncrongeyer\nxantrex\nphytoglycogen\nhaeley\nsuperprime\nmcgrorty\nkossie\nsabcs\naxene\npanflu\nnhow\nreactiv\nsilverfern\nsandretti\nnemakonde\nlandesklinikum\ndemythologise\nbaniulis\nspunkiest\nchoska\nanahiem\nandrianony\ncspl\nflemke\nmilosa\nquimet\npgic\nstearing\npontzer\nchadrick\nrogasch\nbidoon\nundermotivated\nfashionair\nscarpette\noveralled\nacquifer\nrazoring\nmeangingful\nexhaustedly\nrastafarism\ntrichopoulou\nazadnagar\nbaccellieri\nrorb\nweintz\nshemy\ndatcp\nbwcabus\nyabad\nreorientating\npoliticains\nrampager\ntrebevic\nstrulovici\nbohigian\nleesfield\ngibh\nsorbera\nwladyka\nclincial\nconstitutency\ntiime\ngloersen\ndisneylands\nfarisa\nsinsuwongse\nsplitty\nnextmap\ntruecompanion\ngiacchetto\nsetlow\nsmirkingly\natisreal\nloorz\nyaquby\ngearworks\nmycoop\nbertilson\nstangler\nbashika\nowlishly\npregancies\neyjolfur\nkanoto\nsenocak\nskarrild\nmantecal\nsilversol\nsolaicx\nrietze\ncozied\nbubser\ntinhay\ncrouchley\nslipcovered\naurik\nchristoyannis\ncequent\nantiprostitution\ndfeb\nwollschlager\nknuckleballing\nmacroeconomically\nsrisook\njiyad\nunight\nemafo\nccbp\npagnamenta\neacha\ngawell\ngiarra\nwashbowls\nunaffectionately\nkehoes\nmetropulos\nxvala\nfrohwein\nshowbizzy\nreappeal\ncongresscenter\ncaravanos\nconsigner\nhumaidhi\nuhlenhopp\nnashvilles\nracusin\nalvogen\nwaayeel\nhealthtalk\nskunking\ngcec\nkotchen\nmazuka\ndanaei\nkolltan\nhimandhoo\ngarbutts\ncombinatorx\nfettuccini\nwidewell\ntivoed\nrobala\nkomlosi\nhaloumi\nstirlitz\ntravelmaster\nlassise\nhomoerotically\npejcinovic\ncarruades\nswitcheroos\nturkewitz\nperithia\ndealertrack\nquilvest\nmoatassim\nalinia\nmeshad\nyamoun\narysta\ndeinhardt\nimageware\nfebbo\ntshkinvali\nfortrans\nrunions\nkochifas\nveiwers\nrolufs\nbaragar\nvineys\nladderlike\nyinhong\nzenie\ndojaka\nheartrendingly\ndjordjevich\nhitlery\nmadrenas\nbronczek\ngianfranceschi\nkarabak\ncistuses\nwebctrl\ngrivnov\nburleys\nfarrakan\ngywneth\ndaisycutter\nzenns\nflouch\nllawn\npierzchala\nmetatartaric\ncontituency\ntranquilise\nphadungsil\nmccamy\nkebkabiya\nbotwinik\nsemcken\nkenneway\nkhyali\nwwoofing\ncastron\nmikoczy\nahfc\nglasers\nbegolly\ndeposer\npartnerka\nspendy\ngnvc\nblixseths\nthanatologists\ncloseknit\nlladd\nvirtuoz\napoligised\nammenities\nshimaa\nneosa\nchatterati\ntoumeh\nparaportiani\ndaypacks\npaavonen\nsauey\nroofscapes\ncmft\nprosound\nsittert\nmasibambane\nmccammack\nmccauliffe\noshel\nnydj\nnesbo\nkimotong\ndshi\ngelatos\nspielbergs\nmikulska\nmisdefense\nqleibo\nammuntion\njachnik\ntotallyjewish\nhonsberg\noxilp\neinum\nolass\nsalaisons\nruesselsheim\ntelepan\nkomanski\nflautt\nlimbad\nruessi\nrepolling\nneafsey\nstuggles\nixempra\nupfitting\nbrokercheck\nlacalamita\ndetonative\nyoubet\ncempa\nprekaze\nswaggeringly\nstursa\nsampong\nspitzauer\nbuinevicius\nnortin\nelgammal\nmiguelez\ndulford\ndrpic\ntegwyn\nkerstie\nloubeau\nsokaluk\ngokce\ncrme\nscheri\nbudgeters\nmyday\nhousseini\nfiream\nhellishness\nsoussana\nwitchiness\nseptemer\nitallian\nthenuwara\nantiproliferation\nschappler\nmicroplanet\nkhogiani\ngiancarla\ntepav\nincestuousness\nstraayer\nsqaud\nsielemann\ntouranment\nethanols\nmineer\nanamarie\nncjfcj\ntavulares\nwalkiria\ntanjiashan\nfiotakis\nenterprisingly\npushinka\nexámenes\nownit\ntimeing\nzojoji\ndiseasome\nfelana\nmegwa\nharjedalen\ngrapplings\nerrami\nrullah\ndiepolder\nnasfaa\nferniehill\nthingummybob\nmahabeer\nforklifting\ndikoy\ncreanza\nmarzah\nurethras\nwirecutters\ntpps\nvillagevines\npostpay\nsteirteghem\ngabura\nparcio\nzirandaro\nemcp\npresssure\nramussen\npimpi\npabell\nboudrot\nsatomiae\nrearly\nbardaweel\nlowde\nprattled\nrodota\nsecuritizer\ntoates\nwidlife\nkaribuni\nwoolich\negelien\nfibropapilloma\nartstorm\ndekelboum\nguantanimo\nhallihan\ncherkov\nmakili\ndoryman\nwahanda\nmbpa\nmagarsa\ndavore\narroni\nreoffenders\nkotevski\nheximer\nbehud\nticey\npalatably\nenginee\nparadisis\nruzgas\nconsolingly\nbuesaco\nisakovic\nwestins\nzinnanti\nsuaya\naytre\njspca\nbageecha\nnardine\nguardala\nleifland\npummelos\nsmartsense\nhydeskov\npitzarella\nmarchadier\nmiksanek\ncounterrevolutions\nconnextions\nallmenus\nsiig\nflexbar\nfacehunter\nstalkerazzi\nteeni\nmiscuing\nwoger\nersie\nkickz\nkeslassy\nbiotown\nmehboba\nvoied\ndeconsolidated\nlintala\nsmugs\noverweaning\nhillraiser\nlailvaux\npresliced\nzonias\nrimli\nsnedegar\ndoyennes\nfetterhoff\nkruegers\ncnossen\nfaizasyah\nbagila\nnewsmith\nmwakio\ngubta\nmujawayo\nonerously\nsemisecret\nschaye\nalisara\nbriitish\ngargagliano\nvizioncore\nhandstitched\nrohmah\nlovies\nopencrowd\nlibertiny\ngaľa\ndelleney\nbiocrude\nashstead\nkausal\nneidl\nlunchbucket\ndharmawardena\nbellieve\nultcw\nlvaro\nlittleheath\nkyriad\npursuiters\nmbbi\nslyer\ndihok\nboggild\nlacquement\naspinalls\nosenat\npynter\nlaksin\nsmartstart\nmanarin\nbogaards\nsibotshiwe\nuntackled\nendgadget\nunthreateningly\ndeodorise\nmohammaden\nwedepohl\nhandyphone\nabotsway\nuncinematic\ncaviola\nliquidware\nlanarth\novercollateralization\nmourino\nnfus\ncrocetto\ncheptais\nmelmount\nmotzer\nboespflug\njakstas\ntalkfests\ncresapartners\nshmear\npowergenix\nhardekopf\nnarcoterrorists\nreflationary\nprovexis\npostfight\namselle\nuplinq\nmuirhall\nleeney\nzeuli\ntrusim\nacciardi\ntessas\nplotholders\nflenner\nkrosby\nhalfnight\nopenband\nbougette\nobagi\nsebrle\nnewstex\nmcwar\nmbmg\nangemi\nmoseyed\nfearmongers\nlumpu\npalanco\nahour\nsinutab\nsodian\nsheikhli\nsanofipasteur\ncinephilic\nlothstein\nsoubie\nzerno\nbkhm\ncystography\nusocial\nsermonised\nobhi\nammoudi\nlousier\nereckson\nparacletes\nkrpshtskan\ngiganomics\nccording\nseasonique\nperlroth\nfieriness\nmugginess\nbarthelemey\nxenoport\nmetrogel\ntranc\nteguest\nmuslimyar\nmulticlient\nfoccacia\nkosaisuk\nmenopur\nhizer\nschelberg\nbarnlike\ncebri\ntosay\nturnow\nanalex\npowerskin\nsidie\nschoeber\nahijado\nzlhr\ncushty\nigadd\nfpies\nputumattalan\noglesbee\nselfesteem\nmallipo\nsubisidies\nafrough\njustwhistledixie\ncycnical\ncfnc\nwoollies\nbiben\nsimultaneoulsy\nkreishan\nuoya\nadapoids\nkrovvidi\npatelis\nyalayalatabua\nbrasato\nconnerys\nfunkily\nsurendiran\ncomora\nkanmen\ngasby\nheglin\nameripath\nfanswarm\nlydiatt\nmetalmaster\nchiroma\nwindemuth\nlukacz\nsplashier\nkrentcil\ncataouatche\nyastremskiy\nmcclanaghan\nmegamerger\nshticky\nartherosclerosis\ngrueskin\ntendy\nlignieres\ngallmeyer\ngirthed\nhomman\njebur\nrahati\nwahayshi\nbluetraker\nyospe\ntuduv\ninfantilising\ndufflet\nbalony\nadrianto\nappcraver\nrubbled\nescapia\nmostari\nbelche\ngülenists\nkhalidis\nweitzberg\nsouhail\nusfm\nchirpiness\nproffessors\nmusolini\nstonily\nnothingburger\npengassan\nmantrips\nkarele\nungenteel\nwepower\npipline\npanoramically\ndinnin\nbravey\nsubeh\npoerschke\ndzemaili\nslentrol\npallipat\nunequivical\nquestek\npointment\ntawain\ndonnybrooks\nabaar\njambia\nmoerheim\npithed\nemalie\nontier\nunderbidder\nshourong\nkaldoun\nauthenticom\ncapak\nbluedogs\nsoquem\njunyao\nkenzero\nchocbox\nasmq\ndabovic\nboersen\nvideocan\nskelid\npingzhong\ncarrycot\ncranfleet\nbonsib\nncreif\nwaterseal\neucalyptuses\nvallish\nhafeezuddin\nbrodricks\nchristianists\nmachler\nariaan\npsychopathically\nluxmanor\nbayari\nfrivol\nscheeler\nembarressed\nstickability\nreganomics\ntwttr\ntystiolaeth\nndds\ndagmawit\ndavise\nstartegy\njetin\nsaidenov\nunbendingly\nderbyites\nblathered\njamayel\nmccullars\ntheopold\nbulgers\nmastoris\nmultiseason\ncyfarfod\nlevano\ndarrenkamp\ntelecardiology\nbpsi\nnytta\nstadelmayer\ncontemporised\nfouhse\nlunkheads\nevites\nfirsthealth\nclywch\ntrvs\ncoratti\nreletting\nesconsed\nbroffman\nstonefire\nlorans\nnewspics\nkarlbaum\npenaflorida\nnorthburn\nhurkey\nctiy\ntinselled\nmarketsandmarkets\nlilang\nhirtes\nrcent\nrougeou\ncontribued\nfobis\nerwiah\ndellavigna\nunstainable\nwurstfest\npalpitate\noomc\nneopharm\ngroenefeld\nbrainlessness\nscreentonic\nviisage\nhotelchatter\nlequn\nkemala\nreesby\ndiemtig\nellisman\nmdco\nvrsn\nmamade\ninvastion\nunambivalent\nhaileys\ntrivan\nsternick\nrooperi\nserff\nherrgesell\nwieczynski\nsuperlabs\nbidabe\nhfth\nuzumcu\ncojan\nignia\nabstraktes\nstrenthen\naqualine\ncialdea\njeyapalan\nnegronis\npaudert\nzhenyao\natiende\nmagnetom\nhymning\nbonadie\ntrollopian\nthagi\nschnetler\nesrp\nstathakopoulos\nnachterstedt\nboipeba\nultrasecret\npanshir\necocidal\nkatania\nblendini\nrecalcine\nmomager\nsaphris\nrediculas\ndiavlogs\novcon\nsusick\njihadic\nconeybury\nhethmon\nshirm\nrishwain\nkireker\nmirell\nsheeren\neramerica\nluxehills\ntouchiest\nyalon\nbanglaore\nrommi\nlenigas\nmegaquake\nmascone\ncoruption\nsmartjog\ngollinger\nbabeh\npendall\nmisek\nsubaqua\ndallavalle\nvoreloxin\ndithery\ngonacon\njanumet\nvettes\nstaylor\nsenternovem\nexce\nhitleresque\nrocen\ndigrf\nramondini\nbushway\ncbase\nihilani\nhirtenstein\nslakteris\nunete\nhilarides\nserices\nticca\nshockin\npedegg\ncinep\ntranstac\nappma\naqah\nprudentiel\nrostar\nilluminatingly\nflimflammed\nboogied\nhyfforddi\nplzensky\neithinog\notesaga\naspling\npollalis\nindigovision\nblachard\nporciello\ntereas\neshki\ndalgish\nakafuku\npetroquimica\ntunlan\nedaily\noorjapac\npunjani\nsagie\ncnrd\nhometowne\nconced\nsilga\nmultigame\nbrinlee\nsurepress\nstatemented\njharkhali\npasquarello\ngerianne\npastapur\ngartain\nlamonaco\nbluehenge\nsonc\npipho\nmischon\ntradetech\nbunnets\npenjore\nanshur\nsondervan\nebels\nfacai\nbiostructures\nklenert\nbagrock\nnutrarev\ndusoulier\narjowiggins\nskylogic\n"
  },
  {
    "path": "assignments/word_transform/eval.vocab",
    "content": "tiene,has\nhabían,had\nentendido,understood\nclase,class\nharry,harry\npluma,pen\nguerra,war\ntan,so\ndios,god\nle,you\nestés,are\nmarea,tide\nmr,mr\nel,he\njerry,jerry\npuedo,can\ncoño,cone\nmarca,brand\ndebió,must\ndiferente,different\ntras,after\nrival,rival\npelículas,films\nésta,this\npiel,skin\nintención,intention\nshow,show\nir,go\nos,you\naumentar,increase\npaís,country\nmarcador,marker\nperfecta,perfect\nben,ben\npresión,pressure\npasada,pass\ndeje,leave\ndia,day\ndólares,dollars\nporque,why\nmaldita,damn\nlocura,madness\nfotos,photos\nhinchar,swell\nregresar,return\nalto,high\nchico,boy\nsoberanía,sovereignty\naquella,that\nhables,speak\npoder,power\ntomado,taken\nverde,green\nnube,cloud\nplaya,beach\nmercado,market\nnadie,nobody\ncontrario,contrary\nolvidar,forget\njodido,fucking\naltavoz,speaker\npobre,poor\noigan,hear\nviuda,widow\nvivo,alive\nverle,see\ncreí,believed\nmalas,bad\nhubiera,would\nperra,dog\nmuestra,sample\nbienvenidos,welcome\ncalcetines,socks\ndónde,where\nteléfono,phone\nhuele,smells\nclientes,customers\nsería,would\nbiblioteca,library\npaciente,patient\nruido,noise\npasa,happens\ndiplomático,diplomatic\nllamaba,called\nprosperar,prosper\nnosotros,us\nvas,go\nemergencia,emergency\nsucia,dirty\ndesastre,disaster\ndavid,david\npensar,think\nreal,real\nhumano,human\nvuelvas,return\nestaría,be\ncomprar,buy\nred,net\nsea,be\nray,ray\npresa,dam\nganado,won\nsexo,sex\noficina,office\nrecibir,receive\nmaravilloso,wonderful\ndura,hard\nestupendo,great\ndepende,depends\nbastardo,bastard\nmedia,half\npedazo,piece\nunas,nail\nojalá,hopefully\nbanda,band\nmetros,meters\nsiente,feels\nposibilidad,possibility\ninevitable,inevitable\nbatalla,battle\nseñorita,miss\npeor,worst\nnaval,naval\nbuenas,good\ncompletamente,completely\nsientes,feel\npaso,passed\ncallejón,alley\nobservación,observation\nperfecto,perfect\nflor,flower\nimposible,impossible\nhagan,make\nconversión,conversion\ntrasero,rear\ndiez,ten\nlínea,line\nc,c\nbuena,good\nadelante,ahead\nee,ee\notras,other\nvoz,voice\nmofeta,skunk\npolítica,politics\nah,ah\nnombres,name\nmaestro,teacher\nablandar,soften\ndará,give\nencantado,charmed\ncállate,quiet\nocho,eight\nfuimos,went\nfiesta,party\nquedo,remain\nsentí,felt\ncansado,tired\noro,gold\nabierta,open\ncámara,camera\nmagnético,magnetic\nratón,mouse\nseguro,insurance\ncomo,as\nimagino,imagine\nguantes,gloves\nespacio,space\notros,others\nbailando,dancing\nherido,injured\noportunidad,opportunity\nbobby,bobby\nrobert,robert\nuso,use\nencontrado,found\nmanos,hands\nver,see\nafuera,outside\nhabéis,have\nquienes,who\niluminación,lighting\nfácil,easy\nmenor,less\ndirección,address\nnegocios,business\nprivado,private\nlengua,language\ninformática,computing\nmary,mary\ntratando,trying\nejército,army\nperros,dogs\ncosecha,harvest\nsiempre,always\nvienes,viennese\ncabra,goat\ngana,desire\nempieza,starts\ndeben,should\nvengo,come\ntuvo,had\ndolor,pain\ntuve,had\nefecto,effect\nquedado,left\nllegue,arrived\ncaluroso,hot\norganizado,organized\nquede,stay\nestarás,be\neso,that\nhijos,children\ntuvimos,had\nvergüenza,shame\nalegra,happy\ngobierno,government\ncaro,expensive\noscuridad,darkness\ninvestigación,investigation\nmike,mike\ndinero,money\nhacia,toward\ndulce,sweet\nsiéntate,sit\nparecer,seem\nvistazo,glance\nhistorias,stories\nvender,sell\nroja,red\ngallo,rooster\nvayan,go\nchicos,boys\ncontrato,contract\n"
  },
  {
    "path": "assignments/word_transform/train.vocab",
    "content": "catedral,cathedral\nescúchame,listen\naccidente,accident\nté,tea\ngorda,fat\nregresa,returned\nnegación,denial\npato,duck\nprecisamente,accurately\nimagen,image\npersona,person\npistola,pistol\ndonde,where\ncafé,coffee\nnegocio,business\nquería,wanted\npensaba,thought\nespectáculo,show\nseguridad,security\njuvenil,juvenile\nvenga,come\nalrededor,around\neres,are\nrobo,stole\nespecial,special\nsolos,alone\nolvidé,forgot\nárbol,tree\ndanny,danny\nhicimos,did\nay,oh\nnoche,night\nregalo,present\nentiendes,understand\ndisculpe,sorry\nes,is\nimpulso,impulse\ninteractuar,interact\ncerebro,brain\ncosas,things\nsupuesto,supposed\nreina,queen\nbaile,dance\nayudarme,help\ntraído,brought\nescuela,school\ndiario,daily\ntu,you\ngran,great\nprincipio,beginning\ndejas,let\nvuelve,returns\nvoluntad,will\nfavor,favor\npersonal,personal\ndirecto,direct\ntal,such\nlobo,wolf\ninmigrante,immigrant\nsemanas,weeks\nbase,base\ninterior,inside\npreguntar,ask\npasé,pass\ntejer,weave\nlector,reader\noigo,hear\npiedra,stone\nmadre,mother\nhoy,today\ncaballero,gentleman\nsistema,system\nfamilia,family\npodía,could\nexamen,exam\nrestaurante,restaurant\nconveniencia,convenience\ncara,face\nhora,hour\nempleo,job\npista,track\npronto,soon\naño,year\nmillón,million\npasará,happen\nbob,bob\ndomingo,sunday\nhacerme,me\nmaravillosa,wonderful\nbrutal,brutal\nciudad,city\ncome,eat\nbilly,billy\nincalculable,incalculable\ndeleite,delight\ndebido,due\nmala,bad\nestúpido,stupid\nlibre,free\ncontacto,contact\nenamorado,love\ndesde,since\npasar,happen\nbailar,dance\nverano,summer\nprima,premium\ndate,date\nmano,hand\ncine,cinema\nbonito,beautiful\nconsecutivo,consecutive\nconocer,know\nsermón,sermon\nseñoras,ladies\ntigre,tiger\nseñora,mrs\nrecuerdas,remember\ncuarto,room\nvez,time\naquí,here\nrepugnante,disgusting\nestoy,am\nverás,see\ndio,gave\nganas,forward\namigo,friend\ntendré,have\nquímica,chemistry\nverdadero,true\ncansada,tired\ncocido,cooked\ncual,which\ncielo,sky\npolicía,police\npadre,father\ndando,giving\nasiento,seat\ntoque,touch\nagente,agent\nisla,island\ncuántos,many\nnena,baby\nentender,understand\ninstante,instant\niglesia,church\nsuerte,luck\nluego,then\nperfectamente,perfectly\nanimal,animal\ncorazón,heart\ngracias,thank\nprefiero,prefer\ncreía,thought\nrenta,rent\ndelgado,thin\nbañar,bathe\nestuviste,were\ncontinuar,continue\nla,the\nllevaré,take\ncomienzo,start\nmujeres,women\nvea,see\ncreen,believe\ncontrol,control\ncabrón,dumbass\nmitad,half\narena,sand\nabsolutamente,absolutely\nmata,bush\ndoy,give\nconejo,spider\nti,you\ndetrás,behind\nhablamos,speak\nanna,anna\nencuentro,meeting\nperdona,forgives\nmayor,higher\nganar,win\ntrabajando,working\ngay,gay\nencontró,found\nconseguir,get\npeter,peter\nfunciona,works\npreciosa,precious\nesperen,expect\nhacemos,make\nharé,do\nvelocidad,speed\nvecino,neighbor\ncrimen,crime\nposición,position\nbosque,forest\nnuestro,our\nhecho,fact\nsr,mr\ntenía,had\nsaliendo,leave\nángeles,angels\nnutritivo,nutrient\nfinal,final\nnota,note\nasunto,issue\nnos,us\ncarga,load\ntalento,talent\nsegundos,seconds\napenas,barely\nexplosión,explosion\nalma,soul\nvaqueros,jeans\nmujer,woman\notra,other\nidea,idea\nabogado,attorney\nrayos,ray\ncrudo,raw\nacuerdas,remember\nanillo,ring\nmente,mind\nparte,part\nmal,wrong\nproyecto,draft\nchaqueta,jacket\nlisto,ready\nonda,wave\ntommy,tommy\nlados,sides\nhabía,was\nbuenos,good\nimportante,important\ndama,lady\naeropuerto,airport\nirresistible,compelling\nsiento,feel\ncorriendo,running\noscuro,dark\nmirar,look\nedad,age\nsalgan,leave\npapá,dad\ntardes,afternoons\ntío,uncle\nfantástico,fantastic\nmemoria,memory\ncamisa,shirt\nconfianza,trust\nperder,lose\nnueva,new\ncomida,food\nmomentos,moments\nvamos,go\ncuento,story\nestupidez,stupidity\nteológico,theological\nnuestros,our\namo,love\ncama,bed\nsois,are\ndijiste,said\nninguno,any\nsorpresa,surprise\nsucio,dirty\ntarde,late\nciudadanía,citizenship\ncrucero,cruise\ndetente,stop\npulmón,lung\ncinturón,belt\nsiendo,being\ntraje,suit\ncuidado,attention\nniño,boy\ntenga,have\nintentar,try\nenseñar,teach\nextranjero,foreigner\nllamas,calls\ntontería,nonsense\nmierda,shit\ntomar,drink\nbien,well\nlastimado,hurt\nlocos,crazy\nmilitar,military\nmotocicleta,motorcycle\nacá,here\nsí,yes\ncalor,heat\nlibro,book\nya,already\ndar,give\njunto,together\nnivel,level\nidiotas,idiots\nprofesor,professor\nunos,some\nhorrible,horrible\nhacerle,make\ndeseo,wish\nsostener,sustain\nodio,hate\ndías,days\ndespierta,awake\nrelámpago,lightning\nser,be\nacaba,just\ntodo,all\nquedarme,stay\nestará,be\nmucha,much\nvidas,lives\nbasta,enough\nenorme,huge\nreligión,religion\nquerida,dear\npongo,put\ncreo,believe\nllegamos,arrived\nempresa,company\npodré,can\ndiablo,devil\ndemonios,damn\nverá,see\npregunto,ask\nvisita,visit\nsocorro,help\nfeliz,happy\nbar,pub\ntemprano,early\npiscina,pool\na,to\nexactamente,exactly\nbicicleta,bicycle\nintento,attempt\ncódigo,code\nobjetivo,objective\nculpable,guilty\ngustó,taste\nmiles,thousands\ndoble,double\njack,jack\ndejó,left\nencontraron,found\nponga,put\npartes,parts\nfilete,steak\ncomún,common\nmaestra,teacher\nves,see\ncebolla,onion\nresto,rest\niba,going\nvena,vein\ntienes,have\nceño,frown\nfusil,rifle\ntranquila,quiet\npienso,think\npróxima,next\nllevan,carry\nhablan,speak\nespada,sword\nr,r\ndrogas,drugs\nusar,use\nfrustrar,frustrate\nllevar,carry\nmuchachos,boys\ndemocracia,democracy\nmedicina,medicine\nnavidad,christmas\nlluvia,rain\nbella,beautiful\nesperanza,hope\nanimales,animals\ndejaste,left\nsola,alone\ngrandes,big\ncomenzó,started\nexacto,exact\nesperaba,expected\nbonita,beautiful\ncharles,charles\nespecie,species\nbiblia,bible\ney,ey\nhumanos,humans\ntrata,about\nduda,doubt\nmuy,very\nmajestad,majesty\ncambio,change\nestar,be\nhabría,be\nlímite,limit\nhonor,honor\ncomienza,begins\nmortalidad,mortality\nlista,list\nmuchacho,boy\nprisión,prison\ntome,take\nmono,monkey\ncuando,when\nrey,king\ndurante,during\ncontento,happy\nejemplo,example\nvolveré,return\ntécnico,technician\nbuscar,search\nfuerzas,forces\ndifícil,difficult\nvaya,go\njurisdicción,jurisdiction\nfrancés,french\ncuesta,cost\ncuántas,many\ntv,tv\ncastillo,castle\ncinco,five\ncambiar,change\nrealmente,really\nbaja,low\nregreso,returned\nhace,does\ndecirle,tell\nfatiga,fatigue\nviene,comes\ncomputadora,computer\nviernes,friday\ntenido,had\nbebida,drink\nsuena,sounds\nlimpio,clean\nha,has\ngrande,big\njuicio,judgment\nquedan,are\nmojado,wet\ncambia,change\nhijo,son\npapel,paper\njugar,play\ncarrera,career\ntrabajar,work\nespecificar,specify\ndebí,should\nfrente,front\nescritorio,desk\ncariño,sweetie\nmatarme,kill\nnecesitas,need\nhombres,mens\nmansión,mansion\neducación,education\nidiota,moron\nfuturo,future\nplanta,plant\npagar,pay\ncompañero,companion\nestados,state\ncosa,thing\npendientes,earrings\nllevó,wear\nestas,these\ntaxi,taxi\nquieren,want\npápa,pope\nsofá,couch\nmas,more\nespecular,speculate\nhubo,was\nideas,ideas\ndébil,weak\nquerido,dear\nmejor,best\nvino,wine\ncoordinar,coordinate\nsostenible,sustainable\ncalifornia,california\nocurrió,occurred\nintercambio,exchange\ncomenzar,start\nchicas,girls\noye,hears\nviste,dresses\nfui,was\nusa,uses\ndisculpa,sorry\ndirecciones,directions\ndistancia,distance\ndiablos,devils\ngordo,fat\npocos,few\ndiga,tell\ntoda,all\nhaber,have\nsrta,ms\nhablado,spoken\nvictoria,victory\npríncipe,prince\núltimos,latest\nmultitud,crowd\nve,go\nelección,choice\nalguien,someone\ntengas,have\npensando,thinking\nprueba,proof\ndebes,must\nimporta,matters\npetición,plea\ncasa,house\ncumpleaños,birthday\nactualizar,update\ntenemos,have\nusted,you\npudiera,could\nloco,crazy\nmédico,doctor\nbeber,drink\neh,eh\nestan,are\njake,jake\nrespeto,respect\nfreno,break\ncamino,path\nrazón,reason\nsol,sun\ncuerpo,body\nmotor,engine\nrecuerda,remember\npareces,seem\ndepositar,deposit\nmiren,look\nseguir,follow\nguapo,handsome\nescritor,writer\nquieto,still\nbrazos,arms\nhaces,do\nempezar,start\nentra,enters\ncuál,which\npresidente,president\narmonía,harmony\noiga,listen\npedido,order\nintelectual,intellectual\nnecesario,necessary\ndedos,fingers\npunto,point\nalemán,german\ngranizo,hail\nsalud,health\nirás,go\nguapa,beautiful\nsandalia,sandals\npruebas,tests\nelefante,elephant\nfavorable,favorable\ndarte,give\npreocupes,worry\nllega,arrives\nuds,you\nmuertos,dead\nningún,any\nhorno,oven\ndarme,give\nflores,flowers\nentrar,enter\nformas,shapes\nenemigo,enemy\nllorar,cry\nlamento,lament\nhola,hello\njohnny,johnny\npared,wall\ngusto,taste\npropio,own\ntodos,everybody\nsalió,left\namar,love\nencantaría,love\nextranjeros,languages\nrepublicano,republican\ntuyo,yours\nserá,be\npodido,have\nestamos,are\ngratis,free\ncliente,client\nllegó,arrived\ncaucho,rubber\ndebía,should\nsido,been\nabrigo,coat\nexcelente,excellent\nnaturaleza,nature\nblusa,blouse\nmúsica,music\nprobabilidad,probability\nestrella,star\nsan,saint\ncascada,waterfall\nterminar,terminate\ndepredador,predatory\nsra,mrs\nsarah,sarah\npuerta,door\nbusca,search\nseleccionado,selected\njardín,garden\nlibros,books\nciencia,science\nencontré,found\namas,love\npues,well\nescuchar,hear\nmataré,kill\npobres,poor\npequeña,small\npez,fish\nllama,call\nhacerlo,do\nsociedad,society\ncreerlo,believe\ntratar,try\nponte,ponte\nalquiler,rent\nsir,sir\nlanzamiento,launch\ncaso,case\ninherente,inherent\nmax,max\ninformación,information\npelícula,movie\naun,yet\naceptación,acceptance\nlos,the\nmuseo,museum\nsolamente,only\npasando,passing\ndepartamento,department\ntuya,yours\niré,go\najo,garlic\nhumor,humor\nsigues,follow\ninvencible,invincible\npredicar,preach\ndecisión,decision\nautobús,bus\navión,airplane\nzona,zone\nde,from\nconocía,knew\ncasi,almost\nhéroe,hero\ndigo,say\ntenedor,fork\nesperar,wait\npelaje,fur\ngarganta,throat\nconmigo,with\neddie,eddie\neran,were\nlargo,long\nconfiar,trust\nmovimiento,movement\nlámpara,lamp\nnieve,snow\ntesoro,treasure\nhermanos,brothers\nquedar,stay\nnovia,girlfriend\nfuera,outside\ninspector,inspector\nlee,read\ndamas,ladies\nirse,leave\npodrás,can\npar,pair\ncompleto,full\nanoche,night\nespecialmente,especially\nfin,end\nmejores,top\nrico,rich\nmuerta,dead\nfondo,bottom\nsé,know\namigos,friends\ntoma,taking\nquieres,want\nvacaciones,holidays\nirnos,leave\nuniversidad,university\nbuscando,searching\nveinte,twenty\nvida,life\ndas,give\nalegro,glad\nbolsa,bag\njoven,young\nbebé,baby\ncaminar,walk\npie,foot\nestabas,were\njohn,john\nllegar,arrive\ndetective,detective\nprograma,program\nhice,did\nsomos,are\nentiendo,understand\nhabrá,have\napuesto,handsome\ncalma,calm\nhombre,man\nvuelto,turned\nmarcha,march\ntipo,kind\namarillo,yellow\nquédate,stay\narco,bow\nmami,mommy\ndefinitivamente,definitely\ntecho,roof\ncarro,car\nirme,go\ntema,theme\nestén,are\nllegué,arrived\ncolocación,placement\ncasado,married\ninteresante,interesting\narticular,articulate\ndelante,ahead\nveras,see\nprisa,hurry\nsentir,feel\ntenéis,have\nmedio,medium\nsignifica,means\nponer,place\npiensas,think\ndecir,say\ncuentas,accounts\ndespués,after\nazul,blue\narrepentirse,repent\nsiéntese,sit\npropiedad,property\nalgo,something\nperdido,lost\nmontaña,mountain\ndaré,give\nuno,one\nfrágil,fragile\nnoches,nights\nloca,crazy\nhacer,do\nrostro,face\nambos,both\nbelleza,beauty\nbronce,bronze\ncapitán,captain\nsupongo,suppose\npidió,asked\nnuevo,new\nmuerto,dead\nhubieras,had\nfamiliar,familiar\nmirada,look\nprometo,promise\ntrabajo,job\nrazones,reasons\nquerer,want\npiso,floor\ngiro,twist\nsemejanza,similarity\ncosta,coast\nagradecer,appreciate\nsaberlo,know\nestuvo,was\ncirculo,circle\noí,hear\npuerto,door\ntú,you\nrepente,suddenly\nbarco,ship\nfotografía,photograph\nhogar,home\nhacen,make\nmí,me\nterminado,finished\nminutos,minutes\nustedes,you\nresulta,result\njóvenes,young\nego,ego\ntambien,also\ndejen,leave\nempezó,started\ncargo,position\ncomandante,commander\nalmohada,pillow\nhago,make\ncaballo,horse\ndemandante,plaintiff\ncanción,song\nprofesional,professional\nescena,scene\nelegible,eligible\nmayoría,most\ntribunal,court\ncomentario,remark\niremos,go\nhabló,speak\ndice,says\nmorir,die\nporqué,why\npiensa,think\ndescansar,rest\npotable,potable\ntrato,treatment\ntuviera,had\ncocina,kitchen\nclub,club\nahí,there\nreunión,meeting\nsal,salt\nsean,are\nespiar,spy\ngracia,grace\ncalle,street\nreloj,clock\nayudar,help\nropa,clothes\ncalles,streets\nbeso,kiss\ntarjeta,card\nmark,mark\nfrancia,france\nfracción,fraction\nhará,will\ngeometría,geometry\ndebajo,below\ntrampa,trap\nperdone,forgive\nputa,bitch\nchispa,spark\nviviendo,living\njefe,boss\nbajar,down\nintimidad,intimacy\nesposa,wife\njabón,soap\ncasas,houses\nironía,irony\npropósito,purpose\npersonas,people\nmuelle,dock\nbote,boat\npero,but\nesta,this\nmatar,kill\nabuela,grandmother\nniebla,fog\ncamión,truck\nsale,leaves\nplato,plate\noyes,hear\ninocente,innocent\ndan,give\npide,asks\núnica,only\nreferir,refer\nhizo,did\nrevólver,revolver\natención,attention\ninjusto,unfair\nésa,that\ngustan,like\nequivalente,equivalent\nmi,my\nvan,go\naburrido,boring\nperro,dog\nalcalde,mayor\nentiende,understands\nbusco,search\nbueno,good\ndormido,asleep\nnunca,never\nprecioso,precious\néxito,success\nblanco,white\ncuanto,many\nencima,above\ndelicioso,delicious\ntantas,many\nálgebra,algebra\nwhisky,whiskey\nperdonar,forgive\noh,oh\notro,other\nfoto,photo\nescuche,heard\npájaro,bird\nnegros,black\nrobar,steal\ntrabaja,working\nfortuna,fortune\nal,to\nrelación,relationship\nfuerza,force\nllanta,wheel\nembargo,embargo\nabierto,open\npalabra,word\nserán,be\nproblemas,problems\nthomas,thomas\ncon,with\ngrueso,thick\nbill,bill\ncaliente,hot\nbañador,swimsuit\ndejes,let\naburrida,boring\nalemania,germany\nsu,his\ngarantía,guarantee\nunidad,unity\natrás,behind\ntemo,fear\ninglaterra,england\nsalido,protruding\nm,m\nescucha,listen\ndisparar,shoot\nademás,besides\nmolécula,molecule\nobra,work\nninguna,any\nsegundo,second\nmía,mine\nagradable,nice\nlistos,ready\nclaro,clear\nvemos,see\npalabras,words\nsube,up\núltimo,latest\nnoticia,news\ncielos,heavens\nfelices,happy\ndijeron,said\nsituación,situation\ntoca,plays\npreocupado,worried\ntensión,strain\ntodas,all\ndave,dave\npuertas,doors\nvolvió,returned\ntocar,play\nayude,help\nvieja,old\nhonesto,honest\nparecen,seem\nj,j\nelaborar,elaborate\nvuelo,flight\nvacío,vacuum\nentre,between\nparecía,seemed\nnoticias,news\ncartas,letters\namante,lover\nesperando,waiting\nentonces,then\ncheque,check\naduana,customs\nvayamos,go\nespina,spine\nducha,shower\nacusación,accusation\nsigue,follow\nmientras,while\nretirada,retreat\norar,pray\nabsoluto,absolute\nllevas,take\ndelincuente,offender\ndanza,dance\nacabo,finished\ntren,train\nvendedor,seller\nfísica,physics\nmasa,dough\npon,put\nbautismo,baptism\ndijo,said\nbajo,low\ndivertido,fun\nprotestante,protestant\nmataron,killed\ns,s\nnuestra,our\nluchar,fight\nnariz,nose\narcilla,clay\nsaca,removes\nyork,york\nserás,be\nconducir,drive\ntranquilo,quiet\nturno,turn\nsano,healthy\ngusta,like\nminuto,minute\nfea,ugly\nera,was\ndedo,finger\nexcepto,except\nsiquiera,even\namable,friendly\nbravo,bravo\nayúdame,help\nboda,wedding\noferta,sale\nhija,daughter\nadónde,where\ndueño,owner\nmisión,mission\ndoctor,doctor\nseguramente,surely\nsaben,know\npaz,peace\nrepentino,sudden\ncualquiera,anyone\nepidemia,epidemic\ntarifa,rate\nequivocado,wrong\nmurió,died\nserio,serious\nveré,see\nwww,www\nestimación,estimate\nsalga,out\ndentro,inside\naqui,here\nmamá,mom\ndestino,destination\ncuello,neck\nnuestras,our\npuente,bridge\nsuficiente,enough\ndebe,should\nexperiencia,experience\nembarazada,pregnant\nchofer,driver\ntienda,store\npantalones,pants\namericano,american\npaseo,walk\npone,places\nhonestamente,honestly\npata,duck\ncambiado,changed\nparque,park\npartido,match\nbiología,biology\nquedó,stayed\nsangre,blood\nbaño,bathroom\nhechos,acts\nlado,side\nprimero,first\nlevántate,raise\nhey,hey\nescuchen,listen\ndiferentes,different\nvelcro,velcro\ngenial,great\nquedarse,stay\nchina,china\nestá,this\narma,weapon\nmis,my\nverdad,true\nfilosófico,philosophical\npatata,potato\ntemplo,temple\nnovio,boyfriend\nhospital,hospital\nabuelo,grandfather\nocurre,occurs\nvivir,live\noír,hear\nsuéter,sweater\ndeber,must\nvete,go\nsentía,felt\npodemos,can\ndiciendo,saying\nventana,window\nsentido,sense\nlibrería,bookstore\ngeneral,general\nquién,who\nvos,you\nverlo,see\nescaleras,stairs\ncuestión,question\ntendremos,have\ncomplicado,complicated\ntrauma,trauma\nhermano,brother\nsemana,week\nveremos,see\nculo,ass\npresuntamente,allegedly\nmillones,millions\nantiguo,old\nfe,faith\nconsejo,advice\nmolesta,bothers\nturismo,tourism\nhas,have\nintervalo,interval\nedificio,building\ngustaba,liked\noído,ear\ndecirme,tell\nalex,alex\nalguna,any\ntoalla,towel\ndame,give\nespalda,back\ncerda,pig\ncenar,dine\narrodillarse,kneel\ndi,gave\ncamboya,cambodia\nmapa,map\nvenir,come\nmonasterio,monastery\nvigésimo,twentieth\nrueda,wheel\nmás,more\nhablé,talked\ndiferencia,difference\nnuevos,new\npresente,present\nalboroto,riot\nenferma,sick\nhablas,speak\nsaldrá,will\nvd,you\ncorre,run\nante,before\nimbécil,fool\ndarle,give\nvoy,go\nechar,throw\nenderezar,straighten\ncorte,cut\ntengo,have\ncomer,eat\nrana,frog\nataque,attack\naños,years\ncontar,tell\nvine,came\ndroga,drug\nyo,i\npeligroso,dangerous\nnecesitaba,needed\nun,a\nbrillante,brilliant\núltima,last\nligero,light\npor,by\nprimer,first\nmatrimonio,marriage\ndormir,sleep\nhablar,talk\nsoldados,soldiers\nbarrio,neighborhood\ndirector,director\nterminó,finished\npila,sink\nvosotros,you\nvista,view\nquisiera,want\ncorrer,run\ndiría,say\nqueda,remains\nprimo,cousin\nluna,moon\nbroma,joke\nnosotras,we\nok,okay\nrápido,fast\njim,jim\nhermoso,beautiful\npedir,ask\nesa,that\njames,james\npatada,kick\nbienvenida,welcome\nviaje,travel\nsabemos,know\nhombro,shoulder\ngente,people\nunidos,united\nlondres,london\npido,ask\ntriste,sad\nobispo,bishop\nvuestro,your\ntenías,had\nquien,who\nconstitución,constitution\nparece,seems\nmatado,killed\npreguntas,questions\ncargador,charger\ndemasiado,too\ndije,said\ncorrecto,right\nirte,leave\ndigamos,say\npúblico,public\nestán,are\nacelerar,accelerate\nsaber,know\narmas,weapons\nlinda,pretty\npelear,fight\nestúpida,stupid\nencanto,charm\nestaremos,be\ntendrás,have\nsepa,know\nconocido,known\nsi,if\ncae,falls\ndejo,left\nmuñeca,wrist\nmontón,heap\nfundir,melt\nvenido,come\nabajo,down\nenergía,energy\nesto,this\ntendrá,have\nperdón,sorry\nahi,there\nhiciera,do\ncorrea,belt\npantalla,screen\nagua,water\npequeños,little\nruego,beg\nocurrido,happened\nhenry,henry\ntendrán,will\nestación,station\nbastante,quite\ntermina,ends\ncola,tail\nmuerte,death\nque,what\nayer,yesterday\npanaderia,shop\nboca,mouth\nhacía,toward\nb,b\nhaciendo,doing\ncaballos,horses\nmodo,mode\nsecreto,secret\nverte,see\ngato,cat\nfábrica,factory\npiensan,think\nsabe,knows\nmensaje,message\ndime,tell\ncierre,zipper\ntampoco,neither\nto,to\nestado,state\nllamada,call\nd,d\nmuchas,many\nojo,eye\nlapicero,pen\ntanto,much\npierna,leg\nacabar,finish\nojos,eyes\nputo,fucking\ncresta,ridge\ncomprendo,comprehend\ngrave,serious\ndebería,should\ncentro,center\nmismo,same\nviudo,widower\nórdenes,orders\nmonstruo,monster\ndeberías,should\nvisto,viewed\npiernas,legs\nnada,nothing\nseñor,mister\ncorreos,office\nteníamos,had\nborracho,drunk\nestadio,stadium\nencuentras,find\npueblo,town\nclases,lessons\nnatural,natural\ndices,say\nproclamar,proclaim\nfuese,was\nolvides,forget\ndefensa,defending\nestarán,be\nsupe,knew\ncarne,meat\nantes,before\nllave,key\nmanta,blanket\nllaman,call\ncoge,grabs\ntravés,through\nizquierda,left\nasuntos,issues\nalgunas,some\nenfermero,nurse\nquiénes,who\nprobar,try\ncristianismo,christianity\nleal,loyal\ndetalles,details\njugando,playing\nsam,sam\ncierto,true\nplacer,pleasure\npollo,chicken\npase,pass\nmundo,world\nmiedo,fear\ndos,two\naunque,although\nhermana,sister\npatrón,patron\npuñetazo,punch\njamás,never\ntony,tony\ntrago,drink\nfalda,skirt\nexplícito,explicit\ntelevisión,television\nsino,but\nhay,are\nfinalmente,finally\ndecía,said\nsalida,exit\nadentro,in\ncaja,box\nhígado,liver\ndespierto,awake\nescapar,escape\nrica,rich\njuntos,together\nnervioso,nervous\npapi,daddy\ncerrar,close\ndibujar,draw\nnegro,black\nsuya,his\ntodavía,still\nanterior,underwear\nseas,are\nestuviera,was\nincluso,even\nmañana,morning\ninforme,report\ntolerancia,tolerance\ngloria,glory\ncontigo,with\nteatro,theater\nnaríz,nose\nhablando,speaking\namérica,america\ntiro,threw\npareja,couple\nme,me\ndaño,hurt\ncuidar,care\ncopa,cup\noso,bear\njuro,swear\ncantar,sing\narriba,above\nlibras,pounds\nsimple,simple\nlugares,places\npudo,could\ntendría,have\nrevisión,review\nveamos,see\ntrajo,brought\nvolver,return\nellos,they\nproblema,problem\nalemanes,german\nson,are\ndiré,say\ndecirte,tell\nama,love\naire,air\nopción,option\nministro,minister\nveía,looked\nvio,saw\nnaranja,orange\nwalter,walter\nhuevos,eggs\nencontramos,find\namiga,friend\nmuevas,move\ndía,day\nsoldado,soldier\ncabeza,head\nlapiz,pencil\nhaga,make\nhabitación,room\nfútbol,football\ndenso,dense\nmantener,keep\nperforar,drill\nluces,lights\ncharlie,charlie\nqué,what\ntomó,took\ncampo,field\nmatemáticas,math\nlleva,carries\nbienvenido,welcome\ncita,appointment\npatrocinador,sponsor\nqueja,complaint\ncarta,letter\ncaer,fall\nsiete,seven\nempujón,poke\nviejos,old\nestudiar,study\nmil,thousand\norgulloso,proud\nllamar,call\nocéano,ocean\nido,gone\npoco,little\ndientes,teeth\njusticia,justice\ndejado,left\nviejo,old\nlleno,full\nsalvo,except\nposible,possible\nlejos,far\ndígame,tell\nallí,there\ncerdo,pig\nrojo,red\nintenta,try\nquedarte,stay\ncarretera,highway\npolvo,dust\ndel,of\nparar,stop\nnave,ship\njuego,game\nciclomotor,moped\nparís,paris\nhubiese,had\nlas,the\np,p\ncausa,cause\nconoce,known\nalegar,allege\nél,he\nfeo,ugly\nhaya,beech\nvuestra,your\nlíquido,liquid\ntonto,stupid\nsiguiente,following\nsentado,seated\nvestíbulo,hallway\npelea,fight\nprofesora,professor\nmenos,less\nquerría,want\ncerveza,beer\nbromeando,joking\nrespecto,respect\ninmediato,now\nmando,send\nsólo,only\nseré,be\neconomía,economics\nlleve,carried\nverla,see\nesos,those\nroma,rome\nasesinato,murder\ncolegio,college\ncharca,pond\ndebo,must\npelo,hair\nquizá,maybe\nsábado,saturday\nrecortar,trim\nleer,read\ninmediatamente,immediately\ncapaz,able\naprender,learn\nespaña,spain\nllamaré,call\nviendo,seeing\nolvidado,forgotten\nmesa,table\nofficina,office\nenemigos,enemies\nmirando,looking\nmadera,timber\nacción,action\naquel,that\nacerca,about\ntener,have\ngustaría,like\nactuar,act\nballena,whale\ncena,dinner\nsolía,accustomed\ndeja,let\ntotal,total\nbus,bus\nave,bird\nviento,wind\njoder,fuck\nmentira,lie\numbral,threshold\ncayó,fell\ncompañía,company\noperación,operation\ntapa,lid\ncasarse,marry\namor,love\nbomba,bomb\nconozco,know\nanda,walks\ninvención,invention\ncuatro,four\nsur,south\nsabías,know\nextraña,strange\nllevará,carry\ncompromiso,compromise\nsheriff,sheriff\nespere,waited\nvolar,fly\ntanta,much\ncontabilidad,accounting\nrutinariamente,routinely\nlibertad,freedom\nabre,opens\nsilla,chair\nharemos,will\ntomando,taking\nsobre,on\nprecio,price\ncinta,ribbon\npara,for\naspirina,aspirin\nmotivo,reason\nperdió,lost\ntotalmente,totally\ndigas,say\nsus,their\nseñores,sirs\nfalta,lack\nmuere,die\nzapatos,shoes\nhiciste,did\nrecuperar,recover\npermiso,permission\nmalditos,damn\nio,io\nelectrónica,electronics\nseco,dry\npuntos,points\ncrees,believe\ncapa,coat\nsigo,follow\nguardia,guard\nágil,agile\nahora,now\nnuevas,news\ncerca,close\nllevo,wear\npensé,thought\npeligro,danger\nen,in\nbrazo,arm\nsombrero,hat\npreocupe,worry\nrato,while\nresponsable,responsable\nmichael,michael\ninevitablemente,inevitably\npodremos,can\ncierra,closes\nalmacén,warehouse\nextraño,strange\nnombre,name\nrosa,pink\ndéjeme,let\néste,east\nhable,talked\ndejar,leave\nrío,river\ncolor,color\noeste,west\nalta,high\njuventud,youth\ncontribuyente,contributor\nestudio,study\nraro,rare\nlucha,fight\npesar,weigh\npueden,may\nnick,nick\npasado,past\naspecto,appearance\njoe,joe\nsucedió,happened\ntraer,bring\npijama,pajamas\nyou,you\nescupir,spit\npuesto,position\neras,were\nvestido,dress\nángel,angel\nadiós,goodbye\ndemás,other\nhayas,have\nsueños,dreams\ncuchillo,knife\ndemócrata,democrat\nsirve,serves\nda,gives\naquellos,those\ntiempo,time\ncruel,cruel\nvaliente,brave\nderecho,right\npermite,allows\ncodo,elbow\nequipaje,luggage\nabrir,open\ncabello,hair\npapa,dad\ngraduación,graduation\nleche,milk\nperiódico,newspaper\nlago,lake\nestufa,stove\nsalir,leave\npuse,put\nforma,shape\nacto,act\nroto,broken\nluz,light\norden,order\nconoces,know\ncada,each\nveterano,veteran\nvarias,several\nmucho,much\ntránsito,transit\nvale,okay\nplan,plan\ntambién,also\njesús,jesus\nsargento,sergeant\nauto,car\nchica,girl\nprensa,press\ncontinúa,continue\nduro,hard\ndado,dice\nhaz,make\ndurmiendo,sleeping\ncoger,take\ninteligente,intelligent\npreparado,prepared\npies,feet\nestaba,was\ntornillo,bolt\nellas,they\nuh,uh\nley,law\ndiccionário,dictionary\nverdadera,real\ncálculo,calculus\nvive,lives\nsegún,according\nviva,live\npaja,straw\ndé,from\nasesino,murderer\nmire,look\nespíritu,spirit\nuna,a\ncoronel,colonel\njacob,jacob\ncabo,cape\nmira,look\ntí,you\nva,goes\nservicio,service\ncarajo,fuck\ntengan,have\nentrada,entry\nespera,wait\nreservado,reserved\nvuelva,return\ncálmate,calm\nrespuesta,answer\nbañera,bathtub\npedí,asked\nsteve,steve\nrecibido,received\nfué,was\nespejo,mirror\nmaldición,curse\nnacional,national\nquiere,wants\nhabla,speaks\nthe,the\nculpa,guilt\nlindo,pretty\nvalle,valley\nsonido,sound\noficial,official\nniñas,girls\ncómo,how\nesas,those\nayuda,help\nlástima,pity\nmomento,moment\nfarmácia,pharmacy\ncampesino,peasant\ntejido,fabric\ngeorge,george\nllaves,keys\nopinión,opinion\nrichard,richard\nmuchacha,girl\nsuyo,yours\nmírame,look\nerror,error\ndejarlo,leave\nllamado,called\nvendrá,come\ndejé,leave\ninfeliz,unhappy\nladrón,thief\nenfermedad,disease\nmes,month\nsilencio,silence\nvengan,come\nbanco,bank\nenfermo,sick\ninfierno,hell\nlión,lion\nmicroonda,microwave\nsubir,up\npequeño,small\nigual,same\nnormal,normal\nllámame,call\napartamento,apartment\ntumba,grave\nrádio,radio\nacaso,perhaps\npena,pain\ntipos,types\nenseguida,immediately\nahogar,drown\nmalo,bad\npapeles,papers\nsala,room\nlugar,place\nselva,jungle\nalguno,any\nsentimientos,feelings\npuedes,can\ngracioso,funny\nsimplemente,simply\ndejaré,leave\nfrío,cold\nprofeta,prophet\npasó,passed\nhabías,had\nautonomía,autonomy\nsacar,take\nalabanza,praise\npadres,parents\ncuenta,account\nmuévete,move\nsiga,follow\nnueve,nine\ncolina,hill\nsin,without\npecho,chest\nlíder,leader\nasí,yes\nriesgo,risk\nrodilla,knee\napología,apology\nu,or\nayudarte,help\nni,neither\npropia,own\nllegado,arrived\ntus,your\nmarido,husband\ndieron,gave\nacuerdo,agreement\neste,east\npuso,put\npago,payment\ntoques,touches\ngolpe,knock\nsuelo,floor\nhambre,hungry\nridículo,ridiculous\ntom,tom\ndesea,wish\nnecesitamos,need\ninteresa,interested\ntres,three\npreocupa,worries\nocupado,occupied\nsanta,saint\ntransmitir,transmit\ntomas,shots\npaga,pay\nniños,children\ncree,believes\naún,yet\nsupone,supposed\nhasta,until\ncuchara,spoon\npareció,seemed\narte,art\ncintura,waist\ncien,hundred\ndicho,saying\nhablemos,talk\nadorar,worship\nsanto,holy\ndr,dr\nexperto,skilled\npuede,can\ngenio,genius\nmar,sea\nhagamos,do\nhe,have\njuez,judge\nella,she\nsueño,dream\nrefiero,refer\nseis,six\nvi,saw\ntestigo,witness\nseñoría,lordship\nmisma,same\nhablo,speak\nimpuesto,tax\nverme,see\nhielo,ice\ntenían,had\nmáquina,machine\nvaca,cow\nnecesita,needs\nrealidad,reality\nmundial,world\ndéjalo,leave\ngeografía,geography\ninútil,useless\npan,bread\nescribir,write\nlarry,larry\nmuchos,many\nchris,chris\nfuego,fire\nhotel,hotel\nexiste,exists\nmaldito,damned\nlavaplatos,dishwasher\nsabía,knew\ndespacio,slowly\nfamoso,famous\nmármol,marble\ninglés,english\nlarga,long\nacabó,finished\nllame,called\naceptar,accept\ndecidido,decided\nescrito,written\ncerrado,closed\nacabado,finish\nbotella,bottle\nyendo,going\nautomovíl,car\nsalvar,save\nrecuerdo,memory\nallá,there\nincreíble,amazing\nfue,was\nsolo,alone\no,or\nveces,times\nterriblemente,terribly\nvolverá,return\ncoco,coconut\nvienen,come\nhumana,human\nperdí,lost\npartir,from\nsiguen,follow\nencontrar,find\ndéjame,let\nbasura,trash\noreja,ear\nzoológico,zoo\nmeses,months\nescuché,heard\nestrellas,stars\naraña,spider\nduerme,sleeps\njudaismo,judaism\nestáis,are\npude,could\nt,t\nmodos,modes\npueda,can\njusto,fair\ny,and\nestábamos,were\narreglar,fix\nhan,have\nacelerado,accelerated\ncuándo,when\ndicen,say\ncontemplar,contemplate\npregunta,question\njimmy,jimmy\ntierra,earth\nsegura,safe\nteniente,lieutenant\nello,it\npaul,paul\náguila,eagle\nno,no\nconocí,met\nsabes,know\narroz,rice\nles,them\nwashington,washington\nvarios,various\nvalor,value\ntonta,dumb\nllena,full\nmiel,honey\nnecesitan,need\nsexualidad,sexuality\nprincesa,princess\ntantos,many\nhoras,hours\ngallina,chicken\ncentral,central\nmenudo,often\nhalcón,hawk\ncostar,cost\ndeprisa,quickly\nprobablemente,probably\nplanes,plans\nblanca,white\nbiografía,biography\nevitar,avoid\nibas,were\ntienen,have\nvoluntario,voluntary\nesposo,husband\nnúmero,number\nencuentra,find\nconversación,conversation\ncárcel,jail\nte,tea\ncaballeros,gentlemen\nveo,see\nprimera,first\nirá,go\nnegra,black\nn,n\nsubterráneo,subway\nei,ei\npodrá,can\nterrible,terrible\nplatillo,saucer\ngrupo,group\ntía,aunt\nestilo,style\nrecordar,remember\nnorte,north\ncoche,car\ndescanso,rest\nprincipal,principal\ndemonio,demon\ndile,tell\nmunicipal,municipal\nse,oneself\narmario,closet\ndeberíamos,should\nestos,these\nsitio,site\nentero,whole\nmetido,involved\noveja,sheep\nbarato,cheap\npeso,weight\nllevaba,took\nmanera,way\ncualquier,any\nárboles,trees\ncreer,believe\népoca,time\nespero,hope\nequipo,team\nbuen,good\ntrae,brings\nmío,mine\nsoleado,sunny\njane,jane\nllamó,called\npróximo,next\nfuerte,strong\nresumen,summary\nreglas,rules\nnecesito,need\nsoy,am\nhermosa,beautiful\nbebe,baby\nfelicidad,happiness\nfragmento,fragment\nintentando,trying\nglobo,balloon\nvayas,go\nderecha,right\nvuelvo,return\nsucede,happens\npalo,stick\nestaré,be\nuva,grape\nestás,are\nabrumar,overwhelm\npuedas,can\nárea,area\ncontra,against\nvuelta,return\nlágrimas,tears\nestuve,was\nfrank,frank\nhistoria,history\nalgún,some\neuropa,europe\nesté,be\nllamo,call\nhicieron,made\nniña,girl\ndonación,donation\nmismos,same\nquizás,maybe\nradio,radio\nalgunos,some\nmató,killed\nplaneta,planet\nduele,hurts\nven,come\nseñal,signal\nunir,merge\núnico,only\n"
  },
  {
    "path": "examples/02_lazy_loading.py",
    "content": "\"\"\" Example of lazy vs normal loading\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 02\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf \n\n######################################## \n## NORMAL LOADING   \t\t\t      ##\n## print out a graph with 1 Add node  ## \n########################################\n\nx = tf.Variable(10, name='x')\ny = tf.Variable(20, name='y')\nz = tf.add(x, y)\n\nwith tf.Session() as sess:\n\tsess.run(tf.global_variables_initializer())\n\twriter = tf.summary.FileWriter('graphs/normal_loading', sess.graph)\n\tfor _ in range(10):\n\t\tsess.run(z)\n\tprint(tf.get_default_graph().as_graph_def())\n\twriter.close()\n\n######################################## \n## LAZY LOADING   \t\t\t\t\t  ##\n## print out a graph with 10 Add nodes## \n########################################\n\nx = tf.Variable(10, name='x')\ny = tf.Variable(20, name='y')\n\nwith tf.Session() as sess:\n\tsess.run(tf.global_variables_initializer())\n\twriter = tf.summary.FileWriter('graphs/lazy_loading', sess.graph)\n\tfor _ in range(10):\n\t\tsess.run(tf.add(x, y))\n\tprint(tf.get_default_graph().as_graph_def()) \n\twriter.close()"
  },
  {
    "path": "examples/02_placeholder.py",
    "content": "\"\"\" Placeholder and feed_dict example\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 02\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\n# Example 1: feed_dict with placeholder\n\n# a is a placeholderfor a vector of 3 elements, type tf.float32\na = tf.placeholder(tf.float32, shape=[3])\nb = tf.constant([5, 5, 5], tf.float32)\n\n# use the placeholder as you would a constant\nc = a + b  # short for tf.add(a, b)\n\nwriter = tf.summary.FileWriter('graphs/placeholders', tf.get_default_graph())\nwith tf.Session() as sess:\n    # compute the value of c given the value of a is [1, 2, 3]\n    print(sess.run(c, {a: [1, 2, 3]}))                 # [6. 7. 8.]\nwriter.close()\n\n\n# Example 2: feed_dict with variables\na = tf.add(2, 5)\nb = tf.multiply(a, 3)\n\nwith tf.Session() as sess:\n    print(sess.run(b))                                 # >> 21\n    # compute the value of b given the value of a is 15\n    print(sess.run(b, feed_dict={a: 15}))              # >> 45"
  },
  {
    "path": "examples/02_simple_tf.py",
    "content": "\"\"\" Simple TensorFlow's ops\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\n\n# Example 1: Simple ways to create log file writer\na = tf.constant(2, name='a')\nb = tf.constant(3, name='b')\nx = tf.add(a, b, name='add')\nwriter = tf.summary.FileWriter('./graphs/simple', tf.get_default_graph()) \nwith tf.Session() as sess:\n    # writer = tf.summary.FileWriter('./graphs', sess.graph) \n    print(sess.run(x))\nwriter.close() # close the writer when you’re done using it\n\n# Example 2: The wonderful wizard of div\na = tf.constant([2, 2], name='a')\nb = tf.constant([[0, 1], [2, 3]], name='b')\n\nwith tf.Session() as sess:\n    print(sess.run(tf.div(b, a)))\n    print(sess.run(tf.divide(b, a)))\n    print(sess.run(tf.truediv(b, a)))\n    print(sess.run(tf.floordiv(b, a)))\n    # print(sess.run(tf.realdiv(b, a)))\n    print(sess.run(tf.truncatediv(b, a)))\n    print(sess.run(tf.floor_div(b, a)))\n\n# Example 3: multiplying tensors\na = tf.constant([10, 20], name='a')\nb = tf.constant([2, 3], name='b')\n\nwith tf.Session() as sess:\n    print(sess.run(tf.multiply(a, b)))\n    print(sess.run(tf.tensordot(a, b, 1)))\n\n# Example 4: Python native type\nt_0 = 19 \nx = tf.zeros_like(t_0) \t\t\t\t\t# ==> 0\ny = tf.ones_like(t_0) \t\t\t\t\t# ==> 1\n\nt_1 = ['apple', 'peach', 'banana']\nx = tf.zeros_like(t_1) \t\t\t\t\t# ==> ['' '' '']\n# y = tf.ones_like(t_1) \t\t\t\t# ==> TypeError: Expected string, got 1 of type 'int' instead.\n\nt_2 = [[True, False, False],\n       [False, False, True],\n       [False, True, False]] \nx = tf.zeros_like(t_2) \t\t\t\t\t# ==> 3x3 tensor, all elements are False\ny = tf.ones_like(t_2) \t\t\t\t\t# ==> 3x3 tensor, all elements are True\n\nprint(tf.int32.as_numpy_dtype())\n\n# Example 5: printing your graph's definition\nmy_const = tf.constant([1.0, 2.0], name='my_const')\nprint(tf.get_default_graph().as_graph_def())"
  },
  {
    "path": "examples/02_variables.py",
    "content": "\"\"\" Variable exmaples\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 02\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\n\n# Example 1: creating variables\ns = tf.Variable(2, name='scalar') \nm = tf.Variable([[0, 1], [2, 3]], name='matrix') \nW = tf.Variable(tf.zeros([784,10]), name='big_matrix')\nV = tf.Variable(tf.truncated_normal([784, 10]), name='normal_matrix')\n\ns = tf.get_variable('scalar', initializer=tf.constant(2)) \nm = tf.get_variable('matrix', initializer=tf.constant([[0, 1], [2, 3]]))\nW = tf.get_variable('big_matrix', shape=(784, 10), initializer=tf.zeros_initializer())\nV = tf.get_variable('normal_matrix', shape=(784, 10), initializer=tf.truncated_normal_initializer())\n\nwith tf.Session() as sess:\n    sess.run(tf.global_variables_initializer())\n    print(V.eval())\n\n# Example 2: assigning values to variables\nW = tf.Variable(10)\nW.assign(100)\nwith tf.Session() as sess:\n    sess.run(W.initializer)\n    print(sess.run(W))                    \t# >> 10\n\nW = tf.Variable(10)\nassign_op = W.assign(100)\nwith tf.Session() as sess:\n    sess.run(assign_op)\n    print(W.eval())                     \t# >> 100\n\n# create a variable whose original value is 2\na = tf.get_variable('scalar', initializer=tf.constant(2)) \na_times_two = a.assign(a * 2)\nwith tf.Session() as sess:\n    sess.run(tf.global_variables_initializer()) \n    sess.run(a_times_two)                 \t# >> 4\n    sess.run(a_times_two)                 \t# >> 8\n    sess.run(a_times_two)                 \t# >> 16\n\nW = tf.Variable(10)\nwith tf.Session() as sess:\n    sess.run(W.initializer)\n    print(sess.run(W.assign_add(10)))     \t# >> 20\n    print(sess.run(W.assign_sub(2)))     \t# >> 18\n\n# Example 3: Each session has its own copy of variable\nW = tf.Variable(10)\nsess1 = tf.Session()\nsess2 = tf.Session()\nsess1.run(W.initializer)\nsess2.run(W.initializer)\nprint(sess1.run(W.assign_add(10)))        \t# >> 20\nprint(sess2.run(W.assign_sub(2)))        \t# >> 8\nprint(sess1.run(W.assign_add(100)))        \t# >> 120\nprint(sess2.run(W.assign_sub(50)))        \t# >> -42\nsess1.close()\nsess2.close()\n\n# Example 4: create a variable with the initial value depending on another variable\nW = tf.Variable(tf.truncated_normal([700, 10]))\nU = tf.Variable(W * 2)"
  },
  {
    "path": "examples/03_linreg_dataset.py",
    "content": "\"\"\" Solution for simple linear regression example using tf.data\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\nimport utils\n\nDATA_FILE = 'data/birth_life_2010.txt'\n\n# Step 1: read in the data\ndata, n_samples = utils.read_birth_life_data(DATA_FILE)\n\n# Step 2: create Dataset and iterator\ndataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))\n\niterator = dataset.make_initializable_iterator()\nX, Y = iterator.get_next()\n\n# Step 3: create weight and bias, initialized to 0\nw = tf.get_variable('weights', initializer=tf.constant(0.0))\nb = tf.get_variable('bias', initializer=tf.constant(0.0))\n\n# Step 4: build model to predict Y\nY_predicted = X * w + b\n\n# Step 5: use the square error as the loss function\nloss = tf.square(Y - Y_predicted, name='loss')\n# loss = utils.huber_loss(Y, Y_predicted)\n\n# Step 6: using gradient descent with learning rate of 0.001 to minimize loss\noptimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)\n\nstart = time.time()\nwith tf.Session() as sess:\n    # Step 7: initialize the necessary variables, in this case, w and b\n    sess.run(tf.global_variables_initializer()) \n    writer = tf.summary.FileWriter('./graphs/linear_reg', sess.graph)\n    \n    # Step 8: train the model for 100 epochs\n    for i in range(100):\n        sess.run(iterator.initializer) # initialize the iterator\n        total_loss = 0\n        try:\n            while True:\n                _, l = sess.run([optimizer, loss]) \n                total_loss += l\n        except tf.errors.OutOfRangeError:\n            pass\n            \n        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))\n\n    # close the writer when you're done using it\n    writer.close() \n    \n    # Step 9: output the values of w and b\n    w_out, b_out = sess.run([w, b]) \n    print('w: %f, b: %f' %(w_out, b_out))\nprint('Took: %f seconds' %(time.time() - start))\n\n# plot the results\nplt.plot(data[:,0], data[:,1], 'bo', label='Real data')\nplt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data with squared error')\n# plt.plot(data[:,0], data[:,0] * (-5.883589) + 85.124306, 'g', label='Predicted data with Huber loss')\nplt.legend()\nplt.show()"
  },
  {
    "path": "examples/03_linreg_placeholder.py",
    "content": "\"\"\" Solution for simple linear regression example using placeholders\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\nimport utils\n\nDATA_FILE = 'data/birth_life_2010.txt'\n\n# Step 1: read in data from the .txt file\ndata, n_samples = utils.read_birth_life_data(DATA_FILE)\n\n# Step 2: create placeholders for X (birth rate) and Y (life expectancy)\nX = tf.placeholder(tf.float32, name='X')\nY = tf.placeholder(tf.float32, name='Y')\n\n# Step 3: create weight and bias, initialized to 0\nw = tf.get_variable('weights', initializer=tf.constant(0.0))\nb = tf.get_variable('bias', initializer=tf.constant(0.0))\n\n# Step 4: build model to predict Y\nY_predicted = w * X + b \n\n# Step 5: use the squared error as the loss function\n# you can use either mean squared error or Huber loss\nloss = tf.square(Y - Y_predicted, name='loss')\n# loss = utils.huber_loss(Y, Y_predicted)\n\n# Step 6: using gradient descent with learning rate of 0.001 to minimize loss\noptimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)\n\n\nstart = time.time()\nwriter = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())\nwith tf.Session() as sess:\n\t# Step 7: initialize the necessary variables, in this case, w and b\n\tsess.run(tf.global_variables_initializer()) \n\t\n\t# Step 8: train the model for 100 epochs\n\tfor i in range(100): \n\t\ttotal_loss = 0\n\t\tfor x, y in data:\n\t\t\t# Session execute optimizer and fetch values of loss\n\t\t\t_, l = sess.run([optimizer, loss], feed_dict={X: x, Y:y}) \n\t\t\ttotal_loss += l\n\t\tprint('Epoch {0}: {1}'.format(i, total_loss/n_samples))\n\n\t# close the writer when you're done using it\n\twriter.close() \n\t\n\t# Step 9: output the values of w and b\n\tw_out, b_out = sess.run([w, b]) \n\nprint('Took: %f seconds' %(time.time() - start))\n\n# plot the results\nplt.plot(data[:,0], data[:,1], 'bo', label='Real data')\nplt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')\nplt.legend()\nplt.show()"
  },
  {
    "path": "examples/03_linreg_starter.py",
    "content": "\"\"\" Starter code for simple linear regression example using placeholders\nCreated by Chip Huyen (huyenn@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\n\nimport utils\n\nDATA_FILE = 'data/birth_life_2010.txt'\n\n# Step 1: read in data from the .txt file\ndata, n_samples = utils.read_birth_life_data(DATA_FILE)\n\n# Step 2: create placeholders for X (birth rate) and Y (life expectancy)\n# Remember both X and Y are scalars with type float\nX, Y = None, None\n#############################\n########## TO DO ############\n#############################\n\n# Step 3: create weight and bias, initialized to 0.0\n# Make sure to use tf.get_variable\nw, b = None, None\n#############################\n########## TO DO ############\n#############################\n\n# Step 4: build model to predict Y\n# e.g. how would you derive at Y_predicted given X, w, and b\nY_predicted = None\n#############################\n########## TO DO ############\n#############################\n\n# Step 5: use the square error as the loss function\nloss = None\n#############################\n########## TO DO ############\n#############################\n\n# Step 6: using gradient descent with learning rate of 0.001 to minimize loss\noptimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)\n\nstart = time.time()\n\n# Create a filewriter to write the model's graph to TensorBoard\n#############################\n########## TO DO ############\n#############################\n\nwith tf.Session() as sess:\n    # Step 7: initialize the necessary variables, in this case, w and b\n    #############################\n    ########## TO DO ############\n    #############################\n\n    # Step 8: train the model for 100 epochs\n    for i in range(100):\n        total_loss = 0\n        for x, y in data:\n            # Execute train_op and get the value of loss.\n            # Don't forget to feed in data for placeholders\n            _, loss = ########## TO DO ############\n            total_loss += loss\n\n        print('Epoch {0}: {1}'.format(i, total_loss/n_samples))\n\n    # close the writer when you're done using it\n    #############################\n    ########## TO DO ############\n    #############################\n    writer.close()\n    \n    # Step 9: output the values of w and b\n    w_out, b_out = None, None\n    #############################\n    ########## TO DO ############\n    #############################\n\nprint('Took: %f seconds' %(time.time() - start))\n\n# uncomment the following lines to see the plot \n# plt.plot(data[:,0], data[:,1], 'bo', label='Real data')\n# plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')\n# plt.legend()\n# plt.show()"
  },
  {
    "path": "examples/03_logreg.py",
    "content": "\"\"\" Solution for simple logistic regression model for MNIST\nwith tf.data module\nMNIST dataset: yann.lecun.com/exdb/mnist/\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nimport time\n\nimport utils\n\n# Define paramaters for the model\nlearning_rate = 0.01\nbatch_size = 128\nn_epochs = 30\nn_train = 60000\nn_test = 10000\n\n# Step 1: Read in data\nmnist_folder = 'data/mnist'\nutils.download_mnist(mnist_folder)\ntrain, val, test = utils.read_mnist(mnist_folder, flatten=True)\n\n# Step 2: Create datasets and iterator\ntrain_data = tf.data.Dataset.from_tensor_slices(train)\ntrain_data = train_data.shuffle(10000) # if you want to shuffle your data\ntrain_data = train_data.batch(batch_size)\n\ntest_data = tf.data.Dataset.from_tensor_slices(test)\ntest_data = test_data.batch(batch_size)\n\niterator = tf.data.Iterator.from_structure(train_data.output_types, \n                                           train_data.output_shapes)\nimg, label = iterator.get_next()\n\ntrain_init = iterator.make_initializer(train_data)\t# initializer for train_data\ntest_init = iterator.make_initializer(test_data)\t# initializer for train_data\n\n# Step 3: create weights and bias\n# w is initialized to random variables with mean of 0, stddev of 0.01\n# b is initialized to 0\n# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)\n# shape of b depends on Y\nw = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer(0, 0.01))\nb = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer())\n\n# Step 4: build model\n# the model that returns the logits.\n# this logits will be later passed through softmax layer\nlogits = tf.matmul(img, w) + b \n\n# Step 5: define loss function\n# use cross entropy of softmax of logits as the loss function\nentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=label, name='entropy')\nloss = tf.reduce_mean(entropy, name='loss') # computes the mean over all the examples in the batch\n\n# Step 6: define training op\n# using gradient descent with learning rate of 0.01 to minimize loss\noptimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)\n\n# Step 7: calculate accuracy with test set\npreds = tf.nn.softmax(logits)\ncorrect_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1))\naccuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\nwriter = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())\nwith tf.Session() as sess:\n   \n    start_time = time.time()\n    sess.run(tf.global_variables_initializer())\n\n    # train the model n_epochs times\n    for i in range(n_epochs): \t\n        sess.run(train_init)\t# drawing samples from train_data\n        total_loss = 0\n        n_batches = 0\n        try:\n            while True:\n                _, l = sess.run([optimizer, loss])\n                total_loss += l\n                n_batches += 1\n        except tf.errors.OutOfRangeError:\n            pass\n        print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))\n    print('Total time: {0} seconds'.format(time.time() - start_time))\n\n    # test the model\n    sess.run(test_init)\t\t\t# drawing samples from test_data\n    total_correct_preds = 0\n    try:\n        while True:\n            accuracy_batch = sess.run(accuracy)\n            total_correct_preds += accuracy_batch\n    except tf.errors.OutOfRangeError:\n        pass\n\n    print('Accuracy {0}'.format(total_correct_preds/n_test))\nwriter.close()\n"
  },
  {
    "path": "examples/03_logreg_placeholder.py",
    "content": "\"\"\" Solution for simple logistic regression model for MNIST\nwith placeholder\nMNIST dataset: yann.lecun.com/exdb/mnist/\nCreated by Chip Huyen (huyenn@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nfrom tensorflow.examples.tutorials.mnist import input_data\nimport time\n\nimport utils\n\n# Define paramaters for the model\nlearning_rate = 0.01\nbatch_size = 128\nn_epochs = 30\n\n# Step 1: Read in data\n# using TF Learn's built in function to load MNIST data to the folder data/mnist\nmnist = input_data.read_data_sets('data/mnist', one_hot=True)\nX_batch, Y_batch = mnist.train.next_batch(batch_size)\n\n# Step 2: create placeholders for features and labels\n# each image in the MNIST data is of shape 28*28 = 784\n# therefore, each image is represented with a 1x784 tensor\n# there are 10 classes for each image, corresponding to digits 0 - 9. \n# each lable is one hot vector.\nX = tf.placeholder(tf.float32, [batch_size, 784], name='image') \nY = tf.placeholder(tf.int32, [batch_size, 10], name='label')\n\n# Step 3: create weights and bias\n# w is initialized to random variables with mean of 0, stddev of 0.01\n# b is initialized to 0\n# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)\n# shape of b depends on Y\nw = tf.get_variable(name='weights', shape=(784, 10), initializer=tf.random_normal_initializer())\nb = tf.get_variable(name='bias', shape=(1, 10), initializer=tf.zeros_initializer())\n\n# Step 4: build model\n# the model that returns the logits.\n# this logits will be later passed through softmax layer\nlogits = tf.matmul(X, w) + b \n\n# Step 5: define loss function\n# use cross entropy of softmax of logits as the loss function\nentropy = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y, name='loss')\nloss = tf.reduce_mean(entropy) # computes the mean over all the examples in the batch\n# loss = tf.reduce_mean(-tf.reduce_sum(tf.nn.softmax(logits) * tf.log(Y), reduction_indices=[1]))\n\n# Step 6: define training op\n# using gradient descent with learning rate of 0.01 to minimize loss\noptimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)\n\n# Step 7: calculate accuracy with test set\npreds = tf.nn.softmax(logits)\ncorrect_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(Y, 1))\naccuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\nwriter = tf.summary.FileWriter('./graphs/logreg_placeholder', tf.get_default_graph())\nwith tf.Session() as sess:\n\tstart_time = time.time()\n\tsess.run(tf.global_variables_initializer())\t\n\tn_batches = int(mnist.train.num_examples/batch_size)\n\t\n\t# train the model n_epochs times\n\tfor i in range(n_epochs): \n\t\ttotal_loss = 0\n\n\t\tfor j in range(n_batches):\n\t\t\tX_batch, Y_batch = mnist.train.next_batch(batch_size)\n\t\t\t_, loss_batch = sess.run([optimizer, loss], {X: X_batch, Y:Y_batch}) \n\t\t\ttotal_loss += loss_batch\n\t\tprint('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))\n\tprint('Total time: {0} seconds'.format(time.time() - start_time))\n\n\t# test the model\n\tn_batches = int(mnist.test.num_examples/batch_size)\n\ttotal_correct_preds = 0\n\n\tfor i in range(n_batches):\n\t\tX_batch, Y_batch = mnist.test.next_batch(batch_size)\n\t\taccuracy_batch = sess.run(accuracy, {X: X_batch, Y:Y_batch})\n\t\ttotal_correct_preds += accuracy_batch\t\n\n\tprint('Accuracy {0}'.format(total_correct_preds/mnist.test.num_examples))\n\nwriter.close()\n"
  },
  {
    "path": "examples/03_logreg_starter.py",
    "content": "\"\"\" Starter code for simple logistic regression model for MNIST\nwith tf.data module\nMNIST dataset: yann.lecun.com/exdb/mnist/\nCreated by Chip Huyen (chiphuyen@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 03\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nimport time\n\nimport utils\n\n# Define paramaters for the model\nlearning_rate = 0.01\nbatch_size = 128\nn_epochs = 30\nn_train = 60000\nn_test = 10000\n\n# Step 1: Read in data\nmnist_folder = 'data/mnist'\nutils.download_mnist(mnist_folder)\ntrain, val, test = utils.read_mnist(mnist_folder, flatten=True)\n\n# Step 2: Create datasets and iterator\n# create training Dataset and batch it\ntrain_data = tf.data.Dataset.from_tensor_slices(train)\ntrain_data = train_data.shuffle(10000) # if you want to shuffle your data\ntrain_data = train_data.batch(batch_size)\n\n# create testing Dataset and batch it\ntest_data = None\n#############################\n########## TO DO ############\n#############################\n\n\n# create one iterator and initialize it with different datasets\niterator = tf.data.Iterator.from_structure(train_data.output_types, \n                                           train_data.output_shapes)\nimg, label = iterator.get_next()\n\ntrain_init = iterator.make_initializer(train_data)\t# initializer for train_data\ntest_init = iterator.make_initializer(test_data)\t# initializer for train_data\n\n# Step 3: create weights and bias\n# w is initialized to random variables with mean of 0, stddev of 0.01\n# b is initialized to 0\n# shape of w depends on the dimension of X and Y so that Y = tf.matmul(X, w)\n# shape of b depends on Y\nw, b = None, None\n#############################\n########## TO DO ############\n#############################\n\n\n# Step 4: build model\n# the model that returns the logits.\n# this logits will be later passed through softmax layer\nlogits = None\n#############################\n########## TO DO ############\n#############################\n\n\n# Step 5: define loss function\n# use cross entropy of softmax of logits as the loss function\nloss = None\n#############################\n########## TO DO ############\n#############################\n\n\n# Step 6: define optimizer\n# using Adamn Optimizer with pre-defined learning rate to minimize loss\noptimizer = None\n#############################\n########## TO DO ############\n#############################\n\n\n# Step 7: calculate accuracy with test set\npreds = tf.nn.softmax(logits)\ncorrect_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(label, 1))\naccuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\nwriter = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())\nwith tf.Session() as sess:\n   \n    start_time = time.time()\n    sess.run(tf.global_variables_initializer())\n\n    # train the model n_epochs times\n    for i in range(n_epochs): \t\n        sess.run(train_init)\t# drawing samples from train_data\n        total_loss = 0\n        n_batches = 0\n        try:\n            while True:\n                _, l = sess.run([optimizer, loss])\n                total_loss += l\n                n_batches += 1\n        except tf.errors.OutOfRangeError:\n            pass\n        print('Average loss epoch {0}: {1}'.format(i, total_loss/n_batches))\n    print('Total time: {0} seconds'.format(time.time() - start_time))\n\n    # test the model\n    sess.run(test_init)\t\t\t# drawing samples from test_data\n    total_correct_preds = 0\n    try:\n        while True:\n            accuracy_batch = sess.run(accuracy)\n            total_correct_preds += accuracy_batch\n    except tf.errors.OutOfRangeError:\n        pass\n\n    print('Accuracy {0}'.format(total_correct_preds/n_test))\nwriter.close()"
  },
  {
    "path": "examples/04_linreg_eager.py",
    "content": "\"\"\" Starter code for a simple regression example using eager execution.\nCreated by Akshay Agrawal (akshayka@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 04\n\"\"\"\nimport time\n\nimport tensorflow as tf\nimport tensorflow.contrib.eager as tfe\nimport matplotlib.pyplot as plt\n\nimport utils\n\nDATA_FILE = 'data/birth_life_2010.txt'\n\n# In order to use eager execution, `tfe.enable_eager_execution()` must be\n# called at the very beginning of a TensorFlow program.\ntfe.enable_eager_execution()\n\n# Read the data into a dataset.\ndata, n_samples = utils.read_birth_life_data(DATA_FILE)\ndataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))\n\n# Create variables.\nw = tfe.Variable(0.0)\nb = tfe.Variable(0.0)\n\n# Define the linear predictor.\ndef prediction(x):\n  return x * w + b\n\n# Define loss functions of the form: L(y, y_predicted)\ndef squared_loss(y, y_predicted):\n  return (y - y_predicted) ** 2\n\ndef huber_loss(y, y_predicted, m=1.0):\n  \"\"\"Huber loss.\"\"\"\n  t = y - y_predicted\n  # Note that enabling eager execution lets you use Python control flow and\n  # specificy dynamic TensorFlow computations. Contrast this implementation\n  # to the graph-construction one found in `utils`, which uses `tf.cond`.\n  return t ** 2 if tf.abs(t) <= m else m * (2 * tf.abs(t) - m)\n\ndef train(loss_fn):\n  \"\"\"Train a regression model evaluated using `loss_fn`.\"\"\"\n  print('Training; loss function: ' + loss_fn.__name__)\n  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)\n\n  # Define the function through which to differentiate.\n  def loss_for_example(x, y):\n    return loss_fn(y, prediction(x))\n\n  # `grad_fn(x_i, y_i)` returns (1) the value of `loss_for_example`\n  # evaluated at `x_i`, `y_i` and (2) the gradients of any variables used in\n  # calculating it.\n  grad_fn = tfe.implicit_value_and_gradients(loss_for_example)\n\n  start = time.time()\n  for epoch in range(100):\n    total_loss = 0.0\n    for x_i, y_i in tfe.Iterator(dataset):\n      loss, gradients = grad_fn(x_i, y_i)\n      # Take an optimization step and update variables.\n      optimizer.apply_gradients(gradients)\n      total_loss += loss\n    if epoch % 10 == 0:\n      print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples))\n  print('Took: %f seconds' % (time.time() - start))\n  print('Eager execution exhibits significant overhead per operation. '\n        'As you increase your batch size, the impact of the overhead will '\n        'become less noticeable. Eager execution is under active development: '\n        'expect performance to increase substantially in the near future!')\n\ntrain(huber_loss)\nplt.plot(data[:,0], data[:,1], 'bo')\n# The `.numpy()` method of a tensor retrieves the NumPy array backing it.\n# In future versions of eager, you won't need to call `.numpy()` and will\n# instead be able to, in most cases, pass Tensors wherever NumPy arrays are\n# expected.\nplt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r',\n         label=\"huber regression\")\nplt.legend()\nplt.show()\n"
  },
  {
    "path": "examples/04_linreg_eager_starter.py",
    "content": "\"\"\" Starter code for a simple regression example using eager execution.\nCreated by Akshay Agrawal (akshayka@cs.stanford.edu)\nCS20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nLecture 04\n\"\"\"\nimport time\n\nimport tensorflow as tf\nimport tensorflow.contrib.eager as tfe\nimport matplotlib.pyplot as plt\n\nimport utils\n\nDATA_FILE = 'data/birth_life_2010.txt'\n\n# In order to use eager execution, `tfe.enable_eager_execution()` must be\n# called at the very beginning of a TensorFlow program.\n#############################\n########## TO DO ############\n#############################\n\n# Read the data into a dataset.\ndata, n_samples = utils.read_birth_life_data(DATA_FILE)\ndataset = tf.data.Dataset.from_tensor_slices((data[:,0], data[:,1]))\n\n# Create weight and bias variables, initialized to 0.0.\n#############################\n########## TO DO ############\n#############################\nw = None\nb = None\n\n# Define the linear predictor.\ndef prediction(x):\n  #############################\n  ########## TO DO ############\n  #############################\n  pass\n\n# Define loss functions of the form: L(y, y_predicted)\ndef squared_loss(y, y_predicted):\n  #############################\n  ########## TO DO ############\n  #############################\n  pass\n\ndef huber_loss(y, y_predicted):\n  \"\"\"Huber loss with `m` set to `1.0`.\"\"\"\n  #############################\n  ########## TO DO ############\n  #############################\n  pass\n\ndef train(loss_fn):\n  \"\"\"Train a regression model evaluated using `loss_fn`.\"\"\"\n  print('Training; loss function: ' + loss_fn.__name__)\n  optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)\n\n  # Define the function through which to differentiate.\n  #############################\n  ########## TO DO ############\n  #############################\n  def loss_for_example(x, y):\n    pass\n\n  # Obtain a gradients function using `tfe.implicit_value_and_gradients`.\n  #############################\n  ########## TO DO ############\n  #############################\n  grad_fn = None\n\n  start = time.time()\n  for epoch in range(100):\n    total_loss = 0.0\n    for x_i, y_i in tfe.Iterator(dataset):\n      # Compute the loss and gradient, and take an optimization step.\n      #############################\n      ########## TO DO ############\n      #############################\n      optimizer.apply_gradients(gradients)\n      total_loss += loss\n    if epoch % 10 == 0:\n      print('Epoch {0}: {1}'.format(epoch, total_loss / n_samples))\n  print('Took: %f seconds' % (time.time() - start))\n  print('Eager execution exhibits significant overhead per operation. '\n        'As you increase your batch size, the impact of the overhead will '\n        'become less noticeable. Eager execution is under active development: '\n        'expect performance to increase substantially in the near future!')\n\ntrain(huber_loss)\nplt.plot(data[:,0], data[:,1], 'bo')\n# The `.numpy()` method of a tensor retrieves the NumPy array backing it.\n# In future versions of eager, you won't need to call `.numpy()` and will\n# instead be able to, in most cases, pass Tensors wherever NumPy arrays are\n# expected.\nplt.plot(data[:,0], data[:,0] * w.numpy() + b.numpy(), 'r',\n         label=\"huber regression\")\nplt.legend()\nplt.show()\n"
  },
  {
    "path": "examples/04_word2vec.py",
    "content": "\"\"\" starter code for word2vec skip-gram model with NCE loss\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 04\n\"\"\"\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nfrom tensorflow.contrib.tensorboard.plugins import projector\nimport tensorflow as tf\n\nimport utils\nimport word2vec_utils\n\n# Model hyperparameters\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128            # dimension of the word embedding vectors\nSKIP_WINDOW = 1             # the context window\nNUM_SAMPLED = 64            # number of negative examples to sample\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 100000\nVISUAL_FLD = 'visualization'\nSKIP_STEP = 5000\n\n# Parameters for downloading data\nDOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'\nEXPECTED_BYTES = 31344016\nNUM_VISUALIZE = 3000        # number of tokens to visualize\n\n\ndef word2vec(dataset):\n    \"\"\" Build the graph for word2vec model and train it \"\"\"\n    # Step 1: get input, output from the dataset\n    with tf.name_scope('data'):\n        iterator = dataset.make_initializable_iterator()\n        center_words, target_words = iterator.get_next()\n\n    \"\"\" Step 2 + 3: define weights and embedding lookup.\n    In word2vec, it's actually the weights that we care about \n    \"\"\"\n    with tf.name_scope('embed'):\n        embed_matrix = tf.get_variable('embed_matrix', \n                                        shape=[VOCAB_SIZE, EMBED_SIZE],\n                                        initializer=tf.random_uniform_initializer())\n        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embedding')\n\n    # Step 4: construct variables for NCE loss and define loss function\n    with tf.name_scope('loss'):\n        nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE],\n                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))\n        nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))\n\n        # define loss function to be NCE loss function\n        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, \n                                            biases=nce_bias, \n                                            labels=target_words, \n                                            inputs=embed, \n                                            num_sampled=NUM_SAMPLED, \n                                            num_classes=VOCAB_SIZE), name='loss')\n\n    # Step 5: define optimizer\n    with tf.name_scope('optimizer'):\n        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)\n    \n    utils.safe_mkdir('checkpoints')\n\n    with tf.Session() as sess:\n        sess.run(iterator.initializer)\n        sess.run(tf.global_variables_initializer())\n\n        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps\n        writer = tf.summary.FileWriter('graphs/word2vec_simple', sess.graph)\n\n        for index in range(NUM_TRAIN_STEPS):\n            try:\n                loss_batch, _ = sess.run([loss, optimizer])\n                total_loss += loss_batch\n                if (index + 1) % SKIP_STEP == 0:\n                    print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))\n                    total_loss = 0.0\n            except tf.errors.OutOfRangeError:\n                sess.run(iterator.initializer)\n        writer.close()\n\ndef gen():\n    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, \n                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)\n\ndef main():\n    dataset = tf.data.Dataset.from_generator(gen, \n                                (tf.int32, tf.int32), \n                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))\n    word2vec(dataset)\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "examples/04_word2vec_eager.py",
    "content": "\"\"\" starter code for word2vec skip-gram model with NCE loss\nEager execution\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu)\nLecture 04\n\"\"\"\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nimport tensorflow.contrib.eager as tfe\n\nimport utils\nimport word2vec_utils\n\ntfe.enable_eager_execution()\n\n# Model hyperparameters\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128            # dimension of the word embedding vectors\nSKIP_WINDOW = 1             # the context window\nNUM_SAMPLED = 64            # number of negative examples to sample\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 100000\nVISUAL_FLD = 'visualization'\nSKIP_STEP = 5000\n\n# Parameters for downloading data\nDOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'\nEXPECTED_BYTES = 31344016\n\nclass Word2Vec(object):\n  def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):\n    self.vocab_size = vocab_size\n    self.num_sampled = num_sampled\n    self.embed_matrix = tfe.Variable(tf.random_uniform(\n                                      [vocab_size, embed_size]))\n    self.nce_weight = tfe.Variable(tf.truncated_normal(\n                                    [vocab_size, embed_size],\n                                    stddev=1.0 / (embed_size ** 0.5)))\n    self.nce_bias = tfe.Variable(tf.zeros([vocab_size]))\n\n  def compute_loss(self, center_words, target_words):\n    \"\"\"Computes the forward pass of word2vec with the NCE loss.\"\"\" \n    embed = tf.nn.embedding_lookup(self.embed_matrix, center_words)\n    loss = tf.reduce_mean(tf.nn.nce_loss(weights=self.nce_weight, \n                                        biases=self.nce_bias, \n                                        labels=target_words, \n                                        inputs=embed, \n                                        num_sampled=self.num_sampled, \n                                        num_classes=self.vocab_size))\n    return loss\n\n\ndef gen():\n  yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,\n                                      VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,\n                                      VISUAL_FLD)\n\ndef main():\n  dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),\n                              (tf.TensorShape([BATCH_SIZE]),\n                              tf.TensorShape([BATCH_SIZE, 1])))\n  optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)\n  model = Word2Vec(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE)\n  grad_fn = tfe.implicit_value_and_gradients(model.compute_loss)\n  total_loss = 0.0  # for average loss in the last SKIP_STEP steps\n  num_train_steps = 0\n  while num_train_steps < NUM_TRAIN_STEPS:\n    for center_words, target_words in tfe.Iterator(dataset):\n      if num_train_steps >= NUM_TRAIN_STEPS:\n        break\n      loss_batch, grads = grad_fn(center_words, target_words)\n      total_loss += loss_batch\n      optimizer.apply_gradients(grads)\n      if (num_train_steps + 1) % SKIP_STEP == 0:\n        print('Average loss at step {}: {:5.1f}'.format(\n                num_train_steps, total_loss / SKIP_STEP))\n        total_loss = 0.0\n      num_train_steps += 1\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "examples/04_word2vec_eager_starter.py",
    "content": "\"\"\" starter code for word2vec skip-gram model with NCE loss\nEager execution\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu) & Akshay Agrawal (akshayka@cs.stanford.edu)\nLecture 04\n\"\"\"\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nimport tensorflow as tf\nimport tensorflow.contrib.eager as tfe\n\nimport utils\nimport word2vec_utils\n\n# Enable eager execution!\n#############################\n########## TO DO ############\n#############################\n\n# Model hyperparameters\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128            # dimension of the word embedding vectors\nSKIP_WINDOW = 1             # the context window\nNUM_SAMPLED = 64            # number of negative examples to sample\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 100000\nVISUAL_FLD = 'visualization'\nSKIP_STEP = 5000\n\n# Parameters for downloading data\nDOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'\nEXPECTED_BYTES = 31344016\n\nclass Word2Vec(object):\n  def __init__(self, vocab_size, embed_size, num_sampled=NUM_SAMPLED):\n    self.vocab_size = vocab_size\n    self.num_sampled = num_sampled\n    # Create the variables: an embedding matrix, nce_weight, and nce_bias\n    #############################\n    ########## TO DO ############\n    #############################\n    self.embed_matrix = None\n    self.nce_weight = None\n    self.nce_bias = None\n\n  def compute_loss(self, center_words, target_words):\n    \"\"\"Computes the forward pass of word2vec with the NCE loss.\"\"\" \n    # Look up the embeddings for the center words\n    #############################\n    ########## TO DO ############\n    #############################\n    embed = None\n\n    # Compute the loss, using tf.reduce_mean and tf.nn.nce_loss\n    #############################\n    ########## TO DO ############\n    #############################\n    loss = None\n    return loss\n\n\ndef gen():\n  yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES,\n                                      VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW,\n                                      VISUAL_FLD)\n\ndef main():\n  dataset = tf.data.Dataset.from_generator(gen, (tf.int32, tf.int32),\n                              (tf.TensorShape([BATCH_SIZE]),\n                              tf.TensorShape([BATCH_SIZE, 1])))\n  optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE)\n  # Create the model\n  #############################\n  ########## TO DO ############\n  #############################\n  model = None\n\n  # Create the gradients function, using `tfe.implicit_value_and_gradients`\n  #############################\n  ########## TO DO ############\n  #############################\n  grad_fn = None\n\n  total_loss = 0.0  # for average loss in the last SKIP_STEP steps\n  num_train_steps = 0\n  while num_train_steps < NUM_TRAIN_STEPS:\n    for center_words, target_words in tfe.Iterator(dataset):\n      if num_train_steps >= NUM_TRAIN_STEPS:\n        break\n\n      # Compute the loss and gradients, and take an optimization step.\n      #############################\n      ########## TO DO ############\n      #############################\n      \n      if (num_train_steps + 1) % SKIP_STEP == 0:\n        print('Average loss at step {}: {:5.1f}'.format(\n                num_train_steps, total_loss / SKIP_STEP))\n        total_loss = 0.0\n      num_train_steps += 1\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "examples/04_word2vec_visualize.py",
    "content": "\"\"\" word2vec skip-gram model with NCE loss and \ncode to visualize the embeddings on TensorBoard\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 04\n\"\"\"\n\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport numpy as np\nfrom tensorflow.contrib.tensorboard.plugins import projector\nimport tensorflow as tf\n\nimport utils\nimport word2vec_utils\n\n# Model hyperparameters\nVOCAB_SIZE = 50000\nBATCH_SIZE = 128\nEMBED_SIZE = 128            # dimension of the word embedding vectors\nSKIP_WINDOW = 1             # the context window\nNUM_SAMPLED = 64            # number of negative examples to sample\nLEARNING_RATE = 1.0\nNUM_TRAIN_STEPS = 100000\nVISUAL_FLD = 'visualization'\nSKIP_STEP = 5000\n\n# Parameters for downloading data\nDOWNLOAD_URL = 'http://mattmahoney.net/dc/text8.zip'\nEXPECTED_BYTES = 31344016\nNUM_VISUALIZE = 3000        # number of tokens to visualize\n\nclass SkipGramModel:\n    \"\"\" Build the graph for word2vec model \"\"\"\n    def __init__(self, dataset, vocab_size, embed_size, batch_size, num_sampled, learning_rate):\n        self.vocab_size = vocab_size\n        self.embed_size = embed_size\n        self.batch_size = batch_size\n        self.num_sampled = num_sampled\n        self.lr = learning_rate\n        self.global_step = tf.get_variable('global_step', initializer=tf.constant(0), trainable=False)\n        self.skip_step = SKIP_STEP\n        self.dataset = dataset\n\n    def _import_data(self):\n        \"\"\" Step 1: import data\n        \"\"\"\n        with tf.name_scope('data'):\n            self.iterator = self.dataset.make_initializable_iterator()\n            self.center_words, self.target_words = self.iterator.get_next()\n\n    def _create_embedding(self):\n        \"\"\" Step 2 + 3: define weights and embedding lookup.\n        In word2vec, it's actually the weights that we care about \n        \"\"\"\n        with tf.name_scope('embed'):\n            self.embed_matrix = tf.get_variable('embed_matrix', \n                                                shape=[self.vocab_size, self.embed_size],\n                                                initializer=tf.random_uniform_initializer())\n            self.embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embedding')\n\n    def _create_loss(self):\n        \"\"\" Step 4: define the loss function \"\"\"\n        with tf.name_scope('loss'):\n            # construct variables for NCE loss\n            nce_weight = tf.get_variable('nce_weight', \n                        shape=[self.vocab_size, self.embed_size],\n                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (self.embed_size ** 0.5)))\n            nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))\n\n            # define loss function to be NCE loss function\n            self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, \n                                                biases=nce_bias, \n                                                labels=self.target_words, \n                                                inputs=self.embed, \n                                                num_sampled=self.num_sampled, \n                                                num_classes=self.vocab_size), name='loss')\n    def _create_optimizer(self):\n        \"\"\" Step 5: define optimizer \"\"\"\n        self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss, \n                                                              global_step=self.global_step)\n\n    def _create_summaries(self):\n        with tf.name_scope('summaries'):\n            tf.summary.scalar('loss', self.loss)\n            tf.summary.histogram('histogram loss', self.loss)\n            # because you have several summaries, we should merge them all\n            # into one op to make it easier to manage\n            self.summary_op = tf.summary.merge_all()\n\n    def build_graph(self):\n        \"\"\" Build the graph for our model \"\"\"\n        self._import_data()\n        self._create_embedding()\n        self._create_loss()\n        self._create_optimizer()\n        self._create_summaries()\n\n    def train(self, num_train_steps):\n        saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias\n\n        initial_step = 0\n        utils.safe_mkdir('checkpoints')\n        with tf.Session() as sess:\n            sess.run(self.iterator.initializer)\n            sess.run(tf.global_variables_initializer())\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))\n\n            # if that checkpoint exists, restore from checkpoint\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n\n            total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps\n            writer = tf.summary.FileWriter('graphs/word2vec/lr' + str(self.lr), sess.graph)\n            initial_step = self.global_step.eval()\n\n            for index in range(initial_step, initial_step + num_train_steps):\n                try:\n                    loss_batch, _, summary = sess.run([self.loss, self.optimizer, self.summary_op])\n                    writer.add_summary(summary, global_step=index)\n                    total_loss += loss_batch\n                    if (index + 1) % self.skip_step == 0:\n                        print('Average loss at step {}: {:5.1f}'.format(index, total_loss / self.skip_step))\n                        total_loss = 0.0\n                        saver.save(sess, 'checkpoints/skip-gram', index)\n                except tf.errors.OutOfRangeError:\n                    sess.run(self.iterator.initializer)\n            writer.close()\n\n    def visualize(self, visual_fld, num_visualize):\n        \"\"\" run \"'tensorboard --logdir='visualization'\" to see the embeddings \"\"\"\n        \n        # create the list of num_variable most common words to visualize\n        word2vec_utils.most_common_words(visual_fld, num_visualize)\n\n        saver = tf.train.Saver()\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))\n\n            # if that checkpoint exists, restore from checkpoint\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n\n            final_embed_matrix = sess.run(self.embed_matrix)\n            \n            # you have to store embeddings in a new variable\n            embedding_var = tf.Variable(final_embed_matrix[:num_visualize], name='embedding')\n            sess.run(embedding_var.initializer)\n\n            config = projector.ProjectorConfig()\n            summary_writer = tf.summary.FileWriter(visual_fld)\n\n            # add embedding to the config file\n            embedding = config.embeddings.add()\n            embedding.tensor_name = embedding_var.name\n            \n            # link this tensor to its metadata file, in this case the first NUM_VISUALIZE words of vocab\n            embedding.metadata_path = 'vocab_' + str(num_visualize) + '.tsv'\n\n            # saves a configuration file that TensorBoard will read during startup.\n            projector.visualize_embeddings(summary_writer, config)\n            saver_embed = tf.train.Saver([embedding_var])\n            saver_embed.save(sess, os.path.join(visual_fld, 'model.ckpt'), 1)\n\ndef gen():\n    yield from word2vec_utils.batch_gen(DOWNLOAD_URL, EXPECTED_BYTES, VOCAB_SIZE, \n                                        BATCH_SIZE, SKIP_WINDOW, VISUAL_FLD)\n\ndef main():\n    dataset = tf.data.Dataset.from_generator(gen, \n                                (tf.int32, tf.int32), \n                                (tf.TensorShape([BATCH_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))\n    model = SkipGramModel(dataset, VOCAB_SIZE, EMBED_SIZE, BATCH_SIZE, NUM_SAMPLED, LEARNING_RATE)\n    model.build_graph()\n    model.train(NUM_TRAIN_STEPS)\n    model.visualize(VISUAL_FLD, NUM_VISUALIZE)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "examples/05_randomization.py",
    "content": "\"\"\" Examples to demonstrate ops level randomization\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 05\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\n# Example 1: session keeps track of the random state\nc = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.574932\n    print(sess.run(c)) # >> -5.9731865\n\n# Example 2: each new session will start the random state all over again.\nc = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.574932\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.574932\n\n# Example 3: with operation level random seed, each op keeps its own seed.\nc = tf.random_uniform([], -10, 10, seed=2)\nd = tf.random_uniform([], -10, 10, seed=2)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 3.574932\n    print(sess.run(d)) # >> 3.574932\n\n# Example 4: graph level random seed\ntf.set_random_seed(2)\nc = tf.random_uniform([], -10, 10)\nd = tf.random_uniform([], -10, 10)\n\nwith tf.Session() as sess:\n    print(sess.run(c)) # >> 9.123926\n    print(sess.run(d)) # >> -4.5340395\n    "
  },
  {
    "path": "examples/05_variable_sharing.py",
    "content": "\"\"\" Examples to demonstrate variable sharing\nCS 20: 'TensorFlow for Deep Learning Research'\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 05\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport tensorflow as tf\n\nx1 = tf.truncated_normal([200, 100], name='x1')\nx2 = tf.truncated_normal([200, 100], name='x2')\n\ndef two_hidden_layers(x):\n    assert x.shape.as_list() == [200, 100]\n    w1 = tf.Variable(tf.random_normal([100, 50]), name='h1_weights')\n    b1 = tf.Variable(tf.zeros([50]), name='h1_biases')\n    h1 = tf.matmul(x, w1) + b1\n    assert h1.shape.as_list() == [200, 50]  \n    w2 = tf.Variable(tf.random_normal([50, 10]), name='h2_weights')\n    b2 = tf.Variable(tf.zeros([10]), name='2_biases')\n    logits = tf.matmul(h1, w2) + b2\n    return logits\n\ndef two_hidden_layers_2(x):\n    assert x.shape.as_list() == [200, 100]\n    w1 = tf.get_variable('h1_weights', [100, 50], initializer=tf.random_normal_initializer())\n    b1 = tf.get_variable('h1_biases', [50], initializer=tf.constant_initializer(0.0))\n    h1 = tf.matmul(x, w1) + b1\n    assert h1.shape.as_list() == [200, 50]  \n    w2 = tf.get_variable('h2_weights', [50, 10], initializer=tf.random_normal_initializer())\n    b2 = tf.get_variable('h2_biases', [10], initializer=tf.constant_initializer(0.0))\n    logits = tf.matmul(h1, w2) + b2\n    return logits\n\n# logits1 = two_hidden_layers(x1)\n# logits2 = two_hidden_layers(x2)\n\n# logits1 = two_hidden_layers_2(x1)\n# logits2 = two_hidden_layers_2(x2)\n\n# with tf.variable_scope('two_layers') as scope:\n#     logits1 = two_hidden_layers_2(x1)\n#     scope.reuse_variables()\n#     logits2 = two_hidden_layers_2(x2)\n\n# with tf.variable_scope('two_layers') as scope:\n#     logits1 = two_hidden_layers_2(x1)\n#     scope.reuse_variables()\n#     logits2 = two_hidden_layers_2(x2)\n\ndef fully_connected(x, output_dim, scope):\n    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE) as scope:\n        w = tf.get_variable('weights', [x.shape[1], output_dim], initializer=tf.random_normal_initializer())\n        b = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))\n        return tf.matmul(x, w) + b\n\ndef two_hidden_layers(x):\n    h1 = fully_connected(x, 50, 'h1')\n    h2 = fully_connected(h1, 10, 'h2')\n\nwith tf.variable_scope('two_layers') as scope:\n    logits1 = two_hidden_layers(x1)\n    # scope.reuse_variables()\n    logits2 = two_hidden_layers(x2)\n\nwriter = tf.summary.FileWriter('./graphs/cool_variables', tf.get_default_graph())\nwriter.close()"
  },
  {
    "path": "examples/07_convnet_layers.py",
    "content": "\"\"\" Using convolutional net on MNIST dataset of handwritten digits\nMNIST dataset: http://yann.lecun.com/exdb/mnist/\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 07\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time \n\nimport tensorflow as tf\n\nimport utils\n\nclass ConvNet(object):\n    def __init__(self):\n        self.lr = 0.001\n        self.batch_size = 128\n        self.keep_prob = tf.constant(0.75)\n        self.gstep = tf.Variable(0, dtype=tf.int32, \n                                trainable=False, name='global_step')\n        self.n_classes = 10\n        self.skip_step = 20\n        self.n_test = 10000\n        self.training=False\n\n    def get_data(self):\n        with tf.name_scope('data'):\n            train_data, test_data = utils.get_mnist_dataset(self.batch_size)\n            iterator = tf.data.Iterator.from_structure(train_data.output_types, \n                                                   train_data.output_shapes)\n            img, self.label = iterator.get_next()\n            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])\n            # reshape the image to make it work with tf.nn.conv2d\n\n            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data\n            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data\n\n    def inference(self):\n        conv1 = tf.layers.conv2d(inputs=self.img,\n                                  filters=32,\n                                  kernel_size=[5, 5],\n                                  padding='SAME',\n                                  activation=tf.nn.relu,\n                                  name='conv1')\n        pool1 = tf.layers.max_pooling2d(inputs=conv1, \n                                        pool_size=[2, 2], \n                                        strides=2,\n                                        name='pool1')\n\n        conv2 = tf.layers.conv2d(inputs=pool1,\n                                  filters=64,\n                                  kernel_size=[5, 5],\n                                  padding='SAME',\n                                  activation=tf.nn.relu,\n                                  name='conv2')\n        pool2 = tf.layers.max_pooling2d(inputs=conv2, \n                                        pool_size=[2, 2], \n                                        strides=2,\n                                        name='pool2')\n\n        feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3]\n        pool2 = tf.reshape(pool2, [-1, feature_dim])\n        fc = tf.layers.dense(pool2, 1024, activation=tf.nn.relu, name='fc')\n        dropout = tf.layers.dropout(fc, \n                                    self.keep_prob, \n                                    training=self.training, \n                                    name='dropout')\n        self.logits = tf.layers.dense(dropout, self.n_classes, name='logits')\n\n    def loss(self):\n        '''\n        define loss function\n        use softmax cross entropy with logits as the loss function\n        compute mean cross entropy, softmax is applied internally\n        '''\n        # \n        with tf.name_scope('loss'):\n            entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits)\n            self.loss = tf.reduce_mean(entropy, name='loss')\n    \n    def optimize(self):\n        '''\n        Define training op\n        using Adam Gradient Descent to minimize cost\n        '''\n        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, \n                                                global_step=self.gstep)\n\n    def summary(self):\n        '''\n        Create summaries to write on TensorBoard\n        '''\n        with tf.name_scope('summaries'):\n            tf.summary.scalar('loss', self.loss)\n            tf.summary.scalar('accuracy', self.accuracy)\n            tf.summary.histogram('histogram loss', self.loss)\n            self.summary_op = tf.summary.merge_all()\n    \n    def eval(self):\n        '''\n        Count the number of right predictions in a batch\n        '''\n        with tf.name_scope('predict'):\n            preds = tf.nn.softmax(self.logits)\n            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))\n            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\n    def build(self):\n        '''\n        Build the computation graph\n        '''\n        self.get_data()\n        self.inference()\n        self.loss()\n        self.optimize()\n        self.eval()\n        self.summary()\n\n    def train_one_epoch(self, sess, saver, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init) \n        self.training = True\n        total_loss = 0\n        n_batches = 0\n        try:\n            while True:\n                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                if (step + 1) % self.skip_step == 0:\n                    print('Loss at step {0}: {1}'.format(step, l))\n                step += 1\n                total_loss += l\n                n_batches += 1\n        except tf.errors.OutOfRangeError:\n            pass\n        saver.save(sess, 'checkpoints/convnet_layers/mnist-convnet', step)\n        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n        return step\n\n    def eval_once(self, sess, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init)\n        self.training = False\n        total_correct_preds = 0\n        try:\n            while True:\n                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                total_correct_preds += accuracy_batch\n        except tf.errors.OutOfRangeError:\n            pass\n\n        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n\n    def train(self, n_epochs):\n        '''\n        The train function alternates between training one epoch and evaluating\n        '''\n        utils.safe_mkdir('checkpoints')\n        utils.safe_mkdir('checkpoints/convnet_layers')\n        writer = tf.summary.FileWriter('./graphs/convnet_layers', tf.get_default_graph())\n\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            saver = tf.train.Saver()\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_layers/checkpoint'))\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n            \n            step = self.gstep.eval()\n\n            for epoch in range(n_epochs):\n                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)\n                self.eval_once(sess, self.test_init, writer, epoch, step)\n        writer.close()\n\nif __name__ == '__main__':\n    model = ConvNet()\n    model.build()\n    model.train(n_epochs=15)"
  },
  {
    "path": "examples/07_convnet_mnist.py",
    "content": "\"\"\" Using convolutional net on MNIST dataset of handwritten digits\nMNIST dataset: http://yann.lecun.com/exdb/mnist/\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 07\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time \n\nimport tensorflow as tf\n\nimport utils\n\ndef conv_relu(inputs, filters, k_size, stride, padding, scope_name):\n    '''\n    A method that does convolution + relu on inputs\n    '''\n    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:\n        in_channels = inputs.shape[-1]\n        kernel = tf.get_variable('kernel', \n                                [k_size, k_size, in_channels, filters], \n                                initializer=tf.truncated_normal_initializer())\n        biases = tf.get_variable('biases', \n                                [filters],\n                                initializer=tf.random_normal_initializer())\n        conv = tf.nn.conv2d(inputs, kernel, strides=[1, stride, stride, 1], padding=padding)\n    return tf.nn.relu(conv + biases, name=scope.name)\n\ndef maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'):\n    '''A method that does max pooling on inputs'''\n    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:\n        pool = tf.nn.max_pool(inputs, \n                            ksize=[1, ksize, ksize, 1], \n                            strides=[1, stride, stride, 1],\n                            padding=padding)\n    return pool\n\ndef fully_connected(inputs, out_dim, scope_name='fc'):\n    '''\n    A fully connected linear layer on inputs\n    '''\n    with tf.variable_scope(scope_name, reuse=tf.AUTO_REUSE) as scope:\n        in_dim = inputs.shape[-1]\n        w = tf.get_variable('weights', [in_dim, out_dim],\n                            initializer=tf.truncated_normal_initializer())\n        b = tf.get_variable('biases', [out_dim],\n                            initializer=tf.constant_initializer(0.0))\n        out = tf.matmul(inputs, w) + b\n    return out\n\nclass ConvNet(object):\n    def __init__(self):\n        self.lr = 0.001\n        self.batch_size = 128\n        self.keep_prob = tf.constant(0.75)\n        self.gstep = tf.Variable(0, dtype=tf.int32, \n                                trainable=False, name='global_step')\n        self.n_classes = 10\n        self.skip_step = 20\n        self.n_test = 10000\n        self.training = True\n\n    def get_data(self):\n        with tf.name_scope('data'):\n            train_data, test_data = utils.get_mnist_dataset(self.batch_size)\n            iterator = tf.data.Iterator.from_structure(train_data.output_types, \n                                                   train_data.output_shapes)\n            img, self.label = iterator.get_next()\n            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])\n            # reshape the image to make it work with tf.nn.conv2d\n\n            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data\n            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data\n\n    def inference(self):\n        conv1 = conv_relu(inputs=self.img,\n                        filters=32,\n                        k_size=5,\n                        stride=1,\n                        padding='SAME',\n                        scope_name='conv1')\n        pool1 = maxpool(conv1, 2, 2, 'VALID', 'pool1')\n        conv2 = conv_relu(inputs=pool1,\n                        filters=64,\n                        k_size=5,\n                        stride=1,\n                        padding='SAME',\n                        scope_name='conv2')\n        pool2 = maxpool(conv2, 2, 2, 'VALID', 'pool2')\n        feature_dim = pool2.shape[1] * pool2.shape[2] * pool2.shape[3]\n        pool2 = tf.reshape(pool2, [-1, feature_dim])\n        fc = fully_connected(pool2, 1024, 'fc')\n        dropout = tf.nn.dropout(tf.nn.relu(fc), self.keep_prob, name='relu_dropout')\n        self.logits = fully_connected(dropout, self.n_classes, 'logits')\n\n    def loss(self):\n        '''\n        define loss function\n        use softmax cross entropy with logits as the loss function\n        compute mean cross entropy, softmax is applied internally\n        '''\n        # \n        with tf.name_scope('loss'):\n            entropy = tf.nn.softmax_cross_entropy_with_logits(labels=self.label, logits=self.logits)\n            self.loss = tf.reduce_mean(entropy, name='loss')\n    \n    def optimize(self):\n        '''\n        Define training op\n        using Adam Gradient Descent to minimize cost\n        '''\n        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, \n                                                global_step=self.gstep)\n\n    def summary(self):\n        '''\n        Create summaries to write on TensorBoard\n        '''\n        with tf.name_scope('summaries'):\n            tf.summary.scalar('loss', self.loss)\n            tf.summary.scalar('accuracy', self.accuracy)\n            tf.summary.histogram('histogram loss', self.loss)\n            self.summary_op = tf.summary.merge_all()\n    \n    def eval(self):\n        '''\n        Count the number of right predictions in a batch\n        '''\n        with tf.name_scope('predict'):\n            preds = tf.nn.softmax(self.logits)\n            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))\n            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\n    def build(self):\n        '''\n        Build the computation graph\n        '''\n        self.get_data()\n        self.inference()\n        self.loss()\n        self.optimize()\n        self.eval()\n        self.summary()\n\n    def train_one_epoch(self, sess, saver, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init) \n        self.training = True\n        total_loss = 0\n        n_batches = 0\n        try:\n            while True:\n                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                if (step + 1) % self.skip_step == 0:\n                    print('Loss at step {0}: {1}'.format(step, l))\n                step += 1\n                total_loss += l\n                n_batches += 1\n        except tf.errors.OutOfRangeError:\n            pass\n        saver.save(sess, 'checkpoints/convnet_mnist/mnist-convnet', step)\n        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n        return step\n\n    def eval_once(self, sess, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init)\n        self.training = False\n        total_correct_preds = 0\n        try:\n            while True:\n                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                total_correct_preds += accuracy_batch\n        except tf.errors.OutOfRangeError:\n            pass\n\n        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n\n    def train(self, n_epochs):\n        '''\n        The train function alternates between training one epoch and evaluating\n        '''\n        utils.safe_mkdir('checkpoints')\n        utils.safe_mkdir('checkpoints/convnet_mnist')\n        writer = tf.summary.FileWriter('./graphs/convnet', tf.get_default_graph())\n\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            saver = tf.train.Saver()\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_mnist/checkpoint'))\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n            \n            step = self.gstep.eval()\n\n            for epoch in range(n_epochs):\n                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)\n                self.eval_once(sess, self.test_init, writer, epoch, step)\n        writer.close()\n\nif __name__ == '__main__':\n    model = ConvNet()\n    model.build()\n    model.train(n_epochs=30)\n"
  },
  {
    "path": "examples/07_convnet_mnist_starter.py",
    "content": "\"\"\" Using convolutional net on MNIST dataset of handwritten digits\nMNIST dataset: http://yann.lecun.com/exdb/mnist/\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 07\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport time \n\nimport tensorflow as tf\n\nimport utils\n\ndef conv_relu(inputs, filters, k_size, stride, padding, scope_name):\n    '''\n    A method that does convolution + relu on inputs\n    '''\n    #############################\n    ########## TO DO ############\n    #############################\n    return None\n\ndef maxpool(inputs, ksize, stride, padding='VALID', scope_name='pool'):\n    '''A method that does max pooling on inputs'''\n    #############################\n    ########## TO DO ############\n    #############################\n    return None\n\ndef fully_connected(inputs, out_dim, scope_name='fc'):\n    '''\n    A fully connected linear layer on inputs\n    '''\n    #############################\n    ########## TO DO ############\n    #############################\n    return None\n\nclass ConvNet(object):\n    def __init__(self):\n        self.lr = 0.001\n        self.batch_size = 128\n        self.keep_prob = tf.constant(0.75)\n        self.gstep = tf.Variable(0, dtype=tf.int32, \n                                trainable=False, name='global_step')\n        self.n_classes = 10\n        self.skip_step = 20\n        self.n_test = 10000\n\n    def get_data(self):\n        with tf.name_scope('data'):\n            train_data, test_data = utils.get_mnist_dataset(self.batch_size)\n            iterator = tf.data.Iterator.from_structure(train_data.output_types, \n                                                   train_data.output_shapes)\n            img, self.label = iterator.get_next()\n            self.img = tf.reshape(img, shape=[-1, 28, 28, 1])\n            # reshape the image to make it work with tf.nn.conv2d\n\n            self.train_init = iterator.make_initializer(train_data)  # initializer for train_data\n            self.test_init = iterator.make_initializer(test_data)    # initializer for train_data\n\n    def inference(self):\n        '''\n        Build the model according to the description we've shown in class\n        '''\n        #############################\n        ########## TO DO ############\n        #############################\n        self.logits = None\n\n    def loss(self):\n        '''\n        define loss function\n        use softmax cross entropy with logits as the loss function\n        tf.nn.softmax_cross_entropy_with_logits\n        softmax is applied internally\n        don't forget to compute mean cross all sample in a batch\n        '''\n        #############################\n        ########## TO DO ############\n        #############################\n        self.loss = None\n    \n    def optimize(self):\n        '''\n        Define training op\n        using Adam Gradient Descent to minimize cost\n        Don't forget to use global step\n        '''\n        #############################\n        ########## TO DO ############\n        #############################\n        self.opt = None\n\n    def summary(self):\n        '''\n        Create summaries to write on TensorBoard\n        Remember to track both training loss and test accuracy\n        '''\n        #############################\n        ########## TO DO ############\n        #############################\n        self.summary_op = None\n        \n    def eval(self):\n        '''\n        Count the number of right predictions in a batch\n        '''\n        with tf.name_scope('predict'):\n            preds = tf.nn.softmax(self.logits)\n            correct_preds = tf.equal(tf.argmax(preds, 1), tf.argmax(self.label, 1))\n            self.accuracy = tf.reduce_sum(tf.cast(correct_preds, tf.float32))\n\n    def build(self):\n        '''\n        Build the computation graph\n        '''\n        self.get_data()\n        self.inference()\n        self.loss()\n        self.optimize()\n        self.eval()\n        self.summary()\n\n    def train_one_epoch(self, sess, saver, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init) \n        total_loss = 0\n        n_batches = 0\n        try:\n            while True:\n                _, l, summaries = sess.run([self.opt, self.loss, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                if (step + 1) % self.skip_step == 0:\n                    print('Loss at step {0}: {1}'.format(step, l))\n                step += 1\n                total_loss += l\n                n_batches += 1\n        except tf.errors.OutOfRangeError:\n            pass\n        saver.save(sess, 'checkpoints/convnet_starter/mnist-convnet', step)\n        print('Average loss at epoch {0}: {1}'.format(epoch, total_loss/n_batches))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n        return step\n\n    def eval_once(self, sess, init, writer, epoch, step):\n        start_time = time.time()\n        sess.run(init)\n        total_correct_preds = 0\n        try:\n            while True:\n                accuracy_batch, summaries = sess.run([self.accuracy, self.summary_op])\n                writer.add_summary(summaries, global_step=step)\n                total_correct_preds += accuracy_batch\n        except tf.errors.OutOfRangeError:\n            pass\n\n        print('Accuracy at epoch {0}: {1} '.format(epoch, total_correct_preds/self.n_test))\n        print('Took: {0} seconds'.format(time.time() - start_time))\n\n    def train(self, n_epochs):\n        '''\n        The train function alternates between training one epoch and evaluating\n        '''\n        utils.safe_mkdir('checkpoints')\n        utils.safe_mkdir('checkpoints/convnet_starter')\n        writer = tf.summary.FileWriter('./graphs/convnet_starter', tf.get_default_graph())\n\n        with tf.Session() as sess:\n            sess.run(tf.global_variables_initializer())\n            saver = tf.train.Saver()\n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/convnet_starter/checkpoint'))\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n            \n            step = self.gstep.eval()\n\n            for epoch in range(n_epochs):\n                step = self.train_one_epoch(sess, saver, self.train_init, writer, epoch, step)\n                self.eval_once(sess, self.test_init, writer, epoch, step)\n        writer.close()\n\nif __name__ == '__main__':\n    model = ConvNet()\n    model.build()\n    model.train(n_epochs=15)"
  },
  {
    "path": "examples/07_run_kernels.py",
    "content": "\"\"\"\nSimple examples of convolution to do some basic filters\nAlso demonstrates the use of TensorFlow data readers.\n\nWe will use some popular filters for our image.\nIt seems to be working with grayscale images, but not with rgb images.\nIt's probably because I didn't choose the right kernels for rgb images.\n\nkernels for rgb images have dimensions 3 x 3 x 3 x 3\nkernels for grayscale images have dimensions 3 x 3 x 1 x 1\n\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nChip Huyen (chiphuyen@cs.stanford.edu)\nLecture 07\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nimport sys\nsys.path.append('..')\n\nfrom matplotlib import gridspec as gridspec\nfrom matplotlib import pyplot as plt\nimport tensorflow as tf\n\nimport kernels\n\ndef read_one_image(filename):\n    ''' This method is to show how to read image from a file into a tensor.\n    The output is a tensor object.\n    '''\n    image_string = tf.read_file(filename)\n    image_decoded = tf.image.decode_image(image_string)\n    image = tf.cast(image_decoded, tf.float32) / 256.0\n    return image\n\ndef convolve(image, kernels, rgb=True, strides=[1, 3, 3, 1], padding='SAME'):\n    images = [image[0]]\n    for i, kernel in enumerate(kernels):\n        filtered_image = tf.nn.conv2d(image, \n                                      kernel, \n                                      strides=strides,\n                                      padding=padding)[0]\n        if i == 2:\n            filtered_image = tf.minimum(tf.nn.relu(filtered_image), 255)\n        images.append(filtered_image)\n    return images\n\ndef show_images(images, rgb=True):\n    gs = gridspec.GridSpec(1, len(images))\n    for i, image in enumerate(images):\n        plt.subplot(gs[0, i])\n        if rgb:\n            plt.imshow(image)\n        else: \n            image = image.reshape(image.shape[0], image.shape[1])\n            plt.imshow(image, cmap='gray')\n        plt.axis('off')\n    plt.show()\n\ndef main():\n    rgb = False\n    if rgb:\n        kernels_list = [kernels.BLUR_FILTER_RGB, \n                        kernels.SHARPEN_FILTER_RGB, \n                        kernels.EDGE_FILTER_RGB,\n                        kernels.TOP_SOBEL_RGB,\n                        kernels.EMBOSS_FILTER_RGB]\n    else:\n        kernels_list = [kernels.BLUR_FILTER,\n                        kernels.SHARPEN_FILTER,\n                        kernels.EDGE_FILTER,\n                        kernels.TOP_SOBEL,\n                        kernels.EMBOSS_FILTER]\n\n    kernels_list = kernels_list[1:]\n    image = read_one_image('data/friday.jpg')\n    if not rgb:\n        image = tf.image.rgb_to_grayscale(image)\n    image = tf.expand_dims(image, 0) # make it into a batch of 1 element\n    images = convolve(image, kernels_list, rgb)\n    with tf.Session() as sess:\n        images = sess.run(images) # convert images from tensors to float values\n    show_images(images, rgb)\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "examples/11_char_rnn.py",
    "content": "\"\"\" A clean, no_frills character-level generative language model.\n\nCS 20: \"TensorFlow for Deep Learning Research\"\ncs20.stanford.edu\nDanijar Hafner (mail@danijar.com)\n& Chip Huyen (chiphuyen@cs.stanford.edu)\nLecture 11\n\"\"\"\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\nimport random\nimport sys\nsys.path.append('..')\nimport time\n\nimport tensorflow as tf\n\nimport utils\n\ndef vocab_encode(text, vocab):\n    return [vocab.index(x) + 1 for x in text if x in vocab]\n\ndef vocab_decode(array, vocab):\n    return ''.join([vocab[x - 1] for x in array])\n\ndef read_data(filename, vocab, window, overlap):\n    lines = [line.strip() for line in open(filename, 'r').readlines()]\n    while True:\n        random.shuffle(lines)\n\n        for text in lines:\n            text = vocab_encode(text, vocab)\n            for start in range(0, len(text) - window, overlap):\n                chunk = text[start: start + window]\n                chunk += [0] * (window - len(chunk))\n                yield chunk\n\ndef read_batch(stream, batch_size):\n    batch = []\n    for element in stream:\n        batch.append(element)\n        if len(batch) == batch_size:\n            yield batch\n            batch = []\n    yield batch\n\nclass CharRNN(object):\n    def __init__(self, model):\n        self.model = model\n        self.path = 'data/' + model + '.txt'\n        if 'trump' in model:\n            self.vocab = (\"$%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n                    \" '\\\"_abcdefghijklmnopqrstuvwxyz{|}@#➡📈\")\n        else:\n            self.vocab = (\" $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ\"\n                    \"\\\\^_abcdefghijklmnopqrstuvwxyz{|}\")\n\n        self.seq = tf.placeholder(tf.int32, [None, None])\n        self.temp = tf.constant(1.5)\n        self.hidden_sizes = [128, 256]\n        self.batch_size = 64\n        self.lr = 0.0003\n        self.skip_step = 1\n        self.num_steps = 50 # for RNN unrolled\n        self.len_generated = 200\n        self.gstep = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')\n\n    def create_rnn(self, seq):\n        layers = [tf.nn.rnn_cell.GRUCell(size) for size in self.hidden_sizes]\n        cells = tf.nn.rnn_cell.MultiRNNCell(layers)\n        batch = tf.shape(seq)[0]\n        zero_states = cells.zero_state(batch, dtype=tf.float32)\n        self.in_state = tuple([tf.placeholder_with_default(state, [None, state.shape[1]]) \n                                for state in zero_states])\n        # this line to calculate the real length of seq\n        # all seq are padded to be of the same length, which is num_steps\n        length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)\n        self.output, self.out_state = tf.nn.dynamic_rnn(cells, seq, length, self.in_state)\n\n    def create_model(self):\n        seq = tf.one_hot(self.seq, len(self.vocab))\n        self.create_rnn(seq)\n        self.logits = tf.layers.dense(self.output, len(self.vocab), None)\n        loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits[:, :-1], \n                                                        labels=seq[:, 1:])\n        self.loss = tf.reduce_sum(loss)\n        # sample the next character from Maxwell-Boltzmann Distribution \n        # with temperature temp. It works equally well without tf.exp\n        self.sample = tf.multinomial(tf.exp(self.logits[:, -1] / self.temp), 1)[:, 0] \n        self.opt = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=self.gstep)\n\n    def train(self):\n        saver = tf.train.Saver()\n        start = time.time()\n        min_loss = None\n        with tf.Session() as sess:\n            writer = tf.summary.FileWriter('graphs/gist', sess.graph)\n            sess.run(tf.global_variables_initializer())\n            \n            ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/' + self.model + '/checkpoint'))\n            if ckpt and ckpt.model_checkpoint_path:\n                saver.restore(sess, ckpt.model_checkpoint_path)\n            \n            iteration = self.gstep.eval()\n            stream = read_data(self.path, self.vocab, self.num_steps, overlap=self.num_steps//2)\n            data = read_batch(stream, self.batch_size)\n            while True:\n                batch = next(data)\n\n            # for batch in read_batch(read_data(DATA_PATH, vocab)):\n                batch_loss, _ = sess.run([self.loss, self.opt], {self.seq: batch})\n                if (iteration + 1) % self.skip_step == 0:\n                    print('Iter {}. \\n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))\n                    self.online_infer(sess)\n                    start = time.time()\n                    checkpoint_name = 'checkpoints/' + self.model + '/char-rnn'\n                    if min_loss is None:\n                        saver.save(sess, checkpoint_name, iteration)\n                    elif batch_loss < min_loss:\n                        saver.save(sess, checkpoint_name, iteration)\n                        min_loss = batch_loss\n                iteration += 1\n\n    def online_infer(self, sess):\n        \"\"\" Generate sequence one character at a time, based on the previous character\n        \"\"\"\n        for seed in ['Hillary', 'I', 'R', 'T', '@', 'N', 'M', '.', 'G', 'A', 'W']:\n            sentence = seed\n            state = None\n            for _ in range(self.len_generated):\n                batch = [vocab_encode(sentence[-1], self.vocab)]\n                feed = {self.seq: batch}\n                if state is not None: # for the first decoder step, the state is None\n                    for i in range(len(state)):\n                        feed.update({self.in_state[i]: state[i]})\n                index, state = sess.run([self.sample, self.out_state], feed)\n                sentence += vocab_decode(index, self.vocab)\n            print('\\t' + sentence)\n\ndef main():\n    model = 'trump_tweets'\n    utils.safe_mkdir('checkpoints')\n    utils.safe_mkdir('checkpoints/' + model)\n\n    lm = CharRNN(model)\n    lm.create_model()\n    lm.train()\n    \nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "examples/kernels.py",
    "content": "import numpy as np\nimport tensorflow as tf\n\na = np.zeros([3, 3, 3, 3])\na[1, 1, :, :] = 0.25\na[0, 1, :, :] = 0.125\na[1, 0, :, :] = 0.125\na[2, 1, :, :] = 0.125\na[1, 2, :, :] = 0.125\na[0, 0, :, :] = 0.0625\na[0, 2, :, :] = 0.0625\na[2, 0, :, :] = 0.0625\na[2, 2, :, :] = 0.0625\n\nBLUR_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\n# a[1, 1, :, :] = 0.25\n# a[0, 1, :, :] = 0.125\n# a[1, 0, :, :] = 0.125\n# a[2, 1, :, :] = 0.125\n# a[1, 2, :, :] = 0.125\n# a[0, 0, :, :] = 0.0625\n# a[0, 2, :, :] = 0.0625\n# a[2, 0, :, :] = 0.0625\n# a[2, 2, :, :] = 0.0625\na[1, 1, :, :] = 1.0\na[0, 1, :, :] = 1.0\na[1, 0, :, :] = 1.0\na[2, 1, :, :] = 1.0\na[1, 2, :, :] = 1.0\na[0, 0, :, :] = 1.0\na[0, 2, :, :] = 1.0\na[2, 0, :, :] = 1.0\na[2, 2, :, :] = 1.0\nBLUR_FILTER = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[1, 1, :, :] = 5\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[2, 1, :, :] = -1\na[1, 2, :, :] = -1\n\nSHARPEN_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[1, 1, :, :] = 5\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[2, 1, :, :] = -1\na[1, 2, :, :] = -1\n\nSHARPEN_FILTER = tf.constant(a, dtype=tf.float32)\n\n# a = np.zeros([3, 3, 3, 3])\n# a[:, :, :, :] = -1\n# a[1, 1, :, :] = 8\n\n# EDGE_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\nEDGE_FILTER_RGB = tf.constant([\n\t\t\t[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n            [[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],\n            [[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ 8., 0., 0.], [ 0., 8., 0.], [ 0., 0., 8.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]],\n\t\t\t[[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]],\n\t\t\t[[ -1., 0., 0.], [ 0., -1., 0.], [ 0., 0., -1.]]]\n])\n\na = np.zeros([3, 3, 1, 1])\n# a[:, :, :, :] = -1\n# a[1, 1, :, :] = 8\na[0, 1, :, :] = -1\na[1, 0, :, :] = -1\na[1, 2, :, :] = -1\na[2, 1, :, :] = -1\na[1, 1, :, :] = 4\n\nEDGE_FILTER = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[0, :, :, :] = 1\na[0, 1, :, :] = 2 # originally 2\na[2, :, :, :] = -1\na[2, 1, :, :] = -2\n\nTOP_SOBEL_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[0, :, :, :] = 1\na[0, 1, :, :] = 2 # originally 2\na[2, :, :, :] = -1\na[2, 1, :, :] = -2\n\nTOP_SOBEL = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 3, 3])\na[0, 0, :, :] = -2\na[0, 1, :, :] = -1 \na[1, 0, :, :] = -1\na[1, 1, :, :] = 1\na[1, 2, :, :] = 1\na[2, 1, :, :] = 1\na[2, 2, :, :] = 2\n\nEMBOSS_FILTER_RGB = tf.constant(a, dtype=tf.float32)\n\na = np.zeros([3, 3, 1, 1])\na[0, 0, :, :] = -2\na[0, 1, :, :] = -1 \na[1, 0, :, :] = -1\na[1, 1, :, :] = 1\na[1, 2, :, :] = 1\na[2, 1, :, :] = 1\na[2, 2, :, :] = 2\nEMBOSS_FILTER = tf.constant(a, dtype=tf.float32)"
  },
  {
    "path": "examples/utils.py",
    "content": "import os\nimport gzip\nimport shutil\nimport struct\nimport urllib\n\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n\nfrom matplotlib import pyplot as plt\nimport numpy as np\nimport tensorflow as tf\n\ndef huber_loss(labels, predictions, delta=14.0):\n    residual = tf.abs(labels - predictions)\n    def f1(): return 0.5 * tf.square(residual)\n    def f2(): return delta * residual - 0.5 * tf.square(delta)\n    return tf.cond(residual < delta, f1, f2)\n\ndef safe_mkdir(path):\n    \"\"\" Create a directory if there isn't one already. \"\"\"\n    try:\n        os.mkdir(path)\n    except OSError:\n        pass\n\ndef read_birth_life_data(filename):\n    \"\"\"\n    Read in birth_life_2010.txt and return:\n    data in the form of NumPy array\n    n_samples: number of samples\n    \"\"\"\n    text = open(filename, 'r').readlines()[1:]\n    data = [line[:-1].split('\\t') for line in text]\n    births = [float(line[1]) for line in data]\n    lifes = [float(line[2]) for line in data]\n    data = list(zip(births, lifes))\n    n_samples = len(data)\n    data = np.asarray(data, dtype=np.float32)\n    return data, n_samples\n\ndef download_one_file(download_url, \n                    local_dest, \n                    expected_byte=None, \n                    unzip_and_remove=False):\n    \"\"\" \n    Download the file from download_url into local_dest\n    if the file doesn't already exists.\n    If expected_byte is provided, check if \n    the downloaded file has the same number of bytes.\n    If unzip_and_remove is True, unzip the file and remove the zip file\n    \"\"\"\n    if os.path.exists(local_dest) or os.path.exists(local_dest[:-3]):\n        print('%s already exists' %local_dest)\n    else:\n        print('Downloading %s' %download_url)\n        local_file, _ = urllib.request.urlretrieve(download_url, local_dest)\n        file_stat = os.stat(local_dest)\n        if expected_byte:\n            if file_stat.st_size == expected_byte:\n                print('Successfully downloaded %s' %local_dest)\n                if unzip_and_remove:\n                    with gzip.open(local_dest, 'rb') as f_in, open(local_dest[:-3],'wb') as f_out:\n                        shutil.copyfileobj(f_in, f_out)\n                    os.remove(local_dest)\n            else:\n                print('The downloaded file has unexpected number of bytes')\n\ndef download_mnist(path):\n    \"\"\" \n    Download and unzip the dataset mnist if it's not already downloaded \n    Download from http://yann.lecun.com/exdb/mnist\n    \"\"\"\n    safe_mkdir(path)\n    url = 'http://yann.lecun.com/exdb/mnist'\n    filenames = ['train-images-idx3-ubyte.gz',\n                'train-labels-idx1-ubyte.gz',\n                't10k-images-idx3-ubyte.gz',\n                't10k-labels-idx1-ubyte.gz']\n    expected_bytes = [9912422, 28881, 1648877, 4542]\n\n    for filename, byte in zip(filenames, expected_bytes):\n        download_url = os.path.join(url, filename)\n        local_dest = os.path.join(path, filename)\n        download_one_file(download_url, local_dest, byte, True)\n\ndef parse_data(path, dataset, flatten):\n    if dataset != 'train' and dataset != 't10k':\n        raise NameError('dataset must be train or t10k')\n\n    label_file = os.path.join(path, dataset + '-labels-idx1-ubyte')\n    with open(label_file, 'rb') as file:\n        _, num = struct.unpack(\">II\", file.read(8))\n        labels = np.fromfile(file, dtype=np.int8) #int8\n        new_labels = np.zeros((num, 10))\n        new_labels[np.arange(num), labels] = 1\n    \n    img_file = os.path.join(path, dataset + '-images-idx3-ubyte')\n    with open(img_file, 'rb') as file:\n        _, num, rows, cols = struct.unpack(\">IIII\", file.read(16))\n        imgs = np.fromfile(file, dtype=np.uint8).reshape(num, rows, cols) #uint8\n        imgs = imgs.astype(np.float32) / 255.0\n        if flatten:\n            imgs = imgs.reshape([num, -1])\n\n    return imgs, new_labels\n\ndef read_mnist(path, flatten=True, num_train=55000):\n    \"\"\"\n    Read in the mnist dataset, given that the data is stored in path\n    Return two tuples of numpy arrays\n    ((train_imgs, train_labels), (test_imgs, test_labels))\n    \"\"\"\n    imgs, labels = parse_data(path, 'train', flatten)\n    indices = np.random.permutation(labels.shape[0])\n    train_idx, val_idx = indices[:num_train], indices[num_train:]\n    train_img, train_labels = imgs[train_idx, :], labels[train_idx, :]\n    val_img, val_labels = imgs[val_idx, :], labels[val_idx, :]\n    test = parse_data(path, 't10k', flatten)\n    return (train_img, train_labels), (val_img, val_labels), test\n\ndef get_mnist_dataset(batch_size):\n    # Step 1: Read in data\n    mnist_folder = 'data/mnist'\n    download_mnist(mnist_folder)\n    train, val, test = read_mnist(mnist_folder, flatten=False)\n\n    # Step 2: Create datasets and iterator\n    train_data = tf.data.Dataset.from_tensor_slices(train)\n    train_data = train_data.shuffle(10000) # if you want to shuffle your data\n    train_data = train_data.batch(batch_size)\n\n    test_data = tf.data.Dataset.from_tensor_slices(test)\n    test_data = test_data.batch(batch_size)\n\n    return train_data, test_data\n    \ndef show(image):\n    \"\"\"\n    Render a given numpy.uint8 2D array of pixel data.\n    \"\"\"\n    plt.imshow(image, cmap='gray')\n    plt.show()"
  },
  {
    "path": "examples/word2vec_utils.py",
    "content": "from collections import Counter\nimport random\nimport os\nimport sys\nsys.path.append('..')\nimport zipfile\n\nimport numpy as np\nfrom six.moves import urllib\nimport tensorflow as tf\n\nimport utils\n\ndef read_data(file_path):\n    \"\"\" Read data into a list of tokens \n    There should be 17,005,207 tokens\n    \"\"\"\n    with zipfile.ZipFile(file_path) as f:\n        words = tf.compat.as_str(f.read(f.namelist()[0])).split() \n    return words\n\ndef build_vocab(words, vocab_size, visual_fld):\n    \"\"\" Build vocabulary of VOCAB_SIZE most frequent words and write it to\n    visualization/vocab.tsv\n    \"\"\"\n    utils.safe_mkdir(visual_fld)\n    file = open(os.path.join(visual_fld, 'vocab.tsv'), 'w')\n    \n    dictionary = dict()\n    count = [('UNK', -1)]\n    index = 0\n    count.extend(Counter(words).most_common(vocab_size - 1))\n    \n    for word, _ in count:\n        dictionary[word] = index\n        index += 1\n        file.write(word + '\\n')\n    \n    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n    file.close()\n    return dictionary, index_dictionary\n\ndef convert_words_to_index(words, dictionary):\n    \"\"\" Replace each word in the dataset with its index in the dictionary \"\"\"\n    return [dictionary[word] if word in dictionary else 0 for word in words]\n\ndef generate_sample(index_words, context_window_size):\n    \"\"\" Form training pairs according to the skip-gram model. \"\"\"\n    for index, center in enumerate(index_words):\n        context = random.randint(1, context_window_size)\n        # get a random target before the center word\n        for target in index_words[max(0, index - context): index]:\n            yield center, target\n        # get a random target after the center wrod\n        for target in index_words[index + 1: index + context + 1]:\n            yield center, target\n\ndef most_common_words(visual_fld, num_visualize):\n    \"\"\" create a list of num_visualize most frequent words to visualize on TensorBoard.\n    saved to visualization/vocab_[num_visualize].tsv\n    \"\"\"\n    words = open(os.path.join(visual_fld, 'vocab.tsv'), 'r').readlines()[:num_visualize]\n    words = [word for word in words]\n    file = open(os.path.join(visual_fld, 'vocab_' + str(num_visualize) + '.tsv'), 'w')\n    for word in words:\n        file.write(word)\n    file.close()\n\ndef batch_gen(download_url, expected_byte, vocab_size, batch_size, \n                skip_window, visual_fld):\n    local_dest = 'data/text8.zip'\n    utils.download_one_file(download_url, local_dest, expected_byte)\n    words = read_data(local_dest)\n    dictionary, _ = build_vocab(words, vocab_size, visual_fld)\n    index_words = convert_words_to_index(words, dictionary)\n    del words           # to save memory\n    single_gen = generate_sample(index_words, skip_window)\n    \n    while True:\n        center_batch = np.zeros(batch_size, dtype=np.int32)\n        target_batch = np.zeros([batch_size, 1])\n        for index in range(batch_size):\n            center_batch[index], target_batch[index] = next(single_gen)\n        yield center_batch, target_batch\n"
  },
  {
    "path": "setup/requirements.txt",
    "content": "tensorflow==1.4.1\nscipy==1.0.0\nscikit-learn==0.19.1\nmatplotlib==2.1.1\nxlrd==1.1.0\nipdb==0.10.3\nPillow==5.0.0\nlxml==4.1.1"
  },
  {
    "path": "setup/setup_instruction.md",
    "content": "Please follow the official instruction to install TensorFlow [here](https://www.tensorflow.org/install/). For this course, I will use Python 3.6 and TensorFlow 1.4. You’re welcome to use either Python 2 or Python 3 for the assignments. The starter code, though, will be in Python 3.6. You don't need GPU for most code examples in this course, though having GPU won't hurt. If you install TensorFlow on your local machine, my ecommendation is always set up Tensorflow using virtualenv. \n\nFor the list of dependencies, please consult the file requirements.txt. This list will be updated as the course progresses. \n\nThere are a few things to note:\n- As of version 1.2, TensorFlow no longer provides GPU support on macOS.\n- On macOS, Python 3.6 might gives warning but still works.\n- TensorFlow with GPU support will only work with CUDA® Toolkit 8.0 and cuDNN v6.0, not the newest CUDA and cnDNN version. Make sure that you install the correct CUDA and cuDNN versions to avoid frustrating issues.\n- On Windows, TensorFlow supports only 64-bit Python 3.5 anx Python 3.6.\n- If you see the warning:\n```bash\nYour CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA\n```\nit's because you didn't install TensorFlow from sources to take advantage of all these settings. You can choose to install TensorFlow from sources -- the process might take up to 30 minutes. To silence the warning, add this before importing TensorFlow: <br>\n\n```bash\nimport os\nos.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n```\n\n- If you want to install TensorFlow from sources, keep in mind that TensorFlow doesn't officially support building TensorFlow on Windows. On Windows, you may try using the highly experimental Bazel on Windows or TensorFlow CMake build.\n\nBelow is a simpler instruction on how to install TensorFlow on macOS. If you have any problem installing Tensorflow, feel free to post it on [Piazza](piazza.com/stanford/winter2018/cs20)\n\nIf you get “permission denied” error in any command, use “sudo” in front of that command.\n\nYou will need pip3 (or pip if you use Python2), and virtualenv.\n\nStep 1: install python3 and pip3. Skip this step if you already have both. You can find the official instruction [here](http://docs.python-guide.org/en/latest/starting/install3/osx/)\n\nStep 2: upgrade six\n```bash\n$ sudo easy_install --upgrade six\n```\n\nStep 3: install virtualenv. Skip this step if you already have virtualenv\n```bash\n$ pip3 install virtualenv\n```\n\nStep 4: set up a project directory. You will do all work for this class in this directory\n```bash\n$ mkdir cs20\n```\n\nStep 5: set up virtual environment with python3\n```bash\n$ cd cs20\n$ python3 -m venv .env\n```\nThese commands create a venv subdirectory in your project where everything is installed.\n\nStep 6: activate the virtual environment \n```bash\n$ source .env/bin/activate\n```\n\nIf you type:\n```bash\n$ pip3 freeze\n```\n\nYou will see that nothing is shown, which means no package is installed in your virtual environment. So you have to install all packages that you need. For the list of packages you need for this class, you can see/download the list of requirements in [the setup folder of this repository](https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/setup/requirements.txt).\n\nStep 7: Install Tensorflow and other dependencies\n```bash\n$ pip3 install -r requirements.txt\n```\n\nStep n: \nTo exit the virtual environment, use:\n```bash\n$ deactivate\n```\n\n### Other options\n#### Floydhub\nFloydhub has a clean, GitHub-like interface that allows you to create and run TensorFlow projects.\n\n# Possible set up problems\n## Matplotlib\nIf you have problem with using Matplotlib in virtual environment, here are two simple ways to fix. <br>\n1. If you installed matplotlib using pip, there is a directory in you root called ~/.matplotlib.\nGo there and create a file ~/.matplotlib/matplotlibrc there and add the following code: ```backend: TkAgg```\n2. After importing matplotlib, simply add: ```matplotlib.use(\"TkAgg\")```\n\nIf you run into more problems, feel free to post your questions on [Piazza](https://piazza.com/stanford/winter2018/cs20) or email us cs20-win1718-staff@lists.stanford.edu."
  }
]